WO2023038957A1 - Surveillance d'un pipeline de développement logiciel - Google Patents
Surveillance d'un pipeline de développement logiciel Download PDFInfo
- Publication number
- WO2023038957A1 WO2023038957A1 PCT/US2022/042737 US2022042737W WO2023038957A1 WO 2023038957 A1 WO2023038957 A1 WO 2023038957A1 US 2022042737 W US2022042737 W US 2022042737W WO 2023038957 A1 WO2023038957 A1 WO 2023038957A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- user
- nodes
- graph
- processes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/60—Software deployment
Definitions
- Fig. 1 A shows an illustrative configuration in which a data platform is configured to perform various operations with respect to a cloud environment that includes a plurality of compute assets.
- Fig. IB shows an illustrative implementation of the configuration of Fig. 1A.
- FIG. 1C illustrates an example computing device.
- Fig. ID illustrates an example of an environment in which activities that occur within datacenters are modeled.
- FIG. 2A illustrates an example of a process, used by an agent, to collect and report information about a client.
- Fig. 2B illustrates a 5-tuple of data collected by an agent, physically and logically.
- Fig. 2C illustrates a portion of a polygraph.
- Fig. 2D illustrates a portion of a polygraph.
- Fig. 2E illustrates an example of a communication polygraph.
- Fig. 2F illustrates an example of a polygraph.
- Fig. 2G illustrates an example of a polygraph as rendered in an interface.
- Fig. 2H illustrates an example of a portion of a polygraph as rendered in an interface.
- Fig. 21 illustrates an example of a portion of a polygraph as rendered in an interface.
- Fig. 2J illustrates an example of a portion of a polygraph as rendered in an interface.
- Fig. 2K illustrates an example of a portion of a polygraph as rendered in an interface.
- Fig. 2L illustrates an example of an insider behavior graph as rendered in an interface.
- Fig. 2M illustrates an example of a privilege change graph as rendered in an interface.
- Fig. 2N illustrates an example of a user login graph as rendered in an interface.
- Fig. 20 illustrates an example of a machine server graph as rendered in an interface.
- Fig. 3 A illustrates an example of a process for detecting anomalies in a network environment.
- Fig. 3B depicts a set of example processes communicating with other processes.
- Fig. 3C depicts a set of example processes communicating with other processes.
- Fig. 3D depicts a set of example processes communicating with other processes.
- Fig. 3E depicts two pairs of clusters.
- Fig. 3F is a representation of a user logging into a first machine, then into a second machine from the first machine, and then making an external connection.
- Fig. 3G is an alternate representation of actions occurring in Fig. 3F.
- Fig. 3H illustrates an example of a process for performing extended user tracking.
- Fig. 31 is a representation of a user logging into a first machine, then into a second machine from the first machine, and then making an external connection.
- Fig. 3 J illustrates an example of a process for performing extended user tracking.
- Fig. 3K illustrates example records.
- Fig. 3L illustrates example output from performing an ssh connection match.
- Fig. 3M illustrates example records.
- Fig. 3N illustrates example records.
- Fig. 30 illustrates example records.
- Fig. 3P illustrates example records.
- Fig. 3Q illustrates an adjacency relationship between two login sessions.
- Fig. 3R illustrates example records.
- Fig. 3S illustrates an example of a process for detecting anomalies.
- Fig. 4A illustrates a representation of an embodiment of an insider behavior graph.
- Fig. 4B illustrates an embodiment of a portion of an insider behavior graph.
- Fig. 4C illustrates an embodiment of a portion of an insider behavior graph.
- Fig. 4D illustrates an embodiment of a portion of an insider behavior graph.
- Fig. 4E illustrates a representation of an embodiment of a user login graph.
- Fig. 4F illustrates an example of a privilege change graph.
- Fig. 4G illustrates an example of a privilege change graph.
- Fig. 4H illustrates an example of a user interacting with a portion of an interface.
- Fig. 41 illustrates an example of a dossier for an event.
- Fig. 4J illustrates an example of a dossier for a domain.
- Fig. 4K depicts an example of an Entity Join graph by FilterKey and FilterKey Group (implicit join).
- Fig. 4L illustrates an example of a process for dynamically generating and executing a query.
- FIG. 5 sets forth a flowchart illustrating an example method of dynamically generating monitoring tools for software applications in accordance with some embodiments.
- Fig. 6 sets forth a flowchart illustrating an additional example method of dynamically generating monitoring tools for software applications in accordance with some embodiments.
- Fig. 7 sets forth a flowchart illustrating an additional example method of dynamically generating monitoring tools for software applications in accordance with some embodiments.
- Fig. 8 sets forth a flow chart illustrating an example method of using real-time monitoring to inform static analysis in accordance with some embodiments of the present disclosure.
- FIG. 9 sets forth a flow chart illustrating an additional example method of using real-time monitoring to inform static analysis in accordance with some embodiments.
- Fig. 10 sets forth a flow chart illustrating an additional example method of using real-time monitoring to inform static analysis in accordance with some embodiments.
- Fig. 11 sets forth a flowchart illustrating an example method of configuring cloud deployments (or components in a software development pipeline) based on learnings obtained by monitoring other cloud deployments (or components in another software development pipeline) in accordance with some embodiments of the present disclosure.
- Fig. 12 sets forth a flow chart illustrating an example method of monitoring a software development pipeline in accordance with some embodiments.
- FIG. 13 sets forth a flow chart illustrating another example method of monitoring a software development pipeline in accordance with some embodiments.
- Fig. 14 sets forth a flow chart illustrating another example method of monitoring a software development pipeline in accordance with some embodiments.
- Fig. 15 sets forth a flow chart illustrating another example method of monitoring a software development pipeline in accordance with some embodiments.
- Fig. 1 A shows an illustrative configuration 10 in which a data platform 12 is configured to perform various operations with respect to a cloud environment 14 that includes a plurality of compute assets 16-1 through 16-N (collectively “compute assets 16).
- data platform 12 may include data ingestion resources 18 configured to ingest data from cloud environment 14 into data platform 12, data processing resources 20 configured to perform data processing operations with respect to the data, user interface resources 22 configured to provide one or more external users and/or compute resources (e.g., computing device 24) with access to an output of data processing resources 20.
- data ingestion resources 18 configured to ingest data from cloud environment 14 into data platform 12
- data processing resources 20 configured to perform data processing operations with respect to the data
- user interface resources 22 configured to provide one or more external users and/or compute resources (e.g., computing device 24) with access to an output of data processing resources 20.
- compute resources e.g., computing device 24
- Cloud environment 14 may include any suitable network-based computing environment as may serve a particular application.
- cloud environment 14 may be implemented by one or more compute resources provided and/or otherwise managed by one or more cloud service providers, such as Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, and/or any other cloud service provider configured to provide public and/or private access to network-based compute resources.
- AWS Amazon Web Services
- GCP Google Cloud Platform
- Azure Microsoft Azure
- Compute assets 16 may include, but are not limited to, containers (e.g., container images, deployed and executing container instances, etc.), virtual machines, workloads, applications, processes, physical machines, compute nodes, clusters of compute nodes, software runtime environments (e.g., container runtime environments), and/or any other virtual and/or physical compute resource that may reside in and/or be executed by one or more computer resources in cloud environment 14.
- containers e.g., container images, deployed and executing container instances, etc.
- virtual machines e.g., container images, deployed and executing container instances, etc.
- workloads e.g., applications, processes, physical machines, compute nodes, clusters of compute nodes
- software runtime environments e.g., container runtime environments
- any other virtual and/or physical compute resource may reside in and/or be executed by one or more computer resources in cloud environment 14.
- one or more compute assets 16 may reside in one or more datacenters.
- a compute asset 16 may be associated with (e.g., owned, deployed, or managed by) a particular entity, such as a customer or client of cloud environment 14 and/or data platform 12. Accordingly, for purposes of the discussion herein, cloud environment 14 may be used by one or more entities.
- Data platform 12 may be configured to perform one or more data security monitoring and/or remediation services, compliance monitoring services, anomaly detection services, DevOps services, compute asset management services, and/or any other type of data analytics service as may serve a particular implementation.
- Data platform 12 may be managed or otherwise associated with any suitable data platform provider, such as a provider of any of the data analytics services described herein.
- the various resources included in data platform 12 may reside in the cloud and/or be located on-premises and be implemented by any suitable combination of physical and/or virtual compute resources, such as one or more computing devices, microservices, applications, etc.
- Data ingestion resources 18 may be configured to ingest data from cloud environment 14 into data platform 12. This may be performed in various ways, some of which are described in detail herein. For example, as illustrated by arrow 26, data ingestion resources 18 may be configured to receive the data from one or more agents deployed within cloud environment 14, utilize an event streaming platform (e.g., Kafka) to obtain the data, and/or pull data (e.g., configuration data) from cloud environment 14. In some examples, data ingestion resources 18 may obtain the data using one or more agentless configurations.
- the data ingested by data ingestion resources 18 from cloud environment 14 may include any type of data as may serve a particular implementation.
- the data may include data representative of configuration information associated with compute assets 16, information about one or more processes running on compute assets 16, network activity information, information about events (creation events, modification events, communication events, user- initiated events, etc.) that occur with respect to compute assets 16, etc.
- the data may or may not include actual customer data processed or otherwise generated by compute assets 16.
- data ingestion resources 18 may be configured to load the data ingested from cloud environment 14 into a data store 30.
- Data store 30 is illustrated in Fig. 1 A as being separate from and communicatively coupled to data platform 12. However, in some alternative embodiments, data store 30 is included within data platform 12.
- Data store 30 may be implemented by any suitable data warehouse, data lake, data mart, and/or other type of database structure as may serve a particular implementation.
- Such data stores may be proprietary or may be embodied as vendor provided products or services such as, for example, Snowflake, Google BigQuery, Druid, Amazon Redshift, IBM Db2, Dremio, Databricks Lakehouse Platform, Cloudera, Azure Synapse Analytics, and others.
- data that is collected from agents and other sources may be stored in different ways.
- data that is collected from agents and other sources may be stored in a data warehouse, data lake, data mart, and/or any other data store.
- a data warehouse may be embodied as an analytic database (e.g., a relational database) that is created from two or more data sources. Such a data warehouse may be leveraged to store historical data, often on the scale of petabytes. Data warehouses may have compute and memory resources for running complicated queries and generating reports. Data warehouses may be the data sources for business intelligence (‘BI’) systems, machine learning applications, and/or other applications. By leveraging a data warehouse, data that has been copied into the data warehouse may be indexed for good analytic query performance, without affecting the write performance of a database (e.g., an Online Transaction Processing (‘OLTP’) database). Data warehouses also enable the joining data from multiple sources for analysis.
- a database e.g., an Online Transaction Processing (‘OLTP’) database.
- OTP Online Transaction Processing
- Data lakes which store files of data in their native format, may be considered as “schema on read” resources. As such, any application that reads data from the lake may impose its own types and relationships on the data.
- Data warehouses are “schema on write,” meaning that data types, indexes, and relationships are imposed on the data as it is stored in an enterprise data warehouse (EDW). “Schema on read” resources may be beneficial for data that may be used in several contexts and poses little risk of losing data.
- “Schema on write” resources may be beneficial for data that has a specific purpose, and good for data that must relate properly to data from other sources.
- Such data stores may include data that is encrypted using homomorphic encryption, data encrypted using privacy-preserving encryption, smart contracts, non-fungible tokens, decentralized finance, and other techniques.
- Data marts may contain data oriented towards a specific business line whereas data warehouses contain enterprise-wide data. Data marts may be dependent on a data warehouse, independent of the data warehouse (e.g., drawn from an operational database or external source), or a hybrid of the two. In embodiments described herein, different types of data stores (including combinations thereof) may be leveraged.
- Data processing resources 20 may be configured to perform various data processing operations with respect to data ingested by data ingestion resources 18, including data ingested and stored in data store 30.
- data processing resources 20 may be configured to perform one or more data security monitoring and/or remediation operations, compliance monitoring operations, anomaly detection operations, DevOps operations, compute asset management operations, and/or any other type of data analytics operation as may serve a particular implementation.
- data security monitoring and/or remediation operations may be configured to perform various data security monitoring and/or remediation operations, compliance monitoring operations, anomaly detection operations, DevOps operations, compute asset management operations, and/or any other type of data analytics operation as may serve a particular implementation.
- DevOps operations compute asset management operations
- any other type of data analytics operation as may serve a particular implementation.
- data processing resources 20 may be configured to access data in data store 30 to perform the various operations described herein. In some examples, this may include performing one or more queries with respect to the data stored in data store 30. Such queries may be generated using any suitable query language.
- the queries provided by data processing resources 20 may be configured to direct data store 30 to perform one or more data analytics operations with respect to the data stored within data store 30.
- These data analytics operations may be with respect to data specific to a particular entity (e.g., data residing in one or more silos within data store 30 that are associated with a particular customer) and/or data associated with multiple entities.
- data processing resources 20 may be configured to analyze data associated with a first entity and use the results of the analysis to perform one or more operations with respect to a second entity.
- One or more operations performed by data processing resources 20 may be performed periodically according to a predetermined schedule. For example, one or more operations may be performed by processing resources 20 every hour or any other suitable time interval.
- one or more operations performed by data processing resources 20 may be performed in substantially real-time (or near real-time) as data is ingested into data platform 12.
- the results of such operations e.g., one or more detected anomalies in the data
- may be provided to one or more external entities e.g., computing device 24 and/or one or more users in substantially real-time and/or in near real-time.
- User interface resources 22 may be configured to perform one or more user interface operations, examples of which are described herein.
- user interface resources 22 may be configured to present one or more results of the data processing performed by data processing resources 20 to one or more external entities (e.g., computing device 24 and/or one or more users), as illustrated by arrow 34.
- user interface resources 22 may access data in data store 30 to perform the one or more user interface operations
- Fig. IB illustrates an implementation of configuration 10 in which an agent 38 (e.g., agent 38-1 through agent 38-N) is installed on each of compute assets 16.
- an agent may include a self-contained binary and/or other type of code or application that can be run on any appropriate platforms, including within containers and/or other virtual compute assets.
- Agents 38 may monitor the nodes on which they execute for a variety of different activities, including but not limited to, connection, process, user, machine, and file activities.
- agents 38 can be executed in user space, and can use a variety of kernel modules (e.g., auditd, iptables, netfilter, pcap, etc.) to collect data.
- Agents can be implemented in any appropriate programming language, such as C or Golang, using applicable kernel APIs.
- Agents 38 may be deployed in any suitable manner.
- an agent 38 may be deployed as a containerized application or as part of a containerized application.
- agents 38 may selectively report information to data platform 12 in varying amounts of detail and/or with variable frequency.
- a load balancer 40 configured to perform one or more load balancing operations with respect to data ingestion operations performed by data ingestion resources 18 and/or user interface operations performed by user interface resources 22.
- Load balancer 40 is shown to be included in data platform 12. However, load balancer 40 may alternatively be located external to data platform 12. Load balancer 40 may be implemented by any suitable microservice, application, and/or other computing resources. In some alternative examples, data platform 12 may not utilize a load balancer such as load balancer 40.
- long term storage 42 with which data ingestion resources may interface, as illustrated by arrow 44.
- Long term storage 42 may be implemented by any suitable type of storage resources, such as cloud-based storage (e.g., AWS S3, etc.) and/or on-premises storage and may be used by data ingestion resources 18 as part of the data ingestion process. Examples of this are described herein. In some examples, data platform 12 may not utilize long term storage 42.
- cloud-based storage e.g., AWS S3, etc.
- on-premises storage e.g., AWS S3, etc.
- data ingestion resources 18 e.g., AWS S3, etc.
- data platform 12 may not utilize long term storage 42.
- a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task.
- processor refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
- a non-transitory computer-readable medium storing computer-readable instructions may be provided in accordance with the principles described herein.
- the instructions when executed by a processor of a computing device, may direct the processor and/or computing device to perform one or more operations, including one or more of the operations described herein.
- Such instructions may be stored and/or transmitted using any of a variety of known computer-readable media.
- a non-transitory computer-readable medium as referred to herein may include any non- transitory storage medium that participates in providing data (e.g., instructions) that may be read and/or executed by a computing device (e.g., by a processor of a computing device).
- a non-transitory computer-readable medium may include, but is not limited to, any combination of non-volatile storage media and/or volatile storage media.
- Exemplary non-volatile storage media include, but are not limited to, read-only memory, flash memory, a solid-state drive, a magnetic storage device (e.g.
- FIG. 1C illustrates an example computing device 50 that may be specifically configured to perform one or more of the processes described herein. Any of the systems, microservices, computing devices, and/or other components described herein may be implemented by computing device 50.
- computing device 50 may include a communication interface 52, a processor 54, a storage device 56, and an input/output (“I/O”) module 58 communicatively connected one to another via a communication infrastructure 60. While an exemplary computing device 50 is shown in Fig. 1C, the components illustrated in Fig. 1C are not intended to be limiting. Additional or alternative components may be used in other embodiments. Components of computing device 50 shown in Fig. 1C will now be described in additional detail.
- Communication interface 52 may be configured to communicate with one or more computing devices.
- Examples of communication interface 52 include, without limitation, a wired network interface (such as a network interface card), a wireless network interface (such as a wireless network interface card), a modem, an audio/video connection, and any other suitable interface.
- Processor 54 generally represents any type or form of processing unit capable of processing data and/or interpreting, executing, and/or directing execution of one or more of the instructions, processes, and/or operations described herein. Processor 54 may perform operations by executing computer-executable instructions 62 (e.g., an application, software, code, and/or other executable data instance) stored in storage device 56.
- computer-executable instructions 62 e.g., an application, software, code, and/or other executable data instance
- Storage device 56 may include one or more data storage media, devices, or configurations and may employ any type, form, and combination of data storage media and/or device.
- storage device 56 may include, but is not limited to, any combination of the non- volatile media and/or volatile media described herein.
- Electronic data, including data described herein, may be temporarily and/or permanently stored in storage device 56.
- data representative of computer-executable instructions 62 configured to direct processor 54 to perform any of the operations described herein may be stored within storage device 56.
- data may be arranged in one or more databases residing within storage device 56.
- I/O module 58 may include one or more I/O modules configured to receive user input and provide user output.
- I/O module 58 may include any hardware, firmware, software, or combination thereof supportive of input and output capabilities.
- I/O module 58 may include hardware and/or software for capturing user input, including, but not limited to, a keyboard or keypad, a touchscreen component (e.g., touchscreen display), a receiver (e.g., an RF or infrared receiver), motion sensors, and/or one or more input buttons.
- I/O module 58 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers.
- I/O module 58 is configured to provide graphical data to a display for presentation to a user.
- the graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
- Fig. ID illustrates an example implementation 100 of configuration 10.
- one more components shown in Fig. ID may implement one or more components shown in Fig. 1 A and/or Fig. IB.
- implementation 100 illustrates an environment in which activities that occur within datacenters are modeled using data platform 12.
- a baseline of datacenter activity can be modeled, and deviations from that baseline can be identified as anomalous.
- Anomaly detection can be beneficial in a security context, a compliance context, an asset management context, a DevOps context, and/or any other data analytics context as may serve a particular implementation.
- a datacenter may include dedicated equipment (e.g., owned and operated by entity A, or owned/leased by entity A and operated exclusively on entity A’s behalf by a third party).
- a datacenter can also include cloud- based resources, such as infrastructure as a service (laaS), platform as a service (PaaS), and/or software as a service (SaaS) elements.
- laaS infrastructure as a service
- PaaS platform as a service
- SaaS software as a service
- the techniques described herein can be used in conjunction with multiple types of datacenters, including ones wholly using dedicated equipment, ones that are entirely cloud-based, and ones that use a mixture of both dedicated equipment and cloud-based resources.
- Both datacenter 104 and datacenter 106 include a plurality of nodes, depicted collectively as set of nodes 108 and set of nodes 110, respectively, in Fig. ID. These nodes may implement compute assets 16. Installed on each of the nodes are in-server / in-virtual machine (VM) / embedded in loT device agents (e.g., agent 112), which are configured to collect data and report it to data platform 12 for analysis. As described herein, agents may be small, self-contained binaries that can be run on any appropriate platforms, including virtualized ones (and, as applicable, within containers). Agents may monitor the nodes on which they execute for a variety of different activities, including: connection, process, user, machine, and file activities.
- VM in-virtual machine
- loT device agents e.g., agent 112
- agents may be small, self-contained binaries that can be run on any appropriate platforms, including virtualized ones (and, as applicable, within containers). Agents may monitor the nodes on which they execute
- Agents can be executed in user space, and can use a variety of kernel modules (e.g., auditd, iptables, netfilter, pcap, etc.) to collect data. Agents can be implemented in any appropriate programming language, such as C or Golang, using applicable kernel APIs. [0099] As described herein, agents can selectively report information to data platform 12 in varying amounts of detail and/or with variable frequency. As is also described herein, the data collected by agents may be used by data platform 12 to create polygraphs, which are graphs of logical entities, connected by behaviors. In some embodiments, agents report information directly to data platform 12. In other embodiments, at least some agents provide information to a data aggregator, such as data aggregator 114, which in turn provides information to data platform 12.
- kernel modules e.g., auditd, iptables, netfilter, pcap, etc.
- Agents can be implemented in any appropriate programming language, such as C or Golang, using applicable kernel APIs.
- agents can selectively report information to
- a data aggregator can be implemented as a separate binary or other application (distinct from an agent binary), and can also be implemented by having an agent execute in an “aggregator mode” in which the designated aggregator node acts as a Layer 7 proxy for other agents that do not have access to data platform 12. Further, a chain of multiple aggregators can be used, if applicable (e.g., with agent 112 providing data to data aggregator 114, which in turn provides data to another aggregator (not pictured) which provides data to data platform 12).
- An example way to implement an aggregator is through a program written in an appropriate language, such as C or Golang.
- Use of an aggregator can be beneficial in sensitive environments (e.g., involving financial or medical transactions) where various nodes are subject to regulatory or other architectural requirements (e.g., prohibiting a given node from communicating with systems outside of datacenter 104). Use of an aggregator can also help to minimize security exposure more generally. As one example, by limiting communications with data platform 12 to data aggregator 114, individual nodes in nodes 108 need not make external network connections (e.g., via Internet 124), which can potentially expose them to compromise (e.g., by other external devices, such as device 118, operated by a criminal). Similarly, data platform 12 can provide updates, configuration information, etc., to data aggregator 114 (which in turn distributes them to nodes 108), rather than requiring nodes 108 to allow incoming connections from data platform 12 directly.
- data platform 12 can provide updates, configuration information, etc., to data aggregator 114 (which in turn distributes them to nodes 108), rather than requiring nodes 108 to allow incoming
- Another benefit of an aggregator model is that network congestion can be reduced (e.g., with a single connection being made at any given time between data aggregator 114 and data platform 12, rather than potentially many different connections being open between various of nodes 108 and data platform 12). Similarly, network consumption can also be reduced (e.g., with the aggregator applying compression techniques/bundling data received from multiple agents).
- agent e.g., agent 112, installed on node 116
- agent 112 installed on node 116
- data serialization protocols such as Apache Avro.
- One example type of information sent by agent 112 to data aggregator 114 is status information.
- Status information may be sent by an agent periodically (e.g., once an hour or once any other predetermined amount of time). Alternatively, status information may be sent continuously or in response to occurrence of one or more events.
- the status information may include, but is not limited to, a. an amount of event backlog (in bytes) that has not yet been transmitted, b. configuration information, c. any data loss period for which data was dropped, d. a cumulative count of errors encountered since the agent started, e. version information for the agent binary, and/or f. cumulative statistics on data collection (e.g., number of network packets processed, new processes seen, etc.).
- a second example type of information that may be sent by agent 112 to data aggregator 114 is event data (described in more detail herein), which may include a UTC timestamp for each event.
- the agent can control the amount of data that it sends to the data aggregator in each call (e.g., a maximum of 10MB) by adjusting the amount of data sent to manage the conflicting goals of transmitting data as soon as possible, and maximizing throughput.
- Data can also be compressed or uncompressed by the agent (as applicable) prior to sending the data.
- Each data aggregator may run within a particular customer environment.
- a data aggregator (e.g., data aggregator 114) may facilitate data routing from many different agents (e.g., agents executing on nodes 108) to data platform 12.
- data aggregator 114 may implement a SOCKS 5 caching proxy through which agents can connect to data platform 12.
- data aggregator 114 can encrypt (or otherwise obfuscate) sensitive information prior to transmitting it to data platform 12, and can also distribute key material to agents which can encrypt the information (as applicable).
- Data aggregator 114 may include a local storage, to which agents can upload data (e.g., pcap packets). The storage may have a key-value interface.
- the local storage can also be omitted, and agents configured to upload data to a cloud storage or other storage area, as applicable.
- Data aggregator 114 can, in some embodiments, also cache locally and distribute software upgrades, patches, or configuration information (e.g., as received from data platform 12).
- user A may access a web frontend (e.g., web app 120) using a computer 126 and enrolls (on behalf of entity A) an account with data platform 12. After enrollment is complete, user A may be presented with a set of installers, pre-built and customized for the environment of entity A, that user A can download from data platform 12 and deploy on nodes 108.
- a web frontend e.g., web app 120
- installers include, but are not limited to, a Windows executable file, an iOS app, a Linux package (e.g., .deb or .rpm), a binary, or a container (e.g., a Docker container).
- user B e.g., a network administrator
- user B may be similarly presented with a set of installers that are pre-built and customized for the environment of entity B.
- User A deploys an appropriate installer on each of nodes 108 (e.g., with a Windows executable file deployed on a Windows-based platform or a Linux package deployed on a Linux platform, as applicable).
- the agent can be deployed in a container.
- Agent deployment can also be performed using one or more appropriate automation tools, such as Chef, Puppet, Salt, and Ansible.
- Deployment can also be performed using managed/hosted container management/orchestration frameworks such as Kubemetes, Mesos, and/or Docker Swarm.
- the agent may be installed in the user space (i.e., is not a kernel module), and the same binary is executed on each node of the same type (e.g., all Windows- based platforms have the same Windows-based binary installed on them).
- An illustrative function of an agent, such as agent 112 is to collect data (e.g., associated with node 116) and report it (e.g., to data aggregator 114).
- Other tasks that can be performed by agents include data configuration and upgrading.
- One approach to collecting data as described herein is to collect virtually all information available about a node (and, e.g., the processes running on it).
- the agent may monitor for network connections, and then begin collecting information about processes associated with the network connections, using the presence of a network packet associated with a process as a trigger for collecting additional information about the process.
- an application such as a calculator application, which does not typically interact with the network
- no information about use of that application may be collected by agent 112 and/or sent to data aggregator 114.
- agent 112 may collect information about the process and provide associated information to data aggregator 114.
- the agent may always collect/report information about certain events, such as privilege escalation, irrespective of whether the event is associated with network activity.
- An approach to collecting information is as follows, and described in conjunction with process 200 depicted in Fig. 2A.
- An agent e.g., agent 112 monitors its node (e.g., node 116) for network activity.
- agent 112 can monitor node 116 for network activity is by using a network packet capture tool (e.g., listening using libpcap).
- the agent obtains and maintains (e.g., in an in-memory cache) connection information associated with the network activity (202). Examples of such information include DNS query/response, TCP, UDP, and IP information.
- the agent may also determine a process associated with the network connection (203).
- a kernel network diagnostic API e.g., netlink diag
- netstat e.g., on /proc/net/tcp, /proc/net/tcp6, /proc/net/udp, and /proc/net/udp6
- Information such as socket state (e.g., whether a socket is connected, listening, etc.) can also be collected by the agent.
- mapping between a given inode and a process identifier is to scan within the /proc/pid directory. For each of the processes currently running, the agent examines each of their file descriptors. If a file descriptor is a match for the inode, the agent can determine that the process associated with the file descriptor owns the inode. Once a mapping is determined between an inode and a process identifier, the mapping is cached. As additional packets are received for the connection, the cached process information is used (rather than a new search being performed).
- Another example of an optimization is to prioritize searching the file descriptors of certain processes over others.
- One such prioritization is to search through the subdirectories of /proc/ starting with the youngest process.
- One approximation of such a sort order is to search through /proc/ in reverse order (e.g., examining highest numbered processes first). Higher numbered processes are more likely to be newer (i.e., not long-standing processes), and thus more likely to be associated with new connections (i.e., ones for which inode-process mappings are not already cached).
- the most recently created process may not have the highest process identifier (e.g., due to the kernel wrapping through process identifiers).
- Another example prioritization is to query the kernel for an identification of the most recently created process and to search in a backward order through the directories in /proc/ (e.g., starting at the most recently created process and working backwards, then wrapping to the highest value (e.g., 32768) and continuing to work backward from there).
- An alternate approach is for the agent to keep track of the newest process that it has reported information on (e.g., to data aggregator 114), and begin its search of /proc/ in a forward order starting from the PID of that process.
- Another example prioritization is to maintain, for each user actively using node 116, a list of the five (or any other number) most recently active processes.
- an agent may encounter a socket that does not correspond to the inode being matched against and is not already cached. The identity of that socket (and its corresponding inode) can be cached, once discovered, thus removing a future need to search for that pair.
- a connection may terminate before the agent is able to determine its associated process (e.g., due to a very short-lived connection, due to a backlog in agent processing, etc.).
- One approach to addressing such a situation is to asynchronously collect information about the connection using the audit kernel API, which streams information to user space.
- the information collected from the audit API (which can include PID/inode information) can be matched by the agent against pcap/inode information.
- the audit API is always used, for all connections. However, due to CPU utilization considerations, use of the audit API can also be reserved for short/otherwise problematic connections (and/or omitted, as applicable).
- the agent can then collect additional information associated with the process (204).
- some of the collected information may include attributes of the process (e.g., a process parent hierarchy, and an identification of a binary associated with the process).
- other of the collected information is derived (e.g., session summarization data and hash values).
- the collected information is then transmitted (205), e.g., by an agent (e.g., agent 112) to a data aggregator (e.g., data aggregator 114), which in turn provides the information to data platform 12.
- agent e.g., agent 112
- data aggregator e.g., data aggregator 114
- all information collected by an agent may be transmitted (e.g., to a data aggregator and/or to data platform 12).
- the amount of data transmitted may be minimized (e.g., for efficiency reasons), using various techniques.
- One approach to minimizing the amount of data flowing from agents (such as agents installed on nodes 108) to data platform 12 is to use a technique of implicit references with unique keys.
- the keys can be explicitly used by data platform 12 to extract/derive relationships, as necessary, in a data set at a later time, without impacting performance.
- some data collected about a process is constant and does not change over the lifetime of the process (e.g., attributes), and some data changes (e.g., statistical information and other variable information).
- Constant data can be transmitted (210) once, when the agent first becomes aware of the process. And, if any changes to the constant data are detected (e.g., a process changes its parent), a refreshed version of the data can be transmitted (210) as applicable.
- variable data e.g., data that may change over the lifetime of the process.
- variable data can be transmitted (210) at periodic (or other) intervals.
- variable data may be transmitted in substantially real time as it is collected.
- the variable data may indicate a thread count for a process, a total virtual memory used by the process, the total resident memory used by the process, the total time spent by the process executing in user space, and/or the total time spent by the process executing in kernel space.
- the data may include a hash that may be used within data platform 12 to join process creation time attributes with runtime attributes to construct a full dataset.
- agent 112 can collect and provide to data platform 12.
- Core User Data user name, UID (user ID), primary group, other groups, home directory.
- Failed Login Data IP address, hostname, username, count.
- User Login Data user name, hostname, IP address, start time, TTY (terminal), UID (user ID), GID (group ID), process, end time.
- Dropped Packet Data source IP address, destination IP address, destination port, protocol, count.
- Machine Data hostname, domain name, architecture, kernel, kernel release, kernel version, OS, OS version, OS description, CPU, memory, model number, number of cores, last boot time, last boot reason, tags (e.g., Cloud provider tags such as AWS, GCP, or Azure tags), default router, interface name, interface hardware address, interface IP address and mask, promiscuous mode.
- tags e.g., Cloud provider tags such as AWS, GCP, or Azure tags
- Network Connection Data source IP address, destination IP address, source port, destination port, protocol, start time, end time, incoming and outgoing bytes, source process, destination process, direction of connection, histograms of packet length, inter packet delay, session lengths, etc.
- Listening Ports in Server source IP address, port number, protocol, process.
- Dropped Packet Data source IP address, destination IP address, destination port, protocol, count.
- Arp Data source hardware address, source IP address, destination hardware address, destination IP address.
- DNS Data source IP address, response code, response string, question (request), packet length, final answer (response).
- Package Data exe path, package name, architecture, version, package path, checksums (MD5, SHA-1, SHA-256), size, owner, owner ID.
- Application Data command line, PID (process ID), start time, UID (user ID), EUID (effective UID), PPID (parent process ID), PGID (process group ID), SID (session ID), exe path, username, container ID.
- Container Image Data image creation time, parent ID, author, container type, repo, (AWS) tags, size, virtual size, image version.
- Container Data container start time, container type, container name, container ID, network mode, privileged, PID mode, IP addresses, listening ports, volume map, process ID. [00144] 6.
- File path file data hash, symbolic links, file creation data, file change data, file metadata, file mode.
- an agent such as agent 112
- a container e.g., a Docker container
- Collection about a container can be performed by an agent irrespective of whether the agent is itself deployed in a container or not (as the agent can be deployed in a container running in a privileged mode that allows for monitoring).
- Agents can discover containers (e.g., for monitoring) by listening for container create events (e.g., provided by Docker), and can also perform periodic ordered discovery scans to determine whether containers are running on a node.
- the agent can obtain attributes of the container, e.g., using standard Docker API calls (e.g., to obtain IP addresses associated with the container, whether there’s a server running inside, what port it is listening on, associated PIDs, etc.). Information such as the parent process that started the container can also be collected, as can information about the image (which comes from the Docker repository).
- agents may use namespaces to determine whether a process is associated with a container.
- Namespaces are a feature of the Linux kernel that can be used to isolate resources of a collection of processes. Examples of namespaces include process ID (PID) namespaces, network namespaces, and user namespaces. Given a process, the agent can perform a fast lookup to determine whether the process is part of the namespace the container claims to be its namespace.
- PID process ID
- agents can be configured to report certain types of information (e.g., attribute information) once, when the agent first becomes aware of a process.
- information e.g., attribute information
- static information is not reported again (or is reported once a day, every twelve hours, etc.), unless it changes (e.g., a process changes its parent, changes its owner, or a SHA-1 of the binary associated with the process changes).
- agents are configured to report a list of current connections every minute (or other appropriate time interval). In that connection list will be connections that started in that minute interval, connections that ended in that minute interval, and connections that were ongoing throughout the minute interval (e.g., a one minute slice of a one hour connection).
- agents are configured to collect/compute statistical information about connections (e.g., at the one minute level of granularity and or at any other time interval). Examples of such information include, for the time interval, the number of bytes transferred, and in which direction. Another example of information collected by an agent about a connection is the length of time between packets. For connections that span multiple time intervals (e.g., a seven minute connection), statistics may be calculated for each minute of the connection. Such statistical information (for all connections) can be reported (e.g., to a data aggregator) once a minute.
- agents are also configured to maintain histogram data for a given network connection, and provide the histogram data (e.g., in the Apache Avro data exchange format) under the Connection event type data.
- histograms include: 1. a packet length histogram (packet len hist), which characterizes network packet distribution; 2. a session length histogram (session len hist), which characterizes a network session length; 3. a session time histogram (session time hist), which characterizes a network session time; and 4. a session switch time histogram (session switch time hist), which characterizes network session switch time (i.e., incoming->outgoing and vice versa).
- histogram data may include one or more of the following fields: 1. count, which provides a count of the elements in the sampling; 2. sum, which provides a sum of elements in the sampling; 3. max, which provides the highest value element in the sampling; 4. std dev, which provides the standard deviation of elements in the sampling; and 5. buckets, which provides a discrete sample bucket distribution of sampling data (if applicable).
- a connection is opened, a string is sent, a string is received, and the connection is closed.
- protocols e.g., NFS
- both sides of the connection engage in a constant chatter. Histograms allow data platform 12 to model application behavior (e.g., using machine learning techniques), for establishing baselines, and for detecting deviations.
- application behavior e.g., using machine learning techniques
- a connection generates 500 bytes of traffic, or 2,000 bytes of traffic, such connections would be considered within the typical usage pattern of the server.
- a connection is made that results in 10G of traffic. Such a connection is anomalous and can be flagged accordingly.
- data aggregator 114 may be configured to provide information (e.g., collected from nodes 108 by agents) to data platform 12.
- Data aggregator 128 may be similarly configured to provide information to data platform 12.
- both aggregator 114 and aggregator 128 may connect to a load balancer 130, which accepts connections from aggregators (and/or as applicable, agents), as well as other devices, such as computer 126 (e.g., when it communicates with web app 120), and supports fair balancing.
- load balancer 130 is a reverse proxy that load balances accepted connections internally to various microservices (described in more detail below), allowing for services provided by data platform 12 to scale up as more agents are added to the environment and/or as more entities subscribe to services provided by data platform 12.
- Example ways to implement load balancer 130 include, but are not limited to, using HaProxy, using nginx, and using elastic load balancing (ELB) services made available by Amazon.
- Agent service 132 is a microservice that is responsible for accepting data collected from agents (e.g., provided by aggregator 114).
- agent service 132 uses a standard secure protocol, such as HTTPS to communicate with aggregators (and as applicable agents), and receives data in an appropriate format such as Apache Avro.
- HTTPS HyperText Transfer Protocol
- agent service 132 can perform a variety of checks, such as to see whether the data is being provided by a current customer, and whether the data is being provided in an appropriate format. If the data is not appropriately formatted (and/or is not provided by a current customer), it may be rejected.
- agent service 132 may facilitate copying the received data to a streaming data stable storage using a streaming service (e.g., Amazon Kinesis and/or any other suitable streaming service. Once the ingesting into the streaming service is complete, service 132 may send an acknowledgement to the data provider (e.g., data aggregator 114). If the agent does not receive such an acknowledgement, it is configured to retry sending the data to data platform 12.
- a REST API server framework e.g., Java DropWizard
- Kinesis e.g., using a Kinesis library
- data platform 12 uses one or more streams (e.g., Kinesis streams) for all incoming customer data (e.g., including data provided by data aggregator 114 and data aggregator 128), and the data is sharded based on the node (also referred to herein as a “machine”) that originated the data (e.g., node 116 vs. node 122), with each node having a globally unique identifier within data platform 12.
- multiple instances of agent service 132 can write to multiple shards.
- Kinesis is a streaming service with a limited period (e.g., 1-7 days). To persist data longer than a day, the data may be copied to long term storage 42 (e.g., S3).
- Data loader 136 is a microservice that is responsible for picking up data from a data stream (e.g., a Kinesis stream) and persisting it in long term storage 42.
- files collected by data loader 136 from the Kinesis stream are placed into one or more buckets, and segmented using a combination of a customer identifier and time slice.
- Data loader 136 can be implemented in any appropriate programming language, such as Java or C, and can be configured to use a Kinesis library to interface with Kinesis. In various embodiments, data loader 136 uses the Amazon Simple Queue Service (SQS) (e.g., to alert DB loader 140 that there is work for it to do).
- SQL Amazon Simple Queue Service
- DB loader 140 is a microservice that is responsible for loading data into an appropriate data store 30, such as SnowflakeDB or Amazon Redshift, using individual per-custom er databases.
- DB loader 140 is configured to periodically load data into a set of raw tables from files created by data loader 136 as per above.
- DB loader 140 manages throughput, errors, etc., to make sure that data is loaded consistently and continuously. Further, DB loader 140 can read incoming data and load into data store 30 data that is not already present in tables of data store 30 (also referred to herein as a database).
- DB loader 140 can be implemented in any appropriate programming language, such as Java or C, and an SQL framework such as jOOQ (e.g., to manage SQLs for insertion of data), and SQL/JDBC libraries.
- DB loader 140 may use Amazon S3 and Amazon Simple Queue Service (SQS) to manage files being transferred to and from data store 30.
- Customer data included in data store 30 can be augmented with data from additional data sources, such as AWS CloudTrail and/or other types of external tracking services.
- data platform may include a tracking service analyzer 144, which is another microservice.
- Tracking service analyzer 144 may pull data from an external tracking service (e.g., Amazon CloudTrail) for each applicable customer account, as soon as the data is available. Tracking service analyzer 144 may normalize the tracking data as applicable, so that it can be inserted into data store 30 for later querying/analysis. Tracking service analyzer 144 can be written in any appropriate programming language, such as Java or C. Tracking service analyzer 144 also makes use of SQL/IDBC libraries to interact with data store 30 to insert/query data.
- an external tracking service e.g., Amazon CloudTrail
- Tracking service analyzer 144 can be written in any appropriate programming language, such as Java or C. Tracking service analyzer 144 also makes use of SQL/IDBC libraries to interact with data store 30 to insert/query data.
- data platform 12 can model activities that occur within datacenters, such as datacenters 104 and 106.
- the model may be stable over time, and differences, even subtle ones (e.g., between a current state of the datacenter and the model) can be surfaced.
- the ability to surface such anomalies can be particularly beneficial in datacenter environments where rogue employees and/or external attackers may operate slowly (e.g., over a period of months), hoping that the elastic nature of typical resource use (e.g., virtualized servers) will help conceal their nefarious activities.
- data platform 12 can automatically discover entities (which may implement compute assets 16) deployed in a given datacenter.
- entities include workloads, applications, processes, machines, virtual machines, containers, files, IP addresses, domain names, and users.
- the entities may be grouped together logically (into analysis groups) based on behaviors, and temporal behavior baselines can be established.
- periodic graphs can be constructed (also referred to herein as polygraphs), in which the nodes are applicable logical entities, and the edges represent behavioral relationships between the logical entities in the graph. Baselines can be created for every node and edge.
- Communication is one example of a behavior.
- a model of communications between processes is an example of a behavioral model.
- the launching of applications is another example of a behavior that can be modeled.
- the baselines may be periodically updated (e.g., hourly) for every entity. Additionally or alternatively, the baselines may be continuously updated in substantially real-time as data is collected by agents. Deviations from the expected normal behavior can then be detected and automatically reported (e.g., as anomalies or threats detected). Such deviations may be due to a desired change, a misconfiguration, or malicious activity. As applicable, data platform 12 can score the detected deviations (e.g., based on severity and threat posed). Additional examples of analysis groups include models of machine communications, models of privilege changes, and models of insider behaviors (monitoring the interactive behavior of human users as they operate within the datacenter).
- agents may collect information about every connection involving their respective nodes. And, for each connection, information about both the server and the client may be collected (e.g., using the connection-to-process identification techniques described above). DNS queries and responses may also be collected.
- the DNS query information can be used in logical entity graphing (e.g., collapsing many different IP addresses to a single service - e.g., s3.amazon.com).
- process level information collected by agents include attributes (user ID, effective user ID, and command line). Information such as what user/application is responsible for launching a given process and the binary being executed (and its SHA-256 values) may also be provided by agents.
- the dataset collected by agents across a datacenter can be very large, and many resources (e.g., virtual machines, IP addresses, etc.) are recycled very quickly. For example, an IP address and port number used at a first point in time by a first process on a first virtual machine may very rapidly be used (e.g., an hour later) by a different process/virtual machine.
- a dataset (and elements within it) can be considered at both a physical level, and a logical level, as illustrated in Fig. 2B.
- Fig. 2B illustrates an example 5-tuple of data 210 collected by an agent, represented physically (216) and logically (217).
- the 5-tuple includes a source address 211, a source port 212, a destination address 213, a destination port 214, and a protocol 215.
- port numbers e.g., 212, 214
- port usage is ephemeral.
- a Docker container can listen on an ephemeral port, which is unrelated to the service it will run. When another Docker container starts (for the same service), the port may well be different.
- IP addresses may be recycled frequently (and are thus also potentially ephemeral) or could be NATed, which makes identification difficult.
- a physical representation of the 5-tuple is depicted in region 216.
- a process 218 (executing on machine 219) has opened a connection to machine 220.
- process 218 is in communication with process 221.
- Information such as the number of packets exchanged between the two machines over the respective ports can be recorded.
- portions of the 5-tuple may change - potentially frequently - but still be associated with the same behavior.
- one application e.g., Apache
- another application e.g., Oracle
- either/both of Apache and Oracle may be multi-homed.
- This can lead to potentially thousands of 5-tuples (or more) that all correspond to Apache communicating with Oracle within a datacenter.
- Apache could be executed on a single machine, and could also be executed across fifty machines, which are variously spun up and down (with different IP addresses each time).
- An alternate representation of the 5-tuple of data 210 is depicted in region 217, and is logical.
- the logical representation of the 5-tuple aggregates the 5-tuple (along with other connections between Apache and Oracle having other 5- tuples) as logically representing the same connection.
- Fig. 2C depicts a portion of a logical polygraph.
- a datacenter has seven instances of the application update engine 225, executing as seven different processes on seven different machines, having seven different IP addresses, and using seven different ports.
- the instances of update engine variously communicate with update.core-os.net 226, which may have a single IP address or many IP addresses itself, over the one hour time period represented in the polygraph.
- update engine is a client, connecting to the server update.core-os.net, as indicated by arrow 228.
- Behaviors of the seven processes are clustered together, into a single summary. As indicated in region 227, statistical information about the connections is also maintained (e.g., number of connections, histogram information, etc.).
- a polygraph such as is depicted in Fig. 2C can be used to establish a baseline of behavior (e.g., at the one-hour level), allowing for the future detection of deviations from that baseline. As one example, suppose that statistically an update engine instance transmits data at 11 bytes per second. If an instance were instead to transmit data at 1000 bytes per second, such behavior would represent a deviation from the baseline and could be flagged accordingly.
- Fig. 2D illustrates a portion of a polygraph for a service that evidences more complex behaviors than are depicted in Fig. 2C.
- Fig. 2D illustrates the behaviors of S3 as a service (as used by a particular customer datacenter).
- Clients within the datacenter variously connect to the S3 service using one of five fully qualified domains (listed in region 230). Contact with any of the domains is aggregated as contact with S3 (as indicated in region 231).
- Depicted in region 232 are various containers which (as clients) connect with S3. Other containers (which do not connect with S3) are not included.
- statistical information about the connections is known and summarized, such as the number of bytes transferred, histogram information, etc.
- Fig. 2E illustrates a communication polygraph for a datacenter.
- the polygraph indicates a one hour summary of approximately 500 virtual machines, which collectively run one million processes, and make 100 million connections in that hour.
- a polygraph represents a drastic reduction in size (e.g., from tracking information on 100 million connections in an hour, to a few hundred nodes and a few hundred edges).
- the polygraph for the datacenter will tend to stay the same size (with the 100 virtual machines clustering into the same nodes that the 10 virtual machines previously clustered into).
- the polygraph may automatically scale to include behaviors involving those applications.
- nodes generally correspond to workers, and edges correspond to communications the workers engage in (with connection activity being the behavior modeled in polygraph 235).
- Another example polygraph could model other behavior, such as application launching.
- the communications graphed in Fig. 2E include traffic entering the datacenter, traffic exiting the datacenter, and traffic that stays wholly within the datacenter (e.g., traffic between workers).
- One example of a node included in polygraph 235 is the sshd application, depicted as node 236.
- 421 instances of sshd were executing during the one hour time period of data represented in polygraph 235.
- nodes within the datacenter communicated with a total of 1349 IP addresses outside of the datacenter (and not otherwise accounted for, e.g., as belonging to a service such as Amazon AWS 238 or Slack 239).
- polygraph user B an administrator of datacenter 106
- data platform 12 to view visualizations of polygraphs in a web browser (e.g., as served to user B via web app 120).
- polygraph user B can view is an application- communication polygraph, which indicates, for a given one hour window (or any other suitable time interval), which applications communicated with which other applications.
- polygraph user B can view is an application launch polygraph.
- User B can also view graphs related to user behavior, such as an insider behavior graph which tracks user connections (e.g., to internal and external applications, including chains of such behavior), a privilege change graph which tracks how privileges change between processes, and a user login graph, which tracks which (logical) machines a user logs into.
- an insider behavior graph which tracks user connections (e.g., to internal and external applications, including chains of such behavior)
- a privilege change graph which tracks how privileges change between processes
- a user login graph which tracks which (logical) machines a user logs into.
- FIG. 2F illustrates an example of an application-communication polygraph for a datacenter (e.g., datacenter 106) for the one hour period of 9am-10am on June 5.
- the time slice currently being viewed is indicated in region 240. If user B clicks his mouse in region 241, user B will be shown a representation of the application-communication polygraph as generated for the following hour (10am-l lam on June 5).
- Fig. 2G depicts what is shown in user B’s browser after he has clicked on region 241, and has further clicked on region 242.
- the selection in region 242 turns on and off the ability to compare two time intervals to one another.
- User B can select from a variety of options when comparing the 9am- 10am and 10am- 1 lam time intervals.
- user B By clicking region 248, user B will be shown the union of both graphs (i.e., any connections that were present in either time interval).
- region 249 user B will be shown the intersection of both graphs (i.e., only those connections that were present in both time intervals).
- region 250 depicts connections that are only present in the 9am- 10am polygraph in a first color 251, and depicts connections that are only present in the 10am- 1 lam polygraph in a second color 252. Connections present in both polygraphs are omitted from display.
- a connection was made by a server to sshd (253) and also to systemd (254). Both of those connections ended prior to 10am and are thus depicted in the first color.
- connection was made from a known bad external IP to nginx (255). The connection was not present during the 9am-10am time slice and thus is depicted in the second color.
- two different connections were made to a Slack service between 9am and 11am. However, the first was made by a first client during the 9am- 10am time slice (256) and the second was made by a different client during the 10am-l lam slice (257), and so the two connections are depicted respectively in the first and second colors and blue.
- Fig. 2H illustrates the interface illustrated in Fig. 2H.
- three applications containing the term “etcd” were engaged in communications during the 9am- 10am window.
- One application is etcdctl, a command line client for etcd.
- a total of three different etcdctl processes were executed during the 9am- 10am window, and were clustered together (260).
- Fig. 2H also depicts two different clusters that are both named etcd2.
- the first cluster includes (for the 9am- 10am window) five members (261) and the second cluster includes (for the same window) eight members (262).
- the reason for these two distinct clusters is that the two groups of applications behave differently (e.g., they exhibit two distinct sets of communication patterns).
- the instances of etcd2 in cluster 261 only communicate with locksmithctl (263) and other etcd2 instances (in both clusters 261 and 262).
- the instances of etcd2 in cluster 262 communicate with additional entities, such as etcdctl and Docker containers.
- user B can click on one of the clusters (e.g., cluster 261) and be presented with summary information about the applications included in the cluster, as is shown in Fig. 21 (e.g., in region 265).
- User B can also double click on a given cluster (e.g., cluster 261) to see details on each of the individual members of the cluster broken out.
- Fig. 2J illustrates an example of a portion of a launch polygraph.
- user B has typed “find” into region 266, to see how the “find” application is being launched.
- find applications 267
- bash 268
- systemd 269
- Fig. 2K illustrates another example of a portion of an application launch polygraph.
- user B has searched (270) for “python ma” to see how “python marathon_lb” (271) is launched.
- python marathon lb is launched as a result of a chain of the same seven applications each time. If python marathon lb is ever launched in a different manner, this indicates anomalous behavior.
- the behavior could be indicative of malicious activities, but could also be due to other reasons, such as a misconfiguration, a performance-related issue, and/or a failure, etc.
- the insider behavior graph tracks information about behaviors such as processes started by a user interactively using protocols such as ssh or telnet, and any processes started by those processes.
- a first virtual machine in datacenter 106 e.g., using sshd via an external connection he makes from a hotel
- a first set of credentials e.g., first.last@example.com and an appropriate password
- the administrator connects to a second virtual machine (e.g., using the same credentials), then uses the sudo command to change identities to those of another user, and then launches a program, graphs built by data platform 12 can be used to associate the administrator with each of his actions, including launching the program using the identity of another user.
- a second virtual machine e.g., using the same credentials
- Fig. 2L illustrates an example of a portion of an insider behavior graph.
- user B is viewing a graph that corresponds to the time slice of 3pm-4pm on June 1.
- Fig. 2L illustrates the intemal/external applications that users connected to during the one hour time slice. If a user typically communicates with particular applications, that information will become part of a baseline. If the user deviates from his baseline behavior (e.g., using new applications, or changing privilege in anomalous ways), such anomalies can be surfaced.
- Fig. 2M illustrates an example of a portion of a privilege change graph, which identifies how privileges are changed between processes.
- a process e.g., “Is”
- the process inherits the same privileges that the user has.
- Information included in the privilege change graph can be determined by examining the parent of each running process, and determining whether there is a match in privilege between the parent and the child. If the privileges are different, a privilege change has occurred (whether a change up or a change down).
- the application ntpd is one rare example of a scenario in which a process escalates (272) to root, and then returns back (273).
- the sudo command is another example (e.g., used by an administrator to temporarily have a higher privilege).
- ntpd s privilege change actions, and the legitimate actions of various administrators (e.g., using sudo) will be incorporated into a baseline model by data platform 12.
- deviations occur, such as where a new application that is not ntpd escalates privilege, or where an individual that has not previously/does not routinely use sudo does so, such behaviors can be identified as anomalous.
- Fig. 2N illustrates an example of a portion of a user login graph, which identifies which users log into which logical nodes.
- Physical nodes (whether bare metal or virtualized) are clustered into a logical machine cluster, for example, using yet another graph, a machine-server graph, an example of which is shown in Fig. 20.
- a determination is made as to what type of machine it is, based on what kind(s) of workflows it runs.
- some machines run as master nodes (having a typical set of workflows they run, as master nodes) and can thus be clustered as master nodes.
- Worker nodes are different from master nodes, for example, because they run Docker containers, and frequently change as containers move around. Worker nodes can similarly be clustered.
- the polygraph depicted in Fig. 2E corresponds to activities in a datacenter in which, in a given hour, approximately 500 virtual machines collectively run one million processes, and make 100 million connections in that hour.
- the polygraph represents a drastic reduction in size (e.g., from tracking information on 100 million connections in an hour, to a few hundred nodes and a few hundred edges).
- a polygraph can be constructed (e.g., using commercially available computing infrastructure) in less than an hour (e.g., within a few minutes).
- ongoing hourly snapshots of a datacenter can be created within a two hour moving window (i.e., collecting data for the time period 8am-9am, while also generating a snapshot for the time previous time period 7am-8am).
- the following describes various example infrastructure that can be used in polygraph construction, and also describes various techniques that can be used to construct polygraphs.
- embodiments of data platform 12 may be built using any suitable infrastructure as a service (laaS) (e.g., AWS).
- laaS infrastructure as a service
- data platform 12 can use Simple Storage Service (S3) for data storage, Key Management Service (KMS) for managing secrets, Simple Queue Service (SQS) for managing messaging between applications, Simple Email Service (SES) for sending emails, and Route 53 for managing DNS.
- KMS Key Management Service
- SQL Simple Queue Service
- SES Simple Email Service
- Route 53 for managing DNS.
- Other infrastructure tools can also be used.
- Examples include: orchestration tools (e.g., Kubernetes or Mesos/Marathon), service discovery tools (e.g., Mesos-DNS), service load balancing tools (e.g., marathon-LB), container tools (e.g., Docker or rkt), log/metric tools (e.g., collectd, fluentd, kibana, etc.), big data processing systems (e.g., Spark, Hadoop, AWS Redshift, Snowflake etc.), and distributed key value stores (e.g., Apache Zookeeper or etcd2).
- orchestration tools e.g., Kubernetes or Mesos/Marathon
- service discovery tools e.g., Mesos-DNS
- service load balancing tools e.g., marathon-LB
- container tools e.g., Docker or rkt
- log/metric tools e.g., collectd, fluentd, kibana, etc.
- big data processing systems e.g.
- data platform 12 may make use of a collection of microservices.
- Each microservice can have multiple instances, and may be configured to recover from failure, scale, and distribute work amongst various such instances, as applicable.
- microservices are auto-balancing for new instances, and can distribute workload if new instances are started or existing instances are terminated.
- microservices may be deployed as self-contained Docker containers.
- a Mesos- Marathon or Spark framework can be used to deploy the microservices (e.g., with Marathon monitoring and restarting failed instances of microservices as needed).
- the service etcd2 can be used by microservice instances to discover how many peer instances are running, and used for calculating a hash-based scheme for workload distribution.
- Microservices may be configured to publish various health/status metrics to either an SQS queue, or etcd2, as applicable.
- Amazon DynamoDB can be used for state management.
- Graph generator 146 is a microservice that may be responsible for generating raw behavior graphs on a per customer basis periodically (e.g., once an hour). In particular, graph generator 146 may generate graphs of entities (as the nodes in the graph) and activities between entities (as the edges). In various embodiments, graph generator 146 also performs other functions, such as aggregation, enrichment (e.g., geolocation and threat), reverse DNS resolution, TF-IDF based command line analysis for command type extraction, parent process tracking, etc. [00190] Graph generator 146 may perform joins on data collected by the agents, so that both sides of a behavior are linked.
- first and second virtual machines may each report information on their view of the communication (e.g., the PID of their respective processes, the amount of data exchanged and in which direction, etc.).
- the graph generator When graph generator performs a join on the data provided by both agents, the graph will include a node for each of the processes, and an edge indicating communication between them (as well as other information, such as the directionality of the communication - i.e., which process acted as the server and which as the client in the communication).
- connections are process to process (e.g., from a process on one virtual machine within the cloud environment associated with entity A to another process on a virtual machine within the cloud environment associated with entity A).
- a process may be in communication with a node (e.g., outside of entity A) which does not have an agent deployed upon it.
- a node within entity A might be in communication with node 172, outside of entity A.
- communications with node 172 are modeled (e.g., by graph generator 146) using the IP address of node 172.
- the IP address of the node can be used by graph generator in modeling.
- Graphs created by graph generator 146 may be written to data store 30 and cached for further processing.
- a graph may be a summary of all activity that happened in a particular time interval. As each graph corresponds to a distinct period of time, different rows can be aggregated to find summary information over a larger timestamp. In some examples, picking two different graphs from two different timestamps can be used to compare different periods. If necessary, graph generator can parallelize its workload (e.g., where its backlog cannot otherwise be handled within a particular time period, such as an hour, or if is required to process a graph spanning a long time period).
- Graph generator 146 can be implemented in any appropriate programming language, such as Java or C, and machine learning libraries, such as Spark’s MLLib. Example ways that graph generator computations can be implemented include using SQL or Map-R, using Spark or Hadoop.
- SSH tracker 148 is a microservice that may be responsible for following ssh connections and process parent hierarchies to determine trails of user ssh activity. Identified ssh trails are placed by the SSH tracker 148 into data store 30 and cached for further processing.
- SSH tracker 148 can be implemented in any appropriate programming language, such as Java or C, and machine libraries, such as Spark’s MLLib.
- Example ways that SSH tracker computations can be implemented include using SQL or Map-R, using Spark or Hadoop.
- Threat aggregator 150 is a microservice that may be responsible for obtaining third party threat information from various applicable sources, and making it available to other micro- services. Examples of such information include reverse DNS information, GeoIP information, lists of known bad domains/IP addresses, lists of known bad files etc. As applicable, the threat information is normalized before insertion into data store 30. Threat aggregator 150 can be implemented in any appropriate programming language, such as Java or C, using SQL/JDBC libraries to interact with data store 30 (e.g., for insertions and queries).
- Scheduler 152 is a microservice that may act as a scheduler and that may run arbitrary jobs organized as a directed graph. In some examples, scheduler 152 ensures that all jobs for all customers are able to run during at a given time interval (e.g., every hour). Scheduler 152 may handle errors and retrying for failed jobs, track dependencies, manage appropriate resource levels, and/or scale jobs as needed. Scheduler 152 can be implemented in any appropriate programming language, such as Java or C. A variety of components can also be used, such as open source scheduler frameworks (e.g., Airflow), or AWS services (e.g., the AWS Data pipeline) which can be used for managing schedules.
- open source scheduler frameworks e.g., Airflow
- AWS services e.g., the AWS Data pipeline
- Graph Behavior Modeler (GBM) 154 is a microservice that may compute polygraphs.
- GBM 154 can be used to find clusters of nodes in a graph that should be considered similar based on some set of their properties and relationships to other nodes. As described herein, the clusters and their relationships can be used to provide visibility into a datacenter environment without requiring user specified labels. GBM 154 may track such clusters over time persistently, allowing for changes to be detected and alerts to be generated.
- GBM 154 may take as input a raw graph (e.g., as generated by graph generator 146). Nodes are actors of a behavior, and edges are the behavior relationship itself. For example, in the case of communication, example actors include processes, which communicate with other processes.
- the GBM 154 clusters the raw graph based on behaviors of actors and produces a summary (the polygraph). The polygraph summarizes behavior at a datacenter level.
- the GBM also produces “observations” that represent changes detected in the datacenter. Such observations may be based on differences in cumulative behavior (e.g., the baseline) of the datacenter with its current behavior.
- the GBM 154 can be implemented in any appropriate programming language, such as Java, C, or Golang, using appropriate libraries (as applicable) to handle distributed graph computations (handling large amounts of data analysis in a short amount of time).
- Apache Spark is another example tool that can be used to compute polygraphs.
- the GBM can also take feedback from users and adjust the model according to that feedback. For example, if a given user is interested in relearning behavior for a particular entity, the GBM can be instructed to “forget” the implicated part of the polygraph.
- GBM runner 156 is a microservice that may be responsible for interfacing with GBM 154 and providing GBM 154 with raw graphs (e.g., using a query language, such as SQL, to push any computations it can to data store 30). GBM runner 156 may also insert polygraph output from GBM 154 to data store 30. GBM runner 156 can be implemented in any appropriate programming language, such as Java or C, using SQL/JDBC libraries to interact with data store 30 to insert and query data.
- a query language such as SQL
- Alert generator 158 is a microservice that may be responsible for generating alerts.
- Alert generator 158 may examine observations (e.g., produced by GBM 154) in aggregate, deduplicate them, and score them. Alerts may be generated for observations with a score exceeding a threshold.
- Alert generator 158 may also compute (or retrieves, as applicable) data that a customer (e.g., user A or user B) might need when reviewing the alert. Examples of events that can be detected by data platform 12 (and alerted on by alert generator 158) include, but are not limited to the following:
- new user This event may be created the first time a user (e.g., of node 116) is first observed by an agent within a datacenter.
- user launched new binary This event may be generated when an interactive user launches an application for the first time.
- new privilege escalation This event may be generated when user privileges are escalated and a new application is run.
- new application or container This event may be generated when an application or container is seen for the first time.
- new external connection This event may be generated when a connection to an external IP/domain is made from a new application.
- new external host or IP This event may be generated when a new external host or IP is involved in a connection with a datacenter.
- new internal connection This event may be generated when a connection between internal-only applications is seen for the first time.
- new external client This event may be generated when a new external connection is seen for an application which typically does not have external connections.
- new parent This event may be generated when an application is launched by a different parent.
- connection to known bad IP/domain Data platform 12 maintains (or can otherwise access) one or more reputation feeds. If an environment makes a connection to a known bad IP or domain, an event will be generated.
- An event may be generated when a successful connection to a datacenter from a known bad IP is observed by data platform 12.
- Alert generator 158 can be implemented in any appropriate programming language, such as Java or C, using SQL/JDBC libraries to interact with data store 30 to insert and query data. In various embodiments, alert generator 158 also uses one or more machine learning libraries, such as Spark’s MLLib (e.g., to compute scoring of various observations). Alert generator 158 can also take feedback from users about which kinds of events are of interest and which to suppress.
- QsJobServer 160 is a microservice that may look at all the data produced by data platform 12 for an hour, and compile a materialized view (MV) out of the data to make queries faster. The MV helps make sure that the queries customers most frequently run, and data that they search for, can be easily queried and answered.
- MV materialized view
- QsJobServer 160 may also precompute and cache a variety of different metrics so that they can quickly be provided as answers at query time.
- QsJobServer 160 can be implemented using any appropriate programming language, such as Java or C, using SQL/JDBC libraries.
- QsJobServer 160 is able to compute an MV efficiently at scale, where there could be a large number of joins.
- An SQL engine, such as Oracle, can be used to efficiently execute the SQL, as applicable.
- Alert notifier 162 is a microservice that may take alerts produced by alert generator 158 and send them to customers’ integrated Security Information and Event Management (SIEM) products (e.g., Splunk, Slack, etc.).
- SIEM Security Information and Event Management
- Alert notifier 162 can be implemented using any appropriate programming language, such as Java or C.
- Alert notifier 162 can be configured to use an email service (e.g., AWS SES or pagerduty) to send emails.
- Alert notifier 162 may also provide templating support (e.g., Velocity or Moustache) to manage templates and structured notifications to SIEM products.
- Reporting module 164 is a microservice that may be responsible for creating reports out of customer data (e.g., daily summaries of events, etc.) and providing those reports to customers (e.g., via email). Reporting module 164 can be implemented using any appropriate programming language, such as Java or C. Reporting module 164 can be configured to use an email service (e.g., AWS SES or pagerduty) to send emails. Reporting module 164 may also provide templating support (e.g., Velocity or Moustache) to manage templates (e.g., for constructing HTML-based email).
- Web app 120 is a microservice that provides a user interface to data collected and processed on data platform 12.
- Web app 120 may provide login, authentication, query, data visualization, etc. features.
- Web app 120 may, in some embodiments, include both client and server elements.
- Example ways the server elements can be implemented are using Java DropWizard or Node.Js to serve business logic, and a combination of JSON/HTTP to manage the service.
- Example ways the client elements can be implemented are using frameworks such as React, Angular, or Backbone. JSON, jQuery, and JavaScript libraries (e.g., underscore) can also be used.
- Query service 166 is a microservice that may manage all database access for web app 120.
- Query service 166 abstracts out data obtained from data store 30 and provides a JSON-based REST API service to web app 120.
- Query service 166 may generate SQL queries for the REST APIs that it receives at run time.
- Query service 166 can be implemented using any appropriate programming language, such as Java or C and SQL/JDBC libraries, or an SQL framework such as jOOQ.
- Query service 166 can internally make use of a variety of types of databases, including a relational database engine 168 (e.g., AWS Aurora) and/or data store 30 to manage data for clients. Examples of tables that query service 166 manages are OLTP tables and data warehousing tables.
- Cache 170 may be implemented by Redis and/or any other service that provides a key- value store.
- Data platform 12 can use cache 170 to keep information for frontend services about users. Examples of such information include valid tokens for a customer, valid cookies of customers, the last time a customer tried to login, etc.
- Fig. 3 A illustrates an example of a process for detecting anomalies in a network environment.
- process 300 is performed by data platform 12.
- the process begins at 301 when data associated with activities occurring in a network environment (such as entity A’s datacenter) is received.
- data associated with activities occurring in a network environment such as entity A’s datacenter
- agent-collected data described above (e.g., in conjunction with process 200).
- a logical graph model is generated, using at least a portion of the monitored activities.
- a variety of approaches can be used to generate such logical graph models, and a variety of logical graphs can be generated (whether using the same, or different approaches).
- the following is one example of how data received at 301 can be used to generate and maintain a model.
- data platform 12 creates an aggregate graph of physical connections (also referred to herein as an aggregated physical graph) by matching connections that occurred in the first hour into communication pairs. Clustering is then performed on the communication pairs. Examples of such clustering, described in more detail below, include performing Matching Neighbor clustering and similarity (e.g., SimRank) clustering. Additional processing can also be performed (and is described in more detail below), such as by splitting clusters based on application type, and annotating nodes with DNS query information.
- the resulting graph (also referred to herein as a base graph or common graph) can be used to generate a variety of models, where a subset of node and edge types (described in more detail below) and their properties are considered in a given model.
- UID to UID model also referred to herein as a Uid2Uid model
- CType model which clusters together processes that share command line similarity
- PType model which clusters together processes that share behaviors over time.
- the cumulative graph (also referred to herein as a cumulative PType graph and a polygraph) is a running model of how processes behave over time. Nodes in the cumulative graph are PType nodes, and provide information such as a list of all active processes and PIDs in the last hour, the number of historic total processes, the average number of active processes per hour, the application type of the process (e.g., the CType of the PType), and historic CType information/frequency.
- Edges in the cumulative graph can represent connectivity and provide information such as connectivity frequency.
- the edges can be weighted (e.g., based on number of connections, number of bytes exchanged, etc.).
- Edges in the cumulative graph (and snapshots) can also represent transitions.
- One approach to merging a snapshot of the activity of the last hour into a cumulative graph is as follows. An aggregate graph of physical connections is made for the connections included in the snapshot (as was previously done for the original snapshot used during bootstrap). And, clustering/splitting is similarly performed on the snapshot’s aggregate graph. Next, PType clusters in the snapshot’s graph are compared against PType clusters in the cumulative graph to identify commonality.
- One approach to determining commonality is, for any two nodes that are members of a given CmdType (described in more detail below), comparing internal neighbors and calculating a set membership Jaccard distance. The pairs of nodes are then ordered by decreasing similarity (i.e., with the most similar sets first). For nodes with a threshold amount of commonality (e.g., at least 66% members in common), any new nodes (i.e., appearing in the snapshot’s graph but not the cumulative graph) are assigned the same PType identifier as is assigned to the corresponding node in the cumulative graph.
- a threshold amount of commonality e.g., at least 66% members in common
- a network signature is generated (i.e., indicative of the kinds of network connections the node makes, who the node communicates with, etc.). The following processing is then performed until convergence. If a match of the network signature is found in the cumulative graph, the unclassified node is assigned the PType identifier of the corresponding node in the cumulative graph. Any nodes which remain unclassified after convergence are new PTypes and are assigned new identifiers and added to the cumulative graph as new. As applicable, the detection of a new PType can be used to generate an alert. If the new PType has a new CmdType, a severity of the alert can be increased.
- any surviving nodes i.e., present in both the cumulative graph and the snapshot graph
- PTypes change is noted as a transition, and an alert can be generated.
- a surviving node changes PType and also changes CmdType, a severity of the alert can be increased.
- Changes to the cumulative graph can be used (e.g., at 303) to detect anomalies (described in more detail below).
- anomalies Two example kinds of anomalies that can be detected by data platform 12 include security anomalies (e.g., a user or process behaving in an unexpected manner) and devops/root cause anomalies (e.g., network congestion, application failure, etc.).
- Detected anomalies can be recorded and surfaced (e.g., to administrators, auditors, etc.), such as through alerts which are generated at 304 based on anomaly detection.
- an aggregated physical graph can be generated on a per customer basis periodically (e.g., once an hour) from raw physical graph information, by matching connections (e.g., between two processes on two virtual machines).
- a deterministic fixed approach is used to cluster nodes in the aggregated physical graph (e.g., representing processes and their communications).
- Matching Neighbors Clustering MNC can be performed on the aggregated physical graph to determine which entities exhibit identical behavior and cluster such entities together.
- Fig. 3B depicts a set of example processes (pl, p2, p3, and p4) communicating with other processes (plO and pl 1).
- Fig. 3B is a graphical representation of a small portion of an aggregated physical graph showing (for a given time period, such as an hour) which processes in a datacenter communicate with which other processes.
- processes pl, p2, and p3 will be clustered together (305), as they exhibit identical behavior (they communicate with plO and only plO).
- Process p4 which communicates with both plO and pl 1, will be clustered separately.
- MNC only those processes exhibiting identical (communication) behavior will be clustered.
- an alternate clustering approach can also/instead be used, which uses a similarity measure (e.g., constrained by a threshold value, such as a 60% similarity) to cluster items.
- a similarity measure e.g., constrained by a threshold value, such as a 60% similarity
- the output of MNC is used as input to SimRank, in other embodiments, MNC is omitted.
- Fig. 3C depicts a set of example processes (p4, p5, p6) communicating with other processes (p7, p8, p9). As illustrated, most of nodes p4, p5, and p6 communicate with most of nodes p7, p8, and p9 (as indicated in Fig. 3C with solid connection lines). As one example, process p4 communicates with process p7 (310), process p8 (311), and process p9 (312). An exception is process p6, which communicates with processes p7 and p8, but does not communicate with process p9 (as indicated by dashed line 313). If MNC were applied to the nodes depicted in Fig. 3C, nodes p4 and p5 would be clustered (and node p6 would not be included in their cluster).
- I(v) and O(v) denote the respective set of in-neighbors and out-neighbors of v.
- Individual in-neighbors are denoted as Ii(v), for 1 ⁇ i
- the similarity between two objects a and b can be denoted by s(a,b) ⁇ [1,0],
- One example value for the decay factor C is 0.8 (and a fixed number of iterations such as five).
- Another example value for the decay factor C is 0.6 (and/or a different number of iterations).
- n is the number of nodes in G.
- Sk For each iteration k, n 2 entries Sk (*,*) are kept, where gives the score between a and b on iteration k.
- Successive computations of s k+1 (*,*) are made based on Starting with s0(*,*) where each S0 (a,b) is a lower bound on the actual SimRank score
- the similarity of (a,b) is updated using the similarity scores of the neighbors of (a,b) from the previous iteration k according to the SimRank equation.
- the values sk(*,*) are nondecreasing as k increases.
- Fig. 3D depicts a set of processes, and in particular server processes si and s2, and client processes cl, c2, c3, c4, c5, and c6.
- server processes si and s2 and client processes cl, c2, c3, c4, c5, and c6.
- client processes cl, c2, c3, c4, c5, and c6 are present in the graph depicted in Fig. 3D (and the other nodes depicted are omitted from consideration).
- nodes si and s2 would be clustered together, as would nodes cl and c2.
- SimRank clustering as described above would also result in those two clusters (si and s2, and cl and c2).
- identical behavior is required.
- MNC would not include c3 in a cluster with c2 and cl because node c3 only communicates with node s2 and not node si.
- a SimRank clustering of a graph that includes nodes si, s2, cl, c2, and c3 would result (based, e.g., on an applicable selected decay value and number of iterations) in a first cluster comprising nodes si and s2, and a second cluster of cl, c2, and c3.
- nodes si and s2 will become decreasingly similar (i.e., their intersection is reduced).
- SimRank is modified (from what is described above) to accommodate differences between the asymmetry of client and server connections.
- SimRank can be modified to use different thresholds for client communications (e.g., an 80% match among nodes cl-c6) and for server communications (e.g., a 60% match among nodes si and s2).
- client communications e.g., an 80% match among nodes cl-c6
- server communications e.g., a 60% match among nodes si and s2
- Such modification can also help achieve convergence in situations such as where a server process dies on one node and restarts on another node.
- nodes pl-p4 are all included in a single cluster.
- Both MNC and SimRank operate agnostically of which application a given process belongs to.
- processes pl-p3 each correspond to a first application (e.g., an update engine), and process p4 corresponds to a second application (e.g., sshd).
- process plO corresponds to contact with AWS. Clustering all four of the processes together (e.g., as a result of SimRank) could be problematic, particularly in a security context (e.g., where granular information useful in detecting threats would be lost).
- data platform 12 may maintain a mapping between processes and the applications to which they belong.
- the output of SimRank e.g., SimRank clusters
- SimRank clusters is split based on the applications to which cluster members belong (such a split is also referred to herein as a “CmdType split”). If all cluster members share a common application, the cluster remains. If different cluster members originate from different applications, the cluster members are split along application-type (CmdType) lines.
- CmdType application-type
- node c4 belongs to “ssh,” and suppose that node c6 belongs to “bash.”
- all six nodes (cl-c6) might be clustered into a single cluster.
- the single cluster will be broken into three clusters (cl, c2, c3, c5; c4; and c6).
- the resulting clusters comprise processes associated with the same type of application, which exhibit similar behaviors (e.g., communication behaviors).
- Each of the three clusters resulting from the CmdType split represents, respectively, a node (also referred to herein as a PType) of a particular CmdType.
- Each PType is given a persistent identifier and stored persistently as a cumulative graph.
- a variety of approaches can be used to determine a CmdType for a given process.
- a one-to-one mapping exists between the CmdType and the application/binary name.
- processes corresponding to the execution of sshd will be classified using a CmdType of sshd.
- a list of common application/binary names e.g., sshd, apache, etc. is maintained by data platform 12 and manually curated as applicable.
- command line / execution path information can be used in determining a CmdType.
- the subapplication can be used as the CmdType of the application, and/or term frequency analysis (e.g., TF/IDF) can be used on command line information to group, for example, any marathon related applications together (e.g., as a python. marathon CmdType) and separately from other Python applications (e.g., as a python. airflow CmdType).
- TF/IDF term frequency analysis
- machine learning techniques are used to determine a CmdType.
- the CmdType model is constrained such that the execution path for each CmdType is unique.
- One example approach to making a CmdType model is a random forest based approach.
- An initial CmdType model is bootstrapped using process parameters (e.g., available within one minute of process startup) obtained using one hour of information for a given customer (e.g., entity A). Examples of such parameters include the command line of the process, the command line of the process’s parent(s) (if applicable), the uptime of the process, UID/EUID and any change information, TTY and any change information, listening ports, and children (if any).
- Another approach is to perform term frequency clustering over command line information to convert command lines into cluster identifiers.
- the random forest model can be used (e.g., in subsequent hours) to predict a CmdType for a process (e.g., based on features of the process). If a match is found, the process can be assigned the matching CmdType. If a match is not found, a comparison between features of the process and its nearest CmdType (e.g., as determined using a Levenstein distance) can be performed.
- the existing CmdType can be expanded to include the process, or, as applicable, a new CmdType can be created (and other actions taken, such as generating an alert).
- Another approach to handling processes which do not match an existing CmdType is to designate such processes as unclassified, and once an hour, create a new random forest seeded with process information from a sampling of classified processes (e.g., 10 or 100 processes per CmdType) and the new processes. If a given new process winds up in an existing set, the process is given the corresponding CmdType. If a new cluster is created, a new CmdType can be created.
- a sampling of classified processes e.g. 10 or 100 processes per CmdType
- a polygraph represents the smallest possible graph of clusters that preserve a set of rules (e.g., in which nodes included in the cluster must share a CmdType and behavior).
- SimRank and cluster splitting (e.g., CmdType splitting) many processes are clustered together based on commonality of behavior (e.g., communication behavior) and commonality of application type.
- Such clustering represents a significant reduction in graph size (e.g., compared to the original raw physical graph). Nonetheless, further clustering can be performed (e.g., by iterating on the graph data using the GBM to achieve such a polygraph). As more information within the graph is correlated, more nodes can be clustered together, reducing the size of the graph, until convergence is reached and no further clustering is possible.
- Fig. 3E depicts two pairs of clusters.
- cluster 320 represents a set of client processes sharing the same CmdType (“al”), communicating (collectively) with a server process having a CmdType (“a2”).
- Cluster 322 also represents a set of client processes having a CmdType al communicating with a server process having a CmdType a2.
- the nodes in clusters 320 and 322 (and similarly nodes in 321 and 323) remain separately clustered (as depicted) after MNC/SimRank/CmdType splitting - isolated islands.
- server process 321 corresponds to processes executing on a first machine (having an IP address of 1.1.1.1).
- the machine fails and a new server process 323 starts, on a second machine (having an IP address of 2.2.2.2) and takes over for process 321.
- Communications between a cluster of nodes (e.g., nodes of cluster 320) and the first IP address can be considered different behavior from communications between the same set of nodes and the second IP address, and thus communications 324 and 325 will not be combined by MNC/SimRank in various embodiments. Nonetheless, it could be desirable for nodes of clusters 320/322 to be combined (into cluster 326), and for nodes of clusters 321/323 to be combined (into cluster 327), as representing (collectively) communications between al and a2.
- One task that can be performed by data platform 12 is to use DNS query information to map IP addresses to logical entities.
- GBM 154 can make use of the DNS query information to determine that graph nodes of cluster 320 and graph nodes of cluster 322 both made DNS queries for “appserverabc.example.com,” which first resolved to 1.1.1.1 and then to 2.2.2.2, and to combine nodes 320/322 and 321/323 together into a single pair of nodes (326 communicating with 327).
- GBM 154 operates in a batch manner in which it receives as input the nodes and edges of a graph for a particular time period along with its previous state, and generates as output clustered nodes, cluster membership edges, cluster-to-cluster edges, events, and its next state.
- GBM 154 may not try to consider all types of entities and their relationships that may be available in a conceptual common graph all at once. Instead, GBM uses a concept of models where a subset of node and edge types and their properties are considered in a given model. Such an approach is helpful for scalability, and also to help preserve detailed information (of particular importance in a security context) - as clustering entities in a more complex and larger graph could result in less useful results. In particular, such an approach allows for different types of relationships between entities to be preserved/more easily analyzed.
- GBM 154 can be used with different models corresponding to different subgraphs, core abstractions remain the same across types of models.
- each node type in a GBM model is considered to belong to a class.
- the class can be thought of as a way for the GBM to split nodes based on the criteria it uses for the model.
- the class for a node is represented as a string whose value is derived from the node’s key and properties depending on the GBM Model.
- different GBM models may create different class values for the same node.
- GBM 154 can generate clusters of nodes for that type.
- a GBM generated cluster for a given member node type cannot span more than one class for that node type.
- GBM 154 generates edges between clusters that have the same types as the edges between source and destination cluster node types.
- the processes described herein as being used for a particular model can be used (can be the same) across models, and different models can also be configured with different settings.
- the node types and the edge types may correspond to existing types in the common graph node and edge tables but this is not necessary.
- the properties provided to GBM 154 are not limited to the properties that are stored in the corresponding graph table entries. They can be enriched with additional information before being passed to GBM 154.
- Edge triplets can be expressed, for example, as an array of source node type, edge type, and destination node type. And, each node type is associated with node properties, and each edge type is associated with edge properties. Other edge triplets can also be used (and/or edge triplets can be extended) in accordance with various embodiments.
- the physical input to the GBM model need not (and does not, in various embodiments) conform to the logical input.
- the edges in the PtypeConn model correspond to edges between Matching Neighbors (MN) clusters, where each process node has an MN cluster identifier property.
- MN Matching Neighbors
- Uid2Uid model edges are not explicitly provided separately from nodes (as the euid array in the node properties serves the same purpose). In both cases, however, the physical information provides the applicable information necessary for the logical input.
- the state input for a particular GBM model can be stored in a file, a database, or other appropriate storage.
- the state file (from a previous run) is provided, along with graph data, except for when the first run for a given model is performed, or the model is reset. In some cases, no data may be available for a particular model in a given time period, and GBM may not be run for that time period. As data becomes available at a future time, GBM can run using the latest state file as input.
- GBM 154 outputs cluster nodes, cluster membership edges, and inter-cluster relationship edges that are stored (in some embodiments) in the graph node tables: node c, node cm, and node icr, respectively.
- the type names of nodes and edges may conform to the following rules: [00256] * A given node type can be used in multiple different GBM models. The type names of the cluster nodes generated by two such models for that node type will be different. For instance, process type nodes will appear in both PtypeConn and Uid2Uid models, but their cluster nodes will have different type names.
- the membership edge type name is “MemberOf.”
- the edge type names for cluster-to-cluster edges will be the same as the edge type names in the underlying node-to-node edges in the input.
- GBM 154 can generate: new class, new cluster, new edge from class to class, split class (the notion that GBM 154 considers all nodes of a given type and class to be in the same cluster initially and if GBM 154 splits them into multiple clusters, it is splitting a class), new edge from cluster and class, new edge between cluster and cluster, and/or new edge from class to cluster.
- split class the notion that GBM 154 considers all nodes of a given type and class to be in the same cluster initially and if GBM 154 splits them into multiple clusters, it is splitting a class
- new edge from cluster and class new edge between cluster and cluster, and/or new edge from class to cluster.
- One underlying node or edge in the logical input can cause multiple types of events to be generated. Conversely, one event can correspond to multiple nodes or edges in the input. Not every model generates every event type.
- a PTypeConn Model clusters nodes of the same class that have similar connectivity relationships. For example, if two processes had similar incoming neighbors of the same class and outgoing neighbors of the same class, they could be clustered.
- the node input to the PTypeConn model for a given time period includes non-interactive (i.e., not associated with tty) process nodes that had connections in the time period and the base graph nodes of other types (IP Service Endpoint (IPSep) comprising an IP address and a port), DNS Service Endpoint (DNSSep) and IP Address) that have been involved in those connections.
- IP Service Endpoint IP Service Endpoint
- DNSSep DNS Service Endpoint
- IP Address IP Address
- the edge inputs to this model are the ConnectedTo edges from the MN cluster, instead of individual node-to-node ConnectedTo edges from the base graph.
- the membership edges created by this model refer to the base graph node type provided in the input.
- nodes are determined as follows depending on the node type (e.g., Process nodes, IPSep nodes, DNSSep nodes, and IP Address nodes).
- node type e.g., Process nodes, IPSep nodes, DNSSep nodes, and IP Address nodes.
- IPSep nodes [00276] if IP intemal then “IntIPS”
- IP intemal 1 then “ ⁇ hostname>”
- IP Address nodes (will appear only on client side):
- IPIntC IPIntC
- a new class event in this model for a process node is equivalent to seeing a new CType being involved in a connection for the first time. Note that this does not mean the CType was not seen before. It is possible that it was previously seen but did not make a connection at that time.
- a new class event in this model for an IPSep node with IP intemal 0 is equivalent to seeing a connection to a new external IP address for the first time.
- a new class event in this model for a DNSSep node is equivalent to seeing a connection to a new domain for the first time.
- a new class to class to edge from a class for a process node to a class for a process node is equivalent to seeing a communication from the source CType making a connection to the destination CType for the first time.
- a new class to class to edge from a class for a process node to a class for a DNSSep node is equivalent to seeing a communication from the source CType making a connection to the destination domain name for the first time.
- An IntPConn Model may be similar to the PtypeConn Model, except that connection edges between parent/child processes and connections between processes where both sides are not interactive are filtered out.
- a Uid2Uid Model may cluster processes with the same username that show similar privilege change behavior. For instance, if two processes with the same username had similar effective user values, launched processes with similar usernames, and were launched by processes with similar usernames, then they could be clustered.
- An edge between a source cluster and destination cluster generated by this model means that all of the processes in the source cluster had a privilege change relationship to at least one process in the destination cluster.
- the node input to this model for a given time period includes process nodes that are running in that period.
- the value of a class of process nodes is “ ⁇ username>”.
- the base relationship that is used for clustering is privilege change, either by the process changing its effective user ID, or by launching a child process which runs with a different user.
- the physical input for this model includes process nodes (only), with the caveat that the complete ancestor hierarchy of process nodes active (i.e., running) for a given time period is provided as input even if an ancestor is not active in that time period.
- effective user IDs of a process are represented as an array in the process node properties, and launch relationships are available from ppid hash fields in the properties as well.
- a new class event in this model is equivalent to seeing a user for the first time.
- a new class to class edge event is equivalent to seeing the source user making a privilege change to the destination user for the first time.
- a Ct2Ct Model may cluster processes with the same CType that show similar launch behavior. For instance, if two processes with the same CType have launched processes with similar CTypes, then they could be clustered.
- the node input to this model for a given time period includes process nodes that are running in that period.
- the value class of process nodes is CType (similar to how it is created for the PtypeConn Model).
- the base relationship that is used for clustering is a parent process with a given CType launching a child process with another given destination CType.
- the physical input for this model includes process nodes (only) with the caveat that the complete ancestor hierarchy active process nodes (i.e., that are running) for a given time period is provided as input even if an ancestor is not active in that time period. Note that launch relationships are available from ppid hash fields in the process node properties.
- An edge between a source cluster and destination cluster generated by this model means that all of the processes in the source cluster launched at least one process in the destination cluster.
- a new class event in this model is equivalent to seeing a CType for the first time. Note that the same type of event will be generated by the PtypeConn Model as well. [00309] A new class to class edge event is equivalent to seeing the source CType launching the destination CType for the first time.
- An MTypeConn Model may cluster nodes of the same class that have similar connectivity relationships. For example, if two machines had similar incoming neighbors of the same class and outgoing neighbors of the same class, they could be clustered.
- a new class event in this model will be generated for external IP addresses or (as applicable) domain names seen for the first time. Note that a new class to class to edge Machine, class to class for an IPSep or DNSName node will also be generated at the same time.
- the membership edges generated by this model will refer to Machine, IP Address, DNSName, and IPSep nodes in the base graph. Though the nodes provided to this model are IP Address nodes instead of IPSep nodes, the membership edges it generates will refer to IPSep type nodes. Alternatively, the base graph can generate edges between Machine and IPSep node types. Note that the Machine to IP Address edges have tcp_dst_ports/udp_dst_ports properties that can be used for this purpose.
- the node input to this model for a given time period includes machine nodes that had connections in the time period and the base graph nodes of other types (IP Address and DNSName) that were involved in those connections.
- the base relationship is the connectivity relationship for the following type triplets:
- the edge inputs to this model are the corresponding ConnectedTo edges in the base graph.
- the machine terms property in the Machine nodes is used, in various embodiments, for labeling machines that are clustered together. If a majority of the machines clustered together share a term in the machine terms, that term can be used for labeling the cluster.
- the class value for IPSep nodes is determined as follows:
- the class value for IpAddress nodes is determined as follows:
- GBM 154 generates multiple events of this type for the same class value.
- the set size indicates the size of the cluster referenced in the keys field. [00359] Conditions:
- NewClass events can be generated if there is more than one cluster in that class. NewNode events will not be generated separately in this case.
- the key field contains source and destination class values and also source and destination cluster identifiers (i.e., the src/dst_node:key.cid represents the src/dst cluster identifier).
- an event of this type could involve multiple edges between different cluster pairs that have the same source and destination class values.
- GBM 154 can generate multiple events in this case with different source and destination cluster identifiers.
- the source and destination sizes represent the sizes of the clusters given in the keys field.
- NewClassToClass events can be generated if there are more than one pair of clusters in that class pair. NewNodeToNode events are not generated separately in this case.
- NewEdgeClassToNode events with the same model and source class can be output if there are multiple new edges from source clusters to the destination clusters in that source class (the first time seeing this class as a source cluster class for the destination cluster).
- These events may be combined at the class level and treated as a single event when it is desirable to view changes at the class level, e.g., when one wants to know when there is a new CType.
- different models may have partial overlap in the types of nodes they use from the base graph. Therefore, they can generate NewClass type events for the same class. NewClass events can also be combined across models when it is desirable to view changes at the class level.
- actions can be associated with processes and (e.g., by associating processes with users) actions can thus also be associated with extended user sessions.
- Extended user session tracking can also be useful in operational use cases without malicious intent, e.g., where users make original logins with distinct usernames (e.g., “charlie” or “dave”) but then perform actions under a common username (e.g., “admin” or “support”).
- users with administrator privileges exist, and they need to gain superuser privilege to perform a particular type of maintenance.
- extended user session tracking It may be desirable to know which operations are performed (as the superuser) by which original user when debugging issues.
- ssh secure shell
- extended user session tracking is not limited to the ssh protocol or a particular limitation and the techniques described herein can be extended to other login mechanisms.
- Fig. 3F is a representation of a user logging into a first machine and then into a second machine from the first machine, as well as information associated with such actions.
- a user Charlie logs into Machine A (331) from a first IP address (332).
- a username 333
- an openssh privileged process (334) is created to handle the connection for the user, and a terminal session is created and a bash process (335) is created as a child.
- Charlie launches an ssh client (336) from the shell, and uses it to connect (337) to Machine B (338).
- Fig. 3G is an alternate representation of actions occurring in Fig. 3F, where events occurring on Machine A are indicated along line 350, and events occurring on Machine B are indicated along line 351.
- an incoming ssh connection is received at Machine A (352).
- Charlie logs in (as user “x”) and an ssh privileged process is created to handle Charlie’s connection (353).
- a terminal session is created and a bash process is created (354) as a child of process 353.
- the external domain could be a malicious domain, or it could be benign.
- the external domain is malicious (and, e.g., Charlie has malicious intent). It would be advantageous (e.g., for security reasons) to be able to trace the contact with the external domain back to Machine A, and then back to Charlie’s IP address.
- Using techniques described herein e.g., by correlating process information collected by various agents), such tracking of Charlie’s activities back to his original login (330) can be accomplished.
- an extended user session can be tracked that associates Charlie’s ssh processes together with a single original login and thus original user.
- software agents may run on machines (such as a machine that implements one of nodes 116) and detect new connections, processes, and/or logins.
- agents send associated records to data platform 12 which includes one or more datastores (e.g., data store 30) for persistently storing such data.
- datastores e.g., data store 30
- Such data can be modeled using logical tables, also persisted in datastores (e.g., in a relational database that provides an SQL interface), allowing for querying of the data.
- Other datastores such as graph oriented databases and/or hybrid schemes can also be used.
- An ssh login session can be identified uniquely by an (MID, PID hash) tuple.
- the MID is a machine identifier that is unique to each machine, whether physical or virtual, across time and space.
- Operating systems use numbers called process identifiers (PIDs) to identify processes running at a given time. Over time processes may die and new processes may be started on a machine or the machine itself may restart.
- PID process identifiers
- Over time processes may die and new processes may be started on a machine or the machine itself may restart.
- the PID is not necessarily unique across time in that the same PID value can be reused for different processes at different times. In order to track process descendants across time, one should therefore account for time as well.
- another number called a PID hash is generated for the process.
- the PID hash is generated using a collision-resistant hash function that takes the PID, start time, and (in various embodiments, as applicable) other properties of a process.
- Input data collected by agents comprises the input data model and is represented by the following logical tables:
- a connections table may maintain records of TCP/IP connections observed on each machine.
- Example columns included in a connections table are as follows:
- the source fields correspond to the side from which the connection was initiated.
- the agent associates an ssh connection with the privileged ssh process that is created for that connection.
- a processes table maintains records of processes observed on each machine. It may have the following columns:
- a logins table may maintain records of logins to machines. It may have the following columns:
- a login-local-descendant table maintains the local (i.e., on the same machine) descendant processes of each ssh login session. It may have the following columns:
- a login-connections table may maintain the connections associated with ssh logins. It may have the following columns:
- a login-lineage table may maintain the lineage of ssh login sessions. It may have the following columns:
- the parent MID and parent sshd PID hash columns can be null if there is no parent ssh login. In that case, the (MID, sshd PID hash) tuple will be the same as the (origin MID, origin sshd PID hash) tuple.
- Fig. 3H illustrates an example of a process for performing extended user tracking.
- process 361 is performed by data platform 12.
- the process begins at 362 when data associated with activities occurring in a network environment (such as entity A’s datacenter) is received.
- a network environment such as entity A’s datacenter
- agent-collected data described above (e.g., in conjunction with process 200).
- the received network activity is used to identify user login activity.
- a logical graph that links the user login activity to at least one user and at least one process is generated (or updated, as applicable). Additional detail regarding process 361, and in particular, portions 363 and 364 of process 361 are described in more detail below (e.g., in conjunction with discussion of Fig. 3 J).
- Fig. 31 depicts a representation of a user logging into a first machine, then into a second machine from the first machine, and then making an external connection.
- the scenario depicted in Fig. 31 is used to describe an example of processing that can be performed on data collected by agents to generate extended user session tracking information.
- Fig. 31 is an alternate depiction of the information shown in Figs. 3F and 3G.
- a first ssh connection is made to Machine A (366) from an external source (367) by a user having a username of “X.”
- an external source has an IP address of 1.1.1.10 and uses source port 10000 to connect to Machine A (which has an IP address of 2.2.2.20 and a destination port 22).
- External source 367 is considered an external source because its IP address is outside of the environment being monitored (e.g., is a node outside of entity A’s datacenter, connecting to a node inside of entity A’s datacenter).
- a first ssh login session LSI is created on machine A for user X.
- the privileged openssh process for this login is Al (368).
- the user creates a bash shell process with PID hash A2 (369).
- a second ssh login session LS2 is created on machine B for user Y.
- the privileged openssh process for this login is Bl (373).
- the user creates a bash shell process with PID hash B2 (374).
- B3 is a descendant of Bl and is thus associated with LS2.
- connection to the external domain is thus associated with LS2.
- An association between A3 and LS2 can be established based on the fact that LS2 was created based on an ssh connection initiated from A3. Accordingly, it can be determined that LS2 is a child of LSI.
- Fig. 3 J illustrates an example of a process for performing extended user tracking.
- process 380 is performed periodically (e.g., once an hour in a batch fashion) by ssh tracker 148 to generate new output data.
- batch processing allows for efficient analysis of large volumes of data.
- the approach can be adapted, as applicable, to process input data on a record-by-record fashion while maintaining the same logical data processing flow.
- the results of a given portion of process 380 are stored for use in a subsequent portion.
- the process begins at 381 when new ssh connection records are identified.
- new ssh connections started during the current time period are identified by querying the connections table.
- the query uses filters on the start time and dst_port columns.
- the values of the range filter on the start time column are based on the current time period.
- the dst_port column is checked against ssh listening port(s).
- the ssh listening port number is 22.
- the port(s) that openssh servers are listening to in the environment can be determined by data collection agents dynamically and used as the filter value for the dst_port as applicable.
- the query result will generate the records shown in Fig.
- connection records reported from source and destination sides of the same connection are matched.
- the ssh connection records (e.g., returned from the query at 381) are matched based on the following criteria: [00456] * The five tuples (src IP, dst IP, IP _prot, src _port, dst _port) of the connection records must match.
- Example output of portion 382 of process 380 is shown in Fig. 3L.
- the values in the dst PID hash column (391) are that of the sshd privileged process associated with ssh logins.
- new logins during the current time period are identified by querying the logins table. The query uses a range filter on the login time column with values based on the current time period. In the example depicted in Fig. 31, the query result will generate the records depicted in Fig. 3M.
- matched ssh connection records created at 382 and new login records created at 383 are joined to create new records that will eventually be stored in the login-connection table.
- the join condition is that dst MID of the matched connection record is equal to the MID of the login record and the dst PID hash of the matched connection record is equal to the sshd PID hash of the login record.
- the processing performed at 384 will generate the records depicted in Fig. 3N.
- login-local-descendant records in the lookback time period are identified. It is possible that a process that is created in a previous time period makes an ssh connection in the current analysis batch period. Although not depicted in the example illustrated in Fig. 31, consider a case where bash process A2 does not create ssh process A3 right away but instead that the ssh connection A3 later makes to machine B is processed in a subsequent time period than the one where A2 was processed.
- the time period for which look back is performed can be limited to reduce the amount of historical data that is considered. However, this is not a requirement (and the amount of look back can be determined, e.g., based on available processing resources).
- the login local descendants in the lookback time period can be identified by querying the login-local-descendant table.
- the query uses a range filter on the login time column where the range is from start time of current _period - lookback time to start time of current _period. (No records as a result of performing 385 on the scenario depicted in Fig. 31 are obtained, as only a single time period is applicable in the example scenario.)
- new login-local-descendant records are identified. The purpose is to determine whether any of the new processes in the current time period are descendants of an ssh login process and if so to create records that will be stored in the login-local-descendant table for them. In order to do so, the parent-child relationships between the processes are recursively followed. Either a top down or bottom up approach can be used. In a top down approach, the ssh local descendants in the lookback period identified at 385, along with new ssh login processes in the current period identified at 384 are considered as possible ancestors for the new processes in the current period identified at 386.
- the recursive approach can be considered to include multiple sub-steps where new processes that are identified to be ssh local descendants in the current sub-step are considered as ancestors for the next step.
- new processes that are identified to be ssh local descendants in the current sub-step are considered as ancestors for the next step.
- descendancy relationships will be established in two sub-steps:
- Sub-step 1 Sub-step 1 :
- Process A3 is a local descendant of LSI because it is a child of process A2 which is associated to LSI in sub-step 1.
- Process B3 is a local descendant of LS2 because it is a child of process Bl which is associated to LS2 in sub-step 1.
- Implementation portion 387 can use a datastore that supports recursive query capabilities, or, queries can be constructed to process multiple conceptual sub-steps at once. In the example depicted in Fig. 31, the processing performed at 387 will generate the records depicted in Fig. 3P. Note that the ssh privileged processes associated with the logins are also included as they are part of the login session. [00473] At 388, the lineage of new ssh logins created in the current time period is determined by associating their ssh connections to source processes that may be descendants of other ssh logins (which may have been created in the current period or previous time periods). In order to do so, first an attempt is made to join the new ssh login connections in the current period (identified at
- output data is generated.
- the new login-connection, login- local-descendant, and login-lineage records generated at 384, 387, and 388 are inserted into their respective output tables (e.g., in a transaction manner).
- connection GUID e.g., the SYN packet
- server e.g., the server
- GUID the GUID
- the agent can configure the network stack (e.g. using IP tables functionality on Linux) to intercept an outgoing TCP SYN packet and modify it to add the generated GUID as a TCP option.
- the agent already extracts TCP SYN packets and thus can look for this option and extract the GUID if it exists.
- Example graph-based user tracking and threat detection embodiments associated with data platform 12 will now be described.
- Administrators and other users of network environments e.g., entity A’s datacenter 1014
- entities A e.g., entity A’s datacenter 1014
- Joe performs various tasks as himself (e.g., answering emails, generating status reports, writing code, etc.).
- Joe may require different/additional permission than his individual account has (e.g., root privileges).
- Joe can gain access to such permissions is by using sudo, which will allow Joe to run a single command with root privileges.
- Another way Joe can gain access to such permissions is by su or otherwise logging into a shell as root.
- Another thing that Joe can do is switch identities.
- Joe may use “su help” or “su database-admin” to become (respectively) the help user or the database-admin user on a system. He may also connect from one machine to another, potentially changing identities along the way (e.g., logging in as joe. smith at a first console, and connecting to a database server as database-admin).
- Joe can relinquish his root privileges by closing out of any additional shells created, reverting back to a shell created for user joe. smith.
- Joe While there are many legitimate reasons for Joe to change his identity throughout the day, such changes may also correspond to nefarious activity. Joe himself may be nefarious, or Joe’s account (joe. smith) may have been compromised by a third party (whether an “outsider” outside of entity A’s network, or an “insider”).
- Joe Using techniques described herein, the behavior of users of the environment can be tracked (including across multiple accounts and/or multiple machines) and modeled (e.g., using various graphs described herein). Such models can be used to generate alerts (e.g., to anomalous user behavior).
- Such models can also be used forensically, e.g., helping an investigator visualize various aspects of a network and activities that have occurred, and to attribute particular types of actions (e.g., network connections or file accesses) to specific users.
- a user In a typical day in a datacenter, a user (e.g., Joe Smith) will log in, run various processes, and (optionally) log out. The user will typically log in from the same set of IP addresses, from IP addresses within the same geographical area (e.g., city or country), or from historically known IP addresses/geographical areas (i.e., ones the user has previously/occasionally used). A deviation from the user’s typical (or historical) behavior indicates a change in login behavior. However, it does not necessarily mean that a breach has occurred.
- a user may take a variety of actions. As a first example, a user might execute a binary/script.
- Such binary/script might communicate with other nodes in the datacenter, or outside of the datacenter, and transfer data to the user (e.g., executing “curl” to obtain data from a service external to the datacenter).
- the user can similarly transfer data (e.g., out of the datacenter), such as by using POST.
- a user might change privilege (one or more times), at which point the user can send/receive data as per above.
- a user might connect to a different machine within the datacenter (one or more times), at which point the user can send/receive data as per the above.
- the above information associated with user behavior is broken into four tiers.
- the tiers represent example types of information that data platform 12 can use in modeling user behavior: [00481] 1.
- the user s entry point (e.g., domains, IP addresses, and/or geolocation information such as country/city) from which a user logs in.
- the user executes a script (“collect data.sh”) on Machine03.
- the script internally communicates (as root) to a MySQL-based service internal to the datacenter, and downloads data from the MySQL-based service.
- the user externally communicates with a server outside the datacenter (“ExternalOl”), using a POST command.
- the source/entry point is IP01.
- Data is transferred to an external server ExternalOl.
- the machine performing the transfer to ExternalOl is Machine03.
- the user transferring the data is “root” (on Machine03), while the actual user (hiding behind root) is UserA.
- the “original user” (ultimately responsible for transmitting data to ExternalOl) is UserA, who logged in from IP01.
- UserA the “original user” (ultimately responsible for transmitting data to ExternalOl) is UserA, who logged in from IP01.
- Each of the processes ultimately started by UserA whether started at the command line (tty) such as “runnable, sh” or started after an ssh connection such as “new runnable.sh,” and whether as UserA, or as a subsequent identity, are all examples of child processes which can be arranged into a process hierarchy.
- machines can be clustered together logically into machine clusters.
- One approach to clustering is to classify machines based on information such as the types of services they provide/binaries they have installed upon them/processes they execute. Machines sharing a given machine class (as they share common binaries/services/etc.) will behave similarly to one another.
- Each machine in a datacenter can be assigned to a machine cluster, and each machine cluster can be assigned an identifier (also referred to herein as a machine class).
- One or more tags can also be assigned to a given machine class (e.g., database servers west or prod web frontend).
- One approach to assigning a tag to a machine class is to apply term frequency analysis (e.g., TF/IDF) to the applications run by a given machine class, selecting as tags those most unique to the class.
- Data platform 12 can use behavioral baselines taken for a class of machines to identify deviations from the baseline (e.g., by a particular machine in the class).
- Fig. 3S illustrates an example of a process for detecting anomalies.
- process 392 is performed by data platform 12. As explained above, a given session will have an original user. And, each action taken by the original user can be tied back to the original user, despite privilege changes and/or lateral movement throughout a datacenter.
- Process 392 begins at 393 when log data associated with a user session (and thus an original user) is received.
- a logical graph is generated, using at least a portion of the collected data.
- an anomaly is detected (395), it can be recorded, and as applicable, an alert is generated (396).
- the following are examples of graphs that can be generated (e.g., at 394), with corresponding examples of anomalies that can be detected (e.g., at 395) and alerted upon (e.g., at 396).
- Fig. 4A illustrates a representation of an embodiment of an insider behavior graph.
- each node in the graph can be: (1) a cluster of users; (2) a cluster of launched processes; (3) a cluster of processes/ servers running on a machine class; (4) a cluster of external IP addresses (of incoming clients); or (5) a cluster of external servers based on DNS/IP/etc.
- graph data is vertically tiered into four tiers.
- Tier 0 (400) corresponds to entry point information (e.g., domains, IP addresses, and/or geolocation information) associated with a client entering the datacenter from an external entry point. Entry points are clustered together based on such information.
- Tier 1 corresponds to a user on a machine class, with a given user on a given machine class represented as a node.
- Tier 2 corresponds to launched processes, child processes, and/or interactive processes. Processes for a given user and having similar connectivity (e.g., sharing the processes they launch and the machines with which they communicate) are grouped into nodes.
- Tier 3 corresponds to the services/servers/domains/IP addresses with which processes communicate.
- Tier 0 nodes log in to tier 1 nodes.
- Tier 1 nodes launch tier 2 nodes.
- Tier 2 nodes connect to tier 3 nodes.
- Tier 1 corresponds to a user (e.g., user “U”) logging into a machine having a particular machine class (e.g., machine class “M”).
- Tier 2 is a cluster of processes having command line similarity (e.g., CType “C”), having an original user “U,” and running as a particular effective user (e.g., user “Ul”).
- the effective user in the Tier 2 node may or may not match the original user (while the original user in the Tier 2 node will match the original user in the Tier 1 node).
- a change from a user U into a user Ul can take place in a variety of ways. Examples include where U becomes Ul on the same machine (e.g., via su), and also where U sshes to other machine(s). In both situations, U can perform multiple changes, and can combine approaches. For example, U can become Ul on a first machine, ssh to a second machine (as Ul), become U2 on the second machine, and ssh to a third machine (whether as user U2 or user U3).
- the complexity of how user U ultimately becomes U3 (or U5, etc.) is hidden from a viewer of an insider behavior graph, and only an original user (e.g., U) and the effective user of a given node (e.g., U5) are depicted.
- additional detail about the path e.g., an end-to-end path of edges from user U to user U5 can be surfaced (e.g., via user interactions with nodes).
- Fig. 4B illustrates an example of a portion of an insider behavior graph (e.g., as rendered in a web browser).
- node 405 (the external IP address, 52.32.40.231) is an example of a Tier 0 node, and represents an entry point into a datacenter.
- two users “aruneli _prod” and “harish_prod,” both made use of the source IP 52.32.40.231 when logging in between 5pm and 6pm on Sunday July 30 (408).
- Nodes 409 and 410 are examples of Tier 1 nodes, having aruneli _prod and harish_prod as associated respective original users.
- Tier 1 nodes correspond to a combination of a user and a machine class.
- the machine class associated with nodes 409 and 410 is hidden from view to simplify visualization, but can be surfaced to a viewer of interface 404 (e.g., when the user clicks on node 409 or 410).
- Nodes 414-423 are examples of Tier 2 nodes - processes that are launched by users in Tier 1 and their child, grandchild, etc. processes. Note that also depicted in Fig. 4B is a Tier 1 node 411 that corresponds to a user, “root,” that logged in to a machine cluster from within the datacenter (i.e., has an entry point within the datacenter). Nodes 425-1 and 425-2 are examples of Tier 3 nodes - internal/external IP addresses, servers, etc., with which Tier 2 nodes communicate.
- a viewer of interface 404 has clicked on node 423.
- the user running the marathon container is “root.”
- the viewer can determine that the original user, responsible for node 423, is “aruneli _prod,” who logged into the datacenter from IP 52.32.40.231.
- a user communicates with an external server which has a geolocation not previously used by that user.
- Such changes can be surfaced as alerts, e.g., to help an administrator determine when/what anomalous behavior occurs within a datacenter.
- the behavior graph model can be used (e.g., during forensic analysis) to answer questions helpful during an investigation. Examples of such questions include:
- FIG. 4C depicts a baseline of behavior for a user, “Bill.”
- Bill typically logs into a datacenter from the IP address, 71.198.44.40 (427). He typically makes use of ssh (428), and sudo (429), makes use of a set of typical applications (430) and connects (as root) with the external service, api.lacework.net (431).
- FIG. 4D depicts an embodiment of how the graph depicted in Fig. 4C would appear once Eve begins exfiltrating data from the datacenter.
- Eve logs into the datacenter (using Bill’s credentials) from 52.5.66.8 (432).
- Alex e.g., via su alex
- Alex e.g., via su alex
- Alex Eve executes a script, “sneak.sh” (433), which launches another script, “post.sh” (434), which contacts external server 435 which has an IP address of 52.5.66.7, and transmits data to it.
- Edges 436-439 each represent changes in Bill’s behavior. As previously mentioned, such changes can be detected as anomalies and associated alerts can be generated. As a first example, Bill logging in from an IP address he has not previously logged in from (436) can generate an alert. As a second example, while Bill does typically make use of sudo (429), he has not previously executed sneak.sh (433) or post.sh (434) and the execution of those scripts can generate alerts as well. As a third example, Bill has not previously communicated with server 435, and an alert can be generated when he does so (439). Considered individually, each of edges 436-439 may indicate nefarious behavior, or may be benign.
- Bill begins working from a home office two days a week. The first time he logs in from his home office (i.e., from an IP address that is not 71.198.44.40), an alert can be generated that he has logged in from a new location. Over time, however, as Bill continues to log in from his home office but otherwise engages in typical activities, Bill’s graph will evolve to include logins from both 71.198.44.40 and his home office as baseline behavior. Similarly, if Bill begins using a new tool in his job, an alert can be generated the first time he executes the tool, but over time will become part of his baseline.
- a single edge can indicate a serious threat.
- edge 436 indicates compromise.
- An alert that includes an appropriate severity level e.g., “threat level high”
- a combination of edges could indicate a threat (where a single edge might otherwise result in a lesser warning).
- the presence of multiple new edges is indicative of a serious threat.
- Fig. 4E illustrates a representation of an embodiment of a user login graph.
- tier 0 clusters source IP addresses as belonging to a particular country (including an “unknown” country) or as a known bad IP.
- Tier 1 clusters user logins
- tier 2 clusters type of machine class into which a user is logging in.
- the user login graph tracks the typical login behavior of users. By interacting with a representation of the graph, answers to questions such as the following can be obtained:
- a user accesses a machine class that the user has not previously accessed.
- One way to track privilege changes in a datacenter is by monitoring a process hierarchy of processes.
- the hierarchy of processes can be constrained to those associated with network activity.
- each process has two identifiers assigned to it, a process identifier (PID) and a parent process identifier (PPID).
- PID process identifier
- PPID parent process identifier
- a graph can be constructed (also referred to herein as a privilege change graph) which models privilege changes.
- a graph can be constructed which identifies where a process P1 launches a process P2, where Pl and P2 each have an associated user U1 and U2, with U1 being an original user, and U2 being an effective user.
- each node is a cluster of processes (sharing a CType) executed by a particular (original) user. As all the processes in the cluster belong to the same user, a label that can be used for the cluster is the user’s username.
- An edge in the graph from a first node to a second node, indicates that a user of the first node changed its privilege to the user of the second node.
- Fig. 4F illustrates an example of a privilege change graph.
- each node e.g., nodes 444 and 445
- Privilege changes are indicated by edges, such as edge 446.
- edges such as edge 446.
- anomalies in graph 443 can be used to generate alerts. Three examples of such alerts are as follows:
- FIG. 4F is a representation of an example of an interface that depicts such an alert. Specifically, as indicated in region 447, an alert for the time period 1pm- 2pm on June 8 was generated. The alert identifies that a new user, Bill (448) executed a process.
- Bill 448
- *Privilege change As explained above, a new edge, from a first node (user A) to a second node (user B) indicates that user A has changed privilege to user B.
- Privilege escalation is a particular case of privilege change, in which the first user becomes root.
- An example of an anomalous privilege change and an example of an anomalous privilege escalation are each depicted in graph 450 of Fig. 4G.
- region 451 two alerts for the time period 2pm-3pm on June 8 were generated (corresponding to the detection of the two anomalous events).
- region 452 root has changed privilege to the user “daemon,” which root has not previously done.
- This anomaly is indicated to the user by highlighting the daemon node (e.g., outlining it in a particular color, e.g., red).
- Bill has escalated his privilege to the user root (which can similarly be highlighted in region 454). This action by Bill represents a privilege escalation.
- datacenters are highly dynamic environments.
- different customers of data platform 12 e.g., entity A vs. entity B
- entity A may have different/di sparate needs/requirements of data platform 12, e.g., due to having different types of assets, different applications, etc.
- data platform 12 makes use of predefined relational schema (including by having different predefined relational schema for different customers).
- predefined relational schema including by having different predefined relational schema for different customers.
- the complexity and cost of maintaining/updating such predefined relational schema can rapidly become problematic - particularly where the schema includes a mix of relational, nested, and hierarchical (graph) datasets.
- data platform 12 supports dynamic query generation by automatic discovery of join relations via static or dynamic filtering key specifications among composable data sets. This allows a user of data platform 12 to be agnostic to modifications made to existing data sets as well as creation of new data sets.
- the extensible query interface also provides a declarative and configurable specification for optimizing internal data generation and derivations.
- data platform 12 is configured to dynamically translate user interactions (e.g., received via web app 120) into SQL queries (and without the user needing to know how to write queries). Such queries can then be performed (e.g., by query service 166) against any compatible backend (e.g., data store 30).
- query service 166 any compatible backend (e.g., data store 30).
- FIG. 4H illustrates an example of a user interacting with a portion of an interface.
- data platform 12 e.g., via web app 120 using a browser
- data is extracted from data store 30 as needed (e.g., by query service 166), to provide the user with information, such as the visualizations depicted variously herein).
- query service 166 e.g., clicking on graph nodes, entering text into search boxes, navigating between tabs (e.g., tab 455 vs. 465)
- such interactions act as triggers that cause query service 166 to continue to obtain information from data store 30 as needed (and as described in more detail below).
- FIG. 4H user A is viewing a dashboard that provides various information about entity A users (455), during the time period March 2 at midnight - March 25 at 7pm (which she selected by interacting with region 456).
- Various statistical information is presented to user A in region 457.
- Region 458 presents a timeline of events that occurred during the selected time period.
- User A has opted to list only the critical, high, and medium events during the time period by clicking on the associated boxes (459-461). A total of 55 low severity, and 155 info-only events also occurred during the time period.
- Each time user A interacts with an element in Fig.
- her actions are translated/formalized into filters on the data set and used to dynamically generate SQL queries.
- the SQL queries are generated transparently to user A (and also to a designer of the user interface shown in Fig. 4H).
- User A notes in the timeline (462) that a user, Harish, connected to a known bad server (examplebad.com) using wget, an event that has a critical severity level.
- User A can click on region 463 to expand details about the event inline (which will display, for example, the text “External connection made to known bad host examplebad.com at port 80 from application ‘wget’ running on host devl.lacework.intemal as user harish”) directly below line 462.
- User A can also click on link 464-1, which will take her to a dossier for the event (depicted in Fig. 41).
- a dossier is a template for a collection of visualizations.
- the event of Harish using wget to contact examplebad.com on March 16 was assigned an event ID of 9291 by data platform 12 (467).
- the event is also added to her dashboard in region 476 as a bookmark (468).
- a summary of the event is depicted in region 469.
- user A can see a timeline of related events. In this case, user A has indicated that she would like to see other events involving the wget application (by clicking box 471). Events of critical and medium security involving wget occurred during the one hour window selected in region 472.
- Region 473 automatically provides user A with answers to questions that may be helpful to have answers to while investigating event 9291. If user A clicks on any of the links in the event description (474), she will be taken to a corresponding dossier for the link. As one example, suppose user A clicks on link 475. She will then be presented with interface 477 shown in Fig. 4J.
- Interface 477 is an embodiment of a dossier for a domain.
- the domain is “examplebad.com,” as shown in region 478.
- user A would like to track down more information about interactions entity A resources have made with examplebad.com between January 1 and March 20. She selects the appropriate time period in region 479 and information in the other portions of interface 477 automatically update to provide various information corresponding to the selected time frame.
- user A can see that contact was made with examplebad.com a total of 17 times during the time period (480), as well as a list of each contact (481).
- Various statistical information is also included in the dossier for the time period (482). If she scrolls down in interface 477, user A will be able to view various polygraphs associated with examplebad.com, such as an application-communication polygraph (483).
- Data stored in data store 30 can be internally organized as an activity graph.
- nodes are also referred to as Entities. Activities generated by Entities are modeled as directional edges between nodes. Thus, each edge is an activity between two Entities.
- An Activity is a “login” Activity, in which a user Entity logs into a machine Entity (with a directed edge from the user to the machine).
- a second example of an Activity is a “launch” Activity, in which a parent process launches a child process (with a directed edge from the parent to the child).
- a third example of an Activity is a “DNS query” Activity, in which either a process or a machine performs a query (with a directed edge from the requestor to the answer, e.g., an edge from a process to www.example.com).
- a fourth example of an Activity is a network “connected to” Activity, in which processes, IP addresses, and listen ports can connect to each other (with a directed edge from the initiator to the server).
- query service 166 provides either relational views or graph views on top of data stored in data store 30.
- a user will want to see data filtered using the activity graph. For example, if an entity was not involved in an activity in a given time period, that entity should be filtered out of query results. Thus, a request to show “all machines” in a given time frame will be interpreted as “show distinct machines that were active” during the time frame.
- Query service 166 relies on three main data model elements: fields, entities, and filters.
- a field is a collection of values with the same type (logical and physical).
- a field can be represented in a variety of ways, including: 1. a column of relations (table/view), 2. a return field from another entity, 3. an SQL aggregation (e.g., COUNT, SUM, etc.), 4. an SQL expression with the references of other fields specified, and 5. a nested field of a JSON object.
- an entity is a collection of fields that describe a data set.
- the data set can be composed in a variety of ways, including: 1. a relational table, 2. a parameterized SQL statement, 3.
- DynamicSQL created by a Java function and 4. join/project/aggregate/subclass of other entities.
- Some fields are common for all entities.
- One example of such a field is a “first observed” timestamp (when first use of the entity was detected).
- a second example of such a field is the entity classification type (e.g., one of: 1. Machine (on which an agent is installed), 2. Process, 3. Binary, 4. UID, 5. IP, 6. DNS Information, 7. ListenPort, and 8. PType).
- a third example of such a field is a “last observed” timestamp.
- a filter is an operator that: 1. takes an entity and field values as inputs, 2. a valid SQL expression with specific reference(s) of entity fields, or 3. is a conjunct/disjunct of filters.
- filters can be used to filter data in various ways, and limit data returned by query service 166 without changing the associated data set.
- a dossier is a template for a collection of visualizations.
- Each visualization e.g., the box including chart 484
- has a corresponding card which identifies particular target information needed (e.g., from data store 30) to generate the visualization.
- data platform 12 maintains a global set of dossiers/cards. Users of data platform 12 such as user A can build their own dashboard interfaces using preexisting dossiers/cards as components, and/or they can make use of a default dashboard (which incorporates various of such dossiers/cards).
- a JSON file can be used to store multiple cards (e.g., as part of a query service catalog).
- a particular card is represented by a single JSON object with a unique name as a field name.
- Each card may be described by the following named fields:
- TYPE the type of the card.
- Example values include:
- PARAMETERS a JSON array object that contains an array of parameter objects with the following fields:
- props (a generic JSON object for properties of the parameter. Possible values are: “utype” (a user defined type), and “scope” (an optional property to configure a namespace of the parameter))
- SOURCES a JSON array object explicitly specifying references of input entities. Each source reference has the following attributes:
- alias an alias to access this source entity in other fields (e.g., returns, filters, groups, etc))
- RETURNS a required JSON array object of a return field object.
- a return field object can be described by the following attributes:
- aggr possibly aggregations are: COUNT, COUNT DISTINCT, DISTINCT, MAX, MIN, AVG, SUM, FIRST VALUE, LAST VALUE
- SQL a JSON array of string literals for SQL statements. Each string literal can contain parameterized expressions $ ⁇ ParameterName ⁇ and/or composable entity by # ⁇ EntityName ⁇ [00577] GRAPH: required for graph entity. Has the following required fields:
- JOINS a JSON array of join operators. Possible fields for a join operator include: [00582] * type (possible join types include: “loj” - Left Outer Join, “join” - Inner Join, “in” - Semi Join, “implicit” - Implicit Join)
- FKEYS a JSON array of FilterKey(s).
- the fields for a FilterKey are:
- FILTERS a JSON array of filters (conjunct). Possible fields for a filter include:
- ORDERS a JSON array of ORDER BY for returning fields. Possible attributes for the ORDER BY clause include:
- GROUPS a JSON array of GROUP BY for returning fields.
- Field attributes are: [00601] * field (ordinal index (1 based) or alias from the return fields) [00602]
- LIMIT a limit for the number of records to be returned
- OFFSET an offset of starting position of returned data. Used in combination with limit for pagination.
- customers of data platform 12 e.g., entity A and entity B
- new data transformations or a new aggregation of data from an existing data set (as well as a corresponding visualization for the newly defined data set).
- the data models and filtering applications used by data platform 12 are extensible.
- two example scenarios of extensibility are (1) extending the filter data set, and (2) extending a FilterKey in the filter data set.
- Data platform 12 includes a query service catalog that enumerates cards available to users of data platform 12. New cards can be included for use in data platform 12 by being added to the query service catalog (e.g., by an operator of data platform 12).
- a single external -facing card e.g., available for use in a dossier
- Each newly added card (whether external or internal) will also have associated FilterKey(s) defined.
- a user interface (UI) developer can then develop a visualization for the new data set in one or more dossier templates.
- the same external card can be used in multiple dossier templates, and a given external card can be used multiple times in the same dossier (e.g., after customization). Examples of external card customization include customization via parameters, ordering, and/or various mappings of external data fields (columns).
- a second extensibility scenario is one in which a FilterKey in the filter data set is extended (i.e., existing template functions are used to define a new data set).
- data sets used by data platform 12 are composable/reusable/extensible, irrespective of whether the data sets are relational or graph data sets.
- One example data set is the User Tracking polygraph, which is generated as a graph data set (comprising nodes and edges). Like other polygraphs, User Tracking is an external data set that can be visualized both as a graph (via the nodes and edges) and can also be used as a filter data set for other cards, via the cluster identifier (CID) field.
- CID cluster identifier
- Dynamic composition of filter datasets can be implemented using FilterKeys and FilterKey Types.
- a FilterKey can be defined as a list of columns and/or fields in a nested structure (e.g., JSON). Instances of the same FilterKey Type can be formed as an Implicit Join Group. The same instance of a FilterKey can participate in different Implicit Join Groups.
- a list of relationships among all possible Implicit Join Groups is represented as a Join graph for the entire search space to create a final data filter set by traversing edges and producing Join Path(s).
- Each card (e.g., as stored in the query service catalog and used in a dossier) can be introspected by a /card/describe/CardID REST request.
- query service 166 parses the list of implicit joins and creates a Join graph to manifest relationships of FilterKeys among Entities.
- a Join graph (an example of which is depicted in Fig. 4K) comprises a list of Join Link(s).
- a Join Link represents each implicit join group by the same FilterKey type.
- a Join Link maintains a reverse map (Entity -to-FilterKey) of FilterKeys and their Entities. As previously mentioned, Entities can have more than one FilterKey defined. The reverse map guarantees one FilterKey per Entity can be used for each JoinLink.
- Each JoinLink also maintains a list of entities for the priority order of joins.
- Each JoinLink is also responsible for creating and adding directional edge(s) to graphs.
- An edge represents a possible join between two Entities.
- each Implicit Join uses the Join graph to find all possible join paths. The search of possible join paths starts with the outer FilterKey of an implicit join.
- One approach is to use a shortest path approach, with breadth first traversal and subject to the following criteria: [00611] * Use the priority order list of Join Links for all entities in the same implicit join group. [00612] * Stop when a node (Entity) is reached which has local filter(s).
- Fig. 4L illustrates an example of a process for dynamically generating and executing a query.
- process 485 is performed by data platform 12.
- the process begins at 486 when a request is received to filter information associated with activities within a network environment.
- a request is received to filter information associated with activities within a network environment.
- One example of such a request occurs in response to user A clicking on tab 465.
- Another example of such a request occurs in response to user A clicking on link 464-1.
- Yet another example of such a request occurs in response to user A clicking on link 464-2 and selecting (e.g., from a dropdown) an option to filter (e.g., include, exclude) based on specific criteria that she provides (e.g., an IP address, a username, a range of criteria, etc.).
- a query is generated based on an implicit join.
- processing that can be performed at 487 is as follows. As explained above, one way dynamic composition of filter datasets can be implemented is by using FilterKeys and FilterKey Types. And, instances of the same FilterKey Type can be formed as an Implicit Join Group. A Join graph for the entire search space can be constructed from a list of all relationships among all possible Join Groups. And, a final data filter set can be created by traversing edges and producing one or more Join Paths. Finally, the shortest path in the join paths is used to generate an SQL query string. [00617] One approach to generating an SQL query string is to use a query building library (authored in an appropriate language such as Java).
- a common interface “sqlGen” may be used in conjunction with process 485 is as follows.
- a card/entity is composed by a list of input cards/entities, where each input card recursively is composed by its own list of input cards.
- This nested structure can be visualized as a tree of query blocks(SELECT) in standard SQL constructs. SQL generation can be performed as the traversal of the tree from root to leaf entities (top-down), calling the sqlGen of each entity.
- Each entity can be treated as a subclass of the Java class(Entity).
- An implicit join filter (EntityFilter) is implemented as a subclass of Entity, similar to the right hand side of a SQL semi-join operator.
- preSQLGen which is primarily the entry point for EntityFilter to run a search and generate nested implicit join filters.
- preSQLGen is primarily the entry point for EntityFilter to run a search and generate nested implicit join filters.
- pullUpCachable can be used to pull up common sub-query blocks, including those dynamically generated by preSQLGen, such that SELECT statements of those cacheable blocks are generated only once at top-level WITH clauses.
- a recursive interface, sqlWith is used to generate nested subqueries inside WITH clauses.
- the recursive calls of a sqlWith function can generate nested WITH clauses as well.
- An sqlFrom function can be used to generate SQL FROM clauses by referencing those subquery blocks in the WITH clauses. It also produces INNER/OUTERjoin operators based on the joins in the specification.
- Another recursive interface, sqlWhere can be used to generate conjuncts and disjuncts of local predicates and semi- join predicates based on implicit join transformations.
- sqlProject, sqlGroupBy, sqlOrderBy, and sqlLimitOffset can respectively be used to translate the corresponding directives in JSON spec to SQL SELECT list, GROUP BY, ORDER BY, and LIMIT/OFFSET clauses.
- the query (generated at 487) is used to respond to the request.
- the generated query is used to query data store 30 and provide (e.g., to web app 120) fact data formatted in accordance with a schema (e.g., as associated with a card associated with the request received at 486).
- data that is collected from agents and other sources may be stored in different ways.
- data that is collected from agents and other sources may be stored in a data warehouse, data lake, data mart, and/or any other data store.
- a data warehouse may be embodied as an analytic database (e.g., a relational database) that is created from two or more data sources. Such a data warehouse may be leveraged to store historical data, often on the scale of petabytes. Data warehouses may have compute and memory resources for running complicated queries and generating reports.
- Data warehouses may be the data sources for business intelligence (‘BI’) systems, machine learning applications, and/or other applications.
- BI business intelligence
- data that has been copied into the data warehouse may be indexed for good analytic query performance, without affecting the write performance of a database (e.g., an Online Transaction Processing (‘OLTP’) database).
- OLTP Online Transaction Processing
- Data warehouses also enable the joining data from multiple sources for analysis. For example, a sales OLTP application probably has no need to know about the weather at various sales locations, but sales predictions could take advantage of that data.
- By adding historical weather data to a data warehouse it would be possible to factor it into models of historical sales data.
- Data lakes which store files of data in their native format, may be considered as “schema on read” resources.
- any application that reads data from the lake may impose its own types and relationships on the data.
- Data warehouses are “schema on write,” meaning that data types, indexes, and relationships are imposed on the data as it is stored in the EDW.
- “Schema on read” resources may be beneficial for data that may be used in several contexts and poses little risk of losing data.
- “Schema on write” resources may be beneficial for data that has a specific purpose, and good for data that must relate properly to data from other sources.
- Such data stores may include data that is encrypted using homomorphic encryption, data encrypted using privacy-preserving encryption, smart contracts, non-fungible tokens, decentralized finance, and other techniques.
- Data marts may contain data oriented towards a specific business line whereas data warehouses contain enterprise-wide data. Data marts may be dependent on a data warehouse, independent of the data warehouse (e.g., drawn from an operational database or external source), or a hybrid of the two. In embodiments described herein, different types of data stores (including combinations thereof) may be leveraged. Such data stores may be proprietary or may be embodied as vendor provided products or services such as, for example, Google BigQuery, Druid, Amazon Redshift, IBM Db2, Dremio, Databricks Lakehouse Platform, Cloudera, Azure Synapse Analytics, and others.
- the deployments e.g., a customer’s cloud deployment
- the systems described herein e.g., systems that include components such as the platform 12 of Fig. ID, the data collection agents described herein, and/or other components
- laC involves the managing and/or provisioning of infrastructure through code instead of through manual processes.
- configuration files may be created that include infrastructure specifications.
- laC can be beneficial as configurations may be edited and distributed, while also ensuring that environments are provisioned in a consistent manner.
- laC approaches may be enabled in a variety of ways including, for example, using laC software tools such as Terraform by HashiCorp. Through the usage of such tools, users may define and provide data center infrastructure using JavaScript Object Notation (‘ JSON’), YAML, proprietary formats, or some other format.
- JSON JavaScript Object Notation
- YAML YAML
- proprietary formats or some other format.
- the configuration files may be used to emulate a cloud deployment for the purposes of analyzing the emulated cloud deployment using the systems described herein.
- the configuration files themselves may be used as inputs to the systems described herein, such that the configuration files may be inspected to identify vulnerabilities, misconfigurations, violations of regulatory requirements, or other issues.
- configuration files for multiple cloud deployments may even be used by the systems described herein to identify best practices, to identify configuration files that deviate from typical configuration files, to identify configuration files with similarities to deployments that have been determined to be deficient in some way, or the configuration files may be leveraged in some other ways to detect vulnerabilities, misconfigurations, violations of regulatory requirements, or other issues prior to deploying an infrastructure that is described in the configuration files.
- the techniques described herein may be use in multi-cloud, multi-tenant, cross-cloud, cross-tenant, cross-user, industry cloud, digital platform, and other scenarios depending on specific need or situation.
- the deployments that are analyzed, monitored, evaluated, or otherwise observed by the systems described herein may be monitored to determine the extent to which a particular component has experienced “drift” relative to its associated laC configuration.
- Discrepancies between how cloud resources were defined in an laC configuration file and how they are currently configured in runtime may be identified and remediation workflows may be initiated to generate an alert, reconfigure the deployment, or take some other action. Such discrepancies may occur for a variety of reasons.
- the systems described herein may prevent unwanted drift from occurring during runtime and after a deployment has been created in accordance with an laC configuration.
- the deployments e.g., a customer’s cloud deployment
- the systems described herein e.g., systems that include components such as the platform 12 of Fig.
- SaC security as code
- SaC extends laC concepts by defining cybersecurity policies and/or standards programmatically, so that the policies and/or standards can be referenced automatically in the configuration scripts used to provision cloud deployments. Stated differently, SaC can automate policy implementation and cloud deployments may even be compared with the policies to prevent “drift.” For example, if a policy is created where all personally identifiable information (‘PII) or personal health information (‘PHI’) must be encrypted when it is stored, that policy is translated into a process that is automatically launched whenever a developer submits code, and code that violates the policy may be automatically rejected.
- PII personally identifiable information
- PHI personal health information
- SaC may be implemented by initially classifying workloads (e.g., by sensitivity, by criticality, by deployment model, by segment). Policies that can be instantiated as code may subsequently be designed. For example, compute-related policies may be designed, access-related policies may be designed, application-related policies may be designed, network- related policies may be designed, data-related policies may be designed, and so on. Security as code may then be instantiated through architecture and automation, as successful implementation of SaC can benefit from making key architectural-design decisions and executing the right automation capabilities. Next, operating model protections may be built and supported.
- an operating model may “shift left” to maximize self-service and achieve full-life-cycle security automation (e.g., by standardizing common development toolchains, CI/CD pipelines, and the like).
- security policies and access controls may be part of the pipeline, automatic code review and bug/defect detection may be performed, automated build processes may be performed, vulnerability scanning may be performed, checks against a risk- control framework may be made, and other tasks may be performed all before deploying an infrastructure or components thereof.
- GitOps may be viewed as the one and only source of truth.
- GitOps may require that the desired state of infrastructure (e.g., a customer’s cloud deployment) be stored in version control such that the entire audit trail of changes to such infrastructure can be viewed or audited.
- desired state of infrastructure e.g., a customer’s cloud deployment
- version control such that the entire audit trail of changes to such infrastructure can be viewed or audited.
- all changes to infrastructure are embodied as fully traceable commits that are associated with committer information, commit IDs, time stamps, and/or other information.
- both an application and the infrastructure e.g., a customer’s cloud deployment
- the infrastructure e.g., a customer’s cloud deployment
- the systems described herein are described as analyzing, monitoring, evaluating, or otherwise observing a GitOps environment, in other embodiments other source control mechanisms may be utilized for creating infrastructure, making changes to infrastructure, and so on. In these embodiments, the systems described herein may similarly be used for analyzing, monitoring, evaluating, or otherwise observing such environments.
- the systems described herein may be used to analyze, monitor, evaluate, or otherwise observe a customer’s cloud deployment. While securing traditional datacenters requires managing and securing an IP -based perimeter with networks and firewalls, hardware security modules (‘HSMs’), security information and event management (‘SIEM’) technologies, and other physical access restrictions, such solutions are not particularly useful when applied to cloud deployments. As such, the systems described herein may be configured to interact with and even monitor other solutions that are appropriate for cloud deployments such as, for example, “zero trust” solutions.
- HSMs hardware security modules
- SIEM security information and event management
- a zero trust security model (a.k.a., zero trust architecture) describes an approach to the design and implementation of IT systems.
- a primary concept behind zero trust is that devices should not be trusted by default, even if they are connected to a managed corporate network such as the corporate LAN and even if they were previously verified.
- Zero trust security models help prevent successful breaches by eliminating the concept of trust from an organization's network architecture.
- Zero trust security models can include multiple forms of authentication and authorization (e.g., machine authentication and authorization, human/user authentication and authorization) and can also be used to control multiple types of accesses or interactions (e.g., machine-to-machine access, human-to-machine access).
- the systems described herein may be configured to interact with zero trust solutions in a variety of ways.
- agents that collect input data for the systems described herein may be configured to access various machines, applications, data sources, or other entity through a zero trust solution, especially where local instances of the systems described herein are deployed at edge locations.
- the zero trust solution itself may be monitored to identify vulnerabilities, anomalies, and so on. For example, network traffic to and from the zero trust solution may be analyzed, the zero trust solution may be monitored to detect unusual interactions, log files generated by the zero trust solution may be gathered and analyzed, and so on.
- the systems described herein may leverage various tools and mechanisms in the process of performing its primary tasks (e.g., monitoring a cloud deployment).
- Linux eBPF is mechanism for writing code to be executed in the Linux kernel space.
- user mode processes can hook into specific trace points in the kernel and access data structures and other information.
- eBPF may be used to gather information that enables the systems described herein to attribute the utilization of networking resources or network traffic to specific processes. This may be useful in analyzing the behavior of a particular process, which may be important for observability/SIEM.
- the systems described may be configured to collect security event logs (or any other type of log or similar record of activity) and telemetry in real time for threat detection, for analyzing compliance requirements, or for other purposes.
- the systems described herein may analyze telemetry in real time (or near real time), as well as historical telemetry, to detect attacks or other activities of interest.
- the attacks or activities of interest may be analyzed to determine their potential severity and impact on an organization. In fact, the attacks or activities of interest may be reported, and relevant events, logs, or other information may be stored for subsequent examination.
- systems described herein may be configured to collect security event logs (or any other type of log or similar record of activity) and telemetry in real time to provide customers with a SIEM or SIEM-like solution.
- SIEM technology aggregates event data produced by security devices, network infrastructure, systems, applications, or other source. Centralizing all of the data that may be generated by a cloud deployment may be challenging for a traditional SIEM, however, as each component in a cloud deployment may generate log data or other forms of machine data, such that the collective amount of data that can be used to monitor the cloud deployment can grow to be quite large.
- a traditional SIEM architecture where data is centralized and aggregated, can quickly result in large amounts of data that may be expensive to store, process, retain, and so on. As such, SIEM technologies may frequently be implemented such that silos are created to separate the data.
- data that is ingested by the systems described herein may be stored in a cloud-based data warehouse such as those provided by Snowflake and others.
- a cloud-based data warehouse such as those provided by Snowflake and others.
- companies like Snowflake offer data analytics and other services to operate on data that is stored in their data warehouses
- one or more of the components of the systems described herein may be deployed in or near Snowflake as part of a secure data lake architecture (a.k.a., a security data lake architecture, a security data lake/warehouse).
- components of the systems described herein may be deployed in or near Snowflake to collect data, transform data, analyze data for the purposes of detecting threats or vulnerabilities, initiate remediation workflows, generate alerts, or perform any of the other functions that can be performed by the systems described herein.
- data may be received from a variety of sources (e.g., EDR or EDR-like tools that handle endpoint data, cloud access security broker (‘CASB’) or CASB-like tools that handle data describing interactions with cloud applications, Identity and Access Management (‘LAM’) or lAM-like tools, and many others), normalized for storage in a data warehouse, and such normalized data may be used by the systems described herein.
- sources e.g., EDR or EDR-like tools that handle endpoint data, cloud access security broker (‘CASB’) or CASB-like tools that handle data describing interactions with cloud applications, Identity and Access Management (‘LAM’) or lAM-like tools, and many others
- EDR EDR
- CASB cloud access security
- one data source that is ingested by the systems described herein is log data, although other forms of data such as network telemetry data (flows and packets) and/or many other forms of data may also be utilized.
- event data can be combined with contextual information about users, assets, threats, vulnerabilities, and so on, for the purposes of scoring, prioritization and expediting investigations.
- input data may be normalized, so that events, data, contextual information, or other information from disparate sources can be analyzed more efficiently for specific purposes (e.g., network security event monitoring, user activity monitoring, compliance reporting).
- the embodiments described here offer real-time analysis of events for security monitoring, advanced analysis of user and entity behaviors, querying and long-range analytics for historical analysis, other support for incident investigation and management, reporting (for compliance requirements, for example), and other functionality.
- the systems described herein may be part of an application performance monitoring (‘APM’) solution.
- APM software and tools enable the observation of application behavior, observation of its infrastructure dependencies, observation of users and business key performance indicators (‘KPIs’) throughout the application’s life cycle, and more.
- KPIs business key performance indicators
- the applications being observed may be developed internally, as packaged applications, as software as a service (‘SaaS’), or embodied in some other ways.
- the systems described herein may provide one or more of the following capabilities:
- VDI virtual desktop infrastructure
- the systems described herein may be part of a solution for developing and/or managing artificial intelligence (‘Al’) or machine learning (‘ML’) applications.
- the systems described herein may be part of an AutoML tool that automate the tasks associated with developing and deploying ML models.
- the systems described herein may perform various functions as part of an AutoML tool such as, for example, monitoring the performance of a series of processes, microservices, and so on that are used to collectively form the AutoML tool.
- the systems described herein may perform other functions as part of an AutoML tool or may be used to monitor, analyze, or otherwise observe an environment that the AutoML tool is deployed within.
- the systems described herein may be used to manage, analyze, or otherwise observe deployments that include other forms of AI/ML tools.
- the systems described herein may manage, analyze, or otherwise observe deployments that include Al services.
- Al services are, like other resources in an as-a-service model, ready-made models and Al applications that are consumable as services and made available through APIs.
- organizations may access pre-trained models that accomplish specific tasks. Whether an organization needs natural language processing (‘NLP’), automatic speech recognition (‘ ASR’), image recognition, or some other capability, Al services simply plug-and-play into an application through an API.
- NLP natural language processing
- ASR automatic speech recognition
- image recognition or some other capability
- the systems described herein may be used to manage, analyze, or otherwise observe deployments that include other forms of AI/ML tools such as Amazon Sagemaker (or other cloud machine-learning platform that enables developers to create, train, and deploy ML models) and related services such as Data Wrangler (a service to accelerate data prep for ML) and Pipelines (a CI/CD service for ML).
- AI/ML tools such as Amazon Sagemaker (or other cloud machine-learning platform that enables developers to create, train, and deploy ML models) and related services such as Data Wrangler (a service to accelerate data prep for ML) and Pipelines (a CI/CD service for ML).
- data services may include secure data sharing services, data marketplace services, private data exchanges services, and others.
- Secure data sharing services can allow access to live data from its original location, where those who are granted access to the data simply reference the data in a controlled and secure manner, without latency or contention from concurrent users. Because changes to data are made to a single version, data remains up-to-date for all consumers, which ensures data models are always using the latest version of such data.
- Data marketplace services operate as a single location to access live, ready-to-query data (or data that is otherwise ready for some other use).
- a data marketplace can even include a “feature stores,” which can allow data scientists to repurpose existing work. For example, once a data scientist has converted raw data into a metric (e.g., costs of goods sold), this universal metric can be found quickly and used by other data scientists for quick analysis against that data.
- a metric e.g., costs of goods sold
- the systems described herein may be used to manage, analyze, or otherwise observe deployments that include distributed training engines or similar mechanisms such as, for example, such as tools built on Dask.
- Dask is an open source library for parallel computing that is written in Python.
- Dask is designed to enable data scientists to improve model accuracy faster, as Dask enables data scientists can do everything in Python end-to-end, which means that they no longer need to convert their code to execute in environments like Apache Spark. The result is reduced complexity and increased efficiency.
- the systems described herein may also be used to manage, analyze, or otherwise observe deployments that include technologies such as RAPIDS (an open source Python framework which is built on top of Dask).
- RAPIDS an open source Python framework which is built on top of Dask.
- RAPIDS optimizes compute time and speed by providing data pipelines and executing data science code entirely on graphics processing units (GPUs) rather than CPUs.
- GPUs graphics processing units
- Multi-cluster, shared data architecture, DataFrames, Java user-defined functions (UDF) are supported to enable trained models to run within a data warehouse.
- the systems described herein may be leveraged for the specific use case of detecting and/or remediating ransomware attacks and/or other malicious action taken with respect to data, systems, and/or other resources associated with one or more entities.
- Ransomware is a type of malware from cryptovirology that threatens to publish the victim’s data or perpetually block access to such data unless a ransom is paid.
- ransomware attacks may be carried out in a manner such that patterns (e.g., specific process-to- process communications, specific data access patterns, unusual amounts of encryption/re- encryption activities) emerge, where the systems described herein may monitor for such patterns.
- ransomware attacks may involve behavior that deviates from normal behavior of a cloud deployment that is not experiencing a ransomware attack, such that the mere presence of unusual activity may trigger the systems described herein to generate alerts or take some other action, even without explicit knowledge that the unusual activity is associated with a ransomware attack.
- policies may be put in place the systems described herein may be configured to enforce such policies as part of an effort to thwart ransomware attacks.
- particular network sharing protocols e.g., Common Internet File System (‘CIFS’), Network File System (‘NFS’)
- CIFS Common Internet File System
- NFS Network File System
- policies that protect backup systems may be implemented and enforced to ensure that usable backups are always available
- multifactor authentication for particular accounts may be utilized and accounts may be configured with the minimum privilege required to function
- isolated recovery environments may be created and isolation may be monitored and enforced to ensure the integrity of the recovery environment, and so on.
- the systems described herein may be configured to explicitly enforce such policies or may be configured to detect unusual activity that represents a violation of such policies, such that the mere presence of unusual activity may trigger the systems described herein to generate alerts or take some other action, even without explicit knowledge that the unusual activity is associated with a violation of a particular policy.
- ransomware attacks are often deployed as part of a larger attack that may involve, for example:
- embodiments of the present disclosure may be configured as follows:
- the systems may include one or more components that detect malicious activity based on the behavior of a process.
- the systems may include one or more components that store indicator of compromise (‘IOC’) or indicator of attack ('IOA’) data for retrospective analysis.
- IOC indicator of compromise
- 'IOA indicator of attack
- the systems may include one or more components that detect and block fileless malware attacks.
- the systems may include one or more components that remove malware automatically when detected.
- the systems may include a cloud-based, SaaS-style, multitenant infrastructure.
- the systems may include one or more components that identify changes made by malware and provide the recommended remediation steps or a rollback capability.
- the systems may include one or more components that detect various application vulnerabilities and memory exploit techniques.
- the systems may include one or more components that continue to collect suspicious event data even when a managed endpoint is outside of an organization’s network.
- the systems may include one or more components that perform static, on-demand malware detection scans of folders, drives, devices, or other entities.
- the systems may include data loss prevention (DLP) functionality.
- DLP data loss prevention
- the systems described herein may manage, analyze, or otherwise observe deployments that include deception technologies.
- Deception technologies allow for the use of decoys that may be generated based on scans of true network areas and data. Such decoys may be deployed as mock networks running on the same infrastructure as the real networks, but when an intruder attempts to enter the real network, they are directed to the false network and security is immediately notified. Such technologies may be useful for detecting and stopping various types of cyber threats such as, for example, Advanced Persistent Threats (‘ APTs’), malware, ransomware, credential dumping, lateral movement and malicious insiders. To continue to outsmart increasingly sophisticated attackers, these solutions may continuously deploy, support, refresh and respond to deception alerts.
- APTs Advanced Persistent Threats
- the systems described herein may manage, analyze, or otherwise observe deployments that include various authentication technologies, such as multi-factor authentication and role-based authentication.
- the authentication technologies may be included in the set of resources that are managed, analyzed, or otherwise observed as interactions with the authentication technologies may monitored.
- log files or other information retained by the authentication technologies may be gathered by one or more agents and used as input to the systems described herein.
- the systems described herein may be leveraged for the specific use case of detecting supply chain attacks. More specifically, the systems described herein may be used to monitor a deployment that includes software components, virtualized hardware components, and other components of an organization’s supply chain such that interactions with an outside partner or provider with access to an organization’s systems and data can be monitored. In such embodiments, supply chain attacks may be carried out in a manner such that patterns (e.g., specific interactions between internal and external systems) emerge, where the systems described herein may monitor for such patterns.
- patterns e.g., specific interactions between internal and external systems
- supply chain attacks may involve behavior that deviates from normal behavior of a cloud deployment that is not experiencing a supply chain attack, such that the mere presence of unusual activity may trigger the systems described herein to generate alerts or take some other action, even without explicit knowledge that the unusual activity is associated with a supply chain attack.
- the systems described herein may be leveraged for other specific use cases such as, for example, detecting the presence of (or preventing infiltration from) cryptocurrency miners (e.g., bitcoin miners), token miners, hashing activity, non-fungible token activity, other viruses, other malware, and so on.
- cryptocurrency miners e.g., bitcoin miners
- token miners e.g., hashing activity
- non-fungible token activity other viruses, other malware, and so on.
- the systems described herein may monitor for such threats using known patterns or by detecting unusual activity, such that the mere presence of unusual activity may trigger the systems described herein to generate alerts or take some other action, even without explicit knowledge that the unusual activity is associated with a particular type of threat, intrusion, vulnerability, and so on.
- agents, sensors, or similar mechanisms may be deployed on or near managed endpoints such as computers, servers, virtualized hardware, internet of things (‘lotT’) devices, mobile devices, phones, tablets, watches, other personal digital devices, storage devices, thumb drives, secure data storage cards, or some other entity.
- endpoint protection platform may provide functionality such as:
- Example embodiments are described in which policy enforcement, threat detection, or some other function is carried out by the systems described herein by detecting unusual activity, such that the mere presence of unusual activity may trigger the systems described herein to generate alerts or take some other action, even without explicit knowledge that the unusual activity is associated with a particular type of threat, intrusion, vulnerability, and so on.
- unusual activity such that the mere presence of unusual activity may trigger the systems described herein to generate alerts or take some other action, even without explicit knowledge that the unusual activity is associated with a particular type of threat, intrusion, vulnerability, and so on.
- the systems described herein may be configured to learn what constitutes ‘normal activity’ - where ‘normal activity’ is activity observed, modeled, or otherwise identified in the absence of a particular type of threat, intrusion, vulnerability, and so on.
- detecting ‘unusual activity’ may alternatively be viewed as detecting a deviation from ‘normal activity’ such that ‘unusual activity’ does not need to be identified and sought out. Instead, deviations from ‘normal activity’ may be assumed to be ‘unusual activity’.
- the systems described herein or the deployments that are monitored by such systems may implement a variety of techniques.
- the systems described herein or the deployments that are monitored by such systems may tag data and logs to provide meaning or context, persistent monitoring techniques may be used to monitor a deployment at all times and in real time, custom alerts may be generated based on rules, tags, and/or known baselines from one or more polygraphs, and so on.
- custom alerts may be generated based on rules, tags, and/or known baselines from one or more polygraphs, and so on.
- some embodiments may utilize agentless deployments where no agent (or similar mechanism) is deployed on one or more customer devices, deployed within a customer’s cloud deployment, or deployed at another location that is external to the data platform.
- the data platform may acquire data through one or more APIs such as the APIs that are available through various cloud services.
- one or more APIs that enable a user to access data captured by Amazon CloudTrail may be utilized by the data platform to obtain data from a customer’s cloud deployment without the use of an agent that is deployed on the customer’s resources.
- agents may be deployed as part of a data acquisition service or tool that does not utilize a customer’s resources or environment.
- agents deployed on a customer’s resources or elsewhere
- mechanisms in the data platform that can be used to obtain data from through one or more APIs such as the APIs that are available through various cloud services may be utilized.
- one or more cloud services themselves may be configured to push data to some entity (deployed anywhere), which may or may not be an agent.
- other data acquisition techniques may be utilized, including combinations and variations of the techniques described above, each of which is within the scope of the present disclosure.
- additional examples can include multi-cloud deployments, on-premises environments, hybrid cloud environments, sovereign cloud environments, heterogeneous environments, DevOps environments, DevSecOps environments, GitOps environments, quantum computing environments, data fabrics, composable applications, composable networks, decentralized applications, and many others.
- Other types of data can include, for example, data collected from different tools (e.g., DevOps tools, DevSecOps, GitOps tools), different forms of network data (e.g., routing data, network translation data, message payload data, Wi-Fi data, Bluetooth data, personal area networking data, payment device data, near field communication data, metadata describing interactions carried out over a network, and many others), data describing processes executing in a container, lambda, EC2 instance, virtual machine, or other execution environment), information describing the execution environment itself, and many other types of data.
- tools e.g., DevOps tools, DevSecOps, GitOps tools
- different forms of network data e.g., routing data, network translation data, message payload data, Wi-Fi data, Bluetooth data, personal area networking data, payment device data, near field communication data, metadata describing interactions carried out over a network, and many others
- data describing processes executing in a container lambda
- one or more of the components described above may be deployed using a secure access service edge (‘SASE’) model or similar model.
- SASE secure access service edge
- the services, functionality, or components described above may be deployed at edge devices (or relatively close to such edge device) such as a user’s laptop, tablet, smartphone, or other device.
- network security controls may be delivered on such edge devices.
- SASE capabilities may be delivered as a service based upon the identity of the entity, real-time context, enterprise security/compliance policies and continuous assessment of risk/trust throughout the sessions, where the identity of entities can be associated with people, groups of people, devices, applications, services, loT systems or edge computing locations, and so on.
- one or more of the components described above may be deployed at or near the edge devices, and the edge devices may even include local applications that are configured to utilize one or more of the components described above where the components are not deployed on the edge devices themselves.
- Fig. 5 sets forth a flowchart illustrating an example method of dynamically generating monitoring tools for software applications in accordance with some embodiments of the present disclosure. Dynamically generating monitoring tools for software applications may be carried out using the systems described above. As such, one or more of the steps depicted in Fig. 5 may be performed by the systems described above.
- the example method depicted in Fig. 5 includes inspecting 504, using static code analysis, a non-executable representation 502 of an application to identify one or more points 506 in an application for monitoring.
- the static code analysis may be performed at the time that an application is compiled, at some other point in a CI/CD pipeline, or at some other time in the development lifecycle, including any time before the application is actually deployed.
- the static code analysis may be used to identify, for example, points during the execution of the software application where external libraries are accessed, points during the execution of the software application where the application accesses/is accessible by the public internet, points during the execution of the software application where the application is connected to sensitive data sources, and many other potential points of interest in the application.
- the points of interest in the application may be determined to be the points in the execution of the application that will be monitored.
- the amount of monitoring that is needed can be reduced. Reducing the amount monitoring can introduce efficiencies as resources do not need to be dedicated to monitoring and analyzing points in the application’s execution that need not be monitored (e.g., points where the application is secure, points where the application is optimized).
- abstract syntax tree (‘AST’) analyzers may be used for the purposes of identifying particular points of interest in an application. For example, an AST analyzer may identify the nodes of a particular application so that each node may be identified as being a point of interest, nodes of a particular type may be identified as being a point of interest, nodes that are connected to other nodes of a particular type may be identified as being a point of interest, and so on. In some embodiments, additional aspects of the application that do not have a relationship to the code itself may be used to identify particular points of interest in an application.
- a particular line of code has a debug symbol attached to it
- that line may be identified as being a point of interest in the application by virtue of the fact that a developer (or other tool/person) thought that the line of code was interesting enough to warrant a debug symbol.
- a non- executable representation 502 of an application is inspect 504
- a fully executable version of the application may be inspected.
- the example method depicted in Fig. 5 also includes, for each of the one or more points 506 in the application, generating 508 a monitoring program 510.
- the monitoring program 510 may be embodied, for example, one or more extended Berkeley Packet Filter (‘eBPF’) programs that may be attached to various tracepoints (e.g., tracepoints located at the predetermined points of interest in the application) to monitor aspects of the execution of the application.
- eBPF extended Berkeley Packet Filter
- the particular nature of the monitoring program 510 that is generated 508 may vary, as it may be desirable to monitor different things at different points in the application’s execution.
- the monitoring program 510 may be attached to a code path and whenever that code path is traversed, the monitoring program 510 can execute.
- the precise nature of the monitoring program 510 that is generated 508 can be based on the results of the static code analysis.
- the monitoring program 510 may be generated dynamically based on the results of the static code analysis, for example, by using the static analysis to determine that at a particular point in an application’s execution the application may be engaged in data communications over a network such that the monitoring program 510 that is generated 508 must be capable of monitoring data communications-related information (e.g., IP addresses of communication endpoints that are communicated with, whether communications were encrypted, payload included in one or more messages).
- data communications-related information e.g., IP addresses of communication endpoints that are communicated with, whether communications were encrypted, payload included in one or more messages.
- static code analysis may reveal that at a particular point in an application’s execution may be accessing an external database such that the monitoring program 510 that is generated 508 must be capable of monitoring database access-related information (e.g., queries generated, identity of the database that was accessed, whether credentials were provided as part of the access request), and so on.
- the particular nature of the task being performed by an application at a particular point in its execution may be used to drive the particular nature of the task being performed such that monitoring program 510 that is generated 508, as the monitoring program 510 that is inserted will need to be configured to monitor the particular tasks being performed by the application.
- the monitoring program 510 may be generated 508 dynamically based on the results of the static code analysis in other ways.
- monitoring program 510 are generated 508 that are specifically designed to monitor a particular application, the monitoring of an environment may become more customized to a particular customer’s deployment, even as that deployment changes over time with new applications, updated versions of existing applications, and other changes.
- the monitoring program 510 are generated 508 may be unique to a particular customer, unique to a particular application, unique to a particular executable, unique to a particular Docker build, unique to a particular environment, unique to a particular point in time, or unique in some other way.
- the example method depicted in Fig. 5 also includes, for each of the one or more points 506 in the application, inserting 512, into an executable representation 514 of the application, the monitoring program 510a, 510b, 510n at a location in the executable representation 514 of the application that corresponds to the identified point 506 in the application.
- the executable representation 514 of the application may be embodied, for example, as one or more fully executable binaries or in some other way.
- the monitoring program 510 may be attached to a code path and whenever that code path is traversed, the monitoring program 510 can execute.
- By inserting a monitoring program 510a, 510b, 51 On at each of the identified points 506 in the application various things may be monitored at various points 506 of the application’s execution.
- FIG. 6 sets forth a flowchart illustrating an additional example method of dynamically generating monitoring tools for software applications in accordance with some embodiments of the present disclosure.
- the example method depicted in Fig. 6 is similar to the example method depicted in Fig. 5, as the example method depicted in Fig. 6 includes many of the steps from Fig. 5.
- inspecting 504 a non-executable representation 502 of an application to identify one or more points 506 in an application for monitoring can include inspecting 602 source code for the application that is stored in a code repository. Inspecting 602 source code for the application that is stored in a code repository may be carried out, for example, by examining each line of code to identify lines that perform some particular function that may be worth monitoring. For example, each line of code may be examined to identify lines of code that receive incoming communications, lines of code that generate outgoing communications, lines of code that issue commands to an external data sources, and so on.
- inspecting 504 a non-executable representation 502 of an application to identify one or more points 506 in an application for monitoring can alternatively include inspecting 604 an intermediate representation of the application as the application is being compiled.
- the intermediate representation may be embodied, for example, as an AST as described above, as a version of the application that is being compiled, or as some other version of the application that precedes a deployed, executable version of the application. Readers will appreciate that because an intermediate representation of the application is inspected 602, the process of inspecting 504 an application to identify one or more points 506 to be monitored may begin before the application is even deployed.
- the example method depicted in Fig. 6 also include creating 608 a monitoring program repository 606.
- the monitoring program repository 606 may be embodied, for example, as a database, as a table, or as some other appropriate data structure.
- Each entry in the monitoring program repository 606 may associate a particular monitoring program with an identification of an application and a point within the application (or portion thereof) where the monitoring program should be inserted.
- each application (or portion thereof) may be identified by a hash value that is generated by applying a hash function to the source code of the application (or portion thereof), or by associating some other identified with the application (or portion thereof).
- Creating 608 a monitoring program repository 606 may be carried out, for example, by creating the appropriate data structure and populating the monitoring program repository 606 with one or more entries as described above.
- the generated eBPF programs may be included in a library or other repository so that the eBPF programs can be reused elsewhere.
- a hash function may be applied to the source code (or even a binary) for a first application to generate a hash value for the first application.
- the same hash function may be applied and any subsequently deployed applications (or portions thereof) to generate hash values for those subsequently deployed applications (or portions thereof).
- the one or more eBPF programs that were used to monitor the first application may be attached to the same points of the subsequently deployed applications to monitor the subsequently deployed application. Readers will appreciate that although such applications are described as being ‘subsequently deployed’, in some embodiments the applications may not be actually deployed prior to attaching the one or more eBPF programs that were used to monitor the first application.
- the method depicted in Fig. 6 also includes creating 610 an entry in the monitoring program repository 606 for a generated monitoring program 510. Creating 610 an entry in the monitoring program repository 606 for a generated monitoring program 510 may be carried out, for example, by creating an entry that associates the application (or some portion thereof) with a monitoring program 510 that was generated 508 to monitor the application (or some portion thereof). Such an entry may include other information such as, for example, the location in the application where the monitoring program 510 was inserted, or any other relevant information.
- Fig. 7 sets forth a flowchart illustrating an additional example method of dynamically generating monitoring tools for software applications in accordance with some embodiments of the present disclosure.
- the example method depicted in Fig. 7 is similar to the example methods depicted in Fig. 5 and Fig. 6, as the example method depicted in Fig. 7 includes many of the steps from Fig. 5 and Fig. 6.
- the example method depicted in Fig. 7 also includes detecting 702, by a particular monitoring program 510a, 510b, 51 On, that the application is preparing to take an undesirable action.
- One or more of the monitoring programs 510a, 510b, 51 On may detect 702 that the application is preparing to take an undesirable action, for example, by monitoring the application for data packets being generated by the application and determining that a not-yet sent data packet is targeting a known malicious address, by monitoring the application for data communications connections that are in the process of being established and determining that a not-yet a pending data communications connection will be established with a known malicious address/actor, and so on.
- the monitoring programs 510a, 510b, 510n monitor the actions that are being taken by the application rather than monitoring packets that have already been sent/received by the application, undesirable actions can be prevented before they are taken. For example, an undesirable action of establishing a data communications connection between the application and a malicious actor may be prevented when the monitoring program detects that the application is about to establish a data communications connection between the application and the malicious actor.
- the example method depicted in Fig. 7 also includes preventing 704 the application from taking the undesirable action.
- Preventing 704 the application from taking the undesirable action may be carried out, for example, by blocking or sending an instruction to block some pending action. For example, a pending data communications message may not be sent, a pending sequence to open a data communications connection with some communications endpoint may be terminated, and so on.
- a particular monitoring program 510a, 510b, 51 On may be configured to inspect data communications messages generated by the application prior to the application sending the data communications messages.
- the data communications messages may be embodied, for example, as actual data communications packets, as messages (or similar mechanism) that is used to open a data communications session, or as any other data communication between the application and something that is external to the application.
- a monitoring program 510a, 510b, 51 On may be configured to inspect data communications messages generated by a first portion of the application (i.e., a first service) that is sent to a second portion of the application (i.e., a second service).
- a particular monitoring program is configured to monitor for known exploitable conditions.
- the techniques and technologies described above may be used for the purposes of inspecting an application or environment for exploitable conditions.
- exploitable conditions such as an Out-of-bounds Write, Improper Neutralization of Special Elements used in an SQL Command ('SQL Injection'), and many others may be detected by monitoring the actual usage of an application and comparing that usage to rules, heuristics, or some other mechanism to detect the exploitable condition.
- the techniques described above may be leveraged in place of (or even in coordination with) packet capture (‘PCAP’) analysis in which the agents described above capture and analyze individual data packets that travel through a network in the deployment that is being monitored. Unlike PCAP analysis which monitors packets after they have been placed on the wire (i.e., in the network and outside of the application), the techniques described above may be leveraged to monitor the application without the application actually placing data on the wire.
- PCAP packet capture
- the techniques described above may be used as part of a real-time (or near real-time) system. Through the usage of static code analysis, points of interest in the execution of an application may be identified and those points of interest may be monitored in real-time for certain conditions.
- alerts could be generated in real-time (or near real-time) upon detecting the occurrence of a particular condition.
- other remedial actions could be taken in real-time (or near real-time) upon detecting the occurrence of a particular condition.
- Such remedial actions could include, as one example among many possible remedial actions, preventing an application from actually engaging in undesirable activity upon detecting that the application is about to engage in undesirable activity.
- monitoring may be achieved through the use of other technologies.
- static code analysis may be used to identify points of interest that may be monitored using uBPF programs, monitored using a Microsoft Windows kernel driver or some other Windows hooking technology that allows the systems described above to do a run-time verification of an application state at various checkpoints, monitored using BPF programs, monitored using pcap (or a variant thereof) programs, or monitored in some other way, including through the use various platforms (e.g., FreeBSD, NetBSD, WinPcap) that can be used to convert BPF instructions into native code.
- platforms e.g., FreeBSD, NetBSD, WinPcap
- monitoring the actual usage of an application and performing static analysis of the application may be carried out in a system that includes a feedback loop such that the results from monitoring the actual usage of an application are used to improve the static analysis of the application.
- monitoring the actual usage of the application as described above may reveal that the application or environment is vulnerable to some set of malicious acts (e.g., bitcoin mining attack, ransomware attacks, etc. . .).
- some vulnerabilities may only be detected when the systems described above are monitoring the actual usage of an application, when the systems described above are monitoring the state of an environment, or when the systems described above are engaged in some other form of monitoring that can provide context into the current state of some resource (e.g., an application, a data store, a virtual machine, a container cluster, etc. . .).
- some other vulnerabilities may be detected statically.
- the systems described above may be configured to evaluate whether a vulnerability, anomaly, or some other condition that was detected through monitoring the application could have been detected statically.
- FIG. 8 sets forth a flow chart illustrating an example method of using real-time monitoring to inform static analysis in accordance with some embodiments of the present disclosure.
- the example method depicted in Fig. 8 may be carried out, for example, by one or more modules of computer program instructions executing on physical hardware, virtual hardware, or in some other execution environment (e.g., one or more AWS Lambda’s, one or more containers).
- modules may be part of the systems described above or otherwise coupled to the systems described above.
- the example method depicted in Fig. 8 includes inspecting 802, using one or more static code analysis techniques, one or more components of a cloud deployment.
- the one or more components of a cloud deployment may be embodied as one or more applications, one or more services, one or more microservices, some portion of an application, and so on.
- the one or more components of the cloud deployment may be inspected 802 using one or more static code analysis techniques, for example, by examining source code before a program is run where source code is analyzed against a set (or multiple sets) of coding rules.
- Static code analysis also referred to as ‘static program analysis’
- static program analysis is distinguishable from dynamic analysis (which is performed on programs during their execution) largely by virtue of static code analysis being performed without executing the program, application, or executable object associated with the source code of such a program, application, or executable object.
- the static code analysis techniques may be used to perform unit level, technology level, system level, and/or mission level analysis.
- static code analysis may be used to detect some vulnerable condition in an application, to detect some exploitable aspect of an application, or to otherwise detect deficiencies in the one or more components of the cloud deployment.
- the example method depicted in Fig. 8 also includes detecting 804, using data gathered during the execution of the component in the cloud deployment, a condition 806.
- Detecting 804 a condition 806 may be carried out, for example, as described above where agents are deployed on or near various components in a cloud deployment to gather data which is subsequently analyzed in a variety of ways. In some embodiments, such an analysis may reveal anomalous activity (i.e., a condition 806) resulting in an alert being raised, a remediation workflow being initiated, or some other action being taken.
- anomalous activity i.e., a condition 806
- detecting 804 a condition 806 (e.g., an anomaly, a violation of some rule, a violation of some policy) is done using data gathered during the execution of the component in the cloud deployment. That is, detecting 804 a condition 806 is done using dynamic analysis techniques.
- the example method depicted in Fig. 8 includes modifying 808, based on the detected condition 806, the one or more static code analysis techniques. Modifying 808 the one or more static code analysis techniques based on the detected condition 806 may be carried out to implement the feedback loop described above where dynamic analysis is used to inform, improve, or otherwise influence what static analysis is performed on the components in a cloud deployment. In such a way, conditions that are detected via dynamic analysis may be evaluated to determine whether such conditions could be detected/prevented using static analysis techniques, including static code analysis techniques.
- modifying 808 the one or more static code analysis techniques based on the detected condition 806 may be carried out, for example, by creating one or more static code analysis rules designed to capture the detected condition 806.
- a condition 806 is detected 804 using data gathered during the execution of the component in the cloud deployment where an SQL injection is detected.
- the one or more static code analysis techniques may be modified 808 so as to add one or more static code analysis rules that look for code performing SQL queries and check whether or not those queries are dependent upon untrusted, external input.
- Such static code analysis techniques may be modified to include rules to determine whether the untrusted, external input is sanitized to remove any potentially malicious or dangerous content before use. If the untrusted, external input is not sanitized (meaning that untrusted input is used in an SQL query), then the modified static code analysis techniques may label the associated source code as containing a potential SQL injection vulnerability.
- the one or more static code analysis techniques are modified 808 based on the detected condition 806
- the one or more static code analysis techniques are modified 808 based on the data gathered during the execution of the component in the cloud deployment.
- the one or more static code analysis techniques may be modified 808 based on some combination that includes the detected condition 806 and the data gathered during the execution of the component in the cloud deployment.
- Fig. 9 sets forth a flow chart illustrating an additional example method of using real-time monitoring to inform static analysis in accordance with some embodiments of the present disclosure.
- the example method depicted in Fig. 9 is similar to the example method depicted in Fig. 8, as Fig. 9 also includes: inspecting 802, using one or more static code analysis techniques, one or more components of a cloud deployment; detecting 804, using data gathered during the execution of the component in the cloud deployment, a condition 806; and modifying 808, based on the detected condition 806, the one or more static code analysis techniques.
- Modifying 902 real-time monitoring of the component based on one or more modifications to the static code analysis techniques may be carried out, for example, by removing real-time monitoring for conditions that the static code analysis techniques have been modified 808 to detect. For example (and continuing with the examples described above), if the static code analysis techniques have been modified 808 to detect all SQL injection vulnerabilities and the source code has been determined to not have any SQL injection vulnerabilities, the real-time monitoring of the component may be modified 902 to not look for SQL injections.
- Modifying 902 real-time monitoring (i.e., dynamic analysis) of the component based on one or more modifications to the static code analysis techniques may be carried out to implement the feedback loop described above where static analysis is used to inform, improve, or otherwise influence what dynamic analysis is performed on the components in a cloud deployment.
- static analysis is used to inform, improve, or otherwise influence what dynamic analysis is performed on the components in a cloud deployment.
- conditions that are incapable of detection via static analysis, expensive to detect via static analysis, or otherwise determined to be more capably discovered via real-time monitoring (i.e., dynamic analysis) may be evaluated to determine whether such conditions could be better detected/prevented using real-time monitoring (i.e., dynamic analysis).
- the burden to detect conditions that could be detected via static analysis may be shifted to the static code analysis tools and shifted away from the dynamic analysis tools.
- FIG. 10 sets forth a flow chart illustrating an additional example method of using real-time monitoring to inform static analysis in accordance with some embodiments of the present disclosure.
- the example method depicted in Fig. 10 is similar to the example methods depicted in Fig. 8 and Fig. 9, as Fig. 10 also includes: inspecting 802, using one or more static code analysis techniques, one or more components of a cloud deployment; detecting 804, using data gathered during the execution of the component in the cloud deployment, a condition 806; and modifying 808, based on the detected condition 806, the one or more static code analysis techniques.
- the example method depicted in Fig. 10 also includes detecting 1002, using the modified one or more static code analysis techniques, the condition 806. Detecting 1002 the condition 806 using the modified static code analysis techniques may be carried out, for example, by re-performing static code analysis on the one or more components using the modified one or more static code analysis techniques. Readers will appreciate that in this example, an actual occurrence of a particular condition 806 may not be possible to detect by the modified static code analysis techniques. For example, if the detected condition 806 was an SQL injection, the modified static code analysis techniques would not detect an actual SQL injection given that an actual SQL injection would only occur when the one or more components (e.g., an application) were executing. As such, detecting 1002 the condition 806 may be carried out by detecting that some code is vulnerable to the detected condition 806 or is otherwise structured such that the detected condition 806 could occur is the code was executed.
- this type of condition may be of a predetermined type (i.e., an SQL injection vulnerability type) that can be detectable via static analysis.
- conditions may be categorized and designated as the types of conditions that can or cannot be detected via static analysis.
- a categorization and designation of conditions may be done by a system administrator or other user.
- Alternatively, such a categorization and designation of conditions may be machine learned. In such a way, a determination may be made as to whether a condition 806 that actually occurred in a particular cloud deployment could have been avoided with more rigorous static analysis.
- static code analysis techniques deployed for a first customer may be different than static code analysis techniques deployed for a second customer.
- a first customer may have a first cloud deployment and the second customer may have a second cloud deployment, such that the static code analysis techniques that are used to inspect the components in each cloud deployment are distinct.
- condition 806 may be detected for a component in a cloud deployment of a first customer and the static code analysis techniques that were modified may be used for inspecting components in a cloud deployment of a second customer.
- condition 806 may be detected for a component in a first cloud deployment and the static code analysis techniques that were modified may be used for inspecting components in a second cloud deployment.
- static code analysis techniques for a particular customer or a particular cloud deployment
- static code analysis techniques for a particular customer may be improved without actually deploying components in the particular customer’s cloud deployment.
- Readers will appreciate that by using dynamic analysis in another customer’s cloud deployment to inform, improve, or otherwise influence what static analysis is performed on the components in a cloud deployment, a customer’s cloud deployment may be inspected by well-conceived static code analysis techniques from the outset.
- the static code analysis techniques may be used to inspect source code in a code repository prior to deploying the source code in the cloud deployment.
- the source code repository may be embodied, for example, as a Git Repository or as some other repository.
- the static code analysis techniques may be used to inspect source code in a code repository prior to deploying the source code in the cloud deployment, for example, by using static analysis as part of the software development process and providing the results of such static analysis to developers, quality assurance teams, and so on.
- the use of the improved static code analysis techniques may be leverage prior to actually deploying a component in a cloud environment, thereby improving the quality of such components.
- Evaluating whether a vulnerability, anomaly, or some other condition that was detected through monitoring the application could have been detected statically may be carried out in a variety of ways.
- evaluating whether a vulnerability, anomaly, or some other condition that was detected through monitoring the application could have been detected statically may be carried out through the use of machine learning techniques. For example, information describing an application, an environment, or other resources may be provided to a machine learning model along with information describing various conditions that were detected by monitoring the application, environment, or other resources. Through the usage of such machine learning techniques, static markers of an application, an environment, or other resources may be identified that correlate with the occurrence of some detected condition.
- such machine learning techniques may reveal that environments that experience a successful bitcoin mining attack (as detected by the systems described above monitoring such environments) frequently have some combination of configuration characteristics that makes the environment more prone to bitcoin mining attack.
- the static analysis that is performed may be augmented using such a feedback loop to evaluate an environment for the identified combination of configuration characteristics that makes the environment more prone to bitcoin mining attack. Readers will appreciate that by using such a feedback loop, the systems described above may evolve over time to improve their static analysis capabilities such that vulnerabilities or other conditions are prevented from ever occurring - rather than being detected through monitoring.
- the feedback loop may be used to optimize static analysis capabilities, which may even include identifying static analysis functions that need not be performed.
- Some static analysis capabilities may not need to be performed, for example, because there may be no correlation between some statically detectable attributes of an application, an environment, or other resource and the occurrence of any conditions that may be detected through monitoring the application, the environment, or the other resource.
- performing static analysis does come with some costs (e.g., utilizing resources to perform static analysis, delays in deploying applications or resources generally), the costs associated with performing some static analysis capabilities may not be justified by the gains that can be achieved by performing such static analysis.
- information describing an application, an environment, or other resources may be provided to a machine learning model along with information describing various conditions that were detected by monitoring the application, environment, or other resources.
- static markers of an application, an environment, or other resources may be identified that have no (or sufficiently low when compared to some threshold) correlation with the occurrence of some detected condition.
- static analysis capabilities that are designed to discover such inconsequential static markers of an application, an environment, or other resources may be disabled.
- the particular static analysis capabilities that are enabled for one customer may differ from the static analysis capabilities that are enabled for another customer.
- the static analysis capabilities that are enabled for each customer may be different, for example, because the customers have deployed different applications in different environments, because the customers have resources that are utilized differently, and for a variety of other reasons.
- two customers may have containerized applications that are each deployed in a K8s cluster
- one customer may have their K8s cluster only exposed via a private network whereas another customer may have their K8s cluster exposed to the public internet.
- the conditions e.g., threats, vulnerabilities
- that need to be either prevented through static analysis or detected through monitoring may be different for each customer.
- the particular static analysis capabilities that are enabled/disabled for one customer may be influenced by the static analysis capabilities that are enabled/disabled for another customer.
- the static analysis capabilities that are enabled/disabled for each customer may be similar, for example, because the customers have deployed similar applications in similar environments, because the customers have resources that are utilized similarly, and for a variety of other reasons.
- two customers may have containerized applications that are each deployed in a K8s cluster that exposed to the public internet.
- the conditions e.g., threats, vulnerabilities
- that need to be either prevented through static analysis or detected through monitoring may be similar for each customer.
- the outputs of the systems described above may continuously be evaluated to ensure that the systems described above are producing useful output.
- generating polygraphs (and other forms of output) comes with a cost. For example, storage resources need to be consumed to store data describing a customer’s environment, processing resources need to be consumed to process (including enriching) the data, processing resources need to be consumed to evaluate the data for the presence of certain conditions, and so on.
- the systems described above may be configured to evaluate whether the output that is being generated is useful, so that costs can be avoided if those costs only result in relatively low value output. Readers will appreciate that, as another motivation to avoid producing low value output, presenting low value output to customers also has the possibility of degrading the customer’s experience, distracting the customer from high value output, and other negative impacts.
- the systems described above may be configured to evaluate a customer’s utilization of its output as an indication of the relative value of the output. For example, if a first type of output (e.g., alerts related to some particular vulnerability) is generally acted upon whereas a second type of output (e.g., alerts related to a different vulnerability) is generally ignored, this may be an indication that second type of output is relatively low value output whereas the first type of output is relatively high value output.
- other information related to the manner in which one or more customers consume output may be used to determine the usefulness of the output.
- Such other information can include, for example, how quickly a customer acted upon some output, whether a customer presented the output to others (e.g., did an admin forward the output to peers), how a customer categorized the resolution of some output (e.g., resolved v. ignored), and so on.
- the value of some output may be specific to a particular customer (e.g., one customer may routinely ignore a first type of output relative to other types of output while another customer routinely acts upon the first type of output relative to other types of output).
- the value that is assigned to some output may be influenced by how other customers interact with (or consume) that output.
- a combination of such approaches may be used for each customer as one type of output may be treated in a way that is highly specific to a particular customer while another type of output may not be treated in a way that is specific to the particular customer.
- changes and modifications to the systems described above may be evaluated in a more objective manner. For example, if an update to the system causes relatively high value output to be hidden and also causes relatively low value output to be promoted, this change may be viewed negatively. If an update to the system causes relatively high value output to be promoted and also causes relatively low value output to be hidden, however, this change may be viewed positively. In such a way, the quality of the system described above may be improved by modeling output in terms of its usefulness. Likewise, the systems may operate more efficiently by modeling output in terms of its usefulness such that the system does not consume valuable resources (e.g., storage, compute, time) in the pursuit of output that is not sufficiently useful as indicated by a customer’s consumption of that output.
- valuable resources e.g., storage, compute, time
- Fig. 11 sets forth a flowchart illustrating an example method of configuring cloud deployments (or components in a software development pipeline) based on learnings obtained by monitoring other cloud deployments (or components in another software development pipeline) in accordance with some embodiments of the present disclosure.
- the cloud deployments 1108, 1114 may be similar to the cloud deployments described above, where a particular cloud deployment can include a variety of components 1110, 1112 such as one or more applications, one or more data sources, networking resources, processing resources, and other resources.
- Such components 1110, 1112 may, in some embodiments, be deployed in the cloud deployments 1108, 1114 using one or more as-a-service models where software, infrastructure, platforms, databases, and other components as delivered as services.
- Configuring cloud deployments 1108, 1114 based on learnings obtained by monitoring other cloud deployments may be carried out using the systems described above. As such, one or more of the steps depicted in Fig. 11 may be performed by the systems described above. Readers will appreciate that although the examples described here relate to an embodiment where learnings that are related to one cloud deployment are used to improve another cloud deployment, in other embodiments learnings that are obtained by monitoring or otherwise observing a first software development pipeline may be used to improve a second first software development pipeline.
- the example method depicted in Fig. 11 includes determining 1102 normal behavior for one or more components 1110 in a first cloud deployment 1108. Determining 1102 normal behavior for one or more components 1110 in a first cloud deployment 1108 may be carried out, for example, as described in greater detail above (at times described as identifying ‘normal activity’) by the systems described above (also referred to herein as a ‘data platform’).
- the example method depicted in Fig. 11 also includes determining 1104 normal behavior for one or more components 1112 in one or more other cloud deployments 1114. Determining 1104 normal behavior for one or more components 1112 in one or more other cloud deployments 1114 may also be carried out, for example, as described in greater detail above (at times described as identifying ‘normal activity’) by the systems described above.
- a customer-specific data platform may be used to analyze, monitor, or otherwise observe a particular customer’s cloud deployment (or some other deployment).
- cloud deployment or some other deployment.
- various clusters may exist. For example, a collection of microservices may form a cluster by virtue of those microservices communicating only (or mostly) with each other.
- one or more cloud computing instances e.g., one or more EC2 instances
- a database may form a cluster by virtue of the EC2 instances accessing the database as the only source of data utilized by the EC2 instances. Using the techniques and mechanisms described above, such clusters may be identified.
- clusters may be identified and characteristics associated with the cluster may be learned, limited insights may be gained if only a particular customer’s cloud deployment is analyzed, monitored, or otherwise observed.
- cross-customer analysis may be leveraged to gain deeper insights than would be gained if only a single customer’s cloud deployment is analyzed, monitored, or otherwise observed.
- cross-customer analysis may be carried out by gathering information related to cloud deployments (or some other deployments) for multiple customers and comparing such information.
- information describing clusters identified in a first customer’s cloud deployment may be compared to information describing clusters identified in a second customer’s cloud deployment for the purposes of identifying similar or identical clusters in each customer’s cloud deployment.
- each customer’s deployment included a web server that was deployed in one or more EC2 instances.
- a particular cluster that represents the web server may be identified in each customer’s deployment.
- a first cluster in the first customer’s deployment may represent a first web server and a second cluster in the second customer’s deployment may represent a second web server.
- the first cluster and the second cluster would have similar characteristics (e.g., each cluster receives data communications using HTTP or HTTPS or any other suitable communication protocol, each cluster communicates with a web browser, each cluster requires similar computing resources, and so on)
- the first cluster and the second cluster may be identified as being identical clusters by comparing the characteristics of each cluster that each cluster.
- This process may be repeated across the cloud deployments for many customers such that a collection of ‘web server’ clusters (in this example) may be identified.
- identifying the nature or type (e.g., a web server) of the clusters is not required.
- similar or identical clusters may be identified even if the exact nature/type of those clusters is not known. For example, a comparison of the characteristics of multiple clusters may only reveal that the cluster are identical, even if such a comparison does not reveal that clusters are ‘web server’ clusters.
- Multiple clusters that have been identified as being identical (or sufficiently similar as measured by a threshold) will be referred to throughout the remainder of this document as a “cluster set,” where the clusters that are members of the cluster set may be deployed across multiple customer’s cloud deployments.
- information describing each member of the cluster set may be utilized to identify distributions across the cluster set.
- Distributions may be identified for traditional resource consumption metrics such as, for example, CPU usage, memory usage, network bandwidth usage, and others.
- a distribution may reveal, for example, that all members of the cluster set utilize between 10-60 Mb/s of network bandwidth, with the vast majority of members of the cluster set utilize between 45-60 Mb/s of network bandwidth. Readers will appreciate that distributions may also be identified for other quantifiable characteristics of each cluster.
- Such quantifiable characteristics can include, for example, the failure rate of a cluster or particular components thereof, an identification of communication protocols used by a cluster or particular components thereof, an identification of the types of communications endpoints that a cluster or particular components thereof communicate with (e.g., endpoints that reside on the public internet v. endpoints that are in a private network), characteristics that can be classified by a binary value (e.g., does any component in the cluster perform privileged operations), information describing the various privileges that are given to a particular cluster, and many more.
- the distributions may be used to identify ‘normal’ behavior for a particular cluster.
- a distribution may be identified which indicates that all members of the cluster communicate with (and has privileged access to) the same set of cloud services, including: 1) a cloud database service (e.g., Amazon Aurora, Microsoft Azure SQL Database, Amazon Relational Database Service, Google Cloud SQL, Amazon DynamoDB ), 2) a vendor provided SaaS offering that provides bill payment services, and 3) a vendor provided SaaS offering that provides accounting services.
- a cloud database service e.g., Amazon Aurora, Microsoft Azure SQL Database, Amazon Relational Database Service, Google Cloud SQL, Amazon DynamoDB
- a baseline may be established that identifies ‘normal’ behavior for each member of the cluster set, at least with respect to the specific characteristic (i.e., what cloud services are utilized by members) that the distribution is based on.
- a source code repository cloud service e.g., GitHub Enterprise on AWS
- this sort of access would be outside of the typical distribution for this cluster set and could serve as the basis for raising an alert, denying access to the service, or initiating some other alerting/remediation workflow.
- Readers will appreciate that many distributions may be created for each cluster set, where each distribution is based on one or more characteristics of the members of the cluster set.
- the example method depicted in Fig. 11 also includes recommending 1106, based on the normal behavior for one or more components 1112 in one or more other cloud deployments 1114, a change to the first cloud deployment 1108.
- Recommending 1106 a change to the first cloud deployment 1108 may be carried out, for example, in response to determining that the normal behavior in one or more other cloud deployments 1114 differs from the normal behavior for one or more components 1110 in a first cloud deployment 1108.
- changes to the first cloud deployment 1108 may be recommended that (if implemented) would cause the first cloud deployment 1108 to be more similar to the other cloud deployments 1114.
- the normal behavior in one or more other cloud deployments 1114 indicates that all computing resources (e.g., virtual machines, container, serverless computing resources) communicate with each other using a particular secure data communications protocol and the normal behavior for one or more components 1110 in a first cloud deployment 1108 is for computing resources to communicate using some other data communications protocol, a change may be recommended that involves reconfiguring the computing resources to communicate using the particular secure data communications protocol.
- all computing resources e.g., virtual machines, container, serverless computing resources
- Fig. 12 sets forth an example method of monitoring a software development pipeline 1202 in accordance with some embodiments of the present disclosure.
- the example method depicted in Fig. 12 may be carried out, for example, by one or more modules of computer program instructions executing on physical hardware, virtual hardware, or in some other execution environment (e.g., one or more AWS Lambda’s, one or more containers).
- modules may be part of the systems described above or otherwise coupled to the systems described above.
- the example method depicted in Fig. 12 includes retrieving 1208, from one or more components 1204a, 1204b, 1204n in the software development pipeline 1202, information 1206 associated with a software application.
- the software application may include a software application executable in a variety of computing environments, including cloud-based systems, on-premises systems, and the like.
- an instance or version of the software application may be deployed in a customer-facing or customer-accessible environment (e.g., a production environment).
- the software application may be embodied as source code or other resources stored in a version control system or source code repository such as Git.
- the software application may or may not be deployed to a production environment, with the software application being under development in the software development pipeline 1202.
- the software application may be embodied as one or more containerized applications executable or deployable in a container orchestration environment such as Kubernetes.
- the software development pipeline 1202 includes various hardware and software resources facilitating the development, testing, and deployment of the software application.
- the software development pipeline 1202 may be implemented as a continuous integration/continuous deployment (CI/CD) pipeline.
- the software development pipeline 1202 may facilitate various stages of the development and deployment process for the software application.
- the software development pipeline 1202 includes endpoint devices for developers or other entities that may write compile code for the software application (e.g., a “build” stage).
- the software development pipeline 1202 includes resources for testing the software application, including test suites executed on the endpoint devices as well as in a version control system or source code repository (e.g., a “test” stage).
- the software development pipeline 1202 includes resources for a version control system or source code repository.
- a code repository may store source code generated or contributed by developers via endpoint devices.
- the source code repository may maintain a code branch that is compiled and deployed into a production environment (e.g., a “release” stage). Accordingly, the developer-produced code may be pushed or merged into this branch for subsequent deployment.
- the software development pipeline 1202 includes resources for deploying the software application to a production or customer-facing environment (e.g., a “deployment” stage).
- the software deployment and development pipeline 1202 may include scripts or code for allocating computational resources (e.g., in a cloud-computing environment or other environment) that facilitate deployment of the software application.
- the software development pipeline 1202 may include resources that execute infrastructure-as-code (laC) scripts to allocate resources for application deployment.
- the software development pipeline 1202 may include a container orchestration system such as Kubemetes to manage the resource allocation and deployment of containerized applications.
- Each of these various components for the various stages of the software development pipeline 1202 may correspond to a particular component 1204a,b-n.
- the information 1206 may then describe various actions performed in or by a particular component 1204a,b-n.
- the information 1206 may describe code submissions and therefore describe a particular user or entity submitting code to a code repository.
- the information 1206 may also describe various build-time dependencies (e.g., for particular libraries, applications, and the like) for the software application as indicated in source code.
- the information 1206 may further include particular portions of source code, or an entirety of source code for the software application.
- the information 1206 may also describe the results of various test operations performed by submitted or stored code in the repository.
- the information 1206 may further describe actions related to merging submitted code into a master branch or other deployable code branch for the software application.
- the information 1206 may also include logs or other data describing events related to the compilation and deployment of the software application.
- logs or other data describing events related to the compilation and deployment of the software application.
- the example method depicted in Fig. 12 also includes identifying 1210, based on the information 1206 associated with the software application, an anomaly 1212 associated with the software application.
- the anomaly 1212 may include a new user or account submitting code for the application (e.g., to a branch of code for the software application).
- the anomaly 1212 may include a user lacking a certification, signature, or other credential attempting to submit code for the application.
- the anomaly 1212 may include a failed compilation or test for the application.
- the anomaly 1212 may include newly failed test (e.g., a test that had previously passed that is now failed).
- the anomaly 1212 may include a failed vulnerability scan.
- the vulnerability scan may include a scan of source code (e.g., a static code analysis).
- the vulnerability scan may include a scan of a containerized application.
- the vulnerability scan may include a scan of one or more build-time dependencies.
- the anomaly 1212 may include a temporal anomaly based on a known or predefined time series or schedule.
- the anomaly 1212 may include a scheduled build or deployment varying from a defined or inferred schedule or frequency.
- the anomaly 1212 may include a code submission or merge request occurring at a time varying from an inferred or defined schedule or frequency.
- the anomaly 1212 may include a location-based or geographic anomaly, such as a code submission, build request, deployment request, or other request originating from an aberrant location or falling out of some defined geographic boundary.
- anomalies 1212 may also be identified 1210 and are contemplated within the scope of the present disclosure.
- the example method depicted in Fig. 12 also includes performing 1214 a remedial action based on the anomaly 1212.
- the remedial action includes presenting an alert associated with the anomaly 1212.
- the alert may include a log entry identifying the anomaly 1212.
- the alert may include a notification in a dashboard or other user interface presented to a user associated with the software application (e.g., an administrator, a developer, and the like).
- the alert may include an email notification, push notification, or other visual notification as can be appreciated.
- the remedial action includes preventing one or more actions or stages of the software development pipeline 1202.
- the remedial action includes preventing a deployment of the software application.
- the remedial action includes preventing a build or compilation of the software application.
- the remedial action incudes preventing a merger of a branch of code associated with the anomaly 1212 into the code of the software application.
- the remedial action includes rolling back a deployed instance of the software application.
- a previous version of the software application e.g., created prior to identification of the anomaly 1212
- a code repository may be compiled and deployed in place of a currently executed instance of the software application.
- the remedial action includes deleting or isolating particular branches of source code associated with the anomaly 1212. For example, particular branches submitted by a user associated with the anomaly 1212 may be rolled back, deleted, or isolated (e.g., prevented from merger) in response to identifying the anomaly 1212. In some embodiments, particular user access may be blocked or otherwise restricted in response identifying 1210 the anomaly 1212.
- Fig. 13 sets forth an additional example method of monitoring a software development pipeline 1202 in accordance with some embodiments of the present disclosure.
- the example method depicted in Fig. 13 is similar to the example method depicted in Fig. 12, as the method depicted in Fig.
- Fig. 13 also includes: retrieving 1208, from one or more components 1204a, 1204b, 1204n in the software development pipeline 1202, information 1206 associated with a software application; identifying 1210, based on the information 1206 associated with the software application, an anomaly 1212 associated with the software application; and performing 1214 one or more remedial actions based on the anomaly 1212.
- the method of Fig. 13 differs from Fig. 12 in that identifying 1210, based on the information 1206 associated with the software application, an anomaly 1212 associated with the software application includes identifying 1302 one or more dependencies for the software application.
- the one or more dependencies include one or more run-time dependencies (e.g., dependencies on other software, services, or resources required during run-time execution of the software application).
- the one or more dependencies include one or more build-time dependencies (e.g., dependencies on other code, libraries, and the like required at build-time in order to compile the software application).
- identifying 1302 the one or more dependencies includes scanning source code for references to libraries, external code resources, and the like that are required for compiling the software application.
- the scanned source code may include an entire corpus of source code for the software application, or particular subsets of source code such as newly created or modified portions of source code.
- identifying 1210 based on the information 1206 associated with the software application, an anomaly 1212 associated with the software application includes identifying 1304, as the anomaly 1212 associated with the software application, an anomaly associated with the one or more dependencies.
- an anomaly associated with the one or more dependencies may include a newly introduced dependency (e.g., a newly required or identified dependency).
- the anomaly associated with the one or more dependencies may include a vulnerability associated with the one or more dependencies.
- a vulnerability scan is performed on code or libraries identified as a build-time dependency.
- a build-time dependency may be compared to a repository of known vulnerable libraries in order to identify the vulnerability.
- Fig. 14 sets forth an additional example method of monitoring a software development pipeline 1202 in accordance with some embodiments of the present disclosure.
- the example method depicted in Fig. 14 is similar to the example method depicted in Fig. 13, as the method depicted in Fig.
- the 14 also includes: retrieving 1208, from one or more components 1204a, 1204b, 1204n in the software development pipeline 1202, information 1206 associated with a software application; identifying 1210, based on the information 1206 associated with the software application, an anomaly 1212 associated with the software application including: identifying 1302 one or more dependencies for the software application; and identifying 1304, as the anomaly 1212 associated with the software application, an anomaly associated with one or more dependencies; and performing 1214 one or more remedial actions based on the anomaly 1212.
- the method of Fig. 14 differs from Fig. 13 in that the method of Fig. 14 also includes presenting 1402, in one or more polygraphs, the one or more dependencies.
- a polygraph is a visual representation of particular relationships between entities during a particular time period. Accordingly, the presented polygraph may change to reflect the relationships during different instances in time.
- the polygraph may identify build-time dependencies for the software application.
- a dependency may be represented in the polygraph an icon or user interface element corresponding to a particular library, source code file, or other dependency referenced by the source code of the software application.
- the polygraph may present relationships using various degrees of granularity.
- a polygraph may indicate particular containerized applications or services with separate icons for dependencies.
- the polygraph may also indicate particular source code files or branches, with separate icons for their dependencies.
- the software application and its dependencies may be indicated in the polygraph using other degrees of granularity as can be appreciated.
- the performed 1214 remedial action may include presenting an alert within the polygraph.
- particular dependencies failing a vulnerability scan or newly introduced dependencies may be highlighted or otherwise indicated in the polygraph.
- a polygraph indicates the particular software library dependencies for the software application during a selected period in time. Assume that a new dependency is added, resulting in a new icon or indicator in the polygraph. Where this dependency fails a vulnerability test, is introduced by a new or unverified user, or is otherwise anomalous, the user interface element for this particular dependency may be highlighted or otherwise identified in order to indicate the anomaly 1212 to a user.
- a polygraph of build-time dependencies may be presented independent of identifying 1210 particular anomalies 1212. For example, a polygraph of build-time dependencies may be presented even if no particular anomalies 1212 have been identified, or if no active anomaly 1212 identification is being performed.
- Fig. 15 sets forth an additional example method of monitoring a software development pipeline 1202 in accordance with some embodiments of the present disclosure.
- the example method depicted in Fig. 15 is similar to the example method depicted in Fig. 14, as the method depicted in Fig. 15 also includes: retrieving 1208, from one or more components 1204a, 1204b, 1204n in the software development pipeline 1202, information 1206 associated with a software application; identifying 1210, based on the information 1206 associated with the software application, an anomaly 1212 associated with the software application; and performing 1214 one or more remedial actions based on the anomaly 1212.
- the method of Fig. 15 differs from Fig.
- an anomaly 1212 associated with the software application includes correlating 1502 another anomaly associated with an execution of the software application with one or more actions in the software development pipeline 1202.
- actions may include a user access to a code repository, a code submission or push, a merge of a code branch, a build of the software application, and the like.
- the other anomaly associated with the execution of the software application may include any real-time monitored anomaly for an executed and deployed application as is described above. Accordingly, correlating 1502 the other anomaly with the one or more actions in the software development pipeline 1202 may include identifying 1504 the particular one or more actions that were performed prior to the other anomaly being identified or introduced in the executed application, therefore indicating which actions may have caused the introduction of these anomalies. In some embodiments, correlating 1502 the one or more actions may include identifying a sequence of related actions.
- a user associated with this newly merged branch of code may also be identified.
- a suer account that may have introduced malicious code may also be identified.
- an anomaly in execution may be traced back through the software development pipeline 1202 to any action that may have introduced a vulnerability or malicious code, potentially back to the user that initially introduced the vulnerability or potentially malicious code.
- a method of monitoring a software development pipeline including: retrieving, from one or more components in the software development pipeline, information associated with a software application; identifying, based on the information associated with the software application, an anomaly associated with the software application; and performing one or more remedial actions based on the anomaly.
- identifying the anomaly associated with the software application further comprises: identifying one or more dependencies for the software application; and identifying, as the anomaly associated with the software application, an anomaly associated with the one or more dependencies.
- identifying the anomaly associated with the software application further comprises: correlating another anomaly associated with an execution of the software application with one or more actions in the software development pipeline; and identifying, as the anomaly associated with the software application, the one or more actions.
- a system of monitoring a software development pipeline configured to perform steps comprising: retrieving, from one or more components in the software development pipeline, information associated with a software application; identifying, based on the information associated with the software application, an anomaly associated with the software application; and performing one or more remedial actions based on the anomaly.
- identifying the anomaly associated with the software application further comprises: identifying one or more dependencies for the software application; and identifying, as the anomaly associated with the software application, an anomaly associated with the one or more dependencies.
- identifying the anomaly associated with the software application further comprises: correlating another anomaly associated with an execution of the software application with one or more actions in the software development pipeline; and identifying, as the anomaly associated with the software application, the one or more actions.
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
L'invention concerne la surveillance d'un pipeline de développement logiciel, comprenant : la récupération, auprès d'un ou de plusieurs composants dans le pipeline de développement logiciel, d'informations associées à une application logicielle ; l'identification, sur la base des informations associées à l'application logicielle, d'une anomalie associée à l'application logicielle ; et la mise en œuvre d'une ou de plusieurs actions correctives sur la base de l'anomalie.
Applications Claiming Priority (8)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202163241966P | 2021-09-08 | 2021-09-08 | |
| US63/241,966 | 2021-09-08 | ||
| US202163244336P | 2021-09-15 | 2021-09-15 | |
| US63/244,336 | 2021-09-15 | ||
| US202163287506P | 2021-12-08 | 2021-12-08 | |
| US63/287,506 | 2021-12-08 | ||
| US17/838,974 US20220311794A1 (en) | 2017-11-27 | 2022-06-13 | Monitoring a software development pipeline |
| US17/838,974 | 2022-06-13 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023038957A1 true WO2023038957A1 (fr) | 2023-03-16 |
Family
ID=83508926
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2022/042737 Ceased WO2023038957A1 (fr) | 2021-09-08 | 2022-09-07 | Surveillance d'un pipeline de développement logiciel |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2023038957A1 (fr) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118228243A (zh) * | 2024-05-23 | 2024-06-21 | 成都赛力斯科技有限公司 | 车辆的访问管理方法 |
| US12158894B2 (en) | 2022-10-21 | 2024-12-03 | Elementum Ltd | Systems and methods for a data-driven workflow platform |
| US20250045417A1 (en) * | 2023-08-04 | 2025-02-06 | Huaneng Information Technology Co., Ltd. | Security analysis method and system based on protocol state |
| US12355626B1 (en) | 2017-11-27 | 2025-07-08 | Fortinet, Inc. | Tracking infrastructure as code (IaC) asset lifecycles |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170169229A1 (en) * | 2015-12-10 | 2017-06-15 | Sap Se | Vulnerability analysis of software components |
| US10691810B1 (en) * | 2019-09-16 | 2020-06-23 | Fmr Llc | Detecting vulnerabilities associated with a software application build |
| US20200202006A1 (en) * | 2018-12-19 | 2020-06-25 | Red Hat, Inc. | Vulnerability analyzer for application dependencies in development pipelines |
| EP3693874A1 (fr) * | 2019-01-28 | 2020-08-12 | Visa International Service Association | Gestion de la vulnérabilité continue pour des applications modernes |
-
2022
- 2022-09-07 WO PCT/US2022/042737 patent/WO2023038957A1/fr not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170169229A1 (en) * | 2015-12-10 | 2017-06-15 | Sap Se | Vulnerability analysis of software components |
| US20200202006A1 (en) * | 2018-12-19 | 2020-06-25 | Red Hat, Inc. | Vulnerability analyzer for application dependencies in development pipelines |
| EP3693874A1 (fr) * | 2019-01-28 | 2020-08-12 | Visa International Service Association | Gestion de la vulnérabilité continue pour des applications modernes |
| US10691810B1 (en) * | 2019-09-16 | 2020-06-23 | Fmr Llc | Detecting vulnerabilities associated with a software application build |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12355626B1 (en) | 2017-11-27 | 2025-07-08 | Fortinet, Inc. | Tracking infrastructure as code (IaC) asset lifecycles |
| US12158894B2 (en) | 2022-10-21 | 2024-12-03 | Elementum Ltd | Systems and methods for a data-driven workflow platform |
| US20250045417A1 (en) * | 2023-08-04 | 2025-02-06 | Huaneng Information Technology Co., Ltd. | Security analysis method and system based on protocol state |
| US12235972B1 (en) * | 2023-08-04 | 2025-02-25 | Huaneng Information Technology Co., Ltd. | Security analysis method and system based on protocol state |
| CN118228243A (zh) * | 2024-05-23 | 2024-06-21 | 成都赛力斯科技有限公司 | 车辆的访问管理方法 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12506762B1 (en) | Leveraging information gathered using static analysis for remediating detected issues in a monitored deployment | |
| US11741238B2 (en) | Dynamically generating monitoring tools for software applications | |
| US11849000B2 (en) | Using real-time monitoring to inform static analysis | |
| US11792284B1 (en) | Using data transformations for monitoring a cloud compute environment | |
| US20220311794A1 (en) | Monitoring a software development pipeline | |
| US20230075355A1 (en) | Monitoring a Cloud Environment | |
| US20220294816A1 (en) | Ingesting event data into a data warehouse | |
| US20220360600A1 (en) | Agentless Workload Assessment by a Data Platform | |
| US20230275917A1 (en) | Identifying An Attack Surface Of A Cloud Deployment | |
| US12095879B1 (en) | Identifying encountered and unencountered conditions in software applications | |
| US12526297B2 (en) | Annotating changes in software across computing environments | |
| US20220200869A1 (en) | Configuring cloud deployments based on learnings obtained by monitoring other cloud deployments | |
| US12095796B1 (en) | Instruction-level threat assessment | |
| US12452279B1 (en) | Role-based permission by a data platform | |
| US12355626B1 (en) | Tracking infrastructure as code (IaC) asset lifecycles | |
| US12309236B1 (en) | Analyzing log data from multiple sources across computing environments | |
| US12261866B1 (en) | Time series anomaly detection | |
| US12363148B1 (en) | Operational adjustment for an agent collecting data from a cloud compute environment monitored by a data platform | |
| US12284197B1 (en) | Reducing amounts of data ingested into a data warehouse | |
| US12445474B1 (en) | Attack path risk mitigation by a data platform | |
| WO2023038957A1 (fr) | Surveillance d'un pipeline de développement logiciel | |
| WO2024044053A1 (fr) | Évaluation et remédiation de scénario de risque de ressources en nuage | |
| US12407702B1 (en) | Gathering and presenting information related to common vulnerabilities and exposures | |
| WO2023215491A1 (fr) | Identification d'une surface d'attaque d'un déploiement en nuage | |
| US12401669B1 (en) | Container vulnerability management by a data platform |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22783120 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22783120 Country of ref document: EP Kind code of ref document: A1 |