Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present specification, and it is possible for those of ordinary skill in the art to apply the present specification to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.
It will be appreciated that "system," "apparatus," "unit" and/or "module" as used herein is one method for distinguishing between different components, elements, parts, portions or assemblies at different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.
As used in this specification and the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.
A flowchart is used in this specification to describe the operations performed by the system according to embodiments of the present specification. It should be appreciated that the preceding or following operations are not necessarily performed in order precisely. Rather, the steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.
Fig. 1 is a schematic application scenario diagram of a system for determining suspicious traffic in encrypted traffic according to some embodiments of the present disclosure.
DPI (DEEP PACKET Inspection) refers to that the equipment can carry out filtering control on the detected flow according to a predefined strategy by detecting and analyzing the flow and the message content at the key points of the network, and can finish the functions of service fine identification, service flow direction analysis, service flow ratio statistics, service ratio shaping, application layer denial of service attack, virus and Trojan filtration, P2P abuse control and the like of the link. For example, the decryption DPI module may perform subsequent decryption analysis on the encrypted traffic to be tested when determining that the traffic type of the encrypted traffic to be tested is suspicious.
As shown in fig. 1, the application scenario 100 may include a network 110, a router 120, a processor 130, encrypted traffic 140, and traffic determination results 150. Router 120 may obtain encrypted traffic 140 to be tested from network 110 and processor 130 may copy encrypted traffic 140 in router 120 to collect encrypted traffic 140 and generate traffic decision 150.
Network 110 may include any suitable network that provides information and/or data capable of facilitating the bandwidth application scenario 100. The router 120 of the application scenario 100 may exchange information and/or data with the network 110. For example, network 110 may send user-generated traffic information to router 120. In some embodiments, the network 110 may be any one or more of a wired network or a wireless network. In some embodiments, network 110 may include one or more network access points. For example, network 110 may include wired or wireless network access points. In some embodiments, the network may be a point-to-point, shared, centralized, etc. variety of topologies or a combination of topologies.
Router 120 may be a network device that reads addresses in packets and then stores, groups, and forwards the packets. In some embodiments, router 120 may be used to connect two or more networks 110. In some embodiments, router 120 receives encrypted traffic 140 of network 110 and forwards encrypted traffic 140 stored in router 120 to processor 130. Router 120 may be local, or may be remote.
The processor 130 may include an execution device for a method for determining suspicious traffic in the encrypted traffic 140, and may process data and/or information obtained from the router 120, and execute the method for determining suspicious traffic in the encrypted traffic provided in the present specification according to the related data, to generate the traffic determination result 150. For example, the processor 130 may determine a traffic characteristic according to the encrypted traffic information received by the router 120, and determine whether the encrypted traffic 140 is suspicious based on the traffic characteristic, thereby generating the traffic determination result 150. In some embodiments, the processor 130 may be a single server or a group of servers. In some embodiments, processor 130 may be integrated with the suspicious traffic determination system (e.g., integrated within router 120). The processor 130 may be local and, or may be remote. Processor 130 may be implemented on a cloud platform.
Traffic may be traffic generated by a user during surfing the internet. In some embodiments, the traffic may be encrypted traffic or unencrypted traffic. The traffic is encrypted to cope with various eavesdropping and man-in-the-middle attacks, so that the webpage is basically not tampered, and the surfing safety of the user is ensured. However, some malicious traffic may still be hidden from the encrypted traffic 140. In some embodiments, the determination of malicious traffic is performed by processor 130. For example, after encrypted traffic 140 containing malicious traffic is transmitted by network 110 to the router, it is determined by processor 130 as suspicious traffic.
The traffic determination 150 may include whether the encrypted traffic 140 is suspicious traffic or whether the encrypted traffic 140 is normal traffic. In some embodiments, the flow determination 150 is performed by the processor 130.
It should be noted that the application scenario 100 is provided for illustrative purposes only and is not intended to limit the scope of the present application. Many modifications and variations will be apparent to those of ordinary skill in the art in light of the present description. For example, the application scenario 100 may also include an information source. However, such changes and modifications do not depart from the scope of the present application.
Fig. 2 is a block diagram of a system for determining suspicious traffic among encrypted traffic according to some embodiments of the present description.
As shown in fig. 2, in some embodiments, suspicious traffic determination system 200 may include a traffic characteristics acquisition module 210, a traffic type determination module 220.
The flow characteristic obtaining module 210 may be configured to collect encrypted flow to be measured, and extract encrypted flow characteristics of the encrypted flow to be measured. In some embodiments, the encrypted traffic characteristics may include first traffic characteristics, which may include access characteristic information, protocol characteristic information, and transfer characteristic information. Details regarding the flow characteristic acquisition can be found in step 310 and its associated description.
The traffic type determining module 220 may be configured to determine a traffic type of the encrypted traffic to be tested based on the encrypted traffic characteristics of the encrypted traffic to be tested. In some embodiments, traffic types may include normal traffic and suspicious traffic. The suspicious traffic is used for subsequent decryption analysis of the encrypted traffic to be tested. For details regarding the determination of the traffic type, reference may be made to step 320 and the description thereof.
In some embodiments, the traffic type determining module 220 may be further configured to process the encrypted traffic characteristics of the encrypted traffic to be tested based on a suspicious traffic identification model, which is a machine learning model, to determine the traffic type of the encrypted traffic to be tested. Details of the suspicious traffic identification model may be found in fig. 5 and its associated description.
In some embodiments, the suspicious traffic determination system 200 may further include a decryption DPI module 230 configured to perform subsequent decryption analysis on the encrypted traffic to be tested in response to the traffic type of the encrypted traffic to be tested being suspicious.
It should be noted that the above description of the suspicious traffic determination system 200 and its modules is for convenience only and is not intended to limit the present description to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the principles of the system, various modules may be combined arbitrarily or a subsystem may be constructed in connection with other modules without departing from such principles. In some embodiments, the flow characteristic obtaining module 210 and the flow type determining module 220 disclosed in fig. 2 may be different modules in a system, or may be one module to implement the functions of two or more modules. For example, each module may share one memory module, or each module may have a respective memory module. Such variations are within the scope of the present description.
Fig. 3 is an exemplary flow chart of a method of determining suspicious traffic among encrypted traffic according to some embodiments of the present description. In some embodiments, the process 300 may be performed by the processor 130. As shown in fig. 3, the process 300 includes the following steps.
Step 310, collecting the encrypted flow to be detected, and extracting the encrypted flow characteristics of the encrypted flow to be detected.
Encrypted traffic refers to traffic in the network traffic that has been encrypted. The encrypted traffic may be encrypted by the user or by the facilitator to protect privacy. For example, for users who need to process services online based on internet communication, encryption mechanisms can be relied on in mobile applications, cloud applications, and Web applications, and keys, certificates, and the like are used to ensure security and establish trust through a data encryption process.
The basic process of data encryption includes processing a file or data (traffic) originally in plaintext into an algorithm, so that the file or data (traffic) becomes an unreadable code, which is commonly called "ciphertext". Through the data encryption way, the protection of data from illegal theft and reading is realized. In some embodiments, encrypting traffic may include encrypting normal traffic and encrypting suspicious traffic. Where encrypting suspicious traffic tends to disguise or hide malicious traffic features. For example, encrypting suspicious traffic often disguises or hides trojan horses, infectious viruses, worm viruses, malicious downloadable devices, etc. with attack on servers, resulting in problems such as crashes of servers.
The encrypted traffic characteristics are traffic characteristics associated with the encrypted traffic to be measured. Traffic characteristics may include statistics such as quintuple information, encryption protocol information, average size of packets, average transmission interval of packets, etc. The quintuple information includes a source IP address, a source port, a destination IP address, a destination port, and a transport layer protocol. The encryption protocol information refers to a message related to a protocol of secure communication established at the server side with the client side in the authentication process. The authentication process comprises the steps that the client sends a message to the server, the server sends a self-authentication message to respond to the client, the client and the server complete key exchange, and the authentication process is ended. In some embodiments, the encryption protocol information may include TLS/SSL protocol versions, extension fields, and the like. The average size of a packet refers to the average length of data in a number of packets, expressed in bytes, e.g., ten IP packets have an average length of 1000 bytes. The average transmission interval of the data packet refers to an average time interval between the transmission of the current data frame and the transmission of the next data frame during the data transmission, for example, the data frame is transmitted on average every 2 seconds.
In some embodiments, the encrypted traffic characteristics may include a first traffic characteristic, which is a characteristic related to content contained in the encrypted traffic to be tested. In some embodiments, the first traffic characteristic may include access characteristic information, protocol characteristic information, and transfer characteristic information.
Access characteristic information refers to characteristic information related to access. Access may refer to a process in which a visitor actively performs active retrieval with a network platform for a specific purpose, and traffic may be generated during access. For example, the encrypted traffic may be traffic generated by the visitor clicking on the URL of the web site collected in the bookmark or traffic generated by the visitor directly entering the web site in the browser address bar. The access characteristic information may be used to distinguish between different sessions, e.g. communications between different users. In some embodiments, the access characteristic information may include information of a source IP address, a source port, a destination IP address, a destination port, and the like.
It may be determined whether the encrypted traffic is suspicious traffic based on the access characteristic information. For example, the access amount of the IP address corresponding to a certain browser in one day is 500% higher than before, and at this time, it is necessary to check whether the access data is increased due to suspicious traffic.
Protocol feature information refers to feature information related to a protocol. The protocol characteristic information may be used to distinguish the manner in which network traffic is transmitted, e.g., whether it is encrypted or not, etc. In some embodiments, the protocol feature information may include transmission protocol information, encryption protocol information, and the like.
The probability that the encrypted traffic is suspicious traffic may be determined based on the protocol characteristic information. For example, statistics on historical malicious traffic tend to select a hidden encrypted transport protocol, which may be determined to be more vulnerable to malicious traffic by identifying the encrypted protocol employed by the encrypted traffic, such as Secure Socket Layer (SSL), etc.
The transfer characteristic information refers to characteristic information related to information transfer. In some embodiments, the transfer characteristic information may include a packet average size, a packet average transmission interval, and the like.
It may be determined whether the encrypted traffic is suspicious traffic based on the transfer characteristic information. For example, a network platform has a normal access time of 8:00-18:00, an average size of 512 bytes, and an average transmission interval of 20ms, but a large number of dense accesses occur in 02:00-04:00 in a certain day, the average size of the data packets is 1500 bytes, and the average transmission interval of the data packets is abnormally reduced to 6ms, so that the abnormal traffic can be determined as suspicious traffic.
In some embodiments, the first traffic feature may further comprise a byte distribution probability vector.
In the field of computer security, data is transmitted from a sender to a receiver in the form of packets, which contain a header, and the data sent by the sender is called the payload, i.e. the receiver subtracts the IP header length from the total length of the IP packet to determine the size of the payload of the packet. The header is appended to the payload for transmission and then discarded upon successful arrival at the destination. The main source of the malicious traffic-borne virus is the payload. The payload includes data corruption, messages with insulting text, or bulk emails sent to a large number of people. Byte distribution refers to the count of each byte value in the packet payload. For example, the byte profile of a packet may be such that the first byte "00000001" occurs 10 times, the second byte "00000011" occurs 15 times, the nth byte "11111111" occurs 5 times in the payload of the packet. The byte distribution probability refers to the probability of occurrence of each byte value in the payload of the data packet. In some embodiments, the occurrence probability of each byte value may be approximated by the occurrence frequency of that byte value. The byte distribution probability vector refers to a vector formed by probabilities that 256 values that one byte may take appear in the data stream, respectively. Byte distribution probability vectors can provide a large amount of data encoding and data stuffing information in which many malicious traffic illegal acts are often hidden. In some embodiments, the byte distribution count for each byte value may be divided by the total number of bytes in the payload to obtain a byte distribution frequency, which is used to represent the byte distribution probability, and this feature is ultimately represented as a1 x 256-dimensional byte distribution probability vector. For example, malicious traffic may utilize certain fields of the HTTP header (e.g., content-type, server, etc.) to initiate some malicious activity, which illustrates that the HTTP field can indicate some malicious activity well. HTTP context flows refer to all HTTP flows sent by the same source IP address within the secure transport protocol TLS (Transport Layer Security) min window. All observed HTTP header information is represented by a feature vector of one binary variable, which will be 1 regardless of the other HTTP streams if any HTTP stream has a particular header value (i.e., contains a header of malicious traffic). For the byte distribution probability vector P1, the processor 130 may count 100 traffic of P1 in the network traffic in the preset period, where 60 HTTP header features are 1, that is, the frequency of the traffic with the byte distribution probability vector P1 of the encrypted traffic to be detected is 60% of the malicious traffic.
The encrypted flow to be measured can be collected by different flow collection methods. The encrypted traffic collection method to be tested includes, but is not limited to Sniffer (sniffing), SNMP (Simple Network Management Protocol ), netflow, sFlow, etc.
In some embodiments Sniffer may be used to collect encrypted traffic. By way of example only, a data collection point may be set at a mirror port of the switch, through which data information in the network is completely replicated to collect encrypted traffic information to be tested.
And after the encrypted flow to be detected is acquired, extracting the encrypted flow characteristics of the encrypted flow to be detected. The encrypted traffic features may be extracted in a variety of ways, such as an encrypted traffic base information extraction library (e.g., flowcontainer), an encrypted traffic feature extraction tool (e.g., WIRESHARK, QPA, TSTAT, etc.), or other encrypted traffic extraction algorithms, machine learning models, etc.
In some embodiments, the encrypted traffic characteristics may also include a second traffic characteristic. The second traffic feature is a feature derived from the content of the encrypted traffic itself to be measured. The second traffic characteristic (domain name heat) may be determined by other external content related to the second traffic characteristic (such as inquiring the domain name in the internet or a knowledge graph) after extracting the target domain name from the encrypted traffic to be detected. In some embodiments, the second traffic characteristic may include domain name heat.
Domain name hotness refers to the degree to which malicious traffic tends to visit a domain name. In some embodiments, the domain name hotness may include the number (or frequency, probability, etc.) of malicious traffic accessing the domain name. The higher the frequency with which a domain name is accessed by malicious traffic, the higher the domain name hotness. In some embodiments, if the network traffic includes a domain name with high heat, the traffic type determining module 220 may determine that the network traffic is more likely to be suspicious.
In some embodiments, the traffic characteristics acquisition module 210 may acquire a second traffic characteristic of the encrypted traffic to be measured. In some embodiments, the second flow characteristic may be represented as a fractional value. For example, when the second flow characteristic is domain name heat, the higher the domain name heat score value, the higher the domain name heat, and vice versa. The score value can be obtained according to the historical access condition of the domain name, the reporting condition of the domain name by the user and the like. In some embodiments, the second traffic feature may be used to determine whether the domain name is vulnerable to malicious traffic and further used for traffic type determination.
In some embodiments, the second flow characteristic may be obtained by a relational graph.
FIG. 4 is an exemplary schematic diagram illustrating the acquisition of a second flow characteristic via a relationship map according to some embodiments of the present description.
The relationship graph may include domain name nodes, entity nodes, edges connecting the entity nodes, and edges connecting the entity nodes with the domain name nodes, and the edge attributes of the edges may include communication related data, and traffic types. For example, the domain name node may be jming.com and the entity node may be an IP address 207.46.197.101 corresponding to the domain name. One IP address may correspond to multiple domain names, but one domain name has only one IP address. When a user types in a Domain name, the Domain name first arrives at the Domain name resolution server (Domain NAME SYSTEM, DNS), then the Domain name is resolved into the IP address of the corresponding website, and the process of completing the task is called Domain name resolution. The access of the client host to the server is completed through the domain name node and the entity node together.
In some embodiments, the flow feature obtaining module 210 may construct a relationship graph based on the encrypted flow information to be tested, and obtain the second flow feature according to the malicious proximity value determined by the relationship graph. The malicious adjacency value represents the number of edges that a certain node (e.g., node a) satisfies a preset condition. The preset condition may include that the direction of the edge points to the node a, that is, the node a is taken as an end point. Wherein the traffic type of the edge is malicious traffic.
As shown in fig. 4, relationship graph 410 may include domain name nodes 420 (e.g., node a, node B, node C), entity nodes 430 (e.g., node 1, node 2, node 3), and edges 440 connecting the nodes. Wherein the edges are directed edges.
In some embodiments, the flow feature acquisition module 210 may construct edges of a relationship graph from communications between the various nodes. The communication of the edge representation is an abstract communication. An abstract communication may include multiple interactions of information within a short period of time. For example, nodes A and B are connected by a directed edge, representing multiple information interactions between node A and node B in a short period of time, which may be considered an abstract communication. The direction of the edges may be determined by the initiator of the first information interaction. For example, in the multiple information interactions described above, the first information interaction is initiated from a to B, and then the direction of the multiple information interactions may be determined to be a to B accordingly. In some embodiments, node a and node B may have multiple directed edges in response to there being multiple communications between node a and node B (e.g., communications occurring at different times over a longer time span). In some embodiments, processor 130 may count the number of edges between nodes for which the traffic type is "malicious traffic" based on the attributes of the edges. A malicious adjacency value of 0 indicates that no edge with a traffic type of malicious traffic exists between two nodes, a malicious adjacency value of 1 indicates that one edge with a traffic type of malicious traffic exists between two nodes, a malicious adjacency value of 2 indicates that two edges with a traffic type of malicious traffic exist between two nodes, and the like. In some embodiments, the second traffic characteristic may be determined from the malicious proximity value. For more on determining the second traffic feature based on the malicious proximity value, see fig. 4 and its related description.
Schematic flow 400 is an example of determining a second flow characteristic from a relationship graph. Illustratively, the second flow feature 460 in the illustrative flow 400 is domain name heat. Specifically, the traffic feature obtaining module 210 may find a node corresponding to the traffic feature information (e.g., an IP address) based on the encrypted traffic information to be detected, and the processor 130 may obtain all edges 440 connected to the node based on the relationship graph 410, and count the number of edges of the node with malicious traffic as traffic types in the edge attribute corresponding to the node in the graph, determine a malicious adjacency value 450 based on the number of edges, and determine the domain name hotness based on the malicious adjacency value. As shown in fig. 4, the malicious adjacency value of the node 1 and the node B is 1, the malicious adjacency value of the node 3 is 2, and the malicious adjacency value of the node 2 is 3, so that the domain name of the node 2 has the highest heat degree.
In some embodiments, the traffic feature acquisition module 210 may determine the domain name hotness according to the malicious adjacency value 450 and a preset adjacency rule. The preset adjacency rule may be that malicious adjacency values corresponding to all nodes are ordered according to the size, and the higher the ranking is, the higher the domain name hotness is. The preset adjacency rule can be set according to actual requirements. For example, the domain name with the highest domain name heat corresponding to the top three nodes is output, and for the domain name with high domain name heat, the traffic corresponding to the domain name can directly enter the subsequent decryption DPI analysis without traffic type classification.
In some embodiments, the edge features of the relationship graph may also include the number of times communication data between two nodes is reported by a user. The user end is located at a node attacked by malicious traffic. When the communication data between two nodes is reported by a user, recording that the traffic type corresponding to the communication is malicious traffic. Processor 130 may count the number of edges between nodes for which the traffic type is "malicious traffic". Further, a second flow characteristic is determined based on the number of edges. For example only, when the second traffic feature is domain name hotness, the greater the number of edges of traffic type "malicious traffic" indicates that the domain name hotness of the domain name to which the node corresponds is higher.
In the embodiment of the specification, the second traffic characteristic is determined through the relation map, so that the traffic types of the network traffic can be effectively integrated, the association between the domain name and the domain name, between the entity IP and between the domain name and the entity IP is constructed, namely, the edge between the nodes is constructed according to the flow direction of the traffic, thereby more effectively supporting the mining and extraction of the second traffic characteristic, and the accuracy of judging the encrypted traffic characteristic as malicious traffic can be improved by determining the second traffic characteristic.
Step 320, determining the flow type of the encrypted flow to be tested based on the encrypted flow characteristics of the encrypted flow to be tested.
In some embodiments, traffic types may include normal traffic and suspicious traffic. The traffic type of the encrypted traffic to be measured may be determined in a number of ways. In some embodiments, the traffic type may be determined based on historical data, preset rules, or suspicious traffic identification models, among other ways. In some embodiments, determining the traffic type based on the historical data includes obtaining, by the traffic type determination module 220, a historical suspicious traffic and comparing the historical suspicious traffic to traffic characteristics of the encrypted traffic to be tested, and determining that the traffic type of the encrypted traffic to be tested is suspicious when the similarity is greater than a certain threshold (e.g., greater than 0.8). In some embodiments, determining the traffic type based on the preset rule includes determining that the traffic type of the encrypted traffic to be tested is suspicious when the number of suspicious traffic features of the encrypted traffic to be tested is greater than a certain value (e.g., greater than 1). In some embodiments, the suspicious traffic identification model may be a machine learning model. Details of the suspicious traffic identification model may be found in fig. 5 and its associated description.
And 330, responding to the fact that the flow type of the encrypted flow to be tested is suspicious, and performing subsequent decryption analysis on the encrypted flow to be tested.
In some embodiments, if the traffic type of the encrypted traffic to be measured is normal traffic, no subsequent decryption analysis is required.
The decryption analysis may include confirmation of protocol type, segmentation protocol domain, SSL offload, payload analysis, and identification negotiation protocol, and by performing decryption analysis on suspicious traffic, it is further determined whether the suspicious traffic is malicious traffic, or the traffic characteristics corresponding to the suspicious traffic may be marked by the decryption DPI module 230, and the suspicious traffic characteristics may be stored in the traffic type determining module 220, so as to identify the suspicious traffic in the encrypted traffic to be tested, and meanwhile, it is convenient to obtain more training samples for model training, so that the judgment of the encryption analysis is more accurate.
According to the embodiment of the specification, the normal flow and the suspicious flow in the encrypted flow are screened out, only the suspicious flow is subjected to subsequent decryption analysis, the load of subsequent analysis work is reduced, and the analysis efficiency is improved.
FIG. 5 is a schematic diagram of a suspicious traffic identification model according to some embodiments of the present description.
In some embodiments, determining the traffic type of the encrypted traffic to be tested based on the encrypted traffic characteristics of the encrypted traffic to be tested includes processing the encrypted traffic characteristics of the encrypted traffic to be tested based on a suspicious traffic identification model, determining the traffic type of the encrypted traffic to be tested, the suspicious traffic identification model being a machine learning model.
As shown in fig. 5, the initial suspicious traffic identification model 550 may be based on a number of identified training samples 540 to obtain a trained suspicious traffic identification model 520. Specifically, training samples 540 with identifications are input into an initial suspicious traffic identification model 550, which is trained based on the identifications. In some embodiments, training samples 540 may be normal traffic and suspicious traffic.
In some embodiments, the identification of the training sample may be whether the training sample is suspicious traffic. For example, if the training sample is suspicious traffic, then the training sample is identified as 1, and vice versa is 0.
In some embodiments, the initial suspicious traffic identification model 550 may be a classifier obtained by training suspicious traffic as positive samples and normal traffic as negative samples. In some embodiments, the classifier may be one of a logistic regression model, a support vector machine, a random forest, or other classification model.
In some embodiments, suspicious flow identification model 520 may be used to determine a class of flow for which the input flow characteristics correspond, in some embodiments, the input to suspicious flow identification model 520 may include first flow characteristics 510-1 or/and second flow characteristics 510-2, and the output of suspicious flow identification model 520 may include one of suspicious flow 530-1 and normal flow 530-2.
In some embodiments, training is ended when the trained suspicious traffic identification model satisfies a preset condition. The preset condition may be that the accuracy rate is greater than or equal to a preset threshold. The preset threshold value can be specifically set according to actual requirements, for example, 90% or 95% and the like.
In some embodiments, the accuracy of the trained suspicious traffic identification model may be determined by a plurality of test samples that contain tags of whether suspicious traffic is. After a plurality of test samples are input into the trained suspicious traffic identification model, a corresponding prediction category can be output, and when the prediction category is consistent with the label, the prediction is correct, otherwise, the prediction is incorrect. The accuracy may be the value of the number of predicted correct samples divided by the total number of test samples.
According to the embodiment of the specification, the flow type is identified by using the machine learning model, and the intrinsic characteristics of malicious flow can be learned based on a large amount of historical flow data, so that whether the encrypted flow to be detected is suspicious or not can be judged more accurately.
In some embodiments, the output of the suspicious traffic identification model may also include a classification vector 530-3, the classification vector 530-3 including the confidence that the encrypted traffic to be tested belongs to suspicious traffic of different classes.
In some embodiments, before using the suspicious traffic identification model to output classification vectors, a large number of multi-classification training samples should be used to train the initial suspicious traffic identification model to have a certain multi-classification capability. In some embodiments, the training samples may be normal traffic and different classes of malicious traffic, e.g., malicious traffic may belong to "privacy compromised suspicious traffic," "malicious attack suspicious traffic," and so on. In some embodiments, the identification of the training samples may be a category of training samples. For example, the malicious traffic is identified as a, which indicates that the class of the malicious traffic is "privacy disclosure suspicious traffic", and the malicious traffic is identified as B, which indicates that the class of the malicious traffic is malicious attack suspicious traffic. In some embodiments, the classification vector output by the suspicious traffic identification model may represent the confidence that the suspicious traffic belongs to different malicious behaviors. In some embodiments, the classification vector output by the suspicious traffic model includes a plurality of values between 0 and 1, each of which is used to represent the confidence that the sample belongs to the corresponding class. As an example, the suspicious traffic identification model may output a vector [0.2,0.8,0.1] where 0.2 indicates that the sample belongs to class a with a confidence of 0.2,0.8 indicates that the sample belongs to class B with a confidence of 0.8,0.1 indicates that the sample belongs to class C with a confidence of 0.1, and then it may be determined that the sample belongs to class B.
In some embodiments, the input of the suspicious traffic identification model may also include the reference malicious value 510-3 of the byte distribution probability vector. The manner in which the byte distribution probability vector is determined may be found in step 310 and its associated description.
Reference to malicious values refers to the likelihood that the byte distribution probability vector is suspicious traffic.
In some embodiments, the reference malicious value of the byte distribution probability vector may be determined based on historical data or the like.
In some embodiments, determining the reference malicious value based on the historical data includes the suspicious traffic determination model obtaining a byte distribution probability vector of the historical suspicious traffic, comparing the byte distribution probability vector of the historical suspicious traffic with a byte distribution probability vector corresponding to the encrypted traffic to be tested, and determining the malicious value of the byte distribution probability vector of the historical suspicious traffic as the reference malicious value of the current byte distribution probability vector when the similarity is greater than a certain threshold (e.g., greater than 0.8).
In some embodiments, the edge attributes of the relationship graph further comprise a byte distribution probability vector.
In some embodiments, the reference malicious value may be obtained based on a relationship graph, including counting frequencies of edges of which traffic types are malicious traffic in edge attributes of edges meeting preset conditions based on edges meeting preset conditions in the relationship graph, and determining the reference malicious value based on the frequencies.
In some embodiments, the preset condition is that the similarity between the byte distribution probability vector in the edge attribute and the byte distribution probability vector of the encrypted traffic to be tested is close to a preset range. The preset range may be one of a system default value, an empirical value, a human preset value, and the like. For example, for the byte distribution probability vector P2, the processor 130 may count 100 traffic of P2 in the network traffic in the preset period, where 40 traffic is normal traffic and 60 traffic is malicious traffic, that is, the frequency of the traffic with the byte distribution probability vector P2 representing the encrypted traffic to be detected is 60% of the malicious traffic.
The reference malicious value may be further calculated based on the aforementioned 60%, the greater the frequency, the greater the reference malicious value. In some embodiments, the byte distribution probability vector of the current flow to be measured is P2, and all vectors with similarity close to the vector P2 are searched in the relationship graph. For example, all vectors with similarity to the vector P2 are P3, P4 and P5, wherein the edges corresponding to P3 and P4 are malicious traffic, and the edges corresponding to P5 are normal traffic. Then the frequency of malicious traffic is 2/3 and the reference malicious value of P2 is calculated based on the frequency of malicious traffic 2/3. The manner in which the reference malicious value is determined according to the frequency may include determining the reference malicious value according to a rule table. For example, the byte distribution probability vector is 60% of the malicious traffic, the malicious value in the corresponding rule table is 80%, and the byte distribution probability vector is 80% of the malicious traffic, the malicious value in the corresponding rule table is 90.
In some embodiments of the present disclosure, a preset relationship map may be updated according to an association relationship between information of an encrypted traffic to be detected and a reference malicious value, where the method includes comparing a frequency of a byte distribution probability vector corresponding to the encrypted traffic to be detected as a malicious traffic with a frequency corresponding to the reference malicious value, if the frequency of the byte distribution probability vector corresponding to the encrypted traffic to be detected is greater than the frequency corresponding to the current reference malicious value, adding a child node on the encrypted traffic to be detected and associating the child node with the byte distribution probability vector representing the malicious traffic to update the preset relationship map, and if the frequency of the byte distribution probability vector corresponding to the encrypted traffic to be detected is less than the frequency corresponding to the current reference malicious value, adding a child node on the encrypted traffic to be detected and associating the byte distribution probability vector representing a normal traffic with the child node to update the preset relationship map.
According to the embodiment of the specification, the reference malicious value is obtained through the relation graph, the more accurate reference malicious value can be obtained based on a large number of byte distribution probability vectors obtained through statistics, meanwhile, the relation graph is updated in real time, and the reference malicious value can be obtained more accurately and efficiently in real time.
While the basic concepts have been described above, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations to the present disclosure may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this specification, and therefore, such modifications, improvements, and modifications are intended to be included within the spirit and scope of the exemplary embodiments of the present invention.
Meanwhile, the specification uses specific words to describe the embodiments of the specification. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present description. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present description may be combined as suitable.
Furthermore, the order in which the elements and sequences are processed, the use of numerical letters, or other designations in the description are not intended to limit the order in which the processes and methods of the description are performed unless explicitly recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of various examples, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the present disclosure. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.
Likewise, it should be noted that in order to simplify the presentation disclosed in this specification and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure does not imply that the subject matter of the present description requires more features than are set forth in the claims. Indeed, less than all of the features of a single embodiment disclosed above.
In some embodiments, numbers describing the components, number of attributes are used, it being understood that such numbers being used in the description of embodiments are modified in some examples by the modifier "about," approximately, "or" substantially. Unless otherwise indicated, "about," "approximately," or "substantially" indicate that the number allows for a 20% variation. Accordingly, in some embodiments, numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by the individual embodiments. In some embodiments, the numerical parameters should take into account the specified significant digits and employ a method for preserving the general number of digits. Although the numerical ranges and parameters set forth herein are approximations that may be employed in some embodiments to confirm the breadth of the range, in particular embodiments, the setting of such numerical values is as precise as possible.
Each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., referred to in this specification is incorporated herein by reference in its entirety. Except for application history documents that are inconsistent or conflicting with the content of this specification, documents that are currently or later attached to this specification in which the broadest scope of the claims to this specification is limited are also. It is noted that, if the description, definition, and/or use of a term in an attached material in this specification does not conform to or conflict with what is described in this specification, the description, definition, and/or use of the term in this specification controls.
Finally, it should be understood that the embodiments described in this specification are merely illustrative of the principles of the embodiments of this specification. Other variations are possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present specification may be considered as consistent with the teachings of the present specification. Accordingly, the embodiments of the present specification are not limited to only the embodiments explicitly described and depicted in the present specification.