Network space vulnerability clustering method based on eigenvalue similarity calculation
Technical Field
The invention relates to the field of network security, and discloses a network space vulnerability clustering method based on eigenvalue similarity calculation.
Technical Field
In the digital age today, network security has become a focus of attention for various industries. For enterprises and security institutions, how to effectively manage and analyze these vulnerabilities and further formulate corresponding protection strategies becomes an important and urgent task.
Network vulnerabilities refer to flaws or vulnerabilities in a computer system, network device, or application that may be exploited by an attacker. These vulnerabilities are of a wide variety and may result from factors such as system design flaws, software implementation errors, or improper configuration. Common vulnerability types include, but are not limited to, SQL injection, cross site scripting attack (XSS), buffer overflow, and privilege elevation vulnerabilities, among others. Once exploited, these vulnerabilities may lead to sensitive information leakage, system crashes, and even overall network paralysis.
With the popularization of the internet and the expansion of the application range, the scale of the network space is also continuously expanding, and the number of related devices and applications is rapidly increased. The vulnerabilities in different devices, different applications, and different network environments vary widely, as do the respective manifestations and impacts. This diversity and complexity makes conventional vulnerability analysis methods difficult to handle.
Current vulnerability management generally relies on manual analysis, and experts decide on coping strategies by comprehensively evaluating the characteristics, hazard degree and repair difficulty of vulnerabilities. However, with the increasing number of vulnerabilities, the efficiency of manual analysis gradually fails to meet the actual requirements. Firstly, the manual analysis is long in time consumption and low in efficiency, and secondly, the consistency and accuracy are difficult to ensure by the manual analysis, and the result is possibly inconsistent due to the difference between different analysts. In addition, manual analysis is difficult to deal with large-scale data processing, and particularly when facing novel vulnerabilities and complex attack techniques, the limitations of manual analysis are particularly apparent. Through an automatic means, classification, analysis and coping strategy formulation of a large number of vulnerabilities can be completed in a short time, so that the overall protection capability of network security is improved.
In order to better manage and analyze a large number of loopholes in a network space, the invention provides a network space loophole clustering method based on feature value similarity calculation. Compared with the traditional manual analysis method, the automatic clustering method has various advantages.
The automated clustering method can greatly improve the efficiency of vulnerability analysis. By calculating the similarity among the loopholes, a large amount of loophole data are automatically classified and arranged, and the workload of manual participation is reduced. Secondly, the method can effectively improve the consistency and accuracy of analysis results. Because the analysis based on the algorithm is not influenced by human factors, the similar loopholes can be ensured to be accurately identified and classified. In addition, visual display of the clustering result can help security personnel to understand vulnerability distribution and trend more intuitively, and more targeted protection measures can be formulated conveniently.
The vulnerability clustering method based on feature value similarity calculation is used as an automatic analysis means, and provides a new solution for solving the problem of large-scale vulnerability analysis. The method can improve the efficiency and accuracy of vulnerability analysis, and can help security personnel to better understand and manage complex vulnerability data, so that the security protection capability of the whole network space is improved.
Disclosure of Invention
The invention innovatively combines the feature value similarity calculation and the clustering algorithm, and designs a method capable of efficiently managing and analyzing network space loopholes. By classifying, labeling and extracting features of the loopholes, the method can automatically group and display a large amount of complex loophole data, thereby reducing the workload of manual analysis and improving the efficiency and accuracy of loophole management.
The method specifically comprises the following steps:
(10) Obtaining and standardizing vulnerability data;
(11) The data fields defining the network security vulnerability data to be crawled and stored include:
● Title: brief summary of vulnerability, including software name and vulnerability type;
● The identification number comprises different identification numbers distributed by CVE-ID and different vulnerability libraries;
● Vulnerability descriptions, namely descriptions of vulnerability principles, triggering methods, vulnerability types and the like;
● The vulnerability type is CWE (Common Weakness Enumeration) type or the vulnerability type defined by each library;
● The CPE (Common Platform Enumeration) standard provides a scheme for identifying manufacturers, products (software and hardware), versions, and the affected products also comprise affected software dependence, an operating system and the like;
● Attack vector, which is the way or mode in which the vulnerability is exploited;
● Attack complexity, namely the technical difficulty required by attack by utilizing the loopholes;
● Vulnerability, namely CVSS score, and hazard rating defined by each library;
● The vulnerability exploitation information is POC codes, so that a user can conveniently reproduce the vulnerability;
● Reference links, namely other indexes which can be used for reference, such as patch links given by vulnerability related manufacturers;
● The release time is the release date of the vulnerability information;
● Update time, namely the last update time of the vulnerability information;
● Vulnerability submitter-submitter or publisher of vulnerability.
(12) The design of the coroutine asynchronous crawler is that an asynchronous request operation is realized by aiohttp and asyncio, a coroutine asynchronous function request mode is established by using aiohttp, and ASYNC WITH aiohttp. Response information is obtained by response=await session. The main process accesses the target server to receive the response information and the analysis information of the server, the auxiliary process performs duplicate removal storage on the analyzed Web information, and the main process and the auxiliary process are switched back and forth when the auxiliary process encounters a plug.
(13) Parsing of the HTML document in the response information and storing the data to a database is performed using Beautiful Soup tools.
(20) The method for describing the tagged loopholes is designed, namely after the network and the system are scanned through the loophole scanning tool and defects existing in the system are determined by network security experts, the loopholes are stored as a list through a loophole description method, the pre-utilization results and the characteristics of the loopholes are reflected, and clustering and combination of the loopholes are facilitated. The specific form is as follows:
Loopholes (atomic attack) Indicating that exploitation of alpha type vulnerabilities (atomic attacks) is requiredUnder the condition of implementing lambda tactic, using delta tool to attack sigma, the subsequent result can be obtainedNamely:
Wherein each attribute value has the following characteristics:
alpha epsilon Vuln _type, vuln _type are vulnerability types including, but not limited to, buffer overflow, code injection authentication problems, etc.
Lambda epsilon Tech, tech is a collection of network attack technologies in the ATT & CK framework, lambda represents a certain technology in the ATT & CK framework used for launching the attack by utilizing the vulnerability;
delta epsilon Tool, tool represents the set of penetration test tools or network attack weapon library that an attacker might employ. The attack means elements in the Tool set mainly comprise a sending data packet, a custom script and the like. Some hacking organizations will use their own attack weapon libraries when they are carrying out attacks, the information of which can be found from ATT & CK networks, too, tool is these weapon libraries;
σ εTar, tar represents the set in the system that can be targeted for attack. It may be a node or device type in the network, a component on a node, or a service that is running;
Respectively representing the precondition of utilizing the vulnerability (atomic attack) and the result caused by utilizing the vulnerability (atomic attack);
Wherein x epsilon sigma and y epsilon sigma represent the collection of target system assets, and represent all available resources or rights of the system, such as user rights, manager rights, system data and the like. These assets are also referred to as attributes. Such a description method may describe not only vulnerabilities but also atomic attacks.
(30) Calculating a specific network space vulnerability;
(31) Modeling network space data:
Aiming at a specific network space, carrying out hierarchical network space model construction according to equipment and software installed on the equipment and association relations among the equipment, the software and the software existing in the network space, and providing data support for next specific network space vulnerability calculation.
(32) Network space vulnerability calculation:
and collecting vendors, products and versions of the equipment and the software existing in the current network space according to the network space data, performing large-scale inquiry of a database according to the related property of the asset, inquiring all vulnerability data possibly existing in the network space, and adding the association relationship between the vulnerability and the network asset.
(40) Classifying network security vulnerability data, labeling and extracting features;
(41) Determining classification and labeling methods:
The vulnerability type is determined by using the vulnerability type (CWE) and the vulnerability description to determine which common vulnerability type the vulnerability belongs to. The manner and difficulty of the attack is inferred from the attack vector and the attack complexity.
Identifying attack strategies and techniques utilizing vulnerability descriptions and attack vectors, referring to MITREATT & CK framework, find the corresponding TTPs (tactics, techniques and procedures). Depending on the complexity of the attack, it is assessed to what extent the attacker needs preparation and resources.
And identifying tools used by the attacker, namely searching POC codes and tools in the exploit information, and identifying commonly used attack tools and information in a reference link.
And identifying the type of the attack target, namely judging which systems or software are potential targets according to the affected manufacturer, product and version. CPE information is used to identify a particular hardware, software, or operating system version.
(42) The tag in text form is converted into a feature vector using TF-IDF. The TF-IDF can measure the relative importance of the tag, so that feature extraction is performed on the vulnerability data:
Where f (t, d) is the number of times the term t occurs in document d and N d is the total number of all data in document d.
Where N is the total number of documents in the corpus, DF (t) is the number of documents containing term t, and the addition of 1 to smooth can prevent the problem of denominator 0 when the term does not appear in any document.
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
By multiplying TF and IDF, the importance of terms in a particular document with respect to the entire corpus can be obtained.
(50) Clustering network security vulnerabilities based on features and clustering algorithms;
(51) Determining a clustering number K aiming at sigma epsilon Tar in the vulnerability description method;
the dataset is partitioned into K clusters (clusters) using a classical K-Means clustering algorithm such that the data points in each cluster are similar and different from the data points in the other clusters, minimizing intra-cluster variance based on assigning the data points to the nearest centroid (centroids).
(52) Randomly selecting one (K total) data points from each class σ from the dataset as an initial centroid;
(53) Cluster assignment-for each data point, calculate its distance from each centroid and assign the data point to the corresponding cluster of centroids closest to it. The method uses Euclidean distance for calculation:
where x is the data point, c is the centroid and n is the feature number.
(54) Updating the centroid, namely, calculating the centroid of each cluster, namely, calculating the average value of all data points in the cluster as a new centroid:
let C k be the kth cluster, update centroid formula as:
Where C k is the new centroid of the kth cluster, x i is the data points in the cluster, and C k is the number of data points in the cluster.
(55) Repeating steps (53) and (54) until the centroid change is less than a threshold value or a maximum number of iterations is reached.
(60) And visually displaying the clustered network space vulnerability categories.
According to the method, the efficient analysis and management of the large-scale network loopholes are realized through the clustering method based on the feature value similarity calculation. The vulnerability management system automatically classifies and labels vulnerabilities, greatly reduces the workload of manual analysis and improves the vulnerability management efficiency. The clustering algorithm ensures the consistency and accuracy of analysis results and avoids deviation and error possibly caused by manual analysis. According to the method, the clustering result is visually displayed, so that security personnel can intuitively understand vulnerability distribution and risk conditions, a more accurate protection strategy is formulated, and the security protection capability of a network space is remarkably improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a flow chart of a feature extraction algorithm of the method of the present invention.
FIG. 3 is a flowchart of a clustering algorithm for the method of the present invention.
FIG. 4 is a flow chart of the experiment of the method of the present invention
Detailed Description
For the purpose of promoting an understanding of the principles and advantages of the invention, reference will now be made to the drawings and specific examples.
The invention provides a network space vulnerability clustering method based on eigenvalue similarity calculation, which comprises the following specific steps:
(10) Obtaining and standardizing vulnerability data;
(11) The data fields defining the network security vulnerability data to be crawled and stored include:
● Title: brief summary of vulnerability, including software name and vulnerability type;
● The identification number comprises different identification numbers distributed by CVE-ID and different vulnerability libraries;
● Vulnerability descriptions, namely descriptions of vulnerability principles, triggering methods, vulnerability types and the like;
● The vulnerability type is CWE (Common Weakness Enumeration) type or the vulnerability type defined by each library;
● The CPE (Common Platform Enumeration) standard provides a scheme for identifying manufacturers, products (software and hardware), versions, and the affected products also comprise affected software dependence, an operating system and the like;
● Attack vector, which is the way or mode in which the vulnerability is exploited;
● Attack complexity, namely the technical difficulty required by attack by utilizing the loopholes;
● Vulnerability, namely CVSS score, and hazard rating defined by each library;
● The vulnerability exploitation information is POC codes, so that a user can conveniently reproduce the vulnerability;
● Reference links, namely other indexes which can be used for reference, such as patch links given by vulnerability related manufacturers;
● The release time is the release date of the vulnerability information;
● Update time, namely the last update time of the vulnerability information;
● Vulnerability submitter-submitter or publisher of vulnerability.
(12) The design of the coroutine asynchronous crawler is that an asynchronous request operation is realized by aiohttp and asyncio, a coroutine asynchronous function request mode is established by using aiohttp, and ASYNC WITH aiohttp. Response information is obtained by response=await session. The main process accesses the target server to receive the response information and the analysis information of the server, the auxiliary process performs duplicate removal storage on the analyzed Web information, and the main process and the auxiliary process are switched back and forth when the auxiliary process encounters a plug.
(13) Parsing of the HTML document in the response information and storing the data to a database is performed using Beautiful Soup tools.
(20) The method for describing the tagged loopholes is designed, namely after the network and the system are scanned through the loophole scanning tool and defects existing in the system are determined by network security experts, the loopholes are stored as a list through a loophole description method, the pre-utilization results and the characteristics of the loopholes are reflected, and clustering and combination of the loopholes are facilitated. The specific form is as follows:
Loopholes (atomic attack) Indicating that exploitation of alpha type vulnerabilities (atomic attacks) is requiredUnder the condition of implementing lambda tactic, using delta tool to attack sigma, the subsequent result can be obtainedNamely:
Wherein each attribute value has the following characteristics:
alpha epsilon Vuln _type, vuln _type are vulnerability types including, but not limited to, buffer overflow, code injection authentication problems, etc.
Lambda epsilon Tech, tech is a collection of network attack technologies in the ATT & CK framework, lambda represents a certain technology in the ATT & CK framework used for launching the attack by utilizing the vulnerability;
delta epsilon Tool, tool represents the set of penetration test tools or network attack weapon library that an attacker might employ. The attack means elements in the Tool set mainly comprise a sending data packet, a custom script and the like. Some hacking organizations will use their own attack weapon libraries when they are carrying out attacks, the information of which can be found from ATT & CK networks, too, tool is these weapon libraries;
σ εTar, tar represents the set in the system that can be targeted for attack. It may be a node or device type in the network, a component on a node, or a service that is running;
Respectively representing the precondition of utilizing the vulnerability (atomic attack) and the result caused by utilizing the vulnerability (atomic attack);
Wherein x epsilon sigma and y epsilon sigma represent the collection of target system assets, and represent all available resources or rights of the system, such as user rights, manager rights, system data and the like. These assets are also referred to as attributes. Such a description method may describe not only vulnerabilities but also atomic attacks.
(30) Calculating a specific network space vulnerability;
(31) Modeling network space data:
Aiming at a specific network space, carrying out hierarchical network space model construction according to equipment and software installed on the equipment and association relations among the equipment, the software and the software existing in the network space, and providing data support for next specific network space vulnerability calculation.
(32) Network space vulnerability calculation:
and collecting vendors, products and versions of the equipment and the software existing in the current network space according to the network space data, performing large-scale inquiry of a database according to the related property of the asset, inquiring all vulnerability data possibly existing in the network space, and adding the association relationship between the vulnerability and the network asset.
(40) Classifying network security vulnerability data, labeling and extracting features;
(41) Determining classification and labeling methods:
The vulnerability type is determined by using the vulnerability type (CWE) and the vulnerability description to determine which common vulnerability type the vulnerability belongs to. The manner and difficulty of the attack is inferred from the attack vector and the attack complexity.
Identifying attack strategies and techniques by utilizing vulnerability descriptions and attack vectors, referring to the MITRE ATT & CK framework, find the corresponding TTPs (tactics, techniques and procedures). Depending on the complexity of the attack, it is assessed to what extent the attacker needs preparation and resources.
And identifying tools used by the attacker, namely searching POC codes and tools in the exploit information, and identifying common attack tools. Refer to information in the link.
And identifying the type of the attack target, namely judging which systems or software are potential targets according to the affected manufacturer, product and version. CPE information is used to identify a particular hardware, software, or operating system version.
(42) The tag in text form is converted into a feature vector using TF-IDF. The TF-IDF can measure the relative importance of the tag, so that feature extraction is performed on the vulnerability data:
Where f (t, d) is the number of times the term t occurs in document d and N d is the total number of all data in document d.
Where N is the total number of documents in the corpus, DF (t) is the number of documents containing term t, and the addition of 1 to smooth can prevent the problem of denominator 0 when the term does not appear in any document.
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
By multiplying TF and IDF, the importance of terms in a particular document with respect to the entire corpus can be obtained.
(50) Clustering network security vulnerabilities based on features and clustering algorithms;
(51) Determining a clustering number K aiming at sigma epsilon Tar in the vulnerability description method;
the dataset is partitioned into K clusters (clusters) using a classical K-Means clustering algorithm such that the data points in each cluster are similar and different from the data points in the other clusters, minimizing intra-cluster variance based on assigning the data points to the nearest centroid (centroids).
(52) Randomly selecting one (K total) data points from each class σ from the dataset as an initial centroid;
(53) Cluster assignment-for each data point, calculate its distance from each centroid and assign the data point to the corresponding cluster of centroids closest to it. The method uses Euclidean distance for calculation:
where x is the data point, c is the centroid and n is the feature number.
(54) Updating the centroid, namely, calculating the centroid of each cluster, namely, calculating the average value of all data points in the cluster as a new centroid:
let C k be the kth cluster, update centroid formula as:
where c k is the new centroid of the kth cluster, x i is the data points in the cluster, and Ck is the number of data points in the cluster.
(55) Repeating steps (53) and (54) until the centroid change is less than a threshold value or a maximum number of iterations is reached.
(60) And visually displaying the clustered network space vulnerability categories.
Example analysis:
The latest vulnerability data is obtained from a plurality of public vulnerability databases (e.g., NVD, CVE, exploitDB, etc.) using a crawler. These data are stored in a database according to the structure described in the steps above:
for each vulnerability, a vulnerability tag is generated according to the following format: for example, for CVE-202144228 (Log 4 j):
The state characteristics before the loophole occurs, namely Log4j 2.0-beta9 to 2.14.1 are in operation.
Alpha: conditions before exploit or pre-context: JNDI component is not enabled.
Lambda, attack carrier (such as attack vector, attack method), and remote attack by LDAP protocol.
Delta: direct effect after exploit or result: remote code execution.
Sigma, vulnerability impact scope or impact object, affected server and application.
The state characteristics after the vulnerability is utilized are that the server is controlled or embedded with malicious codes.
In this embodiment, there is a device list in the network space, including an operating system version, application software, and version information thereof. And according to the information, searching out all vulnerabilities related to the devices by querying a vulnerability database. For example, if a system is running Windows 10 and a particular version APACHE HTTP SERVER is installed, the query results may include all known vulnerabilities associated with the system and software.
And classifying and labeling the obtained vulnerability data according to the vulnerability type, the influence range and other characteristics. For example, all vulnerabilities involving remote code execution are classified as one type and vulnerabilities involving rights promotion are classified as another type. A corresponding feature tag V t is then generated for each vulnerability.
The text-form vulnerability tag V t is converted into a feature vector using TF-IDF algorithm. This process represents the text labels of each vulnerability as a multi-dimensional vector, where each dimension represents a unique feature value, and the numerical values represent the weights of the feature in the vulnerability data.
And analyzing the characteristic vector by using a K-Means clustering algorithm, and classifying the loopholes with higher similarity into the same cluster. Each cluster represents a set of vulnerabilities with similar characteristics. For example, all vulnerabilities involving SQL injection may be clustered into one group and all vulnerabilities involving buffer overflows may be clustered into another group.
And after the clustering is completed, the clustering result is displayed in the network space.
The invention is not limited to the above embodiments, and any person who makes the technical solution with the same or similar to the present invention in the light of the present invention should be known to fall within the protection scope of the present invention.