CN119577790B - Software risk assessment method, device, storage medium, program product and equipment - Google Patents
Software risk assessment method, device, storage medium, program product and equipment Download PDFInfo
- Publication number
- CN119577790B CN119577790B CN202510131246.0A CN202510131246A CN119577790B CN 119577790 B CN119577790 B CN 119577790B CN 202510131246 A CN202510131246 A CN 202510131246A CN 119577790 B CN119577790 B CN 119577790B
- Authority
- CN
- China
- Prior art keywords
- word vector
- score
- risk
- risk score
- risk assessment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0985—Hyperparameter optimisation; Meta-learning; Learning-to-learn
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
- G06F2221/033—Test or assess software
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Hardware Design (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Stored Programmes (AREA)
Abstract
The application discloses a software risk assessment method, a device, a storage medium, a program product and equipment, wherein the method comprises the steps of analyzing a code file of software to be assessed to obtain code information, wherein the code information is used for indicating components and library functions corresponding to the code file, extracting features based on the code information to obtain word vector features, determining word vector risk scores according to the word vector features, performing risk assessment on the components to obtain first static feature risk scores based on the code file, performing risk assessment on the library functions to obtain second static feature risk scores, determining static feature risk scores based on the first static feature risk scores and the second static feature risk scores, and determining overall risk scores based on the word vector risk scores and the static feature risk scores, so that risks existing in the software can be identified efficiently and accurately, and the software quality and the software safety can be improved.
Description
Technical Field
The present application relates to the field of software engineering technologies, and in particular, to a method, an apparatus, a storage medium, a program product, and a device for evaluating software risk.
Background
With the increasing complexity and scale of software systems, analysis of software components is an important task in software engineering. Conventional analysis methods typically require manual risk assessment of the software source code according to set risk indicators. However, this has problems of low efficiency and low accuracy in processing large-scale code.
With the popularization of open source software, a large amount of source codes exist on a network for a developer to call, if the source codes with poor code quality are carelessly called, a certain threat is caused to a developed software project (for example, the developed software crashes are caused), so how to efficiently and accurately identify and analyze the risk degree existing in the software becomes an important subject for improving the software quality and the software security.
Disclosure of Invention
In order to solve the technical problems, embodiments of the present application provide a method, an apparatus, a storage medium, a program product, and a device for evaluating risk of software, which can efficiently and accurately identify risks existing in software, so as to improve quality and security of software.
In a first aspect, an embodiment of the present application provides a software risk assessment method, including:
analyzing based on a code file of the software to be evaluated to obtain code information, wherein the code information is used for indicating components and library functions corresponding to the code file;
Extracting features based on the code information to obtain word vector features, and determining word vector risk scores according to the word vector features;
Performing risk assessment on the component based on the code file to obtain a first static feature risk score, and performing risk assessment on the library function to obtain a second static feature risk score;
determining a static feature risk score based on the first static feature risk score and the second static feature risk score;
Based on the word vector risk score and the static feature risk score, an overall risk score is determined.
Optionally, the code information includes an abstract syntax tree, the abstract syntax tree is used for indicating the component and the library function, the code file based on the software to be evaluated is parsed to obtain code information, and the method includes:
preprocessing the code file to obtain an object code;
determining a lexical analyzer and a grammar analyzer based on grammar rules matched with the code file;
invoking the lexical analyzer to decompose the object code into at least one lexical unit;
And calling the grammar analyzer to construct the abstract grammar tree based on the at least one lexical unit.
Optionally, the feature extraction is performed based on the code information to obtain word vector features, and the determining the word vector risk score according to the word vector features includes:
Extracting features of the code information by using a neural network model to generate at least one word vector;
Constructing the word vector feature based on the at least one word vector;
and calculating the absolute value of each dimension in the word vector feature, and taking the sum of the absolute values of the dimensions as the word vector risk score.
Optionally, the feature extraction is performed on the code information by using a neural network model to generate at least one word vector, including:
determining super parameters including window size and minimum word frequency based on the grammar structure indicated by the code information and the context relation;
configuring the determined hyper-parameters for the neural network model;
And extracting the characteristics of the code information by using the configured neural network model to generate at least one word vector.
Optionally, the constructing the word vector feature based on the at least one word vector includes:
Based on the code information, respectively carrying out TF-IDF statistics on each word vector in the at least one word vector to obtain the weight corresponding to each word vector;
respectively carrying out normalization processing on each word vector in the at least one word vector;
and carrying out weighted calculation based on the normalized at least one word vector and the weight corresponding to the normalized at least one word vector to obtain the word vector characteristics.
Optionally, the first static feature risk score comprises at least one of a first vulnerability severity score, a first frequency of use score, a first version score, a first license compliance score;
the second static feature risk score includes at least one of a second vulnerability severity score, a second frequency of use score, a second version score, a second license compliance score.
Optionally, the method further comprises:
Identifying each component in the code file by utilizing a pre-constructed LDA-based component analysis and risk assessment model, and carrying out risk assessment on each component in the code file to obtain an LDA component risk score;
the determining an overall risk score based on the word vector risk score and the static feature risk score includes:
and carrying out weighted summation based on the LDA component risk score, the word vector risk score and the static feature risk score to obtain the overall risk score.
Optionally, the identifying each component in the code file by using a pre-constructed LDA-based component analysis and risk assessment model includes:
Decomposing the code file into a plurality of sub-documents, wherein the plurality of sub-documents comprise class fragments, method fragments and annotation fragments;
And identifying the plurality of sub-documents by using the LDA-based component analysis and risk assessment model to obtain each component in the code file.
In a second aspect, an embodiment of the present application provides a software risk assessment apparatus, including:
the analysis module is used for analyzing the code file of the software to be evaluated to obtain code information, wherein the code information is used for indicating components and library functions corresponding to the code file;
the first risk assessment module is used for carrying out feature extraction based on the code information to obtain word vector features, and determining word vector risk scores according to the word vector features;
The second risk assessment module is used for carrying out risk assessment on the component based on the code file to obtain a first static feature risk score, and carrying out risk assessment on the library function to obtain a second static feature risk score;
A third risk assessment module configured to determine a static feature risk score based on the first static feature risk score and the second static feature risk score;
and the overall risk scoring module is used for determining an overall risk score based on the word vector risk score and the static feature risk score.
Optionally, the code information includes an abstract syntax tree, the abstract syntax tree is used for indicating the component and the library function, the code file based on the software to be evaluated is parsed to obtain code information, and the method includes:
preprocessing the code file to obtain an object code;
determining a lexical analyzer and a grammar analyzer based on grammar rules matched with the code file;
invoking the lexical analyzer to decompose the object code into at least one lexical unit;
And calling the grammar analyzer to construct the abstract grammar tree based on the at least one lexical unit.
Optionally, the feature extraction is performed based on the code information to obtain word vector features, and the determining the word vector risk score according to the word vector features includes:
Extracting features of the code information by using a neural network model to generate at least one word vector;
Constructing the word vector feature based on the at least one word vector;
and calculating the absolute value of each dimension in the word vector feature, and taking the sum of the absolute values of the dimensions as the word vector risk score.
Optionally, the feature extraction is performed on the code information by using a neural network model to generate at least one word vector, including:
determining super parameters including window size and minimum word frequency based on the grammar structure indicated by the code information and the context relation;
configuring the determined hyper-parameters for the neural network model;
And extracting the characteristics of the code information by using the configured neural network model to generate at least one word vector.
Optionally, the constructing the word vector feature based on the at least one word vector includes:
Based on the code information, respectively carrying out TF-IDF statistics on each word vector in the at least one word vector to obtain the weight corresponding to each word vector;
respectively carrying out normalization processing on each word vector in the at least one word vector;
and carrying out weighted calculation based on the normalized at least one word vector and the weight corresponding to the normalized at least one word vector to obtain the word vector characteristics.
Optionally, the first static feature risk score comprises at least one of a first vulnerability severity score, a first frequency of use score, a first version score, a first license compliance score;
the second static feature risk score includes at least one of a second vulnerability severity score, a second frequency of use score, a second version score, a second license compliance score.
Optionally, the apparatus further comprises:
The fourth risk assessment module is used for identifying each component in the code file by utilizing a pre-constructed LDA-based component analysis and risk assessment model, and carrying out risk assessment on each component in the code file to obtain an LDA component risk score;
the determining an overall risk score based on the word vector risk score and the static feature risk score includes:
and carrying out weighted summation based on the LDA component risk score, the word vector risk score and the static feature risk score to obtain the overall risk score.
Optionally, the identifying each component in the code file by using a pre-constructed LDA-based component analysis and risk assessment model includes:
Decomposing the code file into a plurality of sub-documents, wherein the plurality of sub-documents comprise class fragments, method fragments and annotation fragments;
And identifying the plurality of sub-documents by using the LDA-based component analysis and risk assessment model to obtain each component in the code file.
In a third aspect, an embodiment of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the software risk assessment method of any of the above.
In a fourth aspect, embodiments of the present application provide a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the software risk assessment method of any of the above.
In a fifth aspect, an embodiment of the present application provides a computer device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the steps of the software risk assessment method according to any one of the preceding claims when the computer program is executed by the processor.
In summary, the embodiment of the application has at least the following beneficial effects:
The method and the device for analyzing the code file of the software to be evaluated, provided by the embodiment of the application, are used for analyzing the code file based on the software to be evaluated to obtain the code information, wherein the code information is used for indicating the components and library functions corresponding to the code file, extracting the features based on the code information to obtain word vector features, determining word vector risk scores according to the word vector features, performing risk assessment on the components based on the code file to obtain first static feature risk scores, performing risk assessment on the library functions to obtain second static feature risk scores, determining static feature risk scores based on the first static feature risk scores and the second static feature risk scores, and determining overall risk scores based on the word vector risk scores and the static feature risk scores, so that risks existing in the software can be identified efficiently and accurately, and the quality and the safety of the software can be improved.
Drawings
FIG. 1 is a flowchart of a software risk assessment method according to an embodiment of the present application;
FIG. 2 is a schematic illustration of pretreatment provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of lexical analysis and grammar analysis provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of a word vector feature construction provided by an embodiment of the present application;
FIG. 5 is a schematic diagram of risk assessment provided by an embodiment of the present application;
FIG. 6 is a schematic diagram of overall risk score calculation and risk rating provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of a software risk assessment device according to an embodiment of the present application;
Fig. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In the description of the present application, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", "a third", etc. may explicitly or implicitly include one or more such feature. In the description of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more. In the description of the application, the terms "include" and variations thereof are intended to be open-ended, i.e., to include, but not limited to. The term "based on" is based at least in part on. The term "according to" is based, at least in part, on. The term "one embodiment" means "at least one embodiment," another embodiment "means" at least one additional embodiment, "and" some embodiments "means" at least some embodiments.
In the description of the present application, unless explicitly stated or limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected, mechanically connected, electrically connected, directly connected, indirectly connected via an intervening medium, or in communication between two elements. The specific meaning of the above terms in the present application will be understood in specific cases by those of ordinary skill in the art.
In the description of the present application, it should be noted that all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art unless defined otherwise. The terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application, as the particular meaning of the terms described above in the present application will be understood to those of ordinary skill in the art in the detailed description of the application.
The following is an explanation of some of the term concepts involved in embodiments of the present application:
TF-IDF (Term Frequency-inverse document Frequency) is a statistical method used to evaluate the importance of a word in a document or corpus.
LDA (LATENT DIRICHLET Allocation ) is a topic model for finding abstract topics from a collection of documents.
In a first aspect, referring to fig. 1, a flowchart of a software risk assessment method provided by an embodiment of the present application is shown, where the method includes steps S101 to S105, specifically as follows:
S101, analyzing based on a code file of software to be evaluated to obtain code information, wherein the code information is used for indicating components and library functions corresponding to the code file;
in one example, the language type corresponding to the code file may be Java (an object-oriented programming language), and thus the code file is a corresponding Java source code file.
S102, extracting features based on the code information to obtain word vector features, and determining word vector risk scores according to the word vector features;
S103, performing risk assessment on the component based on the code file to obtain a first static feature risk score, and performing risk assessment on the library function to obtain a second static feature risk score;
S104, determining a static feature risk score based on the first static feature risk score and the second static feature risk score, wherein in one example, the static feature risk score may be obtained by summing/weighting the first static feature risk score and the second static feature risk score.
S105, determining an overall risk score based on the word vector risk score and the static feature risk score.
In an alternative embodiment, the code information includes an abstract syntax tree, the abstract syntax tree is used for indicating the component and the library function, the code file based on the software to be evaluated is parsed to obtain code information, and the method includes:
preprocessing the code file to obtain an object code;
In one example, referring to fig. 2, preprocessing the code file to obtain the target code may include extracting a source code file and a binary file generated after compiling software from the code file, ensuring the integrity of the source code file and the binary file in the extracting process, avoiding missing or losing the file in the extracting process, and then cleaning the source code file, where the cleaning process may include at least one of precisely removing irrelevant characters, removing comments, maintaining the coding consistency of the source code file, and the like through regular expression characteristics, and renaming the source code file and the binary file according to a naming convention of a language type (such as Java) corresponding to the code file, so as to ensure that the file name is clear, intuitive and free of special characters, and finally, in the case that the language type corresponding to the code file is Java, classifying and organizing the source code file and the binary file according to a package name and a function module according to a modularized design principle of Java items, so as to facilitate subsequent analysis. In the embodiment, the data can be ensured to be clean and consistent, the data quality is improved, a normalized data set is provided for subsequent component analysis and risk detection, and a solid foundation is laid.
Determining a lexical analyzer and a grammar analyzer based on grammar rules matched with the code file;
In one example, referring to fig. 3, if the language used in the code file is Java, then Java grammar rules including class definitions, method definitions, variable declarations, etc. may be written using the grammar definition language of ANTLR (ANotherTool for Language Recognition, a type of grammar parser) according to the Java language specification, so that by means of the ANTLR tool, efficient lexical analyzers and grammar analyzers are automatically generated according to the defined Java grammar rules.
Invoking the lexical analyzer to decompose the object code into at least one lexical unit;
And calling the grammar analyzer to construct the abstract grammar tree based on the at least one lexical unit.
In one example, referring to fig. 3, in the case where the language used by the code file is Java, the lexical analyzer may be used to decompose the object code into a series of lexical units (token), where the lexical units are basic elements that constitute the Java program, such as keywords, identifiers, literal amounts, and the like. The parser can then use these lexical elements to build an AST (Abstract Syntax Tree ) neatly according to the defined syntax rules. The tree structure of the abstract syntax tree can clearly show the syntax level and logic relation of Java source codes, and provides a solid foundation for subsequent deep analysis. In parsing, the present embodiment particularly pays attention to the characteristics of Java language, such as the use of rich standard libraries and third party dependency packages, and complex grammar structures (e.g., internal classes, anonymous classes, interfaces and implementations, etc.). Thus, both the lexical analyzer and the grammatical analyzer of the present embodiment are carefully designed and optimized to ensure that these characteristics can be accurately processed to generate a complete and accurate AST.
In an optional embodiment, the feature extraction based on the code information, to obtain a word vector feature, and determining a word vector risk score according to the word vector feature, includes:
Extracting features of the code information by using a neural network model to generate at least one word vector;
In one example, referring to fig. 4, the neural network model may be a Word2Vec model, and in the case that the language used in the code file is Java, since Java code has a strict syntax structure and rich context, it is possible to capture semantic relationships between functions, variables and classes using the Word2Vec model. Thus, the vocabulary (including keywords, variable names, function names, class names, etc.) in the code can be extracted by performing vocabulary extraction preprocessing on the Java code, and then the vector representation of the extracted vocabulary is learned by using the Word2Vec model, so that at least one Word vector (i.e. Word embedding) is generated. In this embodiment, particular attention is paid to some unique characteristics of Java code, such as package importation, class inheritance, interface implementation, etc., which typically occur in a specific syntax structure in the code, so that this embodiment retains these syntax structure information (i.e., the syntax structure contained in the code information is suitable for characterizing the above characteristics) when constructing a vocabulary library, so that the Word2Vec model can better understand the context in the code.
Constructing the word vector feature based on the at least one word vector;
In some cases, where the neural network model is a Word2Vec model and the language used by the code file is Java, since the Word2Vec model is partitioned based on lexical frequency, this may result in some important Java packages (e.g., core libraries) being imported only once and underestimated, while unimportant vocabularies may be overestimated due to frequent occurrence. To solve this problem, the present embodiment may introduce a weight assignment mechanism in constructing the word vector feature, assigning weights according to the importance and frequency of use of each word in the code file. For example, higher weights are given to word vectors corresponding to important packages and classes (e.g., java. Util, java. Io, etc.), while lower weights are given to word vectors corresponding to words that frequently occur but are of lower importance (e.g., temporary variable names, cyclic variables, etc.). Thus, when the word vector features are constructed, word vectors corresponding to important words have larger influence on the construction result, so that the quality of the constructed word vector features (namely word vector representation) is improved.
And calculating the absolute value of each dimension in the word vector feature, and taking the sum of the absolute values of the dimensions as the word vector risk score.
In an optional implementation manner, the feature extraction of the code information by using the neural network model to generate at least one word vector includes:
determining a super-parameter comprising a window size and a minimum word frequency based on a grammar structure and a context indicated by the code information, wherein the window size is suitable for determining a context range to ensure that a model can capture enough context information, and the minimum word frequency is suitable for filtering out words with too low occurrence frequency to reduce the quality influence of noise on word vectors, and the window size and/or the minimum word frequency can be determined according to the grammar structure and the context relation;
configuring the determined hyper-parameters for the neural network model;
And extracting the characteristics of the code information by using the configured neural network model to generate at least one word vector.
In an optional embodiment, the constructing the word vector feature based on the at least one word vector includes:
Based on the code information, respectively carrying out TF-IDF statistics on each word vector in the at least one word vector to obtain a weight (TF-IDF weight) corresponding to each word vector, wherein the TF-IDF weight is suitable for representing the occurrence frequency and importance of words corresponding to the word vector in a code segment;
Respectively carrying out normalization processing on each word vector in the at least one word vector, wherein the normalization processing means that each word vector is normalized to ensure comparability among different word vectors;
and carrying out weighted calculation based on the normalized at least one word vector and the weight corresponding to the normalized at least one word vector to obtain the word vector characteristics.
In this embodiment, when the neural network model is a Word2Vec model, the advantages of TF-IDF and Word2Vec Word vectors are combined, so that semantic features of the code segments can be more comprehensively reflected.
In an alternative embodiment, the first static feature risk score includes at least one of a first vulnerability severity score, a first frequency of use score, a first version score, a first license compliance score;
the second static feature risk score includes at least one of a second vulnerability severity score, a second frequency of use score, a second version score, a second license compliance score.
In one example, referring to FIG. 5, the Scoring criteria for risk assessment of static features may include VSS (Vulnerability Severity Scoring, vulnerability severity score), UFS (Usage Frequency Score, frequency of use score), VS (Version Scoring), LCS (License Compliance Score ) based on CVSS (Common Vulnerability Scoring System, universal vulnerability Scoring system). Where UFS means that the components used at high frequencies are more affected if there is a problem, VS means that the latest version score is higher, the past version score is lower, and LCS means that the license type score following the project requirements is higher. Whereby SF (Static Feature-based RISK ASSESSMENT) can be calculated in combination with one or more of the above scoring criteria.
In one example, the first static feature risk score or the second static feature risk scoreThe determination can be made by the following formula:
Wherein, ,,,Which in turn are weights corresponding to VSS, UFS, VS, LCS, respectively.
In an alternative embodiment, the method further comprises:
Identifying each component in the code file by utilizing a pre-constructed LDA-based component analysis and risk assessment model, and carrying out risk assessment on each component in the code file to obtain an LDA component risk score;
in one example, the LDA component risk score The determination can be made by the following formula:
Wherein, Is a function to calculate LCRS (LDA-based risk assessment) for converting CI (Component Interdependencies, component relevance) and RC (RISK CHARACTERISTICS, potential risk features) into risk scores,The degree of similarity or association between components can be calculated based on the distribution of the LDA model output.Factors that may include known vulnerability patterns, code complexity, external library dependencies, etc. Where f may represent addition, multiplication, weighted summation, etc., and is not particularly limited herein.
The determining an overall risk score based on the word vector risk score and the static feature risk score includes:
and carrying out weighted summation based on the LDA component risk score, the word vector risk score and the static feature risk score to obtain the overall risk score.
In one example, the overall risk scoreThe determination can be made by the following formula:
Wherein, 、、Sequentially and separately word vector risk scoringRisk scoring of LDA componentsStatic feature risk scoringAnd (5) corresponding weight.
In an alternative embodiment, the identifying each component in the code file using a pre-constructed LDA-based component analysis and risk assessment model includes:
Decomposing the code file into a plurality of sub-documents, wherein the plurality of sub-documents comprise class fragments, method fragments and annotation fragments;
And identifying the plurality of sub-documents by using the LDA-based component analysis and risk assessment model to obtain each component in the code file.
In one example, referring to FIG. 5, to enable the LDA to better identify the different components in the code file, the code file may first be parsed into multiple segments of classes, methods, annotations, etc., and each segment may be independently referred to as a small "sub-document". Then an LDA model is constructed, and super parameters of the LDA, such as the topic number (tc), dirichlet priors (alpha and beta) and related parameters of Gibbs sampling (such as the sampling times n and the sampling intervals si) are set. The choice of these parameters will directly affect the performance and results of the LDA model. Through the LDA model, different topics or components can be extracted from a code file (e.g., java document). These components may correspond to different modules, classes, or functions in the source code. By analyzing these components, the structure and composition of the source code can be more fully understood.
In one example, referring to FIG. 6, risk ratings may be divided into A-E levels according to overall risk scores, wherein:
Class a (security) risk score 0-1, no known vulnerabilities, all components being the latest version.
Class B (low risk) risk scores 1-3, there are few low risk vulnerabilities and the external libraries are all the latest versions.
Class C (risk of stroke) risk score 3-5, there are multiple known vulnerabilities and part of the external library expires.
Class D (high risk) risk scores 5-7, severe vulnerabilities exist, and multiple key components and libraries are not updated.
Class E (unacceptable risk) risk score above 7, the system presents a significant safety hazard, requiring immediate disablement and comprehensive review.
As described in the above related embodiments, the present application can effectively identify potential security vulnerabilities and risks in software by combining word vector techniques with machine learning classifiers. The semantic relation of the codes is captured by adopting a neural network model (such as a Word2Vec model), so that the understanding depth of components and library functions is improved, the accuracy of risk assessment is improved, defects can be automatically analyzed, manual intervention is reduced, more comprehensive risk detection is ensured, and a solid foundation is provided for software maintenance and safety management.
In addition, a risk assessment model (namely a risk assessment system) based on multi-dimensional scores is also established, and detailed risk analysis can be conducted on different components and library functions. By combining factors such as vulnerability severity, use frequency, version updating, license compliance and the like, an overall risk score is constructed, so that a risk assessment result is more scientific and reasonable. The flexible rating standard can help a development team to quickly identify high-risk components, and a priority repair plan is formulated, so that the safety and reliability of software are remarkably improved.
The constructed risk assessment model has strong adaptability and expansibility, and can flexibly adjust the scoring standard or add new indexes when meeting the continuously-changing software environment and technical requirements, for example, if a new security hole type or component use condition needs to be considered, only corresponding data input is needed to be updated, and the whole model does not need to be reconstructed, and the modularized design ensures that the invention has wide applicability and high practical value in practical application.
In a second aspect, correspondingly, the embodiment of the present application further provides a software risk assessment device, which can implement all the flows of the software risk assessment method provided in the foregoing embodiment.
Referring to fig. 7, a schematic structural diagram of a software risk assessment device provided by an embodiment of the present application is shown, where the software risk assessment device includes:
The analysis module 701 is configured to analyze a code file based on software to be evaluated to obtain code information, where the code information is used to indicate components and library functions corresponding to the code file;
The first risk assessment module 702 is configured to perform feature extraction based on the code information, obtain word vector features, and determine a word vector risk score according to the word vector features;
A second risk assessment module 703, configured to perform risk assessment on the component to obtain a first static feature risk score, and perform risk assessment on the library function to obtain a second static feature risk score, based on the code file;
a third risk assessment module 704 configured to determine a static feature risk score based on the first static feature risk score and the second static feature risk score;
The overall risk score module 705 is configured to determine an overall risk score based on the word vector risk score and the static feature risk score.
In an alternative embodiment, the code information includes an abstract syntax tree, the abstract syntax tree is used for indicating the component and the library function, the code file based on the software to be evaluated is parsed to obtain code information, and the method includes:
preprocessing the code file to obtain an object code;
determining a lexical analyzer and a grammar analyzer based on grammar rules matched with the code file;
invoking the lexical analyzer to decompose the object code into at least one lexical unit;
And calling the grammar analyzer to construct the abstract grammar tree based on the at least one lexical unit.
In an optional embodiment, the feature extraction based on the code information, to obtain a word vector feature, and determining a word vector risk score according to the word vector feature, includes:
Extracting features of the code information by using a neural network model to generate at least one word vector;
Constructing the word vector feature based on the at least one word vector;
and calculating the absolute value of each dimension in the word vector feature, and taking the sum of the absolute values of the dimensions as the word vector risk score.
In an optional implementation manner, the feature extraction of the code information by using the neural network model to generate at least one word vector includes:
determining super parameters including window size and minimum word frequency based on the grammar structure indicated by the code information and the context relation;
configuring the determined hyper-parameters for the neural network model;
And extracting the characteristics of the code information by using the configured neural network model to generate at least one word vector.
In an optional embodiment, the constructing the word vector feature based on the at least one word vector includes:
Based on the code information, respectively carrying out TF-IDF statistics on each word vector in the at least one word vector to obtain the weight corresponding to each word vector;
respectively carrying out normalization processing on each word vector in the at least one word vector;
and carrying out weighted calculation based on the normalized at least one word vector and the weight corresponding to the normalized at least one word vector to obtain the word vector characteristics.
In an alternative embodiment, the first static feature risk score includes at least one of a first vulnerability severity score, a first frequency of use score, a first version score, a first license compliance score;
the second static feature risk score includes at least one of a second vulnerability severity score, a second frequency of use score, a second version score, a second license compliance score.
In an alternative embodiment, the apparatus further comprises:
The fourth risk assessment module is used for identifying each component in the code file by utilizing a pre-constructed LDA-based component analysis and risk assessment model, and carrying out risk assessment on each component in the code file to obtain an LDA component risk score;
the determining an overall risk score based on the word vector risk score and the static feature risk score includes:
and carrying out weighted summation based on the LDA component risk score, the word vector risk score and the static feature risk score to obtain the overall risk score.
In an alternative embodiment, the identifying each component in the code file using a pre-constructed LDA-based component analysis and risk assessment model includes:
Decomposing the code file into a plurality of sub-documents, wherein the plurality of sub-documents comprise class fragments, method fragments and annotation fragments;
And identifying the plurality of sub-documents by using the LDA-based component analysis and risk assessment model to obtain each component in the code file.
In a third aspect, an embodiment of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the software risk assessment method of any of the above.
In a fourth aspect, embodiments of the present application provide a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the software risk assessment method of any of the above.
In a fifth aspect, an embodiment of the present application provides a computer device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the steps of the software risk assessment method according to any one of the preceding claims when the computer program is executed by the processor.
Referring to fig. 8, the computer device of this embodiment comprises a processor 801, a memory 802, and a computer program, such as a software risk assessment program, stored in the memory 802 and executable on the processor 801. The processor 801, when executing the computer program, implements the steps of the various embodiments of the software risk assessment method described above, such as steps S101-S105 shown in fig. 1.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory 802 and executed by the processor 801 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program in the computer device.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer devices may include, but are not limited to, a processor 801, a memory 802. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of a computer device and is not limiting of the computer device, and may include more or fewer components than shown, or may combine some of the components, or different components, e.g., the computer device may also include input and output devices, network access devices, buses, etc.
The Processor 801 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor 801 may be any conventional processor or the like, the processor 801 being the control center of the computer device and connecting the various parts of the overall computer device using various interfaces and lines.
The memory 802 may be used to store the computer programs and/or modules, and the processor 801 may perform various functions of the computer device by executing or executing the computer programs and/or modules stored in the memory 802, and invoking data stored in the memory 802. The memory 802 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area that may store data created according to the use of a cellular phone (such as audio data, a phonebook, etc.), etc. In addition, memory 802 may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card (SMART MEDIA CARD, SMC), secure Digital (SD) card, flash memory card (FLASH CARD), at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
Wherein the computer device integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by the processor 801, the steps of each method embodiment described above may be implemented. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
In summary, the embodiment of the application has at least the following beneficial effects:
The method and the device for analyzing the code file of the software to be evaluated, provided by the embodiment of the application, are used for analyzing the code file based on the software to be evaluated to obtain the code information, wherein the code information is used for indicating the components and library functions corresponding to the code file, extracting the features based on the code information to obtain word vector features, determining word vector risk scores according to the word vector features, performing risk assessment on the components based on the code file to obtain first static feature risk scores, performing risk assessment on the library functions to obtain second static feature risk scores, determining static feature risk scores based on the first static feature risk scores and the second static feature risk scores, and determining overall risk scores based on the word vector risk scores and the static feature risk scores, so that risks existing in the software can be identified efficiently and accurately, and the quality and the safety of the software can be improved.
From the above description of the embodiments, it will be clear to those skilled in the art that the present application may be implemented by means of software plus necessary hardware platforms, but may of course also be implemented entirely in hardware. Based on such understanding, all or part of the technical solution of the present application contributing to the background art may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM (Read-Only Memory)/RAM (Random Access Memory ), a magnetic disk, an optical disk, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the embodiments or some parts of the embodiments of the present application.
While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the application, such changes and modifications are also intended to be within the scope of the application.
Claims (15)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202510131246.0A CN119577790B (en) | 2025-02-06 | 2025-02-06 | Software risk assessment method, device, storage medium, program product and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202510131246.0A CN119577790B (en) | 2025-02-06 | 2025-02-06 | Software risk assessment method, device, storage medium, program product and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN119577790A CN119577790A (en) | 2025-03-07 |
CN119577790B true CN119577790B (en) | 2025-05-13 |
Family
ID=94800124
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202510131246.0A Active CN119577790B (en) | 2025-02-06 | 2025-02-06 | Software risk assessment method, device, storage medium, program product and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN119577790B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114462049A (en) * | 2022-01-28 | 2022-05-10 | 河海大学 | An automatic vulnerability classification method and system based on weighted Word2vec |
CN118133278A (en) * | 2024-02-26 | 2024-06-04 | 杭州电子科技大学 | Source code traceability analysis and evaluation system |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101261604B (en) * | 2008-04-09 | 2010-09-29 | 中兴通讯股份有限公司 | Software quality evaluation apparatus and software quality evaluation quantitative analysis method |
US20190138731A1 (en) * | 2016-04-22 | 2019-05-09 | Lin Tan | Method for determining defects and vulnerabilities in software code |
CN116757194A (en) * | 2023-06-15 | 2023-09-15 | 中国银行股份有限公司 | Text processing method, device, equipment and readable storage medium |
CN116702156B (en) * | 2023-06-20 | 2024-04-09 | 任丽娜 | Information security risk evaluation system and method thereof |
CN116974947A (en) * | 2023-08-29 | 2023-10-31 | 中国电信股份有限公司技术创新中心 | Component detection method and device, electronic equipment and storage medium |
CN118797580A (en) * | 2023-09-08 | 2024-10-18 | 中国移动通信集团内蒙古有限公司 | Application risk assessment method, device, system and computer-readable storage medium |
CN117473571B (en) * | 2023-11-10 | 2024-05-14 | 广东深技信息科技有限公司 | Data information security processing method and system |
CN118094548B (en) * | 2024-03-28 | 2024-11-05 | 河南省电子规划研究院有限责任公司 | Security detection system for software dependent package |
-
2025
- 2025-02-06 CN CN202510131246.0A patent/CN119577790B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114462049A (en) * | 2022-01-28 | 2022-05-10 | 河海大学 | An automatic vulnerability classification method and system based on weighted Word2vec |
CN118133278A (en) * | 2024-02-26 | 2024-06-04 | 杭州电子科技大学 | Source code traceability analysis and evaluation system |
Also Published As
Publication number | Publication date |
---|---|
CN119577790A (en) | 2025-03-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12141557B2 (en) | Pruning engine | |
US20210303274A1 (en) | Method and System for Arbitrary-Granularity Execution Clone Detection | |
US8577823B1 (en) | Taxonomy system for enterprise data management and analysis | |
WO2019051420A1 (en) | Automating identification of code snippets for library suggestion models | |
WO2019051422A1 (en) | Automating identification of test cases for library suggestion models | |
EP3695310A1 (en) | Blackbox matching engine | |
EP3679481A1 (en) | Automating generation of library suggestion engine models | |
US20120303661A1 (en) | Systems and methods for information extraction using contextual pattern discovery | |
Qiu et al. | Vulnerability detection via multiple-graph-based code representation | |
Asadi et al. | A heuristic-based approach to identify concepts in execution traces | |
White et al. | TCTracer: Establishing test-to-code traceability links using dynamic and static techniques | |
Cheng et al. | A similarity integration method based information retrieval and word embedding in bug localization | |
Ajienka et al. | An empirical study on the interplay between semantic coupling and co-change of software classes | |
US20240202824A1 (en) | Smart contract security auditing | |
Babur et al. | Language usage analysis for EMF metamodels on GitHub | |
Ardimento et al. | A text-based regression approach to predict bug-fix time | |
Mujhid et al. | A search engine for finding and reusing architecturally significant code | |
Seyam et al. | Code complexity and version history for enhancing hybrid bug localization | |
CN119558404A (en) | Optimization method, device, equipment and medium for large model illusion | |
CN118733713A (en) | Data processing method, data processing device and storage medium | |
Domínguez-Álvarez et al. | ReChan: an automated analysis of Android app release notes to report inconsistencies | |
Inokuchi et al. | From academia to software development: publication citations in source code comments | |
CN119577790B (en) | Software risk assessment method, device, storage medium, program product and equipment | |
Saifan et al. | Feature location enhancement based on source code augmentation with synonyms of terms | |
US11068376B2 (en) | Analytics engine selection management |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |