[go: up one dir, main page]

CN119577790B - Software risk assessment method, device, storage medium, program product and equipment - Google Patents

Software risk assessment method, device, storage medium, program product and equipment Download PDF

Info

Publication number
CN119577790B
CN119577790B CN202510131246.0A CN202510131246A CN119577790B CN 119577790 B CN119577790 B CN 119577790B CN 202510131246 A CN202510131246 A CN 202510131246A CN 119577790 B CN119577790 B CN 119577790B
Authority
CN
China
Prior art keywords
word vector
score
risk
risk score
risk assessment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510131246.0A
Other languages
Chinese (zh)
Other versions
CN119577790A (en
Inventor
张辰
王玲
沈潇军
王艺丹
赵帅
卢文达
彭梁英
詹佳雯
陈逍潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd filed Critical Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Priority to CN202510131246.0A priority Critical patent/CN119577790B/en
Publication of CN119577790A publication Critical patent/CN119577790A/en
Application granted granted Critical
Publication of CN119577790B publication Critical patent/CN119577790B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0985Hyperparameter optimisation; Meta-learning; Learning-to-learn
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Stored Programmes (AREA)

Abstract

The application discloses a software risk assessment method, a device, a storage medium, a program product and equipment, wherein the method comprises the steps of analyzing a code file of software to be assessed to obtain code information, wherein the code information is used for indicating components and library functions corresponding to the code file, extracting features based on the code information to obtain word vector features, determining word vector risk scores according to the word vector features, performing risk assessment on the components to obtain first static feature risk scores based on the code file, performing risk assessment on the library functions to obtain second static feature risk scores, determining static feature risk scores based on the first static feature risk scores and the second static feature risk scores, and determining overall risk scores based on the word vector risk scores and the static feature risk scores, so that risks existing in the software can be identified efficiently and accurately, and the software quality and the software safety can be improved.

Description

Software risk assessment method, device, storage medium, program product and equipment
Technical Field
The present application relates to the field of software engineering technologies, and in particular, to a method, an apparatus, a storage medium, a program product, and a device for evaluating software risk.
Background
With the increasing complexity and scale of software systems, analysis of software components is an important task in software engineering. Conventional analysis methods typically require manual risk assessment of the software source code according to set risk indicators. However, this has problems of low efficiency and low accuracy in processing large-scale code.
With the popularization of open source software, a large amount of source codes exist on a network for a developer to call, if the source codes with poor code quality are carelessly called, a certain threat is caused to a developed software project (for example, the developed software crashes are caused), so how to efficiently and accurately identify and analyze the risk degree existing in the software becomes an important subject for improving the software quality and the software security.
Disclosure of Invention
In order to solve the technical problems, embodiments of the present application provide a method, an apparatus, a storage medium, a program product, and a device for evaluating risk of software, which can efficiently and accurately identify risks existing in software, so as to improve quality and security of software.
In a first aspect, an embodiment of the present application provides a software risk assessment method, including:
analyzing based on a code file of the software to be evaluated to obtain code information, wherein the code information is used for indicating components and library functions corresponding to the code file;
Extracting features based on the code information to obtain word vector features, and determining word vector risk scores according to the word vector features;
Performing risk assessment on the component based on the code file to obtain a first static feature risk score, and performing risk assessment on the library function to obtain a second static feature risk score;
determining a static feature risk score based on the first static feature risk score and the second static feature risk score;
Based on the word vector risk score and the static feature risk score, an overall risk score is determined.
Optionally, the code information includes an abstract syntax tree, the abstract syntax tree is used for indicating the component and the library function, the code file based on the software to be evaluated is parsed to obtain code information, and the method includes:
preprocessing the code file to obtain an object code;
determining a lexical analyzer and a grammar analyzer based on grammar rules matched with the code file;
invoking the lexical analyzer to decompose the object code into at least one lexical unit;
And calling the grammar analyzer to construct the abstract grammar tree based on the at least one lexical unit.
Optionally, the feature extraction is performed based on the code information to obtain word vector features, and the determining the word vector risk score according to the word vector features includes:
Extracting features of the code information by using a neural network model to generate at least one word vector;
Constructing the word vector feature based on the at least one word vector;
and calculating the absolute value of each dimension in the word vector feature, and taking the sum of the absolute values of the dimensions as the word vector risk score.
Optionally, the feature extraction is performed on the code information by using a neural network model to generate at least one word vector, including:
determining super parameters including window size and minimum word frequency based on the grammar structure indicated by the code information and the context relation;
configuring the determined hyper-parameters for the neural network model;
And extracting the characteristics of the code information by using the configured neural network model to generate at least one word vector.
Optionally, the constructing the word vector feature based on the at least one word vector includes:
Based on the code information, respectively carrying out TF-IDF statistics on each word vector in the at least one word vector to obtain the weight corresponding to each word vector;
respectively carrying out normalization processing on each word vector in the at least one word vector;
and carrying out weighted calculation based on the normalized at least one word vector and the weight corresponding to the normalized at least one word vector to obtain the word vector characteristics.
Optionally, the first static feature risk score comprises at least one of a first vulnerability severity score, a first frequency of use score, a first version score, a first license compliance score;
the second static feature risk score includes at least one of a second vulnerability severity score, a second frequency of use score, a second version score, a second license compliance score.
Optionally, the method further comprises:
Identifying each component in the code file by utilizing a pre-constructed LDA-based component analysis and risk assessment model, and carrying out risk assessment on each component in the code file to obtain an LDA component risk score;
the determining an overall risk score based on the word vector risk score and the static feature risk score includes:
and carrying out weighted summation based on the LDA component risk score, the word vector risk score and the static feature risk score to obtain the overall risk score.
Optionally, the identifying each component in the code file by using a pre-constructed LDA-based component analysis and risk assessment model includes:
Decomposing the code file into a plurality of sub-documents, wherein the plurality of sub-documents comprise class fragments, method fragments and annotation fragments;
And identifying the plurality of sub-documents by using the LDA-based component analysis and risk assessment model to obtain each component in the code file.
In a second aspect, an embodiment of the present application provides a software risk assessment apparatus, including:
the analysis module is used for analyzing the code file of the software to be evaluated to obtain code information, wherein the code information is used for indicating components and library functions corresponding to the code file;
the first risk assessment module is used for carrying out feature extraction based on the code information to obtain word vector features, and determining word vector risk scores according to the word vector features;
The second risk assessment module is used for carrying out risk assessment on the component based on the code file to obtain a first static feature risk score, and carrying out risk assessment on the library function to obtain a second static feature risk score;
A third risk assessment module configured to determine a static feature risk score based on the first static feature risk score and the second static feature risk score;
and the overall risk scoring module is used for determining an overall risk score based on the word vector risk score and the static feature risk score.
Optionally, the code information includes an abstract syntax tree, the abstract syntax tree is used for indicating the component and the library function, the code file based on the software to be evaluated is parsed to obtain code information, and the method includes:
preprocessing the code file to obtain an object code;
determining a lexical analyzer and a grammar analyzer based on grammar rules matched with the code file;
invoking the lexical analyzer to decompose the object code into at least one lexical unit;
And calling the grammar analyzer to construct the abstract grammar tree based on the at least one lexical unit.
Optionally, the feature extraction is performed based on the code information to obtain word vector features, and the determining the word vector risk score according to the word vector features includes:
Extracting features of the code information by using a neural network model to generate at least one word vector;
Constructing the word vector feature based on the at least one word vector;
and calculating the absolute value of each dimension in the word vector feature, and taking the sum of the absolute values of the dimensions as the word vector risk score.
Optionally, the feature extraction is performed on the code information by using a neural network model to generate at least one word vector, including:
determining super parameters including window size and minimum word frequency based on the grammar structure indicated by the code information and the context relation;
configuring the determined hyper-parameters for the neural network model;
And extracting the characteristics of the code information by using the configured neural network model to generate at least one word vector.
Optionally, the constructing the word vector feature based on the at least one word vector includes:
Based on the code information, respectively carrying out TF-IDF statistics on each word vector in the at least one word vector to obtain the weight corresponding to each word vector;
respectively carrying out normalization processing on each word vector in the at least one word vector;
and carrying out weighted calculation based on the normalized at least one word vector and the weight corresponding to the normalized at least one word vector to obtain the word vector characteristics.
Optionally, the first static feature risk score comprises at least one of a first vulnerability severity score, a first frequency of use score, a first version score, a first license compliance score;
the second static feature risk score includes at least one of a second vulnerability severity score, a second frequency of use score, a second version score, a second license compliance score.
Optionally, the apparatus further comprises:
The fourth risk assessment module is used for identifying each component in the code file by utilizing a pre-constructed LDA-based component analysis and risk assessment model, and carrying out risk assessment on each component in the code file to obtain an LDA component risk score;
the determining an overall risk score based on the word vector risk score and the static feature risk score includes:
and carrying out weighted summation based on the LDA component risk score, the word vector risk score and the static feature risk score to obtain the overall risk score.
Optionally, the identifying each component in the code file by using a pre-constructed LDA-based component analysis and risk assessment model includes:
Decomposing the code file into a plurality of sub-documents, wherein the plurality of sub-documents comprise class fragments, method fragments and annotation fragments;
And identifying the plurality of sub-documents by using the LDA-based component analysis and risk assessment model to obtain each component in the code file.
In a third aspect, an embodiment of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the software risk assessment method of any of the above.
In a fourth aspect, embodiments of the present application provide a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the software risk assessment method of any of the above.
In a fifth aspect, an embodiment of the present application provides a computer device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the steps of the software risk assessment method according to any one of the preceding claims when the computer program is executed by the processor.
In summary, the embodiment of the application has at least the following beneficial effects:
The method and the device for analyzing the code file of the software to be evaluated, provided by the embodiment of the application, are used for analyzing the code file based on the software to be evaluated to obtain the code information, wherein the code information is used for indicating the components and library functions corresponding to the code file, extracting the features based on the code information to obtain word vector features, determining word vector risk scores according to the word vector features, performing risk assessment on the components based on the code file to obtain first static feature risk scores, performing risk assessment on the library functions to obtain second static feature risk scores, determining static feature risk scores based on the first static feature risk scores and the second static feature risk scores, and determining overall risk scores based on the word vector risk scores and the static feature risk scores, so that risks existing in the software can be identified efficiently and accurately, and the quality and the safety of the software can be improved.
Drawings
FIG. 1 is a flowchart of a software risk assessment method according to an embodiment of the present application;
FIG. 2 is a schematic illustration of pretreatment provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of lexical analysis and grammar analysis provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of a word vector feature construction provided by an embodiment of the present application;
FIG. 5 is a schematic diagram of risk assessment provided by an embodiment of the present application;
FIG. 6 is a schematic diagram of overall risk score calculation and risk rating provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of a software risk assessment device according to an embodiment of the present application;
Fig. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In the description of the present application, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", "a third", etc. may explicitly or implicitly include one or more such feature. In the description of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more. In the description of the application, the terms "include" and variations thereof are intended to be open-ended, i.e., to include, but not limited to. The term "based on" is based at least in part on. The term "according to" is based, at least in part, on. The term "one embodiment" means "at least one embodiment," another embodiment "means" at least one additional embodiment, "and" some embodiments "means" at least some embodiments.
In the description of the present application, unless explicitly stated or limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected, mechanically connected, electrically connected, directly connected, indirectly connected via an intervening medium, or in communication between two elements. The specific meaning of the above terms in the present application will be understood in specific cases by those of ordinary skill in the art.
In the description of the present application, it should be noted that all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art unless defined otherwise. The terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application, as the particular meaning of the terms described above in the present application will be understood to those of ordinary skill in the art in the detailed description of the application.
The following is an explanation of some of the term concepts involved in embodiments of the present application:
TF-IDF (Term Frequency-inverse document Frequency) is a statistical method used to evaluate the importance of a word in a document or corpus.
LDA (LATENT DIRICHLET Allocation ) is a topic model for finding abstract topics from a collection of documents.
In a first aspect, referring to fig. 1, a flowchart of a software risk assessment method provided by an embodiment of the present application is shown, where the method includes steps S101 to S105, specifically as follows:
S101, analyzing based on a code file of software to be evaluated to obtain code information, wherein the code information is used for indicating components and library functions corresponding to the code file;
in one example, the language type corresponding to the code file may be Java (an object-oriented programming language), and thus the code file is a corresponding Java source code file.
S102, extracting features based on the code information to obtain word vector features, and determining word vector risk scores according to the word vector features;
S103, performing risk assessment on the component based on the code file to obtain a first static feature risk score, and performing risk assessment on the library function to obtain a second static feature risk score;
S104, determining a static feature risk score based on the first static feature risk score and the second static feature risk score, wherein in one example, the static feature risk score may be obtained by summing/weighting the first static feature risk score and the second static feature risk score.
S105, determining an overall risk score based on the word vector risk score and the static feature risk score.
In an alternative embodiment, the code information includes an abstract syntax tree, the abstract syntax tree is used for indicating the component and the library function, the code file based on the software to be evaluated is parsed to obtain code information, and the method includes:
preprocessing the code file to obtain an object code;
In one example, referring to fig. 2, preprocessing the code file to obtain the target code may include extracting a source code file and a binary file generated after compiling software from the code file, ensuring the integrity of the source code file and the binary file in the extracting process, avoiding missing or losing the file in the extracting process, and then cleaning the source code file, where the cleaning process may include at least one of precisely removing irrelevant characters, removing comments, maintaining the coding consistency of the source code file, and the like through regular expression characteristics, and renaming the source code file and the binary file according to a naming convention of a language type (such as Java) corresponding to the code file, so as to ensure that the file name is clear, intuitive and free of special characters, and finally, in the case that the language type corresponding to the code file is Java, classifying and organizing the source code file and the binary file according to a package name and a function module according to a modularized design principle of Java items, so as to facilitate subsequent analysis. In the embodiment, the data can be ensured to be clean and consistent, the data quality is improved, a normalized data set is provided for subsequent component analysis and risk detection, and a solid foundation is laid.
Determining a lexical analyzer and a grammar analyzer based on grammar rules matched with the code file;
In one example, referring to fig. 3, if the language used in the code file is Java, then Java grammar rules including class definitions, method definitions, variable declarations, etc. may be written using the grammar definition language of ANTLR (ANotherTool for Language Recognition, a type of grammar parser) according to the Java language specification, so that by means of the ANTLR tool, efficient lexical analyzers and grammar analyzers are automatically generated according to the defined Java grammar rules.
Invoking the lexical analyzer to decompose the object code into at least one lexical unit;
And calling the grammar analyzer to construct the abstract grammar tree based on the at least one lexical unit.
In one example, referring to fig. 3, in the case where the language used by the code file is Java, the lexical analyzer may be used to decompose the object code into a series of lexical units (token), where the lexical units are basic elements that constitute the Java program, such as keywords, identifiers, literal amounts, and the like. The parser can then use these lexical elements to build an AST (Abstract Syntax Tree ) neatly according to the defined syntax rules. The tree structure of the abstract syntax tree can clearly show the syntax level and logic relation of Java source codes, and provides a solid foundation for subsequent deep analysis. In parsing, the present embodiment particularly pays attention to the characteristics of Java language, such as the use of rich standard libraries and third party dependency packages, and complex grammar structures (e.g., internal classes, anonymous classes, interfaces and implementations, etc.). Thus, both the lexical analyzer and the grammatical analyzer of the present embodiment are carefully designed and optimized to ensure that these characteristics can be accurately processed to generate a complete and accurate AST.
In an optional embodiment, the feature extraction based on the code information, to obtain a word vector feature, and determining a word vector risk score according to the word vector feature, includes:
Extracting features of the code information by using a neural network model to generate at least one word vector;
In one example, referring to fig. 4, the neural network model may be a Word2Vec model, and in the case that the language used in the code file is Java, since Java code has a strict syntax structure and rich context, it is possible to capture semantic relationships between functions, variables and classes using the Word2Vec model. Thus, the vocabulary (including keywords, variable names, function names, class names, etc.) in the code can be extracted by performing vocabulary extraction preprocessing on the Java code, and then the vector representation of the extracted vocabulary is learned by using the Word2Vec model, so that at least one Word vector (i.e. Word embedding) is generated. In this embodiment, particular attention is paid to some unique characteristics of Java code, such as package importation, class inheritance, interface implementation, etc., which typically occur in a specific syntax structure in the code, so that this embodiment retains these syntax structure information (i.e., the syntax structure contained in the code information is suitable for characterizing the above characteristics) when constructing a vocabulary library, so that the Word2Vec model can better understand the context in the code.
Constructing the word vector feature based on the at least one word vector;
In some cases, where the neural network model is a Word2Vec model and the language used by the code file is Java, since the Word2Vec model is partitioned based on lexical frequency, this may result in some important Java packages (e.g., core libraries) being imported only once and underestimated, while unimportant vocabularies may be overestimated due to frequent occurrence. To solve this problem, the present embodiment may introduce a weight assignment mechanism in constructing the word vector feature, assigning weights according to the importance and frequency of use of each word in the code file. For example, higher weights are given to word vectors corresponding to important packages and classes (e.g., java. Util, java. Io, etc.), while lower weights are given to word vectors corresponding to words that frequently occur but are of lower importance (e.g., temporary variable names, cyclic variables, etc.). Thus, when the word vector features are constructed, word vectors corresponding to important words have larger influence on the construction result, so that the quality of the constructed word vector features (namely word vector representation) is improved.
And calculating the absolute value of each dimension in the word vector feature, and taking the sum of the absolute values of the dimensions as the word vector risk score.
In an optional implementation manner, the feature extraction of the code information by using the neural network model to generate at least one word vector includes:
determining a super-parameter comprising a window size and a minimum word frequency based on a grammar structure and a context indicated by the code information, wherein the window size is suitable for determining a context range to ensure that a model can capture enough context information, and the minimum word frequency is suitable for filtering out words with too low occurrence frequency to reduce the quality influence of noise on word vectors, and the window size and/or the minimum word frequency can be determined according to the grammar structure and the context relation;
configuring the determined hyper-parameters for the neural network model;
And extracting the characteristics of the code information by using the configured neural network model to generate at least one word vector.
In an optional embodiment, the constructing the word vector feature based on the at least one word vector includes:
Based on the code information, respectively carrying out TF-IDF statistics on each word vector in the at least one word vector to obtain a weight (TF-IDF weight) corresponding to each word vector, wherein the TF-IDF weight is suitable for representing the occurrence frequency and importance of words corresponding to the word vector in a code segment;
Respectively carrying out normalization processing on each word vector in the at least one word vector, wherein the normalization processing means that each word vector is normalized to ensure comparability among different word vectors;
and carrying out weighted calculation based on the normalized at least one word vector and the weight corresponding to the normalized at least one word vector to obtain the word vector characteristics.
In this embodiment, when the neural network model is a Word2Vec model, the advantages of TF-IDF and Word2Vec Word vectors are combined, so that semantic features of the code segments can be more comprehensively reflected.
In an alternative embodiment, the first static feature risk score includes at least one of a first vulnerability severity score, a first frequency of use score, a first version score, a first license compliance score;
the second static feature risk score includes at least one of a second vulnerability severity score, a second frequency of use score, a second version score, a second license compliance score.
In one example, referring to FIG. 5, the Scoring criteria for risk assessment of static features may include VSS (Vulnerability Severity Scoring, vulnerability severity score), UFS (Usage Frequency Score, frequency of use score), VS (Version Scoring), LCS (License Compliance Score ) based on CVSS (Common Vulnerability Scoring System, universal vulnerability Scoring system). Where UFS means that the components used at high frequencies are more affected if there is a problem, VS means that the latest version score is higher, the past version score is lower, and LCS means that the license type score following the project requirements is higher. Whereby SF (Static Feature-based RISK ASSESSMENT) can be calculated in combination with one or more of the above scoring criteria.
In one example, the first static feature risk score or the second static feature risk scoreThe determination can be made by the following formula:
Wherein, ,,,Which in turn are weights corresponding to VSS, UFS, VS, LCS, respectively.
In an alternative embodiment, the method further comprises:
Identifying each component in the code file by utilizing a pre-constructed LDA-based component analysis and risk assessment model, and carrying out risk assessment on each component in the code file to obtain an LDA component risk score;
in one example, the LDA component risk score The determination can be made by the following formula:
Wherein, Is a function to calculate LCRS (LDA-based risk assessment) for converting CI (Component Interdependencies, component relevance) and RC (RISK CHARACTERISTICS, potential risk features) into risk scores,The degree of similarity or association between components can be calculated based on the distribution of the LDA model output.Factors that may include known vulnerability patterns, code complexity, external library dependencies, etc. Where f may represent addition, multiplication, weighted summation, etc., and is not particularly limited herein.
The determining an overall risk score based on the word vector risk score and the static feature risk score includes:
and carrying out weighted summation based on the LDA component risk score, the word vector risk score and the static feature risk score to obtain the overall risk score.
In one example, the overall risk scoreThe determination can be made by the following formula:
Wherein, Sequentially and separately word vector risk scoringRisk scoring of LDA componentsStatic feature risk scoringAnd (5) corresponding weight.
In an alternative embodiment, the identifying each component in the code file using a pre-constructed LDA-based component analysis and risk assessment model includes:
Decomposing the code file into a plurality of sub-documents, wherein the plurality of sub-documents comprise class fragments, method fragments and annotation fragments;
And identifying the plurality of sub-documents by using the LDA-based component analysis and risk assessment model to obtain each component in the code file.
In one example, referring to FIG. 5, to enable the LDA to better identify the different components in the code file, the code file may first be parsed into multiple segments of classes, methods, annotations, etc., and each segment may be independently referred to as a small "sub-document". Then an LDA model is constructed, and super parameters of the LDA, such as the topic number (tc), dirichlet priors (alpha and beta) and related parameters of Gibbs sampling (such as the sampling times n and the sampling intervals si) are set. The choice of these parameters will directly affect the performance and results of the LDA model. Through the LDA model, different topics or components can be extracted from a code file (e.g., java document). These components may correspond to different modules, classes, or functions in the source code. By analyzing these components, the structure and composition of the source code can be more fully understood.
In one example, referring to FIG. 6, risk ratings may be divided into A-E levels according to overall risk scores, wherein:
Class a (security) risk score 0-1, no known vulnerabilities, all components being the latest version.
Class B (low risk) risk scores 1-3, there are few low risk vulnerabilities and the external libraries are all the latest versions.
Class C (risk of stroke) risk score 3-5, there are multiple known vulnerabilities and part of the external library expires.
Class D (high risk) risk scores 5-7, severe vulnerabilities exist, and multiple key components and libraries are not updated.
Class E (unacceptable risk) risk score above 7, the system presents a significant safety hazard, requiring immediate disablement and comprehensive review.
As described in the above related embodiments, the present application can effectively identify potential security vulnerabilities and risks in software by combining word vector techniques with machine learning classifiers. The semantic relation of the codes is captured by adopting a neural network model (such as a Word2Vec model), so that the understanding depth of components and library functions is improved, the accuracy of risk assessment is improved, defects can be automatically analyzed, manual intervention is reduced, more comprehensive risk detection is ensured, and a solid foundation is provided for software maintenance and safety management.
In addition, a risk assessment model (namely a risk assessment system) based on multi-dimensional scores is also established, and detailed risk analysis can be conducted on different components and library functions. By combining factors such as vulnerability severity, use frequency, version updating, license compliance and the like, an overall risk score is constructed, so that a risk assessment result is more scientific and reasonable. The flexible rating standard can help a development team to quickly identify high-risk components, and a priority repair plan is formulated, so that the safety and reliability of software are remarkably improved.
The constructed risk assessment model has strong adaptability and expansibility, and can flexibly adjust the scoring standard or add new indexes when meeting the continuously-changing software environment and technical requirements, for example, if a new security hole type or component use condition needs to be considered, only corresponding data input is needed to be updated, and the whole model does not need to be reconstructed, and the modularized design ensures that the invention has wide applicability and high practical value in practical application.
In a second aspect, correspondingly, the embodiment of the present application further provides a software risk assessment device, which can implement all the flows of the software risk assessment method provided in the foregoing embodiment.
Referring to fig. 7, a schematic structural diagram of a software risk assessment device provided by an embodiment of the present application is shown, where the software risk assessment device includes:
The analysis module 701 is configured to analyze a code file based on software to be evaluated to obtain code information, where the code information is used to indicate components and library functions corresponding to the code file;
The first risk assessment module 702 is configured to perform feature extraction based on the code information, obtain word vector features, and determine a word vector risk score according to the word vector features;
A second risk assessment module 703, configured to perform risk assessment on the component to obtain a first static feature risk score, and perform risk assessment on the library function to obtain a second static feature risk score, based on the code file;
a third risk assessment module 704 configured to determine a static feature risk score based on the first static feature risk score and the second static feature risk score;
The overall risk score module 705 is configured to determine an overall risk score based on the word vector risk score and the static feature risk score.
In an alternative embodiment, the code information includes an abstract syntax tree, the abstract syntax tree is used for indicating the component and the library function, the code file based on the software to be evaluated is parsed to obtain code information, and the method includes:
preprocessing the code file to obtain an object code;
determining a lexical analyzer and a grammar analyzer based on grammar rules matched with the code file;
invoking the lexical analyzer to decompose the object code into at least one lexical unit;
And calling the grammar analyzer to construct the abstract grammar tree based on the at least one lexical unit.
In an optional embodiment, the feature extraction based on the code information, to obtain a word vector feature, and determining a word vector risk score according to the word vector feature, includes:
Extracting features of the code information by using a neural network model to generate at least one word vector;
Constructing the word vector feature based on the at least one word vector;
and calculating the absolute value of each dimension in the word vector feature, and taking the sum of the absolute values of the dimensions as the word vector risk score.
In an optional implementation manner, the feature extraction of the code information by using the neural network model to generate at least one word vector includes:
determining super parameters including window size and minimum word frequency based on the grammar structure indicated by the code information and the context relation;
configuring the determined hyper-parameters for the neural network model;
And extracting the characteristics of the code information by using the configured neural network model to generate at least one word vector.
In an optional embodiment, the constructing the word vector feature based on the at least one word vector includes:
Based on the code information, respectively carrying out TF-IDF statistics on each word vector in the at least one word vector to obtain the weight corresponding to each word vector;
respectively carrying out normalization processing on each word vector in the at least one word vector;
and carrying out weighted calculation based on the normalized at least one word vector and the weight corresponding to the normalized at least one word vector to obtain the word vector characteristics.
In an alternative embodiment, the first static feature risk score includes at least one of a first vulnerability severity score, a first frequency of use score, a first version score, a first license compliance score;
the second static feature risk score includes at least one of a second vulnerability severity score, a second frequency of use score, a second version score, a second license compliance score.
In an alternative embodiment, the apparatus further comprises:
The fourth risk assessment module is used for identifying each component in the code file by utilizing a pre-constructed LDA-based component analysis and risk assessment model, and carrying out risk assessment on each component in the code file to obtain an LDA component risk score;
the determining an overall risk score based on the word vector risk score and the static feature risk score includes:
and carrying out weighted summation based on the LDA component risk score, the word vector risk score and the static feature risk score to obtain the overall risk score.
In an alternative embodiment, the identifying each component in the code file using a pre-constructed LDA-based component analysis and risk assessment model includes:
Decomposing the code file into a plurality of sub-documents, wherein the plurality of sub-documents comprise class fragments, method fragments and annotation fragments;
And identifying the plurality of sub-documents by using the LDA-based component analysis and risk assessment model to obtain each component in the code file.
In a third aspect, an embodiment of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the software risk assessment method of any of the above.
In a fourth aspect, embodiments of the present application provide a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the software risk assessment method of any of the above.
In a fifth aspect, an embodiment of the present application provides a computer device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the steps of the software risk assessment method according to any one of the preceding claims when the computer program is executed by the processor.
Referring to fig. 8, the computer device of this embodiment comprises a processor 801, a memory 802, and a computer program, such as a software risk assessment program, stored in the memory 802 and executable on the processor 801. The processor 801, when executing the computer program, implements the steps of the various embodiments of the software risk assessment method described above, such as steps S101-S105 shown in fig. 1.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory 802 and executed by the processor 801 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program in the computer device.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer devices may include, but are not limited to, a processor 801, a memory 802. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of a computer device and is not limiting of the computer device, and may include more or fewer components than shown, or may combine some of the components, or different components, e.g., the computer device may also include input and output devices, network access devices, buses, etc.
The Processor 801 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor 801 may be any conventional processor or the like, the processor 801 being the control center of the computer device and connecting the various parts of the overall computer device using various interfaces and lines.
The memory 802 may be used to store the computer programs and/or modules, and the processor 801 may perform various functions of the computer device by executing or executing the computer programs and/or modules stored in the memory 802, and invoking data stored in the memory 802. The memory 802 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area that may store data created according to the use of a cellular phone (such as audio data, a phonebook, etc.), etc. In addition, memory 802 may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card (SMART MEDIA CARD, SMC), secure Digital (SD) card, flash memory card (FLASH CARD), at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
Wherein the computer device integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by the processor 801, the steps of each method embodiment described above may be implemented. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
In summary, the embodiment of the application has at least the following beneficial effects:
The method and the device for analyzing the code file of the software to be evaluated, provided by the embodiment of the application, are used for analyzing the code file based on the software to be evaluated to obtain the code information, wherein the code information is used for indicating the components and library functions corresponding to the code file, extracting the features based on the code information to obtain word vector features, determining word vector risk scores according to the word vector features, performing risk assessment on the components based on the code file to obtain first static feature risk scores, performing risk assessment on the library functions to obtain second static feature risk scores, determining static feature risk scores based on the first static feature risk scores and the second static feature risk scores, and determining overall risk scores based on the word vector risk scores and the static feature risk scores, so that risks existing in the software can be identified efficiently and accurately, and the quality and the safety of the software can be improved.
From the above description of the embodiments, it will be clear to those skilled in the art that the present application may be implemented by means of software plus necessary hardware platforms, but may of course also be implemented entirely in hardware. Based on such understanding, all or part of the technical solution of the present application contributing to the background art may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM (Read-Only Memory)/RAM (Random Access Memory ), a magnetic disk, an optical disk, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the embodiments or some parts of the embodiments of the present application.
While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the application, such changes and modifications are also intended to be within the scope of the application.

Claims (15)

1.一种软件风险评估方法,其特征在于,包括:1. A software risk assessment method, comprising: 基于待评估软件的代码文件进行解析,得到代码信息,其中,所述代码信息用于指示所述代码文件对应的组件及库函数;Parsing the code file of the software to be evaluated to obtain code information, wherein the code information is used to indicate the components and library functions corresponding to the code file; 基于所述代码信息进行特征提取,得到词向量特征,并根据所述词向量特征确定词向量风险评分;Perform feature extraction based on the code information to obtain word vector features, and determine a word vector risk score based on the word vector features; 基于所述代码文件,对所述组件进行风险评估得到第一静态特征风险评分,以及,对所述库函数进行风险评估得到第二静态特征风险评分;Based on the code file, performing risk assessment on the component to obtain a first static feature risk score, and performing risk assessment on the library function to obtain a second static feature risk score; 基于所述第一静态特征风险评分及所述第二静态特征风险评分,确定静态特征风险评分;Determining a static feature risk score based on the first static feature risk score and the second static feature risk score; 基于所述词向量风险评分及所述静态特征风险评分,确定总体风险评分;Determining an overall risk score based on the word vector risk score and the static feature risk score; 所述基于所述代码信息进行特征提取,得到词向量特征,并根据所述词向量特征确定词向量风险评分,包括:The extracting features based on the code information to obtain word vector features, and determining the word vector risk score according to the word vector features, includes: 利用神经网络模型,对所述代码信息进行特征提取,生成至少一词向量;Using a neural network model, extracting features from the code information to generate at least one word vector; 基于所述至少一词向量,构建所述词向量特征;Based on the at least one word vector, construct the word vector feature; 计算所述词向量特征中每一维度的绝对值,并将各所述维度的绝对值之和作为所述词向量风险评分;Calculate the absolute value of each dimension in the word vector feature, and take the sum of the absolute values of each dimension as the word vector risk score; 所述基于所述至少一词向量,构建所述词向量特征,包括:The constructing the word vector feature based on the at least one word vector includes: 基于所述代码信息,分别对所述至少一词向量中每一词向量进行TF-IDF统计,以得到所述每一词向量对应的权重;Based on the code information, performing TF-IDF statistics on each word vector in the at least one word vector to obtain a weight corresponding to each word vector; 分别对所述至少一词向量中每一词向量进行归一化处理;Normalizing each word vector in the at least one word vector respectively; 基于经归一化处理的至少一词向量及其各自对应的权重,进行加权计算得到所述词向量特征。Based on at least one normalized word vector and its corresponding weight, a weighted calculation is performed to obtain the word vector feature. 2.根据权利要求1所述的软件风险评估方法,其特征在于,所述代码信息包括抽象语法树,所述抽象语法树用于指示所述组件及所述库函数,所述基于待评估软件的代码文件进行解析,得到代码信息,包括:2. The software risk assessment method according to claim 1, wherein the code information includes an abstract syntax tree, the abstract syntax tree is used to indicate the component and the library function, and the code information is obtained by parsing the code file based on the software to be assessed, including: 对所述代码文件进行预处理得到目标代码;Preprocessing the code file to obtain target code; 基于与所述代码文件相匹配的语法规则,确定词法分析器和语法分析器;Determine a lexical analyzer and a syntax analyzer based on grammatical rules matching the code file; 调用所述词法分析器,将所述目标代码分解为至少一词法单元;Calling the lexical analyzer to decompose the target code into at least one lexical unit; 调用所述语法分析器,基于所述至少一词法单元构建出所述抽象语法树。The syntax analyzer is called to construct the abstract syntax tree based on the at least one token unit. 3.根据权利要求1所述的软件风险评估方法,其特征在于,所述利用神经网络模型,对所述代码信息进行特征提取,生成至少一词向量,包括:3. The software risk assessment method according to claim 1, characterized in that the use of a neural network model to extract features from the code information to generate at least one word vector comprises: 基于所述代码信息所指示的语法结构及上下文关系,确定包括窗口大小及最小词频的超参数;Determine hyperparameters including window size and minimum word frequency based on the grammatical structure and contextual relationship indicated by the code information; 为所述神经网络模型配置所确定的超参数;Configuring the determined hyperparameters for the neural network model; 利用所配置的神经网络模型,对所述代码信息进行特征提取,生成至少一词向量。The configured neural network model is used to extract features from the code information to generate at least one word vector. 4.根据权利要求1所述的软件风险评估方法,其特征在于,4. The software risk assessment method according to claim 1, characterized in that: 所述第一静态特征风险评分包括如下至少之一:第一漏洞严重性评分、第一使用频率评分、第一版本评分、第一许可证合规评分;The first static feature risk score includes at least one of the following: a first vulnerability severity score, a first usage frequency score, a first version score, and a first license compliance score; 所述第二静态特征风险评分包括如下至少之一:第二漏洞严重性评分、第二使用频率评分、第二版本评分、第二许可证合规评分。The second static feature risk score includes at least one of the following: a second vulnerability severity score, a second usage frequency score, a second version score, and a second license compliance score. 5.根据权利要求1-4任一项所述的软件风险评估方法,其特征在于,所述方法还包括:5. The software risk assessment method according to any one of claims 1 to 4, characterized in that the method further comprises: 利用预先构建的基于LDA的成分分析与风险评估模型,识别出所述代码文件中的各成分,以及,对所述代码文件中的各成分进行风险评估得到LDA成分风险评分;Using a pre-built LDA-based component analysis and risk assessment model, identifying each component in the code file, and performing risk assessment on each component in the code file to obtain an LDA component risk score; 所述基于所述词向量风险评分及所述静态特征风险评分,确定总体风险评分,包括:The determining of the overall risk score based on the word vector risk score and the static feature risk score includes: 基于所述LDA成分风险评分、所述词向量风险评分及所述静态特征风险评分进行加权求和,得到所述总体风险评分。The overall risk score is obtained by performing a weighted sum based on the LDA component risk score, the word vector risk score and the static feature risk score. 6.根据权利要求5所述的软件风险评估方法,其特征在于,所述利用预先构建的基于LDA的成分分析与风险评估模型,识别出所述代码文件中的各成分,包括:6. The software risk assessment method according to claim 5, characterized in that the use of a pre-built LDA-based component analysis and risk assessment model to identify the components in the code file comprises: 将所述代码文件分解为多个子文档,其中,所述多个子文档包括类片段、方法片段及注释片段;Decomposing the code file into a plurality of sub-documents, wherein the plurality of sub-documents include class fragments, method fragments and comment fragments; 利用所述基于LDA的成分分析与风险评估模型,对所述多个子文档进行识别,得到所述代码文件中的各成分。The LDA-based component analysis and risk assessment model is used to identify the multiple sub-documents to obtain the components in the code file. 7.一种软件风险评估装置,其特征在于,包括:7. A software risk assessment device, comprising: 解析模块,用于基于待评估软件的代码文件进行解析,得到代码信息,其中,所述代码信息用于指示所述代码文件对应的组件及库函数;A parsing module, used to parse the code file of the software to be evaluated to obtain code information, wherein the code information is used to indicate the components and library functions corresponding to the code file; 第一风险评估模块,用于基于所述代码信息进行特征提取,得到词向量特征,并根据所述词向量特征确定词向量风险评分;A first risk assessment module is used to perform feature extraction based on the code information to obtain word vector features, and determine a word vector risk score based on the word vector features; 第二风险评估模块,用于基于所述代码文件,对所述组件进行风险评估得到第一静态特征风险评分,以及,对所述库函数进行风险评估得到第二静态特征风险评分;A second risk assessment module is used to perform risk assessment on the component based on the code file to obtain a first static feature risk score, and to perform risk assessment on the library function to obtain a second static feature risk score; 第三风险评估模块,用于基于所述第一静态特征风险评分及所述第二静态特征风险评分,确定静态特征风险评分;a third risk assessment module, configured to determine a static feature risk score based on the first static feature risk score and the second static feature risk score; 总体风险评分模块,用于基于所述词向量风险评分及所述静态特征风险评分,确定总体风险评分;An overall risk scoring module, configured to determine an overall risk score based on the word vector risk score and the static feature risk score; 所述基于所述代码信息进行特征提取,得到词向量特征,并根据所述词向量特征确定词向量风险评分,包括:The extracting features based on the code information to obtain word vector features, and determining the word vector risk score according to the word vector features, includes: 利用神经网络模型,对所述代码信息进行特征提取,生成至少一词向量;Using a neural network model, extracting features from the code information to generate at least one word vector; 基于所述至少一词向量,构建所述词向量特征;Based on the at least one word vector, construct the word vector feature; 计算所述词向量特征中每一维度的绝对值,并将各所述维度的绝对值之和作为所述词向量风险评分;Calculate the absolute value of each dimension in the word vector feature, and take the sum of the absolute values of each dimension as the word vector risk score; 所述基于所述至少一词向量,构建所述词向量特征,包括:The constructing the word vector feature based on the at least one word vector includes: 基于所述代码信息,分别对所述至少一词向量中每一词向量进行TF-IDF统计,以得到所述每一词向量对应的权重;Based on the code information, performing TF-IDF statistics on each word vector in the at least one word vector to obtain a weight corresponding to each word vector; 分别对所述至少一词向量中每一词向量进行归一化处理;Normalizing each word vector in the at least one word vector respectively; 基于经归一化处理的至少一词向量及其各自对应的权重,进行加权计算得到所述词向量特征。Based on at least one normalized word vector and its corresponding weight, a weighted calculation is performed to obtain the word vector feature. 8.根据权利要求7所述的软件风险评估装置,其特征在于,所述代码信息包括抽象语法树,所述抽象语法树用于指示所述组件及所述库函数,所述基于待评估软件的代码文件进行解析,得到代码信息,包括:8. The software risk assessment device according to claim 7, characterized in that the code information includes an abstract syntax tree, the abstract syntax tree is used to indicate the component and the library function, and the code information is obtained by parsing the code file based on the software to be assessed, including: 对所述代码文件进行预处理得到目标代码;Preprocessing the code file to obtain target code; 基于与所述代码文件相匹配的语法规则,确定词法分析器和语法分析器;Determine a lexical analyzer and a syntax analyzer based on grammatical rules matching the code file; 调用所述词法分析器,将所述目标代码分解为至少一词法单元;Calling the lexical analyzer to decompose the target code into at least one lexical unit; 调用所述语法分析器,基于所述至少一词法单元构建出所述抽象语法树。The syntax analyzer is called to construct the abstract syntax tree based on the at least one token unit. 9.根据权利要求7所述的软件风险评估装置,其特征在于,所述利用神经网络模型,对所述代码信息进行特征提取,生成至少一词向量,包括:9. The software risk assessment device according to claim 7, characterized in that the use of a neural network model to extract features from the code information to generate at least one word vector comprises: 基于所述代码信息所指示的语法结构及上下文关系,确定包括窗口大小及最小词频的超参数;Determine hyperparameters including window size and minimum word frequency based on the grammatical structure and contextual relationship indicated by the code information; 为所述神经网络模型配置所确定的超参数;Configuring the determined hyperparameters for the neural network model; 利用所配置的神经网络模型,对所述代码信息进行特征提取,生成至少一词向量。The configured neural network model is used to extract features from the code information to generate at least one word vector. 10.根据权利要求7所述的软件风险评估装置,其特征在于,10. The software risk assessment device according to claim 7, characterized in that: 所述第一静态特征风险评分包括如下至少之一:第一漏洞严重性评分、第一使用频率评分、第一版本评分、第一许可证合规评分;The first static feature risk score includes at least one of the following: a first vulnerability severity score, a first usage frequency score, a first version score, and a first license compliance score; 所述第二静态特征风险评分包括如下至少之一:第二漏洞严重性评分、第二使用频率评分、第二版本评分、第二许可证合规评分。The second static feature risk score includes at least one of the following: a second vulnerability severity score, a second usage frequency score, a second version score, and a second license compliance score. 11.根据权利要求7-10任一项所述的软件风险评估装置,其特征在于,所述装置还包括:11. The software risk assessment device according to any one of claims 7 to 10, characterized in that the device further comprises: 第四风险评估模块,用于利用预先构建的基于LDA的成分分析与风险评估模型,识别出所述代码文件中的各成分,以及,对所述代码文件中的各成分进行风险评估得到LDA成分风险评分;A fourth risk assessment module, for identifying each component in the code file using a pre-built LDA-based component analysis and risk assessment model, and performing risk assessment on each component in the code file to obtain an LDA component risk score; 所述基于所述词向量风险评分及所述静态特征风险评分,确定总体风险评分,包括:The determining of the overall risk score based on the word vector risk score and the static feature risk score includes: 基于所述LDA成分风险评分、所述词向量风险评分及所述静态特征风险评分进行加权求和,得到所述总体风险评分。The overall risk score is obtained by performing a weighted sum based on the LDA component risk score, the word vector risk score and the static feature risk score. 12.根据权利要求11所述的软件风险评估装置,其特征在于,所述利用预先构建的基于LDA的成分分析与风险评估模型,识别出所述代码文件中的各成分,包括:12. The software risk assessment device according to claim 11, characterized in that the use of the pre-built LDA-based component analysis and risk assessment model to identify the components in the code file includes: 将所述代码文件分解为多个子文档,其中,所述多个子文档包括类片段、方法片段及注释片段;Decomposing the code file into a plurality of sub-documents, wherein the plurality of sub-documents include class fragments, method fragments and comment fragments; 利用所述基于LDA的成分分析与风险评估模型,对所述多个子文档进行识别,得到所述代码文件中的各成分。The LDA-based component analysis and risk assessment model is used to identify the multiple sub-documents to obtain the components in the code file. 13.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-6任一项所述的软件风险评估方法。13. A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the software risk assessment method according to any one of claims 1 to 6. 14.一种计算机程序产品,包括计算机指令,其特征在于,所述计算机指令被处理器执行时实现权利要求1-6任一项所述的软件风险评估方法。14. A computer program product, comprising computer instructions, characterized in that when the computer instructions are executed by a processor, the software risk assessment method according to any one of claims 1 to 6 is implemented. 15.一种计算机设备,其特征在于,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述处理器执行所述计算机程序时实现权利要求1-6任一项所述的软件风险评估方法。15. A computer device, comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein the processor implements the software risk assessment method according to any one of claims 1 to 6 when executing the computer program.
CN202510131246.0A 2025-02-06 2025-02-06 Software risk assessment method, device, storage medium, program product and equipment Active CN119577790B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510131246.0A CN119577790B (en) 2025-02-06 2025-02-06 Software risk assessment method, device, storage medium, program product and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510131246.0A CN119577790B (en) 2025-02-06 2025-02-06 Software risk assessment method, device, storage medium, program product and equipment

Publications (2)

Publication Number Publication Date
CN119577790A CN119577790A (en) 2025-03-07
CN119577790B true CN119577790B (en) 2025-05-13

Family

ID=94800124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510131246.0A Active CN119577790B (en) 2025-02-06 2025-02-06 Software risk assessment method, device, storage medium, program product and equipment

Country Status (1)

Country Link
CN (1) CN119577790B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114462049A (en) * 2022-01-28 2022-05-10 河海大学 An automatic vulnerability classification method and system based on weighted Word2vec
CN118133278A (en) * 2024-02-26 2024-06-04 杭州电子科技大学 Source code traceability analysis and evaluation system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101261604B (en) * 2008-04-09 2010-09-29 中兴通讯股份有限公司 Software quality evaluation apparatus and software quality evaluation quantitative analysis method
US20190138731A1 (en) * 2016-04-22 2019-05-09 Lin Tan Method for determining defects and vulnerabilities in software code
CN116757194A (en) * 2023-06-15 2023-09-15 中国银行股份有限公司 Text processing method, device, equipment and readable storage medium
CN116702156B (en) * 2023-06-20 2024-04-09 任丽娜 Information security risk evaluation system and method thereof
CN116974947A (en) * 2023-08-29 2023-10-31 中国电信股份有限公司技术创新中心 Component detection method and device, electronic equipment and storage medium
CN118797580A (en) * 2023-09-08 2024-10-18 中国移动通信集团内蒙古有限公司 Application risk assessment method, device, system and computer-readable storage medium
CN117473571B (en) * 2023-11-10 2024-05-14 广东深技信息科技有限公司 Data information security processing method and system
CN118094548B (en) * 2024-03-28 2024-11-05 河南省电子规划研究院有限责任公司 Security detection system for software dependent package

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114462049A (en) * 2022-01-28 2022-05-10 河海大学 An automatic vulnerability classification method and system based on weighted Word2vec
CN118133278A (en) * 2024-02-26 2024-06-04 杭州电子科技大学 Source code traceability analysis and evaluation system

Also Published As

Publication number Publication date
CN119577790A (en) 2025-03-07

Similar Documents

Publication Publication Date Title
US12141557B2 (en) Pruning engine
US20210303274A1 (en) Method and System for Arbitrary-Granularity Execution Clone Detection
US8577823B1 (en) Taxonomy system for enterprise data management and analysis
WO2019051420A1 (en) Automating identification of code snippets for library suggestion models
WO2019051422A1 (en) Automating identification of test cases for library suggestion models
EP3695310A1 (en) Blackbox matching engine
EP3679481A1 (en) Automating generation of library suggestion engine models
US20120303661A1 (en) Systems and methods for information extraction using contextual pattern discovery
Qiu et al. Vulnerability detection via multiple-graph-based code representation
Asadi et al. A heuristic-based approach to identify concepts in execution traces
White et al. TCTracer: Establishing test-to-code traceability links using dynamic and static techniques
Cheng et al. A similarity integration method based information retrieval and word embedding in bug localization
Ajienka et al. An empirical study on the interplay between semantic coupling and co-change of software classes
US20240202824A1 (en) Smart contract security auditing
Babur et al. Language usage analysis for EMF metamodels on GitHub
Ardimento et al. A text-based regression approach to predict bug-fix time
Mujhid et al. A search engine for finding and reusing architecturally significant code
Seyam et al. Code complexity and version history for enhancing hybrid bug localization
CN119558404A (en) Optimization method, device, equipment and medium for large model illusion
CN118733713A (en) Data processing method, data processing device and storage medium
Domínguez-Álvarez et al. ReChan: an automated analysis of Android app release notes to report inconsistencies
Inokuchi et al. From academia to software development: publication citations in source code comments
CN119577790B (en) Software risk assessment method, device, storage medium, program product and equipment
Saifan et al. Feature location enhancement based on source code augmentation with synonyms of terms
US11068376B2 (en) Analytics engine selection management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant