CN114996111A

CN114996111A - Method and system for analyzing influence of configuration items on performance of software system

Info

Publication number: CN114996111A
Application number: CN202210736612.1A
Authority: CN
Inventors: 陈鹏飞; 陈志明; 关雅雯; 郑子彬
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-09-02

Abstract

The present invention provides a method and system for analyzing the impact of configuration items on software system performance, wherein the method includes: identifying and marking all performance operations in the software system according to a preset code pattern of the software system, and the performance operations are those that affect the performance of the software system. Time-intensive operations and/or space-intensive operations; identify the dependencies between each performance operation and each configuration item of the software system, and obtain the performance operation set corresponding to each configuration item; build the feature corresponding to each configuration item according to the performance operation set Vector; input the feature vector corresponding to each configuration item into the trained qualitative performance impact model, determine whether the configuration item affects the performance of the software system, and obtain the set of configuration items that affect the performance of the software system. The invention can discover the set of configuration items that really affects the performance of the software system without running the software system, can improve the efficiency of the software system configuration, and is beneficial to correctly configure the software system to improve the performance of the software system.

Description

A method and system for analyzing the impact of configuration items on software system performance

技术领域technical field

本发明涉及软件系统技术领域，尤其是涉及一种配置项对软件系统性能影响的分析方法及系统。The invention relates to the technical field of software systems, in particular to a method and system for analyzing the influence of configuration items on the performance of software systems.

背景技术Background technique

计算机的软件系统(Software Systems)是指计算机在运行的各种程序、数据及相关的文档资料，包括系统软件、支撑软件和应用软件。大量的现代软件系统被设计成高度可定制系统，可根据用户所使用的硬件平台、操作系统及用户的需求进行配置，可满足用户在软件功能或性能方面的需求。The software system of a computer refers to various programs, data and related documents that the computer is running, including system software, supporting software and application software. A large number of modern software systems are designed as highly customizable systems, which can be configured according to the hardware platform, operating system and user needs used by the user, and can meet the user's software function or performance needs.

但是，软件系统配置项的数量很多，部分配置项之间也存在依赖关系，软件系统配置的复杂性让调整软件系统配置成为了一个巨大的挑战。研究表明，软件系统的配置错误已经成为系统故障和系统性能问题的主要原因之一。软件系统的配置错误可能带来相当严重的后果，对商业存储系统及开源操作系统的配置错误可能导致难以诊断的系统崩溃、挂起或严重的性能下降。However, there are many software system configuration items, and some configuration items also have dependencies. The complexity of software system configuration makes adjusting software system configuration a huge challenge. Studies have shown that the misconfiguration of software systems has become one of the main reasons for system failures and system performance problems. Misconfiguration of software systems can have serious consequences, and misconfiguration of commercial storage systems and open source operating systems can cause hard-to-diagnose system crashes, hangs, or severe performance degradation.

除了普遍的软件系统配置错误以外，用户通常难以清楚了解改变一个配置项对软件系统的实际影响，因此，用户通常被迫以大量试错的耗时方式来调整软件系统配置，导致软件系统配置的效率低下。In addition to common software system configuration errors, it is often difficult for users to clearly understand the actual impact of changing a configuration item on the software system. Therefore, users are usually forced to adjust the software system configuration in a time-consuming way of a lot of trial and error, resulting in inconsistent software system configuration. low efficiency.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种配置项对软件系统性能影响的分析方法及系统，以解决现有技术中软件系统的配置错误、配置效率低下的技术问题。The purpose of the present invention is to provide a method and system for analyzing the influence of configuration items on software system performance, so as to solve the technical problems of software system configuration errors and low configuration efficiency in the prior art.

本发明的目的，可以通过如下技术方案实现：The object of the present invention can be realized by the following technical solutions:

一种配置项对软件系统性能影响的分析方法，包括：A method for analyzing the impact of configuration items on software system performance, including:

根据软件系统预设的代码模式识别并标记所述软件系统中的所有性能操作，所述性能操作为影响软件系统性能的时间密集型操作和/或空间密集型操作；Identify and mark all performance operations in the software system according to a code pattern preset by the software system, where the performance operations are time-intensive operations and/or space-intensive operations that affect the performance of the software system;

识别各所述性能操作与所述软件系统的各配置项之间的依赖关系，得到各所述配置项对应的性能操作集合，所述性能操作集合中的各性能操作与所述配置项均具有依赖关系；Identify the dependency between each of the performance operations and each configuration item of the software system, and obtain a performance operation set corresponding to each of the configuration items, and each performance operation in the performance operation set has the same configuration as the configuration item. dependencies;

根据所述性能操作集合构建各所述配置项对应的特征向量；constructing a feature vector corresponding to each of the configuration items according to the performance operation set;

将各所述配置项对应的特征向量输入到训练好的定性性能影响模型中，判断所述配置项是否影响软件系统性能，得到影响软件系统性能的配置项集合，所述定性性能影响模型是利用多个软件系统的配置项对应的特征向量训练得到的。Input the feature vector corresponding to each of the configuration items into the trained qualitative performance impact model, determine whether the configuration item affects the performance of the software system, and obtain a set of configuration items that affect the performance of the software system, the qualitative performance impact model is to use It is obtained by training the feature vectors corresponding to the configuration items of multiple software systems.

可选地，所述定性性能影响模型包括：Optionally, the qualitative performance impact model includes:

随机森林分类模型和配置项依赖检测器；Random forest classification model and configuration item dependency detector;

其中，所述随机森林分类模型对配置项是否影响软件系统性能进行二分类，所述配置项依赖检测器修正所述随机森林分类模型的分类结果。Wherein, the random forest classification model performs binary classification on whether the configuration item affects the performance of the software system, and the configuration item depends on the detector to correct the classification result of the random forest classification model.

可选地，所述依赖关系包括：Optionally, the dependencies include:

数据依赖关系和控制依赖关系；Data dependencies and control dependencies;

其中，所述数据依赖关系是数据流之间的依赖关系，所述控制依赖关系是程序控制流导致的依赖关系。The data dependencies are dependencies between data streams, and the control dependencies are dependencies caused by program control flows.

可选地，识别各所述性能操作与所述软件系统的各配置项之间的依赖关系包括：Optionally, identifying the dependencies between the performance operations and the configuration items of the software system includes:

利用污点分析识别各所述性能操作与所述软件系统的各配置项之间的数据依赖关系；Identify data dependencies between each of the performance operations and each configuration item of the software system by using taint analysis;

利用程序依赖图识别各所述性能操作与所述软件系统的各配置项之间的控制依赖关系；所述程序依赖图是利用程序依赖分析技术构建的，用于描述程序的控制依赖和数据依赖关系。The program dependency graph is used to identify the control dependency relationship between each of the performance operations and the configuration items of the software system; the program dependency graph is constructed by using the program dependency analysis technology to describe the control dependency and data dependency of the program relation.

可选地，利用污点分析识别各所述性能操作与所述软件系统的各配置项之间的数据依赖关系包括：Optionally, using taint analysis to identify data dependencies between each of the performance operations and each configuration item of the software system includes:

进入所述软件系统的程序入口，遍历控制流，在配置项加载API处创建污点作为源点；Enter the program entry of the software system, traverse the control flow, and create a taint at the configuration item loading API as a source point;

记录源点的数据传播路径及最终到达的汇点，则所述汇点处的性能操作对所述配置项具有数据依赖关系；所述汇点为不希望所述源点到达的程序语句，所述汇点是在所述性能操作对应的语句前预先设置的。Record the data propagation path of the source point and the finally reached sink point, then the performance operation at the sink point has a data dependency on the configuration item; the sink point is the program statement that the source point is not expected to reach, so The sink point is preset before the statement corresponding to the performance operation.

可选地，利用程序依赖图识别各所述性能操作与所述软件系统的各配置项之间的控制依赖关系包括：Optionally, identifying the control dependency between each of the performance operations and each configuration item of the software system by using a program dependency graph includes:

遍历所述程序依赖图中的所有节点，构建各配置项的控制区域，所述配置项的控制区域是与所述配置项具有直接控制依赖关系的一段语句序列；Traverse all the nodes in the program dependency graph, and construct the control area of each configuration item, where the control area of the configuration item is a statement sequence that has a direct control dependency relationship with the configuration item;

根据各所述配置项的控制区域识别各所述性能操作与所述软件系统的各配置项之间的控制依赖关系。A control dependency relationship between each of the performance operations and each of the configuration items of the software system is identified according to the control area of each of the configuration items.

可选地，所述随机森林分类模型的训练过程包括：Optionally, the training process of the random forest classification model includes:

将多个软件系统的配置项对应的特征向量划分为训练集和测试集；Divide the feature vectors corresponding to the configuration items of multiple software systems into a training set and a test set;

根据所述训练集和随机森林算法训练所述随机森林分类模型。The random forest classification model is trained according to the training set and the random forest algorithm.

可选地，所述配置项依赖检测器修正所述随机森林分类模型的分类结果包括：Optionally, the configuration item-dependent detector revising the classification result of the random forest classification model includes:

当所述软件系统的第一配置项依赖于第二配置项时，若所述随机森林分类模型判断所述第一配置项影响软件系统性能且所述第二配置项不影响软件系统性能，则所述配置项依赖检测器将所述第二配置项修正为影响软件系统性能。When the first configuration item of the software system depends on the second configuration item, if the random forest classification model determines that the first configuration item affects the performance of the software system and the second configuration item does not affect the performance of the software system, then The configuration item dependency detector corrects the second configuration item to affect software system performance.

可选地，识别各所述性能操作与所述软件系统的各配置项之间的依赖关系之前还包括：Optionally, before identifying the dependencies between the performance operations and the configuration items of the software system, the method further includes:

提取所述软件系统的配置项信息，所述配置项信息至少包括配置项的名称、数量、配置项加载进所述软件系统时使用的API。The configuration item information of the software system is extracted, where the configuration item information at least includes the name, quantity of the configuration item, and an API used when the configuration item is loaded into the software system.

本发明还提供了一种配置项对软件系统性能影响的分析系统，包括：The present invention also provides a system for analyzing the impact of configuration items on software system performance, including:

性能操作识别模块，用于根据软件系统预设的代码模式识别并标记所述软件系统中的所有性能操作，所述性能操作为影响软件系统性能的时间密集型操作和/或空间密集型操作；A performance operation identification module, configured to identify and mark all performance operations in the software system according to a code pattern preset by the software system, where the performance operations are time-intensive operations and/or space-intensive operations that affect the performance of the software system;

依赖关系识别模块，用于识别各所述性能操作与所述软件系统的各配置项之间的依赖关系，得到各所述配置项对应的性能操作集合，所述性能操作集合中的各性能操作与所述配置项均具有依赖关系；A dependency relationship identification module is used to identify the dependency relationship between each of the performance operations and each configuration item of the software system, and obtain a performance operation set corresponding to each of the configuration items, and each performance operation in the performance operation set have dependencies with the configuration items;

特征向量构建模块，用于根据所述性能操作集合构建各所述配置项对应的特征向量；a feature vector building module, configured to build a feature vector corresponding to each of the configuration items according to the performance operation set;

配置项集合确定模块，用于将各所述配置项对应的特征向量输入到训练好的定性性能影响模型中，判断所述配置项是否影响软件系统性能，得到影响软件系统性能的配置项集合，所述定性性能影响模型是利用多个软件系统的配置项对应的特征向量训练得到的。A configuration item set determination module, configured to input the feature vector corresponding to each of the configuration items into the trained qualitative performance impact model, determine whether the configuration item affects the performance of the software system, and obtain a set of configuration items that affect the performance of the software system, The qualitative performance impact model is obtained by training the feature vectors corresponding to the configuration items of multiple software systems.

本发明提供了一种配置项对软件系统性能影响的分析方法及系统，其中方法包括：根据软件系统预设的代码模式识别并标记所述软件系统中的所有性能操作，所述性能操作为影响软件系统性能的时间密集型操作和/或空间密集型操作；识别各所述性能操作与所述软件系统的各配置项之间的依赖关系，得到各所述配置项对应的性能操作集合，所述性能操作集合中的各性能操作与所述配置项均具有依赖关系；根据所述性能操作集合构建各所述配置项对应的特征向量；将各所述配置项对应的特征向量输入到训练好的定性性能影响模型中，判断所述配置项是否影响软件系统性能，得到影响软件系统性能的配置项集合，所述定性性能影响模型是利用多个软件系统的配置项对应的特征向量训练得到的。The present invention provides a method and system for analyzing the impact of configuration items on software system performance, wherein the method includes: identifying and marking all performance operations in the software system according to a preset code pattern of the software system, and the performance operations are Time-intensive operations and/or space-intensive operations of software system performance; identify the dependencies between each of the performance operations and each configuration item of the software system, and obtain a performance operation set corresponding to each of the configuration items. Each performance operation in the performance operation set has a dependency on the configuration item; construct a feature vector corresponding to each of the configuration items according to the performance operation set; input the feature vector corresponding to each of the configuration items into the training In the qualitative performance impact model, it is judged whether the configuration item affects the performance of the software system, and a set of configuration items that affect the performance of the software system is obtained. .

有鉴如此，本发明带来的有益效果是：In view of this, the beneficial effects brought by the present invention are:

本发明采用程序分析技术追踪与配置项具有依赖关系的时间密集型操作或空间密集型操作，根据程序分析结果为配置项构建对应的特征向量，能细粒度到配置项，不局限于布尔类型或穷举有限数值类型，支持任意类型的配置项；使用随机森林建立定性性能影响模型，无需耗时的局部测量操作，只需要进行一次软件系统源代码的分析即可判断一个具体的配置项是否影响可配置系统的性能，大幅减少了性能分析的开销，能够在不运行软件系统的前提下，较准确地预测软件系统的各配置项是否影响软件系统性能，能发现真正影响软件系统性能的配置项集合，能提高软件系统配置的效率，有利于正确配置软件系统以提高软件系统的性能。此外，本发明还具有可解释性，通过程序分析结果及性能模型的分类规则了解配置项影响性能的底层原因。The present invention adopts program analysis technology to track time-intensive operations or space-intensive operations that have dependencies on configuration items, constructs corresponding feature vectors for configuration items according to the program analysis results, and can fine-grained configuration items, not limited to Boolean type or Exhaustive limited numerical types, support any type of configuration items; use random forest to build qualitative performance impact models, without time-consuming local measurement operations, only need to analyze the source code of the software system once to determine whether a specific configuration item affects The performance of the configurable system greatly reduces the overhead of performance analysis, and can more accurately predict whether each configuration item of the software system affects the performance of the software system without running the software system, and can find the configuration items that really affect the performance of the software system. Set, can improve the efficiency of software system configuration, is conducive to the correct configuration of the software system to improve the performance of the software system. In addition, the present invention also has interpretability, and the underlying reason that the configuration item affects the performance can be understood through the program analysis result and the classification rule of the performance model.

附图说明Description of drawings

图1为本发明方法的流程示意图；Fig. 1 is the schematic flow chart of the method of the present invention;

图2为本发明系统的结构示意图；Fig. 2 is the structural representation of the system of the present invention;

图3为示例程序的程序依赖图，其中，实线表示控制依赖关系，虚线表示数据依赖关系；3 is a program dependency diagram of an example program, wherein the solid line represents the control dependency, and the dashed line represents the data dependency;

图4为FlowDroid的污点分析示例图；Figure 4 is an example diagram of the taint analysis of FlowDroid;

图5为本发明方法实施例的流程示意图；5 is a schematic flowchart of a method embodiment of the present invention;

图6为本发明方法中性能操作的类别及其代码模式示例图；6 is an example diagram of the categories of performance operations and their code patterns in the method of the present invention;

图7为本发明方法中性能操作与配置项的依赖关系分类及示例图；Fig. 7 is the dependency relationship classification and example diagram of performance operation and configuration item in the method of the present invention;

图8为本发明方法的污点分析流程示意图；8 is a schematic diagram of a stain analysis process flow of the method of the present invention;

图9为本发明方法的配置项控制区域示例图；9 is an example diagram of a configuration item control area of the method of the present invention;

图10为本发明方法中的定性性能影响模型示意图；10 is a schematic diagram of a qualitative performance impact model in the method of the present invention;

图11为本发明的ConfigAnalyzer工具的程序分析模块示意图。FIG. 11 is a schematic diagram of a program analysis module of the ConfigAnalyzer tool of the present invention.

具体实施方式Detailed ways

术语解释：Terminology Explanation:

配置项(Option)：一种特殊类型的输入，具有类型并有一定的取值范围，比如：某个Boolean类型的配置项的取值范围为{true,false}，某个Integer类型的配置项的取值范围为{0,1,2,3}。配置项让用户能在不修改软件代码的前提下改变软件系统的内部执行逻辑，因而用户通过改变配置项的值来改变软件系统的功能或性能表现。在一些文献中，配置项也常被称为特征或功能(Feature)。Configuration item (Option): a special type of input, with a type and a certain range of values, for example: a configuration item of type Boolean has a value range of {true, false}, a configuration item of type Integer The value range of is {0,1,2,3}. The configuration item allows the user to change the internal execution logic of the software system without modifying the software code, so the user can change the function or performance of the software system by changing the value of the configuration item. In some literature, configuration items are also often referred to as features or functions.

配置(Configuration)：软件系统中所有配置项的完整设置。所有被设定为确定值的配置项构成软件系统的配置。Configuration: The complete settings of all configuration items in the software system. All configuration items set to definite values constitute the configuration of the software system.

配置空间(Configuration Space)：软件系统中所有可能的配置构成配置空间。Configuration Space: All possible configurations in a software system constitute a configuration space.

可配置系统(Configurable System)：为用户提供了配置以供定制化运行的软件系统。Configurable System: A software system that provides users with configurations for customized operation.

错误配置(Misconfiguration)：配置项被设定为不合适的值从而导致软件系统的行为或表现不符合预期的配置。由错误配置引发的软件系统错误被称为配置错误(Configuration error)。Misconfiguration: A configuration item that is set to an inappropriate value resulting in a software system that does not behave or behave as expected. Software system errors caused by misconfiguration are called configuration errors.

环境(Environment)：软件系统运行所依赖的硬件和软件构成的整体。一般来说，运行软件系统时，其所在的环境不会发生变化。Environment: The entirety of hardware and software on which the software system operates. Generally speaking, when running a software system, the environment in which it is located does not change.

工作负载(Workload)：一定时间内软件系统需要完成的任务工作量。用户为完成预定的任务或目标向软件系统输入任务，在一定时间内软件系统需要完成的任务工作量越大，工作负载就越大，软件系统需要使用的计算资源就越多。Workload: The amount of work that the software system needs to complete within a certain period of time. The user inputs tasks to the software system in order to complete a predetermined task or goal. The greater the workload of the task that the software system needs to complete within a certain period of time, the greater the workload, and the more computing resources the software system needs to use.

性能(Performance)：用于表示软件系统工作能力的性质，通常与能源消耗、运营成本直接相关。在一般情况下，性能的评价在不同的软件服务质量要求下有不同的度量方式。最直观的度量软件系统性能的方式是完成任务所需要的运行时间。Performance: It is used to represent the nature of the working ability of a software system, which is usually directly related to energy consumption and operating costs. In general, performance evaluation has different measurement methods under different software service quality requirements. The most intuitive way to measure the performance of a software system is the running time required to complete a task.

性能影响模型(Performance-influence Model)：一类描述软件系统在特定环境、特定工作负载下运行的性能表现的模型。Performance-influence Model: A type of model that describes the performance of a software system running under a specific environment and specific workload.

控制流(Control Flow)：指程序运行时，其各个语句、指令或函数调用的执行顺序。对于Java这种命令式编程语言，程序有着明确的控制流(区别于声明式编程语言所编写出来的程序)。Control Flow: Refers to the execution order of each statement, instruction or function call when the program is running. For an imperative programming language such as Java, the program has a clear control flow (different from programs written in a declarative programming language).

控制流语句(Control-flow Statement)：一类程序语句，根据控制流决策的不同而影响着程序实际执行的控制流。比如，Java中的if语句、switch语句、for循环语句、while循环语句都是控制流语句。Control-flow Statement: A type of program statement that affects the control flow of the actual execution of the program depending on the control flow decision. For example, the if statement, switch statement, for loop statement, and while loop statement in Java are all control flow statements.

控制流决策(Control-flow Decision)：指控制流语句的实际执行，即选择执行一个特定的分支。Control-flow Decision: Refers to the actual execution of a control-flow statement, that is, choosing to execute a specific branch.

控制流图(Control-flow Graph)：即CFG，是一类用于表示程序的控制流的流程图。Control-flow Graph (Control-flow Graph): CFG is a type of flow chart used to represent the control flow of a program.

数据流(Data Flow)：对程序中的数据依赖链的抽象。Data Flow: An abstraction of the data dependency chain in a program.

本发明实施例提供了一种配置项对软件系统性能影响的分析方法及系统，以解决现有技术中软件系统的配置错误、配置效率低下的技术问题。Embodiments of the present invention provide a method and system for analyzing the impact of configuration items on software system performance, so as to solve the technical problems of software system configuration errors and low configuration efficiency in the prior art.

为了便于理解本发明，下面将参照相关附图对本发明进行更全面的描述。附图中给出了本发明的首选实施例。但是，本发明可以以许多不同的形式来实现，并不限于本发明所描述的实施例。相反地，提供这些实施例的目的是使对本发明的公开内容更加透彻全面。In order to facilitate understanding of the present invention, the present invention will be described more fully hereinafter with reference to the related drawings. Preferred embodiments of the invention are shown in the accompanying drawings. However, the present invention may be embodied in many different forms and is not limited to the described embodiments of the present invention. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

除非另有定义，本发明所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。本发明中在本发明的说明书中所使用的术语只是为了描述具体的实施例的目的，不是旨在于限制本发明。本发明所使用的术语“及/或”包括一个或多个相关的所列项目的任意的和所有的组合。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terms used in the present specification in the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

(1)现代软件系统的配置及其复杂性(1) Configuration and complexity of modern software systems

如今，大量的现代软件系统被设计成高度可定制系统，可根据用户所使用的硬件平台、操作系统及用户的需求进行配置。软件系统的配置(Configuration)让用户们能方便地在不修改软件代码的前提下改变软件系统的行为或状态，从而提高软件的灵活性和安全性，满足用户在软件功能或性能方面的需求。为用户提供了配置的软件系统被称为可配置系统(Configurable system)。Today, a large number of modern software systems are designed to be highly customizable systems that can be configured according to the hardware platform, operating system, and needs of the user. The configuration of the software system allows users to easily change the behavior or state of the software system without modifying the software code, thereby improving the flexibility and security of the software and meeting the needs of users in terms of software functions or performance. A software system that provides configuration to users is called a Configurable system.

简单来说，配置可以表示为若干个配置项(Option)的集合，其中每个配置项表示软件的某种性质，如所使用的硬件平台、操作系统、是否载入某个插件等。但是，高度可定制化的配置在带来潜在的软件功能上的丰富或软件性能上的提升的同时，也为用户和开发者带来了巨大的挑战。In simple terms, a configuration can be represented as a set of several configuration items (Option), where each configuration item represents a certain property of the software, such as the used hardware platform, operating system, whether to load a certain plug-in, and so on. However, the highly customizable configuration brings potential enrichment of software functions or improvement of software performance, but also brings great challenges to users and developers.

研究表明，软件系统的配置错误已经成为系统故障和系统性能问题的主要原因之一。据报道，在Google的主要生产服务中，配置错误是导致服务级别事件的第二大原因；而在Facebook中，配置错误导致了16％的服务级别事件，被认为是对于Facebook可靠性的一个关键挑战。对企业备份系统的研究显示，大多数的任务失败都是配置错误所造成的。配置错误带来的后果可能相当严重，对商业存储系统及开源操作系统的配置错误的研究显示，相当一部分的配置错误会导致难以诊断的系统崩溃、挂起或严重的性能下降。Studies have shown that the misconfiguration of software systems has become one of the main reasons for system failures and system performance problems. In Google's main production service, misconfiguration was reported to be the second leading cause of service-level incidents; in Facebook, misconfiguration was responsible for 16% of service-level incidents and was considered a key to Facebook's reliability challenge. Studies of enterprise backup systems have shown that most mission failures are caused by configuration errors. The consequences of misconfiguration can be severe, and studies of misconfiguration of commercial storage systems and open source operating systems have shown that a significant percentage of misconfigurations can cause hard-to-diagnose system crashes, hangs, or severe performance degradation.

除了普遍的配置错误以外，理解配置的作用也是一大挑战。用户往往难以清楚了解改变一个配置项对软件系统的实际影响，因此，用户常常被迫以大量试错的耗时方式来调整配置。为此，供应商们也承受了代价，报道显示，配置问题是云服务和数据中心软件供应商的用户支持成本的主要来源。同时，配置让软件系统的开发、测试、运维过程也变得更加复杂。In addition to common misconfigurations, understanding what configuration does is a challenge. It is often difficult for users to clearly understand the actual impact of changing a configuration item on the software system. Therefore, users are often forced to adjust the configuration in a time-consuming way of extensive trial and error. To that end, vendors have also taken a toll, with reports showing that configuration issues are a major source of user support costs for cloud service and data center software vendors. At the same time, configuration complicates the development, testing, and operation and maintenance of software systems.

总结而言，现代软件系统的配置可满足用户在软件功能或性能方面的需求，但配置的复杂性(体现在配置项的数量之多，及配置项之间的互动、依赖关系)让调整软件配置成为了一个巨大的挑战。To sum up, the configuration of modern software systems can meet the needs of users in terms of software functions or performance, but the complexity of configuration (reflected in the number of configuration items, and the interactions and dependencies between configuration items) makes it difficult to adjust the software. Configuration became a huge challenge.

(2)配置对软件系统性能的影响(2) Influence of configuration on software system performance

软件系统的性能(以及通常直接相关的能源消耗和运营成本)是用户和开发者都十分看重的软件系统属性。在用户的角度上，用户往往希望系统在具备特定功能的基础上减少能源消耗、运营成本；在开发者的角度上，开发者往往希望能开发出能高效配置的系统，提供高质量的用户体验。发明人经研究发现，在开源云系统中，软件系统的性能问题导致了约50％配置相关补丁的发行，以及约30％配置相关的论坛讨论。在云系统中，性能问题错误配置所导致的严重性能问题和中断消耗了数亿美元。The performance of a software system (and often directly related energy consumption and operating costs) is a software system attribute that is highly valued by both users and developers. From the perspective of users, users often hope that the system can reduce energy consumption and operating costs on the basis of specific functions; from the perspective of developers, developers often hope to develop systems that can be configured efficiently and provide high-quality user experience . The inventor found through research that in the open source cloud system, the performance problems of the software system lead to the release of about 50% of the configuration-related patches, and about 30% of the configuration-related forum discussions. Serious performance issues and outages caused by misconfigurations cost hundreds of millions of dollars in cloud systems.

设置对软件系统性能敏感的配置项是一项具有挑战性的任务，往往需要对软件系统有着深入理解。比如，设置某个配置项可能需要在内存使用量和系统响应时间之间作出平衡，而对此进行权衡需要对软件系统、所使用的硬件或当下的工作负载有着深入的理解。雪上加霜的是，软件系统的文档中往往对这些关系没有清晰的解释，而即使有清晰的文档，工作负载等因素也往往复杂多变，让用户难以设置适当的配置。Setting configuration items that are sensitive to software system performance is a challenging task that often requires a deep understanding of the software system. For example, setting a configuration item may require a balance between memory usage and system response time, and balancing this trade-off requires a deep understanding of the software system, the hardware used, or the current workload. To make matters worse, there is often no clear explanation of these relationships in the documentation of software systems, and even with clear documentation, factors such as workload are often complex and variable, making it difficult for users to set appropriate configurations.

在另外的例子中，某个配置项为特定值时可能会给用户的每个写操作加锁，导致写延迟增加，但文档仅仅写明该配置项为特定值时对写操作支持的对象没有限制。除非用户通过代码追踪到具体的实现逻辑，否则不可能理解系统性能变化的原因。In another example, when a configuration item is a specific value, each write operation of the user may be locked, resulting in an increase in write latency, but the document only states that when the configuration item is a specific value, there is no object that supports write operations. limit. Unless the user traces the specific implementation logic through the code, it is impossible to understand the cause of the system performance change.

(3)现有技术方案(3) Existing technical solutions

软件系统的性能影响模型用于表示配置如何影响系统性能，性能影响模型的应用是分析配置与软件系统性能关系的一类重要的技术工具。将不同的配置作为性能影响模型的输入从而得到性能的预测值来判断配置是否影响系统性能，因此，现有的技术方案的差异在于构建性能影响模型的方式。The performance impact model of a software system is used to express how the configuration affects the system performance. The application of the performance impact model is an important technical tool to analyze the relationship between the configuration and the performance of the software system. Different configurations are used as the input of the performance impact model to obtain the performance prediction value to determine whether the configuration affects the system performance. Therefore, the difference between the existing technical solutions lies in the way of constructing the performance impact model.

一类相关工作是采用黑盒的研究方法建立性能影响模型。使用黑盒方法建立性能模型的思路是：将软件系统视为一个黑盒，对配置空间进行采样得一个配置子集，测量在特定工作负载下系统在该配置子集的每一个配置下的性能表现，然后从这些观察中学习出一个性能影响模型。A related category of work is to use a black-box approach to model performance impact. The idea of using the black-box method to build a performance model is to treat the software system as a black box, sample the configuration space to obtain a configuration subset, and measure the performance of the system under each configuration of the configuration subset under a specific workload. performance, and then learn a performance impact model from these observations.

现有的采用黑盒的技术方法需要在建模成本和模型准确度之间进行权衡，要构建更精确的模型需要更多的样本，更多的样本需要采样出更大的配置子集，在特定负载下测量目标软件系统性能表现的次数也就更多，所付出的时间代价更大。此外，黑盒方法构建出来的性能影响模型大多基于深度学习模型，可解释性普遍较低，无法解释配置项改变导致软件系统性能出现变化的根本原因。Existing black-box techniques require a trade-off between modeling cost and model accuracy. More samples are required to build a more accurate model, and more samples need to sample a larger subset of configurations. The number of times to measure the performance of the target software system under a specific load is more, and the cost of time is greater. In addition, most of the performance impact models constructed by the black-box method are based on deep learning models, which are generally low in interpretability and cannot explain the root cause of changes in software system performance caused by configuration item changes.

另一类相关工作则使用白盒的方式建立性能影响模型。使用白盒方法构建性能影响模型不再将软件系统视为一个完全整体的黑盒，而是按程序分析的思想划分为多个组件或模块，对每一个组件或模块进行分析建模，从而构建出整个系统的性能影响模型。除了能够正确预测性能外，还能解释性能表现出现的原因，例如性能表现出现是由那几个组件或模块造成的。Another category of related work uses a white-box approach to model performance impacts. Building a performance impact model using the white-box method no longer regards the software system as a complete black box, but divides it into multiple components or modules according to the idea of program analysis, and analyzes and models each component or module, thereby constructing Model the performance impact of the entire system. In addition to being able to correctly predict performance, it can also explain why the performance occurs, such as which components or modules cause the performance to occur.

而现有的使用白盒方法建立性能影响模型存在一定的缺陷。有些白盒方法只支持布尔类型或穷举有限数值类型的配置项(穷举有限类型的配置项需要离散为若干个布尔类型配置项)，这是一个很大的限制，且离散后相当于配置项数量大幅增加，工具的运行时间呈指数级增长。However, the existing white-box methods to build performance impact models have certain defects. Some white-box methods only support configuration items of Boolean type or exhaustive finite numerical type (configuration items of exhaustive finite type need to be discretized into several Boolean type configuration items), which is a big limitation, and is equivalent to configuration after discretization The number of items has increased dramatically, and the runtime of the tool has grown exponentially.

有些白盒方法虽然能够学习出一个更准确的性能影响模型，但是仍然需要准备软件系统运行环境，对配置空间进行采样，基于采样后的配置子集运行目标软件系统，收集软件系统运行时的各项性能指标。Although some white-box methods can learn a more accurate performance impact model, they still need to prepare the software system operating environment, sample the configuration space, run the target software system based on the sampled configuration subset, and collect various item performance indicators.

上述现有的各种技术方法虽然实现细节各异导致各有各的缺点，但都存在一个不可避免的软件系统运行时开销，即这些现有的技术方法都需要准备软件系统运行环境，选定并遍历配置子集，测量在特定负载下软件系统在不同配置子集下的性能表现。反复地运行软件系统并收集软件系统的运行时性能指标需要的巨大时间开销已经远远超过分析与构建性能影响模型所需要的时间开销。Although the above-mentioned existing technical methods have their own shortcomings due to different implementation details, there is an inevitable software system runtime overhead, that is, these existing technical methods all need to prepare the software system operating environment, and select the software system operating environment. And traverse the configuration subsets to measure the performance of the software system under different configuration subsets under a specific load. The huge time overhead required to repeatedly run the software system and collect the runtime performance indicators of the software system has far exceeded the time overhead required to analyze and build performance impact models.

(4)程序分析(4) Program analysis

程序分析(Program analysis)是一个自动分析程序表现的过程，分析的着重点包括程序的正确性、健壮性、安全性、活跃性等。换言之，程序分析对程序进行系统的检查，以分析程序的性质。Program analysis is a process of automatically analyzing program performance. The focus of the analysis includes program correctness, robustness, security, and activity. In other words, program analysis systematically examines programs to analyze their properties.

程序分析可以分为：Program analysis can be divided into:

1)静态程序分析：在不运行程序的前提下进行程序分析；1) Static program analysis: program analysis without running the program;

2)动态程序分析：在真实或虚拟处理器上运行程序，根据程序的运行时表现进行程序分析。2) Dynamic program analysis: Run the program on a real or virtual processor, and analyze the program according to its runtime performance.

虽然静态程序分析无法获取程序的运行时信息，但由于静态程序分析无需实际运行程序，相比于动态程序分析能节省大量的时间和计算资源。另外，程序分析所要获取的信息也并非越多越好，需要在收益与开销之间寻求平衡。因此，本发明所提出的方法使用的是静态程序分析技术。Although static program analysis cannot obtain the runtime information of the program, because static program analysis does not need to actually run the program, it can save a lot of time and computing resources compared to dynamic program analysis. In addition, the information to be obtained by program analysis is not as much as possible, and it is necessary to seek a balance between benefits and costs. Therefore, the method proposed in the present invention uses static program analysis technology.

(5)污点分析(5) Stain analysis

污点分析(Taint analysis)是程序分析的一种，也被称为信息流分析(Information-flow analysis)，是一种检测源代码中是否能够通过注入的漏洞获取任何敏感隐私信息的分析。污点分析一般用于识别用户输入在系统中的流动，以了解系统设计的安全影响。污点分析可划分为静态污点分析和动态污点分析。Taint analysis is a type of program analysis, also known as information-flow analysis, which is an analysis that detects whether any sensitive private information can be obtained through injected vulnerabilities in source code. Taint analysis is generally used to identify the flow of user input through a system to understand the security implications of system design. The taint analysis can be divided into static taint analysis and dynamic taint analysis.

污点分析定义了四元组(P,SO,SI,SA)，其中：The taint analysis defines the quadruple (P, SO, SI, SA), where:

1)P表示受分析程序(Program)；1) P represents the program under analysis (Program);

2)SO表示源点(Source)的集合，源点是表示需要追踪的信息。2) SO represents a collection of source points (Source), and the source point is the information that needs to be tracked.

3)SI表示汇点(Sink)的集合，汇点是不希望源点到达的程序语句。3) SI represents a collection of sinks, and sinks are program statements that are not expected to be reached by the source.

4)SA表示净化器(Sanitizer)的集合，若源点在传播过程中经过了净化器，则其有害性被消除。4) SA represents a set of sanitizers. If the source point passes through the sanitizer during the propagation process, its harmfulness is eliminated.

定理1：一个程序中存在信息泄漏漏洞或污点流漏洞，当且仅当程序中存在一条从某个源点到某个汇点的路径，且该路径不经过任何净化器。Theorem 1: There is an information leakage vulnerability or a taint flow vulnerability in a program if and only if there is a path from a source to a sink in the program, and the path does not pass through any purifier.

需要说明的是，漏洞泛指程序代码中所有的汇点。在信息安全分析中，漏洞表示的是所有程序中会信息泄露的地方。离开信息安全的背景下，漏洞指的是所有使用到敏感信息的地方。It should be noted that the vulnerability generally refers to all the sinks in the program code. In information security analysis, a vulnerability represents a place where information is leaked in all programs. Leaving the context of information security, a vulnerability refers to all places where sensitive information is used.

本实施例汇总，在当前的可配置系统性能分析的背景下，存在污点流漏洞即是存在配置项影响空间密集型和时间密集型操作的路径。In summary of this embodiment, in the context of the current configurable system performance analysis, the existence of a taint flow vulnerability is a path through which a configuration item affects space-intensive and time-intensive operations.

(6)程序依赖分析(6) Program dependency analysis

程序的控制依赖关系和数据依赖关系有多种定义方式，下述采用其中较为直观的一种：There are many ways to define the control dependencies and data dependencies of a program. The following is a more intuitive one:

(6.1)控制依赖关系(6.1) Control dependencies

控制依赖关系的定义：对于任意的程序分支语句S1及程序语句S2，有：Definition of control dependencies: For any program branch statement S1 and program statement S2, there are:

若语句S1是在语句S2之前、距离语句S2最接近的分支语句，语句S1有多个分支目标，且语句S1的分支决策的改变有可能导致语句S2不被执行，则称语句S2控制依赖于语句S1，或语句S2对语句S1有控制依赖关系，记为S2δc S1。If the statement S1 is the branch statement before the statement S2 and the closest to the statement S2, the statement S1 has multiple branch targets, and the change of the branch decision of the statement S1 may cause the statement S2 not to be executed, then the statement S2 control depends on the Statement S1, or statement S2 has a control dependency on statement S1, denoted as S2δc S1.

(6.2)数据依赖关系(6.2) Data dependencies

数据依赖关系存在于访问或修改相同资源的程序语句之间。数据依赖关系包括流依赖关系、反依赖关系、输出依赖关系、输入依赖关系。其中，流依赖是最基本的数据依赖关系。Data dependencies exist between program statements that access or modify the same resource. Data dependencies include stream dependencies, anti-dependencies, output dependencies, and input dependencies. Among them, stream dependencies are the most basic data dependencies.

(6.3)程序依赖分析及程序依赖图(6.3) Program dependency analysis and program dependency graph

程序依赖分析的目的在于分析出程序中的控制依赖和数据依赖关系。在实际的分析中，区别于上述粒度为语句的定义，程序依赖分析一般采取基本块为最小单位。The purpose of program dependency analysis is to analyze the control dependencies and data dependencies in the program. In actual analysis, the program dependency analysis generally takes the basic block as the smallest unit, which is different from the above-mentioned granularity definition of the statement.

程序依赖图(Program dependency graph，简称PDG)用于描述程序的控制依赖和数据依赖关系。A program dependency graph (PDG for short) is used to describe the control dependencies and data dependencies of a program.

示例程序如下所示：A sample program looks like this:

图3是上述示例程序的程序依赖图，其中，实线表示控制依赖关系，虚线表示数据依赖关系。FIG. 3 is a program dependency diagram of the above-mentioned example program, in which solid lines indicate control dependencies and dashed lines indicate data dependencies.

(7)随机森林(7) Random Forest

决策树是一种数据挖掘或机器学习中常用的一种白盒预测模型。决策树的结构是一种类似于流程图的树形结构，其中：A decision tree is a white-box predictive model commonly used in data mining or machine learning. The structure of a decision tree is a tree structure similar to a flowchart, where:

·每个内部节点代表对某个属性的测试；Each internal node represents a test for a property;

·每个分支代表上述测试的结果；each branch represents the result of the above test;

·每个叶子节点代表一个类型标签；Each leaf node represents a type label;

·从根节点到叶子节点的路径代表分类规则。决策树的分类规则是由决策树算法根据特征向量和分类标签构建而成的。• Paths from root nodes to leaf nodes represent classification rules. The classification rules of a decision tree are constructed by the decision tree algorithm based on feature vectors and classification labels.

决策树学习是一种根据源数据库构建决策树的方法，在决策树学习的过程中不断地对原数据库进行分割，递归地对树进行修剪，直到不能再进行分割或某一分支可归于某一个类。学习出的决策树在训练集上容易过拟合训练集。Decision tree learning is a method of building a decision tree based on the source database. In the process of decision tree learning, the original database is continuously divided, and the tree is recursively pruned until it can no longer be divided or a branch can be attributed to a certain kind. The learned decision tree is prone to overfitting the training set.

随机森林是一种包含了多棵决策树的分类器，最终输出的类别由所包含的决策树所输出的类别的众数决定。在构建随机森林的过程中，使用训练集的不同部分，随机地构建出多棵决策树。Random forest is a classifier that contains multiple decision trees, and the final output category is determined by the mode of the categories output by the contained decision trees. In the process of building a random forest, multiple decision trees are randomly constructed using different parts of the training set.

随机森林作为一种被广泛使用的分类器，具有以下几个显著的优点：As a widely used classifier, random forest has the following significant advantages:

1)在很多种应用场景下，随机森林不容易过拟合；1) In many application scenarios, random forest is not easy to overfit;

2)使用随机森林处理高维数据(即特征很多的数据)时，一般无需特征选择；2) When using random forest to process high-dimensional data (that is, data with many features), feature selection is generally not required;

3)对于不平衡的分类数据集，随机森林可以平衡误差。3) For imbalanced classification datasets, random forest can balance the error.

(8)Soot：Java和Android应用的分析、转换框架(8) Soot: Analysis and conversion framework for Java and Android applications

Soot最初是一个Java优化框架，后来渐渐发展为一个Java和Android应用的分析、测量、优化、可视化框架。简单来说，Soot的工作原理是将输入程序(Java字节码)转换成中间语言(Intermediate representation，简称IR)，然后对中间语言进行分析、转换，处理后的代码可进一步转换成Java字节码等目标语言输出。Soot was originally a Java optimization framework, and then gradually developed into an analysis, measurement, optimization, and visualization framework for Java and Android applications. Simply put, the working principle of Soot is to convert the input program (Java bytecode) into an intermediate language (Intermediate representation, referred to as IR), and then analyze and convert the intermediate language, and the processed code can be further converted into Java bytes code and other target language output.

使用Soot框架，可以实现以下功能：Using the Soot framework, the following functions can be achieved:

·构建调用图(Call graph)；· Build a call graph (Call graph);

·进行指向分析；· Conduct directional analysis;

·构建定义、使用链(数据流分析的基础，在此基础上可分析数据依赖关系)；Build definition and usage chains (the basis of data flow analysis, on which data dependencies can be analyzed);

·进行模版驱动的程序内数据流分析；Perform template-driven in-program data flow analysis;

·进行模板驱动的程序间数据流分析；Perform template-driven inter-program data flow analysis;

·进行流、字段、上下文敏感的指针分析。Perform stream, field, context sensitive pointer analysis.

(9)FlowDroid：Java和Android应用的污点分析框架(9) FlowDroid: A taint analysis framework for Java and Android applications

FlowDroid是一个对Java和Android应用程序的上下文、流、字段、对象敏感的静态污点分析框架。FlowDroid的实现基于Soot和Heros，其中Heros是一个通用的多线程IDFS(程序间有限分布子集问题)、IDE(程序间分布的环境问题)求解器。FlowDroid is a static taint analysis framework sensitive to context, stream, field, and object for Java and Android applications. The implementation of FlowDroid is based on Soot and Heros, where Heros is a general-purpose multi-threaded IDFS (inter-program finitely distributed subset problem), IDE (inter-program distributed environmental problem) solver.

FlowDroid通过构建一个相当精确的调用图确保对上下文和流的敏感性，通过基于IDFS的流函数确保对字段、对象的敏感性。其中，为了确保对上下文和字段敏感性，FlowDroid实现了精确且高效的别名(alias)追踪。FlowDroid ensures context and flow sensitivity by building a fairly accurate call graph, and ensures field and object sensitivity through IDFS-based flow functions. Among them, to ensure context and field sensitivity, FlowDroid implements accurate and efficient alias tracking.

图4是FlowDroid污点分析的一个实际示例。图4中自1到7是FlowDroid分析出的一条从源点(source)到汇点(sink)且没有经过净化器的路径，容易看到在这个过程中FlowDroid发现了z.g.f、a.g.f、b.g均是x.f的别名。Figure 4 is a practical example of FlowDroid taint analysis. From 1 to 7 in Figure 4 is a path analyzed by FlowDroid from the source to the sink without going through the purifier. It is easy to see that in this process, FlowDroid found that z.g.f, a.g.f, b.g are all Alias for x.f.

由于程序中会经常利用不同名的局部变量、类中字段、全局变量等等变量去指代一个同一个变量。当程序没有运行的时候，是无法保证某一个变量被哪一个名称的变量所指代。因此，静态程序分析就需要别名分析来获得指代某一个变量的所有变量名称。Because the program often uses local variables with different names, fields in classes, global variables and other variables to refer to the same variable. When the program is not running, there is no guarantee that a variable is referred to by a variable of which name. Therefore, static program analysis requires alias analysis to obtain all variable names that refer to a variable.

请参阅图1和图5，本发明提供了一种配置项对软件系统性能影响的分析方法的实施例，包括：Referring to FIG. 1 and FIG. 5 , the present invention provides an embodiment of a method for analyzing the impact of configuration items on software system performance, including:

S100：根据软件系统预设的代码模式识别并标记所述软件系统中的所有性能操作，所述性能操作为影响软件系统性能的时间密集型操作和/或空间密集型操作；S100: Identify and mark all performance operations in the software system according to a code pattern preset by the software system, where the performance operations are time-intensive operations and/or space-intensive operations that affect the performance of the software system;

S200：识别各所述性能操作与所述软件系统的各配置项之间的依赖关系，得到各所述配置项对应的性能操作集合，所述性能操作集合中的各性能操作与所述配置项均具有依赖关系；S200: Identify the dependency between each of the performance operations and each configuration item of the software system, and obtain a performance operation set corresponding to each of the configuration items, where each performance operation in the performance operation set is related to the configuration item have dependencies;

S300：根据所述性能操作集合构建各所述配置项对应的特征向量；S300: Construct a feature vector corresponding to each of the configuration items according to the performance operation set;

S400：将各所述配置项对应的特征向量输入到训练好的定性性能影响模型中，判断所述配置项是否影响软件系统性能，得到影响软件系统性能的配置项集合，所述定性性能影响模型是利用多个软件系统的配置项对应的特征向量训练得到的。S400: Input the feature vector corresponding to each of the configuration items into the trained qualitative performance impact model, determine whether the configuration item affects the performance of the software system, and obtain a set of configuration items that affect the performance of the software system, and the qualitative performance affects the model It is obtained by training the feature vectors corresponding to the configuration items of multiple software systems.

本实施例提供的配置项对软件系统性能影响的分析方法，是一种新的白盒配置性能分析方法，步骤S100根据软件系统预设的代码模式识别并标记所述软件系统中的所有性能操作。The method for analyzing the impact of configuration items on software system performance provided in this embodiment is a new white-box configuration performance analysis method. Step S100 identifies and marks all performance operations in the software system according to a code pattern preset by the software system. .

本实施例中的性能操作(PerfOp)是指时间密集型操作或空间密集型操作，时间密集型操作与空间密集型操作的主要区别在于，计算机完成时间密集型操作所需要的时间长，计算机完成空间密集型操作所需要的内存、磁盘等资源开销大。The performance operation (PerfOp) in this embodiment refers to a time-intensive operation or a space-intensive operation. The main difference between a time-intensive operation and a space-intensive operation is that the computer takes a long time to complete the time-intensive operation, and the computer Memory, disk and other resources required for space-intensive operations are expensive.

可以理解的是，性能操作与耗时、耗空间的操作具有很强的相关性。值得说明的是，时间密集型操作与时间复杂度是两个不同的概念，类似地，空间密集型操作和空间复杂度也是两个不同的概念。评价时间复杂度和空间复杂度在实际运行时需要考虑到输入大小。例如，一个函数f()的时间复杂度和空间复杂度都非常小，但是若某一个操作o都要运行函数f()1000次才算完成，那么该操作o可能是时间密集型或者空间密集型的；但是该操作的时间复杂度和空间复杂度还是取决于函数f()的复杂度，由于函数f()的时间复杂度和空间复杂度是优秀的，因此，操作o的时间复杂度和空间复杂度也是优秀的。也就是说，操作o的时间复杂度和空间复杂度都非常小，但是操作o可能是时间密集型操作或者空间密集型操作。Understandably, performance operations are strongly correlated with time-consuming, space-consuming operations. It is worth noting that time-intensive operations and time complexity are two different concepts, and similarly, space-intensive operations and space complexity are also two different concepts. Evaluating the time complexity and space complexity needs to take into account the input size when actually running. For example, the time complexity and space complexity of a function f() are very small, but if an operation o needs to run the function f() 1000 times to be completed, then the operation o may be time-intensive or space-intensive. However, the time complexity and space complexity of this operation still depend on the complexity of the function f(). Since the time complexity and space complexity of the function f() are excellent, the time complexity of the operation o And space complexity is also excellent. That is, the time and space complexity of the operation o is very small, but the operation o may be a time-intensive operation or a space-intensive operation.

请参阅图6，根据对软件系统的观察研究，以Java软件系统为例，本实施例将性能操作可以划分为四类，并总结出对应的代码模式如图6所示。图6中的性能操作分为JavaIO、线程操作、同步操作和创建数组，每类性能操作都有其对应的代码模式，例如，Java IO对应的代码模式为：调用java.io包内的方法、调用java.nio包内的方法。Referring to FIG. 6 , according to the observation and research on the software system, taking the Java software system as an example, the performance operations can be divided into four categories in this embodiment, and the corresponding code patterns are summarized as shown in FIG. 6 . The performance operations in Figure 6 are divided into JavaIO, thread operations, synchronization operations, and array creation. Each type of performance operation has its corresponding code pattern. For example, the code pattern corresponding to Java IO is: calling methods in the java.io package, Call methods in the java.nio package.

需要说明的是，对于不同类型的软件系统，所涉及的性能操作不同，以上几类性能操作不一定涵盖所有的软件系统。本实施例所提出的方法具有普适性和可扩展性，只要定义新的一类性能操作并提供该性能操作的代码模式，就能支持该类新的性能操作。It should be noted that, for different types of software systems, the performance operations involved are different, and the above types of performance operations do not necessarily cover all software systems. The method proposed in this embodiment is universal and extensible, as long as a new type of performance operation is defined and a code mode of the performance operation is provided, the new type of performance operation can be supported.

步骤S200中，识别各所述性能操作与所述软件系统的各配置项之间的依赖关系，得到各所述配置项对应的性能操作集合，所述性能操作集合中的各性能操作与所述配置项均具有依赖关系。In step S200, a dependency relationship between each of the performance operations and each configuration item of the software system is identified, and a performance operation set corresponding to each of the configuration items is obtained, and each performance operation in the performance operation set is related to the Configuration items all have dependencies.

本实施例中，识别各所述性能操作与所述软件系统的各配置项之间的依赖关系之前还包括：提取所述软件系统的配置项信息，所述配置项信息至少包括配置项的名称、数量、配置项加载进所述软件系统时使用的API。In this embodiment, before identifying the dependencies between the performance operations and the configuration items of the software system, the method further includes: extracting configuration item information of the software system, where the configuration item information at least includes the name of the configuration item , quantity, and the API used when the configuration items are loaded into the software system.

本实施例中，任何影响软件系统性能的配置项都与某些性能操作具有数据依赖关系或控制依赖关系。请参阅图7，图7中的配置项为OptionX，OptionX是一个抽象的配置项，可以在程序中为任意的配置项。图7中创建数组arr的性能操作与配置项OptionX之间存在数据依赖关系和控制依赖关系，其中，控制依赖关系包括if分支控制依赖和循环控制依赖，循环控制依赖包括循环边界控制依赖和循环步长控制依赖。In this embodiment, any configuration item that affects the performance of the software system has a data dependency or a control dependency with some performance operations. Please refer to Figure 7. The configuration item in Figure 7 is OptionX. OptionX is an abstract configuration item, which can be any configuration item in the program. In Figure 7, there are data dependencies and control dependencies between the performance operation of creating the array arr and the configuration item OptionX, where the control dependencies include if branch control dependencies and loop control dependencies, and loop control dependencies include loop boundary control dependencies and loop step Long control dependency.

值得说明的是，软件系统的配置项与性能操作之间的控制依赖和数据依赖关系相对独立，但又结合在一起：识别控制依赖和数据依赖关系在大部分的情况下是独立的，但在某些情况下需要两者的结合。It is worth noting that the control dependencies and data dependencies between the configuration items and performance operations of the software system are relatively independent, but combined: Identifying control dependencies and data dependencies are independent in most cases, but in most cases A combination of the two is required in some cases.

污点分析作为一种信息流追踪分析的手段，其本质上是一种数据流分析技术，可使用该技术识别程序中配置项的数据依赖关系。本实施例的步骤S200中，使用污点分析进行信息流追踪来识别程序中性能操作与配置项的数据依赖关系。识别数据依赖关系的流程如图8所示，下面对每个步骤进行详细解释：As a means of information flow tracking and analysis, taint analysis is essentially a data flow analysis technology, which can be used to identify the data dependencies of configuration items in the program. In step S200 of this embodiment, taint analysis is used to track information flow to identify data dependencies between performance operations and configuration items in the program. The process of identifying data dependencies is shown in Figure 8, and each step is explained in detail below:

第一步，进入程序入口。The first step is to enter the program entry.

如果程序提供了多个程序入口，则创建一个虚拟入口作为唯一的程序入口，该虚拟入口到所有程序入口之间皆有控制流边。If the program provides multiple program entries, a virtual entry is created as the only program entry, and there are control flow edges between the virtual entry and all program entries.

第二步，遍历控制流，插入汇点标记性能操作。The second step is to traverse the control flow and insert the sink mark performance operation.

遍历控制流，通过代码模式识别出相应的分支语句和性能操作后，在相应的语句前插入调用汇点函数的语句。Traversing the control flow, after identifying the corresponding branch statement and performance operation through the code pattern, insert the statement calling the sink function before the corresponding statement.

第三步：返回程序入口，再次遍历控制流。Step 3: Return to the program entry and traverse the control flow again.

前面两步属于准备工作(设置汇点)，第三步回到程序入口，重新遍历控制流，分析已经插入了汇点之后的程序。The first two steps are preparatory work (setting the sink), and the third step is to return to the program entry, re-traverse the control flow, and analyze the program after the sink has been inserted.

第四步：在配置加载API处创建污点(源点)Step 4: Create a taint (source point) at the configuration loading API

配置加载API即是配置最早进入程序的地方，我们从该处开始追踪。The configuration loading API is where configuration first enters the program, and we start tracking from there.

第五步：污点传播(信息流传播)。The fifth step: taint propagation (information flow propagation).

污点传播简单来说是：源点是最初的污点，污点在数据传播的过程中会把其他变量也标记成污点。因此，污点传播的路径就是数据依赖链。The taint propagation is simply: the source point is the original taint, and the taint will mark other variables as taints in the process of data propagation. Therefore, the path of taint propagation is the data dependency chain.

第六步：记录到达的汇点Step 6: Record the arrival point

对于一个配置项，其从配置加载API处创建的污点最后通过数据传播到达预先插入的汇点，该信息表明：汇点所处的性能操作(或分支语句)对该配置项有数据依赖关系。For a configuration item, the taint created from the configuration loading API finally reaches the pre-inserted sink through data propagation, and the information indicates that the performance operation (or branch statement) where the sink is located has a data dependency on the configuration item.

可以理解的是，本实施例中利用污点分析识别各性能操作与软件系统的各配置项之间的数据依赖关系的过程具体包括：进入软件系统的程序入口，遍历控制流，在配置项加载API处创建污点作为源点；记录源点的数据传播路径及最终到达的汇点，则汇点处的性能操作对所述配置项具有数据依赖关系；其中，汇点为不希望源点到达的程序语句，汇点是在性能操作对应的语句前预先设置的。It can be understood that the process of identifying data dependencies between various performance operations and various configuration items of the software system by using taint analysis in this embodiment specifically includes: entering the program entry of the software system, traversing the control flow, and loading the API in the configuration item. Create a taint at the source point as the source point; record the data propagation path of the source point and the finally reached sink point, then the performance operation at the sink point has a data dependency on the configuration item; wherein, the sink point is the program that does not want the source point to reach Statements, sinks are preset before the statement corresponding to the performance operation.

值得说明的是，控制流是代码中各个语句、指令或函数的执行顺序，在识别数据依赖关系时是必须要结合控制流的。通过控制流才知道程序代码中各个部分代码的执行顺序。It is worth noting that the control flow is the execution order of each statement, instruction or function in the code, and it must be combined with the control flow when identifying data dependencies. The execution order of each part of the code in the program code is known only through the control flow.

本实施例的步骤S200中，通过构建配置项的控制区域来识别程序中性能操作与配置项的控制依赖关系。In step S200 of this embodiment, the control dependency relationship between the performance operation and the configuration item in the program is identified by constructing the control area of the configuration item.

具体地，配置项的控制区域为：对于某一配置项，其控制区域是一段与该配置项具有控制依赖关系的语句序列，且在控制流顺序中，该序列的下一个语句是该序列语句中的直接后支配语句。Specifically, the control area of a configuration item is: for a certain configuration item, its control area is a statement sequence that has a control dependency relationship with the configuration item, and in the control flow sequence, the next statement of the sequence is the sequence statement The immediate after-governing statement in .

在控制流图中，后支配(Postdominate)关系是指：对于控制流节点n，m，若所有以程序入口(entry)开始且经过n的路径，必须经过m才能到程序出口，则称节点m后支配节点n。若节点m后支配节点n，且不后支配其它任何后支配n的节点，则称节点m直接后支配节点n，节点m是节点n的直接后支配者。In the control flow graph, the postdominate relationship means: for the control flow nodes n and m, if all the paths starting with the program entry (entry) and passing through n must go through m to reach the program exit, then the node m is called post-dominant node n. If node m later dominates node n, and does not later dominate any other node that later dominates n, then node m is said to directly dominate node n, and node m is the direct subsequent dominate of node n.

直观来说，对某一特定配置项而言，配置项的控制区域就是直接受该配置项控制的一段程序。图9标出了四个配置项OptionA、OptionB、OptionC、OptionD的影响区域。Intuitively, for a specific configuration item, the control area of the configuration item is a program directly controlled by the configuration item. Figure 9 shows the affected areas of the four configuration items OptionA, OptionB, OptionC, and OptionD.

本实施例中，利用程序依赖图识别各性能操作与软件系统的各配置项之间的控制依赖关系包括：遍历程序依赖图中的所有节点，构建各配置项的控制区域，配置项的控制区域是与该配置项具有直接控制依赖关系的一段语句序列；根据各配置项的控制区域识别各性能操作与软件系统的各配置项之间的控制依赖关系。In this embodiment, using the program dependency graph to identify the control dependency between each performance operation and each configuration item of the software system includes: traversing all nodes in the program dependency graph, constructing the control area of each configuration item, and the control area of the configuration item It is a sequence of statements that has a direct control dependency relationship with the configuration item; the control dependency relationship between each performance operation and each configuration item of the software system is identified according to the control area of each configuration item.

具体地，识别性能操作与软件系统的各配置项之间的控制依赖关系的过程为：Specifically, the process of identifying the control dependencies between performance operations and configuration items of the software system is as follows:

1)利用程序依赖分析技术，构建出程序依赖图；1) Use program dependency analysis technology to build a program dependency graph;

2)遍历程序依赖图中所有节点，构建配置项控制区域；2) Traverse all nodes in the program dependency graph, and build a configuration item control area;

3)对于某一配置项，在构建出配置项控制区域后，配置项控制区域内的性能操作有两种：与该配置项有数据依赖、有控制依赖的性能操作；与该配置项无数据依赖、但有控制依赖的性能操作。3) For a configuration item, after the configuration item control area is constructed, there are two types of performance operations in the configuration item control area: performance operations that are data-dependent and control-dependent with the configuration item; no data to the configuration item. A performance operation that depends, but has control dependencies.

值得说明的是，分析控制依赖关系会用到控制流，但是只使用控制流只能分析一部分控制依赖关系。具体来说，仅使用简单的控制流只能分析与配置项有数据依赖、有控制依赖关系的性能操作，而与配置项无数据依赖关系、但有控制依赖关系的性能操作需要构建配置项的控制区域才能完成。It is worth noting that analyzing control dependencies will use control flow, but only using control flow can only analyze a part of control dependencies. Specifically, only using simple control flow can only analyze performance operations that have data dependencies and control dependencies with configuration items, while performance operations that have no data dependencies but have control dependencies with configuration items need to build configuration items. control area to complete.

步骤S300中，根据所述性能操作集合构建各所述配置项对应的特征向量。In step S300, a feature vector corresponding to each configuration item is constructed according to the performance operation set.

对于任一配置项，已知与其有数据依赖关系或控制依赖关系的性能操作集合。可以理解的是，配置项的性能操作集合可以视为特定的代码集合，是通过特定的代码模式识别出来的。For any configuration item, the set of performance operations that are known to have data dependencies or control dependencies. It can be understood that the performance operation set of the configuration item can be regarded as a specific code set, which is identified by a specific code pattern.

在得到配置项对应的性能操作集合之后，根据性能操作集合为该配置项构建对应的特征向量，即特征向量用于描述配置项与性能操作之间的依赖关系。After the performance operation set corresponding to the configuration item is obtained, a corresponding feature vector is constructed for the configuration item according to the performance operation set, that is, the feature vector is used to describe the dependency between the configuration item and the performance operation.

构建特征向量较为复杂，下面讲解部分特征向量的构建：The construction of eigenvectors is more complicated. The construction of some eigenvectors is explained below:

设配置项集合为Options，本实施例提出的4种不同的性能操作集合为PerfOps。对于任意的配置项option∈Options和性能操作perfOp∈PerfOps，设函数f(option,perfOp)表示在目标软件系统中，与配置项option具有数据依赖或控制依赖关系的性能操作perfOp的数量。Assume that the set of configuration items is Options, and the set of four different performance operations proposed in this embodiment is PerfOps. For any configuration item option∈Options and performance operation perfOp∈PerfOps, let the function f(option, perfOp) represent the number of performance operations perfOp that have data dependencies or control dependencies with the configuration item option in the target software system.

另外，用perfOp^data、perfOp^if、perfOp^loop分别表示与配置项option有数据依赖关系、if分支控制依赖关系、循环控制依赖关系的perfOp。对于k∈{data,if,loop}，设函数g(option,perfOp^k)为在目标软件系统中，与配置项option有k所对应的依赖关系的性能操作perfOp的数量。In addition, use perfOp ^data , perfOp ^if , and perfOp ^loop to represent the perfOp that has data dependencies, if branch control dependencies, and loop control dependencies with the configuration item option, respectively. For k∈{data,if,loop}, let the function g(option,perfOp ^k ) be the number of performance operations perfOp in the target software system that have k corresponding dependencies with the configuration item option.

特征向量v的前22维分别为：The first 22 dimensions of the feature vector v are:

其中，i和j只是用来表示下标计数，比如i＝0的时候，第一个公式v₀＝f(option,PerfOps₀)表示特征向量中的第一个维度是option与第一类性能操作Java IO存在依赖关系的操作数目。Among them, i and j are only used to represent the subscript count. For example, when i=0, the first formula v ₀ =f(option,PerfOps ₀ ) indicates that the first dimension in the feature vector is option and the first-class performance The number of operations that have dependencies on Java IO.

至于

以一个例子去解释：当i＝0时，表示与第一类性能操作JavaIO存在数据依赖关系。类似地，

以一个例子去解释：当i＝0时，表示与第一类性能操作Java IO存在if分支控制依赖关系；

以一个例子去解释：当i＝0时，表示与第一类性能操作Java IO存在循环控制依赖关系。As for

Explain with an example: when i=0, it means that there is a data dependency with the first type of performance operation JavaIO. Similarly,

Explain with an example: when i=0, it means that there is an if branch control dependency with the first type of performance operation Java IO;

To explain with an example: when i=0, it means that there is a circular control dependency with the first type of performance operation Java IO.

步骤S400中，将各所述配置项对应的特征向量输入到训练好的定性性能影响模型中，判断所述配置项是否影响软件系统性能，得到影响软件系统性能的配置项集合，所述定性性能影响模型是利用多个软件系统的配置项对应的特征向量训练得到的。In step S400, the feature vector corresponding to each of the configuration items is input into the trained qualitative performance impact model, and it is judged whether the configuration item affects the performance of the software system, and a set of configuration items that affect the performance of the software system is obtained. The influence model is obtained by training the feature vectors corresponding to the configuration items of multiple software systems.

本实施例中，需构建配置项的定性性能影响模型，此后对于任一全新的目标软件系统，通过前述方法为目标软件系统的每个配置项计算出对应的特征向量后，将特征向量输入到构建好的定性性能影响模型中，即可自动化判断出该配置项是否影响软件系统的性能，将所有影响软件系统性能的配置项添加到配置项集合中，最终得到影响软件系统性能的所有配置项。具体过程为：In this embodiment, a qualitative performance impact model of the configuration item needs to be constructed. After that, for any brand-new target software system, after calculating the corresponding feature vector for each configuration item of the target software system by the aforementioned method, the feature vector is input into the In the constructed qualitative performance impact model, you can automatically determine whether the configuration item affects the performance of the software system, add all the configuration items that affect the performance of the software system to the set of configuration items, and finally get all the configuration items that affect the performance of the software system. . The specific process is:

首先，收集若干个软件系统用于构建训练集。对于每个软件系统，在不同配置下多次运行软件系统并记录实际运行时间，判断软件系统的每个配置项实际上是否影响软件系统性能。另外，构建出软件系统中每个配置项对应的特征向量，即可得到训练集数据。First, several software systems are collected for building the training set. For each software system, run the software system multiple times under different configurations and record the actual running time to determine whether each configuration item of the software system actually affects the performance of the software system. In addition, by constructing the feature vector corresponding to each configuration item in the software system, the training set data can be obtained.

然后，本实施例建立配置项的定性性能影响模型，如图10所示，该定性性能影响模型由随机森林分类模型、cDEP配置项依赖检测器两部分组成。其中，随机森林模型的训练使用的是sklearn库中的随机森林算法，重点是将特征向量划分为训练集与测试集，用于随机森林分类模型RandomForestClassifier的构建。Then, this embodiment establishes a qualitative performance impact model of a configuration item. As shown in FIG. 10 , the qualitative performance impact model consists of a random forest classification model and a cDEP configuration item dependency detector. Among them, the training of the random forest model uses the random forest algorithm in the sklearn library, and the focus is to divide the feature vector into a training set and a test set for the construction of the random forest classification model RandomForestClassifier.

本实施例中，为每个配置项构建了特征向量，该特征向量的维度与目标软件系统中出现过的性能操作所在类的数量正相关，因此，配置项对应的特征向量的维度通常较高。另外，由于每个软件系统的功能不一样，对于不同类别的软件系统(比如：计算密集型或内存密集型)，其性能操作出现的频次和特征也有明显的不同，构建出的训练集的数据很可能并不平衡。In this embodiment, a feature vector is constructed for each configuration item, and the dimension of the feature vector is positively correlated with the number of classes of performance operations that have appeared in the target software system. Therefore, the dimension of the feature vector corresponding to the configuration item is usually higher . In addition, due to the different functions of each software system, for different types of software systems (such as: computation-intensive or memory-intensive), the frequency and characteristics of its performance operations are also significantly different. Probably not balanced.

由于随机森林具有不容易过拟合、可处理高维数据、可平衡分类数据集的误差等优点，可解决上述问题，且利用随机森林可学习出若干具有可解释性的分类规则，因此，本实施例采用随机森林分类模型来进行“配置项是否影响软件系统性能”的二分类，初步定性地回答了配置项对软件系统性能的影响问题。Because random forest has the advantages of not easy overfitting, can handle high-dimensional data, and can balance the errors of classification data sets, it can solve the above problems, and use random forest to learn some interpretable classification rules. Therefore, this paper The embodiment adopts the random forest classification model to carry out the binary classification of "whether the configuration item affects the performance of the software system", and preliminarily answers the question of the influence of the configuration item on the performance of the software system.

本实施例到目前为止，一直没有考虑配置项之间的依赖关系，而是把每个配置项当作独立配置项来考虑。但事实上，软件系统的各配置项之间也可能存在着依赖关系，对配置进行调整时，具有依赖关系的配置项通常要一起考虑。So far, in this embodiment, the dependency between configuration items has not been considered, but each configuration item is considered as an independent configuration item. But in fact, there may also be dependencies between configuration items of a software system. When adjusting the configuration, configuration items with dependencies are usually considered together.

本实施例认为：如果配置项OptionA依赖于配置项OptionB，且配置项OptionA对软件系统性能有影响，那么OptionB对软件系统性能也有影响。This embodiment considers that: if the configuration item OptionA depends on the configuration item OptionB, and the configuration item OptionA affects the performance of the software system, then the OptionB also affects the performance of the software system.

cDEP是一个用于发现配置项依赖关系的检测工具，由Qingrong Chen等人于2020年提出。为了将配置项之间的依赖关系纳入考虑，本实施例使用cDEP检测软件系统中各配置项之间的依赖关系，对随机森林分类模型的分类结果进行进一步修正完善。cDEP is a detection tool for discovering configuration item dependencies, proposed by Qingrong Chen et al. in 2020. In order to take the dependencies among the configuration items into consideration, this embodiment uses cDEP to detect the dependencies between the configuration items in the software system, and further corrects and improves the classification results of the random forest classification model.

本实施例中，配置项依赖检测器修正随机森林分类模型的分类结果包括：In this embodiment, the classification result of the configuration item dependency detector to correct the random forest classification model includes:

当软件系统的第一配置项依赖于第二配置项时，若随机森林分类模型判断第一配置项影响软件系统性能且第二配置项不影响软件系统性能，则cDEP配置项依赖检测器将第二配置项修正为影响软件系统性能。When the first configuration item of the software system depends on the second configuration item, if the random forest classification model determines that the first configuration item affects the performance of the software system and the second configuration item does not affect the performance of the software system, the cDEP configuration item dependency detector will The second configuration item is corrected to affect the performance of the software system.

本发明实施例提供的配置项对软件系统性能影响的分析方法，采用程序分析技术追踪与配置项具有依赖关系的时间密集型操作或空间密集型操作，根据程序分析结果为配置项构建对应的特征向量，能细粒度到配置项，不局限于布尔类型或穷举有限数值类型，支持任意类型的配置项；使用随机森林建立定性性能影响模型，无需耗时的局部测量操作，只需要进行一次软件系统源代码的分析即可判断一个具体的配置项是否影响可配置系统的性能，大幅减少了性能分析的开销，能够在不运行软件系统的前提下，较准确地预测软件系统的各配置项是否影响软件系统性能，能发现真正影响软件系统性能的配置项集合，能提高软件系统配置的效率，有利于正确配置软件系统以提高软件系统的性能。此外，本发明实施例还具有可解释性，通过程序分析结果及性能模型的分类规则了解配置项影响性能的底层原因。In the method for analyzing the impact of configuration items on software system performance provided by the embodiments of the present invention, program analysis technology is used to track time-intensive operations or space-intensive operations that have dependencies on configuration items, and corresponding features are constructed for configuration items according to the program analysis results. Vector, which can be fine-grained to configuration items, not limited to Boolean types or exhaustive finite numerical types, and supports any type of configuration items; using random forests to build qualitative performance impact models without time-consuming local measurement operations, only one software The analysis of the system source code can determine whether a specific configuration item affects the performance of the configurable system, which greatly reduces the overhead of performance analysis, and can more accurately predict whether each configuration item of the software system is not running on the premise of not running the software system. Affect the performance of software system, can find the set of configuration items that really affect the performance of the software system, can improve the efficiency of software system configuration, and is conducive to the correct configuration of the software system to improve the performance of the software system. In addition, the embodiments of the present invention are also interpretable, and the underlying reasons why the configuration items affect the performance are known through the program analysis results and the classification rules of the performance model.

请参阅图2，本发明还提供了一种配置项对软件系统性能影响的分析系统的实施例，包括：Referring to FIG. 2, the present invention also provides an embodiment of a system for analyzing the impact of configuration items on software system performance, including:

性能操作识别模块11，用于根据软件系统预设的代码模式识别并标记所述软件系统中的所有性能操作，所述性能操作为影响软件系统性能的时间密集型操作和/或空间密集型操作；A performance operation identification module 11, configured to identify and mark all performance operations in the software system according to a code pattern preset by the software system, where the performance operations are time-intensive operations and/or space-intensive operations that affect the performance of the software system ;

依赖关系识别模块22，用于识别各所述性能操作与所述软件系统的各配置项之间的依赖关系，得到各所述配置项对应的性能操作集合，所述性能操作集合中的各性能操作与所述配置项均具有依赖关系；The dependency relationship identification module 22 is configured to identify the dependency relationship between each of the performance operations and each configuration item of the software system, and obtain a performance operation set corresponding to each of the configuration items, and each performance operation set in the performance operation set The operation has a dependency on the configuration item;

特征向量构建模块33，用于根据所述性能操作集合构建各所述配置项对应的特征向量；A feature vector building module 33, configured to build a feature vector corresponding to each of the configuration items according to the performance operation set;

配置项集合确定模块44，用于将各所述配置项对应的特征向量输入到训练好的定性性能影响模型中，判断所述配置项是否影响软件系统性能，得到影响软件系统性能的配置项集合，所述定性性能影响模型是利用多个软件系统的配置项对应的特征向量训练得到的。The configuration item set determination module 44 is configured to input the feature vector corresponding to each of the configuration items into the trained qualitative performance impact model, determine whether the configuration item affects the performance of the software system, and obtain a set of configuration items that affect the performance of the software system , the qualitative performance impact model is obtained by training the feature vectors corresponding to the configuration items of multiple software systems.

本发明实施例提供的配置项对软件系统性能影响的分析系统，采用程序分析技术追踪与配置项具有依赖关系的时间密集型操作或空间密集型操作，根据程序分析结果为配置项构建对应的特征向量，能细粒度到配置项，不局限于布尔类型或穷举有限数值类型，支持任意类型的配置项；使用随机森林建立定性性能影响模型，无需耗时的局部测量操作，只需要进行一次软件系统源代码的分析即可判断一个具体的配置项是否影响可配置系统的性能，大幅减少了性能分析的开销，能够在不运行软件系统的前提下，较准确地预测软件系统的各配置项是否影响软件系统性能，能发现真正影响软件系统性能的配置项集合，能提高软件系统配置的效率，有利于正确配置软件系统以提高软件系统的性能。此外，本发明还具有可解释性，通过程序分析结果及性能模型的分类规则了解配置项影响性能的底层原因。The system for analyzing the impact of configuration items on software system performance provided by the embodiments of the present invention uses program analysis technology to track time-intensive operations or space-intensive operations that have dependencies on configuration items, and builds corresponding features for configuration items according to program analysis results Vector, which can be fine-grained to configuration items, not limited to Boolean types or exhaustive finite numerical types, and supports any type of configuration items; using random forests to build qualitative performance impact models without time-consuming local measurement operations, only one software The analysis of the system source code can determine whether a specific configuration item affects the performance of the configurable system, which greatly reduces the overhead of performance analysis, and can more accurately predict whether each configuration item of the software system is not running on the premise of not running the software system. Affect the performance of software system, can find the set of configuration items that really affect the performance of the software system, can improve the efficiency of software system configuration, and is conducive to the correct configuration of the software system to improve the performance of the software system. In addition, the present invention also has interpretability, and the underlying reason that the configuration item affects the performance can be understood through the program analysis result and the classification rule of the performance model.

另外，根据前面所提出的基于程序分析的白盒性能分析方法，本发明设计并实现了面向Java应用的配置分析工具ConfigAnalyzer。In addition, according to the white box performance analysis method based on program analysis proposed above, the present invention designs and implements a configuration analysis tool ConfigAnalyzer oriented to Java applications.

ConfigAnalyzer工具实现了本发明所提出的白盒性能分析方法，支持Java软件系统。ConfigAnalyzer使用了FlowDroid来实现对上下文、流、字段、对象敏感的静态污点分析，对FlowDroid所基于的Soot分析框架进行了必要的自定义拓展以支持ConfigAnalyzer所需分析逻辑。The ConfigAnalyzer tool implements the white-box performance analysis method proposed by the present invention, and supports the Java software system. ConfigAnalyzer uses FlowDroid to implement static taint analysis sensitive to context, flow, field, and object, and makes necessary custom extensions to the Soot analysis framework that FlowDroid is based on to support the analysis logic required by ConfigAnalyzer.

ConfigAnalyzer分为以下两个主要模块：ConfigAnalyzer is divided into the following two main modules:

1)程序分析模块：实现了性能操作的标记、配置项与性能操作之间的依赖关系的识别；1) Program analysis module: realizes the identification of performance operation tags, the dependencies between configuration items and performance operations;

2)性能模型模块：实现了配置项的特征向量、定性性能影响模型的构建。性能分析模块包括根据程序分析模块得到的结果构建特征向量，以及根据特征向量构建定性性能影响模型两个部分。2) Performance model module: The feature vector of configuration items and the construction of qualitative performance impact model are realized. The performance analysis module includes two parts: constructing the feature vector according to the result obtained by the program analyzing module, and constructing the qualitative performance impact model according to the feature vector.

其中，ConfigAnalyzer工具的程序分析模块如图11所示，程序分析模块中含有三个包，其作用分别为：Among them, the program analysis module of the ConfigAnalyzer tool is shown in Figure 11. The program analysis module contains three packages, whose functions are:

1)edu.sysu.dds.analysis：实现污点分析、程序依赖分析及信息提取等；1) edu.sysu.dds.analysis: realizes taint analysis, program dependency analysis and information extraction, etc.;

2)edu.sysu.dds.visual：实现中间结果的可视化；2) edu.sysu.dds.visual: realize the visualization of intermediate results;

3)edu.sysu.dds.utility：为前面两个包所提供的辅助。3) edu.sysu.dds.utility: the assistance provided by the previous two packages.

其中，信息提取主要是用于提取软件系统的配置项信息，例如配置项有哪些、配置项的数目、配置项加载进软件系统所使用的API等。Among them, the information extraction is mainly used to extract the configuration item information of the software system, such as which configuration items are, the number of configuration items, and the API used by the configuration items to be loaded into the software system.

对中间结果的可视化，主要包括配置项的控制区域可视化(如图9所示)、插入标记汇点可视化等。The visualization of the intermediate results mainly includes the visualization of the control area of the configuration item (as shown in Figure 9), the visualization of the insertion marker sink, and so on.

本发明参考了现有对白盒性能分析方法的研究工作的评估方式，并基于评估效果和工作量作出平衡，从现有的19个真实软件系统中，选取了6个代表软件系统进行对ConfigAnalyzer所建立的定性性能影响模型进行评估。表1为目标软件系统概览。The present invention refers to the existing evaluation methods for the research work of the white box performance analysis method, and makes a balance based on the evaluation effect and the workload. From the existing 19 real software systems, 6 representative software systems are selected to carry out the evaluation of the ConfigAnalyzer The established qualitative performance impact model is evaluated. Table 1 is an overview of the target software system.

表1Table 1

值得说明的是，有效配置是指能够使得软件系统正确执行不会崩溃的配置，这些配置能够让软件系统完成对应的任务，但是不同配置下完成任务所需要的资源不同。It is worth noting that an effective configuration refers to a configuration that enables the software system to execute correctly without crashing. These configurations enable the software system to complete corresponding tasks, but different configurations require different resources to complete tasks.

考虑5个配置项，它们的类型是布尔类型的，取值范围是{false,true}。仅仅5个配置项就能构成2⁵＝32个配置。10个布尔类型的配置项能够构成1024个配置。如果配置项不是布尔类型的话，取值空间会更加大，构成的配置总数也就非常大。Consider 5 configuration items, their type is boolean and the value range is {false,true}. Only 5 configuration items can constitute 2 ⁵ =32 configurations. 10 Boolean configuration items can form 1024 configurations. If the configuration item is not a Boolean type, the value space will be larger, and the total number of configurations will be very large.

本发明的实验中将6个目标软件系统分成两类，使用Batik、H2、Kanzi、Prevayler四个软件系统的配置项(共)所对应的数据作为训练集，Catena、Sunflow两个软件系统的配置项(共)所对应的样本作为测试集。In the experiment of the present invention, the six target software systems are divided into two categories, and the data corresponding to the configuration items (total) of the four software systems of Batik, H2, Kanzi, and Prevayler are used as the training set, and the configuration of the two software systems of Catena and Sunflow is used as the training set. The samples corresponding to the items (total) are used as the test set.

基于程序分析的结果构建出来的特征向量共有50维。建立随机森林回归器模型，然后运行cDEP以完善配置依赖关系。The feature vector constructed based on the results of program analysis has a total of 50 dimensions. Build a random forest regressor model, then run cDEP to refine configuration dependencies.

本实施例中，当配置项利用ConfigAnalyzer工具的程序分析模块生成特征向量后，将该特征向量输入到随机森林模型里，随机森林由许多决策树构成，特征向量需要经过每一个决策树利用其分类规则得到预测的分类标签。In this embodiment, after the configuration item uses the program analysis module of the ConfigAnalyzer tool to generate a feature vector, the feature vector is input into the random forest model. The random forest is composed of many decision trees, and the feature vector needs to pass through each decision tree and use its classification The rule gets the predicted class label.

实验结果如表2所示，定性性能模型对受测软件系统配置项的预测影响与实际影响对比。其中，y表示配置项的实际分类，y_predit表示模型对该配置项的预测分类。分类为-1表示配置项不影响软件系统性能，分类为1表示配置项影响软件系统性能。The experimental results are shown in Table 2. The predicted impact of the qualitative performance model on the configuration items of the tested software system is compared with the actual impact. Among them, y represents the actual classification of the configuration item, and y _predit represents the predicted classification of the configuration item by the model. A classification of -1 indicates that the configuration item does not affect the performance of the software system, and a classification of 1 indicates that the configuration item affects the performance of the software system.

表2Table 2

实验结果表明，ConfigAnalyzer在不运行程序的前提下准确地预测了受测软件系统中84.21％的配置项是否影响性能，准确率较高，可见ConfigAnalyzer确实能有效地建立软件系统的定性性能影响模型。The experimental results show that ConfigAnalyzer can accurately predict whether 84.21% of the configuration items in the tested software system affect the performance without running the program, and the accuracy rate is high. It can be seen that ConfigAnalyzer can indeed effectively establish the qualitative performance impact model of the software system.

本发明提供了一种配置项对软件系统性能影响的分析系统，并实现了ConfigAnalyzer工具，该工具是一个面向Java应用程序的配置分析工具。The invention provides an analysis system for the impact of configuration items on software system performance, and implements a ConfigAnalyzer tool, which is a configuration analysis tool oriented to Java applications.

ConfigAnalyzer首先通过污点分析、程序控制分析等程序分析技术，静态地追踪与配置项具有依赖关系的时间或空间密集型操作。然后，ConfigAnalyzer根据程序分析的结果构建特征向量，使用随机森林建立定性性能影响模型。ConfigAnalyzer first uses program analysis techniques such as taint analysis and program control analysis to statically track time- or space-intensive operations that have dependencies on configuration items. Then, ConfigAnalyzer builds feature vectors based on the results of program analysis, and uses random forests to build qualitative performance impact models.

ConfigAnalyzer在无需运行软件系统的前提下，帮助用户发现真正影响系统性能的配置项集合。区别于传统的黑盒方法，ConfigAnalyzer具有可解释性，用户可通过程序分析结果及性能模型的分类规则了解配置项影响性能的底层原因。区别于目前现有的白盒方法，ConfigAnalyzer支持任意类型的配置项，且由于建立的是定性性能影响模型，无需耗时的局部测量操作，大幅减少了分析开销。ConfigAnalyzer helps users discover the set of configuration items that really affect system performance without running a software system. Different from the traditional black box method, ConfigAnalyzer is interpretable. Users can understand the underlying reasons why configuration items affect performance through program analysis results and performance model classification rules. Different from the existing white-box methods, ConfigAnalyzer supports any type of configuration items, and because it builds a qualitative performance impact model, it does not require time-consuming local measurement operations, which greatly reduces the analysis overhead.

本发明的ConfigAnalyzer工具具有如下优点：The ConfigAnalyzer tool of the present invention has the following advantages:

(1)可解释(1) Explainable

ConfigAnalyzer首先通过污点分析、程序控制分析等程序分析技术，静态地追踪与配置项具有依赖关系的时间或空间密集型操作。然后，ConfigAnalyzer根据程序分析的结果构建特征向量，使用随机森林建立定性性能影响模型。区别于传统的黑盒方法，ConfigAnalyzer具有可解释性，用户可通过程序分析结果及性能模型的分类规则了解配置项影响性能的底层原因。ConfigAnalyzer first uses program analysis techniques such as taint analysis and program control analysis to statically track time- or space-intensive operations that have dependencies on configuration items. Then, ConfigAnalyzer builds feature vectors based on the results of program analysis, and uses random forests to build qualitative performance impact models. Different from the traditional black box method, ConfigAnalyzer is interpretable. Users can understand the underlying reasons why configuration items affect performance through program analysis results and performance model classification rules.

(2)准确度(2) Accuracy

实验结果表明，ConfigAnalyzer在不运行程序的前提下准确地预测了受测软件系统中84.21％的配置项是否影响性能，确实能有效地建立软件系统的定性性能影响模型。The experimental results show that ConfigAnalyzer can accurately predict whether 84.21% of the configuration items in the tested software system affect the performance without running the program, and can indeed effectively establish the qualitative performance impact model of the software system.

(3)分析粒度(3) Analysis of particle size

本发明能够精确判断一个具体的配置项是否影响可配置系统的性能表现，区别于其他将软件系统视为一个黑盒，对配置空间进行采样得一个配置子集，测量在特定工作负载下系统在该配置子集的每一个配置下的性能表现的测试方法。这些测试方法只能判断出一个配置是否影响可配置系统的性能表现，并不能细粒度到配置项。The invention can accurately judge whether a specific configuration item affects the performance of a configurable system, and is different from other software systems that regard the software system as a black box, sample the configuration space to obtain a configuration subset, and measure the system performance under a specific workload. A test method for performance under each configuration of this subset of configurations. These test methods can only determine whether a configuration affects the performance of a configurable system, and cannot be fine-grained to configuration items.

(4)效率与完备性(4) Efficiency and completeness

现有方案中有人提出的白盒性能分析方法，只支持布尔类型或穷举有限数值类型的配置项(穷举有限类型的配置项需要离散为若干个布尔类型配置项)，这是一个非常大的限制，且离散后相当于配置项数量大幅增加，工具的运行时间呈指数级增长。本发明只需要进行一次可配置系统源代码的分析即可判断一个具体的配置项是否影响可配置系统的性能，配置项的类型能够覆盖所有Java程序允许的类型，不局限于布尔类型或穷举有限数值类型，完全不需要支撑软件系统运行的硬件设备、无需搭建可配置系统的执行环境以及不需要考虑可配置系统在特定负载不同配置下的测试开销。The white-box performance analysis method proposed by someone in the existing scheme only supports configuration items of Boolean type or exhaustive finite numerical type (configuration items of exhaustive finite type need to be discrete into several Boolean type configuration items), which is a very large , and the number of configuration items increases significantly after discrete, and the running time of the tool increases exponentially. The invention only needs to analyze the source code of the configurable system once to determine whether a specific configuration item affects the performance of the configurable system. The type of the configuration item can cover all types allowed by Java programs, and is not limited to Boolean type or exhaustive The finite numerical type does not require hardware equipment to support the operation of the software system, no need to build an execution environment for the configurable system, and no need to consider the test overhead of the configurable system under different configurations of specific loads.

本发明实施例提供了ConfigAnalyzer工具，ConfigAnalyzer通过一次分析目标系统的源代码，发现真正影响系统性能的配置项集合，配置项不受限于类型。区别于传统的黑盒思想构建精准的软件系统性能影响模型的方法，ConfigAnalyzer具有可解释性，用户可以通过ConfigAnalyzer所产生的程序分析结果和定性性能模型的分类规则，结合分析配置项与具体的性能操作的关系，进一步了解每一个配置项影响系统性能的根本原因。The embodiment of the present invention provides a ConfigAnalyzer tool. The ConfigAnalyzer analyzes the source code of the target system at one time and finds a set of configuration items that really affect the performance of the system, and the configuration items are not limited by types. Different from the traditional black-box method of constructing an accurate software system performance impact model, ConfigAnalyzer is interpretable. Users can analyze the configuration items and specific performance through the program analysis results generated by ConfigAnalyzer and the classification rules of the qualitative performance model. Operational relationship, and further understand the root cause of each configuration item affecting system performance.

区别于现有白盒思想构建性能影响模型的方法，ConfigAnalyzer完全不需要运行可配置软件系统所需要的硬件设备、软件环境、测量与测试软件系统的时间、能源消耗等开销，突破配置项类型的限制，大幅降低分析配置项与软件系统性能关系的成本。Different from the existing method of constructing a performance impact model with white box thinking, ConfigAnalyzer does not require the hardware equipment, software environment, time, energy consumption and other overheads required to run a configurable software system. Limits, greatly reducing the cost of analyzing the relationship between configuration items and software system performance.

通过实验对ConfigAnalyzer所建立的定性性能影响模型进行评估，结果表明，ConfigAnalyzer在不运行程序的前提下准确地预测了受测软件系统中84.21％的配置项是否影响性能，确实能有效地建立软件系统的定性性能影响模型，可见ConfigAnalyzer具有较优的准确率表现。The qualitative performance impact model established by ConfigAnalyzer is evaluated through experiments. The results show that ConfigAnalyzer can accurately predict whether 84.21% of the configuration items in the tested software system will affect the performance without running the program, and it can effectively establish the software system. The qualitative performance affects the model, and it can be seen that ConfigAnalyzer has better accuracy performance.

值得说明的是，本发明包括但不限于上述具体实施方式，只要符合本发明构思的所有技术方案都属于本发明的保护范围，例如，下面内容也属于本发明的保护范围：It should be noted that the present invention includes but is not limited to the above-mentioned specific embodiments, as long as all technical solutions conforming to the inventive concept belong to the protection scope of the present invention, for example, the following contents also belong to the protection scope of the present invention:

(1)将本发明中所使用的静态污点分析替换为动态污点分析，同时结合程序测试并使得测试覆盖率达到较高水平时(80％-99％)，也能够得到本发明中程序分析模块所输出的结果，并用于性能模型模块的输入。(1) Replacing the static taint analysis used in the present invention with dynamic taint analysis, and combining with program testing to make the test coverage reach a higher level (80%-99%), the program analysis module in the present invention can also be obtained The output is used as input to the performance model module.

(2)将本发明中性能模型模块的随机森林分类模型替换为任何一种分类模型都能够对配置项进行分类，或许会丢失部分分类结果的可解释性，但不影响分类结果的产生。(2) Replacing the random forest classification model of the performance model module in the present invention with any classification model can classify configuration items, which may lose the interpretability of some classification results, but does not affect the generation of classification results.

(3)仅采用配置项采样和程序测试也能够判断出一个配置项是否影响可配置系统的性能表现。对每一个配置项进行采样，然后将每个配置项的采样结果进行笛卡尔积操作得到配置空间的子集(此时，子集中的配置之间仅存在一个配置项的值有差异，除了该配置项外，其他配置项均设置相同的值)，对该子集中的每一个配置进行程序性能测试，通过分析不同配置之间的程序性能测试结果可以判断出一个配置项是否影响可配置系统的性能表现或行为。(3) It is also possible to judge whether a configuration item affects the performance of a configurable system only by using configuration item sampling and program testing. Sampling each configuration item, and then performing the Cartesian product operation on the sampling results of each configuration item to obtain a subset of the configuration space (at this time, there is only a difference in the value of one configuration item between the configurations in the subset, except for this In addition to the configuration items, other configuration items are set to the same value), perform program performance test for each configuration in the subset, and analyze the program performance test results between different configurations to determine whether a configuration item affects the performance of the configurable system. performance or behavior.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统，装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

在本申请所提供的实施例中，应该理解到，所揭露的系统，装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-OnlyMemory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, removable hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.

以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The recorded technical solutions are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. a method for analyzing the impact of a configuration item on software system performance, is characterized in that, comprising:

Identify and mark all performance operations in the software system according to a code pattern preset by the software system, where the performance operations are time-intensive operations and/or space-intensive operations that affect the performance of the software system;

Identify the dependency between each of the performance operations and each configuration item of the software system, and obtain a performance operation set corresponding to each of the configuration items, and each performance operation in the performance operation set has the same configuration as the configuration item. dependencies;

constructing a feature vector corresponding to each of the configuration items according to the performance operation set;

Input the feature vector corresponding to each of the configuration items into the trained qualitative performance impact model, determine whether the configuration item affects the performance of the software system, and obtain a set of configuration items that affect the performance of the software system, the qualitative performance impact model is to use It is obtained by training the feature vectors corresponding to the configuration items of multiple software systems.

2. The method for analyzing the impact of configuration items on software system performance according to claim 1, wherein the qualitative performance impact model comprises:

Random forest classification model and configuration item dependency detector;

Wherein, the random forest classification model performs binary classification on whether the configuration item affects the performance of the software system, and the configuration item depends on the detector to correct the classification result of the random forest classification model.

3. The method for analyzing the impact of a configuration item on software system performance according to claim 1, wherein the dependency relationship comprises:

Data dependencies and control dependencies;

The data dependencies are dependencies between data streams, and the control dependencies are dependencies caused by program control flows.

4. The method for analyzing the impact of configuration items on software system performance according to claim 3, wherein identifying the dependencies between each of the performance operations and each of the configuration items of the software system comprises:

Identify data dependencies between each of the performance operations and each configuration item of the software system by using taint analysis;

The program dependency graph is used to identify the control dependency relationship between each of the performance operations and the configuration items of the software system; the program dependency graph is constructed by using the program dependency analysis technology to describe the control dependency and data dependency of the program relation.

5. The method for analyzing the impact of configuration items on software system performance according to claim 4, wherein identifying data dependencies between each of the performance operations and each of the configuration items of the software system using taint analysis comprises:

Enter the program entry of the software system, traverse the control flow, and create a taint at the configuration item loading API as a source point;

Record the data propagation path of the source point and the finally reached sink point, then the performance operation at the sink point has a data dependency on the configuration item; the sink point is the program statement that the source point is not expected to reach, so The sink point is preset before the statement corresponding to the performance operation.

6 . The method for analyzing the impact of configuration items on software system performance according to claim 4 , wherein identifying the control dependencies between each of the performance operations and each of the configuration items of the software system by using a program dependency graph comprises the following steps: 7 . :

Traverse all the nodes in the program dependency graph, and construct the control area of each configuration item, where the control area of the configuration item is a statement sequence that has a direct control dependency relationship with the configuration item;

A control dependency relationship between each of the performance operations and each of the configuration items of the software system is identified according to the control area of each of the configuration items.

7. The method for analyzing the impact of configuration items on software system performance according to claim 2, wherein the training process of the random forest classification model comprises:

Divide the feature vectors corresponding to the configuration items of multiple software systems into a training set and a test set;

The random forest classification model is trained according to the training set and the random forest algorithm.

8. The method for analyzing the impact of configuration items on software system performance according to claim 2, wherein the configuration item-dependent detector revising the classification result of the random forest classification model comprises:

When the first configuration item of the software system depends on the second configuration item, if the random forest classification model determines that the first configuration item affects the performance of the software system and the second configuration item does not affect the performance of the software system, then The configuration item dependency detector corrects the second configuration item to affect software system performance.

9. The method for analyzing the impact of configuration items on software system performance according to any one of claims 1 to 8, characterized in that, before identifying the dependencies between the performance operations and the configuration items of the software system Also includes:

The configuration item information of the software system is extracted, where the configuration item information at least includes the name, quantity of the configuration item, and an API used when the configuration item is loaded into the software system.

10. A system for analyzing the impact of configuration items on software system performance, comprising:

A performance operation identification module, configured to identify and mark all performance operations in the software system according to a code pattern preset by the software system, where the performance operations are time-intensive operations and/or space-intensive operations that affect the performance of the software system;

A dependency relationship identification module is used to identify the dependency relationship between each of the performance operations and each configuration item of the software system, and obtain a performance operation set corresponding to each of the configuration items, and each performance operation in the performance operation set have dependencies with the configuration items;

a feature vector building module, configured to build a feature vector corresponding to each of the configuration items according to the performance operation set;

A configuration item set determination module, configured to input the feature vector corresponding to each of the configuration items into the trained qualitative performance impact model, determine whether the configuration item affects the performance of the software system, and obtain a set of configuration items that affect the performance of the software system, The qualitative performance impact model is obtained by training with feature vectors corresponding to configuration items of multiple software systems.