CN115408700A

CN115408700A - Open source component detection method based on binary program modularization

Info

Publication number: CN115408700A
Application number: CN202210863358.1A
Authority: CN
Inventors: 刘胜利; 郭茜茜; 刘明; 杨启超; 王家丰; 张璨; 邵文强; 吕思欧
Original assignee: PLA Information Engineering University
Current assignee: PLA Information Engineering University
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-11-29

Abstract

The invention belongs to the technical field of open source component detection, and particularly relates to an open source component detection method based on binary program modularization. According to the method, through extracting the calling and address information of functions in the binary program, a coagulation algorithm for processing hierarchical community structure analysis of an undirected graph is improved into a BCM suitable for a directed graph, and the binary program module is divided by using the BCM. On the basis of binary program modularization, all modules are subjected to feature matching with the open source component, the specific module where the component is located is positioned, the multiplexing of the current file granularity is refined to the positioning detection of the module granularity, the search range of software security analysis tasks such as open source vulnerability detection is narrowed, and the efficiency of downstream tasks is improved. Experiments show that the method improves the detection precision of the open source assembly, greatly reduces the false alarm rate, and is improved by 60.12 percent compared with the existing open source assembly detection method B2SFinder with better file granularity; the detection efficiency is still maintained after the modular analysis is added.

Description

Open source component detection method based on binary program modularization

技术领域technical field

本发明属于开源组件检测技术领域，具体涉及一种基于二进制程序模块化的开源组件检测方法。The invention belongs to the technical field of open source component detection, in particular to an open source component detection method based on binary program modularization.

背景技术Background technique

在现代软件开发过程中，随着开源软件(Open-Source Software，OSS)的日益普及，软件重用成为一种普遍现象，开发人员往往利用开源软件提供的丰富功能来缩短开发周期，将更多的时间花在个性化开发上。近年来，开源软件的数量正以指数级的速度增长。截至目前，Github上有超过4400万个存储库，大量的开源软件给软件开发带来了极大的便利。然而，OSS使用不当会导致潜在的安全风险。含有过时的OSS的软件更有可能被利用。例如，一个名为Heartbleed的安全漏洞在OpenSSL(一个流行的加密软件库)的1.0.1g版本之前的1.0.1版本中被发现。对于使用该库的易受攻击的1.0.1版本的软件，攻击者可以窃取诸如用户名和密码之类的私人信息，它影响了许多著名的软件，例如4.2.0到4.2.2版本的LibreOffice软件和VMware Workstation 10。In the process of modern software development, with the increasing popularity of open source software (Open-Source Software, OSS), software reuse has become a common phenomenon. Developers often use the rich functions provided by open source software to shorten the development cycle and integrate more Time is spent on personalization development. In recent years, the amount of open source software has grown exponentially. As of now, there are more than 44 million repositories on Github, and a large number of open source software brings great convenience to software development. However, improper use of OSS can lead to potential security risks. Software with outdated OSS is more likely to be exploited. For example, a security flaw called Heartbleed was found in version 1.0.1 of OpenSSL, a popular cryptographic software library, before version 1.0.1g. For software using the vulnerable version 1.0.1 of this library, attackers can steal private information such as usernames and passwords, it affects many well-known software such as LibreOffice software from version 4.2.0 to 4.2.2 and VMware Workstation 10.

商用软件(Commercial Off-The-Shelf，COTS)通常是以闭源的方式进行发布，无法直接获取源码进行软件安全分析，为了检测开源组件漏洞，在二进制程序中识别复用的开源软件尤为重要，它在各种软件安全工程相关任务中发挥着重要作用，如恶意软件检测、漏洞搜索、逆向工程等。目前的工作多数通过提取目标二进制程序和源码中的特征，选择合适的特征匹配算法进行开源组件的识别。但是，已有的工作在两个方面存在不足，分别是特征选择和开源组件在二进制程序中的识别粒度。Commercial Off-The-Shelf (COTS) is usually released in a closed-source manner, and the source code cannot be directly obtained for software security analysis. In order to detect vulnerabilities in open source components, it is particularly important to identify reused open source software in binary programs. It plays an important role in various software security engineering related tasks, such as malware detection, vulnerability searching, reverse engineering, etc. Most of the current work extracts the features in the target binary program and source code, and selects an appropriate feature matching algorithm to identify open source components. However, existing works are deficient in two aspects, feature selection and recognition granularity of open source components in binary programs.

首先，特征选择方面，有些方法提取的特征太少或者易受编译优化影响。多数方法依赖字符串和导出函数名等语法特征，但是一些商业软件为了隐藏其软件组成而去除字符串。因此，为了克服特征的单一性，控制流图成为新的选择。但是由于编译优化的影响，控制流很容易发生改变，故控制流图不能作为合适的特征进行匹配。First of all, in terms of feature selection, some methods extract too few features or are easily affected by compilation optimization. Most methods rely on syntactic features such as strings and exported function names, but some commercial software strips strings in order to hide its software composition. Therefore, in order to overcome the singleness of features, control flow graph becomes a new choice. However, due to the influence of compilation optimization, the control flow is easy to change, so the control flow graph cannot be matched as a suitable feature.

其次，在二进制程序中开源组件的识别粒度方面，现有的方法是从文件粒度检测开源组件，即以整个二进制程序为粒度单元，通过直接映射、层次匹配等算法，对提取的特征进行匹配，然后计算特征相似度，判断整个程序中是否复用开源组件。针对二进制程序的开源组件识别，会有一些下游任务的开展，例如漏洞检测，是否包含高危漏洞；是否受供应链投毒影响；推动供应商更新重要依赖组件的版本等。基于目前的方法，若进行下游任务开源漏洞的搜索，其范围为整个二进制程序，对于时间和工作量都是一个较高的要求。Secondly, in terms of the recognition granularity of open source components in binary programs, the existing method is to detect open source components from the file granularity, that is, the entire binary program is used as the granular unit, and the extracted features are matched through algorithms such as direct mapping and hierarchical matching. Then calculate the feature similarity to judge whether open source components are reused in the entire program. For the identification of open source components of binary programs, some downstream tasks will be carried out, such as vulnerability detection, whether it contains high-risk vulnerabilities; whether it is affected by supply chain poisoning; pushing suppliers to update versions of important dependent components, etc. Based on the current method, if the downstream task is to search for open source vulnerabilities, the scope is the entire binary program, which is a high requirement for time and workload.

发明内容Contents of the invention

针对开源组件检测方法存在的缺陷和问题，本发明提供一种基于二进制程序模块化的开源组件检测方法。Aiming at the defects and problems existing in the open source component detection method, the present invention provides an open source component detection method based on binary program modularization.

本发明解决其技术问题所采用的方案是：一种基于二进制程序模块化的开源组件检测方法，包括二进制程序模块化和开源组件识别，其中二进制程序模块化包括以下内容：The solution adopted by the present invention to solve the technical problem is: a binary program modularization-based open source component detection method, including binary program modularization and open source component identification, wherein the binary program modularization includes the following content:

(1)对二进制程序进行函数地址提取和函数调用图创建，通过函数地址和函数有向调用关系的提取构造模块化步骤的输入图；(1) Extract the function address and create the function call graph for the binary program, and construct the input graph of the modularization step through the extraction of the function address and the function-directed call relationship;

(2)使用基于模块度改进的BCM算法对二进制程序模块进行划分；(2) Use the improved BCM algorithm based on modularity to divide the binary program modules;

开源组件识别包括以下内容：Open source component identification includes the following:

(1)特征选择与提取：针对二进制程序的每个模块，提取每个模块中的字符类型及函数的复杂分支序列作为提取特征，其中字符类型包括字符串字面量、字符串数组和导出函数名；对于函数的复杂分支序列，将switch/case、if/else的分支序列作为选择的特征；(1) Feature selection and extraction: For each module of the binary program, the complex branch sequences of character types and functions in each module are extracted as extraction features, where the character types include string literals, string arrays, and exported function names ; For the complex branch sequence of the function, the branch sequence of switch/case, if/else is used as the selected feature;

(2)模块粒度的开源组件识别：将二进制程序每个模块的提取特征与源码特征进行比对，检测出开源组件所在的二进制模块。(2) Open source component identification at module granularity: compare the extracted features of each module of the binary program with the source code features, and detect the binary modules where the open source components are located.

上述的基于二进制程序模块化的开源组件检测方法，对二进制程序模块划分包括以下步骤：In the above-mentioned open source component detection method based on binary program modularization, the division of binary program modules includes the following steps:

Ⅰ、初始假设每个函数都是一个独立的模块，对任意相邻的节点i和节点j，计算将节点i加入其邻居节点j所在模块C时对应的模块度增量ΔQ，Ⅰ. Initially assume that each function is an independent module. For any adjacent node i and node j, calculate the corresponding modularity increment ΔQ when node i is added to the module C where its neighbor node j is located.

式中：S_i,in表示节点i与模块C内部函数连边权重之和；m是网络中所有边权重之和；W_c是模块C内部所有边的权重和；S_c是所有与模块C内部的函数相关联的边的权重和；In the formula: S _{i, in} represents the sum of the weights of nodes i and the internal functions of module C; m is the sum of the weights of all edges in the network; W _c is the sum of the weights of all edges in module _C ; The sum of the weights of the edges associated with the internal function;

然后计算节点i与所有邻居节点的模块度增量，选出其中最大的一个；当该值为正时，把节点i加人相应的邻居节点所在的模块；否则，节点i留在原模块中；Then calculate the modularity increment of node i and all neighbor nodes, and select the largest one; when the value is positive, add node i to the module where the corresponding neighbor node is located; otherwise, node i stays in the original module;

重复进行，直至不再出现合并现象，划分出第一层模块；Repeat until the merging phenomenon no longer occurs, and the first layer of modules is divided;

Ⅱ、基于Ⅰ形成的模块结果，重复Ⅰ中的方法对新的模块结果进行模块划分，得到第二层模块，其中模块之间连边的权重是两个模块之间所有节点连边的权重和；Ⅱ. Based on the module results formed in Ⅰ, repeat the method in Ⅰ to divide the new module results into modules, and obtain the second layer of modules, where the weight of the edges between the modules is the sum of the weights of the edges of all nodes between the two modules ;

重复直至网络模块度不再增加，完成二进制程序模块划分。Repeat until the network modularity does not increase any more, and the binary program module division is completed.

上述的基于二进制程序模块化的开源组件检测方法，从二进制模块中提取字符串的过程为：通过模块中每个函数的地址获取该函数的所有片段，遍历每个函数片段中的所有指令，遍历该指令的数据引用地址，当该地址存放的是字符串时，将字符串的地址及值添加到字符串列表中。The above-mentioned open source component detection method based on binary program modularization, the process of extracting strings from binary modules is as follows: obtain all fragments of the function through the address of each function in the module, traverse all instructions in each function fragment, traverse The data reference address of the instruction, when the address stores a string, add the address and value of the string to the string list.

上述的基于二进制程序模块化的开源组件检测方法，从二进制程序中提取导出函数名的过程为：首先需要获取模块中的所有函数名，然后通过遍历整个二进制程序中的导出函数名的方式，保留模块中的导出函数名。The above-mentioned open source component detection method based on binary program modularization, the process of extracting the exported function name from the binary program is as follows: firstly, it is necessary to obtain all the function names in the module, and then by traversing the exported function names in the entire binary program, retain The name of the exported function in the module.

上述的基于二进制程序模块化的开源组件检测方法，开源组件的对比流程为：遍历待匹配的特征类型，获取模块和开源组件的该类型特征；遍历获取的特征，将匹配的特征添加到匹配特征列表中，当匹配的特征得分超过阈值，将该类型特征添加到匹配的特征类型列表中，此列表非空，则该模块复用开源组件。The above-mentioned open source component detection method based on binary program modularization, the comparison process of open source components is: traverse the feature types to be matched, obtain the features of this type of modules and open source components; traverse the acquired features, and add the matching features to the matching features In the list, when the matched feature score exceeds the threshold, the type of feature is added to the matched feature type list. If the list is not empty, the module reuses open source components.

上述的基于二进制程序模块化的开源组件检测方法，在开源组件识别时，The above-mentioned open source component detection method based on binary program modularization, when identifying open source components,

a、对于字符类型的特征，当二进制程序模块与开源组件中的字符特征一致时，则判定特征为匹配；a. For character type features, when the binary program module is consistent with the character features in the open source component, the feature is determined to be a match;

b、对于函数的复杂分支序列，通过最长公共子序列长度匹配if/else特征，当最长公共子序列长度超过设定阈值则判定为匹配。b. For the complex branch sequence of the function, the if/else feature is matched by the length of the longest common subsequence. When the length of the longest common subsequence exceeds the set threshold, it is determined as a match.

本发明的有益效果：本发明基于二进制程序函数具备层次化的特点，提取函数地址及调用关系，进行二进制程序模块化；基于特征的模块粒度开源组件检测，依据特征选择标准，提取二进制模块及源码的特征，对测试数据集进行开源组件检测，精度比B2SFinde提高60.12％，极大的降低了误报率，提高特征提取效率。Beneficial effects of the present invention: the present invention is based on the hierarchical characteristics of binary program functions, extracts function addresses and calling relationships, and performs binary program modularization; feature-based module granularity open source component detection, and extracts binary modules and source codes according to feature selection criteria The features of open source components are detected on the test data set, and the accuracy is 60.12% higher than that of B2SFinde, which greatly reduces the false positive rate and improves the efficiency of feature extraction.

本发明在评估模块粒度检测效率及COTS中的开源组件应用时，结果表明即使增加模块化过程，依旧保持检测效率；而且本发明能够在COTS中检测到开源组件的复用，并且定位具体模块，提高软件安全分析下游任务的效率。When the present invention evaluates the detection efficiency of module granularity and the application of open source components in COTS, the results show that even if the modularization process is added, the detection efficiency is still maintained; and the present invention can detect the reuse of open source components in COTS, and locate specific modules, Improve the efficiency of downstream tasks in software security analysis.

附图说明Description of drawings

图1为本发明整体框架示意图。Figure 1 is a schematic diagram of the overall framework of the present invention.

图2为开源组件定位模块数量的频率统计图。Figure 2 is a frequency statistical diagram of the number of open source component positioning modules.

图3为开源组件复用频率统计图。Figure 3 is a statistical chart of the reuse frequency of open source components.

具体实施方式Detailed ways

下面结合附图和实施例对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

实施例1：本实施例提供的基于二进制程序模块化的开源组件检测方法，如图1所示，整体分为两个阶段，分别为二进制程序模块化与基于特征的模块粒度开源组件识别。在模块化阶段，提取二进制程序的函数直接调用信息，并以图的形式表示这些信息，并使用基于模块度进行改进的BCM方法，对二进制程序中的函数进行聚类。在模块粒度的开源组件识别阶段，提取字符串、导出函数名、字符串数组、函数中复杂的分支序列特征，将二进制程序的各个模块与开源组件进行匹配，识别二进制模块与开源组件的对应关系。Embodiment 1: The open source component detection method based on binary program modularization provided by this embodiment, as shown in FIG. 1 , is divided into two stages as a whole, namely binary program modularization and feature-based module granularity open source component identification. In the modularization stage, the direct function call information of the binary program is extracted, and the information is represented in the form of a graph, and the functions in the binary program are clustered using the improved BCM method based on modularity. In the open source component recognition stage at module granularity, extract character strings, exported function names, string arrays, and complex branch sequence features in functions, match each module of the binary program with the open source component, and identify the correspondence between the binary module and the open source component .

第一阶段：二进制程序模块化，主要包括以下内容The first stage: binary program modularization, mainly includes the following contents

1、属性选择及图的创建1. Attribute selection and graph creation

在二进制程序分析的工作中，常见的图有调用图(call graph，CG)、控制流图(control-flow graph，CFG)和数据流图(data flow graph，DFG)。In the work of binary program analysis, common graphs include call graph (call graph, CG), control-flow graph (control-flow graph, CFG) and data flow graph (data flow graph, DFG).

调用图、控制流图都是表示有向传递的图，但侧重点和节点不同。调用图表示的是整个程序中函数之间调用关系的图，是程序内函数调用的流向，流向节点是函数，边表示调用关系。而控制流图是表示一个函数内的程序执行流的图，表示函数中语句跳转流向，流向节点是程序语句，边表示执行流。数据流图表示一系列操作之间的数据依赖关系的图，图中的边表示从一个操作的结果到另一个操作输入的数据流，一旦一条指令的所有输入数据值都被消耗完，它就会执行。当一条指令执行时，它产生一个新的数据，这个数据值被传播到其他连接的指令。二进制程序分析的工作人员利用较多的是控制流图和数据流图，便于更好的理解函数。Both the call graph and the control flow graph are graphs representing directed transfer, but the emphasis is different from the nodes. The call graph represents the call relationship between functions in the entire program, and it is the flow direction of function calls in the program. The flow direction nodes are functions, and the edges represent the call relationship. The control flow graph is a graph that represents the program execution flow within a function, and represents the statement jump flow in the function. The flow direction nodes are program statements, and the edges represent the execution flow. A data flow graph represents a graph of data dependencies between a series of operations. The edges in the graph represent the data flow from the result of one operation to the input of another operation. Once all the input data values of an instruction are consumed, it is will execute. When an instruction executes, it generates a new data value, and this data value is propagated to other connected instructions. Binary program analysis staff use more control flow graphs and data flow graphs to better understand functions.

但是针对二进制程序的模块化是全局的分析，在这种情况下，控制流图和数据流图则不具备适用性，因此，选择函数调用图，对整个二进制程序具备全局的分析。函数调用图表示一个程序的不同函数如何相互调用，相互调用的函数一般会共同协作完成同一功能，因此更大的概率属于同一个模块。通过函数地址及函数有向调用关系的提取构造模块化步骤的输入图。But the modularization of binary programs is a global analysis. In this case, the control flow graph and data flow graph are not applicable. Therefore, the function call graph is selected to have a global analysis of the entire binary program. The function call graph shows how different functions of a program call each other. The functions that call each other generally cooperate to complete the same function, so they have a greater probability of belonging to the same module. The input graph of the modularization step is constructed by extracting the function address and the directed calling relationship of the function.

2、二进制程序模块划分2. Binary program module division

(1)模块质量度量(1) Module Quality Metrics

模块度Q是常用的一种衡量社团划分质量的标准，其基本思想是把划分社团后的网络与相应的零模型进行比较，以度量社团划分的质量。所谓与一个网络对应的零模型，就是指与该网络具有某些相同的性质，如相同的边数或者相同的度分布等，而在其他方面完全随机的随机图模型。二进制程序模块划分是针对函数的聚类，可以类比网络图中的社团检测。因此，本文以社团检测中的模块度Q作为模块划分效果的度量指标，模块度定义为：Modularity Q is a commonly used standard to measure the quality of community division. Its basic idea is to compare the network after community division with the corresponding null model to measure the quality of community division. The so-called null model corresponding to a network refers to a random graph model that has some of the same properties as the network, such as the same number of edges or the same degree distribution, and is completely random in other respects. Binary program module division is clustering for functions, which can be compared to community detection in network graphs. Therefore, this paper uses the modularity Q in community detection as a measure of the effect of module division, and the modularity is defined as:

其中，i和j表示节点；m是网络中所有边权重之和；A_ij表示节点i和j之间的边权重；k_i表示与节点i相连的边权重和；c_i表示节点i所属社团；c_j表示节点j所属社团；δ(c_i,c_j)表示节点i和节点j是否属于一个社团，属于值为1，否则为0。Among them, i and j represent nodes; m is the sum of all edge weights in the network; A _ij represents the edge weight between nodes i and j; k _i represents the sum of edge weights connected to node i; c _i represents the community to which node i belongs ;c _j indicates the community to which node j belongs; δ( _ci ,c _j ) indicates whether node i and node j belong to the same community, and the belonging value is 1, otherwise it is 0.

(2)二进制程序模块化方法(2) Binary program modularization method

网络的社团结构具有每个社团内部节点之间连接较为紧密，各个社团之间的连接相对来说比较稀疏的特点。程序“高内聚，低耦合”的模块与复杂网络的“社团”非常相似，并且社团检测技术的发展较成熟。因此本实施例使用社团检测的方法进行二进制程序的模块化。但是一些社团检测算法需要预先知道社团个数信息，例如谱分析、基于机器学习的社团检测算法，而二进制程序的模块数量是未知的。基于模块度概念的算法，不需要提前知道社团个数。The community structure of the network has the characteristics that the internal nodes of each community are relatively closely connected, and the connections between various communities are relatively sparse. The "high cohesion, low coupling" module of the program is very similar to the "community" of the complex network, and the development of community detection technology is relatively mature. Therefore, this embodiment uses the community detection method to modularize the binary program. However, some community detection algorithms need to know the number of communities in advance, such as spectral analysis, community detection algorithms based on machine learning, and the number of modules of binary programs is unknown. The algorithm based on the concept of modularity does not need to know the number of communities in advance.

本实施例通过提取的二进制函数地址和函数有向调用信息，对适用于无向图的基于模块度概念的凝聚算法进行改进，得到有向图检测的BCM方法，对二进制程序进行模块化。软件中实现的各个功能模块是具有层次性的，在源码文件目录中表现为各个文件夹中的文件，选择层次化社团结构分析的算法契合了软件功能实现的层次特性。In this embodiment, the agglomeration algorithm based on the concept of modularity applicable to undirected graphs is improved through the extracted binary function addresses and function directed call information, and a BCM method for directed graph detection is obtained to modularize binary programs. Each functional module implemented in the software is hierarchical, and it is represented as a file in each folder in the source code file directory. The algorithm for selecting hierarchical community structure analysis fits the hierarchical characteristics of software function realization.

具体的，使用BCM算法划分二进制程序模块分为两个阶段：Specifically, using the BCM algorithm to divide binary program modules into two stages:

Ⅰ、初始时假设每个函数都是一个独立的模块；对任意相邻的节点i和节点j，计算将节点i加入其邻居节点j所在模块(记为模块C)时对应的模块度增量ΔQ，Ⅰ. Initially assume that each function is an independent module; for any adjacent node i and node j, calculate the corresponding modularity increment when adding node i to the module where its neighbor node j is located (denoted as module C) ΔQ,

其中，S_i,in表示节点i与模块C内部函数连边权重之和；m是网络中所有边权重之和；W_c是模块C内部所有边的权重和；S_c是所有与模块C内部的函数相关联的边的权重和。Among them, S _{i, in} represents the sum of the weights of nodes i and the internal functions of module C; m is the sum of all edge weights in the network; W _c is the weight sum of all edges inside module C; S _c is the sum of all edges connected to module C The weight sum of the edges associated with the function of .

计算节点i与所有邻居节点的模块度增量，然后选出其中最大的一个。当该值为正时，把节点i加人相应的邻居节点所在的模块；否则，节点i留在原模块中。这种模块合并过程重复进行，直到不再出现合并现象，这样就划分出了第一层模块。Calculate the modularity increments of node i and all neighbor nodes, and then select the largest one. When the value is positive, add node i to the module where the corresponding neighbor node is located; otherwise, node i stays in the original module. This module merging process is repeated until no more merging occurs, thus dividing the first layer of modules.

Ⅱ、基于过程Ⅰ形成一个新的模块结果，模块之间连边的权重是两个模块之间所有节点连边的权重和。然后重复Ⅰ中的方法对新的模块结果进行模块划分，得到第二层模块。以此类推，直到网络模块度Q不再增加为止。Ⅱ. A new module result is formed based on the process Ⅰ. The weight of the connection between the modules is the sum of the weights of all the node connections between the two modules. Then repeat the method in I to divide the new module results into modules to obtain the second layer of modules. And so on, until the network modularity Q no longer increases.

二进制程序模块化示例算法如表1所示。The binary program modularization example algorithm is shown in Table 1.

初始化各个模块的模块度为-1(第2行)，然后迭代进行节点移动，获得移动后的模块度(第5-6行)，若模块度增加，则合并模块(第7-9行)，直到模块度不再增加(第10-11行)，停止迭代过程。Initialize the modularity of each module to -1 (line 2), and then iteratively move the nodes to obtain the modularity after movement (lines 5-6). If the modularity increases, merge the modules (lines 7-9) , until the modularity no longer increases (lines 10-11), stop the iterative process.

表1二进制程序模块化示例算法Table 1 Binary program modularization example algorithm

利用社团检测技术进行二进制程序的模块划分，很好的解决了目前二进制程序模块聚类的问题，同时为新的模块粒度上的开源组件检测提供了基础。Using community detection technology to divide binary programs into modules solves the current problem of binary program module clustering and provides a basis for open source component detection at the new module granularity.

第二阶段：开源组件识别Phase 2: Open Source Component Identification

目前针对二进制程序的开源组件识别粒度为文件级别，解决的问题是检测整个二进制程序是否复用开源组件，未准确定位开源组件的位置。本实施例将二进制程序模块一一与开源组件匹配，定位开源组件所在模块，将文件粒度的识别细化到模块粒度，为软件安全工作中的下游任务缩小分析范围。主要内容如下。At present, the recognition granularity of open source components for binary programs is at the file level. The problem to be solved is to detect whether the entire binary program reuses open source components, and the location of open source components is not accurately located. In this embodiment, the binary program modules are matched with the open source components one by one, the modules of the open source components are located, the identification of the file granularity is refined to the module granularity, and the analysis scope is narrowed for the downstream tasks in the software security work. The main contents are as follows.

1、特征选择与提取1. Feature selection and extraction

现有的方法常用特征包括字符串、导出函数名、整型常量、全局数组、调用图(callgraph，CG)和控制流图(control-flow graph，CFG)等。绝大多数的开源组件提取特征之后，几乎不存在枚举数组。在开源组件检测的过程中，通过整型数组检测到开源组件的比率接近于零。整型常量明显受到编译的影响，在编译和优化过程中，大量与源代码无关的立即数，如内存偏移量和堆栈偏移量，被作为噪声引入二进制代码中。调用图和控制流图也会因为编译优化而发生改变。Commonly used features of existing methods include strings, exported function names, integer constants, global arrays, call graphs (callgraph, CG), and control-flow graphs (control-flow graph, CFG). After most open source components extract features, there is almost no enumeration array. In the process of detecting open source components, the ratio of detecting open source components through integer arrays is close to zero. Integer constants are obviously affected by compilation. During compilation and optimization, a large number of immediate values that have nothing to do with the source code, such as memory offsets and stack offsets, are introduced into the binary code as noise. The call graph and control flow graph will also change due to compilation optimizations.

本实施例排除以上对于检测没有实际价值、徒增特征提取工作量的全局数组和易受编译优化影响的整型常量、调用图与控制流图特征，选择字符串、字符串数组、导出函数名及函数的复杂分支序列。提取的特征如表2所示。This embodiment excludes the above-mentioned global arrays that have no practical value for detection and will increase the workload of feature extraction, and integer constants, call graphs, and control flow graph features that are easily affected by compilation optimization, and select strings, string arrays, and exported function names. and complex branching sequences of functions. The extracted features are shown in Table 2.

表2特征选择Table 2 Feature selection

对于开源组件检测，现有的一些方法是将源码编译为二进制程序，然后进行二进制比对。但是会产生减少字符串等特征，从而导致较低的匹配率的编译设置和时间消耗问题。For the detection of open source components, some existing methods are to compile the source code into a binary program, and then perform binary comparison. But it will produce features such as reducing strings, resulting in lower matching rate compilation settings and time consumption problems.

本实施例直接针对源码分析，在提取特征之前结合编译信息，只从与编译相关的文件中提取特征。一方面，减少了特征提取工作量，另一方面，剔除了不会被二进制程序复用的无关代码，例如test等文件。This embodiment is directly aimed at source code analysis, combining compilation information before extracting features, and extracting features only from files related to compilation. On the one hand, it reduces the workload of feature extraction. On the other hand, it eliminates irrelevant codes that will not be reused by binary programs, such as test files.

(1)字符串及数组、导出函数名(1) Strings and arrays, export function names

针对二进制程序的每个模块，通过模块中每个函数的地址，获取该函数的所有指令地址，遍历指令的数据引用，提取所引用的字符串。For each module of the binary program, through the address of each function in the module, obtain all instruction addresses of the function, traverse the data references of the instructions, and extract the referenced strings.

以字符串提取算法为例，来描述从二进制模块中提取特征的过程。如表3所示。Take the string extraction algorithm as an example to describe the process of extracting features from binary modules. as shown in Table 3.

表3模块字符串提取算法流程Table 3 Module string extraction algorithm flow

通过模块中每个函数的地址获取该函数的所有片段(第3-4行)，遍历每个函数片段中的所有指令，遍历该指令的数据引用地址(第5-6行)，当该地址存放的是字符串时，将字符串的地址及值添加到字符串列表中(第7-13行)。Obtain all fragments of the function through the address of each function in the module (lines 3-4), traverse all instructions in each function fragment, and traverse the data reference address of the instruction (lines 5-6), when the address When storing a string, add the address and value of the string to the string list (lines 7-13).

从整个二进制程序中可以获取所有导出函数名，提取模块中的导出函数名与此不同，首先需要获取模块中的所有函数名，然后通过遍历整个二进制程序中的导出函数名的方式，保留模块中的导出函数名。针对给定的源代码文件，通过clang将源码解析成抽象语法树(Abstract Syntax Tree，AST)，把对源码的特征提取转为对AST的操作，从而提取源码中字符串、字符串数组及函数名。但是字符串特征很容易删除，特别是作为日志消息和错误消息。当软件中不包含可以匹配的有用字符串特征时，需要考虑其它特征的选择。All exported function names can be obtained from the entire binary program, which is different from extracting the exported function names in the module. First, you need to obtain all the function names in the module, and then by traversing the exported function names in the entire binary program, keep them in the module. The exported function name. For a given source code file, the source code is parsed into an abstract syntax tree (Abstract Syntax Tree, AST) by clang, and the feature extraction of the source code is converted into an operation on the AST, thereby extracting strings, string arrays and functions in the source code name. But string features are easy to remove, especially as log messages and error messages. When the software does not contain useful string features that can be matched, the selection of other features needs to be considered.

(2)函数的复杂分支序列(2) Complex branching sequences of functions

在二进制相似性比对工作中常用的调用图和控制流图很容易受到编译优化影响，为达到补充抗编译优化的代码特征的目的，这两种类型的特征排除在选择范围外。通过分析发现，函数中的switch/case、if/else结构在编译过程中是稳定的，因此，可以将switch/case、if/else的分支序列作为选择的特征，它们包含函数中比较、跳转和分支的相对完整信息。Call graphs and control flow graphs commonly used in binary similarity comparison work are easily affected by compilation optimization. In order to achieve the purpose of supplementing code features that are resistant to compilation optimization, these two types of features are excluded from the selection range. Through analysis, it is found that the switch/case and if/else structures in the function are stable during the compilation process. Therefore, the branch sequence of switch/case and if/else can be used as the selected feature, which includes comparison and jump in the function and relatively complete information about branches.

对于二进制模块，if/else特征是通过分析它的比较和跳转指令来提取，利用模块中函数跳转表提取switch/case特征。在clang解析源码之后，分析中间代码的分支序列，提取特征。For the binary module, the if/else feature is extracted by analyzing its comparison and jump instructions, and the switch/case feature is extracted by using the function jump table in the module. After clang parses the source code, it analyzes the branch sequence of the intermediate code and extracts features.

2、模块粒度的开源组件识别2. Open source component identification at module granularity

如图1所示，开源组件的检测阶段是将二进制程序的每个模块与开源组件源码进行匹配。具体过程是从二进制程序各个模块和源码中提取字符串、导出函数名和函数的复杂分支序列特征，将每个模块特征都与源码特征比对，检测出开源组件所在的二进制模块。As shown in Figure 1, the detection stage of open source components is to match each module of the binary program with the source code of the open source components. The specific process is to extract strings, export function names, and complex branch sequence features of functions from each module and source code of the binary program, compare each module feature with the source code feature, and detect the binary module where the open source component is located.

对于字符类型的特征，当二进制程序模块与开源组件中的字符特征一致时，则认为特征是匹配的。For character-type features, when the binary program module is consistent with the character features in the open source component, the features are considered to match.

函数的复杂分支序列是通过语义对其进行特征匹配。通过最长公共子序列长度匹配if/else特征，当最长公共子序列长度超过设定阈值则判定为匹配。将源码添加默认分支的switch/case无序列表与模块中的switch/case进行序列比对。每种特征的匹配判定涉及的阈值根据经验确定，当特征的匹配分数超过阈值，则判定该模块复用开源组件。A complex branching sequence of functions is characterized by semantically matching them. The if/else feature is matched by the length of the longest common subsequence, and it is determined as a match when the length of the longest common subsequence exceeds the set threshold. Sequence alignment of the switch/case unordered list with the default branch added to the source code and the switch/case in the module. The threshold involved in the matching judgment of each feature is determined based on experience. When the matching score of the feature exceeds the threshold, it is determined that the module reuses open source components.

各个模块与开源组件具体的匹配流程示例如表4所示。An example of the specific matching process between each module and the open source component is shown in Table 4.

表4模块粒度开源组件监测算法示例Table 4 Example of module granular open source component monitoring algorithm

遍历待匹配的特征类型(第4行)，获取模块和开源组件的该类型特征(第6-7行)，遍历获取的特征，将匹配的特征添加到匹配特征列表中(第8-11行)，当匹配的特征得分超过阈值，将该类型特征添加到匹配的特征类型列表中，此列表非空，则该模块复用开源组件(第12-15行)。Traverse the feature types to be matched (line 4), obtain the features of this type of modules and open source components (lines 6-7), traverse the acquired features, and add the matched features to the matching feature list (lines 8-11 ), when the matched feature score exceeds the threshold, add this type of feature to the matched feature type list, if the list is not empty, then the module reuses open source components (lines 12-15).

实施例2：本实施例通过实验评估模块粒度的开源组件识别，全局数组特征对于检测结果的影响，检测效率，并且对实际的商用软件(COTS)开展检测分析。使用IDAPro反汇编二进制程序，利用IDAPython提取函数调用信息和二进制程序各个模块特征，通过clang和llvm工具，提取源代码特征。Embodiment 2: This embodiment evaluates the open source component identification of module granularity, the impact of global array features on detection results, and detection efficiency through experiments, and carries out detection and analysis on actual commercial software (COTS). Use IDAPro to disassemble the binary program, use IDAPython to extract function call information and the features of each module of the binary program, and use clang and llvm tools to extract the source code features.

1、数据集1. Dataset

模块粒度的开源组件识别评估实验中的二进制程序利用已知复用关系的二进制程序作为测试数据集，包括BCFinder、ISRD和ModX中的测试程序。源码从Github等托管平台获取。二进制程序测试数据集以及与其具备复用关系的开源组件源码描述如表5所示Open Source Component Identification at Module Granularity The binary programs in the evaluation experiments use binary programs with known reuse relationships as test data sets, including the test programs in BCFinder, ISRD and ModX. The source code is obtained from hosting platforms such as Github. The binary program test data set and the source code description of the open source components that have a reuse relationship with it are shown in Table 5

表5用于准确率评估的数据集Table 5 Datasets used for accuracy evaluation

模块粒度的COTS检测应用中所涉及的部分软件数据集如表6所示。Some software datasets involved in COTS detection applications at module granularity are shown in Table 6.

表6部分COTSTable 6 Part COTS

2、对比方法2. Comparison method

实验从目前存在的开源组件检测工作中选择已开放源码的B2SFinder和腾讯安全科恩实验室正式发布的在线软件成分分析平台BinaryAI作为对比方法。B2SFinder在目前文件粒度的二进制程序匹配开源组件源码工作中，选取特征较为全面，结果优于其它方法。BinaryAI是二进制文件软件成分分析(Software component analysis，SCA)的智能分析平台。将B2SFinder和BinaryAI与本文模块粒度的开源组件识别结果进行对比。The experiment selects the open source B2SFinder and BinaryAI, an online software component analysis platform officially released by Tencent Security Keen Lab, as comparison methods from the existing open source component detection work. B2SFinder selects more comprehensive features in the current work of matching binary programs with file granularity to the source code of open source components, and the results are better than other methods. BinaryAI is an intelligent analysis platform for software component analysis (SCA) of binary files. Compare B2SFinder and BinaryAI with the recognition results of open source components at module granularity in this paper.

3、模块粒度的检测评估3. Detection and evaluation of module granularity

(1)评估指标(1) Evaluation indicators

实验选择precision、Recall和F1-Score作为评估指标，定义如公式所示。The experiment selects precision, Recall and F1-Score as evaluation indicators, defined as shown in the formula.

其中，TP指被正确匹配为复用开源组件的数量，TN被正确匹配为非复用开源组件的数量，FN指被错误匹配为复用开源组件的数量，TP被正确匹配为非复用开源组件的数量。Among them, TP refers to the number of correctly matched open source components, TN is correctly matched as the number of non-reusable open source components, FN refers to the number of wrongly matched open source components, and TP is correctly matched as non-reusable open source components The number of components.

(2)模块粒度的开源组件检测效果评估(2) Evaluation of the detection effect of open source components at module granularity

针对实验测试数据集，进行模块粒度的检测评估，将本文模块粒度的检测BSCA-M与3种方法进行比对，分别是未进行模块化的检测BSCA、目前二进制程序匹配开源组件源码工作中较优的B2SFinder、BinaryAI方法，结果如表7所示。For the experimental test data set, the module granularity detection and evaluation are carried out, and the module granularity detection BSCA-M in this paper is compared with three methods, namely, the detection BSCA without modularization, and the comparison between the current binary program matching source code of open source components. The results of the excellent B2SFinder and BinaryAI methods are shown in Table 7.

表7检测效果评估Table 7 Detection effect evaluation

方法method precisionprecision recallrecall F1-ScoreF1-Score BSCA-MBSCA-M 82.35％82.35% 63.64％63.64% 71.8％71.8% B2SFinderB2SFinder 60％60% 81.82％81.82% 69.23％69.23% BinaryAIBinaryAI 84.62％84.62% 50％50% 62.86％62.86%

从表中可知，本发明模块粒度检测开源组件的精度为82.35％，接近于BinaryAI，与B2SFinder达到的仅为51.43％的精度相比，本文精度明显提高了约60.12％。原因在于本发明模块粒度的检测，极大地减少假阳例，降低了误报率。召回率处于B2SFinder与BinaryAI之间，很好地平衡了两种方法的召回率差值过大的现象。It can be seen from the table that the accuracy of the module granularity detection open source component of the present invention is 82.35%, which is close to BinaryAI. Compared with the accuracy of only 51.43% achieved by B2SFinder, the accuracy of this paper is significantly improved by about 60.12%. The reason is that the detection of module granularity in the present invention greatly reduces false positive cases and lowers false alarm rate. The recall rate is between B2SFinder and BinaryAI, which balances the phenomenon that the difference between the recall rates of the two methods is too large.

整体评估结果来看，本发明方法的F1-Score达到71.8％，分别比BinaryAI和B2SFinder提高14.22％、3.71％，并且将二进制程序中是否包含开源组件的检测细化为定位组件所在的具体二进制模块。From the overall evaluation results, the F1-Score of the method of the present invention reaches 71.8%, which is 14.22% and 3.71% higher than BinaryAI and B2SFinder respectively, and the detection of whether the binary program contains open source components is refined to locate the specific binary module where the component is located .

(3)全局数组特征的影响评估(3) Impact assessment of global array features

本发明模块化的开源组件检测方法未选择整型数组、枚举数组特征，B2SFinder的特征选择包括全局数组。为评估全局数组特征对于检测结果及效率的影响，需要排除模块化的过程，因此，将BSCA(未进行模块化)与B2SFinder进行对比分析，结果如表8所示。The modularized open-source component detection method of the present invention does not select integer arrays and enumeration array features, and the feature selection of B2SFinder includes global arrays. In order to evaluate the impact of global array features on detection results and efficiency, the process of modularization needs to be excluded. Therefore, BSCA (without modularization) and B2SFinder were compared and analyzed, and the results are shown in Table 8.

表8全局数组特征影响评估Table 8 Impact assessment of global array features

方法method precisionprecision recallrecall F1-ScoreF1-Score 时耗time consuming BSCABSCA 64％64% 72.73％72.73% 69.33％69.33% 920.38920.38 B2SFinderB2SFinder 60％60% 81.82％81.82% 69.23％69.23% 1151.011151.01

从表中可见，未进行模块化的BSCA与B2SFinder相比，召回率微低，但是精度稍高，主要原因则是B2SFinder中一部分假阳例是通过此类全局数组的匹配到的，降低了检测精度。F1-Score趋于相等，表明去除掉B2SFinder中的整型数组、枚举数组，对于F1-Score几乎没有影响。It can be seen from the table that compared with B2SFinder, BSCA without modularization has a slightly lower recall rate, but slightly higher precision. The main reason is that some false positive cases in B2SFinder are matched by such global arrays, which reduces the detection rate. precision. F1-Score tends to be equal, indicating that removing the integer array and enumeration array in B2SFinder has little effect on F1-Score.

针对测试中涉及到的所有开源组件进行特征提取，将未提取与提取此类全局数组的效率进行对比。从表中可知，BSCA相对于B2SFinder，由于未选择整型和枚举数组特征，对于开源组件的特征提取时耗明显降低20％。Feature extraction was performed for all open source components involved in the test, comparing the efficiency of no extraction with extraction of such global arrays. It can be seen from the table that compared with B2SFinder, BSCA significantly reduces the feature extraction time consumption of open source components by 20% due to the fact that integer and enumerated array features are not selected.

因此，本实施例去除全局整型数组、枚举数组的选择，减少了针对全局数组的特征提取工作量，而且在特征匹配的阶段，避免了对于无效特征的比对。Therefore, this embodiment removes the selection of the global integer array and the enumeration array, reduces the feature extraction workload for the global array, and avoids the comparison of invalid features in the stage of feature matching.

(4)效率评估(4) Efficiency evaluation

对本文模块粒度的检测BSCA-M的两个阶段，即模块化过程和开源组件检测的时间与B2SFinder的开源组件检测进行对比分析，结果如表9所示，单位为秒(s)。The two stages of module granularity detection BSCA-M in this paper, namely the modularization process and open source component detection time, are compared with B2SFinder's open source component detection. The results are shown in Table 9, and the unit is second (s).

表9效率评估Table 9 Efficiency Evaluation

方法method 模块化Modular 开源组件检测Open source component detection 总和sum BSCA-MBSCA-M 63.40263.402 1356.5271356.527 1419.9291419.929 B2SFinderB2SFinder —— 1255.6821255.682 1255.6821255.682

从表中数据可知，本发明方法BSCA-M与B2SFinder相比，整个流程耗时较为接近。虽然本发明方法包含针对二进制程序的模块化阶段，此过程增加了提取二进制程序信息和利用社团检测技术划分二进制模块的时间，并且在检测阶段，对每个模块提取特征并进行匹配，但是依旧保持了效率。It can be seen from the data in the table that, compared with the method BSCA-M of the present invention and B2SFinder, the time consumption of the whole process is relatively close. Although the method of the present invention includes a modularization stage for binary programs, this process increases the time for extracting binary program information and using community detection technology to divide binary modules, and in the detection stage, features are extracted and matched for each module, but it still remains efficiency.

(5)开源组件定位模块数量及频率分析(5) Quantity and frequency analysis of open source component positioning modules

对二进制程序模块粒度的开源组件检测结果进行分析发现，复用的开源组件存在定位不止一个模块的情况。最理想的情况是复用开源组件的函数被划分到同一模块，存在只有一个模块与开源组件匹配。针对所有的测试二进制程序进行开源组件定位模块数量及相应的频率统计，如图2所示。Analysis of the detection results of open source components at the granularity of binary program modules reveals that more than one module is located in multiplexed open source components. The ideal situation is that the functions that reuse open source components are divided into the same module, and there is only one module that matches the open source component. For all test binary programs, the number of open source component positioning modules and the corresponding frequency statistics are performed, as shown in Figure 2.

从图中可知，开源组件定位模块数量为1的频率为57.14％，超过3的仅占21.42％。因此，二进制程序的模块化结果整体较好，将协作完成同一功能的函数划分到同一个模块，复用同一组件的函数不会被错误划分到其他模块，在开源组件检测阶段，则可以将复用的开源组件准确定位到同一个模块中，达到较好的模块粒度检测结果。It can be seen from the figure that the frequency of the number of open source component positioning modules is 1 is 57.14%, and the frequency of more than 3 is only 21.42%. Therefore, the modularization result of the binary program is generally better. The functions that cooperate to complete the same function are divided into the same module, and the functions that reuse the same component will not be mistakenly divided into other modules. In the open source component detection stage, the complex The open source components used are accurately positioned in the same module, achieving better module granularity detection results.

4、模块粒度的COTS开源组件检测应用4. COTS open source component detection application with module granularity

商用软件的发布通常是以闭源的方式，当软件中复用了包含漏洞的开源代码，软件面临着一定的安全风险，故对商用软件进行开源组件检测，定位组件所在的具体模块，实验选取一些常用的商用软件进行检测，部分COTS复用的开源组件列表如表10所示。Commercial software is usually released in a closed-source manner. When open-source code containing vulnerabilities is reused in the software, the software faces certain security risks. Therefore, open-source component detection is performed on commercial software to locate the specific module where the component is located. The experimental selection Some commonly used commercial software is used for detection, and the list of some open source components reused by COTS is shown in Table 10.

表10COTS复用的开源组件Table 10 COTS reused open source components

根据对COTS模块粒度的开源组件检测，分析发现一些开源组件在COTS中被频繁复用，结合模块粒度的检测评估中复用的开源组件，对此类开源组件被复用的频率进行统计，结果如图3所示。According to the open source component detection of COTS module granularity, the analysis found that some open source components are frequently reused in COTS. Combining with the open source components reused in the detection and evaluation of module granularity, the frequency of reuse of such open source components was counted. The result As shown in Figure 3.

图中显示了被频繁复用的前5种开源组件，其中xz在这些组件中被复用频率最高，占比达到50％，其次是zlib和libpng，为16.67％。因此，当组件中包含了漏洞，本发明模块粒度的开源组件检测可以缩小漏洞在复用这类组件的大量程序中的搜索范围，只需要在对应模块中进行漏洞检测。The figure shows the top 5 frequently reused open source components, among which xz is the most frequently reused among these components, accounting for 50%, followed by zlib and libpng, accounting for 16.67%. Therefore, when a component contains a vulnerability, the open-source component detection at the module granularity of the present invention can narrow the search scope of the vulnerability in a large number of programs that reuse such components, and only need to perform vulnerability detection in the corresponding module.

Claims

1. A method for detecting open source components based on binary program modularization, characterized in that: comprise binary program modularization and open source component identification, wherein binary program modularization includes the following:

(1) Extract the function address and create the function call graph for the binary program, and construct the input graph of the modularization step through the extraction of the function address and the function-directed call relationship;

(2) Use the improved BCM algorithm based on modularity to divide the binary program modules;

Open source component identification includes the following:

(1) Feature selection and extraction: For each module of the binary program, the complex branch sequences of character types and functions in each module are extracted as extraction features, where the character types include string literals, string arrays, and exported function names ; For the complex branch sequence of the function, the branch sequence of switch/case, if/else is used as the selected feature;

(2) Open source component identification at module granularity: compare the extracted features of each module of the binary program with the source code features, and detect the binary modules where the open source components are located.

2. the open-source component detection method based on binary program modularization according to claim 1, is characterized in that: the binary program module division comprises the following steps:

Ⅰ. Initially assume that each function is an independent module. For any adjacent node i and node j, calculate the corresponding modularity increment ΔQ when node i is added to the module C where its neighbor node j is located.

In the formula: S _{i, in} represents the sum of the edge weights between node i and the internal function of module C; m is the sum of all edge weights in the network; W _c is the weight sum of all edges inside module _C ; The sum of the weights of the edges associated with the internal function;

Then calculate the modularity increment of node i and all neighbor nodes, and select the largest one; when the value is positive, add node i to the module where the corresponding neighbor node is located; otherwise, node i stays in the original module;

Repeat until the merging phenomenon no longer occurs, and the first layer of modules is divided;

Ⅱ. Based on the module results formed in Ⅰ, repeat the method in Ⅰ to divide the new module results into modules, and obtain the second layer of modules, where the weight of the edges between the modules is the sum of the weights of the edges of all nodes between the two modules ;

Repeat until the network modularity does not increase any more, and the binary program module division is completed.

3. the open-source component detection method based on binary program modularization according to claim 1, is characterized in that: the process of extracting character string from binary module is: obtain all fragments of this function by the address of each function in the module, Traverse all instructions in each function segment, traverse the data reference address of the instruction, and when the address stores a string, add the address and value of the string to the string list.

4. The open source component detection method based on binary program modularization according to claim 1, characterized in that: the process of extracting and exporting function names from the binary program is as follows: firstly all function names in the module need to be obtained, and then by traversing the entire The method of exporting function names in the binary program retains the exporting function names in the module.

5. The open source component detection method based on binary program modularization according to claim 1, characterized in that: the comparison process of the open source component is: traverse the feature type to be matched, and obtain the type feature of the module and the open source component; traverse and obtain feature, add the matching feature to the matching feature list, when the matching feature score exceeds the threshold, add this type of feature to the matching feature type list, if the list is not empty, the module reuses open source components.

6. The open source component detection method based on binary program modularization according to claim 5, characterized in that: when the open source component is identified,

a. For character type features, when the binary program module is consistent with the character features in the open source component, the feature is determined to be a match;

b. For the complex branch sequence of the function, the if/else feature is matched by the length of the longest common subsequence. When the length of the longest common subsequence exceeds the set threshold, it is determined as a match.