CN107203468B

CN107203468B - AST-based software version evolution comparative analysis method

Info

Publication number: CN107203468B
Application number: CN201710258269.3A
Authority: CN
Inventors: 吴艳霞; 刘志江; 李晓松; 孙彬; 卢文祥
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2017-04-19
Filing date: 2017-04-19
Publication date: 2020-09-25
Anticipated expiration: 2037-04-19
Also published as: CN107203468A

Abstract

The invention provides an AST-based software version evolution comparative analysis method. Compare and analyze the source code of the two software versions through the Diff command of Unix, and divide the changed source code into three source code blocks; obtain the AST information of the source code of the two software versions through a syntax analyzer, and analyze and obtain each source code block The AST information corresponding to a line of code; by combining the acquired syntax node information into one combination information, the adjacent code lines with the same identification are combined into modules; by comparing and analyzing the identification information of the two source blocks in units of modules, according to the module The evolution situation is classified and analyzed, and various evolution situations are processed by preprocessing and secondary processing methods, and the evolution information is stored in the database and converted into HTML code. Compared with other software version evolution comparative analysis methods, the present invention can improve the accuracy of acquiring evolution information and improve the efficiency of software version evolution analysis.

Description

A comparative analysis method of software version evolution based on AST

技术领域technical field

本发明涉及的是一种软件版本演化对比分析方法，具体地说是一种以Diff为基础的AST 软件版本演化对比分析方法。The invention relates to a comparative analysis method for software version evolution, in particular to a Diff-based AST software version evolution comparative analysis method.

背景技术Background technique

软件版本演化分析，其本质都是针对软件源码进行差异分析，针对软件演化分析的方法和源码差异分析方法，介绍本发明的技术背景。The essence of software version evolution analysis is to perform difference analysis on software source code, and the technical background of the present invention is introduced for the method of software evolution analysis and source code difference analysis method.

软件演化分为动态演化和静态演化，动态演化也称在线演化，是指在软件执行期间演化，静态演化更多的是对软件版本源码差异内容的一个理解和展示，接下来阐述的软件版本演化只是针对静态演化。主要是介绍的是与源码差异分析方法相关的，静态演化相关的软件版本演化分析方法。Software evolution is divided into dynamic evolution and static evolution. Dynamic evolution, also known as online evolution, refers to evolution during software execution. Static evolution is more about an understanding and display of the differences in the source code of software versions. The software version evolution explained next Just for static evolution. It mainly introduces the software version evolution analysis method related to the source code difference analysis method and static evolution.

目前，在软件版本演化分析方面，根据其分析的粒度主要分为四个分类进行研究，第一个分类是针对源码的增加、删除、修改和未改的代码行数进行进一步研究。第二个分类是针对增加、删除、修改和未改的类函数等语法结构进一步研究，通过将软件版本的变化分为47 种来理解软件演化的过程，到底是什么决定了软件演化的走向，其种类划分的依据是类函数等语法结构的属性参数和语句的种类，如函数的参数改变了是一种变化的种类，语句则有条件、循环和else语句等。其中比较有代表性的是一种改进的基于AST的软件演化分析方法，该方法是基于AST编辑距离(Levenshtein Distance，LD)的方法的改进，基于AST编辑距离 (Levenshtein Distance，LD)方法通常被用于输入字符串的快速模糊匹配、英文辅助，写作等领域，是一种经典而广为使用的方法。早在1996年的时候有人将LD算法的思想迁移到源码的对比，Chawathe提出了基于AST的最短编辑距离算法分析分层结构化代码，假设分层结构化代码都是对称的，类函数等模块的语法结点都是一样的，只是具体属性有所变化。基于此方法思想，提出了一种改进的基于AST的软件演化分析方法，提高了算法的效率和准确性，将算法的适用范围拓宽了，不只限于分层结构化代码。At present, in terms of software version evolution analysis, it is mainly divided into four categories for research according to the granularity of its analysis. The first category is further research on the addition, deletion, modification and unmodified lines of code of source code. The second category is to further study the grammatical structures such as added, deleted, modified and unchanged class functions. By dividing the changes of software versions into 47 types to understand the process of software evolution, what determines the direction of software evolution? The classification of its types is based on the attribute parameters of grammatical structures such as functions and the types of statements. For example, if the parameters of a function change, it is a type of change, and statements include conditionals, loops, and else statements. One of the more representative ones is an improved AST-based software evolution analysis method, which is an improvement of the method based on AST edit distance (Levenshtein Distance, LD). The method based on AST edit distance (Levenshtein Distance, LD) is usually It is a classic and widely used method for fast fuzzy matching of input strings, English assistance, writing and other fields. As early as 1996, some people migrated the idea of LD algorithm to the source code comparison, Chawathe proposed the shortest edit distance algorithm based on AST to analyze hierarchically structured code, assuming that the hierarchically structured code is symmetrical, and the modules such as class functions The syntax nodes are the same, but the specific properties have changed. Based on this method idea, an improved AST-based software evolution analysis method is proposed, which improves the efficiency and accuracy of the algorithm, and broadens the scope of application of the algorithm, which is not limited to hierarchical structured codes.

第三个分类是针对改变和未变的代码进行视图显示，然后进行进一步研究，从文本和软件工程两个角度分析，通过完善Diff的工作，构造了一个VCN(Visual CodeNavigator)，通过视图显示演化信息，可视化的理解版本演化规律，但是VCN只支持C/C++。第四个分类是针对软件源码的模型进行研究，模型映射与比较可以用于软件复用和软件版本演化分析研究虽然能很好的满足软件开发者的需求，但是需要对软件版本的模型进行获取分析，有的软件版本无法获取模型，总的来说这个方法操作难度较大。针对每一个方向都有各种实现方式，每一种方式也有各自的优缺点。The third category is to display the changed and unchanged code in view, and then conduct further research, analyze from the perspective of text and software engineering, and construct a VCN (Visual CodeNavigator) by improving the work of Diff, and display the evolution through the view. Information, visual understanding of version evolution rules, but VCN only supports C/C++. The fourth category is to study the model of software source code. Model mapping and comparison can be used for software reuse and software version evolution analysis. Although it can well meet the needs of software developers, it is necessary to obtain the model of the software version. Analysis, some software versions cannot obtain the model, in general, this method is difficult to operate. There are various implementation methods for each direction, and each method also has its own advantages and disadvantages.

从这些研究者的研究内容看出，虽然软件版本演化分析算法和源码的差异分析算法种类繁多，但是大多是从语义的角度切入，基于AST的版本演化对比分析已经成为了源码分析的热点。相对于原有的静态源码分析，针对整个的源码版本的整体架构的分析算法，演化分析的准确性和效率都不高。From the research content of these researchers, although there are many kinds of software version evolution analysis algorithms and source code difference analysis algorithms, most of them are from the perspective of semantics. AST-based version evolution comparative analysis has become a hot spot in source code analysis. Compared with the original static source code analysis, the analysis algorithm for the overall structure of the entire source code version, the accuracy and efficiency of evolution analysis are not high.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种能提高获取演化信息的准确率，并能提高软件版本演化分析效率的基于AST的软件版本演化对比分析方法。The purpose of the present invention is to provide an AST-based software version evolution comparison analysis method that can improve the accuracy of acquiring evolution information and improve the software version evolution analysis efficiency.

本发明的目的是这样实现的：The object of the present invention is achieved in this way:

步骤一：通过Unix的Diff命令对两个软件版本源码进行比对分析，根据分析结果文件 Patch获取代码行的增加、删除、改变的情况，将变化的源码分为三种源码块；Step 1: Compare and analyze the source code of the two software versions through the Diff command of Unix, and obtain the addition, deletion and change of code lines according to the analysis result file Patch, and divide the changed source code into three source code blocks;

步骤二：通过语法分析器获取两个软件版本源码的AST信息，针对第一步分析出的三种源码块，分析获取每一种源码块的每一行代码对应的AST信息；Step 2: Obtain the AST information of the source code of the two software versions through the syntax analyzer, and analyze and obtain the AST information corresponding to each line of code of each source block for the three source code blocks analyzed in the first step;

步骤三：通过将获取的语法结点信息组合成一个组合信息，使用该组合信息标识对应行，将标识一样的相邻代码行组合成模块；Step 3: by combining the acquired syntax node information into a combination information, using the combination information to identify the corresponding line, and combining the adjacent code lines with the same identification into a module;

步骤四：通过以模块为单位对比分析两个源码块的标识信息，根据模块演化情况进行分类分析，使用预处理、二次处理方法处理各种演化情况，并将演化信息存入数据库中，并转换成HTML代码。Step 4: By comparing and analyzing the identification information of the two source code blocks in units of modules, classifying and analyzing the evolution of the modules, using preprocessing and secondary processing methods to process various evolutionary situations, and storing the evolution information in the database, and Convert to HTML code.

本发明还可以包括：The present invention can also include:

1、步骤一中所述的将变化的源码分为三种源码块具体包括：增加、删除、改变，将变化的源码分类成增加的源码块、删除的源码块和改变的源码块存储，以此为基础进行进一步的分析处理。1. Dividing the changed source code into three source code blocks described in step 1 specifically includes: adding, deleting, and changing, and classifying the changed source code into added source code blocks, deleted source code blocks, and changed source code blocks. This is the basis for further analysis.

2、步骤二中所述的分析获取每一种源码块的每一行代码对应的AST信息具体包括：2. The analysis described in step 2 obtains the AST information corresponding to each line of code of each source block specifically including:

1)使用广度优先的搜索算法，根据行号逐个匹配ANTLR生成的抽象语法树结点；1) Use the breadth-first search algorithm to match the abstract syntax tree nodes generated by ANTLR one by one according to the line number;

2)如果行号在结点的所属的范围内，逐个匹配此结点的子结点，直到找到完全匹配的行号结点；2) If the line number is within the range of the node, match the child nodes of this node one by one until a completely matching line number node is found;

3)使用语法分析器获取当前行的语法元素以及上层语法元素，直到该源码文件的顶层类或者函数、结构体、枚举类和共用体。3) Use the syntax analyzer to obtain the syntax elements of the current line and the upper-level syntax elements, up to the top-level classes or functions, structures, enumeration classes and unions of the source code file.

3、步骤三所述的组合信息的获取，是使用标签系统标识，它是类、函数、结构体、枚举类和共用体模块的一个唯一标识，使用多个标识组合成一个标识代码行的组合信息，标签系统设计的方法是：3. The acquisition of the combination information described in step 3 is to use the label system identification, which is a unique identification of the class, function, structure, enumeration class and common body module, and uses multiple identifications to combine into one identification code line. Combining information, the approach to labeling system design is:

a)模块的关键字是模块的名字加上相应的属性信息字符；a) The keyword of the module is the name of the module plus the corresponding attribute information characters;

b)模块的标识是使用多个字值对组合而成的，每两个字值对之间都使用一个“@”字符来分隔；b) The identification of the module is composed of multiple word-value pairs, and each two word-value pairs are separated by an "@" character;

c)模块的层级标识之间使用“#”来分隔；c) Use "#" to separate the level identifiers of modules;

d)全局变量，预编译处理和普通代码行的关键字为stmSource，值为tempty。d) The keyword for global variables, precompiled processing and normal code lines is stmSource, and the value is tempty.

4、步骤三中所述的组合成模块，是根据标签系统对Diff划分的源码块，进一步细分模块，具体划分模块的步骤如下：4. The combination module described in step 3 is the source block divided by the Diff according to the labeling system, and the modules are further subdivided. The specific steps of dividing the modules are as follows:

1)针对Diff分析出的变化的源码块，根据行号获取每一行代码的标签信息；1) For the changed source block analyzed by Diff, obtain the label information of each line of code according to the line number;

2)比对相邻行代码的标签信息，根据比对情况分为两种情况，一种是完整模块，此模块的开始行必须和它之前一行的标签不一样，并且模块中间的每一行代码的标签是一样的，另外此模块的结束行必须和它之后的一行的标签也不一样，否则就不是一个完整模块；2) Compare the label information of adjacent lines of code. According to the comparison, there are two cases. One is a complete module. The starting line of this module must be different from the label of the previous line, and each line of code in the middle of the module must be different. The label is the same, and the end line of this module must be different from the label of the line after it, otherwise it is not a complete module;

3)根据分析出的两种情况，填充模块信息数据结构，所述信息数据结构包括模块的开始行、结束行、模块的类型、模块的标签信息。3) Fill in the module information data structure according to the two situations analyzed, the information data structure includes the start line of the module, the end line, the type of the module, and the label information of the module.

5、步骤四中所述的根据模块演化情况进行分类分析具体包括：进行类、函数信息的深入匹配分析，将两个不同版本的对应信息和信息的固定格式进行对比，如果两者都存在，而且有所改变，就是改变演化，如果旧版本没有，存在增加演化，如果新版本没有，存在删除演化，如函数模块的函数参数演化，函数模块的返回值演化。5. The classification analysis according to the evolution of the module described in step 4 specifically includes: performing in-depth matching analysis of class and function information, and comparing the corresponding information of two different versions with the fixed format of the information, if both exist, And there is a change, that is, change evolution. If the old version does not have it, there is an incremental evolution. If the new version does not, there is a deletion evolution, such as the function parameter evolution of the function module, and the return value evolution of the function module.

6、步骤四中所述的使用预处理、二次处理方法处理各种演化情况是一个递归过程，即每一次处理都需要预处理，所述预处理是一次正常对不同版本的同名源码文件所有变化进行分析并将分析的结果进行存储，如果分析到Diff分析有误，就进入二次处理，如果二次处理也再遇到Diff分析错误，然后再进行预处理、二次处理，直到分析正确。6. The use of preprocessing and secondary processing methods to process various evolution situations described in step 4 is a recursive process, that is, preprocessing is required for each processing, and the preprocessing is a normal process for different versions of source code files of the same name. Changes are analyzed and the results of the analysis are stored. If the analysis finds that the Diff analysis is incorrect, it will enter the secondary processing. If the secondary processing also encounters Diff analysis errors, then pre-processing and secondary processing will be performed until the analysis is correct. .

随着软件行业的蓬勃发展，软件越来越复杂，软件版本迭代次数越来越多，开发者接触到的源代码越来越多，同时这些软件架构也越来越复杂，软件演化的规律也就愈加模糊，这就增加了开发者理解和维护软件的难度。本发明提出了一种以Diff为基础的AST软件版本演化对比分析方法，获取类函数等模块的演化信息，为软件版本演化的理解提供依据，并提高了软件版本演化分析的准确率和效率。本发明的核心是面向软件版本变化处的源码，将变化处源码的层次和关系更加精确、高效的体现出来，并迁移到软件版本演化，致力于给用户提供一个指导，减少软件版本理解和维护的难度。With the vigorous development of the software industry, the software is becoming more and more complex, the number of software version iterations is increasing, and developers are exposed to more and more source codes. At the same time, these software architectures are becoming more and more complex, and the laws of software evolution are also It becomes more obscure, which increases the difficulty for developers to understand and maintain the software. The invention proposes a Diff-based AST software version evolution comparative analysis method, which obtains the evolution information of modules such as class functions, provides a basis for the understanding of the software version evolution, and improves the accuracy and efficiency of the software version evolution analysis. The core of the present invention is the source code facing the software version change, which more accurately and efficiently reflects the level and relationship of the source code at the change place, and migrates to the software version evolution, and is committed to providing a guide for users and reducing software version understanding and maintenance. difficulty.

本发明的效果主要体现在：The effect of the present invention is mainly reflected in:

现有的主流的软件版本演化对比分析方法都是基于AST的，传统的软件演化分析大都是基于抽象语法树的相似度匹配，并以此为基础进行语法和语义信息的分析。但是相似就存在误差，而且相似度的匹配需要与所有相似的对象匹配，效率不高。The existing mainstream software version evolution comparative analysis methods are all based on AST, and the traditional software evolution analysis is mostly based on the similarity matching of abstract syntax trees, and based on this, the analysis of syntax and semantic information is carried out. However, there are errors in similarity, and the matching of similarity needs to be matched with all similar objects, which is not efficient.

现有的比较高效的软件版本演化分析方法是，基于编辑距离(LevenshteinDistance，LD) 算法的一种改进的基于AST的软件演化分析方法。LD通常被用于输入字符串的快速模糊匹配、英文辅助，写作等领域，是一种经典而广为使用的方法。将LD算法的思想迁移到源码的对比，基于AST的最短编辑距离算法分析分层结构化代码，假设分层结构化代码都是对称的，类函数等模块的语法结点都是一样的，只是具体属性有所变化。基于此方法思想，提出了一种改进的基于AST的软件演化分析方法，提高了算法的效率和准确性，将算法的适用范围拓宽了，不只限于分层结构化代码。An existing relatively efficient software version evolution analysis method is an improved AST-based software evolution analysis method based on the edit distance (Levenshtein Distance, LD) algorithm. LD is usually used for fast fuzzy matching of input strings, English assistance, writing and other fields. It is a classic and widely used method. Migrate the idea of LD algorithm to the comparison of source code, and analyze hierarchically structured code based on the shortest edit distance algorithm of AST. It is assumed that the hierarchically structured code is symmetrical, and the syntax nodes of modules such as class functions are the same, but Specific properties vary. Based on this method idea, an improved AST-based software evolution analysis method is proposed, which improves the efficiency and accuracy of the algorithm, and broadens the scope of application of the algorithm, which is not limited to hierarchical structured codes.

将LD算法从字符串的匹配应用到AST，其基本思想是一样的。基于AST的最短编辑距离算法将问题分为三个步骤:Applying the LD algorithm from string matching to AST, the basic idea is the same. The AST-based shortest edit distance algorithm divides the problem into three steps:

1)根据自定义规则，将AST转换成需要的树。1) Convert the AST into the required tree according to the custom rules.

2)找到树T1和T2的结点间的合适匹配集合。2) Find a suitable set of matches between the nodes of trees T1 and T2.

3)根据匹配集合，找到最小将树T1转化为T2的编辑距离。3) According to the matching set, find the minimum edit distance to convert the tree T1 to T2.

第一步需要将AST转换成自己分析时所需的树，那么源码中所有的模块信息都要根据树的信息对应存储，每一个结点的匹配都需要遍历对应的存储信息，工作量很大，在大规模软件版本的演化分析中代价很高。也有的研究者将每一个结点映射成一个Hash值，然后再进行匹配，虽然每一个结点的匹配过程会有一定的优化，但是基于AST的最短编辑举例算法还是必须将每一个结点映射存储，分析时根据树的编辑距离对每一个结点确定演化情况。自然存储的信息越多，遍历AST时对比分析的效率就越低。而本发明是根据变化的代码行定位获取 AST信息，无需遍历整个AST树。The first step is to convert the AST into the tree required for your own analysis, then all the module information in the source code must be stored according to the information of the tree, and the matching of each node needs to traverse the corresponding storage information, which is a lot of work. , which is expensive in the evolutionary analysis of large-scale software releases. Some researchers also map each node into a Hash value, and then perform matching. Although the matching process of each node will be optimized to a certain extent, the shortest editing example algorithm based on AST still must map each node. Storage and analysis determine the evolution of each node according to the edit distance of the tree. The more information that is naturally stored, the less efficient the comparative analysis will be when traversing the AST. In the present invention, the AST information is obtained according to the location of the changed code lines, without traversing the entire AST tree.

第二步是根据结点间的距离来匹配两个结点，结点间的距离是通过结点的结构和结点的类型分析，以及结合结点间的距离公式计算结点间的距离值，最后根据阈值确定结点是否匹配，其中距离公式和阈值是一个不稳定因素，很有可能导致误匹配和遗漏匹配的情况。The second step is to match two nodes according to the distance between the nodes. The distance between the nodes is analyzed by the structure of the node and the type of the node, and the distance value between the nodes is calculated by combining the distance formula between the nodes. , and finally determine whether the nodes are matched according to the threshold, in which the distance formula and the threshold are unstable factors, which are likely to lead to false matching and missing matching.

本发明是根据Diff分析源码的变化，使用语法分析器获取的源码AST信息修正Diff分析的结果，从语义上分析和修正Diff的错误分析，如将一个改变的代码情况分析成增加和删除。使用Diff分析能准确分析出文本上的变化，使接下来的分析基础非常高效准确。结合 AST分析，设计并使用了标签系统，能高效的标识类函数等模块信息，可以根据本发明使用的方法能高效的分析出类、函数、结构体、共用体等模块结构的演化信息。According to the change of Diff analysis source code, the present invention uses the source code AST information obtained by the syntax analyzer to correct the result of Diff analysis, and analyzes and corrects the error analysis of Diff from semantics, such as analyzing a changed code situation into addition and deletion. Using Diff analysis can accurately analyze the changes in the text, making the subsequent analysis basis very efficient and accurate. Combined with AST analysis, a tag system is designed and used, which can efficiently identify module information such as class functions, and can efficiently analyze the evolution information of module structures such as classes, functions, structures, and unions according to the method used in the present invention.

根据上述分析以及实验表明，本发明和其他软件版本演化对比分析方法相比，能提高获取演化信息的准确率，并能提高软件版本演化分析的效率。能有效分析出类函数等模块结构的演化信息，弥补了Diff分析的不足，对git，cvs等源码版本管理工具有更详细的指导作用，对阅读源码、维护系统方面都有很好的指导作用。开发者可以借此方法来指导先期探索，加快理解工程，提升开发效率。进一步的发展，可以借此方法完成功能模块的复用。According to the above analysis and experiments, the present invention can improve the accuracy of acquiring evolution information and improve the efficiency of software version evolution analysis compared with other software version evolution comparative analysis methods. It can effectively analyze the evolution information of module structure such as class functions, make up for the lack of Diff analysis, and provide more detailed guidance for source code version management tools such as git and cvs, and has a good guiding role in reading source code and maintaining the system. . Developers can use this method to guide early exploration, speed up understanding of engineering, and improve development efficiency. In further development, the reuse of functional modules can be completed by this method.

附图说明Description of drawings

图1为系统架构图Diff源码版本对比例子图；Figure 1 is a comparison example of the Diff source code version of the system architecture diagram;

图2为源码版本V1和V2的Diff分析正常格式结果图；Figure 2 shows the normal format result of Diff analysis of source code versions V1 and V2;

图3为add函数源码图；Figure 3 is the source code diagram of the add function;

图4为源码对应的AST图；Figure 4 is the AST diagram corresponding to the source code;

图5为语法树节点对应的行号图；Fig. 5 is the line number diagram corresponding to the syntax tree node;

图6为类函数等模块标签信息的获取流程图；Fig. 6 is the acquisition flow chart of module label information such as class function;

图7为模块信息表图；Fig. 7 is the module information table diagram;

图8为软件版本演化差异分析流程图；Figure 8 is a flowchart of software version evolution difference analysis;

图9为配置文件configure.txt内容图；Figure 9 is the content diagram of the configuration file configure.txt;

图10为测试环境结构图。Figure 10 is a structural diagram of the test environment.

具体实施方式Detailed ways

下面举例对本发明做更详细的描述。The present invention will be described in more detail with examples below.

本发明的基于AST的软件版本演化对比分析方法，主要包括如下几个步骤：The AST-based software version evolution comparative analysis method of the present invention mainly includes the following steps:

步骤一：通过Unix的Diff命令对两个软件版本源码进行比对分析，根据分析结果文件 Patch获取代码行的增加、删除、改变的情况，将变化的源码分为三种源码块。Step 1: Compare and analyze the source code of the two software versions through the Diff command of Unix. According to the analysis result file Patch, obtain the addition, deletion and change of the code line, and divide the changed source code into three source code blocks.

步骤二：通过语法分析器获取两个软件版本源码的AST信息，针对第一步分析出的三种源码块，分析获取每一种源码块的每一行代码对应的AST信息。Step 2: Obtain the AST information of the source code of the two software versions through the syntax analyzer, and analyze and obtain the AST information corresponding to each line of code of each source code block for the three source code blocks analyzed in the first step.

步骤三：通过将获取的语法结点信息组合成一个信息，使用该组合信息标识对应行，将标识一样的相邻代码行组合成模块。Step 3: Combine the acquired syntax node information into one piece of information, use the combined information to identify corresponding lines, and combine adjacent code lines with the same identification into modules.

步骤四：通过以模块为单位对比分析两个源码块的标识信息，根据模块演化情况进行分类分析，使用预处理、二次处理等方法灵活处理各种演化情况，并将演化信息存入数据库中，并转换成HTML代码；Step 4: Compare and analyze the identification information of the two source code blocks by module, classify and analyze the evolution of the modules, use preprocessing, secondary processing and other methods to flexibly handle various evolutions, and store the evolution information in the database. , and convert it into HTML code;

步骤一将变化的源码分为三种源码块，增加、删除、改变，将变化的源码分类成增加的源码块，删除的源码块和改变的源码块存储，以此为基础进行进一步的分析处理，使处理流程更加合理、清晰。Step 1 Divide the changed source code into three source code blocks, add, delete, and change, and classify the changed source code into added source code blocks, deleted source code blocks and changed source code blocks. , to make the processing flow more reasonable and clear.

步骤二获取的每一行代码对应的AST信息，本发明的获取的代码行对应的AST信息的方法，会将使用语法分析器获取的源码的AST的冗余信息去除，并无需遍历整个抽象语法树。The AST information corresponding to each line of code obtained in step 2, the method of the present invention for obtaining the AST information corresponding to the code line will remove the redundant information of the AST of the source code obtained by using the syntax analyzer, without traversing the entire abstract syntax tree .

具体的获取步骤如下：The specific acquisition steps are as follows:

2)如果行号在结点的所属的范围内，会逐个匹配此结点的子结点，直到找到完全匹配的行号结点；2) If the line number is within the range of the node, the child nodes of this node will be matched one by one until a completely matching line number node is found;

3)使用语法分析器获取当前行的语法元素以及上层语法元素，直到该源码文件的顶层类或者函数的类、函数、结构体、枚举类和共用体。3) Use the syntax analyzer to obtain the syntax elements of the current line and the upper-level syntax elements, up to the top-level class or function class, function, structure, enumeration class and union of the source file.

步骤三获取标识代码行的组合信息，是使用本发明设计的标签系统标识，它是类、函数、结构体、枚举类和共用体模块的一个唯一标识。使用多个标识组合成一个标识代码行的组合信息，标签系统是进一步细分模块的标准，设计的原则是：In step 3, the combination information of identifying code lines is obtained by using the label system identification designed by the present invention, which is a unique identification of class, function, structure, enumeration class and common body module. Using multiple identifiers to combine information into a single identifier code line, the labeling system is a standard for further subdividing modules, and the design principles are:

1)模块的关键字是模块的名字加上相应的属性信息字符，如类的名字，使用“className”关键字。1) The keyword of the module is the name of the module plus the corresponding attribute information characters, such as the name of the class, using the "className" keyword.

2)模块的标识是使用多个字值对(关键字+“@”+值)组合而成的，每两个字值对之间都使用一个“@”字符来分隔。2) The identification of the module is composed of multiple word-value pairs (keyword + "@" + value), and each two word-value pairs are separated by an "@" character.

3)模块的层级标识之间使用“#”来分隔，如类中的函数，className@math#function Name@add。3) Use "#" to separate the hierarchical identifiers of modules, such as functions in a class, className@math#function Name@add.

4)全局变量，预编译处理和普通代码行的关键字为stmSource，值为tempty4) Global variables, the keyword for precompiled processing and ordinary code lines is stmSource, and the value is tempty

步骤三组合成的模块，是根据标签系统对Diff划分的源码块，进一步细分模块，如类模块，函数模块。具体划分模块的步骤如下：The modules formed in the third step are the source blocks divided by the Diff according to the labeling system, and the modules are further subdivided, such as class modules and function modules. The specific steps of dividing the modules are as follows:

1)针对Diff分析出的变化的源码块，根据行号获取每一行代码的标签信息。1) For the changed source block analyzed by Diff, the label information of each line of code is obtained according to the line number.

2)比对相邻行代码的标签信息，根据比对情况分为两种情况，一种是完整模块，此模块的开始行必须和它之前一行的标签不一样，并且模块中间的每一行代码的标签是一样的，另外此模块的结束行必须和它之后的一行的标签也不一样，否则就不是一个完整模块。2) Compare the label information of adjacent lines of code. According to the comparison, there are two cases. One is a complete module. The starting line of this module must be different from the label of the previous line, and each line of code in the middle of the module must be different. The label is the same, and the end line of this module must be different from the label of the line after it, otherwise it is not a complete module.

3)根据第二步分析出的两种情况，填充模块信息数据结构，如模块的开始行，结束行，模块的类型，模块的标签信息。3) Fill in the module information data structure according to the two situations analyzed in the second step, such as the start line of the module, the end line, the type of the module, and the label information of the module.

步骤四进行类、函数等信息的深入匹配分析，将两个不同版本的对应的信息和信息的固定格式进行对比，如果两者都存在，而且有所改变，就是改变演化，如果旧版本没有，存在增加演化，如果新版本没有，存在删除演化，如函数模块的函数参数演化，函数模块的返回值演化。Step 4: Carry out in-depth matching analysis of information such as classes and functions, and compare the corresponding information of two different versions with the fixed format of the information. If both exist and have changed, it is a change and evolution. If the old version does not, There is incremental evolution, if there is no new version, there is deletion evolution, such as function parameter evolution of function modules, and return value evolution of function modules.

步骤四所描述的预处理和二次处理是一个递归过程，就是每一次处理都需要预处理，预处理是一次正常对不同版本的同名源码文件所有变化进行分析并将分析的结果进行存储。如果分析到Diff分析有误，就会进入二次处理，二次处理也可能遇到Diff分析错误，然后再进行预处理、二次处理，直到分析正确，预处理需要将分析到的数据分别存储。The preprocessing and secondary processing described in step 4 is a recursive process, that is, each processing requires preprocessing. The preprocessing is a normal analysis of all changes in the source code files of the same name of different versions and the analysis results are stored. If the analysis finds that the Diff analysis is wrong, it will enter the secondary processing, and the secondary processing may also encounter Diff analysis errors, and then perform preprocessing and secondary processing until the analysis is correct. The preprocessing needs to store the analyzed data separately. .

由于对许多文件对比软件只能区分文件夹和文件变化的粗粒度信息，显示哪几行有增删改，因此鉴于此种情况，本发明着眼于基于AST的软件版本演化分析算法研究，主要研究的内容是准确定位存在演化信息的AST结点，分析获取类函数等模块的属性信息，并对演化情况进行分类，分析各种演化情况，提高演化分析的准确率和效率。本发明的研究由以下几部分内容组成：Since many file comparison software can only distinguish coarse-grained information about changes in folders and files, and show which lines have been added, deleted, or modified, in view of this situation, the present invention focuses on the research on AST-based software version evolution analysis algorithm. The content is to accurately locate the AST nodes with evolution information, analyze and obtain the attribute information of modules such as class functions, classify the evolution situation, analyze various evolution situations, and improve the accuracy and efficiency of evolution analysis. The research of the present invention consists of the following parts:

1)如何准确定位存在演化信息的AST结点，本发明通过Diff能准确定位变化的行号，根据行号能准确定位源码AST存在演化的结点，但是Diff分析出的结果存在错误，需要修正，如将一个改变的代码情况分析成增加和删除。1) How to accurately locate the AST node with evolution information, the present invention can accurately locate the changed line number through Diff, and can accurately locate the evolution node of the source code AST according to the line number, but the result analyzed by Diff has errors and needs to be corrected , such as analyzing a changed code situation into additions and deletions.

2)如何分析获取类函数等模块的属性信息，本发明根据源码的AST结点信息，分析每一类模块的结点特点，并获取属性信息。2) How to analyze and obtain the attribute information of modules such as class functions, the present invention analyzes the node characteristics of each type of module according to the AST node information of the source code, and obtains the attribute information.

3)演化情况的分类及分析，本发明根据模块的属性信息，对演化情况分类，然后根据其特点设计相应的方法。3) Classification and analysis of the evolution situation, the present invention classifies the evolution situation according to the attribute information of the module, and then designs a corresponding method according to its characteristics.

由于演化情况的多种多样，具体的方法是由解决多个具体问题的具体方法组合实现的。整个方法流程主要涉及以下几个部分：Due to the variety of evolution situations, specific methods are realized by a combination of specific methods that solve multiple specific problems. The whole method process mainly involves the following parts:

1)Diff分析结果划分1) Division of Diff analysis results

Diff是基于LCS算法实现的，用来对比分析新旧两个版本文件的异同。操作方法是在系统终端界面中，输入：$Diff<旧版本文件><新版本文件>。Diff的文本显示不太友好，为了便于说明，举例C++文件如附图1所示。Diff is implemented based on the LCS algorithm and is used to compare and analyze the similarities and differences between the old and new versions of the files. The operation method is to enter: $Diff<old version file><new version file> in the system terminal interface. The text display of Diff is not very friendly. For the convenience of explanation, an example C++ file is shown in Figure 1.

Diff有三种显示格式：正常格式，上下文格式和合并格式，下面会介绍本发明使用的正常格式。Diff has three display formats: normal format, contextual format and merged format. The normal format used in the present invention will be introduced below.

现在对V1和V2进行比较：$Diff V1 V2，此时，Diff就会显示正常格式的结果如附图2 所示。Now compare V1 and V2: $Diff V1 V2, at this point, Diff will display the result in normal format as shown in Figure 2.

每一个显示部分都分为三部分：第一行是一个提示，用来说明变动位置，第一个"3"，表示V1的第3行有变化；"c"(change)是一种改变模式的缩写，另外还有两个模式，分别为 "a"(addition)代表增加和"d"(deletion)代表删除；第二个"3"，表示前面显示的V1的第3行变成V2的第3行。第二行分成两个部分，开头的小于号，表示要从V1当中删除第4 行，"return 0；"表示该行的内容。第三行用来分割V1和V2，第四行，类似于第二行，前面的大于号表示V2增加了该行，后面的"return 1；"表示该行的内容。Each display part is divided into three parts: the first line is a prompt to explain the change position, the first "3" means that the third line of V1 has changed; "c" (change) is a change mode The abbreviation of , and there are two other modes, namely "a" (addition) for increase and "d" (deletion) for deletion; the second "3" means that the 3rd line of V1 shown earlier becomes V2's Line 3. The second line is divided into two parts. The less than sign at the beginning indicates that the fourth line should be deleted from V1, and "return 0;" indicates the content of this line. The third line is used to separate V1 and V2. The fourth line is similar to the second line. The greater-than sign in front means that V2 has added the line, and the following "return 1;" means the content of the line.

对两个源码版本进行一次Diff会生成一个Patch文件，该Patch文件包含了新旧源码变化的对应的行号，而此处的行号，本发明会将其对应成一个源码块。根据Diff分析的行号对应的增加、删除、改变变化，总共有三种源码块，第一种是删除的源码块，第二种是增加的源码块，第三种是改变的源码块。由于Diff是针对文本行进行的LCS的处理，因此没有语义信息，而且这种模块的划分，不会包含任何语义信息。另外对于一个模块的位置变化，大多会分析成增加和删除的改变。因此需要提出一种新的模块划分方法，这种方法需要体现语义信息，而且能进行更加高层次的演化分析，例如函数的参数变了，类的继承关系变了。Diffing the two source code versions once will generate a Patch file, and the Patch file contains the corresponding line numbers of the old and new source code changes, and the line numbers here are corresponding to a source code block in the present invention. According to the corresponding addition, deletion and change of the line number analyzed by Diff, there are three kinds of source blocks in total, the first is the deleted source block, the second is the added source block, and the third is the changed source block. Since Diff is LCS processing for text lines, there is no semantic information, and the division of this module will not contain any semantic information. In addition, for the position change of a module, most of them will be analyzed as changes in addition and deletion. Therefore, it is necessary to propose a new module division method, which needs to reflect semantic information, and can carry out higher-level evolution analysis, for example, the parameters of functions have changed, and the inheritance relationship of classes has changed.

2)标签系统设计2) Label system design

标签系统是类、函数、结构体、枚举类和共用体模块的一个唯一标识，根据标签系统针对Diff划分的源码块，进一步细分模块，如类模块，函数模块。因此标签系统是进一步细分模块的标准，这个标签系统设计的原则是：The labeling system is a unique identification of classes, functions, structures, enumeration classes and common body modules. According to the source blocks divided by the labeling system for Diff, modules are further subdivided, such as class modules and function modules. Therefore, the labeling system is the standard for further subdividing modules. The principles of this labeling system design are:

(1)模块的关键字是模块的名字加上相应的属性信息字符，如类的名字，使用“className”关键字。(1) The keyword of the module is the name of the module plus the corresponding attribute information characters, such as the name of the class, using the "className" keyword.

(2)模块的标识是使用多个字值对(关键字+“@”+值)组合而成的，每两个字值对之间都使用一个“@”字符来分隔。(2) The identification of the module is composed of multiple word-value pairs (keyword + "@" + value), and each two word-value pairs are separated by an "@" character.

(3)模块的层级标识之间使用“#”来分隔，如类中的函数，className@math#function Name@add。(3) Use "#" to separate the hierarchical identifiers of modules, such as functions in a class, className@math#function Name@add.

全局变量，预编译处理和普通代码行的关键字为stmSource，值为temptyThe keyword for global variables, precompiled processing and normal code lines is stmSource, and the value is tempty

接下来就是关于各个模块的各个关键字的介绍，具体如下：Next is the introduction of each keyword of each module, as follows:

(1)类的关键字(1) Keywords of the class

类模板：classTemplate@template，这里的template是为了凑字值对，没有实际值。Class template: classTemplate@template, the template here is to make up word-value pairs and has no actual value.

类模板参数：classTemplateParameter@(typenameT)，模板参数是T。Class template parameter: classTemplateParameter@(typenameT), the template parameter is T.

类名：className@Math，一个名为Math的类。Class name: className@Math, a class named Math.

类继承：classRelation@public Circle，本类继承自Circle类。Class inheritance: classRelation@public Circle, this class inherits from the Circle class.

(2)函数的关键字(2) Keyword of function

函数模板：functionTemplate@template，这里的template是为了凑字值对，没有实际值。Function template: functionTemplate@template, the template here is to make up word-value pairs and has no actual value.

函数模板参数：functionTemplateParameter@(typenameT)，模板参数是T。Function template parameter: functionTemplateParameter@(typenameT), the template parameter is T.

内联函数：inlineFunction@inline，这里的inline是为了凑字值对，没有实际值。Inline function: inlineFunction@inline, the inline here is to make up word value pairs, there is no actual value.

函数的返回值：functionReturn@int，这个函数的返回值是int。The return value of the function: functionReturn@int, the return value of this function is int.

函数名：functionName@Add，一个名为Add的函数。Function name: functionName@Add, a function named Add.

函数参数：functionParameters@(inta)，此函数的参数是一个名为a的int类型。Function parameters: functionParameters@(inta), the parameter of this function is an int type named a.

其它的模块，如结构体、枚举类和共用体的标识都比较简单，这里就不一一赘述了。The identification of other modules, such as structures, enumeration classes and unions, is relatively simple, so I won't go into details here.

3)基于AST的标签系统信息的获取3) Acquisition of AST-based labeling system information

本发明是以Diff分析为基础的，Diff分析出的变化是针对源码的具体行的，而本发明是对每一行代码进行类函数等模块的标签标识的，其关键字信息值的获取是关键。本发明有别于其它的基于AST的演化分析是由于本发明是根据行号定位AST的关键字信息，无需遍历整个抽象语法树，这种方式增加了大型版本的软件版本演化分析的能力，提高了效率和准确率。ANTLR分析出的抽象语法树的每一个节点是有对应的行号信息的，如附图3中源代码对应的ANTLR生成的AST如附图4所示，该图中存在很多冗余信息，尤其是具体的变量和语句。将AST的冗余信息去除，处理过的语法树的行号信息如附图5所示。The present invention is based on Diff analysis, the changes analyzed by Diff are for specific lines of source code, and the present invention is to mark each line of code with labels of modules such as class functions, and the acquisition of its keyword information value is the key . The present invention is different from other evolution analysis based on AST because the present invention locates the keyword information of AST according to the line number, without traversing the entire abstract syntax tree. efficiency and accuracy. Each node of the abstract syntax tree analyzed by ANTLR has corresponding line number information. As shown in Figure 3, the AST generated by ANTLR corresponding to the source code in Figure 3 is shown in Figure 4. There is a lot of redundant information in this figure, especially are specific variables and statements. The redundant information of the AST is removed, and the line number information of the processed syntax tree is shown in FIG. 5 .

本发明使用广度优先的搜索算法，根据行号逐个匹配ANTLR生成的抽象语法树结点，如果行号在结点的所属的范围内，会逐个匹配此结点的子结点，直到找到完全匹配的行号结点。如本例中，如果要找行号2所对应的标签信息，从根结点开始匹配，行号2的结点在根结点所属范围1～3之间，会继续搜索根结点的子结点，从左边第一个子结点开始，直到搜索到函数体的结点完全匹配。然后获取函数体结点的父结点的第一个子结点信息，如函数名字，函数返回值类型，函数参数。最终将获取的信息加上预先定义关键字的标签信息，整个完整信息如下：The invention uses the breadth-first search algorithm to match the abstract syntax tree nodes generated by ANTLR one by one according to the line number. If the line number is within the range of the node, it will match the child nodes of this node one by one until a complete match is found. line number node. In this example, if you want to find the label information corresponding to line number 2, start the match from the root node. The node of line number 2 is in the range 1 to 3 to which the root node belongs, and will continue to search for the children of the root node. Node, starting from the first child node on the left, until the node of the function body is completely matched. Then obtain the first child node information of the parent node of the function body node, such as function name, function return value type, and function parameters. Finally, the obtained information is added with the label information of the pre-defined keywords. The complete information is as follows:

functionReturn@int@functionName@add@functionParameters@(inta,intb)functionReturn@int@functionName@add@functionParameters@(inta,intb)

上述例子的语法树结点信息是经过处理的，实际的结点信息比较复杂，获取AST信息的关键结点信息的具体流程如附图6所示。The syntax tree node information of the above example is processed, and the actual node information is relatively complex. The specific process of obtaining the key node information of the AST information is shown in FIG. 6 .

其中的结点名都是AST中的表达，它和实际模块结点信息一一对应，本发明需要获取类、函数、结构体、枚举类和共用体模块的AST信息，然后才能使用标签系统。本发明只举例介绍类函数的详细标签信息获取，其它模块的标签信息获取，与此类似。附图6中的流程图首先会根据上述定位代码行结点信息的方法获取代码行的结点信息，然后根据结点的名字信息和子结点信息的内容获取类函数模块的具体信息，最后循着AST树的父结点一直遍历到根节点，将遍历到的模块信息组合成标签信息。The node names are all expressions in the AST, which correspond one-to-one with the actual module node information. The invention needs to obtain the AST information of the class, function, structure, enumeration class and common body module, and then the label system can be used. The present invention only introduces the acquisition of the detailed label information of the class function by example, and the acquisition of the label information of other modules is similar. The flowchart in the accompanying drawing 6 first obtains the node information of the code line according to the above-mentioned method of locating the node information of the code line, then obtains the specific information of the class function module according to the name information of the node and the content of the child node information, and finally follows The parent node of the AST tree is traversed all the way to the root node, and the traversed module information is combined into label information.

4)基于标签系统的分模块处理4) Sub-module processing based on label system

当能获取每一行代码的标签之后，本发明就能将每一行代码按标签进行分类，分模块。为此本发明设计了一个数据结构来存储模块信息，具体模块信息如附图7所示，其中的名字标签，就是语法标签去掉了一些语法信息的结果，只有由每一个模块的名字标签组合在一起的字符串。其中Type信息表示的是整个完整模块的类型，如一个完整的函数，而标号7则表示这块的代码只有模块的一部分，例如只是函数其中的某一块有演化，有可能是函数中增加了一行代码。有了模块信息之后就能分模块处理，具体的步骤如下：After the label of each line of code can be obtained, the present invention can classify each line of code according to the label and divide it into modules. For this reason, the present invention designs a data structure to store module information. The specific module information is shown in Figure 7. The name tag is the result of removing some syntax information from the syntax tag. Only the name tag of each module is combined in the strings together. The Type information indicates the type of the entire complete module, such as a complete function, and the label 7 indicates that the code of this block is only a part of the module. For example, only a certain block of the function has evolved. It may be that a line has been added to the function. code. After the module information is available, it can be processed in modules. The specific steps are as follows:

第一步，针对Diff分析出的变化的源码块，根据行号获取每一行代码的标签信息，这里有两种标签，一种是名字标签，一种是语法标签。The first step is to obtain the label information of each line of code according to the line number of the changed source block analyzed by Diff. There are two kinds of labels, one is the name label, and the other is the syntax label.

第二步，比对相邻行代码的两种标签信息，根据比对情况分为两种情况，一种是完整模块，就是附图7中Type为1-5的整个模块的信息，此模块的开始行必须和它之前一行的标签不一样，并且模块中间的每一行代码的标签是一样的，另外此模块的结束行必须和它之后的一行的标签也不一样，否则就不是一个完整模块，就是Type为7的类型。The second step is to compare the two label information of adjacent lines of code. According to the comparison, it is divided into two cases. One is the complete module, which is the information of the entire module whose Type is 1-5 in Figure 7. This module The start line must be different from the label of the line before it, and the label of each line of code in the middle of the module must be the same, and the end line of this module must be different from the label of the line after it, otherwise it is not a complete module , which is the type of Type 7.

第三步，根据第二步分析出的两种情况，填充模块信息数据结构，如模块的开始行，结束行，模块的类型，模块的名字标签和语法标签。The third step is to fill in the module information data structure according to the two situations analyzed in the second step, such as the start line of the module, the end line, the type of the module, the name tag of the module and the syntax tag.

5)演化分类步骤5) Evolutionary classification steps

从类演化的数据结构信息可以看出，模块的演化分类是以类和函数等模块的语法信息进行分类的，如类的修饰符演化，类的继承关系演化，类的内部一些模块和非模块演化次数的一些统计。根据Diff分析的增加、删除和改变，这里的增加、删除和改变不是最终的，需要经过分模块化处理修正之后进一步分析，分模块后的分类处理过程如下：From the data structure information of class evolution, it can be seen that the evolution classification of modules is classified by the grammatical information of modules such as classes and functions, such as the evolution of modifiers of classes, the evolution of inheritance relationships of classes, and some internal modules and non-modules of classes. Some statistics on the number of evolutions. According to the additions, deletions and changes of the Diff analysis, the additions, deletions and changes here are not final, and need to be further analyzed after being corrected by sub-module processing. The classification processing process after sub-modules is as follows:

第一步，如果是一个模块增加，则获取软件新版本关于此模块的标签的最底层的标签，如className@class Sphere#functionName@add，这里最底层的标签是函数add，只需获取 functionName的值，就说明存在一个函数增加的演化信息，并且说明类Sphere中存在一个函数增加的演化信息，这两个演化的统计方式是不一样的。一个是实际演化的模块统计演化关系，一个从层次归属体现演化关系。The first step, if a module is added, get the bottommost label of the new version of the software about the label of this module, such as className@class Sphere#functionName@add, where the bottommost label is the function add, just get the functionName value, it means that there is a function-added evolution information, and there is a function-added evolution information in the class Sphere. The statistical methods of these two evolutions are different. One is the statistical evolution relationship of the actual evolution module, and the other reflects the evolution relationship from the hierarchical attribution.

第二步，如果是一个模块删除，同上，只不过是获取软件旧版本的信息，增加改成删除。In the second step, if a module is deleted, the same as above, it is just to obtain the information of the old version of the software, and change it to delete.

第三步，如果是一个模块的模块体改变了，这里不论是增加，删除，还是改变，都记录成最底层模块的一个内容改变。如函数体中增加了一行，记录此函数多了一次内容改变。In the third step, if the module body of a module changes, whether it is added, deleted, or changed, it is recorded as a content change of the bottom-level module. If a line is added to the function body, one more content change is recorded for this function.

第四步，如果是一个模块属性改变，这里的属性指的是除了标识系统中的名字之外其它关于模块的描述，如函数的返回值，参数，是否内联都是此函数模块的属性。此处需要对比软件新旧版本关于此模块的最底层标签，方法是使用一个关于底层模块所有信息的完整关键字组成的标签，如函数的如下： functionTemplate@functionParameter@inlineF-unction@functionReturn@functionName@f unctionParameters，假设从第一个标签关键字functionTemplate开始，如果新旧软件版本的标签都有此关键字，就对比此关键字的值，如果是一样的，说明此属性是一样的，如果不一样，就记录此属性的改变信息，如果新旧标签有一个没有此关键字，就记录一次属性增加或删除。The fourth step, if a module attribute is changed, the attribute here refers to the description of the module other than the name in the identification system, such as the return value of the function, the parameters, and whether it is inlined or not are the attributes of the function module. Here, it is necessary to compare the lowest-level label of this module between the old and new versions of the software. The method is to use a label composed of complete keywords about all the information about the underlying module, such as the function as follows: functionTemplate@functionParameter@inlineF-unction@functionReturn@functionName@f unctionParameters, assuming starting from the first tag keyword functionTemplate, if the tags of the old and new software versions have this keyword, compare the value of this keyword, if it is the same, it means that this attribute is the same, if not, then Record the change information of this attribute. If the old and new tags have one without this keyword, record the addition or deletion of the attribute once.

在软件版本的演化过程中会分析两个不同的软件版本源码，本发明将待分析的两个软件版本源码分别标记为VI和V2，附图8描述了软件版本演化分析的整个过程。其中第一块是针对同名同路径的源码文件进行Diff分析，获取Patch文件，逐行处理Patch文件，对每一种Diff变化，进行分类处理，并分模块处理类函数等模块的相关演化，最终通过Diff-AST算法的处理将分析的结果存储到数据库并转换存储成HTML页面，既可以通过数据库查看演化信息，也可以通过浏览器查看。第二块是针对文件和文件夹的移动操作，使用MD5的加密标签处理，然后将移动的文件和文件夹的名字和路径转换成一张HTML的表，并将数据存入数据库。Two different software version source codes will be analyzed in the evolution process of the software version. The present invention marks the two software version source codes to be analyzed as VI and V2 respectively. Figure 8 describes the entire process of software version evolution analysis. The first block is to perform Diff analysis on source files with the same name and the same path, obtain Patch files, process Patch files line by line, classify each Diff change, and process the related evolution of modules such as class functions in modules. Through the processing of the Diff-AST algorithm, the analysis results are stored in the database and converted into HTML pages. The evolution information can be viewed through the database or through the browser. The second block is for the moving operation of files and folders, using MD5 encrypted label processing, and then converting the names and paths of the moved files and folders into an HTML table, and storing the data in the database.

Diff分析的结果显示有三种方式，本发明使用正常格式的Diff，会将分析的结果存储到一个patch文件中。Diff-AST算法会逐行分析patch文件，使用正在表达式“[0-9]+(,[0-9]+)？a[0-9]+(,[0-9]+)？.*”匹配增加变化，删除变化和改变变化是相似的。接下来需要根据分析到的V1，V2变化的行号，通过开源语法分析器ANTLR获取V1和V2的 AST，基于AST去分析类函数等模块的语法语义信息，构建模块的标签信息，然后根据标签进行分模块处理，最终针对各种演化类型特点进行相应的算法设计和实现，并将数据通过 Hibernate持久化技术存储到MySQL数据库中，为了直观的对软件版本演化信息进行查看，也会将分析到的演化信息转换成JSON格式的信息存储到HTML页面信息。The results of the Diff analysis are displayed in three ways. The present invention uses the normal format Diff, and the analysis results are stored in a patch file. The Diff-AST algorithm analyzes the patch file line by line, using the regular expression "[0-9]+(,[0-9]+)?a[0-9]+(,[0-9]+)?. *" matches add change, delete change and change change are similar. Next, it is necessary to obtain the AST of V1 and V2 through the open source parser ANTLR according to the line numbers of the analyzed V1 and V2 changes, analyze the syntax and semantic information of modules such as class functions based on the AST, and construct the label information of the module, and then according to the label Perform sub-module processing, and finally design and implement corresponding algorithms according to the characteristics of various evolution types, and store the data in the MySQL database through the Hibernate persistence technology. In order to view the software version evolution information intuitively, it will also analyze the The evolution information is converted into JSON-formatted information and stored into HTML page information.

本实施既可以在Eclipse软件中进行，也可以在Linux的终端环境中进行。在Eclipse 软件中进行是使用工程的源码进行分析，在Linux的终端环境中进行是使用Eclispe将工程源码生成的Diff-AST.jar包文件进行分析。这里只介绍Linux的终端环境中进行的情况，在 Linux的终端环境中的当前目录下需要一个配置文件configure.txt和软件版本演化分析工具Diff-AST.jar，配置文件configure.txt中的内容如附图9所示。This implementation can be performed either in Eclipse software or in a Linux terminal environment. In the Eclipse software, the source code of the project is used for analysis, and in the Linux terminal environment, the Diff-AST.jar package file generated by the project source code is analyzed by Eclispe. Only the situation in the Linux terminal environment is introduced here. In the current directory of the Linux terminal environment, a configuration file configure.txt and the software version evolution analysis tool Diff-AST.jar are required. The content in the configuration file configure.txt is as follows Figure 9 shows.

configure.txt中的第1到4行显示软件版本演化分析的新旧版本源码的目录和输出文件patch和HTML的目录，第5到8行显示的数据库连接的信息，本实施使用的是MySQL数据库，其中的信息包括数据库的名字，数据库的驱动，数据库的用户名和密码，设计使用配置文件configure.txt提高了软件版本演化分析工具的使用的灵活性。Lines 1 to 4 in configure.txt show the source code directory of the old and new versions of the software version evolution analysis and the output file patch and HTML directory, and lines 5 to 8 show the database connection information. The MySQL database is used in this implementation. The information includes the name of the database, the driver of the database, the user name and password of the database, and the design and use of the configuration file configure.txt improves the flexibility of using the software version evolution analysis tool.

在终端中运行命令$java-jar Diff-AST.jar configure.txt会在相应的reportPath目录下生成HTML文件，在HTML文件的根目录下有一个index.html，可以使用浏览器打开，查看软件版本演化的详细信息，如增删改的文件的数量以及源码详细的演化信息，具体的实施环境结构如附图10所示。Running the command $java-jar Diff-AST.jar configure.txt in the terminal will generate an HTML file in the corresponding reportPath directory. There is an index.html in the root directory of the HTML file, which can be opened with a browser to view the software version Detailed evolution information, such as the number of files added, deleted and modified, and detailed evolution information of source code, the specific implementation environment structure is shown in Figure 10.

Claims

1. An AST-based software version evolution comparative analysis method is characterized in that:

the method comprises the following steps: comparing and analyzing two software version source codes through a Diff command of Unix, acquiring the conditions of addition, deletion and change of code lines according to an analysis result file Patch, and dividing the changed source codes into three source code blocks, namely an added source code block, a deleted source code block and a changed source code block;

step two: AST information of two software version source codes is obtained through a syntax analyzer, and the AST information corresponding to each line of codes of each source code block is obtained through analysis aiming at the three source code blocks analyzed in the first step;

step three: obtaining grammar node information by analyzing the AST information obtained in the second step to combine the grammar node information into a combined information, using the combined information to identify each line of the three source code blocks in the first step, and combining adjacent code lines with the same identification into a module; the acquisition of the combined information uses a label system identifier which is a unique identifier of a class, a function, a structural body, an enumeration class and a shared body module, and the combined information of an identification code line is formed by combining a plurality of identifiers, wherein the label system design method comprises the following steps:

a) the key words of the module are the name of the module plus corresponding attribute information characters;

b) the identification of the module is formed by combining a plurality of agreed word value pairs, and each two word value pairs are separated by an '@' character;

c) the hierarchy identification of modules are separated by "#";

d) the global variable, the key words of the precompiled processing and common code line are stmSource, and the value is tempty;

step four: comparing and analyzing the identification information of the two source code blocks by taking the module formed in the step three as a unit, performing classification analysis according to the module evolution condition, processing various evolution conditions by using a preprocessing method and a secondary processing method, storing the evolution information into a database, and converting the evolution information into an HTML code; the method for processing various evolution conditions by using the preprocessing and the secondary processing is a recursive process, namely, each processing needs preprocessing, the preprocessing is to analyze all changes of the homonymous source code files of different versions normally and store the analysis result, if the Diff analysis is analyzed to be wrong, secondary processing is carried out, if the secondary processing also encounters the Diff analysis error, then the preprocessing and the secondary processing are carried out until the analysis is correct.

2. The AST-based software version evolution comparative analysis method of claim 1, wherein the analyzing and obtaining AST information corresponding to each line of codes of each source block in the second step specifically comprises:

1) matching the abstract syntax tree nodes generated by the ANTLR one by one according to the line numbers by using a breadth-first search algorithm;

2) if the row number is in the range of the node, matching the sub-nodes of the node one by one until a completely matched row number node is found, otherwise, not matching;

3) and using a syntax analyzer to obtain syntax elements of the current line and upper-layer syntax elements until the top-layer class of the currently analyzed source code file or the function, the structure body, the enumerated class and the shared body.

3. The AST-based software version evolution comparative analysis method of claim 2, wherein the combination module in step three is a source code block divided by Diff according to a label system, and the module is further subdivided, and the specific division module comprises the following steps:

1) for the changed source code blocks analyzed by Diff, acquiring the label information of each line of codes according to the line number;

2) comparing the label information of the codes of the adjacent lines, and dividing the information into two conditions according to the comparison condition, wherein one condition is a complete module, the label of the starting line of the module is different from the label of the previous line, the label of each line of codes in the middle of the module is the same, and the label of the ending line of the module is different from the label of the next line; otherwise, the other is not a complete module;

3) and filling a module information data structure according to the two analyzed conditions, wherein the information data structure comprises a starting line and an ending line of the module, the type of the module and the label information of the module.

4. The AST-based software version evolution comparative analysis method as claimed in claim 1, wherein the classifying and analyzing according to the module evolution in step four specifically comprises: deep matching analysis of class and function information is carried out, corresponding information of two different versions is compared with a fixed format of the information, and if both versions exist and are changed, evolution is changed; if the old version does not exist, there is an incremental evolution; if the new version does not exist, there is a deletion evolution.