Disclosure of Invention
The invention aims at overcoming the defects of the prior art, and provides a method and a device for testing the rear end of a website after Bug repair. The invention provides a new technology-PatchID, the core idea of the technology is that the program dynamic behaviors of the passed test cases in the bug program and the correct patch program are the same, and the program dynamic behaviors of the failed test cases in the bug program and the correct patch program are different. Firstly constructing a dynamic behavior expression snapshot causing the program to generate errors from the bug program and the test set, then generating a new test case to enhance the original test set, finally reading the same snapshot from the patch program, and judging whether the patch is over-fitted according to whether the value of the snapshot changes along with the use of the patch.
In a first aspect, the invention provides a method for inspecting the rear end of a website after Bug repair, which comprises the following steps:
Step 1, searching a dynamic behavior expression snapshot with the maximum suspicious degree for the website back end code after Bug repair;
Step 1-1, acquiring a dynamic behavior expression snapshot of each test case;
The method comprises the steps of running a back-end website before Bug repair and a test set t o corresponding to the back-end website, firstly constructing a Boolean expression required by a snapshot to obtain a Boolean expression set B bug, collecting a program abstract state of each test case in the running period, and then calculating the value of each Boolean expression in the Boolean expression set B bug to generate a dynamic behavior expression snapshot of each test case in the test set;
the dynamic behavior expression snapshot is expressed by adopting five-tuple based on the patch similarity principle:
snapshot= < l, b,? i, v i > form (1)
Where l represents the unique location identity of each statement, b represents the boolean expression, and? i represents the unique serial number of each test case in the test set, v i represents the actual value of b of the test case t i in the bug program execution process;
Step 1-2, calculating the suspicious degree of each dynamic behavior expression snapshot, wherein the calculation formula is defined as follows:
where ed s represents a dependent variable (SYNTACTIC ANALYSIS of expression dependence) and dy s represents a dynamic analysis variable (DYNAMIC ANALYSIS);
Each snapshot has a corresponding suspicion that is determined by 1) ed s;2)dys, where ed s increases as the number of occurrences of b in the preceding and succeeding sentences increases, and the more times b takes value in failed test cases, the fewer times it takes value in passing test cases, the greater the value of dy s.
Step 1-3, screening a dynamic behavior expression snapshot with the maximum suspicion degree, and marking the dynamic behavior expression snapshot with the maximum suspicion degree as s max, wherein the dynamic behavior expression snapshot with the maximum suspicion degree is the dynamic behavior expression snapshot of Bug, so as to obtain a snapshot set s bug corresponding to a test set t o;
Step 2, data enhancement is carried out on the test set t o, and the test set t after data enhancement is obtained e
Randomly generating a plurality of new test cases through Evosuite software, replacing the test set in the step 1-1 with the new test cases, and repeating the step 1-1 to calculate to obtain dynamic behavior expressions snapshot of the test cases, which are recorded as s new, adding the test cases corresponding to s new into the test set t o if s new is the same as s max, otherwise discarding the test cases corresponding to s new, and finally obtaining the test set t e after data enhancement;
step 3, identifying the over-fitting patch and classifying the identified over-fitting patch, wherein the method comprises the following steps:
Step 3-1, acquiring a position l patch needing to be monitored from a patch adopted by Bug repair;
Because the Bug position before the repair of the website back end can not be monitored in the patch directly, a certain position l patch in the patch needs to be reselected to monitor the Boolean expression b with the same dynamic behavior expression of the Bug, no matter which repair operation, the program can have correct program behavior only after the repair operation is finished, so that the first different statement of the Bug and the patch is defined as start s, the last different statement is end s, and the monitoring position selection is carried out by adopting the following rules:
1) If the start s employs a block statement and the end s is inside the start s, then l patch is the next statement to end the block statement, which is for, while, or if;
2) If the start s does not adopt a block statement, judging whether the end s is the last statement, if not, l patch is the next statement of the end s, if so, l patch=ends;
Step 3-2, running a back-end website after Bug repair and a test set t e after data enhancement, obtaining a program abstract state of each test case in the test set t e on l patch, further obtaining a dynamic behavior expression snapshot, and finally obtaining a snapshot set s patch corresponding to the test set t e, wherein s patch is the same as Boolean expressions b and s bug;
Step 3-3, comparing the two sets s bug、spatch according to the sequence numbers of the test cases in the test set t e to obtain the same number N f of v among the test cases failing to test in the set s bug、spatch and different numbers N p of v among the test cases passing the test in the set s bug、spatch;
The type of patch is identified according to the following equation (3):
Wherein correct patch, A indicates that the A-type over-fit patch neither completely repairs the incorrect behavior nor destroys the original correct behavior, B indicates that the B-type over-fit patch, i.e. the patch repairs the original incorrect behavior but destroys the original correct behavior, called regression error, and AB indicates that the AB-type over-fit patch, i.e. the patch does not repair the incorrect behavior but destroys the original correct behavior.
In a second aspect, there is provided an inspection apparatus comprising:
The greatest-suspicious-degree snapshot searching module is used for searching the dynamic behavior expression snapshot with the greatest suspicious degree for the website back-end code after Bug repair;
The test data enhancement module is used for enhancing the data of the test set t o;
and an identification and classification module of the overfit patch.
In a third aspect, a computer readable storage medium is provided, on which a computer program is stored which, when executed in a computer, causes the computer to perform the method.
In a fourth aspect, a computing device is provided, including a memory having executable code stored therein and a processor, which when executing the executable code, implements the method.
The beneficial results of the invention are specifically:
1. The invention re-interprets patch similarity from the aspects of program invariants and program expressions, and provides a five-tuple representation method for calculating patch similarity, which is used for over-fitting patch identification and subdivision of automatic patch generation.
2. The invention identifies 63 overfitting patches and 15 correct patches in classical java dataset Defects4j, and experimental data shows that the method is superior to the existing similar method. This allows developers to more quickly modify the over-fit patch to the correct patch because the technique can subdivide the patch.
Detailed Description
The invention is described in detail below in connection with a software automatic repair technique according to the accompanying drawings. The whole flow of the invention is shown in figure 1 of the accompanying drawings, and the specific steps are as follows:
Step 1, searching a dynamic behavior expression snapshot with the maximum suspicious degree for the website back end code after Bug repair;
Step 1-1, acquiring a dynamic behavior expression snapshot of each test case;
The method comprises the steps of running a back-end website before Bug repair and a test set t o corresponding to the back-end website, firstly constructing a Boolean expression required by a snapshot to obtain a Boolean expression set B bug, collecting a program abstract state of each test case in the running period, and then calculating the value of each Boolean expression in the Boolean expression set B bug to generate a dynamic behavior expression snapshot of each test case in the test set;
The boolean expression is combined by each variable of the same type using logical symbols (<, +.gtoreq, >, +..
The dynamic behavior expression snapshot is expressed by adopting five-tuple based on the patch similarity principle:
snapshot= < l, b,? i, v i > form (1)
Where l represents the unique location identity of each statement, b represents the boolean expression, and? i represents the unique serial number of each test case in the test set, v i represents the actual value of b of the test case t i in the bug program execution process;
After the correct patch is used, the passed test case is the same as the previous boolean expression and its value, while the failed test should be different. If there is a bug program, all passed test cases make a boolean expression b value false, and all failed test cases make b value true. Then determining whether the patch is over-fitted is not just a single way of observing the output of the program, but rather can be done by comparing the value of b in a statement before and after the patch is used. The value of b should be consistent with the bug program when the passed test case tests the patch, and should be different from the bug program when the failed test case tests the patch.
Step 1-2, calculating the suspicious degree of each dynamic behavior expression snapshot, wherein the calculation formula is defined as follows:
where ed s represents a dependent variable (SYNTACTIC ANALYSIS of expression dependence) and dy s represents a dynamic analysis variable (DYNAMIC ANALYSIS);
Step 1-3, screening a dynamic behavior expression snapshot with the maximum suspicion degree, and marking the dynamic behavior expression snapshot with the maximum suspicion degree as s max, wherein the dynamic behavior expression snapshot with the maximum suspicion degree is the dynamic behavior expression snapshot of Bug, so as to obtain a snapshot set s bug corresponding to a test set t o;
Step 2, data enhancement is carried out on the test set t o, and the test set t after data enhancement is obtained e
Randomly generating a plurality of new test cases through Evosuite software, replacing the test set in the step 1-1 with the new test cases, and repeating the step 1-1 to calculate to obtain dynamic behavior expressions snapshot of the test cases, which are recorded as s new, adding the test cases corresponding to s new into the test set t o if s new is the same as s max, otherwise discarding the test cases corresponding to s new, and finally obtaining the test set t e after data enhancement;
step 3, identifying the over-fitting patch and classifying the identified over-fitting patch, wherein the method comprises the following steps:
Step 3-1, acquiring a position l patch needing to be monitored from a patch adopted by Bug repair;
Because the Bug position before the website back-end repair can not be monitored in the patch directly, a certain position l patch in the patch needs to be reselected to monitor the Boolean expression b with the same dynamic behavior expression of the Bug, and the patch generally comprises insert, delete, replace and update for Bug programs. Whichever repair operation, the program may have correct program behavior only after the repair operation is completed, so define bug and patch first different statement to be start s, last different statement to be end s, monitor location selection using the following rules:
1) If the start s employs a block statement and the end s is inside the start s, then l patch is the next statement to end the block statement, which is for, while, or if;
2) If the start s does not adopt a block statement, judging whether the end s is the last statement, if not, l patch is the next statement of the end s, if so, l patch=ends;
Step 3-2, running a back-end website after Bug repair and a test set t e after data enhancement, obtaining a program abstract state of each test case in the test set t e on l patch, further obtaining a dynamic behavior expression snapshot, and finally obtaining a snapshot set s patch corresponding to the test set t e, wherein s patch is the same as Boolean expressions b and s bug;
Step 3-3, comparing the two sets s bug、spatch according to the sequence numbers of the test cases in the test set t e to obtain the same number N f of v among the test cases failing to test in the set s bug、spatch and different numbers N p of v among the test cases passing the test in the set s bug、spatch;
The type of patch is identified according to the following equation (3):
Wherein correct patch, A indicates that the A-type over-fit patch neither completely repairs the incorrect behavior nor destroys the original correct behavior, B indicates that the B-type over-fit patch, i.e. the patch repairs the original incorrect behavior but destroys the original correct behavior, called regression error, and AB indicates that the AB-type over-fit patch, i.e. the patch does not repair the incorrect behavior but destroys the original correct behavior.
The present invention has been experimentally verified on two data sets, respectively, wherein the first data set is Dfects4j, which consists of patches generated by 6 APR tools on Defects 4J. The second dataset Java+JML dataset was created by Nilizadeh et al.
At present, defecets J proposed by Just is the most widely used Java program data set in the field of automatic program repair. The Defects4J has 17 projects so far, which contain 835 Defects. Each program bug in the dataset contains at least one test case that can trigger it. The method uses the 6 most commonly used items in the dataset, namely Chart, time, math, lang, closure and Mockito, wherein Chart is an item specially displaying icons, time is an item for date and Time processing, math is a scientifically calculated item, lang is a set of additional methods for operating JDK classes, closure is an optimized compiler for Javascript, and Mockito is a simulation framework for unit testing. The number of bug contained in each item is shown in table 1 below.
Meter 1:Defects4j Project
ProjectName |
Numberofbugs |
Chart |
26 |
Time |
26 |
Math |
106 |
Lang |
64 |
Closure |
174 |
Mockito |
38 |
Total |
434 |
The method uses 6 existing repair tools to repair on the Defects4J data set to obtain candidate patches. These 6 program bug automatic repair tools are jGenProg, nopol, nopol, 2017, ACS, HDRepair and jKali respectively, wherein jGenProg is a Java version of GenProg, which is a genetic algorithm-based heuristic search repair tool, nopol is a repair technique for conditional statement errors in Java programs, which gives different repair strategies for error statement types, namely, if the code position of the positioning error is a conditional statement, the repair patch which is usually generated by Nopol is used for modifying the original conditional statement, and if the code position of the positioning error is a non-conditional statement, the repair is realized by adding a new condition to skip the execution of the current statement. The data set comprises Nopol 2015 and Nopol 2017 versions, ACS is a high-precision conditional statement comprehensive tool which extracts patch templates for restoration based on statistical analysis, HDRepair is a restoration tool based on statistical analysis, jKali is the re-implementation of Kali on Java and is a restoration tool with a deletion function.
The Java+JML dataset proposed by Nilizadeh is the first validated, publicly available Java program dataset. It consists of four parts, correct procedure, mutated error procedure, test suite, APR-based patch. The procedure for this dataset had JML specifications for experimental evaluation. This dataset implements various classical algorithms and data structures such as bubble ordering, factorization, queuing, etc. They are all small programs of formal specifications written in JML and therefore can be considered as programs with oracle. Test suites are created using AFL-based obfuscation tools, and are scaled according to the number of test cases that are generated, to be Small and Medium. The error program is created by PITest, a Java program mutation tool, injecting a single error into each Java program. PITest generate errors by changing control conditions, changing assignment expressions, deleting method calls, and changing return values. The APR-based repair patch was obtained using the following repair tools ARJAE, cardumen, jGenProg, jKali, jMutRepair, kali-A, andNopol, respectively.
Experimental results:
Performance on Defects4J 220 patches are generated on the Defects4J data set through an APR tool, the 220 patches are tested to judge whether the 220 patches are over-fitting patches, a total of 166 patches are running results of the over-fitting patches, and the rest patches are terminated due to exceeding a set execution time limit and fail to give a final result. Of the 166 patches, the method gives a determination as to whether the remaining 157 patches are over-fitted patches, except for 9 patches. Specific patch determination results are shown in table 2.
Meter 2:Defects4j Dataset
Tables 3 and 4 show the results of the operation of the present method on the relevant defect repair tool and on different projects, respectively. As shown in the table, patchID successfully filtered 78 out of 157 patches, 63 out of which were overfitted and 15 out of which were correct. And for 63 overfitting patches PatchID successfully divided them into three categories, with a maximum of 50 a-Overfitting Patch, followed by 8B-Overfitting Patch and 5 AB-Overfitting Patch.
Meter 3:Result ByAPR Tools
Tool |
Correct |
Overfitting |
Correctdetected |
Overfittingdetected |
A |
B |
AB |
Nopol2015 |
5 |
20 |
2(40%) |
10(50%) |
9 |
0 |
1 |
Nopol2017 |
3 |
68 |
2(66.66%) |
36(52.94%) |
25 |
8 |
3 |
HDRepair |
4 |
5 |
3(75%) |
1(20%) |
1 |
0 |
0 |
ACS |
11 |
6 |
7(63.63%) |
1(16.66%) |
1 |
0 |
0 |
jKali |
1 |
14 |
0 |
8(57.14%) |
8 |
0 |
0 |
jGenprog |
6 |
14 |
1(16.67%) |
7(50%) |
6 |
0 |
1 |
Total |
30 |
127 |
15(50%) |
63(49.61%) |
50 |
8 |
5 |
"Correct/overfitting detected" means the number of Correct classifications by the method of the invention from the "Correct/overfitting" patch.
A=A-Overfitting Patch,B=B-Overfitting Patch,AB=AB-Overfitting Patch
Meter 4:Result By Project
Project |
Correct |
Overfitting |
Correctdetected |
Overfittingdetected |
A |
B |
AB |
Lang |
6 |
10 |
2(33.33%) |
3(50%) |
3 |
0 |
0 |
Math |
16 |
49 |
8(50%) |
22(44.90%) |
20 |
1 |
1 |
Chart |
3 |
21 |
1(33.33%) |
12(57.14%) |
10 |
0 |
2 |
Time |
2 |
10 |
2(100%) |
6(60%) |
5 |
1 |
0 |
Closure |
2 |
37 |
1(50%) |
20(54.05%) |
12 |
6 |
2 |
Mockito |
1 |
0 |
1(100%) |
0 |
0 |
0 |
0 |
Total |
30 |
127 |
15(51.85%) |
63(49.61%) |
50 |
8 |
5 |
Overfitting patch from Table 4 we can find that PatchID works better on four repair tools Nopol2015, nopol2017, jKali, jGenprog (50% of worst success rate), but poorly on ACS, HDRepair (20% of best success rate). We have also found that of the over-fit patches created by these 6 tools, the patches that did not fix the original errors of the program are the most, and the patches that corrupted the original correct behavior of the program are less. However, nopol and Nopol2017 (these two are tools that modify program conditional statements to repair bugs) together 12 patches are destructive to the correct behavior of the program, while the other tools only jGenprog produce one AB-Overfitting Patch. We hypothesize that modifying program conditional statements is relatively easy to introduce new errors.
According to Project, the success rate of over-fitting patch identification is relatively stable, and the range is 43% -60%. PatchID has the highest success rate in the Time Project, reaching 60%. The success rate in the Math project is the lowest, only 44.90%. Here, the number of patches that destroy the original correct behavior of the program is Closure at the maximum, and a total of 8 patches are used.
The Correct patch has 30 patches correctpatch out of 157 patches, and PatchID can correctly judge 15 patches, and the success rate reaches 50%. This is an exciting message. To the best of our knowledge, no tool currently has such a high success rate. In Nopol2017, HDRepair and ACS generated patches, the success rate of PatchID exceeds 60% with the highest being 75% in nature. From the Project point of view, the success rate of the remaining projects is not low except Lang and Chart. Of particular note is the success rate of up to 100% on Mockito and Time items.
With respect to the results of Xiong on this dataset, we identified one more over-fit patch than Xiong's method, but Xiong's method did not identify any correct patch and PatchID identified 15. For 220 patches, his method identified 62 in total and PatchID identified 78 patches. Xiong, however, increased the recognition success rate to 56.3% by pruning the average strategy, patchID being 49.7%. Furthermore, the Xiong method can only determine patches for four items Chart, lang, math and Time, while PatchID is relevant for patches for 6 items. PatchID is more widespread in terms of versatility.
Performance on Java +JML dataset we selected 236 over-fit patches based on the Medium test suite from the Java+JML dataset, 336 over-fit patches based on the Small test suite, all judged by JML specification, and determined that these patches are over-fit patches. There are 21 FALSENEGATIVES patches (JML specification mistakenly considers a correct repairedprogram as overfitted). The PatchID algorithm was run on a total of 593 patches, resulting in 380 patch runs, the specific results being shown in table 5.
TABLE 5 Java+JMLDataset
PatchType |
Collected |
Validated |
Medium |
236 |
144 |
Small |
336 |
221 |
FalseNegatives |
21 |
15 |
Total |
593 |
380 |
From the data in table 6, it can be seen that the success rate reaches 50% when the patch based on the Medium type is PatchID, the success rate reaches 41.62% when the patch based on the Small type is PatchID, and the success rate reaches 33.33% when the patch based on the FALSENEGATIVES is FALSENEGATIVES.
From the perspective of the overfitting classification PatchID did not identify any B-Overfitting patch on this dataset. And other over-fit patches are of the type a-Overfitting, except for 4 AB-Overfitting patches.
As apparent from the success rates of Medium and Small, as the number of test cases in the test suite decreases, the success rate also decreases. This data illustrates that weak test kits can affect the success rate of PatchID.
Meter 6:Result By PatchType
PatchType |
Correctdetected |
Overfittingdetected |
A |
B |
AB |
Medium |
72(50%) |
72(50%) |
72 |
0 |
0 |
Small |
129(58.37%) |
92(41.62%) |
88 |
0 |
4 |
FalseNegatives |
5(33.33%) |
10(66.67%) |
10 |
0 |
0 |
Total |
206 |
174 |
170 |
0 |
4 |