CN115687117B

CN115687117B - A method for modeling abnormal behavior of sorting similarity software and computer-readable medium

Info

Publication number: CN115687117B
Application number: CN202211345882.6A
Authority: CN
Inventors: 谢晓园; 宋壹; 张熙灏; 刘泉明
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2025-07-01
Anticipated expiration: 2042-10-31
Also published as: CN115687117A

Abstract

The present invention provides a method for modeling abnormal behavior of software based on sorting similarity and a computer-readable medium. The method of the present invention uses a multi-fault program data set and multiple external indicators to evaluate the effectiveness of the fault characterization function in the fault separation process, calculates a fitness score for the fault sorting characterization based on the evaluation result, and evolves the fault characterization function population based on the fitness score through the crossover, replication and mutation operations of the genetic programming algorithm until the termination condition is reached, and the fault sorting characterization individual with the highest fitness obtained by evolution is used as a fault classification model for fault separation of real multi-fault programs. The present invention provides a solution to the problem that there is currently no available automatic generation method for fault sorting characterization for fault separation, and the problem that a mapping relationship between fault sorting characterization and failed test case characterization capability cannot be established, and the method of the present invention can realize efficient automatic generation of fault sorting characterization and be used for fault separation in actual production.

Description

Sequencing similarity software abnormal behavior modeling method and computer readable medium

Technical Field

The invention relates to the technical field of computers, in particular to a sequencing similarity software abnormal behavior modeling method and a computer readable medium.

Background

The software defect positioning technology aims at finding out the source causing the abnormal behavior of the software, and has important significance for guaranteeing and improving the quality of the software and enhancing the effectiveness of the software repair technology. In the software testing process, the fact that the actual output and the expected output of the test case are different indicates that at least one defect is hidden in the software, and a developer must repair the defect rapidly to restore the normal function of the software. However, the premise of repairing the defect is to locate the defect, and only an internal root causing abnormal behavior of software execution can be accurately found, so that an effective basis can be provided for the subsequent repairing process. For this reason, many software defect localization methods, such as localization methods based on program mutation technology, localization methods based on program spectrum technology, etc., are proposed successively. Although these methods have all proven to have good defect localization effects, they are almost all based on a single fault assumption, i.e. assuming that an erroneous procedure contains only one defect. With the increasing size of software and the increasing complexity of software systems, an erroneous program is almost impossible to contain only one defect. In other words, the multi-fault environment is the normal state faced by the software debugging task in the real environment.

The simultaneous existence of multiple faults can seriously reduce the effectiveness of the traditional single fault location method, so that the multi-fault software fault location technology becomes one of the hot spots in the current fault location research field. The technology firstly utilizes a risk assessment function for estimating the suspicious degree of a program element in single fault location to generate an ordered list representation for each failure test case, and establishes a connection between the failure test case and a software fault, so that the failure test cases are divided according to fault sources to isolate the failure test cases caused by different faults, the mutual influence among the software faults is weakened, then a single fault location method is applied to each failure test case in parallel, and the multi-fault location problem is disassembled into a plurality of single fault location problems. The effectiveness of the fault division process determines the accuracy of parallel single fault positioning, and the risk assessment function used has a great influence on the effectiveness of fault division.

However, the risk assessment functions used in fault classification of the existing multi-fault localization technology are designed for single-fault localization application scenarios. The factors to be considered in the multi-fault isolation and single-fault localization are not exactly the same, so the optimal risk assessment function in the single-fault scenario does not necessarily have good multi-fault isolation performance. Intuitively, if the risk assessment function has stronger fault feature extraction capability, the generated ordered list can more accurately simulate the failure test cases, so that a better parallel debugging effect is obtained. Therefore, it is very important to use a risk assessment function that can correctly represent a fault in a multi-fault scenario.

According to the purpose of use, the fault characterization function (risk assessment function) can be designed manually or by using an automatic method with the purpose of generating the fault test case characterization which has strong characterization capability and is suitable for the partitioning algorithm as a target, so that the effectiveness of the fault partitioning process is improved, and the accuracy of multi-fault positioning is further improved.

However, the existing method for measuring the characterization capability of the failed test cases is still lacking, the characterization of the failed test cases can only be indirectly evaluated through the effectiveness of the fault division process, on the other hand, after the used risk evaluation function is determined, the execution information of each failed test case and all successful test cases is combined, the suspicious degree is calculated for all program elements, the suspicious degree is ranked according to the calculated suspicious degree, and the finally obtained ranking list can be used as the characterization of the failed test cases. The complex token generation mode and the indirect evaluation method lead to difficulty in determining the mapping relation between the fault token function and the fault test case token capacity, so that the fault token function cannot be trained by using methods such as an artificial neural network and the like. Meanwhile, if a method of manually designing the fault characterization function is adopted, the obtained fault characterization function has far less characterization capability than ideal because the design can only be performed by human intuition, and the design efficiency is far less than that of automatic generation. How to design a reliable fault-oriented fault-dividing fault-characterization function generation method is an important problem.

Disclosure of Invention

In order to solve the technical problems, the invention provides a sequencing similarity software abnormal behavior modeling method and a computer readable medium.

The invention provides a software abnormal behavior modeling method based on sequencing similarity, which comprises the following specific steps:

Step 1, introducing a plurality of multi-fault programs, randomly generating a plurality of fault characterization functions by using a genetic programming algorithm, taking each fault characterization function as each individual in a population, and constructing an initial population by the plurality of fault characterization functions;

Step 2, generating a sequencing list characterization of each failure test case generated by each multi-failure program by combining each failure characterization function in the initial population, clustering and dividing the sequencing list characterization of all failure test cases generated by each multi-failure program by each failure characterization function, obtaining a sequencing list characterization classification result of a plurality of failure test cases generated by each multi-failure program by each failure characterization function, evaluating the validity score of each external index corresponding to the classification result generated by each multi-failure program by each failure characterization function according to a plurality of external indexes, merging the validity scores of a plurality of external indexes corresponding to the classification result generated by each multi-failure program by each failure characterization function, obtaining an adaptation score when each failure characterization function is applied to failure separation of each multi-failure program, and obtaining the adaptation score of each individual of the initial population according to the adaptation score when each failure characterization function is applied to failure separation of a plurality of multi-failure programs;

Step 3, selecting a parent population from the population according to the fitness of each individual in the initial population, creating each child individual by using the parent population according to the crossover rate, the replication rate and the mutation rate, taking all the created child individuals as the child population, and calculating the fitness of each individual in the child population through the step 2;

step 4, repeatedly executing the step 2 and the step 3 until the termination condition appointed by the user is reached, and taking the fault characterization function with the highest individual fitness in the last generation population reaching the termination condition as a fault classification model;

And 5, generating a sequencing list representation for each failure test case in a multi-fault program needing fault division by using a fault classification model, calculating the distance between each pair of sequencing list representations, and clustering the failure test cases by combining the distance between each pair of sequencing list representations to realize modeling of software abnormal behaviors and division of a plurality of faults.

Preferably, each fault-characterization function described in step1 is specifically defined as follows:

the fault characterization function REF is:

Suspiciousness_k,f,i＝REF_k(Spectrum_f,i)

Spectrum_f,i＝{NCF_f,i,NCS_i,NUF_f,i,NUS_i}

The NCF _f,i is the number of times that the ith program statement of each multi-fault program is covered by the f-th failed test case, the NCS _i is the number of times that the ith program statement of each multi-fault program is covered by the successful test case, the NUF _f,i is the number of times that the ith program statement of each multi-fault program is not covered by the f-th failed test case, the NUS _i is the number of times that the ith program statement of each multi-fault program is not covered by the successful test case, the value of i is between 1 and K, the K is the number of program statements in each multi-fault program, the NCF _f,i,NCS_i,NUF_f,i,NUS_i is used as the Spectrum information Spectrum _f,i,Sspiciousness_k,f,i of the f-th failed test case of each multi-fault program at the ith program statement together, and the probability of a fault source exists at the ith program statement of each multi-fault program calculated by combining the Spectrum information of the f-th failed test case at the ith program statement with the fault characterization function REF _k;

the initial population is constructed in the step 1:

A plurality of fault characterization functions are randomly generated by using a genetic programming algorithm and are used as an initial population together:

Wherein, the Population _p represents the p generation Population, the initial Population constructed is Population ₁, the pop is the Population individual number or Population scale, Representing a kth fault characterization function in the p-th generation population;

Preferably, in step 2, each fault characterization function generated in combination with each fault characterization function in the initial population is applied to the ordered list characterization of each failure test case generated by each multi-fault program, and specifically as follows:

each multi-fault program comprises a plurality of failure test cases and a plurality of success test cases;

the plurality of failure test cases are specifically defined as follows:

{Ftc₁,Ftc₂,Ftc₃,...,Ftc_fail}

Wherein fail is the number of failed test cases, ftc _f represents the f failed test cases in each multi-fault program, and the value of f is between 1 and fail;

The plurality of successful test cases are specifically defined as follows:

{Stc₁,Stc₂,Stc₃,...,Stc_success};

Wherein success is the number of successful test cases, stc _s represents the s-th successful test case in each multi-fault program, and the value of s is between 1 and success;

collecting coverage information of each failed test case and each successful test case at each program statement in each multi-fault program;

calculating a spectrum value of each failure test case at each corresponding program statement in each multi-fault program by using each failure test case and coverage information of each successful test case at each program statement in each multi-fault program;

The Spectrum value Spectrum _f,i collected by the f-th failure test case of each multi-failure program at the i-th program statement is:

Spectrum_f,i＝{NCF_f,i,NCS_i,NUF_f,i,NUS_i}

i has a value between 1 and K;

The ith fault characterization function in the p-th generation population is used for calculating the suspicious degree of the ith program statement calculated by each multi-fault program by using the f-th fault test case in combination with the spectrum value collected by the f-th fault test case of each multi-fault program at the ith program statement, wherein the suspicious degree of the ith program statement calculated by the f-th fault test case is as follows:

The value of p is between 1 and end, end is algebra of the last generation population, the value of K is between 1 and pop, the value of f is between 1 and fail, and the value of i is between 1 and K;

according to the suspicion calculation method, the suspicion calculated by each multi-fault program by using the kth fault characterization function and the f failure test case in the p generation population is characterized in that:

Rep_Suspiciousness_p,k,f

={Suspiciousness_p,k,f,1,Suspiciousness_p,k,f,2,...,Suspiciousness_p,k,f,is}

wherein is the number of program statements of each multi-fault program;

Sequencing program sentences according to the suspicious degree of each program sentence in the suspicious degree characterization calculated by using a kth fault characterization function and an f-th failure test case in the p-th generation population of each multi-fault program, and obtaining a sequencing list generated for the f-th failure test case of each multi-fault program by using the kth fault characterization function in the p-th generation population, wherein the sequencing list is characterized by:

Rep_Ranking_p,k,f＝{Index_p,k,f,1,Index_p,k,f,2,...,Index_p,k,f,is}

Index _p,k,f,rank represents the actual Index position of the rank-bit program statement in each multi-fault program when the suspicious degree is sorted in descending order, which is calculated according to the corresponding k-th fault characterization function and f-th failure test case in the p-th generation population of the program statement;

step 2, clustering and dividing the ordered list characterization of all the failure test cases generated by each multi-fault program by each fault characterization function to obtain ordered list characterization classification results of the plurality of failure test cases generated by each multi-fault program by each fault characterization function, wherein the ordered list characterization classification results are as follows;

Result_p,k＝{numOfClusters_p,k,Clusters_p,k}

Wherein Result _p,k is the sorted list characterization classification Result of the kth fault characterization function in the p-th generation population applied to the multiple failure test cases generated by each multi-fault program, numOfClusters _p,k is the cluster number predicted by using the kth fault characterization function in the p-th generation population, and when the cluster number is not equal to the actual fault number of each multi-fault program, the subsequent evaluation step of the current fault characterization function on the current multi-fault program is skipped, and the fitness score of the current fault characterization function on the corresponding multi-fault version is set to 0;

Clusters _p,k is a cluster number allocated to each failure test case of each multi-failure program after clustering by using the kth failure characterization function in the p generation population, and has

Clusters_p,k＝{Group_p,k,1,Group_p,k,2,...,Group_p,k,fail}

Wherein Group _p,k,f is the number of the corresponding class of the ordered list representation generated for the f-th failure test case of each multi-failure program by using the k-th failure representation function in the p-th generation population;

and step 2, evaluating the effectiveness score of each external index corresponding to the classification result generated by each multi-fault procedure by each fault characterization function according to a plurality of external indexes, wherein the effectiveness score is as follows:

Score_PR_p,k＝PR(Clusters_p,k,Oracle)

Score_RR_p,k＝RR(Clusters_p,k,Oracle)

Score_FMI_p,k＝FMI(Clusters_p,k,Oracle)

Score_JC_p,k＝JC(Clusters_p,k,Oracle)

The Oracle is the real correspondence between each failure test case and a failure source in each multi-failure program, score_PR _p,k is the validity Score of the classification result corresponding to PR index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, score_RR _p,k is the validity Score of the classification result corresponding to RR index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, score_FMI _p,k is the validity Score of the classification result corresponding to FMI index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, and score_JC _p,k is the validity Score of the classification result corresponding to JC index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program;

PR is the precision rate, RR is the recall rate, FMI is Fowlkes-Mallows index, JC is the Jaccard coefficient, the four external indexes are respectively input by using a cluster number distributed for each failure test case of each multi-failure program after clustering division by using a kth failure characterization function in a p-th generation population and a real corresponding relation between each failure test case and a failure source in each multi-failure program, the effectiveness score of the clustering is output, the score value is within a [0,1] interval, and the larger the value is, the higher the effectiveness of the clustering process is represented;

The calculation of four external indexes requires establishing the corresponding relation between each class cluster and a real fault source in class cluster numbers distributed for each failure test case of each multi-fault program after clustering division by using a kth fault characterization function in a p generation population, calculating the effectiveness score of all possible corresponding relations for each external index, and taking the highest score as the calculation result of the index;

And step 2, merging the validity scores of the multiple external indexes corresponding to the classification results generated by each multi-fault program by each fault characterization function to obtain the fitness score when each fault characterization function is applied to fault separation of each multi-fault program, wherein the fitness score is as follows:

When each external index is calculated for the classification result generated by applying each fault characterization function to each multi-fault program, the effectiveness score under the corresponding relation with the highest effectiveness score is required to be taken as the calculation result, the current index is called as the corresponding relation for voting, and the vote of each corresponding relation is:

wherein, the Represents whether Score PR _p,k votes for the d-th correspondence,Represents whether Score _ RR _p,k votes for the d-th correspondence,Represents whether Score _ FMI _p,k votes for the d-th correspondence,Representing whether score_jc _p,k is a d-th corresponding relation Vote, vote _p,k,d represents the total number of votes obtained by the kth fault characterization function in the p-th generation population at the d-th corresponding relation, the value of d is between 1 and V, and V is the number of possible corresponding relations existing in the multi-fault program;

The total number of obtained tickets with the most corresponding relation is as follows:

Vote_p,k,most＝Max(Vote_p,k,1,Vote_p,k,2,...,Vote_p,k,V)

Vote _p,k,most is the total number of obtained tickets of the corresponding relation of the most obtained tickets of the classification result generated by each multi-fault program in the application of the kth fault characterization function in the p-th generation population in the effectiveness index evaluation process, and is determined by the number of fault sources contained in the multi-fault program;

The method for calculating the fitness score when the kth fault characterization function in the p-th generation population is applied to fault separation of each multi-fault program comprises the following steps:

FitnessScore_p,k＝TotalMetrics_p,k*PenaltyFactor_p,k

Wherein FitnessScore _p,k is the fitness score of the kth fault characterization function in the p-th generation population when applied to fault separation of each multi-fault procedure;

TotalMetrics_p,k＝Score_PR_p,k+Score_RR_p,k+Score_FMI_p,k+Score_JC_p,k

TotalMetrics _p,k is the sum of four external index validity scores of the classification result generated by each multi-fault procedure by applying the kth fault characterization function in the p generation population;

PenaltyFactor_p,k＝1-0.05*(4-Vote_p,k,most)

Wherein PenaltyFactor _p,k is the fitness penalty factor of the kth fault characterization function in the p-th generation population when applied to fault separation of each multi-fault program, and is determined by Vote _p,k,most;

and step 2, obtaining the fitness of each individual of the initial population according to the fitness score of each fault characterization function applied to fault separation of a plurality of multi-fault programs, wherein the fitness of each individual of the initial population is as follows:

the individual fitness of each fault-characterization function in the initial population is:

Wherein the method comprises the steps of For the fitness of the kth fault-characterization function in the p-th generation population, version _ver represents the verst multiple-fault-procedure, nv is the number of multiple-fault-procedures included in the multiple-fault-procedures introduced in step 1,Applying the kth fault characterization function in the p-th generation population to the fitness score of fault separation of the verth multi-fault program;

preferably, in step 3, the selecting a parent population from the population according to the fitness of each individual in the initial population is:

fitness from a population according to each individual using roulette algorithm Selecting individuals with different pop-profiles to form a parent population:

Wherein pop is the Population size, the description is the ratio of the next parent Population to the parent Population, the popularizing _p represents the p-th generation Population, The fitness score representing the kth fault-characterization function in the p-th generation Population is Roulette which is a roulette algorithm, father _popularization _p+1 represents a parent Population selected from the popularization _p and used for generating the popularization _p+1, the value of p is between 1 and end, and the value of k is between 1 and pop;

step 3, creating offspring individuals by using the parent population according to the crossover rate, the replication rate and the mutation rate, wherein the offspring individuals are as follows:

creating each child individual using parent Population Father _population _p+1 according to crossover rate, replication rate, and mutation rate

Determining each sub-generation individual according to the crossover rate, the replication rate and the mutation rateThe creation mode of (a):

Wherein c, o and m are crossover, replication and mutation rates, and there are

c+o+m=1

0≤c,o,m≤1

Random is a random number in a [0,1] interval, method is a generation mode of current offspring individuals, crossover represents generation through cross, copy represents generation through Copy, mutate represents generation through variation, and an individual creation mode is determined according to the value of random and the size of cross rate, replication rate and variation rate;

The creation of offspring individuals is:

wherein, the The method is characterized in that a kth fault characterization function of a child Population of a p-th generation Population is realized, the Create_ Crossover is a crossover algorithm, the REF _r1,REF_r2 is two different individuals randomly selected from a parent Population Father _population _p+1, the Create_copy and the Create_ Mutate are replication and mutation algorithms, and the REF _r is one individual randomly selected from a parent Population Father _population _p+1;

Each individual in the population possesses, in addition to its own attribute as a fault-characterizing function, an age attribute:

wherein, the For the age of the kth individual in the p-th generation population,For all of the individuals in the initial population,For newly created individuals in a offspring population, method isCorresponding creation means, REF _r is created by a replication algorithmRandomly selecting individuals for replication from the parent population, age representing Age;

when an individual's age reaches 3, it will not be selectable as a parent Population, defining Survive _population _p as a surviving Population consisting of all individuals less than 3 in the p-th generation Population;

Namely, the selection method of the parent Population Father _population _p+1 of the p-th generation Population is modified as follows:

wherein the kth fault characterization function of the p-th generation population Is the individual in Survive _population _p if and only if

When generating offspring individualsStopping generating when the number reaches the Population scale pop, and taking the pop individuals as next generation Population _p+1;

step 3, evaluating fitness of the offspring population by using the method in step 2 and continuing to evolve the next generation population until the termination condition is reached:

Before a user-defined termination condition SC (such as iterated fixed times or obtained fault representation functions meeting requirements) is met, continuously using the method in the step 2 to evaluate the fitness score of each fault representation function in the current p-th generation Population position _p, and using the method in the step 3 to generate a child Population position _p+1;

Preferably, in step 4, the fault characterization function with the highest individual fitness in the last generation population that reaches the termination condition is used as a fault classification model as follows:

the Population _end completes fitness evaluation for the last generation and reaches the termination condition SC Population, For the highest fitness score in the placement _end The characterization function can be used for generating the fault characterization of the real multi-fault program and separating the fault;

Preferably, in the multi-fault procedure requiring fault division using the fault classification model in step 5, generating the ordered list for each failed test case is characterized as follows:

Using the kth fault-characterization function with highest fitness score, i.e. the fault classification model, selected in step 4 from the Population _end of the last generation for which fitness evaluation is completed and termination conditions SC are reached Generating an ordered list for each failed test case of the target multi-fault program according to the method in step 2 characterizes Rep_ranking _end,k,f,Rep_Ranking_end,k,f as using a fault classification modelThe method comprises the steps of representing an ordered list generated for an f-th failure test case of a target multi-fault program;

step 5, calculating the distance between each pair of ordered list characterization is as follows:

the preselected Distance Metric function distance_metric is:

Distance_fo,ft＝Distance_Metric(Rep_Ranking_end,k,fo,Rep_Ranking_end,k,ft)

wherein, the two failure test cases fo and ft of the Rep_ranking _end,k,fo,Rep_Ranking_end,k,fo as the target multi-failure program are based on the failure separation model The generated ordered list is characterized in that Distance _fo,ft is the Distance between failed test cases fo and ft obtained by calculation by using a Distance measurement function distance_metric;

Step 5, clustering the failure test cases into clusters by using the calculated distance;

The pre-selected clustering strategy includes:

numOfClusters_end,k

=Predict(Distance_f1,f1,Distance_f1,f2,...,Distance_fo,ft,...,Distance_fail,fail)

And:

Clusters_end,k＝Cluster(Distance_f1,f1,Distance_f1,f2,...,Distance_fo,ft,...,Distance_fail,fail)

Wherein Distance _fo,ft is the Distance between the failure test cases fo and ft calculated by using a Distance measurement function distance_metric, prediction is a function of the number of predicted clusters, numOfClusters _end,k is the number of clusters predicted by using a fault classification model, cluster is a clustering algorithm, clusters _end,k is the Cluster number allocated to each failure test case of the target multi-fault program after clustering by using a fault classification model, and there are

Clusters_end,k＝{Group_end,k,1,Group_end,k,2,...,Group_end,k,fail}

Wherein Group _end,k,f represents the number of the corresponding class for the ordered list generated by using the fault classification model for the f-th failure test case of the target multi-fault program;

The prediction and Cluster both take as input the distance between every two failed test cases fo and ft in the target multi-failure program.

The invention also provides a computer readable medium storing a computer program for execution by an electronic device, which when run on the electronic device causes the electronic device to execute the steps of the ordering similarity software abnormal behavior modeling method.

The invention aims to provide a method for automatically generating fault sequencing characterization for a fault separation scene, which comprises the steps of firstly generating a random initial fault sequencing characterization population by using a genetic programming algorithm, then applying population individuals to a multi-fault program dataset appointed by a user for fault separation, generating a sequencing list characterization for fault test cases in the multi-fault program by using each fault sequencing characterization, calculating distances among the characterizations, dividing the fault test case characterizations according to fault sources by using a clustering algorithm, evaluating the effectiveness of a dividing process by using a plurality of external indexes, calculating the adaptability degree of the fault sequencing characterization to the multi-fault program fault separation according to evaluation results, counting the overall performance of the fault sequencing characterization on the dataset to obtain the adaptability degree score of the fault sequencing characterization, finally selecting a parent population according to the adaptability degree score of each fault sequencing characterization individual in the population, generating a child population by crossing, copying and mutation operations according to preset crossing rate, copying and mutation operation, continuously generating the child population by the same method until the method reaches a termination condition, and taking the fault characterization with high adaptability degree in the last generation of the population as an output fault characterization individual with the final generation of the complete adaptability degree evaluation as the fault sequencing result for the fault separation of the multi-fault program. The invention finally forms a complete software abnormal behavior modeling method based on sequencing similarity.

The software abnormal behavior modeling method based on sequencing similarity provides a fault characterization function generating method which realizes automation by means of a genetic programming algorithm and comprises a plurality of interfaces capable of performing manual intervention to improve flexibility, and solves the problem that no fault sequencing characterization automatic generating method oriented to fault separation is available at present. Based on the automatic generation method of the fault characterization function, the invention solves the problem that the mapping relation from the fault sequencing characterization to the fault test case characterization capability cannot be established by using a genetic programming algorithm, realizes the automation of the generation of the fault characterization function, and improves the adaptability of the generated fault characterization function to the fault separation application of multiple fault programs of different types. Finally, the invention also adds a plurality of interfaces which can be adjusted manually in the process of the method, so that the method can meet the requirements of various types. The invention finally forms a complete software abnormal behavior modeling method based on sequencing similarity, and a user can realize efficient automatic fault sequencing characterization generation by using the method and is used for fault separation work in actual production.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a tree structure diagram of a fault characterization function generated by a genetic programming algorithm of an embodiment of the present invention;

FIG. 3 is an algorithm schematic of the overall flow of the method of the embodiment of the invention;

FIG. 4 is an algorithm diagram of a process for adaptively evaluating individuals of a fault-characterization function population according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In particular, the method according to the technical solution of the present invention may be implemented by those skilled in the art using computer software technology to implement an automatic operation flow, and a system apparatus for implementing the method, such as a computer readable storage medium storing a corresponding computer program according to the technical solution of the present invention, and a computer device including the operation of the corresponding computer program, should also fall within the protection scope of the present invention.

FIG. 1 is a flow chart of the method of the present invention. The following describes a method for modeling abnormal behaviors of sequencing similarity software according to the technical scheme of the method in the embodiment of the invention with reference to fig. 1 to 4, and the method specifically comprises the following steps:

Each fault characterization function described in step 1 is specifically defined as follows:

the fault characterization function REF is:

Suspiciousness_k,f,i＝REF_k(Spectrum_f,i)

Spectrum_f,i＝{NCF_f,i,NCS_i,NUF_f,i,NUS_i}

The NCF _f,i is the number of times that the ith program statement of each multi-fault program is covered by the f-th failed test case, the NCS _i is the number of times that the ith program statement of each multi-fault program is covered by the successful test case, the NUF _f,i is the number of times that the ith program statement of each multi-fault program is not covered by the f-th failed test case, the NUS _i is the number of times that the ith program statement of each multi-fault program is not covered by the successful test case, the value of i is between 1 and K, the K is the number of program statements in each multi-fault program, the NCF _f,i,NCS_i,NUF_f,i,NUS_i is used as the Spectrum information Spectrum _f,i,Suspiciousness_k,f,i of the f-th failed test case of each multi-fault program at the ith program statement together, and the probability of a fault source exists at the ith program statement of each multi-fault program calculated by combining the Spectrum information of the f-th failed test case at the ith program statement with the fault characterization function REF _k;

in one embodiment, the fault-characterization function may be generated in a tree form by a genetic programming algorithm, where one of the generated fault-characterization functions REF is shown in fig. 2, where rectangular nodes represent operators combined with its lower child nodes, and circular nodes represent program spectral values and constants involved in the operation, and the fault-characterization function shown in fig. 2 is:

the initial population is constructed in the step 1:

In one embodiment, since a population of 160 individuals of the fault-characterization function can generate individuals with high effectiveness for fault isolation under the premise of controllable population evolution cost, the population size can be designated as 160, i.e., pop=160 is set.

Step 2, generating an ordered list representation of each failure test case generated by each multi-failure program by combining each failure representation function in the initial population, wherein the ordered list representation is specifically as follows:

the plurality of failure test cases are specifically defined as follows:

{Ftc₁,Ftc₂,Ftc₃,...,Ftc_fail}

The plurality of successful test cases are specifically defined as follows:

{Stc₁,Stc₂,Stc₃,...,Stc_success};

In one embodiment, if a large number of successful test cases irrelevant to the fault module exist in the test suite, the number of the successful test cases can be reduced by screening the successful test cases, and the adaptability evaluation efficiency in the population evolution process is improved.

Spectrum_f,i＝{NCF_f,i,NCS_i,NUF_f,i,NUS_i}

i has a value between 1 and K;

Rep_Suspiciousness_p,k,f

={Suspiciousness_p,k,f,1,Suspiciousness_p,k,f,2,...,Suspiciousness_p,k,f,is}

wherein is the number of program statements of each multi-fault program;

Rep_Ranking_p,k,f＝{Index_p,k,f,1,Index_p,k,f,2,...,Index_p,k,f,is}

Result_p,k＝{numOfClusters_p,k,Clusters_p,k}

Clusters_p,k＝{Group_p,k,1,Group_p,k,2,...,Group_p,k,fail}

Score_PR_p,k＝PR(Clusters_p,k,Oracle)

Score_RR_p,k＝RR(Clusters_p,k,Oracle)

Score_FMI_p,k＝FMI(Clusters_p,k,Oracle)

Score_JC_p,k＝JC(Clusters_p,k,Oracle)

PR and RR are single-case indexes, and the indexes aim at comparing whether the classification result of each failure test case is consistent with the real classification. For each fault class, the division situation of one failure test case can be divided into four types, including true positive (the failure test case is divided into the class by the fault separation frame, and the failure test case actually belongs to the class), false positive (the failure test case is divided into the class by the fault separation frame, but the failure test case actually does not belong to the class), true negative (the failure test case is not divided into the class by the fault separation frame, and the failure test case actually does not belong to the class), and false negative (the failure test case is not divided into the class by the fault separation frame, but the failure test case actually belongs to the class). Let TP, FP, TN, FN be the sum of the occurrence times of the four cases of true yang, false yang, true yin and false yin respectively, and the calculation formulas of PR and RR are as follows:

FMI and JC are paired indexes, and the indexes aim to compare whether the classification condition of each pair of failed test cases is consistent with the condition under the real classification. For each fault class, the partition consistency of a pair of fault test cases can be divided into four cases, including true identity (both fault test cases are/are not partitioned into the class by the fault isolation framework and are/are not actually partitioned into the class by the fault isolation framework), false identity (both fault test cases are/are not partitioned into the class by the fault isolation framework, but actually one fault test case is in the class, the other is not in the class), false identity (one fault test case is partitioned into the class by the fault isolation framework, and the other is not partitioned into the class, but actually both fault test cases are/are not in the class), true identity (one fault test case is partitioned into the class by the fault isolation framework, and the other is not partitioned into the class, and actually one fault test case is in the class, and the other is not in the class). Let SS, SD, DS, DD be the sum of the occurrence times of the four cases of true identity, false dissimilarity and true dissimilarity, and the calculation formulas of FMI and JC are as follows:

In one embodiment, the training dataset includes 2,3,4,5 fault procedures for which there may be 2,6,24,120 correspondences between the failure test case class and the true fault class, respectively, i.e., V takes 2,6,24,120 values, respectively.

Vote_p,k,most＝Max(Vote_p,k,1,Vote_p,k,2,...,Vote_p,k,V)

FitnessScore_p,k＝TotalMetrics_p,k*PenaltyFactor_p,k

TotalMetrics_p,k＝Score_PR_p,k+Score_RR_p,k+Score_FMI_p,k+Score_JC_p,k

PenaltyFactor_p,k＝1-0.05*(4-Vote_p,k,most)

The fitness score calculating method can be adjusted according to the requirement of fault sequencing characterization, if the ideal value of the clustering external indexes is more concerned than whether four indexes can be consistent in the corresponding relation between the clustering cluster and the real fault class, the influence of PenaktyFactor can be weakened, the influence is lower than 0.05, if the authenticity of clustering validity assessment is required to be ensured, the fitness score calculating method is not adopted, and four external index values on the corresponding relation can be added to be taken as the fitness score after the corresponding relation between the clustering cluster and the real fault class is determined.

In one embodiment, since 600 multi-fault procedures may more generally simulate multi-fault procedures of different fault types, different program types, and different numbers of faults, suitable as training data sets generated by the fault ordering characterization, the number of versions may be set to 600, i.e., nv=600.

In one embodiment, the fitness score evaluation flow of the individual fault-characterization function is shown in fig. 4.

step 3, selecting a parent population from the population according to the fitness of each individual in the initial population, wherein the parent population is:

In one embodiment, since selecting half of the individuals from the population as parent population can ensure the efficiency of the evolution process and prevent local optima from being generated too quickly, the ratio of parent population to population can be set to be 0.5, i.e. report=0.5.

The parent population individual selection algorithm is not unique and can be flexibly adjusted to algorithms other than Roulette according to the requirement.

Wherein c, o and m are crossover, replication and mutation rates, and there are

c+o+m=1

0≤c,o,m≤1

In one embodiment, the cross rate, replication rate and mutation rate are set to 0.7,0.2 and 0.1 respectively, so that population stability can be guaranteed to have stronger evolution of the failure test case characterization generating capability direction, and the capability of jumping out of local optimum is provided, and therefore the cross rate, replication rate and mutation rate can be set to 0.7,0.2 and 0.1, namely c=0.7, o=0.2 and m=0.1.

The setting of the crossover rate, the replication rate and the mutation rate can be adjusted according to actual conditions on the basis of meeting the limiting conditions, if the population cannot continuously improve the overall characterization capacity along with the evolution process, the replication rate can be properly improved, excellent individuals of the previous generation population are reserved, if the population evolves to be locally optimal, the replication rate can be properly reduced and the mutation rate can be improved without change of the optimal individuals of the continuous multiple generation population, and the breakthrough of the local optimal is realized by the mutated excellent individuals.

In one embodiment, as the replication of the population optimal individuals easily causes the population to fall into local optimization, a mutation rate of 0.5 can be additionally added under the condition that the offspring population individuals are confirmed to be generated by replicating the parent population optimal individuals, so that mutation operation is applied to the offspring population individuals before replicating the optimal individuals, and the capability of the population evolution to break through the local optimization is improved.

The creation of offspring individuals is:

The selection of the termination condition can be set as evolution fixed algebra if the time cost of population evolution is low, and can be flexibly determined by a user according to the evaluation condition of each generation of fitness if each generation of population evolution consumes more time.

And step 4, taking a fault characterization function with the highest individual fitness in the final generation population reaching the termination condition as a fault classification model, wherein the fault classification model comprises the following steps:

In the step 5, in the multi-fault procedure requiring fault division, generating an ordered list for each failed test case by using a fault classification model is characterized in that:

the preselected Distance Metric function distance_metric is:

Distance_fo,ft＝Distance_Metric(Rep_Ranking_end,k,fo,Rep_Ranking_end,k,ft)

In one embodiment, to more accurately measure the similarity between ordered list representations, a distance measurement function such as Jaccard distance, as proposed for Ranking, may be used.

The pre-selected clustering strategy includes:

numOfClusters_end,k

And:

Clusters_end,k＝{Group_end,k,1,Group_end,k,2,...,Group_end,k,fail}

Particular embodiments of the present invention also provide a computer readable medium.

The computer readable medium is a server workstation;

The server workstation stores a computer program executed by the electronic device, and when the computer program runs on the electronic device, the electronic device is caused to execute the steps of the ordering similarity software abnormal behavior modeling method in the embodiment of the invention.

It should be understood that parts of the specification not specifically set forth herein are all prior art.

It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims

1. The software abnormal behavior modeling method based on the ordering similarity is characterized by comprising the following steps of:

Step 2, generating an ordered list representation of each failure test case generated by each multi-failure program by combining each failure representation function in the initial population, carrying out cluster division to obtain ordered list representation classification results of a plurality of failure test cases generated by each multi-failure program by each failure representation function, evaluating the effectiveness score of each external index corresponding to the classification result generated by each multi-failure program by each failure representation function, and obtaining the fitness score of each multi-failure program when each failure representation function is applied to failure separation of each multi-failure program and the fitness of each individual in the initial population;

2. The method for modeling abnormal behavior of software based on ranking similarity according to claim 1, wherein each fault-characterization function in step 1 is specifically defined as follows:

the fault characterization function REF is:

Suspiciousness_k,f,i＝REF_k(Spectrum_f,i)

Spectrum_f,i＝{NCF_f,i,NCS_i,NUF_f,i,NUS_i}

the initial population is constructed in the step 1:

Wherein, the Population _p represents the p generation Population, the initial Population constructed is Population ₁, the pop is the Population individual number or Population scale, Representing the kth fault-characterization function in the p-th generation population.

3. The method for modeling abnormal behavior of software based on sequencing similarity according to claim 1, wherein step 2 generates a sequencing list characterization of each failure test case generated by each multi-failure program by combining each failure characterization function in the initial population, specifically as follows:

the plurality of failure test cases are specifically defined as follows:

{Ftc₁,Ftc₂,Ftc₃,...,Ftc_fail}

The plurality of successful test cases are specifically defined as follows:

{Stc₁,Stc₂,Stc₃,...,Stc_success};

Spectrum_f,i＝{NCF_f,i,NCS_i,NUF_f,i,NUS_i}

i has a value between 1 and K;

Rep_Suspiciousness_p,k,f

={Suspiciousness_p,k,f,1,Suspiciousness_p,k,f,2,...,Suspiciousness_p,k,f,is}

wherein is the number of program statements of each multi-fault program;

Rep_Ranking_p,k,f＝{Index_p,k,f,1,Index_p,k,f,2,...,Index_p,k,f,is}

Index _p,k,f,rank represents the actual Index position of the rank-order program statement in each multi-fault program when the suspicion degree obtained by calculation of the kth fault characterization function and the f-th failure test case in the p-th generation population is ordered according to the program statement.

4. The method for modeling abnormal software behavior based on sequencing similarity according to claim 1, wherein the step 2 is characterized in that the clustering is performed to obtain a sequencing list characterization classification result of a plurality of failure test cases generated by each multi-failure program by each failure characterization function, and the sequencing list characterization classification result is specifically as follows:

Result_p,k＝{numOfClusters_p,k,Clusters_p,k}

Clusters_p,k＝{Group_p,k,1,Group_p,k,2,...,Group_p,k,fail}

Wherein Group _p,k,f is the number of the corresponding class of the ordered list generated for the f-th failure test case of each multi-failure program by using the k-th failure characterization function in the p-th generation population.

5. The method for modeling abnormal behavior of software based on ranking similarity according to claim 1, wherein the step 2 of evaluating the validity score of each external index corresponding to the classification result generated by each multi-fault procedure according to each fault characterization function is:

Score_PR_p,k＝PR(Clusters_p,k,Oracle)

Score_RR_p,k＝RR(Clusters_p,k,Oracle)

Score_FMI_p,k＝FMI(Clusters_p,k,Oracle)

Score_JC_p,k＝JC(Clusters_p,k,Oracle)

The calculation of four external indexes requires the establishment of the corresponding relation between each class cluster and a real fault source in class cluster numbers distributed for each failure test case of each multi-fault program after clustering division by using a kth fault characterization function in a p generation population, and for each external index, the effectiveness score under all possible corresponding relations is calculated, and the highest score is taken as the calculation result of the index.

6. The method for modeling abnormal behavior of software based on ranking similarity according to claim 1, wherein the step 2 of obtaining the fitness score of each fault characterization function when applied to fault isolation of each multi-fault program is specifically as follows:

Combining the effectiveness scores of the multiple external indexes corresponding to the classification results generated by each multi-fault program by each fault characterization function to obtain the fitness score of each fault characterization function when the fault characterization function is applied to fault separation of each multi-fault program;

step 2, obtaining fitness scores of each individual in the initial population, wherein the fitness scores are specifically as follows:

And obtaining the fitness of each individual of the initial population according to the fitness score of each fault characterization function when the fault characterization function is applied to fault separation of a plurality of multi-fault procedures.

7. The method for modeling abnormal behavior of software based on ranking similarity according to claim 6, wherein,

The validity score of the multiple external indexes corresponding to the classification result generated by each multi-fault program by combining each fault characterization function is obtained, and the fitness score of each fault characterization function when applied to fault separation of each multi-fault program is:

wherein, the Represents whether Score PR _p,k votes for the d-th correspondence,Represents whether Score _ RR _p,k votes for the d-th correspondence,Represents whether Score-FMI _p,k votes for the d-th correspondence,Representing whether score_jc _p,k is a d-th corresponding relation Vote, vote _p,k,d represents the total number of votes obtained by the kth fault characterization function in the p-th generation population at the d-th corresponding relation, the value of d is between 1 and V, and V is the number of possible corresponding relations existing in the multi-fault program;

Vote_p,k,nost＝Max(Vote_p,k,1,Vote_p,k,2,...,Vote_p,k,V)

FitnessScore_p,k＝TotalMetrics_p,k*PenaltyFactor_p,k

TotalMetrics_p,k＝Score_PR_p,k+Score_RR_p,k+Score_FMI_p,k+Score_JC_p,k

PenaltyFactor_p,k＝1-0.05*(4-Vote_p,k,most)

The fitness of each individual of the initial population is obtained according to the fitness score when each fault characterization function is applied to fault separation of a plurality of multi-fault programs, wherein the fitness of each individual of the initial population is as follows:

Wherein the method comprises the steps of For the fitness of the kth fault-characterization function in the p-th generation population, version _ver represents the verst multiple-fault-procedure, nv is the number of multiple-fault-procedures included in the multiple-fault-procedures introduced in step 1,And (3) applying the k fault characterization function in the p generation population to the fitness score of fault separation of the third multi-fault program.

8. The method for modeling abnormal behavior of software based on ranking similarity according to claim 1, wherein,

creating each child using parent Population Father-Population _p+1 based on crossover, replication and mutation rates

Wherein c, o and m are crossover, replication and mutation rates, and there are

c+o+m=1

0≤c,o,m≤1

The creation of offspring individuals is:

And (3) continuously using the method in the step (2) to evaluate the fitness score of each fault characterization function in the current p-th generation Population position _p before the user-defined termination condition SC is met, the iteration is carried out for a fixed number of times or the fault characterization function meeting the requirement is obtained, and using the method in the step (3) to generate a child Population position _p+1.

9. The method for modeling abnormal behavior of software based on ranking similarity according to claim 1, wherein,

the preselected Distance Metric function distance_metric is:

Distance_fo,ft＝Distance_Metric(Rep_Ranking_end,k,fo,Rep_Ranking_end,k,ft)

The pre-selected clustering strategy includes:

numOfClusters_end,k

And:

Clusters_end,k＝{Group_end,k,1,Group_end,k,2,...,Group_end,k,fail}

10. A computer readable medium, characterized in that it stores a computer program for execution by an electronic device, which computer program, when run on the electronic device, causes the electronic device to perform the steps of the method according to any one of claims 1-9.