Disclosure of Invention
In order to solve the technical problems, the invention provides a sequencing similarity software abnormal behavior modeling method and a computer readable medium.
The invention provides a software abnormal behavior modeling method based on sequencing similarity, which comprises the following specific steps:
Step 1, introducing a plurality of multi-fault programs, randomly generating a plurality of fault characterization functions by using a genetic programming algorithm, taking each fault characterization function as each individual in a population, and constructing an initial population by the plurality of fault characterization functions;
Step 2, generating a sequencing list characterization of each failure test case generated by each multi-failure program by combining each failure characterization function in the initial population, clustering and dividing the sequencing list characterization of all failure test cases generated by each multi-failure program by each failure characterization function, obtaining a sequencing list characterization classification result of a plurality of failure test cases generated by each multi-failure program by each failure characterization function, evaluating the validity score of each external index corresponding to the classification result generated by each multi-failure program by each failure characterization function according to a plurality of external indexes, merging the validity scores of a plurality of external indexes corresponding to the classification result generated by each multi-failure program by each failure characterization function, obtaining an adaptation score when each failure characterization function is applied to failure separation of each multi-failure program, and obtaining the adaptation score of each individual of the initial population according to the adaptation score when each failure characterization function is applied to failure separation of a plurality of multi-failure programs;
Step 3, selecting a parent population from the population according to the fitness of each individual in the initial population, creating each child individual by using the parent population according to the crossover rate, the replication rate and the mutation rate, taking all the created child individuals as the child population, and calculating the fitness of each individual in the child population through the step 2;
step 4, repeatedly executing the step 2 and the step 3 until the termination condition appointed by the user is reached, and taking the fault characterization function with the highest individual fitness in the last generation population reaching the termination condition as a fault classification model;
And 5, generating a sequencing list representation for each failure test case in a multi-fault program needing fault division by using a fault classification model, calculating the distance between each pair of sequencing list representations, and clustering the failure test cases by combining the distance between each pair of sequencing list representations to realize modeling of software abnormal behaviors and division of a plurality of faults.
Preferably, each fault-characterization function described in step1 is specifically defined as follows:
the fault characterization function REF is:
Suspiciousnessk,f,i=REFk(Spectrumf,i)
Spectrumf,i={NCFf,i,NCSi,NUFf,i,NUSi}
The NCF f,i is the number of times that the ith program statement of each multi-fault program is covered by the f-th failed test case, the NCS i is the number of times that the ith program statement of each multi-fault program is covered by the successful test case, the NUF f,i is the number of times that the ith program statement of each multi-fault program is not covered by the f-th failed test case, the NUS i is the number of times that the ith program statement of each multi-fault program is not covered by the successful test case, the value of i is between 1 and K, the K is the number of program statements in each multi-fault program, the NCF f,i,NCSi,NUFf,i,NUSi is used as the Spectrum information Spectrum f,i,Sspiciousnessk,f,i of the f-th failed test case of each multi-fault program at the ith program statement together, and the probability of a fault source exists at the ith program statement of each multi-fault program calculated by combining the Spectrum information of the f-th failed test case at the ith program statement with the fault characterization function REF k;
the initial population is constructed in the step 1:
A plurality of fault characterization functions are randomly generated by using a genetic programming algorithm and are used as an initial population together:
Wherein, the Population p represents the p generation Population, the initial Population constructed is Population 1, the pop is the Population individual number or Population scale, Representing a kth fault characterization function in the p-th generation population;
Preferably, in step 2, each fault characterization function generated in combination with each fault characterization function in the initial population is applied to the ordered list characterization of each failure test case generated by each multi-fault program, and specifically as follows:
each multi-fault program comprises a plurality of failure test cases and a plurality of success test cases;
the plurality of failure test cases are specifically defined as follows:
{Ftc1,Ftc2,Ftc3,...,Ftcfail}
Wherein fail is the number of failed test cases, ftc f represents the f failed test cases in each multi-fault program, and the value of f is between 1 and fail;
The plurality of successful test cases are specifically defined as follows:
{Stc1,Stc2,Stc3,...,Stcsuccess};
Wherein success is the number of successful test cases, stc s represents the s-th successful test case in each multi-fault program, and the value of s is between 1 and success;
collecting coverage information of each failed test case and each successful test case at each program statement in each multi-fault program;
calculating a spectrum value of each failure test case at each corresponding program statement in each multi-fault program by using each failure test case and coverage information of each successful test case at each program statement in each multi-fault program;
The Spectrum value Spectrum f,i collected by the f-th failure test case of each multi-failure program at the i-th program statement is:
Spectrumf,i={NCFf,i,NCSi,NUFf,i,NUSi}
i has a value between 1 and K;
The ith fault characterization function in the p-th generation population is used for calculating the suspicious degree of the ith program statement calculated by each multi-fault program by using the f-th fault test case in combination with the spectrum value collected by the f-th fault test case of each multi-fault program at the ith program statement, wherein the suspicious degree of the ith program statement calculated by the f-th fault test case is as follows:
The value of p is between 1 and end, end is algebra of the last generation population, the value of K is between 1 and pop, the value of f is between 1 and fail, and the value of i is between 1 and K;
according to the suspicion calculation method, the suspicion calculated by each multi-fault program by using the kth fault characterization function and the f failure test case in the p generation population is characterized in that:
Rep_Suspiciousnessp,k,f
={Suspiciousnessp,k,f,1,Suspiciousnessp,k,f,2,...,Suspiciousnessp,k,f,is}
wherein is the number of program statements of each multi-fault program;
Sequencing program sentences according to the suspicious degree of each program sentence in the suspicious degree characterization calculated by using a kth fault characterization function and an f-th failure test case in the p-th generation population of each multi-fault program, and obtaining a sequencing list generated for the f-th failure test case of each multi-fault program by using the kth fault characterization function in the p-th generation population, wherein the sequencing list is characterized by:
Rep_Rankingp,k,f={Indexp,k,f,1,Indexp,k,f,2,...,Indexp,k,f,is}
Index p,k,f,rank represents the actual Index position of the rank-bit program statement in each multi-fault program when the suspicious degree is sorted in descending order, which is calculated according to the corresponding k-th fault characterization function and f-th failure test case in the p-th generation population of the program statement;
step 2, clustering and dividing the ordered list characterization of all the failure test cases generated by each multi-fault program by each fault characterization function to obtain ordered list characterization classification results of the plurality of failure test cases generated by each multi-fault program by each fault characterization function, wherein the ordered list characterization classification results are as follows;
Resultp,k={numOfClustersp,k,Clustersp,k}
Wherein Result p,k is the sorted list characterization classification Result of the kth fault characterization function in the p-th generation population applied to the multiple failure test cases generated by each multi-fault program, numOfClusters p,k is the cluster number predicted by using the kth fault characterization function in the p-th generation population, and when the cluster number is not equal to the actual fault number of each multi-fault program, the subsequent evaluation step of the current fault characterization function on the current multi-fault program is skipped, and the fitness score of the current fault characterization function on the corresponding multi-fault version is set to 0;
Clusters p,k is a cluster number allocated to each failure test case of each multi-failure program after clustering by using the kth failure characterization function in the p generation population, and has
Clustersp,k={Groupp,k,1,Groupp,k,2,...,Groupp,k,fail}
Wherein Group p,k,f is the number of the corresponding class of the ordered list representation generated for the f-th failure test case of each multi-failure program by using the k-th failure representation function in the p-th generation population;
and step 2, evaluating the effectiveness score of each external index corresponding to the classification result generated by each multi-fault procedure by each fault characterization function according to a plurality of external indexes, wherein the effectiveness score is as follows:
Score_PRp,k=PR(Clustersp,k,Oracle)
Score_RRp,k=RR(Clustersp,k,Oracle)
Score_FMIp,k=FMI(Clustersp,k,Oracle)
Score_JCp,k=JC(Clustersp,k,Oracle)
The Oracle is the real correspondence between each failure test case and a failure source in each multi-failure program, score_PR p,k is the validity Score of the classification result corresponding to PR index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, score_RR p,k is the validity Score of the classification result corresponding to RR index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, score_FMI p,k is the validity Score of the classification result corresponding to FMI index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, and score_JC p,k is the validity Score of the classification result corresponding to JC index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program;
PR is the precision rate, RR is the recall rate, FMI is Fowlkes-Mallows index, JC is the Jaccard coefficient, the four external indexes are respectively input by using a cluster number distributed for each failure test case of each multi-failure program after clustering division by using a kth failure characterization function in a p-th generation population and a real corresponding relation between each failure test case and a failure source in each multi-failure program, the effectiveness score of the clustering is output, the score value is within a [0,1] interval, and the larger the value is, the higher the effectiveness of the clustering process is represented;
The calculation of four external indexes requires establishing the corresponding relation between each class cluster and a real fault source in class cluster numbers distributed for each failure test case of each multi-fault program after clustering division by using a kth fault characterization function in a p generation population, calculating the effectiveness score of all possible corresponding relations for each external index, and taking the highest score as the calculation result of the index;
And step 2, merging the validity scores of the multiple external indexes corresponding to the classification results generated by each multi-fault program by each fault characterization function to obtain the fitness score when each fault characterization function is applied to fault separation of each multi-fault program, wherein the fitness score is as follows:
When each external index is calculated for the classification result generated by applying each fault characterization function to each multi-fault program, the effectiveness score under the corresponding relation with the highest effectiveness score is required to be taken as the calculation result, the current index is called as the corresponding relation for voting, and the vote of each corresponding relation is:
wherein, the Represents whether Score PR p,k votes for the d-th correspondence,Represents whether Score _ RR p,k votes for the d-th correspondence,Represents whether Score _ FMI p,k votes for the d-th correspondence,Representing whether score_jc p,k is a d-th corresponding relation Vote, vote p,k,d represents the total number of votes obtained by the kth fault characterization function in the p-th generation population at the d-th corresponding relation, the value of d is between 1 and V, and V is the number of possible corresponding relations existing in the multi-fault program;
The total number of obtained tickets with the most corresponding relation is as follows:
Votep,k,most=Max(Votep,k,1,Votep,k,2,...,Votep,k,V)
Vote p,k,most is the total number of obtained tickets of the corresponding relation of the most obtained tickets of the classification result generated by each multi-fault program in the application of the kth fault characterization function in the p-th generation population in the effectiveness index evaluation process, and is determined by the number of fault sources contained in the multi-fault program;
The method for calculating the fitness score when the kth fault characterization function in the p-th generation population is applied to fault separation of each multi-fault program comprises the following steps:
FitnessScorep,k=TotalMetricsp,k*PenaltyFactorp,k
Wherein FitnessScore p,k is the fitness score of the kth fault characterization function in the p-th generation population when applied to fault separation of each multi-fault procedure;
TotalMetricsp,k=Score_PRp,k+Score_RRp,k+Score_FMIp,k+Score_JCp,k
TotalMetrics p,k is the sum of four external index validity scores of the classification result generated by each multi-fault procedure by applying the kth fault characterization function in the p generation population;
PenaltyFactorp,k=1-0.05*(4-Votep,k,most)
Wherein PenaltyFactor p,k is the fitness penalty factor of the kth fault characterization function in the p-th generation population when applied to fault separation of each multi-fault program, and is determined by Vote p,k,most;
and step 2, obtaining the fitness of each individual of the initial population according to the fitness score of each fault characterization function applied to fault separation of a plurality of multi-fault programs, wherein the fitness of each individual of the initial population is as follows:
the individual fitness of each fault-characterization function in the initial population is:
Wherein the method comprises the steps of For the fitness of the kth fault-characterization function in the p-th generation population, version ver represents the verst multiple-fault-procedure, nv is the number of multiple-fault-procedures included in the multiple-fault-procedures introduced in step 1,Applying the kth fault characterization function in the p-th generation population to the fitness score of fault separation of the verth multi-fault program;
preferably, in step 3, the selecting a parent population from the population according to the fitness of each individual in the initial population is:
fitness from a population according to each individual using roulette algorithm Selecting individuals with different pop-profiles to form a parent population:
Wherein pop is the Population size, the description is the ratio of the next parent Population to the parent Population, the popularizing p represents the p-th generation Population, The fitness score representing the kth fault-characterization function in the p-th generation Population is Roulette which is a roulette algorithm, father _popularization p+1 represents a parent Population selected from the popularization p and used for generating the popularization p+1, the value of p is between 1 and end, and the value of k is between 1 and pop;
step 3, creating offspring individuals by using the parent population according to the crossover rate, the replication rate and the mutation rate, wherein the offspring individuals are as follows:
creating each child individual using parent Population Father _population p+1 according to crossover rate, replication rate, and mutation rate
Determining each sub-generation individual according to the crossover rate, the replication rate and the mutation rateThe creation mode of (a):
Wherein c, o and m are crossover, replication and mutation rates, and there are
c+o+m=1
0≤c,o,m≤1
Random is a random number in a [0,1] interval, method is a generation mode of current offspring individuals, crossover represents generation through cross, copy represents generation through Copy, mutate represents generation through variation, and an individual creation mode is determined according to the value of random and the size of cross rate, replication rate and variation rate;
The creation of offspring individuals is:
wherein, the The method is characterized in that a kth fault characterization function of a child Population of a p-th generation Population is realized, the Create_ Crossover is a crossover algorithm, the REF r1,REFr2 is two different individuals randomly selected from a parent Population Father _population p+1, the Create_copy and the Create_ Mutate are replication and mutation algorithms, and the REF r is one individual randomly selected from a parent Population Father _population p+1;
Each individual in the population possesses, in addition to its own attribute as a fault-characterizing function, an age attribute:
wherein, the For the age of the kth individual in the p-th generation population,For all of the individuals in the initial population,For newly created individuals in a offspring population, method isCorresponding creation means, REF r is created by a replication algorithmRandomly selecting individuals for replication from the parent population, age representing Age;
when an individual's age reaches 3, it will not be selectable as a parent Population, defining Survive _population p as a surviving Population consisting of all individuals less than 3 in the p-th generation Population;
Namely, the selection method of the parent Population Father _population p+1 of the p-th generation Population is modified as follows:
wherein the kth fault characterization function of the p-th generation population Is the individual in Survive _population p if and only if
When generating offspring individualsStopping generating when the number reaches the Population scale pop, and taking the pop individuals as next generation Population p+1;
step 3, evaluating fitness of the offspring population by using the method in step 2 and continuing to evolve the next generation population until the termination condition is reached:
Before a user-defined termination condition SC (such as iterated fixed times or obtained fault representation functions meeting requirements) is met, continuously using the method in the step 2 to evaluate the fitness score of each fault representation function in the current p-th generation Population position p, and using the method in the step 3 to generate a child Population position p+1;
Preferably, in step 4, the fault characterization function with the highest individual fitness in the last generation population that reaches the termination condition is used as a fault classification model as follows:
the Population end completes fitness evaluation for the last generation and reaches the termination condition SC Population, For the highest fitness score in the placement end The characterization function can be used for generating the fault characterization of the real multi-fault program and separating the fault;
Preferably, in the multi-fault procedure requiring fault division using the fault classification model in step 5, generating the ordered list for each failed test case is characterized as follows:
Using the kth fault-characterization function with highest fitness score, i.e. the fault classification model, selected in step 4 from the Population end of the last generation for which fitness evaluation is completed and termination conditions SC are reached Generating an ordered list for each failed test case of the target multi-fault program according to the method in step 2 characterizes Rep_ranking end,k,f,Rep_Rankingend,k,f as using a fault classification modelThe method comprises the steps of representing an ordered list generated for an f-th failure test case of a target multi-fault program;
step 5, calculating the distance between each pair of ordered list characterization is as follows:
the preselected Distance Metric function distance_metric is:
Distancefo,ft=Distance_Metric(Rep_Rankingend,k,fo,Rep_Rankingend,k,ft)
wherein, the two failure test cases fo and ft of the Rep_ranking end,k,fo,Rep_Rankingend,k,fo as the target multi-failure program are based on the failure separation model The generated ordered list is characterized in that Distance fo,ft is the Distance between failed test cases fo and ft obtained by calculation by using a Distance measurement function distance_metric;
Step 5, clustering the failure test cases into clusters by using the calculated distance;
The pre-selected clustering strategy includes:
numOfClustersend,k
=Predict(Distancef1,f1,Distancef1,f2,...,Distancefo,ft,...,Distancefail,fail)
And:
Clustersend,k=Cluster(Distancef1,f1,Distancef1,f2,...,Distancefo,ft,...,Distancefail,fail)
Wherein Distance fo,ft is the Distance between the failure test cases fo and ft calculated by using a Distance measurement function distance_metric, prediction is a function of the number of predicted clusters, numOfClusters end,k is the number of clusters predicted by using a fault classification model, cluster is a clustering algorithm, clusters end,k is the Cluster number allocated to each failure test case of the target multi-fault program after clustering by using a fault classification model, and there are
Clustersend,k={Groupend,k,1,Groupend,k,2,...,Groupend,k,fail}
Wherein Group end,k,f represents the number of the corresponding class for the ordered list generated by using the fault classification model for the f-th failure test case of the target multi-fault program;
The prediction and Cluster both take as input the distance between every two failed test cases fo and ft in the target multi-failure program.
The invention also provides a computer readable medium storing a computer program for execution by an electronic device, which when run on the electronic device causes the electronic device to execute the steps of the ordering similarity software abnormal behavior modeling method.
The invention aims to provide a method for automatically generating fault sequencing characterization for a fault separation scene, which comprises the steps of firstly generating a random initial fault sequencing characterization population by using a genetic programming algorithm, then applying population individuals to a multi-fault program dataset appointed by a user for fault separation, generating a sequencing list characterization for fault test cases in the multi-fault program by using each fault sequencing characterization, calculating distances among the characterizations, dividing the fault test case characterizations according to fault sources by using a clustering algorithm, evaluating the effectiveness of a dividing process by using a plurality of external indexes, calculating the adaptability degree of the fault sequencing characterization to the multi-fault program fault separation according to evaluation results, counting the overall performance of the fault sequencing characterization on the dataset to obtain the adaptability degree score of the fault sequencing characterization, finally selecting a parent population according to the adaptability degree score of each fault sequencing characterization individual in the population, generating a child population by crossing, copying and mutation operations according to preset crossing rate, copying and mutation operation, continuously generating the child population by the same method until the method reaches a termination condition, and taking the fault characterization with high adaptability degree in the last generation of the population as an output fault characterization individual with the final generation of the complete adaptability degree evaluation as the fault sequencing result for the fault separation of the multi-fault program. The invention finally forms a complete software abnormal behavior modeling method based on sequencing similarity.
The software abnormal behavior modeling method based on sequencing similarity provides a fault characterization function generating method which realizes automation by means of a genetic programming algorithm and comprises a plurality of interfaces capable of performing manual intervention to improve flexibility, and solves the problem that no fault sequencing characterization automatic generating method oriented to fault separation is available at present. Based on the automatic generation method of the fault characterization function, the invention solves the problem that the mapping relation from the fault sequencing characterization to the fault test case characterization capability cannot be established by using a genetic programming algorithm, realizes the automation of the generation of the fault characterization function, and improves the adaptability of the generated fault characterization function to the fault separation application of multiple fault programs of different types. Finally, the invention also adds a plurality of interfaces which can be adjusted manually in the process of the method, so that the method can meet the requirements of various types. The invention finally forms a complete software abnormal behavior modeling method based on sequencing similarity, and a user can realize efficient automatic fault sequencing characterization generation by using the method and is used for fault separation work in actual production.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In particular, the method according to the technical solution of the present invention may be implemented by those skilled in the art using computer software technology to implement an automatic operation flow, and a system apparatus for implementing the method, such as a computer readable storage medium storing a corresponding computer program according to the technical solution of the present invention, and a computer device including the operation of the corresponding computer program, should also fall within the protection scope of the present invention.
FIG. 1 is a flow chart of the method of the present invention. The following describes a method for modeling abnormal behaviors of sequencing similarity software according to the technical scheme of the method in the embodiment of the invention with reference to fig. 1 to 4, and the method specifically comprises the following steps:
Step 1, introducing a plurality of multi-fault programs, randomly generating a plurality of fault characterization functions by using a genetic programming algorithm, taking each fault characterization function as each individual in a population, and constructing an initial population by the plurality of fault characterization functions;
Each fault characterization function described in step 1 is specifically defined as follows:
the fault characterization function REF is:
Suspiciousnessk,f,i=REFk(Spectrumf,i)
Spectrumf,i={NCFf,i,NCSi,NUFf,i,NUSi}
The NCF f,i is the number of times that the ith program statement of each multi-fault program is covered by the f-th failed test case, the NCS i is the number of times that the ith program statement of each multi-fault program is covered by the successful test case, the NUF f,i is the number of times that the ith program statement of each multi-fault program is not covered by the f-th failed test case, the NUS i is the number of times that the ith program statement of each multi-fault program is not covered by the successful test case, the value of i is between 1 and K, the K is the number of program statements in each multi-fault program, the NCF f,i,NCSi,NUFf,i,NUSi is used as the Spectrum information Spectrum f,i,Suspiciousnessk,f,i of the f-th failed test case of each multi-fault program at the ith program statement together, and the probability of a fault source exists at the ith program statement of each multi-fault program calculated by combining the Spectrum information of the f-th failed test case at the ith program statement with the fault characterization function REF k;
in one embodiment, the fault-characterization function may be generated in a tree form by a genetic programming algorithm, where one of the generated fault-characterization functions REF is shown in fig. 2, where rectangular nodes represent operators combined with its lower child nodes, and circular nodes represent program spectral values and constants involved in the operation, and the fault-characterization function shown in fig. 2 is:
the initial population is constructed in the step 1:
A plurality of fault characterization functions are randomly generated by using a genetic programming algorithm and are used as an initial population together:
Wherein, the Population p represents the p generation Population, the initial Population constructed is Population 1, the pop is the Population individual number or Population scale, Representing a kth fault characterization function in the p-th generation population;
In one embodiment, since a population of 160 individuals of the fault-characterization function can generate individuals with high effectiveness for fault isolation under the premise of controllable population evolution cost, the population size can be designated as 160, i.e., pop=160 is set.
Step 2, generating a sequencing list characterization of each failure test case generated by each multi-failure program by combining each failure characterization function in the initial population, clustering and dividing the sequencing list characterization of all failure test cases generated by each multi-failure program by each failure characterization function, obtaining a sequencing list characterization classification result of a plurality of failure test cases generated by each multi-failure program by each failure characterization function, evaluating the validity score of each external index corresponding to the classification result generated by each multi-failure program by each failure characterization function according to a plurality of external indexes, merging the validity scores of a plurality of external indexes corresponding to the classification result generated by each multi-failure program by each failure characterization function, obtaining an adaptation score when each failure characterization function is applied to failure separation of each multi-failure program, and obtaining the adaptation score of each individual of the initial population according to the adaptation score when each failure characterization function is applied to failure separation of a plurality of multi-failure programs;
Step 2, generating an ordered list representation of each failure test case generated by each multi-failure program by combining each failure representation function in the initial population, wherein the ordered list representation is specifically as follows:
each multi-fault program comprises a plurality of failure test cases and a plurality of success test cases;
the plurality of failure test cases are specifically defined as follows:
{Ftc1,Ftc2,Ftc3,...,Ftcfail}
Wherein fail is the number of failed test cases, ftc f represents the f failed test cases in each multi-fault program, and the value of f is between 1 and fail;
The plurality of successful test cases are specifically defined as follows:
{Stc1,Stc2,Stc3,...,Stcsuccess};
Wherein success is the number of successful test cases, stc s represents the s-th successful test case in each multi-fault program, and the value of s is between 1 and success;
In one embodiment, if a large number of successful test cases irrelevant to the fault module exist in the test suite, the number of the successful test cases can be reduced by screening the successful test cases, and the adaptability evaluation efficiency in the population evolution process is improved.
Collecting coverage information of each failed test case and each successful test case at each program statement in each multi-fault program;
calculating a spectrum value of each failure test case at each corresponding program statement in each multi-fault program by using each failure test case and coverage information of each successful test case at each program statement in each multi-fault program;
The Spectrum value Spectrum f,i collected by the f-th failure test case of each multi-failure program at the i-th program statement is:
Spectrumf,i={NCFf,i,NCSi,NUFf,i,NUSi}
i has a value between 1 and K;
The ith fault characterization function in the p-th generation population is used for calculating the suspicious degree of the ith program statement calculated by each multi-fault program by using the f-th fault test case in combination with the spectrum value collected by the f-th fault test case of each multi-fault program at the ith program statement, wherein the suspicious degree of the ith program statement calculated by the f-th fault test case is as follows:
The value of p is between 1 and end, end is algebra of the last generation population, the value of K is between 1 and pop, the value of f is between 1 and fail, and the value of i is between 1 and K;
according to the suspicion calculation method, the suspicion calculated by each multi-fault program by using the kth fault characterization function and the f failure test case in the p generation population is characterized in that:
Rep_Suspiciousnessp,k,f
={Suspiciousnessp,k,f,1,Suspiciousnessp,k,f,2,...,Suspiciousnessp,k,f,is}
wherein is the number of program statements of each multi-fault program;
Sequencing program sentences according to the suspicious degree of each program sentence in the suspicious degree characterization calculated by using a kth fault characterization function and an f-th failure test case in the p-th generation population of each multi-fault program, and obtaining a sequencing list generated for the f-th failure test case of each multi-fault program by using the kth fault characterization function in the p-th generation population, wherein the sequencing list is characterized by:
Rep_Rankingp,k,f={Indexp,k,f,1,Indexp,k,f,2,...,Indexp,k,f,is}
Index p,k,f,rank represents the actual Index position of the rank-bit program statement in each multi-fault program when the suspicious degree is sorted in descending order, which is calculated according to the corresponding k-th fault characterization function and f-th failure test case in the p-th generation population of the program statement;
step 2, clustering and dividing the ordered list characterization of all the failure test cases generated by each multi-fault program by each fault characterization function to obtain ordered list characterization classification results of the plurality of failure test cases generated by each multi-fault program by each fault characterization function, wherein the ordered list characterization classification results are as follows;
Resultp,k={numOfClustersp,k,Clustersp,k}
Wherein Result p,k is the sorted list characterization classification Result of the kth fault characterization function in the p-th generation population applied to the multiple failure test cases generated by each multi-fault program, numOfClusters p,k is the cluster number predicted by using the kth fault characterization function in the p-th generation population, and when the cluster number is not equal to the actual fault number of each multi-fault program, the subsequent evaluation step of the current fault characterization function on the current multi-fault program is skipped, and the fitness score of the current fault characterization function on the corresponding multi-fault version is set to 0;
Clusters p,k is a cluster number allocated to each failure test case of each multi-failure program after clustering by using the kth failure characterization function in the p generation population, and has
Clustersp,k={Groupp,k,1,Groupp,k,2,...,Groupp,k,fail}
Wherein Group p,k,f is the number of the corresponding class of the ordered list representation generated for the f-th failure test case of each multi-failure program by using the k-th failure representation function in the p-th generation population;
and step 2, evaluating the effectiveness score of each external index corresponding to the classification result generated by each multi-fault procedure by each fault characterization function according to a plurality of external indexes, wherein the effectiveness score is as follows:
Score_PRp,k=PR(Clustersp,k,Oracle)
Score_RRp,k=RR(Clustersp,k,Oracle)
Score_FMIp,k=FMI(Clustersp,k,Oracle)
Score_JCp,k=JC(Clustersp,k,Oracle)
The Oracle is the real correspondence between each failure test case and a failure source in each multi-failure program, score_PR p,k is the validity Score of the classification result corresponding to PR index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, score_RR p,k is the validity Score of the classification result corresponding to RR index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, score_FMI p,k is the validity Score of the classification result corresponding to FMI index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, and score_JC p,k is the validity Score of the classification result corresponding to JC index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program;
PR is the precision rate, RR is the recall rate, FMI is Fowlkes-Mallows index, JC is the Jaccard coefficient, the four external indexes are respectively input by using a cluster number distributed for each failure test case of each multi-failure program after clustering division by using a kth failure characterization function in a p-th generation population and a real corresponding relation between each failure test case and a failure source in each multi-failure program, the effectiveness score of the clustering is output, the score value is within a [0,1] interval, and the larger the value is, the higher the effectiveness of the clustering process is represented;
PR and RR are single-case indexes, and the indexes aim at comparing whether the classification result of each failure test case is consistent with the real classification. For each fault class, the division situation of one failure test case can be divided into four types, including true positive (the failure test case is divided into the class by the fault separation frame, and the failure test case actually belongs to the class), false positive (the failure test case is divided into the class by the fault separation frame, but the failure test case actually does not belong to the class), true negative (the failure test case is not divided into the class by the fault separation frame, and the failure test case actually does not belong to the class), and false negative (the failure test case is not divided into the class by the fault separation frame, but the failure test case actually belongs to the class). Let TP, FP, TN, FN be the sum of the occurrence times of the four cases of true yang, false yang, true yin and false yin respectively, and the calculation formulas of PR and RR are as follows:
FMI and JC are paired indexes, and the indexes aim to compare whether the classification condition of each pair of failed test cases is consistent with the condition under the real classification. For each fault class, the partition consistency of a pair of fault test cases can be divided into four cases, including true identity (both fault test cases are/are not partitioned into the class by the fault isolation framework and are/are not actually partitioned into the class by the fault isolation framework), false identity (both fault test cases are/are not partitioned into the class by the fault isolation framework, but actually one fault test case is in the class, the other is not in the class), false identity (one fault test case is partitioned into the class by the fault isolation framework, and the other is not partitioned into the class, but actually both fault test cases are/are not in the class), true identity (one fault test case is partitioned into the class by the fault isolation framework, and the other is not partitioned into the class, and actually one fault test case is in the class, and the other is not in the class). Let SS, SD, DS, DD be the sum of the occurrence times of the four cases of true identity, false dissimilarity and true dissimilarity, and the calculation formulas of FMI and JC are as follows:
The calculation of four external indexes requires establishing the corresponding relation between each class cluster and a real fault source in class cluster numbers distributed for each failure test case of each multi-fault program after clustering division by using a kth fault characterization function in a p generation population, calculating the effectiveness score of all possible corresponding relations for each external index, and taking the highest score as the calculation result of the index;
And step 2, merging the validity scores of the multiple external indexes corresponding to the classification results generated by each multi-fault program by each fault characterization function to obtain the fitness score when each fault characterization function is applied to fault separation of each multi-fault program, wherein the fitness score is as follows:
When each external index is calculated for the classification result generated by applying each fault characterization function to each multi-fault program, the effectiveness score under the corresponding relation with the highest effectiveness score is required to be taken as the calculation result, the current index is called as the corresponding relation for voting, and the vote of each corresponding relation is:
wherein, the Represents whether Score PR p,k votes for the d-th correspondence,Represents whether Score _ RR p,k votes for the d-th correspondence,Represents whether Score _ FMI p,k votes for the d-th correspondence,Representing whether score_jc p,k is a d-th corresponding relation Vote, vote p,k,d represents the total number of votes obtained by the kth fault characterization function in the p-th generation population at the d-th corresponding relation, the value of d is between 1 and V, and V is the number of possible corresponding relations existing in the multi-fault program;
In one embodiment, the training dataset includes 2,3,4,5 fault procedures for which there may be 2,6,24,120 correspondences between the failure test case class and the true fault class, respectively, i.e., V takes 2,6,24,120 values, respectively.
The total number of obtained tickets with the most corresponding relation is as follows:
Votep,k,most=Max(Votep,k,1,Votep,k,2,...,Votep,k,V)
Vote p,k,most is the total number of obtained tickets of the corresponding relation of the most obtained tickets of the classification result generated by each multi-fault program in the application of the kth fault characterization function in the p-th generation population in the effectiveness index evaluation process, and is determined by the number of fault sources contained in the multi-fault program;
The method for calculating the fitness score when the kth fault characterization function in the p-th generation population is applied to fault separation of each multi-fault program comprises the following steps:
FitnessScorep,k=TotalMetricsp,k*PenaltyFactorp,k
Wherein FitnessScore p,k is the fitness score of the kth fault characterization function in the p-th generation population when applied to fault separation of each multi-fault procedure;
TotalMetricsp,k=Score_PRp,k+Score_RRp,k+Score_FMIp,k+Score_JCp,k
TotalMetrics p,k is the sum of four external index validity scores of the classification result generated by each multi-fault procedure by applying the kth fault characterization function in the p generation population;
PenaltyFactorp,k=1-0.05*(4-Votep,k,most)
Wherein PenaltyFactor p,k is the fitness penalty factor of the kth fault characterization function in the p-th generation population when applied to fault separation of each multi-fault program, and is determined by Vote p,k,most;
The fitness score calculating method can be adjusted according to the requirement of fault sequencing characterization, if the ideal value of the clustering external indexes is more concerned than whether four indexes can be consistent in the corresponding relation between the clustering cluster and the real fault class, the influence of PenaktyFactor can be weakened, the influence is lower than 0.05, if the authenticity of clustering validity assessment is required to be ensured, the fitness score calculating method is not adopted, and four external index values on the corresponding relation can be added to be taken as the fitness score after the corresponding relation between the clustering cluster and the real fault class is determined.
And step 2, obtaining the fitness of each individual of the initial population according to the fitness score of each fault characterization function applied to fault separation of a plurality of multi-fault programs, wherein the fitness of each individual of the initial population is as follows:
the individual fitness of each fault-characterization function in the initial population is:
Wherein the method comprises the steps of For the fitness of the kth fault-characterization function in the p-th generation population, version ver represents the verst multiple-fault-procedure, nv is the number of multiple-fault-procedures included in the multiple-fault-procedures introduced in step 1,Applying the kth fault characterization function in the p-th generation population to the fitness score of fault separation of the verth multi-fault program;
In one embodiment, since 600 multi-fault procedures may more generally simulate multi-fault procedures of different fault types, different program types, and different numbers of faults, suitable as training data sets generated by the fault ordering characterization, the number of versions may be set to 600, i.e., nv=600.
In one embodiment, the fitness score evaluation flow of the individual fault-characterization function is shown in fig. 4.
Step 3, selecting a parent population from the population according to the fitness of each individual in the initial population, creating each child individual by using the parent population according to the crossover rate, the replication rate and the mutation rate, taking all the created child individuals as the child population, and calculating the fitness of each individual in the child population through the step 2;
step 3, selecting a parent population from the population according to the fitness of each individual in the initial population, wherein the parent population is:
fitness from a population according to each individual using roulette algorithm Selecting individuals with different pop-profiles to form a parent population:
Wherein pop is the Population size, the description is the ratio of the next parent Population to the parent Population, the popularizing p represents the p-th generation Population, The fitness score representing the kth fault-characterization function in the p-th generation Population is Roulette which is a roulette algorithm, father _popularization p+1 represents a parent Population selected from the popularization p and used for generating the popularization p+1, the value of p is between 1 and end, and the value of k is between 1 and pop;
In one embodiment, since selecting half of the individuals from the population as parent population can ensure the efficiency of the evolution process and prevent local optima from being generated too quickly, the ratio of parent population to population can be set to be 0.5, i.e. report=0.5.
The parent population individual selection algorithm is not unique and can be flexibly adjusted to algorithms other than Roulette according to the requirement.
Step 3, creating offspring individuals by using the parent population according to the crossover rate, the replication rate and the mutation rate, wherein the offspring individuals are as follows:
creating each child individual using parent Population Father _population p+1 according to crossover rate, replication rate, and mutation rate
Determining each sub-generation individual according to the crossover rate, the replication rate and the mutation rateThe creation mode of (a):
Wherein c, o and m are crossover, replication and mutation rates, and there are
c+o+m=1
0≤c,o,m≤1
Random is a random number in a [0,1] interval, method is a generation mode of current offspring individuals, crossover represents generation through cross, copy represents generation through Copy, mutate represents generation through variation, and an individual creation mode is determined according to the value of random and the size of cross rate, replication rate and variation rate;
In one embodiment, the cross rate, replication rate and mutation rate are set to 0.7,0.2 and 0.1 respectively, so that population stability can be guaranteed to have stronger evolution of the failure test case characterization generating capability direction, and the capability of jumping out of local optimum is provided, and therefore the cross rate, replication rate and mutation rate can be set to 0.7,0.2 and 0.1, namely c=0.7, o=0.2 and m=0.1.
The setting of the crossover rate, the replication rate and the mutation rate can be adjusted according to actual conditions on the basis of meeting the limiting conditions, if the population cannot continuously improve the overall characterization capacity along with the evolution process, the replication rate can be properly improved, excellent individuals of the previous generation population are reserved, if the population evolves to be locally optimal, the replication rate can be properly reduced and the mutation rate can be improved without change of the optimal individuals of the continuous multiple generation population, and the breakthrough of the local optimal is realized by the mutated excellent individuals.
In one embodiment, as the replication of the population optimal individuals easily causes the population to fall into local optimization, a mutation rate of 0.5 can be additionally added under the condition that the offspring population individuals are confirmed to be generated by replicating the parent population optimal individuals, so that mutation operation is applied to the offspring population individuals before replicating the optimal individuals, and the capability of the population evolution to break through the local optimization is improved.
The creation of offspring individuals is:
wherein, the The method is characterized in that a kth fault characterization function of a child Population of a p-th generation Population is realized, the Create_ Crossover is a crossover algorithm, the REF r1,REFr2 is two different individuals randomly selected from a parent Population Father _population p+1, the Create_copy and the Create_ Mutate are replication and mutation algorithms, and the REF r is one individual randomly selected from a parent Population Father _population p+1;
Each individual in the population possesses, in addition to its own attribute as a fault-characterizing function, an age attribute:
wherein, the For the age of the kth individual in the p-th generation population,For all of the individuals in the initial population,For newly created individuals in a offspring population, method isCorresponding creation means, REF r is created by a replication algorithmRandomly selecting individuals for replication from the parent population, age representing Age;
when an individual's age reaches 3, it will not be selectable as a parent Population, defining Survive _population p as a surviving Population consisting of all individuals less than 3 in the p-th generation Population;
Namely, the selection method of the parent Population Father _population p+1 of the p-th generation Population is modified as follows:
wherein the kth fault characterization function of the p-th generation population Is the individual in Survive _population p if and only if
When generating offspring individualsStopping generating when the number reaches the Population scale pop, and taking the pop individuals as next generation Population p+1;
step 3, evaluating fitness of the offspring population by using the method in step 2 and continuing to evolve the next generation population until the termination condition is reached:
Before a user-defined termination condition SC (such as iterated fixed times or obtained fault representation functions meeting requirements) is met, continuously using the method in the step 2 to evaluate the fitness score of each fault representation function in the current p-th generation Population position p, and using the method in the step 3 to generate a child Population position p+1;
The selection of the termination condition can be set as evolution fixed algebra if the time cost of population evolution is low, and can be flexibly determined by a user according to the evaluation condition of each generation of fitness if each generation of population evolution consumes more time.
Step 4, repeatedly executing the step 2 and the step 3 until the termination condition appointed by the user is reached, and taking the fault characterization function with the highest individual fitness in the last generation population reaching the termination condition as a fault classification model;
And step 4, taking a fault characterization function with the highest individual fitness in the final generation population reaching the termination condition as a fault classification model, wherein the fault classification model comprises the following steps:
the Population end completes fitness evaluation for the last generation and reaches the termination condition SC Population, For the highest fitness score in the placement end The characterization function can be used for generating the fault characterization of the real multi-fault program and separating the fault;
And 5, generating a sequencing list representation for each failure test case in a multi-fault program needing fault division by using a fault classification model, calculating the distance between each pair of sequencing list representations, and clustering the failure test cases by combining the distance between each pair of sequencing list representations to realize modeling of software abnormal behaviors and division of a plurality of faults.
In the step 5, in the multi-fault procedure requiring fault division, generating an ordered list for each failed test case by using a fault classification model is characterized in that:
Using the kth fault-characterization function with highest fitness score, i.e. the fault classification model, selected in step 4 from the Population end of the last generation for which fitness evaluation is completed and termination conditions SC are reached Generating an ordered list for each failed test case of the target multi-fault program according to the method in step 2 characterizes Rep_ranking end,k,f,Rep_Rankingend,k,f as using a fault classification modelThe method comprises the steps of representing an ordered list generated for an f-th failure test case of a target multi-fault program;
step 5, calculating the distance between each pair of ordered list characterization is as follows:
the preselected Distance Metric function distance_metric is:
Distancefo,ft=Distance_Metric(Rep_Rankingend,k,fo,Rep_Rankingend,k,ft)
wherein, the two failure test cases fo and ft of the Rep_ranking end,k,fo,Rep_Rankingend,k,fo as the target multi-failure program are based on the failure separation model The generated ordered list is characterized in that Distance fo,ft is the Distance between failed test cases fo and ft obtained by calculation by using a Distance measurement function distance_metric;
In one embodiment, to more accurately measure the similarity between ordered list representations, a distance measurement function such as Jaccard distance, as proposed for Ranking, may be used.
Step 5, clustering the failure test cases into clusters by using the calculated distance;
The pre-selected clustering strategy includes:
numOfClustersend,k
=Predict(Distancef1,f1,Distancef1,f2,...,Distancefo,ft,...,Distancefail,fail)
And:
Clustersend,k=Cluster(Distancef1,f1,Distancef1,f2,...,Distancefo,ft,...,Distancefail,fail)
Wherein Distance fo,ft is the Distance between the failure test cases fo and ft calculated by using a Distance measurement function distance_metric, prediction is a function of the number of predicted clusters, numOfClusters end,k is the number of clusters predicted by using a fault classification model, cluster is a clustering algorithm, clusters end,k is the Cluster number allocated to each failure test case of the target multi-fault program after clustering by using a fault classification model, and there are
Clustersend,k={Groupend,k,1,Groupend,k,2,...,Groupend,k,fail}
Wherein Group end,k,f represents the number of the corresponding class for the ordered list generated by using the fault classification model for the f-th failure test case of the target multi-fault program;
The prediction and Cluster both take as input the distance between every two failed test cases fo and ft in the target multi-failure program.
Particular embodiments of the present invention also provide a computer readable medium.
The computer readable medium is a server workstation;
The server workstation stores a computer program executed by the electronic device, and when the computer program runs on the electronic device, the electronic device is caused to execute the steps of the ordering similarity software abnormal behavior modeling method in the embodiment of the invention.
It should be understood that parts of the specification not specifically set forth herein are all prior art.
It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.