[go: up one dir, main page]

CN115687117B - A method for modeling abnormal behavior of sorting similarity software and computer-readable medium - Google Patents

A method for modeling abnormal behavior of sorting similarity software and computer-readable medium Download PDF

Info

Publication number
CN115687117B
CN115687117B CN202211345882.6A CN202211345882A CN115687117B CN 115687117 B CN115687117 B CN 115687117B CN 202211345882 A CN202211345882 A CN 202211345882A CN 115687117 B CN115687117 B CN 115687117B
Authority
CN
China
Prior art keywords
fault
population
program
failure
characterization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211345882.6A
Other languages
Chinese (zh)
Other versions
CN115687117A (en
Inventor
谢晓园
宋壹
张熙灏
刘泉明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202211345882.6A priority Critical patent/CN115687117B/en
Publication of CN115687117A publication Critical patent/CN115687117A/en
Application granted granted Critical
Publication of CN115687117B publication Critical patent/CN115687117B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

本发明提供了一种排序相似性软件异常行为建模方法及计算机可读介质。本发明方法使多故障程序数据集和多项外部指标评估故障表征函数应用于故障分离过程的有效性,依据评估结果为故障排序表征计算适应度得分,基于适应度得分,通过基因编程算法的交叉、复制和变异操作演化故障表征函数种群直到达成终止条件,将演化得到具有最高适应度的故障排序表征个体作为故障分类模型,用于真实多故障程序的故障分离。本发明提供了解决了目前暂无可用面向故障分离的故障排序表征自动化生成方法的问题、无法建立故障排序表征到失败测试用例表征能力的映射关系的问题,且利用本发明方法可实现高效的自动化故障排序表征生成,并用于实际生产中的故障分离工作。

The present invention provides a method for modeling abnormal behavior of software based on sorting similarity and a computer-readable medium. The method of the present invention uses a multi-fault program data set and multiple external indicators to evaluate the effectiveness of the fault characterization function in the fault separation process, calculates a fitness score for the fault sorting characterization based on the evaluation result, and evolves the fault characterization function population based on the fitness score through the crossover, replication and mutation operations of the genetic programming algorithm until the termination condition is reached, and the fault sorting characterization individual with the highest fitness obtained by evolution is used as a fault classification model for fault separation of real multi-fault programs. The present invention provides a solution to the problem that there is currently no available automatic generation method for fault sorting characterization for fault separation, and the problem that a mapping relationship between fault sorting characterization and failed test case characterization capability cannot be established, and the method of the present invention can realize efficient automatic generation of fault sorting characterization and be used for fault separation in actual production.

Description

Sequencing similarity software abnormal behavior modeling method and computer readable medium
Technical Field
The invention relates to the technical field of computers, in particular to a sequencing similarity software abnormal behavior modeling method and a computer readable medium.
Background
The software defect positioning technology aims at finding out the source causing the abnormal behavior of the software, and has important significance for guaranteeing and improving the quality of the software and enhancing the effectiveness of the software repair technology. In the software testing process, the fact that the actual output and the expected output of the test case are different indicates that at least one defect is hidden in the software, and a developer must repair the defect rapidly to restore the normal function of the software. However, the premise of repairing the defect is to locate the defect, and only an internal root causing abnormal behavior of software execution can be accurately found, so that an effective basis can be provided for the subsequent repairing process. For this reason, many software defect localization methods, such as localization methods based on program mutation technology, localization methods based on program spectrum technology, etc., are proposed successively. Although these methods have all proven to have good defect localization effects, they are almost all based on a single fault assumption, i.e. assuming that an erroneous procedure contains only one defect. With the increasing size of software and the increasing complexity of software systems, an erroneous program is almost impossible to contain only one defect. In other words, the multi-fault environment is the normal state faced by the software debugging task in the real environment.
The simultaneous existence of multiple faults can seriously reduce the effectiveness of the traditional single fault location method, so that the multi-fault software fault location technology becomes one of the hot spots in the current fault location research field. The technology firstly utilizes a risk assessment function for estimating the suspicious degree of a program element in single fault location to generate an ordered list representation for each failure test case, and establishes a connection between the failure test case and a software fault, so that the failure test cases are divided according to fault sources to isolate the failure test cases caused by different faults, the mutual influence among the software faults is weakened, then a single fault location method is applied to each failure test case in parallel, and the multi-fault location problem is disassembled into a plurality of single fault location problems. The effectiveness of the fault division process determines the accuracy of parallel single fault positioning, and the risk assessment function used has a great influence on the effectiveness of fault division.
However, the risk assessment functions used in fault classification of the existing multi-fault localization technology are designed for single-fault localization application scenarios. The factors to be considered in the multi-fault isolation and single-fault localization are not exactly the same, so the optimal risk assessment function in the single-fault scenario does not necessarily have good multi-fault isolation performance. Intuitively, if the risk assessment function has stronger fault feature extraction capability, the generated ordered list can more accurately simulate the failure test cases, so that a better parallel debugging effect is obtained. Therefore, it is very important to use a risk assessment function that can correctly represent a fault in a multi-fault scenario.
According to the purpose of use, the fault characterization function (risk assessment function) can be designed manually or by using an automatic method with the purpose of generating the fault test case characterization which has strong characterization capability and is suitable for the partitioning algorithm as a target, so that the effectiveness of the fault partitioning process is improved, and the accuracy of multi-fault positioning is further improved.
However, the existing method for measuring the characterization capability of the failed test cases is still lacking, the characterization of the failed test cases can only be indirectly evaluated through the effectiveness of the fault division process, on the other hand, after the used risk evaluation function is determined, the execution information of each failed test case and all successful test cases is combined, the suspicious degree is calculated for all program elements, the suspicious degree is ranked according to the calculated suspicious degree, and the finally obtained ranking list can be used as the characterization of the failed test cases. The complex token generation mode and the indirect evaluation method lead to difficulty in determining the mapping relation between the fault token function and the fault test case token capacity, so that the fault token function cannot be trained by using methods such as an artificial neural network and the like. Meanwhile, if a method of manually designing the fault characterization function is adopted, the obtained fault characterization function has far less characterization capability than ideal because the design can only be performed by human intuition, and the design efficiency is far less than that of automatic generation. How to design a reliable fault-oriented fault-dividing fault-characterization function generation method is an important problem.
Disclosure of Invention
In order to solve the technical problems, the invention provides a sequencing similarity software abnormal behavior modeling method and a computer readable medium.
The invention provides a software abnormal behavior modeling method based on sequencing similarity, which comprises the following specific steps:
Step 1, introducing a plurality of multi-fault programs, randomly generating a plurality of fault characterization functions by using a genetic programming algorithm, taking each fault characterization function as each individual in a population, and constructing an initial population by the plurality of fault characterization functions;
Step 2, generating a sequencing list characterization of each failure test case generated by each multi-failure program by combining each failure characterization function in the initial population, clustering and dividing the sequencing list characterization of all failure test cases generated by each multi-failure program by each failure characterization function, obtaining a sequencing list characterization classification result of a plurality of failure test cases generated by each multi-failure program by each failure characterization function, evaluating the validity score of each external index corresponding to the classification result generated by each multi-failure program by each failure characterization function according to a plurality of external indexes, merging the validity scores of a plurality of external indexes corresponding to the classification result generated by each multi-failure program by each failure characterization function, obtaining an adaptation score when each failure characterization function is applied to failure separation of each multi-failure program, and obtaining the adaptation score of each individual of the initial population according to the adaptation score when each failure characterization function is applied to failure separation of a plurality of multi-failure programs;
Step 3, selecting a parent population from the population according to the fitness of each individual in the initial population, creating each child individual by using the parent population according to the crossover rate, the replication rate and the mutation rate, taking all the created child individuals as the child population, and calculating the fitness of each individual in the child population through the step 2;
step 4, repeatedly executing the step 2 and the step 3 until the termination condition appointed by the user is reached, and taking the fault characterization function with the highest individual fitness in the last generation population reaching the termination condition as a fault classification model;
And 5, generating a sequencing list representation for each failure test case in a multi-fault program needing fault division by using a fault classification model, calculating the distance between each pair of sequencing list representations, and clustering the failure test cases by combining the distance between each pair of sequencing list representations to realize modeling of software abnormal behaviors and division of a plurality of faults.
Preferably, each fault-characterization function described in step1 is specifically defined as follows:
the fault characterization function REF is:
Suspiciousnessk,f,i=REFk(Spectrumf,i)
Spectrumf,i={NCFf,i,NCSi,NUFf,i,NUSi}
The NCF f,i is the number of times that the ith program statement of each multi-fault program is covered by the f-th failed test case, the NCS i is the number of times that the ith program statement of each multi-fault program is covered by the successful test case, the NUF f,i is the number of times that the ith program statement of each multi-fault program is not covered by the f-th failed test case, the NUS i is the number of times that the ith program statement of each multi-fault program is not covered by the successful test case, the value of i is between 1 and K, the K is the number of program statements in each multi-fault program, the NCF f,i,NCSi,NUFf,i,NUSi is used as the Spectrum information Spectrum f,i,Sspiciousnessk,f,i of the f-th failed test case of each multi-fault program at the ith program statement together, and the probability of a fault source exists at the ith program statement of each multi-fault program calculated by combining the Spectrum information of the f-th failed test case at the ith program statement with the fault characterization function REF k;
the initial population is constructed in the step 1:
A plurality of fault characterization functions are randomly generated by using a genetic programming algorithm and are used as an initial population together:
Wherein, the Population p represents the p generation Population, the initial Population constructed is Population 1, the pop is the Population individual number or Population scale, Representing a kth fault characterization function in the p-th generation population;
Preferably, in step 2, each fault characterization function generated in combination with each fault characterization function in the initial population is applied to the ordered list characterization of each failure test case generated by each multi-fault program, and specifically as follows:
each multi-fault program comprises a plurality of failure test cases and a plurality of success test cases;
the plurality of failure test cases are specifically defined as follows:
{Ftc1,Ftc2,Ftc3,...,Ftcfail}
Wherein fail is the number of failed test cases, ftc f represents the f failed test cases in each multi-fault program, and the value of f is between 1 and fail;
The plurality of successful test cases are specifically defined as follows:
{Stc1,Stc2,Stc3,...,Stcsuccess};
Wherein success is the number of successful test cases, stc s represents the s-th successful test case in each multi-fault program, and the value of s is between 1 and success;
collecting coverage information of each failed test case and each successful test case at each program statement in each multi-fault program;
calculating a spectrum value of each failure test case at each corresponding program statement in each multi-fault program by using each failure test case and coverage information of each successful test case at each program statement in each multi-fault program;
The Spectrum value Spectrum f,i collected by the f-th failure test case of each multi-failure program at the i-th program statement is:
Spectrumf,i={NCFf,i,NCSi,NUFf,i,NUSi}
i has a value between 1 and K;
The ith fault characterization function in the p-th generation population is used for calculating the suspicious degree of the ith program statement calculated by each multi-fault program by using the f-th fault test case in combination with the spectrum value collected by the f-th fault test case of each multi-fault program at the ith program statement, wherein the suspicious degree of the ith program statement calculated by the f-th fault test case is as follows:
The value of p is between 1 and end, end is algebra of the last generation population, the value of K is between 1 and pop, the value of f is between 1 and fail, and the value of i is between 1 and K;
according to the suspicion calculation method, the suspicion calculated by each multi-fault program by using the kth fault characterization function and the f failure test case in the p generation population is characterized in that:
Rep_Suspiciousnessp,k,f
={Suspiciousnessp,k,f,1,Suspiciousnessp,k,f,2,...,Suspiciousnessp,k,f,is}
wherein is the number of program statements of each multi-fault program;
Sequencing program sentences according to the suspicious degree of each program sentence in the suspicious degree characterization calculated by using a kth fault characterization function and an f-th failure test case in the p-th generation population of each multi-fault program, and obtaining a sequencing list generated for the f-th failure test case of each multi-fault program by using the kth fault characterization function in the p-th generation population, wherein the sequencing list is characterized by:
Rep_Rankingp,k,f={Indexp,k,f,1,Indexp,k,f,2,...,Indexp,k,f,is}
Index p,k,f,rank represents the actual Index position of the rank-bit program statement in each multi-fault program when the suspicious degree is sorted in descending order, which is calculated according to the corresponding k-th fault characterization function and f-th failure test case in the p-th generation population of the program statement;
step 2, clustering and dividing the ordered list characterization of all the failure test cases generated by each multi-fault program by each fault characterization function to obtain ordered list characterization classification results of the plurality of failure test cases generated by each multi-fault program by each fault characterization function, wherein the ordered list characterization classification results are as follows;
Resultp,k={numOfClustersp,k,Clustersp,k}
Wherein Result p,k is the sorted list characterization classification Result of the kth fault characterization function in the p-th generation population applied to the multiple failure test cases generated by each multi-fault program, numOfClusters p,k is the cluster number predicted by using the kth fault characterization function in the p-th generation population, and when the cluster number is not equal to the actual fault number of each multi-fault program, the subsequent evaluation step of the current fault characterization function on the current multi-fault program is skipped, and the fitness score of the current fault characterization function on the corresponding multi-fault version is set to 0;
Clusters p,k is a cluster number allocated to each failure test case of each multi-failure program after clustering by using the kth failure characterization function in the p generation population, and has
Clustersp,k={Groupp,k,1,Groupp,k,2,...,Groupp,k,fail}
Wherein Group p,k,f is the number of the corresponding class of the ordered list representation generated for the f-th failure test case of each multi-failure program by using the k-th failure representation function in the p-th generation population;
and step 2, evaluating the effectiveness score of each external index corresponding to the classification result generated by each multi-fault procedure by each fault characterization function according to a plurality of external indexes, wherein the effectiveness score is as follows:
Score_PRp,k=PR(Clustersp,k,Oracle)
Score_RRp,k=RR(Clustersp,k,Oracle)
Score_FMIp,k=FMI(Clustersp,k,Oracle)
Score_JCp,k=JC(Clustersp,k,Oracle)
The Oracle is the real correspondence between each failure test case and a failure source in each multi-failure program, score_PR p,k is the validity Score of the classification result corresponding to PR index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, score_RR p,k is the validity Score of the classification result corresponding to RR index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, score_FMI p,k is the validity Score of the classification result corresponding to FMI index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, and score_JC p,k is the validity Score of the classification result corresponding to JC index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program;
PR is the precision rate, RR is the recall rate, FMI is Fowlkes-Mallows index, JC is the Jaccard coefficient, the four external indexes are respectively input by using a cluster number distributed for each failure test case of each multi-failure program after clustering division by using a kth failure characterization function in a p-th generation population and a real corresponding relation between each failure test case and a failure source in each multi-failure program, the effectiveness score of the clustering is output, the score value is within a [0,1] interval, and the larger the value is, the higher the effectiveness of the clustering process is represented;
The calculation of four external indexes requires establishing the corresponding relation between each class cluster and a real fault source in class cluster numbers distributed for each failure test case of each multi-fault program after clustering division by using a kth fault characterization function in a p generation population, calculating the effectiveness score of all possible corresponding relations for each external index, and taking the highest score as the calculation result of the index;
And step 2, merging the validity scores of the multiple external indexes corresponding to the classification results generated by each multi-fault program by each fault characterization function to obtain the fitness score when each fault characterization function is applied to fault separation of each multi-fault program, wherein the fitness score is as follows:
When each external index is calculated for the classification result generated by applying each fault characterization function to each multi-fault program, the effectiveness score under the corresponding relation with the highest effectiveness score is required to be taken as the calculation result, the current index is called as the corresponding relation for voting, and the vote of each corresponding relation is:
wherein, the Represents whether Score PR p,k votes for the d-th correspondence,Represents whether Score _ RR p,k votes for the d-th correspondence,Represents whether Score _ FMI p,k votes for the d-th correspondence,Representing whether score_jc p,k is a d-th corresponding relation Vote, vote p,k,d represents the total number of votes obtained by the kth fault characterization function in the p-th generation population at the d-th corresponding relation, the value of d is between 1 and V, and V is the number of possible corresponding relations existing in the multi-fault program;
The total number of obtained tickets with the most corresponding relation is as follows:
Votep,k,most=Max(Votep,k,1,Votep,k,2,...,Votep,k,V)
Vote p,k,most is the total number of obtained tickets of the corresponding relation of the most obtained tickets of the classification result generated by each multi-fault program in the application of the kth fault characterization function in the p-th generation population in the effectiveness index evaluation process, and is determined by the number of fault sources contained in the multi-fault program;
The method for calculating the fitness score when the kth fault characterization function in the p-th generation population is applied to fault separation of each multi-fault program comprises the following steps:
FitnessScorep,k=TotalMetricsp,k*PenaltyFactorp,k
Wherein FitnessScore p,k is the fitness score of the kth fault characterization function in the p-th generation population when applied to fault separation of each multi-fault procedure;
TotalMetricsp,k=Score_PRp,k+Score_RRp,k+Score_FMIp,k+Score_JCp,k
TotalMetrics p,k is the sum of four external index validity scores of the classification result generated by each multi-fault procedure by applying the kth fault characterization function in the p generation population;
PenaltyFactorp,k=1-0.05*(4-Votep,k,most)
Wherein PenaltyFactor p,k is the fitness penalty factor of the kth fault characterization function in the p-th generation population when applied to fault separation of each multi-fault program, and is determined by Vote p,k,most;
and step 2, obtaining the fitness of each individual of the initial population according to the fitness score of each fault characterization function applied to fault separation of a plurality of multi-fault programs, wherein the fitness of each individual of the initial population is as follows:
the individual fitness of each fault-characterization function in the initial population is:
Wherein the method comprises the steps of For the fitness of the kth fault-characterization function in the p-th generation population, version ver represents the verst multiple-fault-procedure, nv is the number of multiple-fault-procedures included in the multiple-fault-procedures introduced in step 1,Applying the kth fault characterization function in the p-th generation population to the fitness score of fault separation of the verth multi-fault program;
preferably, in step 3, the selecting a parent population from the population according to the fitness of each individual in the initial population is:
fitness from a population according to each individual using roulette algorithm Selecting individuals with different pop-profiles to form a parent population:
Wherein pop is the Population size, the description is the ratio of the next parent Population to the parent Population, the popularizing p represents the p-th generation Population, The fitness score representing the kth fault-characterization function in the p-th generation Population is Roulette which is a roulette algorithm, father _popularization p+1 represents a parent Population selected from the popularization p and used for generating the popularization p+1, the value of p is between 1 and end, and the value of k is between 1 and pop;
step 3, creating offspring individuals by using the parent population according to the crossover rate, the replication rate and the mutation rate, wherein the offspring individuals are as follows:
creating each child individual using parent Population Father _population p+1 according to crossover rate, replication rate, and mutation rate
Determining each sub-generation individual according to the crossover rate, the replication rate and the mutation rateThe creation mode of (a):
Wherein c, o and m are crossover, replication and mutation rates, and there are
c+o+m=1
0≤c,o,m≤1
Random is a random number in a [0,1] interval, method is a generation mode of current offspring individuals, crossover represents generation through cross, copy represents generation through Copy, mutate represents generation through variation, and an individual creation mode is determined according to the value of random and the size of cross rate, replication rate and variation rate;
The creation of offspring individuals is:
wherein, the The method is characterized in that a kth fault characterization function of a child Population of a p-th generation Population is realized, the Create_ Crossover is a crossover algorithm, the REF r1,REFr2 is two different individuals randomly selected from a parent Population Father _population p+1, the Create_copy and the Create_ Mutate are replication and mutation algorithms, and the REF r is one individual randomly selected from a parent Population Father _population p+1;
Each individual in the population possesses, in addition to its own attribute as a fault-characterizing function, an age attribute:
wherein, the For the age of the kth individual in the p-th generation population,For all of the individuals in the initial population,For newly created individuals in a offspring population, method isCorresponding creation means, REF r is created by a replication algorithmRandomly selecting individuals for replication from the parent population, age representing Age;
when an individual's age reaches 3, it will not be selectable as a parent Population, defining Survive _population p as a surviving Population consisting of all individuals less than 3 in the p-th generation Population;
Namely, the selection method of the parent Population Father _population p+1 of the p-th generation Population is modified as follows:
wherein the kth fault characterization function of the p-th generation population Is the individual in Survive _population p if and only if
When generating offspring individualsStopping generating when the number reaches the Population scale pop, and taking the pop individuals as next generation Population p+1;
step 3, evaluating fitness of the offspring population by using the method in step 2 and continuing to evolve the next generation population until the termination condition is reached:
Before a user-defined termination condition SC (such as iterated fixed times or obtained fault representation functions meeting requirements) is met, continuously using the method in the step 2 to evaluate the fitness score of each fault representation function in the current p-th generation Population position p, and using the method in the step 3 to generate a child Population position p+1;
Preferably, in step 4, the fault characterization function with the highest individual fitness in the last generation population that reaches the termination condition is used as a fault classification model as follows:
the Population end completes fitness evaluation for the last generation and reaches the termination condition SC Population, For the highest fitness score in the placement end The characterization function can be used for generating the fault characterization of the real multi-fault program and separating the fault;
Preferably, in the multi-fault procedure requiring fault division using the fault classification model in step 5, generating the ordered list for each failed test case is characterized as follows:
Using the kth fault-characterization function with highest fitness score, i.e. the fault classification model, selected in step 4 from the Population end of the last generation for which fitness evaluation is completed and termination conditions SC are reached Generating an ordered list for each failed test case of the target multi-fault program according to the method in step 2 characterizes Rep_ranking end,k,f,Rep_Rankingend,k,f as using a fault classification modelThe method comprises the steps of representing an ordered list generated for an f-th failure test case of a target multi-fault program;
step 5, calculating the distance between each pair of ordered list characterization is as follows:
the preselected Distance Metric function distance_metric is:
Distancefo,ft=Distance_Metric(Rep_Rankingend,k,fo,Rep_Rankingend,k,ft)
wherein, the two failure test cases fo and ft of the Rep_ranking end,k,fo,Rep_Rankingend,k,fo as the target multi-failure program are based on the failure separation model The generated ordered list is characterized in that Distance fo,ft is the Distance between failed test cases fo and ft obtained by calculation by using a Distance measurement function distance_metric;
Step 5, clustering the failure test cases into clusters by using the calculated distance;
The pre-selected clustering strategy includes:
numOfClustersend,k
=Predict(Distancef1,f1,Distancef1,f2,...,Distancefo,ft,...,Distancefail,fail)
And:
Clustersend,k=Cluster(Distancef1,f1,Distancef1,f2,...,Distancefo,ft,...,Distancefail,fail)
Wherein Distance fo,ft is the Distance between the failure test cases fo and ft calculated by using a Distance measurement function distance_metric, prediction is a function of the number of predicted clusters, numOfClusters end,k is the number of clusters predicted by using a fault classification model, cluster is a clustering algorithm, clusters end,k is the Cluster number allocated to each failure test case of the target multi-fault program after clustering by using a fault classification model, and there are
Clustersend,k={Groupend,k,1,Groupend,k,2,...,Groupend,k,fail}
Wherein Group end,k,f represents the number of the corresponding class for the ordered list generated by using the fault classification model for the f-th failure test case of the target multi-fault program;
The prediction and Cluster both take as input the distance between every two failed test cases fo and ft in the target multi-failure program.
The invention also provides a computer readable medium storing a computer program for execution by an electronic device, which when run on the electronic device causes the electronic device to execute the steps of the ordering similarity software abnormal behavior modeling method.
The invention aims to provide a method for automatically generating fault sequencing characterization for a fault separation scene, which comprises the steps of firstly generating a random initial fault sequencing characterization population by using a genetic programming algorithm, then applying population individuals to a multi-fault program dataset appointed by a user for fault separation, generating a sequencing list characterization for fault test cases in the multi-fault program by using each fault sequencing characterization, calculating distances among the characterizations, dividing the fault test case characterizations according to fault sources by using a clustering algorithm, evaluating the effectiveness of a dividing process by using a plurality of external indexes, calculating the adaptability degree of the fault sequencing characterization to the multi-fault program fault separation according to evaluation results, counting the overall performance of the fault sequencing characterization on the dataset to obtain the adaptability degree score of the fault sequencing characterization, finally selecting a parent population according to the adaptability degree score of each fault sequencing characterization individual in the population, generating a child population by crossing, copying and mutation operations according to preset crossing rate, copying and mutation operation, continuously generating the child population by the same method until the method reaches a termination condition, and taking the fault characterization with high adaptability degree in the last generation of the population as an output fault characterization individual with the final generation of the complete adaptability degree evaluation as the fault sequencing result for the fault separation of the multi-fault program. The invention finally forms a complete software abnormal behavior modeling method based on sequencing similarity.
The software abnormal behavior modeling method based on sequencing similarity provides a fault characterization function generating method which realizes automation by means of a genetic programming algorithm and comprises a plurality of interfaces capable of performing manual intervention to improve flexibility, and solves the problem that no fault sequencing characterization automatic generating method oriented to fault separation is available at present. Based on the automatic generation method of the fault characterization function, the invention solves the problem that the mapping relation from the fault sequencing characterization to the fault test case characterization capability cannot be established by using a genetic programming algorithm, realizes the automation of the generation of the fault characterization function, and improves the adaptability of the generated fault characterization function to the fault separation application of multiple fault programs of different types. Finally, the invention also adds a plurality of interfaces which can be adjusted manually in the process of the method, so that the method can meet the requirements of various types. The invention finally forms a complete software abnormal behavior modeling method based on sequencing similarity, and a user can realize efficient automatic fault sequencing characterization generation by using the method and is used for fault separation work in actual production.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a tree structure diagram of a fault characterization function generated by a genetic programming algorithm of an embodiment of the present invention;
FIG. 3 is an algorithm schematic of the overall flow of the method of the embodiment of the invention;
FIG. 4 is an algorithm diagram of a process for adaptively evaluating individuals of a fault-characterization function population according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In particular, the method according to the technical solution of the present invention may be implemented by those skilled in the art using computer software technology to implement an automatic operation flow, and a system apparatus for implementing the method, such as a computer readable storage medium storing a corresponding computer program according to the technical solution of the present invention, and a computer device including the operation of the corresponding computer program, should also fall within the protection scope of the present invention.
FIG. 1 is a flow chart of the method of the present invention. The following describes a method for modeling abnormal behaviors of sequencing similarity software according to the technical scheme of the method in the embodiment of the invention with reference to fig. 1 to 4, and the method specifically comprises the following steps:
Step 1, introducing a plurality of multi-fault programs, randomly generating a plurality of fault characterization functions by using a genetic programming algorithm, taking each fault characterization function as each individual in a population, and constructing an initial population by the plurality of fault characterization functions;
Each fault characterization function described in step 1 is specifically defined as follows:
the fault characterization function REF is:
Suspiciousnessk,f,i=REFk(Spectrumf,i)
Spectrumf,i={NCFf,i,NCSi,NUFf,i,NUSi}
The NCF f,i is the number of times that the ith program statement of each multi-fault program is covered by the f-th failed test case, the NCS i is the number of times that the ith program statement of each multi-fault program is covered by the successful test case, the NUF f,i is the number of times that the ith program statement of each multi-fault program is not covered by the f-th failed test case, the NUS i is the number of times that the ith program statement of each multi-fault program is not covered by the successful test case, the value of i is between 1 and K, the K is the number of program statements in each multi-fault program, the NCF f,i,NCSi,NUFf,i,NUSi is used as the Spectrum information Spectrum f,i,Suspiciousnessk,f,i of the f-th failed test case of each multi-fault program at the ith program statement together, and the probability of a fault source exists at the ith program statement of each multi-fault program calculated by combining the Spectrum information of the f-th failed test case at the ith program statement with the fault characterization function REF k;
in one embodiment, the fault-characterization function may be generated in a tree form by a genetic programming algorithm, where one of the generated fault-characterization functions REF is shown in fig. 2, where rectangular nodes represent operators combined with its lower child nodes, and circular nodes represent program spectral values and constants involved in the operation, and the fault-characterization function shown in fig. 2 is:
the initial population is constructed in the step 1:
A plurality of fault characterization functions are randomly generated by using a genetic programming algorithm and are used as an initial population together:
Wherein, the Population p represents the p generation Population, the initial Population constructed is Population 1, the pop is the Population individual number or Population scale, Representing a kth fault characterization function in the p-th generation population;
In one embodiment, since a population of 160 individuals of the fault-characterization function can generate individuals with high effectiveness for fault isolation under the premise of controllable population evolution cost, the population size can be designated as 160, i.e., pop=160 is set.
Step 2, generating a sequencing list characterization of each failure test case generated by each multi-failure program by combining each failure characterization function in the initial population, clustering and dividing the sequencing list characterization of all failure test cases generated by each multi-failure program by each failure characterization function, obtaining a sequencing list characterization classification result of a plurality of failure test cases generated by each multi-failure program by each failure characterization function, evaluating the validity score of each external index corresponding to the classification result generated by each multi-failure program by each failure characterization function according to a plurality of external indexes, merging the validity scores of a plurality of external indexes corresponding to the classification result generated by each multi-failure program by each failure characterization function, obtaining an adaptation score when each failure characterization function is applied to failure separation of each multi-failure program, and obtaining the adaptation score of each individual of the initial population according to the adaptation score when each failure characterization function is applied to failure separation of a plurality of multi-failure programs;
Step 2, generating an ordered list representation of each failure test case generated by each multi-failure program by combining each failure representation function in the initial population, wherein the ordered list representation is specifically as follows:
each multi-fault program comprises a plurality of failure test cases and a plurality of success test cases;
the plurality of failure test cases are specifically defined as follows:
{Ftc1,Ftc2,Ftc3,...,Ftcfail}
Wherein fail is the number of failed test cases, ftc f represents the f failed test cases in each multi-fault program, and the value of f is between 1 and fail;
The plurality of successful test cases are specifically defined as follows:
{Stc1,Stc2,Stc3,...,Stcsuccess};
Wherein success is the number of successful test cases, stc s represents the s-th successful test case in each multi-fault program, and the value of s is between 1 and success;
In one embodiment, if a large number of successful test cases irrelevant to the fault module exist in the test suite, the number of the successful test cases can be reduced by screening the successful test cases, and the adaptability evaluation efficiency in the population evolution process is improved.
Collecting coverage information of each failed test case and each successful test case at each program statement in each multi-fault program;
calculating a spectrum value of each failure test case at each corresponding program statement in each multi-fault program by using each failure test case and coverage information of each successful test case at each program statement in each multi-fault program;
The Spectrum value Spectrum f,i collected by the f-th failure test case of each multi-failure program at the i-th program statement is:
Spectrumf,i={NCFf,i,NCSi,NUFf,i,NUSi}
i has a value between 1 and K;
The ith fault characterization function in the p-th generation population is used for calculating the suspicious degree of the ith program statement calculated by each multi-fault program by using the f-th fault test case in combination with the spectrum value collected by the f-th fault test case of each multi-fault program at the ith program statement, wherein the suspicious degree of the ith program statement calculated by the f-th fault test case is as follows:
The value of p is between 1 and end, end is algebra of the last generation population, the value of K is between 1 and pop, the value of f is between 1 and fail, and the value of i is between 1 and K;
according to the suspicion calculation method, the suspicion calculated by each multi-fault program by using the kth fault characterization function and the f failure test case in the p generation population is characterized in that:
Rep_Suspiciousnessp,k,f
={Suspiciousnessp,k,f,1,Suspiciousnessp,k,f,2,...,Suspiciousnessp,k,f,is}
wherein is the number of program statements of each multi-fault program;
Sequencing program sentences according to the suspicious degree of each program sentence in the suspicious degree characterization calculated by using a kth fault characterization function and an f-th failure test case in the p-th generation population of each multi-fault program, and obtaining a sequencing list generated for the f-th failure test case of each multi-fault program by using the kth fault characterization function in the p-th generation population, wherein the sequencing list is characterized by:
Rep_Rankingp,k,f={Indexp,k,f,1,Indexp,k,f,2,...,Indexp,k,f,is}
Index p,k,f,rank represents the actual Index position of the rank-bit program statement in each multi-fault program when the suspicious degree is sorted in descending order, which is calculated according to the corresponding k-th fault characterization function and f-th failure test case in the p-th generation population of the program statement;
step 2, clustering and dividing the ordered list characterization of all the failure test cases generated by each multi-fault program by each fault characterization function to obtain ordered list characterization classification results of the plurality of failure test cases generated by each multi-fault program by each fault characterization function, wherein the ordered list characterization classification results are as follows;
Resultp,k={numOfClustersp,k,Clustersp,k}
Wherein Result p,k is the sorted list characterization classification Result of the kth fault characterization function in the p-th generation population applied to the multiple failure test cases generated by each multi-fault program, numOfClusters p,k is the cluster number predicted by using the kth fault characterization function in the p-th generation population, and when the cluster number is not equal to the actual fault number of each multi-fault program, the subsequent evaluation step of the current fault characterization function on the current multi-fault program is skipped, and the fitness score of the current fault characterization function on the corresponding multi-fault version is set to 0;
Clusters p,k is a cluster number allocated to each failure test case of each multi-failure program after clustering by using the kth failure characterization function in the p generation population, and has
Clustersp,k={Groupp,k,1,Groupp,k,2,...,Groupp,k,fail}
Wherein Group p,k,f is the number of the corresponding class of the ordered list representation generated for the f-th failure test case of each multi-failure program by using the k-th failure representation function in the p-th generation population;
and step 2, evaluating the effectiveness score of each external index corresponding to the classification result generated by each multi-fault procedure by each fault characterization function according to a plurality of external indexes, wherein the effectiveness score is as follows:
Score_PRp,k=PR(Clustersp,k,Oracle)
Score_RRp,k=RR(Clustersp,k,Oracle)
Score_FMIp,k=FMI(Clustersp,k,Oracle)
Score_JCp,k=JC(Clustersp,k,Oracle)
The Oracle is the real correspondence between each failure test case and a failure source in each multi-failure program, score_PR p,k is the validity Score of the classification result corresponding to PR index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, score_RR p,k is the validity Score of the classification result corresponding to RR index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, score_FMI p,k is the validity Score of the classification result corresponding to FMI index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, and score_JC p,k is the validity Score of the classification result corresponding to JC index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program;
PR is the precision rate, RR is the recall rate, FMI is Fowlkes-Mallows index, JC is the Jaccard coefficient, the four external indexes are respectively input by using a cluster number distributed for each failure test case of each multi-failure program after clustering division by using a kth failure characterization function in a p-th generation population and a real corresponding relation between each failure test case and a failure source in each multi-failure program, the effectiveness score of the clustering is output, the score value is within a [0,1] interval, and the larger the value is, the higher the effectiveness of the clustering process is represented;
PR and RR are single-case indexes, and the indexes aim at comparing whether the classification result of each failure test case is consistent with the real classification. For each fault class, the division situation of one failure test case can be divided into four types, including true positive (the failure test case is divided into the class by the fault separation frame, and the failure test case actually belongs to the class), false positive (the failure test case is divided into the class by the fault separation frame, but the failure test case actually does not belong to the class), true negative (the failure test case is not divided into the class by the fault separation frame, and the failure test case actually does not belong to the class), and false negative (the failure test case is not divided into the class by the fault separation frame, but the failure test case actually belongs to the class). Let TP, FP, TN, FN be the sum of the occurrence times of the four cases of true yang, false yang, true yin and false yin respectively, and the calculation formulas of PR and RR are as follows:
FMI and JC are paired indexes, and the indexes aim to compare whether the classification condition of each pair of failed test cases is consistent with the condition under the real classification. For each fault class, the partition consistency of a pair of fault test cases can be divided into four cases, including true identity (both fault test cases are/are not partitioned into the class by the fault isolation framework and are/are not actually partitioned into the class by the fault isolation framework), false identity (both fault test cases are/are not partitioned into the class by the fault isolation framework, but actually one fault test case is in the class, the other is not in the class), false identity (one fault test case is partitioned into the class by the fault isolation framework, and the other is not partitioned into the class, but actually both fault test cases are/are not in the class), true identity (one fault test case is partitioned into the class by the fault isolation framework, and the other is not partitioned into the class, and actually one fault test case is in the class, and the other is not in the class). Let SS, SD, DS, DD be the sum of the occurrence times of the four cases of true identity, false dissimilarity and true dissimilarity, and the calculation formulas of FMI and JC are as follows:
The calculation of four external indexes requires establishing the corresponding relation between each class cluster and a real fault source in class cluster numbers distributed for each failure test case of each multi-fault program after clustering division by using a kth fault characterization function in a p generation population, calculating the effectiveness score of all possible corresponding relations for each external index, and taking the highest score as the calculation result of the index;
And step 2, merging the validity scores of the multiple external indexes corresponding to the classification results generated by each multi-fault program by each fault characterization function to obtain the fitness score when each fault characterization function is applied to fault separation of each multi-fault program, wherein the fitness score is as follows:
When each external index is calculated for the classification result generated by applying each fault characterization function to each multi-fault program, the effectiveness score under the corresponding relation with the highest effectiveness score is required to be taken as the calculation result, the current index is called as the corresponding relation for voting, and the vote of each corresponding relation is:
wherein, the Represents whether Score PR p,k votes for the d-th correspondence,Represents whether Score _ RR p,k votes for the d-th correspondence,Represents whether Score _ FMI p,k votes for the d-th correspondence,Representing whether score_jc p,k is a d-th corresponding relation Vote, vote p,k,d represents the total number of votes obtained by the kth fault characterization function in the p-th generation population at the d-th corresponding relation, the value of d is between 1 and V, and V is the number of possible corresponding relations existing in the multi-fault program;
In one embodiment, the training dataset includes 2,3,4,5 fault procedures for which there may be 2,6,24,120 correspondences between the failure test case class and the true fault class, respectively, i.e., V takes 2,6,24,120 values, respectively.
The total number of obtained tickets with the most corresponding relation is as follows:
Votep,k,most=Max(Votep,k,1,Votep,k,2,...,Votep,k,V)
Vote p,k,most is the total number of obtained tickets of the corresponding relation of the most obtained tickets of the classification result generated by each multi-fault program in the application of the kth fault characterization function in the p-th generation population in the effectiveness index evaluation process, and is determined by the number of fault sources contained in the multi-fault program;
The method for calculating the fitness score when the kth fault characterization function in the p-th generation population is applied to fault separation of each multi-fault program comprises the following steps:
FitnessScorep,k=TotalMetricsp,k*PenaltyFactorp,k
Wherein FitnessScore p,k is the fitness score of the kth fault characterization function in the p-th generation population when applied to fault separation of each multi-fault procedure;
TotalMetricsp,k=Score_PRp,k+Score_RRp,k+Score_FMIp,k+Score_JCp,k
TotalMetrics p,k is the sum of four external index validity scores of the classification result generated by each multi-fault procedure by applying the kth fault characterization function in the p generation population;
PenaltyFactorp,k=1-0.05*(4-Votep,k,most)
Wherein PenaltyFactor p,k is the fitness penalty factor of the kth fault characterization function in the p-th generation population when applied to fault separation of each multi-fault program, and is determined by Vote p,k,most;
The fitness score calculating method can be adjusted according to the requirement of fault sequencing characterization, if the ideal value of the clustering external indexes is more concerned than whether four indexes can be consistent in the corresponding relation between the clustering cluster and the real fault class, the influence of PenaktyFactor can be weakened, the influence is lower than 0.05, if the authenticity of clustering validity assessment is required to be ensured, the fitness score calculating method is not adopted, and four external index values on the corresponding relation can be added to be taken as the fitness score after the corresponding relation between the clustering cluster and the real fault class is determined.
And step 2, obtaining the fitness of each individual of the initial population according to the fitness score of each fault characterization function applied to fault separation of a plurality of multi-fault programs, wherein the fitness of each individual of the initial population is as follows:
the individual fitness of each fault-characterization function in the initial population is:
Wherein the method comprises the steps of For the fitness of the kth fault-characterization function in the p-th generation population, version ver represents the verst multiple-fault-procedure, nv is the number of multiple-fault-procedures included in the multiple-fault-procedures introduced in step 1,Applying the kth fault characterization function in the p-th generation population to the fitness score of fault separation of the verth multi-fault program;
In one embodiment, since 600 multi-fault procedures may more generally simulate multi-fault procedures of different fault types, different program types, and different numbers of faults, suitable as training data sets generated by the fault ordering characterization, the number of versions may be set to 600, i.e., nv=600.
In one embodiment, the fitness score evaluation flow of the individual fault-characterization function is shown in fig. 4.
Step 3, selecting a parent population from the population according to the fitness of each individual in the initial population, creating each child individual by using the parent population according to the crossover rate, the replication rate and the mutation rate, taking all the created child individuals as the child population, and calculating the fitness of each individual in the child population through the step 2;
step 3, selecting a parent population from the population according to the fitness of each individual in the initial population, wherein the parent population is:
fitness from a population according to each individual using roulette algorithm Selecting individuals with different pop-profiles to form a parent population:
Wherein pop is the Population size, the description is the ratio of the next parent Population to the parent Population, the popularizing p represents the p-th generation Population, The fitness score representing the kth fault-characterization function in the p-th generation Population is Roulette which is a roulette algorithm, father _popularization p+1 represents a parent Population selected from the popularization p and used for generating the popularization p+1, the value of p is between 1 and end, and the value of k is between 1 and pop;
In one embodiment, since selecting half of the individuals from the population as parent population can ensure the efficiency of the evolution process and prevent local optima from being generated too quickly, the ratio of parent population to population can be set to be 0.5, i.e. report=0.5.
The parent population individual selection algorithm is not unique and can be flexibly adjusted to algorithms other than Roulette according to the requirement.
Step 3, creating offspring individuals by using the parent population according to the crossover rate, the replication rate and the mutation rate, wherein the offspring individuals are as follows:
creating each child individual using parent Population Father _population p+1 according to crossover rate, replication rate, and mutation rate
Determining each sub-generation individual according to the crossover rate, the replication rate and the mutation rateThe creation mode of (a):
Wherein c, o and m are crossover, replication and mutation rates, and there are
c+o+m=1
0≤c,o,m≤1
Random is a random number in a [0,1] interval, method is a generation mode of current offspring individuals, crossover represents generation through cross, copy represents generation through Copy, mutate represents generation through variation, and an individual creation mode is determined according to the value of random and the size of cross rate, replication rate and variation rate;
In one embodiment, the cross rate, replication rate and mutation rate are set to 0.7,0.2 and 0.1 respectively, so that population stability can be guaranteed to have stronger evolution of the failure test case characterization generating capability direction, and the capability of jumping out of local optimum is provided, and therefore the cross rate, replication rate and mutation rate can be set to 0.7,0.2 and 0.1, namely c=0.7, o=0.2 and m=0.1.
The setting of the crossover rate, the replication rate and the mutation rate can be adjusted according to actual conditions on the basis of meeting the limiting conditions, if the population cannot continuously improve the overall characterization capacity along with the evolution process, the replication rate can be properly improved, excellent individuals of the previous generation population are reserved, if the population evolves to be locally optimal, the replication rate can be properly reduced and the mutation rate can be improved without change of the optimal individuals of the continuous multiple generation population, and the breakthrough of the local optimal is realized by the mutated excellent individuals.
In one embodiment, as the replication of the population optimal individuals easily causes the population to fall into local optimization, a mutation rate of 0.5 can be additionally added under the condition that the offspring population individuals are confirmed to be generated by replicating the parent population optimal individuals, so that mutation operation is applied to the offspring population individuals before replicating the optimal individuals, and the capability of the population evolution to break through the local optimization is improved.
The creation of offspring individuals is:
wherein, the The method is characterized in that a kth fault characterization function of a child Population of a p-th generation Population is realized, the Create_ Crossover is a crossover algorithm, the REF r1,REFr2 is two different individuals randomly selected from a parent Population Father _population p+1, the Create_copy and the Create_ Mutate are replication and mutation algorithms, and the REF r is one individual randomly selected from a parent Population Father _population p+1;
Each individual in the population possesses, in addition to its own attribute as a fault-characterizing function, an age attribute:
wherein, the For the age of the kth individual in the p-th generation population,For all of the individuals in the initial population,For newly created individuals in a offspring population, method isCorresponding creation means, REF r is created by a replication algorithmRandomly selecting individuals for replication from the parent population, age representing Age;
when an individual's age reaches 3, it will not be selectable as a parent Population, defining Survive _population p as a surviving Population consisting of all individuals less than 3 in the p-th generation Population;
Namely, the selection method of the parent Population Father _population p+1 of the p-th generation Population is modified as follows:
wherein the kth fault characterization function of the p-th generation population Is the individual in Survive _population p if and only if
When generating offspring individualsStopping generating when the number reaches the Population scale pop, and taking the pop individuals as next generation Population p+1;
step 3, evaluating fitness of the offspring population by using the method in step 2 and continuing to evolve the next generation population until the termination condition is reached:
Before a user-defined termination condition SC (such as iterated fixed times or obtained fault representation functions meeting requirements) is met, continuously using the method in the step 2 to evaluate the fitness score of each fault representation function in the current p-th generation Population position p, and using the method in the step 3 to generate a child Population position p+1;
The selection of the termination condition can be set as evolution fixed algebra if the time cost of population evolution is low, and can be flexibly determined by a user according to the evaluation condition of each generation of fitness if each generation of population evolution consumes more time.
Step 4, repeatedly executing the step 2 and the step 3 until the termination condition appointed by the user is reached, and taking the fault characterization function with the highest individual fitness in the last generation population reaching the termination condition as a fault classification model;
And step 4, taking a fault characterization function with the highest individual fitness in the final generation population reaching the termination condition as a fault classification model, wherein the fault classification model comprises the following steps:
the Population end completes fitness evaluation for the last generation and reaches the termination condition SC Population, For the highest fitness score in the placement end The characterization function can be used for generating the fault characterization of the real multi-fault program and separating the fault;
And 5, generating a sequencing list representation for each failure test case in a multi-fault program needing fault division by using a fault classification model, calculating the distance between each pair of sequencing list representations, and clustering the failure test cases by combining the distance between each pair of sequencing list representations to realize modeling of software abnormal behaviors and division of a plurality of faults.
In the step 5, in the multi-fault procedure requiring fault division, generating an ordered list for each failed test case by using a fault classification model is characterized in that:
Using the kth fault-characterization function with highest fitness score, i.e. the fault classification model, selected in step 4 from the Population end of the last generation for which fitness evaluation is completed and termination conditions SC are reached Generating an ordered list for each failed test case of the target multi-fault program according to the method in step 2 characterizes Rep_ranking end,k,f,Rep_Rankingend,k,f as using a fault classification modelThe method comprises the steps of representing an ordered list generated for an f-th failure test case of a target multi-fault program;
step 5, calculating the distance between each pair of ordered list characterization is as follows:
the preselected Distance Metric function distance_metric is:
Distancefo,ft=Distance_Metric(Rep_Rankingend,k,fo,Rep_Rankingend,k,ft)
wherein, the two failure test cases fo and ft of the Rep_ranking end,k,fo,Rep_Rankingend,k,fo as the target multi-failure program are based on the failure separation model The generated ordered list is characterized in that Distance fo,ft is the Distance between failed test cases fo and ft obtained by calculation by using a Distance measurement function distance_metric;
In one embodiment, to more accurately measure the similarity between ordered list representations, a distance measurement function such as Jaccard distance, as proposed for Ranking, may be used.
Step 5, clustering the failure test cases into clusters by using the calculated distance;
The pre-selected clustering strategy includes:
numOfClustersend,k
=Predict(Distancef1,f1,Distancef1,f2,...,Distancefo,ft,...,Distancefail,fail)
And:
Clustersend,k=Cluster(Distancef1,f1,Distancef1,f2,...,Distancefo,ft,...,Distancefail,fail)
Wherein Distance fo,ft is the Distance between the failure test cases fo and ft calculated by using a Distance measurement function distance_metric, prediction is a function of the number of predicted clusters, numOfClusters end,k is the number of clusters predicted by using a fault classification model, cluster is a clustering algorithm, clusters end,k is the Cluster number allocated to each failure test case of the target multi-fault program after clustering by using a fault classification model, and there are
Clustersend,k={Groupend,k,1,Groupend,k,2,...,Groupend,k,fail}
Wherein Group end,k,f represents the number of the corresponding class for the ordered list generated by using the fault classification model for the f-th failure test case of the target multi-fault program;
The prediction and Cluster both take as input the distance between every two failed test cases fo and ft in the target multi-failure program.
Particular embodiments of the present invention also provide a computer readable medium.
The computer readable medium is a server workstation;
The server workstation stores a computer program executed by the electronic device, and when the computer program runs on the electronic device, the electronic device is caused to execute the steps of the ordering similarity software abnormal behavior modeling method in the embodiment of the invention.
It should be understood that parts of the specification not specifically set forth herein are all prior art.
It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims (10)

1. The software abnormal behavior modeling method based on the ordering similarity is characterized by comprising the following steps of:
Step 1, introducing a plurality of multi-fault programs, randomly generating a plurality of fault characterization functions by using a genetic programming algorithm, taking each fault characterization function as each individual in a population, and constructing an initial population by the plurality of fault characterization functions;
Step 2, generating an ordered list representation of each failure test case generated by each multi-failure program by combining each failure representation function in the initial population, carrying out cluster division to obtain ordered list representation classification results of a plurality of failure test cases generated by each multi-failure program by each failure representation function, evaluating the effectiveness score of each external index corresponding to the classification result generated by each multi-failure program by each failure representation function, and obtaining the fitness score of each multi-failure program when each failure representation function is applied to failure separation of each multi-failure program and the fitness of each individual in the initial population;
Step 3, selecting a parent population from the population according to the fitness of each individual in the initial population, creating each child individual by using the parent population according to the crossover rate, the replication rate and the mutation rate, taking all the created child individuals as the child population, and calculating the fitness of each individual in the child population through the step 2;
step 4, repeatedly executing the step 2 and the step 3 until the termination condition appointed by the user is reached, and taking the fault characterization function with the highest individual fitness in the last generation population reaching the termination condition as a fault classification model;
And 5, generating a sequencing list representation for each failure test case in a multi-fault program needing fault division by using a fault classification model, calculating the distance between each pair of sequencing list representations, and clustering the failure test cases by combining the distance between each pair of sequencing list representations to realize modeling of software abnormal behaviors and division of a plurality of faults.
2. The method for modeling abnormal behavior of software based on ranking similarity according to claim 1, wherein each fault-characterization function in step 1 is specifically defined as follows:
the fault characterization function REF is:
Suspiciousnessk,f,i=REFk(Spectrumf,i)
Spectrumf,i={NCFf,i,NCSi,NUFf,i,NUSi}
The NCF f,i is the number of times that the ith program statement of each multi-fault program is covered by the f-th failed test case, the NCS i is the number of times that the ith program statement of each multi-fault program is covered by the successful test case, the NUF f,i is the number of times that the ith program statement of each multi-fault program is not covered by the f-th failed test case, the NUS i is the number of times that the ith program statement of each multi-fault program is not covered by the successful test case, the value of i is between 1 and K, the K is the number of program statements in each multi-fault program, the NCF f,i,NCSi,NUFf,i,NUSi is used as the Spectrum information Spectrum f,i,Suspiciousnessk,f,i of the f-th failed test case of each multi-fault program at the ith program statement together, and the probability of a fault source exists at the ith program statement of each multi-fault program calculated by combining the Spectrum information of the f-th failed test case at the ith program statement with the fault characterization function REF k;
the initial population is constructed in the step 1:
A plurality of fault characterization functions are randomly generated by using a genetic programming algorithm and are used as an initial population together:
Wherein, the Population p represents the p generation Population, the initial Population constructed is Population 1, the pop is the Population individual number or Population scale, Representing the kth fault-characterization function in the p-th generation population.
3. The method for modeling abnormal behavior of software based on sequencing similarity according to claim 1, wherein step 2 generates a sequencing list characterization of each failure test case generated by each multi-failure program by combining each failure characterization function in the initial population, specifically as follows:
each multi-fault program comprises a plurality of failure test cases and a plurality of success test cases;
the plurality of failure test cases are specifically defined as follows:
{Ftc1,Ftc2,Ftc3,...,Ftcfail}
Wherein fail is the number of failed test cases, ftc f represents the f failed test cases in each multi-fault program, and the value of f is between 1 and fail;
The plurality of successful test cases are specifically defined as follows:
{Stc1,Stc2,Stc3,...,Stcsuccess};
Wherein success is the number of successful test cases, stc s represents the s-th successful test case in each multi-fault program, and the value of s is between 1 and success;
collecting coverage information of each failed test case and each successful test case at each program statement in each multi-fault program;
calculating a spectrum value of each failure test case at each corresponding program statement in each multi-fault program by using each failure test case and coverage information of each successful test case at each program statement in each multi-fault program;
The Spectrum value Spectrum f,i collected by the f-th failure test case of each multi-failure program at the i-th program statement is:
Spectrumf,i={NCFf,i,NCSi,NUFf,i,NUSi}
i has a value between 1 and K;
The ith fault characterization function in the p-th generation population is used for calculating the suspicious degree of the ith program statement calculated by each multi-fault program by using the f-th fault test case in combination with the spectrum value collected by the f-th fault test case of each multi-fault program at the ith program statement, wherein the suspicious degree of the ith program statement calculated by the f-th fault test case is as follows:
The value of p is between 1 and end, end is algebra of the last generation population, the value of K is between 1 and pop, the value of f is between 1 and fail, and the value of i is between 1 and K;
according to the suspicion calculation method, the suspicion calculated by each multi-fault program by using the kth fault characterization function and the f failure test case in the p generation population is characterized in that:
Rep_Suspiciousnessp,k,f
={Suspiciousnessp,k,f,1,Suspiciousnessp,k,f,2,...,Suspiciousnessp,k,f,is}
wherein is the number of program statements of each multi-fault program;
Sequencing program sentences according to the suspicious degree of each program sentence in the suspicious degree characterization calculated by using a kth fault characterization function and an f-th failure test case in the p-th generation population of each multi-fault program, and obtaining a sequencing list generated for the f-th failure test case of each multi-fault program by using the kth fault characterization function in the p-th generation population, wherein the sequencing list is characterized by:
Rep_Rankingp,k,f={Indexp,k,f,1,Indexp,k,f,2,...,Indexp,k,f,is}
Index p,k,f,rank represents the actual Index position of the rank-order program statement in each multi-fault program when the suspicion degree obtained by calculation of the kth fault characterization function and the f-th failure test case in the p-th generation population is ordered according to the program statement.
4. The method for modeling abnormal software behavior based on sequencing similarity according to claim 1, wherein the step 2 is characterized in that the clustering is performed to obtain a sequencing list characterization classification result of a plurality of failure test cases generated by each multi-failure program by each failure characterization function, and the sequencing list characterization classification result is specifically as follows:
Resultp,k={numOfClustersp,k,Clustersp,k}
Wherein Result p,k is the sorted list characterization classification Result of the kth fault characterization function in the p-th generation population applied to the multiple failure test cases generated by each multi-fault program, numOfClusters p,k is the cluster number predicted by using the kth fault characterization function in the p-th generation population, and when the cluster number is not equal to the actual fault number of each multi-fault program, the subsequent evaluation step of the current fault characterization function on the current multi-fault program is skipped, and the fitness score of the current fault characterization function on the corresponding multi-fault version is set to 0;
Clusters p,k is a cluster number allocated to each failure test case of each multi-failure program after clustering by using the kth failure characterization function in the p generation population, and has
Clustersp,k={Groupp,k,1,Groupp,k,2,...,Groupp,k,fail}
Wherein Group p,k,f is the number of the corresponding class of the ordered list generated for the f-th failure test case of each multi-failure program by using the k-th failure characterization function in the p-th generation population.
5. The method for modeling abnormal behavior of software based on ranking similarity according to claim 1, wherein the step 2 of evaluating the validity score of each external index corresponding to the classification result generated by each multi-fault procedure according to each fault characterization function is:
Score_PRp,k=PR(Clustersp,k,Oracle)
Score_RRp,k=RR(Clustersp,k,Oracle)
Score_FMIp,k=FMI(Clustersp,k,Oracle)
Score_JCp,k=JC(Clustersp,k,Oracle)
The Oracle is the real correspondence between each failure test case and a failure source in each multi-failure program, score_PR p,k is the validity Score of the classification result corresponding to PR index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, score_RR p,k is the validity Score of the classification result corresponding to RR index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, score_FMI p,k is the validity Score of the classification result corresponding to FMI index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program, and score_JC p,k is the validity Score of the classification result corresponding to JC index generated by applying the kth failure characterization function in the p-th generation population to each multi-failure program;
PR is the precision rate, RR is the recall rate, FMI is Fowlkes-Mallows index, JC is the Jaccard coefficient, the four external indexes are respectively input by using a cluster number distributed for each failure test case of each multi-failure program after clustering division by using a kth failure characterization function in a p-th generation population and a real corresponding relation between each failure test case and a failure source in each multi-failure program, the effectiveness score of the clustering is output, the score value is within a [0,1] interval, and the larger the value is, the higher the effectiveness of the clustering process is represented;
The calculation of four external indexes requires the establishment of the corresponding relation between each class cluster and a real fault source in class cluster numbers distributed for each failure test case of each multi-fault program after clustering division by using a kth fault characterization function in a p generation population, and for each external index, the effectiveness score under all possible corresponding relations is calculated, and the highest score is taken as the calculation result of the index.
6. The method for modeling abnormal behavior of software based on ranking similarity according to claim 1, wherein the step 2 of obtaining the fitness score of each fault characterization function when applied to fault isolation of each multi-fault program is specifically as follows:
Combining the effectiveness scores of the multiple external indexes corresponding to the classification results generated by each multi-fault program by each fault characterization function to obtain the fitness score of each fault characterization function when the fault characterization function is applied to fault separation of each multi-fault program;
step 2, obtaining fitness scores of each individual in the initial population, wherein the fitness scores are specifically as follows:
And obtaining the fitness of each individual of the initial population according to the fitness score of each fault characterization function when the fault characterization function is applied to fault separation of a plurality of multi-fault procedures.
7. The method for modeling abnormal behavior of software based on ranking similarity according to claim 6, wherein,
The validity score of the multiple external indexes corresponding to the classification result generated by each multi-fault program by combining each fault characterization function is obtained, and the fitness score of each fault characterization function when applied to fault separation of each multi-fault program is:
When each external index is calculated for the classification result generated by applying each fault characterization function to each multi-fault program, the effectiveness score under the corresponding relation with the highest effectiveness score is required to be taken as the calculation result, the current index is called as the corresponding relation for voting, and the vote of each corresponding relation is:
wherein, the Represents whether Score PR p,k votes for the d-th correspondence,Represents whether Score _ RR p,k votes for the d-th correspondence,Represents whether Score-FMI p,k votes for the d-th correspondence,Representing whether score_jc p,k is a d-th corresponding relation Vote, vote p,k,d represents the total number of votes obtained by the kth fault characterization function in the p-th generation population at the d-th corresponding relation, the value of d is between 1 and V, and V is the number of possible corresponding relations existing in the multi-fault program;
The total number of obtained tickets with the most corresponding relation is as follows:
Votep,k,nost=Max(Votep,k,1,Votep,k,2,...,Votep,k,V)
Vote p,k,most is the total number of obtained tickets of the corresponding relation of the most obtained tickets of the classification result generated by each multi-fault program in the application of the kth fault characterization function in the p-th generation population in the effectiveness index evaluation process, and is determined by the number of fault sources contained in the multi-fault program;
The method for calculating the fitness score when the kth fault characterization function in the p-th generation population is applied to fault separation of each multi-fault program comprises the following steps:
FitnessScorep,k=TotalMetricsp,k*PenaltyFactorp,k
Wherein FitnessScore p,k is the fitness score of the kth fault characterization function in the p-th generation population when applied to fault separation of each multi-fault procedure;
TotalMetricsp,k=Score_PRp,k+Score_RRp,k+Score_FMIp,k+Score_JCp,k
TotalMetrics p,k is the sum of four external index validity scores of the classification result generated by each multi-fault procedure by applying the kth fault characterization function in the p generation population;
PenaltyFactorp,k=1-0.05*(4-Votep,k,most)
Wherein PenaltyFactor p,k is the fitness penalty factor of the kth fault characterization function in the p-th generation population when applied to fault separation of each multi-fault program, and is determined by Vote p,k,most;
The fitness of each individual of the initial population is obtained according to the fitness score when each fault characterization function is applied to fault separation of a plurality of multi-fault programs, wherein the fitness of each individual of the initial population is as follows:
the individual fitness of each fault-characterization function in the initial population is:
Wherein the method comprises the steps of For the fitness of the kth fault-characterization function in the p-th generation population, version ver represents the verst multiple-fault-procedure, nv is the number of multiple-fault-procedures included in the multiple-fault-procedures introduced in step 1,And (3) applying the k fault characterization function in the p generation population to the fitness score of fault separation of the third multi-fault program.
8. The method for modeling abnormal behavior of software based on ranking similarity according to claim 1, wherein,
Step 3, selecting a parent population from the population according to the fitness of each individual in the initial population, wherein the parent population is:
fitness from a population according to each individual using roulette algorithm Selecting individuals with different pop-profiles to form a parent population:
Wherein pop is the Population size, the description is the ratio of the next parent Population to the parent Population, the popularizing p represents the p-th generation Population, The fitness score representing the kth fault-characterization function in the p-th generation Population is Roulette which is a roulette algorithm, father _popularization p+1 represents a parent Population selected from the popularization p and used for generating the popularization p+1, the value of p is between 1 and end, and the value of k is between 1 and pop;
step 3, creating offspring individuals by using the parent population according to the crossover rate, the replication rate and the mutation rate, wherein the offspring individuals are as follows:
creating each child using parent Population Father-Population p+1 based on crossover, replication and mutation rates
Determining each sub-generation individual according to the crossover rate, the replication rate and the mutation rateThe creation mode of (a):
Wherein c, o and m are crossover, replication and mutation rates, and there are
c+o+m=1
0≤c,o,m≤1
Random is a random number in a [0,1] interval, method is a generation mode of current offspring individuals, crossover represents generation through cross, copy represents generation through Copy, mutate represents generation through variation, and an individual creation mode is determined according to the value of random and the size of cross rate, replication rate and variation rate;
The creation of offspring individuals is:
wherein, the The method is characterized in that a kth fault characterization function of a child Population of a p-th generation Population is realized, the Create_ Crossover is a crossover algorithm, the REF r1,REFr2 is two different individuals randomly selected from a parent Population Father _population p+1, the Create_copy and the Create_ Mutate are replication and mutation algorithms, and the REF r is one individual randomly selected from a parent Population Father _population p+1;
Each individual in the population possesses, in addition to its own attribute as a fault-characterizing function, an age attribute:
wherein, the For the age of the kth individual in the p-th generation population,For all of the individuals in the initial population,For newly created individuals in a offspring population, method isCorresponding creation means, REF r is created by a replication algorithmRandomly selecting individuals for replication from the parent population, age representing Age;
when an individual's age reaches 3, it will not be selectable as a parent Population, defining Survive _population p as a surviving Population consisting of all individuals less than 3 in the p-th generation Population;
Namely, the selection method of the parent Population Father _population p+1 of the p-th generation Population is modified as follows:
wherein the kth fault characterization function of the p-th generation population Is the individual in Survive _population p if and only if
When generating offspring individualsStopping generating when the number reaches the Population scale pop, and taking the pop individuals as next generation Population p+1;
step 3, evaluating fitness of the offspring population by using the method in step 2 and continuing to evolve the next generation population until the termination condition is reached:
And (3) continuously using the method in the step (2) to evaluate the fitness score of each fault characterization function in the current p-th generation Population position p before the user-defined termination condition SC is met, the iteration is carried out for a fixed number of times or the fault characterization function meeting the requirement is obtained, and using the method in the step (3) to generate a child Population position p+1.
9. The method for modeling abnormal behavior of software based on ranking similarity according to claim 1, wherein,
And step 4, taking a fault characterization function with the highest individual fitness in the final generation population reaching the termination condition as a fault classification model, wherein the fault classification model comprises the following steps:
the Population end completes fitness evaluation for the last generation and reaches the termination condition SC Population, For the highest fitness score in the placement end The characterization function can be used for generating the fault characterization of the real multi-fault program and separating the fault;
In the step 5, in the multi-fault procedure requiring fault division, generating an ordered list for each failed test case by using a fault classification model is characterized in that:
Using the kth fault-characterization function with highest fitness score, i.e. the fault classification model, selected in step 4 from the Population end of the last generation for which fitness evaluation is completed and termination conditions SC are reached Generating an ordered list for each failed test case of the target multi-fault program according to the method in step 2 characterizes Rep_ranking end,k,f,Rep_Rankingend,k,f as using a fault classification modelThe method comprises the steps of representing an ordered list generated for an f-th failure test case of a target multi-fault program;
step 5, calculating the distance between each pair of ordered list characterization is as follows:
the preselected Distance Metric function distance_metric is:
Distancefo,ft=Distance_Metric(Rep_Rankingend,k,fo,Rep_Rankingend,k,ft)
wherein, the two failure test cases fo and ft of the Rep_ranking end,k,fo,Rep_Rankingend,k,fo as the target multi-failure program are based on the failure separation model The generated ordered list is characterized in that Distance fo,ft is the Distance between failed test cases fo and ft obtained by calculation by using a Distance measurement function distance_metric;
Step 5, clustering the failure test cases into clusters by using the calculated distance;
The pre-selected clustering strategy includes:
numOfClustersend,k
=Predict(Distancef1,f1,Distancef1,f2,...,Distancefo,ft,...,Distancefail,fail)
And:
Clustersend,k=Cluster(Distancef1,f1,Distancef1,f2,...,Distancefo,ft,...,Distancefail,fail)
Wherein Distance fo,ft is the Distance between the failure test cases fo and ft calculated by using a Distance measurement function distance_metric, prediction is a function of the number of predicted clusters, numOfClusters end,k is the number of clusters predicted by using a fault classification model, cluster is a clustering algorithm, clusters end,k is the Cluster number allocated to each failure test case of the target multi-fault program after clustering by using a fault classification model, and there are
Clustersend,k={Groupend,k,1,Groupend,k,2,...,Groupend,k,fail}
Wherein Group end,k,f represents the number of the corresponding class for the ordered list generated by using the fault classification model for the f-th failure test case of the target multi-fault program;
The prediction and Cluster both take as input the distance between every two failed test cases fo and ft in the target multi-failure program.
10. A computer readable medium, characterized in that it stores a computer program for execution by an electronic device, which computer program, when run on the electronic device, causes the electronic device to perform the steps of the method according to any one of claims 1-9.
CN202211345882.6A 2022-10-31 2022-10-31 A method for modeling abnormal behavior of sorting similarity software and computer-readable medium Active CN115687117B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211345882.6A CN115687117B (en) 2022-10-31 2022-10-31 A method for modeling abnormal behavior of sorting similarity software and computer-readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211345882.6A CN115687117B (en) 2022-10-31 2022-10-31 A method for modeling abnormal behavior of sorting similarity software and computer-readable medium

Publications (2)

Publication Number Publication Date
CN115687117A CN115687117A (en) 2023-02-03
CN115687117B true CN115687117B (en) 2025-07-01

Family

ID=85046155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211345882.6A Active CN115687117B (en) 2022-10-31 2022-10-31 A method for modeling abnormal behavior of sorting similarity software and computer-readable medium

Country Status (1)

Country Link
CN (1) CN115687117B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118690142A (en) * 2024-08-23 2024-09-24 清华大学 A method and system for identifying key network nodes based on large language model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106527394A (en) * 2016-10-25 2017-03-22 天津大学 ECPS (Electric Cyber-Physical System) cascading fault risk assessment method of considering multiple information factors
CN113627088A (en) * 2021-08-23 2021-11-09 上海交通大学 Machine performance degradation evaluation method and system based on gene programming and data fusion

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190127411A (en) * 2018-05-04 2019-11-13 한국과학기술원 Method and apparatus for fault localization using code and change metrics
CN111124884B (en) * 2019-11-20 2021-10-26 北京航空航天大学 Multi-fault decoupling and fault positioning method based on genetic algorithm
CN112559374A (en) * 2020-12-24 2021-03-26 深圳壹账通智能科技有限公司 Test case sequencing method and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106527394A (en) * 2016-10-25 2017-03-22 天津大学 ECPS (Electric Cyber-Physical System) cascading fault risk assessment method of considering multiple information factors
CN113627088A (en) * 2021-08-23 2021-11-09 上海交通大学 Machine performance degradation evaluation method and system based on gene programming and data fusion

Also Published As

Publication number Publication date
CN115687117A (en) 2023-02-03

Similar Documents

Publication Publication Date Title
WO2022246843A1 (en) Software project risk assessment method and apparatus, computer device, and storage medium
EP4075281A1 (en) Ann-based program test method and test system, and application
CN105760295A (en) Multi-defect positioning method based on search algorithm
CN109710514B (en) A solution and system for tie-breaking in test case prioritization
CN108563555A (en) Failure based on four objective optimizations changes code prediction method
Li et al. A novel reliability estimation method of multi-state system based on structure learning algorithm
CN116089504B (en) Relational form data generation method and system
CN115687117B (en) A method for modeling abnormal behavior of sorting similarity software and computer-readable medium
Mahammad et al. Machine learning approach to predict asthma prevalence with decision trees
CN112527573B (en) An interface testing method, device and storage medium
Lu et al. Machine learning methodologies to predict the results of simulation-based fault injection
US10803218B1 (en) Processor-implemented systems using neural networks for simulating high quantile behaviors in physical systems
WO2017001885A2 (en) Method of generating a model of an object
CN115344386A (en) Cloud Simulation Computing Resource Prediction Method, Device and Equipment Based on Sorting Learning
CN114741304A (en) Staged error positioning method based on graph neural network
CN114330859A (en) Optimization method, system and equipment for real-time quality control
CN115576851B (en) A software multi-fault clustering location method and device combined with dynamic slicing
CN119088683A (en) Method for constructing software reliability evaluation model
Guo et al. First, debug the test oracle
CN111445006A (en) Method and system for predicting number of submission times of developer codes in open source community
CN117390952A (en) Fire risk assessment method and system based on parameter modeling
El Mandouh et al. Guiding functional verification regression analysis using machine learning and big data methods
Xiao et al. A systematic literature review on test case prioritization and regression test selection
Shin Case study: Real-world machine learning application for hardware failure detection.
CN118070246B (en) A predictive maintenance method for IoT devices in smart factories

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant