CN111222137A

CN111222137A - Program classification model training method, program classification method and device

Info

Publication number: CN111222137A
Application number: CN201811419260.7A
Authority: CN
Inventors: 焦丽娟
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2020-06-02
Also published as: WO2020108357A1

Abstract

The embodiments of the present application disclose a program classification model training method, program classification method and device, which improve the accuracy of identifying unknown program categories. The training method for the program classification model includes: receiving multiple input sample programs; obtaining corresponding eigenvalues of static features and eigenvalues of dynamic features for each selected sample program in the multiple sample programs; The eigenvalue of the feature, the eigenvalue of the dynamic feature, and the fusion operation rule, to obtain the eigenvalue of at least one candidate fusion feature of the selected sample program; for each candidate fusion feature, according to the candidate fusion feature in the sample program According to the evaluation value of each candidate fusion feature, select the target fusion feature from at least one candidate fusion feature; according to the target fusion feature in each sample program The eigenvalues of the features are fused, and the program classification model is obtained by training.

Description

Program classification model training method, program classification method and device

Technical Field

The present application relates to the field of computers, and in particular, to a program classification model training method, a program classification method, and an apparatus.

Background

The classification of programs is an important requirement in the field of computers, which aims at identifying programs. Such as identifying whether a program is a normal program or a malicious program. The malicious program refers to a program with an attack intention, which may destroy the normal functions of the computer system, resulting in that the computer system cannot run normally or even crashes, so the malicious program is always a significant threat to the information security industry. If a program can be identified as a malicious program in advance, the program can be correspondingly processed, and the influence on a computer system is reduced.

Currently, a common program classification method firstly needs to train a program classification model by using a program of a known class, and then classifies the program of an unknown class, such as a malicious program or a normal program, based on the trained program classification model. Both in the training process and in the classification process, the corresponding features of the program need to be extracted. The common feature extraction method mainly comprises two methods, wherein one method is to extract static features of a program, and the static features refer to features obtained based on the structural characteristics of the program; and the other method is to extract dynamic characteristics of the program, wherein the dynamic characteristics refer to behavior characteristics embodied in the running process of the program.

However, although training a program classification model using static features may identify the classes of a portion of a program, once a programmer makes some form of change to the program, such as adding shells, morphing, or polymorphic forms to the program, the classes of the program cannot be identified. The dynamic features are extracted when the program is run in a virtual environment such as a sandbox, if an anti-virtual environment function is set in the unknown program, for example, if the program is detected to run in the sandbox, some commands are not executed, the sandbox cannot accurately extract the dynamic features of the program, and the program classification model cannot accurately identify the type of the program.

Therefore, how to improve the identification accuracy rate when identifying unknown program categories is a technical problem to be solved at present.

Disclosure of Invention

The embodiment of the application provides a program classification model training method, a program classification method and a program classification device, and improves the accuracy of recognizing unknown program categories.

In a first aspect, an embodiment of the present application provides a method for training a program classification model, where the method includes: firstly, a plurality of sample programs are received, wherein the sample programs refer to programs of which the categories belong to which the programs are calibrated in advance, and the sample programs belong to at least two different categories. The at least two different categories may include a normal program category and a malicious program category; or, the at least two different categories include at least two different categories of malware. Secondly, selecting a sample program from the plurality of sample programs, and executing the following processing to obtain the characteristic value of at least one candidate fusion feature of the selected sample program until each sample program in the plurality of sample programs is processed: acquiring a characteristic value of each static characteristic and a characteristic value of each dynamic characteristic of the selected sample program according to a preset static characteristic set comprising at least one static characteristic and a preset dynamic characteristic set comprising at least one dynamic characteristic; obtaining a feature value of at least one candidate fusion feature of the selected sample program according to the feature value of at least one static feature, the feature value of at least one dynamic feature and at least one fusion operation rule of the selected sample program, wherein the feature value of each candidate fusion feature of the at least one candidate fusion feature is obtained based on the corresponding fusion operation rule, and the fusion operation rule indicates that fusion operation is performed on the feature value of the specified static feature in the preset static feature set and the feature value of the specified dynamic feature in the preset dynamic feature set. Thirdly, for the first candidate fusion feature in the at least one candidate fusion feature, the following processing is performed, and so on, so as to obtain an evaluation value of each candidate fusion feature: and determining an evaluation value of the first candidate fusion feature according to the feature value of the first candidate fusion feature in each sample program and the category of each sample program, wherein the size of the evaluation value represents the effectiveness degree of the first candidate fusion feature for distinguishing the category to which the sample program belongs. Then, according to the evaluation value of each candidate fusion feature, selecting a target fusion feature from the at least one candidate fusion feature, wherein the evaluation value of the target fusion feature represents a greater validity degree than evaluation values of other candidate fusion features in the at least one candidate fusion feature. And finally, training to obtain a program classification model according to the characteristic value of the target fusion characteristic in each sample program.

The static characteristics reflect the structural characteristics of the selected sample program, such as dynamic link library file, selection header information, resource family information, part header family information, data directory table information, mapping file header, additional information, abnormal structure field, entry point information, executable code segment, and the like. If specific values for the static feature in the sample program are selected, these values may be used as feature values for the static feature. Such as selection header information, resource family information, part header family information, data directory table information, map file headers, etc. If the sample program itself has no specific value, the corresponding characteristic value is determined according to the actual performance of the sample program. For example, if the link library file is loaded, the characteristic value is 1; otherwise it is 0. The dynamic characteristics reflect behaviors of the selected sample program in the running process, for example, a parameter model of the sample program and/or at least one interface called by the sample program in the running process, and the parameter model is extracted according to parameters used by the sample program in the running process. Assuming that the at least one dynamic feature includes a third dynamic feature, the feature value of the third dynamic feature of the selected sample program may be a frequency of the third dynamic feature, and the frequency of the third dynamic feature is a ratio between the number of times the third dynamic feature appears in the selected sample program and a total number of dynamic features included in the preset dynamic feature set.

The method and the device have the advantages that the program classification model is trained based on the characteristic value of the fusion characteristic, or the program is classified by the program classification model subsequently, namely the dynamic characteristic is not limited to the program form, and the defect of insufficient characteristic extraction can be overcome through the static characteristic under the condition that the program is provided with the anti-virtual program function, so that compared with the prior art, the program classification model is trained according to the characteristic value of the fusion characteristic, and the accuracy of program category identification is improved.

Optionally, determining the evaluation value of the first candidate fusion feature according to the feature value of the first candidate fusion feature in each sample program and the category of each sample program includes: and according to the category to which the sample program belongs, counting the feature value of the first candidate fusion feature in the sample program of each category, so as to obtain the statistical value of the first candidate fusion feature in each category, such as the median, mean, variance and the like of the feature value of the first candidate fusion feature. Then, the evaluation value of the first candidate fusion feature is determined according to the statistics of the first candidate fusion feature in each category. The evaluation value may be a ratio, difference, variance, or the like of the first candidate fusion feature between the respective category statistics.

In practical application, the feature value of the at least one candidate fusion feature of the selected sample program is obtained according to the feature value of the at least one static feature, the feature value of the at least one dynamic feature, and the at least one fusion operation rule of the selected sample program, and various implementation manners are available.

As one possible implementation manner, the at least one candidate fusion feature includes a first candidate fusion feature, and a feature value of the first candidate fusion feature is obtained based on a corresponding first fusion operation rule. The first fusion operation rule indicates that mathematical operations, such as multiplication, addition, subtraction, and the like, are performed on the feature value of the first static feature in the preset static feature set and the feature value of the first dynamic feature in the preset dynamic feature set. Wherein the first static feature comprises one or more static features and the first dynamic feature comprises one or more dynamic features.

As another possible implementation manner, the at least one candidate fusion feature includes a second candidate fusion feature, and a feature value of the second candidate fusion feature is obtained based on a corresponding second fusion operation rule. The second fusion operation rule indicates that a logical operation, such as an and operation, or operation, nand operation, or the like, is performed on the feature value of the second static feature in the preset static feature set and the feature value of the second dynamic feature in the preset dynamic feature set. Wherein the second static feature comprises one or more static features and the second dynamic feature comprises one or more dynamic features.

As another possible implementation manner, the at least one candidate fusion feature includes a third candidate fusion feature, and a feature value of the third candidate fusion feature is obtained based on a corresponding third fusion operation rule. And the third fusion operation instruction determines the features which are the same in feature value and are the same in feature value from the preset static feature set and the preset dynamic feature set, and calculates the feature value of the third candidate fusion feature according to the total number of the features which are the same in feature value and are the same in feature value.

Optionally, calculating the feature value of the third candidate fusion feature according to the total number of features that are the same in feature itself and the same in feature value includes: firstly, determining the maximum value of a first numerical value and a second numerical value, wherein the first numerical value is the total number of static features contained in a preset static feature set, and the second numerical value is the total number of dynamic features contained in a preset dynamic feature set; and calculating the ratio between the total number of the features which are the same in the features and the same in the feature values and the maximum value, and taking the ratio as the feature value of the third candidate fusion feature. Of course, it is to be understood that calculating the ratio between the total number of features with the same feature itself and the same feature value and the maximum value is not a limitation for calculating the feature value of the third candidate fusion feature, and those skilled in the art can design the ratio according to specific situations.

In addition, the three implementation manners do not limit the technical scheme of the present application, and a person skilled in the art can design the implementation manners according to actual situations.

Optionally, in order to make the program classification model classify the class-unknown program more accurately, the training to obtain the program classification model according to the feature value of the target fusion feature in each sample program includes: and training to obtain a program classification model according to the characteristic value of the target fusion characteristic in each sample program, the characteristic value of at least one static characteristic of each sample program and the characteristic value of at least one dynamic characteristic of each sample program.

In a second aspect, an embodiment of the present application provides a program classification method, where the method includes: first, a target program is acquired. Secondly, acquiring a characteristic value of each static characteristic and a characteristic value of each dynamic characteristic of the target program according to a preset static characteristic set comprising at least one static characteristic and a preset dynamic characteristic set comprising at least one dynamic characteristic. And thirdly, acquiring a characteristic value of at least one target fusion characteristic of the target program, wherein the characteristic value of the at least one target fusion characteristic of the target program is obtained based on a corresponding fusion operation rule, and the fusion operation rule indicates that fusion operation is performed on the characteristic value of the specified static characteristic in the preset static characteristic set and the characteristic value of the specified dynamic characteristic in the preset dynamic characteristic set. And finally, inputting the characteristic value of at least one target fusion characteristic of the target program into the program classification model to obtain a classification result of the target program. For example, the classification result of the target program as a normal program or a malicious program; or, the target program is a classification result of one of a plurality of malicious program categories.

The static features are features that embody structural features of the target program, such as dynamic link library files, selection header information, resource family information, part header family information, data directory family information, mapping file headers, additional information, exception structure fields, entry point information, executable code segments, and the like. The feature values of the static features may be extracted from the target program or derived from the performance. The dynamic characteristics are behavior characteristics, such as a parameter model and a preset interface, of the target program in the running process. If at least one dynamic feature of the target program comprises a parameter model and a preset interface, acquiring a feature value of the dynamic feature of the target program comprises: acquiring a preset interface called by a target program in the running process and used parameters; extracting a parameter model of the parameter according to the used parameter; and selecting a third dynamic feature from at least one dynamic feature of the target program, taking the frequency of the third dynamic feature as a feature value of the third dynamic feature, and so on to obtain feature values of all dynamic features of the target program, wherein the frequency of the third dynamic feature is a ratio of the occurrence frequency of the third dynamic feature in the selected sample program to the total number of the dynamic features included in all preset dynamic feature sets.

According to the method and the device, the characteristic value of the target fusion characteristic obtained by performing fusion operation on the characteristic value of the specified static characteristic and the characteristic value of the specified dynamic characteristic of the target program is input into the program classification model, and not only the characteristic value of the static characteristic or only the characteristic value of the dynamic characteristic, so that the double advantages that the dynamic characteristic is utilized to identify the program with the changed form, and meanwhile, when the program is provided with the anti-virtual environment function, the static characteristic reflecting the structural characteristics of the program is utilized to identify the program are combined, and the accuracy of the program classification model for classifying the target program is effectively improved.

In practical applications, there may be a plurality of implementation manners for obtaining the feature value of the at least one target fusion feature of the target program based on the corresponding fusion operation rule.

As one possible implementation manner, the at least one target fusion feature includes a first target fusion feature, and a feature value of the first target fusion feature is obtained based on a corresponding first fusion operation rule. The first fusion operation rule indicates that mathematical operations, such as multiplication, addition, subtraction, and the like, are performed on the feature value of the first static feature in the preset static feature set and the feature value of the first dynamic feature in the preset dynamic feature set. Wherein the first static feature comprises one or more static features and the first dynamic feature comprises one or more dynamic features.

As another possible implementation manner, the at least one target fusion feature includes a second target fusion feature, and a feature value of the second target fusion feature is obtained based on a corresponding second fusion operation rule. The second fusion operation rule indicates that a logical operation, such as an and operation, or operation, nand operation, or the like, is performed on the feature value of the second static feature in the preset static feature set and the feature value of the second dynamic feature in the preset dynamic feature set. Wherein the second static feature comprises one or more static features and the second dynamic feature comprises one or more dynamic features.

As another possible implementation manner, the at least one target fusion feature includes a third target fusion feature, and a feature value of the third target fusion feature is obtained based on a corresponding third fusion operation rule. And the third fusion operation instruction determines the features with the same feature value and the same feature value from the preset static feature set and the preset dynamic feature set, and calculates the feature value of the third target fusion feature according to the total number of the features with the same feature value and the same feature value.

Optionally, calculating the feature value of the third target fusion feature according to the total number of features that are the same in feature itself and the same in feature value includes: determining the maximum value of a first numerical value and a second numerical value, wherein the first numerical value is the total number of static features contained in the preset static feature set, and the second numerical value is the total number of dynamic features contained in the preset dynamic feature set; and calculating the ratio between the total number of the features with the same features and the same feature values and the maximum value, and taking the ratio as the feature value of the third target fusion feature. Of course, it is to be understood that calculating the ratio between the total number of features with the same feature itself and the same feature value and the maximum value is not a limitation for calculating the feature value of the third candidate fusion feature, and those skilled in the art can design the ratio according to specific situations.

Optionally, the target program is multiple. If the category of the target program cannot be identified through the program classification model, the method further comprises the following steps: and clustering the plurality of target programs according to the characteristic value of at least one target fusion characteristic of each target program in the plurality of target programs to obtain the category of each target program.

In a third aspect, an embodiment of the present application further provides a device for training a program classification model, where the device includes:

a receiving unit, configured to receive a plurality of input sample programs, where a sample program refers to a program to which a category belongs that has been calibrated in advance, and the plurality of sample programs belong to at least two different categories;

a first processing unit, configured to select a sample program from the plurality of sample programs, and perform the following processing to obtain a feature value of at least one candidate fusion feature of the selected sample program until each sample program of the plurality of sample programs is processed:

based on a preset set of static features including at least one static feature, and a preset including at least one dynamic feature

Dynamic feature set, obtaining feature value of each static feature and feature of each dynamic feature of selected sample program

The value, static characteristics reflect structural characteristics of the selected sample program, and the dynamic characteristics reflect structural characteristics of the selected sample program

Behavior reflected in the course of operation;

according to the characteristic value of at least one static characteristic and the characteristic of at least one dynamic characteristic of the selected sample program

Values and at least one fusion operation rule for obtaining characteristics of at least one candidate fusion feature of the selected sample program

Eigenvalues, the eigenvalue of each of the at least one candidate fused feature being based on the corresponding fusion operation

The rule is obtained by fusing the characteristic value of the specified static characteristic in the operation rule indication set and the preset static characteristic

Executing fusion operation on the characteristic values of the designated dynamic characteristics in the dynamic characteristic set;

a second processing unit, configured to, for a first candidate fusion feature of the at least one candidate fusion feature, perform the following processing, and so on, to obtain an evaluation value of each candidate fusion feature: determining an evaluation value of the first candidate fusion feature according to the feature value of the first candidate fusion feature in each sample program and the category of each sample program, wherein the size of the evaluation value represents the validity degree of the first candidate fusion feature for distinguishing the category to which the sample program belongs;

the selection unit is used for selecting the target fusion feature from the at least one candidate fusion feature according to the evaluation value of each candidate fusion feature, wherein the validity degree of the evaluation value of the target fusion feature is greater than the validity degree of the evaluation values of other candidate fusion features in the at least one candidate fusion feature;

and the training unit is used for training to obtain a program classification model according to the characteristic value of the target fusion characteristic in each sample program.

Optionally, determining the evaluation value of the first candidate fusion feature according to the feature value of the first candidate fusion feature in each sample program and the category of each sample program includes:

according to the category to which the sample program belongs, the characteristic value of the first alternative fusion feature in the sample program of each category is counted, so that the statistical value of the first alternative fusion feature in each category is obtained; and determining the evaluation value of the first candidate fusion feature according to the statistic value of the first candidate fusion feature in each category.

Optionally, the feature value of the first candidate fusion feature is obtained based on the corresponding first fusion operation rule;

each fusion operation rule indicates that performing fusion operation on the feature values of the specified static features in the preset static feature set and the feature values of the specified dynamic features in the preset dynamic feature set comprises:

the first fusion operation rule instructs to perform a mathematical operation on a feature value of a first static feature in the preset static feature set and a feature value of a first dynamic feature in the preset dynamic feature set.

Optionally, the at least one candidate fusion feature includes a second candidate fusion feature, and a feature value of the second candidate fusion feature is obtained based on a corresponding second fusion operation rule;

the second fusion operation rule instructs to perform a logical operation on the feature value of the second static feature in the preset static feature set and the feature value of the second dynamic feature in the preset dynamic feature set.

Optionally, the at least one candidate fusion feature includes a third candidate fusion feature, and a feature value of the third candidate fusion feature is obtained based on a corresponding third fusion operation rule;

and the third fusion operation instruction determines the features which are the same in feature value and are the same in feature value from the preset static feature set and the preset dynamic feature set, and calculates the feature value of the third candidate fusion feature according to the total number of the features which are the same in feature value and are the same in feature value.

In a fourth aspect, an embodiment of the present application further provides a program classifying device, where the device includes:

a program acquisition unit configured to acquire a target program;

the first characteristic value acquisition unit is used for acquiring a characteristic value of each static characteristic and a characteristic value of each dynamic characteristic of the target program according to a preset static characteristic set comprising at least one static characteristic and a preset dynamic characteristic set comprising at least one dynamic characteristic; the static characteristics are characteristics which embody the structural characteristics of the target program, and the dynamic characteristics are behavior characteristics which embody the target program in the running process;

a second feature value obtaining unit, configured to obtain a feature value of at least one target fusion feature of the target program, where the feature value of the at least one target fusion feature of the target program is obtained based on a corresponding fusion operation rule, and the fusion operation rule indicates that a fusion operation is performed on a feature value of a specified static feature in the preset static feature set and a feature value of a specified dynamic feature in the preset dynamic feature set;

and the classification unit is used for inputting the characteristic value of at least one target fusion characteristic of the target program into the program classification model to obtain a classification result of the target program.

Optionally, the at least one target fusion feature includes a first target fusion feature, and a feature value of the first target fusion feature is obtained based on a corresponding first fusion operation rule;

the fusion operation rule indicates that executing fusion operation on the feature value of the specified static feature in the preset static feature set and the feature value of the specified dynamic feature in the preset dynamic feature set comprises:

Optionally, the at least one target fusion feature includes a second target fusion feature, and a feature value of the second target fusion feature is obtained based on a corresponding second fusion operation rule;

Optionally, the at least one target fusion feature includes a third target fusion feature, and a feature value of the third target fusion feature is obtained based on a corresponding third fusion operation rule;

and the third fusion operation instruction determines the features with the same feature value and the same feature value from the preset static feature set and the preset dynamic feature set, and calculates the feature value of the third target fusion feature according to the total number of the features with the same feature value and the same feature value.

Drawings

Fig. 1 is a schematic diagram of an enterprise network architecture provided in an embodiment of the present application;

fig. 2 is a schematic diagram of a cloud network architecture according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a method for training a program classification model according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of a program classification method according to an embodiment of the present application;

fig. 5 is a block diagram illustrating a structure of a program classification model training apparatus according to an embodiment of the present disclosure;

fig. 6 is a block diagram of a program classifying device according to an embodiment of the present application;

FIG. 7 is a diagram of a hardware architecture of a program classification model training apparatus according to an embodiment of the present application;

fig. 8 is a hardware architecture diagram of a program sorting apparatus according to an embodiment of the present application.

Detailed Description

In order to improve the accuracy of identifying unknown program categories, the embodiment of the application provides a program classification model training method and device and a program classification method and device.

The program classification training method and device and the program classification method and device provided by the embodiment of the application can be applied to application scenarios shown in fig. 1 and fig. 2, for example.

Fig. 1 is a schematic diagram of an enterprise network architecture. In fig. 1, the enterprise network architecture includes a security device 101, a network access device 102, such as a firewall or security gateway, a switch 103 connected to the network access device 102, and a plurality of hosts 104 connected to the switch. Wherein the security device 101 is connected to the network access device 102. The security device 101 may be, for example, an Intrusion Prevention System (IPS) device or a Unified Threat Management (UTM) device, etc. The security device 101 is configured to train a program classification model, and receive a test sample sent by a firewall or a security gateway in the device 102, or receive a test sample sent by client software installed on the intranet host 104, and output a category to which the test sample belongs.

Fig. 2 is a schematic diagram of a cloud network architecture. In fig. 2, the cloud network architecture may include a security device 201 located on the core network side, and a plurality of firewall devices 202 in the access network. The security device 201 may include modules such as a cloud sandbox, which are used to train a program classification model, receive a test sample from the device 202 in which the firewall is deployed, and output a category to which the test sample belongs.

The following describes the program classification model training method provided by the embodiments of the present application in detail with reference to the accompanying drawings. The execution subject of the training method may be the security device 101 in fig. 1 or the security device 201 in fig. 2. The workflow of the security device 101 and the security device 201 mainly includes a training phase and a testing phase. In the training phase, the inputs of the security device 101 and the security device 201 are training sets, and the outputs are generated program classification models. The training set comprises a plurality of training samples, and the training samples refer to sample programs of which the categories belong to which the training samples are calibrated in advance. The security device 101 and the security device 201 generate a program classification model according to a training set and a predetermined machine learning algorithm. In the testing stage, the input of the safety device 101 and the safety device 201 is a test sample, and the output is a category to which the test sample belongs, wherein the test sample refers to a sample program of which the category is unknown. In the testing phase, the safety device 101 and the safety device 201 classify the test samples according to the generated program classification model.

Referring to fig. 3, the figure is a schematic flowchart of a program classification model training method provided in the embodiment of the present application.

The program classification model training method provided by the embodiment of the application can comprise the following steps:

s101: an input of a plurality of training samples is received.

In the embodiment of the present application, the sample program refers to a program to which a category belongs that has been calibrated in advance, and the plurality of sample programs belong to at least two different categories. For example, the at least two different categories may include two categories, normal programs and malicious programs. As another example, the at least two different categories include at least two different categories of malicious programs, such as worms (work), trojan (trojan) trojans, downloaders (downloaders), backdoors, and the like. The malicious program may be, for example, a Portable Executable (PE).

S102: one sample program is selected from the plurality of sample programs, and S1021 and S1022 are executed to obtain a feature value of at least one candidate fusion feature of the selected sample program until each of the plurality of sample programs is processed.

That is, each of the plurality of sample programs has at least one candidate fusion feature, and the calculation method of the feature value of the candidate fusion feature may be referred to as S1021 and S1022.

S1021: and acquiring the characteristic value of each static characteristic and the characteristic value of each dynamic characteristic of the selected sample program according to a preset static characteristic set comprising at least one static characteristic and a preset dynamic characteristic set comprising at least one dynamic characteristic.

In the embodiment of the application, a preset static feature set and a preset dynamic feature set are predetermined, wherein the preset static feature set comprises at least one static feature, and the preset dynamic feature set comprises at least one dynamic feature. And then analyzing the selected sample program to obtain characteristic values respectively corresponding to the selected sample program and each static characteristic in the preset static characteristic set and characteristic values respectively corresponding to each dynamic characteristic in the preset dynamic characteristic set. That is, the preset static feature set corresponding to each sample program is the same, and only the feature values of the static features in the preset static feature set are different. Similarly, the preset dynamic feature set corresponding to each sample program is the same, and only the feature values of the dynamic features in the preset dynamic feature set are different.

Wherein the static features reflect structural features of the sample program. For example, the static features include one or more of the following: a Dynamic Link Library (DLL), image optional header (image optional header) information, resource family (resource family) information, partial header family (section header family) information, data directory table (data directory family) information, image file header (image file header), additional (overlay) information, an exception structure field, entry point (entry point) information, and an executable code segment.

Wherein each of the static features described above may include one or more static features. For example, the dynamic link library file may include ADVAPI32.DLL, AWFAXP32.DLL, AWFXAB32.DLL, etc. The exception structure field refers to a field in which an exception may occur, and includes, for example, a discarded (predicted) field, a default (default) field, a reserved (reserved) field, a structure (structure) field, and the like.

If specific values for the static feature in the sample program are selected, these values may be used as feature values for the static feature. Such as selection header information, resource family information, part header family information, data directory table information, map file headers, etc. If the sample program itself has no specific value, the corresponding characteristic value is determined according to the actual performance of the sample program. For example, if the link library file is loaded, the characteristic value is 1; otherwise it is 0. For example, whether the entry point information is in the executable code segment, and if so, the characteristic value is 0; otherwise it is 1. For another example, whether the additional information exists, if so, the characteristic value is 1; otherwise it is 0.

Optionally, in order to improve the effectiveness of the classification of the static feature, some specific N-gram features may also be used as the static features. The N-gram is an algorithm based on a statistical language model, and the basic idea is to perform window sliding window operation with the size of N on the content in a text according to bytes to form a byte fragment sequence with the length of N, wherein each byte fragment is called as a gram. In the embodiments of the present application, N is greater than or equal to 2.

For example, an N-gram feature is extracted from an executable code segment, and a sliding window operation with a size of 4 is performed according to bytes of the executable code segment, so as to form byte fragment sequences with a length of 4, and each byte fragment sequence can be regarded as a static feature.

If the N-gram feature can be extracted from the selected sample program, the feature value of the N-gram feature can be 1; otherwise it is 0.

The static features and the feature values of the static features of the selected sample program are described above, and the dynamic features and the feature values of the dynamic features of the selected sample program are described below.

The dynamic characteristics reflect behaviors of the sample program during the running process, and the behaviors comprise process operations, file operations, network operations, registry operations and the like. In order to obtain the characteristic value of the dynamic characteristic of the selected sample program, the selected sample program can be placed in a sandbox to operate, and when the characteristic value of the dynamic characteristic is obtained, the influence on the system when the sample program is a malicious program is also avoided.

In the embodiment of the present application, the dynamic characteristics may include an Application Programming Interface (API) called by the sample program during the running process and/or parameters used by the sample program. The API is used for indicating the type of the operation behavior of the sample program, such as file creation, registry modification and the like, and the parameter represents the object of the operation behavior, such as a file path, a registry path and the like. Each run behavior corresponds to an API and at least one parameter.

In order to improve the generalization capability of the training classification model and prevent overfitting, the embodiment of the application abstracts the parameter model from the parameters.

For example, the file parameter path "c: \ \ users \ \ zhangsan \ \ appdata \ \ local \ \ temp \ \ user's temporary directory may be abstracted as the parameter model" c: \ \ users \ \ \ \ appdata \ \ roaming | local \ \ te | t ] mp \ ", where" \ "represents that there may be any content there; [ roaming | local ] represents that any one of roaming and local occurring in the place is counted as hit, and the roaming and the local have basically the same meaning in an operating system and are used for storing the release content of the application program; [ te | t ] mp ] represents where either tmp or temp appears to be hit, and different operating systems may have different names for temporary files.

For another example, the registry path "hklm \ software \ microsoft \ windows \ currentversion \ runonce" may be abstracted as the parameter model "hk [ cu | lm ] \ \ software \ microsoft \ windows \ currentversion \ runonce", where hk [ cu \ lm ] represents that any one of hkcu and hklm appearing therein is a hit, and may match different registry root key values.

Since neither the API nor the parametric model is a number, the API and the parametric model may be assigned corresponding codes for ease of description. For example, if the selected sample program has 150 APIs, the 150 APIs may be represented by code numbers 1-150, one for each API. If 400 parametric models are selected for the sample, the 400 parametric models may be represented by a code number 1-400, one for each parametric model.

In the embodiment of the present application, the dynamic feature may be an API, or a parametric model, or a combination of an API and a parametric model.

If the dynamic characteristic is an API, the frequency of occurrence of the API of the selected sample program may be used as a characteristic value of the API of the selected sample program. And the frequency of the API is equal to the ratio of the number of times of the API appearing in the selected sample program to the total number of all the APIs in the preset dynamic feature set. For example, if the API with the code number 1 in the 150 APIs included in the preset dynamic feature set appears 30 times during the running of the selected sample program, the feature value corresponding to the API with the code number 1 is 0.2 (30/150).

If the dynamic feature is a parametric model, the frequency of occurrence of the parametric model of the selected sample program may be used as a feature value of the parametric model of the selected sample program. And the frequency of the parameter model of the selected sample program is equal to the ratio of the number of times of the parameter model in the selected sample program to the total number of all the parameter models in the preset dynamic characteristic set. For example, if a parametric model with the reference number 3 of 400 parametric models included in the preset dynamic feature set occurs 40 times in the selected sample program running process, the characteristic value corresponding to the parametric model with the reference number 3 is 0.1 (40/400).

If the dynamic features are a combination of APIs and parametric models, then each API may be combined with each parametric model, i.e., one combination includes one API and one parametric model. This combination can be expressed, for example, as: code of code _ parameter model of API. For example, a dynamic feature identified as "2 _ 5" represents a combination of an API with code 2 and a parametric model with code 5 for the dynamic feature. If the dynamic feature is a combination of the API and the parameter model, the feature value of the dynamic feature is a frequency of occurrence of the corresponding combination of the API and the parameter model in all combinations of the preset dynamic feature set (for convenience of description, hereinafter, referred to as a frequency of the dynamic feature). The frequency of occurrence of a combination of the API and the parametric model in all combinations of the preset dynamic feature set is equal to the ratio of the number of occurrences of the combination of the API and the parametric model in all combinations to the total number of all combinations. For example, assuming that there are 1000 combinations of all APIs and parameter models in the preset dynamic feature set, and the number of occurrences of the combination identified as "2 _ 5" is 10, the frequency of the combination identified as "2 _ 5" is 0.01(10/1000), i.e., the feature value of the dynamic feature "2 _ 5" is 0.01.

Of course, the feature value of the dynamic feature may be other than the frequency of the dynamic feature, for example, if a certain dynamic feature can be obtained according to the selected sample program, the feature value of the dynamic feature is 1; if not, the eigenvalue of the dynamic characteristic is 0.

After the feature values of the static features and the feature values of the dynamic features of the selected sample program are acquired, S1022 may be executed.

S1022: and obtaining the characteristic value of at least one alternative fusion characteristic of the selected sample program according to the characteristic value of at least one static characteristic, the characteristic value of at least one dynamic characteristic and at least one fusion operation rule of the selected sample program.

The features of the traditional technique input program classification model are either static features or dynamic features. As mentioned above, although the static features can reflect the structural features of the program, once the designer of the program changes the form of the program, the program classification model cannot identify the category of the program. The dynamic characteristics can reflect the behavior characteristics of the program in the running process, and even if the form of the program is changed, the behavior characteristics are the same, so that the problem of static characteristics can be overcome. However, the extraction of the dynamic features needs to be performed by running the program in a virtual environment such as a sandbox, and if an anti-virtual program function is set in the program, for example, it is detected that the program runs in the sandbox, some commands are not executed, so that the sandbox cannot extract all the dynamic features. Therefore, the classification of the program cannot be accurately determined only by the dynamic characteristics.

In order to overcome the technical problem, in the embodiment of the present application, the feature value of the static feature and the feature value of the dynamic feature are fused to obtain the feature value of the fused feature, and the program classification model is trained based on the feature value of the fused feature, or the program is subsequently classified by using the program classification model, that is, the method has the advantages that the dynamic feature is not limited to the program form, and the defect of insufficient feature extraction is made up by the static feature under the condition that the program is provided with the anti-virtual program function, so that compared with the prior art, the method trains the program classification model according to the feature value of the fused feature, and improves the accuracy of program category identification.

In specific implementation, after the feature value of the static feature and the feature value of the dynamic feature of the selected sample program are obtained, the feature value of at least one alternative fusion feature of the selected sample program is obtained according to the feature value of at least one static feature, the feature value of at least one dynamic feature and at least one fusion operation rule of the selected sample program, and then an alternative fusion feature with high effectiveness degree for distinguishing the category to which the sample program belongs is selected from the at least one alternative fusion feature to be used as a target fusion feature which is finally input to the program classification model.

For example, assume that the first static feature of the sample program is an executable code segment and the feature value of the executable code segment is a binary entropy value of the executable code segment. The first dynamic characteristic of the sample program is reading the application directory file, and the characteristic value of the reading application directory file is the frequency of the corresponding API and parameter model combination, which may also be referred to as the frequency of reading the application directory file for simplicity. Then, the first candidate fusion feature resulting from the fusion operation of the executable code segment and the read application directory file may have a feature value that is a product of the binary entropy value of the executable code segment and the frequency of reading the application directory file.

For example, the second static feature of the sample program is additional information, and a feature value of the additional information may be determined according to whether the additional information exists in the sample program. If the additional information exists, the characteristic value of the second static characteristic is 1; otherwise it is 0. The second dynamic characteristic of the sample program is networking operation, and if the sample program has networking operation, the characteristic value of the second dynamic characteristic is 1; otherwise it is 0. Then, the fusion operation may be a second and operation of the feature value of the static feature and the feature value of the second dynamic feature, that is, if the sample program has both additional information and networking operation, the feature value of the second candidate fusion feature is 1, otherwise it is 0.

For example, assuming that the static feature set of the sample program includes 150 APIs, and the dynamic feature set of the sample program includes 50 APIs, where after the feature values of the static features and the dynamic features corresponding to the sample program are obtained, there are 40 APIs between the two which have the same feature and the same feature value, and then the feature values of the third candidate fusion features may be obtained according to the total number of the APIs which have the same feature and the same feature value, that is, 40 APIs.

Optionally, calculating the feature value of the third candidate fusion feature according to the total number of features that are the same in feature itself and the same in feature value may include: first, a maximum value of a first value and a second value is determined, where the first value is a total number of static features included in a preset static feature set, and the second value is a total number of dynamic features included in a preset dynamic feature set. And secondly, calculating the ratio between the total number of the features which are the same in feature and have the same feature value and the maximum value, and taking the ratio as the feature value of the third candidate fusion feature.

For example, in the above example, since the static feature set includes a greater total number of APIs than the dynamic feature set, the maximum of the first and second values is 150, and the feature value of the third candidate fused feature is equal to about 0.27 (40/150).

Of course, it is to be understood that calculating the ratio between the total number of features with the same feature itself and the same feature value and the maximum value is not a limitation for calculating the feature value of the third candidate fusion feature, and those skilled in the art can design the ratio according to specific situations.

After the feature value of at least one candidate fusion feature is obtained, S103 may be executed in the embodiment of the present application.

S103: for a first candidate fusion feature in the at least one candidate fusion feature, performing the following processing, and so on, to obtain an evaluation value of each candidate fusion feature: and determining the evaluation value of the first candidate fusion feature according to the feature value of the first candidate fusion feature in each sample program and the category of each sample program.

Since not every candidate fusion feature can effectively distinguish the class of the sample program, in order to improve the training efficiency of the program classification model, in the embodiment of the present application, a target fusion feature is automatically selected from at least one candidate fusion feature, and the size of the evaluation value represents the effective degree of the first candidate fusion feature for distinguishing the class to which the sample program belongs according to the evaluation value of the at least one candidate fusion feature. Based on the evaluation value of at least one candidate fusion feature, a target fusion feature capable of effectively distinguishing the category to which the sample program belongs can be selected from the candidate fusion features.

Specifically, the feature value of the first candidate fusion feature in the sample program of each category may be counted according to the category to which the sample program belongs, so as to obtain a statistical value of the first candidate fusion feature in each category, for example, a median, a mean, a variance, and the like of the feature value of the first candidate fusion feature. Then, the evaluation value of the first candidate fusion feature is determined according to the statistics of the first candidate fusion feature in each category. The evaluation value may be a ratio, difference, variance, or the like of the first candidate fusion feature between the respective category statistics.

For example, assuming that at least two categories include normal programs and malicious programs, the static features of the sample program include executable code segments, and the dynamic features include reading application directory files, writing system files, deleting system files, registry reading application version information, and registry adding a self-launching entry. Then, the binary entropy of the executable code segment of each sample program is multiplied by the eigenvalues of the five dynamic characteristics, respectively, to obtain the eigenvalues of the five candidate fusion characteristics of the sample program. And respectively counting the characteristic values of the plurality of sample programs according to two categories of normal programs and malicious programs to obtain the statistical values, such as median, of the plurality of sample programs in the two categories. Then, the evaluation value of each candidate fusion feature can be obtained according to the statistics of the five candidate fusion feature values in the two categories. Wherein the evaluation value of each candidate fusion feature may be a ratio of a difference between the normal program statistic and the malicious program statistic to a larger one of the normal program statistic and the malicious program statistic. Referring to table 1, the table represents a calculation method of evaluation values of the five candidate fusion features in a scenario where a sample program includes two categories, namely a normal program and a malicious program.

TABLE 1

Note: "+" indicates multiplication.

Assuming that the at least two categories include three categories of malicious programs, such as trojan horses, downloaders, and worms, the static characteristics as well as the dynamic characteristics of the sample program are exemplified above, and the evaluation value of each of the five candidate fusion features may be the variance of the statistical values of the three categories. See table 2, which is a calculation method of evaluation values of the five candidate fusion features described above in a scenario where the sample program includes three categories of trojan horses, downloaders, and worms.

TABLE 2

For another example, assume that at least two categories include a normal program and a malicious program, the static feature of the sample program includes additional information, and the feature value of the static feature is the ratio of the size of the additional information to the size of the entire file; the dynamic characteristics comprise network operation behaviors, registry self-starting item operation behaviors, write execution file operation behaviors and read loading system DLL behaviors. Then, the ratio of the size of the additional information to the size of the whole file of each sample program is multiplied by the eigenvalues of the four dynamic characteristics, respectively, to obtain the eigenvalues of the four candidate fusion characteristics of the sample program. And respectively counting the characteristic values of the plurality of sample programs according to two categories of normal programs and malicious programs to obtain the statistical values, such as median, of the plurality of sample programs in the two categories. Then, the evaluation value of each candidate fusion feature can be obtained according to the statistics of the four candidate fusion feature values in the two categories. Wherein the evaluation value of each candidate fusion feature may be a difference between normal program statistics and malicious program statistics. Referring to table 3, the table represents a method for calculating evaluation values of the four candidate fusion features in a scenario where a sample program includes two categories, namely a normal program and a malicious program.

TABLE 3

Assuming that the at least two classes include three classes of malicious programs, such as trojan horses, downloaders, and worms, the static features as well as the dynamic features of the sample program are exemplified above, and the evaluation value of each of the four candidate fusion features may be the variance of the statistical values of the three classes. See table 4, which is a calculation method of evaluation values of the above four candidate fusion features in a scenario where the sample program includes three categories of trojan horses, downloaders, and worms.

TABLE 4

S104: and selecting the target fusion feature from at least one candidate fusion feature according to the evaluation value of each candidate fusion feature.

In the embodiment of the present application, the validity degree of the evaluation value of the selected target fusion feature is greater than the validity degree of the evaluation values of other candidate fusion features except for the target fusion feature in at least one candidate fusion feature. The number of target fusion features may be one or more.

In a binary classification scenario, that is, when the sample program includes two categories, i.e., a normal program and a malicious program, taking table 1 as an example, the evaluation value of the alternative fusion feature is a ratio between a difference value between a median of the normal program and a median of the malicious program and a larger value between the median of the normal program and the median of the malicious program. According to experience, the executable code segment of the normal program is more uniform than that of the malicious program, so that the binary entropy value of the executable code segment of the normal program is larger than that of the executable code segment of the malicious program, and the frequency of reading the application directory file by the normal program and the frequency of reading the application version information by the registry are higher relative to the malicious program, so that in the scenario of table 1, the higher the ratio of one candidate fusion feature is, the higher the effectiveness degree of distinguishing the category to which the sample program belongs by the candidate fusion feature is; conversely, the lower the ratio, the less effective the candidate fusion feature is in distinguishing the class to which the sample program belongs.

Therefore, in practical applications, a threshold value may be designed, and candidate fusion features having evaluation values greater than or equal to the threshold value may be used as target fusion features.

For example, the threshold value is 0.6. In table 1, the evaluation values corresponding to the first candidate fusion feature, the fourth candidate fusion feature and the fifth candidate fusion feature are respectively 0.888, 0.8864 and 0.6418, which are all greater than 0.6, so that these three candidate fusion features can be used as target fusion features.

Taking table 3 as an example, the evaluation value of each candidate fusion feature is the difference between the normal program statistic and the malicious program statistic, and a larger difference indicates a higher degree of effectiveness of the candidate fusion features in distinguishing the categories to which the sample programs belong. According to experience, a malicious program may add executable code at the additional information, thereby causing a high ratio between the size of the additional information and the overall file size, and at the same time, the malicious program has a high possibility of network operation. So under the scenario of table 3, the evaluation value of the first candidate fusion feature is high, up to 0.145. Assuming that the threshold is 0.05, the candidate fusion feature larger than the threshold in table 3 has only the first candidate fusion feature, so that the candidate fusion feature can be used as the target fusion feature.

In a multi-classification scenario, i.e., when the sample program includes multiple classes of malicious programs, taking table 2 as an example, the evaluation value of each candidate fusion feature may be a variance of statistics of the multiple classes. The larger the variance is, the more effective the candidate fused feature is in distinguishing the class to which the sample program belongs. Therefore, if the threshold value is 20, the feature value (51.014) of the second candidate fusion feature and the feature value (38.592) of the fifth candidate fusion feature in table 2 are both higher than the threshold value, and therefore the two candidate fusion features can be set as the target fusion feature. The reason why the feature value of the second candidate fusion feature and the feature value of the fifth candidate fusion feature are higher is that the frequency of worm writing system files and the frequency of adding auto-start items to the registry are higher relative to the normal procedure.

Since table 4 is similar to table 2, the process of selecting the target fusion feature is not described in detail here.

After the target fusion feature is obtained, S105 may be performed.

S105: and training to obtain a program classification model according to the characteristic value of the target fusion characteristic in each sample program.

In the embodiment of the present application, the training classification model is a model for classifying a program whose category is unknown. S101 to S105 describe a process of training a program classification model, and when the program classification model works, a feature value of a target fusion feature of a program whose category is unknown may be input, and a category to which the program belongs may be output. The specific steps will be described in detail below.

The program classification model may be trained by machine learning, for example, Random Forest (RF) algorithm, Artificial Neural Network (ANN) algorithm, and the like.

As mentioned above, because the advantages of the static features and the advantages of the dynamic features of the target fusion feature set are integrated, the program classification model is trained according to the feature values of the target fusion features in the sample program, and compared with the traditional technology in which the program classification model is trained only according to the static features or only according to the dynamic features, the method has a better training effect, that is, the program classification model has higher accuracy in classifying the programs of which the types are unknown.

In addition, in order to make the program classification model more accurately classify the class-unknown program, the program classification model can be obtained by training according to the feature value of the target fusion feature in each sample program, the feature value of the static feature of each sample program and the feature value of the dynamic feature of each sample program.

Referring to fig. 4, a flowchart of a program classification method provided in an embodiment of the present application is shown.

The program classification method provided by the embodiment of the application comprises the following steps:

s201: and acquiring the target program.

In the embodiment of the present application, the target program is a program whose category is unknown. The target program can be acquired in various ways, such as downloading from an open source website, and the like.

S202: and acquiring a characteristic value of each static characteristic and a characteristic value of each dynamic characteristic of the target program according to a preset static characteristic set comprising at least one static characteristic and a preset dynamic characteristic set comprising at least one dynamic characteristic.

In an embodiment of the present application, a preset static feature set and a preset dynamic feature set may be predetermined, where the preset static feature set includes at least one static feature, and the preset dynamic feature set includes at least one dynamic feature. The static characteristics are characteristics which embody the structural characteristics of the target program, and the dynamic characteristics are behavior characteristics which embody the target program in the running process. The preset static feature set may be the above-mentioned preset static feature set, and the preset dynamic feature set may be the above-mentioned preset dynamic feature set. The types and obtaining manners of the static features and the dynamic features may refer to the description of the static features and the dynamic features of the sample program in fig. 3, and are not described herein again.

S203: the method comprises the steps of obtaining a characteristic value of at least one target fusion characteristic of a target program, wherein the characteristic value of the at least one target fusion characteristic of the target program is obtained based on a corresponding fusion operation rule, and the fusion operation rule indicates that fusion operation is performed on the characteristic value of a specified static characteristic in a preset static characteristic set and the characteristic value of a specified dynamic characteristic in a preset dynamic characteristic set.

In the embodiment of the application, the feature value of the target fusion feature is obtained by executing fusion operation according to the feature value of the specified static feature in the preset static feature set and the feature value of the specified dynamic feature in the preset dynamic feature set. For a specific fusion operation, please refer to the description related to the fusion operation executed on the feature value of the specified static feature and the feature value of the specified dynamic feature of the sample program in fig. 3, which is not described herein again.

S204: and inputting the characteristic value of at least one target fusion characteristic of the target program into the program classification model to obtain a classification result of the target program.

The technical solution provided by the embodiment of the present application is described below by taking an application scenario as an example, and completely describing a process from training of a program classification model to classification of a target program.

s301: a plurality of sample programs of a pre-calibration category are obtained, and the sample programs comprise normal programs and malicious programs.

S302: s102 to S105 are executed on a plurality of sample programs to obtain a first program classification model.

The first program classification model is a binary classification model that can classify a target program as a normal program or a malicious program.

S303: and executing S102 to S105 on the malicious programs in the plurality of sample programs to obtain a second program classification model.

The malicious programs in the plurality of sample programs include a plurality of malicious program categories, so the second program classification model is a multi-classification model which can determine the target program as one of the plurality of malicious program categories.

S304: and acquiring the target program with unknown category.

S305: and executing S202 and S203 to the target program to obtain the characteristic value of the target fusion characteristic of the target program.

S306: and inputting the characteristic value of the target fusion characteristic of the target program into the first program classification model to obtain a classification result that the target program is a malicious program or a normal program.

S307: and when the classification result of the target program is the malicious program, inputting the characteristic value of the target fusion characteristic of the target program into the second program classification model to obtain the classification result of the target program which is a specific one of a plurality of malicious program classes.

S308: if the second program classification model cannot determine the class to which the target program belongs, that is, the target program does not belong to the multiple malicious program classes in the second program classification model, clustering a plurality of target programs according to the characteristic value of at least one target fusion characteristic of each target program in the plurality of target programs, and obtaining the respective classes of the plurality of target programs in a clustering manner.

There are various clustering algorithms, such as a noise-based density-based application space clustering (DBSCAN) algorithm, and the like, and the present application is not limited in particular.

S309: if the category of the target program is obtained through clustering, the category of the target program can be labeled, and the second program classification model is trained again, so that the new second program classification model can identify the program of the new category.

Correspondingly, referring to fig. 5, an embodiment of the present application further provides an apparatus for training a program classification model, where the apparatus includes:

a receiving unit 501, configured to receive a plurality of input sample programs, where a sample program refers to a program to which a category belongs that has been calibrated in advance, and the plurality of sample programs belong to at least two different categories;

a first processing unit 502, configured to select a sample program from the multiple sample programs, and perform the following processing to obtain a feature value of at least one candidate fusion feature of the selected sample program until each sample program of the multiple sample programs is processed:

acquiring a characteristic value of each static characteristic and a characteristic value of each dynamic characteristic of the selected sample program according to a preset static characteristic set comprising at least one static characteristic and a preset dynamic characteristic set comprising at least one dynamic characteristic, wherein the static characteristic reflects the structural characteristics of the selected sample program, and the dynamic characteristic reflects the behavior of the selected sample program in the operation process;

obtaining a feature value of at least one alternative fusion feature of the selected sample program according to the feature value of at least one static feature, the feature value of at least one dynamic feature and at least one fusion operation rule of the selected sample program, wherein the feature value of each alternative fusion feature of the at least one alternative fusion feature is obtained based on the corresponding fusion operation rule, and the fusion operation rule indicates that fusion operation is performed on the feature value of the specified static feature in the preset static feature set and the feature value of the specified dynamic feature in the preset dynamic feature set;

a second processing unit 503, configured to perform the following processing for a first candidate fusion feature of the at least one candidate fusion feature, and so on, to obtain an evaluation value of each candidate fusion feature: determining an evaluation value of the first candidate fusion feature according to the feature value of the first candidate fusion feature in each sample program and the category of each sample program, wherein the size of the evaluation value represents the validity degree of the first candidate fusion feature for distinguishing the category to which the sample program belongs;

a selecting unit 504, configured to select, according to the evaluation value of each candidate fusion feature, a target fusion feature from the at least one candidate fusion feature, where an effectiveness degree of an evaluation value of the target fusion feature is greater than effectiveness degrees of evaluation values of other candidate fusion features in the at least one candidate fusion feature;

and the training unit 505 is configured to train to obtain a program classification model according to the feature value of the target fusion feature in each sample program.

The specific work flow of the apparatus shown in fig. 5 can be referred to the related description in the foregoing embodiment of the program classification model training method.

Referring to fig. 6, an embodiment of the present application further provides a program classifying device, where the device includes:

a program acquisition unit 601 for acquiring a target program;

a first feature value obtaining unit 602, configured to obtain a feature value of each static feature and a feature value of each dynamic feature of the target program according to a preset static feature set including at least one static feature and a preset dynamic feature set including at least one dynamic feature; the static characteristics are characteristics which embody the structural characteristics of the target program, and the dynamic characteristics are behavior characteristics which embody the target program in the running process;

a second feature value obtaining unit 603, configured to obtain a feature value of at least one target fusion feature of the target program, where the feature value of the at least one target fusion feature of the target program is obtained based on a corresponding fusion operation rule, and the fusion operation rule indicates that a fusion operation is performed on a feature value of a specified static feature in the preset static feature set and a feature value of a specified dynamic feature in the preset dynamic feature set;

the classifying unit 604 is configured to input a feature value of at least one target fusion feature of the target program into the program classification model, so as to obtain a classification result of the target program.

The specific work flow of the apparatus shown in fig. 6 can be referred to the related description in the foregoing embodiment of the program classification method.

Referring to fig. 7, an embodiment of the present application further provides a program classification model training apparatus, including:

processor 710, memory 720, and network interface 730, processor 710, memory 720, and network interface 730 are interconnected by a bus 740.

Memory 720 includes, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), erasable programmable read only memory (EPROM or flash memory), or portable read only memory (CD-ROM).

The processor 710 may be one or more Central Processing Units (CPUs), and in the case that the processor 710 is one CPU, the CPU may be a single-core CPU or a multi-core CPU.

The network Interface 730 may be a wired Interface, such as a Fiber Distributed Data Interface (FDDI) Interface or a Gigabit Ethernet (GE) Interface; the network interface 730 may also be a wireless interface.

The network interface 730 is used for receiving a plurality of input sample programs, the sample programs refer to programs of which the categories belong to which the programs are calibrated in advance, and the sample programs belong to at least two different categories.

A memory 720 for storing program code;

a processor 710 for reading the program code stored in the memory 720, performing the following operations:

selecting a sample program from the plurality of sample programs, and performing the following processing to obtain a feature value of at least one candidate fusion feature of the selected sample program until each sample program in the plurality of sample programs is processed:

for a first candidate fusion feature in the at least one candidate fusion feature, performing the following processing, and so on, to obtain an evaluation value of each candidate fusion feature: determining an evaluation value of the first candidate fusion feature according to the feature value of the first candidate fusion feature in each sample program and the category of each sample program, wherein the size of the evaluation value represents the validity degree of the first candidate fusion feature for distinguishing the category to which the sample program belongs;

selecting a target fusion feature from at least one candidate fusion feature according to the evaluation value of each candidate fusion feature, wherein the validity degree of the evaluation value of the target fusion feature is greater than the validity degree of the evaluation values of other candidate fusion features in the at least one candidate fusion feature;

and training to obtain a program classification model according to the characteristic value of the target fusion characteristic in each sample program.

The implementation of the device shown in fig. 7 can be seen in the relevant description in fig. 3.

Referring to fig. 8, an embodiment of the present application further provides a program classification device, including:

a processor 810, a memory 820, and a network interface 830, the processor 810, the memory 820, and the network interface 830 being interconnected by a bus 840.

The memory 820 includes, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), an erasable programmable read only memory (EPROM or flash memory), or a portable read only memory (CD-ROM).

The processor 810 may be one or more Central Processing Units (CPUs), and in the case that the processor 810 is one CPU, the CPU may be a single-core CPU or a multi-core CPU.

The network Interface 830 may be a wired Interface, such as a Fiber Distributed Data Interface (FDDI) Interface or a Gigabit Ethernet (GE) Interface; the network interface 830 may also be a wireless interface.

A network interface 830 for acquiring a target program;

a memory 820 for storing program code;

a processor 810 for reading the program code stored in the memory 820 and performing the following operations:

acquiring a characteristic value of each static characteristic and a characteristic value of each dynamic characteristic of the target program according to a preset static characteristic set comprising at least one static characteristic and a preset dynamic characteristic set comprising at least one dynamic characteristic; the static characteristics are characteristics which embody the structural characteristics of the target program, and the dynamic characteristics are behavior characteristics which embody the target program in the running process;

acquiring a characteristic value of at least one target fusion characteristic of a target program, wherein the characteristic value of the at least one target fusion characteristic of the target program is obtained based on a corresponding fusion operation rule, and the fusion operation rule indicates that fusion operation is performed on the characteristic value of a specified static characteristic in a preset static characteristic set and the characteristic value of a specified dynamic characteristic in a preset dynamic characteristic set;

and inputting the characteristic value of at least one target fusion characteristic of the target program into the program classification model to obtain a classification result of the target program.

The implementation of the device shown in fig. 8 can be seen in the relevant description in fig. 4.

Embodiments of the present application also provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the above method for training a program classification model.

Embodiments of the present application also provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the above program classification method.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Claims

1. a training method for a program classification model, wherein the method comprises:

receiving input multiple sample programs, the sample programs refer to programs whose categories have been pre-calibrated, and the multiple sample programs belong to at least two different categories;

One sample program is selected from the plurality of sample programs, and the following processing is performed to obtain the feature value of at least one candidate fusion feature of the selected sample program, until each sample program in the plurality of sample programs is processed until:

According to the preset static feature set including at least one static feature and the preset dynamic feature set including at least one dynamic feature, obtain the feature value of each static feature and the characteristic value of each dynamic feature of the selected sample program. feature value, the static feature reflects the structural feature of the selected sample program, and the dynamic feature reflects the behavior of the selected sample program in the running process;

obtaining the feature value of at least one candidate fusion feature of the selected sample program according to the feature value of at least one static feature, the feature value of at least one dynamic feature and the at least one fusion operation rule of the selected sample program, The feature value of each candidate fusion feature in the at least one candidate fusion feature is obtained based on the corresponding fusion operation rule, and the fusion operation rule indicates the feature value of the specified static feature in the preset static feature set. Perform a fusion operation with the eigenvalues of the specified dynamic features in the preset dynamic feature set;

For the first candidate fusion feature in the at least one candidate fusion feature, the following processing is performed, and so on, so as to obtain the evaluation value of each candidate fusion feature: according to the first candidate fusion feature, in each candidate fusion feature The feature value in the sample program and the category of each sample program determine the evaluation value of the first candidate fusion feature, and the size of the evaluation value reflects the effectiveness of the first candidate fusion feature for distinguishing the category to which the sample program belongs. degree;

According to the evaluation value of each candidate fusion feature, a target fusion feature is selected from the at least one candidate fusion feature, and the evaluation value of the target fusion feature reflects a degree of effectiveness greater than that in the at least one candidate fusion feature The degree of effectiveness reflected by the evaluation values of other alternative fusion features;

According to the feature value of the target fusion feature in each sample program, a program classification model is obtained by training.

2. The method according to claim 1, wherein the first candidate is determined according to the feature value of the first candidate fusion feature in each sample program and the category of each sample program The evaluation values of fusion features include:

According to the category to which the sample program belongs, count the eigenvalues of the first candidate fusion feature in the sample programs of each category, so as to obtain the statistical value of the first candidate fusion feature in each category;

The evaluation value of the first candidate fusion feature is determined according to the statistical value of the first candidate fusion feature in each category.

3. The method according to claim 2, wherein the statistical value comprises one or more of the following:

The median, mean and variance of the eigenvalues of the first candidate fusion feature.

4. The method according to any one of claims 1-3, wherein the feature value of the first candidate fusion feature is obtained based on the corresponding first fusion operation rule;

Each of the fusion operation rules indicates that a fusion operation is performed on the feature values of the specified static features in the preset static feature set and the feature values of the specified dynamic features in the preset dynamic feature set, including:

The first fusion operation rule instructs to perform a mathematical operation on the feature value of the first static feature in the preset static feature set and the feature value of the first dynamic feature in the preset dynamic feature set.

5. The method according to any one of claims 1-3, wherein the at least one candidate fusion feature comprises a second candidate fusion feature, and the feature value of the second candidate fusion feature is based on the corresponding obtained from the second fusion operation rule of ;

6. The method according to any one of claims 1-3, wherein the at least one candidate fusion feature comprises a third candidate fusion feature, and the feature value of the third candidate fusion feature is based on the corresponding obtained from the third fusion operation rule;

The third fusion operation instructs to determine, from the preset static feature set and the preset dynamic feature set, features with the same feature itself and the same feature value, and according to the total number of features with the same feature itself and the same feature value The purpose is to calculate the eigenvalue of the third candidate fusion feature.

7. The method according to claim 6, wherein calculating the feature value of the third candidate fusion feature according to the total number of features with the same feature itself and the same feature value comprises:

Determine the maximum value among the first numerical value and the second numerical value, the first numerical value is the total number of static features included in the preset static feature set, and the second numerical value is the dynamic feature set included in the preset dynamic feature set. the total number of features;

Calculate the ratio between the total number of features with the same features and the same feature value and the maximum value, and use the ratio as the feature value of the third candidate fusion feature.

8. The method according to any one of claims 1-7, characterized in that, according to the feature value of the target fusion feature in each sample program, the program classification model obtained by training comprises:

According to the feature value of the target fusion feature in each sample program, the feature value of at least one static feature of each sample program, and the feature value of at least one dynamic feature of each sample program, a program is obtained by training classification model.

9. The method according to any one of claims 1-8, wherein the at least one dynamic feature comprises a parameter model of the sample program and/or at least one called by the sample program during running interface, the parameter model is extracted according to the parameters used in the running process of the sample program.

10. The method of claim 9, wherein the at least one dynamic characteristic comprises a third dynamic characteristic;

The feature value of the third dynamic feature of the selected sample program is the frequency of the third dynamic feature, and the frequency of the third dynamic feature is the number of occurrences of the third dynamic feature in the selected sample program and the frequency of the third dynamic feature. The ratio between the total number of dynamic features included in the preset dynamic feature set.

11. A program classification method, wherein the method comprises:

get the target program;

According to a preset static feature set including at least one static feature and a preset dynamic feature set including at least one dynamic feature, acquiring the feature value of each of the static features and the feature of each of the dynamic features of the target program value; the static feature is the feature that embodies the structural feature of the target program, and the dynamic feature is the behavior feature that the target program embodies in the running process;

Obtain the feature value of at least one target fusion feature of the target program, the feature value of at least one target fusion feature of the target program is obtained based on the corresponding fusion operation rule, and the fusion operation rule indicates that the preset static The eigenvalues of the specified static features in the feature set and the eigenvalues of the specified dynamic features in the preset dynamic feature set perform a fusion operation;

The feature value of at least one target fusion feature of the target program is input into a program classification model to obtain a classification result for the target program.

12. The method according to claim 11, wherein the at least one target fusion feature comprises a first target fusion feature, and the feature value of the first target fusion feature is obtained based on a corresponding first fusion operation rule ;

The fusion operation rule instructs to perform a fusion operation on the eigenvalues of the specified static features in the preset static feature set and the eigenvalues of the specified dynamic features in the preset dynamic feature set, including:

13. The method according to claim 11, wherein the at least one target fusion feature comprises a second target fusion feature, and the feature value of the second target fusion feature is obtained based on a corresponding second fusion operation rule ;

14. The method according to claim 11, wherein the at least one target fusion feature comprises a third target fusion feature, and the feature value of the third target fusion feature is obtained based on a corresponding third fusion operation rule ;

The third fusion operation instructs to determine, from the preset static feature set and the preset dynamic feature set, features with the same features and the same feature values, and according to the total number of features with the same features and the same feature values Calculate the feature value of the third target fusion feature.

15. The method according to claim 14, wherein the calculating the feature value of the third target fusion feature according to the total number of features with the same feature itself and the same feature value, comprising:

Calculate the ratio between the total number of features with the same features and the same feature value and the maximum value, and use the ratio as the feature value of the third target fusion feature.

16. The method according to any one of claims 11-15, wherein at least one dynamic feature of the target program includes: a parametric model and/or a preset interface.

17. The method according to claim 16, wherein, if at least one dynamic feature of the target program includes a parametric model and a preset interface, the acquiring the characteristic value of the dynamic feature of the target program comprises:

Obtain the preset interface called by the target program during the running process and the parameters used;

extracting a parametric model of the parameters according to the used parameters;

A third dynamic feature is selected from at least one dynamic feature of the target program, and the frequency of the third dynamic feature is taken as the eigenvalue of the third dynamic feature, and so on, so as to obtain all the features of the target program. The feature value of the dynamic feature, the frequency of the third dynamic feature is the ratio between the number of times the third dynamic feature appears in the selected sample program and the total number of dynamic features included in all preset dynamic feature sets.

18. The method according to any one of claims 11-17, wherein the target program is multiple,

The method also includes:

According to the feature value of at least one target fusion feature of each of the plurality of target programs, the plurality of target programs are clustered to obtain the category of each target program.

19. A training device for a program classification model, wherein the device comprises:

a receiving unit, configured to receive input multiple sample programs, the sample programs refer to programs whose categories have been pre-calibrated, and the multiple sample programs belong to at least two different categories;

a first processing unit, configured to select a sample program from the plurality of sample programs, and perform the following processing to obtain a feature value of at least one candidate fusion feature of the selected sample program, until the plurality of samples are processed Each sample program in the program so far:

The second processing unit is configured to perform the following processing for the first candidate fusion feature in the at least one candidate fusion feature, and so on, so as to obtain the evaluation value of each candidate fusion feature: according to the first candidate fusion feature The feature value of the candidate fusion feature in each sample program and the category of each sample program determine the evaluation value of the first candidate fusion feature, and the size of the evaluation value reflects that the first candidate fusion feature is used for the degree of effectiveness in distinguishing the categories to which the sample programs belong;

A selection unit, configured to select a target fusion feature from the at least one candidate fusion feature according to the evaluation value of each candidate fusion feature, and the evaluation value of the target fusion feature reflects a degree of effectiveness greater than that of the at least one The degree of effectiveness reflected by the evaluation values of other candidate fusion features in the candidate fusion features;

A training unit, configured to train a program classification model according to the feature value of the target fusion feature in each sample program.

20. The apparatus according to claim 19, wherein the first candidate is determined according to the feature value of the first candidate fusion feature in each sample program and the category of each sample program The evaluation values of fusion features include:

According to the category to which the sample program belongs, count the eigenvalues of the first candidate fusion features in the sample programs of each category, so as to obtain the statistical values of the first candidate fusion features in each category; The statistical values of the fusion features in each category are selected to determine the evaluation value of the first candidate fusion feature.

21. The apparatus according to claim 19 or 20, wherein the feature value of the first candidate fusion feature is obtained based on the corresponding first fusion operation rule;

The instruction of each fusion operation rule to perform a fusion operation on the eigenvalues of the specified static features in the preset static feature set and the eigenvalues of the specified dynamic features in the preset dynamic feature set includes:

22. The apparatus according to claim 19 or 20, wherein the at least one candidate fusion feature comprises a second candidate fusion feature, and the feature value of the second candidate fusion feature is based on a corresponding second fusion feature. obtained by integrating the operating rules;

23. The apparatus according to claim 19 or 20, wherein the at least one candidate fusion feature comprises a third candidate fusion feature, and the feature value of the third candidate fusion feature is based on the corresponding third obtained by integrating the operating rules;

24. A program classification device, characterized in that the device comprises:

The program acquisition unit is used to acquire the target program;

a first feature value obtaining unit, configured to obtain a feature value of each of the static features of the target program according to a preset static feature set including at least one static feature and a preset dynamic feature set including at least one dynamic feature And the characteristic value of each described dynamic characteristic; Described static characteristic is the characteristic that embodies the structural characteristic of described target program, and described dynamic characteristic is the behavioral characteristic that described target program embodies in the running process;

The second feature value obtaining unit is configured to obtain the feature value of at least one target fusion feature of the target program, and the feature value of at least one target fusion feature of the target program is obtained based on the corresponding fusion operation rule. The operation rule instructs to perform a fusion operation on the feature value of the specified static feature in the preset static feature set and the feature value of the specified dynamic feature in the preset dynamic feature set;

The classification unit is configured to input the feature value of at least one target fusion feature of the target program into a program classification model to obtain a classification result of the target program.

25. The apparatus according to claim 24, wherein the at least one target fusion feature comprises a first target fusion feature, and the feature value of the first target fusion feature is obtained based on a corresponding first fusion operation rule ;

The fusion operation rule instructing to perform a fusion operation on the feature value of the specified static feature in the preset static feature set and the feature value of the specified dynamic feature in the preset dynamic feature set includes:

26. The apparatus according to claim 24, wherein the at least one target fusion feature comprises a second target fusion feature, and the feature value of the second target fusion feature is obtained based on a corresponding second fusion operation rule ;

27. The apparatus according to claim 24, wherein the at least one target fusion feature comprises a third target fusion feature, and the feature value of the third target fusion feature is obtained based on a corresponding third fusion operation rule ;