[go: up one dir, main page]

CN112395612B - Malicious file detection method and device, electronic equipment and storage medium - Google Patents

Malicious file detection method and device, electronic equipment and storage medium

Info

Publication number
CN112395612B
CN112395612B CN201910755713.1A CN201910755713A CN112395612B CN 112395612 B CN112395612 B CN 112395612B CN 201910755713 A CN201910755713 A CN 201910755713A CN 112395612 B CN112395612 B CN 112395612B
Authority
CN
China
Prior art keywords
sample
file
target
api
behavior
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910755713.1A
Other languages
Chinese (zh)
Other versions
CN112395612A (en
Inventor
程强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201910755713.1A priority Critical patent/CN112395612B/en
Priority to PCT/CN2020/108614 priority patent/WO2021027831A1/en
Publication of CN112395612A publication Critical patent/CN112395612A/en
Application granted granted Critical
Publication of CN112395612B publication Critical patent/CN112395612B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a malicious file detection method, a malicious file detection device, electronic equipment and a storage medium, and relates to the technical field of network security. The malicious file detection method comprises the steps of encoding acquired API behaviors and API behavior parameters of a target file to obtain a target encoding set corresponding to the target file, carrying out vectorization processing on the target encoding set to obtain a target behavior vector, determining whether the target file is a malicious file according to the distance between the target behavior vector and a sample behavior vector in a black-and-white sample set, and determining the malicious type of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to different types of black samples in the black-and-white sample set when the target behavior vector is the malicious file. The application discloses a malicious file detection method, a malicious file detection device, electronic equipment and a storage medium, which keep behavior characteristics, improve the richness of training input and reduce the false alarm rate of a machine learning model.

Description

Malicious file detection method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of network security technologies, and in particular, to a method and apparatus for detecting malicious files, an electronic device, and a storage medium.
Background
The major network security events such as aurora attack, seismometer attack, night dragon attack, RSA token seed theft and the like enable an attack type with the characteristics of advanced attack methods, long duration, definite attack targets and the like to appear in public views, and the attack type is internationally called advanced persistent threat (ADVANCED PERSISTENT THREAT, APT) attack. The attack uses traditional viruses and Trojan as attack means, and carries out 'lead attack' in a social engineering mode such as mail and the like, and sends a file which carefully constructs and uses 0Day vulnerability to a user. Once the user opens the relevant file, the vulnerability is triggered, attack codes are injected into the user system, and subsequent downloading of other viruses, trojans and the like is performed to facilitate long-term latency. Traditional firewalls, enterprise antivirus software, etc., have very limited capabilities for detecting and protecting such featureless signed malicious files or code.
APT attack detection defense technology has become a research hotspot of new generation network security, and the technical difficulty is how to rapidly detect attacks by using unknown vulnerabilities. A series of researches are developed at home and abroad, and various methods are proposed, wherein a dynamic behavior analysis technology based on files or samples is representative. The technology is mainly used for dynamically analyzing the dynamic behaviors of suspicious sample files entering a protected system through controllable environments such as sandboxes and virtual machines in the process of implanting malicious codes in the APT attack process, identifying the malicious behaviors and the attack codes, preventing the implantation of the malicious codes and preventing the occurrence of subsequent destructive behaviors. The technology can detect and protect the attack before the attack enters the network, so that the protected system is prevented from being influenced by the attack. The judgment of the malicious property of the code file depends on a behavior feature library, the feature library stores malicious behavior features extracted after manual code analysis, and the updating speed and accuracy of the feature library rules determine the success rate of malicious code detection.
Because malicious code variation is rapid, in order to meet the actual detection requirement, people try to learn the behavior pattern of the malicious software by using a large number of malicious sample training models through a machine learning method, so that the malicious property of the software is automatically judged. However, the conventional machine learning method depends on the correlation of the distribution condition of the samples, and the imbalance of the samples can lead to poor detection accuracy, and has higher sensitivity to the numerical data types, while the behavior data features with semantics are difficult to distinguish and identify.
Disclosure of Invention
In order to solve the technical problems, the embodiment of the application is realized as follows:
The embodiment of the application provides a malicious file detection method, which comprises the following steps:
Coding the acquired API behaviors and API behavior parameters of the target file to obtain a target coding set corresponding to the target file;
vectorizing the target coding set to obtain a target behavior vector;
Determining whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set;
and if so, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to the different types of black samples in the black-and-white sample set.
Optionally, the encoding the obtained API behaviors and API behavior parameters of the target file to obtain a target encoding set corresponding to the target file includes:
Coding the API behaviors to obtain a first coding set;
coding the API behavior parameters to obtain a second coding set;
And carrying out unified dimension combination on the first encoding set and the second encoding set to obtain the normalized target encoding set.
Optionally, the API behavior parameter is a directory path, and the encoding the API behavior parameter to obtain a second encoding set includes:
Carrying out directory layering on the API behavior parameters;
coding the API behavior parameters after directory layering to obtain the second coding set;
and when the path length of the API behavior parameters exceeds a preset length, adjusting the path length of the API behavior parameters to the preset length and then layering the catalogue.
Optionally, the encoding the API behaviors to obtain a first encoding set includes:
Hexadecimal encoding is carried out on the API behaviors to obtain the first encoding set with the preset encoding length;
the step of encoding the API behavior parameters to obtain a second encoding set comprises the following steps:
carrying out hash coding on the API behavior parameters to obtain the second coding set;
The step of performing unified dimension combination on the first encoding set and the second encoding set to obtain the normalized target encoding set includes:
and converting the codes in the second code set into hexadecimal codes, and combining the codes in the first code set with the codes in the converted second code set in a one-to-one correspondence manner to obtain the target code set.
Optionally, the determining whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set includes:
Calculating a first average distance between the target behavior vector and a sample behavior vector corresponding to a black sample in the black-and-white sample set;
Calculating a second average distance between the target behavior vector and a sample behavior vector corresponding to a white sample in the black and white sample set;
And when the first average distance is greater than or equal to the second average distance, judging that the target file is a malicious file.
Optionally, the determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to the different kinds of black samples in the black-and-white sample set includes:
Calculating a third average distance between the target behavior vector and a sample behavior vector corresponding to different kinds of black samples in the black-and-white sample set;
When the third average distance which does not exceed the preset critical value exists, selecting the malicious category of the black sample corresponding to the minimum value in the third average distance as the malicious category of the target file;
otherwise, dividing the malicious category of the target file into new malicious categories.
Optionally, the method further comprises:
and acquiring the API behaviors and the API behavior parameters of the external analysis engine after the external analysis engine operates the target file.
Optionally, the API behavior is loading a system DLL file, writing a temporary file, or modifying a registry.
Optionally, the method further comprises:
Acquiring sample API behaviors and sample API behavior parameters of a sample file in a training sample set, wherein the sample file comprises a black sample file and a white sample file, the black sample file comprises at least one of virus, trojan horse, worm and luxury software, and the white sample file is a normal file;
coding the acquired sample API behaviors and the sample API behavior parameters to obtain a sample coding set corresponding to the training sample set;
determining the weight corresponding to each code according to the frequency of occurrence of the same sample file and different sample files corresponding to each code in the sample code set;
And carrying out vectorization processing on a sample code set corresponding to the sample file in the training sample set according to the weight corresponding to each code, and obtaining sample behavior vectors in the black and white sample set.
Optionally, the type of the sample file is a PE file, a PDF file, or a text file.
The embodiment of the application also provides a malicious file detection device, which comprises:
The coding module is used for coding the acquired API behaviors and API behavior parameters of the target file to obtain a target coding set corresponding to the target file;
The vectorization module is used for carrying out vectorization processing on the target coding set to obtain a target behavior vector;
a determining module for determining whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set, and
And if so, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to the different types of black samples in the black-and-white sample set.
The embodiment of the application also provides electronic equipment, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface, and the memory are communicated with each other through the bus;
a memory for storing a computer program;
A processor for executing a program stored on a memory, implementing the method steps of any of the above claims.
Embodiments of the present application also provide a computer-readable storage medium having stored therein a computer program which, when executed by a processor, performs any of the above-described method steps.
The above at least one technical scheme adopted by the embodiment of the application can achieve the following beneficial effects:
the scheme provided by the embodiment of the application reserves the behavior characteristics, improves the richness of training input and reduces the false alarm rate of the machine learning model.
The scheme provided by the embodiment of the application can identify the type of the target file and discover the malicious file of a new type through the distance while distinguishing the malicious property of the target file.
Compared with the traditional scheme which only supports an executable PE file analysis model, the scheme provided by the embodiment of the application also supports other types of files such as Word, PDF and the like.
Compared with the deep learning network malicious file detection method, the scheme provided by the embodiment of the application has higher complexity, reduces weight and parameter value adjustment, and improves the dependence of the behavior statistical model on sample distribution based on the behavior number.
In addition, the scheme provided by the embodiment of the application has a better generalization effect aiming at sample imbalance.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
fig. 1 is a flowchart of a malicious file detection method according to a preferred embodiment of the present application.
FIG. 2 is a flowchart of another malicious file detection method according to the preferred embodiment of the present application.
Fig. 3 is a block diagram of an electronic device according to a preferred embodiment of the application.
Fig. 4 is a block diagram illustrating a malicious file detecting apparatus according to a preferred embodiment of the present application.
The icons are 100-electronic device, 110-processor, 120-internal bus, 130-network interface, 140-memory, 150-malicious file detection means, 151-encoding module, 152-vectorization module, 153-determination module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a malicious file detection method according to an embodiment of the present application is applied to an electronic device, and is used for detecting malicious files such as viruses, trojans, worms, and luxury software. The flow shown in fig. 1 will be described in detail.
Step S101, the acquired API behaviors and API behavior parameters of the target file are encoded, and a target encoding set corresponding to the target file is obtained.
In the embodiment of the present application, the API behavior may be, but is not limited to, starting a process under a system such as Windows, linux, unix, loading a system DLL file, writing a temporary file, or modifying a registry. The API behavior parameters refer to parameters contained in the command, such as directory paths. The type of the target file may be, but is not limited to, a PE file, a PDF file, a text file, and the like. The API behaviors of the same target file are in one-to-one correspondence with the API behavior parameters.
Before the API behaviors and the API behavior parameters of the target file are encoded, the target file to be detected is first run by an external analysis engine, which may be, but is not limited to, a sandbox, a virtual machine, and the like. And after the target file is operated, acquiring the API behaviors and the API behavior parameters of the target file, and then coding and unifying dimension combinations to obtain a target coding set corresponding to the target file.
During encoding, each API behavior and corresponding API behavior parameters are encoded respectively, and then the obtained encoding is unified in dimension combination. Specifically, hexadecimal encoding is performed on the API behaviors, and the encoding length of the API behaviors is preset to obtain a first encoding set with a preset length. And simultaneously, vectorizing conversion is carried out on the API behavior parameters to obtain a second coding set. And then, converting the codes in the second code set into hexadecimal codes consistent with the code format in the first code set, and combining the codes in the first code set with the codes in the converted second code set one by one to obtain a normalized target code set.
In the embodiment of the application, hexadecimal coding is adopted for API behavior coding, and it can be appreciated that binary, octal or decimal coding and the like can be adopted in other embodiments. When the codes are coded, if the codes in the first code set are different from the codes in the second code set, the codes in the second code set are required to be converted into the codes of the same type as the codes in the first code set or the codes in the first code set are required to be converted into the codes of the same type as the codes in the second code set. If the codes in the first code set are the same as the codes in the second code set, no conversion is necessary.
The API behavior parameters may be encoded by, but not limited to, hashing, etc., and in the embodiment of the present application, hash encoding is used to encode the API behavior parameters.
For convenience of explanation, an API behavior and an API behavior parameter corresponding to the target file are described herein as an example, assuming that the first code set corresponding to the API behavior includes a hexadecimal code 0200, and the second code set corresponding to the API behavior parameter includes a decimal code 67574613, the codes in the second code set may be converted into hexadecimal codes 4071B55, and the codes in the combined target code set are a hexadecimal code 02004071B55.
Furthermore, in order to improve the accuracy of target file detection, the scheme provided by the embodiment of the application is further provided with the path length of the API behavior parameters in advance, and when the path length of the API behavior parameters exceeds the preset length, the path length of the API behavior parameters is adjusted to the preset length when the path length of the API behavior parameters is encoded, and then directory layering is performed. Wherein, the path length of the API action parameter can be adjusted by adding a fixed tail parameter undefine.
For example, if the preset longest coding path is c:/system and a certain API action parameter is c:/system/host/, the API action parameter can be adjusted to c:/system/undefine, and then coding is performed. Therefore, the obtained codes are uniform in length, the extraction of the features is convenient, the situation that the feature distinction degree is not high due to wide feature description is avoided, and the accuracy of the detection of the subsequent malicious files is improved.
Step S102, vectorization processing is carried out on the target coding set, and target behavior vectors are obtained.
In the embodiment of the application, a sample code set is established in advance according to the API behaviors and the API behavior parameters of black and white samples in the black and white sample set, and each code in the sample code set corresponds to different weights. The black sample comprises at least one of virus, trojan horse, worm and Leuco software, and the white sample file is a normal file.
When the vectorization processing is carried out on the target coding set, different weights are given to different codes in the target coding set according to the weight of each code, and the weight given by the codes which do not appear in the sample coding set is 0, so that the target behavior vector corresponding to the target coding set can be obtained.
For example, the target code set is { A1, A2, B1, A3, C1, A4}, the weight corresponding to the code A1 in the sample code set is A1, the weight corresponding to the code A2 is A2, the weight corresponding to the code A3 is A3, the weight corresponding to the code A4 is a5, and the code B1 and the code C1 are not present in the sample code set, and the target behavior vector obtained after vectorizing the target code set is (A1, A2,0, A3,0, A4).
It will be appreciated that in other embodiments, the weights assigned to codes that do not occur in the sample code set may be other values, such as a 1.
Step S103, determining whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set.
The black and white sample set comprises a sample behavior vector corresponding to the black sample and a sample behavior vector corresponding to the white sample. The black sample includes at least one of virus, trojan horse, worm and lux software, the white sample file is a normal file, in the embodiment of the application, black samples include viruses, trojans, worms, and lux software to ensure that various types of malicious files can be detected. The sample behavior vector is obtained by vectorizing black and white samples in a black and white sample set after API behavior and API behavior parameter coding, and the process is consistent with the API behavior and API behavior parameter coding and vectorizing process of the target file.
When determining whether the target file is a malicious file, first calculating a first distance between the target behavior vector and sample behavior vectors corresponding to all black samples in the black-and-white sample set and a second distance between the target behavior vector and sample behavior vectors corresponding to all white samples in the black-and-white sample set. The first distance and the second distance may be, but are not limited to, an average distance or an intermediate value among a plurality of distances. In the embodiment of the application, the first distance and the second distance are both average distances, namely, the first average distances of the target behavior vector and the sample behavior vectors corresponding to all the black samples in the black-and-white sample set are calculated, and the second average distances of the target behavior vector and the sample behavior vectors corresponding to all the white samples in the black-and-white sample set are calculated.
The distance calculation of the target behavior vector and the sample behavior vector may be, but is not limited to, euclidean distance, cosine similarity calculation, and the like. In the embodiment of the application, the distance calculation of the target behavior vector and the sample behavior vector adopts cosine phase degree calculation.
Assuming that the target behavior vector is J x and the sample behavior vector corresponding to a certain black sample is J k, the distance between the target behavior vector J x and the sample behavior vector J k corresponding to the black sample can be expressed asCalculating the distances between the sample behavior vectors corresponding to all the black samples and the target behavior vectors to obtain a distance list d 1,d2,...dB, and averaging the values in the distance list to obtain a first average distance between the target behavior vectors and the sample behavior vectors corresponding to all the black samples in the black-and-white sample set, wherein the first average distance can be expressed asLikewise, a second average distance between the target behavior vector and the sample behavior vector corresponding to all white samples in the black-and-white sample set can be obtained.
The first average distance is then compared to the second average distance. If the first average is smaller than the second average distance, the target file is judged to be a normal file, and detection is finished. And if the first average distance is greater than or equal to the second average distance, judging the target file to be a malicious file.
Step S104, when the target file is a malicious file, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to different kinds of black samples in the black-and-white sample set.
If the target file corresponding to the target behavior vector is a malicious file, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to different types of black samples in the black-and-white sample set.
Specifically, first, a third average distance between the target behavior vector and the sample behavior vector corresponding to different kinds of black samples (virus, trojan horse, worm, lux software) in the black-white sample set is calculated. The third average distance between the target behavior vector and the sample behavior vector corresponding to the virus is represented by D α, the third average distance between the target behavior vector and the sample behavior vector corresponding to the Trojan horse is represented by D β, the third average distance between the target behavior vector and the sample behavior vector corresponding to the worm is represented by D θ, and the third average distance between the target behavior vector and the sample behavior vector corresponding to the lux software is represented by D μ. And comparing the third average distances D α、Dβ、Dθ and D μ with preset critical values respectively, judging whether the critical values are exceeded by D α、Dβ、Dθ and D μ, if the critical values are exceeded by D α、Dβ、Dθ and D μ, indicating that the difference between the target behavior vector and the sample behavior vector corresponding to various black samples is large, and dividing the target file corresponding to the target behavior vector into new types of malicious files except viruses, trojans, worms and luxes. If one or more malicious categories which do not exceed the preset critical value exist in the D α、Dβ、Dθ and the D μ, selecting the malicious category of the black sample corresponding to the minimum value as the malicious category of the target file.
For example, assuming a threshold of s, if s < D α<Dβ<Dθ<Dμ, the target file is divided into a new class of malicious files outside of viruses, trojans, worms, and luxes. If D α<Dβ<Dθ<Dμ < s, the target file is judged to be a malicious file, and the category of the malicious file belongs to the virus type.
Fig. 2 is a flowchart of another malicious file detection method according to an embodiment of the present application. The flow shown in fig. 2 will be described in detail.
Step S201, sample API behaviors and sample API behavior parameters of a sample file in a training sample set are obtained.
In the embodiment of the present application, the sample file includes a black sample file and a white sample file, the black sample file includes at least one of a virus, a Trojan horse, a worm and a luxury software, and the white sample file is a normal file. The sample file may be, but is not limited to, a PE file, a PDF file, a text file, or the like.
Before detecting the target file, a training sample set for detecting whether the target file is a malicious file or not and a malicious category needs to be established. Specifically, first, sample files in a training sample set are run through an external analysis engine, and sample API behaviors and sample API behavior parameters of each sample file are obtained. Wherein the external analysis engine may be, but is not limited to, a sandbox, virtual machine, etc.
Furthermore, when the sample API behaviors and the sample API behavior parameters of the sample files in the training sample set are obtained, and the same behaviors (the behaviors comprise a sample API behavior and corresponding sample API behavior parameters) of the sample API behaviors and the sample API behavior parameters exist, the behaviors with the same sample API behaviors and sample API behavior parameters can be combined to form a non-repeated set, so that data redundancy can be effectively avoided, and the operation amount is reduced.
Step S202, coding the acquired sample API behaviors and sample API behavior parameters to obtain a sample coding set corresponding to the training sample set.
Specifically, hexadecimal encoding is performed on the corresponding sample API behaviors aiming at each sample file, and the encoding length is preset to obtain a third encoding set with a preset length. And meanwhile, coding corresponding sample API behavior parameters to obtain a fourth coding set. And then, the codes in the fourth code set are converted into hexadecimal codes consistent with the codes in the third code set, and the codes in the third code set and the codes in the fourth code set after conversion are combined in a one-to-one correspondence manner to obtain a normalized sample code set.
In the embodiment of the application, hexadecimal coding is adopted for the sample API behavior coding, and it is understood that binary, octal or decimal coding and the like can be adopted in other embodiments. When other binary codes are adopted, if the codes in the third code set are different from the codes in the fourth code set, the codes in the fourth code set are also required to be converted into the codes of the same type as the codes in the third code set or the codes in the third code set are required to be converted into the codes of the same type as the codes in the fourth code set. The sample API behavior parameters may be encoded by, but not limited to, hashing, etc., and in the embodiment of the present application, hash encoding is used to encode the sample API behavior parameters.
Step S203, determining the weight corresponding to each code according to the frequency of the same sample file and different sample files corresponding to each code in the sample code set.
The weights corresponding to each code may be determined by, but not limited to, TF-IDF algorithm, textRank algorithm, and the like. In the embodiment of the application, a TF-IDF algorithm is adopted. Specifically, for the frequency of occurrence of the same sample file corresponding to the same code in the sample code set (i.e., the frequency of occurrence of the behavior corresponding to the code in the same sample file), the higher the frequency of occurrence, the higher the weight it is given. The frequency of occurrence of different sample files corresponding to the same code in the sample code set (namely, the frequency of occurrence of the behavior corresponding to the code in different sample files) is lower if the frequency of occurrence is higher.
Step S204, vectorization processing is carried out on the sample code set corresponding to the sample file in the training sample set according to the weight corresponding to each code, and sample behavior vectors in the black and white sample set are obtained.
Step S205, the acquired API behaviors and API behavior parameters of the target file are encoded, and a target encoding set corresponding to the target file is obtained.
Step S206, vectorization processing is carried out on the target coding set to obtain a target behavior vector.
Step S207, determining whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set.
Step S208, when the target file is a malicious file, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to different kinds of black samples in the black-and-white sample set.
In summary, according to the malicious file detection method provided by the embodiment of the application, the target behavior vector of the target file is obtained by performing vector conversion after encoding and normalizing the API behavior and the API behavior parameters of the target file, whether the target file is a malicious file is determined according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set, and the malicious category of the target file is determined according to the distance between the target behavior vector and the sample behavior vector of different types of black samples in the black-and-white sample set when the target file is a vector. As the behavior characteristics are reserved, the richness of training input is improved, so that the accuracy of detection can be improved when malicious files are detected, and the false alarm rate of a machine learning model is reduced. Meanwhile, according to the distance between the target behavior vector and the sample behavior vector corresponding to various black samples, the malicious property of the target file can be distinguished, and meanwhile, the type of the target file can be identified and a new type of malicious file can be found through the distance. Secondly, the scheme provided by the embodiment of the application has good file type expansibility, and compared with the traditional scheme which only supports an executable PE file analysis model, the scheme of the embodiment of the application also supports other types of files such as Word, PDF and the like. And finally, compared with the detection method of the malicious files in the deep learning network, the scheme provided by the embodiment of the application has higher complexity, reduces weight and parameter value adjustment, and improves the dependence of the statistical behavior model based on the behavior number on sample distribution. The scheme provided by the embodiment of the application has a better generalization effect aiming at sample imbalance. In addition, the method provided by the embodiment of the application can ensure that the obtained codes have uniform length, is convenient for extracting the characteristics, avoids the situation of low characteristic distinction degree caused by wide characteristic description, and further improves the accuracy of malicious file detection. Finally, when the training sample set is established, the behaviors with the same sample API behaviors and sample API behavior parameters are combined to form a non-repeated set, so that data redundancy can be effectively avoided, and the operation amount is reduced.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Fig. 3 is a block diagram of an electronic device 100 according to an embodiment of the application. Referring to fig. 3, at a hardware level, the electronic device 100 includes a processor 110, and optionally an internal bus 120, a network interface 130, and a memory 140. The Memory 140 may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device 100 may also include hardware required for other services.
The processor 110, the network interface 130, and the memory 140 may be interconnected by an internal bus 120, and the internal bus 120 may be an ISA (Industry Standard Architecture ) bus, a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, etc. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 3, but not only one bus or type of bus.
And a memory 140 for storing a program. In particular, the program may include program code including computer-operating instructions. The storage 140 may include memory and non-volatile storage and provides instructions and data to the processor 110.
The processor 110 reads the corresponding computer program from the nonvolatile memory into the memory and then runs, forming the malicious file detection device 150 on a logical level. The processor 110 executes the program stored in the memory 140, and is specifically configured to perform the following operations:
The method comprises the steps of obtaining an API behavior of a target file, carrying out vectorization conversion on the obtained API behavior and API behavior parameters of the target file to obtain a target behavior vector corresponding to the target file, determining whether the target file is a malicious file according to the distance between the target behavior vector and a sample behavior vector in a black-and-white sample set, and determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to different types of black samples in the black-and-white sample set when the target file is the malicious file.
The method performed by the malicious file detection apparatus 150 disclosed in the embodiment of fig. 3 of the present application may be applied to the processor 110, or may be implemented by the processor 110. The processor 110 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 110. The Processor 110 may be a general-purpose Processor including a central processing unit (Central Processing Unit, CPU), a network Processor (Network Processor, NP), etc., or may be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 140, and the processor 110 reads the information in the memory 140 and, in combination with its hardware, performs the steps of the method described above.
The electronic device 100 may also execute the methods of fig. 1 and fig. 2, and implement the functions of the malicious file detection apparatus 150 in the embodiments shown in fig. 1 and fig. 2, which are not described herein again.
Of course, other implementations, such as a logic device or a combination of hardware and software, are not excluded from the electronic device 100 of the present application, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or a logic device.
The embodiments of the present application also provide a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the methods of the embodiments shown in fig. 1 and 2, and in particular to perform the operations of:
The method comprises the steps of obtaining an API behavior of a target file, carrying out vectorization conversion on the obtained API behavior and API behavior parameters of the target file to obtain a target behavior vector corresponding to the target file, determining whether the target file is a malicious file according to the distance between the target behavior vector and a sample behavior vector in a black-and-white sample set, and determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to different types of black samples in the black-and-white sample set when the target file is the malicious file.
Fig. 4 is a block diagram illustrating a malicious file detection apparatus 150 according to an embodiment of the present application. Referring to fig. 4, in a software implementation, the malicious file detection apparatus 150 may include:
The encoding module 151 is configured to encode the obtained API behaviors and API behavior parameters of the target file, to obtain a target encoding set corresponding to the target file.
It will be appreciated that the encoding module 151 may be configured to perform step S101 or step S205 described above.
The vectorization module 152 is configured to perform vectorization processing on the target encoding set to obtain a target behavior vector.
It will be appreciated that the vectorization module 152 may be configured to perform step S102 or step S206 described above.
The determining module 153 is configured to determine whether the target file is a malicious file according to a distance between the target behavior vector and a sample behavior vector in a black-and-white sample set. And when the target file is a malicious file, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to different types of black samples in the black-and-white sample set.
It is understood that the determination module 153 may be used to perform steps S103 and S104 or steps S207 and S208 described above.
In summary, the foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims (10)

1. A malicious file detection method, comprising:
encoding the API behaviors to obtain a first encoding set, wherein the encoding the API behaviors to obtain the first encoding set comprises the following steps:
Hexadecimal encoding is carried out on the API behaviors to obtain the first encoding set with the preset encoding length;
Encoding the API behavior parameters to obtain a second encoding set, wherein the encoding the API behavior parameters to obtain the second encoding set includes:
carrying out hash coding on the API behavior parameters to obtain the second coding set;
the method includes the steps of encoding the API behavior parameters to obtain a second encoding set, wherein the API behavior parameters are directory paths, and the method includes:
Carrying out directory layering on the API behavior parameters;
coding the API behavior parameters after directory layering to obtain the second coding set;
when the path length of the API behavior parameters exceeds a preset length, adjusting the path length of the API behavior parameters to the preset length, and then layering the catalogue;
Converting the codes in the second code set into hexadecimal codes, and combining the codes in the first code set and the codes in the converted second code in a one-to-one correspondence manner to obtain a target code set;
vectorizing the target coding set to obtain a target behavior vector;
Determining whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set;
and if so, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to the different types of black samples in the black-and-white sample set.
2. The method of claim 1, wherein the determining whether the target file is a malicious file based on the distance of the target behavior vector from the sample behavior vectors in the black and white sample set comprises:
Calculating a first average distance between the target behavior vector and a sample behavior vector corresponding to a black sample in the black-and-white sample set;
Calculating a second average distance between the target behavior vector and a sample behavior vector corresponding to a white sample in the black and white sample set;
And when the first average distance is greater than or equal to the second average distance, judging that the target file is a malicious file.
3. The method of claim 1, wherein the determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to the different types of black samples in the black-and-white sample set comprises:
Calculating a third average distance between the target behavior vector and a sample behavior vector corresponding to different kinds of black samples in the black-and-white sample set;
When the third average distance which does not exceed the preset critical value exists, selecting the malicious category of the black sample corresponding to the minimum value in the third average distance as the malicious category of the target file;
otherwise, dividing the malicious category of the target file into new malicious categories.
4. The method according to claim 1, wherein the method further comprises:
and acquiring the API behaviors and the API behavior parameters of the external analysis engine after the external analysis engine operates the target file.
5. The method of claim 1, wherein the API behavior is loading a system DLL file, writing a temporary file, or modifying a registry.
6. The method according to claim 1, wherein the method further comprises:
Acquiring sample API behaviors and sample API behavior parameters of a sample file in a training sample set, wherein the sample file comprises a black sample file and a white sample file, the black sample file comprises at least one of virus, trojan horse, worm and luxury software, and the white sample file is a normal file;
coding the acquired sample API behaviors and the sample API behavior parameters to obtain a sample coding set corresponding to the training sample set;
determining the weight corresponding to each code according to the frequency of occurrence of the same sample file and different sample files corresponding to each code in the sample code set;
And carrying out vectorization processing on a sample code set corresponding to the sample file in the training sample set according to the weight corresponding to each code, and obtaining sample behavior vectors in the black and white sample set.
7. The method of claim 6, wherein the sample file is of a type of PE file, PDF file, or text file.
8. A malicious file detection apparatus, characterized in that the malicious file detection apparatus comprises:
the coding module is configured to code the API behaviors to obtain a first coding set, where the coding the API behaviors to obtain the first coding set includes:
Hexadecimal encoding is carried out on the API behaviors to obtain the first encoding set with the preset encoding length;
Encoding the API behavior parameters to obtain a second encoding set, wherein the encoding the API behavior parameters to obtain the second encoding set includes:
carrying out hash coding on the API behavior parameters to obtain the second coding set;
the method includes the steps of encoding the API behavior parameters to obtain a second encoding set, wherein the API behavior parameters are directory paths, and the method includes:
Carrying out directory layering on the API behavior parameters;
coding the API behavior parameters after directory layering to obtain the second coding set;
when the path length of the API behavior parameters exceeds a preset length, adjusting the path length of the API behavior parameters to the preset length, and then layering the catalogue;
Converting the codes in the second code set into hexadecimal codes, and combining the codes in the first code set and the codes in the converted second code in a one-to-one correspondence manner to obtain a target code set;
The vectorization module is used for carrying out vectorization processing on the target coding set to obtain a target behavior vector;
a determining module for determining whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black-and-white sample set, and
And if so, determining the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to the different types of black samples in the black-and-white sample set.
9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the bus;
a memory for storing a computer program;
a processor for executing a program stored on a memory, implementing the method steps of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1 to 7.
CN201910755713.1A 2019-08-15 2019-08-15 Malicious file detection method and device, electronic equipment and storage medium Active CN112395612B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910755713.1A CN112395612B (en) 2019-08-15 2019-08-15 Malicious file detection method and device, electronic equipment and storage medium
PCT/CN2020/108614 WO2021027831A1 (en) 2019-08-15 2020-08-12 Malicious file detection method and apparatus, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910755713.1A CN112395612B (en) 2019-08-15 2019-08-15 Malicious file detection method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112395612A CN112395612A (en) 2021-02-23
CN112395612B true CN112395612B (en) 2025-08-19

Family

ID=74570249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910755713.1A Active CN112395612B (en) 2019-08-15 2019-08-15 Malicious file detection method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112395612B (en)
WO (1) WO2021027831A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343219B (en) * 2021-05-31 2023-03-07 烟台中科网络技术研究所 Automatic and efficient high-risk mobile application program detection method
CN113449301A (en) * 2021-06-22 2021-09-28 深信服科技股份有限公司 Sample detection method, device, equipment and computer readable storage medium
CN113704761B (en) * 2021-08-31 2024-06-28 上海观安信息技术股份有限公司 Malicious file detection method and device, computer equipment and storage medium
CN114065196B (en) * 2021-09-30 2025-08-12 奇安信科技集团股份有限公司 Java memory horse detection method and device, electronic equipment and storage medium
CN113901453B (en) * 2021-10-12 2025-09-23 奇安信科技集团股份有限公司 Sample threat assessment method, device, electronic device and storage medium
CN114006766B (en) * 2021-11-04 2024-08-06 杭州安恒信息安全技术有限公司 Network attack detection method, device, electronic device and readable storage medium
CN114297645B (en) * 2021-12-03 2022-09-27 深圳市木浪云科技有限公司 Method, device and system for identifying Lesox family in cloud backup system
US20240362328A1 (en) * 2023-04-25 2024-10-31 Delta Electronics, Inc. Detection method and detection system for ransomware
CN116861428B (en) * 2023-09-04 2023-12-08 北京安天网络安全技术有限公司 Malicious detection method, device, equipment and medium based on associated files
CN116910756B (en) * 2023-09-13 2024-01-23 北京安天网络安全技术有限公司 Detection method for malicious PE (polyethylene) files
CN118427824B (en) * 2024-07-04 2024-11-05 北京安天网络安全技术有限公司 Batch scanning method, device, equipment and medium combined with external sample

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361286A (en) * 2014-12-01 2015-02-18 西安邮电大学 Trojan judgment method based on dynamic code sequence tracking analysis
CN107590388A (en) * 2017-09-12 2018-01-16 南方电网科学研究院有限责任公司 Malicious code detection method and device

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8245295B2 (en) * 2007-07-10 2012-08-14 Samsung Electronics Co., Ltd. Apparatus and method for detection of malicious program using program behavior
US9398034B2 (en) * 2013-12-19 2016-07-19 Microsoft Technology Licensing, Llc Matrix factorization for automated malware detection
CN104700033B (en) * 2015-03-30 2019-01-29 北京瑞星网安技术股份有限公司 The method and device of viral diagnosis
CN104866763B (en) * 2015-05-28 2019-02-26 天津大学 Permission-based Android malware hybrid detection method
CN106469276B (en) * 2015-08-19 2020-04-07 阿里巴巴集团控股有限公司 Type identification method and device of data sample
CN106960153B (en) * 2016-01-12 2021-01-29 阿里巴巴集团控股有限公司 Virus type identification method and device
US10972495B2 (en) * 2016-08-02 2021-04-06 Invincea, Inc. Methods and apparatus for detecting and identifying malware by mapping feature data into a semantic space
CN107122659A (en) * 2017-03-29 2017-09-01 中国科学院信息工程研究所 A kind of method of malicious code or leak in quick positioning Android application software
CN109145605A (en) * 2018-08-23 2019-01-04 北京理工大学 A kind of Android malware family clustering method based on SinglePass algorithm

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361286A (en) * 2014-12-01 2015-02-18 西安邮电大学 Trojan judgment method based on dynamic code sequence tracking analysis
CN107590388A (en) * 2017-09-12 2018-01-16 南方电网科学研究院有限责任公司 Malicious code detection method and device

Also Published As

Publication number Publication date
CN112395612A (en) 2021-02-23
WO2021027831A1 (en) 2021-02-18

Similar Documents

Publication Publication Date Title
CN112395612B (en) Malicious file detection method and device, electronic equipment and storage medium
US20220371621A1 (en) Stateful rule generation for behavior based threat detection
US10430586B1 (en) Methods of identifying heap spray attacks using memory anomaly detection
US10986103B2 (en) Signal tokens indicative of malware
RU2680738C1 (en) Cascade classifier for the computer security applications
US9798981B2 (en) Determining malware based on signal tokens
US10216934B2 (en) Inferential exploit attempt detection
US11379581B2 (en) System and method for detection of malicious files
JP6341964B2 (en) System and method for detecting malicious computer systems
CN109933986B (en) Malicious code detection method and device
CN116204882A (en) Android malware detection method and device based on heterogeneous graph
CN112580044B (en) System and method for detecting malicious files
Ndagi et al. Machine learning classification algorithms for adware in android devices: a comparative evaluation and analysis
CN117579395B (en) Method and system for scanning network security vulnerabilities by applying artificial intelligence
CN111143853B (en) Application security assessment method and device
CN112149126B (en) System and method for determining trust level of file
CN112395603B (en) Vulnerability attack identification method, device and computer equipment based on instruction execution sequence characteristics
CN111143843A (en) Malicious application detection method and device
Liu et al. Android malware detection based on multi-features
KR102174393B1 (en) Malicious code detection device
CN116029388A (en) Incremental learning-based challenge sample detection method, system, device and medium
CN113765852B (en) Data packet detection method, system, storage medium and computing device
CN111240696A (en) Method for extracting similar modules of mobile malicious program
CN111639340A (en) Malicious application detection method and device, electronic equipment and readable storage medium
CN112528329A (en) Detection method for maliciously obtaining user position privacy and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant