WO2008098495A1

WO2008098495A1 - Method and device for determing object file

Info

Publication number: WO2008098495A1
Application number: PCT/CN2008/070223
Authority: WO
Inventors: Jie Bai; Wei Li; Zhengyu Lu
Original assignee: Individual
Current assignee: Individual
Priority date: 2007-02-14
Filing date: 2008-01-31
Publication date: 2008-08-21
Anticipated expiration: 2009-08-14
Also published as: CN101013445A; CN100485691C

Abstract

A method for determining object file is disclosed, which is based on a storage unit for storing a characteristic character string universal set. The method includes: selecting the characteristic character string universal set for determining a object file in the storage unit (11), the universal set comprises at least one characteristic character string for determining the object file; according to the rule, looking up the characteristic character string of the universal set in the file which waits to be detected and obtaining the result (12), and judging whether the lookup result satisfies the first condition or not (13), if yes, determining the file which waits to be detected is the object file (14). A device for determining object file is further provided.

Description

一种目标文件的确定方法和装置技术领域 Method and device for determining target file

本发明涉及一种判断一个数据集合是否为所希望得到的数据集合确定方法和装置。 The present invention relates to a method and apparatus for determining whether a data set is a desired data set.

背景技术 Background technique

为了判断一个已经得到的数据集合是否为所希望得到的数据集合，或者，判断一个得到的文件是否为期望得到的文件 ,或者从大量的原始数据中查找需要查找的目标数据或数据集合，通常都要预先确定至少一个特征数据字符串，将所述已经得到的数据集合作为被检测的对象，通过检索该数据集合中是否存在特征数据字符串，就能够得知该数据集合否为期望得到的数据集合。然而，在特征数据字符串较多且处于离散状态时，一方面，如果被检测的数据集合较大或者特征数据字符串较多和 /或较大，就会消耗更多的时间，使可操作性变差；另一方面，如果预先确定特征数据字符串在被检测的数据集合中可能出现的位置，则检索成功的概率将会变得非常小。 In order to determine whether a data set that has been obtained is the desired data set, or to determine whether a obtained file is a desired file, or to search for a target data or a data set to be searched from a large amount of original data, usually To determine at least one feature data string in advance, and to use the obtained data set as the detected object, by searching whether the feature data string exists in the data set, it can be known whether the data set is the expected data. set. However, when the feature data string is large and in a discrete state, on the one hand, if the detected data set is large or the feature data string is large and/or large, it will consume more time and make it operable. Sexuality is poor; on the other hand, if the location of the feature data string that may appear in the detected data set is predetermined, the probability of successful retrieval will become very small.

例如，在面部图像数据库中查找模拟画像表示的图像是否存在，就是试图确定面部图像数据库中是否有要查找的目的画像；在一个可疑的程序中查找病毒的特征指令或数据是否存在等，就是试图确定该可疑程序是否为病毒程序。由于模拟画像表示的图像或特征指令涉及的特征数据较多且处于离散状态，因此现有的方法难以实现快速有效的查找和目标数据集合的确定。实际中还有一种情况，假设，一种病毒程序的特征全集中包含 10个特征，即该种病毒程序的已知的各种变种病毒程序中，各个病毒程序或变种病毒程序的特征集合的并集中有 10个特征，而各个具体的病毒程序或变种病毒程序的特征集合中包含的特征可能仅是病毒程序的特征全集中的一部分。在此情境下，更是难以确定具有病毒特征的程序样本，进而难以有效地判断一个可疑程序是否为病毒程序。 For example, if the image represented by the simulated portrait is found in the face image database, it is an attempt to determine whether there is a target image to be searched in the face image database; if a feature command or data of a virus is found in a suspicious program, it is an attempt. Determine if the suspicious program is a virus program. Since the image or feature instruction represented by the simulated portrait is involved in many feature data and is in a discrete state, the existing method is difficult to realize fast and efficient search and determination of the target data set. There is also a case in practice, assuming that a feature set of a virus program contains 10 features, that is, a combination of feature sets of various virus programs or variant virus programs in various known virus programs of the virus program. There are 10 features in the set, and the features contained in the feature set of each specific virus program or variant virus program may be only a part of the feature set of the virus program. In this situation, it is more difficult to determine a program sample with a virus signature, and thus it is difficult to effectively judge whether a suspicious program is a virus program.

发明内容 Summary of the invention

本发明要解决的问题在于，提供一种能够对被检测数据集合进行快速、准确判断的目标文件的确定方法和装置，从而实现快速准确判断被检测文件是否为目标文件。 The problem to be solved by the present invention is to provide a method and apparatus for determining an object file capable of quickly and accurately determining a detected data set, thereby realizing quick and accurate determination of whether the detected file is As the target file.

本发明实施例提供的目标文件的确定方法，基于存储特征字符串全集的存储单元，包括：在存储单元中选择用于确定目标文件的特征字符串集合，所述集合中包含至少一个用于确定目标文件的特征字符串； The method for determining an object file according to the embodiment of the present invention, based on storing the storage unit of the feature string complete set, includes: selecting, in the storage unit, a feature string set for determining the target file, where the set includes at least one for determining The characteristic string of the target file;

按照规则在待检测文件中查找所述集合中的字符串，获得查找结果，以及，判断所述查找结果是否满足第一条件，如果满足，确定所述待检测文件为目标文件。 Searching for the character string in the set in the file to be detected according to the rule, obtaining the search result, and determining whether the search result satisfies the first condition, and if yes, determining that the file to be detected is the target file.

其中，可以按照规则采用下述步骤在待检测文件中查找所述集合中的字符串： The following steps may be used to find the string in the set in the file to be detected according to the rule:

每次选择所述集合中的一个特征字符串，直到所述集合中每一个特征字符串被选择一次，对于选择出的每一个特征字符串，扫描所述待检测文件，获得特征字符串在待检测文件中的有效位置 ,将所有特征字符串在待检测文件中的位置作为查找结果。 Each time a feature string in the set is selected until each feature string in the set is selected once, for each feature string selected, the file to be detected is scanned, and the feature string is obtained. The valid position in the file is detected, and the position of all the feature strings in the file to be detected is used as the search result.

也可以按照规则采用下述步骤在待检测文件中查找所述集合中的字符串：在所述集合中选择出一个未被选择过的特征字符串，直到累积结果满足第一规则，对于选择出的每一个特征字符串，扫描所述待检测文件，获得特征字符串在待检测文件中的有效位置，累积所述有效位置，获得累积结果，在对待检测文件查找结束后 , 将累积结果作为查找结果。 The following step may also be used to search for the character string in the set in the file to be detected according to the rule: selecting an unselected feature string in the set until the accumulated result satisfies the first rule, for selecting Each of the feature strings scans the file to be detected, obtains a valid position of the feature string in the file to be detected, accumulates the valid position, and obtains a cumulative result. After the search for the detected file ends, the accumulated result is used as a search result. result.

所述第一规则为：累积结果中，找到的特征字符串在待检测文件中的有效位置的和达到设定的值；或者，累积结果中，在待检测文件中被有效找到的特征字符串的个数达到设定的值；或者，累积结果中，在待检测文件中被有效找到的特征字符串的位置关系满足设定的顺序特征和 /或间隔特征。 The first rule is: in the cumulative result, the sum of the found feature strings in the valid position in the file to be detected reaches a set value; or, in the cumulative result, the feature string that is effectively found in the file to be detected The number of the characters reaches the set value; or, in the cumulative result, the positional relationship of the feature string that is effectively found in the file to be detected satisfies the set sequence feature and/or the interval feature.

所述第一条件为：所述查找结果中，在待检测文件中被有效找到的特征字符串的个数达到设定的值；或者，所述查找结果中，在待检测文件中被有效找到的特征字符串的位置关系满足设定的顺序特征和 /或间隔特征。 The first condition is: in the search result, the number of feature strings that are effectively found in the file to be detected reaches a set value; or, the search result is effectively found in the file to be detected. The positional relationship of the feature strings satisfies the set sequential features and/or spacing features.

所述方法还包括，确定每一个特征字符串的特征字符和按照所述特征字符构建对应特征字符串的第二规则；以及，按照下述步骤扫描所述待检测文件：在待检测文件中，查找所述特征字符，直到待检测文件被查找完毕，对于每一个找到的特征字符，按照所述第二规则构建相应的特征字符串，如果所述特征字符串构建成功 , 将构建成功的特征字符串位置作为有效位置。而且，如果按照所述第二规则构建的特征字符串的字符与作为查找基础的特征字符串的字符完全相同或者相同的比例达到设定的数值，确定所述特征字符串构建成功。 The method further includes: determining a feature character of each feature string and a second rule for constructing a corresponding feature string according to the feature character; and scanning the file to be detected according to the following steps: in the file to be detected, Finding the feature character until the file to be detected is searched, and for each of the found feature characters, constructing a corresponding feature string according to the second rule, if The feature string is successfully built, and the feature string position of the successful construction is taken as a valid position. Moreover, if the character of the feature string constructed according to the second rule is identical or the same as the character of the feature string as the basis of the search reaches the set value, it is determined that the feature string is successfully constructed.

本发明实施例提供的目标文件的确定装置，包括存储特征字符串全集的存储单元，还包括： The apparatus for determining an object file provided by the embodiment of the present invention includes a storage unit that stores a complete set of feature strings, and further includes:

特征字符串选择单元，在存储单元中选择用于确定目标文件的特征字符串的集合，所述集合中包含至少一个用于确定目标文件的特征字符串； a feature string selection unit that selects, in the storage unit, a set of feature strings for determining the target file, the set including at least one feature string for determining the target file;

文件扫描单元 , 用于按照规则在待检测文件中查找所述集合中的字符串，获得查找结果； a file scanning unit, configured to search for a string in the set in the file to be detected according to a rule, and obtain a search result;

判断单元，用于判断所述结果是否满足第一条件，以及， a determining unit, configured to determine whether the result satisfies the first condition, and,

目标文件确定单元，在所述结果满足第一条件时，确定所述待检测文件为目标文件。 The target file determining unit determines that the file to be detected is a target file when the result satisfies the first condition.

按照本发明实施例提供的目标文件的确定方法和装置，需要选择用于确定目标文件的特征字符串集合，以及按照规则在待检测文件中查找所述集合中的字符串，获得查找结果后，还判断所述查找结果是否满足第一条件，从而确定所述待检测文件为目标文件。由于在确定过程中采用查找规则和判断条件，因此可以通过规则和条件约束特征字符串查找的方式，例如模糊查找和针对被检测文件的性质、检测目的，使查找判断更有针对性，因此能够对被检测数据集合进行快速、准确判断。 According to the method and apparatus for determining an object file according to an embodiment of the present invention, a feature string set for determining a target file needs to be selected, and a string in the set is searched for in the file to be detected according to a rule, and after the search result is obtained, And determining whether the search result satisfies the first condition, thereby determining that the to-be-detected file is a target file. Since the search rules and the judgment conditions are adopted in the determination process, the rules and conditions can be used to constrain the feature string search, such as fuzzy search and the nature of the detected file, and the purpose of detection, so that the search judgment is more targeted, and thus can Quickly and accurately determine the set of detected data.

本发明的其它优点在后续的文字中有详尽的叙述。 Other advantages of the invention are described in detail in the following text.

附图说明 DRAWINGS

图 1为本发明所述方法的第一实施例流程图； Figure 1 is a flow chart of a first embodiment of the method of the present invention;

图 2为图 1所述实施例采用的在待检测文件中查找特征字符窜集合的流程图； 2 is a flow chart of searching for a set of feature characters in a file to be detected, which is used in the embodiment of FIG. 1;

图 3为图 1所述实施例采用的存储特征字符串的数据库结构图；图 4为本发明所述装置的实施例框图。 3 is a database structure diagram of a storage feature string used in the embodiment of FIG. 1. FIG. 4 is a block diagram of an embodiment of the apparatus according to the present invention.

具体实施方式 detailed description

在实际应用中，确定一个文件（或者确定一个数据集合）是否为所希望查找的目标文件的方法有着非常广泛的应用，该方法通过特征的查找，可以确定一个文件是否为其他文件的复制品，或者确定一个文件是否受到病毒程序的感染，等等。无论是一个文件的复制品，还是一个受到病毒程序感染的文件，其中都包含有源文件的特征或者受感染的特征。这些特征通常都具有数据量较大、不确定以及数据离散度较大的特性，因此很难确定这些特征究竟有哪些会出现在待检测文件中，更难以确定这些特征在待检测文件中的存在形式和具体位置。例如一个可疑程序具有删除数据的指令特征，但是该指令特征携带什么参数以及还有哪些附加的条件才能确定该可疑程序是病毒程序却是不确定的。因此，采用现有的方法去快速发现大量这样的可疑程序要么耗费大量的时间，要么同时和单独具有较低的检测成功率。 In practical applications, determine whether a file (or determine a data set) is desired. The method of finding the target file has a very wide range of applications. By searching for features, the method can determine whether a file is a duplicate of another file, or whether a file is infected by a virus program, and the like. Whether it is a copy of a file or a file infected by a virus program, it contains features of the active file or infected features. These features usually have the characteristics of large amount of data, uncertainty, and large data dispersion. Therefore, it is difficult to determine which of these features will appear in the file to be detected, and it is more difficult to determine the existence of these features in the file to be detected. Form and specific location. For example, a suspicious program has an instruction feature to delete data, but what parameters the instruction feature carries and what additional conditions are available to determine that the suspicious program is a virus program is uncertain. Therefore, the use of existing methods to quickly find a large number of such suspicious programs takes a lot of time, or has a lower detection success rate at the same time and separately.

下面参照附图对本发明的实施例作详细说明。 Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

图 1是本发明所述方法的第一个实施例流程图。在该实施例中，由于任何数据或文件都能够转化或归结为具有某种编码规则的字符串，例如 ASCI I编码的字符串，因此本实施例以字符串的查找为^ 5出。图 1的^ 5出是预先获得的待检测文件。 BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a flow chart of a first embodiment of the method of the present invention. In this embodiment, since any data or file can be converted or boiled into a character string having a certain encoding rule, such as a string encoded by ASCI I, the present embodiment takes the lookup of the character string as . The output of Fig. 1 is a file to be detected which is obtained in advance.

按照图 1 , 首先在步骤 11确定特征字符串的集合，这是一个预处理的步骤，是后续步骤的基础。艮多应用场合，也将所述特征字符串称为 "指纹"。所述特征字符串的集合中包含确定目标文件的特征字符串，是查找相同或相似文件的基础。对于较简单的情况，有时候用一个特征字符串即可对待检测的文件进行是否为目标文件的判断，但更多的时候需要很多特征字符串才能对待检测的文件进行是否目标文件的判断，因此，所述特征字符串的集合中包含的特征字符串至少一个以上。所述集合在本实施例中采用一个二维表形式的数据部特征字符串及其对应的其他辅助数据 ,当然该数据库也可以采用其他数据结构代替，例如一个线性队列。 According to Figure 1, the set of feature strings is first determined in step 11, which is a pre-processing step and is the basis for subsequent steps. In many applications, the feature string is also referred to as a "fingerprint." The set of feature strings contains a feature string for determining the target file, which is the basis for finding the same or similar files. For a simpler case, sometimes a feature string can be used to determine whether the file to be detected is the target file, but more often it requires a lot of feature strings to determine whether the target file is to be detected. And the feature string included in the set of feature strings is at least one or more. The set uses a data part feature string in the form of a two-dimensional table and its corresponding auxiliary data in this embodiment. Of course, the database can also be replaced by other data structures, such as a linear queue.

典型的数据库结构如图 3所示。其中，字段 "编号"、 "特征字符串"和 "长度"是最基本的，如果选择的特征字符串的长度相等，则 "长度，，也可以去除。其它的字段，用于对图 1所述实施例或其它实施例的进一步改进，可以看作是辅助数据，用以提高实施例所述方法的性能。需要说明的是，所述步骤 11在数据库中选择用于确定目标文件的特征字符串的集合，是确定在某种目标文件判断过程中使用的特征字符串，这些特征字符串对于不同性质目标文件的判断来说是不同的。例如，如果判断文件 B 是否为文件 A的复制品，就需要在文件 A中提取出这些特征的全集，然后在具体的目标文件的判断时，从特征全集中选择或确定用于具体判断的特征字符串的集合。再例如，如果判断某个文件是否为病毒程序或某种病毒程序，也需要预先提取出这些病毒程序特征的全集，然后在具体的目标文件的判断时，从特征全集中选择或确定用于具体判断的特征字符串的集合。然而，如何在一个文件 A中提取出这些特征的全集，以及如何获得病毒程序或某种病毒程序特征的全集有艮多方法实现，在本申请人的其他专利中也有类似方法的详细说明。在图 3所述结构的数据库中，就用来存储这样的针对某种目标文件进行检测的特征字符串的全集，如果用来存储所有全部各种目标文件进行检测的特征字符串的全集，只需增加一个字段标识某个特征字符串用于检测的目标文件的编号或名称即可。而对于其中一种目标文件不同项目的检测采用特征组字段管理即可。等等。 A typical database structure is shown in Figure 3. Among them, the fields "number", "character string" and "length" are the most basic. If the length of the selected feature string is equal, the length can also be removed. Other fields are used for Figure 1. Further improvements of the described embodiments or other embodiments can be seen as auxiliary data to improve the performance of the methods described in the embodiments. It should be noted that the step 11 selects, in the database, a set of feature strings for determining the target file, and is a feature string that is used in determining a certain object file, and the feature strings are for different target files. The judgment is different. For example, if it is judged whether the file B is a copy of the file A, it is necessary to extract the complete set of these features in the file A, and then select or determine the feature for the specific judgment from the feature set in the judgment of the specific object file. A collection of strings. For another example, if it is determined whether a file is a virus program or a virus program, it is necessary to extract the complete set of the characteristics of the virus program in advance, and then select or determine from the feature set for the specific target file. A collection of feature strings that are evaluated. However, there are many ways to extract the complete set of these features in a file A, and how to obtain a complete set of virus programs or certain virus program features. A detailed description of similar methods is also found in other applicant patents. In the database of the structure shown in FIG. 3, it is used to store such a complete set of feature strings for detecting a certain target file, and if used to store the complete set of feature strings for detecting all of the various target files, only You need to add a field to identify the number or name of the target file that a feature string is used for detection. For the detection of different items of one of the target files, the feature group field management can be used. and many more.

因此需要预先分析并获得的典型种类目标文件种的特征信息。对于比较简单的情况，例如查找复制品之类的简单查找，可以通过多模式匹配法等自动分析源文件即可；而对于复杂的情况，例如确定是否病毒文件，或者被病毒感染过的文件等的特征字符串，可能还需要人工的参与。例如无论是分析非法的网络访问特征信息，还是分析病毒的特征信息，其分析的方法和过程存在多种，并且分析得出的特征信息的属性或种类也存在多种方式，所述的分析过程可以由人工或者计算机来完成。可以根据实际情况和需求选择和存储依据特征信息确定的特征字符串。在本实施例中，提供了一个利用计算机工具分析并获得病毒程序特征字符串的例子。该例子使用 DEBUG、 PR0VIEW等工具程序和在专用的试验用的系统分析病毒的特征信息。之所以利用专用的试验用系统，就是因为分析的对象有可能是病毒，很有可能在被分析的阶段继续传播甚至发作。具体的分析过程即可以利用 DEBUG 或其他的反汇编工具程序将病毒的代码打印成反汇编后的程序清单，分析出病毒具有特征信息的代码段，还可以采用动态分析病毒的组成部分，使用了对系统的哪些调用或操作，以及采用了何种操作方式和流程，进而分析出病毒的一系列操作行为。将上述病毒的特征信息或者病毒的操作行为对应的特征字符串采用数据库的方式进行存储，以便于进一步利用其识别、判断其它的待检测程序是否为病毒程序。 Therefore, it is necessary to analyze and obtain characteristic information of a typical kind of target file type in advance. For simple cases, such as finding a simple search such as a replica, you can automatically analyze the source file by multi-pattern matching, etc. For complex situations, such as determining whether a virus file, or a virus-infected file, etc. The characteristic string may also require manual participation. For example, whether analyzing illegal network access feature information or analyzing virus feature information, there are various methods and processes for analyzing, and there are also various ways to analyze the attribute or type of feature information, and the analysis process This can be done manually or by computer. The feature string determined according to the feature information can be selected and stored according to actual conditions and needs. In the present embodiment, an example of analyzing and obtaining a character string of a virus program using a computer tool is provided. This example uses a tool program such as DEBUG, PR0VIEW, and a special test system to analyze the characteristic information of the virus. The reason why the dedicated test system is used is because the object of analysis may be a virus, and it is very likely that it will continue to spread or even attack at the stage of analysis. The specific analysis process can use DEBUG or other disassembly tool program to print the virus code into a disassembled program list, analyze the code segment of the virus with characteristic information, and also use the dynamic analysis of the virus components. What calls or actions to the system, and what actions are taken Ways and processes to analyze a series of operational behaviors of the virus. The characteristic information corresponding to the above-mentioned virus or the characteristic string corresponding to the operational behavior of the virus is stored in a database manner, so that it is further used to identify and determine whether other programs to be detected are virus programs.

特征字符串的确定通常应当满足下述条件：唯一性，长度尽可能短，等等。步骤 11所述确定特征字符串的集合，主要是考虑到了特征数据的离散性以及在待检测文件中的位置的不确定性 ,而将特征数据分解为不同的字符数量较小的特征字符子集，即特征字符串。基于步骤 11确定的特征字符串的集合，在步骤 12按照规则在待检测文件中查找所述集合中的字符串，获得查找结果。这里所述的规则对于不同目标文件的查找是不同的，主要考虑到查找速度和查找准确度，以及考虑所查找的目标文件的性质而制定的查找策略。最简单的规则就是在待检测文件中检索所述特征集合中每一个特征字符串。本实施例中 , 按照规则采用下述步骤在待检测的文件中查找所述集合中的特征字符串：每次选择所述集合中的一个特征字符串，直到所述集合中每一个特征字符串被选择一次，对于选择出的每一个特征字符串，扫描所述待检测文件，获得特征字符串在待检测文件中的有效位置 ,将所有特征字符串在待检测文件中的位置作为查找结果。具体参考图 2。 The determination of the feature string should generally satisfy the following conditions: uniqueness, length as short as possible, and so on. Step 11 determines the set of feature strings, mainly considering the discreteness of the feature data and the uncertainty of the position in the file to be detected, and decomposing the feature data into different feature character subsets with smaller number of characters. , that is, the feature string. Based on the set of feature strings determined in step 11, the string in the set is searched in the file to be detected according to the rule in step 12, and the search result is obtained. The rules described here are different for different target files, mainly considering the speed of search and the accuracy of the search, and the search strategy based on the nature of the target file being looked up. The simplest rule is to retrieve each feature string in the feature set in the file to be tested. In this embodiment, the following steps are used to search for the feature string in the set in the file to be detected according to the rule: each time a feature string in the set is selected, until each feature string in the set After being selected once, for each feature string selected, the file to be detected is scanned, the effective position of the feature string in the file to be detected is obtained, and the position of all the feature strings in the file to be detected is used as the search result. Refer specifically to Figure 2.

步骤 13判断所述查找结果是否满足预先设定的条件，如果满足，则在步骤 14确定该待检测文件为目标文件，接着进行后续处理，否则在步骤 15结束判断操作。 Step 13 determines whether the search result satisfies a preset condition. If yes, it determines in step 14 that the file to be detected is a target file, and then performs subsequent processing. Otherwise, the determining operation ends in step 15.

步骤 12所述的扫描待检测文件的一个具体的例子参考图 2所示。在步骤 A specific example of scanning a file to be detected as described in step 12 is shown in FIG. In the steps

21 从所述特征字符串的集合中读取一个未被选择过的特征字符串，然后利用该特征字符串在步骤 22扫描待检测文件，即在待检测文件中查找该特征字符串是否存在；在步骤 23判断是否扫描到或者找到该特征字符串，在不能确定已经扫描的字符部分包含该特征字符串时，回到步骤 22继续后续的扫描；如果经步骤 23判断已经扫描的字符部分包含该特征字符串时，则在步骤 24记录扫描的结果，例如本例中记录该特征字符串在待检测文件中的有效位置，所述有效位置包括该特征字符串在待检测文件中的偏移量，长度等。所述有效位置是指在所述特征字符串完全包含于待检测文件中时的位置，对于所述特征字符串不完全包含与待检测文件中时的情况，以及没有在待检测文件中找到该特征字符串，则不存在该特征字符串的有效位置。但是由于后续管理和判断的需要，也要在管理表中的有效位置字段记录 "空"，标识没有在待检测文件中找到该特征字符串。 21: reading an unselected feature string from the set of feature strings, and then scanning the file to be detected in step 22 by using the feature string, that is, searching for the feature string in the file to be detected; In step 23, it is determined whether the feature string is scanned or found. When it is not determined that the character portion that has been scanned contains the feature string, the process returns to step 22 to continue the subsequent scan; if it is determined in step 23 that the character portion that has been scanned includes the In the case of the feature string, the result of the scan is recorded in step 24. For example, in this example, the effective position of the feature string in the file to be detected is recorded, and the effective position includes the offset of the feature string in the file to be detected. , length, etc. The valid position refers to a position when the feature string is completely included in the file to be detected, when the feature string does not completely contain the file in the file to be detected, and the file is not found in the file to be detected. Characteristic String, there is no valid position of the feature string. However, due to the need for subsequent management and judgment, "empty" is also recorded in the valid location field in the management table, and the identifier is not found in the file to be detected.

例如，假设一个扫描结果表包含 "特征字符串编号" "起始位置" "长度" 三个字段，对于编号为 1的特征字符串 ACDS和编号为 2的特征字符串 DFS , 和待检测文件 QWERTACDSYASDFGHZXCVB,则基于特征字符串 ACDS的扫描得到的结果，即扫描结果表的记载为： "1" "6" "4"; 对于特征字符串 DFS的扫描得到的结果为： "2" "0" " 3" , 其中， DFS 的扫描结果中间的 "0" 表示 "空"，即没有扫描到特征字符串 DFS。 For example, suppose a scan result table contains three fields of "feature string number" "start position" "length", for feature string ACDS numbered 1 and feature string DFS numbered 2, and file to be detected QWERTACDSYASDFGHZXCVB , based on the result of the scan of the characteristic string ACDS, that is, the description of the scan result table is: "1" "6" "4"; the result of the scan of the feature string DFS is: "2" "0" " 3" , where "0" in the middle of the scan result of DFS means "empty", that is, the feature string DFS is not scanned.

本例中，所述步骤 23还包含另一个判断内容，即判断扫描是否结束，如果结束则终止扫描操作，转到步骤 26确定并获得最终的查找结果（由于本分支显而易见，故图 2中未绘出）。 In this example, the step 23 further includes another judgment content, that is, whether the scan is finished, and if it is finished, the scan operation is terminated, and the process proceeds to step 26 to determine and obtain the final search result (because the branch is obvious, the figure 2 is not Draw).

步骤 25用于判断特征字符串集合中的每一个特征字符串是否全部扫描完毕，如果没有全部扫描完，则转步骤 21选择下一个未被选择的特征字符串继续扫描操作，否则在步骤 26确定并获得最终的查找结果。方式选择，只要保证选择的特征字符串不重复即可。 Step 25 is used to determine whether each feature string in the feature string set is completely scanned. If not all the scans are completed, then step 21 is performed to select the next unselected feature string to continue the scanning operation, otherwise it is determined in step 26. And get the final search results. Mode selection, as long as the selected feature string is not repeated.

在图 1所示实施例中，采用查找每一个特征字符串在待检测文件中的有效位置的方法可能不是十分必要，在某种情况下会导致查找的效率的降低。例如，对于一种病毒程序来说，在特征字符串的集合中包含 8个特征字符串，在其中 4个特征字符串具有某种顺序和间隔特征时，就可以得出待检测文件是否为目标文件的判断结论。为解决这种问题，本发明第二实施例给出了另外一种在待检测的文件中查找所述集合中的字符串的方法。该方法在所述集合中选择出一个未被选择过的特征字符串 ,对于选择出的每一个特征字符串 ,扫描所述待检测文件，获得特征字符串在待检测文件中的有效位置，累积所述有效位置，获得累积结果，如果所述累积结果满足一个预先确定的规则，说明基于目前的累积结果就能够对待检测文件进行一个明确的判断，这样就可以结束对待检测文件查找，将累积结果作为查找结果，以进行进一步的处理。上面文字描述的含义是，如果所述累积结果满足一个预先确定的规则，就可以提前结束查找，而不必将特征集合中的每一个特征字符串都选择一次，从而提高在待检测的文件中查找所述集合中的特征字符串的效率。 In the embodiment shown in Fig. 1, the method of finding the effective position of each feature string in the file to be detected may not be very necessary, and in some cases, the efficiency of the search is lowered. For example, for a virus program, eight feature strings are included in a set of feature strings, and when four feature strings have a certain order and interval feature, it can be determined whether the file to be detected is a target. The judgment conclusion of the document. In order to solve such a problem, the second embodiment of the present invention provides another method of searching for a character string in the set in a file to be detected. The method selects an unselected feature string in the set, scans the to-be-detected file for each feature string selected, and obtains a valid position of the feature string in the file to be detected, and accumulates The effective position obtains a cumulative result. If the accumulated result satisfies a predetermined rule, it indicates that the detected file can be determined based on the current accumulated result, so that the detection of the detected file can be ended, and the cumulative result is obtained. As a result of the search, further processing is performed. The text described above contains The meaning is that if the accumulated result satisfies a predetermined rule, the search can be ended early without having to select each feature string in the feature set once, thereby improving the search for the set in the file to be detected. The efficiency of the feature string.

在第二实施例中，即使对于一个特征字符串的查找没有结果，由于管理或其他的例如统计需要等也应当在累积查找结果时，记录查找的结果，即查找结果为 "空"，表示没有找到；当然如果有明确的查找结果，就要记载该特征字符串在待检测文件中的位置参数。 In the second embodiment, even if there is no result for the search for a feature string, the result of the search should be recorded when the search result is accumulated due to management or other such as statistical needs, etc., that is, the search result is "empty", indicating that there is no Found; of course, if there is a clear search result, it is necessary to record the positional parameter of the feature string in the file to be detected.

在第二实施例中 , 对于一个特定的特征字符串在待检测文件中的有效位置，可以记录第一次出现的有效位置，也可以记录全部的有效位置，这取决于实际的需要，在一个集合中特征字符串具有明确的位置特征或关系特征时就可能这样。例如，一个病毒程序具有的特征字符串的全集为： Al、 A2、 A3、 A4、 A5。依据该全集中的特征字符串，当一个待检测程序中出现下述特征集合 1 中描述的特征字符串组合与顺序关系时，认为该待检测程序为病毒程序；当一个待检测程序中出现下述特征集合 2中描述的特征字符串组合与顺序关系时，认为该待检测程序为某种特定的病毒程序。 In the second embodiment, for a valid position of a specific feature string in the file to be detected, the first effective position may be recorded, and all valid positions may be recorded, depending on actual needs, in one This is possible when the feature string in the collection has explicit position or relationship characteristics. For example, a virus program has a full set of feature strings: Al, A2, A3, A4, A5. According to the characteristic string of the ensemble, when the feature string combination and the order relationship described in the following feature set 1 appear in a to-be-detected program, the program to be detected is regarded as a virus program; when a program to be detected appears When the feature string combination and the order relationship described in feature set 2 are described, the program to be detected is considered to be a specific virus program.

特征集合 1为： {A1 A3 A4 , A1 A3 A4 A5 , Al A1 A3 A4 A5 Al , A2 A4 A5 , A1 A3 A5, A3 A4 A5 , Al A4 A5 , Al A2 A3 A4 A5}； The feature set 1 is: {A1 A3 A4 , A1 A3 A4 A5 , Al A1 A3 A4 A5 Al , A2 A4 A5 , A1 A3 A5 , A3 A4 A5 , Al A4 A5 , Al A2 A3 A4 A5};

特征集合 2为： {A1 A3 A5 , A3 A4 A5 , Al A4 A5 , Al A2 A3 A4 A5}；显然，对于特征集合 1，特征字符串可能在待检测文件中出现多次才能够判断待检测程序是否为病毒程序，因此，在这种情况下就要记录特征字符串在待检测文件中的全部有效位置。而对于特征集合 1，特征字符串在待检测文件中出现一次就能够判断待检测程序是否为病毒程序，为提高检测效率，对于这种特殊的检测，只记录特征字符串在待检测文件中第一次出现的有效位置即可。所述有效位置至少包括偏移量参数，如果特征字符串的长度不同，还需要包括长度参数；在其他的实施例中，设置还包含特征字符串允许的表明具体字符离散关系的参数，等等。 Feature set 2 is: {A1 A3 A5 , A3 A4 A5 , Al A4 A5 , Al A2 A3 A4 A5}; Obviously, for feature set 1, the feature string may appear multiple times in the file to be detected to be able to determine the program to be tested Whether it is a virus program, therefore, in this case, it is necessary to record all valid positions of the feature string in the file to be detected. For the feature set 1, the feature string appears once in the file to be detected to determine whether the program to be detected is a virus program. To improve the detection efficiency, only the feature string is recorded in the file to be detected for the special detection. A valid position that appears once. The effective position includes at least an offset parameter. If the length of the feature string is different, the length parameter needs to be included. In other embodiments, the setting further includes a parameter indicating a discrete relationship of the specific character allowed by the feature string, and the like. .

采用精确匹配的方式在待检测的文件中查找所述集合中的特征字符串 ,这在某种情况下会导致查找准确性的降低。例如，对于一种病毒程序的特征字符串在其变种病毒程序中可能发生改变，进而增加对一个待检测程序判断失误的可能性。在不同的复制者复制的一个文本文件的复制品中，依据原始文本文件确定的特征字符串也会发生变化。这种变化可能是由于携带的参数不同或其它原因导致的。尽管病毒的特征字符串是在仔细分析病毒体后选出的最具代表性、足以将该病毒区别于其它病毒和该病毒的其它变种的病毒特征字符串，以将病毒程序与正常的非病毒程序区分开，避免将非病毒程序当成病毒程序处理。但是，由于病毒程序的变种，所述特征字符串具体组成就可能由于携带参数的不同等原因发生变化。这样，如果待检测的程序是病毒程序或变种病毒程序，其中的特征特征信息中就可能包含有一个到几个 "模糊"字节。遇到这种情况，只要除 "模糊"字节之外的字串都能完好匹配，则也能判别出特征字符串，进而得出该待检测程序是否为病毒程序的判断。例如：给定特征字符串： "E9 7C 00 10 ？ 37 CB" , 则 "E9 7C 00 10 27 37 CB" 和 "E9 7C 00 10 9C 37 CB" 都能被识别出来，又例如: "E9 7C 37 CB" 可以匹配 "E9 7C 00 37 CB" , "E9 7C 00 11 37 CB" 和 "E9 7C 00 11 22 37 CB" ，但不匹配 "E9 7C 00 11 22 33 44 37 CB" , 因为 7C和 37之间的子串已超过一定的长度，超出了间隔特征的限制。因此，需要按照预先确定的规则模糊查找。 The feature string in the set is searched in the file to be detected in an exact matching manner, which may cause a decrease in the search accuracy in some cases. For example, a feature string for a virus program may change in its variant virus program, thereby increasing the misjudgment of a program to be tested. Possibility. In a copy of a text file copied by a different replicator, the feature string determined from the original text file also changes. This change may be due to different parameters carried or other reasons. Although the characteristic string of the virus is the most representative virus characteristic string selected after careful analysis of the virion, which is sufficient to distinguish the virus from other viruses and other variants of the virus, the virus program and the normal non-virus are used. Programs are separated to avoid processing non-virus programs as virus programs. However, due to variants of the virus program, the specific composition of the feature string may change due to differences in carrying parameters. Thus, if the program to be detected is a virus program or a variant virus program, the feature information therein may contain one to several "fuzzy" bytes. In this case, as long as the string other than the "fuzzy" byte can be perfectly matched, the feature string can also be discriminated, and then the judgment of whether the program to be detected is a virus program is obtained. For example: Given the characteristic string: "E9 7C 00 10 ? 37 CB", then "E9 7C 00 10 27 37 CB" and "E9 7C 00 10 9C 37 CB" can be recognized, for example: "E9 7C 37 CB" can match "E9 7C 00 37 CB" , "E9 7C 00 11 37 CB" and "E9 7C 00 11 22 37 CB" but does not match "E9 7C 00 11 22 33 44 37 CB" because 7C and The substring between 37 has exceeded a certain length, exceeding the limit of the spacing feature. Therefore, it is necessary to blur the search according to a predetermined rule.

在第二实施例中，支持所述模糊查找的预先确定的规则为：在待检测文件中被有效找到的特征字符串的位置关系构成的累积结果满足设定的顺序特征、间隔特征，或者同时满足顺序特征和间隔特征。所述顺序特征包括特征字符串的组合和 /或排列规律。对于不同目的、不同性质的目标文件的确定，所述顺序特征和间隔特征是不同的。所述间隔特征反映了一个特征字符串的构成字符的离散关系，例如，一个特征字符串的构成字符间的间隔不应超过一定的字符个数，这取决于实际的不同目的、不同性质的目标文件确定的需要。 In the second embodiment, the predetermined rule supporting the fuzzy lookup is: the cumulative result of the positional relationship of the feature string that is effectively found in the file to be detected satisfies the set sequence feature, the interval feature, or both Sequential features and spacing characteristics are met. The sequence features include combinations and/or permutations of feature strings. The order feature and the interval feature are different for the determination of object files of different purposes and different properties. The interval feature reflects a discrete relationship of constituent characters of a feature string. For example, the interval between constituent characters of a feature string should not exceed a certain number of characters, depending on actual different purposes and targets of different natures. The documentation determines the need.

在其它实施例中，支持所述模糊查找的预先确定的规则也可以为：累积结果中，找到的特征字符串在待检测文件中的有效位置的和达到设定的值，即找到的特征字符串的有效位置的数量和达到设定的值；或者，累积结果中，在待检测文件中被有效找到的特征字符串的个数达到设定的值。究竟采用那种方式作为确定的规则同样根据实际的需要决定，这样的规则对于不同目的、不同性质的目标文件的确定需要来说可能不同，而且也很多种这样的规则需要预先确定好。这里所述 "目的"，很多时候等同于 "要求"，例如，按照所需的概率或可能性确定一个待检测文件是否为目标文件，从而在考虑效率、误判率和漏判率方面达到一个综合的平衡。所述 "性质"对不同目标文件的确定也有影响。例如，某种文件具有的特征字符串的分布和 /或组成的规律性和另一种文件具有的分布和 /或组成的规律性可能不同，典型的规律性不同如离散性不同，这可能由于携带参数的不同或操作的对象数据不同等因素导致的。 In other embodiments, the predetermined rule supporting the fuzzy search may also be: in the cumulative result, the sum of the found feature strings in the valid position in the file to be detected reaches a set value, that is, the found feature character. The number of valid positions of the string reaches the set value; or, in the cumulative result, the number of feature strings that are effectively found in the file to be detected reaches the set value. The way in which that method is used as the determination rule is also determined according to actual needs. Such rules may be different for the determination of object files of different purposes and different natures, and many such rules need to be predetermined. The "purpose" described here is often equivalent to "requirement". For example, determining whether a file to be inspected is a target file according to the probability or possibility required, thereby achieving an efficiency, false positive rate and missed rate. Comprehensive balance. The "nature" also has an impact on the determination of different object files. For example, the regularity of the distribution and/or composition of feature strings that a file has may be different from the regularity of distribution and/or composition of another file. Typical regularities are different, such as discreteness, which may be due to Different factors such as carrying different parameters or operating object data are different.

由于前述的模糊查找不需要精确匹配 ,更适用于特征字符串在待检测文件中变化的情况，还由于可以根据具体的查找需求选择不同的模糊规则，因此能够使本实施例相对第一实施例进一步提高确定一个待检测文件是否为目标文件的效率、漏判率和准确性。 Since the foregoing fuzzy search does not require an exact match, it is more suitable for the case where the feature string is changed in the file to be detected, and since different fuzzy rules can be selected according to specific search requirements, the present embodiment can be made relatively the first embodiment. Further improve the efficiency, the miss rate and the accuracy of determining whether a file to be detected is a target file.

在上面提到的两个实施例中，所述查找结果只有经过步骤 13判断是否满足预先设定的条件，才能确定待检测文件是否为目标文件。这里所述的条件可以为：所述查找结果中，在待检测文件中被有效找到的特征字符串的个数达到设定的值；或者，所述查找结果中，在待检测文件中被有效找到的特征字符串的位置关系满足设定的顺序特征和 /或间隔特征。等等。需要指出，这里所述的 "条件" 与前文所述的 "规则"可能相同，也可能有区别。在对较简单的待检测文件进行是否为目标文件的确定时，就可能相同，此时，步骤 13可以直接利用前述步骤的判断结果得出结论，或者干脆省略步骤 13。在对较复杂的待检测文件进行是否为目标文件的确定时，就可能不同，此时，前面的 "规则" 优选的方式是关注效率，而后面步骤 13的 "条件，，优选关注准确性，这样可以提高方法实施例的整体性能。当然，如前所述，这里所述的条件也可能根据实际的需要存在多种。 In the two embodiments mentioned above, the search result can determine whether the file to be detected is the target file only after the step 13 determines whether the predetermined condition is met. The condition described herein may be: in the search result, the number of feature strings that are effectively found in the file to be detected reaches a set value; or, the search result is valid in the file to be detected. The positional relationship of the found feature string satisfies the set sequence feature and/or interval feature. and many more. It should be noted that the "conditions" described here may be the same as or different from the "rules" described above. In the case of determining whether the file to be detected is a target file, the same may be the same. In this case, step 13 may directly use the judgment result of the foregoing steps to draw a conclusion, or simply omit step 13. In the determination of whether the more complex file to be detected is the target file, it may be different. In this case, the preferred method of the previous "rule" is to pay attention to efficiency, and the condition of the following step 13 is to focus on accuracy. This can improve the overall performance of the method embodiment. Of course, as mentioned above, the conditions described herein may also be various depending on actual needs.

尽管所述第一、第二实施例能够通过模糊查找解决效率和可靠性问题，但是当特征字符串较大时，扫描待检测文家的方法仍然可以改进，以进一步提高特征字符串的查找效率。 Although the first and second embodiments are capable of solving efficiency and reliability problems by fuzzy search, when the feature string is large, the method of scanning the character to be detected can still be improved to further improve the search efficiency of the feature string. .

本发明的第三个实施例就提供了这样的改进。与所述第一、第二实施例不同的是，所述第三实施例增加了一个步骤：确定每一个特征字符串的特征字符和按照所述特征字符构建对应特征字符串的第二规则，这也是一个相当于 "初始化" 的步骤。在实际应用中，每一个特征字符串都具有相对最稳定的部分，例如对于某个破坏性指令，如删除指令，指令本身通常不会变化，变化的通常是携带的参数或操作的对象数据在特征字符串中的位置，因此，通过确定每一个特征字符串的最稳定的部分为特征字符，由于特征字符的数量较少，就可以通过特征字符的查找快速定位可能的特征字符串。因此也可以将所述特征字符称为 "锚"。对于不同的特征字符串，其中适合作为特征字符的那些具体字符通常是不确定的，而且在特征字符串中位置也不确定，因此在一个特征字符串的特征字符确定后 ,还要才艮据特征字符在特征字符串中的位置和特征字符串本身的构成特性确定按照所述特征字符构建对应特征字符串的第二规则。可见，本实施例试图提供一种通过 "锚"快速定位特征字符串，并快速 "组装"特征字符串的方法。 A third embodiment of the present invention provides such an improvement. Different from the first and second embodiments, the third embodiment adds a step of: determining a feature character of each feature string and a second rule for constructing a corresponding feature string according to the feature character, This is also a step equivalent to "initialization". In practical applications, each feature string has a relatively stable part. For example, for a destructive instruction, such as a delete instruction, the instruction itself usually does not change. The change is usually the parameter of the carried parameter or the operation object data. The position in the feature string, therefore, by determining that the most stable part of each feature string is a feature character, since the number of feature characters is small, the feature string can be quickly located by the feature character search. Therefore, the feature character can also be referred to as an "anchor". For different feature strings, those specific characters that are suitable as feature characters are usually indeterminate, and the position in the feature string is also uncertain, so after the feature characters of a feature string are determined, The position of the feature character in the feature string and the composition characteristic of the feature string itself determine a second rule for constructing the corresponding feature string in accordance with the feature character. It can be seen that this embodiment attempts to provide a method for quickly locating feature strings through "anchors" and quickly "assembling" feature strings.

例如，特征字符串 1为（用 ASCI I码表示）： 23 4E 6F 55 77 09 OA 9D 34 8C, 如果用 "6F 55" 作为特征字符，则所述特征字符构建对应特征字符串的规则可以为： "2， 6" , 表示特征字符 "6F 55" 前面的 2个字符 "23 4E" 和后面的 6个字符 "77 09 OA 9D 34 8C" 可以构成对应的特征字符串；如果用 "23 4E" 作为特征字符，则所述特征字符构建对应特征字符串的规则可以为： "0， 8" , 表示特征字符 "23 4E 6F" 前面的 0个字符，即没有字符和后面的 7个字符 "55 77 09 OA 9D 34 8C" 可以构成对应的特征字符串，等等。这样，就可以通过特征字符在待检测文件中快速定位特征字符串，进而快速判断定位的特征字符串是否为希望找到的特征字符串。 For example, the feature string 1 is (represented by the ASCI I code): 23 4E 6F 55 77 09 OA 9D 34 8C, if "6F 55" is used as the feature character, the rule that the feature character constructs the corresponding feature string may be : "2, 6" , indicating that the character "6F 55" in front of the two characters "23 4E" and the following six characters "77 09 OA 9D 34 8C" can constitute the corresponding feature string; if "23 4E "As a feature character, the rule that the feature character constructs the corresponding feature string may be: "0, 8", indicating 0 characters preceding the feature character "23 4E 6F", that is, no characters and the following 7 characters" 55 77 09 OA 9D 34 8C" can form the corresponding feature string, and so on. In this way, the feature string can be quickly located in the file to be detected by the feature character, thereby quickly determining whether the located feature string is the feature string that is desired to be found.

因此，在第三实施例中，按照下述步骤扫描所述待检测文件： Therefore, in the third embodiment, the file to be detected is scanned according to the following steps:

在待检测文件中，查找所述特征字符，直到待检测文件被查找完毕，对于每一个找到的特征字符，按照所述第二规则构建相应的特征字符串，如果所述特征字符串构建成功，将构建成功的特征字符串位置作为有效位置。这种扫描待检测文件的方法适合于找到特征字符串在待检测文件中的所有位置。对于只需要找到首个特征字符串的位置的情况，可以按照下述步骤扫描所述待检测文件：在待检测文件中，查找所述特征字符，如果找到所述特征字符，按照所述第二规则构建相应的特征字符串，如果所述特征字符串构建成功，将构建成功的特征字符串的位置作为有效位置，结束查找。 In the to-be-detected file, the feature character is searched until the file to be detected is searched, and for each of the found feature characters, a corresponding feature string is constructed according to the second rule, and if the feature string is successfully constructed, The feature string position that was successfully built is taken as a valid location. This method of scanning a file to be detected is suitable for finding all locations of the feature string in the file to be detected. For the case where only the location of the first feature string needs to be found, the file to be detected may be scanned according to the following steps: in the file to be detected, the feature character is searched, and if the feature character is found, according to the second The rule constructs a corresponding feature string. If the feature string is successfully constructed, the position of the successfully constructed feature string is taken as a valid position, and the search is ended.

第三实施例中所述特征字符串构建成功，也是指精确匹配的情况，即如果按照所述第二规则构建的特征字符串的字符与作为查找基础的特征字符串的字符完全相同。为了实现模糊匹配，也可以在按照所述第二规则构建的特征字符串的字符与作为查找基础的特征字符串的字符相同的比例达到设定的数值时，确定所述特征字符串构建成功，以实现模糊匹配，更进一步，还可以在模糊匹配中的第二规则中加入更具体的表明字符离散关系的参数。当然这里所述的构建特征字符串构建成功的标准还有很多种，如前所述，可以根据具体的需要确定具体的标准。 The feature string is successfully constructed in the third embodiment, and is also referred to as an exact match, that is, if The character of the feature string constructed according to the second rule is identical to the character of the feature string that is the basis of the search. In order to achieve the fuzzy matching, when the ratio of the character of the feature string constructed according to the second rule to the character of the feature string as the basis of the search reaches a set value, it is determined that the feature string is successfully constructed. In order to achieve fuzzy matching, it is further possible to add a more specific parameter indicating the discrete relationship of the characters in the second rule in the fuzzy matching. Of course, there are many standards for constructing feature string construction as described here. As mentioned above, specific standards can be determined according to specific needs.

对于上述提出的精确匹配，如果采用字符逐个比较的办法，可能会消耗较多的时间。在本发明提供的实施例中，可以进行进一步的改进：在构建特征字符串成功后，计算依据特征字符新构建的特征字符窜的检验和，用该校验和与作为查找的特征字符串预先计算出的校验和进行比较，就能够更快地完成新构建的特征字符串与作为查找^ 5出的特征字符串的比较。在实际中，通常的特征字符串的长度在 50个字节左右，如果用于存储校验和的存储单元的位数足够大，甚至能够消除比较失误，而在现代计算机中，这是极其容易做到的，而且不限于校验和。 For the exact match proposed above, it may take more time to use character-by-word comparison. In the embodiment provided by the present invention, further improvement can be made: after the feature string is successfully constructed, the checksum of the feature character 新 newly constructed according to the feature character is calculated, and the checksum is used in advance with the feature string as the search character. By comparing the calculated checksums, it is possible to complete the comparison of the newly constructed feature string and the feature string as a lookup. In practice, the length of a typical feature string is about 50 bytes. If the number of bits used to store the checksum is large enough, even the comparison error can be eliminated. In modern computers, this is extremely easy. It is done, and is not limited to checksums.

图 4为本发明所述装置的实施例框图，包括： 4 is a block diagram of an embodiment of the apparatus of the present invention, including:

存储特征字符串全集的存储单元 41, 在存储单元中 41选择用于确定目标文件的特征字符串集合的特征字符串选择单元 42 , 所述集合中包含至少一个用于确定目标文件的特征字符串； a storage unit 41 storing a complete set of feature strings, in the storage unit 41, selecting a feature string selection unit 42 for determining a feature string set of the target file, the set including at least one feature string for determining the target file ;

文件扫描单元 43，用于按照规则在待检测文件中查找所述集合中的字符串，获得查找结果；和，用于判断所述结果是否满足第一条件的判断单元 44 , 以及， The file scanning unit 43 is configured to search a character string in the set in a file to be detected according to a rule to obtain a search result; and, a determining unit 44 for determining whether the result satisfies the first condition, and

目标文件确定单元 45 , 用于在所述结果满足第一条件时，确定所述待检测文件为目标文件。 The target file determining unit 45 is configured to determine that the to-be-detected file is a target file when the result satisfies the first condition.

其中，存储单元 41 , 用于存储特征字符串全集，通常存储用于对某种或的其他辅助数据， The storage unit 41 is configured to store a complete set of feature strings, and is usually stored for other auxiliary data of a certain or

所述特征字符串的集合中包含确定目标文件的特征字符串 ,是查找相同或相似文件的基础。所述集合在本实施例中采用一个二维表形式的数据库，以当然该数据库也可以采用其他数据结构代替，例如一个线性队列。典型的数据库结构可以参考图 3。其中，字段 "编号，，、 "特征字符串" 和 "长度，，是最基本的，其它的字段提高实施例所述装置性能的辅助数据。通常，特征字符串的确定通常应当满足下述条件：唯一性，长度尽可能短，等等。 The set of feature strings includes a feature string for determining the target file, which is the basis for finding the same or similar files. The set uses a database in the form of a two-dimensional table in this embodiment to However, the database can also be replaced by other data structures, such as a linear queue. A typical database structure can be referred to Figure 3. Among them, the fields "number,", "character string" and "length" are the most basic, and other fields improve the auxiliary data of the device performance described in the embodiment. In general, the determination of a feature string should normally satisfy the following conditions: uniqueness, length as short as possible, and so on.

特征字符串选择单元 42基于存储单元 41 , 在确定待检测的目标文件种类后，在存储单元 41中选择用于确定目标文件的特征字符串，并构成特征字符串集合。 The feature character string selecting unit 42 selects a feature character string for determining the target file in the storage unit 41 based on the storage unit 41, and determines the feature file type to be detected, and constitutes a feature character string set.

文件扫描单元 43，直接操作被扫描文件，按照规则在待检测文件中查找所述集合中的字符串，获得查找结果。这里所述的规则对于不同目标文件的查找是不同的，主要考虑到查找速度和查找准确度，以及考虑所查找的目标文件的性质而制定的查找策略。本实施例中，按照规则采用下述步骤在待检测的文件中查找所述集合中的特征字符串：每次选择所述集合中的一个特征字符串，直到所述集合中每一个特征字符串被选择一次，对于选择出的每一个特征字符串，扫描所述待检测文件，获得特征字符串在待检测文件中的有效位置，将所有特征字符串在待检测文件中的位置作为查找结果。 The file scanning unit 43, directly operates the scanned file, and searches for the string in the set in the file to be detected according to the rule to obtain the search result. The rules described here are different for different target files, mainly considering the speed of search and the accuracy of the search, and the search strategy based on the nature of the target file being looked up. In this embodiment, the following steps are used to search for the feature string in the set in the file to be detected according to the rule: each time a feature string in the set is selected, until each feature string in the set After being selected once, for each feature string selected, the file to be detected is scanned, the effective position of the feature string in the file to be detected is obtained, and the position of all the feature strings in the file to be detected is used as the search result.

判断单元 44 , 用于判断所述查找结果是否满足预先设定的条件，如果满足，目标文件确定单元 45则确定该待检测文件为目标文件。 The determining unit 44 is configured to determine whether the search result satisfies a preset condition, and if yes, the target file determining unit 45 determines that the to-be-detected file is a target file.

文件扫描单元 43扫描待检测文件的典型流程为：从所述特征字符串的集合中读取一个未被选择过的特征字符串 ,然后利用该特征字符串扫描待检测文件，即在待检测文件中查找该特征字符串是否存在；然后判断是否扫描到或者找到该特征字符串，在不能确定已经扫描的字符部分包含该特征字符串时，继续后续的扫描；如果经判断已经扫描的字符部分包含该特征字符串时，则记录扫描的结果，例如本例中记录该特征字符串在待检测文件中的有效位置，所述有效位置包括该特征字符串在待检测文件中的偏移量，长度等。 A typical process for the file scanning unit 43 to scan a file to be detected is: reading an unselected feature string from the set of feature strings, and then scanning the file to be detected by using the feature string, that is, the file to be detected Finding whether the feature string exists; then determining whether the feature string is scanned or found, and continuing to perform subsequent scanning when it is determined that the scanned character portion includes the feature string; if it is determined that the scanned character portion is included In the case of the feature string, the result of the scan is recorded. For example, in this example, the effective position of the feature string in the file to be detected is recorded, and the effective position includes the offset of the feature string in the file to be detected, and the length. Wait.

本例中，还包含另一个判断内容，即判断扫描是否结束，如果结束则终止扫描操作，然后确定并获得最终的查找结果（这种做法是显而易见的）。 In this example, another judgment content is included, that is, whether the scan is finished, if it is finished, the scan operation is terminated, and then the final search result is determined and obtained (this is obvious).

还要判断特征字符串集合中的每一个特征字符串是否全部扫描完毕，如果没有全部扫描完 , 选择下一个未被选择的特征字符串继续扫描操作。 It is also determined whether each feature string in the feature string set is completely scanned. If not all of the scans are completed, the next unselected feature string is selected to continue the scan operation.

在图 4所示实施例中，文件扫描单元 43采用查找每一个特征字符串在待检测文件中的有效位置的方法可能不是十分必要，在某种情况下会导致查找的效率的降低。本发明所述装置的第二实施例给出了另外一种方法。该方法在所述集合中选择出一个未被选择过的特征字符串，对于选择出的每一个特征字符串，扫描所述待检测文件，获得特征字符串在待检测文件中的有效位置，累积所述有效位置，获得累积结果，如果所述累积结果满足一个预先确定的规则，说明基于目前的累积结果就能够对待检测文件进行一个明确的判断，这样就可以结束对待检测文件查找，将累积结果作为查找结果，以进行进一步的处理。 In the embodiment shown in FIG. 4, the file scanning unit 43 uses the lookup for each feature string to be treated. The method of detecting the valid position in a file may not be necessary, and in some cases, the efficiency of the search is lowered. A second embodiment of the apparatus of the present invention provides another method. The method selects an unselected feature string in the set, scans the to-be-detected file for each feature string selected, and obtains a valid position of the feature string in the file to be detected, and accumulates The effective position obtains a cumulative result. If the accumulated result satisfies a predetermined rule, it indicates that the detected file can be determined based on the current accumulated result, so that the detection of the detected file can be ended, and the cumulative result is obtained. As a result of the search, further processing is performed.

在所述装置的第二实施例中，对于一个特定的特征字符串在待检测文件中的有效位置，可以记录第一次出现的有效位置，也可以记录全部的有效位置，这取决于实际的需要，在一个集合中特征字符串具有明确的位置特征或关系特征时就可能这样。 In a second embodiment of the apparatus, for a valid position of a particular feature string in the file to be detected, the first occurrence of the valid position may be recorded, or all of the valid positions may be recorded, depending on the actual It is desirable that the feature string in a collection has explicit positional or relational features.

采用精确匹配的方式在待检测的文件中查找所述集合中的特征字符串，这在某种情况下会导致查找准确性的降低。 Finding the feature string in the set in the file to be detected by means of exact matching, which in some cases leads to a reduction in the accuracy of the search.

在所述装置的第二实施例中，支持所述模糊查找的预先确定的规则为：在待检测文件中被有效找到的特征字符串的位置关系构成的累积结果满足设定的顺序特征、间隔特征，或者同时满足顺序特征和间隔特征。所述顺序特征包括特征字符串的组合和 /或排列规律。对于不同目的、不同性质的目标文件的确定，所述顺序特征和间隔特征是不同的。 In a second embodiment of the apparatus, the predetermined rule supporting the fuzzy lookup is: the cumulative result of the positional relationship of the feature string that is effectively found in the file to be detected satisfies the set sequence feature, interval Features, or both sequential and interval features. The sequence features include combinations and/or permutations of feature strings. The sequence characteristics and the interval characteristics are different for the determination of object files of different purposes and different natures.

在其它实施例中，支持所述模糊查找的预先确定的规则也可以为：累积结果中，找到的特征字符串在待检测文件中的有效位置的和达到设定的值，即找到的特征字符串的有效位置的数量和达到设定的值；或者，累积结果中，在待检测文件中被有效找到的特征字符串的个数达到设定的值。究竟采用那种方式作为确定的规则同样根据实际的需要决定，这样的规则对于不同目的、不同性质的目标文件的确定需要来说可能不同，而且也很多种这样的规则需要预先确定好。在上面提到的两个实施例中，所述查找结果只有经过判断单元 44判断是否满足预先设定的条件，才能确定待检测文件是否为目标文件。这里所述的条件可以为：所述查找结果中，在待检测文件中被有效找到的特征字符串的个数达到设定的值；或者，所述查找结果中，在待检测文件中被有效找到的特征字符串的位置关系满足设定的顺序特征和 /或间隔特征。等等。 In other embodiments, the predetermined rule supporting the fuzzy search may also be: in the cumulative result, the sum of the found feature strings in the valid position in the file to be detected reaches a set value, that is, the found feature character. The number of valid positions of the string reaches the set value; or, in the cumulative result, the number of feature strings that are effectively found in the file to be detected reaches the set value. The way in which that method is used as the determination rule is also determined according to actual needs. Such rules may be different for the determination of object files of different purposes and different natures, and many such rules need to be predetermined. In the two embodiments mentioned above, the search result can determine whether the file to be detected is the target file only after the determining unit 44 determines whether the predetermined condition is met. The condition described herein may be: in the search result, the number of feature strings that are effectively found in the file to be detected reaches a set value; or, the search result is valid in the file to be detected. The positional relationship of the found feature string satisfies the set sequence feature and/or interval feature. and many more.

本发明所述装置的第三个实施例就提供了提高特征字符串的查找效率的改进。与所述装置的第一、第二实施例不同的是，在第三实施例中，特征字符串选择单元 42，确定每一个特征字符串的特征字符和按照所述特征字符构建对应特征字符串的第二规则。基于上述特征字符和第二规则，就可以按照下述步骤扫描所述待检测文件： A third embodiment of the apparatus of the present invention provides an improvement in the efficiency of finding feature strings. Different from the first and second embodiments of the apparatus, in the third embodiment, the feature character string selecting unit 42 determines the feature characters of each feature string and constructs the corresponding feature string according to the feature characters. The second rule. Based on the above characteristic characters and the second rule, the file to be detected can be scanned according to the following steps:

对于上述提出的精确匹配，可以在构建特征字符串成功后，计算依据特征字符新构建的特征字符窜的检验和，用该校验和与作为查找基础的特征字符串预先计算出的校验和进行比较，就能够更快地完成新构建的特征字符串与作为查找基础的特征字符串的比较。这是极其容易做到的一种，而且实际中不限于校-险和的方式。 For the exact matching proposed above, after the feature string is successfully constructed, the checksum of the feature character 新 newly constructed according to the feature character can be calculated, and the checksum calculated in advance with the feature string as the basis of the search can be calculated. By comparing, it is faster to compare the newly constructed feature string with the feature string that is the basis of the lookup. This is an extremely easy way to do it, and in practice it is not limited to the school-risk approach.

通过上述的说明内容，本发明领域相关工作人员完全可以在不偏离本项发明技术思想的范围内，进行多样的变更以及修改。例如，一种区分有害程序的方法，可以采用下述特征：将破坏系统的指令和 /或参数转化为特征字符串，以及将所有的特征字符串作为特征字符串集合，所述集合中包含至少一个用于确定有害程序的特征字符串； From the above description, various changes and modifications can be made by those skilled in the art without departing from the scope of the invention. For example, a method of distinguishing between unwanted programs may employ the following features: converting instructions and/or parameters that destroy the system into feature strings, and using all of the feature strings as a set of feature strings, the set including at least a feature string used to determine an unwanted program;

按照规则在可疑程序中查找所述集合中的特征字符串，获得查找结果，以及， Find the feature string in the collection in the suspicious program according to the rule, and obtain the search result to And,

判断所述查找结果是否满足第一条件，如果满足，确定所述可疑程序为有害程序。 Determining whether the search result satisfies the first condition, and if so, determining that the suspicious program is a harmful program.

也就是说，区分有害程序，实际上是对一个不知是否有害的程序作为一个已经获得具体内容的文件进行判断，看其中是否存在一些起某种破坏作用的特征字符串，例如破坏性指令和 /或参数。采用本发明的方案，可以使这种查找具有一定程度的模糊性，从而增加程序判断的准确度，减少误判和漏判。例如可以实现对一个通过网络进入系统的可疑程序扫描，如果在其中找到预先确定的破坏系统的指令和 /或参数，或者说在可疑程序中找到对系统有害的行为 , 就可以确定其为有害程序，进而进一步采取对其的措施。 That is to say, distinguishing between harmful programs is actually judging whether a program that is not known or harmful is a document that has obtained specific content, and whether there are some characteristic strings that have some destructive effect, such as destructive instructions and / Or parameters. With the solution of the present invention, such a search can be made to have a certain degree of ambiguity, thereby increasing the accuracy of the program judgment and reducing false positives and missed judgments. For example, it is possible to scan a suspicious program entering the system through the network, and if it finds a predetermined command and/or parameter that destroys the system, or finds a behavior harmful to the system in the suspicious program, it can be determined as an unwanted program. And further take measures against it.

还有，例如判断一个正常程序是否受到病毒程序的攻击，如果受到攻击，其中一种情况就是其中具有某些破坏性指令和 /或参数，显然利用本发明的方法可以完成这种判断（如果该程序仅仅是数据被破坏，采用校验和等常规手段就能轻易检测出来该程序是否被破坏）。再例如识别一个程序 , 只要将识别的特征组成特征字符串，形成特征字符串集合后即可利用本发明的方案轻易地对其进行识别。等等。 Also, for example, to determine whether a normal program is attacked by a virus program, and if it is attacked, one of which is to have some destructive instructions and/or parameters therein, it is obvious that the method can be completed using the method of the present invention (if The program is only the data is destroyed, and it can be easily detected whether the program is destroyed by conventional means such as checksum. For example, a program can be identified, and the identified features can be easily identified by forming the feature string by forming the feature string. and many more.

因此，本项发明的技术性范围并不局限于说明书上的内容，还有很多根据其权利要求确定的具体的技术性应用方案。以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。 Therefore, the technical scope of the present invention is not limited to the contents of the specification, and there are many specific technical applications determined according to the claims. The above description is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should be considered as the scope of protection of the present invention.

Claims

Rights request

A method for determining an object file, based on a storage unit storing a complete set of feature strings, wherein:

Selecting, in the storage unit, a feature string set for determining an object file, where the set includes at least one feature string for determining the target file;

Finding the feature string in the set in the file to be detected according to the rule, obtaining the search result, and,

Determining whether the search result satisfies the first condition, and if so, determining that the to-be-detected file is a target file.

2. The method according to claim 1, wherein the following steps are used to search for the feature string in the set in the file to be detected according to the rule:

Each time a feature string in the set is selected until each feature string in the set is selected once, for each feature string selected, the file to be detected is scanned, and the feature string is obtained. The valid position in the file is detected, and the position of all the feature strings in the file to be detected is used as the search result.

3. The method according to claim 1, wherein the following steps are used to search for a feature string in the set in the file to be detected according to the rule:

Selecting an unselected feature string in the set until the accumulated result satisfies the first rule, scanning the to-be-detected file for each feature string selected, and obtaining the feature string in the file to be detected The valid position in the middle, accumulate the valid position, and obtain the cumulative result. After the search for the detected file ends, the accumulated result is used as the search result.

The method according to claim 3, wherein the first rule is: in the cumulative result, the sum of the found feature strings in the valid position in the file to be detected reaches a set value; or, cumulative In the result, the number of feature strings that are effectively found in the file to be detected reaches a set value; or, in the cumulative result, the positional relationship of the feature strings that are effectively found in the file to be detected satisfies the set order Features and/or spacing characteristics.

The method according to claim 1, 2, 3 or 4, wherein the first condition is: in the search result, the number of feature strings that are effectively found in the file to be detected reaches a set value; or, in the search result, a position of a feature string that is effectively found in the file to be detected The relationship satisfies the set sequence characteristics and/or spacing characteristics.

6. The method of claim 5, further comprising: determining a feature character of each feature string and a second rule for constructing a corresponding feature string according to the feature character; and scanning the site according to the following steps Describe the test file:

In the to-be-detected file, the feature character is searched until the file to be detected is searched, and for each of the found feature characters, a corresponding feature string is constructed according to the second rule, and if the feature string is successfully constructed, The feature string position that was successfully built is taken as a valid location.

7. The method according to claim 5, further comprising: determining a feature character of each feature string and a second rule for constructing a corresponding feature string according to the feature character; and scanning the site according to the following steps Describe the test file:

In the file to be detected, the feature character is searched, and if the feature character is found, a corresponding feature string is constructed according to the second rule, and if the feature string is successfully constructed, a successful feature string position is constructed. As a valid location, end the search.

8. The method according to claim 7, wherein if the character of the feature string constructed according to the second rule is identical to or the same as the character of the feature string based on the search, the set value is reached. , determining that the feature string is successfully constructed.

9. The method according to claim 6, wherein if the character of the feature string constructed according to the second rule is identical to or the same as the character of the feature string based on the search, the set value is reached. , determining that the feature string is successfully constructed.

10. An apparatus for determining an object file, comprising: a storage unit that stores a complete set of feature strings, further characterized by:

a feature string selection unit that selects, in the storage unit, a set of feature strings for determining the target file, the set including at least one feature string for determining the target file;

a file scanning unit, configured to search for a string in the set in the file to be detected according to a rule, and obtain a search result;

a determining unit, configured to determine whether the result satisfies the first condition, and,

The target file determining unit determines that the file to be detected is a target file when the result satisfies the first condition.