HK1197940A

HK1197940A - Fuzzy whitelisting anti-malware systems and methods

Info

Publication number: HK1197940A
Application number: HK14111446.1A
Authority: HK
Inventors: I．弗拉德．托凡; V．索林．杜代亚; D．维罗埃尔．卡尼亚
Original assignee: 比特梵德知识产权管理有限公司
Priority date: 2011-11-02
Filing date: 2012-09-05
Publication date: 2015-02-27

Description

Fuzzy whitelisted anti-malware system and method

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims benefit of the filing date of united states provisional patent application No. 61/554,859 filed on 2/11/2011, which is hereby incorporated by reference in its entirety.

Background

The present invention relates to systems and methods for protecting users from malware, and in particular to software whitelisting.

Malware (also known as malware) affects a large number of computer systems around the world. In many of its forms (e.g., computer viruses, worms, trojan horses, and rootkits), malware presents a serious risk to millions of computer users, making them susceptible to data loss, identity theft, and productivity loss, among others.

Computer programs dedicated to malware scanning employ various methods of detecting and eliminating malware from a user computer system. Such methods include behavior-based techniques and content-based techniques. Behavior-based approaches may involve allowing a suspected program to execute in an isolated virtual environment, identifying malicious behavior and blocking execution of an offending program. In content-based approaches, the content of a suspect file is typically compared to a database of known malware-identifying signatures. If a known malware signature is found in a suspect file, the file is marked as malicious.

Other methods of combating malware employ application whitelists, which include maintaining a list of allowed software and behaviors on the user's computer system, and blocking execution of all other applications. Such methods are particularly effective for polymorphic malware, which can randomly modify its malware identification signatures, thereby rendering conventional content-based methods ineffective.

Some whitelisting applications employ hash values to identify and ensure the integrity of whitelisted software. Cryptographic hashes may be created for files or groups of files that are affiliated with whitelisted applications and stored for reference. The respective application is then authenticated by comparing the stored hash with a new hash generated at runtime.

The performance of the anti-malware whitelisting method may depend on the ability to maintain and update a whitelist database in an efficient and flexible manner.

Disclosure of Invention

According to one aspect, a method comprises: performing, at a client computer system, an initial malware scan of a plurality of target objects of the client computer system; and in response to a preliminary determination by the initial malware scan that the target object is suspected to be malicious: generating, at the client computer system, a plurality of target hashes of the target object, each target hash representing a distinct code block of the target object, each distinct code block consisting of a sequence of processor instructions of the target object; sending the plurality of target hashes from the client computer system to a server computer system connected to the client computer system via a wide area network; and receiving, at the client computer system, a server-side indicator from the server computer system of whether the target object is malicious. The server-side indicator is generated by the server computer system by: retrieving, for at least one target hash of the plurality of target hashes, a plurality of reference hashes of a reference object, the reference object being selected from a group of whitelisted objects according to the target hash, and determining a similarity score according to a hash count common to both the plurality of target hashes and the plurality of reference hashes when the plurality of target hashes are not identical to the plurality of reference hashes; and designating the target object as non-malicious when the similarity score exceeds a predetermined threshold.

According to another aspect, a method comprises: receiving, at a server computer system via a wide area network, a plurality of target hashes of a target object of a client computer system connected to the server computer system; generating, at the server computer system, a server-side indicator of whether the target object is malicious; and sending the server-side indicator of whether the target object is malicious to the client computer system. The plurality of target hashes are generated at the client computer system in response to a preliminary determination by the client computer system that the target object is suspected of being malicious, the preliminary determination resulting from an initial malware scan of a plurality of target objects of the client computer system. Generating, at the server computer system, a server-side indicator of whether the target object is malicious comprises: retrieving, for at least one target hash of the plurality of target hashes, a plurality of reference hashes of a reference object, the reference object being selected from a group of whitelisted objects according to the target hash, and determining a similarity score according to a hash count common to both the plurality of target hashes and the plurality of reference hashes when the plurality of target hashes are not identical to the plurality of reference hashes; and designating the target object as non-malicious when the similarity score exceeds a predetermined threshold.

According to another aspect, a method comprises: receiving, at a server computer system, a plurality of target hashes of a target object, each target hash representing a distinct code block of the target object, each distinct code block consisting of a sequence of processor instructions of the target object; for at least one target hash of the plurality of target hashes, employing the server computer system to: retrieving a plurality of reference hashes of a reference object, the reference object being selected from a set of whitelisted objects according to the target hash, and determining a similarity score according to a hash count common to both the plurality of target hashes and the plurality of reference hashes when the plurality of target hashes are not identical to the plurality of reference hashes; and employing the server computer system to mark the target object as non-malicious when the similarity score exceeds a predetermined threshold.

According to another aspect, a computer system includes at least one processor programmed to: receiving a plurality of target hashes, each target hash representing a distinct code block of a target object, each distinct code block consisting of a sequence of processor instructions of the target object; for at least one target hash of the plurality of target hashes: retrieving a plurality of reference hashes of a reference object, the reference object being selected from a set of whitelisted objects according to the target hash, and determining a similarity score according to a hash count common to both the plurality of target hashes and the plurality of reference hashes when the plurality of target hashes are not identical to the plurality of reference hashes; and when the similarity score exceeds a predetermined threshold, marking the target object as non-malicious.

According to another aspect, a non-transitory computer-readable storage medium encodes instructions that, when executed on a processor, cause the processor to perform the steps of: receiving a plurality of target hashes, each target hash representing a distinct code block of a target object, each distinct code block consisting of a sequence of processor instructions of the target object; retrieving, for at least one target hash of the plurality of target hashes, a plurality of reference hashes of a reference object selected from a group of whitelisted objects according to the target hash; when the plurality of target hashes is not identical to the plurality of reference hashes, determining a similarity score according to a hash count common to both the plurality of target hashes and the plurality of reference hashes. When the similarity score exceeds a predetermined threshold, the target object is non-malicious.

According to another aspect, a computer system comprises: means for receiving a plurality of target hashes, each target hash representing a distinct code block of a target object, each distinct code block consisting of a sequence of processor instructions of the target object; means for retrieving a plurality of reference hashes of a reference object selected from a set of whitelisted objects according to a selected target hash of the plurality of target hashes; means for determining a similarity score from a hash count common to both the plurality of target hashes and the plurality of reference hashes; and means for marking the target object as non-malicious according to the similarity score.

According to another aspect, a method comprises: receiving, at a server computer system, a plurality of target hashes, each target hash representing a distinct data block of a target object, each distinct code block consisting of a sequence of processor instructions of the target object; in response to receiving the plurality of target hashes, retrieving a plurality of reference hashes representing whitelisted data objects, and when the plurality of target hashes are not identical to the plurality of reference hashes, marking the target object as non-malicious when the plurality of target hashes share a majority of items with the plurality of reference hashes.

Drawings

The foregoing aspects and advantages of the invention will become better understood upon reading the following detailed description and upon reference to the drawings in which:

FIG. 1 shows an exemplary anti-malware system according to some embodiments of the present invention.

FIG. 2 illustrates an exemplary hardware configuration of a client computer system according to some embodiments of the invention.

FIG. 3 shows an exemplary hardware configuration of an anti-malware server system according to some embodiments of the present invention.

FIG. 4 shows a diagram of an exemplary anti-malware application executing on a client computer system in accordance with some embodiments of the present invention.

FIG. 5 shows an exemplary application executing on an anti-malware server system, according to some embodiments of the invention.

FIG. 6 illustrates an exemplary sequence of steps performed by the client anti-malware application of FIG. 4 in accordance with some embodiments of the present invention.

FIG. 7 shows an example of code normalization according to some embodiments of the invention.

FIG. 8 shows an exemplary memory representation of processor instructions, according to some embodiments of the invention.

Fig. 9 shows an exemplary code block and an exemplary opcode pattern corresponding to the code block, according to some embodiments of the invention.

Fig. 10 illustrates an exemplary code snippet including a plurality of code blocks and an exemplary Object Data Indicator (ODI) corresponding to the code snippet, according to some embodiments of the invention.

FIG. 11 shows an exemplary sequence of steps performed by the server anti-malware application of FIG. 5, according to some embodiments of the invention.

Detailed Description

In the following description, it is understood that all recited connections between structures may be direct operative connections or indirect operative connections through intervening structures. A set of elements includes one or more elements. Any recitation of an element is understood to mean at least one of the element. The plurality of elements includes at least two elements. It is not necessary that any of the described method steps be performed in the order specifically illustrated, unless otherwise required. A first element (e.g., data) derived from a second element encompasses a first element equal to the second element and a first element and optionally other data generated by processing the second element. Making a determination or decision as a function of a parameter encompasses making the determination or decision as a function of the parameter and optionally as a function of other data. Unless otherwise specified, an indicator of a certain quantity/data may be the quantity/data itself or an indicator different from the quantity/data itself. The computer programs described in some embodiments of the invention may be stand-alone software entities or sub-entities (e.g., subroutines, code objects) of other computer programs. Unless otherwise specified, a target object is a file or process residing on a client computer system. The identifier of the target object comprises data that allows selective identification and retrieval of the target object itself, not just as part of a larger data structure, such as the complete memory of a client computer system. Unless otherwise specified, an Object Data Indicator (ODI) of a target object includes characteristics of the target object data (e.g., code blocks, opcode patterns, hashes) that facilitate determining whether the target object is malicious (e.g., infected with malware). Unless otherwise specified, a hash is the output of a hash function. A hash function is a mathematical transformation that maps a sequence of symbols (e.g., characters, bits) into a sequence of shorter numerical values or bit strings. The target hash is a hash calculated on the data of the target object. Unless otherwise specified, the term whitelisted should be understood to mean trusted as clean (i.e., free of malware). When all elements of the first group are contained in the second group and all elements of the second group are contained in the first group, the first group is identical to the second group. Computer-readable media encompass non-transitory media such as magnetic, optical, and semiconductor storage media (e.g., hard drives, optical disks, flash memory, DRAM), as well as communication links such as conductive cables and fiber optic links. According to some embodiments, the disclosure provides, among other things, a computer system comprising hardware (e.g., one or more processors) programmed to perform the methods described herein, and a computer-readable medium encoding instructions to perform the methods described herein.

The following description illustrates embodiments of the invention by way of example and not necessarily by way of limitation.

FIG. 1 shows an exemplary malware detection system 10 according to some embodiments of the present invention. The system 10 includes a set of anti-malware (AM) server systems 20 a-20 c and a set of client computer systems 30 a-30 b. Client computer systems 30 a-30 b may represent devices that each have a processor, memory, and storage, and operate, for exampleOr an end user computer of the operating system of Linux. Some client computer systems 30 a-30 b may represent mobile computing and/or telecommunications devices, such as tablet PCs and mobile phones. In some embodiments, the client computer systems 30 a-30 b may represent individual customers, or several client computer systems may belong to the same customer. In some embodiments, one of the systems 30 a-30 b may be a server computer (e.g., a mail server), in which case a malware detection service may be used to identify malware present in emails or other messages sent to multiple clients and take appropriate action (e.g., remove or quarantine items infected with malware) before delivering the messages to the clients. Network 12 connects client computer systems 30 a-30 c with anti-malware server systems 20 a-20 c. The network 12 may be a wide area network, such as the Internet. Portions of the network 12, for example, portions of the network 12 that interconnect the client computer systems 30 a-30 b, may also include a Local Area Network (LAN).

Fig. 2 shows an exemplary hardware configuration of the client computer system 30. In some embodiments, system 30 includes processor 24, memory unit 26, a set of input devices 28, a set of output devices 32, a set of storage devices 34, and a communication interface controller 36, all connected by a set of buses 38.

In some embodiments, processor 24 comprises a physical device (e.g., a multi-core integrated circuit) configured to perform computations and/or logical operations on a set of signals and/or data. In some embodiments, such logical operations are delivered to processor 24 in the form of a sequence of processor instructions (e.g., machine code or other type of software). Memory unit 26 may include a volatile computer-readable medium (e.g., RAM) that stores data/signals accessed or generated by processor 24 during execution of instructions. Input device 28 may include a computer keyboard and mouse, among other things, that allows a user to introduce data and/or instructions into system 30. The output device 32 may include a display device, such as a monitor. In some embodiments, the input device 28 and the output device 32 may share a common piece of hardware, as in the case of a touch screen device. Storage 34 includes a computer-readable medium that enables the non-volatile storage, reading, and writing of software instructions and/or data. Exemplary storage devices 34 include magnetic and optical disks and flash memory devices, as well as removable media, such as CD and/or DVD disks and drives. Communication interface controller 36 enables system 30 to connect to a computer network and/or other machine/computer systems. The exemplary communication interface controller 36 comprises a network adapter. Bus 38 collectively represents a plurality of system, peripheral and chipset buses and/or all other circuitry that enable intercommunication of devices 24-36 of computer system 30. For example, bus 38 may include a north bridge bus connecting processor 24 to memory 26 and/or a south bridge bus connecting processor 24 to devices 28-36, among others.

Fig. 3 shows a hardware configuration of an exemplary AM server system 20 of systems 20 a-20 c according to some embodiments of the invention. The AM server system 20 may be a computer system that includes a server processor 124, a server memory 126, a set of server storage devices 134, and a server communication interface controller 136, all connected to each other via a set of server buses 138. Although some details of the hardware configuration may differ between server system 20 and client computer system 30, the range of devices 124, 126, 134, 136, and 138 may be similar to the range of devices 24, 26, 34, 36, and 38, respectively, described above.

The client computer system 30 may include a client Antimalware (AM) application 40 and a client-side cache 56, as shown in fig. 4. In some embodiments, the client AM application 40 may be a standalone application, or may be an anti-malware module with a security suite of anti-virus, firewall, anti-spam, and other modules. The client AM applications may include an active AM scanner 42, a static AM scanner 44, an emulator 46 connected to the static AM scanner 44, a code normalization engine 48 connected to the scanners 42 and 44, a client AM communication manager 52, and a hash engine 54 connected to the communication manager 52 and the code normalization engine 48.

In some embodiments, the client AM application 40 is configured to conduct the client-side portion of a client-server collaborative scan to detect malware stored on a computer readable medium (e.g., memory, hard drive) forming part of the client computer system 30 or a computer readable medium connected to the system 30 (e.g., memory card, external hard drive, network device, etc.). As part of client-server collaborative scanning, the client AM application 40 is configured to send target Object Data Indicators (ODI)100 to AM server systems 20 a-20 c and receive scan reports 50 from the systems 20 a-20 c.

The target objects scanned by the AM application 40 include computer files and processes. Each process may include a set of loaded memory modules (i.e., the target executable file and the loaded image of the dynamically linked library it references) and any additional files corresponding to the loaded memory modules. A target object may be considered malware if it contains at least a portion of a malware entity (e.g., a virus, worm, trojan).

In some embodiments, the ODI100 includes a plurality of code block indicators, each indicating a distinct code block of the target object. Exemplary contents and formats of the ODI100 will be discussed in detail with respect to FIGS. 7-9.

In some embodiments, the scan report 50 includes an identifier (e.g., tag, file ID) of the target object, a malware status indicator (e.g., infected, clean, unknown) of the target object, and/or a set of identifiers of malware agents infecting the target object (e.g., names of individual malware agents (e.g., win32. word. download. gen), malware class indicators (viruses, rootkits, etc.), or pointers to respective agents in a malware knowledge base). In some embodiments, a single scan report may be compiled for a batch of target objects.

In some embodiments, the server communication manager 52 is configured to manage communications with the server AM systems 20 a-20 c. For example, manager 52 may establish a connection over network 12, send/receive data to/from AM servers 20 a-20 c, maintain a list of scanning assets in progress, and associate target ODI100 with an AM server performing server-side scanning.

The active AM scanner 42 and the static AM scanner 44 enable the client AM application 40 to run a preliminary anti-malware scan of the target object, as shown in more detail below. If the preliminary scan detects malicious content, the offending target object is reported directly to the user without having to undergo a client-server scan, thus saving time and computer resources. In some embodiments, file target objects are handled by the static AM scanner 44, while process target objects are handled by the active AM scanner 42. In some embodiments, the static AM scanner 44 may use the emulator 46 to unpack files and execute the files in a protected environment in addition to the primary memory. The scanners 42, 44 may use behavior-based methods, various heuristics, content-based methods (e.g., signature matching), or a combination thereof to determine whether the target object is malware. Examples of heuristic criteria for determining whether a target object is malicious include, among other things, the relative sizes of various sections in the Portable Executable (PE) file of the target object, the information density in each section, the presence of particular flags and flag groups in the PE header, information about the packager/protector (if any), and the presence of particular text patterns within the executable.

The client AM application 40 may employ the code normalization engine 48 and the hash engine 54 to generate the target ODI 100. The operation of the code normalization engine 48 will be discussed below with respect to FIG. 7. The hash engine 54 is configured to receive the opcode patterns and generate hashes of the respective opcode patterns, as shown with respect to fig. 8-9. In some embodiments, a hash is the output of a hash function (a mathematical transformation that maps a sequence of symbols (e.g., characters, bits) into a sequence of numeric or bit strings). Exemplary hash functions employed by hash engine 54 include Cyclic Redundancy Check (CRC), Message Digest (MD), or Secure Hash (SHA), among others. An exemplary hash is a 4 byte CRC 32.

Some embodiments of the client-side cache 56 include a repository of ODIs corresponding to target objects residing on respective client systems 30 that have been scanned for malware at any given time. In some embodiments, the cache 56 may include a hash set of the target object ODI; each ODI received from client system 30 may be hashed, with duplicate hashes removed, and the resulting hash stored as a unique indicator of the respective ODI. The cache 56 allows for acceleration of malware scanning. If the ODI or hash thereof for a target object is found in the client cache 56, indicating that the respective target object has been scanned at least once, the malware status of the target object may be retrieved directly from the cache 56 and reported to the user (which is a much faster process than performing a new scan of the target object). For each ODI, some embodiments of the cache 56 may include an object identifier (e.g., tag, file ID) and an indicator of the malware status of the respective target object.

Fig. 5 shows an exemplary application executing on AM server system 20, according to some embodiments of the invention. In some embodiments, the system 20 includes a server AM application 60, a server-side cache 68, a whitelist database 65, a malware database 66, and a outbreak database 67b, all of which are connected to the AM server application 60.

In some embodiments, the AM server application 60 is configured to perform a plurality of malware detection transactions with the client computer systems 30 a-30 b. For each such thing, the server AM application 60 is configured to cooperatively scan the server-side portion to detect malware residing on the respective client computer system, as described in detail below. As part of a client-server transaction, the application 60 receives the target ODI100 from a client computer system and transmits the scan report 50 to the respective client computer system. The server AM application 60 may include a server AM communication manager 62 and a code comparator 64 connected to the communication manager 62.

In some embodiments, server communication manager 62 is configured to manage communications with client computer systems 30 a-30 b. For example, manager 62 may establish a connection over network 12, send/receive data to/from clients, maintain a list of scanning transactions in progress, and associate target ODI100 with originating client computer systems 30 a-30 b. The code comparator 64 is configured to calculate a similarity score indicative of a degree of similarity between the target object and a set of reference objects stored in the databases 65-67, as described in detail below.

In some embodiments, the server-side cache 68 includes a repository of ODIs for target objects that have been scanned for malware, which are received from the various client computer systems 30 a-30 b during a previous client-server collaborative scan. As discussed further below, if the ODI of a target object is found in the server cache 68 (which indicates that the respective target object has been scanned at least once), the malware status of the target object (e.g., clean, infected, etc.) may be retrieved from the cache 68 without performing a new scan of the target object. Some embodiments of the server cache 68 may store, along with the target ODI, the malware status (e.g., clean, infected) of the respective target object.

Databases 65 through 67 are maintained as repositories of knowledge related to current malware. In some embodiments, each database 65-67 includes a set of data indicators corresponding to a set of reference objects (files and processes) of known malware status. In some embodiments, databases 65-67 store data in the form of opcode pattern hashes (described further below with respect to fig. 7-10). The whitelist database 65 contains a hash set retrieved from objects that are trusted to be clean (i.e., whitelisted items). The malware database 66 includes malware identification hashes retrieved from objects known to be malware. In some embodiments, the outbreak database 67 includes hashes computed for objects with unknown malware status (that have not been identified as malware or clean).

In some embodiments, all opcode pattern hashes stored in databases 65-67 are of the same size (e.g., 4 bytes). Which may be stored sequentially in memory and/or computer readable media of server systems 20 a-20 c. In some embodiments, a second data structure including an object identifier (e.g., a file ID also represented as a 4-byte numerical value) is stored with the reference hash set. Each hash is related to the file ID of the object from which it was retrieved using a bidirectional mapping stored in the memory of the respective AM server. This allows the server AM application to selectively retrieve the reference hash to determine whether the target object received from the client computer system is similar to any of the reference objects stored in the databases 65-67. The databases 65-67 are kept up-to-date by adding target object data received from the client computer systems 30 a-30 b, as described further below.

Fig. 6 shows an exemplary sequence of steps performed by the client AM application 40, according to some embodiments of the present invention. In step 202, application 40 selects a target object to target for malicious intentThe software performs the scan. In some embodiments, the target object may be specified directly or indirectly by the user (scan on demand). For example, a user may instruct the AM application 40 to scan the contents of a particular file or a particular folder or contents stored on a particular computer readable medium (e.g., CDROM, flash memory device). Other exemplary target objects are selected during the access-time scan, where the application 40 is configured to scan a particular type of file or process before reading/loading/transmitting it. In some embodiments, a set of target objects may be compiled for the purpose of scheduled scanning by a client computer system running application 40. Resident on runThe exemplary set of target objects on the client system of (a) may include executables from a WINDIR folder, executables from a WINDIR/system 32 folder, executables of a currently running process, Dynamic Link Libraries (DLLs) imported by a currently running process, and executables of all installed system services, and so on. In some embodiments, the target object may also include files/processes targeted by the malware program of interest (e.g., the malware program deemed most extensive and acting when initiating the respective malware scan).

In some embodiments, an identifier (e.g., a file ID) is used to uniquely identify the corresponding target object. The identifier includes data that allows selective identification of the target object itself (e.g., a file or process) and not as part of a larger structure (e.g., the complete memory of the respective client computer system). Exemplary target object identifiers include file path and memory addresses, and the like. The identifier also allows the client AM application 40 to selectively retrieve target objects in order to compute the target ODI100 and to explicitly perform client-server scanning transactions of multiple target objects.

In step 204 (FIG. 6), the client AM application 40 may run a preliminary anti-malware scan of the target object. In some embodiments, file target objects are handled by the static AM scanner 44, while process target objects are handled by the active AM scanner 42. The scanners 42, 44 may use behavioral methods (e.g., emulation), various heuristics (e.g., geometry of the target object's portable executable title), content-based methods (e.g., signature matching), or a combination thereof to determine whether the target object is malware. In some embodiments, the scanners 42, 44 may generate an indicator of the malware status of the target object. Exemplary status indicators include malicious, suspected malicious and clean, and the like.

In some embodiments, a target object may be suspected as malicious when it has some characteristics in common with known malicious objects, but not enough to be considered malware. Exemplary suspicious features include presence within a PE header of a target object of a particular value/value pair, presence within a target object of a particular sequence of code (e.g., code that checks whether the target object is executing within a virtual environment), and presence of malware-identifying text patterns (signatures), such as a common password and name and/or path indicator for anti-malware software, among others. Other suspicious features may include specific malware-identifying behavioral patterns of the target object.

In some embodiments, the scanners 42, 44 calculate malware scores for respective target objects, where each malware identification feature may be given a particular weight. When the malware score exceeds a first threshold, the respective target object may be suspected to be malicious; when the score exceeds a second, higher threshold, the target object may be marked as malware. Containing the specific strings of IRC protocol, the name of antivirus program, commonAn exemplary target object of a code sequence specific to a password and exploit (exploit) may receive a relatively high malware score, and may therefore be marked as malware, while another exemplary target object containing only the names of some anti-malware applications may receive a relatively low score, but may still be suspected of being malicious.

In step 206, application 40 determines whether the target object is malicious from the preliminary malware scan. If not, operation of application 40 proceeds to step 210, described below. If so, in step 208, the AM application 40 marks the target object as malware and updates the client-side cache 56 in step 230 accordingly. Next, the client AM application 40 outputs the results of the malware scan in step 232.

In some embodiments, step 232 may include issuing an alert (e.g., a pop-up window) to notify the user that the respective client computer system may be infected. Alternatively, application 40 may record the malware scan in a system log. Some embodiments of AM application 40 may display to the user a scan report that includes, among other things, the name of the target object (or object identifier), an indicator of the type of malware detected, and additional information about the corresponding malware (e.g., a possible cleanup method).

In step 210, the client AM application 40 may determine from the results of the preliminary scan (see step 204 above) whether the target object is suspected to be malicious. If so, operation proceeds to step 212, discussed below. If not, then in step 228 application 40 may mark the target object as non-malicious (clean) and proceed to step 230.

In step 212, when the target object is a file, the application 40 may load the target file in the protected environment provided by the emulator 46 to remove any packaging and/or encryption layers of code protecting the target object. When the target object is a process, the operation of application 40 may skip step 212 because the target object has already been loaded into system memory.

In step 214, the code normalization engine 48 performs code normalization of the target object. The compiler may generate machine code that differs from the same source code block depending on the compilation parameters used, in particular due to code optimization. Additional code variants may be introduced by the protector/polymorphic malware. In some embodiments, code normalization includes transforming a processor instruction set forming a target object into a standardized processor instruction set to remove variations of computer code introduced by compilation and/or other polymorphisms. An exemplary code normalization operation may proceed as follows:

1. a compiler for building the target object is detected based on the specific characteristics of the target object. When the compiler is known, the location of the object specific code within the memory image of the target object is determined. When the compiler cannot be determined, the target area for code extraction is selected so as to cover as many potential object-specific code locations as possible (e.g., entry point, start of first section, start of all sections, etc.).

2. The code disassembly begins at the location found in the previous step. In some embodiments, the code disassembly follows a code branch (e.g., JMP/Jxx/CALL in x86 code). The disassembled instructions are processed in sequence. As part of the normalization process, some instructions are left unchanged and others are altered. Exemplary modifications include:

a. the register IDs are replaced based on the order in which they appear within the functional blocks;

b. eliminating constant value and offset;

c. replacing a PUSH followed by a POP sequence with a MOV instruction;

d. replacing the sequence that sets the value of the variable/register/memory address to 0 with MOV < item >, 0 (e.g., XOR < item >, < item >);

e. the addition/subtraction of 1 or 2 is replaced by one or two INC/DEC instructions, respectively.

f. Replacing the JZ/JNZ instruction with the JE/JNE instruction respectively;

g. removing the prologue and the end of the function;

h. remove instruction classes CMP, MOV, and TEST;

i. NOTs (ADD and SUB of 0; NOP, etc.) are removed.

FIG. 7 shows an example of code normalization according to some embodiments of the invention. The disassembled piece of code from the exemplary target object includes function block 70. In some embodiments, the functional blocks are in PUSH EBP; the MOV EBP, ESP instruction sequence begins and it ends with a POP EBP. Each line of code (processor instructions) from the functional block 70 is modified according to the scheme listed to the right to produce a corresponding normalized functional block 72.

In step 216 (FIG. 6), the client AM application 40 computes an Object Data Indicator (ODI) for the target object. In some embodiments, the ODI includes a plurality of code block indicators, each indicating a distinct code block of the target object. An exemplary code block indicator includes an opcode mode for the respective code block.

In some embodiments, the code block comprises a sequence of consecutive processor instructions extracted from the normalized code of the target object. In some embodiments, the code block includes a predetermined number of instructions that are code independent. Alternatively, the count of instructions within a code block varies within a predetermined range. An exemplary code block includes between 5 and 50 consecutive instructions. In some embodiments, the size of the code block (e.g., the number of instructions) is substantially smaller than the size of the functional block, such that the functional block may include more than one code block. In some embodiments, the code block begins at the beginning of a function block or at a CALL instruction. An exemplary code block 74 is shown in fig. 7.

In some embodiments, step 216 includes separating the target object into code blocks and extracting a set of opcode indicators from each such code block. FIG. 8 shows processor instructions 80 (forx86, illustrated for a 32-bit family of processors)The reservoir representation. In some embodiments, each processor instruction is stored in memory as a sequence of bytes, including a set of instruction fields, such as a prefix field 82a, a pair of opcode fields 82 b-82 c, a Mod/Reg/R/M field 82d, and a displacement/data field 82 e. In some embodiments, the opcode fields 82 b-82 c encode an instruction type (e.g., MOV, PUSH, etc.), while the fields 82a, 82 d-82 e encode various instruction parameters (e.g., register name, memory address, etc.). In some embodiments (e.g., x86 format), the byte size and content of the instruction field are instruction dependent, and thus, instructions of the x86 architecture have varying lengths. The instructions (XOR CL, 12H) illustrated in FIG. 8 include only the first opcode byte (10000000 of XOR), the Mod/Reg/R/M byte (11110001 of register CL), and the displacement/data byte (00010010 is binary for 12H), while other instructions may include both opcode fields or prefixes, opcodes, Mod, Reg, and/or other combinations of data fields.

Fig. 9 shows an exemplary opcode pattern 90 corresponding to the code block 74. In some embodiments, the opcode pattern 90 is a data structure (e.g., a sequence of bytes, a list, etc.) that includes a set of opcode indicators 92, each opcode indicator corresponding to a processor instruction of the normalized code block 74. The exemplary opcode indicator 92 includes the contents of an opcode field of the corresponding processor instruction, in which case the opcode pattern 90 includes a sequence of instruction types that make up the corresponding code block. In the embodiment illustrated in FIG. 9, each opcode indicator 92 comprises a combination of an opcode byte and a parameter byte (for example, the opcode indicator for instruction PUSH EDX is 52 in hexadecimal).

FIG. 10 illustrates an exemplary ODI100 of normalized code fragments and fragments, according to some embodiments of the invention. The ODI100 includes a plurality of code block indicators 104 a-104 c, each providing a digest (e.g., fingerprint, signature) of a respective code block 74 a-74 c. Exemplary code block indicators 104 a-104 c include respective opcode patterns 90 a-90 c. In some embodiments, the code block indicators 104 a-104 c include hashes of the opcode patterns 90 a-90 c, respectively, as illustrated in FIG. 10. In addition to the code block indicators 104 a-104 c, some embodiments of the ODI100 may also include an object identifier 102 (e.g., a file ID) that identifies the respective target object and/or a set of object characteristic indicators 106 of the target object. Exemplary object characteristic indicators include file size (e.g., 130kB), an indicator of file type (e.g., whether the file is an executable, DLL, etc.), memory address of the target object, and a set of values indicating the results of a set of anti-malware heuristic tests (e.g., whether the target object shows particular malware-specific behavior or content), and so forth. In some embodiments, the object characteristic indicator 106 may be calculated by the AM scanners 42-44, for example, during a preliminary scan of the target object (step 202).

For simplicity, the remainder of the description will assume that the code block indicators 104 a-104 c include hashes of the opcode patterns 90 a-90 c. Execution of step 216 (FIG. 6) then continues as follows. The client AM application 40 may separate the target object into several distinct code blocks (illustrated by code blocks 74 a-74 c in fig. 10). For each code block 74 a-74 c, the application 40 may proceed to calculate an opcode pattern 90 a-90 c, respectively, as shown in FIG. 9. The application 40 may then invoke the hash engine 54 to compute the hashes of the opcode patterns 90 a-90 c to generate respective code block indicators (i.e., target hashes) 104 a-104 c. Hash engine 54 may employ a hashing algorithm such as a Cyclic Redundancy Check (CRC), message digest (MB), or Secure Hash (SHA), among others.

After computing the target ODI100, in step 218 (FIG. 6), the client AM application 40 performs a lookup of the ODI in the client-side cache 56. If the ODI matches a cache record (cache hit) indicating that the respective target object has been scanned at least once for malware, application 40 proceeds to step 220 to tag the target object (e.g., clean or malware) according to the cache record and proceeds to step 232 discussed above.

If the target ODI100 does not match in the client-side cache 56, then in step 222 the application 40 may invoke the client AM communications manager 52 to initiate a client-server scan transaction. The communication manager 52 transmits the target ODI100 to the AM servers 20 a-20 c and receives the scan report 50 from the servers 20 a-20 c in step 224. In some embodiments, each ODI may form part of a distinct client-server scan transaction, or multiple ODIs may be transmitted simultaneously within the same transaction (batch processing).

In step 226, application 40 determines from scan report 50 whether the target object is whitelisted (clean). If so, the target object is marked as non-malicious (step 228). If the target object is malicious according to scan report 50, application 40 marks the target object as malware (step 208).

Fig. 11 shows an exemplary sequence of steps performed by server AM application 60 (fig. 5), according to some embodiments of the invention. In step 302, the server AM communications manager 62 receives the target ODI100 from the client computer system 30. In step 304, the application 60 performs a lookup of the ODI100 in the server-side cache 68. If the ODI matches a cache record (cache hit) indicating that the respective target object has been scanned at least once for malware, application 60 proceeds to step 306 to tag the target object (e.g., clean or malware) according to the cache record. In step 308, communication manager 62 compiles scan report 50 and transmits report 50 to respective client computer systems 30.

If no record of the ODI100 is found in the server-side cache 68, then the server AM application 60 filters the hash of the ODI100 to produce a relevant subset of hashes in step 310. In some embodiments, hashes of opcode patterns that are not object specific may be discarded from the ODI100 to improve performance of malware scanning. Such non-unique opcode patterns correspond, for example, to unpacker code (e.g., installer, self extractor) and/or library code or are present in both clean and malware objects.

In step 312, for each hash of the ODI100, the server AM application 60 may query the whitelist database 65 to retrieve a set of whitelisted reference objects containing the corresponding hash. In some embodiments, a heap-based algorithm is used to rank the retrieved reference objects according to their similarity to the target object.

In step 314, the server AM application 60 calls the code comparator 64 to calculate a similarity score that characterizes how similar the target object is to each whitelisted reference object retrieved in step 312. In some embodiments, the similarity score is calculated according to the following formula:

where C represents the number (count) of hashes common to both the target object and the corresponding reference object, N_TRepresents the number (count) of hashes of the target ODI to which filtering is to be performed as discussed in step 310 above, and where N_RRepresenting the number (count) of hashes of the reference object.

Alternative embodiments may calculate the similarity score according to a formula such as:

S＝200*C/(N_T+N_R) [2]

or

S＝50*(C/N_T+C/N_R) [3]

In step 316, the application 60 compares the similarity score (e.g., equation [1]) to a predetermined threshold. When the similarity score exceeds a threshold value, which indicates that the target object is similar to the at least one whitelisted object, some embodiments of the server AM application 60 may mark the target object as non-malicious (clean) in step 318. An exemplary value for the whitelisting threshold is 50, indicating that the target object is whitelisted when it shares 50% of its opcode pattern with the whitelisted object.

Next, step 320 updates the whitelist database 65 with the record of the current target object, and step 322 updates the server-side cache 68 with an indicator of the record of the target object and the scan results (e.g., clean).

When the whitelisted similarity score (step 318) does not exceed a threshold value, which indicates that the target object is not sufficiently similar to any known whitelisted object, the server AM application proceeds to step 324 where the target ODI100 is compared to a set of records for malware objects. In some embodiments, the hash set of the ODI100 is further filtered to remove all hashes of matching records from the whitelist database 65 (see step 312 above), thus preserving a subset of hashes that are not found in any known whitelisted objects. For each such unrecognized hash of the target object, the code comparator 64 may query the malware and/or outbreak databases 66-67 to retrieve a set of malware objects that contain the respective hash. In step 326, the code comparator 64 may then proceed to calculate a malware similarity score that indicates how similar the target object is to each such malware object. In some embodiments, the code comparator 64 calculates the malware similarity score using any of the formulas [ 1-3 ] described above.

Step 328 compares the malware similarity score to a preset threshold. When the malware similarity score exceeds a threshold value, which indicates that the target object is similar to at least one malware object stored in databases 66-67, the target object is marked as malware in step 330. An exemplary threshold as a category of malware is 70 (i.e., the target object shares at least 70% of the opcode patterns with known malware objects). Next, the malware and/or outbreak databases 66-67 are updated to include records of the target object. The server-side cache 68 is updated to include a record of the target object and an indicator of its malware status (e.g., infected), and a scan report is compiled and transmitted to the client computer system, step 308.

When the malware similarity score does not exceed the threshold, indicating that the target object is not similar to a known malware object, some embodiments of the server AM application may mark the target object as whitelisted/non-malicious (step 318), and update the whitelist database 65 accordingly.

The target ODI100 may also trigger malware outbreak alerts. In some embodiments, server AM application 60 counts reference objects (objects that are similar to the target object and that have been received by AM server systems 20 a-20 c within a predetermined time frame (e.g., the last 6 hours)) from burst database 67. When the count exceeds a threshold (e.g., 10), it is assumed that malware outbreaks and marks the target object and all reference objects similar thereto as infected. The malware and/or outbreak databases 66-67 are then updated accordingly.

The exemplary systems and methods described above allow an anti-malware system to maintain a flexible whitelist database and use the whitelist database to improve malware detection performance.

In conventional whitelisting applications, the hash of a target object (a computer file or process) is compared to a set of hashes corresponding to whitelisted objects (objects that are trusted to be clean). If the hash of the target object matches the whitelisted hash, which indicates that the target object is the same as at least one of the whitelisted objects, then the target object is trusted and allowed to execute, for example. Due to the specific mathematical nature of the hash function, conventional whitelisting does not allow for changes in the code of the whitelisted objects: if two objects differ only as little as 1 bit, then the hashes of the two objects no longer match. At the same time, legitimate computer files and processes may show significant variation due to differences between compilers or between successive versions of the same software, for example.

Some embodiments of the systems and methods described above allow the anti-malware system to account for benign differences between data objects, such as differences introduced by compilers and other polymorphisms. The target object is split into a number of code blocks and a hash is computed for each code block. The obtained target set of hashes is then compared to a database of hashes corresponding to code chunks extracted from whitelisted objects. A target object may be marked as whitelisted (trusted) if it has a significant number of hashes in common with the whitelisted object. Objects that are slightly different from known whitelisted objects may still receive the whitelisted status. By allowing a certain degree of mismatch between the hash sets of dissimilar objects, some embodiments of the invention increase the efficiency of whitelisting without reducing data security to an unacceptable degree.

The size of the code block may be determined according to several criteria. Small blocks of code (e.g., a few processor instructions each) may result in a large number of hashes per target object, which may increase the storage and processing load of the anti-malware server and slow down scanning. On the other hand, small code blocks provide a significant degree of flexibility: if the two objects are only slightly different, the difference will only be picked up by a small portion of the hash, resulting in a high similarity score. Large blocks of code (e.g., hundreds of processor instructions) produce fewer (e.g., a few) hashes per target object on average, and thus are advantageous from a storage and processing perspective. However, large blocks of code suffer from the same disadvantages as conventional hashing: small differences between the two objects may be picked up by a large portion of the hash, resulting in a low similarity score. Testing reveals an optimal code block size of between 5 and 50 processor instructions, and in some embodiments, particularly about 5 to 15 (e.g., about 10) instructions.

The exemplary systems and methods described above allow an anti-malware system to collaborate client-server scanning transactions and evaluate a target object's malware status according to the results of the server-side scanning of the target object. Performing a portion of the malware scan on the remote anti-malware server has a number of advantages over performing a local scan of the target object on the client computer system.

The proliferation of malware agents and software has generally contributed to a steady increase in the size of white-list and malware hash databases, which can amount to several megabytes to several gigabytes of data. The exemplary methods and systems described above allow for the storage of scattered databases on anti-malware servers, thus avoiding regular heavy data software updates from enterprise servers to large numbers of customers.

By performing a significant portion of the malware scan centrally on the server, the systems and methods described above allow for the timely incorporation of hashes of newly detected malware and new legitimate software. In contrast, in conventional malware detection where scans are primarily assigned to client computer systems, information collection on new security threats and newly whitelisted software may involve indirect methods, taking a significantly longer time to reach the anti-malware software producer.

The size of the files exchanged between the client and the anti-malware server system described above is kept to a minimum. Instead of sending a complete target object from a client to a server for server-side scanning, the exemplary methods and systems described above are configured to exchange hashes, which can amount to a few bytes to several kilobytes per target object, thus significantly reducing network traffic.

It will be clear to a person skilled in the art that the above embodiments may be varied in many ways without departing from the scope of the invention. Accordingly, the scope of the invention should be determined by the appended claims and their legal equivalents.

Claims

1. A method, comprising:

performing, at a client computer system, an initial malware scan of a plurality of target objects of the client computer system; and

in response to a preliminary determination by the initial malware scan that the target object is suspected to be malicious:

generating, at the client computer system, a plurality of target hashes of the target object, each target hash representing a distinct code block of the target object, each distinct code block consisting of a sequence of processor instructions of the target object;

sending the plurality of target hashes from the client computer system to a server computer system connected to the client computer system via a wide area network; and

receiving, at the client computer system, a server-side indicator from the server computer system of whether the target object is malicious, wherein the server-side indicator is generated by the server computer system by:

retrieving, for at least one target hash of the plurality of target hashes, a plurality of reference hashes of a reference object, the reference object being selected from a group of whitelisted objects according to the target hash, and determining a similarity score according to a hash count common to both the plurality of target hashes and the plurality of reference hashes when the plurality of target hashes are not identical to the plurality of reference hashes; and

designating the target object as non-malicious when the similarity score exceeds a predetermined threshold.

2. The method of claim 1, wherein generating, by the computer server system, the server-side indicator comprises:

generating a filtered target hash set of the target object by filtering out all target hashes that appear in a database of clean hashes from the plurality of target hashes of the target object when the similarity score does not exceed the predetermined threshold; and

comparing the set of filtered target hashes to a database of malware-identifying hashes that are specific to malware.

3. The method of claim 1, wherein generating, by the computer server system, the server-side indicator comprises:

comparing the set of filtered target hashes to a database of burst detection hashes specific to unknown objects reported by a plurality of disparate client computer systems connected to the server computer system over a predetermined recent period of time.

4. A method, comprising:

receiving, at a server computer system via a wide area network, a plurality of target hashes of target objects of a client computer system connected to the server computer system, wherein the plurality of target hashes were generated at the client computer system in response to a preliminary determination by the client computer system that the target objects are suspected of being malicious, the preliminary determination resulting from an initial malware scan of a plurality of target objects of the client computer system;

generating, at the server computer system, a server-side indicator of whether the target object is malicious by:

for at least one target hash of the plurality of target hashes, retrieving a plurality of reference hashes of a reference object, the reference object being selected from a group of whitelisted objects according to the target hash, and when the plurality of target hashes are not identical to the plurality of reference hashes, determining a similarity score according to a hash count common to both the plurality of target hashes and the plurality of reference hashes, and

designating the target object as non-malicious when the similarity score exceeds a predetermined threshold; and sending the server-side indicator of whether the target object is malicious to the client computer system.

5. The method of claim 4, wherein generating, by the computer server system, the server-side indicator comprises:

6. The method of claim 4, wherein generating, by the computer server system, the server-side indicator comprises:

7. A method, comprising:

receiving, at a server computer system, a plurality of target hashes of a target object, each target hash representing a distinct code block of the target object, each distinct code block consisting of a sequence of processor instructions of the target object;

for at least one target hash of the plurality of target hashes, employing the server computer system to:

retrieving a plurality of reference hashes of a reference object, the reference object being selected from a group of whitelisted objects according to the target hash, an

Determining a similarity score from a hash count common to both the target hashes and the reference hashes when the target hashes are not identical to the reference hashes; and

when the similarity score exceeds a predetermined threshold, employing the server computer system to mark the target object as non-malicious.

8. The method of claim 7, wherein the target hash comprises a hash of an opcode pattern, the opcode pattern comprising a sequence of instruction indicators, each instruction indicator indicating a processor instruction of the distinct code block.

9. The method of claim 7, wherein the sequence of processor instructions consists of between 5 and 50 consecutive processor instructions.

10. The method of claim 9, wherein the sequence of processor instructions consists of between 5 and 15 consecutive processor instructions.

11. The method of claim 7, wherein the processor instruction sequence begins with a CALL instruction.

12. The method of claim 7, further comprising:

performing a code normalization process on the target object to produce a normalized target object, and wherein each distinct piece of code consists of a sequence of computer instructions for the normalized target object; and

applying a hash function to the distinct code block to generate the target hash.

13. The method of claim 7, wherein the similarity score is determined as a function of:

C/max(N_T，N_R)

wherein C represents the hash count common to both the plurality of target hashes and the plurality of reference hashes, and N_TAnd N_RThe cardinality of the plurality of target hashes and the cardinality of the plurality of reference hashes are respectively represented.

14. The method of claim 7, wherein the similarity score is determined as a function of:

C/(N_T+N_R)

wherein C represents the hash count common to both the plurality of target hashes and the plurality of reference hashes, and N_TAnd N_RRepresenting the cardinality of the plurality of target hashes and the cardinality of the plurality of reference hashes, respectively.

15. The method of claim 7, wherein the similarity score is determined as a function of:

C/N_T+C/N_R

16. The method of claim 7, wherein the target object comprises a computer file.

17. The method of claim 7, wherein the target object comprises a computer process.

18. A computer system comprising at least one processor programmed to:

receiving a plurality of target hashes, each target hash representing a distinct code block of a target object, each distinct code block consisting of a sequence of processor instructions of the target object;

for at least one target hash of the plurality of target hashes:

when the similarity score exceeds a predetermined threshold, marking the target object as non-malicious.

19. The system of claim 18, wherein the target hash comprises a hash of an opcode pattern, the opcode pattern comprising a sequence of instruction indicators, each instruction indicator indicating a processor instruction of the distinct code block.

20. The system of claim 18, wherein the sequence of processor instructions consists of between 5 and 50 consecutive processor instructions.

21. The system of claim 20, wherein the sequence of processor instructions consists of between 5 and 15 consecutive processor instructions.

22. The system of claim 18 wherein the processor instruction sequence begins with a CALL instruction.

23. The system of claim 18, wherein the processor is further programmed to:

performing a code normalization process on the target object to generate a normalized target object, wherein each distinct piece of code consists of a sequence of computer instructions for the normalized target object; and

24. The system of claim 18, wherein the similarity score is determined as a function of:

C/max(N_T，N_R)

25. The system of claim 18, wherein the similarity score is determined as a function of:

C/(N_T+N_R)

26. The system of claim 18, wherein the similarity score is determined as a function of:

C/N_T+C/N_R

27. The system of claim 18, wherein the target object comprises a computer file.

28. The system of claim 18, wherein the target object comprises a computer process.

29. A computer system, comprising:

means for receiving a plurality of target hashes, each target hash representing a distinct code block of a target object, each distinct code block consisting of a sequence of processor instructions of the target object;

means for retrieving a plurality of reference hashes of a reference object selected from a set of whitelisted objects according to a selected target hash of the plurality of target hashes;

means for determining a similarity score from a hash count common to both the plurality of target hashes and the plurality of reference hashes; and

means for tagging the target object as non-malicious according to the similarity score.

30. A non-transitory computer readable storage medium encoding instructions that, when executed on a processor, cause the processor to perform the steps of:

for at least one target hash of the plurality of target hashes:

31. A method, comprising:

receiving, at a server computer system, a plurality of target hashes, each target hash representing a distinct chunk of data of a target object of a client computer system connected to the server computer system, each distinct chunk of code consisting of a sequence of processor instructions of the target object;

in response to receiving the plurality of target hashes, employing the server computer system to retrieve a plurality of reference hashes representing whitelisted data objects, an

In response to determining that the plurality of target hashes are not identical to the plurality of reference hashes and that the plurality of target hashes share a majority of items with the plurality of reference hashes, marking the target object as non-malicious.