HK1108241B

HK1108241B - Method and system for identifying the content of files in a network

Info

Publication number: HK1108241B
Application number: HK07112783.9A
Authority: HK
Inventors: 克里斯托弗．德斯皮格尔
Original assignee: Veritas Technologies Llc
Priority date: 2003-12-24
Filing date: 2004-12-24
Publication date: 2013-07-05

Description

Method and system for identifying file content in network

Technical Field

The present invention relates to a method and system for controlling the content of a computer file, for example a computer file containing text or graphical data, and a method for updating such a content recognition system. More particularly, a method and system for checking and managing the security status and content of computer files on local computing devices in a network environment and updating such a checking and management system is described.

Background

Computers are widely used in the world today. Often, especially in commercial environments, they are interconnected into small or large networks. Since software and data are often an important part of an individual company's investment facilities, it is important to protect individual computing devices and the entire network and their workstations from viruses, trojan horses, worms, and malware. Another problem is associated with the large number of files containing objectionable content, such as overt adult content. These files are often received by uninvited and unwanted local computing devices.

To address security issues related to viruses, protection systems called virus detection programs have been developed. Some examples of conventional Virus detection procedures are Norton AntiVirus, McAfeeVirusScan, PC-cullin, Kasperssky Anti-Virus. Most of these conventional virus protection software packages can be configured such that they always run in the background of the computing device and provide continuous protection. These virus protection systems compare the code of the new or modified software to fingerprints of known viruses (e.g., portions of code introduced into files by viruses). Other virus protection systems compare the code of all data available on the computing device. This results in the use of a significant amount of Central Processor (CPU) time, which limits the ability of the computing device to perform other tasks. In addition, the operating principle of these virus detection programs causes these software packages to operate passively rather than actively, since the fingerprint of the virus needs to be known in order for the virus scanner to identify the virus. This means that the fingerprint database needs to be updated regularly in order to be protected from newer viruses. Thus, the security state of a computer depends not only on external factors, such as the correctness of the fingerprint of a new virus available to the vendor of the virus protection software package, but also on the responsibility awareness of the user to make updates on a regular basis. If updates are provided centrally from the server automatically, network capacity is reduced because these virus updates must be sent to multiple workstations.

In a network environment the problem of updating such a fingerprint database becomes more important as this means that responsibility is given to all users who have to update their virus detection program database. Alternatively, the virus scan may be performed by a central server, thereby limiting updates to new fingerprints to the central server. However, this means that large amounts of data need to be transferred periodically through the network, thereby utilizing large amounts of expensive network bandwidth and potentially overloading the network or server capacity with other activities (depending on the number of clients of the server).

To limit the amount of CPU time used, other techniques have been proposed to speed up the virus scanning process. These techniques typically include hashing of the contents of the file. The hash being one of a "one-way functionAn example of an application. A one-way function is an algorithm that makes it almost impossible to execute in the opposite direction when applied in one direction. The one-way function produces a value, such as a hash value, by computing the contents of a file and can uniquely fingerprint that file if the one-way function is complex enough to avoid the same value from different files. The uniqueness of the hash function depends on the type of hash function used, i.e. the size of the digest formed and the quality of the function. A good hash function has the least collisions in the table, i.e. the chance of providing the same hash value to different files is minimal. As mentioned before, this is also determined by the calculated digest, i.e. the size of the hash value. For example, if a 128-bit digest is used, then the number of possible different values that can be obtained is 2¹²⁸。

Hashing is known to be used for virus checking, possibly in a network environment. Generally, a hash value of a selected application running on the local computer is computed, the stored hash value is retrieved from a database on the secure computer at the local computer, and the secure computer may thus be a secure part of the local computer or a network server, and the two values are compared. If they match, the application is executed, and if they do not match, the security operation is executed. The security operations include loading a virus scanner into a local computer. It may also include alerting network management personnel. Furthermore, hashing is also known to be used to distinguish accessibility to software from different workstations, and as a way to check whether software is licensed.

It is also known to use hashing in a method of identifying rogue software on a computer system or device. The method is generally applicable in a network environment. A hash value of the application software to be executed is calculated, transmitted to the server, and compared with a previously stored value. One of the essential features is that the method uses a database on a server, which is a server with a large number of clients. Thus, the database is built by adding information by different clients so that most application software and their corresponding fingerprints have been stored in the database. The database is built by checking the authenticity of the application software with the owner of the application software. If this is not possible, the system can also give heuristic results, evaluating the occurrence of the application on the local computer from other clients.

Methods of sending electronic files using an e-mail that includes file content and a message content identifier are known. The message is delivered to the client or not, depending on the message content identification. This method can be used to organize email transfers, but it has the disadvantage of focusing on email transfers, which does not allow to secure all files in the network.

Also known are methods of monitoring email messages, thereby protecting computer systems from virus attacks and Unsolicited Commercial Email (UCE). Such a system is preferably installed at a mail server or at an internet service provider, which checks a specific part of an e-mail by calculating a digest, comparing the digest with a stored digest value of a previously received e-mail. Thus, it is determined whether the email has an approved digest, or whether the email is UCE, or contains an email worm. A disadvantage of this system is that it focuses on e-mail viruses and SPAM, it does not allow to check all data files or executable files that may be infected, for example, by files copied from an external storage device, such as a floppy disk or a CD-ROM, or by trojan horses, for example.

It is known to control the execution of software on different workstations according to certain policy rules of a network server, thereby obtaining an improved computer security system by classifying the software. This classification may be based on several forms of data, one of which is, for example, a hash value of the software data. This is typically accomplished by calculating a hash value of the program (if the program is selected for loading and execution) and comparing the hash value to a trusted value to determine the rules to execute. The classification may also be based on a hash value of the content, a digital signature, a file system or network path, or a URL range.

The above-mentioned methods and systems describe the use of a hash function to check whether the application software is authentic or to control the execution of the application software. However, the problem of virus scanning all new files in the network with a conventional virus scanner, and thus updating the database of fingerprints of the conventional virus scanner on each local computer, is not discussed. One of the drawbacks of virus detection systems and data monitoring systems is that they are typically only able to provide protection against viruses or malware when they have been discovered, the fingerprint is known, and the local database in the network or on a local computing device of the network has been updated. The latter means that there is a considerable period of time between the first spread of viruses or malware and the virus detection system or data monitoring system being able to detect and fight the viruses or malware. Generally, when performing important virus detection system updates or upgrades, or data monitoring system updates or upgrades, the entire system, e.g., the network, is then re-detected, which is time consuming and consumes computing power, or the system is not re-detected at all, leaving a possible virus infection or malware in the system.

Disclosure of Invention

It is an object of the present invention to provide a system and method for identifying the content of a new file on a local computing device in a network. It is another object of the present invention to provide a method of updating or upgrading a content recognition device. Advantages of the invention include one or more of the following:

a) providing a high degree of reliability while limiting the necessity of updating the information required by the content identification program on each local computing device.

b) Has high efficiency and provides high security in a network system.

Another advantage of the present invention is that if the present invention is used as a virus checker, the security level is further improved because the fingerprint database of conventional virus scanners does not have to be updated on each local computing device.

Another particular advantage of the invention is that the content of a new file is identified only once for the entire network for the network.

Another particular advantage of the present invention is that the total processor (CPU) processing time and network traffic in the network is reduced.

Another particular advantage of the present invention is that when a virus identification device, a malware identification device or a content identification device is upgraded or updated, the updated or upgraded version is effectively used to actively search for "contaminated" content. This may provide network security even for data generated between and the identification means being able to detect "contaminants", i.e. viruses, malware or infections or impermissible content. Because similar files can be easily identified and processed similarly based on the data available in the metadata repository when a contaminated file is detected, cleanup of the network can be performed efficiently while reducing CPU and network time.

Another advantage of the present invention is that files do not have to be sent to the central server for inspection, but instead can be inspected locally while still using the central virus inspection device, thereby avoiding the risk of damaging the file during transmission to and from the central server.

At least one of the above objects and at least one of the advantages are obtained with a method and a system for content identification in a network according to the present invention.

A method of identifying the content of a data file in a network environment is used in a network having at least one local computing device linked to the remainder of the network environment including a central infrastructure. The method and system include calculating a reference value for a new file on one of the at least one local computing device using a one-way function, transmitting the calculated reference value to the central infrastructure, and comparing the calculated reference value to reference values previously stored within the remainder of the network environment.

The method further comprises, after the comparing, if the calculated reference value and the previously stored reference value are found to match, determining that the content of the new file has been identified and retrieving the corresponding content attribute; or if the calculated reference value and any previously stored reference value are found not to match, determining that the content of the new file has not been identified and then sharing the new file on the local computing device with the central infrastructure, the central infrastructure identifying the content of the new file by remotely identifying the content via a network environment, determining content attributes corresponding to the content of the new file and storing a copy of the content attributes, after the determination, triggering an operation on the local computing device in accordance with the content attributes.

In a method of identifying the content of a data file in a network environment, the reference value may be a hash value. The previously stored reference values may be stored in a central infrastructure. In a method and system of identifying content of a data file in a network environment, identifying content of a new file may include scanning the new file for viruses using an antivirus checker device on a central infrastructure.

The method may further comprise transmitting the new file from the local computing device to the central infrastructure prior to performing said identifying of the content of said new file. Further, it may include storing a copy of the new file on a central infrastructure. Storing a copy of the new file on the central infrastructure may be accomplished by transferring the copy from the local computing device to the central infrastructure. The address at which the file is stored may be stored along with the hash value to enable a quick trace of the copy of the file stored on the central infrastructure.

In the method of the present invention, triggering an operation on the local computing device in accordance with the content attribute may include replacing a new file on the local computing device with a copy of a previous version of the new file. Additionally, triggering an operation on the local computing device based on the content attributes may further include replacing a new file on the local computing device with another version of the new file restored from the remainder of the network environment.

The method of the present invention may further comprise sharing a new file on a local computing device with a central infrastructure prior to performing said identifying of the content of said new file, whereby said identifying of the content of said new file is performed by remotely identifying said content via a network environment. The method may include checking the operation of a local agent on a local computing device.

Further, an operation may be triggered on the local computing device after the content attributes corresponding to the new file are transferred to the local computing device.

In a method of identifying content of a data file in a network environment, identifying content of a new file may include one or more of scanning for adult content, scanning for self-advertising messages or unsolicited commercial e-mail (UCE), and scanning for copyrighted information. Scanning may be performed using a scanning device on the central infrastructure. The method also relates to a method and system for providing a content firewall whereby one local computing device is connected to an external network, such as the internet, and the one local computing device is also connected to the network environment formed by the remaining local computing devices. The one local computing device thereby connects the network environment with an external network and is the only computing device directly connected to a source external to the network environment. The local computing device thus functions as a content firewall protecting the network environment from attacks originating from places in the external network. The local computing device may function as a content firewall that operates in a promiscuous manner, i.e., the local computing device functions as a content firewall that looks at all traffic passing through, performs hashing and comparison functions, and contacts the proxy to enforce policies.

The method relates in particular to a method of checking the security status of a network and its components. In this embodiment, a method of determining the security status of a data file in a network environment is used in a network having at least one local computing device linked to the remainder of the network environment including a central infrastructure. The method comprises calculating a reference value for a new file on one of said at least one local computing device using a one-way function, transmitting said calculated reference value to said central infrastructure, comparing said calculated reference value with reference values previously stored in the remainder of the network environment, determining, after the comparison, that the security status of the file has been checked if the calculated reference value and the previously stored reference value are found to correspond, and retrieving the corresponding security status; or if the calculated reference value and any previously stored reference values are found to be non-matching, then determining that the security status of the new file has not been identified, then the central infrastructure checks the security status of the new file, determines a security status corresponding to the new file, stores a copy of the security status, and then, after the determination, triggers an operation on the local computing device based on the security status of the new file. The operation may be, for example, disabling access to the file by the user of the local computing device and other users in the network, or restoring an infected file.

The above described method may be triggered by an operation performed on the home agent. The trigger by the operation performed on the home agent may be, for example, running an application or opening a file.

The invention also relates to a method of changing a system for identifying the content of a file in a network environment, said network environment comprising means for computing a one-way function, at least one local computing device linked to the remainder of the network environment comprising a central infrastructure, and means for identifying the content, said method comprising changing said means for identifying the content or said means for computing a one-way function, scanning the remainder of the network environment for reference values computed using the one-way function, for each reference value, requesting a file corresponding to said reference value from said network environment, sending the file to the means for identifying the content, identifying the content of said file, determining content attributes corresponding to the content of the file and storing a copy of said content attributes, sending the content attributes to each local computing device comprising the file, and after the sending, triggering an action on the local computing device based on the content attribute.

The invention also relates to a method of changing a system for identifying the content of a file in a network environment, said network environment comprising means for computing a one-way function, at least one local computing device linked to the remainder of the network environment comprising a central infrastructure, said remainder comprising a stored database, and means for identifying content, said method comprising changing said means for identifying content or said means for computing a one-way function, scanning the remainder of the network environment for reference values computed using the one-way function, for each reference value, requesting a file corresponding to said reference value from said network environment, identifying the content of said file, determining content attributes corresponding to the content of the file and storing a copy of said content attributes, sending the content attributes to each local computing device comprising the file, and after the sending, triggering an action on the local computing device based on the content attribute. Scanning the remainder of the network environment for the reference value calculated using the one-way function may include scanning a stored database for the reference value calculated using the one-way function. The request of the file corresponding to the reference value from the network environment may be followed by a transmission of the file to a device that identifies content. Alternatively, files may be shared, and the identification of the content may be made over a network. Sharing may occur under a secure connection and may be limited to between the local computing device and the central infrastructure. The change of the system to identify the content of the file in the network environment may be triggered by the introduction of a new one-way function to calculate the reference value, and also by the update of the means to identify the content of the file. In the method, scanning the remainder of the network environment for the reference value calculated using the one-way function may include scanning the remainder of the network environment for the reference value calculated using the one-way function, the reference value being generated after a predetermined date. The predetermined date may be related to a creation date of the virus or malware for which the change was made. Sending the content attribute to each local computing device containing the file may include identifying each local computing device containing the file using a stored database and sending the content attribute to the identified local computing device. The method may also be used to scan only a portion of the hash key, e.g., the hash key of a file whose contents are identified after a certain date, in the remainder of the network environment to minimize operations to be performed. The date of the previous content identification may be retrieved from the content attribute. For each of the identified local computing devices that is not connected to the network, sending the content attributes to the identified local computing device may include creating an entry in a waiting list, and sending the content attributes to the identified local computing device in accordance with the entry on the waiting list when the local computing device is reconnected to the network. Requesting the file corresponding to the reference value from the network environment may include creating an entry in a waiting list if none of the local computing devices having the file corresponding to the reference value is connected to the network, and requesting the file corresponding to the reference value from the local computing device according to the entry when the local computing device is reconnected to the network. The method may also include identifying whether the content attributes correspond to unwanted content and, if so, identifying a local computing device that introduced the unwanted content into the network first, based on data stored in the database.

The reference value may be a hash value. The means for identifying content may be an anti-virus checker means, a means for scanning adult content, a means for scanning self-advertising messages, or a means for scanning copyrighted information. Triggering an operation on the local computing device based on the content attributes may include replacing a file on the local computing device with another version of the file restored from the remainder of the network environment, or may include replacing the file with a copy of a previous version of the file, or may include putting the file in isolation or removing the file.

The present invention also relates to a computer program product performing any of the above methods when executed on a network. The invention also relates to a system for identifying the content of a file in a network environment comprising at least one local computing device linked to the remainder of the network environment comprising a central infrastructure, said remainder comprising a stored database, whereby the system comprises means for calculating a reference value for a new file on said local computing device using a one-way function, means for transferring said calculated reference value to said central infrastructure, and means for comparing said calculated reference value with a previously stored reference value from the database. The system further comprises means for determining whether the content of the new file has been identified based on a comparison of said calculated reference value with a reference value previously stored in said remaining portion, means located on the central infrastructure for identifying the content of the new file if the new file has not been identified so as to assign a content attribute, means for storing said content attribute in said remaining portion, means for triggering an operation on said local computing device based on the content attribute of said new file.

In a system according to the invention, the means for identifying the content of a file may comprise an anti-virus checker means on said central infrastructure. Further, means for storing a copy of the new file in the remaining portion. The means for identifying the content of the file may include one or more of means for scanning adult content, means for scanning self-advertising messages, and means for scanning copyrighted information.

The invention also relates to a machine readable data storage device storing a computer program product which, when executed on a network, performs any of the above methods. Furthermore, the present invention also relates to the transmission of a computer program product performing any of the above methods.

Particularly preferred aspects of the invention are set out in the accompanying independent and dependent claims. Features of the dependent claims may be combined with features of the independent claims and with features of other dependent claims as appropriate and merely as explicitly set out in the claims.

While there have been constant improvements, changes and developments in the methods of virus scanning and content identification of data files, the principles of the present invention represent quite novel improvements, including improvements that violate existing practices, thereby providing a more efficient, stable and reliable method of this nature.

These and other characteristics, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which illustrate, by way of example, the principles of the invention. The following description is for the purpose of illustration only and is not intended to limit the scope of the present invention. The reference numbers quoted below refer to the attached drawings.

Drawings

FIG. 1 is a schematic diagram of a computer network.

FIG. 2 is a schematic diagram of a central infrastructure and its basic software components.

Fig. 3 is a schematic diagram of a home agent driven content recognition process.

Fig. 4 is a schematic diagram of the metadata-base-driven content identification process.

FIG. 5 is a schematic diagram of a computer network to which the content firewall system and method may be applied.

The same reference numbers in different drawings identify the same or similar elements.

Detailed Description

The present invention will be described with respect to particular embodiments and with reference to certain drawings but the invention is not limited thereto but only by the claims. The drawings described are only schematic and are non-limiting of the invention. In the drawings, the size of some of the elements may be exaggerated and not drawn on scale for illustrative purposes. Where the term "comprising" is used in the description and claims, it does not exclude other elements or steps.

Furthermore, the terms first, second, third and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.

In this description, the terms "file," "program," "computer file," "computer program," "data file," and "data" are used interchangeably, and any use may mean other terms, depending on the context of use. The terms "hash" and "hashing" will be used as examples of the application of the one-way function, but the invention is not limited to a particular form of one-way function.

The term "computing device" should be broadly construed to include any device capable of computing and/or executing an algorithm. The computing device may be any one of a laptop, a workstation, a personal computer, a PDA, a smart phone, a router, a network printer, or any other device having a processor and capable of connecting to a network, such as a facsimile device or a copier, or any special purpose electronic device, such as a so-called "hardware firewall" or a modem.

The method and system for protecting and controlling a network by identifying the content of each new file in the network can be used on any type of network. The network may be a private network, which may be a virtual private network, a Local Area Network (LAN) or a Wide Area Network (WAN). The network may also be within a portion of a public wide area network, such as the Internet. If part of a public wide area network is used, this may be accomplished by remotely providing a method and system for identifying the content of each document by a service provider using an ASP or XSP business model, where a central infrastructure is provided to paying customers operating local computing devices. An exemplary network 10 is shown in fig. 1, where fig. 1 shows several local computing devices 50a, 50 b. There is no limit to the number of local computing devices 50 connected to the network 10 for the method of protecting and controlling the network 10 according to the present invention. In a commercial environment, the number of local computing devices 50 typically ranges from a few to hundreds. The method and system for identifying the contents of each new file present in network 10 may be used with many different operating systems, such as Microsoft DOS, Apple Macintosh OS, OS/2, Unix, DataCenter-Technologies' operating systems.

To provide a fast method of protecting and determining the content identification of files, the method and system according to the present invention will determine the hash values of new files present on the local computing device 50, compare them to previously stored hash values and file information on the central server, and determine the content of the new files to the network 10 using a content recognition engine on the central infrastructure 100. The content attributes describing the content of the new file are then sent to the local computing device 50 where appropriate action is taken at the local computing device 50. Content attributes may also not be sent to the local computing device 50, but rather trigger appropriate operations from the central infrastructure 100. The new file is typically a file in which new content has been generated on the local computing device 50, or when an external file is received. The term "file" may refer to data, as well as application software (also referred to as software).

Identification of the content of the file or data may be accomplished by sending the file or data to the central infrastructure 100, examining the file or data at the central infrastructure 100, or by sharing the file or data locally so that the central infrastructure 100 can remotely identify the content of the file or data. The sharing may be implemented in a secure environment. The sharing may be limited between the local computing device 50 having the file or data and the central infrastructure 100.

The central infrastructure 100 contains a database, also referred to as a metadata repository 110, the metadata repository 110 containing a record of each hash value calculated for a file already present on one of the local computing devices 50. In addition to hash values, the record contains many other fields. In these fields, file source information is stored. The file source information corresponding to a particular hash value includes the file name, a list of local computing devices 50 on which the file corresponding to the hash value resides, a path to the file on the file system of the local computing device 50, and the last modified date. An example of file source information for a particular file is given in table 1.

Filename	Myexampleword.doc
		Path	c：data
Assetname	Pcmarketing001
		ModDate	23/4/2002

TABLE 1

In another field, a list of content attributes identifying the type of content enclosed by the file is stored. The content attributes may refer to, for example but not limited to, files containing viruses, copyrighted MP3 audio files, copyrighted video files, files that are pictures that may contain adult content, files that are self advertising messages (SPAM), files that are HOAX, files that contain a direct rate lyric, or files that contain multiple executable codes.

The central infrastructure 100 also contains a content recognition engine 120. The content recognition engine 120 may be an application software 130 or a set of application software 130a, 130b, 130c, 130d that uses the content of a file to determine what content the file contains. These applications can be varied:

-a virus scanner: this is a software that scans the content of existing files and compares it with a database of known fingerprints of viruses. It may be conventional Virus scanning software, such as Norton Anti-Virus from Symantec Corporation, McAfe from Network associates technologies Inc., PC-cullin from Trend Micro, Kapersky Anti-Virus from Kaspersky Lab, F-Secure Anti-Virus from F-Secure Corporation.

-an adult content scanner in a picture: it is a software that scans the content of the existing files for the presence of shadows, colors, textures that may represent adult content. Scanning pictures for adult content is known. Adult content may be determined by, for example, the number of nudes displayed. Skin tones have hue saturation values in a specific range. Thus, if the image is scanned, the number of pixels having skin tone characteristics can be determined and compared to the total number of pixels. The ratio of skin tone pixels to the total number of pixels allows the proportion of adult content possible in the image to be determined. A threshold is typically introduced so that the image can be classified according to its likely adult content. In a similar manner, video images may be classified such that the video is divided into its different frames, wherein the images are classified as described above.

-a scanner of internet content rating: it is a software that scans objects for adult content according to the PICS, i.e., the internet content selection platform tagging system. On a voluntary basis, an internet content provider may provide an internet object with a PICS rating that determines adult content in the internet object. The PICS rating is stored in the metadata of the object. This data is generally not visible to viewers of internet objects. Rating systems it is well known to provide an example of a scanner for internet content rating in Netscape web browsers for scanning the content of web pages.

-a scanner for scanning objects with respect to a directness lyric that may indicate adult content. The scanner is known for use with text files and audio files. The audio file is first converted into a text file. The text file is then scanned and compared to a database containing the lyrics of the straightness.

-a SPAM engine: this is a piece of software that scans the content of email messages for the presence of so-called SPAM. Algorithms for identifying SPAM ranges are known. These algorithms are generally based on decomposing text in an email message, associating statistical information with the text using a statistical analysis program, and coupling a neural network engine to the statistical analysis program to identify unwanted messages based on statistical indicators.

Other examples of application software that may be used in the content recognition engine 120 are, for example, an engine that scans copyrighted content, an engine that compares the content of a file to a database of copyrighted information, and the like. In some implementations, the operator may play the role of the content recognition engine 120, who manually tags the file with the content recognition attributes. When the content recognition engine 120 is activated, it takes as input a file from the local agent and generates a set of attributes representing the detected content.

The content recognition engine 120 also allows checking whether data on the local computing devices 50 complies with rules for licensing data on the network or on these local computing devices 50. These rules may be different for different local computing devices 50.

Thus, the content recognition engine 120 will be configured as a piece of software that aggregates the functionality of a set of third party engines.

In another embodiment of the present invention, a system and method are described according to the above embodiments, whereby the record corresponding to a particular hash value stored in the metadata repository 110 further comprises a field storing the location of the file corresponding to the hash value on the central infrastructure 100. In this embodiment, copies of all of the different files present on the local computing devices 50 in the network 10 may be stored on the central infrastructure 100. Thus, the central infrastructure 100 of this embodiment may also include a large amount of storage space. This is preferably a secure part of the central infrastructure 100 and is not directly connected to the network 10 so that in the event of a file on the local computing device 50 being corrupted, for example by a virus, the same copy of the file present on the local computing device 50 can be used.

A hash value of the file is calculated using a hash function. The hash function is generally a one-way function, i.e. in the case of a known digest, reconstruction of the original data is at least computationally prohibitive. Different types of hash functions may be used: MD5, SHA-1 or ripemd, available from RSA Data Security Inc., a haval designed at Wollongong university, a sneru for Xerox secure hash function, and the like. The most commonly used hash functions are MD5 and SHA-1. The MD5 algorithm takes a message of arbitrary length as input and produces as output a 128-bit "fingerprint" or "message digest" of the input. It is speculated that it is computationally infeasible to produce two messages with the same message digest, or to produce a message digest with a known, pre-specified target. The MD5 algorithm is used for digital signature applications where large files must be 'compressed' in a secure manner before being encrypted with a private key in accordance with a public key cryptosystem. The MD5 algorithm is designed to be fairly fast on a 32-bit machine. In addition, the MD5 algorithm does not require any large substitution tables; the algorithm can be coded fairly compactly. An alternative hash function SHA-1, secure hash algorithm-1, is a hash algorithm that produces a 160-bit hash. Newer versions of this algorithm also provide bit lengths of 256 and 512.

In the above-mentioned embodiments describing the method and system of protecting and/or controlling network 10, the local agent is installed on local computing device 50. A local agent is software that runs on the local computing device 50 and executes certain algorithms and programs. The local agent on the local computing device 50 is typically triggered in the event that new content is generated on the local computing device 50. To avoid unnecessary hash value calculations and data transfers, a policy is established that determines which operations will trigger the local agent and which operations will not. For example, if a text document is being created, the file must be checked each time the document is stored. The policy on such documents should preferably be to check the document if it is stored and closed. Some examples of operations that may trigger the local agent to initiate the content recognition process are opening or receiving email messages, opening or receiving email attachments, running executable files, running files with.dll or.pif extensions. Applying the policy thus allows avoiding successive checking and scanning of documents, resulting in a limitation of the number of unnecessary hash computations and content identification operations, thereby limiting unnecessary use of CPU time and load in terms of network traffic. The method and system of content identification is not limited by the type of application that generates the file.

The content recognition process may be triggered by a local agent on the local computing device 50 or may be triggered by the central infrastructure 100. The latter process generally occurs where new algorithms or tools are used for content recognition. Such new algorithms or tools may be optimized algorithms and tools or tools that have not been previously installed. Some examples of such tools (not limited to these functions) may be a virus check, checking if the file is a copyrighted MP3 audio file, checking if the file is a copyrighted video file, checking if the file is a picture that may contain adult content, checking if the file is tagged as SPAM or HOAX, checking if the file contains a straightforward lyric, or checking if the file contains a number of pieces of executable code that are copyrighted. Updates of these tools may affect the state of the file and thus, in principle, the corresponding records in the metadata repository 110. Thus, depending on the type of update of the content recognition means 120, it is interesting to update the corresponding record.

In one embodiment, the method involves a virus checker of a network environment. The network 10 to which this method may be applied is the same as the network 10 described in relation to the previous embodiment. The local agent calculates a hash value of the new file on the local computing device 50. The new file may contain new content generated on the local computing device 50 or an external file received on the local computing device 50. The hash value and corresponding file information for the new file are then sent to the central infrastructure 100, also referred to as a server, where it is compared to the hash values corresponding to files previously stored and already present on different local computing devices 50 of the network 10 at the central infrastructure 100. This comparison allows checking whether the file is new throughout the network 10. Alternatively, the hash value may first be compared to a local database of hash values and file information corresponding to files present on a particular local computing device 50, and then, if the file is found not already present on the local computing device 50, the hash value and corresponding file information may be exchanged with the central infrastructure 100 so that it can be checked whether the file is new throughout the network 10. While for conventional central virus checkers the file information and hash values transmitted for each new file correspond to only a small portion of the network traffic, this alternative may further reduce the network traffic for virus checking. If the hash value is identified as new on the network 10, the metabase agent triggers the local agent to transfer the file corresponding to the new hash value from the local computing device 50 to the central infrastructure 100. The transfer of the file may be done in a secure manner, i.e. the file is transferred such that it is not likely to be affected by a virus present on the network connection, or such that if it contains a virus, the virus cannot spread throughout the network 10. For this purpose, known secure transport routes, tunnels and/or known session encryption/decryption techniques may be used. In an alternative embodiment, files or data may be shared against a central infrastructure and virus inspection devices may remotely inspect the files or data. A conventional virus checker installed and updated on the central infrastructure 100 then checks the file for viruses. The Virus checker may be any conventional Virus checker, such as Norton Anti-Virus from Symantec Corporation, McAfe from Network associates technologies Inc., PC-cullin from Trend Micro, Kapersky Anti-Virus from Kaspersky Lab, F-Secure Anti-Virus from F-Secure Corporation.

A particular advantage of the above-described embodiments of the present invention is that the virus scanning software does not have to be updated at each local agent, but rather is limited to updates of the virus scanning software of the central infrastructure 100. In this way, the security level of the network 10 is significantly improved, as security does not rely on the punctuality of different users of the network 10 to update their virus scanning software. If the scanned file does not have any virus, it will be marked as a virus-free file in the metadata repository 110. If a virus is found in a file, the file will be marked as dangerous. A query will be made to the metadata repository 110 to find all files within the network 10 that have the same compromised hash key. The result is a list of files and paths, and the assetname in which the file is located. This information may be used to perform operations to eliminate the harm of discovered viruses on all local computing devices 50, i.e., all workstations, from the entire network 10. In this way, active virus scanning can be performed on the other local computing devices 50 based on virus detection on the first local computing device 50. According to the policy defined for virus checking, the virus engine will notify the agents installed on the infected system to remove the file, if possible, by replacing it with a restored version delivered by the virus engine located on the central infrastructure 100, or a previous version of the file that does not already have a virus. The latter is facilitated by searching the metadata repository for previous versions of the file, or by searching for uninfected versions on another local computing device 50. If the uninfected version cannot be retrieved from another local computing device 50 or a metadata database residing on the central infrastructure 100, the virus scanner should have features that allow it to store a new, disinfected copy of the file on the central infrastructure 100. These advantages also exist for other content identification packages.

In an alternative embodiment, if a file with a new hash value has been identified in the network 10, the file may be automatically shared locally, and then the remote inspector may transmit a file system that allows the file to be inspected across the network 10 by utilizing file sharing, rather than transmitting the file to a central infrastructure. Content tagging is still performed by the server. To improve security, accessibility to shared files is limited to the server. In addition, the java applet may be transferred to the local agent to allow examination of other files.

The foregoing embodiment is an improvement over a central virus checker that scans local computing devices 50 over network 10. Only if the local drive, e.g., C: \ D: this is only possible when shared. In addition to the security-wise danger of sharing, the local user can easily change the local sharing nature, preventing the remote checker from checking the file. The present invention avoids this situation at least in part because changing the sharing nature of the network 10 does not affect the operation of computing the hash value for the new file and sending it to the central infrastructure 100.

Another advantage is that the present invention saves CPU time on the local computing device 50 because the CPU does not have to keep on checking for viruses, but the CPU only needs to compute the one-way function. The invention also saves network time: the management server does not have to update the virus checkers on the local computing devices 50 with virus updates because only a single central virus checker is used and updated.

Fig. 3 illustrates a method 200 of content recognition processing triggered by a local agent on the local computing device 50, according to the above-mentioned embodiment. The different steps that occur in this process, both at the local computing device 50 and at the central infrastructure 100, are discussed below.

The content recognition process builds on the constant scanning of the local agent for new data or applications on the local computing device 50. The scanning of data and applications is limited by policy rules that determine when the local agent should be triggered, as described above. If a "new" file is detected, a method of protecting and controlling the network 10 by content recognition of the new file is initiated. This is step 210. The method 200 then proceeds to step 212.

At step 212, a hash value for the "new" file is calculated using a hash function, such as MD5 or SHA-1. The calculation is performed by utilizing some of the CPU time of the local computing device. However, the amount of CPU time used is significantly less than the CPU time required if a conventional virus checker were used to check files on the local computing device 50. The method 200 then proceeds to step 214.

At step 214, the hash value and file source information are transmitted from the home agent to the central infrastructure 100 of the network 10. Such transfer may be a secure transfer, if desired, to avoid a virus located on the network connection changing the file source information or hash key during the transfer of the data. Such secure transmission may be accomplished by known secure transmission routes, via tunnels, or using known session encryption/decryption techniques.

At step 216, the hash value is compared to data already present in the metadata repository 110. Since in the metadata repository 110 all old files present in the network 10 are stored-i.e. as previously mentioned, the hash value and file source information for each file already present on the network 10 and not "new", it is possible to check whether the file already exists in the network 10. Thus, if the hash value is identified as new, this means that the file is "new" for the entire network 10. If the file is new, the method 200 proceeds to step 218. If the hash value is not new, this means that the file already exists somewhere on the local computing device 50 in the network 10. In this case, there is already a content attribute describing the content of the file. The method 200 then proceeds to step 224.

In step 218, the metadata repository agent triggers the local agent to transfer the file corresponding to the new hash value from the local computing device 50 to the central infrastructure 100. The transfer of the file may be done in a secure manner, i.e. the file may be transferred such that it cannot be affected by a virus present on the network connection, or such that if the file contains a virus, the virus cannot spread throughout the network 10. For this purpose, known secure transport routes, tunnels and/or known session encryption/decryption techniques may be used. The method 200 then proceeds to step 220.

At step 220, the file is loaded into the content recognition engine 120 and the file is processed. For this process, the CPU time of the central infrastructure 100 is used. As previously mentioned, the content recognition engine 120 may include a conventional virus checker, a means of checking for picture information, a means of checking for SPAM, and the like. This may be a repetitive operation in which multiple content recognition engines are invoked in turn. The method 200 then proceeds to step 222.

In step 222, for the file, content attributes identifying the contents of the file are determined. These content attributes are then stored in the metadata repository 110, allowing the state of the file to be identified if the file is deemed "new" on another local computing device 50 in future operations. The method 200 then proceeds to step 224. Depending on the embodiment used, a next step may include storing the file on the central infrastructure 100 and adding the path to the file to the metadata repository 110. This step is not shown in fig. 3.

In step 224, the content attributes are sent to the home agent. Based on the content attributes, the home agent performs appropriate actions in accordance with policy rules set for the content attributes. This is done in step 226. The appropriate operation may be, for example, deleting a file if it is infected, replacing the file with a previous version that was not infected. In one embodiment, the execution of the appropriate action based on the policy rules is triggered by the agent of the metadata repository 110 such that step 224 may be eliminated.

The content policy is a policy that determines what should be done with the file based on the content attributes determined by the content recognition engine 120. Content policies may include various operations, such as deleting a file, deleting a file and replacing the file with a previous version, copying a file to another computing device while keeping a copy on the originating computing device, transferring a file to another computing device while deleting an original file on the originating computing device, logging the existence of a file, changing a property of a file, such as hiding the file or making it read-only, making a file unreadable, making a file unexecutable, and so forth. The content policy will be enforced by the home agent, for example, when the content attributes are received from the central infrastructure 100. The content policies for the agent will be downloaded by the agent from the central policy infrastructure to the local computing device 50.

Fig. 4 shows a method 300 of content recognition processing triggered by the content recognition engine 120 according to the above-mentioned embodiment. The different steps that occur on the local computing device 50 and on the central infrastructure 100 during this process are discussed below.

This process is typically used where new algorithms or tools are used for content recognition. Such new algorithms or tools may be optimized algorithms and tools, or tools that were not previously installed. As previously mentioned, this may be governed by the following policies: the triggering of the content recognition process may be determined by the type of new algorithms and tools being used for content recognition.

The method 300 is initiated by a change to the content recognition engine 120, such as by providing a new algorithm or tool to the content recognition engine 120. A typical example is to update a fingerprint database used in a virus checker or a content recognition device immediately after virus or malicious data has been generated, virus or malicious data has been recognized, and a fingerprint to be used in the virus checker or the content recognition device is generated. Since there is a considerable period of time between the generation of the virus and the time when the virus checker or content recognition means is able to detect the virus or malicious data, during which the network is not secure, it is advantageous to have a system that allows for an active check in an efficient manner, i.e. a check of the files produced during this time interval. In conventional systems, the entire network typically needs to be rescanned, requiring a significant amount of CPU time and network bandwidth, or leaving the system in an unsafe state.

When triggered, in a first step 302 of the method 300, the metadata repository 110 is scanned for a hash value corresponding to the hash key. The method 300 then proceeds to step 304.

At step 304, a file corresponding to the hash key is requested. The file may be requested from a central storage on central infrastructure 100 or may be requested from local computing device 50. The local computing device 50 then allows the central infrastructure 100 to upload the corresponding file. The path to the file corresponding to the hash value may be obtained from the record corresponding to each hash value. If the records store different paths that each correspond to a copy of the corresponding file, then the agent on central infrastructure 100 retrieves one copy of the file by, for example, scanning the paths listed in the records until a local computing device 50 is found that is then connected to network 10 and allows upload of the file. The method 300 then proceeds to step 306.

Once the file is retrieved, the file is sent to the content recognition engine 120. This is done in step 306. The upgraded content recognition engine 120 then scans the content of the file and generates content attributes corresponding to the file. The method 300 then proceeds to step 308.

In step 308, the content attributes are stored in the metadata repository 110 to allow future security steps to immediately identify the content of the file. The method 300 then proceeds to step 310.

In step 310, the content attributes are sent to each local agent residing on the local computing device 50 that stores the corresponding file. The path may be found in a record of the corresponding hash key stored in the metadata repository 110. In this step, the content attributes are sent to each file whose path is mentioned in the record of the corresponding hash key. If, at the time of the check, the local computing device 50 is not connected to, i.e., disconnected from, the network, a waitlist may be created to check for necessary files once the computer is connected to the network. The waiting list may be created during the step of providing content attributes to certain files, as well as during the step of requesting files to identify their content. The list may be created by a central infrastructure or at a local distribution point located downstream of the network. Disconnection of the local computing device 50 occurs particularly frequently when the local computing device 50 is a portable computing device, such as a laptop computer. In this manner, the security of the disconnected local computing device 50, which may be part of a network, is also ensured. The method 300 proceeds to step 312.

In step 312, the local agent on the corresponding local computing device 50 executes policies corresponding to the content attributes that vary from local computing device 50 to local computing device 50.

One major advantage of embodiments of the present invention is that new files need only be scanned once for the entire network 10. If the same copy of the file is used, installed, opened, or stored and closed on another local computing device 50, the file will be recognized by the central infrastructure 100 as known to the network 10, thus avoiding the need to re-examine the contents of the file. This is particularly advantageous if the present invention is used in a network 10 having a large number of local computing devices 50.

The method of an embodiment may also be implemented on a network having a central infrastructure 100, a number of distribution points, which are made up of one computing device, and a number of local computing devices 50 for each of the distribution points. In this way, at least a portion of the processing steps, such as creating a waiting list or actively searching, may be performed by an agent on the computing device of the distribution point. The distribution points may correspond to physically isolated areas in the network.

When operating, the method and system of identifying the content of a new file may optionally include periodically checking the 'heartbeat' of the local agent, i.e., it may be checked whether the local agent is still running on the local computing device 50. This can prevent the user from shutting down the agent locally, thereby making the local computing device 50 vulnerable. If the home agent has been turned off, an alert may occur to a network administrator. Further, an alert message may be sent to the local computing device 50, alerting a user of the local computing device 50. The network administrator may also leave the local computing device 50 in an isolated state such that it cannot harm other local computing devices 50 in the network 10. In addition, the central proxy may also attempt to re-run the local proxy.

In a similar manner, the method and system for identifying the content of a new file may optionally periodically check whether the local computing device 50 is still connected to the network 10. If the local computing device 50 is no longer connected to the network 10, the local agent may also be operable to store the hash key for the new file in a waiting list to be checked immediately when the network connection is restored. Meanwhile, the corresponding file may be brought into an isolated state, or the file may be prevented from being executed depending on the type of the file.

The embodiments described above may be used as content firewalls for different computing devices connected to external networks. For each input/output file, input/output message or input/output data frame, the content firewall calculates a hash value, checks whether it is new, checks whether it is tagged with specific content, and enforces policies related to the specific content.

In another embodiment, another structure using the present invention as a content firewall is described. A schematic diagram of a computer network in which the method and system may be used is shown in fig. 5. Only one reconfigurable firewall electronic device 50, such as a local computing device that may take the form of a dedicated reconfigurable firewall electronic device, is directly connected to an external network 400, such as the internet, and the remaining local computing devices 410 are not directly connected to the external network 400, but are aggregated in a network environment and are connected to the external network 400 only through their connection to the reconfigurable firewall electronic device. The external network may be any possible network that is available. The purpose of the content firewall, represented by reconfigurable firewall electronic device 50, is to protect the network environment containing the remaining local computing device 410 from attacks originating from places and/or devices in the external network. The reconfigurable firewall electronics 50 either contains a local copy of the metadata repository or it may use a high-speed secure network to the central infrastructure 100 that is part of the internal network. This allows for fast queries within the metadata repository. In operation, the reconfigurable firewall electronic device 50, which functions as a content firewall, performs the following operations: a hash value of the input file or input message or input data frame is calculated. The computed hash value is then compared to a metadata database, either stored locally or by using a high-speed secure network, to determine if the incoming file, incoming message, or incoming data frame is new. Furthermore, it is checked whether the file, the message or the data frame is marked with respect to a specific content. Enforcing a policy related to the specific content according to the specific content. The policy may be to pass it through to its final destination, drop it, record it, or leave it in quarantine, etc. The system requires sufficient CPU computing power so as not to significantly slow down the network speed.

This is a very secure and manageable setting in case none of the local computing devices connected to the network is equipped with a detachable device, i.e. allows for the opening or execution of unscanned content on the device.

In another embodiment of the present invention, a similar architecture is provided for using the present invention as a content firewall in promiscuous mode. The content firewall thus looks at all traffic passing through, performs hashing and comparison functions, and contacts the proxy to enforce policies. The advantage of this approach is that there is no single point of failure and there is no longer a bottleneck, and furthermore, no resources are used on the local computing device to compute the hash value. In addition, no bandwidth is used to contact the central metadata repository. The drawback is that the home agent needs to be installed on all computing devices of the internal network.

The methods and systems described in the various embodiments may also include performing the step of identifying or reporting secondary information regarding the presence of viruses or malicious data. From the information provided in the metadata repository 110, an identification of the local computing device 50 from which viruses or malicious data entered the network can be obtained. This may be based on information about the path and the date of modification or generation, for example. Further, based on information provided in the metadata repository 110, such as file type, more information about how the virus works can be obtained. The metadata repository also allows for identifying how viruses or malicious data is spread across the network. The information thus obtained may be stored and/or used in order to further improve the security of the network. If this information is stored for many events that occur, then an overall analysis, such as a statistical analysis, may be performed indicating the vulnerability in the security of the network, i.e., indicating the local computing device 50 that is vulnerable to viruses or malicious data. This can be done automatically. The adjusted security measures may then be taken, such as performing a regular full check of the local computing device, or providing only limited access to external sources, such as the internet, to the local computing device 50.

The information obtained in the metadata repository may be used for recovery purposes, so that when the local computing device 50 fails, all necessary information, such as a path file, may be obtained from the metadata repository. When the local computing device 50 or portion can no longer be connected, at least a portion of the lost information can be recovered from the information in the metadata base, files stored on the central infrastructure, and/or files stored elsewhere in the network.

According to the above described embodiments, the invention comprises a computer program product providing the functionality of any of the methods according to the invention when executed on a computing device. Furthermore, the invention comprises a data carrier, such as a CD-ROM or diskette, storing a computer product, which takes a machine-readable form and, when executed on a computing device, performs at least one of the methods of the invention. Currently, such software is typically provided over the Internet, and thus the present invention includes transmitting a printed computer product according to the present invention over a local area network or a wide area network.

Claims

1. A method for identifying content of a file in a network environment, the network environment including at least one local computing device linked with a remainder of the network environment including a central infrastructure, the method comprising:

-calculating a reference value for a new file on one of the at least one local computing device using a one-way function,

-transmitting the calculated reference value to the central infrastructure,

-comparing said calculated reference value with reference values previously stored in the remaining part of said network environment,

-after the comparison, the comparison is carried out,

-if said calculated reference value and a previously stored reference value are found to match, determining that the content of said new file has been identified and retrieving a corresponding content attribute, said content attribute identifying the type of content enclosed by the file; or

-if it is found that neither the calculated reference value nor any previously stored reference value matches, determining that the content of the new file has not been identified and subsequently sharing the new file on the local computing device to the central infrastructure, the central infrastructure identifying the content of the new file by remotely identifying the content via the network environment, determining content attributes corresponding to the content of the new file, and storing a copy of the content attributes,

-after the determination, triggering an operation on the local computing device in accordance with the content property;

wherein triggering an operation on the local computing device in accordance with the content attributes comprises: replacing the new file on the local computing device with another version of the new file restored from the remainder of the network environment.

2. The method of claim 1, wherein triggering an operation on the local computing device based on the content attributes is performed after transferring the content attributes corresponding to the new file to the local computing device.

3. A method as claimed in any preceding claim, wherein said identifying the content of the new file comprises:

one or more of scanning viruses, scanning adult content, scanning self-advertising messages, and scanning copyrighted information using a scanning device installed on the central infrastructure.

4. The method of claim 1, further comprising storing a copy of the new file on a central infrastructure.

5. A system for identifying the content of a file in a network environment, the network environment including at least one local computing device linked to a remainder of the network environment including a central infrastructure, the remainder including a stored database, the system comprising:

-means for calculating a reference value for a new file on the local computing device using a one-way function,

-means for transmitting the calculated reference value to the central infrastructure,

-means for comparing said calculated reference value with a previously stored reference value from a database,

the system further comprises:

-means for determining whether the content of the new file has been identified on the basis of a comparison of said calculated reference value and a reference value previously stored in said remaining portion,

-means for sharing a new file on the local computing device to the central infrastructure,

-means, located on said central infrastructure, for remotely identifying the content of a new file, if not already identified, over said network, for assigning a content attribute identifying the type of content enclosed by the file, and means for storing said content attribute in said remaining part of said database, and

-means for triggering an operation on the local computing device according to a content property of the new file;

6. The system of claim 5, further comprising means for storing a copy of the new file in said remaining portion of said database.

7. A method for changing a system for identifying file content in a network environment, the network environment including means for computing a one-way function, at least one local computing device linked to a remainder of the network environment including a central infrastructure, the remainder including a stored database, and means for identifying content, the method comprising:

-changing the means for identifying content or the means for computing a one-way function,

-scanning the rest of the network environment for a reference value calculated with a one-way function,

-for each of said reference values,

-requesting a file corresponding to the reference value from the network environment,

-identifying the content of said file, determining content properties corresponding to the content of the file, and storing a copy of said content properties, said content properties identifying the type of content enclosed by the file,

-sending the content properties to each local computing device containing the file,

-after sending, triggering an operation on the local computing device in accordance with the content property.

8. The method of claim 7, wherein said scanning the remainder of the network environment for reference values computed using a one-way function comprises:

scanning the remainder of the network environment for a reference value calculated using a one-way function, wherein the reference value is generated after a predetermined date.

9. The method of claim 7 or 8, wherein the method further comprises:

for each of said reference values, a file is sent to said means for identifying content.

10. The method of claim 7 or 8, wherein the method further comprises:

for each of the reference values, sharing the file to the means for identifying content and remotely identifying the content of the file over the network.

11. The method of claim 7, wherein said sending the content attributes to each local computing device containing the file comprises:

-identifying, using a stored database, each local computing device containing the file,

-sending the content property to the identified local computing device.

12. The method of claim 11, wherein sending the content attributes to the identified local computing device comprises:

creating an entry in a waitlist for each of the identified local computing devices that is not connected to the network, and sending content attributes to the identified local computing devices according to the entries on the waitlist when the local computing devices reconnect to the network.

13. The method of claim 7, wherein if no local computing device having a file corresponding to the reference value is connected to the network, requesting the file corresponding to the reference value from the network environment comprises: creating an entry in a waiting list according to which a file corresponding to the reference value is requested from the local computing device when the local computing device reconnects to the network.

14. The method of claim 7, wherein the method further comprises:

identifying whether the content attributes correspond to unwanted content and, if so, identifying a local computing device that introduced the unwanted content into the network first, based on data stored in the database.

15. The method of claim 7, wherein triggering an operation on the local computing device based on the content attributes comprises: replacing the new file on the local computing device with another version of the new file restored from the remainder of the network environment.