CN1294514C

CN1294514C - Efficient computer file backup system and method

Info

Publication number: CN1294514C
Application number: CNB028161971A
Authority: CN
Inventors: K·德斯皮格莱尔
Original assignee: Datact Technologies N V
Current assignee: Gen Digital Inc
Priority date: 2001-08-20
Filing date: 2002-03-08
Publication date: 2007-01-10
Anticipated expiration: 2022-03-08
Also published as: AU2002304842A1; HK1069651A1; US7752171B2; WO2003019412A3; WO2003019412A2; CN1543617A; EP1419457A2; EP1419457B1; US20040236803A1; JP4446738B2; US7254596B2; JP2005501342A; US20080034021A1

Abstract

A system and method for efficiently backing up and restoring computer files to a central storage system. A hashing key is computed for each file to be backed up on a target computer. The hashing key is compared to a list of hashing keys stored locally to see if the local file has been previously backed up. If the hashing key is not listed locally, then the hashing key is compared to a list of hashing keys of centrally backed up files. Only if the hashing key is not present in both the local and the central list is the file backed up. Backed up files may be renamed to their hashing key for further efficiencies.

Description

Efficient computer file backup system and method

本发明一般地涉及一种用于备份和恢复在计算机系统上的数据文件和程序的方法，更具体来讲，本发明涉及一种高效的方法，用于确定先前是否已经备份了一文件或者程序，或者如果存在该文件的一份备份副本，则然后仅仅备份那些先前没有被备份的、并且没有备份副本的程序。因此，该系统和方法使得能够为本地或者远程备份计算机和/或计算机系统的文件而高效地利用带宽。The present invention generally relates to a method for backing up and restoring data files and programs on a computer system, and more particularly, the present invention relates to an efficient method for determining whether a file or program has been previously backed up , or if a backup copy of the file exists, then back up only those programs that were not previously backed up and that do not have a backup copy. Accordingly, the systems and methods enable efficient use of bandwidth for local or remote backup of files of computers and/or computer systems.

传统的用于备份计算机程序和数据文件的方法常常使用大量昂贵的网络带宽和过量的处理器(CPU)处理时间。当前，许多备份过程对计算机或者计算机系统的整个程序和数据储存库进行备份，这引起备份文件和程序的重复，并且要求大量网络带宽和过量的存储介质(例如磁带或者光盘(CD))。Traditional methods for backing up computer programs and data files often use large amounts of expensive network bandwidth and excessive processor (CPU) processing time. Currently, many backup processes back up a computer or computer system's entire program and data repository, causing duplication of backup files and programs, and requiring large amounts of network bandwidth and excess storage media (such as tape or compact disk (CD)).

许多组织的网络常常包括数据中心(“服务器群组(server farms)”)，用于存储和管理大量的因特网可存取数据。数据中心常常包括几个计算机系统，例如因特网服务器，雇员工作站(employee workstations)，文件服务器，等等。常常，这样的数据中心具有使用传统备份系统的可扩缩性问题。所要求的带宽和存储器不足以执行数据中心环境的大规模备份。可扩缩的并且能够随着组织的壮大而发展的系统将是很有益的。The networks of many organizations often include data centers ("server farms") for storing and managing large amounts of Internet-accessible data. Data centers often include several computer systems, such as Internet servers, employee workstations, file servers, and so on. Often, such data centers have scalability issues using traditional backup systems. The bandwidth and storage required are insufficient to perform large-scale backups of data center environments. A system that is scalable and able to grow as the organization grows would be beneficial.

能够通过增量备份方法实现一些带宽和存储介质的节省，该方法仅仅对已经改变的或者已经更新的文件进行备份。然而，这些方法没有解决这样的问题：重复在一个网络、乃至不同网络上的不同计算机上驻留的文件，常常仍以重复的形式获得备份，耗尽大量存储介质。Some savings in bandwidth and storage media can be achieved through an incremental backup method that only backs up files that have changed or have been updated. However, these methods do not solve the problem that files residing on different computers on one network, or even on different networks, are often backed up in duplicate, consuming a large amount of storage media.

例如，在许多人之间常常共享数据文件，并且重复的副本驻留在许多不同计算机上，这引起跨越一个或者多个计算机网络的、文件的许多多重副本。进一步来讲，计算机常常使用重复的程序和数据文件，用于运行操作系统和应用程序。例如，在运行Microsoft Windows^的网络中，每一计算机可能具有重复的操作系统文件和程序。使用传统方法对整个网络进行备份可能导致那些文件和程序的许多多重备份，引起存储介质的过度浪费。除去备份文件和程序的重复的装置将是合乎需要的，可能的好处导致对于存储介质、处理时间和网络带宽的更加高效的利用。For example, data files are often shared among many people, and duplicate copies reside on many different computers, resulting in many multiple copies of the file across one or more computer networks. Further, computers often use duplicate program and data files to run operating systems and applications. For example, in a network running Microsoft Windows ^(R) , each computer may have duplicate operating system files and programs. Backing up an entire network using traditional methods can result in many multiple copies of those files and programs, causing excessive waste of storage media. A means of deduplicating backup files and programs would be desirable, with possible benefits resulting in more efficient utilization of storage media, processing time, and network bandwidth.

进一步来讲，通过组织实现的传统的备份方法常常使用许多计算机服务器来执行该备份，常常备份到磁带介质，这引起数据备份的分布式存储，此外还在介质和处理器时间两方面都引起重复和浪费。Further, traditional backup methods implemented by organizations often use many computer servers to perform the backup, often to tape media, which results in distributed storage of data backups, in addition to duplication in both media and processor time and waste.

再进一步来讲，分布式的备份过程通常引起对于存储许多备份磁带、或者其他类似备份介质的需要，并且要求一种追踪多个介质的方法。这样一种系统常常是很难恢复的，特别是如果使用了增量备份过程。正确的存储介质必须被定位，并且必须被以正确的顺序加载。磁带恢复是一种冗长的、费时的过程。常常，恢复过程是如此低效率和易出错，以致它是无效的，导致数据损失甚至生产率的损失，因为必须重新安装程序，并且必须重建数据。引起更加有效的和更容易实现的恢复过程的、更加高效易用的备份系统将有益于使用计算机系统的组织。Still further, a distributed backup process typically entails the need to store many backup tapes, or other similar backup media, and requires a method of keeping track of the multiple media. Such a system is often difficult to restore, especially if incremental backup procedures are used. The correct storage media must be located and must be loaded in the correct order. Tape recovery is a tedious, time-consuming process. Often, the recovery process is so inefficient and error-prone that it is ineffective, resulting in data loss and even loss of productivity because programs must be reinstalled and data must be rebuilt. Organizations using computer systems would benefit from a more efficient and easy-to-use backup system resulting in a more efficient and easier-to-implement recovery process.

本发明涉及在备份技术方面的改进，更具体来讲，本发明创建了一种解决方案，用于在因特网数据中心和企业数据中心环境中进行大规模服务器备份，并结果产生了一种用于灾难恢复和数据保护的解决方案。The present invention relates to improvements in backup technology, and more specifically, the present invention creates a solution for large-scale server backup in Internet data center and enterprise data center environments, and results in a solution for Solutions for disaster recovery and data protection.

本发明是一种使用文件内容的散列密钥的改进系统和方法，用于更加高效的和更加有效备份计算机文件和计算机程序。The present invention is an improved system and method for more efficient and effective backup of computer files and computer programs using hash keys of file content.

该过程中的第一步骤是扫描目标机(待备份的计算机系统)上的文件系统，并且创建散列密钥，为每一个待备份的文件创建一个唯一的数字代码。在优选实施例中，为了减少处理时间，仅仅为具有修改日期属性的、也就是比上次备份更新近的文件创建散列密钥。The first step in the process is to scan the file system on the target machine (the computer system to be backed up) and create a hash key that creates a unique digital code for each file to be backed up. In a preferred embodiment, to reduce processing time, hash keys are only created for files with a modified date attribute, ie more recent than the last backup.

作为结果的散列密钥被存储在本地数据库——目标计算机上的数据库——中，例如供在当前、以及将来的备份会话中作进一步的比较。所述本地数据库还包括每一备份文件的完整路径。The resulting hash key is stored in a local database—a database on the target computer—for example, for further comparison in current, as well as future backup sessions. The local database also includes the full path to each backup file.

对照在本地数据库中的先前的散列密钥项目，对所存储的散列密钥进行校验。以这种方式，所述散列密钥被用于校验每一本地文件，以便确定先前是否在目标系统中对其进行了备份。没有在本地数据库密钥列表中查找到的散列密钥被用于该过程的下一步骤。The stored hash key is checked against previous hash key entries in the local database. In this way, the hash key is used to verify each local file to determine whether it was previously backed up on the target system. Hash keys not found in the local database key list are used in the next step of the process.

对照在中央存储服务器上存储的文件的散列密钥，对没有在本地的散列密钥数据库中查找到的散列密钥进行校验。这一校验用于确定是否已经在中央存储服务器上存在特定文件。该文件可以作为来自另一服务器或者系统的备份、或者来自先前备份操作的结果来存在。The hash key not found in the local hash key database is checked against the hash key of the file stored on the central storage server. This check is used to determine if a particular file already exists on the central storage server. This file may exist as a backup from another server or system, or as a result of a previous backup operation.

例如，逐文件地、而不是逐块地执行是否进行备份的判定。这强有力地减少了比较次数和本地数据库的尺寸，并且极其适用于群组服务器，在所述群组服务器中，不仅数据块、而且常常是完整的文件在多个服务器之间被重复。For example, the determination of whether to perform a backup is performed on a file-by-file basis rather than on a block-by-block basis. This strongly reduces the number of comparisons and the size of the local database, and is extremely suitable for group servers where not only data blocks, but often complete files, are duplicated across multiple servers.

附图的简短说明A short description of the drawings

图1是显示根据本发明的一方面的备份过程的主要步骤的方框图；Figure 1 is a block diagram showing the main steps of a backup process according to an aspect of the present invention;

图2是显示根据本发明的一方面的备份决策进行过程的方框图；Fig. 2 is a block diagram showing the backup decision-making process according to an aspect of the present invention;

图3是显示依据本发明、用于实现本发明的方法的系统的一种实现方式的方框图；Figure 3 is a block diagram showing an implementation of a system for implementing the method of the present invention according to the present invention;

图4是显示本发明的备份子系统的更加详细的实现方式的方框图；Figure 4 is a block diagram showing a more detailed implementation of the backup subsystem of the present invention;

传统上，无论是否执行计算机、服务器或者系统的增量或者全部备份，备份解决方案都极大地增加了网络通信量，并且能够使用巨大的存储容量。本发明使用内容散列密钥来做出是否备份某些数据的智能决策，并且使用中央存储器容量来提供更加高效的和更加有效的备份存储和恢复活动。Traditionally, whether incremental or full backups of computers, servers, or systems are performed, backup solutions dramatically increase network traffic and can use enormous storage capacity. The present invention uses content hash keys to make intelligent decisions about whether to back up certain data, and uses central storage capacity to provide more efficient and effective backup storage and restore activities.

本发明是一种使用文件内容的散列密钥的系统和方法，用于更加高效的和更加有效的备份计算机文件和计算机程序。在本说明中，术语“文件”、“程序”、“计算机文件”、“计算机程序”、“数据文件”和“数据”是可交换地使用的，并且依据使用的上下文，任何一个的使用都可能暗示了另一个术语。The present invention is a system and method for more efficient and effective backup of computer files and computer programs using hash keys of file content. In this description, the terms "file", "program", "computer file", "computer program", "data file" and "data" are used interchangeably, and depending on the context of use, the use of either Another term might be implied.

本发明利用了一种使用散列机制的过程，用于检验一个文件在备份系统中是否是唯一的。仅仅唯一的、并且还未备份的文件才将被存储在中央存储系统上，这在使用网络带宽和存储介质时产生了效率。该过程利用将新创建的内容密钥与所有先前产生的散列密钥(使用本地化的和/或中央化的列表)相匹配、以产生备份判定，产生执行备份的整体分析，并且更加有效地和更少麻烦地完成恢复功能。作为结果的方法通过减少网络通信量和备份文件存储器两方面的重复，具有最小的带宽消耗和最小的存储容量使用。这对于备份操作系统文件和常用的应用程序特别有用。The present invention utilizes a process using a hashing mechanism for checking whether a file is unique within the backup system. Only files that are unique and not yet backed up will be stored on the central storage system, which creates efficiencies in the use of network bandwidth and storage media. This process utilizes matching the newly created content key with all previously generated hash keys (using localized and/or centralized lists) to generate a backup verdict, resulting in an overall analysis of performing backups and is more efficient Restoration functions can be accomplished more easily and with less hassle. The resulting method has minimal bandwidth consumption and minimal storage capacity usage by reducing duplication in both network traffic and backup file storage. This is especially useful for backing up operating system files and frequently used applications.

图1提供了对于依据本发明的备份过程的一种实现方式的方法的概观。由框10示出的该过程中的第一步骤是对目标计算机/系统(待备份的单独计算机或者计算机系统)上的文件系统进行扫描，并且例如如框12所示，以32或者64字节模式创建一个内容散列密钥。所述散列密钥对于每一个待备份文件来讲，是唯一的数字代码。对于每一个唯一的文件来讲，所述散列密钥是唯一的。进一步来讲，对于文件的相同副本来讲，所述散列密钥是相同的。以这种方式，对于该文件和任何相同的复制来讲，所述散列密钥成为一个唯一标识符。因此，如果两个文件具有相同的散列代码，则它们是相同的，并且，能够而且将会被同样地处理。能够使用工业散列过程，MD5。Figure 1 provides an overview of the method for one implementation of the backup process according to the invention. The first step in the process, shown by box 10, is to scan the file system on the target computer/system (the individual computer or computer system to be backed up) and, for example, as shown in box 12, in 32 or 64 byte mode to create a content hash key. The hash key is a unique digital code for each file to be backed up. The hash key is unique for each unique file. Further, the hash key is the same for identical copies of the file. In this way, the hash key becomes a unique identifier for the file and any identical copies. Therefore, if two files have the same hash code, they are the same and can and will be treated alike. Ability to use the industrial hashing process, MD5.

作为结果的散列密钥被存储在本地数据库404(图3)中，供在当前、以及将来的备份会话中作进一步的比较。这由图1中的框14表示。对应于所述散列密钥的该文件的路径和/或文件名与所述散列密钥一起被存储。The resulting hash key is stored in local database 404 (FIG. 3) for further comparison in current, as well as future backup sessions. This is represented by box 14 in FIG. 1 . The path and/or filename of the file corresponding to the hash key is stored together with the hash key.

对这一过程的改进可以是将所述散列密钥追加到计算机文件自身。以这种方式，已经进行了散列处理的文件能够被所述散列过程旁路掉，这在计算机处理方面提供了进一步的节省。然而，并不能够对所有的文件进行这样的追加，所以这一改进对于所有计算机文件类型可能是不可行的。An improvement to this process could be to append the hash key to the computer file itself. In this way, files that have already been hashed can be bypassed by the hashing process, which provides further savings in computer processing. However, such appending is not possible for all files, so this improvement may not be feasible for all computer file types.

对照本地数据库404中的先前的散列密钥项目，对所存储的散列密钥进行校验，如图1中的框16所示。以这种方式，所述散列密钥被用于校验是否每一本地文件都曾在以前、在目标系统中进行了备份。没有在本地数据库中查找到的散列密钥被用于该过程的下一步骤。因为只有那些没有由于被最近备份、或者至少最近处理过而被记录的文件才需经历进一步的处理。这使得可以有效利用计算机资源。The stored hash key is checked against previous hash key entries in the local database 404, as indicated at block 16 in FIG. In this way, the hash key is used to verify that each local file was previously backed up in the target system. Hash keys not found in the local database are used in the next step of the process. Because only those files that have not been logged due to being recently backed up, or at least recently processed, need to undergo further processing. This enables efficient use of computer resources.

现在对照中央数据库408中存储的文件，对没有在本地散列密钥数据库中查找到的散列密钥进行校验，如图1中的框18所示。对应于每一散列密钥的文件的路径和/或文件名与存储在本地数据库中的每一散列密钥一起被存储。所述散列密钥被用于确定是否已经在所述中央存储服务器400上存在所述对应的文件，并因此不需要对其进行备份。所述文件可能作为来自不同的目标计算机300乃至不同的目标网络的一次备份而存在。原理是不管有多少不同的目标计算机可能包含该相同，且完全相同的文件，都在中央存储系统内存储每一个唯一文件的单一副本。The hash keys not found in the local hash key database are now checked against the files stored in the central database 408, as shown in block 18 in FIG. 1 . The path and/or filename of the file corresponding to each hash key is stored with each hash key stored in the local database. The hash key is used to determine whether the corresponding file already exists on the central storage server 400 and therefore does not need to be backed up. The file may exist as a backup from a different target computer 300 or even a different target network. The principle is to store a single copy of each unique file within the central storage system, no matter how many different target computers may contain that same, identical file.

如果在中央数据库中不存在与给定的散列密钥的匹配，则该散列密钥被添加到所述中央数据库408，并且将所对应的文件上载(图1中的框20)到所述中央存储系统400(框22)，所述中央存储系统400管理所述文件和散列密钥列表。能够由所述服务器保存所述过程的记录(参见日志存档框22a)。如果期望的话，为了安全原因，对待存档的文件进行加密(框24)，并且对所述文件进行压缩，以便减少存储介质需求(框28)。举例来说，可以通过使用所述散列密钥产生加密密钥，并通过已知的、但是安全的算法对其进行变换。If there is no match in the central database for a given hash key, the hash key is added to the central database 408 and the corresponding file is uploaded (box 20 in FIG. 1 ) to the central database. The central storage system 400 (box 22), the central storage system 400 manages the list of files and hash keys. A record of the process can be kept by the server (see log archive box 22a). If desired, the files to be archived are encrypted for security reasons (box 24) and compressed to reduce storage media requirements (box 28). For example, an encryption key can be generated by using the hash key and transforming it by a known but secure algorithm.

最后，接着执行调度过程(图1中的框30)。基于所述散列密钥，所述调度过程将决定所述文件需要被调度到哪一位置中，并且它应该被存储在哪一存储设备(32a，32b，32c，32d...32n)上。所述存储设备可能被集中地放置，以便增加效率，但是本发明也能够使用分布式的、乃至远程放置的设备。散列密钥可被用于将文件调度到存储网络中的不同位置中。Finally, the scheduling process follows (block 30 in Figure 1). Based on the hash key, the scheduling process will decide in which location the file needs to be scheduled and on which storage device (32a, 32b, 32c, 32d...32n) it should be stored . The storage devices may be centrally located for increased efficiency, but the invention is also capable of using distributed, or even remotely located devices. The hash key can be used to schedule files into different locations in the storage network.

在优选实施例中，使用所述散列密钥作为文件名对所存储的文件进行重命名。这可使文件的检索变得简单、并且更加快速。当恢复的时候，将通过将所述散列密钥与被恢复机器上的文件名和/或文件路径交叉参照，来恢复原始文件名。In a preferred embodiment, the stored files are renamed using said hash key as the file name. This makes file retrieval easier and faster. When restoring, the original file names will be restored by cross-referencing the hash key with the file names and/or file paths on the restored machine.

图2中的流程图更详细地示出了进行所述文件备份决策过程。通过框100中的步骤示出了本地扫描。在步骤102中扫描文件，并且通过步骤104形成散列密钥。在优选实施例中，仅仅为具有修改或者创建日期属性的、也就是比上次备份日期更新近的文件计算散列密钥。每一散列密钥与本地数据库404中的本地存储的散列密钥列表相比较。本地数据库404为先前已经备份的每一文件包含一个记录，该记录包括散列密钥和该文件的完整路径和名称(步骤106)。那些具有匹配的文件将不被备份(步骤110)，而那些具有与本地列表不匹配的散列密钥的文件(步骤106)需要进一步处理(框200中的步骤)。至少对于每一非匹配文件来讲，在本地数据库中存储一个新记录，该新记录包括该散列密钥和该文件的完整路径和名称。用于非匹配文件的散列密钥被收集、以供转发(步骤108)，并且被转发出去，以便与中央存储的(中央数据库408)密钥列表相比较(步骤202)。如果密钥与先前中央存储的散列密钥匹配(步骤204)，则不备份该文件(步骤210)。然而，只有当没有匹配时(步骤204)，才备份该文件。所述散列密钥将被存储在中央数据库408中，并且该文件在被备份或者存档到存储器中之前，可以经受如上所述的处理(即，加密和压缩)。The flowchart in Fig. 2 shows the process of making the file backup decision in more detail. Local scanning is shown by the steps in box 100 . The document is scanned in step 102 and a hash key is formed by step 104 . In a preferred embodiment, hash keys are only calculated for files with a modification or creation date attribute, ie more recent than the last backup date. Each hash key is compared to a locally stored list of hash keys in local database 404 . The local database 404 contains a record for each file that has been previously backed up, the record including the hash key and the full path and name of the file (step 106). Those files with a match will not be backed up (step 110), while those files with hash keys that do not match the local list (step 106) require further processing (step in block 200). For at least each non-matching file, a new record is stored in the local database, the new record including the hash key and the full path and name of the file. The hash keys for non-matching files are collected for forwarding (step 108) and forwarded out for comparison with a centrally stored (central database 408) list of keys (step 202). If the key matches a previously centrally stored hash key (step 204), then the file is not backed up (step 210). However, the file is backed up only if there is no match (step 204). The hash key will be stored in the central database 408 and the file may undergo processing (ie, encryption and compression) as described above before being backed up or archived in memory.

能够通过保存文件的历史副本、以及散列列表404、408的历史副本实现对上述过程的进一步改进，以致能够将任何单独机器恢复到它在过去某一给定时刻的文件系统状态。显然，实现这一改进需要中央存储系统400中的额外存储介质，以便在适宜的时机保存这些“快照”。对于人们能够倒退存档文件系统多远的的唯一限制是专用于该任务的存储量。因此，如果对于一种具体的实现方式来讲，计算机文件系统的历史快照不是令人想要的，则人们能够通过不实现本发明的这一特征来节省资本费用。A further improvement on the above process can be achieved by keeping a historical copy of the file, and of the hash lists 404, 408, so that any individual machine can be restored to the state of its file system at a given moment in the past. Obviously, realizing this improvement requires an additional storage medium in the central storage system 400, so as to save these "snapshots" at an appropriate time. The only limit to how far one can back archive filesystems is the amount of storage dedicated to the task. Thus, if a historical snapshot of a computer's file system is not desirable for a particular implementation, one can save capital expense by not implementing this feature of the present invention.

依据系统恢复文件基本上是通过将过程反向来实施的。因为每一目标计算机300或者系统都具有本地数据库404，该本地数据库404包括已处理文件的散列密钥的记录，所以本地数据库上的那些散列密钥可用于将目标计算机300上的需要被恢复的文件标识为该记录中指示的路径。本地数据库的备份副本还应该被存储在不同的机器上、乃至中央地备份，以便可获得散列密钥的列表和对应的路径来重建毁坏机器中的文件系统。Restoring files from a system is basically performed by reversing the process. Because each target computer 300 or system has a local database 404 that includes a record of the hash keys of processed files, those hash keys on the local database can be used to convert the required files on the target computer 300 to The recovered files are identified by the path indicated in this record. A backup copy of the local database should also be stored on a different machine, or even backed up centrally, so that a list of hash keys and corresponding paths can be obtained to reconstruct the file system in the crashed machine.

该系统通过恢复在本地计算机的数据库404上列出的每一文件来恢复该毁坏机器的文件系统，存储在中央存储系统400中文件对应于它们的散列密钥。进一步来讲，可在中央存储系统400中存储本地数据库404本身、以便保留计算机文件系统状态记录，或者在该中央存储系统400中备份这一本地数据库。The system restores the crashed machine's file system by restoring each file listed on the local computer's database 404, the files stored in the central storage system 400 corresponding to their hash keys. Further, the local database 404 itself may be stored in the central storage system 400 to keep a record of the state of the computer file system, or this local database may be backed up in the central storage system 400 .

类似地，如果打算实现这一特征，将计算机系统恢复到先前的历史文件系统状态，则仅仅需要为该时刻获取该本地数据库，然后依据所述历史的本地数据库恢复文件系统文件。能够本地地、中央地、或者最好是同时在两个位置中存储所述历史的本地数据库。Similarly, if it is intended to implement this feature, to restore the computer system to a previous historical file system state, it is only necessary to obtain the local database for that moment, and then restore the file system files from the historical local database. A local database of the history can be stored locally, centrally, or preferably both.

所述散列码本身可用于在备份和恢复过程期间确保文件的完整性。通过对被备份的和/或被恢复的文件运行所述散列过程，产生了可与原始散列码相比较的散列码。如果所述密钥不是完全相同的，则产生文件误差，并且不能保证文件的完整性。如果是完全相同的，则确保了文件的完整性。The hash code itself can be used to ensure the integrity of the file during the backup and restore process. By running the hashing process on the backed up and/or restored files, a hash code is generated which is comparable to the original hash code. If the keys are not identical, file errors occur and the integrity of the file cannot be guaranteed. If they are identical, the integrity of the file is ensured.

图3示出了用于实践依据本发明方法的一种系统的实现方式的可能的高层概观。目标计算机或者目标系统300是待备份的系统。备份代理402能够被运行或许在目标系统上、或者在所述目标系统是它的一个客户端的服务器上。此外，所述备份代理能够远程地被运行。所述备份代理402实现在上文中论及的文件扫描和散列功能。所述备份代理402还使用了包含有用于先前已被备份的每一文件的记录的本地数据库404，并且实现本地的比较操作(图2中的框100)，以便确定所述目标300上的文件先前是否已经被备份。FIG. 3 shows a possible high-level overview of the implementation of a system for practicing the method according to the invention. The target computer or target system 300 is the system to be backed up. Backup agent 402 can be run either on the target system, or on a server of which the target system is a client. Additionally, the backup agent can be run remotely. The backup agent 402 implements the file scanning and hashing functions discussed above. The backup agent 402 also uses a local database 404 that contains a record for each file that has been previously backed up, and implements a local comparison operation (block 100 in FIG. 2 ) to determine the files on the target 300 Whether it has been backed up previously.

为了更高的效率或者为了避免目标计算机上的消耗，所述备份代理402能够在专用服务器上运行，并为这一功能而进行优化。所述备份代理402也可以包括恢复功能，或者一个单独的模块能够实现所述恢复功能。所述备份代理402和/或所述恢复代理能够使用万维网(web)界面，来允许经由诸如因特网的广域网(WAN)，或者在本地经由局域网(LAN)或者其他网络对所述目标系统的文件备份进行远程管理。替换地或者并行地，还可以经由相同的或者类似的web界面对下文中将论及的备份服务器406进行管理。这能够允许所述备份和/或恢复操作被远程控制，而无论可能是从何处提供了对于所述代理402和/或所述服务器406的访问。For greater efficiency or to avoid overhead on the target computer, the backup agent 402 can run on a dedicated server, optimized for this function. The backup agent 402 may also include a recovery function, or a separate module can implement the recovery function. The backup agent 402 and/or the restore agent can use a World Wide Web (web) interface to allow file backup of the target system via a Wide Area Network (WAN) such as the Internet, or locally via a Local Area Network (LAN) or other network Manage remotely. Alternatively or in parallel, the backup server 406 to be discussed below can also be managed via the same or similar web interface. This can allow the backup and/or restore operations to be controlled remotely regardless of where access to the proxy 402 and/or the server 406 may be provided.

利用了中央存储系统400来实现集中式备份功能，包括图2中的框200中的集中式比较操作。尽管是作为集中式系统来描述的，但是将理解的是，针对这种集中式系统描述的所述功能和/或部件远程地被分布或者放置，取决于本发明的期望实现方式。The central storage system 400 is utilized to implement the centralized backup function, including the centralized comparison operation in block 200 in FIG. 2 . Although described as a centralized system, it will be understood that the functions and/or components described for such a centralized system are distributed or located remotely, depending on the desired implementation of the invention.

备份和恢复服务器406被用于指导所述集中式备份操作。所述服务器406从代理402接收表示未在本地密钥列表中列出的文件的散列密钥列表。然后服务器406将所述失配的密钥列表与中央散列密钥数据库408中存储的(先前备份文件的)密钥列表相比较。将理解的是，如果期望的话，这一数据库能够被存储到下文中论及的一个或多个存储设备414里。如果当前在所述中央设备414中没有备份该文件，则将不存在与中央密钥数据库408中包含的散列密钥的匹配。这意味着需要备份对应的文件。在该情况下，所述服务器406从代理402获取对应的文件，或者替换地，所述服务器可以获取所述文件自身，并将其重命名为它的散列密钥，将重命名地文件转发到加密和压缩模块410(如果要求加密和/或压缩)，这实现了上述的加密和压缩步骤。将理解的是，如果期望的话，能够在服务器406上，或者通过单独的计算机/服务器运行所述加密和/或压缩模块。Backup and restore server 406 is used to direct the centralized backup operations. The server 406 receives from the proxy 402 a list of hash keys representing files not listed in the local key list. The server 406 then compares the mismatched key list with the key list stored in the central hash key database 408 (of the previous backup file). It will be appreciated that this database can be stored, if desired, in one or more of the storage devices 414 discussed below. If the file is not currently backed up in said central device 414 , then there will be no match to the hash key contained in the central key database 408 . This means that the corresponding files need to be backed up. In this case, the server 406 obtains the corresponding file from the proxy 402, or alternatively, the server may obtain the file itself and rename it to its hash key, forwarding the renamed file to To the encryption and compression module 410 (if encryption and/or compression is required), this implements the encryption and compression steps described above. It will be appreciated that the encryption and/or compression modules can be run on the server 406, or by a separate computer/server, if desired.

然后，将所述加密和压缩文件转发到到文件调度器412，所述文件调度器412基于所述散列密钥或者关于所述文件应该被存储在哪里的其它指示符，将所述文件引导到适当的存储设备414a、414b......414n。依照希望，这些数据库414n可以被中央地或者分布地放置。The encrypted and compressed file is then forwarded to the file scheduler 412 which, based on the hash key or other indicator of where the file should be stored, directs the file to to the appropriate storage device 414a, 414b...414n. These databases 414n may be located centrally or distributed, as desired.

为了恢复唯一的文件，所述目标服务器300从本地数据库(在目标服务器上)、为该文件请求散列密钥，并且使用该名称、从中央存储服务器406检索该文件。To recover a unique file, the target server 300 requests a hash key for the file from a local database (on the target server), and retrieves the file from the central storage server 406 using that name.

可能的是：相对于所述目标系统300，远程地或者在本地放置所述集中式备份系统400。可以由服务供应商使用ASP或者XSP商业模型远程提供所述备份系统400，其中所述中央系统被提供给运行该目标系统300的付费客户端。这样一种系统能够使用诸如因特网之类的公众WAN，以便在中央系统和目标客户端之间提供网络连接性。替换地，专用网(WAN或者LAN，等等)能够连接这两个系统。还可以利用公共网络上的虚拟专用网络(VPN)。此外，客户端可能希望本地地实现这样一种系统，以便确保本地控制和自治，特别是在待存储的信息可能是特别敏感的、有价值的和/或是私人所有的情况下。然而，如果此类考虑不是优先的话，能够将更加成本有效的服务市场化，在这种服务中，由服务供应商提供所述中央系统。在该情况下，因特网连接性可能是合算的，并且如上所述，基于web的管理系统也会是有用的，并且依据本发明被容易地适应。It is possible to place the centralized backup system 400 remotely or locally with respect to the target system 300 . The backup system 400 can be provided remotely by a service provider using an ASP or XSP business model, wherein the central system is provided to a paying client running the target system 300 . Such a system can use a public WAN, such as the Internet, to provide network connectivity between the central system and target clients. Alternatively, a private network (WAN or LAN, etc.) can connect the two systems. You can also take advantage of a virtual private network (VPN) on a public network. Furthermore, clients may wish to implement such a system locally in order to ensure local control and autonomy, especially if the information to be stored may be particularly sensitive, valuable and/or privately owned. However, if such considerations are not a priority, a more cost-effective service can be marketed in which the central system is provided by a service provider. In this case, Internet connectivity may be cost-effective, and as noted above, a web-based management system would also be useful and easily adapted in accordance with the present invention.

可能使用自助模型实现利用本发明的系统，这使得客户网络管理员能够备份和恢复客户端系统。在该情况下，网络管理员会经由诸如上述基于web的实现方式之类的界面访问该服务。替换地，可以实现集中管理，来卸载客户端的备份职责。对于IDC服务器群组、以及对于与DataCenter技术的操作系统相结合来讲，这样的系统会是很有用的。此外，所述系统可以利用众多其它开放标准，诸如XML/SOAP，HTTP，和FTP。It is possible to implement a system utilizing the present invention using a self-service model, which enables client network administrators to backup and restore client systems. In this case, the network administrator would access the service via an interface such as the web-based implementation described above. Alternatively, centralized management can be implemented to offload backup responsibilities from clients. Such a system would be useful for IDC server farms and for operating systems integrated with DataCenter technology. Additionally, the system can utilize numerous other open standards such as XML/SOAP, HTTP, and FTP.

图4示出了在图3中给出的系统概述中的备份子系统的更详细的潜在实现方式，其示出了客户端和系统服务器的各种部件。这一附图对应于本发明方法的一种潜在实现方式的更详细的描述(在下文中给出)。Figure 4 shows a more detailed potential implementation of the backup subsystem in the system overview given in Figure 3, showing various components of the client and system servers. This figure corresponds to a more detailed description (given below) of one potential implementation of the method of the invention.

依据所述系统的更详细的潜在实现方式，用户会访问GUI，以便使用附加的进度表配置备份作业。这一备份作业会包含待备份文件/目录、OS具体备份选项和进度表选项的选择。当备份被人工执行、或者被所述进度表引起的时候：According to a more detailed potential implementation of the system, a user would access a GUI to configure a backup job with an additional schedule. This backup job will include the selection of files/directories to be backed up, OS specific backup options and schedule options. When the backup is performed manually, or caused by the schedule:

(I)文件系统扫描产生目标服务器300上现有的、并且将被作为“当前_备份”表存储在本地数据库404中的文件。为这一表中的每一文件，存储所述文件的位置、属性和最后修改时间。(1) A file system scan generates files that are existing on the target server 300 and will be stored in the local database 404 as a "current_backup" table. For each file in this table, store the file's location, attributes and last modification time.

(II)接下来，将所述表“当前_备份”与存储有先前备份历史的、数据库404中的表“先前_备份”相比较。比较结果会是已经改变了最后修改时间的文件。(II) Next, the table "current_backup" is compared with the table "previous_backup" in the database 404 which stores the previous backup history. The result of the comparison will be files that have changed their last modification time.

(III)产生所述改变文件的内容校验和、并将其存储在本地数据库404中的“当前_备份”表中。(III) Generate the content checksum of the changed file and store it in the “current_backup” table in the local database 404 .

(IV)然后对照在中央存储服务器400上的中央数据库408中物理地驻留的、校验和的全局库，校验这些校验和。这一校验的结果集合是遗漏的校验和的列表。(IV) These checksums are then verified against a global repository of checksums physically residing in the central database 408 on the central storage server 400 . The result set of this check is a list of missing checksums.

(V)这些遗留的校验和代表需要被传输给中央存储服务器400的文件。具有遗漏的校验和的每一文件将有一个备份过程，所述备份过程包括与存储服务器的数据同步、其内容的物理传输、压缩、加密以及在所述不同阶段期间的完整性校验，以便保证文件的成功接收。(V) These legacy checksums represent files that need to be transferred to the central storage server 400 . Each file with a missing checksum will have a backup process including data synchronization with the storage server, physical transfer of its content, compression, encryption and integrity checks during the different stages, In order to ensure the successful receipt of the file.

(VI)当已经成功地备份所述文件的时候，所述文件将被标记为在本地数据库404中成功地备份。(VI) When the file has been successfully backed up, the file will be marked as successfully backed up in the local database 404 .

(VII)在所述备份过程之后，客户端和存储服务器400之间的数据同步为所有目标服务器(客户端)产生中央备份历史。(VII) After the backup process, data synchronization between the client and the storage server 400 produces a central backup history for all target servers (clients).

基于所述备份历史被存储的不同位置，可以以多种方式执行所述恢复过程。作为默认，从本地数据库404中存储的历史执行恢复。由操作员选择文件的先前备份集合的子集。这一列表为每一文件包含：原始位置，内容密钥，和文件属性。基于这一信息，代理可以从库中获得该文件，对该内容进行解压缩和解密，将所述文件恢复到其原始位置，继之以恢复关于所述恢复文件的属性。The restore process can be performed in a variety of ways based on the different locations where the backup history is stored. By default, recovery is performed from the history stored in the local database 404 . A subset of the previous backup set of files selected by the operator. This list contains for each file: origin location, content key, and file attributes. Based on this information, the agent can obtain the file from the vault, decompress and decrypt the contents, restore the file to its original location, and subsequently restore attributes about the restored file.

恢复文件的第二种方式是从快照文件获得备份历史。这是一个纯文本文件，在备份过程期间被创建，并且包含一个文件列表。在备份期间，紧挨着每一文件的原始位置存储了内容密钥和文件属性。当我们将这样一种文件提供给客户端计算机上的代理的时候，所述代理能够基于上述说明恢复这些文件。The second way to restore files is to get the backup history from snapshot files. This is a plain text file that is created during the backup process and contains a list of files. During backup, the content key and file attributes are stored next to each file's original location. When we provide such a file to an agent on a client computer, said agent is able to restore these files based on the above instructions.

还可以从存储在中央数据库408中的备份历史创建快照文件，其驻留在中央存储服务器400上。Snapshot files may also be created from the backup history stored in the central database 408 , which resides on the central storage server 400 .

Claims

1. one kind is used to judge whether the specific file on the object computer (300) should be backuped to the method for centralized storage system (400), and described method comprises step:

Calculate specific hash key according to the content of described specific file;

Verify that whether Already in described specific hash key in the local data base (404), wherein, for each computer documents on the described object computer (300), that before backed up, described local data base all comprises a record, and described record comprises:

File hash key according to described computer documents calculating; And

In described object computer, described computer documents should be restored to the there the local file path, described file path is associated with described file hash key;

If described specific hash key is not present in the described local data base, then described specific file is backed up by carrying out following steps:

A. create backup file, described backup file is duplicating of described specific file;

B. with the described specific hash key of described backup file RNTO;

C. the backup file with described rename is stored in the described centralized storage system (400); And

D. in described local data base (404) storage new record, this new record comprise described specific hash key and in described object computer (300), described specific file should be restored to the there particular path; And

If described specific hash key is present in the described local data base (404), then described specific file backup is not arrived in the described centralized storage system (400).

2. the method for claim 1 further comprises step:

Verify that described specific hash key is whether Already in according at least one central database (408) that has been backed up in the file hash key that the computer documents in the described centralized storage system (400) derives; And

And if only if, and described specific hash key is not present in described at least one central database (408) of described centralized storage system (400), just described specific file is backed up.

3. method as claimed in claim 2, wherein said object computer (300) is connected with LAN, and described in addition centralized storage system (400) is connected with described LAN by WAN.

4. as the described method of one of claim 1 to 3, wherein a plurality of object computers (300) are connected with described centralized storage system (400), in addition, if as the result who backs up from arbitrary object computer, in the Already in described centralized storage of the described specific file system, then do not back up described specific file.

5. method as claimed in claim 4, described specific hash key is depended in the position of the backup file of wherein said rename in described centralized storage system (400).

6. method as claimed in claim 5, wherein said centralized storage system comprise a plurality of memory devices (414a, 414b, 414n).

7. method that is used for specific file is returned to object computer (300), described method comprises step:

From local data base (404), ask specific hash key corresponding to described specific file for a previous hash key that calculates of each computer documents that has backed up storage;

The specific path position that request is associated with described specific hash key from described local data base (404);

Use described specific hash key to come retrieval backup file from Center Storage Server (400), described backup file is duplicating of described specific file; And

Described backup file is saved in described specific path position on the described object computer (300),

The title that wherein said backup file has been stored in described Center Storage Server under it depends on described hash key.

8. method as claimed in claim 7, described hash key is depended in the position of wherein having stored described backup file in described Center Storage Server.

9. system that is used for the specific file on the backup target computing machine (300) comprises:

Be used for calculating the device of specific hash key according to the content of described specific file;

Be used for verifying the whether Already in device of local data base (404) of described specific hash key, wherein, for each computer documents on the described object computer (300), that before backed up, described local data base all comprises a record, and described record comprises:

File hash key according to described computer documents calculating; And

Be used for not being present in the device that under the situation of described local data base described specific file is backed up at described specific hash key, described backup may further comprise the steps:

B. with the described specific hash key of described backup file RNTO;

D. in described local data base (404) storage new record, this new record comprise described specific hash key and in described destination server, described specific file should be restored to the there particular path;

Wherein, if described specific hash key is present in the described local data base (404), then do not back up described specific file.

10. system as claimed in claim 9 is a centralized storage system, and further comprises:

Be used for verifying the whether device of at least one central database (408) of Already in described centralized storage system (400) of described specific hash key, described central database (408) comprises the file hash key of deriving according to the computer documents that has been backed up in the described centralized storage system, wherein only when described specific hash key is not present in described at least one central database, just carry out the described device that is used to back up.

11. system as claimed in claim 10 is characterized in that described object computer is connected with LAN, in addition, described centralized storage system is connected with described LAN by WAN.

12. as the described system of one of claim 9 to 11, wherein a plurality of object computers are connected with described centralized storage system, in addition, if as the result who backs up from arbitrary object computer, in the Already in described centralized storage of the described specific file system, then do not back up described specific file.

13. system as claimed in claim 12, described specific hash key is depended in the position of the backup file of wherein said rename in described centralized storage system (400).

14., further comprise as the described system of one of claim 9 to 11:

Be used for described specific file comprising from the device that described centralized storage system (400) returns to described object computer (300):

Be used for from the device of described central database request corresponding to the described specific hash key of described specific file;

Be used for from the device of the described specific path position that is associated with described specific hash key of one of described local data base and described central database request;

Be used to use described specific hash key to retrieve the device of described backup file from described system; And

Be used for will the described backup file on described object computer being saved in the device of described specific path position.

15. system as claimed in claim 14, wherein during described backup, if described specific hash key is not present in the described local data base (404), before then in described backup file being stored into described centralized storage system, with the described specific hash key of described backup file RNTO, in addition, between described convalescence, before described backup file is saved in described object computer, with the title of the described specific file of described backup file RNTO.