WO2008118168A1 - Method and system for backup storage of electronic data - Google Patents
Method and system for backup storage of electronic data Download PDFInfo
- Publication number
- WO2008118168A1 WO2008118168A1 PCT/US2007/064810 US2007064810W WO2008118168A1 WO 2008118168 A1 WO2008118168 A1 WO 2008118168A1 US 2007064810 W US2007064810 W US 2007064810W WO 2008118168 A1 WO2008118168 A1 WO 2008118168A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- data file
- stored
- file
- storage
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/04—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
- H04L63/0428—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1458—Management of the backup or restore process
- G06F11/1464—Management of the backup or restore process for networked environments
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/08—Network architectures or network communication protocols for network security for authentication of entities
- H04L63/0823—Network architectures or network communication protocols for network security for authentication of entities using certificates
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/875—Monitoring of systems including the internet
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B20/00—Signal processing not specific to the method of recording or reproducing; Circuits therefor
- G11B20/10—Digital recording or reproducing
- G11B20/18—Error detection or correction; Testing, e.g. of drop-outs
- G11B20/1833—Error detection or correction; Testing, e.g. of drop-outs by adding special lists or symbols to the coded information
Definitions
- the present Invention relates generally to storage of electronic data, and more particularly to guaranteeing backup storage of a data file and/or backup storing the data file such that it may be recreated even where part of the data file is lost.
- insurance policies that provide data "loss protection” typically cover only the cost of data recovery and business interruption. That is, when data is lost, an insurance policy may cover the time and expense of retrieving the data from electronic backup facilities, creating electronic data from paper archives, restoring damaged electronic storage media, and the cost of lost computing resources during recovery. However, insurance policies typically will not cover the value of lost data that cannot be recovered. Those policies that apparently cover the value of unrecoverable data have such high premiums and/or restrictive claim payments that these policies do not offer meaningful insurance for unrecoverable data.
- a data owner may lose on site data and request retrieval of such data from the off site storage. Where the off site facility is unable to provide the requested data, the data owner has no recourse but to simply accept the data as lost. While the data owner may switch to a different data backup provider, this does not compensate the data owner for costs associated with the lost data.
- one object of the present invention is to address the above and/or other problems relating to backup storage of electronic data.
- Another object of the present invention is to guarantee backup storage of electronic data.
- Still another object of the present invention is to provide a method and system for insuring against data loss.
- Yet another object of the present invention is to increase reliability of backup data storage facilities.
- a further object of the present invention is to provide a method and system for backup storing a data file such that the data file can be reconstructed in substantially (or bit for bit) identical form despite loss of part of the data file.
- One embodiment of the invention includes a method for guaranteeing retrieval of stored data.
- the method includes receiving, from a sender, a data file being stored to be stored at a data custodian and creating a storage certificate associated with the data file being stored, the storage certificate including a partial representation of the data file being stored and an electronic signature of the data custodian.
- the data file being stored is stored in a storage unit at the data custodian, the storage certificate is returned to the sender to verify storage of the data file being stored at the data custodian.
- Another embodiment includes a method for backup storing data files.
- the method of this embodiment includes receiving, from a sender, a data file being stored to be stored at a data custodian, and creating N data chunks from the data file such that the data file can be recreated using ⁇ N of the data chunks.
- the N data chunks are in a plurality of storage locations.
- Figure 1 is a system for guaranteeing backup storage of a data file in accordance with an embodiment of the present invention
- Figure 2 is a flow chart representative of a method for guaranteeing backup storage of a data file in accordance with an embodiment of the present invention
- Figure 3 is a system for backup storing a data file such that the data file can be reconstructed in bit for bit identical form despite loss of part of the data file in accordance with an embodiment of the present invention
- Figure 4 is a flow chart representative of a method for backup storing a data file such that the data file can be reconstructed in bit for bit identical form despite loss of part of the data file in accordance with an embodiment of the present invention
- Figure 5 is a detailed flow chart representative of a method for receiving a data file from a client, guaranteeing storage of the data, and backup storing a data file such that the data file can be reconstructed in bit for bit identical form despite loss of part of the data file in accordance with an embodiment of the present invention
- Figure 6 is a detailed flow chart representative of a method for retrieving a data file guaranteed and stored according to the process shown in Figure 5, in accordance with an embodiment of the invention
- Figure 7 is a detailed flow chart representative of the packer step of Figure 5 in accordance with an embodiment of the present invention.
- Figure 8 is a detailed flow chart representative of the receiver to decoder step of
- Figure 9 illustrates a computer that may be used to implement the present invention. DETAILED DESCRIPTION OF THE INVENTION
- FIG. 1 a system for guaranteeing backup storage of a data file in accordance with an embodiment of the present invention.
- the system of Figure 1 includes workstations 101, laptops 103, servers 105, local area network (LAN) server 107, firewall server 109, Internet 111, firewall server 113, offsite data vault 115 and mirrored site (or additional data vault) 117.
- LAN local area network
- FIG. 1 there could be additional backup storage facilities and additional copies of the data may be stored there or alternately more sophisticated parceling of data as illustrated in Figure 3 may be employed.
- workstations 101, laptops 103, servers 105, and local area network (LAN) server 107 make up the resources of a business (or household) that uses a backup data storage system in accordance with an embodiment of the present invention.
- Firewall server 113, offsite data vault 115 and mirrored site 117 make up the backup storage facility in the embodiment of Figure 1.
- Workstations 101, and laptops 103 are well known and are operated by users that generate and store data files that are essential to successful operation and survival of a business, for example.
- Servers 105 may be dedicated computer systems for providing various functions used during operation of the business.
- the workstations 101, laptops 103 and servers 105 are preferably programmed with any suitable Web browser software that permits these resources to retrieve Web pages via the Internet 111 from remote computers or servers such as the firewall server 113, offsite data vault 115 and mirrored site 117 in order to provide guaranteed data backup to users of these systems.
- the Web browser software may also be used to transmit information provided by the workstations 101, laptops 103 and servers 105 to remote computers such as the server 113, and storage devices 115 and 117.
- the workstations 101, laptops 103 and servers 105 could be equipped with freestanding software which emulates a browser to transmit archives or makes use of dedicated file servers for this purpose.
- File servers exist, which can create an illusion that storage resources which in fact reside somewhere else on the network, appear to reside locally.
- files which customers save from their local systems In the customary way actually are saved on such a file server and are automatically guaranteed in the amount predetermined for all files saved to that volume, folder, or directory of this file server.
- these systems could also use erasure codes, error correcting codes, clustering, and high availability technologies to make them as reliable as desired.
- a plurality of machines each having locally attached disks can present those disks as block devices via iSCSI or the network block device mechanism for example.
- Such a system can support the illusion to a client system that the disk resources on it appear to be directly attached to the client.
- a collection of client systems can then utilize those blocks, preferably so that each client has exclusive access to any given block device and those block devices accessed by a given client can be combined to form a RAID 5 volume.
- the failure of any given server system causes no failure of the client system.
- These client systems may then in turn export their file systems built on these block devices so that they become file servers over the network to clients of their own.
- another file server can obtain access to those blocks and recover that file system and resume service and client access to the file system also fails over.
- Software tools that enable this include Heartbeat, for example.
- the delimiter pair become ' ⁇ p> ! and ' ⁇ /p> ⁇
- the delimiter pair consists of ' ⁇ table> ! and ' ⁇ /table>'.
- Such delimiters enclose text which may in turn include other delimted text.
- the decorating or annotating with functions is done by including "attributes" so that we can include that information in the opening delimiter.
- our table has a width of 5 and a height of 3
- the closing delimiter ' ⁇ /table>' is unchanged.
- This mechanism allows for extremely rich data structures to be transported.
- the data types that occur in programming language can be expressed by XML.
- a web services description language file contains the information necessary to make the translation.
- software tools exist that can take an existing function or method and generate this WSDL file.
- Software tools also exist which can take this WSDL file and create a proxy function or method which supports the illusion that functionality that is in fact remote, is just an ordinary function or method running locally.
- the implementation language for the actual service need not be the same as the implementation language for the proxy object.
- These XML enabled software interfaces are called web services. In this way a programmer using their preferred programming language can make use of program code implemented in an entirely different language without knowing anything about that other language.
- workstations 101, laptops 103 and servers 105 are connected in a LAN by way of LAN server 107, which coordinates communication among business resources and provides other network functionality such as compression and/or encryption of data.
- the business resources are also connected to Internet 111 by way of firewall computer 109 to enable communication with external resources such as the data storage facility.
- firewall server 109 filters data entering and exiting the LAN to protect against unauthorized access to the business resources.
- Workstations 101, laptops 103, servers 105, local area network (LAN) server 107, and firewall server 109 communicate with each other and with external resources using any suitable protocol for communicating directly or via the Internet 111. More specifically, these devices are configured to interface with the firewall server 113, offsite data vault 115 and/or mirrored site 117 to store backup data by user command or automatically, to obtain storage certificates, to retrieve backup data and to make claims to the storage facility in accordance with embodiments of the present invention.
- Workstations 101, laptops 103, servers 105, local area network (LAN) server 107, and firewall server 109 may be implemented as the general purpose computer system 9001 of Figure 9, for example.
- Offsite storage vault 115 and mirrored site 117 are also connected to the Internet 111, and thus the business resources, via firewall 113.
- Firewall server 113 filters data entering and exiting the backup data storage system to protect against unauthorized access to backup data storage resources.
- the Internet 111 includes various networks and gateways for linking together various computer networks and computers such as those shown in Figure 1.
- Offsite data vault 115 and mirrored site 117 are storage devices that store electronic information including data files from business resources such as the workstations 101, laptops 103 and servers 105, for example.
- Firewall server 113, offsite data vault 115 and/or mirrored site 117 may be configured as a Web server programmed to receive, store, and/or transmit various types of information, including, requested backup data files from the business resources. These devices can also be configured to provide storage certificates and satisfy claims in accordance with embodiments of the present invention. [0039] It is to be understood that the system in Figure 1 is for exemplary purposes only, as many variations of the specific hardware and software used to implement the present invention will be readily apparent to one having ordinary skill in the art. For example, the functionality of the firewall server 109 and LAN server 107 may be combined in a single device.
- a single computer e.g., the computer system 9001 of Figure 9
- two or more programmed computers may be substituted for any one of the devices shown in Figure 1.
- Principles and advantages of distributed processing such as redundancy and replication, may also be implemented as desired to increase the robustness and performance of the system, for example.
- other means of tendering information to the custodian are possible such as satellite links, wireless communication, electromagnetic communication including microwave, maser, laser, spread spectrum etc.
- Figure 2 is a flow chart representative of a method for guaranteeing backup storage of a data file in accordance with an embodiment of the present invention.
- the method of Figure 2 may be performed on the system of Figure 1, for example.
- the method begins with receiving a data file at a data custodian in step 201.
- the data file may include digital information or any information that may be represented without loss in a digital form, that is to be stored in a data backup storage of the data custodian.
- the data custodian may be a storage device for providing backup storage services.
- the data file may be accounting information generated by server 103 and sent to the data vault 115 via the Internet 111.
- the data file (being stored) received in step 201 Is an encrypted data file.
- asymmetric cryptographic systems which use a pair of keys, called the public key and the private key, with the property that what is encrypted using the private key may be unencrypted using the public key, permitting the public key holder to verify that the sender of an encrypted message was in possession of the private key.
- Knowledge of the public key provides no useful information for obtaining the matching private key.
- Symmetric cryptographic systems also exist which use a single key.
- Dutch auctions exist, under conditions where there is the ability to fulfill multiple bids, for example, with multiple identical Items; they are conducted by collecting bids, and choosing one bid as the accepted bid value.
- the bidder who bid that accepted bid value does not receive an Item, but rather all bidders whose bids were higher than the accepted bid price each receive an item or items at the accepted bid value.
- This mechanism has the useful property that It is provably fair, in the sense that everyone pays the exact same price, everyone who gets an item gets it for less than they committed themselves to pay, everybody who does get an item committed themselves for more than everyone who does not, and no collusion on the part of the winning bidders could possibly lower the price that they pay.
- an embodiment of the present invention may conduct a real-time dutch auction, to prioritize the servicing of file restoration requests, ensuring that those customers whose needs are most pressing, as measured by the value they place on priority recovery, are given that priority.
- the same dutch auction mechanism may be used for setting other pricing such as the fees charged for data storage, and fees paid for levels of guarantee.
- the data file (being stored) received in 201 is accompanied by a digitally signed bid specifying the value of the file and the offered fee.
- the data file being stored can be received at the custodian via the Internet or other digital communications means.
- the data file may be received as an attachment in an e-mail, or the data file being stored can be uploaded to the data custodian.
- the data file may be manually sent to the data custodian by a user, or automatically sent to the data custodian according to a predetermined backup storage plan or policy or as triggered by saving the file to a file server volume or directory having a predetermined value associated with it.
- the customer requests that a onetime or recurring schedule be set up, so that the data custodian polls the client software running on the customer premises, to respond by sending an archive, where the value is predetermined, or available for retrieval from some pre-arranged place.
- data files can be transferred at a convenient time when Internet traffic is relatively low.
- client software running on the customer premises can generate data chunks (as will be discussed below) and submit the chunks directly to the data storage facilities, and upload the file.
- the data custodian computes each chunk, but rather than sending it to the data storage facility, computes the hash of the chunk, and compares it with the hash of the chunk computed by the data storage facility. If all (or sufficiently many) of these hashes agree, the file has been properly stored and the certificate is issued. [0049] Once the data file being stored is received at the data custodian, the data custodian creates a storage certificate in step 203.
- the storage certificate includes a partial representation of the data file received by the data custodian, the declared value of the data file, as well as an electronic signature of the data custodian.
- the partial representation is a unique representation of the data file being stored that is created from the data file being stored and can be used to authenticate the data file being stored if presented in the future.
- the partial representation of the data file is typically a small fraction of the size of the original data file being stored and cannot be used to recreate the data file being stored.
- the certificate may contain other information as well such as the size of the file, a time stamp recording when the file coverage took effect, an expiration for coverage, a confirmation of the fees being charged, labels for the tracking convenience of the customer, etc.
- the partial representation is a hash value of the data file being stored computed according to a predetermined algorithm.
- hash functions exist, which take as input a file and produce as output a small piece of data, called its hash value, in such a way that it is not feasible to analyze that small piece of data so as to recover the original file.
- These functions have the additional property that any change in the file, however small, results in a completely different hash value.
- the hash value cannot be used to create the original data file that created the hash value
- the hash value can be used to authenticate a file presented as the original data file.
- MD5 and SHAl are widely used hash functions. Source code implementing these functions is available as open source in any standard Linux distribution.
- the electronic signature created in step 203 is associated with the data custodian and functions as a written signature to identify the custodian as the sender, and to possibly legally bind the data custodian as having received the data file being stored.
- the electronic signature is a cryptographic digital signature.
- Digital signatures exist, whereby a document may be reliably attributed to the owner of a public key in possession of the private key, by encrypting the hash value of the document using the private key, for example.
- the general public in possession of the public key may decrypt the encrypted hash value using that public key. When that original hash value is recovered it proves that the encryption truly was done by the private key and therefore could only have been done by the unique possessor of the private key.
- step 205 the data file being stored is stored in a storage unit of the data custodian.
- the data file being stored can be stored in the data vault 115 or the mirrored site 117 or both.
- storage can be spread among a number of sites as will be discussed more fully in connection with Figures 3, 4, 5, 6, 7, and 8.
- the storage certificate is returned to the sender of the data file.
- the storage certificate can be returned to the sender via e-mail or downloading the storage certificate to the sender.
- the certificate may be placed on a website for inspection by the sender or communicated in any other mode suitable for transmitting digital information.
- the data custodian provides a storage certificate to the sender of the data file as a receipt that the data file being stored has been received and stored at the data custodian.
- This storage certificate can be used to obtain the stored data file in the future.
- the data custodian can receive a request for a requested data file from a requestor.
- the request can include a data storage certificate and other optional information such as a file label.
- the storage certificates may be maintained by the data custodian, in which case the request might simply be some piece of information enabling the data custodian to retrieve that certificate.
- the data custodian processes the request in order to determine whether the requested data file can be provided by the data custodian.
- the data custodian first obtains an electronic signature from the storage certificate and determines whether the electronic signature of the storage certificate corresponds to a signature of the custodian. Specifically, the data custodian determines whether the electronic signature of the storage certificate matches its electronic signature previously issued in a data storage certificate. This verification can be performed by anyone in an algorithmic fashion by transforming the certificate using the public key of the data custodian as previously discussed.
- the data custodian retrieves a file from storage having the requested file label and computes a hash value for the retrieved file based on a predetermined algorithm used by the data custodian to issue storage certificates. The data custodian then compares the hash value of the retrieved file with the hash value of the stored data file to determine If a match exists. Where the hash value of the retrieved data file matches the hash value of the stored data file, the data custodian returns the retrieved data file to the requestor as the requested data file, via any communications means suitable for digital communication.
- a check is performed through the accounting system to be sure that payments are current before honoring a request. Or alternately, some different charge may apply for honoring the request under those circumstances.
- the data custodian takes corrective action. Specifically, where the storage certificate includes the data custodian's electronic signature, but the data custodian cannot retrieve a stored file that can generate the hash value included in the storage certificate, then this indicates that the data custodian has lost the data file being stored now requested. Thus, the data custodian can pay the requestor a predetermined dollar amount for lost data.
- the data owner submits their data, probably encrypted, for storage.
- the data custodian generates a hash value for the data, and creates an electronic storage certificate, which is then digitally signed. If the data owner claims a loss, the data owner can exhibit the certificate to all the world without in any way compromising the confidentiality of their data.
- the data custodian may prove to all the world that the original data has been restored, by exhibiting the (encrypted) data, which all the world can verify has the hash value referenced in the storage certificate. Cancellation due to failure to pay, claim payment, or failure of claim payment can all be proved in the usual manners for financial transactions.
- a request to cancel coverage can be acknowledged with another digitally signed instrument.
- the information submitted is not encrypted but is a common file which may have been downloaded from the internet or purchased from a vendor. In this case no additional copy of the information need be stored.
- the certificate can be granted immediately upon verification that the hash value of the transmitted file is the hash value of a file already stored.
- Figure 3 is a system for backup storing a data file such that the data file can be reconstructed in substantially (or even bit for bit) Identical form despite loss of part of the data file in accordance with an embodiment of the present invention.
- the system includes a data file being stored 301 and a plurality of storage units 303-317. As seen in Figure 3, the data file being stored 301 is separated into "chunks," and each chunk is sent to a respective data storage unit 303-317.
- the storage file 301 can be reconstructed as file 350 from the remaining chunks stored in storage units 303, 307, 309, 313, 315 and 317.
- the reconstructed file 350 is bit for bit identical to the original storage file 301.
- files can be stored with greater reliability. A method for accomplishing this reconstruction is discussed in connection with Figure 4.
- Figure 4 is a flow chart representative of a method for backup storing a data file such that the data file can be reconstructed in bit for bit identical (for example) form despite loss of part of the data file in accordance with an embodiment of the present invention.
- the method begins with receiving a data file at a data custodian in step 401.
- the data file may include digital information that is to be stored in a data backup storage of the data custodian, and the data custodian may be a storage device for providing backup storage services.
- the data file being stored received in step 401 is an encrypted data file such as that discussed in relation to asymmetric cryptographic systems above, or alternately a file may be returned from which the original file may be algorithmically derived, as for example a compressed version of the file.
- the storage file is an unencrypted data file.
- the data file being stored can be received at the custodian via the Internet, as an attachment in an e-mail, or can be uploaded to the data custodian or transmitted in any other manner appropriate for digital communication.
- step 401 may be performed by a data custodian software application on the sender's computer system. That is, while step 401 includes receiving a data file at the data custodian, such data file may be processed by client software at the user and divided into data chunks (as discussed below) before ever being remotely transmitted from the user.
- client software could be running on systems owned by the user, or on systems owned by the user but administered by the data custodian, or by systems owned by the systems custodian but located on the user's premises.
- Step 403 may be performed by use of error correcting codes or erasure codes.
- Error correcting codes exist, which enable messages that have been altered in transit to be reconstructed.
- a simple example is a repetition code, which simply sends the same message multiple times.
- An error is detected by comparing the multiple copies, and errors may be corrected by a comparison process to reach a consensus. This is the technique people use when trying to talk in a noisy environment; they repeat themselves as necessary to be understood over the noise. While the message is eventually communicated, it may take a very long time.
- step 403 of Figure 4 can be performed by use of error correcting codes such as Berger Code, Check digit, Chipkill (an application of ECC techniques to volatile system memory), Constant weight control, Convolutional codes, Differential space-time codes (related to space-time block codes), Dual modular redundancy (subset of N-modular redundancy, related to triple modular redundancy), Erasure codes (superset of Fountain codes), Forward error correction, Group code, Golay code, Goppa code, Hadamard code, Hagelbarger code, Hamming code, Lexicographic code, Longitudinal redundancy check, Low-density parity-check code, LT codes (which are near optimal rateless erasure correcting codes), m of n codes, Online codes (which are an example of rateless erasure codes), Parity bit, Raptor codes (which are high speed -near real time- fountain codes), Reed-Solomon error correction, Reed-Muller code, Repeat accumulate code, Sparse graph code,
- error correcting codes
- step 403 includes use of versions of the sparse matrix method of Gallager which comprises a family of codes.
- a particular code is constructed by choosing a number M of message pieces, each of Identical size, and constructing a number C of check pieces, each of the same size as each of the message pieces.
- Each check piece is generated from some collection of message pieces where the number of message pieces may vary from check piece to check piece.
- the manner in which the check piece is generated Is by forming an exclusive "or" of the message pieces. Decoding proceeds via "belief propagation" described below for the case of erasure codes.
- step 403 utilizes the fact that a polynomial having degree N has at most N roots. It is an easily proved standard fact from algebra that this implies that if you know the value of a degree N polynomial at any N points, you know the coefficients of the polynomial.
- Algorithms for generating the values and recovering the coefficients include Neville's Algorithm, and Newtons Interpolation Formula which may be found in Steer and Bulirsch, Introduction to Numerical Analysis.
- Neville's Algorithm and Newtons Interpolation Formula which may be found in Steer and Bulirsch, Introduction to Numerical Analysis.
- a more efficient, but more complicated approach is the Berlekamp-Massey Algorithm which are standard topics in books on algebraic codes.
- Step 403 may also be performed by use of erasure codes, which are closely related to error correcting codes and are also able to reconstruct a message where portions of the message have been lost in transmission.
- Preferred erasure codes are the Gallager family whose encoding process is described above. Decoding proceeds by finding first those check pieces constructed from just one message piece, then using that knowledge to extricate those message pieces from the remaining check pieces. That is, a check piece constructed from just one message piece coincides with that message piece. The extrication is readily done because applying an exclusive "or" a second time decodes the effect of applying an exclusive "or” the first time. Thus applying the exclusive "or" operation to the message piece and any check piece that includes it, yields a new check piece which excludes that message piece.
- the data chunks are created, they are sent to a plurality of custodian locations in step 405.
- the data chunks are preferably each sent to different storage locations to increase reliability of storage, but at least two different storage locations is all that is required according to the present invention. As noted above, in Figure 3, where one or more of the storage locations loses a data chunk for some reason, the original data file can be recreated from the remaining data chunks that have not been lost.
- Figure 5 is a detailed flow chart representative of a method for receiving a data file from a client, guaranteeing storage of the data, and backup storing a data file such that the data file can be reconstructed in bit for bit identical form despite loss of part of the data file in accordance with an embodiment of the present invention.
- Figure 5 shows stage 1 of receiving data from a Client (sender).
- the process begins uploading a data file in step 501.
- the data file includes a value of the data file and may also include a name, a label or other information for the convenience of the customer.
- the value may be a specific monetary value, or a rating of the importance or other valuation of the data file.
- step 503 the data file is saved at the data custodian, and the data custodian processes the file to record the value, path and any label or other information associated with the file.
- step 503 is performed at a packer manager system of the data custodian, which may be a general purpose computer such as that described in Figure 9.
- the value of step 505 may be the value received in step 501, or a value that is calculated based on a rating of the importance of the data as received in step 501, or a pre-assigned value where the file has been stored to a location bearing an agreed upon per file valuation.
- step 507 the data custodian sends a message to the client thanking the client for uploading the file and informing the client to expect an e-mail with label information.
- the client could receive synchronous notification and/or synchronous receipt of the storage certificate.
- Stage 2 begins with step 511 where the value, path and label associated with the data file are received.
- step 513 a code for encoding the data file is chosen based on a value and size of the data file. As noted with respect to Fig. 4, the code may be an error code or an erasure code, the reliability and efficiency of which can be chosen based on the value of the data file.
- Data chunks are then created in step 515 using the code selected in step 513.
- Step 511 also branches to step 517 where a hash value is computed for the original data file.
- step 519 a subset (principal chunks) of the chunks in step 515 are combined to recreate the data file, and a hash value is recomputed for the recreated data file. This recomputed hash value is then compared with the hash value computed in step 517 to check the accuracy of the recreation process using chunks.
- the recreated file will normally match bit for bit the original file. The only way this can fail to happen is if some transient hardware failure has corrupted the computation. That transient error could have occurred in the course of creating the chunks in step 515 or in the computation of the hash in step 517.
- step 521 If the files and hashes both differ in step 521 then the correct execution of step 515 is to be doubted. Thus, the chunks are erased in step 523 and the data chunks are recreated In step 515. Where the files are the same but the hashes differ, there must have been a computational error in computing the hash either at step 517 or at step 519. In this case the process returns to step 517 to recompute those hashes. It is virtually Impossible for the files to differ and yet the hashes to be the same as a result of a transient computational error.
- step 519 determines that the recomputed hash value is the same as the originally computed hash value, which should nearly always be the case, the process continues to step 525 where a check syndrome is performed.
- the check syndrome is a piece of information computed from the encoded information which reveals the presence of errors. In the case of linear codes, in the absence of error the syndrome will normally be zero. For example, a parity or check sum code adds an extra bit of Information to a message in such a way that the total number of bits is even. The syndrome for this code is then computed by counting the number of 'Ts".
- step 515 If that number is even, the syndrome is O 5 if that number is odd, the syndrome is "1" which reveals that there has been an error in transmission because the total number of "Is'" was arranged to be even not odd. Where the result of the check syndrome is bad, an error must have occurred in the execution of step 515. Thus, the process returns to step 523 where the chunks are erased and new chunks are created in step 515, Where the result of the check syndrome is good, the process flows to step 527 where packer messages are created and each given a data chunk for delivery to a distinct data center. In step 529, the packer messages are sent to distinct data centers Pl-Pn, as will be further discussed in Fig. 7 below. This may be done in order that the processes of sending the chunks to the various data centers may be performed concurrently.
- a single process could sequentially send the various chunks to the various data centers or a single process employing threads could use multiple threads to achieve concurrency. Not all choices of erasure or error correcting codes possess syndromes in which case there would be nothing to be done at step 525.
- a collation process is performed on the success or failure of the packer functions in storing the data chunks to the data centers based on messages received back from the data centers.
- the data centers send success or failure messages to their respective packer which in turn reports back to the packer manager which collates the success or failure of the message storage process in step 531.
- This collation process may conclude success based on the individual success of all of the packers or alternately on a sufficient number or a sufficient subset of successful packer operations. Under normal conditions all of those operations will succeed. If however, a particular remote data center is down or under conditions of network breakdown or congestion or transient hardware failure in a packer process, failure may result. As seen in step 533, where the collation process determines a failure, the customer is notified of such failure in step 535. This may initiate the customer resending the data file for storage. In an alternate embodiment we might return to step 527 to retry some subset of the failed packer operations.
- step 539 a certificate that guarantees the data for the declared, computed, or implied value is issued and sent to the customer as part of a notification in step 541.
- the certificate can include a hash value for the data file and a digital signature associated with the data custodian.
- the certificate might also Include other Information of value to the customer or the custodian such as the size of the file and the time of issue, time of expiration, etc.
- the customer notification could be a synchronous response coinciding with step 507.
- Fig. 6 is a detailed flow chart representative of a method for retrieving a data file guaranteed and stored according to the process shown in Fig. 5.
- the claim process begins in step 1 when the customer submits a claim. More specifically, in step 601, the customer submits a certificate with a hash value and signature to the data custodian. It could also include a label and other information discussed previously.
- the signature is verified to determine whether the signature is good or bad, as shown in step 605. Where the signature is determined to be bad, the process flows to step 611 where messages are sent to a customer to provide a valid certificate.
- step 607 a message is sent to the claims department of the data center, and a confirmation Is sent to the customer in step 609 to inform the customer that the data storage certificate was good and to expect an e-mail relating to the process claim.
- the message could be sent synchronously in step 625 rather than asynchronously via e-mail.
- the message to the claims department would normally be an automated message which could be a client-server interaction, a function call, a remote method invocation, a web service or any other form of digital communication.
- step 613 the message generated in step 607 is sent to the retriever activator in step 617.
- step 619 a retriever function is performed to retrieve the data chunk from a respective data center. Details of the retriever process flow 619 will be discussed with respect to Fig. 8 below.
- step 621 at least a subset of the end data chunks are recomblned and decoded to recover the original data file, as discussed above. In one embodiment only those chunks required to reconstruct the file are retrieved. In another embodiment an attempt is made to retrieve all chunks and then check to see if there has been success at retrieving sufficiently many chunks to recover the file. Under normal conditions, all of the original chunks should be available.
- a data center may be congested, overloaded, or suffering from some transient or permanent hardware failure.
- a given chunk may be received correctly from a given data center, a chunk may be received but corrupted, this will be apparent if a hash of the chunk is included as a part of the file name of the chunk or is stored in some other way. If the computed hash of the chunk falls to match its original hash then that chunk is corrupted. Yet another possibility is that the file was not found indicating some other process failure. Another possibility is that no response was received from the data center because of network congestion or because the server is down. This would be a soft failure referenced in step 631.
- step 623 a hash value of the retrieved data file is verified by comparing this hash value with the hash value received from the customer in step 601.
- step 625 a link is e-mailed to the customer informing the customer where they can download the original data file.
- the file could be synchronously returned to the customer or communicated in any way suitable for digital communication.
- step 627 an attempt to recompute the hash value is made, to guard against the unlikely event that some transient processing error is responsible.
- step 625 where an e-mail is sent to the customer as before.
- step 627 fails, a message is generated to initiate dollar payment on the customer's claim in step 629. That is, where the hash value cannot be verified for the retrieved data file, the data custodian has lost the data stored for the customer.
- a further attempt at retrieving and decoding could be initiated to guard against the possibility that some transient error corrupted the process in steps 617, 619, and 621.
- step 621 the data custodian is unable to decode the retrieved data chunks
- the process continues to step 631 where it is determined whether a soft failure has occurred.
- a soft failure results, for example, where one or more data chunks cannot be received, but the message can still be decoded using the remaining data chunks, or by using chunks from systems that were temporarily out of service but which still retain the missing information.
- the process returns to the retriever function 617 and decoder function 621.
- no soft failure this indicates that the data file has been lost by the data custodian, and the process flows to step 629 where a procedure is initiated for dollar payment on the claim, or payment in some other currency or store of value.
- FIG. 7 is a detailed flow chart representative of the packer step of Fig. 5 in accordance with an embodiment of the invention.
- each packer process may perform its function in parallel with all of the others.
- step 701 a Hash 1 value is computed for each of the incoming data chunks.
- the Hash 1 value for the respective data chunks is saved at the data custodian for later processing, and the data chunks are sent in step 703 to the data center 705 which this packer is responsible for.
- step 707 the data chunks are each stored at their respective data centers, and a chunk Hash 2 is computed for the stored data chunk.
- the chunk Hash 2 is received at the data custodian in step 709, and a comparison of the Hash 1 and Hash 2 values for a particular data chunk is performed. Under normal circumstances, these two hashes will agree. If they do not agree, either there has been a transient computational error in computing one or both of those hashes or there has been a corruption in transmission.
- the Hash 1 value equals the Hash 2 value
- the process continues to step 711 where that data chunk is deleted and successful storage of the data chunk is reported to the packer manager.
- the extra precautions associated with recomputing and comparing these hash values might be dispensed with.
- Hash 1 value is not equal to Hash 2 value
- an error condition exists. There are three possibilities: one possibility is that there was a transient computational error in the computing of one of the two hash values or there was a transmission error resulting in a corruption of the chunk at the data storage center, or there was an error in the transmission of Hash 2 from the data center to the packer.
- the process continues to step 713 where a Hash 3 value is recomputed and this Hash 3 value is compared to Hash 2 value for the data chunk. Hash 3 should be the same as Hash 1. If it is not then an error occurred either at this step 713 or previously at step 701.
- Hash 3 value is equal to the Hash 2 value, this means it is overwhelmingly probable that the error occurred at step 701 and that the chunk was in fact correctly transmitted in which case the process proceeds to step 711 where the data chunk is deleted having been safely stored and success is reported to the packer manager.
- a Hash 3 value also does not equal Hash 2 value, it is highly likely that some transmission error did occur. Therefore, an erase message is sent to the data center in step 715 and the data chunks are re-sent to the packers data center in steps 703. The recomputed Hash 3 replaces the originally computed Hash 1.
- step 717 Where the data center is unable to save the data chunk in step 717, for example because its disks are full, the process continues to 719 where the data center reports failure back to the packer manager. In addition, if the packer process times out in its attempt to contact the data center, it may report a soft failure. It should be noted that an embodiment of the invention might place greater trust in the reliability of the computing hardware and the network infrastructure and dispense with some of the safeguard steps.
- FIG. 8 is a detailed flow chart representative of a receiver process flow 619 of Fig. 6 in accordance with an embodiment of the invention. This process may be executed concurrently for each of the data storage centers possessing chunks of the desired file. As seen in Fig. 8, the process begins with the process manager at step 801 requesting data chunks derived from a file having a particular hash value from the data centers (815). Where the data center can find a chunk labeled with that hash value the data center returns the requested data chunk and the packer manager receives the data chunk in step 803. Each chunk is labeled with the hash value of the file it derives from as well as its own hash value. In step 805 the hash value of the chunk is verified.
- Recovery proceeds by returning to step 801 where a request can again be made to a data center step 815 for a data chunk associated with a particular hash value.
- the verification step 805 is successful, the chunk is saved in the package manager and a report is sent to the decoder indicating successful retrieval of a data chunk.
- the process continues to step 811 where a hard failure is reported to the decoder.
- the process continues to step 813 where a soft failure is reported to the decoder. This timeout most likely indicates a network congestion situation or a temporary server outage.
- the data storage facility By storing a chunk hash with the chunk, for example by incorporating it into the name of the file containing the chunk, the data storage facility is able to continually verify the integrity of its data storage and detect any file corruption that may occur. In addition, this process will uncover as early as possible any indication of incipient failure of a disk allowing it to be replaced before any data loss occurs.
- This cost is not simply the value of the file, but incorporates two other factors: the probability of losing the file, and the time value of money. Paying out money in the future is less costly than paying out money now. Thus, the cost is obtained by integrating together, for all possible future times T, the product of a time value of money factor exp(-IT) times a probability factor representing the probability at that future time T of receiving a claim and not being able to reproduce the file. That probability of having lost the file depends on the particular encoding chosen. It will be readily recognized that the interest rate I may itself be varying with time or may be a stochastic process.
- any one owner of information could ill afford to incur only this rational economic cost described above, because the consequences of data loss are so catastrophic.
- owners of information pool their risks, that barrier to economic efficiency is removed. For example, if a homeowner could not purchase home insurance, the homeowner would be compelled to go to ludicrous extremes to prevent his or her house from ever burning down. Because homeowners can purchase fire insurance and thus pool their risk with other homeowners, they take only the most practical of precautions.
- any message can be regarded as a sequence of characters where the characters are chosen from a collection of at least two possible characters. A conventional choice for two such characters are: 0 and 1.
- any message may be encoded as a sequence of O's and 1 's. Any given code requires breaking the message up Into pieces of a fixed size, which means that the message must have a size that Is a multiple of that fixed size. If the message has a size which is not a multiple of that fixed size, it may be necessary to pad it to make it the right size to enable it to be split up into the number of pieces needed by that code.
- Accelerometers exist, which measure the accelerations experienced by them, and hence by the objects to which they are attached.
- Nonvolatile memory devices exist, such as bubble memory, on one technical extreme, to litmus paper, which remembers by permanently changing color. By combining nonvolatile memory with accelerometers we may permanently record the greatest shock that some object has experienced over it's lifespan.
- An important source of unreliability in data storage is the damage that disks may suffer, after being fabricated, before being installed in a computer. Static discharge during installation is a potential source of serious damage, as is mechanical shock in transit. By attaching the device described in the previous paragraph to a disk drive during the manufacturing process, disks which have experienced unacceptable abuse levels in transit can be identified to better control the probabilities of disk failure.
- probability models can be developed which allow for estimating, according to the level of shock a disk has experienced, what the impact of that event will be on it's mean time before failure.
- Reduction of probability is readily achieved through independence. If an event has a 10% chance of occurring then there is only a 1% chance of two independent occurrences of that event. Thus, achieving maximal levels of reliability is facilitated greatly by achieving Independence of failure events.
- a disk that fails may cause other disks to fail because they are electrically connected.
- To achieve independence we can isolate the data connections by converting the voltage signal into an optical signal, which is then converted back to a voltage signal. Circuitry which limits current and/or electrical potential may also be employed. We can isolate the power connections bringing independent DC lines to each disk.
- Embodiments of the present invention may be implemented using a conventional general purpose computer or micro-processor programmed according to the teachings of the present invention, as will be apparent to those skilled in the computer art. Appropriate software can readily be prepared by programmers of ordinary skill based on the teachings of the present disclosure, as will be apparent to those skilled in the software art.
- a non-limiting example of a computer 100 as shown in Figure 9 may implement the method of the present invention, wherein the computer housing 102 houses a motherboard 104 which contains a CPU 106, memory 108 (e.g., DRAM, ROM, EPROM, EEPROM, SRAM, SDRAM, and Flash RAM), and other optical special purpose logic devices (e.g., ASICS) or configurable logic devices (e.g., GAL and reprogrammable FPGA).
- the computer 100 also includes plural input devices, (e.g., keyboard 122 and mouse 124), and a display card 110 for controlling a monitor 120. Additionally, the computer 100 may include a floppy disk drive 114; other removable media devices (e.g.
- the compact disc 119, tape, and removable magneto-optical media (not shown)); and a hard disk 112 or other fixed high density media drives, connected using an appropriate device bus (e.g., a SCSI bus, an Enhanced IDE bus, an Ultra DMA bus, a SATA bus, a fibre channel storage network, or an IP network using network block devices or iSCSI ).
- the computer may also include a compact disc reader 118, a compact disc reader/writer unit (not shown), or a compact disc jukebox (not shown), which may be connected to the same device bus or to another device bus.
- the system includes at least one computer readable medium.
- Examples of computer readable media are compact discs 119, hard disks 112, floppy disks, tape, magneto-optical disks, PROMs (e.g., EPROM 3 EEPROM, Flash EPROM), DRAM, SRAM, SDRAM, etc.
- the present invention includes software for controlling both the hardware of the computer 100 and for enabling the computer to interact with a human user.
- Such software may include, but is not limited to, device drivers, operating systems and user applications, such as development tools.
- Such computer readable media further includes the computer program product of the present invention for performing the inventive method herein disclosed.
- the computer code devices of the present invention can be any interpreted or executable code mechanism, including but not limited to, scripts, interpreters, dynamically linked libraries, Java classes, and complete executable programs. Moreover, parts of the processing of the present invention may be distributed for better performance, reliability, and/or cost. [00101] The invention may also be implemented by the preparation of application specific integrated circuits or by interconnecting an appropriate network of conventional component circuits, as will be readily apparent to those skilled in the art.
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method for guaranteeing retrieval of stored data includes receiving, from a sender, a data file being stored at a data custodian and creating a storage certificate associated with the data file, the storage certificate including a partial representation of the data file being stored and an electronic signature of the data custodian. The data file being stored is stored in a storage unit at the data custodian, and the storage certificate is returned to the sender to verify storage of the data file being stored at the data custodian. A method for backup storing data files includes receiving a data file being stored at a data custodian, and creating N data chunks from the data file such that the data file can be recreated using < N of the data chunks. The N data chunks are in a plurality of storage locations.
Description
TITLE OF THE INVENTION Method and System for Backup Storage of Electronic Data
BACKGROUND OF INVENTION
FIELD OF THE INVENTION
[0001] The present Invention relates generally to storage of electronic data, and more particularly to guaranteeing backup storage of a data file and/or backup storing the data file such that it may be recreated even where part of the data file is lost.
DISCUSSION OF THE BACKGROUND
[0002] Data loss, perhaps more than any other kind of loss, has catastrophic consequences for businesses. 72% of companies who experience serious data loss go out of business. 43% never re-open their doors and 29% close within two years of the data loss [I]. Those businesses that do make an attempt to go on spend 11 billion dollars each year in data recovery efforts [2]. None of this includes the vastly greater sums of money spent attempting to prevent data loss, nor does It Include the costs associated with downtime, loss of revenue and lost customer goodwill. As for the value of all of this lost data, 43% of all lost or stolen data events are each valued at 5 million dollars or more. Most companies value 100MB of data at more than 1 million dollars [3]. This means that a single 250GB disk costing less than $200 would hold more than 2.5 billion dollars worth of data.
[0003] Further, the regulatory consequences of data loss go far beyond the monetary value of the data Itself. New rules of civil procedure require companies to ensure that data loss will not occur, and to be able to prove that the data has not been altered or tampered with [4]. In
addition specific industries have additional regulatory requirements to ensure availability and security of data. These include the Sarbanes-Oxley Act, HIPAA, NASD, and SEC 17a-4. Regulations such as SEC 17a-4 may even require that a company's records be maintained and downloaded by a reliable third-party [5]. The rapid growth of digital data storage is outstripping companies' ability to manage and secure all of this data. In 2002 five Wall Street firms were each slapped with a fine of 1.65 million, for a total 8.3 million dollars, by the SEC for being unable to produce internal emails which should have been preserved on their systems for several years [6].
[§004] There are several causes for data loss, including: hardware or system malfunction (44%), user error (32%) software program malfunction (14%), viruses (7%), and natural disasters (3%) [7]. Further, the odds of having a data loss incident are estimated at 6%, which is comparable to the estimated 7% chance for an auto accident [8]. However, while the probability of a car accident has led to legal requirements for car insurance, data loss remains essentially uninsurable.
[0005] Specifically, insurance policies that provide data "loss protection" typically cover only the cost of data recovery and business interruption. That is, when data is lost, an insurance policy may cover the time and expense of retrieving the data from electronic backup facilities, creating electronic data from paper archives, restoring damaged electronic storage media, and the cost of lost computing resources during recovery. However, insurance policies typically will not cover the value of lost data that cannot be recovered. Those policies that apparently cover the value of unrecoverable data have such high premiums and/or restrictive claim payments that these policies do not offer meaningful insurance for unrecoverable data.
[0006] As insurance for loss of extremely valuable data is essentially unavailable, businesses currently invest heavily in self-insurance against data loss. For example, many businesses
and individuals use online data backup services such as those provided by Ibackup, Ivault, Amerivault, and iDataBackup. In these services, data that is generated, stored, communicated etc. by business resources can be manually and/or automatically "backed up" at an offsite storage facility. However these conventional backup storage systems may be error prone depending on the backup system custodian's expertise and resources, which may be unknown to the data owner seeking backup protection. Similarly, the backup storage system itself may have a data loss event. Further, conventional data backup systems do not guarantee data stored therein. For example, a data owner may lose on site data and request retrieval of such data from the off site storage. Where the off site facility is unable to provide the requested data, the data owner has no recourse but to simply accept the data as lost. While the data owner may switch to a different data backup provider, this does not compensate the data owner for costs associated with the lost data.
[0007] Due to the criticality of data to a business' survival and the inadequacy of existing data backup systems, businesses spend large amounts of money, such as on redundant backup systems for example, to increase protection against data loss. That is, while companies may be very good at estimating the value of their data, they do not typically possess the actuarial sophistication to rationally assess the level of protection they need. Therefore, they are simultaneously overexposed as evidenced by the data loss numbers, and on the other hand over protected by simply throwing money at their archival needs.
SUMMARY OF THE INVENTION
[0008] Accordingly, one object of the present invention is to address the above and/or other problems relating to backup storage of electronic data.
[0009] Another object of the present invention is to guarantee backup storage of electronic data.
[0010] Still another object of the present invention is to provide a method and system for insuring against data loss.
[0011] Yet another object of the present invention is to increase reliability of backup data storage facilities.
[0012] A further object of the present invention is to provide a method and system for backup storing a data file such that the data file can be reconstructed in substantially (or bit for bit) identical form despite loss of part of the data file.
[§§13] Any of these and/or other objects may be achieved by embodiments of the present invention.
[0014] One embodiment of the invention includes a method for guaranteeing retrieval of stored data. The method includes receiving, from a sender, a data file being stored to be stored at a data custodian and creating a storage certificate associated with the data file being stored, the storage certificate including a partial representation of the data file being stored and an electronic signature of the data custodian. The data file being stored is stored in a storage unit at the data custodian, the storage certificate is returned to the sender to verify storage of the data file being stored at the data custodian.
[0015] Another embodiment includes a method for backup storing data files. The method of this embodiment includes receiving, from a sender, a data file being stored to be stored at a data custodian, and creating N data chunks from the data file such that the data file can be recreated using < N of the data chunks. The N data chunks are in a plurality of storage locations.
BRIEF DESCRIPTION QF THE FIGURES
[§016] The above-described and/or other advantages of the invention will become more apparent and more readily appreciated from the following detailed description of the
exemplary embodiments of the invention taken in conjunction with the accompanying drawings, where:
[§§17] Figure 1 is a system for guaranteeing backup storage of a data file in accordance with an embodiment of the present invention;
[0018] Figure 2 is a flow chart representative of a method for guaranteeing backup storage of a data file in accordance with an embodiment of the present invention;
[0019] Figure 3 is a system for backup storing a data file such that the data file can be reconstructed in bit for bit identical form despite loss of part of the data file in accordance with an embodiment of the present invention;
[§020] Figure 4 is a flow chart representative of a method for backup storing a data file such that the data file can be reconstructed in bit for bit identical form despite loss of part of the data file in accordance with an embodiment of the present invention;
[0021] Figure 5 is a detailed flow chart representative of a method for receiving a data file from a client, guaranteeing storage of the data, and backup storing a data file such that the data file can be reconstructed in bit for bit identical form despite loss of part of the data file in accordance with an embodiment of the present invention;
[0022] Figure 6 is a detailed flow chart representative of a method for retrieving a data file guaranteed and stored according to the process shown in Figure 5, in accordance with an embodiment of the invention;
[0023] Figure 7 is a detailed flow chart representative of the packer step of Figure 5 in accordance with an embodiment of the present invention;
(0024] Figure 8 is a detailed flow chart representative of the receiver to decoder step of
Figure 6 in accordance with an embodiment of the present invention; and
[0025] Figure 9 illustrates a computer that may be used to implement the present invention.
DETAILED DESCRIPTION OF THE INVENTION
(§026] As noted in the background section above, insurance against data loss is essentially unavailable, and existing data backup systems are error prone and do not compensate a data owner for unrecoverable data. The present inventor has recognized that a significant factor contributing to these deficiencies is the moral hazards associated with data loss. As noted above, 32% of data loss is caused by user error. As it is very difficult to distinguish between intentional and inadvertent user errors, one could present a claim for loss of data that has in fact been secretly stashed. Similarly, other causes of data loss such as hardware and software failure are difficult to verify.
[0027] In considering these factors, the present inventor recognized that to address the data loss problem, what is needed is an effective means for transferring risk of data loss from the data owner, and simultaneously controlling that risk. The inventor discovered that this can best be accomplished, and perhaps can only be accomplished, by combining the functions of data custodian and risk bearer. This solves the moral risk problem of the risk bearer, because if the data owner presents a claim, the custodian risk bearer merely produces the data, frustrating any false claim of loss, but also leaving the honest data owner whole. The data owner might, however, falsely claim that the data restored has been corrupted. The data custodian might, conversely, falsely restore corrupted data, or even completely fabricated data, claiming to have discharged its obligation. Thus, what is further needed is a means whereby everyone can verify compliance with the agreement that transfers the risk of data loss. One embodiment of the present invention meets these needs by providing a system and method for guaranteeing backup storage of a data file.
[§§28] Referring now to the drawings, and more particularly to Figure 1 thereof, there is shown a system for guaranteeing backup storage of a data file in accordance with an embodiment of the present invention. The system of Figure 1 includes workstations 101,
laptops 103, servers 105, local area network (LAN) server 107, firewall server 109, Internet 111, firewall server 113, offsite data vault 115 and mirrored site (or additional data vault) 117. In other embodiments there could be additional backup storage facilities and additional copies of the data may be stored there or alternately more sophisticated parceling of data as illustrated in Figure 3 may be employed. In the embodiment of Figure 1, workstations 101, laptops 103, servers 105, and local area network (LAN) server 107 make up the resources of a business (or household) that uses a backup data storage system in accordance with an embodiment of the present invention. Firewall server 113, offsite data vault 115 and mirrored site 117 make up the backup storage facility in the embodiment of Figure 1. [0029] Workstations 101, and laptops 103 are well known and are operated by users that generate and store data files that are essential to successful operation and survival of a business, for example. Servers 105 may be dedicated computer systems for providing various functions used during operation of the business. The workstations 101, laptops 103 and servers 105 are preferably programmed with any suitable Web browser software that permits these resources to retrieve Web pages via the Internet 111 from remote computers or servers such as the firewall server 113, offsite data vault 115 and mirrored site 117 in order to provide guaranteed data backup to users of these systems. The Web browser software may also be used to transmit information provided by the workstations 101, laptops 103 and servers 105 to remote computers such as the server 113, and storage devices 115 and 117. In an alternate embodiment the workstations 101, laptops 103 and servers 105 could be equipped with freestanding software which emulates a browser to transmit archives or makes use of dedicated file servers for this purpose.
[0030] File servers exist, which can create an illusion that storage resources which in fact reside somewhere else on the network, appear to reside locally. In combination with the concept of guaranteed file storage, in another embodiment of the invention, files which
customers save from their local systems In the customary way actually are saved on such a file server and are automatically guaranteed in the amount predetermined for all files saved to that volume, folder, or directory of this file server. Moreover, these systems could also use erasure codes, error correcting codes, clustering, and high availability technologies to make them as reliable as desired. In particular, in any data storage application a plurality of machines each having locally attached disks can present those disks as block devices via iSCSI or the network block device mechanism for example. This means that such a system can support the illusion to a client system that the disk resources on it appear to be directly attached to the client. A collection of client systems can then utilize those blocks, preferably so that each client has exclusive access to any given block device and those block devices accessed by a given client can be combined to form a RAID 5 volume. In this way, the failure of any given server system causes no failure of the client system. These client systems may then in turn export their file systems built on these block devices so that they become file servers over the network to clients of their own. In the event of failure of a file server, another file server can obtain access to those blocks and recover that file system and resume service and client access to the file system also fails over. Software tools that enable this include Heartbeat, for example.
[0031] In normal everyday usage, text is structured using parentheses, curly braces, square brackets, angle brackets, quotation marks, etc. This allows for text inside text inside text. This recursion is modeled in computer science by the concept of "tree". [0032] The Extensible Markup Language (XML) exists and greatly expands this notion thereby allowing for sophisticated tree structuring of text information. More precisely, it Is a labeled tree decorated with functions taking finite sets of (certain) strings to strings. This Is accomplished by replacing the limited set of ordinary text delimiters with an unlimited collection constructed by using the delimiter pair '< > ' and '</>' to enclose practically any
name you may choose. Thus, if you choose the name 'p', the delimiter pair become '<p>! and '</p>\ If you choose "table" for the name, the delimiter pair consists of '<table>! and '</table>'. Such delimiters enclose text which may in turn include other delimted text. The decorating or annotating with functions is done by including "attributes" so that we can include that information in the opening delimiter. Thus, if our table has a width of 5 and a height of 3, we might replace '<table>' with '<table width = "5" height = "3">5 . The closing delimiter '</table>' is unchanged.
[0033] This mechanism allows for extremely rich data structures to be transported. In particular the data types that occur in programming language can be expressed by XML. In this way the information passing which is characteristic of function calls and object method invocations can be wrapped up in an XML document, allowing remote function calls and method invocations across system boundaries and across programming language boundaries. A web services description language file (WSDL) contains the information necessary to make the translation. Thus, software tools exist that can take an existing function or method and generate this WSDL file. Software tools also exist which can take this WSDL file and create a proxy function or method which supports the illusion that functionality that is in fact remote, is just an ordinary function or method running locally. In particular, the implementation language for the actual service need not be the same as the implementation language for the proxy object. These XML enabled software interfaces are called web services. In this way a programmer using their preferred programming language can make use of program code implemented in an entirely different language without knowing anything about that other language.
[0034] By combining the concept of XML with the concept of online storage in one embodiment, it can be made easy and inexpensive for companies to incorporate embodiments
of the Invention Into their internal automated processes. By combining the concept of WSDL with the concept of online storage, it can be made even easier and even less expensive. [0035] In the embodiment of Figure 1, workstations 101, laptops 103 and servers 105 are connected in a LAN by way of LAN server 107, which coordinates communication among business resources and provides other network functionality such as compression and/or encryption of data. The business resources are also connected to Internet 111 by way of firewall computer 109 to enable communication with external resources such as the data storage facility. As Is known In the art, firewall server 109 filters data entering and exiting the LAN to protect against unauthorized access to the business resources. [0036] Workstations 101, laptops 103, servers 105, local area network (LAN) server 107, and firewall server 109 communicate with each other and with external resources using any suitable protocol for communicating directly or via the Internet 111. More specifically, these devices are configured to interface with the firewall server 113, offsite data vault 115 and/or mirrored site 117 to store backup data by user command or automatically, to obtain storage certificates, to retrieve backup data and to make claims to the storage facility in accordance with embodiments of the present invention. Workstations 101, laptops 103, servers 105, local area network (LAN) server 107, and firewall server 109 may be implemented as the general purpose computer system 9001 of Figure 9, for example.
[0037] Offsite storage vault 115 and mirrored site 117 are also connected to the Internet 111, and thus the business resources, via firewall 113. Firewall server 113 filters data entering and exiting the backup data storage system to protect against unauthorized access to backup data storage resources. The Internet 111 includes various networks and gateways for linking together various computer networks and computers such as those shown in Figure 1. [0038] Offsite data vault 115 and mirrored site 117 are storage devices that store electronic information including data files from business resources such as the workstations 101,
laptops 103 and servers 105, for example. Firewall server 113, offsite data vault 115 and/or mirrored site 117 may be configured as a Web server programmed to receive, store, and/or transmit various types of information, including, requested backup data files from the business resources. These devices can also be configured to provide storage certificates and satisfy claims in accordance with embodiments of the present invention. [0039] It is to be understood that the system in Figure 1 is for exemplary purposes only, as many variations of the specific hardware and software used to implement the present invention will be readily apparent to one having ordinary skill in the art. For example, the functionality of the firewall server 109 and LAN server 107 may be combined in a single device. To implement these variations as well as other variations, a single computer (e.g., the computer system 9001 of Figure 9) may be programmed to perform the special purpose functions of two or more of the devices shown in Figure 1. On the other hand, two or more programmed computers may be substituted for any one of the devices shown in Figure 1. Principles and advantages of distributed processing, such as redundancy and replication, may also be implemented as desired to increase the robustness and performance of the system, for example. Further, as would be understood by one of ordinary skill in the art, other means of tendering information to the custodian are possible such as satellite links, wireless communication, electromagnetic communication including microwave, maser, laser, spread spectrum etc.
[0040] Figure 2 is a flow chart representative of a method for guaranteeing backup storage of a data file in accordance with an embodiment of the present invention. The method of Figure 2 may be performed on the system of Figure 1, for example. As seen in Figure 2, the method begins with receiving a data file at a data custodian in step 201. The data file may include digital information or any information that may be represented without loss in a digital form, that is to be stored in a data backup storage of the data custodian. Further, the data custodian
may be a storage device for providing backup storage services. Thus, referring to Figure 1, for example, the data file may be accounting information generated by server 103 and sent to the data vault 115 via the Internet 111.
[0041] In one embodiment the data file (being stored) received in step 201 Is an encrypted data file. For example, asymmetric cryptographic systems exist, which use a pair of keys, called the public key and the private key, with the property that what is encrypted using the private key may be unencrypted using the public key, permitting the public key holder to verify that the sender of an encrypted message was in possession of the private key. Knowledge of the public key provides no useful information for obtaining the matching private key. Symmetric cryptographic systems also exist which use a single key. [0042] Dutch auctions exist, under conditions where there is the ability to fulfill multiple bids, for example, with multiple identical Items; they are conducted by collecting bids, and choosing one bid as the accepted bid value. However, the bidder who bid that accepted bid value does not receive an Item, but rather all bidders whose bids were higher than the accepted bid price each receive an item or items at the accepted bid value. This mechanism has the useful property that It is provably fair, in the sense that everyone pays the exact same price, everyone who gets an item gets it for less than they committed themselves to pay, everybody who does get an item committed themselves for more than everyone who does not, and no collusion on the part of the winning bidders could possibly lower the price that they pay.
[0043] It is possible for panic conditions to arise. If a hurricane should destroy a particular storage location, for example, customers might seek reassurance that their files are nonetheless still safe, by requesting retrieval of those files. This could create a congestion situation, which would interfere with the recovery efforts of customers who happened to have the misfortune of actually experiencing data loss at that time. To alleviate this danger, and to
ensure a fair process, an embodiment of the present invention may conduct a real-time dutch auction, to prioritize the servicing of file restoration requests, ensuring that those customers whose needs are most pressing, as measured by the value they place on priority recovery, are given that priority.
[0044] The same dutch auction mechanism may be used for setting other pricing such as the fees charged for data storage, and fees paid for levels of guarantee.
[0045] In another embodiment, the data file (being stored) received in 201 is accompanied by a digitally signed bid specifying the value of the file and the offered fee.
[0046] Further, the data file being stored can be received at the custodian via the Internet or other digital communications means. For example, the data file may be received as an attachment in an e-mail, or the data file being stored can be uploaded to the data custodian.
The data file may be manually sent to the data custodian by a user, or automatically sent to the data custodian according to a predetermined backup storage plan or policy or as triggered by saving the file to a file server volume or directory having a predetermined value associated with it.
[0047] In yet another embodiment, the customer requests that a onetime or recurring schedule be set up, so that the data custodian polls the client software running on the customer premises, to respond by sending an archive, where the value is predetermined, or available for retrieval from some pre-arranged place. In this way, data files can be transferred at a convenient time when Internet traffic is relatively low.
[0048] In another embodiment, client software running on the customer premises can generate data chunks (as will be discussed below) and submit the chunks directly to the data storage facilities, and upload the file. The data custodian computes each chunk, but rather than sending it to the data storage facility, computes the hash of the chunk, and compares it
with the hash of the chunk computed by the data storage facility. If all (or sufficiently many) of these hashes agree, the file has been properly stored and the certificate is issued. [0049] Once the data file being stored is received at the data custodian, the data custodian creates a storage certificate in step 203. The storage certificate includes a partial representation of the data file received by the data custodian, the declared value of the data file, as well as an electronic signature of the data custodian. The partial representation is a unique representation of the data file being stored that is created from the data file being stored and can be used to authenticate the data file being stored if presented in the future. However, the partial representation of the data file is typically a small fraction of the size of the original data file being stored and cannot be used to recreate the data file being stored. The certificate may contain other information as well such as the size of the file, a time stamp recording when the file coverage took effect, an expiration for coverage, a confirmation of the fees being charged, labels for the tracking convenience of the customer, etc. [0050] It may be desirable for customers to be able to use erasure code software in house in which case they may not care that they can recover a particular hash, but only a sufficient number of them to regenerate some original file from that information. More complicated Boolean expressions could also conceivably be desirable to express a guarantee. Thus, embodiments of the invention can also issue certificates for some specified k not exceeding n, where a system receives, possibly at distinct locations, n files, and it is guaranteed that the system will return on demand at least k of the n, using the same hash mechanism for verifying those k files as authentic. Using the usual Boolean operators of "and" and "or" evermore sophisticated conditions can be recursively generated.
[0051] In one embodiment, the partial representation is a hash value of the data file being stored computed according to a predetermined algorithm. Specifically, hash functions exist, which take as input a file and produce as output a small piece of data, called its hash value, in
such a way that it is not feasible to analyze that small piece of data so as to recover the original file. These functions have the additional property that any change in the file, however small, results in a completely different hash value. Thus, although the hash value cannot be used to create the original data file that created the hash value, the hash value can be used to authenticate a file presented as the original data file. Specifically MD5 and SHAl are widely used hash functions. Source code implementing these functions is available as open source in any standard Linux distribution.
[0052] The electronic signature created in step 203 is associated with the data custodian and functions as a written signature to identify the custodian as the sender, and to possibly legally bind the data custodian as having received the data file being stored. In one embodiment, the electronic signature is a cryptographic digital signature. Digital signatures exist, whereby a document may be reliably attributed to the owner of a public key in possession of the private key, by encrypting the hash value of the document using the private key, for example. The general public in possession of the public key may decrypt the encrypted hash value using that public key. When that original hash value is recovered it proves that the encryption truly was done by the private key and therefore could only have been done by the unique possessor of the private key. Computer source code implementing key generation, encryption, and decryption is freely available as open source code from www.gpg.org. [0053] In step 205, the data file being stored is stored in a storage unit of the data custodian. Returning to the example of Figure 1, the data file being stored can be stored in the data vault 115 or the mirrored site 117 or both. In another embodiment of the invention storage can be spread among a number of sites as will be discussed more fully in connection with Figures 3, 4, 5, 6, 7, and 8.
[0054] In step 207, the storage certificate is returned to the sender of the data file. The storage certificate can be returned to the sender via e-mail or downloading the storage
certificate to the sender. In another embodiment the certificate may be placed on a website for inspection by the sender or communicated in any other mode suitable for transmitting digital information.
[0055] Thus, in the method of Figure 2, the data custodian provides a storage certificate to the sender of the data file as a receipt that the data file being stored has been received and stored at the data custodian. This storage certificate can be used to obtain the stored data file in the future. Specifically, in one embodiment, the data custodian can receive a request for a requested data file from a requestor. The request can include a data storage certificate and other optional information such as a file label. In yet another embodiment of the invention, the storage certificates may be maintained by the data custodian, in which case the request might simply be some piece of information enabling the data custodian to retrieve that certificate. The data custodian processes the request in order to determine whether the requested data file can be provided by the data custodian. In processing the request, the data custodian first obtains an electronic signature from the storage certificate and determines whether the electronic signature of the storage certificate corresponds to a signature of the custodian. Specifically, the data custodian determines whether the electronic signature of the storage certificate matches its electronic signature previously issued in a data storage certificate. This verification can be performed by anyone in an algorithmic fashion by transforming the certificate using the public key of the data custodian as previously discussed.
[0056] Where the electronic signature of the storage certificate corresponds to the electronic signature of a previously issued storage certificate, the data custodian retrieves a file from storage having the requested file label and computes a hash value for the retrieved file based on a predetermined algorithm used by the data custodian to issue storage certificates. The data custodian then compares the hash value of the retrieved file with the hash value of the
stored data file to determine If a match exists. Where the hash value of the retrieved data file matches the hash value of the stored data file, the data custodian returns the retrieved data file to the requestor as the requested data file, via any communications means suitable for digital communication.
[§057] In another embodiment of the invention a check is performed through the accounting system to be sure that payments are current before honoring a request. Or alternately, some different charge may apply for honoring the request under those circumstances. [0058] However, where the hash value of the retrieved data file does not match the hash value of the stored data file, the data custodian takes corrective action. Specifically, where the storage certificate includes the data custodian's electronic signature, but the data custodian cannot retrieve a stored file that can generate the hash value included in the storage certificate, then this indicates that the data custodian has lost the data file being stored now requested. Thus, the data custodian can pay the requestor a predetermined dollar amount for lost data.
[0059] Thus, by combining the insurance concept of transfer of risk with the concept of online storage, stored backup data can be guaranteed in accordance with one embodiment of the invention. For example, the data owner submits their data, probably encrypted, for storage. The data custodian generates a hash value for the data, and creates an electronic storage certificate, which is then digitally signed. If the data owner claims a loss, the data owner can exhibit the certificate to all the world without in any way compromising the confidentiality of their data. The data custodian may prove to all the world that the original data has been restored, by exhibiting the (encrypted) data, which all the world can verify has the hash value referenced in the storage certificate. Cancellation due to failure to pay, claim payment, or failure of claim payment can all be proved in the usual manners for financial
transactions. A request to cancel coverage can be acknowledged with another digitally signed instrument.
[0060] In another embodiment of the invention the information submitted is not encrypted but is a common file which may have been downloaded from the internet or purchased from a vendor. In this case no additional copy of the information need be stored. The certificate can be granted immediately upon verification that the hash value of the transmitted file is the hash value of a file already stored.
[§061] In another embodiment, rather than assembling the file from the chunks when a certificate is presented to retrieve a file, the signature is verified and current payment status is verified, and then as appropriate, permission is granted to the presenter or some appropriate agent of the presenter to directly retrieve the chunks from the data storage facilities for restoration. If that recovery should be claimed to have failed, fallback to another embodiment is possible.
[0062] As discussed in the background, the present inventor also recognized that conventional backup storage systems are error prone and may suffer a data loss event. Thus, the present inventor also recognized the need for a more reliable backup data storage system. Figure 3 is a system for backup storing a data file such that the data file can be reconstructed in substantially (or even bit for bit) Identical form despite loss of part of the data file in accordance with an embodiment of the present invention. The system includes a data file being stored 301 and a plurality of storage units 303-317. As seen in Figure 3, the data file being stored 301 is separated into "chunks," and each chunk is sent to a respective data storage unit 303-317. As also seen in the example of Figure 3, when a disaster causes data to be lost In data storage units 305 and 311, then the storage file 301 can be reconstructed as file 350 from the remaining chunks stored in storage units 303, 307, 309, 313, 315 and 317. The reconstructed file 350 is bit for bit identical to the original storage file 301. Thus, files can be
stored with greater reliability. A method for accomplishing this reconstruction is discussed in connection with Figure 4.
[0063] Figure 4 is a flow chart representative of a method for backup storing a data file such that the data file can be reconstructed in bit for bit identical (for example) form despite loss of part of the data file in accordance with an embodiment of the present invention. As seen in Figure 4, the method begins with receiving a data file at a data custodian in step 401. As with the embodiment of Figure 2, the data file may include digital information that is to be stored in a data backup storage of the data custodian, and the data custodian may be a storage device for providing backup storage services. In one embodiment the data file being stored received in step 401 is an encrypted data file such as that discussed in relation to asymmetric cryptographic systems above, or alternately a file may be returned from which the original file may be algorithmically derived, as for example a compressed version of the file. In yet another embodiment the storage file is an unencrypted data file. Further, the data file being stored can be received at the custodian via the Internet, as an attachment in an e-mail, or can be uploaded to the data custodian or transmitted in any other manner appropriate for digital communication.
[0064] Still further, step 401 may be performed by a data custodian software application on the sender's computer system. That is, while step 401 includes receiving a data file at the data custodian, such data file may be processed by client software at the user and divided into data chunks (as discussed below) before ever being remotely transmitted from the user. This client software could be running on systems owned by the user, or on systems owned by the user but administered by the data custodian, or by systems owned by the systems custodian but located on the user's premises.
[0065] Once the data file being stored is received at the data custodian in step 401, N data chunks are created from the data file such that the data file can be recreated using less than N
of the data chunks in step 403. Step 403 may be performed by use of error correcting codes or erasure codes. Error correcting codes exist, which enable messages that have been altered in transit to be reconstructed. A simple example is a repetition code, which simply sends the same message multiple times. An error is detected by comparing the multiple copies, and errors may be corrected by a comparison process to reach a consensus. This is the technique people use when trying to talk in a noisy environment; they repeat themselves as necessary to be understood over the noise. While the message is eventually communicated, it may take a very long time. This is also the method employed in mirroring. In a preferred embodiment of the present invention use is made of much more efficient error correcting codes or erasure codes that allow reconstructing of the data file with less than all of the data chunks, and where no complete copy of the file exists in any one place.
[0066] Specifically, step 403 of Figure 4 can be performed by use of error correcting codes such as Berger Code, Check digit, Chipkill (an application of ECC techniques to volatile system memory), Constant weight control, Convolutional codes, Differential space-time codes (related to space-time block codes), Dual modular redundancy (subset of N-modular redundancy, related to triple modular redundancy), Erasure codes (superset of Fountain codes), Forward error correction, Group code, Golay code, Goppa code, Hadamard code, Hagelbarger code, Hamming code, Lexicographic code, Longitudinal redundancy check, Low-density parity-check code, LT codes ( which are near optimal rateless erasure correcting codes), m of n codes, Online codes ( which are an example of rateless erasure codes), Parity bit, Raptor codes (which are high speed -near real time- fountain codes), Reed-Solomon error correction, Reed-Muller code, Repeat accumulate code, Sparse graph code, Space-time code, Space time trellis code, Tornado codes ( which are optimal Fountain codes), Triple modular redundancy, Turbo code, BCH code, and Walsh code (used in cellular telephony for its high noise immunity, not just its ECC capabilities). As would be understood by one of ordinary
skill in the art, other known error codes may be used. Further information, extensive references, and links for source code Implementing many of these codes may be found in the error correcting codes entry at www.wlkipedia.org.
[0067] In a preferred embodiment, step 403 includes use of versions of the sparse matrix method of Gallager which comprises a family of codes. A particular code is constructed by choosing a number M of message pieces, each of Identical size, and constructing a number C of check pieces, each of the same size as each of the message pieces. Each check piece is generated from some collection of message pieces where the number of message pieces may vary from check piece to check piece. The manner in which the check piece is generated, Is by forming an exclusive "or" of the message pieces. Decoding proceeds via "belief propagation" described below for the case of erasure codes. This exclusive "or" is generated in a bit wise fashion using the standard rules for modulo 2 arithmetic: even plus even equals even, even plus odd equals odd, odd plus odd equals even where 0 is even and 1 is odd. [0068] In another embodiment step 403 utilizes the fact that a polynomial having degree N has at most N roots. It is an easily proved standard fact from algebra that this implies that if you know the value of a degree N polynomial at any N points, you know the coefficients of the polynomial. Thus, If you calculate the values of a degree N polynomial at N+L places and store those N+L results in distinct places, the loss of any L of those values leaves you with N values left from which it is possible to recover the coefficients of the polynomial. These facts persist not just with ordinary numbers but also with the finite fields. Calculations in these finite fields may be efficiently performed by computing exponentials and logarithms in the usual way for finite fields and then non-trivial multiplication consists of adding exponents while non-trivial addition can be reduced to adding 1 which is readily done by combining the exponential and log tables. Computation in Galois Fields which are the finite fields is a standard topic in any book on algebraic coding theory. Algorithms for generating the values
and recovering the coefficients include Neville's Algorithm, and Newtons Interpolation Formula which may be found in Steer and Bulirsch, Introduction to Numerical Analysis. A more efficient, but more complicated approach is the Berlekamp-Massey Algorithm which are standard topics in books on algebraic codes.
[0069] Step 403 may also be performed by use of erasure codes, which are closely related to error correcting codes and are also able to reconstruct a message where portions of the message have been lost in transmission. Preferred erasure codes are the Gallager family whose encoding process is described above. Decoding proceeds by finding first those check pieces constructed from just one message piece, then using that knowledge to extricate those message pieces from the remaining check pieces. That is, a check piece constructed from just one message piece coincides with that message piece. The extrication is readily done because applying an exclusive "or" a second time decodes the effect of applying an exclusive "or" the first time. Thus applying the exclusive "or" operation to the message piece and any check piece that includes it, yields a new check piece which excludes that message piece. This process is applied over and over again until all of the message pieces have been recovered. While these decoding processes are not guaranteed to succeed, the probability of success can be made as large as desired and nearly all choices of codes in this family will be good ones. [0070] Once the data chunks are created, they are sent to a plurality of custodian locations in step 405. The data chunks are preferably each sent to different storage locations to increase reliability of storage, but at least two different storage locations is all that is required according to the present invention. As noted above, in Figure 3, where one or more of the storage locations loses a data chunk for some reason, the original data file can be recreated from the remaining data chunks that have not been lost. Thus, where a user requests to retrieve a previously stored data file from the data custodian, the data custodian then obtains
the data chunks from enough of the different storage locations to recreate the file bit for bit (for example) even where some of the data chunks have been lost. [0071] In a preferred embodiment of the invention, the data custodian can use both data storage certificates and data chunks. Figure 5 is a detailed flow chart representative of a method for receiving a data file from a client, guaranteeing storage of the data, and backup storing a data file such that the data file can be reconstructed in bit for bit identical form despite loss of part of the data file in accordance with an embodiment of the present invention. In particular, Figure 5 shows stage 1 of receiving data from a Client (sender). |0072] In the discussion of Figure 5 it will be noted that there are many steps which exist to safeguard the integrity of the process of the data storage and erasure coding process. In other embodiments of the invention, less care may be taken with more reliance placed on the freedom of the underlying hardware platform from transient computational error and on the reliability of communications among the various data storage centers. [0073] In the embodiment of Fig. 5, the process begins uploading a data file in step 501. The data file includes a value of the data file and may also include a name, a label or other information for the convenience of the customer. The value may be a specific monetary value, or a rating of the importance or other valuation of the data file. In step 503, the data file is saved at the data custodian, and the data custodian processes the file to record the value, path and any label or other information associated with the file. In one embodiment, step 503 is performed at a packer manager system of the data custodian, which may be a general purpose computer such as that described in Figure 9. The value of step 505 may be the value received in step 501, or a value that is calculated based on a rating of the importance of the data as received in step 501, or a pre-assigned value where the file has been stored to a location bearing an agreed upon per file valuation. In step 507, the data custodian sends a message to the client thanking the client for uploading the file and informing the
client to expect an e-mail with label information. In another embodiment the client could receive synchronous notification and/or synchronous receipt of the storage certificate. [0074] As seen in fig. 5, the process continues from step 505 along a process flow line to Stage 2 of distributing the data file. Stage 2 begins with step 511 where the value, path and label associated with the data file are received. In step 513, a code for encoding the data file is chosen based on a value and size of the data file. As noted with respect to Fig. 4, the code may be an error code or an erasure code, the reliability and efficiency of which can be chosen based on the value of the data file. Data chunks are then created in step 515 using the code selected in step 513.
[0075] Step 511 also branches to step 517 where a hash value is computed for the original data file. In step 519, a subset (principal chunks) of the chunks in step 515 are combined to recreate the data file, and a hash value is recomputed for the recreated data file. This recomputed hash value is then compared with the hash value computed in step 517 to check the accuracy of the recreation process using chunks. At step 521 the recreated file will normally match bit for bit the original file. The only way this can fail to happen is if some transient hardware failure has corrupted the computation. That transient error could have occurred in the course of creating the chunks in step 515 or in the computation of the hash in step 517. If the files and hashes both differ in step 521 then the correct execution of step 515 is to be doubted. Thus, the chunks are erased in step 523 and the data chunks are recreated In step 515. Where the files are the same but the hashes differ, there must have been a computational error in computing the hash either at step 517 or at step 519. In this case the process returns to step 517 to recompute those hashes. It is virtually Impossible for the files to differ and yet the hashes to be the same as a result of a transient computational error. [0076] Where step 519 determines that the recomputed hash value is the same as the originally computed hash value, which should nearly always be the case, the process
continues to step 525 where a check syndrome is performed. The check syndrome is a piece of information computed from the encoded information which reveals the presence of errors. In the case of linear codes, in the absence of error the syndrome will normally be zero. For example, a parity or check sum code adds an extra bit of Information to a message in such a way that the total number of bits is even. The syndrome for this code is then computed by counting the number of 'Ts". If that number is even, the syndrome is O5 if that number is odd, the syndrome is "1" which reveals that there has been an error in transmission because the total number of "Is'" was arranged to be even not odd. Where the result of the check syndrome is bad, an error must have occurred in the execution of step 515. Thus, the process returns to step 523 where the chunks are erased and new chunks are created in step 515, Where the result of the check syndrome is good, the process flows to step 527 where packer messages are created and each given a data chunk for delivery to a distinct data center. In step 529, the packer messages are sent to distinct data centers Pl-Pn, as will be further discussed in Fig. 7 below. This may be done in order that the processes of sending the chunks to the various data centers may be performed concurrently.
[0077] In yet another embodiment a single process could sequentially send the various chunks to the various data centers or a single process employing threads could use multiple threads to achieve concurrency. Not all choices of erasure or error correcting codes possess syndromes in which case there would be nothing to be done at step 525. [0078] In the embodiment of Fig. 5, a collation process is performed on the success or failure of the packer functions in storing the data chunks to the data centers based on messages received back from the data centers. Thus, as seen in the flow from step 529 to 531, the data centers send success or failure messages to their respective packer which in turn reports back to the packer manager which collates the success or failure of the message storage process in step 531. This collation process may conclude success based on the individual success of all
of the packers or alternately on a sufficient number or a sufficient subset of successful packer operations. Under normal conditions all of those operations will succeed. If however, a particular remote data center is down or under conditions of network breakdown or congestion or transient hardware failure in a packer process, failure may result. As seen in step 533, where the collation process determines a failure, the customer is notified of such failure in step 535. This may initiate the customer resending the data file for storage. In an alternate embodiment we might return to step 527 to retry some subset of the failed packer operations. However, where the collation process determines success in step 537, the process continues to step 539 where a certificate that guarantees the data for the declared, computed, or implied value is issued and sent to the customer as part of a notification in step 541. As noted above, the certificate can include a hash value for the data file and a digital signature associated with the data custodian. The certificate might also Include other Information of value to the customer or the custodian such as the size of the file and the time of issue, time of expiration, etc. In yet another embodiment the customer notification could be a synchronous response coinciding with step 507.
[0079] Fig. 6 is a detailed flow chart representative of a method for retrieving a data file guaranteed and stored according to the process shown in Fig. 5. As seen in Fig. 6, the claim process begins in step 1 when the customer submits a claim. More specifically, in step 601, the customer submits a certificate with a hash value and signature to the data custodian. It could also include a label and other information discussed previously. In step 603, the signature is verified to determine whether the signature is good or bad, as shown in step 605. Where the signature is determined to be bad, the process flows to step 611 where messages are sent to a customer to provide a valid certificate. Where the signature is determined to be good, the process flows to step 607 where a message is sent to the claims department of the data center, and a confirmation Is sent to the customer in step 609 to inform the customer that
the data storage certificate was good and to expect an e-mail relating to the process claim. Alternately, the message could be sent synchronously in step 625 rather than asynchronously via e-mail. The message to the claims department would normally be an automated message which could be a client-server interaction, a function call, a remote method invocation, a web service or any other form of digital communication.
[0080] As seen by process flow 613, the message generated in step 607 is sent to the retriever activator in step 617. In step 619, a retriever function is performed to retrieve the data chunk from a respective data center. Details of the retriever process flow 619 will be discussed with respect to Fig. 8 below. In step 621, at least a subset of the end data chunks are recomblned and decoded to recover the original data file, as discussed above. In one embodiment only those chunks required to reconstruct the file are retrieved. In another embodiment an attempt is made to retrieve all chunks and then check to see if there has been success at retrieving sufficiently many chunks to recover the file. Under normal conditions, all of the original chunks should be available.
[0081] However, if a data center has failed or under conditions of network failure or congestion, certain data centers may be congested, overloaded, or suffering from some transient or permanent hardware failure. Several possibilities occur: a given chunk may be received correctly from a given data center, a chunk may be received but corrupted, this will be apparent if a hash of the chunk is included as a part of the file name of the chunk or is stored in some other way. If the computed hash of the chunk falls to match its original hash then that chunk is corrupted. Yet another possibility is that the file was not found indicating some other process failure. Another possibility is that no response was received from the data center because of network congestion or because the server is down. This would be a soft failure referenced in step 631.
[0082] Where a retrieved data file is successfully recombined and decoded, the process continues to step 623 where a hash value of the retrieved data file is verified by comparing this hash value with the hash value received from the customer in step 601. Where the hash value message is verified, the process continues to step 625 where a link is e-mailed to the customer informing the customer where they can download the original data file. Alternately the file could be synchronously returned to the customer or communicated in any way suitable for digital communication. In the unlikely event that the hash value fails the verification step 623, the process flows to step 627 where an attempt to recompute the hash value is made, to guard against the unlikely event that some transient processing error is responsible.
[0083] Where the recompute hash value step is successful, the process continues to step 625 where an e-mail is sent to the customer as before. However, where the recompute step 627 fails, a message is generated to initiate dollar payment on the customer's claim in step 629. That is, where the hash value cannot be verified for the retrieved data file, the data custodian has lost the data stored for the customer. In yet another embodiment a further attempt at retrieving and decoding could be initiated to guard against the possibility that some transient error corrupted the process in steps 617, 619, and 621. Similarly, where in step 621, the data custodian is unable to decode the retrieved data chunks, the process continues to step 631 where it is determined whether a soft failure has occurred. A soft failure results, for example, where one or more data chunks cannot be received, but the message can still be decoded using the remaining data chunks, or by using chunks from systems that were temporarily out of service but which still retain the missing information. Thus, where a soft failure is determined, the process returns to the retriever function 617 and decoder function 621. However, where no soft failure occurs, this indicates that the data file has been lost by the
data custodian, and the process flows to step 629 where a procedure is initiated for dollar payment on the claim, or payment in some other currency or store of value. [0084] Figure. 7 is a detailed flow chart representative of the packer step of Fig. 5 in accordance with an embodiment of the invention. There may be more than one packer and in a preferred embodiment there may be as many packers under the direction of the packer manager as there are data storage facilities to facilitate concurrency. Thus, each packer process may perform its function in parallel with all of the others. As seen in this figure, in step 701, a Hash 1 value is computed for each of the incoming data chunks. The Hash 1 value for the respective data chunks is saved at the data custodian for later processing, and the data chunks are sent in step 703 to the data center 705 which this packer is responsible for. In step 707, the data chunks are each stored at their respective data centers, and a chunk Hash 2 is computed for the stored data chunk. The chunk Hash 2 is received at the data custodian in step 709, and a comparison of the Hash 1 and Hash 2 values for a particular data chunk is performed. Under normal circumstances, these two hashes will agree. If they do not agree, either there has been a transient computational error in computing one or both of those hashes or there has been a corruption in transmission. In the normal case where the Hash 1 value equals the Hash 2 value, the process continues to step 711 where that data chunk is deleted and successful storage of the data chunk is reported to the packer manager. In another embodiment of the invention the extra precautions associated with recomputing and comparing these hash values might be dispensed with.
[0085] Where Hash 1 value is not equal to Hash 2 value, an error condition exists. There are three possibilities: one possibility is that there was a transient computational error in the computing of one of the two hash values or there was a transmission error resulting in a corruption of the chunk at the data storage center, or there was an error in the transmission of Hash 2 from the data center to the packer. To recover, the process continues to step 713
where a Hash 3 value is recomputed and this Hash 3 value is compared to Hash 2 value for the data chunk. Hash 3 should be the same as Hash 1. If it is not then an error occurred either at this step 713 or previously at step 701. Where the Hash 3 value is equal to the Hash 2 value, this means it is overwhelmingly probable that the error occurred at step 701 and that the chunk was in fact correctly transmitted in which case the process proceeds to step 711 where the data chunk is deleted having been safely stored and success is reported to the packer manager. However, where a Hash 3 value also does not equal Hash 2 value, it is highly likely that some transmission error did occur. Therefore, an erase message is sent to the data center in step 715 and the data chunks are re-sent to the packers data center in steps 703. The recomputed Hash 3 replaces the originally computed Hash 1. Where the data center is unable to save the data chunk in step 717, for example because its disks are full, the process continues to 719 where the data center reports failure back to the packer manager. In addition, if the packer process times out in its attempt to contact the data center, it may report a soft failure. It should be noted that an embodiment of the invention might place greater trust in the reliability of the computing hardware and the network infrastructure and dispense with some of the safeguard steps.
[0086] Figure 8 is a detailed flow chart representative of a receiver process flow 619 of Fig. 6 in accordance with an embodiment of the invention. This process may be executed concurrently for each of the data storage centers possessing chunks of the desired file. As seen in Fig. 8, the process begins with the process manager at step 801 requesting data chunks derived from a file having a particular hash value from the data centers (815). Where the data center can find a chunk labeled with that hash value the data center returns the requested data chunk and the packer manager receives the data chunk in step 803. Each chunk is labeled with the hash value of the file it derives from as well as its own hash value. In step 805 the hash value of the chunk is verified. In the rare cases where the hash value of
the retrieved data chunk does not match the expected hash value, this indicates that the chunk has become corrupted, most likely in transit. Recovery proceeds by returning to step 801 where a request can again be made to a data center step 815 for a data chunk associated with a particular hash value. Where the verification step 805 is successful, the chunk is saved in the package manager and a report is sent to the decoder indicating successful retrieval of a data chunk. Where a data chunk cannot be found, the process continues to step 811 where a hard failure is reported to the decoder. Where a timeout occurs, the process continues to step 813 where a soft failure is reported to the decoder. This timeout most likely indicates a network congestion situation or a temporary server outage. So it may be worthwhile if needed to reattempt retrieval. It will be readily recognized that the orders of some of these steps can be rearranged, for example, a chunk could be saved before verifying its hash, removing it if it is found to be corrupt.
[0087] By storing a chunk hash with the chunk, for example by incorporating it into the name of the file containing the chunk, the data storage facility is able to continually verify the integrity of its data storage and detect any file corruption that may occur. In addition, this process will uncover as early as possible any indication of incipient failure of a disk allowing it to be replaced before any data loss occurs.
[0088] As is understood from the above discussion, a variety of encoding choices may be used in accordance with embodiments of the invention. Given a file, for each possible choice of encoding using either erasure codes or error correcting codes, there is a cost of using that encoding, which includes any computational cost, together with the extra storage required to contain the extra information required to check for errors. In the case of a repetition code, for example, to repeat the data 5 times means 5 times the cost of disk space required for storage. While more efficient codes exist, all codes require some additional storage space to hold the redundant information needed to replace any information lost by an error or erasure.
[0089] There is also, for that file, a cost of losing the file. This cost is not simply the value of the file, but incorporates two other factors: the probability of losing the file, and the time value of money. Paying out money in the future is less costly than paying out money now. Thus, the cost is obtained by integrating together, for all possible future times T, the product of a time value of money factor exp(-IT) times a probability factor representing the probability at that future time T of receiving a claim and not being able to reproduce the file. That probability of having lost the file depends on the particular encoding chosen. It will be readily recognized that the interest rate I may itself be varying with time or may be a stochastic process.
[0090] Various algorithms exist for finding minima which can be applied to this total cost, including calculating a derivative and finding a zero by section or Newton's method (Stoer & Bulirsch, Introduction to Numerical Analysis). In a preferred embodiment of the invention, the encoding method is chosen which minimizes this total cost, and thus the greatest economic benefit is derived.
[0091] However, any one owner of information could ill afford to incur only this rational economic cost described above, because the consequences of data loss are so catastrophic. According to one embodiment of the invention, if owners of information pool their risks, that barrier to economic efficiency is removed. For example, if a homeowner could not purchase home insurance, the homeowner would be compelled to go to ludicrous extremes to prevent his or her house from ever burning down. Because homeowners can purchase fire insurance and thus pool their risk with other homeowners, they take only the most practical of precautions.
[0092] In addition to the pooling of data loss risk, there is an economy of scale in that the large data sets in the pool make possible the use of the most efficient of the erasure and error correcting codes such as raptor and online codes.
[0093] An additional source of economic benefit is obtained by compiling a database of hashes for widely used files such as files downloaded from the Internet. This feature ensures that such files will only be backed up once and thus eliminates the redundant backup of files. It enables more economical backup by adding to the archive only proprietary files, while at the same time, being assured of the ability to retrieve both proprietary and non-proprietary files. Technologies exist for finding files that are similar and compressing away the redundant Information. For example, hashes may be computed for individual portions of a file. Those portions sharing a common hash value are overwhelmingly likely to be identical and need only be saved once. Incorporating this element Into the Invention further increases the efficiencies in space, bandwidth and cost.
[§094] Further, size normalization may be used In accordance with an embodiment of the present invention. Any message can be regarded as a sequence of characters where the characters are chosen from a collection of at least two possible characters. A conventional choice for two such characters are: 0 and 1. In particular, any message may be encoded as a sequence of O's and 1 's. Any given code requires breaking the message up Into pieces of a fixed size, which means that the message must have a size that Is a multiple of that fixed size. If the message has a size which is not a multiple of that fixed size, it may be necessary to pad it to make it the right size to enable it to be split up into the number of pieces needed by that code. We may pad such files by adding enough 0 characters to obtain a conforming size. It will be readily recognized that other characters could be chosen as padding. To recover the original message, those extra O's must be removed. However, removing them may result in also removing any trailing O's that happened to originally be part of the message. In one embodiment, this can be detected because the hash of the recovered file will fail to match the original file's hash. We restore the file to its original condition by simply adding zeros back until the hash matches the one recorded on the certificate that guarantees the data, which will
require an expected number of two tries. Alternately, the size in the certificate can be used to determine which trailing O's are part of the original file, versus those which are additional padding that were added. In this way, there is no need for extra storage for recording the padding information.
[0095] Accelerometers exist, which measure the accelerations experienced by them, and hence by the objects to which they are attached. Nonvolatile memory devices exist, such as bubble memory, on one technical extreme, to litmus paper, which remembers by permanently changing color. By combining nonvolatile memory with accelerometers we may permanently record the greatest shock that some object has experienced over it's lifespan. [§§96] An important source of unreliability in data storage is the damage that disks may suffer, after being fabricated, before being installed in a computer. Static discharge during installation is a potential source of serious damage, as is mechanical shock in transit. By attaching the device described in the previous paragraph to a disk drive during the manufacturing process, disks which have experienced unacceptable abuse levels in transit can be identified to better control the probabilities of disk failure. In addition, probability models can be developed which allow for estimating, according to the level of shock a disk has experienced, what the impact of that event will be on it's mean time before failure. [0097] Reduction of probability is readily achieved through independence. If an event has a 10% chance of occurring then there is only a 1% chance of two independent occurrences of that event. Thus, achieving maximal levels of reliability is facilitated greatly by achieving Independence of failure events. A disk that fails may cause other disks to fail because they are electrically connected. To achieve independence , we can isolate the data connections by converting the voltage signal into an optical signal, which is then converted back to a voltage signal. Circuitry which limits current and/or electrical potential may also be employed. We can isolate the power connections bringing independent DC lines to each disk. To make this
approach economical we can connect a number of power leads coming from a given power supply to disks that are part of distinct computer systems. If a power supply should fail, causing damage to the attached disks at most one disk in each system would be affected and the RAID protection would keep any of those systems from being harmed. [§098] Embodiments of the present invention may be implemented using a conventional general purpose computer or micro-processor programmed according to the teachings of the present invention, as will be apparent to those skilled in the computer art. Appropriate software can readily be prepared by programmers of ordinary skill based on the teachings of the present disclosure, as will be apparent to those skilled in the software art. [0099] A non-limiting example of a computer 100 as shown in Figure 9 may implement the method of the present invention, wherein the computer housing 102 houses a motherboard 104 which contains a CPU 106, memory 108 (e.g., DRAM, ROM, EPROM, EEPROM, SRAM, SDRAM, and Flash RAM), and other optical special purpose logic devices (e.g., ASICS) or configurable logic devices (e.g., GAL and reprogrammable FPGA). The computer 100 also includes plural input devices, (e.g., keyboard 122 and mouse 124), and a display card 110 for controlling a monitor 120. Additionally, the computer 100 may include a floppy disk drive 114; other removable media devices (e.g. compact disc 119, tape, and removable magneto-optical media (not shown)); and a hard disk 112 or other fixed high density media drives, connected using an appropriate device bus (e.g., a SCSI bus, an Enhanced IDE bus, an Ultra DMA bus, a SATA bus, a fibre channel storage network, or an IP network using network block devices or iSCSI ). The computer may also include a compact disc reader 118, a compact disc reader/writer unit (not shown), or a compact disc jukebox (not shown), which may be connected to the same device bus or to another device bus.
[§§100] As stated above, the system includes at least one computer readable medium. Examples of computer readable media are compact discs 119, hard disks 112, floppy disks, tape, magneto-optical disks, PROMs (e.g., EPROM3 EEPROM, Flash EPROM), DRAM, SRAM, SDRAM, etc. Stored on any one or on a combination of computer readable media, the present invention includes software for controlling both the hardware of the computer 100 and for enabling the computer to interact with a human user. Such software may include, but is not limited to, device drivers, operating systems and user applications, such as development tools. Such computer readable media further includes the computer program product of the present invention for performing the inventive method herein disclosed. The computer code devices of the present invention can be any interpreted or executable code mechanism, including but not limited to, scripts, interpreters, dynamically linked libraries, Java classes, and complete executable programs. Moreover, parts of the processing of the present invention may be distributed for better performance, reliability, and/or cost. [00101] The invention may also be implemented by the preparation of application specific integrated circuits or by interconnecting an appropriate network of conventional component circuits, as will be readily apparent to those skilled in the art.
[00102] Obviously, numerous modifications and variations of the present invention are possible in light of the above teachings. It is therefore to be understood that within the scope of the appended claims, the invention may be practiced otherwise than as specifically described herein.
Claims
1. A method for guaranteeing retrieval of stored data, comprising: receiving, from a sender, a data file being stored to be stored at a data custodian; creating a storage certificate associated with said data file being stored, the storage certificate including a partial representation of the data file being stored and an electronic signature of the data custodian; storing the data file being stored in a storage unit at the data custodian; and returning said storage certificate to the sender to verify storage of the data file being stored at the data custodian.
2. The method of claim 1, wherein said receiving comprises receiving the data file being stored via the Internet.
3. The method of claim 1, wherein said receiving comprises receiving an encrypted data file being stored.
4. The method of claim 1, wherein said receiving comprises receiving the data file being stored via e-mail or uploading the data file to the data custodian.
5. The method of claim 1, wherein said partial representation comprises a unique representation of the data file being stored that can be used to authenticate the data file being stored but cannot be used to recreate the data file being stored.
6. The method of claim 1, wherein said partial representation comprises a hash value of the data file being stored computed according to a predetermined algorithm.
7. The method of claim 1, wherein said electronic signature comprises a digital signature.
8. The method of claim 1, wherein said returning comprises returning the storage certificate to the sender via e-mail or downloading the storage certificate to the sender.
9. The method of claim 6, further comprising: receiving at the custodian, a request for a requested data file, the request being received from a requestor and including a requested file label and a data request certificate; processing the request to determine whether the requested data file can be provided by the custodian.
10. The method of claim 9, wherein said processing comprises: obtaining an electronic signature from the data request certificate; and determining whether the electronic signature of the data request certificate corresponds to the electronic signature of the data storage certificate.
11. The method of claim 10, wherein said processing further comprises: where the electronic signature of the data request certificate corresponds to the electronic signature of the data storage certificate, retrieving a file from storage having the requested file label; and computing a hash value for the retrieved file based on said predetermined algorithm.
12. The method of claim 11, wherein said processing further comprises: comparing the hash value of the retrieved file with the hash value of the stored data file; where the hash value of the retrieved data file matches the hash value of the stored data file, returning the retrieved data file to the requestor as the requested data file; and where the hash value of the retrieved data file does not match the hash value of the stored data file, taking corrective action.
13. The method of claim 12, wherein said corrective action comprises compensating the requestor for lost data.
14. A computer program product storing computer program instructions which when executed by a computer cause performance of the method recited in Claim 1.
15. A system for guaranteeing backup storage of a data file comprising: a memory configured to store data for guaranteeing backup storage of a data file; a storage unit configured to store data files; and a processor configured to: receive, from a sender, a data file being stored to be stored In said storage unit; create a storage certificate associated with said data file being stored, the storage certificate including a partial representation of the data file being stored and an electronic signature associated with the storage unit; store the data file being stored In said storage unit; and return said storage certificate to the sender to verify storage of the data file being stored In the storage unit.
16. The system of claim 15, wherein said processor creates the storage certificate by computing a hash value of the data file being stored according to a predetermined algorithm.
17. The system of claim 16, wherein said processor is further configured to: receive a request for a requested data file, the request being received from a requestor and including a requested file label and a data request certificate; and process the request to determine whether the requested data file can be provided to the requestor,
18. The system of claim 17, wherein the processor processes the request by: obtaining an electronic signature from the data request certificate; and determining whether the electronic signature of the data request certificate corresponds to the electronic signature of the data storage certificate.
19. The system of claim 18, wherein said processor processes the request by: where the electronic signature of the data request certificate corresponds to the electronic signature of the data storage certificate, retrieving a file from storage having the requested file label; and computing a hash value for the retrieved file based on said predetermined algorithm.
20. The system of claim 19, wherein said processor processes the request by: comparing the hash value of the retrieved file with the hash value of the stored data file; where the hash value of the retrieved data file matches the hash value of the stored data file, returning the retrieved data file to the requestor as the requested data file; and where the hash value of the retrieved data file does not match the hash value of the stored data file, Initiating payment to the requestor of a predetermined dollar amount for lost data.
21. A method for backup storing data files, comprising: receiving, from a sender, a data file being stored to be stored at a data custodian; creating N data chunks from the data file such that the data file can be recreated using < N of the data chunks; and storing the N data chunks in a plurality of storage locations.
22. The method of claim 21, wherein said receiving comprises receiving the data file being stored via the Internet.
23. The method of claim 21, wherein said receiving comprises receiving an encrypted data file being stored.
24. The method of claim 21, wherein said receiving comprises receiving the data file being stored via e-mail or uploading the data file to the data custodian.
25. The method of claim 21, wherein said creating comprises creating the N data chunks using a predetermined error code process or a predetermined erasure code process.
26. The method of claim 25, wherein said creating comprises creating the N data chunks using a sparse matrix method of a Gallager code family.
27. The method of claim 25, wherein said creating comprises selecting an encoding method based on a determination of the total cost of losing the data file.
28. The method of claim 21, further comprising: receiving at the custodian, a request for a requested data file, the request being received from a requestor and including a requested file label; processing the request to determine whether the requested data file can be provided by the data custodian.
29. The method of claim 28, wherein said processing comprises: retrieving at least some of the N data chunks from their respective storage locations; and decoding the retrieved data chunks to determine whether the data file can be recreated.
30. The method of claim 29, wherein said processing further comprises, where the data file can be recreated, providing the recreated data file to said sender.
31. The method of claim 29, wherein said processing further comprises, where the data file can not be recreated, taking corrective action.
32. The method of claim 31, wherein said corrective action comprises paying the requestor a predetermined dollar amount for lost data.
33. The method of claim 21, further comprising: creating a storage certificate associated with said data file being stored, the storage certificate including a partial representation of the data file being stored and an electronic signature of the data custodian; storing the data file being stored in a storage unit at the data custodian; and returning said storage certificate to the sender to verify storage of the data file being stored at the data custodian.
34. The method of claim 32, wherein said partial representation comprises a unique representation of the data file being stored that can be used to authenticate the data file being stored but cannot be used to recreate the data file being stored.
35. The method of claim 32, wherein said partial representation comprises a hash value of the data file being stored computed according to a predetermined algorithm.
36. The method of claim 32, wherein said electronic signature comprises a digital signature.
37. A computer program product storing computer program instructions which when executed by a computer cause performance of the method recited in Claim 21.
38. A system for providing reliable backup storage of a data file comprising: a memory configured to store data for reliable backup storage of a data file; a storage unit configured to store data files; and a processor configured to: receive, from a sender, a data file being stored to be stored at a data custodian; create N data chunks from the data file such that the data file can be recreated using < N of the data chunks; and store the N data chunks in a plurality of storage locations.
39. The system of claim 38, wherein said processor creates the N data chunks using a predetermined error code process or a predetermined erasure code process.
40. The system of claim 39, wherein said processor is further configured to: receive a request for a requested data file, the request being received from a requestor and including a requested file label; and process the request to determine whether the requested data file can be provided by the data custodian.
41. The system of claim 28, wherein said processor processes the request by: retrieving at least some of the N data chunks from their respective storage locations; and decoding the retrieved data chunks to determine whether the data file can be recreated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2007/064810 WO2008118168A1 (en) | 2007-03-23 | 2007-03-23 | Method and system for backup storage of electronic data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2007/064810 WO2008118168A1 (en) | 2007-03-23 | 2007-03-23 | Method and system for backup storage of electronic data |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008118168A1 true WO2008118168A1 (en) | 2008-10-02 |
Family
ID=39788778
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2007/064810 WO2008118168A1 (en) | 2007-03-23 | 2007-03-23 | Method and system for backup storage of electronic data |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2008118168A1 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2478438A4 (en) * | 2009-09-16 | 2013-03-13 | Varaani Works Oy | STORAGE METHOD AND SERVER FOR DATA REDUNDANCY |
US8453257B2 (en) | 2009-08-14 | 2013-05-28 | International Business Machines Corporation | Approach for securing distributed deduplication software |
US9832769B2 (en) | 2009-09-25 | 2017-11-28 | Northwestern University | Virtual full duplex network communications |
CN111277651A (en) * | 2020-01-20 | 2020-06-12 | 国网江苏招标有限公司 | Remote bidding method and system |
US10922173B2 (en) | 2018-03-13 | 2021-02-16 | Queen's University At Kingston | Fault-tolerant distributed digital storage |
CN113196702A (en) * | 2018-11-16 | 2021-07-30 | 先进信息技术公司 | System and method for distributed data storage and transfer using blockchains |
CN115459882A (en) * | 2022-09-05 | 2022-12-09 | 福建天晴在线互动科技有限公司 | Remote file repair method and system |
CN118331516A (en) * | 2024-06-17 | 2024-07-12 | 北京城建智控科技股份有限公司 | Data processing method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020138572A1 (en) * | 2000-12-22 | 2002-09-26 | Delany Shawn P. | Determining a user's groups |
US20050114653A1 (en) * | 1999-07-15 | 2005-05-26 | Sudia Frank W. | Certificate revocation notification systems |
US20050138110A1 (en) * | 2000-11-13 | 2005-06-23 | Redlich Ron M. | Data security system and method with multiple independent levels of security |
US20050204273A1 (en) * | 2004-02-06 | 2005-09-15 | Samsung Electronics Co., Ltd. | Apparatus and method for encoding and decoding a space-time low density parity check code with full diversity gain |
-
2007
- 2007-03-23 WO PCT/US2007/064810 patent/WO2008118168A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050114653A1 (en) * | 1999-07-15 | 2005-05-26 | Sudia Frank W. | Certificate revocation notification systems |
US20050138110A1 (en) * | 2000-11-13 | 2005-06-23 | Redlich Ron M. | Data security system and method with multiple independent levels of security |
US20020138572A1 (en) * | 2000-12-22 | 2002-09-26 | Delany Shawn P. | Determining a user's groups |
US20050204273A1 (en) * | 2004-02-06 | 2005-09-15 | Samsung Electronics Co., Ltd. | Apparatus and method for encoding and decoding a space-time low density parity check code with full diversity gain |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8453257B2 (en) | 2009-08-14 | 2013-05-28 | International Business Machines Corporation | Approach for securing distributed deduplication software |
EP2478438A4 (en) * | 2009-09-16 | 2013-03-13 | Varaani Works Oy | STORAGE METHOD AND SERVER FOR DATA REDUNDANCY |
US9832769B2 (en) | 2009-09-25 | 2017-11-28 | Northwestern University | Virtual full duplex network communications |
US10922173B2 (en) | 2018-03-13 | 2021-02-16 | Queen's University At Kingston | Fault-tolerant distributed digital storage |
CN113196702A (en) * | 2018-11-16 | 2021-07-30 | 先进信息技术公司 | System and method for distributed data storage and transfer using blockchains |
CN111277651A (en) * | 2020-01-20 | 2020-06-12 | 国网江苏招标有限公司 | Remote bidding method and system |
CN111277651B (en) * | 2020-01-20 | 2024-04-09 | 国网江苏招标有限公司 | Remote bidding method and system |
CN115459882A (en) * | 2022-09-05 | 2022-12-09 | 福建天晴在线互动科技有限公司 | Remote file repair method and system |
CN118331516A (en) * | 2024-06-17 | 2024-07-12 | 北京城建智控科技股份有限公司 | Data processing method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8171102B2 (en) | Smart access to a dispersed data storage network | |
US8868969B2 (en) | Method and apparatus for rebuilding data in a dispersed data storage network | |
WO2008118168A1 (en) | Method and system for backup storage of electronic data | |
US9996413B2 (en) | Ensuring data integrity on a dispersed storage grid | |
Chen et al. | NCCloud: A network-coding-based storage system in a cloud-of-clouds | |
US8352782B2 (en) | Range based rebuilder for use with a dispersed data storage network | |
US8880940B2 (en) | Pessimistic data reading in a dispersed storage network | |
JP6162239B2 (en) | Data storage application programming interface | |
US7240236B2 (en) | Fixed content distributed data storage using permutation ring encoding | |
US10693640B2 (en) | Use of key metadata during write and read operations in a dispersed storage network memory | |
US20140344228A1 (en) | Multiple Revision Mailbox | |
AU2021302873B2 (en) | Secure secret recovery | |
US7729926B1 (en) | Methods and apparatus for backing up and restoring data | |
US11169973B2 (en) | Atomically tracking transactions for auditability and security | |
US8239706B1 (en) | Data retrieval system and method that provides retrieval of data to any point in time | |
US12411736B2 (en) | Data reconstruction in a storage network and methods for use therewith | |
Chen et al. | Towards server-side repair for erasure coding-based distributed storage systems | |
US9971802B2 (en) | Audit record transformation in a dispersed storage network | |
Goodson et al. | Efficient consistency for erasure-coded data via versioning servers | |
Jaikar et al. | Verifying Data Integrity in Cloud | |
More et al. | Efficient and Secure Data Dynamic Operations in Cloud Computing | |
Shrividhya et al. | Erasure correcting to enhance data security in cloud data storage | |
Curtmola et al. | ◾ Availability, Recovery, and Auditing across Data Centers | |
Patil et al. | Data Integrity Check In Cloud Using Dispersal Code | |
Dhanalakshmi et al. | Remote Data integrity in cloud security services |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07759269 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07759269 Country of ref document: EP Kind code of ref document: A1 |