HK1105695B

HK1105695B - Performing a preemptive reconstruct of a fault-tolerant raid array

Info

Publication number: HK1105695B
Application number: HK07113951.3A
Authority: HK
Inventors: 保罗‧阿什莫尔
Original assignee: 达西系统股份有限公司
Priority date: 2004-08-04
Filing date: 2005-07-11
Publication date: 2010-06-25

Description

Performing preemptive reconstruct of a fault tolerant disk array

Cross reference to related applications

This application claims priority to the following U.S. provisional applications, the entire contents and purpose of which are incorporated herein by reference.

This application is related to the following U.S. applications:

application number (document number)	Date of filling	Title:
application number (document number)	Date of filling	Title:	60/581556(CHAP0119)	6/21/04	pre-reconstruction for redundant disk arrays

Technical Field

The present invention relates generally to redundant array of disks (RAID) controllers and, more particularly, to improving the availability of data on an array of storage devices that it controls.

Background

Computer systems have over the years included disk controllers capable of partitioning (stripe) data of entire groups or arrays of multiple physical disks, such that the controllers present a single logical disk to the computer operating system. To show the sign of the split, assume that each split array of four physical disks has a capacity of 100GB, and the array is split in size, or block size, of eight sectors, or 4 KB. In this example, on the first physical disk, the controller stores the first, fifth, ninth, etc. 4KB block of the logical disk on the first, second, third, etc. group of eight sectors, respectively; on the second physical disk, the controller stores the second, sixth, tenth, etc. 4KB block of the logical disk on the first, second, third, etc. group of eight sectors, respectively; on the third physical disk, the controller stores the third, seventh, eleventh, etc. 4KB block of the logical disk on the first, second, third, etc. group of eight sectors, respectively; on the fourth physical disk, the controller stores the fourth, eighth, twelfth, etc. 4KB block of the logical disk on the first, second, third, etc. group of eight sectors, respectively.

The advantage of partitioning is that it can provide a logical disk with a larger storage capacity than the maximum capacity of the individual physical disks. In the above example, the result is a logical disk with a storage capacity of 400 GB.

Perhaps a more important benefit of partitioning is that partitioning the array can provide improved performance. Performance improvements in a random I/O environment, such as a multi-user file server, database, or transaction processing server, are achieved primarily by selecting a partition size that causes a typical read I/O request from the server to request substantially only one disk in the array to be accessed. Each disk in the array may then be used for each different cylinder (cylinder) simultaneously to satisfy different I/O requests, thus utilizing multiple axes of the array. In an enhanced data throughput environment, such as a video-on-demand server, this performance improvement is generally achieved by selecting a split size that results in a typical read I/O request extending across all disks in the array, so that the controller can read the disks in parallel and keep all disks for the same cylinder. In this embodiment, the spindles of the various disks in the array are generally synchronized.

However, a problem with a partitioned array of disks is that the reliability of the entire array is individually lower than that of each single disk. This is because if data stored on a disk becomes unavailable due to a failure of the disk, all data of the logical disk is unavailable from the computer's perspective, and the reliability of the disk is typically measured in terms of Mean Time Between Failure (MTBF) since it is not accepted for the controller to return only a portion of the data. As the number of disks in a RAID 0 array increases, the MTBF decreases, possibly to a level that is unacceptable for many applications.

To address this problem, the concept of redundancy is introduced into the disk array. In a redundant array of disks, additional, or redundant, disks are added to the array, which does not increase the storage capacity of the logical disk, but instead allows redundant data to be stored on one or more disks of the array, so that the controller can still provide the requested data of the logical disk to the computer even if one of the disks in the array fails. Thus, when the array is in a redundant state, i.e., when none of the disks of the array fail, the array is fault tolerant for reasons that can tolerate a disk failure and still provide user data. The primary form of redundant data is mirrored data and parity data, and in many cases the MTBF of a redundant array of disks may be greater than the MTBF of a single and non-redundant disk.

RAID is an abbreviation for Redundant Array of disks (Redundant Array of Redundant Disk), the name of which was published in 1987 by Patterson, Gibson and Katz, university of california, beckley, university, under their paper entitled "Redundant Array of disks (RAID)" situation. In the late 1980 s, which apparently has become the dominant form of mass storage for server-level computing environments, the bulk use of RAID systems was witnessed, and the first RAID paper defined five different forms of redundant arrays of disks, referred to as RAID levels 1 to 5, respectively. Other forms have been developed since then, and a split array called RAID level 0 appears. The various RAID levels and their associated performance and reliability characteristics are well known in the art, but will also be discussed briefly herein to facilitate an understanding of the problems solved by the present invention.

The RAID level 1 adopts disk mirroring, the RAID1 array comprises a pair of disks, and at the moment that each computer sends a write command to a RAID controller of a RAID1 logical disk, the RAID controller writes the data to the pair of two disks so as to maintain mirroring copy of the data on the pair of disks. At each time the computer issues a read command to the RAID controller for a RAID1 logical disk, the RAID controller reads only one of the disks. If one disk in a RAID1 array fails, data may be read from another disk in the array. An extension of RAID1 is RAID10, which includes an array of split mirror disk pairs, with RAID10 providing the advantages of the reliability of RAID1 and the advantages of the performance of RAID 0.

RAID level 4 employs parity striping, and a RAID 4 array requires at least three physical disks, for example, assuming four disks RAID 4 with a stripe size of 4KB, three of these disks being data disks and the fourth disk being a parity disk. In this example, on the first data disk, the controller stores the 4KB blocks of the first, fourth, seventh, etc. logical disk of eight sectors on first, second, third, etc. groups, respectively; on the second data disk, the controller stores the 4KB block of the second, fifth, eighth, etc. logical disk on the first, second, third, etc. group of eight sectors, respectively; on the third data disk, the controller stores the third, sixth, ninth, etc. 4KB block of the logical disk on the first, second, third, etc. group of eight sectors, respectively. The controller stores the parity (binary XOR, or exclusive or) of the first 4KB block of the three data disks on the first 4KB block of the parity disk, the binary XOR of the second 4KB block of the three data disks on the second 4KB block of the parity disk, the binary XOR of the third 4KB block of the three data disks on the third 4KB block of the parity disk, and so on. Thus, while the controller writes to one or more data disks, the controller must calculate the parity of all the data in the associated blocks of all the data disks and write the parity to the associated blocks of the parity disk. When the controller reads data, it only reads from the data disk, not from the parity disk.

If one of the data disks in the RAID 4 array fails, the data on the failed data disk may be reconstructed by reading from the remaining data disks and the parity disk, and the data is binary-XORed at the same time. This is to take advantage of the binary XOR nature of the redundant array of parity of disks, which allows the RAID controller to return user data to the computer even when the data disk has failed.

RAID level 5 is similar to RAID level 4 except that it has no dedicated parity disks, but rather, for each partition, the parity disks are different disks in the array, so that the parity is distributed across all the disks. In particular, for each partition along the array, rotating the parity disk, RAID level 5 improves write performance in a random I/O environment by limiting the write bottleneck of the parity drive.

As described above, when a disk in a redundant array fails, the array is no longer fault tolerant, i.e., the array is no longer tolerant of the failure of a second disk. The exception to this rule is a RAID level that provides multiple redundancy, such as RAID level 6, which is similar to RAID 5, provides two-dimensional parity, so that a RAID6 array can tolerate two disk failures and continue to provide user data. That is, a RAID6 array with one failed disk is not fully redundant but still fault tolerant, and once two disks in the RAID6 array have failed, the array is no longer fault tolerant.

To repair a redundant array of disks from a non-fault-tolerant (or non-fully redundant) state to a fault-tolerant (or fully redundant) state, the array must be rebuilt. In particular, the data on the failed disk needs to be reconstructed and written to the new disk for inclusion in the array. For a parity-redundant array, reconstructing the data for the failed disk includes reading the data from the remaining disks and binary xoring the data at the same time. For mirrored redundant arrays, reconstructing data of a failed disk includes reading data only from the failed disk's mirror disk. Once the RAID controller rebuilds the data, writes it to the new disk, and logically replaces the failed disk with the new disk to the array, repairing the array as fault-tolerant (or fully redundant), i.e., rebuilding the array.

When a disk failure occurs, most RAID controllers notify the system administrator in some manner so that the administrator can reconstruct the redundant array, requiring the administrator to physically swap out the failed disk with a new disk and instruct the RAID controller to perform the reconstruction. Some RAID controllers are used to reduce the amount of time required to make a redundant array of disks non-fault-tolerant (or not fully redundant) by automatically performing a rebuild of the array in response to a disk failure. Generally, when the administrator has just configured a redundant array of a system, the administrator configures one or more spare disks connected to the RAID controller so that the RAID controller can automatically use the array as a new disk when a disk failure occurs.

Other RAID controllers may predict that a disk in the array will fail by detecting non-fatal errors that occurred to the disk, i.e., without causing a disk failure. The RAID controller notifies the system administrator that an error is occurring with the system so that the administrator can begin rebuilding the array. However, the rebuild is performed because the rebuild removes the error-generating disk from the array, which is non-fault tolerant (or not fully redundant) during the rebuild, which is more severe if a failure of another disk of the array occurs during the rebuild.

Therefore, there is a need for a RAID controller that prevents an array having disks expected to fail from entering a non-fault tolerant state by performing a rebuild of the disks. Rather than leaving the array in a fault tolerant (or fully redundant) state during the rebuild.

Disclosure of Invention

It is an object of the present invention to provide an apparatus for performing a preemptive reconstruct of a redundant array of disks while the redundant array is still fault-tolerant, comprising: means for determining whether a number of errors issued by one of the disks in the redundant array has exceeded an error threshold; means for reading data from a second disk in the redundant array and writing the data to a spare disk in response to the determination, wherein the data of the second disk in the redundant array is a mirror copy of the data of one of the disks; means for replacing one of the disks in the redundant array with the spare disk after the reading and the writing are completed; and means for writing second data to one of the disks in response to a user write request including the second data while the reading and the writing, thereby maintaining fault tolerance of the redundant array.

It is another object of the present invention to provide an apparatus for performing a preemptive reconstruct of a redundant array of disks while the redundant array is still fault-tolerant, comprising: means for determining whether a number of errors issued by one of the disks in the redundant array has exceeded an error threshold; means for generating data for one of the disks from second data read from two or more of the other disks in the redundant array and writing the generated data to the spare disk in response to the determination; means for replacing one of the disks in the redundant array with the spare disk after the generating and the writing are completed; and means for writing third data to one of the disks in the redundant array in response to a user write request including the third data while the generating and writing, thereby maintaining fault tolerance of the redundant array.

It is yet another object of the present invention to provide a method for performing a preemptive reconstruct of a redundant array of disks while the redundant array is still fault tolerant. The method includes determining whether a number of errors has exceeded an error threshold by one of the disks in the redundant array. The method also includes reading data from a second disk of the disks in the redundant array and writing data to a spare disk in response to determining that the error threshold has been exceeded. The data of the second of the disks in the redundant array is a mirror copy of the data of one of the disks, the method further comprising replacing the one of the disks in the redundant array with a spare disk after completing the reading and writing. The method also includes writing second data to one of the disks of the redundant array to respond to a user write request including the second data while performing the read and write, thereby maintaining fault tolerance of the redundant array.

It is yet another object of the present invention to provide a method for performing a preemptive reconstruct of a redundant array of disks while the redundant array is still fault tolerant. The method includes determining whether a number of errors has exceeded an error threshold by one of the disks in the redundant array. The method also includes generating data for one of the disks from second data read from two or more of the other disks in the redundant array and writing the generated data to the spare disk in response to determining that the error threshold has been exceeded. The method also includes replacing one of the disks in the redundant array with the spare disk after completing the generating and writing. The method also includes writing third data to one of the disks in the redundant array in response to a user write request including the third data while performing the generating and writing, thereby maintaining fault tolerance of the redundant array.

One advantage of the present invention is that the rebuild is performed while reducing the life of data loss for the redundant array of disks by not placing the array in a non-fault tolerant state, i.e., maintaining the non-fault tolerant state of the array during the preemptive rebuild. On the other hand, during a preemptive reconstruct, a redundant array may tolerate a disk failure without data loss, whereas during a conventional reconstruct, an array may not tolerate a disk failure without data loss. Furthermore, copying data from a critical disk to a spare disk typically takes less time relative to a parity-redundant array than the well-known reconstruction of a failed disk from other disks in the array, since the copying involves a read of only one disk rather than multiple disks and a read of an exclusive-or operation. Finally, in one embodiment, by automatically performing the preemptive reconstruct in response to the achievement of the error threshold, the time for the critical disk to be part of the array is reduced because of the time constraints for the user to receive notification, determine how to proceed, and initiate the preemptive reconstruct. In addition, the chance of user error is also reduced.

Drawings

FIG. 1 is a computer network including a RAID controller according to the present invention.

FIGS. 2A and 2B are flow diagrams illustrating operation of the RAID controller of FIG. 1 to perform a preemptive reconstruct of a redundant array according to the present invention.

FIGS. 3A and 3B are flow diagrams illustrating operation of the RAID controller of FIG. 1 to perform a preemptive reconstruct of a redundant array according to alternative embodiments of the present invention.

FIGS. 4A and 4B are flow diagrams illustrating operation of the RAID controller of FIG. 1 to perform a preemptive reconstruct of a redundant array according to alternative embodiments of the present invention.

FIG. 5 is a computer system including a RAID controller according to an alternate embodiment of the present invention.

FIG. 6 is a software RAID controller according to an alternative embodiment of the present invention.

Detailed Description

Referring now to FIG. 1, a computer network 100 including a RAID controller 102 in accordance with the present invention is shown. The network 100 includes one or more host computers 104 coupled to a RAID controller 102 and a plurality of disks 142, or disk drives 142, coupled to the RAID controller 102.

In one embodiment, the disks 142 may include, but are not limited to, hard disk drives. However, the disk 142 may include, but is not limited to, any permanent storage device, such as a tape drive or optical drive. The disks 142 provide the storage device transport media 112 to connect to the RAID controller, and the protocols implemented on the storage device transport media 112 may include, but are not limited to, Fibre Channel (FC), Advanced Technology Attachment (ATA), Serial Advanced Technology Attachment (SATA), Small Computer System Interface (SCSI), Serial Attached SCSI (SAS), HIPPI, Enterprise System Connection (ESCON), fibre channel connection (FICON), Ethernet (Ethernet), broadband over Wireless (Infiniband), or combinations thereof. The disks 142 and RAID controller 102 may communicate using a stacking protocol, such as SCSI over fibre channel or internal network SCSI (iSCSI).

The disks 142 are grouped into redundant arrays 114, and FIG. 1 shows three redundant arrays 114, one of which includes four disks 142, another of which includes three disks 142, and one of which includes six disks 142. The redundant array 114 is set according to any one of known RAID levels, such as RAID level 1, 2, 3, 4, 5, 6, 10, or 50. Further, the present invention is not limited to the presently known RAID levels, and redundant arrays of disks 114 may be employed in accordance with the RAID levels disclosed below. Likewise, non-redundant arrays, such as RAID 0 arrays, may also be employed with the present invention. Finally, the present invention also employs redundant arrays, whether single redundant (i.e., arrays that can tolerate only a single disk failure) or multiple redundant (i.e., arrays that can tolerate multiple disk failures), such as RAID6 arrays.

The plurality of disks 142 of the redundant array 114 of disks 142 presented by the RAID controller are considered to be a single logical disk to the host computer 104. When a host computer 114 requests that the RAID controller 102 write user data to the logical disk, the RAID controller 102 writes the user data to one or more of the disks 142 of the redundant array, and one writes the redundant data to one or more other disks in the redundant array 114. The redundant data is typically a mirrored copy of the user data, or parity data calculated from the user data in some manner according to various RAID levels. After the user data and redundant data have been written to the redundant array 114, even if one of the disks 142 in the redundant array 114 has failed, the redundant data is written in addition to the user data so that the RAID controller 102 then provides the user with data when the user data is requested by the host computer 114. In the case of a mirror-based redundant array 114, the RAID controller 102 reads the data of the failed disk 142 only from the mirror disk 142 of the failed disk 142. In the case of a parity-based redundant array 114, the RAID controller 102 reads data stored on the non-failed disk 142 and calculates the parity of the data to obtain the data for the failed disk 142.

In addition, some of the disks 142 are configured as spare disks 116, and FIG. 1 shows two spare disks 116 connected to the RAID controller 102. In either the case of a conventional reconstruct in response to a failure of a disk 142 in the redundant array 114 or the case of a preemptive reconstruct in response to a non-critical disk 142 error, the spare disk 116 is not part of a disk of the redundant array 114, but rather is available to the RAID controller 102 to automatically replace a disk 142 in the redundant array 114.

The host computer 104 may include, but is not limited to, a workstation, a personal computer, a notebook computer, or a Personal Digital Assistant (PDA), a file server, a print server, an enterprise server, a mail server, a web server, a database server, a department server, and so forth. In the embodiment of FIG. 1, host computer 104 provides host transport media 108 in connection with the RAID controller. The primary transmission medium 108 and protocols connected thereto may include, but are not limited to, fibre channel, Ethernet, wireless broadband, TCP/IP, Small computer systems interface, HIPPI, Token Ring (Token Ring), Arcnet, FDDI, LocalTalk, Enterprise systems connection, fibre channel connection, ATM, SAS, SATA and others, and combinations thereof. The RAID controller 102 receives I/O requests from the host computers 104 via the primary transmission medium 108 to translate user data between the host computers 104 and the redundant arrays 114 of disks 142 via the primary transmission medium 108. The primary transmission medium 108 may be part of a network including links, switches, routers, hubs, pointers, etc.

RAID controller 102 includes memory 124, microprocessor 126, management controller 128, buffer memory 132, bus bridge 134, host interface adapter 136, and storage device interface adapter 138. As shown, in one embodiment, each of the host interface adapter 136, storage device interface adapter 138 and microprocessor 126 are coupled to the bus bridge via an associated local bus. In one embodiment, the local bus comprises a high speed local bus including, but not limited to, PCI-X, CompactPCI, or a PCI express bus. In one embodiment, bus bridge 134 also includes a memory controller to control buffer memory 132. In one embodiment, the buffer 132 and the bus bridge 134 are connected via a Double Data Rate (DDR) memory bus. The bus bridge enables each of the microprocessor 126, the host interface adapter 136, and the storage device interface adapter 138 to communicate with each other and to transfer data to and from the buffer 132. In one embodiment, the microprocessor 126 includes a Pentium III processor coupled to the local bus via a second bus bridge, which is commonly referred to as a Northbridge.

The microprocessor 126 is also coupled to the memory 124, and the memory 124 is used for storing program instructions or data executed by the microprocessor 126. In particular, the memory 124 stores the critical disk error threshold 122, or error threshold 122, for performing the preemptive reconstruct of the redundant array 114 as described herein. In one embodiment, the memory 124 stores an error threshold 122 specific to each of the disks 142 coupled to the RAID controller 102. In one embodiment, the user specifies the error threshold 122. In another embodiment, the error threshold 122 is predetermined. Even though FIG. 1 shows the error threshold 122 stored in memory 124 directly coupled to the microprocessor 126, the error threshold 122 may be stored in any memory accessible by the RAID controller 102, such as the buffer 132, the interface management controller 128, or memory on the disks 142.

In one embodiment, the management controller 128 comprises an Advanced MicroElan^TMA microcontroller, and it is connected to the local bus through a third bus bridge, such a bus bridge is commonly referred to as a south bridge. In one embodiment, the management controller 128 is also coupled to a memory for storing program instructions and data that are executed by the management controller 128. The management controller 128 is coupled to the management transport medium 106 for performing input and output with respect to a user. The management transmission medium 106 may include, but is not limited to, RS-232, Ethernet, fibre channel, and Infiniband.

The management controller 128 receives user input for setting and managing the RAID controller 102 and, in particular, may receive the error threshold 122. In another embodiment, the error threshold 122 is provided by a user via the host computer 104. In one embodiment, the management controller 128 receives input from a user via a serial interface, such as an RS-232 interface. In one embodiment, the management controller 128 receives user input via an ethernet interface and provides network configuration and management procedures. In addition to its setup and management functions, the management controller 128 also performs monitoring functions, such as monitoring temperature, status, and status of critical elements of the RAID controller 102, such as fans or power supplies for the disks 142.

The storage device interface adapter 138 connects to the storage device transport medium 112. In one embodiment, the storage device interface adapter 138 includes two ports for interfacing two storage device transport media 112. The host interface adapter 136 interfaces with the host transmission medium 108. In one embodiment, the host interface adapter 136 includes two ports for interfacing with two host transport media 108. The storage device interface adapter 138 interfaces with the RAID controller 102 via the storage device transport medium 112, and the storage device interface adapter 138 implements a protocol that requires the redundant array 114 to be enabled, and in particular the disks comprising the redundant array 114, to communicate with the RAID controller 102. For example, in one embodiment, the storage device interface adapter 138 includes a JNIC-1560Milano dual channel fibre channel to PCI-X controller developed by JNI corporation to implement a fibre channel protocol for transferring fibre channel packets between the disks 142 and the RAID controller 102. In another embodiment, the storage device interface adapter 138 comprises an ISP2312 Dual channel fibre channel to PCI-X controller manufactured by QLogic, Inc. The storage device interface adapter 138 includes a Direct Memory Access Controller (DMAC) for directly transferring data between the storage device transport medium 112 and the buffer 132 via the bus bridge 134.

The host interface adapter 136 interfaces with the RAID controller 102 using the host transport media 108, and the host interface adapter 136 implements the protocol necessary to enable the host computers 104 to communicate with the RAID controller 102. For example, in one embodiment, the host interface adapter 136 comprises a JNIC-1560Mi lano dual channel fibre channel PCI-X controller that implements the fibre channel protocol used to transport fibre channel packets between host computers 104 and RAID controller 102, and in another embodiment, the host interface adapter 136 comprises a QLogic ISP 2312. The host interface adapter 136 includes a Direct Memory Access Controller (DMAC) for directly transferring data between the host transport medium 108 and the buffer 132 via the bus bridge 134.

The microprocessor 126 receives host computer 104I/O requests from the host interface adapter 136 and processes the requests. Processing these requests may include a number of functions, for example, the number of logical blocks and the number of data blocks to be transferred do not correspond to the appropriate number of physical blocks and the number of blocks on the disk 142 containing the redundant array 114 specified in the I/O request to write data to the redundant array 114. Thus, the logical block number specified in the primary I/O request is translated into the appropriate physical block number, and disk 142 to be used in performing one or more data transfers between the RAID controller 102 and the disk 142 comprising the redundant array. This conversion function is performed by the microprocessor 126. In one embodiment, the microprocessor 126 performs the translation according to conventional RAID techniques. After performing the translation, the microprocessor 126 programs the storage device interface adapter 138 to perform the data transfer between the disks 142 and the buffer 132. In addition, the microprocessor 126 programs the host interface adapter 136 to perform the data transfer between the host computer 104 and the buffer 132. Thus, when processing a host I/O request to write data from the host computer 104 to the redundant array 114, the microprocessor 126 programs the host interface translator 136 to transfer data from the host computer 104 to the buffer 132; after data is received into the buffer 132, the microprocessor programs the storage device interface adapter 138 to transfer the data from the buffer 132 to the appropriate physical block number of disks 142 converted, the disks 142 comprising the redundant array 114. Conversely, when processing a host I/O request to read data from the redundant array 114 to the host computer 104, the microprocessor 126 programs the storage device interface adapter 138 to transfer data from the converted appropriate physical block count of the disks 142 to the buffer 132, the disks 142 comprising the redundant array 114. After data is received into the buffer 132, the microprocessor 126 programs the host interface adapter 136 to transfer data from the buffer 132 to the host computer 104. The microprocessor 126 also performs the function of managing the allocation of portions of the buffer 132 that are used to perform data transfers. In one embodiment, the processor 126 also manages the buffer 132 as a cache memory for caching portions of the data buffered in the buffer 132 to improve I/O performance between the redundant array 114 and the host computer 104 according to conventional caching techniques. In one embodiment, the microprocessor 126 performs an exclusive-or operation of the desired data in a particular RAID level, such as RAID level 5, for example, of the parity data used for the redundant data. In one embodiment, the microprocessor 126 programs a given XOR circuit to perform an XOR operation on the user data to generate redundant parity data.

In one embodiment, the microprocessor, buffer 132, bus bridge 134, and management controller 128 are included in a first circuit board that provides a local bus backplane to connect to a second circuit board that includes a host interface adapter 136 and a storage device interface buffer 138. In another embodiment, the management controller 128 is included on a separate circuit board than the circuit board that includes the other elements of the RAID controller 102. In one embodiment, the local bus backplane is passive and hot-pluggable.

The RAID controller 102 shown in fig. 1 may be, for example, a stand-alone RAID controller coupled to the host computers 104 to provide Network Attached Storage (NAS) or as part of a Storage Area Network (SAN). However, the present invention is not limited to a separate RAID controller. Rather, the RAID controller 102 may be connected to the host computers 104 using other means, including but not limited to the alternative embodiments shown in FIGS. 5 and 6.

Preferably, the RAID controller is configured to perform a preemptive reconstruct of a redundant array 114 having a potentially failing disk 142, while performing the reconstruct while the redundant array 114 is still fault-tolerant, i.e., while the failing disk is still included in the redundant array 114 and operating. According to one embodiment, generally speaking, the preemptive reconstruct includes the RAID controller 102, the controller 102 determines that one of the disks 142 in the redundant array 114 (the "critical disk") has a probability of failure that exceeds a user-defined threshold based on an error threshold in response to copying data from the critical disk 142 to the spare disk 116, and after completing the copy, replaces the critical disk 142 with the spare disk 116 in the redundant array 114. While the RAID controller 102 is copying data from the critical disk 142 to the spare disk 116, the RAID controller 102 continues to write user data to the critical disk in response to user I/O requests. If user data is written to an address of the spare disk 116 that has been copied from the critical disk 142 (e.g., below a high water mark), the RAID controller 102 also writes user data to the spare disk 116.

Referring now to FIG. 2, a flow chart illustrating operation of the RAID controller 102 of FIG. 1 to perform a preemptive reconstruct of a redundant array 114 in accordance with the present invention is shown, beginning at block 202.

At block 202, a user configures the redundant array 114 of disks 142 and the at least one spare disk 116 of FIG. 1. In one embodiment, configuring the redundant array 114 includes specifying a RAID level of the redundant array 114 and the number of disks 142 and/or storage capacity of the redundant array 114. In one embodiment, configuring at least one spare disk 116 includes assigning a number of spare disks 116 to the RAID controller to reconstruct the redundant array 114 from the available disks 142. In one embodiment, each spare disk 116 is specifically configured for a particular redundant array 114, while in another embodiment the spare disk 116 is roughly configured for use in any redundant array 114 that requires a spare disk 116, flow proceeds to block 204.

At block 204, the user specifies the error threshold 122 of FIG. 1, and the RAID controller 102 stores the error threshold 122. In one embodiment, the critical error is a correctable error or an uncorrectable error due to a media error. It can correct for errors that the disk suffers when attempting to read or write to its media, but this error is correctable by the disk 142, such as by retrying the operation or remapping the sector that caused the error to another sector, such as a spare sector on the media of the disk 142. In one embodiment, the disk 142 reports correctable ERRORs by returning a CHECK CONDITION defined in the SCSI specification and a SCSI sense key (sense key) to recover from an ERROR RECOVERED ERROR (0x 01). An uncorrectable error because a media error is an error that the disk suffers and that the disk 142 cannot correct when attempting to read or write data to its media. In one embodiment, the disk 142 reports correctable ERRORs by returning a CHECK CONDITION status and a SCSI sense key for MEDIUM ERROR MEDIUM ERROR (0x 03). In general, the RAID controller 102 is able to correct uncorrectable errors due to media errors by remapping the sector that caused the error, either by explicitly instructing the disk 142 to remap the sector, or by the RAID controller 102 itself reserving spare sectors on the disk 142 and performing the remapping using the reserved spare sectors. The SCSI error code is given by, for example, showing an illustration of the way in which the disk 142 may report critical errors, however, the present invention is not limited to a particular disk 142 protocol or manner of reporting critical errors, but may be employed in a variety of protocols. In one embodiment, the error threshold 122 may comprise a combination of error thresholds, for example, a user may specify the threshold 122 for the number of correctable errors and another threshold 122 for the number of uncorrectable errors due to media errors, as described below with reference to block 212. Flow proceeds to block 206.

At block 206, the RAID controller 102 issues read and write commands to the disks 142 in the redundant array 114 in response to I/O requests from the host computers 104. That is, the RAID controller 102 performs normal operations on the redundant array 114. Flow proceeds to block 208.

At block 208, while the RAID controller 102 is performing normal operations on the redundant array 114, the RAID controller 102 also maintains a count of the number of critical errors generated in response to the read and write commands issued to the disks 142. In one embodiment, the RAID controller 102 maintains an additional count of correctable errors and uncorrectable errors due to media errors reported by the disks 142. In addition, during periods of inactivity, such as when a disk 142 of the redundant array 114 is not being read or written to in response to a user I/O request, the RAID controller 102 polls each disk 142 for information regarding the number of critical errors, taking into account the critical error count. Most disks maintain industry standard self-test, analysis and reporting technology (SMART) data, which includes key error counts. In one embodiment, polling the disks for error data includes polling the disks for SMART data. Flow proceeds to block 212.

At block 212, the RAID controller 102 determines that the number of critical errors for the disks 142 in the redundant array 114, referred to as critical disks 142, has reached its critical error threshold 122 in response to a normal read or write command, or in response to polling for critical error information, according to block 208. In one embodiment, when the sum of the number of correctable errors and uncorrectable errors due to media errors reaches the error threshold 122, the RAID controller 102 may determine whether the critical disk 142 has reached its critical error threshold, thereby triggering a preemptive reconstruct. In one embodiment, when the critical disk 142 generates a number of correctable errors that reaches a correctable error threshold, or when the critical disk 142 generates an uncorrectable error due to a media error that has reached an uncorrectable error due to a media error threshold 122, or both, the RAID controller 102 may determine whether the critical disk 142 has reached its critical error threshold 122, thereby triggering preemptive reconstruct. Flow proceeds to block 214.

At block 214, the RAID controller 102 selects one of the spare disks 116 to be the target of the background copy, which is part of the automatic preemptive reconstruct. Flow proceeds to block 216.

At block 216, the RAID controller 102 performs a background copy of the data on the critical disk 142 to the spare disk 116 selected in block 214. While the redundant array 114 is still fault-tolerant and the critical disk 142 is still part of the redundant array 114 performing normal functions, the RAID controller 102 performs replication, i.e., while the RAID controller 102 is still reading and writing the critical disk 142 in response to user I/O requests. In one embodiment, the RAID controller 102 reads data from the starting address of the critical disk 142 and writes the data to the associated address on the spare disk 116, and proceeds sequentially to the end of the critical disk 142 and spare disk 116. As the RAID controller 102 sequentially copies the data, the controller maintains a high water mark for the copy. That is, the high water mark is copied from the critical disk 142 to the last data address of the spare disk 116. Flow proceeds to block 218.

At block 218, as mentioned at block 216, during the background copy, the RAID controller 102 continues to issue read and write commands to the critical disk 142 in response to normal user I/O requests. If a write command refers to an address on the critical disk 142 that matches the address on the spare disk 116 that has been written as part of the background copy, the RAID controller 102 also writes user data to the spare disk 116. Otherwise, the RAID controller 102 writes only user data to the critical disk 142 since the user data will eventually be copied to the spare disk 142 via background copying, as per block 216. In one embodiment, if the write command indicates an address below the high-water mark, the RAID controller 102 also writes the user data to the spare disk 116. Otherwise, the RAID controller 102 writes only to the critical disk 142. Read commands are only issued to the critical disk 142 and not to the spare disk 116. Flow proceeds to decision block 222.

At decision block 222, the RAID controller 102 determines whether the disks 142 failed during the background copy of block 216. Disk failures may include one of a variety of conditions, including but not limited to the following. A disk 142 failure may include a permanent failure, i.e., the disk 142 in the failed state is unable to read and write. A disk 142 failure may include an error generated by the disk 142 that the disk 142 cannot recover and the RAID controller 102 cannot recover even though the damaged sector reported by the disk 142 was, for example, remapped. A disk 142 failure may include a condition where the disk 142 does not respond to a command after a predetermined time, i.e., a command timeout. A disk 142 failure may include a situation where the disk 142 provides an ERROR code indicating a HARDWARE ERROR, such as a SCSI sense key that is NOT READY or a HARDWARE ERROR HARDWARE ERROR or SATA drive NOT READY (RDY NOT set), a write ERROR condition (WFT set), a data address flag NOT found (NDAM bit set), a track 0 NOT found (NTKO bit set), an ID NOT found (IDNF bit set), or a sector marked as corrupted by the host computer (BBK bit set). A disk 142 failure may include a disk-provided error code indicating a condition to abort the command after retrying the command, such as a SCSI sense key for abort command ABORTEDCOMMAND (0x0B) or a SATA command abort condition (ABRT bit set). A disk 142 failure may include a condition where the disk 142 provides an ERROR code indicating a media ERROR of a predetermined number of times, such as a MEDIUM ERROR MEDIUM ERROR sense key or a SATA uncorrectable data condition (UNC bit set). A disk 142 failure may include a user physically or logically removing the disk 142 from the redundant array 114 via software control. If the disk 142 has failed during the background copy, flow proceeds to block 232, otherwise flow proceeds to block 224.

At block 224, the RAID controller 102 quiesces I/O operations to the disks 142 in the redundant array 114 upon completion of the background copy of block 216 from the critical disk 142 to the spare disk 116. That is, the RAID controller 102 completes all critical read or write commands to the disks 142 in the redundant array 114 and orders I/O requests received from the host computers 104 without issuing reads and writes to the disks 142 in response to the I/O requests. In one embodiment, the RAID controller 102 may perform I/O operations on other redundant arrays 114 continuously, i.e., redundant arrays that include critical disks 142 that are not other than the redundant arrays 114. Flow proceeds to block 226.

At block 226, the RAID controller 102 replaces the critical disk 142 in the redundant array 114 with the spare disk 116, which now has a copy of the data from the critical disk 142. That is, the RAID controller 102 logically removes the critical disk 142 from the redundant array 114 and includes the spare disk 116 in the redundant array 114 such that the queued-up I/O requests and any subsequent I/O requests are converted to read or write to the spare disk 116 (which is no longer the spare disk 116 but is now part of the redundant array 114) rather than to the critical disk 142 (which is no longer part of the redundant array 114). Flow proceeds to block 228.

At block 228, the RAID controller 102 does not stall the I/O operations to the redundant array 114. That is, the RAID controller 102 begins issuing read and write commands to the disks 142 of the redundant array 114 in response to the ordered I/O request and any subsequent I/O requests, and in particular issues read and write commands to the spare disk 116, the spare disk 116 being the target of the background copy (which is no longer the spare disk 116 but is now part of the redundant array 114), and does not issue read and write commands to the critical disk 142 (which is no longer part of the redundant array 114). Flow ends at block 228.

At block 232, the RAID controller 102 halts I/O operations to the redundant array 114 because the RAID controller 102 at decision block 222 detected that a disk 142 failure has occurred during the background copy. Flow proceeds to block 236.

At decision block 234, the RAID controller 102 determines whether the spare disk 116 has failed, and if so, flow proceeds to block 248, otherwise, flow proceeds to block 236.

At block 236, the RAID controller 102 removes the failed disk 142 from the redundant array 114 because one of the disks 142 of the redundant array 114 has failed. Next, in the case of a single redundant array, such as a RAID level 1, 2, 3, 4, 5, 10, or 50 array, the redundant array 114 is no longer fault tolerant. In the case of multiple redundant arrays, such as a RAID level 6 array, the redundant array 114 may still be fault tolerant, but not fully redundant, and flow proceeds to block 238.

In block 238, the RAD controller 102 selects the spare disk 116 targeted by the background copy of block 216 to use for a normal reconstruction, i.e., reconstructing data on the failed disk 142 by reading one or more remaining disks 142 in the redundant array 114, the redundant array 114 is no longer fully redundant, and in the event that the original single redundant array 114 is no longer fault-tolerant. Flow proceeds to decision block 242.

At decision block 242, the RAID controller 102 determines whether the disk 142 that failed the background copy is the critical disk 142, and if so, flow proceeds to block 244; otherwise flow proceeds to block 254.

At block 244, the RAID controller 102 begins a normal rebuild of the redundant array 114 from the high water mark of the background copy. That is, the RAID begins reading the remaining disks 142 in the array 114 at the high water mark, reconstructs the data on the failed disk 142, and writes the reconstructed data to the spare disk 116 at the high water mark. As block 244 is reached due to a failure of the critical disk 142 during the background copy, only data above the high water mark needs to be reconstructed. Preferably, the present invention provides a faster rebuild than would normally be required if the entire spare disk 116 were required. Finally, once the RAID controller 102 reconstructs the entire spare disk 116, e.g., after block 246, the RAID controller 102 adds the spare disk 116 to the redundant array 114. Flow proceeds to block 246.

In block 246, the RAD controller 102 does not stop I/O operations to the array 114. In performing a normal reconstruct, an I/O write operation is performed by writing to the spare disk 116 being reconstructed as part of the array 114, and an I/O read operation is performed by reading from one or more of the remaining disks 142 of the original redundant array 114, and in the case of the parity redundant array 114, reconstructing the data provided to the host computer 104. Flow ends at block 246.

At block 248, the RAID controller 102 selects a different spare disk 116 as the target of the background copy associated with the future preemptive reconstruct because the spare disk 116 has failed. In one embodiment, the controller 102 automatically selects a different spare disk 116 from a pool of spare disks 116 previously configured by the user. If no spare disk 116 is available, the user will be prompted to attach a new spare disk 116 to the RAID controller 102. Flow proceeds to block 252.

At block 252, the RAID controller 102 does not halt I/O operations to the redundant array 114 and flow returns to block 216 to perform a background copy of another purpose of the preemptive reconstruct.

At block 254, the RAID controller initiates a normal rebuild of the redundant array 114 from the beginning of the spare disk 116. That is, the RAID controller 102 begins reading the remaining disks 142 in the array 114 at the beginning of the disks, rebuilding the data on the failed disk 142, and writing the rebuilt data to the spare disk 116 at the beginning. Since block 244 is achieved because it was the critical disk 142 that failed during the background copy, all data for the failed disk 142 needs to be reconstructed. Finally, once the RAID controller 102 rebuilds the entire spare disk 116, for example, after block 246, the RAID controller 102 adds the spare disk 116 to the redundant array 114. Flow proceeds to block 246.

Referring now to FIG. 3, a flowchart illustrating operation of the RAID controller of FIG. 1 to perform a preemptive reconstruct of a redundant array according to an alternate embodiment of the present invention is shown. FIG. 3 is similar to FIG. 2 and like reference numerals designate like elements. However, FIG. 3 includes block 302 replacing block 202 and block 316 replacing block 216.

At block 302, the user specifically plans the mirrored redundant array 114 of disks 142, including but not limited to the RAID1 or RAID10 redundant array 114, and at least one spare disk 116 of FIG. 1.

At block 316, rather than performing a copy of the data on the critical disk 142 to the spare disk 116 as in block 216 of FIG. 2, the RAID controller 102 performs a background copy of the data on the mirror disk 142 of the critical disk 142 to the spare disk 116 selected at block 214.

Viewed from another perspective, FIG. 3 is similar to FIG. 2, as can be observed, a potential advantage of the embodiment of FIG. 3 over the embodiment of FIG. 2 is that it reduces access to the critical disk 142, which may be shut down due to a failure, during a preemptive reconstruct, thereby potentially reducing the likelihood of failure of the critical disk 142.

Referring now to FIG. 4, a flowchart illustrating operation of the RAID controller of FIG. 1 to perform a preemptive reconstruct of a redundant array according to an alternate embodiment of the present invention is shown. FIG. 4 is similar to FIG. 2 and like reference numerals designate like elements. However, FIG. 4 includes block 402 replacing block 202, and block 416 replacing block 216.

At block 402, a user specifically programs a parity-redundant array 114 of disks 142, including but not limited to redundant arrays 114 of RAID3, 4, 5, 6, or 50, and at least one spare disk 116 of FIG. 1.

At block 416, rather than performing the copy of the data on the critical disk 142 to the spare disk 116 as in block 216 of FIG. 2, the RAID controller 102 reads data from the other disks 142 in the redundant array 114 and XOR the data to generate data on the critical disk 142, which is then written to the spare disk 116 selected at block 214.

Viewed from another perspective, FIG. 4 is similar to FIG. 2, as can be observed, a potential advantage of the embodiment of FIG. 4 over the embodiment of FIG. 2 is that it reduces access to the critical disk 142, which may be shut down due to a failure, during a preemptive reconstruct, thereby potentially reducing the likelihood of failure of the critical disk 142. Note that in this embodiment, the preemptive reconstruct may take longer than an embodiment that is replicated from the critical disk 142 or from the mirror disk 142.

Referring now to FIG. 5, a computer system 500 including a RAID controller 102 is shown in accordance with an alternative embodiment of the present invention. FIG. 5 is similar to FIG. 1 and like reference numerals designate like elements. However, in the embodiment of FIG. 5, RAID controller 102 comprises a RAID controller in the form of a host bus adapter included in host computer 104, which host computer 104 includes a microprocessor 502 and a memory 506 each coupled to a chipset 504. The microprocessor 502 executes programs stored in the memory 506, such as an operating system, application programs, and application programs that access data on the redundant array 114 via the RAID controller 102. The chipset 504 provides a local bus 508 that couples the microprocessor 502 and the memory 506 of the RAID controller 102 via the local bus 508 to enable the RAID controller 102 to transfer data between the memory and itself and to enable the microprocessor 502 to issue I/O requests to the RAID controller 102. The local bus 508 may include, but is not limited to, a PCI, PCI-X, CompactPCI, or PCI Express bus. In one embodiment, the RAID controller 102 is integrated onto the motherboard of the host computer 104. In another embodiment, the RAID controller 102 includes a supplemental controller card inserted into a slot of the local bus 508, and the RAID controller 102 of FIG. 5 is configured to perform a pre-reset in accordance with the invention described herein.

In one embodiment, the RAID controller 102 performs a manual preemptive reconstruct of the redundant array 114 rather than an automatic preemptive reconstruct. That is, rather than automatically performing the preemptive reconstruct in response to block 212 without user input, the RAID controller 102 performs the preemptive reconstruct of FIGS. 2, 3, and 4 beginning at block 214 in response to user input. The user provides an indication of the critical disk 142 in the redundant array 114 and the RAID controller 102 performs the remaining steps of the preemptive reconstruct in response to user input. In one embodiment, when the RAID controller 102 determines at block 212 that the critical disk 142 has reached its critical error threshold 122, the RAID controller 102 reports this condition to a user and the user accordingly provides input to initiate a manual preemptive reconstruct of the redundant array 114.

Referring now to FIG. 6, which shows a software RAID controller 102 according to an alternate embodiment of the present invention, FIG. 6 is similar to FIG. 1 and like reference numerals are used. However, in the embodiment of FIG. 6, the RAID controller 102 is comprised of components of a host computer, which is generally referred to as a software RAID controller. In the RAID controller 102 of FIG. 6, the host interface adapter 136 interfaces to a network of computers other than the host computer 104. In the FIG. 6 embodiment, the microprocessor 126 of the RAID controller 102 may comprise a microprocessor of the host computer 104, and the memory 124 and buffer 132 of the RAID controller 102 may comprise system memory of the host computer 104; the storage device interface adapters 138 of the RAID controller 102 may comprise one or more host bus adapters of the host computers 104; the bus bridge 134 may comprise a chipset of the host computer 104; and the host interface adapter 136 may comprise a network interface card of the host computer 104. The error threshold may be stored in main memory 124 of host computer 104. In particular, the programming instructions to perform the functions of the RAID controller 102 may be part of the operating system and/or system firmware, such as a ROM BIOS. The microprocessor 126 executes programmed instructions to perform RAID controller 102 functions, and in particular to perform the preemptive reconstruct of a disk 124 that has reached its error threshold 122 as described herein.

While the array of disks of the above-described embodiment is a redundant disk, another embodiment contemplates performing a preemptive reconstruct of a non-redundant array of disks. In particular, the preemptive reconstruct may be performed on a RAID level 0 array of disks, and the elimination of data on a plurality of disks of RAID level 0 is not a redundant RAID level because a non-redundant array is written to the array. However, once the disks in the array reach the error threshold, data may be copied from the critical disk to the spare disk, and thereafter the critical disk may be replaced in the erase array with the copied spare disk that currently contains the critical disk. Further, although embodiments have been described with reference to RAID levels employing mirror redundancy and parity redundancy, the preemptive reconstruct may also be performed on redundant disk arrays employing other data redundancy techniques, such as RAID2 using Hamming code (Hamming code), error correction code (EEC), or redundancy, among others.

Also, although the objects, features and advantages of the present invention have been described in detail, other embodiments are encompassed by the invention. In addition to applications of hardware as used by the present invention, the present invention can be implemented in computer readable code (e.g., computer readable code, data, etc.) embodied in a computer usable (e.g., readable) medium. The computer code causes the initiation or manufacture, or both, of the functions of the invention described herein, for example, through the use of general purpose programming languages (e.g., C, C + +, JAVA, and the like); a GDSII database; hardware Description Languages (HDL) including Verilog HDL, VHDL, Altera HDL (AHDL), etc.; or other programming and/or circuit (i.e., pattern) capture tools available in the art. The computer code can be disposed in any known computer data signal including computer usable (e.g., readable) media including semiconductor memory, magnetic disks, optical disks (e.g., CD-ROM, DVD-ROM, and the like), and as a computer data signal embodied in a computer usable (e.g., readable) transmission medium (e.g., carrier wave or other medium including digital, optical, or analog-based medium). As described above, the computer code may be transmitted over a communication network, including the Internet or an intranet. It is understood that the present invention may be implemented in computer code and transformed to hardware as part of the manufacture of integrated circuits, and that the present invention may likewise be implemented as a combination of hardware and computer code.

Finally, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for performing a preemptive reconstruct of a redundant array of disks while the redundant array is still fault-tolerant, the method comprising:

determining whether the number of errors issued by one of the disks in the redundant array has exceeded an error threshold;

reading data from a second disk in the redundant array and writing the data to a spare disk in response to the determination, wherein the data of the second disk in the redundant array is a mirror copy of the data of one of the disks;

replacing one of the disks in the redundant array with the spare disk after completing the reading and the writing; and

writing second data to one of the disks while reading and writing in response to a user write request including the second data, thereby maintaining fault tolerance of the redundant array.

2. The method of claim 1, further comprising:

the second data is also written to the spare disk, provided that the second data is destined for a location previously written to the spare disk in the reading and the writing.

3. A method for performing a preemptive reconstruct of a redundant array of disks while the redundant array is still fault-tolerant, the method comprising:

generating data for one of the disks from second data read from two or more of the other disks in the redundant array and writing the generated data to the spare disk in response to the determination;

replacing one of the disks in the redundant array with the spare disk after the generating and the writing are completed; and

when the generating and writing occurs, writing third data to one of the disks in the redundant array in response to a user write request including the third data, thereby maintaining fault tolerance of the redundant array.

4. The method of claim 3, wherein the step of generating the data comprises:

reading the second data from the two or more of the other disks in the redundant array; and

performing a binary exclusive-or operation of the second data to generate the data for one of the disks.

5. The method of claim 3, further comprising:

the third data is also written to the spare disk, provided that the destination of the third data is the location previously written to the spare disk in the generating and the writing.

6. An apparatus for performing a preemptive reconstruct of a redundant array of disks while the redundant array is still fault-tolerant, comprising:

means for determining whether a number of errors issued by one of the disks in the redundant array has exceeded an error threshold;

means for reading data from a second disk in the redundant array and writing the data to a spare disk in response to the determination, wherein the data of the second disk in the redundant array is a mirror copy of the data of one of the disks;

means for replacing one of the disks in the redundant array with the spare disk after the reading and the writing are completed; and

means for writing second data to one of the disks in response to a user write request including the second data while the reading and the writing, thereby maintaining fault tolerance of the redundant array.

7. The apparatus of claim 6, further comprising:

means for writing the second data to the spare disk if the second data is destined for a location previously written to the spare disk in the reading and the writing.

8. An apparatus for performing a preemptive reconstruct of a redundant array of disks while the redundant array is still fault-tolerant, comprising:

means for generating data for one of the disks from second data read from two or more of the other disks in the redundant array and writing the generated data to the spare disk in response to the determination;

means for replacing one of the disks in the redundant array with the spare disk after the generating and the writing are completed; and

means for writing third data to one of the disks in the redundant array in response to a user write request including the third data while the generating and writing, thereby maintaining fault tolerance of the redundant array.

9. The apparatus of claim 8, wherein the means for generating the data comprises:

means for reading the second data from the two or more of the other disks in the redundant array; and

means for performing a binary exclusive-or operation on the second data to generate the data for one of the disks.

10. The apparatus of claim 8, further comprising:

means for writing the third data to the spare disk if the destination of the third data is the location previously written to the spare disk in the generating and the writing.