JP2008016028A

JP2008016028A - Distributed storage system with wide area replication

Info

Publication number: JP2008016028A
Application number: JP2007173265A
Authority: JP
Inventors: Stephen J Sicola; ジェイ．シコラスティーブン
Original assignee: Seagate Technology LLC
Current assignee: Seagate Technology LLC
Priority date: 2006-06-30
Filing date: 2007-06-29
Publication date: 2008-01-24

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device and a method for evenly share process load of an I/O command by a whole storage array in a distributed storage system having a wide area memory allocation capability. <P>SOLUTION: The device and method with a virtual engine connectable to a remote device over a network for passing access commands between the remote device and a storage space. A plurality of intelligent storage elements (ISEs) 108-1, 1080-2 replicate data of a block size sufficient for processing an access command which is not yet processed from a first ISE to a second ISE independently of access commands being simultaneously passed between the virtual engine 200 and the first ISE. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は一般に分散データ記憶システムの分野に関するものであって、より詳しくは分散記憶システム内に読取り専用データを広域に複製するための機器および方法に関するが、これに限定するものではない。 The present invention relates generally to the field of distributed data storage systems, and more particularly, but not exclusively, to an apparatus and method for replicating read-only data in a wide area within a distributed storage system.

産業標準構造のデータ転送速度がインテル（登録商標）社製の８０３８６プロセッサのデータ・アクセス速度に追いつかなくなると、コンピュータ・ネットワーク化が急速に広まった。ネットワーク内のデータ記憶容量を強化することにより、ローカル・エリア・ネットワーク（ＬＡＮ）は記憶エリア・ネットワーク（ＳＡＮ）に発展した。ＳＡＮ内の設備と、この設備が処理する関連するデータとを強化することにより、例えば直接付属する記憶装置より一回り大きな記憶を妥当なコストで処理することができるなどの非常に大きな利益をユーザは実現した。 When the data transfer speed of the industry standard structure could not keep up with the data access speed of the 80386 processor manufactured by Intel (registered trademark), computer networking spread rapidly. By enhancing the data storage capacity in the network, the local area network (LAN) has evolved into a storage area network (SAN). By strengthening the equipment in the SAN and the associated data that this equipment processes, the user has the tremendous benefit of being able to process, for example, a memory that is a little larger than the storage device directly attached at a reasonable cost. Realized.

最近の動向は、データ記憶サブシステムを制御するというネットワーク中心方式に移ってきた。すなわち、記憶を強化したのと同じようにして、記憶の機能を制御するシステムもサーバからネットワーク自身に移されている。例えば、ホスト・ベースのソフトウエアは、保全および管理の仕事をインテリジェント・スイッチにまたは専用のネットワーク記憶サービス・プラットフォームに任せる。機器ベースの方式を用いるのでホスト内で走るソフトウエアは必要でなくなり、またこの方式は企業内のノードとして設けられたコンピュータ内で動作する。いずれにしても、インテリジェント・ネットワーク方式は、記憶割当てルーチン、バックアップ・ルーチン、および耐故障性設計などの仕事をホストから独立して集中化することができる。 Recent trends have shifted to a network-centric approach to controlling the data storage subsystem. That is, the system for controlling the storage function is also moved from the server to the network itself in the same way as the storage is strengthened. For example, host-based software delegates maintenance and management tasks to intelligent switches or dedicated network storage service platforms. Since the device-based method is used, software running in the host is not necessary, and this method operates in a computer provided as a node in the enterprise. In any case, the intelligent network scheme can centralize tasks such as storage allocation routines, backup routines, and fault tolerant design independent of the host.

知能をホストからネットワークに移すことにより問題の一部は解決されるが、ホストへの仮想記憶の提示を変更することが一般に困難であるという固有の問題は解決されない。例えば、１台のコントローラの指示の下に記憶されたデータが別のコントローラの記憶空間内で一層効率的にアクセスできることがある。この場合は、ホストまたはネットワークはまずそのデータをコピーして別のコントローラが利用できるようにしなければならない。必要なのは、それぞれのデータ記憶容量を自発的に決定して割当て、管理し、保護して、その容量を仮想記憶空間としてネットワークに提示してシステム全体でＩ／Ｏ性能を最適にするインテリジェント・データ記憶サブシステムである。この仮想記憶空間は多重記憶ボリュームとして配置することができる。分散計算環境では、かかるインテリジェント記憶要素を用いてＩ／Ｏコマンドの処理の負荷を記憶アレイ全体でできるだけ均一に分担する。本発明の実施の形態はこの方式に関するものである。 While moving the intelligence from the host to the network solves some of the problems, it does not solve the inherent problem that it is generally difficult to change the presentation of virtual memory to the host. For example, data stored under the direction of one controller may be more efficiently accessed in the storage space of another controller. In this case, the host or network must first copy the data to make it available to another controller. What is needed is intelligent data that voluntarily determines, allocates, manages, and protects each data storage capacity and presents that capacity to the network as virtual storage space to optimize I / O performance across the system. It is a storage subsystem. This virtual storage space can be arranged as a multiple storage volume. In a distributed computing environment, such intelligent storage elements are used to share the I / O command processing load as uniformly as possible across the storage array. The embodiment of the present invention relates to this method.

本発明の実施の形態は、一般に広域複製能力を持つ分散記憶システムに関するものである。
或る実施の形態では、アクセス・コマンドを遠隔装置と記憶空間との間で渡すための、ネットワークにより遠隔装置に接続可能な仮想エンジンを備えるデータ記憶システムを提供する。複数のインテリジェント記憶要素（ＩＳＥ）は、アクセス・コマンドが同時に仮想エンジンと第１のＩＳＥとの間で渡されていることとは独立に、データを第１のＩＳＥから第２のＩＳＥに複製する。 Embodiments of the present invention generally relate to a distributed storage system having global replication capability.
In one embodiment, a data storage system is provided that includes a virtual engine connectable to a remote device over a network for passing access commands between the remote device and a storage space. Multiple intelligent storage elements (ISEs) replicate data from the first ISE to the second ISE, independent of access commands being simultaneously passed between the virtual engine and the first ISE. .

或る実施の形態では、仮想エンジンとＩＳＥとの間のアクセス・コマンドを処理し、同時にデータをＩＳＥから別の記憶空間に複製する方法を提供する。
或る実施の形態では、仮想エンジンが個別にアドレス指定できる複数のＩＳＥと、インテリジェント記憶要素の間でデータを複製する手段とを備えるデータ記憶装置を提供する。
本発明の特徴を示すこれらの機能や利点は以下の詳細な説明を読みまた関連する図面を参照すれば明らかになる。 In one embodiment, a method is provided for processing access commands between a virtual engine and an ISE and simultaneously replicating data from the ISE to another storage space.
In one embodiment, a data storage device is provided comprising a plurality of ISEs that can be individually addressed by a virtual engine and means for replicating data between intelligent storage elements.
These features and advantages which characterize the present invention will become apparent upon reading the following detailed description and upon reference to the associated drawings.

図１は本発明の実施の形態が有用である例示のコンピュータ・システム１００である。１台以上のホスト１０２が、ローカル・エリア・ネットワーク（ＬＡＮ）および／またはワイド・エリア・ネットワーク（ＷＡＮ）１０６を介して１台以上のネットワーク付属のサーバ（ＮＡＳ）１０４にネットワーク接続されている。好ましくは、ＬＡＮ／ＷＡＮ１０６は、ワールド・ワイド・ウェブにより通信するためにインターネット・プロトコル（ＩＰ）ネットワーキング・インフラストラクチャを用いる。ホスト１０２は、サーバ１０４内に常駐して多数のインテリジェント記憶要素（ＩＳＥ）１０８の１台以上に記憶されたデータを日常的に必要とするアプリケーションにアクセスする。このため、記憶されたデータにアクセスできるようにＳＡＮ１１０はサーバ１０４をＩＳＥ１０８に接続する。ＩＳＥ１０８は直列ＡＴＡやファイバ・チャネルなどの種々の選択された通信プロトコルによりデータを記憶するデータ記憶容量１０９のブロックを備え、その中に企業クラスまたはデスクトップ・クラスの記憶媒体を含む。 FIG. 1 is an exemplary computer system 100 in which embodiments of the present invention are useful. One or more hosts 102 are networked to one or more network attached servers (NAS) 104 via a local area network (LAN) and / or a wide area network (WAN) 106. Preferably, LAN / WAN 106 uses an Internet Protocol (IP) networking infrastructure to communicate over the World Wide Web. The host 102 resides within the server 104 and accesses applications that routinely require data stored in one or more of a number of intelligent storage elements (ISEs) 108. For this reason, the SAN 110 connects the server 104 to the ISE 108 so that the stored data can be accessed. The ISE 108 comprises a block of data storage capacity 109 for storing data according to various selected communication protocols such as serial ATA and fiber channel, including enterprise class or desktop class storage media.

図２は、図１のコンピュータ・システム１００の簡単な線図である。ホスト１０２は、ネットワークまたは構造１１０を介して相互におよび１対のＩＳＥ１０８（それぞれＡおよびＢで示す）と情報を交換する。各ＩＳＥ１０８は二重冗長コントローラ１１２（Ａ１，Ａ２およびＢ１，Ｂ２で示す）を含む。好ましくはコントローラ１１２は、独立ドライブの冗長アレイ（ＲＡＩＤ）として特徴づけられる一組のデータ記憶装置であるデータ記憶容量１０９に作用する。コントローラ１１２およびデータ記憶容量１０９は好ましくは耐故障性配置を用いるので、種々のコントローラ１１２は並列の冗長なリンクを用い、システム１００内に記憶されるユーザ・データの少なくとも一部はデータ記憶容量１０９の少なくとも一組内に冗長形式で記憶される。 FIG. 2 is a simplified diagram of the computer system 100 of FIG. Hosts 102 exchange information with each other and with a pair of ISEs 108 (shown as A and B, respectively) via a network or structure 110. Each ISE 108 includes a dual redundant controller 112 (shown as A1, A2 and B1, B2). Preferably, the controller 112 operates on a data storage capacity 109, which is a set of data storage devices characterized as a redundant array (RAID) of independent drives. Since the controller 112 and data storage capacity 109 preferably use a fault tolerant arrangement, the various controllers 112 use parallel redundant links and at least some of the user data stored in the system 100 is in the data storage capacity 109. At least one set is stored in a redundant format.

更に、Ａホスト・コンピュータ１０２およびＡ−ＩＳＥ１０８は物理的に第１のサイトにあり、Ｂホスト・コンピュータ１０２およびＢ−ＩＳＥ１０８は物理的に第２のサイトにあり、Ｃホスト・コンピュータ１０２は更に第３のサイトにあってよい。ただしこれは単なる例であって、限定するものではない。分散コンピュータ・システム上の全てのエンティティは或るタイプのコンピュータ・ネットワークにより接続される。 In addition, A host computer 102 and A-ISE 108 are physically at the first site, B host computer 102 and B-ISE 108 are physically at the second site, and C host computer 102 is further at the first site. There may be 3 sites. However, this is just an example and not a limitation. All entities on a distributed computer system are connected by some type of computer network.

図３は本発明の実施の形態に従って構築されたＩＳＥ１０８を示す。棚１１４は、コントローラ１１２を受けて係合して中央板１１６と電気的に接続するための空洞を定義する。棚１１４はキャビネット（図示しない）内に支持される。棚１１４は１対の多重ディスク組立体（ＭＤＡ）１１８を中央板１１６の同じ側に受けて係合する。中央板１１６の反対側には、緊急電源である二重電池１２２、二重交流電源１２４、および二重インターフェース・モジュール１２６が接続する。好ましくは、二重構成要素ではＭＤＡ１１８の一方または両方が同時に動作するので、１つの構成要素が故障した場合はバックアップ保護を行うことができる。 FIG. 3 shows an ISE 108 constructed in accordance with an embodiment of the present invention. The shelf 114 defines a cavity for receiving and engaging the controller 112 to electrically connect with the central plate 116. The shelf 114 is supported in a cabinet (not shown). The shelf 114 receives and engages a pair of multiple disk assemblies (MDA) 118 on the same side of the central plate 116. On the opposite side of the center plate 116, a dual battery 122, a dual AC power supply 124, and a dual interface module 126, which are emergency power supplies, are connected. Preferably, in a dual component, one or both of the MDAs 118 operate simultaneously, so that backup protection can be provided if one component fails.

図４は本発明の或る実施の形態に従って構築されたＭＤＡ１１８の拡大部分組立分解等角図である。ＭＤＡ１１８は上部１３０と下部１３２とを有し、それぞれは５個のデータ記憶装置１２８を支持する。区画１３０，１３２は、中央板１１６（図３）と係合するコネクタ１３６を有する共通回路板１３４と接続できるようにデータ記憶装置１２８を揃える。カバー１３８は電磁妨害を遮蔽する。ＭＤＡ１１８のこの例示の実施の形態は、特許出願１０／８８４，６０５、「多重ディスク・アレイの搬送装置および方法（ＣａｒｒｉｅｒＤｅｖｉｃｅａｎｄＭｅｔｈｏｄｆｏｒａＭｕｌｔｉｐｌｅＤｉｓｃＡｒｒａｙ）」の主題である。これは本発明の被譲渡人に譲渡されたものであって、ここに援用する。ＭＤＡの別の例示の実施の形態は同じタイトルの特許出願１０／８１７，３７８の主題である。これも本発明の被譲渡人に譲渡されたものであって、ここに援用する。後で説明するが、別の同等の実施の形態では、密閉された容器内にＭＤＡ１１８を収めてよい。 FIG. 4 is an enlarged partial exploded isometric view of MDA 118 constructed in accordance with an embodiment of the present invention. The MDA 118 has an upper portion 130 and a lower portion 132, each supporting five data storage devices 128. The compartments 130, 132 align the data storage device 128 so that they can be connected to a common circuit board 134 having a connector 136 that engages the central board 116 (FIG. 3). Cover 138 shields against electromagnetic interference. This exemplary embodiment of MDA 118 is the subject of patent application 10 / 884,605, “Carrier Device and Method for a Multiple Disc Array”. This is assigned to the assignee of the present invention and is incorporated herein. Another exemplary embodiment of MDA is the subject of patent application 10 / 817,378 of the same title. This is also assigned to the assignee of the present invention and is incorporated herein. As will be described later, in another equivalent embodiment, the MDA 118 may be contained within a sealed container.

図５は、本発明の実施の形態で用いるのに適した、回転する媒体ディスク・ドライブの形の例示のデータ記憶装置１２８の等角図である。以下の説明のために、動くデータ記憶媒体と共に回転するスピンドルを用いるが、別の同等の実施の形態では、固体メモリ装置などの非回転媒体装置を用いる。データ記憶ディスク１４０はモータ１４２により回転して、ディスク１４０のデータ記憶位置を読取り／書込みヘッド（単に「ヘッド」と呼ぶ）１４３に提示する。ヘッド１４３は、ディスク１４０の内側と外側のトラックの間にヘッド１４３を半径方向に動かす回転アクチュエータ１４４の先端に支持される。ヘッド１４３は可撓回路１４６により回路板１４５に電気的に接続する。回路板１４５はデータ記憶装置１２８の機能を制御する制御信号を受けたり送ったりする。コネクタ１４８は回路板１４５に電気的に接続し、データ記憶装置１２８とＭＤＡ１１８の回路板１３４（図４）とを接続する。 FIG. 5 is an isometric view of an exemplary data storage device 128 in the form of a rotating media disk drive suitable for use with embodiments of the present invention. For purposes of the following description, a rotating spindle with a moving data storage medium is used, but another equivalent embodiment uses a non-rotating media device such as a solid state memory device. The data storage disk 140 is rotated by a motor 142 to present the data storage position of the disk 140 to a read / write head (simply called “head”) 143. The head 143 is supported at the tip of a rotary actuator 144 that moves the head 143 in the radial direction between the inner and outer tracks of the disk 140. The head 143 is electrically connected to the circuit board 145 by a flexible circuit 146. Circuit board 145 receives and sends control signals that control the function of data storage device 128. Connector 148 is electrically connected to circuit board 145 to connect data storage device 128 and circuit board 134 of MDA 118 (FIG. 4).

図６は本発明の実施の形態に従って構築されたＩＳＥ１０８の線図である。コントローラ１１２はインテリジェント記憶プロセッサ（ＩＳＰ）１５０と共に動作してデータの完全性の信頼性を管理する。ＩＳＰ１５０は、コントローラ１１２内、ＭＤＡ１１８内、またはＩＳＥ１０８内のどこか別のところに常駐してよい。 FIG. 6 is a diagram of ISE 108 constructed in accordance with an embodiment of the present invention. The controller 112 operates in conjunction with an intelligent storage processor (ISP) 150 to manage data integrity reliability. ISP 150 may reside somewhere in controller 112, MDA 118, or elsewhere in ISE 108.

管理された信頼性の態様はＲＡＩＤ方式などの信頼できるデータ記憶フォーマットを作ることを含む。例えば、複数の異なるＲＡＩＤフォーマットの選択された１つを選択的に用いるシステムを形成することによりデータ記憶のための比較的強いシステムを作り、またＭＤＡ１１８を管理するのに用いるソフトウエアの複雑さを軽減すると共に記憶の故障状態から比較的速く回復できるようにファームウエア・アルゴリズムを最適にすることができる。この多重ＲＡＩＤフォーマット・システムのこれらの態様は、特許出願１０／８１７，２６４、「記憶媒体データ構造および方法（ＳｔｏｒａｇｅＭｅｄｉａＤａｔａＳｔｒｕｃｔｕｒｅａｎｄＭｅｔｈｏｄ）」に記述されている。これは本発明の被譲渡人に譲渡されたものであって、ここに援用する。 Managed reliability aspects include creating a reliable data storage format such as a RAID scheme. For example, creating a relatively strong system for data storage by forming a system that selectively uses a selected one of a plurality of different RAID formats, and reduces the complexity of the software used to manage the MDA 118. Firmware algorithms can be optimized to reduce and recover relatively quickly from memory failure conditions. These aspects of this multiple RAID format system are described in patent application 10 / 817,264, “Storage Media Data Structure and Method”. This is assigned to the assignee of the present invention and is incorporated herein.

管理された信頼性は、システムを監視して使用することに基づく診断および訂正ルーチンのスケジューリングも含んでよい。データ回復の方法はデータをコピーしまた再構築することで行う。ＩＳＰ１５０は、データを失わずにデータ記憶容量全体を「自己治癒」しやすくするようにしてＭＤＡ１１８と共に組み込む。ここで考えた管理された信頼性のこれらの態様は、特許出願１０／８１７，６１７、「管理された信頼性の記憶システムおよび方法（ＭａｎａｇｅｄＲｅｌｉａｂｉｌｉｔｙＳｔｏｒａｇｅＳｙｓｔｅｍａｎｄＭｅｔｈｏｄ）」に開示されている。これは本発明の被譲渡人に譲渡されたものであって、ここに援用する。管理された信頼性の他の態様は、予め決められた規則に関する予測的故障表示への応答の速さを含む。これは例えば、特許出願１１／０４０，４１０、「分散記憶システムにおける予測された故障からの決定論的な予防的回復（ＤｅｔｅｒｍｉｎｉｓｔｉｃＰｒｅｖｅｎｔｉｖｅＲｅｃｏｖｅｒｙＦｒｏｍａＰｒｅｄｉｃｔｅｄＦａｉｌｕｒｅｉｎａＤｉｓｔｒｉｂｕｔｅｄＳｔｏｒａｇｅＳｙｓｔｅｍ）」に開示されている。これは本発明の被譲渡人に譲渡されたものであって、ここに援用する。 Managed reliability may also include diagnostic and correction routine scheduling based on monitoring and using the system. Data recovery is performed by copying and reconstructing data. ISP 150 incorporates with MDA 118 to facilitate “self-healing” the entire data storage capacity without losing data. These aspects of managed reliability considered here are disclosed in patent application 10 / 817,617, “Managed Reliability Storage System and Method”. This is assigned to the assignee of the present invention and is incorporated herein. Other aspects of managed reliability include the speed of response to predictive failure indications for predetermined rules. This is disclosed, for example, in patent application 11 / 040,410, “Deterministic Predictive Recovering From a Predicted Failure in a Distributed Storage System”. . This is assigned to the assignee of the present invention and is incorporated herein.

図７は１対の冗長なＩＳＰ１５０が常駐するＩＳＰ回路板１５２を示す線図である。ＩＳＰ１５０は、データ記憶容量１０９とＳＡＮ構造１１０とをインターフェースする。各ＩＳＰ１５０は、経路選択、ボリューム管理、およびデータ移動および複製などの種々の記憶サービスを管理してよい。ＩＳＰ１５０はＩＳＰ回路板１５２を、バス１５８により結合される２つのＩＳＰサブシステム１５４，１５６に分割する。ＩＳＰサブシステム１５４は「Ｂ」で示すＩＳＰ１５０を含む。これはリンク１６０によりＳＡＮ構造１１０に、またリンク１６２によりデータ記憶容量１０９に接続する。ＩＳＰサブシステム１５４は実時間オペレーティング・システムを実行するポリシー・プロセッサ１６４も含む。ＩＳＰ１５０とポリシー・プロセッサ１６４とはバス１６６により通信し、また両者はメモリ１６８と通信する。 FIG. 7 is a diagram illustrating an ISP circuit board 152 in which a pair of redundant ISPs 150 reside. The ISP 150 interfaces the data storage capacity 109 with the SAN structure 110. Each ISP 150 may manage various storage services such as path selection, volume management, and data movement and replication. ISP 150 divides ISP circuit board 152 into two ISP subsystems 154 and 156 that are coupled by bus 158. The ISP subsystem 154 includes an ISP 150 indicated by “B”. This is connected to SAN structure 110 by link 160 and to data storage capacity 109 by link 162. The ISP subsystem 154 also includes a policy processor 164 that executes a real-time operating system. ISP 150 and policy processor 164 communicate over bus 166 and both communicate with memory 168.

図８は、本発明の実施の形態に従って構築された例示のＩＳＰサブシステム１５４の線図である。ＩＳＰ１５０は、クロス・ポイント・スイッチ（ＣＰＳ）１８６メッセージ・クロスバーを介してリスト・マネージャ１８２，１８４と通信する多数の機能コントローラ（１７０−１８０）を含む。このように、機能コントローラ（１７０−１８０）は所定の条件に応じてそれぞれＣＰＳメッセージを生成し、ＣＰＳ１８６を通してこのメッセージをリスト・マネージャ１８２，１８４に送り、メモリ・モジュールにアクセスしたりＩＳＰ１５０の活動を起こしたりしてよい。同様に、リスト・マネージャ１８２，１８４からの応答はＣＰＳ１８６を介して機能コントローラ（１７０−１８０）のどれかに送ってよい。図８の配置および関連する説明は例であって、本発明の考えられる実施の形態を制限するものでなない。 FIG. 8 is a diagram of an exemplary ISP subsystem 154 constructed in accordance with an embodiment of the present invention. ISP 150 includes a number of function controllers (170-180) that communicate with list managers 182, 184 via a cross point switch (CPS) 186 message crossbar. In this way, the function controllers (170-180) generate CPS messages according to predetermined conditions, respectively, and send these messages to the list managers 182 and 184 through the CPS 186 to access the memory modules and perform the activities of the ISP 150. You may wake up. Similarly, responses from list managers 182, 184 may be sent via CPS 186 to any of the function controllers (170-180). The arrangement of FIG. 8 and the associated description are examples and do not limit the possible embodiments of the present invention.

ポリシー・プロセッサ１６４は、ＩＳＰ１５０を介して望ましい動作を実効するようプログラムすることができる。例えば、ポリシー・プロセッサ１６４はＣＰＳ１８６を介してリスト・マネージャ１８２，１８４と通信して（すなわち、メッセージを送りまた受けて）よい。ポリシー・プロセッサ１６４への応答は、メモリ１６８レジスタの読取りを知らせる割込みとして働いてよい。 Policy processor 164 may be programmed to perform desired operations via ISP 150. For example, policy processor 164 may communicate with list managers 182, 184 (ie, send and receive messages) via CPS 186. The response to the policy processor 164 may serve as an interrupt that signals the reading of the memory 168 register.

図９は、インテリジェント・コントローラ１１２により、予め選択された複数の通信プロトコル（ＦＣ、ｉＳＣＳＩ、またはＳＡＳなど）のどれかでホスト１０２と通信するＩＳＥ１０８の優れた柔軟性を示す線図である。ＩＳＥ１０８は、ホスト・コマンドの取出しレベルを確認し、これに従ってコマンドに関連する物理的記憶１０９に仮想記憶ボリュームをマップするようプログラムしてよい。 FIG. 9 is a diagram illustrating the superior flexibility of the ISE 108 that communicates with the host 102 in any of a plurality of preselected communication protocols (such as FC, iSCSI, or SAS) by the intelligent controller 112. The ISE 108 may be programmed to check the fetch level of the host command and map the virtual storage volume to the physical storage 109 associated with the command accordingly.

本発明の目的では、「仮想記憶ボリューム」という用語は、物理的記憶の論理的取出しに一般に対応する論理エンティティを意味する。「仮想記憶ボリューム」は、例えば、固定のブロック構造内の連続的にアドレス指定されたアドレス・ブロックまたはカウント・キー・データ（ｃｏｕｎｔ-ｋｅｙ−ｄａｔａ）構造内の記録であるかのように（論理的に）扱われるエンティティを含んでよい。仮想記憶ボリュームは物理的に２つ以上の記憶要素上にあってよい。 For the purposes of the present invention, the term “virtual storage volume” means a logical entity that generally corresponds to the logical retrieval of physical storage. A “virtual storage volume” is, for example, as if it were a record in a continuously addressed address block or count-key-data structure in a fixed block structure (logical May include the entity being handled). A virtual storage volume may physically reside on more than one storage element.

図１０は、任意のホスト１０２と独立にＩＳＥ１０８が行ってよいデータ管理サービスのタイプを示す線図である。例えば、耐故障性のデータ完全性のためにＲＡＩＤ管理を局所で制御して、データのストライピングは望ましい数のデータ記憶装置１２８₁，１２８₂，１２８₃，．．．，１２８_n内で行ってよい。仮想化サービスを局所で制御して、メモリ容量を論理エンティティに割り当てたり割り当てを外したりしてよい。上に説明した管理された信頼性方式や同じＩＳＥ１０８内の論理ボリュームの間のデータの複製などのアプリケーション・ルーチンも同様に局所で制御してよい。この記述およびクレームの目的では、「移動する」または「移動」という用語はデータを原始から宛先に移すことにより、移動完了の一部として原始のデータをなくすことを指す。これはデータを「複製する」または「複製」とは逆で、複製の場合はデータを原始から宛先に複写する。ただし、宛先で別の名前になる。本発明では、データの完全性を保つ場所以外でデータの読取り専用コピーを効率的に作るために複製を行う。 FIG. 10 is a diagram showing the types of data management services that the ISE 108 may perform independently of any host 102. For example, RAID management is controlled locally for fault-tolerant data integrity, and data striping is performed in the desired number of data storage devices 128 ₁ , 128 ₂ , 128 ₃ ,. . . , 128 _n . The virtualization service may be controlled locally to allocate or deallocate memory capacity to logical entities. Application routines such as the managed reliability scheme described above and data replication between logical volumes within the same ISE 108 may be controlled locally as well. For purposes of this description and claims, the terms “move” or “move” refer to erasing the original data as part of the move completion by moving the data from the source to the destination. This is the opposite of “duplicating” or “duplicating” data. In the case of copying, data is copied from the source to the destination. However, it will have a different name at the destination. In the present invention, duplication is performed to efficiently make a read-only copy of the data other than where data integrity is maintained.

図１１は、アクセス・コマンド（Ｉ／Ｏコマンド）をホスト装置１０２と複数のＩＳＥ１０８の選択された１台との間で渡すために仮想エンジン２００がＳＡＮ１０６により遠隔ホスト装置１０２と通信する、というデータ記憶システム１００の実施の形態を示す。各ＩＳＥ１０８は、アクセス・コマンドを渡すために仮想エンジン２００がユニークにアドレス指定できる２つのポート２０２，２０４および２０６，２０８を有する。データ転送ボトルネックを作らずにデータ複製を促進するため、以下にこの実施の形態が、ホスト・アクセス・コマンドの処理とは独立にまた同時に、ＩＳＥ１０８の間でデータを複製する方法を説明する。また、データを複製するときのデータ転送速度を変えることにより、システム１００のアプリケーション性能への影響を最適にすることができる。 FIG. 11 shows data that the virtual engine 200 communicates with the remote host device 102 via the SAN 106 to pass an access command (I / O command) between the host device 102 and a selected one of the plurality of ISEs 108. 1 illustrates an embodiment of a storage system 100. Each ISE 108 has two ports 202, 204 and 206, 208 that the virtual engine 200 can uniquely address to pass access commands. In order to facilitate data replication without creating a data transfer bottleneck, the following describes how this embodiment replicates data between ISEs 108 independently and simultaneously with host access command processing. In addition, by changing the data transfer speed when data is replicated, the influence on the application performance of the system 100 can be optimized.

ＩＳＥ１０８−１では、ＩＳＰ１５０はデータ記憶装置１２８の物理的データ・パック２１２に関する論理ボリューム２１０を作成する。説明のために、データ・パック２１２の記憶容量の４０％を論理ボリューム２１０内の論理ディスク２１４に割り当てたと仮定する。やはり説明のために、データ・パック２１２および以下の全ての他のデータ・パックは、データ記憶のための８個のデータ記憶装置１２８と２個の予備データ記憶装置１２８とを含むと仮定する。更に図１１から認識されるように、ＩＳＥ１０８−１では他方のデータ・パック２１６は９３％を論理ディスク２１８に割り当て、ＩＳＥ１０８−２ではデータ・パック２２０，２２２は論理ディスク２２４，２２６にそれぞれ３０％および４０％を割り当てたと仮定する。 In ISE 108-1, the ISP 150 creates a logical volume 210 for the physical data pack 212 of the data storage device 128. For illustration purposes, assume that 40% of the storage capacity of the data pack 212 is allocated to the logical disk 214 in the logical volume 210. Also for purposes of explanation, assume that data pack 212 and all other data packs below include eight data storage devices 128 and two spare data storage devices 128 for data storage. Further, as can be seen from FIG. 11, in ISE 108-1, the other data pack 216 allocates 93% to logical disk 218, and in ISE 108-2, data packs 220 and 222 are 30% to logical disks 224 and 226, respectively. And 40%.

仮想エンジン２００は論理ディスク２１４から論理ボリューム２２４を作成し、またホストからの記憶空間の要求に応じて論理ディスク２２６を作成してこれをホスト１０２にマップした。 The virtual engine 200 creates a logical volume 224 from the logical disk 214 and creates a logical disk 226 in response to a request for storage space from the host and maps it to the host 102.

或る予め定義された条件では、特定のＬＵＮ（論理ユニット番号）ではユニット・マスタ・コントローラ以外のコントローラでアクセス・コマンドを処理する方がよい。冗長のために予め決められたアレイ内の全てのデータをコピーするミラリングとは異なり、この実施の形態は、比較的高い性能でアクセス・コマンドを処理できそうな条件下で、狭い、データの読取り専用コピーをターゲット記憶空間に向ける。 Under certain predefined conditions, it is better to process the access command with a controller other than the unit master controller for a specific LUN (logical unit number). Unlike mirroring, which copies all data in a predetermined array for redundancy, this embodiment provides a narrow read of data under conditions that are likely to process access commands with relatively high performance. Direct the dedicated copy to the target storage space.

説明のための例として、ＬＵＮ内のデータのアクセス・コマンドを受けたが、ＩＳＥ１０８−１内のコントローラＡは現在順次のデータ・アクセス・コマンドをストリーミング中であるとする。この並列のコマンドを処理するために、コマンドのストリーミングを中断したりストリーミングが終わるまで待ったりするのではなく、コントローラＡはＬＵＮの全てまたは一部の読取り専用コピーを、図１２に示すＩＳＥ１０８−２などの異なる記憶空間に複製してよい。 As an illustrative example, it is assumed that an access command for data in the LUN is received, but the controller A in the ISE 108-1 is currently streaming sequential data access commands. Rather than interrupting command streaming or waiting for streaming to process this parallel command, controller A does not read all or part of the read-only copy of the LUN as shown in FIG. You may replicate to different storage spaces.

一般に、ユニット・マスタ・コントローラは未処理のアクセス・コマンドを処理するために、システムを見渡して比較的高い処理能力を有する別の記憶空間を識別することができる。この別の記憶空間は内部にそのユニット・マスタ・コントローラ内にあっても、外部に同じＩＳＥ１０８内の別のコントローラ内にあっても、または広域に別のＩＳＥ１０８内にあってもよい。いずれの場合も、比較的な処理能力は任意の多数の望ましいパラメータに関して決めてよい。例えば、アクセス・コマンドを処理するための資源の利用可能度を比較してよい。別の例では、Ｉ／Ｏコマンド・キュー深さの質または量の比較を用いてもよいし、管理された信頼性要因の比較を用いてもよい。 In general, the unit master controller can identify another storage space with relatively high processing power over the system in order to process outstanding access commands. This separate storage space may be internal to the unit master controller, externally to another controller in the same ISE 108, or globally to another ISE 108. In any case, the comparative throughput may be determined for any number of desirable parameters. For example, the availability of resources for processing access commands may be compared. In another example, a quality or quantity comparison of I / O command queue depth may be used, or a controlled reliability factor comparison may be used.

認識されるように、この実施の形態は、特定のアクセス・コマンドを処理するためにデータのスナップショット・コピーだけを別の記憶空間に複製することを考える。データの完全性は原始ＬＵＮ内で保たれる。
また認識されるように、図１２の例示の実施の形態は全ＬＵＮを別の記憶空間内に複製することを示す。または、未処理のアクセス・コマンドを処理するのにＬＵＮの一部だけが必要な場合は、その部分だけを複製してよい。同様に、それぞれ割り当てられたサブＬＵＮと共にＬＵＮの全てまたは一部だけを、またはサブＬＵＮの全てまたは一部だけをコピーしてもよい。一般に、複製されたデータは未処理のアクセス・コマンドを処理するのに必要なものだけにすべきである。 As will be appreciated, this embodiment contemplates replicating only a snapshot copy of the data to another storage space in order to process a particular access command. Data integrity is maintained in the source LUN.
It will also be appreciated that the exemplary embodiment of FIG. 12 illustrates duplicating the entire LUN in another storage space. Or, if only a portion of the LUN is needed to process an outstanding access command, only that portion may be duplicated. Similarly, all or only part of a LUN, or all or only part of a sub LUN, may be copied with each assigned sub LUN. In general, replicated data should be only what is needed to process outstanding access commands.

図１３は、本発明の実施の形態に係る広域複製の方法２５０を実行するステップの例の流れ図である。この方法はブロック２５２で、システムの通常のＩ／Ｏ処理モードで実行する。ブロック２５４で、最後のＩ／Ｏコマンドの処理が終わったかどうか判定する。イエスの場合はこの方法は終了する。 FIG. 13 is a flow diagram of example steps for executing a global replication method 250 in accordance with an embodiment of the present invention. The method is performed at block 252 in the normal I / O processing mode of the system. Block 254 determines whether the last I / O command has been processed. If yes, the method ends.

ブロック２５６で広域複製を考える方がよい条件が存在するかどうか判定する。例えば、現在複数のＩＳＥの間で負荷が比較的均一に分担されていて、ユニット・コントローラ自身の記憶空間を使うのがよいという結果になりそうな場合は、アクセス・コマンドを記憶するときにいつも最適な記憶空間を探すことに決めない方が、貴重な資源を保存できるであろう。しかし当業者が認識するように、広域複製モードを実行する方がよい条件が他に存在することがある。例えば、管理された信頼性が特定のＩＳＥ１０８に関する場合や、コントローラが順次データの処理にかかっている場合や、特定のＩＳＥ１０８は現在遊休状態なので利用可能な比較的大きな資源ベースを有する場合などであるが、これらに限定されるわけではない。 Block 256 determines if there are any conditions for which it is better to consider global replication. For example, if you are currently sharing a load relatively evenly among multiple ISEs and are likely to end up using the unit controller's own storage space, whenever you store an access command Those who do not decide to find the optimal storage space will be able to save valuable resources. However, as those skilled in the art will recognize, there may be other conditions where it is better to perform the global replication mode. For example, when the managed reliability is related to a specific ISE 108, the controller is sequentially processing data, or the specific ISE 108 is currently idle and has a relatively large resource base available. However, it is not limited to these.

ブロック２５６の判定がイエスの場合は、ブロック２５８で、広域複製の候補のアクセス・コマンドが存在するかどうか判定する。この場合も、各アクセス・コマンドを調べるのではなくシステム性能を低下させそうなものを調べるためのしきい値を決定するとよい。例えば、ユニット・マスタ・コントローラが順次データの処理にかかっているという上に述べた状況は、未処理の並列のアクセス・コマンドを処理する候補になりそうである。別の例は、ユニット・マスタ・コントローラがショート・ストローキング・ルーチンにかかっているので、未処理のアクセス・コマンドをルーチンから外す必要がある場合である。 If the determination at block 256 is yes, block 258 determines whether there is a global replication candidate access command. In this case as well, it is preferable to determine a threshold value for examining what is likely to degrade the system performance instead of examining each access command. For example, the situation described above where the unit master controller is dependent on sequential data processing is likely to be a candidate for processing unprocessed parallel access commands. Another example is when the unit master controller is in a short stroking routine and the outstanding access command needs to be removed from the routine.

ブロック２５８の判定がイエスの場合は、ブロック２６０で、未処理のアクセス・コマンドを処理するのに最適な記憶空間を決定する。上に述べたように、この決定は一般に、未処理のアクセス・コマンドを処理するのにどの記憶空間が比較的高い処理能力を持っているかについて行う。「比較的高い処理能力」は、資源の利用可能性、未処理のＩ／Ｏコマンド・キュー深さ、および管理された信頼性に関して決定してよいが、これらに限定されるわけではない。ターゲットの記憶空間はユニット・マスタ・コントローラに関して内部に、外部に、または広域にあってよい。 If the determination at block 258 is yes, block 260 determines the optimal storage space for processing the outstanding access command. As stated above, this decision is typically made as to which storage space has a relatively high processing capacity to process an outstanding access command. “Relatively high processing power” may be determined in terms of resource availability, outstanding I / O command queue depth, and managed reliability, but is not limited thereto. The target storage space may be internal, external or global with respect to the unit master controller.

最後に、ブロック２６２で、ユニット・マスタ・コントローラはブロック２６０で決定した記憶空間に読取り専用コピーを複製する。上に説明したように、複製されたデータ・ブロックのサイズは好ましくは未処理のアクセス・コマンドを処理するのに必要なだけのサイズでよい。例えば、複製されたデータはＬＵＮの全てまたは一部だけ、ＬＵＮおよびその各サブＬＵＮの全てまたは一部だけ、またはサブＬＵＮの全てまたは一部だけを含んでよい。 Finally, at block 262, the unit master controller replicates the read-only copy to the storage space determined at block 260. As explained above, the size of the replicated data block is preferably as large as necessary to process an outstanding access command. For example, the replicated data may include all or only part of a LUN, all or part of a LUN and its respective sub LUNs, or all or only part of a sub LUN.

当業者が認識するように、この方法は、データをどこに記憶するかを予め決めて原始データの完全コピーを記憶するデータ・ミラリングとは異なる。この実施の形態はデータを第２のＩＳＥ１０８内に複製して第１のアクセス・コマンドを処理し、次に同じデータを別の第３のＩＳＥ１０８内に複製して第２のアクセス・コマンドを処理してよい。 As those skilled in the art will appreciate, this method is different from data mirroring, where the data is stored in advance and a complete copy of the original data is stored. This embodiment replicates data into the second ISE 108 to process the first access command, and then replicates the same data into another third ISE 108 to process the second access command. You can do it.

最後に、図１４は図４と同様の図であるが、複数のデータ記憶装置１２８および回路板１３４は、ベース１９０とこれに付属する密閉カバー１９２とで形成される密閉容器内に収められる。ＭＤＡ１１８Ａを形成するデータ記憶装置１２８を密閉して係合すると、データ記憶装置１２８の配置が予め選択された最適配置から変わることがないなどの種々の利点がある。またデータ記憶装置１２８の数、サイズ、タイプを明確に定義できる場合は、かかる配置によりＭＤＡ１１８Ａの製作者は最適性能になるようにシステムを調整することができる。 Finally, FIG. 14 is a view similar to FIG. 4, but a plurality of data storage devices 128 and a circuit board 134 are housed in a sealed container formed by a base 190 and a sealed cover 192 attached thereto. Sealing and engaging the data storage device 128 that forms the MDA 118A has various advantages, such as that the placement of the data storage device 128 does not change from a preselected optimal placement. Also, if the number, size, and type of data storage devices 128 can be clearly defined, such an arrangement allows the MDA 118A producer to adjust the system for optimum performance.

またＭＤＡ１１８Ａを密閉すると、製作者は内部の記憶媒体のグループの信頼性および耐故障性を最大にすることができると同時に、ＭＤＡ１１８Ａの寿命がある限りほとんどサービスをしなくてよい。これは、多スピンドル配置のドライブを最適化することにより行う。設計の最適化により、コストが下がり、性能が向上し、信頼性が向上し、ＭＤＡ１１８Ａ内のデータの寿命が一般に延びる。更に、ＭＤＡ１１８Ａ自体の設計により回転振動がほとんどなくなり、冷却効率の高い環境が得られる。これは出願中の米国特許出願１１／１４５，４０４、「強化されたＲＶＩを持つ記憶アレイ（ＳｔｏｒａｇｅＡｒｒａｙｗｉｔｈＥｎｈａｎｃｅｄＲＶＩ）」の主題である。これは本出願の被譲渡人に譲渡されている。これにより、ＭＤＡ１１８の信頼性、性能、または容量を落とさずに内部の記憶媒体を低コストで製作することができる。このようにＭＤＡ１１８Ａを密閉すると、単点故障がなくなり、回転振動の除去と冷却効率がほとんど完全になる。これにより、ディスク媒体特性が最適になるようにＭＤＡ１１８Ａを設計し、コストを下げ、同時に信頼性および性能を高めることができる。 Sealing the MDA 118A also maximizes the reliability and fault tolerance of a group of internal storage media, while requiring little service as long as the MDA 118A has a lifetime. This is done by optimizing a multi-spindle drive. Design optimization reduces costs, improves performance, improves reliability, and generally extends the life of data in MDA 118A. Furthermore, the design of the MDA 118A itself eliminates rotational vibration, and an environment with high cooling efficiency can be obtained. This is the subject of pending US patent application 11 / 145,404, “Storage Array with Enhanced RVI”. This is assigned to the assignee of the present application. Thereby, an internal storage medium can be manufactured at a low cost without reducing the reliability, performance, or capacity of the MDA 118. Sealing the MDA 118A in this way eliminates a single point of failure and eliminates rotational vibration and almost complete cooling efficiency. As a result, the MDA 118A can be designed to optimize the disk medium characteristics, thereby reducing costs and at the same time improving reliability and performance.

要約すると、複数の回転可能なスピンドルを含む分散記憶システム用の内蔵のＩＳＥを提供する。各スピンドルはそれぞれ独立に動くアクチュエータに近接して記憶媒体を支持し、アクチュエータは記憶媒体との間でデータを記憶しまた検索する。ＩＳＥは更に、分散記憶システムの遠隔装置が用いるように仮想記憶ボリュームを複数の媒体にマップするＩＳＰを含む。 In summary, a built-in ISE for a distributed storage system including a plurality of rotatable spindles is provided. Each spindle supports a storage medium in proximity to the independently moving actuator, which stores and retrieves data from and to the storage medium. The ISE further includes an ISP that maps virtual storage volumes to multiple media for use by remote devices in a distributed storage system.

或る実施の形態では、ＩＳＥは共通の密閉されたハウジング内に収められた複数のスピンドルおよび媒体を有する。好ましくは、ＩＳＰはＲＡＩＤ方式などの故障に耐える方法でデータを記憶するために仮想記憶ボリューム内にメモリを割り当てる。更にＩＳＰはデータ記憶プロセス中に、予測される記憶の故障を検出すると決定論的な予防的回復ステップを自発的に開始するなどの管理された信頼性方式を実行することができる。好ましくは、ＩＳＥは、それぞれが２個以上のディスクのデータ記憶媒体から形成されディスク・スタックを有する複数のデータ記憶装置で形成する。 In some embodiments, the ISE has a plurality of spindles and media housed in a common sealed housing. Preferably, the ISP allocates memory in the virtual storage volume for storing data in a manner that is resistant to failure, such as a RAID scheme. In addition, during the data storage process, the ISP can implement a managed reliability scheme, such as voluntarily initiating a deterministic preventive recovery step upon detecting a predicted storage failure. Preferably, the ISE is formed of a plurality of data storage devices each having a disk stack formed from two or more disk data storage media.

別の実施の形態では、ＩＳＥは、内蔵の複数の離散的データ記憶装置と、データ記憶装置と通信して遠隔装置から受信したコマンドを取り出してこれに関係するメモリを関連付けるＩＳＰとを備える分散記憶システムに用いる。好ましくは、分散記憶システムの１つ以上の遠隔装置が用いるために、ＩＳＰは仮想記憶ボリュームを複数のデータ記憶装置にマップする。前と同様に、複数のデータ記憶装置および媒体は共通の密閉されたハウジング内に収めてよい。好ましくは、ＲＡＩＤ方式などの故障に耐える方法でデータを記憶するために、ＩＳＰは仮想記憶ボリューム内にメモリを割り当てる。更にＩＳＰは、予測される記憶の故障を検出するとデータ記憶装置内で決定論的な予防的回復ステップを自発的に開始する。 In another embodiment, the ISE includes a distributed storage comprising a plurality of built-in discrete data storage devices and an ISP that communicates with the data storage devices to retrieve commands received from remote devices and associate memory associated therewith. Used in the system. Preferably, the ISP maps a virtual storage volume to a plurality of data storage devices for use by one or more remote devices of the distributed storage system. As before, a plurality of data storage devices and media may be contained in a common sealed housing. Preferably, the ISP allocates memory in the virtual storage volume to store the data in a manner that resists failures, such as a RAID scheme. In addition, the ISP spontaneously initiates a deterministic preventive recovery step within the data storage device upon detecting a predicted storage failure.

別の実施の形態では、ホストと、ネットワークによりホストと通信する後部記憶サブシステムと、ホストに無関係に内蔵の記憶容量を仮想化する手段とを備える分散記憶システムを提供する。 In another embodiment, a distributed storage system is provided that includes a host, a rear storage subsystem that communicates with the host over a network, and means for virtualizing built-in storage capacity regardless of the host.

仮想化する手段は、複数の離散的で個別にアクセス可能なデータ記憶ユニットを特徴としてよい。仮想化する手段は、複数のデータ記憶ユニットに関連する記憶容量の仮想ブロックをマップすることを特徴としてよい。仮想化する手段は、複数のデータ記憶ユニットおよび関連する制御を密閉して収めることを特徴としてよい。仮想化する手段は、限定されないがＲＡＩＤ方式などの故障に耐える方法でデータを記憶することを特徴としてよい。仮想化する手段は、予測される記憶の故障を検出すると決定論的な予防的回復ステップを自発的に開始することを特徴としてよい。仮想化する手段は、多重スピンドル・データ記憶アレイを特徴としてよい。 The means for virtualizing may feature a plurality of discrete and individually accessible data storage units. The means for virtualizing may map virtual blocks of storage capacity associated with a plurality of data storage units. The means for virtualizing may be characterized by enclosing a plurality of data storage units and associated controls. The means for virtualizing is not limited, but may be characterized in that data is stored in a method that can withstand failures such as a RAID system. The means for virtualizing may be characterized by voluntarily initiating a deterministic preventive recovery step upon detecting a predicted memory failure. The means for virtualizing may feature multiple spindle data storage arrays.

ここの目的では、「仮想化する手段」という用語は、それぞれのデータ記憶サブシステム以外のどこかにデータ記憶空間をマップするためのシステム知能を含む前に試みた解決策を明白には考えない。例えば、「仮想化する手段」は記憶マネージャを用いてデータ記憶サブシステムの機能を制御することは考えないし、またＳＡＮ構造内またはホスト内にマネージャまたはスイッチを置くことも考えない。 For the purposes here, the term “means to virtualize” does not explicitly consider previously attempted solutions, including system intelligence to map the data storage space somewhere other than the respective data storage subsystem. . For example, “virtualizing means” does not consider using a storage manager to control the functions of the data storage subsystem, nor does it consider placing a manager or switch in the SAN structure or in the host.

またはこの実施の形態は、アクセス・コマンドを遠隔装置と記憶空間との間で渡すための、ネットワークにより遠隔装置に接続する仮想エンジンを備えるデータ記憶システムを特徴とする。データ記憶システムは更に、アクセス・コマンドを渡すために仮想エンジンがユニークにアドレス指定できる複数のインテリジェント記憶要素（ＩＳＥ）を有する。仮想エンジンと第１のＩＳＥとの間でアクセス・コマンドを渡すのとは無関係に、同時にＩＳＥは第１のＩＳＥから第２のＩＳＥにデータを移す。 Alternatively, this embodiment features a data storage system that includes a virtual engine connected to the remote device over a network for passing access commands between the remote device and the storage space. The data storage system further includes a plurality of intelligent storage elements (ISEs) that the virtual engine can uniquely address to pass access commands. Regardless of passing an access command between the virtual engine and the first ISE, the ISE simultaneously transfers data from the first ISE to the second ISE.

或る実施の形態では、各ＩＳＥは複数の回転可能なスピンドルを有し、各スピンドルはそれぞれ独立に動くアクチュエータに近接して記憶媒体を支持し、アクチュエータは記憶媒体との間でデータを記憶しまた検索する。複数のスピンドルおよび媒体は共通の密閉されたハウジング内に収めてよい。 In one embodiment, each ISE has a plurality of rotatable spindles, each spindle supporting a storage medium proximate to an independently moving actuator, and the actuator stores data to and from the storage medium. Search. Multiple spindles and media may be contained in a common sealed housing.

各ＩＳＥは仮想記憶ボリュームを複数の媒体にマップして管理するためのプロセッサを有する。各ＩＳＥプロセッサは、好ましくは複数の異なる独立ドライブの冗長アレイ（ＲＡＩＤ）方式の選択された１つなどの故障に耐える方法でデータを記憶するために仮想記憶容量内にメモリを割り当てる。 Each ISE has a processor for mapping and managing virtual storage volumes to a plurality of media. Each ISE processor preferably allocates memory within the virtual storage capacity to store data in a manner that tolerates failures, such as a selected one of a plurality of different independent drive redundant array (RAID) schemes.

各ＩＳＥプロセッサは記憶の故障を検出すると決定論的な予防的回復ステップを自発的に行ってよい。これを行うとき、各ＩＳＥプロセッサは記憶の故障を検出すると第２の仮想記憶容量を割り当ててよい。或る実施の形態では、各ＩＳＥプロセッサは異なるＩＳＥ内に第２の仮想記憶容量を割り当てる。 Each ISE processor may voluntarily perform a deterministic preventive recovery step upon detecting a memory failure. When doing this, each ISE processor may allocate a second virtual storage capacity upon detecting a memory failure. In some embodiments, each ISE processor allocates a second virtual storage capacity in a different ISE.

この実施の形態は更に、仮想エンジンとインテリジェント記憶要素との間のアクセス・コマンドを処理し、同時にインテリジェント記憶要素から別の記憶空間にデータを移すための方法として特徴づけられる。 This embodiment is further characterized as a method for processing access commands between a virtual engine and an intelligent storage element and simultaneously transferring data from the intelligent storage element to another storage space.

処理するステップは、インテリジェント記憶要素が仮想記憶ボリュームを内蔵の物理的記憶にマップして管理することを特徴としてよい。好ましくは移すステップは、記憶の故障を検出するとインテリジェント記憶要素が決定論的な予防的回復ステップを自発的に開始することを特徴とする。 The processing step may be characterized in that the intelligent storage element maps and manages the virtual storage volume to the built-in physical storage. Preferably the transferring step is characterized in that upon detecting a memory failure, the intelligent storage element spontaneously initiates a deterministic preventive recovery step.

移すステップは、記憶の故障を検出するとインテリジェント記憶要素が第２の仮想記憶ボリュームを割り当てることを特徴としてよい。或る実施の形態では、移すステップは、処理するステップで仮想エンジンが異なるアドレスを指定した物理的記憶に関して第２の仮想記憶ボリュームを割り当てることを特徴とする。例えば、移すステップは、第２の仮想記憶ボリュームをインテリジェント記憶要素の内部で割り当てることを特徴としてよい。または、移すステップは、第２の仮想記憶ボリュームをインテリジェント記憶要素の外部で割り当てることを特徴としてよい。すなわち、移すステップは、第２のインテリジェント記憶要素内に第２の仮想記憶ボリュームを割り当てることを特徴とする。 The transferring step may be characterized in that the intelligent storage element allocates a second virtual storage volume upon detecting a storage failure. In one embodiment, the transferring step is characterized in that the virtual engine allocates a second virtual storage volume for physical storage that has specified a different address in the processing step. For example, the transferring step may be characterized by allocating a second virtual storage volume within the intelligent storage element. Alternatively, the transferring step may be characterized by allocating the second virtual storage volume outside the intelligent storage element. That is, the transferring step is characterized by allocating a second virtual storage volume in the second intelligent storage element.

処理するステップは、メモリを割り当てて故障に耐える方法でデータを記憶することを特徴としてよい。また処理するステップは、共通の密閉されたハウジング内でデータを転送しながらデータ転送要素および記憶媒体を互いに動かすことを特徴としてよい。 The processing step may be characterized by allocating memory and storing the data in a manner that resists failure. The processing step may also be characterized by moving the data transfer element and the storage medium relative to each other while transferring the data within a common sealed housing.

またはこの実施の形態は、仮想エンジンが個別にアドレス指定できる複数のインテリジェント記憶要素と、データをインテリジェント記憶要素の間で移す手段とを持つデータ記憶システムを特徴とする。この記述およびクレームの目的では、ここに説明した構造およびその同等物に関して「移す手段」という用語は、ホストのアクセス・コマンドに関連する通常のＩ／Ｏコマンド処理を中断せずにデータを或る論理ユニットから別の論理ユニットに複製することを意味する。 Alternatively, this embodiment features a data storage system having a plurality of intelligent storage elements that can be individually addressed by a virtual engine and means for moving data between intelligent storage elements. For the purposes of this description and claims, the term “means of transferring” with respect to the structures described herein and the like is used to refer to data without interrupting normal I / O command processing associated with host access commands. It means copying from one logical unit to another.

またはこの実施の形態は、アクセス・コマンドを遠隔装置と記憶空間との間で渡すための、ネットワークにより遠隔装置に接続可能な仮想エンジンを備えるデータ記憶システムを特徴とする。複数のＩＳＥは、アクセス・コマンドが同時に仮想エンジンと第１のＩＳＥとの間で渡されていることとは独立に、データを第１のＩＳＥから第２のＩＳＥに複製する。各ＩＳＥは複数の回転可能なスピンドルを有し、各スピンドルはそれぞれ独立に動くアクチュエータに近接して記憶媒体を支持し、アクチュエータは記憶媒体との間でデータを記憶しまた検索する。 Alternatively, this embodiment features a data storage system comprising a virtual engine that is connectable to a remote device over a network for passing access commands between the remote device and the storage space. The plurality of ISEs replicate data from the first ISE to the second ISE independently of the access commands being simultaneously passed between the virtual engine and the first ISE. Each ISE has a plurality of rotatable spindles, each spindle supporting a storage medium proximate to an independently moving actuator, which stores and retrieves data from and to the storage medium.

複製されたデータは、一般に未処理のアクセス・コマンドを処理するのに十分なだけのブロック・サイズの読取り専用コピーを生成する。すなわち、ＩＳＥはＬＵＮの全てまたは一部だけ、ＬＵＮおよびその各サブＬＵＮの全てまたは一部だけ、またはサブＬＵＮの全てまたは一部だけを複製してよい。
第１のＩＳＥ（ユニット・マスタ）は、複製されたデータに関連するアクセス・コマンドを処理するのに比較的高い処理能力を有するものとして複数のＩＳＥの中から第２のＩＳＥを識別する。処理能力の決定は、資源の利用可能性、未処理のＩ／Ｏコマンド・キュー深さ、および管理された信頼性に関して行ってよい。 The replicated data typically produces a read-only copy of a block size sufficient to process an outstanding access command. That is, the ISE may replicate all or part of a LUN, all or part of a LUN and its respective sub-LUNs, or all or part of a sub-LUN.
The first ISE (unit master) identifies the second ISE among the plurality of ISEs as having a relatively high processing capability for processing access commands associated with replicated data. The determination of processing power may be made with respect to resource availability, outstanding I / O command queue depth, and managed reliability.

または、この実施の形態は仮想エンジンとＩＳＥとの間のアクセス・コマンドを処理し、同時にデータをＩＳＥから別の記憶空間に複製する方法を特徴とする。一般に、この方法は未処理のアクセス・コマンドを処理するのに必要なデータだけを複製することを考える。或る実施の形態では、複製するステップは第１のＩＳＥ内のＬＵＮの一部だけを第２のＩＳＥ内の読取り専用コピーに複製することを特徴とする。または、複製するステップは第１のＩＳＥ内の全ＬＵＮを第２のＩＳＥ内の読取り専用コピーに複製することを特徴とする。または、複製するステップは第１のＩＳＥ内のＬＵＮおよびそのそれぞれのサブＬＵＮを第２のＩＳＥ内の読取り専用コピーに複製することを特徴とする。または、複製するステップは第１のＩＳＥ内のサブＬＵＮだけを第２のＩＳＥ内の読取り専用コピーに複製することを特徴とする。 Alternatively, this embodiment features a method of processing access commands between the virtual engine and the ISE and simultaneously replicating data from the ISE to another storage space. In general, this method considers replicating only the data necessary to process an outstanding access command. In one embodiment, the replicating step is characterized by replicating only a portion of the LUN in the first ISE to a read-only copy in the second ISE. Alternatively, the replicating step is characterized by replicating all LUNs in the first ISE to a read-only copy in the second ISE. Alternatively, the replicating step is characterized by replicating the LUN in the first ISE and its respective sub-LUN to a read-only copy in the second ISE. Alternatively, the replicating step is characterized by replicating only the sub LUN in the first ISE to a read-only copy in the second ISE.

処理するステップは、複製されたデータに関連するアクセス・コマンドを処理するのに比較的高い処理能力を有するものとして、複数のＩＳＥの第１のＩＳＥが複数のＩＳＥの中から第２のＩＳＥを識別することを特徴とする。処理能力の決定は、資源の利用可能性、未処理のＩ／Ｏコマンド・キュー深さ、および管理された信頼性に関して行ってよい。処理能力を決定すると、第１のＩＳＥに関して内部に、外部に、または広域に複製を行う。 The processing step assumes that the first ISE of the plurality of ISEs selects the second ISE from the plurality of ISEs as having relatively high processing power to process access commands associated with the replicated data. It is characterized by identifying. The determination of processing power may be made with respect to resource availability, outstanding I / O command queue depth, and managed reliability. Once the processing power is determined, replication is performed internally, externally, or globally with respect to the first ISE.

この方法は更に、ターゲット記憶空間の望ましい性能に関してデータを移す速度を選択することを特徴とする。一般に、複製は読取り専用コピーを複数のターゲット記憶空間のどれかに向けて行う。例えば、この方法はデータを第１のＩＳＥから第２のＩＳＥに複製して第１の未処理のアクセス・コマンドを処理し、次に同じデータについてデータを第２のＩＳＥに複製して次のアクセス・コマンドを処理する。 The method is further characterized by selecting a rate at which data is transferred with respect to the desired performance of the target storage space. In general, replication makes a read-only copy directed to any of a plurality of target storage spaces. For example, the method replicates data from a first ISE to a second ISE to process a first outstanding access command, then replicates the data for the same data to a second ISE to Process the access command.

またはこの実施の形態は、仮想エンジンが個別にアドレス指定できる複数のＩＳＥと、インテリジェント記憶要素の間でデータを複製する手段とを持つデータ記憶システムを特徴とする。この説明の目的およびクレームの意味では、「複製する手段」という用語は、ユニット・マスタ・コントローラがシステム全体の性能パラメータを評価し、これに従って未処理のアクセス・コマンドを処理するためにデータの複製を指示することのできる、ここに開示した構造を必要とする。 Alternatively, this embodiment features a data storage system having a plurality of ISEs that can be individually addressed by a virtual engine and means for replicating data between intelligent storage elements. For the purposes of this description and in the meaning of the claims, the term “means to replicate” means that the unit master controller replicates data to evaluate system-wide performance parameters and process outstanding access commands accordingly. Requires the structure disclosed herein.

理解されるように、これまでの記述で本発明の種々の実施の形態の多くの特徴および利点を、本発明の種々の実施の形態の構造および機能の詳細と共に述べたが、この詳細な記述は単なる例示であって、詳細に関しては、特に本発明の原理内の部分の構造および配置に関してクレームを表現する用語の広い一般的な意味で示す範囲で、変更を行ってよい。例えば、本発明の精神と範囲から逸れない限り特定の要素を特定の処理環境に従って変えてよい。 As will be appreciated, while the foregoing description has set forth many features and advantages of various embodiments of the invention, along with details of the structure and function of the various embodiments of the invention, this detailed description Are merely exemplary, and modifications may be made in details, particularly within the broad general meaning of the terms expressing the claims, regarding the structure and arrangement of portions within the principles of the invention. For example, certain elements may be varied according to a particular processing environment without departing from the spirit and scope of the present invention.

更に、ここに述べた実施の形態はデータ記憶アレイに関するものであるが、当業者が認識するように、クレームされた主題はこれに限定されるものではなく、本発明の精神および範囲から逸れない限り種々の他の処理システムを用いてよい。 Further, although the embodiments described herein relate to data storage arrays, as those skilled in the art will recognize, the claimed subject matter is not so limited and does not depart from the spirit and scope of the invention. Various other processing systems may be used as long as they are.

本出願は２００５年６月３日に出願されて本出願の被譲渡人に譲渡された米国出願番号１１／１４５，４０３の一部継続出願である。 This application is a continuation-in-part of US Application No. 11 / 145,403, filed June 3, 2005 and assigned to the assignee of the present application.

図１は、本発明の実施の形態が有用であるコンピュータ・システムの線図である。FIG. 1 is a diagram of a computer system in which embodiments of the present invention are useful. 図２は、図１のコンピュータ・システムの簡単な線図である。FIG. 2 is a simplified diagram of the computer system of FIG. 図３は、本発明の実施の形態に従って構築されたインテリジェント記憶要素の組立分解等角図である。FIG. 3 is an exploded isometric view of an intelligent storage element constructed in accordance with an embodiment of the present invention. 図４は、図３のインテリジェント記憶要素の多重ディスク・アレイの部分的組立分解等角図である。FIG. 4 is a partially exploded isometric view of the multiple disk array of intelligent storage elements of FIG. 図５は、図４の多重ディスク・アレイに用いられる例示のデータ記憶装置である。FIG. 5 is an exemplary data storage device used in the multiple disk array of FIG. 図６は、図３のインテリジェント記憶要素の機能的ブロック図である。FIG. 6 is a functional block diagram of the intelligent storage element of FIG. 図７は、図３のインテリジェント記憶要素のインテリジェント記憶プロセッサ回路板の機能的ブロック図である。FIG. 7 is a functional block diagram of the intelligent storage processor circuit board of the intelligent storage element of FIG. 図８は、図３のインテリジェント記憶要素のインテリジェント記憶プロセッサの機能的ブロック図である。FIG. 8 is a functional block diagram of the intelligent storage processor of the intelligent storage element of FIG. 図９は、図３のインテリジェント記憶要素が行うコマンド取出しおよび関連するメモリ・マッピング・サービスの機能的ブロック図表現である。FIG. 9 is a functional block diagram representation of command retrieval and associated memory mapping services performed by the intelligent storage element of FIG. 図１０は、図３のインテリジェント記憶要素が行う別の例示のデータ・サービスの機能的ブロック図である。FIG. 10 is a functional block diagram of another exemplary data service performed by the intelligent storage element of FIG. 図１１は、本発明の実施の形態に係る広域複製の方法を示す線図である。FIG. 11 is a diagram showing a wide area replication method according to an embodiment of the present invention. 図１２は、本発明の実施の形態に係る広域複製の方法を示す線図である。FIG. 12 is a diagram showing a wide-area replication method according to the embodiment of the present invention. 図１３は、本発明の実施の形態に係る広域複製の方法を実行するステップの流れ図である。FIG. 13 is a flowchart of steps for executing the global replication method according to the embodiment of the present invention. 図１４は、図３と同様であるが、密閉された容器の中にデータ記憶装置および回路板を収めるものを示す図である。FIG. 14 is a view similar to FIG. 3 but showing the data storage device and circuit board housed in a sealed container.

Explanation of symbols

１０２遠隔装置（ホスト）
１０８インテリジェント記憶要素（ＩＳＥ）
１０９記憶空間
２００仮想エンジン 102 Remote device (host)
108 Intelligent Storage Element (ISE)
109 Storage space 200 Virtual engine

Claims

A data storage system,
A virtual engine connectable to the remote device over a network for passing access commands between the remote device and the storage space;
A plurality of intelligent storage elements (ISEs) that transfer data from the first ISE to the second independent of the access command being passed between the virtual engine and the first ISE at the same time. The ISE to be replicated to the ISE,
The data storage system comprising:

2. Each ISE comprises a plurality of rotatable spindles, each spindle supporting a storage medium proximate to an independently moving actuator, said actuator storing and retrieving data from and to the storage medium. The data storage system described.

The data storage system of claim 2, wherein the ISE copies only a portion of a logical unit number (LUN) in the first ISE to a read-only copy in the second ISE.

The data storage system of claim 2, wherein the ISE copies all LUNs in the first ISE to a read-only copy in the second ISE.

The data storage system of claim 2, wherein the ISE copies a LUN in the first ISE and its corresponding sub-LUN to a read-only copy in the second ISE.

The data storage system of claim 2, wherein the ISE copies only sub-LUNs in the first ISE to a read-only copy in the second ISE.

The first ISE identifies the second ISE from a plurality of previous ISEs as having a relatively high processing capability to process an access command associated with the replicated data. The data storage system described.

8. The ability to process is determined with respect to a parameter in a set of parameters consisting of resource availability, outstanding I / O command queue depth, and managed reliability. Data storage system.

Processing an access command between the virtual engine and the ISE and simultaneously replicating data from the ISE to another storage space.

The method of claim 9, wherein the duplicating step duplicates only a portion of the LUN in the first ISE to a read-only copy in the second ISE.

The method of claim 9, wherein the duplicating step duplicates all LUNs in the first ISE to a read-only copy in the second ISE.

The method of claim 9, wherein the duplicating step duplicates a LUN in a first ISE and its respective sub-LUN to a read-only copy in a second ISE.

The method of claim 9, wherein the duplicating step duplicates only the sub-LUNs in the first ISE to a read-only copy in the second ISE.

The step of processing identifies a second ISE of a plurality of ISEs that identifies the second ISE from a plurality of ISEs as having relatively high processing power to process an access command associated with the replicated data. 10. A method according to claim 9, characterized by one ISE.

15. The ability to process is determined with respect to a parameter from a set of parameters consisting of resource availability, outstanding I / O command queue depth, and managed reliability. Method.

The method of claim 9, wherein the processing step replicates data within the ISE.

10. The method of claim 9, wherein the processing step replicates data outside the ISE.

The method of claim 9, wherein the processing step replicates data in a second ISE.

The method of claim 9, wherein the processing step selects a rate at which data is transferred with respect to a desired performance of the target storage space.

The method of claim 18, wherein the step of processing replicates data in a third ISE that is different from the second ISE.

A data storage system,
Multiple ISEs that the virtual engine can address individually;
Means for replicating data between intelligent storage elements;
A data storage system comprising: