JP2019128960A

JP2019128960A - Data storage system, and method for accessing objects of key-value pair

Info

Publication number: JP2019128960A
Application number: JP2019007690A
Authority: JP
Inventors: 亮ソク奇; Yang Seok Ki
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2018-01-19
Filing date: 2019-01-21
Publication date: 2019-08-01
Anticipated expiration: 2039-01-21
Also published as: CN110058804A; TW201933121A; KR102460568B1; KR20190088873A; CN110058804B; JP7140688B2; TWI750425B

Abstract

【課題】ＫＶ＿ＳＳＤの特性に適合した、相異なるオブジェクト（以下、ＯＢと略）を効率的に格納する高信頼性のデータストレージシステムを提供する。【解決手段】キーバリューペアの複数のＯＢを格納する複数のデータストレージデバイス（以下、ＤＳＤと略）と、複数のＯＢの各々のＯＢサイズに基づいて、データの複製方式及び消去コーディング方式を含む相異なるデータ信頼性方式を適用する仮想ストレージレイヤを含む。複数のＯＢは、各々、第１サイズと第１サイズより大きい第２サイズを有する第１ＯＢと第２ＯＢを含み、仮想ストレージレイヤは、第１ＯＢを小型ＯＢに分類してデータ複製方式を適用し、複数のＤＳＤの中の何れか単数又は複数に亘って格納し、一方、第２ＯＢを巨大ＯＢに分類し、同一サイズの１つ又はそれ以上のチャンクに分割して消去コーディング方式を適用して複数のＤＳＤに亘って単数又は複数のチャンクを分散して格納する。【選択図】図１An object of the present invention is to provide a highly reliable data storage system that efficiently stores different objects (hereinafter abbreviated as OBs) that are adapted to the characteristics of KV_SSD. A plurality of data storage devices (hereinafter, abbreviated as DSD) for storing a plurality of OBs of a key-value pair, and a data replication method and an erasure coding method based on the OB size of each of the plurality of OBs. Includes virtual storage layers that apply different data reliability schemes. The plurality of OBs each include a first OB and a second OB having a first size and a second size larger than the first size, and the virtual storage layer classifies the first OB into a small OB and applies a data replication method, A single OB or a plurality of DSDs are stored, while the second OB is classified into a huge OB, divided into one or more chunks of the same size, and an erasure coding method is applied to store the plurality of OBs. Singly or multiple chunks are distributed and stored across the DSDs. [Selection diagram] Fig. 1

Description

本発明は、データストレージシステムに関するものであり、さらに詳細にはデータストレージシステムにおいて特に大型キーバリューオブジェクトにアクセスする方法に関する。 The present invention relates to data storage systems, and more particularly to methods for accessing large key value objects in data storage systems.

データの信頼性は、データストレージシステムの核心的な要件である。従来のブロックデバイスを使用するデータの信頼性は、消去コーディング（ｅｒａｓｕｒｅｃｏｄｉｎｇ）及びＲＡＩＤ（ＲｅｄｕｎｄａｎｔＡｒｒａｙｏｆＩｎｄｅｐｅｎｄｅｎｔＤｉｓｋｓ）のような多様なデータ複製（ｒｅｐｌｉｃａｔｉｏｎ）技術を使用して具現又は研究されている。ＲＡＩＤは、データストレージデバイスのセット上にデータを分散（ｓｐｒｅａｄ）（又は、複製（ｒｅｐｌｉｃａｔｅ））して、特定のドライブの永久的なデータの損失を防止する。ＲＡＩＤは大きく２つのカテゴリに分類される。即ち、データの完全なミラーイメージ（ｍｉｒｒｏｒｉｍａｇｅ）が第２ドライブで維持されるカテゴリと、又は、パリティブロックがデータブロックに追加されているのでフェイル（ｆａｉｌ）状況でフェイルされたデータブロックを回復できるカテゴリと、である。
消去コーディングは、高レベルのフェイルを容認できる強力なデータ保護及び回復を提供する複雑なアルゴリズムを使用してパリティ（ｐａｒｉｔｙ）に類似した一連（ｂｕｎｃｈ）のブロックを追加する。例えば、消去コーディングは、物理的ドライブを仮想化して複数の物理的ドライブに亘り分散されて速い復元を達成できる仮想ドライブを生成できる。ＲＡＩＤを使用したデータの複製は、大型オブジェクトを複製するためには余りにも高価につき、消去コーディングは、小型オブジェクトに対してはストレージ空間の浪費を招きかねない。 Data reliability is a core requirement of data storage systems. The reliability of data using a conventional block device is implemented or studied using various data replication techniques such as erasure coding and RAID (Redundant Array of Independent Disks). RAID spreads (or replicates) data over a set of data storage devices to prevent permanent data loss for a particular drive. RAID is roughly classified into two categories. That is, a category in which a complete mirror image of the data is maintained in the second drive, or a parity block is added to the data block, so that the failed data block can be recovered in a fail situation. It is a category.
Erasure coding adds a bunch of blocks similar to parity using a complex algorithm that provides strong data protection and recovery that can tolerate high levels of failure. For example, erasure coding can create virtual drives that can be physical drive virtualized and distributed across multiple physical drives to achieve fast recovery. Replicating data using RAID is too expensive to replicate large objects, and erasure coding can waste storage space for small objects.

キーバリュー・ソリッドステートドライブ（ＫＶ＿ＳＳＤ：ｋｅｙ−ｖａｌｕｅｓｏｌｉｄ−ｓｔａｔｅｄｒｉｖｅ）は、ＨＤＤ（ｈａｒｄｄｉｓｋｄｒｉｖｅｓ）及びＳＳＤ（ｓｏｌｉｄ−ｓｔａｔｅｄｒｉｖｅｓ）などの従来のブロックデバイスと比較して、相異なるインタフェース及びセマンティクス（ｓｅｍａｎｔｉｃｓ）を有する新しいタイプのストレージである。ＫＶ＿ＳＳＤは、キーバリューペアのデータバリューを直接的に格納できる。ＫＶ＿ＳＳＤに格納されたデータバリューは、アプリケーション及びデータの特性に応じて、巨大にも小型にもなり得る。従って、パフォーマンスの隘路及び空間の制約なしに、異なるサイズを有するオブジェクトを効率的に格納する効率的なデータの信頼性モデルが必要である。 Key-value solid-state drives (KV_SSD) have different interfaces and semantics (compared to traditional block devices such as hard disk drives (HDD) and solid-state drives (SSD)). storage) is a new type of storage. KV_SSD can directly store the data value of the key-value pair. The data values stored on the KV_SSD can be huge or small, depending on the application and data characteristics. Therefore, there is a need for an efficient data reliability model that efficiently stores objects of different sizes without performance bottlenecks and space constraints.

特許文献１：米国登録特許第９４１７９６３Ｂ２号公報
特許文献２：米国登録特許第９５９４６３３Ｂ２号公報
特許文献３：米国登録特許第９６３９２６８Ｂ２号公報 Patent Literature 1: US Patent No. 9417963B2 Patent Literature 2: US Patent No. 9594633B2 Patent Literature 3: US Patent No. 9539268B2

従って本発明の目的は、従来のメモリタイプとは異なるインタフェース及びセマンティクス（ｓｅｍａｎｔｉｃｓ）を有するＫＶ＿ＳＳＤの特性に適合するように、パフォーマンスの隘路及び空間の制約なしに、異なるオブジェクトを効率的に格納する信頼性の高いモデルを具現するデータストレージシステムを提供する、ことにある。 The object of the present invention is thus to efficiently store different objects without performance bottlenecks and space constraints, so as to conform to the characteristics of KV_SSD having interfaces and semantics different from conventional memory types. It is an object of the present invention to provide a data storage system that embodies a high quality model.

一実施例によると、データストレージシステムは、キーバリューペアの複数のオブジェクトを格納する複数のデータストレージデバイスと、前記複数のオブジェクトの各々のオブジェクトサイズに基づいて、データの複製方式及び消去コーディング方式を含む相異なるデータ信頼性方式を適用する仮想ストレージレイヤと、を含み、前記複数のオブジェクトは、第１サイズを有する第１オブジェクト及び前記第１サイズよりも大きい第２サイズを有する第２オブジェクトを含み、前記仮想ストレージレイヤは、前記第１オブジェクトを小型オブジェクトとして分類して前記データ複製方式を適用し、前記小型オブジェクトを前記複数のデータストレージデバイスの中の何れか１つ、又はそれ以上に亘って格納し、前記仮想ストレージレイヤは、前記第２オブジェクトを巨大オブジェクトとして分類して前記巨大オブジェクトを同一サイズの１つ又はそれ以上のチャンクに分割し、前記消去コーディング方式を適用して前記複数のデータストレージデバイスに亘って前記１つ又はそれ以上のチャンクを分散して格納する。 According to one embodiment, a data storage system includes a plurality of data storage devices that store a plurality of objects of a key-value pair, and a data duplication method and an erasure coding method based on the object size of each of the plurality of objects. A virtual storage layer that applies different data reliability schemes, wherein the plurality of objects include a first object having a first size and a second object having a second size larger than the first size. The virtual storage layer classifies the first object as a small object and applies the data replication scheme, and the small object is spread over any one or more of the plurality of data storage devices. Store the virtual storage layer The second object is classified as a giant object, the giant object is divided into one or more chunks of the same size, and the erasure coding scheme is applied to the one or more data storage devices. Distributed and store more chunks.

他の実施例によると、キーバリューペアのオブジェクトにアクセスする方法は、キーバリューペアの複数のオブジェクトを受信するステップと、ここで、前記複数のオブジェクトは、第１サイズを有する第１オブジェクト及び前記第１サイズよりも大きい第２サイズを有する第２オブジェクトを包含し、前記第１オブジェクトを小型オブジェクトとして分類するステップと、前記小型オブジェクトにデータ複製方式を適用するステップと、前記小型オブジェクトを複数のデータストレージデバイスの何れか１つ又はそれ以上に亘って格納するステップと、前記第２オブジェクトを巨大オブジェクトとして分類するステップと、前記巨大オブジェクトを同一サイズの１つ又はそれ以上のチャンクに分割するステップと、前記巨大オブジェクトに消去コーディング方式を適用するステップと、前記１つ又はそれ以上のチャンクを前記複数のデータストレージデバイスに亘って分散（ｄｉｓｔｒｉｂｕｔｅｄｌｙ）して格納するステップと、を包含する。 According to another embodiment, a method of accessing an object of a key-value pair comprises: receiving a plurality of objects of a key-value pair, wherein the plurality of objects comprises a first object having a first size; Including a second object having a second size greater than the first size, classifying the first object as a small object, applying a data replication scheme to the small object, and Storing over any one or more of the data storage devices, classifying the second object as a giant object, and dividing the giant object into one or more chunks of the same size And the huge object Comprising applying a coding scheme, and storing said one or more distributed across the chunks to the plurality of data storage devices (distributedly) to a removed by.

イベントの具現及び組み合わせの多様な新しい記述を含む前述された、そして、他の適切な特徴は、添付された図面を参照し、さらに詳細に説明され、特許請求の範囲において示されるであろう。本文で記載された特定のシステム及び方法は、記述のためだけに図示され、本発明はこれに制限されない。本文に記載された理論及び特徴は、本発明の思想から逸脱せずに、多様な実施例に具現されうることが当業者によって理解されるはずである。 The foregoing and other suitable features, including various new descriptions of event implementations and combinations, will be described in further detail with reference to the accompanying drawings and set forth in the claims. The particular systems and methods described herein are illustrated for the purpose of description only and the invention is not limited thereto. It should be understood by those skilled in the art that the theory and features described herein can be embodied in various embodiments without departing from the spirit of the invention.

オブジェクトのサイズを判別して、小型オブジェクトにはデータ複製方式を、大型オブジェクトにはチャンク分割の上、消去コーディング方式を適用するので、異なるサイズを有する複数のオブジェクトを効率的に管理・格納できる。 By determining the size of the object, the data replication method is applied to the small object, and the erasure coding method is applied to the large object after dividing the chunk, so that a plurality of objects having different sizes can be efficiently managed and stored.

本発明の詳細な仕様の一部として含まれた添付された図面は、今回の好ましい実施例を図解し、上述の概括的記述及び以下に提供される好ましい実施例の詳細な記述と共に、ここに記述された本発明の原理の説明と教示に資するであろう。 The accompanying drawings, which are included as part of the detailed specification of the present invention, illustrate the presently preferred embodiments, and together with the general description given above and the detailed description of the preferred embodiments provided below. It will serve to explain and teach the principles of the invention described.

図１は、一実施例による例示的なデータストレージデバイスに格納されたオブジェクトの概念図を示す。FIG. 1 shows a conceptual diagram of objects stored in an exemplary data storage device according to one embodiment. 図２は、一実施例によるユーザキーを含む例示的な内部キーを示す。FIG. 2 shows an exemplary internal key including a user key according to one embodiment. 図３は、一実施例による、グループの特徴を使用するオブジェクトの回収の実施例を示す。FIG. 3 shows an example of object retrieval using group features according to one embodiment. 図４は、一実施例による、グループの特徴なきオブジェクト回収の実施例を示す。FIG. 4 shows an example of group-free object collection according to one embodiment. 図５は、一実施例による専用パリティデバイスのない消去コーディングの実施例を示す。FIG. 5 illustrates an example of erasure coding without a dedicated parity device according to one embodiment. 図６は、一実施例による１つ又はそれ以上の専用パリティデバイスを使用した消去コーディングの実施例を示す。FIG. 6 shows an example of erasure coding using one or more dedicated parity devices according to one embodiment. 図７は、一実施例による、パリティ装置なしで、１つ又はそれ以上のデータストレージデバイスを通した小型オブジェクトの例示的な複製方式を示す。FIG. 7 illustrates an exemplary replication scheme of small objects through one or more data storage devices without a parity device, according to one embodiment. 図８は、一実施例によるオブジェクトをライト（ｗｒｉｔｅ）する（書込む）例示的なフローチャートを示す。FIG. 8 shows an exemplary flow chart for writing an object according to one embodiment. 図９は、一実施例によるオブジェクトをライトする例示的なフローチャートの一部のサブプロセスを示す。FIG. 9 illustrates some sub-processes of an exemplary flowchart for writing an object according to one embodiment. 図１０は、一実施例によるオブジェクトをリード（ｒｅａｄ）する（読出す）例示的なフローチャートを示す。FIG. 10 shows an exemplary flow chart for reading an object according to one embodiment. 図１１は、一実施例によるオブジェクトをリードする例示的なフローチャートの一部のサブプロセスを示す。FIG. 11 illustrates some sub-processes of an exemplary flow chart leading object according to one embodiment. 図１２は、一実施例によるオブジェクトをリードする例示的なフローチャートの別の一部のサブプロセスを示す。FIG. 12 illustrates another portion of the sub-process of the exemplary flow chart leading an object according to one embodiment.

図面の全体に亘って図面の簡潔性のために、図面は必ずしも実寸に合わせて描かれず、類似の構造又は機能の要素は、一般的に類似の参照番号により表示される。図面は、本文に記載された多様な実施例の記述を容易にすることのみ意図される。図面は、本文に開示された教示の全ての思想を記述せず、特許請求の範囲を制限しない。 For the sake of brevity throughout the drawings, the drawings are not necessarily drawn to scale, and elements of similar structure or function are generally indicated by similar reference numerals. The drawings are only intended to facilitate the description of the various embodiments described in the text. The drawings do not describe all the ideas of the teachings disclosed in the text and do not limit the scope of the claims.

本文に記載された特徴及び教示の各々は、他の特徴及び教示と共に、又は個別に使用されて、相異なるサイズを有するオブジェクトを効率的に格納するデータストレージシステム及びデータストレージシステムにオブジェクトを格納する方法を提供できる。多様な付加的な特徴及び教示を個別に、又は組み合わせて使用する代表的な実施例は、添付された図面を参照して、さらに詳細に記述される。この詳細な記述は、単に当業者に本発明の思想を実施できるように教示するためのものであり、特許請求の範囲を制限する意図はない。従って、詳細な説明で上記に開示された特徴の組み合わせは、最も広い意味で、本発明を実施するのに必須ではない可能性があり、単に本発明の格別に代表的な実施例を記述するために教示される。 Each of the features and teachings described herein may be used in conjunction with other features and teachings or individually to store objects in data storage systems and data storage systems that efficiently store objects having different sizes. We can provide a way. Representative embodiments that use various additional features and teachings separately or in combination will be described in further detail with reference to the accompanying drawings. This detailed description is merely to teach one of ordinary skill in the art to practice the inventive concepts, and is not intended to limit the scope of the claims. Thus, the combination of features disclosed above in the detailed description may, in the broadest sense, not be essential to the practice of the present invention, and merely describes a particularly representative embodiment of the present invention. To be taught.

以下の説明では、単純に説明の便宜のために、特定の命名法が本発明の全体的な理解を助けるために使用される。しかし、このような特定の記述が本開示の教示を具現するのに必須でないことは、当業者には明白であろう。 In the following description, certain nomenclature is used to aid in the overall understanding of the present invention for the convenience of explanation. However, it will be apparent to those skilled in the art that such specific descriptions are not essential to embody the teachings of the present disclosure.

本文の詳細な説明の一部は、コンピュータメモリ内のデータビットに対する演算のアルゴリズム及びシンボリックな表現の形で提供される。このようなアルゴリズミックな説明及び表現は、データ処理技術分野における当業者が自分の研究内容を同じ技術分野における他の当業者に効果的に伝達するために使用される。アルゴリズムは、本文では、しかし一般的にも、意図した結果を導出する一貫性のある一連のステップとして表現される。ステップは、物理量を物理的に操作しなければならないステップである。一般的に、しかし必須ではないが、このような物理量は、格納され、放送（一斉通知、ｂｒｏａｄｃａｓｔ）され、組み合わせられ、比較され、異なるように変形されうる電気又は磁気信号の形態を有する。主に、共通的に普通に使用されているとの理由により、このような信号をビット、バリュー、要素、シンボル、特徴、用語、数字などで示されるのが便利であることが度々実証されている。 Part of the detailed description in the text is provided in the form of algorithmic and symbolic representations of operations on data bits in computer memory. Such algorithmic descriptions and representations are used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is expressed in the text, but also generally, as a consistent series of steps leading to an intended result. The steps are steps in which physical quantities have to be physically manipulated. Generally, but not necessarily, such physical quantities have the form of electrical or magnetic signals that can be stored, broadcast (broadcast), combined, compared, and transformed differently. It has often been proven convenient to refer to such signals as bits, values, elements, symbols, features, terms, numbers, etc., mainly because they are commonly used in common. There is.

しかし、これら及び類似の用語は全て、適切な物理量と関連されるべきであり、斯かる物理量に適用される便利なラベルに過ぎない。以下の説明で、明らかにするように、特に別な意味と言及されない限り、詳細な説明の全般にわたって、「処理（ｐｒｏｃｅｓｓｉｎｇ）」、「計算（ｃｏｍｐｕｔｉｎｇ）」、「演算（ｃａｌｃｕｌａｔｉｎｇ）」、「決定（ｄｅｔｅｒｍｉｎｉｎｇ）」、「表示（ｄｉｓｐｌａｙｉｎｇ）」などのような用語を使用する説明は、コンピュータシステム又はコンピュータシステムのレジスタ及びメモリ内で物理的（電子的）量として表現されたデータを、コンピュータシステムのメモリ、レジスタ又は他の情報ストレージ内において物理量として類似に表現される相異なるデータに変形して放送する類似の電子コンピューティングデバイス、放送又は表示デバイスの動作及び処理を示す。 However, these and similar terms should all be associated with the appropriate physical quantities and are only convenient labels applied to such physical quantities. As will become apparent in the following description, unless otherwise stated, the terms "processing", "computing", "calculating" and "determination" are used throughout the detailed description. Descriptions using terms such as "determining", "displaying", etc. refer to data expressed as physical (electronic) quantities in the computer system or in the register and memory of the computer system, Fig. 6 illustrates the operation and processing of a similar electronic computing device, broadcast or display device that transforms and broadcasts different data that are similarly represented as physical quantities in a memory, register or other information storage.

さらに、代表的な実施例及び従属請求項における多様な特徴は、本発明の有用な実施例を追加的に提供するために、必ずしも特定的ではなく且つ明示的に列挙されていない方式で組み合わされる。
全てのバリューの範囲、又は、エンティティ（ｅｎｔｉｔｙ）のグループの指称（表示、ｉｎｄｉｃａｔｉｏｎ）は、本来の開示目的のために、同時に、請求される構成要件（Ｓｕｂｊｅｃｔ＿ｍａｔｔｅｒ）を制限する目的のために、全ての可能な中間値又は中間オブジェクトを開示している、ことをここに明記する。図面に示された構成要素の形状及び寸法（ｄｉｍｅｎｓｉｏｎ）は、単に本発明が具現される方式の理解を助けるために設計されており、実施例で図示された寸法及び形状に本発明を含む限定する意図はない、こともここに明記する。 Furthermore, the various features of the exemplary embodiments and subclaims are combined in a manner not necessarily specific and explicitly recited to additionally provide useful embodiments of the present invention. .
The range of all values, or the designation of a group of entities, is all for the purpose of limiting the simultaneously claimed configuration requirements (Subject_matter) for the purpose of the original disclosure. It is specified here that it discloses the possible intermediate values or intermediate objects of. The shape and dimensions of the components shown in the drawings are designed merely to assist in understanding the manner in which the present invention is embodied, and are limited to the dimensions and shapes illustrated in the examples, including the present invention. There is no intention to do, it is also stated here.

本発明の開示は、大型キーバリューオブジェクトをデータストレージシステムに格納するデータストレージシステム及び方法を記述する。本発明のデータストレージシステムは、１つ又はそれ以上のデータストレージデバイスに高い信頼性を有するデータを格納できる。特に、本発明のデータストレージシステムは、オブジェクトを、そのサイズに基づき異なる形で格納して、コスト及びストレージ空間を減らし、高い信頼性を達成できる。大型オブジェクトと関連されたデータは、小型のピース（ｓｍａｌｌｐｉｅｃｅｓ）に分割（ｓｐｌｉｔ）されることができ、１つ又はそれ以上のデータストレージデバイスに格納されることができる。ここで、データストレージデバイスに格納されたデータがキーバリューペア（ｋｅｙ−ｖａｌｕｅ＿ｐａｉｒ）と関連されたデータアレイの場合、データストレージデバイスは、キーバリュー・ソリッドステートドライブに（ＫＶ＿ＳＳＤｓ：ｋｅｙ−ｖａｌｕｅｓｏｌｉｄｓｔａｔｅｄｒｉｖｅｓ）と称され得る。 The present disclosure describes a data storage system and method for storing large key value objects in a data storage system. The data storage system of the present invention can store highly reliable data in one or more data storage devices. In particular, the data storage system of the present invention can store objects differently based on their size, reducing cost and storage space, and achieving high reliability. Data associated with large objects can be split into small pieces and stored in one or more data storage devices. Here, if the data stored in the data storage device is a data array associated with a key-value pair (key-value_pair), the data storage device is a key-value solid state drive (KV_SSDs: key-value solid state drives May be referred to as

オブジェクト（例えば、キーバリューペアのバリュー）は、複数の同じサイズのチャンク（ｃｈｕｎｋｓ）、即ち複数のピースに（ｍｕｌｔｉｐｌｅｐｉｅｃｅｓ）に分割（ｓｐｌｉｔ）されることができる。チャンクのサイズは、ランタイム中に動的にオブジェクト単位で決定されることができる。その大きさに応じて、オブジェクトは、相異なる個数及びサイズのチャンクを包含できる。 An object (e.g., the value of a key-value pair) can be split into a plurality of chunks of the same size, i.e., multiple pieces. Chunk size can be determined on a per-object basis dynamically during runtime. Depending on its size, an object can contain different numbers and sizes of chunks.

グループ（ｇｒｏｕｐ）は、目標データの信頼性（ｔａｒｇｅｔｄａｔａｒｅｌｉａｂｉｌｉｔｙ）を具現するためのデータストレージデバイスのセットとして定義される。グループは、ボックス（例えば、シャーシ（ｃｈａｓｓｉｓ）又はラック（ｒａｃｋ））内の又はボックスを介した１つ又はそれ以上のデータストレージデバイスを含むことができ、階層的又は非階層的に構造化されることができる。 A group is defined as a set of data storage devices for implementing target data reliability. A group may include one or more data storage devices in or via a box (eg, chassis or rack), and may be hierarchically or non-hierarchically structured be able to.

本発明のデータストレージシステムは、１つ又はそれ以上のデータストレージデバイスのグループ化を管理し、全体的に単一の仮想ストレージユニットとしてユーザアプリケーションに対してグループを提供する仮想ストレージレイヤを含む。仮想ストレージレイヤは、１つ又はそれ以上のデータストレージデバイスを制御する複数のドライバを管理するために使用されることができる。仮想ストレージレイヤが管理するデータストレージデバイスの個数は、信頼性の目標（ｒｅｌｉａｂｉｌｉｔｙｔａｒｇｅｔ）に基づいて構成されることができる。
消去コーディング（ｅｒａｓｕｒｅｃｏｄｉｎｇ）に対し、データストレージデバイスの全体の個数は、Ｐ個のフェイルを容認するためには、データデバイス（Ｄ）とパリティデバイス（Ｐ）の合計であり得る。
複製（ｒｅｐｌｉｃａｔｉｏｎ）に対し、Ｐ個のフェイルを容認するデータストレージデバイスの全体の個数は、Ｐ＋１であり得る。データストレージデバイスのストレージ容量は、消去コーディング又は複製においてほぼ類似になる可能性がある。仮想ストレージのストレージ容量は、グループ内の全データストレージデバイスのデータストレージ空間の合計によって決定されることができる。 The data storage system of the present invention includes a virtual storage layer that manages the grouping of one or more data storage devices and provides groups to user applications as a single virtual storage unit as a whole. The virtual storage layer can be used to manage multiple drivers that control one or more data storage devices. The number of data storage devices managed by the virtual storage layer can be configured based on a reliability target.
For erasure coding, the total number of data storage devices may be the sum of the data device (D) and the parity device (P) in order to accept P failures.
For replication, the total number of data storage devices that tolerate P failures may be P + 1. The storage capacities of data storage devices may be substantially similar in erasure coding or replication. The storage capacity of virtual storage can be determined by the sum of data storage space of all data storage devices in the group.

一実施例によると、仮想ストレージレイヤは、１つ又はそれ以上のデータストレージデバイスのグループをステートレス方式（ｓｔａｔｅｌｅｓｓｍａｎｎｅｒ）で管理できる。つまり、仮想ストレージレイヤは、オブジェクトとデータストレージデバイスとの間の如何なるマッピング情報もキー情報も維持しないし、且つ維持する必要がない。ただし、仮想ストレージレイヤは、オブジェクトの個数、利用可能なストレージ容量、及び／又はそれらと類似した種類の、１つ又はそれ以上のデータストレージデバイスにおいて必須のメタデータを、ランタイムで動的にキャッシュ且つ維持できる。 According to one embodiment, the virtual storage layer can manage one or more groups of data storage devices in a stateless manner. That is, the virtual storage layer does not maintain or need to maintain any mapping information or key information between objects and data storage devices. However, the virtual storage layer dynamically caches the required metadata at one or more data storage devices of the number of objects, available storage capacity, and / or similar types, It can be maintained.

ＫＶ＿ＳＳＤのようなデータストレージデバイスは、オブジェクト及び処理可能なオブジェクトに対するオペレーションに対する具現に特有の制限を有する。データストレージシステムの仮想ストレージレイヤは、各データストレージデバイスがサポート可能な最小バリュー及び最大バリューのサイズを認識できており、最小バリュー及び最大バリューのサイズを決定できる。 Data storage devices such as KV_SSD have implementation specific limitations on operations on objects and processable objects. The virtual storage layer of the data storage system can recognize the size of the minimum value and the maximum value that each data storage device can support, and can determine the size of the minimum value and the maximum value.

例えば、ＶＭＩＮ_ｉが第ｉ番目のＫＶ＿ＳＳＤの最小バリューのサイズである場合、仮想ストレージの最小バリューサイズ（ＶＭＩＮ＿ＶＳ）は、グループ内の個々のＫＶ＿ＳＳＤの全ての最小バリューのサイズの中の最大バリューによって定義されることができる。

同様に、ＶＭＡＸ_ｉが第ｉ番目のＫＶ＿ＳＳＤの最大バリューのサイズである場合、仮想ストレージの最大バリューのサイズ（ＶＭＡＸ＿ＶＳ）は、グループ内の個々のＫＶ＿ＳＳＤの全ての最大バリューのサイズの中の最小バリューによって定義されることができる。

一実施例で、ＲＳ（Ｒｅｅｄ−Ｓｏｌｏｍｏｎ）コードのようなＭＤＳ（ｍａｘｉｍｕｍｄｉｓｔａｎｃｅｓｅｐａｒａｂｌｅ）アルゴリズムが、消去コーディング（ｅｒａｓｕｒｅｃｏｄｉｎｇ）のために使用されることができる。 For example, if VMIN _i is the size of the minimum value of the ith KV_SSD, then the minimum value size of virtual storage (VMIN_VS) is defined by the maximum value among the sizes of all the minimum values of the individual KV_SSDs in the group Can be done.

Similarly, if VMAX _i is the size of the largest value of the ith KV_SSD, then the size of the largest value of virtual storage (VMAX_VS) is the smallest value among the sizes of all the largest values of the individual KV_SSDs in the group. Can be defined by

In one embodiment, a maximum distance separate (MDS) algorithm such as an RS (Reed-Solomon) code may be used for erasure coding.

本発明の一実施例によると、本発明のシステム及び方法は、オブジェクトのサイズに基づいて複製及び消去コーディングの両方を活用する（ｌｅｖｅｒａｇｅ）ハイブリッド型データ信頼性メカニズム（ｈｙｂｒｉｄｄａｔａｒｅｌｉａｂｉｌｉｔｙｍｅｃｈａｎｉｓｍ）を提供する。データの複製（ｒｅｐｌｉｃａｔｉｏｎ）は、小型のオブジェクト（ｓｍａｌｌｏｂｊｅｃｔｓ）に対して速く、計算負荷が軽い可能性がある。ただし、データの複製は、消去コーディングと比較するとき、大型オブジェクト（ｌａｒｇｅｏｂｊｅｃｔｓ）に対して、より多くのストレージ空間を消費し得る。消去コーディングは、大型オブジェクトに対して、データの複製よりもストレージ空間の消費が通常、少ないが、複数のデータストレージデバイスに影響を及ぼしながら、このような大型オブジェクトを復元できる。
しかし、複数のデータストレージデバイスから複数のチャンクを必要とするので、消去コーディングは、一般的に、重い計算を伴って、小型のオブジェクトを復元するのに多くの時間が消費されることができる。これは、データの消去コーディングを遂行した後に、データを複製する非階層的な方式である。代わりに、本発明のシステム及び方法は、オブジェクトのサイズに基づいて、データの複製と消去コーディングとの間の選択について柔軟な決定を提供する。言い換えると、本発明のシステム及び方法は、データの信頼性判断、つまり、オブジェクトのサイズだけでなく、オブジェクトを格納するための全体空間のオーバヘッドに基づいて、ランタイムでのデータ複製又は消去コーディングの何れを適用するべきかを決定できる。 According to one embodiment of the present invention, the system and method of the present invention provide a hybrid data reliability mechanism that leverages both replication and erasure coding based on the size of the object. . Data replication can be fast for small objects and can be computationally intensive. However, data replication may consume more storage space for large objects when compared to erasure coding. Erasing coding typically consumes less storage space for large objects than data replication, but can restore such large objects while affecting multiple data storage devices.
However, since multiple chunks are required from multiple data storage devices, erasure coding generally can be time consuming to restore small objects with heavy computations. This is a non-hierarchical way of duplicating data after performing erasure coding of the data. Instead, the system and method of the present invention provide flexible decisions about the choice between data replication and erasure coding based on the size of the object. In other words, the system and method of the present invention can either be data replication or erasure coding at runtime based on data reliability determination, ie not only the size of the object but also the overhead of the overall space for storing the object. Can decide to apply.

一実施例によると、本発明のシステム及び方法では、オブジェクトがアップデートされる際に、リード・モディファイ・ライトのオーバヘッド（ｒｅａｄ−ｍｏｄｉｆｙ−ｗｒｉｔｅｏｖｅｒｈｅａｄ）を伴わない。ブロックデバイスに対する従来の消去コーディング又はデータの複製は、部分的アップデートが存在するときでも、多くのオーバヘッドを伴う。もしデータの小さな部分がアップデートされる場合でも、消去コーディングに対するブロックの全てがリードされてアップデートされなければならないし、アップデート後にパリティブロックは、再演算され、データストレージデバイスに再書込み（ライトバック）されなければならない。言い換えると、１つのブロックが多重のオブジェクトで共有されることができるので、オブジェクトのアップデートは、一連のリード・モディファイ・ライトのプロセスを必要とする。
対照的に、本発明のシステム及び方法は、オブジェクト及びその特性、例えば、サイズに基づく信頼性を提供できる。オブジェクトがアップデートされる場合、リード・モディファイ・ライトのオーバヘッドが発生することなしに、アップデートされたオブジェクトだけがオーバライト（上書き、ｏｖｅｒ−ｗｒｉｔｅ）を必要とする。 According to one embodiment, the system and method of the present invention does not involve read-modify-write overhead when an object is updated. Conventional erasure coding or data replication for block devices involves a lot of overhead, even when partial updates exist. Even if a small portion of the data is updated, all of the blocks for erasure coding must be read and updated, and after the update the parity block is recomputed and rewritten (write back) to the data storage device There must be. In other words, an object update requires a series of read-modify-write processes because one block can be shared by multiple objects.
In contrast, the systems and methods of the present invention can provide reliability based on an object and its characteristics, eg, size. When an object is updated, only the updated object needs to be overwritten (write-over) without the overhead of read-modify-write.

本発明の一実施例によると、本発明のシステム及び方法は、統合フレームワーク（ｕｎｉｆｉｅｄｆｒａｍｅｗｏｒｋ）の元に広範囲なオブジェクトのサイズをサポートする。オブジェクトがデータストレージデバイスのストレージ容量の限界を超えて、非常に大きな（又は巨大な）場合には、データストレージデバイスは、オブジェクトを全体として格納できない可能性がある。この場合、オブジェクトは、小型のピース（ｐｉｅｃｅ）（以下、チャンク（ｃｈｕｎｋ）と称される。）に分割される必要がある。本発明の開示は、データストレージデバイスの容量に基づいて、オブジェクトがチャンクに分割され、分割されたチャンクから復元（ｒｅｂｕｉｌｄ）する方法を論議する。例えば、本発明のシステム及び方法は、以下でより詳細に論議されるように、データストレージデバイスのグループ特徴サポート（ｇｒｏｕｐｆｅａｔｕｒｅｓｕｐｐｏｒｔ）に基づいて多様な分割及び復元メカニズムを提供する。 According to one embodiment of the present invention, the system and method of the present invention support a wide range of object sizes under a unified framework. If the object is very large (or huge) beyond the storage capacity limit of the data storage device, the data storage device may not be able to store the object as a whole. In this case, the object needs to be divided into small pieces (hereinafter referred to as chunks). The present disclosure discusses how objects are divided into chunks and rebuilt from the divided chunks based on the capacity of the data storage device. For example, the systems and methods of the present invention provide a variety of partitioning and restoration mechanisms based on group feature support for data storage devices, as discussed in more detail below.

本発明のシステム及び方法は、固定されたブロックに基づかないオブジェクトの信頼性を提供する。データの複製及び消去コーディングは、オブジェクトのサイズに基づいてオブジェクトを分岐する（ｂｉｆｕｒｃａｔｅ）ことにより、シングルディスクグループに対するオブジェクトの目標信頼性（ｔａｒｇｅｔｒｅｌｉａｂｉｌｉｔｙ）を具現するようにミックスして組み合わされる。このような方法は、階層的構造（つまり、消去コーディングされたオブジェクトに対する複製）を含む従来のデータの信頼性技法とは異なる。本発明のシステム及び方法は、スペース（空間）の効率性を一次的関心事として有し、性能は、特定のオブジェクトに適した信頼性のメカニズムを決定するための二次的要素である。オブジェクトのストレージは、ステートレス（ｓｔａｔｅｌｅｓｓ）である。複製又は消去コーディングの何れか１つのために、追加の情報が格納される必要はない。オブジェクトのサイズと無関係にアップデートについてリード・モディファイ・ライトのオーバヘッドが要求されない。 The system and method of the present invention provide object reliability that is not based on fixed blocks. Data replication and erasure coding is mixed and combined to embody the object's target reliability for a single disk group by bifurcating the object based on the size of the object. Such methods are different from conventional data reliability techniques that include hierarchical structure (i.e., duplicates to erasure coded objects). The system and method of the present invention have space efficiency as a primary concern, and performance is a secondary factor to determine the reliability mechanism that is appropriate for a particular object. The storage of objects is stateless. No additional information needs to be stored for any one of duplication or erasure coding. No read-modify-write overhead is required for updates regardless of the size of the object.

本文で、オブジェクトは、入力及び出力（Ｉ／Ｏ：ｉｎｐｕｔａｎｄｏｕｔｐｕｔ）オペレーションの間において固定されたバリュー（ｆｉｘｅｄｖａｌｕｅ）を有する静的データ（ｓｔａｔｉｃｄａｔａ）を示す。オブジェクトは、キーバリューペアのキーと関連されることができる。このような場合、オブジェクトは、キーバリューペアのバリューに対応する。オブジェクトは、同じサイズの複数のチャンクに分割されることができ、チャンクのサイズは、ランタイム中において動的にオブジェクト単位ベースに決定されることができる。各オブジェクトは、一つ又はそれ以上のデータストレージデバイスに格納されるとき、相異なる個数及び相異なるサイズのチャンクを包含できる。 In the text, an object indicates static data having a fixed value (fixed value) between input and output (I / O) operations. An object can be associated with a key of a key-value pair. In such a case, the object corresponds to the value of the key-value pair. Objects can be divided into multiple chunks of the same size, and the size of the chunks can be determined dynamically on an object-by-object basis during runtime. Each object, when stored on one or more data storage devices, can contain different numbers and different sized chunks.

データの信頼性なしに、１つ又はそれ以上のデータストレージデバイス（例えば、ＫＶ＿ＳＳＤ）に亘って一度に全て格納されるオブジェクトのデータのチャンクの最小セットは、本文でスライス（ｓｌｉｃｅ）と称する。例えば、スライスは、１つ又はそれ以上のデータストレージデバイスに分割して格納されたオブジェクトの複数個のチャンクと対応する。データの信頼性（例えば、複製又は消去コーディング）を具現するオブジェクトのデータのチャンクの最小セットは、「バンド（ｂａｎｄ）」と称する。バンド内のチャンクの個数は、パリティチャンクのためスライスのチャンクの個数よりも多い可能性がある。グループ内の１つ又はそれ以上のデータストレージデバイスの中の１つのデータストレージデバイスに格納されたオブジェクトのチャンクのセットはスプリット（ｓｐｌｉｔ）と称されることができる。 Without data reliability, the smallest set of chunks of object data stored all at once across one or more data storage devices (eg, KV_SSD) is referred to in the text as a slice. For example, a slice corresponds to a plurality of chunks of an object stored separately in one or more data storage devices. The smallest set of data chunks of an object that embodies data reliability (e.g., duplication or erasure coding) is called a "band". The number of chunks in the band may be more than the number of chunks of a slice because of parity chunks. A set of chunks of objects stored on one data storage device among one or more data storage devices in a group can be referred to as a split.

スライスは、データの複製（つまり、原本の（ｏｒｉｇｉｎａｌ）オブジェクトの複製されたコピー）に対しては、（目標オブジェクトの）１つのチャンクのみを包含できる。一方、スライスは、消去コーディングに対しては、目標オブジェクトを含むＤ個のチャンクを包含できる。一方、バンドは、データ複製に対しては、Ｐ個の複製チャンク及び１つの原本チャンクを含むことができる一方、消去コーディングに対しては、Ｐ個のパリティチャンク及びＤ個のデータチャンクを含むことができる。スライス又はバンドの個数は、スプリット内のチャンクの総数に対応する。
スプリットサイズ（つまり、１つのデータデバイスに格納されたチャンクの個数）及びバンドサイズ（つまり、１つ又はそれ以上のデータストレージデバイスに格納された、１つのオブジェクトに含まれるデータチャンク及びパリティチャンクの個数の合計）は、原本のオブジェクト及びチャンクのサイズに応じて可変されることができる。例えば、スプリットサイズは、データの複製に対しては原本のサイズであり得るが、スプリットサイズは、消去コーディングに対してはバンド当たりチャンク個数で全チャンク個数を除算して得られるサイズである。 A slice can contain only one chunk (of the target object) for duplication of data (i.e., duplicate copies of the original object). On the other hand, a slice can contain D chunks containing target objects for erasure coding. On the other hand, a band may contain P duplicate chunks and one original chunk for data duplication, while it contains P parity chunks and D data chunks for erasure coding. Can do. The number of slices or bands corresponds to the total number of chunks in the split.
Split size (ie, the number of chunks stored in one data device) and band size (ie, the number of data chunks and parity chunks contained in one object stored in one or more data storage devices ) Can be varied according to the size of the original object and the chunk. For example, the split size may be the original size for data replication, while the split size is the size obtained by dividing the total number of chunks by the number of chunks per band for erasure coding.

図１は、一実施例による、例示的なデータストレージシステムに格納されたオブジェクトの概念図を示す。オブジェクト１１０は、複数のチャンクに分割されることができる。例えば、オブジェクト１１０は、３Ｓ個のチャンク（Ｃｈｕｎｋ＿１〜Ｃｈｕｎｋ＿３Ｓ）に分けられる（但し、Ｓは自然数）。チャンクの総数３Ｓは、例示的なものであり、オブジェクト１１０のチャンクの総数は、相異なる数であり得、オブジェクト１１０は、３の倍数個に分割されなくてもよい。オブジェクト１１０は、以下で詳細に説明されるように、非常に大きな（ｖｅｒｙｌａｒｇｅ）、即ち巨大な（ｈｕｇｅ）オブジェクトとして分類されることができる。分割（ｓｐｌｉｔ）されたチャンクは、１つ又はそれ以上のデータストレージデバイスに亘る仮想ストレージ内に分散（ｄｉｔｒｉｂｕｔｅ）されることができる。本発明の実施例で、仮想ストレージ空間は、Ｄ個のデータストレージデバイス（Ｄｉｓｋ＿１〜Ｄｉｓｋ＿Ｄ）及びＰ個のパリティデバイス（Ｄｉｓｋ＿Ｄ＋１〜Ｄｉｓｋ＿Ｄ＋Ｐ）を含む。本発明の実施例で、Ｓ及びＤは同一である。 FIG. 1 illustrates a conceptual diagram of objects stored in an exemplary data storage system, according to one embodiment. Object 110 may be divided into multiple chunks. For example, the object 110 is divided into 3S chunks (Chunk_1 to Chunk_3S) (where S is a natural number). The total number of chunks 3S is exemplary, and the total number of chunks of the object 110 may be different numbers, and the object 110 may not be divided into multiples of three. Object 110 can be classified as a very large or huge object, as described in detail below. The split chunks can be distributed in virtual storage across one or more data storage devices. In the embodiment of the present invention, the virtual storage space includes D data storage devices (Disk_1 to Disk_D) and P parity devices (Disk_D + 1 to Disk_D + P). In the embodiment of the present invention, S and D are identical.

一実施例によると、データストレージシステムの仮想ストレージレイヤは、スプリット第１（優先）方式（スプリット優先分散）、又はバンド第１（優先）方式（バンド優先分散）の何れかによりチャンクを分散させることができる。
スプリット第１方式では、仮想ストレージレイヤは、仮想ストレージ１２０において、チャンク（Ｃｈｕｎｋ＿１、Ｃｈｕｎｋ＿２、Ｃｈｕｎｋ＿３）を第１データストレージデバイス（Ｄｉｓｋ＿１）に格納し、チャンク（Ｃｈｕｎｋ＿４、Ｃｈｕｎｋ＿５、Ｃｈｕｎｋ＿６）を第２データストレージデバイス（Ｄｉｓｋ＿２）に格納し、同様にチャンク（Ｃｈｕｎｋ＿３Ｄ−２、Ｃｈｕｎｋ＿３Ｄ−１、Ｃｈｕｎｋ＿３Ｄ）を、第Ｄデータストエージデバイス（Ｄｉｓｋ＿Ｄ）に格納する。バンド１５０は、チャンク（Ｃｈｕｎｋ＿１、Ｃｈｕｎｋ＿４、．．．、Ｃｈｕｎｋ＿３Ｄ−２）及びパリティチャンク（Ｐａｒｉｔｙ＿１〜Ｐａｒｉｔｙ＿Ｐ）を含む。
バンド優先方式では、仮想ストレージレイヤは、仮想ストレージ１２１において、チャンク（Ｃｈｕｎｋ＿１〜Ｃｈｕｎｋ＿Ｄ）をデータストレージデバイス（Ｄｉｓｋ＿１〜Ｄｉｓｋ＿Ｄ）に各々格納し、チャンク（Ｃｈｕｎｋ＿Ｄ＋１〜Ｃｈｕｎｋ＿２Ｄ）をデータストレージデバイス（Ｄｉｓｋ＿１〜Ｄｉｓｋ＿Ｄ）に各々格納し、同様にチャンク（Ｃｈｕｎｋ＿２Ｄ＋１〜Ｃｈｕｎｋ＿３Ｄ）をデータストレージデバイス（Ｄｉｓｋ＿１〜Ｄｉｓｋ＿Ｄ）に各々格納する。スプリット１５１は、データのチャンク（Ｃｈｕｎｋ＿１、Ｃｈｕｎｋ＿Ｄ＋１、Ｃｈｕｎｋ＿２Ｄ＋１）を含む。パリティチャンク（Ｐａｒｉｔｙ＿１〜Ｐａｒｉｔｙ＿Ｐ）はパリティディスク（Ｄｉｓｋ＿Ｄ＋１〜Ｄｉｓｋ＿Ｄ＋Ｐ）に格納される。
本実施形態では、説明の便宜のために、仮想ストレージ１２０及び仮想ストレージ１２１の両方が、バンド優先方式でパリティチャンクを格納する場合が示されるが、パリティチャンクの格納は、本発明の思想から逸脱せずにスプリット優先方式で遂行されうることが理解されるであろう。 According to one embodiment, the virtual storage layer of the data storage system distributes the chunks by either a split first (prioritized) method (split first distribution) or a first band (first priority) method (band first distribution). Can do.
In the split first method, the virtual storage layer stores chunks (Chunk_1, Chunk_2, Chunk_3) in the first data storage device (Disk_1) in the virtual storage 120, and chunks (Chunk_4, Chunk_5, Chunk_6) as the second data storage. The data is stored in the device (Disk_2), and the chunks (Chunk_3D-2, Chunk_3D-1, Chunk_3D) are similarly stored in the Dth data storage device (Disk_D). The band 150 includes chunks (Chunk_1, Chunk_4,..., Chunk_3D-2) and parity chunks (Parity_1 to Parity_P).
In the band priority method, in the virtual storage 121, the virtual storage layer stores chunks (Chunk_1 to Chunk_D) in data storage devices (Disk_1 to Disk_D) and chunks (Chunk_D + 1 to Chunk_2D) as data storage devices (Disk_1 to Disk_D) Similarly, chunks (Chunk_2D + 1 to Chunk_3D) are stored in the data storage devices (Disk_1 to Disk_D), respectively. The split 151 includes chunks of data (Chunk_1, Chunk_D + 1, Chunk_2D + 1). The parity chunks (Parity_1 to Parity_P) are stored in parity disks (Disk_D + 1 to Disk_D + P).
In the present embodiment, for convenience of explanation, both virtual storage 120 and virtual storage 121 are shown to store parity chunks in the band priority system, but storage of parity chunks deviates from the concept of the present invention. It will be appreciated that it may be performed in a split-first manner without the need.

Ｉ／Ｏオペレーションは、どのようなチャンク分散方式が使用されているかに関係なしに、バンド優先方式で実行されることができる。このような場合で、チャンク（Ｃｈｕｎｋ＿１、Ｃｈｕｎｋ＿４、．．．、Ｐａｒｉｔｙ＿Ｐ）に対するＩ／Ｏオペレーションは、スプリット優先方式でも並列にアクセスされる。 The I / O operations can be performed in a band first manner regardless of what chunk distribution method is used. In such a case, I / O operations for chunks (Chunk_1, Chunk_4,..., Parity_P) are accessed in parallel even in the split priority scheme.

一実施例によると、消去コーディングは、オブジェクトのサイズに基づいてオブジェクトに適用される。サイズ（ＳＺ_Ｏｉ）を有する第ｉ番目のオブジェクト（Ｏ_ｉ）に対し、デバイス当たりチャンクの数、即ち、スプリット（ＮＣ_Ｏｉ）は、下の数学式３によって定義される。複製に対し、デバイス当たりチャンクの個数は１であり得る（つまり、ＮＣ_Ｏｉ＝１）。

According to one embodiment, erasure coding is applied to an object based on the size of the object. For the i-th object (O _i ) having size (SZ _Oi ), the number of chunks per device, ie split (NC _Oi ), is defined by Equation 3 below. For replication, the number of chunks per device may be one (ie, NC _Oi = 1).

デバイス当たりチャンクの数（ＮＣ_Ｏｉ）は、最大チャンクサイズ（ＶＭＡＸ_ＶＳ）が使用されている場合には、データディスクに（つまり、Ｄｉｓｋ＿１〜Ｄｉｓｋ＿Ｄとして参照されたデータストレージデバイス）に亘って、オブジェクト（Ｏ_ｉ）を格納するためのスプリット当たりのチャンクの最小の個数である。オブジェクトのサイズがデータストレージデバイスの割り当て又は整列の単位で整列されない場合、バンド内のオブジェクトのために割り当てられた追加の空間が０（ｚｅｒｏｓ）でパディング（ｐａｄｄｅｄ）されることができる。 The number of chunks per device (NC _Oi ) is an object (across data disks (ie, data storage devices referenced as Disk_1-Disk_D)) if the maximum chunk size (VMAX _VS ) is used. O _i ) is the minimum number of chunks per split to store. If the size of the object is not aligned with the unit of data storage device allocation or alignment, the additional space allocated for the object in the band can be padded with zeros.

最大のチャンクサイズが使用される場合は、多すぎるパディングを使用して、ストレージ空間が浪費になる傾向がある。従って、オブジェクト（Ｏ_ｉ）の実際のチャンクサイズは数学式４によって、より厳密に決定される。ＳＺ_ｍｅｔａは、追加のメタデータがチャンク当たりデータと共に格納される場合のメタデータのサイズである。データストレージデバイスがグループの特徴をサポートしている場合、グループ識別子（ｇｒｏｕｐ＿ＩＤ：ｇｒｏｕｐ＿ｉｄｅｎｔｉｆｉｅｒ）及びチャンクの総数のような、一部のメタデータが、各チャンクに格納されることができる。データストレージデバイスがグループの特徴をサポートしていない場合、格納されるメタデータはない（つまり、ＳＺ_ｍｅｔａ＝０）。複製に対し、実際のチャンクサイズは、その原本のオブジェクトのサイズと同一であり得る（つまり、Ｃ_Ｏｉ＝ＳＺ_Ｏｉ）。

If the largest chunk size is used, storage space tends to be wasted using too much padding. Thus, the actual chunk size of the object (O _i ) is more strictly determined by Equation 4. SZ _meta is the size of the metadata if additional metadata is stored with the data per chunk. If the data storage device supports group features, some metadata may be stored in each chunk, such as a group identifier (group_ID) and the total number of chunks. If the data storage device does not support the group feature, there is no metadata stored (ie, SZ _meta = 0). For replication, the actual chunk size may be identical to the size of the original object (ie, C _Oi = SZ _Oi ).

数学式４によるＣ_Ｏｉは、ＶＭＩＮ_ＶＳとＶＭＡＸ_ＶＳとの間の範囲にあるが、ＶＭＡＸ_ＶＳに近いチャンクサイズを決定する。数学式４によって決定されたチャンクサイズは、Ｉ／Ｏ帯域幅を最大化しながら、Ｉ／Ｏオペレーションの回数を最小化できる。各データストレージデバイスが、格納するデータの総量（つまり、スプリットサイズ）は、数学式５によって定義される。

最終的に、データストレージデバイスに亘って一度にライト（書込み）されるデータの全体の総量、即ち、バンドサイズは、数学式６によって定義される。複製についてＤは１と同じである。

C _Oi according to Equation 4 is in the range between VMIN _VS and VMAX _VS , but determines a chunk size close to VMAX _VS. The chunk size determined by Equation 4 can minimize the number of I / O operations while maximizing the I / O bandwidth. The total amount of data (ie, split size) that each data storage device stores is defined by Equation 5.

Finally, the total amount of data written at one time across the data storage device, ie, the band size, is defined by Equation 6. For replication D is the same as 1.

前述されたように、データストレージデバイスは、データストレージデバイスが、格納できるオブジェクトのサイズに対する制限を有する。一部のデータストレージデバイスは、非常に大きな大型オブジェクト又は非常に小さい小型のオブジェクトをサポートできない可能性がある。相異なるサイズを有するオブジェクトの信頼性の高い効率的な格納を達成するために、本発明のデータストレージシステムは、格納されるオブジェクトのサイズに基づいて、相異なるデータの信頼性の方式を適用できる。一実施例によれば、本発明のデータストレージシステムの仮想ストレージレイヤは、オブジェクトのサイズに基づいてオブジェクトを４つのタイプに、即ち、巨大（ｈｕｇｅ）、大型（ｌａｒｇｅ）、中型（ｍｅｄｉｕｍ）、及び小型（ｓｍａｌｌ）に分類する。
オブジェクトを格納するのに複数のバンドが使用される場合、オブジェクトが巨大であるものに分類される。オブジェクトを格納するのに１つのバンドがほぼ全体的に使用される場合、オブジェクトが大型であるものに分類される。オブジェクトを格納するのにバンドの小さな部分（ｓｍａｌｌｆｒａｃｔｉｏｎ）が使用される場合、オブジェクトが小型であるものに分類される。最後に、オブジェクトが小型又は大型に分類できる場合、オブジェクトが中型であるものに分類される。従って、相異なるサイズを有するチャンクが同じ仮想ストレージだけでなく、仮想ストレージを形成する個々のデータストレージデバイスに共存できる。 As mentioned above, the data storage device has a limit on the size of objects that the data storage device can store. Some data storage devices may not be able to support very large large objects or very small small objects. In order to achieve reliable and efficient storage of objects having different sizes, the data storage system of the present invention can apply different data reliability schemes based on the size of the stored objects . According to one embodiment, the virtual storage layer of the data storage system of the present invention divides objects into four types based on the size of the objects: huge, large, medium, and Classified as small.
When multiple bands are used to store an object, the object is classified as huge. If a band is used almost entirely to store an object, the object is classified as being large. If a small fraction of the band is used to store the object, the object is classified as being small. Finally, if the object can be classified as small or large, the object is classified as medium. Therefore, chunks having different sizes can coexist not only in the same virtual storage but also in individual data storage devices forming the virtual storage.

オブジェクトに対する複製の空間オーバヘッドが、オブジェクトの消去コーディングの空間オーバヘッドよりも小さい場合、オブジェクトは、小型であるものに分類される。この場合、読み取り（リード）に対して、より良いパフォーマンスが提供され、複雑な消去コーディング方式よりもアップデートの管理が容易であるので、複製方式の選択が適切である。この選択は、アプリケーションのメタデータが小さくなる傾向がある状況では合理的である。一実施例で、サイズＳＺ_Ｏｉを有する小型オブジェクト（Ｏ_ｉ）は、数学式７を満足する。

An object is classified as small if the space overhead of replication for the object is smaller than the space overhead of erasure coding of the object. In this case, the choice of replication scheme is appropriate as it provides better performance for reads and is easier to manage updates than complex erasure coding schemes. This choice is reasonable in situations where application metadata tends to be small. In one embodiment, a small object (O _i ) having size SZ _Oi satisfies mathematical equation 7.

オブジェクトに対する消去コーディングの空間オーバヘッドがオブジェクトに対するデータ複製の空間オーバヘッドよりも小さい場合、オブジェクトは、大型のものに分類される。この場合に、より小さな空間を有するので、消去コーディング方式の選択が適切である。特に、大型のオブジェクトは、数学式８を満足する。

大型のオブジェクトは、図１と類似した構成になれるが、１つのバンドだけ、即ち、１つのスプリット内には１つのチャンクだけ包含できる。 An object is classified as large if the space overhead of erasure coding for the object is smaller than the space overhead of data replication for the object. In this case, the choice of the erasure coding scheme is appropriate as it has less space. In particular, large objects satisfy mathematical equation 8.

A large object can be configured similar to FIG. 1, but can contain only one band, ie, one chunk in one split.

オブジェクトがスプリット内で１つ以上のチャンクを含む場合、オブジェクトは巨大なものに分類される。この場合、消去コーディングが適切である。特に、巨大オブジェクトは、数学式９を満足する。

巨大オブジェクトは、図１と類似するように構成されることができるが、スプリット内に複数のチャンク又は複数のバンドを包含できる。 If an object contains one or more chunks in a split, the object is classified as huge. In this case, erasure coding is appropriate. In particular, huge objects satisfy the mathematical expression 9.

Giant objects can be configured to be similar to FIG. 1, but can include multiple chunks or multiple bands in a split.

小型又は大型の何れにも分類されることができないオブジェクトのサイズの範囲が存在できる。即ち、数学式１０を満足するオブジェクトは、中型のものに分類される。

このような場合には、データの複製又は消去コーディングの中の何れか１つが使用されることができる。性能がより重要であり、オブジェクトが頻繁にアップデートされる場合、データ複製がよりよい選択であり得る。この場合、中型のオブジェクトは、小型のものに分類されることができる。空間効率性がより重要な場合には、消去コーディングが使用されることができる。このような場合、中型のオブジェクトは、大型のものに分類されることができる。 There can be a range of object sizes that can not be classified as either small or large. That is, objects satisfying mathematical expression 10 are classified as medium-sized objects.

In such a case, any one of data duplication or erasure coding can be used. If performance is more important and objects are updated frequently, data replication may be a better choice. In this case, the medium-sized object can be classified as a small object. If space efficiency is more important, erasure coding can be used. In such cases, medium-sized objects can be classified as large ones.

仮想ストレージレイヤは、オブジェクトを格納し、スプリットデータチャンクを使用してオブジェクトを再建して巨大オブジェクトを回収するために、巨大オブジェクトを小型データチャンクに分割（ｓｐｌｉｔ）する必要があり得る。このため、ユーザ（例えば、ホストコンピュータ上で駆動するするユーザアプリケーション）キーから生成された内部キー（ｉｎｔｅｒｎａｌｋｅｙ）が、仮想ストレージレイヤをステートレスにするのに使用されることができる。
一実施例によると、仮想ストレージレイヤは、チャンクを分配する内部使用のために、デバイスによりサポートされたキー空間（ｄｅｖｉｃｅ−ｓｕｐｐｏｒｔｅｄｋｅｙｓｐａｃｅ）中の数バイトを保留し、キー空間の残りの部分をユーザに開放する。この場合、ユーザ特定のオブジェクトキーは、オブジェクトの１つ又はそれ以上の分割されたチャンクのための内部キーのグループを代表する。 The virtual storage layer may need to split large objects into smaller data chunks in order to store the objects and use split data chunks to reconstruct the objects and retrieve the large objects. Thus, an internal key generated from a user (eg, a user application running on a host computer) key can be used to make the virtual storage layer stateless.
According to one embodiment, the virtual storage layer reserves a few bytes in a device-supported key space supported by the device for internal use to distribute chunks and reserves the rest of the key space. Open to users. In this case, the user-specific object key represents a group of internal keys for one or more split chunks of the object.

図２は、一実施形態による、ユーザのキーを含む例示的な内部キーを示している。内部キー２００は、第１部分のユーザキー２０１及び第２部分のバンド識別子（ＩＤ；ｉｄｅｎｔｉｆｉｅｒ）２０２を含む。内部キー２００は、対応するオブジェクトに対するチャンクの部分又はチャンクの全グループを識別するために使用されることができる。この場合で、オブジェクトは、キーバリューペアのキーとしての内部キー２００を含むキーバリューペアのバリューに対応する。現在の実施例で、仮想ストレージレイヤ及び／又はデータデバイスでサポートする最大キーの長さ（ｍａｘｉｍｕｍｋｅｙｌｅｎｇｔｈ）はＬであり、グループの特定（ｇｒｏｕｐｓｐｅｃｉｆｉｃａｔｉｏｎ）のために予約されたバイトの個数はＧである。仮想ストレージレイヤは、ユーザが、使用できる最大キー長さがＬ−Ｇであることを通知する。 FIG. 2 illustrates an exemplary internal key that includes the user's key, according to one embodiment. The internal key 200 includes a first portion user key 201 and a second portion band identifier (ID) 202. The internal key 200 can be used to identify a portion of a chunk for the corresponding object or an entire group of chunks. In this case, the object corresponds to the value of the key-value pair including the internal key 200 as the key of the key-value pair. In the present embodiment, the maximum key length supported by the virtual storage layer and / or data device is L, and the number of bytes reserved for group specification is G. It is. The virtual storage layer informs the user that the maximum key length available is L-G.

小型又は大型オブジェクトに対し、ＧバイトのバンドＩＤ（２０２）は、基本的に（ｂｙ＿ｄｅｆａｕｌｔ）０でパディングできる。巨大オブジェクトに対し、仮想ストレージレイヤは、オブジェクトのバンドの個数を計算できる。個々のバンドは、内部キー２０１を使用して識別されることができる。バンドはスプリット優先又はバンド優先方式に応じて１つずつオブジェクトを格納するために割り当てられた１つ又はそれ以上のデータストレージデバイスに書込みされることができる。 For small or large objects, the G-byte band ID (202) can be padded with (by_default) 0 basically. For large objects, the virtual storage layer can calculate the number of bands in the object. Individual bands can be identified using internal key 201. Bands can be written to one or more data storage devices assigned to store objects one by one according to a split priority or band priority scheme.

一実施例によると、データストレージデバイスは、グループの特徴（ｇｒｏｕｐｆｅａｔｕｒｅ）をサポートできる。仮想ストレージレイヤは、ユーザキー２０１に基づいて、グループを特定することにより、データストレージデバイスに格納されたスプリットを識別できる。この場合で、追加のメタデータが要求されない可能性がある（ＳＺ＿ｍｅｔａ＝０）。仮想ストレージレイヤは、「ｄｏ＿ｎｏｔ＿ｃａｒｅ」ビット（任意（ａｒｂｉｔｒａｒｙ）データのビット、例えば、０ｘＦＦ）で満たされたバンドＩＤ（２０２）及びユーザキー２０１を、全てのデータストレージデバイスに通知することで、オブジェクトに対する全てのチャンクを回収できる。
バンドＩＤが「ｄｏ＿ｎｏｔ＿ｃａｒｅ」である場合、バンドＩＤフィールドは無視される。データストレージデバイスがグループの特徴を効率的に具現することが仮定されることができる。例えば、トライ木構造（ｔｒｉｅｓｔｒｕｃｔｕｒｅ）は、ユーザキー２０１の与えられた所定の接頭部（ｐｒｅｆｉｘ）を有するオブジェクトを容易に識別できる一方で、ハッシュテーブルは、メタデータフィールドが固定されている場合にのみ、ユーザキーを使用してハッシュバケットからオブジェクトを発見できる。仮想ストレージレイヤは、デバイスごとにバンドＩＤ２０２に基づいて、返されたオブジェクトチャンクを昇順（ａｓｃｅｎｄｉｎｇｏｒｄｅｒ）に整列させて、バンド及びオブジェクトを再構成して、ユーザキー２０１と共にオブジェクトを返す。 According to one embodiment, a data storage device can support group features. The virtual storage layer can identify the split stored in the data storage device by specifying the group based on the user key 201. In this case, additional metadata may not be required (SZ_meta = 0). The virtual storage layer sends a band ID (202) and user key 201 filled with "do_not_care" bits (arbitrary data bits, for example, 0xFF) to all data storage devices, thereby for the object. All chunks can be collected.
If the band ID is "do_not_care", the band ID field is ignored. It can be assumed that the data storage device efficiently embodies the features of the group. For example, a trie structure can easily identify an object having a given prefix given a user key 201, while a hash table can be used when the metadata field is fixed. You can only discover objects from hash buckets using a user key. The virtual storage layer arranges the returned object chunks in ascending order based on the band ID 202 for each device, reconstructs the band and the object, and returns the object together with the user key 201.

図３は、一実施例による、グループの特徴を使用するオブジェクトの回収の例を示す。オブジェクトを格納するように割り当てられたデータストレージデバイス（つまり、Ｄｉｓｋ＿１〜Ｄｉｓｋ＿Ｄ及びＤｉｓｋ＿Ｄ＋１〜Ｄｉｓｋ＿Ｐ）の各々は、グループの特徴（ｇｒｏｕｐｆｅａｔｕｒｅ）をサポートする。この場合で、バンドＩＤ（３０２）は、バンドＩＤ（３０２）を無視するように「ｄｏ＿ｎｏｔ＿ｃａｒｅ」と設定される。仮想ストレージレイヤは、ユーザキー３０１を使用してバンド３５０に属するチャンク（つまり、データのチャンク１、２、．．．、Ｄと、パリティチャンク１〜Ｐ）を収集し、バンド３５０からのチャンク（Ｃｈｕｎｋ＿１〜Ｃｈｕｎｋ＿Ｄ）を含む第１スライスを再構成する。
その後で、仮想ストレージレイヤは、残りのチャンクを収集し、残りのスライスを順番に再構成する。全てのスライスが構成された場合、仮想ストレージレイヤは、スライスからのデータのチャンク（Ｃｈｕｎｋ＿１〜３Ｄ）を含むオブジェクト３１０を再構成する。消去コーディングの場合で、仮想ストレージレイヤは、パリティチャンク（ｐａｒｉｔｙｃｈｕｎｋ１〜Ｐ）からパリティブロックをさらに再構成する。現在の実施例で、チャンクを分割することについてバンド優先方式を示しているが、本発明の思想から逸脱せずに、スプリット優先方式がチャンク分割及びオブジェクト回収に適用できることが理解されるであろう。 FIG. 3 illustrates an example of object retrieval using group features, according to one embodiment. Each of the data storage devices assigned to store the objects (ie, Disk_1 to Disk_D and Disk_D + 1 to Disk_P) supports a group feature. In this case, the band ID (302) is set as "do_not_care" so as to ignore the band ID (302). The virtual storage layer uses the user key 301 to collect the chunks belonging to the band 350 (that is, data chunks 1, 2,..., D and parity chunks 1 to P) and Reconstruct the first slice containing Chunk_1 to Chunk_D).
Thereafter, the virtual storage layer collects the remaining chunks and reconstructs the remaining slices in order. If all slices have been configured, the virtual storage layer reconstructs an object 310 that contains chunks of data (Chunk_1 to 3D) from the slices. In the case of erasure coding, the virtual storage layer further reconstructs parity blocks from parity chunks (parity chunks 1 to P). While the present embodiment shows a band priority scheme for splitting chunks, it will be appreciated that the split priority scheme can be applied to chunk splitting and object retrieval without departing from the spirit of the invention. .

図４は、一実施例による、グループの特徴がないオブジェクトの回収の例を示す。この場合で、オブジェクトを格納するために割り当てられたデータストレージデバイスがグループの特徴をサポートしていないので、仮想ストレージレイヤは、大型又は巨大オブジェクトに追加のメタデータを添付する（つまり、ＳＺ＿ｍｅｔａ≠０）。各チャンクは、１バイトの長さを有するバンドＩＤ４０２によって識別されることができる。現在の実施例では、バンドの個数が１バイトの長さに適合できるように、３つのバンド（Ｂａｎｄ＿１、Ｂａｎｄ＿２、Ｂａｎｄ＿３）が存在することができる。
仮想ストレージレイヤは、バンドＩＤ４０２を使用して１つずつスライスを構成できる。まず、仮想ストレージレイヤは、バンドＩＤが０（ＢＩＤ＝０）であるバンド０と共にユーザキー４０１を、全てのデータストレージデバイスに放送する。仮想ストレージレイヤは、全てのデータストレージデバイスからバンド０（Ｂａｎｄ＿０）に対するチャンクを受信し、受信されたチャンクの中のバンド０（Ｂａｎｄ＿０）に属するチャンクからバンド情報を回収する。受信されたバンドの情報をベースに、仮想ストレージレイヤは、バンドの数を認知して、オブジェクトを回収する。オブジェクトが大型の場合は、１つのバンドのみが存在できるので、仮想ストレージレイヤは、そのバンド内のチャンクからオブジェクト全体を再構成できる。
巨大オブジェクトに対しては１つより多くのチャンクが存在するので、仮想ストレージレイヤは、複数のバンド（例えば、Ｂａｎｄ＿１とＢａｎｄ＿２）を１つずつ回収することを必要とする。この場合で、仮想ストレージレイヤは、全てのチャンクが回収されるまで、バンドＩＤを調節する（例えば、ＢＩＤ＝１又はＢＩＤ＝２）ことにより、回収要請を放送する。仮想ストレージレイヤが、全てのスライスを構成した場合、仮想ストレージレイヤは、オブジェクト４１０を再構成できる。小型オブジェクトは、デバイスがグループの特徴をサポートしているか否かに無関係に、メタデータを有していない可能性がある。チャンクサイズをチェックすることで、仮想ストレージレイヤは、数学式７及び数学式１０を使用して、オブジェクト４１０が小型であるか否かを決定できる。 FIG. 4 shows an example of the collection of objects without group features according to one embodiment. In this case, the virtual storage layer attaches additional metadata to the large or huge objects, since the data storage devices assigned to store the objects do not support group features (ie, SZ_meta ≠ 0). ). Each chunk can be identified by a band ID 402 having a length of 1 byte. In the current embodiment, there can be three bands (Band_1, Band_2, Band_3) so that the number of bands can be adapted to a length of 1 byte.
The virtual storage layer can configure slices one by one using the band ID 402. First, the virtual storage layer broadcasts the user key 401 to all data storage devices together with band 0 whose band ID is 0 (BID = 0). The virtual storage layer receives chunks for band 0 (Band_0) from all data storage devices, and collects band information from the chunks belonging to band 0 (Band_0) in the received chunks. Based on the received band information, the virtual storage layer recognizes the number of bands and collects the object. If the object is large, only one band can exist, so the virtual storage layer can reconstruct the entire object from the chunks in that band.
Since there are more than one chunk for a huge object, the virtual storage layer needs to retrieve multiple bands (eg, Band_1 and Band_2) one by one. In this case, the virtual storage layer broadcasts the collection request by adjusting the band ID (for example, BID = 1 or BID = 2) until all chunks are collected. If the virtual storage layer has configured all slices, then the virtual storage layer can reconfigure the object 410. Small objects may not have metadata, regardless of whether the device supports the features of the group. By checking the chunk size, the virtual storage layer can use Equation 7 and Equation 10 to determine if the object 410 is small.

巨大オブジェクトを１つ又はそれ以上のデータストレージデバイスに書込むために、巨大オブジェクトは、同じサイズのＮＣ_Ｏｉ × Ｄ個のチャンクに（つまり、ＮＣ_ｏｉ個のスライス）に分割されることができる。最後のデータチャンク（例えば、図５でオブジェクト５１０ａのデータ４ａ（Ｄａｔａ＿４ａ）及びオブジェクト５１０ｂのデータ４ｂ（Ｄａｔａ＿４ｂ））は、配列の要求を考慮して、０（ｚｅｒｏ）にパディングされることができ、Ｐ個のパリティチャンクはスライス当たりＤ個のデータのチャンクから生成されることができる。 In order to write huge objects to one or more data storage devices, the large objects can be divided into NC _Oi × D chunks of the same size (ie, _NCOi slices). The last data chunk (for example, data 4a (Data_4a) of object 510a and data 4b (Data_4b) of object 510b in FIG. 5) can be padded to 0 (zero) in consideration of an array request, P parity chunks can be generated from D chunks of data per slice.

図５は、一実施例による、専用のパリティデバイスがない消去コーディングの例を示す。図６は、一実施例による、１つ又はそれ以上の専用パリティデバイスを使用した消去コーディングの例を示す。各バンド当たりＤ個のデータチャンク及びＰ個のパリティチャンクを含む（Ｄ＋Ｐ）個のチャンクの全体は、１つ又はそれ以上のデータストレージデバイスに分散されてＮＣ_Ｏｉ個のバンドの全てが書込みされる。パリティチャンクは、Ｄ＋Ｐ個のデバイスの一部（例えば、図５でＳＳＤ４乃至ＳＳＤ６）に分散して格納されるか、又はＰ個の専用デバイス（例えば、図６のＳＳＤ５及びＳＳＤ６）に格納されることができる。
主データストレージデバイスは、データストレージデバイス上でバンドＩＤなしでユーザキーのハッシュバリュー（以下、「Ｈａｓｈ（ｕｓｅｒ＿ｋｅｙ）」（ハッシュ（ユーザキー））と称される）を使用して選択されることができる。（Ｄ＋Ｐ）個のデバイスの全部又は一部は、図５の実施例で選択されることができ、Ｄ個のデバイスは、図６の実施例で選択されることができる。スタートデバイスは、専用のパリティデバイスがない場合、Ｈａｓｈ（ｕｓｅｒ＿ｋｅｙ）％（Ｄ＋Ｐ）によって決定でき、又は、専用のパリティデバイスが存在する場合、Ｈａｓｈ（ｕｓｅｒｋｅｙ）％Ｄによって決定できる。
後続のチャンクは、次のデバイス（例えば、（Ｈａｓｈ（ｕｓｅｒ＿ｋｅｙ）＋１）％（Ｄ＋Ｐ）、（Ｈａｓｈ（ｕｓｅｒｋｅｙ）＋２）％（Ｄ＋Ｐ）、．．．、（Ｈａｓｈ（ｕｓｅｒｋｅｙ）＋Ｄ＋Ｐ−１）％（Ｄ＋Ｐ）、又は、（Ｈａｓｈ（ｕｓｅｒ＿ｋｅｙ）＋１）％Ｄ、（Ｈａｓｈ（ｕｓｅｒ＿ｋｅｙ）＋２）％Ｄ、．．．、（Ｈａｓｈ（ｕｓｅｒ＿ｋｅｙ）＋Ｄ− １）％Ｄ）の順に順次的に書込みされることができる。このようなオペレーションは、バンド当たり動作であり、仮想ストレージレイヤは、オブジェクトのチャンクを書込むために、全てのＮＣ_Ｏｉ個のバンドに対して、このような手順を繰り返す。ユーザキーのハッシュバリューは、各オブジェクト当たり一度計算されることを必要とする。 FIG. 5 shows an example of erasure coding without a dedicated parity device according to one embodiment. FIG. 6 illustrates an example of erasure coding using one or more dedicated parity devices according to one embodiment. The total of (D + P) chunks, including D data chunks and P parity chunks per band, is distributed to one or more data storage devices and all NC _Oi bands are written . The parity chunks may be stored distributed on some of the D + P devices (e.g. SSD4 to SSD6 in FIG. 5) or stored on P dedicated devices (e.g. SSD5 and SSD6 in FIG. 6) be able to.
The primary data storage device is selected using the hash value of the user key (hereinafter referred to as “Hash (user_key)” (hash (user key)) without band ID on the data storage device it can. All or some of the (D + P) devices can be selected in the embodiment of FIG. 5, and D devices can be selected in the embodiment of FIG. The start device can be determined by Hash (user_key)% (D + P) if there is no dedicated parity device, or by Hash (user key)% D if there is a dedicated parity device.
Subsequent chunks are assigned to the next device (eg, (Hash (user_key) +1)% (D + P), (Hash (user key) +2)% (D + P), ..., (Hash (user key) + D + P-1)% (D + P) or (Hash (user_key) +1)% D, (Hash (user_key) +2)% D,..., (Hash (user_key) + D-1)% D) It can be sequentially written sequentially. Such an operation is a per band operation, and the virtual storage layer repeats such a procedure for all NC _Oi bands to write a chunk of the object. The hash value of the user key needs to be calculated once for each object.

データストレージデバイスがグループの特徴をサポートしていない場合、チャンクは、図４に図示されたように、バンドの全体の個数及びバンドＩＤに対する追加のメタデータを含む。バンドの個数は、数学式３によって決定されることができる。バンドのチャンクは、メタデータとして（ｂａｎｄＩＤ、ＮＣ_Ｏｉ）のペアを包含できる。 If the data storage device does not support the group feature, the chunk includes additional metadata for the total number of bands and band ID, as illustrated in FIG. The number of bands can be determined by Equation 3. Band chunks can contain (band ID, NC _Oi ) pairs as metadata.

図５を参照すると、仮想ストレージレイヤは、オブジェクト５１０ａのデータチャンク（Ｄａｔａ＿１ａ〜Ｄａｔａ＿４ａ）及びパリティチャンク（Ｐａｒｉｔｙ＿１ａとＰａｒｉｔｙ＿２ａ）を、データストレージデバイス（ＳＳＤ１〜ＳＳＤ６）を通して格納する。仮想ストレージレイヤは、他のオブジェクト５１０ｂのデータチャンク（Ｄａｔａ＿１ｂ〜Ｄａｔａ＿４ｂ）及びパッリティチャンク（Ｐａｒｉｔｙ＿１ｂとＰａｒｉｔｙ＿２ｂ）をデータストレージデバイス（ＳＳＤ１〜ＳＳＤ６）を通して格納する。スタートデバイス（例えば、オブジェクト５１０ａに対するＳＳＤ１及びオブジェクト５１０ｂに対するＳＳＤ６）は、前述されたように、ユーザキーのハッシュバリューによって決定されることができる。専用パリティデバイスがないので、仮想ストレージレイヤは、データチャンク及びパリティチャンクを区別せずに、データのチャンク及びパリティチャンクをデータストレージデバイス（ＳＳＤ１〜ＳＳＤ６）に分散できる。現在の実施例で、ＳＳＤ４及びＳＳＤ６は、データチャンク及びパリティチャンクを全て含む。 Referring to FIG. 5, the virtual storage layer stores data chunks (Data_1a to Data_4a) and parity chunks (Parity_1a and Parity_2a) of the object 510a through data storage devices (SSD1 to SSD6). The virtual storage layer stores data chunks (Data_1b to Data_4b) and parity chunks (Parity_1b and Parity_2b) of the other objects 510b through the data storage devices (SSD1 to SSD6). The start device (eg, SSD1 for object 510a and SSD6 for object 510b) can be determined by the hash value of the user key, as described above. Because there is no dedicated parity device, the virtual storage layer can distribute data chunks and parity chunks to data storage devices (SSD1 to SSD6) without distinguishing between data chunks and parity chunks. In the current embodiment, SSD 4 and SSD 6 contain all data chunks and parity chunks.

図６を参照すると、仮想ストレージレイヤは、オブジェクト６１０ａのデータチャンク（Ｄａｔａ＿１ａ〜Ｄａｔａ＿４ａ）及びパリティチャンク（Ｐａｒｉｔｙ＿１ａとＰａｒｉｔｙ＿２ａ）を、データストレージデバイス（ＳＳＤ１〜ＳＳＤ６）を通して格納する。同様に、仮想ストレージレイヤは、オブジェクト６１０ｂのデータチャンク（Ｄａｔａ＿１ｂ〜Ｄａｔａ＿４ｂ）及びパリティチャンクに（Ｐａｒｉｔｙ＿１ｂ及びＰａｒｉｔｙ＿２ｂ）をデータストレージデバイス（ＳＳＤ１〜ＳＳＤ６）に亘って格納する。ＳＳＤ５及びＳＳＤ６はパリティデバイスに割り当てられたので、ＳＳＤ５及びＳＳＤ６はパリティチャンクだけを含む。 Referring to FIG. 6, the virtual storage layer stores the data chunk (Data_1a to Data_4a) and the parity chunk (Parity_1a and Parity_2a) of the object 610a through the data storage devices (SSD1 to SSD6). Similarly, the virtual storage layer stores (Parity_1b and Parity_2b) in the data chunk (Data_1b to Data_4b) and the parity chunk of the object 610b across the data storage devices (SSD1 to SSD6). Since SSD5 and SSD6 were assigned to parity devices, SSD5 and SSD6 contain only parity chunks.

大型のオブジェクトを１つ又はそれ以上のデータストレージデバイスに書込むために、大型のオブジェクトは、同じサイズのＮＣ_Ｏｉ×Ｄ個のチャンクに分割されることができる。大型オブジェクトは、オブジェクトに対して１つのバンド（つまり、ＮＣＯｉ＝１）であり得る点だけを除けば、巨大オブジェクトと同じように管理される。 To write a large object to one or more data storage devices, the large object can be divided into NC _Oi × D chunks of the same size. Large objects are managed like large objects, except that they can be one band to the object (ie, NCOi = 1).

小型のオブジェクトを格納するために、該オブジェクトに対して（Ｐ＋１）個の複製オブジェクトが生成されることができる。パディングを含む整列を考慮すると、複製オブジェクトは（Ｐ＋１）個のデバイスに分散されることができる。主デバイスは、（Ｄ＋Ｐ）デバイス上で（但し、Ｄ＝１）、ユーザキーのハッシュバリューを使用して選択されることができる。Ｐ個の複製オブジェクトは、ストレージの組織、性能などのような多様な要因をベースに決定論的に選択されることができる。たとえば、複製オブジェクトは（Ｈａｓｈ（ｋｅｙ）＋１）％（Ｄ＋Ｐ）、（Ｈａｓｈ（ｋｅｙ）＋２）％（Ｄ＋Ｐ）、．．．、（Ｈａｓｈ（ｋｅｙ）＋Ｐ）％（Ｄ＋Ｐ）、又は、専用のパリティデバイスの有無と関係なしに、相異なるノード、又はラックに単純に格納されることができる。専用パリティデバイス又はノードが存在する場合、複製オブジェクトは（Ｈａｓｈ（ｋｅｙ）＋１）％Ｄ、（Ｈａｓｈ（ｋｅｙ）＋２）％Ｄ、．．．、（Ｈａｓｈ（ｋｅｙ）＋Ｐ）％Ｄに格納されることができる。デバイスの容量と関係なしに、小型のオブジェクトは、メタデータを包含しない。 In order to store small objects, (P + 1) duplicate objects can be generated for the objects. Given the alignment, including padding, duplicate objects can be distributed to (P + 1) devices. The main device can be selected on the (D + P) device (where D = 1) using the hash value of the user key. The P replicated objects can be selected deterministically based on various factors such as storage organization, performance, and the like. For example, the duplicate object is (Hash (key) +1)% (D + P), (Hash (key) +2)% (D + P),. . . , (Hash (key) + P)% (D + P), or can be simply stored in different nodes or racks with or without dedicated parity devices. If a dedicated parity device or node exists, then the duplicate object is (Hash (key) +1)% D, (Hash (key) +2)% D,. . . , (Hash (key) + P)% D can be stored. Small objects do not contain metadata, regardless of device capacity.

図７は、一実施例による、パリティデバイスなしで、１つ又はそれ以上のデータストレージデバイスに亘る小型オブジェクトの例示的な複製方式を示す。仮想ストレージレイヤは、データストレージデバイス（ＳＳＤ１、ＳＳＤ２、ＳＳＤ３）を通して小型オブジェクト７１０ａ（Ｏｂｊｅｃｔ＿１）を格納できる。仮想ストレージレイヤは、小型オブジェクト７１０ｂ（Ｏｂｊｅｃｔ＿２）チャンクをデータストレージデバイス（ＳＳＤ３、ＳＳＤ４、及びＳＳＤ５）に亘って格納できる。小型オブジェクト（７１０ａ、７１０ｂ）は、より小さなデータチャンクに分割されない。オブジェクト（７１０ａ、７１０ｂ）を格納するためのスタートデバイスは、対応するユーザキーのハッシュバリューによって決定できる。現在の実施例で、オブジェクト７１０ａに対するスタートデバイスはＳＳＤ１であり、オブジェクト７１０ｂに対するスタートデバイスはＳＳＤ３である。現在の実施例で、オブジェクト（７１０ａ、７１０ｂ）の各々に対する複製オブジェクトの全体の個数は３である（つまり、Ｐ＝２）。 FIG. 7 illustrates an exemplary replication scheme of small objects across one or more data storage devices without a parity device, according to one embodiment. The virtual storage layer can store the small object 710a (Object_1) through the data storage devices (SSD1, SSD2, SSD3). The virtual storage layer can store small object 710b (Object_2) chunks across data storage devices (SSD3, SSD4, and SSD5). Small objects (710a, 710b) are not divided into smaller data chunks. The start device for storing the objects (710a, 710b) can be determined by the hash value of the corresponding user key. In the current embodiment, the start device for object 710a is SSD1, and the start device for object 710b is SSD3. In the current embodiment, the total number of duplicate objects for each of the objects (710a, 710b) is three (ie, P = 2).

図８は、一実施例による、オブジェクトを書込む例示的なフローチャートを示す。データストレージシステムの仮想ストレージレイヤは、オブジェクトを書込むためのライト（ｗｒｉｔｅ）要請を受信する（８０１）。ユーザ（又はホストコンピュータ上で駆動するユーザアプリケーション）から受信されたライト（ｗｒｉｔｅ）要請は、ユーザのキーを含むことができる。仮想ストレージレイヤは、オブジェクトが巨大又は大型か否かを、例えば、数学式８及び数学式９を使用して判別する（８０２）。大型又は巨大オブジェクトに対し、仮想ストレージレイヤはスプリット当たり、そしてバンド当たりチャンクサイズ及びチャンクの個数を、例えば、数学式３及び数学式４を使用して決定し（８１１）、データストレージデバイスに亘って１つ又はそれ以上のバンドにデータを書込み（８１２）、ライト（ｗｒｉｔｅ）の手順を完了する（８１５）。 FIG. 8 illustrates an exemplary flowchart for writing an object, according to one embodiment. The virtual storage layer of the data storage system receives a write request to write an object (801). The write request received from the user (or the user application running on the host computer) can include the user's key. The virtual storage layer determines whether the object is huge or large using, for example, Equation 8 and Equation 9 (802). For large or large objects, the virtual storage layer determines the chunk size per chunk and per band and the number of chunks, eg, using Equation 3 and Equation 4 (811), across data storage devices Write data to one or more bands (812) and complete the write procedure (815).

仮想ストレージレイヤが、オブジェクトは大型でも巨大でもないと判断した場合、仮想ストレージレイヤは、オブジェクトが小型であるかを、例えば、数学式７を使用して判別する（８０３）。小型オブジェクトに対し、仮想ストレージレイヤは、分散ポリシー（ｄｉｓｔｒｉｂｕｔｉｏｎｐｏｌｉｃｙ）に基づいて、原本のデータ及び複製データを含むデータを格納するための１つ又はそれ以上のストレージデバイスを決定し（８１３）、バンドの１つ又はそれ以上のデバイスに亘ってデータを書込み（８１４）、ライト（ｗｒｉｔｅ）手順を完了する（８１５）。例えば、仮想ストレージレイヤがバンド優先ポリシー（複数のデータストレージデバイスに亘ってデータを分散すること）を使用できる。仮想ストレージレイヤは、ユーザのキーのハッシュバリューを使用してスタートデバイスを決定できる。 If the virtual storage layer determines that the object is neither large nor large, the virtual storage layer determines 803 whether the object is small using, for example, Equation (7). For small objects, the virtual storage layer determines (813) one or more storage devices for storing data, including original data and replicated data, based on the distribution policy (813), Write data across one or more devices (814) and complete the write procedure (815). For example, a virtual storage layer can use a band priority policy (distributing data across multiple data storage devices). The virtual storage layer can determine the start device using the hash value of the user's key.

仮想ストレージレイヤが、オブジェクトは巨大でも大型でも小型でもないと判断した場合、仮想ストレージレイヤは、オブジェクトを中型のものとして処理し（８０４）、分散ポリシーに基づいて、原本のデータ及び複製データを含むデータを格納するための１つ又はそれ以上のデータストレージデバイスを決定し（８１３）、バンドの１つ又はそれ以上のデバイスにデータを書込み（８１４）、ライト（ｗｒｉｔｅ）手順を完了する（８１５）。 If the virtual storage layer determines that the object is neither large nor large nor small, the virtual storage layer treats the object as medium sized (804) and includes original data and replicated data based on the distribution policy Determine one or more data storage devices for storing data (813), write data to one or more devices in the band (814), and complete the write procedure (815) .

データストレージデバイスを通した１つ又はそれ以上のバンドへのデータ書込みの手順（８１２）は、幾つかのサブプロセス（ｓｕｂ−ｐｒｏｃｅｓｓｅｓ）を含むことができる。
図９を参照すると、まず、仮想ストレージレイヤは、書込みのためのスライスが存在するかを判別する（８２０）。書込みのためのスライスが存在しない場合、仮想ストレージレイヤは、エラーログを格納し、オペレーションを終了し、オブジェクトを書込むための要請を放送したホストコンピュータ上で動作するユーザアプリケーション及び／又はホストコンピュータのオペレーティングシステムに知らせる。書込みのためのスライスが存在する場合、仮想ストレージレイヤは、オブジェクトのスライスを作成し、スライスのチャンクに対する内部キーを生成する（８２１）。
仮想ストレージレイヤは、これが書込みのための最後のスライスであるかをチェックする（８２２）。最後のスライスに対し、仮想ストレージレイヤは、スライスがバンドを一杯埋めることができない場合、データのチャンクにパディング（ｐａｄｄｉｎｇｓ）を追加し、消去コーディングアルゴリズムを使用して、バンドに対するＰ個のパリティチャンクを計算する（８２４）。最後のスライス以外の他のスライスに対して、仮想ストレージレイヤはパディング手順８２３を省略している。仮想ストレージレイヤは、分散ポリシーに基づいて、データ及びパリティチャンクに対するデバイスをさらに決定する（８２５）。例えば、分散ポリシーは、スプリット優先又はバンド優先ポリシーの何れか１つであり得る。仮想ストレージレイヤは、８２１で生成された内部キーと共にバンド内のデータ及びパリティチャンクを１つ又はそれ以上のストレージデバイスに書込む（８２６）。 The procedure (812) of writing data to one or more bands through the data storage device can include several sub-processes.
Referring to FIG. 9, first, the virtual storage layer determines whether there is a slice for writing (820). If a slice for writing does not exist, the virtual storage layer stores an error log, completes the operation, and broadcasts a request for writing an object. The user application and / or the host computer operating on the host computer Inform the operating system. If a slice for writing exists, the virtual storage layer creates a slice of the object and generates an internal key for the chunk of the slice (821).
The virtual storage layer checks if this is the last slice for writing (822). For the last slice, the virtual storage layer adds paddings to the chunks of data if the slice can not fill the band full, and uses an erasure coding algorithm to P parity chunks for the band Calculate (824). The virtual storage layer omits the padding procedure 823 for other slices other than the last slice. The virtual storage layer further determines devices for data and parity chunks based on the distribution policy (825). For example, the distribution policy may be any one of split priority or band priority policy. The virtual storage layer writes 826 the data and parity chunks in the band along with the internal key generated at 821 to one or more storage devices.

仮想ストレージレイヤは、ユーザキー及びバリューサイズなどのオブジェクトのメタデータを維持していないので、仮想ストレージレイヤはリード（ｒｅａｄ）するべきオブジェクトが小型であるか、大型であるか、又は巨大であるかを認識できない。従って、仮想ストレージレイヤは、オブジェクトのユーザキーを使用して、仮想ストレージの全てのデータストレージデバイス（例えば、（Ｄ＋Ｐ）個のデータストレージデバイス）にリード要請を放送し、オブジェクトのサイズの分類に基づいてオブジェクトを再構成する適切な方法を決定する。データストレージデバイスによってサポートされるグループの特徴に応じて、仮想ストレージレイヤは、リード及びライトのオペレーションを異なる方式で処理できる。 Because the virtual storage layer does not maintain object metadata, such as user keys and value sizes, does the virtual storage layer have small, large, or large objects to read? Can not recognize Accordingly, the virtual storage layer broadcasts a read request to all data storage devices (eg, (D + P) data storage devices) of the virtual storage using the user key of the object, and based on the classification of the object size. Determine the appropriate way to reconstruct the object. Depending on the characteristics of the groups supported by the data storage device, the virtual storage layer can handle read and write operations in different ways.

巨大オブジェクトは、データストレージデバイスによるグループの特徴のサポートに基づいて相異なる方法でリードされることができる。仮想ストレージのデータストレージデバイスがグループの特徴をサポートしている場合は、仮想ストレージレイヤは、ＢＩＤ＝「ｄｏ＿ｎｏｔ＿ｃａｒｅ」と共にユーザキーを全てのデータストレージデバイスに放送する。それに応答して、放送に応答したエラーがない場合は、（Ｄ＋Ｐ）個のデータストレージデバイスの各々は、そのスプリットのＮＣ_Ｏｉ個のチャンクを返す。以後に、仮想ストレージレイヤは、受信されたチャンクを、それらのバンドＩＤ当たりＮＣ_Ｏｉ個のバンドの全体に分類し、バンドＩＤの昇順にバンドを整列させる。
仮想ストレージレイヤが各バンド当たり同一サイズのＤ個のチャンクを受信している限り、仮想ストレージレイヤは、バンドを再構成できる。バンドに対して受信されたチャンクの全体の個数が、データストレージデバイスの個数（Ｄ）よりも小さいか、又はチャンクのサイズが同じでない場合は、エラーが発生する。このエラーは、全てのデータストレージデバイスが「ＮＯＴ＿ＥＸＩＳＴ」（不存在）エラーを返す場合という不在オブジェクトのリード（ａｒｅａｄｏｆｎｏｎ−ｅｘｉｓｔｉｎｇｏｂｊｅｃｔ）、又は復旧できないエラー（ｕｎｒｅｃｏｖｅｒａｂｌｅｅｒｒｏｒ）によって発生されることができる。仮想ストレージレイヤは、バンドを１つずつ再構成し、その後に、バンドからオブジェクトを再構成する。 Giant objects can be leaded in different ways based on the support of group features by the data storage device. If the virtual storage data storage device supports the group feature, the virtual storage layer broadcasts the user key to all the data storage devices with BID = “do_not_care”. In response, if there is no error in response to the broadcast, each of the (D + P) data storage devices returns the NC _Oi chunks for that split. Thereafter, the virtual storage layer classifies the received chunks into a total of NC _Oi bands per their band ID, and arranges the bands in ascending order of band IDs.
As long as the virtual storage layer receives D chunks of the same size for each band, the virtual storage layer can reconfigure the bands. If the total number of chunks received for the band is less than the number of data storage devices (D) or the chunk sizes are not the same, an error occurs. This error may be generated by a read of non-existent object, or by an unrecoverable error if all data storage devices return a "NOT_EXIST" error. it can. The virtual storage layer reassembles bands one by one and then reassembles objects from the bands.

仮想ストレージのデータストレージデバイスがグループの特徴をサポートしていない場合、仮想ストレージレイヤは巨大オブジェクトをリードするために、ＢＩＤ＝０と共にユーザキーを全てのデータストレージデバイスに放送する。それに応答して、（Ｄ＋Ｐ）個のデータストレージデバイスの各々は、第１バンド（Ｂａｎｄ＿０）から１つのチャンクを返す。仮想ストレージレイヤは、受信されたチャンクの一部と共に格納されたメタデータをチェックすることで、オブジェクトに対するバンドの個数を識別できる。仮想ストレージレイヤは、以後に、バンドＩＤの昇順で１つずつ全てのバンドを再構成して、バンドを使用してオブジェクトを再構成する。幾つかの実施例で、仮想ストレージレイヤは、残りのバンドの全てを非同期的に要請することで、グループのサポートされている場合と同様に、オブジェクトを再構成できる。 If the virtual storage data storage device does not support the group feature, the virtual storage layer broadcasts the user key with BID = 0 to all data storage devices to read the huge object. In response, each of the (D + P) data storage devices returns one chunk from the first band (Band_0). The virtual storage layer can identify the number of bands for an object by checking the metadata stored with part of the received chunks. The virtual storage layer subsequently reconstructs all the bands one by one in ascending order of band ID and reconstructs the object using the bands. In some embodiments, the virtual storage layer can reconstruct objects as if the group is supported by requesting all of the remaining bands asynchronously.

大型のオブジェクトのリードの手順は、バンドが単一である点を除けば、巨大オブジェクトのリードの手順と同様である。データストレージデバイスがグループの特徴をサポートしている場合は、仮想ストレージレイヤは、ＢＩＤ＝「ｄｏ＿ｎｏｔ＿ｃａｒｅ」と共にユーザキーを全てのデータストレージデバイスに放送する。それに応答して、（Ｄ＋Ｐ）のデータストレージデバイスの各々は、そのスプリットの１つのチャンクを返す。
仮想ストレージレイヤが同一サイズのＤ個のチャンクを受信している限り、仮想ストレージレイヤは、バンド、即ちオブジェクトを再構成できる。バンドに対して受信されたチャンクの全体の個数が、データチャンクの数（Ｄ）よりも小さいか、又はチャンクのサイズが同じでない場合は、エラーが発生する。このエラーは、全てのデータストレージデバイスが「ＮＯＴ−ＥＸＩＳＴ」（不存在）エラーを返す場合という不在オブジェクト（ｎｏｎ−ｅｘｉｓｔｉｎｇｏｂｊｅｃｔ）のリード、又は復旧できないエラー（ｕｎｒｅｃｏｖｅｒａｂｌｅｅｒｒｏｒ）によって発生されることができる。 The procedure for reading large objects is similar to the procedure for reading large objects, except that the band is single. If the data storage device supports the group feature, the virtual storage layer broadcasts the user key to all data storage devices with BID = “do_not_care”. In response, each of the (D + P) data storage devices returns one chunk of that split.
As long as the virtual storage layer receives D chunks of the same size, the virtual storage layer can reconfigure the band or object. If the total number of chunks received for the band is less than the number of data chunks (D) or the chunk sizes are not the same, an error occurs. This error can be caused by a read of a non-existing object where all data storage devices return a “NOT-EXIST” (non-existing) error, or an unrecoverable error. .

仮想ストレージのデータストレージデバイスがグループの特徴をサポートしていない場合、仮想ストレージレイヤは大型オブジェクトをリードするために、ＢＩＤ＝０と共にユーザキーを全てのデータストレージデバイスに放送する。それに応答して、（Ｄ＋Ｐ）個のデータストレージデバイスの各々は、バンド（ＢＩＤ＝０）で、１つのチャンクを返す。仮想ストレージレイヤは、受信されたチャンクと共に格納されたメタデータをチェックすることにより、大型のオブジェクトに対して１つのバンドのみが存在することを識別し、バンドを再構成することで、オブジェクトを再構成できる。 If the virtual storage data storage device does not support the group feature, the virtual storage layer broadcasts the user key with BID = 0 to all data storage devices to read large objects. In response, each of the (D + P) data storage devices returns one chunk in the band (BID = 0). The virtual storage layer identifies the existence of only one band for large objects by checking the metadata stored with the received chunks, and reconstructs the objects by reconstructing the bands. It can be configured.

小型オブジェクトのリードの手順は、複製に依存するという点を除けば、大型オブジェクトのリードの手順と同様である。データストレージデバイスがグループの特徴をサポートする場合、仮想ストレージレイヤはＢＩＤ＝「ｄｏ＿ｎｏｔ＿ｃａｒｅ」と共にユーザキーを全てのデータストレージデバイスに放送する。それに応答して、（Ｄ＋Ｐ）のデータストレージデバイスの各々は、応答を返す。オブジェクトが小型であるので、幾つかのデータストレージデバイスは、「ＮＯＴ＿ＥＸＩＳＴ」のエラーを返すことができる一方で、他のデータストレージデバイスは、チャンクを返す。仮想ストレージレイヤは、１つ又はそれ以上のデータストレージデバイスから受信されたチャンクを使用して、オブジェクトを再構成できる有効なチャンクを受信する。全てのデータストレージデバイスが「ＮＯＴ＿ＥＸＩＳＴ」エラーを返す場合、リード要請によって識別されるオブジェクトは、不在オブジェクトであるか、又は復旧できないエラーが発生することができる。 The procedure for reading small objects is similar to the procedure for reading large objects except that it depends on duplication. If the data storage device supports group features, the virtual storage layer broadcasts the user key to all data storage devices with BID = “do_not_care”. In response, each of the (D + P) data storage devices returns a response. Because the object is small, some data storage devices can return a “NOT_EXIST” error, while other data storage devices return chunks. The virtual storage layer uses the chunks received from one or more data storage devices to receive valid chunks that can reconfigure the object. If all data storage devices return a "NOT_EXIST" error, the object identified by the read request may be an absent object or an unrecoverable error may occur.

仮想ストレージのデータストレージデバイスがグループの特徴を提供していない場合、仮想ストレージレイヤは小型オブジェクトをリードするために、ＢＩＤ＝０と共にユーザキーを全てのデータストレージデバイスに放送する。それに応答して、（Ｄ＋Ｐ）のデータストレージデバイスの各々は、応答を返す。仮想ストレージレイヤは、任意のデータストレージデバイスから受信された、幾つかの有効なチャンクを使用して、オブジェクトが小型であるかを識別できる。小型オブジェクトは、追加のメタデータを維持しない。 If the virtual storage data storage device does not provide group features, the virtual storage layer broadcasts the user key with BID = 0 to all data storage devices to lead small objects. In response, each of the (D + P) data storage devices returns a response. The virtual storage layer can use several valid chunks received from any data storage device to identify if the object is small. Small objects do not maintain additional metadata.

図１０は、一実施例による、オブジェクトをリードする例示的なフローチャートを示す。仮想ストレージレイヤは、オブジェクトへのリード要請を受信する（９０１）。ユーザ（又はホストコンピュータ上で動作するユーザアプリケーション）から受信されたリード要請は、ユーザのキーを含むことができる。仮想ストレージレイヤは、オブジェクトの１つ又はそれ以上のチャンクを格納する、１つ又はそれ以上のデータストレージデバイスによってグループの特徴がサポートされるかを判別する（９０２）。グループの特徴がサポートされている場合、仮想ストレージレイヤは、ＢＩＤ＝「ｄｏ＿ｎｏｔ＿ｃａｒｅ」を有する内部キーと共にリード要請を全てのデータデバイスに放送し（９０３）、そうでない場合は、ＢＩＤ＝０を全てのデータストレージデバイスに設定する（９０４）。仮想ストレージレイヤは、データストレージデバイスの何れか１つからチャンクを受信する（９０５）。受信されたチャンクにエラーがない場合（９０６）、仮想ストレージレイヤは、グループの特徴がサポートされているかを判別し（９０７）、グループの特徴がサポートされていない場合、チャンクのメタデータを回収する（９０８）、グループの特徴がサポートされている場合、仮想ストレージレイヤは、オブジェクトのサイズを判別し続ける。 FIG. 10 shows an exemplary flow chart for leading an object, according to one embodiment. The virtual storage layer receives a read request for an object (901). The read request received from a user (or a user application operating on a host computer) may include the user's key. The virtual storage layer determines (902) whether the group feature is supported by one or more data storage devices that store one or more chunks of objects. If group features are supported, the virtual storage layer broadcasts a read request to all data devices with an internal key with BID = "do_not_care" (903), otherwise BID = 0 for all. Set to the data storage device (904). The virtual storage layer receives chunks from any one of the data storage devices (905). If there is an error in the received chunk (906), the virtual storage layer determines if the group feature is supported (907), and if the group feature is not supported, it recovers the chunk metadata (908) If the group feature is supported, the virtual storage layer continues to determine the size of the object.

仮想ストレージレイヤが、オブジェクトが巨大又は大型であると判断した場合（９０９）、仮想ストレージレイヤは、全体のオブジェクトが再構成されたかをさらにチェックし（９１０）、１つ又はそれ以上のデータストレージデバイスから受信されたチャンクを使用して、全体のオブジェクトが再構成されるまでスライスを１つずつ再構成し続けて（９１２）、リードの手順を完了する（９１３）。小型オブジェクトに対しては、仮想ストレージレイヤは、１つ又はそれ以上のデータストレージデバイスから受信されたチャンクを使用してオブジェクトを再構成する（９１１）。 If the virtual storage layer determines that the object is huge or large (909), the virtual storage layer further checks if the entire object has been reconstructed (910), one or more data storage devices Continue to restructure slices one by one (912), using the chunks received from C, until the entire object is rebuilt, to complete the read procedure (913). For small objects, the virtual storage layer restructures the object using chunks received from one or more data storage devices (911).

小型オブジェクトを再構成する手順（９１１）は、幾つかのサブプロセスを含むことができる。
図１１を参照すると、仮想ストレージレイヤは、受信されたチャンクが小型であるかを、まずチェックする（９２１）。仮想ストレージレイヤは、小型オブジェクトに対する小型チャンクを予想するが、大きなチャンクを受信した場合、仮想ストレージレイヤは、エラーを発生する（９２４）。仮想ストレージレイヤが、受信されたチャンクが有効であると判断した場合（９２２）、仮想ストレージレイヤは、受信されたチャンクを使用して、小型のオブジェクトを再構成し（９２３）、そうでない場合には、仮想ストレージレイヤは、他のデータストレージデバイスからチャンクを受信する（９２５）。 The procedure (911) for reconstructing small objects may include several sub-processes.
Referring to FIG. 11, the virtual storage layer first checks 921 if the received chunk is small. The virtual storage layer anticipates small chunks for small objects, but if large chunks are received, the virtual storage layer generates an error (924). If the virtual storage layer determines that the received chunk is valid (922), the virtual storage layer uses the received chunk to reconstruct a small object (923), otherwise The virtual storage layer receives chunks from other data storage devices (925).

スライスを再構成する手順（９１２）は、幾つかのサブプロセスを含むことができる。
図１２を参照すると、仮想ストレージレイヤは、スライスを再構成するＤ個のチャンクが全て受信されたかをチェックする（９３１）。もしそうなら、仮想ストレージレイヤには、Ｄ個のチャンクと共に消去コーディングアルゴリズムを使用してスライスを再構成する（９３５）。もしそうでない場合、仮想ストレージレイヤは、現在のバンドの全てのチャンクが受信されたかをチェックする（９３２）。バンドの全てのチャンクが受信された場合、受信されたチャンクのうち少なくとも１つが有効でないことが有り得るので、仮想ストレージレイヤは、エラーを生成する（９３６）。受信されるチャンクがさらに存在する場合、仮想ストレージレイヤは、他のデータストレージデバイスからのそれらを受信し続けて（９３３）、スライスを再構成する全てのＤ個のチャンクを受信するまで手順を繰り返す。受信されたチャンクが大型オブジェクトのものでも、巨大オブジェクトのものでもない場合（たとえば、チャンクが小型オブジェクトのものである場合）、仮想ストレージレイヤは、エラーを発生する（９３６）。 The procedure for reconstructing a slice (912) may include several sub-processes.
Referring to FIG. 12, the virtual storage layer checks if all D chunks for reconfiguring a slice have been received (931). If so, the virtual storage layer reconstructs the slice using the erasure coding algorithm with D chunks (935). If not, the virtual storage layer checks if all chunks of the current band have been received (932). If all chunks of the band have been received, the virtual storage layer generates an error (936) because at least one of the received chunks may not be valid. If there are more chunks to be received, the virtual storage layer continues to receive them from other data storage devices (933) and repeats the procedure until it has received all D chunks that reconfigures the slice . If the received chunk is not for a large object or a large object (eg, if the chunk is for a small object), the virtual storage layer generates an error (936).

一実施例によると、データストレージシステムは、キーバリューのペアの複数のオブジェクトを格納する複数のデータストレージデバイスと、前記複数のオブジェクトの各々のオブジェクトサイズに基づいて消去コーディング方式及びデータ複製方式を含む相異なるデータ信頼性方式を適用する仮想ストレージレイヤを含む。前記複数のオブジェクトは、第１サイズを有する第１オブジェクト及び前記第１のサイズよりも大きい第２サイズを有する第２オブジェクトを含む。
前記仮想ストレージレイヤは、前記第１オブジェクトを小型オブジェクトとして分類して前記データ複製方式を適用し、前記小型オブジェクトを前記複数のデータストレージデバイスの何れか１つ又はそれ以上に亘って格納する。
前記仮想ストレージレイヤは、前記第２オブジェクトを巨大オブジェクトとして分類して前記巨大オブジェクトを同一サイズの１つ又はそれ以上のチャンクに分割し、前記消去コーディング方式を適用して前記１つ又はそれ以上のチャンクを前記複数のデータストレージデバイスに亘って分散して格納する。 According to one embodiment, a data storage system includes a plurality of data storage devices that store a plurality of objects of key-value pairs, and an erasure coding scheme and a data replication scheme based on the object size of each of the plurality of objects. It includes virtual storage layers that apply different data reliability schemes. The plurality of objects includes a first object having a first size and a second object having a second size larger than the first size.
The virtual storage layer classifies the first object as a small object, applies the data replication method, and stores the small object across any one or more of the plurality of data storage devices.
The virtual storage layer classifies the second object as a giant object, divides the giant object into one or more chunks of the same size, and applies the erasure coding scheme to the one or more chunks. Chunks are distributed and stored across the plurality of data storage devices.

前記分散方式は、前記仮想ストレージレイヤが前記巨大オブジェクトの前記１つ又はそれ以上のチャンクを前記複数のデータストレージデバイスの各々に宛てて分割（ｓｐｌｉｔ）し、前記複数のデータストレージデバイスに亘って前記データストレージデバイスの各々に前記分割した１つ又はそれ以上のチャンクを格納するスプリット優先分散方式である場合がある。 The distribution scheme is such that the virtual storage layer splits the one or more chunks of the giant object to each of the plurality of data storage devices, and the plurality of data storage devices are spread over the plurality of data storage devices. There may be a split priority distribution scheme that stores one or more of the split chunks in each of the data storage devices.

前記分散方式は、上述のスプリット優先分散と異なり、前記仮想ストレージレイヤが前記巨大ブジェクトの前記１つ又はそれ以上のチャンクが、前記１つ又はそれ以上のデータストレージデバイスに全て格納されるまで、前記複数のデータストレージデバイスの各々に、前記巨大オブジェクトの１つのチャンクを格納するバンド優先分散方式である場合がある。 The distribution scheme is different from the split priority distribution described above, in the virtual storage layer until the one or more chunks of the giant project are all stored in the one or more data storage devices. In each of a plurality of data storage devices, there may be a band-first distribution scheme in which one chunk of the huge object is stored.

前記仮想ストレージレイヤは、第３サイズを有する第３オブジェクトを大型オブジェクトとしてさらに分類でき、前記大型オブジェクトを同一サイズの１つ又はそれ以上のチャンクに分割し、前記消去コーディング方式を適用して、前記複数のデータストレージデバイスに亘って前記分割した１つ又はそれ以上のチャンクを分散して格納する。前記第３サイズは、前記第１サイズより大きく、前記第２サイズより小さい。前記大型オブジェクトは、１つのバンドだけ、即ち、１つのスプリット内には１つのチャンクだけを含む。 The virtual storage layer can further classify a third object having a third size as a large object, divide the large object into one or more chunks of the same size, apply the erasure coding scheme, and The divided one or more chunks are distributed and stored across multiple data storage devices. The third size is larger than the first size and smaller than the second size. The large object contains only one band, ie only one chunk in one split.

前記仮想ストレージレイヤは、第４サイズを有する第４オブジェクトを中型オブジェクトとしてさらに分類できる。前記第４サイズは、前記第１サイズより大きく、前記第３サイズより小さい。前記仮想ストレージレイヤは、前記データ複製方式及び前記消去コーディング方式の何れか１つを適用する。 The virtual storage layer can further classify a fourth object having a fourth size as a medium-sized object. The fourth size is larger than the first size and smaller than the third size. The virtual storage layer applies any one of the data replication scheme and the erasure coding scheme.

前記のオブジェクトの各々は、ユーザキーを使用して識別されることができる。前記仮想ストレージレイヤは、前記巨大オブジェクトに対するバンド識別子及び前記ユーザキーを含む内部キーを生成できる。前記バンド識別子は、複数のバンドの中で各々のバンドを識別するために使用される。前記複数のバンドの各々は、前記複数のデータストレージデバイスに亘って分散された１つ以上のチャンクのうちの何れか１つのチャンクを含む。 Each of the above objects can be identified using a user key. The virtual storage layer may generate an internal key including a band identifier for the giant object and the user key. The band identifier is used to identify each band among a plurality of bands. Each of the plurality of bands includes any one of one or more chunks distributed across the plurality of data storage devices.

前記の仮想ストレージレイヤは、前記ユーザキーのハッシュバリューを使用して前記複数のデータストレージデバイスのうち前記１つ又はそれ以上のチャンクの１番目のチャンクをライトするか、又はリードするためのスタートデータストレージデバイスを識別できる。 The virtual storage layer may use start data for writing or reading a first chunk of the one or more chunks of the plurality of data storage devices using a hash value of the user key. Identify storage devices.

前記複数のデータストレージデバイスは、前記巨大オブジェクトに関連されたパリティチャンクを格納する、１つ又はそれ以上の専用パリティデータストレージデバイスを包含できる。 The plurality of data storage devices may include one or more dedicated parity data storage devices that store parity chunks associated with the large object.

前記複数のデータストレージデバイスは、グループの特徴をサポートでき、前記仮想ストレージレイヤは、前記巨大オブジェクトをリードするとき、前記バンド識別子を任意データのビットに設定して、前記複数のデータストレージデバイスに向かって放送できる。 The plurality of data storage devices can support the feature of the group, and the virtual storage layer sets the band identifier to a bit of arbitrary data when reading the giant object, and goes to the plurality of data storage devices Can be broadcast.

前記複数のデータストレージデバイスは、グループの特徴をサポートしていない可能性があり、その場合、前記仮想ストレージレイヤは、前記巨大オブジェクトをリードするとき、前記バンド識別子をユニークな（固有の、ｕｎｉｑｕｅ）バンド識別子に設定して前記複数のデータストレージデバイスに向かって放送できる。 The plurality of data storage devices may not support a group feature, in which case the virtual storage layer identifies the band identifier as unique when reading the giant object. A band identifier can be set and broadcast to the plurality of data storage devices.

前記複数のデータストレージデバイスの各々は、キーバリュー・ソリッドステートドライブ（ＫＶ＿ＳＳＤ：ｋｅｙ−ｖａｌｕｅｓｏｌｉｄ−ｓｔａｔｅｄｒｉｖｅ）であり得る。 Each of the plurality of data storage devices may be a key-value solid-state drive (KV_SSD).

他の実施例によると、キーバリューペアのオブジェクトにアクセスする方法は、キーバリューペアの複数のオブジェクトを受信するステップと、ここで、前記複数のオブジェクトは、第１サイズを有する第１オブジェクト及び前記第１サイズより大きい第２サイズを有する第２オブジェクトを包含し、前記第１オブジェクトを小型オブジェクトとして分類するステップと、前記小型オブジェクトにデータ複製方式を適用するステップと、前記小型オブジェクトを複数のデータストレージデバイスの何れれか１つ又はそれ以上に亘って格納するステップと、前記第２オブジェクトを巨大オブジェクトとして分類するステップと、前記巨大オブジェクトを同一サイズの１つ又はそれ以上のチャンクに分割するステップと、前記巨大オブジェクトに消去コーディング方式を適用するステップと、前記複数のデータストレージデバイスに亘って前記１つ又はそれ以上のチャンクを分散して格納するステップと、を包含する。 According to another embodiment, a method of accessing an object of a key-value pair comprises: receiving a plurality of objects of a key-value pair, wherein the plurality of objects comprises a first object having a first size; Including a second object having a second size greater than the first size, classifying the first object as a small object, applying a data replication scheme to the small object, and converting the small object to a plurality of data Storing over any one or more of the storage devices, classifying the second object as a giant object, and dividing the giant object into one or more chunks of the same size And the huge object Comprising applying a coded scheme, and storing said plurality of data the over the storage device one or more chunks dispersing the, the.

前記の方法は、前記オブジェクトへのライト（書込み）要請を受信するステップと、前記オブジェクトが前記巨大オブジェクトであるか否かを判別するステップと、前記巨大オブジェクトに対するチャンクサイズ及びチャンクの個数を判別するステップと、前記チャンクサイズ及び前記チャンク個数に基づいて、前記複数のデータストレージデバイスに、前記巨大オブジェクトの前記１つ又はそれ以上のチャンクをライトするステップと、をさらに包含できる。 The method comprises the steps of receiving a write request to the object, determining whether the object is the giant object, and determining a chunk size and number of chunks for the giant object. The method may further comprise the steps of: writing the one or more chunks of the giant object to the plurality of data storage devices based on the chunk size and the number of chunks.

前記方法は、前記１つ又はそれ以上のチャンクのうち前記複数のデータストレージデバイスの各々に対して１つのチャンクを含むスライスを生成し、前記スライスに含まれた前記１つ又はそれ以上のチャンクの各々に対して、バンド識別子が添付されたユーザキーを使用して内部キーを生成するステップと、前記消去コーディング方式を使用して、前記スライスと対応する前記１つ又はそれ以上のチャンク、及び１つ又はそれ以上のパリティチャンクを含むバンドを生成するステップと、スプリット優先方式及びバンド優先方式を含む分散方式の何れかに基づいて、前記バンドを格納する前記複数のデータストレージデバイスを判別するステップと、及び前記内部キーと共に前記のバンドに、前記１つ又はそれ以上のチャンクをライトするステップと、をさらに包含できる。 The method generates a slice including one chunk for each of the plurality of data storage devices of the one or more chunks, and the one or more chunks included in the slice. Generating, for each, an internal key using a user key with a band identifier attached, and the one or more chunks corresponding to the slice, using the erasure coding scheme, and Generating a band including one or more parity chunks; determining the plurality of data storage devices storing the band based on one of a split priority scheme and a distributed scheme including a band priority scheme; And write the one or more chunks to the band along with the internal key. Tsu and up, the further can include.

前記方法は、前記オブジェクトへのライト（書込み）要請を受信するステップと、前記オブジェクトが前記小型オブジェクトであるか否かを判別するステップと、スプリット優先方式及びバンド優先方式を含む分散方式の何れかに基づいて、前記複数のデータストレージデバイスのうち前記小型オブジェクトの前記１つ又はそれ以上のチャンクを格納するべき部分集合を決定するステップと、前記複数のデータストレージデバイスの前記部分集合に前記小型オブジェクトの前記１つ又はそれ以上のチャンクをライトするステップと、をさらに包含できる。 The method includes any one of a step of receiving a write request to the object, a step of determining whether the object is the small object, and a distributed method including a split priority method and a band priority method. Determining a subset to store the one or more chunks of the small object of the plurality of data storage devices based on the plurality of data storage devices; and the small object in the subset of the plurality of data storage devices Writing the one or more chunks of

前記の方法は、前記ユーザキーを含む前記オブジェクトへのリード（読出し）要請を受信するステップと、前記複数のデータストレージデバイスがグループの特徴をサポートしているかを判別するステップと、前記内部キーと共に前記リード要請を前記複数のデータストレージデバイスに向かって放送するステップと、をさらに包含できる。 The method includes receiving a read request to the object including the user key; determining whether the plurality of data storage devices support group features; and the internal key Broadcasting the read request towards the plurality of data storage devices.

前記のグループの特徴がサポートされている場合、前記内部キーの前記バンド識別子は、任意データのビットに設定されることができる。 If the group feature is supported, the band identifier of the internal key may be set to an arbitrary data bit.

前記のグループの特徴がサポートされていない場合、前記の内部キーの前記バンド識別子はユニークなバンド識別子に設定されることができる。 If the group feature is not supported, the band identifier of the internal key may be set to a unique band identifier.

前記方法は、前記複数のデータストレージデバイスの各々から少なくとも１つのチャンクを受信するステップと、前記消去コーディング方式を使用して前記複数のデータストレージデバイスから受信された前記少なくとも１つのチャンクからスライスを再構成するステップと、をさらに包含できる。 The method comprises the steps of receiving at least one chunk from each of the plurality of data storage devices, and re-slicing slices from the at least one chunk received from the plurality of data storage devices using the erasure coding scheme. And B. further comprising the steps of:

前述した例示的な実施例は、相異なるサイズを有するオブジェクトを効率的に格納できるデータストレージシステム及びデータストレージシステムにオブジェクトを格納する方法を具現する多様な実施例を説明するためのものである。記載された例示的な実施例からの多様な変形は、当業者によって容易に行われることができる。本発明の技術的思想から意図された事項は、以下の特許請求の範囲で記載される。 The above-described exemplary embodiments are intended to illustrate various embodiments embodying data storage systems capable of efficiently storing objects having different sizes and methods of storing objects in the data storage system. Various modifications from the described exemplary embodiments can be readily made by those skilled in the art. The matters intended from the technical idea of the present invention are described in the following claims.

本発明は、パフォーマンスの制限や空間の制約なしに、相異なるサイズを有するオブジェクトを効率的に格納する、高い信頼性を有するデータストレージシステムに有用である。 The present invention is useful for highly reliable data storage systems that efficiently store objects of different sizes without performance limitations or space limitations.

１１０，３１０，４１０，５１０，６１０，７１０オブジェクト
１２０、１２１，３２１仮想ストレージ
１５１スプリット
２００内部キー
２０１，３０１，４０１ユーザキー
２０２，４０２バンド識別子
３０２バンド識別子（＝‘ｄｏ＿ｎｏｔ＿ｃａｒｅ’の場合）
３５０バンド 110, 310, 410, 510, 610, 710 Object 120, 121, 321 Virtual storage 151 Split 200 Internal key 201, 301, 401 User key 202, 402 Band identifier 302 Band identifier (= 'do_not_care')
350 bands

Claims

A data storage system,
Multiple data storage devices that store multiple objects of key-value pairs;
A virtual storage layer applying different data reliability schemes, including a data replication scheme and an erasure coding scheme, based on the object size of each of the plurality of objects;
The plurality of objects include a first object having a first size and a second object having a second size greater than the first size,
The virtual storage layer classifies the first object as a small object, applies the data replication scheme, stores the small object across any one or more of the plurality of data storage devices,
The virtual storage layer classifies the second object as a giant object, divides the giant object into one or more chunks of the same size, and applies the erasure coding scheme to the plurality of data storage devices. A data storage system, characterized in that it stores the one or more chunks in a distributed manner throughout.

The virtual storage layer stores the one or more chunks of the giant object across the plurality of data storage devices in a split priority manner;
In the split priority scheme, the virtual storage layer stores some of the one or more chunks of the giant object in any one of the plurality of storage devices, and the giant object The other one (others) of the one or more chunks is stored in another one of the plurality of storage devices. Data storage system as described in.

The virtual storage layer stores the one or more chunks of the giant object across the plurality of data storage devices in a band first manner;
The band priority scheme is configured such that the virtual storage layer stores one chunk of the huge object until the virtual storage layer stores all the one or more chunks of the huge object in the plurality of data storage devices. The data storage system according to claim 1, wherein the data storage system is a method of storing data in each of the data storage devices.

The virtual storage layer further classifies a third object having a third size as a large object, divides the large object into one or more chunks of the same size, applies the erasure coding scheme, Storing one or more chunks distributed among the plurality of data storage devices,
The third size is larger than the first size and smaller than the second size, and the large object includes only one band, that is, only one chunk in one split. The data storage system according to claim 1.

The virtual storage layer further classifies a fourth object having a fourth size as a medium-sized object,
The fourth size is larger than the first size and smaller than the third size,
5. The data storage system according to claim 4, wherein the virtual storage layer applies any one of the data duplication method and the erasure coding method.

Each of the plurality of objects is identified using a user key;
The virtual storage layer generates an internal key including a band identifier for the giant object and the user key,
The band identifier is used to identify each band among a plurality of bands, and each of the bands is distributed over the plurality of data storage devices. The data storage system of claim 1, comprising any one of the one or more chunks.

The virtual storage layer uses a hash value of the user key to read or write a first chunk of the one or more chunks, and a start data storage device among the plurality of data storage devices The data storage system according to claim 6, characterized in that

The data storage system of claim 1, wherein the plurality of data storage devices comprises one or more dedicated parity data storage devices that store parity chunks associated with the large objects. .

The plurality of data storage devices support group features, and the virtual storage layer sets the band identifier to a bit of arbitrary data toward the plurality of data storage devices when reading the giant object. 7. A data storage system according to claim 6, characterized in that it is broadcasted.

Without supporting the group feature of the plurality of data storage devices, the virtual storage layer sets the band identifier to a unique band identifier toward the plurality of data storage devices when reading the giant object. 7. A data storage system according to claim 6, characterized in that it is broadcast.

The data storage system according to claim 1, wherein each of the plurality of data storage devices is a key-value solid-state drive (KV_SSD: key-value solid-state drive).

A method of accessing an object of a key-value pair,
Receiving a plurality of objects of the key-value pair;
Here, the plurality of objects include a first object having a first size and a second object having a second size larger than the first size,
Classifying the first object as a small object;
Applying a data replication scheme to the small object;
Storing the small object across any one or more of a plurality of data storage devices;
Classifying the second object as a giant object;
Dividing the giant object into one or more chunks of the same size;
Applying an erasure coding scheme to the giant object;
Storing the one or more chunks in a distributed manner across the plurality of data storage devices.

Receiving a write request to the object;
Determining whether the object is the giant object;
Determining a chunk size and the number of chunks for the huge object;
The method according to claim 12, further comprising writing the one or more chunks of the huge object to the plurality of data storage devices based on the chunk size and the number of chunks. The method described.

Generating a slice containing one chunk for each of the plurality of data storage devices of the one or more chunks, and for each of the one or more chunks included in the slice Generating an internal key using a user key with a band identifier attached,
Generating a band comprising the one or more chunks corresponding to the slice and one or more parity chunks using erasure coding;
Determining a plurality of data storage devices to store the band based on any of a split priority scheme and a distributed scheme including a band priority scheme;
The method of claim 13, further comprising: writing the one or more chunks in the band with the internal key.

Receiving a write request for the object;
Determining whether the object is the small object;
Determining a subset of the plurality of data storage devices to store the one or more chunks of the small object based on any of a distributed scheme including a split priority scheme and a band priority scheme;
The method of claim 12, further comprising writing the one or more chunks of the small object to the subset of the plurality of data storage devices.

Receiving a read request for the object including the user key;
Determining whether the plurality of data storage devices support group features;
Broadcasting the read request with the internal key towards the plurality of data storage devices.

The method according to claim 16, wherein the band identifier of the internal key is set to an arbitrary data bit if the group feature is supported.

The method according to claim 16, wherein if the group feature is not supported, the band identifier of the internal key is set to a unique band identifier.

Receiving at least one chunk from each of the plurality of data storage devices;
The method of claim 16, further comprising reconstructing a slice from at least one chunk received from the plurality of data storage devices using the erasure coding scheme.

The method of claim 12, wherein each of the plurality of data storage devices is a key-value solid-state drive (KV_SSD).