TW201935243A

TW201935243A - SSD, distributed data storage system and method for leveraging key-value storage

Info

Publication number: TW201935243A
Application number: TW107137684A
Authority: TW
Inventors: 提摩太Ｃ. 比森; 安納席塔雪耶斯坦; 崔昌皓
Original assignee: 南韓商三星電子股份有限公司
Priority date: 2018-02-06
Filing date: 2018-10-25
Publication date: 2019-09-01
Also published as: US11392544B2; CN110119425A; JP2019139759A; US20190243906A1; KR102779227B1; TWI778157B; JP7437117B2; KR20190095089A

Abstract

一種固態驅動器（SSD）包括：多個資料區塊；用於存取所述多個資料區塊的多個快閃記憶體通道及多個通路；以及固態驅動器控制器，對所述多個資料區塊的區塊大小進行配置。資料檔案與一個或多個鍵-值對一起儲存在所述固態驅動器中，且所述一個或多個鍵-值對中的每一個鍵-值對具有區塊識別符作為鍵且具有區塊資料作為值。所述資料檔案的大小等於所述區塊大小或所述區塊大小的倍數。A solid state drive (SSD) includes: a plurality of data blocks; a plurality of flash memory channels and a plurality of channels for accessing the plurality of data blocks; and a solid state drive controller for the plurality of data The block's block size is configured. The data file is stored in the solid state drive together with one or more key-value pairs, and each of the one or more key-value pairs has a block identifier as a key and has a block Data as values. The size of the data file is equal to the block size or a multiple of the block size.

Description

System and method for using key-value storage in a distributed file system to efficiently store data and metadata

本揭露大體來說涉及鍵-值儲存裝置，更具體來說涉及一種在分散式檔案系統中利用鍵-值儲存來高效地儲存資料及中繼資料的系統及方法。This disclosure generally relates to a key-value storage device, and more particularly, to a system and method for efficiently storing data and metadata using key-value storage in a distributed file system.

在傳統的資料儲存節點中，通常使用位於資料儲存節點上的現有的檔案系統來儲存鍵-值映射（例如，區塊識別符（identifier，ID）對資料內容的映射）。之所以這樣，是因為基礎的儲存裝置並不在本機支援資料儲存節點所需的鍵-值介面。因此，需要附加軟體層（通常為檔案系統）來提供鍵-值介面。添加檔案系統會引入記憶體開銷及處理器開銷。In traditional data storage nodes, an existing file system located on the data storage node is usually used to store the key-value mapping (for example, a block identifier (ID) to data content mapping). This is because the underlying storage device does not natively support the key-value interface required by the data storage node. Therefore, an additional software layer (usually a file system) is required to provide a key-value interface. Adding a file system introduces memory overhead and processor overhead.

駐留在資料儲存節點與實際的資料儲存裝置之間的檔案系統迫使資料儲存裝置引起額外的效率低下（例如，超額配置（overprovisioning）及較高的寫入放大）並且在資源有限的裝置環境中需要更多中央處理器（central processing unit，CPU）迴圈來執行例如垃圾收集等任務。The file system residing between the data storage node and the actual data storage device forces the data storage device to cause additional inefficiencies (for example, overprovisioning and higher write amplification) and is required in a device environment with limited resources More central processing unit (CPU) loops to perform tasks such as garbage collection.

根據一個實施例，一種固態驅動器（solid-state drive，SSD）包括：多個資料區塊；用於存取所述多個資料區塊的多個快閃記憶體通道及多個通路；以及固態驅動器控制器，對所述多個資料區塊的區塊大小進行配置。資料檔案與一個或多個鍵-值對一起儲存在所述固態驅動器中，且每一個鍵-值對具有區塊識別符作為鍵且具有區塊資料作為值。所述資料檔案的大小等於所述區塊大小或所述區塊大小的倍數。According to an embodiment, a solid-state drive (SSD) includes: a plurality of data blocks; a plurality of flash memory channels and a plurality of paths for accessing the plurality of data blocks; and a solid state drive The driver controller configures a block size of the plurality of data blocks. A data file is stored in the solid state drive together with one or more key-value pairs, and each key-value pair has a block identifier as a key and has block data as a value. The size of the data file is equal to the block size or a multiple of the block size.

根據另一個實施例，一種分散式資料儲存系統包括：客戶機；名稱節點，包括第一鍵-值（key-value，KV）固態驅動器（SSD）；以及資料節點，包括第二鍵-值固態驅動器，其中所述第二鍵-值固態驅動器包括多個資料區塊、用於存取所述多個資料區塊的多個快閃記憶體通道及多個通路、以及用於配置所述多個資料區塊的區塊大小的固態驅動器控制器。所述客戶機向所述名稱節點發送包括用於儲存資料檔案的檔案識別符的創建檔案請求，並向所述名稱節點發送分配命令以分配所述多個資料區塊中與所述資料檔案相關聯的一個或多個資料區塊。所述名稱節點向所述客戶機返回所述一個或多個資料區塊的區塊識別符以及被指派儲存所述一個或多個資料區塊的所述資料節點的資料節點識別符。所述客戶機向所述資料節點發送區塊儲存命令以儲存所述一個或多個資料區塊。所述第二鍵-值固態驅動器儲存所述一個或多個資料區塊作為鍵-值對，且所述鍵-值對中的至少一個鍵-值對具有所述區塊識別符作為鍵且具有區塊資料作為值。所述資料檔案的大小等於所述區塊大小或所述區塊大小的倍數。According to another embodiment, a distributed data storage system includes: a client; a name node including a first key-value (KV) solid state drive (SSD); and a data node including a second key-value solid state A drive, wherein the second key-value solid-state drive includes a plurality of data blocks, a plurality of flash memory channels and a plurality of channels for accessing the plurality of data blocks, and a method for configuring the plurality of data blocks Block size solid state drive controller. The client sends an archive creation request including an archive identifier for storing a data archive to the name node, and sends an allocation command to the name node to allocate the plurality of data blocks related to the data archive Associated one or more data blocks. The name node returns a block identifier of the one or more data blocks and a data node identifier of the data node assigned to store the one or more data blocks to the client. The client sends a block storage command to the data node to store the one or more data blocks. The second key-value solid-state drive stores the one or more data blocks as key-value pairs, and at least one key-value pair in the key-value pairs has the block identifier as a key and Has block data as the value. The size of the data file is equal to the block size or a multiple of the block size.

根據又一實施例，一種方法包括：從客戶機向名稱節點發送創建檔案請求，其中所述創建檔案請求包括用於儲存資料檔案的檔案識別符；將所述檔案識別符作為鍵-值對儲存在所述名稱節點的第一鍵-值（KV）固態驅動器（SSD）中，其中所述檔案識別符作為鍵被儲存在所述鍵-值中，且與所述鍵相關聯的值是空的；從所述客戶機向所述名稱節點發送分配命令，以分配與所述資料檔案相關聯的一個或多個資料區塊；在所述名稱節點處將區塊識別符指派給所述一個或多個資料區塊中的至少一者並指派資料節點來儲存所述一個或多個資料區塊；從所述名稱節點向所述客戶機返回所述區塊識別符及所述資料節點的資料節點識別符；從所述客戶機向所述資料節點發送寫入區塊請求，其中所述寫入區塊請求包括所述區塊識別符及內容；以及將所述一個或多個資料區塊作為鍵-值對保存在所述資料節點的第二鍵-值固態驅動器中。所述資料節點的所述第二鍵-值固態驅動器包括具有區塊大小的一個或多個資料區塊。所述鍵-值對中的至少一個鍵-值對具有區塊識別符作為鍵且具有區塊資料作為值。所述資料檔案的大小等於所述區塊大小或所述區塊大小的倍數。According to yet another embodiment, a method includes sending a create archive request from a client to a name node, wherein the create archive request includes an archive identifier for storing a data archive; and storing the archive identifier as a key-value pair In the first key-value (KV) solid state drive (SSD) of the name node, wherein the file identifier is stored as a key in the key-value, and the value associated with the key is empty Sending an allocation command from the client to the name node to allocate one or more data blocks associated with the data file; assigning a block identifier to the one at the name node At least one of the plurality of data blocks and assigns a data node to store the one or more data blocks; returning the block identifier and the data node's A data node identifier; sending a write block request from the client to the data node, wherein the write block request includes the block identifier and content; and the one or more data areas As the key - the value of the second key stored in the node data - values of the solid state drive. The second key-value solid-state drive of the data node includes one or more data blocks having a block size. At least one of the key-value pairs has a block identifier as a key and has block data as a value. The size of the data file is equal to the block size or a multiple of the block size.

現將參照附圖更具體地闡述包括實施方式的各種新穎細節及事件組合在內的以上及其他優選特徵，且以上及其他優選特徵在權利要求書中指出。應理解，本文所述具體系統及方法僅作為例示示出且不作為限制。如所屬領域中的技術人員應理解，本文所述原理及特徵可用於各種各樣的實施例中，而此並不背離本揭露的範圍。The above and other preferred features including various novel details and event combinations of the embodiments will now be explained in more detail with reference to the accompanying drawings, and the above and other preferred features are pointed out in the claims. It should be understood that the specific systems and methods described herein are shown by way of illustration only and are not limiting. As those skilled in the art would understand, the principles and features described herein may be used in various embodiments without departing from the scope of the present disclosure.

本文所揭露的特徵及教示內容中的每一者可單獨使用或結合其他特徵及教示內容來使用以提供在分散式檔案系統中利用鍵-值儲存來高效地儲存資料及中繼資料的系統及方法。參照附圖更詳細地闡述代表性實例，這些代表性實例單獨地使用及組合地使用這些附加特徵及教示內容中的許多特徵及教示內容。此詳細說明僅旨在向所屬領域中的技術人員教示用於實踐本教示內容的各個方面的進一步細節，而並非旨在限制權利要求書的範圍。因此，在本詳細說明中以上所揭露特徵的組合可能未必是在最廣泛意義上實踐本教示內容所必需的，而是相反，僅是為了具體闡述本教示內容的代表性的實例而教示。Each of the features and teaching contents disclosed herein can be used alone or in combination with other features and teaching contents to provide a system and system for efficiently storing data and metadata using key-value storage in a distributed file system and method. Representative examples are explained in more detail with reference to the drawings, which use individual features and many of these additional features and teaching contents individually and in combination. This detailed description is only intended to teach a person skilled in the art further details for practicing various aspects of the teachings and is not intended to limit the scope of the claims. Therefore, the combination of features disclosed above in this detailed description may not necessarily be necessary to practice the teaching content in the broadest sense, but rather, it is merely taught to specifically illustrate a representative example of the teaching content.

在以下說明中，僅出於解釋目的來闡述特定術語以提供對本揭露的透徹理解。然而，對於所屬領域中的技術人員而言將顯而易見，這些特定細節並非是實踐本揭露的教示內容所必需的。In the following description, specific terms are set forth for explanatory purposes only to provide a thorough understanding of the present disclosure. However, it will be apparent to those skilled in the art that these specific details are not necessary to practice the teachings of the present disclosure.

本文的詳細說明的一些部分是以演算法及對電腦記憶體內的資料位元進行的操作的符號標記法來呈現。這些演算法描述及標記法被資料處理領域中的技術人員用於向所屬領域中的其他技術人員有效地傳達其工作的實質。演算法在此處且一般而言均被視為能得到所期望結果的步驟的自洽序列（self-consistent sequence）。所述步驟需要對物理量進行實體操縱。通常（儘管未必一定如此），這些量會呈能夠被儲存、傳輸、組合、比較及以其他方式被操縱的電訊號或磁訊號的形式。已證明，主要出於通用的原因，將這些訊號稱為位元、值、元件、符號、字元、項、數位等有時是便利的。Some parts of the detailed description in this article are presented by algorithms and symbolic notations of operations on data bits in computer memory. These algorithmic descriptions and notations are used by those skilled in the data processing arts to effectively communicate the substance of their work to other skilled persons in the art. An algorithm is here and generally considered a self-consistent sequence of steps that can achieve the desired result. The steps require physical manipulation of physical quantities. Usually (though not necessarily), these quantities are in the form of electrical or magnetic signals that can be stored, transmitted, combined, compared, and otherwise manipulated. It has proven convenient, sometimes for general reasons, to refer to these signals as bits, values, components, symbols, characters, terms, digits, etc.

然而，應記住，這些用語中的所有用語及所有相似用語均與適宜的物理量相關聯且僅作為應用於這些量的便利標記。除非通過閱讀以下論述顯而易見地另有具體說明，否則應理解，在本說明通篇中，使用例如“處理（processing）”、“計算（computing）”、“運算（calculating）”、“判斷（determining）”、“顯示（displaying）”等用語進行的論述是指電腦系統或相似電子計算裝置的動作及進程，所述電腦系統或相似電子計算裝置操縱在電腦系統的寄存器及記憶體內被表示為物理（電子）量的資料並將所述資料轉換成在電腦系統記憶體或寄存器或者其他這種資訊儲存裝置、資訊傳輸裝置或資訊顯示裝置內被相似地表示為物理量的其他資料。It should be borne in mind, however, that all of these terms and all similar terms are associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless it is clearly stated otherwise by reading the following discussion, it should be understood that throughout this description, for example, "processing", "computing", "calculating", "determining" ) "," Displaying "and other terms refer to the actions and processes of computer systems or similar electronic computing devices, which are manipulated in the registers and memory of computer systems and are represented as physical (Electronic) quantities of data and transforming said data into other data similarly expressed as physical quantities in computer system memory or registers or other such information storage devices, information transmission devices or information display devices.

另外，代表性實例及附屬權利要求的各個特徵可以並未具體地及明確地枚舉的方式加以組合以提供本教示內容的附加的有用實施例。還應明確注意，出於原始揭露內容的目的以及出於限制所主張主題的目的，所有值範圍或對實體的群組的指示均用於揭露每個可能的中間值或中間實體。還應明確注意，圖中所示各組件的尺寸及形狀被設計成有助於理解如何實踐本教示內容，而並非旨在限制實例中所示尺寸及形狀。In addition, the features of the representative examples and the appended claims may be combined in ways that are not specifically and explicitly enumerated to provide additional useful embodiments of the present teachings. It should also be explicitly noted that for the purpose of the original disclosure and for the purpose of limiting the claimed subject matter, all value ranges or indications of groups of entities are used to disclose every possible intermediate value or intermediate entity. It should also be explicitly noted that the dimensions and shapes of the components shown in the figures are designed to help understand how to practice the teachings, and are not intended to limit the dimensions and shapes shown in the examples.

本揭露闡述一種解決由分散式檔案系統（例如，海杜普分散式檔案系統（Hadoop Distributed File System，HDFS））引起的效率低下問題的系統及方法。本系統及方法通過將資料直接儲存在資料儲存裝置中來消除對使用檔案名稱作為區塊識別符且使用檔案的資料內容（或檔案的資料內容的一部分）作為值的鍵-值檔案系統的需要。可將資料直接儲存在鍵-值對中的這種資料儲存裝置在本文中指代為鍵-值（KV）固態驅動器（SSD），簡稱KV SSD。KV SSD支持其中將區塊識別符作為鍵且將資料作為值的鍵-值儲存。本系統及方法可提供高效的及簡化的鍵-值資料儲存系統，所述鍵-值資料儲存系統包括可將資料作為鍵/值對直接儲存在一個或多個KV SSD中的KV SSD。因此，本鍵-值資料儲存系統可在提供更快、更簡單且可擴展的資料儲存解決方案的同時消耗更少的能量及資源。This disclosure describes a system and method for solving inefficiencies caused by distributed file systems (eg, Hadoop Distributed File System (HDFS)). The system and method eliminate the need for a key-value file system that uses the file name as a block identifier and uses the data content of the file (or part of the data content of the file) as the value by storing the data directly in the data storage device. . This type of data storage device that can store data directly in key-value pairs is referred to herein as a key-value (KV) solid state drive (SSD), referred to as a KV SSD. KV SSD supports key-value storage where block identifiers are used as keys and data is used as values. The system and method can provide an efficient and simplified key-value data storage system including a KV SSD that can store data as key / value pairs directly in one or more KV SSDs. Therefore, this key-value data storage system can consume less energy and resources while providing a faster, simpler, and scalable data storage solution.

根據一個實施例，KV SSD可與資料儲存節點相結合地實施將資料儲存在鍵-值對中的檔案系統。通過使用可直接儲存鍵-值資料的一個或多個KV SSD，本鍵-值資料儲存系統可不再需要資料儲存節點中的檔案系統。資料儲存節點可將關於資料儲存節點的行為的資訊向下傳遞到KV SSD以對KV SSD的內部資料結構及資源進行優化，從而適應由資料儲存節點規定的工作負載。另外，可將記憶體內（in-memory）映射表卸載到KV SSD以使用資料儲存節點與KV SSD之間的鍵-值介面提供永久資料。According to one embodiment, a KV SSD can be implemented in combination with a data storage node to implement a file system that stores data in key-value pairs. By using one or more KV SSDs that can directly store key-value data, the key-value data storage system can eliminate the need for a file system in the data storage node. The data storage node can pass down the information about the behavior of the data storage node to the KV SSD to optimize the internal data structure and resources of the KV SSD, so as to adapt to the workload specified by the data storage node. In addition, the in-memory mapping table can be offloaded to the KV SSD to provide permanent data using the key-value interface between the data storage node and the KV SSD.

根據一個實施例，本鍵-值資料儲存系統可支援現有的檔案系統（例如，HDFS）。具體來說，被優化用於大的資料區塊的檔案系統可受益於本鍵-值資料儲存系統。舉例來說，KV SSD的中繼資料（或雜湊表（hash table））以大的區塊大小（例如，10 百萬位元組（MB）到100 MB）進行分攤。According to one embodiment, the key-value data storage system can support an existing file system (eg, HDFS). Specifically, a file system optimized for large data blocks can benefit from this key-value data storage system. For example, the metadata (or hash table) of a KV SSD is allocated in large block sizes (for example, 10 million bytes (MB) to 100 MB).

分散式檔案系統（例如，HDFS）具有不可變資料區塊（immutable data block），所述不可變資料區塊由於鍵的值不會改變而無需來回移動，由此使儲存在KV SSD中的資料及中繼資料的內部寫入放大因數（write amplification factor，WAF）最小化。另外，本鍵-值資料儲存系統可減小與更新雜湊表值相關聯的CPU開銷。Decentralized file systems (eg, HDFS) have immutable data blocks that do not need to be moved back and forth because the values of the keys do not change, thereby enabling data stored in KV SSDs And the internal write amplification factor (WAF) of the metadata is minimized. In addition, the key-value data storage system can reduce the CPU overhead associated with updating hash table values.

本鍵-值資料儲存系統在改善性能及資源利用的同時具有簡化的快閃記憶體轉換層（flash translation layer，FTL）。當KV SSD與不可變分散式儲存系統（例如，HDFS）一起使用時，可減小中繼資料開銷。這是由於在這種分散式檔案系統中，鍵的內容無法改變，因此儲存鍵/值對的KV SSD不再需要將值標記為舊的並將鍵指向KV SSD中的快閃記憶體介質上的新的值內容。換句話說，KV SSD不需要支援重寫（overwrite）。另外，對於分散式檔案系統（例如，HDFS），區塊具有固定大小，因此KV SSD不需要處置動態大小的值，從而使得值位置的管理更簡單。舉例來說，當所有區塊具有固定大小時，可使用基於直接索引的資料結構。利用分散式檔案系統中的這些簡化，可簡化鍵/值元組（key/value tuple）的FTL管理。This key-value data storage system has a simplified flash translation layer (FTL) while improving performance and resource utilization. KV SSDs reduce metadata overhead when used with immutable decentralized storage systems (eg, HDFS). This is because in this distributed file system, the contents of the keys cannot be changed, so KV SSDs that store key / value pairs no longer need to mark the values as old and point the keys to the flash memory media in the KV SSD New value content. In other words, KV SSDs do not need to support overwrite. In addition, for decentralized file systems (for example, HDFS), blocks have a fixed size, so KV SSDs do not need to handle dynamic size values, which makes the management of value locations easier. For example, when all blocks have a fixed size, a data structure based on direct indexing can be used. With these simplifications in a decentralized file system, the FTL management of key / value tuples can be simplified.

分散式檔案系統可能將中繼資料保持在單個資料儲存節點的記憶體中，由此限制中繼資料的可擴展性。本鍵-值資料儲存系統可消除在管理其他分散式檔案系統可能需要的中繼資料方面的記憶體限制。A distributed file system may keep metadata in the memory of a single data storage node, thereby limiting the scalability of the metadata. This key-value data storage system eliminates memory limitations in managing metadata that may be required by other distributed file systems.

本鍵-值資料儲存系統可實現不是面向延遲（latency-oriented）的高的輸送量。由於HDFS具有這樣大的區塊大小，再加上可能超過記憶體容量的資料集，因此頁面快取記憶體可能無法大幅改善資料儲存及管理性能。因此，即使KV-SSD不支援頁面快取記憶體，KV SSD也不會使具有KV能力的資料節點的性能劣化。This key-value data storage system can achieve high throughput that is not latency-oriented. Because HDFS has such a large block size and a data set that may exceed the memory capacity, page cache memory may not be able to significantly improve data storage and management performance. Therefore, even if KV-SSD does not support page cache, KV SSD will not degrade the performance of KV-capable data nodes.

HDFS的集中式快取記憶體管理特徵提供一種明確地告知資料節點在堆外（off-heap）快取記憶體某些區塊的機制。通過允許具有KV能力的資料節點仍從基於記憶體的快取記憶體獲益而無需制定策略決策來確定要進行快取記憶體的區塊，可將這種特徵實施在KV啟用的資料節點中。HDFS's centralized cache memory management feature provides a mechanism to explicitly inform data nodes to off-heap cache blocks. This feature can be implemented in KV-enabled data nodes by allowing KV-capable data nodes to still benefit from memory-based cache memory without having to make strategic decisions to determine which blocks to cache. .

本鍵-值資料儲存系統在讀取操作及寫入操作中能夠實現高的並行性。由於每一個資料區塊的延遲不那麼重要，且HDFS通過讀取/寫入大數目的資料區塊表現出高度的並行性，因此不需要將命令（例如，讀取命令、寫入命令）分條並將所述命令發送到KV SSD上的許多通道。每一個資料區塊均可被直接寫入及讀取到KV SSD的一個通道或一個晶片/裸晶以通過利用固有並行性的好處來提供輸送量。這還可簡化KV SSD的FTL以及查找過程的複雜度。所述並行性還可根據KV SSD的擦除區塊的上下文被應用於多個通道或晶片/裸晶。繼而，此可通過將SSD區塊/頁面大小匹配到分散式檔案系統（例如，HDFS）的區塊大小來最小化或消除SSD超額配置。因此，本鍵-值資料儲存系統可利用被發到KV SSD且被對齊到固定區塊大小的裝置擦除區塊來增大輸送量，這是由於對齊的擦除區塊及資料大小在快閃記憶體通道之間將需要較少的同步。通過將中繼資料映射卸載到KV SSD，中繼資料節點中的記憶體不再是分散式儲存系統中的瓶頸。This key-value data storage system can achieve high parallelism in read operations and write operations. Since the latency of each data block is less important, and HDFS shows a high degree of parallelism by reading / writing a large number of data blocks, there is no need to separate commands (for example, read commands, write commands). And send the command to many channels on the KV SSD. Each data block can be directly written to and read from a channel or a die / die of the KV SSD to provide throughput by taking advantage of the inherent parallelism. This also simplifies the FTL of the KV SSD and the complexity of the lookup process. The parallelism can also be applied to multiple channels or wafers / die according to the context of the erase block of the KV SSD. This, in turn, can minimize or eliminate SSD over-provisioning by matching the SSD block / page size to the block size of a decentralized file system (eg, HDFS). Therefore, this key-value data storage system can use a device sent to the KV SSD and aligned to a fixed block size to erase the block to increase the throughput. This is because the aligned erase block and the data size are fast. Less synchronization will be required between flash memory channels. By offloading the metadata map to the KV SSD, the memory in the metadata node is no longer a bottleneck in a distributed storage system.

圖1A示出現有技術分散式資料儲存系統的方區塊圖。客戶機應用101要將檔案105儲存在分散式資料儲存系統100A中的資料節點121中。檔案105包括兩個資料區塊（即，Ω及Σ）。在將檔案105寫入到資料節點121之後，客戶機101將與檔案105相關聯的中繼資料儲存在名稱節點（或中繼資料節點）111的區塊映射115中。在HDFS的上下文中，名稱節點111被稱為主裝置（master），且資料節點121被稱為從裝置（slave）。主裝置可以HDFS目錄結構儲存整個HDFS的檔案的中繼資料。儘管以下所述實例中的一些實例提到HDFS，然而應理解，還可使用被優化用於大量資料的其他檔案系統，而此並不背離本揭露的範圍。FIG. 1A illustrates a block diagram of a prior art decentralized data storage system. The client application 101 stores the file 105 in the data node 121 in the distributed data storage system 100A. The file 105 includes two data blocks (ie, Ω and Σ). After writing the archive 105 to the data node 121, the client 101 stores the metadata associated with the archive 105 in the block map 115 of the name node (or metadata node) 111. In the context of HDFS, the name node 111 is referred to as the master and the data node 121 is referred to as the slave. The host device can store metadata of the entire HDFS file in the HDFS directory structure. Although HDFS is mentioned in some of the examples described below, it should be understood that other file systems optimized for large amounts of data can be used without departing from the scope of this disclosure.

名稱節點111維持區塊映射115，區塊映射115包含包括區塊識別符的檔案105與儲存檔案105中所包括的區塊的資料節點121之間的映射資訊。在本實例中，區塊Ω及區塊Σ分別具有區塊識別符“11”及區塊識別符“99”。當客戶機101需要對檔案105（或資料區塊Ω及Σ）進行存取時，客戶機101與名稱節點111進行通訊以基於儲存在區塊映射115中的關聯資訊來識別與檔案105相關聯的區塊及要存取檔案105（或資料區塊）的資料節點121（DN 1）。The name node 111 maintains a block map 115 that includes mapping information between the file 105 including the block identifier and the data node 121 storing the blocks included in the file 105. In this example, the blocks Ω and Σ have a block identifier “11” and a block identifier “99”, respectively. When the client 101 needs to access the file 105 (or the data blocks Ω and Σ), the client 101 communicates with the name node 111 to identify the association with the file 105 based on the associated information stored in the block map 115 And the data node 121 (DN 1) of the file 105 (or data block) to be accessed.

資料節點121包括具有目錄結構的本地檔案系統（例如，Linux的ext4檔案系統），以將每一個區塊作為檔案儲存在目錄中。檔案名稱可為與檔案105的相應區塊對應的唯一區塊識別符（“11”或“99”），且檔案的內容是區塊資料。由於區塊需要被儲存為檔案，因此資料節點121需要附加軟體層（例如，本地ext4檔案系統）、附加記憶體（例如，Linux的目錄項快取記憶體（dentry cache））及將鍵-值按區塊轉換到檔案系統的CPU處理（例如，可攜式作業系統介面（portable Operating System Interface，POSIX）及檔案系統專用命令處理（file system-specific command processing），且檔案系統開銷包括中繼資料管理。SSD 140的控制器邏輯需要執行附加處理以維持區塊映射115的一致性。HDFS區塊大小可不與內部SSD頁面/區塊映射對齊。這可增大SSD的內部WAF以及超額配置的空間，從而導致垃圾收集更頻繁且擁有權總成本（total cost of ownership，TCO）增大。The data node 121 includes a local file system (for example, the ext4 file system of Linux) with a directory structure to store each block as a file in the directory. The file name may be a unique block identifier ("11" or "99") corresponding to the corresponding block of the file 105, and the content of the file is block data. Since the blocks need to be stored as files, the data node 121 needs additional software layers (for example, the local ext4 file system), additional memory (for example, the Linux directory entry cache), and key-values. CPU processing (for example, Portable Operating System Interface (POSIX) and file system-specific command processing) converted to the file system by block, and file system overhead includes metadata Management. The controller logic of the SSD 140 needs to perform additional processing to maintain the consistency of the block map 115. HDFS block size may not be aligned with the internal SSD page / block map. This can increase the internal WAF of the SSD and the space for over-provisioning , Which results in more frequent garbage collection and an increase in total cost of ownership (TCO).

圖1B示出根據一個實施例的包括鍵-值儲存裝置的示例性分散式資料儲存系統的方區塊圖。客戶機應用101將檔案105儲存在分散式資料儲存系統100B中的資料節點221中。舉例來說，分散式資料儲存系統100B是HDFS。包括鍵-值SSD 150的資料節點221可直接儲存資料區塊Ω及Σ。與包括傳統SSD 140的資料節點121相反，資料節點221不需要本地檔案系統（例如，ext4），這是由於檔案105的資料區塊被作為鍵-值對直接儲存在KV SSD 150中。FIG. 1B illustrates a block diagram of an exemplary decentralized data storage system including a key-value storage device according to one embodiment. The client application 101 stores the file 105 in a data node 221 in the distributed data storage system 100B. For example, the distributed data storage system 100B is HDFS. The data node 221 including the key-value SSD 150 can directly store the data blocks Ω and Σ. In contrast to the data node 121 including the traditional SSD 140, the data node 221 does not require a local file system (eg, ext4) because the data blocks of the file 105 are stored directly as key-value pairs in the KV SSD 150.

KV SSD 150為資料節點221提供與客戶機應用101進行通訊的介面，所述介面能夠實現資料區塊作為鍵-值對的直接儲存。因此，資料節點221不需要本地檔案系統層，因此可能不會引起圖1A所示傳統資料節點121的記憶體開銷及CPU開銷。The KV SSD 150 provides an interface for the data node 221 to communicate with the client application 101. The interface enables direct storage of data blocks as key-value pairs. Therefore, the data node 221 does not need a local file system layer, and thus may not cause the memory overhead and CPU overhead of the traditional data node 121 shown in FIG. 1A.

根據一個實施例，分散式資料儲存系統100B使客戶機應用101與資料節點221之間能夠交換資訊。這一過程被稱為登記過程或配置過程。在登記過程期間，資料節點221可告知客戶機應用101：資料節點221具有可將資料區塊作為鍵-值對進行儲存的一個或多個KV SSD。在完成登記過程之後，客戶機應用101知曉其可向資料節點221中所包括的KV SSD 150發出KV SSD專用輸入/輸出（input/output，I/O）裝置命令（例如，/dev/kvssd1，其中kvssd1是資料節點221的id）。這會簡化客戶機應用101與資料節點221之間的輸入/輸出路徑。資料節點221可發出“置入（put）”命令來將每一個資料區塊儲存為鍵-值對，而並非依賴本地檔案系統來創建資料區塊及將資料區塊寫入到檔案。讀取KV SSD 150中的所儲存鍵-值對的過程是相似的；資料節點221可向KV SSD 150直接發出“獲取（get）”命令以檢索與鍵相關聯的資料區塊而非通過檔案系統介面來檢索資料區塊。在相似的過程之後可進行刪除過程。According to one embodiment, the distributed data storage system 100B enables the client application 101 and the data node 221 to exchange information. This process is called the registration process or the configuration process. During the registration process, the data node 221 can inform the client application 101 that the data node 221 has one or more KV SSDs that can store data blocks as key-value pairs. After completing the registration process, the client application 101 knows that it can issue a KV SSD-specific input / output (I / O) device command to the KV SSD 150 included in the data node 221 (eg, / dev / kvssd1, Where kvssd1 is the id of the data node 221). This will simplify the input / output path between the client application 101 and the data node 221. The data node 221 may issue a "put" command to store each data block as a key-value pair, instead of relying on a local file system to create a data block and write the data block to a file. The process of reading the stored key-value pairs in the KV SSD 150 is similar; the data node 221 can issue a "get" command directly to the KV SSD 150 to retrieve the data block associated with the key rather than through a file System interface to retrieve data blocks. The deletion process can be followed by a similar process.

在登記過程期間，可向KV SSD 150提供關於分散式資料儲存系統100B的行為的資訊。可基於分散式資料儲存系統100B的行為、針對分散式資料儲存系統100B來專門對KV SSD 150的快閃記憶體轉換層（FTL）進行優化。During the registration process, KV SSD 150 may be provided with information about the behavior of the distributed data storage system 100B. Based on the behavior of the distributed data storage system 100B, the flash memory conversion layer (FTL) of the KV SSD 150 can be specifically optimized for the distributed data storage system 100B.

KV SSD 150的SSD控制器可在不同的記憶體晶片（例如，與非晶片）之間以條的形式來寫入及讀取資料以加速寫入操作及讀取操作。分散式資料儲存系統100B（例如，HDFS）可並行地發送許多輸入/輸出請求，且可容忍長的延遲，只要輸送量高即可。這種並行性可通過向SSD控制器添加複雜性來減小延遲。根據一個實施例，KV SSD 150的FTL可被優化成基於分散式資料儲存系統100B的資訊來對單個通道讀取及寫入大的區塊。在這種情形中，KV SSD 150的FTL不會使資料通過多個通道在多個晶片之間分條（striping），而是可並行地執行同時進行的讀取操作與寫入操作以實現高的輸送量。The KV SSD 150's SSD controller can write and read data in the form of strips between different memory chips (eg, non-chip) to speed up write operations and read operations. The decentralized data storage system 100B (for example, HDFS) can send many input / output requests in parallel, and can tolerate long delays as long as the throughput is high. This parallelism can reduce latency by adding complexity to the SSD controller. According to one embodiment, the FTL of the KV SSD 150 can be optimized to read and write large blocks to a single channel based on the information of the distributed data storage system 100B. In this case, the FTL of the KV SSD 150 does not cause the data to be striped between multiple chips through multiple channels, but can perform concurrent read and write operations in parallel to achieve high performance. Of delivery.

可針對頻繁讀取所儲存的資料的以資料為中心且資料密集型的應用來對分散式檔案系統（例如，HDFS）進行優化。在這種情形中，資料讀取操作比資料寫入操作頻繁得多。這些分散式檔案系統中的一些分散式檔案系統提供一次寫入語義（write-once semantics）且使用大的區塊大小。相比之下，KV SSD 150可支援動態的區塊大小以及對資料區塊的頻繁更新。Distributed file systems (eg, HDFS) can be optimized for data-centric and data-intensive applications that frequently read stored data. In this case, the data read operation is much more frequent than the data write operation. Some of these distributed file systems provide write-once semantics and use large block sizes. In contrast, KV SSD 150 supports dynamic block sizes and frequent updates to data blocks.

根據另一個實施例，KV SSD 150可被優化成不會引起對於採用內部檔案系統（例如，圖1A所示SSD 140）的情況而言可能原本需要的垃圾收集。垃圾收集是如下過程：從包括過時頁面的區塊移動有效頁面，因此可將所述區塊擦除及改寫。垃圾收集是可造成寫入放大、輸入/輸出不確定性（I/O indeterminism）及驅動器的磨損均衡（wear leveling）的代價高昂的過程。一旦被優化，KV SSD 150的FTL便可對寫入操作與擦除操作使用相同的細微性（granularity）。當一個區塊被刪除時，所述區塊可立即被標記以進行擦除，由此不再需要進行垃圾收集。KV SSD 150的經優化的FTL可改善性能及耐用性，同時消除垃圾收集並簡化FTL。According to another embodiment, the KV SSD 150 may be optimized so as not to cause garbage collection that might otherwise be required for the case of using an internal file system (eg, SSD 140 shown in FIG. 1A). Garbage collection is the process of moving valid pages from blocks that include outdated pages, so the blocks can be erased and rewritten. Garbage collection is a costly process that can cause write amplification, input / output uncertainty (I / O indeterminism), and wear leveling of drives. Once optimized, the FTL of the KV SSD 150 can use the same granularity for write and erase operations. When a block is deleted, the block can be immediately marked for erasure, thereby eliminating the need for garbage collection. KV SSD 150's optimized FTL improves performance and durability while eliminating garbage collection and simplifying FTL.

根據一個實施例，KV SSD 150支援動態頁面及區塊大小。舉例而言，KV SSD 150可基於HDFS配置來調整其中將要儲存的區塊的區塊大小。舉例來說，在所述配置期間，分散式資料儲存系統100B可告知KV SSD 150只有經對齊的固定大小的寫入操作將被發出到KV SSD 150，且KV SSD 150相應地對其區塊大小進行配置。作為另外一種選擇，KV SSD 150可透露其擦除區塊大小（或可能的擦除區塊大小）且需要分散式資料儲存系統100B來相應地對KV SSD 150的區塊大小進行配置。在任意一種情形中，KV SSD 150及分散式資料儲存系統100B中的區塊大小可相對於彼此進行配置。According to one embodiment, the KV SSD 150 supports dynamic page and block sizes. For example, the KV SSD 150 can adjust the block size of the blocks to be stored therein based on the HDFS configuration. For example, during the configuration, the distributed data storage system 100B may inform the KV SSD 150 that only aligned fixed-size write operations will be issued to the KV SSD 150, and the KV SSD 150 will correspondingly block its size Configure it. As another option, the KV SSD 150 may disclose its erase block size (or possible erase block size) and requires a distributed data storage system 100B to configure the block size of the KV SSD 150 accordingly. In either case, the block sizes in the KV SSD 150 and the distributed data storage system 100B can be configured relative to each other.

根據一個實施例，分散式資料儲存系統100B可對KV SSD 150進行配置以允許或不允許進行區塊更新。舉例來說，分散式資料儲存系統100B可將附加參數（在本文中稱為更新旗標）傳遞到KV SSD 150。通過使用更新旗標，KV SSD 150的SSD控制器可將自身配置成提供附加快閃記憶體區塊及執行緒以處置與從客戶機應用101接收的區塊更新請求相關聯的垃圾收集。通過不允許進行區塊更新（例如，更新為旗標=假），分散式資料儲存系統100B可因不同的快閃記憶體通道或裸晶之間的並行性而實現輸送量的大規模增大。當將每一次新的寫入與新的鍵一起使用時，KV SSD 150無需在各個通道或裸晶之間執行同步以驗證所寫入的區塊是重寫。在這種情形中，資料節點221可將區塊更新旗標設定為假。According to one embodiment, the distributed data storage system 100B may configure the KV SSD 150 to allow or disallow block updates. For example, the distributed data storage system 100B may pass additional parameters (referred to herein as update flags) to the KV SSD 150. By using the update flag, the SSD controller of the KV SSD 150 can configure itself to provide additional flash memory blocks and threads to handle garbage collection associated with the block update request received from the client application 101. By not allowing block updates (for example, update to flag = false), the distributed data storage system 100B can achieve a large increase in throughput due to the parallelism between different flash memory channels or die. . When using each new write with a new key, the KV SSD 150 does not need to perform synchronization between channels or die to verify that the written block is a rewrite. In this case, the data node 221 may set the block update flag to false.

圖2A示出示例性SSD的系統組態。參照圖1A，分散式資料儲存系統100A通過將其本地檔案系統（例如，Linux的ext4）安裝為“/mnt/fs”來配置SSD 140。儲存在SSD 140中的檔案可由SSD 140的所安裝的檔案系統進行存取。FIG. 2A illustrates a system configuration of an exemplary SSD. Referring to FIG. 1A, the distributed data storage system 100A configures the SSD 140 by installing its local file system (for example, ext4 of Linux) as “/ mnt / fs”. Files stored in the SSD 140 can be accessed by an installed file system of the SSD 140.

圖2B示出根據一個實施例的鍵-值SSD的示例性系統組態。參照圖1B，分散式資料儲存系統100B基於在登記過程期間接收的KV SSD 150的資訊來配置KV SSD 150。舉例來說，分散式資料儲存系統100B可將KV SSD 150的儲存類型配置為鍵-值SSD（KV SSD）並將KV SSD 150的輸入/輸出路徑設定為“dev/kvssd”。KV SSD 150還可被配置成將區塊更新旗標設定為假、將區塊大小設定為64 MB且將對齊旗標設定為真。FIG. 2B illustrates an exemplary system configuration of a key-value SSD according to one embodiment. Referring to FIG. 1B, the distributed data storage system 100B configures the KV SSD 150 based on the information of the KV SSD 150 received during the registration process. For example, the distributed data storage system 100B may configure the storage type of the KV SSD 150 as a key-value SSD (KV SSD) and set the input / output path of the KV SSD 150 to “dev / kvssd”. KV SSD 150 may also be configured to set the block update flag to false, the block size to 64 MB, and the alignment flag to true.

當KV SSD 150被配置成通過將KV SSD 150的擦除區塊大小與分散式資料儲存系統100B的資料大小設定成相等的及對齊的來禁用交叉通道輸入/輸出操作時，分散式資料儲存系統100B在KV SSD 150中的所有的通道或裸晶之間執行實現無鎖定輸入/輸出操作（lock-less I/O operation）。舉例來說，KV SSD 150使用簡單的雜湊函數（例如，位址模10（address mod 10））在所有可能的通道中確定輸入/輸出應被路由到的通道。在這種情形中，針對給定位址的所有輸入/輸出操作將一致地路由到相同的快閃記憶體通道。在其中通道由連續處理單元執行的情形中，被路由到這一通道的所有輸入/輸出操作被排序而無需任何跨通道鎖定。因此，分散式資料儲存系統100B可在輸入/輸出執行緒之間實現完全並行性而無需進行同步。When the KV SSD 150 is configured to disable cross-channel input / output operations by setting the erase block size of the KV SSD 150 and the data size of the distributed data storage system 100B to be equal and aligned, the distributed data storage system 100B performs lock-less I / O operations between all channels or die in the KV SSD 150. For example, KV SSD 150 uses a simple hash function (for example, address mod 10) to determine the channel to which the input / output should be routed among all possible channels. In this case, all input / output operations for a given address will be consistently routed to the same flash memory channel. In the case where a channel is performed by a continuous processing unit, all input / output operations routed to this channel are ordered without any cross-channel locking. Therefore, the distributed data storage system 100B can achieve full parallelism between the input / output threads without synchronization.

本KV SSD可根據KV SSD如何將擦除區塊作為垃圾收集單位進行處置來實現並行性。圖3示出根據一個實施例的實現通道級並行性的示例性SSD通道及裸晶架構。KV SSD可利用通道級並行性來改善KV SSD的輸入/輸出性能。在本實例中，KV SSD具有N個通道及M個通路，其中N及M是等於或大於1的整數。分散式檔案系統（例如，HDFS）的資料大小被設定成等於KV SSD的區塊大小或區塊大小的倍數。KV SSD的區塊大小是由晶片（或裸晶）中擦除單位的大小與KV SSD內部通路數目的乘積來確定，即，區塊大小 = 擦除單位的大小*通路的數目。所述資料可在同一通道中在各個晶片之間分條。舉例來說，KV SSD具有6 MB的擦除單位大小及8個通路，且分散式檔案系統的資料大小被設定成48 MB以適配在KV SSD的資料群組中。在通道級並行性中不會發生垃圾收集，這是因為刪除48 MB資料會導致在同一通道中重新設定8個完整的擦除區塊。然而，通道級並行性並未充分利用所述多個通道可提供的潛在並行性。This KV SSD can achieve parallelism based on how the KV SSD treats erased blocks as garbage collection units. FIG. 3 illustrates an exemplary SSD channel and die architecture to achieve channel-level parallelism according to one embodiment. KV SSDs can utilize channel-level parallelism to improve the input / output performance of KV SSDs. In this example, the KV SSD has N channels and M channels, where N and M are integers equal to or greater than 1. The data size of a distributed file system (for example, HDFS) is set to be equal to the block size of the KV SSD or a multiple of the block size. The block size of a KV SSD is determined by the product of the size of the erase unit in the chip (or the bare die) and the number of internal channels of the KV SSD, that is, the block size = the size of the erase unit * the number of channels. The data can be striped between individual wafers in the same channel. For example, the KV SSD has an erase unit size of 6 MB and 8 channels, and the data size of the distributed file system is set to 48 MB to fit in the data group of the KV SSD. Garbage collection does not occur in channel-level parallelism, because deleting 48 MB of data will cause 8 full erase blocks to be reset in the same channel. However, channel-level parallelism does not take full advantage of the potential parallelism that the multiple channels can provide.

圖4示出根據一個實施例的實現通路級並行性（way-level parallelism）的示例性SSD通道及裸晶架構。KV SSD可利用通路級並行性來改善KV SSD的輸入/輸出性能。在這種情形中，分散式檔案系統（例如，HDFS）的資料大小被設定成等於KV SSD的區塊大小或區塊大小的倍數。KV SSD的區塊大小是由擦除單位的大小與通道的數目的乘積來確定，即，區塊大小 = 擦除單位的大小*通道的數目。垃圾收集單位及資料大小是區塊大小的乘法運算。所述資料可在同一通路中在各個晶片之間分條。資料分條可在所有通道之間進行，從而能夠充分利用通道並行性。FIG. 4 illustrates an exemplary SSD channel and die architecture to achieve way-level parallelism according to one embodiment. KV SSDs can utilize channel-level parallelism to improve the input / output performance of KV SSDs. In this case, the data size of the distributed file system (for example, HDFS) is set to be equal to the block size of the KV SSD or a multiple of the block size. The block size of the KV SSD is determined by the product of the size of the erase unit and the number of channels, that is, the block size = the size of the erase unit * the number of channels. The garbage collection unit and data size are multiplications of the block size. The data can be striped between individual wafers in the same path. Data striping can be performed between all channels, allowing full use of channel parallelism.

圖5示出根據一個實施例的實現裸晶/晶片級並行性的示例性SSD通道及裸晶架構。在本實例中，SSD通道的數目是N，且裸晶的數目是比N大的M。裸晶/晶片級並行性在KV SSD中的各個通道及晶片之間提供最高並行性。在這種情形中，區塊大小等於擦除單位，且垃圾收集是以擦除單位進行。資料大小與擦除單位的倍數對齊。裸晶/晶片級並行性類似於具有虛擬節點的一致性雜湊（consistent hashing）。在這種情形中，每一個虛擬節點對應於擦除區塊單元，且實體節點對應於通道。FIG. 5 illustrates an exemplary SSD channel and die architecture to achieve die / wafer level parallelism according to one embodiment. In this example, the number of SSD channels is N, and the number of die is M, which is greater than N. Die / wafer level parallelism provides the highest parallelism between channels and chips in a KV SSD. In this case, the block size is equal to the erase unit, and garbage collection is performed in the erase unit. The data size is aligned with multiples of the erase unit. Die / wafer level parallelism is similar to consistent hashing with virtual nodes. In this case, each virtual node corresponds to an erase block unit, and the physical node corresponds to a channel.

分散式檔案系統（例如，HDFS）維持中繼資料以管理資料的位置。舉例來說，每一個檔案均維持有構成所述檔案的所有區塊的清單。重複的分散式儲存系統維持有單獨的映射，所述單獨的映射列出儲存給定區塊（或檔案）的所有節點的位置。在一些分散式資料儲存系統中，這些映射表被保持在單個節點的記憶體中，這會限制分散式資料儲存系統的可擴展性。舉例來說，當儲存這些映射表的中繼資料節點不具有足以儲存附加映射資料的記憶體時，可不添加區塊或檔案。可在資料儲存裝置頂上使用檔案系統以儲存這些映射，但檔案系統會引入額外開銷。Decentralized file systems (for example, HDFS) maintain metadata to manage the location of the data. For example, each file maintains a list of all the blocks that make up the file. Duplicate decentralized storage systems maintain separate maps that list the locations of all nodes that store a given block (or file). In some distributed data storage systems, these mapping tables are kept in the memory of a single node, which limits the scalability of the distributed data storage system. For example, when the metadata nodes storing these mapping tables do not have sufficient memory to store additional mapping data, no blocks or files may be added. A file system can be used on top of the data storage device to store these mappings, but the file system introduces additional overhead.

本KV SSD可通過持久地保持檔案對區塊清單的映射以及區塊對節點清單的映射來將資料直接儲存在鍵-值對中而無需本地檔案系統。因此，負責儲存中繼資料的節點不受其記憶體容量的限制且不會引起由於具有附加檔案系統帶來的開銷。由於這些映射資訊是直接儲存在KV SSD上，因此這些映射資訊可採用由檔案進行索引的方式儲存在單個映射表中。這使得KV SSD中的單次查找便能夠檢索所有的資料區塊。單個映射表使中繼資料更加可擴展（僅一個映射表）且更高效（一次查找）。The KV SSD can store data directly in key-value pairs by persistently maintaining file-to-block list mapping and block-to-node list mapping without the need for a local file system. Therefore, the node responsible for storing metadata is not limited by its memory capacity and does not cause the overhead caused by having an additional file system. Since these mapping information are stored directly on the KV SSD, these mapping information can be stored in a single mapping table by means of file indexing. This enables a single lookup in the KV SSD to retrieve all data blocks. A single mapping table makes the metadata more scalable (only one mapping table) and more efficient (one lookup).

讀取儲存在KV SSD中的檔案的過程類似於使用常規雜湊映射或相似的資料結構的過程。資料結構可為直接連結到KV SSD的庫。舉例來說，客戶機應用發出檔案檢索操作以使用檔案ID讀取檔案。中繼資料節點以二進位大物件（binary large object，blob）形式返回檔案的區塊清單，且中繼資料節點可映射成區塊清單最初被寫入成的格式。區塊清單還含有節點清單，在所述節點清單中儲存有區塊清單中的每一個區塊。中繼資料節點可接著將區塊的清單及相關聯的節點傳遞回客戶機應用以發出對區塊的讀取。在這種方案中，中繼資料節點仍需要將映射表儲存在其記憶體中以用於每一次查找，以將清單傳遞回客戶機應用；然而，中繼資料節點不需要將所有映射資訊保持在其記憶體中。舉例來說，最近讀取的檔案的快取記憶體可在可擴展性與效率之間提供折衷。The process of reading files stored in a KV SSD is similar to the process using a regular hash map or similar data structure. The data structure can be a library linked directly to the KV SSD. For example, a client application issues an archive retrieval operation to read an archive using an archive ID. The metadata node returns a block list of the file in the form of a binary large object (blob), and the metadata node can be mapped to the format in which the block list was originally written. The block list also contains a node list in which each block in the block list is stored. The metadata node may then pass the list of blocks and associated nodes back to the client application to issue a read of the block. In this scheme, the metadata node still needs to store the mapping table in its memory for each lookup to pass the list back to the client application; however, the metadata node does not need to keep all mapping information In its memory. For example, the cache memory of recently read files can provide a compromise between scalability and efficiency.

圖6A示出示例性SSD（例如，圖1A及圖2A所示SSD 140）的示例性映射方案。編輯日誌（edit log）是指對檔案及映射表（File and Mapping table）執行的所有中繼資料操作的日誌。這些編輯日誌被持久地保存到磁片。這是必需的，因為檔案及區塊映射表只位於記憶體中；如果名稱節點崩潰，則名稱節點會通過從記憶體中讀取編輯日誌來在記憶體中重建檔案及區塊映射表。圖6B示出根據一個實施例的KV SSD（例如，圖1B及圖2B所示KV SSD 150）的示例性映射方案。SSD 140將檔案映射表及區塊映射表儲存在其記憶體中。使用這些映射資訊，客戶機應用可檢索與檔案相關聯的資料區塊。同時，KV SSD 150儲存包含區塊清單及節點列表的單次映射的文件映射表。FIG. 6A illustrates an exemplary mapping scheme of an exemplary SSD (eg, the SSD 140 illustrated in FIGS. 1A and 2A). The edit log refers to a log of all metadata operations performed on the file and mapping table. These edit logs are persisted to disk. This is necessary because the file and block mapping table is only in memory; if the name node crashes, the name node will rebuild the file and block mapping table in memory by reading the edit log from the memory. FIG. 6B illustrates an exemplary mapping scheme of a KV SSD (eg, KV SSD 150 shown in FIGS. 1B and 2B) according to one embodiment. The SSD 140 stores a file mapping table and a block mapping table in its memory. Using this mapping information, the client application can retrieve blocks of data associated with the file. Meanwhile, the KV SSD 150 stores a file mapping table including a single mapping of a block list and a node list.

圖7、圖8及圖9示出在分散式檔案系統（例如，HDFS）中創建及寫入檔案、讀取檔案及刪除檔案的示例性輸入/輸出過程的圖。在對每一個圖的說明中，將論述與現有技術過程相比而言的差異及優點。7, 8, and 9 are diagrams illustrating exemplary input / output processes of creating and writing archives, reading archives, and deleting archives in a decentralized archive system (eg, HDFS). In the description of each figure, differences and advantages compared to prior art processes will be discussed.

圖7示出根據一個實施例的在分散式檔案系統的KV SSD中創建並儲存檔案的示例性過程。分散式檔案系統包括客戶機710、包括KV SSD （kv1）730的名稱節點（或中繼資料節點）720以及包括KV SSD（kv2）750的資料節點740。為創建新的檔案，客戶機710向名稱節點720發送帶有檔案ID（fileID）的請求761（createFile(fileID)）。名稱節點720通過發送鍵-值儲存命令762（kv.store(fileID, “ ”)）而在內部登記檔案ID並將檔案ID作為鍵-值對的無值鍵（value-less key）儲存在KV SSD 730中。KV SSD 730通過將完成消息763發送回名稱節點720來作出回應，且接著，名稱節點720以完成消息764來回應客戶機710。在名稱節點720回應客戶機710後，客戶機710為文件分配區塊的分配命令765（allocateBlock(fileID)）。作為回應，名稱節點720指派區塊ID（blockID）及資料節點（例如，資料節點740）來儲存區塊並向客戶機710發送迴響應766。名稱節點720可指派單調遞增的ID作為區塊ID。客戶機710使用區塊ID向資料節點740發送帶有所述區塊的資料內容的區塊寫入請求767（writeBlock(blockID, content)）。回應於區塊寫入請求767，資料節點740向KV SSD 750發出帶有傳入參數（區塊ID及內容）的鍵-值儲存命令768（kv.store(blockID, content)）。在具有常規SSD的傳統資料節點中，對資料節點進行的寫入操作將請求對檔案系統進行寫入且接著對SSD的基礎資料儲存介質進行寫入。在儲存所述區塊後，KV SSD 750以完成消息768對資料節點740作出回應，且資料節點740以完成消息770對客戶機710作出回應。客戶機710接著向名稱節點720發送提交寫入命令771（commit(Write(fileID, blockID)）以將區塊ID與資料節點元組提交到相關聯的檔案。名稱節點720向KV SSD 730發送附加命令772（kv.append(fileID, blockID + dataNode))。在附加過程中，附加命令772是對KV SSD 730進行的單個直接操作，而不是如在傳統的分散式儲存系統中一樣是對兩個單獨的映射（即，檔案-區塊映射及區塊-資料節點映射）的儲存操作。FIG. 7 illustrates an exemplary process of creating and storing archives in a KV SSD of a distributed archive system according to one embodiment. The distributed file system includes a client 710, a name node (or metadata node) 720 including a KV SSD (kv1) 730, and a data node 740 including a KV SSD (kv2) 750. To create a new file, the client 710 sends a request 761 (createFile (fileID)) with a file ID (fileID) to the name node 720. Name node 720 registers the file ID internally by sending a key-value store command 762 (kv.store (fileID, “”)) and stores the file ID as a value-less key of the key-value pair in KV SSD 730. The KV SSD 730 responds by sending a completion message 763 back to the name node 720, and then, the name node 720 responds to the client 710 with a completion message 764. After the name node 720 responds to the client 710, the client 710 allocates a block allocation command 765 (allocateBlock (fileID)) for the file. In response, the name node 720 assigns a block ID (blockID) and a data node (eg, the data node 740) to store the block and sends a response 766 back to the client 710. The name node 720 may assign a monotonically increasing ID as the block ID. The client 710 sends a block write request 767 (writeBlock (blockID, content)) with the data content of the block to the data node 740 using the block ID. In response to the block write request 767, the data node 740 sends a key-value store command 768 (kv.store (blockID, content)) with the incoming parameters (block ID and content) to the KV SSD 750. In a traditional data node with a conventional SSD, a write operation to the data node will request a write to the file system and then a write to the basic data storage medium of the SSD. After storing the block, the KV SSD 750 responds to the data node 740 with a completion message 768, and the data node 740 responds to the client 710 with a completion message 770. The client 710 then sends a commit write command 771 (commit (Write (fileID, blockID)) to the name node 720 to submit the block ID and data node tuple to the associated archive. The name node 720 sends an additional to the KV SSD 730 Command 772 (kv.append (fileID, blockID + dataNode)). During the append process, the append command 772 is a single direct operation on the KV SSD 730, rather than two as in a traditional decentralized storage system Storage operations for separate mappings (ie, file-block mapping and block-data node mapping).

圖8示出根據一個實施例的讀取儲存在分散式檔案系統的KV SSD中的檔案的示例性過程。為讀取儲存在資料節點740中的檔案，客戶機710向名稱節點720發送帶有檔案ID（fileID）的讀取檔案請求861（openFile(fileID)）。通過利用檔案ID，名稱節點720向KV SSD 730發送檢索命令（kv.retrieve(fileID)），且KV SSD 730返回映射資訊863，映射資訊863將區塊映射到與檔案ID相關聯的資料節點。名稱節點720將映射資訊864轉發到客戶機710。通過利用區塊-資料節點映射資訊中所包含的區塊ID，客戶機710向資料節點740發送區塊讀取命令865（readBlock(blockID)）。資料節點740向KV SSD 750發送區塊檢索命令866（kv.retrieve(blockID)）以檢索區塊內容。KV SSD 750將所請求區塊的內容867發送回資料節點740，且資料節點740將所檢索的區塊內容868轉發回客戶機710。此與傳統的讀取操作之間的基本差異在於，名稱節點720發出單個直接KV SSD讀取操作以檢索區塊-資料節點映射，而非在記憶體內雜湊表中搜索檔案對區塊清單以及區塊對資料節點清單。另外，資料節點740直接向KV SSD發送要檢索資料的請求，從而繞過任何儲存軟體中介軟體（例如檔案系統）。FIG. 8 illustrates an exemplary process of reading an archive stored in a KV SSD of a distributed file system according to one embodiment. To read the file stored in the data node 740, the client 710 sends a read file request 861 (openFile (fileID)) with a file ID (fileID) to the name node 720. By using the archive ID, the name node 720 sends a retrieval command (kv.retrieve (fileID)) to the KV SSD 730, and the KV SSD 730 returns the mapping information 863, which maps the block to the data node associated with the archive ID. The name node 720 forwards the mapping information 864 to the client 710. By using the block ID included in the block-data node mapping information, the client 710 sends a block read command 865 (readBlock (blockID)) to the data node 740. The data node 740 sends a block retrieval command 866 (kv.retrieve (blockID)) to the KV SSD 750 to retrieve the block content. The KV SSD 750 sends the content 867 of the requested block back to the data node 740, and the data node 740 forwards the retrieved block content 868 back to the client 710. The basic difference between this and the traditional read operation is that the name node 720 issues a single direct KV SSD read operation to retrieve the block-data node mapping, instead of searching the archive for the block list and area in the memory hash table. List of block-to-data nodes. In addition, the data node 740 directly sends a request to retrieve data to the KV SSD, thereby bypassing any storage software intermediary software (such as a file system).

圖9示出根據一個實施例的刪除分散式檔案系統的KV SSD中的檔案的示例性過程。客戶機710向名稱節點720發送帶有檔案ID的檔案刪除命令961（deleteFile(fileID)）。名稱節點720向KV SSD 730發送鍵-值檢索命令962（kv.retrieve(fileID)）以檢索檔案的映射資訊，且KV SSD 730返回映射資訊963，映射資訊963將區塊映射到與檔案ID相關聯的資料節點。名稱節點720可臨時地快取記憶體用於相關聯的區塊的後續非同步刪除過程的區塊ID映射。名稱節點720向KV SSD 730發送鍵-值刪除命令964（kv.delete(fileID)），且KV SSD 730在刪除檔案ID及相關聯的映射後向名稱節點720發送完成消息965。當檢索與檔案ID相關聯的映射資訊時，名稱節點720檢索區塊-資料節點元組來找到將從KV SSD 730刪除的檔案。這一過程不同於將涉及到查找多個記憶體內雜湊表的傳統的分散式儲存系統。而是，名稱節點720從含有區塊-資料節點映射的KV SSD 730刪除這一基於檔案的鍵。名稱節點720將控制返回到客戶機710，且名稱節點720非同步地向資料節點740發送區塊刪除命令以從KV SSD 750刪除對應的區塊。應注意，圖9所示檔案刪除過程是基於分散式檔案系統的區塊大小等於或可被劃分成KV SSD 750的擦除區塊大小這一假設。此可使從記憶體內操作移動到基於KV SSD的操作的開銷最小化。FIG. 9 illustrates an exemplary process of deleting archives in a KV SSD of a distributed archive system according to one embodiment. The client 710 sends a file delete command 961 (deleteFile (fileID)) with a file ID to the name node 720. The name node 720 sends a key-value retrieval command 962 (kv.retrieve (fileID)) to the KV SSD 730 to retrieve the mapping information of the file, and the KV SSD 730 returns the mapping information 963. The mapping information 963 maps the block to the file ID. Linked data nodes. The name node 720 may temporarily cache the block ID mapping for subsequent asynchronous deletion processes of the associated block. The name node 720 sends a key-value delete command 964 (kv.delete (fileID)) to the KV SSD 730, and the KV SSD 730 sends a completion message 965 to the name node 720 after deleting the archive ID and the associated mapping. When retrieving the mapping information associated with the archive ID, the name node 720 retrieves the block-data node tuple to find the archive to be deleted from the KV SSD 730. This process is different from traditional decentralized storage systems that will involve looking up hash tables in multiple memories. Instead, the name node 720 deletes this file-based key from the KV SSD 730 containing the block-data node mapping. The name node 720 returns control to the client 710, and the name node 720 asynchronously sends a block delete command to the data node 740 to delete the corresponding block from the KV SSD 750. It should be noted that the file deletion process shown in FIG. 9 is based on the assumption that the block size of the distributed file system is equal to or can be divided into the erase block size of the KV SSD 750. This minimizes the overhead of moving from memory operations to KV SSD-based operations.

根據一個實施例，一種固態驅動器（SSD）包括：多個資料區塊；用於存取所述多個資料區塊的多個快閃記憶體通道及多個通路；以及SSD控制器，對所述多個資料區塊的區塊大小進行配置。資料檔案與一個或多個鍵-值對一起儲存在所述SSD中，且每一個鍵-值對具有區塊識別符作為鍵且具有區塊資料作為值。所述資料檔案的大小等於所述區塊大小或所述區塊大小的倍數。According to an embodiment, a solid state drive (SSD) includes: a plurality of data blocks; a plurality of flash memory channels and a plurality of channels for accessing the plurality of data blocks; and an SSD controller, The block size of the multiple data blocks is configured. The data file is stored in the SSD together with one or more key-value pairs, and each key-value pair has a block identifier as a key and has block data as a value. The size of the data file is equal to the block size or a multiple of the block size.

所述SSD可用于包括海杜普分散式檔案系統（HDFS）的分散式檔案系統中。The SSD may be used in a distributed file system including a Hydrup Distributed File System (HDFS).

所述SSD控制器還可配置成基於區塊更新旗標來啟用或禁用區塊更新。The SSD controller may also be configured to enable or disable block updates based on a block update flag.

所述SSD控制器還可配置成基於對齊旗標來將所述資料檔案與所述多個資料區塊對齊。The SSD controller may be further configured to align the data file with the plurality of data blocks based on an alignment flag.

所述區塊大小可基於所述SSD的擦除單位乘以快閃記憶體通道的數目來確定。The block size may be determined based on the erasure unit of the SSD multiplied by the number of flash memory channels.

所述區塊大小可基於所述SSD的擦除單位乘以通路的數目來確定。The block size may be determined based on the erase unit of the SSD multiplied by the number of channels.

所述區塊大小可等於所述SSD的擦除單位。The block size may be equal to an erasing unit of the SSD.

所述SSD可儲存檔案映射表，所述檔案映射表包括所述檔案向所述多個資料區塊中與所述檔案相關聯的一個或多個資料區塊的第一映射以及所述一個或多個資料區塊中的至少一者向包括所述SSD的資料節點的第二映射。The SSD may store a file mapping table, and the file mapping table includes a first mapping of the file to one or more data blocks associated with the file among the plurality of data blocks and the one or A second mapping of at least one of the plurality of data blocks to a data node including the SSD.

根據另一個實施例，一種分散式資料儲存系統包括：客戶機；名稱節點，包括第一鍵-值（KV）固態驅動器（SSD）；以及資料節點，包括第二KV SSD，其中所述第二KV SSD包括多個資料區塊、用於存取所述多個資料區塊的多個快閃記憶體通道及多個通路、以及用於配置所述多個資料區塊的區塊大小的SSD控制器。所述客戶機向所述名稱節點發送包括用於儲存資料檔案的檔案識別符的創建檔案請求，並向所述名稱節點發送分配命令以分配所述多個資料區塊中與所述資料檔案相關聯的一個或多個資料區塊。所述名稱節點向所述客戶機返回所述一個或多個資料區塊的區塊識別符以及被指派儲存所述一個或多個資料區塊的所述資料節點的資料節點識別符。所述客戶機向所述資料節點發送區塊儲存命令，以儲存所述一個或多個資料區塊。所述第二KV SSD儲存所述一個或多個資料區塊作為鍵-值對，且至少一個鍵-值對具有所述區塊識別符作為鍵且具有區塊資料作為值。所述資料檔案的大小等於所述區塊大小或所述區塊大小的倍數。According to another embodiment, a distributed data storage system includes: a client; a name node including a first key-value (KV) solid state drive (SSD); and a data node including a second KV SSD, wherein the second The KV SSD includes multiple data blocks, multiple flash memory channels and multiple channels for accessing the multiple data blocks, and an SSD for configuring the block size of the multiple data blocks. Controller. The client sends an archive creation request including an archive identifier for storing a data archive to the name node, and sends an allocation command to the name node to allocate the plurality of data blocks related to the data archive Associated one or more data blocks. The name node returns a block identifier of the one or more data blocks and a data node identifier of the data node assigned to store the one or more data blocks to the client. The client sends a block storage command to the data node to store the one or more data blocks. The second KV SSD stores the one or more data blocks as key-value pairs, and at least one key-value pair has the block identifier as a key and has block data as a value. The size of the data file is equal to the block size or a multiple of the block size.

所述分散式資料儲存系統可採用海杜普分散式檔案系統（HDFS）。The decentralized data storage system may adopt a Haidupu Decentralized File System (HDFS).

所述第二KV SSD可儲存檔案映射表，所述檔案映射表包括所述資料檔案向與所述檔案相關聯的一個或多個資料區塊的第一映射以及所述一個或多個資料區塊中的至少一者向資料節點的第二映射。The second KV SSD may store a file mapping table, the file mapping table including a first mapping of the data file to one or more data blocks associated with the file and the one or more data areas A second mapping of at least one of the blocks to a data node.

根據又一實施例，一種方法包括：從客戶機向名稱節點發送創建檔案請求，其中所述創建檔案請求包括用於儲存資料檔案的檔案識別符；將所述檔案識別符作為鍵-值對儲存在所述名稱節點的第一鍵-值（KV）固態驅動器（SSD）中，其中所述檔案識別符作為鍵被儲存在所述鍵-值中，且與所述鍵相關聯的值是空的；從所述客戶機向所述名稱節點發送分配命令，以分配與所述資料檔案相關聯的一個或多個資料區塊；在所述名稱節點處將區塊識別符指派給所述一個或多個資料區塊中的至少一者並指派資料節點來儲存所述一個或多個資料區塊；從所述名稱節點向所述客戶機返回所述區塊識別符及所述資料節點的資料節點識別符；從所述客戶機向所述資料節點發送寫入區塊請求，其中所述寫入區塊請求包括所述區塊識別符及內容；以及將所述一個或多個資料區塊作為鍵-值對保存在所述資料節點的第二KV SSD中。所述資料節點的所述第二KV SSD包括具有區塊大小的一個或多個資料區塊。至少一個鍵-值對具有區塊識別符作為鍵且具有區塊資料作為值。所述資料檔案的大小等於所述區塊大小或所述區塊大小的倍數。According to yet another embodiment, a method includes: sending a create archive request from a client to a name node, wherein the create archive request includes a file identifier for storing a data file; and storing the file identifier as a key-value pair In the first key-value (KV) solid state drive (SSD) of the name node, wherein the file identifier is stored as a key in the key-value, and the value associated with the key is empty Sending an allocation command from the client to the name node to allocate one or more data blocks associated with the data file; assigning a block identifier to the one at the name node At least one of the plurality of data blocks and assigns a data node to store the one or more data blocks; returning the block identifier and the data node's A data node identifier; sending a write block request from the client to the data node, wherein the write block request includes the block identifier and content; and the one or more data areas As the key - the value of the second KV SSD data stored in the node. The second KV SSD of the data node includes one or more data blocks having a block size. At least one key-value pair has a block identifier as a key and has block data as a value. The size of the data file is equal to the block size or a multiple of the block size.

所述客戶機、所述名稱節點及所述資料節點可為海杜普分散式檔案系統（HDFS）中的節點。The client, the name node, and the data node may be nodes in a Haidupu distributed file system (HDFS).

所述方法還可包括：設定區塊更新旗標來啟用或禁用區塊更新。The method may further include setting a block update flag to enable or disable the block update.

所述方法還可包括：設定對齊旗標來將所述資料檔案與所述資料節點的所述第二KV SSD的所述多個資料區塊對齊。The method may further include setting an alignment flag to align the data file with the plurality of data blocks of the second KV SSD of the data node.

所述方法還可包括：從所述客戶機向所述名稱節點發送寫入提交命令，所述寫入提交命令包括所述檔案識別符及所述區塊識別符；以及附加單一直接操作，以將所述檔案識別符、所述區塊識別符及所述資料節點附加在所述名稱節點中。The method may further include: sending a write commit command from the client to the name node, the write commit command including the file identifier and the block identifier; and attaching a single direct operation to The file identifier, the block identifier, and the data node are attached to the name node.

所述方法還可包括：從所述客戶機向所述名稱節點發送要讀取所述資料檔案的讀取檔案請求；向所述客戶機返回所述一個或多個資料區塊中與所述資料檔案相關聯的至少一個資料區塊的所述區塊識別符及所述資料節點識別符；從所述客戶機向所述資料節點發送區塊讀取命令，以檢索儲存在所述資料節點的所述第二KV SSD中的所述一個或多個資料區塊；以及從所述資料節點向所述客戶機返回由所述區塊識別符標識的所述區塊資料。The method may further include: sending a read archive request from the client to the name node to read the profile; returning to the client the one or more chunks of data that are associated with the profile The block identifier and the data node identifier of at least one data block associated with a data file; sending a block read command from the client to the data node to retrieve the data node stored in the data node The one or more data blocks in the second KV SSD; and returning the block data identified by the block identifier from the data node to the client.

所述方法還可包括：從所述客戶機向所述名稱節點發送檔案刪除命令，所述檔案刪除命令包括所述檔案識別符；向所述客戶機返回所述一個或多個資料區塊中與所述資料檔案相關聯的至少一個資料區塊的所述區塊識別符及所述資料節點識別符；從所述名稱節點向所述名稱節點的所述第一KV SSD發送鍵-值刪除命令，所述鍵-值刪除命令包括所述資料檔案的所述檔案識別符；從所述名稱節點向所述資料節點發送區塊刪除命令，所述區塊刪除命令包括所述一個或多個資料區塊的清單；以及刪除被儲存在所述資料節點的所述第二KV SSD中的所述一個或多個資料區塊。The method may further include: sending an archive delete command from the client to the name node, the archive delete command including the archive identifier; returning the one or more data blocks to the client Send the key identifier of the block identifier and the data node identifier of at least one data block associated with the data file from the name node to the first KV SSD of the name node A command, the key-value delete command including the file identifier of the data file, and sending a block delete command from the name node to the data node, the block delete command including the one or more A list of data blocks; and deleting the one or more data blocks stored in the second KV SSD of the data node.

所述第二KV SSD可儲存檔案映射表，所述檔案映射表包括所述檔案向與所述檔案相關聯的一個或多個資料區塊的第一映射以及所述一個或多個資料區塊中的至少一者向所述資料節點的第二映射。The second KV SSD can store a file mapping table, and the file mapping table includes a first mapping of the file to one or more data blocks associated with the file and the one or more data blocks At least one of the second mapping to the data node.

上文已闡述了以上示例性實施例來示出實施用於提供在分散式檔案系統中利用鍵-值儲存來高效地儲存資料及中繼資料的系統及方法的系統及方法的各種實施例。所屬領域中的一般技術人員將會聯想到對所揭露示例性實施例的各種修改及相對於所揭露示例性實施例的不同之處。在以上權利要求中闡述了旨在落於本揭露範圍內的主題。The foregoing exemplary embodiments have been set forth above to illustrate various embodiments of systems and methods that implement systems and methods for providing efficient storage of data and metadata using key-value storage in a distributed file system. Those of ordinary skill in the art will recognize various modifications to the disclosed exemplary embodiments and differences from the disclosed exemplary embodiments. The subject matter which is intended to fall within the scope of this disclosure is set forth in the following claims.

100A、100B‧‧‧分散式資料儲存系統100A, 100B‧‧‧ Distributed Data Storage System

101‧‧‧客戶機應用/客戶機 101‧‧‧Client Application / Client

105‧‧‧文件 105‧‧‧File

111、720‧‧‧名稱節點 111, 720‧‧‧ name nodes

115‧‧‧區塊映射 115‧‧‧block mapping

121、221、740‧‧‧資料節點 121, 221, 740‧‧‧ data nodes

140‧‧‧SSD 140‧‧‧SSD

150‧‧‧鍵-值SSD/KV SSD 150‧‧‧key-value SSD / KV SSD

710‧‧‧客戶機 710‧‧‧Client

730‧‧‧KV SSD （kv1）/KV SSD 730‧‧‧KV SSD (kv1) / KV SSD

750‧‧‧KV SSD（kv2）/KV SSD 750‧‧‧KV SSD (kv2) / KV SSD

761‧‧‧請求 761‧‧‧request

762、768‧‧‧鍵-值儲存命令 762, 768‧‧‧ key-value storage commands

763、764、769、770、773、774、965‧‧‧完成消息 763, 764, 769, 770, 773, 774, 965‧‧‧ Completion

765‧‧‧分配命令 765‧‧‧ Allocation order

766‧‧‧回應 766‧‧‧ Response

767‧‧‧區塊寫入請求 767‧‧‧block write request

771‧‧‧提交寫入命令 771‧‧‧ Submit write command

772‧‧‧附加命令 772‧‧‧ additional order

861‧‧‧讀取檔案請求 861‧‧‧Read File Request

862‧‧‧檢索命令 862‧‧‧Search Order

863、864、963‧‧‧映射資訊 863, 864, 963‧‧‧ mapping information

865‧‧‧區塊讀取命令 865‧‧‧block read command

866‧‧‧區塊檢索命令 866‧‧‧block search command

867‧‧‧內容 867‧‧‧Contents

868‧‧‧區塊內容 868‧‧‧block content

961‧‧‧檔案刪除命令 961‧‧‧File delete command

962‧‧‧鍵-值檢索命令 962‧‧‧Key-Value Search Command

964‧‧‧鍵-值刪除命令 964‧‧‧key-value delete command

966、967‧‧‧區塊刪除命令 966, 967‧‧‧ block delete order

968、969‧‧‧返回 968, 969‧‧‧ return

970‧‧‧非同步消息 970‧‧‧Asynchronous message

作為本說明書的一部分而包括在內的各個附圖示出當前優選的實施例，且與以上所給出的大體說明及以下所給出的對優選實施例的詳細說明一起用於解釋及教示本文所述原理。The drawings included as part of this specification show the presently preferred embodiments, and are used to explain and teach the text along with the general description given above and the detailed description of the preferred embodiments given below. The principle.

圖1A示出現有技術分散式資料儲存系統的方區塊圖。 FIG. 1A illustrates a block diagram of a prior art decentralized data storage system.

圖1B示出根據一個實施例的包括鍵-值儲存裝置的示例性分散式資料儲存系統的方區塊圖。 FIG. 1B illustrates a block diagram of an exemplary decentralized data storage system including a key-value storage device according to one embodiment.

圖2A示出示例性SSD的系統組態。 FIG. 2A illustrates a system configuration of an exemplary SSD.

圖2B示出根據一個實施例的鍵-值SSD的示例性系統組態。 FIG. 2B illustrates an exemplary system configuration of a key-value SSD according to one embodiment.

圖3示出根據一個實施例的實現通道級並行性（channel-level parallelism）的示例性SSD通道及裸晶架構。 FIG. 3 illustrates an exemplary SSD channel and die architecture to achieve channel-level parallelism according to one embodiment.

圖4示出根據一個實施例的實現通路級並行性（way-level parallelism）的示例性SSD通道及裸晶架構。 FIG. 4 illustrates an exemplary SSD channel and die architecture to achieve way-level parallelism according to one embodiment.

圖5示出根據一個實施例的實現裸晶/晶片級並行性（die/chip-level parallelism）的示例性SSD通道及裸晶架構。 FIG. 5 illustrates an exemplary SSD channel and die architecture for implementing die / chip-level parallelism according to one embodiment.

圖6A示出示例性SSD的示例性映射方案。 FIG. 6A illustrates an exemplary mapping scheme for an exemplary SSD.

圖6B示出根據一個實施例的KV SSD的示例性映射方案。 FIG. 6B illustrates an exemplary mapping scheme for a KV SSD according to one embodiment.

圖7示出根據一個實施例的在分散式檔案系統的KV SSD中創建並儲存檔案的示例性過程。 FIG. 7 illustrates an exemplary process of creating and storing archives in a KV SSD of a distributed archive system according to one embodiment.

圖8示出根據一個實施例的讀取儲存在分散式檔案系統的KV SSD中的檔案的示例性過程。 FIG. 8 illustrates an exemplary process of reading an archive stored in a KV SSD of a distributed file system according to one embodiment.

圖9示出根據一個實施例的刪除分散式檔案系統的KV SSD中的檔案的示例性過程。 FIG. 9 illustrates an exemplary process of deleting archives in a KV SSD of a distributed archive system according to one embodiment.

各個圖未必是按比例繪製，且出於說明目的，在所有圖中具有相似結構或功能的元件一般是由相同的參考編號表示。各個圖僅旨在方便說明本文所述各種實施例。各個圖並不闡述本文所揭露教示內容的每一方面且並不限制權利要求書的範圍。The figures are not necessarily drawn to scale, and for illustrative purposes, elements with similar structures or functions are generally represented by the same reference numbers in all figures. The figures are only intended to facilitate the description of the various embodiments described herein. The figures do not illustrate every aspect of the teachings disclosed herein and do not limit the scope of the claims.

Claims

A solid-state drive includes: Multiple data blocks; Multiple flash memory channels and multiple channels for accessing the multiple data blocks; and A solid state drive controller configured to configure a block size of the plurality of data blocks; Wherein the data file is stored in the solid state drive together with one or more key-value pairs, and at least one key-value pair has a block identifier as a key and has block data as a value, and The size of the data file is equal to the block size or a multiple of the block size.

The solid-state drive according to item 1 of the scope of patent application, wherein the solid-state drive is used in a distributed file system including a Haidupu distributed file system.

The solid-state drive according to item 1 of the patent application scope, wherein the solid-state drive controller is further configured to enable or disable the block update based on a block update flag.

The solid-state drive according to item 1 of the patent application scope, wherein the solid-state drive controller is further configured to align the data file with the plurality of data blocks based on an alignment flag.

The solid-state drive according to item 1 of the scope of patent application, wherein the block size is determined based on the erasing unit of the solid-state drive multiplied by the number of flash memory channels.

The solid state drive according to item 1 of the scope of patent application, wherein the block size is determined based on the erasing unit of the solid state drive multiplied by the number of channels.

The solid-state drive according to item 1 of the patent application scope, wherein the block size is equal to an erasing unit of the solid-state drive.

The solid-state drive according to item 1 of the scope of patent application, wherein the solid-state drive stores a file mapping table, and the file mapping table includes one or more of the files associated with the files among the plurality of data blocks. A first mapping of a plurality of data blocks and a second mapping of at least one of the one or more data blocks to a data node including the solid state drive.

A distributed data storage system includes: Client computer; Name nodes, including first key-value solid-state drives; and A data node includes a second key-value solid-state drive, wherein the second key-value solid-state drive includes multiple data blocks, multiple flash memory channels for accessing the multiple data blocks, and multiple data blocks. Channels, and a solid-state drive controller for configuring a block size of the plurality of data blocks, The client sends an archive creation request including an archive identifier for storing a data archive to the name node, and sends an allocation command to the name node to allocate the plurality of data blocks to the data archive. One or more data blocks associated with it, Wherein the name node returns a block identifier of the one or more data blocks and a data node identifier of the data node assigned to store the one or more data blocks to the client, The client sends a block storage command to the data node to store the one or more data blocks. Wherein the second key-value solid-state drive stores the one or more data blocks as key-value pairs, and at least one key-value pair has the block identifier as a key and has block data as a value, And The size of the data file is equal to the block size or a multiple of the block size.

The decentralized data storage system according to item 9 of the scope of patent application, wherein the decentralized data storage system adopts a Haidupu decentralized file system.

The distributed data storage system according to item 9 of the scope of patent application, wherein the second key-value solid-state drive stores a file mapping table, and the file mapping table includes the data file to a file associated with the file. A first mapping of one or more data blocks and a second mapping of at least one of the one or more data blocks to a data node.

A method including: Sending a create archive request from the client to the name node, wherein the create archive request includes an archive identifier for storing a data archive; Storing the file identifier as a key-value pair in a first key-value solid state drive of the name node, wherein the file identifier is stored as a key in the key-value and is the same as the key The associated value is empty; Sending an allocation command from the client to the name node to allocate one or more data blocks associated with the data file; Assigning a block identifier to at least one of the one or more data blocks at the name node and assigning a data node to store the one or more data blocks; Returning the block identifier and the data node identifier of the data node from the name node to the client; Sending a write block request from the client to the data node, wherein the write block request includes the block identifier and content; and Storing the one or more data blocks as key-value pairs in a second key-value solid-state drive of the data node, Wherein the second key-value solid-state drive of the data node includes one or more data blocks having a block size, At least one of the key-value pairs has a block identifier as a key and has block data as a value, and The size of the data file is equal to the block size or a multiple of the block size.

The method according to item 12 of the scope of patent application, wherein the client, the name node, and the data node are nodes in a Haidupu distributed file system.

The method according to item 12 of the patent application scope further includes: setting a block update flag to enable or disable block update.

The method according to item 12 of the patent application scope further comprises: setting an alignment flag to align the data file with a plurality of data blocks of the second key-value solid state drive of the data node.

The method described in item 12 of the patent application scope further includes: Sending a write commit command from the client to the name node, the write commit command including the file identifier and the block identifier; and Attach a single direct operation to attach the file identifier, the block identifier, and the data node to the name node.

The method described in claim 16 of the scope of patent application further includes: Sending a read archive request from the client to the name node to read the profile; Returning to the client the block identifier and the data node identifier of at least one data block associated with the data file in the one or more data blocks; Sending a block read command from the client to the data node to retrieve the one or more data blocks stored in the second key-value solid-state drive of the data node; and Returning the block data identified by the block identifier to the client from the data node.

The method described in claim 17 of the scope of patent application further includes: Sending an archive delete command from the client to the name node, the archive delete command including the archive identifier; Returning to the client the block identifier and the data node identifier of at least one data block associated with the data file in the one or more data blocks; Sending a key-value delete command from the name node to the first key-value solid-state drive of the name node, the key-value delete command including the file identifier of the data file; Sending a block delete command from the name node to the data node, the block delete command including a list of the one or more data blocks; and Deleting the one or more data blocks stored in the second key-value solid-state drive of the data node.

The method of claim 12, wherein the second key-value solid-state drive stores a file mapping table, and the file mapping table includes the file to one or more data areas associated with the file. A first mapping of blocks and a second mapping of at least one of the one or more data blocks to the data node.