CN112131140B

CN112131140B - SSD-based key-value separation storage method supporting efficient storage space management

Info

Publication number: CN112131140B
Application number: CN202011018307.6A
Authority: CN
Inventors: 王冲; 刘莉; 张扬; 周可; 牛中盈; 滕海; 李春花; 张洲; 王颖
Original assignee: Aerospace Science And Technology Network Information Development Co ltd; Huazhong University of Science and Technology; Beijing Institute of Computer Technology and Applications
Current assignee: Aerospace Science And Technology Network Information Development Co ltd; Huazhong University of Science and Technology; Beijing Institute of Computer Technology and Applications
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2023-07-14
Anticipated expiration: 2040-09-24
Also published as: CN112131140A

Abstract

The present invention relates to an SSD-based key-value separation storage method that supports efficient storage space management, including: dividing the value storage space into segments of equal length, constructing a segment manager to manage the invalidation and valid states of all data segments, and providing data for each Create a value storage invalidation offset set and a key storage invalidation offset set for each segment, and perform available segment caching and semi-invalid segment caching. The value storage invalidation offset set is used to record the invalidation value metadata discarded in the compression operation of the key storage , space reclamation with auxiliary value storage; the key storage invalidation offset set is used to record the offsets in the reclaimed data segment that still exist in the key storage after passive garbage collection, these locations do not need to be recycled, so if the key storage These offsets are collected in , and discarded directly. The present invention builds an efficient value storage space manager by collecting the invalid key-value pairs discarded in the downward compression operation in the key storage part, realizes light garbage collection operation, and further reduces the GC operation in the value storage to the foreground write operation of the system Impact.

Description

SSD-based key-value separation storage method supporting efficient storage space management

技术领域technical field

本发明属于计算机存储系统，具体涉及基于SSD的支持高效存储空间管理的键值分离存储方法。The invention belongs to a computer storage system, and in particular relates to an SSD-based key-value separation storage method supporting efficient storage space management.

背景技术Background technique

持久键值存储在现代数据密集型存储系统和应用中起着至关重要的作用，例如消息传递、电子商务、搜索索引和广告等。日志结构合并树(Log-structuredmerge tree,LSM-tree)是1996年提出的一种基于磁盘的写优化的数据结构，通过将随机写转换成顺序写来获取可观的写性能，并通过保证磁盘内数据有序来提供可靠的查询性能，目前是主流的持久化键值存储采用的数据结构之一。自2006年Google发表分布式键值存储系统Bigtable的论文，并在后续开源了单机键值存储引擎LevelDB之后，Facebook基于LevelDB进行优化，提出了开源单机键值存储引擎RocksDB，Hadoop生态系统中也基于Bigtable开源实现了HBase。Persistent key-value stores play a vital role in modern data-intensive storage systems and applications, such as messaging, e-commerce, search indexing, and advertising. Log-structured merge tree (LSM-tree) is a disk-based write-optimized data structure proposed in 1996. It obtains considerable write performance by converting random writes into sequential writes, and guarantees Data is ordered to provide reliable query performance, which is currently one of the data structures used by mainstream persistent key-value storage. Since Google published a paper on the distributed key-value storage system Bigtable in 2006, and subsequently open-sourced the stand-alone key-value storage engine LevelDB, Facebook has optimized it based on LevelDB and proposed an open-source stand-alone key-value storage engine RocksDB. The Hadoop ecosystem is also based on Bigtable implements HBase open source.

基于LSM-tree的键值存储系统整体由内存组件和磁盘组件两部分构成，两个部件内数据都是有序的。写入的数据先缓存在内存组件中，当内存组件的数据量达到一定阈值时，将数据批量持久化到磁盘组件，与磁盘组件中存在键范围覆盖的数据进行合并排序，该操作称之为压缩(Compaction)。压缩操作将数据进行合并排序，在保证数据之间有序的同时回收无效空间。随着数据量的增长，基于LSM-tree的键值存储系统需要大量的压缩操作来维护数据组织，从而提供良好的查询性能。为了分摊压缩操作的开销，现代基于LSM-tree的键值存储系统将磁盘组件设计成了多层结构。然而，大量的压缩操作会严重影响系统的写性能并造成严重的写放大。写放大定义为执行的总写入数据量与键值存储中实际写入的数据量的比值。因此，出现了大量的优化基于LSM-tree的键值存储系统，如HyperLevelDB、bLSM、LSM-trie、Wisckey、TRIAD、PebblesDB、SILK等。The key-value storage system based on LSM-tree is composed of memory components and disk components as a whole, and the data in the two components are in order. The written data is first cached in the memory component. When the data volume of the memory component reaches a certain threshold, the data is persisted to the disk component in batches, and merged and sorted with the data covered by the key range in the disk component. This operation is called Compression. The compression operation merges and sorts the data, and reclaims invalid space while ensuring the order of the data. As the amount of data grows, LSM-tree-based key-value storage systems require a large number of compression operations to maintain data organization and thus provide good query performance. In order to amortize the overhead of compression operations, modern LSM-tree-based key-value storage systems design disk components into a multi-layer structure. However, a large number of compaction operations will seriously affect the system's write performance and cause severe write amplification. Write amplification is defined as the ratio of the total amount of written data performed to the amount of data actually written in the key-value store. Therefore, there have been a large number of optimized LSM-tree-based key-value storage systems, such as HyperLevelDB, bLSM, LSM-trie, Wisckey, TRIAD, PebblesDB, SILK, etc.

Wisckey是一种键值分离的持久化键值存储系统，包括值存储和键存储两部分。值存储是一个可重用日志，用于存放键值对数据；键存储是一个基于LSM-tree的键值存储系统，用于存储数据索引。Wisckey是一种SSD优化的数据布局，适用于值较大的应用场景，通过键值分离的方式避免了压缩操作带来的严重写性能影响和写放大。当值存储空间被消耗完以后，Wisckey需要对值存储进行空间回收，通过消除失效数据来回收可用空间，该操作称之为垃圾回收。垃圾回收操作的具体步骤是1)在值存储中连续读取批量数据，2)在键存储中逐个查找这些数据的键，判断数据的有效性，3)将有效数据重写入值存储，4)将索引数据更新到键存储。垃圾回收操作开销非常大，而在写密集环境下，频繁的垃圾回收将造成写性能严重下降和大量的数据重写。Wisckey is a persistent key-value storage system with key-value separation, including two parts: value storage and key storage. Value storage is a reusable log for storing key-value pair data; key storage is an LSM-tree-based key-value storage system for storing data indexes. Wisckey is an SSD-optimized data layout, suitable for application scenarios with large values. It avoids severe write performance impact and write amplification caused by compression operations through key-value separation. When the value storage space is exhausted, Wisckey needs to reclaim the value storage space and reclaim the available space by eliminating invalid data. This operation is called garbage collection. The specific steps of the garbage collection operation are 1) read batches of data continuously in the value storage, 2) look up the keys of these data one by one in the key storage, and judge the validity of the data, 3) rewrite the valid data into the value storage, 4 ) to update the index data to the key store. Garbage collection operations are very expensive, and in a write-intensive environment, frequent garbage collections will cause severe write performance degradation and a large amount of data rewriting.

发明内容Contents of the invention

本发明的目的在于提供一种基于SSD的支持高效存储空间管理的键值分离存储方法，用于解决现有技术垃圾对系统写性能的具有不良影响的问题。The purpose of the present invention is to provide an SSD-based key-value separation storage method supporting efficient storage space management, which is used to solve the problem of bad influence of garbage on system writing performance in the prior art.

本发明一种基于SSD的支持高效存储空间管理的键值分离存储方法，其中，包括：将值存储空间划分为等长的段，构建段管理器以管理所有数据段的失效和有效状态，为每个段建立值存储失效偏移集和键存储失效偏移集，值存储空间存在三种类型的段，分别为：满失效段、部分失效段和有效段；所述满失效段表示段内所有数据都为无效数据；所述部分失效段表示段内部分数据为无效数据；所述有效段表示段内所有数据均为有效数据；进行可用段缓存和半失效段缓存，所述可用段缓存用于缓存所述满失效段的起始偏移量和所述被动垃圾回收后已经回收空间的段的起始偏移量，所述半失效段缓存用于缓存失效数据的数量占段内数据总量一半及以上的所述部分失效段，所述值存储偏移有效位图用于标记值存储内数据段中数据的有效性；所述键存储偏移有效位图用于标记被动垃圾回收后，数据段原先标记为有效数据的位置，在采集键存储内丢弃的失效偏移时判断是否为已回收位置。The present invention is an SSD-based key-value separation storage method that supports efficient storage space management, which includes: dividing the value storage space into segments of equal length, and constructing a segment manager to manage the invalidation and valid states of all data segments, for Each segment establishes a value storage invalidation offset set and a key storage invalidation offset set. There are three types of segments in the value storage space, namely: full invalid segment, partial invalid segment and valid segment; All data is invalid data; the partial invalidation segment indicates that part of the data in the segment is invalid data; the valid segment indicates that all data in the segment is valid data; the available segment cache and the semi-invalid segment cache are performed, and the available segment cache It is used to cache the starting offset of the full invalidation segment and the starting offset of the segment whose space has been reclaimed after the passive garbage collection, and the semi-invalid segment cache is used to cache invalid data to account for the data in the segment For the partially invalid segments that are half or more of the total amount, the value storage offset valid bitmap is used to mark the validity of the data in the data segment in the value storage; the key storage offset valid bitmap is used to mark passive garbage collection Finally, the position of the data segment originally marked as valid data is judged to be a recovered position when collecting the discarded invalidation offset in the key storage.

根据本发明的基于SSD的支持高效存储空间管理的键值分离存储方法的一实施例，其中，段大小与文件系统的页大小对齐，并且大于单个键值对大小。According to an embodiment of the SSD-based key-value separation storage method supporting efficient storage space management of the present invention, the segment size is aligned with the page size of the file system and is larger than the size of a single key-value pair.

根据本发明的基于SSD的支持高效存储空间管理的键值分离存储方法的一实施例，其中，对于失效偏移，从键存储的压缩操作中不断地采集值存储中失效数据的偏移量。According to an embodiment of the SSD-based key-value separation storage method supporting efficient storage space management of the present invention, for the invalidation offset, the offset of invalidated data in the value storage is continuously collected from the compression operation of the key storage.

根据本发明的基于SSD的支持高效存储空间管理的键值分离存储方法的一实施例，其中，根据段状态和值存储空间使用状态，分为主动垃圾回收和被动垃圾回收；所述主动垃圾回收包括：当将缓存中的数据刷入SSD时，如果存在所述满失效段，则将缓存中的数据写入到所述满失效段中；所述被动垃圾回收包括：当值存储空间用完后，触发空间回收操作，选取批量数据段，通过将段中的键和偏移量与键存储中的键值对进行匹配来判断数据的有效性，丢弃失效数据，将有效数据重写入值存储中，并将重写数据的键和新的偏移写入键存储中，将回收后的段的起始偏移量放入可用段缓存。According to an embodiment of the SSD-based key-value separation storage method supporting high-efficiency storage space management of the present invention, it is divided into active garbage collection and passive garbage collection according to the segment state and value storage space usage state; the active garbage collection Including: when brushing the data in the cache into the SSD, if there is the full failure segment, then writing the data in the cache into the full failure segment; the passive garbage collection includes: when the value storage space is used up Finally, the space reclamation operation is triggered, the batch data segment is selected, the validity of the data is judged by matching the key and offset in the segment with the key-value pair in the key storage, the invalid data is discarded, and the valid data is rewritten into the value In the storage, write the key and new offset of the rewritten data into the key storage, and put the starting offset of the reclaimed segment into the available segment cache.

根据本发明的基于SSD的支持高效存储空间管理的键值分离存储方法的一实施例，其中，将值存储空间划分为主数据区和预留数据区，在所述被动垃圾回收中产生的有效数据重写到预留数据区；所述主数据区用于存储新数据；所述预留数据区用于存储在被动垃圾回收中产生的有效数据。According to an embodiment of the SSD-based key-value separation storage method supporting high-efficiency storage space management of the present invention, the value storage space is divided into a main data area and a reserved data area, and the effective data generated in the passive garbage collection Data is rewritten to the reserved data area; the main data area is used to store new data; the reserved data area is used to store valid data generated in passive garbage collection.

根据本发明的基于SSD的支持高效存储空间管理的键值分离存储方法的一实施例，其中，主数据区需比预留数据区大。根据本发明的基于SSD的支持高效存储空间管理的键值分离存储方法的一实施例，其中，主数据区需比预留数据区大。According to an embodiment of the SSD-based key-value separation storage method supporting efficient storage space management of the present invention, the main data area needs to be larger than the reserved data area. According to an embodiment of the SSD-based key-value separation storage method supporting efficient storage space management of the present invention, the main data area needs to be larger than the reserved data area.

根据本发明的基于SSD的支持高效存储空间管理的键值分离存储方法的一实施例，其中，段管理器中记录值存储中数据段的有效性状态元数据，使得在恢复存储系统时能获取段的状态，将段管理器中的元数据定期持久化到磁盘。According to an embodiment of the SSD-based key-value separation storage method supporting efficient storage space management of the present invention, wherein the segment manager records the validity state metadata of the data segment in the value storage, so that it can be obtained when the storage system is restored The state of the segment, which periodically persists metadata from the segment manager to disk.

根据本发明的基于SSD的支持高效存储空间管理的键值分离存储方法的一实施例，其中，段状态日志记录满失效段偏移、半失效段偏移及值存储失效偏移集，每次被动垃圾回收后段的偏移和键存储失效偏移集。According to an embodiment of the SSD-based key-value separation storage method supporting high-efficiency storage space management of the present invention, the segment state log records the full invalidation segment offset, the half-invalid segment offset and the value storage invalidation offset set, each time Set of offsets and keystore invalidation offsets for segments after passive garbage collection.

根据本发明的基于SSD的支持高效存储空间管理的键值分离存储方法的一实施例，其中，从键存储中的压缩操作中采集所丢弃的无效键值对，从中提取值存储中失效数据的偏移和长度，通过偏移和段粒度，以及偏移百分比的段粒度，得到所属段的起始偏移量和段内偏移，将段内偏移和长度组合为失效偏移数据插入该段的值存储失效偏移集。According to an embodiment of the SSD-based key-value separation storage method supporting high-efficiency storage space management of the present invention, the discarded invalid key-value pairs are collected from the compression operation in the key storage, and the invalid data in the value storage is extracted therefrom. Offset and length, through the offset and segment granularity, and the segment granularity of the offset percentage, the starting offset and intra-segment offset of the segment to which it belongs are obtained, and the intra-segment offset and length are combined into invalid offset data and inserted into the The value of the segment stores the set of invalidation offsets.

根据本发明的基于SSD的支持高效存储空间管理的键值分离存储方法的一实施例，其中，当将写缓存中的数据刷回值存储时，写入位置的判断包括：如果可用段缓存中段的总容量大于写缓存大小，则选取候选段，获得起始偏移，将写缓存数据刷回候选段中；如果无可用段，在主数据区中获取写入位置，若该位置非主数据区尾部，写入该位置；否则触发被动垃圾回收操作，在完成空间回收后，选取候选段，获得起始偏移，将写缓存数据刷回候选段中。According to an embodiment of the SSD-based key-value separation storage method supporting high-efficiency storage space management of the present invention, when the data in the write cache is flushed back to value storage, the judgment of the write position includes: if the segment in the segment cache is available If the total capacity is greater than the size of the write cache, select the candidate segment, obtain the starting offset, and flush the write cache data back to the candidate segment; if there is no available segment, obtain the write location in the main data area, if the location is not the main data At the end of the area, write to this location; otherwise, a passive garbage collection operation is triggered. After the space recovery is completed, the candidate segment is selected, the starting offset is obtained, and the write cache data is flushed back to the candidate segment.

本发明提供了一种基于SSD的支持高效存储空间管理的键值分离存储方法，通过在键存储部分采集向下压缩操作(Compaction)中丢弃的失效键值对，构建一个高效的值存储空间管理器，实现轻量地垃圾回收(Garbage Collection,GC)操作，进一步减轻值存储中GC操作对系统前台写操作的影响。在键存储中，Compaction操作中丢弃的键值对的值是值存储中失效数据的偏移量。为了有效利用采集到的失效偏移，将值存储空间划分为等长的段，以段为单位管理失效偏移，并以段为单位进行值存储空间的GC操作。本发明在触发密集GC的工作负载下，能够提升写性能并减少写放大。The present invention provides an SSD-based key-value separation storage method that supports efficient storage space management, and constructs an efficient value storage space management by collecting invalid key-value pairs discarded in the downward compression operation (Compaction) in the key storage part The device implements lightweight garbage collection (Garbage Collection, GC) operations, and further reduces the impact of GC operations in value storage on foreground write operations in the system. In the key store, the value of the key-value pair discarded in the Compaction operation is the offset of the invalid data in the value store. In order to effectively utilize the collected invalidation offsets, the value storage space is divided into segments of equal length, the invalidation offsets are managed in units of segments, and the GC operation of the value storage space is performed in units of segments. The present invention can improve write performance and reduce write amplification under the workload that triggers intensive GC.

附图说明Description of drawings

图1为Wisckey的系统架构和读/写流程图；Figure 1 is the system architecture and read/write flowchart of Wisckey;

图2为SRKV的原型架构图；Figure 2 is a prototype architecture diagram of SRKV;

图3为值存储空间的组织结构图；Figure 3 is an organizational chart of the value storage space;

图4所示为被动垃圾回收操作过程图；Figure 4 is a diagram of the passive garbage collection operation process;

图5为段管理器中单个段的值存储失效偏移集和键存储失效偏移集在执行两种垃圾回收操作后的变化示例图；Figure 5 is an example diagram of the change of the value storage invalidation offset set and the key storage invalidation offset set of a single segment in the segment manager after two garbage collection operations are performed;

图6为写缓存数据写入位置判断过程图。FIG. 6 is a process diagram of judging the writing location of write cache data.

具体实施方式Detailed ways

为使本发明的目的、内容、和优点更加清楚，下面结合附图和实施例，对本发明的具体实施方式作进一步详细描述。In order to make the purpose, content, and advantages of the present invention clearer, the specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments.

Wisckey基于LSM-tree的持久化键值存储，如LevelDB的压缩操作会造成严重的读/写放大和写性能下降问题，而提出的一种面向SSD优化的键值分离的持久化键值存储系统，适用于键值对较大的场景。Wisckey包括值存储和键存储两部分。值存储是一个可重用日志，用于存放键值对数据；键存储是一个基于LSM-tree的键值存储系统，用于存储数据索引。图1为Wisckey的系统架构和读/写流程图，如图1所示，Wisckey's persistent key-value storage based on LSM-tree, such as LevelDB's compression operation will cause serious read/write amplification and write performance degradation, and proposes a persistent key-value storage system for SSD-optimized key-value separation , suitable for scenarios with large key-value pairs. Wisckey includes two parts: value storage and key storage. Value storage is a reusable log for storing key-value pair data; key storage is an LSM-tree-based key-value storage system for storing data indexes. Figure 1 is the system architecture and read/write flowchart of Wisckey, as shown in Figure 1,

写数据时，先将数据写入值存储，再将索引数据写入键存储，具体的步骤如下：When writing data, first write the data into the value storage, and then write the index data into the key storage. The specific steps are as follows:

将数据写入写缓存，如果写缓存已满则先将写缓存中的数据刷回磁盘，执行(2)-(3)步，再将数据写入写缓存；Write the data into the write cache. If the write cache is full, first flush the data in the write cache back to the disk, perform steps (2)-(3), and then write the data into the write cache;

将写缓存中的数据刷回磁盘，提取数据的键、偏移量和数据的长度，生成索引数据<键，<偏移，长度>>；Flush the data in the write cache back to the disk, extract the key, offset and length of the data, and generate index data <key, <offset, length>>;

将批量索引数据写入键存储的写缓存(称之为Memtable)中，若Memtable已满则进行(4)-(5)步，在该步之后为由写数据产生的系统后台操作；Write the batch index data into the write cache (called Memtable) of the key storage. If the Memtable is full, perform steps (4)-(5). After this step, the system background operation generated by writing data;

若Memtable已满，则将Memtable转变为不可变Memtable，并生成一个新的Memtable以接收数据；If the Memtable is full, convert the Memtable to an immutable Memtable and generate a new Memtable to receive data;

将不可变Memtable刷回LSM-tree磁盘组件的第一层(称之为L0层)，如果L0层的文件数量达到阈值则触发压缩操作，进行第(6)步；Brush the immutable Memtable back to the first layer of the LSM-tree disk component (called L0 layer), if the number of files in the L0 layer reaches the threshold, the compression operation is triggered, and step (6) is performed;

选择L0层中存在键范围覆盖的所有文件与L1层存在键范围覆盖的文件进行合并排序，并将结果写回L1层；如果L1层的大小达到阈值，则触发L1层的压缩操作，一直往下层压缩数据。Select all the files with key range coverage in the L0 layer and the files with the key range coverage in the L1 layer to merge and sort, and write the results back to the L1 layer; if the size of the L1 layer reaches the threshold, the compression operation of the L1 layer is triggered. The lower layer compresses the data.

读数据时，先在键存储中查找键，获取键所对应数据的偏移和长度后，去值存储中读取数据。在键存储中查找数据的顺序为：1)Memtable，2)不可变Memtable，3)L0层存在键范围覆盖的所有文件，按先新后旧的顺序查找这些文件，4)L1层及以下的层次中存在键范围覆盖的最多一个文件。按序查找键，如果找到，返回其值，包括数据偏移和数据长度；如果找不到则返回找不到该键。When reading data, first look up the key in the key storage, obtain the offset and length of the data corresponding to the key, and then read the data in the value storage. The order of searching data in the key storage is: 1) Memtable, 2) immutable Memtable, 3) all files covered by the key range in the L0 layer, and search for these files in the order of new and old, 4) L1 layer and below At most one file covered by a key range exists in the hierarchy. Search the key in order, if found, return its value, including data offset and data length; if not found, return that the key cannot be found.

当值存储空间被消耗完以后，Wisckey需要对值存储进行空间回收，通过消除失效数据来回收可用空间，该操作称之为垃圾回收。垃圾回收操作的具体步骤如下：When the value storage space is exhausted, Wisckey needs to reclaim the value storage space and reclaim the available space by eliminating invalid data. This operation is called garbage collection. The specific steps of the garbage collection operation are as follows:

在值存储日志中从垃圾回收操作的起始位置连续读取批量数据(如，64MB)；Continuously read bulk data (eg, 64MB) from the beginning of the garbage collection operation in the value store log;

在键存储中逐个查找这些数据的键，匹配该键在两个存储中的偏移，如果相同则表示数据有效，不同则表示数据无效；Find the keys of these data one by one in the key storage, and match the offset of the key in the two storages. If they are the same, it means that the data is valid, and if they are different, it means that the data is invalid;

将有效数据重写入值存储日志的写数据起始位置；Rewrite valid data to the start position of write data in the value storage log;

生成重写数据的索引数据，将索引数据批量更新到键存储。Generate index data for rewritten data, and update index data to key storage in batches.

依据上述的方法，本发明基于SSD的支持高效存储空间管理的键值分离存储方法的一实施例中，包括：According to the above method, an embodiment of the SSD-based key-value separation storage method supporting efficient storage space management of the present invention includes:

段粒度包括：Segment granularities include:

将值存储日志的存储空间划分为等长的段，段大小应该与文件系统的页大小对齐(即4KB)，并且必须大于单个键值对大小，为了适应变长键值大小的应用环境，将段粒度对齐写缓存大小(如1MB)。Divide the storage space of the value storage log into segments of equal length. The segment size should be aligned with the page size of the file system (that is, 4KB), and must be larger than the size of a single key-value pair. In order to adapt to the application environment of variable-length key-value size, set Segment granularity aligns the write cache size (eg 1MB).

段元数据包括：Segment metadata includes:

在内存中维护所有段的元数据信息，包括各个段的起始偏移量、段内键值对数量、段内失效偏移集；Maintain the metadata information of all segments in memory, including the starting offset of each segment, the number of key-value pairs in the segment, and the failure offset set in the segment;

采集基于LSM-tree的键存储中的压缩操作所丢弃的无效键值对，从中提取值，获取值存储中失效数据的偏移和长度，根据数据偏移得到所属段的起始偏移量，将段内偏移和长度组合为失效偏移插入该段的失效偏移集。Collect the invalid key-value pairs discarded by the compression operation in the LSM-tree-based key storage, extract the value from them, obtain the offset and length of the invalid data in the value storage, and obtain the starting offset of the segment according to the data offset, Combines the in-segment offset and length as a failure offset to insert into the failure offset set for that segment.

主动垃圾回收包括：Active garbage collection includes:

当一个段内所有数据都失效时，在即将到来的写缓存刷回磁盘操作中，将数据刷回该段。When all data in a segment is invalid, the data is flushed back to the segment in the upcoming write cache flush operation.

被动垃圾回收包括：Passive garbage collection includes:

当值存储日志空间用完后，触发所述被动垃圾回收操作，回收空间阈值如64MB，选择满足该容量的部分失效段，通过与键存储中的键值对匹配判断数据的有效性，重写有效数据，这将产生两个问题：1)重写的数据不对齐段大小，因此，在实现SRKV时，应该将值存储空间划分为两个部分，一部分用于存储新数据，即所述主数据区，另一部分用于存储被动垃圾回收操作后的重写数据，即所述预留数据区；2)这些数据在键存储中的索引数据已失效，然而还未被丢弃，在后续压缩过程中会被采集，因此，在实现SRKV时还需要在段元数据中增加在键存储中的该段存在的失效偏移集。最后，生成重写数据的索引数据写入键存储。When the log space of the value storage is used up, the passive garbage collection operation is triggered, and the recovery space threshold is such as 64MB, and some invalid segments that meet the capacity are selected, and the validity of the data is judged by matching the key-value pairs in the key storage, and rewritten valid data, this will cause two problems: 1) the rewritten data is not aligned with the segment size, therefore, when implementing SRKV, the value storage space should be divided into two parts, one part is used to store new data, that is, the main The data area, the other part is used to store the rewritten data after the passive garbage collection operation, that is, the reserved data area; 2) the index data of these data in the key storage has been invalidated, but has not been discarded yet, in the subsequent compression process will be collected, therefore, when implementing SRKV, it is also necessary to add the failure offset set of the segment in the key storage to the segment metadata. Finally, the index data that generates the rewritten data is written to the key store.

图2为SRKV的原型架构图，如图2所示，SRKV的键值分离存储架构：1)在键存储和值存储之间增加了段管理器以维护所有段的状态信息，管理和使用段以实现高效的存储空间管理；2)将原先的可重用日志划分为主数据区和预留数据区；3)增加段状态日志，定期持久化段状态变更，用于在恢复数据库时恢复段状态。Figure 2 is the prototype architecture diagram of SRKV. As shown in Figure 2, the key-value separation storage architecture of SRKV: 1) A segment manager is added between the key storage and the value storage to maintain the status information of all segments, manage and use segments In order to achieve efficient storage space management; 2) divide the original reusable log into the main data area and the reserved data area; 3) increase the segment status log, and periodically persist the status change of the segment, which is used to restore the segment status when restoring the database .

SRKV的具体实现如下：The specific implementation of SRKV is as follows:

构建段管理器以维护和管理主数据区内所有数据段的元数据和状态，包括段粒度、段元数据数组、可用段缓存和半失效段缓存。根据主数据区容量和段粒度，可以获得段的数量。各个段的元数据包括段内数据总量、值存储失效偏移集和键存储失效偏移集。Build a segment manager to maintain and manage the metadata and status of all data segments in the main data area, including segment granularity, segment metadata array, available segment cache, and half-invalid segment cache. According to the capacity of the main data area and the granularity of the segment, the number of segments can be obtained. The metadata of each segment includes the total amount of data in the segment, the value store invalidation offset set, and the key store invalidation offset set.

图3为值存储空间的组织结构图，如图3所示，将值存储空间划分为主数据区和预留数据区。主数据区用于存储新数据，以段为单位进行回收和重用；预留数据区用于存储被动垃圾回收操作中产生的有效数据，设定头/尾两个指针，按传统的日志追加写和循环空间回收。为了提供足够的时间去累积失效偏移，主数据区需比预留数据区大，如按7:3划分值存储空间。Fig. 3 is an organizational structure diagram of the value storage space. As shown in Fig. 3, the value storage space is divided into a main data area and a reserved data area. The main data area is used to store new data, which is recycled and reused in units of segments; the reserved data area is used to store valid data generated in passive garbage collection operations, set the head/tail pointers, and write additionally according to the traditional log and recycling space recycling. In order to provide enough time to accumulate failure offsets, the main data area must be larger than the reserved data area, such as dividing the value storage space by 7:3.

段状态日志记录满失效段偏移、半失效段偏移及值存储失效偏移集，每次被动垃圾回收后段的偏移和键存储失效偏移集。The segment status log records the offset of the full invalid segment, the offset of the half invalid segment, and the invalid offset set of the value storage, and the offset of the segment and the invalid offset set of the key storage after each passive garbage collection.

从键存储中的压缩操作中采集所丢弃的无效键值对，从中提取值存储中失效数据的偏移和长度，通过哈希方法：1)偏移/段粒度，2)偏移％段粒度，得到所属段的起始偏移量和段内偏移，将段内偏移和长度组合为失效偏移数据插入该段的值存储失效偏移集。Collect the discarded invalid key-value pairs from the compression operation in the key store, and extract the offset and length of the invalid data in the value store from it, through the hash method: 1) offset/segment granularity, 2) offset% segment granularity , get the start offset and intra-segment offset of the segment to which it belongs, and combine the intra-segment offset and length into invalidation offset data and insert the value of this segment to store the invalidation offset set.

图4所示为被动垃圾回收操作过程图，如图4所示，随着不断累积的失效偏移，段管理器中出现了三种状态的段：1)满失效段，2)部分失效段，3)有效段。当一个段的值存储失效偏移集中的失效偏移数等于该段内数据总量时，判断为满失效段时，将该段的起始偏移放入可用段缓存中。等到下一次刷回写缓存数据时，触发主动垃圾回收操作。刷回写缓存数据时，如果发现值存储主数据区空间已用完，触发被动垃圾回收操作，从半失效段缓存中选取满足回收空间阈值(如64MB)的部分失效段；若获取段的总容量小于被动垃圾回收的阈值，则通过遍历段元数据数据来选择剩余的部分失效段；读取段内所有数据并进行遍历，跳过在值存储失效偏移集中出现的位置，逐个获取键、段内偏移和数据长度，将段内偏移和长度放入键存储失效偏移集中，在键存储中对该键进行查找，若找到的键值对中的值的偏移和长度与值存储中的相同，则判断该数据有效；将有效数据重写到预留数据区的尾指针处，并生成新的索引数据写入键存储中；将空间回收后的段的起始偏移放入可用段缓存中，并重置其元数据：将段内数据总量清零，将值存储失效偏移集清空。图5为段管理器中单个段的值存储失效偏移集和键存储失效偏移集在执行两种垃圾回收操作后的变化示例图，如图5。Figure 4 shows the operation process diagram of passive garbage collection. As shown in Figure 4, with the continuous accumulation of invalidation offsets, segments in three states appear in the segment manager: 1) full invalid segment, 2) partially invalid segment , 3) Effective segment. When the number of invalidation offsets in the value storage invalidation offset set of a segment is equal to the total amount of data in the segment, when it is judged as a full invalidation segment, the starting offset of the segment is put into the available segment cache. Wait until the next time the write cache data is flushed back to trigger an active garbage collection operation. When flashing back the write cache data, if it is found that the space in the main data area of the value storage is exhausted, a passive garbage collection operation is triggered, and some invalid segments that meet the recovery space threshold (such as 64MB) are selected from the semi-invalid segment cache; If the capacity is less than the threshold of passive garbage collection, the remaining invalid segments are selected by traversing the segment metadata data; all data in the segment is read and traversed, skipping the positions that appear in the value storage invalidation offset set, and obtaining keys, Intra-segment offset and data length, put the intra-segment offset and length into the key storage invalidation offset set, search the key in the key storage, if the offset and length of the value in the found key-value pair are equal to the value If it is the same in the storage, it is judged that the data is valid; the valid data is rewritten to the end pointer of the reserved data area, and new index data is generated and written into the key storage; the starting offset of the segment after space recovery is placed in Enter the available segment cache and reset its metadata: clear the total amount of data in the segment, and clear the value storage invalidation offset set. Figure 5 is an example diagram of changes in the value storage invalidation offset set and the key storage invalidation offset set of a single segment in the segment manager after two garbage collection operations are performed, as shown in Figure 5 .

图6为写缓存数据写入位置判断过程图，如图6所示，两种垃圾回收策略使得写缓存的刷回操作也发生了变化，如果可用段缓存不为空，则选择一个段的起始偏移，将写缓存数据刷回该段；如果无可用段，在主数据区中获取写入位置，若该位置非主数据区尾部，写入该位置；否则触发被动垃圾回收操作，在完成空间回收后，选取一个段起始偏移，将数据刷回该位置。Figure 6 is a diagram of the process of judging the write location of write cache data. As shown in Figure 6, the two garbage collection strategies make the refresh operation of the write cache also changed. If the available segment cache is not empty, select the starting point of a segment If there is no available segment, get the write position in the main data area, and if the position is not the end of the main data area, write to this position; otherwise, trigger passive garbage collection operation, in After the space reclamation is completed, select a segment start offset, and flush the data back to this location.

SRKV的数据库实例恢复过程包括：The SRKV database instance recovery process includes:

键存储采用基于LSM-tree的键值存储，自带恢复功能；The key storage adopts LSM-tree-based key-value storage with built-in recovery function;

系统正常退出的值存储恢复：Value store recovery for normal system exits:

从段状态日志中读出所有段状态，保留各个段最新的状态；从键存储中获取预留数据区的头/尾指针。Read all the segment status from the segment status log, keep the latest status of each segment; get the head/tail pointer of the reserved data area from the key storage.

遇系统、设备故障导致系统崩溃的值存储恢复：In case of system or equipment failure, the value storage recovery of the system crash:

从段状态日志中读出所有段状态，保留各个段最新的状态；获取最后的段操作，如果是刷回写缓存操作未完成，读取该段内的键与键存储中的数据进行匹配，若找不到且该段为满失效段，认为数据未写回，丢失；若偏移不同，则表示值存储为新数据，未更新索引数据，进行索引数据更新操作；如果是被动垃圾回收操作未完成，获取所有候选块，重新进行GC操作，从键存储中获取预留数据区的头指针，将有效数据写入该位置。Read all the segment status from the segment status log, and keep the latest status of each segment; get the last segment operation, if the write cache operation is not completed, read the key in the segment and match the data in the key storage, If it cannot be found and the segment is a full invalid segment, it is considered that the data has not been written back and is lost; if the offset is different, it means that the value is stored as new data, and the index data has not been updated, and the index data update operation is performed; if it is a passive garbage collection operation Not completed, get all candidate blocks, re-run the GC operation, get the head pointer of the reserved data area from the key storage, and write valid data to this location.

本发明针对写密集环境，尤其是更新密集型应用下的键值分离存储的高效存储空间管理技术，并基于该技术实现一个新的键值分离存储SRKV。SRKV在现有的包含键存储和值存储两部分的键值分离存储架构上，在两者之间增加了段管理器，以实现从键存储中采集失效偏移，并辅助值存储中的GC操作。基于段状态，设计主动GC和被动GC两种空间回收策略。由于被动GC时重写的数据无法对齐段大小，因此将值存储划分为主数据区和预留数据区，主数据区存储新数据，预留数据区存储被动GC后重写的数据。The present invention aims at efficient storage space management technology of key-value separation storage in write-intensive environment, especially update-intensive applications, and implements a new key-value separation storage SRKV based on the technology. SRKV adds a segment manager between the existing key-value separation storage architecture that includes key storage and value storage to collect invalidation offsets from the key storage and assist GC in the value storage operate. Based on the segment state, two space recovery strategies, active GC and passive GC, are designed. Since the rewritten data during passive GC cannot be aligned with the segment size, the value storage is divided into the main data area and the reserved data area. The main data area stores new data, and the reserved data area stores the data rewritten after passive GC.

本发明通过构建一个以数据段为粒度回收和重用值存储空间的键值分离存储(Segment-conscious recycledkey-value separation store，SRKV)，在现有的包含键存储和值存储两部分的键值分离存储架构上，在键存储和值存储之间增加了段管理器，以实现从键存储中采集失效偏移，并辅助值存储中的GC操作；将值存储空间分为主数据区域和预留数据区域，以区分存储热数据和垃圾回收操作后的重写数据。本发明在触发密集垃圾回收操作的工作负载下，能够提升写性能并减少写放大。The present invention builds a key-value separation store (Segment-conscious recycled key-value separation store, SRKV) that recycles and reuses the value storage space at the granularity of the data segment. In terms of storage architecture, a segment manager is added between key storage and value storage to collect invalidation offsets from key storage and assist GC operations in value storage; the value storage space is divided into main data areas and reserved Data area to distinguish between storing hot data and rewriting data after garbage collection operations. The present invention can improve write performance and reduce write amplification under the workload that triggers intensive garbage collection operations.

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明技术原理的前提下，还可以做出若干改进和变形，这些改进和变形也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the technical principle of the present invention, some improvements and modifications can also be made. It should also be regarded as the protection scope of the present invention.

Claims

1. The SSD-based key value separation storage method supporting efficient storage space management is characterized by comprising the following steps: dividing a value storage space into equal-length segments, constructing a segment manager to manage failure and valid states of all data segments, and establishing a value storage failure offset set and a key storage failure offset set for each segment, wherein the value storage space has three types of segments which are respectively: full failure section, partial failure section and effective section, wherein the full failure section represents that all data in the section are invalid data; the partial failure section indicates that partial data in the section is invalid data; the effective section indicates that all data in the section are effective data;

performing available segment cache and semi-invalidation segment cache, wherein the available segment cache is used for caching the initial offset of the full invalidation segment and the initial offset of the segment of the reclaimed space after passive garbage reclamation, the semi-invalidation segment cache is used for caching the partial invalidation segment of which the quantity of invalidation data accounts for half or more of the total quantity of data in the segment, and the value storage offset valid bitmap is used for marking the validity of the data in the data segment in the value storage; the key storage offset valid bitmap is used for marking the position of the data segment originally marked as valid data after passive garbage collection, and judging whether the position is a recovered position or not when the invalid offset discarded in the key storage is collected;

wherein,,

according to the segment state and the value storage space use state, the method is divided into active garbage collection and passive garbage collection;

the active garbage collection includes: when the data in the cache is brushed into the SSD, if the full failure section exists, the data in the cache is written into the full failure section;

the passive waste recycling includes: when the value storage space is used up, triggering space recovery operation, selecting batch data segments, judging the validity of data by matching keys and offset in the segments with key value pairs in the key storage, discarding invalid data, rewriting valid data into the value storage, writing the key of the rewritten data and new offset into the key storage, and putting the initial offset of the recovered segments into an available segment cache;

dividing a value storage space into a main data area and a reserved data area, and rewriting effective data generated in the passive garbage collection into the reserved data area; the main data area is used for storing new data; the reserved data area is used for storing effective data generated in passive garbage collection.

2. The SSD-based key value split storage method supporting efficient storage space management of claim 1, wherein a segment size is aligned with a page size of a file system and is larger than a single key value pair size.

3. The SSD-based key value separation storage method supporting efficient storage space management of claim 1, wherein for failure offsets, offsets of failure data in the value store are continuously collected from compression operations of the key store.

4. The SSD-based key value separation storage method supporting efficient storage space management of claim 1, wherein the main data area is larger than the reserved data area.

5. The SSD-based key value separation storage method supporting efficient storage space management of claim 1, wherein validity state metadata of the data segments in the value store is recorded in the segment manager so that the state of the segments can be acquired when the storage system is restored, the metadata in the segment manager being periodically persisted to the disk.

6. The SSD-based key value separation storage method supporting efficient storage space management of claim 1, wherein the segment state log records a full failure segment offset, a half failure segment offset, and a value storage failure offset set, the offset of the post segment and the key storage failure offset set each time passive garbage collection.

7. The SSD-based key value separation storage method supporting efficient storage space management of claim 6, wherein discarded invalid key value pairs are collected from compression operations in the key store, wherein offsets and lengths of invalid data in the value store are extracted therefrom, wherein a starting offset and an intra-segment offset of the segment of interest are obtained by the offsets and the segment granularity, and the segment granularity of the offset percentage, and wherein the intra-segment offsets and the lengths are combined into a value store invalidation offset set in which invalidation offset data is inserted into the segment.

8. The SSD-based key value separation storage method supporting efficient storage space management of claim 1, wherein when the data in the write cache is swished back to the value storage, the determining of the write location includes: if the total capacity of the middle section of the available section buffer is larger than the write buffer, selecting a candidate section, obtaining a starting offset, and brushing write buffer data back to the candidate section; if no segment is available, a writing position is obtained in the main data area, and if the writing position is not at the tail part of the main data area, the writing position is written; otherwise, triggering passive garbage collection operation, selecting a candidate segment after space collection is completed, obtaining initial offset, and brushing write cache data back into the candidate segment.