CN115794819B

CN115794819B - Data writing method and electronic equipment

Info

Publication number: CN115794819B
Application number: CN202211457616.2A
Authority: CN
Inventors: 石林灵
Original assignee: XFusion Digital Technologies Co Ltd
Current assignee: XFusion Digital Technologies Co Ltd
Filing date: 2022-11-21
Publication date: 2025-09-16
Anticipated expiration: 2042-11-21

Abstract

An embodiment of the present application discloses a data writing method and electronic device, which are applied to the first node of an etcd cluster, the first node being the master node of the etcd cluster, the method comprising: receiving a data write request including a first key-value pair; when conditions are met, writing the first key-value pair into memory by calling the data write interface of a database adaptation layer to call the data write interface of a database engine, the database adaptation layer being used to encapsulate the interface provided by the database engine into an interface identical to that of boltdb, the database engine being based on an LSM-Tree structure, and the database engine being the database engine used by the etcd storage layer. In an embodiment of the present application, a database engine based on an LSM-Tree structure can be encapsulated by the database adaptation layer, so that the first node can write data by calling the data write interface of the database adaptation layer, thereby improving the write performance of etcd.

Description

Data writing method and electronic equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data writing method and an electronic device.

Background

Etcd is a highly available distributed key-value (KV) store that can be used for micro-service registration and discovery, shared configuration, distributed lock or consistency guarantee, distributed data queues, distributed notification and coordination, cluster election, etc.

Currently, the etcd's storage layer uses a database engine boltdb, while boltdb internally uses a b+ tree as the data structure for storing data. Due to the structural nature of the b+ tree, when a large amount of data writing occurs in etcd, the b+ tree structure inside boltdb is subject to frequent adjustments (e.g., rebalancing, splitting or merging the nodes of the tree, etc.), and these adjustments may lead to reduced writing performance of etcd.

Disclosure of Invention

The embodiment of the application discloses a data writing method and electronic equipment, which can improve the writing performance of etcd.

The first aspect discloses a data writing method, which may be applied to a first node of an etcd cluster, a module (e.g. a chip) in the first node, and a logic module or software capable of implementing all or part of the functions of the first node. The first node is a master node of the etcd cluster, and the description will be given below taking the application to the first node as an example. The data writing method can comprise the steps of receiving a data writing request, wherein the data writing request comprises a first key value pair, calling a data writing interface of a database engine to write the first key value pair into a memory by calling a data writing interface of a database adaptation layer under the condition that the condition is met, and packaging an interface provided by the database engine into the same interface as boltdb, wherein the database engine is based on an LSM-Tree (log structured MERGE TREE) structure, and the database engine is used by an etcd storage layer.

In the embodiment of the application, the database engine based on the LSM-Tree structure can be packaged through the database adaptation layer, so that the externally provided storage interface is unchanged (namely the storage interface is the same as that provided by boltdb), and thus other modules in the etcd can not be influenced. However, when the data writing interface packaged by the database adaptation layer is called to write data, a database engine based on an LSM-Tree structure is actually used for data storage at the bottom layer. Since the LSM-Tree is a layered, ordered and disk-oriented data structure, random write operations can be converted into sequential write operations of the disk, and therefore, the first node can improve the write performance of etcd by writing data in the manner described above.

As a possible implementation manner, the method for calling the data writing interface of the database engine by calling the data writing interface of the database adaptation layer to write the first key value pair into the memory comprises the steps of calling the data writing interface of the database adaptation layer and transmitting the first key value pair, calling the data writing interface of the database engine by the database adaptation layer and transmitting the first key value pair in the data writing interface of the database adaptation layer, and writing the first key value pair in the data writing interface of the database engine into the memory by the database engine.

In the embodiment of the application, when the first node writes data, the data writing interface of the database adaptation layer can be called and key value pairs are transmitted, then the data writing interface of the bottom database engine can be called through the database adaptation layer, and then the first key value pairs can be written into the memory through the database engine. Therefore, the first node directly writes the data into the memory, and the writing efficiency of the memory is higher, so that the writing efficiency of etcd can be further improved.

As one possible implementation, the database engine is rocksdb.

As one possible implementation, the invoking the data writing interface of the database engine to write the first key pair into the memory by invoking the data writing interface of the database adaptation layer includes invoking the data writing interface of the database engine to write the first key pair into a variable memory table in the memory by invoking the data writing interface of the database adaptation layer.

As a possible implementation manner, the method further comprises the steps of converting the variable memory table into a non-variable memory table in the case that the data volume of the variable memory table is larger than a first threshold value, and flushing the non-variable memory table to a disk of the first node to generate an SSTable (sorted string table, ordered string table).

In the embodiment of the application, the first node invokes the data writing interface of the database engine by invoking the data writing interface of the database adaptation layer, and can write the first key value pair into the variable memory table in the memory. After the variable memory is expressed to a certain data size, the variable memory table can be converted into an invariable memory table, and then the invariable memory table can be written on a disk of the first node to generate an SSTable. In this way, during the data writing process, the sequential writing can be ensured without modifying the data in the previous SSTable, so that the writing efficiency can be improved. In addition, since the key and the value in the SSTable may be arbitrary byte arrays, and the size of the SSTable may be configured according to the size of the key and the value, it is possible to implement variable length storage.

As one possible implementation manner, the disk comprises a plurality of SSTable layers, and the method further comprises merging part or all of the SSTable layers of the first layer to the next layer in the case that the data quantity or the SSTable number of the first layer is larger than a second threshold value.

In the embodiment of the application, the SSTable on the disk is divided into a plurality of layers, and each layer can comprise a plurality of SSTable files, so that when the data quantity or the number of sstables of each layer reaches a preset threshold value, the first node can merge part or all of the sstables of the layer into the next layer, and the layered storage can facilitate layered inquiry (namely reading) after data writing.

As one possible implementation, the method further comprises receiving a data reading request, wherein the data reading request comprises a second key, and calling the database engine to read data corresponding to the second key through the database adaptation layer.

As a possible implementation manner, the step of calling the database engine to read the data corresponding to the second key through the database adaptation layer comprises the step of calling the database engine through the database adaptation layer to read the data corresponding to the second key from a variable memory table, wherein the data corresponding to the second key is read from a non-variable memory table when the data is not read from the variable memory table, the data corresponding to the second key is read from a block cache when the data is not read from the non-variable memory table, and the data corresponding to the second key is read from an SSTable on the disk when the data is not read from the block cache.

In the embodiment of the application, as the data in the variable memory table is the latest, the probability of reading the data from the variable memory table is the highest, the probability of reading the data from the non-variable memory table is higher, the data in the block cache is older, the probability of reading the data from the block cache is lower, the data in the SSTable on the disk is the oldest, and the probability of reading the data from the block cache is the lowest. Therefore, when the first node reads data, the data can be sequentially read from the variable memory table, the non-variable memory table, the block cache and the SSTable on the disk, so that the reading efficiency can be greatly improved. In addition, the variable memory table, the non-variable memory table and the block cache can be read based on the memory, and the reading speed of the memory is higher, so that the reading efficiency can be further improved.

As a possible implementation manner, the first node comprises a quota module and a key value server, the method further comprises the steps of performing quota checking through the quota module, performing speed limit, authentication and package size checking through the key value server when the quota is not exceeded, and determining that the condition is met when the speed limit, authentication and package size checking are passed.

In the embodiment of the application, after the first node receives the data writing request, quota, speed limit, authentication, packet size check and the like can be performed, the first node can not continue to process under the condition that the check fails, and the first node can continue to process under the condition that the check fails, so that the processing resources of the first node can be saved, the received data writing request can be prevented from being excessive, the whole processing system is prevented from being crashed, and the stability of the system is improved.

A second aspect discloses an electronic device, which may be a first node of an etcd cluster, the electronic device may comprise a processor, a memory and a communication interface for receiving information from and outputting information to other electronic devices than the electronic device, the processor invoking a computer program stored in the memory for implementing a data writing method as provided in the first aspect and any one of the possible implementations of the first aspect.

A third aspect discloses a computer-readable storage medium having stored thereon a computer program or computer instructions which, when run, implement the data writing method as disclosed in the above aspects.

A fourth aspect discloses a chip comprising a processor for executing a program stored in a memory, which when executed causes the chip to perform the data writing method disclosed in the above aspects.

As a possible implementation, the memory is located off-chip.

A fifth aspect discloses a computer program product comprising computer program code which, when run, causes the data writing method disclosed in the above aspects to be performed.

It will be appreciated that the electronic device provided in the second aspect, the computer readable storage medium provided in the third aspect, the chip provided in the fourth aspect and the computer program product provided in the fifth aspect described above are all configured to perform the data writing method provided in the first aspect and any possible implementation manner of the first aspect of the present application. Therefore, the advantages achieved by the method can be referred to as the advantages of the corresponding method, and will not be described herein.

Drawings

The drawings in the following description will be presented to more clearly illustrate the technical solution of the embodiments of the present application, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a schematic diagram of an etcd architecture according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a scenario of etcd writing data according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a free list scenario disclosed in an embodiment of the present application;

FIG. 4 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 5 is a schematic diagram of another etcd architecture disclosed in an embodiment of the present application;

FIG. 6 is a flow chart of a method for writing data according to an embodiment of the present application;

FIG. 7 is a schematic diagram of another scenario of etcd writing data disclosed in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The embodiment of the application discloses a data writing method and electronic equipment (such as a first node), which can improve the writing performance of etcd. The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.

In order to better understand the embodiments of the present application, the related art of the embodiments of the present application will be described first.

Etcd is a highly available distributed key-value (KV) store that can be used for micro-service registration and discovery, shared configuration, distributed lock or consistency guarantee, distributed data queues, distributed notification and coordination, cluster election, etc. Thus, etcd finds wide application in various distributed scenarios, for example, etcd may be stored as a core of Kubernetes (i.e., k8 s), may be used to save network configuration and state information of objects in k8s clusters, and so on.

Kubernetes is an open source platform for automated deployment, expansion, and operation of container clusters. The system has the functions of disaster recovery, horizontal expansion (elastic expansion), load balancing, version rollback, storage arrangement and the like, and is widely applied to the construction and management of computer clusters.

Among them, the implementation of etcd mainly depends on Raft (REPLICATED AND fault tolerance) algorithm, multi-version concurrency control (multi-version concurrency control, MVCC) mechanism, boltdb, etc.

Raft is a consensus algorithm, which is easy to understand and has higher safety. And Raft can provide a general method for deploying finite state machines between clusters of computers and ensure that any node within a cluster remains consistent across a state transition.

MVCC is a concurrency control commonly used by database management systems. MVCC is intended to address multiple, long-term read starvation write operations caused by read-write locks. Under the MVCC mechanism, each transaction reads a data item that is a historical snapshot and depends on the level of isolation implemented. Also, the write operation does not overwrite an existing data item, but creates a new version that does not become visible until the operation is committed. Furthermore, the transaction can be made to see the data state at its start-up by snapshot isolation.

Boltdb is a KV database engine implemented in golang (i.e., go language). boltdb provides the most basic storage function, does not support network connection and complex structured query language (structured query language, SQL) query, stores single database data in a single file, and reads and writes the data file in a mode of an application program interface (application programming interface, API) so as to achieve the effect of data persistence. It should be noted that the boltdb internal implementation adopts a b+ Tree (i.e., b+ Tree) structure, where B represents a balance (balance), and b+ Tree is also a balance Tree.

Referring to fig. 1, fig. 1 is a schematic diagram of an etcd architecture according to an embodiment of the present application. As shown in fig. 1, the overall architecture of etcd mainly includes 5 layers, namely a Client (Client) layer, an API network layer (i.e., API layer), raft layers, a functional logic layer (i.e., logic layer), and a storage layer.

The Client layer may include tools such as Client libraries (e.g., clientv, clientv version 3, etc., API Client libraries) and etcdctl. The client library provides a simple and easy-to-use API, and supports load balancing, automatic fault transfer among nodes and the like.

The API layer mainly comprises a communication protocol of a client (client) access server node and a communication protocol between server nodes. Wherein the client access server node may use gRPC (google remote procedure call, remote procedure call) API and V2/V3 version of HTTP (hypertext transfer protocol ) API. The communication between the service end nodes can adopt Raft HTTP, which is an HTTP protocol used when the functions of data replication, leader election and the like are realized between the service end nodes through Raft algorithm.

The Raft layer mainly comprises core modules such as a Leader election (LeaderElection), a log replication (LogReplication), a read index (ReadIndex), a member (Membership), a learner (Learner) and the like, so that data consistency among service end nodes can be ensured, service availability can be improved and the like.

The logic layer mainly comprises key value service end (KVServer module), authentication (Auth) module, quota (quanta) module, application module, lease (Lease) module, compression (Compactor) module, maintenance (Maintenance) module, tree index (treeIndex) module and other functional modules, which can be used for performing operations such as Quota, speed limit and authentication, and can realize storage and management of data.

The storage layer mainly comprises a pre-written log (WRITE AHEAD log, WAL) module, a Snapshot (snap shot) module, a boltdb module and the like. The WAL module can ensure the data to be stored in a lasting mode, and in etcd, all data modification is written into WAL Log of a disk before being submitted. The boltdb module may be used to save cluster metadata and user-written data. The snapplot module can be used for saving storage space, and in etcd, every 10000 records can be set as a Snapshot by default.

It should be noted that, the etcd may include an MVCC module, and the MVCC module may include the treeIndex module and the boltdb module described above. Wherein the MVCC mechanism may be implemented by the treeIndex module and the boltdb module. the treeIndex module is implemented based on a memory version btree library, which stores keys and related version number information (including historical version number information), and the version number information can include attributes such as global version numbers, modification times and the like corresponding to the keys. The value corresponding to the key is stored in boltdb modules, so that the memory requirement is relatively low. It should be appreciated that modification, writing, and deletion of key-value pairs may generate a new version number (revision), and accordingly, the version number may be made globally monotonically increasing.

Specifically, the procedure of writing data to etcd is described below. Referring to fig. 2, fig. 2 is a schematic diagram of a scenario of etcd writing data according to an embodiment of the present application.

As shown in fig. 2, when etcd writes data (i.e., when a write transaction is executed), the index may be first obtained and updated (i.e., revision) according to the written key, and if the key does not exist, revision is obtained by self-increment based on the current largest currentRevision (i.e., the current version number). Accordingly, a key (i.e., key) may correspond to one or more version numbers. For example, the key for writing data in fig. 2 is "hello", and the current version number is "revision {2,0 }. It should be appreciated that the version number may include two values, where the first value may be a globally incremented major version number and the second value may be a child version number within a transaction that is incremented within one transaction.

Then, etcd may store a structure body composed of information with a version number of key and a value of original key-value (i.e. key-value carried by the write request) in blotdb. For example, a structure with a key of "revision {2,0}" and a value of original key-value may be written to boltdb. Meanwhile, when data is stored in blotdb, in order that the latest data can be acquired by a read request later, etcd may write the data into a buffer.

It should be noted that, in order to improve the writing performance, in the above writing process, etcd does not commit a transaction (commit), so the data is only updated in the memory data structure managed by boltdb. Thereafter, to persist the data to disk, the goroutine mechanism may be committed by backend asynchronous transactions. Typically, the asynchronous mechanism may default to commit batch transactions once every 100 ms.

It can be understood that when the etcd reads data (i.e. when a read transaction is executed), since the read request submitted by the client queries the value through the key and the data is queried from boltdb, the version number must be passed, so the etcd can obtain the version number of the key from the treeIndex module, then can search the data from the buffer according to the version number, and if the data is found (i.e. hit), the data is directly returned, and if the data is not found, the etcd can query the related data from the boltdb module according to the version number.

It should be appreciated that boltdb is a compact database implementation model that uses memory map (mmap) to map the pages of the disk (i.e., the physical pages of the disk) to the pages of memory, enabling zero copying of data, indexed by the b+ tree. boltdb is ingenious in write transaction implementation, concurrency control is achieved by using meta copy (metadata copy) and freelist (i.e. free list) mechanisms, page management is conducted through copy-on-write (COW) technology, lock-free read-write concurrency can be achieved through copy-on-write technology boltdb, lock-free write concurrency cannot be achieved, and therefore boltdb random write performance is poor. Since boltdb uses the b+ tree internally as a data structure for storing data, leaf nodes store specific key value data. The basic unit of data storage is one page (page), and the size defaults to 4 Kilobytes (KB). When data deletion occurs, boltdb does not directly return the deleted disk space (i.e., the pages corresponding to the deleted data) to the system, but stores the pages temporarily in the system, so as to form a page pool that is released for subsequent use, and this page pool is generally called freelist. As shown in fig. 3, which shows 10 pages 42-51 in succession, pages 43, 45, 46, 50 are being used, i.e., non-idle pages, while pages 42, 44, 47, 48, 49, 51 are idle for subsequent use. It can be seen that these free pages are discrete and not continuous, so that during subsequent use of the pages, sequential writing is not possible, resulting in lower writing efficiency. For example, when 4 consecutive pages are needed to store data, then it may not be possible to find 4 consecutive free pages, and only discrete 4 pages, such as pages 42, 44, 47, and 48, may be written.

In addition, due to the structural characteristics of the b+ tree, when a large amount of data is written, the b+ tree structure inside boltdb is frequently adjusted (such as rebalancing, splitting or merging the nodes of the tree, etc.), thereby further causing a significant reduction in writing performance. And, for the update of data, the B+ tree directly modifies the corresponding value at the original data, so that the probability of random writing is increased. Therefore, for a "service" scenario where writing is dense (i.e., a scenario where data needs to be written frequently), boltdb cannot meet the writing requirement, the writing efficiency is low, and even writing blockage may be caused, which has a large influence on the service.

In order to solve the above problem, in the embodiment of the present application, under the condition of keeping the etcd overall architecture unchanged, a function expansion can be performed on the MVCC module, and a boltdb adaptation layer (i.e. a database adaptation layer) is newly added to introduce rocksdb. The function of rocksdb can be packaged through boltdb adaptation layer, so that the KV storage interface function provided externally is unchanged, rocksdb is actually used for data storage at the bottom layer, LSM-Tree (log structured MERGE TREE) is used as a data structure for storing data in rocksdb, random writing operation can be converted into sequential writing of magnetic discs, and thus writing efficiency of etcd can be improved.

For a better understanding of the embodiments of the present application, a system architecture used in the embodiments of the present application is described below.

Referring to fig. 4, fig. 4 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 4, the system architecture may include an etcd cluster 401, a network 402, a first electronic device 403, and a second electronic device 404. Among other things, etcd cluster 401 may include multiple computing nodes (i.e., etcd nodes), such as node a 4011, node b 4012, node c 4013, node N4014 as shown in fig. 4. It should be appreciated that to improve the availability and fault tolerance of etcd clusters, N is typically an integer greater than or equal to 3.

Note that each node in the etcd cluster 401 may be an electronic device having data processing capability, data transceiving capability, and data storage capability, for example, a server such as a blade server, a high-density server, a rack server, or a rack server.

The etcd cluster 401 (i.e., node a-node N), the first electronic device 403, and the second electronic device 404 may each be connected to a network 402, enabling communication with each other through the network 402. The N nodes in the etcd cluster 401 may also communicate over a network.

It should be appreciated that the Raft protocol (i.e., algorithm) defines the node states in the etcd cluster, which mainly include three states, follower (follower), competitor (candidate), and cluster leader (leader). Wherein at any one time, each node is in one of the states. The state of the node at etcd start-up is typically follower by default, at which time the log received from the leader may be synchronized. The leader node is created by node election, has uniqueness and the privilege of having a synchronized log, and needs to broadcast a heartbeat to the follower node at regular intervals to maintain the leader identity. candidate is a state when a leader is elected, which may initiate a leader election. It should be appreciated that a leader node may also be referred to as a master node.

It should be noted that, for the data read request, the data may be read from any node (such as the node a 4011 or the node b 4012), because the data stored in each node is strongly consistent. And processing by the leader node is required for the data write request. Thus, if the node of the etcd client that initiates the data write request is a follower node, the data write request needs to be forwarded to the leader node, and when the leader node receives the data write request, it will persist the request to the WAL log and broadcast to the various follower nodes. Then, when the proposal is persisted by a plurality of nodes of the cluster, the operation of persisting the content of the proposal can be performed.

In the embodiment of the present application, the electronic devices such as the first electronic device 403 and the second electronic device 404 may send data operation instructions, for example, instructions such as data addition, data deletion, data modification, data query, etc., to the etcd cluster 401. Accordingly, the node in the etcd cluster 401 (such as the node a 4011 shown in fig. 4) may receive the data operation instruction from the electronic device, and then may perform corresponding processing, and return the processing result to the electronic device. It should be understood that the electronic devices such as the first electronic device 403 and the second electronic device 404 may include an etcd client, and the electronic devices such as the first electronic device 403 and the second electronic device 404 may send a data operation instruction to the etcd cluster 401 through the etcd client.

In some embodiments, the first electronic device 403 and the second electronic device 404 may be a mobile phone, a tablet computer, a notebook computer, a smart car, a smart wearable device, a server, etc., which are not limited herein.

In the embodiment of the present application, after the etcd cluster 401 receives the data writing request, the data writing method provided in the embodiment of the present application may be executed, so that the data writing performance may be greatly improved, and specifically, see the description in the following method embodiment.

It should be noted that the system architecture shown in fig. 4 is only exemplary, and is not limited to the configuration thereof. In other embodiments of the application, the system architecture shown in fig. 4 may include more or fewer devices or modules than illustrated, and is not limited to only including the etcd cluster 401, network 402, first electronic device 403, and second electronic device 404 shown in fig. 4.

Referring to fig. 5, fig. 5 is a schematic diagram of an etcd architecture according to an embodiment of the present application. As shown in fig. 5, in the case of keeping the overall configuration of the original etcd unchanged, the boltdb module of the original MVCC module is replaced by a rocksdb module, and in order to not affect other modules, a boltdb adaptation layer (i.e. an adaptation module) is further added, through boltdb adaptation layer, rocksdb may be packaged into boltdb, so that the API interface provided externally is unchanged (i.e. the same as the interface provided externally by boltdb), but the bottom layer may be stored through rocksdb.

Wherein rocksdb is a database engine that provides key value storage and read-write functions. rocksdb adopts an LSM-Tree structure in the internal implementation, which is a layered, ordered and disk-oriented data structure. The LSM-Tree core idea is to make full use of the characteristic that the sequential writing of disk lot is far higher than the random writing performance, and design and optimize around the characteristic to optimize the writing performance, and the writing of the structure is all added in the mode of application without deletion and modification.

It should be appreciated that there are different object concept definitions in boltdb and rocksdb, so that in interface packaging, an associated logical concept mapping is required to map boltdb logical concepts to rocksdb logical concepts. For example, db (database) objects in boltdb may be mapped to db objects in rocksdb, bucket in boltdb may be mapped to columnfamily (column family) in rocksdb, and so on. Specific mappings of boltdb logical concepts and the logical concepts of rocksdb are shown in table 1 below:

TABLE 1

As can be seen from table 1 above, the logical concepts DB, bucket, cursor, transaction, snapshot and the like in boltdb can be mapped to the logical concepts DB, columnFamily, iterator, transaction, checkpoint and the like in rocksdb, respectively. Also, the Put, get, delete operation in boltdb may be mapped to the Put, get, delete operation in rocksdb. It should be appreciated that rocksdb supports transactions that are more complete than boltdb, and thus, a user may choose to target transactions based on a particular usage scenario.

Based on the mapping of the related concepts in boltdb and rocksdb, an interface adaptation layer (i.e., boltdb adaptation layer) that is fully compatible with the boltdb API interface can be developed based on the rocksdb API interface. For example, put (i.e., write, data may be added or updated), get (i.e., read or query), delete (i.e., delete), snapshot, and like api interfaces.

It should be understood that, in order to make the externally provided API interface unchanged, the boltdb adaptation layer needs to further encapsulate the externally provided interface rocksdb, so that the externally provided API interface is the same as the externally provided interface boltdb, and thus, the data can be read, written, deleted according to the original boltdb API call mode without changing other modules. For example, let boltdb be Get (parameter 1: key) as the interface of the read data provided externally, let rocksdb be Get (parameter 1: key, parameter 2: column group) as the interface of the read data provided externally, at this time, since the Get interface of the read data of boltdb may only include parameter 1, i.e. the key of the read data, while the Get interface of the read data of rocksdb may include two parameters, parameter 1 is the key of the read data, and parameter 2 is the related column group. Therefore, in order to make the interface of the read data provided externally unchanged, the boltdb adaptation layer needs to further encapsulate the Get interface of the read data provided by rocksdb so that the external interface is Get (parameter 1: key). In one possible implementation, when the Get (parameter 1:key) interface is invoked externally, the boltdb adaptation layer may convert it to the Get interface that invokes rocksdb, and, since both include the parameter key, parameter 1 may directly replace the fill, while, since the Get interface of boltdb does not include the corresponding parameter 2, the boltdb adaptation layer may automatically fill parameter 2, e.g., fill parameter 2 as a default column family. Thus, by wrapping rocksdb API, etcd may invoke rocksdb through boltdb adaptation layer like invoke boltdb. Similarly, similar packaging may be performed for Put, delete, etc. interfaces. It should be understood that the above-described encapsulation of the Get interface is merely exemplary and not limiting.

It should be noted that, in other embodiments of the present application, some data structures may be redefined when packaging the API interface of rocksdb, or the externally provided API interface may be changed.

For example, DB and Tx may be defined, and their data structure definitions are specifically as follows:

based on the above defined data structure, the relevant interfaces can be defined, as shown in the following table 2:

TABLE 2

It can be understood that through the concept mapping and interface definition, when etcd calls the related interface, writing, reading and the like of data can be performed through rocksdb, so that the data writing efficiency can be improved, and further the writing requirement under the dense business scene can be met.

It should be appreciated that in other embodiments of the present application, other database engines (such as LevelDB) may be adapted through the boltdb adaptation layer, which is not limited to rocksdb, so that different database engines may be flexibly used for different application scenarios, and thus, the data writing performance and the data reading performance of etcd may be improved. In one possible implementation manner, an abstract interface which is not bound with any storage engine can be added in the MVCC module of the etcd, so that basic operations such as KV storage insertion, deletion, inquiry, snapshot and the like are realized, and a user can conveniently select a proper database engine to use according to a use scene.

It should be noted that the etcd architecture shown in fig. 5 is only exemplary, and is not limited to this configuration. In other embodiments of the application, more or fewer components than shown in FIG. 5 may be included, or certain components may be combined, or certain components may be split, or different arrangements of components may be included. The components shown in fig. 5 may be implemented in hardware, software, or a combination of software and hardware.

Based on the above system architecture, please refer to fig. 6, fig. 6 is a flow chart of a data writing method according to an embodiment of the present application. As shown in fig. 6, the data writing method may include, but is not limited to, the following steps:

601. the first node receives a data write request, the data write request including a first key-value pair.

When the first electronic device or the second electronic device needs to write data (including insert data or update data) into the etcd cluster, it may send a data write request to the first node. Accordingly, the first nodes may receive data write requests from them. The first node may be a leader node in a database cluster, and the data write request may include a first key-value pair, which may include a key and a value.

It should be appreciated that in some embodiments, when the first electronic device or the second electronic device needs to write data into the etcd cluster, it may also send a data write request to other follower nodes in the etcd cluster, and after the follower node receives the data write request, the follower node may forward the data write request to the first node. Accordingly, the first node may receive a data write request from the follower node. It should also be appreciated that when the first electronic device or the second electronic device needs to write data into the etcd cluster, the etcd node may be selected by a load balancing algorithm, and then a gRPC call for a data write request (i.e., a PUT request) may be initiated to the etcd node. Accordingly, the etcd node may intercept the data write request through a gRPC interceptor.

It should be noted that in some embodiments, when the first electronic device or the second electronic device needs to delete data from the etcd cluster, it may send a data write request to the first node for deleting the data. Because rocksdb is also added a record to the memory for deleting data, the record is used to mark the related data that needs to be deleted. Accordingly, a data write request for delete data, which may include only keys, does not include a value.

602. And under the condition that the condition is met, the first node writes the first key value pair into the memory through the database adaptation layer.

After the first node receives the data writing request, pre-checking (including quota, speed limit, authentication and the like) is needed, the first node can not continue processing under the condition that the pre-checking fails, and under the condition that the pre-checking fails, the first node can generate a corresponding writing proposal log entry based on the data writing request, then the log entry can be broadcast to all the following nodes, and after the persistence of the log entry is completed by most nodes of the etcd cluster, the first key value pair can be written into the memory through the database adaptation layer. It should be appreciated that the above satisfaction conditions may refer to completion of persistence of log entries by quota, speed limit, authentication, packet size check, and most nodes of the cluster. In order to avoid data loss, the first node may write the first key value pair into the WAL Log of the disk if the condition is satisfied. And, it should also be understood that, before the first node writes the first key value pair into the memory through the database adaptation layer, the index may be obtained and updated according to the key of the first key value pair, and if the key does not exist, revision is obtained based on the current maximum currentRevision self-increment. Then, the first node may store the structure body composed of the key with the version number and the value as the first key value peer-to-peer information into the memory.

Specifically, after the first node receives the data write request, a Quota check may be performed by the Quota module, that is, checking whether the sum of the current etcd db size plus the key-value size (i.e., the size of the first key-value pair) of the data write request exceeds a Quota (Quota-backend-bytes). If the quota is exceeded, indicating that the current storage space is insufficient, the first key value pair cannot be successfully written. If the quota is not exceeded, the first node may perform the next process, that is, may perform speed limit, authentication (i.e., determine whether the data writing request is legal by using the Auth module) and packet size check through the KVServer module, and in the case that the speed limit, authentication or packet size check is not passed, the first node may refuse to write the first key value pair, and in the case that the speed limit, authentication and packet size check is passed, the first node may package the content of the data writing request into a proposal message through the KVServer module and submit the proposal message to the Raft module (i.e., raft layers described above). Thereafter, the first node may generate a log entry corresponding to the proposal through the Raft module, and may broadcast the log entry to other follower nodes in the etcd cluster, where the log entry encapsulates the content of the proposal. After most nodes of the etcd cluster complete the persistence of the log entry, the state corresponding to the proposal may change to committed. Then, the first node may execute the proposal in the submitted state through the Apply module. Specifically, in the case that the condition is satisfied, the Apply module may first determine whether the proposal has been executed, if so, no processing is required, and if not, the Apply module may perform an operation of persisting the content of the proposal through the MVCC module of the first node, where the Apply module needs to call an interface boltdb (i.e., an interface provided by the adaptation layer on the outside) to perform data writing, that is, call an API provided by the database adaptation layer rocksdb (e.g., PUTAPI) to write the first key value pair into the memory, that is, call a data writing interface of the database adaptation layer to call a data writing interface of rocksdb to write the first key value pair into the memory. It should be appreciated that the Apply module needs to enter the first key-value pair when invoking the data-writing interface of the database adaptation layer, so that the database adaptation layer may then enter the first key-value pair into the data-writing interface of rocksdb when invoking the data-writing interface of rocksdb. Accordingly, it will be appreciated that the data write interface and incoming parameters of the database adaptation layer invoked by the Apply module are functions or parameters associated with the interface of boltdb.

The first node will be described below with reference to fig. 7, in which the first key pair is written into the memory through the database adaptation layer. As shown in fig. 7, the writing of data mainly includes two steps, the first step is to sequentially write the first key value pair into the WAL Log of the disk. The second step is to write the first key-value pair into activememtable (i.e., the i.e., memory table) in memory, i.e., to write the first key-value pair into activememtable in memory through the database adaptation layer. Specifically, the first node may write the first key value pair into activememtable in the memory by calling a data write interface provided by the database adaptation layer. It should be noted that, since the first node may immediately return the writing result to the client after writing activememtable the first key value pair, the writing efficiency is higher.

It will be appreciated that as data is written, the data stored in activememtable will increase, and when the amount of data in activememtable reaches a predetermined threshold (i.e., greater than the first threshold), the first node may freeze it within memory, becoming an immutable memory table (i.e., immutablememtable), and at the same time, may newly create memtable as a new activememtable. After activememtable is converted into immutablememtable, the first node may write (flush) immutablememtable in the memory to a Level 1 layer of SSTable (sorted string table, ordered string table) in a disk (such as a solid state disk, a mechanical hard disk, etc.), and generate a corresponding SSTable, so as to implement persistent storage.

It should be noted that, SSTable on the disk is divided into multiple layers, and each layer may include multiple SSTable files, as shown in fig. 7, three layers (i.e., level 1, level 2, and Level 3). It should be appreciated that SSTable has 10 times the data capacity of the previous layer, e.g., level 1 of 10MB, level 2 of 100MB, and Level 3 of 1000MB. Thus, when the data amount of each layer or the number of sstables reaches a preset threshold, the first node may merge part or all of the sstables of that layer into the next layer (i.e., the comparison). For example, in the case where the amount of data or the number of sstables of the first hierarchy is greater than the second threshold, the first node may merge part or all of the sstables of the first hierarchy to the next layer. It should be appreciated that the first Level may be any Level other than the last Level (e.g., level 1 or Level 2 described above).

It can be seen that during the rocksdb write process, the first node can write the key value pair to activememtable first, then after activememtable reaches a certain data size, activememtable can be converted to immutablememtable, and then immutablememtable can be written to disk, generating SSTable. In this way, during the data writing process, sequential writing can be guaranteed without modifying the data in the previous SSTable.

It should be appreciated that SSTable is a persistent, ordered and immutable key-value store structure, whose keys and values can be arbitrary byte arrays. And, the size of SSTable can be configured according to the size of key and value, thus can realize the variable length storage.

The following description of the data reading process in the embodiment of the present application will be presented with reference to fig. 7, and it is known from the above data writing process that the latest data is generally stored in activememtable, so after receiving a data reading request from a client, a corresponding version number can be first searched according to a key (such as a second key) carried in the data reading request, then, when reading data, the data can be first read from activememtable according to the version number, if the data is not read from activememtable, the data can be read from immutablememtable again, and if the data is not read from immutablememtable, the data can be read from a block cache (i.e. BlockCache). If the data is not read from BlockCache, the data is indicated to be stored on the disk, so that the data can be read from SSTable on the disk, namely the data can be read from SSTable in Level 1 first according to a rule from top to bottom, if the data is not read in Level 1, the data can be read from SSTable in Level 2 again, and if the data is not read in Level 2, the data can be read from SSTable in Level 3 again. It should be understood that BlockCache may cache data in SSTable, and in case of larger memory, may cache more SSTable data, so as to improve the reading efficiency of data.

In the above method flow, the first node can convert the random writing operation into the sequential writing of the magnetic disk through the database adaptation layer, so that the writing efficiency can be greatly improved, and the method is especially aimed at the scenes of sensitivity to write amplification of solid state disk (solid STATE DISK, SSD) and the like and sensitivity to random writing of mechanical hard disk (HARD DISK DRIVE, HDD) and the like.

It should be noted that, the related information (i.e., the same information or similar information) and the related description in the above different embodiments may refer to each other.

It should be understood that, in fig. 6, the first node is taken as an example of the execution body of the interaction schematic to illustrate the above-mentioned processing flow, but the present application is not limited to the execution body of the interaction schematic. For example, the first node in fig. 6 may also be a chip, a system-on-a-chip, or a processor that supports the first node to implement the method, or may be a logic module or software that can implement all or part of the functionality of the first node.

Based on the above system architecture, please refer to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 800 may include, among other things, a processor 801, a communication interface 802, and a memory 803. The processor 801, the communication interface 802, and the memory 803 may be connected to each other or to each other through a bus 804.

By way of example, memory 803 is used to store computer programs and data for electronic device 800, and memory 803 may include, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM), among others. The communication interface 802 is used to support communication by the electronic device 800, such as receiving or transmitting data.

By way of example, the processor 801 may be a central processing unit (central processing unit, CPU), complex programmable logic device, general purpose processor, digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, transistor logic device, hardware component, or any combination thereof. A processor may also be a combination that performs a computational function, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and so forth.

In an embodiment, the electronic device 800 may be the first node, and the processor 801 may be configured to read the program stored in the memory 803, and execute the operation performed by the first node in the method embodiment shown in fig. 6, which is described above with reference to the related description and will not be described in detail herein. It should be appreciated that in one embodiment, electronic device 800 may also be other follower nodes in other etcd clusters.

It should be noted that the electronic device 800 shown in fig. 8 is merely an implementation manner of the embodiment of the present application, and in practical application, the electronic device 800 may further include more or fewer components, which is not limited herein.

The embodiment of the application also discloses a computer readable storage medium, wherein the instructions are stored, and the instructions are executed to execute the method in the embodiment of the method.

The embodiment of the application also discloses a computer program product comprising instructions which, when executed, perform the method of the above method embodiment.

It will be apparent that the embodiments described above are only some, but not all, embodiments of the application. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application for the embodiment. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly understand that the embodiments described herein may be combined with other embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. The terms first, second, third and the like in the description and in the claims and in the drawings are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprising," "including," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a series of steps or elements may be included, or alternatively, steps or elements not listed or, alternatively, other steps or elements inherent to such process, method, article, or apparatus may be included. It should be understood that the equal sign of the above condition judgment may be larger than one end or smaller than one end, for example, the above condition judgment for a threshold value being larger than, smaller than or equal to one end may be changed to the condition judgment for the threshold value being larger than or equal to one end or smaller than one end, which is not limited herein.

It is understood that only some, but not all, of the details relating to the application are shown in the accompanying drawings. It should be appreciated that some example embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

As used in this specification, the terms "component," "module," "system," "unit," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a unit may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or being distributed between two or more computers. Furthermore, these units may be implemented from a variety of computer-readable media having various data structures stored thereon. The units may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., second unit data from another unit interacting with a local system, distributed system, and/or across a network).

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present application in further detail, and are not to be construed as limiting the scope of the application, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the application.

Claims

1. A data writing method, characterized in that the method is applied to a first node of an etcd cluster, where the first node is a master node of the etcd cluster, and the method includes:

receiving a data writing request, wherein the data writing request comprises a first key value pair;

Under the condition that the condition is met, a data writing interface of a database engine is called through calling a data writing interface of a database adaptation layer, the first key value pair is written into a variable memory table in a memory, the database adaptation layer is used for packaging an interface provided by the database engine into an interface identical to boltdb, the database engine is based on an LSM-Tree structure, and the database engine is used by an etcd storage layer;

converting the variable memory table into a non-variable memory table under the condition that the data volume of the variable memory table is larger than a first threshold value;

and writing the immutable memory table into a disk of the first node to generate an ordering string table SSTable.

2. The method of claim 1, wherein invoking the data write interface of the database engine to write the first key-value pair into the variable memory table in memory by invoking the data write interface of the database adaptation layer comprises:

Calling a data writing interface of the database adaptation layer, and transmitting the data writing interface into the first key value pair;

Calling a data writing interface of the database engine through the database adaptation layer, and transmitting a first key value pair in the data writing interface of the database adaptation layer;

And writing the data of the database engine into a variable memory table in a memory by the database engine through a first key value pair in a first key value pair writing interface.

3. The method of claim 1, wherein the database engine is rocksdb.

4. The method of claim 1, wherein the disk comprises multiple levels of sstables thereon, the method further comprising:

in the case that the data amount of the first hierarchy or the number of SSTable is larger than the second threshold, part or all of SSTable of the first hierarchy is merged to the next hierarchy.

5. The method according to any one of claims 1-4, further comprising:

receiving a data read request, the data read request including a second key;

and calling the database engine to read the data corresponding to the second key through the database adaptation layer.

6. The method of claim 5, wherein invoking the database engine through the database adaptation layer to read the data corresponding to the second key comprises:

Calling the database engine to read the data corresponding to the second key from the variable memory table through the database adaptation layer;

reading the data corresponding to the second key from the non-variable memory table under the condition that the data is not read from the variable memory table;

reading data corresponding to the second key from a block cache under the condition that the data is not read from the immutable memory table;

and under the condition that the data is not read from the block cache, reading the data corresponding to the second key from the SSTable on the disk.

7. The method of any of claims 1-6, wherein the first node includes a quota module and a key-value server, the method further comprising:

Quota checking is performed by the quota module;

under the condition that the quota is not exceeded, speed limit, authentication and packet size check are carried out through the key value server;

in case the speed limit, authentication and packet size check pass, it is determined that the condition is satisfied.

8. An electronic device comprising a processor, a memory, and a communication interface for receiving information from and outputting information to other electronic devices than the electronic device, the processor invoking a computer program stored in the memory to implement the method of any of claims 1-7.