CN120524534B

CN120524534B - A destruction-resistant storage system and method for extreme scenarios

Info

Publication number: CN120524534B
Application number: CN202511033160.0A
Authority: CN
Inventors: 丁春燕; 纪佳
Original assignee: Aerospace One System Jiangsu Information Technology Co ltd
Current assignee: Aerospace One System Jiangsu Information Technology Co ltd
Priority date: 2025-07-25
Filing date: 2025-07-25
Publication date: 2025-09-26
Anticipated expiration: 2045-07-25
Also published as: CN120524534A

Abstract

This invention discloses a destruction-resistant storage system and method for extreme scenarios. The system comprises a supernode cluster, a data node cluster, an edge node, and a relay server. The supernode cluster, the data node cluster, and the edge nodes communicate and transmit data using gRPC and the REST API. This invention achieves destruction-resistant performance in extreme scenarios, enabling normal operation even when half of the data nodes are destroyed, with no significant performance degradation.

Description

Anti-destruction storage system and method for extreme scene

Technical Field

The invention relates to the technical field of data storage, in particular to a system and a method for survivable storage of extreme scenes.

Background

The technical scheme of the remote disaster recovery in the traditional storage field generally adopts synchronization of the double campaigns and the remote sites, and evolves into a two-place three-center or a three-place five-center and the like on the basis, so that the conventional natural damage scene can be solved, but the comprehensive malicious attack cannot be resisted.

The defects of the prior art mainly comprise the following aspects:

(1) In the aspect of disaster recovery, a centralized storage system of a small number of nodes cannot face extremely destroyed scenes, and if a large number of nodes are introduced, the management cost is rapidly increased and the performance is rapidly reduced because of the impossible principle described by CAP theorem.

(2) In terms of data security, the bottom layer of the traditional storage system does not contain a data security mechanism, and is used for performing security processing at an application layer, a file system layer, a network layer or a physical layer, so that tampering of a storage owner (non-data owner) and acquisition of user data are difficult to eliminate.

(3) In terms of performance, the traditional storage can cause long tail problems of response time delay due to network jitter, resource competition, abrupt load change, software and hardware faults and the like, and meanwhile, recovery indexes RTO and RPO between different places are seriously dependent on deployment environment and hardware cost, so that the requirements cannot be met in extreme scenes.

Disclosure of Invention

In order to solve the problems, the invention provides a storage system and a method capable of realizing the destruction resistance under an extreme scene, which utilize the characteristics of geographic dispersion and random distribution of mass nodes to jointly form a semi-decentralization system by data nodes, edge nodes, super nodes and relay servers, and can still normally work under the scene of destroying half nodes without obvious performance difference.

In order to achieve the above object, the present invention is realized by the following technical scheme:

the invention relates to a survivable storage system for extreme scenes, comprising:

the super node cluster is used for storing the metadata after the database and table division across regions, managing data nodes, edge nodes, users and keys and has background synchronous service;

The data node cluster is used for storing encrypted user data across regions, providing uploading and downloading functions and receiving task scheduling of the super node;

the edge node is used for running on a user end, providing interfaces for users, key management and S3 back-end service, and locally storing metadata of the users;

the relay server is used for establishing P2P communication among the super node, the data node and the edge node according to the P2P connection request of the edge node under the condition that the super node, the data node and the edge node cannot directly communicate;

The invention further improves that gRPC and REST API are adopted among the super node, the data node and the edge node for direct communication and data transmission.

The invention further improves that the metadata is stored on the super node in the form of a metadata database which is divided into a node sub-database, a user sub-database, a global metadata sub-database and a user metadata sub-database, wherein:

The node sub-library is used for storing ids, public key lists and states of super nodes and data nodes;

the user sub-library is used for storing id, key ciphertext and state of the user;

The global metadata sub-database is used for storing information used for describing the encrypted storage objects, including information of file objects, data blocks and data fragments;

The user metadata sub-library is used for storing ciphertext and fragment information comprising names, dates, sizes, authorities and random keys of descriptive files and comprises three tables, namely a storage bucket, files and file objects;

the node sub-library and the user sub-library are independent sub-libraries and are stored on all super nodes, and background synchronization is carried out when the super nodes run;

The global metadata sub-database performs sub-table according to the id of the virtual super node corresponding to the super node;

and the user metadata sub-database is used for carrying out database separation according to the id of the virtual super node corresponding to the super node.

A further improvement of the invention is that each super node is constituted by a cluster of locally located master-slave systems.

The task scheduling method is further improved in that the task scheduling comprises a data reconstruction task, the data nodes send states to the super nodes at regular time, the data nodes which do not send states to the super nodes after overtime are marked as warning states, when the data nodes in the warning states reach a set threshold value, the data reconstruction task is triggered, and when the data reconstruction task starts, the corresponding data nodes are marked as damage states.

The invention further improves that the user data are scattered and stored on the data node after being compressed, encrypted, fragmented and ULRC coded on the edge node.

The invention discloses a method for storing an extreme scene in a destroy-resistant way, which comprises the following steps:

the edge node segments the user data, and after correcting the code to calculate the check segment, the check segment is stored in a scattered manner on the data nodes of the cross-region;

Adopting a read-write separated multi-activity architecture, storing metadata in a metadata base form on super nodes of a cross-region in a scattered manner, and locally storing the metadata of the edge nodes as a quick index;

the super node manages the data nodes and the edge nodes, and performs task scheduling including data reconstruction.

The invention is further improved in that the storage of user data comprises:

calculating hash values of file contents on edge nodes, sending a check request to a corresponding super node, checking whether files with the corresponding hash values exist or not, and if so, performing file deduplication;

Compressing user data and segmenting the user data into a plurality of data blocks, sequentially calculating hash values of the data blocks and a deduplication key DDK, sending a judging request to a super node, judging whether repeated data blocks exist or not, if so, deduplicating, and if not, generating a random key DEK, and encrypting the data blocks by using the random key DEK;

slicing the encrypted data block, and calculating check slicing by utilizing ULRC codes;

And the data block fragments and the corresponding check fragments are stored in the corresponding data nodes in a scattered manner.

The beneficial effects of the invention are as follows: the invention has the following advantages:

1. the system can still work normally under the condition that half of data nodes are damaged, and data are not lost;

2. The invention utilizes the characteristic of user data storage to construct a database with WORM (write once, read many) property, thus effectively preventing the data from being tampered;

3. The space utilization rate is high, the advanced erasure coding technology is adopted for the bottom layer storage, the ciphertext erasure is realized, and compared with the traditional copy mode, the storage resource utilization rate is high;

4. The data leakage prevention is realized by a high-security encryption mechanism, the whole storage network has no plaintext data, important data has no leakage risk, and the privacy protection capability is strong;

5. the storage expansion theory is not limited and does not influence the performance, and the management cost is not increased sharply due to the expansion of the storage scale.

Drawings

FIG. 1 is a schematic diagram of a system architecture in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a hash ring of a supernode in an embodiment of the present invention;

FIG. 3 is a flow chart of file upload in an embodiment of the present invention;

FIG. 4 is a file download flow diagram in an embodiment of the invention;

FIG. 5 is a flow chart of data reconstruction in an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

As shown in FIG. 1, the survivable storage system for an extreme scenario in this embodiment includes a super node cluster, a data node cluster, an edge node, and a relay server, where communication and data transmission between the super node cluster, the data node cluster, and the edge node are performed using gRPC and a REST API.

And the super node adopts a read-write separated multi-activity architecture, stores metadata sub-database sub-tables on the cross-region super node, and forms an integral multi-activity architecture by adopting a master-slave mode on different nodes by each sub-database sub-table. Meanwhile, each super node is a cluster with double activity characteristics, which is constructed in a master-slave mode, and mainly provides the following functions:

and managing the data nodes, namely registering, distributing and scheduling the tasks.

User and key management, user registration, user renaming, user deletion, key addition, key deletion, key modification and password modification. In this embodiment, each user has a pair of login keys (asymmetric), a pair of signature keys (asymmetric), and a pair of encryption keys (symmetric). The login key (ULK) is used for login and asymmetric encryption, the private key is stored locally by the user, the user public key and the ciphertext of the private key are stored in the user metadata sub-base on the super node, and the ciphertext is stored after being encrypted by the encryption key. The signature key (USK) is used for signing when sensitive data is transmitted, and the private key is stored locally as well as the login key, and the public key and the private key ciphertext are stored in the user information in the super node. The encryption key (UEK) is used to encrypt the data key DEK randomly generated when uploading the file, as well as the private key of other asymmetric keys, which is stored locally, i.e. at the edge node. The login key and the signature key can be lost and can be set to expire, and the data can be recovered through the UEK after the login key and the signature key are lost, but the data can not be recovered after the encryption key UEK is lost, and only can be reconstructed. The UEK may be replaced with a replacement equal to the newly created user with the same id, and the old user's id may be cleared.

Metadata management, namely storage and management of global objects, blocks and Shard, storage and management of user-related buckets, files and objects.

And (5) synchronously managing metadata databases among the super nodes.

The data nodes are peer-to-peer storage architecture with complete decentralization, and a super-large scale storage cluster is formed by massive data nodes, user data are encrypted and stored on different data nodes crossing regions in a scattered mode, and the high fault tolerance performance of erasure codes is utilized to ensure the destruction resistance of the whole system. The main functions include:

storing data fragments of a user, and providing an uploading and downloading function, wherein the data fragments comprise data block fragments and check fragments;

and receiving task scheduling of the super node, and taking charge of tasks such as data reconstruction, verification, early warning, state query and the like.

The edge node operates on the user end and mainly provides interfaces for users, key management and S3 back-end service to realize functions of compression, blocking, encryption, duplicate removal and error correction.

And the relay server is used for assisting in connection and establishing P2P communication under the condition that the super node, the data node and the edge node are positioned on the common consumer Internet or behind the router and cannot communicate directly through the IP address. The relay server has a fixed independent public IP address, and the super node, the data node and the edge node communicate with the relay server directly. The relay server marks all super nodes, data nodes and edge nodes through node IDs, maintains a node mapping table in the relay server, receives a P2P connection request of the edge node (client), positions both communication parties according to the node IDs in the P2P connection request, and informs and assists both communication parties to establish connection.

The storage principle of the invention is to compress and segment user data to obtain data blocks, segment the encrypted data blocks, calculate corresponding check segments through correction, and store the data segments including the data block segments and the check segments to data nodes in various places in a scattered manner, so that the characteristics of geographical dispersibility, bayesian and busy-tolerant capability, high concurrency transmission and the like of mass storage are fully utilized to provide a principle support for the destruction resistance in extreme scenes and provide good performance.

The error correction algorithm ULRC adopted in this embodiment is a hierarchical error correction scheme based on RS, the bottom layer adopts cauchy RS codes, the default number k of the original data slices is 64, the total number n of the slices is 128, and the method is configurable, adopts LRC coding on the bottom layer, performs grouping and calculates partial check slices, and dispersedly stores the original data slices and the check slices in the data node together. During downloading, the whole data can be calculated only by half data slicing, so that the problem that the data nodes can still work normally after being damaged in half under extreme conditions is solved, and meanwhile, the problems of data safety and long tail in a network are thoroughly avoided.

The number of the data fragments is between 128 and 255, which is far smaller than the total number of the data nodes, when the data nodes are uniformly and randomly destroyed, the loss probability of the specific data node group to which all the fragments of the single data belong obeys the super-geometric distribution and is consistent with the loss probability of the whole system in mathematical expectation, so that the destruction resistance of the single data is consistent with the destruction resistance of the whole system.

The data stored in this embodiment is divided into user data and metadata describing the distribution of the user data and related attributes. The storage of the user data is that the user data is stored on the data node in a scattered way after being compressed, encrypted, fragmented and ULRC coded, as shown in figure 3, and the storage method specifically comprises the following steps:

The edge node calculates a hash value (hash) (file Cid) of the whole file content, sends a check request to the super node to check whether the hash file exists or not and is used for file deduplication, namely corresponding metadata is directly stored under the condition that the hash file exists;

The edge node compresses user data and segments the user data into a plurality of data blocks (blocks), sequentially calculates a hash value (block Cid) of each data block and a deduplication key DDK, sends a request to a super node to judge whether repeated data blocks exist or not, further removes duplication, generates a random key DEK if the repeated data blocks do not exist, encrypts the compressed data blocks by using the key DEK, and generates EBlock, wherein the deduplication key DDK is obtained by carrying out sha256 calculation on Cid and checksum of the data blocks;

the edge node segments the encrypted data block (EBlock) according to the size of 16K, and calculates check segments by ULRC coding;

and the edge node sends a request for distributing the data nodes to the super node, and sends the data fragments to the distributed data nodes to finish the actual writing operation, namely finishing the user data storage.

Encrypting a key DEK by using a User Encryption Key (UEK), obtaining an encrypted key UDEK and storing the encrypted key UDEK in a block field of a file object table of a user metadata sub-base;

Encrypting the key DEK with the deduplication key DDK, generating DDEK and preparing DDEK fields stored in the data block table of the global metadata sub-library;

After all blocks are transmitted, the edge node sends a request to the super node to report that the transmission is finished, and sends a request for storing corresponding metadata to the corresponding super node, and the edge node stores the corresponding metadata locally at the same time as a quick index, wherein the metadata comprises the date, the size, the attribute, the user, the key ciphertext and the data node position of the fragments.

In this embodiment, the user data downloading, that is, the file downloading flow is as shown in fig. 4, and includes:

the edge node sends a request for reading Object file Object information to the super node, and reads a data block list from the request;

The edge node sends a data fragment list requesting to read the id of the specified data block to the super node;

the edge node sends a reading request to all data nodes according to the data fragment list information;

the edge node receives the data fragments, when the decoding minimum standard is reached, ULRC decoding is started, and after the decoding is successful, all the data fragments which are being downloaded are discarded, so that the long tail problem is effectively removed;

and after the edge node successfully decodes, merging the calculated data fragments into data blocks, decrypting the key UDEK to obtain a key DEK, decrypting the data blocks by using the key DEK, and after all the data blocks are downloaded and decrypted, splicing the data blocks to generate a file to finish the user request.

The storage of the data node is composed of a plurality of files or bare partitions, 16K is used as a storage unit, a K/V database is used for recording the mapping relation between the hash of the data slicing content and the offset address, K is the hash, V is the storage unit address, and address conversion is carried out through the mapping table when the data slicing is read and written.

In this embodiment, the data node sends the state to the super node at regular time, the data node which does not send the state to the super node when the time-out is overtime is marked as an alarm state, when the data node in the alarm state reaches a specified threshold, a scheduling reconstruction task is triggered, the alarm state can be revoked before the scheduling reconstruction task is triggered, when the reconstruction task starts, the corresponding data node is marked as damaged, as shown in fig. 5, and the reconstruction flow includes:

Step 1, a super node receives early warning information, scans data fragment information stored by damaged data nodes, and establishes rebuild shard a table, namely rebuilds a hash value table;

step 2, N data nodes are distributed as reconstruction nodes, and data fragments to be reconstructed are distributed to the N data nodes in an average way for reconstruction;

Step 3, the reconstruction node downloads the appointed data block, calculates the lost data fragments, directly stores the lost data fragments in the local if only one data fragment is lost, and sends the result to the super node;

Step 4, if the number of the lost data fragments is greater than one, uploading the data of the corresponding data fragments to other data nodes;

Step 5, reporting the reconstruction state to the super node, and rewriting the node information of the data fragments;

repeating the steps 3 to 5 until all the data fragments are reconstructed.

The storage form of the metadata is divided into metadata of self files stored locally by users, namely metadata stored by edge nodes, which are stored in the edge nodes, so that the metadata can be quickly searched locally and flexibly supplemented as extreme destruction resistance.

For the survivability of the metadata database on the super node, the embodiment utilizes the characteristic of user data storage to construct a database with WORM (write once, read many times) property, divides the database into a plurality of sub-libraries and sub-tables according to the id of the virtual super node, distributes the sub-libraries and sub-tables to each super node for management, and each super node is provided with a main library and a main table belonging to the super node and a plurality of slave libraries or slave tables belonging to other super nodes, namely the slave super node, the main library and the main table can be read and written, and the slave libraries are read and written only from the slave tables. Metadata maintained on the supernodes are synchronized in the background by each other's slave supernodes. The method specifically comprises the steps of distributing all super nodes, data nodes and edge nodes on a consistent hash ring according to ids, wherein the data nodes and the edge nodes are managed by the nearest super nodes in a clockwise direction, and metadata databases formed by metadata corresponding to the nodes exist as an independent sub-library and serve as a main library and a main table (with read-write permission) managed by the super nodes.

The background synchronization service operates when being started, queries the time of data updating in the slave library and the slave table according to a locally configured strategy, sends a synchronization request to other super nodes by taking the time as a starting point, acquires the master library records of other super nodes after the time, and stores the master library records.

To further illustrate, as shown in FIG. 2, the data nodes and edge nodes find and accept virtual supernodes in a clockwise direction, as in FIG. 2D 1 is managed by virtual supernode Snode, and D2, D3 are managed by virtual supernode Snode.

In the hash ring, 1024 virtual supernodes are predefined, the global metadata sub-library and the user metadata sub-library are evenly segmented into 1024 sub-libraries, and the actual supernodes may be smaller than the virtual supernodes in number, so each supernode corresponds to a plurality of virtual supernodes, for example, supernode Snode A corresponds to virtual nodes Snode 1, 3 and 5 in fig. 2, and supernode Snode B corresponds to virtual supernodes Snode 2,4 and 6.

When the super node is newly added, the virtual super node is redistributed to the newly added super node.

And when deleting the super node, releasing the virtual super node corresponding to the deleted super node to other super nodes for management.

Metadata libraries are divided into four categories Nodes (node sub-libraries), users (user sub-libraries), uisMeta (global metadata sub-libraries), userMeta (user metadata sub-libraries), each record containing a time stamp (timestamp) field for ease of application level synchronization.

The node sub-library is used for recording information such as id, public key list, state and the like of the super node and the data node. The node sub-library has less data and small change, is an independent sub-library and is stored on all super nodes, and each super node uses the same node sub-library (the data can be different). When the super node is initialized, a node sub-library is copied from the existing super node, and the background synchronization is performed during operation.

The user sub-library is used for recording the id, the key ciphertext and the state of the user. The super nodes use the same user sub-library and are synchronized in the background.

The global metadata sub-database is used for storing information used for describing the encrypted storage object, including information of file objects, data blocks and data fragments.

Object file Object for recording file plaintext hash value, size, data block list, reference count, etc.

The Block data Block is used for recording a plaintext hash value of the data Block, a hash value of markle (hash tree) tree root nodes of the data fragment, a ciphertext of a random key and a tamper code parameter.

Shard data fragments, which are used for recording the hash of the data fragments, the id of the data block and the id of the data node;

The global metadata sub-database is divided into 1024 sub-tables according to the ids of the virtual super nodes corresponding to the super nodes, and the total sub-tables are divided into 1024 virtual super nodes. Each sub-table stores objects and blocks with ids in the range of the sub-table, and the super node only writes sub-tables (main tables) managed by itself and other sub-tables (slave tables) read only.

The user metadata sub-library is used for storing ciphertext and fragmentation information comprising names, dates, sizes and authorities of description files and random keys, wherein the fragmentation information comprises information such as the number of data fragments and a coding algorithm. The user metadata sub-library comprises three tables, namely a storage bucket, a file and a file object:

A storage bucket for recording Id including bucket, name, date information, etc.;

The file is used for recording a file name, a bucket id and a version list, each modification of the file generates a new version, and each version comprises an object id corresponding to a storage object;

The file object is a reference to the file object in the global metadata base, and a field UDEK is added to store the DEK ciphertext.

The user metadata sub-library is divided into 1024 sub-libraries according to the id of the user by adopting the sub-library processing because of logic independence, and is respectively managed by 1024 virtual super nodes. As with the global metadata sub-library, the super node only writes sub-libraries that belong to its own management, and the other sub-libraries are read only.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

While the foregoing is directed to embodiments of the present invention, other and further details of the invention may be had by the present invention, it should be understood that the foregoing description is merely illustrative of the present invention and that no limitations are intended to the scope of the invention, except insofar as modifications, equivalents, improvements or modifications are within the spirit and principles of the invention.

Claims

1. A survivable storage system for extreme scenes, comprising:

The metadata is stored on the super node in the form of a metadata database, and the metadata database is divided into a node sub-database, a user sub-database, a global metadata sub-database and a user metadata sub-database;

2. A survivor memory system for extreme scenes according to claim 1 wherein the super nodes, data nodes and edge nodes use gRPC, REST APIs for direct communication and data transfer.

3. A survivable storage system for extreme scenes according to claim 1 wherein each super node is made up of a cluster of locally located master-slave systems.

4. A survivor storage system for extreme scenarios as set forth in claim 1 wherein the task schedule comprises a data rebuilding task, the data nodes periodically send status to the supernode, the data nodes that have not sent status to the supernode over time are marked as warning status, when the warning status data nodes reach a set threshold, the data rebuilding task is triggered, and when the data rebuilding task begins, the corresponding data nodes are marked as survivor status.

5. A survivor storage system for extreme scenes according to claim 1 wherein the user data is stored scattered on the data nodes after compression, encryption, fragmentation, ULRC encoding at the edge nodes.

6. A method of survivability storage for extreme scenes based on the system of any one of claims 1 to 5, comprising:

7. The method for the survivability storage of extreme scenes according to claim 6, wherein the storing of the user data comprises:

Fragmenting the encrypted data block to obtain a data block fragment, and calculating a check fragment by utilizing ULRC codes;