CN117076565A

CN117076565A - Data storage methods, devices, non-volatile storage media and electronic equipment

Info

Publication number: CN117076565A
Application number: CN202311064306.9A
Authority: CN
Inventors: 于子烨; 徐嘉禛; 钱璞昕; 雷经纬
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-08-22
Filing date: 2023-08-22
Publication date: 2023-11-17

Abstract

The invention discloses a data storage method, device, non-volatile storage medium and electronic equipment. Involving the field of big data, financial technology or other related fields, the method includes: obtaining the data to be stored and the respective capacity information of the M data distributed storage spaces; based on the respective capacity information of the M data distributed storage spaces, obtain M data Each of the data distributed storage spaces is assigned a weight, and M storage weights corresponding to each of the M data distributed storage spaces are obtained; based on the M storage weights, the data to be stored is stored in the M data distributed storage spaces. The invention solves the technical problem of uneven distribution of storage tasks among various storage nodes in a distributed database cluster.

Description

Data storage method and device, nonvolatile storage medium and electronic equipment

Technical Field

The present invention relates to the field of big data, the field of financial science and technology, or other related fields, and in particular, to a data storage method, a data storage device, a nonvolatile storage medium, and an electronic device.

Background

The data distribution scheme adopted by the existing massive parallel distributed database can distribute data to different nodes. When the client sends a storage request of the target data, the distributed database can realize balanced distribution of the target data in different storage nodes and expandability of the storage of the database based on the principle that the data amount stored by each node is basically the same, and meanwhile, the availability and consistency of the data are ensured. However, the above method cannot flexibly adjust the overall capacity of the database, and is easy to cause capacity waste.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a data storage method, a data storage device, a nonvolatile storage medium and electronic equipment, which are used for at least solving the technical problem of unbalanced storage task allocation of each storage node in a distributed database cluster.

According to an aspect of an embodiment of the present invention, there is provided a data storage method including: acquiring capacity information of data to be stored and M data distributed storage spaces; according to the capacity information of each of the M data distributed storage spaces, respectively distributing weights to each of the M data distributed storage spaces to obtain M storage weights corresponding to each of the M data distributed storage spaces; and storing the data to be stored into the M data distributed storage spaces according to the M storage weights.

Optionally, storing the data to be stored in the M data distributed storage spaces according to the M storage weights includes: performing data slicing on the data to be stored to obtain N data blocks; and respectively storing the N data blocks into the M data distributed storage spaces according to the M storage weights, wherein the number of the data blocks stored in the storage space with larger corresponding storage weight in the M data distributed storage spaces is larger.

Optionally, storing the N data blocks in the M data distributed storage spaces according to the M storage weights, including: determining N randomization values corresponding to the N data blocks one by one; dividing the total value range into M sub value ranges which are in one-to-one correspondence with the M storage weights, wherein the range sizes of the M sub value ranges are positively correlated with the corresponding storage weights; determining the corresponding relation between the N randomized values and the M sub-value ranges, and determining the data distributed storage spaces corresponding to the N data blocks according to the corresponding relation; and respectively storing the N data blocks into the data distributed storage spaces corresponding to the N data blocks.

Optionally, determining N randomized values of the N data blocks in one-to-one correspondence includes: and respectively carrying out hash operation on the N data blocks to generate N hash values corresponding to the N data blocks one by one, wherein the N randomized values comprise the N hash values.

Optionally, determining the correspondence between the N randomized values and the M sub-value ranges includes: determining a dividend b of modular operation under the condition that the total value range is [0, a ], wherein a and b are integers greater than 0, and b is greater than a; respectively obtaining the modulus of the N hash values according to the dividend b to obtain N moduli corresponding to the N hash values one by one; and determining the corresponding relation between the N moduli and the M sub-value ranges, and determining the corresponding relation between the N randomized values respectively corresponding to the N moduli and the M sub-value ranges.

Optionally, acquiring capacity information of each of the M data distributed storage spaces includes: acquiring data storage state information of each of the M data distributed storage spaces; and determining the available capacity of each of the M data distributed storage spaces according to the data storage state information, wherein the capacity information comprises the available capacity.

Optionally, the M data distributed storage spaces include any one of: m data distributed storage nodes; m data directories in a plurality of data distributed storage nodes.

Optionally, storing the data to be stored in the M data distributed storage spaces according to the M storage weights includes: acquiring respective real-time capacity information of the M data distributed storage spaces in the process of storing the data to be stored into the M data distributed storage spaces according to the M storage weights; according to the real-time capacity information, respectively distributing weights to the M data distributed storage spaces, and obtaining M real-time weights corresponding to the M data distributed storage spaces; and storing the data to be stored into the M data distributed storage spaces according to the M real-time weights.

According to another aspect of an embodiment of the present invention, there is provided a data storage device including: the acquisition module is used for acquiring the data to be stored and the capacity information of each of the M data distributed storage spaces; the distribution module is used for distributing weights to the M data distributed storage spaces according to the respective capacity information of the M data distributed storage spaces to obtain M storage weights corresponding to the M data distributed storage spaces; and the storage module is used for storing the data to be stored into the M data distributed storage spaces according to the M storage weights.

According to still another aspect of the embodiment of the present invention, there is provided a nonvolatile storage medium, which includes a stored program, where the program controls a device in which the nonvolatile storage medium is located to execute any one of the data storage methods described above when running.

According to a further aspect of an embodiment of the present invention, there is provided an electronic device, including one or more processors and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement any one of the above data storage methods.

In the embodiment of the application, the data to be stored and the capacity information of each of the M data distributed storage spaces are acquired, the weights are respectively distributed for the M data distributed storage spaces according to the capacity information of each of the M data distributed storage spaces, the M storage weights corresponding to each of the M data distributed storage spaces are obtained, then the data to be stored is stored in the M data distributed storage spaces according to the M storage weights, and the purpose of flexibly adjusting the storage positions of the data according to the capacity of each storage space in a distributed database is achieved, so that the technical effect of improving the storage space utilization rate in the distributed database is realized, and the technical problem of unbalanced storage task distribution of each storage node in a distributed database cluster is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 is a flow chart of a data storage method provided in accordance with an embodiment of the present application;

FIG. 2 is a schematic diagram of a data storage system provided in accordance with an alternative embodiment of the present application;

FIG. 3 is a flow chart of a method for data storage during entry of a data table provided in accordance with an alternative embodiment of the present application;

FIG. 4 is a flow chart of database capacity expansion and contraction provided in accordance with an alternative embodiment of the present application;

FIG. 5 is a block diagram of a data storage device provided in accordance with an embodiment of the present application;

fig. 6 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that, related information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data for presentation, analyzed data, etc.) related to the present disclosure are information and data authorized by a user or sufficiently authorized by each party. For example, an interface is provided between the system and the relevant user or institution, before acquiring the relevant information, the system needs to send an acquisition request to the user or institution through the interface, and acquire the relevant information after receiving the consent information fed back by the user or institution.

In the related art, the distributed database system can ensure the data balance of each node, which means that the data amount stored in each node is basically the same. The characteristic introduces the bucket principle, ensures that the capacity configuration of each storage node in the distributed database cluster is consistent, can not flexibly adjust the overall capacity of the database, and easily causes capacity waste. Meanwhile, hard requirements exist on the disk capacity configuration of the distributed database nodes in the mode, the expandability of a database system is limited, and the existing resources at hand are difficult to fully utilize.

In order to solve the problems that a large-scale distributed database storage node is limited by a bucket principle and storage resource waste and expansion are limited easily, the application provides a data storage method, which can adaptively distribute data to be stored to different storage positions in a distributed database so as to maximize the storage capacity of the distributed database.

The present application will be described with reference to preferred implementation steps, and fig. 1 is a flowchart of a data storage method according to an embodiment of the present application, where the method may be applied to a distributed database, as shown in fig. 1, and includes the following steps:

Step S101, acquiring capacity information of each of the data to be stored and the M data distributed storage spaces.

The data to be stored may be a data table, which is composed of a plurality of data columns. The M data distributed storage spaces may be storage spaces in a distributed database, where each storage space may be distributed on different devices, or may be distributed on different disk locations in the same device.

As an alternative embodiment, the M data distributed storage spaces includeAny one of the following: m data distributed storage nodes; m data directories in a plurality of data distributed storage nodes. In an alternative embodiment, the data to be stored may be reasonably distributed to each data distributed storage node of the distributed database, where the M data distributed storage spaces are M data distributed storage nodes; in another alternative embodiment, the data to be stored may be further distributed reasonably among the data directories under each storage node in the distributed database, so as to ensure the reasonability of the data storage amount under each data directory, where the M data distributed storage spaces may be M data directories, for example, the number of the data directories included in each of the plurality of storage nodes is M ₁ 、M ₂ ……M _n The number of data directories in the distributed databaseBased on this alternative embodiment, the granularity of data allocation when storing data in the distributed database can be flexibly selected.

In a distributed database, a data directory may represent a particular directory or folder in a storage node for storing data. A data directory is a structure on a storage node that is used to organize and manage storage, and in particular, a data directory is typically a folder created on a disk of a storage node that stores blocks or pieces of data that the node is responsible for managing. Each data block may include a part of data in a data table, and may also include an entire data table, and the division manner of the data blocks depends on service requirements. Each storage node has its own data directory for storing a portion of the data it manages. The roles of the data directory include: storing data: a data directory is a location on a storage node where data is actually stored. It provides a specific file system path for storing and reading data blocks. Organizing and managing data: the data directory may organize data according to certain rules and structures, for example, classification and storage according to numbers or hash values of data blocks, so as to facilitate management and searching of data. Data backup and redundancy: data directories typically have some redundancy mechanism, such as replication or distributed replication, to ensure reliability and fault tolerance of the data. A disk is a storage medium in a storage node and a data directory is typically a folder or directory created on the disk. The data in the data directory is stored on the physical space of the disk. A storage node may have multiple disks, each with one or more data directories thereon for storing different blocks or pieces of data.

As an alternative embodiment, the capacity information of each of the M data distributed storage spaces may be obtained by: acquiring data storage state information of each of M data distributed storage spaces; and determining the available capacity of each of the M data distributed storage spaces according to the data storage state information, wherein the capacity information comprises the available capacity. In this alternative embodiment, the storage status information of the data distributed storage space may include the total storage capacity and the used storage capacity of the storage space, and the difference between the total storage capacity and the used storage capacity is the available capacity of the storage space. When the storage positions of the data to be stored are allocated, the storage spaces can be analyzed based on the optional embodiment to obtain capacity information, and then the storage positions of the data to be stored are reasonably allocated based on the available capacity of the storage spaces, so that the wooden barrel effect is avoided.

And step S102, respectively distributing weights to the M data distributed storage spaces according to the capacity information of the M data distributed storage spaces, and obtaining M storage weights corresponding to the M data distributed storage spaces.

In this step, the storage weight may be used to represent the storage capacity of each storage space, and the storage weight corresponding to the storage space with large storage capacity is also large, so that more data in the data to be stored may be stored in the storage space, and less data in the data to be stored may be stored in the storage space with smaller corresponding storage weight, so that the amount of data stored in each data distributed storage space is matched with the capacity of the storage space itself.

As an alternative embodiment, when storing data to be stored in M data distributed storage spaces according to M storage weights, the following steps may be adopted: performing data slicing on data to be stored to obtain N data blocks; and respectively storing the N data blocks into M data distributed storage spaces according to the M storage weights, wherein the number of the data blocks stored in the storage space with larger corresponding storage weight in the M data distributed storage spaces is larger.

Optionally, when the data to be stored is subjected to data slicing, the data to be stored can be equally split according to the data quantity, so as to obtain N data blocks with the same or similar data size. The more data blocks are stored in the memory space, the more data are stored.

As an alternative embodiment, according to the M storage weights, the N data blocks may be stored in the M data distributed storage spaces respectively by: n randomization values corresponding to the N data blocks one by one are determined; dividing the total value range into M sub value ranges which are in one-to-one correspondence with the M storage weights, wherein the range sizes of the M sub value ranges are positively correlated with the corresponding storage weights; determining the corresponding relation between the N randomized values and the M sub-value ranges, and determining the data distributed storage spaces corresponding to the N data blocks according to the corresponding relation; and respectively storing the N data blocks into the data distributed storage spaces corresponding to the N data blocks.

The N randomized values corresponding to the N data blocks one by one may be randomized values within the total value range, and because the values are randomized, the N randomized values may be considered to be more uniformly distributed within the total value range. Because the N randomized values are in one-to-one correspondence with the N data blocks, the N data blocks can be considered to be uniformly distributed on the line segment corresponding to the total value range. At this time, the total value range may be divided into M sub value ranges according to M storage weights, and optionally, each sub value range may be a continuous range of values, each sub value range is a line segment in the total value range, and the length of the line segment corresponding to each sub value range is proportional to the storage weight corresponding to each sub value range. At this time, each of the N data blocks may be found to fall into a certain one of the M sub-value ranges, and each of the sub-value ranges corresponds to one of the data distributed storage spaces, so that the data block falling into the sub-value range may be determined to be stored in the data distributed storage space corresponding to the sub-value range. By the method, all the data blocks can be distributed according to the capacity of each storage space, for example, more data blocks can be stored in the storage space with larger available capacity.

As an alternative embodiment, determining N randomized values corresponding to N data blocks one to one may use the following manner: and respectively carrying out hash operation on the N data blocks to generate N hash values corresponding to the N data blocks one by one, wherein the N randomized values comprise N hash values.

It should be noted that a Hash-based (Hash) data distribution technique is a technique for distributing data to different storage nodes. The basic idea is to map data to a certain position in a Hash space with a fixed size by performing Hash calculation on the data, and then allocate the data to a corresponding storage node according to the Hash value.

Specifically, hash-based data distribution techniques typically include the following steps: 1. calculating a hash value: for the data to be distributed, a unique Hash value is calculated by a Hash function. The Hash function converts data into a Hash code of a fixed length. 2. Mapping to hash space: the calculated hash value is mapped to a hash space, typically a hash table of fixed size or similar data structure. 3. Assigned to the storage node: and distributing the data to the corresponding storage nodes according to the hash values. In general, a storage node may be a physical server, a node in a distributed system, or a storage area in cloud storage.

The data distribution technology based on Hash can realize uniform distribution and efficient searching of data. Different Hash values can be obtained for different data through Hash calculation, so that the problem of hot spots of the data (namely, certain data is concentrated on the same storage node) is avoided, and the load balance and the performance of the system are improved. Meanwhile, through calculation and mapping of hash values, the hash values can be rapidly positioned to the storage nodes, and efficient searching and accessing of data are achieved.

As an alternative embodiment, when determining the correspondence between the N randomized values and the M sub-value ranges, the following steps may be adopted: under the condition that the total value range is [0, a ], determining a dividend b of modular operation, wherein a and b are integers larger than 0, and b is larger than a; respectively obtaining N modules corresponding to the N hash values one by one according to the divisor b; and determining the corresponding relation between the N moduli and the M sub-value ranges, and determining the corresponding relation between the N randomized values respectively corresponding to the N moduli and the M sub-value ranges.

The present alternative embodiment provides a simplified method for distributing data blocks based on storage weights, and because the hash values have larger values, the hash values can be subjected to modulo operation, and converted into smaller moduli, so that the total value range can also be a range which has a small numerical range and can be rapidly processed by a computer. The divisor b of the modulo operation is based on the respective value ranges of the N moduli, that is, the value range of any one of the N moduli is [0, b ], so that the total value range can be determined as [0, a ], so that any one of the moduli falls into the total value range.

Step S103, storing the data to be stored into M data distributed storage spaces according to M storage weights.

As an alternative embodiment, the process of storing the data to be stored in the M data distributed storage spaces may further include the following steps: acquiring real-time capacity information of each of M data distributed storage spaces; according to the real-time capacity information, respectively distributing weights to the M data distributed storage spaces to obtain M real-time weights corresponding to the M data distributed storage spaces; and storing the data to be stored into M data distributed storage spaces according to the M real-time weights.

By means of the alternative embodiment, the process of data distributed storage can be dynamically adjusted, for example, when part of data of the data to be stored is stored in the M data distributed storage spaces, the respective remaining storage space capacity of the M data distributed storage spaces will change, on the one hand, due to the fact that part of data of the data to be stored is stored therein, and on the other hand, other data may be stored in the M data distributed storage spaces synchronously. Therefore, when data is stored, current real-time capacity information of the M data distributed storage spaces can be obtained in real time, when the residual capacity of a certain node is insufficient to support storing the corresponding part of the data to be stored therein, weights are redistributed according to the current real-time capacity to obtain M real-time weights, and the data to be stored is stored in the M data distributed storage spaces according to the M real-time weights in the same manner as the alternative embodiment.

In the above steps, the data to be stored and the capacity information of each of the M data distributed storage spaces are obtained, the weights are respectively allocated to the M data distributed storage spaces according to the capacity information of each of the M data distributed storage spaces, the M storage weights corresponding to each of the M data distributed storage spaces are obtained, and then the data to be stored is stored in the M data distributed storage spaces according to the M storage weights, so that the purpose of flexibly adjusting the storage positions of the data according to the capacity of each storage space in the distributed database is achieved, the technical effect of improving the storage space utilization rate in the distributed database is achieved, and the technical problem of unbalanced allocation of storage tasks of each storage node in the distributed database cluster is further solved. The application can support devices with different storage capacities through self-adaptive data distribution, can better adapt to the data storage requirements under different scenes, and improves the stability, flexibility and expandability of the distributed database system.

Based on the foregoing embodiment and the optional embodiments, the present application provides an optional implementation, and fig. 2 is a schematic diagram of a data storage system provided according to an optional implementation of the present application, and the data storage system is specifically described below according to fig. 2.

As shown in fig. 2, the foregoing embodiment and the optional embodiments may be applied to a distributed database data distribution system adaptive to different single-node capacities, where the system includes the following 6 modules:

module 1, function is distributed database individual node capacity information collection: the module is responsible for collecting storage capacity information of each node in the distributed database, including detailed information such as total capacity of a node disk, disk capacity of each available data directory in the node, and the like. This information will be used for subsequent capacity summaries and weight assignments.

Module 2, function is distributed database full node capacity summarization calculation analysis: the module is responsible for carrying out total calculation analysis on storage capacity information of each node in the distributed database so as to determine the total capacity of the system and the capacity information of each node. The module can update the capacity information of each granularity in real time according to the change of the capacity of each node so as to ensure the self-adaptive distribution of the subsequent data and the full utilization of the storage capacity.

Module 3, function is distribution of available capacity weight of each node of distributed database: the module is responsible for distributing the available capacity weight value to each node according to the proportion of the disk capacity of each node to the whole total capacity of the database according to the result of the total node capacity summarizing, calculating and analyzing of the distributed database. If there are multiple data directories within a node, the module will further refine the available usage weight value for each data directory based on the capacity ratio. The module can update the capacity weight value of each granularity in real time according to the change of the node capacity so as to ensure the self-adaptive distribution of the subsequent data and the full utilization of the storage capacity.

Module 4, function is data table size information collection: the module is responsible for collecting size information of data tables to be stored and put in storage, distribution columns of each data table, total storage capacity of a disk required after the data tables are put in storage and the like. This information will be used for subsequent data fragmentation and adaptive distribution storage.

The module 5 is used for dividing the data table according to weight: the module is responsible for dividing the data table according to the distributed data base available capacity weight distribution result and the size information of the data table to be stored and put in storage, dividing the data table according to the distributed capacity weight on each node, calculating the distributed column data by using a Hash value, superposing the capacity weight of each node, and dividing the data table according to the capacity proportion. The module dynamically adjusts the slicing strategy according to the change of the capacity of the real-time monitoring node and the actual size of each data table to be put in storage, thereby ensuring the self-adaptive distribution of the data.

Module 6, function is data distribution storage: the module is responsible for storing the data after the self-adaptive slicing strategy is completed on each node and providing the read-write service of the data. Each node only stores the data distributed by itself, and the nodes exchange and synchronize data through the network so as to ensure the consistency and reliability of the data. The module can adaptively adjust the distribution storage strategy of the data according to the change of the slicing result so as to ensure the full utilization of the capacities of different nodes in the distributed database.

FIG. 3 is a flowchart of a data storage method when a data table is put into storage, according to an alternative embodiment of the present invention, as shown in FIG. 3, the process of storing the data table into a distributed database may include the following steps:

step 201: the database capacity information collection module collects and records the available capacity of each node of the massive parallel distributed database.

Step 202: and the database node capacity analysis and calculation module collects the capacity of each node of the distributed database, and collects and calculates the total available capacity of the distributed database.

Step 203: the database capacity weight distribution module determines the weight value of the available capacity of each node according to the ratio of the capacity of each node to the total capacity of the distributed database.

Step 204: the data table size collection module collects distribution column information and table size information of a data table to be put in storage, and calculates the total size of capacity required by the table after the table is stored in storage in advance.

Step 205: and comparing the calculated size of the database table to be put in storage with the total available capacity of the database, and judging whether the residual available capacity of the database can accommodate the table.

Step 206: in step 205, if it is determined that the available capacity of the database can meet the requirement of data table warehousing, weight is assigned to each node capacity, hash calculation with weight is performed on the data table to be warehoused, and data slicing is performed.

Step 207: and the data which has completed the adaptive capacity slicing strategy is stored into each corresponding data node in a distributed mode.

Step 208: in step 205, if the available capacity of the database cannot meet the requirement of data table warehousing, terminating the warehousing flow and reporting errors: the database capacity is insufficient.

Fig. 4 is a flowchart of a database capacity expansion and contraction process according to an alternative embodiment of the present invention, where as shown in fig. 4, a process of expanding or contracting a distributed database may be equivalent to a process of storing a data table into a distributed database. When the distributed database is expanded or contracted, the data table in the distributed database can be taken out, and after the distributed database is expanded or contracted, the taken-out data table is stored in the distributed database again. Specifically, this alternative embodiment includes the steps of:

step 301: the database capacity information collection module pre-collects database total capacity planning changes, typically from external inputs.

Step 302: and comparing the collected capacity planning change condition with the total capacity of the existing database, and judging whether the total capacity is planned to be increased or not.

Step 303: if the determination result in step 302 is yes, it is determined that the distributed database will perform the capacity expansion operation, and the subsequent data distribution flow is executed with capacity expansion as a target.

Step 304: and the database capacity information collection module collects the available capacity of each node after capacity expansion, and calculates the total available capacity of the distributed database in a summarizing way.

Step 305: and the database capacity weight distribution module determines a new weight value of the available capacity of each node according to the ratio of the capacity of each new node to the total capacity of the distributed database.

Step 306: the data table size collection module collects and calculates the sizes of all tables of the memory amount in the library, and summarizes the total sizes of all tables in the library.

Step 307: if the flow is the expansion flow, referring to the new weight of capacity allocation of each node, carrying out weighted Hash calculation on all data tables of the database memory quantity and carrying out data slicing. If the flow is the volume reduction flow, and step 313 is determined to be yes, the step is executed.

Step 308: after the capacity expansion or the capacity shrinkage of the database is completed, the data segmented according to the new weight are distributed into each data node of the database in a self-adaptive mode.

Step 309: if the result of the determination in step 302 is no, it is determined that the distributed database will perform the capacity reduction operation, and the subsequent data distribution process is performed with the capacity reduction as the target.

Step 310: and the database capacity information collection module collects the available capacity of each node after the capacity reduction, and calculates the total available capacity of the distributed database in a summarizing way.

Step 311: and the database capacity weight distribution module determines a new weight value of the available capacity of each node according to the ratio of the capacity of each new node to the total capacity of the distributed database.

Step 312: the data table size collection module collects and calculates the sizes of all tables of the memory amount in the library, and summarizes the total sizes of all tables in the library.

Step 313: and comparing and judging whether the total capacity of the database after the planned capacity reduction can accommodate all the tables of the memory quantity in the database.

Step 314: if step 313 is negative, terminating the capacity reduction process and reporting errors: the capacity of the database after capacity reduction is insufficient to store the current data.

By the alternative embodiments, at least the following advantages can be achieved:

(1) The capacity utilization rate of the database is improved: traditional distributed database data storage is based on Hash distribution and is limited by a bucket principle, and the storage capacity of each node is required to be consistent, so that data cannot be distributed according to the actual capacity of the node, and therefore, the storage capacity of some nodes is wasted, and the capacity of some nodes is insufficient. The self-adaptive data distribution of the invention breaks the limitation of the bucket principle by a weight distribution mode, can reasonably distribute data according to the actual capacity of each node, and improves the utilization rate of storage capacity, thereby saving the storage cost.

(2) The flexibility of the database is improved: conventional distributed databases require the storage capacity of each node to be the same, which limits the range of available devices. The invention can distribute according to the actual capacity of each node, and support more devices with different capacities, so that the distributed database can more flexibly select various storage devices with different capacities, and the cost of selecting the devices is reduced.

(3) The availability of the database is improved: the invention can adaptively acquire and distribute the storage capacity information of each node, and can adaptively and dynamically fragment and store according to the size of the data table and the capacity information of each node, thereby improving the data storage availability of the database. It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the various embodiments of the present application.

According to an embodiment of the present application, there is further provided an apparatus for implementing the above data storage method, and fig. 5 is a block diagram of a data storage apparatus according to an embodiment of the present application, as shown in fig. 5, where the data storage apparatus includes: the acquisition module 51, the distribution module 52 and the storage module 53 are described in detail below.

An obtaining module 51, configured to obtain capacity information of each of the data to be stored and the M data distributed storage spaces;

The allocation module 52 is connected to the obtaining module 51, and is configured to allocate weights to the M data distributed storage spaces according to the capacity information of the M data distributed storage spaces, so as to obtain M storage weights corresponding to the M data distributed storage spaces;

the storage module 53 is connected to the allocation module 52, and is configured to store the data to be stored in the M data distributed storage spaces according to the M storage weights.

Here, the above-mentioned obtaining module 51, the distributing module 52 and the storing module 53 correspond to steps S101 to S103 in implementing the data storing method, and the plurality of modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in the above-mentioned embodiments.

The data storage device provided by the embodiment of the application solves the problem of unbalanced storage task allocation of each storage node in the distributed database cluster in the related technology, thereby achieving the technical effect of improving the utilization rate of the storage space in the distributed database.

The data storage device includes a processor and a memory, the above-mentioned acquisition module 51, the allocation module 52, the storage module 53, and the like are stored in the memory as program units, and the above-mentioned program units stored in the memory are executed by the processor to realize the corresponding functions.

The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can be provided with one or more than one, and the purpose of flexibly adjusting the storage position of the data according to the capacity of each storage space in the distributed database is realized by adjusting the kernel parameters.

The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.

The embodiment of the invention provides a computer-readable storage medium having a program stored thereon, which when executed by a processor, implements a data storage method.

The embodiment of the invention provides a processor, which is used for running a program, wherein the program runs to execute a data storage method.

Fig. 6 is a schematic diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the embodiment of the present invention provides an electronic device, where the device includes a processor, a memory, and a program stored in the memory and executable on the processor, and the processor implements the following steps when executing the program: acquiring capacity information of data to be stored and M data distributed storage spaces; according to the capacity information of each of the M data distributed storage spaces, respectively distributing weights to the M data distributed storage spaces to obtain M storage weights corresponding to each of the M data distributed storage spaces; and storing the data to be stored into M data distributed storage spaces according to the M storage weights. The device herein may be a server, PC, PAD, cell phone, etc.

The following steps may also be implemented when the processor executes the program: storing data to be stored in M data distributed storage spaces according to M storage weights, including: performing data slicing on data to be stored to obtain N data blocks; and respectively storing the N data blocks into M data distributed storage spaces according to the M storage weights, wherein the number of the data blocks stored in the storage space with larger corresponding storage weight in the M data distributed storage spaces is larger.

The following steps may also be implemented when the processor executes the program: according to the M storage weights, storing the N data blocks into M data distributed storage spaces respectively, wherein the method comprises the following steps: n randomization values corresponding to the N data blocks one by one are determined; dividing the total value range into M sub value ranges which are in one-to-one correspondence with the M storage weights, wherein the range sizes of the M sub value ranges are positively correlated with the corresponding storage weights; determining the corresponding relation between the N randomized values and the M sub-value ranges, and determining the data distributed storage spaces corresponding to the N data blocks according to the corresponding relation; and respectively storing the N data blocks into the data distributed storage spaces corresponding to the N data blocks.

The following steps may also be implemented when the processor executes the program: n randomized values corresponding to the N data blocks one by one are determined, and the method comprises the following steps: and respectively carrying out hash operation on the N data blocks to generate N hash values corresponding to the N data blocks one by one, wherein the N randomized values comprise N hash values.

The following steps may also be implemented when the processor executes the program: determining correspondence between the N randomized values and the M sub-value ranges includes: under the condition that the total value range is [0, a ], determining a dividend b of modular operation, wherein a and b are integers larger than 0, and b is larger than a; respectively obtaining N modules corresponding to the N hash values one by one according to the divisor b; and determining the corresponding relation between the N moduli and the M sub-value ranges, and determining the corresponding relation between the N randomized values respectively corresponding to the N moduli and the M sub-value ranges.

The following steps may also be implemented when the processor executes the program: acquiring capacity information of each of M data distributed storage spaces, including: acquiring data storage state information of each of M data distributed storage spaces; and determining the available capacity of each of the M data distributed storage spaces according to the data storage state information, wherein the capacity information comprises the available capacity.

The following steps may also be implemented when the processor executes the program: the M data distributed storage spaces include any one of the following: m data distributed storage nodes; m data directories in a plurality of data distributed storage nodes.

The following steps may also be implemented when the processor executes the program: storing data to be stored in M data distributed storage spaces according to M storage weights, including: in the process of storing data to be stored into M data distributed storage spaces according to M storage weights, acquiring respective real-time capacity information of the M data distributed storage spaces; according to the real-time capacity information, respectively distributing weights to the M data distributed storage spaces to obtain M real-time weights corresponding to the M data distributed storage spaces; and storing the data to be stored into M data distributed storage spaces according to the M real-time weights.

The application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: acquiring capacity information of data to be stored and M data distributed storage spaces; according to the capacity information of each of the M data distributed storage spaces, respectively distributing weights to the M data distributed storage spaces to obtain M storage weights corresponding to each of the M data distributed storage spaces; and storing the data to be stored into M data distributed storage spaces according to the M storage weights.

The application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: storing data to be stored in M data distributed storage spaces according to M storage weights, including: performing data slicing on data to be stored to obtain N data blocks; and respectively storing the N data blocks into M data distributed storage spaces according to the M storage weights, wherein the number of the data blocks stored in the storage space with larger corresponding storage weight in the M data distributed storage spaces is larger.

The application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: according to the M storage weights, storing the N data blocks into M data distributed storage spaces respectively, wherein the method comprises the following steps: n randomization values corresponding to the N data blocks one by one are determined; dividing the total value range into M sub value ranges which are in one-to-one correspondence with the M storage weights, wherein the range sizes of the M sub value ranges are positively correlated with the corresponding storage weights; determining the corresponding relation between the N randomized values and the M sub-value ranges, and determining the data distributed storage spaces corresponding to the N data blocks according to the corresponding relation; and respectively storing the N data blocks into the data distributed storage spaces corresponding to the N data blocks.

The application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: n randomized values corresponding to the N data blocks one by one are determined, and the method comprises the following steps: and respectively carrying out hash operation on the N data blocks to generate N hash values corresponding to the N data blocks one by one, wherein the N randomized values comprise N hash values.

The application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: determining correspondence between the N randomized values and the M sub-value ranges includes: under the condition that the total value range is [0, a ], determining a dividend b of modular operation, wherein a and b are integers larger than 0, and b is larger than a; respectively obtaining N modules corresponding to the N hash values one by one according to the divisor b; and determining the corresponding relation between the N moduli and the M sub-value ranges, and determining the corresponding relation between the N randomized values respectively corresponding to the N moduli and the M sub-value ranges.

The application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: acquiring capacity information of each of M data distributed storage spaces, including: acquiring data storage state information of each of M data distributed storage spaces; and determining the available capacity of each of the M data distributed storage spaces according to the data storage state information, wherein the capacity information comprises the available capacity.

The application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: the M data distributed storage spaces include any one of the following: m data distributed storage nodes; m data directories in a plurality of data distributed storage nodes.

The application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: storing data to be stored in M data distributed storage spaces according to M storage weights, including: in the process of storing data to be stored into M data distributed storage spaces according to M storage weights, acquiring respective real-time capacity information of the M data distributed storage spaces; according to the real-time capacity information, respectively distributing weights to the M data distributed storage spaces to obtain M real-time weights corresponding to the M data distributed storage spaces; and storing the data to be stored into M data distributed storage spaces according to the M real-time weights.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method of data storage, comprising:

acquiring capacity information of data to be stored and M data distributed storage spaces;

according to the capacity information of each of the M data distributed storage spaces, respectively distributing weights to each of the M data distributed storage spaces to obtain M storage weights corresponding to each of the M data distributed storage spaces;

and storing the data to be stored into the M data distributed storage spaces according to the M storage weights.

2. The method of claim 1, wherein storing the data to be stored in the M data distributed storage spaces according to the M storage weights comprises:

performing data slicing on the data to be stored to obtain N data blocks;

and respectively storing the N data blocks into the M data distributed storage spaces according to the M storage weights, wherein the number of the data blocks stored in the storage space with larger corresponding storage weight in the M data distributed storage spaces is larger.

3. The method of claim 2, wherein storing the N data blocks into the M data distributed storage spaces, respectively, according to the M storage weights, comprises:

Determining N randomization values corresponding to the N data blocks one by one;

dividing the total value range into M sub value ranges which are in one-to-one correspondence with the M storage weights, wherein the range sizes of the M sub value ranges are positively correlated with the corresponding storage weights;

determining the corresponding relation between the N randomized values and the M sub-value ranges, and determining the data distributed storage spaces corresponding to the N data blocks according to the corresponding relation;

and respectively storing the N data blocks into the data distributed storage spaces corresponding to the N data blocks.

4. The method of claim 3, wherein determining N randomized values for the N data blocks in a one-to-one correspondence comprises:

and respectively carrying out hash operation on the N data blocks to generate N hash values corresponding to the N data blocks one by one, wherein the N randomized values comprise the N hash values.

5. The method of claim 4, wherein determining the correspondence between the N randomized values and the M sub-value ranges comprises:

determining a dividend b of modular operation under the condition that the total value range is [0, a ], wherein a and b are integers greater than 0, and b is greater than a;

Respectively obtaining the modulus of the N hash values according to the dividend b to obtain N moduli corresponding to the N hash values one by one;

and determining the corresponding relation between the N moduli and the M sub-value ranges, and determining the corresponding relation between the N randomized values respectively corresponding to the N moduli and the M sub-value ranges.

6. The method according to any one of claims 1 to 5, wherein acquiring the capacity information of each of the M data distributed storage spaces includes:

acquiring data storage state information of each of the M data distributed storage spaces;

and determining the available capacity of each of the M data distributed storage spaces according to the data storage state information, wherein the capacity information comprises the available capacity.

7. The method of any one of claims 1 to 5, wherein the M data distributed storage spaces comprise any one of:

m data distributed storage nodes; m data directories in a plurality of data distributed storage nodes.

8. The method according to any one of claims 1 to 5, wherein storing the data to be stored in the M data distributed storage spaces according to the M storage weights comprises:

Acquiring respective real-time capacity information of the M data distributed storage spaces in the process of storing the data to be stored into the M data distributed storage spaces according to the M storage weights;

according to the real-time capacity information, respectively distributing weights to the M data distributed storage spaces, and obtaining M real-time weights corresponding to the M data distributed storage spaces;

and storing the data to be stored into the M data distributed storage spaces according to the M real-time weights.

9. A data storage device, comprising:

the acquisition module is used for acquiring the data to be stored and the capacity information of each of the M data distributed storage spaces;

the distribution module is used for distributing weights to the M data distributed storage spaces according to the respective capacity information of the M data distributed storage spaces to obtain M storage weights corresponding to the M data distributed storage spaces;

and the storage module is used for storing the data to be stored into the M data distributed storage spaces according to the M storage weights.

10. A non-volatile storage medium, characterized in that the non-volatile storage medium comprises a stored program, wherein the program, when run, controls a device in which the non-volatile storage medium is located to perform the data storage method according to any one of claims 1 to 8.

11. An electronic device comprising one or more processors and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the data storage method of any of claims 1-8.