[go: up one dir, main page]

CN108021702A - Classification storage method, device, OLAP database system and medium based on LSM-tree - Google Patents

Classification storage method, device, OLAP database system and medium based on LSM-tree Download PDF

Info

Publication number
CN108021702A
CN108021702A CN201711437794.8A CN201711437794A CN108021702A CN 108021702 A CN108021702 A CN 108021702A CN 201711437794 A CN201711437794 A CN 201711437794A CN 108021702 A CN108021702 A CN 108021702A
Authority
CN
China
Prior art keywords
file
merging
basic
files
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711437794.8A
Other languages
Chinese (zh)
Inventor
李超勇
牟宇航
马如悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201711437794.8A priority Critical patent/CN108021702A/en
Publication of CN108021702A publication Critical patent/CN108021702A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/219Managing data history or versioning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention proposes a kind of classification storage method based on LSM tree, it is characterised in that is stored using the file of the tree structure tissue database based on LSM tree, the file for importing database is merged operation;The union operation includes accumulation union operation and basic union operation;The accumulation union operation includes:First kind file is merged into operation, the first kind file includes having imported database, not merged operation, and meets the file of accumulation merging condition;The basic union operation includes:The file of Second Type file and the accumulation union operation generation is merged into operation, the Second Type file includes having imported database, not merged operation, and is unsatisfactory for the file of accumulation merging condition.Embodiment provided by the invention improves the efficiency of data merging, ensure that performance requirement of the storage engines in the case of large-scale concurrent read-write.

Description

Hierarchical storage method and device based on LSM-tree, OLAP database system and medium
Technical Field
The invention relates to the technical field of information, in particular to a hierarchical storage method and device based on an LSM-tree, an OLAP database system and a medium.
Background
In the prior art, an OLAP (Online Analytical Processing) database storage organization is mainly based on hdfs (Hadoop Distributed File System) block storage at present, and the database is only responsible for File formats, such as Hive's Orcfile, Spark's partial, and the like. Taking Hive as an example, Hive is a data warehouse infrastructure established on Hadoop; hive has no special data storage format and establishes no index for data; all data in Hive is stored in hdfs, and the Hive contains the following data models: table (Table), External Table (External Table), Partition (Partition), Bucket (Bucket); tables in Hive are similar in concept to tables in the database, and each Table has a corresponding directory in Hive to store data. The conventional OLAP database lacks an efficient storage organization, and the reading and writing performance of data is limited based on hdfs.
The LSM-Tree (Log-Structured Merge-Tree) is mainly used in a KV (Key-Value) system, and is mainly used in OLAP for KV analog OLAP operation, and data scanning and aggregation are completed by a large number of KV operations. The KV database is a storage database using key values, which is a NoSQL (non-relational database) model in which data is organized, indexed, and stored in the form of key-value pairs. Most of the existing LSM-Tree-based realization is based on KV storage, and a storage organization of a relation model is not provided; and the LSM-Tree of KV often realizes the compact (merging) only into one layer, and for an OLAP database, the IO peak caused by the compact is overlapped.
Disclosure of Invention
Embodiments of the present invention provide a hierarchical storage method and apparatus based on an LSM-tree, an OLAP database system, and a medium, so as to at least solve one or more technical problems in the prior art.
In a first aspect, an embodiment of the present invention provides a hierarchical storage method based on an LSM-tree, where the method organizes file storage of a database by using a tree structure based on the LSM-tree, and merges files imported into the database; the merge operation comprises an accumulation merge operation and a basic merge operation; the cumulative merge operation includes: merging the first type files, wherein the first type files comprise files which are imported into a database, do not undergo merging operation and meet the accumulation merging condition; the basic merge operation includes: and merging the second type file and the file generated by the accumulative merging operation, wherein the second type file comprises the file which is imported into the database, is not subjected to the merging operation and does not meet the accumulative merging condition.
With reference to the first aspect, in a first implementation manner of the first aspect, the cumulative merging condition includes that a size of the file is equal to or smaller than a preset file size threshold.
With reference to the first aspect, the present invention, in a second implementation manner of the first aspect, further includes: performing the cumulative merging operation under the condition that the number of files which are imported into the database and are not subjected to the merging operation is greater than or equal to a preset first file number threshold value; or, the cumulative merging operation is performed when the number of files of the first type of file is greater than or equal to a preset first file number threshold.
With reference to the first aspect, the present invention, in a third embodiment of the first aspect, further includes: and performing the basic merging operation when the number of files generated by the accumulative merging operation is greater than or equal to a preset second file number threshold value.
With reference to the first aspect, in a fourth implementation manner of the first aspect, the basic merge operation is performed when a ratio of a total number of bytes of the accumulated version file to a total number of bytes of the basic version file is greater than or equal to a preset ratio threshold; the accumulated version files comprise all files generated by the accumulated merging operation in the existing database, and the basic version files comprise all files generated by the basic merging operation in the existing database.
With reference to the first aspect, the present invention, in a fifth implementation manner of the first aspect, further includes: and performing the basic merging operation once every preset time threshold.
With reference to the first aspect, the present invention, in a sixth implementation manner of the first aspect, further includes: and triggering the basic merging operation at the preset time point.
With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, the third implementation manner of the first aspect, the fourth implementation manner of the first aspect, the fifth implementation manner of the first aspect, or the sixth implementation manner of the first aspect, the method further includes: the value range of the file size threshold is 3-5G; and/or the value range of the first file number threshold is 5-8; and/or the value range of the second file number threshold is 5-8; and/or the value range of the proportional threshold is 30% -50%; and/or the value range of the time threshold is 7 days to 10 days.
In a second aspect, an embodiment of the present invention provides a hierarchical storage apparatus based on an LSM-tree, where the apparatus is configured to: organizing file storage of a database by adopting an LSM-tree-based tree structure, and merging files imported into the database; the merge operation comprises an accumulation merge operation and a basic merge operation; the device comprises an accumulation merging unit and a basic merging unit; the accumulation merge unit is configured to: merging the first type files, wherein the first type files comprise files which are imported into a database, do not undergo merging operation and meet the accumulation merging condition; the basic merging unit is used for: and merging the second type file and the file generated by the accumulative merging operation, wherein the second type file comprises the file which is imported into the database, is not subjected to the merging operation and does not meet the accumulative merging condition.
With reference to the second aspect, in a first implementation manner of the second aspect, the cumulative merging condition includes that the size of the file is smaller than or equal to a preset file size threshold.
With reference to the second aspect, in a second implementation manner of the second aspect, the accumulation and combination unit is further configured to: performing the cumulative merging operation under the condition that the number of files which are imported into the database and are not subjected to the merging operation is greater than or equal to a preset first file number threshold value; or, the cumulative merging operation is performed when the number of files of the first type of file is greater than or equal to a preset first file number threshold.
With reference to the second aspect, in a third embodiment of the second aspect, the basic merging unit is further configured to: and performing the basic merging operation when the number of files generated by the accumulative merging operation is greater than or equal to a preset second file number threshold value.
With reference to the second aspect, in a fourth embodiment of the second aspect, the basic merging unit is further configured to: performing the basic merging operation under the condition that the ratio of the total byte number of the accumulated version file to the total byte number of the basic version file is greater than or equal to a preset ratio threshold; the accumulated version files comprise all files generated by the accumulated merging operation in the existing database, and the basic version files comprise all files generated by the basic merging operation in the existing database.
With reference to the second aspect, in a fifth implementation manner of the second aspect, the basic merging unit is further configured to: and performing the basic merging operation once every preset time threshold.
With reference to the second aspect, in a sixth implementation manner of the second aspect, the basic merging unit is further configured to: and triggering the basic merging operation at the preset time point.
With reference to the second aspect, the first embodiment of the second aspect, the second embodiment of the second aspect, the third embodiment of the second aspect, the fourth embodiment of the second aspect, the fifth embodiment of the second aspect, or the sixth embodiment of the second aspect, the method further includes: the value range of the file size threshold is 3-5G; and/or the value range of the first file number threshold is 5-8; and/or the value range of the second file number threshold is 5-8; and/or the value range of the proportional threshold is 30% -50%; and/or the value range of the time threshold is 7 days to 10 days.
In one possible design, the hierarchical LSM-tree-based storage device includes a processor and a memory, the memory is used for storing a program that supports the hierarchical LSM-tree-based storage device to execute the hierarchical LSM-tree-based storage method in the first aspect, and the processor is configured to execute the program stored in the memory.
In a third aspect, an embodiment of the present invention provides an OLAP database system, where the OLAP database system includes: one or more processors; storage means for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of the first aspects described above.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to any one of the first aspect.
One of the above technical solutions has the following advantages or beneficial effects: the data is converted from random writing to sequential writing by adopting a hierarchical storage strategy based on an LSM-tree; meanwhile, a new version (version) is generated in the writing process of each batch, the reading of the existing data is not influenced in the writing process, the read-write separation is realized, and the concurrency control is realized. In addition, by adopting a hierarchical storage strategy based on the LSM-tree, an optimal compromise is obtained among query efficiency, merging frequency and disk IO, and the reading and writing efficiency of the OLAP database storage engine is improved. The adoption of the compact (merging) strategy constructed by the embodiment of the invention improves the efficiency of data compact and ensures the performance requirement of the storage engine under the condition of large-scale concurrent reading and writing.
The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will be readily apparent by reference to the drawings and following detailed description.
Drawings
In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.
FIG. 1 is an overall framework diagram of a hierarchical storage method based on an LSM-tree according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a two-level data organization strategy according to a preferred embodiment of the hierarchical storage method based on the LSM-tree provided in the present invention;
FIG. 3 is an overall block diagram of an LSM-tree based hierarchical storage device according to an embodiment of the present invention.
Detailed Description
In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
The embodiment of the invention provides a hierarchical storage method based on an LSM-tree. FIG. 1 is an overall framework diagram of a hierarchical storage method based on an LSM-tree according to an embodiment of the present invention. The hierarchical storage method based on the LSM-tree of the embodiment of the invention adopts the tree structure based on the LSM-tree to organize the file storage of the database, and carries out merging operation on the files imported into the database; the merge operation comprises an accumulation merge operation and a basic merge operation; as shown in fig. 1, in step S110, the cumulative merging operation includes: merging the first type files, wherein the first type files comprise files which are imported into a database, do not undergo merging operation and meet the accumulation merging condition; step S120, the basic merging operation includes: and merging the second type file and the file generated by the accumulative merging operation, wherein the second type file comprises the file which is imported into the database, is not subjected to the merging operation and does not meet the accumulative merging condition.
Specifically, the LSM-Tree structure performs delay and batch processing on index changes, and through some algorithm, for example, a merging and sorting-like manner, merging operations can be efficiently performed to generate new version files. The purpose of the compact operation is to sort the internal data and sort the randomly written out-of-order data into ordered data. In the LSM-Tree engine, the compact can adopt a multi-path merging mode to sequentially read a plurality of data files and merge and sort the data files. The LSM-Tree structure generally comprises a series of delay placement mechanisms, which are the concepts of structured merge trees, and the core idea of the LSM-Tree structure is that data needs to be merged without updating every time, and the data are merged and added into the Tree structure in a merging and sorting mode after a certain amount of data are accumulated.
The embodiment of the invention adopts the data structure of the LSM-tree to construct the storage organization of the database, and adopts the LSM-tree to delay and update the indexes in batches by a hierarchical storage method. The disk data compact (merge) is divided into a cumulative compact (cumulative merge) and a base compact (basic merge); in one embodiment, the storage is divided into two levels of policy, a cumulant compact is responsible for merging newly imported version files, and a base compact is responsible for merging files generated by cumulant.
The embodiment of the invention realizes that one LSM-Tree reads and writes data, and the data is converted from random writing to sequential writing, because the combination is carried out in a mode of index one-way traversal, and the combined index is also added to the Tree structure in a sequential mode, so that the internal data of the leaf nodes of the Tree structure are organized in order, and the efficiency of retrieval and query is increased. The LSM-Tree scheme is that after the latest updating operation is accumulated to a certain degree, batch merging operation is performed, so that random writing is changed into batch sequential updating. Meanwhile, a new version (version) is generated in the writing process of each batch, the reading of the existing data is not influenced in the writing process, and the read-write separation is realized. The read operation is independent of the write operation, so there is no contention between the two operations. Specifically, a new version is generated when data is written in each batch, and when data is read, the current version is acquired first, and the data is read according to the current version, so that read-write separation is realized, and the LSM-Tree realizes concurrency control through the multi-version mechanism.
The database in the prior art lacks an efficient storage organization, and for example, an OLAP database is limited in write performance due to reading of data caused by hdfs. As described above, by using the composition (merging) strategy constructed in the embodiment of the present invention, the hierarchical storage strategy improves the efficiency of data composition, and changes from random writing to sequential writing and read-write separation, thereby ensuring the performance requirements of the storage engine under the condition of large-scale concurrent read-write.
In most cases, the compact process and the write-in thread compete for IO resources of the disk, and if the two things are performed simultaneously, even the random IO times are additionally increased; moreover, to improve the query efficiency, the frequency of merging operations needs to be increased, which results in more frequent disk IO, and the frequent disk IO may affect the query efficiency. Therefore, how to balance and efficiently complete the process of disk transactions is a key to how well a storage engine can be implemented. In the prior art, the LSM-Tree realization compact is often divided into only one layer, and in this case, the basic version file is read during each query; because the composition has only one layer, the basic version file is large, a lot of IO is consumed if the composition is merged with the basic version file every operation, and the query efficiency cannot be guaranteed if the composition is not merged with the basic version file, so for the OLAP database, the IO peak caused by the composition (merging) overlaps. The embodiment of the invention adopts the grading strategies of accumulation combination and basic combination, so that the accumulated version file generated by accumulation combination is much smaller than the basic version file, the IO consumption is greatly reduced during operation, and meanwhile, the accumulated version file is orderly organized after combination, and the query efficiency can be ensured. Therefore, the embodiment of the invention obtains an optimal compromise among query efficiency, merging frequency and disk IO through the hierarchical storage strategy.
According to an embodiment of the LSM-tree-based hierarchical storage method of the present invention, the cumulative merging condition includes that the size of the file is smaller than or equal to a preset file size threshold. The file size threshold may be determined according to the size of the amount of data, such as: the value range of the file size threshold can be set to be 3-5G.
Preferably, the file size threshold is set to 4G. In this embodiment, only files less than or equal to 4G are cumulatively combined, and files greater than 4G are not cumulatively combined, and basic combining is directly reserved. As described above, the size of the accumulated version file is controlled within a certain range, so that the consumption of IO can be reduced and the query efficiency can be ensured.
According to an embodiment of the LSM-tree-based hierarchical storage method of the present invention, the method further includes: performing the cumulative merging operation under the condition that the number of files which are imported into the database and are not subjected to the merging operation is greater than or equal to a preset first file number threshold value; or, the cumulative merging operation is performed when the number of files of the first type of file is greater than or equal to a preset first file number threshold. The value range of the first file number threshold is 5-8. Carrying out accumulation merging operation when the number of newly introduced files reaches a certain number, counting the number of files which are introduced into the database and are not subjected to merging operation in one embodiment, and carrying out accumulation merging when the number of newly introduced files reaches a first file number threshold value; in another embodiment, the number of files of the first type of file is counted, that is, only small files (the size is smaller than or equal to a preset file size threshold) in the newly imported files are counted, and when the number of small files reaches the first file number threshold, the small files are accumulated and merged.
Fig. 2 is a schematic diagram of a two-level data organization strategy according to a preferred embodiment of the LSM-tree-based hierarchical storage method provided by the present invention. The Base File in FIG. 2 is a Base version File, i.e., a File generated by all basic merge operations in an existing database, such as the files numbered from 0-80 as shown; cumulant Deltas is the Cumulative delta, as illustrated as files numbered from 81-87; deltas is the delta, i.e., the newly imported file (the file that has been imported into the database, and for which no merge operation has been performed), as illustrated by the files with data numbers from 81-87. The storage strategy illustrated in FIG. 2 is as follows: when 7 files (numbered 81-87) are newly imported successively, the cumulant Deltas are gradually increased from 81-84 to 81-86 and finally accumulated to 81-87. If the first file number threshold is set to 5, the number of Deltas files reaches 5 (81-84) after the file with the number of 84 is imported, and at this time, the cumulative merge operation is performed to merge the 5 files 81-84 into one large file, i.e., the cumulative version file.
According to an embodiment of the LSM-tree-based hierarchical storage method of the present invention, the method further includes: and performing the basic merging operation when the number of files generated by the accumulative merging operation is greater than or equal to a preset second file number threshold value. The value range of the second file number threshold is 5-8. If the second file number threshold is set to 5, when the number of files generated by the cumulative merging operation reaches 5, the basic merging operation is performed. A merge process may be provided which, when the number of files is sufficient, will merge the files into a smaller but larger file. The strategy of hierarchical storage is to synthesize small files into medium-sized files and then synthesize the medium-sized files into large files.
According to one embodiment of the hierarchical storage method based on the LSM-tree, the basic merging operation is performed under the condition that the ratio of the total byte number of the accumulated version file to the total byte number of the basic version file is greater than or equal to a preset ratio threshold; the accumulated version files comprise all files generated by the accumulated merging operation in the existing database, and the basic version files comprise all files generated by the basic merging operation in the existing database. Preferably, the value range of the ratio threshold is 30% -50%. When the accumulated version file reaches a certain amount, basic merging operation is carried out, namely the orderliness of data is increased, the merging operation cannot be too frequent, the read-write performance of a system is improved, and the optimal compromise is obtained among the query efficiency, the merging frequency and the disk IO.
According to an embodiment of the LSM-tree-based hierarchical storage method of the present invention, the method further includes: and performing the basic merging operation once every preset time threshold. That is, after the preset time threshold time after the last basic merging operation, the current basic merging operation is performed. The value range of the time threshold may be determined according to the size of the data volume, for example, the value range of the time threshold is set to 7 days to 10 days. Preferably, the time threshold may be set to 7 days, and the basic combining operation is performed 7 days after the last basic combining operation.
According to an embodiment of the LSM-tree-based hierarchical storage method of the present invention, the method further includes: and triggering the basic merging operation at the preset time point. Preferably, the preset trigger time point can be set in the evening time period, the reading and writing operations are less in the evening time period under the normal condition, the system resources are sufficient, and the time period is suitable for basic merging operation.
In another aspect, an embodiment of the present invention provides a hierarchical storage device based on an LSM-tree, and fig. 3 is an overall framework diagram of the hierarchical storage device based on the LSM-tree according to the embodiment of the present invention. The apparatus is for: organizing file storage of a database by adopting an LSM-tree-based tree structure, and merging files imported into the database; the merge operation comprises an accumulation merge operation and a basic merge operation; as shown in fig. 3, the apparatus includes an accumulation combining unit 100 and a basic combining unit 200; the accumulation merge unit 100 is configured to: merging the first type files, wherein the first type files comprise files which are imported into a database, do not undergo merging operation and meet the accumulation merging condition; the basic merging unit 200 is configured to: and merging the second type file and the file generated by the accumulative merging operation, wherein the second type file comprises the file which is imported into the database, is not subjected to the merging operation and does not meet the accumulative merging condition.
According to an embodiment of the LSM-tree based storage apparatus of the present invention, the cumulative merge condition includes that the size of the file is equal to or smaller than a preset file size threshold.
According to an embodiment of the LSM-tree based hierarchical storage apparatus of the present invention, the accumulation merge unit 100 is further configured to: performing the cumulative merging operation under the condition that the number of files which are imported into the database and are not subjected to the merging operation is greater than or equal to a preset first file number threshold value; or, the cumulative merging operation is performed when the number of files of the first type of file is greater than or equal to a preset first file number threshold.
According to an embodiment of the LSM-tree based hierarchical storage apparatus of the present invention, the basic merge unit 200 is further configured to: and performing the basic merging operation when the number of files generated by the accumulative merging operation is greater than or equal to a preset second file number threshold value.
According to an embodiment of the LSM-tree based hierarchical storage apparatus of the present invention, the basic merge unit 200 is further configured to: performing the basic merging operation under the condition that the ratio of the total byte number of the accumulated version file to the total byte number of the basic version file is greater than or equal to a preset ratio threshold; the accumulated version files comprise all files generated by the accumulated merging operation in the existing database, and the basic version files comprise all files generated by the basic merging operation in the existing database.
According to an embodiment of the LSM-tree based hierarchical storage apparatus of the present invention, the basic merge unit 200 is further configured to: and performing the basic merging operation once every preset time threshold.
According to an embodiment of the LSM-tree based hierarchical storage apparatus of the present invention, the basic merge unit 200 is further configured to: and triggering the basic merging operation at the preset time point.
According to an embodiment of the hierarchical storage device based on the LSM-tree of the present invention, the hierarchical storage device further includes: the value range of the file size threshold is 3-5G; and/or the value range of the first file number threshold is 5-8; and/or the value range of the second file number threshold is 5-8; and/or the value range of the proportional threshold is 30% -50%; and/or the value range of the time threshold is 7 days to 10 days.
In one possible design, the hierarchical LSM-tree-based storage device includes a processor and a memory, the memory is used for storing a program that supports the hierarchical LSM-tree-based storage device to execute the hierarchical LSM-tree-based storage method in the first aspect, and the processor is configured to execute the program stored in the memory.
An embodiment of the present invention further provides an OLAP database system, where the OLAP database system includes: one or more processors; storage means for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of the first aspects described above.
An embodiment of the present invention further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to any one of the first aspect.
One of the above technical solutions has the following advantages or beneficial effects: the data is converted from random writing to sequential writing by adopting a hierarchical storage strategy based on an LSM-tree; meanwhile, a new version (version) is generated in the writing process of each batch, the reading of the existing data is not influenced in the writing process, the read-write separation is realized, and the concurrency control is realized. In addition, by adopting a hierarchical storage strategy based on the LSM-tree, an optimal compromise is obtained among query efficiency, merging frequency and disk IO, and the reading and writing efficiency of the OLAP database storage engine is improved. The adoption of the compact (merging) strategy constructed by the embodiment of the invention improves the efficiency of data compact and ensures the performance requirement of the storage engine under the condition of large-scale concurrent reading and writing.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments. The embodiment of the device corresponds to the embodiment of the method, so that the description of the embodiment of the device is relatively simple, and the related description can refer to the description of the embodiment of the method.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present invention, and these should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (18)

1. A hierarchical storage method based on LSM-tree is characterized in that a tree structure based on LSM-tree is adopted to organize the file storage of a database, and the files imported into the database are merged;
the merge operation comprises an accumulation merge operation and a basic merge operation;
the cumulative merge operation includes: merging the first type files, wherein the first type files comprise files which are imported into a database, do not undergo merging operation and meet the accumulation merging condition;
the basic merge operation includes: and merging the second type file and the file generated by the accumulative merging operation, wherein the second type file comprises the file which is imported into the database, is not subjected to the merging operation and does not meet the accumulative merging condition.
2. The method of claim 1, wherein the cumulative merge condition comprises a size of the file being less than or equal to a preset file size threshold.
3. The method of claim 1, further comprising:
performing the cumulative merging operation under the condition that the number of files which are imported into the database and are not subjected to the merging operation is greater than or equal to a preset first file number threshold value; or,
and under the condition that the file number of the first type file is greater than or equal to a preset first file number threshold value, performing the accumulation and combination operation.
4. The method of claim 1, further comprising: and performing the basic merging operation when the number of files generated by the accumulative merging operation is greater than or equal to a preset second file number threshold value.
5. The method of claim 1, further comprising: performing the basic merging operation under the condition that the ratio of the total byte number of the accumulated version file to the total byte number of the basic version file is greater than or equal to a preset ratio threshold; the accumulated version files comprise all files generated by the accumulated merging operation in the existing database, and the basic version files comprise all files generated by the basic merging operation in the existing database.
6. The method of claim 1, further comprising: and performing the basic merging operation once every preset time threshold.
7. The method of claim 1, further comprising: and triggering the basic merging operation at the preset time point.
8. The method according to any one of claims 1-7, further comprising:
the value range of the file size threshold is 3-5G; and/or the presence of a gas in the gas,
the value range of the first file number threshold is 5-8; and/or the presence of a gas in the gas,
the value range of the second file number threshold is 5-8; and/or the presence of a gas in the gas,
the value range of the proportional threshold is 30% -50%; and/or the presence of a gas in the gas,
the time threshold value ranges from 7 days to 10 days.
9. A hierarchical storage device based on an LSM-tree is characterized in that,
the apparatus is for: organizing file storage of a database by adopting an LSM-tree-based tree structure, and merging files imported into the database; the merge operation comprises an accumulation merge operation and a basic merge operation; the device comprises an accumulation merging unit and a basic merging unit;
the accumulation merge unit is configured to: merging the first type files, wherein the first type files comprise files which are imported into a database, do not undergo merging operation and meet the accumulation merging condition;
the basic merging unit is used for: and merging the second type file and the file generated by the accumulative merging operation, wherein the second type file comprises the file which is imported into the database, is not subjected to the merging operation and does not meet the accumulative merging condition.
10. The apparatus of claim 9, wherein the cumulative merge condition comprises a size of the file being less than or equal to a preset file size threshold.
11. The apparatus of claim 9, wherein the accumulation merge unit is further configured to:
performing the cumulative merging operation under the condition that the number of files which are imported into the database and are not subjected to the merging operation is greater than or equal to a preset first file number threshold value; or,
and under the condition that the file number of the first type file is greater than or equal to a preset first file number threshold value, performing the accumulation and combination operation.
12. The apparatus of claim 9, wherein the basic merging unit is further configured to: and performing the basic merging operation when the number of files generated by the accumulative merging operation is greater than or equal to a preset second file number threshold value.
13. The apparatus of claim 9, wherein the basic merging unit is further configured to: performing the basic merging operation under the condition that the ratio of the total byte number of the accumulated version file to the total byte number of the basic version file is greater than or equal to a preset ratio threshold; the accumulated version files comprise all files generated by the accumulated merging operation in the existing database, and the basic version files comprise all files generated by the basic merging operation in the existing database.
14. The apparatus of claim 9, wherein the basic merging unit is further configured to: and performing the basic merging operation once every preset time threshold.
15. The apparatus of claim 9, wherein the basic merging unit is further configured to: and triggering the basic merging operation at the preset time point.
16. The apparatus of any one of claims 9-15, further comprising:
the value range of the file size threshold is 3-5G; and/or the presence of a gas in the gas,
the value range of the first file number threshold is 5-8; and/or the presence of a gas in the gas,
the value range of the second file number threshold is 5-8; and/or the presence of a gas in the gas,
the value range of the proportional threshold is 30% -50%; and/or the presence of a gas in the gas,
the time threshold value ranges from 7 days to 10 days.
17. An OLAP database system, comprising:
one or more processors;
storage means for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-8.
18. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.
CN201711437794.8A 2017-12-26 2017-12-26 Classification storage method, device, OLAP database system and medium based on LSM-tree Pending CN108021702A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711437794.8A CN108021702A (en) 2017-12-26 2017-12-26 Classification storage method, device, OLAP database system and medium based on LSM-tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711437794.8A CN108021702A (en) 2017-12-26 2017-12-26 Classification storage method, device, OLAP database system and medium based on LSM-tree

Publications (1)

Publication Number Publication Date
CN108021702A true CN108021702A (en) 2018-05-11

Family

ID=62071033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711437794.8A Pending CN108021702A (en) 2017-12-26 2017-12-26 Classification storage method, device, OLAP database system and medium based on LSM-tree

Country Status (1)

Country Link
CN (1) CN108021702A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101197A (en) * 2018-08-15 2018-12-28 郑州云海信息技术有限公司 A kind of file stores and accesses method, apparatus, file memory and system
CN110716690A (en) * 2018-07-12 2020-01-21 阿里巴巴集团控股有限公司 Data recovery method and system
CN111831622A (en) * 2020-03-31 2020-10-27 北京嘀嘀无限科技发展有限公司 Data index generation method, apparatus, electronic device and readable storage medium
CN111881092A (en) * 2020-06-22 2020-11-03 武汉绿色网络信息服务有限责任公司 A method and device for merging files based on cassandra database
CN113204564A (en) * 2021-05-20 2021-08-03 山东英信计算机技术有限公司 Database high-frequency SQL query method, system and storage medium
CN113297141A (en) * 2021-05-06 2021-08-24 维沃移动通信有限公司 File merging method and device, electronic equipment and storage medium
CN114398378A (en) * 2022-03-25 2022-04-26 北京奥星贝斯科技有限公司 Method and apparatus for determining index cost
CN115994120A (en) * 2023-03-23 2023-04-21 北京飞轮数据科技有限公司 Data file merging method, device, electronic equipment and computer readable medium
WO2024235197A1 (en) * 2023-05-17 2024-11-21 抖音视界有限公司 Operation method for file, and electronic device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009076010A1 (en) * 2007-12-06 2009-06-18 Microsoft Corporation Document merge
CN103577454A (en) * 2012-08-01 2014-02-12 华为技术有限公司 Method and device for merging files
CN104809237A (en) * 2015-05-12 2015-07-29 百度在线网络技术(北京)有限公司 LSM-tree (The Log-Structured Merge-Tree) index optimization method and LSM-tree index optimization system
CN106599247A (en) * 2016-12-19 2017-04-26 北京奇虎科技有限公司 Method and device for merging data file in LSM-tree structure
CN106855861A (en) * 2015-12-09 2017-06-16 北京金山安全软件有限公司 File merging method and device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009076010A1 (en) * 2007-12-06 2009-06-18 Microsoft Corporation Document merge
CN103577454A (en) * 2012-08-01 2014-02-12 华为技术有限公司 Method and device for merging files
CN104809237A (en) * 2015-05-12 2015-07-29 百度在线网络技术(北京)有限公司 LSM-tree (The Log-Structured Merge-Tree) index optimization method and LSM-tree index optimization system
CN106855861A (en) * 2015-12-09 2017-06-16 北京金山安全软件有限公司 File merging method and device and electronic equipment
CN106599247A (en) * 2016-12-19 2017-04-26 北京奇虎科技有限公司 Method and device for merging data file in LSM-tree structure

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110716690A (en) * 2018-07-12 2020-01-21 阿里巴巴集团控股有限公司 Data recovery method and system
CN110716690B (en) * 2018-07-12 2023-02-28 阿里巴巴集团控股有限公司 Data recovery method and system
CN109101197A (en) * 2018-08-15 2018-12-28 郑州云海信息技术有限公司 A kind of file stores and accesses method, apparatus, file memory and system
CN111831622A (en) * 2020-03-31 2020-10-27 北京嘀嘀无限科技发展有限公司 Data index generation method, apparatus, electronic device and readable storage medium
CN111881092A (en) * 2020-06-22 2020-11-03 武汉绿色网络信息服务有限责任公司 A method and device for merging files based on cassandra database
CN113297141A (en) * 2021-05-06 2021-08-24 维沃移动通信有限公司 File merging method and device, electronic equipment and storage medium
CN113204564A (en) * 2021-05-20 2021-08-03 山东英信计算机技术有限公司 Database high-frequency SQL query method, system and storage medium
CN113204564B (en) * 2021-05-20 2023-02-28 山东英信计算机技术有限公司 Database high-frequency SQL query method, system and storage medium
CN114398378A (en) * 2022-03-25 2022-04-26 北京奥星贝斯科技有限公司 Method and apparatus for determining index cost
CN115994120A (en) * 2023-03-23 2023-04-21 北京飞轮数据科技有限公司 Data file merging method, device, electronic equipment and computer readable medium
WO2024235197A1 (en) * 2023-05-17 2024-11-21 抖音视界有限公司 Operation method for file, and electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN108021702A (en) Classification storage method, device, OLAP database system and medium based on LSM-tree
CN110149803B (en) Data storage method, system and terminal equipment
CN105117417B (en) A kind of memory database Trie tree indexing means for reading optimization
CN105320775A (en) Data access method and apparatus
US10642814B2 (en) Signature-based cache optimization for data preparation
CN104160398B (en) Content structuring method and system used in large object data
US20130080393A1 (en) System and Method for Storing Stream Data in Distributed Relational Tables with Data Provenance
CN111190895B (en) Organization method, device and storage medium of column-type storage data
Xanthakis et al. Parallax: Hybrid key-value placement in lsm-based key-value stores
US12339823B2 (en) Data storage device and storage control method based on log-structured merge tree
US10642815B2 (en) Step editor for data preparation
Amur et al. Design of a write-optimized data store
CN110109894A (en) Implementation method, device, storage medium and the equipment of non-relational database
CN102542057A (en) High dimension data index structure design method based on solid state hard disk
CN115576947A (en) Data management method and device, combined library, electronic equipment and storage medium
CN102169497B (en) Method and device for managing metadata through bitmaps
CN114579617A (en) Data query method and device, computer equipment and storage medium
Carter et al. Nanosecond indexing of graph data with hash maps and VLists
Roumelis et al. Bulk-loading and bulk-insertion algorithms for xBR+-trees in solid state drives
Alam et al. Performance of point and range queries for in-memory databases using radix trees on GPUs
Morfonios et al. Supporting the data cube lifecycle: the power of ROLAP
CN118170820A (en) Hierarchical heterogeneous IoT time series data retrieval system and method for NVM
CN118796859A (en) A database access method and device based on MyBatis framework
CN113495901B (en) Quick retrieval method for variable-length data blocks
CN110110034A (en) A kind of RDF data management method, device and storage medium based on figure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180511