Disclosure of Invention
Aiming at the technical bottleneck, the application provides a novel hard disk garbage recycling method. More specifically, the application provides a GC recycling method based on a hash caching mechanism, which aims to improve the overall performance of garbage recycling by optimizing the concurrent processing efficiency of metadata reverse check.
In order to achieve the above object, the present application provides the following technical solutions:
The first aspect of the application provides a method for recycling hard disk garbage, which comprises the following steps:
s1, constructing an index container hash tree, wherein the hash tree takes a triplet of an object identifier, a data key and an attribute key as a hash key, and takes recording interval information and a corresponding storage object identifier as hash values, and the recording interval information comprises a recording interval range and a recording type;
s2, generating a hash key according to metadata information of the storage object to be recovered, and searching record interval information corresponding to the metadata information through the hash tree;
s3, judging the validity of the data based on the record interval information obtained by searching in the previous step and the storage object identification of the current storage object to be recovered;
And S4, migrating the effective data to the new storage object according to the judgment result of the previous step, and releasing the garbage space in the original storage object.
Optionally, in the method of the present application, the hash value in step S1 includes a lower limit, an upper limit, a record type and a corresponding storage object identifier of a record interval, where if the data is of an array type, the lower limit and the upper limit of the record interval are real boundaries of the array interval, and if the data is of a single value type, the lower limit and the upper limit of the record interval are both single-value values.
Optionally, in the method of the present application, the following determination rule is executed in step S3 to determine the validity of the data:
S31, if the recording section information exists and matches the recording section of the metadata information:
When the storage object identifier corresponding to the recording interval information is larger than the storage object identifier of the current storage object to be recovered, judging that the data corresponding to the recording interval information is junk data;
When the storage object identifier corresponding to the record interval information is smaller than or equal to the storage object identifier of the current storage object to be recovered, re-initiating the reverse check and updating the storage object identifier in the hash tree;
S32, if the recording section information exists but intersects or overlaps with the recording section of the metadata information:
When the storage object identifier corresponding to the recording interval information is larger than the storage object identifier of the current storage object to be recovered, judging that the data of the crossed or covered part is junk data;
when the storage object identifier corresponding to the record interval information is smaller than or equal to the storage object identifier of the current storage object to be recovered, reinitiating the rechecking to determine the data validity;
and S33, if the recorded interval information does not exist, directly initiating reverse check to determine the validity of the data.
Optionally, in the method of the present application, the step S31 of reinitiating the inverse search and updating the storage object identifier in the hash tree includes obtaining the latest storage object identifier through the inverse search, and if the latest storage object identifier is greater than the storage object identifier of the current storage object to be recovered, updating the storage object identifier corresponding to the record in the hash tree to the latest storage object identifier.
Optionally, the method of the present application further comprises:
and S5, sequencing the recorded interval information in the reverse checking result from small to large according to the interval lower limit so as to optimize the processing efficiency of the subsequent interval.
Optionally, the method of the present application further comprises:
and S6, when the crossed or covered record sections exist in the reverse checking result, splitting the record sections according to the principle that the sections with large storage object identifiers cover the sections with small storage object identifiers.
Optionally, the method of the present application further comprises:
and S7, when the continuous record interval with consistent storage object identification exists in the check result, merging the record interval into a larger record interval.
Optionally, in the method of the present application, the hash tree in step S1 is a cache hash tree based on a Least Recently Used (LRU) policy, and is used for preferentially retaining the record interval information of the high-frequency access, so as to reduce the number of the anti-check requests.
A second aspect of the present application provides a hard disk waste recycling device, the device comprising:
the hash tree construction module is used for constructing an index container hash tree, wherein the hash tree takes a triplet of an object identifier, a data key and an attribute key as a hash key, and takes recorded interval information and a corresponding storage object identifier as hash values, and the recorded interval information comprises a recorded interval range and a recorded type;
The hash searching module is used for generating a hash key according to the metadata information of the storage object to be recovered, and searching record interval information corresponding to the metadata information through the hash tree;
The data validity judging module is used for searching the obtained record interval information and the storage object identification of the current storage object to be recovered based on the hash searching module and judging the data validity according to a preset judging rule;
And the data processing module is used for migrating the effective data to the new storage object according to the judging result of the data effectiveness judging module and releasing the garbage space in the original storage object.
The device realizes the steps of the hard disk garbage recycling method during operation.
Further, the device of the application further comprises:
The recording interval sequencing module is used for sequencing the recording interval information in the reverse checking result from small interval lower limit to large interval lower limit so as to optimize the subsequent interval processing efficiency;
The recording interval splitting module is used for splitting the recording interval according to the principle that the interval with large storage object identification covers the interval with small storage object identification when the crossed or covered recording interval exists in the reverse checking result;
And the recording interval merging module is used for merging the continuous recording intervals with consistent storage object identifiers into larger recording intervals when the back check results exist.
A third aspect of the application provides an electronic device comprising a memory and a processor;
A memory for storing a computer program;
and the processor is used for executing the computer program to realize the steps of the hard disk garbage collection method.
A fourth aspect of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the aforementioned hard disk garbage collection method.
In summary, the hard disk garbage recycling method based on hash cache provided by the application realizes performance optimization and stability improvement in a plurality of technical layers by introducing the reverse-check hash cache mechanism, and has the following specific advantages:
(1) The reverse check rpc request overhead is remarkably reduced, namely, the reverse check hash cache is used for pre-storing or dynamically caching the metadata key information, so that a large number of rpc reverse check requests are prevented from being repeatedly initiated, the processing load of network communication and a server side is effectively reduced, and the communication link time delay in the reverse check process is reduced.
(2) The limitation of the anti-checking concurrency capability is broken through, namely the anti-checking concurrency amount in the traditional scheme is limited by the processing capability and resource competition of rpc requests, the scheme localizes the metadata information of high-frequency anti-checking through the hash cache, the validity judgment of more data objects can be processed in batches by single anti-checking operation, the large-scale data verification can be completed without depending on high-concurrency rpc requests, and the concurrency processing efficiency is remarkably improved.
(3) The garbage data judging efficiency is accelerated, the response time of the data validity judgment is greatly shortened due to the quick query characteristic of the hash cache, the quick screening of the data in the target object can be completed in a short time, the valid data and the garbage data are clearly distinguished, and an efficient data basis is provided for the subsequent GC moving operation.
(4) The method has the advantages of improving the overall efficiency of GC recovery and guaranteeing the stability of cluster capacity, obviously reducing the time occupation ratio of the reverse check stage in the GC recovery process through the cooperative optimization of the links, effectively shortening the single GC cycle, improving the release rate of garbage space, simultaneously, avoiding the problem of shortage of cluster capacity caused by space recovery lag due to the rapid and efficient garbage recovery capability, and ensuring the long-term stable availability of the storage cluster capacity.
In conclusion, the scheme optimizes the whole link of the reverse checking process through the hash caching mechanism, achieves multi-dimensional improvement from communication overhead and concurrency capability to judging efficiency, and provides a more efficient and stable garbage recycling solution for a storage system based on the ROW technology.
Additional features and advantages of the application will be set forth in the description which follows, or may be learned by practice of the application as set forth hereinafter. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It should be understood that the embodiments described are merely some, but not all, embodiments of the application. All other embodiments, which can be made by a person skilled in the art without creative efforts, are included in the protection scope of the present application based on the embodiments of the present application.
The term "comprising" and any variations thereof (e.g., "comprising," "including," etc.) are intended to be interpreted as open-ended terms, meaning "including, but not limited to," i.e., the recited list is not an exhaustive list, but may also include other non-explicitly recited list. The term "based on" should be understood as "based at least in part on," i.e., the basis or condition referred to may not be the only factor, but may also relate to other relevant factors. The term "one embodiment" should be understood as "at least one embodiment," i.e., the described embodiment is not the only possible implementation, and other similar embodiments are possible.
In the present disclosure, the terms "a" and "an" and "the" are used to modify a related element or feature, which is presented by way of illustration and not limitation. Unless the context clearly indicates otherwise, "a" shall be understood as "at least one" and "a plurality" as "at least two". Those skilled in the art should reasonably interpret these terms in terms of the semantic and logical relationships of the context to ensure that they encompass the possibilities of "one or more".
Fig. 2 shows an overall implementation flow of the hard disk garbage collection method provided by the application, which comprises the following steps:
s1, constructing an index container hash tree, wherein the hash tree takes a triplet of an object identifier, a data key and an attribute key as a hash key, and takes recording interval information and a corresponding storage object identifier as hash values, and the recording interval information comprises a recording interval range and a recording type;
s2, generating a hash key according to metadata information of the storage object to be recovered, and searching record interval information corresponding to the metadata information through the hash tree;
s3, judging the validity of the data based on the record interval information obtained by searching in the previous step and the storage object identification of the current storage object to be recovered;
And S4, migrating the effective data to the new storage object according to the judgment result of the previous step, and releasing the garbage space in the original storage object.
In order to more clearly illustrate the technical solution of the present application, the following description will be further given by way of examples of specific scenarios.
Assuming that 4 disks are arranged under each node in A3-node cluster, the topological structure of the whole cluster is shown in fig. 1, an index pool shown in the figure is mainly used for storing metadata information corresponding to user data, and a data pool is used for storing actual data of a user; user data (including single value, sv, and array value arrayvalue types, wherein array values are generally described in units of fixed size, e.g., 512 bytes, e.g., [0,3] total of 4 groups of 512=2kbytes) is first mapped to an index object when written, mapped by the index object to a corresponding master node (one copy is selected as a master node from 3 copies for redundancy, and the other two copies are slave nodes), e.g., index_ objA1/B1/C1 master node is node0, index_ objA2/B2/C2 master node is node1, index_ objA3/B3/C13 master node is node2, then aggregated writing is applied to the master node, the respective storage objects are assumed to be 3 copies redundancy, the storage objects of each application are sequentially obj0 (tgt 0, tgt4, tgt 8), obj1 (tgt 1, tgt5, tgt 9), obj 2/C1/C2 are aggregated, and the requests are written together by 3/C1/C3/C2/C1 are aggregated together, i.e., the requests are written together by 3/C1/C3/C2. After the aggregation is completed, metadata information (such as index container cont, index obj object/dkey/akey/recx, back end storage address) of the index object is recorded and stored in metadata space of the index pool, meanwhile, the metadata information is recorded in obj0 to be used for locating a storage position of the index object in obj0, after the object repeatedly overwrites a new obj storage object, data in an old obj storage object becomes garbage data, when the garbage amount of the storage object reaches a GC recycling threshold (such as static threshold 80%), recycling of the garbage space can be completed through a GC garbage recycling algorithm, specifically, index metadata information recorded in the obj storage object is read and used for inverse searching (under a distributed scene, the index object is assumed to be a 3-copy object, namely, to be searched on 3 copy fragments of the object, 3 times of rpc read requests are sent, as shown in fig. 5), then, an inverse searching result (such as index container index/dkey/494) is the metadata information of the index object is changed into garbage data, and if the garbage recycling result is more effective than the data of the index object is more than the index container/dkey, and if the index object is more effective, and the index object is more effective than the storage data is required to be stored, and the index is more effective than the index data is effectively stored.
In summary, because the communication time and concurrency of the anti-checking request rpc are limited, the bottleneck of GC recovery is limited by the anti-checking step, and the method for improving GC recovery efficiency based on the anti-checking cache is provided by combining the write-in feature of ROW (the ID of the storage object of the later application), and fig. 3 shows the flow of storing the GC of the object (the judgment of LRU hash cache is added), the specific implementation steps of the method are as follows:
1. Constructing an index container hash tree, namely creating a corresponding LRU hash table according to the index container, taking obj, dkey, akey triples as keys of the hash, wherein the value comprises recx [ low, high ] and a storage object ID, if the data represented by the metadata are of an array type, the low and the high are real interval boundaries of recx, and if the data are sv data, the low and the high are values of sv, and the constructed hash tree is shown in figure 4.
2. Hash tree searching, namely generating a Hash key (obj, dkey, akey) according to index metadata information (index container, obj, dkey, akey, recx), and searching recx intervals on the index container Hash tree.
3. Section consistency judging rules, namely if recx of the reverse check of the storage object exists in the hash table and the sections are consistent, judging rules are as follows:
3.1 If the object ID of the check recx is smaller than the object ID of the hash cache, it can be directly determined as garbage, if the obj with the object ID of 100 is GC, and if one piece recx of metadata is recorded on the hash tree after the check, and the object ID of the metadata is 150, the recx is necessarily garbage data on the object 100;
3.2 If the storage object ID is greater than or equal to the ID of the storage object in the cache, the inverse checking is restarted, and the hash cache is updated, if the obj with the storage object ID of 100 is GC, and if the storage object ID of one segment recx in metadata is 90 after the inverse checking record on the hash tree, whether recx is garbage or data on the storage object 100 cannot be judged, and at the moment, the inverse checking needs to be initiated to confirm, if the storage object with the inverse checking result is 120, the storage object is updated to the hash cache by 120.
4. Section cross coverage judgment rules are that if recx of storage object check and recx in cache are inconsistent or cross exists, judgment is carried out according to the following rules:
4.1 The intersecting or covering part judges that garbage is generated when the storage object ID is smaller than the cache storage object ID, if the obj with the storage object ID of 100 is doing GC, one section recx [2,5] in metadata searches the corresponding cache { recx [1,8], objId =101 } through a hash table, and the storage object ObjId in the cache is larger, so that recx [2,5] on the 100 objects can be judged to be garbage;
4.2 When the storage object ID is larger than the cache storage object ID, the cross or overlay part needs to check whether the storage object ID is garbage or data, for example, if obj with the storage object ID of 100 is doing GC, a segment recx [2,5] in metadata searches the corresponding cache { recx [1,8], objId =90 ] through the hash table, and because the storage object ObjId =100 doing GC is larger, the check is needed to check whether the storage object ID is garbage or data.
5. If recx of the storage object check is not in the cache, directly initiating check confirmation.
6. Recx section ordering rules, namely the inverse checking result is ordered from small to large according to recx, the section coverage part is split, and if the storage values of two recx sections under one akey on the hash tree are { recx [2,5], objId =100 }, { recx [6,9], objId =90 }, the recx on the hash tree is { recx [2,5], objId =100 }, { recx [6,9], objId =90 }, respectively.
7. And recx, according to the rule that the storage object ID is large and small in coverage, as shown in figure 6, if the storage values of two recx intervals under one akey on the hash tree are { recx [2,5], objId =100 }, { recx [3,9], objId =90 }, respectively, recx on the hash tree is split into { recx [2,5], objId =100 }, { recx [6,9], objId =90 }, according to the rule that the storage object ID is large and small in coverage.
8. If the storage object IDs are identical when the recx section merging rules recx are continuous, the storage object IDs can be merged into one larger recx section, and if the storage values of the two recx sections on the hash tree under one akey are { recx [2,5], objId =100 }, { recx [6,9], objId =100 }, the storage object IDs can be merged into one recx, namely { recx [2,9], objId =100 }.
Fig. 7 shows a hard disk garbage recycling device according to the present application, where the device includes:
the hash tree construction module is used for constructing an index container hash tree, wherein the hash tree takes a triplet of an object identifier, a data key and an attribute key as a hash key, and takes recorded interval information and a corresponding storage object identifier as hash values, and the recorded interval information comprises a recorded interval range and a recorded type;
The hash searching module is used for generating a hash key according to the metadata information of the storage object to be recovered, and searching record interval information corresponding to the metadata information through the hash tree;
The data validity judging module is used for searching the obtained record interval information and the storage object identification of the current storage object to be recovered based on the hash searching module and judging the data validity according to a preset judging rule;
And the data processing module is used for migrating the effective data to the new storage object according to the judging result of the data effectiveness judging module and releasing the garbage space in the original storage object.
The method for recycling the hard disk garbage comprises the steps of realizing the method for recycling the hard disk garbage disclosed by the application when the device is operated.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In the figures, each block may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It will be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
As shown in FIG. 8, an embodiment of the application also discloses an electronic device comprising a processor 310, a communication interface 320, a memory 330 for storing a processor executable computer program and a communication bus 340. Wherein the processor 310, the communication interface 320 and the memory 330 perform communication with each other through the communication bus 340. The processor 310 implements the steps of the hard disk garbage collection method described above by running the executable computer program.
It will be appreciated that the electronic device may contain, in addition to a memory and a processor, input devices (e.g., a keyboard), output devices (e.g., a display), and other communication modules. These input devices, output devices, and other communication modules communicate with the processor through an I/O interface (i.e., an input/output interface).
The operations of the present application may be implemented by writing computer program code using one or more programming languages, or a combination thereof. The programming languages include, but are not limited to, the following types:
object-oriented programming languages, such as Java, smalltalk, C ++, etc.;
Conventional procedural programming languages, such as the "C" language or similar programming languages.
The execution of the program code includes, but is not limited to:
executing entirely on the user's computer;
Partially executing on the user computer and partially executing on the remote computer;
executing as a stand-alone software package;
Executing entirely on a remote computer or server.
In a scenario involving a remote computer, the remote computer may be connected to the user computer through any type of network, including, but not limited to, a Local Area Network (LAN) or a Wide Area Network (WAN). In addition, the remote computer may also be connected to an external computer through an Internet service provider, for example, using the Internet.
Further, the application also discloses a computer readable storage medium, which when the instructions in the computer readable storage medium are executed by a processor of the electronic equipment, the electronic equipment can execute the steps of the hard disk garbage collection method disclosed by the application.
In the context of the present application, a computer-readable storage medium refers to a tangible medium capable of storing computer program code and related data. Specific examples include, but are not limited to, the following:
(1) Portable computer magnetic disks-removable magnetic storage media such as floppy disks.
(2) The hard disk comprises fixed storage devices such as a mechanical hard disk, a solid state hard disk and the like.
(3) Random Access Memory (RAM) is a volatile storage medium used for temporarily storing data and program code.
(4) Read Only Memory (ROM), a nonvolatile storage medium for storing fixed programs and data.
(5) Erasable programmable read-only Memory (EPROM) or Flash Memory (Flash Memory), a non-volatile storage medium that supports multiple erases and programming.
(6) Optical fiber storage devices-storage media based on optical fiber technology.
(7) Portable compact disc read-only memory (CD-ROM), a read-only medium that stores data in the form of an optical disc.
(8) Optical storage devices such as DVD, blu-ray disc, etc. based on optical principles.
(9) Magnetic storage devices such as magnetic tape, magnetic disk, and the like based on magnetic principles.
(10) Any suitable combination of the above, for example, combining multiple storage media to meet different storage requirements.
These computer readable storage media may be used to store program code and related data described in the present application to support program execution and persistent storage of data.
In particular, the processes described in the flowcharts may be implemented as computer software programs according to embodiments of the present application. For example, embodiments of the present application relate to a computer program product comprising a computer program loaded on a non-transitory computer readable medium. The computer program contains program code for performing the hard disk garbage collection method disclosed herein. The above-described functions defined in the embodiments of the present application can be achieved when the computer program is executed by a processing apparatus.
While the above discussion contains several specific implementation details, these should not be construed as limiting the scope of the present application. The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the disclosure to which the present application relates is not limited to the solutions formed by the specific combinations of features described above. Meanwhile, the application also covers other technical schemes formed by any combination of the technical features or the equivalent features without departing from the conception disclosed above.
It will be further understood by those skilled in the art that various changes may be made to the technical solutions described in the foregoing embodiments, or equivalents may be substituted for elements thereof, without departing from the spirit and scope of the technical solutions of the embodiments of the present application. Such modifications and substitutions do not depart from the essence of the corresponding technical solutions from the core spirit and scope of the technical solutions of the embodiments of the present application.