CN108897817B

CN108897817B - Data storage method, detection method and system, storage medium and computer equipment

Info

Publication number: CN108897817B
Application number: CN201810637185.5A
Authority: CN
Inventors: 白帆
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-06-20
Filing date: 2018-06-20
Publication date: 2023-04-07
Anticipated expiration: 2038-06-20
Also published as: CN108897817A

Abstract

The invention discloses a data storage method which is used for storing data to be stored into an inverted index storage structure. The inverted index storage structure includes nodes that include data fields and pointer fields for storing multiple sets of data. The data storage method comprises the following steps: judging whether the data to be stored can be completely written into the data field of the current node or not; if yes, writing a null pointer in the pointer field of the current node; if not, generating a next node; writing a pointer pointing to the next node in the pointer field of the current node; writing the rest data to be stored into the data field of the next node; and making the next node be the current node and entering the judging step. The invention also discloses a content similarity detection method and system, a storage medium and computer equipment. The data storage method, the content similarity detection method and system, the computer readable storage medium and the computer device of the invention utilize the data field of one node to store a plurality of groups of data, thereby reducing the number of nodes and further reducing the storage space required by the pointer.

Description

Data storage method, detection method and system, storage medium and computer equipment

Technical Field

The present invention relates to storage technologies, and in particular, to a data storage method, a content similarity detection system, a non-volatile computer-readable storage medium, and a computer device.

Background

In order to achieve fast finding of the required content in a large number of documents, databases typically associate the content in the documents with the documents by building an index table. However, the conventional index table has a problem of large storage space.

Disclosure of Invention

The embodiment of the invention provides a data storage method, a content similarity detection system, a nonvolatile computer readable storage medium and computer equipment.

The data storage method of the embodiment of the invention is used for storing data to be stored in an inverted index storage structure, the inverted index storage structure comprises at least one node, each node comprises a data domain and a pointer domain, the data domain is used for storing multiple groups of data, the node comprises a current node, and the data storage method comprises the following steps:

judging whether the data to be stored can be completely written into the data domain of the current node or not;

when the data to be stored can be completely written into the data field of the current node, writing a null pointer into the pointer field of the current node;

when the data field of the current node can not store all the data to be stored, generating a next node of the inverted index storage structure;

writing a pointer pointing to the next node in the pointer field of the current node;

writing the remaining data to be stored into the data field of the next node; and

and setting the next node as the current node and entering the step of judging whether the data to be stored can be completely written into the data domain of the current node.

The content similarity detection method is used for an inverted index storage structure, the inverted index storage structure comprises characteristic information and at least one node, and each node comprises a data field and a pointer field. The data field is used for storing a plurality of groups of data, each group of data corresponds to the information of one document, and the pointer field is used for storing a pointer. The content similarity detection method comprises the following steps:

acquiring the number of the feature information with the same number of the two documents according to the corresponding relation between the feature information and the information of the documents in the inverted index storage structure;

judging whether the number is greater than or equal to a preset number;

when the number is larger than or equal to the preset number, judging that the two documents are similar;

and when the number is smaller than the preset number, judging that the two documents are not similar.

The content similarity detection system of the embodiment of the invention is used for an inverted index storage structure, the inverted index storage structure comprises characteristic information and at least one node, and each node comprises a data field and a pointer field. The data field is used for storing a plurality of groups of data, each group of data corresponds to the information of one document, and the pointer field is used for storing a pointer. The content similarity detection system comprises an acquisition module, a first judgment module, a second judgment module and a third judgment module. The acquisition module is used for acquiring the number of the feature information identical to the two documents according to the corresponding relation between the feature information and the information of the documents in the inverted index storage structure. The first judging module is used for judging whether the number is larger than or equal to a preset number. The second judging module is used for judging that the two documents are similar when the number is larger than or equal to the preset number. The third judging module is used for judging that the two documents are not similar when the number is less than the preset number.

One or more non-transitory computer-readable storage media embodying computer-executable instructions that, when executed by one or more processors, cause the processors to perform the above-described data storage methods and/or the above-described content similarity detection methods.

The computer device according to an embodiment of the present invention includes a memory and a processor, where the memory stores computer-readable instructions, and the instructions, when executed by the processor, cause the processor to execute the data storage method and/or the content similarity detection method.

The data storage method, the content similarity detection system, the computer readable storage medium and the computer device of the embodiment of the invention utilize the data domain of one node to store a plurality of groups of data, thereby reducing the number of nodes and further reducing the storage space required by the pointer.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of an inverted index storage structure in accordance with certain implementations of the invention.

Fig. 2 is a schematic structural view of a conventional chain structure.

Fig. 3 is a schematic structural diagram of a conventional array structure.

FIG. 4 is a block diagram of an inverted index storage structure in accordance with certain implementations of the invention.

FIG. 5 is a block diagram of an inverted index storage structure in accordance with certain implementations of the invention.

FIG. 6 is a flow chart illustrating a data storage method according to some embodiments of the invention.

Fig. 7 is a flow chart illustrating a content similarity detection method according to some embodiments of the present invention.

FIG. 8 is a schematic diagram of a content similarity detection system in accordance with certain embodiments of the present invention.

Fig. 9 is a schematic view of an application scenario of the content similarity detection method according to some embodiments of the present invention.

FIG. 10 is a schematic illustration of a computer-readable storage medium of some embodiments of the inventions.

FIG. 11 is a schematic diagram of a computer device of some embodiments of the invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

In the description of the present invention, it is to be understood that the terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, features defined as "first" and "second" may explicitly or implicitly include one or more of the described features. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; may be mechanically connected, may be electrically connected or may be in communication with each other; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

The following disclosure provides many different embodiments or examples for implementing different features of the invention. To simplify the disclosure of the present invention, specific example components and arrangements are described below. Of course, they are merely examples and are not intended to limit the present invention. Moreover, the present invention may repeat reference numerals and/or reference letters in the various examples, which have been repeated for purposes of simplicity and clarity and do not in themselves dictate a relationship between the various embodiments and/or configurations discussed. In addition, the present invention provides examples of various specific processes and materials, but one of ordinary skill in the art may recognize applications of other processes and/or uses of other materials.

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention and are not to be construed as limiting the present invention.

Referring to fig. 1, an inverted index storage structure 100 according to an embodiment of the present invention includes at least one node 10. Each node 10 includes a data field 12 and a pointer field 14. The data field 12 of a node 10 is used to store multiple sets of data. The pointer field 14 is used to store pointers.

The inverted index storage structure 100 according to the embodiment of the present invention stores multiple sets of data by using the data field 12 of one node 10, where each set of data may correspond to information of one document, so as to reduce the number of nodes 10, and further reduce the storage space required by a pointer.

The inverted index storage structure 100 according to the embodiment of the present invention further includes feature information 20, the feature information 20 is, for example, a word, and the correspondence between the feature information 20 and data (information of a document) can be obtained by the inverted index storage structure 100.

Referring to fig. 2, the conventional inverted index storage structure generally adopts a conventional chain structure. The data field of each node of the traditional chain structure can only store one group of data, that is, only information of one document can be stored, and when documents in the database are more, information of documents corresponding to each feature information is more, for example, if feature information "people" appears in one thousand documents, one thousand nodes (each node records information of one document) need to be formed by using the chain structure, and then a pointer of the pointer field is used to form association, so that the corresponding relationship between the feature information "people" and the documents is obtained. However, since the pointer needs a certain storage space (for example, the storage space of the pointer is 1kb in a 32-bit system, and the storage space of the pointer is 2kb in a 64-bit system), when the number of nodes is large, the pointer occupies a large amount of storage space, so that the index table formed by the conventional chain structure occupies a large amount of storage space.

Referring to fig. 1 again, the inverted index storage structure 100 according to the embodiment of the present invention utilizes the data field 12 of one node 10 to store multiple sets of data, i.e. store information of multiple documents, and compared to the conventional chain structure, the inverted index storage structure 100 stores the same document information corresponding to the same feature information 20, and needs fewer nodes 10, so that the storage space required by the pointers of the nodes 10 can be reduced.

Referring to fig. 3, the conventional inverted index storage structure also adopts a conventional array structure. The conventional array structure stores all data in a preset continuous fixed-length storage space, so that the data in the conventional array structure can be directly acquired without using a pointer for jumping, namely, the storage space required by the pointer is not spent. However, when the data to be stored is larger than the continuous fixed-length storage space, the conventional array structure cannot perform dynamic data expansion, and therefore, when the document of the database grows in real time, the conventional array structure cannot meet the service requirement.

Referring again to fig. 1, the inverted index storage structure 100 according to the embodiment of the present invention uses pointers to jump, so that after the storage space of the data field 12 of one node 10 is full, the pointer can be used to jump to the next node 10, and the data field 12 of the next node 10 is used as the storage space to continue storing data.

Referring to FIG. 4, in some embodiments, the inverted index storage structure 100 includes a single node 10. When the number of documents corresponding to the feature information 20 is small, the corresponding inverted index storage structure 100 may only need a single node 10 to store the corresponding document information.

Referring to fig. 5, in some embodiments, the inverted index storage structure 100 includes a plurality of nodes 10, the data field 12 of each node 10 can store the same number of sets of data, and the pointer of the current node 10 is used to point to the next node 10 or the pointer of the current node 10 is a null pointer.

In one embodiment, the data field 12 of each node 10 is capable of storing 2 sets of data, and if the characteristic information 20 "people" appears in one thousand documents, then the index table is formed with only five hundred nodes 10 required for the characteristic information 20, which is a five hundred nodes reduction over the one thousand nodes required by the conventional chain structure. The number of data sets that can be stored in the data field 12 of each node 10 is, for example, 2 data sets, or 4 data sets, or 1024 data sets, or 2048 data sets, and the like, and is not particularly limited herein. It should be noted that when the number of the groups is large, for example, larger than a set value, the use requirement of the high-frequency feature (the feature information 20 appearing in many documents) can be preferably satisfied, that is, only a small number of nodes 10 are needed to associate the high-frequency feature with the corresponding document; when the number of sets is small, for example, smaller than the set value, the usage requirement of the low frequency feature (the feature information 20 appearing in fewer documents) can be preferably satisfied, i.e., the storage space of the data field 12 of the node 10 is prevented from being too large and not full, which causes waste.

When there is a subsequent node 10 for the current node 10, the pointer of the current node 10 is used to point to the next node 10, and in particular, the pointer may be used to point to the storage address of the next node 10. When the current node 10 is the last node 10, the pointer of the current node 10 may be a null pointer (null). After the data field 12 of the current node 10 is full, a new node 10 may be formed after the current node 10, and the pointer of the current node 10 is changed from the original null pointer to point to the new node 10.

In some embodiments, the inverted index storage structure 100 includes a plurality of nodes 10, the pointer of the current node 10 is used to point to the next node 10 or the pointer of the current node 10 is a null pointer, and the number of sets of data that can be stored in the data field 12 of the node 10 is positively correlated to the sequence of the nodes 10.

Specifically, the number of sets of data that can be stored in the data field 12 of the first node 10 of the plurality of nodes 10 may be relatively small, for example, 2 sets of data, and thus, the usage requirement of the low-frequency feature may be met, and waste of storage space caused by opening up a larger storage space for the data field 12 of the node 10 at the beginning may be avoided. As the sequence of the nodes 10 increases, the number of sets of data that can be stored in the corresponding data field 12 may gradually increase, for example, according to a power of 2, so that the data field 12 of the node 10 may quickly form a larger storage space to meet the usage requirement of the high-frequency feature. In one embodiment, a first node 10 may be capable of storing 2 sets of data, a second node 10 may be capable of storing 4 sets of data, a third node 10 may be capable of storing 8 sets of data, a fourth node 10 may be capable of storing 16 sets of data, and so on. If the feature information 20 "people" appears in one thousand documents, then when the index table is formed, the feature information 20 only needs nine nodes 10 (2 +4+8+16+32+64+128+256+512 + 1022 =1022 >) which are reduced by nine hundred nodes compared with the one thousand nodes required by the conventional chain structure.

The pointer of the current node 10 is used to point to the next node 10 when there is a subsequent node 10 of the current node 10, and specifically, the pointer may be used to point to the storage address of the next node 10. When the current node 10 is the last node 10, the pointer of the current node 10 may be a null pointer (null). After the data field 12 of the current node 10 is full, a new node 10 may be formed after the current node 10, and the pointer of the current node 10 is changed from the original null pointer to point to the new node 10.

Referring to fig. 1 again, in some embodiments, the inverted index storage structure 100 includes a plurality of nodes 10, where the pointer of the current node 10 is used to point to the next node 10 or the pointer of the current node 10 is a null pointer, when the number of sets of data that can be stored in the data field 12 of the node 10 is less than the preset number of sets, the number of sets of data that can be stored in the data field 12 of the node 10 is positively correlated with the sequence of the nodes 10, and when the number of sets of data that can be stored in the data field 12 of the node 10 is equal to the preset number of sets, the number of sets of data that can be stored in the data field 12 of the subsequent node 10 of the node 10 is equal to the preset number of sets.

Specifically, the number of sets of data that can be stored in the data field 12 of the first node 10 of the plurality of nodes 10 may be relatively small, for example, 2 sets of data, and thus, the usage requirement of the low-frequency feature may be met, and waste of storage space caused by opening up a larger storage space for the data field 12 of the node 10 at the beginning may be avoided. As the order of the nodes 10 increases, the number of sets of data that can be stored by the corresponding data fields 12 may gradually increase, for example, by a power of 2, so that the data fields 12 of the nodes 10 may quickly form a larger storage space to meet the usage requirements of the high frequency features. When the number of data groups that can be stored in the data field 12 of the node 10 increases to the preset number of data groups, the number of data groups that can be stored in the data field 12 of all nodes 10 subsequent to the node 10 may be the preset number of data groups, so as to avoid that the number of data groups that can be stored in the data field 12 increases infinitely and does not converge, and thus the storage space is too large and the storage space is wasted. In one embodiment, a first node 10 may be capable of storing 2 sets of data, a second node 10 may be capable of storing 4 sets of data, a third node 10 may be capable of storing 8 sets of data, a fourth node 10 may be capable of storing 16 sets of data, and so on until the number of sets of data that may be stored by the node 10 is a predetermined number of sets. If the feature information 20 "people" appears in one thousand documents, then when the index table is formed, the feature information 20 only needs nine nodes 10 (2 +4+8+16+32+64+128+256+512 + 1022 =1022 >) which are reduced by nine hundred nodes compared with the one thousand nodes required by the conventional chain structure.

In some embodiments, the predetermined number of sets is 4096. When the system memory page of the operating environment of the inverted index storage structure 100 is 4096kb (Linux default pagesize), the capacity of the data field 12 is a multiple of the system memory page, so that it is more efficient to write and read data in the data field 12 in batches.

Of course, in other embodiments, the number of preset groups may also be 1024, 2048, 8192, and the like. In addition, the preset number of groups may also be set by a user according to a requirement, and is not limited herein.

In some embodiments, the information of the document includes a document number (docid). Specifically, the storage space occupied by docid is small, generally 1kb, and the storage space of the inverted index storage structure 100 can be reduced by only including the document number in the information of the document.

Of course, in other embodiments, the information of the document may also include the number of occurrences (TF) of the feature information in the document, and the positions of the document where the feature information has occurred.

Referring to fig. 6, the data storage method according to the embodiment of the present invention may be used to store data to be stored in the inverted index storage structure 100 according to any of the above embodiments. The inverted index storage structure 100 includes at least one node 10, each node 10 including a data field 12 and a pointer field 14. A data field 12 of a node 10 for storing a plurality of sets of data, the node 10 comprising a current node 10, the data storage method comprising:

011: judging whether the data to be stored can be completely written into the data field 12 of the current node 10;

012: when the data to be stored can be written into the data field 12 of the current node 10, writing a null pointer into the pointer field 14 of the current node 10;

013: when the data field 12 of the current node 10 cannot store all the data to be stored, generating a next node 10 of the inverted index storage structure 100;

014: writing a pointer pointing to the next node 10 in the pointer field 12 of the current node 10;

015: writing the remaining data to be stored into the data field 12 of the next node 10; and

016: let the next node 10 be the current node 10 and go to step 011.

The data storage method of the embodiment of the invention utilizes the data field 12 of one node 10 to store a plurality of groups of data, wherein each group of data can correspond to the information of one document, thereby reducing the number of nodes 10 and further reducing the storage space required by a pointer.

Specifically, when the inverted index storage structure 100 is constructed, the first node 10 may be generated for the inverted index storage structure 100 first. In the process of storing the data to be stored, the last node 10 of the inverted index storage structure 100 may be made to be the current node 10, and it is determined whether the data to be stored may be completely written into the data domain 12 of the current node 10, that is, it is determined whether the remaining storage space of the data domain 12 of the current node 10 is greater than or equal to the storage space required by the data to be stored, if yes, it is indicated that the data to be stored may be completely written into the data domain 12 of the current node 10, then all the data to be stored is written into the data domain 12 of the current node 10, and it is determined that the data to be stored is completely written; if not, it indicates that the data field 12 of the current node 10 cannot store all the data to be stored, then part of the data to be stored may be written into the remaining storage space of the data field 12 of the current node 10, and then a next node 10 is generated, and a pointer pointing to the next node 10 is written into the pointer field 12 of the current node 10, and then the remaining data to be stored is written into the next node 10, where the next node 10 is substantially the last node 10 of the inverted index storage structure 100, so that the next node 10 may be the current node and the data storage method according to the embodiment of the present invention may be executed in a loop to write all the data to be stored into the inverted index storage structure 100.

Referring to fig. 7, the content similarity detection method according to the embodiment of the present invention may be applied to the inverted index storage structure 100 according to any of the above embodiments. The inverted index storage structure 100 includes at least one node 10 and characteristic information 20, each node 10 including a data field 12 and a pointer field 14. The data field 12 of a node 10 is used to store multiple sets of data, each set of data corresponding to information of a document. The pointer field 14 is used to store pointers. The content similarity detection method comprises the following steps:

02: acquiring the number of the feature information 20 with the same number of the two documents according to the corresponding relation between the feature information 20 and the information of the documents in the inverted index storage structure 100;

04: judging whether the number is greater than or equal to a preset number;

06: when the number is larger than or equal to the preset number, judging that the two documents are similar;

08: and when the number is less than the preset number, judging that the two documents are not similar.

Referring to fig. 8, the content similarity detection system 300 according to the embodiment of the present invention may be used in the inverted index storage structure 100 according to any of the above embodiments. The inverted index storage structure 100 includes at least one node 10 and characteristic information 20, each node 10 including a data field 12 and a pointer field 14. The data field 12 of a node 10 is used to store multiple sets of data, each set of data corresponding to information of a document. The pointer field 14 is used to store pointers. The content similarity detection system 300 includes an obtaining module 310, a first determining module 320, a second determining module 330, and a third determining module 340. The obtaining module 310 is configured to obtain the number of the feature information 20 that is the same for two documents according to the corresponding relationship between the feature information 20 and the information of the documents in the inverted index storage structure 100. The first determining module 320 is configured to determine whether the number is greater than or equal to a preset number. The second judging module 330 is configured to judge that the two documents are similar when the number is greater than or equal to a preset number. The third determining module 340 is configured to determine that the two documents are not similar when the number is smaller than the preset number.

That is to say, the content similarity detection method according to the embodiment of the present invention may be implemented by the content similarity detection system 300 according to the embodiment of the present invention, wherein step 02 may be implemented by the obtaining module 310, step 04 may be implemented by the first determining module 320, step 06 may be implemented by the second determining module 330, and step 08 may be implemented by the third determining module 340.

The content similarity detection method and the content similarity detection system 300 of the embodiment of the invention can judge whether two documents are similar in a smaller storage space by using the characteristics of small storage space and high expansibility of the inverted index storage structure 100, thereby providing reliable technical support for protecting original works.

In certain embodiments, step 02 may be: selecting a document as a document to be analyzed, obtaining feature information 20 of the document to be analyzed, obtaining document information corresponding to each feature information 20 according to the feature information 20 of the document to be analyzed, traversing the document information corresponding to each feature information 20 of the document to be analyzed, adding one to the number of the same feature information of the document (except the document to be analyzed) and the document to be analyzed, and finally counting the number of the documents corresponding to the largest number of the feature information 20 which is the same as the document to be analyzed. Referring to fig. 9, in an embodiment, the feature information 20 of the document to be analyzed includes, for example, a, B, C, and D, the document to be analyzed, the document 1, the document 2, and the document 3 corresponding to the feature information a, the document to be analyzed, the document 1, and the document 2 corresponding to the feature information B, the document to be analyzed, the document 1, and the document 1 corresponding to the feature information C, and the document to be analyzed, and the document 1 corresponding to the feature information D. Traversing the feature information 20, it can be known from the feature information a that the same feature information of the document to be analyzed as the document 1 is 1, the same feature information as the document 2 is 1, and the same feature information as the document 3 is 1, from the feature information B that the same feature information of the document to be analyzed as the document 1 is 2, the same feature information as the document 2 is 2, and the same feature information as the document 3 is 1, from the feature information C that the same feature information of the document to be analyzed as the document 1 is 3, the same feature information as the document 2 is 2, and the same feature information as the document 3 is 1, from the feature information D that the same feature information of the document to be analyzed as the document 1 is 4, the same feature information as the document 2 is 2, and the same feature information as the document 3 is 1, and thus, the document with the largest number of the same feature information 20 as the document to be analyzed can be obtained as the document 1, and the number of the same feature information 20 is 4.

Referring to fig. 10, an embodiment of the invention further provides a computer-readable storage medium 500. One or more non-transitory computer-readable storage media 500 containing computer-executable instructions that, when executed by one or more processors 600, cause the processors 600 to perform the data storage method of any of the above embodiments and/or the content similarity detection method of any of the above embodiments.

For example, when the computer-executable instructions are executed by the processor 600, the processor 600 performs a data storage method as described in the following steps:

011: judging whether the data to be stored can be completely written into the data domain 12 of the current node 10;

014: writing a pointer to the next node 10 in the pointer field 12 of the current node 10;

016: let the next node 10 be the current node 10 and proceed to step 011.

For another example, when the computer-executable instructions are executed by the processor 600, the processor 600 performs the content similarity detection method as described in the following steps:

04: judging whether the number is larger than or equal to a preset number or not;

Referring to fig. 11, a computer device 700 is further provided according to an embodiment of the present invention. The computer device 700 includes a memory 720 and a processor 740, wherein the memory 720 stores computer-readable instructions, and when the computer-readable instructions are executed by the processor 740, the processor 740 executes the data storage method of any one of the above embodiments and/or the content similarity detection method of any one of the above embodiments.

For example, when the computer readable instructions are executed by the processor 740, the processor 740 performs a data storage method as described in the following steps:

013: when the data field 12 of the current node 10 cannot store all the data to be stored, generating the next node 10 of the inverted index storage structure 100;

016: let the next node 10 be the current node 10 and proceed to step 011.

As another example, when the computer readable instructions are executed by the processor 740, the processor 740 performs the content similarity detection method as described in the following steps:

08: and when the number is smaller than the preset number, judging that the two documents are not similar.

In the description herein, references to the description of the terms "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be performed by software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for performing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried out in the above method may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be executed in the form of hardware or in the form of a software functional module. The integrated module, if executed in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are exemplary and not to be construed as limiting the present invention, and that changes, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A data storage method, configured to store data to be stored in an inverted index storage structure, where the inverted index storage structure includes at least one node, each node includes a data field and a pointer field, the data field is used to store multiple sets of data, and the node includes a current node, and the data storage method includes:

when the data domain of the current node can not store all the data to be stored, generating a next node of the inverted index storage structure;

2. The data storage method of claim 1, wherein said inverted index storage structure comprises a plurality of said nodes, and wherein said data fields of each of said nodes store the same number of sets of said data.

3. The data storage method of claim 1, wherein the inverted index storage structure comprises a plurality of nodes, and the data fields of the nodes store sets of the data that are positively correlated with the order of the nodes.

4. The data storage method according to claim 1, wherein the inverted index storage structure comprises a plurality of nodes, when the number of sets of data stored in the data field of a node is smaller than a preset number of sets, the number of sets of data stored in the data field of the node is positively correlated with the order of the nodes, and when the number of sets of data stored in the data field of the node is equal to the preset number of sets, the number of sets of data stored in the data field of a subsequent node of the node is the preset number of sets.

5. The data storage method of claim 4, wherein the predetermined number of sets is 4096.

6. The data storage method according to claim 1, wherein each set of the data corresponds to information of a document, and the information of the document includes a number of the document.

7. A content similarity detection method, used in an inverted index storage structure, where the inverted index storage structure includes feature information and at least one node, each node includes a data field and a pointer field, the data field is used to store multiple sets of data, each set of data corresponds to information of a document, and the pointer field is used to store a pointer, and the content similarity detection method includes:

judging whether the number is greater than or equal to a preset number;

8. The content similarity detection method according to claim 7, wherein the inverted index storage structure comprises a plurality of nodes, and the data field of each node stores the same number of sets of data.

9. The content similarity detection method according to claim 7, wherein the inverted index storage structure comprises a plurality of nodes, and the data fields of the nodes store data whose number of sets positively correlates with the sequence of the nodes.

10. The content similarity detection method according to claim 7, wherein the inverted index storage structure includes a plurality of nodes, when the number of sets of data stored in the data field of the node is smaller than a preset number of sets, the number of sets of data stored in the data field of the node is positively correlated with the sequence of the node, and when the number of sets of data stored in the data field of the node is equal to the preset number of sets, the number of sets of data stored in the data field of a node subsequent to the node is the preset number of sets.

11. The content similarity detection method according to claim 10, wherein the preset number of groups is 4096.

12. The content similarity detection method according to claim 7, wherein the information of the document includes a number of the document.

13. A content similarity detection system, configured to store an inverted index, where the inverted index storage structure includes feature information and at least one node, and each node includes a data field and a pointer field, where the data field is used to store multiple sets of data, each set of data corresponds to information of a document, and the pointer field is used to store a pointer, and the content similarity detection system includes:

the acquisition module is used for acquiring the number of the feature information which is the same with the two documents according to the corresponding relation between the feature information and the information of the documents in the inverted index storage structure;

the first judging module is used for judging whether the number is greater than or equal to a preset number;

a second judging module, configured to judge that the two documents are similar when the number is greater than or equal to the preset number;

a third judging module, configured to judge that the two documents are dissimilar when the number is smaller than the preset number.

14. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the data storage method of any of claims 1 to 6 and/or the content similarity detection method of any of claims 7-12.

15. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the data storage method of any one of claims 1 to 6 and/or the content similarity detection method of any one of claims 7 to 12.