[go: up one dir, main page]

CN112464639A - Search text folding processing system and method thereof - Google Patents

Search text folding processing system and method thereof Download PDF

Info

Publication number
CN112464639A
CN112464639A CN202011465449.7A CN202011465449A CN112464639A CN 112464639 A CN112464639 A CN 112464639A CN 202011465449 A CN202011465449 A CN 202011465449A CN 112464639 A CN112464639 A CN 112464639A
Authority
CN
China
Prior art keywords
search
folding
texts
text
fingerprint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011465449.7A
Other languages
Chinese (zh)
Inventor
张校源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Eisoo Information Technology Co Ltd
Original Assignee
Shanghai Eisoo Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Eisoo Information Technology Co Ltd filed Critical Shanghai Eisoo Information Technology Co Ltd
Priority to CN202011465449.7A priority Critical patent/CN112464639A/en
Publication of CN112464639A publication Critical patent/CN112464639A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/12Fingerprints or palmprints

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a search text folding processing system and a method thereof, wherein the system comprises a fingerprint establishing module, a database and a folding module, wherein the fingerprint establishing module is connected with a local storage terminal and is used for establishing document fingerprints corresponding to each text of the local storage terminal; the fingerprint creating module is connected with the database to store the document fingerprints and the corresponding texts in the database; the folding module folds the search text according to the document fingerprint to obtain folded text data, and transmits the folded text data to the search engine end to be displayed on a search page. Compared with the prior art, the method and the device can fold the search texts with the same or high similarity, and are convenient for a user to view different search texts on the same page as much as possible.

Description

Search text folding processing system and method thereof
Technical Field
The invention relates to the technical field of text analysis, in particular to a search text folding processing system and a method thereof.
Background
At present, when a user acquires a search text from a local search engine terminal, a plurality of repeated texts with the same or higher similarity often exist, so that the user cannot comprehensively and quickly acquire all the search texts, and the user needs to perform page pull-down or page turning operations for many times to check different searched texts.
Therefore, in the prior art, similarity calculation is performed on different search texts, so that a user can conveniently know whether identical or repeated texts with high similarity exist between the search texts in advance, at present, methods such as calculation of an edit distance, Jacard coefficient calculation, TF calculation, word2vec and the like are mostly adopted for similarity calculation, wherein the edit distance calculation means that the minimum number of times of editing operation is required for converting one character string into another character string between the two character strings, the larger the edit distance is, the more different the character strings are, and the editing operation comprises addition, replacement and deletion; the Jacard coefficient is a numerical value obtained by dividing the intersection of two texts by a union, and the numerical value is larger, so that the numerical value is more similar; the TF or TF-IDF value refers to vectorizing the text, and then calculating cosine values of the two texts, wherein the larger the value is, the more similar the two texts are; word2vec means that each word is converted into a vector through a trained model, and then cosine values are calculated, wherein the larger the value is, the more similar the description is. The method is only suitable for judging and calculating the similarity of a small number of short texts, and when the similarity of a large number of long texts is judged and calculated, the efficiency of the method is low. In addition, although the similarity calculation enables the user to know whether the search texts are the same or similar, the user is still required to perform multiple page down or page turning operations to further view more search texts.
Disclosure of Invention
The present invention is directed to provide a system and a method for folding search texts, which can fold search texts with the same or high similarity, so as to facilitate users to view different search texts on the same page as much as possible.
The purpose of the invention can be realized by the following technical scheme: a search text folding processing system comprises a fingerprint creating module, a database and a folding module, wherein the input end of the fingerprint creating module is connected to a local storage end and is used for creating document fingerprints corresponding to texts in the local storage end;
the output end of the fingerprint creating module is connected with the database so as to store the document fingerprint and the corresponding text in the database;
the input end of the folding module is connected with a database, the database is connected with a search engine end, and the database outputs corresponding search texts and corresponding document fingerprints to the folding module according to search instructions of the search engine end;
the output end of the folding module is connected with the search engine end, and the folding module is used for folding the search text according to the document fingerprint to obtain folded text data and transmitting the folded text data to the search engine end to be displayed on the search page.
Further, the folding module comprises a similarity calculation unit and a folding sorting unit which are sequentially connected, wherein the similarity calculation unit is connected with the database and is used for calculating the similarity between the searched texts according to the document fingerprints so as to construct a similarity matrix;
and the folding sorting unit is connected with the search engine end and is used for folding the search texts meeting the folding conditions according to the input sequence of the search texts by combining the similarity matrix and a preset folding threshold value to obtain folded text data, and transmitting the folded text data to the search engine end for display.
A search text folding processing method comprises the following steps:
s1, the fingerprint creating module acquires all stored texts from a local storage end, and performs word segmentation, hash calculation, weighting, merging, dimension reduction and displacement calculation on each text in sequence to obtain document fingerprints corresponding to each text;
s2, the fingerprint creating module stores the document fingerprint and the corresponding text in a database;
s3, according to the search condition input by the user, the search engine end completes the corresponding search operation, and obtains a plurality of search texts and corresponding document fingerprints from the database;
s4, the folding module acquires all search texts and corresponding document fingerprints from the database in sequence, folds the search texts meeting the folding conditions by calculating the similarity between every two search texts and combining a preset folding threshold value according to the input sequence of the search texts to obtain folded text data, and transmits the folded text data to the search engine end;
and S5, the search engine end displays the received folded text data on a search page.
Further, the specific process of the fingerprint creation module performing word segmentation in step S1 is as follows: and performing word segmentation on the text to obtain effective feature vectors, and setting a corresponding weight for each feature vector.
Further, the specific process of the fingerprint creation module performing hash calculation in step S1 is as follows: and calculating the hash value of each feature vector through a hash function, wherein the hash value is specifically an n-bit signature consisting of binary numbers 0 and 1.
Further, the specific process of the weighting processing performed by the fingerprint creating module in step S1 is as follows: on the basis of the hash value, combining the weight of the feature vector, and calculating to obtain a weighting result of each feature vector, wherein if the hash element of the hash value is 1, the weight of the feature vector is multiplied by 1; and if the hash element of the hash value is 0, the weight of the feature vector is multiplied by 1 negatively so as to obtain a weighting result corresponding to each feature vector.
Further, the specific process of the fingerprint creating module performing the dimension reduction process in step S1 is as follows: and accumulating the weighted results of all the feature vectors, and reducing the dimension to obtain a 64-bit accumulated result.
Further, the specific process of the fingerprint creating module performing the displacement calculation process in step S1 is as follows: performing displacement calculation once every 16 bits, and if the displacement is greater than or equal to 0, adding 1 to the right displacement by one bit; if the value is less than 0, shifting one bit to the right without adding 1, and obtaining the integral value of the fingerprint corresponding to the text, namely the document fingerprint.
Further, the step S4 specifically includes the following steps:
s41, the folding module acquires all search texts and corresponding document fingerprints from the database in sequence;
s42, circularly traversing each search text in sequence, and calculating an exclusive OR value of two document fingerprints by adopting an integer exclusive OR operation mode to serve as the similarity corresponding to the two search texts, wherein the smaller the similarity value is, the more similar the two search texts are;
s43, arranging and combining all the similarity in sequence to obtain a similarity matrix;
s44, sequentially judging whether each similarity element meets folding conditions or not according to each similarity element in the similarity matrix and combining a preset folding threshold, if so, folding the two search texts corresponding to the similarity element according to the incoming sequence of the search texts to obtain folded text data, otherwise, not folding the two search texts corresponding to the similarity element;
and S45, the folding module transmits the folded text data to the search engine terminal.
Further, the folding conditions are specifically: the similarity element value is less than or equal to the folding threshold.
Compared with the prior art, the invention has the following advantages:
the invention provides a search text folding processing system and a method thereof, wherein a fingerprint creating module is used for acquiring all stored texts from a local storage end, document fingerprints corresponding to all the texts are created, after a search engine end acquires search texts from a database, a folding module is used for carrying out similarity calculation, folding judgment and folding processing on the search texts according to the document fingerprints to obtain folding text data, and the folding text data is output to a search engine end for page display, so that the folding processing can be carried out on the search texts with the same or high similarity, and a user can check different search texts on the same page as much as possible without carrying out pull-down or page turning operations for many times.
Secondly, the invention carries out the calculation and folding judgment of the similarity between the search texts based on the document fingerprints, and when the document fingerprints are created, different from the traditional mode of calculating different digits of character strings by a character string splicing mode and a Hamming distance, the invention adopts a digital displacement mode to obtain document fingerprints in an integer form, adopts an integer XOR operation mode to calculate the distance between two document fingerprints, compared with the traditional method, the method has higher calculation efficiency, can be well suitable for the similarity judgment calculation of mass long texts, and the similarity calculation and folding judgment are carried out on the search text by utilizing the document fingerprint, and compared with the traditional editing distance or Jacard distance calculation mode, the method can better process the high-similarity text and realize better text folding effect.
When the folding processing is carried out, the folding module outputs the sequence of the searched documents according to the database, namely the input sequence of the searched documents, so that the folded text data is displayed to the user in the same sequence, the same or high-similarity text can be folded, the same or high-similarity text can be displayed in a certain sequence, the user can see the same or high-similarity text more clearly, and more different searched texts which accord with the searching conditions of the user can be found after the folded text data is displayed.
Drawings
FIG. 1 is a schematic diagram of the system of the present invention;
FIG. 2 is a schematic flow diagram of the process of the present invention;
FIG. 3 is a schematic diagram of the working principle of the present invention;
FIG. 4 is a diagram illustrating a process of creating a document fingerprint according to an embodiment;
FIG. 5 is a diagram illustrating a process of text folding in an embodiment;
the notation in the figure is: 1. the system comprises a fingerprint creating module, 2, a database, 3, a folding module, 4, a search engine end, 5, a local storage end, 31, a similarity calculating unit, 32 and a folding sorting unit.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
Examples
A search text folding processing system, as shown in FIG. 1, comprises a fingerprint creating module 1, a database 2 and a folding module 3, wherein an input end of the fingerprint creating module 1 is connected to a local storage end 5 and is used for creating document fingerprints corresponding to texts in the local storage end 5;
the output end of the fingerprint creating module 1 is connected with the database 2 so as to store the document fingerprints and the corresponding texts in the database 2;
the input end of the folding module 3 is connected with the database 2, the database 2 is connected with the search engine end 4, and the database 2 outputs corresponding search texts and corresponding document fingerprints to the folding module 3 according to search instructions of the search engine end 4;
the output end of the folding module 3 is connected to the search engine end 4, and the folding module 3 is used for folding the search text according to the document fingerprint to obtain folded text data and transmitting the folded text data to the search engine end 4 to be displayed on the search page.
The folding module 3 comprises a similarity calculation unit 31 and a folding sorting unit 32 which are connected in sequence, wherein the similarity calculation unit 31 is connected with the database 2 and is used for calculating the similarity between the searched texts according to the document fingerprints so as to construct a similarity matrix;
the folding sorting unit 32 is connected to the search engine terminal 4, and is configured to fold the search texts meeting the folding conditions according to the input sequence of the search texts by combining the similarity matrix and a preset folding threshold value, to obtain folded text data, and transmit the folded text data to the search engine terminal 4 for display.
Applying the above system to practice, a search text folding processing method can be implemented, as shown in fig. 2, including the following steps:
s1, the fingerprint creating module 1 obtains all the stored texts from the local storage 5, and performs word segmentation, hash calculation, weighting, merging, dimension reduction, and displacement calculation on each text in sequence to obtain the document fingerprint corresponding to each text, wherein the specific process of word segmentation is as follows: performing word segmentation on the text to obtain effective feature vectors, and setting a corresponding weight for each feature vector;
the specific process of the hash calculation processing is as follows: calculating the hash value of each feature vector through a hash function, wherein the hash value is specifically an n-bit signature consisting of binary numbers 0 and 1;
the specific process of weighting treatment is as follows: on the basis of the hash value, combining the weight of the feature vector, and calculating to obtain a weighting result of each feature vector, wherein if the hash element of the hash value is 1, the weight of the feature vector is multiplied by 1; if the hash element of the hash value is 0, the weight of the feature vector is multiplied by 1 negatively so as to obtain a weighting result corresponding to each feature vector;
the specific process of the dimension reduction treatment is as follows: accumulating the weighted results of all the eigenvectors, and reducing dimensions to obtain a 64-bit accumulated result;
the specific process of displacement calculation processing is as follows: performing displacement calculation once every 16 bits, and if the displacement is greater than or equal to 0, adding 1 to the right displacement by one bit; if the value is less than 0, shifting to the right by one bit without adding 1 so as to obtain a fingerprint integral value corresponding to the search text, namely the document fingerprint;
s2, the fingerprint creating module 1 stores the document fingerprint and the corresponding search text in the database 2;
s3, according to the search condition input by the user, the search engine end 4 completes the corresponding search operation, and obtains a plurality of search texts and corresponding document fingerprints from the database 2;
s4, the folding module 3 sequentially acquires all search texts and corresponding document fingerprints from the database 2, folds the search texts meeting the folding conditions by calculating the similarity between every two search texts and combining a preset folding threshold value according to the input sequence of the search texts to obtain folded text data, and transmits the folded text data to the search engine terminal 4, specifically, the folding module 3 firstly sequentially acquires all the search texts and corresponding document fingerprints from the database 2;
then, circularly traversing each search text in sequence, and calculating an exclusive OR value of the two document fingerprints by adopting an integer exclusive OR operation mode to serve as the similarity corresponding to the two search texts, wherein the smaller the similarity value is, the more similar the two search texts are;
then all the similarity degrees are arranged and combined in sequence to obtain a similarity degree matrix;
then, according to each similarity element in the similarity matrix, combining a preset folding threshold value, sequentially judging whether each similarity element meets a folding condition (the numerical value of the similarity element is smaller than or equal to the folding threshold value), if so, folding the two search texts corresponding to the similarity element according to the incoming sequence of the search texts to obtain folded text data, otherwise, not folding the two search texts corresponding to the similarity element;
finally, the folding module 3 transmits the folded text data to the search engine terminal 4;
s5, the search engine terminal 4 displays the received folded text data on the search page.
The technical scheme is used for solving the problem that texts with the same or higher similarity are repeated in the search texts aiming at a large number of search texts. By applying the technical scheme, the texts with the same or high similarity can be effectively folded, so that a user can quickly view the searched different texts without pulling down or turning pages.
As shown in fig. 3, the technical scheme mainly includes two steps:
the first step is as follows: creating document fingerprint metadata and storing;
the second step is that: the search text is collapsed.
Firstly, creating fingerprints, storing (as shown in figure 4)
Creating a fingerprint is divided into 5 steps: the method comprises the following specific processes of word segmentation, hash calculation, weighting, dimension reduction and displacement calculation:
(1) word segmentation
Given a sentence text, performing word segmentation to obtain effective feature vectors, for example, setting 5 levels of weights for each feature vector (if a text is given, the feature vector may be a word in the text, and the weight may be the number of times the word appears or a word segmentation weight coefficient, or the importance degree of the word). For example, given a sentence: "natural language processing is an important direction in the field of computer science", and after word segmentation, the following steps are performed: "natural language processing is an important direction in the field of computer science", and then each feature word will have a weight: the natural language (5) processing (4) is (1) one (1) important (3) direction (3) of (1) in the field (3) of (1) computer science (2), wherein the number in brackets represents the importance degree of the word in the whole sentence, and the larger the number is, the more important the word is.
(2)hash
And calculating the hash value of each feature vector through a hash function, wherein the hash value is an n-bit signature consisting of binary numbers 01. For example, "natural language" has a hash value of "101011 …," and thus, the text-language string becomes a series of digits.
(3) Weighting
On the basis of the Hash value, weighting all the feature vectors, namely W is Hash weight, and when the Hash element value is 1, multiplying the weight by 1; when the hash element is 0, the weight is multiplied by 1 negatively, for example, the hash value "101011 …" of "natural language" is weighted to obtain: w ("natural language", weight value bit is 5) — 5-55-555 …, and weighting the hash value "100101 …" of "processing" to obtain: w (processing, weight value is 4) is 4-4-44-44 …, and all the feature vectors are respectively weighted to obtain corresponding weighting results.
(4) Merging calculation dimensionality reduction
And accumulating the weighted results of the feature vectors to form a sequence string.
Taking two feature vectors in the step (3) as examples, for example, adding the processed feature vectors such as 4-4-44-44 … and the natural language feature vectors such as 5-55-555 … to obtain 4+5-4+ -5-4+ 54 + -5-4+ 54 +5, and obtaining the dimensionality reduced feature vector of 9-91-11 ….
(5) The displacement operation is different from the traditional method in that:
and (4) for the accumulated result of the n-bit signature, the total number of 64 bits is obtained, displacement operation is performed every 16 bits, if the accumulated result is greater than or equal to 0, the displacement is performed to the right by one bit and is added with 1, and if the accumulated result is less than 0, the displacement is performed to the right by one bit and is not added with 1, so that the integral value of the fingerprint of the text can be obtained and used as the fingerprint of the document data. For example, the displacement operation is performed on the '9-91-119 …' calculated in the step (4), and then an integer 45623145689452513 is obtained as the data fingerprint of the text.
Fingerprint storage: a document fingerprint is stored in a database as a type of metadata for text.
(II) folding search text (as shown in FIG. 5)
After the search engine is used for searching out the text which meets the requirements, the text metadata can be searched out correspondingly, wherein the fingerprint data of the text exist, and the text metadata are folded by using the fingerprint data.
1. Construction of similarity matrix using document fingerprints
After the document fingerprint data is searched out, each text is circularly traversed by utilizing the XOR operation of integers, and a similarity matrix between every two texts is calculated.
For example, 5 text fingerprint data are: [45236542,5689456,231456,25634879,888565623], calculating the exclusive or value of two fingerprint data by using integer exclusive or operation as the similarity of two texts, and calculating to obtain a 5 × 5 similarity matrix, which is expressed as: [[0,2,9,3,5],[2,0,1,4,6],[9,1,0,2,5],[3,4,2,0,3],[5,6,5,3,0]].
2. Text folding according to parameter folding threshold
In this embodiment, the selection range of the folding threshold is 2-5 (including 2 and 5), the smaller the value of the folding threshold is, it is indicated that the texts with higher similarity are folded together, and conversely, the larger the value of the folding threshold is, the texts with lower similarity are folded together, in addition, the folding is performed according to the sequence of the incoming texts, and when the sequence of the incoming texts is different, the sequence of the texts in the returned folding data is also different.
3. Displaying on the search page according to the folded text effect
After the text is folded by utilizing the fingerprint data, the folded data of the returned text is displayed on a search page and is displayed in sequence, so that not only are the same or highly similar files folded, but also the files are displayed in a certain sequence, a user can see the text with the same or high similarity more clearly, and more different texts which accord with the search result can be found after the text is folded.
To verify the validity of the technical solution, the embodiment selects the first data and the second data for testing, where the first data is specifically:
100 actual articles, adjusting the article contents through manual modification, manufacturing a plurality of similar texts, wherein 10 texts are the same text, 10 texts with higher similar texts with smaller changes (about 10% of the contents are deleted or modified), and 10 texts with larger changes (more than 50% of the contents are deleted or modified), and testing the folding accuracy of the text through data one;
the second data specifically comprises:
1000 document fingerprint data are searched out in a simulation mode (the data are non-actual texts, the fingerprint data created in the simulation mode do not necessarily have the existence of similar high texts, and the efficiency is tested), and the folding efficiency is tested by using the thousand data to perform the test of text folding.
Firstly, testing data I, according to the technical process, firstly creating fingerprint data corresponding to a document (when a search engine is used for searching a text, the fingerprint data is stored in a database as one of metadata), and then folding by using the fingerprint data to test the folding effect, wherein the test result is shown in table 1:
TABLE 1
Figure BDA0002833995730000091
According to the test result of the data I, similar texts are manufactured manually, and the same or highly similar texts can be folded under the same type of texts; texts with low or lower similarity are basically not folded into the same category, and keywords extracted from the texts folded together are basically the same; therefore, the technical scheme can reliably realize the text folding effect by combining the tests of a plurality of scenes.
And then testing data II, generating 1000 fingerprint data by simulation, wherein about 20 texts are displayed on one page during searching, about 50 pages are displayed on 1000 text data, so that 1000 fingerprint data are generated by simulation, and the test result is shown in Table 2:
TABLE 2
Test document Time Remarks for note
1000 0.4s 1000 fingerprint data generated by simulation
According to the test result of the second data, after 1000 texts are searched, the time of 0.4s is needed for folding by using the fingerprints, the time application scene is met, and the folding efficiency of the technical scheme is high.
In summary, the present invention utilizes the simhash algorithm idea to perform fingerprint creation, which is different from the fingerprint created by the conventional method, and one of the particularly important differences is that: the traditional method is to create fingerprints similar to the character string of 0111100110 through the splicing of the character string, while the document fingerprints created by the invention are the fingerprints taking the integers obtained through the digital displacement mode as the fingerprints of the document, and the integer fingerprints are obtained through the displacement mode, so that the efficiency is high, the storage is easy, and the comparison is faster than the character string fingerprints obtained through the character string splicing mode; another important difference is: the traditional fingerprint comparison method calculates different digits of two character strings as the distance between two fingerprints through the Hamming distance, and the method calculates the distance between the two fingerprints through the XOR operation of integers, thereby obtaining the similarity between documents, the efficiency of the method is much higher, and the obtained result is the same as the result of the Hamming distance calculation mode; in addition, the method judges the similarity of the text by utilizing the fingerprint data and folds the text, is quicker compared with directly using an editing distance, a Jacard distance and the like, has better effect on the highly similar text, better accords with the original application purpose of text folding, and can effectively fold the same or higher-similarity text.
Besides the above two important points, the method for storing fingerprints and the method for text folding scheme have the following two points: 1. the storage data of the fingerprint, in the searchable literature, the fingerprint of the plain document is stored in the data of character strings (similar to '1010100011'), and then the comparison is carried out; the method stores the document fingerprints in an integer form, and can reduce the occupied space after improvement. 2. The text folding will be performed in a certain order and the returned data will also have an order, i.e. the displayed text will be arranged in a certain order and the folded text under one displayed text will also be in a certain order. Most of the existing search results directly display all the texts meeting the requirements, but many same texts appear, and the texts needing to be checked need to be pulled down or turned over to be seen. After the technical scheme provided by the invention is applied, the same or higher-similarity texts can be folded under one text, so that a user can conveniently and quickly view a plurality of different search texts which accord with the search conditions.

Claims (10)

1. A search text folding processing system is characterized by comprising a fingerprint creating module (1), a database (2) and a folding module (3), wherein the input end of the fingerprint creating module (1) is connected to a local storage end (5) and is used for creating document fingerprints corresponding to texts in the local storage end (5);
the output end of the fingerprint creating module (1) is connected with the database (2) so as to store the document fingerprints and the corresponding texts in the database (2);
the input end of the folding module (3) is connected with the database (2), the database (2) is connected with the search engine end (4), and the database (2) outputs corresponding search texts and corresponding document fingerprints to the folding module (3) according to search instructions of the search engine end (4);
the output end of the folding module (3) is connected to the search engine end (4), the folding module (3) is used for folding the search text according to the document fingerprint to obtain folded text data, and the folded text data are transmitted to the search engine end (4) to be displayed on the search page.
2. The folding processing system for the search text according to claim 1, wherein the folding module (3) comprises a similarity calculation unit (31) and a folding sorting unit (32) which are connected in sequence, the similarity calculation unit (31) is connected with the database (2) and is used for calculating the similarity between the search texts according to the document fingerprints to construct a similarity matrix;
the folding sorting unit (32) is connected with the search engine end (4) and is used for folding the search texts meeting the folding conditions according to the incoming sequence of the search texts by combining the similarity matrix and a preset folding threshold value to obtain folding text data, and transmitting the folding text data to the search engine end (4) for displaying.
3. A search text folding method to which the search text folding system of claim 1 is applied, characterized by comprising the steps of:
s1, the fingerprint creating module acquires all stored texts from a local storage end, and performs word segmentation, hash calculation, weighting, merging, dimension reduction and displacement calculation on each text in sequence to obtain document fingerprints corresponding to each text;
s2, the fingerprint creating module stores the document fingerprint and the corresponding text in a database;
s3, according to the search condition input by the user, the search engine end completes the corresponding search operation, and obtains a plurality of search texts and corresponding document fingerprints from the database;
s4, the folding module acquires all search texts and corresponding document fingerprints from the database in sequence, folds the search texts meeting the folding conditions by calculating the similarity between every two search texts and combining a preset folding threshold value according to the input sequence of the search texts to obtain folded text data, and transmits the folded text data to the search engine end;
and S5, the search engine end displays the received folded text data on a search page.
4. The method according to claim 3, wherein the specific process of the fingerprint creation module performing word segmentation in step S1 is as follows: and performing word segmentation on the text to obtain effective feature vectors, and setting a corresponding weight for each feature vector.
5. The method for folding search text according to claim 4, wherein the hash calculation process performed by the fingerprint creation module in step S1 includes: and calculating the hash value of each feature vector through a hash function, wherein the hash value is specifically an n-bit signature consisting of binary numbers 0 and 1.
6. The method for folding search text according to claim 5, wherein the specific process of weighting by the fingerprint creation module in step S1 is as follows: on the basis of the hash value, combining the weight of the feature vector, and calculating to obtain a weighting result of each feature vector, wherein if the hash element of the hash value is 1, the weight of the feature vector is multiplied by 1; and if the hash element of the hash value is 0, the weight of the feature vector is multiplied by 1 negatively so as to obtain a weighting result corresponding to each feature vector.
7. The method according to claim 6, wherein the specific process of the fingerprint creation module performing dimension reduction in step S1 is as follows: and accumulating the weighted results of all the feature vectors, and reducing the dimension to obtain a 64-bit accumulated result.
8. The method according to claim 7, wherein the specific process of the fingerprint creation module performing the displacement calculation process in step S1 is as follows: performing displacement calculation once every 16 bits, and if the displacement is greater than or equal to 0, adding 1 to the right displacement by one bit; if the value is less than 0, shifting to the right by one bit without adding 1, and obtaining the integral value of the fingerprint corresponding to the search text, namely the document fingerprint.
9. The method for folding search text according to claim 3, wherein the step S4 specifically includes the following steps:
s41, the folding module acquires all document fingerprints and corresponding search texts from the database in sequence;
s42, circularly traversing each search text in sequence, and calculating an exclusive OR value of two document fingerprints by adopting an integer exclusive OR operation mode to serve as the similarity corresponding to the two search texts, wherein the smaller the similarity value is, the more similar the two search texts are;
s43, arranging and combining all the similarity in sequence to obtain a similarity matrix;
s44, sequentially judging whether each similarity element meets folding conditions or not according to each similarity element in the similarity matrix and combining a preset folding threshold, if so, folding the two search texts corresponding to the similarity element according to the incoming sequence of the search texts to obtain folded text data, otherwise, not folding the two search texts corresponding to the similarity element;
and S45, the folding module transmits the folded text data to the search engine terminal.
10. The method for folding search text according to claim 9, wherein the folding condition is specifically: the similarity element value is less than or equal to the folding threshold.
CN202011465449.7A 2020-12-14 2020-12-14 Search text folding processing system and method thereof Pending CN112464639A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011465449.7A CN112464639A (en) 2020-12-14 2020-12-14 Search text folding processing system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011465449.7A CN112464639A (en) 2020-12-14 2020-12-14 Search text folding processing system and method thereof

Publications (1)

Publication Number Publication Date
CN112464639A true CN112464639A (en) 2021-03-09

Family

ID=74804073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011465449.7A Pending CN112464639A (en) 2020-12-14 2020-12-14 Search text folding processing system and method thereof

Country Status (1)

Country Link
CN (1) CN112464639A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186066A (en) * 2022-02-16 2022-03-15 子长科技(北京)有限公司 Report generation method, system, storage medium and electronic equipment
CN117290315A (en) * 2023-10-11 2023-12-26 河南师范大学 Data classification cleaning method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020169754A1 (en) * 2001-05-08 2002-11-14 Jianchang Mao Apparatus and method for adaptively ranking search results
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN102375813A (en) * 2010-08-09 2012-03-14 腾讯科技(深圳)有限公司 Duplicate detection system and method for search engines
CN102693304A (en) * 2012-05-22 2012-09-26 北京邮电大学 Search engine feedback information processing method and search engine
CN103793523A (en) * 2014-02-20 2014-05-14 刘峰 Automatic search engine construction method based on content similarity calculation
CN111324750A (en) * 2020-02-29 2020-06-23 上海爱数信息技术股份有限公司 Large-scale text similarity calculation and text duplicate checking method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020169754A1 (en) * 2001-05-08 2002-11-14 Jianchang Mao Apparatus and method for adaptively ranking search results
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN102375813A (en) * 2010-08-09 2012-03-14 腾讯科技(深圳)有限公司 Duplicate detection system and method for search engines
CN102693304A (en) * 2012-05-22 2012-09-26 北京邮电大学 Search engine feedback information processing method and search engine
CN103793523A (en) * 2014-02-20 2014-05-14 刘峰 Automatic search engine construction method based on content similarity calculation
CN111324750A (en) * 2020-02-29 2020-06-23 上海爱数信息技术股份有限公司 Large-scale text similarity calculation and text duplicate checking method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186066A (en) * 2022-02-16 2022-03-15 子长科技(北京)有限公司 Report generation method, system, storage medium and electronic equipment
CN117290315A (en) * 2023-10-11 2023-12-26 河南师范大学 Data classification cleaning method
CN117290315B (en) * 2023-10-11 2024-06-25 河南师范大学 Data classification cleaning method

Similar Documents

Publication Publication Date Title
CN111324750B (en) Large-scale text similarity calculation and text duplicate checking method
US8543598B2 (en) Semantic object characterization and search
EP0510634B1 (en) Data base retrieval system
KR101732754B1 (en) Content-based image search
US7107263B2 (en) Multistage intelligent database search method
CN106599054B (en) Method and system for classifying and pushing questions
CN111625621B (en) Document retrieval method and device, electronic equipment and storage medium
CN105320772A (en) Associated paper query method for patent duplicate checking
CN108073576A (en) Intelligent search method, searcher and search engine system
CN108875065B (en) A content-based recommendation method for Indonesian news pages
CN112464639A (en) Search text folding processing system and method thereof
CN108287850B (en) Text classification model optimization method and device
CN105404677A (en) Tree structure based retrieval method
CN113515939B (en) System and method for extracting key information of investigation report text
CN113268986B (en) Unit name matching and searching method and device based on fuzzy matching algorithm
CN119577115A (en) Intelligent patent retrieval method and system based on large language model re-ranking technology
JP3545007B2 (en) Database search system
CN119027694B (en) Image retrieval method, medium and device based on local feature enhancement and re-ranking
JP3151730B2 (en) Database search system
CN105426490A (en) Tree structure based indexing method
JP3081093B2 (en) Index creation method and apparatus and document search apparatus
CN106021346B (en) Retrieval processing method and device
CN112650870A (en) Method for training picture ordering model, and method and device for picture ordering
KR101448803B1 (en) A Very Fast Apparatus and Method for Detecting Similar Sections using Burrows-Wheeler Transform and FM-Index
CN119025665B (en) Related patent recommendation methods, devices and storage media based on semantic understanding model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210309

RJ01 Rejection of invention patent application after publication