CN106569989A

CN106569989A - De-weighting method and apparatus for short text

Info

Publication number: CN106569989A
Application number: CN201610915522.3A
Authority: CN
Inventors: 李苗苗
Original assignee: Beijing Intelligent Housekeeper Technology Co Ltd
Current assignee: Beijing Intelligent Housekeeper Technology Co Ltd
Priority date: 2016-10-20
Filing date: 2016-10-20
Publication date: 2017-04-19

Abstract

An embodiment of the invention discloses a de-weighting method for a short text. The de-weighting method comprises the steps of obtaining text string information of the short text; performing word segmentation on the text string, and obtaining keywords of the text string according to the word segmentation information of the text string; obtaining a text sub string according to a weight corresponding to the keywords, wherein the text sub string comprises keywords with the number of a threshold value; and removing repeating items of the text sub string. According to the technical scheme provided by the embodiment, by obtaining the keywords of the text string, a generalization performance on the original text string is achieved, and the de-weighting generalization capability and efficiency are improved; and meanwhile, the calculated quantity is low, and a de-weighing effect among multiple text strings is realized.

Description

Duplication eliminating method and device for short text

Technical Field

The embodiment of the invention relates to the technical field of text processing, in particular to a duplication eliminating method and device for a short text.

Background

Text deduplication refers to removing identical words, terms, or semantically similar components in a text string. With the continuous development of internet technology, a large number of short message streams appear, the number of the messages is huge, but the length of the messages is generally short, such messages are mostly called short texts, specifically, the short texts refer to texts with very short length, generally within 200 characters, such as common short messages of mobile phones sent through a mobile communication network, instant messages sent through instant communication software, comments of weblogs, comments of internet news, and the like.

The current text deduplication method mainly comprises a text hash method and a similarity comparison method. The text hash method comprises consistent hash and local sensitive hash, wherein the consistent hash has no generalization and the judgment condition is too strict; the local sensitive hash is more suitable for relatively long texts such as web pages and the like; the similarity comparison method needs pairwise comparison, and the calculation amount is too large to adapt to the calculation of massive texts. Because the short text is generally very short in length and very sparse in sample characteristics, effective language characteristics are difficult to accurately extract, the short text is very strong in real-time performance and extremely large in quantity, and the short text processing has higher efficiency requirements compared with the long text processing; short text languages are concise in expression, misspelled, irregular users and more noisy, available information is limited, words are sparse and serious, and the effect of directly processing the duplicate removal problem of the short text by adopting a long text duplicate removal method is reduced.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for removing duplicate of a short text, which solve the problems of too strict determination conditions in text duplicate removal, and improve generalization capability and efficiency of short text duplicate removal.

In a first aspect, an embodiment of the present invention provides a method for removing duplicate texts, where the method includes: acquiring text string information of a short text; segmenting words of the text string, and obtaining keywords of the text string according to word segmentation information of the text string; obtaining a text sub-string according to the weight corresponding to the key words, wherein the text sub-string comprises a threshold number of key words; and removing repeated items of the text substring.

In a second aspect, an embodiment of the present invention provides a deduplication apparatus for short texts, where the apparatus includes: the acquiring unit is used for acquiring text string information of the short text; the extraction unit is connected with the acquisition unit and is used for segmenting the text string and obtaining the key words of the text string according to the segmentation information of the text string; the processing unit is connected with the extraction unit and used for obtaining a text sub-string according to the weight corresponding to the key words, wherein the text sub-string comprises a threshold number of key words; and the operation unit is connected with the processing unit and used for removing repeated items of the text substring.

In the embodiment of the invention, the generalization effect of the original text string is achieved, the generalization capability and efficiency of the de-duplication are improved, the calculated amount is small, and the de-duplication effect in one text string or among a plurality of text strings is realized by performing generalization operations such as word segmentation and keyword extraction on the text string of the short text, acquiring the text sub-string according to the weight information of the keywords and removing repeated items in the text sub-string.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 is a flowchart of a method for removing duplicate texts according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a deduplication method for short text in the second embodiment of the present invention;

FIG. 3 is a flowchart of a method for removing duplicate short texts according to a third embodiment of the present invention;

fig. 4 is a block diagram of a deduplication apparatus for short text in a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should also be noted that, for the convenience of description, only some but not all of the matters related to the present invention are shown in the drawings. It should be further noted that, for convenience of description, examples related to the present invention are shown in the following embodiments, which are used only for illustrating the principles of the embodiments of the present invention and are not meant to limit the embodiments of the present invention, and the specific values of the examples may vary according to different application environments and parameters of the apparatus or the components.

The method and the device for removing duplicate texts in the embodiment of the present invention can be operated in a terminal equipped with operating systems such as Windows (operating system platform developed by microsoft corporation), Android (operating system platform developed by google corporation for portable mobile intelligent devices), iOS (operating system platform developed by apple corporation for portable mobile intelligent devices), Windows phone (operating system platform developed by microsoft corporation for portable mobile intelligent devices), and the like, and the terminal can be any one of a desktop computer, a notebook computer, a mobile phone, a palm computer, a tablet computer, a digital camera, a digital video camera, and the like.

Example one

Fig. 1 is a flowchart of a method for removing a duplicate of a short text according to a first embodiment of the present invention, where the method is implemented by an apparatus having a document processing function, and the apparatus may be implemented by software and/or hardware, and is typically a user terminal device, such as a mobile phone, a computer, and so on. In this embodiment, the generalization relationship refers to a relationship between a general description and a specific description of an element, and the specific description is established on the basis of the general description and is extended. Generalization refers to the manipulation of an element to make it more generalized. The method for removing the duplicate of the short text in the embodiment comprises the following steps: step S110, step S120, step S130, and step S140.

Step S110, text string information of the short text is acquired.

Specifically, the user inputs a text string to be processed, and obtains information of the text string. Alternatively, the information of the text string may include, but is not limited to, the name of the text string, the content of the text string, the length of the text string, and the semantics of the words in the text string. Alternatively, the name of the text string may be S.

And step S120, performing word segmentation on the text string, and obtaining the keywords of the text string according to the word segmentation information of the text string.

Specifically, the text strings are segmented. The word segmentation technology is a basic link of information processing, and the main task of word segmentation is to automatically complete the segmentation of sentences through a computer and identify independent words. Optionally, the word segmentation algorithm may be selected as a shortest path method, and the shortest path method is used for calculating the shortest path from one node to all other nodes, and is mainly characterized in that the starting point is used as the center to expand outwards layer by layer until the end point is reached. Optionally, for the text string S: i want to go to a commercial bank, and the word segmentation result by using the shortest path method is as follows: i want to go to the business bank. And processing the word segmentation information of the text string to obtain the keyword information of the text string. The keyword information may include, but is not limited to: verbs, nouns and adjectives that express the actual meaning of the text string. A text string S: i want to go to the industrial and commercial bank, the keyword information is: i go to the business bank.

And S130, obtaining a text sub-string according to the weight corresponding to the key words, wherein the text sub-string comprises a threshold number of key words.

Specifically, the weight of each keyword is calculated, the system sets a number threshold, and the keywords with the threshold number are selected as text substrings by taking the weight corresponding to each keyword as a judgment basis.

And step S140, removing repeated items of the text substring.

Specifically, after a series of generalization operations such as word segmentation and keyword extraction are performed on the text string, a corresponding text sub-string is obtained, and at this time, the repeated items in the text sub-string are removed. The repeated items may include, but are not limited to: the same characters or words in the text strings and characters or words with similar semantemes. Optionally, a consistent hashing algorithm and a local sensitivity hashing algorithm are applied in a comprehensive manner. A consistent hash Algorithm, such as a Message Digest Algorithm (MD 5), a murmurmur hash Algorithm, etc., operates on the generalized text string by using the consistent hash Algorithm, and the generated hash string value is the unique identifier of the text string. The local sensitivity hash algorithm, such as the SimHash algorithm, judges whether the generated hash string values are the same or the same type of text by further calculating the similarity through the Hamming distance. The hamming distance means that in information coding, different digits are coded on corresponding digits of two legal codes, and two text strings are considered to be the same when the hamming distance is less than 3. Comprehensively applying a consistent hash algorithm and a local sensitivity hash algorithm, and performing duplicate removal operation on the text substrings according to the generated hash string values and Hamming distances

Example two

Fig. 2 is a flowchart of a method for removing duplicate short texts in the second embodiment of the present invention, and this embodiment further explains step S120, step S130, and step S140 on the basis of the first embodiment. In step S120, obtaining the keywords of the text string according to the word segmentation information of the text string includes: and removing stop words in the word segmentation information, and performing normalization processing. In step S130, the factors influencing the weight of the keywords at least include the frequency of each keyword and/or the reverse document frequency, and the text substrings include a threshold number of keywords including: removing the keywords of which the weights of the keywords in the text string are smaller than a preset weight threshold; or selecting a threshold number of keywords in the text string according to the weight corresponding to the keywords; and connecting two or more keywords in the text string into phrases through preset separators or segmentation strings. In step S140, removing the repeated items of the text sub-string includes: if the number of the text substrings is one, removing repeated items in the text substrings; and if the text substrings are two or more, removing repeated items among the text substrings. Specifically, the method for removing duplicate short texts in this embodiment includes: step S210, step S220, step S230, step S240, and step S250.

Step S210, text string information of the short text is acquired.

Step S220, performing word segmentation on the text string, removing stop words in the word segmentation information, and performing normalization processing to obtain keywords of the text string.

Specifically, in the information retrieval, in order to save the storage space and improve the search efficiency, some characters or words, namely, stop words, are automatically filtered before or after the natural language data (or text) is processed. Stop words are all manually input and are not automatically generated, the generated stop words form a stop word list, and the stop words are removed through the stop word list. The stop word list includes, but is not limited to, punctuation marks, mathematical marks, and auxiliary words and imaginary words in Chinese, such as "followed, o, cala, and Chinese". And respectively removing stop words and normalizing the word segmentation result of the text string to obtain the keyword information of the text string. Preferably, the text string ' i like to drink the iron, and the stop word ' calash ' is removed, so that the result is ' i like to drink the iron '. Normalization operations include, but are not limited to, full-half-angle unification into half-angle unification, upper and lower case unification into lower case unification, numeric unification into arabic numerals, english word formation unification into root word, and the like. Optionally, "i like" hamrett "of Shakespeare," normalized to "i like" hamrett "of Shakespeare; the 'industrial and commercial bank' and the 'industry and business' are normalized into the 'industrial and commercial bank'; "two good" normalized to "2008"; "dos, do, doing, and did" are normalized to "do".

Step S230, removing the keywords of which the weights of the keywords in the text string are smaller than a preset weight threshold; or selecting the keywords with the threshold number in the text string according to the weight corresponding to the keywords to obtain the text sub-string. Wherein the factors influencing the weight of the keywords at least comprise the frequency of each keyword and/or the reverse document frequency,

specifically, a text sub-string is obtained by processing the text string, and a preset weight threshold Q and a threshold number G are preset. The keywords of the text string are obtained by performing word segmentation, stop word removal and normalization processing on the text string, and the factors influencing the weight of the keywords at least comprise the frequency of each keyword and/or the reverse document frequency. Specifically, The Frequency (TF) of each keyword represents the frequency of the word appearing in the text string, the Inverse Document Frequency (IDF), where IDF is log (t/n), t is the number of all documents used for statistics, n is the number of documents including the word, IDF is used to measure the degree of distinction of the word, and the greater the degree of distinction, the lower the degree of repetition of two text strings, and IDF is obtained through massive external resource statistics. TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. Optionally, obtaining the weight of each keyword according to the frequency of each keyword of the text string and the frequency of the reverse document by using TF-IDF, and removing the keywords of which the weight is smaller than a preset weight threshold Q in the text string as the text substring according to the weight of each keyword; or selecting key words with threshold number G in the text string as the text substring.

Step S240, connecting two or more keywords in the text string into a phrase by a preset separator or a segmentation string.

Specifically, two or more keywords in the text string are obtained, and a phrase formed by connecting the two or more keywords is used as the text sub-string through a preset separator or a preset segmentation string; the preset delimiter or segmentation string may include, but is not limited to, a space, a pause number, and the like. Optionally, when the text string content is: the Tianan company gate industry is the Tianan gate industry after word segmentation, stop word removal and normalization operation, the Tianan gate industry is the Tianan gate industry after a preset separator is added, if the operation of adding the preset separator or dividing strings is not performed, the system can automatically recognize that the Tianan gate is more common, and the text string semantics are changed.

Step S250, removing repeated items of the text substrings, and if the number of the text substrings is one, removing the repeated items in the text substrings; and if the text substrings are two or more, removing repeated items among the text substrings.

Specifically, the operation of removing repeated items is performed on the text substring obtained by adding a preset separator or a segmentation string to the keyword. If the number of the text substrings is one, removing repeated items in the text substrings, and if the number of the text substrings is two or more, respectively performing the operations of the steps S210 to S240 on each text string to remove the repeated items between the two or more text strings. The de-duplication items may include, but are not limited to, keywords with the same or similar hash values, and whether the keywords are the same or similar texts is determined by calculating similarity through hamming distance.

In the embodiment of the invention, the text sub-strings are obtained by performing operations of word segmentation, stop word removal, normalization, addition of preset separators or segmentation strings and the like on the text strings, repeated items in the text sub-strings are removed from one text sub-string, and repeated items among the text sub-strings are removed from two or more text sub-strings. A series of generalization operations are carried out between the calculation of the hash value, so that the generalization degree of the algorithm is further widened, the deduplication efficiency is improved, and the deduplication operation of one or more texts is realized.

EXAMPLE III

Fig. 3 is a method for removing duplicate texts in a third embodiment of the present invention, and this embodiment describes, as a preferred embodiment, an operation of removing duplicates between two text strings based on the first embodiment and the second embodiment. Specifically, the method for removing duplicate short texts in this embodiment includes: step S310, step S320, step S330, step S340, step S350, step S360, and step S370.

Step S310, information of the first text string and the second text string is acquired.

Step S320, performing word segmentation on the first text string to obtain word segmentation information of the first text string, and performing word segmentation on the second text string to obtain word segmentation information of the second text string.

Step S330, performing word-off and normalization operation on the participles of the first text string to obtain keyword information of the first text string; and performing word-removing and normalization operation on the participles of the second text string to obtain the keyword information of the second text string.

Specifically, in the information retrieval, in order to save the storage space and improve the search efficiency, some characters or words, namely, stop words, are automatically filtered before or after the natural language data (or text) is processed. Stop words are all manually input and are not automatically generated, the generated stop words form a stop word list, and the stop words are removed through the stop word list. Preferably, the stop word list includes, but is not limited to, punctuation marks, mathematical symbols, and auxiliary words and imaginary words in Chinese, such as "then, o, ya, then", etc. And respectively performing word-removing and normalization operations on the word segmentation result of the first text string and the word segmentation result of the second text string to obtain the keyword information of the first text string and the second text string. The normalization operation includes, but is not limited to, full-half-angle unification into half-angle unification, upper and lower case unification into lower case unification, numeral unification into arabic numeral unification, english word unification into root word, and the like. Optionally, "i like" hamrett "of Shakespeare," normalized to "i like" hamrett "of Shakespeare; the 'industrial and commercial bank' and the 'industry and business' are normalized into the 'industrial and commercial bank'; "two good" normalized to "2008"; "dos, do, doing, and did" are normalized to "do".

Optionally, the first text string S1: i want to go to work, second text string S2: i go to the business bank, and the result of word segmentation by using the shortest path method is that the first text string S1: i want to go to work, second text string S2: i go to the business bank. The first text string S1 and the second text string S2 are then deduplicated and normalized to result in the first text string S1: i want to go to the business bank, second text string S2: i go to the business bank.

Step S340, acquiring a first weight of each keyword of the first text string according to the frequency of each keyword of the first text string and the reverse document frequency; and acquiring a second weight of each keyword of the second text string according to the frequency of each keyword of the second text string and the reverse document frequency.

Specifically, The Frequency (TF) of each keyword represents the Frequency of the word appearing in the text string, the Inverse Document Frequency (IDF), where IDF is log (t/n), t is the number of all documents used for statistics, n is the number of documents including the word, IDF is used to measure the degree of distinction of the word, and the greater the degree of distinction, the lower the degree of repetition of two text strings, and IDF is obtained by massive external resource statistics. TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. And acquiring a first weight of each keyword of the first text string by using TF (T) IDF according to the frequency and the reverse document frequency of each keyword of the first text string, and acquiring a second weight of each keyword of the second text string by using TF IDF according to the frequency and the reverse document frequency of each keyword of the second text string.

Step S350, removing the keywords of which the first weights of the keywords of the first text string are smaller than a preset weight threshold value, and taking the rest keywords as the first text sub-string; or selecting a threshold number of keywords as the first text substring according to the weight of each keyword. Removing the keywords of which the second weights of the keywords of the second text string are smaller than a preset weight threshold, and taking the rest keywords as the second text sub string; or selecting a threshold number of keywords as the second text substring according to the weight of each keyword.

Specifically, a preset weight threshold Q is preset, keywords with the first weight smaller than Q in the keywords in the first text string are removed, and the rest keywords are used as the first text substring; or presetting a threshold number G, and selecting the keywords with the preset threshold number G according to the weight. Removing the keywords of which the second weight is smaller than Q in the keywords in the second text string, and taking the rest keywords as second text substrings; or selecting keywords with a preset threshold number G according to the weight.

Step S360, acquiring two or more words in the keyword information of the first text string, and taking a phrase formed by connecting the two or more words in the keyword information of the first text string through a preset separator or a segmentation string as a first text sub-string; and acquiring two or more words in the keyword information of the second text string, and taking a phrase formed by connecting two or more words in the keyword information of the first text string as a second text sub-string through a preset separator or a segmentation string.

Specifically, acquiring keyword information of a first text string, wherein the keyword information comprises two or more words, and using a phrase formed by connecting two or more words in the keywords of the first text string through a preset separator or a preset segmentation string as a first text sub-string; and acquiring keyword information of the second text string, wherein the keyword information comprises two or more words, and taking a phrase formed by connecting two or more words in the keywords of the second text string as the second text sub-string through a preset separator or a preset segmentation string. The preset delimiter or segmentation string may include, but is not limited to, a space, a pause number, and the like. Optionally, when the content of the first text string or the second text string is: the Tianan company gate industry is the Tianan gate industry after word segmentation, stop word removal and normalization operation, the Tianan gate industry is the Tianan gate industry after a preset separator is added, if the operation of adding the preset separator or dividing strings is not performed, the system can automatically recognize that the Tianan gate is more common, and the text string semantics are changed.

And step S370, carrying out duplication elimination operation on the first text substring and the second text substring.

The embodiment of the invention provides a preferable scheme for removing the duplication between two text strings. A series of analysis and processing are carried out on the text string before the hash value is calculated, and finally the text string is represented as the hash value to carry out duplication elimination operation, so that the problem that the traditional hash algorithm judges too strictly is solved, the capacity of generalizing the original string is achieved, and the duplication elimination efficiency is improved.

Example four

Fig. 4 is a block diagram of an apparatus for removing duplicate short texts according to a fourth embodiment of the present invention. The apparatus is suitable for executing the short text deduplication method provided in the first to third embodiments of the present invention, and specifically includes: an acquisition unit 410, an extraction unit 420, a processing unit 430 and an operation unit 440.

An obtaining unit 410, configured to obtain text string information of a short text;

the extracting unit 420 is connected to the obtaining unit 410, and is configured to perform word segmentation on the text string, and obtain a keyword of the text string according to word segmentation information of the text string;

the processing unit 430 is connected to the extracting unit 420, and is configured to obtain a text sub-string according to the weight corresponding to the keyword, where the text sub-string includes a threshold number of keywords;

and the operation unit 440 is connected to the processing unit 440 and is used for removing repeated items of the text sub-string.

Further, in the processing unit 440, the factors affecting the weight of the keywords at least include the frequency of each keyword and/or the inverse document frequency, and the processing unit 440 is specifically configured to: removing the keywords of which the weights of the keywords in the text string are smaller than a preset weight threshold; or selecting the keywords with the threshold number in the text string according to the weight corresponding to the keywords. And connecting two or more keywords in the text string into phrases through preset separators or segmentation strings.

Further, the extracting unit 420 is specifically configured to: and removing stop words in the word segmentation information, and performing normalization processing.

Further, the operation unit 440 is specifically configured to: if the number of the text substrings is one, removing repeated items in the text substrings; and if the text substrings are two or more, removing repeated items among the text substrings.

Obviously, those skilled in the art should understand that the above products can perform the methods provided by any embodiments of the present invention, and have corresponding functional modules and beneficial effects for performing the methods.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A deduplication method for short text, comprising:

acquiring text string information of a short text;

segmenting words of the text string, and obtaining keywords of the text string according to word segmentation information of the text string;

obtaining a text sub-string according to the weight corresponding to the key words, wherein the text sub-string comprises a threshold number of key words;

and removing repeated items of the text substring.

2. The method of claim 1, wherein the text sub-string includes a threshold number of keywords comprising:

removing the keywords of which the weights of the keywords in the text string are smaller than a preset weight threshold; or,

and selecting a threshold number of keywords in the text string according to the weight corresponding to the keywords.

3. The method for removing duplicate texts according to claim 1, wherein obtaining the keywords of the text strings according to the segmentation information of the text strings comprises:

and removing stop words in the word segmentation information, and performing normalization processing.

4. The method for removing duplicates of a short text according to claim 1, wherein according to the weight corresponding to said keyword, a text sub-string is obtained, said text sub-string comprising a threshold number of keywords further comprising:

factors influencing the weight of the keywords at least comprise the frequency of each keyword and/or the reverse document frequency.

5. The method for removing duplicates of a short text according to claim 1, wherein a text sub-string is obtained according to the weight corresponding to said keyword, said text sub-string comprising a threshold number of keywords, further comprising:

and connecting two or more keywords in the text string into phrases through preset separators or segmentation strings.

6. The method of claim 1, wherein removing duplicates of said text sub-string comprises:

if the number of the text substrings is one, removing repeated items in the text substrings; and if the text substrings are two or more, removing repeated items among the text substrings.

7. A deduplication apparatus for short text, comprising:

the acquiring unit is used for acquiring text string information of the short text;

the extraction unit is connected with the acquisition unit and is used for segmenting the text string and obtaining the key words of the text string according to the segmentation information of the text string;

the processing unit is connected with the extraction unit and used for obtaining a text sub-string according to the weight corresponding to the key words, wherein the text sub-string comprises a threshold number of key words;

and the operation unit is connected with the processing unit and used for removing repeated items of the text substring.

8. The apparatus according to claim 7, wherein the processing unit is configured to influence the keyword weights at least according to a frequency of each keyword and/or an inverse document frequency, and the processing unit is specifically configured to:

selecting a threshold number of keywords in the text string according to the weight corresponding to the keywords;

9. The apparatus according to claim 7, wherein the extracting unit is specifically configured to:

10. The short text deduplication device of claim 7, wherein the operation unit is specifically configured to: