CN106569989A - De-weighting method and apparatus for short text - Google Patents
De-weighting method and apparatus for short text Download PDFInfo
- Publication number
- CN106569989A CN106569989A CN201610915522.3A CN201610915522A CN106569989A CN 106569989 A CN106569989 A CN 106569989A CN 201610915522 A CN201610915522 A CN 201610915522A CN 106569989 A CN106569989 A CN 106569989A
- Authority
- CN
- China
- Prior art keywords
- text
- string
- keywords
- text string
- sub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
An embodiment of the invention discloses a de-weighting method for a short text. The de-weighting method comprises the steps of obtaining text string information of the short text; performing word segmentation on the text string, and obtaining keywords of the text string according to the word segmentation information of the text string; obtaining a text sub string according to a weight corresponding to the keywords, wherein the text sub string comprises keywords with the number of a threshold value; and removing repeating items of the text sub string. According to the technical scheme provided by the embodiment, by obtaining the keywords of the text string, a generalization performance on the original text string is achieved, and the de-weighting generalization capability and efficiency are improved; and meanwhile, the calculated quantity is low, and a de-weighing effect among multiple text strings is realized.
Description
Technical Field
The embodiment of the invention relates to the technical field of text processing, in particular to a duplication eliminating method and device for a short text.
Background
Text deduplication refers to removing identical words, terms, or semantically similar components in a text string. With the continuous development of internet technology, a large number of short message streams appear, the number of the messages is huge, but the length of the messages is generally short, such messages are mostly called short texts, specifically, the short texts refer to texts with very short length, generally within 200 characters, such as common short messages of mobile phones sent through a mobile communication network, instant messages sent through instant communication software, comments of weblogs, comments of internet news, and the like.
The current text deduplication method mainly comprises a text hash method and a similarity comparison method. The text hash method comprises consistent hash and local sensitive hash, wherein the consistent hash has no generalization and the judgment condition is too strict; the local sensitive hash is more suitable for relatively long texts such as web pages and the like; the similarity comparison method needs pairwise comparison, and the calculation amount is too large to adapt to the calculation of massive texts. Because the short text is generally very short in length and very sparse in sample characteristics, effective language characteristics are difficult to accurately extract, the short text is very strong in real-time performance and extremely large in quantity, and the short text processing has higher efficiency requirements compared with the long text processing; short text languages are concise in expression, misspelled, irregular users and more noisy, available information is limited, words are sparse and serious, and the effect of directly processing the duplicate removal problem of the short text by adopting a long text duplicate removal method is reduced.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for removing duplicate of a short text, which solve the problems of too strict determination conditions in text duplicate removal, and improve generalization capability and efficiency of short text duplicate removal.
In a first aspect, an embodiment of the present invention provides a method for removing duplicate texts, where the method includes: acquiring text string information of a short text; segmenting words of the text string, and obtaining keywords of the text string according to word segmentation information of the text string; obtaining a text sub-string according to the weight corresponding to the key words, wherein the text sub-string comprises a threshold number of key words; and removing repeated items of the text substring.
In a second aspect, an embodiment of the present invention provides a deduplication apparatus for short texts, where the apparatus includes: the acquiring unit is used for acquiring text string information of the short text; the extraction unit is connected with the acquisition unit and is used for segmenting the text string and obtaining the key words of the text string according to the segmentation information of the text string; the processing unit is connected with the extraction unit and used for obtaining a text sub-string according to the weight corresponding to the key words, wherein the text sub-string comprises a threshold number of key words; and the operation unit is connected with the processing unit and used for removing repeated items of the text substring.
In the embodiment of the invention, the generalization effect of the original text string is achieved, the generalization capability and efficiency of the de-duplication are improved, the calculated amount is small, and the de-duplication effect in one text string or among a plurality of text strings is realized by performing generalization operations such as word segmentation and keyword extraction on the text string of the short text, acquiring the text sub-string according to the weight information of the keywords and removing repeated items in the text sub-string.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
FIG. 1 is a flowchart of a method for removing duplicate texts according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a deduplication method for short text in the second embodiment of the present invention;
FIG. 3 is a flowchart of a method for removing duplicate short texts according to a third embodiment of the present invention;
fig. 4 is a block diagram of a deduplication apparatus for short text in a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should also be noted that, for the convenience of description, only some but not all of the matters related to the present invention are shown in the drawings. It should be further noted that, for convenience of description, examples related to the present invention are shown in the following embodiments, which are used only for illustrating the principles of the embodiments of the present invention and are not meant to limit the embodiments of the present invention, and the specific values of the examples may vary according to different application environments and parameters of the apparatus or the components.
The method and the device for removing duplicate texts in the embodiment of the present invention can be operated in a terminal equipped with operating systems such as Windows (operating system platform developed by microsoft corporation), Android (operating system platform developed by google corporation for portable mobile intelligent devices), iOS (operating system platform developed by apple corporation for portable mobile intelligent devices), Windows phone (operating system platform developed by microsoft corporation for portable mobile intelligent devices), and the like, and the terminal can be any one of a desktop computer, a notebook computer, a mobile phone, a palm computer, a tablet computer, a digital camera, a digital video camera, and the like.
Example one
Fig. 1 is a flowchart of a method for removing a duplicate of a short text according to a first embodiment of the present invention, where the method is implemented by an apparatus having a document processing function, and the apparatus may be implemented by software and/or hardware, and is typically a user terminal device, such as a mobile phone, a computer, and so on. In this embodiment, the generalization relationship refers to a relationship between a general description and a specific description of an element, and the specific description is established on the basis of the general description and is extended. Generalization refers to the manipulation of an element to make it more generalized. The method for removing the duplicate of the short text in the embodiment comprises the following steps: step S110, step S120, step S130, and step S140.
Step S110, text string information of the short text is acquired.
Specifically, the user inputs a text string to be processed, and obtains information of the text string. Alternatively, the information of the text string may include, but is not limited to, the name of the text string, the content of the text string, the length of the text string, and the semantics of the words in the text string. Alternatively, the name of the text string may be S.
And step S120, performing word segmentation on the text string, and obtaining the keywords of the text string according to the word segmentation information of the text string.
Specifically, the text strings are segmented. The word segmentation technology is a basic link of information processing, and the main task of word segmentation is to automatically complete the segmentation of sentences through a computer and identify independent words. Optionally, the word segmentation algorithm may be selected as a shortest path method, and the shortest path method is used for calculating the shortest path from one node to all other nodes, and is mainly characterized in that the starting point is used as the center to expand outwards layer by layer until the end point is reached. Optionally, for the text string S: i want to go to a commercial bank, and the word segmentation result by using the shortest path method is as follows: i want to go to the business bank. And processing the word segmentation information of the text string to obtain the keyword information of the text string. The keyword information may include, but is not limited to: verbs, nouns and adjectives that express the actual meaning of the text string. A text string S: i want to go to the industrial and commercial bank, the keyword information is: i go to the business bank.
And S130, obtaining a text sub-string according to the weight corresponding to the key words, wherein the text sub-string comprises a threshold number of key words.
Specifically, the weight of each keyword is calculated, the system sets a number threshold, and the keywords with the threshold number are selected as text substrings by taking the weight corresponding to each keyword as a judgment basis.
And step S140, removing repeated items of the text substring.
Specifically, after a series of generalization operations such as word segmentation and keyword extraction are performed on the text string, a corresponding text sub-string is obtained, and at this time, the repeated items in the text sub-string are removed. The repeated items may include, but are not limited to: the same characters or words in the text strings and characters or words with similar semantemes. Optionally, a consistent hashing algorithm and a local sensitivity hashing algorithm are applied in a comprehensive manner. A consistent hash Algorithm, such as a Message Digest Algorithm (MD 5), a murmurmur hash Algorithm, etc., operates on the generalized text string by using the consistent hash Algorithm, and the generated hash string value is the unique identifier of the text string. The local sensitivity hash algorithm, such as the SimHash algorithm, judges whether the generated hash string values are the same or the same type of text by further calculating the similarity through the Hamming distance. The hamming distance means that in information coding, different digits are coded on corresponding digits of two legal codes, and two text strings are considered to be the same when the hamming distance is less than 3. Comprehensively applying a consistent hash algorithm and a local sensitivity hash algorithm, and performing duplicate removal operation on the text substrings according to the generated hash string values and Hamming distances
In the embodiment of the invention, the generalization effect of the original text string is achieved, the generalization capability and efficiency of the de-duplication are improved, the calculated amount is small, and the de-duplication effect in one text string or among a plurality of text strings is realized by performing generalization operations such as word segmentation and keyword extraction on the text string of the short text, acquiring the text sub-string according to the weight information of the keywords and removing repeated items in the text sub-string.
Example two
Fig. 2 is a flowchart of a method for removing duplicate short texts in the second embodiment of the present invention, and this embodiment further explains step S120, step S130, and step S140 on the basis of the first embodiment. In step S120, obtaining the keywords of the text string according to the word segmentation information of the text string includes: and removing stop words in the word segmentation information, and performing normalization processing. In step S130, the factors influencing the weight of the keywords at least include the frequency of each keyword and/or the reverse document frequency, and the text substrings include a threshold number of keywords including: removing the keywords of which the weights of the keywords in the text string are smaller than a preset weight threshold; or selecting a threshold number of keywords in the text string according to the weight corresponding to the keywords; and connecting two or more keywords in the text string into phrases through preset separators or segmentation strings. In step S140, removing the repeated items of the text sub-string includes: if the number of the text substrings is one, removing repeated items in the text substrings; and if the text substrings are two or more, removing repeated items among the text substrings. Specifically, the method for removing duplicate short texts in this embodiment includes: step S210, step S220, step S230, step S240, and step S250.
Step S210, text string information of the short text is acquired.
Step S220, performing word segmentation on the text string, removing stop words in the word segmentation information, and performing normalization processing to obtain keywords of the text string.
Specifically, in the information retrieval, in order to save the storage space and improve the search efficiency, some characters or words, namely, stop words, are automatically filtered before or after the natural language data (or text) is processed. Stop words are all manually input and are not automatically generated, the generated stop words form a stop word list, and the stop words are removed through the stop word list. The stop word list includes, but is not limited to, punctuation marks, mathematical marks, and auxiliary words and imaginary words in Chinese, such as "followed, o, cala, and Chinese". And respectively removing stop words and normalizing the word segmentation result of the text string to obtain the keyword information of the text string. Preferably, the text string ' i like to drink the iron, and the stop word ' calash ' is removed, so that the result is ' i like to drink the iron '. Normalization operations include, but are not limited to, full-half-angle unification into half-angle unification, upper and lower case unification into lower case unification, numeric unification into arabic numerals, english word formation unification into root word, and the like. Optionally, "i like" hamrett "of Shakespeare," normalized to "i like" hamrett "of Shakespeare; the 'industrial and commercial bank' and the 'industry and business' are normalized into the 'industrial and commercial bank'; "two good" normalized to "2008"; "dos, do, doing, and did" are normalized to "do".
Step S230, removing the keywords of which the weights of the keywords in the text string are smaller than a preset weight threshold; or selecting the keywords with the threshold number in the text string according to the weight corresponding to the keywords to obtain the text sub-string. Wherein the factors influencing the weight of the keywords at least comprise the frequency of each keyword and/or the reverse document frequency,
specifically, a text sub-string is obtained by processing the text string, and a preset weight threshold Q and a threshold number G are preset. The keywords of the text string are obtained by performing word segmentation, stop word removal and normalization processing on the text string, and the factors influencing the weight of the keywords at least comprise the frequency of each keyword and/or the reverse document frequency. Specifically, The Frequency (TF) of each keyword represents the frequency of the word appearing in the text string, the Inverse Document Frequency (IDF), where IDF is log (t/n), t is the number of all documents used for statistics, n is the number of documents including the word, IDF is used to measure the degree of distinction of the word, and the greater the degree of distinction, the lower the degree of repetition of two text strings, and IDF is obtained through massive external resource statistics. TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. Optionally, obtaining the weight of each keyword according to the frequency of each keyword of the text string and the frequency of the reverse document by using TF-IDF, and removing the keywords of which the weight is smaller than a preset weight threshold Q in the text string as the text substring according to the weight of each keyword; or selecting key words with threshold number G in the text string as the text substring.
Step S240, connecting two or more keywords in the text string into a phrase by a preset separator or a segmentation string.
Specifically, two or more keywords in the text string are obtained, and a phrase formed by connecting the two or more keywords is used as the text sub-string through a preset separator or a preset segmentation string; the preset delimiter or segmentation string may include, but is not limited to, a space, a pause number, and the like. Optionally, when the text string content is: the Tianan company gate industry is the Tianan gate industry after word segmentation, stop word removal and normalization operation, the Tianan gate industry is the Tianan gate industry after a preset separator is added, if the operation of adding the preset separator or dividing strings is not performed, the system can automatically recognize that the Tianan gate is more common, and the text string semantics are changed.
Step S250, removing repeated items of the text substrings, and if the number of the text substrings is one, removing the repeated items in the text substrings; and if the text substrings are two or more, removing repeated items among the text substrings.
Specifically, the operation of removing repeated items is performed on the text substring obtained by adding a preset separator or a segmentation string to the keyword. If the number of the text substrings is one, removing repeated items in the text substrings, and if the number of the text substrings is two or more, respectively performing the operations of the steps S210 to S240 on each text string to remove the repeated items between the two or more text strings. The de-duplication items may include, but are not limited to, keywords with the same or similar hash values, and whether the keywords are the same or similar texts is determined by calculating similarity through hamming distance.
In the embodiment of the invention, the text sub-strings are obtained by performing operations of word segmentation, stop word removal, normalization, addition of preset separators or segmentation strings and the like on the text strings, repeated items in the text sub-strings are removed from one text sub-string, and repeated items among the text sub-strings are removed from two or more text sub-strings. A series of generalization operations are carried out between the calculation of the hash value, so that the generalization degree of the algorithm is further widened, the deduplication efficiency is improved, and the deduplication operation of one or more texts is realized.
EXAMPLE III
Fig. 3 is a method for removing duplicate texts in a third embodiment of the present invention, and this embodiment describes, as a preferred embodiment, an operation of removing duplicates between two text strings based on the first embodiment and the second embodiment. Specifically, the method for removing duplicate short texts in this embodiment includes: step S310, step S320, step S330, step S340, step S350, step S360, and step S370.
Step S310, information of the first text string and the second text string is acquired.
Step S320, performing word segmentation on the first text string to obtain word segmentation information of the first text string, and performing word segmentation on the second text string to obtain word segmentation information of the second text string.
Step S330, performing word-off and normalization operation on the participles of the first text string to obtain keyword information of the first text string; and performing word-removing and normalization operation on the participles of the second text string to obtain the keyword information of the second text string.
Specifically, in the information retrieval, in order to save the storage space and improve the search efficiency, some characters or words, namely, stop words, are automatically filtered before or after the natural language data (or text) is processed. Stop words are all manually input and are not automatically generated, the generated stop words form a stop word list, and the stop words are removed through the stop word list. Preferably, the stop word list includes, but is not limited to, punctuation marks, mathematical symbols, and auxiliary words and imaginary words in Chinese, such as "then, o, ya, then", etc. And respectively performing word-removing and normalization operations on the word segmentation result of the first text string and the word segmentation result of the second text string to obtain the keyword information of the first text string and the second text string. The normalization operation includes, but is not limited to, full-half-angle unification into half-angle unification, upper and lower case unification into lower case unification, numeral unification into arabic numeral unification, english word unification into root word, and the like. Optionally, "i like" hamrett "of Shakespeare," normalized to "i like" hamrett "of Shakespeare; the 'industrial and commercial bank' and the 'industry and business' are normalized into the 'industrial and commercial bank'; "two good" normalized to "2008"; "dos, do, doing, and did" are normalized to "do".
Optionally, the first text string S1: i want to go to work, second text string S2: i go to the business bank, and the result of word segmentation by using the shortest path method is that the first text string S1: i want to go to work, second text string S2: i go to the business bank. The first text string S1 and the second text string S2 are then deduplicated and normalized to result in the first text string S1: i want to go to the business bank, second text string S2: i go to the business bank.
Step S340, acquiring a first weight of each keyword of the first text string according to the frequency of each keyword of the first text string and the reverse document frequency; and acquiring a second weight of each keyword of the second text string according to the frequency of each keyword of the second text string and the reverse document frequency.
Specifically, The Frequency (TF) of each keyword represents the Frequency of the word appearing in the text string, the Inverse Document Frequency (IDF), where IDF is log (t/n), t is the number of all documents used for statistics, n is the number of documents including the word, IDF is used to measure the degree of distinction of the word, and the greater the degree of distinction, the lower the degree of repetition of two text strings, and IDF is obtained by massive external resource statistics. TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. And acquiring a first weight of each keyword of the first text string by using TF (T) IDF according to the frequency and the reverse document frequency of each keyword of the first text string, and acquiring a second weight of each keyword of the second text string by using TF IDF according to the frequency and the reverse document frequency of each keyword of the second text string.
Step S350, removing the keywords of which the first weights of the keywords of the first text string are smaller than a preset weight threshold value, and taking the rest keywords as the first text sub-string; or selecting a threshold number of keywords as the first text substring according to the weight of each keyword. Removing the keywords of which the second weights of the keywords of the second text string are smaller than a preset weight threshold, and taking the rest keywords as the second text sub string; or selecting a threshold number of keywords as the second text substring according to the weight of each keyword.
Specifically, a preset weight threshold Q is preset, keywords with the first weight smaller than Q in the keywords in the first text string are removed, and the rest keywords are used as the first text substring; or presetting a threshold number G, and selecting the keywords with the preset threshold number G according to the weight. Removing the keywords of which the second weight is smaller than Q in the keywords in the second text string, and taking the rest keywords as second text substrings; or selecting keywords with a preset threshold number G according to the weight.
Step S360, acquiring two or more words in the keyword information of the first text string, and taking a phrase formed by connecting the two or more words in the keyword information of the first text string through a preset separator or a segmentation string as a first text sub-string; and acquiring two or more words in the keyword information of the second text string, and taking a phrase formed by connecting two or more words in the keyword information of the first text string as a second text sub-string through a preset separator or a segmentation string.
Specifically, acquiring keyword information of a first text string, wherein the keyword information comprises two or more words, and using a phrase formed by connecting two or more words in the keywords of the first text string through a preset separator or a preset segmentation string as a first text sub-string; and acquiring keyword information of the second text string, wherein the keyword information comprises two or more words, and taking a phrase formed by connecting two or more words in the keywords of the second text string as the second text sub-string through a preset separator or a preset segmentation string. The preset delimiter or segmentation string may include, but is not limited to, a space, a pause number, and the like. Optionally, when the content of the first text string or the second text string is: the Tianan company gate industry is the Tianan gate industry after word segmentation, stop word removal and normalization operation, the Tianan gate industry is the Tianan gate industry after a preset separator is added, if the operation of adding the preset separator or dividing strings is not performed, the system can automatically recognize that the Tianan gate is more common, and the text string semantics are changed.
And step S370, carrying out duplication elimination operation on the first text substring and the second text substring.
The embodiment of the invention provides a preferable scheme for removing the duplication between two text strings. A series of analysis and processing are carried out on the text string before the hash value is calculated, and finally the text string is represented as the hash value to carry out duplication elimination operation, so that the problem that the traditional hash algorithm judges too strictly is solved, the capacity of generalizing the original string is achieved, and the duplication elimination efficiency is improved.
Example four
Fig. 4 is a block diagram of an apparatus for removing duplicate short texts according to a fourth embodiment of the present invention. The apparatus is suitable for executing the short text deduplication method provided in the first to third embodiments of the present invention, and specifically includes: an acquisition unit 410, an extraction unit 420, a processing unit 430 and an operation unit 440.
An obtaining unit 410, configured to obtain text string information of a short text;
the extracting unit 420 is connected to the obtaining unit 410, and is configured to perform word segmentation on the text string, and obtain a keyword of the text string according to word segmentation information of the text string;
the processing unit 430 is connected to the extracting unit 420, and is configured to obtain a text sub-string according to the weight corresponding to the keyword, where the text sub-string includes a threshold number of keywords;
and the operation unit 440 is connected to the processing unit 440 and is used for removing repeated items of the text sub-string.
Further, in the processing unit 440, the factors affecting the weight of the keywords at least include the frequency of each keyword and/or the inverse document frequency, and the processing unit 440 is specifically configured to: removing the keywords of which the weights of the keywords in the text string are smaller than a preset weight threshold; or selecting the keywords with the threshold number in the text string according to the weight corresponding to the keywords. And connecting two or more keywords in the text string into phrases through preset separators or segmentation strings.
Further, the extracting unit 420 is specifically configured to: and removing stop words in the word segmentation information, and performing normalization processing.
Further, the operation unit 440 is specifically configured to: if the number of the text substrings is one, removing repeated items in the text substrings; and if the text substrings are two or more, removing repeated items among the text substrings.
In the embodiment of the invention, the generalization effect of the original text string is achieved, the generalization capability and efficiency of the de-duplication are improved, the calculated amount is small, and the de-duplication effect in one text string or among a plurality of text strings is realized by performing generalization operations such as word segmentation and keyword extraction on the text string of the short text, acquiring the text sub-string according to the weight information of the keywords and removing repeated items in the text sub-string.
Obviously, those skilled in the art should understand that the above products can perform the methods provided by any embodiments of the present invention, and have corresponding functional modules and beneficial effects for performing the methods.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.
Claims (10)
1. A deduplication method for short text, comprising:
acquiring text string information of a short text;
segmenting words of the text string, and obtaining keywords of the text string according to word segmentation information of the text string;
obtaining a text sub-string according to the weight corresponding to the key words, wherein the text sub-string comprises a threshold number of key words;
and removing repeated items of the text substring.
2. The method of claim 1, wherein the text sub-string includes a threshold number of keywords comprising:
removing the keywords of which the weights of the keywords in the text string are smaller than a preset weight threshold; or,
and selecting a threshold number of keywords in the text string according to the weight corresponding to the keywords.
3. The method for removing duplicate texts according to claim 1, wherein obtaining the keywords of the text strings according to the segmentation information of the text strings comprises:
and removing stop words in the word segmentation information, and performing normalization processing.
4. The method for removing duplicates of a short text according to claim 1, wherein according to the weight corresponding to said keyword, a text sub-string is obtained, said text sub-string comprising a threshold number of keywords further comprising:
factors influencing the weight of the keywords at least comprise the frequency of each keyword and/or the reverse document frequency.
5. The method for removing duplicates of a short text according to claim 1, wherein a text sub-string is obtained according to the weight corresponding to said keyword, said text sub-string comprising a threshold number of keywords, further comprising:
and connecting two or more keywords in the text string into phrases through preset separators or segmentation strings.
6. The method of claim 1, wherein removing duplicates of said text sub-string comprises:
if the number of the text substrings is one, removing repeated items in the text substrings; and if the text substrings are two or more, removing repeated items among the text substrings.
7. A deduplication apparatus for short text, comprising:
the acquiring unit is used for acquiring text string information of the short text;
the extraction unit is connected with the acquisition unit and is used for segmenting the text string and obtaining the key words of the text string according to the segmentation information of the text string;
the processing unit is connected with the extraction unit and used for obtaining a text sub-string according to the weight corresponding to the key words, wherein the text sub-string comprises a threshold number of key words;
and the operation unit is connected with the processing unit and used for removing repeated items of the text substring.
8. The apparatus according to claim 7, wherein the processing unit is configured to influence the keyword weights at least according to a frequency of each keyword and/or an inverse document frequency, and the processing unit is specifically configured to:
removing the keywords of which the weights of the keywords in the text string are smaller than a preset weight threshold; or,
selecting a threshold number of keywords in the text string according to the weight corresponding to the keywords;
and connecting two or more keywords in the text string into phrases through preset separators or segmentation strings.
9. The apparatus according to claim 7, wherein the extracting unit is specifically configured to:
and removing stop words in the word segmentation information, and performing normalization processing.
10. The short text deduplication device of claim 7, wherein the operation unit is specifically configured to:
if the number of the text substrings is one, removing repeated items in the text substrings; and if the text substrings are two or more, removing repeated items among the text substrings.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610915522.3A CN106569989A (en) | 2016-10-20 | 2016-10-20 | De-weighting method and apparatus for short text |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610915522.3A CN106569989A (en) | 2016-10-20 | 2016-10-20 | De-weighting method and apparatus for short text |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN106569989A true CN106569989A (en) | 2017-04-19 |
Family
ID=58533112
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201610915522.3A Pending CN106569989A (en) | 2016-10-20 | 2016-10-20 | De-weighting method and apparatus for short text |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN106569989A (en) |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107066623A (en) * | 2017-05-12 | 2017-08-18 | 湖南中周至尚信息技术有限公司 | A kind of article merging method and device |
| CN107977347A (en) * | 2017-12-04 | 2018-05-01 | 海南云江科技有限公司 | A kind of topic De-weight method and computing device |
| CN108536676A (en) * | 2018-03-28 | 2018-09-14 | 广州华多网络科技有限公司 | Data processing method, device, electronic equipment and storage medium |
| CN110032730A (en) * | 2019-02-18 | 2019-07-19 | 阿里巴巴集团控股有限公司 | A text data processing method, device and device |
| CN110348539A (en) * | 2019-07-19 | 2019-10-18 | 知者信息技术服务成都有限公司 | Short text correlation method of discrimination |
| WO2021109850A1 (en) * | 2019-12-03 | 2021-06-10 | 世强先进(深圳)科技股份有限公司 | Method and system for deduplicating and storing pdf files |
| CN114282511A (en) * | 2021-10-26 | 2022-04-05 | 腾讯科技(深圳)有限公司 | Text duplicate removal method and device, electronic equipment and storage medium |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101694658A (en) * | 2009-10-20 | 2010-04-14 | 浙江大学 | Method for constructing webpage crawler based on repeated removal of news |
| CN102289523A (en) * | 2011-09-20 | 2011-12-21 | 北京金和软件股份有限公司 | Method for intelligently extracting text labels |
| CN102682085A (en) * | 2012-04-18 | 2012-09-19 | 北京十分科技有限公司 | Method for removing duplicated web page |
| CN103646029A (en) * | 2013-11-04 | 2014-03-19 | 北京中搜网络技术股份有限公司 | Similarity calculation method for blog articles |
| CN104156452A (en) * | 2014-08-18 | 2014-11-19 | 中国人民解放军国防科学技术大学 | Method and device for generating webpage text summarization |
| CN105468713A (en) * | 2015-11-19 | 2016-04-06 | 西安交通大学 | A short text classification method based on multi-model fusion |
| CN105893551A (en) * | 2016-03-31 | 2016-08-24 | 上海智臻智能网络科技股份有限公司 | Method and device for processing data and knowledge graph |
| CN105989033A (en) * | 2015-02-03 | 2016-10-05 | 北京中搜网络技术股份有限公司 | Information duplication eliminating method based on information fingerprints |
-
2016
- 2016-10-20 CN CN201610915522.3A patent/CN106569989A/en active Pending
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101694658A (en) * | 2009-10-20 | 2010-04-14 | 浙江大学 | Method for constructing webpage crawler based on repeated removal of news |
| CN102289523A (en) * | 2011-09-20 | 2011-12-21 | 北京金和软件股份有限公司 | Method for intelligently extracting text labels |
| CN102682085A (en) * | 2012-04-18 | 2012-09-19 | 北京十分科技有限公司 | Method for removing duplicated web page |
| CN103646029A (en) * | 2013-11-04 | 2014-03-19 | 北京中搜网络技术股份有限公司 | Similarity calculation method for blog articles |
| CN104156452A (en) * | 2014-08-18 | 2014-11-19 | 中国人民解放军国防科学技术大学 | Method and device for generating webpage text summarization |
| CN105989033A (en) * | 2015-02-03 | 2016-10-05 | 北京中搜网络技术股份有限公司 | Information duplication eliminating method based on information fingerprints |
| CN105468713A (en) * | 2015-11-19 | 2016-04-06 | 西安交通大学 | A short text classification method based on multi-model fusion |
| CN105893551A (en) * | 2016-03-31 | 2016-08-24 | 上海智臻智能网络科技股份有限公司 | Method and device for processing data and knowledge graph |
Non-Patent Citations (1)
| Title |
|---|
| 龚静: "《中文文本聚类研究》", 31 March 2012, 北京:中国传媒大学出版社 * |
Cited By (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107066623A (en) * | 2017-05-12 | 2017-08-18 | 湖南中周至尚信息技术有限公司 | A kind of article merging method and device |
| CN107977347A (en) * | 2017-12-04 | 2018-05-01 | 海南云江科技有限公司 | A kind of topic De-weight method and computing device |
| CN107977347B (en) * | 2017-12-04 | 2021-12-21 | 海南云江科技有限公司 | Topic duplication removing method and computing equipment |
| CN108536676A (en) * | 2018-03-28 | 2018-09-14 | 广州华多网络科技有限公司 | Data processing method, device, electronic equipment and storage medium |
| CN108536676B (en) * | 2018-03-28 | 2020-10-13 | 广州华多网络科技有限公司 | Data processing method and device, electronic equipment and storage medium |
| CN110032730A (en) * | 2019-02-18 | 2019-07-19 | 阿里巴巴集团控股有限公司 | A text data processing method, device and device |
| CN110032730B (en) * | 2019-02-18 | 2023-09-05 | 创新先进技术有限公司 | A text data processing method, device and equipment |
| CN110348539A (en) * | 2019-07-19 | 2019-10-18 | 知者信息技术服务成都有限公司 | Short text correlation method of discrimination |
| CN110348539B (en) * | 2019-07-19 | 2021-05-07 | 知者信息技术服务成都有限公司 | Short text relevance judging method |
| WO2021109850A1 (en) * | 2019-12-03 | 2021-06-10 | 世强先进(深圳)科技股份有限公司 | Method and system for deduplicating and storing pdf files |
| CN114282511A (en) * | 2021-10-26 | 2022-04-05 | 腾讯科技(深圳)有限公司 | Text duplicate removal method and device, electronic equipment and storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN102799647B (en) | Method and device for webpage reduplication deletion | |
| CN106569989A (en) | De-weighting method and apparatus for short text | |
| CN107168954B (en) | Text keyword generation method and device, electronic equipment and readable storage medium | |
| CN101315622B (en) | System and method for detecting file similarity | |
| US8983826B2 (en) | Method and system for extracting shadow entities from emails | |
| CN113688954A (en) | Method, system, equipment and storage medium for calculating text similarity | |
| CN105138523A (en) | Method and device for determining semantic keywords in text | |
| US9817812B2 (en) | Identifying word collocations in natural language texts | |
| WO2015179643A1 (en) | Systems and methods for generating summaries of documents | |
| CN103646080A (en) | Microblog duplication-eliminating method and system based on reverse-order index | |
| CN110134777B (en) | Question duplication eliminating method and device, electronic equipment and computer readable storage medium | |
| CN107168966B (en) | Search engine index construction method and device | |
| CN110858217A (en) | Method and device for detecting microblog sensitive topics and readable storage medium | |
| CN112926297B (en) | Method, apparatus, device and storage medium for processing information | |
| CN114328885B (en) | Information processing method, device and computer readable storage medium | |
| CN110222192A (en) | Corpus method for building up and device | |
| CN110019820B (en) | Method for detecting time consistency of complaints and symptoms of current medical history in medical records | |
| CN108073708A (en) | Information output method and device | |
| CN115577082A (en) | Document keyword extraction method and device, electronic equipment and storage medium | |
| CN110442696A (en) | Inquiry processing method and device | |
| CN114372461B (en) | Hidden keyword extraction method, terminal equipment and storage medium | |
| CN113408660B (en) | Book clustering method, device, equipment and storage medium | |
| CN111930949B (en) | Search string processing method and device, computer readable medium and electronic equipment | |
| CN113806483A (en) | Data processing method and device, electronic equipment and computer program product | |
| CN111783433A (en) | A text retrieval error correction method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170419 |
|
| RJ01 | Rejection of invention patent application after publication |