CN102467548B

CN102467548B - A kind of recognition methods of neologisms and system

Info

Publication number: CN102467548B
Application number: CN201010547509.XA
Authority: CN
Inventors: 严浩; 方高林
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Tencent Computer Systems Co Ltd
Priority date: 2010-11-15
Filing date: 2010-11-15
Publication date: 2015-09-16
Anticipated expiration: 2030-11-15
Also published as: CN102467548A

Abstract

The invention discloses a kind of recognition methods of neologisms, the method comprises: cut word process to small-scale data set or the data that upgrade data centralization frequently, and extract candidate data string; Screening strategy according to arranging filters candidate data string, extracts candidate's neologisms; The appearance situation of the candidate's neologisms extracted in evaluation and test data centralization is added up, evaluates the confidence level of neologisms, confidence level is exceeded the new word identification of preset value out.The invention also discloses a kind of recognition system of neologisms, the new word identification unit in system is used for adding up the appearance situation of the candidate's neologisms extracted in evaluation and test data centralization, evaluates the confidence level of neologisms, confidence level is exceeded the new word identification of preset value out.Adopt method and system of the present invention, can at small-scale data set, upgrade and frequently data set carry out new word identification.

Description

Method and system for identifying new words

Technical Field

The invention relates to a new word recognition technology in the field of internet information processing, in particular to a method and a system for recognizing new words on a small-scale data set.

Background

In the field of Chinese processing, because of the characteristics of Chinese, words are not naturally separated like English with spaces, so Chinese word segmentation is an important basic technology. However, with the rapid development of the internet in the information age, the language is continuously updated on the network platform, so that a great amount of new words are created. The appearance of new words causes the appearance of excessive single words or fine-grained words in the word segmentation result, which affects the accuracy of word segmentation, and researches show that nearly 60% of word segmentation errors are caused by inaccurate new word identification. Therefore, accurate recognition of new words plays an important role in improving word segmentation effect.

The existing new word recognition technology is mainly based on a statistical method, namely: candidate strings are extracted on a data set with a certain scale based on a statistical principle, then, some linguistic knowledge such as word formation rules and the like are utilized for filtering, and noise strings which are not new words are eliminated so as to identify the new words. The existing new word recognition technology has the following defects: because the statistical principle is based on, the statistical principle is adopted to require that the data volume to be analyzed is large and the randomness of data occurrence is small, so that the statistical principle can be utilized, the prior art is only suitable for a data set with a certain scale, that is, only when the data set for extracting new words has a certain scale, the more sufficient statistical information can be obtained, and the recall rate and the identification accuracy rate of extracting the new words are ensured. However, for some small-scale data sets and/or frequently updated data sets, the statistical principle cannot be well utilized due to the fact that the requirements of the statistical principle are not met, and therefore, the existing new word recognition technology is not suitable for scenes of the small-scale data sets and/or frequently updated data sets.

Disclosure of Invention

In view of this, the present invention provides a method and a system for recognizing new words, which can recognize new words on a small-scale data set or a data set with frequent updates.

The technical scheme of the invention is realized as follows:

a method of identifying new words, the method comprising:

performing word segmentation on data in a small-scale data set or a frequently updated data set, and extracting candidate data strings;

filtering the candidate data strings according to a set screening strategy to extract candidate new words;

and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value.

Wherein the screening strategies comprise primary screening strategies and advanced screening strategies; wherein,

the primary screening strategy is to filter according to a word-building rule;

the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;

the filtering the candidate data string according to the set screening policy specifically includes: and adopting the primary screening strategy to carry out coarse screening and filtering on the candidate data strings, and adopting the advanced screening strategy to carry out fine screening and filtering on the candidate data strings after the coarse screening and filtering.

Wherein the screening policies comprise advanced screening policies; the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;

the filtering the candidate data string according to the set screening policy specifically includes: and screening and filtering the candidate data strings by adopting the high-level screening strategy.

The counting of the appearance of the extracted candidate new words in the evaluation data set specifically includes: counting the frequency of the candidate new words appearing in the evaluation data set as a whole; or counting the frequency of character separation between the candidate new words and the adjacent texts when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the inquired candidate new words in the general search engine when the candidate new words appear in the evaluation data set as a whole.

The evaluating the credibility of the new word specifically includes: by the formulaEvaluating the credibility of the new words, and identifying the new words with high credibility; wherein,

w represents the candidate newA word; score (w) represents the final confidence score of the candidate new word; f. of_i(w) represents various frequency data obtained by statistics; alpha is alpha_iRepresenting the weighting coefficients.

Wherein, the method also comprises: and performing relevance measurement of the new words and the evaluation data set on the identified new words with high credibility.

A system for recognizing new words, the system comprising: the device comprises a candidate data string extraction unit, a candidate new word extraction unit and a new word identification unit; wherein,

the candidate data string extraction unit is used for performing word segmentation on data in the small-scale data set or the frequently updated data set and extracting candidate data strings;

the candidate new word extraction unit is used for filtering the candidate data strings according to the set screening strategy and extracting candidate new words;

and the new word recognition unit is used for counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words and recognizing the new words with the credibility exceeding a preset value.

the primary screening strategy is to filter according to a word-building rule;

the candidate new word extracting unit is further configured to perform coarse screening and filtering on the candidate data strings by using the primary screening strategy, perform fine screening and filtering on the candidate data strings after the coarse screening and filtering by using the advanced screening strategy, and extract candidate new words.

and the candidate new word extraction unit is further used for adopting the high-level screening strategy to screen and filter the candidate data strings and extracting the candidate new words.

The new word recognition unit is further used for counting the frequency of the candidate new words appearing in the evaluation data set as a whole under the condition of counting; or counting the frequency of character separation between the candidate new words and the adjacent texts when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the inquired candidate new words in the general search engine when the candidate new words appear in the evaluation data set as a whole.

Wherein the new word recognition unit is further configured to, in the case of performing the evaluation, pass through a formulaEvaluating the credibility of the new words, and identifying the new words with high credibility; wherein,

w represents the candidate new word; score (w) represents the final confidence score of the candidate new word; f. of_i(w) represents various frequency data obtained by statistics; alpha is alpha_iRepresenting the weighting coefficients.

Wherein, this system still includes: and the measuring unit is used for measuring the relevance between the new words and the evaluation data set for the identified new words with high credibility.

The method comprises the steps of carrying out word segmentation on data in a small-scale data set or a frequently updated data set, and extracting candidate data strings; filtering the candidate data strings according to a set screening strategy to extract candidate new words; and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with high credibility. By adopting the method and the device, new word recognition can be performed on a small-scale data set or a data set with frequent updating.

Drawings

FIG. 1 is a schematic flow chart of a first implementation of the method of the present invention;

FIG. 2 is a schematic flow chart of a second embodiment of the method of the present invention;

FIG. 3 is a diagram illustrating evaluation results according to the present invention;

FIG. 4 is another schematic diagram of the evaluation results of the present invention.

Detailed Description

The method comprises the steps of carrying out word segmentation on data in a small-scale data set and/or a frequently updated data set, and extracting candidate data strings; filtering the candidate data strings according to a set screening strategy to extract candidate new words; and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value.

In order to make the technical solution and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings by way of examples.

The first embodiment of the method comprises the following steps: the present embodiment is a method for recognizing a new word, as shown in fig. 1, including the following steps:

step 101: and performing word segmentation on data in the small-scale data set or the frequently updated data set, and extracting candidate data strings.

Step 102: and filtering the candidate data strings according to the set screening strategy, and extracting candidate new words.

Step 103: and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value.

The second method embodiment: a method for identifying new words, as shown in fig. 2, the method mainly includes the following steps:

step 201: and performing word segmentation on the data in the small-scale data set and/or the frequently updated data set, and extracting all binary or ternary data strings as candidate data strings.

Step 202: and filtering the candidate data strings according to the set screening strategy, and extracting candidate new words.

Here, the screening policy includes a primary screening policy and an advanced screening policy.

The primary screening strategy is to filter according to a word formation rule, namely: the candidate data strings are firstly filtered for the first time according to the word formation rule of the new word, such as the part of speech rule and the word rule, so that most of obvious noise candidate data strings are eliminated.

The high-level screening strategy is used for fine screening so as to obtain more accurate candidate new words, thereby providing good basis for the credibility of the subsequent new words. Advanced screening strategies include the following categories:

a1, headword dictionary, and headword dictionary filtering. If the first word of the candidate data string appears (e.g., compares) in the headword dictionary, then the candidate data string should be deleted; if the last word of the candidate data string appears (e.g., one) in the endword dictionary, then the candidate data string should also be deleted.

a2, triplets, and combinations of triplets and above. In new word recognition, finding a Chinese word in a triple can be used to determine whether a candidate data string is a garbage word, for example: "tiger by him" and "ocean by us".

a3, 2+1 mode, or 3+1 mode tail word filtering. The words in the 2+1 mode, such as "wutaishan", "peste temple", "beijing city", "off-road vehicle", etc., and the words in the 3+1 mode, such as "wolfo vehicle", "asia european continent", "srilanka", "markov", "australia", all have particularity, which is that the tail words are mostly suffix words composed of morphemes, and the suffix words composed of morphemes are counted to form a word tail word dictionary (trisuffix). If the candidate data string is judged to belong to the 2+1 mode or the 3+1 mode and the tail character is in the word tail character dictionary, the candidate data string is an effective string; otherwise, the candidate data string is an invalid string and is deleted.

a4, filtering the word forming rule. For some candidate data strings belonging to feature strings, it may be judged according to the part of speech, and for those that obviously violate the part of speech rule, for example: and deleting the candidate data string violating the part-of-speech rule by using a head-word filtering rule, an end-word filtering rule, a head-word collocation filtering rule, an end-word collocation filtering rule and the like.

The filtering rule of the headword character is to limit the headword (or prefix) forming the candidate data string, and if the filtering rule is met, the candidate is deleted. The method comprises the following steps: time words, prepositions, successors, sighs, quantifiers, helpwords, adverbs, and the like.

The filtering rule of the end word part of speech limits the end words (or suffixes) forming the candidate data string, and if the filtering rule is satisfied, the candidate is deleted. The method comprises the following steps: temporal words, adverbs, sighs, antecedent elements, prepositions, conjunctions, and the like.

The head word collocation filtering rule comprises the following steps: the first two words are number words + quantifier words, filtered. For example, "one thousand to one beauty story" is obtained after filtration.

The filter rule for collocating the tail words comprises the following steps: the last two words are number words + quantifier words, or prepositions + nouns, or adverbs + verbs, and are filtered. For example, "a piece of granite huge stone rushes toward me quickly", and "granite huge stone" is obtained after filtering.

Additionally, the screening policies may include advanced screening policies.

Step 203: and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value as the new words with high credibility.

The evaluation data set can be selected according to application requirements, and the preferred mode of the invention is to directly utilize a general search engine to send the candidate new words into the general search engine (such as Baidu and Google) and capture the returned result as the evaluation data set. The benefits of this are: the method is simple and convenient to realize, and can automatically evaluate the credibility of the new word without manual operation.

The evaluation criterion is mainly to count the occurrence of the extracted candidate new word as a whole, for example, the more frequently the extracted candidate new word occurs as a whole, the higher the credibility of the extracted candidate new word as a new word. In addition, for special appearance situations, such as the extracted candidate new word is separated from the adjacent text by punctuation, the credibility of the new word is also increased. Specifically, the statistics of the appearance of the extracted candidate new word as a whole includes the following:

(1) the frequency of occurrence of the candidate new words as a whole;

(2) the frequency of the whole candidate new words and the adjacent texts separated by characters such as blanks, symbols, punctuations and the like;

(3) the frequency of occurrence of the candidate new words in the title number;

(4) the frequency of occurrence of candidate new words in quotation marks;

(5) the frequency with which the candidate new words are queried in the search engine.

Here, based on the counted occurrence of the extracted candidate new word as a whole, the formula used for evaluating the reliability of the candidate new word may be:wherein w represents a candidate new word; score (w) represents the final confidence score for the candidate new word; f. of_i(w) represents various statistical frequency data, namely the 5 types of frequency data; alpha is alpha_iAre weighting coefficients used in calculating the confidence of the new word.

For the weighting coefficient alpha_iFor the determination of (1), besides the above automatic evaluation, a part of candidate new words can be evaluated manually, and then the weighted coefficient alpha is obtained by training with a machine learning method by using the manually labeled vocabulary entry as training data_iThe value of (c). Various specific methods can be used, for example, it can be regarded as a logistic regression problem, the judgment result of the vocabulary entry is taken as a dependent variable manually, each frequency data obtained by statistics is taken as an independent variable after normalization processing, and the weighting coefficient alpha is obtained by solving through a corresponding method_iThe value of (c). Specific procedures can be referenced to data on logistic regression. Chapter 12, chinese 3 rd edition, for example, the analysis of exemplar regression; alternatively, "(Pattern Recognition and Machine LearningSection 4.3.2.

The following are examples of the results of evaluating some entries: as shown in fig. 3 and 4. Fig. 3 and 4 are labeled as Good class and Bad class, respectively, and mean the result of manual judgment. The Good class is a result which is judged to be Good manually and is considered to be recognized as a new word; bad is the result of artificial judgment, and it is considered that the new word should not be recognized. The scores in fig. 3 and 4 are the results obtained after the automatic evaluation of the present invention, and it can be seen that the results of the automatic evaluation recognition and the results of the manual evaluation have high consistency, so that the effectiveness of the method of the present invention can be demonstrated. By adopting the automatic evaluation of the invention, the effectiveness and the accuracy of the evaluation can be ensured, and the efficiency can be improved compared with the manual judgment.

Step 204: and performing relevance measurement of the new words and the evaluation data set on the identified new words with high credibility.

Here, the advertisement library text belongs to a small-scale data set and/or a data set with frequent update, and when the relevance measurement of the new word and the evaluation data set is performed on the new word extracted from the advertisement library text, the degree of the advertisement performance of the new word needs to be evaluated, which may also be referred to as performing new word advertisement performance measurement. The preferred mode of the invention is also to directly utilize the returned results of general search engines (such as Baidu and Google), namely: the number of ads on the page returned by the search engine is used to measure the advisability of the new word, assuming that the greater the number of ads, the higher the advisability weight of the new word. The benefits of this are: the method is simple and convenient to implement, and can automatically evaluate the advertising performance measurement of the new word without manual operation.

In summary, when the method and the device are used for identifying the new words from the internet text data, the method and the device are suitable for scenes in which the new words are identified on small-scale data sets and/or frequently updated data sets. Moreover, the credibility and the advertising performance of the new words are evaluated by using a general search engine, so that the automatic evaluation of the credibility and the advertising performance of the new words is realized.

A system for recognizing new words, the system comprising: the device comprises a candidate data string extraction unit, a candidate new word extraction unit and a new word identification unit. The candidate data string extraction unit is used for performing word segmentation on data in the small-scale data set and/or the frequently updated data set and extracting candidate data strings. And the candidate new word extraction unit is used for filtering the candidate data strings according to the set screening strategy and extracting the candidate new words. The new word recognition unit is used for counting the appearance condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words and recognizing the new words with the credibility exceeding a preset value.

Here, the screening policy includes a primary screening policy and an advanced screening policy. Wherein, the primary screening strategy is to filter according to the word-building rule. Advanced screening strategies include: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule. The candidate new word extracting unit is further configured to perform coarse screening and filtering on the candidate data strings by using the primary screening strategy, perform fine screening and filtering on the candidate data strings after the coarse screening and filtering by using the advanced screening strategy, and extract candidate new words.

Additionally, the screening policies may include advanced screening policies. Advanced screening strategies include: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule. The candidate new word extracting unit is further configured to perform screening and filtering on the candidate data strings by using the advanced screening strategy to extract candidate new words.

Here, the new word recognition unit is further configured to, in a case where the statistics is performed, count a frequency with which the candidate new words appear in the evaluation data set as a whole; or counting the frequency of character separation between the candidate new words and the adjacent texts when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the inquired candidate new words in the general search engine when the candidate new words appear in the evaluation data set as a whole.

Here, the new word recognition unit is further configured to, in the case where the evaluation is performed, perform the evaluation by a formulaEvaluating the credibility of the new words, and identifying the new words with high credibility; wherein,

Here, the system further includes: and the measuring unit is used for measuring the relevance of the new words and the evaluation data set for the identified new words with high credibility.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A method for recognizing a new word, the method comprising:

counting the appearance of the extracted candidate new words in the evaluation data set, wherein the counting mode comprises the following steps: counting the frequency of the candidate new words appearing in the evaluation data set as a whole; counting the frequency of character separation between the candidate new words and adjacent texts when the candidate new words appear in the evaluation data set as a whole; counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; counting at least two types of queried frequencies in a general search engine when the candidate new words appear in an evaluation data set as a whole;

by the formulaEvaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value, wherein w represents the candidate new words; score (w) represents the final confidence score of the candidate new word; f. of_i(w) represents at least two frequency data obtained by statistics in the statistical manner; alpha is alpha_iRepresenting the weighting coefficients.

2. The method of claim 1, wherein the screening strategies comprise a primary screening strategy and an advanced screening strategy; wherein,

the primary screening strategy is to filter according to a word-building rule;

3. The method of claim 1, wherein the screening policies include advanced screening policies; the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;

4. The method of claim 3, further comprising: and performing relevance measurement of the new words and the evaluation data set on the identified new words with high credibility.

5. A system for recognizing new words, the system comprising: the device comprises a candidate data string extraction unit, a candidate new word extraction unit and a new word identification unit; wherein,

the new word recognition unit is used for counting the appearance of the extracted candidate new words in the evaluation data set, and the counting mode comprises the following steps: counting the frequency of the candidate new words appearing in the evaluation data set as a whole; counting the frequency of character separation between the candidate new words and adjacent texts when the candidate new words appear in the evaluation data set as a whole; counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; counting at least two types of queried frequencies in a general search engine when the candidate new words appear in an evaluation data set as a whole;

and, by the formulaEvaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value, wherein w represents the candidate new words; score (w) represents the final confidence score of the candidate new word; f. of_i(w) represents at least two frequency data obtained by statistics in the statistical manner; alpha is alpha_iRepresenting the weighting coefficients.

6. The system of claim 5, wherein the screening policies include a primary screening policy and an advanced screening policy; wherein,

the primary screening strategy is to filter according to a word-building rule;

7. The system of claim 5, wherein the screening policies include advanced screening policies; the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;

8. The system of claim 7, further comprising: and the measuring unit is used for measuring the relevance between the new words and the evaluation data set for the identified new words with high credibility.