CN102467548B - A kind of recognition methods of neologisms and system - Google Patents
A kind of recognition methods of neologisms and system Download PDFInfo
- Publication number
- CN102467548B CN102467548B CN201010547509.XA CN201010547509A CN102467548B CN 102467548 B CN102467548 B CN 102467548B CN 201010547509 A CN201010547509 A CN 201010547509A CN 102467548 B CN102467548 B CN 102467548B
- Authority
- CN
- China
- Prior art keywords
- candidate
- new words
- word
- words
- filtering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 206010028916 Neologism Diseases 0.000 title abstract 8
- 238000012216 screening Methods 0.000 claims abstract description 105
- 238000011156 evaluation Methods 0.000 claims abstract description 59
- 239000000284 extract Substances 0.000 claims abstract description 5
- 238000001914 filtration Methods 0.000 claims description 92
- 238000000605 extraction Methods 0.000 claims description 14
- 230000011218 segmentation Effects 0.000 claims description 14
- 238000005259 measurement Methods 0.000 claims description 6
- 238000000926 separation method Methods 0.000 claims description 5
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 239000010438 granite Substances 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 239000004575 stone Substances 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 241000282376 Panthera tigris Species 0.000 description 1
- 230000003796 beauty Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a kind of recognition methods of neologisms, the method comprises: cut word process to small-scale data set or the data that upgrade data centralization frequently, and extract candidate data string; Screening strategy according to arranging filters candidate data string, extracts candidate's neologisms; The appearance situation of the candidate's neologisms extracted in evaluation and test data centralization is added up, evaluates the confidence level of neologisms, confidence level is exceeded the new word identification of preset value out.The invention also discloses a kind of recognition system of neologisms, the new word identification unit in system is used for adding up the appearance situation of the candidate's neologisms extracted in evaluation and test data centralization, evaluates the confidence level of neologisms, confidence level is exceeded the new word identification of preset value out.Adopt method and system of the present invention, can at small-scale data set, upgrade and frequently data set carry out new word identification.
Description
Technical Field
The invention relates to a new word recognition technology in the field of internet information processing, in particular to a method and a system for recognizing new words on a small-scale data set.
Background
In the field of Chinese processing, because of the characteristics of Chinese, words are not naturally separated like English with spaces, so Chinese word segmentation is an important basic technology. However, with the rapid development of the internet in the information age, the language is continuously updated on the network platform, so that a great amount of new words are created. The appearance of new words causes the appearance of excessive single words or fine-grained words in the word segmentation result, which affects the accuracy of word segmentation, and researches show that nearly 60% of word segmentation errors are caused by inaccurate new word identification. Therefore, accurate recognition of new words plays an important role in improving word segmentation effect.
The existing new word recognition technology is mainly based on a statistical method, namely: candidate strings are extracted on a data set with a certain scale based on a statistical principle, then, some linguistic knowledge such as word formation rules and the like are utilized for filtering, and noise strings which are not new words are eliminated so as to identify the new words. The existing new word recognition technology has the following defects: because the statistical principle is based on, the statistical principle is adopted to require that the data volume to be analyzed is large and the randomness of data occurrence is small, so that the statistical principle can be utilized, the prior art is only suitable for a data set with a certain scale, that is, only when the data set for extracting new words has a certain scale, the more sufficient statistical information can be obtained, and the recall rate and the identification accuracy rate of extracting the new words are ensured. However, for some small-scale data sets and/or frequently updated data sets, the statistical principle cannot be well utilized due to the fact that the requirements of the statistical principle are not met, and therefore, the existing new word recognition technology is not suitable for scenes of the small-scale data sets and/or frequently updated data sets.
Disclosure of Invention
In view of this, the present invention provides a method and a system for recognizing new words, which can recognize new words on a small-scale data set or a data set with frequent updates.
The technical scheme of the invention is realized as follows:
a method of identifying new words, the method comprising:
performing word segmentation on data in a small-scale data set or a frequently updated data set, and extracting candidate data strings;
filtering the candidate data strings according to a set screening strategy to extract candidate new words;
and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value.
Wherein the screening strategies comprise primary screening strategies and advanced screening strategies; wherein,
the primary screening strategy is to filter according to a word-building rule;
the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
the filtering the candidate data string according to the set screening policy specifically includes: and adopting the primary screening strategy to carry out coarse screening and filtering on the candidate data strings, and adopting the advanced screening strategy to carry out fine screening and filtering on the candidate data strings after the coarse screening and filtering.
Wherein the screening policies comprise advanced screening policies; the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
the filtering the candidate data string according to the set screening policy specifically includes: and screening and filtering the candidate data strings by adopting the high-level screening strategy.
The counting of the appearance of the extracted candidate new words in the evaluation data set specifically includes: counting the frequency of the candidate new words appearing in the evaluation data set as a whole; or counting the frequency of character separation between the candidate new words and the adjacent texts when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the inquired candidate new words in the general search engine when the candidate new words appear in the evaluation data set as a whole.
The evaluating the credibility of the new word specifically includes: by the formulaEvaluating the credibility of the new words, and identifying the new words with high credibility; wherein,
w represents the candidate newA word; score (w) represents the final confidence score of the candidate new word; f. ofi(w) represents various frequency data obtained by statistics; alpha is alphaiRepresenting the weighting coefficients.
Wherein, the method also comprises: and performing relevance measurement of the new words and the evaluation data set on the identified new words with high credibility.
A system for recognizing new words, the system comprising: the device comprises a candidate data string extraction unit, a candidate new word extraction unit and a new word identification unit; wherein,
the candidate data string extraction unit is used for performing word segmentation on data in the small-scale data set or the frequently updated data set and extracting candidate data strings;
the candidate new word extraction unit is used for filtering the candidate data strings according to the set screening strategy and extracting candidate new words;
and the new word recognition unit is used for counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words and recognizing the new words with the credibility exceeding a preset value.
Wherein the screening strategies comprise primary screening strategies and advanced screening strategies; wherein,
the primary screening strategy is to filter according to a word-building rule;
the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
the candidate new word extracting unit is further configured to perform coarse screening and filtering on the candidate data strings by using the primary screening strategy, perform fine screening and filtering on the candidate data strings after the coarse screening and filtering by using the advanced screening strategy, and extract candidate new words.
Wherein the screening policies comprise advanced screening policies; the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
and the candidate new word extraction unit is further used for adopting the high-level screening strategy to screen and filter the candidate data strings and extracting the candidate new words.
The new word recognition unit is further used for counting the frequency of the candidate new words appearing in the evaluation data set as a whole under the condition of counting; or counting the frequency of character separation between the candidate new words and the adjacent texts when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the inquired candidate new words in the general search engine when the candidate new words appear in the evaluation data set as a whole.
Wherein the new word recognition unit is further configured to, in the case of performing the evaluation, pass through a formulaEvaluating the credibility of the new words, and identifying the new words with high credibility; wherein,
w represents the candidate new word; score (w) represents the final confidence score of the candidate new word; f. ofi(w) represents various frequency data obtained by statistics; alpha is alphaiRepresenting the weighting coefficients.
Wherein, this system still includes: and the measuring unit is used for measuring the relevance between the new words and the evaluation data set for the identified new words with high credibility.
The method comprises the steps of carrying out word segmentation on data in a small-scale data set or a frequently updated data set, and extracting candidate data strings; filtering the candidate data strings according to a set screening strategy to extract candidate new words; and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with high credibility. By adopting the method and the device, new word recognition can be performed on a small-scale data set or a data set with frequent updating.
Drawings
FIG. 1 is a schematic flow chart of a first implementation of the method of the present invention;
FIG. 2 is a schematic flow chart of a second embodiment of the method of the present invention;
FIG. 3 is a diagram illustrating evaluation results according to the present invention;
FIG. 4 is another schematic diagram of the evaluation results of the present invention.
Detailed Description
The method comprises the steps of carrying out word segmentation on data in a small-scale data set and/or a frequently updated data set, and extracting candidate data strings; filtering the candidate data strings according to a set screening strategy to extract candidate new words; and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value.
In order to make the technical solution and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings by way of examples.
The first embodiment of the method comprises the following steps: the present embodiment is a method for recognizing a new word, as shown in fig. 1, including the following steps:
step 101: and performing word segmentation on data in the small-scale data set or the frequently updated data set, and extracting candidate data strings.
Step 102: and filtering the candidate data strings according to the set screening strategy, and extracting candidate new words.
Step 103: and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value.
The second method embodiment: a method for identifying new words, as shown in fig. 2, the method mainly includes the following steps:
step 201: and performing word segmentation on the data in the small-scale data set and/or the frequently updated data set, and extracting all binary or ternary data strings as candidate data strings.
Step 202: and filtering the candidate data strings according to the set screening strategy, and extracting candidate new words.
Here, the screening policy includes a primary screening policy and an advanced screening policy.
The primary screening strategy is to filter according to a word formation rule, namely: the candidate data strings are firstly filtered for the first time according to the word formation rule of the new word, such as the part of speech rule and the word rule, so that most of obvious noise candidate data strings are eliminated.
The high-level screening strategy is used for fine screening so as to obtain more accurate candidate new words, thereby providing good basis for the credibility of the subsequent new words. Advanced screening strategies include the following categories:
a1, headword dictionary, and headword dictionary filtering. If the first word of the candidate data string appears (e.g., compares) in the headword dictionary, then the candidate data string should be deleted; if the last word of the candidate data string appears (e.g., one) in the endword dictionary, then the candidate data string should also be deleted.
a2, triplets, and combinations of triplets and above. In new word recognition, finding a Chinese word in a triple can be used to determine whether a candidate data string is a garbage word, for example: "tiger by him" and "ocean by us".
a3, 2+1 mode, or 3+1 mode tail word filtering. The words in the 2+1 mode, such as "wutaishan", "peste temple", "beijing city", "off-road vehicle", etc., and the words in the 3+1 mode, such as "wolfo vehicle", "asia european continent", "srilanka", "markov", "australia", all have particularity, which is that the tail words are mostly suffix words composed of morphemes, and the suffix words composed of morphemes are counted to form a word tail word dictionary (trisuffix). If the candidate data string is judged to belong to the 2+1 mode or the 3+1 mode and the tail character is in the word tail character dictionary, the candidate data string is an effective string; otherwise, the candidate data string is an invalid string and is deleted.
a4, filtering the word forming rule. For some candidate data strings belonging to feature strings, it may be judged according to the part of speech, and for those that obviously violate the part of speech rule, for example: and deleting the candidate data string violating the part-of-speech rule by using a head-word filtering rule, an end-word filtering rule, a head-word collocation filtering rule, an end-word collocation filtering rule and the like.
The filtering rule of the headword character is to limit the headword (or prefix) forming the candidate data string, and if the filtering rule is met, the candidate is deleted. The method comprises the following steps: time words, prepositions, successors, sighs, quantifiers, helpwords, adverbs, and the like.
The filtering rule of the end word part of speech limits the end words (or suffixes) forming the candidate data string, and if the filtering rule is satisfied, the candidate is deleted. The method comprises the following steps: temporal words, adverbs, sighs, antecedent elements, prepositions, conjunctions, and the like.
The head word collocation filtering rule comprises the following steps: the first two words are number words + quantifier words, filtered. For example, "one thousand to one beauty story" is obtained after filtration.
The filter rule for collocating the tail words comprises the following steps: the last two words are number words + quantifier words, or prepositions + nouns, or adverbs + verbs, and are filtered. For example, "a piece of granite huge stone rushes toward me quickly", and "granite huge stone" is obtained after filtering.
The filtering the candidate data string according to the set screening policy specifically includes: and adopting the primary screening strategy to carry out coarse screening and filtering on the candidate data strings, and adopting the advanced screening strategy to carry out fine screening and filtering on the candidate data strings after the coarse screening and filtering.
Additionally, the screening policies may include advanced screening policies.
The filtering the candidate data string according to the set screening policy specifically includes: and screening and filtering the candidate data strings by adopting the high-level screening strategy.
Step 203: and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value as the new words with high credibility.
The evaluation data set can be selected according to application requirements, and the preferred mode of the invention is to directly utilize a general search engine to send the candidate new words into the general search engine (such as Baidu and Google) and capture the returned result as the evaluation data set. The benefits of this are: the method is simple and convenient to realize, and can automatically evaluate the credibility of the new word without manual operation.
The evaluation criterion is mainly to count the occurrence of the extracted candidate new word as a whole, for example, the more frequently the extracted candidate new word occurs as a whole, the higher the credibility of the extracted candidate new word as a new word. In addition, for special appearance situations, such as the extracted candidate new word is separated from the adjacent text by punctuation, the credibility of the new word is also increased. Specifically, the statistics of the appearance of the extracted candidate new word as a whole includes the following:
(1) the frequency of occurrence of the candidate new words as a whole;
(2) the frequency of the whole candidate new words and the adjacent texts separated by characters such as blanks, symbols, punctuations and the like;
(3) the frequency of occurrence of the candidate new words in the title number;
(4) the frequency of occurrence of candidate new words in quotation marks;
(5) the frequency with which the candidate new words are queried in the search engine.
Here, based on the counted occurrence of the extracted candidate new word as a whole, the formula used for evaluating the reliability of the candidate new word may be:wherein w represents a candidate new word; score (w) represents the final confidence score for the candidate new word; f. ofi(w) represents various statistical frequency data, namely the 5 types of frequency data; alpha is alphaiAre weighting coefficients used in calculating the confidence of the new word.
For the weighting coefficient alphaiFor the determination of (1), besides the above automatic evaluation, a part of candidate new words can be evaluated manually, and then the weighted coefficient alpha is obtained by training with a machine learning method by using the manually labeled vocabulary entry as training dataiThe value of (c). Various specific methods can be used, for example, it can be regarded as a logistic regression problem, the judgment result of the vocabulary entry is taken as a dependent variable manually, each frequency data obtained by statistics is taken as an independent variable after normalization processing, and the weighting coefficient alpha is obtained by solving through a corresponding methodiThe value of (c). Specific procedures can be referenced to data on logistic regression. Chapter 12, chinese 3 rd edition, for example, the analysis of exemplar regression; alternatively, "(Pattern Recognition and Machine LearningSection 4.3.2.
The following are examples of the results of evaluating some entries: as shown in fig. 3 and 4. Fig. 3 and 4 are labeled as Good class and Bad class, respectively, and mean the result of manual judgment. The Good class is a result which is judged to be Good manually and is considered to be recognized as a new word; bad is the result of artificial judgment, and it is considered that the new word should not be recognized. The scores in fig. 3 and 4 are the results obtained after the automatic evaluation of the present invention, and it can be seen that the results of the automatic evaluation recognition and the results of the manual evaluation have high consistency, so that the effectiveness of the method of the present invention can be demonstrated. By adopting the automatic evaluation of the invention, the effectiveness and the accuracy of the evaluation can be ensured, and the efficiency can be improved compared with the manual judgment.
Step 204: and performing relevance measurement of the new words and the evaluation data set on the identified new words with high credibility.
Here, the advertisement library text belongs to a small-scale data set and/or a data set with frequent update, and when the relevance measurement of the new word and the evaluation data set is performed on the new word extracted from the advertisement library text, the degree of the advertisement performance of the new word needs to be evaluated, which may also be referred to as performing new word advertisement performance measurement. The preferred mode of the invention is also to directly utilize the returned results of general search engines (such as Baidu and Google), namely: the number of ads on the page returned by the search engine is used to measure the advisability of the new word, assuming that the greater the number of ads, the higher the advisability weight of the new word. The benefits of this are: the method is simple and convenient to implement, and can automatically evaluate the advertising performance measurement of the new word without manual operation.
In summary, when the method and the device are used for identifying the new words from the internet text data, the method and the device are suitable for scenes in which the new words are identified on small-scale data sets and/or frequently updated data sets. Moreover, the credibility and the advertising performance of the new words are evaluated by using a general search engine, so that the automatic evaluation of the credibility and the advertising performance of the new words is realized.
A system for recognizing new words, the system comprising: the device comprises a candidate data string extraction unit, a candidate new word extraction unit and a new word identification unit. The candidate data string extraction unit is used for performing word segmentation on data in the small-scale data set and/or the frequently updated data set and extracting candidate data strings. And the candidate new word extraction unit is used for filtering the candidate data strings according to the set screening strategy and extracting the candidate new words. The new word recognition unit is used for counting the appearance condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words and recognizing the new words with the credibility exceeding a preset value.
Here, the screening policy includes a primary screening policy and an advanced screening policy. Wherein, the primary screening strategy is to filter according to the word-building rule. Advanced screening strategies include: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule. The candidate new word extracting unit is further configured to perform coarse screening and filtering on the candidate data strings by using the primary screening strategy, perform fine screening and filtering on the candidate data strings after the coarse screening and filtering by using the advanced screening strategy, and extract candidate new words.
Additionally, the screening policies may include advanced screening policies. Advanced screening strategies include: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule. The candidate new word extracting unit is further configured to perform screening and filtering on the candidate data strings by using the advanced screening strategy to extract candidate new words.
Here, the new word recognition unit is further configured to, in a case where the statistics is performed, count a frequency with which the candidate new words appear in the evaluation data set as a whole; or counting the frequency of character separation between the candidate new words and the adjacent texts when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the inquired candidate new words in the general search engine when the candidate new words appear in the evaluation data set as a whole.
Here, the new word recognition unit is further configured to, in the case where the evaluation is performed, perform the evaluation by a formulaEvaluating the credibility of the new words, and identifying the new words with high credibility; wherein,
w represents the candidate new word; score (w) represents the final confidence score of the candidate new word; f. ofi(w) represents various frequency data obtained by statistics; alpha is alphaiRepresenting the weighting coefficients.
Here, the system further includes: and the measuring unit is used for measuring the relevance of the new words and the evaluation data set for the identified new words with high credibility.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.
Claims (8)
1. A method for recognizing a new word, the method comprising:
performing word segmentation on data in a small-scale data set or a frequently updated data set, and extracting candidate data strings;
filtering the candidate data strings according to a set screening strategy to extract candidate new words;
counting the appearance of the extracted candidate new words in the evaluation data set, wherein the counting mode comprises the following steps: counting the frequency of the candidate new words appearing in the evaluation data set as a whole; counting the frequency of character separation between the candidate new words and adjacent texts when the candidate new words appear in the evaluation data set as a whole; counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; counting at least two types of queried frequencies in a general search engine when the candidate new words appear in an evaluation data set as a whole;
by the formulaEvaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value, wherein w represents the candidate new words; score (w) represents the final confidence score of the candidate new word; f. ofi(w) represents at least two frequency data obtained by statistics in the statistical manner; alpha is alphaiRepresenting the weighting coefficients.
2. The method of claim 1, wherein the screening strategies comprise a primary screening strategy and an advanced screening strategy; wherein,
the primary screening strategy is to filter according to a word-building rule;
the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
the filtering the candidate data string according to the set screening policy specifically includes: and adopting the primary screening strategy to carry out coarse screening and filtering on the candidate data strings, and adopting the advanced screening strategy to carry out fine screening and filtering on the candidate data strings after the coarse screening and filtering.
3. The method of claim 1, wherein the screening policies include advanced screening policies; the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
the filtering the candidate data string according to the set screening policy specifically includes: and screening and filtering the candidate data strings by adopting the high-level screening strategy.
4. The method of claim 3, further comprising: and performing relevance measurement of the new words and the evaluation data set on the identified new words with high credibility.
5. A system for recognizing new words, the system comprising: the device comprises a candidate data string extraction unit, a candidate new word extraction unit and a new word identification unit; wherein,
the candidate data string extraction unit is used for performing word segmentation on data in the small-scale data set or the frequently updated data set and extracting candidate data strings;
the candidate new word extraction unit is used for filtering the candidate data strings according to the set screening strategy and extracting candidate new words;
the new word recognition unit is used for counting the appearance of the extracted candidate new words in the evaluation data set, and the counting mode comprises the following steps: counting the frequency of the candidate new words appearing in the evaluation data set as a whole; counting the frequency of character separation between the candidate new words and adjacent texts when the candidate new words appear in the evaluation data set as a whole; counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; counting at least two types of queried frequencies in a general search engine when the candidate new words appear in an evaluation data set as a whole;
and, by the formulaEvaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value, wherein w represents the candidate new words; score (w) represents the final confidence score of the candidate new word; f. ofi(w) represents at least two frequency data obtained by statistics in the statistical manner; alpha is alphaiRepresenting the weighting coefficients.
6. The system of claim 5, wherein the screening policies include a primary screening policy and an advanced screening policy; wherein,
the primary screening strategy is to filter according to a word-building rule;
the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
the candidate new word extracting unit is further configured to perform coarse screening and filtering on the candidate data strings by using the primary screening strategy, perform fine screening and filtering on the candidate data strings after the coarse screening and filtering by using the advanced screening strategy, and extract candidate new words.
7. The system of claim 5, wherein the screening policies include advanced screening policies; the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
and the candidate new word extraction unit is further used for adopting the high-level screening strategy to screen and filter the candidate data strings and extracting the candidate new words.
8. The system of claim 7, further comprising: and the measuring unit is used for measuring the relevance between the new words and the evaluation data set for the identified new words with high credibility.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010547509.XA CN102467548B (en) | 2010-11-15 | 2010-11-15 | A kind of recognition methods of neologisms and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010547509.XA CN102467548B (en) | 2010-11-15 | 2010-11-15 | A kind of recognition methods of neologisms and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102467548A CN102467548A (en) | 2012-05-23 |
CN102467548B true CN102467548B (en) | 2015-09-16 |
Family
ID=46071191
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201010547509.XA Active CN102467548B (en) | 2010-11-15 | 2010-11-15 | A kind of recognition methods of neologisms and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102467548B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8589164B1 (en) * | 2012-10-18 | 2013-11-19 | Google Inc. | Methods and systems for speech recognition processing using search query information |
CN103870459B (en) * | 2012-12-07 | 2017-10-27 | 阿里巴巴集团控股有限公司 | The recognition methods of faced sensing string and device |
CN103955453B (en) * | 2014-05-23 | 2017-09-29 | 清华大学 | A kind of method and device for finding neologisms automatic from document sets |
CN105096933B (en) * | 2015-05-29 | 2017-06-20 | 百度在线网络技术(北京)有限公司 | The generation method and device and phoneme synthesizing method and device of dictionary for word segmentation |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1641634A (en) * | 2004-01-15 | 2005-07-20 | 中国科学院计算技术研究所 | Chinese new word and expression detecting method and its detecting system |
CN101119334A (en) * | 2007-09-21 | 2008-02-06 | 腾讯科技(深圳)有限公司 | Method, system and equipment for obtaining neology |
CN101196904A (en) * | 2007-11-09 | 2008-06-11 | 清华大学 | News keyword abstraction method based on word frequency and multi-component grammar |
CN101464898A (en) * | 2009-01-12 | 2009-06-24 | 腾讯科技(深圳)有限公司 | Method for extracting feature word of text |
CN101706807A (en) * | 2009-11-27 | 2010-05-12 | 清华大学 | Method for automatically acquiring new words from Chinese webpages |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7917355B2 (en) * | 2007-08-23 | 2011-03-29 | Google Inc. | Word detection |
US8180630B2 (en) * | 2008-06-06 | 2012-05-15 | Zi Corporation Of Canada, Inc. | Systems and methods for an automated personalized dictionary generator for portable devices |
-
2010
- 2010-11-15 CN CN201010547509.XA patent/CN102467548B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1641634A (en) * | 2004-01-15 | 2005-07-20 | 中国科学院计算技术研究所 | Chinese new word and expression detecting method and its detecting system |
CN101119334A (en) * | 2007-09-21 | 2008-02-06 | 腾讯科技(深圳)有限公司 | Method, system and equipment for obtaining neology |
CN101196904A (en) * | 2007-11-09 | 2008-06-11 | 清华大学 | News keyword abstraction method based on word frequency and multi-component grammar |
CN101464898A (en) * | 2009-01-12 | 2009-06-24 | 腾讯科技(深圳)有限公司 | Method for extracting feature word of text |
CN101706807A (en) * | 2009-11-27 | 2010-05-12 | 清华大学 | Method for automatically acquiring new words from Chinese webpages |
Also Published As
Publication number | Publication date |
---|---|
CN102467548A (en) | 2012-05-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110457688B (en) | Error correction processing method and device, storage medium and processor | |
US7424421B2 (en) | Word collection method and system for use in word-breaking | |
JP6813591B2 (en) | Modeling device, text search device, model creation method, text search method, and program | |
CN101464898B (en) | Method for extracting feature word of text | |
CN106294320B (en) | A kind of terminology extraction method and system towards academic paper | |
US9361362B1 (en) | Synonym generation using online decompounding and transitivity | |
EP2515242A2 (en) | Incorporating lexicon knowledge to improve sentiment classification | |
CN103902619B (en) | A kind of network public-opinion monitoring method and system | |
CN111027323A (en) | Entity nominal item identification method based on topic model and semantic analysis | |
WO2009035863A2 (en) | Mining bilingual dictionaries from monolingual web pages | |
KR102376489B1 (en) | Text document cluster and topic generation apparatus and method thereof | |
WO2017091985A1 (en) | Method and device for recognizing stop word | |
US20150006563A1 (en) | Transitive Synonym Creation | |
CN102467548B (en) | A kind of recognition methods of neologisms and system | |
CN113157903A (en) | Multi-field-oriented electric power word stock construction method | |
CN111125299A (en) | Dynamic word bank updating method based on user behavior analysis | |
CN102567371A (en) | Method for automatically filtering stop words | |
CN115858787B (en) | A Hotspot Extraction and Mining Method Based on Problem Appeal Information in Road Transportation | |
KR100435442B1 (en) | Method And System For Summarizing Document | |
KR20080024530A (en) | Community specific expression detection device and method | |
CN113076740A (en) | Synonym mining method and device in government affair service field | |
Kešelj et al. | A SUFFIX SUBSUMPTION-BASED APPROACH TO BUILDING STEMMERS AND LEMMATIZERS FOR HIGHLY INFLECTIONAL LANGUAGES WITH SPARSE RESOURCES. | |
CN110377845A (en) | Collaborative filtering recommending method based on the semi-supervised LDA in section | |
CN101989281B (en) | Clustering method and device | |
KR101614551B1 (en) | System and method for extracting keyword using category matching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C41 | Transfer of patent application or patent right or utility model | ||
TR01 | Transfer of patent right |
Effective date of registration: 20151229 Address after: The South Road in Guangdong province Shenzhen city Fiyta building 518057 floor 5-10 Nanshan District high tech Zone Patentee after: Shenzhen Tencent Computer System Co., Ltd. Address before: Shenzhen Futian District City, Guangdong province 518044 Zhenxing Road, SEG Science Park 2 East Room 403 Patentee before: Tencent Technology (Shenzhen) Co., Ltd. |