[go: up one dir, main page]

CN102467548B - A kind of recognition methods of neologisms and system - Google Patents

A kind of recognition methods of neologisms and system Download PDF

Info

Publication number
CN102467548B
CN102467548B CN201010547509.XA CN201010547509A CN102467548B CN 102467548 B CN102467548 B CN 102467548B CN 201010547509 A CN201010547509 A CN 201010547509A CN 102467548 B CN102467548 B CN 102467548B
Authority
CN
China
Prior art keywords
candidate
new words
word
words
filtering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201010547509.XA
Other languages
Chinese (zh)
Other versions
CN102467548A (en
Inventor
严浩
方高林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201010547509.XA priority Critical patent/CN102467548B/en
Publication of CN102467548A publication Critical patent/CN102467548A/en
Application granted granted Critical
Publication of CN102467548B publication Critical patent/CN102467548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a kind of recognition methods of neologisms, the method comprises: cut word process to small-scale data set or the data that upgrade data centralization frequently, and extract candidate data string; Screening strategy according to arranging filters candidate data string, extracts candidate's neologisms; The appearance situation of the candidate's neologisms extracted in evaluation and test data centralization is added up, evaluates the confidence level of neologisms, confidence level is exceeded the new word identification of preset value out.The invention also discloses a kind of recognition system of neologisms, the new word identification unit in system is used for adding up the appearance situation of the candidate's neologisms extracted in evaluation and test data centralization, evaluates the confidence level of neologisms, confidence level is exceeded the new word identification of preset value out.Adopt method and system of the present invention, can at small-scale data set, upgrade and frequently data set carry out new word identification.

Description

Method and system for identifying new words
Technical Field
The invention relates to a new word recognition technology in the field of internet information processing, in particular to a method and a system for recognizing new words on a small-scale data set.
Background
In the field of Chinese processing, because of the characteristics of Chinese, words are not naturally separated like English with spaces, so Chinese word segmentation is an important basic technology. However, with the rapid development of the internet in the information age, the language is continuously updated on the network platform, so that a great amount of new words are created. The appearance of new words causes the appearance of excessive single words or fine-grained words in the word segmentation result, which affects the accuracy of word segmentation, and researches show that nearly 60% of word segmentation errors are caused by inaccurate new word identification. Therefore, accurate recognition of new words plays an important role in improving word segmentation effect.
The existing new word recognition technology is mainly based on a statistical method, namely: candidate strings are extracted on a data set with a certain scale based on a statistical principle, then, some linguistic knowledge such as word formation rules and the like are utilized for filtering, and noise strings which are not new words are eliminated so as to identify the new words. The existing new word recognition technology has the following defects: because the statistical principle is based on, the statistical principle is adopted to require that the data volume to be analyzed is large and the randomness of data occurrence is small, so that the statistical principle can be utilized, the prior art is only suitable for a data set with a certain scale, that is, only when the data set for extracting new words has a certain scale, the more sufficient statistical information can be obtained, and the recall rate and the identification accuracy rate of extracting the new words are ensured. However, for some small-scale data sets and/or frequently updated data sets, the statistical principle cannot be well utilized due to the fact that the requirements of the statistical principle are not met, and therefore, the existing new word recognition technology is not suitable for scenes of the small-scale data sets and/or frequently updated data sets.
Disclosure of Invention
In view of this, the present invention provides a method and a system for recognizing new words, which can recognize new words on a small-scale data set or a data set with frequent updates.
The technical scheme of the invention is realized as follows:
a method of identifying new words, the method comprising:
performing word segmentation on data in a small-scale data set or a frequently updated data set, and extracting candidate data strings;
filtering the candidate data strings according to a set screening strategy to extract candidate new words;
and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value.
Wherein the screening strategies comprise primary screening strategies and advanced screening strategies; wherein,
the primary screening strategy is to filter according to a word-building rule;
the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
the filtering the candidate data string according to the set screening policy specifically includes: and adopting the primary screening strategy to carry out coarse screening and filtering on the candidate data strings, and adopting the advanced screening strategy to carry out fine screening and filtering on the candidate data strings after the coarse screening and filtering.
Wherein the screening policies comprise advanced screening policies; the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
the filtering the candidate data string according to the set screening policy specifically includes: and screening and filtering the candidate data strings by adopting the high-level screening strategy.
The counting of the appearance of the extracted candidate new words in the evaluation data set specifically includes: counting the frequency of the candidate new words appearing in the evaluation data set as a whole; or counting the frequency of character separation between the candidate new words and the adjacent texts when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the inquired candidate new words in the general search engine when the candidate new words appear in the evaluation data set as a whole.
The evaluating the credibility of the new word specifically includes: by the formulaEvaluating the credibility of the new words, and identifying the new words with high credibility; wherein,
w represents the candidate newA word; score (w) represents the final confidence score of the candidate new word; f. ofi(w) represents various frequency data obtained by statistics; alpha is alphaiRepresenting the weighting coefficients.
Wherein, the method also comprises: and performing relevance measurement of the new words and the evaluation data set on the identified new words with high credibility.
A system for recognizing new words, the system comprising: the device comprises a candidate data string extraction unit, a candidate new word extraction unit and a new word identification unit; wherein,
the candidate data string extraction unit is used for performing word segmentation on data in the small-scale data set or the frequently updated data set and extracting candidate data strings;
the candidate new word extraction unit is used for filtering the candidate data strings according to the set screening strategy and extracting candidate new words;
and the new word recognition unit is used for counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words and recognizing the new words with the credibility exceeding a preset value.
Wherein the screening strategies comprise primary screening strategies and advanced screening strategies; wherein,
the primary screening strategy is to filter according to a word-building rule;
the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
the candidate new word extracting unit is further configured to perform coarse screening and filtering on the candidate data strings by using the primary screening strategy, perform fine screening and filtering on the candidate data strings after the coarse screening and filtering by using the advanced screening strategy, and extract candidate new words.
Wherein the screening policies comprise advanced screening policies; the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
and the candidate new word extraction unit is further used for adopting the high-level screening strategy to screen and filter the candidate data strings and extracting the candidate new words.
The new word recognition unit is further used for counting the frequency of the candidate new words appearing in the evaluation data set as a whole under the condition of counting; or counting the frequency of character separation between the candidate new words and the adjacent texts when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the inquired candidate new words in the general search engine when the candidate new words appear in the evaluation data set as a whole.
Wherein the new word recognition unit is further configured to, in the case of performing the evaluation, pass through a formulaEvaluating the credibility of the new words, and identifying the new words with high credibility; wherein,
w represents the candidate new word; score (w) represents the final confidence score of the candidate new word; f. ofi(w) represents various frequency data obtained by statistics; alpha is alphaiRepresenting the weighting coefficients.
Wherein, this system still includes: and the measuring unit is used for measuring the relevance between the new words and the evaluation data set for the identified new words with high credibility.
The method comprises the steps of carrying out word segmentation on data in a small-scale data set or a frequently updated data set, and extracting candidate data strings; filtering the candidate data strings according to a set screening strategy to extract candidate new words; and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with high credibility. By adopting the method and the device, new word recognition can be performed on a small-scale data set or a data set with frequent updating.
Drawings
FIG. 1 is a schematic flow chart of a first implementation of the method of the present invention;
FIG. 2 is a schematic flow chart of a second embodiment of the method of the present invention;
FIG. 3 is a diagram illustrating evaluation results according to the present invention;
FIG. 4 is another schematic diagram of the evaluation results of the present invention.
Detailed Description
The method comprises the steps of carrying out word segmentation on data in a small-scale data set and/or a frequently updated data set, and extracting candidate data strings; filtering the candidate data strings according to a set screening strategy to extract candidate new words; and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value.
In order to make the technical solution and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings by way of examples.
The first embodiment of the method comprises the following steps: the present embodiment is a method for recognizing a new word, as shown in fig. 1, including the following steps:
step 101: and performing word segmentation on data in the small-scale data set or the frequently updated data set, and extracting candidate data strings.
Step 102: and filtering the candidate data strings according to the set screening strategy, and extracting candidate new words.
Step 103: and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value.
The second method embodiment: a method for identifying new words, as shown in fig. 2, the method mainly includes the following steps:
step 201: and performing word segmentation on the data in the small-scale data set and/or the frequently updated data set, and extracting all binary or ternary data strings as candidate data strings.
Step 202: and filtering the candidate data strings according to the set screening strategy, and extracting candidate new words.
Here, the screening policy includes a primary screening policy and an advanced screening policy.
The primary screening strategy is to filter according to a word formation rule, namely: the candidate data strings are firstly filtered for the first time according to the word formation rule of the new word, such as the part of speech rule and the word rule, so that most of obvious noise candidate data strings are eliminated.
The high-level screening strategy is used for fine screening so as to obtain more accurate candidate new words, thereby providing good basis for the credibility of the subsequent new words. Advanced screening strategies include the following categories:
a1, headword dictionary, and headword dictionary filtering. If the first word of the candidate data string appears (e.g., compares) in the headword dictionary, then the candidate data string should be deleted; if the last word of the candidate data string appears (e.g., one) in the endword dictionary, then the candidate data string should also be deleted.
a2, triplets, and combinations of triplets and above. In new word recognition, finding a Chinese word in a triple can be used to determine whether a candidate data string is a garbage word, for example: "tiger by him" and "ocean by us".
a3, 2+1 mode, or 3+1 mode tail word filtering. The words in the 2+1 mode, such as "wutaishan", "peste temple", "beijing city", "off-road vehicle", etc., and the words in the 3+1 mode, such as "wolfo vehicle", "asia european continent", "srilanka", "markov", "australia", all have particularity, which is that the tail words are mostly suffix words composed of morphemes, and the suffix words composed of morphemes are counted to form a word tail word dictionary (trisuffix). If the candidate data string is judged to belong to the 2+1 mode or the 3+1 mode and the tail character is in the word tail character dictionary, the candidate data string is an effective string; otherwise, the candidate data string is an invalid string and is deleted.
a4, filtering the word forming rule. For some candidate data strings belonging to feature strings, it may be judged according to the part of speech, and for those that obviously violate the part of speech rule, for example: and deleting the candidate data string violating the part-of-speech rule by using a head-word filtering rule, an end-word filtering rule, a head-word collocation filtering rule, an end-word collocation filtering rule and the like.
The filtering rule of the headword character is to limit the headword (or prefix) forming the candidate data string, and if the filtering rule is met, the candidate is deleted. The method comprises the following steps: time words, prepositions, successors, sighs, quantifiers, helpwords, adverbs, and the like.
The filtering rule of the end word part of speech limits the end words (or suffixes) forming the candidate data string, and if the filtering rule is satisfied, the candidate is deleted. The method comprises the following steps: temporal words, adverbs, sighs, antecedent elements, prepositions, conjunctions, and the like.
The head word collocation filtering rule comprises the following steps: the first two words are number words + quantifier words, filtered. For example, "one thousand to one beauty story" is obtained after filtration.
The filter rule for collocating the tail words comprises the following steps: the last two words are number words + quantifier words, or prepositions + nouns, or adverbs + verbs, and are filtered. For example, "a piece of granite huge stone rushes toward me quickly", and "granite huge stone" is obtained after filtering.
The filtering the candidate data string according to the set screening policy specifically includes: and adopting the primary screening strategy to carry out coarse screening and filtering on the candidate data strings, and adopting the advanced screening strategy to carry out fine screening and filtering on the candidate data strings after the coarse screening and filtering.
Additionally, the screening policies may include advanced screening policies.
The filtering the candidate data string according to the set screening policy specifically includes: and screening and filtering the candidate data strings by adopting the high-level screening strategy.
Step 203: and counting the occurrence condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value as the new words with high credibility.
The evaluation data set can be selected according to application requirements, and the preferred mode of the invention is to directly utilize a general search engine to send the candidate new words into the general search engine (such as Baidu and Google) and capture the returned result as the evaluation data set. The benefits of this are: the method is simple and convenient to realize, and can automatically evaluate the credibility of the new word without manual operation.
The evaluation criterion is mainly to count the occurrence of the extracted candidate new word as a whole, for example, the more frequently the extracted candidate new word occurs as a whole, the higher the credibility of the extracted candidate new word as a new word. In addition, for special appearance situations, such as the extracted candidate new word is separated from the adjacent text by punctuation, the credibility of the new word is also increased. Specifically, the statistics of the appearance of the extracted candidate new word as a whole includes the following:
(1) the frequency of occurrence of the candidate new words as a whole;
(2) the frequency of the whole candidate new words and the adjacent texts separated by characters such as blanks, symbols, punctuations and the like;
(3) the frequency of occurrence of the candidate new words in the title number;
(4) the frequency of occurrence of candidate new words in quotation marks;
(5) the frequency with which the candidate new words are queried in the search engine.
Here, based on the counted occurrence of the extracted candidate new word as a whole, the formula used for evaluating the reliability of the candidate new word may be:wherein w represents a candidate new word; score (w) represents the final confidence score for the candidate new word; f. ofi(w) represents various statistical frequency data, namely the 5 types of frequency data; alpha is alphaiAre weighting coefficients used in calculating the confidence of the new word.
For the weighting coefficient alphaiFor the determination of (1), besides the above automatic evaluation, a part of candidate new words can be evaluated manually, and then the weighted coefficient alpha is obtained by training with a machine learning method by using the manually labeled vocabulary entry as training dataiThe value of (c). Various specific methods can be used, for example, it can be regarded as a logistic regression problem, the judgment result of the vocabulary entry is taken as a dependent variable manually, each frequency data obtained by statistics is taken as an independent variable after normalization processing, and the weighting coefficient alpha is obtained by solving through a corresponding methodiThe value of (c). Specific procedures can be referenced to data on logistic regression. Chapter 12, chinese 3 rd edition, for example, the analysis of exemplar regression; alternatively, "(Pattern Recognition and Machine LearningSection 4.3.2.
The following are examples of the results of evaluating some entries: as shown in fig. 3 and 4. Fig. 3 and 4 are labeled as Good class and Bad class, respectively, and mean the result of manual judgment. The Good class is a result which is judged to be Good manually and is considered to be recognized as a new word; bad is the result of artificial judgment, and it is considered that the new word should not be recognized. The scores in fig. 3 and 4 are the results obtained after the automatic evaluation of the present invention, and it can be seen that the results of the automatic evaluation recognition and the results of the manual evaluation have high consistency, so that the effectiveness of the method of the present invention can be demonstrated. By adopting the automatic evaluation of the invention, the effectiveness and the accuracy of the evaluation can be ensured, and the efficiency can be improved compared with the manual judgment.
Step 204: and performing relevance measurement of the new words and the evaluation data set on the identified new words with high credibility.
Here, the advertisement library text belongs to a small-scale data set and/or a data set with frequent update, and when the relevance measurement of the new word and the evaluation data set is performed on the new word extracted from the advertisement library text, the degree of the advertisement performance of the new word needs to be evaluated, which may also be referred to as performing new word advertisement performance measurement. The preferred mode of the invention is also to directly utilize the returned results of general search engines (such as Baidu and Google), namely: the number of ads on the page returned by the search engine is used to measure the advisability of the new word, assuming that the greater the number of ads, the higher the advisability weight of the new word. The benefits of this are: the method is simple and convenient to implement, and can automatically evaluate the advertising performance measurement of the new word without manual operation.
In summary, when the method and the device are used for identifying the new words from the internet text data, the method and the device are suitable for scenes in which the new words are identified on small-scale data sets and/or frequently updated data sets. Moreover, the credibility and the advertising performance of the new words are evaluated by using a general search engine, so that the automatic evaluation of the credibility and the advertising performance of the new words is realized.
A system for recognizing new words, the system comprising: the device comprises a candidate data string extraction unit, a candidate new word extraction unit and a new word identification unit. The candidate data string extraction unit is used for performing word segmentation on data in the small-scale data set and/or the frequently updated data set and extracting candidate data strings. And the candidate new word extraction unit is used for filtering the candidate data strings according to the set screening strategy and extracting the candidate new words. The new word recognition unit is used for counting the appearance condition of the extracted candidate new words in the evaluation data set, evaluating the credibility of the new words and recognizing the new words with the credibility exceeding a preset value.
Here, the screening policy includes a primary screening policy and an advanced screening policy. Wherein, the primary screening strategy is to filter according to the word-building rule. Advanced screening strategies include: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule. The candidate new word extracting unit is further configured to perform coarse screening and filtering on the candidate data strings by using the primary screening strategy, perform fine screening and filtering on the candidate data strings after the coarse screening and filtering by using the advanced screening strategy, and extract candidate new words.
Additionally, the screening policies may include advanced screening policies. Advanced screening strategies include: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule. The candidate new word extracting unit is further configured to perform screening and filtering on the candidate data strings by using the advanced screening strategy to extract candidate new words.
Here, the new word recognition unit is further configured to, in a case where the statistics is performed, count a frequency with which the candidate new words appear in the evaluation data set as a whole; or counting the frequency of character separation between the candidate new words and the adjacent texts when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; or counting the frequency of the inquired candidate new words in the general search engine when the candidate new words appear in the evaluation data set as a whole.
Here, the new word recognition unit is further configured to, in the case where the evaluation is performed, perform the evaluation by a formulaEvaluating the credibility of the new words, and identifying the new words with high credibility; wherein,
w represents the candidate new word; score (w) represents the final confidence score of the candidate new word; f. ofi(w) represents various frequency data obtained by statistics; alpha is alphaiRepresenting the weighting coefficients.
Here, the system further includes: and the measuring unit is used for measuring the relevance of the new words and the evaluation data set for the identified new words with high credibility.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (8)

1. A method for recognizing a new word, the method comprising:
performing word segmentation on data in a small-scale data set or a frequently updated data set, and extracting candidate data strings;
filtering the candidate data strings according to a set screening strategy to extract candidate new words;
counting the appearance of the extracted candidate new words in the evaluation data set, wherein the counting mode comprises the following steps: counting the frequency of the candidate new words appearing in the evaluation data set as a whole; counting the frequency of character separation between the candidate new words and adjacent texts when the candidate new words appear in the evaluation data set as a whole; counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; counting at least two types of queried frequencies in a general search engine when the candidate new words appear in an evaluation data set as a whole;
by the formulaEvaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value, wherein w represents the candidate new words; score (w) represents the final confidence score of the candidate new word; f. ofi(w) represents at least two frequency data obtained by statistics in the statistical manner; alpha is alphaiRepresenting the weighting coefficients.
2. The method of claim 1, wherein the screening strategies comprise a primary screening strategy and an advanced screening strategy; wherein,
the primary screening strategy is to filter according to a word-building rule;
the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
the filtering the candidate data string according to the set screening policy specifically includes: and adopting the primary screening strategy to carry out coarse screening and filtering on the candidate data strings, and adopting the advanced screening strategy to carry out fine screening and filtering on the candidate data strings after the coarse screening and filtering.
3. The method of claim 1, wherein the screening policies include advanced screening policies; the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
the filtering the candidate data string according to the set screening policy specifically includes: and screening and filtering the candidate data strings by adopting the high-level screening strategy.
4. The method of claim 3, further comprising: and performing relevance measurement of the new words and the evaluation data set on the identified new words with high credibility.
5. A system for recognizing new words, the system comprising: the device comprises a candidate data string extraction unit, a candidate new word extraction unit and a new word identification unit; wherein,
the candidate data string extraction unit is used for performing word segmentation on data in the small-scale data set or the frequently updated data set and extracting candidate data strings;
the candidate new word extraction unit is used for filtering the candidate data strings according to the set screening strategy and extracting candidate new words;
the new word recognition unit is used for counting the appearance of the extracted candidate new words in the evaluation data set, and the counting mode comprises the following steps: counting the frequency of the candidate new words appearing in the evaluation data set as a whole; counting the frequency of character separation between the candidate new words and adjacent texts when the candidate new words appear in the evaluation data set as a whole; counting the frequency of the candidate new words appearing in the book title number when the candidate new words appear in the evaluation data set as a whole; counting the frequency of the candidate new words appearing in the quotation marks when the candidate new words appear in the evaluation data set as a whole; counting at least two types of queried frequencies in a general search engine when the candidate new words appear in an evaluation data set as a whole;
and, by the formulaEvaluating the credibility of the new words, and identifying the new words with the credibility exceeding a preset value, wherein w represents the candidate new words; score (w) represents the final confidence score of the candidate new word; f. ofi(w) represents at least two frequency data obtained by statistics in the statistical manner; alpha is alphaiRepresenting the weighting coefficients.
6. The system of claim 5, wherein the screening policies include a primary screening policy and an advanced screening policy; wherein,
the primary screening strategy is to filter according to a word-building rule;
the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
the candidate new word extracting unit is further configured to perform coarse screening and filtering on the candidate data strings by using the primary screening strategy, perform fine screening and filtering on the candidate data strings after the coarse screening and filtering by using the advanced screening strategy, and extract candidate new words.
7. The system of claim 5, wherein the screening policies include advanced screening policies; the advanced screening strategy comprises: filtering according to the head word dictionary and the tail word dictionary; or filtering according to the triples and the intermediate words combined by the triples; or filtering according to 2+1 mode or 3+1 mode tail words; or filtering according to the word forming rule;
and the candidate new word extraction unit is further used for adopting the high-level screening strategy to screen and filter the candidate data strings and extracting the candidate new words.
8. The system of claim 7, further comprising: and the measuring unit is used for measuring the relevance between the new words and the evaluation data set for the identified new words with high credibility.
CN201010547509.XA 2010-11-15 2010-11-15 A kind of recognition methods of neologisms and system Active CN102467548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010547509.XA CN102467548B (en) 2010-11-15 2010-11-15 A kind of recognition methods of neologisms and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010547509.XA CN102467548B (en) 2010-11-15 2010-11-15 A kind of recognition methods of neologisms and system

Publications (2)

Publication Number Publication Date
CN102467548A CN102467548A (en) 2012-05-23
CN102467548B true CN102467548B (en) 2015-09-16

Family

ID=46071191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010547509.XA Active CN102467548B (en) 2010-11-15 2010-11-15 A kind of recognition methods of neologisms and system

Country Status (1)

Country Link
CN (1) CN102467548B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8589164B1 (en) * 2012-10-18 2013-11-19 Google Inc. Methods and systems for speech recognition processing using search query information
CN103870459B (en) * 2012-12-07 2017-10-27 阿里巴巴集团控股有限公司 The recognition methods of faced sensing string and device
CN103955453B (en) * 2014-05-23 2017-09-29 清华大学 A kind of method and device for finding neologisms automatic from document sets
CN105096933B (en) * 2015-05-29 2017-06-20 百度在线网络技术(北京)有限公司 The generation method and device and phoneme synthesizing method and device of dictionary for word segmentation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1641634A (en) * 2004-01-15 2005-07-20 中国科学院计算技术研究所 Chinese new word and expression detecting method and its detecting system
CN101119334A (en) * 2007-09-21 2008-02-06 腾讯科技(深圳)有限公司 Method, system and equipment for obtaining neology
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101706807A (en) * 2009-11-27 2010-05-12 清华大学 Method for automatically acquiring new words from Chinese webpages

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7917355B2 (en) * 2007-08-23 2011-03-29 Google Inc. Word detection
US8180630B2 (en) * 2008-06-06 2012-05-15 Zi Corporation Of Canada, Inc. Systems and methods for an automated personalized dictionary generator for portable devices

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1641634A (en) * 2004-01-15 2005-07-20 中国科学院计算技术研究所 Chinese new word and expression detecting method and its detecting system
CN101119334A (en) * 2007-09-21 2008-02-06 腾讯科技(深圳)有限公司 Method, system and equipment for obtaining neology
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101706807A (en) * 2009-11-27 2010-05-12 清华大学 Method for automatically acquiring new words from Chinese webpages

Also Published As

Publication number Publication date
CN102467548A (en) 2012-05-23

Similar Documents

Publication Publication Date Title
CN110457688B (en) Error correction processing method and device, storage medium and processor
US7424421B2 (en) Word collection method and system for use in word-breaking
JP6813591B2 (en) Modeling device, text search device, model creation method, text search method, and program
CN101464898B (en) Method for extracting feature word of text
CN106294320B (en) A kind of terminology extraction method and system towards academic paper
US9361362B1 (en) Synonym generation using online decompounding and transitivity
EP2515242A2 (en) Incorporating lexicon knowledge to improve sentiment classification
CN103902619B (en) A kind of network public-opinion monitoring method and system
CN111027323A (en) Entity nominal item identification method based on topic model and semantic analysis
WO2009035863A2 (en) Mining bilingual dictionaries from monolingual web pages
KR102376489B1 (en) Text document cluster and topic generation apparatus and method thereof
WO2017091985A1 (en) Method and device for recognizing stop word
US20150006563A1 (en) Transitive Synonym Creation
CN102467548B (en) A kind of recognition methods of neologisms and system
CN113157903A (en) Multi-field-oriented electric power word stock construction method
CN111125299A (en) Dynamic word bank updating method based on user behavior analysis
CN102567371A (en) Method for automatically filtering stop words
CN115858787B (en) A Hotspot Extraction and Mining Method Based on Problem Appeal Information in Road Transportation
KR100435442B1 (en) Method And System For Summarizing Document
KR20080024530A (en) Community specific expression detection device and method
CN113076740A (en) Synonym mining method and device in government affair service field
Kešelj et al. A SUFFIX SUBSUMPTION-BASED APPROACH TO BUILDING STEMMERS AND LEMMATIZERS FOR HIGHLY INFLECTIONAL LANGUAGES WITH SPARSE RESOURCES.
CN110377845A (en) Collaborative filtering recommending method based on the semi-supervised LDA in section
CN101989281B (en) Clustering method and device
KR101614551B1 (en) System and method for extracting keyword using category matching

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20151229

Address after: The South Road in Guangdong province Shenzhen city Fiyta building 518057 floor 5-10 Nanshan District high tech Zone

Patentee after: Shenzhen Tencent Computer System Co., Ltd.

Address before: Shenzhen Futian District City, Guangdong province 518044 Zhenxing Road, SEG Science Park 2 East Room 403

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.