CN1460948A

CN1460948A - Method and device for amending or improving words application

Info

Publication number: CN1460948A
Application number: CN03138209A
Authority: CN
Inventors: P·J·怀特洛克; P·G·埃德蒙兹
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2002-05-22
Filing date: 2003-05-22
Publication date: 2003-12-10
Anticipated expiration: 2023-05-22
Also published as: JP4278090B2; JP2004005641A; GB0211727D0; CN1273915C; GB2388940A

Abstract

A database (3) is provided containing associations between words and possible values associated therewith, which provides a measure of the possible values of such correct or customary associations. Possible values are based on the frequency of associations obtained by analyzing large volumes of text, such as text authored by native speakers. In order to check a text segment for possible wrong or unnatural usages of one or more words of the text segment, the text is first analyzed (11) to determine the associations between its words. The likelihood of associations in the analyzed text is determined from a database (3). Calculate (14) the plausibility value of each word in the analyzed text, which is obtained by combining the possible values of the associations where the word appears. Another database (4) is used to index terms, which contains collections of terms that are easily confused by indexed terms. Select (13, 16) each confusing term in turn and replace the index term in its association. The possible values of these new associations are determined and the plausibility value of (14) the confusing word is calculated. In one error detection embodiment, confusable words are tried (23, 24) for those words whose plausibility falls below a threshold, and the confusable words with increasing plausibility are reported (25, 26) to the user. In a context-sensitive thesaurus embodiment, confusable words may be tried for all words and those with a plausibility value exceeding a second threshold may be reported.

Description

Method and apparatus for modifying or improving the use of words

技术领域technical field

本发明涉及一种修改或改善自然语言文本中词语的选择与使用的方法和装置。本发明还涉及为计算机编程以执行这样一种方法的计算机程序、包含这样一种程序的存贮介质和被编制了这样一种程序的计算机。The present invention relates to a method and apparatus for modifying or improving the selection and use of words in natural language texts. The invention also relates to a computer program for programming a computer to perform such a method, a storage medium containing such a program and a computer programmed with such a program.

背景技术Background technique

用语言写作或发言的核心是选择使用单词。为帮助人们进行选择，母语作者使用辞典，语言学习者一般使用双语词典。可是，母语作者发现辞典无法给出关于在上下文中哪个同义词是合适的这样的详细信息，初学者可能因缺乏综合能力或知识而从双语词典中选择错误的译文，或将一个词语误拼为另一个词语。At the heart of writing or speaking in language is the choice of words to use. To help people make choices, native-speaking authors use dictionaries, and language learners generally use bilingual dictionaries. However, native-speaking authors find that dictionaries cannot give such detailed information about which synonyms are appropriate in the context, and beginners may choose the wrong translation from bilingual dictionaries or misspell one word for another due to lack of comprehensive ability or knowledge. a word.

在英语学习者标注语料库(尼可尔斯，1999年“The Cambridge LearnerCorpus-Error Coding and Analysis for Writing Dictionaries and other booksfor English Learners”，学习者语料库的夏季工作室，剑桥大学出版社)中，动词或前置词的错误使用是仅次于拼写与标点错误的最为常见的错误类型。例如，一位作者可能用了“associate to”而不是“associate with”，“loose one’s temper”而不是“lose one’s temper”，或者“wins me at tennis”而不是“beats me at tennis”。In the Annotated Corpus of English Learners (Nichols, 1999 "The Cambridge Learner Corpus-Error Coding and Analysis for Writing Dictionaries and other books for English Learners", Summer Workshop on the Learner Corpus, Cambridge University Press), the verb or Misuse of prepositions is the second most common type of error after spelling and punctuation errors. For example, an author might have used "associate to" instead of "associate with," "loose one's temper" instead of "lose one's temper," or "wins me at tennis" instead of "beats me at tennis."

本发明使检测这些和其它类型的错误以及对它们提出修改成为可能。它能处理真实的词语拼写错误(如lose/loose)，也能处理其它不同类型的错误。The present invention makes it possible to detect these and other types of errors and to propose corrections for them. It can handle real misspellings of words (like lose/loose), and it can also handle other different types of errors.

在辞典里查一个象“make”这样的词，作者会找到大量同义词。可以将这些同义词分类成共有一种中心意义的语群。一个语群可能包括诸如“create”、“construct”和“establish”这样一些同义词，但是作者找不到“creates a diversion”、“constructs a model”或“establishes arelationship”这样的词。Look up a word like "make" in a dictionary and the author will find tons of synonyms. These synonyms can be classified into clusters that share a central meaning. A cluster may include synonyms such as "create", "construct", and "establish", but the author cannot find words such as "creates a diversion", "constructs a model", or "establishes a relationship".

本发明使在响应诸如“make a diversion”、“make a model”或“makea relationship”的输入时提供这些同义词作为建议成为可能。The present invention makes it possible to provide these synonyms as suggestions in response to input such as "make a diversion", "make a model" or "make a relationship".

本发明利用了由在下文中称为文本的一段写作或叙述的语言中同时(不必相邻地)出现的两个词语或短语之间关系构成的相关性或关联。一个关联可能和基于其在大量文本中出现的频率测出的强度或相似性有关。文本中的一个词可能与一个以该词语所在关联的可能值为根据的似真值有关。在文本中不合情理的词语在上下文中将是错误的或不自然的。The present invention exploits the correlation or association constituted by the relationship between two words or phrases that occur simultaneously (not necessarily adjacently) in the language of a piece of writing or narration hereinafter referred to as text. An association may be related to a strength or similarity measured based on its frequency of occurrence in a large body of text. A word in a text may be associated with a plausibility value based on the possible values associated with that word. Words that make no sense in the text would be wrong or unnatural in the context.

美国专利4,916,614、4,942,526、5,406,480公开了在句法分析与翻译中同时出现信息的创建和使用。US Patents 4,916,614, 4,942,526, 5,406,480 disclose the creation and use of co-occurrence information in syntactic analysis and translation.

在美国专利4,674,065、4,868,750、5,258,909、5,537,317、5,659,771、5,799,269、5,907,839和5,907,839的每一篇中公开的技术都使用一个普遍易被混淆的词语集合的列表，诸如“hear”和“here”，或“to”和“too”。在文本中出现这样的词表示有潜在的错误。这些专利接着描述了修改错误的不同方法。The techniques disclosed in each of U.S. Pat. to" and "too". The presence of such words in the text indicates a potential error. The patents go on to describe different methods of correcting the errors.

美国专利4,674,065公开了一种使用规则系统的技术，该系统描述用于区别易被混淆的词语使用的不同上下文。US Patent 4,674,065 discloses a technique using a system of rules describing different contexts for distinguishing confusing word usage.

美国专利4,868,750、5,537,317和5,799,269公开了为词性序列赋概率值的系统。一个含有易被混淆词语的序列的概率可以与含有它被混淆成的词语的序列的概率进行比较。如果后者大于前者，那么将报告可能的错误。US Patents 4,868,750, 5,537,317 and 5,799,269 disclose systems for assigning probability values to part-of-speech sequences. The probability of a sequence containing a confusable word can be compared to the probability of a sequence containing the word it was confused into. If the latter is greater than the former, a possible error will be reported.

美国专利5,258,909公开了一种系统，该系统为词语序列赋概率值，为一个词语被误拼为另一个词语的情况赋概率值，以及将这些概率组合起来以确定一个词是否被误拼为另一个词。U.S. Patent No. 5,258,909 discloses a system that assigns a probability value to a sequence of words, assigns a probability value to the occurrence of one word being misspelled as another word, and combines these probabilities to determine whether a word is misspelled as another word. one word.

美国专利5,659,771和5,907,839公开了一种系统，该系统将词语与表示其上下文的特征相关联，并用机器学习算法由易被混淆集合的特定成员的特征值计算一个函数。当一个易被混淆集合的成员出现在文本中，使用该函数将它分成正确的或不正确的。US Patents 5,659,771 and 5,907,839 disclose a system that associates words with features representing their context and uses a machine learning algorithm to compute a function from the feature values of a particular member of a confusing set. When a member of a confusing set appears in the text, use this function to classify it as correct or incorrect.

乔多罗和里考克的“无人监管检测语法错误的方法”(在2002年计算机语言学会的北美分会第一次年会的会议论文集的第140-147页)公开了使用连续词语n元语法模型检测错误的系统。该系统能检测以前没有见过的类别改变和类别保存错误，但是由于是连续模型，只能涵盖一个非常有限的长度。没有讨论对错误的修改。"An unsupervised method for detecting grammatical errors" by Jodorow and Rycock (pp. 140-147 in Proceedings of the first annual meeting of the North American Chapter of the Society for Computational Linguistics in 2002) discloses the use of consecutive words n A system for detecting errors in metasyntax models. The system can detect class changes and class preservation errors that have not been seen before, but because of the continuous model, it can only cover a very limited length. Modifications to bugs are not discussed.

美国专利5,999,896公开一种系统，该系统通过语法分析程序的失败识别词语使用中可能的错误，并通过找出那些能使随后的语法分析成功的易被混淆的词而解决这些问题。US Patent No. 5,999,896 discloses a system that identifies possible errors in the use of words through the failure of a parser and resolves these problems by finding those confusing words that would allow subsequent parsing to succeed.

发明内容Contents of the invention

依照本发明的第一方面，一种在第一语言的包含一组词语的书面或口语的文本段中的第一词语或短语选择的修改或改进方法，包括下列步骤：According to a first aspect of the present invention, a method of modifying or improving the selection of a first word or phrase in a written or spoken text passage containing a set of words in a first language, comprising the steps of:

(a)提供在第一语言中词语或短语之间的关联的第一数据库，其中每个关联至少有一个基于该关联在第一语言文本中出现频率的相关可能值；(a) providing a first database of associations between words or phrases in the first language, wherein each association has at least one associated likelihood value based on the frequency of occurrence of the association in text in the first language;

(b)分析该文本段以确定在该文本段的所述第一词语或短语与一个第二词语或短语之间的第一关联，至少对应所述关联的第一可能值以及基于所述至少一个可能值对应所述第一词语或短语的第一似真值；(b) analyzing the text segment to determine a first association between said first word or phrase and a second word or phrase in the text segment, corresponding to at least a first possible value of said association and based on said at least a possible value corresponding to a first plausible value of said first word or phrase;

(c)准备一个第二数据库，其中每项至少有一个词语或短语与其能够被混淆成的词语或短语集合联系在一起；(c) prepare a second database in which each entry has at least one word or phrase associated with the set of words or phrases into which it can be confused;

(d)从第二数据库中选择或计算出一个作为该文本段中所述第一词语或短语的候选替代的易被混淆词语或短语；(d) selecting or calculating a confusingly confusing word or phrase from the second database as a candidate alternative to the first word or phrase described in the text passage;

(e)导出一个易被混淆词语或短语的基于在第一数据库中的第二关联可能值的第二似真值，该第二关联由易被混淆词语或短语和该文本段中的其它词语或短语组成；以及(e) deriving a second plausibility value of a confusing word or phrase based on a second possible value of the association in the first database between the confusing word or phrase and other words in the text segment or phrase composition; and

(f)基于计算出的似真性值选择性地提供一个易被混淆词语或短语的指示。(f) selectively providing an indication of a confusing word or phrase based on the calculated plausibility value.

在第一数据库中的每个关联的可能值也可以基于每个包含一个具有相同相关关系的词语或短语的其它关联的发生频率。The possible value of each association in the first database may also be based on the frequency of occurrence of other associations each containing a word or phrase with the same correlation.

在第一数据库中的每个关联的可能值也可以基于所有具有相同相关关系的其它关联的发生频率。The possible value of each association in the first database may also be based on the frequency of occurrence of all other associations having the same correlation.

在第一数据库中的每个关联的可能值由互信息、T值、Z值、Yule’s Q系数和对数可能性中的至少一项组成的。The possible value of each association in the first database consists of at least one of mutual information, T-value, Z-value, Yule's Q coefficient and log likelihood.

在步骤(e)中，所述其它词语或短语可以是第二词语或短语，第二关联的相关关系与第一关联的相关关系可以相同。In step (e), the other word or phrase may be a second word or phrase, and the correlation relationship of the second association may be the same as that of the first association.

步骤(b)可以包括为该文本段中一组第一词语或短语建立一组第一关联以及可以对每个第一关联执行步骤(d)、(e)和(f)。Step (b) may include establishing a set of first associations for a set of first words or phrases in the text segment and steps (d), (e) and (f) may be performed for each first association.

步骤(b)可以包括建立该文本中不相邻的词语或短语之间的关联。Step (b) may include establishing associations between non-adjacent words or phrases in the text.

步骤(d)可以包括选择一个词语或短语的集合的每一个易被混淆的词语或短语以及可以对每个易被混淆词语或短语执行步骤(e)和(f)。Step (d) may include selecting each confusing word or phrase of a set of words or phrases and steps (e) and (f) may be performed for each confusing word or phrase.

步骤(f)可以包括按第二似真值降序指示第二似真值。Step (f) may include indicating the second plausible values in descending order of the second plausible values.

如果第一似真值小于一个第一阈值则可以执行步骤(d)、(e)和(f)。Steps (d), (e) and (f) may be performed if the first likelihood value is less than a first threshold.

步骤(f)可以包括当所述的或每个第二似真值超过一个第二阈值时提供指示。Step (f) may comprise providing an indication when the or each second plausibility value exceeds a second threshold.

步骤(f)可以包括如果第二似真值大于第一似真值则提供指示。Step (f) may include providing an indication if the second plausible value is greater than the first plausible value.

步骤(b)可以包括依靠一个通过机器学习技术从学习者错误的标注语料库及其相关的似真值中学得的函数计算第一似真值。Step (b) may include computing the first plausible value by means of a function learned by machine learning techniques from the annotated corpus of learner errors and their associated plausible values.

该方法可以包括用易被混淆词语代替该文本段中第一词语。The method may include replacing the first word in the text segment with a confusing word.

该方法可以包括通过第二语言翻译生成该文本段。The method may include generating the text segment by second language translation.

该方法可以包括从印刷文档通过光学字符识别生成该文本段。The method may include generating the text segment from a printed document by optical character recognition.

依照本发明的第二方面，提供为计算机编程以执行依照本发明第一方面方法的计算机程序。According to a second aspect of the invention there is provided a computer program for programming a computer to perform a method according to the first aspect of the invention.

依照本发明的第三方面，提供包含依照本发明第二方面的程序的存贮介质。According to a third aspect of the present invention, there is provided a storage medium containing a program according to the second aspect of the present invention.

该介质可以包括计算机可读介质。The media may include computer readable media.

依照本发明第四方面，提供包含依照本发明第二方面的程序的计算机。According to a fourth aspect of the present invention there is provided a computer comprising a program according to the second aspect of the present invention.

依照本发明第五方面，提供一种在第一语言的包含一组词语的书面或口语的文本段中的第一词语或短语选择的修改或改进装置，包括：According to a fifth aspect of the present invention, there is provided an apparatus for modifying or improving the selection of a first word or phrase in a written or spoken text passage containing a set of words in a first language, comprising:

第一语言词语或短语之间的关联的第一数据库，其中每个关联至少有一个基于该关联在大量第一语言文本中出现频率的相关可能值；a first database of associations between words or phrases in the first language, wherein each association has at least one associated likelihood value based on the association's frequency of occurrence in the plurality of texts in the first language;

用于分析该文本段的分析器，以建立在文本段的所述第一词语或短语与一个第二词语或短语之间的一个第一关联，至少一个第一可能值对应所述的关联以及基于所述至少一个可能值的第一似真值对应所述的第一词语或短语；以及an analyzer for analyzing the text segment to establish a first association between said first word or phrase and a second word or phrase of the text segment, at least one first possible value corresponding to said association and a first plausible value based on said at least one possible value corresponds to said first word or phrase; and

第二数据库，其中每项至少有一个词语或短语与其能够被混淆成的词语或短语集合联系在一起；a second database, each of which has at least one word or phrase associated with the set of words or phrases into which it can be confused;

用于从第二数据库中选择或计算出一个作为该文本段中所述第一词语或短语的候选替代的易被混淆词语或短语的工具；means for selecting or calculating a confusingly confusing word or phrase from a second database as a candidate alternative to said first word or phrase in the text passage;

用于导出一个易被混淆词语或短语的基于一个第二关联在第一数据库中的可能值的第二似真值，该第二关联由易被混淆词语或短语和该文本段中的其它词语或短语组成；以及for deriving a second plausibility value of a confusing word or phrase based on the possible values in the first database of a second association between the confusing word or phrase and other words in the text segment or phrase composition; and

用于基于计算出的似真性值选择性地提供一个易被混淆词语或短语的指示(25，26)的工具。A tool for selectively providing an indication (25, 26) of a confusing word or phrase based on a calculated likelihood value.

通过利用词语之间关联的可能性，有可能提供一种技术，它体现了对那些仅仅使用词性序列的概率的已知系统的改进，因为这样的已知系统无法检测和修改非常普通的类别保存错误。By exploiting the possibility of association between words, it is possible to provide a technique that represents an improvement over known systems that use only the probabilities of part-of-speech sequences, since such known systems are unable to detect and modify very common class-preserving mistake.

因为从属性语法能够取得并不相邻但仍然会直接影响其它字选择的字之间的从属性，通过使用连续的N元语法将取得改进。原则上可将N元语法扩展到覆盖这样的从属性，但在实务中这会导致几个数据稀散问题。利用关联将为统计可能值的计算所利用的数据集中成语言意义单元。三个元素相关分段几乎总是足够获得有用的统计，然而即使四个元素的N元语法也会遗漏许多可能的或不太可能的词语组合情况。Improvements will be achieved by using consecutive N-grams because dependency grammars can capture dependencies between words that are not adjacent but still directly affect the choice of other words. In principle the N-gram could be extended to cover such dependencies, but in practice this leads to several data sparsity problems. The use of associations will centralize the data used for the calculation of statistically possible values into units of linguistic meaning. Three-element correlation segments are almost always sufficient to obtain useful statistics, yet even four-element N-grams miss many possible or unlikely word combinations.

对于语言意义实体统计的这种限制的一个重要结果是，在要求错误检测的方式中概率值更容易解释。为了理解这一点，考虑一个连续二元语法模型中相邻词语之间的转移概率的意义。在一个成分中，例如在“a bigdog”中的“big”和“dog”之间，转移概率可直接与形容词和名词的相似序列的转移概率比较。但在“give the dog a bone”中的“dog”和“a”之间的转移概率是相当不令人感兴趣(和不可能的)的概率，这是一个结束于“dog”的成分跟着另一个开始于“a”的成分的概率。感兴趣的概率，即一个以“give”开头的成分有一个以“bone”开头的第二对象的概率，没有被体现而且不能和可能的替代诸如“give the dog a clone”进行比较。An important consequence of this restriction on linguistically meaningful entity statistics is that probability values are easier to interpret in a manner that requires error detection. To understand this, consider the implications of transition probabilities between adjacent words in a continuous bigram model. Within a composition, such as between "big" and "dog" in "a bigdog", transition probabilities can be directly compared to transition probabilities for similar sequences of adjectives and nouns. But the transition probabilities between "dog" and "a" in "give the dog a bone" are rather uninteresting (and unlikely) probabilities, which is a composition ending in "dog" followed by The probability of another component starting with "a". The probability of interest, i.e. the probability that a component beginning with "give" has a second object beginning with "bone", is not represented and cannot be compared with possible alternatives such as "give the dog a clone".

也就是说，在连续N元语法模型中，低转移概率既能指出语言上感兴趣的不太可能性，也能指出语言上不感兴趣的不太可能性。如果一个基于连续N元语法的系统将每个低概率作为错误处理的触发源，它将发现大量可能的“错误”，其中很多不是真正的错误。处理这些开销很大而且存在把假错误分类为真错误的危险。That is, in continuous N-gram models, low transition probabilities can indicate both linguistically interesting and linguistically uninterestingly unlikely. If a system based on successive N-grams used each low probability as a trigger for error handling, it would find a large number of possible "bugs", many of which are not real bugs. Dealing with these is expensive and there is a danger of classifying false errors as true errors.

这就是为什么没有已知技术使用低转移概率作为错误处理的触发源，而宁愿利用在文本中出现的一个已知易被混淆的某个词语，然后考虑原始序列与用替换词语得到的序列的相对可能性。This is why there is no known technique that uses low transition probabilities as a trigger for error handling, but rather uses the occurrence of a certain word in the text that is known to be confusing and then considers the relative possibility.

相反，在本技术中，“不太可能性”是一个更可靠的错误提示。In contrast, "unlikely" is a more reliable error cue in this technique.

任何不太可能的关联可以引起错误处理的开始而且只有不可能的关联才能这么做。当然，不太可能的关联并非总是一个错误所致；不过在本技术中，这些假触发将会少得多。Any improbable association can cause error handling to start and only improbable associations can do so. Of course, unlikely correlations are not always the result of a mistake; however, in this technique, there will be far fewer of these false triggers.

而且，当一个文本中存在一些易被混淆集合中的元素是错误处理的唯一触发源的时候(如在许多已知技术中)，向一个易被混淆集合中加入元素既增加错误处理被触发的次数又增加评价每个元素的计算成本。Moreover, adding elements to a confusing set increases the chance of error handling being triggered when the presence of some elements in a confusing set is the only trigger for error handling (as in many known techniques). The number of times in turn increases the computational cost of evaluating each element.

在一个关联的可能性和由一个词语导出的似真性是错误处理的触发源的时候(如在本发明中)，能辨别很大范围的错误特征。易被混淆的概念不只限于拼写和发音的高频率混淆。When an associated likelihood and plausibility derived from a term are the triggers for error handling (as in the present invention), a wide range of error characteristics can be identified. Concepts that are easily confused are not limited to high frequency confusion of spelling and pronunciation.

在利用学习算法和同时利用存在已知的易被混淆的词语作为错误处理触发源的已知技术中，除了应用学习算法去区别它就没有其它方法可以检测一个词语是一个可能的错误。而且，和已知的基于N元语法的技术一样，学习系统不能从把数据集中为语言意义单元中完全地获得好处。In known techniques using learning algorithms and simultaneously using the presence of known confusing words as error handling triggers, there is no way to detect a word as a possible error other than applying the learning algorithm to distinguish it. Furthermore, like known N-gram-based techniques, learning systems do not fully benefit from aggregating data into units of linguistic meaning.

本技术体现了对于已知的基于语法分析失败的技术的改进，这是因为语法分析失败对词语错误是一种非常粗糙的检测机制(尤其是那些包括词性相同的词语的替换)。相反地，对于即使是非常小的句子分段的可能性，本技术提供非常精细的定量测定，且包括语法分析失败，如缺少配属而提示，以作为不太可能性的一个特别的极端的例子。此外，语法分析成功(作为一个错误已经被修改的粗糙的条件)可以用获得改进的精细的定量测定来代替。The present technique represents an improvement over known techniques based on parsing failures, since parsing failures are a very crude detection mechanism for word errors (especially those involving substitutions of words with the same part of speech). Conversely, the present technique provides very fine-grained quantitative measures of the likelihood of even very small sentence fragments, and includes grammatical analysis failures such as missing attachments hinted at as a particularly extreme case of improbability . Furthermore, the parsing success (coarse condition as an error has been corrected) can be replaced by a fine quantitative measure of gain improvement.

附图说明Description of drawings

通过实例以及参照附图进一步地描述本发明，附图包括：The invention is further described by way of example and with reference to the accompanying drawings, which include:

图1是构成本发明的一个实例的装置方框原理图；Fig. 1 is the device block schematic diagram that constitutes an example of the present invention;

图2是方框图，说明句子“Love is the most important condition tomarriage”的相关性结构；Figure 2 is a block diagram illustrating the correlation structure of the sentence "Love is the most important condition to marriage";

图3是第一数据库的一部分，它把可能值与关联的联系在一起；以及Figure 3 is a portion of the first database that associates possible values with associated ones; and

图4(包括图4a和4b)示出了本发明作为一个错误检测器和修改器的实例。Figure 4 (comprising Figures 4a and 4b) shows an example of the invention as an error detector and modifier.

具体实施方式Detailed ways

提供检测用户书写中的错误和不自然的表达方式并提出能改进这些语言用法的方式的方法和装置。这些技术也可以作为上下文相关辞典使用，它能提出与给定的输入表达方式在其上下文中意思相似的表达方式。使用词语组合的统计相关模型作为错误检测和替换检查的基础。这解决了已知方案的几个问题，它们或者使用连续N元语法模型，或者使用未经分析的特征集合。而且，这些技术使得为替换提供更大范围的候选成为可能。检测错误不依靠容易用错的特定词语的检测，所以可以检测和修改以前没有遇到过的错误。Methods and apparatus are provided that detect errors and unnatural expressions in a user's writing and suggest ways in which usage of these languages can be improved. These techniques can also be used as context-sensitive dictionaries, which suggest expressions that have similar meanings to a given input expression in its context. Uses a statistical correlation model of word combinations as the basis for error detection and substitution checking. This solves several problems with known solutions, which either use continuous N-gram models, or use unanalyzed feature sets. Moreover, these techniques make it possible to provide a wider range of candidates for replacement. Detecting errors does not rely on the detection of specific words that are prone to misuse, so errors that have not been encountered before can be detected and corrected.

本方法使用词语之间的两种关系类型。一种关系类型保持在一个单句中不同位置的两个词语之间。这些是相关关系，诸如‘subject of’、‘objectof’和‘modifier’，以及图2中所示的例子，它说明对句子“Love is themost important condition for marriage”的分析结果。词语用它们的原型和词性表示，即表示为词条，因此“is”就以“be_V”出现。这个动词的主语等同于“love_N”，它的宾语等同于“condition_N”。后者由“the_DET”限定并被“important_ADJ”修饰。“Most_ADV”等同于修饰“important_ADJ”的副词。“For_PREP”等同于修饰“condition_N”的前置词，“marriage_N”等同于前置词“for_PREP”的宾语。三元组由两个词条组成，联系它们的相关关系被称作关联。This method uses two types of relationships between terms. A relation type holds between two words at different positions in a single sentence. These are related relations, such as 'subject of', 'objectof' and 'modifier', and the example shown in Figure 2, which illustrates the results of the analysis of the sentence "Love is the most important condition for marriage". Words are represented by their prototypes and parts of speech, i.e., as terms, so "is" appears as "be_V". The subject of this verb is equivalent to "love_N" and its object is equivalent to "condition_N". The latter is defined by "the_DET" and decorated with "important_ADJ". "Most_ADV" is equivalent to an adverb that modifies "important_ADJ". "For_PREP" is equivalent to modifying the preposition of "condition_N", and "marriage_N" is equivalent to the object of the preposition "for_PREP". A triple is composed of two terms, and the relation linking them is called an association.

另一关系类型包括定义“可能的替换”的关系，即一个句子中给定位置可供选择的词语的选择之间的关系。下面是替代关系的一些例子：Another type of relationship includes a relationship that defines "possible alternatives", ie, relationships between alternatives for words at a given position in a sentence. Here are some examples of substitution relationships:

辞典关系，诸如同义、反义、下义、上义；Dictionary relations, such as synonyms, antonyms, hyponyms, supernyms;

导致出语言的另一些词的拼写错误，，如“loose”之于“lose”，其中有一种特别的情形是同音异义，讲的是发音相同但拼写不同的词，如“pane”和“pain”；The misspellings that lead to other words of the language, such as "loose" to "lose", in which there is a special case of homonyms, speaking of words that sound the same but are spelled differently, such as "pane" and " pain";

词源，讲的是由一个单词根而来的用不同方式构成的词语(诸如“interested”和“interesting”，或“safe”和“safety”)；Etymology, which refers to the different ways in which words are formed from the root of a word (such as "interested" and "interesting", or "safe" and "safety");

语际语言易混淆性，讲的是另一种语言中一个单词的可供选择的翻译词语(如将法语“marquer”翻译成“mark”和“brand”都是可以的)；Interlingual confusion, speaking of alternative translations of a word in another language (e.g. French "marquer" into "mark" and "brand" are both acceptable);

假朋友，其中的一个词不是其同源词的可能翻译(例如，“possible”和“actual”，分别是法语“actual”正确的和错误的翻译)；以及False friends, where a word is not a possible translation of its cognate (e.g., "possible" and "actual", the correct and incorrect translations of the French "actual", respectively); and

插入或删除错误，诸如“he rang(at)the doorbell”、“we paid(for)ourmeals”，也可被认为是一个空词语的替代或被替代；Insertion or deletion errors, such as "he rang(at) the doorbell", "we paid(for) our meals", can also be considered as a substitute for or being substituted for an empty word;

当在一个句子中词语w的使用被确认是不合适的，即是错误的，不然就是非习惯用法，被称作w的易被混淆集C(w)的词语集合的每个成员可被认为是一个可能的替代。w的易被混淆集是从与w相关的那些词语中抽取出来的，条件为实际的成员资格可能随着用户的本国语言、写作所用语言的能力水平以及其它因素而变化。When the use of the word w in a sentence is determined to be inappropriate, i.e. wrong, or otherwise non-idiomatic, each member of the set of words called w's confusing set C(w) can be considered is a possible replacement. The confusable set for w is drawn from those words associated with w, with the proviso that the actual membership may vary with the user's native language, the level of proficiency in the language in which it is written, and other factors.

相关关系是广泛使用的表示句子结构的方法。许多被发现的变化在本技术的情况下在很大程度上是不重要的。一种相关关系连结被称为相关词和中心词的两个词语。在一种典型模式中，没有相关词可以和不止一个单一中心词相关，但一个中心词可以具有任意数量的相关词；其它约束，如禁止循环，确保一个单句中的关系构成树状结构。在本规定中，一个句子中两个词语之间的关联(也被称作关联)用三元组形式表示：Correlation is a widely used method to represent sentence structure. Many of the variations found are largely insignificant in the context of the present technique. A related relationship links two words called a relative word and a head word. In a typical pattern, no related word can be related to more than a single head word, but a head word can have any number of related words; other constraints, such as prohibiting loops, ensure that relations within a single sentence form a tree structure. In this specification, the association (also called association) between two words in a sentence is expressed in triplet form:

<第一词条，关系，第二词条><first term, relationship, second term>

这里词条是一个术语，如‘chase_V’表示动词“to chase”的所有形态，即chase、chased、chasing。Here the entry is a term, such as 'chase_V' means all forms of the verb "to chase", namely chase, chased, chasing.

一个关联可以和它的强度或可能性的数量联系在一起。一个关联的频度，即在一个经过语法分析的语料库中看到它的次数，只是评估其强度的一个粗略的办法。更准确的测量方法是计算该关联的频度偏离根据其组成部分的频度所预期的频度的程度。在一些文献(例如，K.卡奇拉，1999年，“Bigram Statistics Revisited：a Comparative Examination of someStatistical Measures in Morphological Analysis of Japanese KanjiSequences”，定量语言学杂志1999年第6期第2号，第149-166页，以及埃弗特等人的“Methods for the Qualitative Evaluation of LexicalAssociation Measures”，计算机语上言学会的在图卢兹如开的第30届年会的论文集，2001年，第188-195页，它们给出在特定任务中几种测量方法的比较评估)中公开了几个这样的测量方法并在词语切分、语法分析、翻译、信息检索和词典编纂中有所应用。在这些例子中，一般只有那些与预期的频度相比明显地更可能的关联会被感兴趣。不过本技术也关注那些与预期的频度相比明显地不太可能的关联。在文本中检测到这样一个关联经常预示着不符合语法或不合乎语言习惯的语言用法。An association can be associated with its strength or number of possibilities. The frequency of an association, the number of times it is seen in a parsed corpus, is only a rough way of assessing its strength. A more accurate measure is to calculate the degree to which the frequency of the association deviates from what would be expected from the frequencies of its components. In some literature (for example, K. Kaqila, 1999, "Bigram Statistics Revisited: a Comparative Examination of some Statistical Measures in Morphological Analysis of Japanese KanjiSequences", Journal of Quantitative Linguistics, No. 6, No. 2, 1999, No. 149- 166, and "Methods for the Qualitative Evaluation of Lexical Association Measures" by Evert et al., Proceedings of the 30th Annual Meeting of the Society for Computational Linguistics in Toulouse, 2001, pp. 188-195, they gives a comparative evaluation of several measures in specific tasks) several such measures are disclosed and have applications in word segmentation, parsing, translation, information retrieval and lexicography. In these examples, generally only those associations that are significantly more likely than the expected frequency are of interest. However, the technique also focuses on associations that are significantly less likely than the expected frequency. Detection of such an association in text often indicates ungrammatical or non-idiomatic language usage.

出现在一个或多个不太可能的关联中的词语能随后依次用它的易被混淆集中的每个成员代替并可求出进行每个这样的代替得到的结果的似真值。如果该易被混淆集的一个或多个成员导致被充分提高的似真值，这些成员可以被建议作为替代。Words that occur in one or more unlikely associations can then be substituted in turn by each member of its confusing set and the plausibility of the result of each such substitution can be calculated. If one or more members of the confusing set result in a sufficiently increased likelihood, those members may be suggested as replacements.

作为一个预备的步骤，依照相关语法分析大量母语口语文本以建立词语组合的可能值数据库。可使用任何适当的语法分析程序，适当的实例公开在M.考林斯的“Three Generative Lexicalised Models for StatisticalParsing”，EACL的ACL/第8会议的第35届年会论文集，马德里，1997年，和斯里特和坦普利的“Parsing English with a Link Grammar”，CMU-CS-91-196，卡内基-梅隆大学，计算机科学系，1991。该分析器甚至不必是一个如一般想象的语法分析程序，但可以使用有限状态或增加了记录相关性机制的相似技术。As a preliminary step, a large amount of spoken native language text is analyzed according to the relevant grammar to build a database of possible values for word combinations. Any suitable parsing program may be used, suitable examples are disclosed in M. Collins, "Three Generative Lexicalised Models for Statistical Parsing", Proceedings of the 35th Annual Meeting of the ACL/8th Conference of EACL, Madrid, 1997, and Slitter and Templey, "Parsing English with a Link Grammar", CMU-CS-91-196, Carnegie Mellon University, Department of Computer Science, 1991. The parser doesn't even have to be a parser as you might imagine, but can use finite state or similar techniques that add record dependency mechanisms.

依照一种或多种统计测量方法，计算每种类型关联的频度(诸如互信息、T值和对数可能性)，对每个关联计算可能值并将结果存入一个表中。图3示出了在这样的数据库中的一些条目。According to one or more statistical measures, the frequency of each type of association (such as mutual information, T-value, and log-likelihood) is calculated, the likelihood value is calculated for each association and the results are stored in a table. Figure 3 shows some entries in such a database.

在图3中，第一列示出关联本身。以‘freq’为标题的列是这个关联在经过语法分析的语料库(这里是大约8千万个词的英国国家语料库)中出现的次数。其余列分别是互信息、T值、Yule's Q系数和对数可能性。这些中的每一个是由下列四项的频度计算出的不同度量。In Figure 3, the first column shows the association itself. The column titled 'freq' is the number of times this association occurs in the parsed corpus (in this case the UK National Corpus of about 80 million words). The remaining columns are mutual information, T-value, Yule's Q coefficient, and log-likelihood, respectively. Each of these is a different measure calculated from the frequency of the following four terms.

<第一词条，关系，第二词条><first term, relationship, second term>

<第一词条，关系，^*><first entry, relation, ^* >

<^*，关系，第一词条>< ^* , relation, first term>

<^*，关系，^*>< ^* , relation, ^* >

这里的‘^*’代表任意词条。这种参数模式公开于D.林的“AutomaticRetrieval and Clustering of Similar Words”，COLING-ACL 98，蒙特利尔，加拿大，1998年8月。不同度量有不同的范围并以不同的方式对四个参数的精确值敏感。不过在每种情况中，该值都与关系的可能性相关。正值说明组合的可能性比偶然性大，负值说明可能性小。Here ' ^* ' stands for any entry. This parametric scheme is disclosed in "Automatic Retrieval and Clustering of Similar Words" by D. Lin, COLING-ACL 98, Montreal, Canada, August 1998. Different metrics have different ranges and are sensitive to the precise values of the four parameters in different ways. In each case, however, the value is related to the likelihood of the relationship. A positive value indicates that the combination is more likely than chance, and a negative value indicates that it is less likely.

例如，计算<associate_V padv to_PREP>的T值的公式是：[P14-2] $tassociate_V . padv . to_PREP =$ $\frac{F / f (padv) - (f (associate_V \cdot padv) f (padv \cdot to_PREP)) / f {(padv)}^{2}}{\sqrt{f (associate_V \cdot to_PREP) / f (padv)}}$ $tassociate_V . padv . to_PREP =$ $\frac{25 / 10587833 - (7680 \times 1020531) / 10587833^{2}}{\sqrt{25 / 10587835}} = - 143.050$ For example, the formula for calculating the T value of <associate_V padv to_PREP> is: [P14-2] $associate_V . padv . to_PREP =$ $\frac{f / f (padv) - (f (associate_V \cdot padv) f (padv \cdot to_PREP)) / f {(padv)}^{2}}{\sqrt{f (associate_V &Center Dot; to_PREP) / f (padv)}}$ $associate_V . padv . to_PREP =$ $\frac{25 / 10587833 - (7680 \times 1020531) / 10587833^{2}}{\sqrt{25 / 10587835}} = - 143.050$

其中f(associate_V·Padv·to_PREP)＝Fwhere f(associate_V·Padv·to_PREP)=F

为了得到词语组合可能性的高质量的估计值，语法分析母语口语语料库需要尽可能的准确和覆盖面大。可是，准确的语法分析又需要使用词语组合可能性的高质量的估计值，而这导致了一个冲突。这个冲突可以通过使用迭代或步步为营的方法解决。这基于语法分析算法的某些特性。In order to obtain high-quality estimates of the likelihood of word combinations, parsing native-language spoken corpora needs to be as accurate and as comprehensive as possible. However, accurate parsing requires the use of high-quality estimates of word combination probabilities, and this leads to a conflict. This conflict can be resolved by using an iterative or step-by-step approach. This is based on certain properties of the parsing algorithm.

一个句子中每个独立的关联与一个优先值有联系。优先值是句子中两个词语之间存在这样一个关联的可信度的测度。这种优先值同时是句子描述系数例如词性概率和词语分离度，和语言广度系数例如这些词语之间关联的强度的函数。Each individual association in a sentence is associated with a priority value. The priority value is a measure of the confidence that such an association exists between two words in a sentence. This priority value is a function of both sentence description coefficients, such as part-of-speech probability and word separation, and linguistic breadth coefficients, such as the strength of association between these words.

它返回一个关联集合，它们共同满足相关性结构的公理(即关联没有交叉，每个词语是不超过一个节点的相关词等等)；不过，这个集合不要求构成单一连接树；It returns a collection of associations that collectively satisfy the axioms of correlation structure (i.e. associations do not cross, each word is a related word of no more than one node, etc.); however, this collection is not required to form a single connection tree;

通过适当的参数设置能够改变句子描述系数和语言广度系数对优先值的相对作用；The relative effect of the sentence description coefficient and the language breadth coefficient on the priority value can be changed through appropriate parameter settings;

可以设置一个阈值，这样只返回优先值超过这个阈值的关联；A threshold can be set so that only associations with priority values exceeding this threshold are returned;

语法分析的迭代性将就一个非常简单的短语“world title fight”的分析来例示。The iterative nature of parsing will be exemplified in the analysis of a very simple phrase "world title fight".

按照语法“title”必须修饰“fight”，但是不清楚“world”是修饰“title”还是“fight”。在英语语法中，一个名词序列中除了最后一个外的每个名词都可以修饰它右边的任意一个名词。在本例子中，特定词语组合强度的知识会导出“world”修饰“title”的结论。在其它例子中，如“plasticbaby pants”，第一个名词修饰的不是中间跟着它的名词而是最后一个名词。According to the grammar, "title" must modify "fight", but it is not clear whether "world" modifies "title" or "fight". In English grammar, every noun in a noun sequence except the last can modify any noun to its right. In this example, knowledge of the strength of certain word combinations leads to the conclusion that "world" modifies "title". In other examples, like "plasticbaby pants," the first noun modifies not the noun that follows it but the last noun.

一个完整的语法分析将给出关联：A full syntax analysis will give the association:

1.<title_N，mod_of，tight_N>1. <title_N, mod_of, tight_N>

2.<world_N，mod_of，title_N>2. <world_N, mod_of, title_N>

在语法分析母语口语语料库的第一次迭代中，没有可利用的特定词语之间关联的可能值，所以语言广度系数对优先值没有作用。优先值阈值设置得高，因此举例来说词性是不明确的或分开很远的词语不会被关联，而且关联正确性的可信度高。按照这个例子，只有关联1将被返回。一个序列中倒数第二个名词肯定修饰最后一个名词，与语言广度系数无关。不过，在缺乏语言广度信息时，在本例中，不管是关联2还是错误的<world_N，mod_of，fight_N>都不会有足够高的优先值而被返回。不过，在该语料库中没有跟随其它名词的“world title”(和“world fight”)等其它实例的关联将被返回。In the first iteration of parsing the native spoken corpus, no possible value for the association between specific words was available, so the linguistic breadth factor had no effect on the priority value. The priority value threshold is set high so that, for example, words whose parts of speech are ambiguous or far apart are not associated, and the confidence in the correctness of the association is high. Following this example, only association 1 will be returned. The penultimate noun in a sequence definitely modifies the last noun, regardless of the language breadth coefficient. However, in the absence of language breadth information, neither association 2 nor erroneous <world_N, mod_of, fight_N> would have a high enough priority value to be returned in this example. However, associations with other instances of "world title" (and "world fight") that are not followed by other nouns in the corpus will be returned.

然后用这些确定性高的关联计算可能值。后面的迭代随后可以使用这些语言广度系数以确定优先值，因此优先值阈值可被降低。这增加了返回的关联数量(语法分析的覆盖面)并允许计算更准确的可能性统计。按照这个例子，<world，mod_of，title>和<world，mod_of，fight>的相对频度和/或可能性现在将使前者优先于后者。然后进一步的迭代将继续增加语言广度系数对优先值的作用并减低优先值阈值。这样，可能性数据的覆盖面和可信度能够逐渐地加强。These high-certainty associations are then used to compute likely values. Later iterations can then use these language breadth coefficients to determine priority values, so the priority value threshold can be lowered. This increases the number of associations returned (coverage of the parse) and allows calculation of more accurate likelihood statistics. Following this example, the relative frequency and/or likelihood of <world, mod_of, title> and <world, mod_of, fight> will now give priority to the former over the latter. Further iterations will then continue to increase the effect of the language breadth factor on the priority value and decrease the priority value threshold. In this way, the coverage and reliability of probability data can be gradually strengthened.

在语法分析母语口语语料库的每个迭代后，每种类型关联的可能值被确定并输入到数据库中。After each iteration of parsing the native-language spoken corpus, the possible values associated with each type are determined and entered into the database.

当已经准备了或无论用什么方法得到了足够准确的数据库时，它就可以在本发明中使用。要被检查问题的文本要经过这样一个语法分析过程的一次迭代。可以减小语言广度系数对这个语法分析的作用，而这些系数，即关联的可能值，将在下一阶段中考虑。When a sufficiently accurate database has been prepared or obtained by whatever means, it can be used in the present invention. The text to be checked is subjected to one iteration of such a parsing process. It is possible to reduce the contribution of the linguistic breadth coefficients to this parsing, and these coefficients, the possible values of the associations, will be considered in the next stage.

然后通过检查母语口语数据库确定文本中每个关联的可能值。对原始的母语口语语料库中未见到的关联通过假定它们有较低的频度而赋予可能值。在一个典型的实施例中，在母语口语语料库中发现的频度为1的所有关联被丢弃，极大地减小了数据量。然后假设在数据库中找不到的一个关联具有一个在0和2范围内的频度，这是根据实验确定的最佳值，并相应地计算它的可能值。The possible value of each association in the text is then determined by examining the native language spoken database. Associations not seen in the original native-language spoken corpus are assigned a likelihood value by assuming that they have a low frequency. In a typical embodiment, all associations with a frequency of 1 found in the native language spoken corpus are discarded, greatly reducing the amount of data. It is then assumed that an association not found in the database has a frequency in the range 0 and 2, which is the optimal value determined experimentally, and its possible value is calculated accordingly.

可能值低的(例如负值)关联是可能错误的指标。一个词语所在关联的可能值被组合到该词语的似真值中。非似真的词语用它们的易被混淆集的成员替代，看其似真性结果是否有改进。Correlations that are likely to be low (eg, negative) are indicators of possible error. The possible values for which a word is associated are combined into the word's plausible value. Non-plausible words are replaced by members of their confusing sets to see if their plausibility results improve.

图4示出了作为一个错误检测器和修改器的本发明实施例。作为语法分析的实例，在步骤10中提供输入文本，在步骤11中进行分析。在步骤12中，分析输入文本中关联的可能性。在步骤13中，选择文本中第一个词语并在步骤14中计算这个词语的似真性。在步骤15中检查输入文本以确定是否所有词语都被使用过，如果没有，在步骤16中取下一个词语并重复步骤14。Figure 4 shows an embodiment of the invention as an error detector and modifier. As an example of syntax analysis, in step 10 an input text is provided and in step 11 it is analyzed. In step 12, the likelihood of associations in the input text is analyzed. In step 13, the first word in the text is selected and in step 14 the likelihood of this word is calculated. The input text is checked in step 15 to determine if all words have been used, if not, the next word is taken in step 16 and step 14 is repeated.

当文本中所有的词语均已有了计算出的似真性值时，在步骤17中按似真性升序排列这些词语。在步骤18中选择最小似真性词语，在步骤19中如果它的似真性不比第一阈值低，在步骤20中终止本方法。否则，在步骤21中得到这个词语的易被混淆集并在步骤22中选择第一个易被混淆词语。在步骤23中这个词被易被混淆词语代替并在步骤24中计算这个易被混淆词语在上下文中的似真性。在步骤25中如果检测到在似真性上的改进大于第二阈值，则在步骤26中将这个易被混淆词语报告给用户。When all the words in the text have calculated plausibility values, in step 17 the words are sorted in ascending order of plausibility. The least plausible word is selected in step 18 and the method is terminated in step 20 if its plausibility is not lower than a first threshold in step 19 . Otherwise, get the confusing set of this word in step 21 and select the first confusing word in step 22 . In step 23 the word is replaced by the confusing word and in step 24 the plausibility of the confusing word in the context is calculated. If in step 25 an improvement in plausibility greater than a second threshold is detected, then in step 26 the confusing word is reported to the user.

步骤27检查是否所有易被混淆词语都试过了，如果没有，在28中选择下一个易被混淆词语并控制返回到步骤23。否则，步骤29确定是否文本中所有的词语都被处理过了，如果没有，步骤30得到下一个词语并返回控制到步骤19。否则，在步骤31中结束本方法。Step 27 checks if all confusable words have been tried, if not, the next confusable word is selected in 28 and control returns to step 23. Otherwise, step 29 determines whether all words in the text have been processed, if not, step 30 gets the next word and returns control to step 19. Otherwise, the method ends in step 31 .

在这个实施例中我们为每个词w_i(1≤i≤n，句子的长度)确定它所在的关联集D(w_i)。然后我们对每个D(w_i)应用一个函数将关联集的可能值映射为单一值，这个值被称作该词语的“似真性”λ(w_i)。按似真性排序这些词语。如果最小似真性词语w_λmin的似真性低于一个阈值，那我们就试图寻找一个修正。我们依次用每个词语c_i(w_λmin)(1≤j≤n，在C(w_λmin)中易被混淆词语的数目)代替w_λmin，并计算λ(c_i(w_λmin))。代替后改进该词语的似真性的那些易被混淆词语可以提供给用户。可以按它们产生的改进的降序显示易被混淆词语。In this embodiment, we determine for each word w _i (1≤i≤n, the length of the sentence) the association set D(w _i ) it belongs to. We then apply to each D(w _i ) a function that maps the possible values of the association set to a single value, called the word's "likelihood" λ(w _i ). Rank the words by plausibility. If the likelihood of the minimum plausibility word w _λmin is below a threshold, then we try to find a correction. We replace w λmin with each word c _i (w _λmin ₎ (1≤j≤n, the number of confusing words in C(w _λmin )) in turn, and calculate λ( _ci (w _λmin )). Those confusable words that improve the plausibility of the word after substitution may be provided to the user. Confusible words may be displayed in descending order of the improvements they produced.

易被混淆集的成员可能和表示混淆可能性的混淆值有关。例如，从学习者标注语料库中我们能够得到被错误地用作另一个词语的每个词语的频度合计数；真实词语在发音和/或拼写中的错误可能与基于编辑距离的值相联系；基于语义相关性的易被混淆词语可能与基于在一个分层网络中的路径距离的值相联系。Members of the confusing set may be associated with a confusing value indicating the likelihood of confusion. For example, from a learner-annotated corpus we can get an aggregate count of the frequency of each word that is mistakenly used as another word; errors in pronunciation and/or spelling of real words may be associated with values based on edit distance; Confusing words based on semantic relatedness may be associated with values based on path distance in a hierarchical network.

如果使用这样的信息，通过将混淆性和在似真性上的改进结合成一个单一分值，即替代分值σ(w_i→c_i(w_i))，以一个更具有帮助性的顺序提出建议。If such information is used, it is presented in a more _helpful order by _combining confounding _and improvements in plausibility into a single score, the surrogate score suggestion.

在和用户交互过程中，最初提供的建议是用易被混淆集的一个成员代替这个词来改进w_λmin。如果用户接受这些词语中的一个，替代效果就会被传送给与它关联的其它词语并重复新的w_λmin值的计算过程。传送过程可以包括一个替代词语重附着于一个与原始词语不同的词语。During the interaction with the user, an initial suggestion is provided to improve w _λmin by replacing the word with a member of the confusing set. If the user accepts one of these words, the substitution effect is passed on to the other words associated with it and the calculation of the new _wλmin value is repeated. The transfer process may include reattaching a substitute word to a different word than the original word.

独立使用时不太可能的关联作为一个较大结构的部分时是可能的，反之亦然。例如，“by accident”是非常强的搭配，而“by the accident”是不太可能的而且应该被认为是一个可能的错误。然而存在着包括后者的较大的、可能正确的结构，如“horrified by the accident”。Associations that are unlikely when used independently are possible when used as part of a larger structure, and vice versa. For example, "by accident" is a very strong collocation, while "by the accident" is unlikely and should be considered a possible error. There are however larger, possibly correct constructs that include the latter, such as "horrified by the accident".

相反地，孤立的“a knowledge”是一个典型的学习者错误，而“aknowledge of”是合理的表达方式。可到了“learn a knowledge of”却是一个错误。Conversely, isolating "a knowledge" is a typical learner error, while "acknowledge of" is a reasonable expression. But when it came to "learn a knowledge of", it was a mistake.

这些情况能够通过计算包含连结两个或多个关联的三个或更多元素的相关子图的可能值进行处理。实验观察指出在大多数情况下不需要超过三个元素。在上述情况中，四元素短语的可能性能追溯至更小单元的可能性。例如，“horrified by the accident”是可能的，这是因为“horrified by”是这样的一个强搭配，而“learn a knowledge of”是不太可能的，这是因为“knowledge”是“learn”不太可能的宾语，无关于其它元素。These cases can be handled by computing the possible values of a correlation subgraph containing three or more elements linking two or more associations. Experimental observations indicate that no more than three elements are required in most cases. In the above case, the likelihood of a four-element phrase can be traced back to the likelihood of smaller elements. For example, "horrified by the accident" is possible because "horrified by" is such a strong collocation, while "learn a knowledge of" is unlikely because "knowledge" is such a strong collocation that "learn" does not Too probable object, irrespective of other elements.

可以用不同方法计算三元素子图的可能值。一种方法是将其中两个元素及它们之间的关联作为一个短语单元对待，然后计算这个短语单元与第三元素之间的可能性度量，所用的计算方法与在两个元素情况下进行计算的方法完全一样。The possible values for a triplet subgraph can be calculated in different ways. One method is to treat two of the elements and the association between them as a phrase unit, and then calculate the likelihood measure between this phrase unit and the third element, using the same calculation method as in the case of two elements The method is exactly the same.

还可以依照不同的方案实现把二和三元素关联的可能值组合成一个似真值。我们能够使三元素短语的作用的权重大于二元素短语的权重(一种平滑方案)或者如果包含二元素短语的三元素短语不能符合有关它们的频度和/或可能性的某个约束，就只用二元素短语(一种后退方案)。这些方案的参数可以由经验或学习过程来确定，其中学习的要素不是在上下文中某个词语的存在或不存在，而是组合的强度和频度。Combining the possible values associated with two and three elements into a plausible value can also be achieved according to different schemes. We can weight the contribution of three-element phrases more than that of two-element phrases (a smoothing scheme) or if three-element phrases containing two-element phrases fail to satisfy some constraint on their frequency and/or likelihood, then Use only two-element phrases (a fallback scheme). The parameters of these schemes can be determined by experience or by a learning process, where the element of learning is not the presence or absence of a word in context, but the strength and frequency of combinations.

为了增大可以检测和修正的错误范围，可以对基本方法进行一些扩充。To increase the range of errors that can be detected and corrected, some extensions can be made to the basic method.

计算一个词语的似真性可以包括一个指示那个词对于其它任何词语缺少配属的条件。除了在可以是相关树的根的限定动词(或某个在列表和标题中的其它词性)的情况下，独立的词语总是指示一个错误(或一个错误语法)。因此给空配属赋予一个很低的可能值是合适的，而且这将触发错误处理。Computing the plausibility of a term may include a condition indicating the lack of affiliation of that term to any other term. Except in the case of finite verbs (or some other part of speech in lists and titles) which may be the root of a correlation tree, independent words always indicate an error (or an erroneous grammar). It is therefore appropriate to assign a very low probability value to null assignments, and this will trigger error handling.

为了确定修正，本方法随后将需要被扩展，如下所述。In order to determine the corrections, the method will then need to be extended, as described below.

假如如上所述，要被修正的文本的语法分析没有被语言广度优先值系数强烈影响，如果它们的词性合适，词语一般将被附着。相反地，如果一个词语不被附着，错误一般不会通过置换一个有相同词性的词语而被修正。Provided that, as mentioned above, the grammatical analysis of the text to be corrected is not strongly influenced by the language breadth priority value coefficients, words will generally be attached if their parts of speech are appropriate. Conversely, if a word is not attached, the error will generally not be corrected by substituting a word with the same part of speech.

错误可能不是一个置换，而是省略。例如，一个名词将不能作为宾语附着于一个不及物动词。在许多这样的情形中，错误可以通过一个前置词的插入而被修正。甚至在一个名词附着在一个弱关联的动词时，插入也可能是合适的。在任一情形下，插入必须伴随新关联的建立，其可能性将确定错误是否已被修正。The error may not be a substitution, but an omission. For example, a noun will not be able to attach as an object to an intransitive verb. In many such cases, the error can be corrected by the insertion of a preposition. Insertion may be appropriate even when a noun is attached to a weakly associated verb. In either case, the insertion must be accompanied by the establishment of a new association, the likelihood of which will determine whether the error has been fixed.

缺少配属也可能是由类别改变的替换错误引起的。如果一类别的词语的易被混淆集包含另一类别的一个词语，那么这个置换可能必需伴随一次输入的局部重新分析。例如，如果一位初学者写“get out of the buildingsafety”，序列“building safety”可作为一个(不太可能的)名词短语进行分析。如果名词“safety”的易被混淆集包括副词“safely”，重新分析将是必需的，以确定后者是动词“get out”的修饰语，“building”而非“safety”是它的宾语。Missing attachments can also be caused by substitution errors with class changes. This permutation may necessarily be accompanied by a partial reanalysis of the input if the confusing set of words of one class contains a word of another class. For example, if a beginner writes "get out of the building safety", the sequence "building safety" can be analyzed as an (unlikely) noun phrase. If the confusing set of the noun "safety" included the adverb "safely," reanalysis would be necessary to determine that the latter is a modifier of the verb "get out," with "building" rather than "safety" as its object.

本方法也可以用作上下文相关的辞典，以不给每个词语的似真值设置阈值为例。在这种情况下，不管似真性如何所有的词语都是替代的候选。同样地，替代不需要改善似真性。可以提出可能的替代，例如，如果它们的似真性值超过一个阈值。This method can also be used as a context-sensitive dictionary, for example, without setting a threshold for the plausibility value of each word. In this case, all words are candidates for substitution regardless of plausibility. Likewise, substitution need not improve plausibility. Possible substitutions can be proposed, for example, if their plausibility values exceed a threshold.

可以用任何适当的装置执行本方法，但是，实际上，最可能的是由一台计算机来执行本方法，该计算机被编制了一个控制它以执行本方法的程序。图1图示了一个适当的计算机系统100，该计算机基于一个担任控制器的中央处理器(CPU)。CPU1配备一个程序存贮器2，例如含有以磁盘或光盘形式的存贮介质的磁盘驱动器形式，轮流包含控制CPU1的程序。一个第一数据库3，例如存贮在一个磁盘上，包含关联及相关的可能值。一个第二数据库，例如也是存贮在一个或上述磁盘上，包含易被混淆集。以常用的方式配备一个读/写或随机存取存贮器(RAM)5以保存参数的临时值。The method may be carried out by any suitable means, but, in practice, it is most likely to be carried out by a computer which has been programmed to control it to carry out the method. Figure 1 illustrates a suitable computer system 100 based on a central processing unit (CPU) acting as a controller. The CPU 1 is equipped with a program memory 2, for example in the form of a disk drive containing storage media in the form of magnetic disks or optical disks, which in turn contain programs for controlling the CPU 1 . A first database 3, eg stored on a disk, contains associations and associated possible values. A second database, eg also stored on one or the aforementioned disks, contains confusing sets. A read/write or random access memory (RAM) 5 is provided in the usual way to hold temporary values of the parameters.

CPU配备一个输入接口6，它允许要进行错误、不自然的表达方式等等检测的文本输入。例如，文本可能是手工通过键盘输入或者可能已经是机器可读形式(例如在磁盘或光盘上)。CPU1还配备一个输出接口7，它允许用户监控本方法的输出。同样，为了能够与本方法交互，接口6和7为用户提供输入数据、命令等等以及监控本方法的运行的功能。例如，当提供提高似然值的易被混淆词语的选择时，可以通过构成输出接口7部分或全部的显示器显示这些，用户可以通过适当地操作构成输入接口6部分或全部的键盘和/或鼠标选择一个易被混淆的词语。The CPU is equipped with an input interface 6 which allows the input of text to be detected for errors, unnatural expressions and the like. For example, text may have been entered manually via a keyboard or may already be in machine-readable form (eg, on a magnetic or optical disk). CPU 1 is also equipped with an output interface 7 which allows the user to monitor the output of the method. Also, in order to be able to interact with the method, the interfaces 6 and 7 provide the user with functions to enter data, commands, etc. and to monitor the operation of the method. For example, when the selection of easily confused words that increase the likelihood value is provided, these can be displayed through the display that constitutes part or all of the output interface 7, and the user can properly operate the keyboard and/or mouse that constitute part or all of the input interface 6 Choose a word that is easily confused.

提供一个包含词语之间的关联及与其相联系的可能值的数据库，它提供这种正确或惯用的关联的可能值度量。可能值基于通过分析大量文本获得的关联发生频度，例如由说母语的人创作的文本。为了检查文本段中是否有文本段的一个或多个词语的可能的错误或不自然用法，首先要分析文本以确定其词语之间的关联。被分析文本中关联的可能性由数据库确定。计算被分析文本中每个词语的似真值，这是通过把出现该词语的关联的可能值合成起来得到的。使用词语索引另一个数据库，该数据库包含容易被索引词语混淆的词语集合。依次选择每个易被混淆词语并在索引词语的关联中代替索引词语。确定这些新关联的可能值并计算这个易被混淆词语的似真值。在一个错误检测实施例中，对于那些似真性落在一个阈值下面的词语尝试易被混淆词语，并将提高似真性的易被混淆词语报告给用户。在一个上下文相关辞典实施例中，对所有词语可以尝试易被混淆词语，并可以报告那些似真值超过一个第二阈值的易被混淆词语。A database containing associations between words and possible values associated with them is provided, which provides a measure of the possible values of such correct or customary associations. Possible values are based on the frequency of associations obtained by analyzing large volumes of text, such as text authored by native speakers. In order to check a text segment for possible wrong or unnatural usages of one or more words of the text segment, the text is first analyzed to determine the associations between its words. The likelihood of associations in the analyzed text is determined by the database. Computes the likelihood value for each word in the analyzed text, which is obtained by combining the possible values of the associations in which the word occurs. Use terms to index another database that contains collections of terms that are easily confused by index terms. Select each confusing term in turn and replace the index term in the index term's association. The possible values of these new associations are determined and the plausibility value of the confusing word is calculated. In one error detection embodiment, confusable words are tried for those words whose plausibility falls below a threshold, and the confusable words with increasing plausibility are reported to the user. In a context-sensitive thesaurus embodiment, confusable words may be tried for all words and those with a plausibility value exceeding a second threshold may be reported.

尽管以上仅描述了一个本发明被应用于英语的实施例，但是本发明并不仅限于英语而能够应用于其它语言。Although only one embodiment in which the present invention is applied to English is described above, the present invention is not limited to English but can be applied to other languages.

英语文本段可由非英语的语言(例如日语)翻译生成。English text segments can be translated from languages other than English (such as Japanese).

可以通过使用光学字符识别系统阅读印刷文档的文字来生成文本段。A text segment can be generated by reading the words of a printed document using an optical character recognition system.

依照本发明，提供用以在用户的写作中检测错误和不自然的表达方式并提出能够改进这些语言用法的方式的方法和一种装置。In accordance with the present invention, a method and an apparatus are provided for detecting errors and unnatural expressions in a user's writing and proposing ways in which these language usages can be improved.

依照本发明，在用户的写作中检测错误和不自然的表达方式并对它们提出修改是可能的。它能处理真实词语拼写错误，也能处理其它各种类型的错误。According to the present invention, it is possible to detect errors and unnatural expressions in the user's writing and propose corrections to them. It handles misspellings of real words, as well as various other types of errors.

Claims

One kind in the written or spoken text chunk that comprises one group of word of first language first word or the modification selected of phrase or improve one's methods, it is characterized in that, comprise the following steps:

(a) provide first database (3) of the association between a first language word or the phrase, wherein each is associated to rare one and is associated in the relevant probable value of the frequency of occurrences in a large amount of first language texts based on this;

(b) it is related with one first between described first word that is based upon text section or phrase and one second word or the phrase to analyze (14) text section, first probable value of corresponding at least described association and based on the first likelihood value of described at least one probable value described first word of correspondence or phrase;

(c) provide one second database (4), wherein every has at least a word or phrase and its word that can be confused into or phrase set to link together;

(d) from second database (4), select (22) or calculate easily be confused word or the phrase that a candidate as first word described in the text section or phrase substitutes;

(e) derive (23,24) easily be confused word or phrase in first database (3) based on the second likelihood value of the probable value of one second association, this second association is made up of other word or phrase in easily be confused word or phrase and the text section; And

(f) optionally provide the indication (25,26) of easily be confused a word or a phrase based on the truthlikeness value that calculates.
2. the method for claim 1 is characterized in that, each the related probable value in first database (3) also is based on each and comprises one and have the word of identical correlationship or other related occurrence frequency of phrase.
3. the method for claim 1 is characterized in that, each the related probable value in first database (3) also is based on all other related occurrence frequencies with identical correlationship.
4. the method for claim 1 is characterized in that, each the related probable value in first database (3) is formed by in mutual information, T value, Yule ' s Q coefficient and the log likelihood at least one.
5. the method for claim 1 is characterized in that, in step (e), described other word or phrase are second word or phrase, and the correlationship of second association is identical with the first related correlationship.
6. the method for claim 1 is characterized in that, step (b) be included as text Duan Zhongyi organize first word or phrase set up one group first related and and (f) to each first related execution in step (d), (e).
7. the method for claim 1 is characterized in that, step (b) comprises the association of setting up between non-conterminous word in the text or the phrase.
8. the method for claim 1 is characterized in that, step (d) comprises each confusable word of the set of selecting a word or phrase or phrase and to each easily be confused word or phrase execution in step (e) and (f).
9. method as claimed in claim 8 is characterized in that, step (f) comprises by the second likelihood value descending indicates the second likelihood value.
10. the method for claim 1 is characterized in that, if the first likelihood value is less than a first threshold (19) then execution in step (d), (e) and (f).
11. the method for claim 1 is characterized in that, step (f) comprises provides indication when described or each second likelihood value surpass one second threshold value (25).
12. the method for claim 1 is characterized in that, step (f) comprise if the second likelihood value greater than the first likelihood value (25) then indication (26) is provided.
13. the method for claim 1 is characterized in that, step (b) comprises function calculation (14) the first likelihood values that rely on to learn from beginner's mistake tagged corpus and relevant likelihood value thereof by machine learning techniques.
14. the method for claim 1 is characterized in that, word replaces first word in (23) text section with easily being confused.
15. the method for claim 1 is characterized in that, generates text section by the second language translation.
16. the method for claim 1 is characterized in that, generates text section from document printing by optical character identification.
17. be that computer compilation is to carry out the computer program of method according to claim 1.
18. contain storage medium just like the described program of claim 17.
19. medium as claimed in claim 18 comprises computer-readable medium.
20. contain computing machine just like the described program of claim 17.
21. first word or phrase a modification or the modifying device selected in the written or spoken text chunk that comprises one group of word of first language is characterized in that, comprising:

First database (3) of the association between first language word or the phrase, wherein each is associated to rare one and is associated in the relevant probable value of the frequency of occurrences in a large amount of first language texts based on this;

A controller that is used for analyzing (14) text section, with determine one first between described first word of text chunk or phrase and one second word or the phrase related, the corresponding described association of at least one first probable value and corresponding described first word of the first likelihood value based on described at least one probable value or phrase; And

One second database (4), wherein every has at least a word or phrase and its word that can be confused into or phrase set to link together;

Wherein:

Controller (1) is selected (22) or is calculated easily be confused word or the phrase that a candidate as first word described in the text section or phrase substitutes from second database;

Controller (1) derives (23,24) easily be confused a word or a phrase based on one the second second likelihood value that is associated in the probable value in first database (3), this second association is made up of other word or phrase in easily be confused word or phrase and the text section; And

Controller optionally provides the indication (25,26) of easily be confused a word or a phrase based on the truthlikeness value that calculates.