WO2008145055A1

WO2008145055A1 - The method for obtaining restriction word information, optimizing output and the input method system

Info

Publication number: WO2008145055A1
Application number: PCT/CN2008/071064
Authority: WO
Inventors: Jieyong Lv
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2007-05-25
Filing date: 2008-05-23
Publication date: 2008-12-04
Anticipated expiration: 2009-11-25
Also published as: CN101055588A; CN100483417C

Abstract

The method for obtaining restriction word information includes the follow steps, obtaining characteristic information based on the target word, and judging whether the characteristic information is in accordance with the preset condition. If being suitable, the information ensures that the target word is a restriction word, and a related restriction information is recorded. The restriction information is used for a restricting sequence when the word is singly outputted. The inventive embodiment, by presetting the word bank including the input and output method of restriction word information, judges whether the output candidate item accords with the preset condition, based on the result, judges whether the candidate item with restriction word information is displayed and outputted. Accordingly user can obtain more effective output without increasing the operation, the character output process of the input system is optimized greatly, and the intelligence of the input system is also improved.

Description

获取限制词信息的方法、优化输出的方法和输入法系统本申请要求于 2007 年 5 月 25 日提交中国专利局、申请号为 200710099644. 0、发明名称为 "获取限制词信息的方法、优化输出的方法和输入法系统"的中国专利申请的优先权，其全部内容通过引用结合在本申请中。技术领域 Method for obtaining restricted word information, method for optimizing output and input method system The present application claims to be submitted to the Chinese Patent Office on May 25, 2007, and the application number is 200710099644. 0, the invention name is "method for obtaining restricted word information, optimized output" The priority of the method and the input method system of the Chinese Patent Application, the entire contents of which is incorporated herein by reference. Technical field

本发明涉及计算机字符输入数据处理领域，特别是涉及一种获取限制词信息的方法和装置、一种更新输入法词库的方法、一种优化输出的方法以及一种输入法系统。 The present invention relates to the field of computer character input data processing, and more particularly to a method and apparatus for acquiring restricted word information, a method for updating an input method vocabulary, a method for optimizing output, and an input method system.

背景技术 Background technique

随着计算机技术以及互联网技术的普及与发展，不同专业领域、不同兴趣以及使用习惯的用户对于输入法系统的智能性要求越来越高。 With the popularization and development of computer technology and Internet technology, users of different professional fields, different interests and usage habits are increasingly demanding the intelligence of the input method system.

在现有技术中，已经出现了利用庞杂的互联网语料库统计、筛选得到输入法词库的技术。所得到的互联网词库中可以包含很多通过之前的封闭语料信息（如现代汉语词典、新闻、报纸等）所无法得到的新词，从而可以大大提高人们的输入效率。但是，正是由于互联网语料库的复杂性，使得从中通过词频统计歸选得到的一些词，具有语言学或者使用输入习惯上的一些缺陷。 In the prior art, techniques for using the Internet corpus statistics to filter and input the input lexicon have emerged. The resulting Internet vocabulary can contain many new words that are not available through previous closed corpus information (such as modern Chinese dictionaries, news, newspapers, etc.), which can greatly improve people's input efficiency. However, it is precisely because of the complexity of the Internet corpus that some of the words that are derived from the word frequency statistics have some deficiencies in linguistics or the use of input habits.

例如，对于用户输入的拼音编码字符串 " liangjiang" , —般可获得的候选项包括 "两江，，、 "良将，，等，具有互联网词库的候选项还可能包括 "量将"，因为 "量将" 这个词在互联网网页中的出现频率还是相当高的，但是其一般都出现在句子中多个词的连接处（用于表达链接关系），例如， "旅客量将超过"。将 "量将" 这个词收入输入法词库中，固然可以增加输入法的智能性（达到较高的智能组词效果），在某些情况下可以提高用户输入效率，但是却由于 "量将，，一词在单独成词的情况下很少出现，从而也有可能给用户输入造成麻烦，增加用户需要选择的候选项数量，降低输入效率。 For example, for the pinyin encoded string "liangjiang" entered by the user, generally available candidates include "two rivers,", "goods,", etc., candidates with Internet thesaurus may also include "quantity" because " The amount of "this word" appears quite high in Internet pages, but it generally appears at the junction of multiple words in a sentence (used to express a link relationship), for example, "the amount of passengers will exceed." The word "quantity" is included in the input lexicon, which can increase the intelligence of the input method (to achieve a higher intelligent group effect), in some cases can improve user input efficiency, but because of the amount , the word rarely appears in the case of a separate word, which may also cause trouble for the user input, increase the number of candidates that the user needs to select, and reduce the input efficiency.

因此，迫切需要本领域技术人员解决的一个技术问题就是：如何在词库中找出这样的具有语言学或者使用习惯上缺陷的词，并在输入过程中加以限制，以进一步提高输入效率。 Therefore, a technical problem that is urgently needed to be solved by those skilled in the art is: how to find such words with linguistics or habitual defects in the vocabulary, and impose restrictions on the input process to further improve the input efficiency.

发明内容本发明所要解决的技术问题是提供一种获取限制词信息的方法和装置，能够从大量的词汇中找出具有语言学或者使用习惯上缺陷的词，从而提高用户的输入体验。 Summary of the invention The technical problem to be solved by the present invention is to provide a method and apparatus for acquiring restriction word information, which can find words having linguistics or usage habits from a large number of vocabularies, thereby improving the user's input experience.

本发明另一个目的是提供一种更新输入法词库的方法、一种优化输出的方法以及一种输入法系统，能够实现在实际输入过程中对某些词在某些情况下加以限制，从而可以实现在不增加用户操作的前提下，达到提高输入法智能性的目的。 Another object of the present invention is to provide a method for updating an input method vocabulary, a method for optimizing an output, and an input method system, which can restrict certain words in some cases in an actual input process, thereby It can achieve the purpose of improving the intelligence of the input method without increasing the user's operation.

为了解决上述技术问题，本发明公开了一种获取限制词信息的方法，具体可以包括：获取一目标词；获取该目标词相应的特征信息；判断所述特征信息或其相应的计算结果是否符合预置条件，如果符合，则确定该目标词为限制词并记录相关限制信息，所述限制信息用于限制该词单独输出时的排序。 In order to solve the above technical problem, the present invention discloses a method for acquiring restriction word information, which may specifically include: acquiring a target word; acquiring feature information corresponding to the target word; determining whether the feature information or its corresponding calculation result is consistent The preset condition, if it is met, determines that the target word is a restriction word and records related restriction information, and the restriction information is used to limit the ordering when the word is separately output.

优选的，所述特征信息为：该目标词中位于词首的单字在预设语料库内作为词首的特征值，以及该目标词中位于词尾的单字在预设语料库内作为词尾的特征值；所述用于判断的预置条件为：上述特征值中是否存在至少一个特征值属于预置范围。 Preferably, the feature information is: a feature value of the word at the beginning of the target word in the default corpus as a prefix of the word, and a feature value of the word at the end of the target word in the preset corpus as a suffix; The preset condition for determining is: whether at least one of the feature values is in a preset range.

优选的，所述特征信息为：该目标词中所包含的各个单字词和 /或多字词的语言学搭配关系在预设语料库内的特征值；所述用于判断的预置条件为：上述特征值中是否存在至少一个特征值属于预置范围。 Preferably, the feature information is: a feature value of a linguistic collocation relationship of each single word and/or a multi-word word included in the target word in a preset corpus; : Whether at least one of the above feature values exists in the preset range.

优选的，所述特征信息为：该目标词在输入法应用中用户单独输入的特征值；所述用于判断的预置条件为：该特征值是否属于预置范围。 Preferably, the feature information is: an attribute value separately input by the user in the input method application; the preset condition for determining is: whether the feature value belongs to a preset range.

优选的，所述特征信息包括：该目标词中位于词首的单字在预设语料库内作为词首的特征值；该目标词中位于词尾的单字在预设语料库内作为词尾的特征值；以及该目标词的通用词频；所述用于判断的预置条件为：上述特征值中是否存在至少一个特征值与该目标词通用词频的比值属于预置范围。 Preferably, the feature information includes: a feature value of the word at the beginning of the target word in the default corpus as a prefix; the word at the end of the target word is used as a suffix feature value in the default corpus; The common word frequency of the target word; the preset condition for the judgment is: whether the ratio of the at least one feature value to the target word common word frequency in the feature value belongs to a preset range.

优选的，所述特征信息包括：该目标词中所包含的各个单字词和 /或多字词的语言学搭配关系在预设语料库内的特征值；以及该目标词的通用词频；所述用于判断的预置条件为：上述特征值中是否存在至少一个特征值与该目标词通用词频的比值属于预置范围。 Preferably, the feature information includes: a feature value of a linguistic collocation relationship of each single word and/or a multi-word included in the target word in a preset corpus; and a general word frequency of the target word; The preset condition for judging is: whether at least one feature value exists in the above feature value The ratio of the general word frequency to the target word belongs to a preset range.

优选的，所述特征信息为：该目标词在输入法应用中用户单独输入的特征值；以及该目标词的通用词频；所述用于判断的预置条件为：该特征值与该目标词通用词频的比值是否属于预置范围。 Preferably, the feature information is: a feature value that the target word is input by the user in the input method application; and a general word frequency of the target word; the preset condition for the judgment is: the feature value and the target word Whether the ratio of the general word frequency belongs to the preset range.

优选的，所述特征信息为：该目标词在针对同一输入编码的各候选词中的用户排序位置信息；以及该目标词在针对同一输入编码的各候选词中的原始排序位置信息；其中，所述用户排序信息与该目标词在输入法应用中用户单独输入的特征值相关；所述原始排序信息与该目标词的通用词频相关；所述用于判断的预置条件为：所述用户排序位置信息与所述原始排序位置信息的差值是否属于预置范围。 Preferably, the feature information is: user-sorted position information of the target words in each candidate word encoded for the same input; and original sorted position information of the target words in each candidate word encoded for the same input; The user ranking information is related to a feature value that the target word is separately input by the user in the input method application; the original sorting information is related to a general word frequency of the target word; and the preset condition for determining is: the user Whether the difference between the sort position information and the original sort position information belongs to a preset range.

优选的，在特征信息获取步骤之前还包括：对目标词的优化歸选步骤。优选的，所述限制信息包括：该限制词在各预设场景下的限制单独输出的权重。 Preferably, before the feature information obtaining step, the method further includes: an optimal selection step of the target word. Preferably, the restriction information includes: a weight of the restriction word that is separately output in each preset scenario.

优选的，所述限制信息还包括：该限制词在预设语料库中的语言学搭配参数；所述语言学搭配参数用于限制该词在智能组词输出时的排序。 Preferably, the restriction information further includes: a linguistic allocation parameter of the restriction word in a preset corpus; the linguistic collocation parameter is used to limit the ordering of the word when the intelligent group word is output.

优选的，所述方法还可以包括：生成一词库或词表，所述词库或词表包括所述限制词及其相关限制信息；或者，生成一词库，所述词库包括所述限制词及其相关限制信息，以及通用字词。 Preferably, the method may further include: generating a vocabulary or vocabulary, the vocabulary or vocabulary including the restricted words and related restriction information; or generating a vocabulary, the vocabulary including the Limit words and their associated restrictions, as well as generic terms.

依据本发明的另一实施例，还公开了一种获取限制词信息的方法，包括：获取一目标词；获取该目标词在预设语料库中的语言学搭配参数；判断所述语言学搭配参数是否符合预置条件，如果符合，则记录该目标词的限制信息，所述限制信息包括相应的语言学搭配参数；所述限制信息用于限制该词智能组词输出时的排序。 According to another embodiment of the present invention, a method for acquiring restricted word information includes: acquiring a target word; obtaining a linguistic collocation parameter of the target word in a preset corpus; determining the linguistic collocation parameter Whether the preset condition is met, if yes, the restriction information of the target word is recorded, and the restriction information includes a corresponding linguistic collocation parameter; the restriction information is used to limit the ordering of the word intelligent group word output.

优选的，所述语言学搭配参数为一通用参数；或者，所述语言学搭配参数包括针对各预设场景的分参数。 Preferably, the linguistic collocation parameter is a general parameter; or the linguistic collocation parameter includes a sub-parameter for each preset scene.

依据本发明的另一实施例，还公开了一种更新词库的方法，包括：获取一目标词；获取该目标词相应的特征信息；判断所述特征信息或其相应的计算结果是否符合预置条件，如果符合，则确定该目标词为限制词并记录相关限制信息，所述限制信息用于限制该词单独输出时的排序，和 /或，用于限制该词智能组词输出时的排序；将所述限制词及其相关限制信息添加至输入法现有词库中。 According to another embodiment of the present invention, a method for updating a thesaurus includes: acquiring a target word; acquiring feature information corresponding to the target word; determining whether the feature information or its corresponding calculation result meets a pre-determination Setting a condition, if yes, determining that the target word is a restriction word and recording related restriction information, the restriction information is used to limit the ordering when the word is output alone, and/or, Used to limit the ordering of the word intelligent group word output; add the limit word and its related restriction information to the existing vocabulary of the input method.

优选的，所述添加为：判断该限制词是否在所述原始词库中已存在，如果已存在，则仅记录其相关限制信息至所述输入法现有词库中；或者，所述添加为：直接将所述限制词及其相关限制信息记录至所述输入法现有词库中，如果词条重复，则覆盖原始词条；或者，所述添加为：将所述限制词及其相关限制信息存储为一限制词表，所述限制词表和输入法现有词库用于协作完成候选项排序。 Preferably, the adding is: determining whether the restricted word already exists in the original thesaurus, and if so, recording only relevant restriction information into the existing thesaurus of the input method; or, adding For example, the restriction word and its related restriction information are directly recorded into the existing vocabulary of the input method, and if the vocabulary is repeated, the original vocabulary is overwritten; or, the adding is: The related restriction information is stored as a restricted vocabulary, and the existing vocabulary of the restricted vocabulary and the input method is used for collaborative completion of candidate ordering.

优选的，所述限制词具有在各预设场景下的限制信息。 Preferably, the restriction word has restriction information in each preset scenario.

依据本发明的另一实施例，还公开了一种获取限制词信息的装置，包括： According to another embodiment of the present invention, an apparatus for acquiring restriction word information is further disclosed, including:

目标词获取单元，用于获取一目标词； a target word obtaining unit, configured to acquire a target word;

特征信息获取单元，用于获取该目标词相应的特征信息； a feature information acquiring unit, configured to acquire feature information corresponding to the target word;

限制信息获取单元，用于判断所述特征信息或其相应的计算结果是否符合预置条件，如果符合，则确定该目标词为限制词并记录相关限制信息，所述限制信息用于限制该词单独输出时的排序，和 /或，用于限制该词智能组词输出时的排序。 a restriction information acquiring unit, configured to determine whether the feature information or a corresponding calculation result thereof meets a preset condition, and if yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the word Sorting when outputting separately, and/or, is used to limit the ordering of the word smart group word output.

依据本发明的另一实施例，还公开了一种优化输出的方法，包括：接收用户输入信息，并对所述输入信息进行转换；获得输出侯选项；判断一输出候选项是否符合应用限制信息的预置条件；如果是，则提取该输出候选项相应的限制信息，并根据所述限制信息对各候选项进行排序。 According to another embodiment of the present invention, a method for optimizing an output is disclosed, including: receiving user input information, and converting the input information; obtaining an output candidate; determining whether an output candidate meets application restriction information Pre-conditions; if yes, extract the restriction information corresponding to the output candidate, and sort the candidates according to the restriction information.

优选的，所述应用限制信息的预置条件为：所述输出侯选项是否为单独输出的词；或者，所述应用限制信息的预置条件为：所述输出侯选项是否属于智能组词情形。 Preferably, the preset condition of the application restriction information is: whether the output candidate is a separately output word; or the preset condition of the application restriction information is: whether the output candidate belongs to a smart group word situation .

优选的，通过以下步骤获取所述的限制信息：获取一目标词；获取该目标词相应的特征信息；判断所述特征信息或其相应的计算结果是否符合预置条件，如果符合，则针对该目标词记录相关限制信息。 Preferably, the limiting information is obtained by: acquiring a target word; acquiring feature information corresponding to the target word; determining whether the feature information or its corresponding calculation result meets a preset condition, and if yes, Target word record related restriction information.

优选的，当需要判断所述输出侯选项是否为单独输出的词时，通过以下步骤完成：判断一输出候选项是否只包含一个元素，并且长度大于 1个输出字符；所述元素为预置词库中存储的字词；如果是，则确定该输出候选项为单独输出的词。 Preferably, when it is required to determine whether the output candidate is a separately output word, the following steps are performed: determining whether an output candidate includes only one element, and the length is greater than one. Outputting characters; the elements are words stored in a preset vocabulary; if so, it is determined that the output candidates are words that are output separately.

依据本发明的另一实施例，还公开了一种输入法系统，包括输入接口单元和显示单元，所述输入法系统还包括： According to another embodiment of the present invention, an input method system is further provided, including an input interface unit and a display unit, the input method system further comprising:

词库，所述词库包括针对词条的限制信息；所述限制信息用于限制该词单独输出时的排序，和 /或，用于限制该词智能组词输出时的排序； a vocabulary, the vocabulary includes restriction information for the vocabulary; the restriction information is used to limit the ordering when the word is outputted separately, and/or, and is used to limit the ordering when the word intelligent group word is output;

候选项获取单元，用于根据用户的输入信息获得输出侯选项；判断单元，用于判断一输出候选项是否符合应用限制信息的预置条件；候选项排序单元，用于当符合预置条件时，提取该输出候选项相应的限制信息，并根据所述限制信息对各候选项进行排序。 a candidate obtaining unit, configured to obtain an output candidate according to the input information of the user; a determining unit, configured to determine whether an output candidate meets a preset condition of the application restriction information; and a candidate sorting unit, configured to meet the preset condition And extracting restriction information corresponding to the output candidate, and sorting each candidate according to the restriction information.

优选的，所述判断单元进一步包括：用于判断一输出候选项是否只包含一个元素的子单元；其中，所述元素为预置词库中存储的字词；以及，用于判断该输出候选项的长度是否大于 1个输出字符的子单元；以及，用于当该输出候选项符合上述两个判断条件时，确定其为单独输出的词的子单元。 Preferably, the determining unit further includes: a subunit for determining whether an output candidate includes only one element; wherein, the element is a word stored in a preset vocabulary; and, for determining the output candidate Whether the length of the item is greater than one sub-unit of the output character; and, for determining that the output candidate is a sub-unit of the word that is output separately when the two judgment conditions are met.

优选的，所述输入法系统的输入接口单元、显示单元以及词库位于同一计算设备中；或者，所述输入法系统的输入接口单元、显示单元位于第一计算设备中，词库位于第二计算设备中，所述输入法系统根据用户输入的信息，从第二计算设备中获取相应信息，在第一计算设备显示相应字词。 Preferably, the input interface unit, the display unit, and the vocabulary of the input method system are located in the same computing device; or the input interface unit and the display unit of the input method system are located in the first computing device, and the vocabulary is located in the second In the computing device, the input method system acquires corresponding information from the second computing device according to the information input by the user, and displays the corresponding word in the first computing device.

与现有技术相比，本发明实施例具有以下优点： Compared with the prior art, the embodiment of the invention has the following advantages:

本发明实施例预置包括限制词信息的输入法词库，在用户进行输入时，判断输出候选项是否符合应用限制信息的预置条件；进而依据是否符合的结果，控制具有限制词信息的候选项的是否显示和输出，从而可以在不增加用户操作的前提下，可以获得更有效地输出（例如，在实际中，使限制词 "量将"在被单独输出时不显示在候选项中，而在其它情况下参与组词），极大地优化了输入法系统的字符输出过程，提高了输入法系统的智能性。附图说明 The embodiment of the invention presets an input method vocabulary including the restriction word information, and when the user inputs, determines whether the output candidate meets the preset condition of the application restriction information; and further controls the candidate with the restriction word information according to whether the result is consistent Whether the item is displayed and output, so that it can be output more efficiently without increasing the user's operation (for example, in practice, the limit word "quantity" will not be displayed in the candidate when it is output separately. In other cases, participation in group words) greatly optimizes the character output process of the input method system and improves the intelligence of the input method system. DRAWINGS

图 1是本发明一种获取限制词信息的方法实施例 1的步骤流程图；图 2是本发明一种获取限制词信息的方法实施例 2的步骤流程图；图 3是本发明一种更新输入法词库的方法实施例的步骤流程图；图 4是本发明一种获取限制词信息的装置实施例的结构框图；图 5是本发明一种优化输出的方法实施例的步骤流程图； 1 is a flow chart of steps of Embodiment 1 of a method for acquiring restriction word information according to the present invention; FIG. 2 is a flow chart of steps of Embodiment 2 of a method for acquiring restriction word information according to the present invention; FIG. 3 is an update of the present invention; FIG. 4 is a structural block diagram of an embodiment of an apparatus for obtaining restriction information according to the present invention; FIG. 5 is a flow chart of steps of an embodiment of a method for optimizing output according to the present invention;

图 6是一种拼音网络切分方法的词网格示意图； 6 is a schematic diagram of a word grid of a pinyin network segmentation method;

图 7是一种输入法系统实施例的结构框图。 Figure 7 is a block diagram showing the structure of an embodiment of an input method system.

具体实施方式 detailed description

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明。 The above described objects, features and advantages of the present invention will become more apparent from the aspects of the appended claims.

参照图 1，示出了一种获取限制词信息的方法实施例 1，具体可以包括：步骤 101、获取一目标词； Referring to FIG. 1, a method embodiment 1 for acquiring restriction word information is shown, which may include: Step 101: Acquire a target word;

所述获取目标词的过程可以从互联网得到 ,即直接从互联网语料库（例如，互联网网页集合或者搜索关键词集合等）中经过统计、筛选获得，也可以从现有词库得到，本发明对此并不需要加以限制，只要能够获得一个目标词集合即可；至于该集合的范围大小，本领域技术人员根据实际需要设定即可。 The process of obtaining the target word can be obtained from the Internet, that is, directly obtained from the Internet corpus (for example, an Internet web page collection or a search keyword set, etc.), and can also be obtained from an existing vocabulary, and the present invention It is not necessary to be limited as long as a target word set can be obtained; as for the range size of the set, those skilled in the art can set according to actual needs.

优选的，对于所获得的这个目标词集合，还可以包括一优化步骤，采用目标词的一些属性去除一些词汇，以进一步缩小范围。例如，从该集合中去除互联网词频或者词库词频小于等于预设阔值的词；从该集合中去除能够确定不属于限制词的词（例如字典中的通用词汇）等等。当然，所述的这个优化步骤，也完全可以在获取目标词集合的过程中完成。 Preferably, for the obtained target word set, an optimization step may be further included, and some attributes of the target word are used to remove some words to further narrow the scope. For example, words from which the Internet word frequency or the lexical vocabulary frequency is less than or equal to a preset threshold are removed from the set; words that are not subject to the qualifier (e.g., general vocabulary in the dictionary) are removed from the set. Of course, the optimization step described above can also be completed in the process of acquiring the target word set.

步骤 102、获取该目标词相应的特征信息； Step 102: Acquire feature information corresponding to the target word;

步骤 103、判断所述特征信息或其相应的计算结果是否符合预置条件，如果符合，则确定该目标词为限制词并记录相关限制信息，所述限制信息用于限制该词单独输出时的候选项排序。 Step 103: Determine whether the feature information or its corresponding calculation result meets a preset condition, and if yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the single output of the word. Sort the candidates.

例如，对于限制词 "量将"、 "上一 "等，在单独输出时不出现在候选项中，但是在与其他字词智能组词输出时则没有限制。具体的例子：当输入 "liangjiang"时，依据词频信息的预输出的第一条候选项为 "量将"，但是由于其具有限制信息标记，因此将其从候选项中去除，不予显示；当输入 "lvkeliangjiangchaoguo"时，则输出候选项 "旅客量将超过"，此时"量将"这个词不需要被限制输出。 For example, the restriction words "quantity", "previous", etc. do not appear in the candidate when outputted separately, but there is no limit when outputting with other words intelligent group words. Specific example: When typing In the case of "liangjiang", the first candidate based on the pre-output of the word frequency information is "quantity", but since it has the restriction information mark, it is removed from the candidate and is not displayed; when "lvkeliangjiangchaoguo" is input , the output candidate "The passenger volume will exceed", at this time the word "quantity" does not need to be restricted.

本实施例得到的限制词及其限制信息可以直接存储至一独立词库（或词表）中，例如，生成一词库（或词表），所述词库专用于存储所述限制词及其相关限制信息；也可以与通用字词一起生成一输入法词库，例如，生成一词库，所述词库包括所述限制词及其相关限制信息，以及通用字词；还可以直接将其添加至输入法现有词库中。 The restriction words and their restriction information obtained in this embodiment may be directly stored in a separate vocabulary (or vocabulary), for example, generating a vocabulary (or vocabulary) dedicated to storing the qualifiers and The related restriction information; may also generate an input method vocabulary together with the general words, for example, generate a vocabulary, the vocabulary includes the restriction words and related restriction information, and general words; It is added to the existing vocabulary of the input method.

所述限制信息可以采用标识的方式（例如，在词库中的该限制词打上标记 0或 1 )，也可以采用具体数值的方式（例如，从 0到 1的二位小数），用于对候选项的排序进行调整，当然不显示就是一种极端情况。所得到的限制词及其限制信息可以根据实际需要，由用户手动更改，或者由服务器自动更新修改都是可行的。 The restriction information may be in the manner of identification (for example, the restriction word in the lexicon is marked with 0 or 1), or may be a specific numerical value (for example, two decimal places from 0 to 1), for The ordering of the candidates is adjusted, of course, not showing is an extreme situation. The obtained restriction words and their restriction information can be manually changed by the user according to actual needs, or it is feasible to automatically update and modify by the server.

本实施例中根据所获得的特征信息的不同，相应的判断条件也会有所不同，下面举出多个例子对步骤 102和 103进行说明。其中的预置语料库可以为任何语料库；所述特征值可以经过统计得到，也可以根据经验或者现有知识直接得到；所述特征值可以为各种数值，例如概率或者频率等。需要说明的是，下面所描述的特征信息及判断条件仅仅是举例而已，本领域技术人员可以根据需要设定更为复杂的特征信息及判断条件，本发明对此不作限制。 In the present embodiment, the corresponding judgment conditions may differ depending on the obtained feature information. Steps 102 and 103 will be described below by way of a plurality of examples. The preset corpus may be any corpus; the eigenvalues may be obtained by statistics, or may be directly obtained according to experience or existing knowledge; the eigenvalues may be various values, such as probability or frequency. It should be noted that the feature information and the judgment conditions described below are merely examples, and those skilled in the art can set more complicated feature information and judgment conditions as needed, and the present invention does not limit this.

例 1 example 1

所述特征信息为：该目标词中位于词首的单字在预设语料库内作为词首的特征值，以及该目标词中位于词尾的单字在预设语料库内作为词尾的特征值； The feature information is: a feature value of the word at the beginning of the target word in the default corpus as a prefix of the word, and a feature value of the word at the end of the target word in the default corpus as a suffix;

所述用于判断的预置条件为：上述特征值中是否存在至少一个特征值属于预置范围。即词首特征值或者词尾特征值中有一个在预置范围内，则就可以确定该目标词为限制词。 The preset condition for determining is: whether at least one of the feature values is present in the preset range. That is, if one of the initial feature value or the last feature value is within the preset range, then the target word can be determined as a restricted word.

例如，对于"量将"中的单字 "量" 很少出现在词首，如果 "量" 的词首出现频率小于或等于预设阔值 , 则可以判定"量将"为限制词。当然，对于目标词为三个或以上的字组成，则还有可能判断位于词中某个位置上的单字在预设语料库内处于词中相同位置上的特征值。 For example, for the word "quantity", the word "quantity" rarely appears at the beginning of the word, if the word "quantity" If the first appearance frequency is less than or equal to the preset threshold, then the "quantity" will be determined as a limit word. Of course, for a word composed of three or more words, it is also possible to judge the feature value of a word located at a certain position in the word at the same position in the word in the default corpus.

例 2 Example 2

所述特征信息为：该目标词中所包含的各个单字词和 /或多字词的语言学搭配关系在预设语料库内的特征值； The feature information is: a feature value of a linguistic collocation relationship of each single word and/or multi-word included in the target word in a preset corpus;

所述用于判断的预置条件为：上述特征值中是否存在至少一个特征值属于预置范围。 The preset condition for determining is: whether at least one of the feature values is present in the preset range.

所述的语言学搭配关系可以包括词与词的搭配参数，词与词性的搭配参数、词性与词性的搭配参数等多种匹配关系。本领域技术人员可以根据实际需要选用或者组合应用上述各种匹配关系。 The linguistic collocation relationship may include a collocation parameter of a word and a word, a collocation parameter of a word and a part of speech, a matching parameter of a part of speech and a part of speech, and the like. Those skilled in the art can select or apply the above various matching relationships according to actual needs.

例如，对于 "是玩，，一词， "是" 之后紧跟动词，这样的搭配关系在语言学上很少见的，所以可以得到其搭配特征值（即 "是 +动词" 的搭配关系特征值）小于或等于预设阔值，则可以判定"是玩"为限制词。 For example, for "is a play, the word, "yes" followed by a verb, such a collocation is rarely seen in linguistics, so the collocation feature of its collocation feature value (ie "yes + verb" can be obtained). Value) Less than or equal to the preset threshold, you can determine "is play" as a limit word.

例 3 Example 3

所述特征信息为：该目标词在输入法应用中用户单独输入的特征值；所述用于判断的预置条件为：该特征值是否属于预置范围。 The feature information is: a feature value that the target word inputs by the user in the input method application; the preset condition for determining is: whether the feature value belongs to a preset range.

所述的用户单独输入可以为一个用户的统计值，也可以为一个用户群也可以通过监控用户输入行为得到。 The user input alone may be a user's statistical value, or may be a user group or may be obtained by monitoring user input behavior.

例如，对于 "是玩，，一词，用户很少单独输入该词，所以当统计的特征值（如，单独输入频率值）小于或等于预设阔值时，则可以判定"是玩" 为限制词。下面的几个例子中，为了进一步提高限制词的判定准确度，在判断条件中引入了通用词频，所述通用词频可以为互联网词频，也可以为词库词频。下面例子中与前述例子相似之处就不再赘述，具体请参见前述。 For example, for the word "is play," the user rarely enters the word separately, so when the statistical feature value (eg, the input frequency value alone) is less than or equal to the preset threshold, then it can be determined that "is playing" In the following examples, in order to further improve the determination accuracy of the restriction words, a general word frequency is introduced in the judgment condition, and the general word frequency can be the Internet word frequency or the word library word frequency. The similarities of the examples are not described here. For details, please refer to the above.

例 4 Example 4

所述特征信息包括：该目标词中位于词首的单字在预设语料库内作为词首的特征值；该目标词中位于词尾的单字在预设语料库内作为词尾的特征值；以及该目标词的通用词频； The feature information includes: a word at the beginning of the word in the target word is used in a preset corpus The feature value of the prefix; the single word of the target word at the end of the word in the default corpus as the eigenvalue of the suffix; and the general word frequency of the target word;

所述用于判断的预置条件为：上述特征值中是否存在至少一个特征值与该目标词通用词频的比值属于预置范围。 The preset condition for the judgment is: whether the ratio of the at least one feature value to the target word common word frequency among the above feature values belongs to a preset range.

例 5 Example 5

所述特征信息包括：该目标词中所包含的各个单字词和 /或多字词的语言学搭配关系在预设语料库内的特征值；以及该目标词的通用词频； The feature information includes: a feature value of a linguistic collocation relationship of each single word and/or a multi-word word included in the target word in a preset corpus; and a general word frequency of the target word;

例 6 Example 6

所述特征信息为：该目标词在针对同一输入编码的各候选词中的用户排序位置信息；以及该目标词在针对同一输入编码的各候选词中的原始排序位置信息；其中，所述用户排序信息与该目标词在输入法应用中用户单独输入的特征值相关；所述原始排序信息与该目标词的通用词频相关；简单的情况下，可以认为，所述用户排序信息与用户词库信息相关，而所述原始排序信息与系统词库信息相关。 The feature information is: user-sorted position information of the target word in each candidate word encoded for the same input; and original sorted position information of the target word in each candidate word encoded for the same input; wherein, the user The sorting information is related to the feature value separately input by the user in the input method application; the original sorting information is related to the universal word frequency of the target word; in a simple case, the user sorting information and the user vocabulary can be considered The information is related, and the original ranking information is related to the system vocabulary information.

所述用于判断的预置条件为：所述用户排序位置信息与所述原始排序位置信息的差值是否属于预置范围。 The preset condition for determining is: whether the difference between the user sorting position information and the original sorting position information belongs to a preset range.

例 Ί Example

所述特征信息为：该目标词在输入法应用中用户单独输入的特征值；以及该目标词的通用词频； The feature information is: a feature value that the target word is input by the user in the input method application; and a general word frequency of the target word;

所述用于判断的预置条件为：该特征值与该目标词通用词频的比值是否属于预置范围。 The preset condition for determining is: whether the ratio of the feature value to the target word common word frequency belongs to a preset range.

具体描述例 7的一种具体实现过程如下： A specific implementation process of the specific description example 7 is as follows:

统计每个词的通用词频 f— web; Count the general word frequency of each word f- web;

在用户群体的输入记录中统计每个词被单独输入的频率 f— user; Counting the frequency of each word being entered separately in the input record of the user group f-user;

计算 alpha = f_user/f_web , 将 alpha远远小于正常水平的词确定是限制词； Calculate alpha = f_user/f_web and define the word whose alpha is much smaller than the normal level as a limit word;

或者，计算 alpha = f_user/f_web , ^！夺 alpha远远小于正常水平且 f— user 值低于一定阔值的词确定为限制词； Or, calculate alpha = f_user/f_web , ^! Capture alpha far less than normal and f-user Words whose value is below a certain threshold are determined as limit words;

或者，计算 alpha = f—user/f— web, 将 alpha远远小于正常水平且 f— web 值大于一定阔值的词确定为限制词。 Alternatively, calculate alpha = f-user/f_web, and identify words with alpha far below normal and f-web values greater than a certain threshold as restricted words.

其中， alpha为计算结果， f— web为一字词的通用词频信息， f— user为该字词的特征词频信息。 Among them, alpha is the calculation result, f-web is the general word frequency information of a word, and f-user is the characteristic word frequency information of the word.

具体而言，可以对于所有的目标词汇，计算得到其对应的 alpha值，并按照 alpha值从小到大排序。对于那些 alpha值排在 top的词，如前 5%, 并且本身词频较高，如大于 10000，则认为它是限制词。 Specifically, for all target vocabularies, the corresponding alpha values can be calculated and sorted according to the alpha value from small to large. For words whose alpha value is at top, such as the first 5%, and the word frequency is higher, such as greater than 10000, it is considered a restriction.

需要说明的是，上述各个例子中的判断条件还可以组合使用。总之，本领域技术人员可以根据需要设定各种各样的判定方式，在此无法——列举。在本发明的一个优选实施例中，所述限制信息可以包括：该限制词在各预设场景下的限制单独输出的权重。即该限制词可以具有不同应用场景下的限制信息，并不仅仅具有一个通用的限制信息。例如，通过输入法当前程序确定用户的应用场景，当用户在 word 中输入时，调用限制在该预设场景（例如，工作用语环境）下的限制信息值。 It should be noted that the determination conditions in the above respective examples may also be used in combination. In summary, those skilled in the art can set various determination methods as needed, which cannot be listed here. In a preferred embodiment of the present invention, the restriction information may include: a weight of the restriction word that is separately output in each preset scenario. That is, the restriction word can have restriction information in different application scenarios, and does not have only one general restriction information. For example, the current program of the input method determines the application scenario of the user, and when the user inputs in word, the limit information value limited to the preset scene (for example, the working term environment) is called.

进一步，所述限制信息还可以包括：该限制词在预设语料库中的语言学搭配参数；所述语言学搭配参数用于限制该词在智能组词输出时的排序。即对于某些限制词，在单独输出时，需要加以限制，并且在其智能组词输出时，也需要加以限制。例如，对于 "上一，，一词，在单独输出时需要加以限制，尽量不出现在候选项中，而对于 "上一，，和 "里" 智能组词输出时，也要依据搭配关系加以限制， "上一" 和 "里" 这种搭配组词尽量不出现在候选项中。 Further, the restriction information may further include: a linguistic collocation parameter of the restriction word in the preset corpus; the linguistic collocation parameter is used to limit the ordering of the word when the intelligent group word is output. That is to say, for certain restriction words, when they are output separately, they need to be restricted, and when they are outputted by intelligent group words, they also need to be restricted. For example, for the previous one, the word needs to be limited when it is output separately, and it should not appear in the candidate as much as possible, but for the "previous,, and "in" intelligent group word output, it should also be based on the collocation relationship. Restrictions, "previous" and "in" such collocations do not appear in the candidates as much as possible.

其中，所述限制信息可以包括该词在预设语料库中的所有的语言学搭配参数（例如，词性搭配参数），也可以仅仅保存所需的搭配参数。例如，设置一限制输出的阔值，如果某个搭配参数小于等于该阔值，则保存该语言学搭配参数。 The restriction information may include all linguistic allocation parameters (for example, part-of-speech matching parameters) of the word in the preset corpus, or may only save the required matching parameters. For example, set a threshold value for the limit output. If a collocation parameter is less than or equal to the threshold value, the linguistic collocation parameter is saved.

需要说明的是，所述预置语料信息可以为互联网语料信息和 /或用户输入记录语料信息。其中，所述互联网语料信息可以通过网络蜘蛛（spider ) 从互联网上抓取海量网页而获得；所述用户输入记录语料库可以包括直接信息和间接信息，例如，用户输入的字符记录等可作为直接信息，用户输入的字符分布统计等则可作为间接信息。当然，所述预置语料信息还可以由本领域技术人员根据需要或经验进行设置，本发明对此不需要进行限定。参照图 2，示出了一种获取限制词信息的方法实施例 2，可以包括：步骤 201、获取一目标词； It should be noted that the preset corpus information may be Internet corpus information and/or user input. Enter the corpus information. The Internet corpus information may be obtained by crawling a massive webpage from a web spider through a spider; the user input recording corpus may include direct information and indirect information, for example, a character record input by a user may be used as direct information. The character distribution statistics input by the user can be used as indirect information. Of course, the preset corpus information may also be set by a person skilled in the art according to needs or experience, and the present invention does not need to be limited thereto. Referring to FIG. 2, an embodiment 2 of the method for acquiring the restriction word information is shown, which may include: Step 201: Acquire a target word;

步骤 202、获取该目标词在预设语料库中的语言学搭配参数； Step 202: Obtain a linguistic collocation parameter of the target word in a preset corpus;

步骤 203、判断所述语言学搭配参数是否符合预置条件，如果符合，则记录该目标词的限制信息，所述限制信息包括相应的语言学搭配参数，所述限制信息用于限制该词智能组词输出时的排序。 Step 203: Determine whether the linguistic collocation parameter meets a preset condition, and if yes, record the restriction information of the target word, the restriction information includes a corresponding linguistic collocation parameter, and the restriction information is used to limit the word intelligence. Sorting when group words are output.

例如， "上一 "与方位词的搭配参数值就很低，将该搭配参数记录至 "上 ―" 的限制信息中，则如果在进行智能组词时一候选项为 "上一" 与方位词的搭配，则去除该候选项。 For example, the value of the collocation parameter of the "previous" and the position word is very low, and the collocation parameter is recorded in the restriction information of "up", and if the intelligent grouping is performed, the candidate is "previous" and orientation. The collocation of words removes the candidate.

再例如， "讲" 与动词的搭配参数小于预定阔值，将该搭配参数记录至 "讲" 的限制信息中，则如果一候选项为 "讲" 与动词的搭配，则将 "讲" 从智能组词的序列中去除。 For another example, if the matching parameter of the "speaking" and the verb is less than the predetermined threshold, and the matching parameter is recorded in the restriction information of "speaking", if a candidate is a combination of "speaking" and a verb, then "talking" is The sequence of intelligent group words is removed.

优选的，所述语言学搭配参数可以为一通用参数；或者，所述语言学搭配参数也可以包括针对各预设场景的分参数。所述的语言学搭配参数，可以包括词与词的搭配参数，词与词性的搭配参数、词性与词性的搭配参数等等。所述的语言学搭配参数所采用的表现数值可以为相邻同现频率、同现概率或连接强度值等，这些数值可以从任一预置语料库中统计得到，也可以依据现有经验或知识直接得到。 Preferably, the linguistic collocation parameter may be a general parameter; or the linguistic collocation parameter may also include sub-parameters for each preset scene. The linguistic collocation parameters may include collocation parameters of words and words, collocation parameters of words and part of speech, collocation parameters of part of speech and part of speech, and the like. The performance value of the linguistic collocation parameter may be adjacent co-occurrence frequency, co-occurrence probability or connection strength value, etc., and the values may be obtained from any preset corpus, or may be based on existing experience or knowledge. Get it directly.

需要说明的是，通过上述歸选步骤，可以将符合条件的限制词从智能组词的序列中去除，从而减少了智能组词时的搜索空间，提高智能组词的效率。参照图 3，示出了一种更新输入法词库的方法实施例，具体可以包括: 步骤 301、获取一目标词； It should be noted that, by using the above-mentioned selection step, the qualified restriction words can be removed from the sequence of the intelligent group words, thereby reducing the search space in the intelligent group words and improving the efficiency of the intelligent group words. Referring to FIG. 3, an embodiment of a method for updating an input method vocabulary is shown, which may specifically include: Step 301: Obtain a target word.

步骤 302、获取该目标词相应的特征信息； Step 302: Acquire feature information corresponding to the target word;

步骤 303、判断所述特征信息或其相应的计算结果是否符合预置条件，如果符合，则确定该目标词为限制词并记录相关限制信息，所述限制信息用于限制该词单独输出时的排序，和 /或，用于限制该词智能组词输出时的排序； Step 303: Determine whether the feature information or its corresponding calculation result meets a preset condition. If yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the single output of the word. Sorting, and/or, used to limit the ordering of the word smart group word output;

步骤 304、将所述限制词及其相关限制信息添加至输入法现有词库中。本实施例可以应用于：服务器端获得了限制词信息，然后将其及时更新至输入法现有词库。所更新的限制信息可以包括前述图 2、图 3 实施例所获得的限制信息，即可以包括用于限制该词单独输出时排序的信息，也可以包括用于限制该词智能组词输出时排序的信息；二者可以单独存在，也可以并存。例如，所述限制信息可以包括：该限制词在各预设场景下的限制单独输出的权重。 Step 304: Add the restriction word and its related restriction information to an existing vocabulary of the input method. This embodiment can be applied to: the server obtains the restriction word information, and then updates it to the existing vocabulary of the input method in time. The updated restriction information may include the restriction information obtained by the foregoing embodiments of FIG. 2 and FIG. 3, that is, may include information for limiting the ordering when the word is output separately, and may also include sorting for limiting the output of the word intelligent group words. Information; the two can exist separately or together. For example, the restriction information may include: a weight of the restriction word that is separately outputted in each preset scenario.

当然，也可以在服务器端将限制信息添加至服务器端词库后，然后将新词库进行整体的发布更新。具体的更新传输方式在此就不详述了。 Of course, you can also add the restriction information to the server-side lexicon on the server side, and then update the new vocabulary as a whole. The specific update transmission method will not be detailed here.

步骤 304中所述的添加可以为各种方式，例如， The addition described in step 304 can be in various ways, for example,

所述添加为：判断该限制词是否在所述原始词库中已存在，如果已存在，则仅记录其相关限制信息至所述输入法现有词库中； The adding is: determining whether the restricted word already exists in the original thesaurus, and if so, recording only the relevant restriction information into the existing thesaurus of the input method;

或者，所述添加为：直接将所述限制词及其相关限制信息记录至所述输入法现有词库中，如果词条重复，则覆盖原始词条； Or the adding is: directly recording the restriction word and its related restriction information into an existing vocabulary of the input method, and if the vocabulary is repeated, overwriting the original vocabulary;

或者，所述添加为：将所述限制词及其相关限制信息存储为一独立的限制词表，所述限制词表和输入法现有词库用于协作完成候选项排序。参照图 4，示出了一种获取限制词信息的装置实施例，具体可以包括：目标词获取单元 401 , 用于获取一目标词； Alternatively, the adding is: storing the restricted words and their related restriction information as an independent restricted vocabulary, and the restricted vocabulary and the input lexicon are used to collaboratively perform candidate ordering. Referring to FIG. 4, an apparatus for acquiring a restriction word information is shown, which may specifically include: a target word obtaining unit 401, configured to acquire a target word;

特征信息获取单元 402，用于获取该目标词相应的特征信息； The feature information acquiring unit 402 is configured to acquire feature information corresponding to the target word;

限制信息获取单元 403 , 用于判断所述特征信息或其相应的计算结果是否符合预置条件，如果符合，则确定该目标词为限制词并记录相关限制信息，所述限制信息用于限制该词单独输出时的排序，和 /或，用于限制该词智能组词输出时的排序。参照图 5，示出了一种优化输出的方法实施例，具体可以包括：步骤 501、接收用户输入信息，并对所述输入信息进行转换； The restriction information obtaining unit 403 is configured to determine whether the feature information or its corresponding calculation result meets a preset condition, and if yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the Sorting when the words are output separately, and/or, used to limit the Sorting the word smart group word output. Referring to FIG. 5, an embodiment of a method for optimizing an output is shown. Specifically, the method may include: Step 501: Receive user input information, and convert the input information.

所述输入信息可以包括编码字符串，也可以包括手写输入信息以及语音输入的信息，因为这些输入方式也都需要用到词库进行候选项排序。即本发明可以应用于各种输入方式的输入法平台，包括键盘符号、手写信息以及语音输入等等。由于这些输入方式中的信息转换过程都属于公知技术，在此就不伴述了。 The input information may include an encoded character string, and may also include handwritten input information as well as voice input information, since these input methods also require the use of the thesaurus for candidate ordering. That is, the present invention can be applied to input method platforms of various input methods, including keyboard symbols, handwritten information, and voice input. Since the information conversion process in these input methods is a well-known technique, it will not be described here.

例如，当用户输入时，输入法系统会对用户输入的编码字符串进行切分。以对拼音编码字符串的切分为例进行简单说明，通常，对一个拼音编码字符串进行切分，可以获得多种切分方案，例如，对于拼音编码字符串 For example, when the user enters, the input method system splits the encoded string entered by the user. A brief description is given to the example of splitting a pinyin coded string. Generally, a pinyin coded string is divided into multiple segments, for example, for a pinyin encoded string.

"fangan" , 可以切分成" fang' an" , 也可以切分成 "fan'gan"等。当然，所述切分的方法可以为现有技术中的任一方法，本发明对此不需要进行限定。 "fangan" can be divided into "fang" an", or it can be divided into "fan'gan" and so on. Of course, the method of segmentation may be any method in the prior art, and the present invention does not need to be limited thereto.

步骤 502、获得输出侯选项； Step 502: Obtain an output candidate option.

以一种拼音网络切分法为例，根据所述切分后的编码字符串获得输出侯选项的过程相当于把输入的连续拼音流自动转换为相应的文字流的过程。具体地说，所述过程为：对于一个给定的连续拼音流 A, 按着某种拼音流切分算法可以切分为一个拼音序列 Al A2 ... Am, 其中每个拼音 Ai对应的一组同音字词可以用一组列节点表示为 Wil Wi2... Wi3。那么对于拼音序列 Al A2 ... Am, 对应的候选同音字词可用 m组列节点表示。显然，一个拼音序列对应的候选同音字词组成了一个候选同音字词矩阵。把相邻的节点用有向边连接起来，形成词网格。词网格构成了汉字输入问题的状态空间，进而，音字转换问题演变为在词网格中搜索一条最优路径问题。 Taking a pinyin network segmentation method as an example, the process of obtaining an output candidate according to the segmented coded string is equivalent to the process of automatically converting the input continuous pinyin stream into a corresponding word stream. Specifically, the process is: For a given continuous pinyin stream A, according to a certain pinyin stream segmentation algorithm, it can be divided into a pinyin sequence Al A2 ... Am, where each pinyin Ai corresponds to one Group homophones can be represented by a set of column nodes as Wil Wi2... Wi3. Then, for the Pinyin sequence Al A2 ... Am, the corresponding candidate homophones can be represented by m group column nodes. Obviously, the candidate homophones corresponding to a pinyin sequence constitute a candidate homophone matrix. Connect adjacent nodes with directed edges to form a word grid. The word grid constitutes the state space of the Chinese character input problem. Furthermore, the word conversion problem evolves to search for an optimal path problem in the word grid.

例如，输入一个拼音流" zheshiyizhipiaoliangdemao" , 经过拼音流切分生成" zhe'shi'yi'zhi，piaoliang，de，mao"拼音序列 , 该拼音序列对应的词网格为图 6所示。 For example, input a pinyin stream "zheshiyizhipiaoliangdemao", which is divided into pinyin streams to generate "zhe'shi'yi'zhi, piaoliang, de, mao" pinyin sequences. The word grid corresponding to the pinyin sequence is shown in Fig. 6.

然后，查询系统的语言规则库，进行规则匹配，递归地把所有可以匹配某一条语言规则的相邻列的节点捆绑成语言元素节点，形成元素网格。该元素网格构成了音字转换的新的状态空间。通过使用 Viterbi 动态规划算法，把系统的二元 ( Bigram ) 统计库和二元（ Bigram ) 学习库的概率值通过加权结合起来，计算元素网格中所有的字词中候选字词的概率，选择其中具有最大概率的字词候选作为音字转换结果输出。 Then, query the system's language rule base, perform rule matching, and recursively bind all nodes of adjacent columns that can match a certain language rule into language element nodes to form an element mesh. This element mesh constitutes a new state space for phonetic conversion. By using the Viterbi dynamic programming algorithm, the probability values of the system's binary (Bigram) statistical library and the binary (Bigram) learning library are combined by weighting, and the probability of candidate words in all the words in the element grid is calculated. The word candidate having the greatest probability is output as a result of the phonetic word conversion.

当然，本领域技术人员采用任一种获得所述输出候选项的方法都是可行的，本发明对此不需要进行限定。 Of course, it is possible for a person skilled in the art to adopt any method for obtaining the output candidate, and the present invention does not need to be limited thereto.

步骤 503、判断是否符合应用限制信息的预置条件； Step 503: Determine whether the preset condition of the application restriction information is met;

步骤 504、如果是，则提取输出候选项相应的限制信息，并根据所述限制信息对各候选项进行排序展示。 Step 504: If yes, extract the restriction information corresponding to the output candidate, and perform sorting display on each candidate according to the restriction information.

根据所述限制信息对各候选项进行排序可以通过直接设定展现位置或者顺序的方式实现，也可以通过修正词频（包括但不限于加权、降权等）的方式实现；其中，最极端的就是从候选项中去除而不显示。 Sorting each candidate according to the restriction information may be implemented by directly setting a presentation position or a sequence, or by modifying a word frequency (including but not limited to weighting, derating, etc.); wherein the most extreme is Remove from candidates without displaying.

当某个词具有限制单独输出的限制信息时，所述应用限制信息的预置条件可以为：所述输出侯选项是否为单独输出的词。而所述的限制信息则可以通过以下步骤获取所述的限制信息：获取一目标词；获取该目标词相应的特征信息；判断所述特征信息或其相应的计算结果是否符合预置条件，如果符合，则针对该目标词记录相关限制信息。 When a word has restriction information that restricts the individual output, the preset condition of the application restriction information may be: whether the output candidate is a separately output word. And the restriction information may be obtained by acquiring the restriction information by: acquiring a target word; acquiring feature information corresponding to the target word; determining whether the feature information or its corresponding calculation result meets a preset condition, if If it matches, the relevant restriction information is recorded for the target word.

当某个词具有限制组词输出的限制信息时，所述应用限制信息的预置条件可以为：所述输出侯选项是否属于智能组词情形。而所述的限制信息则可以通过以下步骤获取：获取一目标词；获取该目标词在预设语料库中的语言学搭配参数；判断所述语言学搭配参数是否符合预置条件，如果符合，则记录该目标词的限制信息，所述限制信息包括相应的语言学搭配参数，所述限制信息用于限制该词智能组词输出时的排序。优选的，当需要判断所述输出侯选项是否为单独输出的词时，可以通过以下步骤完成： When a word has restriction information for limiting the output of the group word, the preset condition of the application restriction information may be: whether the output candidate belongs to the intelligent group word situation. The restriction information may be obtained by: acquiring a target word; obtaining a linguistic collocation parameter of the target word in a preset corpus; determining whether the linguistic collocation parameter meets a preset condition, and if yes, The restriction information of the target word is recorded, and the restriction information includes a corresponding linguistic collocation parameter, and the restriction information is used to limit the ordering of the word intelligent group word output. Preferably, when it is required to determine whether the output candidate is a separately output word, the following steps can be completed:

针对用户输入的编码字符串，首先获得所有可能的输出候选项；然后，判断一输出候选项是否只包含一个元素，并且长度大于 1 个输出字符；所述元素为预置词库中存储的字词；如果是，则确定该输出候选项为单独输出的词。对于是否包含一个元素的判断，可以通过 ID映射的方式从词库中查询获得，或者通过判断所包含元素 ID的个数，即可确定所述输出候选项是否只包含一个元素。 For the encoded string input by the user, first obtain all possible output candidates; then, determine whether an output candidate contains only one element, and the length is greater than 1 output character; the element is a word stored in the preset vocabulary Word; if yes, determine that the output candidate is a separate input Out of the word. The judgment of whether or not an element is included may be obtained by querying from the lexicon by means of ID mapping, or by judging the number of included element IDs, it may be determined whether the output candidate contains only one element.

所述 1个输出字符在不同输入法系统中可以为不同字节长度或其它长度的字符，例如，对于中文、日文或韩文输入法来说，所述 1个输出字符为包含 2个字节的字。对于所述长度的判断，可以通过读取词库中预置的长度参数来判断，所述长度参数可以针对所述字词 ID存储在相应词条的属性中；或者，通过直接获取所述输出候选项的长度来判断，以及采用现有技术中的其它方法都是可行的，本发明对此不作限制。 The one output character may be a different byte length or other length characters in different input method systems. For example, for Chinese, Japanese, or Korean input methods, the one output character is 2 bytes. word. The determination of the length may be determined by reading a length parameter preset in the vocabulary, and the length parameter may be stored in the attribute of the corresponding term for the word ID; or, by directly acquiring the output Judging by the length of the candidate, and using other methods in the prior art are possible, and the present invention is not limited thereto.

例如，对于用户输入编码字符串 "liangjiangzong" 的情况而言，针对该编码字符串做完拼音网络切分之后，得到的各个可能的候选项为：两江总、量将、两江、良将等等。其中，假设每个候选项可以表示为<词条 1 , 属性 1>、 <词条 2, 属性 2> ；或者， <词条 1的 ID, 属性 1>、 <词条 2 的 ID, 属性 2>。 For example, for the case where the user inputs the encoded string "liangjiangzong", after completing the Pinyin network segmentation for the encoded string, the possible candidates are: Liangjiang Total, Volume, Liangjiang, Liangjiang, and so on. Here, it is assumed that each candidate can be expressed as <entry 1, attribute 1>, <entry 2, attribute 2>; or, <ID of the entry 1, attribute 1>, <ID of the entry 2, attribute 2 >.

比如，对于候选项 "两江总"，就可以表示为： <两江 pl>、 <总 p2>; 对于候选项 "量将"，就可以表示为： <量将 ql>; For example, for the candidate "two rivers total", it can be expressed as: <two rivers pl>, <total p2>; for the candidate "quantity", it can be expressed as: <quantity will be ql>;

而对于 <量将 ql>而言，其仅包含一个元素，并且大于 1个输出字符；继续判断其属性 ql是否包含限制信息标记，由于其具有限制信息标记（例如， tag非 0 )，所以该候选项不输出。所述属性 ql中还可以包括长度参数。 And for <quantity ql>, it contains only one element, and is greater than 1 output character; continue to judge whether its attribute ql contains a restriction information flag, since it has a restriction information flag (for example, tag is not 0), so The candidate is not output. The length parameter may also be included in the attribute ql.

即最终输出的候选项为：两江总、两江、良将。 The candidate for the final output is: Liangjiang, Liangjiang, Liangjiang.

对于一般情况而言，一个候选项不是单独输出，则就是属于组词输出，所以上述过程也可以用于智能组词情况的判断。 For the general case, if a candidate is not output separately, it belongs to the group word output, so the above process can also be used to judge the situation of intelligent group words.

当然，对于当用户仅仅输入了两个音节的时候，可以不用经过上述判断过程，直接判定为单独输出，因为两个音节一般不会是智能组词的情况。例如，对于用户输入的不需要进行切分的编码字符串，判定获得的输出候选项为单独输出的词；或者，对于用户输入的编码字符串对应于词库中单个词条的输出候选项，确定为单独输出的词。参照图 7，示出了一种输入法系统实施例，具体可以包括：输入接口单元 701和显示单元 702，以及； Of course, when the user inputs only two syllables, it is possible to directly determine that the output is separate without going through the above-described judging process, because the two syllables are generally not in the case of intelligent group words. For example, for a coded string input by the user that does not need to be segmented, it is determined that the obtained output candidate is a separately output word; or, for the output string of the user input, the output candidate corresponding to the single entry in the thesaurus, Determine the word to be output separately. Referring to FIG. 7, an embodiment of an input method system is shown, which may specifically include: an input interface unit 701 and a display unit 702, and;

词库 703 : 所述词库包括限制信息；其中所述限制信息可以为前述的各种限制信息；所述限制信息的存在方式也可以各种各样，例如，以词表的方式存在于词库中，或者通过对词库中的相应词条打标记的方式实现。 The lexicon 703: the vocabulary includes restriction information; wherein the restriction information may be various restriction information as described above; the restriction information may be present in various ways, for example, in a vocabulary manner. In the library, or by marking the corresponding terms in the thesaurus.

候选项获取单元 704: 用于根据用户的输入信息获得输出侯选项；判断单元 705 , 用于判断一输出候选项是否符合应用限制信息的预置条件； The candidate obtaining unit 704 is configured to: obtain an output candidate according to the input information of the user; the determining unit 705 is configured to determine whether an output candidate meets the preset condition of the application restriction information;

候选项排序单元 706 , 用于当符合预置条件时，提取该输出候选项相应的限制信息，并根据所述限制信息对各候选项进行排序。 The candidate sorting unit 706 is configured to: when the preset condition is met, extract the restriction information corresponding to the output candidate, and sort the candidates according to the restriction information.

所述的词库 703可以包括词条信息和限制词信息，即可以在现有词库中对于符合预置条件的词记录限制词信息。另一种优选的情况为，所述词库 703为包括基础词库和限制词表，所述限制词表为记录具有限制词信息的词表。在这种情况下，可以将符合预置条件的单词及相应的限制信息独立存储为一张限制词表，该限制词表和基础词库即组成本实施例中的输入法词库。当然，本领域技术人员采用现有技术中的其它方法预置输入法词库也是可行的，本发明对此不作限制。 The thesaurus 703 may include term information and restriction word information, that is, the word restriction information may be recorded in the existing thesaurus for words that meet the preset conditions. In another preferred case, the vocabulary 703 includes a basic vocabulary and a restricted vocabulary, and the restricted vocabulary is a vocabulary with restricted word information. In this case, the words that meet the preset conditions and the corresponding restriction information can be stored independently as a restricted vocabulary, and the restricted vocabulary and the basic vocabulary constitute the input method vocabulary in this embodiment. Of course, it is also feasible for a person skilled in the art to preset the input method vocabulary by using other methods in the prior art, which is not limited by the present invention.

优选的，当某个词具有限制单独输出的限制信息时，所述应用限制信息的预置条件可以为：所述输出侯选项是否为单独输出的词。所述判断单元进一步可以包括： Preferably, when a word has restriction information limiting the individual output, the preset condition of the application restriction information may be: whether the output candidate is a separately output word. The determining unit may further include:

用于判断一输出候选项是否只包含一个元素的子单元；其中，所述元素为预置词库中存储的字词； a subunit for determining whether an output candidate includes only one element; wherein the element is a word stored in a preset vocabulary;

以及，用于判断该输出候选项的长度是否大于 1个输出字符的子单元；以及，用于当该输出候选项符合上述两个判断条件时，确定其为单独输出的词的子单元。 And a subunit for determining whether the length of the output candidate is greater than one output character; and, for determining that the output candidate is a subunit of the separately outputted word when the output candidate meets the two determination conditions.

当某个词具有限制组词输出的限制信息时，所述应用限制信息的预置条件可以为：所述输出侯选项是否属于智能组词情形。其判定方式也可以采用前述方法，如果不符合判断条件，则属于智能组词情形。 When a word has restriction information for limiting the output of the group word, the preset condition of the application restriction information may be: whether the output candidate belongs to the intelligent group word situation. The method of determining may also adopt the foregoing method, and if it does not meet the judgment condition, it belongs to the case of intelligent group words.

上述输入法系统可以为普通输入法系统，例如，所述输入法系统的输入接口单元、显示单元以及词库位于同一计算设备中；上述输入法系统可以为网络输入法系统，例如，所述输入法系统的输入接口单元、显示单元位于第一计算设备中，词库位于第二计算设备中，所述输入法系统根据用户输入的信息，从第二计算设备中获取相应信息，在第一计算设备显示相应字词候选项。 The above input method system may be a common input method system, for example, the input method system is lost. The input interface unit, the display unit, and the vocabulary are located in the same computing device; the input method system may be a network input method system, for example, the input interface unit and the display unit of the input method system are located in the first computing device, and the vocabulary is located In the second computing device, the input method system acquires corresponding information from the second computing device according to the information input by the user, and displays the corresponding word candidate in the first computing device.

由于前述的各个实施例都是基于本发明同一构思的，所以互相着重描述的是区别之处，相似之处可以参见本说明书相应部分。 Since the foregoing various embodiments are based on the same concept of the present invention, the differences are described with emphasis on each other, and similarities can be found in the corresponding parts of the specification.

以上对本发明所提供的一种获取限制词信息的方法和装置、一种更新词库的方法、一种优化输出的方法和一种输入法系统进行了详细介绍，本的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。 The above provides a method and device for obtaining restriction information information, a method for updating a thesaurus, a method for optimizing output, and an input method system. The description is only used to help understanding. The method of the present invention and its core idea; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation manner and the scope of application. It is understood to be a limitation of the invention.

Claims

Rights request

A method for obtaining restriction word information, comprising:

Obtain a target word;

Obtaining corresponding feature information of the target word;

Determining whether the feature information or its corresponding calculation result meets a preset condition, and if so, determining that the target word is a restriction word and recording related restriction information, wherein the restriction information is used to limit the ordering when the word is separately output.

2. The method of claim 1 wherein:

The feature information is: a feature value of the word at the beginning of the target word in the default corpus as a prefix of the word, and a feature value of the word at the end of the target word in the default corpus as a suffix;

The preset condition for determining is: whether at least one of the feature values is present in the preset range.

3. The method of claim 1 wherein:

The feature information is: a feature value of a linguistic collocation relationship of each single word and/or multi-word included in the target word in a preset corpus;

4. The method of claim 1 wherein:

The feature information is: a feature value that the target word inputs by the user in the input method application; the preset condition for determining is: whether the feature value belongs to a preset range.

5. The method of claim 1 wherein:

The feature information includes: a feature value of the word at the beginning of the target word in the preset corpus as a prefix; a word at the end of the target word in the default corpus as a feature value of the suffix; and the target word General word frequency

The preset condition for the judgment is: whether the ratio of the at least one feature value to the target word common word frequency among the above feature values belongs to a preset range.

6. The method of claim 1 wherein:

The feature information includes: a language of each single word and/or multi-word included in the target word The feature value of the collocation relationship in the preset corpus; and the general word frequency of the target word; the preset condition for the judgment is: whether there is a ratio of the at least one feature value to the target word common word frequency in the feature value Belong to the preset range.

7. The method of claim 1 wherein:

The feature information is: a feature value that the target word is input by the user in the input method application; and a general word frequency of the target word;

The preset condition for determining is: whether the ratio of the feature value to the target word common word frequency belongs to a preset range.

8. The method of claim 1 wherein:

The feature information is: user-sorted position information of the target word in each candidate word encoded for the same input; and original sorted position information of the target word in each candidate word encoded for the same input; wherein, the user The sorting information is related to the feature value separately input by the user in the input method application; the original sorting information is related to the general word frequency of the target word; the preset condition for determining is: the user sorting position information Whether the difference from the original sort position information belongs to a preset range.

9. The method according to any one of claims 1-8, further comprising: before the feature information obtaining step, the step of: optimizing the step of selecting the target word.

10. The method according to any one of claims 1-8, wherein the restriction information comprises: a weight of the restriction word that is separately output in each preset scenario.

11. A method according to any of claims 1-8, characterized in that

The restriction information further includes: a linguistic collocation parameter of the restriction word in a preset corpus; the linguistic collocation parameter is used to limit the ordering of the word when the intelligent group word is output.

12. The method of any of claims 1-8, further comprising: generating a term or vocabulary, the vocabulary or vocabulary including the restricted words and their associated restriction information;

Alternatively, a vocabulary is generated, the vocabulary including the qualifiers and their associated restriction information, as well as generic words.

13. A method for obtaining restricted word information, characterized by comprising:

Obtain a target word; Obtaining the linguistic collocation parameters of the target word in the default corpus;

Determining whether the linguistic collocation parameter meets a preset condition, if yes, recording restriction information of the target word, the restriction information includes a corresponding linguistic collocation parameter; the restriction information is used to limit the word intelligent group word output Sort of time.

14. The method of claim 13 wherein:

The linguistic collocation parameter is a general parameter;

Alternatively, the linguistic collocation parameters include sub-parameters for each preset scene.

15. A method of updating a thesaurus, characterized by comprising:

Obtain a target word;

Obtaining corresponding feature information of the target word;

Determining whether the feature information or its corresponding calculation result meets a preset condition, if yes, determining that the target word is a restriction word and recording related restriction information, wherein the restriction information is used to limit the ordering when the word is separately output, and / or, used to limit the sorting of the word intelligent group word output;

Adding the restriction words and their related restriction information to the existing vocabulary of the input method.

16. The method of claim 15 wherein:

The adding is: determining whether the restricted word already exists in the original thesaurus, and if so, recording only the relevant restriction information into the existing thesaurus of the input method;

Or the adding is: directly recording the restriction word and its related restriction information into an existing vocabulary of the input method, and if the vocabulary is repeated, overwriting the original vocabulary;

Alternatively, the adding is: storing the restricted words and their related restriction information as a restricted vocabulary, and the restricted vocabulary and the input lexicon are used to collaboratively perform candidate sorting.

17. The method of claim 15, wherein the restriction word has restriction information in each of the preset scenarios.

18. An apparatus for obtaining restricted word information, comprising:

a target word obtaining unit, configured to acquire a target word;

a feature information acquiring unit, configured to acquire feature information corresponding to the target word;

a restriction information acquiring unit, configured to determine whether the feature information or a corresponding calculation result thereof meets a preset condition, and if yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the word Sorting when outputting separately, and/or, is used to limit the word intelligence Sorting when group words are output.

19. A method of optimizing output, comprising:

Receiving user input information, and converting the input information;

Get the output option;

Determining whether an output candidate meets a preset condition of the application restriction information;

If yes, the restriction information corresponding to the output candidate is extracted, and each candidate is sorted according to the restriction information.

20. The method of claim 19, wherein:

The preset condition of the application restriction information is: whether the output candidate is a separately output word;

Alternatively, the preset condition of the application restriction information is: whether the output candidate belongs to a smart group word situation.

21. The method according to claim 19, wherein the restriction information is obtained by the following steps:

Obtain a target word;

Obtaining corresponding feature information of the target word;

Determining whether the feature information or its corresponding calculation result meets a preset condition, and if so, recording relevant restriction information for the target word.

22. The method according to claim 20, wherein when it is required to determine whether the output candidate is a separately output word, the following steps are performed:

Determining whether an output candidate includes only one element and the length is greater than 1 output character; the element is a word stored in the preset vocabulary;

If so, it is determined that the output candidate is a separately output word.

An input method system, comprising: an input interface unit and a display unit, wherein the input method system further comprises:

a vocabulary, the vocabulary includes restriction information for the vocabulary; the restriction information is used to limit the ordering when the word is outputted separately, and/or, and is used to limit the ordering when the word intelligent group word is output;

a candidate obtaining unit, configured to obtain an output candidate according to the input information of the user; and a determining unit, configured to determine whether an output candidate meets a preset condition of the application restriction information; The candidate sorting unit is configured to extract the restriction information corresponding to the output candidate when the preset condition is met, and sort the candidates according to the restriction information.

24. The system of claim 23, wherein:

The input method system according to claim 23, wherein the determining unit further comprises:

a subunit for determining whether an output candidate includes only one element; wherein the element is a word stored in a preset vocabulary;

a subunit for determining whether the length of the output candidate is greater than one output character; and, for determining that the output candidate is a subunit of a separately outputted word when the output candidate meets the above two determination conditions.

The input method system according to claim 24, wherein the input interface unit, the display unit, and the vocabulary of the input method system are located in the same computing device;

Alternatively, the input interface unit and the display unit of the input method system are located in the first computing device, and the vocabulary is located in the second computing device, and the input method system obtains corresponding information from the second computing device according to the information input by the user. , displaying the corresponding word on the first computing device.