[go: up one dir, main page]

WO2008145055A1 - The method for obtaining restriction word information, optimizing output and the input method system - Google Patents

The method for obtaining restriction word information, optimizing output and the input method system Download PDF

Info

Publication number
WO2008145055A1
WO2008145055A1 PCT/CN2008/071064 CN2008071064W WO2008145055A1 WO 2008145055 A1 WO2008145055 A1 WO 2008145055A1 CN 2008071064 W CN2008071064 W CN 2008071064W WO 2008145055 A1 WO2008145055 A1 WO 2008145055A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
information
output
restriction
restriction information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2008/071064
Other languages
French (fr)
Chinese (zh)
Inventor
Jieyong Lv
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Publication of WO2008145055A1 publication Critical patent/WO2008145055A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/018Input/output arrangements for oriental characters
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/274Converting codes to words; Guess-ahead of partial word inputs

Definitions

  • the present invention relates to the field of computer character input data processing, and more particularly to a method and apparatus for acquiring restricted word information, a method for updating an input method vocabulary, a method for optimizing output, and an input method system.
  • the resulting Internet vocabulary can contain many new words that are not available through previous closed corpus information (such as modern Chinese dictionaries, news, newspapers, etc.), which can greatly improve people's input efficiency.
  • it is precisely because of the complexity of the Internet corpus that some of the words that are derived from the word frequency statistics have some deficiencies in linguistics or the use of input habits.
  • candidates with Internet thesaurus may also include “quantity” because " The amount of "this word” appears quite high in Internet pages, but it generally appears at the junction of multiple words in a sentence (used to express a link relationship), for example, "the amount of passengers will exceed.”
  • the word “quantity” is included in the input lexicon, which can increase the intelligence of the input method (to achieve a higher intelligent group effect), in some cases can improve user input efficiency, but because of the amount , the word rarely appears in the case of a separate word, which may also cause trouble for the user input, increase the number of candidates that the user needs to select, and reduce the input efficiency.
  • the technical problem to be solved by the present invention is to provide a method and apparatus for acquiring restriction word information, which can find words having linguistics or usage habits from a large number of vocabularies, thereby improving the user's input experience.
  • Another object of the present invention is to provide a method for updating an input method vocabulary, a method for optimizing an output, and an input method system, which can restrict certain words in some cases in an actual input process, thereby It can achieve the purpose of improving the intelligence of the input method without increasing the user's operation.
  • the present invention discloses a method for acquiring restriction word information, which may specifically include: acquiring a target word; acquiring feature information corresponding to the target word; determining whether the feature information or its corresponding calculation result is consistent The preset condition, if it is met, determines that the target word is a restriction word and records related restriction information, and the restriction information is used to limit the ordering when the word is separately output.
  • the feature information is: a feature value of the word at the beginning of the target word in the default corpus as a prefix of the word, and a feature value of the word at the end of the target word in the preset corpus as a suffix;
  • the preset condition for determining is: whether at least one of the feature values is in a preset range.
  • the feature information is: a feature value of a linguistic collocation relationship of each single word and/or a multi-word word included in the target word in a preset corpus; : Whether at least one of the above feature values exists in the preset range.
  • the feature information is: an attribute value separately input by the user in the input method application;
  • the preset condition for determining is: whether the feature value belongs to a preset range.
  • the feature information includes: a feature value of the word at the beginning of the target word in the default corpus as a prefix; the word at the end of the target word is used as a suffix feature value in the default corpus; The common word frequency of the target word; the preset condition for the judgment is: whether the ratio of the at least one feature value to the target word common word frequency in the feature value belongs to a preset range.
  • the feature information includes: a feature value of a linguistic collocation relationship of each single word and/or a multi-word included in the target word in a preset corpus; and a general word frequency of the target word;
  • the preset condition for judging is: whether at least one feature value exists in the above feature value
  • the ratio of the general word frequency to the target word belongs to a preset range.
  • the feature information is: a feature value that the target word is input by the user in the input method application; and a general word frequency of the target word; the preset condition for the judgment is: the feature value and the target word Whether the ratio of the general word frequency belongs to the preset range.
  • the feature information is: user-sorted position information of the target words in each candidate word encoded for the same input; and original sorted position information of the target words in each candidate word encoded for the same input;
  • the user ranking information is related to a feature value that the target word is separately input by the user in the input method application;
  • the original sorting information is related to a general word frequency of the target word;
  • the preset condition for determining is: the user Whether the difference between the sort position information and the original sort position information belongs to a preset range.
  • the method before the feature information obtaining step, the method further includes: an optimal selection step of the target word.
  • the restriction information includes: a weight of the restriction word that is separately output in each preset scenario.
  • the restriction information further includes: a linguistic allocation parameter of the restriction word in a preset corpus; the linguistic collocation parameter is used to limit the ordering of the word when the intelligent group word is output.
  • the method may further include: generating a vocabulary or vocabulary, the vocabulary or vocabulary including the restricted words and related restriction information; or generating a vocabulary, the vocabulary including the Limit words and their associated restrictions, as well as generic terms.
  • a method for acquiring restricted word information includes: acquiring a target word; obtaining a linguistic collocation parameter of the target word in a preset corpus; determining the linguistic collocation parameter Whether the preset condition is met, if yes, the restriction information of the target word is recorded, and the restriction information includes a corresponding linguistic collocation parameter; the restriction information is used to limit the ordering of the word intelligent group word output.
  • the linguistic collocation parameter is a general parameter; or the linguistic collocation parameter includes a sub-parameter for each preset scene.
  • a method for updating a thesaurus includes: acquiring a target word; acquiring feature information corresponding to the target word; determining whether the feature information or its corresponding calculation result meets a pre-determination Setting a condition, if yes, determining that the target word is a restriction word and recording related restriction information, the restriction information is used to limit the ordering when the word is output alone, and/or, Used to limit the ordering of the word intelligent group word output; add the limit word and its related restriction information to the existing vocabulary of the input method.
  • the adding is: determining whether the restricted word already exists in the original thesaurus, and if so, recording only relevant restriction information into the existing thesaurus of the input method; or, adding For example, the restriction word and its related restriction information are directly recorded into the existing vocabulary of the input method, and if the vocabulary is repeated, the original vocabulary is overwritten; or, the adding is:
  • the related restriction information is stored as a restricted vocabulary, and the existing vocabulary of the restricted vocabulary and the input method is used for collaborative completion of candidate ordering.
  • the restriction word has restriction information in each preset scenario.
  • an apparatus for acquiring restriction word information including:
  • a target word obtaining unit configured to acquire a target word
  • a feature information acquiring unit configured to acquire feature information corresponding to the target word
  • a restriction information acquiring unit configured to determine whether the feature information or a corresponding calculation result thereof meets a preset condition, and if yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the word Sorting when outputting separately, and/or, is used to limit the ordering of the word smart group word output.
  • a method for optimizing an output including: receiving user input information, and converting the input information; obtaining an output candidate; determining whether an output candidate meets application restriction information Pre-conditions; if yes, extract the restriction information corresponding to the output candidate, and sort the candidates according to the restriction information.
  • the preset condition of the application restriction information is: whether the output candidate is a separately output word; or the preset condition of the application restriction information is: whether the output candidate belongs to a smart group word situation .
  • the limiting information is obtained by: acquiring a target word; acquiring feature information corresponding to the target word; determining whether the feature information or its corresponding calculation result meets a preset condition, and if yes, Target word record related restriction information.
  • the following steps are performed: determining whether an output candidate includes only one element, and the length is greater than one. Outputting characters; the elements are words stored in a preset vocabulary; if so, it is determined that the output candidates are words that are output separately.
  • an input method system including an input interface unit and a display unit, the input method system further comprising:
  • the vocabulary includes restriction information for the vocabulary; the restriction information is used to limit the ordering when the word is outputted separately, and/or, and is used to limit the ordering when the word intelligent group word is output;
  • a candidate obtaining unit configured to obtain an output candidate according to the input information of the user; a determining unit, configured to determine whether an output candidate meets a preset condition of the application restriction information; and a candidate sorting unit, configured to meet the preset condition And extracting restriction information corresponding to the output candidate, and sorting each candidate according to the restriction information.
  • the preset condition of the application restriction information is: whether the output candidate is a separately output word; or the preset condition of the application restriction information is: whether the output candidate belongs to a smart group word situation .
  • the determining unit further includes: a subunit for determining whether an output candidate includes only one element; wherein, the element is a word stored in a preset vocabulary; and, for determining the output candidate Whether the length of the item is greater than one sub-unit of the output character; and, for determining that the output candidate is a sub-unit of the word that is output separately when the two judgment conditions are met.
  • the input interface unit, the display unit, and the vocabulary of the input method system are located in the same computing device; or the input interface unit and the display unit of the input method system are located in the first computing device, and the vocabulary is located in the second In the computing device, the input method system acquires corresponding information from the second computing device according to the information input by the user, and displays the corresponding word in the first computing device.
  • the embodiment of the invention has the following advantages:
  • the embodiment of the invention presets an input method vocabulary including the restriction word information, and when the user inputs, determines whether the output candidate meets the preset condition of the application restriction information; and further controls the candidate with the restriction word information according to whether the result is consistent Whether the item is displayed and output, so that it can be output more efficiently without increasing the user's operation (for example, in practice, the limit word "quantity" will not be displayed in the candidate when it is output separately. In other cases, participation in group words) greatly optimizes the character output process of the input method system and improves the intelligence of the input method system.
  • FIG. 1 is a flow chart of steps of Embodiment 1 of a method for acquiring restriction word information according to the present invention
  • FIG. 2 is a flow chart of steps of Embodiment 2 of a method for acquiring restriction word information according to the present invention
  • FIG. 3 is an update of the present invention
  • FIG. 4 is a structural block diagram of an embodiment of an apparatus for obtaining restriction information according to the present invention
  • FIG. 5 is a flow chart of steps of an embodiment of a method for optimizing output according to the present invention
  • FIG. 6 is a schematic diagram of a word grid of a pinyin network segmentation method
  • Figure 7 is a block diagram showing the structure of an embodiment of an input method system.
  • Step 101 Acquire a target word
  • the process of obtaining the target word can be obtained from the Internet, that is, directly obtained from the Internet corpus (for example, an Internet web page collection or a search keyword set, etc.), and can also be obtained from an existing vocabulary, and the present invention It is not necessary to be limited as long as a target word set can be obtained; as for the range size of the set, those skilled in the art can set according to actual needs.
  • the Internet corpus for example, an Internet web page collection or a search keyword set, etc.
  • an optimization step may be further included, and some attributes of the target word are used to remove some words to further narrow the scope. For example, words from which the Internet word frequency or the lexical vocabulary frequency is less than or equal to a preset threshold are removed from the set; words that are not subject to the qualifier (e.g., general vocabulary in the dictionary) are removed from the set.
  • the optimization step described above can also be completed in the process of acquiring the target word set.
  • Step 102 Acquire feature information corresponding to the target word
  • Step 103 Determine whether the feature information or its corresponding calculation result meets a preset condition, and if yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the single output of the word. Sort the candidates.
  • the restriction words “quantity”, “previous”, etc. do not appear in the candidate when outputted separately, but there is no limit when outputting with other words intelligent group words.
  • the first candidate based on the pre-output of the word frequency information is "quantity”, but since it has the restriction information mark, it is removed from the candidate and is not displayed; when “lvkeliangjiangchaoguo” is input , the output candidate "The passenger volume will exceed", at this time the word "quantity” does not need to be restricted.
  • the restriction words and their restriction information obtained in this embodiment may be directly stored in a separate vocabulary (or vocabulary), for example, generating a vocabulary (or vocabulary) dedicated to storing the qualifiers and The related restriction information; may also generate an input method vocabulary together with the general words, for example, generate a vocabulary, the vocabulary includes the restriction words and related restriction information, and general words; It is added to the existing vocabulary of the input method.
  • the restriction information may be in the manner of identification (for example, the restriction word in the lexicon is marked with 0 or 1), or may be a specific numerical value (for example, two decimal places from 0 to 1), for The ordering of the candidates is adjusted, of course, not showing is an extreme situation.
  • the obtained restriction words and their restriction information can be manually changed by the user according to actual needs, or it is feasible to automatically update and modify by the server.
  • the corresponding judgment conditions may differ depending on the obtained feature information. Steps 102 and 103 will be described below by way of a plurality of examples.
  • the preset corpus may be any corpus; the eigenvalues may be obtained by statistics, or may be directly obtained according to experience or existing knowledge; the eigenvalues may be various values, such as probability or frequency. It should be noted that the feature information and the judgment conditions described below are merely examples, and those skilled in the art can set more complicated feature information and judgment conditions as needed, and the present invention does not limit this.
  • the feature information is: a feature value of the word at the beginning of the target word in the default corpus as a prefix of the word, and a feature value of the word at the end of the target word in the default corpus as a suffix;
  • the preset condition for determining is: whether at least one of the feature values is present in the preset range. That is, if one of the initial feature value or the last feature value is within the preset range, then the target word can be determined as a restricted word.
  • the word “quantity” rarely appears at the beginning of the word, if the word “quantity” If the first appearance frequency is less than or equal to the preset threshold, then the "quantity” will be determined as a limit word.
  • the word composed of three or more words it is also possible to judge the feature value of a word located at a certain position in the word at the same position in the word in the default corpus.
  • the feature information is: a feature value of a linguistic collocation relationship of each single word and/or multi-word included in the target word in a preset corpus;
  • the preset condition for determining is: whether at least one of the feature values is present in the preset range.
  • the linguistic collocation relationship may include a collocation parameter of a word and a word, a collocation parameter of a word and a part of speech, a matching parameter of a part of speech and a part of speech, and the like. Those skilled in the art can select or apply the above various matching relationships according to actual needs.
  • the feature information is: a feature value that the target word inputs by the user in the input method application; the preset condition for determining is: whether the feature value belongs to a preset range.
  • the user input alone may be a user's statistical value, or may be a user group or may be obtained by monitoring user input behavior.
  • a general word frequency is introduced in the judgment condition, and the general word frequency can be the Internet word frequency or the word library word frequency.
  • the feature information includes: a word at the beginning of the word in the target word is used in a preset corpus The feature value of the prefix; the single word of the target word at the end of the word in the default corpus as the eigenvalue of the suffix; and the general word frequency of the target word;
  • the preset condition for the judgment is: whether the ratio of the at least one feature value to the target word common word frequency among the above feature values belongs to a preset range.
  • the feature information includes: a feature value of a linguistic collocation relationship of each single word and/or a multi-word word included in the target word in a preset corpus; and a general word frequency of the target word;
  • the preset condition for the judgment is: whether the ratio of the at least one feature value to the target word common word frequency among the above feature values belongs to a preset range.
  • the feature information is: user-sorted position information of the target word in each candidate word encoded for the same input; and original sorted position information of the target word in each candidate word encoded for the same input; wherein, the user
  • the sorting information is related to the feature value separately input by the user in the input method application; the original sorting information is related to the universal word frequency of the target word; in a simple case, the user sorting information and the user vocabulary can be considered
  • the information is related, and the original ranking information is related to the system vocabulary information.
  • the preset condition for determining is: whether the difference between the user sorting position information and the original sorting position information belongs to a preset range.
  • the feature information is: a feature value that the target word is input by the user in the input method application; and a general word frequency of the target word;
  • the preset condition for determining is: whether the ratio of the feature value to the target word common word frequency belongs to a preset range.
  • a specific implementation process of the specific description example 7 is as follows:
  • calculate alpha f-user/f_web, and identify words with alpha far below normal and f-web values greater than a certain threshold as restricted words.
  • alpha is the calculation result
  • f-web is the general word frequency information of a word
  • f-user is the characteristic word frequency information of the word.
  • the corresponding alpha values can be calculated and sorted according to the alpha value from small to large. For words whose alpha value is at top, such as the first 5%, and the word frequency is higher, such as greater than 10000, it is considered a restriction.
  • the restriction information may include: a weight of the restriction word that is separately output in each preset scenario. That is, the restriction word can have restriction information in different application scenarios, and does not have only one general restriction information.
  • the current program of the input method determines the application scenario of the user, and when the user inputs in word, the limit information value limited to the preset scene (for example, the working term environment) is called.
  • the restriction information may further include: a linguistic collocation parameter of the restriction word in the preset corpus; the linguistic collocation parameter is used to limit the ordering of the word when the intelligent group word is output. That is to say, for certain restriction words, when they are output separately, they need to be restricted, and when they are outputted by intelligent group words, they also need to be restricted. For example, for the previous one, the word needs to be limited when it is output separately, and it should not appear in the candidate as much as possible, but for the "previous,, and "in” intelligent group word output, it should also be based on the collocation relationship. Restrictions, "previous” and “in” such collocations do not appear in the candidates as much as possible.
  • the restriction information may include all linguistic allocation parameters (for example, part-of-speech matching parameters) of the word in the preset corpus, or may only save the required matching parameters. For example, set a threshold value for the limit output. If a collocation parameter is less than or equal to the threshold value, the linguistic collocation parameter is saved.
  • the preset corpus information may be Internet corpus information and/or user input. Enter the corpus information.
  • the Internet corpus information may be obtained by crawling a massive webpage from a web spider through a spider; the user input recording corpus may include direct information and indirect information, for example, a character record input by a user may be used as direct information.
  • the character distribution statistics input by the user can be used as indirect information.
  • the preset corpus information may also be set by a person skilled in the art according to needs or experience, and the present invention does not need to be limited thereto. Referring to FIG. 2, an embodiment 2 of the method for acquiring the restriction word information is shown, which may include: Step 201: Acquire a target word;
  • Step 202 Obtain a linguistic collocation parameter of the target word in a preset corpus
  • Step 203 Determine whether the linguistic collocation parameter meets a preset condition, and if yes, record the restriction information of the target word, the restriction information includes a corresponding linguistic collocation parameter, and the restriction information is used to limit the word intelligence. Sorting when group words are output.
  • the value of the collocation parameter of the "previous" and the position word is very low, and the collocation parameter is recorded in the restriction information of "up”, and if the intelligent grouping is performed, the candidate is "previous” and orientation.
  • the collocation of words removes the candidate.
  • the linguistic collocation parameter may be a general parameter; or the linguistic collocation parameter may also include sub-parameters for each preset scene.
  • the linguistic collocation parameters may include collocation parameters of words and words, collocation parameters of words and part of speech, collocation parameters of part of speech and part of speech, and the like.
  • the performance value of the linguistic collocation parameter may be adjacent co-occurrence frequency, co-occurrence probability or connection strength value, etc., and the values may be obtained from any preset corpus, or may be based on existing experience or knowledge. Get it directly.
  • Step 301 Obtain a target word.
  • Step 302 Acquire feature information corresponding to the target word
  • Step 303 Determine whether the feature information or its corresponding calculation result meets a preset condition. If yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the single output of the word. Sorting, and/or, used to limit the ordering of the word smart group word output;
  • Step 304 Add the restriction word and its related restriction information to an existing vocabulary of the input method.
  • This embodiment can be applied to: the server obtains the restriction word information, and then updates it to the existing vocabulary of the input method in time.
  • the updated restriction information may include the restriction information obtained by the foregoing embodiments of FIG. 2 and FIG. 3, that is, may include information for limiting the ordering when the word is output separately, and may also include sorting for limiting the output of the word intelligent group words. Information; the two can exist separately or together.
  • the restriction information may include: a weight of the restriction word that is separately outputted in each preset scenario.
  • step 304 can be in various ways, for example,
  • the adding is: determining whether the restricted word already exists in the original thesaurus, and if so, recording only the relevant restriction information into the existing thesaurus of the input method;
  • the adding is: directly recording the restriction word and its related restriction information into an existing vocabulary of the input method, and if the vocabulary is repeated, overwriting the original vocabulary;
  • the adding is: storing the restricted words and their related restriction information as an independent restricted vocabulary, and the restricted vocabulary and the input lexicon are used to collaboratively perform candidate ordering.
  • an apparatus for acquiring a restriction word information is shown, which may specifically include: a target word obtaining unit 401, configured to acquire a target word;
  • the feature information acquiring unit 402 is configured to acquire feature information corresponding to the target word
  • the restriction information obtaining unit 403 is configured to determine whether the feature information or its corresponding calculation result meets a preset condition, and if yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the Sorting when the words are output separately, and/or, used to limit the Sorting the word smart group word output.
  • the method may include: Step 501: Receive user input information, and convert the input information.
  • the input information may include an encoded character string, and may also include handwritten input information as well as voice input information, since these input methods also require the use of the thesaurus for candidate ordering. That is, the present invention can be applied to input method platforms of various input methods, including keyboard symbols, handwritten information, and voice input. Since the information conversion process in these input methods is a well-known technique, it will not be described here.
  • the input method system splits the encoded string entered by the user.
  • a brief description is given to the example of splitting a pinyin coded string.
  • a pinyin coded string is divided into multiple segments, for example, for a pinyin encoded string.
  • fangan can be divided into “fang” an", or it can be divided into “fan'gan” and so on.
  • the method of segmentation may be any method in the prior art, and the present invention does not need to be limited thereto.
  • Step 502 Obtain an output candidate option.
  • the process of obtaining an output candidate according to the segmented coded string is equivalent to the process of automatically converting the input continuous pinyin stream into a corresponding word stream.
  • the process is: For a given continuous pinyin stream A, according to a certain pinyin stream segmentation algorithm, it can be divided into a pinyin sequence Al A2 ... Am, where each pinyin Ai corresponds to one Group homophones can be represented by a set of column nodes as Wil Wi2... Wi3. Then, for the Pinyin sequence Al A2 ... Am, the corresponding candidate homophones can be represented by m group column nodes.
  • the candidate homophones corresponding to a pinyin sequence constitute a candidate homophone matrix. Connect adjacent nodes with directed edges to form a word grid.
  • the word grid constitutes the state space of the Chinese character input problem. Furthermore, the word conversion problem evolves to search for an optimal path problem in the word grid.
  • pinyin stream "zheshiyizhipiaoliangdemao", which is divided into pinyin streams to generate "zhe'shi'yi'zhi, piaoliang, de, mao" pinyin sequences.
  • the word grid corresponding to the pinyin sequence is shown in Fig. 6.
  • Step 503 Determine whether the preset condition of the application restriction information is met
  • Step 504 If yes, extract the restriction information corresponding to the output candidate, and perform sorting display on each candidate according to the restriction information.
  • Sorting each candidate according to the restriction information may be implemented by directly setting a presentation position or a sequence, or by modifying a word frequency (including but not limited to weighting, derating, etc.); wherein the most extreme is Remove from candidates without displaying.
  • the preset condition of the application restriction information may be: whether the output candidate is a separately output word.
  • the restriction information may be obtained by acquiring the restriction information by: acquiring a target word; acquiring feature information corresponding to the target word; determining whether the feature information or its corresponding calculation result meets a preset condition, if If it matches, the relevant restriction information is recorded for the target word.
  • the preset condition of the application restriction information may be: whether the output candidate belongs to the intelligent group word situation.
  • the restriction information may be obtained by: acquiring a target word; obtaining a linguistic collocation parameter of the target word in a preset corpus; determining whether the linguistic collocation parameter meets a preset condition, and if yes,
  • the restriction information of the target word is recorded, and the restriction information includes a corresponding linguistic collocation parameter, and the restriction information is used to limit the ordering of the word intelligent group word output.
  • the following steps can be completed:
  • the judgment of whether or not an element is included may be obtained by querying from the lexicon by means of ID mapping, or by judging the number of included element IDs, it may be determined whether the output candidate contains only one element.
  • the one output character may be a different byte length or other length characters in different input method systems.
  • the one output character is 2 bytes. word.
  • the determination of the length may be determined by reading a length parameter preset in the vocabulary, and the length parameter may be stored in the attribute of the corresponding term for the word ID; or, by directly acquiring the output Judging by the length of the candidate, and using other methods in the prior art are possible, and the present invention is not limited thereto.
  • each candidate can be expressed as ⁇ entry 1, attribute 1>, ⁇ entry 2, attribute 2>; or, ⁇ ID of the entry 1, attribute 1>, ⁇ ID of the entry 2, attribute 2 >.
  • the candidate "two rivers total” it can be expressed as: ⁇ two rivers pl>, ⁇ total p2>;
  • the candidate "quantity” it can be expressed as: ⁇ quantity will be ql>;
  • ⁇ quantity ql> it contains only one element, and is greater than 1 output character; continue to judge whether its attribute ql contains a restriction information flag, since it has a restriction information flag (for example, tag is not 0), so The candidate is not output.
  • the length parameter may also be included in the attribute ql.
  • the candidate for the final output is: Liangjiang, Liangjiang, Liangjiang.
  • an input method system which may specifically include: an input interface unit 701 and a display unit 702, and;
  • the lexicon 703 the vocabulary includes restriction information; wherein the restriction information may be various restriction information as described above; the restriction information may be present in various ways, for example, in a vocabulary manner. In the library, or by marking the corresponding terms in the thesaurus.
  • the candidate obtaining unit 704 is configured to: obtain an output candidate according to the input information of the user; the determining unit 705 is configured to determine whether an output candidate meets the preset condition of the application restriction information;
  • the candidate sorting unit 706 is configured to: when the preset condition is met, extract the restriction information corresponding to the output candidate, and sort the candidates according to the restriction information.
  • the thesaurus 703 may include term information and restriction word information, that is, the word restriction information may be recorded in the existing thesaurus for words that meet the preset conditions.
  • the vocabulary 703 includes a basic vocabulary and a restricted vocabulary, and the restricted vocabulary is a vocabulary with restricted word information.
  • the words that meet the preset conditions and the corresponding restriction information can be stored independently as a restricted vocabulary, and the restricted vocabulary and the basic vocabulary constitute the input method vocabulary in this embodiment.
  • the preset condition of the application restriction information may be: whether the output candidate is a separately output word.
  • the determining unit may further include:
  • a subunit for determining whether an output candidate includes only one element wherein the element is a word stored in a preset vocabulary
  • the preset condition of the application restriction information may be: whether the output candidate belongs to the intelligent group word situation.
  • the method of determining may also adopt the foregoing method, and if it does not meet the judgment condition, it belongs to the case of intelligent group words.
  • the above input method system may be a common input method system, for example, the input method system is lost.
  • the input interface unit, the display unit, and the vocabulary are located in the same computing device;
  • the input method system may be a network input method system, for example, the input interface unit and the display unit of the input method system are located in the first computing device, and the vocabulary is located
  • the input method system acquires corresponding information from the second computing device according to the information input by the user, and displays the corresponding word candidate in the first computing device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The method for obtaining restriction word information includes the follow steps, obtaining characteristic information based on the target word, and judging whether the characteristic information is in accordance with the preset condition. If being suitable, the information ensures that the target word is a restriction word, and a related restriction information is recorded. The restriction information is used for a restricting sequence when the word is singly outputted. The inventive embodiment, by presetting the word bank including the input and output method of restriction word information, judges whether the output candidate item accords with the preset condition, based on the result, judges whether the candidate item with restriction word information is displayed and outputted. Accordingly user can obtain more effective output without increasing the operation, the character output process of the input system is optimized greatly, and the intelligence of the input system is also improved.

Description

获取限制词信息的方法、 优化输出的方法和输入法系统 本申请要求于 2007 年 5 月 25 日提交中国专利局、 申请号为 200710099644. 0、发明名称为 "获取限制词信息的方法、优化输出的方法和输 入法系统"的中国专利申请的优先权, 其全部内容通过引用结合在本申请中。 技术领域  Method for obtaining restricted word information, method for optimizing output and input method system The present application claims to be submitted to the Chinese Patent Office on May 25, 2007, and the application number is 200710099644. 0, the invention name is "method for obtaining restricted word information, optimized output" The priority of the method and the input method system of the Chinese Patent Application, the entire contents of which is incorporated herein by reference. Technical field

本发明涉及计算机字符输入数据处理领域, 特别是涉及一种获取限制 词信息的方法和装置、 一种更新输入法词库的方法、 一种优化输出的方法 以及一种输入法系统。  The present invention relates to the field of computer character input data processing, and more particularly to a method and apparatus for acquiring restricted word information, a method for updating an input method vocabulary, a method for optimizing output, and an input method system.

背景技术 Background technique

随着计算机技术以及互联网技术的普及与发展, 不同专业领域、 不同 兴趣以及使用习惯的用户对于输入法系统的智能性要求越来越高。  With the popularization and development of computer technology and Internet technology, users of different professional fields, different interests and usage habits are increasingly demanding the intelligence of the input method system.

在现有技术中, 已经出现了利用庞杂的互联网语料库统计、 筛选得到 输入法词库的技术。 所得到的互联网词库中可以包含很多通过之前的封闭 语料信息 (如现代汉语词典、 新闻、 报纸等) 所无法得到的新词, 从而可 以大大提高人们的输入效率。 但是, 正是由于互联网语料库的复杂性, 使 得从中通过词频统计歸选得到的一些词, 具有语言学或者使用输入习惯上 的一些缺陷。  In the prior art, techniques for using the Internet corpus statistics to filter and input the input lexicon have emerged. The resulting Internet vocabulary can contain many new words that are not available through previous closed corpus information (such as modern Chinese dictionaries, news, newspapers, etc.), which can greatly improve people's input efficiency. However, it is precisely because of the complexity of the Internet corpus that some of the words that are derived from the word frequency statistics have some deficiencies in linguistics or the use of input habits.

例如, 对于用户输入的拼音编码字符串 " liangjiang" , —般可获得的候 选项包括 "两江,,、 "良将,, 等, 具有互联网词库的候选项还可能包括 "量 将", 因为 "量将" 这个词在互联网网页中的出现频率还是相当高的, 但是 其一般都出现在句子中多个词的连接处 (用于表达链接关系), 例如, "旅 客量将超过"。 将 "量将" 这个词收入输入法词库中, 固然可以增加输入法 的智能性(达到较高的智能组词效果), 在某些情况下可以提高用户输入效 率, 但是却由于 "量将,, 一词在单独成词的情况下很少出现, 从而也有可 能给用户输入造成麻烦, 增加用户需要选择的候选项数量, 降低输入效率。  For example, for the pinyin encoded string "liangjiang" entered by the user, generally available candidates include "two rivers,", "goods,", etc., candidates with Internet thesaurus may also include "quantity" because " The amount of "this word" appears quite high in Internet pages, but it generally appears at the junction of multiple words in a sentence (used to express a link relationship), for example, "the amount of passengers will exceed." The word "quantity" is included in the input lexicon, which can increase the intelligence of the input method (to achieve a higher intelligent group effect), in some cases can improve user input efficiency, but because of the amount , the word rarely appears in the case of a separate word, which may also cause trouble for the user input, increase the number of candidates that the user needs to select, and reduce the input efficiency.

因此, 迫切需要本领域技术人员解决的一个技术问题就是: 如何在词 库中找出这样的具有语言学或者使用习惯上缺陷的词, 并在输入过程中加 以限制, 以进一步提高输入效率。  Therefore, a technical problem that is urgently needed to be solved by those skilled in the art is: how to find such words with linguistics or habitual defects in the vocabulary, and impose restrictions on the input process to further improve the input efficiency.

发明内容 本发明所要解决的技术问题是提供一种获取限制词信息的方法和装 置, 能够从大量的词汇中找出具有语言学或者使用习惯上缺陷的词, 从而 提高用户的输入体验。 Summary of the invention The technical problem to be solved by the present invention is to provide a method and apparatus for acquiring restriction word information, which can find words having linguistics or usage habits from a large number of vocabularies, thereby improving the user's input experience.

本发明另一个目的是提供一种更新输入法词库的方法、 一种优化输出 的方法以及一种输入法系统, 能够实现在实际输入过程中对某些词在某些 情况下加以限制, 从而可以实现在不增加用户操作的前提下, 达到提高输 入法智能性的目的。  Another object of the present invention is to provide a method for updating an input method vocabulary, a method for optimizing an output, and an input method system, which can restrict certain words in some cases in an actual input process, thereby It can achieve the purpose of improving the intelligence of the input method without increasing the user's operation.

为了解决上述技术问题, 本发明公开了一种获取限制词信息的方法, 具体可以包括: 获取一目标词; 获取该目标词相应的特征信息; 判断所述 特征信息或其相应的计算结果是否符合预置条件, 如果符合, 则确定该目 标词为限制词并记录相关限制信息, 所述限制信息用于限制该词单独输出 时的排序。  In order to solve the above technical problem, the present invention discloses a method for acquiring restriction word information, which may specifically include: acquiring a target word; acquiring feature information corresponding to the target word; determining whether the feature information or its corresponding calculation result is consistent The preset condition, if it is met, determines that the target word is a restriction word and records related restriction information, and the restriction information is used to limit the ordering when the word is separately output.

优选的, 所述特征信息为: 该目标词中位于词首的单字在预设语料库 内作为词首的特征值, 以及该目标词中位于词尾的单字在预设语料库内作 为词尾的特征值; 所述用于判断的预置条件为: 上述特征值中是否存在至 少一个特征值属于预置范围。  Preferably, the feature information is: a feature value of the word at the beginning of the target word in the default corpus as a prefix of the word, and a feature value of the word at the end of the target word in the preset corpus as a suffix; The preset condition for determining is: whether at least one of the feature values is in a preset range.

优选的, 所述特征信息为: 该目标词中所包含的各个单字词和 /或多字 词的语言学搭配关系在预设语料库内的特征值; 所述用于判断的预置条件 为: 上述特征值中是否存在至少一个特征值属于预置范围。  Preferably, the feature information is: a feature value of a linguistic collocation relationship of each single word and/or a multi-word word included in the target word in a preset corpus; : Whether at least one of the above feature values exists in the preset range.

优选的, 所述特征信息为: 该目标词在输入法应用中用户单独输入的 特征值; 所述用于判断的预置条件为: 该特征值是否属于预置范围。  Preferably, the feature information is: an attribute value separately input by the user in the input method application; the preset condition for determining is: whether the feature value belongs to a preset range.

优选的, 所述特征信息包括: 该目标词中位于词首的单字在预设语料 库内作为词首的特征值; 该目标词中位于词尾的单字在预设语料库内作为 词尾的特征值; 以及该目标词的通用词频; 所述用于判断的预置条件为: 上述特征值中是否存在至少一个特征值与该目标词通用词频的比值属于预 置范围。  Preferably, the feature information includes: a feature value of the word at the beginning of the target word in the default corpus as a prefix; the word at the end of the target word is used as a suffix feature value in the default corpus; The common word frequency of the target word; the preset condition for the judgment is: whether the ratio of the at least one feature value to the target word common word frequency in the feature value belongs to a preset range.

优选的, 所述特征信息包括: 该目标词中所包含的各个单字词和 /或多 字词的语言学搭配关系在预设语料库内的特征值; 以及该目标词的通用词 频; 所述用于判断的预置条件为: 上述特征值中是否存在至少一个特征值 与该目标词通用词频的比值属于预置范围。 Preferably, the feature information includes: a feature value of a linguistic collocation relationship of each single word and/or a multi-word included in the target word in a preset corpus; and a general word frequency of the target word; The preset condition for judging is: whether at least one feature value exists in the above feature value The ratio of the general word frequency to the target word belongs to a preset range.

优选的, 所述特征信息为: 该目标词在输入法应用中用户单独输入的 特征值; 以及该目标词的通用词频; 所述用于判断的预置条件为: 该特征 值与该目标词通用词频的比值是否属于预置范围。  Preferably, the feature information is: a feature value that the target word is input by the user in the input method application; and a general word frequency of the target word; the preset condition for the judgment is: the feature value and the target word Whether the ratio of the general word frequency belongs to the preset range.

优选的, 所述特征信息为: 该目标词在针对同一输入编码的各候选词 中的用户排序位置信息; 以及该目标词在针对同一输入编码的各候选词中 的原始排序位置信息; 其中, 所述用户排序信息与该目标词在输入法应用 中用户单独输入的特征值相关; 所述原始排序信息与该目标词的通用词频 相关; 所述用于判断的预置条件为: 所述用户排序位置信息与所述原始排 序位置信息的差值是否属于预置范围。  Preferably, the feature information is: user-sorted position information of the target words in each candidate word encoded for the same input; and original sorted position information of the target words in each candidate word encoded for the same input; The user ranking information is related to a feature value that the target word is separately input by the user in the input method application; the original sorting information is related to a general word frequency of the target word; and the preset condition for determining is: the user Whether the difference between the sort position information and the original sort position information belongs to a preset range.

优选的, 在特征信息获取步骤之前还包括: 对目标词的优化歸选步骤。 优选的, 所述限制信息包括: 该限制词在各预设场景下的限制单独输 出的权重。  Preferably, before the feature information obtaining step, the method further includes: an optimal selection step of the target word. Preferably, the restriction information includes: a weight of the restriction word that is separately output in each preset scenario.

优选的, 所述限制信息还包括: 该限制词在预设语料库中的语言学搭 配参数; 所述语言学搭配参数用于限制该词在智能组词输出时的排序。  Preferably, the restriction information further includes: a linguistic allocation parameter of the restriction word in a preset corpus; the linguistic collocation parameter is used to limit the ordering of the word when the intelligent group word is output.

优选的, 所述方法还可以包括: 生成一词库或词表, 所述词库或词表 包括所述限制词及其相关限制信息; 或者, 生成一词库, 所述词库包括所 述限制词及其相关限制信息, 以及通用字词。  Preferably, the method may further include: generating a vocabulary or vocabulary, the vocabulary or vocabulary including the restricted words and related restriction information; or generating a vocabulary, the vocabulary including the Limit words and their associated restrictions, as well as generic terms.

依据本发明的另一实施例, 还公开了一种获取限制词信息的方法, 包 括: 获取一目标词; 获取该目标词在预设语料库中的语言学搭配参数; 判 断所述语言学搭配参数是否符合预置条件, 如果符合, 则记录该目标词的 限制信息, 所述限制信息包括相应的语言学搭配参数; 所述限制信息用于 限制该词智能组词输出时的排序。  According to another embodiment of the present invention, a method for acquiring restricted word information includes: acquiring a target word; obtaining a linguistic collocation parameter of the target word in a preset corpus; determining the linguistic collocation parameter Whether the preset condition is met, if yes, the restriction information of the target word is recorded, and the restriction information includes a corresponding linguistic collocation parameter; the restriction information is used to limit the ordering of the word intelligent group word output.

优选的, 所述语言学搭配参数为一通用参数; 或者, 所述语言学搭配 参数包括针对各预设场景的分参数。  Preferably, the linguistic collocation parameter is a general parameter; or the linguistic collocation parameter includes a sub-parameter for each preset scene.

依据本发明的另一实施例, 还公开了一种更新词库的方法, 包括: 获 取一目标词; 获取该目标词相应的特征信息; 判断所述特征信息或其相应 的计算结果是否符合预置条件, 如果符合, 则确定该目标词为限制词并记 录相关限制信息, 所述限制信息用于限制该词单独输出时的排序, 和 /或, 用于限制该词智能组词输出时的排序; 将所述限制词及其相关限制信息添 加至输入法现有词库中。 According to another embodiment of the present invention, a method for updating a thesaurus includes: acquiring a target word; acquiring feature information corresponding to the target word; determining whether the feature information or its corresponding calculation result meets a pre-determination Setting a condition, if yes, determining that the target word is a restriction word and recording related restriction information, the restriction information is used to limit the ordering when the word is output alone, and/or, Used to limit the ordering of the word intelligent group word output; add the limit word and its related restriction information to the existing vocabulary of the input method.

优选的, 所述添加为: 判断该限制词是否在所述原始词库中已存在, 如果已存在, 则仅记录其相关限制信息至所述输入法现有词库中; 或者, 所述添加为: 直接将所述限制词及其相关限制信息记录至所述输入法现有 词库中, 如果词条重复, 则覆盖原始词条; 或者, 所述添加为: 将所述限 制词及其相关限制信息存储为一限制词表, 所述限制词表和输入法现有词 库用于协作完成候选项排序。  Preferably, the adding is: determining whether the restricted word already exists in the original thesaurus, and if so, recording only relevant restriction information into the existing thesaurus of the input method; or, adding For example, the restriction word and its related restriction information are directly recorded into the existing vocabulary of the input method, and if the vocabulary is repeated, the original vocabulary is overwritten; or, the adding is: The related restriction information is stored as a restricted vocabulary, and the existing vocabulary of the restricted vocabulary and the input method is used for collaborative completion of candidate ordering.

优选的, 所述限制词具有在各预设场景下的限制信息。  Preferably, the restriction word has restriction information in each preset scenario.

依据本发明的另一实施例, 还公开了一种获取限制词信息的装置, 包 括:  According to another embodiment of the present invention, an apparatus for acquiring restriction word information is further disclosed, including:

目标词获取单元, 用于获取一目标词;  a target word obtaining unit, configured to acquire a target word;

特征信息获取单元, 用于获取该目标词相应的特征信息;  a feature information acquiring unit, configured to acquire feature information corresponding to the target word;

限制信息获取单元, 用于判断所述特征信息或其相应的计算结果是否 符合预置条件, 如果符合, 则确定该目标词为限制词并记录相关限制信息, 所述限制信息用于限制该词单独输出时的排序, 和 /或, 用于限制该词智能 组词输出时的排序。  a restriction information acquiring unit, configured to determine whether the feature information or a corresponding calculation result thereof meets a preset condition, and if yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the word Sorting when outputting separately, and/or, is used to limit the ordering of the word smart group word output.

依据本发明的另一实施例, 还公开了一种优化输出的方法, 包括: 接 收用户输入信息, 并对所述输入信息进行转换; 获得输出侯选项; 判断一 输出候选项是否符合应用限制信息的预置条件; 如果是, 则提取该输出候 选项相应的限制信息, 并根据所述限制信息对各候选项进行排序。  According to another embodiment of the present invention, a method for optimizing an output is disclosed, including: receiving user input information, and converting the input information; obtaining an output candidate; determining whether an output candidate meets application restriction information Pre-conditions; if yes, extract the restriction information corresponding to the output candidate, and sort the candidates according to the restriction information.

优选的, 所述应用限制信息的预置条件为: 所述输出侯选项是否为单 独输出的词; 或者, 所述应用限制信息的预置条件为: 所述输出侯选项是 否属于智能组词情形。  Preferably, the preset condition of the application restriction information is: whether the output candidate is a separately output word; or the preset condition of the application restriction information is: whether the output candidate belongs to a smart group word situation .

优选的, 通过以下步骤获取所述的限制信息: 获取一目标词; 获取该 目标词相应的特征信息; 判断所述特征信息或其相应的计算结果是否符合 预置条件, 如果符合, 则针对该目标词记录相关限制信息。  Preferably, the limiting information is obtained by: acquiring a target word; acquiring feature information corresponding to the target word; determining whether the feature information or its corresponding calculation result meets a preset condition, and if yes, Target word record related restriction information.

优选的, 当需要判断所述输出侯选项是否为单独输出的词时, 通过以 下步骤完成: 判断一输出候选项是否只包含一个元素, 并且长度大于 1个 输出字符; 所述元素为预置词库中存储的字词; 如果是, 则确定该输出候 选项为单独输出的词。 Preferably, when it is required to determine whether the output candidate is a separately output word, the following steps are performed: determining whether an output candidate includes only one element, and the length is greater than one. Outputting characters; the elements are words stored in a preset vocabulary; if so, it is determined that the output candidates are words that are output separately.

依据本发明的另一实施例, 还公开了一种输入法系统, 包括输入接口 单元和显示单元, 所述输入法系统还包括:  According to another embodiment of the present invention, an input method system is further provided, including an input interface unit and a display unit, the input method system further comprising:

词库, 所述词库包括针对词条的限制信息; 所述限制信息用于限制该 词单独输出时的排序, 和 /或, 用于限制该词智能组词输出时的排序;  a vocabulary, the vocabulary includes restriction information for the vocabulary; the restriction information is used to limit the ordering when the word is outputted separately, and/or, and is used to limit the ordering when the word intelligent group word is output;

候选项获取单元, 用于根据用户的输入信息获得输出侯选项; 判断单元,用于判断一输出候选项是否符合应用限制信息的预置条件; 候选项排序单元, 用于当符合预置条件时, 提取该输出候选项相应的 限制信息, 并根据所述限制信息对各候选项进行排序。  a candidate obtaining unit, configured to obtain an output candidate according to the input information of the user; a determining unit, configured to determine whether an output candidate meets a preset condition of the application restriction information; and a candidate sorting unit, configured to meet the preset condition And extracting restriction information corresponding to the output candidate, and sorting each candidate according to the restriction information.

优选的, 所述应用限制信息的预置条件为: 所述输出侯选项是否为单 独输出的词; 或者, 所述应用限制信息的预置条件为: 所述输出侯选项是 否属于智能组词情形。  Preferably, the preset condition of the application restriction information is: whether the output candidate is a separately output word; or the preset condition of the application restriction information is: whether the output candidate belongs to a smart group word situation .

优选的, 所述判断单元进一步包括: 用于判断一输出候选项是否只包 含一个元素的子单元; 其中, 所述元素为预置词库中存储的字词; 以及, 用于判断该输出候选项的长度是否大于 1个输出字符的子单元; 以及, 用 于当该输出候选项符合上述两个判断条件时, 确定其为单独输出的词的子 单元。  Preferably, the determining unit further includes: a subunit for determining whether an output candidate includes only one element; wherein, the element is a word stored in a preset vocabulary; and, for determining the output candidate Whether the length of the item is greater than one sub-unit of the output character; and, for determining that the output candidate is a sub-unit of the word that is output separately when the two judgment conditions are met.

优选的, 所述输入法系统的输入接口单元、 显示单元以及词库位于同 一计算设备中; 或者, 所述输入法系统的输入接口单元、 显示单元位于第 一计算设备中, 词库位于第二计算设备中, 所述输入法系统根据用户输入 的信息, 从第二计算设备中获取相应信息, 在第一计算设备显示相应字词。  Preferably, the input interface unit, the display unit, and the vocabulary of the input method system are located in the same computing device; or the input interface unit and the display unit of the input method system are located in the first computing device, and the vocabulary is located in the second In the computing device, the input method system acquires corresponding information from the second computing device according to the information input by the user, and displays the corresponding word in the first computing device.

与现有技术相比, 本发明实施例具有以下优点:  Compared with the prior art, the embodiment of the invention has the following advantages:

本发明实施例预置包括限制词信息的输入法词库,在用户进行输入时, 判断输出候选项是否符合应用限制信息的预置条件; 进而依据是否符合的 结果, 控制具有限制词信息的候选项的是否显示和输出, 从而可以在不增 加用户操作的前提下, 可以获得更有效地输出 (例如, 在实际中, 使限制 词 "量将"在被单独输出时不显示在候选项中, 而在其它情况下参与组词), 极大地优化了输入法系统的字符输出过程, 提高了输入法系统的智能性。 附图说明 The embodiment of the invention presets an input method vocabulary including the restriction word information, and when the user inputs, determines whether the output candidate meets the preset condition of the application restriction information; and further controls the candidate with the restriction word information according to whether the result is consistent Whether the item is displayed and output, so that it can be output more efficiently without increasing the user's operation (for example, in practice, the limit word "quantity" will not be displayed in the candidate when it is output separately. In other cases, participation in group words) greatly optimizes the character output process of the input method system and improves the intelligence of the input method system. DRAWINGS

图 1是本发明一种获取限制词信息的方法实施例 1的步骤流程图; 图 2是本发明一种获取限制词信息的方法实施例 2的步骤流程图; 图 3是本发明一种更新输入法词库的方法实施例的步骤流程图; 图 4是本发明一种获取限制词信息的装置实施例的结构框图; 图 5是本发明一种优化输出的方法实施例的步骤流程图;  1 is a flow chart of steps of Embodiment 1 of a method for acquiring restriction word information according to the present invention; FIG. 2 is a flow chart of steps of Embodiment 2 of a method for acquiring restriction word information according to the present invention; FIG. 3 is an update of the present invention; FIG. 4 is a structural block diagram of an embodiment of an apparatus for obtaining restriction information according to the present invention; FIG. 5 is a flow chart of steps of an embodiment of a method for optimizing output according to the present invention;

图 6是一种拼音网络切分方法的词网格示意图;  6 is a schematic diagram of a word grid of a pinyin network segmentation method;

图 7是一种输入法系统实施例的结构框图。  Figure 7 is a block diagram showing the structure of an embodiment of an input method system.

具体实施方式 detailed description

为使本发明的上述目的、 特征和优点能够更加明显易懂, 下面结合附 图和具体实施方式对本发明作进一步详细的说明。  The above described objects, features and advantages of the present invention will become more apparent from the aspects of the appended claims.

参照图 1,示出了一种获取限制词信息的方法实施例 1,具体可以包括: 步骤 101、 获取一目标词;  Referring to FIG. 1, a method embodiment 1 for acquiring restriction word information is shown, which may include: Step 101: Acquire a target word;

所述获取目标词的过程可以从互联网得到 ,即直接从互联网语料库(例 如, 互联网网页集合或者搜索关键词集合等) 中经过统计、 筛选获得, 也 可以从现有词库得到, 本发明对此并不需要加以限制, 只要能够获得一个 目标词集合即可; 至于该集合的范围大小, 本领域技术人员根据实际需要 设定即可。  The process of obtaining the target word can be obtained from the Internet, that is, directly obtained from the Internet corpus (for example, an Internet web page collection or a search keyword set, etc.), and can also be obtained from an existing vocabulary, and the present invention It is not necessary to be limited as long as a target word set can be obtained; as for the range size of the set, those skilled in the art can set according to actual needs.

优选的, 对于所获得的这个目标词集合, 还可以包括一优化步骤, 采 用目标词的一些属性去除一些词汇, 以进一步缩小范围。 例如, 从该集合 中去除互联网词频或者词库词频小于等于预设阔值的词; 从该集合中去除 能够确定不属于限制词的词 (例如字典中的通用词汇) 等等。 当然, 所述 的这个优化步骤, 也完全可以在获取目标词集合的过程中完成。  Preferably, for the obtained target word set, an optimization step may be further included, and some attributes of the target word are used to remove some words to further narrow the scope. For example, words from which the Internet word frequency or the lexical vocabulary frequency is less than or equal to a preset threshold are removed from the set; words that are not subject to the qualifier (e.g., general vocabulary in the dictionary) are removed from the set. Of course, the optimization step described above can also be completed in the process of acquiring the target word set.

步骤 102、 获取该目标词相应的特征信息;  Step 102: Acquire feature information corresponding to the target word;

步骤 103、 判断所述特征信息或其相应的计算结果是否符合预置条件, 如果符合, 则确定该目标词为限制词并记录相关限制信息, 所述限制信息 用于限制该词单独输出时的候选项排序。  Step 103: Determine whether the feature information or its corresponding calculation result meets a preset condition, and if yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the single output of the word. Sort the candidates.

例如, 对于限制词 "量将"、 "上一 "等, 在单独输出时不出现在候选项 中, 但是在与其他字词智能组词输出时则没有限制。 具体的例子: 当输入 "liangjiang"时, 依据词频信息的预输出的第一条候选项为 "量将", 但是 由于其具有限制信息标记, 因此将其从候选项中去除, 不予显示; 当输入 "lvkeliangjiangchaoguo"时, 则输出候选项 "旅客量将超过", 此时"量将"这 个词不需要被限制输出。 For example, the restriction words "quantity", "previous", etc. do not appear in the candidate when outputted separately, but there is no limit when outputting with other words intelligent group words. Specific example: When typing In the case of "liangjiang", the first candidate based on the pre-output of the word frequency information is "quantity", but since it has the restriction information mark, it is removed from the candidate and is not displayed; when "lvkeliangjiangchaoguo" is input , the output candidate "The passenger volume will exceed", at this time the word "quantity" does not need to be restricted.

本实施例得到的限制词及其限制信息可以直接存储至一独立词库 (或 词表) 中, 例如, 生成一词库(或词表), 所述词库专用于存储所述限制词 及其相关限制信息; 也可以与通用字词一起生成一输入法词库, 例如, 生 成一词库, 所述词库包括所述限制词及其相关限制信息, 以及通用字词; 还可以直接将其添加至输入法现有词库中。  The restriction words and their restriction information obtained in this embodiment may be directly stored in a separate vocabulary (or vocabulary), for example, generating a vocabulary (or vocabulary) dedicated to storing the qualifiers and The related restriction information; may also generate an input method vocabulary together with the general words, for example, generate a vocabulary, the vocabulary includes the restriction words and related restriction information, and general words; It is added to the existing vocabulary of the input method.

所述限制信息可以采用标识的方式 (例如, 在词库中的该限制词打上 标记 0或 1 ), 也可以采用具体数值的方式 (例如, 从 0到 1的二位小数), 用于对候选项的排序进行调整, 当然不显示就是一种极端情况。 所得到的 限制词及其限制信息可以根据实际需要, 由用户手动更改, 或者由服务器 自动更新修改都是可行的。  The restriction information may be in the manner of identification (for example, the restriction word in the lexicon is marked with 0 or 1), or may be a specific numerical value (for example, two decimal places from 0 to 1), for The ordering of the candidates is adjusted, of course, not showing is an extreme situation. The obtained restriction words and their restriction information can be manually changed by the user according to actual needs, or it is feasible to automatically update and modify by the server.

本实施例中根据所获得的特征信息的不同, 相应的判断条件也会有所 不同, 下面举出多个例子对步骤 102和 103进行说明。 其中的预置语料库 可以为任何语料库; 所述特征值可以经过统计得到, 也可以根据经验或者 现有知识直接得到; 所述特征值可以为各种数值, 例如概率或者频率等。 需要说明的是, 下面所描述的特征信息及判断条件仅仅是举例而已, 本领 域技术人员可以根据需要设定更为复杂的特征信息及判断条件, 本发明对 此不作限制。  In the present embodiment, the corresponding judgment conditions may differ depending on the obtained feature information. Steps 102 and 103 will be described below by way of a plurality of examples. The preset corpus may be any corpus; the eigenvalues may be obtained by statistics, or may be directly obtained according to experience or existing knowledge; the eigenvalues may be various values, such as probability or frequency. It should be noted that the feature information and the judgment conditions described below are merely examples, and those skilled in the art can set more complicated feature information and judgment conditions as needed, and the present invention does not limit this.

例 1  example 1

所述特征信息为: 该目标词中位于词首的单字在预设语料库内作为词 首的特征值, 以及该目标词中位于词尾的单字在预设语料库内作为词尾的 特征值;  The feature information is: a feature value of the word at the beginning of the target word in the default corpus as a prefix of the word, and a feature value of the word at the end of the target word in the default corpus as a suffix;

所述用于判断的预置条件为: 上述特征值中是否存在至少一个特征值 属于预置范围。 即词首特征值或者词尾特征值中有一个在预置范围内, 则 就可以确定该目标词为限制词。  The preset condition for determining is: whether at least one of the feature values is present in the preset range. That is, if one of the initial feature value or the last feature value is within the preset range, then the target word can be determined as a restricted word.

例如, 对于"量将"中的单字 "量" 很少出现在词首, 如果 "量" 的词 首出现频率小于或等于预设阔值 , 则可以判定"量将"为限制词。 当然, 对于目标词为三个或以上的字组成, 则还有可能判断位于词中 某个位置上的单字在预设语料库内处于词中相同位置上的特征值。 For example, for the word "quantity", the word "quantity" rarely appears at the beginning of the word, if the word "quantity" If the first appearance frequency is less than or equal to the preset threshold, then the "quantity" will be determined as a limit word. Of course, for a word composed of three or more words, it is also possible to judge the feature value of a word located at a certain position in the word at the same position in the word in the default corpus.

例 2  Example 2

所述特征信息为: 该目标词中所包含的各个单字词和 /或多字词的语言 学搭配关系在预设语料库内的特征值;  The feature information is: a feature value of a linguistic collocation relationship of each single word and/or multi-word included in the target word in a preset corpus;

所述用于判断的预置条件为: 上述特征值中是否存在至少一个特征值 属于预置范围。  The preset condition for determining is: whether at least one of the feature values is present in the preset range.

所述的语言学搭配关系可以包括词与词的搭配参数, 词与词性的搭配 参数、 词性与词性的搭配参数等多种匹配关系。 本领域技术人员可以根据 实际需要选用或者组合应用上述各种匹配关系。  The linguistic collocation relationship may include a collocation parameter of a word and a word, a collocation parameter of a word and a part of speech, a matching parameter of a part of speech and a part of speech, and the like. Those skilled in the art can select or apply the above various matching relationships according to actual needs.

例如, 对于 "是玩,, 一词, "是" 之后紧跟动词, 这样的搭配关系在语 言学上很少见的, 所以可以得到其搭配特征值(即 "是 +动词" 的搭配关系 特征值) 小于或等于预设阔值, 则可以判定"是玩"为限制词。  For example, for "is a play, the word, "yes" followed by a verb, such a collocation is rarely seen in linguistics, so the collocation feature of its collocation feature value (ie "yes + verb" can be obtained). Value) Less than or equal to the preset threshold, you can determine "is play" as a limit word.

例 3  Example 3

所述特征信息为: 该目标词在输入法应用中用户单独输入的特征值; 所述用于判断的预置条件为: 该特征值是否属于预置范围。  The feature information is: a feature value that the target word inputs by the user in the input method application; the preset condition for determining is: whether the feature value belongs to a preset range.

所述的用户单独输入可以为一个用户的统计值, 也可以为一个用户群 也可以通过监控用户输入行为得到。  The user input alone may be a user's statistical value, or may be a user group or may be obtained by monitoring user input behavior.

例如, 对于 "是玩,, 一词, 用户很少单独输入该词, 所以当统计的特 征值 (如, 单独输入频率值) 小于或等于预设阔值时, 则可以判定"是玩" 为限制词。 下面的几个例子中, 为了进一步提高限制词的判定准确度, 在判断条 件中引入了通用词频, 所述通用词频可以为互联网词频, 也可以为词库词 频。 下面例子中与前述例子相似之处就不再赘述, 具体请参见前述。  For example, for the word "is play," the user rarely enters the word separately, so when the statistical feature value (eg, the input frequency value alone) is less than or equal to the preset threshold, then it can be determined that "is playing" In the following examples, in order to further improve the determination accuracy of the restriction words, a general word frequency is introduced in the judgment condition, and the general word frequency can be the Internet word frequency or the word library word frequency. The similarities of the examples are not described here. For details, please refer to the above.

例 4  Example 4

所述特征信息包括: 该目标词中位于词首的单字在预设语料库内作为 词首的特征值; 该目标词中位于词尾的单字在预设语料库内作为词尾的特 征值; 以及该目标词的通用词频; The feature information includes: a word at the beginning of the word in the target word is used in a preset corpus The feature value of the prefix; the single word of the target word at the end of the word in the default corpus as the eigenvalue of the suffix; and the general word frequency of the target word;

所述用于判断的预置条件为: 上述特征值中是否存在至少一个特征值 与该目标词通用词频的比值属于预置范围。  The preset condition for the judgment is: whether the ratio of the at least one feature value to the target word common word frequency among the above feature values belongs to a preset range.

例 5  Example 5

所述特征信息包括: 该目标词中所包含的各个单字词和 /或多字词的语 言学搭配关系在预设语料库内的特征值; 以及该目标词的通用词频;  The feature information includes: a feature value of a linguistic collocation relationship of each single word and/or a multi-word word included in the target word in a preset corpus; and a general word frequency of the target word;

所述用于判断的预置条件为: 上述特征值中是否存在至少一个特征值 与该目标词通用词频的比值属于预置范围。  The preset condition for the judgment is: whether the ratio of the at least one feature value to the target word common word frequency among the above feature values belongs to a preset range.

例 6  Example 6

所述特征信息为: 该目标词在针对同一输入编码的各候选词中的用户 排序位置信息; 以及该目标词在针对同一输入编码的各候选词中的原始排 序位置信息; 其中, 所述用户排序信息与该目标词在输入法应用中用户单 独输入的特征值相关; 所述原始排序信息与该目标词的通用词频相关; 简 单的情况下, 可以认为, 所述用户排序信息与用户词库信息相关, 而所述 原始排序信息与系统词库信息相关。  The feature information is: user-sorted position information of the target word in each candidate word encoded for the same input; and original sorted position information of the target word in each candidate word encoded for the same input; wherein, the user The sorting information is related to the feature value separately input by the user in the input method application; the original sorting information is related to the universal word frequency of the target word; in a simple case, the user sorting information and the user vocabulary can be considered The information is related, and the original ranking information is related to the system vocabulary information.

所述用于判断的预置条件为: 所述用户排序位置信息与所述原始排序 位置信息的差值是否属于预置范围。  The preset condition for determining is: whether the difference between the user sorting position information and the original sorting position information belongs to a preset range.

例 Ί  Example

所述特征信息为: 该目标词在输入法应用中用户单独输入的特征值; 以及该目标词的通用词频;  The feature information is: a feature value that the target word is input by the user in the input method application; and a general word frequency of the target word;

所述用于判断的预置条件为: 该特征值与该目标词通用词频的比值是 否属于预置范围。  The preset condition for determining is: whether the ratio of the feature value to the target word common word frequency belongs to a preset range.

具体描述例 7的一种具体实现过程如下:  A specific implementation process of the specific description example 7 is as follows:

统计每个词的通用词频 f— web;  Count the general word frequency of each word f- web;

在用户群体的输入记录中统计每个词被单独输入的频率 f— user;  Counting the frequency of each word being entered separately in the input record of the user group f-user;

计算 alpha = f_user/f_web , 将 alpha远远小于正常水平的词确定是限 制词;  Calculate alpha = f_user/f_web and define the word whose alpha is much smaller than the normal level as a limit word;

或者, 计算 alpha = f_user/f_web , ^!夺 alpha远远小于正常水平且 f— user 值低于一定阔值的词确定为限制词; Or, calculate alpha = f_user/f_web , ^! Capture alpha far less than normal and f-user Words whose value is below a certain threshold are determined as limit words;

或者, 计算 alpha = f—user/f— web, 将 alpha远远小于正常水平且 f— web 值大于一定阔值的词确定为限制词。  Alternatively, calculate alpha = f-user/f_web, and identify words with alpha far below normal and f-web values greater than a certain threshold as restricted words.

其中, alpha为计算结果, f— web为一字词的通用词频信息, f— user为 该字词的特征词频信息。  Among them, alpha is the calculation result, f-web is the general word frequency information of a word, and f-user is the characteristic word frequency information of the word.

具体而言, 可以对于所有的目标词汇, 计算得到其对应的 alpha值, 并 按照 alpha值从小到大排序。 对于那些 alpha值排在 top的词, 如前 5%, 并 且本身词频较高, 如大于 10000, 则认为它是限制词。  Specifically, for all target vocabularies, the corresponding alpha values can be calculated and sorted according to the alpha value from small to large. For words whose alpha value is at top, such as the first 5%, and the word frequency is higher, such as greater than 10000, it is considered a restriction.

需要说明的是, 上述各个例子中的判断条件还可以组合使用。 总之, 本领域技术人员可以根据需要设定各种各样的判定方式, 在此无法——列 举。 在本发明的一个优选实施例中, 所述限制信息可以包括: 该限制词在 各预设场景下的限制单独输出的权重。 即该限制词可以具有不同应用场景 下的限制信息, 并不仅仅具有一个通用的限制信息。 例如, 通过输入法当 前程序确定用户的应用场景, 当用户在 word 中输入时, 调用限制在该预 设场景 (例如, 工作用语环境) 下的限制信息值。  It should be noted that the determination conditions in the above respective examples may also be used in combination. In summary, those skilled in the art can set various determination methods as needed, which cannot be listed here. In a preferred embodiment of the present invention, the restriction information may include: a weight of the restriction word that is separately output in each preset scenario. That is, the restriction word can have restriction information in different application scenarios, and does not have only one general restriction information. For example, the current program of the input method determines the application scenario of the user, and when the user inputs in word, the limit information value limited to the preset scene (for example, the working term environment) is called.

进一步, 所述限制信息还可以包括: 该限制词在预设语料库中的语言 学搭配参数;所述语言学搭配参数用于限制该词在智能组词输出时的排序。 即对于某些限制词, 在单独输出时, 需要加以限制, 并且在其智能组词输 出时, 也需要加以限制。 例如, 对于 "上一,, 一词, 在单独输出时需要加 以限制, 尽量不出现在候选项中, 而对于 "上一,, 和 "里" 智能组词输出 时, 也要依据搭配关系加以限制, "上一" 和 "里" 这种搭配组词尽量不出 现在候选项中。  Further, the restriction information may further include: a linguistic collocation parameter of the restriction word in the preset corpus; the linguistic collocation parameter is used to limit the ordering of the word when the intelligent group word is output. That is to say, for certain restriction words, when they are output separately, they need to be restricted, and when they are outputted by intelligent group words, they also need to be restricted. For example, for the previous one, the word needs to be limited when it is output separately, and it should not appear in the candidate as much as possible, but for the "previous,, and "in" intelligent group word output, it should also be based on the collocation relationship. Restrictions, "previous" and "in" such collocations do not appear in the candidates as much as possible.

其中, 所述限制信息可以包括该词在预设语料库中的所有的语言学搭 配参数(例如, 词性搭配参数), 也可以仅仅保存所需的搭配参数。 例如, 设置一限制输出的阔值, 如果某个搭配参数小于等于该阔值, 则保存该语 言学搭配参数。  The restriction information may include all linguistic allocation parameters (for example, part-of-speech matching parameters) of the word in the preset corpus, or may only save the required matching parameters. For example, set a threshold value for the limit output. If a collocation parameter is less than or equal to the threshold value, the linguistic collocation parameter is saved.

需要说明的是, 所述预置语料信息可以为互联网语料信息和 /或用户输 入记录语料信息。 其中, 所述互联网语料信息可以通过网络蜘蛛(spider ) 从互联网上抓取海量网页而获得; 所述用户输入记录语料库可以包括直接 信息和间接信息, 例如, 用户输入的字符记录等可作为直接信息, 用户输 入的字符分布统计等则可作为间接信息。 当然, 所述预置语料信息还可以 由本领域技术人员根据需要或经验进行设置,本发明对此不需要进行限定。 参照图 2, 示出了一种获取限制词信息的方法实施例 2, 可以包括: 步骤 201、 获取一目标词; It should be noted that the preset corpus information may be Internet corpus information and/or user input. Enter the corpus information. The Internet corpus information may be obtained by crawling a massive webpage from a web spider through a spider; the user input recording corpus may include direct information and indirect information, for example, a character record input by a user may be used as direct information. The character distribution statistics input by the user can be used as indirect information. Of course, the preset corpus information may also be set by a person skilled in the art according to needs or experience, and the present invention does not need to be limited thereto. Referring to FIG. 2, an embodiment 2 of the method for acquiring the restriction word information is shown, which may include: Step 201: Acquire a target word;

步骤 202、 获取该目标词在预设语料库中的语言学搭配参数;  Step 202: Obtain a linguistic collocation parameter of the target word in a preset corpus;

步骤 203、 判断所述语言学搭配参数是否符合预置条件, 如果符合, 则记录该目标词的限制信息, 所述限制信息包括相应的语言学搭配参数, 所述限制信息用于限制该词智能组词输出时的排序。  Step 203: Determine whether the linguistic collocation parameter meets a preset condition, and if yes, record the restriction information of the target word, the restriction information includes a corresponding linguistic collocation parameter, and the restriction information is used to limit the word intelligence. Sorting when group words are output.

例如, "上一 "与方位词的搭配参数值就很低,将该搭配参数记录至 "上 ―" 的限制信息中, 则如果在进行智能组词时一候选项为 "上一" 与方位 词的搭配, 则去除该候选项。  For example, the value of the collocation parameter of the "previous" and the position word is very low, and the collocation parameter is recorded in the restriction information of "up", and if the intelligent grouping is performed, the candidate is "previous" and orientation. The collocation of words removes the candidate.

再例如, "讲" 与动词的搭配参数小于预定阔值, 将该搭配参数记录至 "讲" 的限制信息中, 则如果一候选项为 "讲" 与动词的搭配, 则将 "讲" 从智能组词的序列中去除。  For another example, if the matching parameter of the "speaking" and the verb is less than the predetermined threshold, and the matching parameter is recorded in the restriction information of "speaking", if a candidate is a combination of "speaking" and a verb, then "talking" is The sequence of intelligent group words is removed.

优选的, 所述语言学搭配参数可以为一通用参数; 或者, 所述语言学 搭配参数也可以包括针对各预设场景的分参数。 所述的语言学搭配参数, 可以包括词与词的搭配参数, 词与词性的搭配参数、 词性与词性的搭配参 数等等。 所述的语言学搭配参数所采用的表现数值可以为相邻同现频率、 同现概率或连接强度值等, 这些数值可以从任一预置语料库中统计得到, 也可以依据现有经验或知识直接得到。  Preferably, the linguistic collocation parameter may be a general parameter; or the linguistic collocation parameter may also include sub-parameters for each preset scene. The linguistic collocation parameters may include collocation parameters of words and words, collocation parameters of words and part of speech, collocation parameters of part of speech and part of speech, and the like. The performance value of the linguistic collocation parameter may be adjacent co-occurrence frequency, co-occurrence probability or connection strength value, etc., and the values may be obtained from any preset corpus, or may be based on existing experience or knowledge. Get it directly.

需要说明的是, 通过上述歸选步骤, 可以将符合条件的限制词从智能 组词的序列中去除, 从而减少了智能组词时的搜索空间, 提高智能组词的 效率。 参照图 3, 示出了一种更新输入法词库的方法实施例, 具体可以包括: 步骤 301、 获取一目标词; It should be noted that, by using the above-mentioned selection step, the qualified restriction words can be removed from the sequence of the intelligent group words, thereby reducing the search space in the intelligent group words and improving the efficiency of the intelligent group words. Referring to FIG. 3, an embodiment of a method for updating an input method vocabulary is shown, which may specifically include: Step 301: Obtain a target word.

步骤 302、 获取该目标词相应的特征信息;  Step 302: Acquire feature information corresponding to the target word;

步骤 303、 判断所述特征信息或其相应的计算结果是否符合预置条件, 如果符合, 则确定该目标词为限制词并记录相关限制信息, 所述限制信息 用于限制该词单独输出时的排序, 和 /或, 用于限制该词智能组词输出时的 排序;  Step 303: Determine whether the feature information or its corresponding calculation result meets a preset condition. If yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the single output of the word. Sorting, and/or, used to limit the ordering of the word smart group word output;

步骤 304、 将所述限制词及其相关限制信息添加至输入法现有词库中。 本实施例可以应用于: 服务器端获得了限制词信息, 然后将其及时更 新至输入法现有词库。 所更新的限制信息可以包括前述图 2、 图 3 实施例 所获得的限制信息, 即可以包括用于限制该词单独输出时排序的信息, 也 可以包括用于限制该词智能组词输出时排序的信息; 二者可以单独存在, 也可以并存。 例如, 所述限制信息可以包括: 该限制词在各预设场景下的 限制单独输出的权重。  Step 304: Add the restriction word and its related restriction information to an existing vocabulary of the input method. This embodiment can be applied to: the server obtains the restriction word information, and then updates it to the existing vocabulary of the input method in time. The updated restriction information may include the restriction information obtained by the foregoing embodiments of FIG. 2 and FIG. 3, that is, may include information for limiting the ordering when the word is output separately, and may also include sorting for limiting the output of the word intelligent group words. Information; the two can exist separately or together. For example, the restriction information may include: a weight of the restriction word that is separately outputted in each preset scenario.

当然, 也可以在服务器端将限制信息添加至服务器端词库后, 然后将 新词库进行整体的发布更新。 具体的更新传输方式在此就不详述了。  Of course, you can also add the restriction information to the server-side lexicon on the server side, and then update the new vocabulary as a whole. The specific update transmission method will not be detailed here.

步骤 304中所述的添加可以为各种方式, 例如,  The addition described in step 304 can be in various ways, for example,

所述添加为: 判断该限制词是否在所述原始词库中已存在, 如果已存 在, 则仅记录其相关限制信息至所述输入法现有词库中;  The adding is: determining whether the restricted word already exists in the original thesaurus, and if so, recording only the relevant restriction information into the existing thesaurus of the input method;

或者, 所述添加为: 直接将所述限制词及其相关限制信息记录至所述 输入法现有词库中, 如果词条重复, 则覆盖原始词条;  Or the adding is: directly recording the restriction word and its related restriction information into an existing vocabulary of the input method, and if the vocabulary is repeated, overwriting the original vocabulary;

或者, 所述添加为: 将所述限制词及其相关限制信息存储为一独立的 限制词表, 所述限制词表和输入法现有词库用于协作完成候选项排序。 参照图 4, 示出了一种获取限制词信息的装置实施例, 具体可以包括: 目标词获取单元 401 , 用于获取一目标词;  Alternatively, the adding is: storing the restricted words and their related restriction information as an independent restricted vocabulary, and the restricted vocabulary and the input lexicon are used to collaboratively perform candidate ordering. Referring to FIG. 4, an apparatus for acquiring a restriction word information is shown, which may specifically include: a target word obtaining unit 401, configured to acquire a target word;

特征信息获取单元 402, 用于获取该目标词相应的特征信息;  The feature information acquiring unit 402 is configured to acquire feature information corresponding to the target word;

限制信息获取单元 403 , 用于判断所述特征信息或其相应的计算结果 是否符合预置条件, 如果符合, 则确定该目标词为限制词并记录相关限制 信息, 所述限制信息用于限制该词单独输出时的排序, 和 /或, 用于限制该 词智能组词输出时的排序。 参照图 5, 示出了一种优化输出的方法实施例, 具体可以包括: 步骤 501、 接收用户输入信息, 并对所述输入信息进行转换; The restriction information obtaining unit 403 is configured to determine whether the feature information or its corresponding calculation result meets a preset condition, and if yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the Sorting when the words are output separately, and/or, used to limit the Sorting the word smart group word output. Referring to FIG. 5, an embodiment of a method for optimizing an output is shown. Specifically, the method may include: Step 501: Receive user input information, and convert the input information.

所述输入信息可以包括编码字符串, 也可以包括手写输入信息以及语 音输入的信息, 因为这些输入方式也都需要用到词库进行候选项排序。 即 本发明可以应用于各种输入方式的输入法平台, 包括键盘符号、 手写信息 以及语音输入等等。 由于这些输入方式中的信息转换过程都属于公知技术, 在此就不伴述了。  The input information may include an encoded character string, and may also include handwritten input information as well as voice input information, since these input methods also require the use of the thesaurus for candidate ordering. That is, the present invention can be applied to input method platforms of various input methods, including keyboard symbols, handwritten information, and voice input. Since the information conversion process in these input methods is a well-known technique, it will not be described here.

例如, 当用户输入时, 输入法系统会对用户输入的编码字符串进行切 分。 以对拼音编码字符串的切分为例进行简单说明, 通常, 对一个拼音编 码字符串进行切分, 可以获得多种切分方案, 例如, 对于拼音编码字符串 For example, when the user enters, the input method system splits the encoded string entered by the user. A brief description is given to the example of splitting a pinyin coded string. Generally, a pinyin coded string is divided into multiple segments, for example, for a pinyin encoded string.

"fangan" , 可以切分成" fang' an" , 也可以切分成 "fan'gan"等。 当然, 所述切 分的方法可以为现有技术中的任一方法, 本发明对此不需要进行限定。 "fangan" can be divided into "fang" an", or it can be divided into "fan'gan" and so on. Of course, the method of segmentation may be any method in the prior art, and the present invention does not need to be limited thereto.

步骤 502、 获得输出侯选项;  Step 502: Obtain an output candidate option.

以一种拼音网络切分法为例, 根据所述切分后的编码字符串获得输出 侯选项的过程相当于把输入的连续拼音流自动转换为相应的文字流的过 程。 具体地说, 所述过程为: 对于一个给定的连续拼音流 A, 按着某种拼 音流切分算法可以切分为一个拼音序列 Al A2 ... Am, 其中每个拼音 Ai对 应的一组同音字词可以用一组列节点表示为 Wil Wi2... Wi3。 那么对于拼 音序列 Al A2 ... Am, 对应的候选同音字词可用 m组列节点表示。 显然, 一个拼音序列对应的候选同音字词组成了一个候选同音字词矩阵。 把相邻 的节点用有向边连接起来, 形成词网格。 词网格构成了汉字输入问题的状 态空间, 进而, 音字转换问题演变为在词网格中搜索一条最优路径问题。  Taking a pinyin network segmentation method as an example, the process of obtaining an output candidate according to the segmented coded string is equivalent to the process of automatically converting the input continuous pinyin stream into a corresponding word stream. Specifically, the process is: For a given continuous pinyin stream A, according to a certain pinyin stream segmentation algorithm, it can be divided into a pinyin sequence Al A2 ... Am, where each pinyin Ai corresponds to one Group homophones can be represented by a set of column nodes as Wil Wi2... Wi3. Then, for the Pinyin sequence Al A2 ... Am, the corresponding candidate homophones can be represented by m group column nodes. Obviously, the candidate homophones corresponding to a pinyin sequence constitute a candidate homophone matrix. Connect adjacent nodes with directed edges to form a word grid. The word grid constitutes the state space of the Chinese character input problem. Furthermore, the word conversion problem evolves to search for an optimal path problem in the word grid.

例如, 输入一个拼音流" zheshiyizhipiaoliangdemao" , 经过拼音流切分 生成" zhe'shi'yi'zhi,piaoliang,de,mao"拼音序列 , 该拼音序列对应的词网格 为图 6所示。  For example, input a pinyin stream "zheshiyizhipiaoliangdemao", which is divided into pinyin streams to generate "zhe'shi'yi'zhi, piaoliang, de, mao" pinyin sequences. The word grid corresponding to the pinyin sequence is shown in Fig. 6.

然后, 查询系统的语言规则库, 进行规则匹配, 递归地把所有可以匹 配某一条语言规则的相邻列的节点捆绑成语言元素节点, 形成元素网格。 该元素网格构成了音字转换的新的状态空间。通过使用 Viterbi 动态规划算 法, 把系统的二元 ( Bigram ) 统计库和二元 ( Bigram ) 学习库的概率值通 过加权结合起来, 计算元素网格中所有的字词中候选字词的概率, 选择其 中具有最大概率的字词候选作为音字转换结果输出。 Then, query the system's language rule base, perform rule matching, and recursively bind all nodes of adjacent columns that can match a certain language rule into language element nodes to form an element mesh. This element mesh constitutes a new state space for phonetic conversion. By using the Viterbi dynamic programming algorithm, the probability values of the system's binary (Bigram) statistical library and the binary (Bigram) learning library are combined by weighting, and the probability of candidate words in all the words in the element grid is calculated. The word candidate having the greatest probability is output as a result of the phonetic word conversion.

当然, 本领域技术人员采用任一种获得所述输出候选项的方法都是可 行的, 本发明对此不需要进行限定。  Of course, it is possible for a person skilled in the art to adopt any method for obtaining the output candidate, and the present invention does not need to be limited thereto.

步骤 503、 判断是否符合应用限制信息的预置条件;  Step 503: Determine whether the preset condition of the application restriction information is met;

步骤 504、 如果是, 则提取输出候选项相应的限制信息, 并根据所述 限制信息对各候选项进行排序展示。  Step 504: If yes, extract the restriction information corresponding to the output candidate, and perform sorting display on each candidate according to the restriction information.

根据所述限制信息对各候选项进行排序可以通过直接设定展现位置或 者顺序的方式实现, 也可以通过修正词频 (包括但不限于加权、 降权等) 的方式实现; 其中, 最极端的就是从候选项中去除而不显示。  Sorting each candidate according to the restriction information may be implemented by directly setting a presentation position or a sequence, or by modifying a word frequency (including but not limited to weighting, derating, etc.); wherein the most extreme is Remove from candidates without displaying.

当某个词具有限制单独输出的限制信息时, 所述应用限制信息的预置 条件可以为: 所述输出侯选项是否为单独输出的词。 而所述的限制信息则 可以通过以下步骤获取所述的限制信息: 获取一目标词; 获取该目标词相 应的特征信息;判断所述特征信息或其相应的计算结果是否符合预置条件, 如果符合, 则针对该目标词记录相关限制信息。  When a word has restriction information that restricts the individual output, the preset condition of the application restriction information may be: whether the output candidate is a separately output word. And the restriction information may be obtained by acquiring the restriction information by: acquiring a target word; acquiring feature information corresponding to the target word; determining whether the feature information or its corresponding calculation result meets a preset condition, if If it matches, the relevant restriction information is recorded for the target word.

当某个词具有限制组词输出的限制信息时, 所述应用限制信息的预置 条件可以为: 所述输出侯选项是否属于智能组词情形。 而所述的限制信息 则可以通过以下步骤获取: 获取一目标词; 获取该目标词在预设语料库中 的语言学搭配参数; 判断所述语言学搭配参数是否符合预置条件, 如果符 合, 则记录该目标词的限制信息, 所述限制信息包括相应的语言学搭配参 数, 所述限制信息用于限制该词智能组词输出时的排序。 优选的, 当需要判断所述输出侯选项是否为单独输出的词时, 可以通 过以下步骤完成:  When a word has restriction information for limiting the output of the group word, the preset condition of the application restriction information may be: whether the output candidate belongs to the intelligent group word situation. The restriction information may be obtained by: acquiring a target word; obtaining a linguistic collocation parameter of the target word in a preset corpus; determining whether the linguistic collocation parameter meets a preset condition, and if yes, The restriction information of the target word is recorded, and the restriction information includes a corresponding linguistic collocation parameter, and the restriction information is used to limit the ordering of the word intelligent group word output. Preferably, when it is required to determine whether the output candidate is a separately output word, the following steps can be completed:

针对用户输入的编码字符串, 首先获得所有可能的输出候选项; 然后, 判断一输出候选项是否只包含一个元素, 并且长度大于 1 个输出字符; 所 述元素为预置词库中存储的字词; 如果是, 则确定该输出候选项为单独输 出的词。 对于是否包含一个元素的判断, 可以通过 ID映射的方式从词库中 查询获得, 或者通过判断所包含元素 ID的个数, 即可确定所述输出候选项 是否只包含一个元素。 For the encoded string input by the user, first obtain all possible output candidates; then, determine whether an output candidate contains only one element, and the length is greater than 1 output character; the element is a word stored in the preset vocabulary Word; if yes, determine that the output candidate is a separate input Out of the word. The judgment of whether or not an element is included may be obtained by querying from the lexicon by means of ID mapping, or by judging the number of included element IDs, it may be determined whether the output candidate contains only one element.

所述 1个输出字符在不同输入法系统中可以为不同字节长度或其它长 度的字符, 例如, 对于中文、 日文或韩文输入法来说, 所述 1个输出字符 为包含 2个字节的字。 对于所述长度的判断, 可以通过读取词库中预置的 长度参数来判断,所述长度参数可以针对所述字词 ID存储在相应词条的属 性中; 或者, 通过直接获取所述输出候选项的长度来判断, 以及采用现有 技术中的其它方法都是可行的, 本发明对此不作限制。  The one output character may be a different byte length or other length characters in different input method systems. For example, for Chinese, Japanese, or Korean input methods, the one output character is 2 bytes. word. The determination of the length may be determined by reading a length parameter preset in the vocabulary, and the length parameter may be stored in the attribute of the corresponding term for the word ID; or, by directly acquiring the output Judging by the length of the candidate, and using other methods in the prior art are possible, and the present invention is not limited thereto.

例如, 对于用户输入编码字符串 "liangjiangzong" 的情况而言, 针对 该编码字符串做完拼音网络切分之后, 得到的各个可能的候选项为: 两江 总、 量将、 两江、 良将等等。 其中, 假设每个候选项可以表示为<词条 1 , 属性 1>、 <词条 2, 属性 2> ; 或者, <词条 1的 ID, 属性 1>、 <词条 2 的 ID, 属性 2>。  For example, for the case where the user inputs the encoded string "liangjiangzong", after completing the Pinyin network segmentation for the encoded string, the possible candidates are: Liangjiang Total, Volume, Liangjiang, Liangjiang, and so on. Here, it is assumed that each candidate can be expressed as <entry 1, attribute 1>, <entry 2, attribute 2>; or, <ID of the entry 1, attribute 1>, <ID of the entry 2, attribute 2 >.

比如, 对于候选项 "两江总", 就可以表示为: <两江 pl>、 <总 p2>; 对于候选项 "量将", 就可以表示为: <量将 ql>;  For example, for the candidate "two rivers total", it can be expressed as: <two rivers pl>, <total p2>; for the candidate "quantity", it can be expressed as: <quantity will be ql>;

而对于 <量将 ql>而言, 其仅包含一个元素, 并且大于 1个输出字符; 继续判断其属性 ql是否包含限制信息标记, 由于其具有限制信息标记(例 如, tag非 0 ), 所以该候选项不输出。 所述属性 ql中还可以包括长度参数。  And for <quantity ql>, it contains only one element, and is greater than 1 output character; continue to judge whether its attribute ql contains a restriction information flag, since it has a restriction information flag (for example, tag is not 0), so The candidate is not output. The length parameter may also be included in the attribute ql.

即最终输出的候选项为: 两江总、 两江、 良将。  The candidate for the final output is: Liangjiang, Liangjiang, Liangjiang.

对于一般情况而言, 一个候选项不是单独输出, 则就是属于组词输出, 所以上述过程也可以用于智能组词情况的判断。  For the general case, if a candidate is not output separately, it belongs to the group word output, so the above process can also be used to judge the situation of intelligent group words.

当然, 对于当用户仅仅输入了两个音节的时候, 可以不用经过上述判 断过程, 直接判定为单独输出, 因为两个音节一般不会是智能组词的情况。 例如, 对于用户输入的不需要进行切分的编码字符串, 判定获得的输出候 选项为单独输出的词; 或者, 对于用户输入的编码字符串对应于词库中单 个词条的输出候选项, 确定为单独输出的词。 参照图 7, 示出了一种输入法系统实施例, 具体可以包括: 输入接口单元 701和显示单元 702, 以及; Of course, when the user inputs only two syllables, it is possible to directly determine that the output is separate without going through the above-described judging process, because the two syllables are generally not in the case of intelligent group words. For example, for a coded string input by the user that does not need to be segmented, it is determined that the obtained output candidate is a separately output word; or, for the output string of the user input, the output candidate corresponding to the single entry in the thesaurus, Determine the word to be output separately. Referring to FIG. 7, an embodiment of an input method system is shown, which may specifically include: an input interface unit 701 and a display unit 702, and;

词库 703 : 所述词库包括限制信息; 其中所述限制信息可以为前述的 各种限制信息; 所述限制信息的存在方式也可以各种各样, 例如, 以词表 的方式存在于词库中, 或者通过对词库中的相应词条打标记的方式实现。  The lexicon 703: the vocabulary includes restriction information; wherein the restriction information may be various restriction information as described above; the restriction information may be present in various ways, for example, in a vocabulary manner. In the library, or by marking the corresponding terms in the thesaurus.

候选项获取单元 704: 用于根据用户的输入信息获得输出侯选项; 判断单元 705 , 用于判断一输出候选项是否符合应用限制信息的预置 条件;  The candidate obtaining unit 704 is configured to: obtain an output candidate according to the input information of the user; the determining unit 705 is configured to determine whether an output candidate meets the preset condition of the application restriction information;

候选项排序单元 706 , 用于当符合预置条件时, 提取该输出候选项相 应的限制信息, 并根据所述限制信息对各候选项进行排序。  The candidate sorting unit 706 is configured to: when the preset condition is met, extract the restriction information corresponding to the output candidate, and sort the candidates according to the restriction information.

所述的词库 703可以包括词条信息和限制词信息, 即可以在现有词库 中对于符合预置条件的词记录限制词信息。 另一种优选的情况为, 所述词 库 703为包括基础词库和限制词表, 所述限制词表为记录具有限制词信息 的词表。 在这种情况下, 可以将符合预置条件的单词及相应的限制信息独 立存储为一张限制词表, 该限制词表和基础词库即组成本实施例中的输入 法词库。 当然, 本领域技术人员采用现有技术中的其它方法预置输入法词 库也是可行的, 本发明对此不作限制。  The thesaurus 703 may include term information and restriction word information, that is, the word restriction information may be recorded in the existing thesaurus for words that meet the preset conditions. In another preferred case, the vocabulary 703 includes a basic vocabulary and a restricted vocabulary, and the restricted vocabulary is a vocabulary with restricted word information. In this case, the words that meet the preset conditions and the corresponding restriction information can be stored independently as a restricted vocabulary, and the restricted vocabulary and the basic vocabulary constitute the input method vocabulary in this embodiment. Of course, it is also feasible for a person skilled in the art to preset the input method vocabulary by using other methods in the prior art, which is not limited by the present invention.

优选的, 当某个词具有限制单独输出的限制信息时, 所述应用限制信 息的预置条件可以为: 所述输出侯选项是否为单独输出的词。 所述判断单 元进一步可以包括:  Preferably, when a word has restriction information limiting the individual output, the preset condition of the application restriction information may be: whether the output candidate is a separately output word. The determining unit may further include:

用于判断一输出候选项是否只包含一个元素的子单元; 其中, 所述元 素为预置词库中存储的字词;  a subunit for determining whether an output candidate includes only one element; wherein the element is a word stored in a preset vocabulary;

以及,用于判断该输出候选项的长度是否大于 1个输出字符的子单元; 以及, 用于当该输出候选项符合上述两个判断条件时, 确定其为单独 输出的词的子单元。  And a subunit for determining whether the length of the output candidate is greater than one output character; and, for determining that the output candidate is a subunit of the separately outputted word when the output candidate meets the two determination conditions.

当某个词具有限制组词输出的限制信息时, 所述应用限制信息的预置 条件可以为: 所述输出侯选项是否属于智能组词情形。 其判定方式也可以 采用前述方法, 如果不符合判断条件, 则属于智能组词情形。  When a word has restriction information for limiting the output of the group word, the preset condition of the application restriction information may be: whether the output candidate belongs to the intelligent group word situation. The method of determining may also adopt the foregoing method, and if it does not meet the judgment condition, it belongs to the case of intelligent group words.

上述输入法系统可以为普通输入法系统, 例如, 所述输入法系统的输 入接口单元、 显示单元以及词库位于同一计算设备中; 上述输入法系统可 以为网络输入法系统, 例如, 所述输入法系统的输入接口单元、 显示单元 位于第一计算设备中, 词库位于第二计算设备中, 所述输入法系统根据用 户输入的信息, 从第二计算设备中获取相应信息, 在第一计算设备显示相 应字词候选项。 The above input method system may be a common input method system, for example, the input method system is lost. The input interface unit, the display unit, and the vocabulary are located in the same computing device; the input method system may be a network input method system, for example, the input interface unit and the display unit of the input method system are located in the first computing device, and the vocabulary is located In the second computing device, the input method system acquires corresponding information from the second computing device according to the information input by the user, and displays the corresponding word candidate in the first computing device.

由于前述的各个实施例都是基于本发明同一构思的, 所以互相着重描 述的是区别之处, 相似之处可以参见本说明书相应部分。  Since the foregoing various embodiments are based on the same concept of the present invention, the differences are described with emphasis on each other, and similarities can be found in the corresponding parts of the specification.

以上对本发明所提供的一种获取限制词信息的方法和装置、 一种更新 词库的方法、 一种优化输出的方法和一种输入法系统进行了详细介绍, 本 的说明只是用于帮助理解本发明的方法及其核心思想; 同时, 对于本领域 的一般技术人员, 依据本发明的思想, 在具体实施方式及应用范围上均会 有改变之处, 综上所述, 本说明书内容不应理解为对本发明的限制。  The above provides a method and device for obtaining restriction information information, a method for updating a thesaurus, a method for optimizing output, and an input method system. The description is only used to help understanding. The method of the present invention and its core idea; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation manner and the scope of application. It is understood to be a limitation of the invention.

Claims

权 利 要 求 Rights request 1、 一种获取限制词信息的方法, 其特征在于, 包括:  A method for obtaining restriction word information, comprising: 获取一目标词;  Obtain a target word; 获取该目标词相应的特征信息;  Obtaining corresponding feature information of the target word; 判断所述特征信息或其相应的计算结果是否符合预置条件,如果符合, 则确定该目标词为限制词并记录相关限制信息, 所述限制信息用于限制该 词单独输出时的排序。  Determining whether the feature information or its corresponding calculation result meets a preset condition, and if so, determining that the target word is a restriction word and recording related restriction information, wherein the restriction information is used to limit the ordering when the word is separately output. 2、 如权利要求 1所述的方法, 其特征在于,  2. The method of claim 1 wherein: 所述特征信息为: 该目标词中位于词首的单字在预设语料库内作为词 首的特征值, 以及该目标词中位于词尾的单字在预设语料库内作为词尾的 特征值;  The feature information is: a feature value of the word at the beginning of the target word in the default corpus as a prefix of the word, and a feature value of the word at the end of the target word in the default corpus as a suffix; 所述用于判断的预置条件为: 上述特征值中是否存在至少一个特征值 属于预置范围。  The preset condition for determining is: whether at least one of the feature values is present in the preset range. 3、 如权利要求 1所述的方法, 其特征在于,  3. The method of claim 1 wherein: 所述特征信息为: 该目标词中所包含的各个单字词和 /或多字词的语言 学搭配关系在预设语料库内的特征值;  The feature information is: a feature value of a linguistic collocation relationship of each single word and/or multi-word included in the target word in a preset corpus; 所述用于判断的预置条件为: 上述特征值中是否存在至少一个特征值 属于预置范围。  The preset condition for determining is: whether at least one of the feature values is present in the preset range. 4、 如权利要求 1所述的方法, 其特征在于,  4. The method of claim 1 wherein: 所述特征信息为: 该目标词在输入法应用中用户单独输入的特征值; 所述用于判断的预置条件为: 该特征值是否属于预置范围。  The feature information is: a feature value that the target word inputs by the user in the input method application; the preset condition for determining is: whether the feature value belongs to a preset range. 5、 如权利要求 1所述的方法, 其特征在于,  5. The method of claim 1 wherein: 所述特征信息包括: 该目标词中位于词首的单字在预设语料库内作为 词首的特征值; 该目标词中位于词尾的单字在预设语料库内作为词尾的特 征值; 以及该目标词的通用词频;  The feature information includes: a feature value of the word at the beginning of the target word in the preset corpus as a prefix; a word at the end of the target word in the default corpus as a feature value of the suffix; and the target word General word frequency 所述用于判断的预置条件为: 上述特征值中是否存在至少一个特征值 与该目标词通用词频的比值属于预置范围。  The preset condition for the judgment is: whether the ratio of the at least one feature value to the target word common word frequency among the above feature values belongs to a preset range. 6、 如权利要求 1所述的方法, 其特征在于,  6. The method of claim 1 wherein: 所述特征信息包括: 该目标词中所包含的各个单字词和 /或多字词的语 言学搭配关系在预设语料库内的特征值; 以及该目标词的通用词频; 所述用于判断的预置条件为: 上述特征值中是否存在至少一个特征值 与该目标词通用词频的比值属于预置范围。 The feature information includes: a language of each single word and/or multi-word included in the target word The feature value of the collocation relationship in the preset corpus; and the general word frequency of the target word; the preset condition for the judgment is: whether there is a ratio of the at least one feature value to the target word common word frequency in the feature value Belong to the preset range. 7、 如权利要求 1所述的方法, 其特征在于,  7. The method of claim 1 wherein: 所述特征信息为: 该目标词在输入法应用中用户单独输入的特征值; 以及该目标词的通用词频;  The feature information is: a feature value that the target word is input by the user in the input method application; and a general word frequency of the target word; 所述用于判断的预置条件为: 该特征值与该目标词通用词频的比值是 否属于预置范围。  The preset condition for determining is: whether the ratio of the feature value to the target word common word frequency belongs to a preset range. 8、 如权利要求 1所述的方法, 其特征在于,  8. The method of claim 1 wherein: 所述特征信息为: 该目标词在针对同一输入编码的各候选词中的用户 排序位置信息; 以及该目标词在针对同一输入编码的各候选词中的原始排 序位置信息; 其中, 所述用户排序信息与该目标词在输入法应用中用户单 独输入的特征值相关; 所述原始排序信息与该目标词的通用词频相关; 所述用于判断的预置条件为: 所述用户排序位置信息与所述原始排序 位置信息的差值是否属于预置范围。  The feature information is: user-sorted position information of the target word in each candidate word encoded for the same input; and original sorted position information of the target word in each candidate word encoded for the same input; wherein, the user The sorting information is related to the feature value separately input by the user in the input method application; the original sorting information is related to the general word frequency of the target word; the preset condition for determining is: the user sorting position information Whether the difference from the original sort position information belongs to a preset range. 9、 如权利要求 1 - 8所述的任一方法, 其特征在于, 在特征信息获取 步骤之前还包括: 对目标词的优化歸选步骤。  9. The method according to any one of claims 1-8, further comprising: before the feature information obtaining step, the step of: optimizing the step of selecting the target word. 10、 如权利要求 1 - 8所述的任一方法, 其特征在于, 所述限制信息包 括: 该限制词在各预设场景下的限制单独输出的权重。  10. The method according to any one of claims 1-8, wherein the restriction information comprises: a weight of the restriction word that is separately output in each preset scenario. 11、 如权利要求 1 - 8所述的任一方法, 其特征在于,  11. A method according to any of claims 1-8, characterized in that 所述限制信息还包括: 该限制词在预设语料库中的语言学搭配参数; 所述语言学搭配参数用于限制该词在智能组词输出时的排序。  The restriction information further includes: a linguistic collocation parameter of the restriction word in a preset corpus; the linguistic collocation parameter is used to limit the ordering of the word when the intelligent group word is output. 12、 如权利要求 1 - 8所述的任一方法, 其特征在于, 还包括: 生成一词库或词表, 所述词库或词表包括所述限制词及其相关限制信 息;  12. The method of any of claims 1-8, further comprising: generating a term or vocabulary, the vocabulary or vocabulary including the restricted words and their associated restriction information; 或者, 生成一词库, 所述词库包括所述限制词及其相关限制信息, 以 及通用字词。  Alternatively, a vocabulary is generated, the vocabulary including the qualifiers and their associated restriction information, as well as generic words. 13、 一种获取限制词信息的方法, 其特征在于, 包括:  13. A method for obtaining restricted word information, characterized by comprising: 获取一目标词; 获取该目标词在预设语料库中的语言学搭配参数; Obtain a target word; Obtaining the linguistic collocation parameters of the target word in the default corpus; 判断所述语言学搭配参数是否符合预置条件, 如果符合, 则记录该目 标词的限制信息, 所述限制信息包括相应的语言学搭配参数; 所述限制信 息用于限制该词智能组词输出时的排序。  Determining whether the linguistic collocation parameter meets a preset condition, if yes, recording restriction information of the target word, the restriction information includes a corresponding linguistic collocation parameter; the restriction information is used to limit the word intelligent group word output Sort of time. 14、 如权利要求 13所述的方法, 其特征在于:  14. The method of claim 13 wherein: 所述语言学搭配参数为一通用参数;  The linguistic collocation parameter is a general parameter; 或者, 所述语言学搭配参数包括针对各预设场景的分参数。  Alternatively, the linguistic collocation parameters include sub-parameters for each preset scene. 15、 一种更新词库的方法, 其特征在于, 包括:  15. A method of updating a thesaurus, characterized by comprising: 获取一目标词;  Obtain a target word; 获取该目标词相应的特征信息;  Obtaining corresponding feature information of the target word; 判断所述特征信息或其相应的计算结果是否符合预置条件,如果符合, 则确定该目标词为限制词并记录相关限制信息, 所述限制信息用于限制该 词单独输出时的排序, 和 /或, 用于限制该词智能组词输出时的排序;  Determining whether the feature information or its corresponding calculation result meets a preset condition, if yes, determining that the target word is a restriction word and recording related restriction information, wherein the restriction information is used to limit the ordering when the word is separately output, and / or, used to limit the sorting of the word intelligent group word output; 将所述限制词及其相关限制信息添加至输入法现有词库中。  Adding the restriction words and their related restriction information to the existing vocabulary of the input method. 16、 如权利要求 15所述的方法, 其特征在于,  16. The method of claim 15 wherein: 所述添加为: 判断该限制词是否在所述原始词库中已存在, 如果已存 在, 则仅记录其相关限制信息至所述输入法现有词库中;  The adding is: determining whether the restricted word already exists in the original thesaurus, and if so, recording only the relevant restriction information into the existing thesaurus of the input method; 或者, 所述添加为: 直接将所述限制词及其相关限制信息记录至所述 输入法现有词库中, 如果词条重复, 则覆盖原始词条;  Or the adding is: directly recording the restriction word and its related restriction information into an existing vocabulary of the input method, and if the vocabulary is repeated, overwriting the original vocabulary; 或者, 所述添加为: 将所述限制词及其相关限制信息存储为一限制词 表, 所述限制词表和输入法现有词库用于协作完成候选项排序。  Alternatively, the adding is: storing the restricted words and their related restriction information as a restricted vocabulary, and the restricted vocabulary and the input lexicon are used to collaboratively perform candidate sorting. 17、 如权利要求 15所述的方法, 其特征在于, 所述限制词具有在各预 设场景下的限制信息。  17. The method of claim 15, wherein the restriction word has restriction information in each of the preset scenarios. 18、 一种获取限制词信息的装置, 其特征在于, 包括:  18. An apparatus for obtaining restricted word information, comprising: 目标词获取单元, 用于获取一目标词;  a target word obtaining unit, configured to acquire a target word; 特征信息获取单元, 用于获取该目标词相应的特征信息;  a feature information acquiring unit, configured to acquire feature information corresponding to the target word; 限制信息获取单元, 用于判断所述特征信息或其相应的计算结果是否 符合预置条件, 如果符合, 则确定该目标词为限制词并记录相关限制信息, 所述限制信息用于限制该词单独输出时的排序, 和 /或, 用于限制该词智能 组词输出时的排序。 a restriction information acquiring unit, configured to determine whether the feature information or a corresponding calculation result thereof meets a preset condition, and if yes, determine that the target word is a restriction word and record related restriction information, where the restriction information is used to limit the word Sorting when outputting separately, and/or, is used to limit the word intelligence Sorting when group words are output. 19、 一种优化输出的方法, 其特征在于, 包括:  19. A method of optimizing output, comprising: 接收用户输入信息, 并对所述输入信息进行转换;  Receiving user input information, and converting the input information; 获得输出侯选项;  Get the output option; 判断一输出候选项是否符合应用限制信息的预置条件;  Determining whether an output candidate meets a preset condition of the application restriction information; 如果是, 则提取该输出候选项相应的限制信息, 并根据所述限制信息 对各候选项进行排序。  If yes, the restriction information corresponding to the output candidate is extracted, and each candidate is sorted according to the restriction information. 20、 如权利要求 19所述的方法, 其特征在于:  20. The method of claim 19, wherein: 所述应用限制信息的预置条件为: 所述输出侯选项是否为单独输出的 词;  The preset condition of the application restriction information is: whether the output candidate is a separately output word; 或者, 所述应用限制信息的预置条件为: 所述输出侯选项是否属于智 能组词情形。  Alternatively, the preset condition of the application restriction information is: whether the output candidate belongs to a smart group word situation. 21、 如权利要求 19所述的方法, 其特征在于, 通过以下步骤获取所述 的限制信息:  21. The method according to claim 19, wherein the restriction information is obtained by the following steps: 获取一目标词;  Obtain a target word; 获取该目标词相应的特征信息;  Obtaining corresponding feature information of the target word; 判断所述特征信息或其相应的计算结果是否符合预置条件,如果符合, 则针对该目标词记录相关限制信息。  Determining whether the feature information or its corresponding calculation result meets a preset condition, and if so, recording relevant restriction information for the target word. 22、 如权利要求 20所述的方法, 其特征在于, 当需要判断所述输出侯 选项是否为单独输出的词时, 通过以下步骤完成:  22. The method according to claim 20, wherein when it is required to determine whether the output candidate is a separately output word, the following steps are performed: 判断一输出候选项是否只包含一个元素,并且长度大于 1个输出字符; 所述元素为预置词库中存储的字词;  Determining whether an output candidate includes only one element and the length is greater than 1 output character; the element is a word stored in the preset vocabulary; 如果是, 则确定该输出候选项为单独输出的词。  If so, it is determined that the output candidate is a separately output word. 23、 一种输入法系统, 包括输入接口单元和显示单元, 其特征在于, 所述输入法系统还包括:  An input method system, comprising: an input interface unit and a display unit, wherein the input method system further comprises: 词库, 所述词库包括针对词条的限制信息; 所述限制信息用于限制该 词单独输出时的排序, 和 /或, 用于限制该词智能组词输出时的排序;  a vocabulary, the vocabulary includes restriction information for the vocabulary; the restriction information is used to limit the ordering when the word is outputted separately, and/or, and is used to limit the ordering when the word intelligent group word is output; 候选项获取单元, 用于根据用户的输入信息获得输出侯选项; 判断单元,用于判断一输出候选项是否符合应用限制信息的预置条件; 候选项排序单元, 用于当符合预置条件时, 提取该输出候选项相应的 限制信息 , 并根据所述限制信息对各候选项进行排序。 a candidate obtaining unit, configured to obtain an output candidate according to the input information of the user; and a determining unit, configured to determine whether an output candidate meets a preset condition of the application restriction information; The candidate sorting unit is configured to extract the restriction information corresponding to the output candidate when the preset condition is met, and sort the candidates according to the restriction information. 24、 如权利要求 23所述的系统, 其特征在于:  24. The system of claim 23, wherein: 所述应用限制信息的预置条件为: 所述输出侯选项是否为单独输出的 词;  The preset condition of the application restriction information is: whether the output candidate is a separately output word; 或者, 所述应用限制信息的预置条件为: 所述输出侯选项是否属于智 能组词情形。  Alternatively, the preset condition of the application restriction information is: whether the output candidate belongs to a smart group word situation. 25、 如权利要求 23所述的输入法系统, 其特征在于, 所述判断单元进 一步包括:  The input method system according to claim 23, wherein the determining unit further comprises: 用于判断一输出候选项是否只包含一个元素的子单元; 其中, 所述元 素为预置词库中存储的字词; 以及,  a subunit for determining whether an output candidate includes only one element; wherein the element is a word stored in a preset vocabulary; 用于判断该输出候选项的长度是否大于 1个输出字符的子单元; 以及, 用于当该输出候选项符合上述两个判断条件时, 确定其为单独输出的 词的子单元。  a subunit for determining whether the length of the output candidate is greater than one output character; and, for determining that the output candidate is a subunit of a separately outputted word when the output candidate meets the above two determination conditions. 26、 如权利要求 24所述的输入法系统, 其特征在于, 所述输入法系统 的输入接口单元、 显示单元以及词库位于同一计算设备中;  The input method system according to claim 24, wherein the input interface unit, the display unit, and the vocabulary of the input method system are located in the same computing device; 或者, 所述输入法系统的输入接口单元、 显示单元位于第一计算设备 中, 词库位于第二计算设备中, 所述输入法系统根据用户输入的信息, 从 第二计算设备中获取相应信息, 在第一计算设备显示相应字词。  Alternatively, the input interface unit and the display unit of the input method system are located in the first computing device, and the vocabulary is located in the second computing device, and the input method system obtains corresponding information from the second computing device according to the information input by the user. , displaying the corresponding word on the first computing device.
PCT/CN2008/071064 2007-05-25 2008-05-23 The method for obtaining restriction word information, optimizing output and the input method system Ceased WO2008145055A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710099644.0 2007-05-25
CNB2007100996440A CN100483417C (en) 2007-05-25 2007-05-25 Method for catching limit word information, optimizing output and input method system

Publications (1)

Publication Number Publication Date
WO2008145055A1 true WO2008145055A1 (en) 2008-12-04

Family

ID=38795424

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/071064 Ceased WO2008145055A1 (en) 2007-05-25 2008-05-23 The method for obtaining restriction word information, optimizing output and the input method system

Country Status (2)

Country Link
CN (1) CN100483417C (en)
WO (1) WO2008145055A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111381684A (en) * 2018-12-28 2020-07-07 北京搜狗科技发展有限公司 Method and device for shielding gray self-made phrase
CN112083814A (en) * 2020-08-28 2020-12-15 的卢技术有限公司 A thesaurus generation method based on AI and cloud computing

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100483417C (en) * 2007-05-25 2009-04-29 北京搜狗科技发展有限公司 Method for catching limit word information, optimizing output and input method system
US8407236B2 (en) 2008-10-03 2013-03-26 Microsoft Corp. Mining new words from a query log for input method editors
CN102141868B (en) * 2010-01-28 2013-08-14 北京搜狗科技发展有限公司 Method for quickly operating information interaction page, input method system and browser plug-in
CN102193639B (en) * 2010-03-04 2014-03-12 阿里巴巴集团控股有限公司 Method and device of statement generation
CN102495679A (en) * 2011-12-01 2012-06-13 上海量明科技发展有限公司 Composite spelling input method, word bank and system thereof
CN103365875B (en) * 2012-03-29 2018-05-11 百度在线网络技术(北京)有限公司 A kind of method and apparatus for being used to provide contact object in current application
CN107589855B (en) * 2012-05-29 2021-05-28 阿里巴巴集团控股有限公司 Method and device for recommending candidate words according to geographic positions
CN103869998B (en) * 2012-12-11 2018-05-01 百度国际科技(深圳)有限公司 A kind of method and device being ranked up to candidate item caused by input method
CN106156056B (en) * 2015-03-27 2020-03-06 联想(北京)有限公司 Text mode learning method and electronic equipment
CN105094368B (en) * 2015-07-24 2018-05-15 上海二三四五网络科技有限公司 A kind of control method and control device that frequency modulation sequence is carried out to candidates of input method
CN105955495A (en) * 2016-04-29 2016-09-21 百度在线网络技术(北京)有限公司 Information input method and device
CN107390896B (en) * 2017-07-21 2019-12-03 深圳市鹰硕技术有限公司 A kind of the dictionary management method and device of input method
CN107424461B (en) * 2017-08-01 2019-12-03 深圳市鹰硕技术有限公司 Information screen method and system
CN108509555B (en) * 2018-03-22 2021-07-23 武汉斗鱼网络科技有限公司 Search term determination method, device, device and storage medium
CN108733831B (en) * 2018-05-25 2022-05-17 腾讯音乐娱乐科技(深圳)有限公司 Method and device for processing word stock

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1206871A (en) * 1997-07-25 1999-02-03 上海欧姆龙计算机有限公司 Automatic logging method and device for phonetic words relation table in Chinese character input system
CN1369776A (en) * 2001-02-15 2002-09-18 英业达股份有限公司 A method of adjusting word frequency
JP2006050160A (en) * 2004-08-03 2006-02-16 Sharp Corp Chinese input device, Chinese input program, and Chinese input recording medium
CN1783066A (en) * 2004-11-29 2006-06-07 佛山市顺德区瑞图万方科技有限公司 Method for establishing associated input system and correspondent associated input system and method
CN1920827A (en) * 2006-08-23 2007-02-28 北京搜狗科技发展有限公司 Method for obtaining newly encoded character string, input method system and word stock generation device
CN1954315A (en) * 2004-03-16 2007-04-25 Google公司 Systems and methods for translating chinese pinyin to chinese characters
CN101055588A (en) * 2007-05-25 2007-10-17 北京搜狗科技发展有限公司 Method for catching limit word information, optimizing output and input method system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1206871A (en) * 1997-07-25 1999-02-03 上海欧姆龙计算机有限公司 Automatic logging method and device for phonetic words relation table in Chinese character input system
CN1369776A (en) * 2001-02-15 2002-09-18 英业达股份有限公司 A method of adjusting word frequency
CN1954315A (en) * 2004-03-16 2007-04-25 Google公司 Systems and methods for translating chinese pinyin to chinese characters
JP2006050160A (en) * 2004-08-03 2006-02-16 Sharp Corp Chinese input device, Chinese input program, and Chinese input recording medium
CN1783066A (en) * 2004-11-29 2006-06-07 佛山市顺德区瑞图万方科技有限公司 Method for establishing associated input system and correspondent associated input system and method
CN1920827A (en) * 2006-08-23 2007-02-28 北京搜狗科技发展有限公司 Method for obtaining newly encoded character string, input method system and word stock generation device
CN101055588A (en) * 2007-05-25 2007-10-17 北京搜狗科技发展有限公司 Method for catching limit word information, optimizing output and input method system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HEYES H. ET AL.: "The System for Inputting the Chinese Sentences According To Analyzing the Parsing of the Phrases Based on the Pinyin", COMPUTER AND APPLICATION, February 1999 (1999-02-01), pages 28 - 30 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111381684A (en) * 2018-12-28 2020-07-07 北京搜狗科技发展有限公司 Method and device for shielding gray self-made phrase
CN112083814A (en) * 2020-08-28 2020-12-15 的卢技术有限公司 A thesaurus generation method based on AI and cloud computing

Also Published As

Publication number Publication date
CN101055588A (en) 2007-10-17
CN100483417C (en) 2009-04-29

Similar Documents

Publication Publication Date Title
CN100483417C (en) Method for catching limit word information, optimizing output and input method system
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
CN108304378B (en) Text similarity computing method, apparatus, computer equipment and storage medium
JP4986919B2 (en) Full-form lexicon with tagged data and method for constructing and using tagged data
CN114547329A (en) Method for establishing pre-training language model, semantic analysis method and device
CN112395385B (en) Text generation method, device, computer equipment and medium based on artificial intelligence
CN107145571B (en) Searching method and device
TW202020691A (en) Feature word determination method and device and server
JP2742115B2 (en) Similar document search device
US10896222B1 (en) Subject-specific data set for named entity resolution
CN110297880B (en) Corpus product recommendation method, apparatus, device and storage medium
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN111814481B (en) Shopping intention recognition method, device, terminal equipment and storage medium
CN111160007B (en) Search method and device based on BERT language model, computer equipment and storage medium
CN105956053A (en) Network information-based search method and apparatus
WO2025092584A1 (en) Method and apparatus for generating interaction component of client ui, terminal, and medium
CN118747293A (en) Document writing intelligent recall method and device and document generation method and device
CN116521626A (en) Personal knowledge management method and system based on content retrieval
CN112417875B (en) Configuration information updating method, device, computer equipment and medium
CN118797005A (en) Intelligent question-answering method, device, electronic device, storage medium and product
CN111161730B (en) Voice instruction matching method, device, equipment and storage medium
CN118114660A (en) Text detection method, system and computer readable storage medium
CN118132668A (en) Rule-based component specification model custom word segmentation method
JP2001101184A (en) Structured document generation method and apparatus, and storage medium storing structured document generation program
CN116595216A (en) Music retrieval method, music retrieval device, electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08757493

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08757493

Country of ref document: EP

Kind code of ref document: A1