HK1128541B - System and method for pinyin checking with confusable pinyin recognition - Google Patents
System and method for pinyin checking with confusable pinyin recognition Download PDFInfo
- Publication number
- HK1128541B HK1128541B HK09108175.1A HK09108175A HK1128541B HK 1128541 B HK1128541 B HK 1128541B HK 09108175 A HK09108175 A HK 09108175A HK 1128541 B HK1128541 B HK 1128541B
- Authority
- HK
- Hong Kong
- Prior art keywords
- pinyin
- storage unit
- word
- index
- chinese character
- Prior art date
Links
Description
Technical Field
The invention relates to a pinyin checking technology, in particular to a pinyin checking technology with confusing voice recognition.
Background
With the rapid development of science and technology, particularly, computers gradually step into every corner of society, and the wide use of computers has become a necessary trend of the development of modern dui. However, because the invention and the main application of the computer are both in the west, the popularization and use of the computer in china inevitably creates some obstacles, the most important of which is the obstacle of language and text. Since computers are generally displayed and operated by using English letters, it is very difficult for most Chinese people to operate computers by using English. Therefore, the use and the popularization of the computer in China are limited by the bottle neck with the square shape.
To eliminate this obstacle, since the 70 s, China has designed many input schemes. Seven or eight hundred species have been reported in journal. Among them, there are shape codes, phonetic codes, shape-phonetic codes, digital codes, etc., such as the five-stroke font method (the patent number of the national patent office is CN85100837A), and these code input methods have two prominent disadvantages: first, codes are input instead of "words", and there is a conversion process between codes and words. The operator must learn the code first to operate, which is inconvenient for popularization. Secondly, single Chinese characters are input by means of coding, most of the single Chinese characters are words without meaning, and the single Chinese characters are input in a low-level mode.
In order to solve the above problems, the country has implemented the "chinese pinyin scheme" input method, such as the double-spelling method (patent number CN87100313A of the national patent office). Because it inputs letters instead of codes, there is no conversion process between codes and words. Although its input speed may not be as fast as some coding schemes, it is more scientific than coding schemes in terms of input means.
However, the input method of the 'Chinese phonetic scheme' has a plurality of defects, although a correct word method is compiled through ten years of experiments and popularization, the method is imperfect, the coincident code rate is too high when the input method is input into a computer, and the vocabulary is difficult to shape. To solve this problem, a spelling error correction technique is proposed.
Spell correction is an important function that is indispensable in the application software for processing text data in general computers. In addition to word processors, these application software for processing text data also include databases (databases) and spreadsheet (spreadsheet), etc., so as to reduce the input errors in the written text or the text data in the databases.
Spelling error correction has considerable application in search engines, mainly used for correcting input errors so as to guide users to correctly inquire, and the main technology implemented at present is based on pinyin error correction, for example, on baidu (hundred degrees), input "ping package", and a baidu inquiry page can prompt "whether you want to find is or not: apple ″.
Another application of spell correction in pinyin input methods is to recommend possible words when a user enters a pinyin that does not exist.
However, the above-mentioned spell correction technique can only recommend words with the same pronunciation, but cannot recommend words with confusing sound, such as recommending "apple (pingguo)" according to "ping parcel (pingguo)" but not "apple (pingguo)" according to "pinguo (pinguo)". Because there are a lot of dialects in the area, the pronunciation is not so accurate, so it causes a lot of confusing sounds, such as unclear/flat tongue sound, and front/back nose sound in Zhejiang area. In this case, input errors still occur, and the input device cannot play a more intelligent role and is not humanized.
Disclosure of Invention
The invention aims to provide a pinyin checking system and method with confusing voice recognition, which aim to solve the technical problems that errors possibly occurring in Chinese input of a user cannot be corrected by utilizing pronunciation similarity in the prior art, and the input is easy to make mistakes due to confusion between dialects in various regions and the mandarin.
A pinyin checking system with confusing tone recognition comprises a file storage space and a pinyin checking and processing unit, wherein the file storage space comprises a word bank storage unit, a Chinese pinyin storage unit and a Chinese confusing tone storage unit, and the pinyin checking and processing unit comprises a Chinese pinyin index processing subunit, a word bank pinyin index processing subunit and a Chinese confusing tone index processing subunit.
Wherein the system further comprises an index storage space, the index storage space comprising:
chinese character pinyin index file: the index structure is used for storing the pronunciation obtained from the Chinese pinyin storage unit according to the Chinese characters;
chinese confusing tone index file: the index structure is used for storing the confusable pinyin found on the Chinese character confusing tone index processing unit according to the pinyin;
word bank pinyin index file: the index structure is used for storing all the corresponding words found on the word stock storage unit according to the pinyin.
Particularly, the word stock storage unit is sequentially ordered from small to large or from large to small according to the hash operation value of word pronunciation;
the lexicon pinyin index file further comprises: a pinyin hash value index subfile, a list address index subfile, wherein,
pinyin hash value index sub-file: the device is used for sequentially saving the corresponding list addresses of all the hash values in the list address index subfile from small to large or from large to small according to the hash values of the pinyin;
list address index sub-file: the method is used for storing the number of the words with the same pinyin corresponding to each list address and the corresponding storage address information of the words in the word bank storage unit.
The word stock pinyin index processing subunit further comprises:
a hash calculation subunit: the hash value is used for calculating the word pinyin;
a hash value index processing subunit: the table address is used for finding the corresponding list address of the calculated hash value in the pinyin hash value index subfile;
list address processing subunit: the word library storage unit is used for storing the word number and the storage address information of each word in the word library storage unit;
a word stock processing subunit: the word library storage unit is used for finding the corresponding words from the storage address information of the words found by the list address processing subunit.
Based on the system, a pinyin checking method with confusing voice recognition is provided, which comprises the steps of,
(1) setting a word bank storage unit for storing words, a Chinese character pinyin storage unit for storing pinyin of Chinese characters, and a Chinese character confusion storage unit for storing pinyin which is easy to be confused;
(2) receiving key words input by a user, and searching corresponding pinyin in the Chinese character pinyin storage unit;
(3) receiving the pinyin sent by the Chinese character pinyin storage unit, and searching the corresponding confusing pinyin in the Chinese character confusing sound storage unit;
(4) and (4) receiving the pinyin provided in the step (2) and the step (3) respectively, and searching in the word stock storage unit to obtain corresponding words.
Wherein, the word stock storage unit set in the step (1) further comprises: the words are sorted in the word stock storage unit sequentially from small to large or from large to small according to the hash operation value of the pronunciation of the word.
The step (1) further comprises:
setting a pinyin hash value index subfile: sequentially saving the corresponding list address of each hash value in the list address index subfile from small to large or from large to small according to the hash value of the pinyin;
setting a list address index sub-file: the number of words with the same pinyin corresponding to each list address and the corresponding storage address information of the words in the word stock storage unit are stored
The step (4) of searching the word stock storage unit to obtain the corresponding word further comprises:
calculating the hash value of each word pinyin;
finding the corresponding list address of the calculated hash value in the pinyin hash value index subfile
Finding out the corresponding word number and the storage address information of each word in a word bank storage unit from the list address in a list address index subfile;
and finding the corresponding words in the word bank storage unit according to the storage address information of the words found by the list address processing subunit.
Preferably, the step (1) of setting the pinyin storage unit further comprises:
taking the Chinese characters as keys of a binary tree, taking pinyin as value values of the binary tree, and if the pinyin is a polyphone, adding corresponding records on the binary tree;
the step (1) of setting the Chinese character confusion storage unit further comprises the following steps:
and taking each pinyin as a key of the binary tree, taking the confusable pinyin of the pinyin as a value, and if a plurality of confusable pinyins exist, adding a corresponding record on the binary tree.
The method has the advantages that the problem of confusion between dialects in various regions and the mandarin is solved by introducing the identification of the confusing sound, errors in Chinese input of a user are corrected by utilizing the similarity of pronunciation, such as the similarity of pronunciation of a warped tongue sound/a flat tongue sound, a front nose sound/a rear nose sound and the like, spelling error correction is more intelligent and humanized, and the accuracy of Chinese input is improved.
Drawings
Fig. 1 is a schematic structural diagram of a pinyin checking system with confusing tone recognition according to a first embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a lexicon pinyin index processing subunit according to the present invention;
FIG. 3 is a schematic diagram of a second pinyin checking system with confusing tone recognition according to the present invention;
FIG. 4 is a diagram illustrating a structure of a sub-file for word library Pinyin index processing when the Pinyin checking method with confusing tone recognition is adopted in the present invention;
FIG. 5 is a flowchart illustrating a pinyin checking method with confusing tone recognition according to the present invention;
fig. 6 is a schematic diagram illustrating an application of a structure of a sub-file for word library pinyin index processing when the pinyin checking method with confusing voice recognition is adopted in the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
Please refer to fig. 1, which is a schematic structural diagram of a pinyin checking system with confusing tone recognition according to a first embodiment of the present invention. It comprises a file storage space 100 and a pinyin-checking processing unit 200. File storage space 100 is used for storing the entered keywords and the pinyin and confusing sounds corresponding to the chinese characters. The pinyin checking processing unit 200 is mainly used for performing pinyin annotation on the input keywords and searching confusing voices of the input keywords to obtain corresponding words.
The file storage space 100 is usually a memory, or a storage unit opened in the memory. The Chinese character spelling system is functionally divided into a Chinese character spelling storage unit 110, a Chinese character confusion sound storage unit 120 and a word stock storage unit 130.
The pinyin storage unit 110 is configured to store standard pinyins corresponding to the chinese characters. The pinyin storage unit 110 stores chinese characters and their corresponding standard pinyins in a certain format. The general storage format is "chinese characters: pinyin "wherein, if a Chinese character is a polyphone, the addition is made between two Pinyin". Such as "apple: ping "," containing: sheng, cheng ". And only one Chinese character and the corresponding pinyin thereof are stored in each storage unit.
The pinyin storage unit 110 may sequentially store the pinyin corresponding to each chinese character in the order of the dictionary, and when the character is a polyphone, one more storage unit may be used, and the storage unit is another pinyin corresponding to the chinese character. Because the Chinese characters are stored in the mode, the speed is slow when the pinyin search of the Chinese characters is carried out. In the embodiment of the present invention, the pinyin storage unit 110 stores chinese characters and corresponding pinyin in a binary tree access manner. Namely, the Chinese character is used as the key of the binary tree, and the pinyin of the Chinese character is used as the value of the binary tree. If the character is polyphone, each pronunciation is inserted once, and two records are stored. When the pinyin is stored in the way, the pinyin corresponding to the Chinese characters can be more quickly taken.
A Chinese character confusion sound storage unit 120, configured to store confusable pinyins corresponding to each pinyin in the Chinese character pinyin storage unit. The Chinese character confusion tone storage unit 120 stores confusable pinyin according to a certain format. The general storage format is "pinyin: confusing sound ", wherein if there are multiple confusing sounds in a pinyin," is added between two confusing sounds. Generally, because dialects exist in various places, the pronunciations caused by the dialects are generally classified into the following two types: warped tongue/flat tongue, anterior nasal/posterior nasal. Therefore, most of the confusing sounds stored in the chinese character confusing sound storage unit 120 are the warped-tongue sound/flat-tongue sound, the confusing sounds of the front nose sound/rear nose sound, such as "ping: pin "," sheng: shen, seng, shen ".
The Chinese character confusion sound storage unit 120 may sequentially store the confusion pinyin corresponding to each pinyin in a certain order, and when the pinyin is a plurality of confusion sounds, one storage unit may be stored, and the storage unit is another confusable pinyin corresponding to the pinyin. Because the confusable pinyin is stored in the mode, the speed is low when the confusable pinyin is searched. In the embodiment of the present invention, the Chinese character confusion tone storage unit 120 stores each pinyin and the corresponding confusable pinyin in a binary tree access manner. That is, the pinyin is taken as the key of the binary tree, and the confusable pinyin of the pinyin is taken as the value of the binary tree. If a plurality of confusable pinyins exist, each confusable pinyin is inserted once, and two records exist during storage. When the pinyin is stored in the way, the confusable pinyin corresponding to the pinyin can be more quickly taken. The confusing tone of the Chinese character confusing tone storage unit can be freely configured by a user according to actual needs.
A word bank storage unit 130 for storing words as candidates, which is mainly a set of all words as candidates. The word bank storage unit 130 stores the words in a certain order, and may store the words in a dictionary manner or in other manners. For the convenience of searching, address information stored in each word, such as absolute storage address information, may be recorded in advance. The present invention may also store the offset address information between the address where the word is stored and the initial address of the thesaurus storage unit 130, so that when the storage address information of a word is obtained, the corresponding word can be found quickly, and the reading speed is increased.
The pinyin checking processing unit 200 is mainly used for performing spell checking operation on the input keywords. It is usually the job of the processor to program to implement spell checking. Logically divided, the pinyin-check processing unit 200 can be further divided into a hanzi-pinyin-index processing sub-unit 210, a hanzi-confusing-tone-index processing sub-unit 220, and a lexicon-pinyin-index processing sub-unit 230.
The pinyin index processing subunit 210 is configured to receive a keyword input by a user, and search the corresponding pinyin in the pinyin storage unit 110. The hanzi-pinyin-index-processing sub-unit 210 may sequentially find the corresponding pinyins in the hanzi-pinyin-storage unit 110 in a sequential manner. However, considering that the search efficiency is too slow, when the hanzi-pinyin storage unit 110 stores the correspondence between the hanzi and the pinyin in a binary tree access manner, the hanzi-pinyin index processing subunit 210 may perform the search using multimap (i.e., binary tree manner). multimap is a container of std, and adopts a balanced binary tree structure organization, and the key is organized according to the balanced binary tree structure, so that the corresponding value can be quickly acquired through the key. Where keys are allowed to have the same value between them.
Specifically, the Chinese characters are used as keys of the multimap during storage, and the pinyin corresponding to the Chinese characters is used as values of the multimap. In the case of polyphones, each pronunciation is inserted once. For example, "apple", has a record in multimap, namely < apple, ping >. And "Sheng" is a polyphone, and there are two records in multimap, which are < Sheng, sheng >, < Sheng, cheng > respectively.
When the pinyin index processing subunit 210 works, the input keywords are obtained first, the keywords are converted into multimap keys, and then the value values of the balanced binary tree are searched in the pinyin storage unit 110 by the multimap of the module to obtain pinyins corresponding to the keywords. The whole working process is called a pinyin labeling process.
The hanzi confusing tone index processing subunit 220 is configured to search for confusing tones in the hanzi confusing tone storage unit 120 according to the pinyin provided by the hanzi spelling index processing subunit 210. The confusing sound includes a warped tongue sound/flat tongue sound, and a front nose sound/back nose sound. The principle of the processing of the hanzi confusion tone index processing sub-unit 220 is similar to that of the hanzi pinyin index processing sub-unit 210, and is not described herein again.
Referring to FIG. 2, it is a schematic structural diagram of a word bank pinyin index processing subunit of the present invention.
The lexicon pinyin index processing subunit 230 is configured to receive the pinyins provided by the chinese pinyin index processing subunit 210 and the chinese confusing tone index processing subunit 220, and search the lexicon storage unit to obtain corresponding words. The following description will be emphasized, and will be omitted.
The system of the present invention further includes an index storage space 300 for storing index information. The index storage space 300 includes:
chinese character pinyin index file 310: for storing index information of the pronunciation obtained from the pinyin storage unit 110 according to the chinese character. In general, the hanzi-pinyin index file 310 stores an index rule on how to find the pronunciation of the hanzi-pinyin storage unit 110, storage address information of the hanzi-pinyin storage unit 110, and the like. The indexing rule generally refers to in what order to look up. The hanzi-pinyin index file 310 may be stored in a memory by opening up a storage space, or may be disposed on the hanzi-pinyin index processing unit 210, in other words, the hanzi-pinyin index file 310 may be logically omitted.
Confusing pronunciation index file 320 for Chinese characters: for storing the index information of the confusable sound found in the Chinese character confusing sound index storage unit 120 according to the pinyin. The index information includes index rules and address information of the han liao confusion tone index storage unit 120. Similarly, the confusing Chinese tone index file 320 may be stored in a memory by opening up a storage space, or may be disposed in the confusing Chinese tone index processing subunit 220.
Lexicon pinyin index file 330: the method is used for storing index information of all corresponding words found on the word stock storage unit according to the pinyin. The following description focuses on the word stock pinyin index file 330 of the present invention, which is a preferred embodiment of the present invention and is not intended to limit the present invention.
The thesaurus storage unit 130 may sequentially sort the word pronunciations from small to large or from large to small according to the hash operation values of the word pronunciations.
The lexicon pinyin index file 330 further comprises: a pinyin-hash-value index sub-file 410, a list-address index sub-file 420, wherein,
pinyin-hash-value index sub-file 410: the index sub-file 420 is used for sequentially saving list address information corresponding to each hash value in the list address index sub-file 420 from small to large or from large to small according to the hash value of the pinyin;
list address index sub-file 420: for storing the number of words with the same pinyin corresponding to each list address and the corresponding storage address information of the words in the word bank storage unit 130.
The word stock pinyin index file 330 is described as an application example.
Please refer to fig. 4, which is a schematic diagram of an application of the lexicon pinyin index file 330. The pinyin hash value quotation file 410 stores the correspondence between hash values and list addresses. When the hash values calculated for the words are the same, the corresponding list addresses are the same. That is, it is possible to find the list address by the hash value. The list address information may be absolute address information of a storage address of the list address, or may be an offset address or other addresses.
The list address index file stores the number of words having the same hash value, and the corresponding storage address information of the word in the word stock repository unit 130.
For the word stock pinyin index file 330, the word stock pinyin index processing sub-unit further includes a hash calculation sub-unit 231, a hash value index processing sub-unit 232, a list address processing sub-unit 233, and a word stock processing sub-unit 234,
hash calculation subunit 231: used for calculating the hash value of the word pinyin. And the hash value of each word pinyin forms basic information of each word pinyin. The hash calculation subunit 231 obtains the hash value of the pinyin of each word by using a hash algorithm.
The hash value index processing subunit 232: for finding the corresponding list address from the calculated hash value.
List address processing subunit 233: the storage address information used for finding the corresponding word number and each word in the word bank storage unit 130 from the list address in the list address index sub-file 420;
the lexicon processing sub-unit 234: the word library storage unit 130 is used for finding the corresponding words from the storage address information of the words found by the list address processing subunit.
Based on the system of the pinyin checking method with confusing tone recognition, the invention provides the pinyin checking method with confusing tone recognition. Referring to fig. 5, it includes:
s1: the word bank storage unit for storing words, the Chinese character pinyin storage unit for storing pinyin of Chinese characters and the Chinese character confusion storage unit for storing pinyin which is easy to be confused are arranged.
The step S1 is further configured with a thesaurus storage unit: the words are sorted in the word stock storage unit sequentially from small to large or from large to small according to the hash operation value of the pronunciation of the word.
The setting of the Chinese pinyin storage unit further comprises:
taking the Chinese characters as keys of a binary tree, taking pinyin as value values of the binary tree, and if the pinyin is a polyphone, adding corresponding records on the binary tree;
the step S1 of setting the chinese character confusion storage unit further comprises:
and taking each pinyin as a key of the binary tree, taking the confusable pinyin of the pinyin as a value, and if a plurality of confusable pinyins exist, adding a corresponding record on the binary tree.
Step S1 further includes:
setting a pinyin hash value index subfile: sequentially saving the corresponding list address of each hash value in the list address index subfile from small to large or from large to small according to the hash value of the pinyin;
setting a list address index sub-file: and storing the number of words with the same pinyin corresponding to each list address and the corresponding storage address information of the words in a word bank storage unit.
S2: and receiving key words input by a user, and searching corresponding pinyin in the Chinese character pinyin storage unit. And then, searching a value of the balanced binary tree in a Chinese character pinyin storage unit through the multimap of the module to obtain pinyins corresponding to the keywords. If multiple pinyins exist, space division is used among the multiple pinyins.
S3: receiving the pinyin sent by the Chinese character pinyin storage unit, and searching the corresponding confusing pinyin in the Chinese character confusing sound storage unit. The confusing sound includes a warped tongue sound/flat tongue sound, and a front nose sound/back nose sound. And (3) using multimap, taking each pinyin provided by the Chinese character pinyin index processing subunit as a key of the multimap, and searching a value of the balanced binary tree in the Chinese character confusion sound storage unit to obtain confusion sounds corresponding to the pinyins.
S4: receiving the pinyin provided in the step S2 and the step S3, respectively, and searching the thesaurus storage unit to obtain corresponding words.
The step S4 of searching for the corresponding word in the thesaurus storage unit further includes:
calculating the hash value of each word pinyin;
finding the corresponding list address of the calculated hash value in the pinyin hash value index subfile
Finding out the corresponding word number and the storage address information of each word in a word bank storage unit from the list address in a list address index subfile;
and finding the corresponding words in the word bank storage unit according to the storage address information of the words found by the list address processing subunit. The storage address information is the offset of the address to the first address.
The above-described flow is described below as a specific example.
Please refer to fig. 6, which is a diagram illustrating an application of a sub-file structure of a word library pinyin index processing when the pinyin checking method with confusing voice recognition is adopted in the present invention.
It is assumed that the lexicon storage unit 130 stores "apple", "guo", "rubber", "banana" and "zhejiang", respectively, and the corresponding storage address information is offset address information, for example, the offset addresses for the first address PBase of the lexicon storage unit 130 corresponding to "apple", "guo", "rubber", "banana" and "zhejiang" are "20", "25", "30", "35" and "40", respectively.
The pinyin hash value index file 410 stores address information in the lexicon pinyin index file 420 corresponding to hash (ping guo), hash (pin guo), hash (xiang jiao), and hash (zhe jiang), respectively, where the address information is an offset address for the first address of the list address index sub-file 420, and the offset addresses of the first address of the lexicon pinyin index file 420 corresponding to hash (ping guo), hash (pin guo), hash (xiang jiao), and hash (zhe jiang) are "10", "12", "14", and "17", respectively.
In the list address index sub-file 420, the number of words whose ping guo pinyin is 1 is stored in the memory location having an offset address of "10", the word is stored in the corresponding storage address information (namely, the offset address is 20) in the thesaurus storage unit 130, the number of words of the pinyin of the "pin guo" of the word is 1 in the storage unit with the offset address of "12", the word is stored in the corresponding storage address information (namely the offset address is 25) in the word stock storage unit 130, the number of the words of the xiang jiao pinyin of the word is 2 in the storage unit with the offset address being 14, the number of the words with the zhejiang pinyin is 1, and the corresponding storage address information (namely, the offset address is 40) of each word in the thesaurus storage unit 130 is stored in the storage unit with the offset address being "17" and the corresponding storage address information (namely, the offset address is 40) of the word in the thesaurus storage unit 130.
When the Chinese character confusing tone storage unit is set, ping is correspondingly set in the pinyin which is easy to confuse by the pin.
Suppose that a user wants to input an apple, but the user inputs a pinyin because of inaccurate pronunciation, firstly, the Chinese character pinyin storage unit is searched to find the corresponding pinyin pin and guo respectively, and when the Chinese character confusing and confusing memory unit is searched, the confusing sound ping corresponding to the pin can be found. Then, hash values of the "pin guo" and the "pingguo" are calculated, addresses in the pinyin hash value index sub-file 410 are searched through the hash values, corresponding address information (offset addresses are 10 and 12) is obtained respectively, then, the list address index sub-file 420 is searched, the address information (offset addresses are 20 and 30) corresponding to the word bank storage unit 130 can be obtained, then, corresponding words "apple" and "guo" are found from the word bank storage unit 130, whether a user is one of the words or not is prompted, and accordingly spelling errors are reduced.
The above disclosure is only for the specific embodiments of the present invention, but the present invention is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.
Claims (6)
1. A pinyin checking system with confusing voice recognition, which is used for obtaining entries of candidate objects corresponding to input keywords according to the input keywords, and is characterized by comprising a file storage space and a pinyin checking and processing unit, wherein,
the file storage space includes:
a word stock storage unit for storing words as candidate objects,
a Chinese character pinyin storage unit for storing a standard pinyin corresponding to a Chinese character, an
The Chinese character confusion tone storage unit is used for storing the confusable pinyin corresponding to each pinyin in the Chinese character pinyin storage unit;
the pinyin checking and processing unit comprises:
the Chinese character pinyin index processing subunit is used for receiving the key words input by the user and searching the corresponding pinyin in the Chinese character pinyin storage unit;
the Chinese character confusion sound index processing subunit is used for receiving the pinyin sent by the Chinese character pinyin storage unit and searching the corresponding confusion pinyin in the Chinese character confusion sound storage unit;
a word stock pinyin index processing subunit, configured to receive pinyins provided by the chinese character pinyin index processing subunit and the chinese character confusion tone index processing subunit, and search for a corresponding word in the word stock storage unit, where the system further includes an index storage space, where the index storage space includes:
word bank pinyin index file: used for storing index information of all corresponding words found on the word stock storage unit according to the pinyin,
the word bank storage unit is sequentially ordered from small to large or from large to small according to the Hash operation value of word pronunciation;
the lexicon pinyin index file further comprises: a pinyin hash value index subfile, a list address index subfile, wherein,
pinyin hash value index sub-file: the device is used for sequentially saving the corresponding list addresses of all the hash values in the list address index subfile from small to large or from large to small according to the hash values of the pinyin;
list address index sub-file: the method is used for storing the number of the words with the same pinyin corresponding to each list address and the corresponding storage address information of the words in the word bank storage unit.
2. The system of claim 1, the thesaurus pinyin-indexing processing sub-unit further comprising:
a hash calculation subunit: the hash value is used for calculating the word pinyin;
a hash value index processing subunit: the table address is used for finding the corresponding list address of the calculated hash value in the pinyin hash value index subfile;
list address processing subunit: the word library storage unit is used for storing the word number and the storage address information of each word in the word library storage unit;
a word stock processing subunit: the word library storage unit is used for finding the corresponding words from the storage address information of the words found by the list address processing subunit.
3. The system of claim 1, further comprising:
chinese character pinyin index file: the index information is used for storing the pronunciation obtained from the Chinese pinyin storage unit according to the Chinese characters;
chinese confusing tone index file: the method is used for storing index information of the confusable pinyin which is found on the Chinese character confusing sound storage unit according to the pinyin.
4. A pinyin checking method with confusing tone recognition, which is characterized by comprising the steps of,
(1) the method comprises the following steps of setting a word bank storage unit for storing words, a Chinese character pinyin storage unit for storing pinyin of Chinese characters and a Chinese character confusion tone storage unit for storing pinyin which is easy to be confused, wherein the word bank storage unit set in the step (1) further comprises the following steps: sorting the word pronunciation in the word stock storage unit in order from small to large or from large to small according to the hash operation value of the word pronunciation, and the step (1) further comprises:
setting a pinyin hash value index subfile: sequentially saving the corresponding list address of each hash value in the list address index subfile from small to large or from large to small according to the hash value of the pinyin;
setting a list address index sub-file: storing the number of words with the same pinyin corresponding to each list address and the corresponding storage address information of the words in a word bank storage unit;
(2) receiving key words input by a user, and searching corresponding pinyin in the Chinese character pinyin storage unit;
(3) receiving the pinyin sent by the Chinese character pinyin storage unit, and searching the corresponding confusing pinyin in the Chinese character confusing sound storage unit;
(4) receiving the pinyin provided in the step (2) and the step (3) respectively, searching the word bank storage unit to obtain a corresponding word, and searching the word bank storage unit in the step (4) to obtain the corresponding word further comprises:
calculating the hash value of each word pinyin;
finding the corresponding list address of the calculated hash value in the pinyin hash value index subfile;
finding out the corresponding word number and the storage address information of each word in a word bank storage unit from the list address in a list address index subfile;
and finding out corresponding words in the word bank storage unit according to the found storage address information of the words.
5. The method of claim 4, wherein the storage address information is an offset of an address from a first address.
6. The method of claim 4,
the step (1) of setting the Chinese character pinyin storage unit further comprises the following steps:
taking the Chinese characters as keys of a binary tree, taking pinyin as value values of the binary tree, and if the pinyin is a polyphone, adding corresponding records on the binary tree;
the step (1) of setting the Chinese character confusion tone storage unit further comprises the following steps:
and taking each pinyin as a key of the binary tree, taking the confusable pinyin of the pinyin as a value, and if a plurality of confusable pinyins exist, adding a corresponding record on the binary tree.
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2007101494831A CN101388012B (en) | 2007-09-13 | 2007-09-13 | Phonetic check system and method with easy confusion tone recognition |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1128541A1 HK1128541A1 (en) | 2009-10-30 |
| HK1128541B true HK1128541B (en) | 2012-10-19 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN101388012B (en) | Phonetic check system and method with easy confusion tone recognition | |
| US8745077B2 (en) | Searching and matching of data | |
| CN106202153A (en) | The spelling error correction method of a kind of ES search engine and system | |
| CN103324621B (en) | A kind of Thai text spelling correcting method and device | |
| JPH058464B2 (en) | ||
| Dutta et al. | Text normalization in code-mixed social media text | |
| AU2018102145A4 (en) | Method of establishing English geographical name index and querying method and apparatus thereof | |
| CN100524293C (en) | Method and system for obtaining word pair translation from bilingual sentence | |
| US8583415B2 (en) | Phonetic search using normalized string | |
| US20130090916A1 (en) | System and Method for Detecting and Correcting Mismatched Chinese Character | |
| CN104077346A (en) | Document creation support apparatus, method and program | |
| Xiong et al. | HANSpeller: a unified framework for Chinese spelling correction | |
| CN105760359B (en) | Question processing system and method thereof | |
| CN111950301A (en) | English translation quality analysis method and system for Chinese translation and English translation | |
| CN111597800B (en) | Method, device, equipment and storage medium for obtaining synonyms | |
| CN109885641B (en) | Method and system for searching Chinese full text in database | |
| US6754386B1 (en) | Method and system of matching ink processor and recognizer word breaks | |
| Yang et al. | Spell Checking for Chinese. | |
| CN111310473A (en) | Text error correction method and model training method and device thereof | |
| CN106776590A (en) | A kind of method and system for obtaining entry translation | |
| CN112560493B (en) | Named entity error correction method, named entity error correction device, named entity error correction computer equipment and named entity error correction storage medium | |
| US8041556B2 (en) | Chinese to english translation tool | |
| CN117350302A (en) | Semantic analysis-based language writing text error correction method, system and man-machine interaction device | |
| HK1128541B (en) | System and method for pinyin checking with confusable pinyin recognition | |
| CN109727591B (en) | Voice search method and device |