CN100476800C - A method and system for segmenting index words - Google Patents
A method and system for segmenting index words Download PDFInfo
- Publication number
- CN100476800C CN100476800C CNB2007101230513A CN200710123051A CN100476800C CN 100476800 C CN100476800 C CN 100476800C CN B2007101230513 A CNB2007101230513 A CN B2007101230513A CN 200710123051 A CN200710123051 A CN 200710123051A CN 100476800 C CN100476800 C CN 100476800C
- Authority
- CN
- China
- Prior art keywords
- character
- participle
- unit
- english
- stream
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 69
- 239000000872 buffer Substances 0.000 claims description 24
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 3
- 230000008878 coupling Effects 0.000 claims 4
- 238000010168 coupling process Methods 0.000 claims 4
- 238000005859 coupling reaction Methods 0.000 claims 4
- 230000011218 segmentation Effects 0.000 abstract description 158
- 238000010586 diagram Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Landscapes
- Document Processing Apparatus (AREA)
Abstract
Description
技术领域 technical field
本发明涉及信息索引领域,特别涉及一种切分索引分词的方法及系统。The invention relates to the field of information indexing, in particular to a method and system for segmenting index words.
背景技术 Background technique
现有信息检索系统已经日益普及,大到网络搜索引擎,小到特定应用信息检索系统。当需要进行汉字信息的处理时,信息检索系统就会遇到如何分词的问题。Existing information retrieval systems have become increasingly popular, ranging from web search engines to application-specific information retrieval systems. When it is necessary to process Chinese character information, the information retrieval system will encounter the problem of how to segment words.
目前的分词算法有很多种,其中n元语法分词是一种不需要词典的机械分词方法,实现容易。但是该分词方法冗余度大,不能解决单字分词问题。There are many word segmentation algorithms at present, among which n-gram word segmentation is a mechanical word segmentation method that does not require a dictionary and is easy to implement. However, this word segmentation method has a large degree of redundancy and cannot solve the problem of single-character word segmentation.
二元分词方法是将句子中任意出现的两个紧邻的字都分出来,建立倒排索引。例如:句子“从上述实现步骤来看”会分出“从上、上述、述实、实现、现步、步骤、骤来、来看”等几个词。从上述分出来的分词可以看出,如“述实”、“现步”等分词并没有实际意义。而且该方法也不能解决单字分词的问题,不能对英文词进行划分。The binary word segmentation method is to separate any two adjacent words that appear in a sentence, and build an inverted index. For example: the sentence "From the perspective of the above steps of realization" can be divided into several words such as "From above, above, stating the facts, realizing, present step, step, step, and seeing". It can be seen from the participle separated above that participles such as "reporting facts" and "present situation" have no practical meaning. Moreover, this method cannot solve the problem of word segmentation, and cannot divide English words.
最大匹配分词方法是一种按照最长词优先的原则匹配分词的方法。例如:句子“从上述实现步骤来看”可能被分为“从、上述、实现步骤、来看”等几个词。这种方法分出的词比较少,但不一定是最短的,而且也不一定准确。由于这种分词方法没有一定量的冗余词,可能会导致查全率下降,在某些应用场合体验不好。The maximum matching word segmentation method is a method of matching word segmentation according to the longest word first principle. For example: the sentence "from the above realization steps" may be divided into several words such as "from, above, realization steps, view". The number of words separated by this method is relatively small, but it is not necessarily the shortest, and it is not necessarily accurate. Since this word segmentation method does not have a certain amount of redundant words, it may lead to a decrease in the recall rate, and the experience is not good in some applications.
基于统计或语义分析的分词方法,需要解决歧义消解问题。这种分词方法得到的结果不一定是全面的,但比较准确。但由于这种分词方法实现起来比较麻烦,而且复杂的分析过程必然会从一定程度影响分词效率,这种分词方法不适用于小型特定应用信息检索系统。Word segmentation methods based on statistical or semantic analysis need to solve the problem of disambiguation. The results obtained by this word segmentation method are not necessarily comprehensive, but relatively accurate. However, because this word segmentation method is cumbersome to implement, and the complex analysis process will inevitably affect the word segmentation efficiency to a certain extent, this word segmentation method is not suitable for small specific application information retrieval systems.
发明内容 Contents of the invention
本发明的目的是提供一种切分索引分词的方法,该方法能够同时解决分词准确、一定量的冗余词以及单字分词的问题,增强用户体验。The purpose of the present invention is to provide a method for segmenting index word segmentation, which can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words, and single-word word segmentation, and enhance user experience.
本发明的目的还提供一种切分索引分词的系统,该系统能够同时解决分词准确、一定量的冗余词以及单字分词的问题,增强用户体验。The object of the present invention is also to provide a system for segmenting index word segmentation, which can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words, and single word segmentation, and enhance user experience.
为解决上述技术问题,本发明实施例提供一种切分索引分词的方法,包括以下步骤:In order to solve the above technical problems, an embodiment of the present invention provides a method for segmenting index words, including the following steps:
读取字符流;read character stream;
识别所述字符流,确定汉字、英文字符或数字以及不可识别字符;Identify the character stream, determine Chinese characters, English characters or numbers and unrecognizable characters;
将已经确定的汉字、英文字符或数字与预先建立的词典树比较,确定匹配的分词;Compare the determined Chinese characters, English characters or numbers with the pre-established dictionary tree to determine the matching word segmentation;
将所述英文字符或数字进行ASCII码Wild match(通用模糊匹配),确定英文字符串或者数字串的分词;Carry out ASCII code Wild match (universal fuzzy matching) with described English character or numeral, determine the participle of English character string or numerical string;
将上述匹配的分词和所述英文字符串或者数字串的分词以及不可识别字符,按所述字符流顺序进行排序;Sorting the above-mentioned matched word segmentation and the word segmentation and unrecognizable characters of the English character string or numeric string according to the sequence of the character stream;
按所述排序后的分词顺序以及所述每个分词和上述不可识别字符的长度划分所述字符流。The character stream is divided according to the sorted word segmentation order and the length of each word segment and the above-mentioned unrecognizable characters.
优选地,所述词典树为预先建立的trie字符树数据结构。Preferably, the dictionary tree is a pre-established trie character tree data structure.
优选地,所述词典树为预先建立的二进制流词典结构。Preferably, the dictionary tree is a pre-established binary stream dictionary structure.
优选地,所述识别所述字符流后,将所述字符流存储在内部字符缓冲区。Preferably, after the character stream is identified, the character stream is stored in an internal character buffer.
优选地,在所述字符流存储在内部字符缓冲区之前,将所述字符流进行统一字符的处理。Preferably, before the character stream is stored in the internal character buffer, the character stream is subjected to uniform character processing.
优选地,所述确定汉字、英文字符或数字以及不可识别字符后,去掉所述字符流中的标点符号。Preferably, after the determination of Chinese characters, English characters or numbers and unrecognizable characters, the punctuation marks in the character stream are removed.
优选地,所述词典树在预先建立时去除无意义的单字。Preferably, the dictionary tree removes meaningless words during pre-establishment.
优选地,按所述排序后的分词顺序以及所述每个分词和上述不可识别字符的长度划分所述字符流后进一步包括:Preferably, after dividing the character stream according to the word segmentation order after the sorting and the length of each word segmentation and the above-mentioned unrecognizable characters, further comprising:
定期统计接收到的关键词的频率;Regularly count the frequency of keywords received;
将频率高于预定数值的关键词添加到所述词典树中。Keywords with a frequency higher than a predetermined value are added to the dictionary tree.
本发明实施例提供一种切分索引分词的系统,该系统包括:An embodiment of the present invention provides a system for segmenting index words, the system includes:
读取单元,用于读取字符流;A reading unit for reading a character stream;
字符流识别单元,用于将所述读取单元读取的字符流进行识别,确定汉字、英文字符或数字以及不可识别字符;A character stream identification unit, configured to identify the character stream read by the reading unit, to determine Chinese characters, English characters or numbers, and unrecognizable characters;
词典树单元,预先存储词组和短语的词典树的数据结构单元;The dictionary tree unit, the data structure unit of the dictionary tree storing phrases and phrases in advance;
比较单元,用于将所述字符流识别单元确定的汉字、英文字符或数字与所述词典树单元预先建立的词典树比较,确定匹配的分词;The comparison unit is used to compare the Chinese characters, English characters or numbers determined by the character stream identification unit with the dictionary tree previously established by the dictionary tree unit, and determine the matching word segmentation;
通用模糊匹配单元,用于将所述比较单元比较后的英文字符或数字进行ASCII码通用模糊匹配,确定英文字符串或者数字串的分词;The general fuzzy matching unit is used to carry out the general fuzzy matching of ASCII codes to the English characters or numbers compared by the comparison unit to determine the word segmentation of English character strings or digital strings;
分词管理单元,将所述比较单元和所述通用模糊匹配单元确定的分词以及所述字符流识别单元确定的不可识别字符按所述读取单元读取的字符流顺序进行排序,并记录每个上述分词和上述不可识别字符的长度;The word segmentation management unit sorts the word segmentation determined by the comparison unit and the general fuzzy matching unit and the unrecognizable characters determined by the character stream identification unit according to the sequence of the character stream read by the reading unit, and records each The above word segmentation and the length of the above unrecognizable characters;
分词划分单元,将所述读取单元读取的字符流,按照所述分词管理单元记录的分词顺序以及所述每个分词和上述不可识别字符的长度进行划分。The word segmentation unit divides the character stream read by the reading unit according to the word segmentation sequence recorded by the word segmentation management unit and the length of each word segment and the above-mentioned unrecognizable characters.
本发明实施例还提供一种切分索引分词的系统,该系统包括:The embodiment of the present invention also provides a system for segmenting index and word segmentation, the system includes:
读取单元,用于读取字符流;A reading unit for reading a character stream;
字符流识别单元,用于将所述读取单元读取的字符流进行识别,确定汉字、英文字符或数字以及不可识别字符;A character stream identification unit, configured to identify the character stream read by the reading unit, to determine Chinese characters, English characters or numbers, and unrecognizable characters;
内部字符缓冲区单元,用于存储所述字符流识别单元识别的字符流;An internal character buffer unit for storing the character stream identified by the character stream identification unit;
词典树单元,预先存储词组和短语的词典树的数据结构单元;The dictionary tree unit, the data structure unit of the dictionary tree storing phrases and phrases in advance;
比较单元,用于将所述字符流识别单元确定的汉字、英文字符或数字与所述词典树单元预先建立的词典树比较,确定匹配的分词;The comparison unit is used to compare the Chinese characters, English characters or numbers determined by the character stream identification unit with the dictionary tree previously established by the dictionary tree unit, and determine the matching word segmentation;
通用模糊匹配单元,用于将所述比较单元比较后的英文字符或数字进行ASCII码通用模糊匹配,确定英文字符串或者数字串的分词;The general fuzzy matching unit is used to carry out the general fuzzy matching of ASCII codes to the English characters or numbers compared by the comparison unit to determine the word segmentation of English character strings or digital strings;
分词管理单元,将所述比较单元和所述通用模糊匹配单元确定的分词以及所述字符流识别单元确定的不可识别字符按所述内部字符缓冲区单元存储的所述字符流顺序进行排序,并记录每个上述分词和上述不可识别字符的长度;The word segmentation management unit sorts the word segmentation determined by the comparison unit and the general fuzzy matching unit and the unrecognizable characters determined by the character stream identification unit according to the sequence of the character stream stored in the internal character buffer unit, and Record the length of each of the above-mentioned word segmentation and the above-mentioned unrecognizable characters;
分词划分单元,将所述内部字符缓冲区单元存储的字符流,按照所述分词管理单元记录的分词顺序以及所述每个分词和上述不可识别字符的长度进行划分;The word segmentation unit divides the character stream stored in the internal character buffer unit according to the word segmentation sequence recorded by the word segmentation management unit and the length of each word segmentation and the above-mentioned unrecognizable characters;
词典自适应单元,由预先建立的统计模块统计关键词的出现频率,将所述出现频率高于预定数值的关键词添加到所述词典树单元。The dictionary adaptive unit uses a pre-established statistical module to count the frequency of occurrence of keywords, and adds keywords whose frequency of occurrence is higher than a predetermined value to the dictionary tree unit.
本发明所述切分索引分词的方法能够同时解决分词准确、一定量的冗余词以及单字分词的问题,增强了用户的体验。The method for segmenting index words in the present invention can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words and single word segmentation, and enhance user experience.
本发明实施例所述切分索引分词的方法,包括读取字符流;识别所述字符流中字符,确定汉字、英文字符或数字以及不可识别字符;将已经确定的汉字、英文字符或数字与预先建立的词典树相比较,确定匹配的分词;将英文字符或数字进行ASCII码Wild match,确定英文字符串或者数字串的分词;将上述匹配的分词和所述英文字符串或者数字串的分词以及不可识别字符,按所述字符流中顺序进行排序;按所述分词和所述英文字符串或者数字串排序的顺序划分所述字符流。The method for segmenting an index word in the embodiment of the present invention includes reading a character stream; identifying characters in the character stream, determining Chinese characters, English characters or numbers and unrecognizable characters; combining the determined Chinese characters, English characters or numbers with Compare the pre-established dictionary trees to determine the matching word segmentation; perform ASCII code Wild match on English characters or numbers to determine the word segmentation of the English string or number string; compare the above matched word segmentation with the word segmentation of the English string or number string And the unrecognizable characters are sorted according to the order in the character stream; the character stream is divided according to the order in which the word segmentation and the English character string or number string are sorted.
由于在划分分词前,所有的汉字、英文字符或数字均与预先建立的词典树相比较,避免了无效词组或者短语的出现,而且保证了适当的冗余词。在单字可以作为一个词或者具有实际意义的时候,词典树中会按照正常的词组处理,所以可以实现单字作为一个分词的划分。并且本发明增加了ASCII码Wild match的过程,有效的确定英文字符串以及数字串的分词。Because all Chinese characters, English characters or numbers are compared with a pre-established dictionary tree before dividing and segmenting, avoiding the occurrence of invalid phrases or phrases, and ensuring appropriate redundant words. When a word can be used as a word or has practical meaning, it will be processed as a normal phrase in the dictionary tree, so the division of a word as a participle can be realized. And the present invention has increased the process of ASCII code Wild match, effectively determines the participle of English character string and number string.
因此,本发明所述切分索引分词的方法能够同时解决分词准确、一定量的冗余词以及单字分词的问题,增强了用户的体验。Therefore, the method for segmenting index words in the present invention can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words, and single-word word segmentation, and enhance user experience.
附图说明 Description of drawings
图1为本发明所述方法一种实施方式流程图;Fig. 1 is a flow chart of an embodiment of the method of the present invention;
图2为本发明所述trie字符树结构示意图;Fig. 2 is the trie character tree structural representation of the present invention;
图3本发明所述二进制流的词典树结构示意图;Fig. 3 is a schematic diagram of a dictionary tree structure of a binary stream according to the present invention;
图4为本发明第一种切分索引分词系统结构图;Fig. 4 is the structural diagram of the first segmentation index word segmentation system of the present invention;
图5为本发明第二种切分索引分词系统结构图。FIG. 5 is a structural diagram of the second segmentation index word segmentation system of the present invention.
具体实施方式 Detailed ways
本发明提供一种切分索引分词的方法,该方法能够同时解决分词准确、一定量的冗余词以及单字划分的问题,增强用户体验。The invention provides a method for segmenting index words, which can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words and word division, and enhance user experience.
为了使本技术领域的技术人员更好地理解本发明方案,下面结合附图和具体实施方式对本发明作进一步的详细说明。In order to enable those skilled in the art to better understand the solution of the present invention, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.
参见图1,该图为本发明所述方法一种实施方式流程图。Referring to Fig. 1, this figure is a flowchart of an embodiment of the method of the present invention.
S10、读取字符流。S10. Read the character stream.
读取的字符流中,可能包含汉字也可能包含英文字符、数字以及不可识别字符。The read character stream may contain Chinese characters or English characters, numbers and unrecognizable characters.
S20、识别所述字符流,确定汉字、英文字符或数字以及不可识别字符。S20. Identify the character stream, and determine Chinese characters, English characters or numbers, and unrecognizable characters.
将步骤S10读取的字符流进行字符识别,确定所述字符流中字符具体为汉字或者英文字符或者数字或者不可识别字符。Perform character recognition on the character stream read in step S10, and determine that the characters in the character stream are specifically Chinese characters or English characters or numbers or unrecognizable characters.
由于对所述字符流中的字符进行识别,可以很容易实现对多种字符集的切分。Since the characters in the character stream are recognized, the segmentation of various character sets can be easily realized.
S30、将已经确定的汉字、英文字符或数字与预先建立的词典树比较,确定匹配的分词。S30. Compare the determined Chinese characters, English characters or numbers with the pre-established dictionary tree, and determine the matching word segmentation.
词典树可以是常用的汉语词或者短语的集合,也可以有英文字符和汉字的混合词。The dictionary tree can be a collection of commonly used Chinese words or phrases, or a mixture of English characters and Chinese characters.
将步骤S20确定的汉字、英文字符或数字与预先建立的词典树相比较,确定匹配的分词,并将该分词对应的字符串建立倒排索引。倒排索引为现有技术,在此不再详述。Comparing the Chinese characters, English characters or numbers determined in step S20 with the pre-established dictionary tree, determining the matching word segmentation, and building an inverted index for the character string corresponding to the word segmentation. The inverted index is a prior art and will not be described in detail here.
对于一个字符流进行分词时,如下的字符串会被建立倒排索引。When performing word segmentation for a character stream, the following strings will be built with an inverted index.
1、该字符流中所有存在于所述词典树中的子串。1. All substrings existing in the dictionary tree in the character stream.
2、单个汉字建立倒排索引,实现了一元分词。2. Create an inverted index for a single Chinese character and realize a one-element word segmentation.
3、全英文字符组成的英文单词或者英文字符串。3. English words or English character strings composed of all English characters.
4、全数字字符组成的数字字符串。4. A numeric string composed of all numeric characters.
S40、将所述英文字符或数字进行ASCII码通用模糊匹配,确定英文字符串或者数字串的分词。S40. Perform ASCII universal fuzzy matching on the English characters or numbers to determine word segmentation of the English character string or number string.
ASCII码通用模糊匹配用于保证分词的完整性,当英文字符串或者数字串自成一词时,可以被标记为一个分词。The general fuzzy matching of ASCII codes is used to ensure the integrity of the word segmentation. When an English string or number string forms a word by itself, it can be marked as a word segmentation.
假如需要切分的字符流是:“useU盘”,所述词典树中存在“U盘”这个词。Wild match扫描遇到第一个英文字符“u”时,一直向后扫描直至这个字符串中最后一个英文字符,查找得到字符流中完整的英文字符串“useU”,确定为一个英文字符串的分词。If the character stream that needs to be segmented is: "useU disk", the word "U disk" exists in the dictionary tree. When the wild match scan encounters the first English character "u", it scans backward until the last English character in the string, finds the complete English string "useU" in the character stream, and determines it as an English string Participle.
S50、将上述匹配的分词和所述英文字符串或者数字串的分词以及不可识别字符,按所述字符流顺序进行排序。S50. Sorting the above-mentioned matched word segments, word segments of the English character string or number string, and unrecognizable characters according to the sequence of the character stream.
在整个字符流分词排序过程中,汉字分词都是按照汉字字符在字符流中的位置有序排序的。但当遇到英文字符串或数字串时,产生的分词有可能出现无序甚至重复的现象,此时按所述字符流中字符的顺序进行排序。这样最终得到的分词都是按照各分词在原字符流中的顺序有序排列的,不需要专门调用排序模块进行处理。In the entire character stream word segmentation sorting process, the Chinese character word segmentation is sorted in an orderly manner according to the position of the Chinese characters in the character stream. However, when encountering English character strings or numeric strings, the generated word segmentation may appear out of order or even repeated, and at this time, sort according to the order of the characters in the character stream. In this way, the finally obtained word segmentations are arranged in an orderly manner according to the order of each word segmentation in the original character stream, and there is no need to call a sorting module for processing.
S60、按所述分词顺序以及所述每个分词和上述不可识别字符的长度划分所述字符流。S60. Divide the character stream according to the order of word segmentation and the length of each word segment and the aforementioned unrecognizable characters.
步骤S50已经将上述各分词进行有序的排列,此时只需按所述各分词顺序以及所述各分词和上述不可识别字符的长度划分所述字符流即可。Step S50 has arranged the above-mentioned participles in an orderly manner. At this time, it is only necessary to divide the character stream according to the order of each participle, the length of each participle and the above-mentioned unrecognizable characters.
由于在划分分词前,所有的汉字、英文字符串或数字均与预先建立的词典树相比较,避免了无效词组或者短语的出现,而且保证了适当的冗余词。在单字可以作为一个词或者具有实际意义的时候,所述词典树中会按照正常的词组处理,因此能够实现单字作为一个分词的划分。并且本发明增加了ASCII码通用模糊匹配的过程,有效的确定英文字符串以及数字串的分词。因此,本发明所述切分索引分词的方法能够同时解决分词准确、一定量的冗余词以及单字划分的问题,增强了用户的体验。Since all Chinese characters, English character strings or numbers are compared with a pre-established dictionary tree before dividing and segmenting, the occurrence of invalid phrases or phrases is avoided, and appropriate redundant words are guaranteed. When a word can be used as a word or has actual meaning, it will be processed as a normal phrase in the dictionary tree, so the division of a word as a word segment can be realized. Moreover, the present invention adds a general fuzzy matching process of ASCII codes to effectively determine the word segmentation of English character strings and digital strings. Therefore, the method for segmenting index words in the present invention can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words and word division, and enhance user experience.
现有二元分词、三元分词等分词方法虽然不需要词典树,但是生成的词比较多,很多词是没有意义的,而且无法对中英文混合词的划分。本发明实施例所述方法应用了词典树,由于分词的划分是以词典树为依据,有效的避免了无意义词的划分。Although existing word segmentation methods such as binary word segmentation and trigram word segmentation do not require a dictionary tree, they generate more words, many of which are meaningless, and cannot classify mixed words in Chinese and English. The method described in the embodiment of the present invention uses a dictionary tree, and since the division of word segmentation is based on the dictionary tree, the division of meaningless words is effectively avoided.
本发明实施例所述方法根据实际数据,最终产生的索引比二元分词多一倍,但实际上是由于单个汉字被进行分词,能够实现一元分词。本发明优选方案,在建立词典树时将无意义的单字或者不需要搜索的单字去掉。According to the actual data, the method described in the embodiment of the present invention finally generates twice as many indexes as the binary word segmentation, but in fact, because a single Chinese character is segmented, the unary word segmentation can be realized. In the preferred solution of the present invention, meaningless words or words that do not need to be searched are removed when the dictionary tree is established.
从效率上看,本发明实施例所述方法比n元语法分词要稍慢,因为增加了词典树查找匹配的过程。但本发明实施例所述方法在搜索的时候,词典树中的词将会以最短路径的形式进行分词,从而大大减少了检索索引的次数,对搜索性能有很大改善。In terms of efficiency, the method described in the embodiment of the present invention is slightly slower than the n-gram word segmentation, because the process of searching for a match in the dictionary tree is added. However, when the method described in the embodiment of the present invention is searching, the words in the dictionary tree will be segmented in the form of the shortest path, thereby greatly reducing the number of index retrieval times and greatly improving the search performance.
本发明实施例所述方法与最大匹配比较,由于采用的是全切分,查全率就可以达到100%,比最大匹配更全面。这在小型信息检索中能够满足要求。Compared with the maximum matching, the method described in the embodiment of the present invention adopts full segmentation, and the recall rate can reach 100%, which is more comprehensive than the maximum matching. This is sufficient for small information retrieval.
本发明实施例所述方法的分词效率是比较快的。经过粗略测试,在linux下,语料为一般网页,词典树为海量分词词典树,大约有20多万个单词,占用内存80M左右。通过本发明实施例所述方法进行分词,一秒钟能处理3至4M的数据。The word segmentation efficiency of the method described in the embodiment of the present invention is relatively fast. After a rough test, under linux, the corpus is a general web page, and the dictionary tree is a massive word segmentation dictionary tree, with about 200,000 words and about 80M of memory. Word segmentation is performed by the method described in the embodiment of the present invention, and 3 to 4M data can be processed in one second.
建立索引有很多过程,分词只是第一步,相对于其他过程来说,一秒能处理3至4M的数据已经很快。因为其他过程例如合并索引等需要硬盘I/O,速度相对较慢。因此,在整个索引过程中本发明所述分词方法不会造成瓶颈现象。There are many processes in indexing, and word segmentation is only the first step. Compared with other processes, it is very fast to process 3 to 4M data per second. Because other processes such as merging indexes require hard disk I/O, the speed is relatively slow. Therefore, the word segmentation method of the present invention will not cause bottleneck phenomenon in the whole indexing process.
我们以“useU盘”这个字符流为例,具体说明分词过程。Let's take the character stream of "useU disk" as an example to illustrate the word segmentation process in detail.
由于每个字符都会被查找,上述第一个英文字符被标记为这个英文字符串分词的开始,并且标记“u”是这个英文字符串的一个分词的开始。Since each character will be searched, the first English character above is marked as the start of a word segment of this English string, and the marker "u" is the start of a word segment of this English string.
“u”的下一个位置字符为“s”,词典中查不到“us”,不是分词的结束,不输出任何分词。The character at the next position of "u" is "s", and "us" cannot be found in the dictionary, and it is not the end of the participle, so no participle is output.
依此类推,当遇到第二个“U”时,发现字典里面有词“U盘”,说明这个位置也是一个分词的开始。By analogy, when the second "U" is encountered, it is found that there is a word "U disk" in the dictionary, indicating that this position is also the beginning of a word segmentation.
同时回溯,把前面标记过的分词开始位置第一个“u”和当前位置组成一个分词“use”输出。并标记当前位置是分词的开始,接下来依次进行扫描。At the same time, backtracking, the first "u" at the start position of the previously marked participle and the current position form a participle "use" output. And mark the current position as the beginning of word segmentation, and then scan in turn.
最终输出的分词为:“use”“useU”“U盘”。The final output word segmentation is: "use" "useU" "U disk".
数字串的通用模糊匹配过程与上述英文字符串的过程形似,具体数字串的通用模糊匹配过程在此不再详述。The general fuzzy matching process of the numeric string is similar to the process of the above-mentioned English string, and the specific general fuzzy matching process of the numeric string will not be described in detail here.
本发明所述方法的优选实施方式,包括以下步骤:A preferred embodiment of the method of the present invention comprises the following steps:
S10、读取字符流。S10. Read the character stream.
S20、识别所述字符流,确定汉字、英文字符或数字以及不可识别字符。S20. Identify the character stream, and determine Chinese characters, English characters or numbers, and unrecognizable characters.
S21、将识别后的字符流存储在内部字符缓冲区。S21. Store the recognized character stream in an internal character buffer.
所述字符流存储在内部字符缓冲区之前,还可以将所述字符流进行统一字符的处理。Before the character stream is stored in the internal character buffer, the character stream can also be processed with unified characters.
S30、将已经确定的汉字、英文字符或数字与预先建立的词典树比较,确定匹配的分词。S30. Compare the determined Chinese characters, English characters or numbers with the pre-established dictionary tree, and determine the matching word segmentation.
S40、将英文字符或数字进行ASCII码通用模糊匹配,确定英文字符串或者数字串的分词。S40. Perform general fuzzy matching of ASCII codes on the English characters or numbers to determine word segmentation of the English character string or number string.
S50、将上述匹配的分词和所述英文字符串或者数字串的分词以及不可识别字符,按所述字符流顺序进行排序。S50. Sorting the above-mentioned matched word segments, word segments of the English character string or number string, and unrecognizable characters according to the sequence of the character stream.
S60、将所述内部字符缓冲区存储的字符流按所述分词顺序以及所述每个分词和上述不可识别字符的长度划分所述字符流。S60. Divide the character stream stored in the internal character buffer according to the sequence of word segmentation and the length of each word segment and the aforementioned unrecognizable characters.
所述步骤S60后还可以包括:After the step S60, it may also include:
S61、定期统计接收到的关键词的频率。S61. Regularly count the frequencies of the received keywords.
定期统计接收到的关键词的频率可以由预先建立的统计模块进行统计。统计模块用于统计关键词频率,这种统计模块可以采用现有的用户统计关键词的通用统计模块。具体统计模块的工作原理为公知常识在此不再详述。Periodic statistics The frequency of the received keywords can be counted by a pre-established statistics module. The statistical module is used to count the frequency of keywords, and this statistical module can adopt the existing general statistical module for counting keywords by users. The working principle of the specific statistical module is common knowledge and will not be described in detail here.
S62、将频率高于预定数值的关键词添加到所述词典树中。S62. Add keywords with frequencies higher than a predetermined value to the dictionary tree.
用户可以根据自身行业或者具体需求,定期统计用户的输入关键字,将频率高于预定数值的关键词添加到所述词典树中。预定数值可以根据实际需要进行设定。According to the user's own industry or specific needs, the user can regularly count the input keywords of the user, and add keywords with a frequency higher than a predetermined value to the dictionary tree. The predetermined value can be set according to actual needs.
用户可以将统计后的高频关键词加入到所述词典树中,然后再重新建立索引。这样所述词典树就可以具有自适应性,对检索速度将是一个很大的提升。The user can add the counted high-frequency keywords into the dictionary tree, and then rebuild the index. In this way, the dictionary tree can be adaptive, which will greatly improve the retrieval speed.
本发明实施例所述词典树可以为预先建立的map映射表或者hash_map(哈希表)的数据结构。所述词典树采用map映射表或者hash_map数据结构时,需要查找两次数据,而且无法得知分词在哪个字符终止,需要遍历整个句子才能查找到分词。The dictionary tree in the embodiment of the present invention may be a pre-established map mapping table or a hash_map (hash table) data structure. When the dictionary tree adopts a map mapping table or a hash_map data structure, it is necessary to search for data twice, and it is impossible to know which character the participle ends at, and it is necessary to traverse the entire sentence to find the participle.
本发明所述词典树优选情况,采用预先建立的trie字符树数据结构。采用trie字符树能够比较好的满足要求。所述词典树采用trie字符树可以在一次查找过程中查找到多个词,而且容易找到分词的终止位置。关于trie字符树目前已经有很多成熟的模型,本文不再赘述。The preferred situation of the dictionary tree of the present invention adopts a pre-established trie character tree data structure. Using the trie character tree can better meet the requirements. The trie character tree used in the dictionary tree can find multiple words in one search process, and it is easy to find the end position of the word segmentation. There are already many mature models about the trie character tree, so this article will not repeat them.
参见图2,该图为本发明所述trie字符树结构示意图。Referring to Fig. 2, this figure is a schematic diagram of the trie character tree structure of the present invention.
图2中每个方框代表一个节点,对应表示一个汉字或字符。带“*”标记的方框,表示从根节点到该节点的路径所组成的汉字集或者字符串对应于词典树中的一个词或短语。方框中的数字是对该节点的编号。Each box in Figure 2 represents a node, which corresponds to a Chinese character or character. The box marked with "*" indicates that the Chinese character set or character string formed by the path from the root node to this node corresponds to a word or phrase in the dictionary tree. The number in the box is the number of the node.
例如识别的字符流是“中华人是伟大的”。从“中”开始遍历,直至“是”失配时,就得到两个词“中华”和“中华人”。由于以“中”字开头的词,到达“人”字后没有具体分词对应,因此就不需要再向后扫描。因此采用trie字符树的词典树查找过程是比较高效的。For example, the recognized character flow is "Chinese people are great". Start traversal from "中" until "is" is mismatched, and two words "中华" and "中华人" are obtained. Since the words beginning with the word "中" have no specific word segmentation corresponding to the word "人", there is no need to scan backwards. Therefore, the dictionary tree search process using the trie character tree is more efficient.
字符串“中华人是伟大的”,由前向后扫描,查找以第一个位置开始在词典树中的所有词,过程如下:The string "Chinese are great", scan from front to back, find all words starting with the first position in the dictionary tree, the process is as follows:
A、第一个字是“中”,在trie字符树中匹配到节点1。A. The first character is "中", which matches node 1 in the trie character tree.
B、继续查看第二个字是“华”,在节点1中找“中”的子节点,匹配到节点2。B. Continue to check that the second word is "Hua", find the child node of "中" in node 1, and match to node 2.
C、继续查看第三个字是“人”,在节点2中找“华”的子节点,匹配到节点3。C. Continue to check that the third word is "人", find the child node of "Hua" in node 2, and match to node 3.
D、继续查看第四个字是“是”,在节点3中“人”的子节点,没有表示“是”的,于是退出。D. Continue to check that the fourth word is "yes". In the child node of "person" in node 3, there is no "yes", so exit.
这样从“中”开始,查找到“中华”,“中华人”这样两个分词。因为在词典中节点2(华)和节点3(人)被标记了词尾“*”。从根节点1到2和3的两条路径,就可以分别用1-2和1-2-3表示“中华”和“中华人”这两个具体分词。In this way, starting from "中", two participle words such as "Zhonghua" and "Chinese" are found. Because in the dictionary, node 2 (Hua) and node 3 (person) are marked with suffix "*". From the two paths from the root node 1 to 2 and 3, 1-2 and 1-2-3 can be used to represent the two specific participle of "China" and "Chinese".
由上述查找的过程可以看出,最大匹配的次数在最糟的情况下等于词典树中最长词的长度。但实际上这个长度的平均值很小,不可能有很长的词。因此该算法可以看作常数级的算法,用复杂度函数O(1)表示。It can be seen from the above search process that the maximum number of matches is equal to the length of the longest word in the dictionary tree in the worst case. But in fact the average value of this length is very small, it is impossible to have very long words. Therefore, the algorithm can be regarded as a constant-level algorithm, represented by a complexity function O(1).
复杂度函数O(1)从算法过程来看,主要是一个顺序扫描识别字符和在词典树中查找的过程。在前面介绍的词典树中查找一个字符串,查找的次数等于该位置处的最长词的长度。由于实际词典树中的词长度都是有限的,因此平均查找次数可以是一个常数c。一般常数c<5。当遇到英文字符或数字字符时需要回溯找分词,但实际上英文字符串开头的分词都是一些很特殊的词,出现几率很小。这部分的影响对整个分词的复杂度影响可以忽略不计。Complexity function O(1) From the perspective of the algorithm process, it is mainly a process of sequentially scanning and recognizing characters and searching in the dictionary tree. Find a string in the dictionary tree introduced earlier, the number of searches is equal to the length of the longest word at that position. Since the length of words in the actual dictionary tree is limited, the average search times can be a constant c. The general constant c<5. When encountering English characters or numeric characters, it is necessary to look back to find the participle, but in fact the participles at the beginning of the English string are some very special words, and the probability of occurrence is very small. The impact of this part on the complexity of the entire word segmentation is negligible.
识别字符的过程也是常数复杂度的,对于一个长度为n的字符流,整个算法复杂度为O(k×n)。其中k为一常数因子,是线性复杂度。k一般平均为10左右。The process of character recognition is also of constant complexity. For a character stream with a length of n, the complexity of the entire algorithm is O(k×n). Where k is a constant factor, which is the linear complexity. k is generally around 10 on average.
本发明优选方式,保存两个字符缓冲区,分别用于存储输入字符流和处理后的输出字符流以及分出的分词的位置、长度信息。实际应用中,很少存在无分界符的长句子,所以可以将上述字符缓冲区容量设定为定值。当上述字符缓冲区装不下时,先返回该字符缓冲区中的分词集合,再填充该分词缓冲区的容量,最后重新进行分词。我们用空间复杂度表示程序需要使用的内存与输入数据规模即句子长度的函数关系,该函数是定值,不会随句子长度而增加。因此空间复杂度是常数,与字符流长度无关。In the preferred mode of the present invention, two character buffers are saved, which are respectively used to store the input character stream, the processed output character stream and the position and length information of the segmented words. In practical applications, long sentences without delimiters rarely exist, so the above-mentioned character buffer capacity can be set to a constant value. When the above-mentioned character buffer cannot fit, first return the word segmentation set in the character buffer, then fill the capacity of the word segmentation buffer, and finally perform word segmentation again. We use space complexity to represent the functional relationship between the memory required by the program and the size of the input data, that is, the length of the sentence. This function is a fixed value and will not increase with the length of the sentence. Therefore the space complexity is constant regardless of the length of the character stream.
字符缓冲区容量优选方式采用1024bit,当然可以根据具体需要设定其它数值。The preferred mode of the character buffer capacity is 1024bit, and of course other values can be set according to specific needs.
本发明所述词典树还可以采用二进制流的词典树,这样可以保证每个节点的子节点数量较少。二进制流的词典树是直接以数组下标定位的。并且英文字符串也可以放入二进制流的词典树中,从而实现了中英文混合词的划分。The dictionary tree of the present invention can also adopt a binary stream dictionary tree, which can ensure that the number of child nodes of each node is small. The dictionary tree of the binary stream is directly located by the array subscript. And English strings can also be put into the dictionary tree of the binary stream, thus realizing the division of Chinese and English mixed words.
参见图3,该图为本发明所述二进制流的词典树结构示意图。Referring to FIG. 3 , it is a schematic diagram of the dictionary tree structure of the binary stream in the present invention.
本发明所述二进制流的词典树结构的每个节点不是一个汉字,而是根据二进制数据拆成的几个bit。Each node of the dictionary tree structure of the binary stream in the present invention is not a Chinese character, but several bits split according to the binary data.
假设是4个bit,一个汉字包括两个字节,这样就需要4个节点。一个ASCII码字符需要2个节点。Assuming it is 4 bits, a Chinese character includes two bytes, so 4 nodes are needed. An ASCII character requires 2 nodes.
例如“中”的二进制表示为0xd0d6,那么按照bit位,从低位到高位拆分成为6d 0d四个数据块,则“中”字的树可以表示为图3所示形式。For example, the binary representation of "中" is 0xd0d6, then according to the bits, it is split into four data blocks of 6d 0d from low to high, then the tree of "中" can be expressed as shown in Figure 3.
“6”对应第一节点31;第一个“d”对应第二节点32;“0”对应第三节点33;第二个“d”对应第四节点34。"6" corresponds to the
本发明所述方法采用二进制流的词典树时,读取的字符串被看成二进制流。When the method of the present invention adopts the dictionary tree of the binary stream, the read character string is regarded as the binary stream.
例如查找“中华人”,它们的二进制格式为0xd0d60xaabb 0xcbc8,转成4bit表示的数字流为6d 0d b b a a c b c 8。当进行查找时,依次进行匹配。6匹配第一节点31,第一个d匹配第二节点32,0匹配第三节点33,第二个d匹配第四节点34。第四节点34是词尾的标记,则该次查找过程中6d 0d合起来是一个词,再还原成二进制格式0xd0d6即得到词“中”。这样每个节点的子节点最多有16个,由于4bit只能表示16种不同的数字,从而大大节省了空间。二进制流的词典树是对tries树的一种改进实现方式。For example, if you search for "Chinese", their binary format is 0xd0d60xaabb 0xcbc8, and the digital stream converted into 4bit is 6d 0d b b a a c b c 8. When doing a lookup, the matches are done sequentially. 6 matches the
当进行索引分词时,从前到后扫描该字符串,可以滤掉标点符号,在词典树中查找包含于该字符串中的所有词并输出。全英文字符组成的串或者全数字组成的串被认为是一个单词,也作为一个分词输出。在索引分词的时候具有一定量的冗余词。When performing index word segmentation, the string is scanned from front to back, punctuation marks can be filtered out, and all words contained in the string are searched in the dictionary tree and output. A string composed of all English characters or a string composed of all numbers is considered as a word and is also output as a word segmentation. There are a certain amount of redundant words when indexing word segmentation.
本发明还提供一种切分索引分词的系统,该系统能够同时解决分词准确、一定量的冗余词以及单字的分词问题,增强用户体验。The present invention also provides a system for segmenting and indexing word segmentation, which can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words and word segmentation of single characters, and enhance user experience.
参见图4,该图为本发明第一种切分索引分词系统结构图。Referring to FIG. 4 , this figure is a structural diagram of the first segmentation index word segmentation system of the present invention.
本发明第一种切分索引分词系统,该系统包括:The first segmentation index word segmentation system of the present invention, the system includes:
读取单元41,用于读取字符流。The
字符流识别单元42,用于将所述读取单元41读取的字符流进行识别,确定汉字、英文字符或数字以及不可识别字符。The character
由于读取单元41读取的字符流可能包含汉字、英文字符或数字以及不可识别字符,为了便于进行分词的划分,需要对其进行识别。Since the character stream read by the
词典树单元43,预先存储词组和短语的词典树的数据结构单元。The
词典树单元43是很重要的一个单元,可以有效的减少多余的冗余词。The
比较单元44,用于将所述字符流识别单元42确定的汉字、英文字符或数字与所述词典树单元43预先建立的词典树比较,确定匹配的分词。The
通用模糊匹配单元45,用于将所述比较单元44比较后的英文字符或数字进行ASCII码通用模糊匹配,确定英文字符串或者数字串的分词。The general
分词管理单元46,将所述比较单元44和所述通用模糊匹配单元45确定的分词以及所述字符流识别单元42确定的不可识别字符按所述读取单元41读取的字符流顺序进行排序,并记录每个上述分词和上述不可识别字符的长度。The word
分词划分单元47,将所述读取单元41读取的字符流,按照所述分词管理单元46记录的分词顺序以及所述每个分词和上述不可识别字符的长度进行划分。The
由于本发明实施例所述系统的比较单元44将所述字符流识别单元42确定的汉字、英文字符或数字与所述词典树单元43预先建立的词典树比较,确定匹配的分词。这样可以有效避免无效词组或者短语的出现,保证一定量的冗余词。在单字可以作为一个词或者具有实际意义的时候,词典树单元43中会按照正常的词组处理,所以可以实现单字作为一个分词的分词。并且本发明实施例所述系统增加了通用模糊匹配单元45,对所述比较单元44比较后的英文或数字进行ASCII码通用模糊匹配,有效的确定英文字符串以及数字串的分词。因此,本发明第一种切分索引分词系统能够同时解决分词准确、一定量的冗余词以及单字的分词问题,增强了用户的体验。Because the
参见图5,该图为本发明第二种切分索引分词系统结构图。Referring to FIG. 5 , it is a structural diagram of the second segmentation index word segmentation system of the present invention.
本发明第二种切分索引分词系统相对于第一种切分索引分词系统增加了内部字符缓冲区单元48和词典自适应单元49。Compared with the first segmentation index word segmentation system of the present invention, the second segmentation index word segmentation system adds an internal
读取单元41,用于读取字符流。The
字符流识别单元42,用于将所述读取单元41读取的字符流进行识别,确定汉字、英文字符或数字以及不可识别字符。The character
所述内部字符缓冲区单元48用于存储所述字符流识别单元42识别的字符流,为分词划分单元47提供分词文本。The internal
当然字符流识别单元42还可以将识别的字符流进行字符统一后,才发送到所述内部字符缓冲区单元48中保存。Of course, the character
词典树单元43,预先存储词组和短语的词典树的数据结构单元。The
比较单元44,用于将所述字符流识别单元42确定的汉字、英文字符或数字与所述词典树单元43预先建立的词典树比较,确定匹配的分词。The
通用模糊匹配单元45,用于将所述比较单元44比较后的英文字符或数字进行ASCII码通用模糊匹配,确定英文字符串或者数字串的分词。The general
分词管理单元46,将所述比较单元44和所述通用模糊匹配单元45确定的分词以及所述字符流识别单元42确定的不可识别字符按所述内部字符缓冲区单元48存储的所述字符流顺序进行排序,并记录每个上述分词和上述不可识别字符的长度。The word
所述分词划分单元47,将所述内部字符缓冲区单元48存储的字符流,按照所述分词管理单元46记录的分词顺序以及所述每个分词和上述不可识别字符的长度进行划分。The
词典自适应单元49,由预先建立的统计模块50统计关键词的出现频率,将所述出现频率高于预定数值的关键词添加到所述词典树单元43。The
预先建立的统计模块50用于统计关键词频率。统计模块50可以采用现有的用户统计关键词的通用统计模块。具体统计模块的工作原理为公知常识在此不再详述。The pre-established
词典自适应单元49根据所述统计模块50统计的关键词频率进行自适应操作。当所述统计模块50统计的关键词频率大于预定数值时,所述词典自适应单元49将该关键词添加到所述词典树单元43中。The dictionary
由于本发明实施例所述第二种切分索引分词系统,比较单元44将所述字符流识别单元42确定的汉字、英文字符或数字与所述词典树单元43预先建立的词典树比较,确定匹配的分词。这样有效避免了无效词组或者短语的出现,而且保证了一定量的冗余词。在单字可以作为一个词或者具有实际意义时,词典树单元43可以按照正常的词组处理,所以可以实现单字作为一个分词的划分。并且本发明实施例所述第二种切分索引分词系统具有通用模糊匹配单元45,对所述比较单元44比较后的英文字符或数字进行ASCII码通用模糊匹配,有效的确定英文字符串以及数字串的分词。因此,本发明第二种切分索引分词系统能够同时解决分词准确、一定量的冗余词以及单字分词的问题,增强了用户的体验。Due to the second segmentation index word segmentation system described in the embodiment of the present invention, the
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that, for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications can also be made. It should be regarded as the protection scope of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CNB2007101230513A CN100476800C (en) | 2007-06-22 | 2007-06-22 | A method and system for segmenting index words |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CNB2007101230513A CN100476800C (en) | 2007-06-22 | 2007-06-22 | A method and system for segmenting index words |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN101071420A CN101071420A (en) | 2007-11-14 |
| CN100476800C true CN100476800C (en) | 2009-04-08 |
Family
ID=38898647
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CNB2007101230513A Active CN100476800C (en) | 2007-06-22 | 2007-06-22 | A method and system for segmenting index words |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN100476800C (en) |
Families Citing this family (34)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101770478B (en) * | 2008-12-26 | 2013-04-24 | 高德信息技术有限公司 | Data retrieval method, data retrieval engine and embedded terminal |
| CN102455845B (en) * | 2010-10-14 | 2015-02-18 | 北京搜狗科技发展有限公司 | Character entry method and device |
| ES2577938T3 (en) * | 2010-11-10 | 2016-07-19 | Rakuten, Inc. | Related word recording device, information processing device, related word recording method, program for related word recording device, and storage medium |
| CN102012897B (en) * | 2010-12-02 | 2014-09-17 | 无敌科技(西安)有限公司 | Word-by-word comparison method for realizing high hit rate |
| CN102331999A (en) * | 2011-07-22 | 2012-01-25 | 大连亿创天地科技发展有限公司 | A search method and system for medical industry search box |
| CN102779163A (en) * | 2012-06-18 | 2012-11-14 | 青岛禧泰房产数据技术有限公司 | Quantization searching method and quantization searching system |
| CN103198146B (en) * | 2013-04-19 | 2015-05-27 | 中国科学院计算技术研究所 | Real-time event filtering method and real-time event filtering system oriented to network stream data |
| CN104268137A (en) * | 2013-07-31 | 2015-01-07 | 深圳市华傲数据技术有限公司 | Method and device for matching pharmaceutical name data |
| CN104462105B (en) | 2013-09-16 | 2019-01-22 | 腾讯科技(深圳)有限公司 | Chinese word cutting method, device and server |
| CN103870537B (en) * | 2013-12-03 | 2017-02-01 | 山东金质信息技术有限公司 | Intelligent word segmentation method for standard retrieval |
| CN103778200B (en) * | 2014-01-09 | 2017-08-08 | 中国科学院计算技术研究所 | A kind of message information source abstracting method and its system |
| CN106294371B (en) * | 2015-05-15 | 2019-08-16 | 阿里巴巴集团控股有限公司 | Character string codomain cutting method and device |
| CN104881503A (en) * | 2015-06-24 | 2015-09-02 | 郑州悉知信息技术有限公司 | Data processing method and device |
| CN105184053B (en) * | 2015-08-13 | 2018-09-07 | 易保互联医疗信息科技(北京)有限公司 | A kind of automatic coding and system of Chinese medical service item information |
| CN105095665B (en) * | 2015-08-13 | 2018-07-06 | 易保互联医疗信息科技(北京)有限公司 | A kind of natural language processing method and system of Chinese medical diagnosis on disease information |
| CN106202464B (en) * | 2016-07-18 | 2019-12-17 | 上海轻维软件有限公司 | data identification method based on mutation backtracking algorithm |
| CN106227661B (en) * | 2016-07-22 | 2019-01-08 | 腾讯科技(深圳)有限公司 | Data processing method and device |
| CN108009153A (en) * | 2017-12-08 | 2018-05-08 | 北京明朝万达科技股份有限公司 | A kind of searching method and system based on search statement cutting word result |
| CN108197313B (en) * | 2018-02-01 | 2021-06-25 | 中国计量大学 | A space-optimized dictionary indexing method through a 16-bit Trie tree |
| CN108388635B (en) * | 2018-02-24 | 2021-08-03 | 杭州朗和科技有限公司 | Data search method, apparatus, medium and computing device |
| CN110362650A (en) * | 2018-04-09 | 2019-10-22 | 深圳企业云科技股份有限公司 | Precisely participle realizes the search method of file full-text search |
| CN108664468A (en) * | 2018-05-02 | 2018-10-16 | 武汉烽火普天信息技术有限公司 | A kind of name recognition methods and device based on dictionary and semantic disambiguation |
| CN109657738B (en) * | 2018-10-25 | 2024-04-30 | 平安科技(深圳)有限公司 | Character recognition method, device, equipment and storage medium |
| CN109582972B (en) * | 2018-12-27 | 2023-05-16 | 信雅达科技股份有限公司 | Optical character recognition error correction method based on natural language recognition |
| CN113033193B (en) * | 2021-01-20 | 2024-04-16 | 山谷网安科技股份有限公司 | Mixed Chinese text word segmentation method based on C++ language |
| CN112765318A (en) * | 2021-01-20 | 2021-05-07 | 阅尔基因技术(苏州)有限公司 | Natural language processing method and system for infertility clinical phenotype information |
| CN113627722B (en) * | 2021-07-02 | 2024-04-02 | 湖北美和易思教育科技有限公司 | Simple answer scoring method based on keyword segmentation, terminal and readable storage medium |
| CN113836917B (en) * | 2021-09-28 | 2023-07-18 | 广州华多网络科技有限公司 | Text word segmentation processing method and device, equipment and medium thereof |
| CN114398880A (en) * | 2021-12-06 | 2022-04-26 | 北京思特奇信息技术股份有限公司 | System and method for optimizing Chinese word segmentation |
| CN114021564B (en) * | 2022-01-06 | 2022-04-01 | 成都无糖信息技术有限公司 | Segmentation word-taking method and system for social text |
| CN115391495B (en) * | 2022-10-28 | 2023-01-24 | 强企宝典(山东)信息科技有限公司 | Method, device and equipment for retrieving keywords in Chinese context |
| CN116226362B (en) * | 2023-05-06 | 2023-07-18 | 湖南德雅曼达科技有限公司 | A Word Segmentation Method to Improve the Accuracy of Searching Hospital Names |
| CN118734850B (en) * | 2024-06-27 | 2025-04-04 | 浪潮智慧科技有限公司 | Chinese word segmentation method, device and medium based on hash table and binary search tree |
| CN119862880B (en) * | 2024-12-04 | 2025-10-31 | 天翼云科技有限公司 | A word segmentation method and word segmenter for variable-length text based on openGemini |
-
2007
- 2007-06-22 CN CNB2007101230513A patent/CN100476800C/en active Active
Also Published As
| Publication number | Publication date |
|---|---|
| CN101071420A (en) | 2007-11-14 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN100476800C (en) | A method and system for segmenting index words | |
| KR101157693B1 (en) | Multi-stage query processing system and method for use with tokenspace repository | |
| US8392175B2 (en) | Phrase-based document clustering with automatic phrase extraction | |
| JP3143079B2 (en) | Dictionary index creation device and document search device | |
| CN1664818B (en) | The neologisms collection method split for word and system | |
| CN103198149B (en) | Method and system for query error correction | |
| US10389378B2 (en) | Computer product, information processing apparatus, and information search apparatus | |
| CN111444330A (en) | Method, device and equipment for extracting short text keywords and storage medium | |
| Hon et al. | Space-efficient frameworks for top-k string retrieval | |
| CN100483417C (en) | Method for catching limit word information, optimizing output and input method system | |
| CN107784110B (en) | A kind of index establishment method and apparatus | |
| CN105528411B (en) | Device and method for full-text retrieval of ship equipment interactive electronic technical manual | |
| CN105138514B (en) | It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method | |
| CN102103416B (en) | Chinese character input method and device | |
| CN101432686A (en) | Efficient storage and search of word lists and other text | |
| CN109885641B (en) | Method and system for searching Chinese full text in database | |
| US8682900B2 (en) | System, method and computer program product for documents retrieval | |
| CN115204147A (en) | Data feature fingerprint construction and similarity measurement method and index | |
| CN117763077A (en) | Data query method and device | |
| CN102722526B (en) | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method | |
| Lipták et al. | BAT-LZ out of hell | |
| CN105938469A (en) | Encoding storage method, text storage data structure and text compression storage and statistical output method | |
| CN110209765A (en) | A method and device for searching keywords according to semantics | |
| CN114154494A (en) | Disambiguation word segmentation method, system, device and storage medium | |
| CN101299212B (en) | A Word Retrieval Method Based on Bitmap Compressed Key Tree |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant | ||
| C41 | Transfer of patent application or patent right or utility model | ||
| TR01 | Transfer of patent right |
Effective date of registration: 20151222 Address after: The South Road in Guangdong province Shenzhen city Fiyta building 518057 floor 5-10 Nanshan District high tech Zone Patentee after: Shenzhen Tencent Computer System Co., Ltd. Address before: 2, 518044, East 410 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District Patentee before: Tencent Technology (Shenzhen) Co., Ltd. |