[go: up one dir, main page]

CN100476800C - A method and system for segmenting index words - Google Patents

A method and system for segmenting index words Download PDF

Info

Publication number
CN100476800C
CN100476800C CNB2007101230513A CN200710123051A CN100476800C CN 100476800 C CN100476800 C CN 100476800C CN B2007101230513 A CNB2007101230513 A CN B2007101230513A CN 200710123051 A CN200710123051 A CN 200710123051A CN 100476800 C CN100476800 C CN 100476800C
Authority
CN
China
Prior art keywords
character
participle
unit
english
stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB2007101230513A
Other languages
Chinese (zh)
Other versions
CN101071420A (en
Inventor
王启明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CNB2007101230513A priority Critical patent/CN100476800C/en
Publication of CN101071420A publication Critical patent/CN101071420A/en
Application granted granted Critical
Publication of CN100476800C publication Critical patent/CN100476800C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a segmentation index segmentation method. Including the following steps: read the character stream; identification described the character stream to identify Chinese characters and English characters, as well as an identification number or character; already identified Chinese and English characters or Digital and pre-built 1.1 tree comparison, the sub-set match words; English characters or figures generic fuzzy matching ASCII codes to determine English string or string of digital-term matching the above mentioned English words and string or digital string of words and non-recognition of characters referred to the character stream by order of ranking; The words and figures mentioned in the English string or strings of the sort described in the order of the character stream. The invention also openly segmentation Index segmentation system. The invention provides a cut-word indexing method and system can simultaneously address the precise words, a certain amount of redundant words and word-term problems, enhance the user experience.

Description

一种切分索引分词的方法及系统 A method and system for segmenting index words

技术领域 technical field

本发明涉及信息索引领域,特别涉及一种切分索引分词的方法及系统。The invention relates to the field of information indexing, in particular to a method and system for segmenting index words.

背景技术 Background technique

现有信息检索系统已经日益普及,大到网络搜索引擎,小到特定应用信息检索系统。当需要进行汉字信息的处理时,信息检索系统就会遇到如何分词的问题。Existing information retrieval systems have become increasingly popular, ranging from web search engines to application-specific information retrieval systems. When it is necessary to process Chinese character information, the information retrieval system will encounter the problem of how to segment words.

目前的分词算法有很多种,其中n元语法分词是一种不需要词典的机械分词方法,实现容易。但是该分词方法冗余度大,不能解决单字分词问题。There are many word segmentation algorithms at present, among which n-gram word segmentation is a mechanical word segmentation method that does not require a dictionary and is easy to implement. However, this word segmentation method has a large degree of redundancy and cannot solve the problem of single-character word segmentation.

二元分词方法是将句子中任意出现的两个紧邻的字都分出来,建立倒排索引。例如:句子“从上述实现步骤来看”会分出“从上、上述、述实、实现、现步、步骤、骤来、来看”等几个词。从上述分出来的分词可以看出,如“述实”、“现步”等分词并没有实际意义。而且该方法也不能解决单字分词的问题,不能对英文词进行划分。The binary word segmentation method is to separate any two adjacent words that appear in a sentence, and build an inverted index. For example: the sentence "From the perspective of the above steps of realization" can be divided into several words such as "From above, above, stating the facts, realizing, present step, step, step, and seeing". It can be seen from the participle separated above that participles such as "reporting facts" and "present situation" have no practical meaning. Moreover, this method cannot solve the problem of word segmentation, and cannot divide English words.

最大匹配分词方法是一种按照最长词优先的原则匹配分词的方法。例如:句子“从上述实现步骤来看”可能被分为“从、上述、实现步骤、来看”等几个词。这种方法分出的词比较少,但不一定是最短的,而且也不一定准确。由于这种分词方法没有一定量的冗余词,可能会导致查全率下降,在某些应用场合体验不好。The maximum matching word segmentation method is a method of matching word segmentation according to the longest word first principle. For example: the sentence "from the above realization steps" may be divided into several words such as "from, above, realization steps, view". The number of words separated by this method is relatively small, but it is not necessarily the shortest, and it is not necessarily accurate. Since this word segmentation method does not have a certain amount of redundant words, it may lead to a decrease in the recall rate, and the experience is not good in some applications.

基于统计或语义分析的分词方法,需要解决歧义消解问题。这种分词方法得到的结果不一定是全面的,但比较准确。但由于这种分词方法实现起来比较麻烦,而且复杂的分析过程必然会从一定程度影响分词效率,这种分词方法不适用于小型特定应用信息检索系统。Word segmentation methods based on statistical or semantic analysis need to solve the problem of disambiguation. The results obtained by this word segmentation method are not necessarily comprehensive, but relatively accurate. However, because this word segmentation method is cumbersome to implement, and the complex analysis process will inevitably affect the word segmentation efficiency to a certain extent, this word segmentation method is not suitable for small specific application information retrieval systems.

发明内容 Contents of the invention

本发明的目的是提供一种切分索引分词的方法,该方法能够同时解决分词准确、一定量的冗余词以及单字分词的问题,增强用户体验。The purpose of the present invention is to provide a method for segmenting index word segmentation, which can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words, and single-word word segmentation, and enhance user experience.

本发明的目的还提供一种切分索引分词的系统,该系统能够同时解决分词准确、一定量的冗余词以及单字分词的问题,增强用户体验。The object of the present invention is also to provide a system for segmenting index word segmentation, which can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words, and single word segmentation, and enhance user experience.

为解决上述技术问题,本发明实施例提供一种切分索引分词的方法,包括以下步骤:In order to solve the above technical problems, an embodiment of the present invention provides a method for segmenting index words, including the following steps:

读取字符流;read character stream;

识别所述字符流,确定汉字、英文字符或数字以及不可识别字符;Identify the character stream, determine Chinese characters, English characters or numbers and unrecognizable characters;

将已经确定的汉字、英文字符或数字与预先建立的词典树比较,确定匹配的分词;Compare the determined Chinese characters, English characters or numbers with the pre-established dictionary tree to determine the matching word segmentation;

将所述英文字符或数字进行ASCII码Wild match(通用模糊匹配),确定英文字符串或者数字串的分词;Carry out ASCII code Wild match (universal fuzzy matching) with described English character or numeral, determine the participle of English character string or numerical string;

将上述匹配的分词和所述英文字符串或者数字串的分词以及不可识别字符,按所述字符流顺序进行排序;Sorting the above-mentioned matched word segmentation and the word segmentation and unrecognizable characters of the English character string or numeric string according to the sequence of the character stream;

按所述排序后的分词顺序以及所述每个分词和上述不可识别字符的长度划分所述字符流。The character stream is divided according to the sorted word segmentation order and the length of each word segment and the above-mentioned unrecognizable characters.

优选地,所述词典树为预先建立的trie字符树数据结构。Preferably, the dictionary tree is a pre-established trie character tree data structure.

优选地,所述词典树为预先建立的二进制流词典结构。Preferably, the dictionary tree is a pre-established binary stream dictionary structure.

优选地,所述识别所述字符流后,将所述字符流存储在内部字符缓冲区。Preferably, after the character stream is identified, the character stream is stored in an internal character buffer.

优选地,在所述字符流存储在内部字符缓冲区之前,将所述字符流进行统一字符的处理。Preferably, before the character stream is stored in the internal character buffer, the character stream is subjected to uniform character processing.

优选地,所述确定汉字、英文字符或数字以及不可识别字符后,去掉所述字符流中的标点符号。Preferably, after the determination of Chinese characters, English characters or numbers and unrecognizable characters, the punctuation marks in the character stream are removed.

优选地,所述词典树在预先建立时去除无意义的单字。Preferably, the dictionary tree removes meaningless words during pre-establishment.

优选地,按所述排序后的分词顺序以及所述每个分词和上述不可识别字符的长度划分所述字符流后进一步包括:Preferably, after dividing the character stream according to the word segmentation order after the sorting and the length of each word segmentation and the above-mentioned unrecognizable characters, further comprising:

定期统计接收到的关键词的频率;Regularly count the frequency of keywords received;

将频率高于预定数值的关键词添加到所述词典树中。Keywords with a frequency higher than a predetermined value are added to the dictionary tree.

本发明实施例提供一种切分索引分词的系统,该系统包括:An embodiment of the present invention provides a system for segmenting index words, the system includes:

读取单元,用于读取字符流;A reading unit for reading a character stream;

字符流识别单元,用于将所述读取单元读取的字符流进行识别,确定汉字、英文字符或数字以及不可识别字符;A character stream identification unit, configured to identify the character stream read by the reading unit, to determine Chinese characters, English characters or numbers, and unrecognizable characters;

词典树单元,预先存储词组和短语的词典树的数据结构单元;The dictionary tree unit, the data structure unit of the dictionary tree storing phrases and phrases in advance;

比较单元,用于将所述字符流识别单元确定的汉字、英文字符或数字与所述词典树单元预先建立的词典树比较,确定匹配的分词;The comparison unit is used to compare the Chinese characters, English characters or numbers determined by the character stream identification unit with the dictionary tree previously established by the dictionary tree unit, and determine the matching word segmentation;

通用模糊匹配单元,用于将所述比较单元比较后的英文字符或数字进行ASCII码通用模糊匹配,确定英文字符串或者数字串的分词;The general fuzzy matching unit is used to carry out the general fuzzy matching of ASCII codes to the English characters or numbers compared by the comparison unit to determine the word segmentation of English character strings or digital strings;

分词管理单元,将所述比较单元和所述通用模糊匹配单元确定的分词以及所述字符流识别单元确定的不可识别字符按所述读取单元读取的字符流顺序进行排序,并记录每个上述分词和上述不可识别字符的长度;The word segmentation management unit sorts the word segmentation determined by the comparison unit and the general fuzzy matching unit and the unrecognizable characters determined by the character stream identification unit according to the sequence of the character stream read by the reading unit, and records each The above word segmentation and the length of the above unrecognizable characters;

分词划分单元,将所述读取单元读取的字符流,按照所述分词管理单元记录的分词顺序以及所述每个分词和上述不可识别字符的长度进行划分。The word segmentation unit divides the character stream read by the reading unit according to the word segmentation sequence recorded by the word segmentation management unit and the length of each word segment and the above-mentioned unrecognizable characters.

本发明实施例还提供一种切分索引分词的系统,该系统包括:The embodiment of the present invention also provides a system for segmenting index and word segmentation, the system includes:

读取单元,用于读取字符流;A reading unit for reading a character stream;

字符流识别单元,用于将所述读取单元读取的字符流进行识别,确定汉字、英文字符或数字以及不可识别字符;A character stream identification unit, configured to identify the character stream read by the reading unit, to determine Chinese characters, English characters or numbers, and unrecognizable characters;

内部字符缓冲区单元,用于存储所述字符流识别单元识别的字符流;An internal character buffer unit for storing the character stream identified by the character stream identification unit;

词典树单元,预先存储词组和短语的词典树的数据结构单元;The dictionary tree unit, the data structure unit of the dictionary tree storing phrases and phrases in advance;

比较单元,用于将所述字符流识别单元确定的汉字、英文字符或数字与所述词典树单元预先建立的词典树比较,确定匹配的分词;The comparison unit is used to compare the Chinese characters, English characters or numbers determined by the character stream identification unit with the dictionary tree previously established by the dictionary tree unit, and determine the matching word segmentation;

通用模糊匹配单元,用于将所述比较单元比较后的英文字符或数字进行ASCII码通用模糊匹配,确定英文字符串或者数字串的分词;The general fuzzy matching unit is used to carry out the general fuzzy matching of ASCII codes to the English characters or numbers compared by the comparison unit to determine the word segmentation of English character strings or digital strings;

分词管理单元,将所述比较单元和所述通用模糊匹配单元确定的分词以及所述字符流识别单元确定的不可识别字符按所述内部字符缓冲区单元存储的所述字符流顺序进行排序,并记录每个上述分词和上述不可识别字符的长度;The word segmentation management unit sorts the word segmentation determined by the comparison unit and the general fuzzy matching unit and the unrecognizable characters determined by the character stream identification unit according to the sequence of the character stream stored in the internal character buffer unit, and Record the length of each of the above-mentioned word segmentation and the above-mentioned unrecognizable characters;

分词划分单元,将所述内部字符缓冲区单元存储的字符流,按照所述分词管理单元记录的分词顺序以及所述每个分词和上述不可识别字符的长度进行划分;The word segmentation unit divides the character stream stored in the internal character buffer unit according to the word segmentation sequence recorded by the word segmentation management unit and the length of each word segmentation and the above-mentioned unrecognizable characters;

词典自适应单元,由预先建立的统计模块统计关键词的出现频率,将所述出现频率高于预定数值的关键词添加到所述词典树单元。The dictionary adaptive unit uses a pre-established statistical module to count the frequency of occurrence of keywords, and adds keywords whose frequency of occurrence is higher than a predetermined value to the dictionary tree unit.

本发明所述切分索引分词的方法能够同时解决分词准确、一定量的冗余词以及单字分词的问题,增强了用户的体验。The method for segmenting index words in the present invention can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words and single word segmentation, and enhance user experience.

本发明实施例所述切分索引分词的方法,包括读取字符流;识别所述字符流中字符,确定汉字、英文字符或数字以及不可识别字符;将已经确定的汉字、英文字符或数字与预先建立的词典树相比较,确定匹配的分词;将英文字符或数字进行ASCII码Wild match,确定英文字符串或者数字串的分词;将上述匹配的分词和所述英文字符串或者数字串的分词以及不可识别字符,按所述字符流中顺序进行排序;按所述分词和所述英文字符串或者数字串排序的顺序划分所述字符流。The method for segmenting an index word in the embodiment of the present invention includes reading a character stream; identifying characters in the character stream, determining Chinese characters, English characters or numbers and unrecognizable characters; combining the determined Chinese characters, English characters or numbers with Compare the pre-established dictionary trees to determine the matching word segmentation; perform ASCII code Wild match on English characters or numbers to determine the word segmentation of the English string or number string; compare the above matched word segmentation with the word segmentation of the English string or number string And the unrecognizable characters are sorted according to the order in the character stream; the character stream is divided according to the order in which the word segmentation and the English character string or number string are sorted.

由于在划分分词前,所有的汉字、英文字符或数字均与预先建立的词典树相比较,避免了无效词组或者短语的出现,而且保证了适当的冗余词。在单字可以作为一个词或者具有实际意义的时候,词典树中会按照正常的词组处理,所以可以实现单字作为一个分词的划分。并且本发明增加了ASCII码Wild match的过程,有效的确定英文字符串以及数字串的分词。Because all Chinese characters, English characters or numbers are compared with a pre-established dictionary tree before dividing and segmenting, avoiding the occurrence of invalid phrases or phrases, and ensuring appropriate redundant words. When a word can be used as a word or has practical meaning, it will be processed as a normal phrase in the dictionary tree, so the division of a word as a participle can be realized. And the present invention has increased the process of ASCII code Wild match, effectively determines the participle of English character string and number string.

因此,本发明所述切分索引分词的方法能够同时解决分词准确、一定量的冗余词以及单字分词的问题,增强了用户的体验。Therefore, the method for segmenting index words in the present invention can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words, and single-word word segmentation, and enhance user experience.

附图说明 Description of drawings

图1为本发明所述方法一种实施方式流程图;Fig. 1 is a flow chart of an embodiment of the method of the present invention;

图2为本发明所述trie字符树结构示意图;Fig. 2 is the trie character tree structural representation of the present invention;

图3本发明所述二进制流的词典树结构示意图;Fig. 3 is a schematic diagram of a dictionary tree structure of a binary stream according to the present invention;

图4为本发明第一种切分索引分词系统结构图;Fig. 4 is the structural diagram of the first segmentation index word segmentation system of the present invention;

图5为本发明第二种切分索引分词系统结构图。FIG. 5 is a structural diagram of the second segmentation index word segmentation system of the present invention.

具体实施方式 Detailed ways

本发明提供一种切分索引分词的方法,该方法能够同时解决分词准确、一定量的冗余词以及单字划分的问题,增强用户体验。The invention provides a method for segmenting index words, which can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words and word division, and enhance user experience.

为了使本技术领域的技术人员更好地理解本发明方案,下面结合附图和具体实施方式对本发明作进一步的详细说明。In order to enable those skilled in the art to better understand the solution of the present invention, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

参见图1,该图为本发明所述方法一种实施方式流程图。Referring to Fig. 1, this figure is a flowchart of an embodiment of the method of the present invention.

S10、读取字符流。S10. Read the character stream.

读取的字符流中,可能包含汉字也可能包含英文字符、数字以及不可识别字符。The read character stream may contain Chinese characters or English characters, numbers and unrecognizable characters.

S20、识别所述字符流,确定汉字、英文字符或数字以及不可识别字符。S20. Identify the character stream, and determine Chinese characters, English characters or numbers, and unrecognizable characters.

将步骤S10读取的字符流进行字符识别,确定所述字符流中字符具体为汉字或者英文字符或者数字或者不可识别字符。Perform character recognition on the character stream read in step S10, and determine that the characters in the character stream are specifically Chinese characters or English characters or numbers or unrecognizable characters.

由于对所述字符流中的字符进行识别,可以很容易实现对多种字符集的切分。Since the characters in the character stream are recognized, the segmentation of various character sets can be easily realized.

S30、将已经确定的汉字、英文字符或数字与预先建立的词典树比较,确定匹配的分词。S30. Compare the determined Chinese characters, English characters or numbers with the pre-established dictionary tree, and determine the matching word segmentation.

词典树可以是常用的汉语词或者短语的集合,也可以有英文字符和汉字的混合词。The dictionary tree can be a collection of commonly used Chinese words or phrases, or a mixture of English characters and Chinese characters.

将步骤S20确定的汉字、英文字符或数字与预先建立的词典树相比较,确定匹配的分词,并将该分词对应的字符串建立倒排索引。倒排索引为现有技术,在此不再详述。Comparing the Chinese characters, English characters or numbers determined in step S20 with the pre-established dictionary tree, determining the matching word segmentation, and building an inverted index for the character string corresponding to the word segmentation. The inverted index is a prior art and will not be described in detail here.

对于一个字符流进行分词时,如下的字符串会被建立倒排索引。When performing word segmentation for a character stream, the following strings will be built with an inverted index.

1、该字符流中所有存在于所述词典树中的子串。1. All substrings existing in the dictionary tree in the character stream.

2、单个汉字建立倒排索引,实现了一元分词。2. Create an inverted index for a single Chinese character and realize a one-element word segmentation.

3、全英文字符组成的英文单词或者英文字符串。3. English words or English character strings composed of all English characters.

4、全数字字符组成的数字字符串。4. A numeric string composed of all numeric characters.

S40、将所述英文字符或数字进行ASCII码通用模糊匹配,确定英文字符串或者数字串的分词。S40. Perform ASCII universal fuzzy matching on the English characters or numbers to determine word segmentation of the English character string or number string.

ASCII码通用模糊匹配用于保证分词的完整性,当英文字符串或者数字串自成一词时,可以被标记为一个分词。The general fuzzy matching of ASCII codes is used to ensure the integrity of the word segmentation. When an English string or number string forms a word by itself, it can be marked as a word segmentation.

假如需要切分的字符流是:“useU盘”,所述词典树中存在“U盘”这个词。Wild match扫描遇到第一个英文字符“u”时,一直向后扫描直至这个字符串中最后一个英文字符,查找得到字符流中完整的英文字符串“useU”,确定为一个英文字符串的分词。If the character stream that needs to be segmented is: "useU disk", the word "U disk" exists in the dictionary tree. When the wild match scan encounters the first English character "u", it scans backward until the last English character in the string, finds the complete English string "useU" in the character stream, and determines it as an English string Participle.

S50、将上述匹配的分词和所述英文字符串或者数字串的分词以及不可识别字符,按所述字符流顺序进行排序。S50. Sorting the above-mentioned matched word segments, word segments of the English character string or number string, and unrecognizable characters according to the sequence of the character stream.

在整个字符流分词排序过程中,汉字分词都是按照汉字字符在字符流中的位置有序排序的。但当遇到英文字符串或数字串时,产生的分词有可能出现无序甚至重复的现象,此时按所述字符流中字符的顺序进行排序。这样最终得到的分词都是按照各分词在原字符流中的顺序有序排列的,不需要专门调用排序模块进行处理。In the entire character stream word segmentation sorting process, the Chinese character word segmentation is sorted in an orderly manner according to the position of the Chinese characters in the character stream. However, when encountering English character strings or numeric strings, the generated word segmentation may appear out of order or even repeated, and at this time, sort according to the order of the characters in the character stream. In this way, the finally obtained word segmentations are arranged in an orderly manner according to the order of each word segmentation in the original character stream, and there is no need to call a sorting module for processing.

S60、按所述分词顺序以及所述每个分词和上述不可识别字符的长度划分所述字符流。S60. Divide the character stream according to the order of word segmentation and the length of each word segment and the aforementioned unrecognizable characters.

步骤S50已经将上述各分词进行有序的排列,此时只需按所述各分词顺序以及所述各分词和上述不可识别字符的长度划分所述字符流即可。Step S50 has arranged the above-mentioned participles in an orderly manner. At this time, it is only necessary to divide the character stream according to the order of each participle, the length of each participle and the above-mentioned unrecognizable characters.

由于在划分分词前,所有的汉字、英文字符串或数字均与预先建立的词典树相比较,避免了无效词组或者短语的出现,而且保证了适当的冗余词。在单字可以作为一个词或者具有实际意义的时候,所述词典树中会按照正常的词组处理,因此能够实现单字作为一个分词的划分。并且本发明增加了ASCII码通用模糊匹配的过程,有效的确定英文字符串以及数字串的分词。因此,本发明所述切分索引分词的方法能够同时解决分词准确、一定量的冗余词以及单字划分的问题,增强了用户的体验。Since all Chinese characters, English character strings or numbers are compared with a pre-established dictionary tree before dividing and segmenting, the occurrence of invalid phrases or phrases is avoided, and appropriate redundant words are guaranteed. When a word can be used as a word or has actual meaning, it will be processed as a normal phrase in the dictionary tree, so the division of a word as a word segment can be realized. Moreover, the present invention adds a general fuzzy matching process of ASCII codes to effectively determine the word segmentation of English character strings and digital strings. Therefore, the method for segmenting index words in the present invention can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words and word division, and enhance user experience.

现有二元分词、三元分词等分词方法虽然不需要词典树,但是生成的词比较多,很多词是没有意义的,而且无法对中英文混合词的划分。本发明实施例所述方法应用了词典树,由于分词的划分是以词典树为依据,有效的避免了无意义词的划分。Although existing word segmentation methods such as binary word segmentation and trigram word segmentation do not require a dictionary tree, they generate more words, many of which are meaningless, and cannot classify mixed words in Chinese and English. The method described in the embodiment of the present invention uses a dictionary tree, and since the division of word segmentation is based on the dictionary tree, the division of meaningless words is effectively avoided.

本发明实施例所述方法根据实际数据,最终产生的索引比二元分词多一倍,但实际上是由于单个汉字被进行分词,能够实现一元分词。本发明优选方案,在建立词典树时将无意义的单字或者不需要搜索的单字去掉。According to the actual data, the method described in the embodiment of the present invention finally generates twice as many indexes as the binary word segmentation, but in fact, because a single Chinese character is segmented, the unary word segmentation can be realized. In the preferred solution of the present invention, meaningless words or words that do not need to be searched are removed when the dictionary tree is established.

从效率上看,本发明实施例所述方法比n元语法分词要稍慢,因为增加了词典树查找匹配的过程。但本发明实施例所述方法在搜索的时候,词典树中的词将会以最短路径的形式进行分词,从而大大减少了检索索引的次数,对搜索性能有很大改善。In terms of efficiency, the method described in the embodiment of the present invention is slightly slower than the n-gram word segmentation, because the process of searching for a match in the dictionary tree is added. However, when the method described in the embodiment of the present invention is searching, the words in the dictionary tree will be segmented in the form of the shortest path, thereby greatly reducing the number of index retrieval times and greatly improving the search performance.

本发明实施例所述方法与最大匹配比较,由于采用的是全切分,查全率就可以达到100%,比最大匹配更全面。这在小型信息检索中能够满足要求。Compared with the maximum matching, the method described in the embodiment of the present invention adopts full segmentation, and the recall rate can reach 100%, which is more comprehensive than the maximum matching. This is sufficient for small information retrieval.

本发明实施例所述方法的分词效率是比较快的。经过粗略测试,在linux下,语料为一般网页,词典树为海量分词词典树,大约有20多万个单词,占用内存80M左右。通过本发明实施例所述方法进行分词,一秒钟能处理3至4M的数据。The word segmentation efficiency of the method described in the embodiment of the present invention is relatively fast. After a rough test, under linux, the corpus is a general web page, and the dictionary tree is a massive word segmentation dictionary tree, with about 200,000 words and about 80M of memory. Word segmentation is performed by the method described in the embodiment of the present invention, and 3 to 4M data can be processed in one second.

建立索引有很多过程,分词只是第一步,相对于其他过程来说,一秒能处理3至4M的数据已经很快。因为其他过程例如合并索引等需要硬盘I/O,速度相对较慢。因此,在整个索引过程中本发明所述分词方法不会造成瓶颈现象。There are many processes in indexing, and word segmentation is only the first step. Compared with other processes, it is very fast to process 3 to 4M data per second. Because other processes such as merging indexes require hard disk I/O, the speed is relatively slow. Therefore, the word segmentation method of the present invention will not cause bottleneck phenomenon in the whole indexing process.

我们以“useU盘”这个字符流为例,具体说明分词过程。Let's take the character stream of "useU disk" as an example to illustrate the word segmentation process in detail.

由于每个字符都会被查找,上述第一个英文字符被标记为这个英文字符串分词的开始,并且标记“u”是这个英文字符串的一个分词的开始。Since each character will be searched, the first English character above is marked as the start of a word segment of this English string, and the marker "u" is the start of a word segment of this English string.

“u”的下一个位置字符为“s”,词典中查不到“us”,不是分词的结束,不输出任何分词。The character at the next position of "u" is "s", and "us" cannot be found in the dictionary, and it is not the end of the participle, so no participle is output.

依此类推,当遇到第二个“U”时,发现字典里面有词“U盘”,说明这个位置也是一个分词的开始。By analogy, when the second "U" is encountered, it is found that there is a word "U disk" in the dictionary, indicating that this position is also the beginning of a word segmentation.

同时回溯,把前面标记过的分词开始位置第一个“u”和当前位置组成一个分词“use”输出。并标记当前位置是分词的开始,接下来依次进行扫描。At the same time, backtracking, the first "u" at the start position of the previously marked participle and the current position form a participle "use" output. And mark the current position as the beginning of word segmentation, and then scan in turn.

最终输出的分词为:“use”“useU”“U盘”。The final output word segmentation is: "use" "useU" "U disk".

数字串的通用模糊匹配过程与上述英文字符串的过程形似,具体数字串的通用模糊匹配过程在此不再详述。The general fuzzy matching process of the numeric string is similar to the process of the above-mentioned English string, and the specific general fuzzy matching process of the numeric string will not be described in detail here.

本发明所述方法的优选实施方式,包括以下步骤:A preferred embodiment of the method of the present invention comprises the following steps:

S10、读取字符流。S10. Read the character stream.

S20、识别所述字符流,确定汉字、英文字符或数字以及不可识别字符。S20. Identify the character stream, and determine Chinese characters, English characters or numbers, and unrecognizable characters.

S21、将识别后的字符流存储在内部字符缓冲区。S21. Store the recognized character stream in an internal character buffer.

所述字符流存储在内部字符缓冲区之前,还可以将所述字符流进行统一字符的处理。Before the character stream is stored in the internal character buffer, the character stream can also be processed with unified characters.

S30、将已经确定的汉字、英文字符或数字与预先建立的词典树比较,确定匹配的分词。S30. Compare the determined Chinese characters, English characters or numbers with the pre-established dictionary tree, and determine the matching word segmentation.

S40、将英文字符或数字进行ASCII码通用模糊匹配,确定英文字符串或者数字串的分词。S40. Perform general fuzzy matching of ASCII codes on the English characters or numbers to determine word segmentation of the English character string or number string.

S50、将上述匹配的分词和所述英文字符串或者数字串的分词以及不可识别字符,按所述字符流顺序进行排序。S50. Sorting the above-mentioned matched word segments, word segments of the English character string or number string, and unrecognizable characters according to the sequence of the character stream.

S60、将所述内部字符缓冲区存储的字符流按所述分词顺序以及所述每个分词和上述不可识别字符的长度划分所述字符流。S60. Divide the character stream stored in the internal character buffer according to the sequence of word segmentation and the length of each word segment and the aforementioned unrecognizable characters.

所述步骤S60后还可以包括:After the step S60, it may also include:

S61、定期统计接收到的关键词的频率。S61. Regularly count the frequencies of the received keywords.

定期统计接收到的关键词的频率可以由预先建立的统计模块进行统计。统计模块用于统计关键词频率,这种统计模块可以采用现有的用户统计关键词的通用统计模块。具体统计模块的工作原理为公知常识在此不再详述。Periodic statistics The frequency of the received keywords can be counted by a pre-established statistics module. The statistical module is used to count the frequency of keywords, and this statistical module can adopt the existing general statistical module for counting keywords by users. The working principle of the specific statistical module is common knowledge and will not be described in detail here.

S62、将频率高于预定数值的关键词添加到所述词典树中。S62. Add keywords with frequencies higher than a predetermined value to the dictionary tree.

用户可以根据自身行业或者具体需求,定期统计用户的输入关键字,将频率高于预定数值的关键词添加到所述词典树中。预定数值可以根据实际需要进行设定。According to the user's own industry or specific needs, the user can regularly count the input keywords of the user, and add keywords with a frequency higher than a predetermined value to the dictionary tree. The predetermined value can be set according to actual needs.

用户可以将统计后的高频关键词加入到所述词典树中,然后再重新建立索引。这样所述词典树就可以具有自适应性,对检索速度将是一个很大的提升。The user can add the counted high-frequency keywords into the dictionary tree, and then rebuild the index. In this way, the dictionary tree can be adaptive, which will greatly improve the retrieval speed.

本发明实施例所述词典树可以为预先建立的map映射表或者hash_map(哈希表)的数据结构。所述词典树采用map映射表或者hash_map数据结构时,需要查找两次数据,而且无法得知分词在哪个字符终止,需要遍历整个句子才能查找到分词。The dictionary tree in the embodiment of the present invention may be a pre-established map mapping table or a hash_map (hash table) data structure. When the dictionary tree adopts a map mapping table or a hash_map data structure, it is necessary to search for data twice, and it is impossible to know which character the participle ends at, and it is necessary to traverse the entire sentence to find the participle.

本发明所述词典树优选情况,采用预先建立的trie字符树数据结构。采用trie字符树能够比较好的满足要求。所述词典树采用trie字符树可以在一次查找过程中查找到多个词,而且容易找到分词的终止位置。关于trie字符树目前已经有很多成熟的模型,本文不再赘述。The preferred situation of the dictionary tree of the present invention adopts a pre-established trie character tree data structure. Using the trie character tree can better meet the requirements. The trie character tree used in the dictionary tree can find multiple words in one search process, and it is easy to find the end position of the word segmentation. There are already many mature models about the trie character tree, so this article will not repeat them.

参见图2,该图为本发明所述trie字符树结构示意图。Referring to Fig. 2, this figure is a schematic diagram of the trie character tree structure of the present invention.

图2中每个方框代表一个节点,对应表示一个汉字或字符。带“*”标记的方框,表示从根节点到该节点的路径所组成的汉字集或者字符串对应于词典树中的一个词或短语。方框中的数字是对该节点的编号。Each box in Figure 2 represents a node, which corresponds to a Chinese character or character. The box marked with "*" indicates that the Chinese character set or character string formed by the path from the root node to this node corresponds to a word or phrase in the dictionary tree. The number in the box is the number of the node.

例如识别的字符流是“中华人是伟大的”。从“中”开始遍历,直至“是”失配时,就得到两个词“中华”和“中华人”。由于以“中”字开头的词,到达“人”字后没有具体分词对应,因此就不需要再向后扫描。因此采用trie字符树的词典树查找过程是比较高效的。For example, the recognized character flow is "Chinese people are great". Start traversal from "中" until "is" is mismatched, and two words "中华" and "中华人" are obtained. Since the words beginning with the word "中" have no specific word segmentation corresponding to the word "人", there is no need to scan backwards. Therefore, the dictionary tree search process using the trie character tree is more efficient.

字符串“中华人是伟大的”,由前向后扫描,查找以第一个位置开始在词典树中的所有词,过程如下:The string "Chinese are great", scan from front to back, find all words starting with the first position in the dictionary tree, the process is as follows:

A、第一个字是“中”,在trie字符树中匹配到节点1。A. The first character is "中", which matches node 1 in the trie character tree.

B、继续查看第二个字是“华”,在节点1中找“中”的子节点,匹配到节点2。B. Continue to check that the second word is "Hua", find the child node of "中" in node 1, and match to node 2.

C、继续查看第三个字是“人”,在节点2中找“华”的子节点,匹配到节点3。C. Continue to check that the third word is "人", find the child node of "Hua" in node 2, and match to node 3.

D、继续查看第四个字是“是”,在节点3中“人”的子节点,没有表示“是”的,于是退出。D. Continue to check that the fourth word is "yes". In the child node of "person" in node 3, there is no "yes", so exit.

这样从“中”开始,查找到“中华”,“中华人”这样两个分词。因为在词典中节点2(华)和节点3(人)被标记了词尾“*”。从根节点1到2和3的两条路径,就可以分别用1-2和1-2-3表示“中华”和“中华人”这两个具体分词。In this way, starting from "中", two participle words such as "Zhonghua" and "Chinese" are found. Because in the dictionary, node 2 (Hua) and node 3 (person) are marked with suffix "*". From the two paths from the root node 1 to 2 and 3, 1-2 and 1-2-3 can be used to represent the two specific participle of "China" and "Chinese".

由上述查找的过程可以看出,最大匹配的次数在最糟的情况下等于词典树中最长词的长度。但实际上这个长度的平均值很小,不可能有很长的词。因此该算法可以看作常数级的算法,用复杂度函数O(1)表示。It can be seen from the above search process that the maximum number of matches is equal to the length of the longest word in the dictionary tree in the worst case. But in fact the average value of this length is very small, it is impossible to have very long words. Therefore, the algorithm can be regarded as a constant-level algorithm, represented by a complexity function O(1).

复杂度函数O(1)从算法过程来看,主要是一个顺序扫描识别字符和在词典树中查找的过程。在前面介绍的词典树中查找一个字符串,查找的次数等于该位置处的最长词的长度。由于实际词典树中的词长度都是有限的,因此平均查找次数可以是一个常数c。一般常数c<5。当遇到英文字符或数字字符时需要回溯找分词,但实际上英文字符串开头的分词都是一些很特殊的词,出现几率很小。这部分的影响对整个分词的复杂度影响可以忽略不计。Complexity function O(1) From the perspective of the algorithm process, it is mainly a process of sequentially scanning and recognizing characters and searching in the dictionary tree. Find a string in the dictionary tree introduced earlier, the number of searches is equal to the length of the longest word at that position. Since the length of words in the actual dictionary tree is limited, the average search times can be a constant c. The general constant c<5. When encountering English characters or numeric characters, it is necessary to look back to find the participle, but in fact the participles at the beginning of the English string are some very special words, and the probability of occurrence is very small. The impact of this part on the complexity of the entire word segmentation is negligible.

识别字符的过程也是常数复杂度的,对于一个长度为n的字符流,整个算法复杂度为O(k×n)。其中k为一常数因子,是线性复杂度。k一般平均为10左右。The process of character recognition is also of constant complexity. For a character stream with a length of n, the complexity of the entire algorithm is O(k×n). Where k is a constant factor, which is the linear complexity. k is generally around 10 on average.

本发明优选方式,保存两个字符缓冲区,分别用于存储输入字符流和处理后的输出字符流以及分出的分词的位置、长度信息。实际应用中,很少存在无分界符的长句子,所以可以将上述字符缓冲区容量设定为定值。当上述字符缓冲区装不下时,先返回该字符缓冲区中的分词集合,再填充该分词缓冲区的容量,最后重新进行分词。我们用空间复杂度表示程序需要使用的内存与输入数据规模即句子长度的函数关系,该函数是定值,不会随句子长度而增加。因此空间复杂度是常数,与字符流长度无关。In the preferred mode of the present invention, two character buffers are saved, which are respectively used to store the input character stream, the processed output character stream and the position and length information of the segmented words. In practical applications, long sentences without delimiters rarely exist, so the above-mentioned character buffer capacity can be set to a constant value. When the above-mentioned character buffer cannot fit, first return the word segmentation set in the character buffer, then fill the capacity of the word segmentation buffer, and finally perform word segmentation again. We use space complexity to represent the functional relationship between the memory required by the program and the size of the input data, that is, the length of the sentence. This function is a fixed value and will not increase with the length of the sentence. Therefore the space complexity is constant regardless of the length of the character stream.

字符缓冲区容量优选方式采用1024bit,当然可以根据具体需要设定其它数值。The preferred mode of the character buffer capacity is 1024bit, and of course other values can be set according to specific needs.

本发明所述词典树还可以采用二进制流的词典树,这样可以保证每个节点的子节点数量较少。二进制流的词典树是直接以数组下标定位的。并且英文字符串也可以放入二进制流的词典树中,从而实现了中英文混合词的划分。The dictionary tree of the present invention can also adopt a binary stream dictionary tree, which can ensure that the number of child nodes of each node is small. The dictionary tree of the binary stream is directly located by the array subscript. And English strings can also be put into the dictionary tree of the binary stream, thus realizing the division of Chinese and English mixed words.

参见图3,该图为本发明所述二进制流的词典树结构示意图。Referring to FIG. 3 , it is a schematic diagram of the dictionary tree structure of the binary stream in the present invention.

本发明所述二进制流的词典树结构的每个节点不是一个汉字,而是根据二进制数据拆成的几个bit。Each node of the dictionary tree structure of the binary stream in the present invention is not a Chinese character, but several bits split according to the binary data.

假设是4个bit,一个汉字包括两个字节,这样就需要4个节点。一个ASCII码字符需要2个节点。Assuming it is 4 bits, a Chinese character includes two bytes, so 4 nodes are needed. An ASCII character requires 2 nodes.

例如“中”的二进制表示为0xd0d6,那么按照bit位,从低位到高位拆分成为6d 0d四个数据块,则“中”字的树可以表示为图3所示形式。For example, the binary representation of "中" is 0xd0d6, then according to the bits, it is split into four data blocks of 6d 0d from low to high, then the tree of "中" can be expressed as shown in Figure 3.

“6”对应第一节点31;第一个“d”对应第二节点32;“0”对应第三节点33;第二个“d”对应第四节点34。"6" corresponds to the first node 31 ; the first "d" corresponds to the second node 32 ; "0" corresponds to the third node 33 ; and the second "d" corresponds to the fourth node 34 .

本发明所述方法采用二进制流的词典树时,读取的字符串被看成二进制流。When the method of the present invention adopts the dictionary tree of the binary stream, the read character string is regarded as the binary stream.

例如查找“中华人”,它们的二进制格式为0xd0d60xaabb 0xcbc8,转成4bit表示的数字流为6d 0d b b a a c b c 8。当进行查找时,依次进行匹配。6匹配第一节点31,第一个d匹配第二节点32,0匹配第三节点33,第二个d匹配第四节点34。第四节点34是词尾的标记,则该次查找过程中6d 0d合起来是一个词,再还原成二进制格式0xd0d6即得到词“中”。这样每个节点的子节点最多有16个,由于4bit只能表示16种不同的数字,从而大大节省了空间。二进制流的词典树是对tries树的一种改进实现方式。For example, if you search for "Chinese", their binary format is 0xd0d60xaabb 0xcbc8, and the digital stream converted into 4bit is 6d 0d b b a a c b c 8. When doing a lookup, the matches are done sequentially. 6 matches the first node 31, the first d matches the second node 32, 0 matches the third node 33, and the second d matches the fourth node 34. The 4th node 34 is the mark of end of a word, then 6d 0d is a word together in this search process, then restores to binary form 0xd0d6 and promptly obtains word " in ". In this way, each node has a maximum of 16 child nodes, and since 4bit can only represent 16 different numbers, the space is greatly saved. The dictionary tree of the binary stream is an improved implementation of the tries tree.

当进行索引分词时,从前到后扫描该字符串,可以滤掉标点符号,在词典树中查找包含于该字符串中的所有词并输出。全英文字符组成的串或者全数字组成的串被认为是一个单词,也作为一个分词输出。在索引分词的时候具有一定量的冗余词。When performing index word segmentation, the string is scanned from front to back, punctuation marks can be filtered out, and all words contained in the string are searched in the dictionary tree and output. A string composed of all English characters or a string composed of all numbers is considered as a word and is also output as a word segmentation. There are a certain amount of redundant words when indexing word segmentation.

本发明还提供一种切分索引分词的系统,该系统能够同时解决分词准确、一定量的冗余词以及单字的分词问题,增强用户体验。The present invention also provides a system for segmenting and indexing word segmentation, which can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words and word segmentation of single characters, and enhance user experience.

参见图4,该图为本发明第一种切分索引分词系统结构图。Referring to FIG. 4 , this figure is a structural diagram of the first segmentation index word segmentation system of the present invention.

本发明第一种切分索引分词系统,该系统包括:The first segmentation index word segmentation system of the present invention, the system includes:

读取单元41,用于读取字符流。The reading unit 41 is used to read the character stream.

字符流识别单元42,用于将所述读取单元41读取的字符流进行识别,确定汉字、英文字符或数字以及不可识别字符。The character stream identifying unit 42 is configured to identify the character stream read by the reading unit 41 to determine Chinese characters, English characters or numbers and unrecognizable characters.

由于读取单元41读取的字符流可能包含汉字、英文字符或数字以及不可识别字符,为了便于进行分词的划分,需要对其进行识别。Since the character stream read by the reading unit 41 may contain Chinese characters, English characters or numbers, and unrecognizable characters, in order to facilitate word segmentation, it needs to be recognized.

词典树单元43,预先存储词组和短语的词典树的数据结构单元。The dictionary tree unit 43 is a data structure unit of a dictionary tree for storing phrases and phrases in advance.

词典树单元43是很重要的一个单元,可以有效的减少多余的冗余词。The dictionary tree unit 43 is a very important unit, which can effectively reduce redundant redundant words.

比较单元44,用于将所述字符流识别单元42确定的汉字、英文字符或数字与所述词典树单元43预先建立的词典树比较,确定匹配的分词。The comparison unit 44 is configured to compare the Chinese characters, English characters or numbers determined by the character stream identification unit 42 with the dictionary tree pre-established by the dictionary tree unit 43, and determine the matching word segmentation.

通用模糊匹配单元45,用于将所述比较单元44比较后的英文字符或数字进行ASCII码通用模糊匹配,确定英文字符串或者数字串的分词。The general fuzzy matching unit 45 is used to perform ASCII general fuzzy matching on the English characters or numbers compared by the comparison unit 44 to determine the word segmentation of the English character string or the digital string.

分词管理单元46,将所述比较单元44和所述通用模糊匹配单元45确定的分词以及所述字符流识别单元42确定的不可识别字符按所述读取单元41读取的字符流顺序进行排序,并记录每个上述分词和上述不可识别字符的长度。The word segmentation management unit 46 sorts the word segmentation determined by the comparison unit 44 and the general fuzzy matching unit 45 and the unrecognizable characters determined by the character stream identification unit 42 according to the sequence of the character stream read by the reading unit 41 , and record the length of each of the above word segmentations and the above unrecognizable characters.

分词划分单元47,将所述读取单元41读取的字符流,按照所述分词管理单元46记录的分词顺序以及所述每个分词和上述不可识别字符的长度进行划分。The word segmentation unit 47 divides the character stream read by the reading unit 41 according to the word segmentation order recorded by the word segmentation management unit 46 and the length of each word segment and the above-mentioned unrecognizable characters.

由于本发明实施例所述系统的比较单元44将所述字符流识别单元42确定的汉字、英文字符或数字与所述词典树单元43预先建立的词典树比较,确定匹配的分词。这样可以有效避免无效词组或者短语的出现,保证一定量的冗余词。在单字可以作为一个词或者具有实际意义的时候,词典树单元43中会按照正常的词组处理,所以可以实现单字作为一个分词的分词。并且本发明实施例所述系统增加了通用模糊匹配单元45,对所述比较单元44比较后的英文或数字进行ASCII码通用模糊匹配,有效的确定英文字符串以及数字串的分词。因此,本发明第一种切分索引分词系统能够同时解决分词准确、一定量的冗余词以及单字的分词问题,增强了用户的体验。Because the comparison unit 44 of the system according to the embodiment of the present invention compares the Chinese characters, English characters or numbers determined by the character stream identification unit 42 with the dictionary tree pre-established by the dictionary tree unit 43, and determines the matching word segmentation. This can effectively avoid the occurrence of invalid phrases or phrases, and ensure a certain amount of redundant words. When a single character can be used as a word or has practical meaning, it will be processed according to a normal phrase in the dictionary tree unit 43, so the word segmentation of a single character as a word segmentation can be realized. And the system of the embodiment of the present invention adds a general fuzzy matching unit 45, which performs ASCII code general fuzzy matching on the English or numbers compared by the comparison unit 44, and effectively determines the word segmentation of English character strings and digital strings. Therefore, the first segmentation index word segmentation system of the present invention can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words, and word segmentation of single characters, thereby enhancing user experience.

参见图5,该图为本发明第二种切分索引分词系统结构图。Referring to FIG. 5 , it is a structural diagram of the second segmentation index word segmentation system of the present invention.

本发明第二种切分索引分词系统相对于第一种切分索引分词系统增加了内部字符缓冲区单元48和词典自适应单元49。Compared with the first segmentation index word segmentation system of the present invention, the second segmentation index word segmentation system adds an internal character buffer unit 48 and a dictionary adaptive unit 49 .

读取单元41,用于读取字符流。The reading unit 41 is used to read the character stream.

字符流识别单元42,用于将所述读取单元41读取的字符流进行识别,确定汉字、英文字符或数字以及不可识别字符。The character stream identifying unit 42 is configured to identify the character stream read by the reading unit 41 to determine Chinese characters, English characters or numbers and unrecognizable characters.

所述内部字符缓冲区单元48用于存储所述字符流识别单元42识别的字符流,为分词划分单元47提供分词文本。The internal character buffer unit 48 is used to store the character stream identified by the character stream identification unit 42 and provide word segmentation text for the word segmentation unit 47 .

当然字符流识别单元42还可以将识别的字符流进行字符统一后,才发送到所述内部字符缓冲区单元48中保存。Of course, the character stream identification unit 42 can also unify the identified character streams before sending them to the internal character buffer unit 48 for storage.

词典树单元43,预先存储词组和短语的词典树的数据结构单元。The dictionary tree unit 43 is a data structure unit of a dictionary tree for storing phrases and phrases in advance.

比较单元44,用于将所述字符流识别单元42确定的汉字、英文字符或数字与所述词典树单元43预先建立的词典树比较,确定匹配的分词。The comparison unit 44 is configured to compare the Chinese characters, English characters or numbers determined by the character stream identification unit 42 with the dictionary tree pre-established by the dictionary tree unit 43, and determine the matching word segmentation.

通用模糊匹配单元45,用于将所述比较单元44比较后的英文字符或数字进行ASCII码通用模糊匹配,确定英文字符串或者数字串的分词。The general fuzzy matching unit 45 is used to perform ASCII general fuzzy matching on the English characters or numbers compared by the comparison unit 44 to determine the word segmentation of the English character string or the digital string.

分词管理单元46,将所述比较单元44和所述通用模糊匹配单元45确定的分词以及所述字符流识别单元42确定的不可识别字符按所述内部字符缓冲区单元48存储的所述字符流顺序进行排序,并记录每个上述分词和上述不可识别字符的长度。The word segmentation management unit 46, the word segmentation determined by the comparison unit 44 and the general fuzzy matching unit 45 and the unrecognizable characters determined by the character stream identification unit 42 are stored according to the character stream stored in the internal character buffer unit 48 Sequentially sort, and record the length of each of the above-mentioned word segmentation and the above-mentioned unrecognizable characters.

所述分词划分单元47,将所述内部字符缓冲区单元48存储的字符流,按照所述分词管理单元46记录的分词顺序以及所述每个分词和上述不可识别字符的长度进行划分。The word segmentation unit 47 divides the character stream stored in the internal character buffer unit 48 according to the word segmentation order recorded by the word segmentation management unit 46 and the length of each word segment and the above-mentioned unrecognizable characters.

词典自适应单元49,由预先建立的统计模块50统计关键词的出现频率,将所述出现频率高于预定数值的关键词添加到所述词典树单元43。The dictionary adaptation unit 49 uses a pre-established statistics module 50 to count the frequency of occurrence of keywords, and adds keywords whose frequency of occurrence is higher than a predetermined value to the dictionary tree unit 43 .

预先建立的统计模块50用于统计关键词频率。统计模块50可以采用现有的用户统计关键词的通用统计模块。具体统计模块的工作原理为公知常识在此不再详述。The pre-established statistical module 50 is used to count the frequency of keywords. The statistics module 50 may adopt an existing general statistics module for user statistics keywords. The working principle of the specific statistical module is common knowledge and will not be described in detail here.

词典自适应单元49根据所述统计模块50统计的关键词频率进行自适应操作。当所述统计模块50统计的关键词频率大于预定数值时,所述词典自适应单元49将该关键词添加到所述词典树单元43中。The dictionary adaptive unit 49 performs an adaptive operation according to the keyword frequency counted by the statistical module 50 . When the keyword frequency counted by the statistical module 50 is greater than a predetermined value, the dictionary adaptation unit 49 adds the keyword to the dictionary tree unit 43 .

由于本发明实施例所述第二种切分索引分词系统,比较单元44将所述字符流识别单元42确定的汉字、英文字符或数字与所述词典树单元43预先建立的词典树比较,确定匹配的分词。这样有效避免了无效词组或者短语的出现,而且保证了一定量的冗余词。在单字可以作为一个词或者具有实际意义时,词典树单元43可以按照正常的词组处理,所以可以实现单字作为一个分词的划分。并且本发明实施例所述第二种切分索引分词系统具有通用模糊匹配单元45,对所述比较单元44比较后的英文字符或数字进行ASCII码通用模糊匹配,有效的确定英文字符串以及数字串的分词。因此,本发明第二种切分索引分词系统能够同时解决分词准确、一定量的冗余词以及单字分词的问题,增强了用户的体验。Due to the second segmentation index word segmentation system described in the embodiment of the present invention, the comparison unit 44 compares the Chinese characters, English characters or numbers determined by the character stream identification unit 42 with the dictionary tree previously established by the dictionary tree unit 43, and determines Matching participle. This effectively avoids the occurrence of invalid phrases or phrases, and ensures a certain amount of redundant words. When a single character can be used as a word or has actual meaning, the dictionary tree unit 43 can process it as a normal phrase, so the division of a single character as a word segmentation can be realized. And the second kind of segmentation index word segmentation system described in the embodiment of the present invention has general fuzzy matching unit 45, carries out ASCII code general fuzzy matching to the English character or numeral after described comparison unit 44 comparison, effectively determines English character string and numeral participle of the string. Therefore, the second segmentation index word segmentation system of the present invention can simultaneously solve the problems of accurate word segmentation, a certain amount of redundant words, and single word segmentation, and enhance user experience.

以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that, for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications can also be made. It should be regarded as the protection scope of the present invention.

Claims (10)

1, a kind of method of cutting index participle is characterized in that, may further comprise the steps:
Read character stream;
Discern described character stream, determine Chinese character, English character or numeral and unrecognizable character;
Chinese character, English character or numeral of having determined and the lexicographic tree of setting up are in advance compared, determine the participle of coupling;
Described English character or numeral are carried out the general fuzzy matching of ASCII character, determine the participle of English character string or numeric string;
With the participle of above-mentioned coupling and the participle and the unrecognizable character of described English character string or numeric string, sort in proper order by described character stream;
Divide described character stream by the order of the participle after the described ordering and the length of described each participle and above-mentioned unrecognizable character.
2, the method for cutting index participle according to claim 1 is characterized in that, described lexicographic tree is the trie character tree data structure of setting up in advance.
3, the method for cutting index participle according to claim 1 is characterized in that, described lexicographic tree is the binary stream dictionary configuration of setting up in advance.
4, according to the method for the arbitrary described cutting index participle of claim 1 to 3, it is characterized in that, behind the described character stream of described identification, described character stream is stored in inner character buffer.
5, the method for cutting index participle according to claim 4 is characterized in that, described character stream is stored in before the inner character buffer, described character stream is unified the processing of character.
6, the method for cutting index participle according to claim 5 is characterized in that, behind described definite Chinese character, English character or numeral and the unrecognizable character, removes the punctuation mark in the described character stream.
According to the method for the arbitrary described cutting index participle of claim 1 to 3, it is characterized in that 7, described lexicographic tree is removed insignificant individual character when setting up in advance.
8, according to the method for the arbitrary described cutting index participle of claim 1 to 3, it is characterized in that, further comprise after dividing described character stream by the length of the order of the participle after the described ordering and described each participle and above-mentioned unrecognizable character:
Regularly add up the frequency of the keyword that receives;
The keyword that frequency is higher than predetermined value adds in the described lexicographic tree.
9, a kind of system of cutting index participle is characterized in that, this system comprises:
Reading unit is used to read character stream;
The character stream recognition unit is used for the character stream that described reading unit reads is discerned, and determines Chinese character, English character or numeral and unrecognizable character;
The lexicographic tree unit is stored the data structure cell of the lexicographic tree of phrase and phrase in advance;
Comparing unit, the Chinese character, English character or the lexicographic tree digital and that described lexicographic tree unit is set up in advance that are used for described character stream recognition unit is determined are compared, and determine the participle of coupling;
General fuzzy matching unit is used for English character or the numeral of described comparing unit after relatively carried out the general fuzzy matching of ASCII character, determines the participle of English character string or numeric string;
The participle administrative unit, with described comparing unit and definite participle and the definite unrecognizable character of described character stream recognition unit of described general fuzzy matching unit, the character stream that reads by described reading unit sorts in proper order, and writes down the length of each above-mentioned participle and above-mentioned unrecognizable character;
The participle division unit with the character stream that described reading unit reads, is divided according to the branch word order of described participle management unit records and the length of described each participle and above-mentioned unrecognizable character.
10, a kind of system of cutting index participle is characterized in that, this system comprises:
Reading unit is used to read character stream;
The character stream recognition unit is used for the character stream that described reading unit reads is discerned, and determines Chinese character, English character or numeral and unrecognizable character;
Inner character buffer location is used to store the character stream of described character stream recognition unit identification;
The lexicographic tree unit is stored the data structure cell of the lexicographic tree of phrase and phrase in advance;
Comparing unit, the Chinese character, English character or the lexicographic tree digital and that described lexicographic tree unit is set up in advance that are used for described character stream recognition unit is determined are compared, and determine the participle of coupling;
General fuzzy matching unit is used for English character or the numeral of described comparing unit after relatively carried out the general fuzzy matching of ASCII character, determines the participle of English character string or numeric string;
The participle administrative unit, with described comparing unit and definite participle and the definite unrecognizable character of described character stream recognition unit of described general fuzzy matching unit, described character stream by described inner character buffer location storage sorts in proper order, and writes down the length of each above-mentioned participle and above-mentioned unrecognizable character;
The participle division unit with the character stream of described inner character buffer location storage, is divided according to the branch word order of described participle management unit records and the length of described each participle and above-mentioned unrecognizable character;
The dictionary adaptive unit, by the frequency of occurrences of the statistical module counts keyword of setting up in advance, the keyword that the described frequency of occurrences is higher than predetermined value adds described lexicographic tree unit to.
CNB2007101230513A 2007-06-22 2007-06-22 A method and system for segmenting index words Active CN100476800C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007101230513A CN100476800C (en) 2007-06-22 2007-06-22 A method and system for segmenting index words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007101230513A CN100476800C (en) 2007-06-22 2007-06-22 A method and system for segmenting index words

Publications (2)

Publication Number Publication Date
CN101071420A CN101071420A (en) 2007-11-14
CN100476800C true CN100476800C (en) 2009-04-08

Family

ID=38898647

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007101230513A Active CN100476800C (en) 2007-06-22 2007-06-22 A method and system for segmenting index words

Country Status (1)

Country Link
CN (1) CN100476800C (en)

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770478B (en) * 2008-12-26 2013-04-24 高德信息技术有限公司 Data retrieval method, data retrieval engine and embedded terminal
CN102455845B (en) * 2010-10-14 2015-02-18 北京搜狗科技发展有限公司 Character entry method and device
ES2577938T3 (en) * 2010-11-10 2016-07-19 Rakuten, Inc. Related word recording device, information processing device, related word recording method, program for related word recording device, and storage medium
CN102012897B (en) * 2010-12-02 2014-09-17 无敌科技(西安)有限公司 Word-by-word comparison method for realizing high hit rate
CN102331999A (en) * 2011-07-22 2012-01-25 大连亿创天地科技发展有限公司 A search method and system for medical industry search box
CN102779163A (en) * 2012-06-18 2012-11-14 青岛禧泰房产数据技术有限公司 Quantization searching method and quantization searching system
CN103198146B (en) * 2013-04-19 2015-05-27 中国科学院计算技术研究所 Real-time event filtering method and real-time event filtering system oriented to network stream data
CN104268137A (en) * 2013-07-31 2015-01-07 深圳市华傲数据技术有限公司 Method and device for matching pharmaceutical name data
CN104462105B (en) 2013-09-16 2019-01-22 腾讯科技(深圳)有限公司 Chinese word cutting method, device and server
CN103870537B (en) * 2013-12-03 2017-02-01 山东金质信息技术有限公司 Intelligent word segmentation method for standard retrieval
CN103778200B (en) * 2014-01-09 2017-08-08 中国科学院计算技术研究所 A kind of message information source abstracting method and its system
CN106294371B (en) * 2015-05-15 2019-08-16 阿里巴巴集团控股有限公司 Character string codomain cutting method and device
CN104881503A (en) * 2015-06-24 2015-09-02 郑州悉知信息技术有限公司 Data processing method and device
CN105184053B (en) * 2015-08-13 2018-09-07 易保互联医疗信息科技(北京)有限公司 A kind of automatic coding and system of Chinese medical service item information
CN105095665B (en) * 2015-08-13 2018-07-06 易保互联医疗信息科技(北京)有限公司 A kind of natural language processing method and system of Chinese medical diagnosis on disease information
CN106202464B (en) * 2016-07-18 2019-12-17 上海轻维软件有限公司 data identification method based on mutation backtracking algorithm
CN106227661B (en) * 2016-07-22 2019-01-08 腾讯科技(深圳)有限公司 Data processing method and device
CN108009153A (en) * 2017-12-08 2018-05-08 北京明朝万达科技股份有限公司 A kind of searching method and system based on search statement cutting word result
CN108197313B (en) * 2018-02-01 2021-06-25 中国计量大学 A space-optimized dictionary indexing method through a 16-bit Trie tree
CN108388635B (en) * 2018-02-24 2021-08-03 杭州朗和科技有限公司 Data search method, apparatus, medium and computing device
CN110362650A (en) * 2018-04-09 2019-10-22 深圳企业云科技股份有限公司 Precisely participle realizes the search method of file full-text search
CN108664468A (en) * 2018-05-02 2018-10-16 武汉烽火普天信息技术有限公司 A kind of name recognition methods and device based on dictionary and semantic disambiguation
CN109657738B (en) * 2018-10-25 2024-04-30 平安科技(深圳)有限公司 Character recognition method, device, equipment and storage medium
CN109582972B (en) * 2018-12-27 2023-05-16 信雅达科技股份有限公司 Optical character recognition error correction method based on natural language recognition
CN113033193B (en) * 2021-01-20 2024-04-16 山谷网安科技股份有限公司 Mixed Chinese text word segmentation method based on C++ language
CN112765318A (en) * 2021-01-20 2021-05-07 阅尔基因技术(苏州)有限公司 Natural language processing method and system for infertility clinical phenotype information
CN113627722B (en) * 2021-07-02 2024-04-02 湖北美和易思教育科技有限公司 Simple answer scoring method based on keyword segmentation, terminal and readable storage medium
CN113836917B (en) * 2021-09-28 2023-07-18 广州华多网络科技有限公司 Text word segmentation processing method and device, equipment and medium thereof
CN114398880A (en) * 2021-12-06 2022-04-26 北京思特奇信息技术股份有限公司 System and method for optimizing Chinese word segmentation
CN114021564B (en) * 2022-01-06 2022-04-01 成都无糖信息技术有限公司 Segmentation word-taking method and system for social text
CN115391495B (en) * 2022-10-28 2023-01-24 强企宝典(山东)信息科技有限公司 Method, device and equipment for retrieving keywords in Chinese context
CN116226362B (en) * 2023-05-06 2023-07-18 湖南德雅曼达科技有限公司 A Word Segmentation Method to Improve the Accuracy of Searching Hospital Names
CN118734850B (en) * 2024-06-27 2025-04-04 浪潮智慧科技有限公司 Chinese word segmentation method, device and medium based on hash table and binary search tree
CN119862880B (en) * 2024-12-04 2025-10-31 天翼云科技有限公司 A word segmentation method and word segmenter for variable-length text based on openGemini

Also Published As

Publication number Publication date
CN101071420A (en) 2007-11-14

Similar Documents

Publication Publication Date Title
CN100476800C (en) A method and system for segmenting index words
KR101157693B1 (en) Multi-stage query processing system and method for use with tokenspace repository
US8392175B2 (en) Phrase-based document clustering with automatic phrase extraction
JP3143079B2 (en) Dictionary index creation device and document search device
CN1664818B (en) The neologisms collection method split for word and system
CN103198149B (en) Method and system for query error correction
US10389378B2 (en) Computer product, information processing apparatus, and information search apparatus
CN111444330A (en) Method, device and equipment for extracting short text keywords and storage medium
Hon et al. Space-efficient frameworks for top-k string retrieval
CN100483417C (en) Method for catching limit word information, optimizing output and input method system
CN107784110B (en) A kind of index establishment method and apparatus
CN105528411B (en) Device and method for full-text retrieval of ship equipment interactive electronic technical manual
CN105138514B (en) It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method
CN102103416B (en) Chinese character input method and device
CN101432686A (en) Efficient storage and search of word lists and other text
CN109885641B (en) Method and system for searching Chinese full text in database
US8682900B2 (en) System, method and computer program product for documents retrieval
CN115204147A (en) Data feature fingerprint construction and similarity measurement method and index
CN117763077A (en) Data query method and device
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
Lipták et al. BAT-LZ out of hell
CN105938469A (en) Encoding storage method, text storage data structure and text compression storage and statistical output method
CN110209765A (en) A method and device for searching keywords according to semantics
CN114154494A (en) Disambiguation word segmentation method, system, device and storage medium
CN101299212B (en) A Word Retrieval Method Based on Bitmap Compressed Key Tree

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20151222

Address after: The South Road in Guangdong province Shenzhen city Fiyta building 518057 floor 5-10 Nanshan District high tech Zone

Patentee after: Shenzhen Tencent Computer System Co., Ltd.

Address before: 2, 518044, East 410 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.