CN101488127B

CN101488127B - Bit mark character string fuzzy retrieval method for grouping character and labellng with bit

Info

Publication number: CN101488127B
Application number: CN200510057491.4A
Authority: CN
Inventors: 徐文新
Original assignee: Individual
Current assignee: Individual
Priority date: 2005-01-17
Filing date: 2005-09-13
Publication date: 2015-01-07
Anticipated expiration: 2025-09-13
Also published as: WO2006074586A1; CN101488127A; CN1645374A

Abstract

本发明公开了一种对字符分组并以位标记的字符串检索模糊方法，方法包括：分全部字符元为n组，并用有n个位(bit)的数据W来记录组成字符串S的字符P元信息。如果S的字符元P₁属于j组，则将W的第j个位设置为1；P2属于k组，则将W的第k个位设置为1…，全部字符元标记完成后的W，称为S的“位值”，通过对S_a的W_a、与S_b的W_b进行比较，可以判断S_b不包含、或可能包含S_a的所有字符元；对有可能的S_b，再用字符比较方法判断是否包含S_a。基于以上方法，按逻辑代数，可以推衍出多种标记和比较位值的方法。该方法可以提高检索速度数倍，但需要首先对字符串进行标记，所以适用于检索范围比较固定的字符串查找。The invention discloses a character string retrieval fuzzy method for grouping characters and marking them with bits. The method includes: dividing all character elements into n groups, and using data W with n bits to record the characters constituting the character string S P meta information. If the character element P ₁ of S belongs to group j, set the jth bit of W to 1; if P2 belongs to group k, set the kth bit of W to 1..., W after all character elements are marked, It is called the "bit value" of S. By comparing W _a of S _a with W _b of S _b , it can be judged that S _b does not contain or may contain all the character elements of S _a ; for possible S _b , Then use the character comparison method to judge whether S _a is included. Based on the above methods, according to logic algebra, various methods of marking and comparing bit values can be derived. This method can increase the retrieval speed several times, but it needs to mark the string first, so it is suitable for string search with a relatively fixed retrieval range.

Description

A Fuzzy Retrieval Method of String Characters Grouped and Marked by Bits

技术领域technical field

本发明涉及一种字符串模糊检索方法，快速有效地确定目的字符串可能存在的范围，再使用已有的字符匹配方法定位目的字符串，适用于检索范围比较固定的字符串检索领域。The invention relates to a character string fuzzy search method, which quickly and effectively determines the possible range of a target character string, and then uses an existing character matching method to locate the target character string, and is suitable for the string search field where the search range is relatively fixed.

背景技术Background technique

通常的字符串模糊检索采用逐位比较方式进行，如判断字符串S＝“bdopfqew”中是否包含字符f，计算机以主串S的第一个字符b与f进行比较，再以S的第二个字符d与f进行比较，以此类推，直到S的第5个字符与f相同，匹配成功，这是最简单的情形。如果子串T长度为2个字符以上，简单的模式匹配算法，即BF算法，是以S₁与T₁比较，若不同则以S₂与T₁比较，依次类推，直到S的某一个字符S_i与T₁相同，再将它们之后的S_i+1与T₁₊₁进行比较，若也相同，则继续往下比较，当S的某一个字符S_i+n与T_1+n不同时，则返回，再以S_i+1与T₁作新一轮比较，重复以上过程，直到T中的字符全部比完，则匹配成功，否则匹配失败。随着检索关键词T的长度增加，字符匹配的复杂程度相应增加。改进后的模式匹配算法，即KMP算法，对小字符集的拼音文字来说，避免了回溯，但对字符集大、单字符频度低的汉字字符串而言，实质意义不大。简而言之，BF算法与KMP算法均是对主串和子串的字符进行逐位比较。The usual character string fuzzy search is carried out in a bit-by-bit comparison mode, such as judging whether the character string S="bdopfqew" contains character f, the computer compares the first character b and f of the main string S, and then uses the second character of S The first character d is compared with f, and so on, until the fifth character of S is the same as f, and the match is successful. This is the simplest case. If the length of the substring T is more than 2 characters, the simple pattern matching algorithm, that is, the BF algorithm, compares S ₁ with T ₁ , if they are different, compares S ₂ with T ₁ , and so on, until a certain character of S S _i is the same as T ₁ , then compare S _i+1 with T ₁₊₁ after them, if they are also the same, continue to compare, when a certain character S _i+n of S is different from T _1+n At the same time, return, and then make a new round of comparison with S _i+1 and T ₁ , repeat the above process until all the characters in T are compared, then the match is successful, otherwise the match fails. As the length of the retrieval keyword T increases, the complexity of character matching increases accordingly. The improved pattern matching algorithm, that is, the KMP algorithm, avoids backtracking for pinyin characters with small character sets, but it has little practical significance for Chinese character strings with large character sets and low frequency of single characters. In short, both the BF algorithm and the KMP algorithm compare the characters of the main string and the substring bit by bit.

2004年本人提出“质数代换字符串检索技术”，该方法可以提高字符串模糊检索的速度2-3倍，但对于长字符串实施该方法，需要较多的空间存贮质数乘积值。为了提高字符串模糊检索的速度，并减少对存贮空间的需求，本发明提出用数据的n个位(bit)来标记字符串的组成信息，标记后之后的数据称该字符串的“位值”，对两个字符串的位值进行比较，并结合通常的字符逐位比较法，实现字符串的模糊检索。测试表明，速度是一般的字符逐位比较模糊检索的数倍乃至十几倍以上。In 2004, I proposed "Prime Number Substitution String Retrieval Technology". This method can improve the speed of string fuzzy retrieval by 2-3 times, but implementing this method for long strings requires more space to store the prime number product value. In order to improve the speed of character string fuzzy retrieval and reduce the demand for storage space, the present invention proposes to mark the composition information of character string with n bits (bit) of data, and the data after marking is called "bit" of the character string. value", compare the bit values of two strings, and combine the usual character-by-bit comparison method to realize the fuzzy retrieval of strings. Tests show that the speed is several times or even more than ten times faster than that of general character-by-digit fuzzy retrieval.

发明内容Contents of the invention

本发明是一种字符串检索技术，目的是提高字符串模糊检索的速度。以一个位(bit)对应若干个字符元，以n个位对应全部字符元，也就是分全部字符元为n组，并用一个数据的n个均为0的位，记为W_F，来标记组成字符串的字符元信息。如果若干个字符串S的一个字符元P₁属于第n组，则相应地将W的第n个位标记为1，类似地，根据S其它字符元P₂、P₃、P₄…所属的组对W进行标记，完成全部字符元标记后的W，记录有S的信息，称为S的“位值”，这种方式称为1标记。根据逻辑代数的原理，也可用一个数据的n个均为1的位，记为来标记组成字符串的字符元信息。如果S的一个字符元P属于第n组，则相应地将数据的第n个位标记为0，这种方式称为0标记。通过对S_a的“位值”W_a、与S_b的“位值”W_b、进行比较，可以判断S_b“不包含”、“包含”或“可能包含”S_b的所有字符元。例如，对W_a与W_b进行位蕴含运算，如果所有的位都有蕴含关系，则S_b包含或可能包含S_a的“所有字符元”。如果需要，再用通常的字符逐位比较方法判断S_b是否包含“S_a”。以下简称本发明方法为“位标记检索”、“位标记字符串检索”“位标记字符串检索技术”。The invention is a character string retrieval technology, and the purpose is to improve the speed of character string fuzzy retrieval. One bit (bit) corresponds to several character elements, and n bits correspond to all character elements, that is, all character elements are divided into n groups, and n bits of a data are all 0, marked as W _F . Metainformation about the characters that make up the string. _If a character element P ₁ of several character _strings _S belongs to the nth group, the nth bit of W is marked as 1 accordingly, and similarly, according to the The group marks W, and after completing all character meta-marks, W records the information of S, which is called the "bit value" of S, and this method is called 1 mark. According to the principle of logic algebra, it is also possible to use n bits of a data that are all 1, which is recorded as To mark the meta-information of the characters that make up the string. If a character element P of S belongs to the nth group, the data The nth bit of is marked as 0, which is called 0 marking. Through the "bit value" W _a of S _a , and the "bit value" W _b of S _b , By comparison, it can be judged that S _b "does not contain", "contains" or "may contain" all the character elements of S _b . For example, the bit implication operation is performed on W _a and W _b , if all bits have an implication relationship, then S _b contains or may contain "all characters" of S _a . If necessary, the usual character-by-bit comparison method is used to determine whether S _b contains "S _a ". Hereinafter referred to as the present invention method is " bit mark search ", " bit mark character string search "" bit mark character string search technology ".

测试表明，位标记检索可以显著提高字符串的模糊检索速度。速度优势之外，位标记检索的另一特点是多个关键词查询同单个关键词查询一样方便。位标记既可用于通常意义的检索，即判断数据库字符串是否包含关键词，也可以作“逆检索”，判断关键词是否包含数据库字符串，可用于语音输入、拼音输入及汉语分词中，匹配基本句型或词语。Tests show that bit tag retrieval can significantly speed up fuzzy retrieval of strings. In addition to the advantage of speed, another feature of bit tag retrieval is that multiple keyword queries are as convenient as single keyword queries. The bit mark can be used for retrieval in the usual sense, that is, to judge whether the database string contains keywords, or as a "reverse search", to judge whether the keywords contain database strings, and can be used in voice input, pinyin input, and Chinese word segmentation, matching Basic sentence patterns or words.

作为基本方法的拓展，如果可用于标记的位数n，是字符串的平均长度m的2倍以上，可以用数个位(bit)的组合对应一组字符元进行标记，以提高筛选效率。As an extension of the basic method, if the number of digits n available for marking is more than twice the average length m of the string, a combination of several bits can be used to mark a group of characters to improve screening efficiency.

位标记字符串检索技术作为一种字符串算法，需要首先对检索范围的字符串进行位标记，所以适用于检索范围比较固定的字符串查找领域。As a string algorithm, the bit-marked string retrieval technology needs to bit-mark the strings in the search range first, so it is suitable for the field of string search where the search range is relatively fixed.

1.基本方法1. Basic method

位标记字符串检索实施中，可以用“位”或“位的组合”对应通常意义的字符，如：a、b、A、B、∑#、∞、ま、：、中、国；根据需要也可以用位对应汉字偏旁，如：扌、罒，以至笔画，如丿、丶等；对于拼音文字，如英语字符串“day and night”，可以用位对应单词，如day、and、night，或者用位对应字母组合，如ay、ai，在汉语拼音输入或语音输入中可以代表汉语拼音的声母、韵母或音节，对于其它语言可以代表音标，如θ、siη。为了便于说明，称位或位的组合所对应的字符单位为“字符元”，记为P。In the implementation of bit-marked character string retrieval, "bit" or "combination of bits" can be used to correspond to characters in common meaning, such as: a, b, A, B, ∑#, ∞, ま, :, 中, 国; according to needs, digits can also be used to correspond to radicals of Chinese characters, such as: 扌, 罒, and even strokes, such as 丿, , etc.; for pinyin characters, such as the English string "day and night", digits can be used to correspond to words , such as day, and, night, or use bit-corresponding letter combinations, such as ay, ai, which can represent the initials, finals or syllables of Chinese Pinyin in Chinese Pinyin input or voice input, and can represent phonetic symbols for other languages, such as θ, siη. For ease of description, the character unit corresponding to a bit or a combination of bits is referred to as a "character element", denoted as P.

位标记字符串检索的方法，容易在拼音文字字符串检索中实现。设以一个32位，各个位均为0的数据0000，0000，0000，0000，0000，0000，0000，0000，记为W_F，以W_F自左至右(也可以是自右至左)的第1至第26个位对应26个英文字母，其它6位可能不考虑，也可以对应标点符号。单词big包含b、g、i，所以将相应的2、7、9等3个位标记为“1”，则数据成为0100，0010，1000，0000，0000，0000，0000，0000，称为big的位值，记为W_a，这种标记方式为“1”标记。The method for bit-marked character string retrieval is easy to implement in phonetic character string retrieval. Let a 32-bit data 0000, 0000, 0000, 0000, 0000, 0000, 0000, 0000 be recorded as W _F , with W _F from left to right (or from right to left) The 1st to 26th digits of the corresponding to 26 English letters, and the other 6 digits may not be considered, and may also correspond to punctuation marks. The word big contains b, g, and i, so the corresponding 3 bits such as 2, 7, and 9 are marked as "1", and the data becomes 0100, 0010, 1000, 0000, 0000, 0000, 0000, 0000, which is called big The bit value of is denoted as W _a , and this marking method is a "1" mark.

同样，bigger包含b、e、g、i、r，故将W_F的2、5、7、9、18等5个位标记为“1”，则数据成为0100，1010，1000，0000，0100，0000，0000，0000，称为bigger的位值，记为W_b。Similarly, bigger includes b, e, g, i, r, so the 5 bits 2, 5, 7, 9, and 18 of W _F are marked as "1", and the data becomes 0100, 1010, 1000, 0000, 0100 , 0000, 0000, 0000, called the bit value of bigger, denoted as W _b .

可以看出，所有big为1的位，bigger也必为1；但bigger为1的位，big可能是1，也可能是0，如第5、18个位；bigger为0的位，big必为0，如第1、3、4、6、8、10等21个位。将W_a与W_b进行“位蕴含”运算，结果是所有的位均为“1”：1111，1111，1111，1111，1111，1111，1111，1111，记为W_T，如果是无符号长整数，就是4294967295。可以用公式表示为：W_a→W_b＝W_T，字符串bigger包含big。It can be seen that for all bits whose big is 1, bigger must also be 1; but for bits where bigger is 1, big may be 1 or 0, such as the 5th and 18th bits; for bits where bigger is 0, big must be It is 0, such as the 1st, 3rd, 4th, 6th, 8th, 10th and other 21 bits. Perform "bit implication" operation on W _a and W _b , the result is that all bits are "1": 1111, 1111, 1111, 1111, 1111, 1111, 1111, 1111, recorded as W _T , if it is an unsigned long An integer is 4294967295. It can be expressed as: W _a →W _b =W _T , and the character string bigger includes big.

如果biggest的位值记为W_c，同样有：W_a→W_c＝W_T，字符串biggest包含big。If the bit value of biggest is denoted as W _c , there is also: W _a →W _c =W _T , and the character string biggest contains big.

如果将BIG的位值记为W_A，将big的位值W_a与之进行“位蕴含”运算，同样的有：W_a→W_A＝W_T，字符串BIG等于big。这里未考虑字母大小写，仅说明两个相同的字符串的位值进行“位蕴含”运算，所有的位均为1。If the bit value of BIG is recorded as W _A , and the bit value W _a of big is subjected to a "bit implication" operation, the same is: W _a → W _A = W _T , and the string BIG is equal to big. The case of letters is not considered here, and it only shows that the bit values of two identical strings are subjected to the "bit implication" operation, and all bits are 1.

如果digger的位值记为W_d，如果将big的位值W_a与之进行“位蕴含”运算，结果为1011，1111，1111，1111，1111，1111，1111，1111，不等于W_T，即：W_a→W_d≠W_T，字符串digger不包含big。If the bit value of digger is recorded as W _d , if the bit value W _a of big is used for "bit implication" operation, the result is 1011, 1111, 1111, 1111, 1111, 1111, 1111, 1111, which is not equal to W _T , That is: W _a → W _d ≠ W _T , the string digger does not contain big.

也就是说，通过对两个字符串的位值W进行“位蕴含”运算，如果结果不等于W_T，则后项所对应的字符串“不包含”(“不多于”及“不等于”，下同)前项所对应的字符串。That is to say, by performing the "bit implication" operation on the bit value W of the two strings, if the result is not equal to W _T , then the string corresponding to the latter item "does not contain"("not more than" and "not equal to ", the same below) the character string corresponding to the preceding item.

上面的方案未考虑字母的大小写，如果要区分大小写，则需要52个位，这也可以实现。但是每个字符元对应一个位，并不总是可行的。如，GBK范围有21000个汉字，用一个位对应一个汉字，存贮位值的数据会占用很多空间，其中有相当多的是无意义的空白，而读取数据及位比较运算用时会无意义地增加，实施中，根据需要可用一个“位”对应多个汉字。一个易行的方案是将GBK范围的21000个汉字及其它符号，根据编码分为32组，以一个长整型数据来存贮字符串的位值，设有字符串“大漠孤烟直长河落日圆”，则：The above scheme does not consider the case of letters, if you want to be case sensitive, you need 52 bits, which can also be achieved. But it's not always possible to have one bit per character element. For example, there are 21,000 Chinese characters in the GBK range, and one bit corresponds to one Chinese character. The data storing the bit value will take up a lot of space, and quite a lot of them are meaningless blanks, and it will be meaningless when reading data and bit comparison operations. In practice, one "bit" can be used to correspond to multiple Chinese characters as required. An easy solution is to divide the 21,000 Chinese characters and other symbols in the GBK range into 32 groups according to the encoding, use a long integer data to store the bit value of the string, and set the string "Great Desert Dust Straight Long River Falls" Yen", then:

汉字Chinese character 大big 漠desert 孤solitary 烟cigarette 直straight 长long 河river 落fall 日day 圆round 编码coding 4632346323 5035050350 4755447554 5370853708 5496154961 4598845988 4782747827 4989249892 5141351413 5445054450 组Group 2020 1515 33 1313 1818 55 2020 55 22twenty two 1919 汉字Chinese character 霜Frost 冷cold 长long 河river 落fall 日day 天sky 涯Career 编码coding 5213852138 4938049380 4598845988 4782747827 4989249892 5141351413 5246052460 5370053700 组Group 1111 55 55 2020 55 22twenty two 1313 55

表中的编码是用excel中函数code获取，其中“大”“河”同在20组，“长”“落”同在5组，所以只将8个相应的位置为1，则“大漠孤烟直长河落日圆”的位值W_h为：The codes in the table are obtained by using the function code in excel, among which "big" and "river" are in 20 groups, and "long" and "fall" are in 5 groups, so only 8 corresponding positions are set to 1, then "desert lonely The bit value W _h of "Smoky Straight River Sets the Sun" is:

0010，1000，0000，1010，0111，0100，0000，00000010, 1000, 0000, 1010, 0111, 0100, 0000, 0000

同样可以得到“长河落日”的W_i“位值”为：0000，1000，0000，0000，0001，0100，0000，0000，以W_i与W_h作位“位蕴含”运算：W_i→W_h＝W_T，两个字符串存在包含关系。Similarly, the "bit value" of W _i of "Long River Sunset" can be obtained as: 0000, 1000, 0000, 0000, 0001, 0100, 0000, 0000, using W _i and W _h as bit "bit implication" operation: W _i → W _h = W _T , there is an inclusion relationship between the two character strings.

“霜冷长河”中“霜”为第11组，而“冷”与“长”均为第5组，则“位值”W_j为：0000，1000，0010，0000，0001，0000，0000，0000，以W_j与W_h作位“位蕴含”运算：W_j→W_h≠W_T，而两个字符串也不存在包含关系。也就是说，用一个“位”对应多个汉字，得到位值，对两个字符串的位值进行“位蕴含”运算，如果结果不等于W_T，则后项所对应的字符串“不包含”前项所对应的字符串的“全部字符元”，也就必然不包含该“字符串”。对于拼音文字，同理可用一个位对应多个单词，判断一个字符串“不包含”、“可能包含”另一个字符串的字符元。"Frost" in "Frost Cold River" is the 11th group, and "cold" and "long" are both the 5th group, then the "bit value" W _j is: 0000, 1000, 0010, 0000, 0001, 0000, 0000 , 0000, W _j and W _h are used as the "bit implication" operation: W _j → W _h ≠ W _T , and there is no inclusion relationship between the two strings. That is to say, use one "bit" to correspond to multiple Chinese characters to obtain the bit value, and perform the "bit implication" operation on the bit values of the two strings. If the result is not equal to W _T , then the string corresponding to the latter item "is not Contains "all character elements" of the character string corresponding to the preceding item, so the "character string" must not be included. For pinyin characters, similarly, one bit can be used to correspond to multiple words, and it is judged that a character string "does not contain" or "may contain" the character elements of another character string.

但是，对两个字符串的位值进行“位蕴含”运算，如果结果等于W_T，后项所对应的字符串也只是“可能包含”前项所对应的字符串。However, if the "bit implication" operation is performed on the bit values of two character strings, if the result is equal to W _T , the character string corresponding to the later term only "may contain" the character string corresponding to the former term.

如果gibber的位值记为W_e，将big的位值W_a与之进行“位蕴含”运算，同样有：If the bit value of gibber is denoted as W _e , and the bit value W _a of big is used for the "bit implication" operation, it also has:

W_a→W_e＝W_T W _a →W _e ＝ W _T

但字符串gibber不包含big。But the string gibber does not contain big.

“落日天涯”的“位值”W_k为：0000，1000，0000，1000，0000，0100，0000，0000，以之与W_h作位“位蕴含”运算：The "bit value" W _k of "Sunset Tianya" is: 0000, 1000, 0000, 1000, 0000, 0100, 0000, 0000, and W _h is used as a bit "bit implication" operation:

W_k→W_h＝W_T W _k →W _h ＝W _T

但“大漠孤烟直长河落日圆”不包含“落日天涯”。However, "the lonely smoke in the desert goes straight to the long river and the sun sets" does not include "the setting sun ends in the sky".

也就是说，位标记字符串检索与一般字符串逐位比较法是有区别的。即使两个字符串的位值存在位蕴含关系，两个字符串也不一定存在包含关系。如果以一个“位”对应“一个字符元”，两个字符串的位值存在位蕴含关系，则后项对应的字符串“包含”前项对应的字符串的全部字符元。但由于未考虑字符元的排列，不能肯定两个字符串是否存在包含关系。如果标记是以一个“位”对应“一组字符元”，一组字符元之中有多个字符元，则后项对应的字符串“有可能”包含前项对应的字符串的全部字符元，也“有可能”不包含前项对应的字符串的全部字符元，更不能肯定两个字符串是否存在包含关系。但根据需要，可再用字符逐位比较方法判断两个字符串是否存在包含关系。That is to say, there is a difference between bit-marked string retrieval and general string bit-by-bit comparison. Even if the bit values of the two strings have a bit implication, the two strings do not necessarily have an inclusion relationship. If a "bit" corresponds to "a character element", and there is a bit implication relationship between the bit values of the two character strings, then the character string corresponding to the latter item "contains" all the character elements of the character string corresponding to the former item. However, because the arrangement of characters is not considered, it is not sure whether there is a containment relationship between the two strings. If the mark is a "bit" corresponding to a "group of characters", and there are multiple characters in a group of characters, then the string corresponding to the latter item "may" contain all the characters of the string corresponding to the previous item , it is also "possible" not to contain all the characters of the string corresponding to the preceding item, and it is not certain whether the two strings have an inclusion relationship. However, according to the need, the character-by-character comparison method can be used to judge whether there is a containment relationship between the two character strings.

位标记字符串检索，进行位运算的前项及后项的位值，均可以是一个字符串的字符元的标记信息，也可以是多个字符串所有字符元的标记信息，统称为“若干个字符串”S的位值。“若干个字符串”是S的同位概念，就是说，S指一个或多个字符串。Bit-marked string retrieval, the bit value of the previous item and the next item of the bit operation can be the tag information of the character elements of a string, or the tag information of all the character elements of multiple strings, collectively referred to as "several The bit value of a string "S. "Several character strings" is the same concept of S, that is, S refers to one or more character strings.

设对数据库进行位标记字符串检索，字符串记录S₁、S₂、S₃、…的位值分别为W₁、W₂、W₃、…。检索关键词S_t的位值记为W_t，以W_t与W_n进行位蕴含运算，结果为W_T，则字符串S_n包含或可能包含S_t的所有字符元。由符合条件的S_n构成的记录集，记为R_i。检索所用时间称之为T_i。通常需要在R_i中用字符串逐位比较法得到最终结果集R_z，检索所用时间称之为T_z。Assuming that bit-marked string retrieval is performed on the database, the bit values of string records S ₁ , S ₂ , S ₃ , ... are W ₁ , W ₂ , W ₃ , ... respectively. The bit value of the retrieval keyword S _t is denoted as W _t , and the bit implication operation is performed with W _t and W _n , and the result is W _T , then the string S _n contains or may contain all the characters of S _t . A record set composed of qualified S _n is denoted as R _i . The time taken for retrieval is called T _i . Usually, it is necessary to use the bit-by-bit comparison of character strings in R _i to obtain the final result set R _z , and the retrieval time is called T _z .

因此位标记字符串检索方法总的用时是T_i+T_z，如果希望使T_i低，对32位的处理器来说，最好存贮位值的数据就是32位。对于某次字符串模糊检索来说，R_z是一定的，降低T_z的方法是尽可能使初步结果集R_i最小。存贮位值的位数n越少，一个位对应的字符元越多，筛选效果越差，R_i也会越大；字符串平均长度越长，R_i也会越大。影响R_i大小还有另一因素：检索关键词越长，R_i也越小。Therefore, the total time spent by the bit-marked character string retrieval method is T _i +T _z . If T _i is expected to be low, for a 32-bit processor, the best data to store the bit value is 32 bits. For a string fuzzy retrieval, R _z is certain, and the way to reduce T _z is to minimize the initial result set R _i as much as possible. The fewer the number of bits n for storing bit values, the more characters one bit corresponds to, the worse the filtering effect, and the larger R _i will be; the longer the average length of the string, the larger R _i will be. There is another factor affecting the size of R _i : the longer the search keyword, the smaller R _i is.

设标记所用位数为n，字符串位值的“1”为m个，检索关键词位值的“1”为k个，则位比较的筛选概率可以用下式计算，其值越小越好：Assuming that the number of digits used for the mark is n, the "1" of the character string value is m, and the "1" of the search key value is k, then the screening probability of bit comparison can be calculated by the following formula, and the smaller the value, the better good:

${P P}_{Ii II} = = \frac{{C C}_{m m}^{k k}}{{C C}_{n no}^{k k}} = = \frac{\frac{m m!!}{k k!! ((m m - - k k))!!}}{\frac{n no!!}{k k!! ((n no - - k k))!!}} = = \frac{m m!! ((n no - - k k))!!}{n no!! ((m m - - k k))!!}$

n为32，m和k部分取值的筛选概率计算如下：n is 32, and the screening probabilities for the values of m and k are calculated as follows:

mm k＝1k=1 k＝2k=2 k＝3k=3 k＝4k=4 24twenty four 0.750.75 0.5564516130.556451613 0.4080650.408065 0.2954950.295495 22twenty two 0.68750.6875 0.4657258060.465725806 0.3104840.310484 0.203420.20342 2020 0.6250.625 0.3830645160.383064516 0.2298390.229839 0.1347330.134733 1818 0.56250.5625 0.3084677420.308467742 0.1645160.164516 0.0850950.085095 1616 0.50.5 0.2419354840.241935484 0.1129030.112903 0.0506120.050612 1414 0.43750.4375 0.1834677420.183467742 0.0733870.073387 0.0278360.027836 1212 0.3750.375 0.1330645160.133064516 0.0443550.044355 0.0137650.013765 1010 0.31250.3125 0.0907258060.090725806 0.0241940.024194 0.005840.00584 88 0.250.25 0.0564516130.056451613 0.011290.01129 0.0019470.001947

影响筛选概率的三个因素中，用户检索时所用关键词长度决定k的大小，记录字符串长度决定m的大小，但记录字符串长度、检索时所用关键词长度是不可控制的，可控制的因素是标记所用的位数n，而且长度一定的字符串标记之后“1”的个数m、k的大小，也受n影响，所以n与m保持足够的比例很重要。Among the three factors that affect the screening probability, the length of the keyword used by the user for retrieval determines the size of k, and the length of the record string determines the size of m, but the length of the record string and the length of the keyword used for retrieval are uncontrollable and controllable The factor is the number of digits n used for marking, and the number m and k of "1"s after a string of a certain length are also affected by n, so it is important to maintain a sufficient ratio between n and m.

几点建议：A few suggestions:

1.32位的cpu，用长整数进行标记，最利于位值比较，如果是无符号整数，则W为32个位，如果是有符号位整数，则只有31个位便于标记。如果字符串平均长度大于16个字符元，记录条数多的数据库，在SQL SEVER 2000中，可用数据类型bigint的63个位进行标记，相应地，将字符元分为63组。当然，对于32位处理器，用bigint存贮位值，读取及进行位比较必然用更多时间。实际上，任何数据库中便于进行位运算的任何数据类型均可以用来标记字符串，如果用无符号位的数据类型自然更佳。对于64位cpu，当然应采用64个位来标记字符串，以充分利用cpu的性能，提高“位值”的离散度。1. The 32-bit CPU is marked with a long integer, which is most conducive to bit value comparison. If it is an unsigned integer, W is 32 bits. If it is a signed integer, only 31 bits are convenient for marking. If the average length of the string is greater than 16 characters and the number of records is large, in SQL Server 2000, 63 bits of the data type bigint can be used for marking, and the characters are divided into 63 groups accordingly. Of course, for a 32-bit processor, using bigint to store bit values, it will take more time to read and compare bits. In fact, any data type that is convenient for bit operations in any database can be used to mark strings, and it is naturally better to use unsigned data types. For a 64-bit CPU, of course, 64 bits should be used to mark the character string, so as to fully utilize the performance of the CPU and improve the discreteness of the "bit value".

2.如果cpu是32位，字符串之间长度相差大的数据库，可将记录分为两个表，短字符串表用32个位标记，长字符串表用64位标记，查询中用union命令将两表查询结果联合起来。2. If the cpu is 32 bits, and the length of the strings varies greatly, the records can be divided into two tables. The short string table is marked with 32 bits, and the long string table is marked with 64 bits. Union is used in the query The command combines the query results of two tables.

3.字符元的分组，以频率均衡为最佳，还要兼顾标记位值的速度。按汉字内码进行模运算，取余数分组，容易实现，但不是最优分组，而且模运算除法运算速度慢。可以考虑取汉字内码的某5个位将汉字分为32组，速度会更快，效果可能更好。3. For the grouping of character elements, the frequency balance is the best, and the speed of marking bit values should also be taken into consideration. Carrying out modulo operation according to the internal code of Chinese characters, taking the remainder and grouping is easy to realize, but it is not optimal grouping, and the operation speed of modulo operation and division is slow. It can be considered to divide Chinese characters into 32 groups by taking a certain 5 bits of the internal code of Chinese characters, the speed will be faster and the effect may be better.

通常意义的检索之外，用位标记方法可以对数据库字符串进行“逆检索”，即以数据库字符串S_n的W_n值为前项，以检索关键词S_t的W_t值为后项进行位蕴含运算，根据结果是否所有的位为“1”，即W_n→W_t＝W_T，可筛选出数据库中那些组成字符元可能为检索关键词S_t所包含的S_n。In addition to retrieval in the usual sense, the database string can be "reversely retrieved" by using the bit mark method, that is, the W _n value of the database string S _n is the previous item, and the W _t value of the search keyword S _t is the subsequent item Carry out the bit implication operation, and according to whether all the bits are "1" in the result, that is, W _n →W _t =W _T , it is possible to filter out those constituent characters in the database that may be included in the retrieval keyword S _t S _n .

自然，逆检索的筛选概率的计算要反过来。设数据库及关键词均用“1”标记，标记所用位数为n，位标记后S_n的位值W_n的“1”为m个，检索关键词标记后的“1”为k个，则检索关键词S_t包含或可能包含其所有字符元的概率可以用下式计算：Naturally, the calculation of the screening probability for reverse retrieval is reversed. Assuming that the database and keywords are all marked with "1 _" , the number of digits used for the mark is n, the number of "1"s in the value Wn of S _n after the bit mark is m, and the number of "1" after the search keyword mark is k. Then the probability that the retrieval keyword S _t contains or may contain all its characters can be calculated by the following formula:

${P P}_{Ii II} = = \frac{{C C}_{k k}^{m m}}{{C C}_{n no}^{m m}} = = \frac{\frac{k k!!}{m m!! ((k k - - m m))!!}}{\frac{n no!!}{m m!! ((n no - - m m))!!}} = = \frac{k k!! ((n no - - m m))!!}{n no!! ((k k - - m m))!!}$

2.测试分析2. Test analysis

用上述方法对一个以汉字为主，有部分英文及数字的数据库进行测试：记录数267,000余条，字符量3,473,000余，字符串平均长度12.989，以长整数标记位值，编程语言为VB，操作系统为Window xp，CPU赛扬800Hz，内存256M，技嘉810主板，硬盘40G。检索整体用时，记为T。显然，任何一次位标记检索，都必须读取全部记录的位值，以关键词的位值与之进行比较，所用的时间可以认为是一个常量，称为T₀，但不易直接测得。位值比较后符合条件的记录集，记为R₁，对R₁进行逐位比较检索用时T₁也不易直接测得，但与R₁大小有关，R₁可以得到。影响检索用时的另一个因素是最终结果集R₂的大小。根据120次测试所得的T、R₁、R₂，回归分析得到如下方程：Use the above method to test a database with Chinese characters as the main part and some English and numbers: the number of records is more than 267,000, the number of characters is more than 3,473,000, the average length of the string is 12.989, and the bit value is marked with a long integer. The programming language is VB. The system is Window xp, CPU Celeron 800Hz, memory 256M, Gigabyte 810 motherboard, hard disk 40G. The overall retrieval time is denoted as T. Apparently, for any bit tag retrieval, the bit values of all records must be read and compared with the bit values of keywords. The time used can be considered as a constant, called T ₀ , but it is not easy to measure directly. The qualified record set after bit value comparison is denoted as R ₁ , and T ₁ is not easy to measure directly when comparing and searching R ₁ bit by bit, but it is related to the size of R ₁ , and R ₁ can be obtained. Another factor that affects retrieval time is the size of the final result set R ₂ . According to T, R ₁ , R ₂ obtained from 120 tests, regression analysis obtained the following equation:

$\overset{^^}{T T} = = 0.268 0.268 + + 0.000,008,625 0.000,008,625 {R R}_{11} + + 0.000,0270,2 0.000,0270,2 {R R}_{22}$

调整后的判定系数 Adjusted coefficient of determination

回归模型的显著性检验F＝5265.814Significance test of regression model F＝5265.814

大于F_0.01(2,117)＝4.791Greater than F _0.01 (2,117)＝4.791

常量的显著性检验t₀＝86.610Significance test of constant t ₀ =86.610

R₁的显著性检验t₁＝74.263Significance test of R ₁ t ₁ =74.263

R₂的显著性检验t₂＝25.489Significance test of R ₂ t ₂ =25.489

均大于t_0.01(117)＝2.6185are greater than t _0.01 (117)＝2.6185

可见，回归方程可信度很高。其中的常量，也就是读取全部记录的位值，以关键词的位值与之进行比较所用的时间，即T₀，为0.268秒。同时进行的120次字符串逐位比较方法平均用时2.1739秒。可见位值比较所用的时间T₀仅为通常逐位比较方法用时的八分之一。但是，位标记字符串检索的整体用时还与R₁、R₂相关，该测试中n＝31，字符串平均长度12.989，标记后有重叠现象，“1”的平均个数m约在11至12，这里按12计算，假如上面数据库字符串的字符元分布正常，标记后的“1”分布均衡，按概率可以计算得到R₁，最终结果集R₂与筛选效果无关，这里指定为240，则理想状态下关键词标记后k为1至12的检索用时，可按回归方程计算如下：It can be seen that the reliability of the regression equation is very high. The constant, that is, the time taken to read the bit values of all records and compare them with the bit values of keywords, that is, T ₀ , is 0.268 seconds. The average time for 120 string-by-bit comparisons at the same time is 2.1739 seconds. It can be seen that the time T ₀ used for bit value comparison is only one-eighth of the time used by the usual bit-by-bit comparison method. However, the overall time of bit-marked character string retrieval is also related to R ₁ and R _2. In this test, n=31, the average length of the character string is 12.989, and there is overlap after the mark. The average number m of "1" is about 11 to 12. Here, it is calculated by 12. If the distribution of characters in the above database string is normal, and the marked "1" is evenly distributed, R ₁ can be calculated according to the probability. The final result set R ₂ has nothing to do with the screening effect. Here, it is specified as 240. In an ideal state, the retrieval time when k is 1 to 12 after keyword tagging can be calculated according to the regression equation as follows:

kk R₁ R ₁ R₂ R ₂ TT 11 103354.8103354.8 240240 1.165921.16592 22 37896.7737896.77 240240 0.6013440.601344 33 13067.8513067.85 240240 0.3871950.387195 44 4200.3814200.381 240240 0.3107130.310713

55 1244.5571244.557 240240 0.2852190.285219 66 335.0732335.0732 240240 0.2773750.277375 77 80.4175680.41756 240240 0.2751780.275178 88 16.7536616.75366 240240 0.2746290.274629 99 2.913682.91368 240240 0.274510.27451 1010 0.397320.39732 240240 0.2744880.274488 1111 0.037840.03784 240240 0.2744850.274485 1212 0.0018920.001892 240240 0.2744850.274485

可见，与通常字符串检索方法平均用时2.1739秒相比，位标记字符串检索有很大的性能优势。但不同的硬件、以及标记时所用的数据类型尤其是n的大小、数据库字符元的分布状况、标记时字符元的分组状况，均影响位标记字符串检索的用时，所以此回归方程只有参考意义。It can be seen that compared with the average time of 2.1739 seconds for the usual string retrieval method, bit-marked string retrieval has a great performance advantage. However, different hardware, as well as the data type used in marking, especially the size of n, the distribution of database characters, and the grouping of characters in marking, all affect the time spent on bit-marked string retrieval, so this regression equation is only a reference .

英语中仅有26个字母，如果字符串很长，每条记录含每一个字母，则位标记检索会失去筛选作用。但对一个记录数242,000，字符量7,493,000的英文数据库数据进行测试，表明位标记字符串检索对拼音文字通常也是可行的。该数据库字符串数据类型为varchar，字段长56，字符串平均长30.846，用1个长整数的26个bit标记26个字母，不分大小写，忽略空格及其它字符。由于字母英文字母的使用频率不均衡，标记后统计仅有3213052个“1”，每个记录“1”的平均数为13.2266，一方面说明标记有大量的重叠发生，另一方面说明并非每条记录包含每个字母。There are only 26 letters in English. If the string is very long and every record contains every letter, the bit mark search will lose its filtering effect. However, a test on an English database with 242,000 records and 7,493,000 characters shows that bit-marked string retrieval is usually also feasible for pinyin characters. The database string data type is varchar, the field length is 56, and the average string length is 30.846. 26 letters are marked with 26 bits of a long integer, regardless of case, spaces and other characters are ignored. Due to the unbalanced use frequency of English letters, there are only 3,213,052 "1"s after marking, and the average number of "1"s in each record is 13.2266. A record contains each letter.

对该数据库进行120次字符逐位比较检索，平均用时为4.573秒，同时进行位标记检索，根据120次的T、R1、R2，回归分析得到如下方程：120 character-by-character comparison searches were performed on the database, and the average time was 4.573 seconds. At the same time, bit mark searches were performed. According to 120 times of T, R1, R2, regression analysis obtained the following equation:

$\overset{^^}{T T} = = 0.265 0.265 + + 0.000,019,36 0.000,019,36 {R R}_{11} + + 0.000,0362 0.000,0362,, 33 {R R}_{22}$

调整后的判定系数 Adjusted coefficient of determination

回归模型的显著性检验F＝55405.88Significance test of regression model F＝55405.88

大于F_0.01(2，117)＝4.791Greater than F _0.01 (2, 117) = 4.791

常量的显著性检验t₀＝36.400Significance test of constant t ₀ =36.400

R₁的显著性检验t₁＝121.733Significance test of R ₁ t ₁ =121.733

R₂的显著性检验t₂＝77.409Significance test of R ₂ t ₂ =77.409

均大于t_0.01(117)＝2.6185are greater than t _0.01 (117)＝2.6185

下面为该数据库的字母分布统计表：The following is the letter distribution statistics table of the database:

字母letter 含该字母的记录数the number of records containing the letter 百分比percentage 字母letter 含该字母的记录数the number of records containing the letter 百分比percentage tt 227167227167 0.9351360.935136 hh 114263114263 0.4703650.470365 rr 222338222338 0.9152570.915257 uu 109376109376 0.4502480.450248 ee 221713221713 0.9126850.912685 gg 9733997339 0.4006970.400697 aa 213249213249 0.8778420.877842 ff 8345283452 0.3435310.343531 cc 210116210116 0.8649450.864945 WW 7122871228 0.2932110.293211 nno 204225204225 0.8406950.840695 bb 6706567065 0.2760740.276074 oo 201082201082 0.8277570.827757 ythe y 6069860698 0.2498640.249864 ii 195734195734 0.8057420.805742 vv 5935159351 0.2443190.244319 sthe s 195639195639 0.8053510.805351 kk 4737947379 0.1950360.195036 ll 169753169753 0.6987910.698791 xx 1515015150 0.0623650.062365 dd 159202159202 0.6553570.655357 qq 72587258 0.0298780.029878 mm 125449125449 0.5164130.516413 jj 63516351 0.0261440.026144 pp 122168122168 0.5029060.502906 zz 63076307 0.0259630.025963

从上表可见，该数据库中t、r、e三个字母频率最高，以tree为检索关键词，实际检索得到含tree的记录7条，而R₁为188491，用时3.655秒，同时进行的逐位比较检索用时4.465秒。如按上表t、r、e三个字母的频率，计算可得：It can be seen from the above table that the three letters t, r, and e have the highest frequency in this database. Using tree as the retrieval keyword, 7 records containing tree were actually retrieved, and R ₁ was 188491, which took 3.655 seconds. The bit comparison retrieval took 4.465 seconds. For example, according to the frequency of the three letters t, r, and e in the above table, the calculation can be obtained:

R1＝242，924*0.935136*0.915257*0.912685＝242，924*0.781158＝189762R1＝242, 924*0.935136*0.915257*0.912685＝242, 924*0.781158＝189762

再按回归方程计算检索用时为：Then calculate the retrieval time according to the regression equation:

$\begin{matrix} \overset{^^}{T T} = = 0.265 0.265 + + 0.000,019,36 0.000,019,36 * * 189762189762 + + 0.000,0362,3 0.000,0362,3 * * 77 \\ = = 3.939 3.939 \end{matrix}$

可见，即使是tree这样完全由高频字母构成的单词，位标记检索仍有性能优势。实际上，英文中大多数单词超过4个字母，如果其中有一个低频字母，位标记的筛选作用就很好，相当于“短板效应”。列举三十次测试数据以供参考：It can be seen that even for words such as tree, which are completely composed of high-frequency letters, bit tag retrieval still has performance advantages. In fact, most words in English have more than 4 letters. If there is a low-frequency letter among them, the screening effect of bit marks is very good, which is equivalent to the "short board effect". List thirty test data for reference:

3.位标记的重叠概率3. Overlap probability of bit marks

拼音文字中字母数量小，所以字母重复出现的频率高，如bigger，biggest等单词有两个g，但仅将第7个“位”标记为1。对于汉字字符串来说，一个字符串中同一个汉字重复出现的机率低，但对汉字分组，一个字符串中的数个汉字可能会属于同一组，如上面的分组中，“大”“河”同在20组，“长”“落”同在5组，标记后重叠于一个“位”。即使分组实现字符元频率均衡，也有字符串标记重叠的问题。设用来标记的位数为n，字符元的个数为m。其中完全不发生重叠的概率是：The number of letters in pinyin text is small, so the frequency of repeated letters is high, such as bigger, biggest and other words have two g, but only the seventh "bit" is marked as 1. For Chinese character strings, the probability of the same Chinese character appearing repeatedly in a string is low, but for grouping Chinese characters, several Chinese characters in a string may belong to the same group, such as in the grouping above, "big" "he "The same in 20 groups, "long" and "drop" in the same 5 groups, overlapping in a "bit" after the mark. Even if the grouping achieves character frequency equalization, there is the problem of overlapping string tokens. Suppose the number of digits used for marking is n, and the number of character elements is m. where the probability of no overlap at all is:

$P P = = \frac{{A A}_{n no}^{m m}}{{n no}^{m m}} = = \frac{n no ((n no - - 11)) \cdot &Center Dot; \cdot \cdot \cdot &Center Dot; ((n no - - m m + + 11))}{{n no}^{m m}}$

显然，n一定的时候，m每增加1，分子新增项就递减1，而分母新增项不变，完全不发生重叠的概率越来越低。设以32个位进行标记，字符串长m为7，完全不发生重叠的概率：Obviously, when n is constant, every time m increases by 1, the new items in the numerator will decrease by 1, while the new items in the denominator will remain unchanged, and the probability of no overlap at all is getting lower and lower. Assuming that 32 bits are used for marking, and the string length m is 7, the probability of no overlap at all:

${P P}_{((77))} = = \frac{{A A}_{3232}^{77}}{3232^{77}} = = \frac{3232 ((3232 - - 11)) \cdot &Center Dot; \cdot &Center Dot; \cdot &Center Dot; ((3232 - - 77 + + 11))}{3232^{77}} = = \frac{3232 * * 3131 * * 3030 * * 2929 * * 2828 * * 2727 * * 2626}{3232 * * 3232 * * 3232 * * 3232 * * 3232 * * 3232 * * 3232} = = \frac{16,963,914,240 16,963,914,240}{34,359,738,368 34,359,738,368} = = 0.4937 0.4937$

下表是以32个位进行标记，字符串长m为1至24时，完全不发生重叠的概率：The following table is marked with 32 bits, and when the string length m is 1 to 24, the probability of no overlap at all:

重叠就意味着信息的损失，一定程度的重叠是不可避免的，也是可以接受的，但重叠的比例太高，会影响位标记检索的性能，所以发生重叠的概率也值得关注。以n个位标记，字符串长为m个字符元，重叠为k个“1”的概率计算公式是：Overlap means the loss of information. A certain degree of overlap is inevitable and acceptable, but if the overlap ratio is too high, it will affect the performance of bit label retrieval, so the probability of overlap is also worthy of attention. Marked by n bits, the length of the string is m characters, and the probability calculation formula for overlapping k "1"s is:

$P P = = \frac{{C C}_{n no}^{k k} \cdot &Center Dot; A A}{{n no}^{m m}}$

其中 $A = \underset{m_{1} + m_{2} + \cdot \cdot \cdot + m_{k} = m}{Σ} \frac{m!}{m_{1}! m_{2}! \cdot \cdot \cdot m_{k}!}$ in $A = \underset{m_{1} + m_{2} + &Center Dot; &Center Dot; &Center Dot; + m_{k} = m}{Σ} \frac{m!}{m_{1}! m_{2}! &Center Dot; &Center Dot; &Center Dot; m_{k}!}$

这里的求和是取所有满足m₁+m₂+…+m_k＝m的正整数解，其组数为 The summation here is to take all positive integer solutions satisfying m ₁ +m ₂ +...+m _k =m, and the number of groups is

设以32个位标记，字符串长为7个字符元，重叠为3个位，即m₁+m₂+m₃＝7，有15组，概率计算如下：Assuming that 32 bits are marked, the length of the string is 7 characters, and the overlap is 3 bits, that is, m ₁ +m ₂ +m ₃ =7, there are 15 groups, and the probability calculation is as follows:

7！7! m₁ m ₁ m₂ m ₂ m₃ m ₃ m₁！m ₁ ! m₂！ _m2 ! m₃！ _m3 ! 7！/(m₁！*m₂！*m₃！)7! /(m ₁ !*m ₂ !*m ₃ !) 50405040 11 11 55 11 11 120120 4242 50405040 11 22 44 11 22 24twenty four 105105 50405040 11 33 33 11 66 66 140140 50405040 11 44 22 11 24twenty four 22 105105 50405040 11 55 11 11 120120 11 4242 50405040 22 11 44 22 11 24twenty four 105105 50405040 22 22 33 22 22 66 210210 50405040 22 33 22 22 66 22 210210 50405040 22 44 11 22 24twenty four 11 105105 50405040 33 11 33 66 11 66 140140 50405040 33 22 22 66 22 22 210210 50405040 33 33 11 66 66 11 140140 50405040 44 11 22 24twenty four 11 22 105105 50405040 44 22 11 24twenty four 22 11 105105 50405040 55 11 11 120120 11 11 4242 18061806

$P P = = \frac{{C C}_{3232}^{33} \cdot \cdot A A}{3232^{77}} = = \frac{49604960 * * 18061806}{34,359,738,368 34,359,738,368} = = 0.000261 0.000261$

以32个位标记，字符串长为7个字符元，标记后k为1-7的概率如下表：It is marked with 32 bits, and the length of the string is 7 characters. The probability of k after marking is 1-7 is as follows:

kk 排列数Number of permutations 排列百分比rank percentage 百分比累计cumulative percentage k*排列数k* number of permutations 77 1696391424016963914240 0.4937148840.493714884 0.4937148840.493714884 1.18747E+111.18747E+11 66 1370162304013701623040 0.3987697140.398769714 0.8924845980.892484598 8220973824082209738240 55 33831168003383116800 0.0984616580.098461658 0.9909462560.990946256 1691558400016915584000 44 302064000302064000 0.0087912190.008791219 0.9997374750.999737475 12082560001208256000 33 89577608957760 0.0002607050.000260705 0.999998180.99999818 2687328026873280 22 6249662496 1.81887E-061.81887E-06 0.9999999990.999999999 124992124992 11 3232 9.31323E-109.31323E-10 11 3232 2.19108E+112.19108E+11

标记后“1”的加权平均个数为6.376881392。The weighted average number of "1" after marking is 6.376881392.

4.位标记相关的逻辑代数原理及0标记4. Logical algebra principles related to bit marks and 0 marks

字符串S的位值W的某一个位为“1”，就是意味着命题“字符串S有某个字符元或某组字符元中的一个”为真。对于两个字符串S_a和S_b，W_a的某k个位为1，就是意味着命题“字符串S_a有某k个字符元或某k组字符元中的一个”为真；如果W_b所有相应的某k个位为1，就是意味着命题“字符串S_b有某k个字符元或某k组字符元中的一个”也为真。自然，W_b可能有其它某m-k个位为1，而W_a这些某m-k个位为0。这种关系用逻辑代数公式就表示为：W_a→W_b＝W_T，直观易解。但是并非所有的编程语言和数据库系统都有“位蕴含”运算符，所以应用中需要按逻辑代数原理进行变换。A certain bit of the bit value W of the string S is "1", which means that the proposition "the string S has a certain character element or one of a certain group of character elements" is true. For two strings S _a and S _b , certain k bits of W _a are 1, which means that the proposition "string S _a has certain k characters or one of certain k groups of characters" is true; if All corresponding certain k bits of W _b are 1, which means that the proposition "string S _b has certain k characters or one of certain k groups of characters" is also true. Naturally, some mk bits of W _b may be 1, while some mk bits of W _a are 0. This relationship is expressed by the logic algebra formula: W _a → W _b = W _T , which is intuitive and easy to understand. But not all programming languages and database systems have "bit implication" operators, so the application needs to be transformed according to the principle of logical algebra.

从逻辑代数运算定理可知，若a→b＝1，则同样的，对于n个位的位值也有：若W_a→W_b＝W_T，则Theorems of operation from logic algebra It can be seen that if a→b=1, then Similarly, for the bit value of n bits: if W _a →W _b ＝W _T , then

${\overset{&OverBar; &OverBar;}{W W}}_{a a} | | {W W}_{b b} = = {W W}_{T T} . .$

也可以得到：Also get:

W_a|W_b＝W_b W _a |W _b =W _b

W_a&W_b＝W_a W _a &W _b =W _a

从逻辑运算定理 $W_{a} &RightArrow; W_{b} &DoubleLeftRightArrow; {\overset{&OverBar;}{W}}_{b} &RightArrow; {\overset{&OverBar;}{W}}_{a}$ 可知，Theorems of Logical Operations $W_{a} &Right Arrow; W_{b} &DoubleLeftRightArrow; {\overset{&OverBar;}{W}}_{b} &Right Arrow; {\overset{&OverBar;}{W}}_{a}$ It can be seen that

W_a→W_b＝W_T，则 W _a →W _b ＝W _T , then

所以可用一个各“位”均为1的数据W_T来标记位值，如果字符串包含某一字符元，则将相应的位标记为0，这种方式称为0标记。如：Therefore, a data W _T whose "bit" is 1 can be used to mark the bit value. If the string contains a certain character element, the corresponding bit is marked as 0. This method is called 0 mark. like:

big的位值为10111101011111111111111111111111bit value of big 1011110101111111111111111111111

bigger的位值为10110101011111111011111111111111bit value of bigger 10110101011111111011111111111111

显然 ${\overset{&OverBar;}{W}}_{b} &RightArrow; {\overset{&OverBar;}{W}}_{a} = W_{T} .$ obviously ${\overset{&OverBar;}{W}}_{b} &Right Arrow; {\overset{&OverBar;}{W}}_{a} = W_{T} .$

也可以得到：Also get:

${\overset{&OverBar; &OverBar;}{W W}}_{b b} | | {\overset{&OverBar; &OverBar;}{W W}}_{a a} = = {\overset{&OverBar; &OverBar;}{W W}}_{a a}$

用真值表比较直观：It is more intuitive to use a truth table:

W_a W _a W_b w _b W_a→W_b＝W_T W _a →W _b ＝W _T W_a&W_b＝W_a W _a &W _b =W _a W_a|W_b＝W_b W _a |W _b =W _b 00 00 11 00 00 00 11 11 00 11 11 00 00 00 11 11 11 11 11 11

以上说明的是，如何运用最常见的位运算符进行位值比较，筛选概率的原理是相通的。当然，0标记时，m、k均指0的个数。但各编程语言和数据库提供的位运算符数量不一，其它的位运算符的具体操作，可按逻辑代数原理推出。运用逻辑代数原理，进行等价变换，得出公式当然应以简约适用为目的，而不是越变越复杂。What has been explained above is how to use the most common bit operators to compare bit values, and the principle of screening probability is the same. Of course, when 0 is marked, both m and k refer to the number of 0s. However, the number of bit operators provided by each programming language and database is different, and the specific operations of other bit operators can be deduced according to the principles of logical algebra. Using the principles of logical algebra and carrying out equivalent transformations, the formulas obtained should of course be simple and applicable, rather than becoming more and more complicated.

另外一点是，如果编程语言或数据库系统提供了“位”的置1或置0的功能，则位“标记”可直接运用。如果数据库系统编程语言未提供位置1，则可用位的“或”(or)运算，进行1标记。Another point is that if the programming language or database system provides the function of setting "bit" to 1 or 0, then the bit "flag" can be used directly. If the database system programming language does not provide a position of 1, the bit "or" (or) operation can be used to mark 1.

首先对字符元进行分组，如果用32个位标记，则对第n组赋值2^32-n。从二进制来看，则该值自左至右第n个位为1，而其余位为0，称之为“基本位值”。将一个字符串所有字符元的“基本位值”进行“或”(or)运算，即可得到该字符串的位值。当然，也可以对第n组“基本位值”赋值2^n-1，从二进制来看，则自右至左的第n个位为1，而其余位为0。实际上，在特定数据库中，对记录字符串及检索关键词进行标记，字符元与位有固定的对应即可，而不论是自右至左，还是自左至右，或者其它秩序。First group the character elements, if 32 bits are used to mark, then assign 2 ^32-n to the nth group. From the binary point of view, the nth bit of the value from left to right is 1, and the rest of the bits are 0, which is called "basic bit value". The bit value of the string can be obtained by performing an "or" (or) operation on the "basic bit values" of all the character elements of a string. Of course, it is also possible to assign 2 ^n-1 to the nth group of "basic bit values". From a binary point of view, the nth bit from right to left is 1, and the remaining bits are 0. In fact, in a specific database, record character strings and search keywords are marked, and characters and bits have a fixed correspondence, regardless of whether it is from right to left, from left to right, or in other order.

如果是进行0标记，则先给第n组字符元，赋以第n个位为0而其余位为1的“基本位值”，再将特定字符串所有字符元的“基本位值”进行“和”(and)运算，即可得到该字符串的位值 If it is to mark 0, first give the nth group of characters a "basic bit value" in which the nth bit is 0 and the rest of the bits are 1, and then the "basic bit value" of all the character elements of the specific string is assigned "And" (and) operation, you can get the bit value of the string

至于标记重叠概率计算方法，0标记与1标记是相通的。As for the calculation method of the marker overlap probability, the 0 marker and the 1 marker are interlinked.

5.多位标记与分组标记5. Multi-bit mark and group mark

1.多位标记1. Multi-bit flag

在单“位”(bit)标记的基础上，下面说明多“位”(bit)标记。如果存贮位值的位数n较多，而字符串的长度少，可用j个位的组合对应一组字符元进行标记，则共有个位的组合，相应地分字符元为组，如果S的一个字符元P属于第r组，则将W中构成第r个位的组合的j个位标记为1。j为1时即上述的单“位”(bit)标记。On the basis of single "bit" (bit) flags, multi-"bit" (bit) flags are described below. If there are more digits n for storing bit values, but the length of the character string is small, the combination of j bits can be used to mark a group of character elements, and there are a total of The combination of ones, the corresponding sub-characters are group, if a character element P of S belongs to the rth group, mark the j bits in W that constitute the combination of the rth bit as 1. When j is 1, it is the above-mentioned single "bit" (bit) mark.

设n为32，j为2，则将字符元分为组，即为496组，用“1”标记，若字符元属于第1组，则将W的1、2两个“位”标记为“1”；若字符元属于第2组，则将W的1、3两个“位”标记为1，若字符元属于第3组，则将W的1、4两个“位”标记为1，…。Let n be 32 and j be 2, then divide the characters into Group, that is, 496 groups, marked with "1", if the character element belongs to the first group, the two "bits" of W 1 and 2 are marked as "1"; if the character element belongs to the second group, then W The two "bits" 1 and 3 of W are marked as 1, and if the character element belongs to the third group, the two "bits" of W 1 and 4 are marked as 1, ....

假设汉字“前”字的位值第2、5两个位为1，而“景”字的位值第23、29两个位为1，则词语“前景”有2、5、23、29四个位为1，又设“远”的位值第2、23两个位为1，“大”的位值第5、29两个位为1，则词语“远大”也是2、5、23、29四个位为1，用“前景”的位值检索，结果中会出现“远大”，反之亦然，即检索结果中会出现含“不相关组字符元”的字符串，但由于每一组所含的汉字较单“位”标记为少，所以可以提高筛选效果的。采用多位标记的目的是提高筛选效果，其筛选概率计算与单位标记原理是相同的。下面计算字符串长为4个字符元，检索关键词长为2个字符元，j为1、2、3、4时的筛选概率，其值越低越好：Assuming that the 2nd and 5th bits of the bit value of the Chinese character "before" are 1, and the 23rd and 29th bits of the bit value of the "Jing" word are 1, then the word "foreground" has 2, 5, 23, and 29 Four bits are 1, set the 2nd, 23rd two bits of the bit value of " far " to be 1 again, the 5th, 29th two bits of " big " bit value are 1, then word " great " is also 2,5, The four bits 23 and 29 are 1, and if you use the bit value of "foreground" to search, "Yuanda" will appear in the result, and vice versa, that is, a character string containing "unrelated group characters" will appear in the search result, but due to Each group contains fewer Chinese characters than the single "bit" mark, so the screening effect can be improved. The purpose of using multiple markers is to improve the screening effect, and the calculation of the screening probability is the same as that of unit markers. The following calculates the screening probability when the length of the string is 4 characters, the length of the search keyword is 2 characters, and j is 1, 2, 3, or 4. The lower the value, the better:

j为2，不考虑重叠，则字符串标记后m为4*2＝8个1，检索关键词标记后k为2*2＝4个1。j is 2, regardless of overlapping, then m after character string marking is 4*2=8 1s, and k after retrieval keyword marking is 2*2=4 1s.

j为3，即字符元分为组，即4960组，不考虑重叠，则字符串标记后m为4*3＝12个1，检索关键词标记后k为2*3＝6个1。j is 3, that is, the character element is divided into Groups, that is, 4960 groups, regardless of overlapping, m after character string marking is 4*3=12 1s, and k after retrieval keyword marking is 2*3=6 1s.

j为4，即字符元分为组，即35960组，不考虑重叠，则字符串标记后m为4*4＝16个1，检索关键词标记后k为2*4＝8个1。j is 4, that is, the character element is divided into Groups, that is, 35960 groups, regardless of overlapping, m after character string marking is 4*4=16 1s, and k after retrieval keyword marking is 2*4=8 1s.

${P P}_{11} = = \frac{{C C}_{m m}^{k k}}{{C C}_{n no}^{k k}} = = \frac{\frac{m m!!}{k k!! ((m m - - k k))!!}}{\frac{n no!!}{k k!! ((n no - - k k))!!}} = = \frac{m m!! ((n no - - k k))!!}{n no!! ((m m - - k k))!!} = = \frac{44!! ((3232 - - 22))!!}{3232!! ((44 - - 22))!!} = = \frac{44 * * 33}{3232 * * 3131} = = 0.012097 0.012097$

${P P}_{22} = = \frac{\frac{88!!}{44!! ((88 - - 44))!!}}{\frac{3232!!}{44!! ((3232 - - 44))!!}} = = \frac{\frac{88!!}{44!!}}{\frac{3232!!}{2828!!}} = = \frac{88 * * 77 * * 66 * * 55 * *}{3232 * * 3131 * * 3030 * * 2929} = = 0.00194661 0.00194661$

${P P}_{33} = = \frac{\frac{1212!!}{66!! ((1212 - - 66))!!}}{\frac{3232!!}{66!! ((3232 - - 66))!!}} = = \frac{\frac{1212!!}{66!!}}{\frac{3232!!}{2626!!}} = = \frac{1212 * * 1111 * * 1010 * * 99 * * 88 * * 77 * *}{3232 * * 3131 * * 3030 * * 2929 * * 2828 * * 2727} = = 0.00101965 0.00101965$

${P P}_{44} = = \frac{\frac{1616!!}{88!! ((1616 - - 88))!!}}{\frac{3232!!}{88!! ((3232 - - 88))!!}} = = \frac{\frac{1616!!}{88!!}}{\frac{3232!!}{24 twenty four!!}} = = \frac{1616 * * 1515 * * 1414 * * 1313 * * 1212 * * 1111 * * 1010 * * 99}{3232 * * 3131 * * 3030 * * 2929 * * 2828 * * 2727 * * 2626 * * 2525} = = 0.00122358 0.00122358$

如果检索关键词长度为1、2个字符元时，多位标记尤其有意义。但在存贮位值n为一定时，j值不是越大越好，如上面计算显示，在n为32，m为4时，j为3效果最佳，继续增加，不能提高筛选效果，而且标记时重叠的比例会增加，标记操作也更复杂。如果对长字符串进行多位标记，n必须足够大。Multi-digit tags are especially meaningful if the search keyword length is 1 or 2 characters. But when the storage bit value n is constant, the value of j is not as big as possible. As the above calculation shows, when n is 32 and m is 4, j is the best effect when it is 3. If it continues to increase, the screening effect cannot be improved, and the mark When the overlap ratio increases, the labeling operation is more complicated. n must be sufficiently large if multi-bit tokenizing long strings.

2.分组标记：2. Group marking:

如将一个无符号长整数的32位分17和15两组，分别进行标记，字符串长m为4，不考虑标记重叠的问题，标记后为4个1，设关键词长为2，标记后为2个1，则筛选概率为：For example, divide the 32 bits of an unsigned long integer into two groups of 17 and 15, and mark them separately. The length of the string m is 4. Regardless of the problem of overlapping marks, there are four 1s after the mark, and the length of the keyword is set to 2. If there are two 1s after that, the screening probability is:

$P P = = \frac{44!! ((1717 - - 22))!!}{1717!! ((44 - - 22))!!} \times \times \frac{44!! ((1515 - - 22))!!}{1515!! ((44 - - 22))!!} = = \frac{44 * * 33}{1717 * * 1616} \times \times \frac{44 * * 33}{1515 * * 1414} = = \frac{1212}{272272} \times \times \frac{1212}{210210} = = 0.002521 0.002521$

显然，也可提高筛选效果。当然，实行这种优化，同样是标记的位n要数倍于m。这种优化方式，用于长字符串时可作变通，即用两种不同的字符元分组方式，用两个数据分别标记得到两组W、W’。检索时，先以其中一组的标记方法得到关键词的W_t，与数据库的W_n值进行比较，筛选出R₁后。在R₁中，以第二组的标记方法得到关键词的W’_t，与数据库的W’_n值进行比较得到R₂，再在R₂中用通常的逐位比较法得到最终结果集R_z。与上一种优化方法相比，同样是增加位值的存贮空间，这种优化方式不必读取全部的W’_n，与W’_t进行比较。Obviously, the screening effect can also be improved. Certainly, to carry out this kind of optimization, the bit n of the mark must be several times m as well. This optimization method can be adapted when used for long strings, that is, two different character grouping methods are used, and two data are respectively marked to obtain two groups of W and W'. When searching, first obtain the W _t of the keyword with one of the marking methods, compare it with the W _n value of the database, and filter out R ₁ . In _R1 , the _W't of the keyword is obtained by the marking method of the second group, compared with the _W'n value of the database to obtain _R2 , and then in _R2 , the final result set R is obtained by the usual bit-by-bit comparison method _z . Compared with the previous optimization method, it also increases the storage space of the bit value. This optimization method does not need to read all W' _n for comparison with W' _t .

0标记、1标记都可以进行多位标记、分组标记，分组标记也可以同多位标记结合，不同的组也可以分别是0标记、1标记。当然，适用为佳。0标记、多位标记与分组标记均可用于逆检索。Both 0 mark and 1 mark can carry out multi-bit mark and group mark, group mark can also be combined with multi-bit mark, and different groups can also be 0 mark and 1 mark respectively. Of course, it is better to apply. 0 mark, multi-bit mark and group mark can all be used for inverse retrieval.

位标记字符串检索的速度快于质数代换字符检索的速度，但字符元种类多时，用一个位对应一个字符元，不易实施；如用多位标记，检索结果中，不包含“目的组字符元”而仅包含“不相关组字符元”的字符串又会混杂进来。用一个质数代换一个字符元，可避免检索结果中出现这种混杂现象，应用中，可先用位标记字符串检索作初步筛选，再用质数代换字符检索作二次检索，如需要，再用通常的字符串逐位比较方法得到最终结果。The bit-marked character string retrieval speed is faster than the prime number substitution character retrieval speed, but when there are many types of character elements, it is not easy to implement with one bit corresponding to a character element; if multi-bit tags are used, the search results do not contain "target group characters". meta" and strings containing only "unrelated group characters meta" are mixed in. Substituting a character element with a prime number can avoid such a mixed phenomenon in the retrieval results. In the application, you can first use the bit-marked character string retrieval for primary screening, and then use the prime number substitution character retrieval for secondary retrieval. If necessary, Then use the usual string bit by bit comparison method to get the final result.

6.位标记字符串检索技术在语音输入及汉语拼音输入中的应用：6. Application of bit-marked string retrieval technology in speech input and Chinese pinyin input:

汉语的同音字、同音词多，故汉语语音输入及拼音输入中需要按拼音筛选适当的字词，位标记方法“逆检索”可用于这个方面。There are many homophones and homophones in Chinese, so it is necessary to filter suitable words by pinyin in Chinese phonetic input and pinyin input, and the bit mark method "reverse search" can be used for this aspect.

不考虑声调，汉语有400多个音节，不便于用一个位对应一个音节，可用8个字节的64个位对应汉语拼音的声母及韵母。设有汉语字词搭配及基本句型数据库，其中有基本句型“他毕业了”，拼音为“tabiyele”，标记后位值W₁是：Regardless of the tone, Chinese has more than 400 syllables, and it is not convenient to correspond to a syllable with one bit, and 64 bits of 8 bytes can be used to correspond to the initial consonant and the final vowel of the Chinese phonetic alphabet. There is a database of Chinese word collocation and basic sentence patterns, including the basic sentence pattern "he graduated", the pinyin is "tabiyele", and the post-mark value W ₁ is:

1000，0101，0000，0000，0000，0001，1011，0000，0000，0000，0000，0000，0000，0000，0000，00001000, 0101, 0000, 0000, 0000, 0001, 1011, 0000, 0000, 0000, 0000, 0000, 0000, 0000, 0000, 0000

如果语音转换或者拼音输入“tazaojiubiyele”，标记后位值W_t是：If the phonetic conversion or pinyin input "tazaojiubiyele", the post-mark value W _t is:

1000，0101，0001，0010，0000，0001，1011，0000，1000，0000，1000，0000，0000，0000，0000，00001000, 0101, 0001, 0010, 0000, 0001, 1011, 0000, 1000, 0000, 1000, 0000, 0000, 0000, 0000, 0000

W₁→W_t＝W_T W ₁ →W _t =W _T

则“他毕业了”是“tazaojiubiyele”可参考的基本句型，处理后可以得到“他zaojiu毕业了”，其中“毕业”为谓语，是动词，而词库中拼音为“zaojiu”的词有“早就”“造就”“枣酒”，如果语法、语义、词频等其它方面能起到辅助作用，则可进一步转换成“他早就毕业了”。由于声韵配合的多样性，以位值比较得到的初步结果中，拼音如“tebayile”之类的句型也会出现。可在位标记字符串的基础上，用一个质数对应一个音节的质数代换方法作二次筛选，得到更接近的结果。Then "he graduated" is the basic sentence pattern that "tazaojiubiyele" can refer to. After processing, "he zaojiu graduated" can be obtained, where "graduated" is a predicate and a verb, and the words with the pinyin of "zaojiu" in the thesaurus are "Early" and "bringing up" "jujube wine", if other aspects such as grammar, semantics, and word frequency can play a supporting role, it can be further converted into "he has graduated a long time ago". Due to the diversity of phonetic and rhyme coordination, sentence patterns such as "tebayile" in pinyin will also appear in the preliminary results obtained by place value comparison. On the basis of the bit-marked character string, a prime number substitution method in which a prime number corresponds to a syllable can be used for secondary screening to obtain a closer result.

如果字词搭配及基本句型中有“楼”与量词“栋”的搭配“栋楼”，又有“楼”与经常修饰它的形容词“高”的搭配“高楼”，可按它们的声母韵母进行标记。设语音转换或者拼音输入“zhedongxiezilouhengao”，同样按声母韵母进行标记，以数据库全部字词搭配及基本句型的位值W_n与之进行比较，可筛选出“栋楼”“高楼”等搭配作参考，处理后可以得到“zhe栋xiezi楼hen高”，其中的xiezi或xiezilou可从词库中得到“写字”或“写字楼”，如果语法、语义、词频等其它方面能起到辅助作用，可将“zhedongxiezilouhengao”转换成“这栋写字楼很高”。类似地可以得到汉语的其它搭配及相应的位值。行政区划如，“省市”“省县”；数字如，“七八”“万千”；量词“斤两”“元角”；乃至姓氏的组合“张李”“张刘”。如果输入“hubeishengxianningshi”，按照上述方法，可以得到“hubei省xianning市”。如果输入“zhanglonglihu”，可以得到“张long李hu”。如果输入“qiwansanqian”，可以得到“七万san千”。同样，可在位标记字符串的基础上，用一个质数对应一个音节的质数代换方法作二次筛选，得到更接近的结果。If there is a collocation "building" between "lou" and the quantifier "dong" in the word collocation and basic sentence patterns, and there is also a collocation "high building" between "lou" and the adjective "high" that often modifies it, you can press their initials The vowels are marked. If voice conversion or pinyin input "zhedongxiezilouhengao" is also marked according to the consonant and final consonant, all word collocations in the database and the bit value W _n of the basic sentence pattern are compared with it, and collocations such as "buildings" and "high buildings" can be screened out. For reference, after processing, "zhe building xiezi building hen height" can be obtained, where xiezi or xiezilou can get "writing" or "office building" from the lexicon. If other aspects such as grammar, semantics, and word frequency can play an auxiliary role, you can Convert "zhedongxiezilouhengao" to "this office building is very tall". Other Chinese collocations and corresponding place values can be obtained similarly. Administrative divisions such as "provinces" and "provinces and counties"; numbers such as "Qiba" and "Wanqian"; quantifiers "jinliang" and "yuanjiao"; and even the combination of surnames "Zhang Li" and "Zhang Liu". If you input "hubeishengxianningshi", you can get "Xianning City, Hubei Province" according to the above method. If you input "zhanglonglihu", you can get "Zhang longlihu". If you enter "qiwansanqian", you can get "seven thousand three thousand". Similarly, on the basis of the bit-marked character string, a prime number substitution method in which a prime number corresponds to a syllable can be used for secondary screening to obtain a closer result.

另外，由于汉语声母、韵母的使用频率是不均衡的，单位标记时，以8个字节的数据的64个“位”对应汉语拼音的声母、韵母，以某些音节如li检索时，由于1与i均是高频声母、韵母，筛选效果可能不好，可以考虑用双位标记，即以n为32，j为2，即496组位，对应汉语的400个音节，并对数据库作测算，使之尽可能均衡。当然，按400个音节用双位标记，初步结果中肯定会出现“非目的”句型和搭配，可在位标记检索筛选的基础上，用一个质数对应一个音节的质数代换方法作二次筛选，得到更接近的结果。In addition, because the frequency of use of Chinese initials and finals is unbalanced, when the unit is marked, 64 "bits" of 8-byte data correspond to the initials and finals of Chinese Pinyin. When searching for some syllables such as li, due to Both 1 and i are high-frequency initials and finals, and the screening effect may not be good. Double-digit marks can be considered, that is, n is 32, j is 2, that is, 496 groups, corresponding to 400 syllables in Chinese, and the database Calculate and make it as balanced as possible. Of course, if 400 syllables are marked with double digits, there will definitely be "non-purpose" sentence patterns and collocations in the preliminary results. On the basis of the search and screening of digit markers, a prime number substitution method of one prime number corresponding to one syllable can be used for secondary Filter to get closer results.

汉语分词中也可运用此项技术，将词库中的词语、参考搭配及句型，按拼音标记位值，关键词(实际是句子)也按拼音标记位值，进行位蕴含运算，得到R_i，可以有效地缩小查找范围；当然也可以将词库中的词语、参考搭配及句型，按内码的分组标记位值，关键词也按同样的分组标记位值，进行位蕴含运算，同样可以有效地缩小查找范围，再辅以其它方法完成对关键词(也就是句子)的分解。This technology can also be used in Chinese word segmentation, and the words, reference collocations and sentence patterns in the thesaurus are marked with pinyin, and keywords (actually sentences) are also marked with pinyin, and the bit implication operation is performed to obtain R _i , can effectively narrow down the search scope; of course, words, reference collocations and sentence patterns in the thesaurus can also be grouped and marked with bit values according to internal codes, and keywords can also be marked with the same grouped bit values to perform bit implication operations. It is also possible to effectively narrow down the scope of search, and then complete the decomposition of keywords (that is, sentences) with the aid of other methods.

英语中由于连读、语音流变现象多，不似汉语音节分明清晰，也可用位标记字符串检索技术筛选参考词汇、短语。设数据库有短语in the morning，按其中的元音、辅音i、n、m、n、i、η进行标记得到位值。设语音输入转换后，其中的the所对应的比较模糊，可以剔除，以其它比较响亮的i、n、m、n、i、η进行标记得到位值，以之与数据库参考词汇、短语的位值进行比较，可以得短语in the morning，及其它读音含i、n、m、n、i、η的短语或单词。在此基础上，用质数代换音节的方法进一步缩小范围，然后再根据上下文，以语法、语义、词频作辅助，选择in the morning。In English, there are many phenomena of continuous reading and phonetic rheology, which are not as clear as Chinese syllables. Bit-marked string retrieval technology can also be used to filter reference words and phrases. Suppose the database has a phrase in the morning, according to the vowel, consonant i, n, m, n, i, n are marked to obtain the bit value. After the voice input is converted, the corresponding to the It is relatively fuzzy and can be eliminated, and other relatively loud i, n, m, n, i, n are marked to obtain the bit value, and compared with the bit value of the database reference vocabulary and phrase, the phrase in the morning can be obtained, and other pronunciations contain i, n, m, Phrases or words of n, i, η. On this basis, the method of substituting syllables with prime numbers further narrows the scope, and then according to the context, with the assistance of grammar, semantics, and word frequency, choose in the morning.

以上叙述了位标记字符串检索的方法及相关的概率计算公式、逻辑代数原理，并侧重于数据库中的字符串记录的筛选。从编程角度看，数据库的数据结构是多种多样的，各种数据结构也不只用于数据库。但位标记字符串检索技术作为一种字符串算法，可用于各种数据结构的字符串查找，如大数组中的字符串模糊查找。至于小数组之字符串模糊查找，则未必有需要，因为字符串的标记也需时间及空间。The method of bit-marked character string retrieval and related probability calculation formulas and logical algebra principles are described above, and the emphasis is placed on the screening of character string records in the database. From a programming point of view, the data structure of the database is diverse, and various data structures are not only used in the database. However, as a string algorithm, bit-marked string retrieval technology can be used for string search in various data structures, such as string fuzzy search in large arrays. As for the fuzzy search of character strings in small arrays, it may not be necessary, because the marking of character strings also requires time and space.

具体实施方式Detailed ways

本发明在汉字字符串数据库及英文字符串数据库模糊检索中已得到良好的实现，下面是对SQL SERVER2000中的汉字字符串数据库，以“1”进行标记的vb6.0代码，其它编程语言、数据库的位标记字符串模糊检索可参照实施。The present invention has obtained good realization in the fuzzy retrieval of Chinese character string database and English character string database, below is to the Chinese character string database in SQL SERVER2000, with the vb6.0 code that " 1 " is marked, other programming language, database The fuzzy retrieval of bit-marked strings can refer to the implementation.

1.建立数据库1. Create a database

设数据库shuku有表biao，其中有字段shuming，数据类型为nvarchar，长度为40。另建立字段wei，数据类型为“长整型”，也就是4个字节，有32个位，其中一“位”为符号位，其余31个位可以用于标记。Suppose the database shuku has a table biao, which has a field shuming, the data type is nvarchar, and the length is 40. Another field wei is created, the data type is "long integer", that is, 4 bytes, with 32 bits, one of which is a sign bit, and the remaining 31 bits can be used for marking.

2 vb6.0中没有直接将“位”置为“1”或“0”的命令，故利用位的“或”运算对数据库字符串作“标记”2 There is no command to directly set the "bit" to "1" or "0" in vb6.0, so use the "or" operation of the bit to "mark" the database string

dim shuzu(30)As Longdim shuzu(30)As Long

‘定义一个31个元素的长整型数组。'Define an array of long integers with 31 elements.

shuzu(0)＝1shuzu(0)=1

For x＝1 To 30For x＝1 To 30

shuzu(x)＝2*shuzu(x-1)shuzu(x)=2*shuzu(x-1)

NextNext

‘对长整型数组的31个元素赋值，从1、2、4、8、16至1073741824，从二进制来看，有一位为1，而其余位为0，也就是“基本位值”。'Assign values to the 31 elements of the long integer array, from 1, 2, 4, 8, 16 to 1073741824. From the binary point of view, one bit is 1, while the rest are 0, which is the "basic bit value".

Dim biaostr As StringDim biaostr As String

‘存贮当前处理的字符串'Store the currently processed string

Dim weizhi As LongDim weizhi As Long

‘存贮字符串的位值'Store the bit value of the string

Dim weizhilin As LongDim weizhilin As Long

‘存贮一个字符的基本位值'Store the basic bit value of a character

Dim x As IntegerDim x As Integer

biaors.MoveFirstbiaors. MoveFirst

‘移至数据库记录集biaors的第一个记录'Move to the first record of the database recordset biaors

Dodo

weizhilin＝0weizhilin=0

weizhi＝0weizhi=0

With biaorsWith biases

biaostr＝.Fields(″shuming″)biaostr=.Fields("shuming")

End WithEnd With

‘读入一个记录之字符串，赋与字符串变量biaostr'Read the string of a record and assign it to the string variable biaostr

For x＝1 To Len(biaostr)For x＝1 To Len(biaostr)

index＝Abs(Asc W(Mid(biaostr，x，1))Mod 31)index＝Abs(Asc W(Mid(biaostr, x, 1))Mod 31)

‘从字符串变量biaostr中取一个字符，并将该字符内码，以31为模作运算，再取绝对值，并赋与index，也就是对字符元进行分组。'Take a character from the string variable biaostr, and use the internal code of the character to operate modulo 31, then take the absolute value, and assign it to index, that is, to group the character elements.

weizhilin＝shuzu(index)weizhilin=shuzu(index)

‘将数组shuzu(index)值赋与weizhilin，是1、2、4、8、16至1073741824之一。'Assign the array shuzu (index) value to weizhilin, which is one of 1, 2, 4, 8, 16 to 1073741824.

weizhi＝weizhi Or weizhilinweizhi＝weizhi Or weizhilin

‘将一个字符的“基本位值”weizhilin值与weizhi作位的“或”运算。'The "or" operation of "basic bit value" weizhilin value and weizhi of a character.

NextNext

‘循环结束，获得当前字符串的“位值”weizhi'The end of the loop, get the "bit value" of the current string

With biaorsWith biases

.Fields(″wei″)＝weizhi.Fields("wei")＝weizhi

End WithEnd With

biaors.Updatebiaors.Update

‘将“位值”weizhi存贮进当前记录的字段wei中'Store the "bit value" weizhi into the field wei of the current record

biaors.MoveNextbiaors.MoveNext

‘处理下一个记录'Process the next record

Loop While Not biaors.EOFLoop While Not biaors.EOF

3.利用位的“和”运算进行数据库字符串模糊检索3. Use the "and" operation of bits to perform fuzzy retrieval of database strings

Dim shuzu(30)As LongDim shuzu(30)As Long

shuzu(0)＝1.shuzu(0)=1.

For x＝1 To 30For x＝1 To 30

shuzu(x)＝2*shuzu(x-1)shuzu(x)=2*shuzu(x-1)

NextNext

‘定义一个31个元素的长整型数组，赋值从1、2、4、8、16至1073741824，与“标记”之数组完成一致。'Define a 31-element long integer array, assigned values from 1, 2, 4, 8, 16 to 1073741824, which is consistent with the array of "mark".

Dim weizhi As LongDim weizhi As Long

‘存贮一个字符串的位值'Store the bit value of a string

Dim weizhilin As LongDim weizhilin As Long

‘存贮一个字符的位值'Store the bit value of a character

Dim textstr As StringDim textstr As String

‘存贮检索字符串'store search string

Dim x As IntegerDim x As Integer

weizhilin＝0weizhilin=0

weizhi＝0weizhi=0

textstr＝Text1.Texttextstr=Text1.Text

‘Text1.Text为文本框中的检索关键词'Text1.Text is the search keyword in the text box

For x＝1 To Len(biaostr)For x＝1 To Len(biaostr)

index＝Abs(Asc W(Mid(textstr，x，1))Mod 31)index=Abs(Asc W(Mid(textstr,x,1))Mod 31)

weizhilin＝shuzu(index)weizhilin=shuzu(index)

weizhi＝weizhi Or weizhilinweizhi＝weizhi Or weizhilin

NextNext

‘对检索关键词进行标记得到“位值”weizhi，方法与数据库字符串“标记”方法一致。'Mark the search keywords to get the "bit value" weizhi, the method is consistent with the database string "marking" method.

strQuery＝″select*from(SELECT*FROM biao WHERE(wei&″&weizhi&″)＝″&weizhi&″)DERIVEDTBL WHERE(shuming like′％″&textstr&″％′)″strQuery = "select*from(SELECT*FROM biao WHERE(wei&"&weizhi&")="&weizhi&")DERIVEDTBL WHERE(shuming like′%″&textstr&″%′)″

‘SQL SERVER 2000中没有位蕴含运算符，这里将检索关键词的“位值”weizhi与数据库各记录的“位值”wei作“和”(and)运算，筛选出“和”(and)运算结果等于关键词的“位值”weizhi的记录(也可以用(wei|″&weizhi&″)＝wei)，得到初步结果集R_i，再以通常字符串模糊检索方式作二次检索，得到最终结果R_z。这是SQL SERVER 2000的查询语句，其它数据库可能略有不同。'There is no bit implication operator in SQL SERVER 2000. Here, the "bit value" weizhi of the search keyword and the "bit value" weizhi of each record in the database are used for "and" (and) operation, and the "and" (and) operation is selected The result is equal to the record of the "bit value" weizhi of the keyword (you can also use (wei|"&weizhi&")=wei) to get the preliminary result set R _i , and then perform a secondary search with the usual string fuzzy search method to get the final result R _z . This is the query statement of SQL SERVER 2000, other databases may be slightly different.

Adodc 1.RecordSource＝strQueryAdodc 1.RecordSource＝strQuery

Adodc 1.RefreshAdodc 1. Refresh

‘执行检索'Execute the search

DataList 1.ListField＝″shuming″DataList 1.ListField="shuming"

DataList l.ReFillDataList l.ReFill

‘在列表框中显示当前检索结果。'Display the current search results in the list box.

Claims

1. A character string fuzzy search method, characterized in that:

Use the data W with n bits (bits) that are all 0 to mark the character meta-information that constitutes the string, and take j bits as a combination, then there are The combination of ones, correspondingly divides all characters into Group, with a combination of bits corresponding to a group of character elements, marked; if a character element P ₁ of the string S belongs to the rth group, then the j bits of W that constitute the combination of the rth bit are marked as 1, Similarly, W is marked according to the group to which other character elements P ₂ , P ₃ , P ₄ ... of S belong, and after all character elements are marked, W has the character meta information of S recorded, which is called 1 of the string S tag bit value;

Use data with n bits all 1 To mark the meta-information of the characters that make up the character string, taking the j bits as a combination, there are a total of The combination of ones, correspondingly divides all characters into Group, with a combination of bits corresponding to a group of character elements, marked; if a character element P ₁ of the string S belongs to the rth group, then the The j bits that constitute the combination of the rth bit are marked as 0, similarly, according to the group pairs to which other character elements P ₂ , P ₃ , P ₄ ... of S belong markup, after completing all character meta-tags The character meta information of S is recorded, which is called the 0 mark bit value of the string S;

Note that the 1 flag value of the string S _a is W _a , and the 0 flag value of S _a is Record the value of the 1 mark bit of the string S _b as W _b , record the value of the 0 mark bit of S _b Use the following method for string comparison:

1. compare with W _a and W _b , if all W _a is the position of 1, the corresponding bit of W _b is also 1, then S _b includes or may include all character elements of S _a ;

II. to and compare if all bit of 1, The corresponding bit is also 1, then S _b contains or may contain all the characters of S _a ;

III. With W _a and To compare, if All the corresponding bits with W _a are not 1 at the same time, then S _b contains or may contain all the characters of S _a ;

IV. to Compare with W _b if All the corresponding bits of W _b are not 0 at the same time, then S _b contains or may contain all the characters of S _a .

2. The method according to claim 1, characterized in that: for W _a , with W _b , The comparison is realized with bit operators: write W _T as the bit value with all bits being 1, then method I can use W _a → W _b = W _T , W _a | W _b = W _b , W _a & W _b = W _a to fulfill.

3. The method according to claim 1, characterized in that: for W _a , with W _b , Comparing, if not meeting the judging criteria, then S _b does not contain all character elements of S _a ; if meeting the judging criteria, when each group of character elements is marked with a single bit and there is only one character element in each group of character elements , then S _b contains all the characters of S _a , and it is not sure whether S _b contains S _a ; if it meets the judgment standard, when each group of characters is marked with a single bit and there are multiple characters in each group of characters , or when multiple bits are used to mark each group of character elements, then S _b may contain all the character elements of S _a , and it is not sure whether S _b contains S _a ; according to needs, use the character-by-character comparison method to judge Does S _b contain S _a .

4. according to the described method of claim 1, it is characterized in that: character element comprises Chinese character radical, stroke; Letter, initial consonant, simple or compound vowel of Chinese phonetic alphabet, syllable; Letter, syllable, word of other languages; The phonetic symbol of other language; Or they combination.

5. according to the described method of claim 1, it is characterized in that: character string S comprises one or more character strings, when comprising a character string, its bit value W and Mark the character meta-information of a string, when multiple strings are included, its bit value W and Tokenize all character meta information for multiple strings.

6. according to the described method of claim 1, it is characterized in that: search key word S _t has 1 flag value W _t and 0 flag value And the string S _n has 1 flag value W _n and 0 flag value By W _t , with W _n , Comparing: It is possible to screen out all the S _n characters that contain or may contain the search keyword S _t . If necessary, use the character-by-character comparison method to determine whether S _n contains S _t or not, and also filter out the search keyword S _t S _n that contains or may contain all its character elements, if necessary, can use the character-by-bit comparison method to judge whether S _t contains S _n .