JPH03125263A

JPH03125263A - Retrieving method using phrase index for information retrieving system

Info

Publication number: JPH03125263A
Application number: JP1263067A
Authority: JP
Inventors: Mikizo Kasugai; 春日井　幹三
Original assignee: TOKAI TV HOSO KK; Tokai Television Broadcasting Co Ltd
Current assignee: TOKAI TV HOSO KK; Tokai Television Broadcasting Co Ltd
Priority date: 1989-10-11
Filing date: 1989-10-11
Publication date: 1991-05-28
Also published as: JPH0587865B2

Abstract

PURPOSE:To shorten the scanning time of the real data and to improve the character retrieving speed by selecting the candidate real data via a character index and a phrase index. CONSTITUTION:When the information is retrieved, a phrase produced from combination of characters included in a character string is extracted. Then the key value corresponding to the phrase is obtained, and an AND is obtained between the same position bits of a bit train corresponding to plural key values of a phrase index. Thus a data number shown by an on-bit is obtained and the data including a phrase of a character string is retrieved as the retrieving conditions. That is, the combination of the prescribed characters in expressed in an AND of several keys. Thus the total number of keys can be cut down to several thousands even though the permutations and combinations of optional two characters, for example, among several thousands of characters used in the Japanese sentences reach a large number, i.e., several ten millions. As a result, the retrieving time can be shortened in terms of combinations of characters.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、日本語等による多量の文字情報から成るデー
タベースの中から検索条件に適合するデータを検索する
情報検索システムに関する。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to an information retrieval system for searching data matching search conditions from a database consisting of a large amount of character information in Japanese or the like.

（従来の技術）フリーワード方式の情報検索システムとして、特願昭６
１−０５５６８３号の「日本語情報検索システム」が提
案されている。これは、−文字単位ごとに、文字種と、
その文字が含まれる実データのデータ番号指定ビットと
で構成される文字索引のみを持ち、検索条件として指定
された文字列内の、すべての文字種について、その索引
のデータ番号を示すビット列間の論理演算を行ない、そ
の結果によって、指定された文字列のすべての文字種を
含む実データのデータ番号を知り、次にその実データを
読み出して走査し、検索条件として指定された文字列に
合致するか否かを判定して、目的とする実データを検索
するようにしている。(Prior art) As a free word-based information retrieval system,
No. 1-055683 "Japanese Information Retrieval System" has been proposed. This is - for each character, the character type,
It has only a character index consisting of the data number specification bit of the actual data that contains that character, and the logic between bit strings that indicates the data number of that index for all character types in the character string specified as a search condition. Performs an operation, uses the result to find out the data number of the actual data that includes all character types in the specified string, then reads and scans the actual data to see if it matches the string specified as the search condition. The target actual data is then searched for.

上述の方式では、特に文字と文字の組み合わせについて
の索引は持っていない。したがって例えば文字と文字の
組み合わせとして、「情報検索」という文字列を検索し
ようとする場合、「情」　「報」　「検」　「索」の４
文字を含むデータを検索した後、更にそのデータの中の
「情報検索」という文字列を検索する必要がある。The above method does not have an index specifically for characters and character combinations. Therefore, for example, when trying to search for the character string "information search" as a combination of characters, the four characters "information", "information", "search", and "search" are searched.
After searching for data containing characters, it is necessary to search for the character string "information search" in the data.

（発明が解決しようとする課題）このような従来方式においては、文字索引による論理演
算によって、少なくとも指定された文字列内のすべての
文字を含んでいるデータ、という必要条件は満たすこと
ができるが、文字と文字の組み合わせについては考慮さ
れていないので、必ず実データを走査する必要があり、
そのため候補となる実データの数が多いときには、極め
て長い検索時間を要するという問題点がある。(Problem to be Solved by the Invention) In such a conventional method, the necessary condition of data containing at least all the characters in a specified character string can be satisfied by logical operations using a character index. , characters and character combinations are not considered, so the actual data must be scanned,
Therefore, when there is a large number of actual data candidates, there is a problem in that an extremely long search time is required.

（課題を解決するための手段及び作用）本発明は、上述
した課題を解決するための手段として、情報検索システムにおいて、情報の登録時には、登録データにデータ番号を付加し、
該登録データ中に現れる文字と文字の組み合わせによる
連語を抽出し、該連語を複数のキー値の組み合わせにより表現前記キー
値と、該キー値に対応したデータ番号指定ビット列とを
有する連語索引において、前記複数のキー値とデータ番
号の該当するビットをオンとすることによって前記連語
を登録し、情報の検索時には、検索条件としての文字列
に含まれる文字と文字の組み合わせによる連語を抽出し
、該連語に対応する前記キー値を求め、前記連語索引の
前記複数のキー値に対応するビット列の同一位置ビット
同士について論理積を求め、その結果得られるオンビッ
トの示すデータ番号から、前記検索条件としての文字列
の連語を含むデータを検索することを特徴とする、情報
検索システムにおける連語索引を用いた検索法を提供す
るものである。(Means and effects for solving the problem) The present invention, as a means for solving the above-mentioned problems, provides an information retrieval system that: When registering information, adds a data number to registered data;
Extracting a collocation consisting of a combination of characters appearing in the registered data, and expressing the collocation using a combination of a plurality of key values.In a collocation index having the key value and a data number specification bit string corresponding to the key value, The collocation is registered by turning on the corresponding bits of the plurality of key values and data numbers, and when searching for information, the collocation is extracted by the combination of characters included in the character string as the search condition, and the collocation is registered. The key value corresponding to the collocation is determined, the logical AND is performed on the bits at the same position in the bit string corresponding to the plurality of key values of the collocation index, and from the data number indicated by the resulting ON bit, the search condition is determined. The present invention provides a search method using a collocation index in an information retrieval system, which is characterized by searching for data containing collocations of character strings.

また前記連語に漢字コード表に従った連語番号を設定し
、整数として複数の素数を設定した数個の整数による法
（ＭＯＤ）によって、前記連語番号を演算し、該演算結
果に種別マークを付加することにより、単漢字の文字番
号に変換し、前記複数のキー値とすることを特徴とする
情報検索システムにおける連語索引を用いた検索法によ
り、前記課題を解決しようとするものである。Also, set a compound number according to the kanji code table for the compound word, calculate the compound number by the modulus of several integers (MOD) in which multiple prime numbers are set as integers, and add a type mark to the calculation result. By doing so, the above-mentioned problem is attempted to be solved by a search method using a collocation index in an information retrieval system, which is characterized in that the character numbers of single kanji are converted into the plurality of key values.

本発明の方式は、予め定めた文字と文字の組み合わせを
数個のキーの論理積で表現するものであり、日本語の文
章で使われる数千字の文字の中の例えば任意の２文字の
順列組み合わせが、数千万という大きな数になるにもか
かわらず、キーの総数を数十個に減らすことができ、文
字と文字の組み合わせに関する索引を、従来の文字索引
の場合とまったく同様に扱うことを可能にしたものであ
る。The method of the present invention expresses a predetermined combination of characters by the logical product of several keys. Despite the large number of permuted combinations, in the tens of millions, the total number of keys can be reduced to a few dozen, and indexes on characters and combinations of characters can be treated just like traditional character indexes. This is what made it possible.

（実施例）次に本発明の実施例について図面を参照しながら説明す
る。(Example) Next, an example of the present invention will be described with reference to the drawings.

第３図はＪＩＳコードで漢字がどのように定義されてい
るかを示したものである。漢字はその使用頻度により第
１水準と第２水−準とに分かれ、漢字１文字は２つのコ
ード番号で表現されている。Figure 3 shows how kanji are defined in the JIS code. Kanji are divided into level 1 and level 2 depending on their frequency of use, and one kanji character is represented by two code numbers.

いま１つ目のコード番ゝ号を「扁」、２つ目のコード番
号を「労」とすると、第１水準の漢字は１６進数表示で
（３０）から（４Ｆ）までの３２個の扁を示すコード番
号を有し、第２水準の漢字は、同様に（５０）から（７
３）までの３６個の扁を示すコード番号を有する。これ
らの合わせて６８個の扁のそれぞれが、（２１）〜（７
Ｅ）までの９４個の労を示すコード番号と組み合わされ
、漢字の総数はその積の６，３９２字（第１水準の最後
の未定義文字４３個を含む）となる。Now let's say that the first code number is "Ban" and the second code number is "Rou", then the first level kanji is expressed in hexadecimal notation as 32 Bang characters from (30) to (4F). Similarly, the second level kanji are (50) to (7).
It has a code number indicating 36 flats up to 3). Each of these 68 planes is (21) to (7
Combined with the 94 labor code numbers up to E), the total number of Kanji characters is the product of 6,392 characters (including the last 43 undefined characters of the first level).

本発明の方式は、何文字の組み合わせについても適用で
きるものであるが、説明の繁雑化を避けるために、漢字
２字の組み合わせの連語の場合について、上述した第３
図を参照しながら説明する。Although the method of the present invention can be applied to any number of character combinations, in order to avoid complicating the explanation, we will apply the method described in the third method to the case of a combination of two kanji characters.
This will be explained with reference to the figures.

また、文字の組み合わせに番号をふる方法としては、い
ろいろあるが、ここでは、連番方式を用いる場合につい
て説明する。There are various methods of assigning numbers to combinations of characters, but here we will explain the case of using a serial numbering method.

さらに、上述０連語を数個の数値（キー）の論理積で表
現する場合のキーの定めかたにもいろいろな方式がある
が、ここでは、数個の整数による法（ＭＯＤ）を用いる
場合について説明する。Furthermore, there are various ways to determine the key when expressing the above-mentioned 0-complex by the logical product of several numbers (keys), but here we will use the modulus (MOD) of several integers. I will explain about it.

まず、第３図の表に従って、最初の文字である「亜」は
１番、次の「唖」は２番・・・というように連番をふっ
てい（と、最後の「禽」は６．３９２番になる。また、
これをもとにして漢字２文字の連語の組み合わせにも順
に番号をふると、「亜亜」は１番、「亜唖」は２番、・
・・「面素」は　６゜３９２番、「場面」は６．３９３
番・・・最後の「禽禽」は４０、８５７．６６４番とす
ることができる。First, according to the table in Figure 3, the first character ``A'' is number 1, the next character ``唖'' is number 2, and so on (and the last character ``Tori'' is number 6). .It will be number 392.Also,
Based on this, if we number the combinations of two kanji characters in order, ``aya'' is number 1, ``aya'' is number 2, etc.
..."Menu" is 6°392, "Scene" is 6.393
Number...The last "bird" can be number 40,857.664.

これの一般式は、１字目の番号をＰ、２字目の番号をＳ
とすると、連語番号　Ｎ＝　（Ｐ−１）　＋６．３９２　＋５−（
１）で表わされ、これから、予め定めた数個の整数によ
る法（ＭＯＤ）により、キーを求め、それによって連語
を定義することができる。The general formula for this is P for the first character number and S for the second character number.
Then, collocation number N= (P-1) +6.392 +5-(
1), from which a key can be determined by a modulus (MOD) of several predetermined integers, and a collocation can be defined thereby.

以上の更に詳細な説明を、次に本実施例の動作の、デー
タの格納時と検索時とを用いて説明する。A more detailed explanation of the above will now be given using the operations of this embodiment when storing data and when retrieving data.

第１図は、例としてデータ「情報の蓄積と検索」をとり
あげて、本実施例におけるデータ格納時の連語索引登録
の動作を説明するためのものである。FIG. 1 is for explaining the operation of collocation index registration at the time of data storage in this embodiment, taking data "information storage and retrieval" as an example.

いま、データ「情報の蓄積と検索」が入力され、データ
番号（本例では１２３とする）を付与されて実データ部
に格納されたとする（処理１）このときデータ「情報の蓄積と検索」に含まれているす
べての文字が抽出される（処理２・・ここでは、ひらが
なについては省略している。）。Now, suppose that the data "Information storage and retrieval" is input, a data number (123 in this example) is assigned, and it is stored in the actual data section (Processing 1) At this time, the data "Information storage and retrieval" is input. All characters included in are extracted (Process 2... Hiragana is omitted here).

次に、抽出されたおのおのの文字索引ビット列のデータ
番号に対応するビットが論理“１”にされ（０部）、文
字索引へ登録される（処理３）。Next, the bit corresponding to the data number of each extracted character index bit string is set to logic "1" (copy 0) and registered in the character index (processing 3).

つづいて、データ「情報の蓄積と検素」に含まれている
すべてのとなり同士の文字と文字の組み合わせが連語と
して抽出される。すなわち、「情報／報の／の蓄／蓄積
／積と／と検／検索」の７組であるが、第１図では漢字
同士ではない組み合わせは図示を省略しである（処理４
）。Next, all adjacent characters and combinations of characters included in the data "information accumulation and analysis" are extracted as collocations. In other words, there are seven pairs of "information/report/storage/storage/product and/to inspection/search", but in Figure 1, combinations that are not kanji characters are omitted (processing 4).
).

次に、これら７組のそれぞれを数個のキー（数値）の論
理積で表現するようにキーの値を定めるのであるが、こ
こではそのうち「情報」をとりあげて説明する。Next, key values are determined so that each of these seven sets is expressed by the logical product of several keys (numeric values), but here we will focus on "information" and explain it.

まず、「情」　「報」の文字番号を第３図に示したＪＩ
Ｓコードから求めると、「情」は１，３９６番、「報」
は２．５２７番であるので「情報」の連語番号Ｎは、前
述の（１）式より、Ｎ　＝　（１，３９６−１）＋６，３９２　＋２，５２
７　＝８，９１９，３６７となる。First, the letter numbers for “information” and “information” are shown in Figure 3.
From the S code, ``jo'' is number 1,396, and ``information'' is number 1,396.
is number 2.527, so the collocation number N of "information" is, from equation (1) above, N = (1,396-1) + 6,392 + 2,52
7 = 8,919,367.

次に、予め定めた数個の整数による法（ＭＯＤ）によっ
てキーを求める。キーの数および整数の値はいろいろあ
り得るが、ここではキーの数は４個、整数としては、例
えば（７５１７４３７３９７３３）という４つの素数を
とることにすると、ＭＯＤ（７５１）＝（８９１９３６７÷７５１）の余り
＝４９１ＭＯＤ（７４３）＝　（８９１９３６７÷７４
３）の余り＝３９５ＭＯＤ（７３９）＝（８９１９３６
７÷７３９）の余り＝３７６ＭＯＤ（７３３）＝（８９
１９３６７÷７３３）の余り＝２２３により（４９１３
９５３７６２２３）の４つの数値を得ることができる。Next, a key is determined by a modulus (MOD) of several predetermined integers. The number of keys and the value of the integer can vary, but here the number of keys is 4, and the integers are, for example, 4 prime numbers (751743739733), MOD (751) = (8919367 ÷ 751) Remainder = 491MOD (743) = (8919367÷74
Remainder of 3) = 395 MOD (739) = (891936
Remainder of 7÷739)=376MOD(733)=(89
The remainder of 19367÷733) = 223, so (4913
95376223) can be obtained.

これをそのまま４つのキーとすると、他に種別マーク（
４つのキーを区別して扱うために必要となる）を必要と
するので、種類毎に第１水準の漢字３００８字を７５２
づつの４つの範囲に入るように、以下の計算を行なう。If we use these as the four keys, we will also add a type mark (
(required to distinguish between the four keys), the 3008 first-level kanji are divided into 752 for each type.
Perform the following calculations so that each range falls within the following four ranges.

第１キー　ＭＯＤ（７５１）＋７５２峠＝４９１　　　
ｍｒ亀」第２キー　ＭＯＤ（７４３）＋７５２＊ｌ＝１
，１４７　−ｒ軸」第３キー　ＭＯＤ（７３９）＋７５
２傘２＝１，８８０　０　ｒ寵」第４キー　ＭＯＤ　（
７３：３）　＋７５２牟３＝２，４７９　０　ｒ弊」こ
れから得られた（４９１１１４７１８８０２４７９）と
いうキー値を、更に文字番号とみなして、第３歯のＪＩ
Ｓコードによって翻訳すると「亀／軸／寵／弊」となる
。1st key MOD (751) + 752 pass = 491
mr turtle” 2nd key MOD (743) + 752*l=1
, 147 -r axis" 3rd key MOD (739) + 75
2 Umbrellas 2 = 1,880 0 r 4th key MOD (
73:3) +752㎟3=2,479 0r.The key value (491114718802479) obtained from this is further regarded as a character number, and the JI of the third tooth is
Translated using the S code, it becomes "tortoise/axis/love/hei".

この様に最後に漢字に変換したのは、従来の文字索引と
同じような扱いをするための工夫であって、本質的な問
題ではない。（例えば、１０００の位で種別を表して、
（１４９１２３９５３３７６４２２３）とするのは最も
簡明な方法である。）いずれにしろ、こうして「情報」という連語は４個のキ
ー（例えば４個の漢字の組み合わせ）で表現することが
できる（処理４）。The reason for converting to kanji at the end in this way is to treat it in the same way as a conventional character index, and is not an essential problem. (For example, express the type in 1000 digits,
(1491239533764223) is the simplest method. ) In any case, the collocation word "information" can be expressed using four keys (for example, a combination of four kanji characters) (processing 4).

そこで上述した連語「情報」の４個のキー（漢字）の連
語索引のビット列のデータ番号に対応するビットを、そ
れぞれ論理“１”にして連語索引に登録し、同様に「蓄
積」、「検索」についてもそれぞれ４個のキーを算出し
、それぞれ連語索引に登録する（処理５）。Therefore, the bits corresponding to the data numbers of the bit strings of the collocation index of the four keys (kanji) of the collocation "information" mentioned above are set to logic "1" and registered in the collocation index, and similarly "storage" and "search" are performed. '' are also calculated and registered in the collocation index (processing 5).

次に連語索引の検索について以下に述べる。Next, the search of the collocation index will be described below.

第２図は、例として検索条件として文字列「情報検索」
をとりあげて本実施例における検索時の動作を説明する
ものである。Figure 2 shows the character string "information search" as an example of a search condition.
The operation at the time of search in this embodiment will be explained by taking up the following.

また、実データ部には、第１図、処理１の実データ部の
様にデータ番号１２３「情報の蓄積と検索」の他に、デ
ータ番号４５６「東海テレビ情報検索システム」等、多
くのデータが既に格納されているものとする。In addition, the actual data section includes data number 123 "Information storage and retrieval" as shown in the actual data section of process 1 in Figure 1, as well as data number 456 "Tokai TV information search system" and many other data. Assume that has already been stored.

検索条件として文字列「情報検索Ｊが入力されると（処
理１）、まず「情」、「報」、「検」。When the character string "information search J" is input as a search condition (process 1), first "information", "information", and "ken" are input.

「素」という４つの文字索引のビット列が同一ビット位
置同士で論理積演算される。この演算によって、第１２
３ビツトと第４５６ビツトだけが“１”になったビット
列として得られたとすると、この結果から、「情／報／
検／索」の４文字を含んでいるデータは１２３番と４５
６番の２つしかないことが分かり、この他のデータは「
情報検索」という文字列を含んでいるということはあり
えないため、検索候補から外しても良いことになる（処
理２）。The bit strings of the four character indexes "prime" are ANDed with the same bit positions. By this operation, the 12th
Assuming that a bit string in which only the 3rd bit and the 456th bit are “1” is obtained, from this result, “information/information/
Data containing the four characters "Search/Search" are number 123 and 45.
It turns out that there are only two numbers, number 6, and the other data is "
Since it is unlikely that the character string ``Information Search'' is included, it can be removed from the search candidates (Processing 2).

次に、「情報検索」という文字列に含まれている「情報
」、「報検」、「検索」の３つの連語を抽出し、それぞ
れ４個のキーを前述した登録時の計算方式により算出す
る。その結果、「情報ｊのキーは「亀／軸／寵／弊」と
なり、同様の計算により、「報検」のキーは「翫／後／
凍／ＮＪであり、「検索」のキーは「窟／湘／瀞／彼」
であることが求められる（処理３）。Next, we extract the three combinations of "information,""investigation," and "search" contained in the string "information search," and calculate the four keys for each using the calculation method used during registration as described above. do. As a result, the key for "Information j" is "Turtle/Axis/Kei/Hei", and by the same calculation, the key for "Information J" is "Kan/Ato/Kei".
Freeze/NJ, and the key for “search” is “Ku/Sho/Toro/He”
(Process 3).

そこで、先に文字索引の論理積演算の結果として得たビ
ット列と、これら１２個のキーの連語索引のビット列の
同一ビット位置同士で論理積演算をする。この演算によ
って、第４５６ビツトだけが“１″になったビット列と
して得られたとすると、「情報検索」という文字列を含
む可能性のあるデータは４５６番だけとなる（処理４）
。Therefore, a logical AND operation is performed between the bit string obtained as a result of the logical AND operation of the character index and the same bit position of the bit string of the conjunctive index of these 12 keys. Assuming that this operation yields a bit string in which only the 456th bit is "1", the only data that may contain the character string "Information Search" is the 456th bit (Process 4).
.

そこで４５６番のデータを実データ部から読み出して、
実際に「情報検索」という文字列を含んでいることを確
認すれば、これが検索条件に適合するデータであること
になる（処理５）。Therefore, read data number 456 from the actual data section,
If it is confirmed that the character string "information search" is actually included, this data is found to match the search conditions (processing 5).

この様に、本発明では、連語索引を文字索引と合わせて
検索することにより、文字索引のみの検索に較べて、候
補となる実データの数を極めて少なく絞ることができる
。In this way, in the present invention, by searching the collocation index together with the character index, the number of actual data candidates can be narrowed down to a much smaller number than when searching only with the character index.

次に数少ない連語索引によって、極めて大きな組み合わ
せの数を扱うことが、なぜ可能であるのかについて説明
する。第１水準と第２水準の漢字２個から成る連語は、
前述のように約４千万有り得るが、例えば３５００３乗
は既に４千万を越えるので、３５０程度のスケールの３
個の座標軸の値（キー）によって、すべての連語をユニ
ークな座標に位置づけることが可能である。Next, we will explain why it is possible to handle an extremely large number of combinations using a small number of collocation indexes. A collocation consisting of two kanji characters from the first and second levels is
As mentioned above, there are about 40,000,000 possible numbers, but for example, 35003 is already over 40,000,000, so 3 on a scale of about 350
It is possible to locate all collocations at unique coordinates by the values (keys) of these coordinate axes.

しかし、逆にある座標が”１”であれば対応する連語が
存在するとは限らない。複数の連語が互いに干渉しあっ
て、偶然その座標を示している可能性もあるからである
。これは検索ノイズとして、検索結果の精度を低下させ
る。この様な場合は、スケールを約２倍にし、座標軸も
１個増やして、約３千億の座標を持つことによって、相
当程度、こうした検索ノイズの発生を防ぐことができる
。However, conversely, if a certain coordinate is "1", it does not necessarily mean that a corresponding collocation exists. This is because multiple collocations may interfere with each other and indicate the coordinates by chance. This acts as search noise and reduces the accuracy of search results. In such a case, by approximately doubling the scale and increasing the number of coordinate axes by one to have approximately 300 billion coordinates, it is possible to prevent the generation of such search noise to a considerable extent.

上側の１２３番のデータが、論理演算の結果排除された
のは、１２３番のデータは「情報の蓄積と検索」であっ
たから、「情報」と「検索」に対応する８個のキーの連
語索引ビット列の第１２３ビツトは当然“１”になって
いるが、「保検」に対応する４個のキーの連語索引ビッ
ト列の第１２３ビツトが４個とも１″になっている可能
性は極めて小さいからである。The reason why the data No. 123 on the upper side was eliminated as a result of the logical operation is because the data No. 123 was "information storage and retrieval", so it is a combination of eight keys corresponding to "information" and "retrieval". Naturally, the 123rd bit of the index bit string is ``1'', but it is very possible that all 4 123rd bits of the conjunctive index bit strings of the four keys corresponding to ``Hoken'' are 1''. This is because it is small.

なお、上側では検索条件が「情報検索」という１個の文
字列の場合について説明したが、それに限らず「情報」
および「検索」というように複数の文字列をすべて含む
場合とか、「情報」あるいは「検索」というように複数
の文字列のどれかを含む場合とか、更にその混合型など
の複雑な検索条件に対応することも、文字索引および連
語索引のビット列の同一ビット位置同士の論理演算内容
を変更することによって容易に実現できる。In addition, although the case where the search condition is one character string "information search" is explained above, it is not limited to "information search".
and complex search conditions such as "search" that includes all multiple character strings, "information" or "search" that includes any of multiple character strings, and mixed types. Correspondence can also be easily achieved by changing the contents of logical operations between the same bit positions in the bit strings of the character index and compound word index.

また、上側では、漢字同士の組み合わせの連語について
説明したが、漢字とひらがなの組み合わせ、カタカナ同
士の組み合わせ等の漢字同士ではない組み合わせについ
ても、同様の方法による連語索引を作成することが容易
に実現できる。In addition, although we explained above about collocations that are combinations of kanji, it is easy to create collocation indexes using the same method for combinations that are not kanji, such as combinations of kanji and hiragana, or combinations of katakana. can.

また、上側では、２文字の組み合わせについて説明した
が、３文字以上の組み合わせに適用することも可能であ
る。Moreover, although the combination of two characters has been described above, it is also possible to apply the combination to three or more characters.

更に、上側では、漢字同士の連語索引の数は４個の素数
（７５１，７４３７３９７３３）の和である２、９６６
個であり、文字に変換すれば第１水準の範囲に収まって
いる。すなわち、４子方強ある漢字同士の組み合わせが
、第１水準の文字の範囲で表現されているのであるが、
論理演算の精度は座標軸（キー）の数と、取り得る範囲
の値（キー値）によって決まるので、必要に応じて、よ
り精度を高めることも容易に実現できろ。Furthermore, on the upper side, the number of collocation indexes between kanji is 2,966, which is the sum of four prime numbers (751,743739733).
, and when converted to characters, it falls within the range of the first level. In other words, combinations of kanji with over four characters are expressed within the range of first-level characters.
The precision of logical operations is determined by the number of coordinate axes (keys) and the possible range of values (key values), so it is easy to increase the precision if necessary.

なお、以上の説明は日本語による情報検索について行っ
ているが、本発明は、漢字構成を採る中国語による情報
検索の場合には、よりいっそう適している。Note that although the above explanation has been made regarding information retrieval in Japanese, the present invention is even more suitable for information retrieval in Chinese, which has a kanji structure.

（発明の効果）本発明によれば、従来方式におけるように、文字索引の
みで候補となる実データを絞ることに比較して、文字索
引と連語索引とを用いることによって、候補となる実デ
ータをより少なく絞ることができる。そのため、実デー
タの走査時間を短縮することができ、検索速度が向上す
るという効果が得られる。(Effects of the Invention) According to the present invention, compared to narrowing down candidate real data using only a character index as in the conventional method, by using a character index and a collocation index, candidate real data is narrowed down. can be narrowed down to less. Therefore, the time required to scan actual data can be shortened, and the search speed can be improved.

またこの論理演算の精度は極めて高いので、稀に検索ノ
イズが入り込んで、余計なデータを候補データに入れて
しまうという可能性がありこそすれ、検索条件に適合す
るデータが漏れることは絶対にないという性質を利用し
て、実用的には、実データの走査を省略することも可能
である。その場合には、論理演算だけで検索結果を呈示
できるので、検索時間は著しく短縮されるという効果が
ある。In addition, since the accuracy of this logical operation is extremely high, there is a possibility that search noise may occasionally enter and add unnecessary data to the candidate data, but data that matches the search conditions will never be missed. By utilizing this property, it is practically possible to omit scanning of the actual data. In that case, search results can be presented using only logical operations, which has the effect of significantly shortening the search time.

[Brief explanation of the drawing]

第１図は、本発明の一実施例の文字索引と連語索引の登
録時の動作を説明するための図。第２図は、本実施例の検索時の動作を説明するための図
。第３図は、本発明を説明する前提として、漢字がＪＩＳ
コードでどのように定義されているかを示す図。FIG. 1 is a diagram for explaining the operation at the time of registering a character index and a compound word index according to an embodiment of the present invention. FIG. 2 is a diagram for explaining the operation at the time of search in this embodiment. Figure 3 shows that the kanji are JIS as a premise for explaining the present invention.
Diagram showing how it is defined in code.

Claims

[Claims]

(1) In the information retrieval system, when registering information, a data number is added to the registered data,
A collocation index that extracts a collocation consisting of a combination of characters appearing in the registered data, expresses the collocation using a combination of a plurality of key values, and has the key value and a data number designation bit string corresponding to the key value. , the collocation is registered by turning on the corresponding bits of the plurality of key values and data numbers, and when searching for information, the collocation is extracted by the combination of characters included in the string as the search condition. , find the plurality of key values corresponding to the collocation, calculate the logical product of bits at the same position in the bit string corresponding to the key value of the collocation index, and perform the search from the data number indicated by the resulting ON bit. A search method using a collocation index in an information retrieval system, characterized by searching for data containing a collocation of character strings as a condition.

(2) Set a compound number according to the kanji code table for the compound word, calculate the compound number by a modulus of several integers (MOD) in which multiple prime numbers are set as integers, and mark the result of the calculation with a type mark. 2. The search method using a collocation index in an information search system according to claim 1, wherein the key values are converted into single kanji character numbers by adding .