JPH09120401A

JPH09120401A - Thesaurus preparing device

Info

Publication number: JPH09120401A
Application number: JP7275264A
Authority: JP
Inventors: Hitoshi Sakamoto; 仁坂本; Miki Sasaki; 美樹佐々木; Tokuji Ikeno; 篤司池野
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1995-10-24
Filing date: 1995-10-24
Publication date: 1997-05-06

Abstract

PROBLEM TO BE SOLVED: To automatically prepare thesaurus by inputting materials. SOLUTION: A morpheme analysis part 2 reads a character string, analyzes the character string into a string of words, and prepares and outputs a list of parts of speech of the respective words to a coocurrence extraction part. The cooccurrence extraction part 3 analyzes the list and extracts and outputs cooccurrence data to a coocurrence storage part 4. The cooccurrence storage part 4 stores the coocurrence data and appearance frequencies by data kinds in a storage device. A semantic distance calculation part 6 calculates the semantic distances between the words by a quantifying means from the data stored in the cooccurrence storage part 4 and outputs the calculation results to a thesaurus preparation part 7. The thesaurus preparation part 7 divides the words into groups on the basis of the semantic distances received from the semantic distance calculation part 6 and further divides the groups into groups. After operation is completed for all the words, the thesaurus preparation part 7 outputs the words which are hierarchically grouped as a thesaurus.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、単語をグループ化
したシソーラスを自動的に作成する装置に関し、例え
ば、機械翻訳等の自然言語処理システムに適用し得るも
のである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus for automatically creating a thesaurus in which words are grouped, and can be applied to a natural language processing system such as machine translation.

【０００２】[0002]

【従来の技術】日本語から英語へ、あるいは英語から日
本語への翻訳を行う機械翻訳システムや、音声入力、仮
名入力、あるいはローマ字入力を仮名漢字混じり文に変
換する日本語入力ＦＥＰ等の自然言語処理システムで
は、単純な構文解析からくる多義性（曖昧さ）を解消す
るために辞書の各単語にあらかじめ意味コードを付与し
ておき、この意味コードを参照して多義性を解消又は低
減する方法がとられていた。2. Description of the Related Art A machine translation system that translates from Japanese to English or from English to Japanese, and nature of Japanese input FEP that converts voice input, kana input, or romaji input into kana-kanji mixed sentences In a language processing system, in order to eliminate polysemy (ambiguity) resulting from simple syntactic analysis, a semantic code is given to each word in the dictionary in advance, and the polysemy is resolved or reduced by referring to this semantic code. The method was taken.

【０００３】例えば「豚は肉は食べない」という表現
は、意味コードを参照して判断しなければ、「豚が肉を
食べない」という解釈のほかに「肉が豚を食べない」と
いう意味をなさない解釈も成立してしまう。For example, the expression "pork does not eat meat" means "meat does not eat pork" in addition to the interpretation that "pork does not eat meat" unless it is judged by referring to a meaning code. The interpretation which does not do is also approved.

【０００４】さらに、意味コードを人間が直感だけで付
与すると、例えば「体に肉がついてきた」という表現か
ら、「肉」が体重に近い意味を持つと判断して食べ物の
意味コードを付与し忘れるということも起こりえる。こ
の場合には、意味コードを参照して判断すると、「肉が
豚を食べない」という意味をなさない解釈だけが成立し
てしまう。Further, if a human gives the meaning code only by intuition, for example, from the expression "meat is attached to the body", it is judged that "meat" has a meaning close to weight, and the meaning code of food is given. Forgetting can happen. In this case, if the judgment is made by referring to the meaning code, only the interpretation that does not mean "meat does not eat pork" is established.

【０００５】このため、常に意味コードによる意味処理
が成功するように、より適切な意味コードの体系を自動
的に生成する手法が研究されてきた（文献『杉村領一、
柿ヶ原康二、石川雅彦、川越陸、青山昇一、「意味コー
ド体系の自動生成」、情報処理学会、自然言語処理７８
−４、１９９０年７月１９日』）。Therefore, a method for automatically generating a more appropriate semantic code system so that the semantic processing by the semantic code always succeeds has been researched (reference "Rei Sugimura,
Koji Kakihara, Masahiko Ishikawa, Riku Kawagoe, Shoichi Aoyama, "Automatic Generation of Semantic Code System", Information Processing Society of Japan, Natural Language Processing 78
-4, July 19, 1990 ”).

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、従来の
シソーラスの作成を自動化しようとした試みは、人手で
作業して収集した用例から作成するものであり、人手で
直接意味コードを付与する場合に比べて作業者による解
釈の違いの影響を小さくできるが、どの用例を抽出する
かは作業者によって異なり、作業者による偏りを避ける
ことができなかった。ここで、シソーラスとは、語を類
似した概念ごとに集め、整理し、上位概念へと発展させ
体系化したものである。However, the conventional attempt to automate the creation of the thesaurus is to create from the examples collected by working manually, and compared with the case where the meaning code is directly added manually. Therefore, the influence of the difference in the interpretation by the worker can be reduced, but which example is extracted depends on the worker, and the bias due to the worker cannot be avoided. Here, the thesaurus is a systematic collection of words for each similar concept, organizing them, and developing them into superordinate concepts.

【０００７】また、シソーラスを作成するための用例を
収集するには多大な作業が必要であるので、特定の文献
・資料専用のシソーラスを作成せずに、様々な文献・資
料を一つのシソーラスでカバーしようとしていた。その
ため、例えば、「太る」とか「痩せる」とかが話題の文
章では「肉」が食べ物よりむしろ「体重」に近いという
ことにまでは対応できず、意味処理が失敗することが多
くなっていた。Further, since a great deal of work is required to collect examples for creating a thesaurus, it is not necessary to create a thesaurus dedicated to specific documents / materials, and various documents / materials can be collected in one thesaurus. I was trying to cover. For this reason, for example, in a sentence about “fat” or “thinning”, it is not possible to deal with “meat” being closer to “weight” rather than food, and semantic processing often fails.

【０００８】[0008]

【課題を解決するための手段】請求項１の発明は、入力
文を解析して単語に分解するとともに単語の品詞を特定
する形態素解析手段を備えて、単語をグループ化してシ
ソーラスを作成するシソーラス作成装置において、以下
の手段を備えたことを特徴とする。According to a first aspect of the present invention, there is provided a thesaurus that includes a morpheme analysis unit that analyzes an input sentence to decompose it into words and specifies a part of speech of each word, and groups the words into a thesaurus. The creation device is characterized by including the following means.

【０００９】すなわち、形態素解析手段が解析した情報
から共起データを抽出し、この共起データと共に共起デ
ータ種類毎の出現頻度を記憶する共起抽出記憶手段と、
この共起抽出記憶手段に記憶された共起データと共起デ
ータ種類毎の出現頻度から、共起データを構成する第１
の構成単語間、及び又は共起データを構成する第２の構
成単語間の意味的な距離情報を算出する意味距離計算手
段と、意味的な距離情報をもとに第１の構成単語、及び
又は第２の構成単語をグループ化するシソーラス生成手
段とを備えた。That is, co-occurrence extraction storage means for extracting co-occurrence data from the information analyzed by the morphological analysis means, and storing the appearance frequency for each co-occurrence data type together with the co-occurrence data.
The first co-occurrence data is constructed from the co-occurrence data stored in the co-occurrence extraction storage means and the appearance frequency for each co-occurrence data type.
Between the constituent words and / or the semantic distance calculating means for calculating the semantic distance information between the second constituent words forming the co-occurrence data, and the first constituent word based on the semantic distance information, and Or a thesaurus generating means for grouping the second constituent words.

【００１０】請求項２の発明は、請求項１に記載のシソ
ーラス作成装置において、上記共起抽出記憶手段に記憶
された共起データ種類毎の出現頻度の分布情報を管理す
る頻度管理手段を備え、上記共起抽出記憶手段は、一定
の条件を満たしたときに、頻度管理手段から得た情報を
基に出現頻度の低い共起データの情報を削除することを
特徴とする。According to a second aspect of the present invention, in the thesaurus creating apparatus according to the first aspect, there is provided frequency management means for managing distribution information of appearance frequency for each co-occurrence data type stored in the co-occurrence extraction storage means. The co-occurrence extraction storage means is characterized in that, when a certain condition is satisfied, the co-occurrence data information having a low appearance frequency is deleted based on the information obtained from the frequency management means.

【００１１】請求項１及び請求項２の発明において、共
起抽出記憶手段が共起データを抽出すると共に共起デー
タ種類毎の出現頻度を記録して、意味距離計算手段が共
起データを構成する第１の構成単語間及び又は第２の構
成単語間の意味的な距離情報を全ての組み合わせについ
て算出し、シソーラス生成手段が前記意味的な距離情報
を基に第１の構成単語及び又は第２の構成単語をグルー
プ化することにより、入力文の形態素解析からからシソ
ーラス生成までの全行程を自動的に行うことができる。In the inventions of claims 1 and 2, the co-occurrence extraction storage means extracts the co-occurrence data and records the appearance frequency for each co-occurrence data type, and the semantic distance calculation means configures the co-occurrence data. The semantic distance information between the first constituent words and / or the second constituent words is calculated for all combinations, and the thesaurus generating means calculates the first constituent words and / or the first constituent words based on the semantic distance information. By grouping the two constituent words, the entire process from morphological analysis of the input sentence to thesaurus generation can be performed automatically.

【００１２】また、請求項２の発明において、頻度管理
手段が共起データ種類毎の出現頻度の分布を管理して、
一定の条件を満たしたときに、共起抽出記憶手段が頻度
管理手段から得た情報を基に出現頻度の低い共起データ
の情報を削除することにより、共起の記憶に関する資源
を節約する。Further, in the invention of claim 2, the frequency management means manages the distribution of the appearance frequency for each co-occurrence data type,
When a certain condition is satisfied, the co-occurrence extraction storage unit deletes the information of the co-occurrence data having a low frequency of appearance based on the information obtained from the frequency management unit, thereby saving the resources related to the co-occurrence storage.

【００１３】[0013]

【発明の実施の形態】以下、本発明によるシソーラス作
成装置の実施の形態を、図面を参照しながら詳述する。BEST MODE FOR CARRYING OUT THE INVENTION An embodiment of a thesaurus generator according to the present invention will be described in detail below with reference to the drawings.

【００１４】（第１の実施の形態）図１に第１の実施の
形態におけるシソーラス作成装置の機能構成を示す。実
際には、第１の実施の形態のシソーラス作成装置は、大
容量の補助記憶装置を備えたワークステーションやパー
ソナルコンピュータ等の計算機システムで実現されてお
り、そのハードウェア構成は省略する。(First Embodiment) FIG. 1 shows a functional configuration of a thesaurus creating apparatus according to the first embodiment. Actually, the thesaurus creating apparatus according to the first embodiment is realized by a computer system such as a workstation or a personal computer equipped with a large-capacity auxiliary storage device, and its hardware configuration is omitted.

【００１５】第１の実施の形態におけるシソーラス作成
装置は、機能的には、図１に示すように、制御部１、形
態素解析部２、共起抽出部３、共起蓄積部４、意味距離
計算部６及びシソーラス生成部７から構成されている。As shown in FIG. 1, the thesaurus creating apparatus according to the first embodiment functionally has a control unit 1, a morphological analysis unit 2, a co-occurrence extraction unit 3, a co-occurrence storage unit 4, and a semantic distance. It is composed of a calculation unit 6 and a thesaurus generation unit 7.

【００１６】制御部１は、入力データである文字列の読
み込み先や、出力データであるシソーラスの書き出し先
を受け取り、シソーラス作成装置内の各部を制御し作動
させるものである。The control unit 1 receives a read destination of a character string which is input data and a write destination of a thesaurus which is output data, and controls and operates each unit in the thesaurus creating apparatus.

【００１７】形態素解析部２は、文字列の読み込み先か
ら文字列を入力して単語の列に解析し、かつそれぞれの
品詞を特定するものである。The morphological analysis unit 2 inputs a character string from a reading destination of the character string, analyzes the character string into a string of words, and specifies each part of speech.

【００１８】共起抽出部３は、形態素解析部２による形
態素解析結果から名詞、助詞、動詞の３単語がこの順序
で並んだ部分を抽出するものである。以下、この部分を
共起と呼ぶ。The co-occurrence extraction unit 3 extracts a portion in which three words of a noun, a particle, and a verb are arranged in this order from the morpheme analysis result by the morpheme analysis unit 2. Hereinafter, this part is called co-occurrence.

【００１９】共起蓄積部４は、共起抽出部３が抽出した
共起及びこの共起の出現頻度を記憶装置に記憶するもの
である。The co-occurrence storage unit 4 stores the co-occurrence extracted by the co-occurrence extraction unit 3 and the appearance frequency of the co-occurrence in a storage device.

【００２０】意味距離計算部６は、共起蓄積部４が記憶
装置に記憶した情報を読み込み、数量化手法により共起
をなす単語同士の意味的な近さを計算するものである。The semantic distance calculation unit 6 reads the information stored in the storage device by the co-occurrence storage unit 4 and calculates the semantic closeness of the co-occurring words by a quantification method.

【００２１】シソーラス生成部７は、意味距離計算部６
が求めた単語同士の意味的な近さの値に基づいて、意味
的に近い単語同士のグループを作成し、指定されたシソ
ーラスの書き出し先に出力するものである。The thesaurus generator 7 includes a semantic distance calculator 6
A group of words that are close in meaning is created based on the value of the closeness of meaning between the words obtained by, and the group is output to the specified writing destination of the thesaurus.

【００２２】以下に、第１の実施の形態のシソーラス作
成装置の動作を説明する。The operation of the thesaurus creating apparatus according to the first embodiment will be described below.

【００２３】制御部１は、入力データである文字列の読
み込み先を受け取ると、その入力文字列の読み込み先を
形態素解析部２に送って、当該文字列についての形態素
解析部２から共起蓄積部４までの一連の処理を起動す
る。Upon receiving the read destination of the character string which is the input data, the control unit 1 sends the read destination of the input character string to the morpheme analysis unit 2 and the co-occurrence accumulation of the character string from the morpheme analysis unit 2 is performed. A series of processes up to the part 4 is activated.

【００２４】形態素解析部２は、文字列の読み込み先か
ら文字列を読み込み、文字列を単語の列に解析し、それ
ぞれの単語の品詞を特定したリストを作成し、共起抽出
部３に出力する。The morphological analysis unit 2 reads the character string from the reading destination of the character string, analyzes the character string into a sequence of words, creates a list specifying the part of speech of each word, and outputs it to the co-occurrence extraction unit 3. To do.

【００２５】共起抽出部３は、形態素解析部２から入力
したリストを解析して、名詞、助詞、動詞の３単語がこ
の順序で並んだ部分があれば、この部分リストを共起と
して共起蓄積部４に出力する。The co-occurrence extraction unit 3 analyzes the list input from the morpheme analysis unit 2, and if there is a part in which three words of a noun, a particle, and a verb are arranged in this order, the co-occurrence list is used as a co-occurrence. It is output to the origination / accumulation unit 4.

【００２６】共起蓄積部４は、共起抽出部３から入力し
た共起を記憶装置から検索して既に記憶されていれば検
索したレコードの出現頻度を１つ増加させて更新し、記
憶されていなければ、入力した共起の出現頻度を１とし
て新規レコードとして記憶装置に記憶する。共起蓄積部
４は、入力したデータの処理を全て終えると制御部１に
処理終了を通知する。The co-occurrence storage unit 4 retrieves the co-occurrence input from the co-occurrence extraction unit 3 from the storage device, and if already stored, increases the appearance frequency of the retrieved record by one, updates it, and stores it. If not, the input co-occurrence frequency is set to 1 and stored in the storage device as a new record. The co-occurrence storage unit 4 notifies the control unit 1 of the end of processing when all processing of the input data is completed.

【００２７】制御部１は、出力データであるシソーラス
の書き出し先を受け取ると、意味距離計算部６からシソ
ーラス生成部７までの一連の処理を起動し、同時に、そ
のシソーラスの書き出し先をシソーラス生成部７に送
る。When the control unit 1 receives the output destination of the thesaurus which is the output data, the control unit 1 activates a series of processes from the semantic distance calculation unit 6 to the thesaurus generation unit 7 and, at the same time, sets the thesaurus write destination to the thesaurus generation unit. Send to 7.

【００２８】意味距離計算部６は、まず各単語が意味的
関係を持たずにランダムに出現すると仮定した場合の論
理的な共起の出現頻度と実際に共起蓄積部４が記憶装置
に記憶した共起の出現頻度との偏りをχ（カイ）二乗の
計算と同様の方法で計算する。次に共起関係にある単語
（群）の分布を比較することによって単語間の意味的な
近さを計算する。以下にこれらの計算手法を説明する。First, the semantic distance calculation unit 6 stores the logical occurrence frequency of co-occurrence and the actual co-occurrence storage unit 4 stores it in a storage device assuming that each word appears randomly without having a semantic relationship. The bias with respect to the appearance frequency of the co-occurrence is calculated in the same manner as the calculation of χ (chi) square. Next, the semantic closeness between words is calculated by comparing the distribution of words (groups) that have a co-occurrence relationship. These calculation methods will be described below.

【００２９】まず、共起をなす第１の構成単語と第２の
構成単語との全ての組み合わせについて、共起関係度合
Ｗmn（ｍは第１の構成単語の種類を特定しており、ｎは
第２の構成単語の種類を特定している）を算出して、そ
の後で２種類の第１の構成単語間の意味的な距離、又
は、２種類の第２の構成単語間の意味的な距離Ｄij（ｉ
及びｊはそれぞれ、異なる種類の第１の構成単語を特定
しているか、又は、異なる種類の第２の構成単語を特定
している）を全ての組み合わせについて求める。なお、
本実施の形態においては第１の構成単語は名詞であり、
第２の構成単語は動詞である。First, the degree of co-occurrence relation Wmn (m specifies the type of the first constituent word, n is the value of all the combinations of the first constituent word and the second constituent word that form the co-occurrence). (Specifying the type of the second constituent word) is calculated, and then the semantic distance between the two kinds of first constituent words or the semantic distance between the two kinds of second constituent words is calculated. Distance Dij (i
And j respectively specify different types of first constituent words or different types of second constituent words) for all combinations. In addition,
In the present embodiment, the first constituent word is a noun,
The second constituent word is a verb.

【００３０】共起データの総数をＴ、単語ｍを第１の構
成単語としている入力された共起データの個数をＣm 、
単語ｎを第２の構成単語としている入力された共起デー
タの個数をＣn 、単語ｍを第１の構成単語とし、かつ、
単語ｎを第２の構成単語としている入力された共起デー
タの個数をＣmnとすると、意味距離計算部６が最初に求
める単語ｍ及び単語ｎ間の共起関係度合Ｗmnは、次の
(1) 式で表すことができる。The total number of co-occurrence data is T, the number of input co-occurrence data whose word m is the first constituent word is Cm,
The number of input co-occurrence data in which the word n is the second constituent word is Cn, the word m is the first constituent word, and
When the number of input co-occurrence data in which the word n is the second constituent word is Cmn, the co-occurrence relation degree Wmn between the word m and the word n first obtained by the semantic distance calculation unit 6 is
It can be expressed by equation (1).

【００３１】[0031]

【数１】ここで、この(1) 式におけるｗmnは次の(2) 式に従うも
のである。(Equation 1) Here, wmn in the equation (1) follows the following equation (2).

【００３２】[0032]

【数２】この(2) 式は、次のような考え方に従った式である。単
語ｍを第１の構成単語とし、かつ、単語ｎを第２の構成
単語としている入力された共起データの個数（出現頻
度）Ｃmnは、共起データの総数Ｔに比較すればごく僅か
であり、従って、この個数Ｃmnはポアソン分布に従って
いると考えられる。(Equation 2) This equation (2) is based on the following concept. The number of input co-occurrence data (appearance frequency) Cmn in which the word m is the first constituent word and the word n is the second constituent word is very small in comparison with the total number T of the co-occurrence data. Therefore, the number Cmn is considered to follow the Poisson distribution.

【００３３】単語ｍを第１の構成単語とし、かつ、単語
ｎを第２の構成単語とする共起データ個数の期待値は次
の(3) 式になる。The expected value of the number of co-occurrence data in which the word m is the first constituent word and the word n is the second constituent word is given by the following expression (3).

【００３４】[0034]

【数３】ポアソン分布に従うので、その標準偏差は次の(4) 式に
なる。(Equation 3) Since it follows the Poisson distribution, its standard deviation is given by equation (4) below.

【００３５】[0035]

【数４】従って、単語ｍを第１の構成単語とし、かつ、単語ｎを
第２の構成単語とする実際の共起関係データ数Ｃmnがそ
の期待値からどれだけずれているかを標準偏差の何倍か
で示したもの（共起関係度合）は次の(5) 式になる。(Equation 4) Therefore, how much the actual co-occurrence relation data number Cmn with the word m as the first constituent word and the word n as the second constituent word deviates from its expected value is determined by a multiple of the standard deviation. What is shown (co-occurrence degree) is the following expression (5).

【００３６】[0036]

【数５】また、出現頻度が０になる確率は次の(6) 式になる。(Equation 5) The probability that the appearance frequency becomes 0 is given by the following equation (6).

【００３７】[0037]

【数６】ところで、ポアソン分布に従っているので、期待値があ
る程度の値を有する場合でも、実際の出現頻度が０とな
る確率は高く、期待値によってその値は変化する。ま
た、ポアソン分布は期待値を中心として対称なものでは
ない。そこで、上述した(2) 式の共起関係度合を、出現
頻度が０となる確率を考慮して修正することとした。(Equation 6) By the way, since it follows the Poisson distribution, even if the expected value has a certain value, the probability that the actual appearance frequency becomes 0 is high, and the value changes depending on the expected value. Also, the Poisson distribution is not symmetrical about the expected value. Therefore, the degree of co-occurrence relation of the above equation (2) is modified in consideration of the probability that the appearance frequency is zero.

【００３８】このような考え方によって、第１の構成単
語ｍと第２の構成単語ｎとの共起関係度合を(2) 式に従
うｗmnで定量化したが、このｗmnの取り得る値の範囲は
かなり広い。そこで、(1) 式によって、共起関係度合ｗ
mnを取り得る値の範囲が狭い共起関係度合Ｗmnに変換す
ることとした。なお、(1) 式は、共起関係度合Ｗmnの取
り得る範囲が−１〜１になることをも意図している。Based on such an idea, the degree of co-occurrence relation between the first constituent word m and the second constituent word n was quantified by wmn according to the equation (2). Quite wide. Therefore, the co-occurrence degree w
It is decided to convert the co-occurrence degree Wmn in which the range of possible values of mn is narrow. The expression (1) also intends that the range of the co-occurrence degree Wmn can be -1 to -1.

【００３９】意味距離計算部６が求める単語ｉ及び単語
ｊ間の意味的な距離Ｄijは、次の(7) 式で表すことがで
きる。なお、(7) 式は、２種類の第１の構成単語ｉ及び
ｊ間の意味的な距離Ｄijを求める式である。２種類の第
２の構成単語ｉ及びｊ間の意味的な距離Ｄijを求める式
は、(7) 式とほぼ同様であるのでその記載は省略する。
また、式の意味説明も(7) 式に対して行なう。The semantic distance Dij between the word i and the word j calculated by the semantic distance calculation unit 6 can be expressed by the following equation (7). The expression (7) is an expression for obtaining the semantic distance Dij between the two types of first constituent words i and j. The formula for obtaining the semantic distance Dij between the two types of second constituent words i and j is almost the same as the formula (7), and therefore its description is omitted.
In addition, the meaning of the equation will also be explained for equation (7).

【００４０】[0040]

【数７】但し、Σは共起関係度合Ｗik又はＷjkの少なくとも一方
が正である第２の構成単語ｋに対して行なう。(Equation 7) However, Σ is performed on the second constituent word k in which at least one of the co-occurrence degree Wik and Wjk is positive.

【００４１】この(7)式は、次の(8)式に示す意味的な類
似度ｄijを修正したものである。This expression (7) is a modification of the semantic similarity dij shown in the following expression (8).

【００４２】[0042]

【数８】この実施の形態では、(7) 式における但し書きに記載し
たように、入力されたある共起データの第１の構成単語
ｉと共起関係度合Ｗikがある程度強い、又は、他の第１
の構成単語ｊと共起関係度合Ｗjkがある程度強い第２の
構成単語ｋの集合Ｋを考えて、第１の構成単語ｉ及びｊ
間の意味的な距離をとらえることとしている。この集合
Ｋの要素だけ、単語ｉについての共起関係度合Ｗikの値
があり、また、単語ｊについての共起関係度合Ｗjkの値
がある。すなわち、ある第２の構成単語ｋについて、単
語ｉとの共起関係度合Ｗikと、単語ｊについての共起関
係度合Ｗjkとが存在し、このような一対のデータが集合
Ｋの要素だけある。(7) 式は、このような複数対のデー
タに対して、その対の一方の要素（単語ｉに関するデー
タ）と他方の要素（単語ｊに関するデータ）との相関係
数的な値ｄijを求めているものである。従って、この値
ｄijが大きければ、両要素、従って、第２の構成単語ｋ
を媒介とした第１の構成単語ｉ及びｊ間の関係（意味的
な類似度）は高く、逆に、この値ｄijが小さければ、第
２の構成単語ｋを媒介とした第１の構成単語ｉ及びｊ間
の関係（意味的な類似度）は低い。(Equation 8) In this embodiment, as described in the proviso in the expression (7), the first constituent word i of the input certain co-occurrence data and the co-occurrence relation degree Wik are strong to some extent, or the other first
Considering the set K of the second constituent word k having a relatively high degree of co-occurrence relation Wjk with respect to the first constituent word j and the first constituent words i and j
It aims to capture the semantic distance between them. Only the elements of this set K have the value of the co-occurrence relation degree Wik for the word i, and the value of the co-occurrence relation degree Wjk for the word j. That is, for a certain second constituent word k, there is a co-occurrence relation degree Wik with the word i and a co-occurrence relation degree Wjk with respect to the word j, and such a pair of data has only elements of the set K. Equation (7) is used to obtain a correlation coefficient value dij between one element (data relating to word i) and the other element (data relating to word j) of the pair with respect to such plural pairs of data. It is what Therefore, if this value dij is large, both elements, and thus the second constituent word k
The relationship (semantic similarity) between the first constituent words i and j mediated by is high, and conversely, if this value dij is small, the first constituent word mediated by the second constituent word k is The relationship (semantic similarity) between i and j is low.

【００４３】上述したように、(7) 式はこの(8) 式を修
正して得たものである。(7) 式における分母が０となる
ことを防止するために、１を加算している。また、(7)
式で１から(8) 式に相当する部分を減算するようにして
いるのは、距離という名が持つ概念、すなわち、小さい
値ほど近いことを意味させるようにしたためである。さ
らに、(7) 式において、所定数で割るようにしているの
は、距離Ｄijが取り得る値の範囲を０〜１の範囲にする
ためである。このように範囲を規定すると、距離の相対
的な比較が行ない易い。As described above, the equation (7) is obtained by modifying the equation (8). In order to prevent the denominator in equation (7) from becoming 0, 1 is added. Also (7)
The reason that the part corresponding to the expression (8) is subtracted from 1 in the expression is that the concept of the name of distance, that is, the smaller value means the closer. Further, in the expression (7), the reason why it is divided by a predetermined number is to set the range of values that the distance Dij can take to a range of 0 to 1. Defining the range in this way facilitates relative comparison of distances.

【００４４】以上のように、(7) 式によって、第２の構
成単語ｋとの共起関係度合Ｗik、Ｗjkに基づいて、第１
の構成単語となる２種類の単語ｉ及びｊ間の意味的な距
離Ｄijを求めることができる。また、式の記載は省略し
ているが、第１の構成単語ｋとの共起関係度合Ｗki、Ｗ
kjに基づいて、第２の構成単語となる２種類の単語ｉ及
びｊ間の意味的な距離Ｄijを求めることができる。As described above, based on the co-occurrence degrees Wik and Wjk with the second constituent word k, the first expression is obtained by the expression (7).
It is possible to obtain the semantic distance Dij between the two types of words i and j that are the constituent words of. Although the description of the formula is omitted, the degree of co-occurrence relation Wki with the first constituent word k, Wki
Based on kj, the semantic distance Dij between the two types of words i and j, which are the second constituent words, can be obtained.

【００４５】意味距離計算部６は、共起蓄積部４が記憶
装置に記憶した情報を読み込み上記の計算手法を用いて
意味距離Ｄijを計算して、単語ｉ、単語ｊ、意味距離Ｄ
ijの組を要素とするリストをシソーラス生成部７に送
る。また、意味距離計算部６は、同一単語間の意味距離
や、出現頻度が著しく低い単語間の意味距離は計算しな
い。これは、出現頻度が著しく低い場合は、計算によっ
て得られた数字が無意味なものとなるためである。The semantic distance calculation unit 6 reads the information stored in the storage device by the co-occurrence storage unit 4, calculates the semantic distance Dij using the above calculation method, and calculates the word i, the word j, and the semantic distance D.
The list having the set of ij as elements is sent to the thesaurus generator 7. In addition, the semantic distance calculation unit 6 does not calculate the semantic distance between the same words or the semantic distance between words whose appearance frequency is extremely low. This is because the number obtained by calculation becomes meaningless when the appearance frequency is extremely low.

【００４６】シソーラス生成部７は、各単語をあらかじ
め定められた１階層あたりのグループ数の範囲内でグル
ープ数が最小になるように分割する（グループ同士が重
なりあうことはない）。ここでいうグループとは、ある
単語ａと同じグループに属する単語ｂとの意味距離Ｄab
が、同じグループに属さない単語ｃとの意味距離Ｄacに
対して、常にＤab＜Ｄacの条件を満たすものである。さ
らに、それらのグループを前記１階層あたりのグループ
数の範囲内でグループ数が最小になるように順次分割し
ていき、あらかじめ定められた細分を繰り返す階層数ま
で細分を繰り返したか、または前記１階層あたりのグル
ープ数の範囲内でグループをつくることができなくなる
まで細分の操作を再帰的に行う。また、前記のＤab＜Ｄ
acの条件に加えて、あらかじめ定められた１グループあ
たりの構成単語の数の範囲でグループ化できるかどうか
の判断を行う。上記のグループ化に際して、意味距離計
算部６で計算しなかった同一単語間の意味距離は０、出
現頻度が著しく低い単語間の意味距離は１として処理を
行う。The thesaurus generator 7 divides each word so that the number of groups is the smallest within a predetermined number of groups per layer (groups do not overlap each other). The term "group" used here means a semantic distance Dab between a word a and a word b belonging to the same group.
, Always satisfies the condition of Dab <Dac for the semantic distance Dac with the word c that does not belong to the same group. Further, these groups are sequentially divided so that the number of groups is the smallest within the range of the number of groups per layer, and the subdivision is repeated up to the number of layers which repeats a predetermined subdivision, or the one layer Subdivision operations are recursively performed until no more groups can be created within the number of groups. Also, the above Dab <D
In addition to the ac condition, it is determined whether or not grouping can be performed within a predetermined number of constituent words per group. In the above grouping, the semantic distance between the same words, which has not been calculated by the semantic distance calculation unit 6, is set to 0, and the semantic distance between words having an extremely low appearance frequency is set to 1.

【００４７】このようなグループ化を、表１の各単語を
グループ化するときのシソーラス生成部７の動作を例に
して図２を参照しながら説明する。表１は、キュウリ、
トマト、カボチャ、メロン、リンゴ、スイカ、モモの各
単語間の意味距離を示したものであり、既にＮ−１階層
までのグループ化が再帰的に行われた状態でこれからＮ
階層目のグループ化を行おうとしているものとする。こ
こでＮは正の整数である。Such grouping will be described with reference to FIG. 2 by taking as an example the operation of the thesaurus generator 7 when grouping the words in Table 1. Table 1 shows cucumber,
It shows the semantic distance between each word of tomato, pumpkin, melon, apple, watermelon, and peach, and it has already been recursively grouped up to N-1 hierarchical levels.
Suppose you are trying to group a hierarchy. Here, N is a positive integer.

【００４８】[0048]

【表１】ここでは、１階層あたりのグループ数≧２，１グループ
あたりの構成単語の数≧２，細分を繰り返す階層数≦Ｍ
（Ｍは正の整数，Ｎ＜Ｍとする）としてグループ化する
ものとする。[Table 1] Here, the number of groups per layer ≧ 2, the number of constituent words per group ≧ 2, the number of layers in which subdivision is repeated ≦ M
(M is a positive integer and N <M).

【００４９】キュウリをａとしトマトをｂとすると、Ｄ
ab＝０．２であり、メロン、リンゴ、スイカ、モモのい
ずれをｃとしても、Ｄab＜Ｄacが成り立つ。また、カボ
チャをｂとしても、Ｄab＝０．３であり、メロン、リン
ゴ、スイカ、モモのいずれをｃとしても、Ｄab＜Ｄacが
成り立つ。さらに、トマトをｃとしたときもＤab＝０．
３であり、メロン、リンゴ、スイカ、モモのいずれをｃ
としても、Ｄab＜Ｄacが成り立つ。従って、キュウリ、
トマト、カボチャは、同じグループ（グループ１）に属
し、メロン、リンゴ、スイカ、モモはこのグループに属
さないことがわかる。また、構成単語が３単語であるの
で、１グループあたりの構成単語の数≧２の条件も満た
している。同様に、メロン、リンゴ、スイカ、モモのい
ずれかをａ、ｂとして、キュウリ、トマト、カボチャの
何れかをｃとしても、Ｄab＜Ｄacが成り立つ。従って、
メロン、リンゴ、スイカ、モモは、同じグループ（グル
ープ２）に属し、キュウリ、トマト、カボチャはこのグ
ループに属さないことがわかる。また、構成単語が４単
語であるので、１グループあたりの構成単語の数≧２の
条件も満たしている（第Ｎの階層：ステップ２０１）。If cucumber is a and tomato is b, D
ab = 0.2, and Dab <Dac holds for any of melon, apple, watermelon, and peach as c. Also, if the pumpkin is b, Dab = 0.3, and if any of melon, apple, watermelon, and peach is c, Dab <Dac holds. Furthermore, when tomato is c, Dab = 0.
3 and any c of melon, apple, watermelon, peach
Also, Dab <Dac holds. Therefore, cucumber,
It can be seen that tomatoes and pumpkins belong to the same group (group 1), and melons, apples, watermelons, and peaches do not belong to this group. Further, since the number of constituent words is three, the condition of the number of constituent words per group ≧ 2 is also satisfied. Similarly, if any one of melon, apple, watermelon, and peach is designated as a and b and any one of cucumber, tomato, and pumpkin is designated as c, Dab <Dac holds. Therefore,
It can be seen that melons, apples, watermelons and peaches belong to the same group (group 2), while cucumbers, tomatoes and pumpkins do not belong to this group. Further, since the number of constituent words is four, the condition of the number of constituent words per group ≧ 2 is also satisfied (Nth hierarchy: step 201).

【００５０】このように、表２の各単語は２グループに
分割できたが、後述するように第Ｎの階層のグループ２
はさらに２つのグループに分割可能である。しかし、あ
らかじめ定められた１階層あたりのグループ数の範囲内
でグループ数が最小になるようにグループ化を行うた
め、第Ｎの階層では３グループではなく２グループに分
割する。As described above, each word in Table 2 could be divided into two groups.
Can be further divided into two groups. However, since grouping is performed so that the number of groups is minimized within a predetermined number of groups per layer, the Nth layer is divided into two groups instead of three groups.

【００５１】グループの作成を終えると、グループに属
さない単語は識別番号を０とし（表１の例ではグループ
に属さない単語はない）、グループに属する単語につい
ては分割された各グループがさらに細分可能かどうかを
調べ、細分可能な場合は、当該グループを母集団として
引き渡し再帰的にグループに分割していく。細分不可能
な場合は、各単語にグループに対応した識別番号を付加
していく。グループ１は３単語で構成されていてこれ以
上細分できないので、［キュウリ，１］、［トマト，
１］、［カボチャ，１］のように識別番号を付与する。
さらに細分するには、１階層あたりのグループ数≧２，
１グループあたりの構成単語の数≧２であるので細分す
るには最低でも４単語（２×２）必要であり、Ｎ＋１≦
Ｍを満たす必要がある。グループ２は、細分可能である
ので（グループを構成する単語の数＝４，Ｎ＋１≦
Ｍ）、表２に示す単語の部分リストを作成して、この部
分リストを引き渡して再帰的にグループ化の処理を行う
（第Ｎの階層：ステップ２０２〜２０７）。When the creation of the group is completed, the identification number of the word that does not belong to the group is set to 0 (in the example of Table 1, there is no word that does not belong to the group), and the words belonging to the group are further subdivided into subdivided groups. It is checked whether it is possible, and if it can be subdivided, the group is passed as a population and recursively divided into groups. If it cannot be subdivided, an identification number corresponding to the group is added to each word. Group 1 consists of 3 words and cannot be subdivided any further, so [cucumber, 1], [tomato,
1] and [pumpkin, 1].
To further subdivide, the number of groups per layer ≧ 2
Since the number of constituent words per group is ≧ 2, at least 4 words (2 × 2) are required for subdivision, and N + 1 ≦
It is necessary to satisfy M. Since the group 2 can be subdivided (the number of words forming the group = 4, N + 1 ≦
M), a partial list of words shown in Table 2 is created, and this partial list is delivered to perform recursive grouping processing (Nth hierarchy: steps 202 to 207).

【００５２】[0052]

【表２】第Ｎの階層から表２のリストを引き渡されたときの、第
Ｎ＋１の階層でのグループ化について説明する。[Table 2] Grouping in the (N + 1) th layer when the list of Table 2 is delivered from the Nth layer will be described.

【００５３】表２において、メロンをａとして、スイカ
をｂとすると、Ｄab＝０．１であり、リンゴ、モモのい
ずれをｃとしても、Ｄab＜Ｄacが成り立つ。また、リン
ゴをａとしてモモをｂとすると、Ｄab＝０．１であり、
メロン、スイカのいずれをｃとしても、Ｄab＜Ｄacが成
り立つ。また、構成単語が２単語であるので、１グルー
プあたりの構成単語の数≧２の条件も満たしている。こ
のように、メロンとスイカが１つのグループを形成し
（グループ１）、リンゴとモモが１つのグループを形成
する（グループ２）。４単語全てが何れかのグループに
入るので、グループに属さない単語は存在しない。ま
た、各グループの構成単語が共に２単語であるためこれ
以上細分できないので、［メロン，１］、［スイカ，
１］、［リンゴ，２］、［モモ，２］のようにグループ
に対応した識別番号を付与する（第Ｎ＋１の階層：ステ
ップ２０１〜２０５）。In Table 2, assuming that melon is a and watermelon is b, Dab = 0.1, and Dab <Dac holds for both apple and peach. If apple is a and peach is b, Dab = 0.1,
Dab <Dac holds regardless of whether melon or watermelon is c. Further, since the number of constituent words is two, the condition of the number of constituent words per group ≧ 2 is also satisfied. Thus, melons and watermelons form one group (group 1), and apples and peaches form one group (group 2). Since all four words belong to any group, there are no words that do not belong to the group. In addition, since each group consists of 2 words, it cannot be subdivided any further, so [melon, 1], [watermelon,
1], [apple, 2], and [peach, 2] are assigned identification numbers corresponding to groups (N + 1st layer: steps 201 to 205).

【００５４】仮に第Ｎ＋１の階層で各グループの構成単
語が４単語以上であっても、Ｎ＋２＞Ｍの時は各グルー
プを細分できない。Even if the constituent words of each group are 4 words or more in the (N + 1) th hierarchy, each group cannot be subdivided when N + 2> M.

【００５５】第Ｎの階層に戻り、［メロン，２１］、
［スイカ，２１］、［リンゴ，２２］、［モモ，２２］
のように第Ｎ＋１の階層で付与した識別番号に第Ｎの階
層の識別番号を付加する。なお、表２の例はあてはまら
ないが、第Ｎ＋１の階層でグループが１つも作成できな
かったときは［メロン，２０］、［スイカ，２０］、
［リンゴ，２０］、［モモ，２０］とするのではなく第
Ｎ＋１の階層で付与した識別番号の０は削除して［メロ
ン，２］、［スイカ，２］、［リンゴ，２］、［モモ，
２］のようにする（第Ｎの階層：ステップ２０８）。Returning to the Nth hierarchy, [melon, 21],
[Watermelon, 21], [Apple, 22], [Peach, 22]
As described above, the identification number of the Nth layer is added to the identification number given in the (N + 1) th layer. Although the example in Table 2 does not apply, when no group can be created in the (N + 1) th hierarchy, [melon, 20], [watermelon, 20],
Instead of using [apple, 20] and [peach, 20], the identification number 0 assigned in the (N + 1) th layer is deleted and [melon, 2], [watermelon, 2], [apple, 2], [ peach,
2] (Nth layer: step 208).

【００５６】以上の操作で得られた単語と識別番号の対
は、図３に示すように、意味距離の近い単語をグループ
化して、さらにグループを細分することでグループを体
系化したものである。As shown in FIG. 3, the pairs of words and identification numbers obtained by the above-mentioned operation are obtained by grouping words having a close semantic distance and further subdividing the groups into a systematic group. .

【００５７】全ての単語について操作が完了した後、シ
ソーラス生成部７は、制御部１から受け取ったシソーラ
スの書き出し先に前記操作によりグループ化した単語と
識別番号の対を出力し、制御部１に処理終了を通知す
る。After the operation is completed for all the words, the thesaurus generator 7 outputs the pair of the word and the identification number grouped by the above operation to the writing destination of the thesaurus received from the controller 1 to the controller 1. Notify the end of processing.

【００５８】以上のように、第１の実施の形態によれ
ば、形態素解析部が入力文の読み込み先から入力文を読
み込み単語に分解して単語の品詞を特定し、共起抽出部
が共起を抽出し、共起蓄積部が共起及び共起データ種類
毎の出現頻度を記憶して、意味距離計算部が前記共起及
び共起データ種類毎の出現頻度から単語間の意味距離を
計算し、この意味距離からシソーラス作成部が単語をグ
ループ化して単語と識別番号の対をシソーラスの書き出
し先に出力するようにしたので、資料からシソーラスを
自動的に得ることができる。As described above, according to the first embodiment, the morphological analysis unit reads the input sentence from the read destination of the input sentence, decomposes the input sentence into words, and specifies the part of speech of the words. The co-occurrence storage unit stores the occurrence frequency for each co-occurrence and co-occurrence data type, and the semantic distance calculation unit calculates the semantic distance between words from the appearance frequency for each co-occurrence and co-occurrence data type. Since the thesaurus calculation unit calculates the grouped words based on the semantic distance and outputs the word-identification number pair to the thesaurus writing destination, the thesaurus can be automatically obtained from the material.

【００５９】（第２の実施の形態）図４に示す第２の実
施の形態におけるシソーラス作成装置の機能構成は第１
の実施の形態のシソーラス作成装置に頻度管理部５を加
えたものであり、図１との同一、対応部分には同一符号
を付している。本実施の形態における共起蓄積部４と意
味距離計算部６以外の各部は第１の実施の形態と同様に
機能するのでその説明は省略する。(Second Embodiment) The functional configuration of the thesaurus generator in the second embodiment shown in FIG. 4 is the first.
The frequency management unit 5 is added to the thesaurus creating apparatus of the embodiment, and the same or corresponding parts as those in FIG. 1 are designated by the same reference numerals. The components other than the co-occurrence storage unit 4 and the semantic distance calculation unit 6 in this embodiment function in the same manner as in the first embodiment, and therefore their explanations are omitted.

【００６０】共起蓄積部４は、共起抽出部３が抽出した
共起及びこの共起の出現頻度を記憶装置に記憶して、出
現頻度に変更があったことを頻度管理部５に伝え、記憶
装置の容量の限界に近づいた場合には、頻度管理部５か
らの情報を参照して出現頻度の低い共起を記憶装置から
削除するものである。The co-occurrence storage unit 4 stores the co-occurrence extracted by the co-occurrence extraction unit 3 and the appearance frequency of this co-occurrence in a storage device, and informs the frequency management unit 5 that the appearance frequency has changed. When the capacity of the storage device is approached, the co-occurrence with a low appearance frequency is deleted from the storage device by referring to the information from the frequency management unit 5.

【００６１】頻度管理部５は、出現頻度を１，２，３，
４，５，６，７，８，９，１０以上１５未満，１５以上
２０未満，２０以上２５未満，２５以上３０未満，３０
以上４０未満，４０以上５０未満，５０以上の１６の層
に分け、各層に何レコード記録されているかを記憶して
いくものである。The frequency management unit 5 sets the appearance frequencies to 1, 2, 3,
4, 5, 6, 7, 8, 9, 10 or more, less than 15, 15 or more, less than 20, 20 or more, less than 25, 25 or more, less than 30, 30
It is divided into 16 layers of 40 or more, 40 or more, less than 50, and 50 or more, and the number of records recorded in each layer is stored.

【００６２】意味距離計算部６は、共起蓄積部４及び頻
度管理部５からの情報に基づいて、数量化手法により共
起の出現頻度に基づいて共起をなす単語同士の意味的な
近さを計算するものである。The semantic distance calculation unit 6 uses the information from the co-occurrence storage unit 4 and the frequency management unit 5 to determine the semantic closeness of words that form co-occurrence based on the frequency of occurrence of co-occurrence by a quantification method. Is to calculate the

【００６３】以下に本実施の形態におけるシソーラス作
成装置の動作を説明する。The operation of the thesaurus generator in this embodiment will be described below.

【００６４】制御部１は、入力データである文字列の読
み込み先を受け取ると、その入力文字列の読み込み先を
形態素解析部２に送って、当該文字列についての形態素
解析部２から頻度管理部５までの一連の処理を起動す
る。Upon receipt of the read destination of the character string which is the input data, the control unit 1 sends the read destination of the input character string to the morpheme analysis unit 2, and the morpheme analysis unit 2 for the relevant character string transmits the read data to the frequency management unit. A series of processes up to 5 is activated.

【００６５】形態素解析部２及び共起抽出部３は第１の
実施の形態と同じ動作をするのでその説明は省略する。Since the morphological analysis unit 2 and the co-occurrence extraction unit 3 operate in the same manner as in the first embodiment, their explanation is omitted.

【００６６】共起蓄積部４は、共起抽出部３から入力し
た共起を記憶装置から検索して既に記憶されていれば検
索したレコードの出現頻度を１つ増加させて更新し、記
憶されていなければ、入力した共起の出現頻度を１とし
て新規レコードとして記憶装置に記憶して記憶装置の使
用可能量を検査する。使用可能量が、あらかじめ定めら
れた第１の基準量を下回る場合には、共起蓄積部４は、
頻度管理部５から共起レコードの頻度の分布情報を得
て、頻度が１のものからどの頻度のものまで削除すれば
記憶装置の使用可能量をあらかじめ定められた第２の基
準量以上（第２の基準量＞第１の基準量）に広げること
が可能か計算して、計算結果をもとにして当該頻度の共
起レコードを記憶装置から削除し、次のレコードを入力
する。同時に共起蓄積部４は記憶装置に記憶した共起レ
コードの追加・更新・削除の変更情報を頻度管理部５へ
変更の都度送る。The co-occurrence storage unit 4 retrieves the co-occurrence input from the co-occurrence extraction unit 3 from the storage device and, if it is already stored, updates the appearance frequency of the retrieved record by one and updates it. If not, the appearance frequency of the input co-occurrence is set to 1 and stored in the storage device as a new record, and the usable amount of the storage device is inspected. When the usable amount is less than the first predetermined reference amount, the co-occurrence storage unit 4
If the frequency management unit 5 obtains the frequency distribution information of the co-occurrence records and deletes the frequency from 1 to any frequency, the usable amount of the storage device is equal to or more than a second reference amount which is set in advance (second 2 reference amount> first reference amount) is calculated, the co-occurrence record of the frequency is deleted from the storage device based on the calculation result, and the next record is input. At the same time, the co-occurrence storage unit 4 sends the change information of addition / update / deletion of the co-occurrence record stored in the storage device to the frequency management unit 5 each time the change information is changed.

【００６７】頻度管理部５は、共起蓄積部４から追加・
更新・削除の変更情報を入力した場合は、出現頻度を
１，２，３，４，５，６，７，８，９，１０以上１５未
満，１５以上２０未満，２０以上２５未満，２５以上３
０未満，３０以上４０未満，４０以上５０未満，５０以
上の１６の層に分けて各層毎に共起が何レコード記録さ
れているか管理して、共起部４から参照された場合は、
出現頻度の各層のレコード数を共起蓄積部４に送る。頻
度管理部５は、入力したデータの処理を全て終えると制
御部１に処理終了を通知する。The frequency management unit 5 is added from the co-occurrence storage unit 4.
When change information of update / deletion is input, the appearance frequency is 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10 or more and less than 15, 15 or more and less than 20, 20 or more and less than 25, 25 or more. Three
When the co-occurrence unit 4 refers to, managing the number of records of co-occurrence for each layer divided into 16 layers of less than 0, 30 or more, less than 40, 40 or more, less than 50, 50 or more,
The number of records of each layer having the appearance frequency is sent to the co-occurrence storage unit 4. When the processing of the input data is completed, the frequency management unit 5 notifies the control unit 1 of the end of processing.

【００６８】制御部１は、出力データであるシソーラス
の書き出し先を受け取ると、意味距離計算部６からシソ
ーラス生成部７までの一連の処理を起動し、同時に、そ
のシソーラスの書き出し先をシソーラス生成部７に送
る。When the control unit 1 receives the output destination of the thesaurus which is the output data, the control unit 1 activates a series of processes from the semantic distance calculation unit 6 to the thesaurus generation unit 7 and, at the same time, sets the thesaurus write destination to the thesaurus generation unit. Send to 7.

【００６９】意味距離計算部６は、共起蓄積部４及び頻
度管理部５からの情報を参照して第１の実施の形態と同
様の計算手法を用いて意味距離Ｄijを計算して、単語
ｉ、単語ｊ、意味距離Ｄijの組を要素とするリストをシ
ソーラス生成部７に送る。意味距離計算部６は、意味距
離を算出するときに頻度管理部５に加えて共起蓄積部４
からの情報を参照する以外は第１の実施の形態と同様の
動作を行うのでその説明は省略する。The semantic distance calculating unit 6 refers to the information from the co-occurrence accumulating unit 4 and the frequency managing unit 5 and calculates the semantic distance Dij by using the same calculation method as that of the first embodiment. A list including elements of i, word j, and semantic distance Dij as elements is sent to the thesaurus generator 7. The semantic distance calculation unit 6 includes the frequency management unit 5 and the co-occurrence storage unit 4 when calculating the semantic distance.
Since the same operation as that of the first embodiment is performed except that the information from is referred to, the description thereof will be omitted.

【００７０】シソーラス生成部７の動作は第１の実施の
形態と全く同じであるので説明は省略する。Since the operation of the thesaurus generator 7 is exactly the same as that of the first embodiment, its explanation is omitted.

【００７１】以上のように、第２の実施の形態によれ
ば、頻度管理部が共起の出現頻度の分布を管理し、共起
蓄積部が出現頻度の低い共起の情報を削除するようにし
たので、共起の蓄積に必要な記憶装置の容量を少なくす
ることができ、高速な処理が可能となる。As described above, according to the second embodiment, the frequency management unit manages the distribution of the co-occurrence frequency of appearance, and the co-occurrence storage unit deletes the co-occurrence information having a low frequency of occurrence. Therefore, the capacity of the storage device necessary for accumulating co-occurrence can be reduced, and high-speed processing can be performed.

【００７２】従って、シソーラス作成装置が、記憶容量
の少ない低速な計算機システムでも実現可能となる。Therefore, the thesaurus generator can be realized even in a low-speed computer system with a small storage capacity.

【００７３】（他の実施の形態）第１及び又は第２の実
施の形態の一部を変形した他の実施の形態を数例挙げれ
ば、以下の（イ）〜（ニ）の通りである。(Other Embodiments) The following (a) to (d) will be given to give some examples of other embodiments in which a part of the first and / or second embodiment is modified. .

【００７４】（イ）意味距離計算部６からシソーラス生
成部７へ送るデータを意味距離の小さいものに限定して
もよい。この場合、処理の高速化が期待できる。(A) The data sent from the semantic distance calculation unit 6 to the thesaurus generation unit 7 may be limited to data with a small semantic distance. In this case, speeding up of processing can be expected.

【００７５】（ロ）意味距離計算部６は、意味距離を計
算する場合に、各共起データ種類毎の出現頻度が共起デ
ータ種類毎の出現頻度の総和に対してあらかじめ定めら
れた比率に満たない場合は、当該共起を計算対象から除
外するようにしてもよい。あるいは、あらかじめ定めら
れた出現頻度に満たない共起を計算対象から除外するよ
うにしてもよい。こうすることで、記憶容量や計算速度
等が比較的小さな資源の装置においても、特定の文献や
資料における頻出単語について適切なシソーラスを得る
ことができる。(B) When calculating the semantic distance, the semantic distance calculation unit 6 sets the appearance frequency for each co-occurrence data type to a predetermined ratio with respect to the sum of the appearance frequencies for each co-occurrence data type. If not, the co-occurrence may be excluded from the calculation target. Alternatively, co-occurrence that is less than a predetermined appearance frequency may be excluded from the calculation target. By doing so, it is possible to obtain an appropriate thesaurus for a frequent word in a specific document or material, even in a device with a relatively small storage capacity or calculation speed.

【００７６】（ハ）名詞−動詞関係共起を抽出するもの
を示したが、名詞−名詞関係共起や名詞−形容詞関係共
起等を抽出して処理するようにしてもよい。(C) Although the noun-verb relation co-occurrence is extracted, the noun-noun relation co-occurrence, the noun-adjective relation co-occurrence, etc. may be extracted and processed.

【００７７】（ニ）ある単語ａと同じグループに属する
単語ｂとの意味距離Ｄabが同じグループに属さない単語
ｃとの意味距離Ｄacに対して常にＤab＜Ｄacの条件を満
たすものをグループの定義としたが、他の統計的手法に
よりグループを定義してもよい。(D) The definition of the group is such that the semantic distance Dac between a word a and a word b belonging to the same group is always Dab <Dac with respect to the semantic distance Dac with a word c not belonging to the same group. However, the group may be defined by another statistical method.

【００７８】また、第２の実施の形態において、出現頻
度を１，２，３，４，５，６，７，８，９，１０以上１
５未満，１５以上２０未満，２０以上２５未満，２５以
上３０未満，３０以上４０未満，４０以上５０未満，５
０以上の１６の層に分けて各層毎に共起が何レコード記
録されているか管理するようにしたが、各層の区切り及
び層の数は任意に設定可能であるし、入力文のデータ量
に応じて変化させるようにしてもよい。In the second embodiment, the appearance frequency is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more 1
Less than 5, 15 or more and less than 20, 20 or more and less than 25, 25 or more and less than 30, 30 or more and less than 40, 40 or more and less than 50, 5
Although the number of records of co-occurrence is recorded for each layer by dividing it into 16 layers of 0 or more, the delimiter of each layer and the number of layers can be set arbitrarily, and the data amount of the input sentence can be set. You may make it change according to it.

【００７９】さらに、第１及び第２の実施の形態におい
ては、日本語の共起データを処理するものを示したが、
他の言語にも適用できる。Furthermore, in the first and second embodiments, the case where Japanese co-occurrence data is processed is shown.
It can be applied to other languages.

【００８０】[0080]

【発明の効果】以上のように、本発明によれば、資料か
ら抽出した共起データから、その構成単語間の意味的な
距離を求め、この意味的な距離をもとに単語をグループ
化するようにしたので、シソーラスを自動的に得ること
ができ、あらかじめ当該資料から作成したシソーラスを
利用して、当該資料の意味処理を行うことにより自然言
語処理をより適切なものとすることができる。As described above, according to the present invention, the semantic distance between the constituent words is obtained from the co-occurrence data extracted from the material, and the words are grouped based on this semantic distance. By doing so, the thesaurus can be automatically obtained, and natural language processing can be made more appropriate by performing the semantic processing of the material using the thesaurus created in advance from the material. .

[Brief description of the drawings]

【図１】第１の実施の形態のシソーラス作成装置の機能
を示す機能構成図である。FIG. 1 is a functional configuration diagram showing a function of a thesaurus creating device according to a first embodiment.

【図２】第１の実施の形態のグループ化処理のフローチ
ャートである。FIG. 2 is a flowchart of a grouping process according to the first embodiment.

【図３】第１の実施の形態のシソーラスの概念図であ
る。FIG. 3 is a conceptual diagram of a thesaurus according to the first embodiment.

【図４】第２の実施の形態のシソーラス作成装置の機能
を示す機能構成図である。FIG. 4 is a functional configuration diagram showing functions of the thesaurus creation device according to the second embodiment.

[Explanation of symbols]

２形態素解析部３共起抽出部４共起蓄積部５頻度管理部６意味距離計算部７シソーラス生成部 2 Morphological analysis unit 3 Co-occurrence extraction unit 4 Co-occurrence storage unit 5 Frequency management unit 6 Semantic distance calculation unit 7 Thesaurus generation unit

Claims

[Claims]

1. A thesaurus creating device for analyzing a input sentence to decompose the input sentence into words and for specifying a part of speech of a word, and grouping the words to create a thesaurus. Co-occurrence data stored in the co-occurrence extraction storage means for extracting co-occurrence data from information and storing the appearance frequency for each co-occurrence data type together with the co-occurrence data The first to construct co-occurrence data from the appearance frequency for each
Between the constituent words and / or the semantic distance calculating means for calculating the semantic distance information between the second constituent words forming the co-occurrence data, and the first constituent word based on the semantic distance information. And / or a thesaurus generating means for grouping the second constituent words into a group.

2. A frequency management means for managing distribution information of appearance frequency for each co-occurrence data type stored in the co-occurrence extraction storage means, the co-occurrence extraction storage means satisfying a certain condition. The thesaurus creating apparatus according to claim 1, wherein the information of the co-occurrence data having a low appearance frequency is deleted based on the information obtained from the frequency managing means.