JP2000507008A

JP2000507008A - Systems, software and methods for locating information in a collection of text-based information sources

Info

Publication number: JP2000507008A
Application number: JP9532080A
Authority: JP
Inventors: レヴィ，ユヴァル; マルグリス，ハイム; アラッド，アイアリス
Original assignee: フレア・テクノロジーズ・リミテッド
Priority date: 1996-04-04
Filing date: 1997-04-04
Publication date: 2000-06-06
Also published as: CA2250694A1; EP0934569A2; WO1997038376A3; WO1997038376A2

Abstract

(57)【要約】テキスト・ベース型情報ソースのコレクションに含まれる情報を処理するシステムは、入力単語の連想的で且つ言語学的拡張を用い、そこにおいて連想的拡張が最初に実行され、そして関連した形態論的及び音声学的規則に従って同時的言語学的拡張が続く。システムは、各言語におけるテキストの大きな本体を分析することにより処理されるべき各言語に対して言語学的知識ベースを自動的に発生し更新する。システムはまた、探索されるべきテキスト・ベース型情報ソースのコレクションを自動的にインデックスする。大きな柔軟性、高精度及び低ノイズ出力を与える二次元（２Ｄ）拡張マトリックスを用いて、サポートされた言語における単語又は用語を拡張するための方法が提供される。２Ｄ拡張マトリックスは、シソーラス、セーブされた問合せのデータベース及び他の連想された情報ソースを利用する連想的次元であって、単語が意味及び関係により他の単語と関係付けられる当該連想的次元と、認識文法を利用する言語的次元であって、単語が形態論的及び音声学的バリエーションに対する組合された規則により他の単語と関連付けられる当該言語学的次元とを含む。 (57) Abstract: A system for processing information contained in a collection of text-based information sources uses associative and linguistic extensions of input words, where the associative extensions are performed first, and A simultaneous linguistic extension follows the relevant morphological and phonetic rules. The system automatically generates and updates a linguistic knowledge base for each language to be processed by analyzing a large body of text in each language. The system also automatically indexes the collection of text-based information sources to be searched. A method is provided for expanding words or terms in supported languages using a two-dimensional (2D) expansion matrix that provides great flexibility, high accuracy and low noise output. The 2D extension matrix is an associative dimension utilizing a thesaurus, a database of saved queries and other associated information sources, wherein the word is related to other words by meaning and relationship, A linguistic dimension utilizing the recognition grammar, wherein the word is associated with other words by combined rules for morphological and phonetic variations.

Description

【発明の詳細な説明】テキスト・ベース型情報ソースのコレクションの中の情報を捜し出すためのシステム、ソフトウエア及び方法背景１．発明の分野本発明は一般に情報検索の分野に関する。詳細には、本発明は、テキスト・ベース型情報ソースのコレクションにおいて、ユーザ入力の問合せに関連した情報を見つけるための情報管理システム及び計算機言語システムに関する。２．関連技術の説明情報時代において、巨大な容量の情報を効率的に管理し必要とされる情報を迅速に見つける能力は、全ての人間の努力における推進力となった。情報管理システムの開発の初期においては、大きな容量の自由形式のテキスト・ドキュメント及び他のテキスト・ベース型情報ソースを処理する能力は、著しく制限された。従って、情報の専門家は、データが受け取られ、記億され、参照され得る仕方を厳しく制御することに基づいて種々のタイプのデータベース管理システム及び探索システムを開発した。しかしながら、そのようなシステムにより処理されねばならない情報の容量及び性質が拡大したので、通常のデータベース管理システムは後れを取らないようにすることができなくなった。通常のデータベース管理システムにおいては、データは厳しく構造化された環境の中に記憶されている。そのようなシステムは、例えば、レコードのテーブル又はスプレッドシート・モデルに基づき得る。そのようなシステムは、フラットであり得るか、又はデータベースのレコードが相互に連想する仕方に関して合理的であり得る。しかしながら、通常のデータベース管理システムは、一般に、そこにおいて１つ又は複数のフィールドが探索可能であり得る、即ちキーフィールドである構造化されたレコードを必要とする。更に、そのようなキーフィールドは用語、例えば数体系、ラベル等を、既知の問合せ値、即ち番号、ラベル等の組合わせにより探索を容易にする一貫した要領で用いることが望ましい。一般のテキスト・ベース型情報ソース内の情報を捜し出すため、いわゆるフルテキスト探索が開発された。コンピュータ・システムに記憶されている英語言語のドキュメントのようなテキスト・ベース型情報ソースのコレクション（collec tion）のフルテキスト探索は、ユーザが関連のドキュメントにおいて用いられていることが既知である用語を含む問合せを書き込むことを可能にする。ドキュメントのコレクションは、最初フルにインデックスされ、インデックスにおけるドキュメントの単語は問合せの用語と比較される。このタイプのシステムの最も単純な形式においては、問合せ用語とインデックス・エントリとの正確な一致が、関連のドキュメントを識別するため見つけられねばならない。つづりの間違い、単語の異形等は、全ての関連のドキュメントを見つけるのを妨げがちである。ワイルドカード化（wild-carding）と呼ばれる技術を用いてこの問題を部分的に多少解決するため用いられ得るが、ワイルドカード化を用いた場合、「ノイズ」と呼ばれる多くの無関係なドキュメントがしばしば出てくる。ワイルドカード化の使用の例は、ユーザ問合せ用語が“ｃｏｍｐｕｔｅ（計算する）”、“ｃｏｍｐｕｔｅｒ(コンピュータ)”、“ｃｏｍｐｕｔｉｎｇ(計算すること)”、“ｃｏｍｐｕｔａｔｉｏｎ(計算)”等の概念に対する“ｃｏｍｐｕｔ^*”のような関連の用語の単語語幹（ここで“^*”は除外された用語の部分を示す。）であることを識別したもののみを含む。高度化（sophistication）のより高いレベルを有する最新且つ通常のフルテキスト探索システムが開発されてきた。例えば、Ｇ．ピンカス(Ｐｉｎｋａｎｓ，Ｇ.)の修士論文「ＮａｔｕｒａｌＬａｎｇｕａｇｅＦｕｌｌ−ＴｅｘｔＲｅｔｒｉｅｖａｌＳｙｓｔｅｍ」(イスラエル大学、１９８５年)は、単純なワイルドカード化より一層ノイズ・フリーに追加の関連用語を含むようユーザの問合せを自動的に拡張するシステムを開示する。このピンカスのシステムは、（１）問合せ単語及びブール演算子から構成されたユーザ問合せを受け取り、（２）該問合せを言語学的に、即ち形態論的及び音声学の情報の事前処理されたデータベースを参照することにより拡張し、（３）該問合せを連想的に、即ち連想的副問合せ（sub-query）のデータベースを参照することにより拡張し、（４）上記ステップ（２）及び（３）の結果を併合する。形態論的拡張は、問合せの用語の挿入辞に到達し、一方音声学的拡張は母音を誤ってつづることにより発生され得る用語に到達する(例えば、ｒｅｃｉｅｖｅ→ｒｅｃｅｉｖｅ)。連想的拡張は、（例えば、頭字語“ＵＳＡ”からそのフルの言い方を連想するため）特定の問合せ一単語と連想付けされている副問合せの形でユーザにより予め定義されたように問合せに関連した用語に到達する。そして、人は、単語“ＵＳＡ”と、ブール代数の“ＡＮＤ”演算子を次の４つの単語、即ち、１単語距離の付近に制限された、“Ｕｎｉｔｅｄ”、“Ｓｔａｔｅｓ”、“ｏｆ”、“Ａｍｅｒｉｃａ” に適用する問合せとの間の連想を作る。こうして、包括的な拡張された問合せが、ユーザの元の問合せに概念的に関連し得る多くの異なる単語をカバーするため発生される。実行されるべき形態論的拡張のレベル及び音声学的拡張のレベルにおける幾つかのバリエーションは、拡張パラメータの選択によりユーザに対して利用可能である。しかしながら、この形態論的及び音声学的拡張のプロセスは、多くの不効率を被る。それは、形態論的語幹（stems）及び音素のような異なる「単語語幹(word -bases)」間の基本的相違を認識するのに失敗する。従って、それは、双方のメカニズムにより影響される多くの関連の言語学的並べ換え（linguisticpermutat ions）を見落とし、同時に、それは、双方のメカニズムを組み合わせる組合わせの効果に起因した、大きな量のノイズ、即ち偽りの肯定（false-positive）を発生する。更に、このプロセスはまた、全く、単一の単語を認識して拡張することに制限され、そして、連想的拡張と言語学的拡張との間の相互作用さえ、全く、双方の結果のつまらない併合に制限され、従って相互のフィードバックを許す概念的な基礎を共有しない(例えば、問合せ−単語“ａｉｒｐｌａｎｅ”は、(“ａｉｒｐｌａｎｅ”、又は“ａｉｒｐｌａｎｅｓ”、又は“ａｉｒｃｒａｆｔ”) へ拡張するが、しかし“ａｉｒｃｒａｆｔｓ”へ拡張しない。)。通常のシステムにおいて、問合せの拡張は、処理されるべき言語の専門家により開発された１組の言語学的規則に依存する。入力言語の出来る限り多い特性がいずれのテキスト・ベース型情報ソースを処理する前に説明されねばならなかったので、当該１組の言語学的規則は大量で且つ相対的に柔軟性がなかった。処理されるべき各言語に対する言語学的規則の開発は、非常に労働集約的な且つ多くの時間を必要とする仕事であった。最後に、テキスト・ソースの手動のインデックシングを必要とする通常のシステム、並びにテキスト・ソースを自動的にインデックスする通常のシステムは既知である。通常のインデックスは、テキスト・ソースの中に見つけられた単語を、該単語が見つけられた位置に単純にマップする。手動のフルテキスト・インデックシングは、極端に時間を消費し、且つ誤りが生じがちである。キーワード・インデックシングは、主観的であり、またやや誤りが生じがちである。本発明の概要従って、従来技術に関する前述の問題を解決することが本発明の一般的目的である。従来技術の問題を解決する本発明の局面は、少なくとも、テキスト・ベース型情報ソースのコレクションに含まれる情報を処理するためのシステム、ソフトウエア及び方法を含む。当該システムは、コンピュータ又はデータ・プロセッサと、該コンピュータ又はデータ・プロセッサにより指定された順序で実行されたとき所望の情報処理のタスクを実行する１つ又は複数のソフトウエア・モジュール、ユニット又は機能として構造化されたソフトウエアとを含む。１つ又は複数のソフトウエア・モジュール、ユニット又は機能が、そのようなライブラリを知っている要領で書かれるソフトウエア・プログラムにより参照され得るコンパイル時間かランタイム・ライブラリ・エントリかのいずれかの通常の要領で利用可能にされ得る。本発明は更に、問合せ−概念を処理し、大きな柔軟性、高精度及び低ノイズ出力を供給する拡張マトリックスを用いて、当該処理された問合せ−概念を拡張され／改善された問合せに変換する。本発明の一局面に従って、テキスト・ベース型情報ソースのコレクションを受け取る入力を有し、言語学的知識ベースを生成する自動言語学的知識ベース発生器と、テキスト・ベース型情報ソースのコレクション及び言語学的知識ベースを受け取る入力を有し、当該受け取られたテキスト・ベース型情報のインデックスを生成し、更に、インデックス発生器に対する入力を表し且つインデックスと言語学的知識ベースとの間の相関を維持するため言語学的知識ベースを更新する前記インデックス発生器と、オペレータにより作られた問合せ、言語学的知識ベース、インデックス及びシソーラスを受け取る入力を有し、問合せに関連したテキスト・ベース型情報ソースのコレクションにおける位置のリストを生成する問合せプロセッサとを備えるテキスト・ベース型情報処理システムが提供され得る。該テキスト・ベース型情報処理システムは、非常に多くの修正及び変更を受けやすい。例えば、自動言語学的知識ベース発生器、自動インデックス発生器及び問合せプロセッサは種々の方法で具体化され得る。本発明の別の局面に従って、テキストベース型情報処理システムにおいて、自動言語学的知識ベース発生器は、用語の入力ストリームを受け取り、個々の用語を生成するパーサと、前記パーサからの個々の用語を受け取るよう接続され、個々の用語の各々が属する言語を示す出力を生成する言語認識器と、個々の用語を受け取るよう、更に前記言語認識器の出力により示された言語に対する言語学的規則を受け取るよう接続され、正規化された用語を生成する正規化器と、適法の個々の用語を受け取るよう接続され、前記言語学的知識ベースに記憶されるエントリを生成する言語学的拡張器とを備え得る。本発明の更に別の局面に従って、テキストベース型情報処理システムにおいて、自動インデックス器は、用語の入力ストリームを受け取り、個々の用語を生成するパーサと、前記パーサからの個々の用語を受け取るよう接続され、個々の用語の各々が属する言語を示す出力を生成する言語認識器と、個々の用語を受け取るよう、更に前記言語認識器の出力により示された言語に対する言語学的規則を受け取るよう接続され、正規化された用語を生成する正規化器と、適法の個々の用語を受け取るよう接続され、用語が先にインデックスされていなかったときインデックスに記憶されたエントリを生成し、また用語が先にインデックスされてしまったとき既存のインデックス・エントリを修正するインデックス・エントリ発生器とを備え得る。最後に、本発明の更に別の局面に従って、テキストベース型情報処理システムにおいて、言語における用語を拡張するための拡張装置は、用語を受け取る入力を有し、且つ当該用語と、連想的拡張器がシソーラスを参照することにより見つけられた少なくとも１つの関連した用語とを表す出力を有する前記連想的拡張器と、前記連想的拡張器の出力に接続された入力を有し、且つ前記言語学的拡張器の入力と、前記言語学的拡張器の入力に言語学的に関連し且つ言語のための言語学的知識ベースを参照することにより見つけられた少なくとも１つの用語とを表す出力を有する前記言語学的拡張器とを備え得る。上記に記載した正規化器は２つのユニットから構成され得る。第１の正規化器ユニットは、個々の用語及び言語学的規則を受け取るよう接続され、違法の文字（illegal characters）が除去された用語を生成し得て、そして、第２の正規化器ユニットは、違法の文字が除去された用語及び言語学的規則を受け取るよう接続され得て、違法の文字が除去されてしまった用語に言語学的規則を適用することにより見つけられた単語語幹を含む正規化された用語を生成する。本発明は、添付の図面と関係して、本発明の少なくとも１つの例示的実施形態の詳細な説明を読むことにより一層良く理解されるであろう。図面の簡単な説明図面において、類似の参照番号は類似の構成要素を示す。図１は、本発明が実施され得るコンピュータ又はデータ処理システムの概略ブロック図である。図２は、図２のメモリの概略ブロック図である。図３は、自動言語学的知識ベース発生のフローチャートである。図４は、自動インデックス発生のフローチャートである。図５は、問合せ拡張のフローチャートである。図６は、図３から図５に図示された特徴を含む情報検索システムのフローチャートである。詳細な説明以下の詳細な説明を一層良く理解するため、以下の定義に言及する。この説明においては、「言語」は、音表象の意味（symbolic meaning）を有する、トークンのいずれの組織化された体系であると見なされる。テキスト・ベース型情報システムにより扱われる最も共通のタイプの言語は単語又は単語の組合わせ、即ち用語から構成される人間の自然言語であり、それは人間により特定の意味を有するよう理解されるので、便宜上、トークンは、以下で「単語(words)」又は「用語(terms)」と称される。従って、用語「単語」及び「用語」は、単語フレーズが実際そのサブユニットから独立した意味を有するトークンであり得るそれらの事例における単語フレーズと、単語又は単語フレーズが特定の文脈／意義（impo rtance）を有するそれらの事例におけるキーワードと、頭字語及び切り詰めたもの（short-cut）のような人工的単語とを包含することを意図する。「単語語幹(w ordbase)」は、単語が用いられる文脈に対して適切に単語の語根の意味又はことば（speech）の部分を修正する単語の全ての接頭辞及び接尾辞を除去した後に残る単語の基体部分である。本明細書に用いられる用語「シソーラス(thesaurus) 」は、用語、単語及び／又は単語語幹のデータベースを参照し、そこにおいて各用語、単語又は単語語幹は、データベースの中で、形態論的近接、音声学的類似性、類似の意味(類義語)、ほぼ反対の意味(反意語)、より広い意味、より狭い意味、特定の文脈における関連した用語等の定義された関係を有する他の用語、単語及び単語語幹を連想する。データベースは、そこに記憶されている用語、単語及び／又は単語語幹に基づいて誘導（navigate）され又は探索され得る。ここで考慮される言語は、単語がうけ得る形態論的及び音声学的バリエーションに対する既知の言語学的規則を有する。例えば、言語の形態論的規則は、複数が単語の形を変えることにより即ち英語では最後の“ｓ”を加えることにより単数の名詞から形成される仕方を定義し得る。これに対して、音声学的規則は、ユーザのつづり間違いから生じるつづりにおける共通のバリエーションを表し得る。テーブル、ファイル又はデータベースは、そのような言語学的規則のリストを保持するためソフトウエア・プログラムにより用いられ得る。一般に、言語はまた、言語の言語学的規則に従わない単語を含む。例えば、動詞の過去時制を発生するための英語言語の形態論的規則は、動詞“ｔｏｇｏ” に適用しないで、それはばかげた“ｇｏｅｄ”とは違う“ｗｅｎｔ”となる。従って、規則に対する例外は、ソフトウエア・プログラムにより例外のテーブルの中に保持され得て、そのため該規則に従わない単語は、規則に従う単語と同じように正確に処理され得る。本発明の文脈においては、特に自然言語において、しかしまた一般的に言語において、意味を生成する単語語幹のバリエーションの効率的で適応可能で且つ柔軟な表示を生成するため、言語学的規則と、例外又は不規則な形式の１つ又は複数のテーブルとをテキストの情報の大きな本体に適用することにより「言語学的知識ベース」が開発されている。「言語学的知識ベース」は、単語語幹及び関連する単語のテーブル、リスト又はデータベースである。関連する単語は、言語に対する言語学的規則の下で分析されるとき同じ単語語幹を有すると決定されたそれらの単語である。本発明は、コンピュータ・システム及びデータ処理システムの文脈において構成される。そのようなシステムの全体像は、図１のブロック図と関係して与えられる。コンピュータ・システム又はデータ処理システムは、一般に、プロセッサ１０１、メモリ１０３、１つ又は複数の入力装置１０５及び１つ又は複数の出力装置１０７を含み、これら全ては相互接続機構１０９を介して相互接続されている。この基本計画の多くの変形が可能である。例えば、実行可能なシステムは、入力装置１０５及び出力装置１０７がなく、外部装置（図示せず）によりメモリ１０３との対話を通して全体的に通信し得る。また、分散コンピュータ・システム及びデータ処理システムは、この基本計画内に入るように企図される。相互接続機構１０９は、パーソナル・コンピュータの内部システム・バスであってもよく、またインターネットであってもよく、それらを介してプロセッサ１０１は遠隔のメモリ１０３に記憶されているデータベースと対話する。他の変形は当業者には明らかであろう。メモリ１０３は、この説明に有用な２つのカテゴリ、即ち長期間メモリ（これはまた不揮発性メモリと呼ばれる。）と、短期間メモリ（これはまた揮発性メモリと呼ばれる。）とに分類される。これら２つのタイプのメモリは、図２に示されるように、コンピュータ・システム及びデータ処理システムにおいてしばしば用いられている。集積回路ランダム・アクセス・メモリ（ＲＡＭ）のような揮発性メモリ２０１は、そのような揮発性メモリ２０１が最も容易に実現される技術が速いプロセッサ１０１をサポートするのに望ましいような速いアクセス時間を生じるので、プロセッサ１０１に物理的に密接に近接してしばしば用いられる。不揮発性メモリ２０３は、類似の容量の揮発性メモリより安価に構成することができるので、大量のデータを長期間記憶しておくためしばしば用いられる。不揮発性メモリ２０３は、磁気又は光学ディスク又はテープ記憶装置としてしばしば実行され、該記億装置は、異なるコンピュータ又はデータ処理システム間でのデータ及びソフトウエア・プログラムの交換の更なる利点を提供する。それとして、不揮発性メモリ２０３は、命令を表す信号が記録されるソフトウエア・プロダクト・ディスクであり得て、該命令は、プロセッサ１０１により実行されるときコンピュータ又はデータ処理システムが特定の目的の機能を実行するようにさせる。本発明のソフトウエアを具現化したものが、製造業者による販売のため、保存のため、揮発性メモリ２０１を介するプロセッサ１０１によるアクセスのため等のためそのような不揮発性メモリ２０３に記録され得る。本発明の種々の局面に従って、テキスト・ベース型情報ソースのコレクションにわたり探索し情報を捜し出すためのシステムが構成され得る。本発明の種々の局面に従って、言語学的知識ベースが最初に発生される。次いで、テキスト・ベース型情報ソースのコレクションがインデックスされる。ユーザは、次に、捜し求める情報を定義する問合せを入力する。該問合せは、シソーラス及び言語学的知識ベースを用いて、選択された連想的及び言語学的な規則に従って拡張される。最後に、情報は、種々の拡張された問合せ用語と一致する情報が識別される。シソーラス、言語学的知識ベース及びインデックスは、１つ又は複数のコンピュータ・ファイルに記憶され得て、該コンピュータ・ファイルに対してシステムはメモリ１０３を介するアクセスを有する。言語学的知識ベースの自動発生、インデックスの自動発生、及び問合せ拡張と関係する本発明の局面が次に詳細に記載される。Ｉ．言語学的知識ベースの自動発生本発明の一局面に従って、適切なデータ処理システムにより実行されるときテキスト・ベース型情報ソースの入力本体から言語学的知識ベースを自動的に発生する図３に示されるようなソフトウエアが提供される。例えば、本発明のこの局面に従って、英語言語のドキュメントのコレクションが、英語言語の言語学的知識ベースを発生するため処理され得る。言語、例えば英語のための言語学的規則に対する例外のリスト３０２を含む言語学的規則３０１の小さい組（集合）が、最初に、テキスト・ベース型情報の大きい本体の統計的分析により発生される。この小さい組の規則は以下のものを含む。即ち、・言語における不規則な単語及び単語語幹のリスト、即ち前述した例外のリスト・言語における適法の文字（legal characters）文字、即ち言語のアルファベット、及び、言語における適法の文宇の位置、例えば単語内の特定の位置のみに現れることができる文字に関する特別の規則を指定する単語正規化テーブル・言語における適法の接頭辞及び適法の接尾辞を指定する接頭辞及び接尾辞リスト・言語における通常の単語及び適正な名前の双方に対する文字対音声（letter -to-sound）規則次いで、例外のリスト３０２を含むこの言語学的規則の組３０１を用いて、テキスト・ベース型情報ソースの本体３０３を分析し、テキスト・ベース型情報ソースの本体３０３から特に適合した言語学的知識ベース３０５を発生する。テキスト・ベース型情報ソースの本体３０３は、例えば、将来の問合せがなされるのを予想される開発の努力（endeavor）の特定の分野からのソースであるよう選択され得る。これは、言語学的知識ベースが開発の努力のその特定の分野の特定のもの（specifics）に一層良く対処することができることをもたらす。それから言語学的知識ベース３０５が導出されるテキスト・ベース型情報ソースの本体３０３は、最終的に探索されることになるソースの同じ本体でなくてもよい。しかしながら、探索されるべきテキスト・ベース型情報ソースの本体から言語学的知識ベース３０５を自動的に発生することは、そのように生成された言語学的知識ベース３０５が探索されるべきテキスト・ベース型情報ソースの本体に特に良く適合されるという利点を有する。言語学的知識ベース３０５の自動発生は、以下のとおり進行する。テキスト・ベース型情報ソースの本体３０３は、システムに対するテキストの入力ストリーム３０４を形成する。この入力ストリーム３０４は、最初に、固定の単語認識規則か、１つ又は複数の言語に対して特有の単語認識規則かのいずれかに従って単語及び用語３０７に分解（parse）される。次いで、該入力ストリームから分解された単語の各々の言語は３０９で認識される。一旦単語の言語が認識されてしまうと、単語は３１１で言語に対する言語学的規則３０１に従って正規化される。既知の不規則単語が既に不規則単語３０２のリストにあり、従って更に処理する必要がないので、不規則単語もまたこの時点で認識され得る。システムはまた、潜在的な新しい不規則単語として、あるルールベースの判断基準に合うそれらの単語を識別し得る。それらの先に未知の不規則単語は、それらが不規則単語のリストに追加されるべきかの決定のため人間のオペレータに対して識別され得る。規則的な単語は、言語学的知識ベース３０５に追加される前に３１３で言語学的に拡張され、それにより単語語幹はテキスト・ベース型情報ソースの本体３０３からの関連した単語のリストと共に言語学的知識ベース３０５に記憶される。言語学的拡張３１３が以下に一層詳細に説明される。入力ストリーム３０４を文及び単語に分解するステップ３０７は、以下の疑似コードに従って生じる。正規化は次のとおり実行される。正規化は、がらくた文字（garbage characters ）を入力ストリーム３０４の単語から識別して除去する。最後に、新しいキーが、言語学的知識ベース３０５に以下の手順により加えられる。上記の疑似コードにおいて示されたような分析を受けた２つの有効な認識タイプは形態論的であるし音声学的である。記載される実施形態の形態論的分析器は、以下の手順に従って機能する。形態論的分析器は、入力単語のため識別された言語における有効な接頭辞及び接尾辞のリストを受け取る。音声学的分析器は、文字対音声規則に基づいて各里語を単語の音声学的表示に変換する。類似の又は同じ音声学的表示を有する単語同士は、それらの音声学的形態論により関連していると見なされ得る。上記のプロセスが最初に提示されたテキスト・ベース型情報ソースの本体３０３に対して完了してしまうとき、テキスト・ベース型情報ソースの本体３０３におけるテキストの言語に対する言語学的知識ベース３０５は、自動的に発生されてしまう。新しいテキスト・ベース型情報ソースがシステムに加えられるとき、それらはまた上記のように処理される。こうして、新しい情報の言語学的知識ベース３０５への追加により、並びに新しい情報に従って言語学的知識ベース３０５の中の個々のエントリの内容を訂正する言語訂正機構により、新しいソースは言語学的知識ベース３０５の知識及び正確さを増大させる。言語訂正機構を具体化する学習手順は次のとおりである。システムが言語学的知識ベース３０５と新しく提示されたテキスト・ソースとの間の不一致を検出したとき、影響を受けた単語語幹、及び関連した単語のリストは、自動的に、又は人間オペレータの指図で、新しく提示された情報及び上記の手順に従って再分析され更新され得る。こうして、システムは、処理された各々の言語について常に学習し、影響を受けた言語学的知識ベースを更新する。ＩＩ．言語学的知識ベースと相関したインデックスの自動発生言語学的知識ベース３０５に加えて、図４に示される本発明の別の局面に従った検索システムは、インデックス４０１を自動的に発生し、それによりテキスト・ベース型情報は、インデックス４０１を参照することにより見つけられ得る。言語学的知識ベース３０５を更新し、そのため言語学的知識ベース３０５の内容がテキスト・ベース型情報ソースの本体３０３に含まれる関連の用語を表し、従ってインデックス４０１と相関付けられることにより、インデックス４０１の自動発生が達成される。インデックス４０１は、テキスト・ベース型情報ソースの本体３０３に実際に見つけられる単語をテキスト・ベース型情報ソースの本体３０３内の位置と単純に関連付ける。該位置は階層的に定義されているのが好ましい。例えば、該位置は、ドキュメント番号、セクション番号、文番号及び位置番号により階層的に表され得る。他の階層的位置識別スキームが、当業者により適合されるのが分かるように用いられ得る。本発明の好適実施形態に従って、インデックス４０１は言語学的知識ベース３０５により支援される。インデックス４０１は、テキスト・ベース型情報ソースの本体３０３の中に実際に生じる単語及び用語のみを含む。言語学的知識ベース３０５は、テキスト・ベース型情報ソースの本体３０３の中に実際に生じる単語から導出された単語語幹を関連した単語のリストと関連付ける。以下で説明される検索の間、システムは、言語学的知識ベース３０５からのエントリを検索し、次いで該エントリは１つ又は複数のインデックス・エントリを参照するため用いられる。インデックス４０１の自動発生は次のとおり進行する。テキスト・ベース型情報ソースの本体３０３は、インデックシング(indexing)・サブシステムに対するテキストの入力ストリーム３０４を形成する。この入力ストリーム３０４は、最初に、単語認識規則に従って単語及び用語３０７に分解される。次いで、入力ストリームから分解された単語の各々の言語は、３０９で認識される。一旦単語の言語が３０９で認識されてしまうと、単語は、言語に対する言語学的規則３０１に従って３１１で正規化され得る。次いで、インデックス・エントリは、４０３で、新しく正規化された単語の各々に対して発生される。正規化された単語が既にエントリをインデックス４０１の中に有する場合、単語の現在の発生の位置が先のエントリに加えられる。上記のプロセスと実質的に同時に、言語学的知識ベース３０５は、インデックス４０１に生成された新しく且つ修正されたエントリと相関されるのを連続的に維持される。各正規化された単語は、その単語語幹４０５へと単語の言語の言語学的規則に従って縮小される。次いで、単語語幹及び関連した単語は、存在しない場合、言語学的知識ベース・ファイル４０７に追加される。ユーザはまた、関連した単語が単語語幹の色々のタイプの拡張を含むことを明記（specify）する。拡張が含まれる場合、単語語幹の拡張が、単語語幹及び関連した単語を言語学的知識ベース・ファイル３０５に記億する前に実行される。テキスト・ベース型情報ソースの本体３０３のインデックシングが完了したとき、言語学的知識ベース３０５は、インデックス４０１と相関させられ、テキスト・ベース型情報ソースの本体３０３に含まれる関連の用語を表す。ＩＩＩ．問合せの拡張問合せの拡張は、図５に示される、本発明の第３の局面に従って実行される。問合せは２以上の単語又は用語を含み得るので、単語の認識は最初に上記のように実行される。単語認識のタスクにより識別された単語及び用語は更に正規化され得る。即ち、それらは、所望ならば、語幹形式に変換され得る。シソーラス及び言語学的規則を参照することにより、つづりの誤りは除去され、頭字語及び切り詰めたものの異なる語彙形式が認識され得る等々である。次いで、問合せにおいて各認識された単語が、２Ｄ拡張マトリックスを用いて拡張され得る。２Ｄ拡張マトリックスは、その中に入力単語が表され得る拡張空間を定義する一つの方法である。この空間の次元は、連想的で且つ言語学的である。連想的次元は、処理されるべき言語において単語／単語一語幹の意味に基づいている。本発明の記載される実施形態において、連想的次元は、単語及び用語をそれらの類義語、より広い用語、より狭い用語及び他の関係と関連付ける１つ又は複数のシソーラス５０１により定義される。各シソーラス５０１は、概念的に関連した用語を加えた用語のデータベースを含む。シソーラスは用語により探索可能である。こうして、各シソーラス・エントリは、探索可能な用語のリストであるエントリ・キーを含む。各エントリ・キーは、類義語、より広い用語、より狭い用語、関連した用語、反意語等のようなエントリ・キーに概念的に関連付けられた１つ又は複数の用語を各エントリ・キーと連想付けせしめる。連想のいずれの１つ又は複数のカテゴリを含むかは任意である。更に、各エントリ用語は、通常の辞書定義及び使用ガイド、並びにその中にエントリ・キーが必要なとき変換され得る問合せストリングを各エントリ用語と任意に連想付けせしめる。こうして、シソーラスはエントリのリストであり、そこにおいて各エントリは実質的に次のとおりの構造を有する。即ち、・キーワード：(自然言語フレーズ又は用語の形で)エントリ・キーとして用いられる。・記述(任意)：キーワードの意味及び使用の記述（百科事典におけるのと同様）・問合せ：必要なときキーワードが変換される基礎をなすフルテキスト問合せ言語における完全な問合せステートメント（任意）(例えば、キーワード“ＵＳＡ"→問合せ“ＵｎｉｔｅｄＡＮＤＳｔａｔｅｓＡＮＤｏｆＡＮＤＡｍｅｒｉｃａ"）キーワードの完全な問合せステートメントへの変換が明示的に供給されない場合、デフォルト変換がキーワードに付与される。・関係・類義語：概念又は記述子を備えるキーワードと類義語のキーワードのリスト・より広い用語・より狭い用語・連想・その他これらの特徴の全ては、連想的拡張が所望の効果を有しているかを決定するためオペレータにより用いられ得る。この拡張空間の言語学的次元は、処理されるべき言語の言語学的知識ベース３０５に基づいている。前述したように、言語学的知識ベースは、テキスト・ベース型情報ソースの実際のコーパス（corpora）から、手作りの言語辞書とは無関係に、且つ「適法の」又は「適正な」単語に制限されずに、自動的に作られる。本発明のこの実施形態において、形態論及び音声学の言語学的拡張の文法がサポートされている。拡張タスクは、２Ｄ拡張を実質的に２つの主ステップで実行する。最初に、連想的拡張が、ステップ５０３で実行され、そこにおいて入力問合せ５０５の各入力単語が、入力単語に対する関連を定義せしめた単語を含む単語のリスト５０７に拡張される。連想された単語は、シソーラス５０１を参照することにより見つけられる。単語の拡張されたリスト５０７は、入力で言語学的拡張５０９が形態論的次元及び音声学的次元の双方において同時に実行される当該入力となる。形態論的及び音声学的拡張は、言語学的知識ベース３０５を参照することにより制御される。言語学的拡張５０９は、ユーザにより供給される拡張パラメータ５１１により制御され、形態論的次元及び音声学的次元の双方がその次元の拡張無しからその次元のフル拡張までの範囲を変動する程度の形態論的拡張及び音声学的拡張を含む。形態論的拡張及び音声学的拡張を単一の言語学的拡張ステップ５０９として実行することにより、形態論及び音声学に対する拡張方法は、理性的に関連付けられ得る。拡張次元同士間の関係は、言語のための言語学的知識ベース３０５において定義される。こうして、形態論的拡張のための規則は、入力単語又は拡張された結果の音声学的特性に応じて変化する形態論的バリエーションを定義し得る。その結果、より小さいノイズしか拡張された出力に発生しない。それは、形態論的次元及び音声学的次元を単一の言語学的平面として関連させることは、言語学的規則の全体性の下で音声学的に許容できない形態論的異形を排除し、そしてその逆の関係であるからである。ＩＶ．完全なテキスト検索システム前述したソフトウエアを用いて、検索システムは、図６に示されるように構成され得て、該検索システムは、テキスト・ベース型情報ソースのコレクション内の情報を効率的に且つ正確に捜し出すことができる。簡単には、そのようなシステムは、テキスト・ベース型情報ソースの１つ又は複数のコレクション３０３ａ及び３０３ｂへのアクセスを与えられる。少なくとも１つのグループのテキスト・ベース型情報ソース３０３ａは、自動言語学的知識ベース発生ソフトウエア６０１に供給され、該自動言語学的知識ベース発生ソフトウエア６０１は、前述したように言語学的知識ベース３０５を発生する。テキスト・ベース型情報ソース３０３ｂは、インデックシング・サブシステム６０３に供給され、該インデックシング・サブシステム６０３は、テキスト・ベース型情報ソース３０３ｂの中の単語のインデックス４０１を生成し、該インデックス４０１においてインデックス４０１内の各エントリは、前述したように、単語と、上記コレクションの中の単語の位置との関係を定義する。１つ又は複数の言語のためシステムがシソーラス５０１及び言語学的知識ベース３０５を有する当該１つ又は複数の言語の中の正規化された単語を用いて、インデックス４０１が発生されるのが好ましい。インデックシング・サブシステム６０３は、システムによりサポートされた言語の１つに適合する形式を有する単語を認識するためのモジュールを含み得て、更に、システムによりサポートされた各言語に対する適切な正規化するモジュールを含み得る。単語は、それらの言語において前述したように正規化され、インデックス４０１に現れる異常なエントリの数を低減する。システムは更に、ユーザ問合せ５０５を、ユーザにより求められた情報を表している１つ又は複数の単語の形で受け取る。問合せ単語は、前述したように、２Ｄ拡張マトリックスを用いて６０５で拡張される。該問合せは、最初に、問合せ単語の言語に適切なシソーラスを参照することにより元の問合せ単語に関連した単語を含むよう連想的に拡張される。次いで、連想的に拡張された問合せは、形態論的次元と音声学的次元の双方において同時に言語学的に拡張される。各次元における拡張の程度は、ユーザにより問合せを供給されたパラメータ５１１によって指定される。拡張の程度は、例えば、ユーザにより拡張パラメータ５１１のチェックリストを各問合せ用語に添付することにより指定され得る。最後に、フルに拡張された問合せ６０７の用語は、６０９で、インデックス４０１の中のエントリと比較され、関連の位置６１１をテキスト・ベース型情報ソースのコレクション３０３ｂ内に見出す。テキスト・ベース型情報ソースのコレクション３０３ｂの中の関連の位置６１１は、元の問合せ用語のいずれかを必ずしも含まない。前述した処理により、見つけられた位置６１１は、元の問合せ用語と、連想的及び言語学的の拡張プロセスにより生成された関連した用語とのうちの１つを含むであろう。見つけられた位置６１１は、多くの「ノイズ」の位置を含まないであろう。それは、音声学的に無意味な結果を発生するため形態論的規則を適用する問題を避ける、またその逆の形態論的に無意味な結果を発生するため音声学的規則を適用する問題を避ける相互依存的に形態論的言語学的規則及び音声学的言語学的規則が同時に適用される要領で、言語学的拡張プロセスが前述したように実行されるからである。前述したようなシステムにおいて、テキスト・ベース型情報ソースは、コンピュータ・システムに記憶されたテキスト・ドキュメントであり得る。この場合、インデックシング・システムが、ドキュメント番号、セクション番号、文番号及び文内の位置により位置を階層的に参照することは便利である。更に、自由にフォーマットされたドキュメントは、前述のシステムにより処理され得る。ある従来技術のシステムにおいてなされているように、特定の方法でドキュメントを構造化すること、又は分類又はキーワードを手動で生成する必要はない。それは、本発明のシステムは、単語が発生する言語の規則に従って、単語をインデックスし、問合せを操作するからである。フレーズが特定の言語において単一の単語又は用語として取り扱われることが望まれる場合、そのフレーズは、そのようにシソーラスにおいて概念的エントリとして定義され得る。他の全ての点で、単語としてそのように定義されたフレーズは、単純に言語における単語として取り扱われる。しかしながら、受け入れられたキーワードの長いリストを宣言することは不必要である。それは、インデックシングのプロセス及び問合せ拡張は、求められた情報を合理的に表すユーザ問合せに対して正確で相対的にノイズのない一致を発生するからである。こうして本発明の少なくとも１つの例示的実施形態を記載したが、種々の代替、修正及び改良が当業者には容易に生じるであろう。そのような代替、修正及び改良は、本発明の精神及び範囲内にあることを意図するものである。従って、前述の記載は、例示のみであり、制限として意図されたものではない。本発明は、請求の範囲及びその均等物において定義されるようにのみ制限されるものである。DETAILED DESCRIPTION OF THE INVENTION Systems, software and methods for locating information in a collection of text-based information sources background 1. FIELD OF THE INVENTION The present invention relates generally to the field of information retrieval. In particular, the present invention relates to information management systems and computer language systems for finding information related to user input queries in a collection of text-based information sources. 2. Description of the Related Art In the information age, the ability to efficiently manage large volumes of information and quickly find the information needed has become a driving force in all human efforts. Early in the development of information management systems, the ability to handle large volumes of free-form text documents and other text-based information sources was severely limited. Accordingly, information professionals have developed various types of database management and search systems based on tightly controlling the manner in which data can be received, stored, and referenced. However, as the volume and nature of the information that must be processed by such systems has grown, conventional database management systems have been unable to keep up. In a typical database management system, data is stored in a tightly structured environment. Such a system may be based, for example, on a table of records or a spreadsheet model. Such a system may be flat or rational with respect to how the records of the database associate with one another. However, typical database management systems generally require a structured record in which one or more fields may be searchable, ie, are key fields. Further, it is desirable that such key fields use terms, eg, a number system, labels, etc., in a consistent manner to facilitate searching by a combination of known query values, ie, numbers, labels, etc. To locate information in common text-based information sources, so-called full-text searches have been developed. A full-text search of a collection of text-based information sources, such as English language documents, stored in a computer system, searches for terms that the user is known to use in related documents. Allows you to write containing queries. The collection of documents is first fully indexed, and the words of the documents in the index are compared to the terms of the query. In the simplest form of this type of system, an exact match between the query term and the index entry must be found to identify the relevant document. Misspellings, word variants, etc., tend to prevent all relevant documents from being found. While a technique called wild-carding can be used to partially remedy this problem, wildcarding often results in many irrelevant documents called "noise." Come. An example of the use of wildcarding is that the user query terms are "compute", "computer", "computing", "computation" for concepts such as "computation". compute ^* ”, The word stem of a related term (where“ ^* "Indicates the part of the term that has been excluded.). Modern and ordinary full-text search systems with higher levels of sophistication have been developed. For example, A Master's thesis, "Natural Language Full-Text Retrieval System" (Pinkans, G.) (University of Israel, 1985) is intended to allow users to include additional related terms even more noise-free than simple wildcarding. A system for automatically expanding a query is disclosed. The Pincus system receives (1) a user query composed of query words and Boolean operators, and (2) a pre-processed database of linguistic, ie, morphological and phonetic information. (3) The query is expanded associatively, that is, by referring to a database of associative subqueries (sub-query). (4) The above steps (2) and (3) The results of are merged. Morphological expansions reach inflections of query terms, while phonetic expansions reach terms that can be generated by spelling vowels incorrectly (eg, receive → receive). An associative extension relates to a query as defined by the user in the form of a subquery associated with a particular query word (eg, to associate the full word from the acronym "USA"). Reach the term you did. Then, a person may add the word "USA" and the Boolean "AND" operator to the next four words: "United", "States", "of", Make an association with the query that applies to “America”. Thus, a comprehensive expanded query is generated to cover many different words that may be conceptually related to the user's original query. Some variations in the level of morphological extension and the level of phonetic extension to be performed are available to the user through the selection of extension parameters. However, this morphological and phonetic expansion process suffers from a number of inefficiencies. It fails to recognize fundamental differences between different "word-bases" such as morphological stems and phonemes. Thus, it overlooks many related linguistic permutat ions that are affected by both mechanisms, while at the same time it produces a large amount of noise, ie, spurious, due to the effect of the combination combining both mechanisms. Generates a false-positive. Furthermore, this process is also totally restricted to recognizing and expanding a single word, and even the interaction between associative and linguistic expansion is, at all, a trivial merge of the results of both. And therefore do not share a conceptual basis that allows mutual feedback (eg, the query-word “airplane” expands to (“airplane” or “airplanes” or “aircraft”), but Do not extend to "aircrafts".) In a typical system, query expansion relies on a set of linguistic rules developed by the language expert to be processed. The set of linguistic rules was extensive and relatively inflexible, since as many characteristics of the input language had to be accounted for before processing any text-based information source. The development of linguistic rules for each language to be processed has been a very labor intensive and time consuming task. Finally, conventional systems that require manual indexing of text sources, as well as conventional systems that automatically index text sources, are known. A regular index simply maps words found in a text source to the locations where the words were found. Manual full-text indexing is extremely time-consuming and error-prone. Keyword indexing is subjective and somewhat error prone. Overview of the present invention Accordingly, it is a general object of the present invention to solve the aforementioned problems with the prior art. Aspects of the present invention that solve the problems of the prior art include, at least, systems, software, and methods for processing information contained in a collection of text-based information sources. The system comprises a computer or data processor and one or more software modules, units or functions that, when executed in the order specified by the computer or data processor, perform the desired information processing tasks. And structured software. One or more software modules, units or functions can be referenced by a software program written in a manner that knows such a library. It can be made available in a manner. The present invention further processes the query-concept and transforms the processed query-concept into an extended / improved query using an extended matrix that provides great flexibility, high accuracy and low noise output. In accordance with one aspect of the present invention, an automatic linguistic knowledge base generator having an input for receiving a collection of text-based information sources and generating a linguistic knowledge base, and a collection and language of text-based information sources Having an input for receiving a linguistic knowledge base, generating an index of the received text-based type information, further representing an input to the index generator and maintaining a correlation between the index and the linguistic knowledge base An index generator for updating a linguistic knowledge base for inputting a query, a linguistic knowledge base, an index and a thesaurus made by an operator; A query processor to generate a list of locations in the collection. Text-based information processing system can be provided that. The text-based information processing system is susceptible to numerous modifications and changes. For example, an automatic linguistic knowledge base generator, automatic index generator and query processor may be embodied in various ways. In accordance with another aspect of the present invention, in a text-based information processing system, an automatic linguistic knowledge base generator receives a stream of terms and generates parsers, and parsers individual terms from the parsers. A language recognizer connected to receive and generating an output indicating a language to which each of the individual terms belongs; and receiving a linguistic rule for the language indicated by the output of the language recognizer to receive the individual terms A normalizer for generating normalized terms, and a linguistic expander connected to receive legal individual terms and generating entries stored in the linguistic knowledge base. obtain. According to yet another aspect of the present invention, in a text-based information processing system, an automatic indexer is connected to a parser for receiving an input stream of terms and generating individual terms, and for receiving individual terms from the parser. A language recognizer for producing an output indicating a language to which each of the individual terms belongs; and a language recognizer connected to receive the individual terms and to receive linguistic rules for the language indicated by the output of the language recognizer; A normalizer for generating a normalized term, connected to receive the legitimate individual term, generating an entry stored in the index when the term has not been previously indexed, and for indexing the term first An index entry generator that modifies existing index entries when they have been deleted. . Finally, in accordance with yet another aspect of the present invention, in a text-based information processing system, an expansion device for expanding a term in a language has an input for receiving the term, and the term and an associative expander are provided. An associative extender having an output representing at least one relevant term found by reference to a thesaurus; and an input connected to an output of the associative extender, and the linguistic extension. The language having an input of a verifier and at least one term linguistically related to the input of the linguistic dilator and representing at least one term found by reference to a linguistic knowledge base for the language And a biological dilator. The normalizer described above can be composed of two units. The first normalizer unit is connected to receive individual terms and linguistic rules, may generate terms with illegal characters removed, and the second normalizer unit Can be connected to receive terms and linguistic rules with illegal characters removed, including word stems found by applying linguistic rules to terms with illegal characters removed Generate normalized terms. The present invention will be better understood from reading the detailed description of at least one exemplary embodiment of the invention, taken in conjunction with the accompanying drawings, in which: BRIEF DESCRIPTION OF THE FIGURES In the drawings, like reference numbers indicate similar components. FIG. 1 is a schematic block diagram of a computer or data processing system in which the present invention may be implemented. FIG. 2 is a schematic block diagram of the memory of FIG. FIG. 3 is a flowchart of automatic linguistic knowledge base generation. FIG. 4 is a flowchart of automatic index generation. FIG. 5 is a flowchart of query expansion. FIG. 6 is a flowchart of the information search system including the features shown in FIGS. Detailed description For a better understanding of the following detailed description, reference is made to the following definitions. In this description, "language" is considered to be any organized system of tokens having a symbolic meaning. The most common type of language handled by text-based information systems is the natural language of humans composed of words or combinations of words, i.e., terms, which are understood by humans to have a particular meaning, For convenience, tokens are referred to below as "words" or "terms". Thus, the terms "word" and "term" refer to the word phrase in those instances where the word phrase may actually be a token having a meaning independent of its subunits, and the word or word phrase to a particular context / significance. ) Are intended to encompass keywords in those instances that have artificial words such as acronyms and short-cuts. "Word stem" is the word remaining after removing all prefixes and suffixes of a word that corrects the root meaning or speech part of the word appropriately for the context in which the word is used. Of the substrate. As used herein, the term "thesaurus" refers to a database of terms, words and / or word stems, where each term, word or word stem is identified in the database by morphological proximity, Other terms, words and other words that have a defined relationship, such as phonetic similarity, similar meanings (synonyms), nearly opposite meanings (antonyms), broader meanings, narrower meanings, related terms in a particular context, etc. Recall word stems. The database may be navigated or searched based on the terms, words and / or word stems stored therein. The languages considered here have known linguistic rules for the morphological and phonetic variations that the word can undergo. For example, the morphological rules of a language may define how a plurality is formed from a singular noun by changing the shape of the word, i.e. by adding a final "s" in English. In contrast, phonetic rules may represent common variations in spelling that result from a user's misspelling. A table, file or database may be used by the software program to maintain a list of such linguistic rules. In general, languages also include words that do not follow the linguistic rules of the language. For example, the morphological rules of the English language to generate the verb's past tense do not apply to the verb "to go", which would be a "went" different from the silly "goed". Thus, exceptions to rules can be kept in a table of exceptions by a software program, so that words that do not follow the rules can be treated exactly as words that do. In the context of the present invention, in particular in natural language, but also generally in language, linguistic rules to produce an efficient, adaptable and flexible representation of word stem variations that produce meaning, A "linguistic knowledge base" has been developed by applying one or more tables of exceptional or irregular form to the large body of textual information. A "linguistic knowledge base" is a table, list or database of word stems and related words. Related words are those words that have been determined to have the same word stem when analyzed under the linguistic rules for the language. The invention is configured in the context of a computer system and a data processing system. An overview of such a system is given in connection with the block diagram of FIG. A computer system or data processing system generally includes a processor 101, a memory 103, one or more input devices 105, and one or more output devices 107, all of which are interconnected via an interconnect mechanism 109. ing. Many variants of this master plan are possible. For example, a viable system may have no input device 105 and no output device 107 and communicate entirely through an interaction with memory 103 by an external device (not shown). Also, distributed computer systems and data processing systems are contemplated to fall within this master plan. Interconnection mechanism 109 may be the internal system bus of a personal computer, or may be the Internet, through which processor 101 interacts with a database stored in remote memory 103. Other variations will be apparent to those skilled in the art. The memories 103 fall into two categories useful for this description: long-term memory (also referred to as non-volatile memory) and short-term memory (also referred to as volatile memory). These two types of memory are often used in computer systems and data processing systems, as shown in FIG. Volatile memory 201, such as integrated circuit random access memory (RAM), provides such a fast access time that volatile memory 201 is most easily implemented to support fast processor 101. As such, they are often used in physical close proximity to the processor 101. Since the nonvolatile memory 203 can be configured at a lower cost than a volatile memory having a similar capacity, the nonvolatile memory 203 is often used for storing a large amount of data for a long period of time. Non-volatile memory 203 is often implemented as magnetic or optical disk or tape storage, which provides the additional benefit of exchanging data and software programs between different computers or data processing systems. As such, non-volatile memory 203 can be a software product disk on which signals representing instructions are recorded, which, when executed by processor 101, cause a computer or data processing system to function as a special purpose computer. To be executed. An implementation of the software of the present invention may be recorded in such non-volatile memory 203 for sale by a manufacturer, for storage, for access by processor 101 via volatile memory 201, and the like. In accordance with various aspects of the present invention, a system for searching and locating information across a collection of text-based information sources can be configured. According to various aspects of the invention, a linguistic knowledge base is first generated. The collection of text-based information sources is then indexed. The user then enters a query that defines the information sought. The query is expanded according to the selected associative and linguistic rules using a thesaurus and a linguistic knowledge base. Finally, information is identified that matches the various expanded query terms. The thesaurus, linguistic knowledge base, and index may be stored in one or more computer files, to which the system has access via memory 103. Aspects of the present invention relating to automatic generation of linguistic knowledge bases, automatic generation of indexes, and query expansion will now be described in detail. I. Automatic generation of linguistic knowledge base In accordance with one aspect of the present invention, there is provided software as shown in FIG. 3 for automatically generating a linguistic knowledge base from an input body of a text-based information source when executed by a suitable data processing system. You. For example, according to this aspect of the invention, a collection of English language documents may be processed to generate an English language linguistic knowledge base. A small set of linguistic rules 301 including a list of exceptions 302 to the linguistic rules for the language, eg, English, is first generated by statistical analysis of the large body of text-based type information. . This small set of rules includes: A list of irregular words and word stems in the language, ie a list of the aforementioned exceptions; legal characters in the language, ie the alphabet of the language, and the location of legal sentences in the language, eg A word normalization table that specifies special rules for characters that can appear only at specific positions within words, a legal prefix in the language and a prefix and suffix list that specifies legal suffixes, Letter-to-sound rules for both words and legal names. This set of linguistic rules 301, including a list of exceptions 302, is used to analyze the body 303 of the text-based information source. Then a specially adapted linguistic knowledge base 305 is generated from the body 303 of the text-based information source. The body 303 of the text-based information source may be selected, for example, to be a source from a particular area of development endeavor that is expected to be queried in the future. This results in the linguistic knowledge base being better able to address the specifics of that particular area of development effort. The body 303 of the text-based information source from which the linguistic knowledge base 305 is derived may not be the same body of the source that will ultimately be searched. However, automatically generating the linguistic knowledge base 305 from the body of the text-based information source to be searched is such that the linguistic knowledge base 305 so generated is the text base to be searched. It has the advantage of being particularly well adapted to the body of the type information source. Automatic generation of the linguistic knowledge base 305 proceeds as follows. The body 303 of the text-based information source forms an input stream 304 of text for the system. This input stream 304 is first parsed into words and terms 307 according to either fixed word recognition rules or word recognition rules specific to one or more languages. Then, the language of each of the words decomposed from the input stream is recognized at 309. Once the language of the word has been recognized, the word is normalized 311 according to the linguistic rules 301 for the language. Irregular words can also be recognized at this point, since the known irregular words are already in the list of irregular words 302 and thus need not be further processed. The system may also identify those words that meet certain rule-based criteria as potential new irregular words. Those earlier unknown irregular words can be identified to a human operator for a decision as to whether they should be added to the irregular word list. The regular words are linguistically expanded at 313 before being added to the linguistic knowledge base 305 so that the word stems along with a list of related words from the body 303 of the text-based information source. It is stored in the linguistic knowledge base 305. Linguistic extensions 313 are described in more detail below. The step 307 of decomposing the input stream 304 into sentences and words occurs according to the following pseudo code. Normalization is performed as follows. Normalization identifies and removes garbage characters from words in input stream 304. Finally, a new key is added to the linguistic knowledge base 305 by the following procedure. The two valid recognition types analyzed as shown in the pseudocode above are morphological and phonetic. The morphological analyzer of the described embodiment functions according to the following procedure. The morphological analyzer receives a list of valid prefixes and suffixes in the language identified for the input word. The phonetic analyzer converts each village word into a phonetic representation of the word based on the letter-to-speech rules. Words having similar or the same phonetic representation may be considered more related by their phonetic morphology. When the above process has been completed for the body 303 of the originally presented text-based information source, the linguistic knowledge base 305 for the language of the text in the body 303 of the text-based information source is: Automatically generated. When new text-based information sources are added to the system, they are also processed as described above. Thus, by adding new information to the linguistic knowledge base 305, and by the linguistic correction mechanism that corrects the contents of individual entries in the linguistic knowledge base 305 according to the new information, the new source is Increase the knowledge and accuracy of the base 305. The learning procedure for embodying the language correction mechanism is as follows. When the system detects a discrepancy between the linguistic knowledge base 305 and the newly presented text source, the affected word stem and the list of related words are automatically or automatically dictated by a human operator. Can be re-analyzed and updated according to the newly presented information and the procedure described above. Thus, the system constantly learns for each language processed and updates the affected linguistic knowledge base. II. Automatic index generation correlated with linguistic knowledge base In addition to the linguistic knowledge base 305, the search system according to another aspect of the present invention shown in FIG. 4 automatically generates an index 401, whereby text-based type information refers to the index 401. Can be found by doing The linguistic knowledge base 305 is updated, so that the contents of the linguistic knowledge base 305 represent relevant terms contained in the body 303 of the text-based information source, and are thus correlated with the index 401, so that the index Is automatically generated. The index 401 simply associates words that are actually found in the body 303 of the text-based information source with locations in the body 303 of the text-based information source. Preferably, the locations are defined hierarchically. For example, the location can be represented hierarchically by document number, section number, sentence number, and location number. Other hierarchical location identification schemes may be used as will be appreciated by those skilled in the art. In accordance with a preferred embodiment of the present invention, index 401 is supported by linguistic knowledge base 305. The index 401 contains only those words and terms that actually occur in the body 303 of the text-based information source. The linguistic knowledge base 305 associates word stems derived from words that actually occur in the body 303 of the text-based information source with a list of related words. During the search described below, the system searches for an entry from the linguistic knowledge base 305, which is then used to reference one or more index entries. Automatic generation of the index 401 proceeds as follows. The body 303 of the text-based information source forms an input stream 304 of text for the indexing subsystem. This input stream 304 is first decomposed into words and terms 307 according to word recognition rules. Then, the language of each of the words decomposed from the input stream is recognized at 309. Once the language of the word has been recognized at 309, the word may be normalized at 311 according to the linguistic rules 301 for the language. An index entry is then generated at 403 for each of the newly normalized words. If the normalized word already has an entry in index 401, the location of the current occurrence of the word is added to the previous entry. At substantially the same time as the above process, the linguistic knowledge base 305 is continuously maintained correlated with the new and modified entries created in the index 401. Each normalized word is reduced to its word stem 405 according to the linguistic rules of the language of the word. The word stem and related words, if not present, are then added to the linguistic knowledge base file 407. The user also specifies that the associated words include various types of word stem extensions. If expansions are included, word stem expansion is performed before storing the word stems and related words in the linguistic knowledge base file 305. When the indexing of the body 303 of the text-based information source is completed, the linguistic knowledge base 305 is correlated with the index 401 and represents the relevant terms contained in the body 303 of the text-based information source. III. Query expansion Query expansion is performed according to the third aspect of the present invention, shown in FIG. Since the query may include more than one word or term, word recognition is first performed as described above. The words and terms identified by the word recognition task can be further normalized. That is, they can be converted to stem form if desired. By referring to the thesaurus and linguistic rules, spelling errors can be eliminated, acronyms and truncated but different vocabulary forms can be recognized, and so on. Then, each recognized word in the query may be expanded using a 2D expansion matrix. A 2D extension matrix is one way to define an extension space in which input words can be represented. The dimensions of this space are associative and linguistic. The associative dimension is based on the meaning of the word / word stem in the language to be processed. In the described embodiment of the invention, the associative dimension is defined by one or more thesauruses 501 that associate words and terms with their synonyms, broader terms, narrower terms, and other relationships. Each thesaurus 501 includes a database of terms with the addition of conceptually related terms. The thesaurus is searchable by term. Thus, each thesaurus entry includes an entry key, which is a list of searchable terms. Each entry key associates one or more terms conceptually associated with the entry key with each entry key, such as synonyms, broader terms, narrower terms, related terms, antonyms, etc. . Which one or more categories of associations are included is arbitrary. In addition, each entry term arbitrarily associates with each entry term a normal dictionary definition and usage guide, as well as a query string in which entry keys can be converted when needed. Thus, a thesaurus is a list of entries, where each entry has a structure substantially as follows. Keywords: used as entry keys (in the form of natural language phrases or terms). • Description (optional): A description of the meaning and use of the keyword (as in an encyclopedia) • Query: A complete query statement (optional) in the underlying full-text query language where keywords are translated when needed (eg, If the conversion of the keyword "USA"-> query "United AND States AND of AND America") keyword into a complete query statement is not explicitly provided, a default conversion is given to the keyword. • Relationships • Synonyms: List of keywords with synonyms or synonyms with concepts or descriptors • Broader terms • Narrower terms • Associations • All of these features determine whether associative expansion has the desired effect. Can be used by an operator to make a decision. The linguistic dimension of this extended space is based on the linguistic knowledge base 305 of the language to be processed. As mentioned above, the linguistic knowledge base is restricted from the actual corpus of text-based information sources to words "legal" or "legal", independent of a hand-crafted language dictionary. Not automatically created. In this embodiment of the invention, a grammar of morphology and linguistic extensions of phonetics is supported. The extension task performs the 2D extension in essentially two main steps. First, an associative expansion is performed in step 503, where each input word of the input query 505 is expanded into a list of words 507 containing words defining the association to the input word. The associated word can be found by referring to the thesaurus 501. The expanded list 507 of words is the input at which the linguistic expansion 509 is performed simultaneously in both the morphological and phonetic dimensions. Morphological and phonetic extensions are controlled by reference to the linguistic knowledge base 305. The linguistic extension 509 is controlled by an extension parameter 511 supplied by the user, to the extent that both morphological and phonetic dimensions vary in range from no extension of that dimension to full extension of that dimension. Morphological and phonetic extensions. By performing the morphological and phonetic extensions as a single linguistic extension step 509, the extension methods for morphology and phonetics can be intelligently linked. The relationships between the extended dimensions are defined in a linguistic knowledge base 305 for the language. Thus, rules for morphological expansion may define morphological variations that vary depending on the input word or phonetic properties of the expanded result. As a result, less noise occurs in the expanded output. That is, associating the morphological and phonetic dimensions as a single linguistic plane eliminates phonetically unacceptable morphological variants under the wholeness of the linguistic rules, and This is because the relationship is the opposite. IV. Complete text search system Using the software described above, a search system can be configured as shown in FIG. 6, which can efficiently and accurately find information in a collection of text-based information sources. it can. Briefly, such a system is provided with access to one or more collections 303a and 303b of text-based information sources. The at least one group of text-based information sources 303a is provided to automatic linguistic knowledge base generation software 601, which automatically generates the linguistic knowledge base 601 as described above. Generate the base 305. The text-based information source 303b is provided to an indexing subsystem 603, which generates an index 401 of the words in the text-based information source 303b, where Each entry in the index 401 defines the relationship between a word and the position of the word in the collection, as described above. Preferably, the index 401 is generated using normalized words in one or more languages for which the system has a thesaurus 501 and a linguistic knowledge base 305 for one or more languages. The indexing subsystem 603 may include a module for recognizing words having a format that conforms to one of the languages supported by the system, and furthermore, performs proper normalization for each language supported by the system. Modules may be included. The words are normalized in those languages as described above, reducing the number of unusual entries that appear in the index 401. The system further receives the user query 505 in the form of one or more words representing the information sought by the user. The query word is expanded at 605 using a 2D expansion matrix, as described above. The query is first associatively expanded to include words related to the original query word by referencing a thesaurus appropriate to the language of the query word. The associatively expanded query is then linguistically expanded in both morphological and phonetic dimensions simultaneously. The degree of expansion in each dimension is specified by a parameter 511 supplied with a query by the user. The degree of expansion can be specified, for example, by the user attaching a checklist of expansion parameters 511 to each query term. Finally, the terms of the fully expanded query 607 are compared 609 with the entries in the index 401 to find the relevant location 611 in the text-based information source collection 303b. The relevant location 611 in the text-based information source collection 303b does not necessarily include any of the original query terms. With the process described above, the location 611 found will include one of the original query term and the associated term generated by the associative and linguistic expansion process. The location 611 found will not include many "noise" locations. It avoids the problem of applying morphological rules to produce phonetically meaningless results, and vice versa, the problem of applying phonetic rules to produce morphologically meaningless results. This is because the linguistic extension process is performed as described above in such a way that morphological linguistic rules and phonetic linguistic rules are simultaneously applied in an interdependent manner to avoid. In such a system, the text-based information source may be a text document stored on a computer system. In this case, it is convenient for the indexing system to hierarchically refer to positions by document number, section number, sentence number and position in the sentence. In addition, freely formatted documents can be processed by the system described above. There is no need to structure the document in a particular way, or to manually generate classifications or keywords, as in some prior art systems. This is because the system of the present invention indexes words and manipulates queries according to the rules of the language in which the words occur. If it is desired that a phrase be treated as a single word or term in a particular language, the phrase may be so defined as a conceptual entry in a thesaurus. In all other respects, phrases so defined as words are simply treated as words in the language. However, it is unnecessary to declare a long list of accepted keywords. That is because the indexing process and query expansion produce accurate, relatively noise-free matches to user queries that reasonably represent the information sought. Having thus described at least one exemplary embodiment of the invention, various alternatives, modifications and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be within the spirit and scope of the invention. Accordingly, the foregoing description is by way of example only and is not intended as limiting. The invention is limited only as defined in the following claims and equivalents thereof.

【手続補正書】特許法第１８４条の８第１項【提出日】１９９８年５月８日（１９９８．５．８）【補正内容】請求の範囲１．１組の言語学的規則に従って単語語幹と単語との間の関係を定義する言語学的知識ベースと、異なる単語間の概念的関係を定義するシソーラスと、問合せプロセッサとを備え、前記問合せプロセッサは、単語を受け取る入力を有し、且つ当該単語と、前記シソーラスを参照することにより見つけられた少なくとも１つの連想された単語とを表す出力を有する連想的拡張器と、前記連想的拡張器の出力に接続された入力を有し、且つ言語学的拡張器の入力と、前記言語学的拡張器の入力と言語学的に関連し且つ前記言語学的知識ベースを参照することにより見つけられた少なくとも１つの単語とを表す出力を有する言語学的拡張器とを有するテキスト・ベース型情報処理システム。２．テキスト・ベース型情報ソースのコレクションを受け取る入力を有し、且つ前記言語学的知識ベースを生成する自動言語学的知識ベース発生器を更に備える請求項１記載のシステム。３．前記自動言語学的知識ベース発生器は更に、単語の入力ストリームを受け取り、個々の単語を生成するパーサと、前記パーサからの個々の単語を受け取るよう接続され、個々の単語の各々が属する言語を示す出力を生成する言語認識器と、個々の単語を受け取るよう接続され、更に前記言語認識器の出力により示された言語のための言語学的規則を受け取るよう接続され、且つ正規化された単語を生成する正規化器と、当該正規化された単語を受け取る入力を有し、前記言語学的拡張器の入力と、前記言語学的拡張器の入力と言語学的に関連した少なくとも１つの単語とを表す出力を生成し、且つ前記言語学的知識ベースを更新する言語学的拡張器とを備える請求項２記載のシステム。４．前記言語学的拡張器は更に、前記テキスト・ベース型情報ソースから単語を受け取る入力を有し、且つ接頭辞及び接尾辞を取り去った、入力での単語に対応する単語語幹を識別する出力を有する形態論的単語分析器と、前記テキスト・ベース型情報ソースから単語を受け取る入力を有し、且つ類似の音声学的表示を有する単語を識別する出力を有する音声学的単語分析器とを備える請求項３記載のシステム。５．前記正規化器は更に、個々の単語及び言語学的規則を受け取るよう接続され、且つ違法の文字が除去された単語を生成する第１の正規化器ユニットと、違法の文字が除去された単語と言語学的規則とを受け取るよう接続され、且つ違法の文字が除去された単語に言語学的規則を適用することにより見つけられた単語語幹を含む正規化された単語を生成する第２の正規化器ユニットとを備える請求項４記載のシステム。６．テキスト・ベース型情報ソースのコレクション及び言語学的知識ベースを受け取る入力を有し、出力で前記の受け取られたテキスト・ベース型情報のインデックスを生成し、更に、インデックス発生器に対する入力を表し且つインデックスと言語学的知識ベースとの間の相関を維持するため言語学的知識ベースを更新するインデックス発生器を備える請求項１記載のシステム。７．前記インデックス発生器は更に、前記テキスト・ベース型情報ソースの単語の入力ストリームを受け取り、個々の単語を生成するパーサと、前記パーサから個々の単語を受け取るよう接続され、且つ個々の単語の各々が属する言語を示す出力を生成する言語認識器と、個々の単語を受け取るよう接続され、更に前記言語認識器の出力により示された言語のための言語学的規則を受け取るよう接続され、且つ正規化された単語を生成する正規化器と、インデックス・エントリを更新する出力を有するエントリ更新器とを備える請求項６記載のシステム。８．前記正規化器は更に、個々の単語及び言語学的規則を受け取るよう接続され、且つ違法の文字が除去された単語を生成する第１の正規化器ユニットと、違法の文字が除去された単語及び言語学的規則を受け取るよう接続され、且つ違法の文字が除去された単語に言語学的規則を適用することにより見つけられた単語語幹を含む正規化された用語を生成する第２の正規化ユニットとを備える請求項７記載のシステム。９．前記言語学的拡張器の出力を受け取る入力を有し、且つ前記言語学的拡張器の出力と一致するテキスト・ベース型情報ソースをインデックスを参照することにより見つけられるように識別する出力を有する比較器を更に備える請求項１記載のシステム。１０．前記比較器の出力がテキスト・ベース型情報ソースを階層的に識別する請求項９記載のシステム。１１．前記自動言語学的知識ベース発生器は更に、テキスト・ベース型情報ソースの単語の入力ストリームを受け取り、個々の単語を生成するパーサと、前記パーサからの個々の単語を受け取るよう接続され、個々の単語の各々が属する言語を示す出力を生成する言語認識器と、個々の単語を受け取るよう接続され、更に前記言語認識器の出力により示された言語のための言語学的規則を受け取るよう接続され、且つ正規化された単語を生成する正規化器と、当該正規化された単語を受け取るよう接続され、前記言語学的知識ベースに記憶されるエントリを生成する言語学的拡張器とを備える請求項１記載のシステム。１２．前記正規化器は更に、個々の単語及び言語学的規則を受け取るよう接続され、且つ違法の文字が除去された単語を生成する第１の正規化器ユニットと、違法の文字が除去された単語と言語学的規則とを受け取るよう接続され、且つ違法の文字が除去された単語に言語学的規則を適用することにより見つけられた単語語幹を含む正規化された単語を生成する第２の正規化器ユニットとを備える請求項１１記載のシステム。[Procedure of Amendment] Article 184-8, Paragraph 1 of the Patent Act [Submission date] May 8, 1998 (1998.5.8) [Correction contents] The scope of the claims 1. Linguistics that define the relationship between word stems and words according to a set of linguistic rules Knowledge base, A thesaurus that defines conceptual relationships between different words; With a query processor, The query processor comprises: Having an input to receive a word and referencing the word and the thesaurus Having an output representing at least one associated word found by Dynamic dilator and An input connected to an output of the associative dilator, and an input of the linguistic dilator And a linguistic knowledge base linguistically related to the input of the linguistic expander. With at least one word found by referring to With linguistic dilator Text-based information processing system. 2. Has an input for receiving a collection of text-based information sources, and Further comprising an automatic linguistic knowledge base generator for generating the linguistic knowledge base The system according to claim 1. 3. The automatic linguistic knowledge base generator further comprises: A parser that receives an input stream of words and generates individual words; Connected to receive individual words from the parser, each of the individual words A language recognizer that produces an output indicating the language to be implemented; Connected to receive individual words and further indicated by the output of the language recognizer Words that are connected to receive linguistic rules for the language A normalizer to generate, An input for receiving the normalized word, an input of the linguistic dilator, Representing the input of the linguistic expander and at least one linguistically related word A linguistic extender for generating an output and updating the linguistic knowledge base. To The system according to claim 2. 4. The linguistic expander further comprises: Having an input for receiving a word from the text-based information source, and having a prefix Output that identifies the word stem corresponding to the word in the input, with the suffix and suffix removed A morphological word analyzer having Having an input for receiving words from the text-based information source, and similar A phonetic word analyzer having an output identifying a word having a phonetic representation of the word. Get The system according to claim 3. 5. The normalizer further comprises: Connected to receive individual words and linguistic rules, and remove illegal characters A first normalizer unit for generating the generated word; Connected to receive words with illicit characters removed and linguistic rules; and Found by applying linguistic rules to words with illegal characters removed A second normalizer unit for generating a normalized word including the word stem. The system according to claim 4. 6. Receive a collection of text-based information sources and a linguistic knowledge base An index of said received text-based information at an output having an input Generate an index, and further represent the input to the index generator and index Update linguistic knowledge base to maintain correlation between resources and linguistic knowledge base 2. The system of claim 1, further comprising an index generator. 7. The index generator further comprises: Receiving an input stream of words of the text-based information source, A parser that generates the word Connected to receive individual words from the parser, and each individual word is A language recognizer that produces an output indicating the language to which it belongs, Connected to receive individual words and further indicated by the output of the language recognizer Words that are connected to receive linguistic rules for the language A normalizer to generate, An entry updater having an output for updating an index entry. The system of claim 6. 8. The normalizer further comprises: Connected to receive individual words and linguistic rules, and remove illegal characters A first normalizer unit for generating the generated word; Connected to receive words and linguistic rules with illegal characters removed, and Found by applying linguistic rules to words with illegal characters removed A second normalization unit for generating normalized terms including word stems. The system of claim 7. 9. An input for receiving an output of the linguistic dilator, and the linguistic dilator Look up an index for text-based information sources that match the output of The comparator of claim 1, further comprising a comparator having an output that identifies it to be found by the comparator. On-board system. 10. A check that the output of the comparator hierarchically identifies text-based information sources. The system of claim 9. 11. The automatic linguistic knowledge base generator further comprises: Receives an input stream of words from a text-based information source and A parser for generating words, Connected to receive individual words from the parser, each of the individual words A language recognizer that produces an output indicating the language to be implemented; Connected to receive individual words and further indicated by the output of the language recognizer Words that are connected to receive linguistic rules for the language A normalizer to generate, Connected to receive the normalized word and recorded in the linguistic knowledge base A linguistic extender for generating the remembered entry. The system according to claim 1. 12. The normalizer further comprises: Connected to receive individual words and linguistic rules, and remove illegal characters A first normalizer unit for generating the generated word; Connected to receive words with illicit characters removed and linguistic rules; and Found by applying linguistic rules to words with illegal characters removed A second normalizer unit for generating a normalized word including the word stem. The system according to claim 11.

───────────────────────────────────────────────────── フロントページの続き (81)指定国ＥＰ(ＡＴ，ＢＥ，ＣＨ，ＤＥ，ＤＫ，ＥＳ，ＦＩ，ＦＲ，ＧＢ，ＧＲ，ＩＥ，ＩＴ，ＬＵ，ＭＣ，ＮＬ，ＰＴ，ＳＥ)，ＣＡ，ＪＰ【要約の続き】当該言語学的次元とを含む。────────────────────────────────────────────────── ─── Continuation of front page (81) Designated countries EP (AT, BE, CH, DE, DK, ES, FI, FR, GB, GR, IE, IT, L U, MC, NL, PT, SE), CA, JP [Continuation of summary] The linguistic dimension.

Claims

[Claims] 1. A language that has an input to receive a collection of text-based information sources An automatic linguistic knowledge base generator for generating a linguistic knowledge base; A collection of text-based information sources and a linguistic knowledge base Input and generate an index of the received text-based information. And also represent the inputs to the index generator and index and linguistics. Updating the linguistic knowledge base to maintain a correlation with the statistical knowledge base An index generator, Operator-generated queries, linguistic knowledge bases, indexes and systems A text-based information source that has an input to receive thesaurus and is relevant to the query A query processor to generate a list of locations in the collection of resources A text-based information processing system equipped with 2. In a text-based information processing system, A parser that receives an input stream of terms and generates individual terms; Connected to receive individual terms from the parser, each individual term A language recognizer that produces an output indicating the language to be implemented; Connected to receive individual terms and further indicated by the output of the language recognizer Connected to receive linguistic rules for the selected language and generate normalized terms A normalizer to generate Connected to receive the normalized terms and stored in the linguistic knowledge base A linguistic extender that generates the entry Automatic linguistic knowledge base generator with 3. The normalizer further comprises: Connected to receive individual terms and linguistic rules, remove illegal characters A first normalizer unit for generating the term Connected to receive terms with illegal characters removed and linguistic rules, and Words found by applying linguistic rules to terms from which the legal characters have been removed. A second normalizer unit for generating a normalized term including a stem, and The system of claim 2, comprising: 4. In a text-based information processing system, A parser that receives an input stream of terms and generates individual terms; Connected to receive individual terms from the parser, each individual term A language recognizer that produces an output indicating the language to be implemented; Connected to receive individual terms and further indicated by the output of the language recognizer Connected to receive linguistic rules for the selected language and generate normalized terms A normalizer to generate A collection of text-based information sources and a linguistic knowledge base Input and generate an index of the received text-based information. And also represent the inputs to the index generator and index and linguistics. Updating the linguistic knowledge base to maintain a correlation with the statistical knowledge base Index generator and Automatic indexing device equipped with. 5. The normalizer further comprises: Connected to receive individual terms and linguistic rules, remove illegal characters A first normalizer unit for generating the term Connected to receive terms with illegal characters removed and linguistic rules, and Words found by applying linguistic rules to terms from which the legal characters have been removed. A second normalizer unit for generating a normalized term including a stem, and The system of claim 4 comprising: 6. In a text-based information processing system, You have an input to receive the term, and the term and the associative dilator refer to the thesaurus. Output representing at least one associated term found by reference Said associative dilator having: Having an input connected to the output of the associative expander, and An input, and at least one linguistically related input to the input of the linguistic expander. Said linguistic dilator having an output representing a word; An expansion device for expanding terms in a language, comprising: