JP2022069790A

JP2022069790A - Information processor, information processing method, and program

Info

Publication number: JP2022069790A
Application number: JP2020178643A
Authority: JP
Inventors: 雄介松田; Yusuke Matsuda; 直之福田; Naoyuki Fukuda
Original assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Current assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2022-05-12
Anticipated expiration: 2040-10-26
Also published as: JP7670951B2

Abstract

To provide an information processing system, an information processor, an information processing method and a program which can improve retrieval accuracy so as to obtain a retrieval result close to the intention of a retrieval user using a category of document data and field information in calculation of a retrieval score.SOLUTION: In an information processing system (document retrieval system), retrieval processing stores a weight set in each field included in a document for each category of a document, and calculates a score of the document based on the weight set in the fields, and a relation between the field and a retrieval word.SELECTED DRAWING: Figure 9

Description

本発明は、情報処理装置、情報処理方法、プログラムに関する。 The present invention relates to an information processing apparatus, an information processing method, and a program.

従来の単語頻度のみによる全文検索では単語の重要度や意味というものが考慮されない。そのため、出現頻度は低いが重要な単語ではヒットしても検索上位に現れなかったり、字面は同じだがニュアンスが異なる単語にヒットした文書が検索結果に現れたりするという問題があった。 The conventional full-text search based only on word frequency does not consider the importance or meaning of words. Therefore, there is a problem that even if a word that appears infrequently but is important is hit, it does not appear in the top of the search, or a document that hits a word with the same character but different nuances appears in the search results.

特許文献１には、文書データのフィールド情報を検索スコアの計算に用いて、ユーザの検索意図に近い検索結果を得るための技術について開示されている。 Patent Document 1 discloses a technique for obtaining a search result close to the user's search intention by using the field information of the document data for the calculation of the search score.

特開２００５－０６３４６８号公報Japanese Unexamined Patent Publication No. 2005-053468

特許文献１には、文書データのフィールド情報を検索スコアの計算に用いて、ユーザの検索意図に近い検索結果を得るための技術が記載されている。 Patent Document 1 describes a technique for obtaining a search result close to the user's search intention by using the field information of the document data for the calculation of the search score.

しかし、フィールドごとのスコア反映割合を検索のたびにユーザが入力する必要があり、フィールド数が多くなった場合に煩雑である。また、フィールドが事前に文書のメタデータとして用意されていない部分についてはフィールドに格納されない問題がある。さらに文書のカテゴリについての概念がないため、各カテゴリに応じたフィールド情報の抽出やスコア計算を行うことができないという課題がある。 However, it is necessary for the user to input the score reflection ratio for each field every time the search is performed, which is complicated when the number of fields is large. In addition, there is a problem that the part where the field is not prepared as the metadata of the document in advance is not stored in the field. Furthermore, since there is no concept of document categories, there is a problem that field information cannot be extracted and scores can be calculated according to each category.

そのため、文書データにカテゴリ情報を付与し、カテゴリごとにフィールド抽出情報を定義することが望まれる。 Therefore, it is desirable to add category information to the document data and define field extraction information for each category.

そこで、本発明は、文書データのカテゴリとフィールド情報を検索スコアの計算に用いて、検索ユーザの意図に近い検索結果を得られるよう検索精度の向上を行うことを目的とする。 Therefore, an object of the present invention is to improve the search accuracy so that a search result close to the intention of the search user can be obtained by using the category and field information of the document data in the calculation of the search score.

本発明の情報処理システムは、文書のカテゴリ毎に文書に含まれる各フィールドに設定される重みを記憶する記憶手段と、前記フィールドに設定された重みと、当該フィールドと検索語との関係とに基づき、当該文書のスコアを算出する算出手段と、を備えることを特徴とする。 The information system of the present invention has a storage means for storing weights set in each field included in a document for each document category, weights set in the fields, and a relationship between the fields and search terms. Based on this, it is characterized by comprising a calculation means for calculating the score of the document.

本発明の情報処理方法は、文書のカテゴリ毎に文書に含まれる各フィールドに設定される重みを記憶する記憶ステップと、前記フィールドに設定された重みと、当該フィールドと検索語との関係とに基づき、当該文書のスコアを算出する算出ステップと、を備えることを特徴とする。 The information processing method of the present invention includes a storage step for storing weights set in each field included in a document for each category of a document, weights set in the field, and a relationship between the field and a search term. Based on this, it is characterized by comprising a calculation step for calculating the score of the document.

本発明のプログラムは、コンピュータを、文書のカテゴリ毎に文書に含まれる各フィールドに設定される重みを記憶する記憶手段と、前記フィールドに設定された重みと、当該フィールドと検索語との関係とに基づき、当該文書のスコアを算出する算出手段として機能させることを特徴とする。 In the program of the present invention, the computer stores the weight set in each field included in the document for each category of the document, the weight set in the field, and the relationship between the field and the search term. Based on the above, it is characterized in that it functions as a calculation means for calculating the score of the document.

本発明によれば、文書データのカテゴリとフィールド情報を検索スコアの計算に用いて、検索ユーザの意図に近い検索結果を得られるよう検索精度の向上を行うことが可能となる。 According to the present invention, it is possible to improve the search accuracy so that a search result close to the intention of the search user can be obtained by using the category and field information of the document data in the calculation of the search score.

情報処理システムのシステム構成を示す図である。It is a figure which shows the system configuration of an information processing system. 情報処理装置のハードウェア構成を示す図である。It is a figure which shows the hardware configuration of an information processing apparatus. 本実施例での処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process in this Example. 本発明の実施形態における、文書登録処理の一例を示すフローチャートである。It is a flowchart which shows an example of the document registration process in embodiment of this invention. 本発明の実施形態における、フィールド抽出処理の一例を示すフローチャートである。It is a flowchart which shows an example of the field extraction processing in embodiment of this invention. 本発明の実施形態における、キーワードによるフィールド抽出処理の一例を示すフローチャートである。It is a flowchart which shows an example of the field extraction process by a keyword in embodiment of this invention. 本発明の実施形態における、パターンによるフィールド抽出処理の一例を示すフローチャートである。It is a flowchart which shows an example of the field extraction process by a pattern in embodiment of this invention. 本発明の実施形態における、形態素解析によるフィールド抽出処理の一例を示すフローチャートである。It is a flowchart which shows an example of the field extraction process by the morphological analysis in embodiment of this invention. 本発明の実施形態における、検索処理の一例を示すフローチャートである。It is a flowchart which shows an example of the search process in embodiment of this invention. 本発明の実施形態における、検索セッション統計情報の更新処理の一例を示すフローチャートである。It is a flowchart which shows an example of the update process of the search session statistical information in embodiment of this invention. 本発明の実施形態における、フィールド重みの更新処理の一例を示すフローチャートである。It is a flowchart which shows an example of the field weight update process in Embodiment of this invention. 本発明の実施形態における、抽出定義一覧画面の一例を示す図である。It is a figure which shows an example of the extraction definition list screen in embodiment of this invention. 本発明の実施形態における、抽出定義詳細画面の一例を示す図である。It is a figure which shows an example of the extraction definition detail screen in embodiment of this invention. 本発明の実施形態における、フィールド名とキーワードの距離についての説明の図である。It is a figure explaining the distance between a field name and a keyword in embodiment of this invention. 本発明の実施形態における、フィールド重み更新処理の一例を示す図である。It is a figure which shows an example of the field weight update process in embodiment of this invention. 本発明の実施形態における、検索セッション統計情報のテーブルの一例を示す図である。It is a figure which shows an example of the table of the search session statistical information in embodiment of this invention. 本発明の実施形態における、フィールドスコアの計算の一例を示す図である。It is a figure which shows an example of the calculation of the field score in embodiment of this invention.

以下、図面を参照して、本発明の実施形態を詳細に説明する。なお、以下に説明する実施形態は、本発明を具体的に実施した場合の一例を示すもので、特許請求の範囲に記載した構成の一例である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. The embodiments described below are examples of specific implementations of the present invention, and are examples of the configurations described in the claims.

図１は、本発明の実施形態における文書検索システムのシステム構成の一例を示す図である。本発明における実施形態における文書検索システム２０００は、情報処理装置１００と、文書ＤＢ１０７、クライアントＰＣ１０８で構成される。情報処理装置１００は、文書登録処理部１０１、文書検索処理部１０２、形態素解析辞書１０３、登録文書インデックス１０４、抽出定義ＤＢ１０５、検索セッション統計情報１０６から構成され、外部の文書ＤＢ１０７や文書検索システムに文書を登録する際に使用するクライアントＰＣとネットワークを介して通信可能に接続されている。 FIG. 1 is a diagram showing an example of a system configuration of a document retrieval system according to an embodiment of the present invention. The document retrieval system 2000 according to the embodiment of the present invention includes an information processing apparatus 100, a document DB 107, and a client PC 108. The information processing apparatus 100 is composed of a document registration processing unit 101, a document search processing unit 102, a morpheme analysis dictionary 103, a registered document index 104, an extraction definition DB 105, and a search session statistical information 106, and is used in an external document DB 107 or a document search system. It is connected to the client PC used when registering a document so that it can communicate with it via a network.

文書登録処理部１０１では、ユーザから受け付けた文書に係る処理を実行する機能部である。具体的には、テキスト抽出処理やカテゴリ付与やフィールドの抽出処理を行い、検索インデックスを作成し、登録文書インデックス１０４に格納するなどの処理を行う。 The document registration processing unit 101 is a functional unit that executes processing related to a document received from a user. Specifically, text extraction processing, category assignment, and field extraction processing are performed, a search index is created, and processing such as storage in the registered document index 104 is performed.

文書検索処理部１０２では、ユーザから受けつけた検索語を用いて、インデックス済みの文書を検索する機能部である。ユーザから検索語を受け付けると、インデックス済みの文書から本文スコアとフィールドスコアを計算して、それぞれを合算して検索結果に反映させる処理を行う。 The document search processing unit 102 is a functional unit that searches for an indexed document by using a search term received from the user. When a search term is received from the user, the text score and field score are calculated from the indexed document, and the respective are added up and reflected in the search result.

形態素解析辞書１０３は、形態素解析を行う際に使用される辞書である。 The morphological analysis dictionary 103 is a dictionary used when performing morphological analysis.

登録文書インデックス１０４は、登録対象となる文書から抽出した本文及び各フィールドに対する検索インデックスを格納するＤＢである。本ＤＢを用いて、検索処理部１０２による処理が行われる。 The registered document index 104 is a DB that stores the text extracted from the document to be registered and the search index for each field. Processing is performed by the search processing unit 102 using this DB.

抽出定義ＤＢ１０５は、カテゴリ毎に定義づけられる抽出定義を記憶しておくＤＢである。本抽出定義ＤＢに記憶される当該カテゴリの抽出定義として設定された抽出方式により、フィールドの抽出を行う。抽出方式は、キーワードによる抽出を行うか、パターンによる抽出を行うか、形態素解析による抽出などがある。 The extraction definition DB 105 is a DB that stores the extraction definitions defined for each category. The field is extracted by the extraction method set as the extraction definition of the category stored in the present extraction definition DB. Extraction methods include extraction by keywords, extraction by patterns, and extraction by morphological analysis.

検索セッション統計情報１０６は、ユーザの検索セッション統計情報を更新するＤＢである。ユーザの検索セッション統計情報の更新を行い、抽出定義のフィールド重みを更新する際に利用する。 The search session statistical information 106 is a DB for updating the user's search session statistical information. It is used when updating the search session statistics of the user and updating the field weight of the extraction definition.

文書ＤＢ１０７は、文書が記憶されているＤＢである。クラウドサービスなどの外部ＤＢも含まれる。 The document DB 107 is a DB in which a document is stored. External DBs such as cloud services are also included.

クライアントＰＣ１０８は、ユーザから文書登録を受付ける際に使用される。 The client PC 108 is used when accepting a document registration from a user.

図２は、本発明の実施形態における情報処理装置のハードウェア構成の一例を示すブロック図である。 FIG. 2 is a block diagram showing an example of the hardware configuration of the information processing apparatus according to the embodiment of the present invention.

図２に示すように、情報処理装置は、システムバス２００を介してＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２０１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２０２、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２０３、記憶装置２０４、入力コントローラ２０５、音声入力コントローラ２０６、ビデオコントローラ２０７、メモリコントローラ２０８、および通信Ｉ／Ｆコントローラ２０９が接続される。 As shown in FIG. 2, the information processing apparatus includes a CPU (Central Processing Unit) 201, a ROM (Read Only Memory) 202, a RAM (Random Access Memory) 203, a storage device 204, and an input controller 205 via a system bus 200. A voice input controller 206, a video controller 207, a memory controller 208, and a communication I / F controller 209 are connected.

ＣＰＵ２０１は、システムバス２００に接続される各デバイスやコントローラを統括的に制御する。 The CPU 201 comprehensively controls each device and controller connected to the system bus 200.

ＲＯＭ２０２あるいは外部メモリ２１３は、ＣＰＵ２０１が実行する制御プログラムであるＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）やＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）や、本情報処理方法を実現するためのコンピュータ読み取り実行可能なプログラムおよび必要な各種データ（データテーブルを含む）を保持している。 The ROM 202 or the external memory 213 is a control program executed by the CPU 201 such as a BIOS (Basic Input / Output System) or an OS (Operating System), a computer-readable program for realizing this information processing method, and various necessary programs. Holds data (including data table).

ＲＡＭ２０３は、ＣＰＵ２０１の主メモリ、ワークエリア等として機能する。ＣＰＵ２０１は、処理の実行に際して必要なプログラム等をＲＯＭ２０２あるいは外部メモリ２１３からＲＡＭ２０３にロードし、ロードしたプログラムを実行することで各種動作を実現する。 The RAM 203 functions as a main memory, a work area, and the like of the CPU 201. The CPU 201 realizes various operations by loading a program or the like necessary for executing the process from the ROM 202 or the external memory 213 into the RAM 203 and executing the loaded program.

入力コントローラ２０５は、キーボード２１０や不図示のマウス等のポインティングデバイス等の入力装置からの入力を制御する。入力装置がタッチパネルの場合、ユーザがタッチパネルに表示されたアイコンやカーソルやボタンに合わせて押下（指等でタッチ）することにより、各種の指示を行うことができることとする。 The input controller 205 controls input from an input device such as a keyboard 210 or a pointing device such as a mouse (not shown). When the input device is a touch panel, the user can give various instructions by pressing (touching with a finger or the like) the icon, the cursor, or the button displayed on the touch panel.

また、タッチパネルは、マルチタッチスクリーンなどの、複数の指でタッチされた位置を検出することが可能なタッチパネルであってもよい。 Further, the touch panel may be a touch panel such as a multi-touch screen that can detect a position touched by a plurality of fingers.

ビデオコントローラ２０７は、ディスプレイ２１２などの外部出力装置への表示を制御する。ディスプレイは本体と一体になったノート型パソコンのディスプレイも含まれるものとする。なお、外部出力装置はディスプレイに限ったものははく、例えばプロジェクタであってもよい。また、前述のタッチ操作を受け付け可能な装置については、入力装置も提供する。 The video controller 207 controls the display on an external output device such as the display 212. The display shall include the display of a notebook computer integrated with the main body. The external output device is not limited to the display, and may be, for example, a projector. Further, as for the device capable of accepting the above-mentioned touch operation, an input device is also provided.

なおビデオコントローラ２０７は、表示制御を行うためのビデオメモリ（ＶＲＡＭ）を制御することが可能で、ビデオメモリ領域としてＲＡＭ２０３の一部を利用することもできるし、別途専用のビデオメモリを設けることも可能である。 The video controller 207 can control a video memory (VRAM) for display control, can use a part of the RAM 203 as a video memory area, or can separately provide a dedicated video memory. It is possible.

メモリコントローラ２０８は、外部メモリ２１３へのアクセスを制御する。外部メモリとしては、ブートプログラム、各種アプリケーション、フォントデータ、ユーザファイル、編集ファイル、および各種データ等を記憶する外部記憶装置（ハードディスク）、フレキシブルディスク（ＦＤ）、或いはＰＣＭＣＩＡカードスロットにアダプタを介して接続されるコンパクトフラッシュ（登録商標）メモリ等を利用可能である。 The memory controller 208 controls access to the external memory 213. The external memory is connected to a boot program, various applications, font data, user files, edit files, an external storage device (hard disk) for storing various data, a flexible disk (FD), or a PCMCIA card slot via an adapter. Compact flash (registered trademark) memory etc. can be used.

通信Ｉ／Ｆコントローラ２０９は、ネットワークを介して外部機器と接続・通信するものであり、ネットワークでの通信制御処理を実行する。例えば、ＴＣＰ／ＩＰを用いた通信やＩＳＤＮなどの電話回線、および携帯電話の３Ｇ回線を用いた通信が可能である。 The communication I / F controller 209 connects to and communicates with an external device via a network, and executes communication control processing on the network. For example, communication using TCP / IP, a telephone line such as ISDN, and communication using a 3G line of a mobile phone are possible.

尚、ＣＰＵ２０１は、例えばＲＡＭ２０３内の表示情報用領域へアウトラインフォントの展開（ラスタライズ）処理を実行することにより、ディスプレイ２１２上での表示を可能としている。また、ＣＰＵ２０１は、ディスプレイ２１２上の不図示のマウスカーソル等でのユーザ指示を可能とする。 The CPU 201 enables display on the display 212 by, for example, executing an outline font expansion (rasterization) process in the display information area in the RAM 203. Further, the CPU 201 enables a user instruction with a mouse cursor or the like (not shown) on the display 212.

本発明を実現するための後述する各種プログラムは、外部メモリ２１３に記憶されており、必要に応じてＲＡＭ２０３にロードされることによりＣＰＵ２０１によって実行されるものである。さらに上記プログラムの実行時に用いられる定義ファイル及び各種情報テーブル等も外部メモリ２１３に格納されており、これらについての詳細な説明も後述する。 Various programs described later for realizing the present invention are stored in the external memory 213, and are executed by the CPU 201 by being loaded into the RAM 203 as needed. Further, a definition file and various information tables used when executing the above program are also stored in the external memory 213, and detailed explanations thereof will be described later.

次に図３を用いて、本願発明における処理の流れについて説明する。 Next, the flow of processing in the present invention will be described with reference to FIG.

ステップＳ３０１では、事前設定として、カテゴリ毎のフィールド抽出定義情報（フィールド重みを含む）とデフォルトの抽出定義（カテゴリが設定されていないファイルやフィールド重みセットで指定しなかったフィールドに使う抽出定義）の設定を受け付ける。フィールド抽出定義情報とは、抽出定義詳細画面１３００に示すように、カテゴリ毎に、フィールド名と当該フィールドを抽出する方法と抽出定義が対応付けられた情報である。例えば図１３に示す抽出定義情報によれば、「工事概要」というカテゴリの文書については、「事務所」や「病院」といったキーワードにより抽出されるフィールドを「建物用途」というフィールドとして抽出することが可能となる。 In step S301, as presets, the field extraction definition information (including the field weight) for each category and the default extraction definition (extraction definition used for the file in which the category is not set or the field not specified in the field weight set) are set. Accept settings. The field extraction definition information is information in which the field name, the method for extracting the field, and the extraction definition are associated with each category, as shown in the extraction definition detail screen 1300. For example, according to the extraction definition information shown in FIG. 13, for a document in the category of "construction outline", the field extracted by keywords such as "office" and "hospital" can be extracted as the field of "building use". It will be possible.

設定された抽出定義情報は、抽出定義ＤＢ１０５に保存される。 The set extraction definition information is stored in the extraction definition DB 105.

ステップＳ３０２では、ユーザから受け付けた文書（検索対象文書）に対して、文書登録処理を実行する。文書登録処理では、検索対象文書の本文抽出やカテゴリの付与、検索対象文書のフィールド抽出、本文及びフィールドに対する検索インデックスの構築などが行われる。文書登録処理の詳細については、図４を用いて後述する。 In step S302, the document registration process is executed for the document (search target document) received from the user. In the document registration process, the text of the search target document is extracted, categories are assigned, fields of the search target document are extracted, and a search index is constructed for the text and fields. The details of the document registration process will be described later with reference to FIG.

ステップＳ３０３では、ユーザから受け付けた検索語に基づき、文書検索処理を実行する。文書検索処理では、ステップＳ３０２で構築した検索インデックスを用いた検索処理が行われる。文書検索処理の詳細については、図９を用いて後述する。 In step S303, the document search process is executed based on the search term received from the user. In the document search process, a search process using the search index constructed in step S302 is performed. The details of the document retrieval process will be described later with reference to FIG.

次に図４～図８のフローチャートを用いて、本発明の実施形態における文書登録処理部が実行する文書登録処理について説明する。 Next, the document registration process executed by the document registration processing unit according to the embodiment of the present invention will be described with reference to the flowcharts of FIGS. 4 to 8.

図４のフローチャートは、文書登録処理部１０１において文書を登録する処理を示すフローチャートである。 The flowchart of FIG. 4 is a flowchart showing a process of registering a document in the document registration processing unit 101.

ステップＳ４０１では、登録対象となる文書全てに対して処理が終了したかどうかを判定する。処理が終了していれば（Ｓ４０１のＹｅｓ）該フローチャートの処理を終了し、処理の終了していない文書が残っていれば（Ｓ４０１のＮｏ）ステップＳ４０２に進む。 In step S401, it is determined whether or not the processing is completed for all the documents to be registered. If the processing is completed (Yes in S401), the processing of the flowchart is completed, and if there is a document that has not been processed (No in S401), the process proceeds to step S402.

ステップＳ４０２では該文書に対してテキスト抽出処理を行う。該テキスト抽出処理は一般に開示されている技術により実現されるものであり、どのような技術・方法を用いても構わない。 In step S402, a text extraction process is performed on the document. The text extraction process is realized by a technique generally disclosed, and any technique / method may be used.

ステップＳ４０３では該文書に対するカテゴリ付与を行う。カテゴリとは、その文書がいかなるタイプの文書であるかを分類するために付与され、本実施例であれば工事概要、注文書、議事録などがカテゴリの分類例である。ここでのカテゴリ付与は計算機によって自動で行ってもよいし、ユーザによって手動で行っても構わない。 In step S403, a category is assigned to the document. The category is given to classify what type of document the document is, and in the case of this embodiment, the construction outline, the purchase order, the minutes, etc. are the classification examples of the category. The category assignment here may be performed automatically by a computer or manually by the user.

ステップＳ４０４ではフィールド抽出処理を行う。フィールド抽出処理については、図５を使い後述する。 In step S404, a field extraction process is performed. The field extraction process will be described later with reference to FIG.

ステップＳ４０５ではステップＳ４０２で抽出したテキスト及びステップＳ４０４で抽出した各フィールドに対する検索インデックスの作成を行い登録文書インデックス１０４に格納する。検索インデックスとは、図９で示す文書検索処理の処理時に使用する検索インデックスである。 In step S405, a search index is created for the text extracted in step S402 and each field extracted in step S404, and stored in the registered document index 104. The search index is a search index used at the time of processing the document search process shown in FIG.

図５のフローチャートは、文書からフィールドを抽出する処理を示すフローチャートである。 The flowchart of FIG. 5 is a flowchart showing a process of extracting a field from a document.

ステップＳ５０１では、ステップＳ４０３で付与された該文書のカテゴリを取得する。 In step S501, the category of the document given in step S403 is acquired.

ステップＳ５０２では、抽出定義ＤＢ１０５からステップＳ３０１で設定された該カテゴリの抽出定義情報を取得する。ステップＳ５０１でカテゴリが取得できなかった場合はデフォルトの抽出定義を取得する。 In step S502, the extraction definition information of the category set in step S301 is acquired from the extraction definition DB 105. If the category cannot be acquired in step S501, the default extraction definition is acquired.

ステップＳ５０３では、該抽出定義に定義された全てのフィールドに対して抽出処理が終了したかどうかを判断する。終了していれば（Ｓ５０３のＹｅｓ）該フローチャートを終了し、そうでなければ（Ｓ５０３のＮｏ）処理をステップＳ５０４に進める。 In step S503, it is determined whether or not the extraction process is completed for all the fields defined in the extraction definition. If it is completed (Yes in S503), the flowchart is terminated, and if not (No in S503), the process proceeds to step S504.

ステップＳ５０４では、該抽出定義情報に設定された処理対象のフィールドの抽出方式に応じて、処理を分岐する。例えば、図１３の例では、「住所」のフィールドについては形態素解析により抽出することを意味している。抽出方式が「キーワード」であればステップＳ５０５に、「パターン」であればステップＳ５０６に、「形態素解析」であればステップＳ５０７に処理を進める。 In step S504, the process is branched according to the extraction method of the field to be processed set in the extraction definition information. For example, in the example of FIG. 13, it means that the field of "address" is extracted by morphological analysis. If the extraction method is "keyword", the process proceeds to step S505, if the extraction method is "pattern", the process proceeds to step S506, and if the extraction method is "morphological analysis", the process proceeds to step S507.

ステップＳ５０５では、キーワードによる抽出処理を行う。キーワードによる抽出処理の詳細は、図６のフローチャートを用いて後述する。 In step S505, the extraction process using keywords is performed. The details of the extraction process using keywords will be described later using the flowchart of FIG.

ステップＳ５０６では、パターンによる抽出処理を行う。パターンによる抽出処理の詳細は、図７のフローチャートを用いて後述する。 In step S506, the extraction process using the pattern is performed. The details of the extraction process using the pattern will be described later using the flowchart of FIG. 7.

ステップＳ５０７では、形態素解析による抽出処理を行う。形態素解析による抽出処理の詳細は、図８のフローチャートを用いて後述する
ステップＳ５０８では、抽出されたフィールドを該文書のフィールドとして記録しておく。このとき、該抽出定義情報のフィールド名と関連付けて記録する。 In step S507, extraction processing by morphological analysis is performed. The details of the extraction process by the morphological analysis will be described later in step S508 using the flowchart of FIG. 8, and the extracted field is recorded as the field of the document. At this time, it is recorded in association with the field name of the extraction definition information.

図６のフローチャートは、文書からキーワード方式でフィールドを抽出する処理を示すフローチャートである。 The flowchart of FIG. 6 is a flowchart showing a process of extracting a field from a document by a keyword method.

ステップＳ６０１では、該抽出定義の全てのキーワードを処理したかどうかを判断する。全て処理していれば（Ｓ６０１のＹｅｓ）該フローチャートを終了し、そうでなければ（Ｓ６０１のＮｏ）処理をステップＳ６０２に進める。 In step S601, it is determined whether or not all the keywords of the extraction definition have been processed. If all the processes have been performed (Yes in S601), the flowchart is terminated, and if not (No in S601), the process proceeds to step S602.

ステップＳ６０２では、該抽出定義から未処理のキーワードを取得する。 In step S602, an unprocessed keyword is acquired from the extraction definition.

ステップＳ６０３では、該文書に対するキーワードマッチを実行する。このキーワードマッチにはどのような手法を用いても構わない。 In step S603, a keyword match for the document is executed. Any method may be used for this keyword match.

ステップＳ６０４では、該文書に該キーワードが存在するかどうかを判断する。存在しない場合（Ｓ６０４のＮｏ）処理をステップＳ６０１に進め、存在する場合（Ｓ６０４のＹｅｓ）処理をステップＳ６０５に進める。 In step S604, it is determined whether or not the keyword exists in the document. If it does not exist (No in S604), the process proceeds to step S601, and if it exists (Yes in S604), the process proceeds to step S605.

ステップＳ６０５では、存在キーワードの近くにフィールド名が存在するかどうかを判定する。キーワードによる抽出処理に関する抽出定義は、図１３に示すように、フィールド名とキーワードとが対応付けて登録されたものである。検出されたキーワードの近くに、当該キーワードに対応付けられたフィールド名が存在する場合（ステップＳ６０５：ＹＥＳ）は処理をステップＳ６０６に進め、存在しない場合（ステップＳ６０５：ＮＯ）は処理をステップＳ６０１に戻す。 In step S605, it is determined whether or not the field name exists near the existing keyword. As shown in FIG. 13, the extraction definition related to the extraction process by keyword is registered in which the field name and the keyword are associated with each other. If the field name associated with the keyword exists near the detected keyword (step S605: YES), the process proceeds to step S606, and if it does not exist (step S605: NO), the process proceeds to step S601. return.

ここで、キーワード（Ｖａｌｕｅ）とフィールド名（Ｋｅｙ）の距離について、図１４を用いて具体的に説明する。 Here, the distance between the keyword (Value) and the field name (Key) will be specifically described with reference to FIG.

図１３のようにフィールド名「建物用途」には事務所、病院、飲食店、駐車場、ホテルの５つのキーワードが対応付けられているため、図１４に示す文書において抽出されるキーワードは、Ｖ１「病院」、Ｖ２「事務所」、Ｖ３「駐車場」となる。このうちＶ１とＶ２はキーであるＫ１「建物用途」と同じ行にあり、距離的に近いと言える。一方でＶ３はＫ１と５行離れており、距離的には遠く、「建物用途」とは異なる文脈で使用されていると考えられる。したがって、キーワード抽出の際にはこのキーワードとフィールド名の距離を考慮し、遠いものを抽出対象としないようにすることで誤抽出を防ぐことができる。 As shown in FIG. 13, the field name “building use” is associated with five keywords of office, hospital, restaurant, parking lot, and hotel. Therefore, the keyword extracted in the document shown in FIG. 14 is V1. It becomes "hospital", V2 "office", V3 "parking lot". Of these, V1 and V2 are on the same line as the key K1 "building use" and can be said to be close in distance. On the other hand, V3 is 5 lines away from K1 and is far from K1, and it is considered that it is used in a context different from "building use". Therefore, when extracting a keyword, it is possible to prevent erroneous extraction by considering the distance between this keyword and the field name and not targeting distant ones.

図６の説明に戻る。 Returning to the description of FIG.

ステップＳ６０６では、ステップＳ６０５で抽出されたキーワードを該文書のフィールドの値として記録する。 In step S606, the keyword extracted in step S605 is recorded as the value of the field of the document.

図７は、正規表現パターンによるフィールドの抽出処理を示すフローチャートである。 FIG. 7 is a flowchart showing a field extraction process using a regular expression pattern.

ステップＳ７０１では、抽出定義情報に設定された正規表現パターンを取得する。正規表現パターンの例としては、図１３のフィールド名１３０２の関連法令であれば、抽出定義１３０４の「．＋（法｜条例）」となる。これは、「法」または「条例」が後方一致する文字列を検出するための正規表現であり、この条件によれば例えば、「建築基準法」「騒音対策条例」などが抽出可能となる。 In step S701, the regular expression pattern set in the extraction definition information is acquired. As an example of the regular expression pattern, in the case of the related law of the field name 1302 in FIG. 13, it is ". + (Law | Ordinance)" of the extraction definition 1304. This is a regular expression for detecting a character string whose "law" or "regulation" is a suffix matching, and according to this condition, for example, "Building Standard Law" and "Noise Countermeasure Ordinance" can be extracted.

ステップＳ７０２では、該文書に対してステップＳ７０１で取得した正規表現のパターンマッチを行う。 In step S702, pattern matching of the regular expression acquired in step S701 is performed on the document.

ステップＳ７０３では、ステップＳ７０２でマッチした部分全てについて処理が行われたかどうかを判断する。全てのパターンで処理が行われた場合（Ｓ７０３のＹｅｓ）該フローチャートの処理を終了し、そうでない場合（Ｓ７０３のＮｏ）ステップＳ７０４へ処理を進める。 In step S703, it is determined whether or not processing has been performed for all the parts matched in step S702. If the processing is performed in all patterns (Yes in S703), the processing of the flowchart is terminated, and if not (No in S703), the processing proceeds to step S704.

ステップＳ７０４では、マッチした部分を該文書のフィールドの値として記録する。また、グループや名前付き前方参照といった正規表現の機能を用いてマッチした部分の一部をフィールドの値として使ってもよい。 In step S704, the matched portion is recorded as the value of the field of the document. Also, a part of the matched part using the regular expression function such as group or named forward reference may be used as the value of the field.

図８は、形態素解析で得られた品詞によるフィールドの抽出処理を示すフローチャートである。 FIG. 8 is a flowchart showing a field extraction process using part of speech obtained by morphological analysis.

ステップＳ８０１では、抽出定義を取得する。例えば、本実施例であれば、図１３のフィールド名１３０２の住所の抽出定義１３０４に定められる品詞の並びを取得する。この場合であれば、抽出定義は［名詞－固有名詞－地域］の並びで定められている。つまり、これは、名詞の中の固有名詞の中の地域カテゴリに属する単語の並びを抽出することを意味し、「東京都港区港南」といった文字列が抽出される。 In step S801, the extraction definition is acquired. For example, in the present embodiment, the sequence of part of speech defined in the extraction definition 1304 of the address of the field name 1302 in FIG. 13 is acquired. In this case, the extraction definition is defined by the sequence of [noun-proprietary noun-region]. In other words, this means extracting a sequence of words belonging to a region category in a proper noun in a noun, and a character string such as "Konan, Minato-ku, Tokyo" is extracted.

ステップＳ８０２では、該文書に形態素解析を実行する。 In step S802, morphological analysis is performed on the document.

ステップＳ８０３では、ステップＳ８０１で取得した抽出定義に合致する品詞の並びがあるかどうかを判断する。品詞の並びがない場合（Ｓ８０３のＮｏ）該フローチャートの処理を終了し、そうでない場合（Ｓ８０３のＹｅｓ）処理をステップＳ８０４に進める。 In step S803, it is determined whether or not there is a sequence of part of speech that matches the extraction definition acquired in step S801. If there is no sequence of part of speech (No in S803), the process of the flowchart is terminated, and if not (Yes in S803), the process proceeds to step S804.

ステップＳ８０４では、マッチした部分を該文書のフィールドの値として記録する。 In step S804, the matched portion is recorded as the value of the field of the document.

続けて、図９、図１７を用いて、本発明の実施形態における文書検索処理部が実行する処理について説明する。 Subsequently, with reference to FIGS. 9 and 17, the processing executed by the document retrieval processing unit in the embodiment of the present invention will be described.

図９は、検索処理部１０２において、ユーザからの検索語を入力として受けとり、インデックス済みの文書を検索する処理を示すフローチャートである。 FIG. 9 is a flowchart showing a process in which the search processing unit 102 receives a search term from a user as an input and searches for an indexed document.

ステップＳ９０１では、ユーザからの検索語を取得する。 In step S901, the search term from the user is acquired.

ステップＳ９０２では、インデックス済みの全文書に対して文書スコアが未計算の文書が存在するかどうかを判断する。文書スコアが未計算の文書が存在する場合（Ｓ９０２のＹｅｓ）処理をステップＳ９０３に進め、そうでない場合（Ｓ９０２のＮｏ）処理をステップＳ９０８に進める。 In step S902, it is determined whether or not there is a document whose document score has not been calculated for all the indexed documents. If there is a document whose document score has not been calculated (Yes in S902), the process proceeds to step S903, and if not (No in S902), the process proceeds to step S908.

ステップＳ９０３では、文書スコア未計算の文書を取得する。 In step S903, a document whose document score has not been calculated is acquired.

ステップＳ９０４では、該文書の本文に対する検索スコアを計算する。検索スコアとは、検索語との関連度合いを数値で表した値である。本文に対する検索スコアを、本文スコアと呼ぶ。なお、本実施例においては、本文スコアは公知の検索スコア算出方法により算出される値とする。 In step S904, the search score for the body of the document is calculated. The search score is a numerical value indicating the degree of association with the search term. The search score for the text is called the text score. In this embodiment, the text score is a value calculated by a known search score calculation method.

ステップＳ９０５では、フィールドスコアが未計算のフィールドが存在するかどうかを判断する。存在する場合（Ｓ９０５のＹｅｓ）処理をステップＳ８０６に進め、そうでない場合（Ｓ９０５のＮｏ）処理をステップＳ９０７に進める。 In step S905, it is determined whether or not there is a field for which the field score has not been calculated. If it exists (Yes in S905), the process proceeds to step S806, and if not (No in S905), the process proceeds to step S907.

ステップＳ９０６では、フィールドスコア未計算のフィールドを取得し、該フィールドに対する検索スコアを計算する。このスコアをフィールドスコアと呼ぶ。 In step S906, a field for which the field score has not been calculated is acquired, and the search score for the field is calculated. This score is called the field score.

フィールドスコアの計算の方法の一例を、図１７を用いて説明する。ユーザから「ＡＡＡ株式会社大阪」という検索語を受け付けた場合について説明する。。 An example of the method of calculating the field score will be described with reference to FIG. The case where the search term "AAA Co., Ltd. Osaka" is accepted from the user will be described. ..

図１７Ａは、大阪府警担当者議事録というタイトルの文書を示した図で、当該文書をフィールド毎に分け、各フィールドの値と重みが対応付けられている。図１７Ｂは、○○プロジェクト概要というタイトルの文書を示した図で、図１７Ａと同様に、フィールド毎に値と重みとが対応付けてある。なお、重みは、当該文書のカテゴリによって定まる値である。なお、図17において各フィールドの値として示している内容は、説明の為に抽出定義に合致しない文字列も含めて示しているが、ステップS506、ステップS604、ステップS704で説明した通り、各フィールドの値として登録されるのは、抽出定義に合致した文字列である。 FIG. 17A is a diagram showing a document titled Osaka Prefectural Police Department Minutes, in which the document is divided into fields and the values and weights of each field are associated with each other. FIG. 17B is a diagram showing a document titled XX project outline, in which values and weights are associated with each field as in FIG. 17A. The weight is a value determined by the category of the document. The contents shown as the values of each field in FIG. 17 include character strings that do not match the extraction definition for the sake of explanation, but as explained in steps S506, S604, and S704, each field is shown. What is registered as the value of is a character string that matches the extraction definition.

まず、検索語の出現回数をフィールド毎にカウントする。 First, the number of occurrences of the search term is counted for each field.

図１７Ａの文書であれば、タイトルフィールド１８０３には「大阪」は１回出現、人名フィールド１８０４には「大阪」は０回出現、本文フィールド１７０５には「大阪」は３回出現している。そして、各フィールドでの検索語の出現回数をフィールド毎に設定されている重みとをかけてフィールドスコアを求める。 In the document of FIG. 17A, "Osaka" appears once in the title field 1803, "Osaka" appears 0 times in the personal name field 1804, and "Osaka" appears three times in the text field 1705. Then, the number of appearances of the search term in each field is multiplied by the weight set for each field to obtain the field score.

タイトルフィールド１８０３に設定されている重みは１８０６に示すように２で大阪は１回出現なので、１×２＝２となる。同様に、人名フィールド１８０４は０×５＝０、本文フィールド１８０５は３×１＝３となる。これらの合計値（２＋０＋３＝５）が「大阪府警担当者議事録」という文書のフィールドスコアとして算出される。 The weight set in the title field 1803 is 2 as shown in 1806, and Osaka appears once, so 1 × 2 = 2. Similarly, the personal name field 1804 has 0 × 5 = 0, and the text field 1805 has 3 × 1 = 3. The total value (2 + 0 + 3 = 5) is calculated as the field score of the document "Minutes of Osaka Prefectural Police Department".

同様に図１７Ｂの、○○プロジェクト概要．ＰＤＦのフィールドスコアを計算すると、会社名フィールドで、検索語ＡＡＡ株式会社が１回出現しているので１×５＝５、住所フィールドで大阪が１回出現しているので１×５＝５、本文フィールドでＡＡＡ株式会社と大阪がそれぞれ１回ずつ出現しているので２×１＝２となる。これらの合計値（５＋５＋２＝１２）が○○プロジェクト概要．ＰＤＦのフィールドスコアとして算出される。 Similarly, the outline of the XX project in Fig. 17B. When calculating the field score of PDF, 1 × 5 = 5 because the search term AAA Co., Ltd. appears once in the company name field, and 1 × 5 = 5 because Osaka appears once in the address field. Since AAA Co., Ltd. and Osaka appear once each in the text field, 2 × 1 = 2. The total value (5 + 5 + 2 = 12) of these is the outline of the XX project. Calculated as a PDF field score.

ステップＳ９０７では、ステップＳ９０４で算出した該文書の本文スコアと、ステップＳ９０６で算出した該文書のフィールドスコアを合算する。この値を文書スコアと呼ぶ。 In step S907, the text score of the document calculated in step S904 and the field score of the document calculated in step S906 are added up. This value is called the document score.

なお、本実施例においては、本文スコアとフィールドスコアとを合算したスコアを文書スコアとしたが、各フィールドの重みを考慮したスコアであるフィールドスコアのみを用いても良い。 In this embodiment, the score obtained by adding the text score and the field score is used as the document score, but only the field score, which is the score considering the weight of each field, may be used.

ステップＳ９０８では、文書スコアの降順で検索結果をユーザに示す。なお、本実施例では検索語との関連性が強い文書の文書スコアが高くなる計算方法を用いたため、降順で検索結果をユーザに示したが、検索語との関連性が強い文書の文書スコアが小さくなる算出方法を用いる場合は、昇順により表示する。すなわち、検索語との関連性が強い文書が検索結果の上位に表示されるようソートして表示する。 In step S908, the search results are shown to the user in descending order of the document score. In this embodiment, since a calculation method is used in which the document score of a document having a strong relevance to the search term is high, the search results are shown to the user in descending order, but the document score of the document having a strong relevance to the search term is shown. When a calculation method in which is smaller is used, it is displayed in ascending order. That is, documents that are strongly related to the search term are sorted and displayed so that they are displayed at the top of the search results.

以上のように、抽出定義情報で「人名」や「会社名」や「住所」など、当該カテゴリの文書を特徴付けるフィールドに対して大きな重みを設定し、設定されたフィールド毎の重みを考慮して検索スコアを算出することで、検索語が同じ数だけ含まれる文書であっても、よりユーザ（検索者）の意図に合った（ユーザが探し求めている）文書を上位に表示することが可能となる。 As described above, a large weight is set for the fields that characterize the document of the relevant category, such as "person name", "company name", and "address" in the extraction definition information, and the weight for each set field is taken into consideration. By calculating the search score, even if the document contains the same number of search terms, it is possible to display the document that better suits the intention of the user (searcher) (the user is looking for) at the top. Become.

ステップＳ９０９では、検索セッション統計情報更新処理を行う。図１０のフローチャートを用いて後述する。 In step S909, the search session statistical information update process is performed. It will be described later using the flowchart of FIG.

ステップＳ９１０では、フィールド重み更新処理を行う。図１１のフローチャートを用いて後述する。 In step S910, the field weight update process is performed. It will be described later using the flowchart of FIG.

図１０は、ユーザの検索セッションでの統計情報を更新する処理を示すフローチャートである。なお、検索セッションとはユーザが検索結果を取得して、該検索結果を破棄するまでの期間のことを言う。 FIG. 10 is a flowchart showing a process of updating statistical information in a user's search session. The search session is a period from when the user acquires the search result to when the search result is discarded.

ステップＳ１００１では、検索セッション統計情報テーブル図１５の１４００の初期化を行う。該検索結果に含まれる全ての文書情報について、文書ＩＤ、カテゴリを設定しセッション閲覧数を０に設定する。 In step S1001, 1400 of the search session statistical information table FIG. 15 is initialized. For all the document information included in the search result, the document ID and the category are set, and the number of session views is set to 0.

ステップＳ１００２では、検索セッションが終了しているかどうかを判断する。終了している場合（Ｓ１００２のＹｅｓ）該フローチャートの処理を終了し、そうでない場合（Ｓ１００２のＮｏ）ステップＳ１００３に処理を進める。 In step S1002, it is determined whether or not the search session has ended. If it is finished (Yes in S1002), the process of the flowchart is finished, and if not (No in S1002), the process proceeds to step S1003.

ステップＳ１００３では、ユーザが検索結果の文書を選択したかどうかを判断する。選択していない場合（Ｓ１００３のＮｏ）処理をステップＳ１００２に進め、そうでない場合（Ｓ９０３のＹｅｓ）は処理をステップＳ１００４に進める。 In step S1003, it is determined whether or not the user has selected the document as the search result. If it is not selected (No in S1003), the process proceeds to step S1002, and if not (Yes in S903), the process proceeds to step S1004.

ステップＳ１００４では、ユーザが選択した文書の情報を取得する。 In step S1004, the information of the document selected by the user is acquired.

ステップＳ１００５では、検索セッション統計情報テーブルの該文書のエントリを更新する。この場合、該テーブルのセッション閲覧数に１を加える。 In step S1005, the entry of the document in the search session statistics table is updated. In this case, 1 is added to the number of session views of the table.

図１１は、検索セッション統計情報を利用して抽出定義のフィールド重みを更新する処理を示すフローチャートである。検索の情報に応じてフィールド重みを更新していくことで、より検索精度が向上していくことが見込まれる。 FIG. 11 is a flowchart showing a process of updating the field weight of the extraction definition using the search session statistical information. By updating the field weights according to the search information, it is expected that the search accuracy will be further improved.

ステップＳ１１０１では、検索セッション統計情報テーブル図１５の１４００を取得する。なお、ここで取得するのは検索セッションの終了した検索セッション統計情報テーブルのみである。 In step S1101, 1400 of the search session statistical information table FIG. 15 is acquired. Note that only the search session statistical information table for which the search session has ended is acquired here.

ステップＳ１１０２では、ヒット文書のカテゴリごとに閲覧数を集計する。ヒット文書とは、検索処理部により検索された文書である。検索の結果ヒットした文書をユーザが閲覧したかを集計することで、次回の検索精度を上げるために利用される。 In step S1102, the number of views is totaled for each category of hit documents. The hit document is a document searched by the search processing unit. It is used to improve the accuracy of the next search by totaling whether the user has viewed the document that was hit as a result of the search.

ステップＳ１１０３では、ステップＳ１１０２で集計したカテゴリの中に未処理のカテゴリがあるかどうかを判断する。未処理のカテゴリがある場合（Ｓ１１０３のＹｅｓ）処理をステップＳ１００４に進め、そうでない場合（Ｓ１１０３のＮｏ）処理をステップＳ１１０９に進める。 In step S1103, it is determined whether or not there is an unprocessed category among the categories aggregated in step S1102. If there is an unprocessed category (Yes in S1103), the process proceeds to step S1004, and if not (No in S1103), the process proceeds to step S1109.

ステップＳ１１０４では、未処理のカテゴリの抽出定義を取得する。 In step S1104, the extraction definition of the unprocessed category is acquired.

ステップＳ１１０５では、検索語に含まれる未処理のフィールド情報（当該カテゴリの抽出定義として設定されたフィールドのうち、検索語として用いられたワードが該当するフィールドであって、未処理のフィールド）があるかどうかを判断する。未処理のフィールド情報がある場合（Ｓ１１０５のＹｅｓ）処理をステップＳ１１０６へ進め、そうでない場合（Ｓ１１０５のＮｏ）処理をステップＳ１１０３に進める。 In step S1105, there is unprocessed field information included in the search term (a field in which the word used as the search term is a corresponding field among the fields set as the extraction definition of the category and is an unprocessed field). Determine if. If there is unprocessed field information (Yes in S1105), the process proceeds to step S1106, and if not (No in S1105), the process proceeds to step S1103.

ステップＳ１１０６では、該カテゴリのセッション閲覧数が０より大きいかどうかを判断する。０より大きい場合（Ｓ１１０６のＹｅｓ）処理をステップＳ１１０７に進め、そうでない場合（Ｓ１１０６のＮｏ）処理をステップＳ１１０８に進める。 In step S1106, it is determined whether or not the number of session views of the category is larger than 0. If it is larger than 0 (Yes in S1106), the process proceeds to step S1107, and if not (No in S1106), the process proceeds to step S1108.

ステップＳ１１０７では、該カテゴリのセッション閲覧数が０より大きく、該フィールドが検索に貢献できたと考え、該フィールドのフィールド重みを（セッション閲覧数）×０．０１だけ加算する。この計算式はあくまでも一例であり、その他の計算方法を用いても構わない。 In step S1107, it is considered that the number of session views of the category is larger than 0 and the field can contribute to the search, and the field weight of the field is added by (session view number) × 0.01. This calculation formula is just an example, and other calculation methods may be used.

ステップＳ１１０８では、該カテゴリのセッション閲覧数が０であり、該フィールドが検索に貢献していないと考え、該フィールドのフィールド重みを０．０１だけ減算する。この計算式はあくまでも一例であり、その他の計算方法を用いても構わない。 In step S1108, it is considered that the number of session views of the category is 0 and the field does not contribute to the search, and the field weight of the field is subtracted by 0.01. This calculation formula is just an example, and other calculation methods may be used.

ステップＳ１１０９では、不要となった該検索セッション統計情報テーブルを破棄する。 In step S1109, the search session statistical information table that is no longer needed is discarded.

ここで、図１１を用いて、フィールド重み更新処理の一例を説明する。まず、検索語に「住所」と「建物用途」を含む検索語が使われたとし、検索セッション終了時の検索セッション統計情報テーブルが図１６の１４００であったとする。また、カテゴリ「工事概要」「注文書」「議事録」の抽出定義がそれぞれ、１５００、１６００、１７００であったとする。またフィールド重み更新式はセッション閲覧数が０より大きい場合は（セッション閲覧数）×０．０１を加算、０の場合は０．０１の減算とする。この場合、テーブル１３００より検索結果のカテゴリごとのセッション閲覧数は、工事概要が２、注文書と議事録が０となる。工事概要のフィールド重みの更新は、フィールド「住所」（図１６の１５０１）と「建物用途」（図１６の１５０２）が両方とも定義されていることから、２×０．０１＝０．０２が加算され、更新後のフィールド重みはそれぞれ３．０２と２．０２となる。注文書のフィールド重みの更新は「住所」（図１６の１６０１）のみが定義されていることから、０．０１の減算となり、更新後のフィールド重みは０．０９となる。議事録のフィールド重みの更新は「住所」「建物用途」ともに定義されていないため行われない。 Here, an example of the field weight update process will be described with reference to FIG. First, it is assumed that a search term including "address" and "building use" is used as the search term, and the search session statistical information table at the end of the search session is 1400 in FIG. Further, it is assumed that the extraction definitions of the categories "construction outline", "order form", and "minutes" are 1500, 1600, and 1700, respectively. In the field weight update formula, if the number of session views is larger than 0, (session views) x 0.01 is added, and if it is 0, 0.01 is subtracted. In this case, the number of sessions viewed for each category of the search results from Table 1300 is 2 for the construction outline and 0 for the purchase order and the minutes. The field weight update of the construction outline is 2 x 0.01 = 0.02 because both the field "address" (1501 in Fig. 16) and "building use" (1502 in Fig. 16) are defined. The fields are added and the updated field weights are 3.02 and 2.02, respectively. Since only the "address" (1601 in FIG. 16) is defined for updating the field weight of the purchase order, the subtraction is 0.01, and the updated field weight is 0.09. The field weights in the minutes are not updated because neither "address" nor "building use" is defined.

図１２は、現在定義されている抽出定義の確認と、追加、削除を行う画面である。抽出定義一覧画面１２００は抽出定義追加ボタン１２０１、一括削除ボタン１２０２、チェックボックス１２０３、編集ボタン１２０４、個別削除ボタン１２０５からなる。 FIG. 12 is a screen for confirming, adding, and deleting the currently defined extraction definition. The extraction definition list screen 1200 includes an extraction definition addition button 1201, a batch deletion button 1202, a check box 1203, an edit button 1204, and an individual deletion button 1205.

抽出定義追加ボタン１２０１は、押下することで抽出定義詳細画面（図１３）に遷移し、新規に抽出定義を作成するためのものである。 The extraction definition addition button 1201 transitions to the extraction definition detail screen (FIG. 13) by pressing the button, and is for creating a new extraction definition.

一括削除ボタン１２０２は、押下することでチェックボックス１２０３が有効になっている全ての抽出定義を一括削除するものである。 The batch delete button 1202 deletes all the extraction definitions for which the check box 1203 is enabled by pressing the button 1202.

チェックボックス１２０３は、有効にすることで一括削除ボタン１２０２を用いて一括削除を行えるようにするためのものである。 The check box 1203 is for enabling batch deletion by using the batch deletion button 1202.

編集ボタン１２０４は、押下することで抽出定義詳細画面（図１３）に遷移し、選択した抽出定義を編集するためのものである。 The edit button 1204 transitions to the extraction definition detail screen (FIG. 13) by pressing the button, and edits the selected extraction definition.

個別削除ボタン１２０５は、押下することで選択した抽出定義を削除するためのものである。 The individual deletion button 1205 is for deleting the selected extraction definition by pressing the button.

図１３は、抽出定義の詳細の追加、確認、編集を行う画面である。抽出定義詳細画面１３００は、カテゴリ名テキストボックス１３０１、フィールド名テキストボックス１３０２、抽出方式プルダウンリスト１３０３、抽出定義テキストボックス１３０４、フィールド重みテキストボックス１３０５、フィールド削除ボタン１３０６、抽出定義フィールド追加ボタン１３０７からなる。 FIG. 13 is a screen for adding, confirming, and editing the details of the extraction definition. The extraction definition detail screen 1300 includes a category name text box 1301, a field name text box 1302, an extraction method pull-down list 1303, an extraction definition text box 1304, a field weight text box 1305, a field deletion button 1306, and an extraction definition field addition button 1307. ..

なお、抽出定義一覧画面１１００の抽出定義追加ボタン１１０１を押下して本画面に遷移した場合は、カテゴリ名テキストボックス１３０１は空欄で、フィールド名テキストボックス１３０２、抽出方式プルダウンリスト１３０３、抽出定義テキストボックス１３０４、フィールド重みテキストボックス１３０５は初期状態では表示されていない。また、抽出定義一覧画面１２００の編集ボタンから本画面に遷移した場合、該抽出定義の内容がカテゴリ名テキストボックス１３０１、フィールド名テキストボックス１３０２、抽出方式プルダウンリスト１３０３、抽出定義テキストボックス１３０４、フィールド重みテキストボックス１３０５に表示される。 When the extraction definition addition button 1101 on the extraction definition list screen 1100 is pressed to move to this screen, the category name text box 1301 is blank, the field name text box 1302, the extraction method pull-down list 1303, and the extraction definition text box. 1304, field weight text box 1305 is not displayed in the initial state. When the edit button of the extraction definition list screen 1200 transitions to this screen, the contents of the extraction definition are the category name text box 1301, the field name text box 1302, the extraction method pull-down list 1303, the extraction definition text box 1304, and the field weight. It is displayed in the text box 1305.

カテゴリ名テキストボックス１３０１は、この抽出定義につける名称を設定するためのものである。 The category name text box 1301 is for setting a name to be given to this extraction definition.

フィールド名テキストボックス１３０２は、フィールドの名称を設定するためのものである。 The field name text box 1302 is for setting the name of the field.

抽出方式プルダウンリスト１３０３は、抽出方式を選択するためのものである。ここでは「キーワード」「パターン」「形態素解析」から選択する。 The extraction method pull-down list 1303 is for selecting an extraction method. Here, select from "keyword", "pattern", and "morphological analysis".

抽出定義テキストボックス１３０４は、抽出の定義を設定するためのものである。抽出方式が「キーワード」の場合は抽出するキーワードのリスト、「パターン」の場合は正規表現パターン、「形態素解析」の場合は抽出したい形態素の並びを設定する。 Extract definition text box 1304 is for setting the definition of extraction. If the extraction method is "keyword", set the list of keywords to be extracted, if it is "pattern", set the regular expression pattern, and if it is "morphological analysis", set the sequence of morphemes to be extracted.

フィールド重みテキストボックス１３０５は、フィールド重みを設定するためのものである。 The field weight text box 1305 is for setting the field weight.

フィールド削除ボタン１３０６は、押下することで該フィールドの抽出定義を削除するためのものである。 The field deletion button 1306 is for deleting the extraction definition of the field by pressing the field deletion button 1306.

抽出定義フィールド追加ボタン１３０７は、押下することで空欄のフィールド名テキストボックス１３０２、抽出方式プルダウンリスト１３０３、抽出定義テキストボックス１３０４、フィールド重みテキストボックス１３０５、フィールド削除ボタン１３０６が最下行に追加され新しいフィールドの定義ができるようになる。 When the extraction definition field addition button 1307 is pressed, a blank field name text box 1302, extraction method pull-down list 1303, extraction definition text box 1304, field weight text box 1305, and field deletion button 1306 are added to the bottom line and a new field is added. Can be defined.

このようにして、カテゴリごとに抽出定義を設定することにより検索精度の向上が見込まれる。例えば、登録文書内の建築設計書の工事概要と注文書を比較した場合、工事概要の住所（建設場所）の情報は地形や適用される自治体の条例が異なるなど非常に重要な項目であるが、注文書の住所は特に重要な情報でないため、工事概要ではフィールド重みを高め（例えば３）に、注文書では低め（例えば０．１）に設定することで、同じフィールドでのカテゴリごと重要度の違いを表現できる。このように設定することで、住所で検索を行った場合、検索スコアが高めになる工事概要が検索結果上位に、検索スコアが低めになる注文書は検索下位に表示されることが見込め、検索ユーザの意図に沿った検索結果となりやすい。 In this way, it is expected that the search accuracy will be improved by setting the extraction definition for each category. For example, when comparing the construction outline of the building design document in the registration document with the order form, the information of the address (construction location) of the construction outline is a very important item such as the topography and the applicable local ordinances are different. Since the address of the purchase order is not particularly important information, by setting the field weight higher (for example, 3) in the construction outline and lower (for example, 0.1) in the purchase order, the importance of each category in the same field is set. Can express the difference between. By setting in this way, when searching by address, it is expected that the construction outline with a high search score will be displayed at the top of the search results, and the purchase order with a low search score will be displayed at the bottom of the search. Search results are likely to be in line with the user's intentions.

仮にカテゴリごとにフィールド重みを設定しなかった場合、住所で検索した場合、どのカテゴリの文書でも住所を重視するよう設定した場合、工事概要と注文書の両方が検索結果に混在することになり利便性が低下すると考えられる。 If you do not set the field weight for each category, if you search by address, if you set to emphasize the address in all categories of documents, both the construction outline and the purchase order will be mixed in the search results, which is convenient. It is thought that the sex is reduced.

図１５の検索セッション統計情報テーブル１４００は、検索セッションの統計情報を保持するためのテーブルであり、文書ＩＤ１４０１、カテゴリ１４０２、セッション閲覧数１４０３の項目からなる
文書ＩＤ１４０１には、検索でヒットした文書を特定するための項目であり、ヒットした文書のＩＤが登録される。 The search session statistical information table 1400 of FIG. 15 is a table for holding the statistical information of the search session, and the document ID 1401 composed of the items of the document ID 1401, the category 1402, and the number of session views 1403 is the document hit by the search. It is an item for identification, and the ID of the hit document is registered.

カテゴリ１４０２には、該文書のカテゴリが登録される。 The category of the document is registered in the category 1402.

セッション閲覧数１４０３には、ユーザが検索セッション中に該文書を閲覧した回数を記録する。 The session viewing number 1403 records the number of times the user has viewed the document during the search session.

以上、本実施形態について示したが、本発明は、例えば、システム、装置、方法、プログラムもしくは記録媒体等としての実施態様をとることが可能である。具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 Although the present embodiment has been described above, the present invention can be implemented as, for example, a system, an apparatus, a method, a program, a recording medium, or the like. Specifically, it may be applied to a system composed of a plurality of devices, or may be applied to a device composed of one device.

また、本発明におけるプログラムは、図３～図１１に示すフローチャートの処理方法をコンピュータが実行可能なプログラムであり、本発明の記憶媒体は図３～図１１の処理方法をコンピュータが実行可能なプログラムが記憶されている。 Further, the program in the present invention is a program in which a computer can execute the processing methods of the flowcharts shown in FIGS. 3 to 11, and the storage medium of the present invention is a program in which the computer can execute the processing methods in FIGS. 3 to 11. Is remembered.

以上のように、前述した実施形態の機能を実現するプログラムを記録した記録媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたプログラムを読み出し、実行することによっても本発明の目的が達成されることは言うまでもない。 As described above, a recording medium recording a program that realizes the functions of the above-described embodiment is supplied to the system or device, and the computer (or CPU or MPU) of the system or device stores the program in the recording medium. Needless to say, the object of the present invention can be achieved by reading and executing.

この場合、記録媒体から読み出されたプログラム自体が本発明の新規な機能を実現することになり、そのプログラムを記録した記録媒体は本発明を構成することになる。 In this case, the program itself read from the recording medium realizes the novel function of the present invention, and the recording medium on which the program is recorded constitutes the present invention.

プログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ－ＲＯＭ、ＣＤ－Ｒ、ＤＶＤ－ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＥＥＰＲＯＭ、シリコンディスク等を用いることが出来る。 Recording media for supplying programs include, for example, flexible disks, hard disks, optical disks, magneto-optical disks, CD-ROMs, CD-Rs, DVD-ROMs, magnetic tapes, non-volatile memory cards, ROMs, EEPROMs, and silicon. A disc or the like can be used.

また、コンピュータが読み出したプログラムを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the program read by the computer, not only the function of the above-described embodiment is realized, but also the OS (operating system) or the like running on the computer is actually realized based on the instruction of the program. Needless to say, there are cases where a part or all of the processing is performed and the processing realizes the functions of the above-described embodiment.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, after the program read from the recording medium is written in the memory provided in the function expansion board inserted in the computer or the function expansion unit connected to the computer, the function expansion board is based on the instruction of the program code. It goes without saying that there are cases where the CPU or the like provided in the function expansion unit performs a part or all of the actual processing, and the processing realizes the functions of the above-described embodiment.

また、本発明は、複数の機器から構成されるシステムに適用しても、ひとつの機器から成る装置に適用しても良い。また、本発明は、システムあるいは装置にプログラムを供給することによって達成される場合にも適用できることは言うまでもない。この場合、本発明を達成するためのプログラムを格納した記録媒体を該システムあるいは装置に読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 Further, the present invention may be applied to a system composed of a plurality of devices or a device composed of one device. It goes without saying that the present invention can also be applied when it is achieved by supplying a program to a system or an apparatus. In this case, by reading the recording medium containing the program for achieving the present invention into the system or device, the system or device can enjoy the effect of the present invention.

さらに、本発明を達成するためのプログラムをネットワーク上のサーバ、データベース等から通信プログラムによりダウンロードして読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。なお、上述した各実施形態およびその変形例を組み合わせた構成も全て本発明に含まれるものである。 Further, by downloading and reading a program for achieving the present invention from a server, database, or the like on a network by a communication program, the system or device can enjoy the effect of the present invention. It should be noted that the present invention also includes all the configurations in which each of the above-described embodiments and modifications thereof are combined.

２０００文書検索システム
１００情報処理装置
１０１文書登録処理部
１０２文書検索処理部
１０３形態素解析辞書
１０４登録文書インデックス
１０５抽出定義ＤＢ
１０６検索セッション統計情報
１０７文書ＤＢ
１０８クライアントＰＣ 2000 Document retrieval system 100 Information processing device 101 Document registration processing unit 102 Document retrieval processing unit 103 Morphological analysis dictionary 104 Registered document index 105 Extraction definition DB
106 Search session statistics 107 Document DB
108 Client PC

Claims

A storage means for storing the weights set in each field contained in the document for each document category,
A calculation means for calculating the score of the document based on the weight set in the field and the relationship between the field and the search term.
An information processing system characterized by being equipped with.

The information processing system according to claim 1, wherein the calculation means calculates a score of the document based on a weight set in the field and the number of search terms included in the field.

The information processing system according to claim 1 or 2, further comprising a table means for displaying search results by the search term based on a score calculated by the calculation means.

The information processing system according to claim 3, further comprising an adjusting means for adjusting the weight of the field related to the category of the document based on the browsing record for the document displayed on the display means.

A reception method that attaches conditions for extracting fields from a document,
A field extraction means for extracting a field from a document to be searched according to the conditions accepted by the reception means, and a field extraction means.
The information processing system according to any one of claims 1 to 4, wherein the information processing system is provided.

It is an information processing method in an information processing system provided with a storage means for storing weights set in each field included in a document for each document category.
An information processing method characterized in that a calculation means of the information processing system includes a calculation step of calculating a score of the document based on a weight set in the field and a relationship between the field and a search term.

A storage means that stores the weights set in each field contained in the document for each document category on the computer.
A program for functioning as a calculation means for calculating the score of the document based on the weight set in the field and the relationship between the field and the search term.