JP2009245179A

JP2009245179A - Document retrieval support device

Info

Publication number: JP2009245179A
Application number: JP2008091266A
Authority: JP
Inventors: Naoshi Kono; 尚士河野
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2008-03-31
Filing date: 2008-03-31
Publication date: 2009-10-22

Abstract

【課題】文書検索にともなうユーザの負担を軽減する。
【解決手段】文書検索支援装置は、指定検索語を含む文書ファイルを候補文書として抽出する。候補文書のうち、目的文書と内容において類似する文書ファイルが類似文書として選択されると、類似文書に含まれる用語を候補用語として抽出し、所定文書ファイル群における候補用語の稀少度に応じて各候補用語を順位づけする。そして、目的文書を検索するために好適な検索語として、所定順位以上に位置する候補用語をユーザに提示する。
【選択図】図１A user's burden associated with document search is reduced.
A document search support apparatus extracts a document file including a designated search word as a candidate document. When a document file similar in content to the target document is selected as a similar document among the candidate documents, terms included in the similar document are extracted as candidate terms, and each of them is selected according to the rarity of candidate terms in the predetermined document file group. Rank candidate terms. Then, candidate terms positioned at a predetermined rank or higher are presented to the user as suitable search terms for searching for the target document.
[Selection] Figure 1

Description

本発明は、文書検索技術に関する。 The present invention relates to a document retrieval technique.

コンピュータの普及とネットワーク技術の進展にともない、ネットワークを介した電子情報の交換が盛んになっている。従来においては紙ベースで行われていた事務処理の多くが、ネットワークベースの処理に置き換えられつつある。デジタル化とネットワーク技術の進展は、情報取得コストを急激に低下させている。このような状況において、大量の文書ファイルの中から所望の文書ファイルを検索する技術の重要性が高まっている。
特開２００２−０１５００１号公報 With the spread of computers and the development of network technology, the exchange of electronic information via networks has become popular. Many business processes that have been conventionally performed on a paper basis are being replaced by network-based processes. Advances in digitalization and network technology have drastically reduced information acquisition costs. Under such circumstances, the importance of a technique for searching for a desired document file from a large number of document files is increasing.
JP 2002-015001 A

一般的な文書検索方法においては、ユーザは、まず１以上の単語を「検索語」として入力する。通常、検索対象となる文書ファイル（以下、「目的文書」とよぶ）に含まれていると思われる単語が、検索語として選択される。複数の検索語を組み合わせたさまざまな検索式を試みることにより、目的文書を特定する。このような文書検索方法の場合、どのような検索式を作るか、いいかえれば、どのような検索語をどのように組み合わせるか、が目的文書検出の成否を決定づける。しかし、検索式の作成は、経験や勘といった属人的スキルに大きく依存するため、ユーザの負担が大きい。また、目的文書に確実に含まれている単語についてのイメージがなければ、好適な検索式を作成しにくい。 In a general document search method, a user first inputs one or more words as “search words”. Usually, a word considered to be contained in a document file to be searched (hereinafter referred to as “target document”) is selected as a search word. The target document is specified by trying various search expressions combining a plurality of search terms. In the case of such a document retrieval method, what kind of retrieval formula is created, in other words, what kind of retrieval word is combined and how to decide the success or failure of the target document detection. However, the creation of a search expression is heavily dependent on personal skills such as experience and intuition, and thus places a heavy burden on the user. Also, it is difficult to create a suitable search formula if there is no image of words that are definitely included in the target document.

別の文書検索方法として、概念検索とよばれる技術がある（たとえば、特許文献１参照）。概念検索は、ユーザから入力された自然文と関連する内容の文書ファイルを検索する技術である。目的文書の概略を自然文形式にて入力することにより目的文書を検索できるため、ユーザは検索式を作成する負担から解放される。しかし、目的文書の検出に失敗した場合、検索失敗した自然文のどこをどのように直せばより好適な検索結果が得られるのかわからないため、検索の失敗を次の検索のために活かしにくい。 As another document search method, there is a technique called concept search (see, for example, Patent Document 1). Concept search is a technique for searching a document file having contents related to a natural sentence input by a user. Since the target document can be searched by inputting the outline of the target document in a natural sentence format, the user is freed from the burden of creating a search expression. However, if the detection of the target document fails, it is difficult to make use of the failure of the search for the next search because it is not known where and how to correct the natural sentence that failed to be searched to obtain a more preferable search result.

本発明は、上記課題に基づいて完成された発明であり、その主たる目的は、文書検索にともなうユーザの負担を軽減し、文書検索を効率化するための技術、を提供することにある。 The present invention has been completed based on the above-described problems, and its main object is to provide a technique for reducing the burden on the user associated with document search and improving the efficiency of document search.

本発明のある態様は、文書検索支援装置に関する。
この装置は、目的文書と内容において類似する文書ファイルを類似文書として取得し、類似文書に含まれる用語を候補用語として抽出し、所定の文書ファイル群における候補用語の稀少さに応じて各候補用語を順位づけする。そして、目的文書を検索するために好適な検索語として、所定順位以上に位置する候補用語をユーザに提示する。 One embodiment of the present invention relates to a document search support apparatus.
This apparatus acquires a document file similar in content to the target document as a similar document, extracts terms included in the similar document as candidate terms, and sets each candidate term according to the rarity of candidate terms in a predetermined document file group. Ranking. Then, candidate terms positioned at a predetermined rank or higher are presented to the user as suitable search terms for searching for the target document.

類似文書は１つである必要はなく、複数であってもよい。複数の類似文書から候補用語を抽出する場合には、複数の類似文書のいずれかに含まれている用語を候補用語として抽出してもよいし、複数の類似文書のいずれにも共通して含まれている用語を候補用語として抽出してもよい。
「用語」は、いわゆる「形態素」であってもよいし、形態素の組み合わせであってもよい。たとえば、「ユーザインタフェース」という用語から「ユーザ」と「インタフェース」という２つの用語を更に抽出してもよい。 The number of similar documents is not necessarily one, and may be a plurality. When extracting candidate terms from multiple similar documents, terms included in any of the multiple similar documents may be extracted as candidate terms, or included in common to any of multiple similar documents Terms that have been identified may be extracted as candidate terms.
The “term” may be a so-called “morpheme” or a combination of morphemes. For example, two terms “user” and “interface” may be further extracted from the term “user interface”.

なお、以上に示した各構成要素の任意の組み合わせ、本発明を方法、システム、記録媒体、コンピュータプログラムにより表現したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described components, and the present invention expressed by a method, system, recording medium, and computer program are also effective as an aspect of the present invention.

本発明によれば、文書検索におけるユーザの負担を軽減し、効率的な文書検索を実現できる。 According to the present invention, it is possible to reduce the burden on the user in document search and realize efficient document search.

図１は、本実施例における文書検索の概要図である。
特許データベースから、所望の公開特許公報を「目的文書」として検出する状況を想定する。特許データベースは、「特許電子図書館」などの既存データベースであってもよい。以下、特許データベースなどの文書データベースに格納され、検索対象となる文書ファイル群のことを、「コーパス（corpus）」とよぶ。 FIG. 1 is a schematic diagram of document search in this embodiment.
Assume a situation in which a desired published patent publication is detected as a “target document” from the patent database. The patent database may be an existing database such as “Patent Digital Library”. Hereinafter, a group of document files stored in a document database such as a patent database and to be searched is referred to as a “corpus”.

目的文書は、「立体視」に関する発明の公開特許公報であるとする。目的文書に含まれる用語としては、「仮想現実」、「仮想環境」、「仮想空間」、「座標系」、「立体視」、「３次元」、「視覚」、・・・などが考えられる。ユーザは、まず、目的文書に確実に含まれていると思われる「立体」を検索語として入力したとする。以下、ユーザにより明示的に入力される検索語のことを「指定検索語」とよぶ。 The target document is an open patent publication of an invention related to “stereoscopic vision”. As terms included in the target document, “virtual reality”, “virtual environment”, “virtual space”, “coordinate system”, “stereoscopic view”, “three-dimensional”, “sight”, and so on can be considered. . Assume that the user first inputs “three-dimensional” that seems to be surely included in the target document as a search term. Hereinafter, a search term that is explicitly input by the user is referred to as a “designated search term”.

Ｓ１．検索処理：
指定検索語「立体」を含む公開特許広報を「候補文書」として検出する。通常、「立体」という文字列を含む候補文書の数は膨大である。候補文書の多くは、仮想現実感技術以外の技術に関する内容である。ユーザは、これらの候補文書群から、目的文書と内容的に類似性が高いと思われる候補文書を「類似文書」として選択する。たとえば、仮想現実感に関する記述が含まれている候補文書が選択される。ユーザは、２０〜３０件程度の候補文書を概観し、そのうち３件を類似文書として選択したとする。 S1. Search process:
An open patent information including the designated search word “3D” is detected as “candidate document”. Usually, the number of candidate documents including the character string “3D” is enormous. Many of the candidate documents have contents related to technologies other than the virtual reality technology. The user selects, from these candidate document groups, candidate documents that are likely to be similar in content to the target document as “similar documents”. For example, a candidate document including a description related to virtual reality is selected. It is assumed that the user overviews about 20 to 30 candidate documents and selects three of them as similar documents.

Ｓ２．抽出処理：
各類似文書に含まれる用語を「候補用語」として抽出する。類似文書に含まれている候補用語は、目的文書にも含まれている可能性が高い。 S2. Extraction process:
Terms included in each similar document are extracted as “candidate terms”. The candidate terms included in the similar document are likely to be included in the target document.

Ｓ３．順位処理：
候補用語を、重要度に応じて順位づけする。重要度に基づく順位のことを「重要順位」とよぶ。候補用語の重要度は、類似文書中における出現回数や出現位置、コーパス全体からみた稀少性などに基づいて算出される。重要度の算出方法については、図３に関連して詳述する。重要度が高い候補用語ほど、目的文書を特定する上で有用な検索語となる可能性が高い。重要順位が所定順位以上、たとえば、上位２０位以内の候補用語を新たな検索語の候補として提示する。以下、順位処理の結果として提示される検索語のことを「提案検索語」とよぶ。 S3. Rank processing:
Rank candidate terms according to importance. The ranking based on importance is called “importance ranking”. The importance level of the candidate term is calculated based on the number of appearances and the appearance position in the similar document, the rarity seen from the entire corpus, and the like. A method of calculating the importance will be described in detail with reference to FIG. A candidate term with higher importance is more likely to be a useful search term for specifying a target document. Candidate terms whose importance rank is higher than a predetermined rank, for example, within the top 20 are presented as new search term candidates. Hereinafter, the search terms presented as a result of the ranking process are referred to as “suggested search terms”.

たとえば、重要順位１位の候補用語は「視差」であったとする。この場合、ユーザは「指定検索語＊提案検索語（「＊」は論理積、すなわち、ＡＮＤ条件を示す）」という検索式により、再検索を実行すればよい。この例の場合、「立体＊視差」という検索式が入力されることになる。この検索式に基づく検索処理（Ｓ１）により、「立体」および「視差」の両方を含む候補文書群に絞り込まれることになる。以後のプロセスは同様である。 For example, it is assumed that the candidate term with the highest priority is “parallax”. In this case, the user may perform a re-search by a search expression “designated search word * proposed search word (“ * ”indicates a logical product, ie, AND condition)”. In the case of this example, a search expression “stereoscopic * parallax” is input. By the search process (S1) based on this search formula, the candidate documents including both “solid” and “parallax” are narrowed down. The subsequent process is the same.

このような処理方法によれば、指定検索語の追加と類似文書の選択を繰り返すことにより、候補文書の数を徐々に絞り込むことができる。上位２０個の提案検索語の中から新たな指定検索語を選択するため、検索式に含まれるべき指定検索語の全てを一から考える必要がない。この結果、検索語や検索式を作成する負担が大幅に軽減される。 According to such a processing method, it is possible to gradually narrow down the number of candidate documents by repeating the addition of designated search terms and the selection of similar documents. Since a new designated search word is selected from the top 20 suggested search words, it is not necessary to consider all of the designated search words to be included in the search formula from the beginning. As a result, the burden of creating search terms and search expressions is greatly reduced.

従来の文書検索方法においては、一つの検索式を作成して文書検索を実行したあと、その検索式、特に、その検索式に含まれていた検索語が適切でないことが判明し、まったく別の検索式を一から作成して文書検索を実行することも多い。概念検索の場合にも、自然文を作成して文書検索を実行したあと、その自然文が適切でないことが判明し、まったく別の自然文を一から作成して文書検索を実行することも多い。 In the conventional document search method, after creating a single search expression and executing a document search, it turns out that the search expression, in particular, the search word contained in the search expression is not appropriate. In many cases, a search expression is created from scratch and a document search is executed. Even in the case of conceptual search, after creating a natural sentence and executing a document search, it turns out that the natural sentence is not appropriate, and it is often the case that a completely different natural sentence is created from scratch and the document search is executed. .

一方、本実施例における文書検索方法によれば、合理的に検索語を増やしつつ徐々に候補文書の数を絞り込める。このため、何度も検索式を一から作り直さなくて済む。たとえば、「立体＊視差」という検索式で絞り込んだときの検索結果が好適でなければ、重要順位１位の「視差」の代わりに重要順位２位の「仮想」を採用し、「立体＊仮想」という検索式にて再検索を実行すればよい。あるいは、「立体＊（視差＋仮想）（「＋」は論理和、すなわち、ＯＲ条件を示す）」という検索式を採用してもよい。「立体」のような一般的な指定検索語から徐々に検索範囲を狭めることにより、目的文書に到達しやすくなる。 On the other hand, according to the document search method in the present embodiment, the number of candidate documents can be narrowed down gradually while increasing the search terms reasonably. This eliminates the need to re-create the search formula from the beginning. For example, if the search result when narrowed down by the search formula “stereo * parallax” is not suitable, “virtual” in the second priority ranking is adopted instead of “parallax” in the first ranking ranking, Re-searching may be performed using the search expression “ Alternatively, a search expression “3D * (parallax + virtual)” (“+” indicates a logical sum, that is, an OR condition) may be employed. By gradually narrowing the search range from a general designated search word such as “three-dimensional”, it becomes easier to reach the target document.

なお、必ずしも候補文書から類似文書を選択する必要はなく、所定の類似文書をそのまま入力してもよい。たとえば、ある論文原稿や設計文書に含まれるアイディアについて特許出願を考えている場合、これらの論文原稿等を類似文書として入力すれば、このアイディアと内容的に近い発明が既に出願されているか否かを判断しやすくなる。 It is not always necessary to select a similar document from candidate documents, and a predetermined similar document may be input as it is. For example, if you are thinking about applying for a patent for an idea contained in a manuscript or design document, if you enter these manuscript manuscripts as similar documents, whether or not an invention close in content to this idea has already been filed. It becomes easy to judge.

図２は、文書検索支援装置１００の機能ブロック図である。
ここに示す各ブロックは、ハードウェア的には、コンピュータのＣＰＵをはじめとする素子や機械装置で実現でき、ソフトウェア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウェア、ソフトウェアの組み合わせによっていろいろなかたちで実現できることは、当業者には理解されるところである。 FIG. 2 is a functional block diagram of the document search support apparatus 100.
Each block shown here can be realized in hardware by an element such as a CPU of a computer or a mechanical device, and in software it is realized by a computer program or the like. Draw functional blocks. Therefore, those skilled in the art will understand that these functional blocks can be realized in various forms by a combination of hardware and software.

文書検索支援装置１００は、ＩＦ（インタフェース）部１１０、データ処理部１４０およびデータ保持部１６０を含む。
ＩＦ部１１０は、ユーザやコーパスとのインタフェースを担当する。データ処理部１４０は、ＩＦ部１１０やデータ保持部１６０から取得されたデータを元にして各種のデータ処理を実行する。データ保持部１６０は、ＩＦ部１１０とデータ保持部１６０との間のインタフェースの役割も果たす。データ保持部１６０は、各種データを保持するための記憶領域である。 The document search support apparatus 100 includes an IF (interface) unit 110, a data processing unit 140, and a data holding unit 160.
The IF unit 110 is in charge of an interface with a user or a corpus. The data processing unit 140 executes various data processing based on data acquired from the IF unit 110 and the data holding unit 160. The data holding unit 160 also serves as an interface between the IF unit 110 and the data holding unit 160. The data holding unit 160 is a storage area for holding various data.

ＩＦ部１１０：
ＩＦ部１１０は、文書取得部１１２、検索式取得部１２０、提示部１２６および検索実行部１３２を含む。文書取得部１１２と検索実行部１３２は主としてコーパスとのインタフェースを担当し、検索式取得部１２０と提示部１２６は主としてユーザとのインタフェースを担当する。 IF unit 110:
The IF unit 110 includes a document acquisition unit 112, a search expression acquisition unit 120, a presentation unit 126, and a search execution unit 132. The document acquisition unit 112 and the search execution unit 132 are mainly in charge of the interface with the corpus, and the search expression acquisition unit 120 and the presentation unit 126 are mainly in charge of the interface with the user.

文書取得部１１２は、各種文書ファイルを取得する。文書取得部１１２は、類似文書取得部１１４、非類似文書取得部１１６および候補文書取得部１１８を含む。候補文書取得部１１８は、検索処理（Ｓ１）の結果として特定される候補文書をコーパスから取得する。類似文書取得部１１４は、候補文書のうち、ユーザに選択された類似文書をコーパスから取得する。類似文書取得部１１４は、ユーザから直接、類似文書を取得してもよい。非類似文書取得部１１６は、目的文書と内容において非類似の文書ファイルを「非類似文書」として取得する。非類似文書取得部１１６も、ユーザから直接、非類似文書を取得してもよい。非類似文書については、図５や図６に関連して詳述する。 The document acquisition unit 112 acquires various document files. The document acquisition unit 112 includes a similar document acquisition unit 114, a dissimilar document acquisition unit 116, and a candidate document acquisition unit 118. The candidate document acquisition unit 118 acquires a candidate document specified as a result of the search process (S1) from the corpus. The similar document acquisition unit 114 acquires the similar document selected by the user from the candidate documents from the corpus. The similar document acquisition unit 114 may acquire a similar document directly from the user. The dissimilar document acquisition unit 116 acquires a dissimilar document file in terms of the target document and contents as a “dissimilar document”. The dissimilar document acquisition unit 116 may also acquire dissimilar documents directly from the user. The dissimilar document will be described in detail with reference to FIG. 5 and FIG.

検索式取得部１２０は、目的文書を検索するための各種検索式の入力を受け付ける。検索式には、１以上の指定検索語が含まれる。検索式取得部１２０は、検索語取得部１２２を含む。検索語取得部１２２は、検索式に含まれる指定検索語を抽出する。 The search expression acquisition unit 120 receives input of various search expressions for searching for a target document. The search expression includes one or more designated search terms. The search expression acquisition unit 120 includes a search word acquisition unit 122. The search term acquisition unit 122 extracts designated search terms included in the search formula.

提示部１２６は、ユーザに各種情報を提示する。提示部１２６は、検索語提示部１２８と検索式提示部１３０を含む。検索語提示部１２８は、提案検索語や後述の「関連検索語」、「除外検索語」を画面表示させる。検索式提示部１３０は、更に、提案検索語等を含む好適な検索式をユーザに提示する。検索式提示部１３０により提示される検索式については、図４や図７に関連して後述する。
検索実行部１３２は、指定された検索式にしたがって、文書検索を実行する。検索実行部１３２の機能は、文書検索支援装置１００以外の装置により実現されてもよい。 The presentation unit 126 presents various information to the user. The presentation unit 126 includes a search word presentation unit 128 and a search expression presentation unit 130. The search term presenting unit 128 displays a suggested search term and “relevant search terms” and “excluded search terms” described later on the screen. The search formula presenting unit 130 further presents a suitable search formula including a suggested search term to the user. The search formula presented by the search formula presentation unit 130 will be described later with reference to FIGS. 4 and 7.
The search execution unit 132 executes document search according to the specified search expression. The function of the search execution unit 132 may be realized by a device other than the document search support device 100.

データ保持部１６０：
データ保持部１６０は、関連テーブル保持部１６２を含む。関連テーブル保持部１６２は、「関連テーブル」を保持する。関連テーブルにおいては、一の用語と別の用語が対応づけられる。たとえば、ある用語について、その類義語にあたる用語が対応づけられる。関連テーブルの詳細については、図４に関連して後述する。 Data holding unit 160:
The data holding unit 160 includes a related table holding unit 162. The related table holding unit 162 holds a “related table”. In the association table, one term is associated with another term. For example, a term corresponding to a synonym is associated with a certain term. Details of the association table will be described later with reference to FIG.

データ処理部１４０：
データ処理部１４０は、用語抽出部１４２、関連処理部１４８および順位設定部１５４を含む。用語抽出部１４２は、文書ファイルから用語を抽出する。用語抽出部１４２は、候補用語抽出部１４４と不適用語抽出部１４６を含む。候補用語抽出部１４４は、類似文書中の用語を候補用語として抽出する。不適用語抽出部１４６は、非類似文書中の用語を「不適用語」として抽出する。 Data processing unit 140:
The data processing unit 140 includes a term extraction unit 142, a related processing unit 148, and a rank setting unit 154. The term extraction unit 142 extracts terms from the document file. The term extraction unit 142 includes a candidate term extraction unit 144 and an inapplicable word extraction unit 146. The candidate term extraction unit 144 extracts terms in the similar document as candidate terms. The inapplicable word extraction unit 146 extracts the terms in the dissimilar document as “inapplicable words”.

順位設定部１５４は、候補用語の重要度を算出し、各候補用語に重要順位を設定する。所定順位以内、たとえば、２０位以内の候補用語は、検索語提示部１２８により提案検索語としてユーザに示される。同様に、順位設定部１５４は、不適用語の重要度を算出し、各不適用語に順位を設定する。不適用語の重要度に基づく順位のことを「不適順位」とよぶ。 The rank setting unit 154 calculates the importance level of the candidate terms and sets the importance rank for each candidate term. Candidate terms within a predetermined rank, for example, within the 20th, are shown to the user as suggested search terms by the search term presentation unit 128. Similarly, the rank setting unit 154 calculates the importance of the inapplicable word and sets the rank for each inapplicable word. The ranking based on the importance of non-applicable words is called “unsuitable ranking”.

関連処理部１４８は、関連検索語に関する処理を担当する。関連処理部１４８は、関連検索語特定部１５０と関連テーブル更新部１５２を含む。関連検索語特定部１５０は、提案検索語が特定されたとき、関連テーブルにおいて提案検索語と対応づけられている用語を関連検索語として特定する。関連テーブル更新部１５２は、ユーザにより入力された検索式に基づいて関連テーブルを更新する。たとえば、ユーザが「コンテキストスイッチ」と「タスクスイッチ」という２つの指定検索語を含む検索式を入力したとする。この場合、関連テーブル更新部１５２は、「コンテキストスイッチ」と「タスクスイッチ」を対応づけて、関連テーブルに登録する。 The related processing unit 148 is in charge of processing related to related search terms. The related processing unit 148 includes a related search word specifying unit 150 and a related table update unit 152. When the suggested search word is specified, the related search word specifying unit 150 specifies a term associated with the suggested search word in the related table as the related search word. The related table update unit 152 updates the related table based on the search formula input by the user. For example, it is assumed that the user inputs a search expression including two designated search terms “context switch” and “task switch”. In this case, the related table update unit 152 associates “context switch” with “task switch” and registers them in the related table.

図３は、提案検索語を特定するまでの処理過程を示す概念図である。
まず、ユーザが指定検索語として文字列「自然言語」を入力したとする。検索語取得部１２２は指定検索語として文字列「自然言語」を取得し、検索実行部１３２は文字列「自然言語」を含む公開特許公報を候補文書として検出する。候補文書取得部１１８は候補文書をコーパスから取得する。ユーザは、このうち公開特許公報Ａ〜Ｃを類似文書として選択したとする。類似文書取得部１１４は、類似文書をコーパスから、あるいは、既に取得した候補文書群から取得する。 FIG. 3 is a conceptual diagram showing a processing process until a proposed search word is specified.
First, it is assumed that the user inputs a character string “natural language” as a designated search term. The search word acquisition unit 122 acquires the character string “natural language” as the designated search word, and the search execution unit 132 detects a published patent publication including the character string “natural language” as a candidate document. The candidate document acquisition unit 118 acquires a candidate document from the corpus. It is assumed that the user selects the published patent publications A to C as similar documents. The similar document acquisition unit 114 acquires a similar document from a corpus or a candidate document group that has already been acquired.

類似文書Ａからは、「異音語」、「入力文」、「コーパス」、「学習データ」、「分類器」、「特徴ベクトル」などのさまざまな候補用語が抽出される。類似文書Ａにおける出現回数が所定回数以内、たとえば、３回以内の用語は、抽出対象外としてもよい。同様にして、類似文書Ｂ、Ｃからもさまざまな候補用語が抽出される。ただし、あらかじめ辞書登録されている一般用語は検出の対象とはならない。たとえば、「しかし」、「または」などの接続詞、「は」、「に」、「から」などの助詞は候補用語とはならない。また、特許明細書において一般的な「前記」、「装置」、「好適」、「変形例」などの用語を除外してもよい。
本実施例においては、すべての類似文書、すなわち、類似文書Ａ〜Ｃのいずれにも出現する候補用語だけが以降の分析の対象となる。たとえば、類似文書Ａから抽出された「異音語」は、類似文書Ｂや類似文書Ｃにおいて、最低１カ所は出現しているものとする。仮に、類似文書Ａにおいて「再現率」という用語が頻出していても、類似文書Ｂから「再現率」が抽出されなければ、「再現率」は後述の個別重要度や総合重要度の算出対象外となる。 From the similar document A, various candidate terms such as “anomalous words”, “input sentence”, “corpus”, “learning data”, “classifier”, “feature vector” are extracted. Terms whose number of appearances in the similar document A is within a predetermined number of times, for example, three times or less, may be excluded from extraction. Similarly, various candidate terms are extracted from the similar documents B and C. However, general terms registered in the dictionary in advance are not subject to detection. For example, conjunctions such as “but”, “or”, and particles such as “ha”, “ni”, and “from” are not candidate terms. In addition, terms such as “above”, “apparatus”, “preferred”, and “variation” that are general in patent specifications may be excluded.
In the present embodiment, all similar documents, that is, only candidate terms appearing in any of the similar documents A to C are subjected to the subsequent analysis. For example, it is assumed that at least one “abnormal word” extracted from the similar document A appears in the similar document B and the similar document C. Even if the term “recall rate” appears frequently in the similar document A, if the “recall rate” is not extracted from the similar document B, the “recall rate” is a calculation target of the individual importance and the total importance described later. Get out.

順位設定部１５４は、類似文書ごとに各候補用語の個別重要度を算出する。ここで、「特許請求の範囲」中に出現した候補用語は、その類似文書の内容を示す上で有力な用語である可能性が高い。一方、「願書」中に出現した候補用語はそれほど有力ではないと考えられる。そこで、候補用語の出現場所に応じて、ポイントを割り当てる。本実施例においては、「願書」に出現した候補用語には０ポイント、「特許請求の範囲」と「要約書」には２．０ポイント、「発明の名称」には１．８ポイント、「発明を実施するための最良の形態」には１．０ポイントを割り当てる。「異音語」という候補用語が類似文書Ａの「特許請求の範囲」に５回、「発明を実施するための最良の形態」に２２回出現している場合には、類似文書Ａにおける候補用語「異音語」の個別重要度は、３２．０（＝２．０×５＋１．０×２２）となる。同図に示す類似文書Ａの場合、１位「学習データ（４２．８）」、２位「異音語（３２．０）」、３位「コーパス（２９．８）」、・・・のように順位づけがなされている。「特許請求の範囲」等の重要箇所において多く出現する候補用語ほど個別重要度が高くなる。類似文書Ｂ、Ｃについても、候補用語の個別重要度が算出される。 The rank setting unit 154 calculates the individual importance of each candidate term for each similar document. Here, the candidate terms appearing in “Claims” are highly likely to be effective terms in showing the contents of similar documents. On the other hand, the candidate terms appearing in the “application” are not so powerful. Therefore, points are assigned according to the appearance locations of candidate terms. In this example, 0 points are given to the candidate terms appearing in the “application”, 2.0 points to the “claims” and “abstract”, 1.8 points to the “title of the invention”, “ 1.0 point is assigned to “the best mode for carrying out the invention”. If the candidate term “anomalous word” appears five times in the “Claims” of the similar document A and 22 times in the “best mode for carrying out the invention”, the candidate in the similar document A The individual importance of the term “allophone” is 32.0 (= 2.0 × 5 + 1.0 × 22). In the case of the similar document A shown in the figure, the first place “learning data (42.8)”, the second place “abnormal word (32.0)”, the third place “corpus (29.8)”,. The ranking is made as follows. Candidate terms that appear more frequently in important parts such as “Claims” have higher individual importance. For similar documents B and C, the individual importance of candidate terms is calculated.

次に、各候補用語の個別重要度を統合し、各候補用語の総合重要度を算出する。総合重要度の算出にあたっては、「ベクトル空間法」の考え方を踏襲する。すなわち、所定の文書ファイル群にあまり出現しない稀少な候補用語は、類似文書の内容を示す上で有力な用語であると考えられる。そのため、稀少な候補用語の総合重要度が相対的に高くなるように調整する。逆に、この文書ファイル群において多くの文書ファイル中に出現する平凡な候補用語は、類似文書の内容を示す上でそれほど有力な用語とはなりにくいと考えられる。そのため、一般用語の総合重要度が相対的に低くなるように計算する。 Next, the individual importance of each candidate term is integrated, and the total importance of each candidate term is calculated. In calculating the total importance, the concept of “vector space method” is followed. That is, a rare candidate term that does not appear so much in a predetermined document file group is considered to be an effective term for indicating the content of a similar document. Therefore, adjustments are made so that the overall importance of rare candidate terms is relatively high. Conversely, ordinary candidate terms appearing in many document files in this document file group are unlikely to be so powerful terms in indicating the contents of similar documents. Therefore, calculation is performed so that the general importance of general terms is relatively low.

本実施例においては、
候補用語Ｗｉの総合重要度＝（ｎ_１＋ｎ_２＋ｎ_３）・log｛（Ｄｎ／Ｒｎ）＋１｝
により、各候補用語の総合重要度を算出する。Ｄｎは、コーパスに含まれる文書ファイルの総数を示す。たとえば、特許データベースに含まれる文書ファイルの総数であってもよい。Ｒｎは、コーパスに含まれる文書ファイルのうち、候補用語Ｗｉを含む文書ファイルの数である。log｛（Ｄｎ／Ｒｎ）＋１｝は、候補用語Ｗｉを含む文書ファイルの数Ｒｎが文書ファイルの総数Ｄｎに比べて少ないほど大きくなる。いいかえれば、候補用語Ｗｉが稀少であるほど大きくなる。ｎ_１、ｎ_２、ｎ_３は、それぞれ、候補用語Ｗｉの類似文書Ａ、Ｂ、Ｃにおける個別重要度を示す。たとえば、候補用語「特徴ベクトル」は、類似文書Ａ〜Ｃの全てに含まれている。候補用語「特徴ベクトル」は、ｎ_１＝１８．４、ｎ_２＝１２．２、ｎ_３＝４．３であるため、ｎ_１＋ｎ_２＋ｎ_３＝３４．９となる。 In this example,
Overall importance of candidate term Wi = (n ₁ + n ₂ + n ₃ ) · log {(Dn / Rn) +1}
To calculate the total importance of each candidate term. Dn indicates the total number of document files included in the corpus. For example, it may be the total number of document files included in the patent database. Rn is the number of document files including the candidate term Wi among the document files included in the corpus. log {(Dn / Rn) +1} increases as the number Rn of document files including the candidate term Wi is smaller than the total number Dn of document files. In other words, the smaller the candidate term Wi is, the larger the candidate term Wi becomes. n ₁ , n ₂ , and n ₃ indicate the individual importance levels of the candidate terms Wi in the similar documents A, B, and C, respectively. For example, the candidate term “feature vector” is included in all of the similar documents A to C. Since the candidate term “feature vector” is n ₁ = 18.4, n ₂ = 12.2, and n ₃ = 4.3, n ₁ + n ₂ + n ₃ = 34.9.

個別重要度が高い候補用語ほど、総合重要度が高くなる。また、コーパス全体からみて一部の文書ファイルにしか出現しない稀少な候補用語ほど総合重要度が高くなる。こうして、各類似文書から抽出された候補用語の総合重要度を算出した上で、順位づけを行う。同図に類似文書Ａ〜Ｃの場合、１位「特徴ベクトル（９１．４）」、２位「学習データ（８４．８）」、３位「類似度（８１．２）」、・・・のように順位づけがなされている。検索式提示部１３０は、上位２０位以内の候補用語を「提案検索語」として画面表示させる。ユーザは、これらの提案検索語の中から、次の指定検索語を選択する。たとえば、「自然言語＊特徴ベクトル」や「自然言語＊（特徴ベクトル＋学習データ）」といった検索式が次の検索式として考えられる。以下、総合重要度のことを単に「重要度」とよぶ。
なお、候補用語の抽出や選定、出現回数、位置、稀少性に基づく重要度の算出は、上記順序にて実行される必要はない。たとえば、類似文書Ａ〜Ｃに共通してあらわれる用語を候補用語として抽出し、そこから辞書登録されている一般用語を除去し、出現回数に応じて更に候補用語を絞り込み、更に、Ｄｎ／Ｒｎが所定値以上となる稀少な候補用語に絞り込んでから、個別重要度や総合重要度を算出してもよい。 Candidate terms with higher individual importance have higher overall importance. In addition, a rare candidate term that appears only in some document files as viewed from the entire corpus has a higher overall importance. In this way, ranking is performed after the total importance of candidate terms extracted from each similar document is calculated. In the case of similar documents A to C in the figure, the first place “feature vector (91.4)”, the second place “learning data (84.8)”, the third place “similarity (81.2)”,... The ranking is made as follows. The search formula presenting unit 130 displays the candidate terms within the top 20 as “suggested search terms” on the screen. The user selects the next designated search term from these suggested search terms. For example, a search expression such as “natural language * feature vector” or “natural language * (feature vector + learning data)” can be considered as the next search expression. Hereinafter, the total importance is simply referred to as “importance”.
Note that the extraction of the candidate terms and selection, the calculation of the importance based on the number of appearances, the position, and the rarity need not be executed in the above order. For example, terms commonly appearing in similar documents A to C are extracted as candidate terms, general terms registered in the dictionary are removed therefrom, candidate terms are further narrowed down according to the number of appearances, and Dn / Rn The individual importance level and the total importance level may be calculated after narrowing down to rare candidate terms that are equal to or greater than a predetermined value.

図４は、関連テーブルのデータ構造図である。
関連テーブルは、関連テーブル保持部１６２に格納され、複数の用語が対応づけられて登録されるテーブルである。基本用語欄１７０は各用語を示す。一般的な文書検索において「スプーン」に関する公開特許公報を探すときには、「スプーン」だけでなく、「スプーン」に近い意味の用語も検索語とすることが多い。たとえば、「さじ」、「サジ」、「匙」などの類義語が候補となり、「スプーン＋さじ＋サジ＋匙」という検索式が作成される。また、「仮想現実」という用語であれば、「人工現実」、「複合現実」という類義語、「ヴァーチャルリアリティ」や「バーチャルリアリティ」という外来語、「ＶＲ」という略語も検索語の候補となる。検索漏れを防ぐためには、一つの検索語だけでなく、その検索語と関連性のある別の検索語も考慮する必要がある。 FIG. 4 is a data structure diagram of the related table.
The related table is a table stored in the related table holding unit 162 and registered in association with a plurality of terms. The basic term column 170 shows each term. When searching for a published patent publication related to “spoon” in a general document search, not only “spoon” but also a term close to “spoon” is often used as a search term. For example, synonyms such as “Saji”, “Saji”, and “Samurai” are candidates, and a search expression “Spoon + Saji + Saji + Samurai” is created. In addition, in the case of the term “virtual reality”, synonyms such as “artificial reality” and “mixed reality”, foreign words such as “virtual reality” and “virtual reality”, and an abbreviation “VR” are also candidates for a search term. In order to prevent omission of search, it is necessary to consider not only one search term but also another search term related to the search term.

類義語欄１７２は、基本用語欄１７０に示される用語の類義語を示す。類義語は、あらかじめ辞書登録されている。たとえば、「スプーン」という用語に対しては、「さじ」、「サジ」、「匙」という類義語が対応づけられている。関連検索語特定部１５０は、「スプーン」が提案検索語としてリストアップされたときには、関連テーブルを参照して、「さじ」、「サジ」、「匙」を「関連検索語」として特定する。検索語提示部１２８は、提案検索語だけでなく、その関連検索語も画面表示させる。 The synonym column 172 shows synonyms for the terms shown in the basic term column 170. Synonyms are registered in the dictionary in advance. For example, the term “spoon” is associated with the synonyms “spoof”, “saji”, and “匙”. When “spoon” is listed as a suggested search word, the related search word specifying unit 150 refers to the related table and specifies “spoons”, “saj”, and “匙” as “related search words”. The search term presentation unit 128 displays not only the proposed search terms but also related search terms on the screen.

たとえば、指定検索語として「フォーク」が入力された結果、提案検索語として「スプーン」が特定されたとする。「スプーン」に対し「さじ」、「サジ」、「匙」は関連テーブルにおいて互いに関連づけられているため、これらはＯＲ条件（＋記号）により連結され「検索セット」が形成される。検索式提示部１３０は、「フォーク＊検索セット」、すなわち、「フォーク＊（スプーン＋さじ＋サジ＋匙）」という検索式を生成し、画面表示する。「フォーク」の類義語として「フオーク」が登録されている場合、検索式提示部１３０は、提案検索語だけでなく、指定検索語についての関連検索語も含めて、「（フォーク＋フオーク）＊（スプーン＋さじ＋サジ＋匙）」という検索式を生成し、画面表示する。 For example, it is assumed that “spoon” is specified as the suggested search term as a result of inputting “fork” as the designated search term. Since “spoon”, “sag”, and “匙” are related to each other in the related table, “spoon” is connected by an OR condition (+ symbol) to form a “search set”. The search expression presenting unit 130 generates a search expression “fork * search set”, that is, “fork * (spoon + spoon + saji + claw)” and displays it on the screen. When “fork” is registered as a synonym of “fork”, the search expression presenting unit 130 includes not only the suggested search word but also related search words for the designated search word “(fork + fork) * ( A search expression “spoon + spoon + saji + 匙)” is generated and displayed on the screen.

更に、検索式「（フォーク＋フオーク）＊（スプーン＋さじ＋サジ＋匙）」が入力された結果、提案検索語として「信号」が特定され、「信号」の関連検索語として「シグナル」や「メッセージ」が特定されたとする。このとき、「信号」、「シグナル」、「メッセージ」をＯＲ条件で結びつけた「信号＋シグナル＋メッセージ」という検索セットが生成され、「（フォーク＋フオーク）＊（スプーン＋さじ＋サジ＋匙）＊（信号＋シグナル＋メッセージ）」という検索式が生成され、画面表示される。以降も同様である。 Furthermore, as a result of inputting the search expression “(fork + fork) * (spoon + spouse + saji + 匙)”, “signal” is specified as the suggested search term, and “signal” Assume that a “message” is specified. At this time, a search set of “signal + signal + message” in which “signal”, “signal”, and “message” are combined with OR condition is generated, and “(fork + fork) * (spoon + spoon + sag + sag + 匙) A search expression “* (signal + signal + message)” is generated and displayed on the screen. The same applies thereafter.

関連語欄１７４は関連語を示す。たとえば、「スプーン＋皿」や「スプーン＊皿」のように、「スプーン」および「皿」の両方を含む検索式が入力されたとする。このとき、関連テーブル更新部１５２は、「皿」を「スプーン」の関連語として、あるいは「スプーン」を「皿」の関連語として、関連テーブルに登録する。関連テーブル更新部１５２は、過去所定回分の検索式を記録しておき、そのうち、特に組み合わせ頻度の高い用語を対応づけて関連テーブルに登録してもよい。たとえば、過去に入力された１０００個の検索式において、「皿」と「スプーン」が両方含まれる検索式が所定個数、たとえば、２０個以上あれば、「スプーン」と「皿」を対応づけるとしてもよい。また、「皿」そのものだけではなく、「皿」の類義語である「食器」や「dish」が「スプーン」と共に同一検索式に含まれているときにも、「スプーン」と「皿」を対応づけてもよい。このような関連語登録によれば、ベテランのサーチャーが作成した検索式から、好適な関連テーブルを作成することも可能である。たとえば、ベテランのサーチャーが「仮想現実」を指定検索語として入力するとき、その類義語である「人工現実」をＯＲ条件で連結して入力することが多いとする。この場合、未熟なユーザであっても、「仮想現実」の類似概念として「人工現実」という用語が存在することを知ることができるため、検索漏れの少ない検索式を作成しやすくなる。 The related word column 174 shows related words. For example, it is assumed that a search expression including both “spoon” and “dish” is input, such as “spoon + dish” and “spoon * dish”. At this time, the related table update unit 152 registers “dish” as a related word of “spoon” or “spoon” as a related word of “dish” in the related table. The related table update unit 152 may record search formulas for a predetermined number of times in the past, and register terms that are particularly frequently combined in the related table. For example, in 1000 search formulas input in the past, if there are a predetermined number of search formulas including both “dish” and “spoon”, for example, 20 or more, “spoon” and “dish” are associated with each other. Also good. In addition, not only “dish” itself, but also “tablet” and “dish”, which are synonyms for “dish”, are included in the same search expression together with “spoon”. It may be attached. According to such related word registration, it is also possible to create a suitable related table from a search formula created by an experienced searcher. For example, when a veteran searcher inputs “virtual reality” as a designated search word, the synonym “artificial reality” is often connected in an OR condition. In this case, even an inexperienced user can know that the term “artificial reality” exists as a similar concept of “virtual reality”, so that it is easy to create a search expression with few search omissions.

図５は、不適用語に基づく総合重要度の調整方法を説明するための概念図である。
ユーザは、候補文書の中から類似文書だけでなく、非類似文書を選択してもよい。非類似文書は、目的文書とは内容において非類似とされる文書ファイルである。類似文書と同様、必ずしも候補文書から非類似文書を選択する必要はなく、所定の非類似文書をそのまま入力してもよい。また、候補文書のうち類似文書として選択されなかった候補文書を一律に非類似文書として取り扱ってもよい。不適用語抽出部１４６は、非類似文書に含まれる用語を「不適用語」として抽出する。順位設定部１５４は、候補用語の重要度の算出方法と同様のアルゴリズムにて、不適用語の不適度を算出する。 FIG. 5 is a conceptual diagram for explaining a method of adjusting the overall importance based on the inapplicable word.
The user may select not only similar documents but also dissimilar documents from candidate documents. The dissimilar document is a document file whose contents are dissimilar from the target document. Similar to similar documents, it is not always necessary to select a dissimilar document from candidate documents, and a predetermined dissimilar document may be input as it is. In addition, candidate documents that are not selected as similar documents among candidate documents may be treated as dissimilar documents. The inapplicable word extraction unit 146 extracts terms included in dissimilar documents as “inapplicable words”. The rank setting unit 154 calculates the inappropriateness of non-applicable words using the same algorithm as the method for calculating the importance of candidate terms.

不適用語でもある候補用語は、非類似文書の内容と深く関連している可能性が高いため、類似文書の特徴を示す上で必ずしも適切とはいえない。順位設定部１５４は、非類似文書から特定した不適用語のうち、不適順位２０位以内の不適用語を提案検索語から除外する。同図の場合、類似文書において重要順位３位の候補用語「類似度」は、非類似文書において不適順位４位の不適用語でもあるため、提案検索語から除外される。 Candidate terms that are also non-applicable words are highly likely to be closely related to the contents of dissimilar documents, and thus are not necessarily appropriate for showing the characteristics of similar documents. The rank setting unit 154 excludes non-applicable words within the 20th inappropriate rank from the suggested search words among the non-applicable words specified from dissimilar documents. In the case of the same figure, the candidate term “similarity” having the third highest priority in the similar document is also excluded from the suggested search word because it is also an inapplicable fourth-ranked word in the dissimilar document.

変形例として、候補用語が不適用語としてもランキングされているときには、候補用語の重要度から所定値、たとえば、１０ポイントを減点してもよい。同図の場合、「類似度」の重要度は、８１．２→７１．２に調整されることになる。このように、順位設定部１５４は、不適用語としても抽出された候補用語の重要順位が低くなるように、候補用語の重要度を調整したり、候補用語を提案検索語から除外する。非類似文書に基づく重要順位の調整により、類似文書の特徴を適切に示す候補用語だけが提案検索語としてリストアップされやすくなる。 As a modification, when a candidate term is also ranked as an inapplicable word, a predetermined value, for example, 10 points may be deducted from the importance of the candidate term. In the case of the figure, the importance of “similarity” is adjusted from 81.2 to 71.2. In this way, the rank setting unit 154 adjusts the importance of candidate terms so that the importance ranks of candidate terms extracted as non-applicable words are lowered, or excludes candidate terms from the proposed search terms. By adjusting the importance ranking based on dissimilar documents, only candidate terms that appropriately indicate the characteristics of similar documents are easily listed as suggested search terms.

順位設定部１５４は、重要順位２０位以内の候補用語のそれぞれについて、不適用語群に含まれていないかを検出してもよい。たとえば、重要順位１位の候補用語「コーパス」が、非類似文書の中にも現れているときには、いいかえれば、不適用語群の中に含まれているときには、候補用語「コーパス」を提案検索語から除外したり、あるいは、候補用語「コーパス」の重要度を減点してもよい。候補用語「コーパス」を指定検索語として実行した場合には、類似文書だけでなく非類似文書も検出されることになるため、類似文書の特徴を表す上で有力と判定された候補用語「コーパス」は必ずしも最適な検索語とはいえないかもしれない。こういった観点から、不適用語としても抽出された候補用語の順位が低くなるように、あるいは、ランキング外となるように調整してもよい。 The rank setting unit 154 may detect whether or not each candidate term within the 20th priority rank is not included in the inapplicable word group. For example, when the candidate term “corpus” ranked first in the priority rank appears in dissimilar documents, in other words, when it is included in the non-applicable word group, the candidate term “corpus” is selected as the suggested search term. Or the degree of importance of the candidate term “corpus” may be reduced. When the candidate term “corpus” is executed as the designated search term, not only similar documents but also dissimilar documents are detected. Therefore, the candidate term “corpus” determined to be prominent in expressing the characteristics of the similar documents. "May not necessarily be the best search term. From this point of view, adjustment may be made so that the rank of candidate terms extracted as non-applicable words is lower or out of ranking.

また、不適用語、特に、不適順位の上位圏に位置する不適用語を「除外検索語」として特定してもよい。たとえば、不適用語「実行時間」は、非類似文書の特徴を表す上で有力な用語であり、目的文書は不適用語「実行時間」を含まない文書ファイルである可能性が高い。そこで、「自然言語＊（−実行時間）（「（−）：マイナス記号」は「除外」を意味する）」のような検索式を入力すれば、「「自然言語」という用語を含み、かつ、「実行時間」という用語を含まない文書ファイル」が検索対象となる。除外検索語の設定により、検索範囲から非類似文書を取り除きやすくなる。 Inapplicable words, in particular, inapplicable words positioned in the higher ranks of the inappropriate ranking may be specified as “excluded search words”. For example, the inapplicable word “execution time” is an effective term for expressing the characteristics of dissimilar documents, and the target document is likely to be a document file that does not include the inapplicable word “execution time”. Therefore, if a search expression such as “natural language * (− execution time)” (“(−): minus sign” means “exclusion”) is input, the term “natural language” is included, and , “Document file not including the term“ execution time ”” is a search target. By setting exclusion search terms, dissimilar documents can be easily removed from the search range.

また、不適用語「実行時間」が候補用語群にも含まれいるとき、不適用語「実行時間」を除外検索語とすると、類似文書まで除外されてしまうことになる。そこで、検索語提示部１２８は、候補用語群に含まれていない不適用語、いいかえれば、類似文書には含まれていない不適用語だけを好適な「除外検索語」としてユーザに提示してもよい。 Further, when the inapplicable word “execution time” is also included in the candidate term group, if the inapplicable word “execution time” is an excluded search word, similar documents are also excluded. Therefore, the search word presenting unit 128 may present to the user only the inapplicable words that are not included in the candidate term group, in other words, the inapplicable words that are not included in the similar document as suitable “excluded search words”. .

図６は、本実施例における文書検索方法を応用した検索エンジンの画面図である。
以上においては、特許データベースから公開特許公報を検索するという状況を想定したが、本実施例における文書検索方法は一般的なウェブ検索にも応用可能である。文書検索支援装置１００は、ウェブページをクライアント端末に送信し、クライアント端末に検索画面２００を表示させる。検索式入力欄２０２に検索式を入力し、検索ボタン２０４をクリックすると、結果表示領域２０８に検索結果がリスト表示される。同図においては、検索式入力欄２０２に「ワンクリック」という文字列が入力されている。検索ボタン２０４がクリックされると、「ワンクリック」という文字列を含むウェブページのＵＲＩ（Uniform Resource Identifier）が結果表示領域２０８に一覧表示される。ここまでは、一般的な検索エンジンのユーザインタフェースと同様である。 FIG. 6 is a screen view of a search engine to which the document search method according to the present embodiment is applied.
In the above description, it is assumed that the published patent gazette is searched from the patent database. However, the document search method in this embodiment can be applied to general web search. The document search support apparatus 100 transmits a web page to the client terminal, and displays a search screen 200 on the client terminal. When a search expression is entered in the search expression input field 202 and the search button 204 is clicked, search results are displayed in a list in the result display area 208. In the figure, a character string “one click” is entered in the search expression input field 202. When the search button 204 is clicked, a list of URIs (Uniform Resource Identifiers) of web pages including the character string “one click” is displayed in the result display area 208. Up to this point, the user interface is the same as that of a general search engine.

各ページの横にはチェックボックス２０６が表示される。ユーザは、目的文書と内容において類似していると思われるページには「Ａ」、やや類似していると思われるページには「Ｂ」、非類似のページには「Ｘ」をそれぞれチェックボックス２０６に入力する。「Ａ」と「Ｂ」が設定されたウェブページが類似文書となり、候補用語が抽出される。Ａの類似文書から抽出された候補用語の個別重要度を、Ｂの類似文書から抽出された候補用語の個別重要度よりも重み付けして、各候補用語の重要度を算出する。「Ｘ」が設定されたウェブページは非類似文書となり、不適用語が抽出される。仮に、重要順位１位の提案検索語として「認証」が特定されたとする。このときには、検索式提示部１３０は、検索式入力欄２０２に「ワンクリック＊認証」という検索式を表示させてもよい。
検索式提案ボタン２１０をクリックすると、「Ａ」、「Ｂ」、「Ｘ」等の入力結果に対応して、いくつかの検索式を提案するための検索式提案画面２１２が表示される。 A check box 206 is displayed beside each page. The user selects “A” for a page that is similar in content to the target document, “B” for a page that is somewhat similar, and “X” for a page that is not similar. Input to 206. Web pages in which “A” and “B” are set become similar documents, and candidate terms are extracted. The importance of each candidate term is calculated by weighting the individual importance of the candidate term extracted from the similar document of A with respect to the individual importance of the candidate term extracted from the similar document of B. Web pages with “X” set are dissimilar documents, and inapplicable words are extracted. Suppose that “authentication” is specified as the suggested search term with the highest priority. At this time, the search expression presenting unit 130 may display a search expression “one click * authentication” in the search expression input field 202.
When the search formula suggestion button 210 is clicked, a search formula proposal screen 212 for suggesting several search formulas is displayed corresponding to the input results such as “A”, “B”, “X”, and the like.

図７は、検索式提案画面２１２の画面図である。
検索式提案ボタン２１０がクリックされると、ダイアログボックス形式にて同図に示す検索式提案画面２１２が表示される。指定検索語「ワンクリック」に基づく検索結果、および、類似文書と非類似文書の設定の結果として、提案検索語「認証」、「特許」、除外検索語「詐欺」が特定されたとする。説明を簡単にするため、ここでは関連検索語については考慮しないものとする。検索式提示部１３０は、同図に示すように「ワンクリック＊認証」、「ワンクリック＊（−詐欺）」、「ワンクリック＊特許＊（−詐欺）」といったさまざまな組合せの検索式を検索式提案画面２１２に提示する。ユーザは、各検索式の隣りにあるラジオボタンにより検索式を選択し、再検索ボタン２１４をクリックする。すると、検索実行部１３２は、選択された検索式にしたがって、再度ウェブ検索を実行する。このような態様によれば、ユーザは「ワンクリック」を入力したあとは、マウスクリックのみで検索を続行可能となる。
なお、検索式の代わりに、候補用語や不適用語のランキングリストを表示させてもよい。 FIG. 7 is a screen diagram of the search formula proposal screen 212.
When the search formula suggestion button 210 is clicked, a search formula proposal screen 212 shown in the figure is displayed in a dialog box format. It is assumed that the proposed search word “authentication”, “patent”, and the excluded search word “fraud” are specified as a search result based on the designated search word “one click” and the result of setting similar documents and dissimilar documents. In order to simplify the explanation, the related search terms are not considered here. As shown in the figure, the search expression presentation unit 130 searches for various combinations of search expressions such as “one click * authentication”, “one click * (− fraud)”, and “one click * patent * (− fraud)”. Presented on the formula proposal screen 212. The user selects a search expression using a radio button next to each search expression, and clicks the re-search button 214. Then, the search execution unit 132 executes the web search again according to the selected search formula. According to such an aspect, after the user inputs “one click”, the search can be continued only with a mouse click.
Note that a ranking list of candidate terms and inapplicable words may be displayed instead of the search expression.

図８は、検索処理過程を示すフローチャートである。
検索式が入力されると、検索語取得部１２２は指定検索語を取得する（Ｓ１０）。検索式に複数の指定検索語が含まれているときには、関連テーブル更新部１５２は関連テーブルを更新する（Ｓ１２）。ここでは「仮想現実」という指定検索語のみを含む検索式が入力されたとする。検索実行部１３２は、検索式に合致する候補文書、いいかえれば、「仮想現実」という文字列を含む候補文書をコーパスから検索し、候補文書取得部１１８はコーパスから候補文書を抽出する（Ｓ１４）。ユーザが検索を完了させる場合には（Ｓ１６のＹ）、検索結果が画面表示され、処理は終了する（Ｓ３２）。 FIG. 8 is a flowchart showing the search process.
When a search expression is input, the search word acquisition unit 122 acquires a designated search word (S10). When a plurality of designated search terms are included in the search expression, the related table update unit 152 updates the related table (S12). Here, it is assumed that a search expression including only the designated search word “virtual reality” is input. The search execution unit 132 searches the corpus for a candidate document that matches the search expression, in other words, a candidate document that includes the character string “virtual reality”, and the candidate document acquisition unit 118 extracts the candidate document from the corpus (S14). . When the user completes the search (Y in S16), the search result is displayed on the screen, and the process ends (S32).

検索完了でなければ（Ｓ１６のＮ）、類似文書取得部１１４は、ユーザによる類似文書の選択を受け付け、候補文書群から類似文書を取得する（Ｓ１８）。候補用語抽出部１４４は、類似文書から候補用語を抽出する（Ｓ２０）。非類似文書取得部１１６は、ユーザによる非類似文書の選択を受け付け、候補文書群から非類似文書を取得する（Ｓ２２）。不適用語抽出部１４６は、非類似文書から不適用語を抽出する（Ｓ２４）。順位設定部１５４は、候補用語の個別重要度を算出し、更に、（総合）重要度を算出したあと、不適用語群を参照して、候補用語の重要順位を決定する（Ｓ２６）。検索語提示部１２８は、提案検索語を画面表示する（Ｓ２８）。検索式提示部１３０は、提案検索語を含む検索式を画面表示させる（Ｓ３０）。 If the search is not completed (N in S16), the similar document acquisition unit 114 receives selection of a similar document by the user and acquires a similar document from the candidate document group (S18). The candidate term extraction unit 144 extracts candidate terms from the similar document (S20). The dissimilar document acquisition unit 116 receives selection of dissimilar documents by the user and acquires dissimilar documents from the candidate document group (S22). The inapplicable word extraction unit 146 extracts inapplicable words from dissimilar documents (S24). The rank setting unit 154 calculates the individual importance of the candidate terms, and further calculates the (total) importance, and then determines the importance rank of the candidate terms with reference to the non-applicable word group (S26). The search word presentation unit 128 displays the suggested search word on the screen (S28). The search formula presenting unit 130 displays the search formula including the proposed search term on the screen (S30).

たとえば、指定検索語が「仮想現実」、重要順位１位の提案検索語が「スプーン」、重要順位２位の提案検索語が「フォーク」の場合、「（仮想現実＋人工現実＋複合現実＋・・・）＊（スプーン＋さじ＋サジ＋匙＋・・・）」や「（仮想現実＋人工現実＋複合現実＋・・・）＊（スプーン＋さじ＋サジ＋匙＋・・・）＊（フォーク＋フオーク＋・・・）」、あるいは、「（仮想現実＋人工現実＋複合現実＋・・・）＊（スプーン＋さじ＋サジ＋匙＋・・・＋フォーク＋フオーク＋・・・）」といった検索式が生成される。提案検索語だけでなく除外検索語や、提案検索語と除外検索語を含む検索式を画面表示させてもよい。 For example, if the designated search term is “virtual reality”, the suggested search term with the first priority ranking is “spoon”, and the suggested search term with the second highest ranking ranking is “fork”, “(virtual reality + artificial reality + mixed reality + ...) * (spoon + spoon + saji + 匙 + ...) "or" (virtual reality + artificial reality + mixed reality + ...) * (spoon + spoon + saji + shark + ...) * (Fork + Fork + ...) "or" (Virtual Reality + Artificial Reality + Mixed Reality + ...) * (Spoon + Spoon + Saji + Spear + ... + Fork + Fork + ...) "Is generated. Not only the suggested search terms but also excluded search terms, or search expressions including the suggested search terms and the excluded search terms may be displayed on the screen.

ユーザは、提案された検索式を参照しつつ、提案検索語の中から指定検索語を選択する。このとき、任意の関連検索語を指定検索語として選択してもよい。選択した提案検索語や関連検索語に基づいて、新たな検索式を作成する。あるいは、文書検索支援装置１００により提案された上記検索式をそのまま採用してもよい。以後、Ｓ１０以降の処理が繰り返される。 The user selects a designated search term from the proposed search terms while referring to the proposed search formula. At this time, an arbitrary related search word may be selected as the designated search word. A new search formula is created based on the selected suggested search term and related search terms. Alternatively, the search formula proposed by the document search support apparatus 100 may be employed as it is. Thereafter, the processes after S10 are repeated.

以上、実施例に基づいて文書検索支援装置１００を説明した。
文書検索支援装置１００によれば、候補文書数を十分に絞り込める検索式を入力しなくても、類似文書や非類似文書の選択を介して、適切な検索語を提案検索語や関連検索語から特定できる。そして提案検索語や関連検索語を検索式に追加しつつ、徐々に候補文書の数を絞り込むことができる。 The document search support apparatus 100 has been described above based on the embodiment.
According to the document search support apparatus 100, an appropriate search word is selected as a suggested search word or a related search word through selection of a similar document or a dissimilar document without inputting a search expression that can sufficiently narrow down the number of candidate documents. Can be identified from The number of candidate documents can be gradually narrowed down while adding suggested search terms and related search terms to the search formula.

また、候補用語の類似文書における出現位置やコーパスにおける稀少性、非類似文書における出現頻度等から候補用語の重要度、ひいては、重要順位を算出することにより、目的文書を特定する上で有力な提案検索語が特定されやすくなっている。更に、関連テーブルにより、用語間の関係を設定することにより、検索漏れの発生を防止している。文書検索支援装置１００によれば、関連テーブルにより、検索語だけではなくその関連する用語も関連検索語として特定できる。 It is also a promising proposal for identifying the target document by calculating the importance of the candidate term and the importance ranking from the appearance position of the candidate term in the similar document, the rarity in the corpus, the appearance frequency in the dissimilar document, etc. Search terms are easily identified. Furthermore, the occurrence of a search omission is prevented by setting the relationship between terms by using a relation table. According to the document search support apparatus 100, not only the search terms but also the related terms can be specified as the related search terms by the related table.

以上、本発明について実施例をもとに説明した。実施の形態は例示であり、それらの各構成要素や各処理プロセスの組み合わせにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described based on the embodiments. The embodiments are exemplifications, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are within the scope of the present invention. .

本実施例における文書検索の概要図である。It is a schematic diagram of the document search in a present Example. 文書検索支援装置の機能ブロック図である。It is a functional block diagram of a document search support apparatus. 提案検索語を特定するまでの処理過程を示す概念図である。It is a conceptual diagram which shows the process until it specifies a proposal search word. 関連テーブルのデータ構造図である。It is a data structure figure of a related table. 不適用語に基づく総合重要度の調整方法を説明するための概念図である。It is a conceptual diagram for demonstrating the adjustment method of the comprehensive importance based on a non-application word. 本実施例における文書検索方法を応用した検索エンジンの画面図である。It is a screen figure of the search engine which applied the document search method in a present Example. 検索式提案画面の画面図である。It is a screen figure of a search type proposal screen. 検索処理過程を示すフローチャートである。It is a flowchart which shows a search process.

Explanation of symbols

１００文書検索支援装置、１１０ＩＦ部、１１２文書取得部、１１４類似文書取得部、１１６非類似文書取得部、１１８候補文書取得部、１２０検索式取得部、１２２検索語取得部、１２６提示部、１２８検索語提示部、１３０検索式提示部、１３２検索実行部、１４０データ処理部、１４２用語抽出部、１４４候補用語抽出部、１４６不適用語抽出部、１４８関連処理部、１５０関連検索語特定部、１５２関連テーブル更新部、１５４順位設定部、１６０データ保持部、１６２関連テーブル保持部、１７０基本用語欄、１７２類義語欄、１７４関連語欄、２００検索画面、２０２検索式入力欄、２０４検索ボタン、２０８結果表示領域、２１２検索式提案画面、２１４再検索ボタン。 DESCRIPTION OF SYMBOLS 100 Document search assistance apparatus, 110 IF part, 112 Document acquisition part, 114 Similar document acquisition part, 116 Dissimilar document acquisition part, 118 Candidate document acquisition part, 120 Search expression acquisition part, 122 Search term acquisition part, 126 Presentation part, 128 Search term presentation unit, 130 Search expression presentation unit, 132 Search execution unit, 140 Data processing unit, 142 Term extraction unit, 144 Candidate term extraction unit, 146 Inapplicable word extraction unit, 148 Related processing unit, 150 Related search word specifying unit 152 related table update unit, 154 rank setting unit, 160 data holding unit, 162 related table holding unit, 170 basic term field, 172 synonym field, 174 related word field, 200 search screen, 202 search expression input field, 204 search button , 208 Result display area, 212 Search formula Screen, 214 re-search button.

Claims

A device for supporting retrieval of a document file,
A similar document acquisition unit that acquires a similar document selected by the user as a document file similar in content to a desired document file;
A candidate term extraction unit that extracts terms included in the similar document as candidate terms;
A ranking setting unit that ranks candidate terms extracted from similar documents so that a candidate term having a lower appearance frequency in a predetermined document file group has a higher rank;
A search word presenting unit that presents candidate terms positioned at a predetermined rank or higher as search words suitable for searching for the desired document file;
A document search support apparatus comprising:

A designated search term acquisition unit that acquires a search term specified by the user as a designated search term;
A candidate document acquisition unit that acquires, as a candidate document, a document file including the designated search word from a predetermined document file group,
The document search support apparatus according to claim 1, wherein the similar document acquisition unit acquires a candidate document designated by a user among one or more candidate documents as a similar document.

The document search support apparatus according to claim 1, wherein the rank setting unit adjusts ranks of candidate terms based on one or both of an appearance position and an appearance frequency in a similar document.

4. The ranking setting unit according to any one of claims 1 to 3, wherein among the extracted candidate terms, candidate terms whose number of appearances in the similar document is equal to or greater than a predetermined threshold are targeted for ranking. Document retrieval support device described in 1.

A dissimilar document obtaining unit that obtains a dissimilar document selected by the user as a dissimilar document file in content with the desired document file;
5. The search term presenting unit presents a candidate term that is not included in the dissimilar document among the candidate terms positioned at the predetermined order or more, to the user. Document search support device.

A dissimilar document acquisition unit that acquires a dissimilar document selected by a user as a dissimilar document file in content with the desired document file;
A non-applicable word extraction unit that extracts terms included in the dissimilar document as non-applicable words; and
The rank setting unit also sets ranks of non-applicable words so that ranks of non-applicable words with a lower appearance frequency in a predetermined document file group have higher ranks, and ranks of candidate terms that are also non-applicable words of a predetermined rank or higher are lowered. 5. The document search support apparatus according to claim 1, wherein the ranking of candidate terms is adjusted.

A dissimilar document acquisition unit that acquires a dissimilar document selected by a user as a dissimilar document file in content with the desired document file;
A non-applicable word extraction unit that extracts terms included in the dissimilar document as non-applicable words; and
The rank setting unit also sets ranks of non-applicable words extracted from dissimilar documents so that ranks of non-applicable words with a lower appearance frequency in a predetermined document file group are higher.
5. The search word presenting unit presents an inapplicable word of a predetermined rank or higher to a user as an excluded search word that is a search word that should not be included in the desired document file. Document retrieval support device described in 1.

8. The document search support device according to claim 7, wherein the search word presentation unit presents, as the excluded search words, non-applicable words that are not included in the similar document among the non-applicable words of the predetermined order or higher. .

An association table holding unit that holds an association table that associates one term with another term;
A related search term identifying unit that identifies candidate terms positioned above the predetermined rank as suggested search terms, and identifies terms associated with the proposed search terms in the related table as related search terms,
9. The search word presenting unit presents the related search word to the user as a search word suitable for searching the desired document file in addition to the suggested search word. The document search support apparatus according to any one of the above.

The document search support apparatus according to claim 9, wherein the related table holding unit holds the related table as a table in which terms having synonym relations are registered in association with each other.

10. A related table update unit that associates the plurality of terms and registers them in the related table when a plurality of terms are included as search terms in a search expression designated by a user. 10. The document search support device according to 10.

For each of a plurality of proposed search terms, a search term set in which a suggested search term and a related search term are linked by an OR condition is generated, and a plurality of the search term sets are used as a search expression suitable for searching the desired document file. The document search support device according to claim 9, further comprising a search formula presenting unit that presents a search formula linked according to an AND condition to a user.

A computer program for supporting retrieval of a document file,
Processing for acquiring a similar document selected by the user as a document file similar in content to a desired document file;
Processing for extracting terms included in the similar document as candidate terms;
A process for ranking candidate terms extracted from similar documents so that a candidate term having a lower appearance frequency in a predetermined document file group has a higher rank;
A process of presenting candidate terms positioned above a predetermined rank as a search term suitable for searching for the desired document file;
Search support program that causes a computer to execute