[go: up one dir, main page]

JP2005099972A - Concept search method and system - Google Patents

Concept search method and system Download PDF

Info

Publication number
JP2005099972A
JP2005099972A JP2003330940A JP2003330940A JP2005099972A JP 2005099972 A JP2005099972 A JP 2005099972A JP 2003330940 A JP2003330940 A JP 2003330940A JP 2003330940 A JP2003330940 A JP 2003330940A JP 2005099972 A JP2005099972 A JP 2005099972A
Authority
JP
Japan
Prior art keywords
sentence
search
document
keyword
concept
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2003330940A
Other languages
Japanese (ja)
Other versions
JP4385697B2 (en
Inventor
Atsushi Sakata
淳 坂田
Jugo Noda
十悟 野田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi Ltd
Original Assignee
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd filed Critical Hitachi Ltd
Priority to JP2003330940A priority Critical patent/JP4385697B2/en
Publication of JP2005099972A publication Critical patent/JP2005099972A/en
Application granted granted Critical
Publication of JP4385697B2 publication Critical patent/JP4385697B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

<P>PROBLEM TO BE SOLVED: To provide a conceptual search method which improves search accuracy compared with a conventional conceptual search by performing a more accurate search based on information necessary for a user, since in a conceptual search, the search is carried out without deciding whether the information included in documents is necessary information for a user, there are cases that search accuracy may deteriorate, if the conceptual search is performed for huge documents, and that although decision of necessity and propriety of information included in the document is made by a feature term, the discrimination may be difficult. <P>SOLUTION: A keyword which becomes a feature of a document is extracted from the document is inputted as a search condition. The keyword used as the feature and individual sentences obtained by breaking down a document are classified into an equivalence class, matched and held. A sentence representing a sentence group in each classified equivalence class is shown as an important sentence, and is made to be choosen. The conceptual search processing is carried out by the use of a keyword group matched with the equivalence class shown by the important sentence and its weight. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、電子化文書を、ユーザが登録した検索条件で検索し、条件を満たす文書をユーザに返却する概念検索に係わり、特に電子化文書の内容を走査することにより、ユーザが検索条件として登録した文書に対する適合度を算出する文書間適合度算出機能を有する文書検索方法及びシステムに関する。   The present invention relates to a concept search in which an electronic document is searched with a search condition registered by the user and a document satisfying the condition is returned to the user. In particular, by scanning the contents of the electronic document, the user can set the search condition as The present invention relates to a document search method and system having an inter-document fitness level calculation function for calculating a fitness level for a registered document.

近年、電子メールや電子ニュース等により大量の電子化文書(以下、テキストと呼ぶ)が時々刻々ユーザへ配信されるようになってきた。また、WWW(World Wide Web)を利用して情報発信を行なう情報源が急増しており、これらの情報源から情報収集ロボット等を用いて収集されるテキストも膨大な量となっている。このため、これらのテキストの中から、真にユーザが求める情報を含むテキストを検索するニーズが高まっている。   In recent years, a large amount of digitized documents (hereinafter referred to as text) has been delivered to users from time to time by e-mail, electronic news, and the like. In addition, information sources that transmit information using the WWW (World Wide Web) are rapidly increasing, and a large amount of text is collected from these information sources using an information collecting robot or the like. For this reason, there is an increasing need to search for texts including information that the user really wants from among these texts.

従来の検索システムでは、ユーザが検索に必要と思われる単語を、ある構文に従い検索式を組み立て、それを入力することで検索を行ってきた。しかし、検索に不慣れなユーザが、所望する情報を得るために適切な単語を入力したり、必要な情報だけを取り出して、不要な情報を振り落とすための複雑な検索式を組み立てる事は困難である。このため、特許文献1では検索式を組み立てる代わりに、ユーザが所望する情報を含む文書(以下、種文書と呼ぶ)を入力して検索する技術(以下、概念検索と呼ぶ)が提案されている。この技術では、種文書から検索に必要な単語(以下、特徴タームと呼ぶ)を自動的に抽出し、この抽出した特徴タームに適切な重みを付けて、検索結果文章の適合度を計算する。この適合度が一定値よりも大きいものを、検索結果とする。   In the conventional search system, a search is performed by assembling a search expression for a word that the user thinks is necessary for the search according to a certain syntax and inputting it. However, it is difficult for a user unfamiliar with searching to enter a suitable word to obtain the desired information, or to extract only necessary information and assemble a complicated search expression to shake out unnecessary information. is there. For this reason, Patent Document 1 proposes a technique (hereinafter referred to as a concept search) in which a document including information desired by a user (hereinafter referred to as a seed document) is input instead of assembling a search expression. . In this technique, a word necessary for a search (hereinafter referred to as a feature term) is automatically extracted from a seed document, an appropriate weight is assigned to the extracted feature term, and the degree of matching of a search result sentence is calculated. A search result having a fitness value greater than a certain value is used.

しかし、種文書が長大な文書の場合にはユーザが所望する文書の情報だけでなく、不必要な概念を含む種文書が入力されることがある。この場合には、ユーザの所望する情報と誤差が生じているので、ユーザは検索結果に満足できないため、なんらかの調整を施して再検索を行おうとする。一般的には、検索結果文書を参照して、その中から種文書よりも適切にユーザの所望する概念を含んでいる文書(または文章)を見つけて、それを入力して再検索する。この作業を繰り返し行うことで、ユーザの求める検索結果に近づけて行くことができる。もしくは、特徴タームに対する重みについて、自分にとって必要と思われる特徴タームの重みを上げたり、いらないと思う特徴タームの重みを下げたりして再検索を行う。   However, when the seed document is a long document, not only the document information desired by the user but also a seed document including an unnecessary concept may be input. In this case, since there is an error with the information desired by the user, the user is not satisfied with the search result, so he tries to perform a search again with some adjustments. In general, a search result document is referred to, and a document (or sentence) including the concept desired by the user is found more appropriately than the seed document. By repeating this operation, it is possible to approach the search result desired by the user. Alternatively, with respect to the weight for the feature term, the search is performed again by increasing the weight of the feature term considered necessary for the user or lowering the weight of the feature term that is not necessary.

以下、他の関連文献の技術について本発明に関係するものについて述べる。
非特許文献1は各分類項目のキーワードの出現回数と、各文書のキーワードの出現回数を比較して最も近い分類項目に分類することによりクラスタリング(自動分類)を行う技術について触れられている。
Hereinafter, other related literature techniques related to the present invention will be described.
Non-Patent Document 1 touches on a technique for performing clustering (automatic classification) by comparing the number of appearances of keywords in each classification item and the number of appearances of keywords in each document and classifying them into the closest classification item.

非特許文献2ではキーワードの重みに基づき文書内の重要文を抽出する技術に関して触れられている。   Non-Patent Document 2 mentions a technique for extracting an important sentence in a document based on keyword weights.

特開平11−143902号公報JP-A-11-143902

文書自動クラスタリング・システム gnmz(http://icrouton.as.wakwak.ne.jp/pub/kks/cnamazu.html)Automatic document clustering system gnmz (http://icrouton.as.wakwak.ne.jp/pub/kks/cnamazu.html) 「テキストを自動的に要約する技術-第1回-テキスト中の重要な文を抜き出す」,コンピュータサイエンス誌bit2月号,共立出版,pp.37-42,2000.2"Technology for automatically summarizing texts-Part 1-Extracting important sentences in texts", computer science magazine bit February, Kyoritsu Shuppan, pp. 37-42, 2000.2

概念検索では文書内に含まれる情報がユーザにとって必要な情報か不必要な情報かの判断をせずに検索を実行するため、長大な文書を用いて概念検索をする場合、従来技術の方法のみでは検索精度が低下する場合がある。この検索精度低下を防ぐためには以下の2点が課題である。   In the concept search, the search is executed without determining whether the information contained in the document is necessary information or unnecessary information for the user. Therefore, when performing a concept search using a long document, only the conventional method is used. Then, the search accuracy may decrease. In order to prevent this search accuracy degradation, the following two points are problems.

1)文書に含まれる情報を分類する手段
2)ユーザに必要な情報かどうかを的確に判断させる手段
従来技術においては2)の手段として、特徴タームの取捨選択、および重みの調整があるが、周辺情報の欠落した特徴タームのみで、ユーザの所望する情報かどうかを判断するのは困難な場合がある。
1) Means for classifying information contained in a document 2) Means for allowing a user to accurately determine whether the information is necessary In the prior art, as means for 2), there are selection of feature terms and weight adjustment. It may be difficult to determine whether or not the information is desired by the user only with feature terms lacking peripheral information.

本発明の目的は、ユーザに必要な情報に基づいて、より的確な検索を実行することにより、従来の概念検索と比較して検索精度が向上する概念検索方法を提供することにある。   An object of the present invention is to provide a concept search method in which search accuracy is improved as compared with a conventional concept search by executing a more accurate search based on information necessary for a user.

前記目的を達成するため、以下のステップからなる処理により、ユーザが検索に最適な特徴タームを知らなくても自動的に検索に相応しい種文書(文章)の候補を提示し、その中からユーザが適切と思う文章を選択することにより、上記の問題を解決し、検索の精度を上げることが出来る。以下、処理手順を述べる。   In order to achieve the above-mentioned object, the process consisting of the following steps automatically presents candidates for seed documents (sentences) suitable for the search without the user knowing the optimum feature terms for the search. By selecting an appropriate sentence, the above problem can be solved and the search accuracy can be improved. The processing procedure will be described below.

ステップ1:入力された種文書から特徴タームとその重みを抽出する。   Step 1: Extract feature terms and their weights from the input seed document.

ステップ2:入力された種文書を文に分解する。   Step 2: The input seed document is decomposed into sentences.

ステップ3:前記ステップ2において分解した各文を1文書とみなし、ステップ1で抽出した特徴タームとその重みを用いて、各文を同値類に分類し(以下、クラスタリングと呼ぶ)、その同値類を代表する特徴タームをステップ1で抽出した特徴タームから決定する。   Step 3: Treat each sentence decomposed in Step 2 as one document, classify each sentence into equivalence classes using the feature terms extracted in Step 1 and their weights (hereinafter referred to as clustering), and equivalence classes Is determined from the feature terms extracted in step 1.

ステップ4:前記ステップ3において分類した各同値類の中で、その同値類の特徴として最もふさわしい文(以下、重要文と呼ぶ)を選択する。   Step 4: Select a sentence (hereinafter referred to as an important sentence) that is most suitable as a feature of the equivalence class among the equivalence classes classified in Step 3 above.

ステップ5:前記ステップ4で抽出した重要文をユーザに提示し、ユーザに必要な情報を含む文を選択してもらう。   Step 5: Present the important sentence extracted in Step 4 to the user, and ask the user to select a sentence including necessary information.

ステップ6:前記ステップ5で選択した文に対応する、ステップ3で分類した同値類を代表する特徴タームとその重みを用いて概念検索を実行する。   Step 6: A concept search is executed using the feature terms representing the equivalence classes classified in Step 3 and their weights corresponding to the sentence selected in Step 5 above.

概念検索を実行するにあたり、ユーザに必要な情報のみを使用して概念検索を実行するため、より的確な検索を実行できる。その結果、従来の概念検索と比較して検索精度が向上する。   In executing the concept search, the concept search is executed using only information necessary for the user, so that a more accurate search can be executed. As a result, the search accuracy is improved as compared with the conventional concept search.

図1に本発明のシステム構成を示す。概念検索装置30000はクライント10000とネットワーク20000を介して通信を行うクライアントサーバ型の検索システムである。ユーザは検索を行う際、クライアント10000から種文書40000を入力する。クライアント10000は入力された種文書40000を、ネットワーク20000を通して概念検索装置30000に送信し、本発明の処理を実行する。   FIG. 1 shows the system configuration of the present invention. The concept search device 30000 is a client server type search system that communicates with the client 10000 via the network 20000. When performing a search, the user inputs a seed document 40000 from the client 10000. The client 10000 transmits the input seed document 40000 to the concept search device 30000 through the network 20000, and executes the processing of the present invention.

概念検索装置30000は、以下の構成からなる。文書情報DB38000を用いて種文書40000から特徴タームとその重みを抽出する特徴ターム抽出部31000、特徴タームとその重みを用いて、文書DB37000内の文書と種文書40000との類似度を算出する類似度算出部32000、種文書40000を各文に分解する文分解部33000、特徴タームとその重みを用いて種文書40000内の各文を類似内容の同値類に分類するクラスタリング部34000、クラスタリング部34000により同値類と分類した文を入力文として検索条件(特徴ターム抽出部により抽出された特徴タームとその重み)に対する重要文を出力する重要文抽出部35000、類似度算出部32000および重要文抽出部35000の結果を受け、クライアント10000に送信する画面データを生成する画面データ生成部36000からなる。以下、図13の処理フローに従って本発明の概念検索の手順を説明する。   The concept search device 30000 has the following configuration. A feature term extraction unit 31000 that extracts a feature term and its weight from the seed document 40000 using the document information DB 38000, and a similarity that calculates the similarity between the document in the document DB 37000 and the seed document 40000 using the feature term and its weight Degree calculation unit 32000, sentence decomposition unit 33000 that decomposes seed document 40000 into sentences, clustering unit 34000 that classifies each sentence in seed document 40000 into equivalence classes with similar contents using feature terms and their weights, clustering unit 34000 An important sentence extraction unit 35000, a similarity calculation unit 32000, and an important sentence extraction unit that output an important sentence with respect to a search condition (a feature term extracted by the feature term extraction unit and its weight), using sentences classified as equivalent classes by Screen data that receives 35000 results and sends to client 10000 Consisting screen data generating unit 36000 to generate. The concept search procedure of the present invention will be described below according to the processing flow of FIG.

ステップ60000:ユーザは図2に示すクライアント画面から図3に例示する内容の文書41000を種文書入力BOX11000に入力する。入力後、ユーザは検索開始ボタン12000を押下するとクライアント10000は、入力データを種文書40000としてネットワーク20000を介して概念検索装置30000にデータを送信する。   Step 60000: The user inputs the document 41000 having the contents illustrated in FIG. 3 into the seed document input BOX 11000 from the client screen shown in FIG. After the input, when the user presses the search start button 12000, the client 10000 transmits the input data as a seed document 40000 to the concept search device 30000 via the network 20000.

ステップ61000:概念検索装置30000はクライアント10000から送信された種文書40000を受け取って、特徴ターム抽出部31000に入力する。特徴ターム抽出部31000は文書情報DB38000を用いて、図4に示す特徴ターム42000とその重み43000を抽出する。ここでは、特徴ターム抽出アルゴリズム例としては特許文献1の方法を用いることができ、この場合、文書情報DB38000は特許文献1に示された必要なデータを格納しているものとする。他の処理方法として特徴ターム抽出に形態素解析、重みの算出には特徴タームの種文書内出現回数を用いてもよい。   Step 61000: The concept retrieval apparatus 30000 receives the seed document 40000 transmitted from the client 10000 and inputs it to the feature term extraction unit 31000. The feature term extraction unit 31000 uses the document information DB 38000 to extract the feature term 42000 and its weight 43000 shown in FIG. Here, as an example of the feature term extraction algorithm, the method of Patent Literature 1 can be used. In this case, the document information DB 38000 is assumed to store necessary data shown in Patent Literature 1. As another processing method, morphological analysis may be used for feature term extraction, and the number of appearances of feature terms in the seed document may be used for weight calculation.

ステップ62000:次に、概念検索装置30000は文分解部33000に種文書40000を入力して、図6に示す種文書40000内の文群44000を得る。文分解部33000は図12に示す処理フローに従って動作する。業務日報などの比較的自由な形式の文章を処理する場合は、文の長さが不規則となりやすいため、文分解部33000において幾つかの短い文を一つの文にまとめたり、長い文書をある一定長を超えない単語の区切り目で切り出すように調整してもよい。   Step 62000: Next, the concept search device 30000 inputs the seed document 40000 to the sentence decomposition unit 33000, and obtains a sentence group 44000 in the seed document 40000 shown in FIG. The sentence decomposition unit 33000 operates according to the processing flow shown in FIG. When processing relatively free-form sentences such as business daily reports, the sentence length tends to be irregular. Therefore, the sentence disassembly unit 33000 combines several short sentences into one sentence or has a long document. You may adjust so that it may cut out at the break of the word which does not exceed fixed length.

ステップ63000:上記ステップにより抽出した特徴ターム42000とその重み43000及び、文群44000をクラスタリング部34000に入力し、各文を同値類に分類して、図7に示す分類を代表する特徴ターム群45000及び分類文群46000を得る。文書クラスタリング手法には様々なものがあるが、ここでは非特許文献1の方法を用いることができる。各文の類似度の算出には、タームの出現頻度や、タームの出現頻度とタームのユニーク度の積、タームの出現頻度とタームの種類数の積を用いてもよい。図7に分類数を3とした場合のクラスタリング結果を示す。また,必要な情報は文書情報DB38000より取得するものとする。分類文群46000を代表する特徴ターム群45000は、前記ステップで抽出した特徴ターム42000のうち各文群に含まれるもの全てとする。また、クラスタリングを実行する対象文は、図4に示す特徴タームを含む文のみである。   Step 63000: The feature term 42000 extracted by the above step, its weight 43000, and the sentence group 44000 are input to the clustering unit 34000, and each sentence is classified into equivalence classes, and the feature term group 45000 representing the classification shown in FIG. And the classification sentence group 46000 is obtained. Although there are various document clustering methods, the method of Non-Patent Document 1 can be used here. The similarity of each sentence may be calculated by using the term appearance frequency, the product of the term appearance frequency and the term uniqueness, or the product of the term appearance frequency and the number of types of terms. FIG. 7 shows the clustering result when the number of classifications is 3. Necessary information is acquired from the document information DB 38000. The feature term group 45000 representing the classified sentence group 46000 is assumed to be all included in each sentence group among the feature terms 42000 extracted in the above step. Further, the target sentence for executing clustering is only a sentence including the feature term shown in FIG.

ステップ64000:次に、各分類群の重要文を決定する。重要文の決定については、重要文抽出部35000が類似度算出部32000を利用して行う。特徴ターム群45000と図4のテーブルより取得するその重みを検索条件とし、分類内の各文を類似度算出部32000に入力する。ここでは、必要に応じて文書情報DB38000からデータを取得する。類似度算出部32000は特許文献1に従い、各文の類似度を求めてもよいし、非特許文献2の手法を用いてもよい。ただし、この場合は非特許文献2に必要なデータを文書情報DB38000が格納しているものとする。求めた類似度47000のうち最も高い類似度の文を分類群の重要文48000としたものを図9に示す。   Step 64000: Next, an important sentence of each classification group is determined. The important sentence is determined by the important sentence extraction unit 35000 using the similarity calculation unit 32000. Each sentence in the classification is input to the similarity calculation unit 32000 using the feature terms 45000 and the weight obtained from the table of FIG. Here, data is acquired from the document information DB 38000 as necessary. The similarity calculation unit 32000 may obtain the similarity of each sentence in accordance with Patent Document 1 or may use the method of Non-Patent Document 2. However, in this case, it is assumed that the document information DB 38000 stores data necessary for Non-Patent Document 2. FIG. 9 shows a sentence having the highest similarity among the obtained similarities 47000 as the important sentence 48000 of the classification group.

ステップ65000:求めた重要文48000を出力画面データ生成部36000に入力すると、概念検索装置30000は必要な出力データを、ネットワーク20000を介してクライアント10000に送信し、クライアント10000は検索結果画面13000に出力する(図10)。   Step 65000: When the obtained important sentence 48000 is input to the output screen data generation unit 36000, the concept search device 30000 transmits necessary output data to the client 10000 via the network 20000, and the client 10000 outputs to the search result screen 13000. (FIG. 10).

ステップ66000:検索結果画面13000は各分類群の重要文48000と、その分類群に分類された文の数をその分類群のスコア15000として表示する。スコアが高いほど、種文書40000内にその分類群の概念が多く含まれていることを示す。また、詳細閲覧ボタン18000を用いることにより、図8に示す分類群の全文書を閲覧できる。ユーザは画面指示に従って必要な概念を、チェックボックス17000を用いてチェックし、検索実行ボタン14000を押下すると、選択番号49000を送信する。図10では「2」が選択されており、クライアント10000は概念検索装置30000に選択番号49000である「2」を送信する。   Step 66000: The search result screen 13000 displays the important sentence 48000 of each classification group and the number of sentences classified into the classification group as the score 15000 of the classification group. The higher the score, the more the concept of the classification group is included in the seed document 40000. Further, by using the detailed browsing button 18000, all documents in the classification group shown in FIG. 8 can be browsed. The user checks a necessary concept according to a screen instruction using a check box 17000, and when a search execution button 14000 is pressed, a selection number 49000 is transmitted. In FIG. 10, “2” is selected, and the client 10000 transmits “2”, which is the selection number 49000, to the concept search device 30000.

ステップ67000:概念検索装置30000は選択番号49000を受け取った後、図9に示すテーブルから、選択番号49000に該当する特徴ターム群45000とその重み43000を図4に示すテーブルより取得する。   Step 67000: After receiving the selection number 49000, the concept search device 30000 obtains a feature term group 45000 corresponding to the selection number 49000 and its weight 43000 from the table shown in FIG. 9 from the table shown in FIG.

この特徴ターム45000とその重み43000を類似度算出部32000に入力し、文書DB37000内に格納している各文書との類似度を求める。類似度算出法は、一例としてここでは特許文献1を用いる。   The feature term 45000 and its weight 43000 are input to the similarity calculation unit 32000, and the similarity to each document stored in the document DB 37000 is obtained. As an example of the similarity calculation method, Patent Document 1 is used here.

ステップ68000:類似度を算出後、類似度の降順に、文書タイトルを出力画面データ生成部36000に入力する。出力画面データ生成部36000は、出力データをクライアント10000に送信する。   Step 68000: After calculating the similarity, the document title is input to the output screen data generation unit 36000 in descending order of similarity. The output screen data generation unit 36000 transmits the output data to the client 10000.

ステップ69000:クライアント10000は図11に示す検索結果画面を表示する。また、検索条件保存ボタン19000を押下すると、選択文をキーとして検索条件をクライアントに保存する。これにより、ユーザが適切な保存名をつける手間を省き、後日利用する際も文章内容により利用可否を決定できる。   Step 69000: The client 10000 displays the search result screen shown in FIG. When the search condition storage button 19000 is pressed, the search condition is stored in the client using the selected sentence as a key. Thereby, the user can save the trouble of assigning an appropriate storage name, and whether or not the user can use it can be determined based on the contents of the text even when the user uses it later.

また、図12に示す文分解処理を説明する。   Further, the sentence decomposition process shown in FIG. 12 will be described.

ステップ50000:テキスト操作部33100は、作業用記憶領域33200に処理文書を読み込む。操作開始文字位置33300及び操作文字位置33400の操作初期位置(文書の一文字目)、文数33500に0を設定し、ステップ50010へ進む。   Step 50000: The text operation unit 33100 reads the processed document into the work storage area 33200. The operation initial character position 33300 and the operation initial position (first character of the document) of the operation character position 33400 are set to 0 in the sentence number 33500, and the process proceeds to Step 50010.

ステップ50010:テキスト操作部33100は、作業用記憶領域33200に読み込んだ処理文書のうち、操作位置格納領域33400に格納した文字位置の文字が区切り記号(。)かどうか判定する。区切り記号の場合は、ステップ50020の処理を行う。区切り記号でない場合は、ステップ50060に進む。   Step 50010: The text operation unit 33100 determines whether or not the character at the character position stored in the operation position storage area 33400 among the processed documents read into the work storage area 33200 is a delimiter (.). If it is a delimiter, the process of step 50020 is performed. If it is not a delimiter, the process proceeds to step 50060.

ステップ50020:作業用記憶領域33200より、操作開始文字位置33300から操作文字位置33400までの文字を文群格納領域33600にコピーして、ステップ50030に進む。   Step 50020: The characters from the operation start character position 33300 to the operation character position 33400 are copied from the work storage area 33200 to the sentence group storage area 33600, and the process proceeds to Step 50030.

ステップ50030:文数33500を1増分し、ステップ50040に進む。   Step 50030: The sentence number 33500 is incremented by 1, and the process proceeds to Step 50040.

ステップ50040:操作文字位置33400を1増分し、ステップ50050に進む。   Step 50040: The operation character position 33400 is incremented by 1, and the process proceeds to Step 50050.

ステップ50050:操作開始文字位置33300に操作文字位置33400の値を設定し、ステップ50060に進む。   Step 50050: The value of the operation character position 33400 is set in the operation start character position 33300, and the process proceeds to Step 50060.

ステップ50060:操作文字位置33400の値を1増分し、ステップ50070に進む。   Step 50060: The value of the operation character position 33400 is incremented by 1, and the process proceeds to Step 50070.

ステップ50070:操作文字位置33400の値が処理文書の文字数より少なければ、ステップ50010へ進む。操作文字位置33400の値が処理文書の文字数と等しければ、ステップ50080へ進む。   Step 50070: If the value of the operation character position 33400 is smaller than the number of characters of the processed document, the process proceeds to Step 50010. If the value of operation character position 33400 is equal to the number of characters of the processed document, the process proceeds to step 50080.

ステップ50080:操作開始文字位置33300から操作文字位置33400までの文字を文群格納領域33600にコピーする。文数33500を1増分し、文群格納領域の各文を出力して処理を終了する。   Step 50080: The characters from the operation start character position 33300 to the operation character position 33400 are copied to the sentence group storage area 33600. The number of sentences 33500 is incremented by 1, each sentence in the sentence group storage area is output, and the process is terminated.

本発明のシステム構成を示す。1 shows a system configuration of the present invention. 本発明の実施例におけるクライアント画面を示す。The client screen in the Example of this invention is shown. 本発明の実施例において入力する種文書を示す。The seed document input in the Example of this invention is shown. 特徴タームとその重みを示す。The feature terms and their weights are shown. 本発明の実施例における文分解部を示す。The sentence decomposition part in the Example of this invention is shown. 文分解部の出力した文群を示す。Indicates the sentence group output by the sentence decomposition unit. 本発明の実施例における分類文群を示す。The classification sentence group in the Example of this invention is shown. 本発明の実施例における分類文群と各文の重要度を示す。The classification sentence group in the Example of this invention and the importance of each sentence are shown. 本発明の実施例における各分類とその重要文を示す。Each classification and the important sentence in the Example of this invention are shown. 本発明の実施例における概念選択画面を示す。The concept selection screen in the Example of this invention is shown. 本発明の実施例における検索結果画面を示す。The search result screen in the Example of this invention is shown. 本発明の実施例における文分解部処理フローを示す。The sentence decomposition part process flow in the Example of this invention is shown. 本発明の実施例における処理の流れを示す。The flow of the process in the Example of this invention is shown.

符号の説明Explanation of symbols

10000:クライアント、20000:ネットワーク、30000:概念検索装置、
31000:特徴ターム抽出部、32000:類似度算出部、
33000:文分解部、34000:クラスタリング部、35000:重要文抽出部、
36000:出力画面データ生成部、37000:文書DB、38000:文書情報DB、40000:種文書
10000: client, 20000: network, 30000: concept search device,
31000: Feature term extraction unit, 32000: Similarity calculation unit,
33000: sentence decomposition unit, 34000: clustering unit, 35000: important sentence extraction unit,
36000: output screen data generation unit, 37000: document DB, 38000: document information DB, 40000: seed document

Claims (7)

検索条件として文書を入力して類似文書を検索する概念検索方法において、
前記検索条件として入力した文書から前記文書の特徴となるキーワードを抽出し、
前記文書を個々の文に分解し、
前記特徴となるキーワードと前記分解した個々の文を同値類に分類して対応付けて保持し、
前記分類した各同値類内の文群を代表する文を重要文として、同値類の変わりに前記重要文を提示してユーザに選択させ、
前記重要文で示される同値類に対応づけられたキーワード群と前記キーワード群の重みを用いて概念検索処理を行うことを特徴とする概念検索方法。
In a concept search method for searching for similar documents by inputting a document as a search condition,
Extracting a keyword that is characteristic of the document from the document input as the search condition;
Break the document into individual sentences,
Classify the keyword as the feature and the decomposed individual sentence into equivalence classes, hold them in correspondence,
The sentence representative of the group of sentences in each classified equivalence class is an important sentence, the important sentence is presented instead of the equivalence class, and the user is selected,
A concept search method comprising performing a concept search process using a keyword group associated with an equivalence class indicated by the important sentence and a weight of the keyword group.
前記文書を各文に分解する際に、各文の長さをなるべく均等にするため、短い文となる場合は、2つの文を一つの文とみなし、長すぎる文の場合は、単語の区切りを意識して文の途中で、一文として取り出すことを特徴とする請求項1記載の概念検索方法。   When disassembling the document into sentences, to make the length of each sentence as uniform as possible, two sentences are regarded as one sentence when it is a short sentence, and word separation is used when a sentence is too long. 2. The concept retrieval method according to claim 1, wherein the concept is extracted as one sentence in the middle of the sentence in consideration of the above. 各文を同値類に分類する際の類似度算出において、各文内におけるキーワードの出現頻度、キーワードの出現頻度とそのユニーク度の積、あるいは、キーワードの出現頻度とキーワードの種類数との積のいずれかを用いることを特徴とする請求項1記載の概念検索方法。   In calculating similarity when classifying each sentence into equivalence classes, the frequency of keyword occurrence in each sentence, the product of the keyword appearance frequency and its uniqueness, or the product of the keyword appearance frequency and the number of keyword types The concept retrieval method according to claim 1, wherein either of them is used. 前記同値類の代わりにその重要文を選択させる際、前記同値類に分類された文数を前記重要文の横に提示することを特徴とする請求項1記載の概念検索方法。   2. The concept retrieval method according to claim 1, wherein when an important sentence is selected instead of the equivalence class, the number of sentences classified into the equivalence class is presented next to the important sentence. ユーザに同値類の代わりに重要文を選択させる際、前記同値類内の各文を閲覧することを特徴とする請求項1記載の概念検索方法。   2. The concept retrieval method according to claim 1, wherein when the user selects an important sentence instead of the equivalence class, each sentence in the equivalence class is browsed. 前記重要文で示される同値類を代表するキーワード群と前記重みを、検索プロファイルとして残す際、重要文をキーにして保存することを特徴とする請求項1記載の概念検索方法。   2. The concept retrieval method according to claim 1, wherein when a keyword group representing the equivalence class indicated by the important sentence and the weight are left as a search profile, the important sentence is stored as a key. 検索条件として文書を入力して類似文書を検索する概念検索システムは、
前記検索条件として入力した文書から前記文書の特徴となるキーワードを抽出する手段、
前記文書を個々の文に分解する手段、
前記特徴となるキーワードと前記分解した個々の文を同値類に分類して対応付けて保持する手段、
前記分類した各同値類内の文群を代表する文を重要文として、同値類の変わりに前記重要文を提示してユーザに選択させる手段、
前記重要文で示される同値類に対応づけられたキーワード群と前記キーワード群の重みを用いて概念検索処理を行う手段を有することを特徴とする概念検索システム。
A concept search system that searches for similar documents by entering documents as search criteria.
Means for extracting a keyword that is a feature of the document from the document input as the search condition;
Means for decomposing the document into individual sentences;
Means for classifying the keyword as the feature and the decomposed individual sentences into equivalent classes and associating them with each other;
Means that presents the important sentence instead of the equivalence class and makes the user select the sentence representative of the group of sentences in each classified equivalence class as an important sentence;
A concept search system comprising: a keyword group associated with an equivalence class indicated by the important sentence; and means for performing a concept search process using a weight of the keyword group.
JP2003330940A 2003-09-24 2003-09-24 Concept search method and system Expired - Fee Related JP4385697B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2003330940A JP4385697B2 (en) 2003-09-24 2003-09-24 Concept search method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2003330940A JP4385697B2 (en) 2003-09-24 2003-09-24 Concept search method and system

Publications (2)

Publication Number Publication Date
JP2005099972A true JP2005099972A (en) 2005-04-14
JP4385697B2 JP4385697B2 (en) 2009-12-16

Family

ID=34459723

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2003330940A Expired - Fee Related JP4385697B2 (en) 2003-09-24 2003-09-24 Concept search method and system

Country Status (1)

Country Link
JP (1) JP4385697B2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012216239A (en) * 2012-07-12 2012-11-08 Toshiba Corp Information processing apparatus, program, and method of information retrieval
JP2019057279A (en) * 2017-09-18 2019-04-11 タタ コンサルタンシー サービシズ リミテッドTATA Consultancy Services Limited Method and system for inference data mining

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11175541A (en) * 1997-12-09 1999-07-02 Toshiba Corp Natural language search input device, input method, and recording medium recording input processing program
JP2002222210A (en) * 2001-01-25 2002-08-09 Hitachi Ltd Document search system, document search method, and search server
JP2002228575A (en) * 2001-02-01 2002-08-14 Asahi Eng Co Ltd Corrosion diagnosis system for tank steel plate
JP2002358323A (en) * 2001-03-30 2002-12-13 Just Syst Corp Search request sentence generation device, search result presentation device, search request sentence generation method, search result presentation method, search request sentence generation program, search result presentation program, data search device, data search method, and data search program
JP2003108584A (en) * 2001-09-28 2003-04-11 Casio Comput Co Ltd Information retrieval system and program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11175541A (en) * 1997-12-09 1999-07-02 Toshiba Corp Natural language search input device, input method, and recording medium recording input processing program
JP2002222210A (en) * 2001-01-25 2002-08-09 Hitachi Ltd Document search system, document search method, and search server
JP2002228575A (en) * 2001-02-01 2002-08-14 Asahi Eng Co Ltd Corrosion diagnosis system for tank steel plate
JP2002358323A (en) * 2001-03-30 2002-12-13 Just Syst Corp Search request sentence generation device, search result presentation device, search request sentence generation method, search result presentation method, search request sentence generation program, search result presentation program, data search device, data search method, and data search program
JP2003108584A (en) * 2001-09-28 2003-04-11 Casio Comput Co Ltd Information retrieval system and program

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012216239A (en) * 2012-07-12 2012-11-08 Toshiba Corp Information processing apparatus, program, and method of information retrieval
JP2019057279A (en) * 2017-09-18 2019-04-11 タタ コンサルタンシー サービシズ リミテッドTATA Consultancy Services Limited Method and system for inference data mining

Also Published As

Publication number Publication date
JP4385697B2 (en) 2009-12-16

Similar Documents

Publication Publication Date Title
CN118839021B (en) A dynamic relevance enhanced retrieval generation system and method driven by intelligent knowledge graph
KR101715432B1 (en) Word pair acquisition device, word pair acquisition method, and recording medium
US7272558B1 (en) Speech recognition training method for audio and video file indexing on a search engine
CN112818694A (en) Named entity recognition method based on rules and improved pre-training model
JP6355840B2 (en) Stopword identification method and apparatus
CN106708929B (en) Video program searching method and device
KR102376489B1 (en) Text document cluster and topic generation apparatus and method thereof
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
CN111611356A (en) Information searching method and device, electronic equipment and readable storage medium
JP4349875B2 (en) Document filtering apparatus, document filtering method, and document filtering program
JP2000357170A (en) Apparatus for retrieving information using document reference reason
CN106570196A (en) Video program searching method and device
JP2005301856A (en) Document search method, document search program, and document search apparatus for executing the same
CN120296146A (en) Government document citation retrieval method, device, equipment and medium based on big model
CN109948154B (en) Character acquisition and relationship recommendation system and method based on mailbox names
CN103226601B (en) A kind of method and apparatus of picture searching
JP2004334766A (en) Word classification device, word classification method, and word classification program
CN116738979A (en) Power grid data search method, system and electronic equipment based on core data identification
JP4005343B2 (en) Information retrieval system
JP4212347B2 (en) Document search apparatus, program, and recording medium
JPH1145257A (en) Web document search support apparatus and computer-readable recording medium storing a program for causing a computer to function as the apparatus
CN119493778A (en) A method, system, device and storage medium for compressing multimodal weight files
JP4385697B2 (en) Concept search method and system
Oliveira et al. A concept-based ILP approach for multi-document summarization exploring centrality and position
CN111159393B (en) A text generation method based on LDA and D2V for summary extraction

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20051006

RD01 Notification of change of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7421

Effective date: 20060421

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20090119

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20090127

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20090325

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20090609

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20090805

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20090908

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20090921

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20121009

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20121009

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20131009

Year of fee payment: 4

LAPS Cancellation because of no payment of annual fees