JPS61248160A

JPS61248160A - Document information registering system

Info

Publication number: JPS61248160A
Application number: JP60088517A
Authority: JP
Inventors: Tetsuo Machida; 哲夫町田; Kuniaki Tabata; 邦晃田畑; Masatoshi Hino; 樋野　匡利; Kunihiro Nomura; 訓弘野村
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1985-04-26
Filing date: 1985-04-26
Publication date: 1986-11-05

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔発明の利用分野〕本発明は、学術論文等を収納するファイルシステムに係
り、特に検索用のキーワード入力を自動化すると共に、
検索補助情報としての要約文を自動生成するに好適な、
文書情報登録方式に関する。[Detailed Description of the Invention] [Field of Application of the Invention] The present invention relates to a file system for storing academic papers, etc., and in particular automates the input of keywords for searching.
Suitable for automatically generating summary sentences as search auxiliary information,
Regarding document information registration method.

[Background of the invention]

印刷や保管を目的にキー人力されたり、ワードプロセッ
サによって作成された、コード化された文章を蓄積、格
納する場合、従来、本文を入力すると同時に、検索用の
キーワードを本文とは別に入力する必要があった。さら
に、検索時、キーワードだけでは十分に所望の文章を選
別できないので１本文の要約文を検索の補助情報として
利用する場合が多いが、この要約も、本文とは別に作成
し、入力する必要があり、データ登録が煩雑という欠点
があった。When accumulating and storing coded text that has been manually written or created using a word processor for the purpose of printing or archiving, conventionally it has been necessary to input search keywords separately from the text at the same time as entering the main text. there were. Furthermore, when searching, it is not possible to select a desired text sufficiently using keywords alone, so a summary of a single text is often used as supplementary information for the search, but this summary also needs to be created and entered separately from the main text. However, there was a drawback that data registration was complicated.

発明者等は、文章情報を画像として登録する場合、上記
の要約文に対応する検索用の補助情報（案内画像１画像
インデックス）を原文書情報の縮小、切出等により自動
的に作成し、検索時、指定されたキーワードに対応する
文書情報の補助情報を一覧表示し、その中から、所望の
文書情報を特定する方式を発明した。（特公昭５６−５
３７８８．５７−８４９９　、特願昭５９−２０１６６
６）　Ｌ、かじ、これらは全て画像として登録する場合
であり、ワードプロセッサ等の出力結果であるコードデ
ータに対しては。When registering text information as an image, the inventors automatically create search auxiliary information (guidance image 1 image index) corresponding to the above summary text by reducing or cutting out the original document information, During a search, we have invented a method that displays a list of auxiliary information for document information that corresponds to a specified keyword, and specifies desired document information from among the auxiliary information. (Special Public Service 56-5
3788.57-8499, patent application 1982-20166
6) L, rudder: All of these are for registering as images, and for code data that is the output result of a word processor, etc.

適用できないと言う欠点があった。The drawback was that it could not be applied.

[Purpose of the invention]

本発明の目的は、文書情報をコードとして登録する場合
に、検索用のキーワード、および検索補助情報としての
要約文を、本文データから自動抽出・登録することによ
って、登録作業の容易化を計ることである。An object of the present invention is to simplify the registration process when document information is registered as a code by automatically extracting and registering search keywords and summary text as search auxiliary information from text data. It is.

[Summary of the invention]

一般に、文書情報の検索に於て、検索のキーワードとな
る単語の種類は限定されている。特に学術論文や、特許
情報などの場合には、キーワードは厳密に規定されてお
り、それらには特殊コードが付されている場合が多い。Generally, when searching for document information, the types of words that can be used as search keywords are limited. Particularly in the case of academic papers, patent information, etc., keywords are strictly defined and often have special codes attached to them.

さらに、学術論文や特許情報の場合、その主題を説明す
る文（所謂トビツクセンテンス）には必らず上記のキー
ワードが含まれている。Furthermore, in the case of academic papers and patent information, the above-mentioned keywords are always included in sentences that explain the subject (so-called tobitsku sentences).

本発明では、上記の性質に着目し、■本文中からキーワ
ードとなりうる単語を自動抽出する。■上記■で抽出し
たキーワードを含む文章を抽出し、出現類に再編集し要
約文とする、ことによって、キーワード入力、要約文作
成と入力からなるデー。In the present invention, focusing on the above properties, (1) automatically extracting words that can be keywords from the main text; ■ A day consisting of keyword input, summary sentence creation, and input by extracting sentences that include the keywords extracted in ■ above, re-editing them into occurrences, and making them into summary sentences.

夕登録作業を自動化するものである。This automates the evening registration process.

[Embodiments of the invention]

以下、本発明を実施例を用いて詳細に説明する。 Hereinafter, the present invention will be explained in detail using Examples.

第１図に、本発明による文書情報の登録方式の処理概要
を示す。図中１０が、本発明によるキーワードの自動抽
出、および要約文の自動作成処理を示す０本実施例では
、キーワードとなりつる単語の一覧表、すなわちキーワ
ードテーブル１０１を持ち、本文ファイル１０２から抽
出した単語が、キーワードテーブル１０１に格納されて
いる単語と一致するか否かをチェックする。一致する場
合は、その単語を当該文書のキーワードとしてキーワー
ドファイル１０３に登録する。例えば特許情報における
分類コードの如く、キーワードに付随した記号、コード
等が定められている場合には、これらのコード等もキー
ワードテーブル１０１に格納しておき、キーワードと共
に、キーワードファイル１０３に登録する。さらに、該
キーワードを含む文を本文ファイル１０２から抽出し、
これを要約文ファイル１０４に格納する。本文ファイル
１０２から抽出した単語が、キーワードテーブル１０１
に格納されている単語と一致しない場合は、その単語は
該文書情報のキーワードとはなり得ないものと見做し、
次の単語抽出に処理を移行する。FIG. 1 shows a processing overview of the document information registration method according to the present invention. 10 in the figure indicates automatic keyword extraction and automatic summary creation processing according to the present invention. In this embodiment, a list of words that can be used as keywords, that is, a keyword table 101 is provided, and words extracted from a text file 102 are provided. It is checked whether the word matches the word stored in the keyword table 101. If they match, that word is registered in the keyword file 103 as a keyword for the document. For example, when symbols, codes, etc. associated with keywords are defined, such as classification codes in patent information, these codes are also stored in the keyword table 101 and registered in the keyword file 103 along with the keywords. Furthermore, a sentence containing the keyword is extracted from the main text file 102,
This is stored in the summary sentence file 104. The words extracted from the text file 102 are stored in the keyword table 101.
If the word does not match the word stored in the document information, it is assumed that the word cannot be a keyword of the document information,
Shifts processing to the next word extraction.

以上の処理を、本文ファイル１０２の全単語に゛　つい
て繰返す、これによって、全キーワードの抽出、および
要約文の作成が完了する。The above process is repeated for all words in the text file 102, thereby completing extraction of all keywords and creation of a summary sentence.

第２図に、上記自動抽出処理を実例を用いて示す。図中
１０２は本文ファイルである６文中第１行目に存在する
「画像」がキーワードテーブルの内容と一致する。従っ
て、キーワードテーブル１０３に「画像」を登録する。FIG. 2 shows an example of the automatic extraction process described above. In the figure, 102 is a main text file, and the "image" present in the first line of six sentences matches the content of the keyword table. Therefore, “image” is registered in the keyword table 103.

これと共に、この「画像」を含む文「本発明では、・・
・するようにした。」を要約文ファイル１０４に格納す
る。Along with this, the sentence containing this "image" is "In the present invention...
・I made it so. ” is stored in the summary sentence file 104.

以上説明した処理の流れを第３図に示す。図中２０１は
、本文ファイル１０２から単語を抽出する処理である。FIG. 3 shows the flow of the processing described above. 201 in the figure is a process for extracting words from the text file 102.

英文の場合には、ブランクやピリオド、カンマ等の機号
によって区切られる文字列として単語を識別できる。一
方、日本語の場合には、英文のルールの他に、助詞、接
続語、接尾語。In English, words can be identified as strings of characters separated by blanks, periods, commas, etc. On the other hand, in Japanese, in addition to the English rules, there are also particles, conjunctions, and suffixes.

活用語尾等によって区切られた文字列として抽出する。Extract as a character string separated by conjugated endings, etc.

さらに、キーワードとなりうる単語は、全て名詞である
ことから１品詞判定して１名詞のみを抽出することによ
り、以下に続く判定処理２０２の実行回数を減少するこ
ともできる。Furthermore, since all words that can be used as keywords are nouns, by determining one part of speech and extracting only one noun, it is possible to reduce the number of times the determination process 202 that follows is performed.

処理２０２は、上記抽出された単語と同一単語が、キー
ワードテーブル１０１中に存在するか否かをチェックす
る処理である。一般に、キーワードテーブル１０１のエ
ントリー数は多くなるので、このチェック処理２０２に
は、二項チェック等の方法により高速化する必要がある
。Process 202 is a process of checking whether the same word as the above-mentioned extracted word exists in the keyword table 101. Generally, the number of entries in the keyword table 101 increases, so it is necessary to speed up the checking process 202 by using a method such as a binary check.

処理２０３は、前述の如く、キーワードに付随して記号
や番号等（以下、キーコードと呼ぶ。）が定められてい
る場合、そのキーコードを、キーワードテーブル１０１
から選択する処理である。As described above, in the process 203, if a symbol, number, etc. (hereinafter referred to as a key code) is defined along with the keyword, the key code is stored in the keyword table 101.
This is the process of selecting from.

このキーコードは、キーワードテーブル１０１中で、当
該キーワードと関連付けて格納されている。This key code is stored in the keyword table 101 in association with the keyword.

上記処理２０２．２０３で選定したキーワードおよびキ
ーコードを、キーワードファイル１０３に格納する処理
が第２図中２０４である。キーワードファイル１０３は
、本文ファイル１０２中に、文書情報の識別子として記
入されているコードと同一のコードが記入されており、
それに引続いて、キーワード、キーコードが格納される
。従って、処理２０４では、現在処理中の本文データに
対応する識別コードに続く空エリアに、上記処理２０２
．２０３で選定したキーワード、キーコードを格納する
。204 in FIG. 2 is a process for storing the keywords and key codes selected in the above processes 202 and 203 in the keyword file 103. The keyword file 103 has the same code entered as the identifier of the document information in the text file 102.
Following this, keywords and key codes are stored. Therefore, in process 204, the empty area following the identification code corresponding to the text data currently being processed is
．． The keywords and key codes selected in step 203 are stored.

処理２０５は、当該キーワードを含む文を、要約文とし
て抽出する処理である。文の末尾は、ピリオド、セミコ
ロン、あるいは、カンマに続く接続詞の前で終了する。Process 205 is a process of extracting a sentence containing the keyword as a summary sentence. The sentence ends before a period, semicolon, or comma followed by a conjunction.

このような文の終了点から、次に表われる文の終了点ま
でを、１つの文として抽出する。さらに、要約文とする
ために、接続詞を除去する。The period from the end point of such a sentence to the end point of the next sentence is extracted as one sentence. Furthermore, conjunctions are removed to create a summary sentence.

以上の処理で抽出された文を、要約文ファイル１０４に
格納するのが処理２０６である。要約文ファイル１０４
は、キーワードファイル１０３と同様に１本文データに
対応する識別コードが記入されており、これに続く空エ
リアに、処理２０６で抽出された文を格納する。Processing 206 stores the sentences extracted in the above processing into the summary sentence file 104. Summary text file 104
Similar to the keyword file 103, an identification code corresponding to one text data is entered, and the sentence extracted in process 206 is stored in the empty area following this.

以上の処理の終了を判定するのが、処理２０７であり、
本文中の全単語に対して処理が終了した時に、キーワー
ドの自動抽出、および要約文の自動作成が完了する。Process 207 determines the end of the above processing,
When processing is completed for all words in the text, automatic extraction of keywords and automatic creation of a summary sentence are completed.

以上の実施例では、本文中のキーワード候補全てを抽出
している。この場合、本文の説明の都合上使用した単語
が、偶然キーワード候補としてキーワードテーブル１０
１に収納されている場合、それをも当該文書のキーワー
ドとして抽出してしまう場合もありうる。これを除外す
るために、規定回数以上出現したキーワード候補のみを
、真のキーワードとして採用する方式や、構文解析によ
って主語や、目的語として表われた場合のみをキーワー
ドとして採用する等の方法が考えられる。In the above embodiment, all keyword candidates in the text are extracted. In this case, the word used for convenience of explanation of the main text happens to be a keyword candidate in the keyword table 10.
1, it may also be extracted as a keyword for the document. In order to eliminate this, methods can be considered, such as adopting only keyword candidates that appear more than a specified number of times as true keywords, or adopting only keywords that appear as subjects or objects in syntactic analysis. It will be done.

第４図に、本発明による文書情報登録方式によって登録
した文書を検索する際の表示画面を示す。FIG. 4 shows a display screen when searching for documents registered using the document information registration method according to the present invention.

ｒ画像」というキーワードに対応して、３つの文書情報
が選択されている。これらの中から所望の１つを特定す
るために、本発明による文書情報登録方式によって自動
作成された要約文を、案内用の画面として表示している
。検索者は、この要約文を見ることによって、真に必要
な文書情報を特定することができる。Three pieces of document information are selected in response to the keyword "r image". In order to specify a desired one from among these, a summary sentence automatically created by the document information registration method according to the present invention is displayed as a guidance screen. By looking at this summary text, the searcher can identify the document information that is truly needed.

〔Effect of the invention〕

本発明によれば、学術論文、特許情報などの文書情報の
登録に際し、本文データのみを入力すれば、それから、
キーワードを自動抽出し、さらに従来人間の知的活動と
されていた要約文の作成をも自動的に実行する。従って
、文書情報の入力作業を簡単化するのみならず、検索に
際し、キーワードだけでは特定できない場合にも、自動
作成した要約文を、検索用の補助情報として用いること
によって、所望の情報を容易に特定できる。さらに、従
来、文書情報の登録・検索システムは、入力の困難さに
よって、その普及が妨げられることが多かったが、本発
明によって、この障害を取除くことができる。According to the present invention, when registering document information such as academic papers and patent information, only the text data is input, and then,
It automatically extracts keywords and also automatically creates a summary sentence, which was traditionally considered an intellectual activity for humans. Therefore, it not only simplifies the task of inputting document information, but also makes it easier to find the desired information by using automatically created summaries as auxiliary information when searching for information that cannot be specified using keywords alone. Can be identified. Furthermore, conventional document information registration/retrieval systems have often been hindered from becoming popular due to the difficulty of inputting them, but the present invention can eliminate this obstacle.

[Brief explanation of drawings]

第１図は、本発明による文書情報の登録方式の全体構成
を示す図、第２図は、キーワードの自動抽出、および要
約文の自動作成の処理の一例を示す図、第３図はその処
理の流れを示す図、第４１！１は、本発明によって自動
作成した要約文を検索用の補助情報として用いた案内画
面の一例を示す図である。１０・・・キーワード・要約文自動作成部、１０１・・
・キーワードテーブル、１０２・・・本文ファイル。１０３・・・キーワードファイル、１０４　用要約文フ
ァイル、２０１・・・単語抽出処理部、２０２川キ一ワ
ードチエツク部、２０３・・・キーコード選択部、２０
４・・・キーワード登録部、２０５・・・文抽出処理部
、２゛０６・・・要約文登録部、２ｏ７・・・終了判定
部。￥　１　口第　２　ロ￥１３　口FIG. 1 is a diagram showing the overall configuration of the document information registration method according to the present invention, FIG. 2 is a diagram showing an example of automatic keyword extraction and automatic summary creation processing, and FIG. 3 is the process. 41!1 is a diagram showing an example of a guide screen using a summary sentence automatically created according to the present invention as auxiliary information for searching. 10...Keyword/summary sentence automatic creation section, 101...
-Keyword table, 102...Main text file. 103... Keyword file, 104 Summary sentence file, 201... Word extraction processing section, 202 River keyword check section, 203... Key code selection section, 20
4...Keyword registration section, 205...Sentence extraction processing section, 2'06...Summary sentence registration section, 2o7...End determination section. ￥ 1 mouth 2nd mouth ￥ 13 mouth

Claims

[Claims] 1. A document characterized in that, in a file system that stores document information, search keywords are automatically extracted from the main text, and the results are used to automatically generate a summary sentence as search auxiliary information. Information registration method. 2. The document information registration method according to item 1, which displays a list of the summaries and selects desired information from among them.