JP2007122403A

JP2007122403A - Automatic extraction device, extraction method and extraction program for document title and related information

Info

Publication number: JP2007122403A
Application number: JP2005313615A
Authority: JP
Inventors: Seiso Cho; 正操張; Mosho Son; 茂松孫; Tsuguaki Ryu; 紹明劉
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2005-10-28
Filing date: 2005-10-28
Publication date: 2007-05-17

Abstract

【課題】文書タイトルを自動的にかつ精度よく抽出するタイトル抽出装置を提供する。
【解決手段】文書タイトル抽出装置は、文書入力部３０から入力されたテキスト文書から複数のタイトル候補文を抽出するタイトル候補文抽出部３２と、抽出された複数のタイトル候補文の各々の特徴量を抽出する特徴量抽出部３４と、抽出された特徴量に基づき複数のタイトル候補文の中から文書タイトルを判定するタイトル判定部３６と、抽出結果を出力する出力部３８とを含む。特徴量は、少なくともタイトル候補文と文書中の複数の文との類似度の関数である類似度情報を含んでいる。
【選択図】図２PROBLEM TO BE SOLVED: To provide a title extraction device for automatically and accurately extracting a document title.
A document title extraction device includes a title candidate sentence extraction unit that extracts a plurality of title candidate sentences from a text document input from a document input unit, and a feature amount of each of the extracted plurality of title candidate sentences. A feature amount extraction unit 34 for extracting a document title, a title determination unit 36 for determining a document title from a plurality of title candidate sentences based on the extracted feature amount, and an output unit 38 for outputting an extraction result. The feature amount includes similarity information that is a function of similarity between at least a title candidate sentence and a plurality of sentences in the document.
[Selection] Figure 2

Description

本発明は、スキャナ等により読取られた文書から、文書タイトルを自動的に抽出する文書タイトル抽出装置に関する。 The present invention relates to a document title extraction apparatus that automatically extracts a document title from a document read by a scanner or the like.

紙原稿の一般文書を光学式スキャナ等により読み込み、電子化された画像データから、文書タイトルを抽出する装置が実用化されつつある。例えば特許文献１は、文書を画像データに変換して得られる文書画像から容易にタイトル部分を抽出するタイトル抽出装置に関し、これによれば、文書画像内の黒画素が連結している領域に外接する矩形領域を文字矩形として抽出し、さらに、隣接する複数の文字矩形を統合して、それらの文字矩形に外接する矩形領域を文字列矩形として抽出し、次に、各文字列矩形の下線属性、枠付き属性、罫線属性等の属性と、文書画像内の文字列矩形の位置や相互の位置関係とに基づいてタイトルらしさのポイント計算を行い、高ポイントを獲得した文字列矩形をタイトル矩形として抽出するものである。 An apparatus for reading a general document of a paper original by an optical scanner or the like and extracting a document title from digitized image data is being put into practical use. For example, Patent Document 1 relates to a title extraction device that easily extracts a title portion from a document image obtained by converting a document into image data. According to this, a circumscribed area that is connected to black pixels in a document image is disclosed. The rectangle area to be extracted is extracted as a character rectangle, and a plurality of adjacent character rectangles are integrated to extract a rectangle area circumscribing those character rectangles as a character string rectangle, and then the underline attribute of each character string rectangle , Calculate the point of title likeness based on attributes such as framed attributes, ruled line attributes, etc., and the position of the character string rectangle in the document image and the mutual positional relationship, and the character string rectangle that acquired the high point as the title rectangle To extract.

特許文献２は、文書画像から切り出された文字列矩形に対し、この文字列矩形内の文字コードの識別を行い、文字コード識別の確信度、自然言語的タイトルらしさを解析する自然言語解析手段、語尾の統計情報、センタリング・下線・特定のフォント、文字矩形の大きさなどの手法によりタイトルを抽出するタイトル抽出装置に関する。 Patent Document 2 discloses a natural language analysis means for identifying a character code in a character string rectangle cut out from a document image and analyzing the certainty of character code identification and the likelihood of a natural language title. The present invention relates to a title extracting apparatus that extracts titles by methods such as statistical information on endings, centering, underline, specific font, and character rectangle size.

非特許文献１は、正規表現パターンを用いて技術論文の住所、都市名、URL、時間を抽出可能とし、論文開始部分に抽出されない部分を著者とタイトルとして抽出する技術を開示している。 Non-Patent Document 1 discloses a technology that makes it possible to extract the address, city name, URL, and time of a technical paper using a regular expression pattern, and extracts a portion that is not extracted as a paper start portion as an author and title.

非特許文献２は、文章開始部分を対象に、言語特徴（単語数、行の位置、単語と非単語の比率、首文字が大文字と小文字の比率、数字の比率）などを文の特徴量とし、ＳＶＭを利用してタイトルを判定する技術を開示している。 Non-Patent Document 2 uses sentence features such as linguistic features (number of words, line position, word to non-word ratio, initial capital letter to lowercase letter ratio, number ratio) for the sentence start part. , A technique for determining a title using SVM is disclosed.

特開平９−１３４４０６号JP-A-9-134406 特開２０００−１４８７８８JP 2000-148788 A E Berkowitz, M Elkhadiri, T Sahouri and M Abraham. 2004. Intelligent Content Based Title and Author Name Extraction from Formatted Documents. Proceedings Fifteenth Midwest Artificial Intelligence and Cognitive Science Conference. age 119-124.E Berkowitz, M Elkhadiri, T Sahouri and M Abraham. 2004. Intelligent Content Based Title and Author Name Extraction from Formatted Documents.Proceedings Fifteenth Midwest Artificial Intelligence and Cognitive Science Conference.age 119-124. Hui Han, C Giles, E Manavoglu, Hongyuan Zha, Zhenyue Zhang and E Fox. 2003. Automatic Document Metadata Extraction using Support Vector Machines. ACM/IEEE Joint Conference on Digital Libraries. Page 37-48.Hui Han, C Giles, E Manavoglu, Hongyuan Zha, Zhenyue Zhang and E Fox. 2003. Automatic Document Metadata Extraction using Support Vector Machines. ACM / IEEE Joint Conference on Digital Libraries. Page 37-48.

しかしながら、特許文献１のタイトル抽出装置は、非定型文書に対して行領域のレイアウト的特徴を用いてタイトル抽出を行っているので、抽出率が十分でないとうい課題がある。特許文献２は、幾つかのタイトルの属性を用いてタイトルを判定しているが、複数の短い文字列矩形を持つ文書に対して、タイトル属性を持っている短い文字列矩形が多いので、誤判定しやすいという問題がある。 However, since the title extraction apparatus of Patent Document 1 performs title extraction using a layout characteristic of a line area for an atypical document, there is a problem that the extraction rate is not sufficient. In Patent Document 2, the title is determined using some title attributes. However, there are many short character string rectangles having title attributes for a document having a plurality of short character string rectangles. There is a problem that it is easy to judge.

また、非特許文献１や非特許文献２に開示される技術は、文書の構造に依存しているため、技術論文以外の文書に適用し難く、また，文書の開始情報が少ない場合には、正しいタイトルの抽出ができなくなるという課題がある。 In addition, since the technology disclosed in Non-Patent Document 1 and Non-Patent Document 2 depends on the structure of the document, it is difficult to apply to a document other than a technical paper, and when the start information of the document is small, There is a problem that a correct title cannot be extracted.

本発明は、上記従来の課題を解決するものであり、必ずしも文書のレイアウトと内容範囲に依存しない、言語知識を十分に活用し、タイトル候補文の長さ、候補文と他の文の類似度の順位、著者、組織名、タイトルキーワード、候補文と著者の距離、タイトル禁用キーワード、郵便番号、記号などの情報をタイトル候補文の特徴量とし、この特徴量を分類装置(例えば、ＳＶＭ)を利用してタイトルであるかどうかを判定することにより、タイトル独自の属性を最大限に利用し、柔軟な判定方式で文書タイトルおよびその関連情報を高精度に抽出することができるタイトル抽出装置、抽出方法、抽出プログラムを提供することを目的とする。 The present invention solves the above-described conventional problems, does not necessarily depend on the layout and content range of a document, fully utilizes linguistic knowledge, the length of a candidate title sentence, the similarity between a candidate sentence and another sentence Information such as ranking, author, organization name, title keyword, distance between candidate sentence and author, prohibited keyword, zip code, symbol, etc. are used as feature quantities of the title candidate sentence, and this feature quantity is classified by a classification device (for example, SVM). A title extraction device that can extract the document title and related information with high accuracy using a flexible determination method by making the best use of the unique attributes of the title by determining whether or not it is a title and extraction It is an object to provide a method and an extraction program.

本発明に係る文書タイトル抽出装置は、文書から複数のタイトル候補文を抽出するタイトル候補文抽出手段と、抽出された複数のタイトル候補文の各々の特徴量を抽出する特徴量抽出手段と、抽出された特徴量に基づき複数のタイトル候補文の中から文書タイトルを判定する判定手段と、判定結果を出力する出力手段とを含み、特徴量は、少なくともタイトル候補文と文書中の複数の文との類似度の関数である類似度情報を含んでいる。 A document title extraction apparatus according to the present invention includes a title candidate sentence extraction unit that extracts a plurality of title candidate sentences from a document, a feature amount extraction unit that extracts a feature quantity of each of the extracted title candidate sentences, and an extraction A determination unit that determines a document title from a plurality of title candidate sentences based on the feature amount that has been output, and an output unit that outputs a determination result. The feature amount includes at least a title candidate sentence and a plurality of sentences in the document. Similarity information, which is a function of the degree of similarity, is included.

好ましくは、類似度情報は、タイトル候補文と文書中の複数の文との類似度の度合いを表す順位情報を含む。類似度情報は、タイトル候補文から選択されるサブ文字列のベクトル情報と文書中の文から選択されたサブ文字列のベクトル情報とを用いて算出される。ベクトル情報は、タイトル候補文から選択さられたＮ（Ｎは２以上の自然数）グラムの出現頻度と、文書中の文から選択されたＮグラムの出現頻度に基づき算出される。このような類似度情報を用いることで、形態素解析を用いることなく、言語情報を活用した、高精度な文書タイトルの抽出および判定を行うことができる。 Preferably, the similarity information includes rank information indicating a degree of similarity between the title candidate sentence and a plurality of sentences in the document. The similarity information is calculated using the vector information of the sub character string selected from the title candidate sentence and the vector information of the sub character string selected from the sentence in the document. The vector information is calculated based on the appearance frequency of N (N is a natural number of 2 or more) gram selected from the title candidate sentence and the appearance frequency of N gram selected from the sentence in the document. By using such similarity information, it is possible to extract and determine a document title with high accuracy using linguistic information without using morphological analysis.

さらに、Ｎグラムの出現頻度に基づきベクトル情報を算出するとき、予め定められた使用禁止用のＮグラムが含まれている場合には、当該ベクトル情報を修正する。タイトルになり得ない文字列、あるいはタイトルになる可能性が低い文字列を除外することで、文書タイトルの判定および抽出精度を向上させることができる。 Further, when the vector information is calculated based on the appearance frequency of the N-gram, if the predetermined use-prohibited N-gram is included, the vector information is corrected. By excluding a character string that cannot be a title or a character string that is unlikely to be a title, the determination and extraction accuracy of the document title can be improved.

また、類似度情報は、タイトル候補文と文書中の文の編集距離により算出されるようにしてもよいし、タイトル候補文と文書中の文の最大共通文字列の長さにより算出されるようにしてもよい。 The similarity information may be calculated based on the edit distance between the title candidate sentence and the sentence in the document, or may be calculated based on the length of the maximum common character string between the title candidate sentence and the sentence in the document. It may be.

さらに特徴量は、タイトル候補文に、予め定められたタイトルキーワードが含まれている場合に、そのキーワードの位置と出現頻度を示すタイトルキーワード情報を含ませたり、タイトル候補文に、予め定められた使用禁止用タイトルキーワードが含まれている場合に、その使用禁止用タイトルキーワードの位置と出現頻度を示す使用禁止用タイトルキーワード情報を含ませるようにしてもよい。これにより、タイトル候補文の特徴量に種々の特徴を含ませ、タイトルの判定精度を向上させることができる。 Further, when the title candidate sentence includes a predetermined title keyword, the feature amount includes title keyword information indicating the position and appearance frequency of the keyword, or is determined in the title candidate sentence. When a use-prohibited title keyword is included, use-prohibited title keyword information indicating the position and appearance frequency of the use-prohibited title keyword may be included. Thereby, various features can be included in the feature amount of the title candidate sentence, and the determination accuracy of the title can be improved.

文書タイトルを判定する判定手段は、タイトル候補文の特徴量に基づき、最適なタイトル候補文を抽出する。好ましくは、ＳＶＭ（サポートベクトルマシン）により特徴量を分類し、判定を行う。出力手段は、例えば、ディスプレイ等の表示装置を含み、判定されたタイトル文と関連情報を出力する。関連情報は、著者や組織名等である。 The determination means for determining the document title extracts an optimal title candidate sentence based on the feature amount of the title candidate sentence. Preferably, the feature amount is classified by SVM (support vector machine) and the determination is performed. The output means includes a display device such as a display, for example, and outputs the determined title sentence and related information. Related information includes authors and organization names.

文書タイトル抽出装置はさらに、画像文書を入力する入力手段と、入力された画像文書からテキスト文書を抽出するテキスト文書抽出手段とを含み、タイトル候補文抽出手段は、抽出されたテキスト文書からタイトル候補文を抽出するものであってもよい。画像文書を入力する入力手段は、画像文書を光学的に読取るスキャナを含み、スキャナにより読取られた画像文書データから、ＯＣＲ等によりテキストデータを抽出する。好ましくは、タイトル候補文抽出手段は、テキスト文書の先頭から一定の候補対象範囲においてタイトル候補文を抽出する。おおよそ、タイトルとなり得るような文書は、先頭部分に含まれていることが多いからである。 The document title extracting device further includes an input unit for inputting an image document, and a text document extracting unit for extracting a text document from the input image document. The title candidate sentence extracting unit is a title candidate from the extracted text document. A sentence may be extracted. The input means for inputting the image document includes a scanner that optically reads the image document, and extracts text data by OCR or the like from the image document data read by the scanner. Preferably, the title candidate sentence extraction unit extracts title candidate sentences in a certain candidate target range from the top of the text document. This is because a document that can be a title is often included in the top part.

特徴量は、入力された画像文書から得られたレイアウト情報を含むものであってもよい。これらの情報を利用することで、文書タイトルの判定精度が向上する。 The feature amount may include layout information obtained from the input image document. By using these pieces of information, the document title determination accuracy is improved.

本発明に係る文書からタイトルを抽出する方法は、文書から複数のタイトル候補文を抽出するステップと、タイトル候補文と文書中の複数の文との類似度情報を含む特徴量をすべてのタイトル候補文について抽出するステップと、抽出された特徴量に基づき複数のタイトル候補文の中から文書タイトルを判定するステップと、判定結果を出力するステップとを含む。さらに本発明に係る文書からタイトルを抽出するプログラムは、文書から複数のタイトル候補文を抽出するステップと、タイトル候補文と文書中の複数の文との類似度情報を含む特徴量をすべてのタイトル候補文について抽出するステップと、抽出された特徴量に基づき複数のタイトル候補文の中から文書タイトルを判定するステップと、判定結果を出力するステップとを含む。 A method for extracting titles from a document according to the present invention includes a step of extracting a plurality of title candidate sentences from a document, and a feature amount including similarity information between the title candidate sentences and a plurality of sentences in the document. A step of extracting a sentence; a step of determining a document title from a plurality of title candidate sentences based on the extracted feature amount; and a step of outputting a determination result. Furthermore, the program for extracting a title from a document according to the present invention includes a step of extracting a plurality of title candidate sentences from the document, and a feature amount including similarity information between the title candidate sentence and the plurality of sentences in the document. Extracting candidate sentences, determining a document title from a plurality of title candidate sentences based on the extracted feature quantities, and outputting a determination result.

本発明に係る文書タイトル抽出装置によれば、タイトル候補文の各々の特徴量を抽出し、当該特徴量が、タイトル候補文と文書中の複数の文との類似度の関数である類似度情報を含むようにしたので、必ずしも、文書のレイアウト、画像情報と内容範囲に依存することなく、言語知識を十分に活用した柔軟な判定方式で文書タイトルと関連属性を高精度に抽出することが可能である。好ましくは、ＳＶＭを使用した抽出する方法を行うことにより、判別ルール適用の不完全性、ＯＣＲの誤認識に影響しにくいので、スキャンされたテキスト文書（ＯＣＲを実施したテキスト文書）のタイトル及び関連情報の自動抽出に最適である。ＳＶＭを使用することにより、学習によって，システムの抽出性能(抽出範囲の拡張，抽出精度)を高めることが可能である。 According to the document title extracting apparatus of the present invention, each feature amount of the title candidate sentence is extracted, and the feature amount is a function of similarity between the title candidate sentence and a plurality of sentences in the document. It is possible to extract document titles and related attributes with high accuracy using a flexible judgment method that fully utilizes language knowledge, without necessarily depending on document layout, image information and content range. It is. Preferably, the extraction method using SVM does not affect the imperfection of discrimination rule application and the misrecognition of OCR, so the title of the scanned text document (text document on which OCR has been performed) and related Ideal for automatic extraction of information. By using SVM, it is possible to improve the extraction performance (extraction range expansion, extraction accuracy) of the system by learning.

以下、本発明の最良の実施形態について図面を参照して説明する。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, exemplary embodiments of the invention will be described with reference to the drawings.

図１は、本発明の実施例に係る文書タイトル抽出装置の構成を示す図である。タイトル抽出装置１０は、入力装置１２、表示装置１４、主記憶装置１６、記憶装置１８、中央処理装置（ＣＰＵ）２０、これらを接続するバス２２を含んでいる。 FIG. 1 is a diagram illustrating a configuration of a document title extraction apparatus according to an embodiment of the present invention. The title extraction device 10 includes an input device 12, a display device 14, a main storage device 16, a storage device 18, a central processing unit (CPU) 20, and a bus 22 for connecting them.

入力装置１２は、キー操作により情報を入力するキーボード、原稿に記載された文書等を光学的に読み取る光学式読取装置（スキャナ）、外部装置や外部メモリ等からのデータを入力する入力インターフェース等を含む。表示装置１４は、文書から抽出されたタイトルおよびその関連情報等を表示するディスプレイ等を含む。主記憶装置１６は、ＲＯＭまたはＲＡＭを含み、文書からタイトル候補文を抽出したり、タイトル候補文の特徴量を抽出したり、文書タイトルを判定するためのプログラムや演算処理されたデータ等を記憶する。記憶装置１８は、例えばハードディスク等の大容量記憶装置を含み、スキャナによって光学的に読取られた画像文書データや、特徴量の抽出の際に使用される各種辞書データベース等を蓄積する。ＣＰＵ（Central Processing Unit）２０は、主記憶装置１６に記憶されたプログラムに従い各部を制御する。 The input device 12 includes a keyboard for inputting information by key operation, an optical reading device (scanner) for optically reading a document or the like written on a manuscript, an input interface for inputting data from an external device, an external memory, or the like. Including. The display device 14 includes a display that displays a title extracted from the document and related information thereof. The main storage device 16 includes a ROM or a RAM, and stores a title candidate sentence from a document, a feature amount of the title candidate sentence, a program for determining a document title, arithmetically processed data, and the like. To do. The storage device 18 includes, for example, a large-capacity storage device such as a hard disk, and stores image document data optically read by a scanner, various dictionary databases used for extracting feature amounts, and the like. A CPU (Central Processing Unit) 20 controls each unit in accordance with a program stored in the main storage device 16.

図２は、文書テキスト抽出装置を機能的に示したブロックである。文書入力部３０は、文書のテキスト文を入力する。テキスト文は、例えば、入力インターフェースを介して受け取られたテキストデータ、または、スキャナにより光学的に読み取られた画像文書データからＯＣＲ（文字認識装置）により抽出されたテキストデータであってもよい。勿論、それ以外の手法により得られたものであっても良い。 FIG. 2 is a block diagram functionally showing the document text extraction apparatus. The document input unit 30 inputs a text sentence of a document. The text sentence may be, for example, text data received via an input interface or text data extracted by an OCR (character recognition device) from image document data optically read by a scanner. Of course, it may be obtained by other methods.

タイトル候補文抽出部３２は、入力されたテキスト文からタイトルの候補となり得るタイトル候補分を抽出する。タイトル候補文抽出部は、入力されたテキスト文書の先頭から所定の範囲を候補対象範囲とし、候補対象範囲に含まれるテキスト文から、特定の記号および改行記号で切り分けた部分をタイトル候補文とする。 The candidate title sentence extraction unit 32 extracts candidate titles that can be candidate titles from the input text sentence. The title candidate sentence extraction unit sets a predetermined range from the beginning of the input text document as a candidate target range, and sets a portion separated from the text sentence included in the candidate target range by a specific symbol and a line feed symbol as a title candidate sentence. .

図３は、タイトル候補文抽出部３２による動作フローを示している。タイトル候補文抽出部３２は、入力文書の先頭からα％の部分と候補対象範囲に設定する（ステップＳ１０１）。αは、整数であり、例えば５０である。次に、タイトル候補文抽出部は、候補対象範囲に含まれるテキスト文から、記号（;.?!=~@#$%^&*_|\n；。？！…）と改行記号で切り分けた部分をタイトル候補文とする（ステップＳ１０２）。最後に、切り分けられたタイトル候補文を集合として記憶装置等に記憶する（ステップＳ１０３）。 FIG. 3 shows an operation flow by the title candidate sentence extraction unit 32. The title candidate sentence extraction unit 32 sets the α% portion and the candidate target range from the top of the input document (step S101). α is an integer, for example, 50. Next, the title candidate sentence extraction unit separates the text sentence included in the candidate target range by a symbol (;.?! = ~ @ # $% ^ & * _ | \ N;.?! ...) and a line feed symbol. This part is set as a title candidate sentence (step S102). Finally, the separated title candidate sentences are stored as a set in a storage device or the like (step S103).

図４は、日本語を入力文書とする例を示している。同図(ａ)は、スキャナ等により読取られたテキスト文の入力文書であり、同図（ｂ）は、入力文書の先頭から５０％の部分を候補対象範囲として抽出した例であり、同図（ｃ）は、候補対象範囲から記号や改行で切り分けたタイトル候補文の集合を示している。 FIG. 4 shows an example in which Japanese is an input document. FIG. 6A shows an input document of a text sentence read by a scanner or the like, and FIG. 6B shows an example in which a 50% portion from the top of the input document is extracted as a candidate target range. (C) shows a set of title candidate sentences separated from the candidate target range by symbols and line feeds.

再び図２に戻り、抽出されたタイトル候補文は、特徴量抽出部３４に供給される。特徴量抽出部３４は、タイトル候補文の類似度を判定するための特徴量を、すべてのタイトル候補文から抽出する。特徴量は、図５に示すように、候補文の長さ４０、類似度の順位情報４１、著者情報４２、組織名情報４３、タイトルキーワード情報４４、著者の位置情報４５、タイトル禁用キーワード４６、郵便番号４７、記号の数４８、の９つの要素から構成される。 Returning to FIG. 2 again, the extracted title candidate sentence is supplied to the feature amount extraction unit 34. The feature amount extraction unit 34 extracts a feature amount for determining the similarity between the title candidate sentences from all the title candidate sentences. As shown in FIG. 5, the feature amount includes candidate sentence length 40, similarity rank information 41, author information 42, organization name information 43, title keyword information 44, author position information 45, title prohibited keyword 46, It is composed of nine elements such as a zip code 47 and a number of symbols 48.

図６は、特徴量を構成する各部の情報の算出方法を説明する図である。「候補文の長さ」４０は、タイトル候補文の長さであり、単位はByteである。例えば、候補文の長さ（Byte）／１５０(定数)の値で表す。 FIG. 6 is a diagram for explaining a method of calculating information of each unit constituting the feature amount. The “candidate sentence length” 40 is the length of the title candidate sentence, and its unit is Byte. For example, it is represented by the value of the length (Byte) / 150 (constant) of the candidate sentence.

「類似度の順位情報」４１は、タイトル候補文と文書中の他の文間の類似度を計算し、類似度が一番高いものを該タイトル候補文の類似度とする。すべでのタイトル候補に対して、類似度の降順でソートし、順位１からＭを付ける(Ｍがタイトル候補文の数)。類似度の順位情報＝１／類似度の順位で表される。 The “similarity rank information” 41 calculates the similarity between the title candidate sentence and other sentences in the document, and sets the highest similarity as the similarity of the title candidate sentence. All title candidates are sorted in descending order of similarity, and M is assigned from rank 1 to M (M is the number of title candidate sentences). Similarity rank information = 1 / represented by similarity rank.

類似度は、次の方法により求めることができる。
方法1：タイトル候補文のＶＳＭベクトル特徴量を用いて、文間の類似度(或いは文間の距離)を求める方法。ＶＳＭベクトル特徴量は、単語のＴＦ（ＴＦ／ＩＤＦ）、ＴＦとＩＤＦの関数値を用いることができる。また、文字列をＮグラムに切り分け、ＮグラムのＴＦ(ＴＦ／ＩＤＦ)、或いは、ＴＦとＩＤＦの関数値としてもよい。さらに、公開されているベクトル間の類似度、距離の計算方法を用いても良い。 The similarity can be obtained by the following method.
Method 1: A method of obtaining the similarity between sentences (or the distance between sentences) using the VSM vector feature amount of the title candidate sentence. As the VSM vector feature amount, TF (TF / IDF) of a word, or a function value of TF and IDF can be used. Further, the character string may be divided into N-grams, and N-grams of TF (TF / IDF) or function values of TF and IDF may be used. Furthermore, a method for calculating similarity and distance between vectors that are publicly available may be used.

方法２：文字列間の編集距離で文間の距離を求める方法。
方法３：２文字列間の最大共通文字列の長さで文間の類似度とを求める方法。
方法４：他の公開された任意の方法。
本実施例では、後述するように、文字列を２グラムに切り分け、２グラムベクトル特徴量間の類似度を算出する。 Method 2: A method of obtaining a distance between sentences by an edit distance between character strings.
Method 3: A method of obtaining similarity between sentences by the length of the maximum common character string between two character strings.
Method 4: Any other published method.
In this embodiment, as will be described later, the character string is divided into two grams, and the similarity between the two-gram vector feature amounts is calculated.

「著者情報」４２は、タイトル候補文に著者を含む場合には、フラグが「１」にセットされ、その他は「０」である。例えば、公開された固有名抽出技術、あるいは人名抽出技術を用いることができる。図７は、電話帳などに掲載されている日本人の名字とその件数を順位で表した名字辞書である。タイトル候補文と図７の名字辞書を比較し、ヒットすれば、フラグを「１」、ヒットしなければ「０」とするようにしてもよい。また、名字辞書と同様に、日本人の名前辞書を用意し、タイトル候補文と名前辞書とを比較し、ヒットすれば、フラグを「１」、ヒットしなければ「０」としてもよい。また、名字と名前の双方がヒットするときのみ、フラグを「１」にセットするようにしてもよい。ヒットするか否かは、文字列が完全一致する場合のみならず、前方一致、後方一致等の部分一致であってもよい。 In the “author information” 42, when the author is included in the title candidate sentence, the flag is set to “1”, and the others are “0”. For example, a publicly-known name extraction technique or a person name extraction technique can be used. FIG. 7 is a surname dictionary showing the surnames and the number of Japanese surnames listed in a telephone directory. If the title candidate sentence and the surname dictionary of FIG. 7 are compared and hit, the flag may be set to “1”, and if not hit, “0” may be set. Similarly to the surname dictionary, a Japanese name dictionary is prepared, the title candidate sentence is compared with the name dictionary, and the flag may be set to “1” if hit and “0” if not hit. Alternatively, the flag may be set to “1” only when both the last name and the name are hit. Whether or not to hit may be a partial match such as a forward match and a backward match as well as a case where the character strings completely match.

「組織名情報」４３は、タイトル候補文に組織名情報を含む場合には、フラグが「１」にセットされ、その他は「０」である。例えば、予め組織名を登録した組織名辞書とタイトル候補文とを比較し、組織名がヒットした場合には、フラグを「１」、ヒットしない場合には、「０」とする。ヒットするか否かは、文字列が完全一致する場合のみならず、前方一致、後方一致等の部分一致であってもよい。 In the “organization name information” 43, when the title candidate sentence includes the organization name information, the flag is set to “1”, and the others are “0”. For example, the organization name dictionary in which the organization name is registered in advance and the title candidate sentence are compared. If the organization name is hit, the flag is “1”, and if it is not hit, “0” is set. Whether or not to hit may be a partial match such as a forward match and a backward match as well as a case where the character strings completely match.

「タイトルキーワード情報」４４は、タイトル候補文に予め定められたタイトルキーワードが含まれているか否かを示す情報であり、タイトルキーワードの位置とタイトルキーワードが出現した頻度を合計したものである。タイトルキーワードは、例えばタイトルキーワード辞書として登録しておく。「著者の位置情報」４５は、タイトル候補文の番号を文書に出現する前後（位置）に大きい順（昇順）で1から付与する。仮に、第ｉ番目のタイトル候補文に始めて著者が出現したとする。番号１から番号ｉ+３のタイトル候補文の「著者の位置情報」＝１、その他の候補文の「著者の位置情報」＝０となる。 The “title keyword information” 44 is information indicating whether or not a title candidate sentence includes a predetermined title keyword, and is the sum of the position of the title keyword and the frequency of appearance of the title keyword. For example, the title keyword is registered as a title keyword dictionary. The “author position information” 45 is assigned the title candidate sentence numbers from 1 in ascending order (ascending order) before and after (position) appearing in the document. Suppose that an author appears for the first time in the i-th title candidate sentence. “Author author position information” = 1 for the candidate title sentences numbered 1 to i + 3, and “author position information” = 0 for the other candidate sentences.

「タイトル禁用キーワード」４６は、タイトル候補文に予め定められたタイトル使用禁止用キーワードが含まれているか否かを示す情報であり、タイトル禁用キーワードの位置とタイトル禁用キーワードが出現した頻度を合計したものである。タイトル禁用キーワードは、タイトルに使用されることがない文字列または使用される可能性が低い文字列を予め辞書に登録し、それに該当するか否かをチェックする。 “Title forbidden keyword” 46 is information indicating whether or not the title candidate sentence includes a predetermined title forbidden keyword, and the position of the title forbidden keyword and the frequency of appearance of the title forbidden keyword are totaled. Is. For the title-prohibited keyword, a character string that is not used in the title or a character string that is unlikely to be used is registered in the dictionary in advance, and it is checked whether or not it corresponds to that.

「郵便番号」４７は、連続６桁の数字を郵便番号とする。タイトル候補文に郵便番号を含む場合には、フラグを「１」、その他の場合には「０」とする。「記号の数」４８は、タイトル候補文に含まれる、“，”，“．”，“；”の数である。 The “zip code” 47 is a postal code consisting of six consecutive digits. The flag is set to “1” when the postal code is included in the title candidate sentence, and “0” in other cases. The “number of symbols” 48 is the number of “,”, “.”, “;” Included in the title candidate sentence.

なお、候補文類似度用特徴量は、図５に示すように、９個の要素から構成される例を示したが、これに限定されるものではない。タイトル抽出には、少なくても２番目の「類似度順位情報」を含めばよく、「類似度順位情報」と他の情報を適宜組み合わせるようにしてもよい。例えば、「類似度順位情報」と５次元目の「タイトルキーワード情報」を特徴量としたり、あるいは「類似度順位情報」と７次元目の「タイトル禁用キーワード情報」を特徴量としてもよい。勿論、他の言語情報を追加してもよい。例えば、住所の情報等を追加してもよい。さらに、スキャナから読取られた画像文書であれば、文書のレイアウト情報(候補文の位置関係など)や画像情報(文字の大きさ、色、文字の種類など)を得ることができ、これらを特徴量の情報として追加することも可能である。 In addition, as shown in FIG. 5, the candidate sentence similarity feature amount is an example of nine elements, but is not limited thereto. The title extraction may include at least the second “similarity rank information”, and “similarity rank information” and other information may be appropriately combined. For example, “similarity rank information” and the fifth dimension “title keyword information” may be used as the feature quantity, or “similarity rank information” and the seventh dimension “title prohibited keyword information” may be used as the feature quantity. Of course, other language information may be added. For example, address information or the like may be added. Furthermore, if it is an image document read from a scanner, document layout information (such as the positional relationship of candidate sentences) and image information (such as character size, color, and character type) can be obtained. It is also possible to add as quantity information.

再び図２に戻り、特徴量抽出部３４により抽出されたすべてのタイトル候補文の特徴量がタイトル判定部３６へ供給される。タイトル判定部３６は、学習によって構成された判定分類部から構成される．分類部は、公開された任意の分類技術を使用してもよい。例えば、具体例として、ＳＶＭ(Support Vector Machine)の分類技術を使用することができる。ＳＶＭエンジンは、例えば、論文“Support Vector Machine によるテキスト分類”，1998,自然言語処理，128-24等に記載されている。 Returning to FIG. 2 again, the feature amounts of all the title candidate sentences extracted by the feature amount extraction unit 34 are supplied to the title determination unit 36. The title determination unit 36 includes a determination classification unit configured by learning. The classification unit may use any published classification technique. For example, a SVM (Support Vector Machine) classification technique can be used as a specific example. The SVM engine is described in, for example, the paper “Text Classification by Support Vector Machine”, 1998, Natural Language Processing, 128-24.

タイトル判定部３６によりタイトルが抽出されると、その抽出結果が抽出結果出力部３８に供給される。抽出結果出力部３８は、表示装置１４に、抽出されたタイトルを表示させる。同時に、著者等の関連情報も表示するようにしてもよい。 When the title is extracted by the title determination unit 36, the extraction result is supplied to the extraction result output unit 38. The extraction result output unit 38 causes the display device 14 to display the extracted title. At the same time, related information such as authors may be displayed.

次に、特徴量の類似度の算出方法について説明する。先ず、タイトル候補文の左から右へすべての連続な２文字の文字列（２グラム）を抽出する。例えば、タイトル候補文が「知的財産権」であれば、「知的」、「的財」、「財産」、「産権」のように２グラムの文字列が切り出される。タイトル候補文の２グラムベクトル特徴量をＡ＝（β１、β２、・・・βＮ）で表す。文書中の他の文の２グラムベクトル特徴量をＢ＝（β’１、β’２、・・・β’Ｎ）で表す。次の公式で、タイトル候補文と文書中の他の文間のすべての類似度ｓｉｍ（Ａ，Ｂ）を計算する。 Next, a method for calculating the similarity between feature quantities will be described. First, all continuous two-character strings (2 grams) are extracted from the left to the right of the title candidate sentence. For example, if the title candidate sentence is “intellectual property right”, a character string of 2 grams is cut out such as “intellectual”, “target property”, “property”, and “industry right”. The 2-gram vector feature amount of the title candidate sentence is represented by A = (β1, β2,... ΒN). A 2-gram vector feature amount of another sentence in the document is represented by B = (β′1, β′2,... Β′N). The following formula calculates all similarities sim (A, B) between the title candidate sentence and other sentences in the document.

図８は、２グラムベクトル特徴量を計算するときのフローを示している。タイトル候補文の左から右へすべての連続な２文字の文字列（２グラム）を抽出する（ステップＳ２０１）。次に、すべての２グラムの出現頻度＃（ｘ）を求める（ステップＳ２０２）。次に、使用禁止として２グラムが予め登録された禁用２グラム辞書５０を参照し、禁用２グラムが含まれていれば、ベクトル特徴量の次元数を修正する（ステップＳ２０３）。最後に、修正後の２グラム頻度＃’（ｘ）を用いて、ベクトル特徴量Ａ、Ｂを生成する（ステップＳ２０４）。 FIG. 8 shows a flow when calculating a 2-gram vector feature. All consecutive two-character strings (2 grams) are extracted from the left to the right of the title candidate sentence (step S201). Next, the appearance frequency # (x) of all 2 grams is obtained (step S202). Next, the forbidden 2 gram dictionary 50 in which 2 grams are registered as prohibition of use is referred to. If the forbidden 2 gram is included, the dimension number of the vector feature amount is corrected (step S203). Finally, vector feature amounts A and B are generated using the corrected 2-gram frequency # ′ (x) (step S204).

図９は、２グラム頻度＃’（ｘ）の算出方法を示している。
ＭＩ(ｘ，ｙ)：２グラムｘ、yの相互情報量；
＃(ｘ)は、２グラムＸが本文書に出現した回数；
Ｎは、すべでの２グラムが出現した回数；
＃(ｘ，ｙ) ：本文書にＸとＹが共起する回数； FIG. 9 shows a method of calculating the 2-gram frequency # ′ (x).
MI (x, y): mutual information of 2 grams x, y;
# (X) is the number of times 2 grams X appears in this document;
N is the number of times 2 grams of all appear;
# (X, y): number of times X and Y co-occur in this document;

このように本実施例の文書タイトル抽出装置は、タイトル候補文から類似度用特徴量を抽出し、その特徴量に基づき文書タイトルを抽出・判定するようにしたので、言語情報と統計ベースの判別基準の融合により、文書タイトルと関連情報を高精度に抽出することが可能である。完全に文書の内容により、タイトルと関連情報を抽出するので、必ずしも、文書のレイアウト、画像情報および内容範囲に依存しない、汎用性の高い文書タイトルを抽出することができる。論文のキーワード情報、概要内容、専門領域の関連情報を必ずしも必須としないので、タイトル抽出範囲は、領域に依存しない。さらに、形態素解析を使用せず、タイトル候補文から選定された２文字のサブ文字列を抽出し、文に存在しているすべての２文字文字列間の相互情報量を該文のベクトルとし、ベクトル間のＣｏｓ値を文間の類似度とすることにより、ＯＣＲの僅かな誤認識がタイトルの判定に影響しにくい特徴があり、スキャンされた画像文書のタイトル抽出に適切である。タイトル候補文の長さ、類似度の順位、著者、組織名、タイトルキーワード、候補文と著者の距離、タイトル禁用キーワード、郵便番号、記号などの情報を該文の特徴量とし、分類装置(例えばＳＶＭ)を利用してタイトルであるかどうかを判定することにより、高精度にタイトルを抽出することが可能である。 As described above, the document title extraction apparatus according to the present embodiment extracts the feature amount for similarity from the title candidate sentence, and extracts and determines the document title based on the feature amount. By combining the standards, it is possible to extract the document title and related information with high accuracy. Since the title and related information are extracted completely based on the content of the document, a highly versatile document title that does not necessarily depend on the document layout, image information, and content range can be extracted. The keyword extraction range does not depend on the area because the keyword information of the paper, the outline content, and the related information of the specialized area are not necessarily required. Furthermore, without using morphological analysis, a two-character sub-character string selected from the title candidate sentence is extracted, and the mutual information between all the two-character strings existing in the sentence is used as the vector of the sentence, By making the Cos value between vectors the similarity between sentences, there is a feature that slight misrecognition of OCR hardly affects the determination of the title, which is suitable for extracting the title of the scanned image document. Information such as length of title candidate sentence, ranking of similarity, author, organization name, title keyword, distance between candidate sentence and author, keyword forbidden keyword, zip code, symbol, etc. is used as a feature amount of the sentence, and classification device (for example, It is possible to extract a title with high accuracy by determining whether or not it is a title using SVM).

図１０は、本発明の文書タイトル抽出装置の第２の実施例を示しブロック図である。第２の実施例は、図２に示す文書入力部３０を変形するものである。画像文書入力部６０は画像文書を入力し、入力された画像文書データをレイアウトと画像情報抽出部６２へ出力する。入力は、例えばスキャナ等を用いることができる。レイアウトと画像情報抽出部６２は、画像文書データから、レイアウト情報と画像情報を抽出する。レイアウト情報は、例えば、タイトル候補文の位置関係などの情報を含み、画像情報は、文字の大きさ、色、字体などの情報を含む。レイアウト情報および画像情報の抽出は、公知の技術を用いることができ、例えば、特開平９−１３４４０６号や特開平２０００−１４８７８８号などに開示されている。 FIG. 10 is a block diagram showing a second embodiment of the document title extracting apparatus of the present invention. In the second embodiment, the document input unit 30 shown in FIG. 2 is modified. The image document input unit 60 inputs an image document and outputs the input image document data to the layout and image information extraction unit 62. For example, a scanner or the like can be used for input. The layout and image information extraction unit 62 extracts layout information and image information from the image document data. The layout information includes, for example, information such as the positional relationship of title candidate sentences, and the image information includes information such as character size, color, and font. For extracting the layout information and the image information, a known technique can be used. For example, the layout information and the image information are disclosed in JP-A-9-134406 and JP-A-2000-148788.

次に、テキスト情報抽出部６４は、例えばＯＣＲにより、画像情報からテキスト情報を抽出する。ＯＣＲは、公知の技術、あるいは市販されているＯＣＲを使用することができる。抽出されたテキスト情報は、タイトル候補文抽出部３２へ供給される。また、第２の実施例では、特徴量抽出部３４において、タイトル候補文の特徴量を抽出するに際して、レイアウトと画像情報抽出部６２で得られたレイアウト情報と画像情報を含めることができる。 Next, the text information extraction unit 64 extracts text information from the image information by, for example, OCR. As the OCR, a known technique or a commercially available OCR can be used. The extracted text information is supplied to the title candidate sentence extraction unit 32. In the second embodiment, when the feature amount extraction unit 34 extracts the feature amount of the title candidate sentence, the layout information and image information obtained by the layout and image information extraction unit 62 can be included.

第２の実施例では、画像文書をスキャナ等により読み込み、読み込んだ画像文書から自動的に文書タイトルを抽出することができる。同時に、画像文書データに含まれるレイアウト情報を、タイトル候補文の特徴量に加えることで、文書タイトルの判定精度をより向上させることができる。 In the second embodiment, an image document can be read by a scanner or the like, and a document title can be automatically extracted from the read image document. At the same time, by adding the layout information included in the image document data to the feature amount of the title candidate sentence, it is possible to further improve the document title determination accuracy.

次に、本発明の文書タイトル抽出装置を、中国語文書について実施したときの例を説明する。中国語文書についても、図２に示すように、文書入力部３０によりテキスト文書が入力され、タイトル候補文抽出部３２により文書中からタイトル候補文が抽出される。特徴量抽出部３４は、以下に示すように、中国語の特徴量において、著者名および組織名等において最適化される。 Next, an example when the document title extracting apparatus of the present invention is implemented for a Chinese document will be described. As for a Chinese document, as shown in FIG. 2, a text document is input by the document input unit 30, and a title candidate sentence is extracted from the document by a title candidate sentence extraction unit 32. As will be described below, the feature amount extraction unit 34 is optimized in terms of author names, organization names, etc., in Chinese feature amounts.

図１１は、中国人の名字用字辞書と名前用字辞書を示している。ここでの方法は、中国人名に限定するものである。著者抽出方法は、中国人名の名字の識別及び名前の識別から構成され、次のような判断基準を用いることができる。 FIG. 11 shows a Chinese name dictionary and a name dictionary. The method here is limited to Chinese names. The author extraction method is composed of Chinese name identification and name identification, and the following criteria can be used.

中国人名は、４文字以上のものが少ないので、タイトル候補文の字列が４文字以上なら、人名ではないと判断する。
中国人名には、２文字の名字が非常に少ないので、タイトル候補文の文字列の先頭２文字が２文字名字であるかどうかを判断する。もし、２文字名字であれば、本候補文字列が名字であると判定することができる。
人名判定値を計算する。まず、出現頻度をもつ中国人名字に出現した文字のリスト表(名字用字辞書と呼ぶ)と名前に出現した文字のリスト表(名前用字辞書と呼ぶ)を用意する。名字用字辞書と名前用字辞書が文字の出現頻度の高い順でソートされている。さらに、名字用字辞書をＡ、Ｂ、Ｃの３つのグルーブにわける。
Ａグルーブ：名字用字辞書に先頭から走査して、出現頻度の累計が全体の９５％までに含まれれば、走査した文字の集合をＡグループとする。
Ｂグルーブ：名字用字辞書に先頭から走査して、出現頻度の累計が全体の９９％までに含まれれば、走査した文字の集合をＢグループとする。
Ｃグルーブ：名字用字辞書にすべての文字の集合をCグルーブとする。つまり、残りの１％に該当すれば、Ａ、Ｂ以外のＣグループとなる。 Since there are few Chinese names with more than 4 characters, if the title candidate sentence has 4 or more characters, it is determined that the name is not a person name.
Since there are very few two-letter surnames in Chinese names, it is determined whether the first two characters of the character string of the title candidate sentence are two-letter surnames. If it is a two character surname, it can be determined that the candidate character string is a surname.
Calculate person name judgment value. First, a list table of characters appearing in Chinese surnames having an appearance frequency (referred to as a surname dictionary) and a list of characters appearing in names (referred to as a name dictionary) are prepared. The first character dictionary and the first character dictionary are sorted in the order of appearance frequency. Further, the surname character dictionary is divided into three grooves A, B, and C.
A group: If the cumulative appearance frequency is included in up to 95% of the total of the appearance frequencies scanned from the head in the surname character dictionary, the group of scanned characters is set as the A group.
B-groove: If the cumulative appearance frequency is included in up to 99% of the total of appearance frequencies scanned from the head in the surname character dictionary, the group of scanned characters is set as the B group.
C groove: A set of all characters in the surname character dictionary is defined as a C groove. That is, if it corresponds to the remaining 1%, it becomes a C group other than A and B.

同様に，名前用字辞書をＤ、Ｅ、Ｆの３つのグルーブに分ける。名字と名前の判定値をそれぞれＭとＮで表す。
もし、候補文字列の先頭部分がＡ集合に含む名字であるなら、Ｍ＝ＳＡ；
もし、候補文字列の先頭部分がＢ集合に含む名字であるなら、Ｍ＝ＳＢ；
もし、候補文字列の先頭部分がＣ集合に含む名字であるなら、Ｍ＝ＳＣ；
もし、候補文字列の先頭部分がＣ集合に含む名字でなければ、Ｍ＝０；
もし、候補文字列の最後部分がＤ集合に含む名前用字であれば、Ｎ＝ＳＤ；
もし、候補文字列の最後部分がE集合に含む名前用字であれば、Ｎ＝ＳＥ；
もし、候補文字列の最後部分がF集合に含む名前用字であれば、Ｎ＝ＳＦ；
もし、候補文字列の最後部分がF集合に含む名前用字でなければ、Ｎ＝０； Similarly, the name character dictionary is divided into three grooves D, E, and F. The judgment values for the last name and the name are represented by M and N, respectively.
If the leading part of the candidate character string is a surname included in the A set, M = SA;
If the leading part of the candidate character string is a surname included in the B set, M = SB;
If the leading part of the candidate character string is a surname included in the C set, M = SC;
If the first part of the candidate character string is not a surname included in the C set, M = 0;
If the last part of the candidate character string is a name character included in the D set, N = SD;
If the last part of the candidate character string is a name character included in the E set, N = SE;
If the last part of the candidate character string is a name character included in the F set, N = SF;
If the last part of the candidate character string is not a name character included in the F set, N = 0;

もし、Ｍ+Ｎ＞閾値であれば、候補文字列が人名と判定される。なお、ＳＡ、ＳＢ、ＳＣ、ＳＤ、ＳＥ、ＳＦは、定数である。ＳＡ＞ＳＢ＞ＳＣ、ＳＤ＞ＳＥ＞ＳＦの関係がある。 If M + N> threshold, the candidate character string is determined to be a person name. SA, SB, SC, SD, SE, and SF are constants. There is a relationship of SA> SB> SC, SD> SE> SF.

次に、組織名の抽出方法について説明する。ここでは、公開された固有名抽出技術、或いは人名抽出技術を用い、次のような判断基準を用いることができる。 Next, an organization name extraction method will be described. Here, the following criteria can be used by using a publicly-known name extraction technique or a person name extraction technique.

長さ測定。もし、入力された文字列の長さが４文字以下であれば、タイトル候補分は、組織名ではないとし、終了する。
タイトル候補文の文字列が、組織名を含むかどうかをチェックする。図１２は、中国語の組織名辞書の一例を示す図である。もし、タイトル候補分が辞書の組織名を含むなら、判定値＋Ａとする。
タイトル候補文の文字列が、組織の全称であるかどうかを判断する。もし、全称であれば、判定値+Ｂとする。これも、組織名辞書と照合することにより行われる。
タイトル候補文の文字列が、組織名キーワードを含むかどうかをチェックする。もし、含むなら、判定値+Ｃとする。これは、文字列の特定の位置(例えが、文尾)に組織名キーワード(例えば，「大学」)が含まれるか否かの照合を行う。
以上により、判定値＞閾値を満足すれば，タイトル候補文の文字列を組織名と判定する。なお、Ａ、Ｂ、Ｃは定数である。 Length measurement. If the length of the input character string is 4 characters or less, it is determined that the title candidate is not an organization name, and the process ends.
Check if the text of the title candidate sentence includes the organization name. FIG. 12 is a diagram showing an example of a Chinese organization name dictionary. If the title candidate includes the organization name of the dictionary, the judgment value + A is set.
It is determined whether the character string of the title candidate sentence is the general name of the organization. If it is a generic name, the determination value is + B. This is also done by collating with the organization name dictionary.
It is checked whether the character string of the title candidate sentence includes the organization name keyword. If it is included, the judgment value is + C. This collates whether or not an organization name keyword (for example, “university”) is included in a specific position (for example, the end of a sentence) of a character string.
As described above, if the determination value> the threshold value is satisfied, the character string of the title candidate sentence is determined as the organization name. A, B, and C are constants.

図１３は、中国語の２グラムタイトルキーワード辞書と、２グラムタイトル禁用キーワード辞書を示している。タイトル候補文の特徴量のタイトルキーワード情報４４およびタイトル禁用キーワード情報４６（図５を参照）を抽出するとき、２グラムタイトルキーワード辞書および２グラムタイトル禁用キーワード辞書が参照される。例えば、タイトル禁用キーワードの先頭に示す「本病」は、日本語では、「この病」という意味であり、このような文字列が文書タイトルには使用されないと判定する。 FIG. 13 shows a Chinese 2-gram title keyword dictionary and a 2-gram title prohibited keyword dictionary. When extracting the title keyword information 44 and the title prohibition keyword information 46 (see FIG. 5) of the feature amount of the title candidate sentence, the two-gram title keyword dictionary and the two-gram title prohibition keyword dictionary are referred to. For example, “main disease” shown at the beginning of a title-prohibited keyword means “this disease” in Japanese, and it is determined that such a character string is not used in the document title.

次に、中国語のサンプル文書からタイトル候補文を抽出する例を説明する。図１４（ａ）は、中国語のサンプル文書から抽出されたタイトル候補文（カッコ内に日本語の意味を記載）を示している。図１４（ｂ）は、これらのタイトル候補文についての特徴量を示している。図中、“：”の前の数字は、特徴量の次元番号（図５に示す特徴量の９つの要素であり、１は、候補文の長さ、２は、類似度の順位情報、３は著者情報、４は組織名情報、５はタイトルキーワード情報、６は著者の位置情報、７はタイトル禁用キーワード情報、８は郵便番号、９は記号の数）を示し、“：”の後ろの数字は対応する次元の値を示している。“＃”の後ろの文書は、対応しているタイトル候補文である。 Next, an example in which a title candidate sentence is extracted from a Chinese sample document will be described. FIG. 14A shows a title candidate sentence extracted from a Chinese sample document (the meaning of Japanese is described in parentheses). FIG. 14B shows the feature quantities for these title candidate sentences. In the figure, the number before “:” is the dimension number of the feature quantity (the nine elements of the feature quantity shown in FIG. 5, where 1 is the length of the candidate sentence, 2 is the rank information of similarity, 3 Indicates author information, 4 indicates organization name information, 5 indicates title keyword information, 6 indicates author position information, 7 indicates no-title keyword information, 8 indicates a zip code, and 9 indicates the number of symbols). The numbers indicate the corresponding dimension values. The document after “#” is the corresponding title candidate sentence.

例えば、上から２番目のタイトル候補文では、２次元目の類似度の順位が「１」であり、すなわち、文書の他のタイトル候補文との２グラムベクトルによる類似度が最も高いことを示しおり、５次元目のタイトルキーワード情報が「１」であり、タイトルキーワードが含まれていることを示している。 For example, in the second title candidate sentence from the top, the ranking of the similarity degree in the second dimension is “1”, that is, the degree of similarity by the 2-gram vector with the other title candidate sentences in the document is the highest. The title keyword information of the fifth dimension is “1”, indicating that the title keyword is included.

こうして得られたタイトル候補文の特徴量は、ＳＶＭにより分類される。図１５は、図１４（ｂ）の特徴量を、ＳＶＭにより分類したときの結果であり、第１列目の破線で囲んだデータは、正分類境界面までの距離を表している。上記の結果から分かるように、正分類境界面に一番近いものは、“工?段?量工作管理初探”（保線区の計量作業管理の初期検討）であり、この候補文が文書タイトルとして抽出される。 The feature amount of the title candidate sentence obtained in this way is classified by SVM. FIG. 15 shows the results of classifying the feature values of FIG. 14B by SVM, and the data surrounded by the broken line in the first column represents the distance to the primary classification boundary surface. As can be seen from the above results, the one closest to the primary classification boundary surface is “Initial exploration of construction work / quantity work management” (initial examination of track work management in track maintenance area), and this candidate sentence is used as the document title. Extracted.

以上、本発明の好ましい実施の形態について詳述したが、本発明に係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 The preferred embodiments of the present invention have been described in detail above. However, the present invention is not limited to the specific embodiments according to the present invention, and various modifications can be made within the scope of the gist of the present invention described in the claims. Deformation / change is possible.

本発明に係る文書タイトル抽出装置は、言語知識を用いた文書情報抽出方法として、種々の言語の文書タイトルの抽出に利用することができる。さらに、紙原稿をコピー感覚でリアルタイムに電子化し、紙原稿のレイアウト、画像情報、内容範囲に依存せず、自動的にインデックスすることが可能となり、従って、汎用なスキャンインデックスシステムに最適である。 The document title extraction apparatus according to the present invention can be used for extracting document titles in various languages as a document information extraction method using language knowledge. Furthermore, it is possible to digitize a paper document in real time as if it were a copy, and to automatically index the paper document without depending on the layout, image information, and content range of the paper document.

本発明の実施例に係る文書タイトル抽出装置を実行するハードウエア構成を示す図である。It is a figure which shows the hardware constitutions which perform the document title extraction apparatus which concerns on the Example of this invention. 本実施例に係る文書タイトル抽出装置の機能ブロックを示す図である。It is a figure which shows the functional block of the document title extraction apparatus which concerns on a present Example. タイトル候補文抽出部の動作フローを説明する図である。It is a figure explaining the operation | movement flow of a title candidate sentence extraction part. 日本語入力文書からタイトル候補文を抽出した例を示す図である。It is a figure which shows the example which extracted the title candidate sentence from the Japanese input document. 候補文類似度特徴量抽出部により抽出される特徴量を説明する図である。It is a figure explaining the feature-value extracted by the candidate sentence similarity feature-value extraction part. 特徴量の求め方を説明する図である。It is a figure explaining how to obtain a feature amount. 日本語の苗字辞書の一例を示す図である。It is a figure which shows an example of a Japanese surname dictionary. ２グラムベクトル特徴量の算出フローを示す図である。It is a figure which shows the calculation flow of 2 gram vector feature-value. ２グラム頻度＃’（ｘ）の算出方法を説明する図である。It is a figure explaining the calculation method of 2 gram frequency # '(x). 本発明の第２の実施例に係る文書タイトル抽出装置の機能ブロックを示す図である。It is a figure which shows the functional block of the document title extraction apparatus which concerns on 2nd Example of this invention. 中国人の名字および名前の辞書を示す図である。It is a figure which shows the Chinese surname and the name dictionary. 中国の組織名の辞書を示す図である。It is a figure which shows the dictionary of the organization name of China. 中国語の２グラムタイトルキーワード辞書と２グラムタイトル禁用キーワード辞書を示す図である。It is a figure which shows the Chinese 2-gram title keyword dictionary and the 2-gram title prohibition keyword dictionary. 図１４(ａ)は、中国語サンプル文書のタイトル候補文を示し、図１４（ｂ）はタイトル候補文の特徴量を示す図である。FIG. 14A shows a title candidate sentence of the Chinese sample document, and FIG. 14B shows a feature amount of the title candidate sentence. 図１４に示すタイトル候補文の特徴量をＳＶＭにより分類した結果を示す図である。It is a figure which shows the result of having classified the feature-value of the title candidate sentence shown in FIG. 14 by SVM.

Explanation of symbols

１０：文書タイトル抽出装置１２：入力装置
１４：表示装置１６：主記憶装置
１８：記憶装置２０：ＣＰＵ
３０：文書入力部３２：タイトル候補文抽出部
３４：特徴量抽出部３６：タイトル判定部
３８：抽出結果出力部６０画像文書入力部
６２：レイアウトと画像情報抽出部６４：テキスト文書抽出部 10: Document title extraction device 12: Input device 14: Display device 16: Main storage device 18: Storage device 20: CPU
30: Document input unit 32: Title candidate sentence extraction unit 34: Feature amount extraction unit 36: Title determination unit 38: Extraction result output unit 60 Image document input unit 62: Layout and image information extraction unit 64: Text document extraction unit

Claims

Title candidate sentence extraction means for extracting a plurality of title candidate sentences from a document;
Feature quantity extracting means for extracting the feature quantity of each of the extracted plurality of title candidate sentences;
Determination means for determining a document title from a plurality of title candidate sentences based on the extracted feature amount;
Output means for outputting the determination result,
The document title extracting apparatus, wherein the feature amount includes at least similarity information that is a function of similarity between a title candidate sentence and a plurality of sentences in the document.

The document title extraction apparatus according to claim 1, wherein the similarity information includes rank information indicating a degree of similarity between a title candidate sentence and a plurality of sentences in the document.

The said similarity information is calculated using the vector information of the sub character string selected from the title candidate sentence, and the vector information of the sub character string selected from the sentence in the document. Document title extraction device.

The vector information is calculated based on an appearance frequency of N (N is a natural number greater than or equal to 2) grams selected from title candidate sentences and an appearance frequency of N grams selected from sentences in the document. 3. The document title extraction device according to any one of 3.

5. The document title extraction according to claim 4, wherein when vector information is calculated based on the appearance frequency of the N-gram, the vector information is corrected when a predetermined use-prohibited N-gram is included. apparatus.

The document title extraction apparatus according to claim 1, wherein the similarity information is calculated based on an edit distance between a title candidate sentence and a sentence in the document.

The document title extraction apparatus according to claim 1, wherein the similarity information is calculated based on a length of a title candidate sentence and a maximum common character string of sentences in the document.

2. The document title extracting apparatus according to claim 1, wherein when the title candidate sentence includes a predetermined title keyword, the feature amount includes title keyword information indicating a position and an appearance frequency of the keyword.

When the title candidate sentence includes a predetermined use prohibition title keyword, the feature amount includes use prohibition title keyword information indicating a position and an appearance frequency of the use prohibition title keyword. Item 2. The document title extraction device according to Item 1.

The document title extraction device according to claim 1, wherein the determination unit classifies the feature amount of each title candidate sentence by SVM and extracts an optimal title candidate sentence from the classification result.

The document title extraction apparatus according to claim 1, wherein the output unit outputs the determined title sentence and related information.

The document title extraction apparatus further includes an input unit for inputting an image document, and a text document extraction unit for extracting a text document from the input image document. The title candidate sentence extraction unit is a title candidate from the extracted text document. The document title extraction apparatus according to claim 1, wherein a sentence is extracted.

13. The document title extracting apparatus according to claim 12, wherein the title candidate sentence extracting unit extracts a title candidate sentence in a certain candidate target range from the beginning of the text document.

The document title extracting apparatus according to claim 13, further comprising means for extracting layout information from the image document, wherein the feature amount includes the extracted layout information.

A method for extracting a title from a document,
Extracting a plurality of title candidate sentences from the document;
Extracting a feature amount including similarity information, which is a function of similarity between a title candidate sentence and a plurality of sentences in a document, for all title candidate sentences;
Determining a document title from a plurality of title candidate sentences based on the extracted feature values;
A document title extracting method including a step of outputting a determination result.

The document title extraction method according to claim 15, wherein the similarity information includes rank information indicating a degree of similarity between a title candidate sentence and a plurality of sentences in the document.

The similarity information is calculated using vector information of a sub character string selected from a title candidate sentence and vector information of a sub character string selected from a sentence in a document, according to claim 15 or 16. Document title extraction method.

16. The vector information is calculated based on an appearance frequency of N (N is a natural number greater than or equal to 2) grams selected from title candidate sentences and an appearance frequency of N grams selected from sentences in a document. The document title extraction method according to any one of 17.

19. The document title extraction according to claim 18, wherein when vector information is calculated based on the appearance frequency of the N-gram, the vector information is corrected when a predetermined N-gram for use prohibition is included. Method.

The document title extraction method according to claim 15, wherein the similarity information is calculated from an edit distance between a title candidate sentence and a sentence in the document.

The document title extraction method according to claim 15, wherein the similarity information is calculated based on a length of a title candidate sentence and a maximum common character string of sentences in the document.

The document title extraction method according to claim 15, wherein the feature amount includes title keyword information indicating a position and an appearance frequency of a keyword when a predetermined title keyword is included in the title candidate sentence.

When the title candidate sentence includes a predetermined use prohibition title keyword, the feature amount includes use prohibition title keyword information indicating a position and an appearance frequency of the use prohibition title keyword. Item 15. The document title extraction method according to Item 15.

A program for extracting a title from a document,
Extracting a plurality of title candidate sentences from the document;
Extracting a feature amount including similarity information, which is a function of similarity between a title candidate sentence and a plurality of sentences in a document, for all title candidate sentences;
Determining a document title from a plurality of title candidate sentences based on the extracted feature values;
A document title extraction program including a step of outputting a determination result.

The similarity information is calculated using an appearance frequency of N (N is a natural number greater than or equal to 2) grams selected from title candidate sentences and an appearance frequency of N grams selected from sentences in a document. 24. Document title extraction program according to 24.