JP2009199302A

JP2009199302A - Program, device, and method for analyzing document

Info

Publication number: JP2009199302A
Application number: JP2008039911A
Authority: JP
Inventors: Yasuaki Arakawa; 靖章荒川; Fumihiro Moriya; 文裕森谷; Kohei Yazaki; 浩平矢▲崎▼; Keisuke Yoshizaki; 圭祐吉崎
Original assignee: NETSTAR Inc
Current assignee: NETSTAR Inc
Priority date: 2008-02-21
Filing date: 2008-02-21
Publication date: 2009-09-03
Anticipated expiration: 2028-02-21
Also published as: JP4959603B2

Abstract

【課題】レイアウト構造におけるドキュメントの主要箇所を判定することにより、ドキュメントのカテゴリ分類の正確さを向上させる。
【解決手段】カテゴリ特徴学習部２は学習対象ドキュメントから単語を抽出し、単語のカテゴリ毎の特徴量を算出する。レイアウト特徴学習部３は単語のカテゴリ毎の特徴量を基にして、レイアウトの各部分構造ごとの重要度を算出する。重要度の大きさを見ることで、ドキュメントにおける各部分構造の重要さを判定することができる。カテゴリ確率判定部４はレイアウトの各部分構造ごとの重要度を用いて、カテゴリ分類に重要な箇所をドキュメントから特定した上で、特徴量と重要度を用いてカテゴリを判別する。
【選択図】図３An object of the present invention is to improve the accuracy of document category classification by determining main parts of a document in a layout structure.
A category feature learning unit 2 extracts a word from a learning target document and calculates a feature amount for each category of the word. The layout feature learning unit 3 calculates the importance for each partial structure of the layout based on the feature amount for each category of words. By looking at the magnitude of importance, it is possible to determine the importance of each partial structure in the document. The category probability determining unit 4 uses the importance for each partial structure of the layout to identify a location important for the category classification from the document, and then determines the category using the feature amount and the importance.
[Selection] Figure 3

Description

本発明は、構造化ドキュメントに対して、ドキュメントの内容ならびにレイアウト構造を解析するプログラム、装置および方法に関する。 The present invention relates to a program, an apparatus, and a method for analyzing document contents and layout structure of a structured document.

ドキュメントの内容を解析する際に単語の出現頻度などを集計することで、ドキュメントのカテゴリ分類を自動で行う手法が知られている。 There is known a technique for automatically classifying a document category by counting the appearance frequency of words when analyzing the contents of the document.

このようなドキュメントのカテゴリ分類手法の一つの利用形態として、例えばウェブページの内容が未成年者にとって不適切なカテゴリに分類されるか否かを判別する際に用いることがある。このことにより、ウェブページの閲覧を制限するフィルタリング処理を行うことができる。なお、ここでいうウェブページとは、インターネット上に公開されているドキュメントをいう。このドキュメントは、ＨＴＭＬによるレイアウト情報，テキスト情報の他、そのドキュメント中に埋め込まれた各種情報を含んでいる。 As one usage form of such a document category classification method, for example, it may be used to determine whether the content of a web page is classified into a category inappropriate for minors. This makes it possible to perform a filtering process that restricts browsing of web pages. The web page here refers to a document published on the Internet. This document includes various information embedded in the document in addition to the layout information and text information by HTML.

ウェブページのカテゴリを所定の基準に基づいて自動分類する際には、そのページの内容の他にレイアウト構造が重要となる。そこで、特許文献１では文書の内容的特徴に加えて、体裁的特徴も考慮して分類を行う手法が開示されている。 When automatically classifying web page categories based on predetermined criteria, the layout structure is important in addition to the contents of the pages. Therefore, Patent Document 1 discloses a method of performing classification in consideration of appearance characteristics in addition to document content characteristics.

特開２０００−２６８０４０号公報JP 2000-268040 A

従来の手法では、カテゴリごとに、そのカテゴリの特徴的なレイアウト構造を決定していた。その結果、ある特定のレイアウト構造とカテゴリとの対応関係は把握できたとしても、ドキュメントの主題となる内容がドキュメント中のどの箇所に存在しているかを判断することはできなかった。 In the conventional method, for each category, a characteristic layout structure of the category is determined. As a result, even if the correspondence between a specific layout structure and a category can be grasped, it has not been possible to determine where in the document the content that is the subject of the document exists.

例えば、電子掲示板のウェブページが備えているような通常のレイアウト構造をあるウェブページが持っていたとしても、それがスポーツの話題を扱うウェブページなのか、ゲームの話題を扱うウェブページなのかを判断することは困難である。 For example, even if a certain web page has the usual layout structure that a web page of an electronic bulletin board has, whether it is a web page dealing with sports topics or a web page dealing with game topics It is difficult to judge.

請求項１に係る発明は、レイアウト構造を有するドキュメントの解析プログラムであって、レイアウト構造を持つ学習対象ドキュメントから複数の構成要素を抽出する要素抽出処理と、前記学習対象ドキュメントから抽出された前記構成要素の各々について、前記学習対象ドキュメントのカテゴリにおける特徴量を算出するカテゴリ特徴学習処理と、前記学習対象ドキュメントと前記特徴量とに基づき、前記学習対象ドキュメントが持つレイアウト構造に含まれている複数の部分構造について、各部分構造ごとの重要度を算出するレイアウト特徴学習処理と、をコンピュータで実行させることを特徴とするドキュメント解析プログラムである。
請求項２に係る発明は、請求項１に記載のドキュメント解析プログラムにおいて、更に加えて、前記カテゴリ特徴学習処理により算出された前記特徴量と、前記レイアウト特徴学習処理により算出された前記重要度と、前記学習対象ドキュメントのレイアウト構造とに基づいて、レイアウト構造を持つ判定対象ドキュメントのカテゴリを判定するカテゴリ判定処理を、コンピュータで実行させることを特徴とする。
請求項３に係る発明は、請求項２に記載のドキュメント解析プログラムにおいて、前記カテゴリ判定処理は、複数あるカテゴリの各々に対して、前記判定対象ドキュメントが属する確率を出力することを特徴とする。
請求項４に係る発明は、請求項１に記載のドキュメント解析プログラムにおいて、更に加えて、レイアウト構造を持つ複数の検索対象ドキュメントから所定のキーワードを検索し、前記学習対象ドキュメントのレイアウト構造と、前記レイアウト特徴学習処理により算出された前記重要度とに基づいて、前記検索対象ドキュメントの各々と前記キーワードとの関連度合いを算出するドキュメント検索処理を、コンピュータで実行させることを特徴とする。
請求項５に係る発明は、請求項１〜３のいずれか一項に記載のドキュメント解析プログラムにおいて、前記ドキュメントは画像データである、ことを特徴とする。
請求項６に係る発明は、請求項５に記載のドキュメント解析プログラムにおいて、前記画像データから前記要素抽出手段により抽出される構成要素は画素である、ことを特徴とする。
請求項７に係る発明は、請求項１〜４のいずれか一項に記載のドキュメント解析プログラムにおいて、前記ドキュメントはＨＴＭＬによって記述された文書である、ことを特徴とする。
請求項８に係る発明は、請求項１〜４または７のいずれか一項に記載のドキュメント解析プログラムにおいて、前記構成要素は単語である、ことを特徴とする。
請求項９に係る発明は、請求項１〜４または７のいずれか一項に記載のドキュメント解析プログラムにおいて、前記構成要素は単語および前記単語の共起関係である、ことを特徴とする。
請求項１０に係る発明は、請求項８または９に記載のドキュメント解析プログラムにおいて、前記単語は形態素解析によって抽出される、ことを特徴とする。
請求項１１に係る発明は、請求項１〜１０のいずれか一項に記載のドキュメント解析プログラムにおいて、前記特徴量は、ＴＦ−ＩＤＦ法によって算出される、ことを特徴とする。
請求項１２に係る発明は、請求項１〜１１のいずれか一項に記載のドキュメント解析プログラムにおいて、前記レイアウト特徴学習処理は、前記レイアウト構造が類似する複数の前記学習対象ドキュメントから、前記レイアウト構造の共通部分を取り出した新たなドキュメントを作成し、前記新たなドキュメントが持つレイアウト構造における各部分構造ごとの重要度を算出する、ことを特徴とする。
請求項１３に係る発明は、請求項１〜１２のいずれか一項に記載のドキュメント解析プログラムにおいて、前記レイアウト構造は木構造を備える、ことを特徴とする。
請求項１４に係る発明は、請求項１３に記載のドキュメント解析プログラムにおいて、前記レイアウト特徴学習処理は、木構造において上位に位置する部分構造の前記重要度を算出する際に、上位に位置する部分構造から下位に位置する部分構造を除外する、ことを特徴とする。
請求項１５に係る発明は、請求項１３に記載のドキュメント解析プログラムにおいて、前記レイアウト特徴学習処理は、木構造において上位に位置する部分構造の前記重要度を算出する際に、下位に位置する部分構造を上位に位置する部分構造に含める、ことを特徴とする。
請求項１６に係る発明は、レイアウト構造を有するドキュメントの解析装置であって、レイアウト構造を持つ学習対象ドキュメントから複数の構成要素を抽出する要素抽出部と、前記学習対象ドキュメントから抽出された前記構成要素の各々について、前記学習対象ドキュメントのカテゴリにおける特徴量を算出するカテゴリ特徴学習部と、前記学習対象ドキュメントと前記特徴量とに基づき、前記学習対象ドキュメントが持つレイアウト構造に含まれている複数の部分構造について、各部分構造ごとの重要度を算出するレイアウト特徴学習部と、を備えることを特徴とするドキュメント解析装置である。
請求項１７に係る発明は、請求項１６に記載のドキュメント解析装置において、更に加えて、前記カテゴリ特徴学習部により算出された前記特徴量と、前記レイアウト特徴学習部により算出された前記重要度と、前記学習対象ドキュメントのレイアウト構造とに基づいて、レイアウト構造を持つ判定対象ドキュメントのカテゴリを判定するカテゴリ判定部を備えることを特徴とする。
請求項１８に係る発明は、請求項１６に記載のドキュメント解析装置において、更に加えて、レイアウト構造を持つ複数の検索対象ドキュメントから所定のキーワードを検索し、前記学習対象ドキュメントのレイアウト構造と、前記レイアウト特徴学習部により算出された前記重要度とに基づいて、前記検索対象ドキュメントの各々と前記キーワードとの関連度合いを算出するドキュメント検索部を備えることを特徴とする。
請求項１９に係る発明は、レイアウト構造を有するドキュメントの解析方法であって、レイアウト構造を持つ学習対象ドキュメントから複数の構成要素を抽出する要素抽出工程と、前記学習対象ドキュメントから抽出された前記構成要素の各々について、前記学習対象ドキュメントのカテゴリにおける特徴量を算出するカテゴリ特徴学習工程と、前記学習対象ドキュメントと前記特徴量とに基づき、前記学習対象ドキュメントが持つレイアウト構造に含まれている複数の部分構造について、各部分構造ごとの重要度を算出するレイアウト特徴学習工程と、を備えることを特徴とするドキュメント解析方法である。
請求項２０に係る発明は、請求項１９に記載のドキュメント解析方法において、更に加えて、前記カテゴリ特徴学習工程により算出された前記特徴量と、前記レイアウト特徴学習工程により算出された前記重要度と、前記学習対象ドキュメントのレイアウト構造とに基づいて、レイアウト構造を持つ判定対象ドキュメントのカテゴリを判定するカテゴリ判定工程を備える、ことを特徴とする。
請求項２１に係る発明は、請求項１９に記載のドキュメント解析方法において、更に加えて、レイアウト構造を持つ複数の検索対象ドキュメントから所定のキーワードを検索し、前記学習対象ドキュメントのレイアウト構造と、前記レイアウト特徴学習工程により算出された前記重要度とに基づいて、前記検索対象ドキュメントの各々と前記キーワードとの関連度合いを算出するドキュメント検索工程を備える、ことを特徴とする。 The invention according to claim 1 is an analysis program for a document having a layout structure, in which an element extraction process for extracting a plurality of constituent elements from a learning target document having a layout structure and the configuration extracted from the learning target document For each element, based on a category feature learning process for calculating a feature amount in the category of the learning target document, a plurality of elements included in the layout structure of the learning target document based on the learning target document and the feature amount A document analysis program characterized by causing a computer to execute layout feature learning processing for calculating the importance of each partial structure.
The invention according to claim 2 is the document analysis program according to claim 1, in addition to the feature amount calculated by the category feature learning process, and the importance degree calculated by the layout feature learning process. A category determination process for determining a category of a determination target document having a layout structure based on the layout structure of the learning target document is executed by a computer.
The invention according to claim 3 is the document analysis program according to claim 2, wherein the category determination processing outputs a probability that the determination target document belongs to each of a plurality of categories.
According to a fourth aspect of the present invention, in the document analysis program according to the first aspect, in addition, a predetermined keyword is searched from a plurality of search target documents having a layout structure, the layout structure of the learning target document, A document search process for calculating a degree of association between each of the search target documents and the keyword based on the importance calculated by the layout feature learning process is executed by a computer.
The invention according to claim 5 is the document analysis program according to any one of claims 1 to 3, wherein the document is image data.
The invention according to claim 6 is the document analysis program according to claim 5, wherein the component extracted from the image data by the element extraction means is a pixel.
The invention according to claim 7 is the document analysis program according to any one of claims 1 to 4, wherein the document is a document described in HTML.
The invention according to claim 8 is characterized in that, in the document analysis program according to any one of claims 1 to 4 or 7, the component is a word.
The invention according to claim 9 is the document analysis program according to any one of claims 1 to 4 or 7, wherein the component is a word and a co-occurrence relationship of the word.
The invention according to claim 10 is the document analysis program according to claim 8 or 9, wherein the word is extracted by morphological analysis.
According to an eleventh aspect of the present invention, in the document analysis program according to any one of the first to tenth aspects, the feature amount is calculated by a TF-IDF method.
The invention according to a twelfth aspect is the document analysis program according to any one of the first to eleventh aspects, wherein the layout feature learning processing is performed by using the layout structure from a plurality of the learning target documents having similar layout structures. A new document in which the common parts are extracted is created, and the importance of each partial structure in the layout structure of the new document is calculated.
The invention according to claim 13 is the document analysis program according to any one of claims 1 to 12, wherein the layout structure has a tree structure.
According to a fourteenth aspect of the present invention, in the document analysis program according to the thirteenth aspect, when the layout feature learning process calculates the importance of the partial structure positioned higher in the tree structure, the portion positioned higher A partial structure located at a lower level is excluded from the structure.
According to a fifteenth aspect of the present invention, in the document analysis program according to the thirteenth aspect, when the layout feature learning process calculates the importance of the partial structure positioned higher in the tree structure, the portion positioned lower The structure is included in a partial structure located at a higher level.
The invention according to claim 16 is an apparatus for analyzing a document having a layout structure, wherein an element extraction unit extracts a plurality of components from a learning target document having a layout structure, and the configuration extracted from the learning target document For each of the elements, a category feature learning unit that calculates a feature amount in the category of the learning target document, and a plurality of elements included in the layout structure of the learning target document based on the learning target document and the feature amount. A document analysis apparatus comprising: a layout feature learning unit that calculates importance of each partial structure with respect to the partial structure.
The invention according to claim 17 is the document analysis apparatus according to claim 16, in addition to the feature amount calculated by the category feature learning unit, and the importance level calculated by the layout feature learning unit. A category determination unit that determines a category of a determination target document having a layout structure based on the layout structure of the learning target document.
The invention according to claim 18 is the document analysis apparatus according to claim 16, further comprising: searching for a predetermined keyword from a plurality of search target documents having a layout structure, the layout structure of the learning target document, A document search unit that calculates the degree of association between each of the search target documents and the keyword based on the importance calculated by the layout feature learning unit is provided.
The invention according to claim 19 is a method for analyzing a document having a layout structure, wherein an element extracting step of extracting a plurality of components from a learning target document having a layout structure, and the configuration extracted from the learning target document For each element, a category feature learning step for calculating a feature amount in the category of the learning target document, and a plurality of elements included in a layout structure of the learning target document based on the learning target document and the feature amount. A layout analysis learning step for calculating the importance of each partial structure for the partial structure.
The invention according to claim 20 is the document analysis method according to claim 19, in addition to the feature amount calculated by the category feature learning step, and the importance degree calculated by the layout feature learning step. And a category determination step of determining a category of a determination target document having a layout structure based on the layout structure of the learning target document.
The invention according to claim 21 is the document analysis method according to claim 19, further comprising: searching for a predetermined keyword from a plurality of search target documents having a layout structure, and the layout structure of the learning target document; A document search step of calculating a degree of association between each of the search target documents and the keyword based on the importance calculated by the layout feature learning step is provided.

本発明によれば、レイアウト構造とカテゴリを直ちに結びつけるのではなく、カテゴリ判定にとって重要か否かという情報をレイアウト構造に対して関連付けることができる。すなわち、本発明によれば、レイアウト構造を持つドキュメントについて、ドキュメントの主題に対応した箇所を特定することができるので、カテゴリの分類精度が向上する。 According to the present invention, the layout structure and the category are not immediately associated, but information on whether or not the layout structure is important for category determination can be associated with the layout structure. That is, according to the present invention, the location corresponding to the subject of the document can be specified for the document having the layout structure, so that the category classification accuracy is improved.

――第１の実施の形態――
図面を用いて、本発明の一実施の形態によるドキュメント分類装置について説明する。図１は、本実施の形態におけるドキュメント分類装置の全体構成を示すブロック図である。図に示したドキュメント分類装置１は、ＨＴＭＬで記述されたドキュメント（以下、ＨＴＭＬドキュメントという）を入力する。入力されたＨＴＭＬドキュメントが属するカテゴリについては、後に説明する手順に従ってカテゴリ判定処理を実行する。カテゴリ判定処理の結果は、複数あるカテゴリの各カテゴリについて、入力されたＨＴＭＬドキュメントが属する確率として出力される。 -First embodiment-
A document classification apparatus according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the overall configuration of the document classification device according to the present embodiment. The document classification apparatus 1 shown in the figure inputs a document described in HTML (hereinafter referred to as an HTML document). For the category to which the input HTML document belongs, category determination processing is executed according to the procedure described later. The result of the category determination process is output as the probability that the input HTML document belongs for each of a plurality of categories.

ドキュメント分類装置１を用いてドキュメントのカテゴリ分類を行う場合、事前に学習を行う必要がある。ドキュメント分類装置１における学習は２つの工程に分かれており、一方がカテゴリ特徴学習、他方がレイアウト特徴学習である。学習を行う際は、予めカテゴリが判明しているドキュメントを複数用意し、まずカテゴリ特徴学習を行った後、レイアウト特徴学習を行う。 When performing document category classification using the document classification device 1, it is necessary to perform learning in advance. Learning in the document classification device 1 is divided into two steps, one of which is category feature learning and the other is layout feature learning. When learning is performed, a plurality of documents whose categories are known in advance are prepared, and after first performing category feature learning, layout feature learning is performed.

ドキュメント分類装置１は、カテゴリ特徴学習部２、レイアウト特徴学習部３、カテゴリ確率判定部４およびデータ記憶部５から構成される。カテゴリ特徴学習部２は、学習対象ドキュメント入力部２１、要素抽出部２２およびカテゴリ特徴算出部２３を備え、カテゴリ特徴学習を行う。 The document classification device 1 includes a category feature learning unit 2, a layout feature learning unit 3, a category probability determination unit 4, and a data storage unit 5. The category feature learning unit 2 includes a learning target document input unit 21, an element extraction unit 22, and a category feature calculation unit 23, and performs category feature learning.

レイアウト特徴学習部３は、学習対象ドキュメント入力部３１、レイアウト解析部３２、要素抽出部３３およびレイアウト特徴解析部３４を備え、レイアウト特徴学習を行う。カテゴリ確率判定部４は、判定対象ドキュメント入力部４１、カテゴリ判定部４２、要素抽出部４３およびカテゴリ得点算出部４４を備え、ＨＴＭＬドキュメントのカテゴリ判定処理を行う。 The layout feature learning unit 3 includes a learning target document input unit 31, a layout analysis unit 32, an element extraction unit 33, and a layout feature analysis unit 34, and performs layout feature learning. The category probability determination unit 4 includes a determination target document input unit 41, a category determination unit 42, an element extraction unit 43, and a category score calculation unit 44, and performs an HTML document category determination process.

データ記憶部５は、カテゴリ特徴テーブル５１、レイアウト特徴テーブル５２、除去タグテーブル５３、除去属性テーブル５４、類似タグテーブル５５およびカテゴリテーブル５６を備える。カテゴリ特徴テーブル５１には、カテゴリ特徴学習の結果が記録される。レイアウト特徴テーブル５２には、レイアウト特徴学習の結果が記録される。除去タグテーブル５３、除去属性テーブル５４、類似タグテーブル５５の３つのテーブルは、ＨＴＭＬドキュメントの正規化処理（後に詳述する）に用いるデータが記録されている。カテゴリテーブル５６には、カテゴリ特徴学習，レイアウト特徴学習およびカテゴリ判定処理に用いるカテゴリの一覧が記録されている。 The data storage unit 5 includes a category feature table 51, a layout feature table 52, a removal tag table 53, a removal attribute table 54, a similar tag table 55, and a category table 56. The category feature table 51 records the result of category feature learning. The layout feature table 52 records the result of layout feature learning. The three tables, the removal tag table 53, the removal attribute table 54, and the similar tag table 55, record data used for normalization processing (detailed later) of the HTML document. In the category table 56, a list of categories used for category feature learning, layout feature learning, and category determination processing is recorded.

次に、ドキュメント分類装置１がカテゴリ判定処理の対象とするＨＴＭＬドキュメントについて、図を用いて説明する。図２は、ＨＴＭＬドキュメントの一例を示す図である。ＨＴＭＬドキュメントは、開始タグから終了タグに至るＨＴＭＬ要素で構成される。例えば、図２（ａ）のＨＴＭＬドキュメント６１において、“＜Ｐ＞”で始まり“＜／Ｐ＞”で終わる記述（６２）は、１つのＨＴＭＬ要素である。 Next, an HTML document that is a target of category determination processing by the document classification device 1 will be described with reference to the drawings. FIG. 2 is a diagram illustrating an example of an HTML document. An HTML document is composed of HTML elements from a start tag to an end tag. For example, in the HTML document 61 of FIG. 2A, the description (62) starting with “” and ending with “” is one HTML element.

ＨＴＭＬ要素同士は木構造を構成し、あるＨＴＭＬ要素の下位に別のＨＴＭＬ要素が位置することがある。例えば、ＨＴＭＬ要素６２が含んでいる、“＜ＥＭ＞”で始まり“＜／ＥＭ＞”で終わる記述（６４）は、ＨＴＭＬ要素６２の下位に位置する１つのＨＴＭＬ要素である。図２（ｂ）は、図２（ａ）のＨＴＭＬドキュメント６１を構成するＨＴＭＬ要素を、木構造の形で表現した図である。ＨＴＭＬ要素６４が、ＨＴＭＬ要素６２の下位に位置している。 HTML elements form a tree structure, and another HTML element may be located below a certain HTML element. For example, the description (64) that starts with “” and ends with “” included in the HTML element 62 is one HTML element positioned below the HTML element 62. FIG. 2B is a diagram representing the HTML elements constituting the HTML document 61 of FIG. 2A in the form of a tree structure. An HTML element 64 is positioned below the HTML element 62.

以下の説明では、あるＨＴＭＬ要素において、そのＨＴＭＬ要素自身の開始タグと終了タグとを取り除いた残りの部分を、ＨＴＭＬ要素の内容と呼ぶ。例えば、図２（ａ）において、点線６３で囲まれた記述は、ＨＴＭＬ要素６２の内容である。なお、ＨＴＭＬ要素の内容として、下位に位置するＨＴＭＬ要素の内容を含まなくてもよい。この場合、ＨＴＭＬ要素６２の内容は、点線６３で囲まれた記述からＨＴＭＬ要素６４を取り除いた記述となる。 In the following description, the remaining part of an HTML element obtained by removing the start tag and the end tag of the HTML element itself is referred to as the content of the HTML element. For example, in FIG. 2A, the description surrounded by a dotted line 63 is the content of the HTML element 62. It should be noted that the content of the HTML element does not have to be included as the content of the HTML element. In this case, the content of the HTML element 62 is a description obtained by removing the HTML element 64 from the description surrounded by the dotted line 63.

本実施の形態では、ＨＴＭＬ要素の内容からＨＴＭＬタグを除去し、単語ごとに分割したものを、カテゴリ判定のために用いる要素とする。すなわち、本実施の形態において、１つの要素は１つの単語である。単語への分割は、公知の形態素解析技術を用いる。 In the present embodiment, an HTML tag is removed from the content of an HTML element, and an element used for category determination is divided for each word. That is, in this embodiment, one element is one word. A known morphological analysis technique is used for the division into words.

なお、１つの要素が１つの単語であることに加えて、隣接する２つの単語を１つの要素として扱ってもよい。この場合、２つの単語から、隣接する２つの単語をそれぞれ別個の要素としたものと、２つの単語を合わせた要素の、３つの要素が取り出される。また、隣接していない２つの単語を要素としたり、３つ以上の単語を要素としてもよい。これらの方法を用いることで、単語間の共起を考慮したカテゴリ判定を行うことができる。 In addition to one element being one word, two adjacent words may be treated as one element. In this case, from the two words, three elements are extracted, which are two adjacent words as separate elements and two words combined. Further, two words that are not adjacent to each other may be used as an element, or three or more words may be used as an element. By using these methods, it is possible to perform category determination in consideration of co-occurrence between words.

次に、ドキュメント分類装置１を用いて学習およびカテゴリ判定を行う手順を、図を用いて説明する。図３は、ドキュメント分類装置１を利用する手順を示す図である。始めに、６５で示す、カテゴリ特徴学習を行う。既にカテゴリが判明しているＨＴＭＬドキュメントを学習対象として用意する。学習対象ドキュメントならびに学習対象ドキュメントが属するカテゴリを、カテゴリ特徴学習部２に与えると、学習の結果がカテゴリ特徴テーブル５１に記録される。上記手続きを複数の学習対象ドキュメントについて繰り返す。 Next, procedures for learning and category determination using the document classification device 1 will be described with reference to the drawings. FIG. 3 is a diagram showing a procedure for using the document classification device 1. First, category feature learning indicated by 65 is performed. An HTML document whose category is already known is prepared as a learning target. When the learning target document and the category to which the learning target document belongs are given to the category feature learning unit 2, the learning result is recorded in the category feature table 51. The above procedure is repeated for a plurality of documents to be learned.

カテゴリ特徴学習の次に、６６で示す、レイアウト特徴学習を行う。レイアウト特徴学習には、カテゴリ特徴学習で用いたのと同じ学習対象ドキュメントを使用する。カテゴリ特徴学習と同様に、学習対象ドキュメントならびに学習対象ドキュメントが属するカテゴリを、レイアウト特徴学習部３に与えると、学習の結果がレイアウト特徴テーブル５２に記録される。上記手続きを複数の学習対象ドキュメントについて繰り返す。レイアウト特徴学習には、カテゴリ特徴テーブル５１、除去タグテーブル５３、除去属性テーブル５４および類似タグテーブル５５の内容が用いられる。 Next to the category feature learning, layout feature learning indicated by 66 is performed. For the layout feature learning, the same learning target document as that used in the category feature learning is used. Similar to the category feature learning, when the learning target document and the category to which the learning target document belongs are given to the layout feature learning unit 3, the learning result is recorded in the layout feature table 52. The above procedure is repeated for a plurality of documents to be learned. The layout feature learning uses the contents of the category feature table 51, the removal tag table 53, the removal attribute table 54, and the similar tag table 55.

カテゴリ特徴学習およびレイアウト特徴学習が終了すると、６７で示す、カテゴリ判定を行うことができる。カテゴリ判定の対象とするＨＴＭＬドキュメントをカテゴリ確率判定部４に与えると、カテゴリ判定結果６８が得られる。カテゴリ判定結果６８は、カテゴリ判定対象ドキュメントが各々のカテゴリに属する確率を示す。なお、図３に示したカテゴリ判定結果６８は、カテゴリ判定結果の一例である。 When the category feature learning and the layout feature learning are completed, the category determination indicated by 67 can be performed. When an HTML document to be subjected to category determination is given to the category probability determination unit 4, a category determination result 68 is obtained. The category determination result 68 indicates the probability that the category determination target document belongs to each category. The category determination result 68 shown in FIG. 3 is an example of the category determination result.

次に、カテゴリ特徴学習部２の処理内容を、図を用いて説明する。図４は、カテゴリ特徴学習部２を詳細に示したブロック図である。学習対象ドキュメント入力部２１は、学習対象ドキュメントからＨＴＭＬタグを除去して、要素抽出部２２へ出力する。要素抽出部２２は、入力を要素に分解し、得られたすべての要素をカテゴリ特徴算出部２３へ出力する。カテゴリ特徴算出部２３は、与えられた要素ならびに学習対象ドキュメントの属するカテゴリに基づいて、カテゴリ特徴テーブル５１の内容を更新する。 Next, processing contents of the category feature learning unit 2 will be described with reference to the drawings. FIG. 4 is a block diagram showing the category feature learning unit 2 in detail. The learning target document input unit 21 removes the HTML tag from the learning target document and outputs it to the element extraction unit 22. The element extraction unit 22 decomposes the input into elements and outputs all the obtained elements to the category feature calculation unit 23. The category feature calculation unit 23 updates the contents of the category feature table 51 based on the given element and the category to which the learning target document belongs.

図５は、カテゴリ特徴テーブル５１のデータ構造を示す図である。カテゴリ特徴テーブル５１には、要素６９と、カテゴリ７０と、出現頻度７１と、特徴量７２とが含まれている。カテゴリ特徴テーブル５１の１つのレコードは、あるカテゴリにおける要素の出現頻度と、その要素がどの程度当該カテゴリの特徴を表すかを記録している。特徴量７２が小さいほど、どのカテゴリのドキュメントでも使用される一般的な要素であることを意味し、特徴量７２が大きいほど、あるカテゴリで出現しやすい特徴的な要素であることを意味する。特徴量７２の算出には、公知のＴＦ−ＩＤＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ−ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）法を用いる。 FIG. 5 is a diagram showing the data structure of the category feature table 51. The category feature table 51 includes an element 69, a category 70, an appearance frequency 71, and a feature amount 72. One record of the category feature table 51 records the appearance frequency of an element in a certain category and how much the element represents the feature of the category. A smaller feature amount 72 means a general element used in documents of any category, and a larger feature amount 72 means a characteristic element that is likely to appear in a certain category. The feature amount 72 is calculated by using a known TF-IDF (Term Frequency-Inverse Document Frequency) method.

図６は、カテゴリ特徴学習の処理内容を説明するためのフローチャートである。まずステップＳ１では、学習対象のドキュメントと、そのドキュメントが属するカテゴリを受け取る。ステップＳ２では、学習対象のドキュメントからすべてのＨＴＭＬタグを除去する。ステップＳ３では、ＨＴＭＬタグを除去したドキュメントを要素に分解し、すべての要素を取り出す。以降のステップでは、ステップＳ３で取り出した要素に対して処理を行う。 FIG. 6 is a flowchart for explaining the processing contents of category feature learning. First, in step S1, a learning target document and a category to which the document belongs are received. In step S2, all HTML tags are removed from the learning target document. In step S3, the document from which the HTML tag has been removed is decomposed into elements, and all elements are extracted. In subsequent steps, processing is performed on the element extracted in step S3.

ステップＳ４では、未選択の要素を１つ選択する。ステップＳ５では、選択した要素を「要素」列に含むレコードが、カテゴリ特徴テーブル５１に存在するか否かを判定する。カテゴリ特徴テーブル５１に上記レコードが存在した場合、ステップＳ５により肯定判定がなされ、ステップＳ７へ進む。他方、ステップＳ５において否定判定がなされた場合は、ステップＳ６へ進む。ステップＳ６では、「要素」列に選択した要素が格納されているレコードを、カテゴリ特徴テーブル５１へ追加する。このとき、すべてのカテゴリについてレコードを追加する。すなわち、「要素」列に選択した要素が格納され、「カテゴリ」列には各々のカテゴリが格納されているレコードを、カテゴリの総数分追加する。追加するレコードの「出現頻度」は０、「特徴量」は１とする。 In step S4, one unselected element is selected. In step S 5, it is determined whether or not a record including the selected element in the “element” column exists in the category feature table 51. If the record exists in the category feature table 51, an affirmative determination is made in step S5, and the process proceeds to step S7. On the other hand, if a negative determination is made in step S5, the process proceeds to step S6. In step S 6, a record storing the selected element in the “element” column is added to the category feature table 51. At this time, records are added for all categories. That is, the selected element is stored in the “element” column, and records storing each category are added to the “category” column for the total number of categories. The “appearance frequency” of the record to be added is 0, and the “feature” is 1.

ステップＳ７では、カテゴリ特徴テーブル５１から、「要素」列に選択した要素を含み、かつ「カテゴリ」列に学習対象ドキュメントが属するカテゴリを含むレコードを探索し、発見したレコードの「出現頻度」へ１を加算する。ステップＳ８では、カテゴリ特徴テーブル５１から、「要素」列に選択した要素を含むレコードを探索し、発見したレコードの「特徴量」を所定の方法に従って更新する。ステップＳ９では、ステップＳ３で取り出したすべての要素が選択済みか否かを判定する。未選択の要素が存在していた場合、ステップＳ９により否定判定がなされ、ステップＳ４に戻る。他方、ステップＳ９において肯定判定がなされた場合は、カテゴリ特徴学習を終了する。 In step S7, the record including the selected element in the “element” column and including the category to which the learning target document belongs in the “category” column is searched from the category feature table 51, and the “occurrence frequency” of the found record is set to 1. Is added. In step S8, a record including the selected element in the “element” column is searched from the category feature table 51, and the “feature” of the found record is updated according to a predetermined method. In step S9, it is determined whether all the elements extracted in step S3 have been selected. If there is an unselected element, a negative determination is made in step S9, and the process returns to step S4. On the other hand, if a positive determination is made in step S9, the category feature learning is terminated.

次に、レイアウト特徴学習部３の処理内容を、図を用いて説明する。図７は、レイアウト特徴学習部３を詳細に示したブロック図である。学習対象ドキュメント入力部３１は、学習対象のＨＴＭＬドキュメントに対して正規化処理（後に詳述する）を行い、結果をレイアウト解析部３２へ出力する。レイアウト解析部３２は、入力されたＨＴＭＬドキュメントを構成するＨＴＭＬ要素ごとに、ＨＴＭＬ要素の内容を要素抽出部３３へ繰り返し出力する。この結果として、レイアウト特徴解析部３４から、各ＨＴＭＬ要素の重要度を得る。 Next, processing contents of the layout feature learning unit 3 will be described with reference to the drawings. FIG. 7 is a block diagram showing the layout feature learning unit 3 in detail. The learning target document input unit 31 performs normalization processing (described in detail later) on the learning target HTML document, and outputs the result to the layout analysis unit 32. The layout analysis unit 32 repeatedly outputs the content of the HTML element to the element extraction unit 33 for each HTML element constituting the input HTML document. As a result, the importance level of each HTML element is obtained from the layout feature analysis unit 34.

要素抽出部３３は、入力されたＨＴＭＬ要素の内容を要素に分解し、得られたすべての要素をレイアウト特徴解析部３４へ出力する。レイアウト特徴解析部３４は、与えられた要素ならびに学習対象ドキュメントが属するカテゴリを基に、カテゴリ特徴テーブル５１から各要素の特徴量を取得する。そして、得られた各要素の特徴量から、レイアウト解析部３２が出力したＨＴＭＬ要素の重要度を算出し、レイアウト解析部３２へ出力する。重要度を受け取ったレイアウト解析部３２は、引き続き別のＨＴＭＬ要素について、ＨＴＭＬ要素の内容を出力し、重要度を取得する。レイアウト解析部３２が上記の手続きを繰り返すことで、最終的に、学習対象ドキュメントが備えるすべてのＨＴＭＬ要素について重要度が算出される。レイアウト解析部３２は、正規化されたＨＴＭＬドキュメントと、上記ＨＴＭＬドキュメントに含まれる各ＨＴＭＬ要素の重要度とを基に、レイアウト特徴テーブル５２を更新する。 The element extraction unit 33 decomposes the content of the input HTML element into elements, and outputs all the obtained elements to the layout feature analysis unit 34. The layout feature analysis unit 34 acquires the feature amount of each element from the category feature table 51 based on the given element and the category to which the learning target document belongs. Then, the importance level of the HTML element output from the layout analysis unit 32 is calculated from the obtained feature amounts of each element, and is output to the layout analysis unit 32. The layout analysis unit 32 that has received the importance continues to output the content of the HTML element for another HTML element and acquires the importance. The layout analysis unit 32 repeats the above procedure, so that the importance is finally calculated for all HTML elements included in the learning target document. The layout analysis unit 32 updates the layout feature table 52 based on the normalized HTML document and the importance of each HTML element included in the HTML document.

図８は、レイアウト特徴テーブル５２のデータ構造を示す図である。レイアウト特徴テーブル５２には、図８に示すテーブルが複数記録されている。１つのテーブルは、１つのＨＴＭＬドキュメントを元に作成される。テーブルに格納されているデータのうち、ＨＴＭＬ要素７３および属性値７４は、元になったＨＴＭＬドキュメントをそのまま反映している。識別子７５は、レイアウト解析部３２によって割り当てられる、一意な識別子である。重要度７６は、ＨＴＭＬ要素がカテゴリ判別にとってどの程度重要かを示す数値であり、値が大きいほどカテゴリ判別に大きな影響を与える。 FIG. 8 is a diagram showing the data structure of the layout feature table 52. A plurality of tables shown in FIG. 8 are recorded in the layout feature table 52. One table is created based on one HTML document. Of the data stored in the table, the HTML element 73 and the attribute value 74 reflect the original HTML document as it is. The identifier 75 is a unique identifier assigned by the layout analysis unit 32. The importance degree 76 is a numerical value indicating how important the HTML element is for category determination, and the larger the value, the greater the influence on the category determination.

図９は、レイアウト特徴学習の処理手順を示すフローチャートである。まずステップＳ１０では、学習対象のドキュメントと、そのドキュメントが属するカテゴリを受け取る。ステップＳ１１では、学習対象のドキュメントに対して、正規化処理（後に詳述する）を行う。ステップＳ１２では、学習対象のドキュメントを構成する各々のＨＴＭＬ要素に対して、学習対象のドキュメント内でユニークな識別子、ならびに、重要度の初期値である１を与える。 FIG. 9 is a flowchart showing a processing procedure for layout feature learning. First, in step S10, a learning target document and a category to which the document belongs are received. In step S11, normalization processing (described in detail later) is performed on the learning target document. In step S12, an identifier that is unique within the learning target document and 1 that is an initial value of importance are given to each HTML element constituting the learning target document.

ステップＳ１３では、学習対象のドキュメントから、未選択のＨＴＭＬ要素を１つ選択する。ステップＳ１４では、ステップＳ１３で選択したＨＴＭＬ要素の内容を、要素に分解して取り出す。ステップＳ１５では、ステップＳ１４で取り出した要素の中から、未選択の要素を１つ選択する。ステップＳ１６では、選択した要素を「要素」列に含み、かつ学習対象ドキュメントが属するカテゴリを「カテゴリ」列に含むレコードを、カテゴリ特徴テーブル５１から探索する。ステップＳ１７では、ステップＳ１６で発見したレコードから「特徴量」を取り出し、ステップＳ１３で選択したＨＴＭＬ要素の重要度に対して乗算を行う。 In step S13, one unselected HTML element is selected from the document to be learned. In step S14, the content of the HTML element selected in step S13 is taken out into elements. In step S15, one unselected element is selected from the elements extracted in step S14. In step S16, the category feature table 51 is searched for a record that includes the selected element in the “element” column and includes the category to which the learning target document belongs in the “category” column. In step S17, the “feature” is extracted from the record found in step S16, and the importance of the HTML element selected in step S13 is multiplied.

ステップＳ１８では、ステップＳ１４で取り出した要素がすべて選択済みであるか否かを判定する。要素をすべて選択済みである場合、ステップＳ１８により肯定判定がなされ、ステップＳ１９へ進む。他方、ステップＳ１８において否定判定がなされた場合は、ステップＳ１５へ戻る。ステップＳ１９では、学習対象のドキュメントを構成するすべてのＨＴＭＬ要素が選択済みであるか否かを判定する。未選択のＨＴＭＬ要素が存在していた場合、ステップＳ１９により否定判定がなされ、ステップＳ１３へ戻る。他方、ステップＳ１９において肯定判定がなされた場合は、レイアウト特徴学習を終了する。 In step S18, it is determined whether all the elements extracted in step S14 have been selected. If all the elements have been selected, an affirmative determination is made in step S18, and the process proceeds to step S19. On the other hand, if a negative determination is made in step S18, the process returns to step S15. In step S19, it is determined whether or not all HTML elements constituting the learning target document have been selected. If an unselected HTML element exists, a negative determination is made in step S19, and the process returns to step S13. On the other hand, if an affirmative determination is made in step S19, the layout feature learning ends.

次に、図９のステップＳ１１で実行される、ＨＴＭＬドキュメントの正規化処理を、図を用いて説明する。図１０は、ＨＴＭＬドキュメントの正規化処理手順を示すフローチャートである。ステップＳ２１では、ＨＴＭＬドキュメントから未知のＨＴＭＬタグを削除する。ステップＳ２２では、不足している終了タグと属性、ならびに、省略されている終了タグと属性を補完する。ステップＳ２３では、除去タグテーブル５３に含まれているＨＴＭＬタグを削除する。ステップＳ２４では、類似タグテーブル５５に従って、同一視するＨＴＭＬタグ同士を同一のＨＴＭＬタグに置き換える。ステップＳ２５では、除去属性テーブル５４に含まれている属性を削除する。ステップＳ２６では、空白文字と改行文字の除去、ならびに追加を行う。具体的には、まず冗長な空白文字と改行文字を削除する。次に、ＨＴＭＬタグが行頭となるよう改行文字の追加を行う。同様に、ＨＴＭＬ要素の内容が行頭から始まるように改行文字の追加を行う。 Next, the HTML document normalization process executed in step S11 of FIG. 9 will be described with reference to the drawings. FIG. 10 is a flowchart showing the normalization processing procedure of the HTML document. In step S21, an unknown HTML tag is deleted from the HTML document. In step S22, the missing end tag and attribute, and the omitted end tag and attribute are complemented. In step S23, the HTML tag included in the removal tag table 53 is deleted. In step S24, the HTML tags to be identified are replaced with the same HTML tags according to the similar tag table 55. In step S25, the attribute included in the removal attribute table 54 is deleted. In step S26, blank characters and line feed characters are removed and added. Specifically, first, redundant white space characters and line feed characters are deleted. Next, a line feed character is added so that the HTML tag is at the beginning of the line. Similarly, a line feed character is added so that the content of the HTML element starts from the beginning of the line.

次に、カテゴリ確率判定部４の処理内容を、図を用いて説明する。図１１は、カテゴリ確率判定部４を詳細に示したブロック図である。判定対象ドキュメント入力部４１は、レイアウト特徴テーブル５２に含まれるテーブルの中から、判定対処のＨＴＭＬドキュメントにもっとも似ているＨＴＭＬ要素の構造を持つテーブルを探索して、カテゴリ得点算出部４４へ出力する。また、判定対象のＨＴＭＬドキュメントを前述の通り正規化して、カテゴリ判定部４２へ出力することも行う。カテゴリ判定部４２は、カテゴリ得点の管理を行う。カテゴリ得点は、各々のカテゴリに割り当てられた数値であり、初期値は０となっている。カテゴリ得点が高いカテゴリほど、判定対象ドキュメントが属する確率が高い。カテゴリ判定部４２は、入力されたＨＴＭＬドキュメントを構成するＨＴＭＬ要素ごとに、ＨＴＭＬ要素の内容を要素抽出部４３へ繰り返し出力する。この結果として、カテゴリ得点算出部４から、カテゴリ得点の更新情報を受け取る。 Next, processing contents of the category probability determination unit 4 will be described with reference to the drawings. FIG. 11 is a block diagram showing the category probability determination unit 4 in detail. The determination target document input unit 41 searches the table included in the layout feature table 52 for a table having an HTML element structure most similar to the HTML document to be determined and outputs the table to the category score calculation unit 44. . In addition, the determination target HTML document is normalized as described above and output to the category determination unit 42. The category determination unit 42 manages category scores. The category score is a numerical value assigned to each category, and the initial value is 0. The higher the category score, the higher the probability that the determination target document belongs. The category determination unit 42 repeatedly outputs the content of the HTML element to the element extraction unit 43 for each HTML element constituting the input HTML document. As a result, category score update information is received from the category score calculation unit 4.

要素抽出部４３は、入力されたＨＴＭＬ要素の内容を要素に分解し、得られたすべての要素をカテゴリ得点算出部４４へ出力する。カテゴリ得点算出部４４は、与えられた要素、判定対象ドキュメント入力部４１から与えられたテーブル、ならびに、カテゴリ特徴テーブル５１に基づいて、各々のカテゴリに加算するカテゴリ得点を、カテゴリ判定部４２へ出力する。カテゴリ得点を受け取ったカテゴリ判定部２は、引き続き別のＨＴＭＬ要素について、ＨＴＭＬ要素の内容を出力し、カテゴリ得点を取得する。カテゴリ判定部４２が上記の手続きを繰り返すことで、最終的に、すべてのカテゴリにおけるカテゴリ得点が算出される。カテゴリ判定部４２は、カテゴリ得点から、判定対象ドキュメントが各々のカテゴリに属する確率を算出し、判定結果として出力する。 The element extraction unit 43 decomposes the content of the input HTML element into elements, and outputs all the obtained elements to the category score calculation unit 44. The category score calculation unit 44 outputs the category score to be added to each category to the category determination unit 42 based on the given elements, the table given from the determination target document input unit 41, and the category feature table 51. To do. Upon receiving the category score, the category determination unit 2 continues to output the content of the HTML element for another HTML element, and acquires the category score. The category determination unit 42 repeats the above procedure, and finally, category scores in all categories are calculated. The category determination unit 42 calculates the probability that the determination target document belongs to each category from the category score, and outputs it as a determination result.

図１２は、カテゴリ判定の処理手順を示すフローチャートである。まずステップＳ２７では、すべてのカテゴリに対して、カテゴリ得点の初期値０を与える。ステップＳ２８では、カテゴリ判定対象ドキュメントに対して、図１０で説明した手順に従い正規化処理を行う。ステップＳ２９では、レイアウト特徴テーブル５２に記録されているテーブルのうち、カテゴリ判定対象ドキュメントにもっとも似ているＨＴＭＬ要素の構造を含むテーブルを取得する。ＨＴＭＬ要素の類似度を測る具体的な方法は、カテゴリ判定対象ドキュメントからＨＴＭＬタグ以外を削除したＨＴＭＬドキュメントと、レイアウト特徴テーブル５２に含まれる各テーブルについて元になったＨＴＭＬドキュメントとを、公知の文書比較アルゴリズム（例えば、動的計画法を用いて最長共通部分文字列を求めるアルゴリズム）によって行単位で比較するというものである。比較した結果、同一性がもっとも高いテーブルを取得する。同一性の判定には、相違部分の割合を用いてもよいし、編集距離を用いてもよい。以下の説明において、上記手続きによって取得されたテーブルを、比較対象テーブルと呼ぶ。 FIG. 12 is a flowchart showing a processing procedure for category determination. First, in step S27, an initial category score of 0 is given to all categories. In step S28, normalization processing is performed on the category determination target document according to the procedure described in FIG. In step S29, a table including the structure of the HTML element most similar to the category determination target document is acquired from the tables recorded in the layout feature table 52. A specific method for measuring the similarity of HTML elements is that a HTML document obtained by deleting an HTML tag from a category determination target document and an HTML document based on each table included in the layout feature table 52 are known documents. The comparison is performed in line units by a comparison algorithm (for example, an algorithm for obtaining the longest common partial character string using dynamic programming). As a result of comparison, a table having the highest identity is acquired. For the determination of identity, the ratio of different parts may be used, or the edit distance may be used. In the following description, the table acquired by the above procedure is called a comparison target table.

ステップＳ３０では、判定対象のＨＴＭＬドキュメントから、未選択のＨＴＭＬ要素を１つ選択する。ステップＳ３１では、ステップＳ３０で選択したＨＴＭＬ要素の内容を、要素に分解して取り出す。ステップＳ３２では、ステップＳ３１で取り出した要素の中から、未選択の要素を１つ選択する。ステップＳ３３では、選択した要素を「要素」列に含むレコードが、カテゴリ特徴テーブル５１に存在するか否かを判定する。該当するレコードがカテゴリ特徴テーブル５１に存在していた場合、ステップＳ３３により肯定判定がなされ、ステップＳ３４へ進む。他方、ステップＳ３３において否定判定がなされた場合は、ステップＳ３８へ進む。ステップＳ３４では、ステップＳ３０で選択したＨＴＭＬ要素が、比較対象テーブルに存在するか否かを判定する。比較対象テーブルに同一のＨＴＭＬ要素が存在した場合、ステップＳ３４により肯定判定がなされ、ステップＳ３５へ進む。他方、ステップＳ３４において否定判定がなされた場合は、ステップＳ３７へ進む。 In step S30, one unselected HTML element is selected from the determination target HTML document. In step S31, the content of the HTML element selected in step S30 is decomposed into elements and taken out. In step S32, one unselected element is selected from the elements extracted in step S31. In step S 33, it is determined whether or not a record including the selected element in the “element” column exists in the category feature table 51. If the corresponding record exists in the category feature table 51, an affirmative determination is made in step S33, and the process proceeds to step S34. On the other hand, if a negative determination is made in step S33, the process proceeds to step S38. In step S34, it is determined whether or not the HTML element selected in step S30 exists in the comparison target table. When the same HTML element exists in the comparison target table, an affirmative determination is made in step S34, and the process proceeds to step S35. On the other hand, if a negative determination is made in step S34, the process proceeds to step S37.

ステップＳ３５では、比較対象テーブルに存在する、ステップＳ３０で選択したＨＴＭＬ要素と同一のＨＴＭＬ要素から、重要度を取得する。ステップＳ３６では、各々のカテゴリについて以下の処理を行う。まず、カテゴリ特徴テーブル５１から、ステップＳ３２で選択した要素を「要素」列に含み、かつ「カテゴリ」列に処理対象となっているカテゴリを含むレコードを探索し、「特徴量」を取得する。次に、ステップＳ３５で取得した重要度と、上記の「特徴量」を乗算し、その結果を、処理対象となっているカテゴリのカテゴリ得点へ加算する。 In step S35, the importance is acquired from the same HTML element as the HTML element selected in step S30, which exists in the comparison target table. In step S36, the following processing is performed for each category. First, a record including the element selected in step S32 in the “element” column and including the category to be processed in the “category” column is searched from the category feature table 51 to obtain “feature”. Next, the importance acquired in step S35 is multiplied by the above-mentioned “feature amount”, and the result is added to the category score of the category to be processed.

ステップＳ３７では、各々のカテゴリについて以下の処理を行う。まず、カテゴリ特徴テーブル５１から、ステップＳ３２で選択した要素を「要素」列に含み、かつ「カテゴリ」列に処理対象となっているカテゴリを含むレコードを探索し、「特徴量」を取得する。次に、上記の「特徴量」を、処理対象となっているカテゴリのカテゴリ得点へ加算する。ステップＳ３７では、ステップＳ３６と異なり、重要度の乗算は行われない。 In step S37, the following processing is performed for each category. First, a record including the element selected in step S32 in the “element” column and including the category to be processed in the “category” column is searched from the category feature table 51 to obtain “feature”. Next, the above “feature amount” is added to the category score of the category to be processed. In step S37, unlike step S36, the multiplication of importance is not performed.

ステップＳ３８では、ステップＳ３１で取り出した要素がすべて選択済みであるか否かを判定する。要素をすべて選択済みである場合、ステップＳ３８により肯定判定がなされ、ステップＳ３９へ進む。他方、ステップＳ３８において否定判定がなされた場合は、ステップＳ３２へ戻る。ステップＳ３９では、判定対象のＨＴＭＬドキュメントに存在するすべてのＨＴＭＬ要素が選択済みであるか否かを判定する。未選択のＨＴＭＬ要素が存在する場合、ステップＳ３９により否定判定がなされ、ステップＳ３０へ戻る。他方、ステップＳ３９において肯定判定がなされた場合は、カテゴリ判定処理を終了する。 In step S38, it is determined whether all the elements extracted in step S31 have been selected. When all the elements have been selected, an affirmative determination is made in step S38, and the process proceeds to step S39. On the other hand, if a negative determination is made in step S38, the process returns to step S32. In step S39, it is determined whether all the HTML elements existing in the determination target HTML document have been selected. If there is an unselected HTML element, a negative determination is made in step S39, and the process returns to step S30. On the other hand, if an affirmative determination is made in step S39, the category determination process ends.

以上のカテゴリ判定処理によって得られたカテゴリ得点を、カテゴリ得点の総和によって除算すれば、カテゴリ判定結果６８（図３）が得られる。 A category determination result 68 (FIG. 3) is obtained by dividing the category score obtained by the above category determination process by the total of the category scores.

上述した第１の実施の形態によるドキュメント分類方法によれば、次の作用効果が得られる。
（１）レイアウト特徴によって、カテゴリ判定のために重要な箇所を特定するようにした。これにより、普遍的なレイアウト構造のＨＴＭＬドキュメントを扱う場合でも、カテゴリごとの違いが現れやすい箇所を重視するので、カテゴリ判定の精度が向上する。
（２）除去タグテーブルおよび除去属性テーブルによって、特定のＨＴＭＬタグや属性を事前に削除することができる。これにより、カテゴリごとの特徴に寄与しないＨＴＭＬタグなどをレイアウト特徴に含まないので、カテゴリ判定の精度が向上する。
（３）比較対象テーブルに存在しないＨＴＭＬ要素における要素は、無視されるのではなく、カテゴリ得点へ単純に加算される。これにより、レイアウト構造の共通部分ではない場所に出現した特徴的な要素も考慮してカテゴリ判定が行えるため、カテゴリ判定の精度が向上する。 According to the document classification method according to the first embodiment described above, the following operational effects can be obtained.
(1) The location important for category determination is specified by the layout feature. As a result, even when handling an HTML document having a universal layout structure, the accuracy of category determination is improved because importance is given to a portion where a difference for each category is likely to appear.
(2) A specific HTML tag or attribute can be deleted in advance by the removal tag table and the removal attribute table. As a result, HTML tags that do not contribute to the features for each category are not included in the layout features, so that the accuracy of category determination is improved.
(3) Elements in HTML elements that do not exist in the comparison target table are not ignored but are simply added to the category score. As a result, category determination can be performed in consideration of characteristic elements that appear in places that are not common parts of the layout structure, so that the accuracy of category determination is improved.

上述した第１の実施の形態では、判定対象のドキュメントはＨＴＭＬドキュメントであり、いわゆるテキストデータが処理対象となっていた。以下に詳述する第２の実施の形態では、判定対象のドキュメントは画像であるとする。 In the first embodiment described above, the determination target document is an HTML document, and so-called text data is a processing target. In the second embodiment described in detail below, it is assumed that the document to be determined is an image.

――第２の実施の形態――
本実施の形態においては、学習対象のドキュメントならびに判定対象のドキュメントは画像データとする。カテゴリが与えられた複数の画像を学習対象ドキュメントとし、例えば人物カテゴリの画像の場合、画像中央付近に肌色が存在すること、また夕焼けカテゴリの画像の場合は画像上部に赤色が多いことなどの特徴をカテゴリ特徴として学習する。次に、レイアウト特徴の学習を行う。例えば夕焼けカテゴリの画像において、画像が上下に分割されていた場合、画像の上部が重要であるといったレイアウト特徴の学習が行われる。 -Second embodiment-
In the present embodiment, the learning target document and the determination target document are image data. For example, in the case of an image of a person category, a skin color exists near the center of the image, and in the case of an image of a sunset category, there are many red colors at the top of the image. Is learned as a category feature. Next, layout characteristics are learned. For example, in the image of the sunset category, when the image is divided into upper and lower parts, layout features are learned such that the upper part of the image is important.

以上の処理で学習されたカテゴリ特徴とレイアウト特徴から、特定の画像のカテゴリ判定を行う。例えば判定対象の画像が上下に分割されていた場合、上述の学習結果から画像上部が重要であるという判定がなされ、画像上部の比較結果が画像下部よりも重要視されることになる。 The category determination of a specific image is performed from the category features and layout features learned by the above processing. For example, when the determination target image is divided vertically, it is determined that the upper part of the image is important from the learning result described above, and the comparison result at the upper part of the image is more important than the lower part of the image.

上述した第２の実施の形態によるドキュメント分類方法によれば、第１の実施の形態によるドキュメント分類方法で得られる作用効果に加えて、次の作用効果が得られる。カテゴリ判定対象のドキュメントを画像データとし、カテゴリ特徴とレイアウト特徴の学習結果を用いた。これにより、文書データ以外のドキュメントに対しても、レイアウト特徴を用いてカテゴリ判定を行うことができる。 According to the document classification method according to the second embodiment described above, in addition to the effects obtained by the document classification method according to the first embodiment, the following effects can be obtained. The category determination target document is image data, and the learning results of the category features and layout features are used. Accordingly, category determination can be performed on a document other than document data using the layout feature.

上述した実施の形態では、レイアウト解析を用いてドキュメントの自動分類を行っていた。本発明のレイアウト解析方法を、ドキュメントの検索方法に適用することも可能である。以下に詳述する第３の実施の形態では、レイアウト解析の結果からドキュメントの検索を行う。 In the above-described embodiment, automatic document classification is performed using layout analysis. It is also possible to apply the layout analysis method of the present invention to a document search method. In the third embodiment described in detail below, a document is searched from the result of layout analysis.

――第３の実施の形態――
図面を用いて、本発明の一実施の形態によるドキュメント検索装置について説明する。図１３は、本実施の形態におけるドキュメント検索装置の全体構成を示すブロック図である。図に示したドキュメント検索装置１０１は、キーワードが入力されると、検索対象ドキュメントＤＢ１５７に含まれるドキュメント群に対して、キーワードに基づいて検索処理を実行する。具体的には、キーワードが使用されているドキュメントを特定し、それらのドキュメントを、キーワードとの関連の強さを基に順序づける。検索結果は、順序付けられた複数のドキュメントとして出力される。 --Third embodiment--
A document search apparatus according to an embodiment of the present invention will be described with reference to the drawings. FIG. 13 is a block diagram showing the overall configuration of the document search apparatus according to the present embodiment. When a keyword is input, the document search apparatus 101 shown in the figure executes search processing based on the keyword for the document group included in the search target document DB 157. Specifically, documents in which keywords are used are specified, and those documents are ordered based on the strength of the relationship with the keywords. The search results are output as a plurality of ordered documents.

ドキュメント検索装置１０１を用いる際は、第１の実施の形態におけるドキュメント分類装置１と同様に、事前に学習を行う必要がある。学習の詳細は、ドキュメント分類装置１と同様のため、省略する。 When the document search apparatus 101 is used, it is necessary to perform learning in advance as in the document classification apparatus 1 in the first embodiment. Details of the learning are the same as those of the document classification device 1 and are therefore omitted.

ドキュメント検索装置１０１は、カテゴリ特徴学習部１０２、レイアウト特徴学習部１０３、ドキュメント検索部１０４およびデータ記憶部１０５から構成される。カテゴリ特徴学習部１０２およびレイアウト特徴学習部１０３は、第１の実施の形態におけるドキュメント分類装置１と同様である。 The document search apparatus 101 includes a category feature learning unit 102, a layout feature learning unit 103, a document search unit 104, and a data storage unit 105. The category feature learning unit 102 and the layout feature learning unit 103 are the same as the document classification device 1 in the first embodiment.

データ記憶部１０５は、第１の実施の形態におけるデータ記憶部５に加えて、検索対象ドキュメントＤＢ１５７を備える。検索対象ドキュメントＤＢ１５７は、検索対象となる複数のドキュメントが格納されている。 The data storage unit 105 includes a search target document DB 157 in addition to the data storage unit 5 in the first embodiment. The search target document DB 157 stores a plurality of documents to be searched.

ドキュメント検索部１０４は、キーワード入力部１４１および得点算出部１４２を備える。キーワード入力部１４１は、検索するキーワードを受け取ると、まず検索対象ドキュメントＤＢ１５７から、キーワードを含むドキュメント群を取り出す。次に、取り出した各々のドキュメントについて、キーワードと共に得点算出部１４２へ送信する。 The document search unit 104 includes a keyword input unit 141 and a score calculation unit 142. Upon receiving the keyword to be searched, the keyword input unit 141 first extracts a document group including the keyword from the search target document DB 157. Next, each extracted document is transmitted to the score calculation unit 142 together with the keyword.

得点算出部１４２は、キーワードと処理対象のドキュメントを受信すると、第１の実施の形態における判定対象ドキュメント入力部４１と同様に、処理対象のドキュメントにもっとも似ているＨＴＭＬ要素の構造を有するテーブルを、レイアウト特徴テーブル１５２から取得する。次に、処理対象のドキュメントにおける、キーワードの出現場所を調べる。キーワード１つにつき、レイアウト特徴テーブル１５２の、出現場所と対応する重要度を得点に加算し、最終的な得点をキーワード入力部１４１へ返す。 When the score calculation unit 142 receives the keyword and the document to be processed, as in the determination target document input unit 41 in the first embodiment, the score calculation unit 142 creates a table having an HTML element structure most similar to the document to be processed. , From the layout feature table 152. Next, the appearance location of the keyword in the document to be processed is checked. For each keyword, the importance corresponding to the appearance location in the layout feature table 152 is added to the score, and the final score is returned to the keyword input unit 141.

上記の処理を、ドキュメント群に含まれる全てのドキュメントについて実行したキーワード入力部１４１は、ドキュメント群を得点順に整列し、検索結果として出力する。以上のようにして、レイアウト構造に基づいた検索結果が得られる。 The keyword input unit 141 that has executed the above processing for all the documents included in the document group arranges the document group in the order of points, and outputs it as a search result. As described above, a search result based on the layout structure is obtained.

上述した第３の実施の形態によるドキュメント検索方法によれば、次の作用効果が得られる。キーワードが含まれるドキュメント群について、単純にキーワードの出現回数で順序付けるのではなく、ドキュメントのレイアウト構造に基づいてキーワードの重み付けを行った。これにより、ドキュメントの主題部分でキーワードが使用されているドキュメントが検索結果の上位となるので、検索精度が向上する。 According to the document search method according to the third embodiment described above, the following operational effects can be obtained. For a document group including keywords, the keywords are weighted based on the layout structure of the document, not simply by the number of occurrences of the keywords. As a result, a document in which a keyword is used in the subject part of the document is ranked higher in the search result, so that the search accuracy is improved.

次のような変形も本発明の範囲内である。
（１）レイアウト特徴テーブルには、学習対象ドキュメント１つにつき１つのテーブルを記録するのではなく、似通った構造を持つ複数の学習対象ドキュメント群に対して１つのテーブルを記録するようにしてもよい。この場合、複数のドキュメントに共通するＨＴＭＬ要素のみをテーブルに記録する。共通するＨＴＭＬ要素の抽出には、ＨＴＭＬドキュメントの類似度を判定するのと同じ手法を利用できる。また、ＨＴＭＬ要素の重要度は、複数のＨＴＭＬドキュメント間で平均を取ればよい。 The following modifications are also within the scope of the present invention.
(1) In the layout feature table, one table may be recorded for a plurality of learning target document groups having similar structures instead of recording one table for each learning target document. . In this case, only HTML elements common to a plurality of documents are recorded in the table. For extraction of common HTML elements, the same technique as that for determining the similarity of HTML documents can be used. Further, the importance of HTML elements may be averaged among a plurality of HTML documents.

（２）ドキュメントの要素には、単語や共起語以外のものを用いてもよい。例えば、形態素解析を行わずに隣接する２文字を１つの要素としてもよいし、単語間の距離を１つの要素としてもよい。また、ドキュメントに埋め込まれた画像を１つの要素としてもよい。 (2) Document elements other than words and co-occurrence words may be used. For example, two adjacent characters may be used as one element without performing morphological analysis, and the distance between words may be used as one element. An image embedded in a document may be used as one element.

（３）カテゴリ特徴テーブルに記録される特徴量は、ＴＦ−ＩＤＦ法以外の手法を用いてもよい。 (3) The feature quantity recorded in the category feature table may use a technique other than the TF-IDF method.

（４）カテゴリ判定の対象とするドキュメントは、何らかの構造を有するものであれば、ＨＴＭＬドキュメント以外の形式であってもよい。例えば、いわゆるワードプロセッサによって作成されたドキュメントであってもよいし、画像であってもよい。画像を判定対象とする場合は、要素はピクセルの位置や色としてもよいし、ＳｃａｌａｂｌｅＶｅｃｔｏｒＦｏｒｍａｔ（ＳＶＧ）形式などのマークアップ言語に変換して扱ってもよい。 (4) The document to be subjected to category determination may be in a format other than an HTML document as long as it has some structure. For example, it may be a document created by a so-called word processor or an image. When an image is a determination target, the element may be a pixel position or color, or may be handled after being converted into a markup language such as a Scalable Vector Format (SVG) format.

（５）文章を有するドキュメントにおいて、文章中に出現するパターンをレイアウト特徴学習に利用してもよい。例えば、「はじめに」や「序論」などといった文が出現した場合、以降はレイアウト構造上独立した箇所であると見なすことができる。 (5) In a document having a sentence, a pattern appearing in the sentence may be used for layout feature learning. For example, when a sentence such as “Introduction” or “Introduction” appears, it can be regarded as an independent part in the layout structure.

（６）レイアウト特徴学習に用いるドキュメントは、カテゴリが判明していなくてもよい。この場合、学習対象ドキュメントが属するカテゴリとして、学習対象ドキュメントを公知のドキュメント自動分類技術とカテゴリ特徴学習の結果によって分類したカテゴリを用いる。 (6) The category of the document used for layout feature learning may not be known. In this case, as a category to which the learning target document belongs, a category in which the learning target document is classified based on a well-known automatic document classification technique and category feature learning results is used.

（７）カテゴリの判定やキーワードによる検索以外にも、本発明のレイアウト解析を利用することができる。例えば、ドキュメントの要約処理においては、重要度の小さい箇所を大きく略し、重要度の大きい箇所を残すことができる。 (7) The layout analysis of the present invention can be used in addition to category determination and keyword search. For example, in document summarization processing, a portion with a low importance can be largely omitted, and a portion with a high importance can be left.

本発明の特徴を損なわない限り、本発明は上記実施の形態に限定されるものではなく、本発明の技術的思想の範囲内で考えられるその他の形態についても、本発明の範囲内に含まれる。 As long as the characteristics of the present invention are not impaired, the present invention is not limited to the above-described embodiments, and other forms conceivable within the scope of the technical idea of the present invention are also included in the scope of the present invention. .

第１の実施の形態におけるドキュメント分類装置の全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the document classification | category apparatus in 1st Embodiment. 第１の実施の形態が処理対象とするＨＴＭＬドキュメントの例を示す図である。It is a figure which shows the example of the HTML document made into a process target by 1st Embodiment. 第１の実施の形態におけるドキュメント分類装置を利用する手順を示す図である。It is a figure which shows the procedure which utilizes the document classification device in 1st Embodiment. カテゴリ特徴学習部の処理内容を説明するためのブロック図である。It is a block diagram for demonstrating the processing content of a category feature learning part. カテゴリ特徴テーブルのデータ構造を示す図である。It is a figure which shows the data structure of a category feature table. カテゴリ特徴学習の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of category feature learning. レイアウト特徴学習部の処理内容を説明するためのブロック図である。It is a block diagram for demonstrating the processing content of a layout feature learning part. レイアウト特徴テーブルのデータ構造を示す図である。It is a figure which shows the data structure of a layout feature table. レイアウト特徴学習の処理手順を示すフローチャートである。It is a flowchart which shows the processing procedure of layout feature learning. ＨＴＭＬドキュメントの正規化を行う処理手順を示すフローチャートである。It is a flowchart which shows the process sequence which normalizes an HTML document. カテゴリ確率判定部の処理内容を説明するためのブロック図である。It is a block diagram for demonstrating the processing content of a category probability determination part. カテゴリ判定の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of category determination. 第３の実施の形態におけるドキュメント検索装置の全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the document search device in 3rd Embodiment.

Explanation of symbols

１ドキュメント分類装置
２カテゴリ特徴学習部
３レイアウト特徴学習部
４カテゴリ確率判定部
５データ記憶部
５１カテゴリ特徴テーブル
５２レイアウト特徴テーブル 1 Document Classification Device 2 Category Feature Learning Unit 3 Layout Feature Learning Unit 4 Category Probability Determination Unit 5 Data Storage Unit 51 Category Feature Table 52 Layout Feature Table

Claims

An analysis program for a document having a layout structure,
An element extraction process for extracting a plurality of components from a learning target document having a layout structure;
For each of the components extracted from the learning target document, a category feature learning process for calculating a feature amount in the category of the learning target document;
Layout feature learning processing for calculating the importance of each partial structure for a plurality of partial structures included in the layout structure of the learning target document based on the learning target document and the feature amount;
A document analysis program characterized in that it is executed on a computer.

The document analysis program according to claim 1, further comprising:
Based on the feature amount calculated by the category feature learning process, the importance calculated by the layout feature learning process, and the layout structure of the learning target document, a category of a determination target document having a layout structure is determined. The category determination process to determine
A document analysis program characterized by being executed by a computer.

The document analysis program according to claim 2,
The category determination process outputs a probability that the determination target document belongs to each of a plurality of categories.

The document analysis program according to claim 1, further comprising:
A predetermined keyword is searched from a plurality of search target documents having a layout structure, and each of the search target documents is determined based on the layout structure of the learning target document and the importance calculated by the layout feature learning process. Document search processing for calculating the degree of association with the keyword,
A document analysis program characterized by being executed by a computer.

In the document analysis program as described in any one of Claims 1-3,
A document analysis program characterized in that the document is image data.

The document analysis program according to claim 5,
A document analysis program characterized in that a component extracted from the image data by the element extraction means is a pixel.

In the document analysis program as described in any one of Claims 1-4,
A document analysis program characterized in that the document is a document described by HTML (HyperText Markup Language).

In the document analysis program according to any one of claims 1 to 4 or 7,
A document analysis program, wherein the component is a word.

In the document analysis program according to any one of claims 1 to 4 or 7,
The document analysis program, wherein the component is a word and a co-occurrence relationship of the word.

The document analysis program according to claim 8 or 9,
A document analysis program characterized in that the word is extracted by morphological analysis.

In the document analysis program as described in any one of Claims 1-10,
The document analysis program characterized in that the feature amount is calculated by a TF-IDF (Term Frequency-Inverse Document Frequency) method.

In the document analysis program as described in any one of Claims 1-11,
The layout feature learning process creates a new document by extracting a common part of the layout structure from a plurality of documents to be learned having a similar layout structure, and for each partial structure in the layout structure of the new document A document analysis program characterized by calculating the importance of.

In the document analysis program as described in any one of Claims 1-12,
The document analysis program, wherein the layout structure includes a tree structure.

The document analysis program according to claim 13,
The layout feature learning process excludes a partial structure positioned at a lower level from a partial structure positioned at a higher level when calculating the importance of the partial structure positioned at a higher level in a tree structure. program.

The document analysis program according to claim 13,
The layout feature learning process includes a partial structure positioned at a lower level in a partial structure positioned at a higher level when the importance of a partial structure positioned at a higher level in a tree structure is calculated. .

An apparatus for analyzing a document having a layout structure,
An element extraction unit that extracts a plurality of components from a learning target document having a layout structure;
A category feature learning unit that calculates a feature amount in a category of the learning target document for each of the components extracted from the learning target document;
A layout feature learning unit that calculates importance for each partial structure for a plurality of partial structures included in the layout structure of the learning target document based on the learning target document and the feature amount;
A document analysis apparatus comprising:

The document analysis apparatus according to claim 16, further comprising:
Based on the feature amount calculated by the category feature learning unit, the importance calculated by the layout feature learning unit, and the layout structure of the learning target document, a category of a determination target document having a layout structure is determined. A document analysis apparatus comprising a category determination unit for determining.

The document analysis apparatus according to claim 16, further comprising:
A predetermined keyword is searched from a plurality of search target documents having a layout structure, and each of the search target documents is determined based on the layout structure of the learning target document and the importance calculated by the layout feature learning unit. A document analysis apparatus comprising a document search unit for calculating a degree of association with the keyword.

A method for analyzing a document having a layout structure,
An element extraction step for extracting a plurality of components from a learning target document having a layout structure;
A category feature learning step for calculating a feature amount in the category of the learning target document for each of the components extracted from the learning target document;
A layout feature learning step of calculating importance for each partial structure for a plurality of partial structures included in the layout structure of the learning target document based on the learning target document and the feature amount;
A document analysis method comprising:

The document analysis method according to claim 19, further comprising:
Based on the feature amount calculated in the category feature learning step, the importance calculated in the layout feature learning step, and the layout structure of the learning target document, a category of a determination target document having a layout structure is determined. A document analysis method comprising a category determination step of determining.

The document analysis method according to claim 19, further comprising:
A predetermined keyword is searched from a plurality of search target documents having a layout structure, and based on the layout structure of the learning target document and the importance calculated by the layout feature learning step, A document analysis method comprising a document search step of calculating a degree of association with the keyword.