JP2010061587A

JP2010061587A - Similar document determination device, similarity determination method and program therefor

Info

Publication number: JP2010061587A
Application number: JP2008229104A
Authority: JP
Inventors: Masakazu Hasegawa; 雅一長谷川; Mitsuaki Tsunakawa; 光明綱川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2008-09-05
Filing date: 2008-09-05
Publication date: 2010-03-18

Abstract

【課題】パワーポイント等の文書データ同士の類似判定を容易に行う。
【解決手段】類似判定装置は、文書データの文書タイトル、頁タイトル、頁テキストに着目し、文書データ同士の類似判定を行う。このとき、類似判定装置は、文書データ同士で同じ文書タイトルを持つか否か、同じ頁タイトルの頁数の割合、同じ頁テキストの割合を判断し、この判断結果に基づきどのような類似パターンの類似文書データかを判定する。この判定結果は、類似文書情報に記録し、類似文書データの検索処理に用いる。
【選択図】図１Similarity determination of document data such as PowerPoint is easily performed.
A similarity determination apparatus pays attention to a document title, a page title, and a page text of document data, and performs similarity determination between document data. At this time, the similarity determination device determines whether or not the document data have the same document title, the ratio of the number of pages of the same page title, and the ratio of the same page text. It is determined whether the document data is similar. This determination result is recorded in similar document information and used for search processing of similar document data.
[Selection] Figure 1

Description

本発明は、類似文書データの検索技術に関する。 The present invention relates to a technique for retrieving similar document data.

従来、ある文書データについて、その文書データのもととなった文書データを検索する方法としては、以下のような方法がある。（１）ＣＳＶ（Concurrent Versions System)等を用いて、その文書データもととなった文書データを検索する方法（非特許文献１参照）。（２）ＳＧＭＬ（Standard Generalized Mark-up Language）やＸＭＬ（Extensible Markup Language）等の構造化文書に含まれるタグを用いて、当該文書データの類似文書データを検索する方法（非特許文献２参照）。また、このような類似文書データの検索技術としては、文書データ内の単語の出現頻度を用いる方法がある（非特許文献３参照）。
ＣＳＶ（Concurrent Versions System)、[online]、[平成20年7月30日検索]、インターネット、<URL:http://www.linkclub.or.jp/~tumibito/soft-an/cvs/cvs-man/cvs-ja_1.html#SEC1> 富田他、構造化文書をランキング可能な全文検索システム、電子情報通信学会技術研究報告2000-DBS-122、P361-368、電子情報通信学会、2000年7月帆足他、文書間の類似度における単語寄与度を利用した検索式拡張方法、情報処理学会論文誌 vol.40 No.SIG8（TOD4）、P63-73、1999年11月 Conventionally, as a method for retrieving document data that is the basis of document data, there are the following methods. (1) A method of retrieving document data that is the source of the document data using CSV (Concurrent Versions System) or the like (see Non-Patent Document 1). (2) A method of searching similar document data of the document data using tags included in a structured document such as SGML (Standard Generalized Mark-up Language) or XML (Extensible Markup Language) (see Non-Patent Document 2) . As a similar document data search technique, there is a method of using the appearance frequency of words in document data (see Non-Patent Document 3).
CSV (Concurrent Versions System), [online], [July 30, 2008 search], Internet, <URL: http://www.linkclub.or.jp/~tumibito/soft-an/cvs/cvs- man / cvs-ja_1.html # SEC1> Tomita et al., Full-text search system that can rank structured documents, IEICE Technical Report 2000-DBS-122, P361-368, IEICE, July 2000 Hashiashi et al., Retrieval Expression Expansion Method Using Word Contribution in Similarity Between Documents, IPSJ Journal vol.40 No.SIG8 (TOD4), P63-73, November 1999

ここで、パワーポイント（登録商標）等のプレゼンテーションソフトで作成され、文書タイトルや、頁タイトルや、頁テキストといった構造を持つ文書データについて、その文書データのもととなった文書データ（派生元の文書データ）を検索したいというニーズがある。しかし、非特許文献１に記載のＣＳＶにより、パワーポイント（登録商標）等の文書データを検索するためには、予めＣＳＶで文書データを作成することが前提となる。また、非特許文献２に記載の技術を用いる場合、パワーポイント（登録商標）等の文書データを、いったんＸＭＬ等の構造化文書に変換しなければならず、また各文書データに用いられるタグの共通化も必要である。さらに、非特許文献３に記載の技術は、パワーポイント（登録商標）等、必ずしも多くの単語が登場するとは限らない文書データの検索には不向きである。そこで、本発明は、前記した問題を解決し、パワーポイント（登録商標）等の文書データについて、文書データ同士の類似判定を容易にし、その文書データのもととなったと文書データの検索を容易に行うことを目的とする。 Here, for document data created with presentation software such as PowerPoint (registered trademark) and having a structure such as a document title, page title, and page text, the document data that is the basis of the document data (derived document) There is a need to search (data). However, in order to retrieve document data such as PowerPoint (registered trademark) by CSV described in Non-Patent Document 1, it is premised that document data is created in advance by CSV. Further, when using the technique described in Non-Patent Document 2, document data such as PowerPoint (registered trademark) must be once converted into a structured document such as XML, and a common tag used for each document data is used. It is necessary to make it. Furthermore, the technique described in Non-Patent Document 3 is unsuitable for searching document data such as PowerPoint (registered trademark) that does not necessarily have many words. Therefore, the present invention solves the above-described problem, makes it easy to determine similarity between document data of document data such as PowerPoint (registered trademark), and facilitates retrieval of document data when the document data is the basis. The purpose is to do.

前記した課題を解決するため、類似文書判定装置は、入力部経由で１以上の文書データの入力を受け付けると、文書解析部により文書データから、文書タイトル、頁タイトル、頁テキストを抽出して、その文書データの文書解析情報を作成する。そして、新たな文書データ（比較対象となる文書データ）の入力を受け付けると、文書解析部は、この文書データについても、文書タイトル、頁タイトル、頁テキストを抽出し、判定部は、この文書タイトル、頁タイトル、頁テキストに着目した類似判定を行う。つまり、判定部は、文書解析情報に示される文書データが、（１）この新たな文書データの文書タイトルと同じ文書タイトルか否か、（２）この新たな文書データの頁タイトルと同じ頁タイトルの頁数の割合、（３）この新たな文書データの頁テキストと同じ頁テキストの割合の、いずれかまたはその組み合わせからなる類似パターンに基づき、いずれかの類似パターンにあてはまる類似文書データか否かを判定する。そして、判定部はいずれかの類似パターンにあてはまる類似文書データと判定したとき、その類似パターンを含む判定結果を出力する。このように類似文書判定装置は、文書タイトル、頁タイトル、頁テキストに着目した類似判定を行うことで、文書データをＸＭＬ等の構造化文書に変換しないでも、文書データ同士が類似しているか否かを判定できる。 In order to solve the above-described problem, when the similar document determination device receives input of one or more document data via the input unit, the document analysis unit extracts the document title, the page title, and the page text from the document data, Document analysis information of the document data is created. When receiving input of new document data (document data to be compared), the document analysis unit extracts the document title, page title, and page text from the document data, and the determination unit displays the document title. Similarity determination is performed focusing on the page title and page text. That is, the determination unit determines whether the document data indicated by the document analysis information is (1) the same document title as the document title of the new document data, or (2) the same page title as the page title of the new document data. (3) Based on a similar pattern composed of any one or a combination of the same page text ratio as the page text of the new document data, whether or not the similar document data applies to any similar pattern Determine. When the determination unit determines that the document data is similar document data corresponding to any one of the similar patterns, the determination unit outputs a determination result including the similar pattern. As described above, the similar document determination device performs similarity determination focusing on the document title, the page title, and the page text, so that the document data is similar to each other without converting the document data into a structured document such as XML. Can be determined.

また、この類似文書判定装置の文書解析部は、入力された文書データの最初の頁タイトルをこの文書データの文書タイトルとして抽出する。よって、文書データに文書タイトルという属性をもつコンテンツがない場合でも、文書タイトルを抽出し、文書解析情報を作成できる。 The document analysis unit of the similar document determination apparatus extracts the first page title of the input document data as the document title of the document data. Therefore, even when there is no content having the attribute of document title in the document data, the document title can be extracted and document analysis information can be created.

また、この類似文書判定装置の文書解析部は、入力された文書データの最初の頁の頁タイトルがなかった場合、この文書データの最初の頁の頁テキストを、この文書データの文書タイトルとして抽出し、文書解析情報を作成できる。よって、この文書解析情報には、文書タイトルが含まれる可能性が高くなるので、判定部は、この文書解析情報に含まれる文書タイトルを用いて文書データ同士の類似判定を行いやすくなる。 The document analysis unit of the similar document determination device extracts the page text of the first page of the document data as the document title of the document data when there is no page title of the first page of the input document data. Document analysis information can be created. Therefore, since there is a high possibility that the document analysis information includes the document title, the determination unit can easily determine the similarity between the document data using the document title included in the document analysis information.

また、この類似文書判定装置の文書解析部は、文書データに頁タイトルのない頁があった場合、この頁の頁テキストを、頁タイトルとして抽出し、文書解析情報を作成する。よって、この文書解析情報には、頁タイトルが含まれる可能性が高くなるので、判定部は、この文書解析情報に含まれる頁タイトルを用いて文書データ同士の類似判定を行いやすくなる。 In addition, when there is a page without a page title in the document data, the document analysis unit of the similar document determination apparatus extracts the page text of this page as a page title and creates document analysis information. Therefore, since there is a high possibility that the document analysis information includes a page title, the determination unit can easily determine similarity between document data using the page title included in the document analysis information.

また、この類似文書判定装置の類似文書情報作成部は、判定部による判定結果に基づき、入力された文書データの識別情報ごとに、類似パターンの識別情報と、この類似パターンにあてはまる類似文書データの識別情報とを対応付けて示した類似文書情報を作成し、記憶部に記憶する。このようにすることで、類似文書検索部が、類似パターンの識別情報および文書データの識別情報の少なくとも一方を含んでなる検索条件を示した検索要求の入力を受け付けたとき、この検索条件を満たす文書データおよびその類似文書データの組を類似文書情報から検索することができる。そして、表示処理部によりこの検索結果を表示するので、この類似文書判定装置のユーザは、所定の類似パターンにあてはまる文書データ（類似文書データ）群や、所定の文書データについて類似パターンにあてはまる文書データ（類似文書データ）群を確認することができる。 Further, the similar document information creation unit of the similar document determination device, based on the determination result by the determination unit, for each piece of identification information of the input document data, the identification information of the similar pattern and the similar document data corresponding to the similar pattern Similar document information associated with identification information is created and stored in the storage unit. In this way, when the similar document search unit receives an input of a search request indicating a search condition including at least one of the identification information of the similar pattern and the identification information of the document data, the search condition is satisfied. A set of document data and similar document data can be retrieved from similar document information. Then, since the search result is displayed by the display processing unit, the user of the similar document determination device can select a document data (similar document data) group corresponding to a predetermined similar pattern, or document data corresponding to a similar pattern with respect to predetermined document data. A group of (similar document data) can be confirmed.

また、類似文書判定装置の並べ替え処理部は、文書解析情報に示される文書データの最終編集日時を参照して、類似文書情報から検索した文書データおよびその類似文書データの組を、その組における類似文書データの最終編集日時の古い順または新しい順に並べ替え、検索した文書データおよびその類似文書データの組のうち、同じ類似文書データに対し、その対となる文書データの異なる組が複数ある場合、その組を、その組における文書データの最終編集日時の古い順または新しい順に並べ替える。例えば、類似文書データについて、最終編集日時が古い順に並べ替え、その類似文書データと対（ペア）になる文書データについても最終編集日時が古い順に並べ替えることで、表示処理部は、派生元の文書データから派生した文書データ群について、その派生元の文書データから派生した順に近い状態で表示することができる。つまり、この類似文書判定装置のユーザは、文書データ群の文書データそれぞれについて、その派生元の文書データからどのような順に派生したかを確認しやすくなる。 Further, the rearrangement processing unit of the similar document determination device refers to the last edit date and time of the document data indicated in the document analysis information, and sets the document data retrieved from the similar document information and the set of similar document data in the set. When similar document data is rearranged in order of oldest or newest date and time, the retrieved document data and similar document data, and there are multiple pairs of different pairs of document data for the same similar document data The groups are rearranged in order of oldest or newest date and time of the last editing of document data in the group. For example, the display processing unit sorts the similar document data in ascending order of the last edit date and sorts the document data paired with the similar document data in the order of the last edit date. The document data group derived from the document data can be displayed in a state close to the order derived from the derivation source document data. That is, it becomes easy for the user of this similar document determination device to confirm in what order the document data of the document data group is derived from the document data of the derivation source.

また、類似文書判定装置の文書解析部は、１以上の文書データを記憶する文書蓄積部から文書データを選択し、その選択した文書データの文書タイトルと、この文書データにおける頁ごとの、頁タイトルおよび頁テキストとを抽出し、判定部は、この選択した文書データと、既に記憶部に記憶された文書解析情報に示される文書データとの類似判定を行い、類似文書情報作成部により類似文書情報を作成する。そして、文書解析部は、この類似文書情報作成済みの文書データについて文書解析情報を作成し、記憶部に記憶する。このようにすることで、類似文書判定装置は、文書データの類似判定をしながら文書解析情報を作成することができる。 The document analysis unit of the similar document determination apparatus selects document data from a document storage unit that stores one or more document data, and the document title of the selected document data and the page title for each page in the document data. And the page text are extracted, and the determination unit performs similarity determination between the selected document data and the document data indicated by the document analysis information already stored in the storage unit, and the similar document information generation unit performs similar document information. Create The document analysis unit creates document analysis information for the document data for which similar document information has been created, and stores the document analysis information in the storage unit. By doing in this way, the similar document determination apparatus can create document analysis information while performing similarity determination of document data.

また、類似文書判定装置の文書解析部は、文書蓄積部に記憶された文書データを、その最終編集日時の古い順に選択し、その選択した文書データの文書タイトルと、この文書データにおける頁ごとの、頁タイトルおよび頁テキストとを抽出する。よって、判定部は、最終編集日時が古い文書データから順に、既に記憶部に記憶された文書解析情報と対比して類似文書情報が作成されることになる。類似文書判定装置は、古い文書データを処理している段階では、まだ記憶部に文書解析情報が多数記憶されていないので、判定部は少数の文書解析情報と対比して類似文書情報を作成する。一方、新しい文書データを処理するようになると、記憶部にも文書解析情報が多数記憶されているので、判定部は多数の文書解析情報と対比して類似文書情報を作成することになる。ここで、文書蓄積部に蓄積される文書データ群が、ある文書データを派生元として派生した文書データ群であるとすると、その大元となった文書データ（最も古い文書データ）の類似文書データ自体の数は比較的少なく、また、新しい文書データについては類似文書データの数は比較的多いと考えられる。よって、判定部は、実際に派生関係にあると思われる文書データ同士について効率よく類似判定を行うことができる。また、並べ替え処理部は、類似文書情報から検索した文書データおよびその類似文書データの組を、その組における類似文書データの最終編集日時が古い順に並べ替え、検索した文書データおよびその類似文書データの組のうち、同じ類似文書データに対し、その対となる文書データの異なる組が複数ある場合、その組を、その組における文書データの最終編集日時の古い順に並べ替える。よって、この類似文書判定装置のユーザは、文書データ群の文書データそれぞれについて、その派生元の文書データからどのような順に派生したかを確認しやすくなる。 In addition, the document analysis unit of the similar document determination device selects the document data stored in the document storage unit in order from the oldest editing date and time, and selects the document title of the selected document data and each page in the document data. The page title and page text are extracted. Therefore, the determination unit creates similar document information in order from the document data with the oldest editing date and time in comparison with the document analysis information already stored in the storage unit. In the similar document determination device, since a large amount of document analysis information is not yet stored in the storage unit at the stage of processing old document data, the determination unit creates similar document information in comparison with a small number of document analysis information. . On the other hand, when new document data is processed, since a large amount of document analysis information is stored in the storage unit, the determination unit creates similar document information in contrast to the large number of document analysis information. Here, if the document data group stored in the document storage unit is a document data group derived from a certain document data as a derivation source, similar document data of the document data (oldest document data) that is the origin The number of itself is relatively small, and the number of similar document data is considered to be relatively large for new document data. Therefore, the determination unit can efficiently perform similarity determination between document data that are actually considered to be derived. The rearrangement processing unit rearranges the document data retrieved from the similar document information and the set of the similar document data in the order of the last edit date of the similar document data in the set from the oldest, and the retrieved document data and the similar document data If there are a plurality of different sets of document data to be paired with respect to the same similar document data, the sets are rearranged in order from the oldest edit date of the document data in the set. Therefore, the user of the similar document determination device can easily confirm in what order the document data of the document data group is derived from the document data of the derivation source.

本発明によれば、パワーポイント（登録商標）等の文書データについて、文書データ同士の類似判定を容易にし、その文書データのもととなった文書データの検索を容易に行うことができる According to the present invention, it is possible to easily determine similarity between document data of document data such as PowerPoint (registered trademark), and to easily search for document data that is the basis of the document data.

＜概要＞
以下、本発明を実施するための最良の形態（以下、実施の形態という）について説明する。まず、本実施の形態の類似文書判定装置の処理概要を、図１および図２を用いて説明する。図１および図２は、本実施の形態の類似文書判定装置の処理概要を示した概念図である。 <Overview>
Hereinafter, the best mode for carrying out the present invention (hereinafter referred to as an embodiment) will be described. First, an outline of processing of the similar document determination apparatus according to the present embodiment will be described with reference to FIGS. 1 and 2 are conceptual diagrams showing an outline of processing of the similar document determination apparatus of the present embodiment.

ここで、類似文書判定装置が扱う文書データは、例えば、パワーポイント（登録商標）等で作成された文書データであり、文書タイトル、各頁の頁タイトル、頁テキスト等のコンテンツを含んで構成されるものとする。なお、このコンテンツには図形等も含まれている可能性があるが、この図形のコンテンツには着目せず、テキスト情報からなるコンテンツに着目して類似判定を行う。なお、これらのコンテンツが、文書タイトル、頁タイトル、頁テキストのいずれの属性に属するものかは、それぞれのコンテンツに付されている属性情報をもとに判定される。 Here, the document data handled by the similar document determination device is, for example, document data created by PowerPoint (registered trademark) or the like, and includes contents such as a document title, a page title of each page, and page text. Shall. Although there is a possibility that this content includes a graphic or the like, the similarity determination is performed noting the content of the graphic but paying attention to the content including text information. Whether these contents belong to any attribute of the document title, the page title, and the page text is determined based on attribute information attached to each content.

そして、この類似文書判定装置は、文書データおよびその文書データとの類似判定の対象となる文書データそれぞれの文書タイトル、頁タイトル、頁テキスト等を抽出する。そして、類似文書判定装置は、この文書タイトル、頁タイトル、頁テキストに着目して、それぞれの文書データがどの程度類似しているかを判定する。例えば、類似文書判定装置は、図１の文書データＡの文書タイトルが、文書データＢの文書タイトルと同じか否かを判定する。また、文書データＢに、文書データＡの頁タイトルと同じ頁タイトルが含まれているか、また同じ頁タイトルが含まれている場合、その割合が所定の閾値以上か否かを判定する。また、文書データＢに、文書データＡの頁テキストと同じ頁テキストが含まれているか、また、同じ頁テキストが含まれている場合、その割合が所定の閾値以上か否かを判定する。なお、これらの類似判定において、対比の対象とする頁は、それぞれの文書データの同じ頁同士でなくてもよい。つまり、その文書データ全体のいずれかの頁に同じ頁タイトルや頁テキストが含まれていればよいものとする。ここでの、類似判定結果は、その類似パターンの識別情報（類似パターン識別子）とともに、類似文書情報（後記）に記録される。この類似文書情報の詳細は後記するが、類似パターンの識別子をキーとした類似文書データの検索処理に用いられる情報である。 Then, the similar document determination device extracts the document title, the page title, the page text, and the like of the document data and the document data subjected to similarity determination with the document data. Then, the similar document determination device determines how similar the respective document data are by paying attention to the document title, page title, and page text. For example, the similar document determination apparatus determines whether or not the document title of the document data A in FIG. 1 is the same as the document title of the document data B. If the document data B includes the same page title as the page title of the document data A, and if the same page title is included, it is determined whether the ratio is equal to or higher than a predetermined threshold. Further, it is determined whether or not the document data B includes the same page text as the page text of the document data A, and if the same page text is included, the ratio is equal to or greater than a predetermined threshold value. In these similarity determinations, the pages to be compared do not have to be the same pages of the respective document data. That is, it is only necessary that the same page title or page text is included in any page of the entire document data. The similarity determination result here is recorded in similar document information (described later) together with identification information (similar pattern identifier) of the similar pattern. Although details of the similar document information will be described later, the similar document information is information used for similar document data search processing using a similar pattern identifier as a key.

なお、この類似文書判定装置は、文書データの文書タイトル、各頁の頁タイトルや頁テキスト等を抽出するとき、以下のようにして抽出する。つまり、図２に示すように、類似文書判定装置は、文書タイトルについては、その文書データの最初の頁（０頁目）の最初の頁タイトルを文書タイトルとして抽出する。また、文書データから、頁タイトルを抽出する場合、その頁に頁タイトルがあれば、その頁タイトルをそのまま頁タイトルとして抽出する。しかし、もしその頁に頁タイトルがないときには、その頁の最初の頁テキストを頁タイトルとして抽出する。このように、文書データから、文書タイトルと推定できる情報を抽出し、また、頁タイトル等を含まない頁については、頁タイトルと見なせる情報を抽出するので、文書データの文書タイトル、頁タイトルを確実に抽出できる。よって、類似文書判定装置は、文書データ同士の類似判定を行いやすくなる。 The similar document determination device extracts the document title of the document data, the page title of each page, the page text, and the like as follows. That is, as shown in FIG. 2, the similar document determination device extracts the first page title of the first page (page 0) of the document data as the document title. Further, when extracting a page title from document data, if there is a page title on the page, the page title is extracted as it is as a page title. However, if the page has no page title, the first page text of the page is extracted as the page title. In this way, information that can be estimated as the document title is extracted from the document data, and information that can be regarded as the page title is extracted for pages that do not include the page title or the like, so the document title and page title of the document data are surely confirmed. Can be extracted. Therefore, the similar document determination device can easily perform similarity determination between document data.

＜構成＞
次に、図３を用いて、このような類似文書判定装置の構成を説明する。図３は、本実施の形態の類似文書判定装置の構成を示したブロック図である。 <Configuration>
Next, the configuration of such a similar document determination apparatus will be described with reference to FIG. FIG. 3 is a block diagram showing the configuration of the similar document determination apparatus according to the present embodiment.

類似文書判定装置１０は、大きく入出力部１１と、処理部１２と、記憶部１３とに分けられる。入出力部１１は、この処理部１２における処理対象となる文書データの入力を受け付けたり、この処理部１２による処理結果を示す表示画面等を表示装置２０等へ出力したりする。処理部１２は、文書データから文書タイトルや頁タイトル、頁テキストを抽出したり、類似パターン情報（後記）を参照して、文書データ同士の類似判定を行ったりする。記憶部１３は、処理部１２が文書データの類似判定を行うときに参照する情報や、その類似判定結果である類似文書情報（後記）等を記憶する。なお、類似文書判定装置１０をプログラム実行処理により実現する場合、記憶部１３には、この類似文書判定装置１０の機能を実現するためのプログラムが格納される。 The similar document determination apparatus 10 is roughly divided into an input / output unit 11, a processing unit 12, and a storage unit 13. The input / output unit 11 accepts input of document data to be processed by the processing unit 12, and outputs a display screen or the like indicating a processing result by the processing unit 12 to the display device 20 or the like. The processing unit 12 extracts a document title, page title, and page text from the document data, or makes similarity determination between the document data with reference to similar pattern information (described later). The storage unit 13 stores information that is referred to when the processing unit 12 performs similarity determination of document data, similar document information (described later) that is a result of the similarity determination, and the like. When the similar document determination device 10 is realized by program execution processing, the storage unit 13 stores a program for realizing the function of the similar document determination device 10.

入出力部１１は、入出力インタフェースから構成される。また、処理部１２は、この類似文書判定装置１０の備えるＣＰＵ（Central Processing Unit）によるプログラム実行処理や、専用回路等により実現される。さらに、記憶部１３は、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、ＨＤＤ（Hard Disk Drive）、フラッシュメモリ等の記憶媒体から構成される。 The input / output unit 11 includes an input / output interface. The processing unit 12 is realized by a program execution process by a CPU (Central Processing Unit) included in the similar document determination apparatus 10 or a dedicated circuit. Further, the storage unit 13 includes a storage medium such as a random access memory (RAM), a read only memory (ROM), a hard disk drive (HDD), and a flash memory.

入出力部１１は、この類似文書判定装置１０における処理対象となる文書データの入力を受け付けたり、処理部１２による処理結果を外部へ出力したりする。例えば、処理部１２による処理結果を表示した表示画面を表示装置２０へ出力表示する。 The input / output unit 11 receives input of document data to be processed in the similar document determination apparatus 10 and outputs a processing result by the processing unit 12 to the outside. For example, the display screen displaying the processing result by the processing unit 12 is output and displayed on the display device 20.

処理部１２は、文書解析部１２１と、判定部１２２と、類似文書情報作成部１２３と、類似文書検索部１２４と、並べ替え処理部１２５と、表示処理部１２６とを含んで構成される。 The processing unit 12 includes a document analysis unit 121, a determination unit 122, a similar document information creation unit 123, a similar document search unit 124, a rearrangement processing unit 125, and a display processing unit 126.

文書解析部１２１は、入力された文書データの文書解析情報を作成する。つまり、文書解析部１２１は、文書データに含まれる各コンテンツの属性情報から、各コンテンツが文書タイトル、頁タイトル、頁テキストのいずれを示すものか判定する。そして、その判定結果に基づき、この文書データのコンテンツを、文書タイトル、頁タイトルおよび頁テキストのいずれかとして抽出する。図２で説明したように、文書解析部１２１は、文書データの最初の頁（０頁目）のコンテンツのうち、属性が頁タイトルであるコンテンツを文書タイトルとして抽出する。また、文書データの、２番目の頁（１頁目）以降の頁から、頁タイトルを抽出するとき、その頁に属性が頁タイトルであるコンテンツがあれば、文書解析部１２１は、そのコンテンツを頁タイトルとして抽出するが、もしその頁に、属性が頁タイトルであるコンテンツがないとき、その頁の最初の頁テキスト（属性が頁テキストであるコンテンツ）を頁タイトルとして抽出する。このときの処理手順の詳細は後記する。なお、この文書解析部１２１は、これらのコンテンツのほか、文書データの最終編集日時に関する情報等も抽出する。 The document analysis unit 121 creates document analysis information of the input document data. That is, the document analysis unit 121 determines whether each content indicates a document title, a page title, or a page text from the attribute information of each content included in the document data. Then, based on the determination result, the content of the document data is extracted as one of the document title, the page title, and the page text. As described with reference to FIG. 2, the document analysis unit 121 extracts, as the document title, content whose attribute is the page title from the content of the first page (page 0) of the document data. Further, when a page title is extracted from the second page (first page) and subsequent pages of the document data, if there is a content whose attribute is the page title, the document analysis unit 121 displays the content. Extracted as a page title. If there is no content whose page attribute is a page title, the first page text of that page (content whose attribute is page text) is extracted as a page title. Details of the processing procedure at this time will be described later. In addition to the contents, the document analysis unit 121 extracts information related to the last edit date and time of the document data.

判定部１２２は、類似パターン情報記憶部１３３（後記）に記憶される類似パターン情報を参照して、ある文書データ（選択文書データ）と、その文書データの比較対象となる文書データ（比較対象文書データ）の類似パターンを判定する。具体的には、判定部１２２は、文書解析情報に示される文書データが、類似パターン情報に示されるいずれかの類似パターンにあてはまる類似文書データか否かを判定する。そして、当該文書データがいずれかの類似パターンにあてはまる文書データ（類似文書データ）と判定したとき、その類似パターンを含む判定結果を出力する。なお、判定部１２２は、類似パターン情報に示されるいずれの類似パターンにもあてはまらない文書データについては、この選択文書データの類似文書データと判定しない。 The determination unit 122 refers to similar pattern information stored in the similar pattern information storage unit 133 (described later), and compares certain document data (selected document data) with document data (comparison target document) to be compared with the document data. Data) similarity pattern is determined. Specifically, the determination unit 122 determines whether or not the document data indicated in the document analysis information is similar document data that applies to any of the similar patterns indicated in the similar pattern information. When the document data is determined to be document data (similar document data) corresponding to any similar pattern, a determination result including the similar pattern is output. Note that the determination unit 122 does not determine that the document data that does not match any of the similar patterns indicated in the similar pattern information is similar document data of the selected document data.

例えば、判定部１２２は、比較対象文書データの文書タイトルが、選択文書データの文書タイトルと同じか否かを判定する。また、その比較対象文書データが、選択文書データの頁タイトルと同じ頁タイトルの頁数を持つ割合がどの程度かを判定する。そして、その文書データと、比較対象文書データの文書タイトルが同じであれば、その比較対象文書データの類似パターンを「１」と判定する。また、選択文書データと、その比較対象文書データの頁タイトルと同じ頁タイトルの頁数の割合が、所定の閾値以上であれば、その比較対象文書データの類似パターンを「２」と判定する。また、その比較対象文書データが類似パターン情報に示されるいずれの類似パターンにもあてはまらなかった場合、この選択文書データの類似文書データと判定しない。そして、判定部１２２はその判定結果を出力する。 For example, the determination unit 122 determines whether the document title of the comparison target document data is the same as the document title of the selected document data. Further, it is determined how much the comparison target document data has the same number of pages as the page title of the selected document data. If the document data and the document title of the comparison target document data are the same, the similarity pattern of the comparison target document data is determined to be “1”. Further, if the ratio between the selected document data and the page number of the same page title as the page title of the comparison target document data is equal to or greater than a predetermined threshold, the similar pattern of the comparison target document data is determined to be “2”. If the comparison target document data does not correspond to any of the similar patterns indicated in the similar pattern information, it is not determined as similar document data of the selected document data. Then, the determination unit 122 outputs the determination result.

類似文書情報作成部１２３は、判定部１２２が判定した判定結果に基づき、比較対象文書データの文書識別子（識別情報）と、判定された類似パターンと、この類似パターンにあてはまる文書データ（類似文書データ）の文書識別子（類似文書識別子）とを対応付けた類似文書情報を作成する。この類似文書情報の詳細は後記する。 Based on the determination result determined by the determination unit 122, the similar document information creation unit 123 determines the document identifier (identification information) of the comparison target document data, the determined similar pattern, and the document data (similar document data) corresponding to the similar pattern. ) Is created in association with the document identifier (similar document identifier). Details of the similar document information will be described later.

類似文書検索部１２４は、入出力部１１経由で、類似パターンの識別情報の入力を受け付けたとき、類似文書情報から、この入力された識別情報の類似パターンにあてはまる１以上の文書データおよびその類似文書データの識別子の組を検索する。例えば、類似文書検索部１２４は、類似文書情報から、類似パターン「１」にあてはまる文書データおよびその文書データの類似文書データの文書識別子を検索する。 When the similar document search unit 124 receives input of identification information of a similar pattern via the input / output unit 11, the similar document search unit 124 selects one or more document data corresponding to the similar pattern of the input identification information from the similar document information and the similarity thereof. Retrieve a set of identifiers for document data. For example, the similar document search unit 124 searches the document data corresponding to the similar pattern “1” and the document identifier of the similar document data of the document data from the similar document information.

並べ替え処理部１２５は、類似文書情報から検索した文書データおよびその類似文書データの組を、所定の順序で並べ替える。例えば、並べ替え処理部１２５は、この文書データおよびその類似文書データの組について、その組の類似文書データの最終編集日時の古い順に並べ替える。 The rearrangement processing unit 125 rearranges the document data retrieved from the similar document information and the set of the similar document data in a predetermined order. For example, the rearrangement processing unit 125 rearranges the sets of the document data and the similar document data in order from the oldest edit date of the similar document data of the set.

表示処理部１２６は、類似文書検索部１２４により検索された検索結果や、並べ替え処理部１２５により並べ替えられた、前記検索結果を示した表示画面を表示装置２０等に表示する。 The display processing unit 126 displays the search results searched by the similar document search unit 124 and the display screen showing the search results sorted by the sorting processing unit 125 on the display device 20 or the like.

記憶部１３は、所定領域に、文書蓄積部１３１と、文書解析情報記憶部１３２と、類似パターン情報記憶部１３３と、類似文書情報記憶部１３４とを備える。 The storage unit 13 includes a document storage unit 131, a document analysis information storage unit 132, a similar pattern information storage unit 133, and a similar document information storage unit 134 in a predetermined area.

文書蓄積部１３１は、入出力部１１から入力された１以上の文書データを記憶する。 The document storage unit 131 stores one or more document data input from the input / output unit 11.

文書解析情報記憶部１３２は、文書解析部１２１により作成された文書解析情報を記憶する。この文書解析情報は、文書データの文書識別子、文書タイトル、頁タイトル、頁テキストを示した情報である。図４は、図３の文書解析情報を例示した図である。図４に示すように、この文書解析情報は、例えば、文書データの文書タイトル等を示した文書ファイル情報と、文書データの各頁の頁タイトルを示した頁タイトル情報と、文書データの各頁の頁テキストを示した頁テキスト情報とを含んで構成される。このうち、文書ファイル情報は、文書データの文書識別子ごとに、その文書データから抽出された文書タイトル、最終編集日時等を示した情報である。この文書ファイル情報は、この文書データの文書ファイル名、格納フォルダ名等の情報をさらに含んでいてもよい。この頁タイトル情報には、文書識別子と、頁識別子と、その頁の頁タイトルとが示される。また、頁テキスト情報は、文書識別子と、頁識別子と、その頁の頁テキストとが示される。なお、この文書ファイル情報と、頁タイトル情報と、頁テキスト情報とをまとめて１つの情報としてもよい。 The document analysis information storage unit 132 stores the document analysis information created by the document analysis unit 121. This document analysis information is information indicating the document identifier, document title, page title, and page text of the document data. FIG. 4 is a diagram illustrating the document analysis information of FIG. As shown in FIG. 4, the document analysis information includes, for example, document file information indicating the document title of the document data, page title information indicating the page title of each page of the document data, and each page of the document data. And page text information indicating the page text. Among these, the document file information is information indicating the document title extracted from the document data, the date and time of last editing, and the like for each document identifier of the document data. The document file information may further include information such as the document file name and storage folder name of the document data. The page title information includes a document identifier, a page identifier, and a page title of the page. The page text information indicates a document identifier, a page identifier, and a page text of the page. The document file information, page title information, and page text information may be combined into a single piece of information.

図３の類似パターン情報記憶部１３３は、類似パターン情報を記憶する。この類似パターン情報は、以下の表１に例示するように、類似パターン識別子ごとに、比較対象文書データがその類似パターンにあてはまると判定するための条件を示した情報である。この類似パターン情報は、比較対象文書データがその類似パターンにあてはまるとか否かを判定するときに、頁数の閾値や頁テキスト数の閾値を用いる場合、その閾値に関する情報も含む。例えば、表１に示す類似パターン情報において、類似パターン識別子「１」は、文書データ同士の文書タイトルが同じという条件を示す。また、類似パターン識別子「２」は、その比較対象文書データに、その選択文書データの頁タイトルと同じ頁タイトルが含まれる頁数の割合が、７５％以上という条件を示す。さらに、類似パターン識別子「３」は、その比較対象文書データに、その選択文書データの頁テキストと同じ頁テキストを含む割合が５０％以上である頁数をカウントし、その頁数が、その比較対象文書データ全体の頁数の７５％以上という条件を示す。なお、この類似パターン情報は、入出力部１１経由で設定されるものとする。 The similar pattern information storage unit 133 in FIG. 3 stores similar pattern information. As illustrated in Table 1 below, the similar pattern information is information indicating a condition for determining that the comparison target document data is applicable to the similar pattern for each similar pattern identifier. This similar pattern information also includes information relating to the threshold value when the threshold value of the page number or the threshold value of the page text is used when determining whether or not the comparison target document data applies to the similar pattern. For example, in the similar pattern information shown in Table 1, the similar pattern identifier “1” indicates a condition that the document titles of the document data are the same. The similar pattern identifier “2” indicates that the ratio of the number of pages in which the comparison target document data includes the same page title as the page title of the selected document data is 75% or more. Further, the similar pattern identifier “3” counts the number of pages in which the ratio of the page text that is the same as the page text of the selected document data is 50% or more in the comparison target document data, and the number of pages is the comparison number. A condition of 75% or more of the total number of pages of the target document data is shown. This similar pattern information is set via the input / output unit 11.

類似文書情報記憶部１３４は、類似文書情報を記憶する。この類似文書情報は、表２に示すように文書データ（選択文書データ）の文書識別子ごとに、類似パターン識別子と、その識別子の類似パターンにあてはまる類似文書データの文書識別子（類似文書識別子）とを示す情報である。この類似文書情報は、この類似パターンにあてはまる選択文書データの頁数をさらに含んでいてもよい。この類似文書情報は、類似文書検索部１２４が類似文書データを検索するときのインデクスとして用いられる。 The similar document information storage unit 134 stores similar document information. As shown in Table 2, the similar document information includes, for each document identifier of document data (selected document data), a similar pattern identifier and a document identifier (similar document identifier) of similar document data that applies to the similar pattern of the identifier. It is information to show. The similar document information may further include the number of pages of selected document data that applies to the similar pattern. This similar document information is used as an index when the similar document search unit 124 searches for similar document data.

＜処理手順＞
次に、この類似文書判定装置１０の処理手順を、フローチャートを用いて説明する。まず、図５を用いて、類似文書判定装置１０が、文書解析情報および類似文書情報を作成する手順を説明する。図５は、図３の類似文書判定装置の処理手順を示したフローチャートである。 <Processing procedure>
Next, the processing procedure of the similar document determination apparatus 10 will be described using a flowchart. First, the procedure in which the similar document determination apparatus 10 creates document analysis information and similar document information will be described with reference to FIG. FIG. 5 is a flowchart showing a processing procedure of the similar document determination apparatus of FIG.

まず、類似文書判定装置１０は、入出力部１１経由で文書データの入力を受け付けると、この受け付けた文書データを文書蓄積部１３１に蓄積する（Ｓ１）。そして、文書解析部１２１は、その文書蓄積部１３１に蓄積された文書データ群について、その文書データの最終編集日時順に並べ替え（Ｓ２）、その文書データの最終編集日時の最も古い文書データをセット（選択）する（Ｓ３）。 First, when the similar document determination apparatus 10 receives input of document data via the input / output unit 11, the similar document determination apparatus 10 stores the received document data in the document storage unit 131 (S1). Then, the document analysis unit 121 rearranges the document data group stored in the document storage unit 131 in order of the last editing date and time of the document data (S2), and sets the document data with the oldest last editing date and time of the document data. (Select) (S3).

次に、文書解析部１２１は、セットした文書データから、頁タイトルと頁テキストを抽出する（Ｓ４）。そして、文書解析部１２１は、Ｓ４で抽出した頁タイトルと頁テキストを用いて、この文書データに関する文書解析情報（図４参照）を作成する（Ｓ５）。そして、文書解析情報を文書解析情報記憶部１３２に記憶する。この文書解析情報の作成手順の詳細は後記する。 Next, the document analysis unit 121 extracts a page title and page text from the set document data (S4). Then, the document analysis unit 121 uses the page title and page text extracted in S4 to create document analysis information (see FIG. 4) regarding this document data (S5). Then, the document analysis information is stored in the document analysis information storage unit 132. Details of the document analysis information creation procedure will be described later.

次に、判定部１２２は、選択された文書データについて、既に文書解析情報記憶部１３２に文書解析情報が登録されている文書データとの類似判定を行う（Ｓ６）。この類似判定処理の詳細についても後記する。文書解析部１２１は、Ｓ３でセットされた文書データの類似判定後、この文書データの文書解析情報を文書解析情報記憶部１３２に格納する。 Next, the determination unit 122 determines similarity of the selected document data with document data whose document analysis information is already registered in the document analysis information storage unit 132 (S6). Details of the similarity determination process will be described later. The document analysis unit 121 stores the document analysis information of the document data in the document analysis information storage unit 132 after the similarity determination of the document data set in S3.

なお、文書解析部１２１が、Ｓ３で最終編集日時の最も古い文書データをセットしたとき、文書解析情報記憶部１３２にはまだ文書解析情報が格納されていないので、Ｓ６の処理は行わず、作成した文書解析情報をそのまま文書解析情報記憶部１３２に格納する。 When the document analysis unit 121 sets the document data with the oldest editing date and time in S3, the document analysis information is not stored in the document analysis information storage unit 132, so the processing of S6 is not performed. The document analysis information thus stored is stored in the document analysis information storage unit 132 as it is.

Ｓ６の後、文書蓄積部１３１のすべての文書データの処理が終わっていなければ（Ｓ７のＮｏ）、文書解析部１２１は、Ｓ２で並べ替えた文書データ群から、次に最終編集日時の古い文書データをセットし（Ｓ８）、Ｓ４へ戻る。一方、文書蓄積部１３１のすべての文書データの処理が終わっていれば（Ｓ７のＹｅｓ）、処理を終了する。 After S6, if the processing of all the document data in the document storage unit 131 is not completed (No in S7), the document analysis unit 121 selects the next document with the oldest edit date from the document data group rearranged in S2. Data is set (S8), and the process returns to S4. On the other hand, if all the document data in the document storage unit 131 has been processed (Yes in S7), the process ends.

なお、このように、類似判定装置１０が文書蓄積部１３１に蓄積される文書データの最終編集日時が古いものから順に選択し、文書解析情報に示される文書データとの類似判定を行うことで、派生関係にある可能性が高い文書データの類似文書情報を効率よく作成できる。例えば、この文書蓄積部１３１に、大元（おおもと）となる、ある文書データから派生した複数の文書データが蓄積されている場合、この大元となる文書データの最終編集日時が最も古いと考えられる。そして、この文書データからの文書データの派生順は、最終編集日時が古いものから順に時系列に並べたものに近くなる。よって、この文書蓄積部１３１に蓄積されている文書データすべての組み合わせについて、類似判定を行い、類似文書情報を作成するよりも、実際に派生関係になっている可能性が高い（類似している）文書データに関する類似文書情報を効率よく作成することができる。 As described above, the similarity determination device 10 selects the document data stored in the document storage unit 131 in order from the oldest editing date and performs the similarity determination with the document data indicated in the document analysis information. It is possible to efficiently create similar document information of document data that is highly likely to be derived. For example, in the case where a plurality of document data derived from certain document data is stored in the document storage unit 131, the last edit date and time of the document data that is the source is the oldest. it is conceivable that. The order of derivation of the document data from the document data is close to the order in which the last editing date and time are arranged in chronological order. Therefore, it is more likely that the combination of all the document data stored in the document storage unit 131 actually has a derivation relationship than when the similarity determination is performed and the similar document information is created (similarity). ) It is possible to efficiently create similar document information related to document data.

次に、図３、図５、図６を参照しつつ、図７を用いて、Ｓ５に示した文書解析情報作成処理を説明する。図６は、本実施の形態の文書データの構造例を示した図である。図７は、図５の文書解析情報作成処理の詳細を示したフローチャートである。ここでは、類似文書判定装置１０の文書解析部１２１が文書データから文書タイトルを生成する。また、図５のＳ４で、文書解析部１２１が、頁タイトルを抽出できなかったとき、この頁タイトルを生成する。なお、以下の説明において、文書データにおける頁番号は、図６に示すように文書データの最初の頁から「０，…，Ｎ」という順に振られた番号とする。また、頁テキスト番号は、その文書データの頁ごとに、その頁内の例えば、一番の上に位置するコンテンツ（テキストボックス）から順に「０，１，…，Ｎ」という順に振られた番号とする。但し、文書データの最初の頁における最初の頁テキスト番号は、他の頁の最初の頁テキストと区別するため「−１」から始まる番号とする。 Next, the document analysis information creation process shown in S5 will be described with reference to FIGS. 3, 5, and 6 and FIG. FIG. 6 is a diagram showing an example of the structure of document data according to the present embodiment. FIG. 7 is a flowchart showing details of the document analysis information creation processing of FIG. Here, the document analysis unit 121 of the similar document determination apparatus 10 generates a document title from the document data. In S4 of FIG. 5, when the document analysis unit 121 cannot extract the page title, the page title is generated. In the following description, page numbers in document data are numbers assigned in the order of “0,..., N” from the first page of document data as shown in FIG. The page text number is assigned to each page of the document data in the order of “0, 1,..., N” in order from, for example, the content (text box) located at the top of the page. And However, the first page text number in the first page of the document data is a number starting from “−1” to distinguish it from the first page text of other pages.

まず、図３の類似文書判定装置１０の文書解析部１２１は、図５のＳ３またはＳ８でセットした文書データの開始頁番号を０とする（Ｓ１１）。また、開始頁テキスト番号（開始頁の頁テキスト番号）を−１とする（Ｓ１２）。 First, the document analysis unit 121 of the similar document determination apparatus 10 in FIG. 3 sets the start page number of the document data set in S3 or S8 in FIG. 5 to 0 (S11). The start page text number (page text number of the start page) is set to -1 (S12).

そして、文書解析部１２１は、この頁に頁タイトルがあり（Ｓ１３のＹｅｓ）、かつ、頁番号が０であれば（Ｓ１６のＹｅｓ）、処理対象のコンテンツをこの文書データの文書タイトルとして抽出する（Ｓ１７）。また、このコンテンツをこの頁の頁タイトルとして抽出する（Ｓ１８）。つまり、文書データの最初の頁の頁タイトルを文書タイトルとして抽出する。 Then, if there is a page title on this page (Yes in S13) and the page number is 0 (Yes in S16), the document analysis unit 121 extracts the content to be processed as the document title of this document data. (S17). Further, this content is extracted as the page title of this page (S18). That is, the page title of the first page of the document data is extracted as the document title.

一方、文書解析部１２１は、この頁に頁タイトルはあるが（Ｓ１３のＹｅｓ）、頁番号が０でなければ（Ｓ１６のＮｏ）、つまり、最初の頁でなければ、このコンテンツをこの頁の頁タイトルとして抽出する（Ｓ１８）。ここで抽出した文書タイトルや頁タイトルは、文書解析情報の文書ファイル情報や、頁タイトル情報等に蓄積される。 On the other hand, the document analysis unit 121 has a page title on this page (Yes in S13), but if the page number is not 0 (No in S16), that is, if it is not the first page, The page title is extracted (S18). The document title and page title extracted here are stored in the document file information of the document analysis information, page title information, and the like.

そして、このコンテンツの頁テキスト番号＝０であれば（Ｓ１９のＹｅｓ）、文書解析部１２１は、このコンテンツを当該頁の頁テキストとして抽出する（Ｓ２０）。抽出した頁テキストは、文書解析部１２１が、この文書データに関する文書解析情報の頁テキスト情報（図４参照）等に蓄積していく。 If the page text number of this content = 0 (Yes in S19), the document analysis unit 121 extracts this content as the page text of the page (S20). The extracted page text is accumulated by the document analysis unit 121 in page text information (see FIG. 4) of document analysis information related to the document data.

一方、文書解析部１２１は、この頁に頁タイトルがないとき（Ｓ１３のＮｏ）、Ｓ１９で頁テキスト番号が０でないとき（Ｓ１９のＮｏ）、開始頁テキスト番号を０とする（Ｓ１４）。そして、処理対象のコンテンツの頁テキスト番号が０であり、かつ、その頁に頁タイトルがないことを確認して（Ｓ１５のＹｅｓ）、Ｓ１６へ進む。一方、Ｓ１５で処理対象のコンテンツの頁テキスト番号が１以上のとき、または、その頁に頁タイトルがあるとき（Ｓ１５のＮｏ）、Ｓ２０へ進み、処理対象のコンテンツを頁テキストとして抽出する。すなわち、Ｓ１５において、文書解析部１２１は、処理対象のコンテンツを、文書タイトルまたは頁タイトルとして抽出するか、頁テキストとして抽出するかを判定する。 On the other hand, when there is no page title on this page (No in S13), the document analysis unit 121 sets the start page text number to 0 (S14) when the page text number is not 0 in S19 (No in S19). Then, it is confirmed that the page text number of the content to be processed is 0 and there is no page title on the page (Yes in S15), and the process proceeds to S16. On the other hand, when the page text number of the content to be processed is 1 or more in S15, or when the page has a page title (No in S15), the process proceeds to S20, and the content to be processed is extracted as page text. That is, in S15, the document analysis unit 121 determines whether to extract the content to be processed as a document title or a page title or as a page text.

Ｓ２０の後、文書解析部１２１は、処理対象の頁のすべての頁テキストの処理を終了し（Ｓ２１のＹｅｓ）、かつ、処理対象の文書データのすべての頁の処理を終了していれば（Ｓ２２のＹｅｓ）、処理を終了する。一方、Ｓ２０の後、処理対象の頁に未処理のコンテンツがあれば（Ｓ２１のＮｏ）、開始頁テキスト番号を１増やす（Ｓ２４）。そして、Ｓ１５に戻る。つまり、同じ頁内の次のコンテンツの処理に移る。また、処理対象の頁のすべての頁テキストの処理を終了したが（Ｓ２１のＹｅｓ）、処理対象の文書データに未処理の頁があれば（Ｓ２２のＮｏ）、文書解析部１２１は、開始頁番号を１増やす（Ｓ２３）。そして、Ｓ１３へ戻る。つまり、同じ文書データ内の次の頁の処理に移る。 After S20, the document analysis unit 121 ends the processing of all the page texts of the processing target page (Yes in S21), and ends the processing of all the pages of the processing target document data (S21). The process is terminated. On the other hand, after S20, if there is an unprocessed content on the page to be processed (No in S21), the start page text number is incremented by 1 (S24). Then, the process returns to S15. That is, the process proceeds to the next content on the same page. If all the page texts on the page to be processed have been processed (Yes in S21), but there is an unprocessed page in the document data to be processed (No in S22), the document analysis unit 121 starts the start page. The number is incremented by 1 (S23). Then, the process returns to S13. That is, the process proceeds to the next page in the same document data.

このようにすることで、類似文書判定装置１０の文書解析部１２１は、文書データから文書タイトルを生成し、また、頁タイトルや頁テキストを抽出し、文書解析情報を作成する。そして、作成した文書解析情報は、文書解析情報記憶部１３２に格納する。 By doing in this way, the document analysis part 121 of the similar document determination apparatus 10 produces | generates a document title from document data, extracts a page title and page text, and produces document analysis information. The created document analysis information is stored in the document analysis information storage unit 132.

次に、図８を用いて、図５のＳ６の処理を説明する。図８は、図５の類似判定処理の詳細を示したフローチャートである。ここでは、図３の類似文書判定装置１０が選択した文書データ（選択文書データ）について、既に文書解析情報記憶部１３２に文書解析情報が登録されている文書データとの類似判定を行い、その類似判定結果に基づき類似文書情報を作成する処理について説明する。 Next, the process of S6 in FIG. 5 will be described with reference to FIG. FIG. 8 is a flowchart showing details of the similarity determination process of FIG. Here, the similarity determination is performed on the document data (selected document data) selected by the similar document determination apparatus 10 in FIG. 3 with the document data in which the document analysis information is already registered in the document analysis information storage unit 132. A process for creating similar document information based on the determination result will be described.

まず、図３の判定部１２２は、選択文書データと同じ文書タイトルの文書データの文書識別子を、文書解析情報の文書ファイル情報から検索する（Ｓ３１）。ここで該当する文書識別子があれば（Ｓ３２のＹｅｓ）、類似文書情報作成部１２３は、この文書識別子を示した類似文書情報を作成する。つまり、類似文書情報作成部１２３は、類似文書情報に、選択文書データの文書識別子と、類似パターン識別子「１（文書タイトルが同じ）」と、この類似パターン識別子に該当する類似文書データの文書識別子とを登録する（Ｓ３３）。そして、判定部１２２は、その選択文書データの最初の頁をセットする（Ｓ３４）。 First, the determination unit 122 in FIG. 3 searches the document file information of the document analysis information for the document identifier of the document data having the same document title as the selected document data (S31). If there is a corresponding document identifier (Yes in S32), the similar document information creating unit 123 creates similar document information indicating the document identifier. That is, the similar document information creating unit 123 includes the document identifier of the selected document data, the similar pattern identifier “1 (the document title is the same)”, and the document identifier of the similar document data corresponding to the similar pattern identifier in the similar document information. Are registered (S33). Then, the determination unit 122 sets the first page of the selected document data (S34).

一方、文書解析情報に、選択文書データと同じ文書タイトルの文書データの文書識別子がなければ（Ｓ３２のＮｏ）、Ｓ３３をスキップして、Ｓ３４へ進む。そして、Ｓ３４の後、判定部１２２は、現在セットされている頁の頁タイトルと同じ頁タイトルを持つ文書データの文書識別子を、文書解析情報の頁タイトル情報から検索する（Ｓ３５）。ここで該当する文書識別子があれば（Ｓ３６のＹｅｓ）、類似文書情報作成部１２３は、この文書識別子を頁タイトル類似リスト（図示省略）に登録する（Ｓ３７）。なお、この頁タイトル類似リストは、同じ頁タイトルを持つ文書データの数およびその文書データにおいて同じ頁タイトルを持つ頁数をカウントするためのリストである。このリストは、記憶部１３の所定領域に記憶される。 On the other hand, if the document analysis information does not include the document identifier of the document data having the same document title as the selected document data (No in S32), S33 is skipped and the process proceeds to S34. After S34, the determination unit 122 searches the page identifier information of the document analysis information for the document identifier of the document data having the same page title as the page title of the currently set page (S35). If there is a corresponding document identifier (Yes in S36), the similar document information creating unit 123 registers this document identifier in a page title similarity list (not shown) (S37). The page title similarity list is a list for counting the number of document data having the same page title and the number of pages having the same page title in the document data. This list is stored in a predetermined area of the storage unit 13.

一方、文書解析情報に、選択文書データにおいて現在セットされている頁と同じ頁タイトルを持つ文書データの文書識別子がなければ（Ｓ３６のＮｏ）、Ｓ３７をスキップし、判定部１２２は、その選択文書データの最初の頁テキストをセットする（Ｓ３８）。そして、判定部１２２は、現在セットされている頁の頁テキストと同じ頁テキストを持つ文書データの文書識別子を文書解析情報の頁テキスト情報から検索する（Ｓ３９）。ここで、該当する文書識別子があれば（Ｓ４０のＹｅｓ）、類似文書情報作成部１２３は、この文書識別子を類似リスト（図示省略）に登録する（Ｓ４１）。この類似リストは、同じ頁テキストを持つ文書データの文書識別子およびその文書データにおける同じ頁テキストの数をカウントするためのリストである。このリストも、記憶部１３の所定領域に記憶される。 On the other hand, if the document analysis information does not include the document identifier of the document data having the same page title as the currently set page in the selected document data (No in S36), the determination unit 122 skips S37 and the determination unit 122 The first page text of data is set (S38). Then, the determination unit 122 retrieves the document identifier of the document data having the same page text as the page text of the currently set page from the page text information of the document analysis information (S39). If there is a corresponding document identifier (Yes in S40), the similar document information creating unit 123 registers this document identifier in a similar list (not shown) (S41). This similarity list is a list for counting document identifiers of document data having the same page text and the number of the same page text in the document data. This list is also stored in a predetermined area of the storage unit 13.

文書解析情報に、選択文書データにおいて現在セットされている頁の頁テキストと同じ頁テキストを持つ文書データの文書識別子がなく（Ｓ４０のＮｏ）、かつ、現在セットされている頁内のすべての頁テキストの処理が終了していれば（Ｓ４２のＹｅｓ）、Ｓ４４へ進む。Ｓ４４については後記する。一方、選択文書データにおいて現在セットされている頁内に未処理の頁テキストがあれば（Ｓ４２のＮｏ）、次の頁テキストをセットし（Ｓ４３）、Ｓ３９へ戻る。このような処理により、選択文書データの頁のうち、処理対象の頁について、その頁内の頁テキストと同じ頁テキストを持つ文書データの文書識別子が類似リストに登録されていく。 There is no document identifier of document data having the same page text as the page text of the currently set page in the selected document data in the document analysis information (No in S40), and all pages in the currently set page If the text processing is completed (Yes in S42), the process proceeds to S44. S44 will be described later. On the other hand, if there is an unprocessed page text in the currently set page in the selected document data (No in S42), the next page text is set (S43), and the process returns to S39. By such processing, the document identifier of the document data having the same page text as the page text in the page to be processed among the pages of the selected document data is registered in the similarity list.

ここで、Ｓ４２において、現在セットされている頁内のすべての頁テキストの処理が終了すると（Ｓ４２のＹｅｓ）、類似リストには、現在セットされている頁内の各頁テキストと同じ頁テキストを持つ文書データの文書識別子が、当該文書データにおいて同じ頁テキストを持つ数だけ登録される。例えば、選択文書データの頁Ａに頁テキストＡ,Ｂが含まれている場合において、比較対象となる文書識別子「Ａ」の文書データに、頁テキストＡと同じ頁テキストがＮ個含まれ、頁テキストＢと同じ頁テキストがＭ個含まれていれば、類似リストには、この文書識別子「Ａ００２」が（Ｎ＋Ｍ）個登録されることになる。これにより、文書識別子「Ａ００２」について、選択文書データの頁Ａの頁テキストと同じ頁テキスト数が（Ｎ＋Ｍ）個あることがわかる。 Here, when the processing of all the page texts in the currently set page is completed in S42 (Yes in S42), the same page text as each page text in the currently set page is displayed in the similar list. As many document identifiers of document data as the number of documents having the same page text in the document data are registered. For example, when page texts A and B are included in page A of the selected document data, N page texts that are the same as page text A are included in the document data of document identifier “A” to be compared. If the same page text as text B is included, (N + M) document identifiers “A002” are registered in the similarity list. As a result, it is understood that the document identifier “A002” has (N + M) page text numbers that are the same as the page text of page A of the selected document data.

判定部１２２は、選択文書データについて、１頁分の処理が終了すると、Ｓ４４において、この類似リストに登録された文書識別子ごとの頁テキスト数を取得し、この頁テキスト数が、類似パターン情報の頁テキストに関する閾値以上の文書識別子を頁テキスト類似リストに登録する。例えば、選択文書データの頁Ａの頁テキスト数が６であり、前記した閾値が５０％であるとき、判定部１２２は、類似リストに登録される文書識別子のうち、その頁Ａの頁テキストと同じ頁テキスト数が、３以上のものを頁テキスト類似リストに登録する。なお、この閾値は、類似パターン情報（表１参照）に示される類似パターンのうち、頁テキストに関する類似パターン（類似パターン識別子「３」）の頁テキスト閾値に示される値を用いる。このような処理を選択文書データの各頁について実行することで、判定部１２２は、選択文書データと頁単位で類似する文書データの文書識別子を探す。 When the processing for one page is completed for the selected document data, the determination unit 122 acquires the number of page texts for each document identifier registered in the similarity list in S44, and the number of page texts is the number of similar pattern information. Document identifiers that are equal to or greater than the threshold for page text are registered in the page text similarity list. For example, when the number of page texts of page A of the selected document data is 6 and the threshold value is 50%, the determination unit 122 selects the page text of the page A among the document identifiers registered in the similarity list. Those having the same page text number of 3 or more are registered in the page text similarity list. As the threshold value, the value indicated by the page text threshold value of the similar pattern related to the page text (similar pattern identifier “3”) among the similar patterns indicated by the similar pattern information (see Table 1) is used. By executing such processing for each page of the selected document data, the determination unit 122 searches for a document identifier of document data similar to the selected document data in units of pages.

Ｓ４４の後、判定部１２２は、選択文書データについて未処理の頁があれば（Ｓ４５のＮｏ）、次の頁をセットして（Ｓ４６）、Ｓ３５へ戻る。一方、判定部１２２は、選択文書データについて、すべての頁の処理を終了していれば（Ｓ４５のＹｅｓ）、頁タイトル類似リストから、文書識別子ごとの類似頁数（同じ頁タイトルを持つ頁数）を取得し、類似頁数が類似パターン情報に示される閾値以上の文書識別子と、その類似頁数とを類似文書情報に登録する（Ｓ４７）。 After S44, if there is an unprocessed page for the selected document data (No in S45), the determination unit 122 sets the next page (S46) and returns to S35. On the other hand, if all pages have been processed for the selected document data (Yes in S45), the determination unit 122 determines the number of similar pages for each document identifier (the number of pages having the same page title) from the page title similarity list. ) And the document identifier whose number of similar pages is equal to or greater than the threshold indicated in the similar pattern information and the number of similar pages are registered in the similar document information (S47).

例えば、選択文書データの頁数が１０頁であり、頁タイトルに関する類似パターン（表１の類似パターン識別子「２」）に関する頁閾値が８０％である場合を考える。この場合において、この頁タイトル類似リストに登録されている文書識別子のうち、類似頁数（同じ頁タイトルを持つ頁数）が８頁以上の文書識別子があるとき、判定部１２２は、この文書識別子の文書データを、類似パターン識別子「２」の類似文書データであると判定する。よって、類似文書情報作成部１２３は、この類似パターンに該当する文書データの文書識別子（例えば、「Ａ００２」）と、選択文書データの文書識別子（例えば、「Ａ００１」）と、類似パターン識別子「２」と、類似頁数「８頁」とを類似文書情報に登録する（表２参照）。 For example, consider a case where the number of pages of selected document data is 10 and the page threshold value for a similar pattern related to the page title (similar pattern identifier “2” in Table 1) is 80%. In this case, when there is a document identifier having 8 or more similar pages (number of pages having the same page title) among the document identifiers registered in the page title similarity list, the determination unit 122 determines that the document identifier Is determined to be similar document data with the similar pattern identifier “2”. Therefore, the similar document information creation unit 123 uses the document identifier (for example, “A002”) of the document data corresponding to the similar pattern, the document identifier (for example, “A001”) of the selected document data, and the similar pattern identifier “2”. And the number of similar pages “8 pages” are registered in the similar document information (see Table 2).

この後、判定部１２２は、頁テキスト類似リストから、文書識別子ごとの類似頁数を取得し、この類似頁数が類似パターン情報に示される閾値以上の文書識別子と、その類似頁数とを類似文書情報に登録する（Ｓ４８）。 Thereafter, the determination unit 122 obtains the number of similar pages for each document identifier from the page text similarity list, and compares the number of similar pages with a document identifier that is equal to or greater than the threshold indicated by the similar pattern information. The document information is registered (S48).

例えば、選択文書データの頁数が１０頁であり、この場合において、この頁テキスト類似リストに登録されている文書識別子のうち、類似頁数（同じ頁タイトルを持つ頁数）が８頁以上の文書識別子があるとき、判定部１２２は、この文書識別子の文書データを、類似パターン識別子「３」の類似文書データであると判定する。よって、類似文書情報作成部１２３は、この類似パターンに該当する文書データの文書識別子（例えば、「Ａ００３」）と、選択文書データの文書識別子（例えば、「Ａ００１」）と、類似パターン識別子「３」と、類似頁数「８頁」とを類似文書情報に登録する（表２参照）。そして、処理を終了する。 For example, the selected document data has 10 pages. In this case, among the document identifiers registered in the page text similarity list, the number of similar pages (the number of pages having the same page title) is 8 or more. When there is a document identifier, the determination unit 122 determines that the document data with this document identifier is similar document data with the similar pattern identifier “3”. Therefore, the similar document information creation unit 123 uses the document identifier (for example, “A003”) of the document data corresponding to the similar pattern, the document identifier (for example, “A001”) of the selected document data, and the similar pattern identifier “3”. And the number of similar pages “8 pages” are registered in the similar document information (see Table 2). Then, the process ends.

このようにして、図３の類似文書判定装置１０は、選択文書データについて、既に文書解析情報記憶部１３２に文書解析情報が登録されている文書データとの類似判定を行い、その類似判定結果に基づき類似文書情報を作成する。作成された類似文書情報は、類似文書データの検索処理等に用いられる。 In this manner, the similar document determination apparatus 10 in FIG. 3 performs similarity determination on the selected document data with document data in which the document analysis information is already registered in the document analysis information storage unit 132, and the similarity determination result is obtained. Based on this, similar document information is created. The created similar document information is used for search processing of similar document data.

なお、ここでは、判定部１２２が、文書データについて、類似文書情報に示される３つの類似パターンのいずれかであることを判定することとしたが、これに限定されない。例えば、頁タイトルの頁閾値、頁テキストの頁閾値、頁テキスト閾値の値の組み合わせにより、類似パターン情報に様々な類似パターンを定義し、判定部１２２は、文書データについて、そのいずれの類似パターンにあてはまるかを判定するようにしてもよい。 Here, the determination unit 122 determines that the document data is one of the three similar patterns indicated in the similar document information. However, the present invention is not limited to this. For example, various similar patterns are defined in the similar pattern information by combining the page threshold value of the page title, the page threshold value of the page text, and the value of the page text threshold value, and the determination unit 122 assigns any similar pattern to the document data. You may make it determine whether it is applicable.

次に、類似文書情報を用いた類似文書データの検索処理および表示処理を説明する。図９は、図３の類似文書判定装置による類似文書データの検索処理および表示処理を示したフローチャートである。ここでは、類似文書判定装置１０が、類似パターン識別子の入力を受け付けると、類似文書情報から、この類似パターン識別子にあてはまる文書データとその類似文書データの組を検索し、その文書データとその類似文書データの組を時系列に並べ替えて表示する場合を例に説明する。 Next, similar document data search processing and display processing using similar document information will be described. FIG. 9 is a flowchart showing similar document data search processing and display processing by the similar document determination apparatus of FIG. Here, when the similar document determination apparatus 10 receives the input of the similar pattern identifier, the similar document information is searched for a set of document data and the similar document data corresponding to the similar pattern identifier, and the document data and the similar document are retrieved. A case where data sets are displayed in time series will be described as an example.

まず、図３の類似文書判定装置１０の類似文書検索部１２４は、入出力部１１経由で、類似パターン識別子の入力を受け付ける（Ｓ５１）。そして、この類似文書検索部１２４は、入力された類似パターン識別子に一致する文書識別子と類似文書識別子の組を類似文書情報（表２）から検索する（Ｓ５２）。つまり、類似文書検索部１２４は、入力された類似パターン識別子をキーとして、類似文書情報から、この類似パターン識別子にあてはまる文書データの文書識別子と、その類似文書データの類似文書識別子の組を取得する。 First, the similar document search unit 124 of the similar document determination apparatus 10 in FIG. 3 receives an input of a similar pattern identifier via the input / output unit 11 (S51). Then, the similar document search unit 124 searches the similar document information (Table 2) for a combination of the document identifier and the similar document identifier that matches the input similar pattern identifier (S52). That is, the similar document search unit 124 acquires, from the similar document information, a set of the document identifier of the document data corresponding to the similar pattern identifier and the similar document identifier of the similar document data using the input similar pattern identifier as a key. .

そして、類似文書検索部１２４は、この取得した文書識別子と類似文書識別子それぞれに対応する文書データの最終編集日時を、文書解析情報の文書ファイル情報から検索する（Ｓ５３）。そして、この検索した文書識別子と類似文書識別子の組のうち、その類似文書識別子に対応する文書データの最終編集日時の最も古い類似文書識別子をセットする（Ｓ５４）。この後、表示処理部１２６は、セットした類似文書識別子を表示画面に表示する（Ｓ５５）。次に、並べ替え処理部１２５は、Ｓ５４でセットした類似文書識別子の対となる文書識別子群について、その文書識別子に対応する文書データの最終編集日時で並べ替え（Ｓ５６）、その最終編集日時の最も古い文書データの文書識別子をセットする（Ｓ５７）。そして、表示処理部１２６は、Ｓ５４でセットした類似文書識別子と、Ｓ５７でセットした文書識別子とを表示する（Ｓ５８）。つまり、類似文書情報に登録される類似文書データの組のうち、当該類似パターンにあてはまる最も古い類似文書データの類似文書識別子と、その類似文書データとペアになる文書データの文書識別子とを表示する。このとき、この類似文書情報に登録される類似頁数についても併せて表示するようにしてもよい。 Then, the similar document search unit 124 searches the document file information of the document analysis information for the last edit date and time of the document data corresponding to each of the acquired document identifier and the similar document identifier (S53). Then, out of the set of the retrieved document identifier and similar document identifier, the similar document identifier with the oldest edit date and time of the document data corresponding to the similar document identifier is set (S54). Thereafter, the display processing unit 126 displays the set similar document identifier on the display screen (S55). Next, the rearrangement processing unit 125 rearranges the document identifier group that is a pair of similar document identifiers set in S54 by the last edit date and time of the document data corresponding to the document identifier (S56), The document identifier of the oldest document data is set (S57). Then, the display processing unit 126 displays the similar document identifier set in S54 and the document identifier set in S57 (S58). That is, the similar document identifier of the oldest similar document data applicable to the similar pattern in the set of similar document data registered in the similar document information and the document identifier of the document data paired with the similar document data are displayed. . At this time, the number of similar pages registered in the similar document information may also be displayed.

この後、類似文書検索部１２４は、Ｓ５６で並べ替えた文書識別子について、次に最終編集日時が古い文書識別子をセットする（Ｓ５９）。そして、Ｓ５６で並べ替えた文書識別子について、未処理の文書識別子があれば（Ｓ６０のＮｏ）、Ｓ５８に戻る。一方、Ｓ５６で並べ替えた文書識別子について、すべての文書識別子の処理が終わると（Ｓ６０のＹｅｓ）、類似文書検索部１２４は、Ｓ５２で検索した文書識別子と類似文書識別子の組のうち、Ｓ５４でセットした類似文書識別子の次に最終編集日時が古い類似文書データの文書識別子（類似文書識別子）の組をセットする（Ｓ６１）。このようにして、Ｓ５２で検索した文書識別子と類似文書識別子の組について未処理のものがあれば（Ｓ６２のＮｏ）、Ｓ５５へ戻る。一方、Ｓ５２で検索したすべての文書識別子と類似文書識別子の組について処理が終われば（Ｓ６２のＹｅｓ）、処理を終了する。 Thereafter, the similar document search unit 124 sets the document identifier with the next oldest editing date and time next to the document identifiers rearranged in S56 (S59). If there is an unprocessed document identifier for the document identifiers rearranged in S56 (No in S60), the process returns to S58. On the other hand, when all the document identifiers have been processed for the document identifiers rearranged in S56 (Yes in S60), the similar document search unit 124 determines in S54 from the set of document identifiers and similar document identifiers searched in S52. A set of document identifiers (similar document identifiers) of similar document data having the last edit date and time next to the set similar document identifiers is set (S61). In this way, if there is an unprocessed set of the document identifier and the similar document identifier searched in S52 (No in S62), the process returns to S55. On the other hand, when the processing is completed for all pairs of document identifiers and similar document identifiers searched in S52 (Yes in S62), the processing ends.

このようにすることで、類似文書判定装置１０は、文書蓄積部１３１に蓄積される文書データについて、所定の類似パターンにあてはまる文書データの文書識別子を検索することができる。例えば、互いに同じ文書タイトルを持つ文書データ群や、同じ頁タイトルを所定の割合以上持つ文書データ群を検索することができる。また、類似文書判定装置１０は、検索された文書データの最終編集日時を参照して、その類似文書データの最終編集日時が古い順に並べ、さらに同じ文書データを類似文書データとする文書データが複数あるとき、その文書データについて古い順に並べて表示する。このようにすることで、派生関係にある文書データ群について、最も古い文書データ（その文書データ群において派生元になっている可能性が高い文書データ）から順に表示し、さらにこの文書データから派生した可能性の高い文書データをその派生順に近い状態で表示することができる。 In this way, the similar document determination apparatus 10 can search for document identifiers of document data that match a predetermined similar pattern with respect to the document data stored in the document storage unit 131. For example, a document data group having the same document title and a document data group having the same page title at a predetermined ratio or more can be searched. Further, the similar document determination apparatus 10 refers to the last edit date and time of the retrieved document data, arranges the similar document data in order of oldest edit date, and further includes a plurality of document data having the same document data as the similar document data. In some cases, the document data are displayed in order from the oldest. In this way, the document data group in the derivation relationship is displayed in order from the oldest document data (document data that is likely to be the derivation source in the document data group), and is further derived from this document data. It is possible to display the document data having a high possibility of being close to the order of derivation.

なお、ここで表示処理部１２６が表示する表示画面例を説明する。図１０は、図３の表示処理部が表示する表示画面例を示した図である。図１０に示すように、表示画面は、入力された類似パターン識別子、この類似パターン識別子をキーとして検索された類似文書識別子、その対となる文書識別子、類似頁数、類似文書データおよび文書データの最終編集日時等を含んで構成される。ここで、この類似文書識別子は、その最終編集日時が最も古いものから順に表示される。そして、その類似文書識別子と対になる文書識別子もその最終編集日時が最も古いものから順に表示される。 Here, a display screen example displayed by the display processing unit 126 will be described. FIG. 10 is a diagram showing an example of a display screen displayed by the display processing unit of FIG. As shown in FIG. 10, the display screen displays an input similar pattern identifier, a similar document identifier searched using the similar pattern identifier as a key, a paired document identifier, the number of similar pages, similar document data, and document data. It is configured to include the last edit date and time. Here, the similar document identifiers are displayed in order from the oldest last edited date. The document identifiers paired with the similar document identifiers are also displayed in order from the oldest last edited date.

例えば、図１０に示す画面例は、類似パターン識別子「１」にあてはまる文書識別子と類似文書識別子の組のうち、類似文書データの最終編集日時が最も古い組の類似文書識別子は「Ａ００２」であり、その最終編集日時は「2008.08.21.20：00：50」であることを示す。この類似文書識別子の対となる文書識別子のうち、その最終編集日時が最も古い文書データの文書識別子は「Ａ００１」であり、類似頁数は「１０（頁）」であり、その最終編集日時は「2008.08.02.10：05：30」であることを示す。また、類似パターン識別子「１」にあてはまる文書識別子と類似文書識別子の組のうち、次に最終編集日時が古い類似文書データを持つ組は、類似文書識別子「Ｂ００１」の類似文書データを持つ組であり、その最終編集日時は「2008.08.03.16：04：00」であることを示す。また、この類似文書識別子の対となる文書識別子のうち、その最終編集日時が最も古い文書データの文書識別子は「Ｂ００３」であり、類似頁数は「５（頁）」であり、その最終編集日時は「2008.08.04.13：13：00」であることを示す。 For example, in the example of the screen shown in FIG. 10, the similar document identifier of the pair having the oldest last edit date / time of the similar document data is “A002” among the pair of the document identifier and the similar document identifier corresponding to the similar pattern identifier “1”. This indicates that the last editing date is “2008.08.21.20:00:00”. Among the document identifiers that form a pair of similar document identifiers, the document identifier of the document data with the oldest last edit date is “A001”, the number of similar pages is “10 (pages)”, and the last edit date is It indicates that it is “2008.08.02.10:05:30”. Of the pair of the document identifier and the similar document identifier corresponding to the similar pattern identifier “1”, the pair having the similar document data with the next oldest edit date is the group having the similar document data “B001”. Yes, it indicates that the last edit date is “2008.08.03.16:04:00”. In addition, among the document identifiers that form a pair of similar document identifiers, the document identifier of the document data with the oldest last edit date is “B003”, the number of similar pages is “5 (page)”, and the last edit Indicates that the date and time is “2008.08.04.13: 13:00”.

表示処理部１２６がこのような表示画面を表示することで、ユーザが指定した類似パターンにあてはまる文書データを時系列で確認できる。つまり、文書データの派生順に近い状態で文書データを確認できる。 By displaying such a display screen by the display processing unit 126, it is possible to check document data corresponding to the similar pattern designated by the user in time series. That is, the document data can be confirmed in a state close to the document data derivation order.

なお、この表示画面上の文書識別子や類似文書識別子に、文書蓄積部１３１に蓄積される文書データや類似文書データへのリンクを張るようにしてもよい。そして、表示画面上で、この文書識別子や類似文書識別子の選択入力を受け付けると、表示処理部１２６が、その文書識別子や類似文書識別子に対応する文書データを表示するようにしてもよい。このようにすることで、ユーザは類似関係にある（派生関係にあると推定される）文書データを確認しやすくなる。 It should be noted that a document identifier or similar document identifier on the display screen may be linked to the document data or similar document data stored in the document storage unit 131. When the selection input of the document identifier or similar document identifier is received on the display screen, the display processing unit 126 may display the document data corresponding to the document identifier or similar document identifier. In this way, the user can easily confirm document data that is in a similar relationship (estimated to have a derivation relationship).

なお、本発明は前記した実施の形態に限定されない。例えば、前記した図５のＳ６の類似判定処理において、文書蓄積部１３１に蓄積される文書データのうち、その最終編集日時が古いものから順に、類似文書情報を作成することとしたが、最終編集日時が新しいものから順に作成してもよい。さらに、図８に示した類似判定処理は、文書蓄積部１３１に蓄積される文書データを選択文書データとして用いることとしたが、それ以外の文書データであってもよい。つまり、類似文書判定装置１０が、新たな文書データの入力を受け付けると、前記した手順により文書解析部１２１がこの文書データの、文書タイトル、頁タイトルおよび頁テキストを抽出し、判定部１２２が文書解析情報記憶部１３２に記憶される文書解析情報との類似判定を行ってもよい。この後、表示処理部１２６が、その類似判定結果を、表示画面上に表示するようにしてもよい。このようにすることで、類似文書判定装置１０は様々な文書データの類似判定を行うことができる。 The present invention is not limited to the embodiment described above. For example, in the similarity determination process in S6 of FIG. 5 described above, similar document information is created in order from the document data stored in the document storage unit 131 in order from the oldest edit date. You may create in order from the newest date. Furthermore, in the similarity determination process shown in FIG. 8, the document data stored in the document storage unit 131 is used as the selected document data. However, other document data may be used. That is, when the similar document determination apparatus 10 receives input of new document data, the document analysis unit 121 extracts the document title, page title, and page text of the document data by the above-described procedure, and the determination unit 122 Similarity determination with the document analysis information stored in the analysis information storage unit 132 may be performed. Thereafter, the display processing unit 126 may display the similarity determination result on the display screen. By doing in this way, the similar document determination apparatus 10 can perform the similarity determination of various document data.

また、文書解析部１２１が、文書データから文書タイトルを抽出するとき、最初の頁に頁タイトルがないとき、この頁の最初の頁テキストを文書タイトルとして抽出するようにしてもよい。このようにすることで、類似文書判定装置１０は、文書データから確実に文書タイトルを抽出できる。 Further, when the document analysis unit 121 extracts the document title from the document data, if there is no page title on the first page, the first page text on this page may be extracted as the document title. By doing in this way, the similar document determination apparatus 10 can extract a document title reliably from document data.

さらに、類似文書情報作成部１２３は、類似文書情報を、文書蓄積部１３１に蓄積される文書データすべてを対象として、作成してもよい。つまり、判定部１２２は、文書蓄積部１３１の任意の２つの文書データの組み合わせについて類似判定を行い、類似文書情報を作成してもよい。このようにすることで、類似文書判定装置１０は、実際に類似関係にある文書データについて類似文書情報の作成漏れが発生するのを防止できる。 Further, the similar document information creation unit 123 may create similar document information for all the document data stored in the document storage unit 131. That is, the determination unit 122 may perform similarity determination for any two combinations of document data in the document storage unit 131 and create similar document information. By doing in this way, the similar document determination apparatus 10 can prevent occurrence of omission of generation of similar document information for document data actually having a similar relationship.

さらに、類似文書検索部１２４は、類似文書情報を検索するとき、類似パターン識別子だけでなく、文書識別子や、類似文書識別子、文書データまたは類似文書データの最終編集日時のいずれか、またはそれらの組み合わせをキーとして検索してもよい。このようにすることで、類似文書判定装置１０は、様々な類似文書データを検索することができる。 Further, when searching for similar document information, the similar document search unit 124 not only searches for a similar pattern identifier but also a document identifier, a similar document identifier, document data, or the last edit date / time of similar document data, or a combination thereof. You may search using as a key. By doing in this way, the similar document determination apparatus 10 can search various similar document data.

本実施の形態に係る類似文書判定装置１０は、前記したような処理を実行させるプログラムによって実現することができ、そのプログラムをコンピュータが読み取り可能な記憶媒体（ＣＤ−ＲＯＭ等）に記憶して提供することが可能である。 The similar document determination apparatus 10 according to the present embodiment can be realized by a program for executing the processing as described above, and the program is stored in a computer-readable storage medium (CD-ROM or the like) and provided. Is possible.

＜実験結果＞
次に、図１１および図１２を用いて類似文書判定装置１０を用いた文書データの類似判定により、派生文書データをどの程度検出できるかを評価した評価実験を説明する。図１１（ａ）は、本実験で用いた１８個の文書データのファイル名を示した図であり、図１１（ｂ）は、（ａ）の文書データの派生関係の正解を示した図である。図１２は、本実施の形態の類似文書判定装置の評価実験データを示した図である。 <Experimental result>
Next, an evaluation experiment for evaluating how much derived document data can be detected by similarity determination of document data using the similar document determination apparatus 10 will be described with reference to FIGS. 11 and 12. FIG. 11A is a diagram showing the file names of the 18 document data used in this experiment, and FIG. 11B is a diagram showing the correct answer of the document data derivation relationship of FIG. is there. FIG. 12 is a diagram showing evaluation experiment data of the similar document determination device of the present embodiment.

ここでは、文書データとして、図１１（ａ）に示す１８個のパワーポイント（登録商標）ファイルを用いた。また、類似パターンとして、図１２に示す１５個の類似パターンを用いた。なお、図１２の＃１６は、比較例となる時制インデクスを用いた方法である。そして、類似文書判定装置１０は、この１５個の類似パターンそれぞれを適用して、文書データ同士の類似判定（その文書データ同士が派生関係にあると推定できるか否かの判定）を行った。文書データ同士の実際の派生関係（派生関係の正解）は、図１１（ｂ）に示すとおりである。図１１（ｂ）において、○の中の値は、図１１（ａ）に示すファイルのＩＤを示す。これらのうち「→」でつながれたファイル（文書データ）同士は、実際に派生関係にあることを示す。なお、この１８個の文書データのうち、派生関係にあるペアの数は２２個であった。 Here, 18 PowerPoint (registered trademark) files shown in FIG. 11A are used as document data. Further, 15 similar patterns shown in FIG. 12 were used as similar patterns. Note that # 16 in FIG. 12 is a method using a tense index as a comparative example. Then, the similar document determination device 10 applies the fifteen similar patterns, and performs similarity determination between the document data (determination as to whether or not it is possible to estimate that the document data has a derivation relationship). The actual derivation relationship between document data (correct derivation relationship) is as shown in FIG. In FIG. 11B, the value in the circle indicates the ID of the file shown in FIG. Of these, files (document data) connected by “→” indicate that they are actually in a derivation relationship. Of the 18 document data, the number of derivation pairs was 22.

また、以下の説明において、正解数とは、実際に派生関係にある文書データのペアの数であり、抽出した派生数とは、類似文書判定装置１０が派生関係にあると判定した文書データ同士のペアの数である。また、抽出した正解数とは、抽出した派生数のうち、実際に派生関係にある文書データ同士のペアの数である。さらに、全体正解数とは、処理対象の文書データ全体において実際に派生関係にある文書データのペアの数である。また、ここでは、適合率＝抽出した正解数／抽出した派生数、再現率＝抽出した正解数／全体正解数として計算した。 In the following description, the number of correct answers is the number of pairs of document data that are actually in a derivation relationship, and the extracted number of derivations is the amount of document data that the similar document determination device 10 determines as having a derivation relationship. Is the number of pairs. The number of correct answers extracted is the number of pairs of document data that are actually in a derivation relationship among the extracted derivation numbers. Further, the total number of correct answers is the number of document data pairs that are actually derived in the entire document data to be processed. In addition, here, the calculation was performed with the relevance ratio = the number of extracted correct answers / the number of derived derivatives and the recall ratio = the number of extracted correct answers / the total number of correct answers.

また、ここで用いる類似パターンは、図１２に示すように、大きく判定ルール１,２,３,４というカテゴリに分けられる。判定ルール１は、比較対象となる文書データに、選択文書データと同一の頁がある場合に、その文書データを派生関係と判定（類似文書データと判定）するというルールである。判定ルール２は、選択文書データと同じテキストがある場合に、その文書データを派生関係と判定（類似文書データと判定）するというルールである。この判定ルール２は、頁テキスト閾値、頁閾値（表１参照）が設定され、その頁テキスト閾値（２５％〜７５％）および頁閾値の値（２５％〜７５％）の組み合わせにより、９個のバリエーションを設定した。さらに、判定ルール３は、選択文書データと同じ頁タイトルがある場合、その文書データを派生文書データ（類似文書データ）と判定するというルールである。この判定ルール３は、頁タイトルに関する頁閾値の値（２５％〜７５％）により、３個のバリエーションを設定した。判定ルール４は、選択文書データと同じ文書タイトルがある場合、その文書データを派生文書データ（類似文書データ）と判定するというルールである。なお、＃１５の「組み合わせ」は、前記した判定ルール１（＃１）と、判定ルール２の頁テキスト２５％以上一致、頁２５％以上一致（＃２）の両方を満たす文書データを派生文書データ（類似文書データ）と判定するというルールである。なお、＃１６の「時制インデクス」は、ファイル名を文字列と数字とに分割し、文字列が一致する古い文書を派生文書データ（類似文書データ）と判定するというルールである。 Similar patterns used here are roughly divided into categories of determination rules 1, 2, 3, and 4, as shown in FIG. The determination rule 1 is a rule that, when the document data to be compared includes the same page as the selected document data, the document data is determined as a derivation relationship (determined as similar document data). The determination rule 2 is a rule that, when there is the same text as the selected document data, the document data is determined as a derivation relationship (determined as similar document data). In this determination rule 2, a page text threshold value and a page threshold value (see Table 1) are set, and nine values are obtained by combining the page text threshold value (25% to 75%) and the page threshold value (25% to 75%). The variation of was set. Further, the determination rule 3 is a rule that, when there is the same page title as the selected document data, the document data is determined as derived document data (similar document data). In this determination rule 3, three variations are set according to the page threshold value (25% to 75%) for the page title. The determination rule 4 is a rule that, when there is the same document title as the selected document data, the document data is determined as derived document data (similar document data). Note that “combination” of # 15 is a derived document that satisfies both the above-described determination rule 1 (# 1) and page rule 25% or more match and page 25% or more match (# 2) of the determination rule 2. It is a rule that it is determined as data (similar document data). The “temporal index” of # 16 is a rule that divides a file name into a character string and a number, and determines an old document with a matching character string as derived document data (similar document data).

図１２に、本実験において、各類似パターンを用いた場合の、適合率、再現率、全体正解数、抽出した派生数、抽出した正解数を示す。図１２に示すように、再現率に関しては、＃１６の時制インデクスと比較して、いずれも高い値となっており、本実施の形態の類似文書判定装置１０の有効性が示された。なお、判定ルール２の＃１０（頁テキスト７５％以上一致、頁７５％以上一致）を用いると、高い適合率（0.773）となり、判定ルール２の＃２（頁テキスト２５％以上一致、頁２５％以上一致）を用いると、高い再現率（0.800）となることが示された。よって、この文書データ１８個について、高い適合率で判定を行いたい場合、判定ルール２の＃１０を用い、高い再現率で判定を行いたい場合、判定ルール２の＃２を用いるのが有効であることが示された。また、判定ルール２の＃５（頁テキスト２５％以上一致、頁５０％以上一致）や、判定ルール３の＃１２（頁タイトル５０％以上一致）は、適合率も再現率もバランスした値となっており、適合率と再現率のバランスのとれた判定処理を行いたい場合、これらのいずれかの類似パターンを用いることが有効であることが確認できた。 FIG. 12 shows the precision, the recall, the total number of correct answers, the number of derived derivations, and the number of extracted correct answers when using similar patterns in this experiment. As shown in FIG. 12, the recall rate is higher than that of the # 16 tense index, indicating the effectiveness of the similar document determination apparatus 10 of the present embodiment. If # 10 (matching page text 75% or more, matching page 75% or more) of decision rule 2 is used, a high matching rate (0.773) is obtained, and # 2 of decision rule 2 (matching page text 25% or more, page 25). % Reproducibility (0.800) was shown to be high. Therefore, it is effective to use # 10 of the determination rule 2 when it is desired to make a determination with a high precision for the 18 pieces of document data, and use # 2 of the determination rule 2 when it is desired to make a determination with a high reproduction rate. It was shown that there is. Also, determination rule 2 # 5 (matching page text 25% or more, matching page 50% or more) and determination rule 3 # 12 (matching page title 50% or more) are values that balance the precision and recall. Therefore, it was confirmed that it is effective to use one of these similar patterns when it is desired to perform a determination process in which the precision ratio and the recall ratio are balanced.

本実施の形態の類似文書判定装置の処理概要を示した概念図である。It is the conceptual diagram which showed the process outline | summary of the similar document determination apparatus of this Embodiment. 本実施の形態の類似文書判定装置の処理概要を示した概念図である。It is the conceptual diagram which showed the process outline | summary of the similar document determination apparatus of this Embodiment. 本実施の形態の類似文書判定装置の構成を示したブロック図である。It is the block diagram which showed the structure of the similar document determination apparatus of this Embodiment. 図３の文書解析情報を例示した図である。It is the figure which illustrated the document analysis information of FIG. 図３の類似文書判定装置の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the similar document determination apparatus of FIG. 本実施の形態の文書データの構造例を示した図である。It is the figure which showed the structural example of the document data of this Embodiment. 図５の文書解析情報作成処理の詳細を示したフローチャートである。It is the flowchart which showed the detail of the document analysis information creation process of FIG. 図５の類似判定処理の詳細を示したフローチャートである。It is the flowchart which showed the detail of the similarity determination process of FIG. 図３の類似文書判定装置による類似文書データの検索処理および表示処理を示したフローチャートである。4 is a flowchart showing search processing and display processing of similar document data by the similar document determination device of FIG. 3. 図３の表示処理部が表示する表示画面例を示した図である。It is the figure which showed the example of a display screen which the display process part of FIG. 3 displays. （ａ）は、本実験で用いた１８個の文書データのファイル名を示した図であり、（ｂ）は、（ａ）の文書データの派生関係の正解を示した図である。(A) is the figure which showed the file name of 18 document data used by this experiment, (b) is the figure which showed the correct answer of the derivation relation of the document data of (a). 本実施の形態の類似文書判定装置の評価実験データを示した図である。It is the figure which showed the evaluation experiment data of the similar document determination apparatus of this Embodiment.

Explanation of symbols

１０類似文書判定装置
１１入出力部（入力部）
１２処理部
１３記憶部
２０表示装置
１２１文書解析部
１２２判定部
１２３類似文書情報作成部
１２４類似文書検索部
１２５並べ替え処理部
１２６表示処理部
１３１文書蓄積部
１３２文書解析情報記憶部
１３３類似パターン情報記憶部
１３４類似文書情報記憶部 10 Similar Document Determination Device 11 Input / Output Unit (Input Unit)
DESCRIPTION OF SYMBOLS 12 Processing part 13 Memory | storage part 20 Display apparatus 121 Document analysis part 122 Judgment part 123 Similar document information creation part 124 Similar document search part 125 Rearrangement process part 126 Display processing part 131 Document storage part 132 Document analysis information storage part 133 Similar pattern information Storage unit 134 Similar document information storage unit

Claims

An input unit for receiving input of one or more document data;
Extract the document title of the document data and the page title and page text for each page in the document data from the input document data structure, and extract the extracted document title, page title and page text. Creating document analysis information associated with the identification information of the original document data and storing it in the storage unit;
The storage unit for storing the document analysis information;
When new document data is received via the input unit, the document analysis unit extracts a document title of the document data, a page title and a page text for each page in the document data,
For each document data indicated in the document analysis information, (1) whether the document title is the same as the document title of the new document data, or (2) the number of pages having the same page title as the page title of the new document data. (3) Based on a similar pattern composed of any one or a combination of the same page text ratio as the page text of the new document data, the document data indicated in the document analysis information is converted into any similar pattern. A similar document determination comprising: a determination unit that determines whether or not the corresponding similar document data is applicable, and outputs a determination result including the similar pattern when it is determined to be similar document data corresponding to any one of the similar patterns apparatus.

2. The similar document determination apparatus according to claim 1, wherein the document analysis unit extracts a page title of the first page of the document data as a document title of the document data, and creates the document analysis information. .

When there is no page title of the first page of the document data, the document analysis unit extracts the page text of the first page of the document data as the document title of the document data, and creates the document analysis information. The similar document determination device according to claim 1, wherein the similar document determination device is provided.

The document analysis unit, when there is a page without a page title in the document data, extracts the page text of the page as a page title and creates the document analysis information. Item 4. The similar document determination device according to any one of items 3 to 3.

The similar document determination device includes:
Similar document information in which identification information of the determined similar pattern and identification information of similar document data applicable to the similar pattern are associated with each other based on the determination result by the determination unit. A similar document information creating unit that stores the information in the storage unit;
When an input of a search request indicating a search condition including at least one of the identification information of the similar pattern and the identification information of the document data is received via the input unit, the document data satisfying the search condition and its A similar document search unit for searching a set of similar document data from the similar document information;
The similar document determination apparatus according to claim 1, further comprising: a display processing unit that displays the search result.

The document analysis unit further extracts the last edit date and time of each of the document data and includes it in the document analysis information,
The similar document determination device includes:
Referring to the last edit date and time of the document data indicated in the document analysis information, the set of the document data and similar document data retrieved from the similar document information is the oldest of the last edit date and time of the similar document data in the set. When there are a plurality of different pairs of document data for the same similar document data among the sets of the document data searched and the similar document data sorted in order or new order, A sort processing unit for sorting the document data in the oldest order or the newest order of the last edit date and time,
6. The similar document determination apparatus according to claim 5, wherein the display processing unit displays the rearranged document data and a set of similar document data.

The storage unit includes a document storage unit that stores the one or more document data,
The document analysis unit selects the document data stored in the document storage unit, extracts the document title of the selected document data, and the page title and page text for each page in the document data,
The determination unit determines, for each document data indicated in the document analysis information already stored in the storage unit, whether (1) the document title is the same as the document title of the selected document data, or (2) the selected document Based on a similar pattern consisting of any one or a combination of the ratio of the number of pages having the same page title as the page title of data, and (3) the ratio of the same page text as the page text of the selected document data, the document analysis information Determining a similar pattern of the displayed document data with respect to the selected document data, and outputting the determination result;
The document analysis unit creates the document analysis information in which the document title, page title, and page text extracted from the selected document data are associated with identification information of the selected document data, and stores the document analysis information in the storage unit The similar document determination apparatus according to claim 6.

The document analysis unit selects the document data stored in the document storage unit in order from the oldest editing date and time, and selects the document title of the selected document data and the page title and page for each page in the document data. Extract the text and
The rearrangement processing unit rearranges the document data retrieved from the similar document information and a set of similar document data in descending order of the last edit date of the similar document data in the set, and the retrieved document data and the similar When there are a plurality of different pairs of document data as a pair with respect to the same similar document data among the sets of document data, the sets are rearranged in order from the oldest edit date of the document data in the set. The similar document determination device according to claim 7.

A similar document determination device that performs similarity determination of document data,
Accepting input of one or more document data;
Extract the document title of the document data and the page title and page text for each page in the document data from the input document data structure, and extract the extracted document title, page title and page text. Creating document analysis information shown in association with identification information of the original document data;
Storing the created document analysis information in a storage unit;
A step of extracting a document title of the document data, a page title and a page text for each page in the document data when new document data is received;
For each document data indicated in the document analysis information, (1) whether the document title is the same as the document title of the new document data, or (2) the number of pages having the same page title as the page title of the new document data. (3) Based on a similar pattern composed of any one or a combination of the same page text ratio as the page text of the new document data, the document data indicated in the document analysis information is converted into any similar pattern. Determining whether or not it is similar document data to be applied, and executing a step of outputting a determination result including the similar pattern when it is determined to be similar document data corresponding to any one of the similar patterns Method.

A program for causing a computer, which is the similar document determination device, to execute the similar document determination method according to claim 9.