JP2012018510A

JP2012018510A - Document processor, document processing method, document processing program, and computer readable recording medium recorded with document processing program

Info

Publication number: JP2012018510A
Application number: JP2010154764A
Authority: JP
Inventors: Takayuki Tamura; 孝之田村
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2010-07-07
Filing date: 2010-07-07
Publication date: 2012-01-26
Anticipated expiration: 2030-07-07
Also published as: JP5464082B2

Abstract

【課題】インターネット上の文書から、正当な引用を行なっている文書を含めたオリジナルな文書の抽出を可能にする。
【解決手段】複数の文書に含まれる文字列から、この文字列の一部をなす部分文字列を文書毎に生成する部分文字列生成手段と、前記部分文字列生成手段により生成された前記部分文字列の内、自らが生成された文書以外の文書に含まれない部分文字列を一意部分文字列として判定する一意部分文字列判定手段と、文書毎の総部分文字列数と前記一意部分文字列判定手段により判定された前記一意部分文字列数との比が所定の範囲にある文書を不要文書として検出する不要文書検出手段とを備える。
【選択図】図１PROBLEM TO BE SOLVED: To extract an original document including a document that is properly cited from a document on the Internet.
A partial character string generation unit that generates a partial character string that forms a part of the character string for each document from character strings included in a plurality of documents, and the portion generated by the partial character string generation unit Unique partial character string determination means for determining a partial character string that is not included in a document other than the document in which it is generated as a unique partial character string, the total number of partial character strings for each document, and the unique partial character Unnecessary document detecting means for detecting, as an unnecessary document, a document whose ratio with the number of unique partial character strings determined by the column determining means is within a predetermined range.
[Selection] Figure 1

Description

本発明は、インターネット上に存在する文書、例えば、ブログ等の文書から、分析対象とする目的の文書を抽出する文書処理装置に関する。 The present invention relates to a document processing apparatus that extracts a target document to be analyzed from a document existing on the Internet, for example, a document such as a blog.

インターネットの発展により、入手可能な文書データの量は、飛躍的に増大した。これらの文書データの中には、ブログ等を通じて、個人が自発的に興味の対象や、社会事象に対する意見等を述べたものも数多く含まれるようになった。そこで、このような意見等を述べた文書データを収集して分析することにより、従来は、回答者を募集してアンケートを実施する必要があった社会風潮や消費者動向の把握が、網羅的、かつリアルタイムに実施可能になると期待されている。
一方、ディジタルデータは、入手が容易であると同時に、引用・編集・改変して再発信することも容易であり、インターネット上の文書には、こうした二次情報も多く含まれていると言われている。オリジナルな一次情報とその流用による二次情報が混在していると、同様なデータが重複して格納されることによる記憶効率の低下や、検索問い合わせに対して同様な結果が繰り返し提示されることによる一覧性の低下といった問題が生じる。そこで、各文書データから部分文字列を取り出し、部分文字列毎に出現文書の一覧を管理することで、重複部分を含む文書の提示を可能にするシステムが提案されている（例えば、特許文献１及び２参照）。 With the development of the Internet, the amount of available document data has increased dramatically. In these document data, there are now many cases where individuals voluntarily stated their interests and opinions on social events through blogs. Therefore, by collecting and analyzing document data describing such opinions, it is possible to comprehensively grasp the social trends and consumer trends that previously required recruiting respondents and conducting questionnaires. It is expected to be possible to implement in real time.
Digital data, on the other hand, is easy to obtain, and at the same time, it is easy to quote, edit, modify, and retransmit, and it is said that Internet documents contain a lot of such secondary information. ing. If the original primary information and secondary information by diversion are mixed, storage efficiency decreases due to duplicate storage of similar data, and similar results are repeatedly presented for search queries. Problems such as a drop in the listability due to. In view of this, a system has been proposed in which a partial character string is extracted from each document data and a list of appearing documents is managed for each partial character string so that a document including an overlapping portion can be presented (for example, Patent Document 1). And 2).

特開２００８−３３７２８号公報JP 2008-33728 A 特表２００８−５１１０８１号公報Special table 2008-511081 gazette

ブログ等の中には、記事本体よりも、記事に付随する広告の発信を主眼としたものも存在する。このような広告記事の作成者は、なるべく労力をかけずに記事を作成するために、他のブログ記事を取得して引用するコンピュータプログラムを利用して、広告記事を自動生成することが多い。スパムブログと呼ばれるこのようなブログ記事が大量に存在すると、前述の社会風潮や消費者動向の把握を目的として、全文書データの統計処理を行なった際に、自動引用された情報の出現頻度が増加し、現実において話題とされる頻度の実態と大きく乖離してしまうという問題が生じる。 Some blogs, etc., focus on sending advertisements attached to articles rather than the article itself. In many cases, the creator of such an advertisement article automatically generates an advertisement article using a computer program that obtains and quotes other blog articles in order to create an article with as little effort as possible. If there are a large number of such blog articles called spam blogs, the frequency of appearance of automatically quoted information when statistical processing of all document data is performed for the purpose of grasping the social trends and consumer trends described above. A problem arises in that the frequency increases, and the actual frequency of the topic is greatly deviated.

従来の文書処理装置においては、例えば、特許文献１では、重複の可能性のある文書の検出に止まっており、文書の削除などの最終的な処置は、文書処理装置の利用者が文書の内容を判断した上で実施する必要がある。なぜなら、ブログにおいては、ニュース記事などの一次情報に対する所感を述べる際に、正当な引用が行なわれることもあるため、重複が検出された文書を全て一律に削除するのは不適当であり、利用者が一つ一つの記事を精査しなくてはならないからである。特許文献１の技術は、データベースを管理する際に意図せずに生じてしまう重複の検出を目的としており、他者により意図的にデータが引用・複製される状況には対処することができない。 In a conventional document processing apparatus, for example, in Patent Document 1, the detection of a document with a possibility of duplication is stopped, and a final action such as deletion of a document is performed by a user of the document processing apparatus. It is necessary to carry out after judging. This is because, in blogs, it is not appropriate to delete all duplicated documents uniformly, because legitimate citations are sometimes used when expressing feelings about primary information such as news articles. This is because one has to scrutinize each article. The technique of Patent Document 1 is intended to detect duplication that occurs unintentionally when managing a database, and cannot deal with a situation where data is intentionally cited or copied by another person.

また、特許文献２はインターネット上の文書を想定したものであるが、２つの文書の組に対して、文書単位での重複の有無を判定する技術である。このため、文書集合から２つ以上の文書の一部を切り出して合成された文書に対しては、切り出された文書の一部を単位として重複の検出を行なうことができず、このような合成された文書を排除することができない。 Japanese Patent Application Laid-Open No. 2004-228561 assumes a document on the Internet, but is a technique for determining whether or not there is duplication in document units for a set of two documents. For this reason, it is not possible to detect duplication of a part of two or more documents cut out from a document set as a unit. Cannot be excluded.

この発明は、上記のような課題を解決するためになされたもので、文書集合に対し、他の文書にも出現する部分文字列を一定割合以上含む文書を、自動引用により生成された文書として排除することで、正当な引用を行なっている文書を含めたオリジナルな文書データの抽出を可能にするものである。 The present invention has been made to solve the above-described problems, and a document including a certain percentage or more of a partial character string appearing in another document with respect to a document set is a document generated by automatic citation. By excluding it, it is possible to extract original document data including a document that is properly cited.

上記で述べた課題を解決するため、本発明に係る文書処理装置は、複数の文書に含まれる文字列から、この文字列の一部をなす部分文字列を文書毎に生成する部分文字列生成手段と、前記部分文字列生成手段により生成された前記部分文字列の内、自らが生成された文書以外の文書に含まれない部分文字列を一意部分文字列として判定する一意部分文字列判定手段と、文書毎の総部分文字列数と前記一意部分文字列判定手段により判定された前記一意部分文字列数との比が所定の範囲にある文書を不要文書として検出する不要文書検出手段とを備えることとしたものである。 In order to solve the problems described above, the document processing apparatus according to the present invention generates a partial character string that generates a partial character string that forms a part of the character string for each document from character strings included in a plurality of documents. And a partial character string determination unit that determines, among the partial character strings generated by the partial character string generation unit, a partial character string that is not included in a document other than the document in which the partial character string is generated as a unique partial character string And unnecessary document detection means for detecting a document whose ratio between the total number of partial character strings for each document and the number of unique partial character strings determined by the unique partial character string determination means is within a predetermined range as an unnecessary document. It is to be prepared.

また、本発明に係る文書処理方法は、部分文字列生成手段が、複数の文書に含まれる文字列から、この文字列の一部をなす部分文字列を文書毎に生成する部分文字列生成ステップと、一意部分文字列判定手段が、前記部分文字列生成手段により生成された前記部分文字列の内、自らが生成された文書以外の文書に含まれない部分文字列を一意部分文字列として判定する一意部分文字列判定ステップと、不要文書検出手段が、文書毎の総部分文字列数と前記一意部分文字列判定手段により判定された前記一意部分文字列数との比が所定の範囲にある文書を不要文書として検出する不要文書検出ステップとを備えることとしたものである。 In the document processing method according to the present invention, the partial character string generation unit generates a partial character string that forms a part of the character string for each document from the character strings included in the plurality of documents. And the unique partial character string determination means determines, as the unique partial character string, a partial character string that is not included in a document other than the document in which the partial character string is generated by the partial character string generation means. The unique partial character string determination step and the unnecessary document detection means have a ratio between the total number of partial character strings for each document and the number of the unique partial character strings determined by the unique partial character string determination means within a predetermined range. And an unnecessary document detection step of detecting a document as an unnecessary document.

また、本発明に係る文書処理プログラムは、コンピュータを、複数の文書に含まれる文字列から、この文字列の一部をなす部分文字列を文書毎に生成する部分文字列生成手段と、前記部分文字列生成手段により生成された前記部分文字列の内、自らが生成された文書以外の文書に含まれない部分文字列を一意部分文字列として判定する一意部分文字列判定手段と、文書毎の総部分文字列数と前記一意部分文字列判定手段により判定された前記一意部分文字列数との比が所定の範囲にある文書を不要文書として検出する不要文書検出手段として機能させることとしたものである。 Further, the document processing program according to the present invention includes a partial character string generation unit that generates, for each document, a partial character string that forms a part of the character string from character strings included in a plurality of documents. Of the partial character strings generated by the character string generating means, a unique partial character string determining means for determining a partial character string that is not included in a document other than the document generated by itself as a unique partial character string, and for each document The document is made to function as an unnecessary document detection unit that detects a document whose ratio between the total number of partial character strings and the number of unique partial character strings determined by the unique partial character string determination unit is within a predetermined range as an unnecessary document. It is.

また、本発明に係る文書処理プログラムを記録したコンピュータ読み取り可能な記録媒体は、コンピュータを、複数の文書に含まれる文字列から、この文字列の一部をなす部分文字列を文書毎に生成する部分文字列生成手段と、前記部分文字列生成手段により生成された前記部分文字列の内、自らが生成された文書以外の文書に含まれない部分文字列を一意部分文字列として判定する一意部分文字列判定手段と、文書毎の総部分文字列数と前記一意部分文字列判定手段により判定された前記一意部分文字列数との比が所定の範囲にある文書を不要文書として検出する不要文書検出手段として機能させる文書処理プログラムを記録させることとしたものである。 The computer-readable recording medium recording the document processing program according to the present invention generates, for each document, a partial character string that forms part of the character string from character strings included in a plurality of documents. A unique part for determining, as a unique partial character string, a partial character string that is not included in a document other than a document in which the partial character string is generated by the partial character string generating unit and the partial character string generated by the partial character string generating unit An unnecessary document for detecting a document having a ratio between the total number of partial character strings for each document and the number of the unique partial character strings determined by the unique partial character string determining unit as a unnecessary document. A document processing program that functions as detection means is recorded.

本発明によれば、複数の文書に含まれる文字列から部分文字列を文書毎に生成し、文書毎に当該文書固有の一意部分文字列の割合を求め、一意部分文字列の割合が低いものを、他の１つ以上の文書の引用を中心とする有用性の低い文書として検出し、これらの文書を除去可能にすることにより、統計的な処理に適した文書集合を得ることができるという効果がある。 According to the present invention, a partial character string is generated for each document from character strings included in a plurality of documents, a ratio of the unique partial character string unique to the document is obtained for each document, and a ratio of the unique partial character string is low. Is detected as a less useful document centered on the citation of one or more other documents, and these documents can be removed, thereby obtaining a document set suitable for statistical processing. effective.

この発明の実施の形態１に係る文書処理装置の一例を示す構成図である。It is a block diagram which shows an example of the document processing apparatus concerning Embodiment 1 of this invention. 取得文書データ５の詳細な格納形式の一例を示す図である。It is a figure which shows an example of the detailed storage format of the acquisition document data. 部分文字列テーブル６の詳細な格納形式を示す図である。It is a figure which shows the detailed storage format of the partial character string table. 文書属性テーブル７の詳細な格納形式を示す図である。It is a figure which shows the detailed storage format of the document attribute table. 部分文字列テーブル６のコンピュータシステムにおける格納方法の詳細を説明する図である。It is a figure explaining the detail of the storage method in the computer system of the partial character string table. 不要文書除去手段３の動作を示す概略フローチャートである。5 is a schematic flowchart showing an operation of an unnecessary document removing unit 3. ステップＳ１の詳細を示すフローチャートである。It is a flowchart which shows the detail of step S1. ステップＳ１６の動作の詳細を示すフローチャートである。It is a flowchart which shows the detail of operation | movement of step S16. ステップＳ１９の動作の詳細を示すフローチャートである。It is a flowchart which shows the detail of operation | movement of step S19. ステップＳ２の詳細を示すフローチャートである。It is a flowchart which shows the detail of step S2. この発明の第２の実施の形態に係る文書処理装置の構成図である。It is a block diagram of the document processing apparatus which concerns on 2nd Embodiment of this invention. 不要ＵＲＬ除去手段１４の動作を示すフローチャートである。5 is a flowchart showing the operation of unnecessary URL removal means 14. この発明の第３の実施の形態に係る文書処理装置の構成図である。It is a block diagram of the document processing apparatus which concerns on 3rd Embodiment of this invention. ｑ＝３の場合のｑ単語列を格納する頻出ｑ単語列テーブル９０を示したものである。The frequent q word sequence table 90 which stores the q word sequence in the case of q = 3 is shown. 不要文書除去手段３の動作の内、実施の形態１に対して実施の形態３で加わった部分を示すフローチャートである。7 is a flowchart showing a part of the operation of the unnecessary document removing unit 3 added in the third embodiment with respect to the first embodiment. ステップＳ３の動作の内、特定のｑ（ｑ＞１）に対応する動作の詳細を示すフローチャートである。It is a flowchart which shows the detail of operation | movement corresponding to specific q (q> 1) among operation | movement of step S3.

実施の形態１．
図１は、この発明の実施の形態１に係る文書処理装置の一例を示す構成図である。
図１において、文書処理装置１は、サーバＡ〜Ｃ等の外部サーバ上の文書を取得する文書取得手段２と、不要文書を除去する不要文書除去手段３とを備え、文書ＵＲＬリスト４、取得文書データ５、部分文字列テーブル６、及び文書属性テーブル７によって構成される。不要文書除去手段３は、文書中の文字列から部分文字列を生成する部分文字列生成手段８と、生成された部分文字列の内、他の文書には出現せずに一意に定まる部分文字列（一意部分文字列）を判定する一意部分文字列判定手段９と、一意部分文字列数と総部分文字列数との比により、不要文書を検出する不要文書検出手段１０とを備える。文書ＵＲＬリスト４は、文書取得手段２が取得すべき外部サーバ上の文書を特定するアドレス、例えば、ＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）の一覧を保持する。取得文書データ５は、文書取得手段２が取得した複数の文書データを格納する。部分文字列テーブル６は、不要文書除去手段３による処理の中間状態を格納する。文書属性テーブル７は、不要文書除去手段３による処理結果を格納する。 Embodiment 1 FIG.
FIG. 1 is a block diagram showing an example of a document processing apparatus according to Embodiment 1 of the present invention.
In FIG. 1, a document processing apparatus 1 includes a document acquisition unit 2 that acquires documents on external servers such as servers A to C, and an unnecessary document removal unit 3 that removes unnecessary documents. It consists of document data 5, a partial character string table 6, and a document attribute table 7. The unnecessary document removing unit 3 includes a partial character string generating unit 8 that generates a partial character string from a character string in the document, and a partial character that is uniquely determined without appearing in other documents among the generated partial character strings. A unique partial character string determination unit 9 that determines a column (unique partial character string) and an unnecessary document detection unit 10 that detects an unnecessary document based on a ratio between the number of unique partial character strings and the total number of partial character strings. The document URL list 4 holds a list of addresses, for example, URLs (Uniform Resource Locators) for specifying documents on the external server to be acquired by the document acquisition unit 2. The acquired document data 5 stores a plurality of document data acquired by the document acquisition unit 2. The partial character string table 6 stores an intermediate state of processing by the unnecessary document removing unit 3. The document attribute table 7 stores a processing result by the unnecessary document removing unit 3.

文書処理装置１は、ＣＰＵ、ＲＡＭ、磁気ディスク装置、及びネットワークインタフェース等のハードウェアと、ハードウェアを制御するオペレーティングシステムソフトウェアを備える一般的なコンピュータシステムと、ＣＰＵの動作を規定するプログラムを用いて実現することができる。この場合、文書取得手段２と不要文書除去手段３とは、磁気ディスク装置からＲＡＭに読み込まれてＣＰＵにより実行されるプログラムとして実現され、文書ＵＲＬリスト４、取得文書データ５、部分文字列テーブル６、及び文書属性テーブル７は、ＲＡＭ、または、磁気ディスク装置上の固有の格納領域として実現される。 The document processing apparatus 1 uses a general computer system including hardware such as a CPU, a RAM, a magnetic disk device, and a network interface, and operating system software that controls the hardware, and a program that defines the operation of the CPU. Can be realized. In this case, the document acquisition unit 2 and the unnecessary document removal unit 3 are realized as programs that are read from the magnetic disk device into the RAM and executed by the CPU, and include a document URL list 4, acquired document data 5, and partial character string table 6. The document attribute table 7 is realized as a unique storage area on the RAM or the magnetic disk device.

図２は、取得文書データ５の詳細な格納形式の一例を示す図である。
取得文書データ５は、複数のエントリからなり、各エントリは、１つの文書の文書ＵＲＬ５１、取得日時５２、及び文書内容５３を格納する。文書ＵＲＬ５１は、文書ＵＲＬリスト４に記憶されていた取得対象文書のアドレスの１つであり、取得日時５２は、文書取得手段２が実際に当該文書データを取得した日時である。また、文書内容５３は、文書取得手段２が取得した文書の内容データである。ここで、外部サーバ上の文書のアドレスは、文書ＵＲＬ５１により一意に識別されるが、同一の文書ＵＲＬ５１から異なる時点で取得した文書内容は異なることもあり得る。そのため、取得文書データ５においては、文書ＵＲＬ５１が共通で、取得日時５２が異なる複数のエントリが存在しても良い。また、取得文書データ５における各エントリの格納位置には特段の制約を設けず、任意とする。 FIG. 2 is a diagram illustrating an example of a detailed storage format of the acquired document data 5.
The acquired document data 5 includes a plurality of entries, and each entry stores a document URL 51, an acquisition date 52, and a document content 53 of one document. The document URL 51 is one of the addresses of the acquisition target documents stored in the document URL list 4, and the acquisition date / time 52 is the date / time when the document acquisition unit 2 actually acquires the document data. The document content 53 is content data of the document acquired by the document acquisition unit 2. Here, the address of the document on the external server is uniquely identified by the document URL 51, but the document contents acquired from the same document URL 51 at different times may be different. Therefore, in the acquired document data 5, there may be a plurality of entries having the same document URL 51 and different acquisition dates 52. Further, the storage position of each entry in the acquired document data 5 is arbitrary without any particular restriction.

図３は、部分文字列テーブル６の詳細な格納形式を示す図である。
部分文字列テーブル６は、複数のエントリからなり、各エントリは、１つの部分文字列に関するハッシュ値６１、文書ＵＲＬ６２、取得日時６３、及び重複フラグ６４を格納する。ここでいう部分文字列とは、図２に示す取得文書データ５の文書内容５３から取り出した固定単語数の文字列である。ここでは、単語の数をｋとする（例えばｋ＝５）。 FIG. 3 is a diagram showing a detailed storage format of the partial character string table 6.
The partial character string table 6 includes a plurality of entries, and each entry stores a hash value 61, a document URL 62, an acquisition date and time 63, and a duplication flag 64 related to one partial character string. The partial character string here is a character string of a fixed number of words extracted from the document content 53 of the acquired document data 5 shown in FIG. Here, the number of words is k (for example, k = 5).

ハッシュ値６１には、当該エントリに対応する部分文字列に対し、ＣＲＣ（ＣｙｃｌｉｃＲｅｄｕｎｄａｎｃｙＣｏｄｅ）やＳＨＡ‐２５６（ＳｅｃｕｒｅＨａｓｈＡｌｇｏｒｉｔｈｍ２５６‐ｂｉｔ）などの公知の一方向性ハッシュ関数により計算したハッシュ値を格納する。このようなハッシュ値は、実用上、元の部分文字列と１対１に対応すると考えることができるので、部分文字列テーブル６のエントリを一意に識別するキーとして用いる。 The hash value 61 is a hash value calculated by a known one-way hash function such as CRC (Cyclic Redundancy Code) or SHA-256 (Secure Hash Algorithm 256-bit) for the partial character string corresponding to the entry. Store. Since such a hash value can be considered to have a one-to-one correspondence with the original partial character string, it is used as a key for uniquely identifying an entry in the partial character string table 6.

ここで、ハッシュ値６１の一意性を維持するため、部分文字列テーブル６における各エントリの格納位置は、ハッシュ値６１に基づいて一意に定まる必要がある。しかし、同一の部分文字列であっても、取得文書データ５には、文書ＵＲＬ５１や取得日時５２が異なる複数のエントリが存在するため、文書ＵＲＬ６２及び取得日時６３の組み合わせが一意に定まらない。そこで、部分文字列テーブル６の各エントリに重複フラグ６４を設け、次のようにして、部分文字列テーブル６における各エントリの一意性を維持する。 Here, in order to maintain the uniqueness of the hash value 61, the storage position of each entry in the partial character string table 6 needs to be uniquely determined based on the hash value 61. However, even for the same partial character string, the acquired document data 5 includes a plurality of entries having different document URLs 51 and acquisition dates 52, and therefore the combination of the document URL 62 and the acquisition date 63 is not uniquely determined. Therefore, a duplicate flag 64 is provided for each entry in the partial character string table 6, and the uniqueness of each entry in the partial character string table 6 is maintained as follows.

文書ＵＲＬ６２及び取得日時６３には、当該エントリに対応する部分文字列が取り出された取得文書データ５のエントリの文書ＵＲＬ５２及び取得日時５３の組み合わせを、１つ指定する。指定する文書ＵＲＬ５２及び取得日時５３の組み合わせは、当該エントリに対応する部分文字列を含む文書の内、取得日時が最も古い文書の文書ＵＲＬ５２と取得日時５３を用いるものとする。あるいは、文書取得手段２によらずに収集された文書集合を分析対象にする場合は、文書ＵＲＬ５２と取得日時５３の代わりに、当該エントリに対応する部分文字列が最初に生成された文書の識別情報と、文書が作成・更新された日時を用いるものとする。 In the document URL 62 and the acquisition date and time 63, one combination of the document URL 52 and the acquisition date and time 53 of the entry of the acquired document data 5 from which the partial character string corresponding to the entry is extracted is designated. As the combination of the document URL 52 and the acquisition date / time 53 to be specified, the document URL 52 and the acquisition date / time 53 of the document having the oldest acquisition date / time among the documents including the partial character string corresponding to the entry are used. Alternatively, when a collection of documents collected without using the document acquisition unit 2 is to be analyzed, identification of the document in which the partial character string corresponding to the entry is first generated is used instead of the document URL 52 and the acquisition date / time 53. Information and the date and time when the document was created / updated shall be used.

重複フラグ６４には、当該エントリに対応する部分文字列が、文書ＵＲＬ６２及び取得日時６３の組み合わせで指定される文書だけに含まれる場合に、例えば、０が格納され、そうでない場合には、１が格納される。 For example, 0 is stored in the duplicate flag 64 when the partial character string corresponding to the entry is included only in the document specified by the combination of the document URL 62 and the acquisition date and time 63. Otherwise, 1 is stored. Is stored.

図４は、文書属性テーブル７の詳細な格納形式を示す図である。
文書属性テーブル７は、複数のエントリからなり、各エントリは、１つの文書の文書ＵＲＬ７１、取得日時７２、部分文字列数７３、一意部分文字列数７４、及び除去フラグ７５を格納する。文書属性テーブル７の各エントリは、取得文書データ５のエントリと１対１に対応しており、文書ＵＲＬ７１及び取得日時７２は、それぞれ取得文書データ５の文書ＵＲＬ５１及び取得日時５２に対応する。文書属性テーブル７の各エントリの格納位置は、文書ＵＲＬ７１及び取得日時７２により一意に定まる必要がある。また、部分文字列数７３は、当該エントリに対応する取得文書データ５の文書内容５３に含まれる部分文字列の総数を表し、一意部分文字列数７４は、当該部分文字列の内、他のエントリの文書内容５３に含まれない部分文字列の数を表す。除去フラグ７５には、当該エントリに対応する文書が自動引用により生成されたものであり、不要とみなされる場合に、１が格納され、そうでない場合に、０が格納される。 FIG. 4 is a diagram showing a detailed storage format of the document attribute table 7.
The document attribute table 7 includes a plurality of entries, and each entry stores a document URL 71 of one document, an acquisition date and time 72, a partial character string number 73, a unique partial character string number 74, and a removal flag 75. Each entry in the document attribute table 7 has a one-to-one correspondence with the entry of the acquired document data 5, and the document URL 71 and the acquisition date / time 72 correspond to the document URL 51 and the acquisition date / time 52 of the acquired document data 5, respectively. The storage location of each entry in the document attribute table 7 needs to be uniquely determined by the document URL 71 and the acquisition date 72. The partial character string number 73 represents the total number of partial character strings included in the document content 53 of the acquired document data 5 corresponding to the entry, and the unique partial character string number 74 represents the other partial character strings. This represents the number of partial character strings not included in the document content 53 of the entry. In the removal flag 75, 1 is stored when the document corresponding to the entry is generated by automatic citation and is deemed unnecessary, and 0 is stored otherwise.

図５は、部分文字列テーブル６のコンピュータシステムにおける格納方法の詳細を説明する図である。
図５において、文書処理装置１は、ＣＰＵ１１、ＲＡＭ１２、磁気ディスク装置１３により構成されるコンピュータシステムとして示されている。 FIG. 5 is a diagram for explaining the details of the method of storing the partial character string table 6 in the computer system.
In FIG. 5, the document processing apparatus 1 is shown as a computer system including a CPU 11, a RAM 12, and a magnetic disk device 13.

部分文字列テーブル６は、ハッシュ値６１に基づいてランダムにアクセスされるため、ＲＡＭ１２に格納することが望ましいが、部分文字列テーブル６のエントリ数、すなわち部分文字列の種類は非常に多く、ＲＡＭ１２に部分文字列テーブル６の全体を格納するだけの容量がない場合が生じる。そこで、部分文字列テーブル６を複数の断片に分割し、ＲＡＭ１２に格納される部分文字列テーブル片（０）と、必要に応じて磁気ディスク装置１３に格納される部分文字列テーブル片（１）、部分文字列テーブル片（２）、部分文字列テーブル片（３）、．．．を設ける。ＲＡＭ１２上の部分文字列テーブル片（０）は、ハッシュ値６１をキーとする赤黒木や、ハッシュテーブル等の公知の探索構造により実現する。また、磁気ディスク装置１３上の部分文字列テーブル片（ｊ）（ｊ＝１，２，．．．）においては、各エントリをハッシュ値６１の順に配置するものとする。 Since the partial character string table 6 is randomly accessed based on the hash value 61, it is desirable to store the partial character string table 6 in the RAM 12. However, the number of entries in the partial character string table 6, that is, the types of partial character strings is very large. May not have a capacity for storing the entire partial character string table 6. Therefore, the partial character string table 6 is divided into a plurality of fragments, and the partial character string table piece (0) stored in the RAM 12 and the partial character string table piece (1) stored in the magnetic disk device 13 as necessary. , Partial character string table piece (2), partial character string table piece (3),. . . Is provided. The partial character string table piece (0) on the RAM 12 is realized by a known search structure such as a red-black tree using the hash value 61 as a key or a hash table. In the partial character string table piece (j) (j = 1, 2,...) On the magnetic disk device 13, the entries are arranged in the order of the hash value 61.

また、文書属性テーブル７は、文書ＵＲＬ７１及び取得日時７２に基づいてランダムにアクセスされるが、多くの場合、ＲＡＭ１２に格納可能と想定されるため、文書ＵＲＬ７１及び取得日時７２をキーとするＲＡＭ１２上の赤黒木や、ハッシュテーブルとして実現し、処理終了時に磁気ディスク装置１３に内容を書き出せば良い。 Further, the document attribute table 7 is randomly accessed based on the document URL 71 and the acquisition date / time 72. However, in many cases, it is assumed that the document attribute table 7 can be stored in the RAM 12. This can be realized as a red / black tree or a hash table, and the contents can be written to the magnetic disk device 13 at the end of processing.

取得文書データ５は、各エントリの書き込みと読み出しが一度ずつ行なわれるだけであり、容量も大きいため、磁気ディスク装置１３上に格納する。 The acquired document data 5 is stored on the magnetic disk device 13 because each entry is written and read only once and has a large capacity.

次に、文書処理装置１の動作を説明する。
文書取得手段２は、文書ＵＲＬリスト４から文書ＵＲＬを読み込み、当該ＵＲＬ中のサーバ名に従って外部サーバへの接続を行ない、公知のＨＴＴＰ（Ｈｙｐｅｒ‐ＴｅｘｔＴｒａｎｓｆｅｒＰｒｏｔｏｃｏｌ）に従って当該ＵＲＬを送付して文書データを要求する。次いで文書取得手段２は、当該外部サーバからの応答を受信し、前記ＵＲＬを文書ＵＲＬ５１、現在時刻を取得日時５２、受信内容を文書内容５３として取得文書データ５の末尾に追記する。この処理を文書ＵＲＬリスト４の全てのＵＲＬに対して繰り返す。各文書の取得は逐次に行なう必要はなく、複数文書を同時に並行して取得して所要時間を短縮しても良い。 Next, the operation of the document processing apparatus 1 will be described.
The document acquisition means 2 reads the document URL from the document URL list 4, connects to an external server according to the server name in the URL, and sends the URL according to the well-known HTTP (Hyper-Text Transfer Protocol) to send document data. Request. Next, the document acquisition means 2 receives a response from the external server, and appends the URL as the document URL 51, the current time as the acquisition date 52, and the received content as the document content 53 at the end of the acquired document data 5. This process is repeated for all URLs in the document URL list 4. Acquisition of each document does not need to be performed sequentially, and a plurality of documents may be simultaneously acquired in parallel to shorten the required time.

次に、不要文書除去手段３の動作を、フローチャートを用いて説明する。
図６は、不要文書除去手段３の動作を示す概略フローチャートである。
始めに、ステップＳ１において、不要文書除去手段３は、取得文書データ５に格納された全てのエントリを処理し、部分文字列テーブル６及び文書属性テーブル７の設定を行なう。
次に、ステップＳ２で、不要文書除去手段３は、部分文字列テーブル６の各エントリに基づいて文書属性テーブル７を更新し、各文書に対する最終的な処理結果を格納する。 Next, the operation of the unnecessary document removing unit 3 will be described using a flowchart.
FIG. 6 is a schematic flowchart showing the operation of the unnecessary document removing unit 3.
First, in step S 1, the unnecessary document removing unit 3 processes all entries stored in the acquired document data 5 and sets the partial character string table 6 and the document attribute table 7.
Next, in step S2, the unnecessary document removing unit 3 updates the document attribute table 7 based on each entry in the partial character string table 6, and stores the final processing result for each document.

以下、図６のステップＳ１における動作の詳細を、フローチャートを用いて説明する。
図７は、ステップＳ１の詳細を示すフローチャートである。 Hereinafter, details of the operation in step S1 of FIG. 6 will be described using a flowchart.
FIG. 7 is a flowchart showing details of step S1.

始めに、ステップＳ１１において、不要文書除去手段３は、部分文字列生成手段８により、磁気ディスク装置１３上の部分文字列テーブル片の数を表す変数ｆを０に初期化する。 First, in step S <b> 11, the unnecessary document removal unit 3 initializes a variable f representing the number of partial character string table pieces on the magnetic disk device 13 to 0 by the partial character string generation unit 8.

次に、ステップＳ１２において、部分文字列生成手段８は、取得文書データ５から未処理エントリを１つ選択して処理対象とし、その文書内容５３に公知の形態素解析処理を施して、単一文字列を複数の単語列に分割する。ここで、当該エントリの文書ＵＲＬ５１と取得日時５２の組で識別される文書データをｄとし、文書ｄの文書内容５３を形態素解析して得られる単語列の要素数をｎｄ、単語列をＷ１、Ｗ２、．．．、Ｗｎｄとする。なお、取得文書データ５の各エントリの処理順序は任意であり、例えば、先頭エントリから順に処理対象とすれば良い。 Next, in step S12, the partial character string generation means 8 selects one unprocessed entry from the acquired document data 5 to be processed, and performs a known morphological analysis process on the document content 53 to obtain a single character string. Is divided into a plurality of word strings. Here, the document data identified by the set of the document URL 51 and the acquisition date 52 of the entry is d, the number of elements of the word string obtained by morphological analysis of the document content 53 of the document d is nd, the word string is W1, W2,. . . , Wnd. The processing order of each entry of the acquired document data 5 is arbitrary. For example, the processing order may be set in order from the first entry.

次に、ステップＳ１３において、部分文字列生成手段８は、文書ｄの文書ＵＲＬ５１と取得日時５２に対応する文書属性テーブル７のエントリを生成し、文書ＵＲＬ７１と取得日時７２をそれぞれ文書ＵＲＬ５１と取得日時５２に、部分文字列数７３をｎｄ−ｋ＋１（ｋは部分文字列の単語数）に、一意部分文字列数７４を０に、除去フラグを１に、それぞれ設定する。 Next, in step S13, the partial character string generation unit 8 generates an entry of the document attribute table 7 corresponding to the document URL 51 and the acquisition date 52 of the document d, and the document URL 71 and the acquisition date 72 are respectively set to the document URL 51 and the acquisition date and time. 52, the partial character string number 73 is set to nd−k + 1 (k is the number of words in the partial character string), the unique partial character string number 74 is set to 0, and the removal flag is set to 1.

次に、ステップＳ１４において、部分文字列生成手段８は、文書ｄにおける部分文字列を識別する変数ｉを０に設定し、続くステップＳ１５で変数ｉに１を加える。 Next, in step S14, the partial character string generation means 8 sets a variable i for identifying a partial character string in the document d to 0, and adds 1 to the variable i in a subsequent step S15.

次に、ステップＳ１６において、部分文字列生成手段８は、後述する手順により、文書ｄの形態素解析結果の単語列の内、ｉ番目以降の連続するｋ個の単語列（Ｗｉ、Ｗｉ＋１、．．．、Ｗｉ＋ｋ−１）からなる第ｉ部分文字列のハッシュ値Ｓｉに基づき、部分文字列テーブル片の更新を行なう。
ステップＳ１７では、変数ｉをｎｄ−ｋ＋１と比較し、両者が等しければステップＳ１８に進み、そうでなければステップＳ１５に戻る。 Next, in step S16, the partial character string generation means 8 performs the i-th and subsequent consecutive k word strings (Wi, Wi + 1,...) Among the word strings of the morphological analysis result of the document d according to the procedure described later. ., Wi + k−1), the partial character string table piece is updated based on the hash value Si of the i-th partial character string.
In step S17, the variable i is compared with nd−k + 1. If both are equal, the process proceeds to step S18, and if not, the process returns to step S15.

次に、ステップＳ１８において、部分文字列生成手段８は、取得文書データ５の全てのエントリが処理されたかどうか判定し、未処理のエントリが残っていればステップＳ１２に戻り、全て処理されていればステップＳ１９に進む。 Next, in step S18, the partial character string generation means 8 determines whether or not all entries of the acquired document data 5 have been processed. If there are any unprocessed entries remaining, the process returns to step S12 and all of them have been processed. If so, the process proceeds to step S19.

最後に、ステップＳ１９において、部分文字列生成手段８は、後述する手順により、ｆ＋１個の部分文字列テーブル片を統合し、単一の部分文字列テーブル６を生成して処理を終了する。 Finally, in step S19, the partial character string generation means 8 integrates f + 1 partial character string table pieces according to the procedure described later, generates a single partial character string table 6, and ends the process.

次に、図７のステップＳ１６における動作の詳細を、フローチャートを用いて説明する。
図８は、ステップＳ１６の動作の詳細を示すフローチャートである。
始めに、ステップＳ１０１において、部分文字列生成手段８は、第ｉ部分文字列（Ｗｉ、Ｗｉ＋１、．．．、Ｗｉ＋ｋ−１）のハッシュ値Ｓｉを計算する。 Next, details of the operation in step S16 in FIG. 7 will be described using a flowchart.
FIG. 8 is a flowchart showing details of the operation in step S16.
First, in step S101, the partial character string generation means 8 calculates a hash value Si of the i-th partial character string (Wi, Wi + 1,..., Wi + k−1).

次に、ステップＳ１０２において、部分文字列生成手段８は、部分文字列テーブル片（０）を検索し、ハッシュ値６１がＳｉと等しいエントリが存在するかどうか判定する。エントリが存在する場合はステップＳ１０７に進み、そうでない場合はステップＳ１０３に進む。 Next, in step S102, the partial character string generation means 8 searches the partial character string table piece (0) and determines whether there is an entry having the hash value 61 equal to Si. If there is an entry, the process proceeds to step S107, and if not, the process proceeds to step S103.

次に、ステップＳ１０３において、部分文字列生成手段８は、部分文字列テーブル片（０）の現在のエントリ数を調べ、その値が所定値未満であるかどうか判定する。所定値未満である場合は、ステップＳ１０６に進み、そうでない場合は、ステップＳ１０４に進む。ここで、所定値とは、例えば、ＲＡＭ１２の容量に格納可能な部分文字列テーブル片（０）のエントリ数の上限値を意味するものとする。 Next, in step S103, the partial character string generation means 8 checks the current number of entries in the partial character string table piece (0) and determines whether or not the value is less than a predetermined value. If it is less than the predetermined value, the process proceeds to step S106, and if not, the process proceeds to step S104. Here, the predetermined value means an upper limit value of the number of entries of the partial character string table piece (0) that can be stored in the capacity of the RAM 12, for example.

次に、ステップＳ１０４において、部分文字列生成手段８は、部分文字列テーブル片（０）に新たなエントリを生成できないため、ＲＡＭ１２上の部分文字列テーブル片（０）を磁気ディスク装置１３上の新たな部分文字列テーブル片（ｆ＋１）に書き出し、部分文字列テーブル片（０）を空にする処理を行なう。この時、部分文字列テーブル片（０）のエントリをハッシュ値６１の順に出力するようにする。部分文字列テーブル片（０）を木構造として実現していればこれは容易であり、部分文字列テーブル片（０）をハッシュテーブルとして実現している場合は、全エントリのソート処理を行なえば良い。 Next, in step S104, the partial character string generation means 8 cannot generate a new entry in the partial character string table fragment (0), so the partial character string table fragment (0) on the RAM 12 is stored on the magnetic disk device 13. A new partial character string table piece (f + 1) is written, and the partial character string table piece (0) is emptied. At this time, the entries of the partial character string table piece (0) are output in the order of the hash value 61. If the partial character string table piece (0) is realized as a tree structure, this is easy. If the partial character string table piece (0) is realized as a hash table, all the entries are sorted. good.

続いて、ステップＳ１０５で、部分文字列生成手段８は、磁気ディスク装置１３上の部分文字列テーブル片の数を表す変数ｆに１を加える。 Subsequently, in step S105, the partial character string generating means 8 adds 1 to the variable f representing the number of partial character string table pieces on the magnetic disk device 13.

次に、ステップＳ１０６において、部分文字列生成手段８は、部分文字列テーブル片（０）に新たなエントリを割り当て、当該エントリのハッシュ値６１をＳｉに、文書ＵＲＬ６２及び取得日時６３を現在処理中の文書ｄに対応する文書ＵＲＬ５１及び取得日時５２に、重複フラグ６４を０に設定し、処理を終了する。 Next, in step S106, the partial character string generating means 8 assigns a new entry to the partial character string table piece (0), the hash value 61 of the entry is set to Si, and the document URL 62 and the acquisition date and time 63 are currently being processed. The duplication flag 64 is set to 0 for the document URL 51 and the acquisition date 52 corresponding to the document d, and the process is terminated.

一方、ステップＳ１０７においては、不要文書除去手段３は、一意部分文字列判定手段９により、部分文字列テーブル片（０）上の既存のエントリの文書ＵＲＬ６２及び取得日時６３を調べ、文書ｄの文書ＵＲＬ５１及び取得日時５２とそれぞれ一致するかどうか判定する。一致する場合は、同一文書内に部分文字列が複数回含まれていることを示しており、文書間の引用を示唆しないので、処理を終了する。そうでない場合は、ステップＳ１０８に進む。 On the other hand, in step S107, the unnecessary document removing unit 3 uses the unique partial character string determining unit 9 to check the document URL 62 and the acquisition date 63 of the existing entry on the partial character string table piece (0), and the document d document It is determined whether the URL 51 and the acquisition date / time 52 match each other. If they match, it indicates that a partial character string is included a plurality of times in the same document, and citation between the documents is not suggested, so the processing ends. Otherwise, the process proceeds to step S108.

次に、ステップＳ１０８において、一意部分文字列判定手段９は、前記既存エントリの重複フラグを１に設定し、当該部分文字列が複数文書に重複して存在することを記録する。 Next, in step S108, the unique partial character string determination means 9 sets the duplication flag of the existing entry to 1, and records that the partial character string exists in a plurality of documents.

続いて、ステップＳ１０９において、一意部分文字列判定手段９は、当該既存エントリの取得日時６３が文書ｄの取得日時５２より新しいかどうか判定し、新しければステップＳ１１０に進み、そうでなければ処理を終了する。 Subsequently, in step S109, the unique partial character string determination unit 9 determines whether or not the acquisition date 63 of the existing entry is newer than the acquisition date 52 of the document d. If it is new, the process proceeds to step S110. Exit.

最後に、ステップＳ１１０において、一意部分文字列判定手段９は、当該既存エントリの文書ＵＲＬ６２及び取得日時６３を、文書ｄの文書ＵＲＬ５１及び取得日時５２にそれぞれ設定し、処理を終了する。 Finally, in step S110, the unique partial character string determination unit 9 sets the document URL 62 and acquisition date 63 of the existing entry as the document URL 51 and acquisition date 52 of the document d, and ends the process.

次に、図７のステップＳ１９の動作の詳細を、フローチャートを用いて説明する。
図９は、ステップＳ１９の動作の詳細を示すフローチャートである。
始めに、ステップＳ２１において、不要文書除去手段３は、部分文字列生成手段８により、統合対象の部分文字列テーブル片のそれぞれの先頭エントリを調べ、ハッシュ値の最小値を求める。次に、先頭エントリのハッシュ値が最小値である部分文字列テーブル片の全てから先頭エントリを取得し、当該部分文字列テーブル片から先頭エントリを除去する。 Next, details of the operation in step S19 in FIG. 7 will be described using a flowchart.
FIG. 9 is a flowchart showing details of the operation in step S19.
First, in step S21, the unnecessary document removing unit 3 uses the partial character string generating unit 8 to check each head entry of the partial character string table pieces to be integrated, and obtain the minimum hash value. Next, the first entry is acquired from all the partial character string table pieces having the minimum hash value of the first entry, and the first entry is removed from the partial character string table piece.

次に、ステップＳ２２において、部分文字列生成手段８は、取得したエントリが単一であったかどうか判定し、単一であればステップＳ２６に進み、そうでなければステップＳ２３に進む。 Next, in step S22, the partial character string generation means 8 determines whether or not the acquired entry is single. If it is single, the process proceeds to step S26, and if not, the process proceeds to step S23.

次に、ステップＳ２３において、不要文書除去手段３は、一意部分文字列判定手段９により、取得した複数エントリにおいて、文書ＵＲＬ６２と取得日時６３が全て同一かどうか判定し、同一であればステップＳ２５に進み、そうでなければステップＳ２４に進む。 Next, in step S23, the unnecessary document removal unit 3 determines whether the document URL 62 and the acquisition date 63 are all the same in the acquired plurality of entries by the unique partial character string determination unit 9, and if they are the same, the process proceeds to step S25. Otherwise, go to step S24.

次に、ステップＳ２４において、一意部分文字列判定手段９は、同一ハッシュ値に対して複数文書が対応していることが明らかであるため、重複フラグ６４を１とし、ハッシュ値６１、文書ＵＲＬ６２、及び取得日時６３のそれぞれを、取得したエントリ中で取得日時６３が最小のエントリのハッシュ値６１、文書ＵＲＬ６２、及び取得日時６３に設定して出力し、ステップＳ２７に進む。
ければステップＳ２４に進む。 Next, in step S24, the unique partial character string determination means 9 clearly shows that a plurality of documents correspond to the same hash value, so the duplication flag 64 is set to 1, a hash value 61, a document URL 62, And the acquisition date and time 63 are set and output as the hash value 61, the document URL 62, and the acquisition date and time 63 of the entry with the smallest acquisition date and time 63 in the acquired entries, and the process proceeds to step S27.
If so, the process proceeds to step S24.

一方、ステップＳ２５では、一意部分文字列判定手段９は、取得したエントリは全て同一文書に対応しているため、取得エントリ中で重複フラグ６４が１になっているものが存在する場合は重複フラグ６４を１に、そうでない場合は重複フラグ６４を０とし、ハッシュ値６１、文書ＵＲＬ６２、及び取得日時６３のそれぞれを、取得した任意エントリのハッシュ値６１、文書ＵＲＬ６２、及び取得日時６３に設定して出力し、ステップＳ２７に進む。 On the other hand, in step S25, the unique partial character string determination means 9 determines that the duplicate entry 64 has a duplicate flag 64 if all of the obtained entries correspond to the same document. 64 is set to 1; otherwise, the duplication flag 64 is set to 0, and the hash value 61, document URL 62, and acquisition date 63 are set to the hash value 61, document URL 62, and acquisition date 63 of the acquired arbitrary entry, respectively. Output, and the process proceeds to step S27.

また、ステップＳ２６では、部分文字列生成手段８は、取得した単一のエントリ自体の内容を出力し、ステップＳ２７に進む。 In step S26, the partial character string generation means 8 outputs the content of the acquired single entry itself, and proceeds to step S27.

最後に、ステップＳ２７において、部分文字列生成手段８は、統合対象の部分文字列テーブル片が全て空になったか判定し、空であれば処理を終了し、そうでなければステップＳ２１に戻って処理を繰り返す。 Finally, in step S27, the partial character string generation means 8 determines whether all the partial character string table pieces to be integrated are empty. If empty, the process ends. If not, the process returns to step S21. Repeat the process.

なお、図９の処理は、部分文字列テーブル片の数（ｆ）が非常に大きい場合、所定数（例えば１６個）の部分文字列テーブル片を中間的な部分文字列テーブル片に統合し、次に中間的な部分文字列テーブル片の統合を行なうというように、次第に部分文字列テーブル片の数を減少させるようにして適用することもできる。これにより、磁気ディスク装置１３に対する入出力性能の低下を抑えることが可能となる。
以上が、図６のステップＳ１の詳細な動作の説明である。 In the process of FIG. 9, when the number of partial character string table pieces (f) is very large, a predetermined number (for example, 16 pieces) of partial character string table pieces are integrated into an intermediate partial character string table piece. Next, it is also possible to apply by gradually reducing the number of partial character string table pieces, such as integration of intermediate partial character string table pieces. Thereby, it is possible to suppress a decrease in input / output performance with respect to the magnetic disk device 13.
The above is the detailed operation of step S1 in FIG.

次に、図６のステップＳ２の詳細な動作を、フローチャートを用いて説明する。
図１０は、ステップＳ２の詳細を示すフローチャートである。 Next, the detailed operation of step S2 in FIG. 6 will be described using a flowchart.
FIG. 10 is a flowchart showing details of step S2.

まず、ステップＳ３１からステップＳ３２を繰り返すことで、部分文字列テーブル６の全てのエントリを順に処理する。
始めに、ステップＳ３１において、不要文書除去手段３は、不要文書検出手段１０により、部分文字列テーブル６の未処理エントリの１つを処理対象として選択し、当該エントリの文書ＵＲＬ６２及び取得日時６３で指定される文書属性テーブル７のエントリに対し、一意部分文字列数７４に１を加える。ここで、当該エントリの重複フラグに１が設定されている場合であっても、重複文書中で最も古い取得日時を持つ文書においては当該エントリに対応する部分文字列を一意部分文字列として扱う。これは、重複文書を全て除去対象とすると引用元文書も除去され、当該文書の内容が存在しなかったものとみなされてしまうためである。 First, by repeating step S31 to step S32, all entries in the partial character string table 6 are processed in order.
First, in step S31, the unnecessary document removal unit 3 selects one of the unprocessed entries in the partial character string table 6 as a processing target by the unnecessary document detection unit 10, and uses the document URL 62 and the acquisition date and time 63 of the entry. 1 is added to the number of unique partial character strings 74 for the entry of the designated document attribute table 7. Here, even when the duplication flag of the entry is set to 1, in the document having the oldest acquisition date in the duplicate document, the partial character string corresponding to the entry is treated as a unique partial character string. This is because if all duplicate documents are to be removed, the citation document is also removed, and it is considered that the content of the document does not exist.

なお、文書取得手段２によらずに収集された文書集合を分析対象にする場合は、上記の取得日時の情報を持たないため、文書が作成・更新された日時に基づいて、重複文書中で、重複する部分文字列が最初に生成された文書において当該エントリに対応する部分文字列を一意部分文字列として扱っても良い。 Note that, when a collection of documents collected without using the document acquisition unit 2 is to be analyzed, since there is no information on the acquisition date and time described above, in the duplicate document based on the date and time when the document was created / updated A partial character string corresponding to the entry may be treated as a unique partial character string in a document in which an overlapping partial character string is first generated.

次に、ステップＳ３２において、不要文書検出手段１０は、未処理のエントリの有無を調べ、部分文字列テーブル６の全てのエントリを処理した場合はステップＳ３３に進み、そうでない場合はステップＳ３１に戻る。 Next, in step S32, the unnecessary document detection unit 10 checks whether there is an unprocessed entry. If all entries in the partial character string table 6 have been processed, the process proceeds to step S33. If not, the process returns to step S31. .

続いて、ステップＳ３３からステップＳ３４を繰り返すことで、文書属性テーブル７の全てのエントリを順に処理する。
始めに、ステップＳ３３において、不要文書検出手段１０は、文書属性テーブル７の未処理エントリの１つを処理対象として選択し、部分文字列数７３に対する一意部分文字列数７４の割合が所定値以上（例えば６０％以上）であれば、当該エントリの除去フラグ７５を０に設定する。 Subsequently, by repeating steps S33 to S34, all entries in the document attribute table 7 are processed in order.
First, in step S33, the unnecessary document detection unit 10 selects one of the unprocessed entries in the document attribute table 7 as a processing target, and the ratio of the number of unique partial character strings 74 to the number of partial character strings 73 is equal to or greater than a predetermined value. If it is (for example, 60% or more), the removal flag 75 of the entry is set to 0.

次に、ステップＳ３４において、不要文書検出手段１０は、未処理のエントリの有無を調べ、文書属性テーブル７の全てのエントリを処理した場合は処理を終了し、そうでない場合はステップＳ３３に戻る。 Next, in step S34, the unnecessary document detection unit 10 checks whether there is an unprocessed entry. If all entries in the document attribute table 7 have been processed, the process ends. If not, the process returns to step S33.

なお、以上の説明においては、文書属性テーブル７はＲＡＭ１２上に存在し、効率的にランダムアクセスできるものと想定したが、文書属性テーブル７のデータ量がＲＡＭ１２の容量を上回る場合には、次のようにすれば良い。すなわち、図７のステップＳ１３において、文書属性テーブル７の新規エントリを磁気ディスク装置１３上の文書属性テーブル７の末尾に追記しておき、図１０の処理に先立って文書属性テーブル７のエントリを文書ＵＲＬ７１及び取得日時７２の順にソートして、入力側文書属性テーブルとする。同様に、部分文字列テーブル６のエントリも文書ＵＲＬ６２及び取得日時６３の順にソートしておく。 In the above description, it is assumed that the document attribute table 7 exists on the RAM 12 and can be efficiently accessed randomly. However, when the data amount of the document attribute table 7 exceeds the capacity of the RAM 12, You can do that. That is, in step S13 in FIG. 7, a new entry in the document attribute table 7 is added to the end of the document attribute table 7 on the magnetic disk device 13, and the entry in the document attribute table 7 is added to the document prior to the processing in FIG. It sorts in order of URL71 and acquisition date 72, and it is set as the input side document attribute table. Similarly, the entries in the partial character string table 6 are also sorted in the order of the document URL 62 and the acquisition date 63.

ステップＳ３１では、入力側文書属性テーブルのエントリを先頭から処理し、文書ＵＲＬ及び取得日時が一致するエントリが部分文字列テーブル６から読み込まれる限り、当該文書属性テーブルエントリの一意部分文字列数７４に１を加え、文書ＵＲＬ及び取得日時が一致しないエントリが部分文字列テーブル６から読み込まれた時点で該文書属性テーブルエントリを出力文書属性テーブルに追記する。 In step S31, the entry on the input side document attribute table is processed from the beginning, and as long as the entry having the same document URL and acquisition date / time is read from the partial character string table 6, the number of unique partial character strings in the document attribute table entry is set to 74. 1 is added, and when an entry whose document URL and acquisition date and time do not match is read from the partial character string table 6, the document attribute table entry is added to the output document attribute table.

また、以上の説明においては、部分文字列をｋ個の単語の列としたが、形態素解析処理による単語分割を行なわず、部分文字列をｋ’個の文字の列としても良い。 In the above description, the partial character string is a string of k words. However, the partial character string may be a string of k ′ characters without performing word division by morphological analysis processing.

以上のように、この発明の実施の形態１によれば、複数の文書に含まれる文字列から部分文字列を文書毎に生成し、文書毎に当該文書固有の一意部分文字列の割合を求め、一意部分文字列の割合が低いものを、他の１つ以上の文書の引用を中心とする有用性の低い文書として検出し、これらの文書を除去可能にすることにより、統計的な処理に適した文書集合を得ることができるという効果がある。 As described above, according to the first embodiment of the present invention, partial character strings are generated for each document from character strings included in a plurality of documents, and the ratio of unique partial character strings unique to the document is obtained for each document. Detecting a low percentage of unique substrings as less useful documents centered on citations of one or more other documents, and enabling these documents to be removed for statistical processing There is an effect that a suitable document set can be obtained.

また、複数文書に出現する部分文字列を、最も古い取得日時の文書においては一意部分文字列として扱うことにより、引用元である可能性が高い文書を除去してしまうことを防ぐという効果がある。 In addition, by treating partial character strings appearing in a plurality of documents as unique partial character strings in the document with the oldest acquisition date and time, there is an effect of preventing the removal of a document that is highly likely to be a citation source. .

さらに、一定割合までは他の文書と重複する部分文字列の存在を許すことにより、正当な引用を行なって固有の記述を加えている文書まで除去してしまうことを防ぐという効果がある。 Further, by allowing the presence of partial character strings that overlap with other documents up to a certain ratio, there is an effect of preventing the removal of documents that have been properly quoted and added a unique description.

実施の形態２．
以上の実施の形態１では、一意部分文字列の割合が低い文書を、他の１つ以上の文書の引用を中心とする有用性の低い文書として除去することにより、統計的な処理に適した文書集合を得ることができる文書処理装置を説明したが、次に、同一文書ＵＲＬからの文書の取得を繰り返す際に、統計的に有用な文書が得られない見込みが高い文書ＵＲＬを取得対象から取り除くことにより、不要な文書の取得を避け、外部サーバからの文書取得を効率化する文書処理装置に関する実施の形態２を示す。 Embodiment 2. FIG.
In the first embodiment described above, it is suitable for statistical processing by removing a document with a low ratio of unique partial character strings as a less useful document centered on citation of one or more other documents. The document processing apparatus capable of obtaining a document set has been described. Next, when iteratively obtaining a document from the same document URL, a document URL that is highly likely not to be statistically useful is obtained from the acquisition target. A second embodiment relating to a document processing apparatus that eliminates unnecessary document acquisition and improves the efficiency of document acquisition from an external server will be described.

図１１は、この発明の第２の実施の形態に係る文書処理装置の構成図である。
図１１において、文書処理装置１から不要文書検出手段１０までは、図１の同一番号の構成要素に対応するものであり、不要ＵＲＬ除去手段１４が、実施形態１に対して実施形態２で追加された部分である。不要ＵＲＬ除去手段１４は、不要文書除去手段３の動作に引き続いて動作する。 FIG. 11 is a block diagram of a document processing apparatus according to the second embodiment of the present invention.
In FIG. 11, the document processing apparatus 1 to the unnecessary document detecting means 10 correspond to the components having the same numbers in FIG. 1, and the unnecessary URL removing means 14 is added to the first embodiment in the second embodiment. It is the part which was done. The unnecessary URL removing unit 14 operates following the operation of the unnecessary document removing unit 3.

図１２は、不要ＵＲＬ除去手段１４の動作を示すフローチャートである。
不要ＵＲＬ除去手段１４は、文書属性テーブル７の全てのエントリに対して、同一文書ＵＲＬ７１を持つエントリ同士をまとめて順に処理する。 FIG. 12 is a flowchart showing the operation of the unnecessary URL removing unit 14.
The unnecessary URL removing unit 14 processes the entries having the same document URL 71 together in order for all the entries in the document attribute table 7.

始めに、ステップＳ４１において、不要ＵＲＬ除去手段１４は、文書属性テーブル７の未処理文書ＵＲＬ７１を１つ選択し、当該文書ＵＲＬ７１に対応する全てのエントリを取得する。取得したエントリ数が所定値を超え、かつ取得したエントリ中で除去フラグ７５が１に設定されたエントリの割合（除去率）が所定値を超える場合、当該文書ＵＲＬからは重複文書しか得られない可能性が高いと考えられるため、当該文書ＵＲＬを文書ＵＲＬリスト４から削除し、次回以降は文書の取得を行なわないようにする。 First, in step S41, the unnecessary URL removing unit 14 selects one unprocessed document URL 71 in the document attribute table 7, and acquires all entries corresponding to the document URL 71. If the number of acquired entries exceeds a predetermined value and the ratio (removal rate) of entries in which the removal flag 75 is set to 1 in the acquired entries exceeds a predetermined value, only duplicate documents can be obtained from the document URL. Since the possibility is high, the document URL is deleted from the document URL list 4 so that the document is not acquired from the next time.

続いて、ステップＳ４２において、不要ＵＲＬ除去手段１４は、文書属性テーブル７の全ての文書ＵＲＬが処理されたか判定し、処理済みでなければステップＳ４１に戻り、処理済みであれば終了する。 Subsequently, in step S42, the unnecessary URL removing unit 14 determines whether all document URLs in the document attribute table 7 have been processed. If not, the process returns to step S41.

なお、ここでは、同一文書ＵＲＬの複数バージョンを対象に除去率を求めたが、ＵＲＬ中のホスト名毎、あるいは上位ドメイン名（例えば、ｈｔｔｐ：／／ｂｌｏｇ．ｆｏｏ．ｃｏｍ／をＵＲＬとした場合のｆｏｏ．ｃｏｍなど）毎に除去率を求め、重複文書が多く発生するＵＲＬ群を一括して文書ＵＲＬリスト４から除去するようにしても良い。 Here, the removal rate is obtained for a plurality of versions of the same document URL, but for each host name in the URL or an upper domain name (for example, http://blog.foo.com/) as the URL The removal rate may be obtained for each of the foo.com, etc.), and URL groups in which many duplicate documents are generated may be collectively removed from the document URL list 4.

以上のように、この発明の実施の形態２によれば、同一文書ＵＲＬからの文書の取得を繰り返す際に、統計的に有用な文書が得られない見込みが高い文書ＵＲＬを取得対象から取り除くことにより、不要な文書の取得を避け、外部サーバからの文書取得を効率化することが可能になるという効果がある。 As described above, according to the second embodiment of the present invention, when a document is repeatedly acquired from the same document URL, a document URL that is highly unlikely to be a statistically useful document is removed from the acquisition target. Thus, there is an effect that it is possible to avoid the acquisition of unnecessary documents and to improve the efficiency of document acquisition from an external server.

実施の形態３．
以上の実施の形態２では、統計的に有用な文書が得られない見込みが高い文書ＵＲＬを取得対象から取り除くことにより、不要な文書の取得を避け、外部サーバからの文書取得を効率化する文書処理装置を説明したが、次に、出現頻度の高い語句を不自然に多く含む文書を検出して、この文書を除去することにより、統計的な処理に適した文書集合を得ることができる文書処理装置に関する実施の形態３を示す。 Embodiment 3 FIG.
In the second embodiment described above, a document URL that avoids unnecessary document acquisition and improves document acquisition efficiency from an external server by removing a document URL that is highly unlikely to obtain a statistically useful document from the acquisition target. The processing apparatus has been described. Next, a document that can detect a document including an unnaturally high number of frequently occurring words and phrases and remove the document can obtain a document set suitable for statistical processing. Embodiment 3 regarding a processing apparatus is shown.

図１３は、この発明の第３の実施の形態に係る文書処理装置の構成図である。
図１３において、文書処理装置１から不要文書検出手段１０までは、図１の同一番号の構成要素に対応するものであり、最長頻出部分文字列テーブル１５が、実施形態１に対して実施形態３で追加された部分である。 FIG. 13 is a block diagram of a document processing apparatus according to the third embodiment of the present invention.
In FIG. 13, the document processing apparatus 1 to the unnecessary document detection means 10 correspond to the components having the same numbers in FIG. 1, and the longest frequent partial character string table 15 is the third embodiment compared to the first embodiment. It is a part added by.

まず、最長頻出部分文字列テーブル１５の格納形式について、詳細に説明する。部分文字列テーブル６においては、固定単語数ｋ個の単語の列を部分文字列としたが、最長頻出部分文字列テーブル１５には、１つの単語からＱ個の単語の列までの部分文字列が含まれる（例えばＱ＝１０）。ここで、取得文書データ５の文書内容５３に含まれる部分文字列の内、所定数以上（例えば５００以上）の文書に出現する部分文字列を、頻出部分文字列と定義すれば、最長頻出部分文字列テーブル１５は、頻出部分文字列の内、最長の頻出部分文字列を格納するテーブルである。 First, the storage format of the longest frequent partial character string table 15 will be described in detail. In the partial character string table 6, a word string of k fixed words is a partial character string. However, the longest frequent partial character string table 15 includes a partial character string from one word to a Q word string. (For example, Q = 10). Here, if a partial character string appearing in a predetermined number or more (for example, 500 or more) of partial character strings included in the document content 53 of the acquired document data 5 is defined as a frequent partial character string, the longest frequent part is defined. The character string table 15 is a table for storing the longest frequent partial character string among the frequent partial character strings.

上記の最長の頻出部分文字列を、以下では、最長頻出ｑ単語列（ｑは単語の数）と表現する。最長頻出ｑ単語列は、所定数以上の文書に出現するｑ個の単語の列（ｑ単語列）であって、かつ当該ｑ単語列を含む最長頻出ｑ＋１単語列が存在しないものを指す。 Hereinafter, the longest frequent partial character string is expressed as the longest frequent q word string (q is the number of words). The longest frequent q word string indicates a string of q words (q word string) appearing in a predetermined number or more of documents, and the longest frequent q + 1 word string including the q word string does not exist.

このような最長頻出部分文字列テーブル１５を生成するためには、最長頻出部分文字列の候補をｑ単語列として格納する頻出ｑ単語列テーブルを利用する。
図１４は、ｑ＝３の場合のｑ単語列を格納する頻出ｑ単語列テーブル９０を示したものである。
頻出ｑ単語列テーブル９０は、単語列９１と出現回数９２からなる。頻出ｑ単語列テーブル９０はＲＡＭ１２上に配置され、単語列９１を一意なキーとして検索可能な構造を有する。出現回数９２は、対応する単語列９１が出現する文書の数を保持する。 In order to generate such a longest frequent partial character string table 15, a frequent q word string table that stores candidates for the longest frequent partial character string as a q word string is used.
FIG. 14 shows a frequent q word string table 90 for storing q word strings when q = 3.
The frequent q word string table 90 includes a word string 91 and the number of appearances 92. The frequent q word string table 90 is arranged on the RAM 12 and has a structure that can be searched using the word string 91 as a unique key. The appearance count 92 holds the number of documents in which the corresponding word string 91 appears.

次に、最長頻出部分文字列テーブル１５を用いた不要文書除去手段３の動作を、フローチャートを用いて説明する。
図１５は、不要文書除去手段３の動作の内、実施の形態１に対して実施の形態３で加わった部分を示すフローチャートである。
ステップＳ３からステップＳ５は、図６のステップＳ１及びステップＳ２に引き続いて実行する。 Next, the operation of the unnecessary document removing unit 3 using the longest frequent partial character string table 15 will be described using a flowchart.
FIG. 15 is a flowchart showing a part of the operation of the unnecessary document removing unit 3 that is added to the first embodiment in the third embodiment.
Steps S3 to S5 are executed subsequent to steps S1 and S2 in FIG.

まず、ステップＳ３において、不要文書除去手段３は、不要文書検出手段１０により、後述する処理に従って、ｑ＝１、２、３、．．．、Ｑの順に、取得文書データ５の文書内容５３に含まれるｑ単語列（ｑ＝１の場合は単語）の出現文書数を求め、所定数以上の文書に出現したｑ単語列を、頻出ｑ単語列テーブル９０に格納する。 First, in step S3, the unnecessary document removing unit 3 causes the unnecessary document detecting unit 10 to perform q = 1, 2, 3,. . . , Q, the number of appearing documents of q word strings (words if q = 1) included in the document content 53 of the acquired document data 5 is obtained, and q word strings appearing in a predetermined number or more of documents are Store in the word string table 90.

続いて、ステップＳ４において、不要文書検出手段１０は、ｑ＝Ｑ、Ｑ−１、．．．、２、１の順に、頻出ｑ単語列テーブルの各エントリのｑ単語列に含まれる部分文字列群、すなわちｑ−１単語列、ｑ−２単語列、．．．、単語に対して、対応する頻出単語列テーブルのエントリを削除する。各頻出ｑ単語列に残ったエントリは最長頻出ｑ単語列であり、これらを最長頻出部分文字列テーブルとする。 Subsequently, in step S4, the unnecessary document detection means 10 determines that q = Q, Q-1,. . . 2, 1, partial character string groups included in the q word string of each entry of the frequent q word string table, that is, q-1 word string, q-2 word string,. . . For a word, the corresponding frequent word string table entry is deleted. The entries remaining in each frequent q word string are the longest frequent q word strings, and these are the longest frequent partial character string table.

最後に、ステップＳ５において、不要文書検出手段１０は、ステップＳ４で生成された最長頻出部分文字列テーブルを参照して、取得文書データ５の文章内容５３に同時に含まれる最長頻出部分文字列の種類を求め、所定種類以上（例えば７種類以上）の最長頻出部分文字列を含む文書に対応する文書属性テーブル７のエントリについて、除去フラグ７５を１に設定する。 Finally, in step S5, the unnecessary document detection means 10 refers to the longest frequent partial character string table generated in step S4, and the type of the longest frequent partial character string simultaneously included in the sentence content 53 of the acquired document data 5 And the removal flag 75 is set to 1 for an entry in the document attribute table 7 corresponding to a document including a predetermined type or more (for example, 7 types or more) of the longest frequent partial character string.

次に、図１５のステップＳ３の動作について、フローチャートを用いて説明する。
図１６は、ステップＳ３の動作の内、特定のｑ（ｑ＞１）に対応する動作の詳細を示すフローチャートである。 Next, the operation in step S3 in FIG. 15 will be described using a flowchart.
FIG. 16 is a flowchart showing details of an operation corresponding to a specific q (q> 1) in the operation of step S3.

始めに、ステップＳ５１において、不要文書除去手段３は、取得文書データ５の内、未処理の文書ｄを選択し、文書内容５３を構成する単語の列Ｗ１、Ｗ２、．．．、Ｗｎｄを得る。この処理は、図７のステップＳ１２と同じ形態素解析処理を繰り返すか、ステップＳ１２の以前の実行結果を保存しておいて再利用することで実現できる。 First, in step S 51, the unnecessary document removing unit 3 selects an unprocessed document d from the acquired document data 5, and the word strings W 1, W 2,. . . , Wnd. This process can be realized by repeating the same morphological analysis process as in step S12 of FIG. 7 or by saving the previous execution result of step S12 and reusing it.

次に、ステップＳ５２において、不要文書検出手段１０は、前記単語列の内、未処理のｑ単語列（Ｗｉ、Ｗｉ＋１、．．．、Ｗｉ＋ｑ−１）を処理対象として取り上げる。 Next, in step S52, the unnecessary document detection means 10 picks up unprocessed q word strings (Wi, Wi + 1,..., Wi + q-1) among the word strings as processing targets.

次に、ステップＳ５３において、不要文書検出手段１０は、ｑ個の単語Ｗｊ（ｊ＝ｉ、ｉ＋１、．．．、ｉ＋ｑ−１）が全て頻出単語テーブル（頻出１単語列テーブル）９０に存在するかどうか判定する。存在する場合は、ステップＳ５４に進み、そうでない場合は、当該ｑ単語列が頻出となることはあり得ないため、ステップＳ５６まで処理をスキップする。 Next, in step S <b> 53, the unnecessary document detection unit 10 includes all the q words Wj (j = i, i + 1,..., I + q−1) in the frequent word table (frequently used word string table) 90. Determine whether or not. If it exists, the process proceeds to step S54. If not, the q word string cannot occur frequently, and the process is skipped to step S56.

続いて、ステップＳ５４において、不要文書検出手段１０は、ｑ個の単語列の右端を除いたｑ−１単語列（Ｗｉ、Ｗｉ＋１、．．．、Ｗｉ＋ｑ−２）と、左端を除いたｑ−１単語列（Ｗｉ＋１、．．．、Ｗｉ＋ｑ−１）のいずれもが頻出ｑ−１単語列テーブル９０に存在しているかどうか判定する。存在する場合は、ステップＳ５５に進み、そうでない場合は、当該ｑ単語列が頻出となることはあり得ないため、ステップＳ５６まで処理をスキップする。 Subsequently, in step S54, the unnecessary document detection unit 10 determines the q-1 word string (Wi, Wi + 1,..., Wi + q-2) excluding the right end of the q word strings and the q− without the left end. It is determined whether any one word string (Wi + 1,..., Wi + q−1) exists in the frequent q−1 word string table 90. If it exists, the process proceeds to step S55. If not, the q word string cannot occur frequently, and the process is skipped to step S56.

次に、ステップＳ５５において、不要文書検出手段１０は、当該ｑ単語列（Ｗｉ、Ｗｉ＋１、．．．、Ｗｉ＋ｑ−１）に対応する頻出ｑ単語列テーブルのエントリにおいて、出現回数９２に１を加える。 Next, in step S55, the unnecessary document detection means 10 adds 1 to the appearance count 92 in the frequent q word string table entry corresponding to the q word string (Wi, Wi + 1,..., Wi + q−1). .

次に、ステップＳ５６において、不要文書検出手段１０は、文書ｄの全てのｑ単語列が全て処理されたか判定し、処理されていれば、ステップＳ５７に進み、そうでなければ、ステップＳ５２に戻って処理を繰り返す。 Next, in step S56, the unnecessary document detection unit 10 determines whether all the q word strings of the document d have been processed. If they have been processed, the process proceeds to step S57. Otherwise, the process returns to step S52. Repeat the process.

続いて、ステップＳ５７において、不要文書検出手段１０は、取得文書データ７の全ての文書が全て処理されたか判定し、処理されていれば、ステップＳ５８に進み、そうでなければ、ステップＳ５１に戻って処理を繰り返す。 Subsequently, in step S57, the unnecessary document detection unit 10 determines whether all the documents of the acquired document data 7 have been processed. If all the documents have been processed, the process proceeds to step S58. If not, the process returns to step S51. Repeat the process.

最後に、ステップＳ５８において、不要文書検出手段１０は、頻出ｑ単語列テーブル９０の各エントリを順に調べ、出現回数９２が所定値未満のエントリを削除する。削除されずに残ったエントリが頻出ｑ単語列である。 Finally, in step S58, the unnecessary document detection unit 10 sequentially examines each entry in the frequent q word string table 90, and deletes an entry whose appearance count 92 is less than a predetermined value. Entries that remain without being deleted are frequent q word strings.

以上のように、この発明の実施の形態３によれば、出現頻度の高い語句を不自然に多く含む文書を検出して、この文書を除去することにより、統計的な処理に適した文書集合を得ることができるという効果がある。 As described above, according to the third embodiment of the present invention, a document set suitable for statistical processing is detected by detecting a document that contains an unnaturally high number of frequently occurring words and phrases and removing this document. There is an effect that can be obtained.

また、スパムブログを自動生成する作者は、インターネット上で公開されている検索エンジンの検索問合せランキングなどを用いると、こうした出現頻度の高い語句を話題語として容易に捉えることができるので、検索エンジンユーザに検索され易いＷｅｂページを自動生成するために、このような話題語を、そのＷｅｂページに設定される検索対象キーワードとして利用することがある。この場合、Ｗｅｂページの文章の方には、他の文書を自動引用する代わりにランダムな文字列を用いることがある。このような場合に対しても、この発明の実施の形態３によれば、検索対象キーワードとして設定されている頻出語（話題語）を手掛かりに、上記のような自動生成文書を除去することが可能になる。 Authors who automatically generate spam blogs can easily identify these frequently occurring phrases as topic words using search query rankings of search engines published on the Internet. In order to automatically generate a Web page that can be easily searched, such a topic word may be used as a search target keyword set in the Web page. In this case, a random character string may be used for the text of the Web page instead of automatically quoting another document. Even in such a case, according to the third embodiment of the present invention, it is possible to remove the automatically generated document as described above by using a frequent word (topic word) set as a search target keyword. It becomes possible.

１文書処理装置、２文書取得手段、３不要文書除去手段、４文書ＵＲＬリスト、５取得文書データ、６部分文字列テーブル、７文書属性テーブル、８部分文字列生成手段、９一意部分文字列判定手段、１０不要文書検出手段、１１ＣＰＵ、１２ＲＡＭ、１３磁気ディスク装置、１４不要ＵＲＬ除去手段、１５最長頻出部分文字列テーブル。 DESCRIPTION OF SYMBOLS 1 Document processing apparatus, 2 Document acquisition means, 3 Unnecessary document removal means, 4 Document URL list, 5 Acquisition document data, 6 Partial character string table, 7 Document attribute table, 8 Partial character string production | generation means, 9 Unique partial character string determination Means, 10 Unnecessary document detection means, 11 CPU, 12 RAM, 13 Magnetic disk device, 14 Unnecessary URL removal means, 15 Longest frequent character string table.

Claims

Partial character string generating means for generating, for each document, a partial character string forming a part of the character string from character strings included in a plurality of documents;
Unique partial character string determination means for determining a partial character string that is not included in a document other than the document in which the partial character string is generated among the partial character strings generated by the partial character string generation means;
A document comprising unnecessary document detection means for detecting, as an unnecessary document, a document whose ratio between the total number of partial character strings for each document and the number of unique partial character strings determined by the unique partial character string determination means is within a predetermined range. Processing equipment.

The partial character string generated by the partial character string generating means, the identification information of the document in which the partial character string is first generated, and the duplication indicating whether or not the partial character string is duplicated between different documents A partial character string table that stores flags in association with each other,
In the case where the unique partial character string judging means indicates that the duplication flag stored in the partial character string table is duplicated, the partial character string is regarded as a unique partial character string in the first generated document. The document processing apparatus according to claim 1, wherein the document processing apparatus determines that the document is not a unique partial character string in other documents.

A document acquisition means for acquiring a document on an external server specified by a URL (Uniform Resource Locator) via a network, and storing the acquired document data together with the URL and the acquisition date;
The partial character string generated by the partial character string generating means, the URL of the document with the oldest acquisition date and time among the documents including the partial character string, and the document in which the partial character string is different from the acquisition date and time A partial character string table that stores a duplication flag indicating whether or not there is a duplication flag,
When the unique partial character string determination means indicates that the duplication flag stored in the partial character string table is duplicated, the partial character string is determined as a unique partial character string in the document with the oldest acquisition date and time. The document processing apparatus according to claim 1, wherein the document processing apparatus determines that the document is not a unique partial character string in other documents.

A document URL list for storing the URL for acquiring a document by the document acquisition unit as an acquisition target URL;
The document processing apparatus according to claim 3, further comprising: an unnecessary URL removing unit that deletes the acquisition target URL stored in the document URL list based on the URL of the unnecessary document detected by the unnecessary document detecting unit.

The unnecessary URL removing unit stores the URL of the unnecessary document in the document URL list when the unnecessary document detected by the unnecessary document detecting unit is repeatedly detected as an unnecessary document at a predetermined number or more of acquisition dates and times. The document processing apparatus according to claim 4, wherein the document processing apparatus is deleted from the acquired URL to be acquired.

When the unnecessary URL removing unit detects a predetermined ratio or more as unnecessary documents among the documents having URLs whose host names or higher domain names of the URLs of the unnecessary documents detected by the unnecessary document detecting unit match. 5. The document processing apparatus according to claim 4, wherein all URLs having the same host name or higher domain name are deleted from the acquisition target URLs stored in the document URL list.

The partial character string generation unit stores the partial character string table in a RAM (Random Access Memory). When the partial character string table reaches a predetermined capacity corresponding to the capacity of the RAM, the partial character string table is stored in the RAM. After writing a fragment of a partial character string table to a magnetic disk device, emptying the partial character string table stored in the RAM, and detecting the unnecessary document for all of the acquired document data, the RAM and the magnetic 7. The document processing apparatus according to claim 2, wherein fragments of the partial character string table stored in a disk device are integrated based on the partial character string.

The unnecessary document detection means is based on the number of occurrences of partial character strings within a predetermined length, and the longest frequent character string not included in a longer frequent partial character string among frequent partial character strings appearing in a predetermined number or more of documents. 8. The document processing apparatus according to claim 1, wherein a longest frequent partial character string which is a frequent partial character string is obtained, and a document including a predetermined type or more of the longest frequent partial character string is detected as an unnecessary document.

A partial character string generation means for generating, for each document, a partial character string that forms a part of the character string from character strings included in a plurality of documents;
A unique partial character string determination unit is configured to determine, as a unique partial character string, a partial character string that is not included in a document other than the document in which the partial character string is generated by the partial character string generation unit. A substring determination step;
Unnecessary document detecting means for detecting as an unnecessary document a document whose ratio between the total number of partial character strings for each document and the number of unique partial character strings determined by the unique partial character string determining means is within a predetermined range A document processing method comprising a detection step.

Computer
Partial character string generating means for generating, for each document, a partial character string forming a part of the character string from character strings included in a plurality of documents;
Unique partial character string determination means for determining a partial character string that is not included in a document other than the document in which the partial character string is generated among the partial character strings generated by the partial character string generation means;
To function as an unnecessary document detection unit that detects a document in which the ratio between the total number of partial character strings for each document and the number of unique partial character strings determined by the unique partial character string determination unit is within a predetermined range as an unnecessary document Document processing program.

Computer
Partial character string generating means for generating, for each document, a partial character string forming a part of the character string from character strings included in a plurality of documents;
Unique partial character string determination means for determining a partial character string that is not included in a document other than the document in which the partial character string is generated among the partial character strings generated by the partial character string generation means;
To function as an unnecessary document detection unit that detects a document in which the ratio between the total number of partial character strings for each document and the number of unique partial character strings determined by the unique partial character string determination unit is within a predetermined range as an unnecessary document A computer-readable recording medium on which a document processing program is recorded.