JP2000235574A

JP2000235574A - Document processing device

Info

Publication number: JP2000235574A
Application number: JP11036890A
Authority: JP
Inventors: Atsuyuki Goto; 淳之後藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1999-02-16
Filing date: 1999-02-16
Publication date: 2000-08-29

Abstract

(57)【要約】【課題】話題が混在した新聞記事などの電子化文書か
ら、話題ごとに文書を分割する。【解決手段】電子化文書を、段落に分けて段落間の関
連度を求め、この関連度を例えばマトリクスに表示し
て、任意番目の行と任意番目の列と対角成分とで囲まれ
る三角形領域とし、この三角形領域内の関連度の合計値
を求め、この合計値より分割点を求める。例えば、三角
形領域内の関連度の合計値とこの三角形の列を１辺とし
任意番目の行を１辺とする矩形領域内の関連度の合計値
を求め、これら三角形領域の合計値と矩形領域の合計値
の比を求め、この比の値に基づいて文書を分割する。 (57) [Summary] [Problem] A document is divided for each topic from an electronic document such as a newspaper article in which topics are mixed. SOLUTION: The digitized document is divided into paragraphs, the relevance between paragraphs is obtained, and the relevance is displayed in, for example, a matrix, and a triangle surrounded by an arbitrary row, an arbitrary column, and a diagonal component is displayed. As a region, a total value of the degree of relevance in this triangular region is obtained, and a division point is obtained from the total value. For example, a total value of relevance in a triangular area and a total value of relevance in a rectangular area having a column of this triangle as one side and an arbitrary row as one side are obtained. Is calculated, and the document is divided based on the value of the ratio.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書処理装置に関
し、より詳しくは、意味ブロック毎に文書を分割する文
書処理装置に関し、文書の要約や複数の記事などが混在
した文書を意味毎に分割する場合に適用して好適な文書
分割に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a document processing apparatus, and more particularly, to a document processing apparatus for dividing a document for each semantic block. The present invention relates to a suitable document division to be applied to the case.

【０００２】[0002]

【従来の技術】近年、パーソナルコンピュータによる文
書作成の機会の増加、インターネットの急速な普及によ
り身の回りに電子化文書が氾濫している。このような電
子化文書をすべて閲覧して、有効な情報を提供する文書
を探索することは時間的な制限により不可能に近い。2. Description of the Related Art In recent years, the number of opportunities for creating documents using personal computers and the rapid spread of the Internet have caused a flood of electronic documents around the user. It is almost impossible to browse all such electronic documents and search for a document that provides valid information due to time limitations.

【０００３】このような問題を解決するために、全文検
索などにより、必要な文書を検索する方法もあるが、検
索された文書をある程度、読解しなければ、検索された
文書が本当に意図した情報を提供してくれるかどうか判
断するのは難しい。[0003] In order to solve such a problem, there is a method of searching for a necessary document by a full-text search or the like. However, if the searched document is not read to some extent, the searched document does not have the intended information. It is difficult to judge whether or not to offer.

【０００４】一方、こうした状況を解決する手段とし
て、文書の要約技術がある。文書の要約を全文検索の検
索結果として表示すれば、検索された文書をいちいち開
いて内容を吟味する必要がなくなるので、検索の効率が
飛躍的に向上する。On the other hand, as a means for solving such a situation, there is a technique for summarizing documents. If the summary of the document is displayed as a search result of the full-text search, it is not necessary to open the searched document and examine the contents thereof, so that the efficiency of the search is dramatically improved.

【０００５】しかしながら、いろいろな話題が１つの紙
面に混在した新聞記事などから要約を提示するには、話
題ごとに要約を提供する必要がある。However, in order to present an abstract from a newspaper article or the like in which various topics are mixed on one page, it is necessary to provide an abstract for each topic.

【０００６】[0006]

【発明が解決しようとする課題】本発明は、上述のごと
き実情に鑑みてなされたもので、電子化文書を意味ブロ
ック毎に分割することを目的とするものである。SUMMARY OF THE INVENTION The present invention has been made in view of the above situation, and has as its object to divide an electronic document into meaning blocks.

【０００７】[0007]

【課題を解決するための手段】請求項１の発明は、電子
化された文書を段落に分割し、上記段落から抽出された
キーワードに基づいて段落間の関連度を計算し、段落の
数を次元とする正方行列において、該正方行列の対角成
分を境として片側の領域の各成分に上記関連度を入れ、
該関連度を入れた前記片側領域において、任意番目の行
（又は列）と、任意番目の列（又は行）と、対角成分
と、で囲まれる三角形領域内の関連度の合計値を求め、
該関連度の合計値に基づいて文書の分割点を求めるよう
にしたものである。According to the first aspect of the present invention, an electronic document is divided into paragraphs, and the degree of relevance between paragraphs is calculated based on keywords extracted from the paragraphs, and the number of paragraphs is calculated. In the square matrix to be a dimension, the above-described degree of relevance is put into each component of the one-sided area with the diagonal component of the square matrix as a boundary,
In the one-sided area containing the relevance, a total value of relevance in a triangular area surrounded by an arbitrary row (or column), an arbitrary column (or row), and a diagonal component is calculated. ,
The document dividing point is obtained based on the total value of the relevance.

【０００８】請求項２の発明は、請求項１に記載された
発明において、前記三角形領域内の合計値と、前記三角
形領域の列（又は行）に対応しかつ該三角形領域を除く
矩形領域内の関連度の合計値の関係から文書の分割点を
求めるようにしたものである。According to a second aspect of the present invention, in the first aspect of the present invention, a total value in the triangular area and a rectangular area corresponding to a column (or row) of the triangular area and excluding the triangular area are included. The document division point is obtained from the relationship between the total values of the degrees of relevance.

【０００９】請求項３の発明は、請求項２に記載された
発明において、前記三角形領域内の合計値と、前記矩形
領域内の関連度の合計値との比から極値を求めることに
より、文書の分割点を求めるようにしたものである。According to a third aspect of the present invention, in the invention described in the second aspect, an extreme value is obtained from a ratio of a total value in the triangular area to a total value of relevance in the rectangular area. This is to obtain a document division point.

【００１０】請求項４の発明は、請求項１乃至３のいず
れかに記載された発明において、隣接する段落における
行（又は列）の関連度の合計をそれぞれ求め、各合計値
を比較して話題転換点を求めるようにしたものである。According to a fourth aspect of the present invention, in the invention described in any one of the first to third aspects, a total of the relevance of a row (or a column) in an adjacent paragraph is obtained, and each total is compared. It is designed to find a turning point.

【００１１】請求項５の発明は、請求項１乃至３のいず
れかに記載された発明において、前記分割点における行
（又は列）の関連度の合計値が所定値以下の時、該行
（又は列）を孤立段落とするようにしたものである。According to a fifth aspect of the present invention, in the invention according to any one of the first to third aspects, when a total value of relevance of a row (or column) at the division point is equal to or less than a predetermined value, the row (or Or column) as an isolated paragraph.

【００１２】[0012]

【発明の実施の形態】上述のように、本発明を使用すれ
ば、新聞記事などの複数の話題が混在した文書でも、話
題ごとに要約を提供することが可能になる。以下に、文
書要約機能を有する文書処理装置において、表１に示す
テキストの文書を複数の意味ブロックに分割する例につ
いて説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS As described above, the present invention makes it possible to provide a summary for each topic even in a document in which a plurality of topics are mixed, such as a newspaper article. Hereinafter, an example in which a document having the text shown in Table 1 is divided into a plurality of semantic blocks in a document processing apparatus having a document summarizing function will be described.

【００１３】本発明は、ある文書を関連する意味のまと
まりに分割する（意味ブロック抽出）もので、以下に説
明する意味ブロックの抽出は、文書の重要文抽出の前処
理であり、異なる話題の文章を話題の切れ目で分割する
ことを意味している。たとえば、いろいろな内容の文章
の寄せ集めである新聞の１面から重要文を抽出する際に
有効な処理である。意味ブロック抽出を行い、新聞の１
面から１つの記事を切り出し、そこから重要文を抽出し
た方が、複数の記事から直接、重要文を抽出する方より
良い結果が得られる。例えば、テキスト例に入力された
電子化文書を表１に示すように段落毎に分割する。The present invention divides a document into a group of related meanings (semantic block extraction). Extraction of a semantic block described below is a pre-process of extracting an important sentence of a document. This means that the text is divided at the breaks in the topic. For example, this process is effective when extracting important sentences from one page of a newspaper, which is a collection of sentences of various contents. Extract the meaning block
It is better to extract one sentence from the surface and extract the important sentence therefrom than to extract the important sentence directly from a plurality of articles. For example, a digitized document input as a text example is divided into paragraphs as shown in Table 1.

【００１４】[0014]

【表１】 [Table 1]

【００１５】次の手順により意味ブロックを抽出する。１．段落関連度マトリクスの作成（１．１）．入力文書を段落毎に解析し、キーワードを
抽出する。キーワードは基本的には名詞であり、出現頻
度が高いものほどキーワード性が高いものとする。ま
た、キーワード同士の部分文字列の一致も考慮して、出
現頻度を計算する。A semantic block is extracted according to the following procedure. 1. Creation of paragraph relevance matrix (1.1). The input document is analyzed for each paragraph, and keywords are extracted. Keywords are basically nouns, and the higher the frequency of appearance, the higher the keyword characteristics. Also, the appearance frequency is calculated in consideration of the matching of the partial character strings between the keywords.

【００１６】（１．２）．段落番号をインデックスとし
た２次元配列（正方行列）を用意し、段落と段落のキー
ワードの重複度を計算して、配列要素（成分）に代入す
る。例えば、表１のテキスト例に対する段落関連度マト
リクスは表２に示すようになる。(1.2). A two-dimensional array (square matrix) using a paragraph number as an index is prepared, the degree of overlap between the paragraph and the keyword of the paragraph is calculated, and the calculated degree is substituted into an array element (component). For example, a paragraph relevance matrix for the text example in Table 1 is as shown in Table 2.

【００１７】[0017]

【表２】 [Table 2]

【００１８】図１は、本発明による段落関連マトリクス
を説明するための概要図で、以下、表２と共に説明す
る。意味ブロック抽出に使用するのは、２次元配列（正
方行列）の上半分の三角形領域であり、対角成分は使用
しない。つまり、段落番号（ｎ）と段落番号（ｍ）の関
連度Ｒ（ｎ，ｍ）の計算に対して、段落番号の並びを考
慮していない。すなわち、次の理由（ａ），（ｂ）によ
り、Ｒ（ｎ，ｍ）とＲ（ｍ，ｎ）の値は同じであるとみ
なしている。（ａ）不必要な計算をしない。Ｒ（ｎ，ｍ）とＲ（ｍ，
ｎ）の値を異なるものとしたら、段落関連度マトリクス
の下半分を新たに計算しなければならなくなる。（ｂ）意味ブロック抽出において、Ｒ（ｎ，ｍ）とＲ
（ｍ，ｎ）の値を別にする論理的根拠がない。Ｒ（ｎ，
ｍ）は、別な見方をすれば、意味空間上における段落ｎ
と段落ｍの一種の距離であるとみなせる。従って、段落
ｎから段落ｍを計った距離と段落ｍから段落ｎを計った
距離は同じであるべきであり、そのように定義されなけ
れば、意味ブロック抽出過程において不都合が生じる。FIG. 1 is a schematic diagram for explaining a paragraph-related matrix according to the present invention. A triangular area in the upper half of a two-dimensional array (square matrix) is used for semantic block extraction, and no diagonal components are used. In other words, the calculation of the relevance R (n, m) between the paragraph number (n) and the paragraph number (m) does not consider the arrangement of the paragraph numbers. That is, for the following reasons (a) and (b), the values of R (n, m) and R (m, n) are considered to be the same. (A) Do not perform unnecessary calculations. R (n, m) and R (m,
If the value of n) is different, the lower half of the paragraph relevance matrix must be newly calculated. (B) In extracting a semantic block, R (n, m) and R (n, m)
There is no rationale to separate the value of (m, n). R (n,
m) is, from another perspective, a paragraph n in the semantic space.
And a kind of distance of paragraph m. Therefore, the distance measured from the paragraph n to the paragraph m should be the same as the distance measured from the paragraph m to the paragraph n, and if not defined, a problem occurs in the semantic block extraction process.

【００１９】２．段落関連度マトリクスから三角形（対
角線上の成分を含まない）を切り出す三角形が意味ブロックを表す。すなわち、三角形を切り
出すことが意味ブロックの抽出を意味する。表２の段落
関連度マトリクスには、キーワードノイズがあるため意
味ブロックである三角形がないように見えるが、以下に
説明する方式によれば、三角形が切り出せる。2. Triangles (excluding diagonal components) are extracted from the paragraph relevance matrix. Triangles represent semantic blocks. In other words, cutting out a triangle means extracting a meaning block. In the paragraph relevance matrix of Table 2, it seems that there is no triangle which is a semantic block due to the presence of keyword noise. However, according to the method described below, a triangle can be cut out.

【００２０】（２．１）．面積比の計算段落関連度マトリクス（ｎ×ｎ）において、上述の如く
片側領域（三角形の領域）において、第１行と、任意番
目の列（ｉ列）と、対角線とで囲まれる三角形内の関連
度の総和をＤ_i（三角形の面積内の関連度総和）、前記
三角形の列に対応しかつ該三角形を除く矩形領域内の関
連度の総和をＲ_i（矩形の面積内の関連度総和）とす
る。マトリクス上のすべての段落番号ｉに対して、Ｄ_i
とＲ_iを計算する。この結果、Ｄ₁，Ｄ₂，…，Ｄ_n，
Ｒ₁，Ｒ₂，…，Ｒ_nが求まる。(2.1). Calculation of area ratio In the paragraph relevance matrix (n × n), as described above, in one side region (triangular region), the first row, the arbitrary column (i column), and the triangle within the diagonal line The sum of the relevance is D _i (the sum of the relevance in the area of the triangle), and the sum of the relevance in the rectangular area corresponding to the row of the triangle and excluding the triangle is R _i (the sum of the relevance in the area of the rectangle). ). For all paragraph numbers i on the matrix, _Di
And R _i . As a result, D ₁ , D ₂ ,..., D _n ,
R ₁ , R ₂ ,..., R _n are obtained.

【００２１】（２．２）．極小点の探索すべてのＤ_iとＲ_iに対してＲ_i／Ｄ_iを計算し、式（１）
を満たす点を探す。Ｒ_i／Ｄ_i≧Ｒ_i+1／Ｄ_i+1＜Ｒ_i+2／Ｄ_i+2 …式（１）この時、ｉ＋１番目の段落が分割候補点になる。なお、
比を逆にとれば、分割候補点となる極値は、極大点とな
る。(2.2). Finding the minimum point R _i / D _i is calculated for all D _i and R _i , and equation (1)
Find points that satisfy. R _i / D _i ≧ R _{i + 1} / D _{i + 1} <R _{i + 2} / D _{i + 2} (1) At this time, the ( _{i +} 1) th paragraph is a division candidate point. In addition,
If the ratio is reversed, the extremum which is a division candidate point becomes the maximum point.

【００２２】（２．３）．話題転換係数式（１）で求めた段落ｉに対して、式（２）の計算をす
る。（Ｒ_i+2／Ｄ_i+2）／（Ｒ_i+1／Ｄ_i+1） …式（２）この値は、意味ブロックの話題の転換の度合いを表して
いる。仮に、この値を話題転換係数と呼ぶことにする。
話題転換係数が１に十分近い場合は、式（１）で求めた
分割点により区切られて出来る２つの意味ブロックは互
いに似たような話題を題材にしていることを意味してい
る。(2.3). Topic conversion coefficient Equation (2) is calculated for paragraph i obtained by equation (1). (R _{i + 2} / D _{i + 2} ) / (R _{i + 1} / D _{i + 1} ) Expression (2) This value represents the degree of the change of the topic of the semantic block. This value will be referred to as a topic conversion coefficient.
If the topic conversion coefficient is sufficiently close to 1, it means that the two semantic blocks formed by the division points obtained by the equation (1) are based on topics similar to each other.

【００２３】話題転換係数が１から離れるほど、２つの
意味ブロックは、異なる話題を題材にしていることにな
る。従って、話題転換係数が１より明らかに大きい場合
は、段落ｉ＋１は、意味の切れ目になる。つまり、段落
番号が１〜ｉ＋１の意味ブロックと段落番号がｉ＋２〜
ｎの意味ブロックに分かれる。As the topic conversion coefficient becomes farther from 1, the two semantic blocks are based on different topics. Thus, if the topic conversion factor is clearly greater than 1, paragraph i + 1 is a break in meaning. That is, the semantic block whose paragraph number is 1 to i + 1 and the paragraph number is i + 2
It is divided into n meaning blocks.

【００２４】（２．４）．上記話題転換係数というの
は、図２，図３の曲線の極小点における曲線の傾き（微
分係数）を意味する。Ｒ_iの値が急に大きくなり、逆に
Ｄ_iの値が急に小さくなると極小点の曲線の傾きが大き
くなる。つまり極小点Ｐを境として話題の内容が大きく
異なることを意味する。話題の転換があいまいに推移す
ると（論旨があいまいな文章や小説など）、極小点が存
在しても、話題転換係数が１に近くなり、意味ブロック
としての三角形が切り出せなくなる。(2.4). The topic conversion coefficient means the slope (differential coefficient) of the curve at the minimum point of the curves in FIGS. The value of R _i is suddenly increased, the slope of the curve of the minimum point when the value is suddenly decreased in D _i is increased conversely. In other words, it means that the content of the topic is greatly different from the minimum point P. If the change of the topic is ambiguous (a sentence or a novel whose amendment is ambiguous), the topic conversion coefficient becomes close to 1 even if there is a minimum point, so that a triangle as a semantic block cannot be cut out.

【００２５】このような場合、話題転換係数の値を適切
に決めることにより、適切な意味ブロックを抽出できる
ようになる。実験の結果、話題転換係数を１．１に設定
しても、良好な結果を得ることができた。話題転換係数
の導入により、誤った意味ブロックの抽出を防止でき
る。In such a case, an appropriate meaning block can be extracted by appropriately determining the value of the topic conversion coefficient. As a result of the experiment, good results could be obtained even if the topic conversion coefficient was set to 1.1. By introducing the topic conversion coefficient, it is possible to prevent the extraction of an incorrect semantic block.

【００２６】（２．５）．意味ブロック抽出における孤
立段落の判定段落iが、意味ブロックの先頭にあって、矩形の面積
（Ｒ_i）の成分の合計値が所定値以下例えば０の場合、
孤立段落になる。表２の段落関連度マトリクスを見れば
わかるように、この段落は、どの段落とも関連していな
い。本意味ブロック抽出ロジックの性格上、孤立段落
（どの段落とも関連せずに孤立して存在する段落）は、
必ず意味ブロックの先頭にくる（後述の『意味ブロック
の切り出し方法に対する理論的な背景』と表３の『アル
ゴリズムの実装』を参照）。このような場合は、意味ブ
ロックの開始段落番号をＳ、終了段落番号をＥとする
と、［Ｓ，Ｅ］→［Ｓ，Ｓ］＋［Ｓ＋１，Ｅ］ …式（３）のように、段落内を分割して処理する。(2.5). Determination of Isolated Paragraph in Semantic Block Extraction When the paragraph i is at the head of the semantic block and the total value of the components of the rectangular area (R _i ) is equal to or less than a predetermined value, for example, 0,
It becomes an isolated paragraph. As can be seen from the paragraph association matrix shown in Table 2, this paragraph is not associated with any paragraph. Due to the nature of this semantic block extraction logic, isolated paragraphs (paragraphs that are isolated without being related to any paragraph)
It always comes at the beginning of the semantic block (see "Theoretical background to semantic block extraction method" described later and "Algorithm implementation" in Table 3). In such a case, assuming that the starting paragraph number of the semantic block is S and the ending paragraph number is E, [S, E] → [S, S] + [S + 1, E]... Is divided and processed.

【００２７】この孤立段落は、意味ブロック［Ｓ＋１，
Ｅ］とは関連していないが、直前の段落と関連している
可能性があるので、厳密な結果を求める場合は、その接
続可能性を検査する必要がある。ただし、孤立段落は重
要文を含む可能性が低いので、重要文抽出処理の前処理
として意味ブロックを抽出する場合には無視できる。This isolated paragraph has a meaning block [S + 1,
E], but it may be related to the previous paragraph, so if you want an exact result, you need to check its connectivity. However, since an isolated paragraph is unlikely to include an important sentence, it can be ignored when extracting a semantic block as preprocessing of the important sentence extraction processing.

【００２８】３．意味ブロックである三角形の切リ出し
方法に対する理論的な背景（３．１）．三角形と矩形の面積比により、三角形を切
り出す根拠表２の段落関連度マトリクスにおいて、相互に関連度の
高い段落の集まりは、段落関連度マトリクス上で三角形
を構成するのは明らかである。しかし、実際は、互いに
関連のない段落同士でも、同じようなキーワードを持つ
場合があるので、段落関連度マトリクス上にノイズとな
って表れる。3. Theoretical background to the method of extracting a triangle that is a semantic block (3.1). Grounds for Triangulating Triangles Based on the Area Ratio of Triangles and Rectangles In the paragraph relevance matrix of Table 2, it is clear that a group of paragraphs having a high degree of relevance constitutes a triangle on the paragraph relevance matrix. However, in practice, even paragraphs that are not related to each other may have similar keywords, and thus appear as noise on the paragraph relevance matrix.

【００２９】そうしたノイズは、矩形上に表れる。表１
のテキスト例では、段落１〜段落５までが１つの意味ブ
ロックを形成し、段落６〜段落１０までが別な意味ブロ
ックを形成する。しかし、段落１と段落６では、「ワシ
ントン」，「米政府」，「連邦政府」などのキーワード
が共通に出現し、互いに関連があるように見える。この
場合、こうしたキーワードはノイズであり、非対角成分
である矩形に表れている。表２の（１，６）の要素（成
分）である９という値がそうである。こうしたノイズを
取り除いて、三角形を切り出すには、三角形と矩形の面
積比に注目すれば良い。Such noise appears on a rectangle. Table 1
In the text example, paragraphs 1 to 5 form one semantic block, and paragraphs 6 to 10 form another semantic block. However, in paragraphs 1 and 6, keywords such as "Washington", "US government", and "Federal government" appear in common and seem to be related to each other. In this case, such keywords are noise and appear in rectangles that are off-diagonal components. That is the value of 9 which is the element (component) of (1, 6) in Table 2. In order to remove such noise and cut out a triangle, attention should be paid to the area ratio between the triangle and the rectangle.

【００３０】ノイズがないと仮定すると、図１に示す段
落関連度マトリクスから容易に理解されるように、（矩
形の面積）÷（三角形の面積）の値は、段落の始まりで
無限大になり（Ｄの面積が０）、段落の終わりで、０
（Ｒの面積が０）になる。Assuming that there is no noise, the value of (area of the rectangle) 三角形 (area of the triangle) becomes infinite at the beginning of the paragraph, as can be easily understood from the paragraph relevance matrix shown in FIG. (D area is 0), 0 at the end of the paragraph
(R area is 0).

【００３１】図２は、（矩形の面積）÷（三角形の面
積）の値を概念的な曲線（当然ながら、実際には、離散
曲線になる）で表したもので（この例では、ｎ個の段落
が２つの意味ブロックに分割されたことを意味してい
る）、図２において、極小点Ｐの値が０（Ｒの値が０）
になり、段落番号１おける値が無限大、段落番号ｎにお
ける値が０になることを表している。FIG. 2 shows the value of (the area of a rectangle) ÷ (the area of a triangle) as a conceptual curve (of course, actually becomes a discrete curve). Is divided into two semantic blocks). In FIG. 2, the value of the minimum point P is 0 (the value of R is 0).
And the value in paragraph number 1 is infinite, and the value in paragraph number n is 0.

【００３２】図３は、ノイズがある場合で、この場合
は、意味ブロックの分割点における極小点Ｐの値が、ノ
イズの分だけ、Ｙ軸方向に増加する。ただし、段落番号
１における値は無限大、段落ｎにおける値は０のままで
ある。注目すべきは、ノイズがある場合でもない場合で
も、曲線の形状は図２，図３とも似たようなものになる
点である。すなわち、面積比の値が無限大から始まり、
極小点と極大点を交互に持ちながら、次第に０に減衰し
ていくのである。FIG. 3 shows the case where there is noise. In this case, the value of the minimum point P at the division point of the meaning block increases in the Y-axis direction by the amount of the noise. However, the value in paragraph number 1 is infinite, and the value in paragraph n remains 0. It should be noted that the shape of the curve is similar to FIGS. 2 and 3 regardless of whether there is noise. That is, the value of the area ratio starts from infinity,
It gradually decreases to 0 while having the minimum point and the maximum point alternately.

【００３３】（３．２）．極小点が分割点になる理由表２の段落関連マトリクスから、意味ブロック（三角
形）内では次の関係が成立する。Ｒ₁≦Ｒ₂≦…＜Ｒ_k-1＜Ｒ_k＞Ｒ_k+1＞…＞Ｒ_m（Ｒm→０） …式（４）Ｄ₁≦Ｄ₂≦…＜Ｄ_m …式（５）段落番号１〜ｍの間に、分割点はないとする。この時意
味ブロック内では、 ∀ｉ∈［ｌ，ｍ］ …式（６）の定条件のもとで、Ｒ_i／Ｄ_i≧Ｒ_i+1／Ｄ_i+1 …式（７）が成立することを示す（この説明は数学的な証明ではな
いので厳密さに欠ける）。(3.2). Reason why the minimum point becomes a division point From the paragraph relation matrix of Table 2, the following relation is established in the semantic block (triangle). R ₁ ≤R ₂ ≤ ... <R _k-1 <R _k > R _{k + 1} >...> R _m (R _m → 0) Equation (4) D ₁ ≦ D ₂ ≦ ... <D _m Equation (5) It is assumed that there is no division point between paragraph numbers 1 to m. At this time, in the semantic block, R _i / D _i ≧ R _{i + 1} / D _{i + 1} ... Equation (7) holds under the fixed condition of {i} [l, m]... (This explanation is not a mathematical proof, so it lacks rigor).

【００３４】[0034]

【数１】 (Equation 1)

【００３５】となる。このＸ，Ｙは、段落関連度マトリ
クスにおいて、図４に示すように、それぞれ、列がｉの
Ｙ方向の成分、行がｉのＸ方向の成分を指す。（Ｙ−
Ｘ）が０以上の場合は、式（８）の符号は正になる。
（Ｙ−Ｘ）が負の場合に、Ｄ_i＞（Ｒ_i・Ｙ）／（Ｘ−Ｙ） …式（９）を満たす段落関連度マトリクスの要素（成分）の状態を
示す。式（９）が成立すると、式（８）の符号が負にな
り、ｉが極小点になる。すなわち、式（９）を満たす段
落関連度マトリクスの要素（成分）が存在した場合は、
分割点がないと仮定した意味ブロックに分割点が存在す
ることになる。そうした場合について、以下（ａ），
（ｂ）の２つに分けて検討してみる。## EQU1 ## In the paragraph relevance matrix, X and Y indicate a component in the Y direction with a column of i and a component in the X direction with a row i, respectively, as shown in FIG. (Y-
When X) is 0 or more, the sign of Expression (8) is positive.
When (Y−X) is negative, D _i > (R _i · Y) / (X−Y)... Indicates the state of the elements (components) of the paragraph relevance matrix that satisfies Expression (9). When Expression (9) is satisfied, the sign of Expression (8) becomes negative, and i becomes the minimum point. That is, when there is an element (component) of the paragraph relevance matrix that satisfies Expression (9),
The division point exists in the semantic block assuming that there is no division point. In such cases, the following (a),
Let's consider the two parts (b).

【００３６】（ａ）．Ｙ／（Ｘ−Ｙ）＞１すなわち
Ｘ＞Ｙ＞Ｘ／２の場合上記式（９）において、Ｒ_iの係数が１より大きい時
に、式（９）が成立するので、Ｄ_iの面積（重み：関連
度の合計値）は、Ｒ_iの面積（重み：関連度の合計値）
より大きいと言える。すなわち、三角形の面積（重み）
より、矩形の面積（重み）より大きい場合に式（９）が
成立する。(A). Y / (XY)> 1 That is,
X>Y> X / 2 if the above formula (9), when the coefficient is greater than 1 the R _i, since equation (9) is satisfied, the area of the D _i (weight: total value of relevance), the Area of R _i (weight: total value of relevance)
It can be said that it is larger. That is, the area (weight) of the triangle
Therefore, when the area (weight) of the rectangle is larger, Expression (9) holds.

【００３７】Ｙの値がＸの値と近い状態で、三角形の重
みが周辺部より重い場合は、同じ話題を扱う１つの意味
ブロック内で、話題があいまいに展開していると考えら
れる。そうした場合は、式（８）が負になり、ｉが極小
点になる。この極小点で分割される前後の意味ブロック
は同じ話題について記述されている意味ブロックで、こ
の極小点における話題転換係数は１にかなり近い値とな
る。If the value of Y is close to the value of X and the weight of the triangle is heavier than the surroundings, it is considered that the topic is unclearly developed in one semantic block that handles the same topic. In such a case, equation (8) becomes negative and i becomes the minimum point. The semantic blocks before and after the division at the minimum point are semantic blocks describing the same topic, and the topic conversion coefficient at this minimum point has a value quite close to 1.

【００３８】（ｂ）．Ｙ／（Ｘ−Ｙ）≦１すなわち
０≦Ｙ≦Ｘ／２の場合Ｙが０の場合は、段落ｉは孤立段落となる。この時、式
（８）の符号は負になり極小点が存在する。また、Ｙが
Ｘ／２に近くなるほど、ａの状況に近づく。まとめる
と、分割点がないと仮定した意味ブロック内でも、孤立
段落が存在したり、話題が曖昧に展開したりすると、分
割点が発生する。しかし、すでに説明したように、話題
転換係数の導入と、孤立段落の処理により、こうした状
況を救済することが可能になる。(B). Y / (XY) ≦ 1 That is,
When 0 ≦ Y ≦ X / 2 When Y is 0, the paragraph i is an isolated paragraph. At this time, the sign of equation (8) becomes negative, and there is a minimum point. Further, as Y approaches X / 2, the situation approaches a. In summary, even in a semantic block that is assumed to have no division point, a division point occurs when an isolated paragraph exists or a topic unambiguously develops. However, as described above, such a situation can be remedied by introducing a topic conversion coefficient and processing an isolated paragraph.

【００３９】４．アルゴリズムの実装意味ブロックの抽出アルゴリズムの実装は極めて簡単で
あり、実際のコードも非常に短いので、その例を表３に
疑似コードで示す（ただし、意味ブロックの統合、孤立
段落処理は、省略している）。ただし、Ｄ_iとＲ_iの計算
の高速化のために、段落関連度マトリクスは２次元配列
ではなく、まったく別個のデータ構造になっている。こ
の疑似コードでは、与えられた段落群から意味の切れ目
を１つ見つけ、段落を２つの意味段落群に分けている。
意味の切れ目が見つからない場合は、意味ブロック抽出
処理は終了する。意味の切れ目があった場合には、分割
した最初の段落群の意味の切れ目を探索するために、再
帰的に自分自身を呼び出す。同様に分割した後の段落群
の意味の切れ目を探索するために、再帰的に呼び出す。
意味の切れ目の判断条件は、『三角形の切り出し方法』
で説明した通りである。4. Algorithm Implementation The implementation of the semantic block extraction algorithm is extremely simple, and the actual code is very short. An example is shown in Table 3 in pseudo code (however, the integration of semantic blocks and the isolated paragraph processing are omitted. ing). However, to speed up the calculation of D _i and R _i , the paragraph relevance matrix is not a two-dimensional array but has a completely separate data structure. In this pseudo code, one break in meaning is found from a given paragraph group, and the paragraph is divided into two meaning paragraph groups.
If no meaning break is found, the meaning block extraction process ends. If there is a meaning break, recursively call itself to search for a meaning break in the first paragraph group divided. Similarly, it is called recursively to search for a break in the meaning of a group of paragraphs after division.
Judgment condition of the break of the meaning is "triangulation method"
As described in the above.

【００４０】[0040]

【表３】 [Table 3]

【００４１】[0041]

【発明の効果】本発明によれば、新聞記事など複数の話
題が混在した文書でも、話題ごとに要約を提供すること
が可能になる。新聞記事などのように複数の記事を有す
るものから重要文を抽出する場合、本発明による意味ブ
ロック抽出を行ってから、該当する記事を切り出し、そ
こから重要文を抽出した方が、複数の記事から直接重要
文を抽出する方よりも良い結果が得られる。また、本発
明によれば、同じ文書に日本語と英語などのように２つ
以上の言語が混在した場合でも、日本語で記述された部
分に分割することが可能になる。上記話題転換係数の導
入により、誤った意味ブロックの抽出を防止できる。According to the present invention, it is possible to provide a summary for each topic even in a document including a plurality of topics such as newspaper articles. When extracting an important sentence from an article having a plurality of articles such as a newspaper article, the extraction of the relevant article after performing the semantic block extraction according to the present invention, and extracting the important sentence therefrom, Is better than extracting important sentences directly from Further, according to the present invention, even when two or more languages such as Japanese and English are mixed in the same document, it is possible to divide the document into portions described in Japanese. By introducing the topic conversion coefficient, it is possible to prevent erroneous semantic blocks from being extracted.

[Brief description of the drawings]

【図１】段落関連度マトリクスを説明するための概要
図である。FIG. 1 is a schematic diagram for explaining a paragraph relevance matrix.

【図２】三角形領域と矩形領域の関係を説明するため
の図である。FIG. 2 is a diagram for explaining a relationship between a triangular area and a rectangular area.

【図３】三角形領域と矩形領域の関係（ノイズがある
場合）を説明する図である。FIG. 3 is a diagram illustrating a relationship between a triangular area and a rectangular area (when there is noise).

【図４】段落関連度マトリクスを説明する図である。FIG. 4 is a diagram illustrating a paragraph relevance matrix.

Claims

[Claims]

An electronic document is divided into paragraphs, a relevancy between paragraphs is calculated based on keywords extracted from the paragraphs, and a square matrix having the paragraphs as a dimension is a diagonal of the square matrix. The relevance is put into each component of the one-sided area with the component as a boundary, and in the one-sided area where the relevance is put, an arbitrary row (or column), an arbitrary column (or row), and a diagonal A document processing apparatus comprising: obtaining a total value of relevance in a triangular region surrounded by components; and obtaining a document division point based on the total value of the relevance.

2. The document processing apparatus according to claim 1, wherein a total value in the triangular area and a degree of association in a rectangular area corresponding to a column (or row) of the triangular area and excluding the triangular area. A document division point obtained from a relationship with a total value of the document.

3. The document processing apparatus according to claim 2, wherein an extremum is obtained from a ratio of a total value in the triangular area to a total value of the relevance in the rectangular area. A document processing device for obtaining points.

4. The document processing apparatus according to claim 1, wherein a sum of relevance of a row (or a column) in an adjacent paragraph is obtained, and the sum is compared to obtain a topic turning point. A document processing apparatus characterized by determining

5. The document processing apparatus according to claim 1, wherein when the total value of the relevance of a row (or a column) at the division point is equal to or less than a predetermined value, the row (or the column). A document processing apparatus, wherein the document is an isolated paragraph.