JP2000148752A

JP2000148752A - Document structure recognition method and apparatus, and storage medium storing document structure recognition program

Info

Publication number: JP2000148752A
Application number: JP10317948A
Authority: JP
Inventors: Takaaki Hasegawa; 隆明長谷川; Shinichiro Takagi; 伸一郎高木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1998-11-09
Filing date: 1998-11-09
Publication date: 2000-05-30

Abstract

(57)【要約】【課題】文書内の特定の文字列パターンとそのパター
ンを含む行の行頭から行末までの長さに着目し、行全体
の長さと特定文字列パターンの長さの関係を考慮するこ
とにより、分野に依存しない任意の箇条書きを含む文の
構造を解析することが可能な文書構造認識方法及び装置
及び文書構造認識プログラムを格納した記憶媒体を提供
する。【解決手段】本発明は、認識対象となる文書を入力
し、文書と、予め保持されている箇条書きパターンとを
行毎にパターンマッチングを行い、該箇条書きパターン
にマッチする箇条書きの候補を生成し、文書の一行につ
いて文字が存在する行頭から文字が存在する行末までの
文字列の長さを測定し、生成された箇条書き候補に空白
が含まれる場合には、文字列の長さを用いて、得られた
該箇条書き候補の中から１つの箇条書き候補を決定し、
決定された箇条書き候補から空白を削除して箇条書きラ
ベル及び内容の情報を取得し、決定された箇条書き候補
に、ラベルの内容の情報をタグとして付与して出力す
る。 (57) [Summary] [Problem] To focus on a specific character string pattern in a document and the length from the beginning to the end of a line including the pattern, and to determine the relationship between the length of the entire line and the length of the specific character string pattern. A document structure recognizing method and apparatus capable of analyzing the structure of a sentence including an optional bullet point that does not depend on the field, and a storage medium storing a document structure recognizing program are provided. An object of the present invention is to input a document to be recognized, perform pattern matching between the document and a previously stored bullet pattern on a line-by-line basis, and select bullet candidates that match the bullet pattern. For each line of the document, measure the length of the character string from the beginning of the line where the character is present to the end of the line where the character is present, and if the generated bullet points include blanks, calculate the length of the character string. And determining one bullet point candidate from the obtained bullet point candidates,
Blanks are removed from the determined bullet points to obtain bullet label and content information, and the determined bullet points are added with the label content information as tags and output.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書構造認識方法
及び装置及び文書構造認識プログラムを格納した記憶媒
体に係り、特に、ネットワークを介して伝達された、あ
るいは、ＯＣＲで読み込んだ電子文書の文書構造を認識
するための文書構造認識方法及び装置及び文書構造認識
プログラムを格納した記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for recognizing a document structure and a storage medium storing a document structure recognizing program, and more particularly to a document of an electronic document transmitted via a network or read by an OCR. The present invention relates to a document structure recognition method and apparatus for recognizing a structure and a storage medium storing a document structure recognition program.

【０００２】[0002]

【従来の技術】従来の文書構造を解析する方法は、箇条
書きを特定する際に、予め決められたラベル付箇条書き
を表しうる単語や記号を用意して文書に対して検索を行
い、これに一致した文字列を箇条書きと判定している。2. Description of the Related Art In a conventional method of analyzing a document structure, when specifying a bullet point, a word or a symbol capable of representing a predetermined labeled bullet point is prepared and a search is performed on the document. Is determined as a bulleted list.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上記従
来の予め決められた箇条書きを表す単語や記号を用いて
文書を検索する方法では、分野に依存したキーワードを
用意しなければならないことと、キーワードにマッチし
ない場合や文字間に空白が含まれる場合に、箇条書きと
判定することができないという問題がある。However, in the above-described conventional method of searching for a document using words or symbols representing predetermined bullet points, it is necessary to prepare keywords depending on the field, There is a problem that it is not possible to determine that the item is a bullet when it does not match or when a space is included between characters.

【０００４】本発明は、上記の点に鑑みなされたもの
で、文書内の特定の文字列パターンとそのパターンを含
む行の行頭から行末までの長さに着目し、行全体の長さ
と特定文字列パターンの長さの関係を考慮することによ
り、分野に依存しない任意の箇条書きを含む文の構造を
解析することが可能な文書構造認識方法及び装置及び文
書構造認識プログラムを格納した記憶媒体を提供するこ
とを目的とする。The present invention has been made in view of the above points, and focuses on a specific character string pattern in a document and the length of a line including the pattern from the beginning to the end of the line. By considering the relationship between the lengths of the column patterns, a document structure recognition method and apparatus capable of analyzing the structure of a sentence including an arbitrary bullet point independent of a field, and a storage medium storing a document structure recognition program are provided. The purpose is to provide.

【０００５】[0005]

【課題を解決するための手段】図１は、本発明の原理を
説明するための図である。本発明（請求項１）は、任意
の箇条書きを含む文書の構造を解析するための文書構造
認識方法において、認識対象となる文書を入力し（ステ
ップ１）、文書と、予め保持されている箇条書きパター
ンとを行毎にパターンマッチングを行い、該箇条書きパ
ターンにマッチする箇条書きの候補を生成し（ステップ
２）、文書の一行について文字が存在する行頭から文字
が存在する行末までの文字列の長さを測定し（ステップ
３）、生成された箇条書き候補に空白が含まれる場合に
は、文字列の長さを用いて、得られた該箇条書き候補の
中から１つの箇条書き候補を決定し（ステップ４）、決
定された箇条書き候補から空白を削除して箇条書きラベ
ル及び内容の情報を取得し（ステップ５）、決定された
箇条書き候補に、ラベルの内容の情報をタグとして付与
して出力する（ステップ６）。FIG. 1 is a diagram for explaining the principle of the present invention. According to the present invention (claim 1), in a document structure recognition method for analyzing the structure of a document including an arbitrary bullet point, a document to be recognized is input (step 1), and the document and the document are stored in advance. Pattern matching is performed for each line with the bullet pattern, and a bullet candidate that matches the bullet pattern is generated (step 2), and the characters from the beginning of the line where the character exists to the end of the line where the character exists in one line of the document The length of the column is measured (step 3), and if the generated bullet points include spaces, one bullet point is obtained from the obtained bullet points using the length of the character string. A candidate is determined (step 4), blanks are removed from the determined list item, and information on the item label and content is obtained (step 5). The information on the label content is added to the determined list item. tag Granted and outputs to (Step 6).

【０００６】本発明（請求項２）は、予め保持されてい
る箇条書きパターンとして、文字の種別を利用して、空
白を含む、あるいは、含まない箇条書きにマッチするパ
ターンを保持する。本発明（請求項３）は、箇条書きの
候補を生成する際に、入力された文書が空白を含んで分
かち書きされる文字列のパターンを可能な限りパターン
マッチングし、複数の候補を生成する。According to the present invention (claim 2), as a pre-stored bullet pattern, a pattern matching a bullet containing or not including a blank is retained using the type of character. According to the present invention (claim 3), when generating an itemized list, a plurality of candidates are generated by performing pattern matching as much as possible on a character string pattern in which an input document is divided and written including spaces.

【０００７】本発明（請求項４）は、文字列の長さを測
定する際に、文書の一行について、行頭の空白を除いた
行の先頭から行末までの空白を含む文字列の長さをバイ
ト単位で測定する。本発明（請求項５）は、得られた箇
条書き候補が複数ある場合には、箇条書きパターンにマ
ッチした行と、該箇条書きパターンにマッチせず行頭の
位置が等しいか、インデントされている次の行の長さを
比較し、箇条書きパターンにマッチした行の方が長い、
あるいは次の行が空行の場合は、マッチした行の長さの
２分の１を、箇条書きの上限の長さとし、箇条書きパタ
ーンにマッチした行の方が短い場合には、マッチした行
の長さを箇条書きの上限の長さとし、最大となる箇条書
き候補を決定する。According to the present invention (claim 4), when measuring the length of a character string, the length of a character string including a blank from the beginning to the end of the line excluding the blank at the beginning of a line is measured for one line of the document. Measure in bytes. According to the present invention (claim 5), when there are a plurality of obtained bullet candidates, the line that matches the bullet pattern is equal to or indented at the beginning of the line that does not match the bullet pattern. Compare the length of the next line, and the line that matches the bullet pattern is longer,
Alternatively, if the next line is blank, half the length of the matched line is taken as the upper limit of the itemized list, and if the line that matches the itemized pattern is shorter, the matched line Is the upper limit of the list item, and the maximum list item candidate is determined.

【０００８】本発明（請求項６）は、箇条書きパターン
にマッチした行と比べて、該箇条書きパターンにマッチ
しない次の行との位置が同じか、インデントされている
場合には、箇条書きの範囲として、箇条書きの内容に含
める。図２は、本発明の原理構成図である。本発明（請
求項７）は、任意の箇条書きを含む文書の構造を解析す
るための文書構造認識装置であって、認識対象となる文
書を入力する文書入力手段１と、箇条書きパターンを保
持するパターン格納手段２と、文書入力手段１により入
力された文書と、パターン格納手段２に予め保持されて
いる箇条書きパターンとを行毎にパターンマッチングを
行い、該箇条書きパターンにマッチする箇条書きの候補
を生成する候補生成手段３と、候補生成手段３により生
成された候補の一行について文字が存在する行頭から文
字が存在する行末までの文字列の長さを測定する長さ測
定手段４と、候補生成手段３により生成された箇条書き
候補に空白が含まれる場合には、文字列の長さを用い
て、得られた該箇条書き候補の中から１つの箇条書き候
補を決定する候補決定手段５と、候補決定手段５により
決定された箇条書き候補から空白を削除して箇条書きラ
ベル及び内容の情報を取得する箇条書き抽出手段６と、
候補決定手段５で決定された箇条書き候補に、箇条書き
抽出手段６で取得したラベルの内容の情報をタグとして
付与して出力する出力手段７とを有する。According to the present invention (claim 6), when the position of the next line that does not match the itemized pattern is the same as that of the line that matches the itemized pattern or the item is indented, To be included in the content of the bulleted list. FIG. 2 is a diagram illustrating the principle of the present invention. The present invention (claim 7) is a document structure recognizing device for analyzing the structure of a document including an arbitrary list, which stores a document input unit 1 for inputting a document to be recognized and a list pattern. The pattern storage unit 2 performs pattern matching for each line between a document input by the document input unit 1 and an itemized pattern stored in the pattern storage unit 2 in advance, and an itemized list that matches the itemized pattern. Candidate generating means 3 for generating a candidate, and length measuring means 4 for measuring the length of a character string from the beginning of the line where the character is present to the end of the line where the character is present for one line of the candidate generated by the candidate generating means 3 In the case where a blank is included in the itemized candidate generated by the candidate generating means 3, one itemized candidate is determined from the obtained itemized candidate using the length of the character string. That the candidate determining unit 5, a bullet extraction means 6 for acquiring information bullet label and content from a determined bullet candidate blanks removed by the candidate determination unit 5,
An output unit 7 is provided which adds information on the contents of the label acquired by the item extraction means 6 as a tag to the itemized candidate determined by the item candidate determination means 5 and outputs the result.

【０００９】本発明（請求項８）は、パターン格納手段
２において、文字の種別を利用して、空白を含む、ある
いは、含まない箇条書きにマッチするパターンを保持す
る。本発明（請求項９）は、候補生成手段３において、
入力された文書が空白を含んで分かち書きされる文字列
のパターンを可能な限りパターンマッチングし、複数の
候補を生成するパターンマッチング手段を含む。According to the present invention (claim 8), in the pattern storage means 2, a pattern matching a bullet containing or not including a blank is held by utilizing the type of character. According to the present invention (claim 9), in the candidate generation means 3,
It includes a pattern matching unit that performs pattern matching as much as possible on a character string pattern in which an input document is divided and written including a blank, and generates a plurality of candidates.

【００１０】本発明（請求項１０）は、長さ測定手段４
において、文書の一行について、行頭の空白を除いた行
の先頭から行末までの空白を含む文字列の長さをバイト
単位で測定する手段を含む。本発明（請求項１１）は、
候補決定手段５において、得られた箇条書き候補が複数
ある場合に、パターン格納手段の箇条書きパターンにマ
ッチした行と、該箇条書きパターンにマッチせず行頭の
位置が等しいか、インデントされている次の行の長さを
比較し、該箇条書きパターンにマッチした行の方が長
い、あるいは次の行が空行の場合は、マッチした行の長
さの２分の１を、該箇条書きパターンにマッチした行の
方が短い場合には、マッチした行の長さを箇条書きの上
限の長さとし、最大となる箇条書き候補を決定する手段
を含む。According to the present invention (claim 10), the length measuring means 4 is provided.
Means for measuring the length of a character string including a blank from the beginning of the line to the end of the line excluding the space at the beginning of the line, in bytes, for one line of the document. The present invention (claim 11) provides
In the candidate determination means 5, when there are a plurality of obtained bullet points, the line matching the bullet pattern in the pattern storage means is equal to or indented at the beginning of the line without matching the bullet pattern. Compare the length of the next line, and if the line that matches the bulleted pattern is longer, or if the next line is blank, halve the length of the matched line If the line that matches the pattern is shorter, the length of the matched line is taken as the upper limit of the itemization, and means for determining the maximum itemization candidate is included.

【００１１】本発明（請求項１２）は、箇条書き抽出手
段６において、箇条書きパターンにマッチした行と比べ
て、該箇条書きパターンにマッチしない次の行との位置
が同じか、インデントされている場合には、箇条書きの
範囲として、箇条書きの内容に含める手段を含む。本発
明（請求項１３）は、任意の箇条書きを含む文書の構造
を解析するための文書構造認識プログラムを格納した記
憶媒体であって、認識対象となる文書を入力する文書入
力プロセスと、文書入力プロセスにより入力された文書
と、パターン格納手段に予め保持されている箇条書きパ
ターンとを行毎にパターンマッチングを行い、該箇条書
きパターンにマッチする箇条書きの候補を生成する候補
生成プロセスと、候補生成プロセスにより生成された候
補の一行について文字が存在する行頭から文字が存在す
る行末までの文字列の長さを測定する長さ測定プロセス
と、候補生成プロセスにより生成された箇条書き候補に
空白が含まれる場合には、文字列の長さを用いて、得ら
れた該箇条書き候補の中から１つの箇条書き候補を決定
する候補決定プロセスと、候補決定プロセスにより決定
された箇条書き候補から空白を削除して箇条書きラベル
及び内容の情報を取得する箇条書き抽出プロセスと、候
補決定プロセスで決定された箇条書き候補に、箇条書き
抽出プロセスで取得したラベルの内容の情報をタグとし
て付与して出力する出力プロセスとを有する。According to the present invention (claim 12), in the bullet extraction means 6, the position of the next line that does not match the bullet pattern is the same as that of the line that matches the bullet pattern, or is indented. If so, the scope of the list includes the means to be included in the content of the list. The present invention (claim 13) is a storage medium storing a document structure recognition program for analyzing a structure of a document including an arbitrary itemized list, the document input process for inputting a document to be recognized, and a document input process. A candidate generation process of performing pattern matching for each line between the document input by the input process and the bullet pattern stored in the pattern storage unit for each line, and generating bullet candidates that match the bullet pattern; A length measurement process that measures the length of a character string from the beginning of a line where a character is present to the end of a line where a character is present in one line of a candidate generated by the candidate generation process, and a blank in the bullet point candidate generated by the candidate generation process Is included, using the length of the character string to determine one bullet candidate from the obtained bullet candidates, Process, the bullet extraction process that removes blanks from the bullet candidates determined by the candidate decision process to obtain bullet label and content information, and the bullet extraction that is determined by the bullet candidate that was determined by the candidate decision process. And an output process for adding information on the contents of the label acquired in the process as a tag and outputting the tag.

【００１２】本発明（請求項１４）は、候補生成プロセ
スにおいて、入力された文書が空白を含んで分かち書き
される文字列のパターンを可能な限りパターンマッチン
グし、複数の候補を生成するパターンマッチングプロセ
スを含む。本発明（請求項１５）は、長さ測定プロセス
において、文書の一行について、行頭の空白を除いた行
の先頭から行末までの空白を含む文字列の長さをバイト
単位で測定するプロセスを含む。According to the present invention (claim 14), in the candidate generation process, a pattern of a character string in which an input document is divided and written including a space is subjected to pattern matching as much as possible, and a plurality of candidates are generated. including. The present invention (claim 15) includes, in the length measuring process, a process of measuring, for each line of a document, a length of a character string including a blank from the head of the line to the end of the line excluding the space at the head of the line in bytes. .

【００１３】本発明（請求項１６）は、候補決定プロセ
スにおいて、得られた箇条書き候補が複数ある場合に、
パターン格納プロセスの箇条書きパターンにマッチした
行と、該箇条書きパターンにマッチせず行頭の位置が等
しいか、インデントされている次の行の長さを比較し、
該箇条書きパターンにマッチした行の方が長い、あるい
は次の行が空行の場合は、マッチした行の長さの２分の
１を、該箇条書きパターンにマッチした行の方が短い場
合には、マッチした行の長さを箇条書きの上限の長さと
し、最大となる箇条書き候補を決定するプロセスを含
む。According to the present invention (claim 16), in the candidate determination process, when there are a plurality of obtained bullet candidates,
Compare the line that matches the bulleted pattern in the pattern storage process with the length of the next line that does not match the bulleted pattern and has the same head position or is indented,
If the line that matches the bullet pattern is longer or the next line is blank, half the length of the matched line is shorter and the line that matches the bullet pattern is shorter. Includes the process of determining the length of the matched line as the upper limit of the list item and determining the maximum list item candidate.

【００１４】本発明（請求項１７）は、箇条書き抽出プ
ロセスにおいて、箇条書きパターンにマッチした行と比
べて、該箇条書きパターンにマッチしない次の行との位
置が同じか、インデントされている場合には、箇条書き
の範囲として、箇条書きの内容に含めるプロセスを含
む。上記のように、本発明によれば、イベントの案内に
ついて通知する電子メールの構造を認識する場合を例と
した場合に、受信した電子メール文書が入力されると、
文字の種別からなる箇条書きにマッチするパターンを格
納しておき、格納されている箇条書きパターンを用い
て、入力文書に対してパターンマッチングを行い、パタ
ーンにマッチする箇条書きの候補を生成する。そして、
パターンにマッチする行と空行でなく行頭の位置が同じ
かインデントされている次の行の行頭から行末までの長
さを測定し、測定された行や次の行の長さに基づいて得
られた候補の中から箇条書きとして一つを決定し、決定
された箇条書きのラベルと内容の情報を抽出し、箇条書
きのラベルと内容の情報をタグとして文書に付加し、出
力する。According to the present invention (claim 17), in the bullet extraction process, the position of the next line that does not match the bullet pattern is the same or indented as compared with the line that matches the bullet pattern. In some cases, the scope of the list includes the process of including it in the content of the list. As described above, according to the present invention, in the case of recognizing the structure of an e-mail notifying of event guidance, when a received e-mail document is input,
A pattern that matches the itemized list consisting of character types is stored, and pattern matching is performed on the input document using the stored itemized pattern to generate a list item candidate that matches the pattern. And
Measure the length from the beginning to the end of the next line that has the same or indented line instead of a blank line and the line that matches the pattern, and derives it based on the measured line and the length of the next line. One of the items is determined as an itemized list, information on the determined itemized label and content is extracted, the itemized label and information on the content are added as tags to the document, and output.

【００１５】これにより、文書内の特定の文字列パター
ンとそのパターンを含む行の行頭から行末までの長さに
着目し、行全体の長さと特定文字列パターンの長さの関
係を考慮することにより、分野に依存しない任意の箇条
書きを含む文の構造を解析することが可能となる。[0015] With this, attention is paid to the specific character string pattern in the document and the length from the beginning to the end of the line including the pattern, and the relationship between the length of the entire line and the length of the specific character string pattern is considered. Thus, it is possible to analyze the structure of a sentence including an optional bullet point that does not depend on the field.

【００１６】[0016]

【発明の実施の形態】図３は、本発明の文書構造認識装
置の構成を示す。同図に示す文書構造認識装置は、構造
認識対象の文書を入力する文書入力部１、文字の種別か
らなる箇条書きにマッチするパターン（空白を含む、ま
たは、空白を含まない箇条書きにマッチするパターン）
を格納するパターン格納部２、パターン格納部２の箇条
書きパターンを用いて入力文書に対してパターンマッチ
ングを行い、パターンにマッチする箇条書きの候補を生
成する箇条書き候補生成部３、行の行頭から行末までの
一行の長さを測定する一行長さ測定部４、一行長さ測定
部４によりパターンにマッチした行の長さから箇条書き
ラベルの長さを測定し、得られた長さを基準にして得ら
れた候補の中から箇条書きとして一つを決定する箇条書
き決定部５、決定された箇条書きのラベルと内容の情報
を抽出する箇条書き抽出部６、箇条書きラベルと内容の
情報をタグとして文書に追加し出力するタグ付文書出力
部７及び上記の各構成要素を制御する制御部８から構成
される。FIG. 3 shows the configuration of a document structure recognition apparatus according to the present invention. The document structure recognizing device shown in FIG. 1 includes a document input unit 1 for inputting a document to be subjected to structure recognition, a pattern matching a list of character types (matching a list including a blank or a list including no blank). pattern)
Storage unit 2 for storing a list, an itemized candidate generation unit 3 that performs pattern matching on an input document using the itemized pattern of the pattern storage unit 2 to generate itemized candidates that match the pattern, The line length measuring unit 4 for measuring the length of one line from to the end of the line. The line length measuring unit 4 measures the length of the bulleted label from the line length matching the pattern, and calculates the obtained length. A bullet determination unit 5 that determines one of the candidates obtained from the criteria as a bullet, a bullet extraction unit 6 that extracts label and label information of the determined bullet, and a bullet label and content. It comprises a tagged document output unit 7 for adding and outputting information to a document as a tag, and a control unit 8 for controlling the above components.

【００１７】次に、上記の構成による動作を説明する。
図４は、本発明の文書構造認識装置の動作を説明するた
めのフローチャートである。ステップ１０１）制御部８の指示により文書入力部１
から文書構造の認識対象となる文書を入力する。Next, the operation of the above configuration will be described.
FIG. 4 is a flowchart for explaining the operation of the document structure recognition device of the present invention. Step 101) The document input unit 1 according to an instruction from the control unit 8
, Input a document for which the document structure is to be recognized.

【００１８】ステップ１０２）入力された文書が行末
でないかを判定し、行末でない場合にはステップ１０３
に移行し、行末である場合にはステップ１０９に移行す
る。ステップ１０３）制御部８の制御により箇条書き候補
生成部３がパターン格納部２から箇条書きパターンを読
み込む。ステップ１０４）箇条書き候補生成部３は、パターン
マッチングを行い、箇条書きパターンにマッチする行に
ついて可能な限りの複数の箇条書きの候補を生成する。Step 102) It is determined whether the input document is not at the end of the line.
The process proceeds to step 109 if the end of the line has been reached. Step 103) Under the control of the control unit 8, the bullet candidate generation unit 3 reads the bullet pattern from the pattern storage unit 2. Step 104) The bullet candidate generation unit 3 performs pattern matching, and generates as many bullet candidates as possible for the line that matches the bullet pattern.

【００１９】ステップ１０５）箇条書き候補生成部３
で生成された箇条書きの候補が複数存在するかを判定
し、複数存在する場合にはステップ１０６に移行し、そ
うでない場合にはステップ１０７に移行する。ステップ１０６）箇条書きの候補が複数存在する場合
には、一行長さ測定部４において、行の行頭から行末ま
での一行の長さを測定し、測定された長さを箇条書き決
定部５に渡す。一行の長さを測定する際に、行頭の空白
を除いた行の先頭から行末まで空白を含む文字列の長さ
をバイト単位で測定する。そして、箇条書き決定部５に
おいて長さに基づいて１つの箇条書きを決定し、ステッ
プ１０８に移行する。Step 105) Bulleted line candidate generating unit 3
It is determined whether or not there are a plurality of bullet candidates generated in step. If there are a plurality of bullet candidates, the process proceeds to step 106; otherwise, the process proceeds to step 107. Step 106) When there are a plurality of bullet candidates, the one-line length measuring unit 4 measures the length of one line from the beginning of the line to the end of the line, and sends the measured length to the bullet determining unit 5. hand over. When measuring the length of one line, measure the length of a character string including blanks from the beginning of the line to the end of the line excluding the leading blanks in bytes. Then, the bullet determination unit 5 determines one bullet based on the length, and proceeds to step 108.

【００２０】ステップ１０７）ステップ１０５におい
て、得られた箇条書きの候補が１つである場合には、当
該候補を箇条書きとして決定し、ステップ１０８に移行
する。ステップ１０８）箇条書き抽出部６において、箇条書
きの対象と付加情報を当該箇条書きに付加し、箇条書き
としての情報を抽出する。Step 107) If it is determined in step 105 that there is only one candidate for the itemized list, the candidate is determined as an itemized list, and the process proceeds to step 108. Step 108) The item extraction unit 6 adds the item to be itemized and the additional information to the itemized item, and extracts the itemized information.

【００２１】ステップ１０９）タグ付文書出力部７
は、上記の処理を行末まで繰り返し、最後に、入力文書
に箇条書きラベルと内容の情報をタグとして挿入し、文
書構造を示すタグ付の文書を出力する。次に、上記のス
テップ１０６における箇条書き決定部５の動作について
説明する。Step 109) Tagged Document Output Unit 7
Repeats the above processing until the end of the line, and finally inserts the itemized label and the information of the content as tags into the input document, and outputs a tagged document indicating the document structure. Next, the operation of the itemized list determination unit 5 in step 106 will be described.

【００２２】図５は、本発明の箇条書き決定部の動作を
説明するためのフローチャートである。ステップ２０１）箇条書き決定部５は、一行長さ測定
部４により測定されたマッチした行の行頭から行末まで
の一行の長さを取得する。ステップ２０２）次の行が空行でなく箇条書きパター
ンにマッチしていない場合にはステップ２０３に移行
し、そうでない場合には、ステップ２０６に移行する。FIG. 5 is a flow chart for explaining the operation of the itemized decision section of the present invention. Step 201) The bullet determination unit 5 acquires the length of one line from the beginning to the end of the matched line measured by the one-line length measuring unit 4. Step 202) If the next line is not a blank line and does not match the itemized pattern, the process proceeds to step 203; otherwise, the process proceeds to step 206.

【００２３】ステップ２０３）次の行の行頭の位置が
同じである、または、インデントされている場合にはス
テップ２０４に移行し、そうでない場合にはステップ２
０６に移行する。ステップ２０４）ステップ２０３の条件を満たす場合
には、次の一行の長さを測定する。Step 203) If the position of the beginning of the next line is the same or indented, the process proceeds to step 204; otherwise, the process proceeds to step 2
Shift to 06. Step 204) If the condition of step 203 is satisfied, the length of the next line is measured.

【００２４】ステップ２０５）パターンにマッチした
行の長さと次の行の長さを比較して、マッチした行が次
の行より長い場合、または、次の行が空行あるいは箇条
書きパターンにマッチしないならばステップ２０６に移
行し、そうでない場合にはステップ２０７に移行する。ステップ２０６）マッチした行の長さの半分を上限と
した箇条書きラベルを決定し、当該処理を終了する。Step 205) The length of the line matched with the pattern is compared with the length of the next line, and if the matched line is longer than the next line, or if the next line matches the blank line or the bulleted pattern If not, proceed to step 206; otherwise, proceed to step 207. Step 206) Determine an itemized label with an upper limit of half the length of the matched line, and end the process.

【００２５】ステップ２０７）パターンにマッチした
行が短い場合には、その行の長さを箇条書きラベルの長
さとして箇条書きのラベルを決定し、処理を終了する。
次に、箇条書き抽出部６の動作を説明する。図６は、本
発明の箇条書き抽出部の動作を説明するためのフローチ
ャートである。Step 207) If the line matching the pattern is short, the length of the line is used as the length of the itemized label to determine the itemized label, and the process is terminated.
Next, the operation of the item extraction unit 6 will be described. FIG. 6 is a flowchart for explaining the operation of the bullet extraction unit of the present invention.

【００２６】ステップ３０１）箇条書きラベルが空白
を含んでいるかを判定し、含んでいる場合にはステップ
３０２に移行し、そうでない場合にはステップ３０５に
移行する。ステップ３０２）ラベルの空白を削除する。ステップ３０３）箇条書き決定部５で箇条書きラベル
の長さを行の２文の１にしたかを判定し、そうである場
合にはステップ３０４に移行し、そうでない場合にはス
テップ３０５に移行する。Step 301) It is determined whether or not the itemized label includes a blank. If it does, the process proceeds to step 302; otherwise, the process proceeds to step 305. Step 302) Delete blanks in the label. Step 303) The itemized list determination unit 5 determines whether the length of the itemized label is set to 1 of the two sentences in the line. If so, the process proceeds to step 304; otherwise, the process proceeds to step 305. I do.

【００２７】ステップ３０４）箇条書きの内容の部分
の空白を削除する。ステップ３０５）最後に箇条書きのラベルと内容の情
報を抽出する。Step 304) Delete blanks in the contents of the bulleted list. Step 305) Finally, information on the itemized label and its contents is extracted.

【００２８】[0028]

【実施例】以下、図面と共に本発明の実施例を説明す
る。図７は、本発明の一実施例のパターン格納部に格納
される箇条書きパターンの例である。同図に示す［ＣＪ
ＫＡＳ］は、それぞれ、漢字（Ｃ）、ひらがな（Ｊ）、
カタカナ（Ｋ）、アルファベット（Ａ）、空白（Ｓ）の
字種を示す。｛１，ｎ｝は、ｎ回の繰り返しを示す。同
図の例では、〈箇条書きパターン〉として、「〈前置パターン〉、〈ラベルパターン〉、〈後置パタ
ーン〉、〈内容パターン〉」、「〈ラベルターン〉、〈後置パターン〉、〈内容パター
ン〉」「〈前置パターン〉、〈ラベルパターン〉、〈後置パタ
ーン〉」「〈ラベルパターン〉、〈後置パターン〉」「〈前置パターン〉、〈ラベルパターン〉」「〈ラベルパターン〉」等の各パターンが組となって格納されている。Embodiments of the present invention will be described below with reference to the drawings. FIG. 7 is an example of an itemized pattern stored in the pattern storage unit according to one embodiment of the present invention. [CJ shown in FIG.
KAS] are Kanji (C), Hiragana (J),
Indicates the character type of katakana (K), alphabet (A), and space (S). {1, n} indicates n repetitions. In the example shown in the figure, "<prefix pattern>, <label pattern>, <postfix pattern>, <content pattern>", "<label turn>, <postfix pattern>, <<Contentpattern> ”“ <prefix pattern>, <label pattern>, <postfix pattern> ”“ <label pattern>, <postfix pattern> ”“ <prefix pattern>, <label pattern> ”“ <label pattern ” >> are stored as a set.

【００２９】図８は、本発明の一実施例の入力される電
子メール文書の例である。タイトルの「冠婚葬祭」以下
「備考」の前行まで一行おきに空行が挿入されているも
のとする。例えば、図８に示す電子メール文書を文書入
力部１から入力文書とすると、箇条書き候補生成部３の
処理により３行目は分かち書きの箇条書きパターンにマ
ッチし、箇条書きの候補は、の３つの候補が生成される。さらに、一行長さ測定部４
により行の長さは３０バイトであることがわかる。箇条
書き決定部５では、次の行が空行であることを認識し、
箇条書きのラベルの長は、行の長さの２分の１である１
５バイトを上限とする候補に決定されるので、箇条書き
抽出部６では「種別」が箇条書きのラベルとなる。
ここで、箇条書きのラベル「種別」には２つの空白
が含まれるので、ラベルの空白を削除し「種別」を得
る。さらに、箇条書き決定部５において行の長さの２分
の１を上限として箇条書きの長さを決めたので、箇条書
きの内容部分の空白を削除し、箇条書きの内容として、
「結婚」を得る。タグ付文書出力部７においてこれらの
得られた情報を文書にタグとして付与して出力する。FIG. 8 is an example of an input e-mail document according to an embodiment of the present invention. It is assumed that a blank line is inserted every other line from the title “Ceremonial Festival” to the line preceding “Remarks”. For example, assuming that the electronic mail document shown in FIG. 8 is an input document from the document input unit 1, the third line matches the segmented bullet pattern by the processing of the bullet candidate generation unit 3, and the bullet candidate is Are generated. Further, one line length measuring unit 4
Indicates that the line length is 30 bytes. The bullet determination unit 5 recognizes that the next line is a blank line,
The length of the bulleted label is one half of the line length1
Since it is determined to be a candidate having an upper limit of 5 bytes, the item extraction unit 6 uses “type” as the item label.
Here, since the itemized label “type” includes two blanks, the blank of the label is deleted to obtain “type”. Further, since the bullet length is determined by the bullet determination unit 5 up to a half of the line length, blanks in the content of the bullet are deleted, and the content of the bullet is defined as:
Get "marriage". The obtained information is added to the document as a tag in the tagged document output unit 7 and output.

【００３０】同様の方法により、５行目における箇条書
きの候補は、「所属」のみがマッチするので、これを箇条書きとする。７行目
と１１行目も３行目と同様である。９行目と１３行目
は、５行目と同様である。これにより、図８に示す電子
メールを入力文書とした場合に、図９に示す文書が出力
される。In the same manner, since the candidate for the itemized list in the fifth line matches only "belonging", this item is used as the itemized item. The seventh and eleventh rows are the same as the third row. The ninth and thirteenth rows are the same as the fifth row. Thus, when the electronic mail shown in FIG. 8 is used as the input document, the document shown in FIG. 9 is output.

【００３１】また、上記の実施例では、図３に示す構成
に基づいて説明したが、図３に示す文書構造認識装置の
各構成要素をプログラムとして構築し、当該文書構造認
識装置として利用されるコンピュータに接続されるディ
スク装置や、フロッピーディスク（登録商標）、ＣＤ−
ＲＯＭ等の可搬記憶媒体に格納しておき、本発明を実施
する際にインストールすることにより容易に本発明を実
現できる。Although the above embodiment has been described based on the configuration shown in FIG. 3, each component of the document structure recognition apparatus shown in FIG. 3 is constructed as a program and used as the document structure recognition apparatus. Disk devices connected to computers, floppy disks (registered trademark), CD-
The present invention can be easily realized by storing it in a portable storage medium such as a ROM and installing the same when implementing the present invention.

【００３２】なお、本発明は、上記の実施例に限定され
ることなく、特許請求の範囲内で種々変更・応用が可能
である。The present invention is not limited to the above embodiment, but can be variously modified and applied within the scope of the claims.

【００３３】[0033]

【発明の効果】上述のように、本発明によれば、箇条書
きで書かれている行の文字列の長さや行頭の位置を用い
ることにより、分野に依存したキーワードを用意するこ
となく、空白を挟んで文字が分かち書きされる任意の箇
条書きを含む文書構造を認識することができる。As described above, according to the present invention, by using the length of a character string and the position of the head of a line written in a bulleted list, it is possible to prepare blanks without preparing keywords depending on the field. It is possible to recognize a document structure including an optional bulleted list in which characters are separated by sandwiching.

[Brief description of the drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の文書構造認識装置の構成図である。FIG. 3 is a configuration diagram of a document structure recognition device of the present invention.

【図４】本発明の文書構造認識装置の動作を説明するた
めのフローチャートである。FIG. 4 is a flowchart for explaining the operation of the document structure recognition device of the present invention.

【図５】本発明の箇条書き決定部の動作を説明するため
のフローチャートである。FIG. 5 is a flowchart for explaining the operation of an itemized decision unit according to the present invention;

【図６】本発明の箇条書き抽出部の動作を説明するため
のフローチャートである。FIG. 6 is a flowchart for explaining the operation of a bullet extraction unit of the present invention.

【図７】本発明の一実施例のパターン格納部に格納され
る箇条書きパターンの例である。FIG. 7 is an example of an itemized pattern stored in a pattern storage unit according to an embodiment of the present invention.

【図８】本発明の一実施例の入力される電子メール文書
の例である。FIG. 8 is an example of an input e-mail document according to an embodiment of the present invention.

【図９】本発明の一実施例の出力文書の例である。FIG. 9 is an example of an output document according to an embodiment of the present invention.

[Explanation of symbols]

１文書入力手段、文書入力部２パターン格納手段、パターン格納部３候補生成手段、箇条書き候補生成部４長さ測定手段、一行長さ測定部５候補決定手段、箇条書き決定部６箇条書き抽出手段、箇条書き抽出部７出力手段、タグ付文書出力部 REFERENCE SIGNS LIST 1 document input unit, document input unit 2 pattern storage unit, pattern storage unit 3 candidate generation unit, bullet candidate generation unit 4 length measurement unit, one line length measurement unit 5 candidate determination unit, bullet determination unit 6 bullet extraction Means, bullet extraction section 7 Output means, Tagged document output section

Claims

[Claims]

1. A document structure recognizing method for analyzing a structure of a document including an arbitrary bullet point, a document to be recognized is input, and the document and a bullet pattern stored in advance are line-by-line. To generate a bullet candidate that matches the bullet pattern, measure the length of the character string from the beginning of the line where the characters are present to the end of the line where the characters are present for one line of the document, and generate In the case where the itemized candidate includes a blank, one itemized item candidate is determined from the obtained itemized candidate using the length of the character string, and the determined itemized candidate is determined. A document structure recognition method characterized in that information on itemized labels and contents is obtained by removing blanks from the list, and information on the contents of the label is added as a tag to the determined itemized bullet candidates and output. Law.

2. The document structure recognizing method according to claim 1, wherein a pattern matching a bullet containing or not including a blank is stored as the previously stored bullet pattern by using a character type. .

3. The method according to claim 1, wherein, when generating the candidate for the itemized list, a plurality of candidates are generated by performing a pattern matching of a character string pattern in which the input document is divided and written including a blank space as much as possible. Document structure recognition method described.

4. When measuring the length of the character string, the length of a character string including a blank from the head of the line to the end of the line excluding the space at the beginning of the line is measured in bytes for one line of the document. The method according to claim 1.

5. When there are a plurality of obtained bullet points, a line that matches the bullet pattern and a next line that does not match the bullet pattern and has the same head position or is indented If the line that matches the bulleted pattern is longer, or if the next line is blank, one-half the length of the matched line is replaced by the maximum length of the bulleted list. 2. The document structure recognition according to claim 1, wherein when the line matching the bullet pattern is shorter, the length of the matched line is set as the upper limit length of the bullet, and the maximum bullet candidate is determined. Method.

6. If the position of the next line that does not match the itemized pattern is the same or indented as compared with the line that matches the itemized pattern, the itemized list is included in the itemized range. 2. The document structure recognition method according to claim 1, wherein the document structure is included in the content of the document structure.

7. A document structure recognizing device for analyzing a structure of a document including an arbitrary bullet point, a document input means for inputting a document to be recognized, a pattern storage means for holding a bullet pattern, Pattern matching is performed for each line between the document input by the document input unit and the bullet pattern stored in the pattern storage unit, and a bullet candidate that matches the bullet pattern is generated. Candidate generating means; length measuring means for measuring the length of a character string from the beginning of a line where a character is present to the end of a line where a character is present in one line of the candidate generated by the candidate generating means; If a blank is included in the list item candidate, the length of the character string is used,
Candidate determining means for determining one bullet point candidate from the obtained bullet points; obtaining blank label and content information by deleting blanks from the bullet point candidates determined by the candidate determining means; Item extraction means, and output means for adding the information of the content of the label obtained by the item extraction means as a tag to the item candidate determined by the candidate determination means and outputting the result. Document structure recognition device.

8. The document structure recognition apparatus according to claim 7, wherein said pattern storage means stores a pattern matching a bullet point including or not including a blank, using a character type.

9. The method according to claim 7, wherein the candidate generating unit includes a pattern matching unit that generates a plurality of candidates by performing pattern matching as much as possible on a character string pattern in which the input document is divided and written including a blank. Document structure recognition device as described.

10. The apparatus according to claim 1, wherein said length measuring means includes means for measuring, for each line of said document, a length of a character string including a space from the head of the line to the end of the line excluding a space at the head of the line in bytes. 7. The document structure recognition device according to 7.

11. The candidate deciding means, when there are a plurality of obtained bullet points, a line matching the bullet pattern of the pattern storage means and a position of the head of the line which is not matched with the bullet pattern is equal. Or compare the length of the next indented line, and if the line that matches the bulleted pattern is longer or the next line is blank, 8. The method according to claim 7, further comprising the step of: determining a maximum number of candidate bullets by setting the length of the matched line as an upper limit length of the bullet when the line matching the bullet pattern is shorter. Document structure recognition device.

12. The method according to claim 12, wherein: the list extracting unit determines whether a position of a next line that does not match the list pattern is the same as a position of a line that matches the list pattern.
8. The document structure recognition apparatus according to claim 7, further comprising means for including, when indented, the content of the list as a range of the list.

13. A storage medium storing a document structure recognition program for analyzing a structure of a document including an arbitrary itemized list, wherein the document input process inputs a document to be recognized. A candidate generation process for performing line-by-line pattern matching between the input document and a bullet pattern stored in a pattern storage unit in order to generate bullet candidates that match the bullet pattern; A length measurement process that measures the length of a character string from the beginning of a line where a character is present to the end of a line where a character is present for one line of the candidate generated by the generation process, and the list item candidate generated by the candidate generation process When a blank is included, one bullet point candidate is obtained from the obtained bullet points using the length of the character string. A candidate determination process to be determined, a bullet extraction process for removing information from the bullet candidates determined by the candidate determination process and obtaining bullet label and content information, and a bullet extraction process determined in the candidate determination process. A storage medium storing a document structure recognizing program, comprising: an output process of adding, as a tag, information on the contents of a label acquired in the bullet extraction process to a bullet candidate and outputting the tag.

14. The candidate generating process includes a pattern matching process of generating a plurality of candidates by pattern-matching as much as possible a character string pattern in which the input document is divided and written including a blank. A storage medium that stores the described document structure recognition program.

15. The method according to claim 15, wherein the length measuring process includes a process of measuring a length of a character string including a blank from a head of the line excluding a space at the beginning of the line to an end of the line for each line of the document in bytes. A storage medium storing the document structure recognition program according to claim 13.

16. The candidate determination process, in the case where there are a plurality of obtained bullet candidates, a line that matches the bullet pattern of the pattern storing process is equal to the head of the line that does not match the bullet pattern. Or
Compare the length of the next indented line, and if the line that matches the bulleted pattern is longer, or if the next line is blank, reduce the length of the matched line by half. , If the line that matches the bullet pattern is shorter,
2. The method according to claim 1, further comprising a step of determining a maximum list item candidate by setting a length of the matched line as an upper limit of the list item.
A storage medium storing the document structure recognition program according to Item 3.

17. The list extraction process according to claim 1, wherein a position of a next line that does not match the list pattern is the same as a position of a line that matches the list pattern.
Claim 13 including, if indented, the process of including in the content of the list as the scope of the list.
A storage medium that stores the described document structure recognition program.