JP4504702B2

JP4504702B2 - Document processing apparatus, document processing method, and document processing program

Info

Publication number: JP4504702B2
Application number: JP2004050165A
Authority: JP
Inventors: 慶久大黒
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2004-02-25
Filing date: 2004-02-25
Publication date: 2010-07-14
Anticipated expiration: 2024-02-25
Also published as: JP2005242579A

Description

本発明は、文書原稿を、文字認識することなく、文書を特徴づける検索キーを求めることを目的として、文字行の行内矩形の配置状態を表す特徴の集計結果に注目することによって、文字行の形状の特徴を抽出する文書処理装置、文書処理方法、および文書処理プログラムに関する。 The present invention aims at obtaining a search key that characterizes a document without recognizing characters, and by focusing on the result of summarizing the characteristics representing the arrangement state of the in-line rectangles of the character line, The present invention relates to a document processing apparatus, a document processing method, and a document processing program for extracting shape features.

従来より、文書画像中の文字成分の外接矩形から文字行を抽出し、出力する技術が提案されている。この技術は、文字の外接矩形の形状および位置に関する特徴(大きさ、間隔など)について、複数の制約を適用することによって文字行を抽出するものである（例えば、特許文献１〜５を参照。）。 Conventionally, a technique for extracting and outputting a character line from a circumscribed rectangle of a character component in a document image has been proposed. This technique extracts a character line by applying a plurality of restrictions on features (size, spacing, etc.) related to the shape and position of a circumscribed rectangle of a character (see, for example, Patent Documents 1 to 5). ).

国際公開第００／６２２４３号パンフレットInternational Publication No. 00/62243 Pamphlet 特開平１１−１４３８７９号公報Japanese Patent Laid-Open No. 11-143879 特開平１１−２１９４０７号公報JP 11-219407 A 特開平９−２３１３１７号公報JP-A-9-231317 特開平８−１６１４３０号公報JP-A-8-161430

しかしながら、上記従来技術では、文字行を判断するためには、外接矩形に関する複数の制約を人手によって最適値に調整する必要がある。しかも、文字行らしさは判断できるものの、文字行の内容に関する特徴を求めることはできない。 However, in the above prior art, in order to determine a character line, it is necessary to manually adjust a plurality of constraints related to the circumscribed rectangle to the optimum values. In addition, although the character likelihood can be determined, characteristics relating to the contents of the character line cannot be obtained.

本発明は、上述した問題点を解消するため、文字行画像の行内矩形の配置状態を表す特徴を抽出し、これらを固定段階に量子化してシンボルを生成することにより、文字認識することなく、文字行の特徴の抽出が可能になり、効率的な文字行の内容に関する検索が行える文書処理装置、文書処理方法、および文書処理プログラムを提供することを目的とする。 In order to solve the above-described problems, the present invention extracts features representing the arrangement state of in-line rectangles in a character line image, generates symbols by quantizing these in a fixed stage, and without recognizing characters. It is an object of the present invention to provide a document processing apparatus, a document processing method, and a document processing program that enable extraction of the characteristics of a character line and perform an efficient search regarding the contents of the character line.

上述した課題を解決し、目的を達成するため、本発明の請求項１にかかる文書処理装置は、入力された文書画像に対して所定の画像処理を行い、画像の特徴を抽出し、文書処理を行う装置であって、前記文書画像から抽出した文字行画像の行内矩形の始点の行内における高さを固定段階に量子化して固定種類のシンボルを生成する行内高さシンボル生成手段と、前記文字行画像の行高さを推定する行高さ推定手段と、行高さに対する行内矩形の高さの割合を固定段階に量子化して固定種類のシンボルを生成する行内矩形高さ割合シンボル生成手段と、行高さに対する行内矩形の幅の割合を固定段階に量子化して固定種類のシンボルを生成する行内矩形幅割合シンボル生成手段と、前記矩形内の黒画素密度を固定段階に量子化して固定種類のシンボルを生成する黒画素密度シンボル生成手段と、試験行における行内矩形の配置状態を表す特徴をシンボル系列に変換し、試験行におけるシンボルの出現頻度を集計する出現頻度集計手段と、前記出現頻度集計手段により集計されたシンボルの出現頻度と予め学習したシンボル系列の並び傾向とに基づいて類似性を判定する類似性判定手段と、を有することを特徴とする。
In order to solve the above-described problems and achieve the object, a document processing apparatus according to claim 1 of the present invention performs predetermined image processing on an input document image, extracts image features, and performs document processing. An in-line height symbol generating means for generating a fixed type symbol by quantizing a height in a line of a starting point of an in-line rectangle of a character line image extracted from the document image into a fixed stage; and the character A line height estimating means for estimating the line height of the line image; and an in-line rectangular height ratio symbol generating means for generating a fixed type symbol by quantizing the ratio of the height of the in-line rectangle to the line height in a fixed stage; In-line rectangular width ratio symbol generating means for generating a fixed type symbol by quantizing the ratio of the width of the in-line rectangle to the line height in a fixed stage, and a fixed type by quantizing the black pixel density in the rectangle in the fixed stage The Black pixel density symbol generating means for generating Bol, appearance frequency totaling means for converting features representing the arrangement state of in-line rectangles in a test row into a symbol series, and summing up the appearance frequency of symbols in the test row, and the appearance frequency totalization And similarity determination means for determining similarity based on the appearance frequency of symbols aggregated by the means and the arrangement tendency of the symbol series learned in advance.

この請求項１に記載の発明によれば、文字行画像の行内矩形の配置状態を表す特徴を抽出し、これらを固定段階に量子化してシンボルを生成することにより、文字認識することなく、文字行の特徴の抽出が可能になり、効率的な文字行の内容に関する検索を行うことができる。 According to the first aspect of the present invention, the character representing the arrangement state of the in-line rectangle of the character line image is extracted, and these are quantized to a fixed stage to generate a symbol, so that the character can be recognized without recognizing the character. Line features can be extracted, and an efficient search for the contents of character lines can be performed.

また、請求項２にかかる文書処理装置は、請求項１に記載の発明において、さらに、行内矩形の配置状態を表す複数の特徴のなかから代表的なものを一つ以上抽出し、これに基づいた固定種類のシンボルを生成する代表シンボル生成手段を備えたことを特徴とする。 According to a second aspect of the present invention, the document processing apparatus according to the first aspect further extracts one or more representative ones from a plurality of features representing the arrangement state of the in-line rectangle, and based on the extracted one. And a representative symbol generating means for generating a fixed type of symbol.

この請求項２に記載の発明によれば、検索対象によっては不要となる行内矩形の特徴の測定・記録に関する処理を省略でき、より効率的な文書処理を行うことができる。 According to the second aspect of the present invention, it is possible to omit the process related to the measurement / recording of the feature of the in-line rectangle which is unnecessary depending on the search target, and more efficient document processing can be performed.

また、請求項３にかかる文書処理装置は、請求項２に記載の発明において、さらに、行高さに対する、注目矩形と隣接矩形との距離の割合を固定段階に量子化して固定種類のシンボルを生成する距離割合シンボル生成手段を備えたことを特徴とする。 According to a third aspect of the present invention, there is provided the document processing apparatus according to the second aspect, wherein the ratio of the distance between the target rectangle and the adjacent rectangle with respect to the row height is quantized at a fixed stage to obtain a fixed type symbol. A distance ratio symbol generating means for generating is provided.

この請求項３に記載の発明によれば、文字行の特徴をより詳細に定義でき、厳密な文字行の判定が可能である。 According to the third aspect of the present invention, the character line characteristics can be defined in more detail, and a strict character line determination is possible.

また、請求項４にかかる文書処理装置は、請求項１に記載の発明において、注目する行内矩形の終点と、隣接する行内矩形の始点との距離を算出する距離算出手段と、前記距離算出手段により算出された距離と行高さとを比較し、その割合が一定値を超えている場合に、空白シンボルを挿入して、行内矩形の配置状態をシンボル系列に変換するシンボル系列変換手段と、を備えたことを特徴とする。 According to a fourth aspect of the present invention, there is provided the document processing apparatus according to the first aspect, wherein the distance calculation means calculates the distance between the end point of the in-line rectangle of interest and the start point of the adjacent in-line rectangle, and the distance calculation means. A symbol series conversion means for comparing the distance calculated by the step and the line height and inserting a blank symbol to convert the arrangement state of the in-line rectangle into a symbol series when the ratio exceeds a certain value; It is characterized by having.

この請求項４に記載の発明によれば、文字行の特徴を定義する際に、当該文字行内における空白部分の情報を盛り込むことにより、さらに精度の高い文字行の判定が可能になる。 According to the fourth aspect of the present invention, when defining the characteristics of a character line, it is possible to determine a character line with higher accuracy by including information on a blank portion in the character line.

また、請求項５にかかる文書処理装置は、請求項２に記載の発明において、さらに、行内矩形の配置状態を表す複数の特徴を複数次元ベクトルの各次元に対応させてベクトル量子化し、行内矩形の配置状態を示す固定種類のシンボルを生成する行内矩形配置状態シンボル生成手段を備えたことを特徴とする。 According to a fifth aspect of the present invention, in the document processing apparatus according to the second aspect, the plurality of features representing the arrangement state of the in-line rectangle are vector-quantized corresponding to each dimension of the multi-dimensional vector, and the in-line rectangle is obtained. In-line rectangular arrangement state symbol generating means for generating a fixed type of symbol indicating the arrangement state of is provided.

この請求項５に記載の発明によれば、矩形の配置状態を表す特徴をベクトル量子化し、これをシンボル系列に変換することにより、文字認識することなく、文字行の特徴の抽出が可能になり、効率的な文字行の内容に関する検索を行うことができる。 According to the fifth aspect of the present invention, it is possible to extract the characteristics of the character line without recognizing the character by vector quantization of the feature representing the arrangement state of the rectangle and converting this to a symbol series. , You can efficiently search for the contents of character lines.

また、請求項６にかかる文書処理装置は、請求項２に記載の発明において、さらに、前記代表シンボル生成手段で生成された文字行のシンボル情報を原稿全体、所定領域全体などの特定の範囲において集計して、訓練文書と試験文書との類似性を判定する特定範囲内類似性判定手段を備えたことを特徴とする。 According to a sixth aspect of the present invention, there is provided the document processing apparatus according to the second aspect, wherein the symbol information of the character line generated by the representative symbol generating means is stored in a specific range such as the entire original or the entire predetermined area. A specific range similarity determination means for determining the similarity between the training document and the test document is provided.

この請求項６に記載の発明によれば、所定の領域内における文字行の内容に関する検索が可能になる。 According to the invention described in claim 6, it is possible to search for the contents of the character line in the predetermined area.

また、請求項７にかかる文書処理装置は、請求項１に記載の発明において、文字入力手段と、文字フォントセットと、前記文字入力手段から入力された文字テキストから前記文字フォントセットに基づいて文字フォントに展開し文字画像を得る文書画像取得手段と、前記文字入力手段から入力された文字テキストの文字列から文字列画像を生成し、この文字列画像の行内矩形シンボルを生成する行内矩形シンボル生成手段と、を備えたことを特徴とする。 According to a seventh aspect of the present invention, there is provided the document processing device according to the first aspect, wherein a character input unit, a character font set, and a character text input from the character input unit are used to generate a character based on the character font set. Document image acquisition means for expanding a font into a character image and generating a character string image from a character text string input from the character input means, and generating an in-line rectangular symbol for the character string image Means.

この請求項７に記載の発明によれば、従来行っていた文字認識をすることなく、原稿画像に対するテキスト検索が可能になる。したがって、従来文字認識に必要とされた文字パターン辞書が不要となる。 According to the seventh aspect of the present invention, it is possible to perform a text search on a document image without performing character recognition which has been conventionally performed. Therefore, the character pattern dictionary conventionally required for character recognition becomes unnecessary.

また、請求項８にかかる文書処理装置は、請求項７に記載の発明において、さらに、文字毎にあらかじめ生成された行内矩形シンボルに対して、文字毎にその文字内の矩形の配置状態を表現するシンボルを対応させるシンボル対応手段と、前記文字入力手段から入力された入力テキストの文字列を矩形シンボル系列へ変換する矩形シンボル変換手段と、を備えたことを特徴とする。 Further, in the invention according to claim 7, the document processing apparatus according to claim 8 further represents a rectangular arrangement state in the character for each character with respect to the in-line rectangular symbol generated in advance for each character. Symbol correspondence means for associating the symbol to be processed, and rectangular symbol conversion means for converting the character string of the input text input from the character input means into a rectangular symbol series.

この請求項８に記載の発明によれば、文字画像を経ることなくテキスト文字列から直接行内矩形シンボルへと変換することが可能になり、処理の効率化が図れる。 According to the eighth aspect of the present invention, it is possible to directly convert a text character string into an in-line rectangular symbol without passing through a character image, thereby improving processing efficiency.

また、請求項９にかかる文書処理装置は、入力された文書画像に対して所定の画像処理を行い、画像の特徴を抽出し、文書処理を行う装置であって、前記文書画像から抽出した文字行画像の行内矩形の始点の行内における高さを固定段階に量子化して固定種類のシンボルを生成する行内高さシンボル生成手段と、前記文字行画像の行高さを推定する行高さ推定手段と、行高さに対する行内矩形の高さの割合を固定段階に量子化して固定種類のシンボルを生成する行内矩形高さ割合シンボル生成手段と、行高さに対する行内矩形の幅の割合を固定段階に量子化して固定種類のシンボルを生成する行内矩形幅割合シンボル生成手段と、前記矩形内の黒画素密度を固定段階に量子化して固定種類のシンボルを生成する黒画素密度シンボル生成手段と、訓練行における行内矩形の配置状態を表す特徴をシンボル系列に変換し、その傾向を学習する訓練行学習手段と、試験行における行内矩形の配置状態を表す特徴をシンボル系列に変換し、前記訓練行学習手段による訓練行の学習結果を用いて、試験行の評価値を算出する評価値算出手段と、前記訓練行学習手段による訓練行の学習結果と前記評価値算出手段により算出された試験行の評価値とを照合し、その類似性を判定する類似性判定手段と、を含み構成されることを特徴とする。
A document processing apparatus according to claim 9 is an apparatus for performing predetermined image processing on an input document image, extracting image features, and performing document processing, wherein a character extracted from the document image In-line height symbol generating means for generating a fixed type of symbol by quantizing the height of the start point of the in-line rectangle of the line image in a fixed stage, and a line height estimating means for estimating the line height of the character line image Inline rectangle height ratio symbol generation means for generating a fixed type symbol by quantizing the ratio of the height of the inline rectangle to the line height into a fixed stage, and the ratio of the width of the inline rectangle to the line height in a fixed stage An in-line rectangular width ratio symbol generating means for generating a fixed type symbol by quantizing the black pixel density into a fixed stage and generating a fixed type symbol by quantizing the black pixel density in the rectangle to a fixed stage; The training line learning means for converting the feature representing the arrangement state of the in-line rectangle in the training line into a symbol series and learning the tendency thereof, the feature representing the arrangement state of the in-line rectangle in the test line into the symbol series, and the training line Using the learning result of the training line by the learning means, the evaluation value calculating means for calculating the evaluation value of the test line, the learning result of the training line by the training line learning means, and the test line calculated by the evaluation value calculating means And a similarity determination unit that compares the evaluation value and determines the similarity.

この請求項９に記載の発明によれば、文字行画像の行内矩形の配置状態を表す特徴を抽出し、これらを固定段階に量子化してシンボルを生成することにより、文字認識することなく、文字行の特徴の抽出が可能になり、効率的に文字行の内容に関する検索を行うことができる。 According to the ninth aspect of the present invention, the character representing the arrangement state of the in-line rectangle of the character line image is extracted, and these are quantized to a fixed stage to generate a symbol, so that the character can be recognized without recognizing the character. The feature of the line can be extracted, and the search for the contents of the character line can be performed efficiently.

また、請求項１０にかかる文書処理装置は、文書画像から抽出した文字行画像の類似性を判定する文書処理装置であって、前記文字行画像の行内矩形の始点の行内における高さを固定段階に量子化する行内始点高さ量子化手段と、前記文字行画像の行内矩形の高さを固定段階に量子化する行内矩形高さ量子化手段と、前記文字行画像の行内矩形の幅を固定段階に量子化する行内矩形幅量子化手段と、前記行内始点高さ量子化手段で量子化された行内矩形の始点の行内における高さと、前記行内矩形高さ量子化手段で量子化された行内矩形の高さと、前記行内矩形幅量子化手段で量子化された行内矩形の幅とに基づいて決定される配置状態からシンボルを生成するシンボル生成手段と、前記シンボル生成手段で試験行において生成されたシンボルの出現頻度を集計する出現頻度集計手段と、前記出現頻度集計手段により集計されたシンボルの出現頻度と予め学習したシンボル系列の並び傾向とに基づいて類似性を判定する類似性判定手段と、を有することを特徴とする。
The document processing apparatus according to claim 10 is a document processing apparatus for determining similarity between character line images extracted from a document image, wherein the height of the start point of an in-line rectangle of the character line image is fixed. The in-line start point height quantizing means for quantizing the character line image, the in-line rectangle height quantizing means for quantizing the height of the in-line rectangle of the character line image in a fixed stage, and the width of the in-line rectangle of the character line image are fixed. In-line rectangular width quantization means for quantizing in stages, the height of the start point of the in-line rectangle quantized by the in-line start height quantization means, and the in-line quantized by the in-line rectangular height quantization means Symbol generation means for generating a symbol from an arrangement state determined based on the height of the rectangle and the width of the in-line rectangle quantized by the in-line rectangle width quantization means, and the symbol generation means generated in the test row. Symbol Appearance frequency totaling means for totaling the appearance frequency, and similarity determination means for determining similarity based on the appearance frequency of the symbols calculated by the appearance frequency totaling means and the arrangement tendency of the symbol series learned in advance. It is characterized by having.

この請求項１０に記載の発明によれば、文字行画像の行内矩形の配置状態を表す特徴を抽出し、これらを固定段階に量子化してシンボルを生成することにより、文字認識することなく、文字行の特徴の抽出が可能になり、効率的に文字行の内容に関する検索を行うことができる。 According to the tenth aspect of the present invention, the character representing the arrangement state of the in-line rectangle of the character line image is extracted, and these are quantized to a fixed stage to generate a symbol, so that the character can be recognized without recognizing the character. The feature of the line can be extracted, and the search for the contents of the character line can be performed efficiently.

また、請求項１１にかかる文書処理方法は、入力された文書画像に対して所定の画像処理を行い、画像の特徴を抽出し、文書処理を行う文書処理装置によってなされる方法であって、前記文書処理装置が、前記文書画像から抽出した文字行画像の行内矩形の始点の行内における高さを固定段階に量子化して固定種類のシンボルを生成する行内高さシンボル生成工程と、前記文字行画像の行高さを推定する行高さ推定工程と、行高さに対する行内矩形の高さの割合を固定段階に量子化して固定種類のシンボルを生成する行内矩形高さ割合シンボル生成工程と、行高さに対する行内矩形の幅の割合を固定段階に量子化して固定種類のシンボルを生成する行内矩形幅割合シンボル生成工程と、前記矩形内の黒画素密度を固定段階に量子化して固定種類のシンボルを生成する黒画素密度シンボル生成工程と、試験行における行内矩形の配置状態を表す特徴をシンボル系列に変換し、試験行におけるシンボルの出現頻度を集計する出現頻度集計工程と、前記出現頻度集計工程により集計された試験行におけるシンボルの出現頻度と予め学習したシンボル系列の並び傾向とに基づいて類似性を判定する類似性判定工程と、を実行することを特徴とする。
A document processing method according to an eleventh aspect is a method performed by a document processing apparatus that performs predetermined image processing on an input document image, extracts image features, and performs document processing. An in-line height symbol generating step in which the document processing apparatus generates a fixed type symbol by quantizing the height in the line of the starting point of the in-line rectangle of the character line image extracted from the document image; and the character line image A line height estimation step for estimating the line height of the line, a line height ratio symbol generation step for generating a fixed type symbol by quantizing the ratio of the height of the line rectangle to the line height in a fixed stage, and a line The in-line rectangle width ratio symbol generation step for generating a fixed type symbol by quantizing the ratio of the width of the in-line rectangle to the height in a fixed stage, and the fixed pixel type by quantizing the black pixel density in the rectangle in the fixed stage A black pixel density symbol generating step for generating a similar symbol, an appearance frequency counting step for converting the feature representing the arrangement state of the in-line rectangle in the test row into a symbol series, and counting the appearance frequency of the symbols in the test row, and the appearance A similarity determination step is performed in which similarity is determined based on the appearance frequency of the symbols in the test row tabulated by the frequency tabulation step and the sequence tendency of the previously learned symbol series.

また、請求項１２にかかる文書処理プログラムは、入力された文書画像に対して所定の画像処理を行い、画像の特徴を抽出し、文書処理を行うプログラムであって、前記文書画像から抽出した文字行画像の行内矩形の始点の行内における高さを固定段階に量子化して固定種類のシンボルを生成する行内高さシンボル生成工程と、前記文字行画像の行高さを推定する行高さ推定工程と、行高さに対する行内矩形の高さの割合を固定段階に量子化して固定種類のシンボルを生成する行内矩形高さ割合シンボル生成工程と、行高さに対する行内矩形の幅の割合を固定段階に量子化して固定種類のシンボルを生成する行内矩形幅割合シンボル生成工程と、前記矩形内の黒画素密度を固定段階に量子化して固定種類のシンボルを生成する黒画素密度シンボル生成工程と、試験行における行内矩形の配置状態を表す特徴をシンボル系列に変換し、試験行におけるシンボルの出現頻度を集計する出現頻度集計工程と、前記出現頻度集計工程により集計された試験行におけるシンボルの出現頻度と予め学習したシンボル系列の並び傾向とに基づいて類似性を判定する類似性判定工程と、をコンピュータに実行させることを特徴とする。
A document processing program according to a twelfth aspect is a program for performing predetermined image processing on an input document image, extracting image features, and performing document processing, wherein a character extracted from the document image is processed. An in-line height symbol generating step for generating a fixed type of symbol by quantizing the height of the starting point of the in-line rectangle of the line image into a fixed stage, and a line height estimating step for estimating the line height of the character line image And, the ratio of the height of the in-line rectangle to the line height is quantized in a fixed stage to generate a fixed type symbol, and the ratio of the width of the in-line rectangle to the line height is fixed. In-line rectangular width ratio symbol generation step for generating a fixed type symbol by quantizing the black pixel density into a fixed stage and generating a fixed type symbol by quantizing the black pixel density in the rectangle to a fixed stage Generation step, an appearance frequency totaling step for converting the feature representing the arrangement state of the in-line rectangle in the test row into a symbol series, and totaling the appearance frequency of the symbols in the test row, and the test row totalized by the appearance frequency totaling step And a similarity determination step of determining similarity based on the appearance frequency of symbols and the arrangement tendency of previously learned symbol sequences.

本発明にかかる文書処理装置、文書処理方法、および文書処理プログラムによれば、文字行画像の行内矩形の配置状態を表す特徴を抽出し、これらを固定段階に量子化してシンボルを生成することにより、文字認識することなく、文字行の特徴の抽出が可能になり、効率的に文字行の内容に関する検索を行うことができるという効果を奏する。 According to the document processing device, the document processing method, and the document processing program according to the present invention, by extracting features representing the arrangement state of the in-line rectangles of the character line image, and quantizing these in a fixed stage to generate a symbol Thus, it is possible to extract the characteristics of the character line without recognizing the character, and it is possible to efficiently perform a search regarding the contents of the character line.

以下に添付図面を参照して、本発明にかかる文書処理装置、文書処理方法、および文書処理プログラムの好適な実施の形態を詳細に説明する。 Exemplary embodiments of a document processing apparatus, a document processing method, and a document processing program according to the present invention will be explained below in detail with reference to the accompanying drawings.

（文書処理装置のハードウェア構成）
まず、本発明の実施の形態にかかる文書処理装置のハードウェア構成について説明する。図１は、この文書処理装置のハードウェア構成を示す図である。この文書処理装置は、ＣＰＵ１０１、ＲＯＭ１０２、ＲＡＭ１０３、ＨＤＤ（ハードディスクドライブ）１０４、ＨＤ（ハードディスク）１０５、ＦＤＤ（フレキシブルディスクドライブ）１０６、ＦＤ１１２、ディスプレイ１０７、ネットワークボード１０８、キーボード１０９、マウス１１０、およびスキャナ１１１が、バス１００によって接続され構成されている。 (Hardware configuration of document processing device)
First, the hardware configuration of the document processing apparatus according to the embodiment of the present invention will be described. FIG. 1 is a diagram showing a hardware configuration of the document processing apparatus. The document processing apparatus includes a CPU 101, ROM 102, RAM 103, HDD (hard disk drive) 104, HD (hard disk) 105, FDD (flexible disk drive) 106, FD 112, display 107, network board 108, keyboard 109, mouse 110, and scanner. 111 is connected and configured by the bus 100.

ＣＰＵ１０１は、装置全体を制御する。ＲＯＭ１０２には、基本入出力プログラムが記憶されている。ＲＡＭ１０３は、ＣＰＵ１０１のワークエリアとして使用される。ＨＤＤ１０４は、ＣＰＵ１０１の制御にしたがってＨＤ１０５に対するデータのリード／ライトの制御を行う。ＨＤ１０５は、ＨＤＤ１０４の制御にしたがって書き込まれたデータを記憶する。ＦＤＤ１０６は、ＣＰＵ１０１の制御にしたがってＦＤ（フレキシブルディスク）１１２に対するデータのリード／ライトの制御を行う。ＦＤ１１２は、着脱自在になっており、ＦＤＤ１０６の制御にしたがって書き込まれたデータを記憶する。ディスプレイ１０７は、カーソル、メニュー、ウインドウ、あるいは文字や画像等の各種データの表示を行う。ネットワークボード１０８は、通信ケーブル１１３を介してネットワーク１１４と接続する。キーボード１０９は、各種情報の入力を行う。マウス１１０は、ディスプレイ１０７に表示されたカーソル、メニュー、ウインドウの移動や選択、開閉操作を行う。スキャナ１１１は、文字や画像の光学的な読み取りを行う。 The CPU 101 controls the entire apparatus. The ROM 102 stores a basic input / output program. The RAM 103 is used as a work area for the CPU 101. The HDD 104 controls data read / write with respect to the HD 105 according to the control of the CPU 101. The HD 105 stores data written according to the control of the HDD 104. The FDD 106 performs data read / write control with respect to the FD (flexible disk) 112 according to the control of the CPU 101. The FD 112 is detachable and stores data written according to the control of the FDD 106. The display 107 displays a cursor, a menu, a window, or various data such as characters and images. The network board 108 is connected to the network 114 via the communication cable 113. The keyboard 109 inputs various information. The mouse 110 moves and selects a cursor, menu, and window displayed on the display 107, and performs an opening / closing operation. The scanner 111 performs optical reading of characters and images.

（文書処理装置の機能的構成）
次に、本発明の実施の形態にかかる文書処理装置の機能的構成を説明する。図２は、この文書処理装置の機能的構成を示すブロック図である。この文書処理装置は、画像入力部２０１、矩形抽出部２０２、行切り出し部２０３、シンボル生成部２０４、出現頻度集計部２０５、判定部２０６、および表示部２０７を含み構成される。 (Functional configuration of document processing device)
Next, a functional configuration of the document processing apparatus according to the embodiment of the present invention will be described. FIG. 2 is a block diagram showing a functional configuration of the document processing apparatus. The document processing apparatus includes an image input unit 201, a rectangle extraction unit 202, a line cutout unit 203, a symbol generation unit 204, an appearance frequency totaling unit 205, a determination unit 206, and a display unit 207.

画像入力部２０１は、識別対象の原稿画像を入力する。矩形抽出部２０２は、入力部２０１から入力された原稿画像から矩形を抽出する。行切り出し部２０３は、矩形抽出部２０２で抽出された矩形から行内矩形の切り出し処理を行う。シンボル生成部２０４は、行切り出し部２０３で切り出された行内矩形から、行内矩形の始点の高さ、矩形サイズ(高さ、幅)、黒画素密度、隣接矩形との距離など、矩形の配置状態を表す特徴を抽出し、これらを量子化してシンボルを生成する。出現頻度集計部２０５は、シンボル生成部２０４で生成されたシンボル系列に対し、言語別にあらかじめ訓練用の行内矩形シンボルデータで学習したｔｒｉｇｒａｍ表を適用し、当該シンボル系列の出現確率を言語別に算出し、集計する。判定部２０６は、出現頻度集計部２０５による集計結果から、最も高い出現確率を示した言語が、照合対象行の属する言語だと判定する。表示部２０７は、入力された画像や、各処理の経過、結果などの表示を行う。 The image input unit 201 inputs a document image to be identified. The rectangle extraction unit 202 extracts a rectangle from the document image input from the input unit 201. The line cutout unit 203 performs a process of cutting out an in-line rectangle from the rectangle extracted by the rectangle extraction unit 202. The symbol generation unit 204 determines the arrangement state of the rectangle from the in-line rectangle cut out by the line cut-out unit 203, such as the height of the start point of the in-line rectangle, the rectangle size (height and width), the black pixel density, and the distance from the adjacent rectangle. Are extracted and quantized to generate a symbol. The appearance frequency totaling unit 205 applies a trigram table previously learned from the training in-line rectangular symbol data for each language to the symbol series generated by the symbol generation unit 204, and calculates the appearance probability of the symbol series for each language. ,Tally. The determination unit 206 determines from the tabulation result by the appearance frequency tabulation unit 205 that the language showing the highest appearance probability is the language to which the matching target line belongs. The display unit 207 displays the input image, the progress of each process, the result, and the like.

なお、画像入力部２０１の機能は、図１に示したスキャナ１１１により実現できる。矩形抽出部２０２、行切り出し部２０３、シンボル生成部２０４、出現頻度集計部２０５、および判定部２０６の各機能は、図１に示したＣＰＵ１０１により実現できる。表示部２０７の機能は、図１に示したディスプレイ１０７により実現できる。 The function of the image input unit 201 can be realized by the scanner 111 shown in FIG. The functions of the rectangular extraction unit 202, line cutout unit 203, symbol generation unit 204, appearance frequency totaling unit 205, and determination unit 206 can be realized by the CPU 101 shown in FIG. The function of the display unit 207 can be realized by the display 107 shown in FIG.

また、本発明の文書処理装置は、通信手段（ネットワークボード１０８）を備えているので、ネットワークと接続できる。例えば、図３に示すように、複数の文書処理装置をネットワーク１１４と接続することで、各装置間においてデータのやり取りが可能になる。また、この文書処理装置を構成する各機能部に通信手段を設ければ、各機能部をネットワーク１１４に接続することで、遠隔地から文書処理装置を操作することが可能になる。 In addition, the document processing apparatus of the present invention includes communication means (network board 108), so that it can be connected to a network. For example, as shown in FIG. 3, by connecting a plurality of document processing apparatuses to a network 114, data can be exchanged between the apparatuses. Further, if a communication means is provided in each function unit constituting the document processing apparatus, the document processing apparatus can be operated from a remote place by connecting each function unit to the network 114.

以下、本発明の実施の形態にかかる文書処理装置の動作を詳細に説明する。ここでは、例えば、図４に示すような原稿画像に対して、特定の行画像を検索する場合を考える。なお、特定の行画像は原稿画像中と同一である必要はなく、解像度が違っていても、部分的な行として形状が同じであればよい。画像として完全一致する必要はない。 The operation of the document processing apparatus according to the embodiment of the present invention will be described in detail below. Here, for example, a case where a specific row image is searched for a document image as shown in FIG. 4 is considered. Note that the specific line image does not have to be the same as that in the original image, and may have the same shape as a partial line even if the resolution is different. It is not necessary to completely match the images.

画像入力部２０１から入力された原稿画像（図４参照）に対し、矩形抽出部２０２において、図５に示すような黒画素の外接矩形が抽出される。そして、矩形抽出部２０２で抽出された外接矩形は、行切り出し部２０３において行切り出し処理が行われる。行切り出し処理とは、図５に示した外接矩形の近隣同士を連結していき、行に成長させる処理である（図６参照）。この処理は周知の方法で行うことができるため、説明は省略する。 For the document image (see FIG. 4) input from the image input unit 201, the rectangle extraction unit 202 extracts a circumscribed rectangle of black pixels as shown in FIG. The circumscribing rectangle extracted by the rectangle extraction unit 202 is subjected to a row cutout process by the row cutout unit 203. The line cut-out process is a process for connecting neighboring circumscribed rectangles shown in FIG. 5 and growing them into lines (see FIG. 6). Since this process can be performed by a well-known method, description thereof is omitted.

次に、矩形の配置状態を表す特徴の量子化処理、および量子化された矩形の配置状態を表す特徴からシンボルを生成する処理を説明する。この処理は、シンボル生成部２０４において行われる。 Next, a description will be given of a process for quantizing a feature representing a rectangular arrangement state and a process for generating a symbol from a feature representing a quantized rectangular arrangement state. This process is performed in the symbol generation unit 204.

まず、矩形の配置状態を表す特徴の量子化処理について説明する。図７−１および図７−２は、行内矩形の配置例を示す図である。図７−１の欧文文字の行内矩形と、図７−２のアジア系文字の行内矩形を比較してみると、行内矩形の並び方は、言語の種類に関わらず、その文字行の内容に応じて変化していることがわかる。そこで、文字の外接矩形を抽出することで、文字の大まかな特徴を捉えることができる。すなわち、文字そのものを特定しなくても、例えば、図８に示すように、矩形座標の始点（Ｘｓ，Ｙｓ）と終点（Ｘｅ，Ｙｅ）を求め、これを利用した文字画像の外接矩形の配置状態を表す特徴を取得するだけで文字行の画像特徴を捉えることができる。 First, a description will be given of a quantization process for features representing a rectangular arrangement state. FIG. 7A and FIG. 7B are diagrams illustrating arrangement examples of the in-row rectangles. When comparing the in-line rectangle of the European characters in Fig. 7-1 with the in-line rectangle of the Asian characters in Fig. 7-2, the arrangement of the in-line rectangles depends on the contents of the character line regardless of the language type. It can be seen that it has changed. Therefore, by extracting the circumscribed rectangle of the character, it is possible to capture a rough feature of the character. That is, even if the character itself is not specified, for example, as shown in FIG. 8, the start point (Xs, Ys) and end point (Xe, Ye) of the rectangular coordinates are obtained, and the arrangement of the circumscribed rectangle of the character image using this is obtained. The image feature of the character line can be captured only by acquiring the feature representing the state.

行内における一つの矩形は、行内矩形の始点の高さ、矩形サイズ(幅、高さ)、行内矩形中の黒画素密度を計測することによって唯一に定義される。これらの計測結果を用いて、行内矩形の配置状態を定義する。行内矩形は、行切り出し処理の過程で既に求まっているので、文字行を特定するために、追加の特徴抽出処理を行う必要がないので都合がよい。 A rectangle in a line is uniquely defined by measuring the height of the starting point of the in-line rectangle, the rectangle size (width, height), and the black pixel density in the in-line rectangle. Using these measurement results, the arrangement state of the in-line rectangle is defined. Since the in-line rectangle has already been obtained in the process of the line cut-out process, it is convenient because it is not necessary to perform an additional feature extraction process in order to specify the character line.

以下、行内矩形の始点の高さを基準にして行内矩形の配置状態を定義する一例を示す。図９は、行内矩形の配置状態を示す特徴を量子化する方法を説明するための図である。原稿を特定していない状況下では、行高さは可変であり、処理が行高さの値に依存しないように、行内矩形の始点の高さを次式で正規化する。 Hereinafter, an example of defining the arrangement state of the in-line rectangle with reference to the height of the start point of the in-line rectangle will be shown. FIG. 9 is a diagram for explaining a method of quantizing a feature indicating an arrangement state of in-row rectangles. Under the situation where the document is not specified, the line height is variable, and the height of the starting point of the in-line rectangle is normalized by the following expression so that the processing does not depend on the value of the line height.

ＹｓＲａｔｅ＝ｙｓ／Ｈ・・・（１）
（ただし、ｙｓは行内矩形始点の高さ、Ｈは行高さを示す。） YsRate = ys / H (1)
(However, ys indicates the height of the in-line rectangle start point, and H indicates the line height.)

０＜ＹｓＲａｔｅ≦１であるから、ＹｓＲａｔｅを固定段階に量子化することは容易である。例えば、Ｎ段階に量子化するなら、
ＹｓＶａｌ＝ＩＮＴ（ＹｓＲａｔｅ＊（Ｎ−１））・・・（２）
（ただし、ＩＮＴ（）：小数点以下切捨て）
とすればよい。各段階は０〜（Ｎ−１）とラベル付けされる。なお、原稿をスキャンする際に原稿が傾いてしまうと、図１０の文字行も傾いてしまう。極端な傾きの場合には、行切り出し処理が失敗してしまうが、少々の傾きであれば、行間の空白部を利用して、行を切り出すことができる。 Since 0 <YsRate ≦ 1, it is easy to quantize YsRate to a fixed stage. For example, if you quantize to N stages,
YsVal = INT (YsRate * (N−1)) (2)
(However, INT (): rounded down to the nearest decimal point)
And it is sufficient. Each stage is labeled 0- (N-1). Note that if the document is tilted when the document is scanned, the character lines in FIG. 10 are also tilted. In the case of an extreme inclination, the line cut-out process fails. However, if the inclination is a little, a line can be cut out using a blank portion between lines.

しかし、行内矩形の始点の高さに注目する場合、行のわずかな傾きでも、結果に大きく影響する。図１０において、行内矩形の終点から始点までの距離は、行高さに対して万遍なく分布することになり、欧米系文字行の特徴である、頻度の明確な２カ所への集中が観測できない。そこで、ベースラインを定め、そこから行内矩形の始点までの高さを求めることにする。ベースラインを定めるには行内矩形の終点を結ぶような直線を求めればよい。具体的には、行内矩形の終点座標の分布の回帰直線を求めればよい。回帰直線の求め方に関しては周知であるため、ここでは説明しないが、例えば、「工科系のための統計概論」（培風館）Ｉ・ガットマン、Ｓ・Ｓ・ウィルクス共著などに詳しい。 However, when paying attention to the height of the starting point of the in-line rectangle, even a slight inclination of the line greatly affects the result. In FIG. 10, the distance from the end point to the start point of the in-line rectangle is uniformly distributed with respect to the line height, and the concentration in two distinct frequencies, which is a feature of Western character lines, is observed. Can not. Therefore, a baseline is determined and the height from there to the starting point of the in-line rectangle is determined. In order to determine the baseline, a straight line connecting the end points of the in-line rectangles may be obtained. Specifically, a regression line of the distribution of the end point coordinates of the in-line rectangle may be obtained. Since the method of obtaining the regression line is well known, it will not be described here. For example, it is detailed in “Statistical Overview for Engineering” (Baifukan) I. Gutman and SS Wilkes.

以上のような処理により、行内矩形の始点の高さは量子化できる。同様に、文字行画像の特徴として行内矩形の高さを用いる場合は、図９において、次のとおりである。
ＨｅｉｇｈｔＲａｔｅ＝ｈ／Ｈ・・・（３）
ＨｅｉｇｈｔＶａｌ＝ＩＮＴ（（ＨｅｉｇｈｔＲａｔｅ＊（Ｎ−１））＋０．５）・・・（４）
（ただし、ＩＮＴ（）：小数点以下切捨て）
各段階は０〜（Ｎ−１）とラベル付けされる。 With the above processing, the height of the start point of the in-line rectangle can be quantized. Similarly, in the case where the height of the in-line rectangle is used as a feature of the character line image, it is as follows in FIG.
HeightRate = h / H (3)
HeightVal = INT ((HeightRate * (N−1)) + 0.5) (4)
(However, INT (): rounded down to the nearest decimal point)
Each stage is labeled 0- (N-1).

また、矩形の幅を用いる場合は、次のとおりである。
ＷｉｄｔｈＲａｔｅ＝ｗ／Ｈ・・・（５）
ＷｉｄｔｈＶａｌ＝ＩＮＴ（（ＷｉｄｔｈＲａｔｅ＊（Ｎ−１））＋０．５）・・・（６）
（ただし、ＩＮＴ（）：小数点以下切捨て）
各段階は０〜（Ｎ−１）とラベル付けされる。 Moreover, when using the width | variety of a rectangle, it is as follows.
WidthRate = w / H (5)
WidthVal = INT ((WidthRate * (N−1)) + 0.5) (6)
(However, INT (): rounded down to the nearest decimal point)
Each stage is labeled 0- (N-1).

行内矩形は、文字の内容には関知せず、文字の構成要素の外接矩形を求めたものである。しかし、行内矩形の配置状態が同じであっても、欧文系文字は構造が単純なので、矩形内の黒画素密度は低い。一方、アジア系文字は構造が複雑なので、矩形内の黒画素密度は高い。もちろん、同じアジア系文字においても、構造が簡単なひらがな・カタカナの黒画素密度は低く、漢字の黒画素密度は高いことは容易に想像できる。このように矩形の黒画素密度は文字を区別する特徴となり得る。よって、黒画素密度（＝矩形内の黒画素数／矩形内の画素の総数）も同様に量子化し、固定段階として定義する。以上、行内矩形の配置状態を、複数の測定結果によって定義可能であることを示した。これらの複数の測定結果は一つの独立した行内矩形を定義するものである。 The in-line rectangle is not related to the contents of the character, but is a circumscribed rectangle of the constituent element of the character. However, even if the arrangement state of the in-line rectangles is the same, the structure of the European characters is simple, so the density of black pixels in the rectangles is low. On the other hand, Asian characters have a complicated structure, so the black pixel density in the rectangle is high. Of course, even in the same Asian characters, it can be easily imagined that the black pixel density of hiragana and katakana with a simple structure is low and the black pixel density of kanji is high. In this way, the rectangular black pixel density can be a feature that distinguishes characters. Therefore, the black pixel density (= the number of black pixels in the rectangle / the total number of pixels in the rectangle) is similarly quantized and defined as a fixed stage. As described above, it has been shown that the arrangement state of the in-line rectangle can be defined by a plurality of measurement results. These multiple measurement results define one independent in-line rectangle.

ところで、行内矩形を定義する複数の測定結果のうち、検索対象によっては不要なものがある。例えば、検索対象行がラテン系文字行だけならば、黒画素密度の測定結果は不要であろう。なぜなら、ラテン文字行については、文字の構造が、どの文字も同じ程度の複雑さなので、行内矩形の黒画素密度はほぼ同程度であり、行内矩形を特徴づけることに寄与しないからである。このように、検索行と被検索行の集合の性質によっては、識別に影響しない特徴が存在し、その特徴は使用する必要はない。複数の測定結果のうち、該当行と非該当行とを区別するに足る特徴のみ使用すればよい。この結果、処理効率が向上する。 By the way, some of the plurality of measurement results defining the in-line rectangle are not necessary depending on the search target. For example, if the search target line is only a Latin character line, the measurement result of the black pixel density will be unnecessary. This is because, for Latin character lines, since the character structure is as complex as every character, the black pixel density of the in-line rectangle is almost the same, and does not contribute to characterizing the in-line rectangle. Thus, depending on the nature of the set of search rows and searched rows, there is a feature that does not affect identification, and the feature need not be used. Of the plurality of measurement results, only a feature sufficient to distinguish the corresponding line from the non-corresponding line may be used. As a result, the processing efficiency is improved.

また、欧文系文字行とアジア系文字行における行内矩形の配置状態の違いは、図７に示したように、隣接矩形との距離にも表れている。欧米系文字行においては、隣接矩形との距離は正値である場合が多く、矩形同士が重複することは少ない。一方、アジア系文字行においては、隣接矩形と重複する場合が頻繁に観測される。また、アルファベットの『ｉ』や『ｊ』のように、矩形の垂直上に点が存在するもの、ドイツ語におけるウムラウトのように矩形上に点が２つあるもの、スペイン語における（Ｎ＋〜：エニェ）のように矩形上に細い長方形が存在するもの、など言語別に、隣接矩形との距離に関して特徴的な文字が存在する。この特徴を量子化することによって、行内矩形の配置状態を、より詳細に定義することができる。具体的には、図１１に示す各矩形において、
ＲｉｇｈｔＤｉｓｔａｎｃｅＲａｔｅ＝ｄ／Ｈ・・・（７）
（ただし、ｄは矩形間距離を示す。）
ＲｉｇｈｔＤｉｓｔａｎｃｅＶａｌ
＝ＩＮＴ＿ＰＬＵＳ（（ＲｉｇｈｔＤｉｓｔａｎｃｅＲａｔｅ＊（Ｎ−１））＋０．５）・・・（８）
（ただし、ＩＮＴ＿ＰＬＵＳ（）：正数化して、小数点以下切捨て）
を求め、注目矩形と隣接矩形との距離の割合を固定段階に量子化する。各段階は０〜（Ｎ−１）とラベル付けされる。これによって、アジア系文字を多く含む文字行の特徴を、より詳細に定義でき、厳密な文字行の判定を実施することができる。 Further, the difference in the arrangement state of the in-line rectangle between the European character line and the Asian character line also appears in the distance from the adjacent rectangle as shown in FIG. In Western character lines, the distance between adjacent rectangles is often a positive value, and the rectangles rarely overlap. On the other hand, in Asian character lines, the case of overlapping with adjacent rectangles is frequently observed. In addition, there are points that are vertically on a rectangle, such as the letters “i” and “j”, those that have two points on a rectangle, such as umlauts in German, and (N + ˜: in Spanish) Characters that are characteristic with respect to the distance from adjacent rectangles exist for each language, such as those in which a thin rectangle exists on the rectangle as in (Ene). By quantizing this feature, the arrangement state of the in-line rectangle can be defined in more detail. Specifically, in each rectangle shown in FIG.
RightDistanceRate = d / H (7)
(However, d indicates the distance between rectangles.)
RightDistanceVal
= INT_PLUS ((Right DistanceRate * (N-1)) + 0.5) (8)
(However, INT_PLUS (): Convert to a positive number and round down after the decimal point)
And the ratio of the distance between the target rectangle and the adjacent rectangle is quantized to a fixed stage. Each stage is labeled 0- (N-1). As a result, the characteristics of the character line including many Asian characters can be defined in more detail, and the strict character line determination can be performed.

次に、量子化された矩形の配置状態を表す特徴からシンボルを生成する処理を説明する。ここでは、一つの行内矩形に関する、複数種類の測定結果を一つにまとめてシンボル化することによって、一つの行内矩形を一つのシンボルに対応させることが可能になる。例えば、矩形の始点の高さ、矩形高さ、矩形幅の３種の情報をまとめる。仮に、前述の処理で、矩形の始点の高さ（ｙｓ／Ｈ）を１５段階、矩形高さ（ｈ／Ｈ）を８段階、矩形幅（ｗ／Ｈ）を２段階に量子化するとする。この結果、図１２に示すように、各情報は、矩形の始点の高さ（ｙｓ／Ｈ）は１５段階であるから４ｂｉｔｓ、矩形高さ（ｈ／Ｈ）は８段階であるから３ｂｉｔｓ、矩形幅（ｗ／Ｈ）は２段階であるから１ｂｉｔで表現することができる。また、
４ｂｉｔｓ＋３ｂｉｔｓ＋１ｂｉｔ＝８ｂｉｔｓ
であるから、１ｂｙｔｅの各ビットに全情報を格納することができる。そして、これらの３種の情報を一つにまとめたシンボルの種類は、
１５段階×８段階×２段階＝２４０種
となる。なお、まとめる情報の種類および、その格納のための記憶エリア、記憶サイズは固定ではなく、識別対象である文字行を特定するに好適な情報を適宜選択し、決定することは云うまでもない。 Next, processing for generating a symbol from the feature representing the quantized rectangular arrangement state will be described. Here, by combining a plurality of types of measurement results related to one in-line rectangle into one symbol, it becomes possible to make one in-line rectangle correspond to one symbol. For example, three types of information, that is, the height of the starting point of the rectangle, the height of the rectangle, and the width of the rectangle are collected. Suppose that the height of the rectangular starting point (ys / H) is quantized to 15 levels, the rectangular height (h / H) is quantized to 8 levels, and the rectangular width (w / H) is quantized to 2 levels. As a result, as shown in FIG. 12, each piece of information has a rectangular start point height (ys / H) of 15 levels, 4 bits, and a rectangular height (h / H) of 8 levels, 3 bits, a rectangle. Since the width (w / H) has two stages, it can be expressed by 1 bit. Also,
4bits + 3bits + 1bit = 8bits
Therefore, all information can be stored in each bit of 1 byte. And the type of symbol that combines these three types of information into one,
15 stages × 8 stages × 2 stages = 240 types. Note that the type of information to be collected, the storage area for storing the information, and the storage size are not fixed, and it is needless to say that information suitable for specifying the character line to be identified is appropriately selected and determined.

また、文字行内における空白の存在情報も、当該行を特徴づける。特に単語間に空白を挿入する習慣があるラテン系文字行では重要な特徴である。行内における空白の存在は、行内矩形の隣接矩形との距離を行高さと比較することによって検出可能である。例えば、図１３において、行高さに対する矩形間距離の割合（ａ／Ｈ，ｂ／Ｈ，ｃ／Ｈ）に、しきい値を設ける。そして、それら行高さに対する矩形間距離の割合としきい値とを比較して、しきい値より行高さに対する矩形間距離の割合が大きい値を示した場合に空白ありと判定する。空白ありと判定された場合には、空白を意味するシンボル（例えば、ｓＳＰＣ）を挿入する。先の例であれば、矩形の配置情報に対応するシンボルが２４０種類であることに対し、記憶領域サイズは１ｂｙｔｅなので１６種類（＝２５６−２４０）の特別シンボルを、さらに設定することができる。空白用のシンボルｓＳＰＣは、この１６種類のいずれかに対応させる。 Also, the presence information of blanks in the character line characterizes the line. This is especially important for Latin text lines where it is customary to insert spaces between words. The presence of white space in a line can be detected by comparing the distance between the in-line rectangle and the adjacent rectangle with the line height. For example, in FIG. 13, a threshold value is provided for the ratio of the inter-rectangular distance to the row height (a / H, b / H, c / H). Then, the ratio of the inter-rectangular distance to the line height is compared with a threshold value, and it is determined that there is a blank when the ratio of the inter-rectangular distance to the line height is larger than the threshold value. If it is determined that there is a blank, a symbol (for example, sSPC) meaning a blank is inserted. In the above example, there are 240 types of symbols corresponding to the rectangular arrangement information, whereas the storage area size is 1 byte, so that 16 types (= 256-240) of special symbols can be further set. The blank symbol sSPC corresponds to one of these 16 types.

また、矩形の配置状態を表す複数の特徴を多次元ベクトルの各次元とみなせば、矩形は、その各特徴を用いて一つのベクトルデータに変換（ベクトル量子化）できる。べクトル量子化とは、周知のように、ベクトルデータの多数のバラエティから、それらを代表する少数のベクトルデータを求めることである。求められた代表ベクトルに順にラベル付けすれば、ベクトルデータの系列を単なる一次元のシンボルデータの系列に変換することができる。ベクトル量子化に関しては、「ベクトル量子化と情報圧縮」（コロナ社）ＡｌｌｅｎＧｅｒｓｈｏ，ＲｏｂｅｒｔＭ．Ｇｒａｙ著，田崎三郎ほか訳、に詳しい。 Further, if a plurality of features representing the arrangement state of the rectangle are regarded as each dimension of the multidimensional vector, the rectangle can be converted into one vector data (vector quantization) using each feature. Vector quantization is to obtain a small number of vector data representing them from a large variety of vector data, as is well known. By labeling the obtained representative vectors in order, the vector data series can be converted into a simple one-dimensional symbol data series. For vector quantization, see “Vector quantization and information compression” (Corona) Allen Gersho, Robert M. et al. Familiar with Gray, Saburo Tazaki et al.

このように、シンボル系列に変換することができれば、先に述べたように、その並び傾向を学習できる。例えば、訓練データから矩形の配置に関する３次元のベクトルデータを求め、それらから２４０種の代表ベクトルを求める。この代表ベクトル群をコードブックと呼ぶ。コードブック中の２４０種のベクトルを区別するＩＤが、つまりシンボルである。識別対象の文字行データにおける行内矩形の配置を３次元ベクトルに変換し、コードブック内のベクトルと最も類似するベクトルを選び、そのＩＤを当該矩形のシンボルとする。 In this way, if it can be converted into a symbol series, the arrangement tendency can be learned as described above. For example, three-dimensional vector data relating to the arrangement of rectangles is obtained from the training data, and 240 types of representative vectors are obtained therefrom. This representative vector group is called a code book. An ID that distinguishes 240 vectors in the codebook is a symbol. The arrangement of the in-line rectangles in the character line data to be identified is converted into a three-dimensional vector, the vector most similar to the vector in the codebook is selected, and the ID is used as the symbol of the rectangle.

以上の作業を経ることによって、行に含まれる矩形は、固定個のシンボル（ラベル）に変換することができる。したがって、実際の行内矩形の配置は、図１３に示すような単なるシンボル系列とみなすことができる。これで、シンボル系列の並び傾向を記録することができ、行内矩形の並び傾向を記録できることと等価となる。シンボル系列に変換された後には、テキスト検索と同様に、一般的な検索手法によって検索することが可能になる。つまりシンボル系列間の完全一致を求めればよい。但し、文字行画像の読み取り誤差によって、文字矩形の特徴の計測結果は異なるので、文字行画像が同一であっても、そのシンボル変換結果が同一にならない場合もある。よってシンボル列の完全一致を求めるのみでは、同一文字行画像を検索できないおそれがある。 Through the above operations, the rectangle included in the row can be converted into a fixed number of symbols (labels). Therefore, the actual arrangement of the in-line rectangles can be regarded as a simple symbol series as shown in FIG. This can record the arrangement tendency of the symbol series, which is equivalent to recording the arrangement tendency of the in-line rectangles. After the conversion to the symbol series, it is possible to perform a search by a general search method as in the text search. That is, it is only necessary to obtain a perfect match between symbol sequences. However, since the measurement result of the feature of the character rectangle varies depending on the reading error of the character line image, the symbol conversion result may not be the same even if the character line images are the same. Therefore, there is a possibility that the same character line image cannot be searched only by obtaining a complete match of the symbol strings.

そこで、本発明の文書処理装置では、シンボル列の完全一致ではなく、シンボルの並び傾向の類似度を求める。具体的には、変換されたシンボル系列に対し、言語別にあらかじめ訓練用の行内矩形シンボルデータで学習したｔｒｉｇｒａｍ表を適用し、当該シンボル系列の出現確率を言語別に算出し、集計する。この処理は、出現頻度集計部２０５で行われる。以下、詳述する。 Therefore, in the document processing apparatus of the present invention, the similarity of the symbol arrangement tendency is obtained instead of the complete matching of the symbol strings. Specifically, a trigram table learned in advance using training in-line rectangular symbol data for each language is applied to the converted symbol series, and the appearance probability of the symbol series is calculated for each language and tabulated. This process is performed by the appearance frequency counting unit 205. Details will be described below.

並びの傾向を記録する方法としてはｎ−ｇｒａｍモデルがある。ｎ−ｇｒａｍモデルはクロード・エルウッドシャノンによって提案された言語モデルである。系列中のシンボルの出現が、直前のｎ個（ｎは自然数）のシンボルに影響されるとする。現在の状態がｎ個前の入力に依存して決まる確率プロセスをｎ重マルコフ過程と呼び、ｎ−ｇｒａｍモデルは（ｎ−１）重マルコフモデルとも呼ばれる。特にｎ＝３の場合をｔｒｉｇｒａｍと呼び、広く使用されている。 There is an n-gram model as a method for recording the tendency of arrangement. The n-gram model is a language model proposed by Claude Elwood Shannon. Assume that the appearance of a symbol in the sequence is affected by the immediately preceding n symbols (n is a natural number). A stochastic process whose current state is determined depending on the n-th previous input is called an n-fold Markov process, and the n-gram model is also called an (n-1) -fold Markov model. In particular, the case of n = 3 is called trigram and is widely used.

具体的には次の式（９）で示されるモデルである。さらに、式（１０）にしたがって、訓練用のシンボル系列データからシンボルの３つ組みの出現頻度を計数する。 Specifically, it is a model represented by the following equation (9). Furthermore, according to the equation (10), the appearance frequency of the triplet of symbols is counted from the symbol series data for training.

一方で、ｔｒｉｇｒａｍの出現頻度順位を求めておく。表１にｔｒｉｇｒａｍ集計の例を示す。 On the other hand, the appearance frequency rank of trigram is obtained. Table 1 shows an example of trigram aggregation.

文字行に関して表１に示すようなｔｒｉｇｒａｍ集計を求めることが、文字行の特徴を求めること（学習）に相当する。検索したい文字行の行内矩形の配置状態を学習時と同じ要領でシンボル系列に変換した後、ｔｒｉｇｒａｍ集計を求める。 Obtaining a trigram total as shown in Table 1 for a character line corresponds to obtaining (learning) the characteristics of the character line. After the arrangement state of the in-line rectangle of the character line to be searched is converted into a symbol series in the same manner as at the time of learning, a trigram total is obtained.

ところで、ｔｒｉｇｒａｍ集計結果を用いた、文字行同士の類似度を算出する方法は、行文字行だけでなく、文字行の集合である領域単位に、あるいは原稿単位に類似性を判定する場合にも、適用可能であることは明らかである。比較したい領域において、行切り出し処理を施し、各行によってシンボル系列に変換した後、領域単位に矩形ｔｒｉｇｒａｍを集計する（すなわち、文字行のシンボル化情報を原稿全体、あるいは領域全体など、特定の範囲においてｔｒｉｇｒａｍ集計する）。ｔｒｉｇｒａｍ集計結果に関して順位相関係数を求めれば、領域間の類似度を判定する基準となる。 By the way, the method of calculating the similarity between character lines using the trigram tabulation result is not only for line character lines, but also for determining similarity in units of areas that are sets of character lines or in units of originals. Obviously, it is applicable. In a region to be compared, a line segmentation process is performed and converted into a symbol series by each line, and then a rectangular trigram is totaled in units of regions (that is, symbolization information of character lines is collected in a specific range such as the entire document or the entire region) trigram total). If the rank correlation coefficient is obtained with respect to the trigram count result, it becomes a reference for determining the similarity between the regions.

最後に、前記ｔｒｉｇｒａｍ集計結果と、検索対象である文字行から学習したｔｒｉｇｒａｍ集計結果とを照合し、最も類似するものを選択する。すなわち、最も高い出現頻度を示した言語が、照合対象行の属する言語だと判定する。この処理は判定部２０６で行われる。以下詳述する。 Finally, the trigram total result is compared with the trigram total result learned from the search target character line, and the most similar one is selected. That is, it is determined that the language having the highest appearance frequency is the language to which the verification target line belongs. This process is performed by the determination unit 206. This will be described in detail below.

まず、一行に含まれる行内矩形の数が、検索行と、被検索行とでは異なるから、出現頻度そのものを比較することはできない。そこで、ｔｒｉｇｒａｍ集計表の類似性を判定するには次式で求められる順位相関係数を用いる。なお、順位相関係数の算出方法に関しては周知であるため、ここでは説明しないが、例えば、柳川尭著「ノンパラメトリック法」（培風館）に詳しい。
Ｒｘｙ＝１−（６＊Σ（Ｒｘｉ−Ｒｙｉ）＾２）／（ｎ＊（ｎ＾２−１））・・・（１１）
（ただし、ｎはデータ数、Ｒｘｉ，Ｒｙｉはデータの順位数値を示す。） First, since the number of in-line rectangles included in one line differs between the search line and the search target line, the appearance frequencies themselves cannot be compared. Therefore, in order to determine the similarity of the trigram summary table, the rank correlation coefficient obtained by the following equation is used. Since the method of calculating the rank correlation coefficient is well known, it will not be described here. For example, it is detailed in “Non-parametric method” written by Kei Yanagi (Baifukan).
Rxy = 1- (6 * Σ (Rxi-Ryi) ^ 2) / (n * (n ^ 2-1)) (11)
(However, n indicates the number of data, and Rxi and Ryi indicate the numerical value of the data.)

そして、検索行と、被検索行とのｔｒｉｇｒａｍ集計結果の順位相関係数を求め、最も１に近いものを選択すればよい。さらに、順位相関係数を統計的に検定し、最大の順位相関係数が有意な値を示さない場合には、検索に該当なしと判定してもよい。 Then, the rank correlation coefficient of the trigram aggregation result between the search line and the search target line is obtained, and the one closest to 1 may be selected. Further, the rank correlation coefficient may be statistically tested, and if the maximum rank correlation coefficient does not show a significant value, it may be determined that the search is not applicable.

以上、ここまでの処理を簡単にまとめると、図１４のようになる。すなわち、あらかじめ照合したい画像の行内矩形をシンボルに変換し（ステップＳ１４０１）、所定領域内でｔｒｉｇｒａｍを集計し（ステップＳ１４０２）、ｔｒｉｇｒａｍの出現頻度集計表を作成する（ステップＳ１４０３）。一方、照合対象画像の行内矩形をシンボルに変換し（ステップＳ１４０４）、所定領域内でｔｒｉｇｒａｍを集計し（ステップＳ１４０５）、ｔｒｉｇｒａｍの出現頻度集計表を作成する（ステップＳ１４０６）。最後に、ステップＳ１４０３で作成されたｔｒｉｇｒａｍの出現頻度集計表とステップＳ１４０６で作成されたｔｒｉｇｒａｍの出現頻度集計表とを照合して、順位相関係数を求めることにより（ステップＳ１４０７）、最も高い出現確率を示した言語が、照合対象行の属する言語だと判定することができる。 The processing so far can be summarized as shown in FIG. That is, the in-line rectangle of the image to be collated is converted into a symbol in advance (step S1401), the trigram is totaled in a predetermined area (step S1402), and a trigram appearance frequency totaling table is created (step S1403). On the other hand, the in-line rectangle of the verification target image is converted into a symbol (step S1404), the trigram is totaled in a predetermined area (step S1405), and a trigram appearance frequency totaling table is created (step S1406). Finally, the highest occurrence rate is obtained by collating the occurrence frequency tabulation table created in step S1403 and the occurrence frequency tabulation table created in step S1406 to obtain a rank correlation coefficient (step S1407). It can be determined that the language indicating the probability is the language to which the verification target line belongs.

ところで、これまでは、文字行画像同士の照合に関して言及しているが、テキストデータから文字行画像を作成することができれば、指定する文字を含む行を検索することが可能になる。テキストデータから文字画像を得るには、フォントデータを用いればよい。例えば、ｔｒｕｅｔｙｐｅフォントのようなベクトルデータを展開して文字のビットマップデータ（画像）を作成し、指定文字列（テキスト）をシンボル系列に変換する。このためには、シンボル生成部２０４に、さらに文字フォントセットと、文字テキストから文字フォントを展開し、テキスト文字列から文字列画像を生成した後、行内矩形シンボルに変換する機能を備えることが必要である。検索文字列、被検索文字列、ともに行内矩形のシンボル系列に変換された後は、一般的なテキスト検索の手法と同じく、シンボル系列が完全一致する部分を求める。これによって、文字認識することなく、原稿画像に対するテキスト検索が可能になる。文字認識で必要な文字パターン辞書が不要であることは明らかである。 By the way, although reference has been made so far regarding collation between character line images, if a character line image can be created from text data, a line including a designated character can be searched. In order to obtain a character image from text data, font data may be used. For example, vector data such as a true type font is expanded to generate character bitmap data (image), and a designated character string (text) is converted into a symbol series. For this purpose, it is necessary that the symbol generation unit 204 has a function of converting a character font set and a character font from character text, generating a character string image from the text character string, and then converting it into an in-line rectangular symbol. It is. After both the search character string and the character string to be searched are converted to the in-line rectangular symbol series, a portion where the symbol series completely matches is obtained as in the general text search technique. As a result, text search can be performed on the document image without recognizing characters. It is clear that the character pattern dictionary required for character recognition is unnecessary.

このように一旦文字画像を生成した後、行内矩形シンボルへと変換する場合は、フォントセットさえ準備しておけばフォントの違いによる行内矩形シンボルの変動を考慮することが可能であり、好都合である。しかしながら、行内矩形シンボル系列に変換するため、全文字のフォントデータを用意する必要があるだけでなく、文字画像生成のための演算処理も必要になる。そこで、さらに、文字毎にあらかじめ矩形シンボル変換結果を用意し、文字毎にその文字内の矩形の配置情報を表現するシンボルを対応させ、入力テキストの文字列から、矩形シンボル系列へと変換する機能を備えるとよい。このように、あらかじめ文字毎に、対応する変換後の行内矩形シンボルを求めておき、それを記録しておけば、文字画像を経ることなくテキスト文字列から行内矩形シンボルへと変換することが可能になる。図１５に、文字コードと矩形シンボル変換結果との相関を示す。ただし、１文字に含まれる矩形は一つとは限らないので、１文字から複数のシンボル系列に変換されることがある。 In this way, once a character image is generated and then converted to an in-line rectangular symbol, it is convenient to take into account fluctuations in the in-line rectangular symbol due to differences in fonts if only a font set is prepared. . However, in order to convert to the in-line rectangular symbol series, it is not only necessary to prepare font data for all characters but also an arithmetic process for generating a character image. Therefore, a function that converts the character string of the input text into a rectangular symbol series by preparing a rectangular symbol conversion result for each character in advance, associating each symbol with a symbol that represents the arrangement information of the rectangle within the character. It is good to have. In this way, if a corresponding converted in-line rectangular symbol is obtained for each character in advance and recorded, it can be converted from a text string to an in-line rectangular symbol without going through a character image. become. FIG. 15 shows the correlation between the character code and the rectangular symbol conversion result. However, since one character does not necessarily have one rectangle, one character may be converted into a plurality of symbol sequences.

（文書処理の手順）
以下、本発明の文書処理装置を用いた文書処理の手順を説明する。図１６は、この文書処理の手順を示すフローチャートである。まず、画像入力部２０１が、識別対象の原稿画像を入力する（ステップＳ１６０１）。次に、矩形抽出部２０２が、入力された原稿画像から矩形を抽出する（ステップＳ１６０２）。次いで、行切り出し部２０３が、矩形抽出部２０２で抽出された矩形から行内矩形の切り出し処理を行う（ステップＳ１６０３）。シンボル生成部２０４が、行切り出し部２０３で切り出された行内矩形から、矩形の配置状態を表す特徴を抽出し、これらを量子化してシンボルを生成する（ステップＳ１６０４）。出現頻度集計部２０５が、シンボル生成部２０４で生成された各シンボル系列に対し、言語別にあらかじめ訓練用の行内矩形シンボルデータで学習したｔｒｉｇｒａｍ表を適用し、当該シンボル系列の出現頻度を言語別に算出し、集計する（ステップＳ１６０５）。最後に、判定部２０６が、出現頻度集計部２０５による集計結果から、最も高い出現頻度を示した言語が、照合対象行の属する言語だと判断する（ステップＳ１６０６）。 (Document processing procedure)
The procedure of document processing using the document processing apparatus of the present invention will be described below. FIG. 16 is a flowchart showing the procedure of this document processing. First, the image input unit 201 inputs a document image to be identified (step S1601). Next, the rectangle extraction unit 202 extracts a rectangle from the input document image (step S1602). Next, the line cutout unit 203 performs a process of cutting out the in-line rectangle from the rectangle extracted by the rectangle extraction unit 202 (step S1603). The symbol generation unit 204 extracts features representing the arrangement state of the rectangles from the in-line rectangles cut out by the line cut-out unit 203, and quantizes them to generate symbols (step S1604). The appearance frequency totaling unit 205 applies a trigram table previously learned from the in-line rectangular symbol data for training to each symbol series generated by the symbol generation unit 204 and calculates the appearance frequency of the symbol series for each language. And totalize (step S1605). Finally, the determination unit 206 determines that the language showing the highest appearance frequency is the language to which the verification target line belongs from the aggregation result by the appearance frequency totaling unit 205 (step S1606).

（矩形配置状態シンボル生成処理の手順）
次に、ステップＳ１６０４の矩形配置状態シンボル生成処理の手順をより詳しく説明する。図１７は、この矩形配置状態シンボル生成処理の手順を示すフローチャートである。まず、原稿画像に表現された文字の行高さを推定する（ステップＳ１７０１）。次に、行高さに対する行内矩形の始点位置を基準に当該行内矩形を分類する（ステップＳ１７０２）。次いで、分類された各行内矩形の配置状態を表す特徴（行内矩形の始点の高さ、矩形サイズ(高さ、幅)、黒画素密度、隣接矩形との距離など）を測定する（ステップＳ１７０３）。そして、矩形の配置状態を表す特徴を量子化してシンボルを生成する（ステップＳ１７０４）。最後に、ステップＳ１７０４で生成されたシンボル系列を記録する（ステップＳ１７０５）。 (Rectangle placement state symbol generation processing procedure)
Next, the procedure of the rectangular arrangement state symbol generation process in step S1604 will be described in more detail. FIG. 17 is a flowchart showing the procedure of the rectangular arrangement state symbol generation process. First, the line height of characters represented in the document image is estimated (step S1701). Next, the in-line rectangles are classified based on the start point position of the in-line rectangle with respect to the line height (step S1702). Next, the characteristics (the height of the start point of the in-line rectangle, the rectangle size (height, width), the black pixel density, the distance from the adjacent rectangle, etc.) representing the layout state of each classified in-line rectangle are measured (step S1703). . Then, the feature representing the rectangular arrangement state is quantized to generate a symbol (step S1704). Finally, the symbol series generated in step S1704 is recorded (step S1705).

上記の各処理を行うことで、対象行の属する言語を、行内矩形の配置状態を表す特徴（新たに行の特徴を抽出する処理ではなく、行切り出し処理の過程で得られる特徴）を抽出して分類することができる。この結果、高速に言語識別処理を実現でき、言語識別結果に応じて、言語に最適な文書処理を選択する基準を求めることができる。よって、高精度な文書処理を実現することが可能となる。 By performing each of the above processes, the language to which the target line belongs is extracted from the feature representing the arrangement state of the in-line rectangle (features obtained in the process of line segmentation processing, not the process of newly extracting line features). Can be classified. As a result, language identification processing can be realized at high speed, and a criterion for selecting a document processing optimum for the language can be obtained according to the language identification result. Therefore, highly accurate document processing can be realized.

以上説明したように、本発明にかかる文書処理装置、文書処理方法、および文書処理プログラムによれば、文字行画像の行内矩形の配置状態を表す特徴を抽出し、これらを固定段階に量子化してシンボルを生成することにより、文字認識することなく、文字行の特徴の抽出が可能になり、効率的な文字行の内容に関する検索を行うことができる。 As described above, according to the document processing device, the document processing method, and the document processing program according to the present invention, the feature representing the arrangement state of the in-line rectangle of the character line image is extracted, and these are quantized to the fixed stage. By generating a symbol, it is possible to extract the characteristics of the character line without recognizing the character, and it is possible to perform an efficient search regarding the contents of the character line.

なお、本実施の形態で説明した文書処理方法は、あらかじめ用意されたプログラムをパーソナル・コンピュータやワークステーション等のコンピュータで実行することにより実現することができる。このプログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行される。またこのプログラムは、インターネット等のネットワークを介して配布することが可能な伝送媒体であってもよい。 The document processing method described in this embodiment can be realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. This program is recorded on a computer-readable recording medium such as a hard disk, a flexible disk, a CD-ROM, an MO, and a DVD, and is executed by being read from the recording medium by the computer. The program may be a transmission medium that can be distributed via a network such as the Internet.

以上のように、本発明にかかる文書処理装置、文書処理方法、および文書処理プログラムは、効率よく文字行の形状の特徴を抽出する必要がある文字識別処理に有用であり、特に、文字認識装置などに適している。 As described above, the document processing device, the document processing method, and the document processing program according to the present invention are useful for character identification processing that needs to efficiently extract the characteristics of the shape of a character line. Suitable for such as.

本発明の実施の形態にかかる文書処理装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the document processing apparatus concerning embodiment of this invention. 本発明の実施の形態にかかる文書処理装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the document processing apparatus concerning embodiment of this invention. 本発明の実施の形態にかかる文書処理装置を用いたネットワーク構成の一例を示す図である。It is a figure which shows an example of the network structure using the document processing apparatus concerning embodiment of this invention. 文書処理装置に入力される原稿画像の一例を示す図である。FIG. 4 is a diagram illustrating an example of a document image input to a document processing apparatus. 原稿画像から求められる黒画素の外接矩形の一例を示す図である。It is a figure which shows an example of the circumscribed rectangle of the black pixel calculated | required from a document image. 行切り出し処理を説明するための図である。It is a figure for demonstrating a line cut-out process. 行内矩形の配置例を示す図である。It is a figure which shows the example of arrangement | positioning of the rectangle in a line. 行内矩形の配置例を示す図である。It is a figure which shows the example of arrangement | positioning of the rectangle in a line. 矩形に対する座標の設定例を説明するための図である。It is a figure for demonstrating the example of a setting of the coordinate with respect to a rectangle. 行内矩形の配置状態を表す特徴を量子化する方法を説明するための図である。It is a figure for demonstrating the method of quantizing the characteristic showing the arrangement | positioning state of the rectangle in a line. 行内矩形の配置状態を表す特徴からシンボルを生成した例を示す図である。It is a figure which shows the example which produced | generated the symbol from the characteristic showing the arrangement | positioning state of the rectangle in a line. 矩形間距離に基づく空白シンボルの挿入処理を説明するための図である。It is a figure for demonstrating the insertion process of the blank symbol based on the distance between rectangles. 矩形間距離の量子化を説明するための図である。It is a figure for demonstrating quantization of the distance between rectangles. 矩形間距離に基づく空白シンボルの挿入処理を説明するための図である。It is a figure for demonstrating the insertion process of the blank symbol based on the distance between rectangles. 矩形ｔｒｉｇｒａｍを使用した文書画像照合の手順を示すフローチャートである。It is a flowchart which shows the procedure of the document image collation using the rectangle trigram. 文字コードから矩形シンボルへ直接に変換する場合を説明するための表である。It is a table | surface for demonstrating the case where it converts directly from a character code to a rectangular symbol. 文書処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of document processing. 矩形配置状態シンボル生成処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a rectangular arrangement | positioning state symbol production | generation process.

Explanation of symbols

１００バス
１０１ＣＰＵ
１０２ＲＯＭ
１０３ＲＡＭ
１０４ＨＤＤ（ハードディスクドライブ）
１０５ＨＤ（ハードディスク）
１０６ＦＤＤ（フレキシブルディスクドライブ）
１０７ディスプレイ
１０８ネットワークボード
１０９キーボード
１１０マウス
１１１スキャナ
１１２ＦＤ（フレキシブルディスク）
１１３通信ケーブル
１１４ネットワーク
２０１画像入力部
２０２矩形抽出部
２０３行切り出し部
２０４シンボル生成部
２０５出現頻度集計部
２０６判定部
２０７表示部

100 bus 101 CPU
102 ROM
103 RAM
104 HDD (Hard Disk Drive)
105 HD (hard disk)
106 FDD (flexible disk drive)
107 Display 108 Network Board 109 Keyboard 110 Mouse 111 Scanner 112 FD (Flexible Disk)
113 Communication Cable 114 Network 201 Image Input Unit 202 Rectangular Extraction Unit 203 Line Extraction Unit 204 Symbol Generation Unit 205 Appearance Frequency Total Unit 206 Determination Unit 207 Display Unit

Claims

An apparatus that performs predetermined image processing on an input document image, extracts image characteristics, and performs document processing,
In-line height symbol generating means for generating a fixed type symbol by quantizing the height in the line of the starting point of the in-line rectangle of the character line image extracted from the document image into a fixed stage;
A line height estimating means for estimating a line height of the character line image;
An in-line rectangular height ratio symbol generating means for generating a fixed type symbol by quantizing a ratio of the in-line rectangular height to the line height into a fixed stage;
In-line rectangle width ratio symbol generating means for generating a fixed type symbol by quantizing the ratio of the width of the in-line rectangle to the line height in a fixed stage;
Black pixel density symbol generating means for generating a fixed type of symbol by quantizing the black pixel density in the rectangle in a fixed stage;
Appearance frequency totaling means for converting the feature representing the arrangement state of the in-line rectangle in the test line into a symbol series, and totaling the appearance frequency of the in-line rectangle symbol in the test line;
Similarity determination means for determining similarity based on the appearance frequency of the symbols aggregated by the appearance frequency aggregation means and the arrangement tendency of the symbol series learned in advance;
A document processing apparatus comprising:

And a representative symbol generating means for extracting one or more representative ones from a plurality of features representing the arrangement state of the in-line rectangles and generating a fixed type symbol based thereon. Item 2. The document processing apparatus according to Item 1.

The distance ratio symbol generating means for generating a fixed type symbol by quantizing the ratio of the distance between the target rectangle and the adjacent rectangle with respect to the row height into a fixed stage. Document processing device.

The distance calculation means for calculating the distance between the end point of the in-line rectangle of interest and the start point of the adjacent in-line rectangle, the distance calculated by the distance calculation means and the line height are compared, and the ratio exceeds a certain value. A symbol series conversion means for inserting a blank symbol and converting the arrangement state of the in-line rectangle into a symbol series,
The document processing apparatus according to claim 1, further comprising:

Furthermore, a plurality of features representing the arrangement state of the in-line rectangle are vector-quantized corresponding to each dimension of the multi-dimensional vector, and an in-line rectangular arrangement state symbol generating means for generating a fixed type symbol indicating the arrangement state of the in-line rectangle is provided. The document processing apparatus according to claim 2, wherein:

Further, the symbol information of the character line generated by the representative symbol generating means is aggregated in a specific range such as the entire manuscript or the entire predetermined region, and the similarity within the specific range for determining the similarity between the training document and the test document is determined. The document processing apparatus according to claim 2, further comprising a determination unit.

Character input means;
Character font set,
Document image acquisition means for developing a character font from the character text input from the character input means to obtain a character image based on the character font set;
An in-line rectangular symbol generating means for generating a character string image from a character string of character text input from the character input means, and generating an in-line rectangular symbol of the character string image;
The document processing apparatus according to claim 1, further comprising:

Furthermore, symbol corresponding means for associating a symbol representing the arrangement state of the rectangle in the character for each character with respect to the in-line rectangular symbol generated in advance for each character;
Rectangular symbol conversion means for converting a character string of the input text input from the character input means into a rectangular symbol series;
The document processing apparatus according to claim 7, further comprising:

An apparatus that performs predetermined image processing on an input document image, extracts image characteristics, and performs document processing,
In-line height symbol generating means for generating a fixed type symbol by quantizing the height in the line of the starting point of the in-line rectangle of the character line image extracted from the document image into a fixed stage;
A line height estimating means for estimating a line height of the character line image;
An in-line rectangular height ratio symbol generating means for generating a fixed type symbol by quantizing a ratio of the in-line rectangular height to the line height into a fixed stage;
In-line rectangle width ratio symbol generating means for generating a fixed type symbol by quantizing the ratio of the width of the in-line rectangle to the line height in a fixed stage;
Black pixel density symbol generating means for generating a fixed type of symbol by quantizing the black pixel density in the rectangle in a fixed stage;
A training line learning means for converting the feature representing the arrangement state of the in-line rectangle in the training line into a symbol series and learning the tendency thereof;
An evaluation value calculating means for converting the feature representing the arrangement state of the in-line rectangle in the test line into a symbol series, and using the learning result of the training line by the training line learning means;
Similarity determination means for comparing the learning result of the training line by the training line learning means with the evaluation value of the test line calculated by the evaluation value calculation means, and determining the similarity;
A document processing apparatus comprising:

A document processing apparatus for determining similarity between character line images extracted from a document image,
In-line start point height quantization means for quantizing the height in the line of the start point of the in-line rectangle of the character line image in a fixed stage;
In-line rectangle height quantization means for quantizing the height of the in-line rectangle of the character line image in a fixed stage;
In-line rectangle width quantization means for quantizing the width of the in-line rectangle of the character line image in a fixed stage;
The height of the start point of the in-line rectangle quantized by the in-line start point quantization means, the height of the in-line rectangle quantized by the in-line rectangle height quantization means, and the in-line rectangle width quantization means Symbol generating means for generating a symbol from the arrangement state determined based on the quantized width of the in-line rectangle;
Appearance frequency counting means for counting the appearance frequencies of the symbols generated in the test row by the symbol generating means;
Similarity determination means for determining similarity based on the appearance frequency of the symbols aggregated by the appearance frequency aggregation means and the arrangement tendency of the symbol series learned in advance;
A document processing apparatus comprising:

A document processing method performed by a document processing apparatus that performs predetermined image processing on an input document image, extracts image characteristics, and performs document processing ,
The document processing device is
An in-line height symbol generation step of generating a fixed type symbol by quantizing the height in the line of the start point of the in-line rectangle of the character line image extracted from the document image;
A line height estimating step for estimating a line height of the character line image;
An in-line rectangle height ratio symbol generation step of generating a fixed type symbol by quantizing a ratio of the height of the in-line rectangle to the line height into a fixed stage;
An in-line rectangle width ratio symbol generation step of generating a fixed type of symbol by quantizing a ratio of the width of the in-line rectangle to the line height into a fixed stage;
A black pixel density symbol generating step of generating a fixed type of symbol by quantizing the black pixel density in the rectangle into a fixed stage;
An appearance frequency totaling step of converting the feature representing the arrangement state of the in-line rectangle in the test line into a symbol series, and totaling the appearance frequency of the symbols in the test line,
A similarity determination step of determining similarity based on the appearance frequency of the symbols in the test row tabulated by the appearance frequency tabulation step and the pre-learned arrangement sequence of the symbol series;
The document processing method characterized by performing .

A program that performs predetermined image processing on an input document image, extracts image features, and performs document processing,
An in-line height symbol generation step of generating a fixed type symbol by quantizing the height in the line of the start point of the in-line rectangle of the character line image extracted from the document image;
A line height estimating step for estimating a line height of the character line image;
An in-line rectangle height ratio symbol generation step of generating a fixed type symbol by quantizing a ratio of the height of the in-line rectangle to the line height into a fixed stage;
An in-line rectangle width ratio symbol generation step of generating a fixed type of symbol by quantizing a ratio of the width of the in-line rectangle to the line height into a fixed stage;
A black pixel density symbol generating step of generating a fixed type of symbol by quantizing the black pixel density in the rectangle into a fixed stage;
An appearance frequency totaling step of converting the feature representing the arrangement state of the in-line rectangle in the test line into a symbol series, and totaling the appearance frequency of the symbols in the test line,
A similarity determination step of determining similarity based on the appearance frequency of the symbols in the test row tabulated by the appearance frequency tabulation step and the pre-learned arrangement sequence of the symbol series;
A document processing program for causing a computer to execute.