JPH11191135A

JPH11191135A - Japanese-English determination method of document image, document recognition method, and recording medium

Info

Publication number: JPH11191135A
Application number: JP10125103A
Authority: JP
Inventors: Tooru Mizuna; 水納　　亨; Takashi Saito; 高志齋藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1997-09-10
Filing date: 1998-05-07
Publication date: 1999-07-13
Anticipated expiration: 2018-05-07
Also published as: JP3835652B2

Abstract

(57)【要約】【課題】精度よくかつ高速に日本語と英語の識別を行
うと共に、識別する範囲についても各文字領域毎に、ま
たページ単位毎に両者を識別できる。【解決手段】入力文書画像１０１を縮小１０２した
後、黒画素連結成分を抽出１０３し、それらを統合して
文字領域を生成１０４する。生成した文字領域につい
て、日英判別手段１０５は、連結成分の長さを基にその
成分を分類し、分類結果の集計値を基に日本語領域であ
るか英語領域であるかを判別する。 (57) [Summary] [PROBLEMS] To accurately and quickly identify Japanese and English, and to identify both in each character area and in each page. SOLUTION: After reducing an input document image 101, a black pixel connected component is extracted 103, and the extracted components are integrated to generate a character area 104. For the generated character area, the Japanese-English determination means 105 classifies the component based on the length of the connected component, and determines whether the area is a Japanese area or an English area based on the total value of the classification result.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書画像中の各文
字領域に対して日本語領域であるのか英語領域であるの
かを判定する文書画像の日本語英語判定方法および記録
媒体に関し、また文書画像が日本語文書画像であるか英
語文書画像であるかを判定してから認識処理する文書認
識方法および記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and a recording medium for determining whether a character region in a document image is a Japanese region or an English region, and to a method for determining whether each character region is a English region. The present invention relates to a document recognition method and a recording medium for performing recognition processing after determining whether an image is a Japanese document image or an English document image.

【０００２】[0002]

【従来の技術】文書画像に対して文字認識処理を施す場
合に、適切な言語を選択する必要がある。すなわち、英
文ＯＣＲで日本語を認識しようとしてもアルファベット
や数字以外は認識不可能であるし、また逆に日本語ＯＣ
Ｒで英文を認識しようとすると、文字切り出しや言語処
理のうえで英文ＯＣＲを使用した場合よりも認識率が低
くなってしまう。2. Description of the Related Art When performing character recognition processing on a document image, it is necessary to select an appropriate language. In other words, even if you try to recognize Japanese by English OCR, you cannot recognize anything other than alphabets and numbers.
When attempting to recognize an English sentence using R, the recognition rate is lower than when using English sentence OCR after character extraction and language processing.

【０００３】従って、文字認識処理を施す前に、言語識
別を行う必要が生じる。従来から文書中の文字種を識別
する種々の手法が提案されている。例えば、２値化され
た文字行の縦方向または横方向の黒白反転回数を計数
し、その分布を基に文字種の識別を行う文書認識装置が
ある（特開平５−１０８８７６号公報を参照）。Therefore, it is necessary to perform language identification before performing the character recognition processing. 2. Description of the Related Art Various methods for identifying a character type in a document have conventionally been proposed. For example, there is a document recognition device that counts the number of black-and-white inversions in a vertical or horizontal direction of a binarized character line and identifies a character type based on the distribution (see Japanese Patent Application Laid-Open No. 5-108876).

【０００４】また、読み取った単語を認識させ、その認
識結果と辞書との適合率を基に認識文字の言語種類を判
別する文書認識装置もある（特開平６−１５００６１号
公報を参照）。There is also a document recognizing device that recognizes a read word and determines the language type of the recognized character based on the recognition result and the matching rate with the dictionary (see Japanese Patent Application Laid-Open No. 6-150061).

【０００５】[0005]

【発明が解決しようとする課題】上記した前者の装置で
は、文字種を識別する特徴として黒白反転回数を用いて
いるが、この特徴はフォントや文書内容（かな、漢字、
数字などの比率）による変動が大きく、このために識別
の精度が低くなるという問題がある。In the former device, the number of black / white inversions is used as a feature for identifying a character type. This feature is based on fonts and document contents (kana, kanji,
(The ratio of numbers and the like) is large, which causes a problem that the accuracy of identification is reduced.

【０００６】これに対して、後者の装置では、一度、文
字認識を行っているので、ＯＣＲの性能がよければかな
りの確率で字種が判明することになり、精度よく日英判
別を行うことが可能となる。しかし、ＯＣＲは処理に多
くの時間を要するという問題がある。On the other hand, in the latter device, since character recognition is performed once, if the OCR performance is good, the character type can be determined with a considerable probability. Becomes possible. However, OCR has a problem that it takes a lot of time for processing.

【０００７】本発明は上記した事情を考慮してなされた
もので、本発明の目的は、精度よくかつ高速に日本語と
英語の識別を行うと共に、識別する範囲についても各文
字領域毎に、またページ単位毎に両者を識別できる文書
画像の日本語英語判別方法および記録媒体、さらには、
文書画像を判定し、最適な文書認識処理を行う文書認識
方法および記録媒体を提供することにある。SUMMARY OF THE INVENTION The present invention has been made in consideration of the above circumstances, and an object of the present invention is to accurately and quickly identify Japanese and English, and also determine the range of identification for each character area. In addition, a method and a recording medium for discriminating Japanese and English of a document image that can identify both for each page unit,
An object of the present invention is to provide a document recognition method and a recording medium that determine a document image and perform optimal document recognition processing.

【０００８】[0008]

【課題を解決するための手段】前記目的を達成するため
に、請求項１記載の発明では、文書画像中の各文字領域
が日本語領域であるか英語領域であるかを判定する文書
画像の日本語英語判定方法であって、複数の判定方法を
用いて日本語領域であるか英語領域であるかを判定し、
該複数の判定結果を比較することによって最終判定結果
を得ることを特徴としている。In order to achieve the above object, according to the first aspect of the present invention, it is determined whether each character area in a document image is a Japanese area or an English area. A Japanese English determination method, and determines whether the region is a Japanese region or an English region using a plurality of determination methods,
It is characterized in that a final judgment result is obtained by comparing the plurality of judgment results.

【０００９】請求項２記載の発明では、文書画像中の各
文字領域が日本語領域であるか英語領域であるかを判定
する文書画像の日本語英語判定方法であって、前記文書
画像を縮小することにより生成される文字領域内の黒画
素連結成分の長さを基に該連結成分を分類し、該分類結
果の集計値を基に前記各文字領域が日本語領域であるか
英語領域であるかを判定することを特徴としている。According to a second aspect of the present invention, there is provided a method for judging whether or not each character area in a document image is a Japanese area or an English area. The connected components are classified based on the lengths of the black pixel connected components in the character region generated by performing the above operation, and each of the character regions is a Japanese region or an English region based on a total value of the classification result. It is characterized by determining whether or not there is.

【００１０】請求項３記載の発明では、前記生成される
文字領域内の黒画素連結成分の数が所定の条件を満たさ
ないとき、異なる判定方法を用いることを特徴としてい
る。The invention according to claim 3 is characterized in that a different judgment method is used when the number of connected black pixel components in the generated character area does not satisfy a predetermined condition.

【００１１】請求項４記載の発明では、各ページの文書
画像が日本語文書画像であるか英語文書画像であるかを
判定する文書画像の日本語英語判定方法であって、前記
文書画像を縮小することにより生成されるページ内の黒
画素連結成分の長さを基に該連結成分を分類し、該分類
結果の集計値を基に前記各ページが日本語領域であるか
英語領域であるかを判定することを特徴としている。According to a fourth aspect of the present invention, there is provided a method for determining whether a document image on each page is a Japanese document image or an English document image, the method comprising reducing the size of the document image. Classifying the connected components based on the length of the black pixel connected components in the page generated by performing the above operation, and determining whether each of the pages is a Japanese region or an English region based on a total value of the classification result. Is determined.

【００１２】請求項５記載の発明では、ページが複数の
文字領域からなり、各ページの文書画像が日本語文書画
像であるか英語文書画像であるかを判定する文書画像の
日本語英語判定方法であって、前記文書画像を縮小する
ことにより生成される文字領域内の黒画素連結成分の長
さを基に該連結成分を分類し、該分類結果の集計値を基
に前記各文字領域が日本語領域であるか英語領域である
かを判定し、該判定結果を基に前記各ページが日本語領
域であるか英語領域であるかを判定することを特徴とし
ている。According to the fifth aspect of the present invention, a page is composed of a plurality of character areas, and a Japanese / English determination method of a document image for determining whether a document image of each page is a Japanese document image or an English document image. Wherein the connected components are classified based on the lengths of the black pixel connected components in the character region generated by reducing the document image, and each of the character regions is classified based on the total value of the classification result. It is characterized in that it is determined whether the page is a Japanese area or an English area, and based on the determination result, it is determined whether each of the pages is a Japanese area or an English area.

【００１３】請求項６記載の発明では、文書画像中の各
文字領域が日本語領域であるか英語領域であるかを判定
する文書画像の日本語英語判定方法であって、前記文字
領域中から行を検出し、該行中から近接した外接矩形を
統合してブロックを抽出し、該ブロック毎に日本語領域
であるか英語領域であるか、あるいは判定不能領域であ
るかを判定し、該判定結果を前記ブロック毎に集計し、
該集計値を基に前記各文字領域が日本語領域であるか英
語領域であるかを判定することを特徴としている。According to a sixth aspect of the present invention, there is provided a method for judging whether a character area in a document image is a Japanese area or an English area. A line is detected, a block is extracted by integrating adjacent circumscribed rectangles from the line, and it is determined whether the block is a Japanese region, an English region, or an undeterminable region for each block. The judgment results are totaled for each block,
It is characterized in that it is determined whether each of the character areas is a Japanese area or an English area based on the total value.

【００１４】請求項７記載の発明では、前記抽出される
ブロックの数が所定の条件を満たさないとき、異なる判
定方法を用いることを特徴としている。According to a seventh aspect of the present invention, when the number of blocks to be extracted does not satisfy a predetermined condition, a different judgment method is used.

【００１５】請求項８記載の発明では、ページが複数の
文字領域からなり、各ページの文書画像が日本語文書画
像であるか英語文書画像であるかを判定する文書画像の
日本語英語判定方法であって、前記文字領域中から行を
検出し、該行中から近接した外接矩形を統合してブロッ
クを抽出し、該ブロック毎に日本語領域であるか英語領
域であるか、あるいは判定不能領域であるかを判定し、
該判定結果をページ単位で集計し、該集計値を基に前記
各ページが日本語文書画像であるか英語文書画像である
かを判定することを特徴としている。According to the invention described in claim 8, the page is composed of a plurality of character areas, and the Japanese / English determination method of the document image for determining whether the document image of each page is a Japanese document image or an English document image A line is detected from the character region, a circumscribed rectangle close to the line is integrated, a block is extracted, and it is impossible to determine whether the block is a Japanese region or an English region for each block. Determine whether the area
The determination results are tabulated on a page basis, and it is determined whether each page is a Japanese document image or an English document image based on the tabulated value.

【００１６】請求項９記載の発明では、ページが複数の
文字領域からなり、各ページの文書画像が日本語文書画
像であるか英語文書画像であるかを判定する文書画像の
日本語英語判定方法であって、前記文字領域中から行を
検出し、該行中から近接した外接矩形を統合してブロッ
クを抽出し、該ブロック毎に日本語領域であるか英語領
域であるか、あるいは判定不能領域であるかを判定し、
該判定結果を文字領域毎に集計し、該集計値を基に文字
領域毎に日本語領域であるか英語領域であるかを判定
し、該判定結果をページ単位で集計し、該集計値を基に
前記各ページが日本語文書画像であるか英語文書画像で
あるかを判定することを特徴としている。According to the ninth aspect of the present invention, a page is composed of a plurality of character areas, and a Japanese / English determination method of a document image for determining whether the document image of each page is a Japanese document image or an English document image A line is detected from the character region, a circumscribed rectangle close to the line is integrated, a block is extracted, and it is impossible to determine whether the block is a Japanese region or an English region for each block. Determine whether the area
The determination result is totaled for each character region, and it is determined whether the region is a Japanese region or an English region for each character region based on the total value. The determination result is totaled for each page, and the total value is calculated. It is characterized in that it is determined whether each page is a Japanese document image or an English document image.

【００１７】請求項１０記載の発明では、文書画像が日
本語文書画像であるか英語文書画像であるかを判定し、
該判定結果に応じた文書認識処理を行うことを特徴とし
ている。In the tenth aspect, it is determined whether the document image is a Japanese document image or an English document image.
It is characterized in that a document recognition process is performed according to the determination result.

【００１８】請求項１１記載の発明では、文書画像を複
数の文字領域に分割し、該分割された文字領域毎に日本
語文書領域であるか英語文書領域であるかを判定し、該
判定結果に応じた文書認識処理を行うことを特徴として
いる。According to the eleventh aspect, the document image is divided into a plurality of character areas, and it is determined whether each of the divided character areas is a Japanese document area or an English document area. It is characterized in that a document recognition process is performed in accordance with.

【００１９】請求項１２記載の発明では、文書画像中の
各文字領域が日本語領域であるか英語領域であるかを判
定するために、複数の判定方法を用いて日本語領域であ
るか英語領域であるかを判定する機能と、該複数の判定
結果を比較することによって最終判定結果を得る機能を
コンピュータに実現させるためのプログラムを記録した
コンピュータ読み取り可能な記録媒体であることを特徴
としている。According to the twelfth aspect of the present invention, in order to determine whether each character region in a document image is a Japanese region or an English region, a plurality of determination methods are used to determine whether the character region is a Japanese region or an English region. It is a computer-readable recording medium that records a program for causing a computer to realize a function of determining whether an area is an area and a function of obtaining a final determination result by comparing the plurality of determination results. .

【００２０】請求項１３記載の発明では、文書画像中の
各文字領域または各ページの文書画像が日本語領域であ
るか英語領域であるかを判定するために、前記文書画像
を縮小することにより生成される文字領域内またはペー
ジ内の黒画素連結成分の長さを基に該連結成分を分類す
る機能と、該分類結果の集計値を基に前記各文字領域ま
たは各ページが日本語領域であるか英語領域であるかを
判定する機能をコンピュータに実現させるためのプログ
ラムを記録したコンピュータ読み取り可能な記録媒体で
あることを特徴としている。According to the thirteenth aspect, the document image is reduced in order to determine whether the document image of each character area or each page in the document image is a Japanese area or an English area. A function of classifying the connected component based on the length of the black pixel connected component in the generated character area or page, and each of the character areas or pages in the Japanese area based on the total value of the classification result. It is a computer-readable recording medium that records a program for causing a computer to realize a function of determining whether a region is an English language region.

【００２１】請求項１４記載の発明では、文書画像中の
各文字領域が日本語領域であるか英語領域であるかを判
定するために、または、ページが複数の文字領域からな
り、各ページの文書画像が日本語文書画像であるか英語
文書画像であるかを判定するために、前記文字領域中か
ら行を検出する機能と、該行中から近接した外接矩形を
統合してブロックを抽出する機能と、該ブロック毎に日
本語領域であるか英語領域であるか、あるいは判定不能
領域であるかを判定する機能と、該判定結果を前記ブロ
ック毎またはページ単位に集計する機能と、該集計値を
基に、前記各文字領域が日本語領域であるか英語領域で
あるかを判定する機能または各ページが日本語文書画像
であるか英語文書画像であるかを判定する機能をコンピ
ュータに実現させるためのプログラムを記録したコンピ
ュータ読み取り可能な記録媒体であることを特徴として
いる。According to the fourteenth aspect of the invention, in order to determine whether each character area in the document image is a Japanese area or an English area, or when a page is composed of a plurality of character areas, In order to determine whether the document image is a Japanese document image or an English document image, a function of detecting a line from the character area and a block that is extracted by integrating a circumscribed rectangle close to the line from the character region A function for determining whether each block is a Japanese area, an English area, or a non-determinable area; a function for totalizing the determination result for each block or for each page; Based on the value, a computer implements a function of determining whether each of the character areas is a Japanese area or an English area or a function of determining whether each page is a Japanese or English document image. Let It is characterized by a computer-readable recording medium recording a program for.

【００２２】請求項１５記載の発明では、文書画像が日
本語文書画像であるか英語文書画像であるかを判定する
機能または文書画像を複数の文字領域に分割し、該分割
された文字領域毎に日本語文書領域であるか英語文書領
域であるかを判定する機能と、該判定結果に応じた文書
認識処理を行う機能をコンピュータに実現させるための
プログラムを記録したコンピュータ読み取り可能な記録
媒体であることを特徴としている。According to the fifteenth aspect of the present invention, the function of determining whether the document image is a Japanese document image or an English document image or dividing the document image into a plurality of character areas, A computer-readable recording medium that records a program for causing a computer to implement a function of determining whether a document area is a Japanese document area or an English document area and a function of performing document recognition processing according to the determination result It is characterized by having.

【００２３】[0023]

【発明の実施の形態】以下、本発明の一実施例を図面を
用いて具体的に説明する。〈実施例１〉図１は、本発明の実施例１の構成を示す。
図において、１０１は、文書画像を入力する画像入力手
段、１０２は、入力文書画像を縮小する画像縮小手段、
１０３は、文書画像から連結成分を抽出する連結成分抽
出手段、１０４は、抽出した連結成分を分類し、統合す
ることによって文字領域を生成する領域生成手段、１０
５は、文字領域単位またはページ単位で日本語と英語を
判別する日英判別手段、１０６は、全体を制御する制御
部、１０７は、入力された文書画像データや連結成分デ
ータ、領域データなど各種データを記憶するデータ記憶
部、１０８は、データ通信路、１０９は、ネットワー
ク、回線などを介してホストなどに接続するデータ通信
手段である。DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be specifically described below with reference to the drawings. <Embodiment 1> FIG. 1 shows the structure of Embodiment 1 of the present invention.
In the figure, 101 is an image input unit for inputting a document image, 102 is an image reduction unit for reducing an input document image,
103 is a connected component extracting means for extracting connected components from the document image; 104 is a region generating means for generating a character region by classifying and integrating the extracted connected components;
5 is a Japanese-English discriminating means for discriminating between Japanese and English in a character area unit or a page unit, 106 is a control unit for controlling the whole, and 107 is various kinds of input document image data, connected component data, area data, etc. A data storage unit 108 for storing data is a data communication path, and 109 is a data communication unit connected to a host or the like via a network, a line, or the like.

【００２４】図２は、本発明の実施例１の全体の処理フ
ローチャートを示す。以下、図２を参照しながら、本発
明の処理動作を説明する。まず、画像入力手段１０１
は、文書を読み取ることによって文書画像を得る（ステ
ップ２０１）。この画像入力手段は、例えばスキャナ、
ファックスなどであり、またデータ通信手段１０９を介
してネットワーク経由で別の機器から画像を得るように
してもよい。FIG. 2 is a flowchart showing the entire process according to the first embodiment of the present invention. Hereinafter, the processing operation of the present invention will be described with reference to FIG. First, the image input unit 101
Obtains a document image by reading the document (step 201). This image input means is, for example, a scanner,
It may be a facsimile or the like, and an image may be obtained from another device via a network via the data communication means 109.

【００２５】次に、画像縮小手段１０２は、入力された
文書画像を縮小する（ステップ２０２）。この処理は、
例えば入力文書画像を１／８程度にＯＲ縮小する処理で
ある。すなわち、８×８画素を１画素に縮小するもの
で、６４画素中に１つでも黒画素があれば縮小画素は黒
画素とする処理である。Next, the image reducing means 102 reduces the input document image (step 202). This process
For example, a process of OR-reducing the input document image to about 1/8. That is, the process is to reduce 8 × 8 pixels to one pixel, and if there is even one black pixel in 64 pixels, the reduced pixel is set to a black pixel.

【００２６】連結成分抽出手段１０３は、縮小画像から
黒画素連結成分を抽出する（ステップ２０３）。領域生
成手段１０４は、抽出した連結成分を分類し、統合して
文字領域を生成する（ステップ２０４）。この領域生成
方法として、例えば特開平６−２００９２号公報に記載
された公知の方法を用いればよい。このとき、各文字領
域を構成する連結成分の情報はデータ記憶部１０７に格
納、保持する。The connected component extracting means 103 extracts a black pixel connected component from the reduced image (Step 203). The region generating means 104 classifies the extracted connected components and integrates them to generate a character region (step 204). As this area generation method, for example, a known method described in JP-A-6-20092 may be used. At this time, information on the connected components constituting each character area is stored and held in the data storage unit 107.

【００２７】続いて、生成した文字領域について、日英
判別手段１０５は日本語か英語かの判定を行う（ステッ
プ２０５）。Subsequently, for the generated character area, the Japanese / English determining means 105 determines whether the character area is Japanese or English (step 205).

【００２８】ステップ２０２において画像をＯＲ縮小す
ることにより、近傍の黒画素どうしが融合する。ここで
英文においては単語間にはスペースが存在し、単語内の
文字間は非常に狭いという特徴がある。一方、日本語に
おいては、句読点の前後以外では文字間隔は大きくは変
わらない。In step 202, the adjacent black pixels are merged by OR-reducing the image. Here, English sentences have the feature that there is a space between words and the space between characters in a word is very narrow. On the other hand, in Japanese, the character spacing does not change significantly except before and after punctuation.

【００２９】図３は、英文、日本語文の画像例と、その
外接矩形を示す。英文画像３０１を縮小し、連結成分を
抽出した結果を外接矩形で表現したものが外接矩形３０
２である（なお、縮小処理しているので外接矩形３０２
は、本来画像３０１より小さくなるべきだが、ここでは
同じサイズで表現している）。英文画像では、単語毎に
融合して連結成分が構成される。FIG. 3 shows examples of images of English sentences and Japanese sentences and their circumscribed rectangles. The result obtained by reducing the English image 301 and extracting the connected components as a circumscribed rectangle is the circumscribed rectangle 30.
2 (note that the circumscribed rectangle 302
Should be smaller than the image 301, but are represented in the same size here). In an English image, a connected component is formed by fusing each word.

【００３０】日本語画像３０３と３０５の例について、
同様に縮小して連結成分を抽出し、その外接矩形で表現
すると、それぞれ外接矩形３０４、３０６のようにな
る。Regarding examples of Japanese images 303 and 305,
Similarly, when the connected component is extracted by being reduced, and is expressed by a circumscribed rectangle, the circumscribed rectangles 304 and 306 are obtained, respectively.

【００３１】英文の場合は、単語を構成する文字の数が
ある程度一定であるので、縦横比が２倍から６、７倍程
度となる外接矩形が多くなる特徴がある。一方、日本語
の場合は、外接矩形３０４に示すように英文では現れに
くい長い矩形が生じたり、逆に外接矩形３０６のように
細かい矩形が多く生じる特徴がある。In the case of English sentences, since the number of characters constituting a word is constant to some extent, there is a feature that the circumscribed rectangle whose aspect ratio is about 2 to 6, or 7 is increased. On the other hand, in the case of Japanese, there is a feature that a long rectangle which is hard to appear in English sentence occurs as shown by a circumscribed rectangle 304 and, on the contrary, many small rectangles such as a circumscribed rectangle 306 occur.

【００３２】そこで、上記した連結成分矩形を「短」、
「中」、「長」の３種類に分類し、これを各文字領域に
ついて集計する。図４は、実施例１の日英判定の処理フ
ローチャートを示す。図４の処理は各文字領域毎に行わ
れる。矩形の分類は、行方向が横の場合には例えば、幅
／高さが２以下で「短」、幅／高さが２から６で
「中」、それ以上で「長」とする（ステップ４０１）。
そして、文字領域中におけるこの分類結果を集計し（ス
テップ４０２）、文字領域毎に日本語か英語かを判定す
る（ステップ４０３）。ここで、「短」矩形の数をＳＣ
ＮＴ、「中」矩形の数をＮＣＮＴ、「長」矩形の数をＬ
ＣＮＴとすると、日英の判定は図８（ステップ４０３の
詳細フローチャート）に示すように行われる。Therefore, the above-described connected component rectangle is called “short”,
It is classified into three types of "medium" and "long", and these are totaled for each character area. FIG. 4 is a processing flowchart of the Japanese-English determination according to the first embodiment. The process of FIG. 4 is performed for each character area. When the row direction is horizontal, for example, the classification of rectangles is “short” when the width / height is 2 or less, “medium” when the width / height is 2 to 6, and “long” when the width / height is more than 2 (step). 401).
Then, the classification results in the character area are totaled (step 402), and it is determined whether the character area is Japanese or English (step 403). Here, the number of “short” rectangles is SC
NT, the number of “medium” rectangles is NCNT, and the number of “long” rectangles is L
Assuming CNT, the determination of Japanese or English is performed as shown in FIG. 8 (detailed flowchart of step 403).

【００３３】まず、ＬＣＮＴ／（ＮＣＮＴ＋ＳＣＮＴ）
＞Ｔｈｌが成り立つかどうか調べる（ステップ８０
１）。Ｔｈ１は予め定めたしきい値であり、例えば０．
３程度とする。この条件式が成り立てば、長矩形が十分
に多いということであり、当該文字領域は日本語領域で
あると判定する（ステップ８０４）。First, LCNT / (NCNT + SCNT)
> Thl is checked (step 80)
1). Th1 is a predetermined threshold value.
It should be about 3. If this conditional expression holds, it means that there are many long rectangles, and it is determined that the character area is a Japanese area (step 804).

【００３４】次に、ステップ８０１でＮｏと判定された
とき、ＮＣＮＴ／（ＬＣＮＴ＋ＳＣＮＴ）＜Ｔｈ２が成
り立つかどうかを調べる（ステップ８０２）。Ｔｈ２も
予め定めたしきい値であり、例えば３とする。この条件
式が成り立てば、中矩形が少ないということであり、当
該文字領域は日本語領域であると判定する（ステップ８
０４）。いづれの条件も満たさない場合は、英語領域と
判定される（ステップ８０３）。Next, when No is determined in step 801, it is checked whether NCNT / (LCNT + SCNT) <Th2 is satisfied (step 802). Th2 is also a predetermined threshold value, for example, 3. If this conditional expression holds, it means that there are few middle rectangles, and it is determined that the character area is a Japanese area (step 8).
04). If neither condition is satisfied, it is determined that the region is an English region (step 803).

【００３５】〈実施例２〉上記した実施例１では、文字
領域単位で日英の判定を行っている。この場合、文字領
域によっては文字数が非常に少ない場合がある。そのよ
うな場合は、矩形の数が十分に得られないので矩形数の
比率で日英判定を行うことが難しくなる可能性がある。
実施例２は、矩形の数が十分でない場合を考慮した実施
例である。<Embodiment 2> In the above-described embodiment 1, the judgment of Japanese or English is made for each character area. In this case, the number of characters may be very small depending on the character area. In such a case, the number of rectangles cannot be obtained sufficiently, so that it may be difficult to perform Japanese-English determination at the ratio of the number of rectangles.
The second embodiment is an embodiment that considers a case where the number of rectangles is not sufficient.

【００３６】図５は、実施例２の処理フローチャートを
示す。日英判別手段１０５は、集計された領域内の矩形
の数が十分であるか否か（つまり所定の閾値Ｔｈ以上あ
るか否か）を調べ（ステップ５０１）、十分でない場合
には、前掲した特開平６−１５００６１号公報に記載さ
れているＯＣＲを利用した日英判別を行う（ステップ５
０３）。この場合は、文字の数が少ないのでＯＣＲ処理
を施しても処理時間の増大は少なくてすむ。そして、矩
形の数が十分である場合には実施例１で説明した矩形長
による日英の識別を行う（ステップ５０２）。FIG. 5 shows a processing flowchart of the second embodiment. The Japanese-English discriminating means 105 checks whether or not the number of rectangles in the totaled area is sufficient (that is, whether or not the number is equal to or larger than a predetermined threshold Th) (step 501). Japanese-English discrimination using OCR described in JP-A-6-150061 is performed (step 5).
03). In this case, since the number of characters is small, even if OCR processing is performed, the increase in processing time is small. If the number of rectangles is sufficient, Japanese and English are identified by the rectangle length described in the first embodiment (step 502).

【００３７】〈実施例３〉次に、ページ単位で日英識別
を行う実施例３について説明する。図６、７は、実施例
３に係るステップ２０５の詳細フローチャートを示す。
図６に示す方法は、「短」、「中」、「長」矩形の数の
集計を文字領域毎でなくページ全体について行い（ステ
ップ６０１、６０２）、その結果を使用してページ単位
に日英の判定を行う（ステップ６０３）。この日英の判
定方法は、図８の処理フローチャートに従って行う。こ
のときのしきい値Ｔｈ１，Ｔｈ２は文字領域単位の処理
の場合と異なるしきい値としてもよい。<Embodiment 3> Next, an embodiment 3 for performing Japanese-English discrimination on a page basis will be described. 6 and 7 show a detailed flowchart of step 205 according to the third embodiment.
The method shown in FIG. 6 counts the number of “short”, “medium”, and “long” rectangles not for each character area but for the entire page (steps 601 and 602), and uses the result to store the date in page units. An English determination is made (step 603). This Japanese / English determination method is performed according to the processing flowchart of FIG. The thresholds Th1 and Th2 at this time may be different from those in the case of processing in units of character areas.

【００３８】図７に示す方法は、各文字領域毎に日英の
判別を行い（ステップ７０２）、その結果を基に当該ペ
ージの日英判定を行う（ステップ７０３）。具体的に
は、日本語領域と判定された領域の数をＪｎ、英語領域
と判定された領域の数をＥｎとして、Ｊｎ＞Ｅｎなら日
本語ページ、Ｅｎ＞Ｊｎなら英語ページと判定する。Ｊ
ｎ＝Ｅｎの場合はリジェクトし、あるいは日英の何れか
に判定してもよい。In the method shown in FIG. 7, Japanese / English is determined for each character area (Step 702), and based on the result, Japanese / English is determined for the page (Step 703). Specifically, the number of regions determined to be Japanese regions is Jn, and the number of regions determined to be English regions is En, where Jn> En is a Japanese page, and En> Jn is an English page. J
If n = En, rejection may be performed, or the determination may be made in either Japanese or English.

【００３９】〈実施例４〉上記した実施例とは異なる特
徴を利用した日英識別方法について説明する。図９は、
実施例４の構成を示す。実施例１と異なる点は、行切り
出し部９０２と、ブロック抽出部９０３と、ブロック内
文字種判別部９０４を設けている点である。他の構成要
素は実施例１のものと同様である、図１０は、実施例４
の処理フローチャートを示す。<Embodiment 4> A description will be given of a Japanese-English identification method using features different from those of the above-described embodiment. FIG.
4 shows a configuration of a fourth embodiment. The difference from the first embodiment is that a line segmentation unit 902, a block extraction unit 903, and an in-block character type discrimination unit 904 are provided. Other components are the same as those of the first embodiment. FIG.
3 shows a processing flowchart.

【００４０】まず、行切り出し部９０２は、文書画像の
文字領域から行の切り出しを行う（ステップ１００１、
１００２）。領域生成処理として、特開平６−２００９
２号公報記載の技術を使用した場合には、領域を抽出し
た段階で行情報が得られているので、これを用いればよ
く、また電子通信学会論文「周辺密度分布、線密度、外
接矩形特徴を利用した文書画像の領域分割」（秋山他、
１９８６年８月、Ｖｏｌ．Ｊ６９−ＤＮｏ．８）に記
載されている射影を用いる方法を用いてもよい。First, the line cutout unit 902 cuts out a line from the character area of the document image (step 1001,
1002). Japanese Patent Application Laid-Open No. 6-2009
In the case of using the technology described in Japanese Patent Publication No. 2 (1993), line information is obtained at the stage of extracting a region, and this may be used. Segmentation of Document Image Using ”(Akiyama et al.,
August 1986, Vol. J69-D No. The method using projection described in 8) may be used.

【００４１】次に、ブロック抽出部９０３は、単語相当
のブロックを抽出する（ステップ１００３）。このブロ
ック抽出方法として、本出願人が先に特願平８−３４７
８１号で提案した方法を用いればよい。すなわち、ブロ
ック抽出部１１１は、行データ内部の外接矩形を検出
し、その外接矩形をブロックデータにまとめる。このブ
ロックデータにまとめる方法は、次の通りである。文字
矩形の間隔（まだ一つの矩形が一文字とは確定されてい
ない。従って、漢字の場合、偏とつくりに分離したもの
がそれぞれ一つの矩形となる場合も多い）のヒストグラ
ムを求める。図１８は、抽出された文字矩形と、矩形間
の距離を示す。図１９は、矩形間隔のヒストグラムを示
す。Next, the block extracting unit 903 extracts a block corresponding to a word (step 1003). As the block extraction method, the present applicant has previously disclosed in Japanese Patent Application No. 8-347.
The method proposed in No. 81 may be used. That is, the block extracting unit 111 detects a circumscribed rectangle in the row data, and combines the circumscribed rectangle into block data. The method of combining the block data is as follows. A histogram of the character rectangle intervals (one rectangle is not yet determined to be one character. Therefore, in the case of kanji, a rectangle separated from bias and structure often becomes one rectangle) is obtained. FIG. 18 shows the extracted character rectangles and the distance between the rectangles. FIG. 19 shows a histogram of rectangular intervals.

【００４２】このヒストグラムにおいて、最も距離の短
いピークは、漢字の偏とつくりの間隔や、プロポーショ
ナル英字の同一単語内の文字間距離に現れる傾向があ
る。これらを統合しても異なる文字種がブロックに入る
ことは少ないので、それらを統合することでブロックデ
ータを形成する。この処理を行うことによってプロポー
ショナルの単語や一文字が分離する（つまり偏とつくり
からなる）漢字が一つに統合されることになる。In this histogram, the peak with the shortest distance tends to appear in the interval between the bias and formation of kanji and the distance between characters in the same proportional English word. Even if these are integrated, different character types rarely enter a block, so that block data is formed by integrating them. By performing this processing, proportional words and kanji characters in which one character is separated (that is, composed of bias and structure) are integrated into one.

【００４３】また、最も距離の長いピークは、単語間の
距離、句読点と次の文字との距離に現れることが多い。
これらは（特に単語間の距離は）文字種が変わる場合の
境目に用いられることが多く、同一ブロックになること
を避けたい。そこで、最も距離の長いピーク値以上の距
離の文字矩形については、同一ブロックにしないように
処理する。The peak with the longest distance often appears at the distance between words and the distance between a punctuation mark and the next character.
These are often used at the boundary when the character type changes (especially the distance between words), and it is desired to avoid the same block. Therefore, processing is performed so that character rectangles having a distance equal to or longer than the longest peak value are not placed in the same block.

【００４４】さらに、対象矩形の両隣の矩形との距離
（Ａ，Ｂ）を測定し、その差（Ａ−Ｂ）が所定の閾値以
上のとき、長い方の距離の矩形同志は統合せず、短い方
の距離の矩形を統合するように処理する。図２０は、矩
形間の間隔の差が大きい位置で矩形の統合を行わない場
合を説明する図である。図２０では、差が所定の閾値以
上大きい位置で矩形の統合を行わないので、３つのブロ
ックが形成される。このような処理を行うことによっ
て、プロポーショナルの英文などで、単語間の距離が絶
対的に近くても、文字間距離とは差があるはずであるの
で、一つの単語だけをまとめて統合できる。また、プロ
ポーショナルフォントであっても日本語の漢字部分は比
較的等間隔に配置されるので、日本語文をまとめる場合
にも都合がよい。Further, the distance (A, B) between the target rectangle and the adjacent rectangles is measured. If the difference (A−B) is equal to or larger than a predetermined threshold, the rectangles having the longer distance are not integrated. Process to combine rectangles with shorter distances. FIG. 20 is a diagram illustrating a case where rectangles are not integrated at a position where the difference between the rectangles is large. In FIG. 20, three blocks are formed because rectangles are not integrated at a position where the difference is larger than a predetermined threshold. By performing such processing, even in a proportional English sentence or the like, even if the distance between words is absolutely short, the distance between characters must be different from the distance between characters, so that only one word can be integrated together. Even in a proportional font, Japanese kanji portions are arranged at relatively equal intervals, so that it is convenient to combine Japanese sentences.

【００４５】上記したブロック抽出方法を用いることに
よって、英文の場合、日本語文書と違って単語と単語の
間は半角相当のスペースで区切られるために、他の文字
種と混合してブロックデータとなることが避けられる。By using the above-described block extraction method, in the case of an English sentence, unlike a Japanese document, words are separated by a space equivalent to a half-width, so that block data is mixed with other character types. That can be avoided.

【００４６】続いて、ブロック内文字種判別部９０４
は、ブロック毎の日英判別を行う（ステップ１００
４）。これも前掲した出願の方法を用いればよい。つま
り、ブロック内文字種判別部９０４は、上記処理によっ
てブロック化されたまとまりが、日本語であるか、英数
字であるかという文字種の判定を行う。ブロック内は同
一文字種として判断する。この文字種の判定は次のよう
に行う。すなわち、ブロック内の矩形の幅に対して、該
矩形の垂直方向の黒ランの数または白黒反転回数が所定
の閾値以上のとき日本語文字と識別し、抽出されたブロ
ック内の矩形の垂直方向座標値を基に英字を識別する。
図２１（ａ）、（ｂ）は、日本語と英字の場合の垂直方
向ランの数の具体例を示す。英数字ではノイズがない理
想的な場合、最大で“ｇ”の文字で４つのランができる
（図２１（ｂ））。従って、５つ以上のランがカウント
される場合は日本語とする。図２１（ａ）に示す文字
「像」の場合、垂直方向のランの数は、文字の下の数字
で示すように変化する。Subsequently, a block character type discriminating unit 904 is used.
Performs a Japanese-English determination for each block (step 100).
4). This may also use the method of the above-mentioned application. In other words, the intra-block character type determination unit 904 determines whether the block grouped by the above process is a Japanese character or an alphanumeric character. The inside of the block is determined as the same character type. This character type is determined as follows. That is, when the number of black runs in the vertical direction or the number of black-and-white inversions of the rectangle in the block is equal to or greater than a predetermined threshold, the rectangle is identified as a Japanese character, and the rectangle in the extracted block in the vertical direction is Identify alphabetic characters based on coordinate values.
FIGS. 21A and 21B show specific examples of the number of vertical runs for Japanese and English characters. In an ideal case where there is no noise in alphanumeric characters, four runs can be made at maximum with the letter “g” (FIG. 21B). Therefore, when five or more runs are counted, the language is set to Japanese. In the case of the character "image" shown in FIG. 21A, the number of runs in the vertical direction changes as indicated by the number below the character.

【００４７】日英判別手段９０５は、ブロック毎の判別
結果を集計して当該領域の日英判別を行う（ステップ１
００５）。ここで、日本語と判定されたブロックの数を
ＪＣＮＴ、英語と判定されたブロックの数をＥＣＮＴ、
不定と判定されたブロックの数をＮＣＮＴとする。図１
１は、ステップ１００５の詳細のフローチャートであ
る。ＪＣＮＴ＊Ｔｈ３＞ＥＮＣＴのときは日本語と判定
し（ステップ１１０１、１１０５）、そうではなく、Ｅ
ＣＮＴ＞ＪＣＮＴのときは英語と判定する（１１０２、
１１０４）。それ以外の場合はリジェクトとする（ステ
ップ１１０３）。しきし値Ｔｈ３は、例えば２とする。The English / Japanese discriminating means 905 sums up the discrimination results for each block and performs Japanese / English discrimination of the area (step 1).
005). Here, the number of blocks determined as Japanese is JCNT, the number of blocks determined as English is ECNT,
The number of blocks determined to be undefined is defined as NCNT. FIG.
1 is a detailed flowchart of step 1005. If JCNT * Th3> ENCT, it is determined that the language is Japanese (steps 1101 and 1105).
When CNT> JCNT, it is determined to be English (1102,
1104). Otherwise, it is rejected (step 1103). The threshold value Th3 is, for example, 2.

【００４８】〈実施例５〉上記した実施例４では、文字
領域単位で日英の判定を行っている。この場合、文字領
域によっては文字数が非常に少ない場合がある。そのよ
うな場合は、矩形の数が十分に得られないのでブロック
の判別結果数の比率で日英判定を行うことが難しくなる
可能性がある。実施例５は、ブロックの数が十分でない
場合の実施例である。<Embodiment 5> In Embodiment 4 described above, the judgment of Japanese or English is made in units of character areas. In this case, the number of characters may be very small depending on the character area. In such a case, the number of rectangles cannot be obtained sufficiently, so that it may be difficult to perform Japanese-English determination based on the ratio of the number of block determination results. Embodiment 5 is an embodiment in which the number of blocks is not sufficient.

【００４９】図１２は、実施例５の処理フローチャート
を示す。日英判別手段１０５は、集計された文字領域内
のブロックの数が十分であるか否か（つまり所定の閾値
Ｔｈ以上あるか否か）を調べ（ステップ１２０１）、十
分でない場合には、前掲した特開平６−１５００６１号
公報に記載されているＯＣＲを利用した日英判別を行う
（ステップ１２０３）。この場合は、文字の数が少ない
のでＯＣＲ処理を施しても処理時間の増大は少なくてす
む。そして、ブロックの数が十分である場合には実施例
４で説明したブロック毎の判別結果による日英の識別を
行う（ステップ１２０２）。FIG. 12 shows a processing flowchart of the fifth embodiment. The Japanese-English discriminating means 105 checks whether or not the total number of blocks in the character area is sufficient (that is, whether or not the number is equal to or greater than a predetermined threshold Th) (step 1201). Japanese-English distinction using OCR described in Japanese Patent Application Laid-Open No. Hei 6-150061 is performed (step 1203). In this case, since the number of characters is small, even if OCR processing is performed, the increase in processing time is small. Then, if the number of blocks is sufficient, Japanese and English are identified based on the determination result for each block described in the fourth embodiment (step 1202).

【００５０】〈実施例６〉実施例６は、実施例４の文字
領域毎の日英判別を、ページ単位の日英判別に変更した
ものである。実施例６の処理フローチャートは、図６、
７を用いる。<Embodiment 6> In Embodiment 6, the Japanese / English distinction for each character area in Embodiment 4 is changed to Japanese / English distinction in page units. FIG. 6 is a processing flowchart of the sixth embodiment.
7 is used.

【００５１】図６の処理においては、ＪＣＮＴ、ＥＣＮ
Ｔ、ＮＣＮＴの集計を文字領域毎でなくページ全体につ
いて行い、その結果を使用して、前述した図１１の処理
方法によって日英の判定を行う。このときＴｈ３は文字
領域単位の場合とは異なってもよい。In the processing of FIG. 6, JCNT, ECN
The total of T and NCNT is calculated not for each character area but for the entire page, and the result is used to determine Japanese or English by the processing method of FIG. 11 described above. At this time, Th3 may be different from the case of the character area unit.

【００５２】図７の処理においては、まず、各文字領域
毎に判別し、その結果から当該ページの日英判定を行
う。具体的には、日本語領域と判定された領域の数をＪ
ｎ、英語領域と判定された領域の数をＥｎとして、Ｊｎ
＞Ｅｎなら日本語ページ、Ｅｎ＞Ｊｎなら英語ページと
判定する。Ｊｎ＝Ｅｎの場合はリジェクトとしてもいい
し、日英の何れかにしてもよい。In the processing shown in FIG. 7, first, a determination is made for each character area, and the result is used to determine whether the page is Japanese or English. Specifically, the number of areas determined to be Japanese
n, the number of areas determined to be English areas as En, and Jn
If> En, it is determined to be a Japanese page, and if En> Jn, it is determined to be an English page. When Jn = En, rejection may be used, or any of Japanese and English may be used.

【００５３】〈実施例７〉実施例７では、文字領域毎ま
たはページ単位で日英判別を行う際に、図１３に示すよ
うに矩形長を利用する日英判別処理（ステップ１３０
１）と、ブロック毎の判別結果を利用する日英判別処理
（ステップ１３０２）によって、それぞれ日英の判別を
行う。そして、それぞれの判別結果から最終的に日英に
判別を行う（ステップ１３０３）。<Embodiment 7> In the embodiment 7, when performing the Japanese-English determination for each character area or for each page, as shown in FIG.
1) and Japanese-English determination processing (step 1302) using the determination result for each block, to determine Japanese and English, respectively. Then, from the respective determination results, a determination is finally made between Japanese and English (step 1303).

【００５４】両者共に日本語または英語と判定された場
合には、最終結果はそのまま日本語または英語と判定す
ればよい。何れかがリジェクトと判定された場合には、
リジェクトでない方の判定結果を最終結果とする。If both are determined to be Japanese or English, the final result may be determined to be Japanese or English as it is. If any are determined to be rejected,
The result of the determination that is not reject is the final result.

【００５５】両者の判定結果が、一方が日本語で、他方
が英語で、その結果が一致しない場合には、以下のいづ
れかの判定をする。（１）リジェクトとする。（２）両者の確信度を算出し、値の大きな方の結果を採
用する。矩形長を利用する判別方法の確信度としては、例えばＬＣＮＴ／（ＮＣＮＴ＋ＳＣＮＴ）＞Ｔｈｌで、Ｔｈｌ
＝０．３の場合にはＬＣＮＴ／（ＮＣＮＴ＋ＳＣＮＴ）
＊２．５の値（ただし上限を１とする）ＮＣＮＴ／（ＬＣＮＴ＋ＳＣＮＴ）＜Ｔｈ２で、Ｔｈ２
＝３の場合には（ＬＣＮＴ＋ＳＣＮＴ）／ＮＣＮＴ＊
２．５の値（ただし上限を１とする）ＮＣＮＴ／（ＬＣＮＴ＋ＳＣＮＴ）＞Ｔｈ２で、Ｔｈ２
＝３の場合にはＮＣＮＴ／（ＬＣＮＴ＋ＳＣＮＴ）＊
０．３３の値（ただし上限を１とする）とする。If the result of the determination is that one is Japanese and the other is English and the results do not match, one of the following is determined. (1) Reject. (2) The two confidence factors are calculated, and the result with the larger value is adopted. As the certainty factor of the discrimination method using the rectangular length, for example, LCNT / (NCNT + SCNT)> Thl, Thl
LCNT / (NCNT + SCNT) when = 0.3
* Value of 2.5 (upper limit is 1) NCNT / (LCNT + SCNT) <Th2, Th2
If = 3, (LCNT + SCNT) / NCNT *
NCNT / (LCNT + SCNT)> Th2 when the value of 2.5 (the upper limit is 1)
If N = 3, NCNT / (LCNT + SCNT) *
0.33 (upper limit is 1).

【００５６】ブロック毎の判別結果を利用する判別方法
の確信度としては、例えばＪＣＮＴ＊Ｔｈ３＞ＥＣＮＴで、Ｔｈ３＝２の場合に
は、ＪＣＴＮ／（ＥＣＮＴ＊３）の値（ただし上限を１
とする）ＥＣＮＴ＞ＪＣＮＴの場合には、ＥＣＮＴ／ＪＣＮＴ＊
０．７の値（ただし上限を１とする）とする。As the certainty factor of the discrimination method using the discrimination result for each block, for example, when JCNT * Th3> ECNT and Th3 = 2, the value of JCTN / (ECNT * 3) (the upper limit is 1)
If ECNT> JCNT, ECNT / JCNT *
0.7 (however, the upper limit is 1).

【００５７】〈実施例８〉図１４は、実施例８の構成を
示す。また、図１５は、実施例８の処理フローチャート
を示す。この実施例では、入力された文書のページ全体
について、日英判別部１４１２は、前述した実施例３、
６の方法を用いて、そのページが日本語であるか英語で
あるかの日英識別処理を行い（ステップ１５０１、１５
０２）、その判別結果に基づいて選択部１４０３は英文
文書認識部１４０４または日本語文書認識部１４０５を
選択し、選択された言語の文書認識処理を行い（ステッ
プ１５０４、１５０５）、その認識結果をディスプレイ
などの出力部に出力する（ステップ１５０６）。<Eighth Embodiment> FIG. 14 shows the structure of an eighth embodiment. FIG. 15 shows a processing flowchart of the eighth embodiment. In this embodiment, for the entire page of the input document, the Japanese-English discriminating unit 1412 uses the third embodiment described above.
Using the method of No. 6, Japanese-English discrimination processing is performed to determine whether the page is in Japanese or English (steps 1501 and 15).
02) Based on the determination result, the selection unit 1403 selects the English document recognition unit 1404 or the Japanese document recognition unit 1405, performs the document recognition processing of the selected language (steps 1504 and 1505), and outputs the recognition result. The data is output to an output unit such as a display (step 1506).

【００５８】なお、日本語と英語とではその属性が異な
ることから、領域分割処理やフォント識別処理なども切
り替えた方がよい場合がある。そこで、本実施例の文書
認識部は、文字認識処理だけではなく、上記した領域分
割処理やフォント識別処理も含まれている。Since the attributes are different between Japanese and English, it may be better to switch the area division processing and the font identification processing. Therefore, the document recognition unit of this embodiment includes not only the character recognition processing but also the above-described area division processing and font identification processing.

【００５９】〈実施例９〉図１６は、実施例９の構成を
示し、図１７は、実施例９の処理フローチャートを示
す。実施例８と異なる点は、日英識別を文字領域毎に行
う点である。そのために、領域分割部１６０２は、入力
文書を文字領域に分割する（ステップ１７０１、１７０
２）。ここで、領域分割部では、日英両方に適応できる
領域分割方法を使用する。分割処理された後、日英判別
部１６０３は文字領域毎に、例えば前述した実施例１の
方法を用いて日英識別処理を行い（ステップ１７０
４）、その判別結果に基づいて選択部１６０４は英文文
書認識部１６０５または日本語文書認識部１６０６を選
択し、選択された言語の文書認識処理を行い（ステップ
１７０５、１７０６）、その認識結果をディスプレイな
どの出力部１６０７に出力する（ステップ１７０７）。
なお、実施例９の文書認識部では、文書認識処理の他に
フォント識別処理も行う。<Embodiment 9> FIG. 16 shows the structure of the ninth embodiment, and FIG. 17 shows a processing flowchart of the ninth embodiment. The difference from the eighth embodiment is that Japanese-English identification is performed for each character area. For this purpose, the area dividing unit 1602 divides the input document into character areas (steps 1701 and 170).
2). Here, the region dividing unit uses a region dividing method applicable to both Japanese and English. After the division process, the Japanese / English discriminating unit 1603 performs a Japanese / English identification process for each character area using, for example, the method of the first embodiment described above (step 170).
4) Based on the discrimination result, the selecting unit 1604 selects the English document recognizing unit 1605 or the Japanese document recognizing unit 1606, performs a document recognizing process of the selected language (steps 1705 and 1706), and outputs the recognizing result. The output is output to an output unit 1607 such as a display (step 1707).
The document recognition unit according to the ninth embodiment performs a font identification process in addition to the document recognition process.

【００６０】〈実施例１０〉前述した各実施例は、黒画
素連結成分や矩形長を特徴量として日本語と英語を判定
している。しかし、黒画素連結成分を用いる判定方法は
処理時間がかかり、また矩形長を利用する方法はリジェ
クトの発生が高くなることもある。なお、外接矩形の上
辺、下辺の行内での相対位置の頻度分布のピーク位置を
基に和文か英文かを識別する方法もあるが（特公平７−
２１８１７号公報を参照）、傾きがある文書が入力され
た場合には、頻度分布が大きく変化し、識別精度が低下
してしまうという問題点がある。<Embodiment 10> In each of the above embodiments, Japanese and English are determined using black pixel connected components and rectangular lengths as feature amounts. However, the determination method using the black pixel connected component takes a long processing time, and the method using the rectangular length may cause high rejection. There is also a method of distinguishing between Japanese and English based on the peak position of the frequency distribution of the relative position in the upper and lower lines of the circumscribed rectangle (Japanese Patent Publication No.
However, when a document having an inclination is input, there is a problem that the frequency distribution greatly changes and the identification accuracy is reduced.

【００６１】そこで、本実施例では、行高さに対する、
行内の外接矩形の高さのヒストグラムを用いて日本語と
英語を識別することにより、文書画像の領域毎に精度よ
くかつ高速に日本語と英語を識別するものである。そし
て、上記した日英識別方法でも判別不可能な領域に対し
ては、別の方法を用いて日英識別を行う。Therefore, in the present embodiment, the line height is
By distinguishing between Japanese and English using the histogram of the height of the circumscribed rectangle in the line, Japanese and English are accurately and quickly identified for each area of the document image. Then, for an area that cannot be determined by the above-described Japanese-English identification method, Japanese-English identification is performed using another method.

【００６２】図２２は、実施例１０の構成を示す。ま
た、図２３は、実施例１０の全体の処理フローチャート
である。まず、画像入力手段２２０１は、文書を読み取
ることによって文書画像を得る（ステップ２３０１）。
この画像入力手段は、例えばスキャナ、ファックスなど
であり、またデータ通信手段２２０７を介してネットワ
ーク経由で別の機器から画像を得るよにしてもよい。FIG. 22 shows the structure of the tenth embodiment. FIG. 23 is an overall processing flowchart of the tenth embodiment. First, the image input unit 2201 obtains a document image by reading a document (step 2301).
The image input unit is, for example, a scanner, a facsimile, or the like, and an image may be obtained from another device via the data communication unit 2207 via a network.

【００６３】次に、領域生成手段２２０２は、文字領域
を生成する（ステップ２３０２）。この領域生成方法と
して、例えば特開平６−２００９２号公報に記載された
方法を用いればよい。次に、行切り出し手段２２０３
は、文字領域から文字認識のための行の切り出しを行な
う。つまり、文字の外接矩形を求め、それらを統合して
行を生成する（ステップ２３０３）。日英識別手段２２
０４は、生成した文字領域について日英識別を行なう
（ステップ２３０４）。Next, the area generating means 2202 generates a character area (step 2302). For example, a method described in JP-A-6-20092 may be used as the region generation method. Next, the line segmentation means 2203
Extracts a line for character recognition from a character area. That is, the circumscribed rectangles of the characters are obtained, and they are integrated to generate a line (step 2303). Japanese-English identification means 22
04 performs Japanese-English identification for the generated character area (step 2304).

【００６４】日英の識別は以下のようにして行う。図２
７は、日英識別（ステップ２３０４）の詳細のフローチ
ャートである。図２４は、切り出された行と行内の外接
矩形の一例を示す。まず、行高さに対する、行内の外接
矩形高さの割合の頻度分布を算出する（ステップ２７０
１、２７０２）。行高さをｌｉｎｅｈｅｉｇｈｔ、矩形
高さをｈｅｉｇｈｔとする。割合をｈｅｉｇｈｔｒａｔ
ｅ＝ｈｅｉｇｈｔ＊１００／ｌｉｎｅｈｅｉｇｈｔとす
る。また、図２５のような傾きのある文書の場合は、よ
り精度良く日英識別するために、行高さの代わりにその
行の矩形の高さの最大値をｌｉｎｅｈｅｉｇｈｔとして
用いてもよい。つまり、傾きのある入力文書について
は、行内矩形の最大高さに対する、行内各外接矩形高さ
の割合のヒストグラムを基に日英識別する。The discrimination between Japanese and English is performed as follows. FIG.
FIG. 7 is a detailed flowchart of the Japanese-English identification (step 2304). FIG. 24 shows an example of a cut-out line and a circumscribed rectangle in the line. First, the frequency distribution of the ratio of the height of the circumscribed rectangle in the row to the row height is calculated (step 270).
1, 2702). The line height is lineheight, and the rectangle height is height. Weight ratio
Let e = height * 100 / lineheight. In the case of a document having a slope as shown in FIG. 25, the maximum value of the height of the rectangle of the line may be used as lineheight instead of the line height in order to more accurately identify Japanese and English. That is, with respect to an input document having a slope, Japanese and English are identified based on a histogram of the ratio of the height of each circumscribed rectangle in the line to the maximum height of the rectangle in the line.

【００６５】上記した割合ｈｅｉｇｈｔｒａｔｅが例え
ば８０以上の場合の矩形数をｌｃｎｔとし、ｈｅｉｇｈ
ｔｒａｔｅが例えば７０以上８０未満の場合の矩形数を
ｎｃｎｔとし、ｈｅｉｇｈｔｒａｔｅが例えば４０以上
７０未満の場合の矩形数をｓｃｎｔとする。文字領域内
のすべての矩形に対し、ｌｃｎｔ，ｎｃｎｔ，ｓｃｎｔ
を求める。If the above-mentioned ratio heightrate is, for example, 80 or more, the number of rectangles is set to lcnt, and
The number of rectangles when the rate is, for example, 70 or more and less than 80 is ncnt, and the number of rectangles when the heightrate is, for example, 40 or more and less than 70 is scnt. Lcnt, ncnt, scnt for all rectangles in the character area
Ask for.

【００６６】図２６は、日本語文書と英語文書について
調べた矩形数の一例を示す。一般に、日本語はｌｃｎｔ
が大きく、英語はｓｃｎｔが大きいという傾向がある。
そこで、所定の閾値ｔｈＪ，ｔｈＥを設定し、ｌｃｎｔ
／ｓｃｎｔ＞ｔｈＪのとき日本語と判定し（ステップ２
７０３）、ｌｃｎｔ／ｓｃｎｔ＜ｔｈＥのとき英語と判
定する（ステップ２７０４）。それ以外のときは不明領
域とする（ステップ２７０５）。FIG. 26 shows an example of the number of rectangles examined for a Japanese document and an English document. In general, Japanese is lcnt
English tends to have a large scnt.
Therefore, predetermined thresholds thJ and thE are set, and lcnt
If / scnt> thJ, it is determined that the language is Japanese (step 2
703), if lcnt / scnt <thE, it is determined that the language is English (step 2704). Otherwise, it is set as an unknown area (step 2705).

【００６７】上記した不明領域に対して、統計的手法を
用いて日英識別することができる。図２８は、不明領域
に対する詳細な処理フローチャートである。例えば、あ
らかじめ日本語領域と英語領域の特徴値ｌｃｎｔ、ｎｃ
ｎｔ、ｓｃｎｔを正規化し、その平均値と共分散行列の
逆行列を日本語、英語についてそれぞれ求める。そし
て、平均値と共分散行列の逆行列を用いて、日本語、英
語のそれぞれについてマハラノビス距離を求める（ステ
ップ２８０１、２８０２）。The above-mentioned unknown area can be distinguished between English and Japanese by using a statistical method. FIG. 28 is a detailed processing flowchart for an unknown area. For example, feature values lcnt, nc of the Japanese region and the English region in advance
nt and scnt are normalized, and the mean and the inverse matrix of the covariance matrix are obtained for Japanese and English, respectively. Then, the Mahalanobis distance is calculated for each of Japanese and English using the average value and the inverse matrix of the covariance matrix (steps 2801 and 2802).

【００６８】日本語のマハラノビス距離をＤｊ、英語の
マハラノビス距離をＤｅとするとき、所定の閾値をＭ
ｅ、Ｍｊとすると、Ｄｊ／Ｄｅ＞Ｍｅのとき英語と判定
し（ステップ２８０３）、Ｄｊ／Ｄｅ＜Ｍｊのとき日本
語と判定する（ステップ２８０４）。何れの条件にも満
足しない場合は不明領域と判定する（ステップ２８０
５）。なお、上記したマハラノビス距離の代わりに、平
均値とのユークリッド距離やシティブロック距離を用い
てもよい。When the Mahalanobis distance in Japanese is Dj and the Mahalanobis distance in English is De, the predetermined threshold is M
Assuming e and Mj, English is determined when Dj / De> Me (step 2803), and Japanese is determined when Dj / De <Mj (step 2804). If none of the conditions is satisfied, it is determined that the area is unknown (step 280).
5). Instead of the Mahalanobis distance described above, a Euclidean distance with an average value or a city block distance may be used.

【００６９】さらに不明と判定された領域に対して、英
文認識の確信度を用いて日英識別を行う。図２９は、ス
テップ２８０５の詳細な処理フローチャートである。英
文認識で確信度を算出する（ステップ２９０１）。次い
で、算出された確信度について、例えば６０％以上の確
信度をもつ単語の個数をＧｏｏｄ、６０％未満で確信度
０でない単語の個数をＢａｄ、確信度が０の単語の個数
をＺｅｌｏとする（ステップ２９０２）。Further, for an area determined to be unknown, Japanese-English discrimination is performed using the certainty factor of English sentence recognition. FIG. 29 is a detailed processing flowchart of step 2805. The confidence is calculated by English sentence recognition (step 2901). Next, for the calculated certainty, for example, the number of words having a certainty of 60% or more is Good, the number of words having a certainty of less than 60% is 0, and the number of words having a certainty of 0 is Zero. (Step 2902).

【００７０】日英識別の判定値をＶａｌｕｅとすると
き、Ｖａｌｕｅ＝Ｇｏｏｄ／（Ｇｏｏｄ＋Ｂａｄ＋Ｚｅ
ｌｏ）とし（ステップ２９０３）、Ｖａｌｕｅが所定の閾値ｔ
ｈｅｏｃｒを超えれば（ステップ２９０４）、英語と
判定し、それ以下ならば日本語と判定する。When the judgment value of Japanese-English discrimination is set to Value, Value = Good / (Good + Bad + Ze
lo) (step 2903), and Value is a predetermined threshold value t.
If it exceeds heocr (step 2904), it is determined that the language is English, and if it is less than it, it is determined that it is Japanese.

【００７１】なお、Ｚｅｌｏに重み付けしてもよい。Ｚ
ｅｌｏを例えばＢａｄの３個分とすると、Ｖａｌｕｅ
は、Ｂａｄ＝Ｂａｄ＋Ｚｅｌｏ×３であるからＶａｌｕｅ＝Ｇｏｏｄ／（Ｇｏｏｄ＋Ｂａｄ）となり、Ｖａｌｕｅが閾値ｔｈｅｏｃｒを超えれば英
語、それ以下ならば日本語と判定することもできる。こ
のように、日英識別判定のための文字数が少ない領域で
も、英文認識による確信度で日英識別しているので、精
度よく領域単位の日英識別が行われる。The weight may be assigned to Zero. Z
If ero is, for example, three Bads, Value
Since Bad = Bad + Zero × 3, Value = Good / (Good + Bad), and if the value exceeds the threshold theeocr, it can be determined that the language is English, and if it is less than that, the language can be determined to be Japanese. In this way, even in an area where the number of characters for Japanese / English identification is small, the English / Japanese identification is performed with the certainty based on the English sentence recognition.

【００７２】〈実施例１１〉本実施例は、入力文書画像
を縮小した画像から外接矩形を生成し、生成された矩形
同士で適当な統合を行い、統合後の矩形長の縦横比のヒ
ストグラムを用いて日英識別をより精度良く行なう実施
例である。<Embodiment 11> In this embodiment, a circumscribed rectangle is generated from an image obtained by reducing an input document image, an appropriate integration is performed between the generated rectangles, and a histogram of the rectangle length aspect ratio after the integration is obtained. This is an embodiment in which Japanese-English discrimination is performed with higher accuracy by using this method.

【００７３】図３０は、実施例１１の構成を示す。ま
た、図３１は、実施例１１の全体の処理フローチャート
である。上記した実施例と同様にして画像入力手段３０
０１によって入力された文書画像は、画像縮小手段３０
０２によって縮小される（ステップ３１０１、３１０
２）。この処理は、例えば文書画像を１／４程度にＯＲ
圧縮（４×４画素を１画素に縮小し、１６画素中に１つ
でも黒画素があれば縮小画像は黒とする）する。FIG. 30 shows the structure of the eleventh embodiment. FIG. 31 is an overall processing flowchart of the eleventh embodiment. Image input means 30 in the same manner as in the above-described embodiment.
01 is input to the image reducing unit 30.
02 (steps 3101 and 310
2). In this processing, for example, the document image is ORed to about 1/4.
Compression (4 × 4 pixels are reduced to 1 pixel, and if there is at least one black pixel in 16 pixels, the reduced image is black).

【００７４】次に、領域生成手段３００３は、文字領域
を生成する（ステップ３１０３）。この領域生成方法と
して、例えば特開平６−２００９２号公報に記載された
方法を用いればよい。続いて、矩形統合手段３００４
は、日英の特性が良く表れるように、矩形の統合を行な
う（ステップ３１０４）。例えば、図３２に示すよう
に、矩形１、２のｙ座標（縦方向）の上下座標が近くか
つ、隣同士の矩形１、２のｘ座標が非常に近い場合（例
えば、矩形間の水平距離が英語のスペースに相当する距
離より小さい場合）、矩形を統合する。また、例えば、
図３３に示すように、左側の矩形１が右側の矩形２をｙ
座標で包含する位置関係にありかつ、隣同士の矩形１、
２のｘ座標が非常に近い場合（例えば、矩形間の水平距
離が英語のスペースに相当する距離より小さい場合）、
矩形を統合する。Next, the area generating means 3003 generates a character area (step 3103). For example, a method described in JP-A-6-20092 may be used as the region generation method. Subsequently, the rectangle integration means 3004
Performs integration of rectangles so that the characteristics of Japanese and English can be well expressed (step 3104). For example, as shown in FIG. 32, when the vertical coordinates of the y-coordinates (vertical direction) of the rectangles 1 and 2 are close and the x-coordinates of the adjacent rectangles 1 and 2 are very close (for example, the horizontal distance between the rectangles). Is smaller than the distance corresponding to the English space), the rectangle is integrated. Also, for example,
As shown in FIG. 33, the left rectangle 1 is replaced with the right rectangle 2 by y.
The rectangles 1, which are in a positional relationship encompassed by coordinates and are adjacent to each other,
2 is very close (eg, the horizontal distance between rectangles is less than the distance corresponding to English space)
Merge rectangles.

【００７５】そして、矩形縦横比（矩形長縦／矩形長
横）を用いて、長矩形、中矩形、小矩形、極小矩形の４
つの特徴量に分ける（図３４）。一般に、日本語は長矩
形の出現する割合が高く、また、英語は中矩形の出現す
る割合が高い。この特性の違いを利用して、日英識別手
段３００５は、識別判定式を作成し、日英識別を行なう
（ステップ３１０５）。図３５は、日英識別処理の詳細
のフローチャートである。Then, using the rectangular aspect ratio (rectangular long vertical / rectangular long horizontal), four rectangles of a long rectangle, a medium rectangle, a small rectangle, and a very small rectangle are obtained.
It is divided into two features (FIG. 34). In general, Japanese has a high proportion of long rectangles, and English has a high proportion of medium rectangles. Utilizing this difference in characteristics, the English-Japanese identification means 3005 creates an identification determination formula and performs Japanese-English identification (step 3105). FIG. 35 is a detailed flowchart of the Japanese-English identification processing.

【００７６】例えば、領域内での長矩形の領域数ｌｃｎｔ領域内での中矩形の領域数ｎｃｎｔ領域内での小矩形の領域数ｓｃｎｔ領域内での極小矩形の領域数ｓｓｃｎｔ（ノイズの場合
が多い）を算出し（ステップ３５０１）、領域内での長
矩形の割合ｒａｔｉｏ１＝ｌｃｎｔ／（ｎｃｎｔ＋ｓｃ
ｎｔ）を算出し（ステップ３５０２）、領域内での中矩
形の割合ｒａｔｉｏ２＝ｎｃｎｔ／（ｌｃｎｔ＋ｓｃｎ
ｔ）を算出する（ステップ３５０３）。なお、上記割合
を算出するとき、ｓｓｃｎｔはノイズとして無視した。For example, the number of long rectangular areas in the area lcnt The number of medium rectangular areas in the ncnt area The number of small rectangular areas in the ncnt area The number of small rectangular areas in the scnt area The number of small rectangular areas sscnt (in the case of noise, Is calculated (step 3501), and the ratio of the long rectangle in the area ratio1 = lcnt / (ncnt + sc)
nt) is calculated (step 3502), and the ratio of the middle rectangle in the area ratio2 = ncnt / (lcnt + scn)
t) is calculated (step 3503). When calculating the above ratio, sscnt was ignored as noise.

【００７７】そして、ｒａｔｉｏｌをｘ座標、ｒａｔｉ
ｏ２をｙ座標とし、誤識別を極力少なく、日英重なって
いる部分はリジェクトになるように、日本語領域、英語
領域、リジェクト領域に分ける。例えば、ｒａｔｉｏ２
／ｒａｔｉｏｌ＞ｔｈＥならば英語領域と判定（ステッ
プ３５０４）し、ｒａｔｉｏ２／ｒａｔｉｏｌ＜ｔｈＪ
ならば日本語領域と判定し（ステップ３５０５）、そ
れ以外の領域は日英不明とする（ステップ３５０６）。
ここで、ｔｈＥ、ｔｈＪは所定の閾値である。Then, ratio is defined as x coordinate, ratio
Let o2 be the y coordinate, erroneous identification is reduced as much as possible, and the overlapped portion is divided into a Japanese region, an English region, and a reject region so as to be rejected. For example, ratio2
If / ratio> thE, it is determined that the region is an English region (step 3504), and ratio2 / ratio <thJ
If so, it is determined that the area is a Japanese area (step 3505), and the other areas are unknown in Japanese and English (step 3506).
Here, thE and thJ are predetermined thresholds.

【００７８】日英不明と判定された領域に対して、実施
例１０と同様に、統計的手法を用いて日英識別する。例
えば、あらかじめ日本語領域と英語領域の特徴値ｌｃｎ
ｔ、ｎｃｎｔ、ｓｃｎｔを正規化し、その平均値と共分
散行列の逆行列を日本語、英語でそれぞれ求める。平均
値と共分散行列の逆行列を用いて日本語、英語のそれぞ
れのマハラノビス距離を求める。日本語のマハラノビス
距離をＤｊ、英語のマハラノビス距離をＤｅとすると
き、所定の閾値をＭｅ、Ｍｊとすると、Ｄｊ／Ｄｅ＞Ｍ
ｅのとき英語、Ｄｊ／Ｄｅ＜Ｍｊのとき日本語と判定す
る。何れの条件も満たさない場合は不明と判定する。な
お、マハラノビス距離の代わりに、平均値とのユークリ
ッド距離やシティブロック距離を用いてもよい。In the same manner as in the tenth embodiment, the area determined to be Japanese / English unknown is identified using a statistical method. For example, the feature value lcn of the Japanese region and the English region in advance
The t, ncnt, and scnt are normalized, and the average value and the inverse matrix of the covariance matrix are obtained in Japanese and English, respectively. The Mahalanobis distance for each of Japanese and English is calculated using the mean and the inverse matrix of the covariance matrix. When the Mahalanobis distance in Japanese is Dj and the Mahalanobis distance in English is De, and given thresholds are Me and Mj, Dj / De> M
If e, English is determined, and if Dj / De <Mj, Japanese is determined. If none of the conditions is satisfied, it is determined to be unknown. Note that, instead of the Mahalanobis distance, a Euclidean distance with an average value or a city block distance may be used.

【００７９】〈実施例１２〉本発明は上記した実施例に
限定されず、ソフトウェアによっても実現することがで
きる。本発明をソフトウェアによって実現する場合に
は、図３６に示すように、ＣＰＵ、メモリ、表示装置、
ハードディスク、キーボード、ＣＤ−ＲＯＭドライブ、
スキャナなどからなるコンピュータシステムを用意し、
ＣＤ−ＲＯＭなどのコンピュータ読み取り可能な記録媒
体には、本発明の日本語英語判定機能、文書認識機能を
実現するプログラムなどが記録されている。また、スキ
ャナなどの画像入力手段から入力された文書画像などは
一時的にハードディスクなどに格納される。そして、該
プログラムが起動されると、一時保存された文書画像デ
ータが読み込まれて、日本語英語判定処理、文書認識処
理を実行し、その結果をディスプレイなどに出力する。<Embodiment 12> The present invention is not limited to the above-described embodiment, and can be realized by software. When the present invention is implemented by software, as shown in FIG. 36, a CPU, a memory, a display device,
Hard disk, keyboard, CD-ROM drive,
Prepare a computer system consisting of a scanner, etc.
On a computer-readable recording medium such as a CD-ROM, a program for realizing the Japanese-English determination function and the document recognition function of the present invention is recorded. A document image or the like input from an image input unit such as a scanner is temporarily stored in a hard disk or the like. Then, when the program is started, the temporarily stored document image data is read, a Japanese-English determination process and a document recognition process are executed, and the results are output to a display or the like.

【００８０】[0080]

【発明の効果】以上、説明したように、請求項１、１２
記載の発明によれば、複数の判定方法を併用しているの
で、高精度に日本語と英語とを判別することができる。As described above, claims 1 and 12
According to the described invention, a plurality of determination methods are used in combination, so that Japanese and English can be determined with high accuracy.

【００８１】請求項２、３、６、７、１３、１４記載の
発明によれば、文書画像中の文字領域毎に精度よく日本
語と英語の判別を行うことができる。According to the second, third, sixth, seventh, thirteenth, and fourteenth aspects, it is possible to accurately determine Japanese and English for each character area in a document image.

【００８２】請求項４、５、８、９、１３、１４記載の
発明によれば、文書画像のページ単位に、精度よく日本
語と英語の判別を行うことができる。According to the fourth, fifth, eighth, ninth, thirteenth, and fourteenth aspects of the invention, it is possible to accurately determine Japanese and English for each page of a document image.

【００８３】請求項１０、１１、１５記載の発明によれ
ば、日本語または英語と判定された文書画像に対して、
適切な文書認識処理を実行しているので、高精度な認識
結果を得ることができる。According to the tenth, fifteenth, and fifteenth aspects, a document image determined to be Japanese or English is
Since an appropriate document recognition process is performed, a highly accurate recognition result can be obtained.

[Brief description of the drawings]

【図１】本発明の実施例１の構成を示す。FIG. 1 shows a configuration of a first exemplary embodiment of the present invention.

【図２】本発明の実施例１の全体の処理フローチャート
を示す。FIG. 2 shows an overall processing flowchart of Embodiment 1 of the present invention.

【図３】英文、日本語文の画像例と、その外接矩形を示
す。FIG. 3 shows image examples of English sentences and Japanese sentences and their circumscribed rectangles.

【図４】実施例１の日英判定の処理フローチャートを示
す。FIG. 4 is a processing flowchart of Japanese-English determination according to the first embodiment.

【図５】実施例２の処理フローチャートを示す。FIG. 5 shows a processing flowchart of a second embodiment.

【図６】実施例３に係るステップ２０５の第１の詳細フ
ローチャートを示す。FIG. 6 shows a first detailed flowchart of step 205 according to the third embodiment.

【図７】実施例３に係るステップ２０５の第２の詳細フ
ローチャートを示す。FIG. 7 shows a second detailed flowchart of step 205 according to the third embodiment.

【図８】ステップ４０３の詳細フローチャートを示す。FIG. 8 shows a detailed flowchart of step 403.

【図９】実施例４の構成を示す。FIG. 9 shows a configuration of a fourth embodiment.

【図１０】実施例４の処理フローチャートを示す。FIG. 10 shows a processing flowchart of a fourth embodiment.

【図１１】ステップ１００５の詳細のフローチャートで
ある。FIG. 11 is a detailed flowchart of step 1005.

【図１２】実施例５の処理フローチャートを示す。FIG. 12 shows a processing flowchart of a fifth embodiment.

【図１３】実施例７の処理フローチャートを示す。FIG. 13 shows a processing flowchart of a seventh embodiment.

【図１４】実施例８の構成を示す。FIG. 14 shows a configuration of an eighth embodiment.

【図１５】実施例８の処理フローチャートを示す。FIG. 15 shows a processing flowchart of the eighth embodiment.

【図１６】実施例９の構成を示す。FIG. 16 shows a configuration of a ninth embodiment.

【図１７】実施例９の処理フローチャートを示す。FIG. 17 shows a processing flowchart of the ninth embodiment.

【図１８】抽出された文字矩形と、矩形間の距離を示
す。FIG. 18 shows an extracted character rectangle and a distance between the rectangles.

【図１９】矩形間隔のヒストグラムを示す。FIG. 19 shows a histogram of rectangular intervals.

【図２０】矩形間の間隔の差が大きい位置で矩形の統合
を行わない場合を説明する図である。FIG. 20 is a diagram illustrating a case where rectangles are not integrated at a position where the difference between the rectangles is large.

【図２１】（ａ）、（ｂ）は、日本語と英字の場合の垂
直方向ランの数の具体例を示す。FIGS. 21A and 21B show specific examples of the number of vertical runs for Japanese and English characters.

【図２２】実施例１０の構成を示す。FIG. 22 shows a configuration of a tenth embodiment.

【図２３】実施例１０の全体の処理フローチャートであ
る。FIG. 23 is an overall processing flowchart of the tenth embodiment.

【図２４】切り出された行と行内の外接矩形の一例を示
す。FIG. 24 shows an example of a cut-out line and a circumscribed rectangle in the line.

【図２５】文書が傾いている場合の行と行内の外接矩形
の一例を示す。FIG. 25 shows an example of a line when a document is inclined and a circumscribed rectangle in the line.

【図２６】日本語文書と英語文書について調べた矩形数
の一例を示す。FIG. 26 shows an example of the number of rectangles examined for a Japanese document and an English document.

【図２７】日英識別（ステップ２３０４）の詳細な処理
フローチャートである。FIG. 27 is a detailed processing flowchart of Japanese-English identification (step 2304).

【図２８】不明領域に対する詳細な処理フローチャート
である。FIG. 28 is a detailed processing flowchart for an unknown area.

【図２９】ステップ２８０５の詳細な処理フローチャー
トである。FIG. 29 is a detailed processing flowchart of step 2805.

【図３０】実施例１１の構成を示す。FIG. 30 shows a configuration of an eleventh embodiment.

【図３１】実施例１１の全体の処理フローチャートであ
る。FIG. 31 is an overall processing flowchart of an eleventh embodiment.

【図３２】矩形を統合する例を示す。FIG. 32 shows an example of integrating rectangles.

【図３３】矩形を統合する他の例を示す。FIG. 33 shows another example of merging rectangles.

【図３４】４種類に分類された矩形を示す。FIG. 34 shows rectangles classified into four types.

【図３５】実施例１１の日英識別処理の詳細な処理フロ
ーチャートである。FIG. 35 is a detailed processing flowchart of Japanese-English identification processing in Example 11;

【図３６】実施例１２の構成を示す。FIG. 36 shows a structure of a twelfth embodiment.

[Explanation of symbols]

１０１画像入力手段１０２画像縮小手段１０３連結成分抽出手段１０４領域生成手段１０５日英判別手段１０６制御部１０７データ記憶部１０８データ通信路１０９データ通信手段 DESCRIPTION OF SYMBOLS 101 Image input means 102 Image reduction means 103 Connected component extraction means 104 Area generation means 105 Japanese-English discrimination means 106 Control unit 107 Data storage unit 108 Data communication path 109 Data communication means

─────────────────────────────────────────────────────
────────────────────────────────────────────────── ───

【手続補正書】[Procedure amendment]

【提出日】平成１０年７月１５日[Submission date] July 15, 1998

【手続補正１】[Procedure amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】全文[Correction target item name] Full text

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【書類名】明細書[Document Name] Statement

【発明の名称】文書画像の日本語英語判定方法、文書
認識方法および記録媒体Patent application title: Method for judging Japanese and English in document image, document recognition method and recording medium

【特許請求の範囲】[Claims]

【請求項１０】文書画像中の各文字領域が日本語領域
であるか英語領域であるかを判定する文書画像の日本語
英語判定方法であって、前記各文字領域から行を切り出
し、行の高さと行内の矩形高さを基に前記各文字領域が
日本語領域であるか英語領域であるかを判定することを
特徴とする文書画像の日本語英語判定方法。 10. A method for determining whether a character region in a document image is a Japanese region or an English region, the method comprising the steps of: cutting out a line from each character region; A Japanese-English determination method for a document image, comprising determining whether each of the character regions is a Japanese region or an English region based on a height and a rectangular height in a line.

【請求項１１】前記行の高さと行内の矩形高さは、行
の高さに対する行内の各矩形高さの割合のヒストグラム
であることを特徴とする請求項１０記載の文書画像の日
本語英語判定方法。 11. height and row rectangle height of the line, Japanese English document image according to claim 10 which is a histogram of the percentage of row each rectangle height to the height of the row Judgment method.

【請求項１２】文書画像中の各文字領域が日本語領域
であるか英語領域であるかを判定する文書画像の日本語
英語判定方法であって、前記各文字領域から行を切り出
し、行の高さに対する行内の各矩形高さの割合のヒスト
グラムを基に前記各文字領域が日本語領域であるか英語
領域であるかを判定し、何れの領域にも判定できない不
明領域については、予め前記ヒストグラムを基に日本語
の特性値と英語の特性値とを算出しておき、前記不明領
域が前記何れの特性値に近いかを算出することによって
日本語領域であるか英語領域であるかを判定することを
特徴とする文書画像の日本語英語判定方法。 12. A method for judging whether a character region in a document image is a Japanese region or an English region, the method comprising the steps of: cutting out a line from each character region; Based on the histogram of the ratio of the height of each rectangle in the line to the height, it is determined whether each of the character areas is a Japanese area or an English area. For an unknown area that cannot be determined in any area, The characteristic value of Japanese and the characteristic value of English are calculated based on the histogram, and it is determined whether the unknown region is the Japanese region or the English region by calculating which characteristic value is closer to the unknown region. A method for determining Japanese / English of a document image, characterized by determining.

【請求項１３】文書画像中の各文字領域が日本語領域
であるか英語領域であるかを判定する文書画像の日本語
英語判定方法であって、前記各文字領域から行を切り出
し、行内の矩形の最大高さと行内の矩形高さを基に前記
各文字領域が日本語領域であるか英語領域であるかを判
定することを特徴とする文書画像の日本語英語判定方
法。 13. Each character area in the document image is a Japanese English determination method of determining a document image whether the English area or a Japanese area, wherein from each of the character areas cut out line, row A Japanese-English determination method for a document image, comprising determining whether each of the character areas is a Japanese area or an English area based on a maximum height of a rectangle and a height of a rectangle in a line.

【請求項１４】前記行内の矩形の最大高さと行内の矩
形高さは、行内の矩形の最大高さに対する行内の各矩形
高さの割合のヒストグラムであることを特徴とする請求
項１３記載の文書画像の日本語英語判定方法。 14. The method according to claim 13, wherein the maximum height of the rectangle in the row and the height of the rectangle in the row are histograms of the ratio of the height of each rectangle in the row to the maximum height of the rectangle in the row. Japanese / English judgment method for document images.

【請求項１５】文書画像中の各文字領域が日本語領域
であるか英語領域であるかを判定する文書画像の日本語
英語判定方法であって、前記各文字領域から行を切り出
し、行内の矩形の最大高さに対する行内の各矩形高さの
割合のヒストグラムを基に前記各文字領域が日本語領域
であるか英語領域であるかを判定し、何れの領域にも判
定できない不明領域については、予め前記ヒストグラム
を基に日本語の特性値と英語の特性値とを算出してお
き、前記不明領域が前記何れの特性値に近いかを算出す
ることによって日本語領域であるか英語領域であるかを
判定することを特徴とする文書画像の日本語英語判定方
法。 15. Each character area in the document image is a Japanese English determination method of determining a document image whether the English area or a Japanese area, wherein from each of the character areas cut out line, row Based on the histogram of the ratio of the height of each rectangle in the line to the maximum height of the rectangle, it is determined whether each of the character regions is a Japanese region or an English region. In advance, a characteristic value of Japanese and a characteristic value of English are calculated based on the histogram, and by calculating which characteristic value the unknown region is closer to the characteristic value, it is determined whether the region is the Japanese region or the English region. A method for judging Japanese or English of a document image, characterized by judging whether or not there is a document image.

【請求項１６】前記２度目の判定でも判定できない不
明領域については、英文認識の確信度を基に日本語領域
であるか英語領域であるかを判定することを特徴とする
請求項１２または１５記載の文書画像の日本語英語判定
方法。 16. An unknown area that cannot be determined even in the second determination is determined as to whether it is a Japanese area or an English area based on the confidence of English sentence recognition. Japanese / English judgment method of the document image described.

【請求項１７】文書画像中の各文字領域が日本語領域
であるか英語領域であるかを判定する文書画像の日本語
英語判定方法であって、前記文書画像を縮小した画像か
ら外接矩形を生成し、所定の位置関係にある矩形同士を
統合し、統合後の矩形の縦横比のヒストグラムを基に前
記各文字領域が日本語領域であるか英語領域であるかを
判定することを特徴とする文書画像の日本語英語判定方
法。 17. A method for determining whether each character region in a document image is a Japanese region or an English region, the method comprising: determining a circumscribed rectangle from a reduced image of the document image; Generating, integrating rectangles having a predetermined positional relationship, and determining whether each of the character regions is a Japanese region or an English region based on a histogram of the aspect ratio of the rectangle after the integration. Japanese / English judgment method of the document image to be executed.

【請求項１８】文書画像中の各文字領域が日本語領域
であるか英語領域であるかを判定する文書画像の日本語
英語判定方法であって、前記文書画像を縮小した画像か
ら外接矩形を生成し、所定の位置関係にある矩形同士を
統合し、統合後の矩形の縦横比のヒストグラムを基に前
記各文字領域が日本語領域であるか英語領域であるかを
判定し、何れの領域にも判定できない不明領域について
は、予め前記ヒストグラムを基に日本語の特性値と英語
の特性値とを算出しておき、前記不明領域が前記何れの
特性値に近いかを算出することによって日本語領域であ
るか英語領域であるかを判定することを特徴とする文書
画像の日本語英語判定方法。 18. Each character area in the document image is a Japanese English determination method of determining a document image whether the English area or a Japanese region, circumscribed rectangles from the image obtained by reducing the document image Generate and integrate rectangles having a predetermined positional relationship, and determine whether each of the character regions is a Japanese region or an English region based on a histogram of the aspect ratio of the rectangle after the integration. For an unknown area that cannot be determined even in advance, the Japanese characteristic value and the English characteristic value are calculated in advance based on the histogram, and by calculating which characteristic value the unknown area is closer to, A Japanese-English determination method for a document image, which determines whether the image is a word area or an English area.

【請求項２５】文書画像中の各文字領域から行を切り
出す機能と、行の高さに対する行内の各矩形高さの割合
のヒストグラムまたは行内の矩形の最大高さに対する行
内の各矩形高さの割合のヒストグラムを基に前記各文字
領域が日本語領域であるか英語領域であるかを判定する
機能と、何れの領域にも判定できない不明領域について
は、予め前記ヒストグラムを基に日本語の特性値と英語
の特性値とを算出する機能と、前記不明領域が前記何れ
の特性値に近いかを算出することによって日本語領域で
あるか英語領域であるかを判定する機能をコンピュータ
に実現させるためのプログラムを記録したコンピュータ
読み取り可能な記録媒体。 25. A function for cutting out a line from each character area in a document image, a histogram of the ratio of the height of each rectangle in the line to the height of the line, or the height of each rectangle in the line relative to the maximum height of a rectangle in the line. A function of determining whether each of the character areas is a Japanese area or an English area based on the histogram of the ratio; and an unknown area that cannot be determined as any area, the characteristic of Japanese based on the histogram in advance. The computer realizes a function of calculating a value and a characteristic value of English, and a function of determining whether the unknown region is closer to the characteristic value to determine whether the region is a Japanese region or an English region. Readable recording medium on which a program for recording is recorded.

【請求項２６】文書画像から縮小画像を生成する機能
と、該縮小画像から文字領域を生成する機能と、該文字
領域から外接矩形を生成する機能と、所定の位置関係に
ある矩形同士を統合する機能と、統合後の矩形の縦横比
のヒストグラムを基に前記文字領域が日本語領域である
か英語領域であるかを判定する機能をコンピュータに実
現させるためのプログラムを記録したコンピュータ読み
取り可能な記録媒体。And 26. A function of generating a reduced image from the document image, a function of generating a character area from the reduced image, a function of generating an enclosing rectangle from the character area, a rectangle with each other in a predetermined positional relationship integrating And a computer-readable program storing a program for causing a computer to implement a function of determining whether the character area is a Japanese area or an English area based on a histogram of an aspect ratio of a rectangle after integration. recoding media.

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【０００２】[0002]

【０００３】従って、文字認識処理を施す前に、言語識
別を行う必要が生じる。従来から文書中の文字種を識別
する種々の手法が提案されている。例えば、２値化され
た文字行の縦方向または横方向の黒白反転回数を計数
し、その分布を基に文字種の識別を行う文書認識装置が
ある（特開平５−１０８８７６号公報を参照）。Therefore, it is necessary to perform language identification before performing the character recognition processing. 2. Description of the Related Art Various methods for identifying a character type in a document have conventionally been proposed. For example, there is a document recognition device that counts the number of black-and-white reversals in a vertical or horizontal direction of a binarized character line and identifies a character type based on the distribution (see Japanese Patent Application Laid-Open No. 5-108876).

【０００５】[0005]

【０００７】本発明は上記した事情を考慮してなされた
もので、本発明の目的は、精度よくかつ高速に日本語と
英語の識別を行うと共に、識別する範囲についても各文
字領域毎に、またページ単位毎に両者を識別できる文書
画像の日本語英語判別方法および記録媒体、さらには、
文書画像を判定し、最適な文書認識処理を行う文書認識
方法および記録媒体を提供することにある。The present invention has been made in consideration of the above circumstances, and an object of the present invention is to accurately and quickly identify Japanese and English, and also to determine the range of identification for each character area. In addition, a method and a recording medium for discriminating Japanese and English of a document image that can identify both for each page unit,
An object of the present invention is to provide a document recognition method and a recording medium that determine a document image and perform optimal document recognition processing.

【０００８】[0008]

【００１３】請求項６記載の発明では、文書画像中の各
文字領域が日本語領域であるか英語領域であるかを判定
する文書画像の日本語英語判定方法であって、前記文字
領域中から行を検出し、該行中から近接した外接矩形を
統合してブロックを抽出し、該ブロック毎に日本語領域
であるか英語領域であるか、あるいは判定不能領域であ
るかを判定し、該判定結果を前記ブロック毎に集計し、
該集計値を基に前記各文字領域が日本語領域であるか英
語領域であるかを判定することを特徴としている。According to a sixth aspect of the present invention, there is provided a method for judging whether a character area in a document image is a Japanese area or an English area. A line is detected, a circumscribed rectangle close to the line is integrated, a block is extracted, and it is determined whether the block is a Japanese region, an English region, or an undeterminable region for each block. The judgment results are totaled for each block,
It is characterized in that it is determined whether each of the character areas is a Japanese area or an English area based on the total value.

【００１７】請求項１０記載の発明では、文書画像中の
各文字領域が日本語領域であるか英語領域であるかを判
定する文書画像の日本語英語判定方法であって、前記各
文字領域から行を切り出し、行の高さと行内の矩形高さ
を基に前記各文字領域が日本語領域であるか英語領域で
あるかを判定することを特徴としている。 According to the tenth aspect of the present invention, in the document image,
Determines whether each character area is a Japanese area or an English area.
A method of determining Japanese / English of a document image to be determined,
Cut line from character area, line height and rectangle height in line
Based on the above, each character area is a Japanese area or an English area
It is characterized by determining whether or not there is.

【００１８】請求項１１記載の発明では、前記行の高さ
と行内の矩形高さは、行の高さに対する行内の各矩形高
さの割合のヒストグラムであることを特徴としている。 In the invention according to claim 11, the height of the row is
And the rectangle height in the row is the height of each rectangle in the row relative to the row height
It is characterized in that it is a histogram of the ratio of the height.

【００１９】請求項１２記載の発明では、文書画像中の
各文字領域が日本語領域であるか英語領域であるかを判
定する文書画像の日本語英語判定方法であって、前記各
文字領域から行を切り出し、行の高さに対する行内の各
矩形高さの割合のヒストグラムを基に前記各文字領域が
日本語領域であるか英語領域であるかを判定し、何れの
領域にも判定できない不明領域については、予め前記ヒ
ストグラムを基に日本語の特性値と英語の特性値とを算
出しておき、前記不明領域が前記何れの特性値に近いか
を算出することによって日本語領域であるか英語領域で
あるかを判定することを特徴としている。 According to the twelfth aspect of the present invention, in the document image,
Determines whether each character area is a Japanese area or an English area.
A method of determining Japanese / English of a document image to be determined,
Cut out the line from the character area and set each line in the line to the line height.
Based on the histogram of the rectangular height ratio,
Judge whether the area is a Japanese area or an English area.
For unknown regions that cannot be determined as regions,
Calculates Japanese characteristic values and English characteristic values based on the strogram
Which of the characteristic values the unknown area is close to
By calculating
It is characterized by determining whether or not there is.

【００２０】請求項１３記載の発明では、文書画像中の
各文字領域が日本語領域であるか英語領域であるかを判
定する文書画像の日本語英語判定方法であって、前記各
文字領域から行を切り出し、行内の矩形の最大高さと行
内の矩形高さを基に前記各文字領域が日本語領域である
か英語領域であるかを判定することを特徴としている。 According to the thirteenth aspect, in the document image,
Determines whether each character area is a Japanese area or an English area.
A method of determining Japanese / English of a document image to be determined,
Cut a line from the character area, and the maximum height and line of the rectangle in the line
Each character area is a Japanese area based on the height of the rectangle inside
It is characterized in that it is determined whether the region is an English region.

【００２１】請求項１４記載の発明では、前記行内の矩
形の最大高さと行内の矩形高さは、行内の矩形の最大高
さに対する行内の各矩形高さの割合のヒストグラムであ
ることを特徴としている。 According to the present invention, the rectangular shape in the line is
The maximum height of the shape and the height of the rectangle in the row are the maximum height of the rectangle in the row
Histogram of the ratio of the height of each rectangle in the row to the height
It is characterized by that.

【００２２】請求項１５記載の発明では、文書画像中の
各文字領域が日本語領域であるか英語領域であるかを判
定する文書画像の日本語英語判定方法であって、前記各
文字領域から行を切り出し、行内の矩形の最大高さに対
する行内の各矩形高さの割合のヒストグラムを基に前記
各文字領域が日本語領域であるか英語領域であるかを判
定し、何れの領域にも判定できない不明領域について
は、予め前記ヒストグラムを基に日本語の特性値と英語
の特性値とを算出しておき、前記不明領域が前記何れの
特性値に近いかを算出することによって日本語領域であ
るか英語領域であるかを判定することを特徴としてい
る。 According to the fifteenth aspect, the document image includes
Determines whether each character area is a Japanese area or an English area.
A method of determining Japanese / English of a document image to be determined,
Cut out the line from the character area, and match the maximum height of the rectangle in the line.
Based on the histogram of the height ratio of each rectangle in the row
Determines whether each character area is a Japanese area or an English area.
Unknown area that cannot be determined in any area
Is based on the histogram,
And the characteristic value of
By calculating whether or not it is close to the characteristic value,
Or English language domain
You.

【００２３】請求項１６記載の発明では、前記２度目の
判定でも判定できない不明領域については、英文認識の
確信度を基に日本語領域であるか英語領域であるかを判
定することを特徴としている。 In the invention according to claim 16, the second time
For unknown areas that cannot be determined by judgment,
Based on the certainty factor, it is determined whether the language is Japanese or English.
It is characterized by specifying.

【００２４】請求項１７記載の発明では、文書画像中の
各文字領域が日本語領域であるか英語領域であるかを判
定する文書画像の日本語英語判定方法であって、前記文
書画像を縮小した画像から外接矩形を生成し、所定の位
置関係にある矩形同士を統合し、統合後の矩形の縦横比
のヒストグラムを基に前記各文字領域が日本語領域であ
るか英語領域であるかを判定することを特徴としてい
る。 According to the seventeenth aspect, in the document image,
Determines whether each character area is a Japanese area or an English area.
A Japanese-English determination method for a document image to be
A circumscribed rectangle is generated from an image obtained by reducing the
Rectangles that are in a positional relationship are merged, and the aspect ratio of the merged rectangles
Each character area is a Japanese area based on the histogram of
Or English language domain
You.

【００２５】請求項１８記載の発明では、文書画像中の
各文字領域が日本語領域であるか英語領域であるかを判
定する文書画像の日本語英語判定方法であって、前記文
書画像を縮小した画像から外接矩形を生成し、所定の位
置関係にある矩形同士を統合し、統合後の矩形の縦横比
のヒストグラムを基に前記各文字領域が日本語領域であ
るか英語領域であるかを判定し、何れの領域にも判定で
きない不明領域については、予め前記ヒストグラムを基
に日本語の特性値と英語の特性値とを算出しておき、前
記不明領域が前記何れの特性値に近いかを算出すること
によって日本語領域であるか英語領域であるかを判定す
ることを特徴としている。 According to the eighteenth aspect of the present invention, the document image
Determines whether each character area is a Japanese area or an English area.
A Japanese-English determination method for a document image to be
A circumscribed rectangle is generated from an image obtained by reducing the
Rectangles that are in a positional relationship are merged, and the aspect ratio of the merged rectangles
Each character area is a Japanese area based on the histogram of
Or in the English area,
For unknown areas that cannot be
Calculate the characteristic value of Japanese and the characteristic value of English
Calculating which characteristic value the unknown area is closer to
To determine whether the area is Japanese or English
It is characterized by that.

【００２６】請求項１９記載の発明では、文書画像が日
本語文書画像であるか英語文書画像であるかを判定し、
該判定結果に応じた文書認識処理を行うことを特徴とし
ている。According to the nineteenth aspect, it is determined whether the document image is a Japanese document image or an English document image.
It is characterized in that a document recognition process is performed according to the determination result.

【００２７】請求項２０記載の発明では、文書画像を複
数の文字領域に分割し、該分割された文字領域毎に日本
語文書領域であるか英語文書領域であるかを判定し、該
判定結果に応じた文書認識処理を行うことを特徴として
いる。In the twentieth aspect, a document image is divided into a plurality of character areas, and it is determined whether each of the divided character areas is a Japanese document area or an English document area. It is characterized in that a document recognition process is performed in accordance with.

【００２８】請求項２１記載の発明では、文書画像中の
各文字領域が日本語領域であるか英語領域であるかを判
定するために、複数の判定方法を用いて日本語領域であ
るか英語領域であるかを判定する機能と、該複数の判定
結果を比較することによって最終判定結果を得る機能を
コンピュータに実現させるためのプログラムを記録した
コンピュータ読み取り可能な記録媒体であることを特徴
としている。According to the twenty-first aspect, a plurality of determination methods are used to determine whether each character area in a document image is a Japanese area or an English area. It is a computer-readable recording medium that records a program for causing a computer to realize a function of determining whether an area is an area and a function of obtaining a final determination result by comparing the plurality of determination results. .

【００２９】請求項２２記載の発明では、文書画像中の
各文字領域または各ページの文書画像が日本語領域であ
るか英語領域であるかを判定するために、前記文書画像
を縮小することにより生成される文字領域内またはペー
ジ内の黒画素連結成分の長さを基に該連結成分を分類す
る機能と、該分類結果の集計値を基に前記各文字領域ま
たは各ページが日本語領域であるか英語領域であるかを
判定する機能をコンピュータに実現させるためのプログ
ラムを記録したコンピュータ読み取り可能な記録媒体で
あることを特徴としている。In the invention according to claim 22, in order to determine whether the document image of each character area or each page in the document image is a Japanese area or an English area, the document image is reduced. A function of classifying the connected component based on the length of the black pixel connected component in the generated character area or page, and each of the character areas or pages in the Japanese area based on the total value of the classification result. It is a computer-readable recording medium that records a program for causing a computer to realize a function of determining whether a region is an English language region.

【００３０】請求項２３記載の発明では、文書画像中の
各文字領域が日本語領域であるか英語領域であるかを判
定するために、または、ページが複数の文字領域からな
り、各ページの文書画像が日本語文書画像であるか英語
文書画像であるかを判定するために、前記文字領域中か
ら行を検出する機能と、該行中から近接した外接矩形を
統合してブロックを抽出する機能と、該ブロック毎に日
本語領域であるか英語領域であるか、あるいは判定不能
領域であるかを判定する機能と、該判定結果を前記ブロ
ック毎またはページ単位に集計する機能と、該集計値を
基に、前記各文字領域が日本語領域であるか英語領域で
あるかを判定する機能または各ページが日本語文書画像
であるか英語文書画像であるかを判定する機能をコンピ
ュータに実現させるためのプログラムを記録したコンピ
ュータ読み取り可能な記録媒体であることを特徴として
いる。According to the twenty-third aspect of the invention, in order to determine whether each character area in the document image is a Japanese area or an English area, or to determine whether each page is composed of a plurality of character areas, In order to determine whether the document image is a Japanese document image or an English document image, a function of detecting a line from the character area and a block that is extracted by integrating a circumscribed rectangle close to the line from the character region A function for determining whether each block is a Japanese area, an English area, or a non-determinable area; a function for totalizing the determination result for each block or for each page; Based on the value, a computer implements a function of determining whether each of the character areas is a Japanese area or an English area or a function of determining whether each page is a Japanese or English document image. Let It is characterized by a computer-readable recording medium recording a program for.

【００３１】請求項２４記載の発明では、文書画像が日
本語文書画像であるか英語文書画像であるかを判定する
機能または文書画像を複数の文字領域に分割し、該分割
された文字領域毎に日本語文書領域であるか英語文書領
域であるかを判定する機能と、該判定結果に応じた文書
認識処理を行う機能をコンピュータに実現させるための
プログラムを記録したコンピュータ読み取り可能な記録
媒体であることを特徴としている。According to the twenty-fourth aspect of the present invention, a function of determining whether a document image is a Japanese document image or an English document image or a function of dividing a document image into a plurality of character areas, A computer-readable recording medium that records a program for causing a computer to implement a function of determining whether a document area is a Japanese document area or an English document area and a function of performing document recognition processing according to the determination result It is characterized by having.

【００３２】請求項２５記載の発明では、文書画像中の
各文字領域から行を切り出す機能と、行の高さに対する
行内の各矩形高さの割合のヒストグラムまたは行内の矩
形の最大高さに対する行内の各矩形高さの割合のヒスト
グラムを基に前記各文字領域が日本語領域であるか英語
領域であるかを判定する機能と、何れの領域にも判定で
きない不明領域については、予め前記ヒストグラムを基
に日本語の特性値と英語の特性値とを算出する機能と、
前記不明領域が前記何れの特性値に近いかを算出するこ
とによって日本語領域であるか英語領域であるかを判定
する機能をコンピュータに実現させるためのプログラム
を記録したコンピュータ読み取り可能な記録媒体である
ことを特徴としている。 According to the twenty-fifth aspect, in the document image,
A function to cut out lines from each character area
Histogram of the percentage of each rectangle height in a row or rectangle in a row
Hist of the ratio of the height of each rectangle in the row to the maximum height of the shape
Whether each character area is a Japanese area or English based on the gram
A function to determine whether an area is an area
For unknown areas that cannot be
A function to calculate Japanese characteristic values and English characteristic values,
Calculating which characteristic value the unknown region is close to
To determine whether the area is Japanese or English
To make a computer realize the function of
Is a computer-readable recording medium that records
It is characterized by:

【００３３】請求項２６記載の発明では、文書画像から
縮小画像を生成する機能と、該縮小画像から文字領域を
生成する機能と、該文字領域から外接矩形を生成する機
能と、所定の位置関係にある矩形同士を統合する機能
と、統合後の矩形の縦横比のヒストグラムを基に前記文
字領域が日本語領域であるか英語領域であるかを判定す
る機能をコンピュータに実現させるためのプログラムを
記録したコンピュータ読み取り可能な記録媒体であるこ
とを特徴としている。 According to the twenty-sixth aspect, the document image
A function to generate a reduced image, and a character area from the reduced image.
Function for generating a circumscribed rectangle from the character area
And a function to integrate rectangles in a predetermined positional relationship
And the above sentence based on the histogram of the aspect ratio of the rectangle after integration.
Determine if the character area is Japanese or English
A program to realize the functions
Be a computer-readable recording medium
It is characterized by.

【００３４】[0034]

【００３５】図２は、本発明の実施例１の全体の処理フ
ローチャートを示す。以下、図２を参照しながら、本発
明の処理動作を説明する。まず、画像入力手段１０１
は、文書を読み取ることによって文書画像を得る（ステ
ップ２０１）。この画像入力手段は、例えばスキャナ、
ファックスなどであり、またデータ通信手段１０９を介
してネットワーク経由で別の機器から画像を得るように
してもよい。FIG. 2 is a flowchart showing the entire process according to the first embodiment of the present invention. Hereinafter, the processing operation of the present invention will be described with reference to FIG. First, the image input unit 101
Obtains a document image by reading the document (step 201). This image input means is, for example, a scanner,
It may be a facsimile or the like, and an image may be obtained from another device via a network via the data communication means 109.

【００３６】次に、画像縮小手段１０２は、入力された
文書画像を縮小する（ステップ２０２）。この処理は、
例えば入力文書画像を１／８程度にＯＲ縮小する処理で
ある。すなわち、８×８画素を１画素に縮小するもの
で、６４画素中に１つでも黒画素があれば縮小画素は黒
画素とする処理である。Next, the image reducing means 102 reduces the input document image (step 202). This process
For example, a process of OR-reducing the input document image to about 1/8. That is, the process is to reduce 8 × 8 pixels to one pixel, and if there is even one black pixel in 64 pixels, the reduced pixel is set to a black pixel.

【００３７】連結成分抽出手段１０３は、縮小画像から
黒画素連結成分を抽出する（ステップ２０３）。領域生
成手段１０４は、抽出した連結成分を分類し、統合して
文字領域を生成する（ステップ２０４）。この領域生成
方法として、例えば特開平６−２００９２号公報に記載
された公知の方法を用いればよい。このとき、各文字領
域を構成する連結成分の情報はデータ記憶部１０７に格
納、保持する。The connected component extracting means 103 extracts a black pixel connected component from the reduced image (Step 203). The region generating means 104 classifies the extracted connected components and integrates them to generate a character region (step 204). As this area generation method, for example, a known method described in JP-A-6-20092 may be used. At this time, information on the connected components constituting each character area is stored and held in the data storage unit 107.

【００３８】続いて、生成した文字領域について、日英
判別手段１０５は日本語か英語かの判定を行う（ステッ
プ２０５）。Subsequently, for the generated character area, the Japanese / English determining means 105 determines whether the character area is Japanese or English (step 205).

【００３９】ステップ２０２において画像をＯＲ縮小す
ることにより、近傍の黒画素どうしが融合する。ここで
英文においては単語間にはスペースが存在し、単語内の
文字間は非常に狭いという特徴がある。一方、日本語に
おいては、句読点の前後以外では文字間隔は大きくは変
わらない。In step 202, the adjacent black pixels are merged by OR-reducing the image. Here, English sentences have the feature that there is a space between words and the space between characters in a word is very narrow. On the other hand, in Japanese, the character spacing does not change significantly except before and after punctuation.

【００４０】図３は、英文、日本語文の画像例と、その
外接矩形を示す。英文画像３０１を縮小し、連結成分を
抽出した結果を外接矩形で表現したものが外接矩形３０
２である（なお、縮小処理しているので外接矩形３０２
は、本来画像３０１より小さくなるべきだが、ここでは
同じサイズで表現している）。英文画像では、単語毎に
融合して連結成分が構成される。FIG. 3 shows image examples of English and Japanese sentences and their circumscribed rectangles. The result obtained by reducing the English image 301 and extracting the connected components as a circumscribed rectangle is the circumscribed rectangle 30.
2 (note that the circumscribed rectangle 302
Should be smaller than the image 301, but are represented in the same size here). In an English image, a connected component is formed by fusing each word.

【００４１】日本語画像３０３と３０５の例について、
同様に縮小して連結成分を抽出し、その外接矩形で表現
すると、それぞれ外接矩形３０４、３０６のようにな
る。For examples of Japanese images 303 and 305,
Similarly, when the connected component is extracted by being reduced, and is expressed by a circumscribed rectangle, the circumscribed rectangles 304 and 306 are obtained, respectively.

【００４２】英文の場合は、単語を構成する文字の数が
ある程度一定であるので、縦横比が２倍から６、７倍程
度となる外接矩形が多くなる特徴がある。一方、日本語
の場合は、外接矩形３０４に示すように英文では現れに
くい長い矩形が生じたり、逆に外接矩形３０６のように
細かい矩形が多く生じる特徴がある。In the case of English sentences, since the number of characters constituting a word is constant to some extent, there is a feature that the circumscribed rectangle having an aspect ratio of about 2 to 6, or 7 is increased. On the other hand, in the case of Japanese, there is a feature that a long rectangle which is hard to appear in English sentence occurs as shown by a circumscribed rectangle 304 and, on the contrary, many small rectangles such as a circumscribed rectangle 306 occur.

【００４３】そこで、上記した連結成分矩形を「短」、
「中」、「長」の３種類に分類し、これを各文字領域に
ついて集計する。図４は、実施例１の日英判定の処理フ
ローチャートを示す。図４の処理は各文字領域毎に行わ
れる。矩形の分類は、行方向が横の場合には例えば、幅
／高さが２以下で「短」、幅／高さが２から６で
「中」、それ以上で「長」とする（ステップ４０１）。
そして、文字領域中におけるこの分類結果を集計し（ス
テップ４０２）、文字領域毎に日本語か英語かを判定す
る（ステップ４０３）。ここで、「短」矩形の数をＳＣ
ＮＴ、「中」矩形の数をＮＣＮＴ、「長」矩形の数をＬ
ＣＮＴとすると、日英の判定は図８（ステップ４０３の
詳細フローチャート）に示すように行われる。Therefore, the above-mentioned connected component rectangle is called “short”,
It is classified into three types of "medium" and "long", and these are totaled for each character area. FIG. 4 is a processing flowchart of the Japanese-English determination according to the first embodiment. The process of FIG. 4 is performed for each character area. When the row direction is horizontal, for example, the classification of rectangles is “short” when the width / height is 2 or less, “medium” when the width / height is 2 to 6, and “long” when the width / height is more than 2 (step). 401).
Then, the classification results in the character area are totaled (step 402), and it is determined whether the character area is Japanese or English (step 403). Here, the number of “short” rectangles is SC
NT, the number of “medium” rectangles is NCNT, and the number of “long” rectangles is L
Assuming CNT, the determination of Japanese or English is performed as shown in FIG. 8 (detailed flowchart of step 403).

【００４４】まず、ＬＣＮＴ／（ＮＣＮＴ＋ＳＣＮＴ）
＞Ｔｈｌが成り立つかどうか調べる（ステップ８０
１）。Ｔｈ１は予め定めたしきい値であり、例えば０．
３程度とする。この条件式が成り立てば、長矩形が十分
に多いということであり、当該文字領域は日本語領域で
あると判定する（ステップ８０４）。First, LCNT / (NCNT + SCNT)
> Thl is checked (step 80)
1). Th1 is a predetermined threshold value.
It should be about 3. If this conditional expression holds, it means that there are many long rectangles, and it is determined that the character area is a Japanese area (step 804).

【００４５】次に、ステップ８０１でＮｏと判定された
とき、ＮＣＮＴ／（ＬＣＮＴ＋ＳＣＮＴ）＜Ｔｈ２が成
り立つかどうかを調べる（ステップ８０２）。Ｔｈ２も
予め定めたしきい値であり、例えば３とする。この条件
式が成り立てば、中矩形が少ないということであり、当
該文字領域は日本語領域であると判定する（ステップ８
０４）。いづれの条件も満たさない場合は、英語領域と
判定される（ステップ８０３）。Next, when No is determined in step 801, it is checked whether NCNT / (LCNT + SCNT) <Th2 is satisfied (step 802). Th2 is also a predetermined threshold value, for example, 3. If this conditional expression holds, it means that there are few middle rectangles, and it is determined that the character area is a Japanese area (step 8).
04). If neither condition is satisfied, it is determined that the region is an English region (step 803).

【００４６】〈実施例２〉上記した実施例１では、文字
領域単位で日英の判定を行っている。この場合、文字領
域によっては文字数が非常に少ない場合がある。そのよ
うな場合は、矩形の数が十分に得られないので矩形数の
比率で日英判定を行うことが難しくなる可能性がある。
実施例２は、矩形の数が十分でない場合を考慮した実施
例である。<Embodiment 2> In the above-described embodiment 1, the judgment of Japanese or English is made for each character area. In this case, the number of characters may be very small depending on the character area. In such a case, the number of rectangles cannot be obtained sufficiently, so that it may be difficult to perform Japanese-English determination at the ratio of the number of rectangles.
The second embodiment is an embodiment that considers a case where the number of rectangles is not sufficient.

【００４７】図５は、実施例２の処理フローチャートを
示す。日英判別手段１０５は、集計された領域内の矩形
の数が十分であるか否か（つまり所定の閾値Ｔｈ以上あ
るか否か）を調べ（ステップ５０１）、十分でない場合
には、前掲した特開平６−１５００６１号公報に記載さ
れているＯＣＲを利用した日英判別を行う（ステップ５
０３）。この場合は、文字の数が少ないのでＯＣＲ処理
を施しても処理時間の増大は少なくてすむ。そして、矩
形の数が十分である場合には実施例１で説明した矩形長
による日英の識別を行う（ステップ５０２）。FIG. 5 shows a processing flowchart of the second embodiment. The Japanese-English discriminating means 105 checks whether or not the number of rectangles in the totaled area is sufficient (that is, whether or not the number is equal to or larger than a predetermined threshold Th) (step 501). Japanese-English discrimination using OCR described in JP-A-6-150061 is performed (step 5).
03). In this case, since the number of characters is small, even if OCR processing is performed, the increase in processing time is small. If the number of rectangles is sufficient, Japanese and English are identified by the rectangle length described in the first embodiment (step 502).

【００４８】〈実施例３〉次に、ページ単位で日英識別
を行う実施例３について説明する。図６、７は、実施例
３に係るステップ２０５の詳細フローチャートを示す。
図６に示す方法は、「短」、「中」、「長」矩形の数の
集計を文字領域毎でなくページ全体について行い（ステ
ップ６０１、６０２）、その結果を使用してページ単位
に日英の判定を行う（ステップ６０３）。この日英の判
定方法は、図８の処理フローチャートに従って行う。こ
のときのしきい値Ｔｈ１，Ｔｈ２は文字領域単位の処理
の場合と異なるしきい値としてもよい。<Embodiment 3> Next, an embodiment 3 for performing Japanese-English discrimination on a page basis will be described. 6 and 7 show a detailed flowchart of step 205 according to the third embodiment.
The method shown in FIG. 6 counts the number of “short”, “medium”, and “long” rectangles not for each character area but for the entire page (steps 601 and 602), and uses the result to store the date in page units. An English determination is made (step 603). This Japanese / English determination method is performed according to the processing flowchart of FIG. The thresholds Th1 and Th2 at this time may be different from those in the case of processing in units of character areas.

【００４９】図７に示す方法は、各文字領域毎に日英の
判別を行い（ステップ７０２）、その結果を基に当該ペ
ージの日英判定を行う（ステップ７０３）。具体的に
は、日本語領域と判定された領域の数をＪｎ、英語領域
と判定された領域の数をＥｎとして、Ｊｎ＞Ｅｎなら日
本語ページ、Ｅｎ＞Ｊｎなら英語ページと判定する。Ｊ
ｎ＝Ｅｎの場合はリジェクトし、あるいは日英の何れか
に判定してもよい。In the method shown in FIG. 7, Japanese or English is determined for each character area (step 702), and based on the result, Japanese or English is determined for the page (step 703). Specifically, the number of regions determined to be Japanese regions is Jn, and the number of regions determined to be English regions is En, where Jn> En is a Japanese page, and En> Jn is an English page. J
If n = En, rejection may be performed, or the determination may be made in either Japanese or English.

【００５０】〈実施例４〉上記した実施例とは異なる特
徴を利用した日英識別方法について説明する。図９は、
実施例４の構成を示す。実施例１と異なる点は、行切り
出し部９０２と、ブロック抽出部９０３と、ブロック内
文字種判別部９０４を設けている点である。他の構成要
素は実施例１のものと同様である、図１０は、実施例４
の処理フローチャートを示す。<Embodiment 4> A Japanese-English identification method using a feature different from that of the above-described embodiment will be described. FIG.
4 shows a configuration of a fourth embodiment. The difference from the first embodiment is that a line segmentation unit 902, a block extraction unit 903, and an in-block character type discrimination unit 904 are provided. Other components are the same as those of the first embodiment. FIG.
3 shows a processing flowchart.

【００５１】まず、行切り出し部９０２は、文書画像の
文字領域から行の切り出しを行う（ステップ１００１、
１００２）。領域生成処理として、特開平６−２００９
２号公報記載の技術を使用した場合には、領域を抽出し
た段階で行情報が得られているので、これを用いればよ
く、また電子通信学会論文「周辺密度分布、線密度、外
接矩形特徴を利用した文書画像の領域分割」（秋山他、
１９８６年８月、Ｖｏｌ．Ｊ６９−ＤＮｏ．８）に記
載されている射影を用いる方法を用いてもよい。First, the line cutout unit 902 cuts out a line from the character area of the document image (step 1001,
1002). Japanese Patent Application Laid-Open No. 6-2009
In the case of using the technology described in Japanese Patent Publication No. 2 (1993), line information is obtained at the stage of extracting a region, and this may be used. Segmentation of Document Image Using ”(Akiyama et al.,
August 1986, Vol. J69-D No. The method using projection described in 8) may be used.

【００５２】次に、ブロック抽出部９０３は、単語相当
のブロックを抽出する（ステップ１００３）。このブロ
ック抽出方法として、本出願人が先に特願平８−３４７
８１号で提案した方法を用いればよい。すなわち、ブロ
ック抽出部１１１は、行データ内部の外接矩形を検出
し、その外接矩形をブロックデータにまとめる。このブ
ロックデータにまとめる方法は、次の通りである。文字
矩形の間隔（まだ一つの矩形が一文字とは確定されてい
ない。従って、漢字の場合、偏とつくりに分離したもの
がそれぞれ一つの矩形となる場合も多い）のヒストグラ
ムを求める。図１８は、抽出された文字矩形と、矩形間
の距離を示す。図１９は、矩形間隔のヒストグラムを示
す。Next, the block extracting unit 903 extracts a block corresponding to a word (step 1003). As the block extraction method, the present applicant has previously disclosed in Japanese Patent Application No. 8-347.
The method proposed in No. 81 may be used. That is, the block extracting unit 111 detects a circumscribed rectangle in the row data, and combines the circumscribed rectangle into block data. The method of combining the block data is as follows. A histogram of the character rectangle intervals (one rectangle is not yet determined to be one character. Therefore, in the case of kanji, a rectangle separated from bias and structure often becomes one rectangle) is obtained. FIG. 18 shows the extracted character rectangles and the distance between the rectangles. FIG. 19 shows a histogram of rectangular intervals.

【００５３】このヒストグラムにおいて、最も距離の短
いピークは、漢字の偏とつくりの間隔や、プロポーショ
ナル英字の同一単語内の文字間距離に現れる傾向があ
る。これらを統合しても異なる文字種がブロックに入る
ことは少ないので、それらを統合することでブロックデ
ータを形成する。この処理を行うことによってプロポー
ショナルの単語や一文字が分離する（つまり偏とつくり
からなる）漢字が一つに統合されることになる。In this histogram, the shortest distance peak tends to appear in the interval between the bias and formation of kanji and the distance between characters in the same word of proportional alphabetic characters. Even if these are integrated, different character types rarely enter a block, so that block data is formed by integrating them. By performing this processing, proportional words and kanji characters in which one character is separated (that is, composed of bias and structure) are integrated into one.

【００５４】また、最も距離の長いピークは、単語間の
距離、句読点と次の文字との距離に現れることが多い。
これらは（特に単語間の距離は）文字種が変わる場合の
境目に用いられることが多く、同一ブロックになること
を避けたい。そこで、最も距離の長いピーク値以上の距
離の文字矩形については、同一ブロックにしないように
処理する。The peak with the longest distance often appears at the distance between words and the distance between punctuation marks and the next character.
These are often used at the boundary when the character type changes (especially the distance between words), and it is desired to avoid the same block. Therefore, processing is performed so that character rectangles having a distance equal to or longer than the longest peak value are not placed in the same block.

【００５５】さらに、対象矩形の両隣の矩形との距離
（Ａ，Ｂ）を測定し、その差（Ａ−Ｂ）が所定の閾値以
上のとき、長い方の距離の矩形同志は統合せず、短い方
の距離の矩形を統合するように処理する。図２０は、矩
形間の間隔の差が大きい位置で矩形の統合を行わない場
合を説明する図である。図２０では、差が所定の閾値以
上大きい位置で矩形の統合を行わないので、３つのブロ
ックが形成される。このような処理を行うことによっ
て、プロポーショナルの英文などで、単語間の距離が絶
対的に近くても、文字間距離とは差があるはずであるの
で、一つの単語だけをまとめて統合できる。また、プロ
ポーショナルフォントであっても日本語の漢字部分は比
較的等間隔に配置されるので、日本語文をまとめる場合
にも都合がよい。Further, the distance (A, B) between the target rectangle and the adjacent rectangles is measured. When the difference (A−B) is equal to or larger than a predetermined threshold, the rectangles having the longer distance are not integrated. Process to combine rectangles with shorter distances. FIG. 20 is a diagram illustrating a case where rectangles are not integrated at a position where the difference between the rectangles is large. In FIG. 20, three blocks are formed because rectangles are not integrated at a position where the difference is larger than a predetermined threshold. By performing such processing, even in a proportional English sentence or the like, even if the distance between words is absolutely short, the distance between characters must be different from the distance between characters, so that only one word can be integrated together. Even in a proportional font, Japanese kanji portions are arranged at relatively equal intervals, so that it is convenient to combine Japanese sentences.

【００５６】上記したブロック抽出方法を用いることに
よって、英文の場合、日本語文書と違って単語と単語の
間は半角相当のスペースで区切られるために、他の文字
種と混合してブロックデータとなることが避けられる。By using the above-described block extraction method, in the case of an English sentence, unlike a Japanese document, words are separated by a space equivalent to a half-width, so that block data is mixed with other character types. That can be avoided.

【００５７】続いて、ブロック内文字種判別部９０４
は、ブロック毎の日英判別を行う（ステップ１００
４）。これも前掲した出願の方法を用いればよい。つま
り、ブロック内文字種判別部９０４は、上記処理によっ
てブロック化されたまとまりが、日本語であるか、英数
字であるかという文字種の判定を行う。ブロック内は同
一文字種として判断する。この文字種の判定は次のよう
に行う。すなわち、ブロック内の矩形の幅に対して、該
矩形の垂直方向の黒ランの数または白黒反転回数が所定
の閾値以上のとき日本語文字と識別し、抽出されたブロ
ック内の矩形の垂直方向座標値を基に英字を識別する。
図２１（ａ）、（ｂ）は、日本語と英字の場合の垂直方
向ランの数の具体例を示す。英数字ではノイズがない理
想的な場合、最大で“ｇ”の文字で４つのランができる
（図２１（ｂ））。従って、５つ以上のランがカウント
される場合は日本語とする。図２１（ａ）に示す文字
「像」の場合、垂直方向のランの数は、文字の下の数字
で示すように変化する。Subsequently, a character type discriminating unit 904 in a block is performed.
Performs a Japanese-English determination for each block (step 100).
4). This may also use the method of the above-mentioned application. In other words, the intra-block character type determination unit 904 determines whether the block grouped by the above process is a Japanese character or an alphanumeric character. The inside of the block is determined as the same character type. This character type is determined as follows. That is, when the number of black runs in the vertical direction or the number of black-and-white inversions of the rectangle in the block is equal to or greater than a predetermined threshold, the rectangle is identified as a Japanese character, and the rectangle in the extracted block in the vertical direction is Identify alphabetic characters based on coordinate values.
FIGS. 21A and 21B show specific examples of the number of vertical runs for Japanese and English characters. In an ideal case where there is no noise in alphanumeric characters, four runs can be made at maximum with the letter “g” (FIG. 21B). Therefore, when five or more runs are counted, the language is set to Japanese. In the case of the character "image" shown in FIG. 21A, the number of runs in the vertical direction changes as indicated by the number below the character.

【００５８】日英判別手段９０５は、ブロック毎の判別
結果を集計して当該領域の日英判別を行う（ステップ１
００５）。ここで、日本語と判定されたブロックの数を
ＪＣＮＴ、英語と判定されたブロックの数をＥＣＮＴ、
不定と判定されたブロックの数をＮＣＮＴとする。図１
１は、ステップ１００５の詳細のフローチャートであ
る。ＪＣＮＴ＊Ｔｈ３＞ＥＮＣＴのときは日本語と判定
し（ステップ１１０１、１１０５）、そうではなく、Ｅ
ＣＮＴ＞ＪＣＮＴのときは英語と判定する（１１０２、
１１０４）。それ以外の場合はリジェクトとする（ステ
ップ１１０３）。しきし値Ｔｈ３は、例えば２とする。The English / Japanese discriminating means 905 sums up the discrimination results for each block and performs Japanese / English discrimination of the area (step 1).
005). Here, the number of blocks determined as Japanese is JCNT, the number of blocks determined as English is ECNT,
The number of blocks determined to be undefined is defined as NCNT. FIG.
1 is a detailed flowchart of step 1005. If JCNT * Th3> ENCT, it is determined that the language is Japanese (steps 1101 and 1105).
When CNT> JCNT, it is determined to be English (1102,
1104). Otherwise, it is rejected (step 1103). The threshold value Th3 is, for example, 2.

【００５９】〈実施例５〉上記した実施例４では、文字
領域単位で日英の判定を行っている。この場合、文字領
域によっては文字数が非常に少ない場合がある。そのよ
うな場合は、矩形の数が十分に得られないのでブロック
の判別結果数の比率で日英判定を行うことが難しくなる
可能性がある。実施例５は、ブロックの数が十分でない
場合の実施例である。<Embodiment 5> In Embodiment 4 described above, the judgment of Japanese or English is made in units of character areas. In this case, the number of characters may be very small depending on the character area. In such a case, the number of rectangles cannot be obtained sufficiently, so that it may be difficult to perform Japanese-English determination based on the ratio of the number of block determination results. Embodiment 5 is an embodiment in which the number of blocks is not sufficient.

【００６０】図１２は、実施例５の処理フローチャート
を示す。日英判別手段１０５は、集計された文字領域内
のブロックの数が十分であるか否か（つまり所定の閾値
Ｔｈ以上あるか否か）を調べ（ステップ１２０１）、十
分でない場合には、前掲した特開平６−１５００６１号
公報に記載されているＯＣＲを利用した日英判別を行う
（ステップ１２０３）。この場合は、文字の数が少ない
のでＯＣＲ処理を施しても処理時間の増大は少なくてす
む。そして、ブロックの数が十分である場合には実施例
４で説明したブロック毎の判別結果による日英の識別を
行う（ステップ１２０２）。FIG. 12 shows a processing flowchart of the fifth embodiment. The Japanese-English discriminating means 105 checks whether or not the total number of blocks in the character area is sufficient (that is, whether or not the number is equal to or greater than a predetermined threshold Th) (step 1201). Japanese-English distinction using OCR described in Japanese Patent Application Laid-Open No. Hei 6-150061 is performed (step 1203). In this case, since the number of characters is small, even if OCR processing is performed, the increase in processing time is small. Then, if the number of blocks is sufficient, Japanese and English are identified based on the determination result for each block described in the fourth embodiment (step 1202).

【００６１】〈実施例６〉実施例６は、実施例４の文字
領域毎の日英判別を、ページ単位の日英判別に変更した
ものである。実施例６の処理フローチャートは、図６、
７を用いる。<Embodiment 6> In Embodiment 6, the Japanese / English distinction for each character area in Embodiment 4 is changed to Japanese / English distinction in page units. FIG. 6 is a processing flowchart of the sixth embodiment.
7 is used.

【００６２】図６の処理においては、ＪＣＮＴ、ＥＣＮ
Ｔ、ＮＣＮＴの集計を文字領域毎でなくページ全体につ
いて行い、その結果を使用して、前述した図１１の処理
方法によって日英の判定を行う。このときＴｈ３は文字
領域単位の場合とは異なってもよい。In the processing of FIG. 6, JCNT, ECN
The total of T and NCNT is calculated not for each character area but for the entire page, and the result is used to determine Japanese or English by the processing method of FIG. 11 described above. At this time, Th3 may be different from the case of the character area unit.

【００６３】図７の処理においては、まず、各文字領域
毎に判別し、その結果から当該ページの日英判定を行
う。具体的には、日本語領域と判定された領域の数をＪ
ｎ、英語領域と判定された領域の数をＥｎとして、Ｊｎ
＞Ｅｎなら日本語ページ、Ｅｎ＞Ｊｎなら英語ページと
判定する。Ｊｎ＝Ｅｎの場合はリジェクトとしてもいい
し、日英の何れかにしてもよい。In the processing shown in FIG. 7, first, a determination is made for each character area, and the result is used to determine whether the page is Japanese or English. Specifically, the number of areas determined to be Japanese
n, the number of areas determined to be English areas as En, and Jn
If> En, it is determined to be a Japanese page, and if En> Jn, it is determined to be an English page. When Jn = En, rejection may be used, or any of Japanese and English may be used.

【００６４】〈実施例７〉実施例７では、文字領域毎ま
たはページ単位で日英判別を行う際に、図１３に示すよ
うに矩形長を利用する日英判別処理（ステップ１３０
１）と、ブロック毎の判別結果を利用する日英判別処理
（ステップ１３０２）によって、それぞれ日英の判別を
行う。そして、それぞれの判別結果から最終的に日英に
判別を行う（ステップ１３０３）。<Embodiment 7> In the embodiment 7, when performing the Japanese / English distinction for each character area or for each page, as shown in FIG.
1) and Japanese-English determination processing (step 1302) using the determination result for each block, to determine Japanese and English, respectively. Then, from the respective determination results, a determination is finally made between Japanese and English (step 1303).

【００６５】両者共に日本語または英語と判定された場
合には、最終結果はそのまま日本語または英語と判定す
ればよい。何れかがリジェクトと判定された場合には、
リジェクトでない方の判定結果を最終結果とする。If both are determined to be Japanese or English, the final result may be determined to be Japanese or English as it is. If any are determined to be rejected,
The result of the determination that is not reject is the final result.

【００６６】両者の判定結果が、一方が日本語で、他方
が英語で、その結果が一致しない場合には、以下のいづ
れかの判定をする。（１）リジェクトとする。（２）両者の確信度を算出し、値の大きな方の結果を採
用する。矩形長を利用する判別方法の確信度としては、例えばＬＣＮＴ／（ＮＣＮＴ＋ＳＣＮＴ）＞Ｔｈｌで、Ｔｈｌ
＝０．３の場合にはＬＣＮＴ／（ＮＣＮＴ＋ＳＣＮＴ）
＊２．５の値（ただし上限を１とする）ＮＣＮＴ／（ＬＣＮＴ＋ＳＣＮＴ）＜Ｔｈ２で、Ｔｈ２
＝３の場合には（ＬＣＮＴ＋ＳＣＮＴ）／ＮＣＮＴ＊
２．５の値（ただし上限を１とする）ＮＣＮＴ／（ＬＣＮＴ＋ＳＣＮＴ）＞Ｔｈ２で、Ｔｈ２
＝３の場合にはＮＣＮＴ／（ＬＣＮＴ＋ＳＣＮＴ）＊
０．３３の値（ただし上限を１とする）とする。If the result of the determination is that one is Japanese and the other is English, and the results do not match, one of the following is determined. (1) Reject. (2) The two confidence factors are calculated, and the result with the larger value is adopted. As the certainty factor of the discrimination method using the rectangular length, for example, LCNT / (NCNT + SCNT)> Thl, Thl
LCNT / (NCNT + SCNT) when = 0.3
* Value of 2.5 (upper limit is 1) NCNT / (LCNT + SCNT) <Th2, Th2
If = 3, (LCNT + SCNT) / NCNT *
NCNT / (LCNT + SCNT)> Th2 when the value of 2.5 (the upper limit is 1)
If N = 3, NCNT / (LCNT + SCNT) *
0.33 (upper limit is 1).

【００６７】ブロック毎の判別結果を利用する判別方法
の確信度としては、例えばＪＣＮＴ＊Ｔｈ３＞ＥＣＮＴで、Ｔｈ３＝２の場合に
は、ＪＣＴＮ／（ＥＣＮＴ＊３）の値（ただし上限を１
とする）ＥＣＮＴ＞ＪＣＮＴの場合には、ＥＣＮＴ／ＪＣＮＴ＊
０．７の値（ただし上限を１とする）とする。As a certainty factor of the discrimination method using the discrimination result for each block, for example, when JCNT * Th3> ECNT and Th3 = 2, the value of JCTN / (ECNT * 3) (the upper limit is 1)
If ECNT> JCNT, ECNT / JCNT *
0.7 (however, the upper limit is 1).

【００６８】〈実施例８〉図１４は、実施例８の構成を
示す。また、図１５は、実施例８の処理フローチャート
を示す。この実施例では、入力された文書のページ全体
について、日英判別部１４１２は、前述した実施例３、
６の方法を用いて、そのページが日本語であるか英語で
あるかの日英識別処理を行い（ステップ１５０１、１５
０２）、その判別結果に基づいて選択部１４０３は英文
文書認識部１４０４または日本語文書認識部１４０５を
選択し、選択された言語の文書認識処理を行い（ステッ
プ１５０４、１５０５）、その認識結果をディスプレイ
などの出力部に出力する（ステップ１５０６）。<Eighth Embodiment> FIG. 14 shows the structure of the eighth embodiment. FIG. 15 shows a processing flowchart of the eighth embodiment. In this embodiment, for the entire page of the input document, the Japanese-English discriminating unit 1412 uses the third embodiment described above.
Using the method of No. 6, Japanese-English discrimination processing is performed to determine whether the page is in Japanese or English (steps 1501 and 15).
02) Based on the determination result, the selection unit 1403 selects the English document recognition unit 1404 or the Japanese document recognition unit 1405, performs the document recognition processing of the selected language (steps 1504 and 1505), and outputs the recognition result. The data is output to an output unit such as a display (step 1506).

【００６９】なお、日本語と英語とではその属性が異な
ることから、領域分割処理やフォント識別処理なども切
り替えた方がよい場合がある。そこで、本実施例の文書
認識部は、文字認識処理だけではなく、上記した領域分
割処理やフォント識別処理も含まれている。Since the attributes of Japanese and English are different, it may be better to switch the area division processing and the font identification processing. Therefore, the document recognition unit of this embodiment includes not only the character recognition processing but also the above-described area division processing and font identification processing.

【００７０】〈実施例９〉図１６は、実施例９の構成を
示し、図１７は、実施例９の処理フローチャートを示
す。実施例８と異なる点は、日英識別を文字領域毎に行
う点である。そのために、領域分割部１６０２は、入力
文書を文字領域に分割する（ステップ１７０１、１７０
２）。ここで、領域分割部では、日英両方に適応できる
領域分割方法を使用する。分割処理された後、日英判別
部１６０３は文字領域毎に、例えば前述した実施例１の
方法を用いて日英識別処理を行い（ステップ１７０
４）、その判別結果に基づいて選択部１６０４は英文文
書認識部１６０５または日本語文書認識部１６０６を選
択し、選択された言語の文書認識処理を行い（ステップ
１７０５、１７０６）、その認識結果をディスプレイな
どの出力部１６０７に出力する（ステップ１７０７）。
なお、実施例９の文書認識部では、文書認識処理の他に
フォント識別処理も行う。<Embodiment 9> FIG. 16 shows the structure of the ninth embodiment, and FIG. 17 shows a processing flowchart of the ninth embodiment. The difference from the eighth embodiment is that Japanese-English identification is performed for each character area. For this purpose, the area dividing unit 1602 divides the input document into character areas (steps 1701 and 170).
2). Here, the region dividing unit uses a region dividing method applicable to both Japanese and English. After the division process, the Japanese / English discriminating unit 1603 performs a Japanese / English identification process for each character area using, for example, the method of the first embodiment described above (step 170).
4) Based on the discrimination result, the selecting unit 1604 selects the English document recognizing unit 1605 or the Japanese document recognizing unit 1606, performs a document recognizing process of the selected language (steps 1705 and 1706), and outputs the recognizing result. The output is output to an output unit 1607 such as a display (step 1707).
The document recognition unit according to the ninth embodiment performs a font identification process in addition to the document recognition process.

【００７１】〈実施例１０〉前述した各実施例は、黒画
素連結成分や矩形長を特徴量として日本語と英語を判定
している。しかし、黒画素連結成分を用いる判定方法は
処理時間がかかり、また矩形長を利用する方法はリジェ
クトの発生が高くなることもある。なお、外接矩形の上
辺、下辺の行内での相対位置の頻度分布のピーク位置を
基に和文か英文かを識別する方法もあるが（特公平７−
２１８１７号公報を参照）、傾きがある文書が入力され
た場合には、頻度分布が大きく変化し、識別精度が低下
してしまうという問題点がある。<Embodiment 10> In each of the embodiments described above, Japanese and English are determined using black pixel connected components and rectangular lengths as feature amounts. However, the determination method using the black pixel connected component takes a long processing time, and the method using the rectangular length may cause high rejection. There is also a method of distinguishing between Japanese and English based on the peak position of the frequency distribution of the relative position in the upper and lower lines of the circumscribed rectangle (Japanese Patent Publication No.
However, when a document having an inclination is input, there is a problem that the frequency distribution greatly changes and the identification accuracy is reduced.

【００７２】そこで、本実施例では、行高さに対する、
行内の外接矩形の高さのヒストグラムを用いて日本語と
英語を識別することにより、文書画像の領域毎に精度よ
くかつ高速に日本語と英語を識別するものである。そし
て、上記した日英識別方法でも判別不可能な領域に対し
ては、別の方法を用いて日英識別を行う。Therefore, in this embodiment, the line height is
By distinguishing between Japanese and English using the histogram of the height of the circumscribed rectangle in the line, Japanese and English are accurately and quickly identified for each area of the document image. Then, for an area that cannot be determined by the above-described Japanese-English identification method, Japanese-English identification is performed using another method.

【００７３】図２２は、実施例１０の構成を示す。ま
た、図２３は、実施例１０の全体の処理フローチャート
である。まず、画像入力手段２２０１は、文書を読み取
ることによって文書画像を得る（ステップ２３０１）。
この画像入力手段は、例えばスキャナ、ファックスなど
であり、またデータ通信手段２２０７を介してネットワ
ーク経由で別の機器から画像を得るよにしてもよい。FIG. 22 shows the structure of the tenth embodiment. FIG. 23 is an overall processing flowchart of the tenth embodiment. First, the image input unit 2201 obtains a document image by reading a document (step 2301).
The image input unit is, for example, a scanner, a facsimile, or the like, and an image may be obtained from another device via the data communication unit 2207 via a network.

【００７４】次に、領域生成手段２２０２は、文字領域
を生成する（ステップ２３０２）。この領域生成方法と
して、例えば特開平６−２００９２号公報に記載された
方法を用いればよい。次に、行切り出し手段２２０３
は、文字領域から文字認識のための行の切り出しを行な
う。つまり、文字の外接矩形を求め、それらを統合して
行を生成する（ステップ２３０３）。日英識別手段２２
０４は、生成した文字領域について日英識別を行なう
（ステップ２３０４）。Next, the area generating means 2202 generates a character area (step 2302). For example, a method described in JP-A-6-20092 may be used as the region generation method. Next, the line segmentation means 2203
Extracts a line for character recognition from a character area. That is, the circumscribed rectangles of the characters are obtained, and they are integrated to generate a line (step 2303). Japanese-English identification means 22
04 performs Japanese-English identification for the generated character area (step 2304).

【００７５】日英の識別は以下のようにして行う。図２
７は、日英識別（ステップ２３０４）の詳細のフローチ
ャートである。図２４は、切り出された行と行内の外接
矩形の一例を示す。まず、行高さに対する、行内の外接
矩形高さの割合の頻度分布を算出する（ステップ２７０
１、２７０２）。行高さをｌｉｎｅｈｅｉｇｈｔ、矩形
高さをｈｅｉｇｈｔとする。割合をｈｅｉｇｈｔｒａｔ
ｅ＝ｈｅｉｇｈｔ＊１００／ｌｉｎｅｈｅｉｇｈｔとす
る。また、図２５のような傾きのある文書の場合は、よ
り精度良く日英識別するために、行高さの代わりにその
行の矩形の高さの最大値をｌｉｎｅｈｅｉｇｈｔとして
用いてもよい。つまり、傾きのある入力文書について
は、行内矩形の最大高さに対する、行内各外接矩形高さ
の割合のヒストグラムを基に日英識別する。The distinction between Japanese and English is performed as follows. FIG.
FIG. 7 is a detailed flowchart of the Japanese-English identification (step 2304). FIG. 24 shows an example of a cut-out line and a circumscribed rectangle in the line. First, the frequency distribution of the ratio of the height of the circumscribed rectangle in the row to the row height is calculated (step 270).
1, 2702). The line height is lineheight, and the rectangle height is height. Weight ratio
Let e = height * 100 / lineheight. In the case of a document having a slope as shown in FIG. 25, the maximum value of the height of the rectangle of the line may be used as lineheight instead of the line height in order to more accurately identify Japanese and English. That is, with respect to an input document having a slope, Japanese and English are identified based on a histogram of the ratio of the height of each circumscribed rectangle in the line to the maximum height of the rectangle in the line.

【００７６】上記した割合ｈｅｉｇｈｔｒａｔｅが例え
ば８０以上の場合の矩形数をｌｃｎｔとし、ｈｅｉｇｈ
ｔｒａｔｅが例えば７０以上８０未満の場合の矩形数を
ｎｃｎｔとし、ｈｅｉｇｈｔｒａｔｅが例えば４０以上
７０未満の場合の矩形数をｓｃｎｔとする。文字領域内
のすべての矩形に対し、ｌｃｎｔ，ｎｃｎｔ，ｓｃｎｔ
を求める。If the above-mentioned ratio heightrate is, for example, 80 or more, the number of rectangles is set to lcnt, and
The number of rectangles when the rate is, for example, 70 or more and less than 80 is ncnt, and the number of rectangles when the heightrate is, for example, 40 or more and less than 70 is scnt. Lcnt, ncnt, scnt for all rectangles in the character area
Ask for.

【００７７】図２６は、日本語文書と英語文書について
調べた矩形数の一例を示す。一般に、日本語はｌｃｎｔ
が大きく、英語はｓｃｎｔが大きいという傾向がある。
そこで、所定の閾値ｔｈＪ，ｔｈＥを設定し、ｌｃｎｔ
／ｓｃｎｔ＞ｔｈＪのとき日本語と判定し（ステップ２
７０３）、ｌｃｎｔ／ｓｃｎｔ＜ｔｈＥのとき英語と判
定する（ステップ２７０４）。それ以外のときは不明領
域とする（ステップ２７０５）。FIG. 26 shows an example of the number of rectangles checked for a Japanese document and an English document. In general, Japanese is lcnt
English tends to have a large scnt.
Therefore, predetermined thresholds thJ and thE are set, and lcnt
If / scnt> thJ, it is determined that the language is Japanese (step 2
703), if lcnt / scnt <thE, it is determined that the language is English (step 2704). Otherwise, it is set as an unknown area (step 2705).

【００７８】上記した不明領域に対して、統計的手法を
用いて日英識別することができる。図２８は、不明領域
に対する詳細な処理フローチャートである。例えば、あ
らかじめ日本語領域と英語領域の特徴値ｌｃｎｔ、ｎｃ
ｎｔ、ｓｃｎｔを正規化し、その平均値と共分散行列の
逆行列を日本語、英語についてそれぞれ求める。そし
て、平均値と共分散行列の逆行列を用いて、日本語、英
語のそれぞれについてマハラノビス距離を求める（ステ
ップ２８０１、２８０２）。The above-mentioned unknown area can be distinguished between Japanese and English by using a statistical method. FIG. 28 is a detailed processing flowchart for an unknown area. For example, feature values lcnt, nc of the Japanese region and the English region in advance
nt and scnt are normalized, and the mean and the inverse matrix of the covariance matrix are obtained for Japanese and English, respectively. Then, the Mahalanobis distance is calculated for each of Japanese and English using the average value and the inverse matrix of the covariance matrix (steps 2801 and 2802).

【００７９】日本語のマハラノビス距離をＤｊ、英語の
マハラノビス距離をＤｅとするとき、所定の閾値をＭ
ｅ、Ｍｊとすると、Ｄｊ／Ｄｅ＞Ｍｅのとき英語と判定
し（ステップ２８０３）、Ｄｊ／Ｄｅ＜Ｍｊのとき日本
語と判定する（ステップ２８０４）。何れの条件にも満
足しない場合は不明領域と判定する（ステップ２８０
５）。なお、上記したマハラノビス距離の代わりに、平
均値とのユークリッド距離やシティブロック距離を用い
てもよい。When the Japanese Mahalanobis distance is Dj and the English Mahalanobis distance is De, the predetermined threshold is M
Assuming e and Mj, English is determined when Dj / De> Me (step 2803), and Japanese is determined when Dj / De <Mj (step 2804). If none of the conditions is satisfied, it is determined that the area is unknown (step 280).
5). Instead of the Mahalanobis distance described above, a Euclidean distance with an average value or a city block distance may be used.

【００８０】さらに不明と判定された領域に対して、英
文認識の確信度を用いて日英識別を行う。図２９は、ス
テップ２８０５の詳細な処理フローチャートである。英
文認識で確信度を算出する（ステップ２９０１）。次い
で、算出された確信度について、例えば６０％以上の確
信度をもつ単語の個数をＧｏｏｄ、６０％未満で確信度
０でない単語の個数をＢａｄ、確信度が０の単語の個数
をＺｅｌｏとする（ステップ２９０２）。Further, for an area determined to be unknown, Japanese-English discrimination is performed using the certainty factor of English sentence recognition. FIG. 29 is a detailed processing flowchart of step 2805. The confidence is calculated by English sentence recognition (step 2901). Next, for the calculated certainty, for example, the number of words having a certainty of 60% or more is Good, the number of words having a certainty of less than 60% is 0, and the number of words having a certainty of 0 is Zero. (Step 2902).

【００８１】日英識別の判定値をＶａｌｕｅとすると
き、Ｖａｌｕｅ＝Ｇｏｏｄ／（Ｇｏｏｄ＋Ｂａｄ＋Ｚｅ
ｌｏ）とし（ステップ２９０３）、Ｖａｌｕｅが所定の
閾値ｔｈｅｏｃｒを超えれば（ステップ２９０４）、
英語と判定し、それ以下ならば日本語と判定する。Assuming that the judgment value for Japanese-English discrimination is Value, Value = Good / (Good + Bad + Ze
lo) (step 2903), and when the value exceeds a predetermined threshold theeocr (step 2904),
It is determined to be English, and if it is less than that, it is determined to be Japanese.

【００８２】なお、Ｚｅｌｏに重み付けしてもよい。Ｚ
ｅｌｏを例えばＢａｄの３個分とすると、Ｖａｌｕｅ
は、Ｂａｄ＝Ｂａｄ＋Ｚｅｌｏ×３であるからＶａｌｕｅ＝Ｇｏｏｄ／（Ｇｏｏｄ＋Ｂａｄ）となり、Ｖａｌｕｅが閾値ｔｈｅｏｃｒを超えれば英
語、それ以下ならば日本語と判定することもできる。こ
のように、日英識別判定のための文字数が少ない領域で
も、英文認識による確信度で日英識別しているので、精
度よく領域単位の日英識別が行われる。The weight may be assigned to Zero. Z
If ero is, for example, three Bads, Value
Since Bad = Bad + Zero × 3, Value = Good / (Good + Bad), and if the value exceeds the threshold theeocr, it can be determined that the language is English, and if it is less than that, the language can be determined to be Japanese. In this way, even in an area where the number of characters for Japanese / English identification is small, the English / Japanese identification is performed with the certainty based on the English sentence recognition.

【００８３】〈実施例１１〉本実施例は、入力文書画像
を縮小した画像から外接矩形を生成し、生成された矩形
同士で適当な統合を行い、統合後の矩形長の縦横比のヒ
ストグラムを用いて日英識別をより精度良く行なう実施
例である。<Embodiment 11> In this embodiment, a circumscribed rectangle is generated from an image obtained by reducing an input document image, an appropriate integration is performed between the generated rectangles, and a histogram of the rectangular length aspect ratio after the integration is obtained. This is an embodiment in which Japanese-English discrimination is performed with higher accuracy by using this method.

【００８４】図３０は、実施例１１の構成を示す。ま
た、図３１は、実施例１１の全体の処理フローチャート
である。上記した実施例と同様にして画像入力手段３０
０１によって入力された文書画像は、画像縮小手段３０
０２によって縮小される（ステップ３１０１、３１０
２）。この処理は、例えば文書画像を１／４程度にＯＲ
圧縮（４×４画素を１画素に縮小し、１６画素中に１つ
でも黒画素があれば縮小画像は黒とする）する。FIG. 30 shows the structure of the eleventh embodiment. FIG. 31 is an overall processing flowchart of the eleventh embodiment. Image input means 30 in the same manner as in the above-described embodiment.
01 is input to the image reducing unit 30.
02 (steps 3101 and 310
2). In this processing, for example, the document image is ORed to about 1/4.
Compression (4 × 4 pixels are reduced to 1 pixel, and if there is at least one black pixel in 16 pixels, the reduced image is black).

【００８５】次に、領域生成手段３００３は、文字領域
を生成する（ステップ３１０３）。この領域生成方法と
して、例えば特開平６−２００９２号公報に記載された
方法を用いればよい。続いて、矩形統合手段３００４
は、日英の特性が良く表れるように、矩形の統合を行な
う（ステップ３１０４）。例えば、図３２に示すよう
に、矩形１、２のｙ座標（縦方向）の上下座標が近くか
つ、隣同士の矩形１、２のｘ座標が非常に近い場合（例
えば、矩形間の水平距離が英語のスペースに相当する距
離より小さい場合）、矩形を統合する。また、例えば、
図３３に示すように、左側の矩形１が右側の矩形２をｙ
座標で包含する位置関係にありかつ、隣同士の矩形１、
２のｘ座標が非常に近い場合（例えば、矩形間の水平距
離が英語のスペースに相当する距離より小さい場合）、
矩形を統合する。Next, the area generating means 3003 generates a character area (step 3103). For example, a method described in JP-A-6-20092 may be used as the region generation method. Subsequently, the rectangle integration means 3004
Performs integration of rectangles so that the characteristics of Japanese and English can be well expressed (step 3104). For example, as shown in FIG. 32, when the vertical coordinates of the y-coordinates (vertical direction) of the rectangles 1 and 2 are close and the x-coordinates of the adjacent rectangles 1 and 2 are very close (for example, the horizontal distance between the rectangles). Is smaller than the distance corresponding to the English space), the rectangle is integrated. Also, for example,
As shown in FIG. 33, the left rectangle 1 is replaced with the right rectangle 2 by y.
The rectangles 1, which are in a positional relationship encompassed by coordinates and are adjacent to each other,
2 is very close (eg, the horizontal distance between rectangles is less than the distance corresponding to English space)
Merge rectangles.

【００８６】そして、矩形縦横比（矩形長縦／矩形長
横）を用いて、長矩形、中矩形、小矩形、極小矩形の４
つの特徴量に分ける（図３４）。一般に、日本語は長矩
形の出現する割合が高く、また、英語は中矩形の出現す
る割合が高い。この特性の違いを利用して、日英識別手
段３００５は、識別判定式を作成し、日英識別を行なう
（ステップ３１０５）。図３５は、日英識別処理の詳細
のフローチャートである。Then, using the rectangular aspect ratio (rectangular long vertical / rectangular long horizontal), four rectangles of a long rectangle, a medium rectangle, a small rectangle, and a very small rectangle are obtained.
It is divided into two features (FIG. 34). In general, Japanese has a high proportion of long rectangles, and English has a high proportion of medium rectangles. Utilizing this difference in characteristics, the English-Japanese identification means 3005 creates an identification determination formula and performs Japanese-English identification (step 3105). FIG. 35 is a detailed flowchart of the Japanese-English identification processing.

【００８７】例えば、領域内での長矩形の領域数ｌｃｎｔ領域内での中矩形の領域数ｎｃｎｔ領域内での小矩形の領域数ｓｃｎｔ領域内での極小矩形の領域数ｓｓｃｎｔ（ノイズの場合
が多い）を算出し（ステップ３５０１）、領域内での長
矩形の割合ｒａｔｉｏ１＝ｌｃｎｔ／（ｎｃｎｔ＋ｓｃ
ｎｔ）を算出し（ステップ３５０２）、領域内での中矩
形の割合ｒａｔｉｏ２＝ｎｃｎｔ／（ｌｃｎｔ＋ｓｃｎ
ｔ）を算出する（ステップ３５０３）。なお、上記割合
を算出するとき、ｓｓｃｎｔはノイズとして無視した。For example, the number of long rectangular areas in the area lcnt The number of medium rectangular areas in the ncnt area The number of small rectangular areas in the ncnt area The number of small rectangular areas in the scnt area The number of small rectangular areas sscnt (in the case of noise, Is calculated (step 3501), and the ratio of the long rectangle in the area ratio1 = lcnt / (ncnt + sc)
nt) is calculated (step 3502), and the ratio of the middle rectangle in the area ratio2 = ncnt / (lcnt + scn)
t) is calculated (step 3503). When calculating the above ratio, sscnt was ignored as noise.

【００８８】そして、ｒａｔｉｏｌをｘ座標、ｒａｔｉ
ｏ２をｙ座標とし、誤識別を極力少なく、日英重なって
いる部分はリジェクトになるように、日本語領域、英語
領域、リジェクト領域に分ける。例えば、ｒａｔｉｏ２
／ｒａｔｉｏｌ＞ｔｈＥならば英語領域と判定（ステッ
プ３５０４）し、ｒａｔｉｏ２／ｒａｔｉｏｌ＜ｔｈＪ
ならば日本語領域と判定し（ステップ３５０５）、そ
れ以外の領域は日英不明とする（ステップ３５０６）。
ここで、ｔｈＥ、ｔｈＪは所定の閾値である。Then, ratio is defined as x coordinate, ratio
Let o2 be the y coordinate, erroneous identification is reduced as much as possible, and the overlapped portion is divided into a Japanese region, an English region, and a reject region so as to be rejected. For example, ratio2
If / ratio> thE, it is determined that the region is an English region (step 3504), and ratio2 / ratio <thJ
If so, it is determined that the area is a Japanese area (step 3505), and the other areas are unknown in Japanese and English (step 3506).
Here, thE and thJ are predetermined thresholds.

【００８９】日英不明と判定された領域に対して、実施
例１０と同様に、統計的手法を用いて日英識別する。例
えば、あらかじめ日本語領域と英語領域の特徴値ｌｃｎ
ｔ、ｎｃｎｔ、ｓｃｎｔを正規化し、その平均値と共分
散行列の逆行列を日本語、英語でそれぞれ求める。平均
値と共分散行列の逆行列を用いて日本語、英語のそれぞ
れのマハラノビス距離を求める。日本語のマハラノビス
距離をＤｊ、英語のマハラノビス距離をＤｅとすると
き、所定の閾値をＭｅ、Ｍｊとすると、Ｄｊ／Ｄｅ＞Ｍ
ｅのとき英語、Ｄｊ／Ｄｅ＜Ｍｊのとき日本語と判定す
る。何れの条件も満たさない場合は不明と判定する。な
お、マハラノビス距離の代わりに、平均値とのユークリ
ッド距離やシティブロック距離を用いてもよい。In the same manner as in the tenth embodiment, the area determined to be Japanese / English unknown is identified using a statistical method. For example, the feature value lcn of the Japanese region and the English region in advance
The t, ncnt, and scnt are normalized, and the average value and the inverse matrix of the covariance matrix are obtained in Japanese and English, respectively. The Mahalanobis distance for each of Japanese and English is calculated using the mean and the inverse matrix of the covariance matrix. When the Mahalanobis distance in Japanese is Dj and the Mahalanobis distance in English is De, and given thresholds are Me and Mj, Dj / De> M
If e, English is determined, and if Dj / De <Mj, Japanese is determined. If none of the conditions is satisfied, it is determined to be unknown. Note that, instead of the Mahalanobis distance, a Euclidean distance with an average value or a city block distance may be used.

【００９０】〈実施例１２〉本発明は上記した実施例に
限定されず、ソフトウェアによっても実現することがで
きる。本発明をソフトウェアによって実現する場合に
は、図３６に示すように、ＣＰＵ、メモリ、表示装置、
ハードディスク、キーボード、ＣＤ−ＲＯＭドライブ、
スキャナなどからなるコンピュータシステムを用意し、
ＣＤ−ＲＯＭなどのコンピュータ読み取り可能な記録媒
体には、本発明の日本語英語判定機能、文書認識機能を
実現するプログラムなどが記録されている。また、スキ
ャナなどの画像入力手段から入力された文書画像などは
一時的にハードディスクなどに格納される。そして、該
プログラムが起動されると、一時保存された文書画像デ
ータが読み込まれて、日本語英語判定処理、文書認識処
理を実行し、その結果をディスプレイなどに出力する。<Embodiment 12> The present invention is not limited to the above-described embodiment, and can be realized by software. When the present invention is implemented by software, as shown in FIG. 36, a CPU, a memory, a display device,
Hard disk, keyboard, CD-ROM drive,
Prepare a computer system consisting of a scanner, etc.
On a computer-readable recording medium such as a CD-ROM, a program for realizing the Japanese-English determination function and the document recognition function of the present invention is recorded. A document image or the like input from an image input unit such as a scanner is temporarily stored in a hard disk or the like. Then, when the program is started, the temporarily stored document image data is read, a Japanese-English determination process and a document recognition process are executed, and the results are output to a display or the like.

【００９１】[0091]

【発明の効果】以上、説明したように、本発明によれ
ば、複数の判定方法を併用しているので、高精度に日本
語と英語とを判別することができる。また、文書画像中
の文字領域毎に精度よく日本語と英語の判別を行うこと
ができ、文書画像のページ単位に、精度よく日本語と英
語の判別を行うことができる。さらに、日本語または英
語と判定された文書画像に対して、適切な文書認識処理
を実行しているので、高精度な認識結果を得ることがで
きる。As described above, according to the present invention, a plurality of determination methods are used in combination, so that Japanese and English can be determined with high accuracy. Further, it is possible to accurately determine Japanese and English for each character region in the document image, and it is possible to accurately determine Japanese and English for each page of the document image. Furthermore, since an appropriate document recognition process is performed on a document image determined to be Japanese or English, a highly accurate recognition result can be obtained.

【図面の簡単な説明】[Brief description of the drawings]

【符号の説明】１０１画像入力手段１０２画像縮小手段１０３連結成分抽出手段１０４領域生成手段１０５日英判別手段１０６制御部１０７データ記憶部１０８データ通信路１０９データ通信手段DESCRIPTION OF SYMBOLS 101 Image input means 102 Image reduction means 103 Connected component extraction means 104 Area generation means 105 Japanese / English discrimination means 106 Control unit 107 Data storage unit 108 Data communication path 109 Data communication means

Claims

[Claims]

1. A method for determining whether a character region in a document image is a Japanese region or an English region, the method comprising the steps of: A method for determining Japanese or English for a document image, which determines whether a document image is in an English area or not, and obtains a final determination result by comparing the plurality of determination results.

2. A method for judging whether or not each character area in a document image is a Japanese area or an English area, which is generated by reducing the size of the document image. Classifying the connected component based on the length of the black pixel connected component in the character region, and determining whether each of the character regions is a Japanese region or an English region based on a total value of the classification result. Japanese-English determination method for document images characterized by

3. The Japanese-English determination of a document image according to claim 2, wherein a different determination method is used when the number of connected black pixels in the generated character area does not satisfy a predetermined condition. Method.

4. A method for determining whether a document image of each page is a Japanese document image or an English document image in Japanese or English, which is generated by reducing the document image. Classifying the connected component based on the length of the black pixel connected component in the page, and determining whether each page is a Japanese region or an English region based on a total value of the classification result. Japanese and English judgment method of the document image to be used.

5. A method according to claim 1, wherein the page is composed of a plurality of character areas, and the document image of each page is a Japanese document image or an English document image. Classifying the connected component based on the length of the black pixel connected component in the character region generated by reducing the image, and determining whether each of the character regions is a Japanese region based on the total value of the classification result; A method for determining whether a page is a Japanese area or an English area based on a determination result of whether the page is an English area or not, based on the determination result.

6. A method for determining whether a character region in a document image is a Japanese region or an English region, the method comprising the steps of: detecting a line in the character region;
A block is extracted by integrating adjacent circumscribing rectangles from the line, and it is determined whether each block is a Japanese area, an English area, or an undeterminable area, and the determination result is determined by the block. A method for determining whether each character area is a Japanese area or an English area based on the total value.

7. The method according to claim 6, wherein a different judgment method is used when the number of extracted blocks does not satisfy a predetermined condition.

8. A method for determining whether a document image of a page is a Japanese document image or an English document image, wherein the document image includes a plurality of character areas, Detects a line from the area, integrates adjacent circumscribed rectangles from the line, extracts blocks, and determines whether each block is a Japanese area, an English area, or an undeterminable area And summing up the determination results on a page-by-page basis, and determining whether each of the pages is a Japanese document image or an English document image based on the total value. Method.

9. A method for determining whether a document image of a page is composed of a plurality of character areas and determining whether the document image of each page is a Japanese document image or an English document image. Detects a line from the area, integrates adjacent circumscribed rectangles from the line, extracts blocks, and determines whether each block is a Japanese area, an English area, or an undeterminable area Then, the determination result is totaled for each character region, and it is determined whether the region is a Japanese region or an English region for each character region based on the total value, and the determination result is totaled for each page. A method for determining whether each page is a Japanese document image or an English document image based on a value.

10. A document recognition method comprising: determining whether a document image is a Japanese document image or an English document image; and performing a document recognition process according to a result of the determination.

11. A document image is divided into a plurality of character areas,
A document recognition method comprising: determining whether each of the divided character regions is a Japanese document region or an English document region, and performing a document recognition process according to the determination result.

12. A plurality of determination methods are used to determine whether each character area in a document image is a Japanese area or an English area using a plurality of determination methods. And a computer-readable recording medium storing a program for causing a computer to realize a function of performing a final determination result by comparing the plurality of determination results.

13. A character region generated by reducing the size of a document image in order to determine whether the character image in each character region or each page in the document image is a Japanese region or an English region. Alternatively, a function of classifying the connected component based on the length of the black pixel connected component in the page, and each character region or each page is a Japanese region or an English region based on a total value of the classification result. A computer-readable recording medium on which a program for causing a computer to realize the function of determining whether the program has been stored is recorded.

14. A method for determining whether each character area in a document image is a Japanese area or an English area, or
A page is composed of a plurality of character areas, and a function of detecting a line from the character area to determine whether the document image of each page is a Japanese document image or an English document image,
A function of extracting blocks by integrating adjacent circumscribed rectangles from the row, a function of determining whether each block is a Japanese area, an English area, or an undeterminable area; A function for summing up the results for each block or for each page, and a function for judging whether each of the character areas is a Japanese area or an English area based on the total value, or each page is a Japanese document image. A computer-readable recording medium that records a program for causing a computer to realize a function of determining whether a document image is an English document image.

15. A function for determining whether a document image is a Japanese document image or an English document image, or dividing a document image into a plurality of character regions, and dividing each of the divided character regions by a Japanese document region. A computer-readable storage medium storing a program for causing a computer to realize a function of determining whether a document area is a document area or not and a function of performing document recognition processing according to the determination result.