JPH1185899A

JPH1185899A - Character reader, its method and record medium

Info

Publication number: JPH1185899A
Application number: JP9241445A
Authority: JP
Inventors: Hironobu Shishido; 広信宍戸
Original assignee: Tsubasa System Co Ltd
Current assignee: Tsubasa System Co Ltd
Priority date: 1997-09-05
Filing date: 1997-09-05
Publication date: 1999-03-30
Anticipated expiration: 2017-09-05
Also published as: JP3190603B2

Abstract

PROBLEM TO BE SOLVED: To reduce analysis time of a character string. SOLUTION: A user instructs the number of the character strings of the same kinds with the character string written on an original through an input device 17. A CPU 13 counts the number of analysis processing times executed in the analysis of a character recognizing result for every kind, excludes the kind of analysis processing reaching an indicated number of times from the analysis processing of the character string after then and avoids the analysis processing of its kind.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文字列が記載され
た原稿画像を読み取り、文字列の字句を解析する文字読
み取り装置、その読み取り方法および記録媒体に関す
る。[0001] 1. Field of the Invention [0002] The present invention relates to a character reading device for reading a document image on which a character string is described, and analyzing the character strings of the character string, a reading method thereof, and a recording medium.

【０００２】[0002]

【従来の技術】文字読み取り装置は、文字列が記載され
た原稿画像をスキャナーにより読み取り、読み取った文
字画像を文字認識して文字コードに変換する。このよう
な文字読み取り装置を使用して、健康保険証や運転免許
証に記載された氏名、住所等の個人関連情報を読み取
り、データベースに読み取った個人関連情報を登録する
ことが可能となったきた。2. Description of the Related Art A character reading device reads a document image on which a character string is described by a scanner, recognizes the read character image, converts the read character image into a character code, and converts the read character image into a character code. Using such a character reading device, it has become possible to read personal related information such as a name and an address written on a health insurance card or a driver's license, and to register the read personal related information in a database. .

【０００３】データベースは周知のように１レコードを
複数項目のデータで構成し、複数のレコードを集積して
記憶したものである。このため、文字読み取り装置によ
り読み取り、変換した文字コードが上記どの項目に対応
するかを分類しなければならない。[0003] As is well known, a database is one in which one record is composed of data of a plurality of items, and a plurality of records are accumulated and stored. Therefore, it is necessary to classify which item the character code read and converted by the character reading device corresponds to.

【０００４】この対応関係の指示方法は以下の方法が知
られている。The following method is known as a method of indicating the correspondence.

【０００５】ａ）読み取り対象の原稿を文字列の記載位
置が定まっている原稿に限定する。A) The document to be read is limited to a document in which the position of the character string is fixed.

【０００６】原稿上の文字列の記載位置、すなわち、画
像の読み取り領域の位置とこの位置に対応するレコード
の項目の種類を文字読み取り装置に対してユーザが指示
する。文字処理装置では指定された複数の領域の文字を
読み取り、文字コード列に変換して、データベース登録
用のレコードを作成する。[0006] The user instructs the character reading device to describe the position of the character string on the document, that is, the position of the image reading area and the type of record item corresponding to this position. The character processing device reads characters in a plurality of designated areas, converts them into a character code string, and creates a record for database registration.

【０００７】ｂ）原稿上の文字列の記載内容、たとえ
ば、名刺のように氏名、会社名、住所、電話番号、郵便
番号のように個々の情報が固有名詞や一定の桁数を持つ
数字からなる原稿に、読み取り対象を限定する。B) The description content of a character string on a manuscript, for example, individual information such as a name, a company name, an address, a telephone number, and a zip code such as a business card is a proper noun or a number having a fixed number of digits. The originals to be read are limited to originals.

【０００８】文字読み取り装置は、固有名詞辞書や人名
辞書、地名辞書等を使用して、読み取った文字列が氏
名、会社名等いずれの項目名に該当するかを解析し、こ
の解析結果に基づきデータベースに登録するレコードを
作成する。The character reading device uses a proper noun dictionary, a personal name dictionary, a place name dictionary, and the like to analyze whether the read character string corresponds to an item name such as a name or a company name, and based on the analysis result. Create records to be registered in the database.

【０００９】[0009]

【発明が解決しようとする課題】上述のａ）の読み取り
条件設定方法では。ユーザが読み取り領域と、この読み
取り領域に記載された文字列の属性をユーザが指示しな
ければならないので、読み取り領域が増えるほどユーザ
の指示操作が煩雑であるという問題がある。In the reading condition setting method of the above a). Since the user must specify the reading area and the attribute of the character string described in the reading area, there is a problem that the user's instruction operation becomes more complicated as the reading area increases.

【００１０】一方、上述のｂ）の読み取り条件設定方法
では、原稿に記載された情報の属性が自動判別されるの
で、ユーザの指示操作労力は低減されるが、文字列を解
析するので文字列の属性が増えるほど属性解析に時間が
かかるという問題がある。On the other hand, in the above-described reading condition setting method of b), the attribute of the information described in the document is automatically determined, so that the user's instruction operation effort is reduced. However, since the character string is analyzed, the character string is analyzed. There is a problem that attribute analysis takes longer as the number of attributes increases.

【００１１】そこで、上述の目的は、本発明の目的は、
上述のｂ）の問題点を解消し、かつ、ユーザの指示操作
を煩雑化することのない文字読み取り装置、その読み取
り方法および記録媒体を提供することにある。Therefore, the object of the present invention is as follows.
An object of the present invention is to provide a character reading device, a reading method, and a recording medium which solve the above-mentioned problem b) and do not complicate a user's instruction operation.

【００１２】[0012]

【課題を解決するための手段】このような目的を達成す
るために、請求項１の発明は、複数の文字列が記載され
た原稿画像を読み取り、当該読み取られた複数の文字列
を文字認識し、当該文字認識の結果に対して解析処理を
施すことにより、文字列の意味内容の種類を識別し、当
該識別結果に基づいて前記文字認識の結果を分類する文
字読み取り装置において、前記解析処理の対象となる意
味内容の種類と、前記原稿画像上の文字列の個数を前記
意味内容の種類毎に指示する指示手段と、前記複数の文
字列の意味内容の種類が判明する毎に、当該判明の回数
をその種類毎に計数する計数手段と、当該計数の結果が
前記指示手段により指示された前記個数に到達した種類
については、実行すべき解析処理の種類から除外する制
御手段とを具えたことを特徴とする。In order to achieve the above object, a first aspect of the present invention is to read an original image on which a plurality of character strings are described, and recognize the plurality of read character strings by character recognition. By performing an analysis process on the result of the character recognition, a character reading device that identifies the type of the semantic content of the character string and classifies the result of the character recognition based on the identification result. Instruction means for instructing the type of the semantic content to be targeted, the number of character strings on the document image for each type of the semantic content, and each time the type of the semantic content of the plurality of character strings is determined, the Counting means for counting the number of determinations for each type, and control means for excluding, from the type of analysis processing to be performed, the type in which the result of the counting reaches the number instructed by the instruction means. Was And wherein the door.

【００１３】請求項２の発明は、請求項１に記載の文字
読み取り装置において、前記解析処理は、文字列の特徴
を前記意味内容の種類毎に定義した複数の解析ルールを
使用する字句解析処理であることを特徴とする。[0013] According to a second aspect of the present invention, in the character reading apparatus according to the first aspect, the analysis processing uses a plurality of analysis rules that define characteristics of a character string for each type of the semantic content. It is characterized by being.

【００１４】請求項３の発明は、請求項１に記載の文字
読み取り装置において、前記解析処理は、文字列の意味
内容別に該文字列の表記を記載した辞書であることを特
徴とする。According to a third aspect of the present invention, in the character reading apparatus according to the first aspect, the analysis processing is a dictionary in which a description of the character string is described according to the meaning and content of the character string.

【００１５】請求項４の発明は、請求項１に記載の文字
読み取り装置において、文字認識された１つの文字列に
つき、複数の種類の識別結果が得られた場合には、予め
定めた選択基準に基づき、前記複数の種類の識別結果の
中の１つを最終的な識別結果として選択する選択手段を
さらに有することを特徴とする。According to a fourth aspect of the present invention, in the character reading apparatus according to the first aspect, when a plurality of types of identification results are obtained for one character string recognized as a character, a predetermined selection criterion is set. And selecting means for selecting one of the plurality of types of identification results as a final identification result based on the above.

【００１６】請求項５の発明は、複数の文字列が記載さ
れた原稿画像を読み取り、当該読み取られた複数の文字
列を文字認識し、当該文字認識の結果に対して解析処理
を施すことにより、文字列の意味内容の種類を識別し、
当該識別結果に基づいて前記文字認識の結果を分類する
文字読み取り装置の文字読み取り方法において、前記解
析処理の対象となる意味内容の種類と、前記原稿画像上
の文字列の個数を前記意味内容の種類毎に前記文字読み
取り装置に対して指示し、文字読み取り装置は、前記複
数の文字列の意味内容の種類が判明する毎に、当該判明
の回数をその種類毎に計数し、当該計数の結果が指示さ
れた前記個数に到達した種類については、実行すべき解
析処理の種類から除外することを特徴とする。According to a fifth aspect of the present invention, an original image on which a plurality of character strings are described is read, the read plurality of character strings are recognized as characters, and the result of the character recognition is analyzed. , Identifies the type of semantic content of the string,
In the character reading method of the character reading device that classifies the result of the character recognition based on the identification result, the type of the semantic content to be subjected to the analysis processing and the number of character strings on the document image are defined by the The character reading device instructs the character reading device for each type, and the character reading device counts the number of determinations for each type each time the type of the meaning content of the plurality of character strings is determined. Are excluded from the types of analysis processing to be executed.

【００１７】請求項６の発明は、請求項５に記載の文字
読取装置の文字読み取り方法において、前記解析処理
は、文字列の特徴を前記意味内容の種類毎に定義した複
数の解析ルールを使用する字句解析処理であることを特
徴とする。According to a sixth aspect of the present invention, in the character reading method of the character reading apparatus according to the fifth aspect, the analysis process uses a plurality of analysis rules that define the characteristics of a character string for each type of the semantic content. Lexical analysis processing.

【００１８】請求項７の発明は、請求項５に記載の文字
読み取り装置の文字読み取り方法において、前記解析処
理は、文字列の意味内容別に該文字列の表記を記載した
辞書であることを特徴とする。According to a seventh aspect of the present invention, in the character reading method of the character reading apparatus according to the fifth aspect, the analysis processing is a dictionary in which a description of the character string is described according to the meaning of the character string. And

【００１９】請求項８の発明は、請求項５に記載の文字
読み取り装置の文字読み取り方法において、前記文字処
理装置は、文字認識された１つの文字列につき、複数の
種類の識別結果が得られた場合には、予め定めた選択基
準に基づき、前記複数の種類の識別結果の中の１つを最
終的な識別結果として選択することを特徴とする。According to an eighth aspect of the present invention, in the character reading method of the character reading apparatus according to the fifth aspect, the character processing device can obtain a plurality of types of identification results for one character string recognized as a character. In this case, one of the plurality of types of identification results is selected as a final identification result based on a predetermined selection criterion.

【００２０】請求項９の発明は、複数の文字列が記載さ
れた原稿画像を読み取り、当該読み取られた複数の文字
列を文字認識し、当該文字認識の結果に対して解析処理
を施すことにより、文字列の意味内容の種類を識別し、
当該識別結果に基づいて前記文字認識の結果を分類する
一連の処理を規定した処理プログラムを文字読み取り装
置内のコンピュータにより実行するために前記処理プロ
グラムを記録した記録媒体において、前記処理プログラ
ムは、前記解析処理の対象となる意味内容の種類と、前
記原稿画像上の文字列の個数を前記意味内容の種類毎に
前記文字読み取り装置に対して指示する処理手順と、前
記複数の文字列の意味内容の種類が判明する毎に、当該
判明の回数をその種類毎に計数する処理手順と、当該計
数の結果が指示された前記個数に到達した種類について
は、実行すべき解析処理の種類から除外する処理手順と
を具えたことを特徴とする。According to a ninth aspect of the present invention, an original image on which a plurality of character strings are described is read, the read plurality of character strings are recognized as characters, and the result of the character recognition is analyzed. , Identifies the type of semantic content of the string,
In a recording medium recording the processing program to execute a processing program that defines a series of processing for classifying the result of the character recognition based on the identification result by a computer in a character reading device, the processing program includes: A process of instructing the character reading device for the type of the meaning content to be analyzed and the number of character strings on the document image for each type of the meaning content, and the meaning content of the plurality of character strings Each time the type is determined, a processing procedure of counting the number of determinations for each type, and a type in which the result of the counting reaches the specified number is excluded from the types of analysis processing to be performed. And a processing procedure.

【００２１】請求項１０の発明は、請求項９に記載の記
録媒体において、前記解析処理は、文字列の特徴を前記
意味内容の種類毎に定義した複数の解析ルールを使用す
る字句解析処理であることを特徴とする。According to a tenth aspect of the present invention, in the recording medium according to the ninth aspect, the analysis processing is a lexical analysis processing using a plurality of analysis rules defining the characteristics of a character string for each type of the semantic content. There is a feature.

【００２２】請求項１１の発明は、請求項９に記載の記
録媒体において、前記解析処理は、文字列の意味内容別
に該文字列の表記を記載した辞書であることを特徴とす
る。According to an eleventh aspect of the present invention, in the recording medium according to the ninth aspect, the analysis processing is a dictionary in which a description of the character string is described according to the meaning of the character string.

【００２３】請求項１２の発明は、請求項９に記載の記
録媒体において、前記処理プログラムは、文字認識され
た１つの文字列につき、複数の種類の識別結果が得られ
た場合には、予め定めた選択基準に基づき、前記複数の
種類の識別結果の中の１つを最終的な識別結果として選
択する処理手順をさらに具えたことを特徴とする。According to a twelfth aspect of the present invention, in the recording medium according to the ninth aspect, the processing program is configured such that when a plurality of types of identification results are obtained for one character string whose character is recognized, The method further includes a processing procedure of selecting one of the plurality of types of identification results as a final identification result based on the determined selection criterion.

【００２４】[0024]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態を詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００２５】図１は本発明を適用した文字読み取り装置
のシステム構成を示す。図１において、文字読み取り装
置１０には汎用パーソナルコンピュータを使用すること
ができる。文字読み取り装置１０の本体は以下の回路が
バスに接続されている。入出力インターフェース（Ｉ／
Ｏ）１１はスキャナー２０と接続し、スキャナー２０に
より読み取られた原稿画像を入力してＣＰＵ１３に引き
渡す。FIG. 1 shows a system configuration of a character reading apparatus to which the present invention is applied. In FIG. 1, a general-purpose personal computer can be used as the character reading device 10. In the main body of the character reading device 10, the following circuits are connected to a bus. Input / output interface (I /
O) 11 is connected to the scanner 20, inputs a document image read by the scanner 20, and delivers it to the CPU 13.

【００２６】システムメモリ１２は、オペレーティング
システム等のシステム制御用のシステムプログラム、デ
ータ、表示用イメージ、演算データ等を記憶する。ＣＰ
Ｕ１３は後述の文字読み取り用プログラムを実行して、
スキャナー２０から入力した原稿画像に基づきデータベ
ース登録用のレコードを作成する。また、システムプロ
グラムにしたがってシステム全体の動作制御を行う。通
信インターフェース１４はＬＡＮ（ローカルエリアネッ
トワーク）等と接続し、他のコンピュータと通信を行
う。The system memory 12 stores a system program for controlling the system such as an operating system, data, an image for display, operation data, and the like. CP
U13 executes a character reading program described below,
A record for database registration is created based on the document image input from the scanner 20. Also, the operation of the entire system is controlled according to the system program. The communication interface 14 is connected to a LAN (local area network) or the like, and communicates with another computer.

【００２７】ハードディスク記憶装置（ＨＤＤ）１５は
システムプログラム、文字読み取り用プログラム、およ
び文字読み取り処理において、ユーザーが情報入力を行
うためのウィンドウ画面等を記憶する。The hard disk storage (HDD) 15 stores a system program, a character reading program, and a window screen or the like for a user to input information in a character reading process.

【００２８】フロッピーディスクドライブ（ＦＤＤ）１
６はフロッピーディスクを受け付け、フロッピーディス
クに対して情報の読み書きを行う。本発明に係る文字読
み取り用のプログラムおよび関連データはフロッピーデ
ィスクからＨＤＤ１５にインストールされる。入力装置
１７はキーボードおよびマウスを有し、情報入力を行
う。ディスプレイ１８は読み取り原稿や読み取り条件入
力用のグラフィカルインタフェース、属性解析結果等を
表示する。Floppy disk drive (FDD) 1
Numeral 6 receives a floppy disk and reads and writes information from and to the floppy disk. The character reading program and related data according to the present invention are installed in the HDD 15 from a floppy disk. The input device 17 has a keyboard and a mouse, and inputs information. The display 18 displays a read original, a graphical interface for inputting read conditions, an attribute analysis result, and the like.

【００２９】スキャナー２０はＣＣＤ（固体撮像素子）
により原稿を撮像し、読み取り画像をデジタル信号形態
で文字読み取り装置１０に出力する。The scanner 20 is a CCD (solid-state image sensor)
And outputs a read image to the character reading device 10 in the form of a digital signal.

【００３０】このようなシステム構成の文字読み取り装
置の動作説明に先立って、読み取り条件の設定や読み取
りの指示に使用する画面（グラフィカルインターフェー
ス）を説明する。Prior to the description of the operation of the character reading apparatus having such a system configuration, a screen (graphical interface) used for setting reading conditions and instructing reading will be described.

【００３１】本実施形態では図２に示すような健康保険
被保険者証（保険証と略記する) や図４に示す葉書き等
書式が異なる原稿を読み取り対象とすることができる。
図３は図２の保険証についての読み取り条件を設定する
画面である。１０１は読み取り条件に与える書式名であ
り、この書式名により読み取り条件の保存、表示等を行
う。In this embodiment, originals having different formats such as a health insurance card as shown in FIG. 2 (abbreviated as an insurance card) and postcards shown in FIG. 4 can be read.
FIG. 3 is a screen for setting reading conditions for the insurance card of FIG. Reference numeral 101 denotes a format name given to the reading condition, and the reading condition is stored and displayed based on the format name.

【００３２】１０２は原稿に記載される文字列の種類お
よび重複の個数を設定する欄である。ユーザはレ記号を
マウスの操作により付すことにより項目原稿に記載され
る文字列の種類内容を指示する。また、ユーザはキーボ
ードから数字を入力して同一の種類の文字列数を指示す
る。図３の例では人名、ふりがな、性別、団体名に関す
る文字列がそれぞれ１組あり、識別コードが２組、日付
に関する文字列が２組、地名・住所に関する文字列が５
組あること示している。Reference numeral 102 denotes a column for setting the type of character string described in the document and the number of duplications. The user designates the type and content of the character string described in the item document by attaching the check mark by operating the mouse. In addition, the user inputs a number from the keyboard and indicates the number of character strings of the same type. In the example of FIG. 3, there is one set of character strings for personal names, phonetic characters, genders, and organization names, two sets of identification codes, two sets of character strings for dates, and five sets of character strings for place names and addresses.
It shows that there is a pair.

【００３３】本実施の形態では、解析可能な文字列の種
類をすべて案内表示し、ユーザが解析すべき文字列の種
類内容を図３の画面で選択する。また、原稿に記載され
た文字列の個数を種類毎に図３の画面で選択する。ユー
ザはこれだけの読み取り条件を設定するだけで、以後
は、文字読み取り装置側が読み取りの文字列の種類内容
を自動的に解析して、指定された個数の項目を持つレコ
ードを作成する。In the present embodiment, all types of character strings that can be analyzed are displayed as guidance, and the user selects the type of character string to be analyzed on the screen shown in FIG. Further, the number of character strings described in the document is selected for each type on the screen of FIG. The user merely sets the reading conditions, and thereafter, the character reading apparatus automatically analyzes the type of the read character string and creates a record having the specified number of items.

【００３４】図４の葉書を読み取るためにユーザが設定
する読み取り条件を図５に参考のために示しておく。図
６は原稿読み取り時のマスク色を設定する画面２０１を
示す。２０５は選択可能な色をすべて表示する領域であ
る。２０６の領域にはサンプルの色が表示され、ユーザ
はマウスによりサンプルを指定することによりマスク色
を指定する。２０２は設定クリアボタンであり、現在の
設定を初期設定に戻すよう指示するボタンである。２０
３はＯＫボタンであり、現在の設定色を確定するボタン
である。２０４はキャンセルボタンであり、図６のマス
ク色設定モードをキャンセルし、ウィンドウ画面を消去
するように指示するボタンである。このようなウィンド
ウ画面はカラーピッカーと呼ばれ、画像処理ソフトでよ
く使用される。The reading conditions set by the user for reading the postcard of FIG. 4 are shown in FIG. 5 for reference. FIG. 6 shows a screen 201 for setting a mask color when reading a document. An area 205 displays all selectable colors. The color of the sample is displayed in the area 206, and the user specifies the mask color by specifying the sample with the mouse. A setting clear button 202 is a button for instructing to return current settings to initial settings. 20
Reference numeral 3 denotes an OK button, which is a button for confirming the currently set color. A cancel button 204 is a button for instructing to cancel the mask color setting mode in FIG. 6 and delete the window screen. Such a window screen is called a color picker and is often used in image processing software.

【００３５】図７はマスク色を設定する画面の第２の例
である。図７において、３０１は低い解像度で読み取ら
れた原稿画像を表示する領域であり、この領域の特定位
置をマウスでクリックすることによりその位置に対応す
る色をマスク色として設定する。３０２は画像の読み取
りを指示するためのボタン、３０３は設定されたマスク
色をクリアするように指示するボタンである。３０４は
選択されたマスク色を表示する領域である。３０５は現
在のマスク色の設定の確定を指示するＯＫボタンであ
り、３０６はマスク色の設定のキャンセルを指示するボ
タンである。FIG. 7 shows a second example of a screen for setting a mask color. In FIG. 7, reference numeral 301 denotes an area for displaying a document image read at a low resolution. By clicking a specific position in this area with a mouse, a color corresponding to the position is set as a mask color. Reference numeral 302 denotes a button for instructing reading of an image, and 303 denotes a button for instructing to clear the set mask color. An area 304 displays the selected mask color. Reference numeral 305 denotes an OK button for instructing confirmation of the current mask color setting, and reference numeral 306 denotes a button for instructing cancellation of the mask color setting.

【００３６】本実施の形態ではこのようにして設定した
マスク色で読み取り画像のマスクを行って、カラーの背
景や特定色の文字認識に供さない文字を消去する。In this embodiment, the read image is masked with the mask color set in this way, and the color background and characters that are not used for character recognition of a specific color are erased.

【００３７】図８は文字認識の結果と、認識された文字
列の解析結果を表示するウィンドウ画面である。４０１
は書式名を入力する欄、４０２はファイルの新規作成を
指示するボタンである。４０３は編集モードを指示する
ボタン、４０４は読み取り実行を指示するボタンであ
る。４０５は文字の種類内容の解析結果を、表示する領
域である。FIG. 8 shows a window screen for displaying the result of character recognition and the result of analysis of the recognized character string. 401
Is a column for inputting a format name, and 402 is a button for instructing new creation of a file. A button 403 indicates an editing mode, and a button 404 indicates a reading execution. An area 405 displays the analysis result of the type of character.

【００３８】ユーザは新規作成ボタン４０２をマウスに
より指示した後、読み取り実行ボタン４０４を操作する
と、文字読み取り装置では、スキャナー２０に対して画
像の読み取りを指示し、次に読み取り画像の文字認識、
種類内容の解析を行う。その解析結果が文字列の種類内
容（いわゆる属性）と関連させて表示領域４０５に表示
される。When the user operates the read execution button 404 after instructing the new creation button 402 with the mouse, the character reading apparatus instructs the scanner 20 to read an image, and then performs character recognition of the read image.
Analyzes the type contents. The analysis result is displayed in the display area 405 in association with the type content (so-called attribute) of the character string.

【００３９】この解析結果は、データベースに登録可能
な書式、たとえば、コンマ付きテキスト等各種の書式の
ファイルで保存される。This analysis result is stored in a file in a format that can be registered in the database, for example, in various formats such as a text with a comma.

【００４０】次に本実施形態で解析可能な文字列の属性
の種類内容およびその解析ルールについて図９を参照し
て説明する。図９は解析可能な属性の種類内容および解
析ルールを示す。Next, the type contents of the attribute of the character string that can be analyzed in the present embodiment and the analysis rules thereof will be described with reference to FIG. FIG. 9 shows the types of attributes that can be analyzed and the analysis rules.

【００４１】ａ）人名解析対象の文字列が人名であるか否かを解析するために
は人名辞書が使用される。人名辞書は、姓、名の表記、
この場合、文字コード列を複数記載した辞書であり、Ｈ
ＤＤ１５に格納される。人名辞書の検索により文字認識
の結果得られた文字コード列（解析対象の文字コード
列）と同じ文字コード列が人名辞書に記載されている場
合には、解析対象の文字コード列は人名であると判断さ
れる。A) Personal Name A personal name dictionary is used to analyze whether or not the character string to be analyzed is a personal name. Personal name dictionaries include notation of first and last names,
In this case, the dictionary is a dictionary in which a plurality of character code strings are described.
Stored in DD15. If the same character code string (character code string to be analyzed) obtained as a result of character recognition by searching the personal name dictionary is described in the personal name dictionary, the character code string to be analyzed is a person name. Is determined.

【００４２】ｂ）ふりがなふりがなは解析対象の文字列がすべてひらがなかカタカ
ナで構成されているか否かを判定する。ＪＩＳ（日本工
業規格）のひらがなの文字コードおよびカタカナの文字
コードは、特定のコード範囲にはいるように制定されて
いるので、解析対象の文字列の個々の文字コードが上記
コード範囲にあるか否かを判定することで、解析対象の
文字列がふりがなかを判別する。B) Furigana Furigana determines whether all character strings to be analyzed are composed of hiragana or katakana. JIS (Japanese Industrial Standards) hiragana character codes and katakana character codes are established so that they fall within a specific code range. Therefore, do the individual character codes of the character strings to be analyzed fall within the above code ranges? By determining whether or not the character string to be analyzed is a phonetic character.

【００４３】ｃ）肩書き肩書きは解析対象の文字列と同じ文字列が専用の肩書き
辞書に記載されているか否かにより判別する。肩書き辞
書は係長、課長等の肩書きを表す文字コード列を記載し
た辞書であり、ＨＤＤ１５に格納される。C) Title The title is determined based on whether or not the same character string as the character string to be analyzed is described in a dedicated title dictionary. The title dictionary is a dictionary in which character code strings representing titles such as section chiefs and section managers are described, and is stored in the HDD 15.

【００４４】ｄ）年齢年齢については文字コード列が数字に関する文字コード
で構成されること、かつ文字コードの示す値がたとえ
ば、０以上１２０以下というように年齢に該当する値の
範囲内にある場合にその文字列は年齢であると判断され
る。なお、文字コードを数値コードに変換する機能はコ
ンピュータが有しているので、この機能を使用して文字
コードを数値変換するとよい。漢字コードやひらがなの
文字コードについては数値変換しようとしたときに不可
の応答がコンピュータから返るので、この場合、解析対
象の文字コード列は年齢ではないと判断することができ
る。ｅ）性別性別については、解析対象の文字コード列が「男」、
「女」、「Ｍ」、「Ｆ」、「Ｍａｌｅ」、「Ｆｅｍａｌ
ｅ」のいずれかの文字コード列であるか否かを判定す
る。解析対象の文字コード列が上記特定の文字コード列
に合致する場合には解析対象の文字コード列は性別であ
ると判断される。D) Age When the character code string is composed of character codes related to numbers, and the value indicated by the character code is within a range corresponding to the age, for example, 0 or more and 120 or less. Is determined to be the age. Since the computer has a function of converting a character code into a numerical code, the character code may be numerically converted using this function. Since an unsuccessful response is returned from the computer when trying to convert a kanji code or a hiragana character code to a numerical value, in this case, it can be determined that the character code string to be analyzed is not age. e) Gender For gender, the character code string to be analyzed is "male",
"Woman", "M", "F", "Male", "Femal"
e ”is determined. If the character code string to be analyzed matches the specific character code string, it is determined that the character code string to be analyzed is gender.

【００４５】地名・住所については解析対象の文字列と
同じ文字列が地名辞書に記載されている場合に解析対象
の文字列は地名・住所と判断される。上記地名辞書は地
名を表す文字コード列を記載した辞書であり、ＨＤＤ１
５に格納される。For a place name / address, if the same character string as the character string to be analyzed is described in the place name dictionary, the character string to be analyzed is determined to be a place name / address. The place name dictionary is a dictionary in which character code strings representing place names are described.
5 is stored.

【００４６】ｆ）郵便番号郵便番号については解析対象の文字コード列が数字のみ
もしくは数字と「−」記号で構成されること、かつ数字
の値の範囲が郵便番号に割り当てられた値の範囲（もし
くは桁数の範囲）の条件に合致する場合に解析対象の文
字列は郵便番号であると判断される。F) Postal code For the postal code, the character code string to be analyzed is composed of only numbers or a number and a "-" sign, and the range of numerical values is the range of values assigned to postal codes ( Or the range of the number of digits), the character string to be analyzed is determined to be a postal code.

【００４７】ｇ）団体名、事業所名、組織・部門名解析対象の文字列と同一の文字列が団体名辞書、事業所
名辞書、組織・部門名辞書に記載されている場合、解析
対象の文字列は記載の辞書に対応して団体名、事業所
名、組織・部門名と判別する。G) Group name, business name, organization / department name When the same character string as the character string to be analyzed is described in the group name dictionary, business name dictionary, or organization / department name dictionary, the analysis target Is determined as an organization name, a business name, or an organization / department name according to the dictionary described.

【００４８】ｈ）電話番号解析対象の文字列が「電話」、「ＴＥ
Ｌ」、「（」「）」、「−」の記号を伴う数字の文字列
で構成される文字列は電話番号と判断される。なお、よ
り正確な解析を行う場合には、数字の桁数が電話番号の
桁数と合致しているかをも判定するとよい。H) Telephone number Character strings to be analyzed are "telephone", "TE"
A character string composed of a character string of numbers accompanied by symbols "L", "("")" and "-" is determined to be a telephone number. In order to perform more accurate analysis, it may be determined whether the number of digits of the number matches the number of digits of the telephone number.

【００４９】ｇ）識別コード解析対象の文字列の中に「第」、「号」「Ｎｏ」のいず
れかの文字コードを含む場合に種別が識別コードと判断
する。このために、解析対象の文字コード列の先頭文字
コードを取り出し、取り出した文字コードが「第」、
「号」の文字コードであるかの一致比較を行う。不一致
判定の場合には先頭の２つの文字コードを解析対象の文
字コード列から取り出し、上記（Ｎｏ）の文字コード列
と一致比較する。上述のいずれかの一致判定において、
一致の判定結果が得られた場合には、解析対象の文字コ
ード列は識別コード判断する。G) Identification code If the character string to be analyzed contains any of the character codes "No.", "No.", and "No", the type is determined to be an identification code. For this purpose, the first character code of the character code string to be analyzed is extracted, and the extracted character code is
Performs a match comparison to see if it is the character code of "go" In the case of a mismatch determination, the first two character codes are extracted from the character code string to be analyzed, and are compared with the character code string of (No) above. In any of the above-described matching determinations,
When a match determination result is obtained, the character code string to be analyzed is identified by an identification code.

【００５０】ｈ）日付解析対象の文字列の中に「平成」「昭和」「／」「Ｊａ
ｎ．」等の日付で使用される特定文字列を含み、数字の
値が日付で使用される数値の範囲内にある場合に解析対
象の文字列は日付と判断される。H) Date In the character string to be analyzed, "Heisei", "Showa", "/" and "Ja"
n. If a numeric character value is included in the range of the numeric value used in the date, the character string to be analyzed is determined to be a date.

【００５１】ｉ）その他数値解析対象の文字コードが数字コードで構成される場合種
別がその他数値と判断される。I) Other Numerical Values When the character code to be analyzed is composed of numeric codes, the type is determined to be other numeric values.

【００５２】ｊ）解析対象の文字列の種別が２つ以上得
られた場合には後述の予め定めた選択基準に基づき種別
を最終決定する。J) When two or more types of character strings to be analyzed are obtained, the type is finally determined based on a predetermined selection criterion described later.

【００５３】以上述べた解析ルールや辞書にしたがっ
て、文字読み取り装置は、文字認識結果の文字列に対し
て解析を行い、その種類を判断する。In accordance with the above-described analysis rules and the dictionary, the character reading device analyzes the character string obtained as a result of the character recognition and determines the type of the character string.

【００５４】次に本発明に係る文字読み取り処理を図１
０〜図１５を参照して説明する。図１０は文字読み取り
処理のメイン処理の内容を示し、図１１〜図１５はメイ
ン処理の中の個別処理の詳細を示す。Next, a character reading process according to the present invention will be described with reference to FIG.
This will be described with reference to FIGS. FIG. 10 shows the contents of the main processing of the character reading processing, and FIGS. 11 to 15 show details of the individual processing in the main processing.

【００５５】図１０から図１５に示す処理プログラムは
ＣＰＵ１３が実行可能なプログラム言語で記載され、Ｈ
ＤＤ１５に格納される。この文字読み取りプログラムの
起動が指示されると、文字読み取りプログラムはシステ
ムメモリ１２にロードされて、ＣＰＵ１３において実行
される。The processing programs shown in FIGS. 10 to 15 are described in a program language that can be executed by the CPU 13.
Stored in DD15. When the activation of the character reading program is instructed, the character reading program is loaded into the system memory 12 and executed by the CPU 13.

【００５６】ｋ）読み取り条件の設定ユーザはメニュー画面においてマウス等により読み取り
条件設定モードを指示する。これにより図３に示すウィ
ンドウ画面が表示される。ユーザは読み取るべき原稿に
記載された文字列の種別および同一種類の文字列の個数
を指示する。指示の終了後は書式名を入力し、保存を指
示する。ＣＰＵ１３は図１６の（ｂ）に示すようなテー
ブル形態の読み取り条件ファイルをＨＤＤ１５に記憶す
る（ステップＳ１０→Ｓ２０→Ｓ３０）。なお、マスク
処理に関する設定もステップＳ３０の読み取り条件設定
処理中に行われることは言うまでもない。K) Setting of reading conditions The user designates a reading condition setting mode with a mouse or the like on the menu screen. As a result, a window screen shown in FIG. 3 is displayed. The user specifies the type of character string described in the document to be read and the number of character strings of the same type. After the instruction, input the format name and instruct to save. The CPU 13 stores the read condition file in the form of a table as shown in FIG. 16B in the HDD 15 (steps S10 → S20 → S30). It goes without saying that the setting relating to the mask processing is also performed during the reading condition setting processing in step S30.

【００５７】原稿の読み取りを行う場合には、ユーザは
たとえば、図２に示すような原稿をスキャナー２０にセ
ットする。ユーザはメニュー画面上で図８のウィンドウ
画面を呼び出す。書式名記入欄４０１に先ほど記憶した
読み取り条件ファイルの名前をキーボードから入力し
て、読み取り実行ボタンを４０４をマウスにより操作す
る。この操作に応じてＣＰＵ１３の実行手順は図１０の
ステップＳ１０→Ｓ２０→Ｓ１００へと進む。When reading a document, the user sets a document as shown in FIG. The user calls the window screen of FIG. 8 on the menu screen. The name of the read condition file stored earlier is input to the format name entry field 401 from the keyboard, and the read execution button 404 is operated with the mouse. In response to this operation, the execution procedure of the CPU 13 proceeds to steps S10 → S20 → S100 in FIG.

【００５８】ＣＰＵ１３は従来と同様にしてスキャナ−
駆動用のドライバソフトを実行してスキャナー２０を制
御し、原稿画像の読み取りを行わせる。読み取られたカ
ラー原稿画像はＩ／Ｏ１１を介して文字読み取り装置に
入力され、ＣＰＵ１３によりシステムメモリ１２内のワ
ーク領域に一時記憶される。この後、文字認識に好適な
画像を作成するための画像処理が行われる（ステップ１
００）。The CPU 13 operates in the same manner as in the prior art.
The scanner driver 20 is controlled by executing driver software for driving to read a document image. The read color document image is input to the character reading device via the I / O 11, and is temporarily stored in the work area in the system memory 12 by the CPU 13. Thereafter, image processing for creating an image suitable for character recognition is performed (step 1).
00).

【００５９】ステップＳ１００の詳細手順を図１１に示
す。図１１の処理手順では、スキャナー２０により読み
取られたカラーの原稿画像が、システムメモリ１２に記
憶された後（ステップＳ１００１）、マスク処理および
２値化処理が行われる（ステップＳ１００２）。FIG. 11 shows the detailed procedure of step S100. In the processing procedure of FIG. 11, after a color document image read by the scanner 20 is stored in the system memory 12 (step S1001), mask processing and binarization processing are performed (step S1002).

【００６０】２値化処理により原稿画像データの文字部
分の各画素はビッ１、原稿画像の背景部分の各画素はビ
ット０の値に変換される。By the binarization process, each pixel in the character portion of the original image data is converted into a bit 1 value, and each pixel in the background portion of the original image data is converted into a bit 0 value.

【００６１】ＣＰＵ１３は２値化処理後の画像データの
ストローク分布、すなわち、ビット１の分布を調べるこ
とにより文字列領域とその他の空白領域を検出する（ス
テップＳ１００３）。The CPU 13 detects the character string area and other blank areas by examining the stroke distribution of the image data after the binarization processing, that is, the distribution of bit 1 (step S1003).

【００６２】次に、原稿が斜めにセットされた場合に生
じる画像の傾斜補正が行われ、罫線画像が除去される
（ステップＳ１００５）。Next, the inclination of the image generated when the document is set obliquely is corrected, and the ruled line image is removed (step S1005).

【００６３】図１０に戻り、ＣＰＵ１３は上述の処理に
より検出された文字列領域を従来手法によりブロック化
する。本実施の形態ではブロック化とは文字が連続する
文字列を検出し、この文字列と外接するブロック（矩
形）の位置を自動検出する処理を意味する（ステップＳ
２００）。このブロック内の画像が文字認識の対象とな
る。検出されたブロックにはブロック番号が検出順に付
され、図１６の（ａ）に示すようにブロック番号とブロ
ックの座標位置を記載したテーブルＡがシステムメモリ
１２内のワーク領域に作成される（図１２、ステップＳ
２００１〜Ｓ２００３）。また、図１６の（ｂ）に示す
ようなブロック番号に対応させて文字認識結果を記憶す
るためのテーブルＢも上記ワーク領域内に作成される。Returning to FIG. 10, the CPU 13 blocks the character string area detected by the above-described processing by a conventional method. In the present embodiment, blocking means a process of detecting a character string in which characters are continuous and automatically detecting the position of a block (rectangle) circumscribing the character string (step S).
200). The image in this block is the target of character recognition. Block numbers are assigned to the detected blocks in the order of detection, and a table A describing the block numbers and the coordinate positions of the blocks is created in the work area in the system memory 12 as shown in FIG. 12. Step S
2001 to S2003). In addition, a table B for storing the character recognition result corresponding to the block number as shown in FIG. 16B is also created in the work area.

【００６４】ＣＰＵ１３はブロック化された領域の中の
個々の文字画像を文字認識し、文字コードに変換する
（ステップＳ３００）。文字認識結果は上記テーブル
（図１５（ｂ）参照）にブロック番号に対応させて格納
される。文字認識処理の詳細手順の一例を図１３に示
す。この詳細手順では、ブロック内の画像の濃度ヒスト
グラムを調べ、画像に濃度（階調）の変化がない場合に
はその画像は文字画像と判断し、文字認識を行う（ステ
ップＳ３００１→Ｓ３００２→Ｓ３００３→Ｓ３００４
〜Ｓ３００６）。一方、画像に濃度変化がある場合に
は、その画像は汚れがあるか、文字画像ではないので、
誤認識を阻止するために文字認識は行わず、画像そのも
のをテーブルＢのブロック番号に対応する認識結果記憶
欄に記憶する（ステップＳ３００１→Ｓ３００２→Ｓ３
００３→Ｓ３００４→Ｓ３００７））。The CPU 13 performs character recognition on each character image in the block area and converts it into a character code (step S300). The character recognition result is stored in the table (see FIG. 15B) in association with the block number. FIG. 13 shows an example of a detailed procedure of the character recognition process. In this detailed procedure, the density histogram of the image in the block is examined, and if there is no change in density (gradation) in the image, the image is determined to be a character image and character recognition is performed (steps S3001 → S3002 → S3003 →). S3004
To S3006). On the other hand, if the image has a density change, the image is dirty or not a character image.
Character recognition is not performed to prevent erroneous recognition, and the image itself is stored in the recognition result storage column corresponding to the block number in table B (steps S3001 → S3002 → S3).
003 → S3004 → S3007)).

【００６５】ＣＰＵ１３はすべてのブロックの文字認識
を行うと、その文字認識結果の属性解析、すなわち、本
発明に係る意味内容の種類解析を行う（ステップＳ４０
０）。種類解析の詳細を図１４および図１５に示す。When the character recognition of all blocks is performed, the CPU 13 analyzes the attribute of the character recognition result, that is, analyzes the type of the meaning content according to the present invention (step S40).
0). Details of the type analysis are shown in FIGS.

【００６６】ＣＰＵ１３は図１６のテーブルＢの第１行
目に行ポインタを設定し、第１行目の文字認識結果とし
て記載されている文字列をシステムメモリ１２のワーク
領域に取り出す。The CPU 13 sets a line pointer on the first line of the table B in FIG. 16 and takes out a character string described as a result of character recognition on the first line into the work area of the system memory 12.

【００６７】取り出した文字列につき上述した各解析ル
ールと照合し、合致する属性名を図１６のテーブルＢの
候補属性記載欄に記載する。なお、解析ルール毎に解析
プログラムを用意して、ある１つの解析プログラムを実
行し、解析ルールに合致しないの判定が得られた場合に
は、次の解析プログラムを実行するというようにして、
文字列の字句解析が行われる（ステップＳ４００１→Ｓ
４００２→Ｓ４００３→Ｓ４００４→Ｓ４００６）。The extracted character string is collated with each of the above-described analysis rules, and a matching attribute name is described in a candidate attribute description column of Table B of FIG. In addition, an analysis program is prepared for each analysis rule, one analysis program is executed, and when it is determined that the analysis rule does not match, the next analysis program is executed.
The lexical analysis of the character string is performed (step S4001 → S
4002 → S4003 → S4004 → S4006).

【００６８】図２の健康保険証が読み取られた場合、テ
ーブルＢの第１行目に記載されている文字認識結果は
「平成８年３月１日発行」であるので、この文字列は日
付に関する解析プログラムを実行したときに、日付の解
析ルールに合致の判定が得られ、候補属性として「日
付」が与えられる（図１６（ｂ）参照）。When the health insurance card shown in FIG. 2 is read, the character recognition result described on the first line of table B is “issued on March 1, 1996”, so that this character string When the analysis program relating to the date is executed, it is determined that the analysis rule matches the date analysis rule, and “date” is given as a candidate attribute (see FIG. 16B).

【００６９】ＣＰＵ１３は特定の解析ルールの合致の判
定が得られた後も図１６の現在のテーブルＣの示すすべ
ての解析ルールと照合し、解析ルールに文字列の属性を
判別する毎にその属性をテーブルＢの候補属性記載欄に
記入していく（ステップＳ４００２〜Ｓ４００４→Ｓ４
００６→Ｓ４００２のループ処理）。また、字句解析終
了後は後述の辞書解析を実行し（ステップＳ４０１
０）、辞書解析の終了後、得られた複数の候補属性の中
から予め定めた属性選択基準（後述）に基づき、属性を
決定する（Ｓ４０１１）。The CPU 13 checks all the analysis rules shown in the current table C of FIG. 16 even after the determination of the matching of the specific analysis rule is obtained. Is written in the candidate attribute description column of the table B (steps S4002 to S4004 → S4
006 → loop processing of S4002). After completion of the lexical analysis, a dictionary analysis described later is executed (step S401).
0) After the dictionary analysis is completed, the attribute is determined from a plurality of obtained candidate attributes based on a predetermined attribute selection criterion (described later) (S4011).

【００７０】一方、選択された字句解析プログラムを実
行して、解析ルールに合致せずの判定が得られた場合に
は（ステップＳ４００４のＮＯ判定）、他の解析プログ
ラムを選択して、字句解析を続ける（ステップ４００４
→Ｓ４００６→Ｓ４００２）。字句解析の後は辞書解析
処理が行われる（ステップＳ４００２→Ｓ４０１０）。On the other hand, when the selected lexical analysis program is executed and it is determined that the lexical analysis program does not match the analysis rule (NO in step S4004), another analytic program is selected and lexical analysis is performed. (Step 4004)
→ S4006 → S4002). After the lexical analysis, a dictionary analysis process is performed (step S4002 → S4010).

【００７１】図２の原稿の第１行目の認識文字列に関す
る候補属性としては「日付」のみが得られるので、「日
付」が最終的な属性と自動決定され、図１７の（ｂ）に
示すように属性記載欄に決定結果が記入される。また、
日付に関するテーブルＣの設定値が現在の“３”から
“２”に更新（デクリメント、日付について得られた識
別結果の個数の計数と同等）される（図１６（ｃ）、図
１７（ｂ）参照）。ここで注意して欲しい点は上記属性
の設定値が１以上となっている場合には、その属性につ
いての字句解析あるいは辞書解析が行われ、属性の設定
値が０になると（特定の種類の識別結果の計数結果が、
設定値に到達したことと同等）、その属性は解析の対象
から外されるという点である。従来では、解析に使用さ
れる字句解析ルールは辞書の個数は固定であるのに対
し、本実施形態では使用する解析ルールや辞書の数が減
少していくので、解析処理時間が大幅に減少する。Since only “date” can be obtained as a candidate attribute relating to the recognized character string on the first line of the original in FIG. 2, “date” is automatically determined as the final attribute, and FIG. As shown, the determination result is entered in the attribute description column. Also,
The set value of the table C relating to the date is updated from the current “3” to “2” (decrement, equivalent to counting the number of identification results obtained for the date) (FIGS. 16C and 17B). reference). Here, it should be noted that when the set value of the attribute is 1 or more, lexical analysis or dictionary analysis is performed on the attribute, and when the set value of the attribute becomes 0 (a specific type of The counting result of the identification result is
(Equivalent to reaching the set value), the attribute is excluded from the analysis. Conventionally, the lexical analysis rules used for analysis have a fixed number of dictionaries, whereas in the present embodiment, the number of analysis rules and dictionaries used is reduced, so the analysis processing time is greatly reduced. .

【００７２】図１６のテーブルＣの第１行目の認識文字
列の解析が終了すると、ＣＰＵ１３は第２行目の認識文
字列を解析の対象に選択し、上述の解析処理を実行す
る。テーブルＣの第１行目から第４行目までの認識文字
列は図１４の字句解析処理において、解析結果が得られ
るが、第５行目の認識文字列は、属性が人名であるの
で、図１４の字句解析処理で解析結果が得られないまま
（ステップＳ４００２のＹＥＳ判定）、ステップＳ４０
１０の辞書解析処理に手順が進む。When the analysis of the recognized character string on the first line of the table C in FIG. 16 is completed, the CPU 13 selects the recognized character string on the second line as an object to be analyzed, and executes the above-described analysis processing. The recognized character strings in the first to fourth lines of the table C can be analyzed in the lexical analysis processing of FIG. 14, but the recognized character strings in the fifth line have the attribute "person name". If no analysis result is obtained in the lexical analysis processing of FIG. 14 (YES determination in step S4002), step S40
The procedure proceeds to ten dictionary analysis processes.

【００７３】辞書解析処理の詳細を図１５に示す。FIG. 15 shows the details of the dictionary analysis process.

【００７４】ＣＰＵ１３はテーブルＣの示す第１番目の
辞書（この場合、人名辞書）を解析に使用する辞書とし
て選択する（ステップＳ５０１０）。ＣＰＵ１３は、第
５行目の認識文字列をシステムメモリ１２のワーク領域
上に取り出して、以下の辞書解析を行う。The CPU 13 selects the first dictionary (in this case, the personal name dictionary) shown in the table C as the dictionary to be used for analysis (step S5010). The CPU 13 takes out the recognized character string on the fifth line onto the work area of the system memory 12 and performs the following dictionary analysis.

【００７５】人名解析の一例を紹介すると、取り出した
文字列の第１番目の文字を抽出し（ステップＳ５０３
０）、第１番目の文字について選択された辞書を検索す
る。たとえば、氏名辞書を検索し、第１番目の文字と同
じ文字が姓として記載されている場合には、認識文字列
は氏名であることが検出される（ステップＳ５０４
０）。To introduce an example of the personal name analysis, the first character of the extracted character string is extracted (step S503).
0), search the dictionary selected for the first character. For example, the name dictionary is searched, and if the same character as the first character is described as the surname, it is detected that the recognized character string is the name (step S504).
0).

【００７６】第１番目の文字と同じ文字が辞書に記載さ
れていない場合には、次にワーク領域に取り出した文字
列の第１番目の文字と第２番目の文字を組み合わせ、こ
の組み合わせ文字列について選択の辞書を参照する。こ
のようにして、記載文字列が判明するまで、組み合わせ
の文字を増やして行く（ステップＳ５０３０→Ｓ５０４
０→Ｓ５０５０→Ｓ５０７０→Ｓ５０３０のループ処
理）。If the same character as the first character is not described in the dictionary, the first character and the second character of the character string extracted next to the work area are combined, and this combined character string Look up the selected dictionary for In this way, the number of combined characters is increased until the written character string is determined (steps S5030 → S504).
0 → S5050 → S5070 → S5030 loop processing).

【００７７】取り出した認識文字列の文字のすべての組
み合わせを検索しても記載がない場合には、選択した辞
書を第２番目の辞書に変更して、記載の有無確認を行う
（ステップＳ５０８０→Ｓ５０９０→Ｓ５０３０→Ｓ５
０４０）。If there is no description even after searching all combinations of the characters of the extracted recognized character string, the selected dictionary is changed to the second dictionary, and the presence or absence of the description is confirmed (step S5080 →). S5090 → S5030 → S5
040).

【００７８】このようにして、認識文字列が記載された
辞書の種類、換言すると認識文字列の属性を検出すると
ＣＰＵ１３は図１７のテーブルＢの候補属性記載欄に検
出した属性を記入する（ステップＳ５０６０）。以下、
選択辞書を変更して残りの辞書についても認識文字列の
記載の有無を確認する（ステップＳ５０６０→Ｓ５０９
０〜Ｓ５０９５→Ｓ５０３０→Ｓ５０４０）。As described above, when the type of the dictionary in which the recognized character string is described, that is, the attribute of the recognized character string is detected, the CPU 13 writes the detected attribute in the candidate attribute description column of the table B of FIG. S5060). Less than,
The selected dictionary is changed, and the presence or absence of the recognition character string is confirmed for the remaining dictionaries (step S5060 → S509).
0 to S5095 → S5030 → S5040).

【００７９】以上述べた字句解析や辞書解析により複数
の候補属性、たとえば、「人名」および「地名」のよう
な複数の候補属性が得られる場合がある（図１８（ａ）
第１２行、第１３行参照、このときのテーブルＣの内容
を図１８（ｂ）に示す）。このような場合には、予め選
択基準を設け、その選択基準に基づき複数の候補属性の
中から属性を１つ決定する。A plurality of candidate attributes, for example, a plurality of candidate attributes such as "person name" and "place name" may be obtained by the lexical analysis and dictionary analysis described above (FIG. 18 (a)).
See the twelfth and thirteenth rows, and the contents of table C at this time are shown in FIG. 18 (b)). In such a case, a selection criterion is provided in advance, and one attribute is determined from a plurality of candidate attributes based on the selection criterion.

【００８０】選択基準の一例を紹介する。図１６（ｃ）
の符号に記載する属性の種類に重み（あるいは優先順
位）を予め定めておき、複数の候補属性の重みを比較す
る。これにより重みの最も大きい候補属性を最終の属性
と決定することができる。この重みは、固定化してもよ
いし、読み取り条件で設定された属性の個数の値を使用
してもよい。この例では、「人名」が１、「地名（・住
所）」が５に初期設定されているので（図１６（ｃ）参
照）、属性として重みが大きい地名・住所が属性として
決定される（図１４のステップＳ４０１１）。An example of selection criteria will be introduced. FIG. 16 (c)
The weights (or priorities) are determined in advance for the types of attributes described in the reference numeral, and the weights of a plurality of candidate attributes are compared. As a result, the candidate attribute having the largest weight can be determined as the final attribute. This weight may be fixed, or a value of the number of attributes set in the reading condition may be used. In this example, since “person name” is initially set to 1 and “place name (address)” is initially set to 5 (see FIG. 16C), a place name / address having a large attribute is determined as an attribute ( Step S4011 in FIG. 14).

【００８１】このようにして、テーブルＣに記入された
すべての行の認識文字列について、字句解析および辞書
解析を行うと、ＣＰＵ１３は図１４および図１５の処理
手順を終了し、テーブルＢに記載された属性（決定済）
を図８の符号４０５に示すように表示する（図１０のス
テップＳ５００）。As described above, when the lexical analysis and the dictionary analysis are performed on the recognized character strings of all the lines written in the table C, the CPU 13 completes the processing procedures of FIGS. Attribute (determined)
Is displayed as shown by reference numeral 405 in FIG. 8 (step S500 in FIG. 10).

【００８２】ユーザはこの表示を見て、もし、修正の必
要があれば、ワープロ文書の修正と同様にして、文字の
修正を行って、データベース登録用のデータを作成する
（ステップＳ５００→Ｓ６００）。最後に、従来と同様
にして、作成されたデータをデータベースに登録して
（ステップＳ７００）、図１０の処理手順を終了する。The user looks at the display and, if it is necessary to correct the character, corrects the character in the same manner as the correction of the word processing document to create data for database registration (steps S500 → S600). . Finally, the created data is registered in the database as in the conventional case (step S700), and the processing procedure of FIG. 10 ends.

【００８３】以上の述べた実施形態の他に次の形態を実
施できる。The following embodiment can be carried out in addition to the embodiment described above.

【００８４】１）本実施の形態では、マスク処理により
文字列の解析に不要な原稿画像を消去しているが、ユー
ザが指示した領域を原稿画像から消去し、消去処理後の
原稿画像に対して文字認識処理を施すことも可能であ
る。1) In the present embodiment, the document image unnecessary for character string analysis is deleted by mask processing. However, the area designated by the user is deleted from the document image, and the original image after the deletion processing is deleted. To perform character recognition processing.

【００８５】２）上述の実施の形態では、意味内容の種
類、個数をマウスやキーボードにより指示されている
が、他のコンピュータとの通信により他のコンピュータ
から指示を受けたり、他のアプリケーションプログラム
から指示を受けてもよい。さらには原稿の種類毎に指示
すべきデータ（種類、個数）をデータベースに登録して
おき、読み取り原稿の種類に対応させてデータベースか
ら指示データを取り出すようにしてもよい。2) In the above-described embodiment, the type and number of semantic contents are instructed by a mouse or a keyboard. However, instructions are received from another computer through communication with another computer, or from other application programs. You may receive instructions. Further, data (type, number) to be instructed for each type of document may be registered in the database, and the instruction data may be extracted from the database in accordance with the type of document to be read.

【００８６】以上の例では、他のコンピュータ、プログ
ラム、データベースが本発明の指示手段となる。In the above example, another computer, program, or database is the instruction means of the present invention.

【００８７】[0087]

【発明の効果】以上、説明したように、請求項１、５、
９の発明によれば、人名、ふりがな等各種の解析の種類
と、原稿上の文字列の種類数を指定する。たとえば、原
稿画像上の人名の文字列の種類数が１と指示され、人名
の識別が終了すると、以後、他の文字列の解析において
人名についての解析は行われない。従来では、全ての文
字列に対して、全種類の解析を行っていたので、実行す
べき解析の種類が減少することにより解析処理時間の短
縮化が図られる。また、単に文字列の個数と、種類の指
定操作だけを行えばよいので、ユーザは読み取り領域の
位置と属性の指定の関連付けなどの煩雑な従来行われて
いた指定操作を行う必要はない。As described above, claims 1 and 5,
According to the ninth aspect, the types of various types of analysis, such as a person's name and phonetic characters, and the number of types of character strings on the document are specified. For example, when the number of types of character strings of a person's name on a document image is designated as 1 and the identification of the person's name is completed, the analysis of the person's name is not performed in the analysis of other character strings thereafter. Conventionally, all types of analysis have been performed on all character strings. Therefore, the number of types of analysis to be performed is reduced, thereby shortening the analysis processing time. Further, since only the operation of designating the number and type of character strings is required, the user does not need to perform a complicated conventional design operation such as associating the position of the reading area with the designation of the attribute.

【００８８】請求項２、６、１０の発明では字句解析を
行うことにより、ふりがな、年齢、性別といった文字列
の種類を識別できる。According to the second, sixth and tenth aspects of the present invention, by performing lexical analysis, it is possible to identify the type of character string such as phonetic characters, age and gender.

【００８９】請求項３、７、１１の発明では、辞書を使
用した解析を行うことにより、氏名、人名・住所といた
文字列を識別できる。According to the third, seventh and eleventh aspects of the present invention, by performing analysis using a dictionary, a character string including a name, a person's name and an address can be identified.

【００９０】請求項４、８、１２の発明では、実行可能
な解析処理の種類の中で、解析を行い、複数の識別結果
を取得しておき、その候補の中から最終的な識別結果を
取得するので、たとえば、地名や人名に共通する文字列
についての誤解析を極力減らすことができる。According to the fourth, eighth, and twelfth aspects of the present invention, analysis is performed among the types of executable analysis processing, a plurality of identification results are obtained, and the final identification result is selected from the candidates. Since it is acquired, for example, erroneous analysis of a character string common to a place name and a person name can be reduced as much as possible.

[Brief description of the drawings]

【図１】本発明実施形態のシステム構成を示すブロック
図である。FIG. 1 is a block diagram illustrating a system configuration according to an embodiment of the present invention.

【図２】読み取り対象の原稿の一例を示す説明図であ
る。FIG. 2 is an explanatory diagram illustrating an example of a document to be read.

【図３】読み取り条件を設定するグラフィカルインター
フェースの設定内容を示す説明図である。FIG. 3 is an explanatory diagram showing setting contents of a graphical interface for setting reading conditions.

【図４】読み取り対象の原稿の他の例を示す説明図であ
る。FIG. 4 is an explanatory diagram showing another example of a document to be read.

【図５】読み取り条件を設定するグラフィカルインター
フェースの設定内容を示す説明図である。FIG. 5 is an explanatory diagram showing setting contents of a graphical interface for setting reading conditions.

【図６】マスク色設定のためのグラフィカルインタフェ
ースの表示内容を示す説明図である。FIG. 6 is an explanatory diagram showing display contents of a graphical interface for setting a mask color.

【図７】原稿画像のプレビューを行うためのグラフィカ
ルインタフェースの表示内容を示す説明図である。FIG. 7 is an explanatory diagram showing display contents of a graphical interface for previewing a document image.

【図８】解析結果の表示内容を示す説明図である。FIG. 8 is an explanatory diagram showing display contents of an analysis result.

【図９】文字列の解析に使用する辞書および解析ルール
を示す説明図である。FIG. 9 is an explanatory diagram showing a dictionary and an analysis rule used for analyzing a character string.

【図１０】ＣＰＵ１３が実行する処理手順を示すフロー
チャートである。FIG. 10 is a flowchart illustrating a processing procedure executed by a CPU 13;

【図１１】ＣＰＵ１３が実行する処理手順を示すフロー
チャートである。FIG. 11 is a flowchart illustrating a processing procedure executed by a CPU 13;

【図１２】ＣＰＵ１３が実行する処理手順を示すフロー
チャートである。FIG. 12 is a flowchart illustrating a processing procedure executed by a CPU 13;

【図１３】ＣＰＵ１３が実行する処理手順を示すフロー
チャートである。FIG. 13 is a flowchart illustrating a processing procedure executed by a CPU 13;

【図１４】ＣＰＵ１３が実行する処理手順を示すフロー
チャートである。FIG. 14 is a flowchart illustrating a processing procedure executed by a CPU 13;

【図１５】ＣＰＵ１３が実行する処理手順を示すフロー
チャートである。FIG. 15 is a flowchart illustrating a processing procedure executed by a CPU 13;

【図１６】属性解析に使用するテーブルの構成および記
載内容を示す説明図である。FIG. 16 is an explanatory diagram showing the configuration and description contents of a table used for attribute analysis.

【図１７】テーブルＢ、Ｃの記載内容の変化を示す説明
図である。FIG. 17 is an explanatory diagram showing a change in description contents of tables B and C;

【図１８】テーブルＢ、Ｃの記載内容の変化を示す説明
図である。FIG. 18 is an explanatory diagram showing a change in description contents of tables B and C.

[Explanation of symbols]

１０文字読み取り装置１１Ｉ／Ｏ１２システムメモリ１３ＣＰＵ１４通信インタフェース１５ＨＤＤ１６ＦＤＤ１７入力装置１８ディスプレイ Reference Signs List 10 Character reading device 11 I / O 12 System memory 13 CPU 14 Communication interface 15 HDD 16 FDD 17 Input device 18 Display

Claims

[Claims]

1. A document image on which a plurality of character strings are described, and the read plurality of character strings are recognized as characters.
By performing an analysis process on the result of the character recognition, a character reading device that identifies the type of the semantic content of the character string and classifies the result of the character recognition based on the identification result, Instruction means for instructing the type of the semantic content and the number of character strings on the document image for each type of the semantic content, and each time the type of the semantic content of the plurality of character strings is determined, Counting means for counting the number of times for each type, and control means for excluding, from the type of analysis processing to be executed, a type in which the result of the counting reaches the number instructed by the instruction means. Character reading device characterized by the above-mentioned.

2. The character reading device according to claim 1, wherein the analysis process is a lexical analysis process using a plurality of analysis rules that define the characteristics of a character string for each type of the semantic content. Character reading device.

3. The character reading apparatus according to claim 1, wherein the analysis processing is a dictionary that describes the notation of the character string for each meaning of the character string.

4. The character reading device according to claim 1, wherein when a plurality of types of identification results are obtained for one character string whose character has been recognized, the plurality of identification results are determined based on a predetermined selection criterion. A character reading device further comprising a selection unit for selecting one of the types of identification results as a final identification result.

5. An original image on which a plurality of character strings are described, and the read plurality of character strings are recognized as characters.
By performing an analysis process on the result of the character recognition, the type of the semantic content of the character string is identified, and the character reading method of the character reading device that classifies the result of the character recognition based on the identification result. The type of the meaning content to be analyzed and the number of character strings on the document image are instructed to the character reading device for each type of the meaning content, and the character reading device includes the plurality of character strings. Each time the type of the meaning content is determined, the number of the determination is counted for each type, and the type in which the result of the counting reaches the specified number is excluded from the types of analysis processing to be performed. A character reading method for a character reading device, comprising:

6. The character reading method according to claim 5, wherein the analysis process is a lexical analysis process using a plurality of analysis rules defining a characteristic of a character string for each type of the semantic content. A character reading method for a character reading device.

7. The character reading method according to claim 5, wherein the analyzing process is a dictionary in which the description of the character string is described for each of the meanings of the character string. How to read characters.

8. The character reading method according to claim 5, wherein the character processing device is configured to output, when a plurality of types of identification results are obtained for one character string recognized as a character, A character reading method for a character reading device, wherein one of the plurality of types of identification results is selected as a final identification result based on a predetermined selection criterion.

9. A document image on which a plurality of character strings are described, and the read plurality of character strings are recognized as characters.
By performing an analysis process on the result of the character recognition, a type of semantic content of the character string is identified, and a processing program that defines a series of processes for classifying the result of the character recognition based on the identification result is defined as a character program. In a recording medium on which the processing program is recorded for execution by a computer in a reading device, the processing program indicates a type of a meaning content to be analyzed and a number of character strings on the document image. A processing procedure for instructing the character reading device for each type of content, and a processing procedure for counting the number of determinations for each type each time the type of the meaning content of the plurality of character strings is determined; A recording procedure for excluding, from the type of analysis processing to be performed, a type whose counting result has reached the specified number. .

10. The recording medium according to claim 9, wherein
The recording medium according to claim 1, wherein the analysis process is a lexical analysis process using a plurality of analysis rules that define the characteristics of a character string for each type of the semantic content.

11. The recording medium according to claim 9, wherein
A recording medium according to claim 1, wherein said analysis processing is a dictionary in which a description of the character string is described for each semantic content of the character string.

12. The recording medium according to claim 9, wherein
When a plurality of types of identification results are obtained for one character string whose character has been recognized, the processing program determines one of the plurality of types of identification results based on a predetermined selection criterion. A recording medium further comprising a processing procedure for selecting as a final identification result.