JP2011198285A

JP2011198285A - Document processing system and program

Info

Publication number: JP2011198285A
Application number: JP2010066834A
Authority: JP
Inventors: Mitsuharu Ohazama; 光晴大峡
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2010-03-23
Filing date: 2010-03-23
Publication date: 2011-10-06
Anticipated expiration: 2030-03-23
Also published as: JP5550959B2

Abstract

【課題】紙文書を電子化する際に発生する文字列の修正作業を効率化する手法を提案する。
【解決手段】本発明による文書処理システムは、文字列の修正画面において、修正すべき文字列か否かを判定し、修正すべき文字列に対してはアラート表示を施す機能と、文字列を修正する際には候補となる文字列をサジェスト表示する機能を有する。アラートを行うか否かを判定する際には、形態素解析や、既登録文字列や、辞書や、既に登録されたデータとの整合性などを活用することを特徴とする。サジェストする文字列は、既登録文字列や辞書との類似度や、既に登録されているデータにより算出される値の範囲や文字列を活用することを特徴とする。
【選択図】図４A method for improving the efficiency of correcting a character string generated when a paper document is digitized is proposed.
A document processing system according to the present invention determines whether or not a character string to be corrected is displayed on a character string correction screen, and has a function for displaying an alert for the character string to be corrected, and a character string. When correcting, it has a function of suggesting a candidate character string. When determining whether or not to perform an alert, morphological analysis, registered character string, dictionary, consistency with already registered data, and the like are utilized. The character string to be suggested is characterized by utilizing a similarity with an already registered character string or a dictionary, a value range or a character string calculated based on already registered data.
[Selection] Figure 4

Description

本発明は、文書処理システム、及びプログラムに関し、例えば、紙文書に対してスキャン・OCRを施して得られたメタデータ文字列を効率的に修正する技術に関する。 The present invention relates to a document processing system and a program, for example, to a technique for efficiently correcting a metadata character string obtained by performing scan / OCR on a paper document.

企業など多くの組織では、オフィスソフトで作成したファイルや紙文書をスキャニングしたファイルなど大量のデータが日々生成され、ファイルサーバ等で保管される。これらの大量データの中から所望のファイルを探すときは、ファイルサーバ内のフォルダをたどりながら見つけるという手段が一般的である。しかしフォルダ構造が複雑である時や、ファイルを探したい人の意図しない構造でフォルダとして纏められていると、所望のファイルを探すのは非常に時間がかかってしまう。ファイルを探す別の方法として、全文検索を用いる手段があるが、これも２つの問題がある。 In many organizations such as companies, a large amount of data such as files created by office software and files scanned from paper documents is generated daily and stored in a file server or the like. When searching for a desired file from such a large amount of data, it is common to find it by tracing a folder in the file server. However, when the folder structure is complicated, or when it is organized as a folder in a structure unintended by a person who wants to search for a file, it takes a very long time to search for a desired file. As another method for searching for a file, there is a method using a full text search, which also has two problems.

１つは、キーワード検索だけでは見つけられないファイルがあることである（図１参照）。例えば、ある期間内の文書を全て探したい場合、全文検索では文書内の日付の文字列を「日付のデータ」として扱うことができないため、探すことができない。また他の例として、検索者が探したいキーワードと別の単語・語句が同じ意味として使われていた場合にそれが探せないという問題がある。 One is that there are files that cannot be found only by keyword search (see FIG. 1). For example, when all documents within a certain period are to be searched, a full-text search cannot be performed because a character string of a date in the document cannot be handled as “date data”. As another example, there is a problem that when a keyword that the searcher wants to search and another word / phrase are used as the same meaning, it cannot be searched.

もう１つは、大量の無関係なファイルにヒットしてしまうということである（図２参照）。例えば、顧客先としての銀行を探すつもりで検索すると、他のファイルの振込先にも書かれている場合や、見積番号などのIDをもとに検索すると、IDと同じ数字が金額などに書かれている場合である。これらの全文検索に起因する問題は、文書中のキーワードを意味のある文字として扱っていないために起こることである。 The other is that a large number of unrelated files are hit (see FIG. 2). For example, if you search with the intention of finding a bank as a customer, the same number as the ID is written in the amount etc. if it is also written in the transfer destination of another file or if you search based on an ID such as an estimate number This is the case. The problem resulting from these full-text searches is that the keywords in the document are not treated as meaningful characters.

そこで、文書のメタデータ（属性情報）を文書に対応付けて管理する方法が考えられている。例えば、特許文献１では、仮想フォルダシステムが提案されている。仮想フォルダシステムとは、ファイルにメタデータを設定しておき、仮想フォルダにはメタデータに対する検索条件を定義することで行う。仮想フォルダ参照時には、検索条件に基づいたファイルの検索結果を提示することにより、検索条件に応じた分類を実現する。例えば営業文書を管理する場面では、全てのファイルについて「文書種別名」（契約書・注文書・見積書など）と「起票日」をメタデータとして設定しておき、仮想フォルダに「文書種別名が“契約書”であるもの」という検索条件を割り当てておけば、その仮想フォルダを参照すると契約書の一覧が取得できる。同様に、別の仮想フォルダで「起票日が２００９年1月〜３月のもの」と割り当てておけば、指定期間の文書を収集することができる。このように、仮想フォルダシステムでは、ファイルを意味的に分類するので、文書の効果的な活用が可能となる。 Therefore, a method of managing document metadata (attribute information) in association with a document has been considered. For example, Patent Document 1 proposes a virtual folder system. The virtual folder system is performed by setting metadata in a file and defining a search condition for the metadata in the virtual folder. At the time of referring to the virtual folder, a file search result based on the search condition is presented, thereby realizing classification according to the search condition. For example, in the case of managing sales documents, “document type name” (contract, order form, estimate, etc.) and “draft date” are set as metadata for all files, and “document type” is set in the virtual folder. If a search condition “name is“ contract ”” is assigned, a list of contracts can be acquired by referring to the virtual folder. Similarly, if a virtual folder is assigned with “the draft date is from January to March 2009”, documents in a specified period can be collected. As described above, since the virtual folder system classifies files semantically, documents can be effectively used.

紙文書をスキャンし文字データを登録する際には、OCR(Optical Character Reader)を使用する方法が一般的である。例えば、フォーマットが固定の帳票(固定帳票)を電子化する際には、あらかじめ記入されるべき文字列の種類と位置座標を定義しておき、定義された位置座標内の文字列を、定義された種類のメタデータとして保存する方法がとられることがある。図３は固定帳票の例を示した図である。文書種別、顧客名、起票日、案件IDが記入されるべき位置を定義しておき、この位置に当該文字列を記載するように帳票を作成する。そして印刷された固定帳票を電子化する際には、固定帳票の定義に基づいて文字列を認識し、メタデータとして保存する。 When scanning a paper document to register the character data, how to use the OCR (O ptical C haracter R eader ) are common. For example, when a form with a fixed format (fixed form) is digitized, the type and position coordinates of the character string to be filled in beforehand are defined, and the character string within the defined position coordinates is defined. Some types of metadata may be stored. FIG. 3 shows an example of a fixed form. A position where a document type, a customer name, a draft date, and a case ID are to be entered is defined, and a form is created so that the character string is written at this position. When the printed fixed form is digitized, the character string is recognized based on the definition of the fixed form and stored as metadata.

そして、メタデータを修正する際には、元の文書を参照しながら行う。文書管理製品の多くは、メタデータ登録画面を用意しており、ユーザは文書を見ながらメタデータを手入力で修正する。 The metadata is corrected with reference to the original document. Many document management products have a metadata registration screen, and the user manually corrects the metadata while viewing the document.

特開２００３−３２３３２６号公報JP 2003-323326 A

OCRは、その認識率が100％ではないため、文字を誤認することがある。そのため、ユーザは文字が誤認されているか否かをチェックし、さらにその文字を正しく修正する必要がある。 OCR may misrecognize characters because its recognition rate is not 100%. Therefore, it is necessary for the user to check whether or not the character is misidentified and to correct the character correctly.

本発明はこのような状況に鑑みてなされたものである、紙文書に対して、スキャン・OCRを経て得られたメタデータ文字列の修正作業を効率的に行うための技術を提供するものである。 The present invention has been made in view of such a situation, and provides a technique for efficiently performing a correction operation of a metadata character string obtained through scanning / OCR on a paper document. is there.

上記課題を解決するために、本発明の文書処理システムでは、紙文書に対してスキャン・OCRを経て得られたメタデータの修正作業をする際に、メタデータの修正画面において、修正すべきデータか否かを算出・表示し（アラート機能）、修正後のデータの候補を算出・表示する（サジェスト機能）。 In order to solve the above problems, in the document processing system of the present invention, data to be corrected on the metadata correction screen when the metadata obtained through scanning / OCR is corrected for a paper document. Is calculated / displayed (alert function), and corrected data candidates are calculated / displayed (suggest function).

そして、アラート機能は、対象文字列に対して形態素解析を施し、その結果によって実現することができる。また、メタデータ種類毎に用意された辞書DB及び／又は既に登録済の文字列が格納されたDBと、処理対象データ（対象文字列）とを比較し、その比較結果に基づいてアラート表示するか（修正が必要か）否か判断しても良い。さらに、処理対象データがあらかじめ定義されたフォーマットとマッチするか否かを判定し、その結果に基づいてアラート表示するか（修正が必要か）否か判断しても良い。また、処理対象データと、既に登録済のデータと比較し、その比較結果に基づいて（文書の作成日時の前後関係が矛盾する等）アラート表示するか（修正が必要か）否か判断しても良い。また、処理対象データ及びそれと関連あるデータとの関係（従属関係）と、既に登録済のデータ及びそれと関連あるデータとの関係（従属関係）に矛盾が生じているか否かを判定し、その判定結果に基づいて、アラート表示するか（修正が必要か）否か判断しても良い。 The alert function can be realized by performing morphological analysis on the target character string and the result. Moreover, the dictionary DB prepared for each metadata type and / or the DB in which the already registered character string is stored are compared with the processing target data (target character string), and an alert is displayed based on the comparison result. It may be determined whether or not correction is necessary. Further, it may be determined whether or not the processing target data matches a predefined format, and it may be determined whether or not to display an alert based on the result (whether correction is necessary). Also, compare the data to be processed with the data that has already been registered, and based on the result of the comparison (such as a conflict in the context of the document creation date and time), determine whether or not to display an alert (need to modify) Also good. In addition, it is determined whether or not there is a contradiction in the relationship (subordinate relationship) between the data to be processed and the related data and the relationship (subordinate relationship) between the already registered data and the related data. Based on the result, it may be determined whether an alert is displayed (correction is necessary) or not.

また、サジェスト機能は、アラート機能によって判定された修正原因の種類に応じて、正しい文字列の候補、正しいフォーマット、正しい値（日時）の範囲を表示することを実現する。 The suggest function realizes displaying a correct character string candidate, a correct format, and a correct value (date and time) range according to the type of correction cause determined by the alert function.

さらなる本発明の特徴は、以下本発明を実施するための形態および添付図面によって明らかになるものである。 Further features of the present invention will become apparent from the following detailed description and the accompanying drawings.

本発明によれば、紙文書に対してスキャン・OCRを施して得られる文字列の中から、修正すべき文字列を容易に発見することができ、さらに、修正すべき文字列を容易に設定することが可能となる。これにより、ユーザの文字列修正作業を効率的に行うことが可能となる。 According to the present invention, it is possible to easily find a character string to be corrected from character strings obtained by performing scan / OCR on a paper document, and to easily set a character string to be corrected. It becomes possible to do. Thereby, it becomes possible to perform a user's character string correction work efficiently.

全文検索（キーワード検索）ではファイルを見つけられない例を示す図である。It is a figure which shows the example which cannot find a file by full text search (keyword search). 全文検索（キーワード検索）で無関係なファイルがヒットしてしまう例を示す図である。It is a figure which shows the example which an irrelevant file hits by a full text search (keyword search). 固定帳票のメタデータ設定の一例を示す図である。It is a figure which shows an example of the metadata setting of a fixed form. 本発明の実施形態に係る文字列修正システム（文書処理システム）の概略構成を示す図である。It is a figure which shows schematic structure of the character string correction system (document processing system) which concerns on embodiment of this invention. メタデータ項目設定ファイルの一例を示す図である。It is a figure which shows an example of a metadata item setting file. メタデータの一例を示す図である。It is a figure which shows an example of metadata. 辞書データの一例を示す図である。It is a figure which shows an example of dictionary data. 修正画面の一例を示す図である。It is a figure which shows an example of a correction screen. アラート判定処理の全体概要を説明するためのフローチャートである。It is a flowchart for demonstrating the whole outline | summary of an alert determination process. アラート判定処理における修正の必要性を判定する処理（ステップ９０７）の詳細を説明するためのフローチャートである。It is a flowchart for demonstrating the detail of the process (step 907) which determines the necessity for correction | amendment in an alert determination process. 形態素解析結果であって、エラーフラグを管理するデータの一例を示す図である。It is a figure which is an example of the data which is a morphological analysis result and manages an error flag. 形態素解析結果の一例を示す図である。It is a figure which shows an example of a morphological analysis result. 同一案件IDの登録済データの一例（１）を示す図である。It is a figure which shows an example (1) of the registered data of the same case ID. 同一案件IDの登録済データの一例（２）を示す図である。It is a figure which shows an example (2) of the registered data of the same case ID. アラート結果の一例を示す図である。It is a figure which shows an example of an alert result. サジェスト処理の詳細を説明するためのフローチャートである。It is a flowchart for demonstrating the detail of a suggestion process. サジェスト結果の一例（１）を示す図である。It is a figure which shows an example (1) of a suggestion result. サジェスト結果の一例（２）を示す図である。It is a figure which shows an example (2) of a suggestion result. サジェスト結果の一例（３）を示す図である。It is a figure which shows an example (3) of a suggestion result. サジェスト結果の一例（４）を示す図である。It is a figure which shows an example (4) of a suggestion result.

以下、添付図面を参照して本発明の実施形態に係わる文字列修正技術について説明する。ただし、本実施形態は本発明を実現するための一例にすぎず、本発明の技術的範囲を限定するものではないことに注意すべきである。また、各図において共通の構成には同一の参照番号が付されている。 A character string correction technique according to an embodiment of the present invention will be described below with reference to the accompanying drawings. However, it should be noted that this embodiment is merely an example for realizing the present invention and does not limit the technical scope of the present invention. In each drawing, the same reference numerals are assigned to the common components.

＜文字列修正システムの構成＞
図４は、本発明の実施形態による文字列修正システム（文書処理システム）の概略構成を示す図である。当該文字列修正システムは、紙文書を予めスキャン・OCRを施して得られるスキャン画像及びスキャン画像の文字列データを蓄積するスキャンデータDB４０１と、スキャンデータから抽出したメタデータを格納するメタデータDB４０２と、メタデータとなり得る多数の文字列を格納する辞書DB４０３と、形態素解析時に使用する辞書を格納する形態素解析DB４０４と、メタデータの定義が記載されているメタデータ項目設定ファイル４０５（具体的な内容は図４参照）と、閾値や各種パラメータが記載されているパラメータ設定ファイル４０６と、検索結果やメタデータ設定画面の表示等を行う表示装置４０７と、データの入力や編集やメニューの選択などの操作を行うためのキーボード４０８及びマウスなどのポインティングデバイス４０９と、必要な演算処理、制御処理等を行う中央処理装置４１０と、を有している。 <Configuration of character string correction system>
FIG. 4 is a diagram showing a schematic configuration of a character string correction system (document processing system) according to the embodiment of the present invention. The character string correction system includes a scan data DB 401 that stores a scanned image obtained by performing scan / OCR on a paper document in advance and character string data of the scanned image, and a metadata DB 402 that stores metadata extracted from the scan data. , A dictionary DB 403 that stores a large number of character strings that can be metadata, a morpheme analysis DB 404 that stores a dictionary used for morpheme analysis, and a metadata item setting file 405 that contains metadata definitions (specific contents) 4), a parameter setting file 406 in which threshold values and various parameters are described, a display device 407 for displaying search results and metadata setting screens, data input, editing, menu selection, etc. A keyboard 408 for operating and a pointing device 409 such as a mouse; It has required processing, the central processing unit 410 that performs control processing and the like, the.

スキャンデータDB４０１、メタデータDB４０２、辞書DB４０３、及び形態素解析DB４０４は、DB（データベース）としての実体が物理的に１つではなく複数である場合も対象としている。メタデータDB４０２には、確定されたメタデータだけでなく、未確定のメタデータも含まれる。よって、未確定のメタデータのみ抽出されて修正画面に表示されることになる。 The scan data DB 401, metadata DB 402, dictionary DB 403, and morpheme analysis DB 404 are targeted even when there are a plurality of DB (database) entities instead of physically one. The metadata DB 402 includes not only confirmed metadata but also undetermined metadata. Therefore, only unconfirmed metadata is extracted and displayed on the correction screen.

中央処理装置４１０は、メタデータDB４０２に対してメタデータの修正を行う画面の表示を行う修正画面表示部４１１と、メタデータDB４０２におけるメタデータの中から、修正すべき文字列を算出する修正対象算出部４１２と、修正画面表示部において、ユーザによって選択されたメタデータに対して修正後の文字列の候補を算出する修正候補算出部４１３と、を含んでいる。なお、修正対象算出部４１２及び修正候補算出部４１３は、バックグラウンドで動作するものである。また、以上に述べた処理部・データ・処理部等で用いられるプログラム等は、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、ＭＯ、フロッピーディスク、ＵＳＢメモリ等の種々の記録媒体に格納して提供することもできる。 The central processing unit 410 includes a correction screen display unit 411 that displays a screen for correcting metadata in the metadata DB 402, and a correction target that calculates a character string to be corrected from the metadata in the metadata DB 402. The calculation unit 412 and the correction screen display unit include a correction candidate calculation unit 413 that calculates a corrected character string candidate for the metadata selected by the user. The correction target calculation unit 412 and the correction candidate calculation unit 413 operate in the background. The programs used in the processing unit, data, processing unit, etc. described above may be provided by being stored in various recording media such as a CD-ROM, DVD-ROM, MO, floppy disk, USB memory, etc. it can.

＜メタデータ項目の設定ファイル＞
図５は、固定帳票のメタデータ項目の設定ファイルの例である。図５はXML形式のテキストデータで、各メタデータの項目名(item)、データ型(type)、記載フォーマット(define)、記載範囲(upper left, lower right)、順序関係(constraint order)、順序式(order)、従属関係(dependent item)の各情報が記載されている。itemタグは各メタデータの種類を表す。typeタグはメタデータのデータ型を表す。例えばStringは文字列型を表し、文字列のデータが格納されることを示す。Intは整数型を表し、整数のデータが格納されることを示す。Dateは日付型を表し、日付のデータが格納されることを示す。また、各データは組み合わせることもできる。例えば、String + Intは、文字列データと整数データが組み合わされたデータを示す。各型は「+」演算子で結合される。また、データは正規表現で表すこともできる。その際のデータは、次に続くdefineタグで定義される。upper leftタグとlower rightタグは帳票上のメタデータの記載範囲を表す。upper leftタグは、記載範囲の左上の座標を表し、lower rightタグは、記載範囲の右下の座標を表す。文書管理システムでは、このような定義に基づいてメタデータを格納する。 <Metadata item settings file>
FIG. 5 is an example of a setting file for metadata items of a fixed form. Fig. 5 shows text data in XML format. Each metadata item name (item), data type (type), description format (define), description range (upper left, lower right), order relation (constraint order), order Each information of the formula (order) and the dependent item (dependent item) is described. The item tag represents the type of each metadata. The type tag represents the data type of metadata. For example, String represents a character string type and indicates that character string data is stored. Int represents an integer type and indicates that integer data is stored. Date represents a date type and indicates that date data is stored. Each data can also be combined. For example, String + Int indicates data obtained by combining character string data and integer data. Each type is connected with the "+" operator. Data can also be expressed in regular expressions. The data at that time is defined by the following define tag. The upper left tag and lower right tag represent the description range of the metadata on the form. The upper left tag represents the upper left coordinates of the description range, and the lower right tag represents the lower right coordinates of the description range. In the document management system, metadata is stored based on such a definition.

＜メタデータ＞
図６は、メタデータDB４０２内のメタデータの一例を示す図である。本発明では、メタデータを登録したファイルは、メタデータと共にここで管理するものとする。従って、メタデータが未登録のファイルはここには登録されていないものとする。 <Metadata>
FIG. 6 is a diagram illustrating an example of metadata in the metadata DB 402. In the present invention, a file in which metadata is registered is managed here together with the metadata. Therefore, it is assumed that a file whose metadata is not registered is not registered here.

図６で示されるように、メタデータDB４０２内のデータは表形式で管理され、１つのファイルが１行に対応している。表には、構成項目として、ファイルを一意に示すID６０１、メタデータが確定されているか否かを示す状態６０２、該当ファイルのファイルパス６０３、そして該当ファイルに登録されたメタデータ６０４が含まれる。 As shown in FIG. 6, the data in the metadata DB 402 is managed in a table format, and one file corresponds to one line. The table includes, as configuration items, an ID 601 that uniquely identifies the file, a state 602 that indicates whether or not the metadata is confirmed, a file path 603 of the corresponding file, and metadata 604 registered in the corresponding file.

メタデータ６０４は、本システムで管理するメタデータ項目毎に列を構成している。図６の例では、メタデータ項目として、文書種別名６０５、顧客名６０６、起票日６０７、案件ID６０８がある。図６では、いくつかのセルに文字の誤りが含まれている。 The metadata 604 forms a column for each metadata item managed by this system. In the example of FIG. 6, there are a document type name 605, a customer name 606, a draft date 607, and a case ID 608 as metadata items. In FIG. 6, some cells contain character errors.

＜辞書データ＞
図７は、辞書DB４０３内の辞書データの一例を示す図である。辞書データは、メタデータ項目毎に、メタデータの文字列となり得るリストを集めたテキストファイルで構成され、あらかじめDBに登録しておく。メタデータ項目「文書種別名」におけるメタデータのキーワードを集めたものを「文書種別名.txt」に、メタデータ項目「顧客名」におけるキーワードを集めたものを「顧客名.txt」に登録した例を示している。図７に示すように、キーワードごとに改行して入力する。 <Dictionary data>
FIG. 7 is a diagram illustrating an example of dictionary data in the dictionary DB 403. The dictionary data is composed of a text file that collects a list that can be a metadata character string for each metadata item, and is registered in the DB in advance. A collection of metadata keywords in the metadata item “document type name” is registered in “document type name.txt”, and a collection of keywords in the metadata item “customer name” is registered in “customer name.txt”. An example is shown. As shown in FIG. 7, a new line is entered for each keyword.

なお、文書種別名や顧客名以外に、製品名やプロジェクト名などによって構成される辞書データも考えられるが、これらに限られるものではない。 In addition to the document type name and customer name, dictionary data composed of product names, project names, and the like are also conceivable, but are not limited thereto.

＜修正画面＞
図８は、修正画面表示部４１１が作成し、表示する修正画面の一例を示す図である。当該修正画面は、GUIのウインドウ上に、表形式で表されたメタデータの他に、処理実行時に押下される修正実行メニュー８０１と、各行の修正を実行するか否かを選択する処理チェックボックス８０２とを含んでいる。 <Correction screen>
FIG. 8 is a diagram illustrating an example of a correction screen created and displayed by the correction screen display unit 411. In the correction screen, in addition to the metadata displayed in a tabular format on the GUI window, a correction execution menu 801 that is pressed when processing is executed, and a processing check box that selects whether or not to execute correction of each row. 802.

システムからこの画面が起動されると、修正画面表示部４１１は、状態が「未確定」となっているメタデータ一覧を表示し、また修正すべきセルを自動的にアラート表示する。また、メタデータが記載されているセルは編集可能であり、ユーザが各セルを選択すると、その選択動作に応答して修正候補算出部４１３が、修正候補の文字列の候補を算出し、修正画面表示部４１１がその候補を表示したり、あるいは、修正画面表示部４１１が、メタデータのフォーマットや、値の範囲を表すポップアップを表示させたり場合がある。これによりユーザは文字列の修正作業を効率的に行うことができる。ユーザが各セル内の文字列を修正後、修正実行メニューが選択されると、処理チェックボックス８０２がチェックされたメタデータが更新され、状態が「確定」となる。 When this screen is activated from the system, the correction screen display unit 411 displays a list of metadata whose status is “indeterminate” and automatically displays a cell to be corrected with an alert. In addition, the cell in which the metadata is described is editable. When the user selects each cell, the correction candidate calculation unit 413 calculates a correction candidate character string candidate in response to the selection operation, and the correction is performed. The screen display unit 411 may display the candidate, or the correction screen display unit 411 may display a pop-up indicating the metadata format or value range. As a result, the user can efficiently perform the character string correction work. When the correction execution menu is selected after the user corrects the character string in each cell, the metadata in which the processing check box 802 is checked is updated, and the state becomes “confirmed”.

＜アラート判定処理＞
アラート判定処理は、図８の修正画面が表示されるタイミングと、図８の修正画面において、セル内の文字列が更新されるタイミングで実行される。 <Alert judgment processing>
The alert determination process is executed at the timing when the correction screen of FIG. 8 is displayed and at the timing when the character string in the cell is updated on the correction screen of FIG.

図９は、修正対象算出部４１２において実行されるアラート判定処理の内容であって、修正画面が表示されるタイミングで実行される場合の処理の概要を示すフローチャートである。 FIG. 9 is a flowchart showing an outline of the process in the case where the alert determination process executed by the correction target calculation unit 412 is executed at the timing when the correction screen is displayed.

ステップ９０１では、修正対象算出部４１２は、変数Ｎにメタデータ未確定データ数を格納する。Ｎの初期値は、メタデータ修正画面に表示されているデータ数(ファイル数)となる。例えば、図８の画面には、スクロールバーがあるため全部は表示されていませんが、チェックの有無には関係なく画面に表示されているデータすべてが対象となる。仮に図８にスクロールバーがない状態であればＮ＝４である。 In step 901, the correction target calculation unit 412 stores the number of metadata unconfirmed data in the variable N. The initial value of N is the number of data (number of files) displayed on the metadata correction screen. For example, the screen of FIG. 8 has a scroll bar so that not all of the data is displayed, but all data displayed on the screen is targeted regardless of whether or not there is a check. If there is no scroll bar in FIG. 8, N = 4.

ステップ９０２では、修正対象算出部４１２は、Ｎが０か否かを判定する。０であれば、処理対象データがないことを示しており、処理はステップ９１０に進む。０でなければ、処理はステップ９０３に進む。 In step 902, the correction target calculation unit 412 determines whether N is 0 or not. If it is 0, it indicates that there is no data to be processed, and the process proceeds to step 910. If not 0, the process proceeds to Step 903.

ステップ９０３では、修正対象算出部４１２は、メタデータ未確定データを１つ選択する。このデータをＦとする。 In step 903, the correction target calculation unit 412 selects one piece of metadata unconfirmed data. This data is F.

ステップ９０４では、修正対象算出部４１２は、変数Ｍに処理対象メタデータ項目数を格納する。例えば、図８の例ではＭは４となる。ステップ９０５では、Ｍが0か否かを判定する。０であれば、処理対象メタデータ項目がないことを示しており、処理はステップ９０９に進む。０でなければ、処理はステップ９０６に進む。 In step 904, the correction target calculation unit 412 stores the number of processing target metadata items in the variable M. For example, M is 4 in the example of FIG. In step 905, it is determined whether M is 0 or not. If 0, it indicates that there is no processing target metadata item, and the process proceeds to step 909. If not 0, the process proceeds to Step 906.

ステップ９０６では、修正対象算出部４１２は、Ｆにおける処理対象メタデータを１つ選択する。このデータをＧとする。 In step 906, the correction target calculation unit 412 selects one processing target metadata in F. Let this data be G.

ステップ９０７では、修正対象算出部４１２は、Ｇに対して修正の必要有無を判定する。ステップ９０７の詳細は、図１０を用いて詳細に説明する。 In step 907, the correction target calculation unit 412 determines whether or not G needs to be corrected. Details of step 907 will be described in detail with reference to FIG.

ステップ９０８では、修正対象算出部４１２は、Ｍに対してデクリメントを行う。このようにしてＭが０になるまでループ処理が行われる。 In step 908, the correction target calculation unit 412 decrements M. In this way, loop processing is performed until M becomes zero.

また、ステップ９０９では、修正対象算出部４１２は、Ｎに対してデクリメントを行う。Ｍと同様にＮが０になるまでループ処理が行われる。 In step 909, the correction target calculation unit 412 decrements N. Similar to M, loop processing is performed until N becomes 0.

Ｎに関するループ処理が終了すると、ステップ９１０では、修正対象算出部４１２は、ステップ９０７において設定されたエラーフラグに応じてメタデータがアラート表示される。 When the loop process for N is completed, in step 910, the correction target calculation unit 412 displays an alert with metadata according to the error flag set in step 907.

＜修正必要性有無の判定処理＞
図１０は、図９におけるステップ９０７の修正の必要有無の判定処理の詳細を説明するためのフローチャートである。 <Judgment process for necessity of correction>
FIG. 10 is a flowchart for explaining the details of the process for determining whether or not the correction is necessary in step 907 in FIG.

修正の必要有無を判定する対象のメタデータ（Pとする）が読み込まれると、ステップ１００１では、修正対象算出部４１２は、Pのデータ型をチェックする。データ型がＳｔｒｉｎｇ型であれば、処理はステップ１００３に進む。Ｓｔｒｉｎｇ型とは、文字列を表すデータ型である。Ｓｔｒｉｎｇ型でなければ、処理はステップ１００２に進む。 When metadata (P) for determining whether or not correction is necessary is read, in step 1001, the correction target calculation unit 412 checks the data type of P. If the data type is the string type, the process proceeds to step 1003. The string type is a data type representing a character string. If it is not the string type, the process proceeds to step 1002.

ステップ１００２では、修正対象算出部４１２は、Pのフォーマットをチェックする。メタデータ項目設定ファイル４０５におけるメタデータフォーマットと照合し、正しければ処理はステップ１００５に進む。正しくなければ、処理はステップ１００６に進む。 In step 1002, the correction target calculation unit 412 checks the format of P. The data is collated with the metadata format in the metadata item setting file 405, and if it is correct, the process proceeds to step 1005. If not correct, processing proceeds to step 1006.

ステップ１００３では、修正対象算出部４１２は、辞書DB４０３上の当該メタデータ項目の文字列データ、及びメタデータDB４０２において既に登録済のメタデータにおける当該メタデータ項目の文字列データと、Pを比較する。その結果、完全一致する文字列が存在すれば、処理はステップ１００７に進む。完全一致する文字列が存在しなければ、処理はステップ１００４に進む。メタデータDB４０２に格納された確認済の文字列データとの同一性もチェックするので、辞書DBにはないが、今までの処理で正しいと判断された文字列を再度修正対象とすることが無くなる。 In step 1003, the correction target calculation unit 412 compares P with the character string data of the metadata item in the dictionary DB 403 and the character string data of the metadata item in the metadata already registered in the metadata DB 402. . As a result, if there is a completely matched character string, the process proceeds to step 1007. If there is no exact character string, the process proceeds to step 1004. Since the identity with the confirmed character string data stored in the metadata DB 402 is also checked, a character string that is not in the dictionary DB but is determined to be correct in the processing so far is not subject to correction again. .

ステップ１００４では、修正対象算出部４１２は、Ｐに対して形態素解析を行う。形態素解析の結果、未定義語が一定以上含まれていれば、処理はステップ１００６に進み、修正対象算出部４１２は、エラーフラグを「STRING」（文字列に誤りがあることを示すエラーフラグ。エラーフラグについては後述する）に設定する。未定義語が一定以上含まれていなければ、処理はステップ１００７に進む。なお、この際の閾値は、文書の状態や環境等の状況などによって自由に設定可能である。この閾値は、パラメータ設定ファイル４０６に格納されている。 In step 1004, the correction target calculation unit 412 performs morphological analysis on P. As a result of the morphological analysis, if an undefined word is included in a certain amount or more, the process proceeds to Step 1006, and the correction target calculation unit 412 sets the error flag to “STRING” (an error flag indicating that the character string has an error). The error flag is set to be described later. If undefined words are not included in a certain amount, the process proceeds to step 1007. The threshold value at this time can be freely set according to the state of the document, the environment, and the like. This threshold value is stored in the parameter setting file 406.

ここで、形態素解析について説明する。図１２は、形態素解析結果の一例を示す図である。形態素解析結果は、対象文字列、形態素文字列、品詞の項目から構成される。例えば、「西戸塚≠ャピタル」は、形態素解析の結果、「西」、「戸塚」、「≠」、「ャピタル」と分解される。形態素解析を行う際には、形態素解析DB４０４の辞書データが使用される。この辞書には、単語と品詞のデータが格納されている。形態素解析の結果、辞書登録単語であれば、品詞データが存在するが、辞書登録単語でなければ品詞データが存在しない。一般に、正確にOCRで認識された文字列は、意味のある文字列であるため、未定義語が含まれる可能性は低い。しかし、OCRで誤認識された文字列は未定義語が多く含まれる可能性が高い。 Here, morphological analysis will be described. FIG. 12 is a diagram illustrating an example of a morphological analysis result. The morpheme analysis result is composed of items of a target character string, a morpheme character string, and a part of speech. For example, “Nishitozuka ≠ Capital” is decomposed into “West”, “Totsuka”, “≠”, and “Capital” as a result of the morphological analysis. When performing morphological analysis, dictionary data in the morphological analysis DB 404 is used. This dictionary stores word and part-of-speech data. As a result of the morphological analysis, if the word is a dictionary registered word, the part of speech data exists, but if it is not a dictionary registered word, there is no part of speech data. In general, since a character string accurately recognized by OCR is a meaningful character string, it is unlikely that an undefined word is included. However, there is a high possibility that the character string misrecognized by OCR contains many undefined words.

本発明では、この現象を利用して、誤認識であるか否かを判定する材料としている。「西戸塚≠ャピタル」は、元の文字列は「西戸塚キャピタル」であるがOCRが誤認識したためこのような文字列となっている。「西戸塚キャピタル」と認識されていれば、形態素解析の結果、「西」、「戸塚」、「キャピタル」と分解され、すべて形態素解析DBに登録済の単語で構成される。「西戸塚≠ャピタル」の場合、形態素解析結果は図１２のようになり、未定義語が含まれる。これにより修正すべき文字列とみなすことができる。 In the present invention, this phenomenon is used as a material for determining whether or not it is erroneous recognition. “Nishitozuka ≠ Capital” is such a character string because the original character string is “Nishitozuka Capital” but OCR misrecognized it. If it is recognized as “Nishitozuka Capital”, the result of morphological analysis is decomposed into “West”, “Totsuka”, and “Capital”, all of which are composed of words registered in the morphological analysis DB. In the case of “Nishitozuka ≠ capital”, the morphological analysis result is as shown in FIG. 12, and includes undefined words. This can be regarded as a character string to be corrected.

ステップ１００５では、修正対象算出部４１２は、メタデータの順序関係の正当性を判定する。順序関係の正当性は、関連するデータの中で順序に規則性を持つメタデータが、正しい規則に基づいているか否かの観点で決定される。例えば、営業証憑において同一案件には複数の書類が存在し、例えば見積書、注文書、請求書などがある。これらの書類は通常必ず見積書、注文書、請求書の順番で生成される。そのため、これらの書類の起票日は、古い書類から見積書、注文書、請求書の順番となる。もしこれらの書類の起票日がこの順番となっていなければ、OCRの誤認識などの理由により、誤認識されたメタデータと考えることができる。順序関係の正当性を判定するため、まずメタデータ項目設定ファイル４０５を参照し、処理中のメタデータ項目に、「constraint order」タグが含まれるか否かを判定する。含まれていなければ、修正対象算出部４１２は、順序関係は正しいと判定し、処理をステップ１００７に進ませる。含まれていれば、修正対象算出部４１２は、さらに順序関係の正当性を調べる。正しければ、処理はステップ１００７に進み、誤っていれば処理はステップ１００６に進み、修正対象算出部４１２がエラーフラグを「ORDER」に設定する。順序関係を調べるには、まず「constraint order」タグの子要素の「group item」タグと「order item」タグと「order」タグを読み取る。「group item」タグは、処理中のデータを含む関連するデータをグルーピングするタグである。例えば、同一案件に含まれる、見積書、注文書、請求書などは同一グループであるため、「group item」タグとして「案件ID」が付与される。ただし「group item」タグは「constraint order」タグの子要素にのみ付与される。 In step 1005, the correction target calculation unit 412 determines the validity of the metadata order relationship. The correctness of the order relation is determined in terms of whether or not metadata having regularity in order among related data is based on a correct rule. For example, in the business voucher, there are a plurality of documents in the same case, for example, an estimate, an order, an invoice, and the like. These documents are usually generated in the order of quotation, order, and invoice. Therefore, the draft date of these documents is the order from the old document to the estimate, order, and invoice. If the draft date of these documents is not in this order, it can be considered as mis-recognized metadata for reasons such as OCR misrecognition. In order to determine the validity of the order relationship, first, the metadata item setting file 405 is referred to, and it is determined whether or not the “constraint order” tag is included in the metadata item being processed. If not included, the correction target calculation unit 412 determines that the order relationship is correct, and advances the processing to step 1007. If included, the correction target calculation unit 412 further checks the validity of the order relationship. If it is correct, the process proceeds to step 1007, and if it is incorrect, the process proceeds to step 1006, and the correction target calculation unit 412 sets the error flag to “ORDER”. In order to check the order relationship, first, the “group item” tag, the “order item” tag, and the “order” tag of the child elements of the “constraint order” tag are read. The “group item” tag is a tag for grouping related data including data being processed. For example, since an estimate, order, invoice, etc. included in the same item are in the same group, “item ID” is assigned as a “group item” tag. However, the “group item” tag is attached only to the child elements of the “constraint order” tag.

ステップ１００６では、修正対象算出部４１２は、エラーフラグを「FORMAT」に更新する。ここで、エラーフラグとは、各メタデータにどのようなアラートを表示するかを設定するためのフラグである。図１１は、エラーフラグを管理するデータの一例を示している。このデータは、メタデータを示すIDと、各メタデータ項目名に対応するエラーフラグで構成される。エラーフラグは、図８の修正画面に表示されているメタデータに対するデータでありメモリ上に記憶される。そして、修正画面が開かれると生成され、閉じられると消去される。また、各エラーフラグの初期値は「CORRECT」である。エラーフラグは５種類あり、「CORRECT」は文字列が正しいと判定されアラートの必要がないことを示す。「STRING」は文字列に誤りがあると判定されたことを示す。「FORMAT」は、文字列のフォーマットに誤りがあると判定されたことを示す。「ORDER」は、関連する他のメタデータと比較した結果、値の順序関係に誤りがあると判定されたことを示す。「DEPENDENT」は、関連する他のメタデータと比較した結果、本来格納されるべき文字列と異なると判定されたことを示す。 In step 1006, the correction target calculation unit 412 updates the error flag to “FORMAT”. Here, the error flag is a flag for setting what kind of alert is displayed in each metadata. FIG. 11 shows an example of data for managing error flags. This data includes an ID indicating metadata and an error flag corresponding to each metadata item name. The error flag is data for the metadata displayed on the correction screen in FIG. 8 and is stored on the memory. It is generated when the correction screen is opened and erased when it is closed. The initial value of each error flag is “CORRECT”. There are five types of error flags, and “CORRECT” indicates that the character string is determined to be correct and no alert is required. “STRING” indicates that the character string is determined to be incorrect. “FORMAT” indicates that it is determined that there is an error in the format of the character string. “ORDER” indicates that it is determined that there is an error in the order relation of values as a result of comparison with other related metadata. “DEPENDENT” indicates that, as a result of comparison with other related metadata, it is determined that the character string is different from the original character string to be stored.

ステップ１００７では、修正対象算出部４１２は、同一の「group item」タグを持つデータの中で、「order item」タグのメタデータ項目が、「order」タグに指定された文字列である場合に、処理中のメタデータ項目が「order」タグで指定された順番となっているか否かを判定する。図１３Ａに例を示す。図１３Ａは、同一案件IDを持つデータの例である。IDが０５０と０５１のデータは、確定済であり、１０２のデータは未確定となっている。これらのデータのメタデータ定義は、図４のメタデータ項目設定ファイルで規定されている。IDが１０２の起票日に対して順序関係の正当性を判定する。起票日は「constraint order」タグが設定されているため、順序関係が定義されているメタデータ項目である。同一の案件IDをもつデータは、図１３Ａに示すデータである。これらのデータの中で、「order item」タグで規定される文書種別名において、起票日は、見積書、注文書、請求書の順番でなければならない。しかし、IDが１０２のデータの起票日は、他の２つのデータよりも古い日付である。このため、IDが１０２のデータの起票日は、順序関係に誤りがあると考えられる。 In step 1007, the correction target calculation unit 412 determines that the metadata item of the “order item” tag is the character string specified in the “order” tag among the data having the same “group item” tag. It is determined whether the metadata items being processed are in the order specified by the “order” tag. An example is shown in FIG. 13A. FIG. 13A is an example of data having the same case ID. The data with IDs 050 and 051 are confirmed, and the data 102 is unconfirmed. The metadata definition of these data is defined in the metadata item setting file of FIG. The validity of the order relationship is determined for the draft date with ID 102. Since the “constraint order” tag is set, the draft date is a metadata item in which the order relation is defined. Data having the same item ID is data shown in FIG. 13A. Among these data, in the document type name specified by the “order item” tag, the draft date must be the order of the estimate, the order, and the invoice. However, the draft date of the data whose ID is 102 is older than the other two data. For this reason, it is considered that there is an error in the order relationship for the draft date of the data whose ID is 102.

また、ステップ１００７では、修正対象算出部４１２は、メタデータの従属関係の正当性を判定する。従属関係の正当性は、あるメタデータ項目が、他のメタデータ項目に従属している場合に、本来格納されるべき値が格納されているか否かの観点で決定される。例えば、営業証憑において、同一案件では一般に顧客名は同一となる。そのため、もし同一案件で顧客名が異なっていた場合、格納されているメタデータは誤りと考えることができる。従属関係の正当性を判定するため、まずメタデータ項目設定ファイル４０５を参照し、処理中のメタデータ項目に「dependent item」タグが含まれているか否かを判定する。含まれていなければ、修正対象算出部４１２は、従属関係が正しいと判定し、図１０のフローチャートで示される処理を終了する。含まれていれば、修正対象算出部４１２は、さらに従属関係の正当性を調べ、正しければ図１０のフローチャートを終了し、誤っていれば処理をステップ１００８に移行させ、エラーフラグを「DEPENDENT」に設定する。従属関係の正当性を調べるには、「dependent item」タグに設定されているメタデータ項目が共通するデータにおいて、処理中のメタデータ項目が、確定済データの当該メタデータ項目の値と同一か否かを調べる。図１３Ｂに例を示す。図１３Ｂは、同一案件IDを持つデータの例である。IDが０７０と０７１のデータは、確定済であり、１０３のデータは未確定である。これらのデータのメタデータ定義は、図４のメタデータ項目設定ファイルで規定されている。IDが１０３の起票日に対して従属関係の正当性を判定する。これらのデータにおいて、顧客名には「dependent item」タグが設定されているため、顧客名は案件IDに従属する。しかし、IDが１０３の顧客名は、確定済の他の２つのデータの顧客名とは異なる。このため、IDが１０３の顧客名は、従属関係に誤りがあると考えることができる。 In step 1007, the correction target calculation unit 412 determines the validity of the metadata dependency. The validity of the dependency relationship is determined in view of whether or not a value that should originally be stored is stored when a certain metadata item is dependent on another metadata item. For example, in a business voucher, the customer name is generally the same for the same case. Therefore, if the customer name is different in the same case, the stored metadata can be considered an error. In order to determine the validity of the dependency relationship, the metadata item setting file 405 is first referred to to determine whether or not the “dependent item” tag is included in the metadata item being processed. If not included, the correction target calculation unit 412 determines that the dependency relationship is correct, and ends the processing shown in the flowchart of FIG. If it is included, the correction target calculation unit 412 further checks the validity of the subordinate relationship. If it is correct, the correction target calculation unit 412 ends the flowchart of FIG. Set to. To check the validity of the dependency, in the data with the same metadata item set in the "dependent item" tag, whether the metadata item being processed is the same as the value of the metadata item in the confirmed data Check for no. An example is shown in FIG. 13B. FIG. 13B is an example of data having the same item ID. Data with IDs 070 and 071 is confirmed, and data 103 is unconfirmed. The metadata definition of these data is defined in the metadata item setting file of FIG. The validity of the subordinate relationship is determined for the draft date with ID 103. In these data, since the “dependent item” tag is set for the customer name, the customer name depends on the case ID. However, the customer name with ID 103 is different from the customer names of the other two confirmed data. For this reason, it can be considered that the customer name with ID 103 has an error in the dependency relationship.

＜アラート判定結果＞
図１４は、アラート判定結果の一例である。修正すべきメタデータが色分けして表示されている（１４０１）。これにより、ユーザは修正すべきメタデータを簡単にチェックすることができ、見落とすリスクも減少する。 <Alert judgment result>
FIG. 14 is an example of the alert determination result. The metadata to be corrected is displayed in different colors (1401). As a result, the user can easily check the metadata to be corrected, and the risk of oversight is reduced.

また、アラート判定処理は、修正画面において、セル内の文字列が更新されるタイミングでも実行される。この場合は、図１０に示すフローチャートに沿った処理が再度行われる。図１０のフローチャートの処理の結果、設定されたエラーフラグに応じたアラートが表示される。 The alert determination process is also executed at the timing when the character string in the cell is updated on the correction screen. In this case, the process according to the flowchart shown in FIG. 10 is performed again. As a result of the processing of the flowchart of FIG. 10, an alert corresponding to the set error flag is displayed.

なお、上記説明では各メタデータのエラーフラグがいずれか1つの値のみを持ち、該当するエラーフラグに関するアラートのみを表示する場合の例を示した。 In the above description, an example has been shown in which the error flag of each metadata has only one value and only alerts related to the corresponding error flag are displayed.

しかし、メタデータには同時に複数の誤りが含まれることもあり得る。
この点、本発明では、複数のアラートを同時に表示することも可能である。その場合はエラーフラグをアラートの種類毎に定義し、各アラートの必要の有無を当該エラーフラグに設定する。そしてアラート表示時にはエラーフラグを参照しアラートが必要な項目についてアラートを行う。 However, metadata may contain multiple errors at the same time.
In this regard, in the present invention, a plurality of alerts can be displayed simultaneously. In that case, an error flag is defined for each type of alert, and whether or not each alert is necessary is set in the error flag. When an alert is displayed, an error flag is referred to alert an item requiring an alert.

＜サジェスト処理＞
図１５は、修正候補算出部４１３において実行されるサジェスト処理の内容を説明するためのフローチャートである。 <Suggest processing>
FIG. 15 is a flowchart for explaining the content of the suggestion process executed in the correction candidate calculation unit 413.

ステップ１５０１では、修正候補算出部４１３は、メタデータ修正画面においてメタデータが格納されているセルに対するユーザの入力（選択）を受け付ける。 In step 1501, the correction candidate calculation unit 413 receives a user input (selection) for a cell in which metadata is stored on the metadata correction screen.

ステップ１５０２では、修正候補算出部４１３は、選択されたメタデータ（Sとする）のエラーフラグを調べ、エラーフラグに応じて後の処理を行う。エラーフラグが「CORRECT」の場合は、メタデータは正しい値であるとみなし、処理はそのままステップ１５０８に進む。 In step 1502, the correction candidate calculation unit 413 checks the error flag of the selected metadata (S) and performs subsequent processing according to the error flag. If the error flag is “CORRECT”, the metadata is regarded as a correct value, and the process proceeds to step 1508 as it is.

ステップ１５０３は、エラーフラグが「STRING」の場合の処理であり、このときはメタデータの文字列が誤認識されていることになる。修正候補算出部４１３は、正しい文字列を提示するため、まずSとメタデータDB４０２および辞書DB４０３の当該メタデータ項目の値を比較する。比較する際には、任意の文字列マッチング手法が適用可能であるが、例えば、ＤＰマッチングでマッチング可能である。 Step 1503 is processing when the error flag is “STRING”. At this time, the character string of the metadata is erroneously recognized. The correction candidate calculation unit 413 first compares S with the value of the metadata item in the metadata DB 402 and dictionary DB 403 in order to present a correct character string. In the comparison, an arbitrary character string matching method can be applied. For example, the matching can be performed by DP matching.

ステップ１５０７では、修正候補算出部４１３は、ステップ１５０３で行ったマッチング結果の中から、マッチ度が高い文字列の上位から順に表示装置４０７に表示する。図１６Ｂは、この時のサジェスト画面例を示している。なお、太字はマッチした文字を表している。ユーザは、サジェスト表示された文字列の中から任意の文字列を選択可能である。このように修正候補の文字列が表示されることにより、ユーザは文字列の修正を容易に行うことができる。 In step 1507, the correction candidate calculation unit 413 displays the matching results performed in step 1503 on the display device 407 in order from the top of the character string having the highest matching degree. FIG. 16B shows an example of a suggestion screen at this time. Bold characters indicate matched characters. The user can select an arbitrary character string from the suggested character strings. By displaying the correction candidate character string in this way, the user can easily correct the character string.

ステップ１５０４は、エラーフラグが「FORMAT」の場合の処理であり、このときはメタデータのフォーマットが異なっていることになる。修正候補算出部４１３は、正しいフォーマットを提示するため、メタデータ項目設定ファイル４０５における、処理中のメタデータ項目の「define」タグの内容をポップアップする。図１６Ａは、この時のポップアップ画面例を示している。なお、この例では、メタデータの定義以外に例が示されているが、これはメタデータ項目設定ファイル内に例えば「example」タグを規定し、その中に例示したい文字列を設定し、ポップアップ時にその文字列を表示することで実現可能である。 Step 1504 is processing when the error flag is “FORMAT”. At this time, the format of the metadata is different. The correction candidate calculation unit 413 pops up the contents of the “define” tag of the metadata item being processed in the metadata item setting file 405 in order to present the correct format. FIG. 16A shows an example of the pop-up screen at this time. In this example, an example is shown in addition to the definition of metadata. For example, an “example” tag is defined in the metadata item setting file, a character string to be exemplified is set therein, and a popup is displayed. Sometimes this is possible by displaying the string.

ステップ１５０５は、エラーフラグが「ORDER」の場合の処理であり、このときはメタデータの順序関係が誤っていることになる。修正候補算出部４１３は、正しいデータの範囲を提示するため、メタデータ項目設定ファイル４０５における定義を読み取り、取りうる値の範囲をポップアップする。図１６Ｃは、この時のポップアップ画面例を示している。 Step 1505 is processing when the error flag is “ORDER”. At this time, the order relation of metadata is incorrect. The correction candidate calculation unit 413 reads the definition in the metadata item setting file 405 and presents a range of possible values in order to present a correct data range. FIG. 16C shows an example of the pop-up screen at this time.

ステップ１５０６は、エラーフラグが「DEPENDENT」の場合の処理であり、このときはメタデータの従属関係が誤っていることになる。修正候補算出部４１３は、正しい文字列を提示するため、メタデータ項目設定ファイル４０５における、処理中のメタデータ項目の「dependent item」タグを読み取り、従属しているメタデータ項目が共通するデータの確定済データを参照し、該当するメタデータをサジェスト表示する。図１６Ｄは、この時のサジェスト画面例を示している。ユーザは、サジェスト表示された文字列を選択することで、文字列の修正を容易に行うことができる。 Step 1506 is processing in a case where the error flag is “DEPENDENT”. At this time, the dependency relationship of the metadata is incorrect. In order to present a correct character string, the correction candidate calculation unit 413 reads the “dependent item” tag of the metadata item being processed in the metadata item setting file 405, and reads the data of the data with which the dependent metadata item is common. Refer to the confirmed data and suggest the corresponding metadata. FIG. 16D shows an example of a suggestion screen at this time. The user can easily correct the character string by selecting the suggested character string.

ステップ１５０８では、修正候補算出部４１３は、文字列が修正されたか否かを判定する。文字列が修正されていれば、処理はステップ１５０９に進み、文字列が修正されていなければ処理は終了する。 In step 1508, the correction candidate calculation unit 413 determines whether or not the character string has been corrected. If the character string has been modified, the process proceeds to step 1509. If the character string has not been modified, the process ends.

ステップ１５０９では、修正候補算出部４１３は、修正画面における処理中のメタデータの表示文字列を、修正後の文字列で更新する。 In step 1509, the correction candidate calculation unit 413 updates the display character string of the metadata being processed on the correction screen with the corrected character string.

ステップ１５１０では、修正候補算出部４１３は、修正後の文字列に対して再度修正の必要有無を判定する。修正後の文字列が正しい文字列とは限らないためである。この処理は、図１０のフローチャートに従って行われる。 In step 1510, the correction candidate calculation unit 413 determines again whether or not correction is necessary for the corrected character string. This is because the corrected character string is not always a correct character string. This process is performed according to the flowchart of FIG.

ステップ１５１１では、修正候補算出部４１３は、ステップ１５１０の処理で設定されたエラーフラグに応じてアラート表示を行う。なお、前述したようにエラーフラグを項目毎に設定すれば、アラート処理と同様にサジェスト処理も複数同時に行うことが可能である。 In step 1511, the correction candidate calculation unit 413 displays an alert according to the error flag set in the process of step 1510. As described above, if an error flag is set for each item, a plurality of suggestion processes can be performed simultaneously as in the alert process.

＜まとめ＞
以上説明したように、本実施形態によれば、メタデータデータベースに格納された複数の文書のそれぞれについて複数のメタデータの中で修正対象となるメタデータを抽出し、そのメタデータが修正必要であることを示すアラート表示を行う。また、修正対象のメタデータと辞書データベースに含まれる辞書データとを照合し、類似したデータを修正候補としてサジェスト表示を行う。これにより、紙文書に対してスキャン・OCRを施して得られた文字列を容易に発見・修正することが可能となり、ユーザにとってストレスのない形でのメタデータ設定を実現できる。 <Summary>
As described above, according to this embodiment, for each of a plurality of documents stored in the metadata database, metadata to be corrected is extracted from the plurality of metadata, and the metadata needs to be corrected. An alert is displayed to show that there is. Further, the metadata to be corrected and the dictionary data included in the dictionary database are collated, and a suggestion display is performed with similar data as correction candidates. As a result, it becomes possible to easily find and correct a character string obtained by performing scanning / OCR on a paper document, and it is possible to realize metadata setting without stress for the user.

処理対象データが修正された場合に、修正後のデータに対して、再度修正が必要か否か判断し、修正が必要な場合には、再度アラート表示をする。これにより、最終的に正しいメタデータが抽出されるようになる。 When the data to be processed is corrected, it is determined whether or not the corrected data needs to be corrected again. If correction is necessary, the alert is displayed again. As a result, correct metadata is finally extracted.

また、複数のメタデータの中の処理対象データに対して形態素解析を実行し、当該形態素解析により得られるデータが形態素解析DBの形態素解析辞書データに存在するか否か判断する。そして、形態素解析により得られるデータの数が所定数未満の場合に、修正対象であると判定する。これにより、OCRの認識処理に用いる形態素解析DBをメタデータの修正の必要性有無の判定処理に流用することができるので、当該文書処理システムを実現する上でのコストを抑えることが可能となる。 Further, morphological analysis is performed on the processing target data in the plurality of metadata, and it is determined whether or not the data obtained by the morphological analysis exists in the morphological analysis dictionary data of the morphological analysis DB. Then, when the number of data obtained by morphological analysis is less than a predetermined number, it is determined to be a correction target. As a result, the morphological analysis DB used for the OCR recognition process can be used for the determination process for determining whether or not the metadata needs to be corrected, so that the cost for realizing the document processing system can be suppressed. .

また、複数のメタデータの中の処理対象データが文字列以外のデータ（日付や案件ＩＤ番号等）である場合、処理対象データが予め定義されたフォーマットで記述されているか判断し、フォーマットで記述されていない処理対象データを修正対象のメタデータとして抽出する。これにより、辞書DBにない数字等の認識誤りを指摘して修正させることが可能となる。 In addition, when the processing target data in a plurality of metadata is data other than a character string (date, case ID number, etc.), it is determined whether the processing target data is described in a predefined format and described in the format. Unprocessed processing target data is extracted as correction target metadata. This makes it possible to point out and correct recognition errors such as numbers that are not in the dictionary DB.

さらに、複数のメタデータの中の処理対象データが既に登録済のデータ（確定済のメタデータ）と比較し、処理対象データが登録済のデータと矛盾が生じているか否か判断する。矛盾が生じている場合に、処理対象データを修正対象のメタデータとして抽出する。なお、この場合の矛盾とは、登録済のデータの中に、処理対象データを持つ文書の案件ＩＤと同一の案件ＩＤを持つ他の文書がある場合、これらの文書間の作成順序が正当でない場合や、登録済のデータを持つ文書における各メタデータの項目の従属関係と、処理対象データを持つ文書におけるメタデータの項目の従属関係とが異なる場合等が該当する。これにより、１つのメタデータ項目の正誤だけでなく、項目間及び複数の文書との関係での正誤も判定でき、最終的に、適切なメタデータを抽出することができるようになる。 Further, the processing target data in the plurality of metadata is compared with the already registered data (confirmed metadata), and it is determined whether the processing target data is inconsistent with the registered data. If there is a contradiction, the processing target data is extracted as metadata to be corrected. Note that the contradiction in this case means that if there is another document having the same project ID as the document ID having the processing target data in the registered data, the creation order between these documents is not valid. This is the case, for example, when the dependency of each metadata item in a document having registered data is different from the dependency of a metadata item in a document having processing target data. As a result, not only correctness / incorrectness of one metadata item but also correctness / incorrectness in relation to the relationship between the items and a plurality of documents can be determined, and finally appropriate metadata can be extracted.

また、サジェスト表示も、修正の原因に対応して行うことができる。つまり、文字列の場合、修正対象となった処理対象データと、メタデータデータベース及び辞書データベースを比較し、マッチング度が高い文字列データを修正候補としてサジェスト表示したり、メタデータの記述形式を定義するメタデータ項目設定ファイルにおいて定義されている適切なフォーマットを修正候補としてサジェスト表示したり、正当な作成順序が担保されるようなデータを修正候補としてサジェスト表示したり、正当な従属関係となるような文字列データを修正候補としてサジェスト表示したりすることが可能となる。 Suggestion display can also be performed in response to the cause of the correction. In other words, in the case of character strings, the target data to be corrected is compared with the metadata database and dictionary database, and character string data with a high degree of matching is suggested as correction candidates, and the metadata description format is defined. Appropriate format defined in the metadata item configuration file to be suggested as a correction candidate, data that guarantees the correct creation order is suggested as a correction candidate, or a legal dependency relationship Simple character string data can be suggested as correction candidates.

なお、本発明は、実施形態の機能を実現するソフトウェアのプログラムコードによっても実現できる。この場合、プログラムコードを記録した記憶媒体をシステム或は装置に提供し、そのシステム或は装置のコンピュータ（又はＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出す。この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコード自体、及びそれを記憶した記憶媒体は本発明を構成することになる。このようなプログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどが用いられる。 The present invention can also be realized by a program code of software that realizes the functions of the embodiments. In this case, a storage medium in which the program code is recorded is provided to the system or apparatus, and the computer (or CPU or MPU) of the system or apparatus reads the program code stored in the storage medium. In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the program code itself and the storage medium storing the program code constitute the present invention. As a storage medium for supplying such program code, for example, a flexible disk, CD-ROM, DVD-ROM, hard disk, optical disk, magneto-optical disk, CD-R, magnetic tape, nonvolatile memory card, ROM Etc. are used.

また、プログラムコードの指示に基づき、コンピュータ上で稼動しているＯＳ（オペレーティングシステム）などが実際の処理の一部又は全部を行い、その処理によって前述した実施の形態の機能が実現されるようにしてもよい。さらに、記憶媒体から読み出されたプログラムコードが、コンピュータ上のメモリに書きこまれた後、そのプログラムコードの指示に基づき、コンピュータのＣＰＵなどが実際の処理の一部又は全部を行い、その処理によって前述した実施の形態の機能が実現されるようにしてもよい。 Also, based on the instruction of the program code, an OS (operating system) running on the computer performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing. May be. Further, after the program code read from the storage medium is written in the memory on the computer, the computer CPU or the like performs part or all of the actual processing based on the instruction of the program code. Thus, the functions of the above-described embodiments may be realized.

また、実施の形態の機能を実現するソフトウェアのプログラムコードを、ネットワークを介して配信することにより、それをシステム又は装置のハードディスクやメモリ等の記憶手段又はＣＤ−ＲＷ、ＣＤ−Ｒ等の記憶媒体に格納し、使用時にそのシステム又は装置のコンピュータ（又はＣＰＵやＭＰＵ）が当該記憶手段や当該記憶媒体に格納されたプログラムコードを読み出して実行するようにしても良い。 Further, by distributing the program code of the software that realizes the functions of the embodiment via a network, the program code is stored in a storage means such as a hard disk or memory of a system or apparatus, or a storage medium such as a CD-RW or CD-R And the computer (or CPU or MPU) of the system or apparatus may read and execute the program code stored in the storage means or the storage medium when used.

４０１…スキャンデータDB
４０２…メタデータDB
４０３…辞書DB
４０４…形態素解析DB
４０５…メタデータ項目設定ファイル
４０６…パラメータ設定ファイル
４０７…表示装置
４０８…キーボード
４０９…マウス
４１０…中央処理装置
４１１…修正画面表示部
４１２…修正対象算出部
４１３…修正候補算出部
６０１…ファイルＩＤ
６０２…状態
６０３…ファイルパス
６０４…メタデータ全体
６０５…文書種別名
６０６…顧客名
６０７…起票日
６０８…案件ＩＤ
８０１…修正実行メニュー
８０２…処理チェックボックス
１４０１…アラート表示されたメタデータ
１６０１…フォーマット表示ポップアップ
１６０２…入力候補文字列サジェスト
１６０３…データ範囲ポップアップ
１６０４…適正文字列サジェスト 401: Scan data DB
402: Metadata DB
403 ... Dictionary DB
404 ... Morphological analysis DB
405: Metadata item setting file 406 ... Parameter setting file 407 ... Display device 408 ... Keyboard 409 ... Mouse 410 ... Central processing unit 411 ... Correction screen display unit 412 ... Correction target calculation unit 413 ... Correction candidate calculation unit 601 ... File ID
602 ... Status 603 ... File path 604 ... Overall metadata 605 ... Document type name 606 ... Customer name 607 ... Draft date 608 ... Case ID
801 ... Correction execution menu 802 ... Processing check box 1401 ... Alert displayed metadata 1601 ... Format display pop-up 1602 ... Input candidate character string suggestion 1603 ... Data range pop-up 1604 ... Appropriate character string suggestion

Claims

A metadata database obtained by scanning / OCR for a plurality of paper documents and storing a plurality of metadata for each of the plurality of documents;
A dictionary database for storing character strings that can be the metadata character strings;
A display device;
An input device;
A central processing unit,
The central processing unit is
For each of the plurality of documents stored in the metadata database, metadata that is to be corrected is extracted from the plurality of metadata, and an alert display indicating that the metadata needs to be corrected is displayed on the display device Displayed on the
A document processing system, wherein the metadata to be corrected is compared with dictionary data included in the dictionary database, and similar data is suggested on the display device as correction candidates.

In claim 1,
Furthermore, it has a morphological analysis database that stores morphological analysis dictionary data used for morphological analysis,
The central processing unit performs morpheme analysis on processing target data in the plurality of metadata, determines whether data obtained by the morpheme analysis exists in the morpheme analysis dictionary data, and A document processing system, wherein when the number of data obtained by analysis is less than a predetermined number, the metadata is determined to be the metadata to be corrected.

In claim 1,
When the processing target data in the plurality of metadata is data other than a character string, the central processing unit determines whether the processing target data is described in a predefined format, and is described in the format. The document processing system, wherein the processing target data not yet extracted is extracted as the metadata to be corrected.

In claim 1,
The central processing unit compares processing target data in the plurality of metadata with already registered data, determines whether the processing target data is inconsistent with the registered data, If it occurs, the document processing system extracts the processing target data as the correction target metadata.

In claim 4,
When the central processing unit is data other than a character string described in a predefined format, the central processing unit has a case ID that is the same as the case ID of the document having the processing target data in the registered data. The document processing system is characterized in that it determines whether or not there is a document and if the creation order between these documents is not valid, the processing target data is extracted as the metadata to be corrected.

In claim 4,
When the central processing unit is data other than a character string described in a predefined format, the central processing unit has a case ID that is the same as the case ID of the document having the processing target data in the registered data. When there is a discrepancy between the dependency relationship of each metadata item in the document having the registered data and the dependency relationship of the metadata item in the document having the processing target data Further, the document processing system is characterized in that the processing target data is extracted as the correction target metadata.

In claim 2,
The central processing unit compares the processing target data to be corrected with the metadata database and the dictionary database, and suggests character string data having a high matching degree as the correction candidate. Processing system.

In claim 3,
The central processing unit suggests, as the correction candidate, an appropriate format defined in a metadata item setting file that defines a metadata description format for the processing target data to be corrected. Feature document processing system.

In claim 5,
The document processing system, wherein the central processing unit suggests, as the correction candidate, data that guarantees a valid creation order with respect to the processing target data to be corrected.

In claim 6,
The document processing system, wherein the central processing unit suggests, as the correction candidate, character string data that has a valid dependency relationship with respect to the processing target data to be corrected.

In any one of Claims 1 thru | or 10,
When the data to be processed is corrected, the central processing unit determines whether the corrected data needs to be corrected again, and if correction is necessary, displays the alert again. Feature document processing system.

A program for causing a computer and a storage device to function as the document processing system according to claim 1.