CN1452098A

CN1452098A - File classing system and program for carrying out same

Info

Publication number: CN1452098A
Application number: CN02141403.3A
Authority: CN
Inventors: 古贺昌史; 丸川勝美; 田中雅子
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2002-04-19
Filing date: 2002-08-28
Publication date: 2003-10-29
Also published as: JP2003317034A

Abstract

The invention provides a document classification system and its realization program. It is possible to efficiently classify according to the content of letters. The frequency of occurrence of words stored in the important word dictionary is measured by the character recognition device, and the document type is estimated by the document type recognition unit.

Description

Document Classification System and Its Realization Program

技术领域technical field

本发明涉及在企业、机关等的顾客窗口，利用计算机对询问书信、电子邮件等进行自动分类的文档分类技术及回答询问的支持系统。The present invention relates to a document classification technology for automatically classifying inquiry letters, e-mails, etc. by computers at customer windows of companies, institutions, etc., and a support system for answering inquiries.

背景技术Background technique

在制造业、保险业、通信产业、机关等之中，从顾客直接接收电子邮件、书信、FAX等文档形式的询问的业务近年来变得日益重要。很多场合下要求一个人高效率地回答多种询问是困难的。通常，询问的件数多得一个人应付不了。此外，很多内容牵涉到多方面。比如，在制造业中，必须处理对制品的索赔、购买方法、操作方法等询问的文档。一个人要处理所有这些问题需要有广泛的知识。通常，很难确保具有如此广泛知识的工作人员。In manufacturing, insurance, telecommunications, institutions, etc., the business of directly receiving inquiries in the form of e-mails, letters, FAX, etc. from customers has become increasingly important in recent years. In many cases, it is difficult to ask a person to efficiently answer various inquiries. Usually, there are more inquiries than one person can handle. In addition, a lot of content involves multiple aspects. For example, in the manufacturing industry, it is necessary to deal with documents for inquiries about a product, a purchase method, an operation method, and the like. It takes a wide range of knowledge for one to deal with all these issues. Often, it is difficult to secure staff with such extensive knowledge.

于是，需要有一种识别询问文档的种类，将其分类后根据各个内容分给作为专家的工作人员，由这些工作人员回答的系统。Therefore, there is a need for a system that recognizes the types of inquiry documents, classifies them, distributes them to specialist staff according to each content, and allows these staff to answer.

本发明就是涉及利用计算机对上述询问文档进行分类的技术，以及利用计算机回答询问的支持系统的技术。The present invention relates to the technology of using computer to classify the above query documents, and the technology of using computer to answer the query support system.

利用计算机对电子邮件的询问进行分类的技术业已公知。作为典型的方法，有采用以文档中特定的单词(重要单词)群的出现频率为特征的多变量的模式识别技术的方法。电子邮件的正文、文件名是文本数据，单词的出现频率可通过单纯的单词比对或词素分析得到。一旦询问电子邮件的种类得到识别，就根据其种类自动地以电子邮件等进行回答的技术也业已公知。Techniques for sorting e-mail inquiries using computers are known. As a typical method, there is a method of using a multivariate pattern recognition technique characterized by the frequency of occurrence of a specific word (important word) group in a document. The body text and file name of the e-mail are text data, and the occurrence frequency of words can be obtained through simple word comparison or morpheme analysis. Once the type of inquiry e-mail is recognized, a technology is known in which an e-mail or the like is automatically answered according to the type.

另外，作为询问书信的种类的识别方法，有将书信内容以文字识别装置文本化而采用与上述相同的技术的方法。Also, as a method for identifying the type of inquiry letter, there is a method of converting the contents of the letter into text by a character recognition device and using the same technique as above.

然而，在现有的这种利用文字识别的书信种类识别中存在识别精度的问题。一般，为了使文字识别高精度化，必须将能出现的单词事先作为词典存储。特别是，为了识别手写的文字，必须将单词数缩小为大约数百个。可是，在现有的书信种类识别中所使用的文字识别装置中，要预先将能够出现的单词范围进行缩小是困难的。因此，要以充分高的精度识别书信的种类是困难的。However, there is a problem of recognition accuracy in conventional letter type recognition using character recognition. Generally, in order to increase the accuracy of character recognition, it is necessary to store possible words in advance as a dictionary. In particular, in order to recognize handwritten characters, the number of words must be reduced to about hundreds. However, in conventional character recognition devices used for letter type recognition, it is difficult to narrow down the range of words that can appear in advance. Therefore, it is difficult to identify the type of letter with sufficiently high accuracy.

另外，在通常的文字识别中，文字分割的结果会遗留文字识别结果的暧昧性。比如，作为通常的文字识别分割的每个部分图像，可得到多个文字识别的候补文字。也有文字分割本身遗留暧昧性的时候。从这种文字识别结果推定特定的单词出现频率，不是明显的处理，不能将从文本数据算出单词出现频率的方法按原样使用。另外，如果不容许这种暧昧性，将文字识别结果作为文本处理，则文档中的单词很多会遗漏。In addition, in general character recognition, the result of character segmentation will leave ambiguity in the result of character recognition. For example, a plurality of candidate characters for character recognition can be obtained for each partial image divided by normal character recognition. There are also times when text segmentation itself leaves ambiguity. Estimating the frequency of appearance of a specific word from such a result of character recognition is not an obvious process, and the method of calculating the frequency of appearance of a word from text data cannot be used as it is. Also, if such ambiguity is not allowed and the character recognition result is treated as text, many words in the document will be missed.

另外，在文字识别中，无法避免识别错误。由于这些错误，常常会发生不能识别文档的种类，出现错误的情况。在现有的方式中，在发生这种不能识别及错误识别的场合，就会使得作业效率大大降低。In addition, in character recognition, recognition errors cannot be avoided. Due to these errors, it often happens that the type of the document cannot be recognized and an error occurs. In the existing method, when such inability to identify and misidentify occur, the work efficiency will be greatly reduced.

发明概述Summary of the invention

本发明要解决的第一个课题是在这种询问回答支持系统中，实现高速高精度识别利用文字识别的书信的种类。The first problem to be solved by the present invention is to realize high-speed and high-accuracy recognition of the type of letters using character recognition in such an inquiry-response support system.

本发明要解决的第二个课题是提高文字识别用的词典的维护性。The second problem to be solved by the present invention is to improve maintainability of a dictionary for character recognition.

本发明要解决的第三个课题是改进结果存在暧昧性的文字识别处理和输入文本数据的文档分类处理的界面，提高系统中的亲和性。The third problem to be solved by the present invention is to improve the interface of character recognition processing with ambiguous results and document classification processing of input text data, and improve the affinity in the system.

本发明要解决的第四个课题是提供一种在不能识别文档种类及识别出现错误的场合，可以高效率地继续回答作业的环境。The fourth problem to be solved by the present invention is to provide an environment in which an answering task can be continued efficiently when the document type cannot be recognized or the recognition error occurs.

以文档种类识别中的重要单词的集合作为文字识别的单词词典。另外，不是像过去那样读取全部的文字，而是采用字定位技术只计测重要单词的出现频率。文字识别处理的输出形式，与过去不同，是作成表示单词的出现频率的矢量。将所得到的出现频率输入到现有的文档种类识别中进行文档种类的识别。A collection of important words in document type recognition is used as a word dictionary for text recognition. In addition, instead of reading all the characters as in the past, word positioning technology is used to measure only the occurrence frequency of important words. The output format of the character recognition process is different from the conventional one in that a vector indicating the frequency of occurrence of a word is created. The obtained frequency of occurrence is input into the existing document type identification to identify the document type.

从进行文字识别的装置向进行回答作业的装置不仅发送文档种类的识别结果，而且将文档种类识别的2位以下的候补和单词识别结果一起发送。回答作业装置，通过在询问书信的图像上加亮重要单词而支持回答作业。另外，提供一种在文档种类发生错误的场合，可利用2位以下的候补文档种类，向适当的回答者转送书信图像的环境。Not only the recognition result of the document type but also two or less candidates for the document type recognition are transmitted from the device performing character recognition to the device performing the answering task together with the word recognition result. The answering work device supports the answering work by highlighting important words on the image of the question letter. Also, an environment is provided in which a letter image can be forwarded to an appropriate respondent by using a candidate document type of 2 digits or less when an error occurs in the document type.

附图说明Description of drawings

图1为实施例的硬件构成。Fig. 1 is the hardware structure of the embodiment.

图2为示出实施例的处理流程的数据流图。FIG. 2 is a data flow diagram showing the processing flow of the embodiment.

图3为示出学习处理的流程的数据流图。FIG. 3 is a data flow diagram showing the flow of learning processing.

图4为示出电子邮件分类处理的流程的数据流图。FIG. 4 is a data flow diagram showing the flow of e-mail classification processing.

图5为示出书信分类处理的流程的数据流图。FIG. 5 is a data flow diagram showing the flow of letter sorting processing.

图6为示出回答处理的流程的数据流图。FIG. 6 is a data flow diagram showing the flow of answer processing.

图7为回答作业用的显示画面。Fig. 7 is a display screen for answering tasks.

图8为示出学习处理步骤的动作时序图。FIG. 8 is an operation sequence diagram showing a learning process procedure.

图9为示出书信处理步骤的动作时序图。Fig. 9 is an operation sequence diagram showing a letter processing procedure.

图10为单词出现频率的数据形式。Fig. 10 is the data form of the word occurrence frequency.

图11为回答作业用数据的数据形式。Fig. 11 shows the data format of data for answering tasks.

图12为输入图像示例。Figure 12 is an example of an input image.

图13为分割虚拟网络。Figure 13 is a split virtual network.

图14为检出的重要单词。Figure 14 shows the detected important words.

具体实施方式Detailed ways

图1示出本发明实施例的询问回答系统，即对来自顾客的询问自动进行分类、解答作业的支持系统的构成。本系统的输入为利用电子邮件、书信、电话等的询问。输出为利用电子邮件、书信、电话等的询问的回答。为了与外部通信，本系统通过接口和电话线路与外部1联系。另外，本系统的计算机类经LAN进行信息交换。FIG. 1 shows the configuration of an inquiry answering system according to an embodiment of the present invention, that is, a support system for automatically classifying and answering inquiries from customers. The input to this system is an inquiry by e-mail, letter, telephone or the like. The output is an answer to an inquiry by e-mail, letter, telephone, or the like. In order to communicate with the outside, the system contacts the outside 1 through an interface and a telephone line. In addition, the computers of this system exchange information via LAN.

在利用本系统接受询问之前，必须进行计算出用来识别询问文档的种类所用的必需信息并生成词典的处理，即必须进行学习。101是管理学习的学习用计算机。学习用计算机101，参照学习用数据文件系统102中预先收集的学习数据，算出文档种类识别所必需的信息，作为分类用词典存放于词典文档系统103中。所谓学习数据，是将询问内容文本化的文本数据和其询问种类识别符对的集合。在学习数据中的文本数据使用了过去的询问的事例。对应的文档种类由人工指定。生成的分类用词典，随时经LAN拷贝到电子邮件分类用计算机106的词典文件系统107，书信分类用计算机108的词典文件系统109及声音询问分类用计算机114的词典文件系统115。Before receiving an inquiry using this system, it is necessary to calculate necessary information for identifying the type of the inquiry document and generate a dictionary, that is, learning is required. 101 is a learning computer for managing learning. The computer for learning 101 refers to learning data collected in advance in the data file system 102 for learning, calculates information necessary for document type recognition, and stores it in the dictionary file system 103 as a dictionary for classification. The so-called learning data is a collection of text data that textualizes the question content and its question type identifier pair. The text data in the learning data uses examples of past queries. The corresponding document type is specified manually. The generated classification dictionary is copied to the dictionary file system 107 of the electronic mail classification computer 106, the dictionary file system 109 of the letter classification computer 108, and the dictionary file system 115 of the voice query classification computer 114 via LAN at any time.

利用电子邮件的询问，从系统外部的因特网104经网关105，由电子邮件分类用计算机106接收。电子邮件分类用计算机106，根据询问内容识别电子邮件的种类，将文档种类的识别结果和后述的重要单词的位置与电子邮件相对应转送到自动回答用计算机116。Inquiries by electronic mail are received by the electronic mail sorting computer 106 from the Internet 104 outside the system via the gateway 105 . The e-mail sorting computer 106 recognizes the type of e-mail based on the content of the inquiry, and transfers the recognition result of the document type and the positions of important words described later to the automatic answering computer 116 in association with the e-mail.

询问书信，由与书信分类用计算机108相连接的带有分类排序器的扫描仪110进行光电变换，作为图像输入。书信分类用计算机108，利用后述的字定位技术，识别图像中的文字，根据询问内容识别书信的种类。将识别结果和图像及后述的重要单词的位置相对应地转送到自动回答用计算机116。通过传真发来的询问也从电话线路111经线路控制装置113送入书信分类用计算机108，实施同样的处理。在上述处理后，书信用扫描仪的分类排序器根据文档种类的识别结果归类保管。The inquiry letter is photoelectrically converted by a scanner 110 with a sorter connected to the letter sorting computer 108, and input as an image. The letter classification computer 108 recognizes characters in the image by using character positioning technology described later, and recognizes the type of letter from the content of the inquiry. The recognition result is transferred to the computer 116 for automatic answering in association with the position of the image and the important words described later. Inquiries sent by facsimile are also sent from the telephone line 111 to the letter sorting computer 108 via the line control device 113, and the same processing is carried out. After the above processing, the letters are sorted and stored by the classifier of the scanner according to the identification results of the document types.

通过电话的询问，从电话线路111经线路控制装置113送入声音询问分类用计算机114。声音询问分类用计算机114对声音进行识别转换为文本，根据询问内容进行分类，转送到回答通话用电话机112。由与内容相应的专家工作人员利用回答通话用电话机112进行回答。Inquiry by telephone is sent from the telephone line 111 through the line control device 113 to the computer 114 for voice inquiry classification. The voice inquiry sorting computer 114 recognizes the voice and converts it into text, sorts it according to the content of the inquiry, and transfers it to the answering phone 112 . An expert worker corresponding to the content makes an answer using the telephone 112 for answering the call.

自动回答用计算机116，在转送来的文档的种类是可以自动回答的场合，从回答文例文件系统117中检索合适的回答文例，以电子邮件回答，或是利用自动封缄打印机118打印以书信形式回答。还有，在不能自动回答的场合，则根据与询问文档相对应的文档种类，转送到合适专家待机的回答作业装置(121，125，126)。The computer 116 for automatic answering, when the type of the transferred document can be answered automatically, retrieves a suitable answering sentence example from the answering sentence example file system 117, and replies by e-mail, or prints it in the form of a letter by using the automatic sealing printer 118 answer. Also, if the automatic answer is not possible, it is forwarded to the answering operation device (121, 125, 126) where an appropriate expert is on standby according to the document type corresponding to the inquiry document.

如121所示，回答作业装置的构成包括计算机122，键盘、鼠标等组成的输入装置123及图像显示装置124。利用这些装置，各工作人员参照询问文档生成回答文档，转送到自动回答用计算机116。自动回答用计算机116，以与上述相同方式，发出回答电子邮件，或是打印回答书信。As indicated by 121, the composition of the answering operation device includes a computer 122, an input device 123 composed of a keyboard and a mouse, and an image display device 124. Using these devices, each worker generates an answer document by referring to the inquiry document, and transfers it to the computer 116 for automatic answering. The computer 116 for automatic answering sends out an answering e-mail or prints out an answering letter in the same manner as above.

下面利用图2的数据流图对本系统的处理流程予以说明。在本图中，按照ゲ-ン·サ-ソン表示法(J.マ-チン“软件结构化技术”近代科学社，ISBN4-7649-0124-2 C3050 P5562E)，实线箭头表示信息流，空心箭头表示物流。另外，圆角矩形表示处理，右边为空的矩形表示存放的信息。The following uses the data flow diagram in Figure 2 to describe the processing flow of this system. In this figure, according to the notation of ゲ-ン・サ-ソン (J. マ-チン "Software Structured Technology" Modern Science Society, ISBN4-7649-0124-2 C3050 P5562E), the solid arrow indicates the information flow, and the hollow Arrows indicate logistics. In addition, the rounded rectangle represents processing, and the empty rectangle on the right represents stored information.

首先，利用学习数据201，执行学习202，即计算出识别询问文档的种类所必需的信息的处理，生成单词统计量词典203及重要单词词典204。重要单词词典204存放在识别文档种类上是重要特征的单词的集合。单词统计量词典203存放根据重要单词的出现频率识别文档种类所必需的统计量。本处理由学习用计算机101实现。First, learning data 201 is used to perform learning 202 , that is, a process of calculating information necessary for identifying the type of query document, and word statistics dictionary 203 and important word dictionary 204 are generated. The important word dictionary 204 stores a collection of words that are important features in identifying document types. The word statistics dictionary 203 stores statistics necessary to identify the type of document based on the frequency of occurrence of important words. This processing is realized by the learning computer 101 .

在接受询问电子邮件之际，电子邮件分类205参照单词统计量词典203和重要单词词典204识别询问文档种类。得到的询问文档种类，和电子邮件本身及在处理过程中得到的重要单词的位置信息相对应作为回答作业用数据输出。本处理由电子邮件分类用计算机106实现。When receiving an inquiry e-mail, the e-mail classification 205 refers to the word statistics dictionary 203 and the important word dictionary 204 to identify the type of the inquiry document. The type of the query document obtained is output as data for answering work corresponding to the electronic mail itself and the position information of important words obtained during processing. This process is realized by the computer 106 for e-mail sorting.

在书信中有询问的场合，由书信分类206参照单词统计量词典203和重要单词词典204识别询问文档种类。得到的询问文档种类，与在文档图像处理过程中得到的重要单词的出现地点相对应作为回答作业用数据输出。另一方面，为便于保存等，根据所得到的询问文档种类对书信本身进行归类。本处理由书信分类用计算机108及带分类排序器的扫描仪110实现。When there is a query in the letter, the letter classification 206 refers to the word statistics dictionary 203 and the important word dictionary 204 to identify the type of the query document. The type of the obtained question document is outputted as data for answering tasks in correspondence with the appearance positions of important words obtained in the process of document image processing. On the other hand, for the convenience of storage, etc., the letters themselves are classified according to the types of the obtained query documents. This process is realized by the computer 108 for sorting letters and the scanner 110 with a sorter.

利用回答用数据，由回答208生成回答文，打印回答书信或发送回答电子邮件。同时为了更新单词统计量词典203，将重要单词出现频率和回答之际确定的文档种类信息输出到词典更新210。在输入是电子邮件的场合，追加文档种类和电子邮件的内容的文本数据为学习数据。通过利用这样追加的学习数据进行再度学习，可以使单词统计量词典203和重要单词词典204适应运用的实际情况。本处理由自动回答用计算机116，自动封缄打印机118以及回答作业装置121，125，126实现。Using the answer data, an answer message is generated from the answer 208, and an answer letter is printed or an answer e-mail is sent. At the same time, in order to update the word statistics dictionary 203 , the frequency of occurrence of important words and the document type information identified at the time of answering are output to the dictionary update 210 . When the input is e-mail, the text data to which the type of the document and the content of the e-mail are added is learning data. By performing relearning using such additional learning data, it is possible to adapt the word statistics dictionary 203 and the important word dictionary 204 to actual conditions of use. This processing is realized by the computer 116 for automatic answering, the automatic sealing printer 118 and the answering operation devices 121 , 125 , and 126 .

词典更新210，根据在运用中得到的询问文档中的重要单词的出现频率和文档种类对分类用词典进行更新处理。用于这种识别的统计量的更新利用标准的模式识别的方法实现。本处理由学习用计算机101实现。In dictionary update 210, the classification dictionary is updated based on the frequency of occurrence of important words in the query document and the type of document obtained during operation. The updating of the statistics used for this recognition is accomplished using standard pattern recognition methods. This processing is realized by the learning computer 101 .

在以电话进行询问的场合，由声音询问分类207识别询问内容的种类，由与内容相应的专家工作人员进行回答通话208。本处理由声音询问分类用计算机114及回答通话用电话机112实现。In the case of an inquiry by telephone, the type of the inquiry content is recognized by voice inquiry classification 207, and the answering call 208 is made by an expert worker corresponding to the content. This processing is realized by the computer 114 for voice inquiry classification and the telephone 112 for answering calls.

其次，在图3的数据流图中示出学习202的处理流程。首先，由重要单词抽出301从学习数据201中抽出在识别文档种类上重要的单词，存放于重要单词词典204。这一处理利用自然语言的词素分析技术以及模式识别的特征选择技术实现。其次，由单词统计量算出302计算在识别文档种类上必需的统计量，存放于单词统计量词典203中。此处的识别方法，使用模式识别中的标准方法，比如二次识别函数及神经网络等。在采用二次识别函数的场合，所谓单词统计量意味着各单词的出现频率及其协方差系数。在使用神经网络的场合，单词统计量是网络的连接权重。Next, the processing flow of learning 202 is shown in the data flow diagram of FIG. 3 . First, important words for recognizing document types are extracted from the learning data 201 by the important word extraction 301 and stored in the important word dictionary 204 . This processing is realized by using morpheme analysis technology of natural language and feature selection technology of pattern recognition. Next, the statistics necessary for identifying the document type are calculated from the word statistics calculation 302 and stored in the word statistics dictionary 203 . The recognition method here uses standard methods in pattern recognition, such as quadratic recognition functions and neural networks. When the secondary recognition function is used, word statistics mean the frequency of occurrence of each word and its covariance coefficient. In the case of neural networks, word statistics are the connection weights of the network.

其次，在图4的数据流图中示出电子邮件分类205的处理流程。首先，由单词抽出401，参照重要单词词典204，算出电子邮件中的各重要单词的出现频率。为检出重要单词，使用词素分析等一般的语言处理技术。其次，文档种类识别402利用单词统计量词典203中的单词统计量识别文档种类。识别利用二次识别函数及神经网络等标准模式识别方法。最后，由回答作业用数据生成403生成取得重要单词的出现位置和文档种类的识别结果和电子邮件的对应关系的数据，即回答作业用数据，并进行输出。通常，对文档种类可列举不同可信度的多个候补。全部存放在回答作业用数据中。Next, the processing flow of the e-mail classification 205 is shown in the data flow diagram of FIG. 4 . First, the word extraction 401 refers to the important word dictionary 204 to calculate the frequency of appearance of each important word in the electronic mail. To detect important words, general language processing techniques such as morphological analysis are used. Next, the document type identification 402 uses word statistics in the word statistics dictionary 203 to identify document types. Recognition utilizes standard pattern recognition methods such as quadratic recognition functions and neural networks. Finally, the data for answering task generating 403 generates and outputs the data for obtaining the correspondence relationship between the occurrence positions of important words and the recognition results of document types and e-mails, that is, answering task data. Usually, a plurality of candidates with different reliability levels can be listed for the document category. All are stored in the data for answering assignments.

其次，在图5的数据流图中示出书信分类206的处理流程。首先将在书信中写文章的区域作为图像输入。其次，由字定位识别502参照重要单词词典204，从图像中识别重要单词，输出各重要单词的出现频率。其次，由文档种类识别503，参照存放于单词统计量词典203中的单词统计量，根据重要单词的出现频率识别文档种类。其次，由回答作业用数据生成505，将图像和重要单词的出现频率和文档种类的识别结果相对应地输出。书信本身，根据文档种类识别结果，进行归类。Next, the processing flow of letter classification 206 is shown in the data flow diagram of FIG. 5 . First, the area where the article is written in the letter is input as an image. Next, the word location recognition 502 refers to the important word dictionary 204 to identify important words from the image, and outputs the frequency of occurrence of each important word. Next, the document type identification 503 refers to the word statistics stored in the word statistics dictionary 203 to identify the document type based on the frequency of occurrence of important words. Next, from the data generation 505 for answering tasks, the frequency of appearance of images and important words and the recognition results of document types are output in association with each other. The letter itself is classified according to the document type identification result.

其次，利用图6的数据流说明回答208的处理流程。首先，由送信目的地决定601，根据赋予回答作业用数据的文档种类，将回答作业用数据转送到回答作业1～3(602，603，604)或自动回答605。本处理，由自动回答用计算机116实现。在回答作业1～3(602，603，604)中，各文档种类的专家工作人员对询问内容进行研讨，生成回答文。这些由利用回答作业装置1～3(121，125，126)的工作人员实现。在自动回答605中，从回答例文集606中检索与文档种类相应的回答文例并输出。本处理由自动回答用计算机116实现。由回答电子邮件发送607将回答作业1～3(602，603，604)或自动回答605得到的回答文例以电子邮件发送。本处理，由自动回答用计算机116实现。另外，回答文打印608将回答文打印到纸上生成回答书信。本处理，由自动封缄打印机118实现。Next, the processing flow of the answer 208 will be described using the data flow in FIG. 6 . First, by sending destination determination 601, answering task data is transferred to answering tasks 1 to 3 (602, 603, 604) or automatic answering 605 according to the document type to which the answering task data is added. This process is realized by the computer 116 for automatic answering. In answering tasks 1 to 3 (602, 603, 604), expert staff for each document category discusses the content of the inquiry and generates an answer text. These are realized by the workers using the answering work devices 1 to 3 (121, 125, 126). In the automatic answer 605, an answer sentence example corresponding to the document type is retrieved from the answer example text collection 606 and output. This processing is realized by the computer 116 for automatic answering. An answer e-mail sending 607 sends an e-mail example of an answer sentence obtained by answering assignments 1 to 3 (602, 603, 604) or the automatic answer 605. FIG. This process is realized by the computer 116 for automatic answering. In addition, the answer text printing 608 prints the answer text on paper to generate an answer letter. This process is realized by the automatic seal printer 118 .

实际上，计算机文档种类识别也不一定正确。另外，在识别处理中，有时也有不能识别和拒绝识别的场合。于是，在本系统中，以如下的办法应对文档种类识别的错误和拒绝。在回答作业1或2或3(602，603，604)中，在分到的询问文档不是自己负责多的领域的场合，工作人员在后述的操作画面上进行转送操作。如上所述，作为文档识别结果通常可得到不同可信度的多个候补。可利用这一点，在进行转送操作的场合，自动地向与2位以下的文档种类候补相对应的转送目的地转送该回答作业用数据。转送目的地也可由工作人员本身指定。另外，在拒绝识别的场合，将在电子邮件的文本中或图像中检出的重要单词加亮，支持回答作业。In fact, computer document type recognition is not necessarily correct. In addition, in the recognition process, sometimes the recognition cannot be recognized or the recognition is rejected. Therefore, in this system, errors and rejections of document type identification are dealt with as follows. In answering task 1, 2, or 3 (602, 603, 604), if the assigned query document is not in the field that the employee is in charge of a lot, the worker performs a forwarding operation on an operation screen described later. As described above, a plurality of candidates with different degrees of reliability are usually available as a document recognition result. Utilizing this point, when a forwarding operation is performed, the answering work data is automatically forwarded to the forwarding destination corresponding to the document type candidates of 2 or less digits. The forwarding destination can also be specified by the worker himself. In addition, when the recognition is rejected, important words detected in the text of the e-mail or in the image are highlighted to support the answering task.

图7示出在回答作业装置1～3(121，125，126)中在图像显示装置的显示画面的一例。画面701中的询问文档窗口708中显示询问文档。在电子邮件的询问文档的场合显示文本，在书信形式的询问的场合显示书信的图像。另外，利用回答作业数据中的重要单词出现位置，在同一窗口中，加亮重要单词，容易生成回答作业。工作人员，在回答文编辑窗口709中编辑回答文。此处可使用通常的文字处理器。另外，在询问文档窗口708中显示的不是自己负责的文档的场合，工作人员可利用作为输入装置设置的鼠标点击自动转送按钮703。与此相应，将回答作业用数据转送到与识别结果的2位以下的候补相对应的回答作业装置或自动回答用计算机。在窗口702中，显示也包含2位以下的候补的文档识别结果的候补。在操作者指定去向目的地转送回答作业用数据之际，在利用设置于窗口702中的单选按钮指定文档种类后，点击转送按钮704。在希望检索与文档种类相应的过去的文例的场合，则点击文例检索按钮705。于是，经LAN从回答文例文件系统117转送该回答文例，显示于回答文编辑窗口709。如点击发送按钮706，就将经过编辑的回答文回寄给发送电子邮件询问的人。另外，如点击打印按钮707，就将经过编辑的回答文由自动封缄打印机118进行打印。FIG. 7 shows an example of a display screen on the image display device in the answering operation devices 1 to 3 ( 121 , 125 , 126 ). An inquiry document is displayed in an inquiry document window 708 on the screen 701 . Text is displayed in the case of an e-mail inquiry document, and a letter image is displayed in the case of a letter form inquiry. In addition, by using the important words appearing positions in the answering task data, the important words are highlighted in the same window, so that answering tasks can be easily generated. The worker edits the answer text in the answer text editing window 709 . A normal word processor can be used here. Also, when the document displayed in the query document window 708 is not the document under his/her responsibility, the worker can click the automatic transfer button 703 using a mouse provided as an input device. Accordingly, the data for answering work is transferred to the answering work device or the computer for automatic answering corresponding to the two or less candidates of the recognition result. In window 702 , candidates of document recognition results including candidates of 2 or less digits are displayed. When the operator specifies to forward the data for answering work to the destination, he clicks the forward button 704 after specifying the type of document using the radio buttons provided in the window 702 . When it is desired to search past sentence examples corresponding to the document type, the sentence example search button 705 is clicked. Then, the answer sentence example is transferred from the answer sentence example file system 117 via the LAN, and displayed on the answer text editing window 709 . If the send button 706 is clicked, the edited answer text will be sent back to the person who sent the email inquiry. In addition, when the print button 707 is clicked, the edited reply text is printed by the automatic sealing printer 118 .

图8示出学习202的处理步骤。在重要单词抽出中，首先，从各学习用文本数据Ti(1≤i≤N、N：数据数)利用词素分析抽出单词，将出现频率以矢量ui＝(ui1、ui2、…uiM)形式存储(1≤i≤N、.uij：Ti中的单词j出现的次数、M：总单词数)。从矢量ui和各文本数据Ti的种类ci(人工赋予)的对的集合{(ui\ci)}，利用Branch and Bound(分支限界)算法的特征选择等已知方法，选择分类上重要的单词M’个(M’＜＜M)。需要时，也可由人工选择重要单词。其次，在单词统计量算出中，算出各重要单词的出现频率，以矢量vi＝(vi1、vi2、…viM’)存储。此外，算出文档种类识别所必需的统计量。比如，在识别方式为二次识别函数的场合，算出各变量vi1、vi2、…viM’的平均、相关系数等的统计量。FIG. 8 shows the process steps of learning 202 . In extracting important words, first, words are extracted from each learning text data Ti (1≤i≤N, N: number of data) by morpheme analysis, and the frequency of appearance is stored as a vector ui=(ui1, ui2, ... uiM) (1≤i≤N, .uij: number of occurrences of word j in Ti, M: total number of words). From the set {(ui\ci)} of the vector ui and the type ci (manually assigned) of each text data Ti, use known methods such as feature selection of the Branch and Bound (branch and bound) algorithm to select important words in the classification M' (M'<<M). Important words can also be manually selected when needed. Next, in calculating word statistics, the frequency of occurrence of each important word is calculated and stored as a vector vi=(vi1, vi2, ... viM'). In addition, statistics necessary for document type identification are calculated. For example, when the identification method is a quadratic identification function, statistics such as the average and correlation coefficient of the variables vi1, vi2, ... viM' are calculated.

图9示出书信分类206的处理步骤。本处理包括图像输入，字定位识别，文档种类识别，归类，生成作业用数据各步骤。FIG. 9 shows the processing steps of letter classification 206 . This processing includes steps of image input, character location recognition, document type recognition, classification, and generation of data for work.

在通常的文字识别中，识别图像中的所有文字。与此相对，在字定位识别中，在实行之前指定来自外部的读取对象的单词。在识别过程中，将识别对象的文字种类只限定于可能以指定的单词出现的文字，检出似乎正确的文字串作为指定的单词。在本实施例中，采用特开平11-85909中的方式从图像中识别单词，算出出现频率。本处理包括文字分割虚拟生成步骤，探索识别重要单词步骤和重要单词出现频率算出步骤。重要单词出现频率以矢量w＝(w1、w2、…wM’)表示。通过使用这种对给定的单词的集合进行探索性的识别的方法，可大幅度提高识别的精度和速度。另外，虽然识别和精度劣于这种方式，但也可以如现有的方式那样识别图像中的所有的文字，利用现有的单词比对的技术求得矢量w。In normal character recognition, all characters in an image are recognized. On the other hand, in word positioning recognition, a word to be read from outside is specified before execution. In the recognition process, the type of characters to be recognized is limited to characters that may appear in the designated word, and a seemingly correct character string is detected as the designated word. In this embodiment, words are recognized from an image by the method described in JP-A-11-85909, and the frequency of appearance is calculated. This processing includes a virtual generation step of character segmentation, a step of exploring and identifying important words, and a step of calculating the frequency of occurrence of important words. The frequency of occurrence of important words is represented by a vector w=(w1, w2, . . . wM'). By using this method of exploratory recognition for a given set of words, the accuracy and speed of recognition can be greatly improved. In addition, although the recognition and accuracy are inferior to this method, it is also possible to recognize all the characters in the image as in the existing method, and use the existing word comparison technology to obtain the vector w.

在文档种类识别中，使用二次识别函数及神经网络等一般的模式识别的方法，从矢量w和单词统计量算出作为各文档种类的可信度，根据文档种类的可信度顺序赋予。此外，仿效一般的方法，当1位和2位的文档种类候补的可信度之差小于一定值时，以及1位的文档种类候补的可信度小于一定值时，判断为拒绝识别。In document type recognition, general pattern recognition methods such as quadratic recognition functions and neural networks are used to calculate the reliability of each document type from the vector w and word statistics, and assign them in order according to the reliability of the document type. In addition, following a general method, when the difference between the reliability of the 1-digit and 2-digit document type candidates is smaller than a certain value, and when the reliability of the 1-digit document type candidate is smaller than a certain value, it is determined that recognition is rejected.

在归类时，根据文档种类的识别结果，控制带分类排序器的扫描仪110，将书信归类到规定的文档堆。When categorizing, according to the recognition result of the document type, the scanner 110 with a sorting and sorting device is controlled to classify letters into specified document piles.

在生成回答用数据中，生成并输出重要单词的出现位置和文档种类的识别结果和图像相对应的数据，即回答作业用数据。In generating answer data, data corresponding to an appearance position of an important word, a recognition result of a document type, and an image, ie, answer work data, is generated and output.

图10示出作为字定位识别的单词出现频率的数据形式。这是由M’个记录组成的排列。在各记录中存放第M’个的重要单词的出现频率。存放的出现频率，可以是整数值，也可是相应于识别的可信度的实数值。Fig. 10 shows the data form of the frequency of occurrence of words recognized as a word location. This is an array consisting of M' records. The frequency of occurrence of the M'th important word is stored in each record. The stored frequency of occurrence can be an integer value or a real value corresponding to the reliability of recognition.

图11示出回答作业用数据的数据形式。在变量kindOfMessage1101中存放的是标志，用来表示区别是以询问电子邮件等的文本表示的还是如书信及传真那样以图像表示的。变量sizeOfMsg1102表示存放于回答作业用数据中的文档的大小。接着，在sizeOfMsg字节的区域1103中存放询问书信实体。在电子邮件等场合存放文本，在书信和传真的场合存放图像数据。在变量numberOfCandidate1104中存放文档种类识别结果得到的文档种类候补的数。接着，在区域1105中，存放numberOfCandidate数目的文档种类候补记录。各记录的构成包括表示文档种类的整数的识别符及其可信度的值的对。变量numberOfWords1106中存放检出的重要单词的数。接着，在区域1107中存放numberOfWords数目的重要单词检出结果的记录。各记录的构成包括重要单词的识别符wordID和表示检出的位置的记录location的对。作为检出位置，在文本数据的场合，存放在文本数据中的重要单词的开头的文字出现的字节数。在图像数据的场合，存放识别重要单词的区域的上端、下端、左端、右端的坐标。Fig. 11 shows the data format of data for answering tasks. Stored in the variable kindOfMessage 1101 is a flag for indicating whether the query is expressed in text such as an e-mail or in an image such as a letter or a facsimile. The variable sizeOfMsg 1102 indicates the size of the document stored in the answer job data. Next, the inquiry letter entity is stored in the area 1103 of sizeOfMsg bytes. Store text in e-mail, etc., and store image data in letters and faxes. The variable numberOfCandidate 1104 stores the number of document type candidates obtained as a result of document type identification. Next, in the area 1105, document type candidate records of the numberOfCandidate number are stored. The configuration of each record includes a pair of an integer identifier indicating the type of document and its reliability value. The number of important words detected is stored in the variable numberOfWords 1106 . Next, the numberOfWords number of important word detection results records are stored in the area 1107 . The configuration of each record includes a pair of an identifier wordID of an important word and a record location indicating a detected position. As the detection position, in the case of text data, the number of bytes in which the first character of an important word in the text data appears. In the case of image data, the coordinates of the upper end, lower end, left end, and right end of the region where important words are recognized are stored.

下面对字定位识别的概要予以说明。图12示意地示出询问的书信的示例。通常，在询问的书信中没有特定的格式。因此，不能预先了解文字行的位置及文字的大小。另外，很多场合不了解是横写还是竖写。此外，像这个例子这样的文字行的间隔很小，第2行的“え”上面的点，是属于上一行还是属于下一行，有时是很难判断的成分。The outline of character position recognition will be described below. Fig. 12 schematically shows an example of a letter of inquiry. Usually, there is no specific format in an inquiry letter. Therefore, the position of the character line and the size of the character cannot be known in advance. In addition, many occasions do not know whether to write horizontally or vertically. In addition, the interval between lines of characters like this example is very small, and it is sometimes difficult to judge whether the dot above "え" in the second line belongs to the previous line or the next line.

为了解决这样的问题，在本发明的字定位识别中，采用特开平11-85909的方式。该方式是从输入图像中抽出文字模式候补，将这些关系以分割虚拟网络表现，之后，在分割虚拟网络中对预先指定的单词进行探索性的识别。这是通过以预测的方式利用单词的信息，可以高精度且高速地识别单词的方式。作为抽出文字模式的候补的方法，编入，可以使用在文字行中连接成分的任意个数的组合之后，选择将其合成而得到的图形的高度和宽度处于预先指定的上限值和下限值之间的方法。作为探索性的方式，使用一般的宽度优先探索，探索树的展开的判断根据文字识别的结果进行。In order to solve such a problem, the method of Japanese Patent Laid-Open No. 11-85909 is adopted in the character positioning recognition of the present invention. In this method, character pattern candidates are extracted from an input image, these relationships are expressed in a segmented virtual network, and then pre-specified words are tentatively recognized in the segmented virtual network. This is a method by which a word can be recognized with high precision and high speed by utilizing the information of the word in a predictive manner. As a method of extracting the candidate of the character pattern, it is possible to use the combination of any number of connected components in the character line, and then select and combine them so that the height and width of the figure obtained by combining them are at the upper limit and lower limit specified in advance. method between values. As an exploratory method, a common breadth-first search is used, and the expansion of the search tree is determined based on the result of character recognition.

在特开平11-85909中，为解决文字行内的文字切出的困难，导入分割虚拟网络。在图12的示例中，在文字切出之前的文字行抽出本身也很困难。于是，在本发明的字定位识别中，在从整个图像抽出文字模式的候补的同时，以分割虚拟网络表示在纵横任一方向上文字模式的候补是否可以连接，也就是说文字连起来是否可以作为单词连起来读。图13示出如此得到的分割虚拟网络的例子。图中的椭圆表示文字模式的候补。比如，1301示出将两个连接成分合起来生成一个文字模式候补。在此场合，文字模式候补1301与“ら”对应。另外，候补模式1302与“ら”上方的点相对应。边1303表示候补模式1302和候补模式1304可以连接，就是说表示在单词中作为文字有可能连起来。另外，此处，在像边1303这样从内部向外出来的场合，表示候补模式1301和候补模式1304也有可能连接。是否可以连接，要根据候补模式之间的距离判断。距离在预先确定的阈值以下的场合可能连接。In JP-A-11-85909, in order to solve the difficulty of cutting out characters in a character line, a divided virtual network is introduced. In the example of FIG. 12 , it is also difficult to extract character lines before character extraction. Therefore, in the character positioning recognition of the present invention, while extracting the candidates of the character pattern from the whole image, the virtual network is divided to indicate whether the candidates of the character pattern can be connected in any direction in the vertical and horizontal directions, that is to say, whether the characters can be connected together as Words are read together. FIG. 13 shows an example of the divided virtual network thus obtained. Ellipses in the figure indicate candidates for the character mode. For example, 1301 shows that two connection components are combined to generate one character pattern candidate. In this case, character pattern candidate 1301 corresponds to "ら". In addition, the candidate pattern 1302 corresponds to the dot above "ら". A side 1303 indicates that the candidate pattern 1302 and the candidate pattern 1304 can be connected, that is, it indicates that they can be connected as characters in a word. In addition, here, when the side 1303 goes out from the inside, it means that the candidate pattern 1301 and the candidate pattern 1304 may be connected. Whether it can be connected depends on the distance between the alternate modes. Connection is possible if the distance is below a predetermined threshold.

将这样得到的分割虚拟网络作为输入，利用文字识别探索重要单词，可以检出重要单词出现的地点。比如，在“ちらし”和“价格”是重要单词的场合，如图14的1401和1402所示，可检出重要单词的位置。通过使用由文档分类所必需的最小限度的单词组成的单词词典进行文字识别，可以高精度且高速地进行书信类的自动分类。Using the segmented virtual network obtained in this way as input, it is possible to search for important words using character recognition, and it is possible to detect places where important words appear. For example, when "ちらし" and "price" are important words, as shown in 1401 and 1402 in FIG. 14, the position of the important words can be detected. By performing character recognition using a word dictionary composed of the minimum words required for document classification, it is possible to perform automatic classification of letters with high precision and speed.

通过和电子邮件分类的重要单词词典共用化，可以很容易生成文字识别装置的单词词典。另外，根据在运用中利用电子邮件得到的事例可以对单词词典进行自动更新。The word dictionary for the character recognition device can be easily created by sharing it with the important word dictionary for e-mail classification. In addition, the word dictionary can be automatically updated based on examples obtained by e-mail during operation.

文字识别处理的输出是单词的出现频率，基于单词出现频率的文档种类识别处理和系统中的亲和性提高。因此，可以获得多数既有的文档种类识别装置可以挪用，在系统中基于文字识别的文档种类识别和基于文本的文档种类识别容易共存的效果。The output of the character recognition processing is the frequency of occurrence of words, and the affinity improvement in the document type recognition processing and system based on the frequency of occurrence of words. Therefore, most existing document type identification devices can be used, and the document type identification based on text recognition and the document type identification based on text can easily coexist in the system.

在不能分类的场合，也可通过在图像上向回答工作人员指示重要单词，支持回答工作人员的作业。另外，即使是在分类错误的场合，也可提供可以高效率地继续回答作业的环境。When classification is not possible, it is also possible to support the answering worker's work by pointing out important words on the image to the answering worker. In addition, even in the event of a wrong classification, an environment can be provided in which the answering work can be continued efficiently.

Claims

1. document classification system is characterized in that comprising:

Be used for importing the input media of the view data of document,

The memory storage of the information of the relevant important words of in the kind identification of above-mentioned document, using of storage and the frequency of occurrences thereof, and

Handle the treating apparatus of above-mentioned view data,

Above-mentioned treating apparatus, the word location technology of the above-mentioned important words of utilization from the view data of above-mentioned input media input, count its occurrence number, according to the information and the above-mentioned counting that are stored in the above-mentioned memory storage, discern the kind of above-mentioned document, above-mentioned document kind recognition result and above-mentioned view data are exported accordingly.

2. the document classification system that puts down in writing as claim 1 is characterized in that

Said system also has display device,

Above-mentioned display device shows above-mentioned document kind recognition result and above-mentioned view data accordingly.

3. the document classification system that puts down in writing as claim 2 is characterized in that

Said system will be recorded in the pen recorder accordingly to the answer literary composition example and the mentioned kind of above-mentioned document,

Above-mentioned display device shows above-mentioned view data and the corresponding above-mentioned answer literary composition example of above-mentioned document kind recognition result.

4. the document classification system that puts down in writing as claim 2 to 3 is characterized in that

Above-mentioned treating apparatus is also exported the positional information of important words in above-mentioned view data of above-mentioned counting,

On above-mentioned display device, emphasize to show this important words in the above-mentioned view data according to above-mentioned positional information.

5. as any one document classification system that puts down in writing in the claim 1 to 3, it is characterized in that

Said system has the classification and ordination device,

Above-mentioned classification and ordination device will be sorted out discharge by kind through the above-mentioned document of identification.

6. as any one document classification system that puts down in writing in the claim 1 to 3, it is characterized in that

Above-mentioned document classification system is connected to communication network,

Above-mentioned treating apparatus also carries out mentioned kind identification for the Email that receives through this communication network.

7. as any one document classification system that puts down in writing in the claim 1 to 3, it is characterized in that

Above-mentioned treating apparatus utilizes the occurrence number of above-mentioned recognition result and above-mentioned important words that the information that is stored in the above-mentioned memory storage is upgraded.

8. a document classification system is characterized in that

Have document kind note identification apparatus and a plurality of document processing device, document processings that are connected through network,

Above-mentioned document kind note identification apparatus comprises:

Obtain the view data of document or the device of text data,

The pen recorder that the information of the important words of using in will discerning with this kind about the kind of above-mentioned document writes down accordingly,

Handle the treating apparatus of above-mentioned view data or text data,

Above-mentioned treating apparatus is discerned above-mentioned document according to above-mentioned information, each confidence level of this recognition result and mentioned kind is together exported, according to the above-mentioned document processing device, document processing of this confidence level decision output.

9. the document classification system that puts down in writing as claim 8 is characterized in that

Above-mentioned treating apparatus, in the occasion of accepting the wrong input of this identification,

According to above-mentioned confidence level, above-mentioned document is transferred to other above-mentioned document processing device, document processings.

10. program is a kind of by carrying out the program of document recognition method with the computing machine that the view data input media is connected, has data storage device and control device,

It is characterized in that

The document recognition methods comprises:

Obtain the step of document data by above-mentioned view data input media,

Utilize the word location technology from above-mentioned view data, to discern the important words of depositing in advance in the above-mentioned memory storage, count each the step of occurrence number of this important words,

According to the step of above-mentioned count value identification document kind, and

The step that above-mentioned document data and above-mentioned document kind recognition result are exported accordingly.