CN1452098A - File classing system and program for carrying out same - Google Patents
File classing system and program for carrying out same Download PDFInfo
- Publication number
- CN1452098A CN1452098A CN02141403.3A CN02141403A CN1452098A CN 1452098 A CN1452098 A CN 1452098A CN 02141403 A CN02141403 A CN 02141403A CN 1452098 A CN1452098 A CN 1452098A
- Authority
- CN
- China
- Prior art keywords
- mentioned
- document
- view data
- recognition
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Character Discrimination (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
本发明提供一种文档分类系统及其实现程序。可根据书信类的内容高效率地进行分类。由文字识别装置测定存储于重要单词词典中的单词的出现频率,由文档种类识别单元推定文档种类。
The invention provides a document classification system and its realization program. It is possible to efficiently classify according to the content of letters. The frequency of occurrence of words stored in the important word dictionary is measured by the character recognition device, and the document type is estimated by the document type recognition unit.
Description
技术领域technical field
本发明涉及在企业、机关等的顾客窗口,利用计算机对询问书信、电子邮件等进行自动分类的文档分类技术及回答询问的支持系统。The present invention relates to a document classification technology for automatically classifying inquiry letters, e-mails, etc. by computers at customer windows of companies, institutions, etc., and a support system for answering inquiries.
背景技术Background technique
在制造业、保险业、通信产业、机关等之中,从顾客直接接收电子邮件、书信、FAX等文档形式的询问的业务近年来变得日益重要。很多场合下要求一个人高效率地回答多种询问是困难的。通常,询问的件数多得一个人应付不了。此外,很多内容牵涉到多方面。比如,在制造业中,必须处理对制品的索赔、购买方法、操作方法等询问的文档。一个人要处理所有这些问题需要有广泛的知识。通常,很难确保具有如此广泛知识的工作人员。In manufacturing, insurance, telecommunications, institutions, etc., the business of directly receiving inquiries in the form of e-mails, letters, FAX, etc. from customers has become increasingly important in recent years. In many cases, it is difficult to ask a person to efficiently answer various inquiries. Usually, there are more inquiries than one person can handle. In addition, a lot of content involves multiple aspects. For example, in the manufacturing industry, it is necessary to deal with documents for inquiries about a product, a purchase method, an operation method, and the like. It takes a wide range of knowledge for one to deal with all these issues. Often, it is difficult to secure staff with such extensive knowledge.
于是,需要有一种识别询问文档的种类,将其分类后根据各个内容分给作为专家的工作人员,由这些工作人员回答的系统。Therefore, there is a need for a system that recognizes the types of inquiry documents, classifies them, distributes them to specialist staff according to each content, and allows these staff to answer.
本发明就是涉及利用计算机对上述询问文档进行分类的技术,以及利用计算机回答询问的支持系统的技术。The present invention relates to the technology of using computer to classify the above query documents, and the technology of using computer to answer the query support system.
利用计算机对电子邮件的询问进行分类的技术业已公知。作为典型的方法,有采用以文档中特定的单词(重要单词)群的出现频率为特征的多变量的模式识别技术的方法。电子邮件的正文、文件名是文本数据,单词的出现频率可通过单纯的单词比对或词素分析得到。一旦询问电子邮件的种类得到识别,就根据其种类自动地以电子邮件等进行回答的技术也业已公知。Techniques for sorting e-mail inquiries using computers are known. As a typical method, there is a method of using a multivariate pattern recognition technique characterized by the frequency of occurrence of a specific word (important word) group in a document. The body text and file name of the e-mail are text data, and the occurrence frequency of words can be obtained through simple word comparison or morpheme analysis. Once the type of inquiry e-mail is recognized, a technology is known in which an e-mail or the like is automatically answered according to the type.
另外,作为询问书信的种类的识别方法,有将书信内容以文字识别装置文本化而采用与上述相同的技术的方法。Also, as a method for identifying the type of inquiry letter, there is a method of converting the contents of the letter into text by a character recognition device and using the same technique as above.
然而,在现有的这种利用文字识别的书信种类识别中存在识别精度的问题。一般,为了使文字识别高精度化,必须将能出现的单词事先作为词典存储。特别是,为了识别手写的文字,必须将单词数缩小为大约数百个。可是,在现有的书信种类识别中所使用的文字识别装置中,要预先将能够出现的单词范围进行缩小是困难的。因此,要以充分高的精度识别书信的种类是困难的。However, there is a problem of recognition accuracy in conventional letter type recognition using character recognition. Generally, in order to increase the accuracy of character recognition, it is necessary to store possible words in advance as a dictionary. In particular, in order to recognize handwritten characters, the number of words must be reduced to about hundreds. However, in conventional character recognition devices used for letter type recognition, it is difficult to narrow down the range of words that can appear in advance. Therefore, it is difficult to identify the type of letter with sufficiently high accuracy.
另外,在通常的文字识别中,文字分割的结果会遗留文字识别结果的暧昧性。比如,作为通常的文字识别分割的每个部分图像,可得到多个文字识别的候补文字。也有文字分割本身遗留暧昧性的时候。从这种文字识别结果推定特定的单词出现频率,不是明显的处理,不能将从文本数据算出单词出现频率的方法按原样使用。另外,如果不容许这种暧昧性,将文字识别结果作为文本处理,则文档中的单词很多会遗漏。In addition, in general character recognition, the result of character segmentation will leave ambiguity in the result of character recognition. For example, a plurality of candidate characters for character recognition can be obtained for each partial image divided by normal character recognition. There are also times when text segmentation itself leaves ambiguity. Estimating the frequency of appearance of a specific word from such a result of character recognition is not an obvious process, and the method of calculating the frequency of appearance of a word from text data cannot be used as it is. Also, if such ambiguity is not allowed and the character recognition result is treated as text, many words in the document will be missed.
另外,在文字识别中,无法避免识别错误。由于这些错误,常常会发生不能识别文档的种类,出现错误的情况。在现有的方式中,在发生这种不能识别及错误识别的场合,就会使得作业效率大大降低。In addition, in character recognition, recognition errors cannot be avoided. Due to these errors, it often happens that the type of the document cannot be recognized and an error occurs. In the existing method, when such inability to identify and misidentify occur, the work efficiency will be greatly reduced.
发明概述Summary of the invention
本发明要解决的第一个课题是在这种询问回答支持系统中,实现高速高精度识别利用文字识别的书信的种类。The first problem to be solved by the present invention is to realize high-speed and high-accuracy recognition of the type of letters using character recognition in such an inquiry-response support system.
本发明要解决的第二个课题是提高文字识别用的词典的维护性。The second problem to be solved by the present invention is to improve maintainability of a dictionary for character recognition.
本发明要解决的第三个课题是改进结果存在暧昧性的文字识别处理和输入文本数据的文档分类处理的界面,提高系统中的亲和性。The third problem to be solved by the present invention is to improve the interface of character recognition processing with ambiguous results and document classification processing of input text data, and improve the affinity in the system.
本发明要解决的第四个课题是提供一种在不能识别文档种类及识别出现错误的场合,可以高效率地继续回答作业的环境。The fourth problem to be solved by the present invention is to provide an environment in which an answering task can be continued efficiently when the document type cannot be recognized or the recognition error occurs.
以文档种类识别中的重要单词的集合作为文字识别的单词词典。另外,不是像过去那样读取全部的文字,而是采用字定位技术只计测重要单词的出现频率。文字识别处理的输出形式,与过去不同,是作成表示单词的出现频率的矢量。将所得到的出现频率输入到现有的文档种类识别中进行文档种类的识别。A collection of important words in document type recognition is used as a word dictionary for text recognition. In addition, instead of reading all the characters as in the past, word positioning technology is used to measure only the occurrence frequency of important words. The output format of the character recognition process is different from the conventional one in that a vector indicating the frequency of occurrence of a word is created. The obtained frequency of occurrence is input into the existing document type identification to identify the document type.
从进行文字识别的装置向进行回答作业的装置不仅发送文档种类的识别结果,而且将文档种类识别的2位以下的候补和单词识别结果一起发送。回答作业装置,通过在询问书信的图像上加亮重要单词而支持回答作业。另外,提供一种在文档种类发生错误的场合,可利用2位以下的候补文档种类,向适当的回答者转送书信图像的环境。Not only the recognition result of the document type but also two or less candidates for the document type recognition are transmitted from the device performing character recognition to the device performing the answering task together with the word recognition result. The answering work device supports the answering work by highlighting important words on the image of the question letter. Also, an environment is provided in which a letter image can be forwarded to an appropriate respondent by using a candidate document type of 2 digits or less when an error occurs in the document type.
附图说明Description of drawings
图1为实施例的硬件构成。Fig. 1 is the hardware structure of the embodiment.
图2为示出实施例的处理流程的数据流图。FIG. 2 is a data flow diagram showing the processing flow of the embodiment.
图3为示出学习处理的流程的数据流图。FIG. 3 is a data flow diagram showing the flow of learning processing.
图4为示出电子邮件分类处理的流程的数据流图。FIG. 4 is a data flow diagram showing the flow of e-mail classification processing.
图5为示出书信分类处理的流程的数据流图。FIG. 5 is a data flow diagram showing the flow of letter sorting processing.
图6为示出回答处理的流程的数据流图。FIG. 6 is a data flow diagram showing the flow of answer processing.
图7为回答作业用的显示画面。Fig. 7 is a display screen for answering tasks.
图8为示出学习处理步骤的动作时序图。FIG. 8 is an operation sequence diagram showing a learning process procedure.
图9为示出书信处理步骤的动作时序图。Fig. 9 is an operation sequence diagram showing a letter processing procedure.
图10为单词出现频率的数据形式。Fig. 10 is the data form of the word occurrence frequency.
图11为回答作业用数据的数据形式。Fig. 11 shows the data format of data for answering tasks.
图12为输入图像示例。Figure 12 is an example of an input image.
图13为分割虚拟网络。Figure 13 is a split virtual network.
图14为检出的重要单词。Figure 14 shows the detected important words.
具体实施方式Detailed ways
图1示出本发明实施例的询问回答系统,即对来自顾客的询问自动进行分类、解答作业的支持系统的构成。本系统的输入为利用电子邮件、书信、电话等的询问。输出为利用电子邮件、书信、电话等的询问的回答。为了与外部通信,本系统通过接口和电话线路与外部1联系。另外,本系统的计算机类经LAN进行信息交换。FIG. 1 shows the configuration of an inquiry answering system according to an embodiment of the present invention, that is, a support system for automatically classifying and answering inquiries from customers. The input to this system is an inquiry by e-mail, letter, telephone or the like. The output is an answer to an inquiry by e-mail, letter, telephone, or the like. In order to communicate with the outside, the system contacts the outside 1 through an interface and a telephone line. In addition, the computers of this system exchange information via LAN.
在利用本系统接受询问之前,必须进行计算出用来识别询问文档的种类所用的必需信息并生成词典的处理,即必须进行学习。101是管理学习的学习用计算机。学习用计算机101,参照学习用数据文件系统102中预先收集的学习数据,算出文档种类识别所必需的信息,作为分类用词典存放于词典文档系统103中。所谓学习数据,是将询问内容文本化的文本数据和其询问种类识别符对的集合。在学习数据中的文本数据使用了过去的询问的事例。对应的文档种类由人工指定。生成的分类用词典,随时经LAN拷贝到电子邮件分类用计算机106的词典文件系统107,书信分类用计算机108的词典文件系统109及声音询问分类用计算机114的词典文件系统115。Before receiving an inquiry using this system, it is necessary to calculate necessary information for identifying the type of the inquiry document and generate a dictionary, that is, learning is required. 101 is a learning computer for managing learning. The computer for learning 101 refers to learning data collected in advance in the
利用电子邮件的询问,从系统外部的因特网104经网关105,由电子邮件分类用计算机106接收。电子邮件分类用计算机106,根据询问内容识别电子邮件的种类,将文档种类的识别结果和后述的重要单词的位置与电子邮件相对应转送到自动回答用计算机116。Inquiries by electronic mail are received by the electronic
询问书信,由与书信分类用计算机108相连接的带有分类排序器的扫描仪110进行光电变换,作为图像输入。书信分类用计算机108,利用后述的字定位技术,识别图像中的文字,根据询问内容识别书信的种类。将识别结果和图像及后述的重要单词的位置相对应地转送到自动回答用计算机116。通过传真发来的询问也从电话线路111经线路控制装置113送入书信分类用计算机108,实施同样的处理。在上述处理后,书信用扫描仪的分类排序器根据文档种类的识别结果归类保管。The inquiry letter is photoelectrically converted by a
通过电话的询问,从电话线路111经线路控制装置113送入声音询问分类用计算机114。声音询问分类用计算机114对声音进行识别转换为文本,根据询问内容进行分类,转送到回答通话用电话机112。由与内容相应的专家工作人员利用回答通话用电话机112进行回答。Inquiry by telephone is sent from the
自动回答用计算机116,在转送来的文档的种类是可以自动回答的场合,从回答文例文件系统117中检索合适的回答文例,以电子邮件回答,或是利用自动封缄打印机118打印以书信形式回答。还有,在不能自动回答的场合,则根据与询问文档相对应的文档种类,转送到合适专家待机的回答作业装置(121,125,126)。The
如121所示,回答作业装置的构成包括计算机122,键盘、鼠标等组成的输入装置123及图像显示装置124。利用这些装置,各工作人员参照询问文档生成回答文档,转送到自动回答用计算机116。自动回答用计算机116,以与上述相同方式,发出回答电子邮件,或是打印回答书信。As indicated by 121, the composition of the answering operation device includes a
下面利用图2的数据流图对本系统的处理流程予以说明。在本图中,按照ゲ-ン·サ-ソン表示法(J.マ-チン“软件结构化技术”近代科学社,ISBN4-7649-0124-2 C3050 P5562E),实线箭头表示信息流,空心箭头表示物流。另外,圆角矩形表示处理,右边为空的矩形表示存放的信息。The following uses the data flow diagram in Figure 2 to describe the processing flow of this system. In this figure, according to the notation of ゲ-ン・サ-ソン (J. マ-チン "Software Structured Technology" Modern Science Society, ISBN4-7649-0124-2 C3050 P5562E), the solid arrow indicates the information flow, and the hollow Arrows indicate logistics. In addition, the rounded rectangle represents processing, and the empty rectangle on the right represents stored information.
首先,利用学习数据201,执行学习202,即计算出识别询问文档的种类所必需的信息的处理,生成单词统计量词典203及重要单词词典204。重要单词词典204存放在识别文档种类上是重要特征的单词的集合。单词统计量词典203存放根据重要单词的出现频率识别文档种类所必需的统计量。本处理由学习用计算机101实现。First, learning
在接受询问电子邮件之际,电子邮件分类205参照单词统计量词典203和重要单词词典204识别询问文档种类。得到的询问文档种类,和电子邮件本身及在处理过程中得到的重要单词的位置信息相对应作为回答作业用数据输出。本处理由电子邮件分类用计算机106实现。When receiving an inquiry e-mail, the
在书信中有询问的场合,由书信分类206参照单词统计量词典203和重要单词词典204识别询问文档种类。得到的询问文档种类,与在文档图像处理过程中得到的重要单词的出现地点相对应作为回答作业用数据输出。另一方面,为便于保存等,根据所得到的询问文档种类对书信本身进行归类。本处理由书信分类用计算机108及带分类排序器的扫描仪110实现。When there is a query in the letter, the
利用回答用数据,由回答208生成回答文,打印回答书信或发送回答电子邮件。同时为了更新单词统计量词典203,将重要单词出现频率和回答之际确定的文档种类信息输出到词典更新210。在输入是电子邮件的场合,追加文档种类和电子邮件的内容的文本数据为学习数据。通过利用这样追加的学习数据进行再度学习,可以使单词统计量词典203和重要单词词典204适应运用的实际情况。本处理由自动回答用计算机116,自动封缄打印机118以及回答作业装置121,125,126实现。Using the answer data, an answer message is generated from the answer 208, and an answer letter is printed or an answer e-mail is sent. At the same time, in order to update the
词典更新210,根据在运用中得到的询问文档中的重要单词的出现频率和文档种类对分类用词典进行更新处理。用于这种识别的统计量的更新利用标准的模式识别的方法实现。本处理由学习用计算机101实现。In dictionary update 210, the classification dictionary is updated based on the frequency of occurrence of important words in the query document and the type of document obtained during operation. The updating of the statistics used for this recognition is accomplished using standard pattern recognition methods. This processing is realized by the learning
在以电话进行询问的场合,由声音询问分类207识别询问内容的种类,由与内容相应的专家工作人员进行回答通话208。本处理由声音询问分类用计算机114及回答通话用电话机112实现。In the case of an inquiry by telephone, the type of the inquiry content is recognized by voice inquiry classification 207, and the answering call 208 is made by an expert worker corresponding to the content. This processing is realized by the
其次,在图3的数据流图中示出学习202的处理流程。首先,由重要单词抽出301从学习数据201中抽出在识别文档种类上重要的单词,存放于重要单词词典204。这一处理利用自然语言的词素分析技术以及模式识别的特征选择技术实现。其次,由单词统计量算出302计算在识别文档种类上必需的统计量,存放于单词统计量词典203中。此处的识别方法,使用模式识别中的标准方法,比如二次识别函数及神经网络等。在采用二次识别函数的场合,所谓单词统计量意味着各单词的出现频率及其协方差系数。在使用神经网络的场合,单词统计量是网络的连接权重。Next, the processing flow of learning 202 is shown in the data flow diagram of FIG. 3 . First, important words for recognizing document types are extracted from the learning
其次,在图4的数据流图中示出电子邮件分类205的处理流程。首先,由单词抽出401,参照重要单词词典204,算出电子邮件中的各重要单词的出现频率。为检出重要单词,使用词素分析等一般的语言处理技术。其次,文档种类识别402利用单词统计量词典203中的单词统计量识别文档种类。识别利用二次识别函数及神经网络等标准模式识别方法。最后,由回答作业用数据生成403生成取得重要单词的出现位置和文档种类的识别结果和电子邮件的对应关系的数据,即回答作业用数据,并进行输出。通常,对文档种类可列举不同可信度的多个候补。全部存放在回答作业用数据中。Next, the processing flow of the
其次,在图5的数据流图中示出书信分类206的处理流程。首先将在书信中写文章的区域作为图像输入。其次,由字定位识别502参照重要单词词典204,从图像中识别重要单词,输出各重要单词的出现频率。其次,由文档种类识别503,参照存放于单词统计量词典203中的单词统计量,根据重要单词的出现频率识别文档种类。其次,由回答作业用数据生成505,将图像和重要单词的出现频率和文档种类的识别结果相对应地输出。书信本身,根据文档种类识别结果,进行归类。Next, the processing flow of
其次,利用图6的数据流说明回答208的处理流程。首先,由送信目的地决定601,根据赋予回答作业用数据的文档种类,将回答作业用数据转送到回答作业1~3(602,603,604)或自动回答605。本处理,由自动回答用计算机116实现。在回答作业1~3(602,603,604)中,各文档种类的专家工作人员对询问内容进行研讨,生成回答文。这些由利用回答作业装置1~3(121,125,126)的工作人员实现。在自动回答605中,从回答例文集606中检索与文档种类相应的回答文例并输出。本处理由自动回答用计算机116实现。由回答电子邮件发送607将回答作业1~3(602,603,604)或自动回答605得到的回答文例以电子邮件发送。本处理,由自动回答用计算机116实现。另外,回答文打印608将回答文打印到纸上生成回答书信。本处理,由自动封缄打印机118实现。Next, the processing flow of the answer 208 will be described using the data flow in FIG. 6 . First, by sending destination determination 601, answering task data is transferred to answering tasks 1 to 3 (602, 603, 604) or automatic answering 605 according to the document type to which the answering task data is added. This process is realized by the
实际上,计算机文档种类识别也不一定正确。另外,在识别处理中,有时也有不能识别和拒绝识别的场合。于是,在本系统中,以如下的办法应对文档种类识别的错误和拒绝。在回答作业1或2或3(602,603,604)中,在分到的询问文档不是自己负责多的领域的场合,工作人员在后述的操作画面上进行转送操作。如上所述,作为文档识别结果通常可得到不同可信度的多个候补。可利用这一点,在进行转送操作的场合,自动地向与2位以下的文档种类候补相对应的转送目的地转送该回答作业用数据。转送目的地也可由工作人员本身指定。另外,在拒绝识别的场合,将在电子邮件的文本中或图像中检出的重要单词加亮,支持回答作业。In fact, computer document type recognition is not necessarily correct. In addition, in the recognition process, sometimes the recognition cannot be recognized or the recognition is rejected. Therefore, in this system, errors and rejections of document type identification are dealt with as follows. In answering task 1, 2, or 3 (602, 603, 604), if the assigned query document is not in the field that the employee is in charge of a lot, the worker performs a forwarding operation on an operation screen described later. As described above, a plurality of candidates with different degrees of reliability are usually available as a document recognition result. Utilizing this point, when a forwarding operation is performed, the answering work data is automatically forwarded to the forwarding destination corresponding to the document type candidates of 2 or less digits. The forwarding destination can also be specified by the worker himself. In addition, when the recognition is rejected, important words detected in the text of the e-mail or in the image are highlighted to support the answering task.
图7示出在回答作业装置1~3(121,125,126)中在图像显示装置的显示画面的一例。画面701中的询问文档窗口708中显示询问文档。在电子邮件的询问文档的场合显示文本,在书信形式的询问的场合显示书信的图像。另外,利用回答作业数据中的重要单词出现位置,在同一窗口中,加亮重要单词,容易生成回答作业。工作人员,在回答文编辑窗口709中编辑回答文。此处可使用通常的文字处理器。另外,在询问文档窗口708中显示的不是自己负责的文档的场合,工作人员可利用作为输入装置设置的鼠标点击自动转送按钮703。与此相应,将回答作业用数据转送到与识别结果的2位以下的候补相对应的回答作业装置或自动回答用计算机。在窗口702中,显示也包含2位以下的候补的文档识别结果的候补。在操作者指定去向目的地转送回答作业用数据之际,在利用设置于窗口702中的单选按钮指定文档种类后,点击转送按钮704。在希望检索与文档种类相应的过去的文例的场合,则点击文例检索按钮705。于是,经LAN从回答文例文件系统117转送该回答文例,显示于回答文编辑窗口709。如点击发送按钮706,就将经过编辑的回答文回寄给发送电子邮件询问的人。另外,如点击打印按钮707,就将经过编辑的回答文由自动封缄打印机118进行打印。FIG. 7 shows an example of a display screen on the image display device in the answering operation devices 1 to 3 ( 121 , 125 , 126 ). An inquiry document is displayed in an inquiry document window 708 on the screen 701 . Text is displayed in the case of an e-mail inquiry document, and a letter image is displayed in the case of a letter form inquiry. In addition, by using the important words appearing positions in the answering task data, the important words are highlighted in the same window, so that answering tasks can be easily generated. The worker edits the answer text in the answer text editing window 709 . A normal word processor can be used here. Also, when the document displayed in the query document window 708 is not the document under his/her responsibility, the worker can click the automatic transfer button 703 using a mouse provided as an input device. Accordingly, the data for answering work is transferred to the answering work device or the computer for automatic answering corresponding to the two or less candidates of the recognition result. In window 702 , candidates of document recognition results including candidates of 2 or less digits are displayed. When the operator specifies to forward the data for answering work to the destination, he clicks the forward button 704 after specifying the type of document using the radio buttons provided in the window 702 . When it is desired to search past sentence examples corresponding to the document type, the sentence example search button 705 is clicked. Then, the answer sentence example is transferred from the answer sentence example file system 117 via the LAN, and displayed on the answer text editing window 709 . If the send button 706 is clicked, the edited answer text will be sent back to the person who sent the email inquiry. In addition, when the print button 707 is clicked, the edited reply text is printed by the
图8示出学习202的处理步骤。在重要单词抽出中,首先,从各学习用文本数据Ti(1≤i≤N、N:数据数)利用词素分析抽出单词,将出现频率以矢量ui=(ui1、ui2、…uiM)形式存储(1≤i≤N、.uij:Ti中的单词j出现的次数、M:总单词数)。从矢量ui和各文本数据Ti的种类ci(人工赋予)的对的集合{(ui\ci)},利用Branch and Bound(分支限界)算法的特征选择等已知方法,选择分类上重要的单词M’个(M’<<M)。需要时,也可由人工选择重要单词。其次,在单词统计量算出中,算出各重要单词的出现频率,以矢量vi=(vi1、vi2、…viM’)存储。此外,算出文档种类识别所必需的统计量。比如,在识别方式为二次识别函数的场合,算出各变量vi1、vi2、…viM’的平均、相关系数等的统计量。FIG. 8 shows the process steps of learning 202 . In extracting important words, first, words are extracted from each learning text data Ti (1≤i≤N, N: number of data) by morpheme analysis, and the frequency of appearance is stored as a vector ui=(ui1, ui2, ... uiM) (1≤i≤N, .uij: number of occurrences of word j in Ti, M: total number of words). From the set {(ui\ci)} of the vector ui and the type ci (manually assigned) of each text data Ti, use known methods such as feature selection of the Branch and Bound (branch and bound) algorithm to select important words in the classification M' (M'<<M). Important words can also be manually selected when needed. Next, in calculating word statistics, the frequency of occurrence of each important word is calculated and stored as a vector vi=(vi1, vi2, ... viM'). In addition, statistics necessary for document type identification are calculated. For example, when the identification method is a quadratic identification function, statistics such as the average and correlation coefficient of the variables vi1, vi2, ... viM' are calculated.
图9示出书信分类206的处理步骤。本处理包括图像输入,字定位识别,文档种类识别,归类,生成作业用数据各步骤。FIG. 9 shows the processing steps of
在通常的文字识别中,识别图像中的所有文字。与此相对,在字定位识别中,在实行之前指定来自外部的读取对象的单词。在识别过程中,将识别对象的文字种类只限定于可能以指定的单词出现的文字,检出似乎正确的文字串作为指定的单词。在本实施例中,采用特开平11-85909中的方式从图像中识别单词,算出出现频率。本处理包括文字分割虚拟生成步骤,探索识别重要单词步骤和重要单词出现频率算出步骤。重要单词出现频率以矢量w=(w1、w2、…wM’)表示。通过使用这种对给定的单词的集合进行探索性的识别的方法,可大幅度提高识别的精度和速度。另外,虽然识别和精度劣于这种方式,但也可以如现有的方式那样识别图像中的所有的文字,利用现有的单词比对的技术求得矢量w。In normal character recognition, all characters in an image are recognized. On the other hand, in word positioning recognition, a word to be read from outside is specified before execution. In the recognition process, the type of characters to be recognized is limited to characters that may appear in the designated word, and a seemingly correct character string is detected as the designated word. In this embodiment, words are recognized from an image by the method described in JP-A-11-85909, and the frequency of appearance is calculated. This processing includes a virtual generation step of character segmentation, a step of exploring and identifying important words, and a step of calculating the frequency of occurrence of important words. The frequency of occurrence of important words is represented by a vector w=(w1, w2, . . . wM'). By using this method of exploratory recognition for a given set of words, the accuracy and speed of recognition can be greatly improved. In addition, although the recognition and accuracy are inferior to this method, it is also possible to recognize all the characters in the image as in the existing method, and use the existing word comparison technology to obtain the vector w.
在文档种类识别中,使用二次识别函数及神经网络等一般的模式识别的方法,从矢量w和单词统计量算出作为各文档种类的可信度,根据文档种类的可信度顺序赋予。此外,仿效一般的方法,当1位和2位的文档种类候补的可信度之差小于一定值时,以及1位的文档种类候补的可信度小于一定值时,判断为拒绝识别。In document type recognition, general pattern recognition methods such as quadratic recognition functions and neural networks are used to calculate the reliability of each document type from the vector w and word statistics, and assign them in order according to the reliability of the document type. In addition, following a general method, when the difference between the reliability of the 1-digit and 2-digit document type candidates is smaller than a certain value, and when the reliability of the 1-digit document type candidate is smaller than a certain value, it is determined that recognition is rejected.
在归类时,根据文档种类的识别结果,控制带分类排序器的扫描仪110,将书信归类到规定的文档堆。When categorizing, according to the recognition result of the document type, the
在生成回答用数据中,生成并输出重要单词的出现位置和文档种类的识别结果和图像相对应的数据,即回答作业用数据。In generating answer data, data corresponding to an appearance position of an important word, a recognition result of a document type, and an image, ie, answer work data, is generated and output.
图10示出作为字定位识别的单词出现频率的数据形式。这是由M’个记录组成的排列。在各记录中存放第M’个的重要单词的出现频率。存放的出现频率,可以是整数值,也可是相应于识别的可信度的实数值。Fig. 10 shows the data form of the frequency of occurrence of words recognized as a word location. This is an array consisting of M' records. The frequency of occurrence of the M'th important word is stored in each record. The stored frequency of occurrence can be an integer value or a real value corresponding to the reliability of recognition.
图11示出回答作业用数据的数据形式。在变量kindOfMessage1101中存放的是标志,用来表示区别是以询问电子邮件等的文本表示的还是如书信及传真那样以图像表示的。变量sizeOfMsg1102表示存放于回答作业用数据中的文档的大小。接着,在sizeOfMsg字节的区域1103中存放询问书信实体。在电子邮件等场合存放文本,在书信和传真的场合存放图像数据。在变量numberOfCandidate1104中存放文档种类识别结果得到的文档种类候补的数。接着,在区域1105中,存放numberOfCandidate数目的文档种类候补记录。各记录的构成包括表示文档种类的整数的识别符及其可信度的值的对。变量numberOfWords1106中存放检出的重要单词的数。接着,在区域1107中存放numberOfWords数目的重要单词检出结果的记录。各记录的构成包括重要单词的识别符wordID和表示检出的位置的记录location的对。作为检出位置,在文本数据的场合,存放在文本数据中的重要单词的开头的文字出现的字节数。在图像数据的场合,存放识别重要单词的区域的上端、下端、左端、右端的坐标。Fig. 11 shows the data format of data for answering tasks. Stored in the
下面对字定位识别的概要予以说明。图12示意地示出询问的书信的示例。通常,在询问的书信中没有特定的格式。因此,不能预先了解文字行的位置及文字的大小。另外,很多场合不了解是横写还是竖写。此外,像这个例子这样的文字行的间隔很小,第2行的“え”上面的点,是属于上一行还是属于下一行,有时是很难判断的成分。The outline of character position recognition will be described below. Fig. 12 schematically shows an example of a letter of inquiry. Usually, there is no specific format in an inquiry letter. Therefore, the position of the character line and the size of the character cannot be known in advance. In addition, many occasions do not know whether to write horizontally or vertically. In addition, the interval between lines of characters like this example is very small, and it is sometimes difficult to judge whether the dot above "え" in the second line belongs to the previous line or the next line.
为了解决这样的问题,在本发明的字定位识别中,采用特开平11-85909的方式。该方式是从输入图像中抽出文字模式候补,将这些关系以分割虚拟网络表现,之后,在分割虚拟网络中对预先指定的单词进行探索性的识别。这是通过以预测的方式利用单词的信息,可以高精度且高速地识别单词的方式。作为抽出文字模式的候补的方法,编入,可以使用在文字行中连接成分的任意个数的组合之后,选择将其合成而得到的图形的高度和宽度处于预先指定的上限值和下限值之间的方法。作为探索性的方式,使用一般的宽度优先探索,探索树的展开的判断根据文字识别的结果进行。In order to solve such a problem, the method of Japanese Patent Laid-Open No. 11-85909 is adopted in the character positioning recognition of the present invention. In this method, character pattern candidates are extracted from an input image, these relationships are expressed in a segmented virtual network, and then pre-specified words are tentatively recognized in the segmented virtual network. This is a method by which a word can be recognized with high precision and high speed by utilizing the information of the word in a predictive manner. As a method of extracting the candidate of the character pattern, it is possible to use the combination of any number of connected components in the character line, and then select and combine them so that the height and width of the figure obtained by combining them are at the upper limit and lower limit specified in advance. method between values. As an exploratory method, a common breadth-first search is used, and the expansion of the search tree is determined based on the result of character recognition.
在特开平11-85909中,为解决文字行内的文字切出的困难,导入分割虚拟网络。在图12的示例中,在文字切出之前的文字行抽出本身也很困难。于是,在本发明的字定位识别中,在从整个图像抽出文字模式的候补的同时,以分割虚拟网络表示在纵横任一方向上文字模式的候补是否可以连接,也就是说文字连起来是否可以作为单词连起来读。图13示出如此得到的分割虚拟网络的例子。图中的椭圆表示文字模式的候补。比如,1301示出将两个连接成分合起来生成一个文字模式候补。在此场合,文字模式候补1301与“ら”对应。另外,候补模式1302与“ら”上方的点相对应。边1303表示候补模式1302和候补模式1304可以连接,就是说表示在单词中作为文字有可能连起来。另外,此处,在像边1303这样从内部向外出来的场合,表示候补模式1301和候补模式1304也有可能连接。是否可以连接,要根据候补模式之间的距离判断。距离在预先确定的阈值以下的场合可能连接。In JP-A-11-85909, in order to solve the difficulty of cutting out characters in a character line, a divided virtual network is introduced. In the example of FIG. 12 , it is also difficult to extract character lines before character extraction. Therefore, in the character positioning recognition of the present invention, while extracting the candidates of the character pattern from the whole image, the virtual network is divided to indicate whether the candidates of the character pattern can be connected in any direction in the vertical and horizontal directions, that is to say, whether the characters can be connected together as Words are read together. FIG. 13 shows an example of the divided virtual network thus obtained. Ellipses in the figure indicate candidates for the character mode. For example, 1301 shows that two connection components are combined to generate one character pattern candidate. In this case,
将这样得到的分割虚拟网络作为输入,利用文字识别探索重要单词,可以检出重要单词出现的地点。比如,在“ちらし”和“价格”是重要单词的场合,如图14的1401和1402所示,可检出重要单词的位置。通过使用由文档分类所必需的最小限度的单词组成的单词词典进行文字识别,可以高精度且高速地进行书信类的自动分类。Using the segmented virtual network obtained in this way as input, it is possible to search for important words using character recognition, and it is possible to detect places where important words appear. For example, when "ちらし" and "price" are important words, as shown in 1401 and 1402 in FIG. 14, the position of the important words can be detected. By performing character recognition using a word dictionary composed of the minimum words required for document classification, it is possible to perform automatic classification of letters with high precision and speed.
通过和电子邮件分类的重要单词词典共用化,可以很容易生成文字识别装置的单词词典。另外,根据在运用中利用电子邮件得到的事例可以对单词词典进行自动更新。The word dictionary for the character recognition device can be easily created by sharing it with the important word dictionary for e-mail classification. In addition, the word dictionary can be automatically updated based on examples obtained by e-mail during operation.
文字识别处理的输出是单词的出现频率,基于单词出现频率的文档种类识别处理和系统中的亲和性提高。因此,可以获得多数既有的文档种类识别装置可以挪用,在系统中基于文字识别的文档种类识别和基于文本的文档种类识别容易共存的效果。The output of the character recognition processing is the frequency of occurrence of words, and the affinity improvement in the document type recognition processing and system based on the frequency of occurrence of words. Therefore, most existing document type identification devices can be used, and the document type identification based on text recognition and the document type identification based on text can easily coexist in the system.
在不能分类的场合,也可通过在图像上向回答工作人员指示重要单词,支持回答工作人员的作业。另外,即使是在分类错误的场合,也可提供可以高效率地继续回答作业的环境。When classification is not possible, it is also possible to support the answering worker's work by pointing out important words on the image to the answering worker. In addition, even in the event of a wrong classification, an environment can be provided in which the answering work can be continued efficiently.
Claims (10)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP116976/2002 | 2002-04-19 | ||
| JP2002116976A JP2003317034A (en) | 2002-04-19 | 2002-04-19 | Document classification system and program for realizing the system |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN1452098A true CN1452098A (en) | 2003-10-29 |
Family
ID=29243476
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN02141403.3A Pending CN1452098A (en) | 2002-04-19 | 2002-08-28 | File classing system and program for carrying out same |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JP2003317034A (en) |
| CN (1) | CN1452098A (en) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102119383A (en) * | 2008-03-19 | 2011-07-06 | 德尔夫网络有限公司 | Method and subsystem for information acquisition and aggregation to facilitate ontology and language-model generation within a content-search-service system |
| CN102184343A (en) * | 2005-04-06 | 2011-09-14 | 株式会社东芝 | Report check apparatus and computer program product |
| CN102637205A (en) * | 2012-03-19 | 2012-08-15 | 南京大学 | Document classification method based on Hadoop |
| US8966389B2 (en) | 2006-09-22 | 2015-02-24 | Limelight Networks, Inc. | Visual interface for identifying positions of interest within a sequentially ordered information encoding |
| US9015172B2 (en) | 2006-09-22 | 2015-04-21 | Limelight Networks, Inc. | Method and subsystem for searching media content within a content-search service system |
| CN107005613A (en) * | 2014-12-17 | 2017-08-01 | 微软技术许可有限责任公司 | Message view is optimized based on classifying importance |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2012205181A (en) | 2011-03-28 | 2012-10-22 | Fuji Xerox Co Ltd | Image processing device and program |
-
2002
- 2002-04-19 JP JP2002116976A patent/JP2003317034A/en active Pending
- 2002-08-28 CN CN02141403.3A patent/CN1452098A/en active Pending
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102184343A (en) * | 2005-04-06 | 2011-09-14 | 株式会社东芝 | Report check apparatus and computer program product |
| CN102184343B (en) * | 2005-04-06 | 2014-06-25 | 株式会社东芝 | Report check apparatus and computer program product |
| US8966389B2 (en) | 2006-09-22 | 2015-02-24 | Limelight Networks, Inc. | Visual interface for identifying positions of interest within a sequentially ordered information encoding |
| US9015172B2 (en) | 2006-09-22 | 2015-04-21 | Limelight Networks, Inc. | Method and subsystem for searching media content within a content-search service system |
| CN102119383A (en) * | 2008-03-19 | 2011-07-06 | 德尔夫网络有限公司 | Method and subsystem for information acquisition and aggregation to facilitate ontology and language-model generation within a content-search-service system |
| CN102637205A (en) * | 2012-03-19 | 2012-08-15 | 南京大学 | Document classification method based on Hadoop |
| CN102637205B (en) * | 2012-03-19 | 2014-10-15 | 南京大学 | Document classification method based on Hadoop |
| CN107005613A (en) * | 2014-12-17 | 2017-08-01 | 微软技术许可有限责任公司 | Message view is optimized based on classifying importance |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2003317034A (en) | 2003-11-07 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US7965891B2 (en) | System and method for identifying and labeling fields of text associated with scanned business documents | |
| US7251644B2 (en) | Processing an electronic document for information extraction | |
| US7734636B2 (en) | Systems and methods for electronic document genre classification using document grammars | |
| US5642435A (en) | Structured document processing with lexical classes as context | |
| JP4311552B2 (en) | Automatic document separation | |
| US7668372B2 (en) | Method and system for collecting data from a plurality of machine readable documents | |
| US20090116736A1 (en) | Systems and methods to automatically classify electronic documents using extracted image and text features and using a machine learning subsystem | |
| CN112508011A (en) | OCR (optical character recognition) method and device based on neural network | |
| US20040181749A1 (en) | Method and apparatus for populating electronic forms from scanned documents | |
| US11615244B2 (en) | Data extraction and ordering based on document layout analysis | |
| JP3485020B2 (en) | Character recognition method and apparatus, and storage medium | |
| CN112861865A (en) | OCR technology-based auxiliary auditing method | |
| WO2021140682A1 (en) | Information processing device, information processing method, and information processing program | |
| CN1452098A (en) | File classing system and program for carrying out same | |
| US20010043742A1 (en) | Communication document detector | |
| EP1202213B1 (en) | Document format identification apparatus and method | |
| CN115063114A (en) | Contract additional recording automation method, electronic equipment and storage medium | |
| JP6856916B1 (en) | Information processing equipment, information processing methods and information processing programs | |
| JP2004171316A (en) | OCR device, document search system and document search program | |
| CN112464907A (en) | Document processing system and method | |
| US6320985B1 (en) | Apparatus and method for augmenting data in handwriting recognition system | |
| JP2001009381A (en) | Information processing type mail sorting system | |
| Engelbach et al. | Combining Deep Learning and Reasoning for Address Detection in Unstructured Text Documents | |
| CN113221886A (en) | Character learning and proofreading system based on image-text recognition | |
| JP2006031129A (en) | Document processing method and document processing apparatus |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
| WD01 | Invention patent application deemed withdrawn after publication |