CN100511232C

CN100511232C - Document retrieving device and method thereof

Info

Publication number: CN100511232C
Application number: CNB200610088580XA
Authority: CN
Inventors: 户岛英一郎
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2005-06-07
Filing date: 2006-06-06
Publication date: 2009-07-08
Anticipated expiration: 2026-06-06
Also published as: JP4750476B2; CN1877578A; JP2006343870A

Abstract

The present invention provides a document retrieval device and method for retrieving document data of a retrieval object according to a retrieval query word and an extended query word of the retrieval query word, and extracting characters consistent with the retrieval query word and the extended query word String, to determine whether the extracted character string contains an unknown word region, when it is judged that it does not contain an unknown word region, adjust, so that the similarity of the extracted character string decreases, and in the order corresponding to the adjusted similarity, the A character string is output as a search result.

Description

Document retrieval device and method

技术领域 technical field

本发明涉及按照检索查询词(query)，检索文档数据的文档检索装置、检索方法以及存储介质。The present invention relates to a document retrieval device, a retrieval method and a storage medium for retrieving document data according to a retrieval query word (query).

背景技术 Background technique

伴随着个人计算机(PC)的普及，文档的生成一般使用文档生成软件等PC上的应用软件来进行。具体而言，广泛进行在PC的画面上生成、编辑各种文件，对文件进行复制、检索这样的作业。此外，伴随着网络的发展和普及，通常，这样在PC上生成的电子的文档数据(电子文档数据)，不使用打印机等作为纸文档打印，而由其它PC等访问，用电子邮件发送、分发，无纸的文档生成环境正在扩大。With the spread of personal computers (PCs), documents are generally created using application software on PCs such as document creation software. Specifically, operations such as creating and editing various files on a PC screen, and copying and searching files are widely performed. In addition, along with the development and popularization of the network, the electronic document data (electronic document data) thus generated on the PC is usually not printed as a paper document using a printer or the like, but is accessed by other PCs or the like, and sent and distributed by e-mail. , the paperless document generation environment is expanding.

构筑文档管理系统，以由计算机系统地管理实现这样的无纸化的电子文档数据。该电子文档数据，在基于文档共享的高效的信息量的削减、文档间建立关联等方面，极其便利。伴随着电子文档数据的普及，文档数据的全文检索、关键字检索等检索操作普及，检索的有效性逐渐广为人知。A document management system is constructed to systematically manage such paperless electronic document data by a computer. The electronic document data is extremely convenient in terms of efficient reduction of the amount of information based on document sharing, association between documents, and the like. With the popularity of electronic document data, retrieval operations such as full-text retrieval and keyword retrieval of document data have become popular, and the effectiveness of retrieval has gradually become known.

另一方面，在纸上打印文档数据的纸文档，与电子文档数据相比，具有容易阅读、处理通用、容易搬运、容易把握全貌等优点。例如当需要分发资料时，依然以打印装置打印电子数据后形成的纸文档这种形式进行分配。可是，纸文档无法以原样的形式检索，所以不容易检索打印有所需信息的纸文档。因此，以往使用扫描纸文档并进行OCR(Optical Character Recognition)处理电子文本化的电子文档数据，进行检索。可是，在OCR处理中，如果发生误识别，就无法正确检索用户所需的文档数据。On the other hand, a paper document in which document data is printed on paper has advantages such as easy reading, general handling, easy handling, and easy grasp of the whole picture compared to electronic document data. For example, when materials need to be distributed, the distribution is still done in the form of a paper document formed after the electronic data is printed by the printing device. However, since paper documents cannot be retrieved in their original form, it is not easy to retrieve paper documents in which desired information is printed. Therefore, in the past, paper documents were scanned and OCR (Optical Character Recognition) was used to process electronic textualized electronic document data for retrieval. However, if misrecognition occurs during the OCR process, document data desired by the user cannot be retrieved correctly.

为了解决这样的问题，以往提出了各种方案。作为一个例子，有假定文档数据的字符串的字符遗漏、字符混合、字符走样，对照检索查询词(query)和文档数据的字符串来进行检索的方法。此外，还提出将检索查询词的各字符扩展成假定的误识别字符，并与该扩展的检索查询词对照，从而检索文档数据的方法。这里，将这些对照方法统称为模糊对照。In order to solve such a problem, various proposals have been made conventionally. As an example, there is a method of performing a search by comparing a search query word (query) and a character string of document data on the assumption of character omission, character mixing, or character distortion in a character string of document data. In addition, a method of retrieving document data by expanding each character of a search query word into an assumed misrecognized character and comparing it with the expanded search query word has also been proposed. Here, these control methods are collectively referred to as fuzzy control.

通过这样的模糊对照，能命中(hit)由于误识别而漏取的字符串，但是也有很多弊端。例如作为检索查询词输入“イラク”，连由于字符识别“イラク”时的误识别而产生的误识别字符(例如“イテク”)也要作为检索对象使其命中。这时，也命中例如“ハイテク”(未误识别)中的“イテク”。据此，在文档数据中每当使用“ハイテク”这一单词，就将产生无关的命中。这样的不想要的命中多发时，需要一种用于选择有意义的命中的作业，成为对于用户来说作业负荷增大、难以使用的检索装置。Through such fuzzy comparison, it is possible to hit character strings missed due to misrecognition, but there are also many disadvantages. For example, if "イラク" is input as a search query word, even misrecognized characters (for example, "イテク") caused by misrecognition of the character "イラク" are hit as search objects. In this case, for example, "イテク" in "プイテク" (not misrecognized) is also hit. Accordingly, whenever the word "ハイテク" is used in the document data, an irrelevant hit will be generated. When such unwanted hits occur frequently, an operation for selecting meaningful hits is required, which increases the workload for the user and makes the search device difficult to use.

作为与此相关的技术，有日本特开2004-334334号公报。As a technique related to this, there is Japanese Patent Laid-Open No. 2004-334334.

可是，在日本特开2004-334334号公报的技术中，依然会产生无关的命中。例如想检索“人間”这一字符串(或者误识别“人間”后产生的字符串)时，作为查询词，指定“人間”。汉字“間”和“関”相似，所以作为用于误识别的字符的检索查询词，也设定“人関”。于是，当文档数据中存在“被告人関

者”这一字符串时，尽管该字符串未被误识别，但是该字符串中的“人関”也命中了。这时，“人関”、“告人関”、“人関

”等并不形成字典单词的一部分，在以往技术中无法抑制该命中。However, in the technique disclosed in Japanese Patent Application Laid-Open No. 2004-334334, irrelevant hits still occur. For example, when searching for a character string "Renjian" (or a character string generated by misrecognizing "Renjian"), specify "Renjian" as a query term. Since the Chinese characters "between" and "guan" are similar, "renguan" is also set as a search term for characters that are misrecognized. Therefore, when there is a "defendant related

When the character string ", although the character string has not been misrecognized, the "person off" in the string is also hit. At this time, the

" etc. do not form part of a dictionary word, and this hit cannot be suppressed in previous techniques.

发明内容 Contents of the invention

本发明在于解决所述以往技术的缺点。The present invention aims to solve the disadvantages of the prior art described above.

此外，本发明的特征在于，提供能高效检索被误识别的字符串的文档检索装置及其方法。Furthermore, the present invention is characterized in that it provides a document search device and method thereof capable of efficiently searching for misrecognized character strings.

本发明提供一种文档检索装置，包括：对照装置，基于检索查询词和与该检索查询词相似的扩展查询词，对作为检索对象的文档数据进行检索，抽取与所述检索查询词和所述扩展查询词一致的字符串；分析装置，分析作为所述检索对象的文档数据，识别未知词区域；判断装置，判断由所述对照装置抽出的字符串的至少一部分是否包含在所述未知词区域中；以及检索结果输出装置，根据所述对照装置和所述判断装置的处理结果，在判断为由所述对照装置抽出的字符串未包含在所述未知词区域中的情况下，降低由所述对照装置抽出的字符串的相似度，输出所述字符串作为检索结果。The present invention provides a document retrieval device, comprising: a comparison device, based on a retrieval query word and an extended query word similar to the retrieval query word, retrieves document data as a retrieval object, and extracts a document data corresponding to the retrieval query word and the described retrieval query word. Expand the character string that query word is consistent; Analyzer, analyze the document data as described retrieval object, identify unknown word area; Judgment means, judge whether at least a part of the character string that is extracted by described comparison means is contained in described unknown word area and the retrieval result output device, according to the processing results of the comparison device and the judgment device, if it is judged that the character string extracted by the comparison device is not included in the unknown word region, reduce the The similarity of the character strings extracted by the comparison device is calculated, and the character strings are output as the search results.

并且，本发明提供一种文档检索方法，包括：对照步骤，基于检索查询词和该检索查询词的扩展查询词，对作为检索对象的文档数据进行检索，抽取与所述检索查询词和所述扩展查询词一致的字符串；分析步骤，分析所述检索对象的文档数据，识别未知词区域；判断步骤，判断由所述对照步骤抽出的字符串的至少一部分是否包含在所述的未知词区域中；以及检索结果输出步骤，根据所述对照步骤和判断步骤的处理结果，在判断为由所述对照步骤所抽取的字符串未包含在所述未知词区域中的情况下，降低所述字符串的相似度，输出在所述字符串作为检索结果。Moreover, the present invention provides a document retrieval method, including: a comparison step, based on the retrieval query word and the extended query word of the retrieval query word, retrieve the document data as the retrieval object, and extract the document data related to the retrieval query word and the described retrieval query word. Expand the character string that query word is consistent; Analysis step, analyze the document data of described retrieval object, identify unknown word region; Judgment step, judge whether at least a part of the character string extracted by described comparison step is contained in described unknown word region and the retrieval result output step, according to the processing results of the comparison step and the judgment step, if it is judged that the character string extracted by the comparison step is not included in the unknown word region, reduce the character The string similarity is output in the string as a retrieval result.

本发明的概要并非完全列举必要的特征，因此这些特征群的子组合(sub-combination)也能成为发明。The summary of the present invention is not an exhaustive list of essential features, therefore sub-combinations of these feature groups can also be inventions.

在以下的参照附图的说明中，本发明其它特征、目的和优势将变得明显，在附图中相似的参照符号表示相同或相似的部分。Other features, objects and advantages of the present invention will become apparent from the following description with reference to the accompanying drawings, in which like reference characters indicate the same or similar parts.

附图说明 Description of drawings

附图构成说明书的一部分，描述本发明的实施例，与说明一起用来解释发明的原理。The accompanying drawings constitute a part of this specification, illustrate embodiments of the invention, and together with the description serve to explain the principles of the invention.

图1是表示本发明实施例的文档检索装置的结构的框图。FIG. 1 is a block diagram showing the structure of a document retrieval device according to an embodiment of the present invention.

图2A、图2B是说明本实施例的检索的操作例的图。FIG. 2A and FIG. 2B are diagrams illustrating an operation example of search in this embodiment.

图3是说明本实施例的未知词分析处理的一例的图。FIG. 3 is a diagram illustrating an example of unknown word analysis processing in this embodiment.

图4是说明本实施例的存储未知词区域的未知词区域表的数据结构的图。FIG. 4 is a diagram illustrating the data structure of an unknown word area table storing unknown word areas in this embodiment.

图5是说明本实施例的用于将检索查询词的各字符(字符串)扩展为相似字符串的查询词扩展表的数据结构的图。FIG. 5 is a diagram illustrating the data structure of a query word expansion table for expanding each character (character string) of a search query word into a similar character string in this embodiment.

图6是说明将本实施例的检索查询词扩展为有可能通过误识别扩展的字符串的扩展查询词格网的图。FIG. 6 is a diagram illustrating an expanded query word grid that expands the search query word of this embodiment into a character string that may be expanded by misrecognition.

图7是表示本实施例的用于认定词素分析结果中哪部分应该成为未知词区域的规则即未知词区域认定规则的存储形式的图。FIG. 7 is a diagram showing the storage format of the rules for identifying which part of the morpheme analysis results should be the unknown word area, that is, the unknown word area identification rule in this embodiment.

图8是表示本实施例的存储检索结果的候选的检索结果表的结构的图。FIG. 8 is a diagram showing the structure of a search result table storing candidates of search results in this embodiment.

图9是说明本实施例的规定检索结果的输出顺序的分数的计算式的图。FIG. 9 is a diagram illustrating a calculation formula of a score that defines the output order of search results according to the present embodiment.

图10是说明本实施例的文档检索装置的处理的流程图。FIG. 10 is a flowchart illustrating the processing of the document retrieval device of this embodiment.

图11是说明作为图10的步骤S4中的事件对应处理的一部分的检索处理的流程图。FIG. 11 is a flowchart illustrating retrieval processing as part of the event handling processing in step S4 of FIG. 10 .

图12是说明图11的步骤S11的未知词分析处理的流程图。FIG. 12 is a flowchart illustrating unknown word analysis processing in step S11 of FIG. 11 .

图13是说明图12的步骤S23的未知词区域的抽取处理的流程图。FIG. 13 is a flowchart illustrating the extraction process of the unknown word region in step S23 of FIG. 12 .

图14是说明图11的步骤S13的模糊对照处理的流程图。FIG. 14 is a flowchart illustrating the fuzzy matching process of step S13 in FIG. 11 .

图15是说明图11的步骤S14的分数调整处理的流程图。FIG. 15 is a flowchart illustrating the point adjustment processing in step S14 of FIG. 11 .

具体实施方式 Detailed ways

下面参照附图详细说明本发明的优选实施例。以下的实施例并不限定关于权利要求书的发明，此外本实施例中说明的特征的组合的全部并不一定是发明的解决手段所必须的。Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples do not limit the invention related to the claims, and all combinations of features described in the examples are not necessarily essential to the solution means of the invention.

在图中，CPU101是微处理器，按照ROM102或RAM103中存储的程序，进行用于图象处理(image processing)、字符处理(characterprocessing)、字符识别处理(character recognition processing)、检索处理(search processing)的运算、逻辑判断等，控制通过总线120连接的各构成要素。总线120是系统总线，传送指示作为CPU101的控制对象的各构成要素的地址信号、数据以及控制信号。ROM102是读出专用的非易失性存储器，存储由CPU101执行的引导程序和各种数据。该引导程序在系统的起动时，将硬盘(HD)108中存储的控制程序加载到RAM103中，使CPU101执行。关于该控制程序，以后参照流程图详细说明。RAM103是可读写的随机存储器，存储从HD108加载并由CPU101执行的各种程序，并且在CPU101的动作时作为工作区使用，用于暂时存储来自各构成要素的各种数据。In the figure, CPU101 is a microprocessor, according to the program stored in ROM102 or RAM103, is used for image processing (image processing), character processing (character processing), character recognition processing (character recognition processing), retrieval processing (search processing) ) operations, logical judgments, etc., to control each component connected via the bus 120. The bus 120 is a system bus, and transmits address signals, data, and control signals indicating each component to be controlled by the CPU 101 . The ROM 102 is a read-only nonvolatile memory, and stores a boot program executed by the CPU 101 and various data. The boot program loads the control program stored in the hard disk (HD) 108 into the RAM 103 when the system is started, and causes the CPU 101 to execute it. This control program will be described in detail later with reference to a flowchart. RAM 103 is a readable and writable random access memory, stores various programs loaded from HD 108 and executed by CPU 101 , and is used as a work area during operation of CPU 101 to temporarily store various data from each component.

输入部(input unit)104包含键盘、鼠标、触摸板等，通过用户的操作，进行菜单项目的选择、各种数据的输入等。显示部(displayunit)105具有液晶、CRT、等离子体等显示器，用于将各种菜单、处理结果、错误、警告、检索结果等显示从而呈现给用户。扫描仪(scanner)106进行光学地读取作为原稿的纸文档并数字化等处理。打印机(printer)107在打印文档和图象时使用。在该文档检索装置中，也能打印由通信部(communication unit)110接收的PDL(打印控制语言)格网式的电子文档数据。The input unit (input unit) 104 includes a keyboard, a mouse, a touch panel, and the like, and performs selection of menu items, input of various data, and the like through user operations. The display unit 105 has a display such as liquid crystal, CRT, or plasma, and is used to present various menus, processing results, errors, warnings, search results, and the like to the user. A scanner (scanner) 106 performs processes such as optically reading a paper document as an original and digitizing it. A printer (printer) 107 is used when printing documents and images. In this document retrieval apparatus, electronic document data in a PDL (Print Control Language) grid format received by a communication unit 110 can also be printed.

HD(hard disk)108中存储有由CPU101执行的控制程序111、用于进行自然语言分析的词素分析字典(morpheme analyzingdictionary)112、记述了用于认定未知词区域(unknown-word area)的规则的未知词区域认定规则(unknown-word area recognition rule)113等。并且，根据需要，也存储用于管理未知词区域的未知词区域表(unknown-word area table)114、保持检索结果的检索结果表(searchresult table)115、将检索查询词扩展(develop)并且保持的查询词扩展表(query development table)116等作业用数据。这些各种数据，根据需要加载到RAM中被参照，并根据需要变更后写回到HD108。词素分析字典112中存储在一般的自然语言分析中所提出的必要信息，例如单词书写、词性信息、变化(conjugation)信息、单词搭配信息等。The HD (hard disk) 108 stores a control program 111 executed by the CPU 101, a morpheme analyzing dictionary 112 for natural language analysis, and a document describing rules for identifying unknown-word areas. Unknown word area recognition rule (unknown-word area recognition rule) 113, etc. And, according to need, also store the unknown word area table (unknown-word area table) 114 that is used for managing unknown word area, the search result table (searchresult table) 115 that keeps retrieval result, will search query words expand (develop) and keep Operation data such as query word expansion table (query development table) 116. These various data are loaded into RAM as needed, referred to, changed as needed, and written back to HD 108 . The morpheme analysis dictionary 112 stores necessary information proposed in general natural language analysis, such as word writing, part-of-speech information, conjugation information, word collocation information, and the like.

可移动的外部存储装置109是USB存储设备、IC卡等可插拔的存储设备(storage device)。它们与普通的PC同样，也可以是用于访问软盘、CD、DVD等外部存储的驱动等。该外部存储装置109能与HD108同样使用，能通过这些存储介质与其它装置进行数据交换。硬盘108中存储的控制程序111，可根据需要从外部存储装置109将全部或一部分复制(安装)到HD108。通信部110是网络控制器，能通过通信线路与外部进行数据交换。The removable external storage device 109 is a pluggable storage device (storage device) such as a USB storage device or an IC card. They can also be drives for accessing external storage such as floppy disks, CDs, and DVDs, just like ordinary PCs. This external storage device 109 can be used similarly to HD 108, and can exchange data with other devices via these storage media. All or part of the control program 111 stored in the hard disk 108 can be copied (installed) from the external storage device 109 to the HD 108 as needed. The communication unit 110 is a network controller, and can exchange data with the outside through a communication line.

具有以上的结构的本实施例的文档检索装置，根据来自输入部104等的各种事件进行工作。当有来自输入部104的中断时，将该中断信号发送给CPU101，与此相伴随地产生事件。根据该事件，CPU101读出ROM102或RAM103中存储的各种命令，通过执行命令，按照该控制程序进行各种控制。The document retrieval device of this embodiment having the above-mentioned structure operates according to various events from the input unit 104 and the like. When there is an interrupt from the input unit 104, the interrupt signal is sent to the CPU 101, and an event is generated accordingly. According to this event, CPU 101 reads various commands stored in ROM 102 or RAM 103 and executes the commands to perform various controls according to the control program.

在图2A所示的例1中，本来要检索的文本是“イラクヘのハイテク兵器輸出の是非を問ぅ”(201)。它在OCR处理时，被识别为“イテクヘのハイテク兵器輸出の是非を問ぅ”，并被登录(202)。这里，本来是“イラク”的字符串因为误识别，所以变为“イテク”。In the example 1 shown in FIG. 2A , the text to be retrieved originally is "イラクヘのハイテク战争出の问题を问ぅ" (201). During the OCR process, it is recognized as "イテクヘのハイテク战争出の问题を问ぅ" and registered (202). Here, the character string originally "イラク" is changed to "イテク" because of misrecognition.

接着，操作员为了寻找该误识别的字符串“イラク”，发出检索查询词“イラク”(203)。在本实施例的检索处理中，通过相似词扩展处理，将相似的字符(扩展查询词“イテク”)视为一致地进行检索。据此，作为检索结果找到被误识别的字符串“イテク”(210)。这里，在字符串“ハイテク”中也存在相同的字符串(“イテク”)(211)，但是它们是在分析“ハイテク”的短语时命中的，所以降低命中位次地作为检索结果输出。Next, the operator issues a search query word "iraku" in order to find the misrecognized character string "iraku" (203). In the retrieval process of this embodiment, similar characters (expanded query word "イテク") are regarded as identical and searched by similar word expansion processing. Accordingly, the misrecognized character string "イテク" is found as a search result (210). Here, the same character string ("イテク") (211) also exists in the character string "プイテク", but they are hits when analyzing the phrase "ハイテク", so they are output as the search results with lowered hit ranks.

在图2B所示的例2中，检索的文本是“法律に詳しぃ人間が被告人関

者に必要”(205)。与例1同样字符识别该原文时，将“間”误识别为“関”，并被登录(206)。这里，为了检索“人間”，发出检索查询词“人間”的命令(207)。据此，包含相似字符(“人関”)地进行检索。命中字符串“人関”212的“人関”。另外，此时“被告人関

者”中的“人関”213成为可分析为“被告人”“関

者”的短语，所以降低命中位次地作为检索结果输出。In Example 2 shown in FIG. 2B, the retrieved text is "law ni detailed しぃ human が defendant related

人に必要” (205). When recognizing the original text with the same characters as in Example 1, “between” is misidentified as “off” and is registered (206). Here, in order to retrieve “human”, the search query word “human” is issued. " command (207). Accordingly, search is carried out containing similar characters ("Renguan"). Hit the "Renguan" of the character string "Renguan" 212. In addition, at this time, "Defendant Guan"

"Ren Guan" 213 in "Person" can be analyzed as "Defendant" and "Guan Guan".

"Phrase", so the hit rank is lowered and output as the search result.

页面图像301表示通过纸文档的扫描或电子文档的光栅化所生成的文档数据，它与原文一致。用文本302表示对它进行OCR处理后的结果。在该文本302中，因为误识别，所以“人間”变为“人関”(310)，“望まれる”变为“望申れる”(311)，“イラク”变为“イテク”(312)(误识别的字符带有下划线地显示)。A page image 301 represents document data generated by scanning of a paper document or rasterization of an electronic document, which is consistent with the original text. Text 302 represents the result of OCR processing on it. In this text 302, due to misrecognition, "Renjian" becomes "Renguan" (310), "Wang まれる" becomes "Wang Shen れる" (311), and "イラク" becomes "イテク" (312) (misrecognized characters are underlined).

文本303表示将进行了字符识别的文本302进行词素分析后的分析结果，将文本分割为短语单位。“/”表示短语的划分。这里，没能进行词素分析的地方(无法分析字符串311)，带有框313地显示。这时，被误识别的“人関”(310)、“イテク”(312)，如果分割为词素就能分析，所以不判断为无法分析字符串。文本304表示对303所示的分析结果进一步进行后述的未知词区域的抽取处理后的结果(带有框314～315的字符串表示未知词区域)。据此，除了303中的无法分析字符串(“望申れる”(311))以外，按照后述的未知词区域认定规则113，也将被误识别的“人関”(310)、“イテク”(312)的部分也设定为未知词区域。The text 303 represents the analysis result of the morphological analysis of the text 302 subjected to character recognition, and the text is divided into phrase units. "/" indicates division of phrases. Here, a place where morphological analysis cannot be performed (character string 311 that cannot be analyzed) is displayed with a frame 313 . At this time, "人关" (310) and "イテク" (312) that were misrecognized can be analyzed if they are divided into morphemes, so it is not judged that the character string cannot be analyzed. The text 304 represents the result of further performing the extraction process of unknown word regions described later on the analysis results shown in 303 (character strings with frames 314 to 315 represent unknown word regions). Accordingly, in addition to the unanalyzable character string ("Wang Shen れる" (311)) in 303, according to the unknown word area identification rule 113 described later, the misrecognized "person off" (310), "イテク" The part of "(312) is also set to the unknown word region.

图4是说明本实施例的存储未知词区域的未知词区域表114的数据结构的图。FIG. 4 is a diagram illustrating the data structure of the unknown word area table 114 storing unknown word areas in this embodiment.

对于各未知词区域，存储开始位置(start position)401和末尾位置(end position)402。这些开始位置和末尾位置，存储表示文本上的未知词区域的开始位置和末尾位置的值(表示页、行数、该行的第几个字符的信息)。字符代码403，是在图3的文本304中与被认定为未知词区域的字符串316对应的字符代码。图4中，存储指定为未知词区域的“イテク”316的开始位置和末尾位置。For each unknown word area, a start position (start position) 401 and an end position (end position) 402 are stored. These start positions and end positions store values indicating the start position and end position of the unknown word region on the text (information indicating the page, line number, and character number of the line). The character code 403 is a character code corresponding to the character string 316 identified as the unknown word area in the text 304 of FIG. 3 . In FIG. 4, the start position and end position of "イテク" 316 designated as the unknown word area are stored.

图5是说明本实施例的用于将检索查询词的各字符(字符串)扩展为相似字符串的查询词扩展表116的数据结构的图。FIG. 5 is a diagram illustrating the data structure of the query word expansion table 116 for expanding each character (character string) of a search query word into a similar character string in this embodiment.

在这里存储像是被误识别的具有相似性的字符(串)的对。例如，片假名“ン”和“ソ”相似，所以在原字符串(original characterstring)501中存储“ン”，在与它对应的扩展字符串(developedcharacter string)502中存储“ソ”。此外，片假名“デ”有可能被误识别为片假名2字符的“テリ”，所以作为与原字符串501的“デ”对应的扩展字符串502存储“テリ”。此外，片假名“ク”有可能误识别为“ワ”从而被登录。此外，片假名2字符的“イン”有可能被误识别为1字符的汉字“仁”，所以作为与原字符串501的“イン”对应的扩展字符串502存储“仁”。此外，还登录有被误识别可能性高的字符和字符串，但是这里省略它们。Pairs of similar characters (strings) that appear to have been misrecognized are stored here. For example, the katakana characters "n" and "ソ" are similar, so "n" is stored in the original character string (original character string) 501, and "ソ" is stored in the corresponding developed character string (developed character string) 502. In addition, since the katakana "デ" may be misrecognized as "テリ" with two characters of katakana, "テリ" is stored as the extended character string 502 corresponding to "デ" of the original character string 501 . In addition, the katakana "ク" may be misrecognized as "ワ" and registered. In addition, since the 2-character katakana "イン" may be misrecognized as the 1-character kanji "仁", "仁" is stored as the extended character string 502 corresponding to the original character string 501 "イン". In addition, characters and character strings with a high possibility of being misrecognized are also registered, but they are omitted here.

图6是说明将本实施例的检索查询词扩展为有可能通过误识别扩展的字符串的扩展查询词的格网(lattice)的图。FIG. 6 is a diagram illustrating a lattice of an expanded query word that expands the search query word of this embodiment into a character string that may be expanded by misrecognition.

这里，各扩展字符串的连接状况形成格网。选择从开始节点到末尾节点602的路径时，原来的检索查询词表现为扩展成有可能被误识别的相似字符串的一个查询词。例如作为检索查询词，“インデクス”按照610所示的规则，就变为“イソテリワス”，按照611所示的规则，就扩展为“仁デワス”。按照其它规则，就扩展为“インテリクス”、“イソデクス”、“仁デワス”等。这样，检索查询词，按照是其检索查询词的原文不变(这里“インテクス”)还是该扩展查询词，区分是否为被扩展成上述相似字符串的字符串地进行存储。在图6中，椭圆内的字符表示与原来的检索查询词的字符一致的字符，矩形内的字符表示其被误识别时的字符。Here, the connection status of each extended character string forms a grid. When selecting a path from the start node to the end node 602, the original search query appears as a query expanded into a similar character string that may be misrecognized. For example, as a search query word, "Index" is changed to "Isoteriwas" according to the rule shown in 610, and is expanded to "Index" according to the rule shown in 611. According to other rules, it is expanded to "Interlyx", "Isodex", "Rendemus" and so on. In this way, the search query word is stored as a character string expanded into the above-mentioned similar character string according to whether the original text of the search query word remains unchanged (here, "Intex") or the expanded query word. In FIG. 6 , the characters inside the ellipses represent the characters that match the characters of the original search query word, and the characters inside the rectangles represent the characters when they were misrecognized.

图7是表示用于认定词素分析结果中哪部分应该成为未知词区域的规则即未知词区域认定规则的存储形式的图。FIG. 7 is a diagram showing a storage format of an unknown word area identification rule which is a rule for identifying which part of the morphological analysis result should be an unknown word area.

关于各规则，记述第一短语701和第二短语702各短语所满足的条件。例如，在规则1中，第一短语701的短语长度(字符数)为“1”，并且第二短语702的独立词长度(即除去附属词的字符数)为“1”时，满足规则1(即认定为未知词区域)。据此，在图3的304所示的例子中，由“人”和“関ガ”构成的2个短语314被认定为未知词区域。With regard to each rule, conditions satisfied by each of the first phrase 701 and the second phrase 702 are described. For example, in rule 1, when the phrase length (number of characters) of the first phrase 701 is "1", and the length of the independent word of the second phrase 702 (that is, the number of characters excluding dependent words) is "1", rule 1 is satisfied (i.e. identified as an unknown word area). Accordingly, in the example shown at 304 in FIG. 3 , two phrases 314 consisting of "person" and "guanga" are recognized as unknown word regions.

同样，规则2记述着的这样规则，当第一短语701的短语长度小于等于2，书写用片假名，并且第二短语702的独立词长度小于等于2，书写用片假名时，认定为未知词。据此，在图3的304所示的例子中，由“イ”和“テク”构成的2个短语316被认定为未知词区域。Similarly, the rule described in rule 2, when the phrase length of the first phrase 701 is less than or equal to 2, written in katakana, and the independent word length of the second phrase 702 is less than or equal to 2, written in katakana, it is considered an unknown word . Accordingly, in the example shown at 304 in FIG. 3 , two phrases 316 composed of "i" and "TEK" are recognized as unknown word regions.

图8是表示本实施例的存储检索结果的候选的检索结果表115的结构的图。FIG. 8 is a diagram showing the structure of a search result table 115 storing candidates of search results in this embodiment.

这里，对于各检索结果候选，在开始位置801中存储检索结果(命中字符串)的开始字符在文本上的位置。在末尾位置802中存储检索结果(命中字符串)的末尾字符在文本上的位置。在分数803中存储规定该检索结果的显示位次的值(后面描述)。将该检索结果最终按分数803的值的顺序排序，并作为检索结果输出。Here, for each search result candidate, the position on the text of the start character of the search result (hit character string) is stored in the start position 801 . In the end position 802, the position on the text of the end character of the retrieval result (hit character string) is stored. In the score 803, a value specifying the display order of the search result (described later) is stored. The search results are finally sorted in the order of the value of the score 803 and output as a search result.

图9是说明本实施例的规定检索结果的输出位次的分数的计算式的图。FIG. 9 is a diagram illustrating a calculation formula of a score defining an output ranking of a search result according to the present embodiment.

首先，相似度由(完全一致字符数)×2+(模糊一致字符数)的表达式计算。这里，“完全一致字符数”是检索查询词和命中字符串正确(不进行相似字符串扩展)一致的字符数。此外，“模糊一致字符数”是与将检索查询词扩展为相似字符串后的结果即命中字符串一致的字符数。这里，当命中字符串一部分在未知词区域时，所述求出的“相似度”原封不动成为“分数”。此外，命中字符串不在未知词区域时，即命中字符串的整个区域都已被分析成短语时，“分数”为从“相似度”中减去“最大分析短语长度”。这里，“最大分析短语长度”是关于命中字符串的词素分析结果的短语中最长的短语长度(字符数)。以下详细说明。First, the similarity is calculated by the expression of (number of fully consistent characters)×2+(number of fuzzy consistent characters). Here, the "number of identical characters" is the number of identical characters between the search query term and the hit character string (without similar character string expansion). In addition, the "number of fuzzy matching characters" is the number of characters matching the hit character string which is the result of expanding the search query term into similar character strings. Here, when a part of the hit character string is in the unknown word region, the calculated "similarity" becomes the "score" as it is. In addition, when the hit string is not in the unknown word region, that is, when the entire region of the hit string has been analyzed into phrases, the "score" is subtracted from the "similarity" by the "maximum analyzed phrase length". Here, the "maximum analyzed phrase length" is the longest phrase length (the number of characters) among the phrases of the morphological analysis result on the hit character string. Details are given below.

例1表示命中字符串在未知词区域中时的例子。这时，定义为“分数”＝“相似度”。换言之，“最大分析短语长度”＝0。Example 1 shows an example when the hit character string is in the unknown word region. In this case, it is defined as "score" = "similarity". In other words, "maximum analysis phrase length"=0.

例2～例4表示命中字符串不在未知词区域中时的例子。在例2中，分析短语1和分析短语2在命中字符串中，其中字符数多的短语的字符数n成为“最大分析短语长度”。Examples 2 to 4 show examples when the hit character string is not in the unknown word area. In Example 2, the analysis phrase 1 and the analysis phrase 2 are among the hit character strings, and the number n of characters of the phrase with the largest number of characters becomes the "maximum analysis phrase length".

在例3中，命中字符串仅覆盖一个短语，因此该短语的字符数n为“最大分析短语长度”。In Example 3, the hit string covers only one phrase, so the number of characters n of the phrase is the "maximum parsed phrase length".

例4中，命中字符串覆盖3个分析短语，而其中最长短语的字符数k为“最大分析短语长度”。In Example 4, the hit string covers 3 analysis phrases, and the character number k of the longest phrase among them is the "maximum analysis phrase length".

使用图3的304所示的例子说明以上说明的分数的求解方法。当检索查询词为“人間”时，作为与扩展查询词“人関”一致的字符串，检索“人関が”314和“被告人関

者”317。此外，检索查询词为“イラク”时，作为与扩展查询词“イテク”一致的字符串，检索“イテク”316和“ハイテク”318。这时，首先“人関が”314的相似度与(例1)对应。这时，根据图9的表达式，完全一致的字符(“人”)的数(1)×2+(模糊一致的字符(“関”)的数(1)＝3，此外，字符串“人関”的一部分在未知词区域，所以分数也变为“3”。Using the example shown at 304 in FIG. 3 , the method of calculating the score described above will be described. When the search query term is "renjian", as a character string consistent with the extended query term "renguan", retrieve "renguan ga" 314 and "defendant Guan

"Those" 317. In addition, when the search query word is "イラク", as a character string consistent with the expanded query word "イテク", search for "イテク" 316 and "ハイテク" 318. At this time, first of all, "人关が" 314 The degree of similarity corresponds to (example 1). At this moment, according to the expression of Fig. 9, the number (1) * 2+(the number (1) of the number (1) * 2+(fuzzy consistent character (" close ") of completely consistent character (" close ") according to the expression of Fig. 9 )=3, in addition, part of the character string "Renguan" is in the unknown word area, so the score also becomes "3".

而“被告人関

者に”317的情况，与(例2)对应。这时，相似度也是与上述的计算相同的“3”。可是，这时任何部分都不在未知词区域中，所以分数是相似度“3”-最大分析短语长度(4)(3-4＝)-1。同样，检索查询词为“イラク”时，与(例1)对应。因此，“イテク”316的相似度，根据图9的表达式，成为完全一致的字符(“イ”、“ク”的数(2)×2+(模糊一致的字符“テ”的数(1)＝“5”，此外，由于在未知词区域中，所以分数也为“5”。而在“ハイテク”318的情况下，“ハイテク”318收容在一个分析短语内，所以与(例3)对应。通过上述的计算，相似度为“5”，但由于不在未知词区域，所以分数为相似度“5”-最大分析短语长度(4)(5-4＝)1。And "the accused

者に" 317 corresponds to (Example 2). At this time, the similarity is also "3" which is the same as the above-mentioned calculation. However, at this time, any part is not in the unknown word region, so the score is the similarity "3". "-maximum analysis phrase length (4) (3-4=)-1. Similarly, when the search query word is "イラク", it corresponds to (Example 1). Therefore, the similarity of "イテク" 316, according to Fig. 9 The expression becomes the number (2)×2+(the number (1) of the character “テ” of fuzzy coincidence of characters ("イ" and "ク") that are completely consistent = "5". In addition, because in the unknown word region , so the score is also "5". In the case of "ハイテク" 318, "ハイテク" 318 is contained in an analysis phrase, so it corresponds to (Example 3). Through the above calculation, the similarity is "5", But because it is not in the unknown word area, the score is similarity "5" - maximum analysis phrase length (4) (5-4=)1.

按照流程图，说明上述的动作。The above-mentioned operation will be described in accordance with the flowchart.

图10是说明本实施例的文档检索装置的处理的流程图，执行该处理的程序在执行时预先存储在RAM103中，在CPU101的控制下执行。FIG. 10 is a flowchart illustrating the processing of the document retrieval device of this embodiment. A program for executing this processing is stored in RAM 103 and executed under the control of CPU 101 when executed.

首先在步骤S1中，执行系统的初始化处理，这里进行各种参数的初始化和初始画面的显示等。接着在步骤S2中，等待来自输入部104或经由网络等连接的设备的请求等产生的任意事件。这里，事件发生后进入步骤S3，判别该发生的事件，根据该判别出的事件的种类，分支为各种处理。这里，用步骤S4综合表现与各种事件对应的分支目标的多个处理。作为与各种事件对应的分支目标的处理的一个例子，有图11所示的检索处理。作为其它处理未记述细节，有指定检索条件的处理、扫描原稿并且生成文档图象的处理、指定文档的处理等通常的检索装置的处理。然后，进入步骤S5，显示步骤S4的各处理的处理结果。这里的处理，是检索结果的显示处理、存在错误时的错误显示、正常结束时的显示处理等通常广泛进行的处理。First, in step S1, the initialization process of the system is executed, here, initialization of various parameters, display of an initial screen, and the like are performed. Next, in step S2, an arbitrary event such as a request from the input unit 104 or a device connected via a network or the like is waited for. Here, after an event occurs, it proceeds to step S3, and the event that occurred is discriminated, and various processes are branched according to the discriminated event type. Here, a plurality of processes of branch targets corresponding to various events are comprehensively expressed in step S4. As an example of processing of branch destinations corresponding to various events, there is search processing shown in FIG. 11 . As other processing, which is not described in detail, there are processing of specifying retrieval conditions, processing of scanning a document to generate a document image, processing of specifying a document, and other processing of a general retrieval device. Then, it progresses to step S5, and the processing result of each process of step S4 is displayed. The processing here is generally widely performed processing such as display processing of search results, error display when there is an error, and display processing when it ends normally.

图11是说明作为图10的步骤S4的事件对应处理的一部分的检索处理的流程图。FIG. 11 is a flowchart illustrating search processing as part of the event handling processing in step S4 of FIG. 10 .

首先，在步骤S11中，执行图12的流程图中详细描述的未知词分析处理。这里，根据所指定的文档的图象，进行字符识别，生成OCR文本，进而通过词素分析、未知词分析，生成未知词区域表114(图4)。接着在步骤S12中，将输入的检索查询词扩展为有可能被误识别的相似字符串，生成扩展查询词格网(参照图6)。接着在步骤S13中，根据该生成的扩展查询词格网，参照图14的流程图，执行后述的模糊对照处理，生成检索结果表115。接着在步骤S14中，参照图15的流程图，如后所述，根据未知词区域表114和图9的表达式，求出检索结果表115的分数。接着，在步骤S15中，按照求出的分数的顺序，将检索结果按分数顺序排序。然后在步骤S16中，显示输出按照分数顺序排序后的检索结果。First, in step S11, unknown word analysis processing described in detail in the flowchart of FIG. 12 is performed. Here, character recognition is performed based on the image of the specified document to generate an OCR text, and an unknown word area table 114 ( FIG. 4 ) is generated through morphological analysis and unknown word analysis. Next, in step S12, the input search query is expanded into similar character strings that may be misrecognized to generate an expanded query grid (see FIG. 6 ). Next, in step S13 , based on the generated expanded query word grid, referring to the flowchart of FIG. 14 , a fuzzy comparison process described later is executed to generate a search result table 115 . Next, in step S14, referring to the flowchart of FIG. 15, the score of the search result table 115 is obtained from the unknown word area table 114 and the expression in FIG. 9 as described later. Next, in step S15 , the search results are sorted in the order of scores in accordance with the order of the obtained scores. Then in step S16, display and output the retrieval results sorted in order of scores.

首先，在步骤S21中，对所指定的文档图象进行字符识别，取得文本信息(图3的302)。接着在步骤S22中，对文本信息进行词素分析，分割为短语(图3的303)。接着在步骤S23中，参照图13的流程图，如后所述，抽取未知词区域(图3的304)。接着在步骤S24中，将抽出的未知词区域作为未知词区域表114输出。First, in step S21, character recognition is performed on the designated document image to obtain text information (302 in FIG. 3). Next in step S22, morpheme analysis is performed on the text information, and it is divided into phrases (303 in FIG. 3). Next, in step S23, referring to the flowchart of FIG. 13, an unknown word region is extracted as will be described later (304 in FIG. 3). Next, in step S24 , the extracted unknown word region is output as the unknown word region table 114 .

首先，在步骤S31中初始设定变量等，使指示短语的指针指示文本的开始地进行初始化。接着在步骤S32中，取得由该指针所指示的短语的信息。接着在步骤S33中，参照词素分析字典112，判断步骤S32中取得的短语是否为不能分析的短语。当判断为是不能分析的短语时，视为未知词区域，分支到步骤S35，而判断为不是不能分析的短语时，进入步骤S34，参照未知词区域认定规则113，判断该短语是否属于未知词区域。这里，如果判断为不属于未知词区域，就分支到步骤S36，而如果判断为属于未知词区域，就进入步骤S35，关于抽出的未知词区域收集必要的信息，设定为未知词区域。然后，进入步骤S36，更新表示短语的指针，以指示下一短语。接着在步骤S37中，判断是否存在下一短语，如果判断为存在，就回到步骤S32，执行所述的处理。而如果判断为下一短语不存在，就结束未知词区域的抽取处理。First, in step S31, variables and the like are initialized, and the pointer indicating the phrase is initialized to indicate the beginning of the text. Next, in step S32, the information of the phrase indicated by the pointer is obtained. Next, in step S33, referring to the morphological analysis dictionary 112, it is judged whether or not the phrase acquired in step S32 is an unanalyzable phrase. When it is judged to be an unanalyzable phrase, it is regarded as an unknown word region, and branched to step S35, and when it is judged not to be an unanalyzable phrase, it enters step S34, and with reference to the unknown word region determination rule 113, it is judged whether the phrase belongs to an unknown word area. Here, if it is judged not to belong to the unknown word area, it will branch to step S36, and if it is judged to belong to the unknown word area, it will go to step S35, and the necessary information is collected about the extracted unknown word area, and it is set as the unknown word area. Then, enter step S36, update the pointer representing the phrase to indicate the next phrase. Then in step S37, it is judged whether there is a next phrase, if it is judged to exist, it returns to step S32 to execute the described processing. On the other hand, if it is judged that the next phrase does not exist, the extraction process of the unknown word region is terminated.

首先，在步骤S41中，进行初始化设定，以将指示字符位置的指针指向文本的开始。接着在步骤S42中，对照扩展查询词格网和由该指针指示的文本上的字符。然后在步骤S43中，判断扩展查询词格网与字符是否一致，当不一致时，跳到步骤S46，移动到下一字符。在步骤S43中，当判断为一致时，进入步骤S44，将一致的程度作为相似度计算。该相似度的计算处理按照所述图9所示的表达式进行。接着进入步骤S45，将该一致的字符位置登录到检索结果表115(图8)。这里，将在步骤S44中求出的相似度原封不动地设定为分数。接着进入步骤S46，更新指示字符位置的指针，将字符位置向下一个前进。然后，在步骤S47中，判断字符位置是否到达文本的末尾，当未到达末尾时，回到步骤S42，执行所述的处理，当到达末尾时，结束该模糊对照处理。First, in step S41, initial setting is performed so that the pointer indicating the position of the character points to the beginning of the text. Then in step S42, the expanded query word grid is compared with the character on the text indicated by the pointer. Then in step S43, it is judged whether the expanded query word grid is consistent with the character, and if not, skip to step S46 and move to the next character. In step S43, when it is judged that they match, the process proceeds to step S44, and the degree of match is calculated as the degree of similarity. The calculation process of this similarity is performed according to the expression shown in FIG. 9 mentioned above. Next, the process proceeds to step S45, and the matching character position is registered in the search result table 115 (FIG. 8). Here, the degree of similarity obtained in step S44 is set as a score as it is. Then enter step S46, update the pointer indicating the character position, and advance the character position to the next one. Then, in step S47, it is judged whether the character position has reached the end of the text, and when it has not reached the end, it returns to step S42 to execute the described processing, and when it reaches the end, it ends the fuzzy comparison process.

首先，在步骤S51中，进行初始化设定，使得指示检索结果的指针指示检索结果表115(图8)的开始。接着在步骤S52中，取得指针指示的检索结果的信息(位置和分数)。接着在步骤S53中，根据未知词区域表114(图4)检查检索结果表示的命中字符串在文本上是否包含未知词区域。然后在步骤S54中，如果判断为包含未知词区域，就分支到步骤S58，将分数确定为与相似度相等的值，而如果判断为不包含，就进入步骤S55，如图9所示，求出命中字符串所涉及的最长分析短语长度。接着在步骤S56中，从分数减去该求出的最长分析短语长度，来校正分数。接着在步骤S57中，将该校正后的分数反映到检索结果表115(图8)中。接着在步骤S58中，将指示检索结果的指针更新为指示下一检索结果。然后在步骤S59中，判断是否为检索结果的最后，当不是最后时，回到步骤S52，执行所述的处理，当判断为结束时，结束分数调整处理。First, in step S51, initialization is set so that the pointer indicating the search result indicates the start of the search result table 115 (FIG. 8). Next, in step S52, the information (position and score) of the search result indicated by the pointer is acquired. Then in step S53, check whether the hit character string represented by the retrieval result contains unknown word regions in the text according to the unknown word region table 114 (FIG. 4). Then in step S54, if it is judged to include the unknown word region, it will branch to step S58, and the score is determined to be a value equal to the similarity, and if it is judged to not include, it will enter step S55, as shown in Figure 9, to find The length of the longest parsed phrase involved in outputting a hit string. Next, in step S56, the calculated longest analysis phrase length is subtracted from the score to correct the score. Next, in step S57, the corrected score is reflected in the search result table 115 (FIG. 8). Next, in step S58, the pointer indicating the search result is updated to indicate the next search result. Then in step S59, it is judged whether it is the last of the retrieval results, if not, return to step S52, and execute the processing described above, and when it is judged to be the end, the score adjustment processing ends.

(其它实施例)(other embodiments)

本发明并不局限于上述的实施例，只要不脱离本发明的宗旨，可以进行适当变更。在上述的实施例中，作为语言分析的方法，使用词素分析，但是也考虑此外的实现方式。例如，也考虑基于只分割为单词的手法的方式。这时，附属词的部分完全不分析，作为未知词区域处理。据此，存在分析的精度下降这样的缺点，但是与词素分析相比，分析处理轻松完成，能构筑负荷更轻的系统。The present invention is not limited to the above-mentioned embodiments, and appropriate changes can be made as long as they do not deviate from the gist of the present invention. In the above-described embodiments, morphological analysis is used as a method of language analysis, but other implementations are also conceivable. For example, a method based on a technique of dividing only into words is also considered. At this time, the portion of the dependent words is not analyzed at all, and is treated as an unknown word area. According to this, there is a disadvantage that the accuracy of the analysis is lowered, but the analysis process can be easily completed compared with the morphological analysis, and a system with a lighter load can be constructed.

此外，在上述的实施例中，作为模糊对照的手法，将查询词的字符扩展为相似字符串地进行检索，但是也考虑不进行扩展，而将相似的字符组汇总，标准化为要代表的代表字符进行对照的手法。通过这样构成，能减轻处理负载，能应用于更小规模的装置。In addition, in the above-mentioned embodiment, as a method of fuzzy comparison, the characters of the query words are expanded into similar character strings for retrieval. However, it is also considered to collect similar character groups without expansion and standardize them as representatives to be represented. The way characters are compared. With such a configuration, the processing load can be reduced, and it can be applied to a smaller-scale device.

此外，也考虑完全不同的模糊对照的实施方式。例如，还可采用如通配符检索那样，即使存在不一致的部分也判断为对照成功的手法。这时，相似度的计算方法有若干改变，但是如果除去模糊对照的部分，就能完全同样地构成，能取得完全同样的效果。并且，在所述以外，只要不脱离本发明的宗旨，就能适当变更该实施例的结构。Furthermore, completely different implementations of fuzzy controls are also conceivable. For example, it is also possible to adopt a method of judging that the comparison is successful even if there is an inconsistency like a wildcard search. In this case, the calculation method of the similarity is slightly changed, but if the part of the blurred comparison is removed, the configuration can be exactly the same, and the same effect can be obtained. In addition, other than the above, the structure of this embodiment can be appropriately changed as long as it does not deviate from the gist of the present invention.

如上所述，根据本实施例，对于存在误识别的文本，能进行允许误识别的字符串检索。并且即使存在允许了误识别的命中字符串时，如果包含在可分析的字符串中则分数评价低，因此能控制不想要的误命中。据此，相对优先地显示实际被误识别的字符串的命中，能提供操作性高的文档检索装置。As described above, according to the present embodiment, it is possible to search for a character string that allows misrecognition for text in which misrecognition exists. In addition, even if there is a hit character string that allows misrecognition, if it is included in an analyzable character string, the score evaluation will be low, so that unwanted false hits can be suppressed. Accordingly, hits of character strings that were actually misrecognized are displayed with relative priority, and a document retrieval device with high operability can be provided.

本发明并不局限于上面的实施例，在本发明的精神和范围中能进行各种变更和修改。因此，为了向大众通知本发明的范围，产生了以下的权利要求书。The present invention is not limited to the above embodiments, and various changes and modifications can be made within the spirit and scope of the present invention. Therefore, to apprise the public of the scope of the present invention, the following claims have been made.

Claims

1. A document retrieval device, characterized in that, comprising:

The comparison device retrieves the document data as the retrieval object based on the retrieval query word and an extended query word similar to the retrieval query word, and extracts a character string consistent with the retrieval query word and the expanded query word;

An analyzing device, analyzing the document data as the retrieval object, and identifying the unknown word region;

Judging means, judging whether at least a part of the character string extracted by the matching means is included in the unknown word region; and

The search result output means reduces the number of the character string when it is determined that the character string extracted by the matching means is not included in the unknown word region based on the processing results of the matching means and the judging means. Similarity, output the character string as the retrieval result.

2. The document retrieval device according to claim 1, characterized in that:

The retrieval result output device has a score correcting device, which reduces the similarity of the character string extracted by the comparing device when it is judged by the judging device that the unknown word region is not included; The sequence corresponding to the similarity obtained by the score correction device is used to output the character string as the retrieval result.

3. The document retrieval device according to claim 1, characterized in that:

The extended query word is a character string formed by replacing a character with a high probability of being misrecognized as another character when character recognition is performed on the characters constituting the search query word with the other character.

4. The document retrieval device according to claim 2, characterized in that:

The comparison device, according to the number of characters contained in the consistent character string that is consistent with the characters of the search query word and the number of characters that are consistent with the characters of the extended query word contained in the consistent character string , to obtain the similarity of the consistent character strings.

5. The document retrieval device according to claim 1, characterized in that:

The analysis device performs morphological analysis of the document data, and identifies the unknown word region based on whether words included in the phrase obtained by the morphological analysis are included in a word dictionary.

6. The document retrieval device according to claim 2, characterized in that:

the analysis means analyzes the document data as the retrieval object, divides the document data in units of phrases,

The score correcting means reduces the similarity by subtracting the maximum number of characters of a phrase including at least a part of the character string extracted by the matching means from the similarity of the extracted character string.

7. A document retrieval method comprising:

In the comparison step, based on the search query word and the extended query word of the search query word, the document data as the retrieval object is retrieved, and a character string consistent with the search query word and the described extended query word is extracted;

An analysis step, analyzing the document data of the retrieval object, and identifying the unknown word region;

A judging step, judging whether at least a part of the character string extracted by the matching step is included in the unknown word region; and

The retrieval result output step is to reduce the similarity of the character strings when it is judged that the character strings extracted in the comparing step are not included in the unknown word region based on the processing results of the comparing step and the judging step. degree, output the string as the search result.

8. The document retrieval method according to claim 7, characterized in that:

In the retrieval result output step, there is a score correction step, which reduces the similarity of the character strings extracted by the comparing step when it is judged by the judging step that the unknown word region is not included;

The character strings are output as retrieval results in an order corresponding to the degrees of similarity obtained in the score correction step.

9. The document retrieval method according to claim 7, characterized in that:

10. The document retrieval method according to claim 8, characterized in that:

The comparison step is based on the number of characters consistent with the characters of the search query word contained in the consistent character string and the number of characters consistent with the characters of the extended query word contained in the consistent character string , to obtain the similarity of the consistent character strings.

11. The document retrieval method according to claim 7, characterized in that:

In the analyzing step, the morphological analysis of the document data is performed, and the unknown word region is identified according to whether the words contained in the phrase obtained by the morphological analysis are included in the word dictionary.

12. The document retrieval method according to claim 8, characterized in that:

the analysis step analyzes the document data as the retrieval object, divides the document data in units of phrases,

The score correction step reduces the similarity by subtracting a maximum number of characters of a phrase including at least a part of the character strings extracted in the comparing step from the similarity of the extracted character strings.