RU2005118673A

RU2005118673A - METHOD FOR RECOGNIZING TEXT INFORMATION FROM GRAPHIC FILE USING DICTIONARIES AND ADDITIONAL DATA

Info

Publication number: RU2005118673A
Application number: RU2005118673/09A
Authority: RU
Inventors: Константин Владимирович Анисимович (RU); Константин Владимирович Анисимович; Владимир Юрьевич Рыбкин (RU); Владимир Юрьевич Рыбкин; Александр Львович Шамис (RU); Александр Львович Шамис
Original assignee: "Аби Софтвер Лтд." (CY); "Аби Софтвер Лтд."
Priority date: 2005-06-16
Filing date: 2005-06-16
Publication date: 2006-12-27
Also published as: RU2295154C1

Claims

1. A method for recognizing text information from a graphic file, characterized by obtaining a graphic file from a scanning device or otherwise, image segmentation, recognition of test characters, characterized in that the following order of access to additional information is preliminarily specified, including at least the following types: information about points of dividing the string into characters, and / or the recognition quality of the graphic element, and / or the dictionary, and / or the dictionary of possible parts of the words, and / or the rules due to type data patterns or regular expressions used, and / or rules determined by the location of the word within the line and / or paragraph, and / or rules determined by the language of the document, and / or rules determined by the type of document, and / or additional processing rules rare cases, pre-assign a quality assessment for each type of additional information, pre-build various options for splitting the image of the selected lines into fragments, presumably containing e images of individual words, based on reliably recognized spaces, a linear division graph is constructed for each fragment of the line, which describes the options for dividing the fragment into graphic elements, presumably containing symbol images, recognizes images of graphic elements using one or more classifiers, and assigns each variant of recognition of the graphic element assessment, carry out the transition from grapheme recognition options to alphabet character variants, perform at least the following steps: first step: for each GLD chain connecting the initial and final vertices, chains are constructed that correspond to all grapheme recognition options and transition options from recognized graphemes to alphabet characters, rank the resulting options in order to reduce recognition quality assessment, the second step: all received character group options are processed with by attracting information on the location of upper and lower case letters, if there are more than one variant of a symbol based on the recognition of a graphic element, they are processed from the last by a consistent use of subsequent types of additional information, according to a predetermined order, and / or, if necessary, by simultaneously attracting all types of additional information, each received option is assigned a quality rating, character options having an estimate below a predetermined one are discarded, the received options are sorted using pairwise comparison, third step: make additional correction for recognition of gaps that were erroneously recognized in the previous stages: attachment of elements, error eous separated in the previous steps, the separation of elements, mistakenly connected in the previous steps.

2. The method according to claim 1, characterized in that the rules due to the characteristics of the language of the document include, including phonetic, and / or lexical, and / or semantic.

3. The method according to claim 1, characterized in that in the second step the information on the possible arrangement of uppercase and lowercase letters includes at least four varieties according to the following criteria: all characters are uppercase, all characters are lowercase, the first character is uppercase , the rest are lowercase, the option selected based on the assessment of the completed transitions from the recognized grapheme to symbols using the first type of additional information.

4. The method according to claim 1, characterized in that they use a dictionary of possible fragments of words that exist in a natural language.

5. The method according to claim 4, characterized in that each combination of possible fragments of words is provided with an estimate of the probability of use in the text.

6. The method according to claim 4, characterized in that for evaluating the word, patterns are used that differ in the composition and types of incoming characters: a bilingual word, and / or a bilingual word with numbers, and / or a dictionary identifier, and / or abbreviation, and / or a number, and / or a Roman number, and / or a number with a suffix (ordinal number), and / or a number with a prefix, and / or a word from punctuators, and / or a word + number, and / or a word with a number inside, and / or a word with brackets, and / or a phone number, and / or a URL pattern, and / or a file name together with full location information, and / or a regular pattern single expressions, and / or an auxiliary pattern.

7. The method according to claim 1, characterized in that it contains a means for adding new rules and restrictions, including the introduction of rules for data types, which are divided into simple and composite.

8. The method according to claim 7, characterized in that the composite data types form as a connection of at least two simple or any combination of simple and composite data types.

9. The method according to claim 7, in which the data type is set in the form of at least the following characteristics: a list of characters allowed for use in words, and / or an additional rule restricting the list of characters, and / or a list of punctuation marks allowed for use, and / or grammar rules for frequently occurring words or fragments of words.