[go: up one dir, main page]

HK1179387B - Method and apparatus for segmenting words from a textual line image - Google Patents

Method and apparatus for segmenting words from a textual line image Download PDF

Info

Publication number
HK1179387B
HK1179387B HK13106061.6A HK13106061A HK1179387B HK 1179387 B HK1179387 B HK 1179387B HK 13106061 A HK13106061 A HK 13106061A HK 1179387 B HK1179387 B HK 1179387B
Authority
HK
Hong Kong
Prior art keywords
ink
width
percentile
distribution
interrupt
Prior art date
Application number
HK13106061.6A
Other languages
Chinese (zh)
Other versions
HK1179387A1 (en
Inventor
Aleksandar Uzelac
Bodin Dresevic
Sasa Galic
Bogdan Radakovic
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US12/749,599 external-priority patent/US8345978B2/en
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of HK1179387A1 publication Critical patent/HK1179387A1/en
Publication of HK1179387B publication Critical patent/HK1179387B/en

Links

Description

Method and apparatus for segmenting words from text line images
Background
Optical Character Recognition (OCR) is a computer-based process that converts text images into digital form, which is machine-editable text, typically using a standard encoding scheme. This process eliminates the need to manually enter the document into a computer system. Many different problems may arise due to poor image quality and imperfections introduced by the scanning process, etc. For example, a conventional OCR engine may be coupled with a flat scanner that scans pages of text. Because the page is placed flush against the scanning surface of the scanner, the image produced by the scanner typically exhibits uniform contrast and illumination, reduced tilt and distortion, and high resolution. Thus, the OCR engine can easily convert text in an image into machine-editable text. However, when the image is inferior in terms of contrast, illuminance, inclination, and the like, the performance of the OCR engine may be degraded, and since all pixels in the image are processed, the processing time may be increased. Such situations may arise, for example, when images are taken from books or generated with an image-based scanner, among other reasons: in these cases, the text/picture is scanned from a certain distance and a varying orientation with changing light illumination. When scanning a relatively poor quality page of text, OCR engine performance may be degraded even if the scanning process is good.
This background section is provided to introduce a brief background regarding the following summary and detailed description. This background section is not intended to be used as an aid in determining the scope of the claimed subject matter, nor should it be viewed as limiting the claimed subject matter to implementations that solve any or all of the disadvantages or problems identified above.
Disclosure of Invention
Line segmentation in an OCR process is performed by extracting features from the input to locate breaks (breaks) and then classifying the breaks into one of two break classifications, including inter-word breaks and inter-character breaks, to detect the location of words in the input text line image. The output of the bounding box containing the detected words and the probability that a given break belongs to the identified class may then be provided to downstream OCR or other components for post-processing. Advantageously, by reducing the line segmentation process to feature extraction, which includes: the location of each interrupt, the number of interrupt features, and the interrupt classification.
In an illustrative example, a line segmentation engine implementing a featurization (featurization) component and an interrupt classifier is configured in an architecture without word recognition capabilities. In this architecture, the line segmentation engine is interposed between a pre-processing stage (e.g., which produces an input grayscale text line image from a scanned document) and a separate word recognizer, which is generally not concerned with correcting any inter-word interruption errors produced by the classifier. In an alternative architecture, the line segmentation engine and the word recognizer are deployed integrally. In the latter architecture, a word-breaking (lattice) grid is generated from the detected breaks for a given line of text. Each word in the lattice is detected by a word recognizer and word recognition features such as word confidence, character confidence, word frequency, grammar, and word length can be extracted. The extracted word and break features are then used by a break-word-oriented search (beamsearch) engine to select a more optimal line segmentation by using more information in the decision process than in the independent architecture.
Different combinations of features can be extracted from the text line image for use in the characterization process, including: absolute features, relative line features, relative break features, relative ink-to-ink features, relative break proximity features, and word recognition features. A variety of interrupt classifiers may be used, including decision tree classifiers, AdaBoost classifiers, cluster separators, neural network classifiers, and iterative gradient descent classifiers.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Drawings
FIG. 1 shows a simplified functional block diagram of an illustrative line segmentation engine;
FIG. 2 shows an illustrative interrupt classification example with a "pen and ink" projection;
FIG. 3 shows illustrative classifications of features that may be used in the characterization phase of the current line segmentation process;
FIG. 4 shows an illustrative set of absolute features;
FIG. 5 shows an illustrative distribution of all break widths;
FIG. 6 shows a graphical representation of a baseline, an average line, and an x-height for illustrating a word;
FIG. 7 shows an illustrative example of a pen-ink feature set;
FIG. 8 shows an illustrative pen ink-pen ink width distribution;
FIG. 9 shows an illustrative set of relative row features;
FIG. 10 shows an illustrative set of relative interrupt features;
FIG. 11 shows an illustrative set of relative pen-ink features;
FIG. 12 shows an illustrative relative pen-ink feature set;
FIG. 13 shows an illustrative set of relative interrupt proximity features;
FIG. 14 shows an illustrative set of word recognition features;
FIG. 15 shows an illustrative set of classifiers in which one or more classifiers may be used with one or more subsets of the features shown in FIG. 3;
FIG. 16 shows a first illustrative architecture for providing output by a line segmentation engine to an external word recognizer, for example as in an OCR system;
FIG. 17 shows a second illustrative architecture in which a line segmentation engine and word recognizer are deployed integrally; and
FIG. 18 is a simplified block diagram of an illustrative computer system, such as a Personal Computer (PC) or server, that can implement the present line segmentation process.
In the drawings, like reference numerals designate like elements.
Detailed Description
FIG. 1 shows an illustrative advanced row segmentation architecture 100, with the figure highlighting features of the row segmentation techniques herein. In the illustrative example, the line segmentation technique may be implemented with an engine represented by block 110 in architecture 100, which includes a characterization component 120 and a classifier 130, and generally speaking, these components implement the characterization and classification algorithms, respectively. As shown, the input to the document line segmentation engine 110 is a preprocessed grayscale image 140 of a single line of text. The input image is pre-processed to the extent necessary to eliminate or remove background color variations and replace them with white. The foreground color, also known as "pen ink", is converted to gray scale. The output of the line segmentation engine is a collection of detected one or more words 150 that include word locations, generally represented by bounding boxes 160 (i.e., one bounding box for each individual word) and an associated confidence factor 170 for each output bounding box.
Instead of having to detect the coordinates of the word bounding box directly for each word in a given text line image, the line segmentation technique herein functions to classify each break into one of two break classifications. At a certain position in the image of a line of text, if at that position a straight line can be drawn from the top to the bottom of the line without encountering (i.e., "touching") "pen ink", this indicates that there is a break at that position. An alternative way to illustrate the meaning of the break is to project the pen ink vertically. In this case, there will be an interruption of the position where the pen ink is projected empty (i.e. no pen ink is projected). Such an alternative illustration is illustrated in fig. 2. The top row includes an exemplary text line image 210. The middle row shows the pen-ink projection 220 and the bottom row shows the break 230 where the pen-ink projection is empty. As can be observed in FIG. 2, there are only two classes of breaks, namely inter-word breaks (represented by reference numeral 240) and inter-character breaks (represented by reference numeral 250).
Advantageously, the complexity of the line segmentation problem can be reduced by extracting the text line image features including the location of each break and the number of break features. Furthermore, experience has shown that using the line segmentation process here does not result in loss of generality. Accordingly, the line segmentation engine 110 of FIG. 1 implements interrupt classification in both the characterization and classification phases, and will return the target classification for each interrupt and the probability that a given interrupt belongs to that target classification. For example, the technique may be applied to latin, cyrillic, greek, and east asian scripts. It should be noted that not all inter-character breaks 250 must be present in order to achieve satisfactory line segmentation using the techniques herein, but rather only a relatively high percentage (e.g., 99.5% experimentally determined) of inter-word breaks 240. Inaccuracies resulting from such line segmentation processes, such as false positive word-to-word misclassification, can be addressed in a post-processing step called "soft-word-breaking," which is implemented as part of the word recognizer component. The characterization and classification stages of the present technique will be discussed in turn below.
Characterization-characterization can be defined as a process of extracting numerical features from an input text line image. Using this definition, the characterization process is well known and can be considered directly. However, as exemplified in FIGS. 3-5 and 8-14, there are certain features that may be advantageously used with the line segmentation process herein. More specifically, FIG. 3 shows a feature classification 300 that may be used during the characterization phase. It is emphasized that the particular features used in any given use scenario may vary. Moreover, not all features shown and described need be used in every situation. Instead, the use of a subset of these features may be better suited to the needs of the particular implementation of the line segmentation process herein.
As shown in FIG. 3, the characterization component 120 can employ a variety of features that fall into different feature categories. These features include absolute features (referenced by reference numeral 300)1Representation), relative row feature 3002Relative interrupt feature 3003Relative ink feature 3004Relative ink-to-ink feature 3005Relative interrupt proximity feature 3006And word recognition features 300N
An illustrative absolute feature set 400 is shown in fig. 4. The set 400 includes a pixel metricThe interrupt width of each interrupt in a given text line image (as indicated by reference numeral 400)1Shown). Further, distribution 300 of all break widths in units of pixels2May also be used. As shown in FIG. 5, a given distribution 400 of all break widths2The90 th percentile of the distribution (the 90) may be includedthpercentile)500150 th percentile 500 of the distribution2The 10 th percentile 500 of the distribution3And the number of breaks 500 in the text line imageN
Returning to FIG. 4, the absolute feature set 400 also includes an x-height 400 defined as the difference between the baseline and the mean line3Where the baseline is the line on which most of the characters in the text line image "sit", and the average line is the line on which most of the characters "hang". The average line, x height, and baseline defined above are shown in fig. 6 using blue, green, and red lines indicated by reference numerals 610, 620, and 630, respectively.
The absolute feature set 400 also includes stroke width 400 measured in pixels4Text line image height in pixels 4005Text line image width in pixels 4006Total interrupt width 4007(it is the sum of all breaks in pixel units), the ink width 400 in pixel units8Height 400 of ink in pixels9Distribution 400 of ink-to-ink widths in pixels10And pen ink-pen ink area 400N
For pen ink-pen ink feature (400)10And 400N) Attention is directed to fig. 7, in which the first word (indicated by reference numeral 700) from the text line image of fig. 2 is presented in an enlarged view. This example considers the first and third breaks (both of which are inter-character breaks, but the same description applies to any inter-word break). The first and third interruptions are shown in red (for better visibility, they are shown as two shades of red and are each indicated by a reference numeral710 and 720) as a horizontal line connecting the two pen inks across any given break, which is effectively a pen ink-pen ink line, but only for the common horizontal pixel of the two pen inks in question. Thus, for example, the purple line 730 in the third break is an invalid ink-to-ink line because the upper right ink pixel has no left counterpart, such that the purple line 730 crosses the green line 740 (break boundary). Furthermore, to remain valid, the pen ink-to-pen ink line cannot cross other breaks. Thus, for example, the blue line 750 in the third break would cross the green break boundary line. Therefore, although there are pixel corresponding portions on both the left and right sides, blue line 750 is not an effective ink-to-ink line.
Once the ink-to-ink line is defined, a distribution of ink-to-ink line widths may be established for each break. It can be observed that the 0 th percentile (minimum) of the ink-to-ink line width is typically greater than or equal to the actual break. This is illustrated in fig. 7 with a first interrupt 710. Accordingly, as shown in FIG. 8, at absolute feature 3001Middle, ink-to-ink width distribution 40010May include the 100 th percentile of the distribution as the maximum (using reference numeral 800)1Indication), the90 th percentile 800 of the distribution2The 50 th percentile 800 as the distribution of the median values310 th percentile of distribution 8004And the 0 th percentile 800 of the distribution as the minimum valueN
As shown in FIG. 9, for example, relative row features 3002May include an estimated number of characters 900 in the input text line image1. This number is an approximation of the number of characters in the text line image, where the number is calculated as (text line image width-total break width)/x height. Relative row feature 3002May also include an interrupt count 900 in terms of an estimated number of characters2. It is calculated as the number of breaks/estimated number of characters in a given text line image.
Relative lineFeature 3002It may also include all break widths 900 in terms of row width3. It is calculated as the total break width/text line image width. Median interrupt width 900 in terms of x heightNMay also be included in the relative row feature 3002In (1). It is calculated as the 50 th percentile/x height of the interrupt distribution.
As shown in FIG. 10, for example, relative interrupt feature 3003May include a break width 1000 in terms of x height1. It is calculated as the break width/x height. Relative interrupt feature 3003May also include a break width 1000 according to the90 th percentile of the break distribution2. It is calculated as the90 th percentile of the break width/break distribution. Relative interrupt feature 3003May also include a break width 1000 according to the 50 th percentile of the break distribution3. It is calculated as the 50 th percentile of the break width/break distribution. Still further, the relative interrupt feature 3003May also include a break width 1000 in accordance with the 10 th percentile of the break distribution4. It is calculated as the 10 th percentile of the break width/break distribution.
Relative interrupt feature 3003May also include a break width 1000 based on the width of the previous break5Where-1 is used for the first break in a given text line image. Further, relative interrupt feature 3003May also include a break width 1000 in terms of the next break widthNWhere-1 is used for the last break in a given text line image.
As shown in FIG. 11, for example, relative pen ink feature 3004May include a distance 1100 from the bottom of the pen ink to the baseline in terms of x height1. It is calculated as the distance/x height from the bottom of the pen ink to the baseline. Relative ink feature 3004May also include a distance 1100 from the top of the pen ink to the x height in terms of x heightN. It is calculated as the distance from the top of the pen ink to the x height/x height.
As shown in fig. 12For example, relative pen-to-pen ink feature 3005May include an ink-to-ink width distribution by x height 100 th percentile 12001Ink-to-ink width distribution according to x height 90 th percentile 12002Ink-to-ink width distribution by x height 60 th percentile 1200310 th percentile of ink-to-ink width distribution according to x height 120040 th percentile of ink-to-ink width distribution according to x height 12005100 th percentile of ink-to-ink width distribution according to median break width 1200690 th percentile of ink-to-ink width distribution according to median break width 1200760 th percentile of ink-to-ink width distribution according to median break width 1200810 th percentile of ink-to-ink width distribution according to median break width 120090 th percentile of ink-to-ink width distribution according to median break width 120010And ink-to-ink area 1200 in terms of effective ink-to-ink heightN
As shown in FIG. 13, for example, the relative interrupt proximity feature 3006Can include a width 1300 of the surrounding (previous and next) discontinuity in terms of x height1And peripheral (previous and subsequent) interrupt widths 1300 in terms of median interrupt widthN
As shown in FIG. 14, for example, the word recognition features 300N may include word confidence 14001Character confidence 1400 for each character in a word2Word frequency 1400 reported by the particular language model used3Advanced language model features 14004(e.g., grammar, which indicates whether a given set of words follows certain grammar rules, probability if not very accurate) and word length in characters 1400N
Classification-in the classification phase, one or more classifiers shown in FIG. 15 can be used in conjunction with one or more of the features described above. These classifiers include, for example, decision tree classificationDevice 15001AdaBoost classifier 1500, typically implemented at the top of a decision tree classifier2Cluster classifier 1500 such as FCM (fuzzy C-means) or K-means3Neural network classifier 15004And an iterative gradient descent classifier 1500N. In some use scenarios, the classifier may be trained to place a barrier to false positives, thereby facilitating false negative inter-word break classification. It should also be noted that all of the enumerated classifiers have the capability to provide a confidence level associated with an interrupt identification belonging to one of the two interrupt classes.
The classifier 1500 may also be trained using results from engines located upstream and downstream in the OCR system pipeline to improve end-to-end accuracy. Alternatively, the classifier 1500 may be trained with a separate range implementation. In this case, the engines in the OCR system are trained with the same marking data. Such techniques are generally expected to provide optimal general-purpose accuracy for applications outside of OCR systems, such as handwriting (handwriting) line segmentation processes.
The first four classifiers 15001-4May be implemented in a conventional manner and need not be discussed further herein. However, the iterative gradient descent classifier 1500 will be demonstrated and presented belowNFurther description of the same.
Suppose thatIs a set of interrupts to be classified as either inter-word interrupts (BW) or inter-character interrupts (BC). The set is ordered, meaning that interrupts with higher indices will occur after interrupts with lower indices. This observation allows the set of interrupts to be treated as a sequence. Now, the problem of classifying each interrupt independently transforms into the problem of finding the most likely interrupt sequence. This process can be done using well-known Hidden Markov Model (HMM) techniques.
To use HMM techniques, states and transition probabilities are defined herein. If it is notIs a set of interrupt features, then these probabilities can be defined as:
where bc and bw are the median inter-character break and the median inter-word break.
Unfortunately, existing classification processes do not know the median value. To address this problem, an iterative gradient descent technique may be used. In the first iteration, it may be assumed thatbc=b min And isbw=b max . From this assumption, the most likely sequence can be found using the well-known viterbi algorithm. Once the most likely sequence is found, the actual median bc can be calculated based on the Viterbi algorithm results1And bw1. The old value is then updated according to the following rules:
whereinηIs the learning rate.
After updating the median, the most likely sequence will be calculated again, the current median will be updated, and so on. Once the median value is stable (i.e., no longer changing), the iterative process ends. At this stage, the most likely sequence is the final result of the classification process.
Once the classification process is finished, a verification step may be performed. This step assumes that the word length is calculated from the classification result. If it is detected that there are long words or too many short words, it is indicated that the iterative gradient descent algorithm converged to the wrong minimum value because the initial median value was incorrectly selected. Thus, the initial median value can be changed and the algorithm will start again. This process will be repeated until verification is passed.
As a simple example, it is contemplated that the only feature used is the width of the break in pixels, thus. The probability may be defined as:
FIG. 16 shows a first illustrative architecture 1600 that provides output to an independent word recognizer 120 by a line segmentation engine 110 (including the characterization component 120 and the interrupt classifier 130) therein, for example as in an OCR system. As shown in fig. 16, the input of the word recognizer 1610 is generated by using an interrupt-bounding box transformation process, as indicated by reference numeral 1620. In this embodiment using architecture 1600, word recognizer 1610 does not consider correcting any inter-word break errors. However, in some applications, word recognizer 1610 may still correct some inter-character break errors depending on whether and what type of "soft-break" implementation is used. However, in architecture 1600, the line segmentation/break classification process only uses the information (i.e., extracted features) contained in the input text line image itself to detect a single word in the text line.
FIG. 17 shows a second illustrative architecture 1700 that integrates the deployment line segmentation engine 110 and the word recognizer 1710, for example, when the word recognizer features are available. In this embodiment using architecture 1700, word recognizer 1700 is used to provide results for the entire word-breaking lattice of a given line of text (produced by word-breaking lattice engine 1720), while the word-breaking directed search provided by engine 1730 is used to produce final results 1740 in which all words in the given line of text are segmented and recognized. The word-breaking grid engine 1720 and the word-breaking directional search engine 1730 may be implemented in a conventional manner.
The word recognizer 1610 may generally support capabilities such as word confidence, character confidence, word frequency, grammar, and word length. By using this information, the line segmentation process can use much more information available (as compared to the architecture 1600 discussed above) before performing the final line segmentation process call, since the actual results of the downstream OCR processing of the text line in question can be used here. In other words, the word-breaking orientation engine will be provided with several possible results, which are likely to be correct, and from which the engine can pick a more optimal line segmentation result.
FIG. 18 is a simplified block diagram of an illustrative computer system 1800, such as a personal computer or server, on which the line segmentation process herein may be implemented. The computer system 1800 includes a processing unit 1805, a system memory 1811, and a system bus 1814 that couples various system components including the system memory 1811 to the processing unit 1805. The system bus 1814 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system memory 1811 includes Read Only Memory (ROM) 1817 and Random Access Memory (RAM) 1821. A basic input/output system (BIOS) 1825, containing the basic routines that help to transfer information between elements within the computer system 1800, such as during start-up, is stored in ROM 1817. The computer system 1800 may also include a hard disk drive 1828 for reading from and writing to an internally disposed hard disk (not shown), a magnetic disk drive 1830 for reading from or writing to a removable magnetic disk 1833 (e.g., a floppy disk), and an optical disk drive 1838 for reading from or writing to a removable optical disk 1843 such as a CD (compact disc), DVD (digital versatile disc), or other optical media. The hard disk drive 1828, magnetic disk drive 1830, and optical disk drive 1838 are connected to the system bus 1814 by a hard disk drive interface 1846, a magnetic disk drive interface 1849, and an optical drive interface 1852, respectively. The drives and their associated computer-readable media provide nonvolatile storage of computer-readable instructions, data structures, program modules and other data for the computer system 1800. Although the illustrative example shows a hard disk, a removable magnetic disk 1833, and a removable optical disk 1843, other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, data cartridges, Random Access Memories (RAMs), Read Only Memories (ROMs), and the like, may also be used in some applications of the line segmentation process herein. Further, the term computer-readable media as used herein includes one or more instances of a media type (e.g., one or more diskettes, one or more CDs, etc.).
A number of program modules can be stored on the hard disk, magnetic disk 1833, optical disk 1843, ROM1817, or RAM1821, including an operating system 1855, one or more application programs 1857, other program modules 1860, and program data 1863. A user may enter commands and information into the computer system 1800 through input devices such as a keyboard 1866 and a pointing device 1868, such as a mouse. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 1805 through a serial port interface 1871 that is coupled to the system bus 1814, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus ("USB"). A monitor 1873 or other type of display device is also connected to the system bus 1814 via an interface, such as a video adapter 1875. In addition to the monitor 1873, personal computers typically include other peripheral output devices (not shown), such as speakers and printers. The illustrative example shown in FIG. 18 also includes a host adapter 1878, a Small Computer System Interface (SCSI) bus 1883, and an external storage device 1886 connected to the SCSI bus 1883.
The computer system 1800 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 1888. The remote computer 1888 may be selected to be another personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer system 1800, although only a single typical remote memory/storage device 1890 has been illustrated in fig. 18. The logical connections depicted in FIG. 18 include a local area network ("LAN") 1893 and a wide area network ("WAN") 1895. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
When used in a LAN networking environment, the computer system 1800 is connected to the local network 1893 through a network interface or adapter 1896. When used in a WAN networking environment, the computer system 1800 typically includes a broadband modem 1898, network gateway, or other means for establishing communications over the wide area network 1895, such as the Internet. The broadband modem 1898, which may be internal or external, is connected to the system bus 1814 via the serial port interface 1871. In a networked environment, program modules depicted relative to the computer system 1800, or portions thereof, may be stored in the remote memory storage device 1890. It should be noted that the network connections shown in FIG. 18 are illustrative and other means of establishing a communications link between the computers may be used depending on the particular requirements of the row partitioning application.
Although the subject matter herein has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (30)

1. A method for segmenting words from a text line image, the method comprising the steps of:
extracting features from the text line image using a characterization component;
calculating an interruption using the extracted features;
using an interrupt classifier to classify each interrupt into a classification, the classification including an inter-word interrupt classification and an inter-character interrupt classification, and determining a probability that the classified interrupt is a member of the classification; and
segmenting words from the text line image using the breaks and probabilities.
2. The method of claim 1, wherein the extracted features are selected from absolute features, relative line features, relative break features, relative pen-ink-features, relative break proximity features, or word recognition features.
3. The method of claim 2, wherein absolute features are selected from one or more of: a break width in pixels, a distribution of all break widths in pixels, an x height in pixels, a stroke width in pixels, a text line image height in pixels, a text line image width in pixels, a total break width in pixels, a pen ink height in pixels, a pen ink-pen ink width in pixels, a distribution of pen ink-pen ink widths in pixels, or a pen ink-pen ink area.
4. The method of claim 3, wherein the distribution of all break widths comprises at least one of: the90 th percentile of the distribution, the 50 th percentile of the distribution, the 10 th percentile of the distribution, or the number of breaks in the text line image.
5. The method of claim 3, wherein an ink-to-ink width distribution comprises at least one of: the 100 th percentile of the distribution, the90 th percentile of the distribution, the 50 th percentile of the distribution, the 10 th percentile of the distribution, or the 0 th percentile of the distribution.
6. The method of claim 2, wherein the relative row characteristics are selected from one or more of: the number of estimated characters, the number of breaks according to the number of estimated characters, all the break widths according to the line width, or the median break width according to the x height.
7. The method of claim 2, wherein the relative interrupt characteristic is selected from one or more of: the interruption width according to the x height, the interruption width according to the90 th percentile of the interruption distribution, the interruption width according to the 50 th percentile of the interruption distribution, the interruption width according to the 10 th percentile of the interruption distribution, the interruption width according to the previous interruption width, or the interruption width according to the next interruption width.
8. The method of claim 2, wherein the relative pen-ink characteristics are selected from one or more of: the distance from the bottom of the pen ink to the baseline in terms of x-height, and the distance from the top of the pen ink to x-height in terms of x-height.
9. The method of claim 2, wherein the relative pen-ink characteristics are selected from one or more of: the 100 th percentile of the ink-to-ink width distribution according to x height, the90 th percentile of the ink-to-ink width distribution according to x height, the 60 th percentile of the ink-to-ink width distribution according to x height, the 10 th percentile of the ink-to-ink width distribution according to x height, the 0 th percentile of the ink-to-ink width distribution according to x height, the 100 th percentile of the ink-to-ink width distribution according to median interrupt width, the90 th percentile of the ink-to-ink width distribution according to median interrupt width, the 60 th percentile of the ink-to-ink width distribution according to median interrupt width, the 10 th percentile of the ink-to-ink width distribution according to median interrupt width, the 0 th percentile of the ink-to-ink width distribution according to median interrupt width or the ink-to-ink area according to effective ink-to ink height.
10. The method of claim 2, wherein the relative interrupt proximity characteristic is selected from one or more of: a peripheral break width in terms of x height or a peripheral break width in terms of median break width.
11. The method of claim 2, wherein the word recognition feature is selected from one or more of: word confidence, character confidence for each character in a word, word frequency reported by the language model, advanced language model features, or word length in characters.
12. The method of claim 1, wherein the interrupt classifier is selected from one of: a decision tree classifier, an AdaBoost classifier configured as a cluster classifier, a neural network classifier, or an iterative gradient descent classifier at the top of the decision tree classifier.
13. The method of claim 1, wherein the interrupt classifier is trained with results provided by an engine located upstream or downstream of the characterization component and the interrupt classifier.
14. The method of claim 1, wherein the interrupt classifier is trained with an independent scope implementation.
15. The method of claim 1, further comprising the steps of: extracting word features from words of the text line image, the word features including at least one of: word confidence, character confidence, word frequency, grammar or word length, and selecting a line segmentation process using the numerical features extracted by the use characterization component and the extracted word features.
16. An apparatus for segmenting words from a text line image, the apparatus comprising:
means for extracting features from the text line image using a characterization component;
means for calculating an interrupt using the extracted features;
means for using an interrupt classifier to classify each interrupt into a classification and to determine a probability that the classified interrupt is a member of the classification, the classification including an inter-word interrupt classification and an inter-character interrupt classification; and
means for segmenting words from the text line image using the breaks and probabilities.
17. The apparatus of claim 16, wherein the extracted feature is selected from an absolute feature, a relative line feature, a relative break feature, a relative pen-ink-feature, a relative break proximity feature, or a word recognition feature.
18. The apparatus of claim 17, wherein absolute features are selected from one or more of: a break width in pixels, a distribution of all break widths in pixels, an x height in pixels, a stroke width in pixels, a text line image height in pixels, a text line image width in pixels, a total break width in pixels, a pen ink height in pixels, a pen ink-pen ink width in pixels, a distribution of pen ink-pen ink widths in pixels, or a pen ink-pen ink area.
19. The apparatus of claim 18, wherein the distribution of all break widths comprises at least one of: the90 th percentile of the distribution, the 50 th percentile of the distribution, the 10 th percentile of the distribution, or the number of breaks in the text line image.
20. The apparatus of claim 18, wherein an ink-to-ink width distribution comprises at least one of: the 100 th percentile of the distribution, the90 th percentile of the distribution, the 50 th percentile of the distribution, the 10 th percentile of the distribution, or the 0 th percentile of the distribution.
21. The apparatus of claim 17, wherein the relative row characteristic is selected from one or more of: the number of estimated characters, the number of breaks according to the number of estimated characters, all the break widths according to the line width, or the median break width according to the x height.
22. The apparatus of claim 17, wherein the relative interrupt characteristic is selected from one or more of: the interruption width according to the x height, the interruption width according to the90 th percentile of the interruption distribution, the interruption width according to the 50 th percentile of the interruption distribution, the interruption width according to the 10 th percentile of the interruption distribution, the interruption width according to the previous interruption width, or the interruption width according to the next interruption width.
23. The apparatus of claim 17, wherein the relative pen-ink characteristics are selected from one or more of: the distance from the bottom of the pen ink to the baseline in terms of x-height, and the distance from the top of the pen ink to x-height in terms of x-height.
24. The apparatus of claim 17, wherein the relative pen-ink characteristic is selected from one or more of: the 100 th percentile of the ink-to-ink width distribution according to x height, the90 th percentile of the ink-to-ink width distribution according to x height, the 60 th percentile of the ink-to-ink width distribution according to x height, the 10 th percentile of the ink-to-ink width distribution according to x height, the 0 th percentile of the ink-to-ink width distribution according to x height, the 100 th percentile of the ink-to-ink width distribution according to median interrupt width, the90 th percentile of the ink-to-ink width distribution according to median interrupt width, the 60 th percentile of the ink-to-ink width distribution according to median interrupt width, the 10 th percentile of the ink-to-ink width distribution according to median interrupt width, the 0 th percentile of the ink-to-ink width distribution according to median interrupt width or the ink-to-ink area according to effective ink-to ink height.
25. The apparatus of claim 17, wherein the relative interrupt proximity feature is selected from one or more of: a peripheral break width in terms of x height or a peripheral break width in terms of median break width.
26. The apparatus of claim 17, wherein the word recognition feature is selected from one or more of: word confidence, character confidence for each character in a word, word frequency reported by the language model, advanced language model features, or word length in characters.
27. The apparatus of claim 16, wherein the interrupt classifier is selected from one of: a decision tree classifier, an AdaBoost classifier configured as a cluster classifier, a neural network classifier, or an iterative gradient descent classifier at the top of the decision tree classifier.
28. The apparatus of claim 16, wherein the interrupt classifier is trained with results provided by an engine located upstream or downstream of the characterization component and the interrupt classifier.
29. The apparatus of claim 16, wherein the interrupt classifier is trained with an independent scope implementation.
30. The apparatus of claim 16, further comprising:
means for extracting word features from words of a text line image, the word features including at least one of: word confidence, character confidence, word frequency, grammar or word length, and
means for selecting a line segmentation process using the extracted numerical features and the extracted word features of the usage characterization component.
HK13106061.6A 2010-03-30 2011-03-24 Method and apparatus for segmenting words from a textual line image HK1179387B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US12/749,599 US8345978B2 (en) 2010-03-30 2010-03-30 Detecting position of word breaks in a textual line image
US12/749,599 2010-03-30
PCT/US2011/029752 WO2011126755A2 (en) 2010-03-30 2011-03-24 Detecting position of word breaks in a textual line image

Publications (2)

Publication Number Publication Date
HK1179387A1 HK1179387A1 (en) 2013-09-27
HK1179387B true HK1179387B (en) 2016-12-02

Family

ID=

Similar Documents

Publication Publication Date Title
CN102822846B (en) Method and apparatus for segmenting words from images of text lines
JP4516778B2 (en) Data processing system
CN110942074B (en) Character segmentation recognition method and device, electronic equipment and storage medium
US9384409B1 (en) Word segmentation for document image using recursive segmentation
CN106446896B (en) Character segmentation method and device and electronic equipment
US5335290A (en) Segmentation of text, picture and lines of a document image
US8494273B2 (en) Adaptive optical character recognition on a document with distorted characters
EP2553626B1 (en) Segmentation of textual lines in an image that include western characters and hieroglyphic characters
EP2545492B1 (en) Document page segmentation in optical character recognition
CN102982330B (en) Character identifying method and identification device in character image
CN100446027C (en) Low-resolution optical character recognition for camera-acquired documents
US8693790B2 (en) Form template definition method and form template definition apparatus
US9104940B2 (en) Line segmentation method applicable to document images containing handwriting and printed text characters or skewed text lines
US8233726B1 (en) Image-domain script and language identification
JPH08305803A (en) Operating method of learning machine of character template set
JP4522468B2 (en) Image discrimination device, image search device, image search program, and recording medium
JP7132050B2 (en) How text lines are segmented
WO2020061691A1 (en) Automatically detecting and isolating objects in images
CN112861865A (en) OCR technology-based auxiliary auditing method
JP2010102709A (en) Character string recognition method, character string system, and character string recognition program recording medium
Marne et al. Identification of optimal optical character recognition (OCR) engine for proposed system
CN107545261A (en) The method and device of text detection
HK1179387B (en) Method and apparatus for segmenting words from a textual line image
Dey et al. A comparative study of margin noise removal algorithms on marnr: A margin noise dataset of document images
KR102875236B1 (en) Electronic device and method for digitization of historical texts