[go: up one dir, main page]

CN112507782B - Text image recognition method and device - Google Patents

Text image recognition method and device Download PDF

Info

Publication number
CN112507782B
CN112507782B CN202011138322.4A CN202011138322A CN112507782B CN 112507782 B CN112507782 B CN 112507782B CN 202011138322 A CN202011138322 A CN 202011138322A CN 112507782 B CN112507782 B CN 112507782B
Authority
CN
China
Prior art keywords
text
image
boundary
target
gray
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011138322.4A
Other languages
Chinese (zh)
Other versions
CN112507782A (en
Inventor
林涛
潘甜甜
黄伟如
金成伟
郑建飞
赵仕嘉
董浩欣
张宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State-owned Assets Supervision and Administration Commission of the State Council
Original Assignee
State-owned Assets Supervision and Administration Commission of the State Council
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State-owned Assets Supervision and Administration Commission of the State Council filed Critical State-owned Assets Supervision and Administration Commission of the State Council
Priority to CN202011138322.4A priority Critical patent/CN112507782B/en
Publication of CN112507782A publication Critical patent/CN112507782A/en
Application granted granted Critical
Publication of CN112507782B publication Critical patent/CN112507782B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/90Dynamic range modification of images or parts thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/28Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20024Filtering details
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30176Document
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Character Input (AREA)

Abstract

本发明公开了一种文本图像的识别方法及装置,包括:对承载有目标文本的文本图像进行预处理,得到模型输入图像;将模型输入图像输入至预先确定出的文字检测模型进行分析,得到文本图像的文字检测结果;根据每个边界框的坐标信息,对所有的边界框进行聚类以得到至少一个边界框聚类集合;将每个边界框所框选出的目标图像区域中的图像输入至预先确定出的文字识别模型进行分析,得到该边界框的文字识别结果;根据每个边界框的文字识别结果、坐标信息以及所属的边界框聚类集合,确定文本图像的文本识别结果。可见,本发明能够提高对文本图像中的文本的识别准确性,保证在输出文本识别结果时,目标文本中处于同一行的文字中被作为同一行文字输出。

The present invention discloses a text image recognition method and device, including: preprocessing a text image carrying a target text to obtain a model input image; inputting the model input image into a predetermined text detection model for analysis to obtain a text detection result of the text image; clustering all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box clustering set; inputting the image in the target image area selected by each bounding box into a predetermined text recognition model for analysis to obtain a text recognition result of the bounding box; determining the text recognition result of the text image according to the text recognition result, coordinate information and bounding box clustering set of each bounding box. It can be seen that the present invention can improve the recognition accuracy of text in a text image, and ensure that when outputting the text recognition result, the text in the same line in the target text is output as the same line of text.

Description

Text image recognition method and device
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to a text image recognition method and apparatus.
Background
OCR (Optical Character Recognition) is a common technique in the field of image processing, which analyzes text images (e.g., images of printed matter such as notes, newspapers, books, etc.) containing text by various analysis algorithms, thereby converting the text contained in the text images into text information that is more convenient for computer storage and processing. Currently, OCR technology is widely used in many scenarios where a large amount of printed matter material (e.g., corporate qualification certificates, employee qualification certificates, notes, archive volumes, etc.) needs to be processed.
In practice, OCR technology typically recognizes the ordinate of each word in a text image in the text image by an analysis algorithm. The words in the same line in the text generally have the same ordinate, so that words having the same ordinate can be output as words in the same line when outputting text information. However, some characters in the text (for example, characters corresponding to the graduation universities and the graduation professions in some graduation certificates) may have abnormal line-feeding phenomenon, and the ordinate of the abnormal line-feeding characters is usually larger than the ordinate of the characters in the same line, but smaller than the ordinate of the characters in the next line, so that when outputting text information, the characters are easily output as a new line, and thus incorrect text information is obtained. It can be seen that how to improve the recognition accuracy of text in a text image is particularly important.
Disclosure of Invention
The invention aims to solve the technical problem of providing a text image recognition method and device, which are used for clustering all characters according to coordinate information corresponding to each character, dividing the characters in the same row in a target text into a cluster set, and determining a text recognition result of the text image according to a character recognition result, the coordinate information and the cluster set of each character, so that the recognition accuracy of the text in the text image can be improved, and the characters in the same row in the target text are ensured to be output as the characters in the same row when the text recognition result is output.
In order to solve the technical problem, a first aspect of the present invention discloses a text image recognition method, which includes:
preprocessing a text image bearing a target text to obtain a model input image;
inputting the model input image into a predetermined text detection model for analysis, and obtaining a text detection result of the text image, wherein the text detection result comprises at least one boundary box for selecting a target image area in the text image, each boundary box at least comprises coordinate information for representing the position of the target image area in the text image, and the target image area refers to an image area where a single text in the target text is located in the text image;
Clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box clustering set, wherein the characters corresponding to all the bounding boxes in each bounding box clustering set are characters in the same line in the target text;
Inputting the images in the target image areas selected by each boundary frame into a predetermined character recognition model for analysis to obtain a character recognition result of the boundary frame;
and determining the text recognition result of the text image according to the text recognition result of each bounding box, the coordinate information and the belonging bounding box cluster set.
The second aspect of the present invention discloses a text image recognition apparatus, the apparatus comprising:
The preprocessing module is used for preprocessing the text image bearing the target text to obtain a model input image;
The first analysis module is used for inputting the model input image into a predetermined text detection model for analysis, so as to obtain a text detection result of the text image, wherein the text detection result comprises at least one boundary box for selecting a target image area in the text image, each boundary box at least comprises coordinate information for representing the position of the target image area in the text image, and the target image area refers to an image area where a single text in the target text is located in the text image;
The clustering module is used for clustering all the boundary boxes according to the coordinate information of each boundary box to obtain at least one boundary box clustering set, and characters corresponding to all the boundary boxes in each boundary box clustering set are characters in the same line in the target text;
The second analysis module is used for inputting the image in the target image area selected by each boundary frame into a predetermined character recognition model for analysis to obtain a character recognition result of the boundary frame;
and the determining module is used for determining the text recognition result of the text image according to the text recognition result, the coordinate information and the belonging boundary box cluster set of each boundary box.
In a third aspect, the present invention discloses another text image recognition apparatus, the apparatus comprising:
A memory storing executable program code;
a processor coupled to the memory;
the processor invokes the executable program code stored in the memory to perform some or all of the steps in the method for identifying a text image as disclosed in the first aspect of the present invention.
A fourth aspect of the invention discloses a computer storage medium storing computer instructions which, when invoked, are adapted to perform part or all of the steps of the method of identifying a text image as disclosed in the first aspect of the invention.
Compared with the prior art, the embodiment of the invention has the following beneficial effects:
In the embodiment of the invention, the coordinate information corresponding to each word in the text image carrying the target text is firstly analyzed, then all the words are clustered according to the coordinate information corresponding to each word, the words in the same line in the target text are divided into a cluster set, then the word recognition result corresponding to each word is analyzed, and finally the text recognition result of the text image is determined according to the word recognition result, the coordinate information and the cluster set to which each word belongs. Therefore, the method and the device can improve the recognition accuracy of the text in the text image, and ensure that the characters in the same row in the target text are output as the characters in the same row when the text recognition result is output.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a text image recognition method disclosed in an embodiment of the present invention;
FIG. 2 is a flow chart of another method for recognizing text images according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a text image recognition device according to an embodiment of the present invention;
Fig. 4 is a schematic structural diagram of another text image recognition apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a still another text image recognition apparatus according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or article that comprises a list of steps or elements is not limited to only those listed but may optionally include other steps or elements not listed or inherent to such process, method, article, or article.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The invention discloses a text image recognition method and a text image recognition device, which are characterized in that coordinate information corresponding to each word in a text image carrying a target text is analyzed, then all the words are clustered according to the coordinate information corresponding to each word, the words in the same line in the target text are divided into a clustering set, then the word recognition result corresponding to each word is analyzed, finally the text recognition result of the text image is determined according to the word recognition result, the coordinate information and the clustering set to which each word belongs, the recognition accuracy of the text in the text image can be improved, and the words in the same line in the target text are outputted as the words in the same line when the text recognition result is outputted, and are respectively described in detail below.
Example 1
Referring to fig. 1, fig. 1 is a flowchart of a text image recognition method according to an embodiment of the present invention. As shown in fig. 1, the text image recognition method may include the following operations:
101. And preprocessing the text image bearing the target text to obtain a model input image.
In the above step 101, the text image bearing the target text may include a scanned image or a photograph of any one of the certificates such as a graduation certificate, a professional qualification certificate, an enterprise business license, and an enterprise seniority certificate. Preprocessing of the text image may include mean filtering, graying, binarizing, alignment transformation, etc. Specifically, the preprocessing process for the text image in the embodiment of the present invention can be referred to the description in the subsequent embodiment.
102. And inputting the model input image into a predetermined text detection model for analysis to obtain a text detection result of the text image.
In the step 102, the text detection result includes at least one bounding box for selecting a target image area in the text image, where each bounding box includes at least coordinate information for indicating a position of the target image area in the text image, and the target image area is an image area where a single text in the target text is located in the text image.
Alternatively, the text detection model may be a deep learning model PixelLink that foregoes the bounding regression method to detect the bounding box of the text line, whereas the example segmentation method is employed to derive the bounding box of the text line directly from the segmented text line region. The algorithm process is as follows:
(1) The deep learning model VGG16 is adopted as a feature extraction network, wherein the output of the deep learning model VGG16 is divided into two parts:
pixel segmentation, namely judging whether each pixel point of the model input image is a text pixel or a non-text pixel;
and carrying out link prediction on eight fields of each pixel point of the model input image, merging into text pixels if the positive is positive, and discarding if the positive is not positive.
(2) And extracting an circumscribed rectangle frame (namely a boundary frame of a target image area) with direction information corresponding to each character in the input image of the model by calculating the minimum circumscribed rectangle, wherein the circumscribed rectangle frame is expressed as ((x, y), (w, h) theta) (namely coordinate information of the boundary frame), wherein (x, y) represents the center point coordinate of the circumscribed rectangle frame, (w, h) represents the width and the height of the circumscribed rectangle frame, and theta represents the rotation angle of the circumscribed rectangle frame.
103. And clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set.
In step 103, the text corresponding to all the bounding boxes in each bounding box cluster set is the text in the same line in the target text.
In an alternative embodiment, the coordinate information of each bounding box includes abscissa information and ordinate information of the bounding box, and the ordinate information of each bounding box includes a maximum ordinate and a minimum ordinate of the bounding box;
And clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set, including:
determining an ordinate interval of each bounding box according to the maximum ordinate and the minimum ordinate of the bounding box;
Judging whether an intersection exists between the ordinate intervals of every two bounding boxes;
Dividing the two bounding boxes into the same bounding box cluster set when judging that the intersection exists between the ordinate intervals of the two bounding boxes;
And when judging that no intersection exists between the ordinate intervals of the two bounding boxes, dividing the two bounding boxes into different bounding box clustering sets.
In this alternative embodiment, after the center point coordinates, the width and the height of the circumscribed rectangular frame (bounding frame) output by the deep learning model PixelLink are obtained, the maximum ordinate and the minimum ordinate of the bounding frame can be determined, then the ordinate intervals of the bounding frame are determined, for example, three bounding frames in total, the center point coordinates are (10, 10), (15, 12), (20, 20), the width and the height are (2, 2) in sequence, the ordinate intervals of the three bounding frames are [8,12], [10,14], [18,22], and finally, two bounding frames with the ordinate intervals of [8,12] and [10,14] are divided into the same bounding frame cluster set, and the bounding frame with the ordinate interval of [18,22] is divided into another bounding frame cluster set.
Therefore, by implementing the alternative embodiment, whether the two bounding boxes are located in the bounding boxes of the characters in the same row can be judged according to whether the intersection exists between the ordinate intervals of the two bounding boxes, so that the bounding boxes of the characters in the same row can be divided into the same bounding box clustering set, and the clustering of the bounding boxes is realized.
104. And inputting the images in the target image area selected by each boundary box into a predetermined character recognition model for analysis, and obtaining the character recognition result of the boundary box.
In the step 104, the word recognition model may be a deep learning model CRNN, which is mainly composed of three types of layers, as follows:
(1) Convolution layer-feature extraction of an image in a target image region using a convolution layer, e.g., converting an image of size (32,100,3) to a convolution feature matrix of size (1,25,512).
(2) And a circulation layer, namely adopting a deep bidirectional LSTM network to continuously extract character sequence features on the basis of the convolution feature matrix.
(3) And a transcription layer, namely after the RNN output result is input into an activation function softmax, selecting the character with the highest probability as a character recognition result to be output.
105. And determining the text recognition result of the text image according to the text recognition result, the coordinate information and the belonging boundary box cluster set of each boundary box.
In another optional embodiment, determining the text recognition result of the text image according to the text recognition result, the coordinate information and the belonging boundary box cluster set of each boundary box includes:
Determining text coordinates of each bounding box according to the bounding box cluster set to which the bounding box belongs and coordinate information of the bounding box;
And determining the text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box.
In this alternative embodiment, the text coordinates are used to represent the location of the text corresponding to the bounding box in the target text, where the text coordinates include at least a text vertical coordinate and a text horizontal coordinate, and the text vertical coordinates of bounding boxes belonging to the same bounding box cluster set are the same. For example, the text coordinate of the bounding box is (3, 3), then the text corresponding to the bounding box is the third text in the third line of the target text. After the characters and the text coordinates in each bounding box are determined, the recognized characters are arranged and combined together according to the corresponding text coordinates, and the text corresponding to the whole text image is obtained.
It will be seen that implementing this alternative embodiment, the text coordinates of the bounding box can be determined from the coordinate information of the bounding box, and then the text in the bounding box is ordered and combined according to the text coordinates, so as to obtain the recognized text of the entire text image.
In this further alternative embodiment, further alternative, the coordinate information of each bounding box includes abscissa information and ordinate information of the bounding box;
And determining text coordinates of each bounding box according to the bounding box cluster set to which the bounding box belongs and the coordinate information of the bounding box, wherein the text coordinates comprise:
determining the text longitudinal coordinates of all the bounding boxes in the bounding box cluster set to which each bounding box belongs according to the longitudinal coordinate information of all the bounding boxes in the bounding box cluster set to which the bounding box belongs;
and determining the text transverse coordinates of each bounding box in each bounding box cluster set according to the transverse coordinate information of each bounding box in the bounding box cluster set.
In this further alternative embodiment, the abscissa and the ordinate of the coordinates of the center point of the circumscribed rectangular frame output by the depth learning model PixelLink may be taken as the abscissa information and the ordinate information of the boundary frame, respectively. Since the ordinate of the text in the same line in the target text is usually not greatly different, but the ordinate of the text in a different line is greatly different, the text longitudinal coordinates of all the bounding boxes in the bounding box cluster set to which each bounding box belongs can be determined according to the ordinate information of all the bounding boxes in the bounding box cluster set to which the bounding box belongs. It is assumed that there are three border frame cluster groups in total, wherein three border frames are shared in the first border frame cluster group, the abscissa information and the ordinate information of the border frames in the three border frame cluster groups are sequentially (10, 10), (15, 12), (20, 8), three border frames are shared in the second border frame cluster group, the abscissa information and the ordinate information of the border frames in the three border frame cluster groups are sequentially (11, 21), (14, 18), (20, 20), three border frames are shared in the third border frame cluster group, the abscissa information and the ordinate information of the border frames in the three border frame cluster groups are sequentially (9, 30), (16, 28), (19, 32), and it can be seen that the ordinate information of the border frames in the first border frame cluster group is approximately distributed at about 10, the ordinate information of the border frames in the second border frame cluster group is approximately distributed at about 20, the ordinate information of the border frames in the third border frame cluster group is approximately distributed at about 30, and the ordinate information of the border frames in the border frame cluster group is approximately distributed at about 30, and the border frames in the border frame cluster group is approximately at the border frame 2. And then according to the ordering of the abscissa information of the boundary frames in the boundary frame cluster group, the text abscissa of the boundary frames in the boundary frame cluster group can be obtained, so that the text coordinates of the boundary frames of (10, 10), (15, 12), (20, 8) are obtained, the text coordinates of the boundary frames of (1, 1), (2, 1), (3, 1), (11, 21), (14, 18), (20, 20) are sequentially obtained, and the text coordinates of the boundary frames of (1, 2), (2, 2), (3, 2), (9, 30), (16, 28), (19, 32) are sequentially obtained, and the text coordinates of the boundary frames of (1, 3), (2, 3), (3, 32) are sequentially obtained.
It will be seen that implementing this further alternative embodiment, the bounding boxes can be ordered according to their abscissa and ordinate information, resulting in their text abscissa and text ordinate.
In this another alternative embodiment, the determining the text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box further includes:
Determining an original text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box;
Determining a regular expression and a text template corresponding to the text image;
Extracting key text information from an original text recognition result based on a regular expression;
and filling the key text information into a text template to obtain a text recognition result of the text image.
In this yet further alternative embodiment, a regular expression is a technical means in the computer arts that can be used to extract a specific portion of the target text. When the text image is a scanned image or a photograph of a certificate such as a graduation certificate, a professional qualification certificate, an enterprise business license, an enterprise seniority certificate, etc., since the text in the certificate is usually in a prescribed format, the meaning of other words is not important except some key information, for example, the key text information in the graduation certificate is a graduation institution, a graduation specialty, a name, etc., and the meaning of the words in other parts is not much different even on a different graduation certificate, the words in other parts of the graduation certificate are used as a text template, then the key text information in the graduation certificate is extracted through a regular expression, and then the key text information is filled into the text template, i.e. the text recognition result of the whole text image can be obtained. Thus, when the text in other parts is recognized in error, the occurrence of the situation that the text recognition result of the whole obtained text image is also in error can be avoided. In addition, the regular expressions corresponding to the text images of different types and the text templates are different, for example, the regular expressions corresponding to the text images of the business license of the enterprise are used for extracting key text information such as enterprise names, enterprise addresses and the like in the text images, and the corresponding text templates are also different from the text templates corresponding to the text images of the graduation certificate.
Therefore, by implementing the further alternative embodiment, the text recognition result of the whole text image is obtained by extracting the key text information in the text image and then filling the key text information into the text template corresponding to the text image, so that the occurrence of the situation that the text recognition result of the whole text image is wrong when the characters of other parts are recognized can be avoided, and the recognition accuracy of the text image is improved.
In still another further alternative embodiment, after extracting the key text information from the original text recognition result based on the regular expression, the key text information is filled into the text template, and before obtaining the text recognition result of the text image, the text image recognition method further includes:
Judging whether the number of characters contained in the key text information is equal to the number of correct characters determined in advance;
And triggering and executing the step of filling the key text information into the text template to obtain a text recognition result of the text image when the number of characters contained in the key text information is judged to be equal to the predetermined correct number of characters.
In this further alternative embodiment, the number of correct characters corresponding to each text image is preset, for example, the number of correct characters corresponding to the text image of the graduation certificate may be 9, 10, 11, the number of correct characters corresponding to the text image of the business license may be 19, 20, 21, 22, etc. When the number of characters of the key text information is equal to the correct number of characters, it can be determined that the extracted key text information is correct.
It can be seen that implementing this further alternative embodiment, whether the extracted key text information is correct is determined by the number of the included characters, and after determining that the extracted key text information is correct, the extracted key text information is filled into the text template, so that the accuracy of the text recognition result can be improved.
In this still further alternative embodiment, still further alternative, the method for recognizing a text image further includes:
When the number of characters contained in the key text information is not equal to the number of correct characters determined in advance, receiving corrected key text information input by a user;
And filling the corrected key text information into a text template to obtain a text recognition result of the text image.
In this still further alternative embodiment, the corrected key text information may be directly input by the user according to the target text, that is, after determining that the key text information is extracted incorrectly, the corrected key text information is directly input by the user, and then the corrected key text information input by the user is filled into the text template, so as to obtain the text recognition result of the text image.
It can be seen that, in implementing this further alternative embodiment, after determining that the key text information is extracted in error, the user directly inputs the corrected key text information, and then fills the corrected key text information input by the user into the text template, so that the accuracy of the text recognition result can be improved.
Therefore, by implementing the text image recognition method described in fig. 1, the recognition accuracy of the text in the text image can be improved, and it is ensured that the text in the same line in the target text is outputted as the text in the same line when the text recognition result is outputted. And judging whether the two bounding boxes are in the bounding boxes of the characters in the same row according to whether the intersection exists between the ordinate intervals of the two bounding boxes. And determining text coordinates of the boundary box according to the coordinate information of the boundary box, and sequencing and combining the characters in the boundary box according to the text coordinates so as to obtain the identification text of the whole text image. And sequencing the bounding boxes according to the abscissa information and the ordinate information of the bounding boxes to obtain the text transverse coordinates and the text longitudinal coordinates of the bounding boxes. And the situation that the text recognition result of the whole text image is wrong when the text of other parts is recognized can be avoided, so that the recognition accuracy of the text image is improved.
Example two
Referring to fig. 2, fig. 2 is a flowchart illustrating another text image recognition method according to an embodiment of the invention. As shown in fig. 2, the text image recognition method may include the following operations:
201. and carrying out mean value filtering on the text image carrying the target text to obtain a filtered image.
In the step 201, the average filtering process may be as follows:
(1) The size of the filter is set, and the size of the filter is generally odd, such as:
The filter is a matrix with the size of (3, 3), and contains 9 elements, each element is 1, and in an actual application scene, the size and the element value of the filter can be set according to actual conditions.
(2) The filter is moved on the text image, the center of the filter is overlapped with each pixel of the text image in sequence, the pixels corresponding to the overlapping of the filter and the text image are multiplied, and the products are added and divided by the number of filter elements, so that the filter can be expressed as the following formula:
where f (i+k, j+l) represents a pixel value at a coordinate (i+k, j+l) in the pixel matrix of the picture before denoising, and g (i, j) represents a pixel value at a coordinate (i, j) in the pixel matrix of the picture after denoising. h (k, l) is a filter matrix containing n elements.
202. And carrying out graying treatment on the filtered image to obtain a gray image.
In the above step 202, the image graying process means that in the image of the RGB color model, the R channel, G channel, and B channel of each pixel point in the image have the same value, so that the whole image is gray. Wherein the values m of the R, G and B channels are called gray values. The usual ways of graying are the following:
Mode one m= (r+g+b)/3
Mode two m=0.3 r+0.59g+0.11b
In the embodiment of the present invention, the graying process is preferably performed in the second mode.
203. And carrying out binarization processing on the gray level image to obtain a binarized image.
In an alternative embodiment, binarizing the gray scale image to obtain a binarized image includes:
dividing a gray image into a plurality of gray image regions;
according to the gray values of all pixel points in each gray image area, determining a binarization threshold value corresponding to the gray image area;
and carrying out binarization processing on the image in each gray level image area according to the binarization threshold value corresponding to each gray level image area to obtain a binarized image of the whole gray level image.
In this alternative embodiment, the process of binarizing the image may be understood as resetting the gray value of each pixel in the image according to whether the gray value of each pixel is greater than the binarization threshold, setting the gray value of each pixel to 255 (i.e., the pixel to white) when the gray value of each pixel is greater than the binarization threshold, and setting the gray value of each pixel to 0 (i.e., the pixel to black) when the gray value of each pixel is less than the binarization threshold, so that the whole image is changed into a black-and-white image. Further, since different portions in the same image may have different brightness, the effect of the obtained binarized image is not always ideal when a uniform binarization threshold is used for the same image. In this case, a method of adaptive binarization, that is, different binarization thresholds are used for different areas of the same image, is used, so that a better-effect binarized image can be obtained.
It can be seen that by implementing this alternative embodiment, different binarization thresholds can be used for different regions of the same image, so that a better binarized image can be obtained.
In this alternative embodiment, further alternatively, each gray image region is a matrix-shaped region;
and determining a binarization threshold corresponding to each gray image area according to gray values of all pixel points in the gray image area, wherein the binarization threshold comprises:
The corresponding binarization threshold value of each gray image region is calculated by the following formula:
Where a denotes an abscissa of a lower left corner pixel of each gray image region in the gray image, b denotes an ordinate of a lower right corner pixel of each gray image region in the gray image, k denotes the number of pixels occupied by each gray image region in the lateral direction of the gray image, l denotes the number of pixels occupied by each gray image region in the longitudinal direction of the gray image, f (i, j) denotes a gray value of pixels having coordinates (i, j) in the gray image, n denotes the total number of pixels occupied by each gray image region, thr denotes a binarization threshold value corresponding to each gray image region.
It can be seen that this further alternative embodiment is implemented, taking the average value of the gray values of all the pixels in each gray image region as the binarization threshold, so that a better effect of binarization image can be obtained.
204. The first vertex coordinates of the binarized image are calculated based on a predetermined image edge detection algorithm.
In the step 204, the image edge detection algorithm may be a Canny edge detection algorithm, and three vertex coordinates (i.e., first vertex coordinates) of the target text in the text image are extracted by using the Canny edge detection algorithm, where the three vertex coordinates are respectively:
A(x1,y1),B(x2,y2),C(x3,y3)
The Canny edge detection algorithm firstly multiplies coordinate points in a picture pixel point matrix of a text image by sobel or other operators to obtain gradient values g x(m,n),gy (m, n) in different directions, and then combines the directions to obtain gradient values and gradient directions:
Wherein G (m, n) is a gradient value, Is the gradient direction.
And then using the upper and lower thresholds to screen the edge pixel points, wherein the screening rule is that two thresholds are set, namely an upper threshold maxVal and a lower threshold minVal. Wherein, pixels with luminance greater than maxVal are all detected as edges, and pixels with luminance lower than minVal are all detected as non-edges. And judging the middle pixel point as an edge if the middle pixel point is adjacent to the pixel point determined as the edge, and judging the middle pixel point as a non-edge if the middle pixel point is not adjacent to the pixel point determined as the edge.
205. The second vertex coordinates are determined from the predetermined alignment image.
In step 205 described above, the alignment image is an image to be aligned after the binarized image is corrected, which has a specified geometry, and a new binarized image obtained after the binarized image is aligned with the alignment image will have the same geometry. Three vertex coordinates (i.e., second vertex coordinates) of the alignment image may be determined from the long side and the wide side of the alignment image, and the three vertex coordinates are expressed as:
A′(x′1,y′1),B′(x′2,y′2),C′(x′3,y′3)
206. An affine transformation matrix is determined from the first vertex coordinates and the second vertex coordinates.
In step 206 described above, the affine transformation matrix M can be expressed as:
207. And carrying out affine transformation on the binarized image according to the affine transformation matrix to obtain a model input image.
In step 207 described above, the affine transformation can be expressed as:
G′=M*G
wherein G' represents the affine transformed picture pixel matrix, and G represents the picture pixel matrix before affine transformation.
208. And inputting the model input image into a predetermined text detection model for analysis to obtain a text detection result of the text image.
209. And clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set.
210. And inputting the images in the target image area selected by each boundary box into a predetermined character recognition model for analysis, and obtaining the character recognition result of the boundary box.
211. And determining the text recognition result of the text image according to the text recognition result, the coordinate information and the belonging boundary box cluster set of each boundary box.
For a specific description of the steps 208 to 211, reference may be made to the specific descriptions of the steps 102 to 105, which are not described in detail herein.
In a further alternative embodiment, the bounding box further comprises geometric information of the target image area, the geometric information comprising a pixel width and/or a pixel length and/or a pixel area of the target image area;
And inputting the model input image into a predetermined text detection model for analysis, and clustering all the bounding boxes according to the coordinate information of each bounding box after the text detection result of the text image is obtained, so as to obtain at least one bounding box cluster set, wherein the text image recognition method further comprises the following steps:
When the geometric information comprises the pixel width of the target image area, selecting a target pixel width from all the pixel widths;
Removing the boundary frames with the corresponding pixel widths smaller than or equal to the target pixel width from the text detection result, triggering and executing the step of clustering all the boundary frames according to the coordinate information of each boundary frame to obtain at least one boundary frame clustering set;
When the geometric information comprises the pixel length of the target image area, selecting a target pixel length from all the pixel lengths;
Removing the boundary frames with the corresponding pixel lengths smaller than or equal to the target pixel length from the text detection result, and triggering and executing the step of clustering all the boundary frames according to the coordinate information of each boundary frame to obtain at least one boundary frame clustering set;
When the geometric information comprises pixel areas of the target image area, selecting the target pixel area from all the pixel areas;
and removing the boundary boxes with the pixel areas smaller than or equal to the target pixel areas from the text detection result, and triggering and executing the step of clustering all the boundary boxes according to the coordinate information of each boundary box to obtain at least one boundary box clustering set.
In this alternative embodiment, the bounding rectangle (i.e., bounding box) output by the deep learning model PixelLink is sometimes erroneous, so that it is necessary to filter the bounding box that it outputs to ensure the accuracy of the resulting text recognition result. In particular, filtering the bounding box using the geometric features (pixel width, pixel length, pixel area) of the target image region is a simple and efficient method. For example, the pixel widths of all the target image areas are sorted from large to small, then the pixel width sorted at 99% is selected as the target pixel width, and if the target pixel width is 10 pixels, the bounding box with the pixel width smaller than 10 pixels is deleted, so that the filtering of the bounding box is realized. At this time, of all the bounding boxes output by the deep learning model PixelLink, the bounding box with a pixel width of less than 10 pixels is generally an erroneous bounding box, so it needs to be deleted, and the bounding box ordered in the first 99% is generally a valid bounding box.
It can be seen that implementing this alternative embodiment, the erroneous bounding box can be removed from the text detection result according to the geometric features of the bounding box, thereby improving the accuracy of the resulting text recognition result.
Therefore, when the text image recognition method described in fig. 2 is implemented, different binarization thresholds can be used for different areas of the same image, and an average value of gray values of all pixels in each gray image area is used as the binarization threshold, so that a binarized image with better effect can be obtained. The false bounding box can be removed from the text detection result according to the geometric characteristics of the bounding box, so that the accuracy of the obtained text recognition result is improved.
Example III
Referring to fig. 3, fig. 3 is a schematic structural diagram of a text image recognition device according to an embodiment of the invention. As shown in fig. 3, the text image recognition apparatus may include:
the preprocessing module 301 is configured to preprocess a text image carrying a target text to obtain a model input image;
A first analysis module 302, configured to input a model input image to a predetermined text detection model for analysis, to obtain a text detection result of a text image, where the text detection result includes at least one bounding box for framing a target image region in the text image, and each bounding box includes at least coordinate information for indicating a position of the target image region in the text image, and the target image region is an image region in which a single text in the target text is located in the text image;
The clustering module 303 is configured to cluster all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set, where the characters corresponding to all bounding boxes in each bounding box cluster set are characters in the same line in the target text;
the second analysis module 304 is configured to input an image in the target image area selected by each bounding box to a predetermined text recognition model for analysis, so as to obtain a text recognition result of the bounding box;
The determining module 305 is configured to determine a text recognition result of the text image according to the text recognition result, the coordinate information, and the belonging bounding box cluster set of each bounding box.
Therefore, implementing the recognition device for text image depicted in fig. 3, by analyzing the coordinate information corresponding to each word in the text image carrying the target text, then clustering all the words according to the coordinate information corresponding to each word, dividing the words in the same line in the target text into a cluster set, analyzing the word recognition result corresponding to each word, and finally determining the text recognition result of the text image according to the word recognition result, the coordinate information and the belonging cluster set of each word, the recognition accuracy of the text in the text image can be improved, and the words in the same line in the target text are ensured to be output as the same line of words when the text recognition result is output.
In an alternative embodiment, the determining module 305 determines, according to the text recognition result, the coordinate information and the belonging bounding box cluster set of each bounding box, the text recognition result of the text image in the following specific manner:
determining text coordinates of each boundary box according to the boundary box cluster set to which the boundary box belongs and coordinate information of the boundary box, wherein the text coordinates are used for representing the position of characters corresponding to the boundary box in a target text, and the text coordinates at least comprise text longitudinal coordinates and text transverse coordinates, and the text longitudinal coordinates of the boundary boxes belonging to the same boundary box cluster set are the same;
And determining the text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box.
It can be seen that implementing the text image recognition apparatus described in fig. 4 can determine the text coordinates of the bounding box according to the coordinate information of the bounding box, and then sort and combine the text in the bounding box according to the text coordinates, thereby obtaining the recognition text of the whole text image.
In this optional embodiment, further optional, the coordinate information of each bounding box includes abscissa information and ordinate information of the bounding box;
and, the determining module 305 determines, according to the bounding box cluster set to which each bounding box belongs and the coordinate information of the bounding box, the text coordinates of the bounding box in the following specific manner:
determining the text longitudinal coordinates of all the bounding boxes in the bounding box cluster set to which each bounding box belongs according to the longitudinal coordinate information of all the bounding boxes in the bounding box cluster set to which the bounding box belongs;
and determining the text transverse coordinates of each bounding box in each bounding box cluster set according to the transverse coordinate information of each bounding box in the bounding box cluster set.
It can be seen that implementing the recognition device for text images described in fig. 4 can sort the bounding boxes according to their abscissa and ordinate information, and obtain the text transverse coordinates and text longitudinal coordinates of the bounding boxes.
In this further alternative embodiment, still further alternative, the ordinate information of each bounding box includes a maximum ordinate and a minimum ordinate of the bounding box;
and, the clustering module 303 clusters all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set in the following specific ways:
determining an ordinate interval of each bounding box according to the maximum ordinate and the minimum ordinate of the bounding box;
Judging whether an intersection exists between the ordinate intervals of every two bounding boxes;
Dividing the two bounding boxes into the same bounding box cluster set when judging that the intersection exists between the ordinate intervals of the two bounding boxes;
And when judging that no intersection exists between the ordinate intervals of the two bounding boxes, dividing the two bounding boxes into different bounding box clustering sets.
Therefore, the recognition device for text images described in fig. 4 can judge whether two bounding boxes are located in the bounding boxes of the characters in the same row according to whether the intersection exists between the ordinate sections of the two bounding boxes, so that the bounding boxes of the characters in the same row can be divided into the same bounding box clustering set, and the clustering of the bounding boxes is realized.
In another alternative embodiment, the determining module 305 determines the text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box in the following specific manner:
Determining an original text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box;
Determining a regular expression and a text template corresponding to the text image;
Extracting key text information from an original text recognition result based on a regular expression;
and filling the key text information into a text template to obtain a text recognition result of the text image.
Therefore, by implementing the text image recognition device described in fig. 4, by extracting the key text information in the text image and then filling the key text information into the text template corresponding to the text image, the text recognition result of the whole text image is obtained, and the situation that the text recognition result of the whole text image is wrong when the text of other parts is recognized can be avoided, thereby improving the recognition accuracy of the text image.
In yet another alternative embodiment, the preprocessing module 301 performs preprocessing on the text image carrying the target text, and the specific manner of obtaining the model input image is:
carrying out mean value filtering on the text image bearing the target text to obtain a filtered image;
Carrying out graying treatment on the filtered image to obtain a gray image;
performing binarization processing on the gray level image to obtain a binarized image;
calculating a first vertex coordinate of the binarized image based on a predetermined image edge detection algorithm;
determining a second vertex coordinate according to the predetermined alignment image;
determining an affine transformation matrix according to the first vertex coordinates and the second vertex coordinates;
And carrying out affine transformation on the binarized image according to the affine transformation matrix to obtain a model input image.
It can be seen that implementing the text image recognition apparatus described in fig. 4 can implement preprocessing of the text image in multiple ways.
In this further alternative embodiment, further optionally, the preprocessing module 301 performs binarization processing on the gray-scale image, and the specific manner of obtaining the binarized image is:
dividing a gray image into a plurality of gray image regions;
according to the gray values of all pixel points in each gray image area, determining a binarization threshold value corresponding to the gray image area;
and carrying out binarization processing on the image in each gray level image area according to the binarization threshold value corresponding to each gray level image area to obtain a binarized image of the whole gray level image.
It can be seen that implementing the text image recognition device described in fig. 4 can use different binarization thresholds for different areas of the same image, so that a better binarized image can be obtained.
In this further alternative embodiment, still further alternative, each gray image region is a matrix-shaped region;
and, the preprocessing module 301 determines, according to the gray values of all the pixels in each gray image area, the binarization threshold corresponding to the gray image area in the following specific manner:
The corresponding binarization threshold value of each gray image region is calculated by the following formula:
Where a denotes an abscissa of a lower left corner pixel of each gray image region in the gray image, b denotes an ordinate of a lower right corner pixel of each gray image region in the gray image, k denotes the number of pixels occupied by each gray image region in the lateral direction of the gray image, l denotes the number of pixels occupied by each gray image region in the longitudinal direction of the gray image, f (i, j) denotes a gray value of pixels having coordinates (i, j) in the gray image, n denotes the total number of pixels occupied by each gray image region, thr denotes a binarization threshold value corresponding to each gray image region.
As can be seen, the recognition device for text image described in fig. 4 is implemented, and the average value of the gray values of all the pixels in each gray image area is used as the binarization threshold, so that a better binarized image can be obtained.
In a further alternative embodiment, the bounding box further comprises geometric information of the target image area, the geometric information comprising a pixel width and/or a pixel length and/or a pixel area of the target image area;
and, the recognition device of the text image further includes:
The selecting module 306 is configured to, after the first analyzing module 302 inputs the model input image to a predetermined text detection model for analysis, obtain a text detection result of the text image, cluster all bounding boxes according to the coordinate information of each bounding box by the clustering module 303 to obtain at least one bounding box cluster set, and when the geometric information includes the pixel width of the target image area, select a target pixel width from all pixel widths;
The removing module 307 is configured to remove the bounding boxes with the pixel widths less than or equal to the target pixel width from the text detection result, and trigger the clustering module 303 to perform the above-described operation of clustering all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set;
the selecting module 306 is configured to, after the first analyzing module 302 inputs the model input image to a predetermined text detection model for analysis, obtain a text detection result of the text image, cluster all bounding boxes according to coordinate information of each bounding box by the clustering module 303 to obtain at least one bounding box cluster set, and when the geometric information includes pixel lengths of the target image area, select a target pixel length from all pixel lengths;
The removing module 307 is configured to remove the bounding boxes with the pixel lengths less than or equal to the target pixel length from the text detection result, and trigger the clustering module 303 to perform the above-described operation of clustering all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set;
The selecting module 306 is configured to, after the first analyzing module 302 inputs the model input image to a predetermined text detection model for analysis, obtain a text detection result of the text image, cluster all bounding boxes according to the coordinate information of each bounding box by the clustering module 303 to obtain at least one bounding box cluster set, and when the geometric information includes the pixel areas of the target image area, select a target pixel area from all pixel areas;
The removing module 307 is configured to remove the bounding boxes with the pixel areas less than or equal to the target pixel area from the text detection result, and trigger the clustering module 303 to perform the above-mentioned operation of clustering all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set.
It can be seen that implementing the text image recognition apparatus described in fig. 4 can remove the erroneous bounding box from the text detection result according to the geometric features of the bounding box, thereby improving the accuracy of the obtained text recognition result.
For the specific description of the text image recognition device, reference may be made to the specific description of the text image recognition method, which is not described in detail herein.
Example IV
Referring to fig. 5, fig. 5 is a schematic structural diagram of a text image recognition device according to another embodiment of the present invention. As shown in fig. 5, the apparatus may include:
A memory 501 in which executable program codes are stored;
A processor 502 coupled to the memory 501;
the processor 502 invokes executable program codes stored in the memory 501 to perform the steps in the recognition method of a text image disclosed in the first or second embodiment of the present invention.
Example five
The embodiment of the invention discloses a computer storage medium which stores computer instructions for executing the steps in the method for recognizing a text image disclosed in the first or second embodiment of the invention when the computer instructions are called.
The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above detailed description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product that may be stored in a computer-readable storage medium including Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic disc Memory, tape Memory, or any other medium that can be used for computer-readable carrying or storing data.
Finally, it should be noted that the method and apparatus for recognizing text images disclosed in the embodiments of the present invention are only preferred embodiments of the present invention, and are only used for illustrating the technical solutions of the present invention, but not limiting the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solutions described in the foregoing embodiments may be modified or some technical features thereof may be equivalently replaced, and these modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A method of identifying a text image, the method comprising:
preprocessing a text image bearing a target text to obtain a model input image;
inputting the model input image into a predetermined text detection model for analysis, and obtaining a text detection result of the text image, wherein the text detection result comprises at least one boundary box for selecting a target image area in the text image, each boundary box at least comprises coordinate information for representing the position of the target image area in the text image, and the target image area refers to an image area where a single text in the target text is located in the text image;
Clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box clustering set, wherein the characters corresponding to all the bounding boxes in each bounding box clustering set are characters in the same line in the target text;
Inputting the images in the target image areas selected by each boundary frame into a predetermined character recognition model for analysis to obtain a character recognition result of the boundary frame;
determining a text recognition result of the text image according to the text recognition result, the coordinate information and the belonging boundary box cluster set of each boundary box;
The method comprises the steps of obtaining coordinate information of each boundary frame, wherein the coordinate information of each boundary frame comprises abscissa information and ordinate information of the boundary frame, the ordinate information of each boundary frame comprises maximum ordinate and minimum ordinate of the boundary frame, and clustering all boundary frames according to the coordinate information of each boundary frame to obtain at least one boundary frame cluster set, and the method comprises the following steps:
Determining an ordinate interval of each boundary frame according to the maximum ordinate and the minimum ordinate of the boundary frame;
judging whether an intersection exists between the ordinate intervals of every two bounding boxes;
Dividing the two boundary frames into the same boundary frame cluster set when judging that the intersection exists between the ordinate intervals of the two boundary frames;
When judging that no intersection exists between the ordinate intervals of the two bounding boxes, dividing the two bounding boxes into different bounding box clustering sets;
and determining the text recognition result of the text image according to the text recognition result, the coordinate information and the belonging boundary box cluster set of each boundary box, wherein the method comprises the following steps:
For each boundary box cluster set, determining an ordinate distribution target point of the boundary box cluster set according to the ordinate information of all the boundary boxes in the boundary box cluster set, wherein the ordinate distribution target point is used for representing the distribution condition of the ordinate information of all the boundary boxes in the boundary box cluster set;
Determining the text longitudinal coordinates of each bounding box in each bounding box cluster set according to all the ordinate distribution target points, wherein the text longitudinal coordinates of the bounding boxes in the same bounding box cluster set are the same;
For each boundary box cluster set, determining the text transverse ordering information of the boundary box cluster set according to the abscissa information of all boundary boxes of the boundary box cluster set, wherein the text transverse ordering information is used for representing the text transverse ordering condition of the boundary box cluster set;
And determining a text recognition result of the text image according to the text recognition result of each bounding box and text coordinates, wherein the text coordinates comprise the text longitudinal coordinates and the text transverse coordinates, and the text coordinates are used for representing the position of the Chinese character in the bounding box in the target text.
2. The method according to claim 1, wherein the determining the text recognition result of the text image based on the text recognition result and the text coordinates of each of the bounding boxes includes:
Determining an original text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box;
Determining a regular expression and a text template corresponding to the text image;
extracting key text information from the original text recognition result based on the regular expression;
And filling the key text information into the text template to obtain a text recognition result of the text image.
3. The method for recognizing a text image according to claim 1, wherein the preprocessing the text image carrying the target text to obtain a model input image includes:
carrying out mean value filtering on the text image bearing the target text to obtain a filtered image;
graying treatment is carried out on the filtered image to obtain a gray image;
performing binarization processing on the gray level image to obtain a binarized image;
Calculating a first vertex coordinate of the binarized image based on a predetermined image edge detection algorithm;
determining a second vertex coordinate according to the predetermined alignment image;
determining an affine transformation matrix according to the first vertex coordinates and the second vertex coordinates;
and carrying out affine transformation on the binarized image according to the affine transformation matrix to obtain a model input image.
4. A method for recognizing a text image according to claim 3, wherein the binarizing the gray-scale image to obtain a binarized image comprises:
dividing the gray image into a plurality of gray image regions;
Determining a binarization threshold value corresponding to each gray image area according to gray values of all pixel points in each gray image area;
and carrying out binarization processing on the image in the gray image area according to the binarization threshold value corresponding to each gray image area to obtain a binarized image of the whole gray image.
5. The method of recognizing a text image according to claim 4, wherein each of the grayscale image regions is a matrix-shaped region;
and determining a binarization threshold corresponding to each gray image area according to gray values of all pixel points in the gray image area, including:
calculating a binarization threshold value corresponding to each gray image region by the following formula:
Where a represents the abscissa of the pixel point at the lower left corner of each gray image region in the gray image, b represents the ordinate of the pixel point at the lower right corner of each gray image region in the gray image, k represents the number of pixels occupied by each gray image region in the lateral direction of the gray image, l represents the number of pixels occupied by each gray image region in the longitudinal direction of the gray image, f (i, j) represents the gray value of the pixel point with coordinates (i, j) in the gray image, n represents the total number of pixels occupied by each gray image region, thr represents the corresponding binarization threshold value of each gray image region.
6. The method of claim 5, wherein the bounding box further comprises geometric information of the target image region, the geometric information comprising a pixel width and/or a pixel length and/or a pixel area of the target image region;
And before the model input image is input to a predetermined text detection model for analysis, and after a text detection result of the text image is obtained, all the bounding boxes are clustered according to the coordinate information of each bounding box to obtain at least one bounding box cluster set, the method further includes:
When the geometric information comprises the pixel width of the target image area, selecting a target pixel width from all the pixel widths;
Removing the corresponding bounding boxes with the pixel width smaller than or equal to the target pixel width from the text detection result, triggering and executing the step of clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set;
When the geometric information comprises the pixel length of the target image area, selecting a target pixel length from all the pixel lengths;
Removing the boundary frames with the pixel lengths smaller than or equal to the target pixel lengths from the text detection result, triggering and executing the step of clustering all the boundary frames according to the coordinate information of each boundary frame to obtain at least one boundary frame cluster set;
when the geometric information comprises pixel areas of the target image area, selecting a target pixel area from all the pixel areas;
and removing the boundary boxes with the pixel areas smaller than or equal to the target pixel areas from the text detection result, triggering and executing the step of clustering all the boundary boxes according to the coordinate information of each boundary box to obtain at least one boundary box cluster set.
7. A text image recognition apparatus for performing the text image recognition method according to any one of claims 1 to 6, comprising:
The preprocessing module is used for preprocessing the text image bearing the target text to obtain a model input image;
The first analysis module is used for inputting the model input image into a predetermined text detection model for analysis, so as to obtain a text detection result of the text image, wherein the text detection result comprises at least one boundary box for selecting a target image area in the text image, each boundary box at least comprises coordinate information for representing the position of the target image area in the text image, and the target image area refers to an image area where a single text in the target text is located in the text image;
The clustering module is used for clustering all the boundary boxes according to the coordinate information of each boundary box to obtain at least one boundary box clustering set, and characters corresponding to all the boundary boxes in each boundary box clustering set are characters in the same line in the target text;
The second analysis module is used for inputting the image in the target image area selected by each boundary frame into a predetermined character recognition model for analysis to obtain a character recognition result of the boundary frame;
and the determining module is used for determining the text recognition result of the text image according to the text recognition result, the coordinate information and the belonging boundary box cluster set of each boundary box.
8. A text image recognition apparatus, the apparatus comprising:
A memory storing executable program code;
a processor coupled to the memory;
The processor invokes the executable program code stored in the memory to perform the method of recognition of a text image as claimed in any one of claims 1 to 6.
CN202011138322.4A 2020-10-22 2020-10-22 Text image recognition method and device Active CN112507782B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011138322.4A CN112507782B (en) 2020-10-22 2020-10-22 Text image recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011138322.4A CN112507782B (en) 2020-10-22 2020-10-22 Text image recognition method and device

Publications (2)

Publication Number Publication Date
CN112507782A CN112507782A (en) 2021-03-16
CN112507782B true CN112507782B (en) 2025-04-01

Family

ID=74954683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011138322.4A Active CN112507782B (en) 2020-10-22 2020-10-22 Text image recognition method and device

Country Status (1)

Country Link
CN (1) CN112507782B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221632A (en) * 2021-03-23 2021-08-06 奇安信科技集团股份有限公司 Document picture identification method and device and computer equipment
CN115188009A (en) * 2021-03-23 2022-10-14 北京中医药大学 Character recognition method and apparatus, electronic device and computer readable medium
CN113537199B (en) * 2021-08-13 2023-05-02 上海淇玥信息技术有限公司 Image boundary box screening method, system, electronic device and medium
CN115730563A (en) * 2021-08-30 2023-03-03 广东艾檬电子科技有限公司 Typesetting method and device for text image, electronic equipment and storage medium
CN113850208B (en) * 2021-09-29 2024-09-27 平安科技(深圳)有限公司 Picture information structuring method, device, equipment and medium
CN114119349A (en) * 2021-10-13 2022-03-01 广东金赋科技股份有限公司 Image information extraction method, device and medium
CN113903040A (en) * 2021-10-27 2022-01-07 上海商米科技集团股份有限公司 Text recognition method, equipment, system and computer readable medium for shopping receipt
CN114140800B (en) * 2021-10-29 2025-05-30 广东省电信规划设计院有限公司 Intelligent positioning method and device for pages
CN114155535B (en) * 2021-12-07 2025-03-11 创优数字科技(广东)有限公司 Text detection method, device, storage medium and computer equipment
CN114240890B (en) * 2021-12-17 2025-06-27 深圳前海环融联易信息科技服务有限公司 A method, device, equipment and storage medium for removing blank space from text image
CN114937270A (en) * 2022-05-05 2022-08-23 上海迥灵信息技术有限公司 Ancient book word processing method, ancient book word processing device and computer readable storage medium
CN115471919B (en) * 2022-09-19 2023-09-12 江苏至真健康科技有限公司 Filing method and system based on portable mydriasis-free fundus camera
CN116052175A (en) * 2022-10-28 2023-05-02 北京迈格威科技有限公司 Text detection method, electronic device, storage medium and computer program product
CN115546810B (en) * 2022-11-29 2023-04-11 支付宝(杭州)信息技术有限公司 Image element category identification method and device
CN120182987B (en) * 2025-05-22 2025-07-18 西安葆康医管数据科技有限公司 Intelligent verification and identification method, device and electronic equipment for financial statements

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107748888A (en) * 2017-10-13 2018-03-02 众安信息技术服务有限公司 A kind of image text row detection method and device
CN109284750A (en) * 2018-08-14 2019-01-29 北京市商汤科技开发有限公司 Bank slip recognition method and device, electronic equipment and storage medium
CN111539330A (en) * 2020-04-17 2020-08-14 西安英诺视通信息技术有限公司 Transformer substation digital display instrument identification method based on double-SVM multi-classifier

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109993040B (en) * 2018-01-03 2021-07-30 北京世纪好未来教育科技有限公司 Text recognition method and device
US10740602B2 (en) * 2018-04-18 2020-08-11 Google Llc System and methods for assigning word fragments to text lines in optical character recognition-extracted data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107748888A (en) * 2017-10-13 2018-03-02 众安信息技术服务有限公司 A kind of image text row detection method and device
CN109284750A (en) * 2018-08-14 2019-01-29 北京市商汤科技开发有限公司 Bank slip recognition method and device, electronic equipment and storage medium
CN111539330A (en) * 2020-04-17 2020-08-14 西安英诺视通信息技术有限公司 Transformer substation digital display instrument identification method based on double-SVM multi-classifier

Also Published As

Publication number Publication date
CN112507782A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
CN112507782B (en) Text image recognition method and device
CN104112128B (en) Digital image processing system and method applied to bill image character recognition
CN111680690B (en) Character recognition method and device
JP5492205B2 (en) Segment print pages into articles
JP6139396B2 (en) Method and program for compressing binary image representing document
US8712188B2 (en) System and method for document orientation detection
CN110598566A (en) Image processing method, device, terminal and computer readable storage medium
CN112926564B (en) Picture analysis method, system, computer device and computer readable storage medium
JP4694613B2 (en) Document orientation determination apparatus, document orientation determination method, program, and recording medium therefor
CN105283884A (en) Classify objects in digital images captured by mobile devices
US11151402B2 (en) Method of character recognition in written document
CN111737478B (en) Text detection method, electronic device and computer readable medium
CN111340023A (en) Text recognition method and device, electronic equipment and storage medium
CN114241463A (en) Signature verification method, apparatus, computer equipment and storage medium
CN110210297A (en) The method declaring at customs the positioning of single image Chinese word and extracting
CN113963353A (en) Character image processing and identifying method and device, computer equipment and storage medium
CN114926829A (en) Certificate detection method and device, electronic equipment and storage medium
CN113569859A (en) Image processing method and device, electronic equipment and storage medium
CN115410191B (en) Text image recognition method, device, equipment and storage medium
EP2545498B1 (en) Resolution adjustment of an image that includes text undergoing an ocr process
CN118334670A (en) Seal content identification method, system and medium
CN117854090A (en) Universal form identification method and device
CN115830607B (en) Text recognition method and device based on artificial intelligence, computer equipment and medium
Boiangiu et al. Handwritten documents text line segmentation based on information energy
CN112215783B (en) Image noise point identification method, device, storage medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant