Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or article that comprises a list of steps or elements is not limited to only those listed but may optionally include other steps or elements not listed or inherent to such process, method, article, or article.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The invention discloses a text image recognition method and a text image recognition device, which are characterized in that coordinate information corresponding to each word in a text image carrying a target text is analyzed, then all the words are clustered according to the coordinate information corresponding to each word, the words in the same line in the target text are divided into a clustering set, then the word recognition result corresponding to each word is analyzed, finally the text recognition result of the text image is determined according to the word recognition result, the coordinate information and the clustering set to which each word belongs, the recognition accuracy of the text in the text image can be improved, and the words in the same line in the target text are outputted as the words in the same line when the text recognition result is outputted, and are respectively described in detail below.
Example 1
Referring to fig. 1, fig. 1 is a flowchart of a text image recognition method according to an embodiment of the present invention. As shown in fig. 1, the text image recognition method may include the following operations:
101. And preprocessing the text image bearing the target text to obtain a model input image.
In the above step 101, the text image bearing the target text may include a scanned image or a photograph of any one of the certificates such as a graduation certificate, a professional qualification certificate, an enterprise business license, and an enterprise seniority certificate. Preprocessing of the text image may include mean filtering, graying, binarizing, alignment transformation, etc. Specifically, the preprocessing process for the text image in the embodiment of the present invention can be referred to the description in the subsequent embodiment.
102. And inputting the model input image into a predetermined text detection model for analysis to obtain a text detection result of the text image.
In the step 102, the text detection result includes at least one bounding box for selecting a target image area in the text image, where each bounding box includes at least coordinate information for indicating a position of the target image area in the text image, and the target image area is an image area where a single text in the target text is located in the text image.
Alternatively, the text detection model may be a deep learning model PixelLink that foregoes the bounding regression method to detect the bounding box of the text line, whereas the example segmentation method is employed to derive the bounding box of the text line directly from the segmented text line region. The algorithm process is as follows:
(1) The deep learning model VGG16 is adopted as a feature extraction network, wherein the output of the deep learning model VGG16 is divided into two parts:
pixel segmentation, namely judging whether each pixel point of the model input image is a text pixel or a non-text pixel;
and carrying out link prediction on eight fields of each pixel point of the model input image, merging into text pixels if the positive is positive, and discarding if the positive is not positive.
(2) And extracting an circumscribed rectangle frame (namely a boundary frame of a target image area) with direction information corresponding to each character in the input image of the model by calculating the minimum circumscribed rectangle, wherein the circumscribed rectangle frame is expressed as ((x, y), (w, h) theta) (namely coordinate information of the boundary frame), wherein (x, y) represents the center point coordinate of the circumscribed rectangle frame, (w, h) represents the width and the height of the circumscribed rectangle frame, and theta represents the rotation angle of the circumscribed rectangle frame.
103. And clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set.
In step 103, the text corresponding to all the bounding boxes in each bounding box cluster set is the text in the same line in the target text.
In an alternative embodiment, the coordinate information of each bounding box includes abscissa information and ordinate information of the bounding box, and the ordinate information of each bounding box includes a maximum ordinate and a minimum ordinate of the bounding box;
And clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set, including:
determining an ordinate interval of each bounding box according to the maximum ordinate and the minimum ordinate of the bounding box;
Judging whether an intersection exists between the ordinate intervals of every two bounding boxes;
Dividing the two bounding boxes into the same bounding box cluster set when judging that the intersection exists between the ordinate intervals of the two bounding boxes;
And when judging that no intersection exists between the ordinate intervals of the two bounding boxes, dividing the two bounding boxes into different bounding box clustering sets.
In this alternative embodiment, after the center point coordinates, the width and the height of the circumscribed rectangular frame (bounding frame) output by the deep learning model PixelLink are obtained, the maximum ordinate and the minimum ordinate of the bounding frame can be determined, then the ordinate intervals of the bounding frame are determined, for example, three bounding frames in total, the center point coordinates are (10, 10), (15, 12), (20, 20), the width and the height are (2, 2) in sequence, the ordinate intervals of the three bounding frames are [8,12], [10,14], [18,22], and finally, two bounding frames with the ordinate intervals of [8,12] and [10,14] are divided into the same bounding frame cluster set, and the bounding frame with the ordinate interval of [18,22] is divided into another bounding frame cluster set.
Therefore, by implementing the alternative embodiment, whether the two bounding boxes are located in the bounding boxes of the characters in the same row can be judged according to whether the intersection exists between the ordinate intervals of the two bounding boxes, so that the bounding boxes of the characters in the same row can be divided into the same bounding box clustering set, and the clustering of the bounding boxes is realized.
104. And inputting the images in the target image area selected by each boundary box into a predetermined character recognition model for analysis, and obtaining the character recognition result of the boundary box.
In the step 104, the word recognition model may be a deep learning model CRNN, which is mainly composed of three types of layers, as follows:
(1) Convolution layer-feature extraction of an image in a target image region using a convolution layer, e.g., converting an image of size (32,100,3) to a convolution feature matrix of size (1,25,512).
(2) And a circulation layer, namely adopting a deep bidirectional LSTM network to continuously extract character sequence features on the basis of the convolution feature matrix.
(3) And a transcription layer, namely after the RNN output result is input into an activation function softmax, selecting the character with the highest probability as a character recognition result to be output.
105. And determining the text recognition result of the text image according to the text recognition result, the coordinate information and the belonging boundary box cluster set of each boundary box.
In another optional embodiment, determining the text recognition result of the text image according to the text recognition result, the coordinate information and the belonging boundary box cluster set of each boundary box includes:
Determining text coordinates of each bounding box according to the bounding box cluster set to which the bounding box belongs and coordinate information of the bounding box;
And determining the text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box.
In this alternative embodiment, the text coordinates are used to represent the location of the text corresponding to the bounding box in the target text, where the text coordinates include at least a text vertical coordinate and a text horizontal coordinate, and the text vertical coordinates of bounding boxes belonging to the same bounding box cluster set are the same. For example, the text coordinate of the bounding box is (3, 3), then the text corresponding to the bounding box is the third text in the third line of the target text. After the characters and the text coordinates in each bounding box are determined, the recognized characters are arranged and combined together according to the corresponding text coordinates, and the text corresponding to the whole text image is obtained.
It will be seen that implementing this alternative embodiment, the text coordinates of the bounding box can be determined from the coordinate information of the bounding box, and then the text in the bounding box is ordered and combined according to the text coordinates, so as to obtain the recognized text of the entire text image.
In this further alternative embodiment, further alternative, the coordinate information of each bounding box includes abscissa information and ordinate information of the bounding box;
And determining text coordinates of each bounding box according to the bounding box cluster set to which the bounding box belongs and the coordinate information of the bounding box, wherein the text coordinates comprise:
determining the text longitudinal coordinates of all the bounding boxes in the bounding box cluster set to which each bounding box belongs according to the longitudinal coordinate information of all the bounding boxes in the bounding box cluster set to which the bounding box belongs;
and determining the text transverse coordinates of each bounding box in each bounding box cluster set according to the transverse coordinate information of each bounding box in the bounding box cluster set.
In this further alternative embodiment, the abscissa and the ordinate of the coordinates of the center point of the circumscribed rectangular frame output by the depth learning model PixelLink may be taken as the abscissa information and the ordinate information of the boundary frame, respectively. Since the ordinate of the text in the same line in the target text is usually not greatly different, but the ordinate of the text in a different line is greatly different, the text longitudinal coordinates of all the bounding boxes in the bounding box cluster set to which each bounding box belongs can be determined according to the ordinate information of all the bounding boxes in the bounding box cluster set to which the bounding box belongs. It is assumed that there are three border frame cluster groups in total, wherein three border frames are shared in the first border frame cluster group, the abscissa information and the ordinate information of the border frames in the three border frame cluster groups are sequentially (10, 10), (15, 12), (20, 8), three border frames are shared in the second border frame cluster group, the abscissa information and the ordinate information of the border frames in the three border frame cluster groups are sequentially (11, 21), (14, 18), (20, 20), three border frames are shared in the third border frame cluster group, the abscissa information and the ordinate information of the border frames in the three border frame cluster groups are sequentially (9, 30), (16, 28), (19, 32), and it can be seen that the ordinate information of the border frames in the first border frame cluster group is approximately distributed at about 10, the ordinate information of the border frames in the second border frame cluster group is approximately distributed at about 20, the ordinate information of the border frames in the third border frame cluster group is approximately distributed at about 30, and the ordinate information of the border frames in the border frame cluster group is approximately distributed at about 30, and the border frames in the border frame cluster group is approximately at the border frame 2. And then according to the ordering of the abscissa information of the boundary frames in the boundary frame cluster group, the text abscissa of the boundary frames in the boundary frame cluster group can be obtained, so that the text coordinates of the boundary frames of (10, 10), (15, 12), (20, 8) are obtained, the text coordinates of the boundary frames of (1, 1), (2, 1), (3, 1), (11, 21), (14, 18), (20, 20) are sequentially obtained, and the text coordinates of the boundary frames of (1, 2), (2, 2), (3, 2), (9, 30), (16, 28), (19, 32) are sequentially obtained, and the text coordinates of the boundary frames of (1, 3), (2, 3), (3, 32) are sequentially obtained.
It will be seen that implementing this further alternative embodiment, the bounding boxes can be ordered according to their abscissa and ordinate information, resulting in their text abscissa and text ordinate.
In this another alternative embodiment, the determining the text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box further includes:
Determining an original text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box;
Determining a regular expression and a text template corresponding to the text image;
Extracting key text information from an original text recognition result based on a regular expression;
and filling the key text information into a text template to obtain a text recognition result of the text image.
In this yet further alternative embodiment, a regular expression is a technical means in the computer arts that can be used to extract a specific portion of the target text. When the text image is a scanned image or a photograph of a certificate such as a graduation certificate, a professional qualification certificate, an enterprise business license, an enterprise seniority certificate, etc., since the text in the certificate is usually in a prescribed format, the meaning of other words is not important except some key information, for example, the key text information in the graduation certificate is a graduation institution, a graduation specialty, a name, etc., and the meaning of the words in other parts is not much different even on a different graduation certificate, the words in other parts of the graduation certificate are used as a text template, then the key text information in the graduation certificate is extracted through a regular expression, and then the key text information is filled into the text template, i.e. the text recognition result of the whole text image can be obtained. Thus, when the text in other parts is recognized in error, the occurrence of the situation that the text recognition result of the whole obtained text image is also in error can be avoided. In addition, the regular expressions corresponding to the text images of different types and the text templates are different, for example, the regular expressions corresponding to the text images of the business license of the enterprise are used for extracting key text information such as enterprise names, enterprise addresses and the like in the text images, and the corresponding text templates are also different from the text templates corresponding to the text images of the graduation certificate.
Therefore, by implementing the further alternative embodiment, the text recognition result of the whole text image is obtained by extracting the key text information in the text image and then filling the key text information into the text template corresponding to the text image, so that the occurrence of the situation that the text recognition result of the whole text image is wrong when the characters of other parts are recognized can be avoided, and the recognition accuracy of the text image is improved.
In still another further alternative embodiment, after extracting the key text information from the original text recognition result based on the regular expression, the key text information is filled into the text template, and before obtaining the text recognition result of the text image, the text image recognition method further includes:
Judging whether the number of characters contained in the key text information is equal to the number of correct characters determined in advance;
And triggering and executing the step of filling the key text information into the text template to obtain a text recognition result of the text image when the number of characters contained in the key text information is judged to be equal to the predetermined correct number of characters.
In this further alternative embodiment, the number of correct characters corresponding to each text image is preset, for example, the number of correct characters corresponding to the text image of the graduation certificate may be 9, 10, 11, the number of correct characters corresponding to the text image of the business license may be 19, 20, 21, 22, etc. When the number of characters of the key text information is equal to the correct number of characters, it can be determined that the extracted key text information is correct.
It can be seen that implementing this further alternative embodiment, whether the extracted key text information is correct is determined by the number of the included characters, and after determining that the extracted key text information is correct, the extracted key text information is filled into the text template, so that the accuracy of the text recognition result can be improved.
In this still further alternative embodiment, still further alternative, the method for recognizing a text image further includes:
When the number of characters contained in the key text information is not equal to the number of correct characters determined in advance, receiving corrected key text information input by a user;
And filling the corrected key text information into a text template to obtain a text recognition result of the text image.
In this still further alternative embodiment, the corrected key text information may be directly input by the user according to the target text, that is, after determining that the key text information is extracted incorrectly, the corrected key text information is directly input by the user, and then the corrected key text information input by the user is filled into the text template, so as to obtain the text recognition result of the text image.
It can be seen that, in implementing this further alternative embodiment, after determining that the key text information is extracted in error, the user directly inputs the corrected key text information, and then fills the corrected key text information input by the user into the text template, so that the accuracy of the text recognition result can be improved.
Therefore, by implementing the text image recognition method described in fig. 1, the recognition accuracy of the text in the text image can be improved, and it is ensured that the text in the same line in the target text is outputted as the text in the same line when the text recognition result is outputted. And judging whether the two bounding boxes are in the bounding boxes of the characters in the same row according to whether the intersection exists between the ordinate intervals of the two bounding boxes. And determining text coordinates of the boundary box according to the coordinate information of the boundary box, and sequencing and combining the characters in the boundary box according to the text coordinates so as to obtain the identification text of the whole text image. And sequencing the bounding boxes according to the abscissa information and the ordinate information of the bounding boxes to obtain the text transverse coordinates and the text longitudinal coordinates of the bounding boxes. And the situation that the text recognition result of the whole text image is wrong when the text of other parts is recognized can be avoided, so that the recognition accuracy of the text image is improved.
Example two
Referring to fig. 2, fig. 2 is a flowchart illustrating another text image recognition method according to an embodiment of the invention. As shown in fig. 2, the text image recognition method may include the following operations:
201. and carrying out mean value filtering on the text image carrying the target text to obtain a filtered image.
In the step 201, the average filtering process may be as follows:
(1) The size of the filter is set, and the size of the filter is generally odd, such as:
The filter is a matrix with the size of (3, 3), and contains 9 elements, each element is 1, and in an actual application scene, the size and the element value of the filter can be set according to actual conditions.
(2) The filter is moved on the text image, the center of the filter is overlapped with each pixel of the text image in sequence, the pixels corresponding to the overlapping of the filter and the text image are multiplied, and the products are added and divided by the number of filter elements, so that the filter can be expressed as the following formula:
where f (i+k, j+l) represents a pixel value at a coordinate (i+k, j+l) in the pixel matrix of the picture before denoising, and g (i, j) represents a pixel value at a coordinate (i, j) in the pixel matrix of the picture after denoising. h (k, l) is a filter matrix containing n elements.
202. And carrying out graying treatment on the filtered image to obtain a gray image.
In the above step 202, the image graying process means that in the image of the RGB color model, the R channel, G channel, and B channel of each pixel point in the image have the same value, so that the whole image is gray. Wherein the values m of the R, G and B channels are called gray values. The usual ways of graying are the following:
Mode one m= (r+g+b)/3
Mode two m=0.3 r+0.59g+0.11b
In the embodiment of the present invention, the graying process is preferably performed in the second mode.
203. And carrying out binarization processing on the gray level image to obtain a binarized image.
In an alternative embodiment, binarizing the gray scale image to obtain a binarized image includes:
dividing a gray image into a plurality of gray image regions;
according to the gray values of all pixel points in each gray image area, determining a binarization threshold value corresponding to the gray image area;
and carrying out binarization processing on the image in each gray level image area according to the binarization threshold value corresponding to each gray level image area to obtain a binarized image of the whole gray level image.
In this alternative embodiment, the process of binarizing the image may be understood as resetting the gray value of each pixel in the image according to whether the gray value of each pixel is greater than the binarization threshold, setting the gray value of each pixel to 255 (i.e., the pixel to white) when the gray value of each pixel is greater than the binarization threshold, and setting the gray value of each pixel to 0 (i.e., the pixel to black) when the gray value of each pixel is less than the binarization threshold, so that the whole image is changed into a black-and-white image. Further, since different portions in the same image may have different brightness, the effect of the obtained binarized image is not always ideal when a uniform binarization threshold is used for the same image. In this case, a method of adaptive binarization, that is, different binarization thresholds are used for different areas of the same image, is used, so that a better-effect binarized image can be obtained.
It can be seen that by implementing this alternative embodiment, different binarization thresholds can be used for different regions of the same image, so that a better binarized image can be obtained.
In this alternative embodiment, further alternatively, each gray image region is a matrix-shaped region;
and determining a binarization threshold corresponding to each gray image area according to gray values of all pixel points in the gray image area, wherein the binarization threshold comprises:
The corresponding binarization threshold value of each gray image region is calculated by the following formula:
Where a denotes an abscissa of a lower left corner pixel of each gray image region in the gray image, b denotes an ordinate of a lower right corner pixel of each gray image region in the gray image, k denotes the number of pixels occupied by each gray image region in the lateral direction of the gray image, l denotes the number of pixels occupied by each gray image region in the longitudinal direction of the gray image, f (i, j) denotes a gray value of pixels having coordinates (i, j) in the gray image, n denotes the total number of pixels occupied by each gray image region, thr denotes a binarization threshold value corresponding to each gray image region.
It can be seen that this further alternative embodiment is implemented, taking the average value of the gray values of all the pixels in each gray image region as the binarization threshold, so that a better effect of binarization image can be obtained.
204. The first vertex coordinates of the binarized image are calculated based on a predetermined image edge detection algorithm.
In the step 204, the image edge detection algorithm may be a Canny edge detection algorithm, and three vertex coordinates (i.e., first vertex coordinates) of the target text in the text image are extracted by using the Canny edge detection algorithm, where the three vertex coordinates are respectively:
A(x1,y1),B(x2,y2),C(x3,y3)
The Canny edge detection algorithm firstly multiplies coordinate points in a picture pixel point matrix of a text image by sobel or other operators to obtain gradient values g x(m,n),gy (m, n) in different directions, and then combines the directions to obtain gradient values and gradient directions:
Wherein G (m, n) is a gradient value, Is the gradient direction.
And then using the upper and lower thresholds to screen the edge pixel points, wherein the screening rule is that two thresholds are set, namely an upper threshold maxVal and a lower threshold minVal. Wherein, pixels with luminance greater than maxVal are all detected as edges, and pixels with luminance lower than minVal are all detected as non-edges. And judging the middle pixel point as an edge if the middle pixel point is adjacent to the pixel point determined as the edge, and judging the middle pixel point as a non-edge if the middle pixel point is not adjacent to the pixel point determined as the edge.
205. The second vertex coordinates are determined from the predetermined alignment image.
In step 205 described above, the alignment image is an image to be aligned after the binarized image is corrected, which has a specified geometry, and a new binarized image obtained after the binarized image is aligned with the alignment image will have the same geometry. Three vertex coordinates (i.e., second vertex coordinates) of the alignment image may be determined from the long side and the wide side of the alignment image, and the three vertex coordinates are expressed as:
A′(x′1,y′1),B′(x′2,y′2),C′(x′3,y′3)
206. An affine transformation matrix is determined from the first vertex coordinates and the second vertex coordinates.
In step 206 described above, the affine transformation matrix M can be expressed as:
207. And carrying out affine transformation on the binarized image according to the affine transformation matrix to obtain a model input image.
In step 207 described above, the affine transformation can be expressed as:
G′=M*G
wherein G' represents the affine transformed picture pixel matrix, and G represents the picture pixel matrix before affine transformation.
208. And inputting the model input image into a predetermined text detection model for analysis to obtain a text detection result of the text image.
209. And clustering all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set.
210. And inputting the images in the target image area selected by each boundary box into a predetermined character recognition model for analysis, and obtaining the character recognition result of the boundary box.
211. And determining the text recognition result of the text image according to the text recognition result, the coordinate information and the belonging boundary box cluster set of each boundary box.
For a specific description of the steps 208 to 211, reference may be made to the specific descriptions of the steps 102 to 105, which are not described in detail herein.
In a further alternative embodiment, the bounding box further comprises geometric information of the target image area, the geometric information comprising a pixel width and/or a pixel length and/or a pixel area of the target image area;
And inputting the model input image into a predetermined text detection model for analysis, and clustering all the bounding boxes according to the coordinate information of each bounding box after the text detection result of the text image is obtained, so as to obtain at least one bounding box cluster set, wherein the text image recognition method further comprises the following steps:
When the geometric information comprises the pixel width of the target image area, selecting a target pixel width from all the pixel widths;
Removing the boundary frames with the corresponding pixel widths smaller than or equal to the target pixel width from the text detection result, triggering and executing the step of clustering all the boundary frames according to the coordinate information of each boundary frame to obtain at least one boundary frame clustering set;
When the geometric information comprises the pixel length of the target image area, selecting a target pixel length from all the pixel lengths;
Removing the boundary frames with the corresponding pixel lengths smaller than or equal to the target pixel length from the text detection result, and triggering and executing the step of clustering all the boundary frames according to the coordinate information of each boundary frame to obtain at least one boundary frame clustering set;
When the geometric information comprises pixel areas of the target image area, selecting the target pixel area from all the pixel areas;
and removing the boundary boxes with the pixel areas smaller than or equal to the target pixel areas from the text detection result, and triggering and executing the step of clustering all the boundary boxes according to the coordinate information of each boundary box to obtain at least one boundary box clustering set.
In this alternative embodiment, the bounding rectangle (i.e., bounding box) output by the deep learning model PixelLink is sometimes erroneous, so that it is necessary to filter the bounding box that it outputs to ensure the accuracy of the resulting text recognition result. In particular, filtering the bounding box using the geometric features (pixel width, pixel length, pixel area) of the target image region is a simple and efficient method. For example, the pixel widths of all the target image areas are sorted from large to small, then the pixel width sorted at 99% is selected as the target pixel width, and if the target pixel width is 10 pixels, the bounding box with the pixel width smaller than 10 pixels is deleted, so that the filtering of the bounding box is realized. At this time, of all the bounding boxes output by the deep learning model PixelLink, the bounding box with a pixel width of less than 10 pixels is generally an erroneous bounding box, so it needs to be deleted, and the bounding box ordered in the first 99% is generally a valid bounding box.
It can be seen that implementing this alternative embodiment, the erroneous bounding box can be removed from the text detection result according to the geometric features of the bounding box, thereby improving the accuracy of the resulting text recognition result.
Therefore, when the text image recognition method described in fig. 2 is implemented, different binarization thresholds can be used for different areas of the same image, and an average value of gray values of all pixels in each gray image area is used as the binarization threshold, so that a binarized image with better effect can be obtained. The false bounding box can be removed from the text detection result according to the geometric characteristics of the bounding box, so that the accuracy of the obtained text recognition result is improved.
Example III
Referring to fig. 3, fig. 3 is a schematic structural diagram of a text image recognition device according to an embodiment of the invention. As shown in fig. 3, the text image recognition apparatus may include:
the preprocessing module 301 is configured to preprocess a text image carrying a target text to obtain a model input image;
A first analysis module 302, configured to input a model input image to a predetermined text detection model for analysis, to obtain a text detection result of a text image, where the text detection result includes at least one bounding box for framing a target image region in the text image, and each bounding box includes at least coordinate information for indicating a position of the target image region in the text image, and the target image region is an image region in which a single text in the target text is located in the text image;
The clustering module 303 is configured to cluster all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set, where the characters corresponding to all bounding boxes in each bounding box cluster set are characters in the same line in the target text;
the second analysis module 304 is configured to input an image in the target image area selected by each bounding box to a predetermined text recognition model for analysis, so as to obtain a text recognition result of the bounding box;
The determining module 305 is configured to determine a text recognition result of the text image according to the text recognition result, the coordinate information, and the belonging bounding box cluster set of each bounding box.
Therefore, implementing the recognition device for text image depicted in fig. 3, by analyzing the coordinate information corresponding to each word in the text image carrying the target text, then clustering all the words according to the coordinate information corresponding to each word, dividing the words in the same line in the target text into a cluster set, analyzing the word recognition result corresponding to each word, and finally determining the text recognition result of the text image according to the word recognition result, the coordinate information and the belonging cluster set of each word, the recognition accuracy of the text in the text image can be improved, and the words in the same line in the target text are ensured to be output as the same line of words when the text recognition result is output.
In an alternative embodiment, the determining module 305 determines, according to the text recognition result, the coordinate information and the belonging bounding box cluster set of each bounding box, the text recognition result of the text image in the following specific manner:
determining text coordinates of each boundary box according to the boundary box cluster set to which the boundary box belongs and coordinate information of the boundary box, wherein the text coordinates are used for representing the position of characters corresponding to the boundary box in a target text, and the text coordinates at least comprise text longitudinal coordinates and text transverse coordinates, and the text longitudinal coordinates of the boundary boxes belonging to the same boundary box cluster set are the same;
And determining the text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box.
It can be seen that implementing the text image recognition apparatus described in fig. 4 can determine the text coordinates of the bounding box according to the coordinate information of the bounding box, and then sort and combine the text in the bounding box according to the text coordinates, thereby obtaining the recognition text of the whole text image.
In this optional embodiment, further optional, the coordinate information of each bounding box includes abscissa information and ordinate information of the bounding box;
and, the determining module 305 determines, according to the bounding box cluster set to which each bounding box belongs and the coordinate information of the bounding box, the text coordinates of the bounding box in the following specific manner:
determining the text longitudinal coordinates of all the bounding boxes in the bounding box cluster set to which each bounding box belongs according to the longitudinal coordinate information of all the bounding boxes in the bounding box cluster set to which the bounding box belongs;
and determining the text transverse coordinates of each bounding box in each bounding box cluster set according to the transverse coordinate information of each bounding box in the bounding box cluster set.
It can be seen that implementing the recognition device for text images described in fig. 4 can sort the bounding boxes according to their abscissa and ordinate information, and obtain the text transverse coordinates and text longitudinal coordinates of the bounding boxes.
In this further alternative embodiment, still further alternative, the ordinate information of each bounding box includes a maximum ordinate and a minimum ordinate of the bounding box;
and, the clustering module 303 clusters all the bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set in the following specific ways:
determining an ordinate interval of each bounding box according to the maximum ordinate and the minimum ordinate of the bounding box;
Judging whether an intersection exists between the ordinate intervals of every two bounding boxes;
Dividing the two bounding boxes into the same bounding box cluster set when judging that the intersection exists between the ordinate intervals of the two bounding boxes;
And when judging that no intersection exists between the ordinate intervals of the two bounding boxes, dividing the two bounding boxes into different bounding box clustering sets.
Therefore, the recognition device for text images described in fig. 4 can judge whether two bounding boxes are located in the bounding boxes of the characters in the same row according to whether the intersection exists between the ordinate sections of the two bounding boxes, so that the bounding boxes of the characters in the same row can be divided into the same bounding box clustering set, and the clustering of the bounding boxes is realized.
In another alternative embodiment, the determining module 305 determines the text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box in the following specific manner:
Determining an original text recognition result of the text image according to the text recognition result and the text coordinates of each bounding box;
Determining a regular expression and a text template corresponding to the text image;
Extracting key text information from an original text recognition result based on a regular expression;
and filling the key text information into a text template to obtain a text recognition result of the text image.
Therefore, by implementing the text image recognition device described in fig. 4, by extracting the key text information in the text image and then filling the key text information into the text template corresponding to the text image, the text recognition result of the whole text image is obtained, and the situation that the text recognition result of the whole text image is wrong when the text of other parts is recognized can be avoided, thereby improving the recognition accuracy of the text image.
In yet another alternative embodiment, the preprocessing module 301 performs preprocessing on the text image carrying the target text, and the specific manner of obtaining the model input image is:
carrying out mean value filtering on the text image bearing the target text to obtain a filtered image;
Carrying out graying treatment on the filtered image to obtain a gray image;
performing binarization processing on the gray level image to obtain a binarized image;
calculating a first vertex coordinate of the binarized image based on a predetermined image edge detection algorithm;
determining a second vertex coordinate according to the predetermined alignment image;
determining an affine transformation matrix according to the first vertex coordinates and the second vertex coordinates;
And carrying out affine transformation on the binarized image according to the affine transformation matrix to obtain a model input image.
It can be seen that implementing the text image recognition apparatus described in fig. 4 can implement preprocessing of the text image in multiple ways.
In this further alternative embodiment, further optionally, the preprocessing module 301 performs binarization processing on the gray-scale image, and the specific manner of obtaining the binarized image is:
dividing a gray image into a plurality of gray image regions;
according to the gray values of all pixel points in each gray image area, determining a binarization threshold value corresponding to the gray image area;
and carrying out binarization processing on the image in each gray level image area according to the binarization threshold value corresponding to each gray level image area to obtain a binarized image of the whole gray level image.
It can be seen that implementing the text image recognition device described in fig. 4 can use different binarization thresholds for different areas of the same image, so that a better binarized image can be obtained.
In this further alternative embodiment, still further alternative, each gray image region is a matrix-shaped region;
and, the preprocessing module 301 determines, according to the gray values of all the pixels in each gray image area, the binarization threshold corresponding to the gray image area in the following specific manner:
The corresponding binarization threshold value of each gray image region is calculated by the following formula:
Where a denotes an abscissa of a lower left corner pixel of each gray image region in the gray image, b denotes an ordinate of a lower right corner pixel of each gray image region in the gray image, k denotes the number of pixels occupied by each gray image region in the lateral direction of the gray image, l denotes the number of pixels occupied by each gray image region in the longitudinal direction of the gray image, f (i, j) denotes a gray value of pixels having coordinates (i, j) in the gray image, n denotes the total number of pixels occupied by each gray image region, thr denotes a binarization threshold value corresponding to each gray image region.
As can be seen, the recognition device for text image described in fig. 4 is implemented, and the average value of the gray values of all the pixels in each gray image area is used as the binarization threshold, so that a better binarized image can be obtained.
In a further alternative embodiment, the bounding box further comprises geometric information of the target image area, the geometric information comprising a pixel width and/or a pixel length and/or a pixel area of the target image area;
and, the recognition device of the text image further includes:
The selecting module 306 is configured to, after the first analyzing module 302 inputs the model input image to a predetermined text detection model for analysis, obtain a text detection result of the text image, cluster all bounding boxes according to the coordinate information of each bounding box by the clustering module 303 to obtain at least one bounding box cluster set, and when the geometric information includes the pixel width of the target image area, select a target pixel width from all pixel widths;
The removing module 307 is configured to remove the bounding boxes with the pixel widths less than or equal to the target pixel width from the text detection result, and trigger the clustering module 303 to perform the above-described operation of clustering all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set;
the selecting module 306 is configured to, after the first analyzing module 302 inputs the model input image to a predetermined text detection model for analysis, obtain a text detection result of the text image, cluster all bounding boxes according to coordinate information of each bounding box by the clustering module 303 to obtain at least one bounding box cluster set, and when the geometric information includes pixel lengths of the target image area, select a target pixel length from all pixel lengths;
The removing module 307 is configured to remove the bounding boxes with the pixel lengths less than or equal to the target pixel length from the text detection result, and trigger the clustering module 303 to perform the above-described operation of clustering all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set;
The selecting module 306 is configured to, after the first analyzing module 302 inputs the model input image to a predetermined text detection model for analysis, obtain a text detection result of the text image, cluster all bounding boxes according to the coordinate information of each bounding box by the clustering module 303 to obtain at least one bounding box cluster set, and when the geometric information includes the pixel areas of the target image area, select a target pixel area from all pixel areas;
The removing module 307 is configured to remove the bounding boxes with the pixel areas less than or equal to the target pixel area from the text detection result, and trigger the clustering module 303 to perform the above-mentioned operation of clustering all bounding boxes according to the coordinate information of each bounding box to obtain at least one bounding box cluster set.
It can be seen that implementing the text image recognition apparatus described in fig. 4 can remove the erroneous bounding box from the text detection result according to the geometric features of the bounding box, thereby improving the accuracy of the obtained text recognition result.
For the specific description of the text image recognition device, reference may be made to the specific description of the text image recognition method, which is not described in detail herein.
Example IV
Referring to fig. 5, fig. 5 is a schematic structural diagram of a text image recognition device according to another embodiment of the present invention. As shown in fig. 5, the apparatus may include:
A memory 501 in which executable program codes are stored;
A processor 502 coupled to the memory 501;
the processor 502 invokes executable program codes stored in the memory 501 to perform the steps in the recognition method of a text image disclosed in the first or second embodiment of the present invention.
Example five
The embodiment of the invention discloses a computer storage medium which stores computer instructions for executing the steps in the method for recognizing a text image disclosed in the first or second embodiment of the invention when the computer instructions are called.
The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above detailed description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product that may be stored in a computer-readable storage medium including Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic disc Memory, tape Memory, or any other medium that can be used for computer-readable carrying or storing data.
Finally, it should be noted that the method and apparatus for recognizing text images disclosed in the embodiments of the present invention are only preferred embodiments of the present invention, and are only used for illustrating the technical solutions of the present invention, but not limiting the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solutions described in the foregoing embodiments may be modified or some technical features thereof may be equivalently replaced, and these modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.