CN112329803A - A natural scene text recognition method based on standard glyph generation - Google Patents
A natural scene text recognition method based on standard glyph generation Download PDFInfo
- Publication number
- CN112329803A CN112329803A CN201910716704.1A CN201910716704A CN112329803A CN 112329803 A CN112329803 A CN 112329803A CN 201910716704 A CN201910716704 A CN 201910716704A CN 112329803 A CN112329803 A CN 112329803A
- Authority
- CN
- China
- Prior art keywords
- glyph
- standard
- font
- attention
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Multimedia (AREA)
- Character Discrimination (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a character recognition method based on standard font generation, which is characterized in that a neural network model based on an attention mechanism and a generation mechanism is established, attention is focused at a certain position of a picture at each moment, character category prediction and multi-font standard font generation are respectively carried out by utilizing the neural network characteristics of the position until all characters in the picture are traversed, and recognition and output of the characters in a natural scene picture containing one or more characters are realized. The invention utilizes the multi-font generation, improves the attention module, and improves the character recognition precision and the font generation quality, thereby improving the character recognition accuracy.
Description
Technical Field
The invention belongs to the technical field of computer vision and pattern recognition, relates to a character recognition method, and particularly relates to a method for recognizing characters in a natural scene picture.
Background
In the field of computer vision and pattern recognition, character recognition refers to letting a computer automatically recognize character contents in a picture. The natural scene character recognition specifically refers to recognizing all character contents in a picture for a natural scene picture with characters as main bodies. The method realizes automatic recognition of the characters in the natural scene, and has great significance in improving the production and living efficiency of people, understanding the image content, recognizing the environment by a machine and the like.
To date, many text recognition techniques have been proposed in academia and industry, mainly classified into a local feature-based method and a neural network-based method. Among them, the method based on local features is represented by a method proposed in the literature (Wang, k., Babenko, b., & Belongie, S.J. (2011), End-to-End scene text registration.in 2011 International Conference on Computer Vision (pp.1457-1464)). It locates the positions of the characteristic points by a series of rules set by human, and extracts the characteristics on the positions for character classification. However, in natural scene images, the background of the text and the font thereof are complicated, the shape of the text is not fixed (bending, tilting, etc.), and the method cannot provide the unified standard of which feature points are important, so that the method cannot show a good recognition effect.
Recently, some methods based on neural networks have been proposed. The method has excellent performance on the character recognition problem by utilizing the characteristics of self-adaptive selection characteristics of the neural network and strong noise robustness. These methods generally extract visual features of an image using a Convolutional Neural Network (CNN), perform sequence modeling using a Recurrent Neural Network (RNN), and sequentially predict each character in the image. Among them, a Long Short Term Memory Network (LSTM) is a commonly used RNN structure. The most advanced methods at present are represented by the ASTER method in the literature (Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2018). ASTER: An Interactive Screen Text Recombnizer with Flexible Reconfiguration. IEEE Transactions on Pattern and Machine Intelligence, 1-1.) and the method in the literature (Li, H., Wang, P., Shen, C., & Zhang, G. (2018). Show, extended Read: A Simple and string base for Irregurguar SAR Text registration. ArXpriv: 1811.00751.). However, these methods still have the defect that they only use the word class labels to supervise the neural network, but the guiding information provided by the word class labels is not sufficient. When processing a picture with a noisy text background and a novel font style, the methods cannot extract the characteristics with identification power, so that the identification precision is still not ideal. There are some methods that attempt to use standard glyphs as additional supervisory information, such as the methods in the literature (Liu, y., Wang, z., Jin, h., & Wassell, I.J. (2018). synthetic provided Learning for Scene Text registration. in Proceedings of the European Conference Computer Vision (ECCV) (pp.449-465.) (SSFL method hereinafter) and the literature (Zhang, y., Liang, S., Nie, S., Liu, w., & pending, S. (2018). Robust writing reader-editor of font library, but no suitable Recognition methods for generating fonts in the standard script 26, which results in no Recognition of fonts 26 and characters in the methods of using these methods.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a character recognition method based on standard font generation. For the natural scene character features extracted by the neural network, the attention mechanism is used in the character pattern generation mode, the neural network is used for predicting character types, and the neural network is also used for generating standard character patterns of various characters corresponding to the natural scene characters. By learning how to generate the standard font through the neural network, the method can extract the character features of the natural scene, which are more robust to interference factors such as a noisy background, a font style and the like, so that the accuracy of character recognition is improved.
For convenience of explanation, the present invention has the following definitions of terms:
natural scene picture: and (4) a real scene picture shot artificially.
Text picture: the picture with the text content as the main body comprises one or more texts.
The core of the invention is: in the process of recognizing the characters, unnecessary font style information in the neural network features is redundant information. Two main problems of the prior art SSFL are: firstly, SSFL generates a standard font (single font) for learning how to filter out the background of the characters in the natural scene, and does not consider the font of various fonts to be generated and what effect can be brought by generating the font of various fonts; secondly, the model provided by the SSFL can not generate fonts of various fonts, and the technical difficulty exists. Unlike the prior art SSFL which uses only one font glyph as the generation target, the present invention proposes standard glyph generation for multiple fonts: generating the attention vector c (x, t) into the standard font of the m fonts corresponding to the attention vector c (x, t) by using a font generatorUsing a glyph discriminator to compete against the glyph generator enables the glyph generator to generate a more realistic standard glyph. Because there are several typical standard fonts, such as song, regular, black, etc., for a certain character. The method uses the font style embedded vector z to control the font to be generated, and the characteristics extracted by the neural network only reflect the most important content information (which character is), so that the method reduces unnecessary font style information in the neural network characteristics and further improves the identification precision; meanwhile, the method for controlling the font of the generated font by using the font style embedded vector z innovatively provided by the invention perfectly solves the problem of multi-font generation. In addition, the attention mechanism and the standard font generation are jointly optimized in a mode of jointly learning (joint optimization) through the two models, the two models which are independently learned are organically combined, and the two models are better in performance through jointly exchanging and learning (joint optimization).
The technical scheme provided by the invention is as follows:
a character recognition method based on standard font generation. The invention processes a natural scene picture containing one or more characters, and can sequentially output the characters in the picture according to the writing sequence. The invention uses a neural network model based on an attention mechanism and a generation mechanism, focuses attention on a certain position of an image at each moment, and respectively predicts character types and generates multi-font standard fonts by using the neural network characteristics of the position until all characters in the image are traversed, thereby realizing the recognition and output of the characters in a natural scene image containing one or more characters.
The attention mechanism and generation mechanism based neural network model comprises:
A. a convolutional neural network for extracting visual features f (x) of the input picture x;
B. a recurrent neural network for sequence modeling the features f (x); the recurrent neural network comprises an LSTM encoder and a decoder;
C. an attention module for acquiring an attention weight matrix M (x, t) according to hidden states h (x, t) and F (x) of the recurrent neural network at a time t;
D. a classifier for classifying the features; in specific implementation, a softmax classifier is adopted;
E. standard glyph for generating attention vector c (x, t) into its corresponding m fontsThe glyph generator of (1);
F. a glyph discriminator for competing with the glyph generator so that the glyph generator can generate a more realistic standard glyph.
The character recognition method based on standard font generation specifically comprises the following steps:
1. the visual features f (x) of the input picture x are extracted using a convolutional neural network.
2. And F (x) performing sequence modeling on the recurrent neural network, and transmitting the hidden state h (x, t) of the recurrent neural network at the time t to an attention module together with the F (x) to obtain an attention weight matrix M (x, t) which represents the attention allocated to each area of the image at the time t.
3. Performing point multiplication on each feature channel by using F (x) and M (x, t), and obtaining an attention vector c (x, t) which represents the feature of the concerned picture area at the time t.
4. And classifying the features of c (x, t) and h (x, t) after the c (x, t) and the h (x, t) are connected in series by using a classifier, and predicting the character category of the attention position at the moment t.
5. Generating the attention vector c (x, t) into the standard font of the m fonts corresponding to the attention vector c (x, t) by using a font generatorUsing a glyph discriminator to compete against the glyph generator enables the glyph generator to generate a more realistic standard glyph.
In step 1, based on the convolutional neural network in the enter method (a scene text identifier based on attention mechanism), the step size of the first convolution unit in the last two convolution groups is modified to 1 × 1, which is used as the CNN feature extractor in the present invention to extract the visual feature f (x) of the input picture x. WhereinThat is, the input image is scaled to 48 pixels in height, 160 pixels in width, and 3 color channels;where H is 6, W is 40, C is 512, which respectively represent the height, width and number of channels of feature f (x).
In step 2, the features f (x) are sequence modeled using an LSTM encoder and decoder. Both the LSTM encoder and decoder have two hidden layers, 512 nodes in each layer. F (x) for each set of features in the width (W) dimension, Pooling (Max-Pooling) is performed along the height dimension and then sequentially input into the LSTM encoder. The hidden state of the LSTM encoder at the last instant is used as the initial state of the LSTM decoder. The hidden state h (x, t) of the LSTM decoder at time t is fed to the attention module together with F (x) to obtain the attention weight matrixThe M (x, t) mode is calculated as follows:
M′ij(x,t)=tanh(∑p,q∈N(i,j)WFFpq(x)+Whh (x, t)) formula (1)
M(x,t)=sotfmax(WMM' (x, t)) formula (2)
Where M '(x, t) is an intermediate variable of the calculation process, M'ij(x, t) represents the characteristic of M' (x, t) at position (i, j), i ≦ 1 ≦ H, j ≦ 1 ≦ W; n (i, j) represents a neighborhood of the center of (i, j), i.e. p is more than or equal to i-1 and less than or equal to i +1, and q is more than or equal to j-1 and less than or equal to j + 1; fpq(x) Represents F (x) the feature at position (p, q); wFAnd WhIs a parameter to be learned, tanh is a hyperbolic tangent function, and sotfmax is a normalized exponential function (softmax function).
In step 3, the feature of each channel of F (x) and M (x, t) are used for dot multiplication to obtainRepresenting the features of the picture region of interest at time t.
In step 4, features of attention vectors c (x, t) and h (x, t) after being connected in series are classified by using a softmax classifier commonly used in machine learning, and the character type of the attention position at the moment t can be obtainedProbability of (c):
wherein, WoAnd boIs a parameter to be learned, the middle brackets represent tandem operation,c represents the overall character category. Is selected such thatLargest sizeAs a predicted character category.
In step 5, a glyph generator based on a deconvolution Neural Network (DCNN) is used to generate standard glyphs of m selected fonts by taking the attention vector c (x, t) as an inputRepresented by formula (4):
wherein z isiThe embedded vector of the ith font is a random vector which follows multivariate standard normal distribution, the bracket represents the tandem operation, and m is the set font type number. True multifont standard glyphs giAnd (x, t) is rendered by a TTF (true Type font) or OTF (open Type font) file. Meanwhile, the idea of a generating type confrontation network is adopted, and a character pattern discriminator based on a convolution neural network is used for discriminating the generated standard character pattern and the real standard character pattern. The generated font is more accurate through the countermeasure between the font discriminator and the font generator. The glyph discriminator gives the generated glyphProbability of being true isProbability of being falseBy the same token, it gives the true glyph giThe probability that (x, t) is true is p (y)d=1|gi(x, t)), probability p (y) of being falsed=0|gi(x,t))=1-p(yd=1|gi(x,t))。
The network parameters that need to be trained include the CNN feature extractor,parameters to be learned in the LSTM encoder, LSTM decoder, attention module, glyph generator and glyph discriminator module. When the network is trained, the method and the device update the parameters of the network by combining character type prediction loss, font pixel point loss and font discriminator loss. In particular, the present invention iteratively optimizes two objective functions LGAnd LD:
Wherein α is a weight coefficient, set to 0.01; y is1,y2,...,yTIs the category label of all T characters in the input picture x; and | L | · | | represents the norm operation of L1. L isGItem I of (1)Predicting loss for text categories, second termA third term for the loss of the glyph discriminator to falsely discriminate the generated glyph as trueIs a loss of glyph pixel points; l isDItem I of (1)Loss of generating a glyph to be false for the right authentication of the glyph discriminator, the second termA loss of correctly identifying the true glyph to true for the glyph identifier. The iterative optimization method realizes the countermeasure between the font discriminator and the font generator. Literature (Kingma, D is used.P.,&Ba, j.l. (2015). Adam: a Method for Stochartic optimization, International Conference on Learning retrieval), an adam optimizer optimizes network parameters, the initial Learning rate is set to be 0.001, the attenuation of every 1 ten thousand steps is 0.9 times of the original attenuation, and the same training data are adopted as an SAR Method.
Compared with the prior art, the beneficial effects of the invention include the following aspects:
the invention provides a character recognition method based on standard font generation, which is characterized in that a neural network model based on an attention mechanism and a generation mechanism is established, attention is focused at a certain position of an image at each moment, character category prediction and multi-font standard font generation are respectively carried out by utilizing the neural network characteristics of the position until all characters in the image are traversed, and recognition and output of the characters in a natural scene image containing one or more characters are realized. The invention utilizes the multi-font generation to improve or improve the attention module, the recognition precision and the font generation quality, and the specific embodiment is as follows:
firstly, the method adopts standard font generation to guide the learning process of character features, and compared with most methods for guiding feature learning by using character labels, the method can better learn features irrelevant to scenes, thereby improving the identification accuracy.
Secondly, the standard font is generated by adopting a space attention mechanism, compared with the existing SSFL method, the standard font corresponding to the irregular-shaped text can be generated better, the generation quality of the standard font is greatly improved, and the character recognition can obtain better accuracy.
Thirdly, the method adopts multi-font standard font generation, and further enhances the robustness of the learned characteristics. Compared with the generation of single-font standard fonts, the method reduces the font style characteristics in the characters in the natural scene, and is more beneficial to the identification of the contents.
Drawings
Fig. 1 is a flowchart of a text recognition method provided in the present invention.
FIG. 2 is a comparison of the glyph generation method of the present invention and the SSFL method.
FIG. 3 is an exemplary diagram of a standard glyph font utilized in the present invention.
FIG. 4 is a comparison graph of glyphs generated by the present invention and SSFL method when processing irregular-shaped text pictures.
FIG. 5 is a graph of glyph pixel loss versus value during training for the present invention and other prior art.
Fig. 6 is a comparison graph of the visualization result of the attention weight matrix calculated by the SAR method and the present invention.
FIG. 7 is a comparison graph of standard glyphs generated with and without the use of resist learning in accordance with the present invention.
FIG. 8 is a comparison graph of standard glyphs generated using single font and multi-font training in accordance with the invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a character recognition method based on standard font generation. The invention processes a natural scene picture containing one or more characters, and sequentially outputs the characters in the picture according to the writing sequence. The invention uses a neural network model based on an attention mechanism and a generation mechanism, focuses attention on a certain position of the picture at each moment, and respectively predicts character types and generates multi-font standard fonts by using the neural network characteristics of the position until all characters in the picture are traversed.
The flow chart of the invention is shown in the attached figure 1, and when the method is implemented, the method comprises the following steps:
1 extracting visual features f (x) of the input picture x using a CNN feature extractor.That is, the input image is scaled to 48 pixels in height, 160 pixels in width, and 3 color channels;where H is 6, W is 40, C is 512, which respectively represent the height, width and number of channels of feature f (x).
Table 1 parameter configuration diagram of CNN feature extractor in embodiment
The configuration parameters of the CNN feature extractor are shown in table 1: the second column is a characteristic dimension output by each convolution group, and the format of the second column is h multiplied by w multiplied by c, wherein h, w and c respectively represent the height, width and channel number of the characteristic; in addition to the first convolution group, the other convolution groups are internally represented by the literature (He, k., Zhang, x., Ren, s.,&residual unit (Residual unit) proposed by Sun, J. (2016.) Deep reactive Learning for Image registration in 2016 IEEE Conference on Computer Vision and Pattern Registration (CVPR) (pp.770-778.) configurationThe convolution group is represented by n residual error units, each residual error unit comprises two convolution layers with convolution kernel sizes of 1 × 1 and 3 × 3 respectively and the number of output characteristic channels of o, and the step size represents the convolution step size.
2 sequence modeling of feature f (x) using LSTM encoder and decoder in SAR approach. Both the LSTM encoder and decoder have two hidden layers, 512 nodes in each layer. F (x) for each set of features in the width (W) dimension, Pooling (Max-Pooling) is performed along the height dimension and then sequentially input into the LSTM encoder. The hidden state of the LSTM encoder at the last instant is used as the initial state of the LSTM decoder. The hidden state h (x, t) of the LSTM decoder at time t is fed to the attention module together with F (x) to obtain the attention weight matrixWherein the M (x, t) mode is calculated as follows:
M(x,t)=sotfmax(WMM′(x,t))
where M' (x, t) is an intermediate variable of the calculation process, i.e. the attention weight matrix before softmax normalization; m'ij(x, t) represents the characteristic of M' (x, t) at position (i, j), i ≦ 1 ≦ H, j ≦ 1 ≦ W; n (i, j) represents a neighborhood of the center of (i, j), i.e. p is more than or equal to i-1 and less than or equal to i +1, and q is more than or equal to j-1 and less than or equal to j + 1; fpq(x) Represents F (x) the feature at position (p, q); wFAnd WhIs a parameter to be learned, tanh is a hyperbolic tangent function, and sotfmax is a normalized exponential function (softmax function).
3 using the characteristic of each channel of F (x) and M (x, t) to carry out dot multiplication to obtainRepresenting the features of the picture region of interest at time t.
4, classifying the features of the attention vectors c (x, t) and h (x, t) after the attention vectors are connected in series by using a softmax classifier commonly used in machine learning, and obtaining the character class of the concerned position at the moment tProbability of (2)
Wherein, WoAnd boIs a parameter to be learned, the middle brackets represent tandem operation,c represents the overall character category. Is selected such thatLargest sizeAs a predicted character category.
5 generating standard glyphs of m selected fonts by using a glyph generator based on a deconvolution neural network and taking the attention vector c (x, t) as input
Wherein z isiThe embedded vector of the ith font is a random vector which follows a multivariate standard normal distribution, and the bracket represents a concatenation operation. True multifont standard glyphs giAnd (x, t) is rendered by a TTF (true Type font) or OTF (open Type font) file. Meanwhile, the idea of a generating type confrontation network is adopted, the generated standard font and the real standard font are identified by using the font identifier, and the generated font is more accurate through the confrontation between the font identifier and the font generator. The glyph discriminator gives the generated glyphProbability of being true isProbability of being false isBy the same token, it gives the true glyph giThe probability that (x, t) is true is p (y)d=1|gi(x, t)), the probability of being false is p (y)d=0|gi(x,t))=1-p(yd=1|gi(x, t)). The configuration parameters of the glyph generator and the discriminator are shown in table 2: the first, second and third columns of the table respectively representing the network layersName, type and specific configuration. In the third column, "k × k × c, s, BN, ReLU" represents the size of the convolution kernel for the convolution and deconvolution layers as k × k, the output characteristic dimension as c, the step size as s, and the batch normalization and ReLU activation functions are used. For a fully connected layer, "i × o" represents that the dimension of the input features of the layer is i and the dimension of the output features is o.
Table 2 parameter configuration diagram of glyph generator and glyph discriminator in the embodiment
When the whole network is trained, the invention combines character type prediction loss, font pixel point loss and font discriminator loss to update the parameters of the network. Specifically, the invention iteratively optimizes two objective functions L for countermeasures in the countermeasures learningGAnd LDExpressed as follows:
wherein α is a weight coefficient, set to 0.01; y is1,y2,...,yTIs the category label for all T characters in the input picture x. L isGItem I of (1)Predicting loss for text categories, second term Identifying loss of generating a glyph to be genuine for a glyph discriminatorItem III Is a loss of glyph pixel points; l isDItem I of (1)Second term for loss of the glyph discriminator to discriminate the generated glyph as falseA loss to identify the true glyph to be true for the glyph discriminator. The iterative optimization method realizes the countermeasure between the font discriminator and the font generator. Using literature (Kingma, d.p.,&ba, j.l. (2015). Adam: a Method for Stochartic optimization, International Conference on Learning retrieval), an adam optimizer optimizes network parameters, the initial Learning rate is set to be 0.001, the attenuation of every 1 ten thousand steps is 0.9 times of the original attenuation, and the same training data are adopted as an SAR Method.
FIG. 2 is a comparison of the glyph generation methods provided by the present invention and the existing SSFL method, respectively. The upper half of the dotted line is the standard font generation mode based on the attention mechanism provided by the invention, and the lower half of the dotted line is the standard font generation mode provided by the SSFL method. The scheme adopted by the invention is different from the SSFL in two main points: firstly, generating standard fonts corresponding to each scene character one by adopting an attention mechanism; second, the invention adopts multi-font standard font generation, which is helpful to better learn the character irrelevant to the font style.
FIG. 3 is an example of a font for a standard glyph used in the present invention. The invention trains three network models respectively for English, Chinese and Bengali. For English, the present invention uses 8 (m ═ 8) fonts, namely Arial, Bradley Hand ITC, cosmetic Sans MS, Courier New, Georgia, Times New Roman, Kunstler Script and Vladimir Script. For Chinese, the invention adopts 4 (m ═ 8) fonts, namely Song style, regular style, black style and imitated Song style. For Bengali, the present invention uses 1 font, Nirmala UI.
TABLE 3 recognition accuracy of the present invention and other prior art techniques on English evaluation datasets
| Method of producing a composite material | IIIT5k | SVT | IC13 | IC15 | SVTP | CT80 |
| SSFL | 89.4 | 87.1 | 94.0 | - | 73.9 | 62.5 |
| ASTER | 93.4 | 89.5 | 91.8 | 76.1 | 78.5 | 79.5 |
| SAR | 95.0 | 91.2 | 94.0 | 78.8 | 86.4 | 89.6 |
| The invention | 95.3 | 91.3 | 95.1 | 81.7 | 86.0 | 88.5 |
TABLE 4 recognition accuracy of the present invention and other prior art techniques on Chinese and Bengali evaluation datasets
| Method of producing a composite material | Pan+ChiPhoto | ISI Bengali |
| HOG | 59.2 | 87.4 |
| CNN | 61.5 | 89.7 |
| ConvCoHOG | 71.2 | 92.2 |
| The invention | 89.4 | 97.4 |
Tables 2-3 are graphs of the recognition accuracy of the present invention and other prior art on an evaluation data set. Among them, IIIT5k, SVT, IC13, IC15, SVTP and CT80 are english character data sets commonly used in the industry at present. The present invention achieves the best results over most data sets, as seen by the recognition accuracy (in%) in the graph. The invention has great advantages in accuracy on the IC15 data set; the accuracy of the present invention is somewhat behind the SAR approach on the two smaller datasets SVTP and CT 80. Pan + ChiPhoto is the Chinese dataset and ISI Bengali is the Mengladesh dataset, on which the present invention also achieves the highest recognition accuracy. HOG is a method in the literature (Dalal, N., & Triggs, B. (2005). Histograms of oriented graphics for human detection. in 2005IEEE Computer Conference on Computer Vision and Pattern Recognition (CVPR '05) (Vol.1, pp.886-893.), CNN is a method in the literature (Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014.) Deep Features for Text pointing. in European Computer Conference video (pp.512-528), and ConvCoHOG is a method in the literature (Dalal, N., & B., & Triggs, B. (2005.) Histograms of Computer Conference on Computer Vision (pp.512-528) & CVPR' 05) (Vol.1, PP.886-893). In general, the invention is more advanced than the prior art in the task of character recognition in natural scenes.
FIG. 4 is a comparison graph of glyphs generated by the present invention and SSFL method when processing irregular-shaped text pictures. The SSFL method generates a standard glyph by a global mapping method and does not better handle irregular shaped text. The invention locates the approximate position of each character through an attention mechanism, and then generates the corresponding standard font, thereby obtaining better results aiming at the irregular text.
FIG. 5 is a diagram of glyph pixel loss (also called L1 loss) in the training process of the present invention and other prior arts, in which CNN-DCNN is a standard glyph Generation framework used by SSFL method, CNN-DCNN (Skip) is added with Skip Connection (Skip Connection) in CNN-DCNN, CNN-LSTM-DCNN is an improved version of CNN-DCNN, in which CNN features are passed through LSTM and then delivered to deconvolution network (DCNN), and Attentional Generation is the attention-based standard glyph Generation framework proposed herein. For fair comparison, the four methods use the same CNN and DCNN structure configuration, the same training data, and the first three methods also introduce multi-font generation. Through comparison, the attention-based generation method provided by the invention generates more accurate standard fonts.
Fig. 6 is a graph comparing the visualization results of the attention weight matrix (i.e., M (x, t)) obtained by the present invention and SAR method. The invention generates the standard font through learning, so that the attention module generates a more accurate and more meaningful attention weight matrix. The 2 nd and 3 rd columns in the figure respectively represent the thermodynamic diagrams of M (x, t) calculated by the SAR method and the invention, and the underlined letters below the thermodynamic diagrams represent character labels predicted by the model at a certain moment. Taking the first group of pictures as an example, the invention focuses attention on the lower half of the flower character "L" and correctly identifies the flower character "L"; while the SAR method focuses on the lower half of the flower-shaped word "L", it will be recognized as "R" if it is mistaken.
FIG. 7 is a comparison graph of a standard glyph generated with and without the use of antagonistic learning in accordance with the present invention. Where "output" is the result of not using countermeasure training, "output" is the result of using countermeasure training, and "target" is the true standard glyph. Through the counterstudy, the invention can better generate the standard font and identify the text content for the fuzzy and distorted text. Although many generated standard glyphs have some gap from the true standard glyphs after using the resistance training, the key improvement is related compared with not using the resistance training.
FIG. 8 is a standard glyph comparison graph generated under training with single fonts and multiple fonts according to the invention. Where "output" is the result of training using a single font, the name of the font is in parentheses, "output" is the result of training using multiple fonts, and "target" is the true standard glyph. If the standard font of only one font is adopted for training, the standard font and the recognition cannot be correctly generated when the model encounters characters with newer font styles during testing. By generating standard glyphs of multiple fonts, the model can better learn the character-independent features, thereby correctly identifying the content of the characters.
The technical solutions in the embodiments of the present invention are clearly and completely described above with reference to the drawings in the embodiments of the present invention. It is to be understood that the described examples are only a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910716704.1A CN112329803B (en) | 2019-08-05 | 2019-08-05 | Natural scene character recognition method based on standard font generation |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910716704.1A CN112329803B (en) | 2019-08-05 | 2019-08-05 | Natural scene character recognition method based on standard font generation |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN112329803A true CN112329803A (en) | 2021-02-05 |
| CN112329803B CN112329803B (en) | 2022-08-26 |
Family
ID=74319415
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201910716704.1A Active CN112329803B (en) | 2019-08-05 | 2019-08-05 | Natural scene character recognition method based on standard font generation |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN112329803B (en) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114419174A (en) * | 2021-12-07 | 2022-04-29 | 科大讯飞股份有限公司 | On-line handwritten text synthesis method, device and storage medium |
| JP2023039888A (en) * | 2021-09-09 | 2023-03-22 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Method, device, apparatus, and storage medium for model training and word stock generation |
| CN116030476A (en) * | 2022-12-30 | 2023-04-28 | 中南民族大学 | A multi-style handwritten English image label recognition system and method |
| CN120744144A (en) * | 2025-09-03 | 2025-10-03 | 成都中医药大学 | Method, system and medium for constructing Chinese ancient book foreign word dictionary and aligning text |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2007122500A (en) * | 2005-10-28 | 2007-05-17 | Ricoh Co Ltd | Character recognition device, character recognition method, and character data |
| CN107577651A (en) * | 2017-08-25 | 2018-01-12 | 上海交通大学 | Chinese character style migratory system based on confrontation network |
| CN107644006A (en) * | 2017-09-29 | 2018-01-30 | 北京大学 | A kind of Chinese script character library automatic generation method based on deep neural network |
| CN108615036A (en) * | 2018-05-09 | 2018-10-02 | 中国科学技术大学 | A kind of natural scene text recognition method based on convolution attention network |
| CN108804397A (en) * | 2018-06-12 | 2018-11-13 | 华南理工大学 | A method of the Chinese character style conversion based on a small amount of target font generates |
| CN109255356A (en) * | 2018-07-24 | 2019-01-22 | 阿里巴巴集团控股有限公司 | A kind of character recognition method, device and computer readable storage medium |
| CN109389091A (en) * | 2018-10-22 | 2019-02-26 | 重庆邮电大学 | The character identification system and method combined based on neural network and attention mechanism |
| US20190147304A1 (en) * | 2017-11-14 | 2019-05-16 | Adobe Inc. | Font recognition by dynamically weighting multiple deep learning neural networks |
-
2019
- 2019-08-05 CN CN201910716704.1A patent/CN112329803B/en active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2007122500A (en) * | 2005-10-28 | 2007-05-17 | Ricoh Co Ltd | Character recognition device, character recognition method, and character data |
| CN107577651A (en) * | 2017-08-25 | 2018-01-12 | 上海交通大学 | Chinese character style migratory system based on confrontation network |
| CN107644006A (en) * | 2017-09-29 | 2018-01-30 | 北京大学 | A kind of Chinese script character library automatic generation method based on deep neural network |
| US20190147304A1 (en) * | 2017-11-14 | 2019-05-16 | Adobe Inc. | Font recognition by dynamically weighting multiple deep learning neural networks |
| CN108615036A (en) * | 2018-05-09 | 2018-10-02 | 中国科学技术大学 | A kind of natural scene text recognition method based on convolution attention network |
| CN108804397A (en) * | 2018-06-12 | 2018-11-13 | 华南理工大学 | A method of the Chinese character style conversion based on a small amount of target font generates |
| CN109255356A (en) * | 2018-07-24 | 2019-01-22 | 阿里巴巴集团控股有限公司 | A kind of character recognition method, device and computer readable storage medium |
| CN109389091A (en) * | 2018-10-22 | 2019-02-26 | 重庆邮电大学 | The character identification system and method combined based on neural network and attention mechanism |
Non-Patent Citations (1)
| Title |
|---|
| ZHANZHAN CHENG等: "Focusing Attention: Towards Accurate Text Recognition in Natural Images", 《2017 IEEE 计算机视觉国际会议(ICCV)》 * |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2023039888A (en) * | 2021-09-09 | 2023-03-22 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Method, device, apparatus, and storage medium for model training and word stock generation |
| CN114419174A (en) * | 2021-12-07 | 2022-04-29 | 科大讯飞股份有限公司 | On-line handwritten text synthesis method, device and storage medium |
| CN116030476A (en) * | 2022-12-30 | 2023-04-28 | 中南民族大学 | A multi-style handwritten English image label recognition system and method |
| CN120744144A (en) * | 2025-09-03 | 2025-10-03 | 成都中医药大学 | Method, system and medium for constructing Chinese ancient book foreign word dictionary and aligning text |
| CN120744144B (en) * | 2025-09-03 | 2026-02-03 | 成都中医药大学 | Method, system and medium for constructing Chinese ancient book foreign word dictionary and aligning text |
Also Published As
| Publication number | Publication date |
|---|---|
| CN112329803B (en) | 2022-08-26 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Jiang et al. | Scfont: Structure-guided chinese font generation via deep stacked networks | |
| CN112819686B (en) | Image style processing method and device based on artificial intelligence and electronic equipment | |
| CN107368831B (en) | English words and digit recognition method in a kind of natural scene image | |
| CN108376244B (en) | A method for identifying text fonts in natural scene pictures | |
| CN112364873A (en) | Character recognition method and device for curved text image and computer equipment | |
| CN111368660A (en) | A single-stage semi-supervised image human object detection method | |
| CN112434599A (en) | Pedestrian re-identification method based on random shielding recovery of noise channel | |
| CN112329803B (en) | Natural scene character recognition method based on standard font generation | |
| CN106372624B (en) | Face recognition method and system | |
| CN110728694A (en) | A long-term visual target tracking method based on continuous learning | |
| Wang et al. | Multiscale deep alternative neural network for large-scale video classification | |
| CN111737511A (en) | Image description method based on self-adaptive local concept embedding | |
| CN108681735A (en) | Optical character recognition method based on convolutional neural networks deep learning model | |
| CN110517270A (en) | A kind of indoor scene semantic segmentation method based on super-pixel depth network | |
| CN115512109B (en) | A method for image semantic segmentation based on relational context aggregation | |
| CN111680705A (en) | MB-SSD Method and MB-SSD Feature Extraction Network for Object Detection | |
| CN110503090B (en) | Character detection network training method based on limited attention model, character detection method and character detector | |
| CN111242114B (en) | Character recognition method and device | |
| CN110766001B (en) | Bank card number positioning and end-to-end identification method based on CNN and RNN | |
| CN112560866B (en) | OCR recognition method based on background suppression | |
| CN116994255B (en) | Stroke extraction method based on multi-level deep feature fusion | |
| CN117115824B (en) | A visual text detection method based on stroke region segmentation strategy | |
| CN114202659B (en) | Fine-grained image classification method based on space symmetry irregular local region feature extraction | |
| CN112101479B (en) | A hair style recognition method and device | |
| CN118134963B (en) | Anti-background-interference twin network single-target tracking method based on hierarchical feature fusion |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |