CN112329803A

CN112329803A - A natural scene text recognition method based on standard glyph generation

Info

Publication number: CN112329803A
Application number: CN201910716704.1A
Authority: CN
Inventors: 连宙辉; 王逸之; 唐英敏; 肖建国
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2019-08-05
Filing date: 2019-08-05
Publication date: 2021-02-05
Anticipated expiration: 2039-08-05
Also published as: CN112329803B

Abstract

The invention discloses a character recognition method based on standard font generation, which is characterized in that a neural network model based on an attention mechanism and a generation mechanism is established, attention is focused at a certain position of a picture at each moment, character category prediction and multi-font standard font generation are respectively carried out by utilizing the neural network characteristics of the position until all characters in the picture are traversed, and recognition and output of the characters in a natural scene picture containing one or more characters are realized. The invention utilizes the multi-font generation, improves the attention module, and improves the character recognition precision and the font generation quality, thereby improving the character recognition accuracy.

Description

Natural scene character recognition method based on standard font generation

Technical Field

The invention belongs to the technical field of computer vision and pattern recognition, relates to a character recognition method, and particularly relates to a method for recognizing characters in a natural scene picture.

Background

In the field of computer vision and pattern recognition, character recognition refers to letting a computer automatically recognize character contents in a picture. The natural scene character recognition specifically refers to recognizing all character contents in a picture for a natural scene picture with characters as main bodies. The method realizes automatic recognition of the characters in the natural scene, and has great significance in improving the production and living efficiency of people, understanding the image content, recognizing the environment by a machine and the like.

To date, many text recognition techniques have been proposed in academia and industry, mainly classified into a local feature-based method and a neural network-based method. Among them, the method based on local features is represented by a method proposed in the literature (Wang, k., Babenko, b., & Belongie, S.J. (2011), End-to-End scene text registration.in 2011 International Conference on Computer Vision (pp.1457-1464)). It locates the positions of the characteristic points by a series of rules set by human, and extracts the characteristics on the positions for character classification. However, in natural scene images, the background of the text and the font thereof are complicated, the shape of the text is not fixed (bending, tilting, etc.), and the method cannot provide the unified standard of which feature points are important, so that the method cannot show a good recognition effect.

Recently, some methods based on neural networks have been proposed. The method has excellent performance on the character recognition problem by utilizing the characteristics of self-adaptive selection characteristics of the neural network and strong noise robustness. These methods generally extract visual features of an image using a Convolutional Neural Network (CNN), perform sequence modeling using a Recurrent Neural Network (RNN), and sequentially predict each character in the image. Among them, a Long Short Term Memory Network (LSTM) is a commonly used RNN structure. The most advanced methods at present are represented by the ASTER method in the literature (Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2018). ASTER: An Interactive Screen Text Recombnizer with Flexible Reconfiguration. IEEE Transactions on Pattern and Machine Intelligence, 1-1.) and the method in the literature (Li, H., Wang, P., Shen, C., & Zhang, G. (2018). Show, extended Read: A Simple and string base for Irregurguar SAR Text registration. ArXpriv: 1811.00751.). However, these methods still have the defect that they only use the word class labels to supervise the neural network, but the guiding information provided by the word class labels is not sufficient. When processing a picture with a noisy text background and a novel font style, the methods cannot extract the characteristics with identification power, so that the identification precision is still not ideal. There are some methods that attempt to use standard glyphs as additional supervisory information, such as the methods in the literature (Liu, y., Wang, z., Jin, h., & Wassell, I.J. (2018). synthetic provided Learning for Scene Text registration. in Proceedings of the European Conference Computer Vision (ECCV) (pp.449-465.) (SSFL method hereinafter) and the literature (Zhang, y., Liang, S., Nie, S., Liu, w., & pending, S. (2018). Robust writing reader-editor of font library, but no suitable Recognition methods for generating fonts in the standard script 26, which results in no Recognition of fonts 26 and characters in the methods of using these methods.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a character recognition method based on standard font generation. For the natural scene character features extracted by the neural network, the attention mechanism is used in the character pattern generation mode, the neural network is used for predicting character types, and the neural network is also used for generating standard character patterns of various characters corresponding to the natural scene characters. By learning how to generate the standard font through the neural network, the method can extract the character features of the natural scene, which are more robust to interference factors such as a noisy background, a font style and the like, so that the accuracy of character recognition is improved.

For convenience of explanation, the present invention has the following definitions of terms:

natural scene picture: and (4) a real scene picture shot artificially.

Text picture: the picture with the text content as the main body comprises one or more texts.

The core of the invention is: in the process of recognizing the characters, unnecessary font style information in the neural network features is redundant information. Two main problems of the prior art SSFL are: firstly, SSFL generates a standard font (single font) for learning how to filter out the background of the characters in the natural scene, and does not consider the font of various fonts to be generated and what effect can be brought by generating the font of various fonts; secondly, the model provided by the SSFL can not generate fonts of various fonts, and the technical difficulty exists. Unlike the prior art SSFL which uses only one font glyph as the generation target, the present invention proposes standard glyph generation for multiple fonts: generating the attention vector c (x, t) into the standard font of the m fonts corresponding to the attention vector c (x, t) by using a font generator

Using a glyph discriminator to compete against the glyph generator enables the glyph generator to generate a more realistic standard glyph. Because there are several typical standard fonts, such as song, regular, black, etc., for a certain character. The method uses the font style embedded vector z to control the font to be generated, and the characteristics extracted by the neural network only reflect the most important content information (which character is), so that the method reduces unnecessary font style information in the neural network characteristics and further improves the identification precision; meanwhile, the method for controlling the font of the generated font by using the font style embedded vector z innovatively provided by the invention perfectly solves the problem of multi-font generation. In addition, the attention mechanism and the standard font generation are jointly optimized in a mode of jointly learning (joint optimization) through the two models, the two models which are independently learned are organically combined, and the two models are better in performance through jointly exchanging and learning (joint optimization).

The technical scheme provided by the invention is as follows:

a character recognition method based on standard font generation. The invention processes a natural scene picture containing one or more characters, and can sequentially output the characters in the picture according to the writing sequence. The invention uses a neural network model based on an attention mechanism and a generation mechanism, focuses attention on a certain position of an image at each moment, and respectively predicts character types and generates multi-font standard fonts by using the neural network characteristics of the position until all characters in the image are traversed, thereby realizing the recognition and output of the characters in a natural scene image containing one or more characters.

The attention mechanism and generation mechanism based neural network model comprises:

A. a convolutional neural network for extracting visual features f (x) of the input picture x;

B. a recurrent neural network for sequence modeling the features f (x); the recurrent neural network comprises an LSTM encoder and a decoder;

C. an attention module for acquiring an attention weight matrix M (x, t) according to hidden states h (x, t) and F (x) of the recurrent neural network at a time t;

D. a classifier for classifying the features; in specific implementation, a softmax classifier is adopted;

E. standard glyph for generating attention vector c (x, t) into its corresponding m fonts

The glyph generator of (1);

F. a glyph discriminator for competing with the glyph generator so that the glyph generator can generate a more realistic standard glyph.

The character recognition method based on standard font generation specifically comprises the following steps:

1. the visual features f (x) of the input picture x are extracted using a convolutional neural network.

2. And F (x) performing sequence modeling on the recurrent neural network, and transmitting the hidden state h (x, t) of the recurrent neural network at the time t to an attention module together with the F (x) to obtain an attention weight matrix M (x, t) which represents the attention allocated to each area of the image at the time t.

3. Performing point multiplication on each feature channel by using F (x) and M (x, t), and obtaining an attention vector c (x, t) which represents the feature of the concerned picture area at the time t.

4. And classifying the features of c (x, t) and h (x, t) after the c (x, t) and the h (x, t) are connected in series by using a classifier, and predicting the character category of the attention position at the moment t.

5. Generating the attention vector c (x, t) into the standard font of the m fonts corresponding to the attention vector c (x, t) by using a font generator

Using a glyph discriminator to compete against the glyph generator enables the glyph generator to generate a more realistic standard glyph.

In step 1, based on the convolutional neural network in the enter method (a scene text identifier based on attention mechanism), the step size of the first convolution unit in the last two convolution groups is modified to 1 × 1, which is used as the CNN feature extractor in the present invention to extract the visual feature f (x) of the input picture x. Wherein

That is, the input image is scaled to 48 pixels in height, 160 pixels in width, and 3 color channels;

where H is 6, W is 40, C is 512, which respectively represent the height, width and number of channels of feature f (x).

In step 2, the features f (x) are sequence modeled using an LSTM encoder and decoder. Both the LSTM encoder and decoder have two hidden layers, 512 nodes in each layer. F (x) for each set of features in the width (W) dimension, Pooling (Max-Pooling) is performed along the height dimension and then sequentially input into the LSTM encoder. The hidden state of the LSTM encoder at the last instant is used as the initial state of the LSTM decoder. The hidden state h (x, t) of the LSTM decoder at time t is fed to the attention module together with F (x) to obtain the attention weight matrix

The M (x, t) mode is calculated as follows:

M′_ij(x，t)＝tanh(∑_{p，q∈N(i，j)}W_FF_pq(x)+W_hh (x, t)) formula (1)

M(x，t)＝sotfmax(W_MM' (x, t)) formula (2)

Where M '(x, t) is an intermediate variable of the calculation process, M'_ij(x, t) represents the characteristic of M' (x, t) at position (i, j), i ≦ 1 ≦ H, j ≦ 1 ≦ W; n (i, j) represents a neighborhood of the center of (i, j), i.e. p is more than or equal to i-1 and less than or equal to i +1, and q is more than or equal to j-1 and less than or equal to j + 1; f_pq(x) Represents F (x) the feature at position (p, q); w_FAnd W_hIs a parameter to be learned, tanh is a hyperbolic tangent function, and sotfmax is a normalized exponential function (softmax function).

In step 3, the feature of each channel of F (x) and M (x, t) are used for dot multiplication to obtain

Representing the features of the picture region of interest at time t.

In step 4, features of attention vectors c (x, t) and h (x, t) after being connected in series are classified by using a softmax classifier commonly used in machine learning, and the character type of the attention position at the moment t can be obtained

Probability of (c):

wherein, W_oAnd b_oIs a parameter to be learned, the middle brackets represent tandem operation,

c represents the overall character category. Is selected such that

Largest size

As a predicted character category.

In step 5, a glyph generator based on a deconvolution Neural Network (DCNN) is used to generate standard glyphs of m selected fonts by taking the attention vector c (x, t) as an input

Represented by formula (4):

wherein z is_iThe embedded vector of the ith font is a random vector which follows multivariate standard normal distribution, the bracket represents the tandem operation, and m is the set font type number. True multifont standard glyphs g_iAnd (x, t) is rendered by a TTF (true Type font) or OTF (open Type font) file. Meanwhile, the idea of a generating type confrontation network is adopted, and a character pattern discriminator based on a convolution neural network is used for discriminating the generated standard character pattern and the real standard character pattern. The generated font is more accurate through the countermeasure between the font discriminator and the font generator. The glyph discriminator gives the generated glyph

Probability of being true is

Probability of being false

By the same token, it gives the true glyph g_iThe probability that (x, t) is true is p (y)_d＝1|g_i(x, t)), probability p (y) of being false_d＝0|g_i(x，t))＝1-p(y_d＝1|g_i(x，t))。

The network parameters that need to be trained include the CNN feature extractor,parameters to be learned in the LSTM encoder, LSTM decoder, attention module, glyph generator and glyph discriminator module. When the network is trained, the method and the device update the parameters of the network by combining character type prediction loss, font pixel point loss and font discriminator loss. In particular, the present invention iteratively optimizes two objective functions L_GAnd L_D：

Wherein α is a weight coefficient, set to 0.01; y is₁，y₂，...，y_TIs the category label of all T characters in the input picture x; and | L | · | | represents the norm operation of L1. L is_GItem I of (1)

Predicting loss for text categories, second term

A third term for the loss of the glyph discriminator to falsely discriminate the generated glyph as true

Is a loss of glyph pixel points; l is_DItem I of (1)

Loss of generating a glyph to be false for the right authentication of the glyph discriminator, the second term

A loss of correctly identifying the true glyph to true for the glyph identifier. The iterative optimization method realizes the countermeasure between the font discriminator and the font generator. Literature (Kingma, D is used.P.，&Ba, j.l. (2015). Adam: a Method for Stochartic optimization, International Conference on Learning retrieval), an adam optimizer optimizes network parameters, the initial Learning rate is set to be 0.001, the attenuation of every 1 ten thousand steps is 0.9 times of the original attenuation, and the same training data are adopted as an SAR Method.

Compared with the prior art, the beneficial effects of the invention include the following aspects:

the invention provides a character recognition method based on standard font generation, which is characterized in that a neural network model based on an attention mechanism and a generation mechanism is established, attention is focused at a certain position of an image at each moment, character category prediction and multi-font standard font generation are respectively carried out by utilizing the neural network characteristics of the position until all characters in the image are traversed, and recognition and output of the characters in a natural scene image containing one or more characters are realized. The invention utilizes the multi-font generation to improve or improve the attention module, the recognition precision and the font generation quality, and the specific embodiment is as follows:

firstly, the method adopts standard font generation to guide the learning process of character features, and compared with most methods for guiding feature learning by using character labels, the method can better learn features irrelevant to scenes, thereby improving the identification accuracy.

Secondly, the standard font is generated by adopting a space attention mechanism, compared with the existing SSFL method, the standard font corresponding to the irregular-shaped text can be generated better, the generation quality of the standard font is greatly improved, and the character recognition can obtain better accuracy.

Thirdly, the method adopts multi-font standard font generation, and further enhances the robustness of the learned characteristics. Compared with the generation of single-font standard fonts, the method reduces the font style characteristics in the characters in the natural scene, and is more beneficial to the identification of the contents.

Drawings

Fig. 1 is a flowchart of a text recognition method provided in the present invention.

FIG. 2 is a comparison of the glyph generation method of the present invention and the SSFL method.

FIG. 3 is an exemplary diagram of a standard glyph font utilized in the present invention.

FIG. 4 is a comparison graph of glyphs generated by the present invention and SSFL method when processing irregular-shaped text pictures.

FIG. 5 is a graph of glyph pixel loss versus value during training for the present invention and other prior art.

Fig. 6 is a comparison graph of the visualization result of the attention weight matrix calculated by the SAR method and the present invention.

FIG. 7 is a comparison graph of standard glyphs generated with and without the use of resist learning in accordance with the present invention.

FIG. 8 is a comparison graph of standard glyphs generated using single font and multi-font training in accordance with the invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a character recognition method based on standard font generation. The invention processes a natural scene picture containing one or more characters, and sequentially outputs the characters in the picture according to the writing sequence. The invention uses a neural network model based on an attention mechanism and a generation mechanism, focuses attention on a certain position of the picture at each moment, and respectively predicts character types and generates multi-font standard fonts by using the neural network characteristics of the position until all characters in the picture are traversed.

The flow chart of the invention is shown in the attached figure 1, and when the method is implemented, the method comprises the following steps:

1 extracting visual features f (x) of the input picture x using a CNN feature extractor.

Table 1 parameter configuration diagram of CNN feature extractor in embodiment

The configuration parameters of the CNN feature extractor are shown in table 1: the second column is a characteristic dimension output by each convolution group, and the format of the second column is h multiplied by w multiplied by c, wherein h, w and c respectively represent the height, width and channel number of the characteristic; in addition to the first convolution group, the other convolution groups are internally represented by the literature (He, k., Zhang, x., Ren, s.,&residual unit (Residual unit) proposed by Sun, J. (2016.) Deep reactive Learning for Image registration in 2016 IEEE Conference on Computer Vision and Pattern Registration (CVPR) (pp.770-778.) configuration

The convolution group is represented by n residual error units, each residual error unit comprises two convolution layers with convolution kernel sizes of 1 × 1 and 3 × 3 respectively and the number of output characteristic channels of o, and the step size represents the convolution step size.

2 sequence modeling of feature f (x) using LSTM encoder and decoder in SAR approach. Both the LSTM encoder and decoder have two hidden layers, 512 nodes in each layer. F (x) for each set of features in the width (W) dimension, Pooling (Max-Pooling) is performed along the height dimension and then sequentially input into the LSTM encoder. The hidden state of the LSTM encoder at the last instant is used as the initial state of the LSTM decoder. The hidden state h (x, t) of the LSTM decoder at time t is fed to the attention module together with F (x) to obtain the attention weight matrix

Wherein the M (x, t) mode is calculated as follows:

M(x，t)＝sotfmax(W_MM′(x，t))

where M' (x, t) is an intermediate variable of the calculation process, i.e. the attention weight matrix before softmax normalization; m'_ij(x, t) represents the characteristic of M' (x, t) at position (i, j), i ≦ 1 ≦ H, j ≦ 1 ≦ W; n (i, j) represents a neighborhood of the center of (i, j), i.e. p is more than or equal to i-1 and less than or equal to i +1, and q is more than or equal to j-1 and less than or equal to j + 1; f_pq(x) Represents F (x) the feature at position (p, q); w_FAnd W_hIs a parameter to be learned, tanh is a hyperbolic tangent function, and sotfmax is a normalized exponential function (softmax function).

3 using the characteristic of each channel of F (x) and M (x, t) to carry out dot multiplication to obtain

Representing the features of the picture region of interest at time t.

4, classifying the features of the attention vectors c (x, t) and h (x, t) after the attention vectors are connected in series by using a softmax classifier commonly used in machine learning, and obtaining the character class of the concerned position at the moment t

Probability of (2)

c represents the overall character category. Is selected such that

Largest size

As a predicted character category.

5 generating standard glyphs of m selected fonts by using a glyph generator based on a deconvolution neural network and taking the attention vector c (x, t) as input

Wherein z is_iThe embedded vector of the ith font is a random vector which follows a multivariate standard normal distribution, and the bracket represents a concatenation operation. True multifont standard glyphs g_iAnd (x, t) is rendered by a TTF (true Type font) or OTF (open Type font) file. Meanwhile, the idea of a generating type confrontation network is adopted, the generated standard font and the real standard font are identified by using the font identifier, and the generated font is more accurate through the confrontation between the font identifier and the font generator. The glyph discriminator gives the generated glyph

Probability of being true is

Probability of being false is

By the same token, it gives the true glyph g_iThe probability that (x, t) is true is p (y)_d＝1|g_i(x, t)), the probability of being false is p (y)_d＝0|g_i(x，t))＝1-p(y_d＝1|g_i(x, t)). The configuration parameters of the glyph generator and the discriminator are shown in table 2: the first, second and third columns of the table respectively representing the network layersName, type and specific configuration. In the third column, "k × k × c, s, BN, ReLU" represents the size of the convolution kernel for the convolution and deconvolution layers as k × k, the output characteristic dimension as c, the step size as s, and the batch normalization and ReLU activation functions are used. For a fully connected layer, "i × o" represents that the dimension of the input features of the layer is i and the dimension of the output features is o.

Table 2 parameter configuration diagram of glyph generator and glyph discriminator in the embodiment

When the whole network is trained, the invention combines character type prediction loss, font pixel point loss and font discriminator loss to update the parameters of the network. Specifically, the invention iteratively optimizes two objective functions L for countermeasures in the countermeasures learning_GAnd L_DExpressed as follows:

wherein α is a weight coefficient, set to 0.01; y is₁，y₂，...，y_TIs the category label for all T characters in the input picture x. L is_GItem I of (1)

Predicting loss for text categories, second term

Identifying loss of generating a glyph to be genuine for a glyph discriminatorItem III

Is a loss of glyph pixel points; l is_DItem I of (1)

Second term for loss of the glyph discriminator to discriminate the generated glyph as false

A loss to identify the true glyph to be true for the glyph discriminator. The iterative optimization method realizes the countermeasure between the font discriminator and the font generator. Using literature (Kingma, d.p.,&ba, j.l. (2015). Adam: a Method for Stochartic optimization, International Conference on Learning retrieval), an adam optimizer optimizes network parameters, the initial Learning rate is set to be 0.001, the attenuation of every 1 ten thousand steps is 0.9 times of the original attenuation, and the same training data are adopted as an SAR Method.

FIG. 2 is a comparison of the glyph generation methods provided by the present invention and the existing SSFL method, respectively. The upper half of the dotted line is the standard font generation mode based on the attention mechanism provided by the invention, and the lower half of the dotted line is the standard font generation mode provided by the SSFL method. The scheme adopted by the invention is different from the SSFL in two main points: firstly, generating standard fonts corresponding to each scene character one by adopting an attention mechanism; second, the invention adopts multi-font standard font generation, which is helpful to better learn the character irrelevant to the font style.

FIG. 3 is an example of a font for a standard glyph used in the present invention. The invention trains three network models respectively for English, Chinese and Bengali. For English, the present invention uses 8 (m ═ 8) fonts, namely Arial, Bradley Hand ITC, cosmetic Sans MS, Courier New, Georgia, Times New Roman, Kunstler Script and Vladimir Script. For Chinese, the invention adopts 4 (m ═ 8) fonts, namely Song style, regular style, black style and imitated Song style. For Bengali, the present invention uses 1 font, Nirmala UI.

TABLE 3 recognition accuracy of the present invention and other prior art techniques on English evaluation datasets

Method of producing a composite material	IIIT5k	SVT	IC13	IC15	SVTP	CT80
							SSFL	89.4	87.1	94.0	-	73.9	62.5
ASTER	93.4	89.5	91.8	76.1	78.5	79.5
							SAR	95.0	91.2	94.0	78.8	86.4	89.6
The invention	95.3	91.3	95.1	81.7	86.0	88.5

TABLE 4 recognition accuracy of the present invention and other prior art techniques on Chinese and Bengali evaluation datasets

Method of producing a composite material	Pan+ChiPhoto	ISI Bengali
			HOG	59.2	87.4
CNN	61.5	89.7
			ConvCoHOG	71.2	92.2
The invention	89.4	97.4

Tables 2-3 are graphs of the recognition accuracy of the present invention and other prior art on an evaluation data set. Among them, IIIT5k, SVT, IC13, IC15, SVTP and CT80 are english character data sets commonly used in the industry at present. The present invention achieves the best results over most data sets, as seen by the recognition accuracy (in%) in the graph. The invention has great advantages in accuracy on the IC15 data set; the accuracy of the present invention is somewhat behind the SAR approach on the two smaller datasets SVTP and CT 80. Pan + ChiPhoto is the Chinese dataset and ISI Bengali is the Mengladesh dataset, on which the present invention also achieves the highest recognition accuracy. HOG is a method in the literature (Dalal, N., & Triggs, B. (2005). Histograms of oriented graphics for human detection. in 2005IEEE Computer Conference on Computer Vision and Pattern Recognition (CVPR '05) (Vol.1, pp.886-893.), CNN is a method in the literature (Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014.) Deep Features for Text pointing. in European Computer Conference video (pp.512-528), and ConvCoHOG is a method in the literature (Dalal, N., & B., & Triggs, B. (2005.) Histograms of Computer Conference on Computer Vision (pp.512-528) & CVPR' 05) (Vol.1, PP.886-893). In general, the invention is more advanced than the prior art in the task of character recognition in natural scenes.

FIG. 4 is a comparison graph of glyphs generated by the present invention and SSFL method when processing irregular-shaped text pictures. The SSFL method generates a standard glyph by a global mapping method and does not better handle irregular shaped text. The invention locates the approximate position of each character through an attention mechanism, and then generates the corresponding standard font, thereby obtaining better results aiming at the irregular text.

FIG. 5 is a diagram of glyph pixel loss (also called L1 loss) in the training process of the present invention and other prior arts, in which CNN-DCNN is a standard glyph Generation framework used by SSFL method, CNN-DCNN (Skip) is added with Skip Connection (Skip Connection) in CNN-DCNN, CNN-LSTM-DCNN is an improved version of CNN-DCNN, in which CNN features are passed through LSTM and then delivered to deconvolution network (DCNN), and Attentional Generation is the attention-based standard glyph Generation framework proposed herein. For fair comparison, the four methods use the same CNN and DCNN structure configuration, the same training data, and the first three methods also introduce multi-font generation. Through comparison, the attention-based generation method provided by the invention generates more accurate standard fonts.

Fig. 6 is a graph comparing the visualization results of the attention weight matrix (i.e., M (x, t)) obtained by the present invention and SAR method. The invention generates the standard font through learning, so that the attention module generates a more accurate and more meaningful attention weight matrix. The 2 nd and 3 rd columns in the figure respectively represent the thermodynamic diagrams of M (x, t) calculated by the SAR method and the invention, and the underlined letters below the thermodynamic diagrams represent character labels predicted by the model at a certain moment. Taking the first group of pictures as an example, the invention focuses attention on the lower half of the flower character "L" and correctly identifies the flower character "L"; while the SAR method focuses on the lower half of the flower-shaped word "L", it will be recognized as "R" if it is mistaken.

FIG. 7 is a comparison graph of a standard glyph generated with and without the use of antagonistic learning in accordance with the present invention. Where "output" is the result of not using countermeasure training, "output" is the result of using countermeasure training, and "target" is the true standard glyph. Through the counterstudy, the invention can better generate the standard font and identify the text content for the fuzzy and distorted text. Although many generated standard glyphs have some gap from the true standard glyphs after using the resistance training, the key improvement is related compared with not using the resistance training.

FIG. 8 is a standard glyph comparison graph generated under training with single fonts and multiple fonts according to the invention. Where "output" is the result of training using a single font, the name of the font is in parentheses, "output" is the result of training using multiple fonts, and "target" is the true standard glyph. If the standard font of only one font is adopted for training, the standard font and the recognition cannot be correctly generated when the model encounters characters with newer font styles during testing. By generating standard glyphs of multiple fonts, the model can better learn the character-independent features, thereby correctly identifying the content of the characters.

The technical solutions in the embodiments of the present invention are clearly and completely described above with reference to the drawings in the embodiments of the present invention. It is to be understood that the described examples are only a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims

1. A text recognition method based on standard glyph generation, establishing a neural network model based on attention mechanism and generation mechanism, focusing attention on a certain position of the picture at each moment, using the neural network characteristics of this position, respectively. Predict the text category and generate multi-font standard glyphs until all the text in the picture is traversed, thereby realizing the recognition and output of the text in a natural scene picture containing one or more texts;

The neural network model based on the attention mechanism and the generation mechanism includes:

A. A convolutional neural network for extracting the visual feature F(x) of the input image x;

B. Recurrent Neural Network for sequence modeling of feature F(x); Recurrent Neural Network includes LSTM encoder and decoder;

C. An attention module for obtaining the attention weight matrix M(x, t) according to the hidden states h(x, t) and F(x) of the recurrent neural network at time t;

D. A classifier for classifying features;

E. The standard glyphs used to generate the corresponding m fonts from the attention vector c(x, t)

glyph generator;

F. A glyph discriminator for confronting the glyph generator so that the glyph generator can generate more realistic standard glyphs;

Described character recognition method based on standard font generation comprises the following steps:

1) The structure of the convolutional neural network in the ASTER method is reconstructed. The stride of the first convolutional unit in the last two convolutional groups of the convolutional neural network is 1 × 1, which is used as a CNN feature extractor; the convolutional neural network is used. The neural network extracts the visual feature F(x) of the input image x;

where H, W, and C represent the height, width and number of channels of the feature F(x), respectively;

2) Use the recurrent neural network to model the sequence of F(x), and send the hidden state h(x, t) of the recurrent neural network at time t to the attention module together with F(x) to obtain the attention weight matrix M(x, t), representing the attention allocated to each area of the image at time t;

3) Use F(x) and M(x, t) to perform dot multiplication on each feature channel to obtain an attention vector c(x, t), which represents the feature of the image area concerned at time t;

4) Use the classifier to classify the concatenated features of c(x, t) and h(x, t), and predict the text category of the position of attention at time t;

5) Use the glyph generator to generate the standard glyphs of the corresponding m fonts from the attention vector c(x, t)

Further, the glyph discriminator can be used to fight against the glyph generator, so that the glyph generator can generate more realistic standard glyphs;

Through the above steps, the recognition of the text in the picture based on the standard font generation is realized.

2. the character recognition method based on standard font generation as claimed in claim 1, is characterized in that, in step 1), the visual feature F (x) of described picture x, wherein,

That is to scale the image x to a height of 48 pixels, a width of 160 pixels, and the number of color channels to 3;

where H=6, W=40, C=512.

3. the character recognition method based on standard font generation as claimed in claim 1, it is characterized in that, in step 2), use LSTM encoder and decoder to carry out sequence modeling to feature F (x); Comprise:

21) Both the LSTM encoder and decoder include two hidden layers with 512 nodes in each layer.

22) For each group of features of F(x) in the width W dimension, first pool Max-Pooling along the height dimension; then input them into the LSTM encoder in turn;

23) Use the hidden state of the LSTM encoder at the last moment as the initial state of the LSTM decoder;

24) Send the hidden state h(x, t) of the LSTM decoder at time t to the attention module together with F(x) to get the attention weight matrix

The way to calculate M(x, t) is:

M′ _ij (x, t)=tanh(∑ _{p, q∈N(i, j)} W _F F _pq (x)+W _h h(x, t)) Equation (1)

M(x, t)=sotfmax(W _M M'(x, t)) Equation (2)

Among them, M'(x, t) is the intermediate variable of the calculation process, M' _ij (x, t) represents the feature of M'(x, t) at position (i, j), 1≤i≤H, 1 ≤j≤W; N(i, j) represents the neighborhood of the center of (i, j), that is, i-1≤p≤i+1, j-1≤q≤j+1; F _pq (x) represents The feature of F(x) at position (p, q); W _F and W _h are parameters to be learned, tanh is the hyperbolic tangent function, and sotfmax is the softmax function.

4. the character recognition method based on standard glyph generation as claimed in claim 1, it is characterized in that, in step 4), concrete softmax classifier to attention vector c (x, t) and h (x, t) after the concatenation The features are classified, and the character category of the position of interest at time t is obtained.

The probability:

Among them, W _o and b _o are the parameters to be learned, and the square brackets represent the concatenation operation,

C stands for the total character class.

5. the character recognition method based on standard glyph generation as claimed in claim 1, it is characterized in that, in step 5), use the glyph generator based on deconvolutional neural network DCNN, with attention vector c (x, t) as. input, generate standard glyphs for m selected fonts

Expressed as formula (4):

Among them, zi is the embedding vector of the _i -th font, which is a random vector obeying the multivariate standard normal distribution, the square brackets represent the concatenation operation, and m is the set number of font types.

6. The character recognition method based on standard font generation as claimed in claim 1, characterized in that, specifically utilizing TTF or OTF file rendering to obtain a real multi-font standard font _gi (x, t).

7. the character recognition method that generates based on standard font as claimed in claim 1, it is characterized in that, further adopt generative confrontation network, utilize the font discriminator based on convolutional neural network to differentiate the standard font of generation and the real standard font ; where: the glyph discriminator gives the generated glyph

The probability of being true is

probability of being false

The true glyph _gi (x, t) is true with probability p(y _d =1| _gi (x,t)), and false with probability p(y _d =0| _gi (x,t))= 1-p(y _d =1| _gi (x, t)).

8. the character recognition method based on standard glyph generation as claimed in claim 1 is characterized in that, establish and train the described neural network model based on attention mechanism and generation mechanism, the loss function of model comprises: character category prediction loss, glyph Pixel point loss, glyph discriminator loss; learning parameters include parameters in CNN feature extractor, LSTM encoder, LSTM decoder, attention module, glyph generator and glyph discriminator modules.

9. the character recognition method based on standard font generation as claimed in claim 8, it is characterized in that, training process is iterative optimization two objective functions _LG and _LD , is expressed as formula (5) and formula (6):

Among them, α is the weight coefficient; y ₁ , y ₂ , ..., y _T is the category label of all T characters in the input image x; || · || represents the operation of finding the L1 norm; in L _G ,

predict the loss for the text category;

A loss that generates true glyphs for glyph discriminator errors;

is the glyph pixel loss; in L _D ,

The loss for the glyph discriminator to correctly identify the generated glyph as false,

Loss for the glyph discriminator to correctly identify real glyphs as true.

10 . The method for character recognition based on standard glyph generation according to claim 9 , wherein the network parameters are optimized by adam optimizer, the initial learning rate is set to 0.001, and the weight coefficient α is set to 0.01. 11 .