[go: up one dir, main page]

CN112329803A - A natural scene text recognition method based on standard glyph generation - Google Patents

A natural scene text recognition method based on standard glyph generation Download PDF

Info

Publication number
CN112329803A
CN112329803A CN201910716704.1A CN201910716704A CN112329803A CN 112329803 A CN112329803 A CN 112329803A CN 201910716704 A CN201910716704 A CN 201910716704A CN 112329803 A CN112329803 A CN 112329803A
Authority
CN
China
Prior art keywords
glyph
standard
font
attention
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910716704.1A
Other languages
Chinese (zh)
Other versions
CN112329803B (en
Inventor
连宙辉
王逸之
唐英敏
肖建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201910716704.1A priority Critical patent/CN112329803B/en
Publication of CN112329803A publication Critical patent/CN112329803A/en
Application granted granted Critical
Publication of CN112329803B publication Critical patent/CN112329803B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a character recognition method based on standard font generation, which is characterized in that a neural network model based on an attention mechanism and a generation mechanism is established, attention is focused at a certain position of a picture at each moment, character category prediction and multi-font standard font generation are respectively carried out by utilizing the neural network characteristics of the position until all characters in the picture are traversed, and recognition and output of the characters in a natural scene picture containing one or more characters are realized. The invention utilizes the multi-font generation, improves the attention module, and improves the character recognition precision and the font generation quality, thereby improving the character recognition accuracy.

Description

Natural scene character recognition method based on standard font generation
Technical Field
The invention belongs to the technical field of computer vision and pattern recognition, relates to a character recognition method, and particularly relates to a method for recognizing characters in a natural scene picture.
Background
In the field of computer vision and pattern recognition, character recognition refers to letting a computer automatically recognize character contents in a picture. The natural scene character recognition specifically refers to recognizing all character contents in a picture for a natural scene picture with characters as main bodies. The method realizes automatic recognition of the characters in the natural scene, and has great significance in improving the production and living efficiency of people, understanding the image content, recognizing the environment by a machine and the like.
To date, many text recognition techniques have been proposed in academia and industry, mainly classified into a local feature-based method and a neural network-based method. Among them, the method based on local features is represented by a method proposed in the literature (Wang, k., Babenko, b., & Belongie, S.J. (2011), End-to-End scene text registration.in 2011 International Conference on Computer Vision (pp.1457-1464)). It locates the positions of the characteristic points by a series of rules set by human, and extracts the characteristics on the positions for character classification. However, in natural scene images, the background of the text and the font thereof are complicated, the shape of the text is not fixed (bending, tilting, etc.), and the method cannot provide the unified standard of which feature points are important, so that the method cannot show a good recognition effect.
Recently, some methods based on neural networks have been proposed. The method has excellent performance on the character recognition problem by utilizing the characteristics of self-adaptive selection characteristics of the neural network and strong noise robustness. These methods generally extract visual features of an image using a Convolutional Neural Network (CNN), perform sequence modeling using a Recurrent Neural Network (RNN), and sequentially predict each character in the image. Among them, a Long Short Term Memory Network (LSTM) is a commonly used RNN structure. The most advanced methods at present are represented by the ASTER method in the literature (Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2018). ASTER: An Interactive Screen Text Recombnizer with Flexible Reconfiguration. IEEE Transactions on Pattern and Machine Intelligence, 1-1.) and the method in the literature (Li, H., Wang, P., Shen, C., & Zhang, G. (2018). Show, extended Read: A Simple and string base for Irregurguar SAR Text registration. ArXpriv: 1811.00751.). However, these methods still have the defect that they only use the word class labels to supervise the neural network, but the guiding information provided by the word class labels is not sufficient. When processing a picture with a noisy text background and a novel font style, the methods cannot extract the characteristics with identification power, so that the identification precision is still not ideal. There are some methods that attempt to use standard glyphs as additional supervisory information, such as the methods in the literature (Liu, y., Wang, z., Jin, h., & Wassell, I.J. (2018). synthetic provided Learning for Scene Text registration. in Proceedings of the European Conference Computer Vision (ECCV) (pp.449-465.) (SSFL method hereinafter) and the literature (Zhang, y., Liang, S., Nie, S., Liu, w., & pending, S. (2018). Robust writing reader-editor of font library, but no suitable Recognition methods for generating fonts in the standard script 26, which results in no Recognition of fonts 26 and characters in the methods of using these methods.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a character recognition method based on standard font generation. For the natural scene character features extracted by the neural network, the attention mechanism is used in the character pattern generation mode, the neural network is used for predicting character types, and the neural network is also used for generating standard character patterns of various characters corresponding to the natural scene characters. By learning how to generate the standard font through the neural network, the method can extract the character features of the natural scene, which are more robust to interference factors such as a noisy background, a font style and the like, so that the accuracy of character recognition is improved.
For convenience of explanation, the present invention has the following definitions of terms:
natural scene picture: and (4) a real scene picture shot artificially.
Text picture: the picture with the text content as the main body comprises one or more texts.
The core of the invention is: in the process of recognizing the characters, unnecessary font style information in the neural network features is redundant information. Two main problems of the prior art SSFL are: firstly, SSFL generates a standard font (single font) for learning how to filter out the background of the characters in the natural scene, and does not consider the font of various fonts to be generated and what effect can be brought by generating the font of various fonts; secondly, the model provided by the SSFL can not generate fonts of various fonts, and the technical difficulty exists. Unlike the prior art SSFL which uses only one font glyph as the generation target, the present invention proposes standard glyph generation for multiple fonts: generating the attention vector c (x, t) into the standard font of the m fonts corresponding to the attention vector c (x, t) by using a font generator
Figure BDA0002155681230000021
Using a glyph discriminator to compete against the glyph generator enables the glyph generator to generate a more realistic standard glyph. Because there are several typical standard fonts, such as song, regular, black, etc., for a certain character. The method uses the font style embedded vector z to control the font to be generated, and the characteristics extracted by the neural network only reflect the most important content information (which character is), so that the method reduces unnecessary font style information in the neural network characteristics and further improves the identification precision; meanwhile, the method for controlling the font of the generated font by using the font style embedded vector z innovatively provided by the invention perfectly solves the problem of multi-font generation. In addition, the attention mechanism and the standard font generation are jointly optimized in a mode of jointly learning (joint optimization) through the two models, the two models which are independently learned are organically combined, and the two models are better in performance through jointly exchanging and learning (joint optimization).
The technical scheme provided by the invention is as follows:
a character recognition method based on standard font generation. The invention processes a natural scene picture containing one or more characters, and can sequentially output the characters in the picture according to the writing sequence. The invention uses a neural network model based on an attention mechanism and a generation mechanism, focuses attention on a certain position of an image at each moment, and respectively predicts character types and generates multi-font standard fonts by using the neural network characteristics of the position until all characters in the image are traversed, thereby realizing the recognition and output of the characters in a natural scene image containing one or more characters.
The attention mechanism and generation mechanism based neural network model comprises:
A. a convolutional neural network for extracting visual features f (x) of the input picture x;
B. a recurrent neural network for sequence modeling the features f (x); the recurrent neural network comprises an LSTM encoder and a decoder;
C. an attention module for acquiring an attention weight matrix M (x, t) according to hidden states h (x, t) and F (x) of the recurrent neural network at a time t;
D. a classifier for classifying the features; in specific implementation, a softmax classifier is adopted;
E. standard glyph for generating attention vector c (x, t) into its corresponding m fonts
Figure BDA0002155681230000031
The glyph generator of (1);
F. a glyph discriminator for competing with the glyph generator so that the glyph generator can generate a more realistic standard glyph.
The character recognition method based on standard font generation specifically comprises the following steps:
1. the visual features f (x) of the input picture x are extracted using a convolutional neural network.
2. And F (x) performing sequence modeling on the recurrent neural network, and transmitting the hidden state h (x, t) of the recurrent neural network at the time t to an attention module together with the F (x) to obtain an attention weight matrix M (x, t) which represents the attention allocated to each area of the image at the time t.
3. Performing point multiplication on each feature channel by using F (x) and M (x, t), and obtaining an attention vector c (x, t) which represents the feature of the concerned picture area at the time t.
4. And classifying the features of c (x, t) and h (x, t) after the c (x, t) and the h (x, t) are connected in series by using a classifier, and predicting the character category of the attention position at the moment t.
5. Generating the attention vector c (x, t) into the standard font of the m fonts corresponding to the attention vector c (x, t) by using a font generator
Figure BDA0002155681230000041
Using a glyph discriminator to compete against the glyph generator enables the glyph generator to generate a more realistic standard glyph.
In step 1, based on the convolutional neural network in the enter method (a scene text identifier based on attention mechanism), the step size of the first convolution unit in the last two convolution groups is modified to 1 × 1, which is used as the CNN feature extractor in the present invention to extract the visual feature f (x) of the input picture x. Wherein
Figure BDA0002155681230000042
That is, the input image is scaled to 48 pixels in height, 160 pixels in width, and 3 color channels;
Figure BDA0002155681230000043
where H is 6, W is 40, C is 512, which respectively represent the height, width and number of channels of feature f (x).
In step 2, the features f (x) are sequence modeled using an LSTM encoder and decoder. Both the LSTM encoder and decoder have two hidden layers, 512 nodes in each layer. F (x) for each set of features in the width (W) dimension, Pooling (Max-Pooling) is performed along the height dimension and then sequentially input into the LSTM encoder. The hidden state of the LSTM encoder at the last instant is used as the initial state of the LSTM decoder. The hidden state h (x, t) of the LSTM decoder at time t is fed to the attention module together with F (x) to obtain the attention weight matrix
Figure BDA0002155681230000044
The M (x, t) mode is calculated as follows:
M′ij(x,t)=tanh(∑p,q∈N(i,j)WFFpq(x)+Whh (x, t)) formula (1)
M(x,t)=sotfmax(WMM' (x, t)) formula (2)
Where M '(x, t) is an intermediate variable of the calculation process, M'ij(x, t) represents the characteristic of M' (x, t) at position (i, j), i ≦ 1 ≦ H, j ≦ 1 ≦ W; n (i, j) represents a neighborhood of the center of (i, j), i.e. p is more than or equal to i-1 and less than or equal to i +1, and q is more than or equal to j-1 and less than or equal to j + 1; fpq(x) Represents F (x) the feature at position (p, q); wFAnd WhIs a parameter to be learned, tanh is a hyperbolic tangent function, and sotfmax is a normalized exponential function (softmax function).
In step 3, the feature of each channel of F (x) and M (x, t) are used for dot multiplication to obtain
Figure BDA0002155681230000048
Representing the features of the picture region of interest at time t.
In step 4, features of attention vectors c (x, t) and h (x, t) after being connected in series are classified by using a softmax classifier commonly used in machine learning, and the character type of the attention position at the moment t can be obtained
Figure BDA0002155681230000045
Probability of (c):
Figure BDA0002155681230000046
wherein, WoAnd boIs a parameter to be learned, the middle brackets represent tandem operation,
Figure BDA0002155681230000047
c represents the overall character category. Is selected such that
Figure BDA0002155681230000051
Largest size
Figure BDA0002155681230000052
As a predicted character category.
In step 5, a glyph generator based on a deconvolution Neural Network (DCNN) is used to generate standard glyphs of m selected fonts by taking the attention vector c (x, t) as an input
Figure BDA00021556812300000514
Represented by formula (4):
Figure BDA0002155681230000053
wherein z isiThe embedded vector of the ith font is a random vector which follows multivariate standard normal distribution, the bracket represents the tandem operation, and m is the set font type number. True multifont standard glyphs giAnd (x, t) is rendered by a TTF (true Type font) or OTF (open Type font) file. Meanwhile, the idea of a generating type confrontation network is adopted, and a character pattern discriminator based on a convolution neural network is used for discriminating the generated standard character pattern and the real standard character pattern. The generated font is more accurate through the countermeasure between the font discriminator and the font generator. The glyph discriminator gives the generated glyph
Figure BDA0002155681230000054
Probability of being true is
Figure BDA0002155681230000055
Probability of being false
Figure BDA0002155681230000056
By the same token, it gives the true glyph giThe probability that (x, t) is true is p (y)d=1|gi(x, t)), probability p (y) of being falsed=0|gi(x,t))=1-p(yd=1|gi(x,t))。
The network parameters that need to be trained include the CNN feature extractor,parameters to be learned in the LSTM encoder, LSTM decoder, attention module, glyph generator and glyph discriminator module. When the network is trained, the method and the device update the parameters of the network by combining character type prediction loss, font pixel point loss and font discriminator loss. In particular, the present invention iteratively optimizes two objective functions LGAnd LD
Figure BDA0002155681230000057
Figure BDA0002155681230000058
Wherein α is a weight coefficient, set to 0.01; y is1,y2,...,yTIs the category label of all T characters in the input picture x; and | L | · | | represents the norm operation of L1. L isGItem I of (1)
Figure BDA0002155681230000059
Predicting loss for text categories, second term
Figure BDA00021556812300000510
A third term for the loss of the glyph discriminator to falsely discriminate the generated glyph as true
Figure BDA00021556812300000511
Is a loss of glyph pixel points; l isDItem I of (1)
Figure BDA00021556812300000512
Loss of generating a glyph to be false for the right authentication of the glyph discriminator, the second term
Figure BDA00021556812300000513
A loss of correctly identifying the true glyph to true for the glyph identifier. The iterative optimization method realizes the countermeasure between the font discriminator and the font generator. Literature (Kingma, D is used.P.,&Ba, j.l. (2015). Adam: a Method for Stochartic optimization, International Conference on Learning retrieval), an adam optimizer optimizes network parameters, the initial Learning rate is set to be 0.001, the attenuation of every 1 ten thousand steps is 0.9 times of the original attenuation, and the same training data are adopted as an SAR Method.
Compared with the prior art, the beneficial effects of the invention include the following aspects:
the invention provides a character recognition method based on standard font generation, which is characterized in that a neural network model based on an attention mechanism and a generation mechanism is established, attention is focused at a certain position of an image at each moment, character category prediction and multi-font standard font generation are respectively carried out by utilizing the neural network characteristics of the position until all characters in the image are traversed, and recognition and output of the characters in a natural scene image containing one or more characters are realized. The invention utilizes the multi-font generation to improve or improve the attention module, the recognition precision and the font generation quality, and the specific embodiment is as follows:
firstly, the method adopts standard font generation to guide the learning process of character features, and compared with most methods for guiding feature learning by using character labels, the method can better learn features irrelevant to scenes, thereby improving the identification accuracy.
Secondly, the standard font is generated by adopting a space attention mechanism, compared with the existing SSFL method, the standard font corresponding to the irregular-shaped text can be generated better, the generation quality of the standard font is greatly improved, and the character recognition can obtain better accuracy.
Thirdly, the method adopts multi-font standard font generation, and further enhances the robustness of the learned characteristics. Compared with the generation of single-font standard fonts, the method reduces the font style characteristics in the characters in the natural scene, and is more beneficial to the identification of the contents.
Drawings
Fig. 1 is a flowchart of a text recognition method provided in the present invention.
FIG. 2 is a comparison of the glyph generation method of the present invention and the SSFL method.
FIG. 3 is an exemplary diagram of a standard glyph font utilized in the present invention.
FIG. 4 is a comparison graph of glyphs generated by the present invention and SSFL method when processing irregular-shaped text pictures.
FIG. 5 is a graph of glyph pixel loss versus value during training for the present invention and other prior art.
Fig. 6 is a comparison graph of the visualization result of the attention weight matrix calculated by the SAR method and the present invention.
FIG. 7 is a comparison graph of standard glyphs generated with and without the use of resist learning in accordance with the present invention.
FIG. 8 is a comparison graph of standard glyphs generated using single font and multi-font training in accordance with the invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a character recognition method based on standard font generation. The invention processes a natural scene picture containing one or more characters, and sequentially outputs the characters in the picture according to the writing sequence. The invention uses a neural network model based on an attention mechanism and a generation mechanism, focuses attention on a certain position of the picture at each moment, and respectively predicts character types and generates multi-font standard fonts by using the neural network characteristics of the position until all characters in the picture are traversed.
The flow chart of the invention is shown in the attached figure 1, and when the method is implemented, the method comprises the following steps:
1 extracting visual features f (x) of the input picture x using a CNN feature extractor.
Figure BDA0002155681230000071
That is, the input image is scaled to 48 pixels in height, 160 pixels in width, and 3 color channels;
Figure BDA0002155681230000072
where H is 6, W is 40, C is 512, which respectively represent the height, width and number of channels of feature f (x).
Table 1 parameter configuration diagram of CNN feature extractor in embodiment
Figure BDA0002155681230000073
The configuration parameters of the CNN feature extractor are shown in table 1: the second column is a characteristic dimension output by each convolution group, and the format of the second column is h multiplied by w multiplied by c, wherein h, w and c respectively represent the height, width and channel number of the characteristic; in addition to the first convolution group, the other convolution groups are internally represented by the literature (He, k., Zhang, x., Ren, s.,&residual unit (Residual unit) proposed by Sun, J. (2016.) Deep reactive Learning for Image registration in 2016 IEEE Conference on Computer Vision and Pattern Registration (CVPR) (pp.770-778.) configuration
Figure BDA0002155681230000081
The convolution group is represented by n residual error units, each residual error unit comprises two convolution layers with convolution kernel sizes of 1 × 1 and 3 × 3 respectively and the number of output characteristic channels of o, and the step size represents the convolution step size.
2 sequence modeling of feature f (x) using LSTM encoder and decoder in SAR approach. Both the LSTM encoder and decoder have two hidden layers, 512 nodes in each layer. F (x) for each set of features in the width (W) dimension, Pooling (Max-Pooling) is performed along the height dimension and then sequentially input into the LSTM encoder. The hidden state of the LSTM encoder at the last instant is used as the initial state of the LSTM decoder. The hidden state h (x, t) of the LSTM decoder at time t is fed to the attention module together with F (x) to obtain the attention weight matrix
Figure BDA0002155681230000082
Wherein the M (x, t) mode is calculated as follows:
Figure BDA0002155681230000083
M(x,t)=sotfmax(WMM′(x,t))
where M' (x, t) is an intermediate variable of the calculation process, i.e. the attention weight matrix before softmax normalization; m'ij(x, t) represents the characteristic of M' (x, t) at position (i, j), i ≦ 1 ≦ H, j ≦ 1 ≦ W; n (i, j) represents a neighborhood of the center of (i, j), i.e. p is more than or equal to i-1 and less than or equal to i +1, and q is more than or equal to j-1 and less than or equal to j + 1; fpq(x) Represents F (x) the feature at position (p, q); wFAnd WhIs a parameter to be learned, tanh is a hyperbolic tangent function, and sotfmax is a normalized exponential function (softmax function).
3 using the characteristic of each channel of F (x) and M (x, t) to carry out dot multiplication to obtain
Figure BDA0002155681230000084
Representing the features of the picture region of interest at time t.
4, classifying the features of the attention vectors c (x, t) and h (x, t) after the attention vectors are connected in series by using a softmax classifier commonly used in machine learning, and obtaining the character class of the concerned position at the moment t
Figure BDA0002155681230000085
Probability of (2)
Figure BDA0002155681230000086
Figure BDA0002155681230000087
Wherein, WoAnd boIs a parameter to be learned, the middle brackets represent tandem operation,
Figure BDA0002155681230000088
c represents the overall character category. Is selected such that
Figure BDA00021556812300000812
Largest size
Figure BDA00021556812300000810
As a predicted character category.
5 generating standard glyphs of m selected fonts by using a glyph generator based on a deconvolution neural network and taking the attention vector c (x, t) as input
Figure BDA00021556812300000813
Figure BDA00021556812300000811
Wherein z isiThe embedded vector of the ith font is a random vector which follows a multivariate standard normal distribution, and the bracket represents a concatenation operation. True multifont standard glyphs giAnd (x, t) is rendered by a TTF (true Type font) or OTF (open Type font) file. Meanwhile, the idea of a generating type confrontation network is adopted, the generated standard font and the real standard font are identified by using the font identifier, and the generated font is more accurate through the confrontation between the font identifier and the font generator. The glyph discriminator gives the generated glyph
Figure BDA0002155681230000091
Probability of being true is
Figure BDA0002155681230000092
Probability of being false is
Figure BDA0002155681230000093
By the same token, it gives the true glyph giThe probability that (x, t) is true is p (y)d=1|gi(x, t)), the probability of being false is p (y)d=0|gi(x,t))=1-p(yd=1|gi(x, t)). The configuration parameters of the glyph generator and the discriminator are shown in table 2: the first, second and third columns of the table respectively representing the network layersName, type and specific configuration. In the third column, "k × k × c, s, BN, ReLU" represents the size of the convolution kernel for the convolution and deconvolution layers as k × k, the output characteristic dimension as c, the step size as s, and the batch normalization and ReLU activation functions are used. For a fully connected layer, "i × o" represents that the dimension of the input features of the layer is i and the dimension of the output features is o.
Table 2 parameter configuration diagram of glyph generator and glyph discriminator in the embodiment
Figure BDA0002155681230000094
When the whole network is trained, the invention combines character type prediction loss, font pixel point loss and font discriminator loss to update the parameters of the network. Specifically, the invention iteratively optimizes two objective functions L for countermeasures in the countermeasures learningGAnd LDExpressed as follows:
Figure BDA0002155681230000095
Figure BDA0002155681230000101
wherein α is a weight coefficient, set to 0.01; y is1,y2,...,yTIs the category label for all T characters in the input picture x. L isGItem I of (1)
Figure BDA0002155681230000102
Predicting loss for text categories, second term
Figure BDA0002155681230000103
Figure BDA0002155681230000104
Identifying loss of generating a glyph to be genuine for a glyph discriminatorItem III
Figure BDA0002155681230000105
Figure BDA0002155681230000106
Is a loss of glyph pixel points; l isDItem I of (1)
Figure BDA0002155681230000107
Second term for loss of the glyph discriminator to discriminate the generated glyph as false
Figure BDA0002155681230000108
A loss to identify the true glyph to be true for the glyph discriminator. The iterative optimization method realizes the countermeasure between the font discriminator and the font generator. Using literature (Kingma, d.p.,&ba, j.l. (2015). Adam: a Method for Stochartic optimization, International Conference on Learning retrieval), an adam optimizer optimizes network parameters, the initial Learning rate is set to be 0.001, the attenuation of every 1 ten thousand steps is 0.9 times of the original attenuation, and the same training data are adopted as an SAR Method.
FIG. 2 is a comparison of the glyph generation methods provided by the present invention and the existing SSFL method, respectively. The upper half of the dotted line is the standard font generation mode based on the attention mechanism provided by the invention, and the lower half of the dotted line is the standard font generation mode provided by the SSFL method. The scheme adopted by the invention is different from the SSFL in two main points: firstly, generating standard fonts corresponding to each scene character one by adopting an attention mechanism; second, the invention adopts multi-font standard font generation, which is helpful to better learn the character irrelevant to the font style.
FIG. 3 is an example of a font for a standard glyph used in the present invention. The invention trains three network models respectively for English, Chinese and Bengali. For English, the present invention uses 8 (m ═ 8) fonts, namely Arial, Bradley Hand ITC, cosmetic Sans MS, Courier New, Georgia, Times New Roman, Kunstler Script and Vladimir Script. For Chinese, the invention adopts 4 (m ═ 8) fonts, namely Song style, regular style, black style and imitated Song style. For Bengali, the present invention uses 1 font, Nirmala UI.
TABLE 3 recognition accuracy of the present invention and other prior art techniques on English evaluation datasets
Method of producing a composite material IIIT5k SVT IC13 IC15 SVTP CT80
SSFL 89.4 87.1 94.0 - 73.9 62.5
ASTER 93.4 89.5 91.8 76.1 78.5 79.5
SAR 95.0 91.2 94.0 78.8 86.4 89.6
The invention 95.3 91.3 95.1 81.7 86.0 88.5
TABLE 4 recognition accuracy of the present invention and other prior art techniques on Chinese and Bengali evaluation datasets
Method of producing a composite material Pan+ChiPhoto ISI Bengali
HOG 59.2 87.4
CNN 61.5 89.7
ConvCoHOG 71.2 92.2
The invention 89.4 97.4
Tables 2-3 are graphs of the recognition accuracy of the present invention and other prior art on an evaluation data set. Among them, IIIT5k, SVT, IC13, IC15, SVTP and CT80 are english character data sets commonly used in the industry at present. The present invention achieves the best results over most data sets, as seen by the recognition accuracy (in%) in the graph. The invention has great advantages in accuracy on the IC15 data set; the accuracy of the present invention is somewhat behind the SAR approach on the two smaller datasets SVTP and CT 80. Pan + ChiPhoto is the Chinese dataset and ISI Bengali is the Mengladesh dataset, on which the present invention also achieves the highest recognition accuracy. HOG is a method in the literature (Dalal, N., & Triggs, B. (2005). Histograms of oriented graphics for human detection. in 2005IEEE Computer Conference on Computer Vision and Pattern Recognition (CVPR '05) (Vol.1, pp.886-893.), CNN is a method in the literature (Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014.) Deep Features for Text pointing. in European Computer Conference video (pp.512-528), and ConvCoHOG is a method in the literature (Dalal, N., & B., & Triggs, B. (2005.) Histograms of Computer Conference on Computer Vision (pp.512-528) & CVPR' 05) (Vol.1, PP.886-893). In general, the invention is more advanced than the prior art in the task of character recognition in natural scenes.
FIG. 4 is a comparison graph of glyphs generated by the present invention and SSFL method when processing irregular-shaped text pictures. The SSFL method generates a standard glyph by a global mapping method and does not better handle irregular shaped text. The invention locates the approximate position of each character through an attention mechanism, and then generates the corresponding standard font, thereby obtaining better results aiming at the irregular text.
FIG. 5 is a diagram of glyph pixel loss (also called L1 loss) in the training process of the present invention and other prior arts, in which CNN-DCNN is a standard glyph Generation framework used by SSFL method, CNN-DCNN (Skip) is added with Skip Connection (Skip Connection) in CNN-DCNN, CNN-LSTM-DCNN is an improved version of CNN-DCNN, in which CNN features are passed through LSTM and then delivered to deconvolution network (DCNN), and Attentional Generation is the attention-based standard glyph Generation framework proposed herein. For fair comparison, the four methods use the same CNN and DCNN structure configuration, the same training data, and the first three methods also introduce multi-font generation. Through comparison, the attention-based generation method provided by the invention generates more accurate standard fonts.
Fig. 6 is a graph comparing the visualization results of the attention weight matrix (i.e., M (x, t)) obtained by the present invention and SAR method. The invention generates the standard font through learning, so that the attention module generates a more accurate and more meaningful attention weight matrix. The 2 nd and 3 rd columns in the figure respectively represent the thermodynamic diagrams of M (x, t) calculated by the SAR method and the invention, and the underlined letters below the thermodynamic diagrams represent character labels predicted by the model at a certain moment. Taking the first group of pictures as an example, the invention focuses attention on the lower half of the flower character "L" and correctly identifies the flower character "L"; while the SAR method focuses on the lower half of the flower-shaped word "L", it will be recognized as "R" if it is mistaken.
FIG. 7 is a comparison graph of a standard glyph generated with and without the use of antagonistic learning in accordance with the present invention. Where "output" is the result of not using countermeasure training, "output" is the result of using countermeasure training, and "target" is the true standard glyph. Through the counterstudy, the invention can better generate the standard font and identify the text content for the fuzzy and distorted text. Although many generated standard glyphs have some gap from the true standard glyphs after using the resistance training, the key improvement is related compared with not using the resistance training.
FIG. 8 is a standard glyph comparison graph generated under training with single fonts and multiple fonts according to the invention. Where "output" is the result of training using a single font, the name of the font is in parentheses, "output" is the result of training using multiple fonts, and "target" is the true standard glyph. If the standard font of only one font is adopted for training, the standard font and the recognition cannot be correctly generated when the model encounters characters with newer font styles during testing. By generating standard glyphs of multiple fonts, the model can better learn the character-independent features, thereby correctly identifying the content of the characters.
The technical solutions in the embodiments of the present invention are clearly and completely described above with reference to the drawings in the embodiments of the present invention. It is to be understood that the described examples are only a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims (10)

1.一种基于标准字形生成的文字识别方法,建立基于注意力机制和生成机制的神经网络模型,在每个时刻将注意力集中在图片的某个位置,利用该位置的神经网络特征,分别进行文字类别的预测和生成多字体标准字形,直到遍历图片中的所有文字为止,由此实现对一张包含一个或多个文字的自然场景图片中的文字进行识别和输出;1. A text recognition method based on standard glyph generation, establishing a neural network model based on attention mechanism and generation mechanism, focusing attention on a certain position of the picture at each moment, using the neural network characteristics of this position, respectively. Predict the text category and generate multi-font standard glyphs until all the text in the picture is traversed, thereby realizing the recognition and output of the text in a natural scene picture containing one or more texts; 所述基于注意力机制和生成机制的神经网络模型包括:The neural network model based on the attention mechanism and the generation mechanism includes: A.用于提取输入图片x的视觉特征F(x)的卷积神经网络;A. A convolutional neural network for extracting the visual feature F(x) of the input image x; B.用于对特征F(x)进行序列建模的循环神经网络;循环神经网络包括LSTM编码器和解码器;B. Recurrent Neural Network for sequence modeling of feature F(x); Recurrent Neural Network includes LSTM encoder and decoder; C.用于根据时刻t循环神经网络的隐藏状态h(x,t)和F(x)获取注意力权重矩阵M(x,t)的注意力模块;C. An attention module for obtaining the attention weight matrix M(x, t) according to the hidden states h(x, t) and F(x) of the recurrent neural network at time t; D.用于对特征进行分类的分类器;D. A classifier for classifying features; E.用于将注意力向量c(x,t)生成其对应的m种字体的标准字形
Figure FDA0002155681220000011
的字形生成器;
E. The standard glyphs used to generate the corresponding m fonts from the attention vector c(x, t)
Figure FDA0002155681220000011
glyph generator;
F.用于与字形生成器进行对抗使得字形生成器能够生成更加真实的标准字形的字形鉴别器;F. A glyph discriminator for confronting the glyph generator so that the glyph generator can generate more realistic standard glyphs; 所述基于标准字形生成的文字识别方法包括以下步骤:Described character recognition method based on standard font generation comprises the following steps: 1)将ASTER方法中的卷积神经网络进行结构改造,卷积神经网络最后两个卷积组中的第一个卷积单元的步长为1×1,作为CNN特征提取器;利用卷积神经网络提取得到输入图片x的视觉特征F(x);
Figure FDA0002155681220000012
其中H、W、C分别代表特征F(x)的高度,宽度和通道数;
1) The structure of the convolutional neural network in the ASTER method is reconstructed. The stride of the first convolutional unit in the last two convolutional groups of the convolutional neural network is 1 × 1, which is used as a CNN feature extractor; the convolutional neural network is used. The neural network extracts the visual feature F(x) of the input image x;
Figure FDA0002155681220000012
where H, W, and C represent the height, width and number of channels of the feature F(x), respectively;
2)使用循环神经网络对F(x)进行序列建模,将时刻t循环神经网络的隐藏状态h(x,t),与F(x)一起输送到注意力模块中,得到注意力权重矩阵M(x,t),代表在t时刻对图像各个区域分配的注意力;2) Use the recurrent neural network to model the sequence of F(x), and send the hidden state h(x, t) of the recurrent neural network at time t to the attention module together with F(x) to obtain the attention weight matrix M(x, t), representing the attention allocated to each area of the image at time t; 3)使用F(x)和M(x,t)在每个特征通道上进行点乘,得到注意力向量c(x,t),代表在t时刻所关注的图片区域的特征;3) Use F(x) and M(x, t) to perform dot multiplication on each feature channel to obtain an attention vector c(x, t), which represents the feature of the image area concerned at time t; 4)使用分类器对c(x,t)和h(x,t)串联后的特征进行分类,预测t时刻注意力所关注位置的文字类别;4) Use the classifier to classify the concatenated features of c(x, t) and h(x, t), and predict the text category of the position of attention at time t; 5)使用字形生成器将注意力向量c(x,t)生成其对应的m种字体的标准字形
Figure FDA0002155681220000013
5) Use the glyph generator to generate the standard glyphs of the corresponding m fonts from the attention vector c(x, t)
Figure FDA0002155681220000013
进一步可使用字形鉴别器与字形生成器进行对抗,使得字形生成器能够生成更加真实的标准字形;Further, the glyph discriminator can be used to fight against the glyph generator, so that the glyph generator can generate more realistic standard glyphs; 通过上述步骤,实现基于标准字形生成对图片中的文字进行识别。Through the above steps, the recognition of the text in the picture based on the standard font generation is realized.
2.如权利要求1所述基于标准字形生成的文字识别方法,其特征是,步骤1)中,所述图片x的视觉特征F(x),其中,
Figure FDA0002155681220000021
即将图片x缩放到高度为48像素、宽度为160像素,颜色通道数为3;
Figure FDA0002155681220000022
其中H=6,W=40,C=512。
2. the character recognition method based on standard font generation as claimed in claim 1, is characterized in that, in step 1), the visual feature F (x) of described picture x, wherein,
Figure FDA0002155681220000021
That is to scale the image x to a height of 48 pixels, a width of 160 pixels, and the number of color channels to 3;
Figure FDA0002155681220000022
where H=6, W=40, C=512.
3.如权利要求1所述基于标准字形生成的文字识别方法,其特征是,步骤2)中,使用LSTM编码器和解码器对特征F(x)进行序列建模;包括:3. the character recognition method based on standard font generation as claimed in claim 1, it is characterized in that, in step 2), use LSTM encoder and decoder to carry out sequence modeling to feature F (x); Comprise: 21)所述LSTM编码器和解码器均包括两个隐藏层,每层中有512个节点。21) Both the LSTM encoder and decoder include two hidden layers with 512 nodes in each layer. 22)将F(x)在宽度W维度的每组特征,先沿着高度维度进行池化Max-Pooling;再依次输入到LSTM编码器中;22) For each group of features of F(x) in the width W dimension, first pool Max-Pooling along the height dimension; then input them into the LSTM encoder in turn; 23)将LSTM编码器在最后一个时刻的隐藏状态,作为LSTM解码器的初始状态;23) Use the hidden state of the LSTM encoder at the last moment as the initial state of the LSTM decoder; 24)将LSTM解码器在时刻t的隐藏状态h(x,t),连同F(x)一起输送到注意力模块中,得到注意力权重矩阵
Figure FDA0002155681220000023
计算M(x,t)方式为:
24) Send the hidden state h(x, t) of the LSTM decoder at time t to the attention module together with F(x) to get the attention weight matrix
Figure FDA0002155681220000023
The way to calculate M(x, t) is:
M′ij(x,t)=tanh(∑p,q∈N(i,j)WFFpq(x)+Whh(x,t)) 式(1)M′ ij (x, t)=tanh(∑ p, q∈N(i, j) W F F pq (x)+W h h(x, t)) Equation (1) M(x,t)=sotfmax(WMM′(x,t)) 式(2)M(x, t)=sotfmax(W M M'(x, t)) Equation (2) 其中,M′(x,t)是计算过程的中间变量,M′ij(x,t)代表M′(x,t)在位置(i,j)处的特征,1≤i≤H,1≤j≤W;N(i,j)代表为(i,j)中心的邻域,即i-1≤p≤i+1,j-1≤q≤j+1;Fpq(x)代表F(x)在位置(p,q)处的特征;WF和Wh是需要学习的参数,tanh是双曲正切函数,sotfmax是softmax函数。Among them, M'(x, t) is the intermediate variable of the calculation process, M' ij (x, t) represents the feature of M'(x, t) at position (i, j), 1≤i≤H, 1 ≤j≤W; N(i, j) represents the neighborhood of the center of (i, j), that is, i-1≤p≤i+1, j-1≤q≤j+1; F pq (x) represents The feature of F(x) at position (p, q); W F and W h are parameters to be learned, tanh is the hyperbolic tangent function, and sotfmax is the softmax function.
4.如权利要求1所述基于标准字形生成的文字识别方法,其特征是,步骤4)中,具体softmax分类器对注意力向量c(x,t)和h(x,t)串联后的特征进行分类,得到时刻t所关注位置的字符类别是
Figure FDA0002155681220000024
的概率:
4. the character recognition method based on standard glyph generation as claimed in claim 1, it is characterized in that, in step 4), concrete softmax classifier to attention vector c (x, t) and h (x, t) after the concatenation The features are classified, and the character category of the position of interest at time t is obtained.
Figure FDA0002155681220000024
The probability:
Figure FDA0002155681220000025
Figure FDA0002155681220000025
其中,Wo和bo是需要学习的参数,中括号代表串联操作,
Figure FDA0002155681220000026
C代表总的字符类别。
Among them, W o and b o are the parameters to be learned, and the square brackets represent the concatenation operation,
Figure FDA0002155681220000026
C stands for the total character class.
5.如权利要求1所述基于标准字形生成的文字识别方法,其特征是,步骤5)中,使用基于反卷积神经网络DCNN的字形生成器,以注意力向量c(x,t)为输入,生成m种选定字体的标准字形
Figure FDA0002155681220000031
表示为式(4):
5. the character recognition method based on standard glyph generation as claimed in claim 1, it is characterized in that, in step 5), use the glyph generator based on deconvolutional neural network DCNN, with attention vector c (x, t) as. input, generate standard glyphs for m selected fonts
Figure FDA0002155681220000031
Expressed as formula (4):
Figure FDA0002155681220000032
Figure FDA0002155681220000032
其中,zi为第i种字体的嵌入向量,是服从多元标准正态分布的随机向量,中括号代表串联操作,m为设定的字体种类数。Among them, zi is the embedding vector of the i -th font, which is a random vector obeying the multivariate standard normal distribution, the square brackets represent the concatenation operation, and m is the set number of font types.
6.如权利要求1所述基于标准字形生成的文字识别方法,其特征是,具体利用TTF或OTF文件渲染得到真实的多字体标准字形gi(x,t)。6. The character recognition method based on standard font generation as claimed in claim 1, characterized in that, specifically utilizing TTF or OTF file rendering to obtain a real multi-font standard font gi (x, t). 7.如权利要求1所述基于标准字形生成的文字识别方法,其特征是,进一步采用生成式对抗网络,利用基于卷积神经网络的字形鉴别器对生成的标准字形和真实的标准字形进行鉴别;其中:字形鉴别器给出生成字形
Figure FDA0002155681220000033
为真的概率为
Figure FDA0002155681220000034
为假的概率
Figure FDA0002155681220000035
真实字形gi(x,t)为真的概率为p(yd=1|gi(x,t)),为假的概率p(yd=0|gi(x,t))=1-p(yd=1|gi(x,t))。
7. the character recognition method that generates based on standard font as claimed in claim 1, it is characterized in that, further adopt generative confrontation network, utilize the font discriminator based on convolutional neural network to differentiate the standard font of generation and the real standard font ; where: the glyph discriminator gives the generated glyph
Figure FDA0002155681220000033
The probability of being true is
Figure FDA0002155681220000034
probability of being false
Figure FDA0002155681220000035
The true glyph gi (x, t) is true with probability p(y d =1| gi (x,t)), and false with probability p(y d =0| gi (x,t))= 1-p(y d =1| gi (x, t)).
8.如权利要求1所述基于标准字形生成的文字识别方法,其特征是,建立并训练所述基于注意力机制和生成机制的神经网络模型,模型的损失函数包括:文字类别预测损失、字形像素点损失、字形鉴别器损失;学习参数包括CNN特征提取器,LSTM编码器,LSTM解码器,注意力模块,字形生成器和字形鉴别器模块中的参数。8. the character recognition method based on standard glyph generation as claimed in claim 1 is characterized in that, establish and train the described neural network model based on attention mechanism and generation mechanism, the loss function of model comprises: character category prediction loss, glyph Pixel point loss, glyph discriminator loss; learning parameters include parameters in CNN feature extractor, LSTM encoder, LSTM decoder, attention module, glyph generator and glyph discriminator modules. 9.如权利要求8所述基于标准字形生成的文字识别方法,其特征是,训练过程为迭代优化两个目标函数LG和LD,表示为式(5)和式(6):9. the character recognition method based on standard font generation as claimed in claim 8, it is characterized in that, training process is iterative optimization two objective functions LG and LD , is expressed as formula (5) and formula (6):
Figure FDA0002155681220000036
Figure FDA0002155681220000036
Figure FDA0002155681220000037
Figure FDA0002155681220000037
其中,α是权重系数;y1,y2,...,yT是输入图片x中所有T个字符的类别标签;||·||代表求L1范数操作;LG中,
Figure FDA0002155681220000038
为文字类别预测损失;
Figure FDA0002155681220000039
Figure FDA00021556812200000310
为字形鉴别器错误鉴别生成字形为真的的损失;
Figure FDA00021556812200000311
Figure FDA00021556812200000312
是字形像素点损失;LD中,
Figure FDA00021556812200000313
为字形鉴别器正确鉴别生成字形为假的损失,
Figure FDA00021556812200000314
为字形鉴别器正确鉴别真实字形为真的损失。
Among them, α is the weight coefficient; y 1 , y 2 , ..., y T is the category label of all T characters in the input image x; || · || represents the operation of finding the L1 norm; in L G ,
Figure FDA0002155681220000038
predict the loss for the text category;
Figure FDA0002155681220000039
Figure FDA00021556812200000310
A loss that generates true glyphs for glyph discriminator errors;
Figure FDA00021556812200000311
Figure FDA00021556812200000312
is the glyph pixel loss; in L D ,
Figure FDA00021556812200000313
The loss for the glyph discriminator to correctly identify the generated glyph as false,
Figure FDA00021556812200000314
Loss for the glyph discriminator to correctly identify real glyphs as true.
10.如权利要求9所述基于标准字形生成的文字识别方法,其特征是,利用adam优化器对网络参数进行优化,初始学习率设置为0.001;权重系数α设为0.01。10 . The method for character recognition based on standard glyph generation according to claim 9 , wherein the network parameters are optimized by adam optimizer, the initial learning rate is set to 0.001, and the weight coefficient α is set to 0.01. 11 .
CN201910716704.1A 2019-08-05 2019-08-05 Natural scene character recognition method based on standard font generation Active CN112329803B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910716704.1A CN112329803B (en) 2019-08-05 2019-08-05 Natural scene character recognition method based on standard font generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910716704.1A CN112329803B (en) 2019-08-05 2019-08-05 Natural scene character recognition method based on standard font generation

Publications (2)

Publication Number Publication Date
CN112329803A true CN112329803A (en) 2021-02-05
CN112329803B CN112329803B (en) 2022-08-26

Family

ID=74319415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910716704.1A Active CN112329803B (en) 2019-08-05 2019-08-05 Natural scene character recognition method based on standard font generation

Country Status (1)

Country Link
CN (1) CN112329803B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419174A (en) * 2021-12-07 2022-04-29 科大讯飞股份有限公司 On-line handwritten text synthesis method, device and storage medium
JP2023039888A (en) * 2021-09-09 2023-03-22 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method, device, apparatus, and storage medium for model training and word stock generation
CN116030476A (en) * 2022-12-30 2023-04-28 中南民族大学 A multi-style handwritten English image label recognition system and method
CN120744144A (en) * 2025-09-03 2025-10-03 成都中医药大学 Method, system and medium for constructing Chinese ancient book foreign word dictionary and aligning text

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007122500A (en) * 2005-10-28 2007-05-17 Ricoh Co Ltd Character recognition device, character recognition method, and character data
CN107577651A (en) * 2017-08-25 2018-01-12 上海交通大学 Chinese character style migratory system based on confrontation network
CN107644006A (en) * 2017-09-29 2018-01-30 北京大学 A kind of Chinese script character library automatic generation method based on deep neural network
CN108615036A (en) * 2018-05-09 2018-10-02 中国科学技术大学 A kind of natural scene text recognition method based on convolution attention network
CN108804397A (en) * 2018-06-12 2018-11-13 华南理工大学 A method of the Chinese character style conversion based on a small amount of target font generates
CN109255356A (en) * 2018-07-24 2019-01-22 阿里巴巴集团控股有限公司 A kind of character recognition method, device and computer readable storage medium
CN109389091A (en) * 2018-10-22 2019-02-26 重庆邮电大学 The character identification system and method combined based on neural network and attention mechanism
US20190147304A1 (en) * 2017-11-14 2019-05-16 Adobe Inc. Font recognition by dynamically weighting multiple deep learning neural networks

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007122500A (en) * 2005-10-28 2007-05-17 Ricoh Co Ltd Character recognition device, character recognition method, and character data
CN107577651A (en) * 2017-08-25 2018-01-12 上海交通大学 Chinese character style migratory system based on confrontation network
CN107644006A (en) * 2017-09-29 2018-01-30 北京大学 A kind of Chinese script character library automatic generation method based on deep neural network
US20190147304A1 (en) * 2017-11-14 2019-05-16 Adobe Inc. Font recognition by dynamically weighting multiple deep learning neural networks
CN108615036A (en) * 2018-05-09 2018-10-02 中国科学技术大学 A kind of natural scene text recognition method based on convolution attention network
CN108804397A (en) * 2018-06-12 2018-11-13 华南理工大学 A method of the Chinese character style conversion based on a small amount of target font generates
CN109255356A (en) * 2018-07-24 2019-01-22 阿里巴巴集团控股有限公司 A kind of character recognition method, device and computer readable storage medium
CN109389091A (en) * 2018-10-22 2019-02-26 重庆邮电大学 The character identification system and method combined based on neural network and attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANZHAN CHENG等: "Focusing Attention: Towards Accurate Text Recognition in Natural Images", 《2017 IEEE 计算机视觉国际会议(ICCV)》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2023039888A (en) * 2021-09-09 2023-03-22 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method, device, apparatus, and storage medium for model training and word stock generation
CN114419174A (en) * 2021-12-07 2022-04-29 科大讯飞股份有限公司 On-line handwritten text synthesis method, device and storage medium
CN116030476A (en) * 2022-12-30 2023-04-28 中南民族大学 A multi-style handwritten English image label recognition system and method
CN120744144A (en) * 2025-09-03 2025-10-03 成都中医药大学 Method, system and medium for constructing Chinese ancient book foreign word dictionary and aligning text
CN120744144B (en) * 2025-09-03 2026-02-03 成都中医药大学 Method, system and medium for constructing Chinese ancient book foreign word dictionary and aligning text

Also Published As

Publication number Publication date
CN112329803B (en) 2022-08-26

Similar Documents

Publication Publication Date Title
Jiang et al. Scfont: Structure-guided chinese font generation via deep stacked networks
CN112819686B (en) Image style processing method and device based on artificial intelligence and electronic equipment
CN107368831B (en) English words and digit recognition method in a kind of natural scene image
CN108376244B (en) A method for identifying text fonts in natural scene pictures
CN112364873A (en) Character recognition method and device for curved text image and computer equipment
CN111368660A (en) A single-stage semi-supervised image human object detection method
CN112434599A (en) Pedestrian re-identification method based on random shielding recovery of noise channel
CN112329803B (en) Natural scene character recognition method based on standard font generation
CN106372624B (en) Face recognition method and system
CN110728694A (en) A long-term visual target tracking method based on continuous learning
Wang et al. Multiscale deep alternative neural network for large-scale video classification
CN111737511A (en) Image description method based on self-adaptive local concept embedding
CN108681735A (en) Optical character recognition method based on convolutional neural networks deep learning model
CN110517270A (en) A kind of indoor scene semantic segmentation method based on super-pixel depth network
CN115512109B (en) A method for image semantic segmentation based on relational context aggregation
CN111680705A (en) MB-SSD Method and MB-SSD Feature Extraction Network for Object Detection
CN110503090B (en) Character detection network training method based on limited attention model, character detection method and character detector
CN111242114B (en) Character recognition method and device
CN110766001B (en) Bank card number positioning and end-to-end identification method based on CNN and RNN
CN112560866B (en) OCR recognition method based on background suppression
CN116994255B (en) Stroke extraction method based on multi-level deep feature fusion
CN117115824B (en) A visual text detection method based on stroke region segmentation strategy
CN114202659B (en) Fine-grained image classification method based on space symmetry irregular local region feature extraction
CN112101479B (en) A hair style recognition method and device
CN118134963B (en) Anti-background-interference twin network single-target tracking method based on hierarchical feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant