Disclosure of Invention
In order to solve the above problems, it is necessary to provide a digital dial plate recognition method based on Mask R-CNN and CRNN.
The invention provides a digital dial plate recognition model based on Mask R-CNN and CRNN, which is obtained by training according to the following method:
step 1, preprocessing the collected digital dial original image, and then, according to a training data set, a verification data set and a test data set, setting the data set as 6: 2: 2 into three data sets;
step 2, performing label operation on the training set image, and labeling the region needing to be positioned and segmented in the original image by using a labelme tool to obtain a label image;
step 3, inputting the images of the training set into a ResNet101 model for feature extraction, performing feature map fusion by using FPN to obtain fusion feature map, inputting the fusion feature map into a candidate region to generate a network RPN and a ROIAlign layer, and then obtaining an interested target frame ROI;
step 4, performing FCN operation on the feature maps of the fusion feature maps, and outputting a mask; classifying the target frame ROI, and outputting a class and a labeling frame box;
step 5, four-point coordinates of the area to be identified in the mask are calculated, the quadrilateral identification area is converted into a rectangle through perspective conversion, and a rectangular image is stored;
and 6, performing digital recognition by taking the stored rectangular image as the input of the convolution cyclic neural network model CRNN.
The second aspect of the invention provides a digital dial plate identification method based on Mask R-CNN and CRNN, which is characterized in that after a digital dial plate identification model based on Mask R-CNN and CRNN is constructed, an original image to be identified is input into the model, and then an identification result can be output.
The third aspect of the invention provides a terminal, which comprises a processor, a memory and a finger vein image edge detection algorithm program stored in the memory, wherein when the finger vein image edge detection algorithm program is run by the processor, the steps of the digital dial plate identification method based on Mask R-CNN and CRNN are realized.
A fourth aspect of the present invention provides a computer-readable storage medium having computer instructions stored thereon, characterized in that: when being executed by a processor, the computer instructions realize the steps of the digital dial plate identification method based on the Mask R-CNN and the CRNN.
Compared with the prior art, the invention has prominent substantive characteristics and remarkable progress, particularly: the model constructed by the invention realizes the classification of pixel levels by adopting a Mask R-CNN model, and outputs a Mask for digital identification; and then, the CRNN model is adopted for digital recognition, and bidirectional LSTM and CRC are introduced into the image, so that the accuracy of digital recognition is obviously improved.
The model constructed by the invention can be particularly applied to identifying the image number of the gas meter, and compared with an algorithm for respectively modeling by case segmentation and number identification in the traditional technology, the model has the advantages of wide application scene, high accuracy and high identification speed.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
Example 1
As shown in fig. 1, this embodiment proposes a digital dial recognition model based on Mask R-CNN and CRNN, which is obtained by training according to the following method:
step 1, preprocessing the collected digital dial original image, and then, according to a training data set, a verification data set and a test data set, setting the data set as 6: 2: scale of 2 into three data sets.
And 2, performing label operation on the training set image, and labeling the region needing to be positioned and segmented in the original image by using a labelme tool to obtain a label image.
Step 3, inputting the images of the training set into a ResNet101 model for feature extraction, performing feature map fusion by using FPN to obtain fusion feature map, inputting the fusion feature map into a candidate region to generate a network RPN and a ROIAlign layer, and then obtaining an interested target frame ROI;
wherein, the ResNet101 model comprises five stages, including conv1, conv2_ x, conv3_ x, conv4_ x and conv5_ x, and outputs C1, C2, C3, C4 and C5 corresponding to the five stages;
the output bottom layer feature layer obtains the same channel number as the upper layer feature layer through convolution of 1x 1, the upper layer feature layer obtains the same length and width as the lower layer feature layer through up-sampling and then is added, so that a new feature layer P2-P6 which is well fused is obtained, and the FPN completes fusion of bottom layer-to-high layer fusion feature map maps; the feature layers P2-P5 are used for predicting class, box and mask of the object, and the feature layers P2-P6 are used for training the RPN, namely P6 is only used in the RPN network.
The candidate region generation network RPN generates 9 kinds of target frames anchors with preset length-width ratios and areas for each position by means of a window sliding on the shared feature map. These 9 initial anchors contained three areas (128 × 128, 256 × 256, 512 × 512), each of which contained three aspect ratios (1:1, 1:2, 2: 1). The tree-shaped structure is adopted, a trunk is a 3 x 3 convolutional layer, branches are two 1x 1 convolutional layers, the first 1x 1 convolutional layer is used for judging whether a target frame anchor covers a target, and the second 1x 1 convolutional layer is used for carrying out coordinate correction on the target frame anchor in the foreground, so that a corrected frame ROI is output.
The RoIAlign layer is slightly modified on the basis of the RoI pooling, and the problem of large pixel point difference when feature maps are restored to the original image in the RoI pooling is solved. The principle is to divide feature maps into k × k units, calculate fixed four coordinate positions in each unit, calculate the values of the four positions by a bilinear interpolation method, and then perform a maximum pooling operation to obtain an accurate target frame ROI.
Step 4, performing FCN operation on the feature maps of the fusion feature maps, and outputting a mask; classifying the target frame ROI, and outputting a class and a labeling frame box;
the FCN can receive an input image with any size, then up-samples the feature map of the last convolutional layer through the deconvolution layer to restore the feature map to the same size as the input image, so that a prediction can be generated for each pixel, space information in the original input image is kept, and finally, each pixel is classified on a feature map with the same size as the input map to output a mask.
The target frame ROI obtains the final class and the marking frame box of the object detection through a full-connection layer network, and is remarkable in that the mask and the classification are parallel in the training process, and the classification is performed before the mask in the prediction process. When the ROI of the target frame is classified, the Loss function of each ROI is as follows:
L=Lcls+Lbox+Lmask
Lcls、Lboxclassification, regression loss, respectively;
N
cls: a classification weight coefficient; p is a radical of
i: probability of the anchor prediction being the target;
whether the target second classification contains the object is 1, and if not, the target second classification is 0;
N
reg: a regression weight coefficient;
real coordinates; t is t
i: four-point prediction coordinates;
the output dimension of the mask branch to each ROI is K m, wherein m represents the size of the mask, and K represents K categories; after the prediction mask is obtained, a sigmoid function value L is calculated for each pixel point value of the maskmaskIs input.
Step 5, four-point coordinates of the area to be identified in the mask are calculated, the quadrilateral identification area is converted into a rectangle through perspective conversion, and a rectangular image is stored;
in this step, digital recognition is realized by calculating the position coordinates of the target area in the mask and taking the target area image (quadrangle) as input. Before recognition, the quadrangle should be transformed into a rectangle through perspective transformation. Perspective transformation formula:
u and v are the coordinates of the original picture, the parameter ω is equal to 1, and the picture coordinates x, y are obtained after perspective transformation, wherein,
step 6, taking the stored rectangular image as the input of a convolution cyclic neural network model CRNN to carry out digital recognition;
specifically, the method for performing digital recognition by taking the stored rectangular image as the input of the convolution cyclic neural network model CRNN comprises the following steps:
and after the stored rectangular image is used as the input of the depth CNN for feature extraction to obtain a digital feature map, converting the digital feature map into a feature vector sequence and inputting the feature vector sequence into a BLSTM network, predicting the feature vector sequence, outputting predicted label distribution, and finally converting the obtained label distribution into a final label sequence by using CTC loss to obtain a predicted value, namely a true value of the identified number.
Specifically, the convolutional recurrent neural network model CRNN includes three parts, which are from bottom to top:
the convolutional layer CNN uses the depth CNN to extract features of an input image to obtain a feature map and extract a feature vector sequence; specifically, the grayscale image (H ═ 32) is subjected to feature extraction with the depth CNN to obtain a feature map, a feature vector sequence (H ═ 1) is extracted, and then a feature vector sequence required by RNN is extracted; with some adjustments made to the depth CNN: in order to input the CNN extracted features as input into the RNN network, the window sizes of the last two pooling layers are changed from 2x2 to 1x 2; in order to accelerate the convergence of the model and shorten the training process, a BN module is introduced for normalization. When a feature vector sequence is extracted, generating each feature vector on a feature map from left to right according to rows, wherein each row comprises multidimensional features, namely the ith feature vector is the connection of pixels of the ith row of all the feature maps, the feature vectors form a sequence, and a plurality of sequence vectors are obtained and then correspond to an original map by a certain step length for classifying related areas of the original map;
the loop layer RNN predicts the characteristic vector sequence by using the bidirectional LSTM, learns each characteristic vector in the sequence and outputs the distribution of prediction labels;
inputting the characteristic vector sequence into a two-layer bidirectional LSTM network, predicting which character the characteristic vector corresponds to, and obtaining softmax probability distribution of all characters, wherein the softmax probability distribution is a vector with the length being the character category number and is used as the input of a CTC layer;
the transcription layer CTC loss is used for converting a series of label distributions acquired from the circulation layer into a final label sequence; by converting the prediction of each characteristic vector by RNN into a label sequence and introducing a blank mechanism, the problem of merging continuous characters in the prediction process can be solved.
The specific network configuration is:
where the first row is the top level, "k", "s", and "p" represent kernel size, stride, and fill size, respectively.
The model constructed by the invention adopts a Mask R-CNN model to realize the classification of pixel levels and outputs a Mask for digital identification; and then, the CRNN model is adopted for digital recognition, and bidirectional LSTM and CRC are introduced into the image, so that the accuracy of digital recognition is obviously improved.
It should be noted that, after the training of the recognition model is completed, the verification data set and the test data set are used to verify and test the recognition model.
Example 2
The embodiment provides a digital dial plate identification method based on Mask R-CNN and CRNN, and after the digital dial plate identification model based on Mask R-CNN and CRNN in embodiment 1 is constructed, the original image to be identified is input into the model, and then the identification result can be output.
Specifically, the identification model in embodiment 1 is embedded into front-end software of the gas meter to form a digital identification mode integrating photographing, classification and identification, and the identification model is used according to the following steps:
acquiring a gas meter image, manually photographing and uploading the gas meter image to the front end to obtain an original image;
inputting an original image as an input image into an identification model;
and calculating the recognition model, outputting the number to be recognized of the target area, and feeding back the result to the front end.
Example 3
The embodiment provides a terminal, which includes a processor, a memory, and a finger vein image edge detection algorithm program stored in the memory, where the finger vein image edge detection algorithm program is executed by the processor, and the steps of the Mask R-CNN and CRNN-based digital dial plate recognition method according to embodiment 2 are implemented.
Example 4
The present embodiment provides a computer-readable storage medium, on which computer instructions are stored, and when the computer instructions are executed by a processor, the steps of the Mask R-CNN and CRNN-based digital dial plate recognition method according to embodiment 2 are implemented.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Each functional unit in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated module may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the flow in the method of the embodiments described above may be implemented by a computer program, which may be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.