CN114067111A

CN114067111A - Method for eliminating interference of seal on document extraction

Info

Publication number: CN114067111A
Application number: CN202111197695.3A
Authority: CN
Inventors: 潘新星; 陶提; 黄登; 高翔; 陈运文; 纪达麒
Original assignee: Daguan Data Suzhou Co ltd
Current assignee: Daguan Data Suzhou Co ltd
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2022-02-18

Abstract

The invention discloses a method for eliminating interference of a seal on document extraction, aiming at an image P with a seal, comprising the following steps: eliminating the seal by adopting an image processing method on the image P to obtain an image Q; performing text detection on the image Q to obtain coordinates of a text area; and taking the area of the image P with the coordinates as a text area for text recognition. In the bank running line intelligent auditing business process, the text detection effect is optimized by eliminating seal interference, and a more accurate digital identification result can be obtained; in the contract intelligent comparison business process, the invention can reduce multiple identification and error identification caused by seal interference, thereby reducing the probability of subsequent comparison errors.

Description

Method for eliminating interference of seal on document extraction

Technical Field

The invention belongs to the field of optical character recognition, and particularly relates to a method for eliminating interference of a seal on document extraction.

Background

The optical character recognition technology is widely applied to the business fields of document intelligent verification, comparison and the like of electronic scanned documents. The accuracy of the extraction result directly determines the effect of intelligent document auditing, comparison and other services, and the higher the extraction accuracy is, the simpler the post-processing logic of the corresponding service is, and the stronger the robustness of the system is.

Various official stamps and signature stamps are often stamped on the electronic scanning piece document, when the seal stamp is stamped on the characters, the detection and identification of the characters by the model can be interfered, an extraction result is wrong, the original normal document is verified as an abnormal document, two documents which are consistent actually are judged by mistake to be different, and finally business errors such as verification, comparison and the like are caused. If the interference of the seal on the extraction can be effectively eliminated, the extraction accuracy is improved, and the error probability of the subsequent process can be greatly reduced.

Aiming at the problem of seal interference, two common methods are currently used, one is that a traditional image processing method is adopted to remove a red channel of an image, so that a red area in the image is filtered; the other is to use a generative countermeasure network to remove the stamp and restore the text of the stamp area.

The first method is simple and effective, but the effect on a non-pure red seal is not good enough, if the threshold value is enlarged and removed blindly, the extraction of other effective character information is influenced, and especially when red effective texts exist in a document, the method can filter the effective information by mistake, so that the subsequent business process is wrong.

The second method has good effect on complex scenes, but the model is heavier and more system resources are consumed; in addition, training for generating an anti-network stamp removal requires a large amount of data support, and the requirement of the training cannot be met only by labeling a real sample, because original documents before stamping need to be restored besides scanned documents on which stamps need to be labeled. Therefore, training data is often artificially created, but the created data is finally false data and inevitably different from real data, which also results in poor generalization of the trained model. In addition, the generation of the countermeasure network is not perfect when the text is restored, and the situations of partial pixel missing and distortion occur, which undoubtedly reduces the recognition effect of the subsequent model.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method for eliminating the interference of a seal on document extraction, which can quickly and effectively eliminate the interference of the seal on a model by combining the traditional image processing and deep learning algorithm, thereby achieving the purpose of accurately extracting the useful information of the document. The traditional image processing method is difficult to cope with complicated and changeable real scenes, and the generation of a method for removing the deep learning such as the countermeasure network is difficult to train.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for eliminating interference of a seal on document extraction is provided, aiming at an image P with a seal, the method comprises the following steps: eliminating the seal by adopting an image processing method on the image P to obtain an image Q; performing text detection on the image Q to obtain coordinates of a text area; and taking the area of the image P with the coordinates as a text area for text recognition.

Preferably, the removing the stamp from the image P by the image processing method to obtain the image Q includes: carrying out seal detection on the image P to obtain a seal area; and eliminating the stamp by adopting an image processing method for the stamp area in the image P.

Preferably, the removing the stamp by the image processing method for the stamp region in the image P includes: counting median of three-channel values of all pixel points in the seal area, and taking the median as a channel value of the background color; and counting three channel values of all pixel points in the seal area, and if the three channel values of the pixel points belong to a red color gamut or a blue color gamut, replacing the three channel values of the pixel points with the channel values of the background color.

Preferably, the input of the text recognition is a three-channel color map.

Preferably, the removing the stamp from the image P by the image processing method to obtain the image Q includes performing a pre-processing of correcting the image P by image rotation.

A computer storage medium, wherein a computer program is stored in the computer storage medium, and when the computer program is executed, the computer program realizes any one of the methods for eliminating the interference of the seal on the document extraction.

An apparatus for eliminating stamp-to-document extraction interference, said apparatus for processing an image P with a stamp, said apparatus comprising: the image processing module is used for eliminating the seal by adopting an image processing method on the image P to obtain an image Q; the text detection module is used for performing text detection on the image Q to obtain the coordinates of the text area; and the text recognition module is used for performing text recognition by taking the area with the coordinates in the image P as a text area.

Compared with the prior art, the invention has the beneficial effects that:

1. in the bank running intelligent auditing business process, a seal in a running file can cause a text detection box to be overlarge, so that a digit identification error is caused, missing identification of a digit area is possibly caused, and the digit cannot be corrected through a rule. After the seal is eliminated by the scheme, the detected text box is more accurate, the identification difficulty is reduced, and a more accurate digital identification result can be obtained, so that the accuracy of money amount check is improved;

2. in the contract intelligent comparison business process, the scheme can reduce multiple identifications and false identifications caused by seal interference, thereby reducing the probability of errors in subsequent comparison. In addition, the complexity of the comparison rule can be reduced, and the robustness of the system is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.

As shown in fig. 1, the present embodiment provides a general process of document extraction, which mainly includes four parts, namely, file preprocessing, text detection, text recognition, and information extraction. In the file preprocessing stage, input files are unified into a picture format, the pictures are subjected to rotation correction, and a plurality of pages of pdfs can be converted into a plurality of pictures to be extracted in parallel; the text detection stage provides coordinate information of a character area boundary box in the picture; in the text recognition stage, character information in each frame is recognized item by item according to the coordinate information; and in the information extraction stage, useful information is extracted through a post-processing rule according to the coordinates and the content of the characters.

The specific embodiment of the invention is as follows:

1. uploading document files, and unifying the files into a picture format after format conversion. The multi-page pdf is split into multiple pictures and the following processes are executed in parallel.

2. And (4) transmitting the picture into a rotary classification model, outputting angle information of the picture characters, and performing rotary correction on the picture according to the information.

3. Transferring the corrected picture into a target detection module, positioning whether a seal exists in the picture according to the detection result of the seal area, and directly executing the step 5 if the seal does not exist; if there is a stamp, the corrected image is denoted as A, the detected stamp region is denoted as S, and the coordinates thereof can be expressed as (x1, y1, x2, y2), that is, S ═ A [ y1: y2, x1: x2]

4. The method for eliminating the seal in the area S by adopting the image processing method comprises the following specific steps:

a) copying an area S in the picture to obtain a picture B;

b) counting the median of three-channel values of all pixel points in the picture B, wherein the median can be determined as a channel value of a background color due to the sparsity of characters;

c) counting three channel values of all pixel points in the picture B, if the three channel values of the pixel points belong to a red color gamut or a blue color gamut, directly replacing the three channel values of the pixel points with the channel values of the background color, and recording the replaced picture as B1;

d) copy picture a, get duplicate a1, replace region S in a1 with picture B1, i.e. a1[ y1: y2, x1: x2] ═ B1;

e) the original picture A and the picture A1 with the stamp removed are sent to the next process.

5. If the seal area does not exist, directly transmitting the picture A into a text detection module; if the seal area exists, the picture A1 is transmitted into the text detection module, and as the seal removal processing is performed on the picture A1, the detection module can obtain more accurate coordinates of the text area, so that the situations of multiple frames and error frames caused by seal interference are avoided. And finally obtaining the detection coordinates of the text area.

6. And cutting out character bars from the original image A according to the coordinates of the text area, and sending the character bars into a text recognition flow in batch. The text recognition model is input as a three-channel color image, and characters and residual seal interference can be effectively distinguished. In addition, the text recognition model transmits the slices of the original image, so that the character pixel loss caused by the removal of the image processing flow of the stamp does not influence the recognition of the text.

7. And integrating and extracting useful text information through a post-processing rule according to the detection coordinates and the identification content of the text and the service scene.

Although the present invention has been described in detail with respect to the above embodiments, it will be understood by those skilled in the art that modifications or improvements based on the disclosure of the present invention may be made without departing from the spirit and scope of the invention, and these modifications and improvements are within the spirit and scope of the invention.

Claims

1. A method for eliminating the interference of a seal on document extraction is characterized in that aiming at an image P with a seal, the method comprises the following steps:

eliminating the seal by adopting an image processing method on the image P to obtain an image Q;

performing text detection on the image Q to obtain coordinates of a text area;

and taking the area of the image P with the coordinates as a text area for text recognition.

2. The method for eliminating interference of a stamp on document extraction according to claim 1, wherein the step of eliminating the stamp obtaining image Q by the image processing method on the image P comprises:

carrying out seal detection on the image P to obtain a seal area;

and eliminating the stamp by adopting an image processing method for the stamp area in the image P.

3. The method for eliminating interference of a stamp on document extraction according to claim 2, wherein the step of eliminating the stamp by using the image processing method for the stamp region in the image P comprises:

counting median of three-channel values of all pixel points in the seal area, and taking the median as a channel value of the background color; and counting three channel values of all pixel points in the seal area, and if the three channel values of the pixel points belong to a red color gamut or a blue color gamut, replacing the three channel values of the pixel points with the channel values of the background color.

4. The method of eliminating interference of a stamp on document extraction as claimed in claim 1, wherein said text recognition input is a three-channel color map.

5. The method for eliminating interference of a stamp on document extraction according to claim 1, wherein the step of eliminating the stamp-acquired image Q by using the image processing method for the image P comprises a pre-processing of performing a picture rotation correction on the image P.

6. A computer storage medium, wherein a computer program is stored in the computer storage medium, and when the computer program is executed, the method for eliminating the interference of the seal on the document extraction according to any one of claims 1 to 5 is implemented.

7. An apparatus for eliminating stamp-to-document extraction interference, said apparatus being adapted to process an image P with a stamp, said apparatus comprising:

the image processing module is used for eliminating the seal by adopting an image processing method on the image P to obtain an image Q;

the text detection module is used for performing text detection on the image Q to obtain the coordinates of the text area;

and the text recognition module is used for performing text recognition by taking the area with the coordinates in the image P as a text area.