CN114067111A - Method for eliminating interference of seal on document extraction - Google Patents
Method for eliminating interference of seal on document extraction Download PDFInfo
- Publication number
- CN114067111A CN114067111A CN202111197695.3A CN202111197695A CN114067111A CN 114067111 A CN114067111 A CN 114067111A CN 202111197695 A CN202111197695 A CN 202111197695A CN 114067111 A CN114067111 A CN 114067111A
- Authority
- CN
- China
- Prior art keywords
- image
- seal
- eliminating
- stamp
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000000605 extraction Methods 0.000 title claims abstract description 24
- 238000001514 detection method Methods 0.000 claims abstract description 21
- 238000003672 processing method Methods 0.000 claims abstract description 16
- 230000008569 process Effects 0.000 claims abstract description 9
- 238000012545 processing Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000012937 correction Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 5
- 238000012805 post-processing Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
Images
Landscapes
- Character Input (AREA)
Abstract
The invention discloses a method for eliminating interference of a seal on document extraction, aiming at an image P with a seal, comprising the following steps: eliminating the seal by adopting an image processing method on the image P to obtain an image Q; performing text detection on the image Q to obtain coordinates of a text area; and taking the area of the image P with the coordinates as a text area for text recognition. In the bank running line intelligent auditing business process, the text detection effect is optimized by eliminating seal interference, and a more accurate digital identification result can be obtained; in the contract intelligent comparison business process, the invention can reduce multiple identification and error identification caused by seal interference, thereby reducing the probability of subsequent comparison errors.
Description
Technical Field
The invention belongs to the field of optical character recognition, and particularly relates to a method for eliminating interference of a seal on document extraction.
Background
The optical character recognition technology is widely applied to the business fields of document intelligent verification, comparison and the like of electronic scanned documents. The accuracy of the extraction result directly determines the effect of intelligent document auditing, comparison and other services, and the higher the extraction accuracy is, the simpler the post-processing logic of the corresponding service is, and the stronger the robustness of the system is.
Various official stamps and signature stamps are often stamped on the electronic scanning piece document, when the seal stamp is stamped on the characters, the detection and identification of the characters by the model can be interfered, an extraction result is wrong, the original normal document is verified as an abnormal document, two documents which are consistent actually are judged by mistake to be different, and finally business errors such as verification, comparison and the like are caused. If the interference of the seal on the extraction can be effectively eliminated, the extraction accuracy is improved, and the error probability of the subsequent process can be greatly reduced.
Aiming at the problem of seal interference, two common methods are currently used, one is that a traditional image processing method is adopted to remove a red channel of an image, so that a red area in the image is filtered; the other is to use a generative countermeasure network to remove the stamp and restore the text of the stamp area.
The first method is simple and effective, but the effect on a non-pure red seal is not good enough, if the threshold value is enlarged and removed blindly, the extraction of other effective character information is influenced, and especially when red effective texts exist in a document, the method can filter the effective information by mistake, so that the subsequent business process is wrong.
The second method has good effect on complex scenes, but the model is heavier and more system resources are consumed; in addition, training for generating an anti-network stamp removal requires a large amount of data support, and the requirement of the training cannot be met only by labeling a real sample, because original documents before stamping need to be restored besides scanned documents on which stamps need to be labeled. Therefore, training data is often artificially created, but the created data is finally false data and inevitably different from real data, which also results in poor generalization of the trained model. In addition, the generation of the countermeasure network is not perfect when the text is restored, and the situations of partial pixel missing and distortion occur, which undoubtedly reduces the recognition effect of the subsequent model.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method for eliminating the interference of a seal on document extraction, which can quickly and effectively eliminate the interference of the seal on a model by combining the traditional image processing and deep learning algorithm, thereby achieving the purpose of accurately extracting the useful information of the document. The traditional image processing method is difficult to cope with complicated and changeable real scenes, and the generation of a method for removing the deep learning such as the countermeasure network is difficult to train.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for eliminating interference of a seal on document extraction is provided, aiming at an image P with a seal, the method comprises the following steps: eliminating the seal by adopting an image processing method on the image P to obtain an image Q; performing text detection on the image Q to obtain coordinates of a text area; and taking the area of the image P with the coordinates as a text area for text recognition.
Preferably, the removing the stamp from the image P by the image processing method to obtain the image Q includes: carrying out seal detection on the image P to obtain a seal area; and eliminating the stamp by adopting an image processing method for the stamp area in the image P.
Preferably, the removing the stamp by the image processing method for the stamp region in the image P includes: counting median of three-channel values of all pixel points in the seal area, and taking the median as a channel value of the background color; and counting three channel values of all pixel points in the seal area, and if the three channel values of the pixel points belong to a red color gamut or a blue color gamut, replacing the three channel values of the pixel points with the channel values of the background color.
Preferably, the input of the text recognition is a three-channel color map.
Preferably, the removing the stamp from the image P by the image processing method to obtain the image Q includes performing a pre-processing of correcting the image P by image rotation.
A computer storage medium, wherein a computer program is stored in the computer storage medium, and when the computer program is executed, the computer program realizes any one of the methods for eliminating the interference of the seal on the document extraction.
An apparatus for eliminating stamp-to-document extraction interference, said apparatus for processing an image P with a stamp, said apparatus comprising: the image processing module is used for eliminating the seal by adopting an image processing method on the image P to obtain an image Q; the text detection module is used for performing text detection on the image Q to obtain the coordinates of the text area; and the text recognition module is used for performing text recognition by taking the area with the coordinates in the image P as a text area.
Compared with the prior art, the invention has the beneficial effects that:
1. in the bank running intelligent auditing business process, a seal in a running file can cause a text detection box to be overlarge, so that a digit identification error is caused, missing identification of a digit area is possibly caused, and the digit cannot be corrected through a rule. After the seal is eliminated by the scheme, the detected text box is more accurate, the identification difficulty is reduced, and a more accurate digital identification result can be obtained, so that the accuracy of money amount check is improved;
2. in the contract intelligent comparison business process, the scheme can reduce multiple identifications and false identifications caused by seal interference, thereby reducing the probability of errors in subsequent comparison. In addition, the complexity of the comparison rule can be reduced, and the robustness of the system is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.
As shown in fig. 1, the present embodiment provides a general process of document extraction, which mainly includes four parts, namely, file preprocessing, text detection, text recognition, and information extraction. In the file preprocessing stage, input files are unified into a picture format, the pictures are subjected to rotation correction, and a plurality of pages of pdfs can be converted into a plurality of pictures to be extracted in parallel; the text detection stage provides coordinate information of a character area boundary box in the picture; in the text recognition stage, character information in each frame is recognized item by item according to the coordinate information; and in the information extraction stage, useful information is extracted through a post-processing rule according to the coordinates and the content of the characters.
The specific embodiment of the invention is as follows:
1. uploading document files, and unifying the files into a picture format after format conversion. The multi-page pdf is split into multiple pictures and the following processes are executed in parallel.
2. And (4) transmitting the picture into a rotary classification model, outputting angle information of the picture characters, and performing rotary correction on the picture according to the information.
3. Transferring the corrected picture into a target detection module, positioning whether a seal exists in the picture according to the detection result of the seal area, and directly executing the step 5 if the seal does not exist; if there is a stamp, the corrected image is denoted as A, the detected stamp region is denoted as S, and the coordinates thereof can be expressed as (x1, y1, x2, y2), that is, S ═ A [ y1: y2, x1: x2]
4. The method for eliminating the seal in the area S by adopting the image processing method comprises the following specific steps:
a) copying an area S in the picture to obtain a picture B;
b) counting the median of three-channel values of all pixel points in the picture B, wherein the median can be determined as a channel value of a background color due to the sparsity of characters;
c) counting three channel values of all pixel points in the picture B, if the three channel values of the pixel points belong to a red color gamut or a blue color gamut, directly replacing the three channel values of the pixel points with the channel values of the background color, and recording the replaced picture as B1;
d) copy picture a, get duplicate a1, replace region S in a1 with picture B1, i.e. a1[ y1: y2, x1: x2] ═ B1;
e) the original picture A and the picture A1 with the stamp removed are sent to the next process.
5. If the seal area does not exist, directly transmitting the picture A into a text detection module; if the seal area exists, the picture A1 is transmitted into the text detection module, and as the seal removal processing is performed on the picture A1, the detection module can obtain more accurate coordinates of the text area, so that the situations of multiple frames and error frames caused by seal interference are avoided. And finally obtaining the detection coordinates of the text area.
6. And cutting out character bars from the original image A according to the coordinates of the text area, and sending the character bars into a text recognition flow in batch. The text recognition model is input as a three-channel color image, and characters and residual seal interference can be effectively distinguished. In addition, the text recognition model transmits the slices of the original image, so that the character pixel loss caused by the removal of the image processing flow of the stamp does not influence the recognition of the text.
7. And integrating and extracting useful text information through a post-processing rule according to the detection coordinates and the identification content of the text and the service scene.
Although the present invention has been described in detail with respect to the above embodiments, it will be understood by those skilled in the art that modifications or improvements based on the disclosure of the present invention may be made without departing from the spirit and scope of the invention, and these modifications and improvements are within the spirit and scope of the invention.
Claims (7)
1. A method for eliminating the interference of a seal on document extraction is characterized in that aiming at an image P with a seal, the method comprises the following steps:
eliminating the seal by adopting an image processing method on the image P to obtain an image Q;
performing text detection on the image Q to obtain coordinates of a text area;
and taking the area of the image P with the coordinates as a text area for text recognition.
2. The method for eliminating interference of a stamp on document extraction according to claim 1, wherein the step of eliminating the stamp obtaining image Q by the image processing method on the image P comprises:
carrying out seal detection on the image P to obtain a seal area;
and eliminating the stamp by adopting an image processing method for the stamp area in the image P.
3. The method for eliminating interference of a stamp on document extraction according to claim 2, wherein the step of eliminating the stamp by using the image processing method for the stamp region in the image P comprises:
counting median of three-channel values of all pixel points in the seal area, and taking the median as a channel value of the background color; and counting three channel values of all pixel points in the seal area, and if the three channel values of the pixel points belong to a red color gamut or a blue color gamut, replacing the three channel values of the pixel points with the channel values of the background color.
4. The method of eliminating interference of a stamp on document extraction as claimed in claim 1, wherein said text recognition input is a three-channel color map.
5. The method for eliminating interference of a stamp on document extraction according to claim 1, wherein the step of eliminating the stamp-acquired image Q by using the image processing method for the image P comprises a pre-processing of performing a picture rotation correction on the image P.
6. A computer storage medium, wherein a computer program is stored in the computer storage medium, and when the computer program is executed, the method for eliminating the interference of the seal on the document extraction according to any one of claims 1 to 5 is implemented.
7. An apparatus for eliminating stamp-to-document extraction interference, said apparatus being adapted to process an image P with a stamp, said apparatus comprising:
the image processing module is used for eliminating the seal by adopting an image processing method on the image P to obtain an image Q;
the text detection module is used for performing text detection on the image Q to obtain the coordinates of the text area;
and the text recognition module is used for performing text recognition by taking the area with the coordinates in the image P as a text area.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111197695.3A CN114067111A (en) | 2021-10-14 | 2021-10-14 | Method for eliminating interference of seal on document extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111197695.3A CN114067111A (en) | 2021-10-14 | 2021-10-14 | Method for eliminating interference of seal on document extraction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114067111A true CN114067111A (en) | 2022-02-18 |
Family
ID=80234531
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111197695.3A Pending CN114067111A (en) | 2021-10-14 | 2021-10-14 | Method for eliminating interference of seal on document extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114067111A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116994269A (en) * | 2023-05-11 | 2023-11-03 | 达而观信息科技(上海)有限公司 | Seal similarity comparison method and seal similarity comparison system in image document |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109636825A (en) * | 2018-11-01 | 2019-04-16 | 平安科技(深圳)有限公司 | Seal graphics dividing method, device and computer readable storage medium |
CN110163786A (en) * | 2019-04-02 | 2019-08-23 | 阿里巴巴集团控股有限公司 | A kind of method, device and equipment removing watermark |
CN110895696A (en) * | 2019-11-05 | 2020-03-20 | 泰康保险集团股份有限公司 | Image information extraction method and device |
CN111680694A (en) * | 2020-05-28 | 2020-09-18 | 中国工商银行股份有限公司 | Method and device for filtering colored seal in character image |
CN111814716A (en) * | 2020-07-17 | 2020-10-23 | 上海眼控科技股份有限公司 | Seal removing method, computer device and readable storage medium |
CN112766275A (en) * | 2021-04-08 | 2021-05-07 | 金蝶软件(中国)有限公司 | Seal character recognition method and device, computer equipment and storage medium |
WO2021115490A1 (en) * | 2020-06-22 | 2021-06-17 | 平安科技(深圳)有限公司 | Seal character detection and recognition method, device, and medium for complex environments |
-
2021
- 2021-10-14 CN CN202111197695.3A patent/CN114067111A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109636825A (en) * | 2018-11-01 | 2019-04-16 | 平安科技(深圳)有限公司 | Seal graphics dividing method, device and computer readable storage medium |
CN110163786A (en) * | 2019-04-02 | 2019-08-23 | 阿里巴巴集团控股有限公司 | A kind of method, device and equipment removing watermark |
CN110895696A (en) * | 2019-11-05 | 2020-03-20 | 泰康保险集团股份有限公司 | Image information extraction method and device |
CN111680694A (en) * | 2020-05-28 | 2020-09-18 | 中国工商银行股份有限公司 | Method and device for filtering colored seal in character image |
WO2021115490A1 (en) * | 2020-06-22 | 2021-06-17 | 平安科技(深圳)有限公司 | Seal character detection and recognition method, device, and medium for complex environments |
CN111814716A (en) * | 2020-07-17 | 2020-10-23 | 上海眼控科技股份有限公司 | Seal removing method, computer device and readable storage medium |
CN112766275A (en) * | 2021-04-08 | 2021-05-07 | 金蝶软件(中国)有限公司 | Seal character recognition method and device, computer equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116994269A (en) * | 2023-05-11 | 2023-11-03 | 达而观信息科技(上海)有限公司 | Seal similarity comparison method and seal similarity comparison system in image document |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112966537B (en) | Form identification method and system based on two-dimensional code positioning | |
US9898808B1 (en) | Systems and methods for removing defects from images | |
US9754164B2 (en) | Systems and methods for classifying objects in digital images captured using mobile devices | |
TW201719505A (en) | Methods, apparatus, and tangible computer readable storage media to extract text from imaged documents | |
CN110363095A (en) | A kind of recognition methods for table font | |
CN112949471A (en) | Domestic CPU-based electronic official document identification reproduction method and system | |
US20060171587A1 (en) | Image processing apparatus, control method thereof, and program | |
JP4904175B2 (en) | Method and apparatus for creating high fidelity glyph prototypes from low resolution glyph images | |
JP3359433B2 (en) | How to improve document image quality | |
CN110765740B (en) | Full-type text replacement method, system, device and storage medium based on DOM tree | |
CN110807454B (en) | Text positioning method, device, equipment and storage medium based on image segmentation | |
CN113901952A (en) | Print form and handwritten form separated character recognition method based on deep learning | |
CN111931769A (en) | Invoice processing device, invoice processing apparatus, invoice computing device and invoice storage medium combining RPA and AI | |
CN113592735B (en) | Text page image restoration method and system, electronic device and computer readable medium | |
CN112508000B (en) | Method and equipment for generating OCR image recognition model training data | |
CN114445841A (en) | Tax return form recognition method and device | |
CN116630984A (en) | OCR character recognition method and system based on seal removal | |
CN110889311A (en) | Financial electronic facsimile document identification system and method | |
US20210174119A1 (en) | Systems and methods for digitized document image data spillage recovery | |
CN117333893A (en) | OCR-based custom template image recognition method, system and storage medium | |
CN102682457A (en) | Rearrangement method for performing adaptive screen reading on print media image | |
CN117649670A (en) | Document layout analysis model training method, application method, computer device and computer readable storage medium | |
CN114067111A (en) | Method for eliminating interference of seal on document extraction | |
US12266145B1 (en) | Machine-learning models for image processing | |
CN118196799A (en) | Circular seal character recognition method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |