CN116758560B

CN116758560B - Document image classification method and device

Info

Publication number: CN116758560B
Application number: CN202311030954.2A
Authority: CN
Inventors: 申意萍; 陈友斌; 张志坚; 徐一波
Original assignee: Hubei Micropattern Technology Development Co ltd
Current assignee: Hubei Micropattern Technology Development Co ltd
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-11-17
Anticipated expiration: 2043-08-16
Also published as: CN116758560A

Abstract

The invention discloses a document image classification method and a device, which relate to the technical field of document image classification, and the method comprises the following steps of 1, performing title detection on a document image to obtain a title region; step 2, performing text recognition on the title area to obtain title text content; step 3, performing text error correction on the title text content, and correcting misidentification caused by text distortion and shielding; and 4, classifying the corrected title text content based on the lossless compressor and the K neighbor. When a new type needs to be added, only the training data point set of the K neighbor needs to be updated. The method provided by the invention aims to solve the problems of document classification caused by the situations that the shooting environments are different, the paper materials are non-rigid bodies, the non-rigid bodies are deformed, the characters are partially blocked by seals or other things, the document materials are complex and changeable in content, the newly added document types can appear at any time, and the like, and the document classification is accurate and efficient.

Description

Document image classification method and device

Technical Field

The invention relates to the technical field of document image classification, in particular to a document image classification method and device.

Background

As digital transformation progresses through the industries, the number of electronic document images continues to increase. In the financial field (such as banks, insurance, securities, tax, etc.), in order to preserve a wide variety of paper materials for a long period of time, it is necessary to process them electronically, thus forming a huge electronic document image dataset. In recent years, various remote financial activities such as remote account opening, online reimbursement, and the like have been continuously promoted due to the influence of external environments. In these remote financial activities, it is necessary to electronically render the paper material, typically using a user's cell phone or tablet. A large number of electronic materials require sorting, archiving and identification processes. Electronic documents contain a large amount of industry-related image and text information, and manual processing of such information is time consuming and costly, so that automatic classification of electronic document images is highly desirable. However, classifying these document images faces the following difficulties:

(1) The shooting environments, such as illumination, angles and backgrounds, the shooting devices are different, the resolution, exposure time, distortion degree and the like related to the devices are also different, the generated document images are large in difference, and the document images are difficult to unify and standardize;

(2) The paper material is a non-rigid body, and is easy to deform, so that the characters are distorted and deformed;

(3) The text is partially blocked by the stamp or other things, for example, titles of various notes are blocked by the stamp;

(4) The document material content is complex and changeable, the document layout of the same kind is not uniform, and the intra-class difference is large. Taking medical documents as an example, examination reports with different names are not in a fixed form, some examination reports comprise images shot by a medical camera, some examination reports only have tables, some examination reports only have text descriptions, and in addition, different medical institutions generate different documents;

(5) Newly added document types will occur at any time. For pre-trained classifiers, the classifier needs to be retrained when a new document type is added.

Compared with an electronic document image acquired by a scanner, the electronic document image shot by a mobile phone or a tablet is often influenced by resolution, illumination and shooting background, and is deformed or blocked by the shot document, so that a challenge is brought to text recognition, and the accuracy of the method for classifying the electronic document image by only text recognition is low.

Disclosure of Invention

In order to solve the technical problems, the invention provides a document image classification method and a document image classification device. The following technical scheme is adopted:

a document image classification method comprising the steps of:

step 1, performing title detection on a document image to obtain a title region;

step 2, performing text recognition on the title area to obtain title text content;

step 3, performing text error correction on the title text content, and correcting misidentification caused by text distortion and shielding;

and 4, classifying the corrected title text content based on the lossless compressor and the K neighbor.

By adopting the technical scheme, the title of the document can provide good text description despite large difference in document image class, is summary of document content, and is beneficial to document image classification. Firstly, the title detection and recognition are carried out on the document image, and the text content of the title is obtained. In order to solve the problem of misidentification caused by distortion or shielding of characters, text correction is required. The corrected title text content is then classified using a lossless compressor and k-nearest neighbors. When a new type needs to be added, only the training data point set of the K neighbor needs to be updated.

Optionally, the method of step 4 is:

step 41, establishing a training data point set of K neighbor, compressing each training data of the training data point set by adopting a lossless compressor, and obtaining the length of the compressed training data;

step 42, compressing the title text content by using a lossless compressor, and obtaining the length of the compressed title text content;

step 43, connecting the title text content with each training data to obtain a long text, compressing the long text by using a lossless compressor, and obtaining the length of the compressed long text;

step 44, determining k neighbor of the corrected title text content according to the three text lengths of the corrected title text content;

step 45, determining the category of the document image according to the category information of the k neighbor.

Optionally, the specific method of step 4 is:

in step 41, let the compressed length of each training data point be Lt, t=1, 2, …, N be the number of training data points, and the lossless compressor is a Huffman, predictive coding or dictionary coding based compressor;

in step 42, compressing the title text content y by using a lossless compressor to obtain a compressed length Ly;

in step 43, the content of the title text is connected with each training data to obtain a long text ty, the long text ty is compressed by using a lossless compressor, and the compressed length Lty is obtained;

in step 44, k nearest neighbors of the document image are determined based on Lt, ly, and Lty:

define distance D (t, y)

The distance D (t, y) = (Lty-min (Lt, ly))/max (Lt, ly). The distance calculation method may be D (t, y) = (lt+ly-Lty)/Lty, where k neighbors are the k nearest neighbors with the largest distance.

In step 45, the value of k neighbor is 1 or other integer.

By adopting the technical scheme, the corrected title text content is classified by using a lossless compressor and k neighbor:

establishing a training data point set of K neighbor, and compressing each training data by using a lossless compressor to obtain the compressed length; the training data point set includes the title text content of the document to be classified and the category to which the document belongs, such as { "Henan province medical hospitalization bill (electronic)", "invoice" }, { "medical imaging DX examination report", "examination report" }, { "CT image examination report", "examination report" }. The title text content of the training data set can be obtained through title detection, title text recognition and text correction, or can be manually input text content. Here, assume that the compressed length of each training data point is Lt, t=1, 2, …, N is the number of training data points. The lossless compressor can be a compressor based on Huffman coding, or can be a compressor based on predictive coding, dictionary coding or other methods;

compressing the title text content y by using a lossless compressor to obtain the compressed length Ly;

connecting the content of the title text with each training data to obtain a long text ty, and compressing the long text by using a lossless compressor to obtain a compressed length Lty;

determining k neighbor of the document according to the three compression lengths; for example, the distance D (t, y) = (Lty-min (Lt, ly))/max (Lt, ly);

and determining the category of the document according to the category information of the k neighbor. The k neighbor may be 1 or another integer.

Optionally, in step 1, a target detection algorithm or a layout analysis algorithm is adopted to detect the title of the document image, so as to obtain a title region.

By adopting the technical scheme, the title of the document is detected by using a target detection algorithm (such as YOLO, SSD) or a layout analysis algorithm (such as LayoutLM), so that the title area can be obtained quickly and accurately.

Optionally, in step 2, text line detection and text line recognition are performed on the title region to obtain the title text content.

By adopting the technical scheme, text line detection (such as DBNet and FCENT) and text line identification (such as CRNN+CTC) can be performed on the title region, so that the title text content can be obtained rapidly and accurately; end-to-end text recognition algorithms (e.g., PGNet) may also be performed to obtain the title text content.

Optionally, in step 3, a text error correction algorithm Soft-Masked-BERT is used to perform error correction processing on the text content of the title.

Optionally, in step 3, the BERT model is adjusted according to the corpus, so as to improve the accuracy of error correction.

By adopting the technical scheme, the text content of the title is subjected to error correction processing by using a text error correction algorithm (such as Soft-Masked-BERT). The BERT model can be finely adjusted according to corpus (error-correct sentence pair), so that the accuracy of error correction is improved.

The document image classification device comprises shooting equipment, a memory and a processor, wherein the shooting equipment shoots a paper document to be classified, acquires a document image to be classified, transmits the document image to the processor, and pre-loads a document image classification program in the memory, and the processor runs the document image classification program in the memory to finish classification of the document image.

Optionally, the system further comprises a display, wherein the display is in communication connection with the processor and displays the classification result of the image to be classified under the control of the processor.

By adopting the technical scheme, the electronic classification filing of the paper documents can be realized, and the document classification problem caused by the situations that the shooting environments are different, the paper materials are non-rigid bodies, characters are partially blocked by seals or other things, the document material content is complex and changeable, the newly added document types can appear at any time, and the like is solved.

In summary, the invention has at least the following beneficial technical effects:

the invention can provide a document image classification method and a device, which firstly carry out title detection and identification on document images to obtain text contents of titles. In order to solve the problem of misidentification caused by distortion or shielding of characters, text correction is required. The corrected title text content is then classified using a lossless compressor and k-nearest neighbors. When a new type is needed to be added, only the training data point set of the K neighbor is needed to be updated, and the problem of document classification caused by the situations that shooting environments are different, paper materials are non-rigid bodies, characters are partially blocked by seals or other things, document material contents are complex and changeable, the newly added document type can appear at any time and the like is solved.

Drawings

FIG. 1 is a flow chart of a document image classification method of the present invention;

FIG. 2 is a flowchart showing the specific steps of step 4 of a document image classification method according to the present invention;

fig. 3 is a schematic diagram of a connection principle of a document image classification apparatus according to the present invention.

Reference numerals illustrate: 1. a photographing device; 2. a memory; 3. a processor; 4. a display.

Description of the embodiments

The present invention will be described in further detail with reference to the accompanying drawings.

The embodiment of the invention discloses a document image classification method and a document image classification device.

Referring to fig. 1 to 3, a document image classification method includes the steps of:

Although the differences in the document image classes are large, the titles of the documents can provide good text description, so that the method is a generalization of the document content and is beneficial to document image classification. Firstly, the title detection and recognition are carried out on the document image, and the text content of the title is obtained. In order to solve the problem of misidentification caused by distortion or shielding of characters, text correction is required. The corrected title text content is then classified using a lossless compressor and k-nearest neighbors. When a new type needs to be added, only the training data point set of the K neighbor needs to be updated.

The method of step 4 is as follows:

step 41, building a K-neighbor training data point set for the corrected title text content, compressing each training data of the training data point set by adopting a lossless compressor, and obtaining the length of the compressed training data;

The specific method of the step 4 is as follows:

in step 41, let the compressed length of each training data point be Lt, t=1, 2, …, N be the number of training data points, and the lossless compressor is a Huffman, predictive coding, dictionary coding-based compressor;

define distance D (t, y)

The distance D (t, y) = (Lty-min (Lt, ly))/max (Lt, ly).

In step 45, the value of k neighbor is 1 or other integer.

Classifying the corrected title text content by using a lossless compressor and k-nearest neighbor:

establishing a training data point set of K neighbor, and compressing each training data by using a lossless compressor to obtain the compressed length; the training data point set includes the title text content of the document to be classified and the category to which the document belongs, such as { "Henan province medical hospitalization bill (electronic)", "invoice" }, { "medical imaging DX examination report", "examination report" }, { "CT image examination report", "examination report" }. Here, assume that the compressed length of each training data point is Lt, t=1, 2, …, N is the number of training data points. The lossless compressor can be a compressor based on Huffman coding, or can be a compressor based on predictive coding, dictionary coding or other methods;

The method comprises the steps of firstly compressing a title of a document to be classified, a training set, and a long text spliced by the title of the document to be classified and the training set by using a lossless compressor, classifying the same or similar data on the premise of keeping the integrity of original data so as to achieve the purpose of reducing the data quantity, and then calculating by using a K adjacent algorithm based on the compressed length, wherein if the title of the document to be classified is very close to or even completely consistent with a certain training data point, the lossless compressor can obtain higher compression rate for the spliced long text, and the compressed text length is shorter; otherwise, when the title of the document to be classified is different from a certain training data point, the compression rate of the lossless compressor is lower for the spliced long text, and the compressed text is longer.

Taking a medical image as an example, a point set of 4 categories containing 13 training data is established,

as shown in table 1 below: the training data point set comprises 13 training data and 4 categories.

TABLE 1

The content of the title of the document image which needs to be classified is detected to be 'Guangdong province medical charging bill', the compression is carried out by utilizing a gzip algorithm, and a distance calculation mode is defined: d (t, y) = (Lty-min (Lt, ly))/max (Lt, ly), three compression lengths and compression distances are shown in table 2 below, k is taken as 1, the test data closest to the distance is the "medical toll ticket", the distance value is 0.26, and the category of the test data is the "ticket", so the category of the document is the "ticket".

Table 2 shows the compression length and distance calculation results;

TABLE 2

In step 1, a target detection algorithm or a layout analysis algorithm is adopted to detect the title of the document image, and a title area is obtained.

The title area can be obtained quickly and accurately by detecting the title of the document using a target detection algorithm (e.g., YOLO, SSD) or a layout analysis algorithm (e.g., layoutLM).

In step 2, text line detection and text line identification are performed on the title region, and title text content is obtained.

Text line detection (such as DBNet and FCENT) and text line identification (such as CRNN+CTC) can be performed on the title area, so that title text content can be obtained rapidly and accurately; end-to-end text recognition algorithms (e.g., PGNet) may also be performed to obtain the title text content.

And 3, performing error correction processing on the title text content by adopting a text error correction algorithm Soft-Masked-BERT.

In step 3, the BERT model is adjusted according to the corpus, so that the accuracy of error correction is improved.

The text content of the title is error corrected using a text error correction algorithm (e.g., soft-Masked-BERT). The BERT model can be finely adjusted according to corpus (error-correct sentence pair), so that the accuracy of error correction is improved;

in a specific embodiment, due to the shielding of the seal, the text content identification result is "Henan medical hospital charging bill", and after text correction, the text content is corrected to be "Henan medical hospital charging bill".

A document image classifying device comprises a shooting device 1, a memory 2 and a processor 3, wherein the shooting device 1 shoots a paper document to be classified, acquires a document image to be classified and transmits the document image to the processor 3, the memory 2 is preloaded with a document image classifying program designed according to the method of claim 6, and the processor 3 runs the document image classifying program in the memory 2 to complete classification of the document image.

The system also comprises a display 4, wherein the display 4 is in communication connection with the processor 3 and displays the classification result of the images to be classified under the control of the processor 3.

The electronic classified filing of the paper documents can be realized, and the document classification problem caused by the situations that the shooting environments are different, the paper materials are non-rigid bodies, characters are partially blocked by seals or other things, the document material content is complex and changeable, the newly added document types can occur at any time, and the like is solved.

The implementation principle of the document image classification method and the document image classification device in the embodiment of the invention is as follows:

a batch of medical document images need to be classified, the images of a batch of documents are respectively shot by the shooting equipment 1, the images are stored in the memory 2, the document images of the batch are transmitted to the processor 3 for document image classification, and finally the classification result is transmitted to the display 4 for displaying the result.

After the title area is detected, the text content of the title is identified as 'Henan medical hospital charging bill', wherein 'any' is the result of false identification, and is actually 'live', and after text correction, the text content is corrected to be 'Henan medical hospital charging bill'. And carrying out lossless compression and K neighbor clustering on the text content, wherein the final type is a bill.

The above embodiments are not intended to limit the scope of the present invention, and therefore: all equivalent changes in structure, shape and principle of the invention should be covered in the scope of protection of the invention.

Claims

1. A document image classification method, characterized by comprising the steps of:

step 4, classifying the corrected title text content based on the lossless compressor and K neighbor;

the specific method of the step 4 is as follows:

step 42, compressing the corrected title text content by using a lossless compressor, and obtaining the length of the compressed title text content;

step 44, determining k neighbor of corrected title text content according to the three text lengths;

the three text lengths are the compressed training data length obtained in step 41, the compressed title text content length obtained in step 42 and the compressed long text length obtained in step 43;

step 45, determining the category of the document image according to the category information of the k neighbor;

in step 44, k nearest neighbors of the document image are determined based on Lt, ly, and Lty: defining a distance D (t, y), then the distance D (t, y) = (Lty-min (Lt, ly))/max (Lt, ly);

in step 45, the value of k nearest neighbor is 1 or other integer;

in the step 1, a target detection algorithm or a layout analysis algorithm is adopted to detect the title of a document image, so as to obtain a title region;

in the step 3, a text error correction algorithm Soft-Masked-BERT is adopted to carry out error correction processing on the text content of the title;

2. A document image classification method according to claim 1, wherein: in step 2, text line detection and text line identification are performed on the title region, and title text content is obtained.

3. A document image classification apparatus characterized in that: the method comprises shooting equipment (1), a memory (2) and a processor (3), wherein the shooting equipment (1) shoots a paper document to be classified, acquires an image of the document to be classified, and transmits the image to the processor (3), the memory (2) is preloaded with a document image classification program designed according to the method of claim 2, and the processor (3) runs the document image classification program in the memory (2) to finish classification of the document images.

4. A document image classification apparatus according to claim 3, wherein: the system also comprises a display (4), wherein the display (4) is in communication connection with the processor (3) and displays the classification result of the images to be classified under the control of the processor (3).