CN117217203A

CN117217203A - A noisy text sentence segmentation method for document image translation

Info

Publication number: CN117217203A
Application number: CN202310477508.XA
Authority: CN
Inventors: 邓彪; 翟飞飞; 白书航
Original assignee: Beijing Zhongkefan Language Technology Co ltd
Current assignee: Beijing Zhongkefan Language Technology Co ltd
Priority date: 2023-04-27
Filing date: 2023-04-27
Publication date: 2023-12-12

Abstract

The invention discloses a noisy text sentence segmentation method for document image translation, which relates to the technical field of natural language processing and comprises the steps of simulating and constructing a noisy text data set containing various recognition noises on the basis of a clean pure text data set to obtain input data; encoding the input data, and encoding the text to be processed by using a BERT pre-training language model to complete word embedding, so as to obtain a dynamic word vector: and (3) contrast learning, wherein the contrast learning comprises the step of calculating contrast loss of the clean plain text and various noisy texts respectively. According to the invention, the BERT pre-training language model is used for fully extracting the semantic features of the noisy text, and the relation between the noisy text and the clean pure text is fully learned through contrast learning, so that the characteristics of the noisy text obtained after OCR recognition of the document image are fully utilized, sentence segmentation can be more accurately carried out, more accurate natural sentences for segmentation are provided for a machine translation task, and the overall performance of document image translation is improved.

Description

Noisy text sentence segmentation method for document image translation

Technical Field

The invention relates to the technical field of natural language processing, in particular to a noisy text sentence segmentation method for document image translation.

Background

Document image translation refers to automatically translating a source language contained in a document image into a target language using a computer system. The conventional document image translation method firstly carries out text detection and recognition on a document image to obtain a plain text paragraph, then carries out sentence segmentation on the plain text paragraph, inputs segmented natural sentences into a machine translation system for translation, and the consistency and accuracy of sentence segmentation directly influence the performance of subsequent machine translation, so that problems of text missed recognition and false recognition, especially problems of missed recognition and false recognition of punctuation marks, can occur in a text detection recognition stage, and compared with sentence segmentation on a clean text paragraph, the problems of segmentation are more difficult due to noise.

The conventional sentence segmentation method is to segment by using punctuation and a regular expression, however, for noisy texts, the ideal natural sentence cannot be obtained by only using the regular method for segmentation. Therefore, the existing method for segmenting noisy text sentences based on rule-oriented document image translation cannot meet the requirements in practical use, so that an improved technology is urgently needed in the market to solve the problems.

Disclosure of Invention

The invention aims to provide a noisy text sentence segmentation method for document image translation, which adopts a BERT pre-training model to code, adopts a Bi-LSTM model to further extract features, adopts a CRF model to classify, predicts labels in sequence labeling, adopts a contrast learning method in the training process, draws the distance between clean plain text and noisy text, fully utilizes the characteristics of noisy text, and solves the problems in the background technology.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention relates to a noisy text sentence segmentation method for document image translation, which comprises the following steps of;

step one: simulating and constructing a noisy text data set containing various recognition noises on the basis of a clean pure text data set to obtain input data, and preprocessing the input data;

step two: encoding the input data, encoding a text to be processed by using a BERT pre-training language model to complete word embedding, obtaining a dynamic word vector, and further extracting features of the dynamic word vector by using a Bi-LSTM model to obtain a text vector;

step three: contrast learning, wherein the contrast learning comprises respectively calculating contrast loss of the clean plain text and various noisy texts;

step four: calculating a classification model loss, wherein the classification model loss comprises the steps of inputting a text vector into a CRF model to obtain a prediction category of each word, and calculating a loss between a classification result and a standard answer;

step five, loss fusion, which comprises the step of weighting and summing the comparison learning loss and the classification model loss to obtain the final training loss;

and step six, carrying out gradient feedback on the final training loss, and updating model parameters.

Further, the processing of the input data specifically includes:

and taking the clean plain text paragraph data set, setting the sample length of each input model as N characters, and adopting a sliding window processing method to slide the whole plain text data set in a window mode, wherein the number of words sliding in each window is M, so as to obtain a plurality of clean plain text original training samples.

Further, the processing of the clean plain text original training sample specifically includes:

and simulating and adding various noises to each clean plain text original training sample, deleting and replacing punctuation and other texts in a certain proportion, simulating and constructing a plurality of groups of missed recognition noise samples, erroneous recognition noise samples and clean plain text samples, deleting and replacing, and simulating and constructing a plurality of groups of mixed noise samples containing both missed recognition noise and erroneous recognition noise.

Further, one of the clean plain text sample, the missed recognition noise sample, and the mixed noise sample is used as a set of training samples.

Further, the training sample is processed, which specifically includes:

and adding a corresponding triplet label for each character of the training sample according to the position of the end of the natural sentence, wherein the label triples are [ B ], [ I ], [ E ], and respectively represent a natural sentence start bit, a natural sentence middle bit and a natural sentence end bit.

Further, extracting features from the dynamic word vector specifically includes:

and serializing a group of training samples and then sending the training samples into a BERT model to obtain dynamic word vectors.

Further, the dynamic word vector is input into a Bi-LSTM model, and text features are further extracted.

Further, the contrast learning includes a set of contrast losses, specifically including the following losses:

calculating contrast loss for the clean plain text sample and the missed recognition noise sample;

calculating a contrast loss for the clean plain text sample and the false recognition noise sample;

a contrast loss is calculated for the clean plain text sample and the mixed noise sample.

Further, the contrast learning further includes:

and averaging the group of comparison losses to obtain the final comparison learning loss.

Further, the loss fusion specifically includes:

and (3) averaging the contrast learning loss calculated in the step (III) and the classification model loss calculated in the step (IV) to obtain the final training loss.

The invention has the following beneficial effects:

according to the invention, a BERT pre-training model is set to obtain a high-quality word vector, then a Bi-LSTM model is used for further feature extraction, a CRF model is used for classification, and a contrast learning method is used for training on various data added with simulation noise, so that training data is fully utilized, characteristics of noisy texts are mined, the accuracy of sentence segmentation is improved, more accurate natural sentences are provided for machine translation tasks, and the overall performance of document image translation is improved.

Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a model architecture of a noisy-text-sentence segmentation method of the present invention;

FIG. 2 is a training architecture diagram II of the noisy text sentence segmentation method of the present invention;

fig. 3 is a flowchart illustrating a noisy text sentence segmentation method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Referring to fig. 1-3, the present embodiment is a noisy text sentence segmentation method for document image translation, including the following steps;

step two: encoding input data, encoding a text to be processed by using a BERT pre-training language model to complete word embedding, obtaining a dynamic word vector, and further extracting features from the dynamic word vector by using a Bi-LSTM model to obtain a text vector;

step three: contrast learning, which comprises respectively calculating contrast loss of clean plain text and various noisy texts;

step five, loss fusion, which comprises the steps of weighting and summing the comparison learning loss and the classification model loss to obtain the final training loss;

The specific workflow of the noisy text sentence segmentation method for document image translation comprises the following steps:

as shown in fig. 1-3, the processing of the input data specifically includes:

and taking a clean plain text paragraph data set, setting the sample length of each input model as N characters, and adopting a sliding window processing method to carry out window sliding on the whole plain text data set, wherein the number of words sliding each time is M, so as to obtain a plurality of clean plain text original training samples.

It should be noted that; suppose that for clean plain text "i gently wave hands, clouds are made on the other western day. If the maximum text length of the input model is limited to 10 characters and the input model slides 5 characters each time, the paragraph can be segmented into 'I gently waving hands, do other' and 'hands, do clouds of other western days'. In a specific experiment, the window is slid until all the text in the paragraph is covered, so that all the segmentation marks are not affected by incomplete input.

1-2, the processing of the clean plain text original training samples specifically comprises:

multiple kinds of noise are added to each clean plain text original training sample simulation, a plurality of groups of missed recognition noise samples, false recognition noise samples and clean plain text samples are simulated and constructed through deleting and replacing punctuation and other texts in a certain proportion, deleting operation and replacing operation are carried out, and a plurality of groups of mixed noise samples containing missed recognition noise and false recognition noise are simulated and constructed.

It should be noted that; for example, the sample "i gently wave his hands, and make clouds on the other western day. The simulation adds noise, and the first one is replaced by ' MASK ', so that a new sample ' I gently wave hands [ MASK ] to make a cloud on the other western day's world '. "in the case where the simulated punctuation mark is misidentified. The new sample is obtained by deleting, namely, the new sample is obtained by gently waving the hand, the cloud color in the other western world is made, and when the simulated punctuation mark is missed, the color in the simulated punctuation mark is replaced by 'MASK'. The new sample is obtained by deleting, and the new sample is obtained by gently waving the hand [ MASK ] to make clouds on the western world, so that the situation that the text contains both wrong identification information and missing identification information is simulated.

Wherein a clean plain text sample, a missed recognition noise sample, a false recognition noise sample, and a mixed noise sample are used as a set of training samples, as shown in fig. 1-2.

As shown in fig. 1-2, the training samples are processed, which specifically includes:

It should be noted that; for example, the sample "i gently wave his hands, make clouds on the other western day. "corresponding tags are: [B] [ I ] [ I ] [ I ] [ I ] [ I ] [ I ] [ I ] [ I ] [ I ] [ I ] [ E ], wherein the color is "ANDed". The label of the 'E' is marked, and because the number of pixels at the punctuation mark position in the document picture is less, the problem of missed recognition is easy to occur, the label of the 'E' is also marked for the characters at the sentence end position, and the influence that sentences are stuck and sentences cannot be correctly separated due to the missed recognition of the punctuation mark is reduced.

As shown in fig. 1-2, extracting features for the dynamic word vector specifically includes:

and (3) serializing a group of training samples, and sending the serialized training samples into a BERT model to obtain the dynamic word vector.

Wherein as shown in fig. 1-2, the dynamic word vector is input into the Bi-LSTM model to further extract text features.

It should be noted that; given a sentence input s= (ω) ₁ ,ω ₂ ,…ω _n-1 ,ω _n ) Text features are extracted through the BERT pre-training language model, and features are further extracted through the Bi-LSTM model, so that an output vector X= (X) is obtained ₁ ,x ₂ ,…x _n-1 ,x _n )。

1-2, the contrasted study includes a set of contrast losses, including in particular the following losses:

calculating contrast loss for clean plain text samples and missed recognition noise samples;

calculating contrast loss for clean plain text samples and false recognition noise samples;

contrast loss is calculated for clean plain text samples and mixed noise samples.

Wherein as shown in fig. 1-2, the contrast learning further comprises:

and averaging a group of comparison losses to obtain the final comparison learning loss.

It should be noted that; loss of contrast learning L _contra Is composed of three parts, namelyAnd->Andwherein->Representing loss between erroneous recognition noise samples and original samples,/->Indicating loss between missed noise samples and original samples, < >>Representing the loss between the misidentification and missed identification mixed noise samples and the original samples.And->The specific expressions are the same and are L _N-pair ，

Wherein x is an anchor, namely noisy text combined with missing, wrong and missing identities, x ⁺ For the original clean plain text sample,n other irrelevant samples selected from the batch size are taken as negative samples, and the final contrast learning loss is that

As shown in fig. 1-2, the loss fusion specifically includes:

It should be noted that;

classification model loss: for an input sequence x= (X) ₁ ,x ₂ ,…x _n-1 ,x _n ) Its corresponding predicted tag sequence is y= (Y) ₁ ,y ₂ ,…y _n-1 ,y _n ) Predicting a sequence tag total scoreWherein A represents the transition fraction between tags, (-)>Representing each word to a corresponding y _i A score of the tag. Normalizing all possible sequences to obtain probability values of the predicted sequence>The final result is a classification loss l= Σ _s∈S log(P(Y _s ∣X _s )). Wherein S represents a set of all sentences in the training data, X represents a CRF input sequence corresponding to sentence S, and Y represents a prediction sequence corresponding to sentence S;

the model is trained by adopting a back propagation algorithm, the loss function is divided into two parts, and one part is derived from the forward propagation loss L obtained after text is input into the Bi-LSTM+CRF model _origin Another part derives from the loss L of contrast learning _contra Finally, the model loss L is obtained _total ＝L _origin +L _contra 。

In the description of the present specification, reference to the terms "one embodiment," some embodiments, "" examples, "" particular examples, "" some examples, "or" what is desired to be described, etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A noisy text sentence segmentation method for document image translation is characterized in that: comprises the following steps of;

step five: loss fusion, wherein the loss fusion comprises the step of weighting and summing the comparison learning loss and the classification model loss to obtain the final training loss;

step six: and carrying out gradient feedback on the final training loss, and updating model parameters.

2. The method for segmenting a noisy text sentence for document image translation according to claim 1, wherein the processing of the input data specifically comprises:

3. The method for segmenting the noisy text sentence oriented to the translation of the document image according to claim 2, wherein the processing of the clean plain text original training sample specifically comprises:

4. A noisy text sentence segmentation method for document image translation according to claim 3 wherein one of said clean plain text samples, missed recognition noise samples, mispronounced recognition noise samples and mixed noise samples is used as a set of training samples.

5. The method for segmentation of noisy text sentences for document image translation according to claim 4, wherein said training samples are processed comprising:

6. The method for segmenting a noisy text sentence for document image translation according to claim 5, wherein extracting features from the dynamic word vector specifically comprises:

7. The method for segmenting a noisy text sentence for document image translation according to claim 6, wherein after obtaining a dynamic word vector, the dynamic word vector is input into a Bi-LSTM model to further extract text features.

8. The method for segmentation of noisy text sentences for document image translation according to claim 2, wherein said contrast learning comprises a set of contrast losses, in particular the following:

9. The method for segmentation of noisy text sentences for document image translation of claim 8 wherein said contrast learning further comprises:

10. The method for segmentation of noisy text sentences for document image translation according to claim 9, wherein said loss fusion comprises: