[go: up one dir, main page]

CN117217203A - A noisy text sentence segmentation method for document image translation - Google Patents

A noisy text sentence segmentation method for document image translation Download PDF

Info

Publication number
CN117217203A
CN117217203A CN202310477508.XA CN202310477508A CN117217203A CN 117217203 A CN117217203 A CN 117217203A CN 202310477508 A CN202310477508 A CN 202310477508A CN 117217203 A CN117217203 A CN 117217203A
Authority
CN
China
Prior art keywords
text
loss
noisy
document image
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310477508.XA
Other languages
Chinese (zh)
Inventor
邓彪
翟飞飞
白书航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongkefan Language Technology Co ltd
Original Assignee
Beijing Zhongkefan Language Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongkefan Language Technology Co ltd filed Critical Beijing Zhongkefan Language Technology Co ltd
Priority to CN202310477508.XA priority Critical patent/CN117217203A/en
Publication of CN117217203A publication Critical patent/CN117217203A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a noisy text sentence segmentation method for document image translation, which relates to the technical field of natural language processing and comprises the steps of simulating and constructing a noisy text data set containing various recognition noises on the basis of a clean pure text data set to obtain input data; encoding the input data, and encoding the text to be processed by using a BERT pre-training language model to complete word embedding, so as to obtain a dynamic word vector: and (3) contrast learning, wherein the contrast learning comprises the step of calculating contrast loss of the clean plain text and various noisy texts respectively. According to the invention, the BERT pre-training language model is used for fully extracting the semantic features of the noisy text, and the relation between the noisy text and the clean pure text is fully learned through contrast learning, so that the characteristics of the noisy text obtained after OCR recognition of the document image are fully utilized, sentence segmentation can be more accurately carried out, more accurate natural sentences for segmentation are provided for a machine translation task, and the overall performance of document image translation is improved.

Description

Noisy text sentence segmentation method for document image translation
Technical Field
The invention relates to the technical field of natural language processing, in particular to a noisy text sentence segmentation method for document image translation.
Background
Document image translation refers to automatically translating a source language contained in a document image into a target language using a computer system. The conventional document image translation method firstly carries out text detection and recognition on a document image to obtain a plain text paragraph, then carries out sentence segmentation on the plain text paragraph, inputs segmented natural sentences into a machine translation system for translation, and the consistency and accuracy of sentence segmentation directly influence the performance of subsequent machine translation, so that problems of text missed recognition and false recognition, especially problems of missed recognition and false recognition of punctuation marks, can occur in a text detection recognition stage, and compared with sentence segmentation on a clean text paragraph, the problems of segmentation are more difficult due to noise.
The conventional sentence segmentation method is to segment by using punctuation and a regular expression, however, for noisy texts, the ideal natural sentence cannot be obtained by only using the regular method for segmentation. Therefore, the existing method for segmenting noisy text sentences based on rule-oriented document image translation cannot meet the requirements in practical use, so that an improved technology is urgently needed in the market to solve the problems.
Disclosure of Invention
The invention aims to provide a noisy text sentence segmentation method for document image translation, which adopts a BERT pre-training model to code, adopts a Bi-LSTM model to further extract features, adopts a CRF model to classify, predicts labels in sequence labeling, adopts a contrast learning method in the training process, draws the distance between clean plain text and noisy text, fully utilizes the characteristics of noisy text, and solves the problems in the background technology.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention relates to a noisy text sentence segmentation method for document image translation, which comprises the following steps of;
step one: simulating and constructing a noisy text data set containing various recognition noises on the basis of a clean pure text data set to obtain input data, and preprocessing the input data;
step two: encoding the input data, encoding a text to be processed by using a BERT pre-training language model to complete word embedding, obtaining a dynamic word vector, and further extracting features of the dynamic word vector by using a Bi-LSTM model to obtain a text vector;
step three: contrast learning, wherein the contrast learning comprises respectively calculating contrast loss of the clean plain text and various noisy texts;
step four: calculating a classification model loss, wherein the classification model loss comprises the steps of inputting a text vector into a CRF model to obtain a prediction category of each word, and calculating a loss between a classification result and a standard answer;
step five, loss fusion, which comprises the step of weighting and summing the comparison learning loss and the classification model loss to obtain the final training loss;
and step six, carrying out gradient feedback on the final training loss, and updating model parameters.
Further, the processing of the input data specifically includes:
and taking the clean plain text paragraph data set, setting the sample length of each input model as N characters, and adopting a sliding window processing method to slide the whole plain text data set in a window mode, wherein the number of words sliding in each window is M, so as to obtain a plurality of clean plain text original training samples.
Further, the processing of the clean plain text original training sample specifically includes:
and simulating and adding various noises to each clean plain text original training sample, deleting and replacing punctuation and other texts in a certain proportion, simulating and constructing a plurality of groups of missed recognition noise samples, erroneous recognition noise samples and clean plain text samples, deleting and replacing, and simulating and constructing a plurality of groups of mixed noise samples containing both missed recognition noise and erroneous recognition noise.
Further, one of the clean plain text sample, the missed recognition noise sample, and the mixed noise sample is used as a set of training samples.
Further, the training sample is processed, which specifically includes:
and adding a corresponding triplet label for each character of the training sample according to the position of the end of the natural sentence, wherein the label triples are [ B ], [ I ], [ E ], and respectively represent a natural sentence start bit, a natural sentence middle bit and a natural sentence end bit.
Further, extracting features from the dynamic word vector specifically includes:
and serializing a group of training samples and then sending the training samples into a BERT model to obtain dynamic word vectors.
Further, the dynamic word vector is input into a Bi-LSTM model, and text features are further extracted.
Further, the contrast learning includes a set of contrast losses, specifically including the following losses:
calculating contrast loss for the clean plain text sample and the missed recognition noise sample;
calculating a contrast loss for the clean plain text sample and the false recognition noise sample;
a contrast loss is calculated for the clean plain text sample and the mixed noise sample.
Further, the contrast learning further includes:
and averaging the group of comparison losses to obtain the final comparison learning loss.
Further, the loss fusion specifically includes:
and (3) averaging the contrast learning loss calculated in the step (III) and the classification model loss calculated in the step (IV) to obtain the final training loss.
The invention has the following beneficial effects:
according to the invention, a BERT pre-training model is set to obtain a high-quality word vector, then a Bi-LSTM model is used for further feature extraction, a CRF model is used for classification, and a contrast learning method is used for training on various data added with simulation noise, so that training data is fully utilized, characteristics of noisy texts are mined, the accuracy of sentence segmentation is improved, more accurate natural sentences are provided for machine translation tasks, and the overall performance of document image translation is improved.
Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a diagram of a model architecture of a noisy-text-sentence segmentation method of the present invention;
FIG. 2 is a training architecture diagram II of the noisy text sentence segmentation method of the present invention;
fig. 3 is a flowchart illustrating a noisy text sentence segmentation method according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
Referring to fig. 1-3, the present embodiment is a noisy text sentence segmentation method for document image translation, including the following steps;
step one: simulating and constructing a noisy text data set containing various recognition noises on the basis of a clean pure text data set to obtain input data, and preprocessing the input data;
step two: encoding input data, encoding a text to be processed by using a BERT pre-training language model to complete word embedding, obtaining a dynamic word vector, and further extracting features from the dynamic word vector by using a Bi-LSTM model to obtain a text vector;
step three: contrast learning, which comprises respectively calculating contrast loss of clean plain text and various noisy texts;
step four: calculating a classification model loss, wherein the classification model loss comprises the steps of inputting a text vector into a CRF model to obtain a prediction category of each word, and calculating a loss between a classification result and a standard answer;
step five, loss fusion, which comprises the steps of weighting and summing the comparison learning loss and the classification model loss to obtain the final training loss;
and step six, carrying out gradient feedback on the final training loss, and updating model parameters.
The specific workflow of the noisy text sentence segmentation method for document image translation comprises the following steps:
as shown in fig. 1-3, the processing of the input data specifically includes:
and taking a clean plain text paragraph data set, setting the sample length of each input model as N characters, and adopting a sliding window processing method to carry out window sliding on the whole plain text data set, wherein the number of words sliding each time is M, so as to obtain a plurality of clean plain text original training samples.
It should be noted that; suppose that for clean plain text "i gently wave hands, clouds are made on the other western day. If the maximum text length of the input model is limited to 10 characters and the input model slides 5 characters each time, the paragraph can be segmented into 'I gently waving hands, do other' and 'hands, do clouds of other western days'. In a specific experiment, the window is slid until all the text in the paragraph is covered, so that all the segmentation marks are not affected by incomplete input.
1-2, the processing of the clean plain text original training samples specifically comprises:
multiple kinds of noise are added to each clean plain text original training sample simulation, a plurality of groups of missed recognition noise samples, false recognition noise samples and clean plain text samples are simulated and constructed through deleting and replacing punctuation and other texts in a certain proportion, deleting operation and replacing operation are carried out, and a plurality of groups of mixed noise samples containing missed recognition noise and false recognition noise are simulated and constructed.
It should be noted that; for example, the sample "i gently wave his hands, and make clouds on the other western day. The simulation adds noise, and the first one is replaced by ' MASK ', so that a new sample ' I gently wave hands [ MASK ] to make a cloud on the other western day's world '. "in the case where the simulated punctuation mark is misidentified. The new sample is obtained by deleting, namely, the new sample is obtained by gently waving the hand, the cloud color in the other western world is made, and when the simulated punctuation mark is missed, the color in the simulated punctuation mark is replaced by 'MASK'. The new sample is obtained by deleting, and the new sample is obtained by gently waving the hand [ MASK ] to make clouds on the western world, so that the situation that the text contains both wrong identification information and missing identification information is simulated.
Wherein a clean plain text sample, a missed recognition noise sample, a false recognition noise sample, and a mixed noise sample are used as a set of training samples, as shown in fig. 1-2.
As shown in fig. 1-2, the training samples are processed, which specifically includes:
and adding a corresponding triplet label for each character of the training sample according to the position of the end of the natural sentence, wherein the label triples are [ B ], [ I ], [ E ], and respectively represent a natural sentence start bit, a natural sentence middle bit and a natural sentence end bit.
It should be noted that; for example, the sample "i gently wave his hands, make clouds on the other western day. "corresponding tags are: [B] [ I ] [ I ] [ I ] [ I ] [ I ] [ I ] [ I ] [ I ] [ I ] [ I ] [ E ], wherein the color is "ANDed". The label of the 'E' is marked, and because the number of pixels at the punctuation mark position in the document picture is less, the problem of missed recognition is easy to occur, the label of the 'E' is also marked for the characters at the sentence end position, and the influence that sentences are stuck and sentences cannot be correctly separated due to the missed recognition of the punctuation mark is reduced.
As shown in fig. 1-2, extracting features for the dynamic word vector specifically includes:
and (3) serializing a group of training samples, and sending the serialized training samples into a BERT model to obtain the dynamic word vector.
Wherein as shown in fig. 1-2, the dynamic word vector is input into the Bi-LSTM model to further extract text features.
It should be noted that; given a sentence input s= (ω) 12 ,…ω n-1n ) Text features are extracted through the BERT pre-training language model, and features are further extracted through the Bi-LSTM model, so that an output vector X= (X) is obtained 1 ,x 2 ,…x n-1 ,x n )。
1-2, the contrasted study includes a set of contrast losses, including in particular the following losses:
calculating contrast loss for clean plain text samples and missed recognition noise samples;
calculating contrast loss for clean plain text samples and false recognition noise samples;
contrast loss is calculated for clean plain text samples and mixed noise samples.
Wherein as shown in fig. 1-2, the contrast learning further comprises:
and averaging a group of comparison losses to obtain the final comparison learning loss.
It should be noted that; loss of contrast learning L contra Is composed of three parts, namelyAnd->Andwherein->Representing loss between erroneous recognition noise samples and original samples,/->Indicating loss between missed noise samples and original samples, < >>Representing the loss between the misidentification and missed identification mixed noise samples and the original samples.And->The specific expressions are the same and are L N-pair
Wherein x is an anchor, namely noisy text combined with missing, wrong and missing identities, x + For the original clean plain text sample,n other irrelevant samples selected from the batch size are taken as negative samples, and the final contrast learning loss is that
As shown in fig. 1-2, the loss fusion specifically includes:
and (3) averaging the contrast learning loss calculated in the step (III) and the classification model loss calculated in the step (IV) to obtain the final training loss.
It should be noted that;
classification model loss: for an input sequence x= (X) 1 ,x 2 ,…x n-1 ,x n ) Its corresponding predicted tag sequence is y= (Y) 1 ,y 2 ,…y n-1 ,y n ) Predicting a sequence tag total scoreWherein A represents the transition fraction between tags, (-)>Representing each word to a corresponding y i A score of the tag. Normalizing all possible sequences to obtain probability values of the predicted sequence>The final result is a classification loss l= Σ s∈S log(P(Y s ∣X s )). Wherein S represents a set of all sentences in the training data, X represents a CRF input sequence corresponding to sentence S, and Y represents a prediction sequence corresponding to sentence S;
the model is trained by adopting a back propagation algorithm, the loss function is divided into two parts, and one part is derived from the forward propagation loss L obtained after text is input into the Bi-LSTM+CRF model origin Another part derives from the loss L of contrast learning contra Finally, the model loss L is obtained total =L origin +L contra
In the description of the present specification, reference to the terms "one embodiment," some embodiments, "" examples, "" particular examples, "" some examples, "or" what is desired to be described, etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A noisy text sentence segmentation method for document image translation is characterized in that: comprises the following steps of;
step one: simulating and constructing a noisy text data set containing various recognition noises on the basis of a clean pure text data set to obtain input data, and preprocessing the input data;
step two: encoding the input data, encoding a text to be processed by using a BERT pre-training language model to complete word embedding, obtaining a dynamic word vector, and further extracting features of the dynamic word vector by using a Bi-LSTM model to obtain a text vector;
step three: contrast learning, wherein the contrast learning comprises respectively calculating contrast loss of the clean plain text and various noisy texts;
step four: calculating a classification model loss, wherein the classification model loss comprises the steps of inputting a text vector into a CRF model to obtain a prediction category of each word, and calculating a loss between a classification result and a standard answer;
step five: loss fusion, wherein the loss fusion comprises the step of weighting and summing the comparison learning loss and the classification model loss to obtain the final training loss;
step six: and carrying out gradient feedback on the final training loss, and updating model parameters.
2. The method for segmenting a noisy text sentence for document image translation according to claim 1, wherein the processing of the input data specifically comprises:
and taking the clean plain text paragraph data set, setting the sample length of each input model as N characters, and adopting a sliding window processing method to slide the whole plain text data set in a window mode, wherein the number of words sliding in each window is M, so as to obtain a plurality of clean plain text original training samples.
3. The method for segmenting the noisy text sentence oriented to the translation of the document image according to claim 2, wherein the processing of the clean plain text original training sample specifically comprises:
and simulating and adding various noises to each clean plain text original training sample, deleting and replacing punctuation and other texts in a certain proportion, simulating and constructing a plurality of groups of missed recognition noise samples, erroneous recognition noise samples and clean plain text samples, deleting and replacing, and simulating and constructing a plurality of groups of mixed noise samples containing both missed recognition noise and erroneous recognition noise.
4. A noisy text sentence segmentation method for document image translation according to claim 3 wherein one of said clean plain text samples, missed recognition noise samples, mispronounced recognition noise samples and mixed noise samples is used as a set of training samples.
5. The method for segmentation of noisy text sentences for document image translation according to claim 4, wherein said training samples are processed comprising:
and adding a corresponding triplet label for each character of the training sample according to the position of the end of the natural sentence, wherein the label triples are [ B ], [ I ], [ E ], and respectively represent a natural sentence start bit, a natural sentence middle bit and a natural sentence end bit.
6. The method for segmenting a noisy text sentence for document image translation according to claim 5, wherein extracting features from the dynamic word vector specifically comprises:
and serializing a group of training samples and then sending the training samples into a BERT model to obtain dynamic word vectors.
7. The method for segmenting a noisy text sentence for document image translation according to claim 6, wherein after obtaining a dynamic word vector, the dynamic word vector is input into a Bi-LSTM model to further extract text features.
8. The method for segmentation of noisy text sentences for document image translation according to claim 2, wherein said contrast learning comprises a set of contrast losses, in particular the following:
calculating contrast loss for the clean plain text sample and the missed recognition noise sample;
calculating a contrast loss for the clean plain text sample and the false recognition noise sample;
a contrast loss is calculated for the clean plain text sample and the mixed noise sample.
9. The method for segmentation of noisy text sentences for document image translation of claim 8 wherein said contrast learning further comprises:
and averaging the group of comparison losses to obtain the final comparison learning loss.
10. The method for segmentation of noisy text sentences for document image translation according to claim 9, wherein said loss fusion comprises:
and (3) averaging the contrast learning loss calculated in the step (III) and the classification model loss calculated in the step (IV) to obtain the final training loss.
CN202310477508.XA 2023-04-27 2023-04-27 A noisy text sentence segmentation method for document image translation Pending CN117217203A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310477508.XA CN117217203A (en) 2023-04-27 2023-04-27 A noisy text sentence segmentation method for document image translation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310477508.XA CN117217203A (en) 2023-04-27 2023-04-27 A noisy text sentence segmentation method for document image translation

Publications (1)

Publication Number Publication Date
CN117217203A true CN117217203A (en) 2023-12-12

Family

ID=89043200

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310477508.XA Pending CN117217203A (en) 2023-04-27 2023-04-27 A noisy text sentence segmentation method for document image translation

Country Status (1)

Country Link
CN (1) CN117217203A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120849350A (en) * 2025-09-22 2025-10-28 浪潮通用软件有限公司 A RAG-oriented document parsing method, system and computer device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120849350A (en) * 2025-09-22 2025-10-28 浪潮通用软件有限公司 A RAG-oriented document parsing method, system and computer device

Similar Documents

Publication Publication Date Title
CN112560478B (en) A Chinese address RoBERTa-BiLSTM-CRF coupling parsing method using semantic annotation
CN107168955B (en) Chinese word segmentation method using word context-based word embedding and neural network
CN112464662B (en) Medical phrase matching method, device, equipment and storage medium
CN109960728B (en) Method and system for identifying named entities of open domain conference information
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN108664474B (en) Resume analysis method based on deep learning
CN111444721A (en) A method for extracting key information from Chinese text based on pre-trained language model
CN111738169A (en) A Handwritten Formula Recognition Method Based on End-to-End Network Model
CN111104498A (en) Semantic understanding method in task type dialogue system
CN103020034A (en) Chinese words segmentation method and device
CN107943911A (en) Data extraction method, device, computer equipment and readable storage medium
CN112434520B (en) Named entity recognition method, named entity recognition device and readable storage medium
CN113160917B (en) Electronic medical record entity relation extraction method
CN111967267B (en) XLNET-based news text region extraction method and system
CN113779992B (en) Implementation method of BcBERT-SW-BiLSTM-CRF model based on vocabulary enhancement and pre-training
WO2023134402A1 (en) Calligraphy character recognition method based on siamese convolutional neural network
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN111680669A (en) Test question segmentation method and system and readable storage medium
CN109446523A (en) Entity attribute extraction model based on BiLSTM and condition random field
CN114297987A (en) Document information extraction method and system based on text classification and reading understanding
CN116523032A (en) Image text double-end migration attack method, device and medium
CN114357165B (en) A short text classification method based on deep learning network
CN113962196A (en) Resume processing method and device, electronic equipment and storage medium
Zaryab et al. Optical character recognition for medical records digitization with deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination