CN120705590A

CN120705590A - A data set enhancement method and system based on deep learning

Info

Publication number: CN120705590A
Application number: CN202511141956.8A
Authority: CN
Inventors: 吴洋; 兰伟杰; 胡志
Original assignee: Jiajie Technology Co ltd
Current assignee: Jiajie Technology Co ltd
Priority date: 2025-08-15
Filing date: 2025-08-15
Publication date: 2025-09-26
Anticipated expiration: 2045-08-15
Also published as: CN120705590B

Abstract

The invention relates to the technical field of data set enhancement, in particular to a data set enhancement method and a data set enhancement system based on deep learning. Presetting multiple target modes, constructing projection functions of all modes, mapping original semantic vectors into style guide vectors of all target modes, and carrying out weighted fusion to generate comprehensive style vectors. And carrying out syntactic analysis on the original text, extracting word level masks, carrying out differential fusion on the context word vectors and the comprehensive style vectors based on the masks, and generating global fusion semantic vectors. The training projection matrix is mapped to the input space of the large language model to form soft prompt vectors, the soft prompt vectors are injected into the model input layer, a plurality of enhancement texts with faithful semantics and various styles are guided to be generated, and data set enhancement is completed. The invention realizes the effect of enhancing the generation of the text multi-style while retaining the original semantic information.

Description

Data set enhancement method and system based on deep learning

Technical Field

The invention relates to the technical field of image processing. And more particularly to a data set enhancement method and system based on deep learning.

Background

Along with the wide application of deep learning in the field of natural language processing, high-quality and large-scale labeling data becomes a key for improving the performance of a model. However, in many practical application scenarios, such as medical, legal, financial, etc., the cost of acquiring a large amount of annotation data is high and the period is long, resulting in sparse training data or outstanding class imbalance problems. The traditional data enhancement method, such as semantic word replacement, back translation, random insertion/deletion and the like, can increase the data quantity to a certain extent, but the generated text often lacks diversity, the semantics are easy to drift, and the dual requirements of the complicated task on semantic fidelity and expression richness are difficult to meet.

In recent years, text enhancement techniques based on generative models have come to be increasingly popular, particularly the advent of Large Language Models (LLM), providing a powerful tool for generating high quality text. However, direct use of LLM for enhancement often lacks control, results generated are highly random, easily deviate from the original semantics, and create "illusions" or unrelated content.

In addition, the existing methods are mostly limited to the transformation in the text, and the non-language information (such as vision, emotion and hearing) cannot be effectively introduced to guide the stylized expression of the text, so that the enhanced text is insufficient in terms of vitality, expressiveness and emotion infection, and the task requirements such as creative writing, advertising text, emotion analysis and the like with specific requirements on expression styles are difficult to meet.

Disclosure of Invention

The method aims to solve the technical problems that semantic fidelity and expression diversity are difficult to consider, style guiding capability is weak and a generating process is uncontrollable in the existing text data enhancement technology. The present invention provides aspects as follows.

In a first aspect, a data set enhancement method based on deep learning includes the steps of obtaining an original text, inputting the original text into a pre-training cross-modal text encoder to obtain an original semantic vector, presetting a multi-target-mode set, constructing a projection function of each target mode, inputting the original semantic vector into each mode projection function, generating each target-mode style guide vector, synthesizing all target-mode style guide vectors to obtain a comprehensive style vector, carrying out syntactic analysis on the original text, extracting a mask, fusing a context word vector of the original text with the comprehensive style vector based on the mask to generate a global fusion semantic vector, mapping the global fusion semantic vector to an input embedding space of a pre-training large language model through a trainable linear projection matrix to form a soft prompt vector, injecting the soft prompt vector into an input layer of the large language model to generate a plurality of enhanced texts, and completing data set enhancement.

Preferably, the pre-training cross-modal text encoder is a pre-training CLIP model, and the pre-training CLIP model is composed of an image encoder and a text encoder.

Preferably, the multi-target mode set includes expression features of non-language information such as image target modes, emotion target modes and the like.

Preferably, the comprehensive style vector comprises the steps of presetting a multi-target mode set, obtaining a mode projection function corresponding to each target mode, mapping an original semantic vector to a function of a target mode semantic subspace according to the mode projection function corresponding to each target mode, outputting the function as a style guide vector, and carrying out weighted fusion on the style guide vectors of different modes to generate the comprehensive style vector.

Preferably, the modal projection function comprises the steps of taking an image mode as an example, modeling the projection function based on the directional characteristic of a cross-modal image encoder, acquiring a graph-text data pair similar to the original text semantic in graph-text annotation data according to the original text semantic through a manual screening method, acquiring a corresponding semantic vector from the screened graph-text data pair, acquiring a graph-text offset of each graph-text data by using the cross-modal image encoder, acquiring a graph-text offset mean value of all graph-text data, taking the graph-text offset mean value as an image style guiding direction, and weighting and adding the image style guiding direction and the original semantic vector of the original text to obtain the modal projection function.

Preferably, the global fusion semantic vector comprises the steps of carrying out syntactic analysis on an original text to extract an original text semantic trunk vocabulary and a non-semantic trunk vocabulary, generating word level masks according to the original text semantic trunk vocabulary and the non-semantic trunk vocabulary, obtaining a context sensing word vector of each word in the original text by utilizing a pre-training language model code, combining the context sensing word vector of each word with a comprehensive style vector of the original text, carrying out fusion according to corresponding mask information to obtain a word level fusion vector of each word after fusion, and carrying out average aggregation on the word level fusion vectors of each word to obtain the global fusion semantic vector.

Preferably, the generating a plurality of enhanced texts comprises the steps of setting a projection matrix, multiplying a global fusion semantic vector by the projection matrix to obtain a soft prompt vector with a dimension required by a large language model input space, wherein the size of the projection matrix is consistent with the dimension of the global fusion semantic vector and the dimension of a hidden layer of the large language model, using a word segmentation device of the large language model to segment an original text to obtain a word sequence, obtaining a corresponding word embedded vector, splicing the soft prompt vector and the word embedded vector in the sequence dimension to form an extended input embedded sequence, wherein the soft prompt vector is at the forefront end of the extended input embedded sequence, and the extended sequence is used as input of the large language model and is sent into a transducer structure, and the large language model generates a plurality of enhanced texts in an autoregressive mode.

In a second aspect, a deep learning based data set enhancement system includes a processor and a memory storing computer program instructions that when executed by the processor implement a deep learning based data set enhancement method as described above.

The invention has the following effects:

1. By introducing a pre-training cross-modal text encoder to generate an original semantic vector and constructing a multi-target-mode learnable projection function, mapping the original semantic vector into a style guide vector of each mode, the effective migration of cross-modal knowledge to a text enhancement task is realized, the rich non-language information contained in a cross-modal model can be utilized to provide a clear and quantifiable style guide direction for text generation, and the expression effect of the generated text in different mode spaces is enhanced.

2. By performing dependency syntactic analysis on the original text, generating a mask vector for distinguishing semantic trunk words from non-trunk modifier words, and performing differential fusion on a context word vector and a comprehensive style vector based on the mask, fine granularity semantic control of trunk freezing and edge disturbance is realized, the stability of the core semantics of sentences is ensured, semantic drift caused by global style migration is avoided, and the expression diversity and liveliness of the generated text are improved on the basis of not damaging the original meaning.

Drawings

Fig. 1 is a flowchart of a method of step S1 to step S4 in a data set enhancement method based on deep learning according to an embodiment of the present invention.

FIG. 2 is a block diagram of a data set enhancement system based on deep learning in accordance with an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention.

Referring to fig. 1, a data set enhancement method based on deep learning includes steps S1 to S4, specifically as follows:

s1, acquiring an original text, and inputting the original text into a pre-training cross-modal text encoder to obtain an original semantic vector.

And reading input text samples to be enhanced in batches from the original training set to serve as original texts. And preprocessing the original text, namely removing redundant spaces and unifying punctuation mark formats to obtain a standardized text.

The preprocessed original text is input into a Pre-trained cross-modal text encoder to generate a corresponding original semantic vector, wherein the original semantic vector represents a vector of the original text in a semantic space, the cross-modal text encoder adopts a text encoder in a CLIP (Contrastive Language-Image Pre-Training, contrast language-Image Pre-Training ‌) model, and the CLIP model is a known technology means and is not repeated.

S2, presetting a set of multi-target modes, constructing a projection function of each target mode, inputting an original semantic vector into each mode projection function, generating each target mode style guide vector, and synthesizing all target mode style guide vectors to obtain a synthesized style vector.

To the original textInputting the pre-trained cross-modal text encoder to obtain corresponding original semantic vectors。

After the original semantic vector corresponding to the original text is obtained, the original semantic vector represents the vector of the original text in the semantic space, and because different text expression styles are different in a multi-modal scene, a preset target modal set is introduced and used for guiding the original semantic vector to migrate towards a specific expression style direction, so that generalization of the original text expression content is realized.

The target mode set includes, but is not limited to, image modes, emotion modes, etc., each mode corresponding to an expression feature of non-language information.

The image mode is used for enhancing the space scene description capability and visual appearance of the text, and the emotion mode is used for enhancing the subjective emotion color and emotion intensity of the text.

By constructing the mode projection function corresponding to each target mode one by one, the controllable stylized offset of the original semantic vector is realized, so that a guiding signal is provided for the subsequent generation of diversified and consistent-style enhanced texts.

Set the mode projection functionIs defined as the original semantic vectorThe function mapped to the target modal semantic subspace is output as a style guide vectorI.e.。

The function of the method is not to completely change the original semantics, but to moderately perturb along the semantic direction defined by the target mode on the premise of maintaining the core semantic structure, so that the generated text presents typical characteristics of the mode in an expression mode.

Illustratively, when the target modality is "emotion",The output of (c) should be closer to the semantic region of high emotional intensity such as "excited", "sad" or "anger" to guide the subsequent generation of more emotional-infected text.

For different target modes, the mode of constructing the mode projection function can be designed differently according to the mode characteristics.

Illustratively, taking image modality as an example, the projection functionModeling can be performed based on the directional characteristics of the cross-mode image encoder, a graph-text data pair similar to the original text semantic in graph-text annotation data is obtained through a manual screening method according to the original text semantic, and corresponding semantic vectors are obtained from the screened graph-text data pair by using the cross-mode text encoder in the step S1And acquiring corresponding image vectors using a cross-modality image encoderAcquiring the image-text offset of each image-text dataAcquiring the image-text offset average value of all image-text data asAs image style guide directionAnd defineWherein the cross-mode image encoder adopts an image encoder in a CLIP model, and the CLIP model is the same model as the CLIP model in the step S1, so that the cross-mode image encoder and the cross-mode text encoder in the step S1 are trained by using the same graphic data set,For adjustable gain coefficient, taking the checked valueControlling style migration intensity, and setting the screening number of image-text data similar to the original text semanteme asThe adjustment may be made by the practitioner depending on the particular implementation scenario.

Illustratively, taking emotion mode as an example, the projection functionIs realized by a trainable multi-layer perceptron (MLP) structure. The MLP is composed of two fully-connected layers, and the activation function isThe whole structure is as follows:

;

In the training process, the original text is acquired by a manual screening method Related text pairs with emotion labels, wherein,Is thatIs a mood enhancing version of (c) a,The generation can be performed by means of a manual standard method, optimizing the parameters by minimizing the loss function of the mean square error,,,,For the learnable optimization parameters in the MLP, the input original semantic text vector is enabled by the MLPThe corresponding output vector tends to a high emotional intensity semantic region.

When multi-mode collaborative enhancement is started, respectively generating corresponding style guide vectors of the original text in different modes through a plurality of mode projection functions, and generating comprehensive style vectors through a weighted fusion strategy:

Wherein, the Is the firstA weight value corresponding to a preset target mode,For the preset total number of target modes,In the first place for the original textStyle guide vectors for each preset target modality,The sum of the weights between different target modes is 1, wherein the weight value can be adjusted by an implementer according to the implementation scene, the weight values between different target modes are equal, the weight of different target modes in the comprehensive style vector can be adjusted by adjusting the weight values among different target modes.

The comprehensive style vector is used as a key input for subsequent semantic fusion, the dimension of the comprehensive style vector is kept consistent with the original semantic vector, and the compatibility of vector operation is ensured. The integrated style vector does not directly correspond to a particular text, but rather represents a "semantic trend" or "expression trend" for introducing style perturbation at the non-stem vocabulary level.

And S3, carrying out syntactic analysis on the original text to extract a mask, and fusing the context word vector of the original text with the comprehensive style vector based on the mask to generate a fused semantic vector.

The integrated style vector corresponding to the original text can be obtained through the step S2 to realize modal level fusion, but if the integrated style migration is carried out on the whole original text, the problem of semantic deviation of the original text is caused, so that word level semantic analysis is carried out on the original text, and the main semantic information of the original text is reserved.

In order to realize stable reservation of the semantic trunk and stylized enhancement of non-trunk components, dependency syntax analysis is carried out on an original text, a core grammar structure is identified, a word level mask vector is generated according to the dependency syntax analysis, a mask value belonging to the semantic trunk vocabulary in the word level mask vector is set to be 1, and a mask value belonging to the non-semantic trunk vocabulary is set to be 0 so as to distinguish the semantic trunk vocabulary from the non-trunk vocabulary, wherein the dependency syntax is a known technology means and is not repeated.

On the basis, the comprehensive style vector is combinedPerforming differential fusion on context-aware word vectors of each word to finally generate a word-level fusion semantic vector with both semantic fidelity and expression diversityProviding high quality semantic input for subsequent controllable text generation.

After the word level mask vector is obtained, a context-aware word vector for each word in the original text is obtained. The vector is obtained through pre-training language model coding, and semantic information of words in specific contexts can be reflected, wherein the pre-training language model coder is a pre-training language model such as BERT, roBERTa, sense vec.

Will be the first in the original textInputting the individual words into a pre-trained language model code to obtain corresponding context-aware word vectors;

Will beWith the generated style guide vectorWhen fusion is carried out, a mask-based differential weighting strategy is adopted, and a fusion principle of 'trunk freezing and edge disturbance' is realized. The specific fusion formula is as follows:

For the first Individual words, their fused word vectorsThe definition is as follows:

Wherein, the For the main fusion coefficient, in order to ensure that the main word vector mainly keeps original semantics, only very slight style disturbance is introduced, the required value is smaller, and the experience value is takenCan be adjusted by an implementer according to specific implementation scenes;

for non-trunk fusion coefficients, the non-trunk words are allowed to fully absorb the information of style guide vectors, so that the diversified expansion of expression modes is realized, the required value is large, and the experience value is taken The adjustment may be made by the practitioner depending on the particular implementation scenario.

Is the firstThe corresponding mask value of the individual word in the word level mask vector,Expressed as the backbone semantic words in the original text,Represented as non-stem semantic words in the original text.

Is the original textA corresponding synthesis style vector.

Is the firstContext-aware word vectors corresponding to individual words.

The design ensures that the core semantic structure of the sentence is not destroyed, and the modifier component of the non-main semantic information can be enhanced to be more visual or emotional color expression.

All word-level fusion vectors are aggregated into a global fusion semantic vector through an average operation:

The vector is Not only keep the original textThe core semantic intention of the target mode is fused with style guiding information of the target mode, and an enhanced semantic representation with rich semantics and controllable styles is formed. The vector is input as a soft prompt in the subsequent large language model generation process, and the guide model generates enhanced text which is faithful and original and has diversified expression styles.

And S4, mapping the global fusion semantic vector to an input embedded space of the pre-trained large language model through a trainable linear projection matrix to form a soft prompt vector, injecting the soft prompt vector into an input layer of the large language model to generate a plurality of enhancement texts, and completing data set enhancement.

The global fusion semantic vector generated in the step S3 is obtainedAnd then, because the semantic space of the global fusion semantic vector and the semantic space of the large language model input embedding space are not consistent, mapping alignment is carried out by forming a learnable soft prompt vector, and then enhanced text generation is carried out.

Due toThe semantic space of a cross-mode encoder derived from the CLIP model, while the input embedded space of the large language model has different dimension structures, and the problem of mismatch of space dimension and semantic distribution exists between the two.

For this purpose, a trainable linear projection matrix is introduced, wherein,For the global fusion of the semantic vector dimensions,Is a hidden layer dimension of a large language model,Is thatIs a matrix of (a) in the matrix. Mapping the global fusion semantic vector vf to the input space of the large language model by matrix multiplication operation:

the resulting vector The soft prompt vector does not correspond to any actual vocabulary, but carries the fused semantic and style information, and is used as an initial guiding signal in the generation process.

Word segmentation device using large language model for original textWord segmentation is carried out to obtain a word element sequence,The word segmentation number is the word segmentation number of the word segmentation device.

Obtaining corresponding word embedding vectors。

Splicing the soft prompt vector h0 and the word embedding vector in the sequence dimension to form an extended input embedding sequence:

The extended sequence is fed into a transducer structure as input to a large language model. Due to At the forefront of the sequence, the semantic information carried by the sequence participates in calculation as context bias in an attention mechanism, so that the generation direction is globally guided.

Autoregressive generation of multiple enhanced text by large language modelWhereinFor the preset generation quantity, the experience value is 1000, and the implementation can be adjusted according to the specific implementation scene.

The generation process can adopt various decoding strategies, including Top-k sampling, namely randomly sampling from k words with highest probability, increasing diversity, nuclus sampling (Top-p), namely sampling in the minimum word set with accumulated probability reaching p, balancing fluency and creativity, beamSearch, reserving a plurality of candidate paths, and selecting the output with highest overall probability, thereby being suitable for a scene with emphasized semantic continuity.

Preferably, the system sets a temperature coefficient (temperature. Apprxeq. 0.7-0.9) to encourage diversity while maintaining fluency. The generated text semantic is consistent with the original text, but the target modal style is embodied.

By way of example, generating text may emphasize subjective emotion more when the global fusion semantic vector is guided by emotion modalities, and text may contain more spatial description and visual detail when guided by image modalities.

Will thenAnd injecting the context guide information into an input layer of the large language model, driving the model to generate a plurality of enhanced texts with stable semantic information and various styles, and completing data set enhancement.

The invention also provides a data set enhancement system based on deep learning. As shown in fig. 2, the system comprises a processor and a memory storing computer program instructions which, when executed by the processor, implement a deep learning based data set enhancement method according to the first aspect of the invention. The system further comprises other components known to those skilled in the art, such as communication buses and communication interfaces, the arrangement and function of which are known in the art and therefore will not be described in detail herein.

It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A method for enhancing a data set based on deep learning, comprising:

The method comprises the steps of obtaining an original text, and inputting the original text into a pre-training cross-mode text encoder to obtain an original semantic vector;

Presetting a set of multi-target modes, constructing a projection function of each target mode, inputting an original semantic vector into each mode projection function, generating each target mode style guide vector, and synthesizing all target mode style guide vectors to obtain a synthesized style vector;

carrying out syntactic analysis on the original text to extract a mask, and fusing the context word vector of the original text with the comprehensive style vector based on the mask to generate a global fusion semantic vector;

The global fusion semantic vector is mapped to an input embedded space of a pre-trained large language model through a trainable linear projection matrix to form a soft prompt vector, the soft prompt vector is injected into an input layer of the large language model to generate a plurality of enhancement texts, and data set enhancement is completed.

2. The method of claim 1, wherein the pre-training cross-modal text encoder is a pre-training CLIP model, and the pre-training CLIP model is composed of an image encoder and a text encoder.

3. The method for enhancing a data set based on deep learning according to claim 1, wherein the set of multi-target modalities includes expression features of non-language information such as image target modalities and emotion target modalities.

4. The method for enhancing a data set based on deep learning according to claim 1, the method is characterized in that the comprehensive style vector comprises the following steps:

presetting a multi-target mode set, and acquiring a mode projection function corresponding to each target mode;

According to the mode projection function corresponding to each target mode, mapping the original semantic vector to a function of a target mode semantic subspace, and outputting the function as a style guide vector;

and carrying out weighted fusion on the style guide vectors of different modes to generate a comprehensive style vector.

5. The method of depth learning based data set enhancement of claim 4, wherein the modal projection function comprises:

Taking an image mode as an example, a projection function can be modeled based on the directional characteristic of a cross-mode image encoder, a graph-text data pair similar to the original text semantic in graph-text annotation data is obtained through a manual screening method according to the original text semantic, a corresponding semantic vector is obtained from the screened graph-text data pair, a corresponding image vector is obtained by the cross-mode image encoder, the graph-text offset of each graph-text data is obtained, the graph-text offset average value of all the graph-text data is obtained and is used as an image style guiding direction, and the image style guiding direction and the original semantic vector of the original text are added in a weighted mode projection function.

6. The method for enhancing a data set based on deep learning according to claim 1, the global fusion semantic vector is characterized by comprising:

Carrying out syntactic analysis on the original text to extract the semantic trunk vocabulary and the non-semantic trunk vocabulary of the original text, and generating a word level mask according to the semantic trunk vocabulary and the non-semantic trunk vocabulary of the original text;

The method comprises the steps of obtaining a context sensing word vector of each word in an original text by utilizing a pre-training language model code, combining the context sensing word vector of each word with a comprehensive style vector of the original text, and fusing according to corresponding mask information to obtain a word level fusion vector of each word after fusion;

And carrying out average aggregation on the word-level fusion vectors of each word to obtain a global fusion semantic vector.

7. The method for enhancing a data set based on deep learning of claim 1, wherein the generating a plurality of enhanced texts comprises the steps of:

Setting a projection matrix, multiplying the global fusion semantic vector by the projection matrix to obtain a soft prompt vector with the dimension required by the input space of the large language model, wherein the size of the projection matrix is consistent with the dimension of the global fusion semantic vector and the dimension of a hidden layer of the large language model;

word segmentation is carried out on the original text by using a word segmentation device of a large language model to obtain a word element sequence, and a corresponding word embedding vector is obtained;

splicing the soft prompt vector and the word embedding vector in the sequence dimension to form an extended input embedding sequence, wherein the soft prompt vector is at the forefront end of the extended input embedding sequence;

The extended sequence is fed into a transducer structure as input to a large language model that generates a plurality of enhanced text in an autoregressive manner.

8. A deep learning based data set enhancement system comprising a processor and a memory, the memory storing computer program instructions which, when executed by the processor, implement a deep learning based data set enhancement method according to any of claims 1-7.