CN114358025A

CN114358025A - A text data processing method, apparatus, device and medium

Info

Publication number: CN114358025A
Application number: CN202110897046.8A
Authority: CN
Inventors: 杨振; 许钰林; 孟凡东
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2022-04-15
Anticipated expiration: 2041-08-05
Also published as: CN114358025B

Abstract

The embodiment of the application provides a text data processing method, a text data processing device, text data processing equipment and a text data processing medium, and the method relates to the field of artificial intelligence and comprises the following steps: acquiring a first text pair and a second text pair, acquiring a first sub-text from the first text pair, and acquiring a second sub-text from the second text pair; determining an editing distance between the first sub-text and the second sub-text, if the editing distance meets a similarity condition, generating a first target sub-text which is associated with the semantic information of the first sub-text and belongs to a third language type, and generating a second target sub-text which is associated with the semantic information of the second sub-text and belongs to a second language type; and generating a text sample pair according to the first text pair, the second text pair, the first target sub-text and the second target sub-text. By the method and the device, the text sample pairs formed by different language types can be generated, and the corpus quantity can be improved while the quality of the corpus is guaranteed.

Description

A text data processing method, apparatus, device and medium

技术领域technical field

本申请涉及计算机技术领域，尤其涉及一种文本数据处理方法、装置、设备以及介质。The present application relates to the field of computer technology, and in particular, to a text data processing method, apparatus, device, and medium.

背景技术Background technique

现有的语料库通常包含English-Centric(即源语言端或目标语言端为英文)的文本对(即句对)，为提高现有语料库中的语料的语料数量，需要基于该语料库中的English-Centric句对，生成non-English(非英文方向，即源语言和目标语言都是非英文语言)句对。The existing corpus usually contains English-Centric (that is, the source language end or the target language end is English) text pairs (ie sentence pairs). In order to increase the number of corpora in the existing corpus, it is necessary to Centric sentence pairs, generate non-English (non-English direction, that is, both the source language and the target language are non-English languages) sentence pairs.

目前是使用基于抽取的方法实现语料的生成，该基于抽取的方法在从现有的English-Centric语料库中抽取平行语料时，通常是对齐完全相同的英文端以建立多路平行语料库。比如，当“法语—英语”句对和“英语—德语”句对的英文端完全相同时，可以得到“法语—英语—德语”三路对齐的句对，进而得到“法语—德语”两路对齐的句对。但是，该方法需要英文端严格对齐，导致抽取的语料规模远小于English-Centric的语料规模，这意味着现有的文本对的生成方式，生成语料数量有限，限制了机器翻译的质量。At present, the extraction-based method is used to generate corpus. When extracting parallel corpus from the existing English-Centric corpus, the extraction-based method usually aligns exactly the same English end to build a multi-channel parallel corpus. For example, when the "French-English" sentence pair is exactly the same as the "English-German" sentence pair in English, the three-way alignment of "French-English-German" can be obtained, and then the two-way "French-German" sentence pair can be obtained. Aligned sentence pairs. However, this method requires strict alignment of the English end, resulting in a much smaller corpus size than that of English-Centric, which means that the existing method of generating text pairs has a limited number of generated corpora, which limits the quality of machine translation.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供一种文本数据处理方法、装置、设备以及介质，可以在保证语料库的质量的同时，提高语料库的语料数量。Embodiments of the present application provide a text data processing method, apparatus, device, and medium, which can increase the quantity of corpus while ensuring the quality of the corpus.

本申请实施例一方面提供了一种文本数据处理方法，包括：On the one hand, the embodiments of the present application provide a text data processing method, including:

获取第一文本对和第二文本对，从第一文本对中获取第一子文本，从第二文本对中获取第二子文本；第一子文本和第二子文本均属于第一语言类型；第一文本对还包括与第一子文本的语义信息相同且属于第二语言类型的第三子文本；第二文本对还包括与第二子文本的语义信息相同且属于第三语言类型的第四子文本；Obtain the first text pair and the second text pair, obtain the first subtext from the first text pair, and obtain the second subtext from the second text pair; both the first subtext and the second subtext belong to the first language type ; the first text pair also includes a third subtext that has the same semantic information as the first subtext and belongs to the second language type; the second text pair also includes a third subtext that has the same semantic information as the second subtext and belongs to the third language type the fourth subtext;

确定第一子文本和第二子文本之间的编辑距离，若编辑距离满足相似性条件，则生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本，生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本；Determine the edit distance between the first sub-text and the second sub-text, if the edit distance satisfies the similarity condition, generate the first target sub-text associated with the semantic information of the first sub-text and belong to the third language type, and generate a second target subtext associated with the semantic information of the second subtext and belonging to the second language type;

根据第一文本对、第二文本对、第一目标子文本和第二目标子文本，生成文本样本对。A pair of text samples is generated from the first text pair, the second text pair, the first target sub-text and the second target sub-text.

本申请实施例一方面提供了一种文本数据处理装置，包括：On the one hand, an embodiment of the present application provides a text data processing apparatus, including:

文本获取模块，用于获取第一文本对和第二文本对，从第一文本对中获取第一子文本，从第二文本对中获取第二子文本；第一子文本和第二子文本均属于第一语言类型；第一文本对还包括与第一子文本的语义信息相同且属于第二语言类型的第三子文本；第二文本对还包括与第二子文本的语义信息相同且属于第三语言类型的第四子文本；A text acquisition module for acquiring the first text pair and the second text pair, acquiring the first subtext from the first text pair, and acquiring the second subtext from the second text pair; the first subtext and the second subtext belong to the first language type; the first text pair also includes a third subtext that has the same semantic information as the first subtext and belongs to the second language type; the second text pair also includes the same semantic information as the second subtext and the fourth subtext belonging to the third language type;

文本生成模块，用于确定第一子文本和第二子文本之间的编辑距离，若编辑距离满足相似性条件，则生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本，生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本；The text generation module is used to determine the edit distance between the first sub-text and the second sub-text, and if the edit distance satisfies the similarity condition, then generate the first sub-text associated with the semantic information of the first sub-text and belonging to the third language type. a target subtext, generating a second target subtext that is associated with the semantic information of the second subtext and belongs to the second language type;

样本对生成模块，用于根据第一文本对、第二文本对、第一目标子文本和第二目标子文本，生成文本样本对。The sample pair generation module is configured to generate text sample pairs according to the first text pair, the second text pair, the first target sub-text and the second target sub-text.

其中，装置还包括：Wherein, the device also includes:

长度确定模块，用于将第一子文本的文本长度确定为第一文本长度，将第二子文本的文本长度确定为第二文本长度，根据第一文本长度和第二文本长度，确定相似性条件对应的目标长度；a length determination module, configured to determine the text length of the first subtext as the first text length, determine the text length of the second subtext as the second text length, and determine the similarity according to the first text length and the second text length The target length corresponding to the condition;

距离确定模块，用于获取与相似性条件相关联的相似参数，根据目标长度和相似参数，确定相似性条件对应的相似距离；The distance determination module is used to obtain the similarity parameter associated with the similarity condition, and determine the similarity distance corresponding to the similarity condition according to the target length and the similarity parameter;

第一比较模块，用于若编辑距离小于或等于相似距离，则确定编辑距离满足相似性条件；a first comparison module, configured to determine that the edit distance satisfies the similarity condition if the edit distance is less than or equal to the similarity distance;

第二比较模块，用于若编辑距离大于相似距离，则确定编辑距离不满足相似性条件。The second comparison module is configured to determine that the edit distance does not satisfy the similarity condition if the edit distance is greater than the similarity distance.

其中，文本生成模块包括：Among them, the text generation module includes:

文本提取单元，用于从第一子文本中提取第一单位文本，从第二子文本中提取第二单位文本；a text extraction unit for extracting the first unit text from the first subtext and extracting the second unit text from the second subtext;

距离确定单元，用于基于第一单位文本和第二单位文本，确定第一子文本和第二子文本之间的编辑距离。A distance determining unit, configured to determine an edit distance between the first sub-text and the second sub-text based on the first unit text and the second unit text.

目标获取单元，用于若编辑距离满足相似性条件，则获取与第一子文本和第二子文本相关联的网络模型集合；a target acquisition unit, configured to acquire a network model set associated with the first subtext and the second subtext if the edit distance satisfies the similarity condition;

文本生成单元，用于基于网络模型集合，生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本，生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本。The text generation unit is configured to generate, based on the network model set, a first target sub-text associated with the semantic information of the first sub-text and belonging to the third language type, and generate a first target sub-text associated with the semantic information of the second sub-text and belonging to the second sub-text The second target subtext of the language type.

其中，网络模型集合包括第一目标网络模型和第二目标网络模型；第一目标网络模型与第一语言类型和第三语言类型相关联，第二目标网络模型与第一语言类型和第二语言类型相关联；The network model set includes a first target network model and a second target network model; the first target network model is associated with the first language type and the third language type, and the second target network model is associated with the first language type and the second language type type association;

文本生成单元包括：Text generation units include:

第一拼接子单元，用于对第一子文本和第四子文本进行拼接处理，得到第一拼接文本；The first splicing subunit is used for splicing the first subtext and the fourth subtext to obtain the first splicing text;

第一生成子单元，用于将第一拼接文本输入至第一目标网络模型，通过第一目标网络模型生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本；The first generating subunit is used to input the first spliced text into the first target network model, and generate the first target subtext associated with the semantic information of the first subtext and belonging to the third language type through the first target network model ;

第二拼接子单元，用于对第二子文本和第三子文本进行拼接处理，得到第二拼接文本；The second splicing subunit is used for splicing the second subtext and the third subtext to obtain the second splicing text;

第二生成子单元，用于将第二拼接文本输入至第二目标网络模型，通过第二目标网络模型生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本。The second generating subunit is configured to input the second spliced text into the second target network model, and generate the second target sub-text associated with the semantic information of the second sub-text and belonging to the second language type through the second target network model .

其中，第一目标网络模型包括用于进行编码处理的编码器和用于进行解码处理的解码器；Wherein, the first target network model includes an encoder for performing encoding processing and a decoder for performing decoding processing;

第一生成子单元包括：The first generation subunit includes:

编码处理子单元，用于将第一拼接文本输入至第一目标网络模型中的编码器，通过编码器对第一拼接文本进行编码处理，得到第一拼接文本对应的第一特征向量；an encoding processing subunit, used for inputting the first spliced text into an encoder in the first target network model, and encoding the first spliced text by the encoder to obtain a first feature vector corresponding to the first spliced text;

解码处理子单元，用于将第一特征向量输入至第一目标网络模型中的解码器，通过解码器对第一特征向量进行解码处理，得到第一特征向量对应的第一文本向量；A decoding processing subunit, used for inputting the first feature vector into the decoder in the first target network model, and decoding the first feature vector by the decoder to obtain the first text vector corresponding to the first feature vector;

文本生成子单元，用于基于第一文本向量，生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本。The text generation subunit is configured to generate, based on the first text vector, a first target subtext that is associated with the semantic information of the first subtext and belongs to a third language type.

其中，样本对生成模块包括：Among them, the sample pair generation module includes:

第一组合单元，用于将第一目标子文本和第三子文本，组合为第一样本对；第一目标子文本与第三子文本的语义信息相关联；The first combining unit is used to combine the first target subtext and the third subtext into a first sample pair; the first target subtext is associated with the semantic information of the third subtext;

第二组合单元，用于将第二目标子文本和第四子文本，组合为第二样本对；第二目标子文本与第四子文本的语义信息相关联；The second combining unit is used to combine the second target subtext and the fourth subtext into a second sample pair; the second target subtext is associated with the semantic information of the fourth subtext;

样本对确定单元，用于将第一文本对、第二文本对、第一样本对和第二样本对，确定为文本样本对。The sample pair determination unit is configured to determine the first text pair, the second text pair, the first sample pair and the second sample pair as text sample pairs.

其中，文本生成单元还包括：Among them, the text generation unit also includes:

初始获取子单元，用于获取与第一语言类型和第三语言类型相关联的第一初始网络模型；an initial acquisition subunit for acquiring the first initial network model associated with the first language type and the third language type;

变换子单元，用于对第四子文本进行编辑变换，得到与第四子文本相关联的第一变换文本；a transformation subunit, used to edit and transform the fourth subtext to obtain the first transformed text associated with the fourth subtext;

拼接处理子单元，用于对第二子文本和第一变换文本进行拼接处理，得到第一拼接样本；a splicing processing subunit, configured to perform splicing processing on the second subtext and the first transformed text to obtain a first splicing sample;

模型训练子单元，用于通过第一初始网络模型获取第一拼接样本的第一样本向量，基于第一样本向量和第四子文本，对第一初始网络模型进行模型训练，得到第一目标网络模型。The model training subunit is used to obtain the first sample vector of the first spliced sample through the first initial network model, and based on the first sample vector and the fourth sub-text, perform model training on the first initial network model to obtain the first target network model.

其中，第四子文本包含N个单位文本；N为正整数；Wherein, the fourth subtext contains N unit texts; N is a positive integer;

变换子单元包括：Transform subunits include:

概率生成子单元，用于生成N个单位文本中的每个单位文本分别对应的随机变换概率，将随机变换概率处于可变换概率区间的单位文本确定为待编辑单位文本；The probability generation subunit is used to generate the random transformation probability corresponding to each unit text in the N unit texts, and determine the unit text whose random transformation probability is in the transformable probability interval as the unit text to be edited;

编辑变换子单元，用于获取与待编辑单位文本相关联的编辑变换方式，根据编辑机变换方式对待编辑单位文本进行编辑变换，得到编辑变换结果，根据编辑变换结果生成与第四子文本相关联的第一变换文本。The editing transformation subunit is used to obtain the editing transformation mode associated with the unit text to be edited, edit and transform the text to be edited according to the transformation mode of the editing machine, obtain the editing transformation result, and generate the editing transformation result according to the editing transformation result, which is associated with the fourth subtext The first transform text of .

其中，编辑变换子单元，具体用于获取与待编辑单位文本相关联的编辑变换方式；Wherein, the editing transformation subunit is specifically used to obtain the editing transformation mode associated with the unit text to be edited;

编辑变换子单元，还具体用于若编辑变换方式为替换操作，则获取第三语言类型对应的词典表，从词典表中获取第一编辑文本，将待编辑单位文本替换为第一编辑文本，得到编辑变换结果；The editing transformation subunit is also specifically configured to obtain a dictionary table corresponding to the third language type if the editing transformation mode is a replacement operation, obtain the first editing text from the dictionary table, and replace the unit text to be edited with the first editing text, Get the editing transformation result;

编辑变换子单元，还具体用于若编辑变换方式为插入操作，则从词典表中获取第二编辑文本，在待编辑单位文本的相邻位置插入第二编辑文本，得到编辑变换结果；The editing transformation subunit is also specifically used to obtain the second editing text from the dictionary table if the editing transformation mode is an insert operation, insert the second editing text at the adjacent position of the unit text to be edited, and obtain the editing transformation result;

编辑变换子单元，还具体用于若编辑变换方式为删除操作，则在第四子文本中删除待编辑单位文本，得到编辑变换结果。The editing transformation subunit is also specifically used for deleting the unit text to be edited in the fourth subtext if the editing transformation mode is a deletion operation, to obtain an editing transformation result.

其中，模型训练子单元包括：Among them, the model training subunit includes:

文本预测子单元，用于通过第一初始网络模型获取第一拼接样本的第一样本向量，基于第一样本向量，生成属于第三语言类型的预测样本子文本；a text prediction subunit, configured to obtain the first sample vector of the first spliced sample through the first initial network model, and based on the first sample vector, generate the predicted sample subtext belonging to the third language type;

损失生成子单元，用于获取预测样本子文本与第四子文本之间的样本语义相似度，根据样本语义相似度，生成第一初始网络模型的模型损失函数；The loss generating subunit is used to obtain the sample semantic similarity between the predicted sample sub-text and the fourth sub-text, and generate the model loss function of the first initial network model according to the sample semantic similarity;

参数调整子单元，用于基于模型损失函数对第一初始网络模型进行参数调整，得到第一目标网络模型。The parameter adjustment subunit is used to adjust the parameters of the first initial network model based on the model loss function to obtain the first target network model.

其中，装置还包括：Wherein, the device also includes:

模型训练模块，用于获取与文本样本对相关联的初始翻译模型，基于文本样本对，对初始翻译模型进行迭代训练，将迭代训练后的初始翻译模型确定为目标翻译模型；目标翻译模型用于在第一语言类型、第二语言类型和第三语言类型中的任意两种语言类型之间进行文本翻译。The model training module is used to obtain the initial translation model associated with the text sample pair, perform iterative training on the initial translation model based on the text sample pair, and determine the initial translation model after iterative training as the target translation model; the target translation model is used for Text translation is performed between any two of the first language type, the second language type, and the third language type.

本申请实施例一方面提供了一种计算机设备，包括：处理器和存储器；An aspect of the embodiments of the present application provides a computer device, including: a processor and a memory;

处理器与存储器相连，其中，存储器用于存储计算机程序，计算机程序被处理器执行时，使得该计算机设备执行本申请实施例提供的方法。The processor is connected to the memory, where the memory is used to store a computer program, and when the computer program is executed by the processor, the computer device executes the method provided by the embodiments of the present application.

本申请实施例一方面提供了一种计算机可读存储介质，计算机可读存储介质存储有计算机程序，该计算机程序适于由处理器加载并执行，以使得具有该处理器的计算机设备执行本申请实施例提供的方法。One aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, where the computer program is adapted to be loaded and executed by a processor, so that a computer device having the processor executes the present application Methods provided by the examples.

本申请实施例一方面提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行本申请实施例提供的方法。In one aspect, embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method provided by the embodiments of the present application.

在本申请实施例中，计算机设备在获取到第一文本对和第二文本对时，可以从第一文本对中获取第一子文本，从第二文本对中获取第二子文本。其中，第一子文本和第二子文本均属于第一语言类型，这里的第一文本对还包括与第一子文本的语义信息相同且属于第二语言类型的第三子文本，这里的第二文本对还包括与第二子文本的语义信息相同且属于第三语言类型的第四子文本。进一步地，计算机设备可以确定第一子文本和第二子文本之间的编辑距离，若编辑距离满足相似性条件，则生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本，生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本。进一步地，计算机设备可以根据第一文本对、第二文本对、第一目标子文本和第二目标子文本，生成文本样本对。由此可见，本申请实施例可以基于抽取和生成的步骤来生成多路对齐语料。该抽取的步骤可以通过确定第一文本对中的第一子文本和第二文本对中的第二子文本之间的编辑距离，将满足相似性条件的编辑距离所对应的第一文本对和第二文本对确定为候选句对；该生成的步骤可以消除第一子文本和第二子文本之间的语义差异，将“部分对齐”的候选句对转换为“完全对齐”的文本样本对。基于此，通过上述基于抽取和生成的步骤，可以快速且准确地生成大量高质量的语义对齐的语料(即文本样本对)，从而可以在保证语料库的质量的同时，提高语料库的语料数量。In this embodiment of the present application, when acquiring the first text pair and the second text pair, the computer device may acquire the first sub-text from the first text pair and acquire the second sub-text from the second text pair. The first sub-text and the second sub-text both belong to the first language type, and the first text pair here also includes a third sub-text that has the same semantic information as the first sub-text and belongs to the second language type. The two-text pair also includes a fourth sub-text that has the same semantic information as the second sub-text and belongs to the third language type. Further, the computer device can determine the edit distance between the first sub-text and the second sub-text, and if the edit distance satisfies the similarity condition, then generate the first sub-text that is associated with the semantic information of the first sub-text and belongs to the third language type. A target sub-text, generating a second target sub-text associated with the semantic information of the second sub-text and belonging to the second language type. Further, the computer device may generate a pair of text samples according to the first text pair, the second text pair, the first target sub-text and the second target sub-text. It can be seen that the embodiments of the present application can generate multi-way aligned corpora based on the steps of extraction and generation. In the extraction step, by determining the edit distance between the first sub-text in the first text pair and the second sub-text in the second text pair, the first text pair and The second text pair is determined as a candidate sentence pair; the generating step can eliminate the semantic difference between the first subtext and the second subtext, and convert the "partially aligned" candidate sentence pairs into "fully aligned" text sample pairs . Based on this, through the above-mentioned steps based on extraction and generation, a large number of high-quality semantically aligned corpora (ie, text sample pairs) can be quickly and accurately generated, so that the quality of the corpus can be guaranteed and the corpus quantity can be increased.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1是本申请实施例提供的一种网络架构的结构示意图；1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application;

图2是本申请实施例提供的一种进行数据交互的场景示意图；2 is a schematic diagram of a scenario for data interaction provided by an embodiment of the present application;

图3是本申请实施例提供的一种文本数据处理方法的流程示意图；3 is a schematic flowchart of a text data processing method provided by an embodiment of the present application;

图4是本申请实施例提供的一种生成候选文本对的场景示意图；4 is a schematic diagram of a scenario for generating candidate text pairs provided by an embodiment of the present application;

图5是本申请实施例提供的一种生成目标子文本的流程示意图；Fig. 5 is a kind of schematic flow chart of generating target subtext provided by an embodiment of the present application;

图6是本申请实施例提供的一种生成文本样本对的场景示意图；6 is a schematic diagram of a scenario for generating a text sample pair provided by an embodiment of the present application;

图7是本申请实施例提供的一种文本数据处理方法的流程示意图；7 is a schematic flowchart of a text data processing method provided by an embodiment of the present application;

图8是本申请实施例提供的一种生成目标子文本的场景示意图；8 is a schematic diagram of a scenario for generating target subtext provided by an embodiment of the present application;

图9是本申请实施例提供的一种文本数据处理方法的流程示意图；9 is a schematic flowchart of a text data processing method provided by an embodiment of the present application;

图10是本申请实施例提供的一种进行性能比较的场景示意图；FIG. 10 is a schematic diagram of a performance comparison scenario provided by an embodiment of the present application;

图11是本申请实施例提供的一种文本数据处理装置的结构示意图；11 is a schematic structural diagram of a text data processing apparatus provided by an embodiment of the present application;

图12是本申请实施例提供的一种计算机设备的结构示意图。FIG. 12 is a schematic structural diagram of a computer device provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

应当理解，人工智能(Artificial Intelligence，简称AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。It should be understood that artificial intelligence (AI) is the theory, method, technology and method that use digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. operating system. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.

人工智能技术是一门综合学科，涉及领域广泛，既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习、自动驾驶、智慧交通等几大方向。Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology. The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning, autonomous driving, and smart transportation.

其中，本申请实施例所提供的方案主要涉及人工智能的自然语言处理(NatureLanguage processing，简称NLP)技术和机器学习(Machine Learning，简称ML)技术。The solutions provided by the embodiments of the present application mainly relate to the natural language processing (Nature Language processing, NLP) technology and machine learning (Machine Learning, ML) technology of artificial intelligence.

其中，机器学习(Machine Learning)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。Among them, machine learning (Machine Learning) is a multi-domain interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.

其中，自然语言处理(Nature Language processing)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。Among them, natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use on a daily basis, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.

具体的，请参见图1，图1是本申请实施例提供的一种网络架构的结构示意图。如图1所示，该网络架构可以包括业务服务器2000和用户终端集群。其中，用户终端集群具体可以包括一个或者多个用户终端，这里将不对用户终端集群中的用户终端的数量进行限定。如图1所示，多个用户终端具体可以包括用户终端3000a、用户终端3000b、用户终端3000c、…、用户终端3000n；用户终端3000a、用户终端3000b、用户终端3000c、…、用户终端3000n可以分别与业务服务器2000通过有线或无线通信方式进行直接或间接地网络连接，以便于每个用户终端可以通过该网络连接与业务服务器2000之间进行数据交互。Specifically, please refer to FIG. 1 , which is a schematic structural diagram of a network architecture provided by an embodiment of the present application. As shown in FIG. 1 , the network architecture may include a service server 2000 and a cluster of user terminals. The user terminal cluster may specifically include one or more user terminals, and the number of user terminals in the user terminal cluster will not be limited here. As shown in FIG. 1 , the multiple user terminals may specifically include user terminal 3000a, user terminal 3000b, user terminal 3000c, ..., user terminal 3000n; user terminal 3000a, user terminal 3000b, user terminal 3000c, ..., user terminal 3000n may be A direct or indirect network connection is made with the service server 2000 through wired or wireless communication, so that each user terminal can perform data interaction with the service server 2000 through the network connection.

其中，业务服务器2000可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。The service server 2000 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.

其中，用户终端集群中的每个用户终端均可以包括：智能手机、平板电脑、笔记本电脑、台式计算机、智能家居、可穿戴设备、车载终端等具有文本数据处理功能的智能终端。应当理解，如图1所示的用户终端集群中的每个用户终端均可以集成安装有应用客户端，当该应用客户端运行于各用户终端中时，可以分别与上述图1所示的业务服务器2000之间进行数据交互。其中，应用客户端具体可以包括：车载客户端、智能家居客户端、娱乐客户端(例如，游戏客户端)、多媒体客户端(例如，视频客户端)、社交客户端以及资讯类客户端(例如，新闻客户端)等。Wherein, each user terminal in the user terminal cluster may include: a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart home, a wearable device, a vehicle terminal and other smart terminals with text data processing functions. It should be understood that each user terminal in the user terminal cluster as shown in FIG. 1 can be integrated with an application client. When the application client runs in each user terminal, it can be respectively connected with the service shown in FIG. 1 above. Data exchange is performed between the servers 2000 . Wherein, the application client may specifically include: vehicle client, smart home client, entertainment client (eg, game client), multimedia client (eg, video client), social client, and information client (eg , news client) etc.

为便于理解，本申请实施例可以在图1所示的多个用户终端中选择一个用户终端作为目标用户终端。例如，本申请实施例可以将图1所示的用户终端3000a作为目标用户终端，该目标用户终端中可以集成有具备文本数据处理功能的应用客户端。此时，该目标用户终端可以通过该应用客户端与业务服务器2000之间实现数据交互。For ease of understanding, in this embodiment of the present application, one user terminal may be selected as the target user terminal from the multiple user terminals shown in FIG. 1 . For example, in this embodiment of the present application, the user terminal 3000a shown in FIG. 1 may be used as the target user terminal, and the target user terminal may be integrated with an application client having a function of text data processing. At this time, the target user terminal can implement data interaction between the application client and the service server 2000 .

应当理解，本申请实施例中的计算机设备可以从语料库中获取第一文本对和第二文本对。其中，语料库中可以包括多种语言类型的子文本所组成的文本对，第一文本对中可以包括属于第一语言类型的第一子文本、以及与第一子文本的语义信息相同且属于第二语言类型的第三子文本，第二文本对中可以包括属于第一语言类型的第二子文本、以及与第二子文本的语义信息相同且属于第三语言类型的第四子文本。进一步地，计算机设备可以基于该第一文本对和第二文本对，生成其他语言类型的子文本所组成的文本对(即文本样本对)，进而丰富现有的语料库。It should be understood that the computer device in this embodiment of the present application may acquire the first text pair and the second text pair from the corpus. The corpus may include text pairs composed of sub-texts of multiple language types, and the first text pair may include a first sub-text belonging to the first language type and the same semantic information as the first sub-text and belonging to the first sub-text. For the third subtext of the second language type, the second text pair may include a second subtext belonging to the first language type, and a fourth subtext that has the same semantic information as the second subtext and belongs to the third language type. Further, the computer device may generate text pairs (ie, text sample pairs) composed of sub-texts of other language types based on the first text pair and the second text pair, thereby enriching the existing corpus.

可以理解的是，本申请实施例所提供的文本数据处理方法可以由上述业务服务器2000执行(即上述计算机设备可以为业务服务器2000)，也可以由上述目标用户终端执行(即上述计算机设备可以为目标用户终端)，还可以由业务服务器2000和目标用户终端共同执行。It can be understood that, the text data processing method provided in the embodiment of the present application can be executed by the above-mentioned service server 2000 (that is, the above-mentioned computer device can be the service server 2000), or can be executed by the above-mentioned target user terminal (that is, the above-mentioned computer device can be a target user terminal), and may also be jointly executed by the service server 2000 and the target user terminal.

其中，在该文本数据处理方法由业务服务器2000和目标用户终端共同执行时，业务服务器2000可以基于上述得到的文本样本对，对用于进行文本翻译的初始翻译模型进行模型训练，得到目标翻译模型。这样，目标用户终端对应的用户(例如，用户Y)可以通过目标用户终端中的应用客户端向业务服务器2000发送文本翻译请求。其中，这里的文本翻译请求可以包括用户Y请求进行文本翻译的初始翻译文本。进一步地，业务服务器2000在接收到文本翻译请求后，可以从文本翻译请求中获取初始翻译文本，通过上述目标翻译模型对该初始翻译文本进行文本翻译，得到目标翻译文本，进而将该目标翻译文本返回给目标用户终端。Wherein, when the text data processing method is jointly executed by the service server 2000 and the target user terminal, the service server 2000 can perform model training on the initial translation model used for text translation based on the obtained text sample pair to obtain the target translation model . In this way, a user (eg, user Y) corresponding to the target user terminal can send a text translation request to the service server 2000 through the application client in the target user terminal. Wherein, the text translation request here may include the initial translation text requested by user Y for text translation. Further, after receiving the text translation request, the service server 2000 can obtain the initial translation text from the text translation request, perform text translation on the initial translation text through the above-mentioned target translation model to obtain the target translation text, and then the target translation text. Return to the target user terminal.

可选的，在该文本数据处理方法由业务服务器2000执行时，业务服务器2000可以直接通过上述训练得到的目标翻译模型对初始翻译文本进行文本翻译，得到目标翻译文本。可选的，在该文本数据处理方法由目标用户终端执行时，目标用户终端可以基于上述得到的文本样本对，对初始翻译模型进行模型训练，得到目标翻译模型，进而直接通过在目标用户终端训练得到的目标翻译模型对初始翻译文本进行文本翻译，得到目标翻译文本。Optionally, when the text data processing method is executed by the service server 2000, the service server 2000 may directly perform text translation on the initial translation text through the target translation model obtained by the above training to obtain the target translation text. Optionally, when the text data processing method is executed by the target user terminal, the target user terminal can perform model training on the initial translation model based on the obtained text sample pair to obtain the target translation model, and then directly train on the target user terminal. The obtained target translation model performs text translation on the initial translation text to obtain the target translation text.

为便于理解，进一步地，请参见图2，图2是本申请实施例提供的一种进行数据交互的场景示意图。如图2所示的服务器20a可以为上述图1所对应实施例中的业务服务器2000，如图2所示的用户终端20b可以为上述图1所对应实施例的用户终端集群中的任意一个用户终端，为便于理解，本申请实施例以上述图1所示的用户终端3000a作为该用户终端20b为例，以阐述图2所示的服务器20a和用户终端20b进行数据交互的具体过程。其中，用户终端20b上安装有应用客户端，该应用客户端可以用于显示初始翻译文本和目标翻译文本，其中，用户终端20b对应的用户可以为用户20c。For ease of understanding, please refer to FIG. 2 , which is a schematic diagram of a data interaction scenario provided by an embodiment of the present application. The server 20a shown in FIG. 2 may be the service server 2000 in the embodiment corresponding to FIG. 1, and the user terminal 20b shown in FIG. 2 may be any user in the user terminal cluster of the embodiment corresponding to FIG. 1. Terminal, for ease of understanding, the embodiment of the present application takes the user terminal 3000a shown in FIG. 1 as the user terminal 20b as an example to illustrate the specific process of data interaction between the server 20a and the user terminal 20b shown in FIG. 2 . An application client is installed on the user terminal 20b, and the application client can be used to display the initial translated text and the target translated text, wherein the user corresponding to the user terminal 20b can be the user 20c.

其中，可以理解的是，如图2所示的文本数据库20d中可以包括多个数据库，多个数据库具体可以包括图2所示的数据库30a、数据库30b、…、数据库30n。这意味着文本数据库20d可以用于存储不同语言类型的子文本所组成的文本对，例如，数据库30a可以用于存储第一语言类型和第二语言类型的子文本所组成的文本对，数据库30b可以用于存储第一语言类型和第三语言类型的子文本所组成的文本对，…，数据库30n可以用于存储第一语言类型和第四语言类型的子文本所组成的文本对。It can be understood that the text database 20d shown in FIG. 2 may include multiple databases, and the multiple databases may specifically include the database 30a, database 30b, . . . , database 30n shown in FIG. 2 . This means that the text database 20d can be used to store text pairs consisting of sub-texts of different language types, for example, the database 30a can be used to store text pairs consisting of sub-texts of a first language type and a second language type, the database 30b The database 30n can be used to store text pairs composed of sub-texts of the first language type and the third language type, . . . , the database 30n can be used to store text pairs composed of sub-texts of the first language type and the fourth language type.

如图2所示，服务器20a可以从文本数据库(例如，文本数据库20d)中获取第一文本对和第二文本对，例如，从数据库30a中获取第一文本对21a，从数据库30b中获取第二文本对21b。其中，第一文本对21a中可以包括第一语言类型的第一子文本和第二语言类型的第三子文本，第一子文本和第三子文本语义对齐(即具有相同的语义信息)；第二文本对21b中可以包括第一语言类型的第二子文本和第三语言类型的第四子文本，第二子文本和第四子文本语义对齐(即具有相同的语义信息)。As shown in FIG. 2, the server 20a may obtain the first text pair and the second text pair from a text database (eg, the text database 20d), eg, the first text pair 21a from the database 30a, and the first text pair 21a from the database 30b Two texts to 21b. Wherein, the first text pair 21a may include a first sub-text of the first language type and a third sub-text of the second language type, and the first sub-text and the third sub-text are semantically aligned (that is, have the same semantic information); The second text pair 21b may include a second sub-text of the first language type and a fourth sub-text of the third language type, and the second sub-text and the fourth sub-text are semantically aligned (ie, have the same semantic information).

进一步地，服务器20a可以从第一文本对21a和第二文本对21b中获取具有相同语言类型(即第一语言类型)的第一子文本和第二子文本，进而确定第一子文本和第二子文本之间的文本相似度，例如，这里可以通过编辑距离确定第一子文本和第二子文本之间的文本相似度。可以理解的是，若第一子文本和第二子文本之间的文本相似度满足相似性条件(即编辑距离满足相似性条件)，则服务器20a可以基于第一文本对21a和第二文本对21b，生成目标子文本21c。其中，目标子文本21c可以包括第一目标子文本和第二目标子文本。Further, the server 20a may obtain the first subtext and the second subtext with the same language type (ie, the first language type) from the first text pair 21a and the second text pair 21b, and then determine the first subtext and the second subtext. The text similarity between the two sub-texts, for example, the text similarity between the first sub-text and the second sub-text can be determined by the edit distance here. It can be understood that, if the text similarity between the first sub-text and the second sub-text satisfies the similarity condition (that is, the edit distance satisfies the similarity condition), the server 20a may, based on the first text pair 21a and the second text pair, 21b, generate target subtext 21c. The target sub-text 21c may include a first target sub-text and a second target sub-text.

其中，可以理解的是，服务器20a可以基于第一子文本和第四子文本，生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本；服务器20a可以基于第二子文本和第三子文本，生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本。Among them, it can be understood that the server 20a can generate, based on the first sub-text and the fourth sub-text, a first target sub-text that is associated with the semantic information of the first sub-text and belongs to the third language type; A second sub-text and a third sub-text are generated to generate a second target sub-text that is associated with the semantic information of the second sub-text and belongs to the second language type.

如图2所示，服务器20a可以根据第一文本对21a、第二文本对21b和目标子文本21c，生成文本样本对，进而获取用于进行文本翻译的初始翻译模型，基于文本样本对，对初始翻译模型进行模型训练，得到目标翻译模型。其中，目标翻译模型可以用于对第一语言类型、第二语言类型和第三语言类型的任意两种语言类型之间进行文本翻译。As shown in FIG. 2, the server 20a can generate text sample pairs according to the first text pair 21a, the second text pair 21b and the target sub-text 21c, and then obtain an initial translation model for text translation, based on the text sample pairs, The initial translation model is trained to obtain the target translation model. Wherein, the target translation model can be used to perform text translation between any two language types of the first language type, the second language type and the third language type.

如图2所示，当用户20c需要对某一段话(即初始翻译文本)进行文本翻译时，可以通过用户终端20b向服务器20a发送携带该初始翻译文本的文本翻译请求。例如，这里的初始翻译文本可以属于第二语言类型，用户20c期望将第二语言类型的初始翻译文本进行文本翻译所得到的语言类型可以为第三语言类型。这样，服务器20a在接收到初始翻译文本后，可以获取用于进行文本翻译的目标翻译模型，通过目标翻译模型对属于第二语言类型的初始翻译文本进行文本翻译，得到属于第三语言类型的目标翻译文本。可选的，目标翻译模型还可以对属于第二语言类型的初始翻译文本进行文本翻译，得到属于第一语言类型的目标翻译文本，应当理解，本申请实施例不对目标翻译文本的语言类型进行限定。As shown in FIG. 2 , when the user 20c needs to perform text translation for a certain paragraph (ie, the initial translation text), the user terminal 20b may send a text translation request carrying the initial translation text to the server 20a. For example, the initial translation text here may belong to the second language type, and the language type obtained by the user 20c wishing to perform text translation on the initial translation text of the second language type may be the third language type. In this way, after receiving the initial translation text, the server 20a can obtain a target translation model for text translation, and use the target translation model to perform text translation on the initial translation text belonging to the second language type to obtain a target translation belonging to the third language type. Translate text. Optionally, the target translation model may also perform text translation on the initial translation text belonging to the second language type to obtain the target translation text belonging to the first language type. It should be understood that this embodiment of the present application does not limit the language type of the target translation text. .

进一步地，服务器20a可以将翻译得到的目标翻译文本返回至用户终端20b，这样，用户终端20b可以在显示初始翻译文本的同时，显示该目标翻译文本，以使用户20c可以获取到初始翻译文本对应的目标翻译文本。Further, the server 20a can return the target translation text obtained by translation to the user terminal 20b, so that the user terminal 20b can display the target translation text while displaying the initial translation text, so that the user 20c can obtain the corresponding initial translation text. target translation text.

由此可见，本申请实施例可以通过编辑距离抽取第一语言类型(例如，英文端)“高度相似”的文本对(即第一文本对和第二文本对)，进而通过模型生成的方法来消除文本对的语义差异。其中，“高度相似”的文本对可以确保生成的语料库的规模；“高度相似”的文本对作为先验知识，该先验知识可以有效避免生成语料的多样性低于原始语料多样性。因此，本申请实施例可以生成具有大规模和多样性的语料库，在通过该语料库对初始翻译模型进行训练时，可以提高目标翻译模型的翻译性能。It can be seen that, in this embodiment of the present application, text pairs (ie, the first text pair and the second text pair) that are "highly similar" in the first language type (for example, English) can be extracted by the edit distance, and then the model generation method can be used to extract Eliminate semantic differences in text pairs. Among them, "highly similar" text pairs can ensure the scale of the generated corpus; "highly similar" text pairs are used as prior knowledge, which can effectively avoid the diversity of the generated corpus to be lower than the original corpus diversity. Therefore, the embodiments of the present application can generate a large-scale and diverse corpus, and when the initial translation model is trained through the corpus, the translation performance of the target translation model can be improved.

进一步地，请参见图3，图3是本申请实施例提供的一种文本数据处理方法的流程示意图。该方法可以由服务器执行，也可以由用户终端执行，还可以由服务器可以用户终端共同执行，该服务器可以为上述图2所对应实施中的服务器20a，该用户终端可以为上述图2所对应实施中的用户终端20b。为便于理解，本申请实施例以该方法由服务器执行为例进行说明。其中，该文本数据处理方法可以包括以下步骤S101-步骤S103：Further, please refer to FIG. 3 , which is a schematic flowchart of a text data processing method provided by an embodiment of the present application. The method can be executed by a server, a user terminal, or a server and a user terminal jointly. The server can be the server 20a in the implementation corresponding to the above-mentioned FIG. 2 , and the user terminal can be implemented corresponding to the above-mentioned FIG. 2 . in the user terminal 20b. For ease of understanding, the embodiments of the present application take the method being executed by a server as an example for description. Wherein, the text data processing method may include the following steps S101-S103:

步骤S101，获取第一文本对和第二文本对，从第一文本对中获取第一子文本，从第二文本对中获取第二子文本；Step S101, obtaining the first text pair and the second text pair, obtaining the first sub-text from the first text pair, and obtaining the second sub-text from the second text pair;

其中，第一子文本和第二子文本均属于第一语言类型；第一文本对还包括与第一子文本的语义信息相同且属于第二语言类型的第三子文本，第二文本对还包括与第二子文本的语义信息相同且属于第三语言类型的第四子文本。The first sub-text and the second sub-text both belong to the first language type; the first text pair further includes a third sub-text that has the same semantic information as the first sub-text and belongs to the second language type, and the second text pair also includes A fourth subtext that has the same semantic information as the second subtext and belongs to the third language type is included.

应当理解，第一文本对和第二文本对可以为从语料库中获取到的，该语料库中可以包括多种语言类型的子文本所组成的文本对。在语料库为双语语料库时，语料库中可以包括由两种语言类型的文本所组成的文本对，例如，由第一语言类型的第一子文本和第二语言类型的第三子文本所组成的第一文本对，由第一语言类型的第二子文本和第三语言类型的第四子文本所组成的第二文本对。It should be understood that the first text pair and the second text pair may be obtained from a corpus, and the corpus may include text pairs composed of sub-texts of multiple language types. When the corpus is a bilingual corpus, the corpus may include text pairs composed of texts of two language types, for example, a first subtext composed of a first subtext of a first language type and a third subtext of a second language type A text pair, a second text pair consisting of a second sub-text of the first language type and a fourth sub-text of the third language type.

可以理解的是，本申请实施例所使用的语料库可以为双语语料库，该双语语料库可以为WMT-5数据集，该WMT-5数据集是以英语为中心(即English-Centric，表示源语言端或目标语言端为英文(即英语))的数据集。应当理解，本申请实施例不对双语语料库的具体类型进行限定。It can be understood that the corpus used in the embodiment of the present application can be a bilingual corpus, and the bilingual corpus can be a WMT-5 data set, and the WMT-5 data set is English-centric (namely, English-Centric, which means the source language side. Or a dataset whose target language side is English (i.e. English). It should be understood that the embodiment of the present application does not limit the specific type of the bilingual corpus.

其中，可以理解的是，WMT-5数据集是由5组数据集所构成的，该5组数据集具体可以包括：WMT13EnEs数据集，WMT14EnDe数据集，WMT15EnFr数据集，WMT18EnCs数据集和WMT18EnRu数据集。其中，WMT13EnEs数据集是由英语(简称En/en)和西班牙语(即Spanish，简称Es/es)所构成的数据集，该WMT13EnEs数据集中的每一对文本对均以英语和西班牙语作为源语言端或目标语言端；WMT14EnDe数据集是由英语和德语(即German，简称De/de)所构成的数据集，该WMT14EnDe数据集中的每一对文本对均以英语和德语作为源语言端或目标语言端；WMT15EnFr数据集是由英语和法语(即French，简称Fr/fr)所构成的数据集，该WMT15EnFr数据集中的每一对文本对均以英语和法语作为源语言端或目标语言端；WMT18EnCs数据集是由英语和捷克语(即Czech，简称Cs/cs)所构成的数据集，该WMT18EnCs数据集中的每一对文本对均以英语和捷克语作为源语言端或目标语言端；WMT18EnRu数据集是由英语和俄语(即Russian，简称Ru/ru)所构成的数据集，该WMT18EnRu数据集中的每一对文本对均以英语和俄语作为源语言端或目标语言端。Among them, it can be understood that the WMT-5 data set is composed of 5 sets of data sets, and the 5 sets of data sets can specifically include: WMT13EnEs data set, WMT14EnDe data set, WMT15EnFr data set, WMT18EnCs data set and WMT18EnRu data set . Among them, the WMT13EnEs dataset is a dataset composed of English (En/en for short) and Spanish (Spanish, Es/es for short), and each pair of text pairs in the WMT13EnEs dataset uses English and Spanish as the source Language side or target language side; the WMT14EnDe dataset is a dataset composed of English and German (ie German, De/de for short), and each pair of text pairs in the WMT14EnDe dataset takes English and German as the source language side or The target language side; the WMT15EnFr dataset is a dataset composed of English and French (ie French, Fr/fr for short). Each pair of text pairs in the WMT15EnFr dataset uses English and French as the source or target language side. ; The WMT18EnCs data set is a data set composed of English and Czech (ie Czech, referred to as Cs/cs), and each pair of text pairs in the WMT18EnCs data set uses English and Czech as the source or target language side; The WMT18EnRu dataset is a dataset composed of English and Russian (Russian, Ru/ru for short). Each pair of texts in the WMT18EnRu dataset uses English and Russian as the source or target language.

应当理解，双语语料库中还可以包括其他语言类型的子文本所组成的文本对所构成的样本数据集，本申请实施例以双语语料库包括WMT-5数据集为例进行说明，对样本数据集中的文本对进行文本数据处理的过程可以参见对第一文本对和第二文本对进行文本数据处理的描述。其中，WMT-5数据集可以包括六种语言类型的子文本，这六种语言类型可以包括：第一语言类型、第二语言类型、第三语言类型、第四语言类型、第五语言类型和第六语言类型。It should be understood that the bilingual corpus may also include a sample data set composed of text pairs composed of sub-texts of other language types. For the process of performing text data processing on a text pair, reference may be made to the description of performing text data processing on the first text pair and the second text pair. Among them, the WMT-5 dataset may include sub-texts of six language types, and the six language types may include: the first language type, the second language type, the third language type, the fourth language type, the fifth language type and the Sixth language type.

应当理解，本申请实施例不对第一语言类型、第二语言类型和第三语言类型的具体语言类型进行限定。比如，在第一文本对属于WMT15EnFr数据集时，第一语言类型可以为英语，第二语言类型可以为法语；在第二文本对属于WMT14EnDe数据集时，第一语言类型可以为英语，第三语言类型可以为德语。其中，第二语言类型可以作为源语言端，也可以作为目标语言端；第三语言类型可以作为源语言端，也可以作为目标语言端。It should be understood that the embodiments of the present application do not limit the specific language types of the first language type, the second language type, and the third language type. For example, when the first text pair belongs to the WMT15EnFr dataset, the first language type can be English, and the second language type can be French; when the second text pair belongs to the WMT14EnDe dataset, the first language type can be English, and the third language type can be English. The language type can be German. Wherein, the second language type can be used as the source language end or the target language end; the third language type can be used as the source language end or the target language end.

同理，本申请实施例不对第四语言类型、第五语言类型和第六语言类型的具体语言类型进行限定。比如，在第一语言类型为英语、第二语言类型为法语和第三语言类型为德语时，第四语言类型可以为捷克语，第五语言类型可以为西班牙语，第六语言类型可以为俄语。Similarly, the embodiments of the present application do not limit the specific language types of the fourth language type, the fifth language type, and the sixth language type. For example, when the first language type is English, the second language type is French, and the third language type is German, the fourth language type may be Czech, the fifth language type may be Spanish, and the sixth language type may be Russian .

步骤S102，确定第一子文本和第二子文本之间的编辑距离，若编辑距离满足相似性条件，则生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本，生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本；Step S102, determine the edit distance between the first sub-text and the second sub-text, if the edit distance satisfies the similarity condition, then generate the first target sub-text associated with the semantic information of the first sub-text and belonging to the third language type. text, generating a second target sub-text associated with the semantic information of the second sub-text and belonging to the second language type;

具体的，服务器可以从第一子文本中提取第一单位文本，从第二子文本中提取第二单位文本。进一步地，服务器可以基于第一单位文本和第二单位文本，确定第一子文本和第二子文本之间的编辑距离。进一步地，若编辑距离满足相似性条件，则服务器可以生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本，生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本。Specifically, the server may extract the first unit text from the first subtext, and extract the second unit text from the second subtext. Further, the server may determine an edit distance between the first sub-text and the second sub-text based on the first unit text and the second unit text. Further, if the edit distance satisfies the similarity condition, the server may generate the first target subtext associated with the semantic information of the first subtext and belonging to the third language type, and generate the semantic information associated with the second subtext and A second target subtext belonging to the second language type.

可以理解的是，本申请可以将编辑距离应用于单词级别。比如，在第一子文本为：“How do you do？”时，从第一子文本中提取到的第一单位文本可以为：“How”、“do”、“you”、“do”、“？”。又比如，在第二子文本为：“How are you？”时，从第二子文本中提取到的第二单位文本可以为：“How”、“are”、“you”、“？”。It will be appreciated that the present application can apply edit distance to the word level. For example, when the first subtext is: "How do you do?", the first unit text extracted from the first subtext can be: "How", "do", "you", "do", "?". For another example, when the second subtext is: "How are you?", the second unit text extracted from the second subtext may be: "How", "are", "you", "?".

其中，本申请实施例不区分文本(例如，“How”、“do”)和符号(例如，“？”)，这里的文本和符号可以统称为单位文本。The embodiments of the present application do not distinguish between text (for example, "How", "do") and symbols (for example, "?"), and the text and symbols here may be collectively referred to as unit text.

可以理解的是，编辑距离用将一个句子转换为另一个句子所需要的最少操作数量来度量句子之间的相似性，所以这样得到的相似句对中往往能具有更多相似的单词和更为相似的句子结构。此外，编辑距离的计算中只包含替换、插入和删除三种操作，比较简单。其中，距离计算方法可以为动态规划法，应当理解，本申请实施例不对距离计算方法的具体方法类型进行限定。Understandably, edit distance measures the similarity between sentences by the minimum number of operations required to convert one sentence into another, so the resulting similar sentence pairs tend to have more similar words and more similar sentences. Similar sentence structure. In addition, the calculation of edit distance only includes three operations of replacement, insertion and deletion, which is relatively simple. The distance calculation method may be a dynamic programming method, and it should be understood that the specific method type of the distance calculation method is not limited in this embodiment of the present application.

应当理解，编辑距离与相似性条件之间的关系可以描述为：服务器可以将第一子文本的文本长度确定为第一文本长度，将第二子文本的文本长度确定为第二文本长度，根据第一文本长度和第二文本长度，确定相似性条件对应的目标长度。进一步地，服务器可以获取与相似性条件相关联的相似参数，根据目标长度和相似参数，确定相似性条件对应的相似距离。进一步地，若编辑距离小于或等于相似距离，则服务器可以确定编辑距离满足相似性条件。可选的，若编辑距离大于相似距离，则服务器可以确定编辑距离不满足相似性条件。It should be understood that the relationship between the edit distance and the similarity condition can be described as: the server can determine the text length of the first subtext as the first text length, and determine the text length of the second subtext as the second text length, according to The first text length and the second text length determine the target length corresponding to the similarity condition. Further, the server may obtain the similarity parameters associated with the similarity conditions, and determine the similarity distance corresponding to the similarity conditions according to the target length and the similarity parameters. Further, if the edit distance is less than or equal to the similarity distance, the server may determine that the edit distance satisfies the similarity condition. Optionally, if the edit distance is greater than the similarity distance, the server may determine that the edit distance does not satisfy the similarity condition.

可以理解的是，服务器可以给定

(X¹,Y¹)和(X²,Y²)是语言对(即文本对)，X¹和X²为第一语言类型(例如，英文端)，Y¹和Y²分别为语言L_a(例如，第二语言类型)和语言L_b(例如，第三语言类型)。因此，

可以为第一文本对，

表示第一子文本，

表示第三子文本；

可以为第二文本对，

表示第二子文本，

表示第四子文本。Understandably, the server can give

(X ¹ , Y ¹ ) and (X ² , Y ² ) are language pairs (ie, text pairs), X ¹ and X ² are the first language type (eg, English end), and Y ¹ and Y ² are language L, respectively _a (eg, second language type) and language L _b (eg, third language type). therefore,

can be the first text pair,

represents the first subtext,

represents the third subtext;

can be the second text pair,

represents the second subtext,

Represents the fourth subtext.

其中，编辑距离与相似距离之间的关系可以参见下述公式(1)：Among them, the relationship between the edit distance and the similarity distance can be referred to the following formula (1):

其中，

可以表示第一子文本的文本长度(即第一文本长度)，

可以表示第二子文本的文本长度(即第二文本长度)。

可以表示第一文本长度和第二文本长度的较小值，因此，本申请实施例可以将第一文本长度和第二文本长度中的较小值作为目标长度。其中，在第一文本长度小于第二文本长度时，目标长度可以等于第一文本长度；在第一文本长度大于第二文本长度时，目标长度可以等于第二文本长度。γ可以表示相似参数，

可以表示相似性条件对应的相似距离，

可以表示第一子文本和第二子文本之间的编辑距离。in,

can represent the text length of the first subtext (ie, the first text length),

The text length of the second subtext (ie, the second text length) may be represented.

It may represent the smaller value of the first text length and the second text length. Therefore, in this embodiment of the present application, the smaller value of the first text length and the second text length may be used as the target length. Wherein, when the first text length is less than the second text length, the target length may be equal to the first text length; when the first text length is greater than the second text length, the target length may be equal to the second text length. γ can represent similar parameters,

can represent the similarity distance corresponding to the similarity condition,

The edit distance between the first subtext and the second subtext may be represented.

其中，若编辑距离与相似距离之间的关系满足公式(1)，则确定编辑距离小于或等于相似距离，此时，服务器可以将

作为候选文本对(即候选句对)，进而将候选文本对加入候选集。可选的，若编辑距离与相似距离之间的关系不满足公式(1)，则确定编辑距离大于相似距离。Among them, if the relationship between the edit distance and the similarity distance satisfies the formula (1), it is determined that the edit distance is less than or equal to the similarity distance. At this time, the server can

As candidate text pairs (ie, candidate sentence pairs), the candidate text pairs are then added to the candidate set. Optionally, if the relationship between the edit distance and the similarity distance does not satisfy the formula (1), it is determined that the edit distance is greater than the similarity distance.

可以理解的是，γ是一个可调的超参数，该超参数用于控制候选集中的语言对(即候选文本对)的相似性大小，即第一子文本和第二子文本的相似程度，γ∈[0,1]。若γ越大，则第一子文本和第二子文本越不相似，语义差异越大；若γ越小，则第一子文本和第二子文本越相似，语义差异越小。其中，在γ＝0时，相似距离等于0，在第一子文本和第二子文本完全相同时，第一子文本和第二子文本之间的编辑距离(即编辑距离等于0)满足相似性条件，此时，第一子文本和第二子文本属于完全对齐的多路平行语料，否则，将会得到更多具有语义差异的相似句对。It can be understood that γ is an adjustable hyperparameter, which is used to control the similarity of language pairs (ie, candidate text pairs) in the candidate set, that is, the degree of similarity between the first subtext and the second subtext, γ∈[0,1]. If γ is larger, the first sub-text and the second sub-text are less similar, and the semantic difference is larger; if γ is smaller, the first sub-text and the second sub-text are more similar, and the semantic difference is smaller. Among them, when γ=0, the similarity distance is equal to 0, and when the first sub-text and the second sub-text are exactly the same, the edit distance (that is, the edit distance is equal to 0) between the first sub-text and the second sub-text satisfies the similarity At this time, the first sub-text and the second sub-text belong to the fully aligned multi-channel parallel corpus, otherwise, more similar sentence pairs with semantic differences will be obtained.

为便于理解，请参见图4，图4是本申请实施例提供的一种生成候选文本对的场景示意图。如图4所示，第一文本对41a中可以包括第一语言类型的第一子文本和第二语言类型的第三子文本，第二文本对41b中可以包括第一语言类型的第二子文本和第三语言类型的第四子文本。其中，第一子文本和第三子文本具有相同的语义信息(即第一语义信息)，第二子文本和第四子文本具有相同的语义信息(即第二语义信息)。For easy understanding, please refer to FIG. 4 , which is a schematic diagram of a scenario for generating a candidate text pair provided by an embodiment of the present application. As shown in FIG. 4 , the first text pair 41a may include a first subtext of the first language type and a third subtext of the second language type, and the second text pair 41b may include a second subtext of the first language type Text and the fourth subtext of the third language type. The first sub-text and the third sub-text have the same semantic information (ie, the first semantic information), and the second sub-text and the fourth sub-text have the same semantic information (ie, the second semantic information).

如图4所示，服务器可以获取第一文本对41a和第二文本对41b中具有相同语言类型的子文本，即获取第一文本对41a中具有第一语言类型的第一子文本，获取第二文本对41b中具有第一语言类型的第二子文本，进而确定第一子文本和第二子文本之间的编辑距离。进一步地，服务器在确定编辑距离满足相似性条件时，可以基于第一文本对41a和第二文本对41b生成候选文本对，这里的候选文本对可以为候选文本对41c，候选文本对41c可以表示为(第一子文本，第三子文本，第二子文本，第四子文本)。进一步地，服务器在得到候选文本对41c之后，可以将该候选文本对41c存储至候选集。As shown in FIG. 4 , the server may obtain sub-texts of the same language type in the first text pair 41a and the second text pair 41b, that is, obtain the first sub-text of the first text pair 41a with the first language type, obtain the first sub-text of the first text pair 41a The two-text pair 41b includes the second sub-text of the first language type, thereby determining the edit distance between the first sub-text and the second sub-text. Further, when determining that the edit distance satisfies the similarity condition, the server may generate a candidate text pair based on the first text pair 41a and the second text pair 41b, where the candidate text pair may be the candidate text pair 41c, and the candidate text pair 41c may represent is (first subtext, third subtext, second subtext, fourth subtext). Further, after obtaining the candidate text pair 41c, the server may store the candidate text pair 41c in the candidate set.

为便于理解，请参见图5，图5是本申请实施例提供的一种生成目标子文本的流程示意图。如图5所示的(E1，A1)可以为第一文本对50a，(E2，Z1)可以为第二文本对50b；如图5所示的(E3，A2)可以为第一文本对50c，(E4，Z2)可以为第二文本对50d，可选的，该第一文本对50c还可以称之为第三文本对50c，该第二文本对50d还可以称之为第四文本对50d；如图5所示的(E5，A3)可以为第一文本对50e，(E6，Z3)可以为第二文本对50f，可选的，该第一文本对50e还可以称之为第五文本对50e，该第二文本对50f还可以称之为第六文本对50f；如图5所示的(E7，A4)可以为第一文本对50g，(E8，Z4)可以为第二文本对50h，可选的，该第一文本对50g还可以称之为第七文本对50g，该第二文本对50h还可以称之为第八文本对50h。For ease of understanding, please refer to FIG. 5 , which is a schematic flowchart of a target sub-text generation provided by an embodiment of the present application. (E1, A1) as shown in FIG. 5 may be the first text pair 50a, (E2, Z1) may be the second text pair 50b; (E3, A2) as shown in FIG. 5 may be the first text pair 50c , (E4, Z2) can be the second text pair 50d, optionally, the first text pair 50c can also be called the third text pair 50c, and the second text pair 50d can also be called the fourth text pair 50d; as shown in FIG. 5 (E5, A3) may be the first text pair 50e, (E6, Z3) may be the second text pair 50f, optionally, the first text pair 50e may also be referred to as the first text pair 50e. Five text pairs 50e, the second text pair 50f may also be referred to as the sixth text pair 50f; as shown in FIG. 5 (E7, A4) may be the first text pair 50g, (E8, Z4) may be the second text pair 50g The text pair 50h, optionally, the first text pair 50g may also be referred to as the seventh text pair 50g, and the second text pair 50h may also be referred to as the eighth text pair 50h.

其中，可以理解的是，这里以(E1，A1)、(E3，A2)、(E5，A3)和(E7，A4)为不同的第一文本对为例进行说明，这里以(E2，Z1)、(E4，Z2)、(E6，Z3)和(E8，Z4)为不同的第二文本对为例进行说明。换言之，这里以(E1，A1)、(E3，A2)、(E5，A3)和(E7，A4)统称为第一文本对为例进行说明，这里以(E2，Z1)、(E4，Z2)、(E6，Z3)和(E8，Z4)统称为第二文本对为例进行说明。其中，这里的第一文本对和第二文本对可以属于OPUS数据集。Among them, it can be understood that (E1, A1), (E3, A2), (E5, A3) and (E7, A4) are used as different first text pairs as examples for description, and (E2, Z1) ), (E4, Z2), (E6, Z3) and (E8, Z4) are different second text pairs as examples to illustrate. In other words, here (E1, A1), (E3, A2), (E5, A3) and (E7, A4) are collectively referred to as the first text pair as an example for illustration, and here (E2, Z1), (E4, Z2) ), (E6, Z3) and (E8, Z4) are collectively referred to as the second text pair for illustration. Wherein, the first text pair and the second text pair here may belong to the OPUS dataset.

其中，E可以表示第一语言类型，A可以表示第二语言类型，Z可以表示第三语言类型。例如，第一语言类型可以为英语，第二语言类型可以为阿拉伯语，第三语言类型可以为中文，因此，图5展示了本申请实施例所提供的方法生成的阿拉伯语(Arabic)到中文(Chinese)和中文到阿拉伯语的语义对齐的样例。Wherein, E may represent the first language type, A may represent the second language type, and Z may represent the third language type. For example, the first language type may be English, the second language type may be Arabic, and the third language type may be Chinese. Therefore, FIG. 5 shows Arabic to Chinese generated by the method provided in this embodiment of the present application. (Chinese) and Chinese to Arabic semantic alignment examples.

如图5所示，服务器可以基于第一文本对50a和第二文本对50b，生成第一目标子文本61a，该第一目标子文本61a可以为Z5，该Z5与第四子文本Z1均属于第三语言类型，且该Z5与第一子文本E1的语义信息相关联；服务器可以基于第一文本对50a和第二文本对50b，生成第二目标子文本61b，该第二目标子文本61b可以为A5，该A5与第三子文本A1均属于第二语言类型，且该A5与第二子文本E2的语义信息相关联。As shown in FIG. 5, the server may generate a first target sub-text 61a based on the first text pair 50a and the second text pair 50b, and the first target sub-text 61a may be Z5, which belongs to both the Z5 and the fourth sub-text Z1 The third language type, and the Z5 is associated with the semantic information of the first subtext E1; the server may generate a second target subtext 61b based on the first text pair 50a and the second text pair 50b, the second target subtext 61b It can be A5, both the A5 and the third sub-text A1 belong to the second language type, and the A5 is associated with the semantic information of the second sub-text E2.

同理，如图5所示，服务器可以基于第一文本对50c和第二文本对50d，生成第一目标子文本62a和第二目标子文本62b，该第一目标子文本62a可以为Z6，该第二目标子文本62b可以为A6。服务器可以基于第一文本对50e和第二文本对50f，生成第一目标子文本63a和第二目标子文本63b，该第一目标子文本63a可以为Z7，该第二目标子文本63b可以为A7。服务器可以基于第一文本对50g和第二文本对50h，生成第一目标子文本64a和第二目标子文本64b，该第一目标子文本64a可以为Z8，该第二目标子文本64b可以为A8。Similarly, as shown in FIG. 5, the server can generate a first target sub-text 62a and a second target sub-text 62b based on the first text pair 50c and the second text pair 50d, and the first target sub-text 62a can be Z6, The second target subtext 62b may be A6. The server may generate a first target sub-text 63a and a second target sub-text 63b based on the first text pair 50e and the second text pair 50f, the first target sub-text 63a may be Z7, and the second target sub-text 63b may be A7. The server may generate a first target sub-text 64a and a second target sub-text 64b based on the first text pair 50g and the second text pair 50h, the first target sub-text 64a may be Z8, and the second target sub-text 64b may be A8.

比如，第一子文本E1可以为“Did you have anything to do with him.”，第二子文本E2可以为“Did you have anything to do with it？”，第四子文本Z1可以为“你跟这事有关吗”，因此，第一目标子文本Z5可以为“你跟他有关吗？”。其中，可以理解的是，x¹和x²高度相似，但是仍然有语义差异：x¹中有“him”，而x²中没有。通过生成的中文句子中添加了“他”，与英文中的“him”相对应，消除了语义差异，所以生成的

与x¹语义对齐，即与y¹语义对齐，由此得到了高质量的语义对齐的双语语料(或者多路平行语料)。For example, the first subtext E1 can be "Did you have anything to do with him.", the second subtext E2 can be "Did you have anything to do with it?", and the fourth subtext Z1 can be "Did you have anything to do with it?" Is this related?", therefore, the first target subtext Z5 can be "Are you related to him?". Among them, it is understandable that ^x1 and ^x2 are highly similar, but there are still semantic differences ^: there is "him" in x1, but not in ^x2 . By adding "he" to the generated Chinese sentence, which corresponds to "him" in English, the semantic difference is eliminated, so the generated

It is semantically aligned with x ¹ , that is, semantically aligned with y ¹ , thereby obtaining a high-quality semantically aligned bilingual corpus (or multi-channel parallel corpus).

又比如，第一子文本E3可以为“You want justice,right？”，第二子文本E4可以为“You want out,right？”，第四子文本Z2可以为“你想出去，嗯？”，因此，第一目标子文本Z6可以为“你想伸张正义对吧，嗯？”。又比如，第一子文本E5可以为“Item 56of theprovisional agenda*”，第二子文本E6可以为“Item100of the provisional agenda*”，第四子文本Z3可以为“临时议程项目100”，因此，第一目标子文本Z7可以为“临时议程项目56”。又比如，第一子文本E7可以为“What does he want？”，第二子文本E8可以为“Whatdoes he know？”，第四子文本Z4可以为“他刚说什么，他知道什么？”，因此，第一目标子文本Z8可以为“他想知道什么？”。For another example, the first sub-text E3 may be "You want justice, right?", the second sub-text E4 may be "You want out, right?", and the fourth sub-text Z2 may be "You want to go out, eh?" , so the first target subtext Z6 could be "You want justice, right?". For another example, the first sub-text E5 may be "Item 56 of the provisional agenda*", the second sub-text E6 may be "Item 100 of the provisional agenda*", and the fourth sub-text Z3 may be "Provisional agenda item 100". A target subtext Z7 may be "Provisional agenda item 56". For another example, the first sub-text E7 can be "What does he want?", the second sub-text E8 can be "What does he know?", and the fourth sub-text Z4 can be "What did he just say, what does he know?" , so the first target subtext Z8 can be "what does he want to know?".

步骤S103，根据第一文本对、第二文本对、第一目标子文本和第二目标子文本，生成文本样本对。Step S103: Generate a pair of text samples according to the first text pair, the second text pair, the first target sub-text and the second target sub-text.

具体的，服务器可以将第一目标子文本和第三子文本，组合为第一样本对。其中，第一目标子文本与第三子文本的语义信息相关联。进一步地，服务器可以将第二目标子文本和第四子文本，组合为第二样本对。其中，第二目标子文本与第四子文本的语义信息相关联。进一步地，服务器可以将第一文本对、第二文本对、第一样本对和第二样本对，确定为文本样本对。Specifically, the server may combine the first target subtext and the third subtext into a first sample pair. The first target sub-text is associated with semantic information of the third sub-text. Further, the server may combine the second target subtext and the fourth subtext into a second sample pair. Wherein, the second target sub-text is associated with the semantic information of the fourth sub-text. Further, the server may determine the first text pair, the second text pair, the first sample pair and the second sample pair as a text sample pair.

可选的，服务器也可以将第一目标子文本和第一子文本，组成为第三样本对。进一步地，服务器也可以将第二目标子文本和第二子文本，组合为第四样本对。进一步地，服务器可以将第一文本对、第二文本对、第一样本对、第二样本对、第三样本对和第四样本对，确定为文本样本对。Optionally, the server may also combine the first target subtext and the first subtext into a third sample pair. Further, the server may also combine the second target subtext and the second subtext into a fourth sample pair. Further, the server may determine the first text pair, the second text pair, the first sample pair, the second sample pair, the third sample pair and the fourth sample pair as the text sample pair.

可选的，服务器还可以将第一子文本、第三子文本和第一目标子文本，共同组合为第一对齐文本对。进一步地，服务器还可以将第二子文本、第四子文本和第二目标子文本，共同组合为第二对齐文本对。其中，第一对齐文本对和第二对齐文本对可以为三路对齐文本对。Optionally, the server may further combine the first sub-text, the third sub-text and the first target sub-text into a first aligned text pair. Further, the server may further combine the second sub-text, the fourth sub-text and the second target sub-text into a second aligned text pair. Wherein, the first aligned text pair and the second aligned text pair may be three-way aligned text pairs.

为便于理解，请参见图6，图6是本申请实施例提供的一种生成文本样本对的场景示意图。如图6所示的第一文本对60a可以为上述图5所对应实施例中的第一文本对50a，如图6所示的第二文本对60b可以为上述图5所对应实施例中的第二文本对50b。其中，E1和E2可以为第一语言类型的子文本，A1可以为第二语言类型的子文本，Z1可以为第三语言类型的子文本。其中，E1可以为第一子文本，E2可以为第二子文本，A1可以为第三子文本，Z1可以为第四子文本。For ease of understanding, please refer to FIG. 6 , which is a schematic diagram of a scenario for generating a text sample pair provided by an embodiment of the present application. The first text pair 60a shown in FIG. 6 may be the first text pair 50a in the above-mentioned embodiment corresponding to FIG. 5 , and the second text pair 60b shown in FIG. 6 may be the above-mentioned embodiment corresponding to FIG. 5 . Second text pair 50b. Wherein, E1 and E2 may be sub-texts of the first language type, A1 may be the sub-texts of the second language type, and Z1 may be the sub-texts of the third language type. Wherein, E1 may be the first sub-text, E2 may be the second sub-text, A1 may be the third sub-text, and Z1 may be the fourth sub-text.

如图6所示，Z5可以为基于第一文本对60a和第二文本对60b所生成的第一目标子文本，A5可以为基于第一文本对60a和第二文本对60b所生成的第二目标子文本。第一目标子文本Z5与第一子文本E1和第三子文本A1的语义信息相关联，第二目标子文本A5与第二子文本E2和第四子文本Z1的语义信息相关联。As shown in FIG. 6, Z5 may be the first target sub-text generated based on the first text pair 60a and the second text pair 60b, and A5 may be the second target sub-text generated based on the first text pair 60a and the second text pair 60b target subtext. The first target sub-text Z5 is associated with the semantic information of the first sub-text E1 and the third sub-text A1, and the second target sub-text A5 is associated with the semantic information of the second sub-text E2 and the fourth sub-text Z1.

如图6所示，服务器可以将第一目标子文本Z5和第三子文本A1，组合为第一样本对60c，将第二目标子文本A5和第四子文本Z1组合为第二样本对60d。可选的，服务器还可以将第一目标子文本Z5和第一子文本E1，组成为第三样本对(未在图上示出)，将第二目标子文本A5和第二子文本E2，组合为第四样本对(未在图上示出)。进一步地，服务器可以将第一文本对60a、第二文本对60b、第一样本对60c、第二样本对60d、第三样本对和第四样本对，确定为文本样本对。As shown in FIG. 6 , the server may combine the first target sub-text Z5 and the third sub-text A1 into a first sample pair 60c, and combine the second target sub-text A5 and the fourth sub-text Z1 into a second sample pair 60d. Optionally, the server can also combine the first target subtext Z5 and the first subtext E1 into a third sample pair (not shown in the figure), and combine the second target subtext A5 and the second subtext E2, The combination is a fourth sample pair (not shown on the figure). Further, the server may determine the first text pair 60a, the second text pair 60b, the first sample pair 60c, the second sample pair 60d, the third sample pair and the fourth sample pair as the text sample pair.

由此可见，本申请实施例可以基于抽取和生成的步骤来生成多路对齐语料。该抽取的步骤可以通过确定第一文本对中的第一子文本和第二文本对中的第二子文本之间的编辑距离，将满足相似性条件的编辑距离所对应的第一文本对和第二文本对确定为候选句对；该生成的步骤可以消除第一子文本和第二子文本之间的语义差异，将“部分对齐”的候选句对转换为“完全对齐”的文本样本对。基于此，通过上述基于抽取和生成的步骤，可以快速且准确地生成大量高质量的语义对齐的语料(即文本样本对)，从而可以在保证语料库的质量的同时，提高语料库的语料数量。It can be seen that the embodiments of the present application can generate multi-way aligned corpora based on the steps of extraction and generation. In the extraction step, by determining the edit distance between the first sub-text in the first text pair and the second sub-text in the second text pair, the first text pair and The second text pair is determined as a candidate sentence pair; the generating step can eliminate the semantic difference between the first subtext and the second subtext, and convert the "partially aligned" candidate sentence pairs into "fully aligned" text sample pairs . Based on this, through the above-mentioned steps based on extraction and generation, a large number of high-quality semantically aligned corpora (ie, text sample pairs) can be quickly and accurately generated, so that the quality of the corpus can be guaranteed and the corpus quantity can be increased.

进一步地，请参见图7，图7是本申请实施例提供的一种文本数据处理方法的流程示意图。该方法可以由服务器执行，也可以由用户终端执行，还可以由服务器可以用户终端共同执行，该服务器可以为上述图2所对应实施中的服务器20a，该用户终端可以为上述图2所对应实施中的用户终端20b。为便于理解，本申请实施例以该方法由服务器执行为例进行说明。其中，该文本数据处理方法可以包括以下步骤S201-步骤S202：Further, please refer to FIG. 7 , which is a schematic flowchart of a text data processing method provided by an embodiment of the present application. The method may be executed by a server, a user terminal, or a server and a user terminal jointly. The server may be the server 20a in the implementation corresponding to FIG. 2 above, and the user terminal may be implemented corresponding to the above FIG. 2 . in the user terminal 20b. For ease of understanding, the embodiments of the present application take the method being executed by a server as an example for description. Wherein, the text data processing method may include the following steps S201-S202:

步骤S201，确定第一子文本和第二子文本之间的编辑距离，若编辑距离满足相似性条件，则获取与第一子文本和第二子文本相关联的网络模型集合；Step S201, determine the edit distance between the first sub-text and the second sub-text, if the edit distance satisfies the similarity condition, then obtain the network model set associated with the first sub-text and the second sub-text;

具体的，服务器可以从第一子文本中提取第一单位文本，从第二子文本中提取第二单位文本。进一步地，服务器可以基于第一单位文本和第二单位文本，确定第一子文本和第二子文本之间的编辑距离。进一步地，若编辑距离满足相似性条件，则服务器可以获取与第一子文本和第二子文本相关联的网络模型集合。Specifically, the server may extract the first unit text from the first subtext, and extract the second unit text from the second subtext. Further, the server may determine an edit distance between the first sub-text and the second sub-text based on the first unit text and the second unit text. Further, if the edit distance satisfies the similarity condition, the server may acquire the network model set associated with the first sub-text and the second sub-text.

其中，网络模型集合可以包括第一目标网络模型和第二目标网络模型。其中，第一目标网络模型是对第一初始网络模型进行模型训练后所得到的，第二目标网络模型是对第二初始网络模型进行模型训练所得到的。可以理解的是，第一目标网络模型和第一初始网络模型与第一语言类型和第三语言类型相关联，第二目标网络模型和第二初始网络模型与第一语言类型和第二语言类型相关联。Wherein, the network model set may include a first target network model and a second target network model. The first target network model is obtained by performing model training on the first initial network model, and the second target network model is obtained by performing model training on the second initial network model. It can be understood that the first target network model and the first initial network model are associated with the first language type and the third language type, and the second target network model and the second initial network model are associated with the first language type and the second language type. Associated.

可选的，网络模型集合还可以包括第三目标网络模型、第四目标网络模型和第五目标网络模型，第三目标网络模型是对第三初始网络模型进行模型训练后所得到的，第四目标网络模型是对第四初始网络模型进行模型训练后所得到的，第五目标网络模型是对第五初始网络模型进行模型训练后所得到的。可以理解的是，第三目标网络模型和第三初始网络模型与第一语言类型和第四语言类型相关联，第四目标网络模型和第四初始网络模型与第一语言类型和第五语言类型相关联，第五目标网络模型和第五初始网络模型与第一语言类型和第六语言类型相关联。Optionally, the network model set may also include a third target network model, a fourth target network model, and a fifth target network model, where the third target network model is obtained by performing model training on the third initial network model, and the fourth The target network model is obtained by performing model training on the fourth initial network model, and the fifth target network model is obtained by performing model training on the fifth initial network model. It can be understood that the third target network model and the third initial network model are associated with the first language type and the fourth language type, and the fourth target network model and the fourth initial network model are associated with the first language type and the fifth language type. In association, the fifth target network model and the fifth initial network model are associated with the first language type and the sixth language type.

应当理解，网络模型集合中的目标网络模型(例如，第一目标网络模型、第二目标网络模型)可以为NMT(Neural Machine Translation，神经网络机器翻译)模型，本申请实施例不对目标网络模型的模型类型进行限定。可以理解的是，对初始网络模型(例如，第一初始网络模型，第二初始网络模型)进行模型训练，以得到目标网络模型(例如，第一初始网络模型对应的第一目标网络模型，第二初始网络模型对应的第二目标网络模型)的具体过程，可以参见下述图9所对应实施例中对步骤S302-步骤S305的描述。It should be understood that the target network model (for example, the first target network model and the second target network model) in the network model set may be an NMT (Neural Machine Translation, neural network machine translation) model, and the embodiment of the present application does not refer to the target network model. Model type is limited. It can be understood that model training is performed on the initial network model (for example, the first initial network model, the second initial network model) to obtain the target network model (for example, the first target network model corresponding to the first initial network model, the first For the specific process of the second target network model corresponding to the second initial network model), reference may be made to the description of steps S302 to S305 in the embodiment corresponding to FIG. 9 below.

步骤S202，基于网络模型集合，生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本，生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本。Step S202, based on the network model set, generate a first target sub-text that is associated with the semantic information of the first sub-text and belongs to the third language type, and generates a second target sub-text that is associated with the semantic information of the second sub-text and belongs to the second language type. The second target subtext.

具体的，服务器可以对第一子文本和第四子文本进行拼接处理，得到第一拼接文本。进一步地，服务器可以将第一拼接文本输入至第一目标网络模型，通过第一目标网络模型生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本。进一步地，服务器可以对第二子文本和第三子文本进行拼接处理，得到第二拼接文本。进一步地，服务器可以将第二拼接文本输入至第二目标网络模型，通过第二目标网络模型生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本。Specifically, the server may perform splicing processing on the first sub-text and the fourth sub-text to obtain the first splicing text. Further, the server may input the first concatenated text into the first target network model, and generate the first target sub-text associated with the semantic information of the first sub-text and belonging to the third language type through the first target network model. Further, the server may perform splicing processing on the second sub-text and the third sub-text to obtain the second splicing text. Further, the server may input the second concatenated text into the second target network model, and generate the second target sub-text associated with the semantic information of the second sub-text and belonging to the second language type through the second target network model.

其中，第一目标网络模型基于第一拼接文本，输出第一目标子文本可以参见下述公式(2)：Wherein, the first target network model is based on the first spliced text, and the output of the first target sub-text can refer to the following formula (2):

其中，

可以表示第一拼接文本，m可以表示第一目标网络模型，

可以表示第一目标子文本。in,

can represent the first concatenated text, m can represent the first target network model,

The first target subtext may be represented.

其中，第二目标网络模型基于第二拼接文本，输出第二目标子文本可以参见下述公式(3)：Wherein, the second target network model is based on the second spliced text, and the output of the second target sub-text can refer to the following formula (3):

其中，

可以表示第二拼接文本，m可以表示第二目标网络模型，

可以表示第二目标子文本。in,

can represent the second concatenated text, m can represent the second target network model,

A second target subtext may be represented.

可以理解的是，在公式(2)和公式(3)中，m(x)表示将x作为输入，并运行m的解码过程可以得到模型的预测结果。通过将

和

的拼接作为模型输入，得到输出结果

生成的

与

语义对齐(即

与

的语义信息相关联)，相应的也与

语义对齐(即

与

的语义信息相关联)；通过将

和

的拼接作为模型输入，得到输出结果

生成的

与

语义对齐(即

与

的语义信息相关联)，相应的也与

语义对齐(即

与

的语义信息相关联)。It can be understood that in formula (2) and formula (3), m(x) means taking x as an input, and running the decoding process of m can obtain the prediction result of the model. by putting

and

The splicing of the model is used as the model input, and the output result is obtained

Generated

and

Semantic alignment (i.e.

and

associated with the semantic information of the

Semantic alignment (i.e.

and

associated with semantic information); by adding

and

Generated

and

Semantic alignment (i.e.

and

associated with the semantic information of the

Semantic alignment (i.e.

and

associated with semantic information).

可以理解的是，在模型的预测结果的准确性很高时，“相关联”与“相同”具有相同的意义，即第一目标子文本与第一子文本和第三子文本的语义信息相同，第二目标子文本与第二子文本和第四子文本的语义信息相同。It can be understood that when the accuracy of the prediction results of the model is high, "related" and "same" have the same meaning, that is, the first target subtext has the same semantic information as the first subtext and the third subtext. , the second target subtext has the same semantic information as the second subtext and the fourth subtext.

其中，第一目标网络模型包括用于进行编码处理的编码器和用于进行解码处理的解码器。应当理解，服务器通过第一目标网络模型生成第一目标子文本的具体过程可以描述为：服务器可以将第一拼接文本输入至第一目标网络模型中的编码器，通过编码器(即第一目标网络模型中的编码器)对第一拼接文本进行编码处理，得到第一拼接文本对应的第一特征向量。进一步地，服务器可以将第一特征向量输入至第一目标网络模型中的解码器，通过解码器(即第一目标网络模型中的解码器)对第一特征向量进行解码处理，得到第一特征向量对应的第一文本向量。进一步地，服务器可以基于第一文本向量，生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本。Wherein, the first target network model includes an encoder for performing encoding processing and a decoder for performing decoding processing. It should be understood that the specific process for the server to generate the first target subtext through the first target network model can be described as follows: the server can input the first spliced text into the encoder in the first target network model, The encoder in the network model) encodes the first spliced text to obtain a first feature vector corresponding to the first spliced text. Further, the server may input the first feature vector into the decoder in the first target network model, and perform decoding processing on the first feature vector by the decoder (ie, the decoder in the first target network model) to obtain the first feature. The vector corresponds to the first text vector. Further, the server may generate, based on the first text vector, the first target sub-text that is associated with the semantic information of the first sub-text and belongs to the third language type.

同理，第二目标网络模型包括用于进行编码处理的编码器和用于进行解码处理的解码器。应当理解，服务器通过第二目标网络模型生成第二目标子文本的具体过程可以描述为：服务器可以将第二拼接文本输入至第二目标网络模型中的编码器，通过编码器(即第二目标网络模型中的编码器)对第二拼接文本进行编码处理，得到第二拼接文本对应的第二特征向量。进一步地，服务器可以将第二特征向量输入至第二目标网络模型中的解码器，通过解码器(即第二目标网络模型中的解码器)对第二特征向量进行解码处理，得到第二特征向量对应的第二文本向量。进一步地，服务器可以基于第二文本向量，生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本。Similarly, the second target network model includes an encoder for encoding and a decoder for decoding. It should be understood that the specific process for the server to generate the second target sub-text through the second target network model can be described as follows: the server can input the second spliced text into the encoder in the second target network model, The encoder in the network model) encodes the second spliced text to obtain a second feature vector corresponding to the second spliced text. Further, the server may input the second feature vector into the decoder in the second target network model, and perform decoding processing on the second feature vector by the decoder (ie, the decoder in the second target network model) to obtain the second feature. The vector corresponds to the second text vector. Further, the server may generate, based on the second text vector, a second target sub-text that is associated with the semantic information of the second sub-text and belongs to the second language type.

其中，可以理解的是，服务器可以对第一拼接文本进行向量处理，得到第一拼接文本对应的第一拼接向量，进而将第二拼接向量输入至第一目标网络模型，通过第一目标网络模型对第一拼接向量进行编码处理。同理，服务器可以对第二拼接文本进行向量处理，得到第二拼接文本对应的第二拼接向量，进而将第二拼接向量输入至第二目标网络模型，通过第二目标网络模型对第二拼接向量进行编码处理。It can be understood that the server can perform vector processing on the first spliced text to obtain a first splicing vector corresponding to the first splicing text, and then input the second splicing vector into the first target network model, and then pass the first target network model through the first splicing vector. The first splicing vector is encoded. Similarly, the server can perform vector processing on the second spliced text to obtain a second splicing vector corresponding to the second spliced text, and then input the second splicing vector into the second target network model, and use the second target network model to perform the second splicing vector. vector for encoding.

为便于理解，请参见图8，图8是本申请实施例提供的一种生成目标子文本的场景示意图。如图8所示，服务器可以从候选集中获取候选文本对，这里的候选文本对可以为候选文本对80a，该候选文本对80a可以为图4所对应实施例中的候选文本对41c。其中，目标网络模型80d可以为第一目标网络模型80d，该第一目标网络模型80d中可以包括第一编码器和第一解码器，目标网络模型80e可以为第二目标网络模型80e，该第二目标网络模型80e中可以包括第二编码器和第二解码器。For ease of understanding, please refer to FIG. 8 , which is a schematic diagram of a scenario for generating target sub-text provided by an embodiment of the present application. As shown in FIG. 8 , the server may obtain a candidate text pair from the candidate set, where the candidate text pair may be the candidate text pair 80a, and the candidate text pair 80a may be the candidate text pair 41c in the embodiment corresponding to FIG. 4 . The target network model 80d may be a first target network model 80d, the first target network model 80d may include a first encoder and a first decoder, the target network model 80e may be a second target network model 80e, and the first target network model 80d may be a second target network model 80e. A second encoder and a second decoder may be included in the two-target network model 80e.

如图8所示，服务器可以基于候选文本对80a生成拼接文本，这里的拼接文本可以包括第一拼接文本和第二拼接文本，这里的第一拼接文本可以为第一拼接文本80b，这里的第二拼接文本可以为第二拼接文本80c。其中，第一拼接文本80b可以表示为“第一子文本；第四子文本”，第二拼接文本80c可以表示为“第二子文本；第三子文本”。As shown in FIG. 8 , the server may generate spliced text based on the candidate text pair 80a, where the spliced text may include a first spliced text and a second spliced text, the first spliced text here may be the first spliced text 80b, and the first spliced text here may be the first spliced text 80b. The second spliced text may be the second spliced text 80c. The first concatenated text 80b may be expressed as "first sub-text; fourth sub-text", and the second concatenated text 80c may be expressed as "second sub-text; third sub-text".

如图8所示，服务器可以将第一拼接文本80b输入至第一目标网络模型80d，将第二拼接文本80c输入至第二目标网络模型80e。可以理解的是，通过第一目标网络模型80d中的第一编码器可以输出第一拼接文本80b对应的第一特征向量，通过第一目标网络模型80d中的第一解码器可以输出第一特征向量对应的第一文本向量，进而可以基于第一文本向量确定第一目标子文本。同理，通过第二目标网络模型80e中的第二编码器可以输出第二拼接文本80c对应的第二特征向量，通过第二目标网络模型80e中的第二解码器可以输出第二特征向量对应的第二文本向量，进而可以基于第二文本向量确定第二目标子文本。As shown in FIG. 8 , the server may input the first concatenated text 80b into the first target network model 80d, and input the second concatenated text 80c into the second target network model 80e. It can be understood that the first encoder in the first target network model 80d can output the first feature vector corresponding to the first concatenated text 80b, and the first decoder in the first target network model 80d can output the first feature. The first text vector corresponding to the vector, and then the first target sub-text can be determined based on the first text vector. Similarly, the second encoder in the second target network model 80e can output the second feature vector corresponding to the second spliced text 80c, and the second decoder in the second target network model 80e can output the second feature vector corresponding to The second text vector of , and then the second target sub-text can be determined based on the second text vector.

可选的，若编辑距离不满足相似性条件，则服务器可以确定第一文本对和第二文本对不满足该相似性条件，从候选集中获取新的第一文本对(例如，第三文本对)和新的第二文本对(例如，第四文本对)，进而对第三文本对和第四文本对执行和上述第一文本对和第二文本对同样的操作。其中，第三文本对可以包括属于第一语言类型的第五子文本和属于第四语言类型的第七子文本，第四文本对可以包括属于第一语言类型的第六子文本和属于第五语言类型的第八子文本。应当理解，本申请实施例不对这里的第七子文本和第八子文本的语言类型进行限定。Optionally, if the edit distance does not satisfy the similarity condition, the server may determine that the first text pair and the second text pair do not satisfy the similarity condition, and obtain a new first text pair (for example, a third text pair) from the candidate set. ) and a new second text pair (eg, a fourth text pair), and then perform the same operations as the above-mentioned first text pair and second text pair on the third text pair and the fourth text pair. Wherein, the third text pair may include a fifth sub-text belonging to the first language type and a seventh sub-text belonging to a fourth language type, and the fourth text pair may include a sixth sub-text belonging to the first language type and a fifth sub-text belonging to the fifth language type The eighth subtext of the language type. It should be understood that this embodiment of the present application does not limit the language types of the seventh sub-text and the eighth sub-text herein.

由此可见，本申请实施例可以通过网络模型集合消除语义差异，基于“部分对齐”的第一文本对和第二文本对，生成第一目标子文本和第二目标子文本，进而基于第一文本对、第二文本对、第一目标子文本和第二目标子文本，生成了完全对齐的(Y¹,Y²)语言的平行句对

以及(Y²,Y¹)语言的平行句对

从而可以在保证语料库的质量的同时，提升了语料库的语料数量。It can be seen that, the embodiment of the present application can eliminate semantic differences through the network model set, generate the first target sub-text and the second target sub-text based on the "partially aligned" first text pair and the second text pair, and then generate the first target sub-text and the second target sub-text based on the first and second text pairs. The text pair, the second text pair, the first target subtext, and the second target subtext, generate parallel sentence pairs in the fully aligned (Y ¹ , Y ² ) language

and parallel sentence pairs in (Y ² ,Y ¹ ) language

Thus, the quality of the corpus can be ensured while the quantity of the corpus can be increased.

进一步地，请参见图9，图9是本申请实施例提供的一种文本数据处理方法的流程示意图。该方法可以由服务器执行，也可以由用户终端执行，还可以由服务器可以用户终端共同执行，该服务器可以为上述图2所对应实施中的服务器20a，该用户终端可以为上述图2所对应实施中的用户终端20b。为便于理解，本申请实施例以该方法由服务器执行为例进行说明。其中，该文本数据处理方法可以包括以下步骤S301-步骤S309：Further, please refer to FIG. 9 , which is a schematic flowchart of a text data processing method provided by an embodiment of the present application. The method may be executed by a server, a user terminal, or a server and a user terminal jointly. The server may be the server 20a in the implementation corresponding to FIG. 2 above, and the user terminal may be implemented corresponding to the above FIG. 2 . in the user terminal 20b. For ease of understanding, the embodiments of the present application take the method being executed by a server as an example for description. Wherein, the text data processing method may include the following steps S301-S309:

步骤S301，获取第一文本对和第二文本对，从第一文本对中获取第一子文本，从第二文本对中获取第二子文本；Step S301, obtaining the first text pair and the second text pair, obtaining the first sub-text from the first text pair, and obtaining the second sub-text from the second text pair;

其中，第一子文本和第二子文本均属于第一语言类型；第一文本对还包括与第一子文本的语义信息相同且属于第二语言类型的第三子文本；第二文本对还包括与第二子文本的语义信息相同且属于第三语言类型的第四子文本。The first sub-text and the second sub-text both belong to the first language type; the first text pair further includes a third sub-text that has the same semantic information as the first sub-text and belongs to the second language type; the second text pair also includes A fourth subtext that has the same semantic information as the second subtext and belongs to the third language type is included.

步骤S302，确定第一子文本和第二子文本之间的编辑距离，若编辑距离满足相似性条件，则获取与第一语言类型和第三语言类型相关联的第一初始网络模型；Step S302, determine the edit distance between the first sub-text and the second sub-text, if the edit distance satisfies the similarity condition, then obtain the first initial network model associated with the first language type and the third language type;

可选的，编辑距离可以用于确定第一子文本和第二子文本之间的文本相似度，本申请实施例还可以使用其他相似度度量方法确定第一子文本和第二子文本之间的文本相似度，例如，句子嵌入，TF-IDF(term frequency–inverse document frequency，一种用于信息检索与数据挖掘的常用加权技术)。Optionally, the edit distance may be used to determine the text similarity between the first sub-text and the second sub-text, and other similarity measurement methods may also be used in this embodiment of the present application to determine the relationship between the first sub-text and the second sub-text. The text similarity of , for example, sentence embedding, TF-IDF (term frequency–inverse document frequency, a common weighting technique for information retrieval and data mining).

可以理解的是，在编辑距离满足相似性条件时，服务器可以将满足相似性条件的第一文本对和第二文本对作为候选文本对，进而将候选文本对添加至候选集。进一步地，服务器可以获取与第一语言类型和第三语言类型相关联的第一初始网络模型，进而基于候选集中的候选文本对，在下述步骤S303-步骤S305中对第一初始网络模型进行模型训练。同理，服务器可以获取与第一语言类型和第二语言类型相关联的第二初始网络模型，进而基于候选集中的候选文本对，对第二初始网络模型进行模型训练。It can be understood that, when the edit distance satisfies the similarity condition, the server may use the first text pair and the second text pair satisfying the similarity condition as the candidate text pair, and then add the candidate text pair to the candidate set. Further, the server can obtain the first initial network model associated with the first language type and the third language type, and then model the first initial network model in the following steps S303-S305 based on the candidate text pairs in the candidate set. train. Similarly, the server may acquire the second initial network model associated with the first language type and the second language type, and then perform model training on the second initial network model based on the candidate text pairs in the candidate set.

其中，服务器确定第一子文本和第二子文本之间的编辑距离的具体过程，可以参见上述图3所对应实施例中对步骤S102的描述，这里将不再进行赘述。The specific process for the server to determine the edit distance between the first sub-text and the second sub-text may refer to the description of step S102 in the embodiment corresponding to FIG. 3 , which will not be repeated here.

步骤S303，对第四子文本进行编辑变换，得到与第四子文本相关联的第一变换文本；Step S303, editing and transforming the fourth sub-text to obtain the first transformed text associated with the fourth sub-text;

具体的，服务器可以生成N个单位文本中的每个单位文本分别对应的随机变换概率，将随机变换概率处于可变换概率区间的单位文本确定为待编辑单位文本。其中，第四子文本包含N个单位文本，这里的N可以为正整数。进一步地，服务器可以获取与待编辑单位文本相关联的编辑变换方式，根据编辑机变换方式对待编辑单位文本进行编辑变换，得到编辑变换结果，根据编辑变换结果生成与第四子文本相关联的第一变换文本。Specifically, the server may generate a random transformation probability corresponding to each unit text in the N unit texts, and determine the unit text whose random transformation probability is in the transformable probability interval as the unit text to be edited. The fourth sub-text contains N unit texts, where N may be a positive integer. Further, the server can obtain the editing transformation mode associated with the unit text to be edited, edit and transform the unit text to be edited according to the transformation mode of the editing machine, obtain the editing transformation result, and generate the fourth subtext associated with the fourth subtext according to the editing transformation result. A transform text.

可以理解的是，服务器可以对第四子文本进行分词处理，得到第四子文本的单位文本，进而可以确定该单位文本的数量为N个。比如，在第四子文本为：“How do you do？”时，该单位文本的数量可以为5个，该5个单位文本可以为：“How”、“do”、“you”、“do”、“？”。It can be understood that the server can perform word segmentation processing on the fourth sub-text to obtain unit text of the fourth sub-text, and then can determine that the number of the unit text is N. For example, when the fourth subtext is: "How do you do?", the number of the unit text can be 5, and the 5 unit texts can be: "How", "do", "you", "do" ", "?".

其中，可以理解的是，服务器可以生成每个单位文本对应的随机变换概率，该随机变换概率可以为0到1之间随机生成的小数。比如，单位文本“How”对应的随机变换概率可以为G1(例如，G1等于0.1)，单位文本“do”对应的随机变换概率可以为G2(例如，G2等于0.2)，…，单位文本“？”对应的随机变换概率可以为G5(例如，G5等于0.7)。It can be understood that the server can generate a random transformation probability corresponding to each unit of text, and the random transformation probability can be a randomly generated decimal between 0 and 1. For example, the random transformation probability corresponding to the unit text "How" can be G1 (for example, G1 is equal to 0.1), the random transformation probability corresponding to the unit text "do" can be G2 (for example, G2 is equal to 0.2), ..., the unit text "? "The corresponding random transformation probability can be G5 (eg, G5 is equal to 0.7).

可以理解的是，可变换概率区间可以属于区间[0,1]，例如，可变换概率区间可以为[0,0.25]、[0.25,0.35]。比如，在可变换概率区间为[0,0.25]时，若5个单位文本为：“How”、“do”、“you”、“do”、“？”，则单位文本“How”所对应的随机变换概率G1(该随机变换概率G1等于0.1)和单位文本“do”所对应的随机变换概率G2(该随机变换概率G2等于0.2)处于可变换概率区间。因此，服务器可以将5个单位文本中的单位文本“How”和单位文本“do”确定为待编辑单位文本。It can be understood that the transformable probability interval may belong to the interval [0, 1], for example, the transformable probability interval may be [0, 0.25], [0.25, 0.35]. For example, when the transformable probability interval is [0, 0.25], if the five unit texts are: "How", "do", "you", "do", "?", then the unit text "How" corresponds to The random transformation probability G1 of (the random transformation probability G1 is equal to 0.1) and the random transformation probability G2 corresponding to the unit text "do" (the random transformation probability G2 is equal to 0.2) are in the transformable probability interval. Therefore, the server may determine the unit text "How" and the unit text "do" among the 5 unit texts as the unit text to be edited.

可选的，换言之，服务器可以获取与N个单位文本相关联的变换阈值，进而生成N个单位文本中的每个单位文本分别对应的随机变换概率，将随机变换概率小于变换阈值的单位文本确定为待编辑单位文本。此时，变换阈值对应的可变换概率区间可以为[0,变换阈值)。Optionally, in other words, the server may obtain the transformation threshold associated with the N unit texts, and then generate a random transformation probability corresponding to each unit text in the N unit texts, and determine the unit text whose random transformation probability is less than the transformation threshold. The text of the unit to be edited. At this time, the transformable probability interval corresponding to the transform threshold may be [0, transform threshold).

应当理解，服务器根据编辑变换方式对待编辑单位文本进行编辑变换，得到编辑变换结果的具体过程可以描述为：服务器可以获取与待编辑单位文本相关联的编辑变换方式。进一步地，若编辑变换方式为替换操作，则服务器可以获取第三语言类型对应的词典表，从词典表中获取第一编辑文本，将待编辑单位文本替换为第一编辑文本，得到编辑变换结果。进一步地，若编辑变换方式为插入操作，则服务器可以从词典表中获取第二编辑文本，在待编辑单位文本的相邻位置插入第二编辑文本，得到编辑变换结果。进一步地，若编辑变换方式为删除操作，则服务器可以在第四子文本中删除待编辑单位文本，得到编辑变换结果。It should be understood that the server performs editing and transformation on the unit text to be edited according to the editing transformation mode, and the specific process of obtaining the editing transformation result can be described as follows: the server may obtain the editing transformation mode associated with the unit text to be edited. Further, if the editing transformation method is a replacement operation, the server can obtain the dictionary table corresponding to the third language type, obtain the first editing text from the dictionary table, replace the unit text to be edited with the first editing text, and obtain the editing transformation result. . Further, if the editing transformation mode is an insert operation, the server may obtain the second editing text from the dictionary table, insert the second editing text at the adjacent position of the unit text to be edited, and obtain the editing transformation result. Further, if the editing transformation method is a deletion operation, the server may delete the unit text to be edited in the fourth sub-text to obtain the editing transformation result.

其中，词典表即单词列表，还可以称之为词典，该词典用于在替换操作和插入操作时，选取用于替换或者插入的编辑文本。可以理解的是，不同语言类型可以对应不同的词典表，因此，第三语言类型对应的词典表可以为第一词典表，第二语言类型对应的词典表可以为第二词典表，以此类推，可以分别得到其他语言类型对应的词典表。Among them, the dictionary table is a word list, which can also be called a dictionary, and the dictionary is used to select edit texts for replacement or insertion during the replacement operation and the insertion operation. It can be understood that different language types may correspond to different dictionary tables. Therefore, the dictionary table corresponding to the third language type may be the first dictionary table, the dictionary table corresponding to the second language type may be the second dictionary table, and so on. , the dictionary tables corresponding to other language types can be obtained respectively.

可选的，不同语言类型可以对应于同一个词典表，该词典表中可以存储有不同语言类型的编辑文本，例如，第三语言类型对应的第一编辑文本和第二编辑文本，第二语言类型对应的第三编辑文本。因此，在对第四子文本进行编辑变换时，可以从词典表中获取第三语言类型对应的第一编辑文本或者第二编辑文本；在对其他子文本进行编辑变换时，可以从词典表中获取其他语言类型对应的编辑文本，例如，在对第三子文本进行编辑变换时，可以从词典表中获取第二语言类型对应的第三编辑文本。Optionally, different language types may correspond to the same dictionary table, and the dictionary table may store edit texts of different language types, for example, the first edit text and the second edit text corresponding to the third language type, the second language Type the corresponding third edit text. Therefore, when editing and transforming the fourth sub-text, the first editing text or the second editing text corresponding to the third language type can be obtained from the dictionary table; when editing and transforming other sub-texts, it can be obtained from the dictionary table. The editing text corresponding to other language types is acquired. For example, when editing and transforming the third sub-text, the third editing text corresponding to the second language type can be acquired from the dictionary table.

其中，可以理解的是，待编辑单位文本的数量可以为零个(即0个)、一个(即1个)或多个。在待编辑单位文本的数量为零个时，服务器无需对第四子文本进行编辑变换；在待编辑单位文本的数量为一个时，服务器可以对该一个待编辑单位文本进行编辑变化，以实现对第四子文本的编辑变换；在待编辑单位文本的数量为多个时，服务器可以获取多个待编辑单位文本分别对应的编辑变换方式，基于每个待编辑单位文本对应的编辑变换方式，对每个待编辑单位文本进行编辑变换，以实现对第四子文本的编辑变换。Wherein, it can be understood that the number of unit texts to be edited may be zero (ie, 0), one (ie, 1) or more. When the number of unit texts to be edited is zero, the server does not need to edit and transform the fourth sub-text; when the number of unit texts to be edited is one, the server can edit and change the unit text to be edited to realize Editing transformation of the fourth sub-text; when the number of unit texts to be edited is multiple, the server may obtain the editing transformation modes corresponding to the multiple unit texts to be edited respectively, and based on the editing transformation mode corresponding to each unit text to be edited, for Editing transformation is performed on each unit text to be edited, so as to realize the editing transformation of the fourth sub-text.

其中，多个待编辑单位文本中的每个待编辑单位文本分别对应的编辑变换方式可以包括替换操作、插入操作和删除操作，多个待编辑单位文本中的每个待编辑单位文本对应的编辑变换方式是独立的，可以随机为上述替换操作、插入操作或者删除操作。Wherein, the editing transformation mode corresponding to each to-be-edited unit text in the plurality of to-be-edited unit texts may include a replacement operation, an insertion operation, and a The transformation method is independent, and can be the above-mentioned replacement operation, insertion operation or deletion operation at random.

可以理解的是，待编辑单位文本的相邻位置可以为待编辑单位前一位置，也可以待编辑单位文本的后一位置，本申请实施例可以以该相邻位置统一为编辑单位文本的前一位置为例进行说明。It can be understood that, the adjacent position of the unit text to be edited may be the previous position of the unit to be edited, or the next position of the unit text to be edited. An example of a location will be described.

步骤S304，对第二子文本和第一变换文本进行拼接处理，得到第一拼接样本；Step S304, performing splicing processing on the second subtext and the first transformed text to obtain a first splicing sample;

步骤S305，通过第一初始网络模型获取第一拼接样本的第一样本向量，基于第一样本向量和第四子文本，对第一初始网络模型进行模型训练，得到第一目标网络模型；Step S305, obtaining the first sample vector of the first spliced sample through the first initial network model, and performing model training on the first initial network model based on the first sample vector and the fourth sub-text to obtain the first target network model;

具体的，服务器可以通过第一初始网络模型获取第一拼接样本的第一样本向量，基于第一样本向量，生成属于第三语言类型的预测样本子文本。进一步地，服务器可以获取预测样本子文本与第四子文本之间的样本语义相似度，根据样本语义相似度，生成第一初始网络模型的模型损失函数。进一步地，服务器可以基于模型损失函数对第一初始网络模型进行参数调整，得到第一目标网络模型。Specifically, the server may obtain the first sample vector of the first spliced sample through the first initial network model, and generate the predicted sample sub-text belonging to the third language type based on the first sample vector. Further, the server may obtain the sample semantic similarity between the predicted sample sub-text and the fourth sub-text, and generate a model loss function of the first initial network model according to the sample semantic similarity. Further, the server may adjust parameters of the first initial network model based on the model loss function to obtain the first target network model.

其中，第一初始网络模型包括用于编码处理的编码器和用于解码处理的解码器。应当理解，服务器通过第一初始网络模型获取第一拼接样本的第一样本向量的具体过程可以描述为：服务器可以将第一拼接样本输入至第一初始网络模型中的编码器，通过第一初始网络模型中的编码器对第一拼接样本进行编码处理，得到第一拼接样本对应的第一编码向量。进一步地，服务器可以将第一编码向量输入至第一初始网络模型中的解码器，通过第一初始网络模型中的解码器对第一编码向量进行解码处理，得到第一编码向量对应的第一样本向量(即第一拼接样本的第一样本向量)。Wherein, the first initial network model includes an encoder for encoding processing and a decoder for decoding processing. It should be understood that the specific process for the server to obtain the first sample vector of the first spliced sample through the first initial network model can be described as follows: the server can input the first spliced sample into the encoder in the first initial network model, The encoder in the initial network model encodes the first spliced sample to obtain a first encoding vector corresponding to the first spliced sample. Further, the server may input the first encoding vector into the decoder in the first initial network model, and perform decoding processing on the first encoding vector by the decoder in the first initial network model to obtain the first encoding vector corresponding to the first encoding vector. The sample vector (ie the first sample vector of the first stitched sample).

应当理解，服务器基于模型损失函数对第一初始网络模型进行参数调整(即对第一初始网络模型进行迭代训练)的具体过程可以描述为：当第一初始网络模型的模型损失函数不满足模型收敛条件时，服务器可以基于不满足模型收敛条件的模型损失函数，对第一初始网络模型的模型参数进行调整。进一步地，服务器可以将调整模型参数后的第一初始网络模型确定为过渡网络模型，对过渡网络模型进行迭代训练，直到迭代训练后的过渡网络模型的模型损失函数满足模型收敛条件时，将满足模型收敛条件的过渡网络模型作为第一目标网络模型。It should be understood that the specific process that the server adjusts the parameters of the first initial network model based on the model loss function (that is, iteratively trains the first initial network model) can be described as: when the model loss function of the first initial network model does not satisfy the model convergence When conditions are met, the server may adjust the model parameters of the first initial network model based on a model loss function that does not satisfy the model convergence condition. Further, the server may determine the first initial network model after adjusting the model parameters as the transition network model, and perform iterative training on the transition network model until the model loss function of the transition network model after iterative training satisfies the model convergence condition, it will satisfy the The transition network model of the model convergence condition is used as the first target network model.

应当理解，对于候选集中的对齐样例(即候选文本对)

可以生成训练样本

其中，

是第一初始网络模型的输入(即第一拼接样本)，“；”表示拼接操作，

是第一初始网络模型的预测目标，

是通过对

进行替换操作、插入操作和删除操作生成的句子(即第一变换文本)，即

中的每个位置的词语都有β概率(即变换阈值)执行替换、插入或者删除操作。其中，通过构建的训练语料对第一初始网络模型进行模型训练，模型的预测结果具有以下两个特点：预测结果与输入的左边部分(即

)语义对齐；预测结果可以由输入的右边部分(即

)通过替换操作、插入操作和删除操作得到。It should be understood that for aligned examples in the candidate set (ie, candidate text pairs)

training samples can be generated

in,

is the input of the first initial network model (that is, the first splicing sample), ";" indicates the splicing operation,

is the prediction target of the first initial network model,

is through the

The sentence generated by the replacement operation, insertion operation and deletion operation (ie the first transformed text), that is

Words at each position in , have a β probability (ie, the transformation threshold) to perform replacement, insertion, or deletion. Among them, the first initial network model is trained by the constructed training corpus, and the prediction result of the model has the following two characteristics: the prediction result and the left part of the input (ie

) semantically aligned; the prediction result can be determined by the right part of the input (i.e.

) is obtained by replacing, inserting, and deleting operations.

可选的，服务器对第二初始网络模型进行模型训练，得到第二目标网络模型的过程可以描述为：服务器可以对第三子文本进行编辑变换，得到与第三子文本相关联的第二变换文本。进一步地，服务器可以对第一子文本和第二变换文本进行拼接处理，得到第二拼接样本。进一步地，服务器可以通过第二初始网络模型获取第二拼接样本的第二样本向量，基于第二样本向量和第三子文本对第二初始网络模型进行模型训练，得到第二目标网络模型。Optionally, the server performs model training on the second initial network model, and the process of obtaining the second target network model can be described as follows: the server can edit and transform the third sub-text to obtain a second transformation associated with the third sub-text. text. Further, the server may perform splicing processing on the first sub-text and the second transformed text to obtain a second splicing sample. Further, the server may obtain the second sample vector of the second spliced sample through the second initial network model, and perform model training on the second initial network model based on the second sample vector and the third sub-text to obtain the second target network model.

应当理解，服务器对第三子文本进行编辑变换的具体过程，可以参见上述对第四子文本进行编辑变换的描述，这里将不再进行赘述。应当理解，服务器获取第二拼接样本的第二样本向量的具体过程，可以参见上述获取第一拼接样本的第一样本向量的描述，这里将不再进行赘述。应当理解，服务器基于第二样本向量和第三子文本对第二初始网络模型进行模型训练的具体过程，可以参见上述基于第一样本向量和第四子文本对第一初始网络模型进行模型训练的描述，这里将不再进行赘述。It should be understood that, for the specific process of editing and transforming the third sub-text by the server, reference may be made to the above description of editing and transforming the fourth sub-text, which will not be repeated here. It should be understood that, for the specific process for the server to obtain the second sample vector of the second spliced sample, reference may be made to the above description of obtaining the first sample vector of the first spliced sample, which will not be repeated here. It should be understood that, for the specific process that the server performs model training on the second initial network model based on the second sample vector and the third sub-text, refer to the above-mentioned model training on the first initial network model based on the first sample vector and the fourth sub-text. description, which will not be repeated here.

步骤S306，获取与第一子文本和第二子文本相关联的网络模型集合；Step S306, acquiring network model sets associated with the first subtext and the second subtext;

可以理解的是，服务器可以将上述步骤S302-步骤S305训练得到的第一目标网络模型和第二目标网络模型，确定为网络模型集合。It can be understood that the server may determine the first target network model and the second target network model obtained by training in the above steps S302 to S305 as a network model set.

其中，第一初始网络模型和第一目标网络模型可以统称为第一泛化模型，第一初始网络模型和第一目标网络模型属于第一泛化网络在不同时刻的名称，在训练阶段，第一泛化网络可以称之为第一初始网络模型，在预测阶段，第一泛化网络可以称之为第一目标网络模型。同理，第二初始网络模型和第二目标网络模型可以统称为第二泛化模型，第二初始网络模型和第二目标网络模型属于第二泛化网络在不同时刻的名称，在训练阶段，第二泛化网络可以称之为第二初始网络模型，在预测阶段，第二泛化网络可以称之为第二目标网络模型。The first initial network model and the first target network model may be collectively referred to as the first generalization model, and the first initial network model and the first target network model belong to the names of the first generalization network at different times. A generalized network may be referred to as the first initial network model, and in the prediction stage, the first generalized network may be referred to as the first target network model. Similarly, the second initial network model and the second target network model can be collectively referred to as the second generalization model. The second initial network model and the second target network model belong to the names of the second generalization network at different times. During the training phase, The second generalization network may be referred to as a second initial network model, and in the prediction stage, the second generalization network may be referred to as a second target network model.

步骤S307，基于网络模型集合，生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本，生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本；Step S307, based on the network model set, generate a first target subtext that is associated with the semantic information of the first subtext and belongs to the third language type, and generates a second target subtext that is associated with the semantic information of the second subtext and belongs to the second language type. the second target subtext;

其中，可以理解的是，第一目标子文本是指能够由第一拼接文本经过编辑变换后所得到的子文本，该第一目标子文本与第四子文本具有类似的结构和单词组成；第二目标子文本是指能够由第二拼接文本经过编辑变换后所得到的子文本，该第二目标子文本与第三子文本具有类似的结构和单词组成。其中，第一拼接文本是对第一子文本和第四子文本进行拼接处理所得到的，第二拼接文本是对第二子文本和第三子文本进行拼接处理所得到的。It can be understood that the first target sub-text refers to a sub-text that can be obtained by editing and transforming the first spliced text, and the first target sub-text and the fourth sub-text have similar structures and words; The second target subtext refers to a subtext that can be obtained by editing and transforming the second spliced text, and the second target subtext and the third subtext have a similar structure and word composition. The first spliced text is obtained by splicing the first sub-text and the fourth sub-text, and the second spliced text is obtained by splicing the second sub-text and the third sub-text.

其中，服务器基于网络模型集合，生成第一目标子文本和第二目标子文本的具体过程，可以参见上述图7所对应实施例中对步骤S202的描述，这里将不再进行赘述。The specific process for the server to generate the first target sub-text and the second target sub-text based on the network model set may refer to the description of step S202 in the embodiment corresponding to FIG. 7 , which will not be repeated here.

步骤S308，根据第一文本对、第二文本对、第一目标子文本和第二目标子文本，生成文本样本对；Step S308, generating a text sample pair according to the first text pair, the second text pair, the first target sub-text and the second target sub-text;

其中，步骤S302-步骤S308的具体可以包括三个步骤：训练数据生成步骤、模型训练步骤和模型生成步骤。在这三个步骤中，给定候选句对(x¹,y¹,x²,y²)、泛化网络模型m、噪声概率β(即变换阈值)和单词列表W_b，可以返回语义对齐语料库

Wherein, steps S302 to S308 may specifically include three steps: a training data generation step, a model training step, and a model generation step. In these three steps, given the candidate sentence pair (x ¹ , y ¹ , x ² , y ² ), the generalization network model m, the noise probability β (ie, the transformation threshold), and the word list W _b , the semantic alignment can be returned Corpus

其中，在模型训练步骤中，可以初始化得到模型m，若模型m没有收敛，则通过训练数据生成步骤生成训练数据。在生成训练数据时，可以为

的每个位置生成随机数α(即随机变换概率)，若该随机数α小于噪声概率β，则基于单词列表W_b对

的对应位置随机执行插入操作、删除操作或替换操作，得到第一变换文本(即

)。进一步地，在模型训练步骤中，基于该第一变换文本和

可以生成

进而可以基于

对模型m进行训练。可以理解的是，若模型m收敛，则返回该模型m；若模型m不收敛，则可以继续通过训练数据生成步骤生成训练数据。为便于理解，这里以生成的训练数据为单个示例(即第一变换文本)为例进行说明。Wherein, in the model training step, the model m can be obtained by initialization, and if the model m does not converge, the training data is generated through the training data generation step. When generating training data, you can

Generate a random number α (that is, the random transformation _probability ) at each position of

Randomly perform an insert operation, delete operation or replacement operation at the corresponding position of the

). Further, in the model training step, based on the first transformed text and

can generate

which can be based on

Train the model m. It can be understood that if the model m converges, the model m is returned; if the model m does not converge, the training data can be generated continuously through the training data generation step. For ease of understanding, the generated training data is taken as a single example (ie, the first transformed text) as an example for description.

其中，在模型生成步骤中，通过将

输入至训练好的模型m，通过上述推理步骤可以得到最终的与

配对的对齐示例

进而返回

可以理解的是，候选句对(x¹,y¹,x²,y²)的数量可以多个，这里可以为多个候选句对生成每个候选句对分别对应的对齐示例。应当理解，通过编辑距离来选取候选句对，在模型生成步骤中可以更容易模仿编辑距离所对应的三种操作来消除语义差异，可操作性较强。Among them, in the model generation step, by

Input to the trained model m, through the above inference steps, the final and

Paired alignment example

and then return

It can be understood that the number of candidate sentence pairs (x ¹ , y ¹ , x ² , y ² ) can be multiple, and here an alignment example corresponding to each candidate sentence pair can be generated for the multiple candidate sentence pairs. It should be understood that selecting candidate sentence pairs through edit distance can more easily imitate the three operations corresponding to edit distance in the model generation step to eliminate semantic differences, and the operability is strong.

可以理解的是，文本样本对可以用于对初始翻译模型进行模型训练，即文本样本对可以作为用于训练初始翻译模型的训练语料，这里的训练语料的规模是候选集中的语料规模的2倍。应当理解，基于文本样本对，对初始翻译模型进行模型训练的具体过程可以参见下述步骤S309。It can be understood that the text sample pair can be used for model training of the initial translation model, that is, the text sample pair can be used as the training corpus for training the initial translation model, and the size of the training corpus here is twice the size of the corpus in the candidate set. . It should be understood that, for the specific process of performing model training on the initial translation model based on the text sample pair, reference may be made to the following step S309.

步骤S309，获取与文本样本对相关联的初始翻译模型，基于文本样本对，对初始翻译模型进行迭代训练，将迭代训练后的初始翻译模型确定为目标翻译模型。Step S309 , acquiring the initial translation model associated with the text sample pair, performing iterative training on the initial translation model based on the text sample pair, and determining the initial translation model after the iterative training as the target translation model.

其中，目标翻译模型用于在第一语言类型、第二语言类型和第三语言类型中的任意两种语言类型之间进行文本翻译。Wherein, the target translation model is used to perform text translation between any two language types among the first language type, the second language type and the third language type.

可选的，目标翻译模型还可以用于在第一语言类型、第四语言类型、第五语言类型和第六语言类型中的任意两种语言类型之间进行文本翻译。换言之，目标翻译模型可以用于在第一语言类型、第二语言类型、第三语言类型、第四语言类型、第五语言类型和第六语言类型中的任意两种语言类型之间进行文本翻译。Optionally, the target translation model may also be used for text translation between any two language types among the first language type, the fourth language type, the fifth language type and the sixth language type. In other words, the target translation model can be used to translate text between any two language types among the first language type, the second language type, the third language type, the fourth language type, the fifth language type, and the sixth language type .

其中，这里的初始翻译模型和目标翻译模型可以为C-MNMT(CompleteMultilingual Neural Machine Translation，完全多语言神经网络机器翻译)模型，本申请实施例不对初始翻译模型和目标翻译模型的模型类型进行限定。The initial translation model and the target translation model here may be a C-MNMT (Complete Multilingual Neural Machine Translation) model, and the embodiment of the present application does not limit the model types of the initial translation model and the target translation model.

应当理解，初始翻译模型和目标翻译模型可以统称为泛化翻译模型，初始翻译模型和目标翻译模型属于泛化翻译模型在不同时刻的名称。在训练阶段，泛化翻译模型可以称之为初始翻译模型，在预测阶段，泛化翻译模型可以称之为目标翻译模型。It should be understood that the initial translation model and the target translation model may be collectively referred to as the generalized translation model, and the initial translation model and the target translation model belong to the names of the generalized translation model at different times. In the training phase, the generalized translation model can be called the initial translation model, and in the prediction phase, the generalized translation model can be called the target translation model.

其中，可以理解的是，本申请提出的方法可以在公开数据集WMT-5上进行实验和验证，实验结果显示本申请提出的EAG(Extract and Generate，抽取和生成)算法在WMT-5数据集上生成的语料数量远远多于现有技术(即基于抽取的方法)生成的语料。其中，在每个non-English语言对上平均多生成10倍左右的高质量的语义对齐的双语对。其中，基于抽取的方法即为现有技术中通过完全对齐的方式生成对齐语料的方法，该方法即为C-MNMT模型所使用的方法。Among them, it can be understood that the method proposed in this application can be tested and verified on the public data set WMT-5. The experimental results show that the EAG (Extract and Generate) algorithm proposed in this application can be used in the WMT-5 data set. The number of corpora generated on the above is far more than the corpus generated by the existing technology (ie, extraction-based method). Among them, on average, about 10 times more high-quality semantically aligned bilingual pairs are generated on each non-English language pair. Among them, the extraction-based method is a method for generating aligned corpus in a fully aligned manner in the prior art, and this method is the method used by the C-MNMT model.

为便于理解，请参见图10，图10是本申请实施例提供的一种进行性能比较的场景示意图。如图10所示为现有技术所提供的训练语料(即第一训练语料)和本申请实施例所提供的训练语料(即第二训练语料)分别对应的实验结果，为便于理解，这里将通过第一训练语料对初始翻译模型进行训练所得到的模型称之为第一翻译模型，这里将通过第二训练语料对初始翻译模型进行训练所得到的模型称之为第二翻译模型(即目标翻译模型)。For ease of understanding, please refer to FIG. 10 , which is a schematic diagram of a performance comparison scenario provided by an embodiment of the present application. Figure 10 shows the experimental results corresponding to the training corpus provided by the prior art (ie, the first training corpus) and the training corpus (ie, the second training corpus) provided by the embodiments of the present application. The model obtained by training the initial translation model with the first training corpus is called the first translation model, and the model obtained by training the initial translation model with the second training corpus is called the second translation model (ie the target translation model).

其中，本申请选取C-MNMT作为基线系统(baseline system)，EAG方法与C-MNMT方法的性能对比如图10所示，实验结果90a为第一翻译模型对应的实验结果，实验结果90b为第二翻译模型对应的实验结果。从图10可以看出，在大部分语言对上，EAG方法的性能明显高于C-MNMT方法，在non-English方向上，跟基线系统相比，EAG方法的BLEU(BilingualEvaluation Understudy，双语评估替补)值有了+1.1个BLEU值的提升。其中，BLEU值是一种机器翻译的评估方法，该评估方法可以用于自动评估机器翻译效果。Wherein, this application selects C-MNMT as the baseline system, and the performance comparison between the EAG method and the C-MNMT method is shown in Figure 10, the experimental result 90a is the experimental result corresponding to the first translation model, and the experimental result 90b is the first The experimental results corresponding to the second translation model. As can be seen from Figure 10, the performance of the EAG method is significantly higher than that of the C-MNMT method in most language pairs. In the non-English direction, compared with the baseline system, the BLEU (Bilingual Evaluation Understudy) of the EAG method ) value has been improved by +1.1 BLEU value. Among them, the BLEU value is an evaluation method for machine translation, which can be used to automatically evaluate the effect of machine translation.

如图10所示，第一翻译模型和第二翻译模型均可以用于捷克语、德语、英语、西班牙语、法语和俄语之间的翻译。实验结果充分展示了本方法的有效性，横向为不同类型的源语言，纵向为不同类型的目标语言，任意两种语言类型之间均可以实现翻译。此外，实验结果90a显示C-MNMT方法在非英文方向的翻译质量与英文方向的翻译质量有一定差距。As shown in FIG. 10, both the first translation model and the second translation model can be used for translation between Czech, German, English, Spanish, French, and Russian. The experimental results fully demonstrate the effectiveness of this method. The horizontal is different types of source languages, the vertical is different types of target languages, and translation can be achieved between any two language types. In addition, the experimental results 90a show that the translation quality of the C-MNMT method in the non-English direction has a certain gap with the translation quality in the English direction.

其中，实验结果90b显示在源语言为捷克语(即cs)、目标语言为德语(即de)时，BLEU值提升了+1.8；在源语言为捷克语、目标语言为英语(即en)时，BLEU值降低了-0.1；在源语言为捷克语、目标语言为西班牙(即es)时，BLEU值提升了+1.5；在源语言为捷克语、目标语言为法语(即fr)时，BLEU值提升了+2.4；在源语言为捷克语、目标语言为俄语(即ru)时，BLEU值提升了+1.5。Among them, the experimental result 90b shows that when the source language is Czech (ie cs) and the target language is German (ie de), the BLEU value increases by +1.8; when the source language is Czech and the target language is English (ie en) , the BLEU value is reduced by -0.1; when the source language is Czech and the target language is Spanish (ie es), the BLEU value is increased by +1.5; when the source language is Czech and the target language is French (ie fr), the BLEU value The value is improved by +2.4; when the source language is Czech and the target language is Russian (i.e. ru), the BLEU value is improved by +1.5.

其中，在如图10所示的实验结果中，除捷克语和英语作为源语言和目标语言之外，与现有技术相比，BLEU值均有不同程度的提升。特别是在捷克语作为源语言、法语作为目标语言以及俄语作为源语言、德语作为目标语言时，BLEU值提升较大。Among them, in the experimental results shown in Figure 10, except for Czech and English as the source language and target language, compared with the prior art, the BLEU value has been improved to different degrees. Especially when Czech is used as the source language, French is used as the target language, and Russian is used as the source language and German is used as the target language, the BLEU value is greatly improved.

由此可见，本申请实施例可以通过EAG方法来为泛化翻译模型(完全多语言神经网络机器翻译模型，即C-MNMT模型)抽取并生成多路对齐语料库(即训练语料)，该EAG方法主要分为两步：抽取和生成。第一步可以使用编辑距离抽取候选句对，当第一语言类型(即英文端)的编辑距离小于阈值(即相似距离)时，将两个句对(即第一文本对和第二文本对)加入候选集；第二步可以用模型生成的方法来消除语义差异，将“部分对齐”的候选集中的句对转换为“完全对齐”的训练语料(即文本样本对)。基于此，本申请提出的方法可以用于任何需要对齐不同语言对的训练语料的任务中，经过上述过程可以由现有的双语语料生成大量高质量的语义对齐的训练语料，进而可以显著提升泛化翻译模型的翻译效果(即性能)。It can be seen that the embodiment of the present application can extract and generate a multi-way aligned corpus (that is, a training corpus) for a generalized translation model (completely multilingual neural network machine translation model, namely the C-MNMT model) through the EAG method. The EAG method It is mainly divided into two steps: extraction and generation. In the first step, the edit distance can be used to extract candidate sentence pairs. When the edit distance of the first language type (ie, the English end) is less than the threshold (ie, the similarity distance), the two sentence pairs (ie, the first text pair and the second text pair). ) into the candidate set; in the second step, the method of model generation can be used to eliminate semantic differences, and the sentence pairs in the "partially aligned" candidate set can be converted into "fully aligned" training corpora (ie, text sample pairs). Based on this, the method proposed in this application can be used in any task that needs to align training corpora of different language pairs. Through the above process, a large number of high-quality semantically aligned training corpora can be generated from the existing bilingual corpus, which can significantly improve the general The translation effect (i.e. performance) of the translation model.

进一步地，请参见图11，图11是本申请实施例提供的一种文本数据处理装置的结构示意图，该文本数据处理装置1可以包括：文本获取模块10，文本生成模块20，样本对生成模块30；进一步地，该文本数据处理装置1还可以包括：长度确定模块40，距离确定模块50，第一比较模块60，第二比较模块70，模型训练模块80；Further, please refer to FIG. 11. FIG. 11 is a schematic structural diagram of a text data processing apparatus provided by an embodiment of the present application. The text data processing apparatus 1 may include: a text acquisition module 10, a text generation module 20, and a sample pair generation module. 30; further, the text data processing apparatus 1 may further include: a length determination module 40, a distance determination module 50, a first comparison module 60, a second comparison module 70, and a model training module 80;

文本获取模块10，用于获取第一文本对和第二文本对，从第一文本对中获取第一子文本，从第二文本对中获取第二子文本；第一子文本和第二子文本均属于第一语言类型；第一文本对还包括与第一子文本的语义信息相同且属于第二语言类型的第三子文本；第二文本对还包括与第二子文本的语义信息相同且属于第三语言类型的第四子文本；The text acquisition module 10 is used for acquiring the first text pair and the second text pair, acquiring the first subtext from the first text pair, and acquiring the second subtext from the second text pair; the first subtext and the second subtext The texts all belong to the first language type; the first text pair also includes a third subtext that has the same semantic information as the first subtext and belongs to the second language type; the second text pair also includes the same semantic information as the second subtext and belongs to the fourth subtext of the third language type;

文本生成模块20，用于确定第一子文本和第二子文本之间的编辑距离，若编辑距离满足相似性条件，则生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本，生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本；The text generation module 20 is used to determine the edit distance between the first sub-text and the second sub-text, and if the edit distance satisfies the similarity condition, then generate a text that is associated with the semantic information of the first sub-text and belongs to the third language type. a first target subtext, generating a second target subtext that is associated with the semantic information of the second subtext and belongs to the second language type;

其中，文本生成模块20包括：文本提取单元201，距离确定单元202，目标获取单元203，文本生成单元204；Wherein, the text generation module 20 includes: a text extraction unit 201, a distance determination unit 202, a target acquisition unit 203, and a text generation unit 204;

文本提取单元201，用于从第一子文本中提取第一单位文本，从第二子文本中提取第二单位文本；Text extraction unit 201, for extracting the first unit text from the first subtext, and extracting the second unit text from the second subtext;

距离确定单元202，用于基于第一单位文本和第二单位文本，确定第一子文本和第二子文本之间的编辑距离。The distance determining unit 202 is configured to determine an edit distance between the first sub-text and the second sub-text based on the first unit text and the second unit text.

目标获取单元203，用于若编辑距离满足相似性条件，则获取与第一子文本和第二子文本相关联的网络模型集合；A target acquisition unit 203, configured to acquire a network model set associated with the first sub-text and the second sub-text if the edit distance satisfies the similarity condition;

文本生成单元204，用于基于网络模型集合，生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本，生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本。The text generating unit 204 is configured to generate, based on the network model set, a first target sub-text associated with the semantic information of the first sub-text and belonging to the third language type, and generating a first target sub-text associated with the semantic information of the second sub-text and belonging to the third language type. The second target subtext of the bilingual type.

文本生成单元204包括：第一拼接子单元2041，第一生成子单元2042，第二拼接子单元2043，第二生成子单元2044；可选的，文本生成单元204可以进一步包括：初始获取子单元2045，变换子单元2046，拼接处理子单元2047，模型训练子单元2048；The text generation unit 204 includes: a first concatenation subunit 2041, a first generation subunit 2042, a second concatenation subunit 2043, and a second generation subunit 2044; optionally, the text generation unit 204 may further include: an initial acquisition subunit 2045, transform subunit 2046, splicing processing subunit 2047, model training subunit 2048;

第一拼接子单元2041，用于对第一子文本和第四子文本进行拼接处理，得到第一拼接文本；The first splicing subunit 2041 is used for splicing the first subtext and the fourth subtext to obtain the first splicing text;

第一生成子单元2042，用于将第一拼接文本输入至第一目标网络模型，通过第一目标网络模型生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本；The first generation subunit 2042 is used to input the first spliced text into the first target network model, and generate the first target subunit associated with the semantic information of the first subtext and belonging to the third language type through the first target network model. text;

第一生成子单元2042包括：编码处理子单元20421，解码处理子单元20422，文本生成子单元20423；The first generation subunit 2042 includes: an encoding processing subunit 20421, a decoding processing subunit 20422, and a text generation subunit 20423;

编码处理子单元20421，用于将第一拼接文本输入至第一目标网络模型中的编码器，通过编码器对第一拼接文本进行编码处理，得到第一拼接文本对应的第一特征向量；The encoding processing subunit 20421 is used to input the first spliced text into the encoder in the first target network model, and the first spliced text is encoded by the encoder to obtain the first feature vector corresponding to the first spliced text;

解码处理子单元20422，用于将第一特征向量输入至第一目标网络模型中的解码器，通过解码器对第一特征向量进行解码处理，得到第一特征向量对应的第一文本向量；Decoding processing subunit 20422, for inputting the first feature vector to the decoder in the first target network model, and decoding the first feature vector by the decoder to obtain the first text vector corresponding to the first feature vector;

文本生成子单元20423，用于基于第一文本向量，生成与第一子文本的语义信息相关联且属于第三语言类型的第一目标子文本。A text generating subunit 20423, configured to generate, based on the first text vector, a first target subtext that is associated with the semantic information of the first subtext and belongs to a third language type.

其中，编码处理子单元20421，解码处理子单元20422和文本生成子单元20423的具体实现方式，可以参见上述图7所对应实施例中对步骤S202的描述，这里将不再进行赘述。The specific implementation of the encoding processing subunit 20421, the decoding processing subunit 20422, and the text generating subunit 20423 can be referred to the description of step S202 in the embodiment corresponding to FIG. 7, which will not be repeated here.

第二拼接子单元2043，用于对第二子文本和第三子文本进行拼接处理，得到第二拼接文本；The second splicing sub-unit 2043 is used for splicing the second sub-text and the third sub-text to obtain the second splicing text;

第二生成子单元2044，用于将第二拼接文本输入至第二目标网络模型，通过第二目标网络模型生成与第二子文本的语义信息相关联且属于第二语言类型的第二目标子文本。The second generating subunit 2044 is configured to input the second concatenated text into the second target network model, and generate a second target subunit associated with the semantic information of the second subtext and belonging to the second language type through the second target network model text.

可选的，初始获取子单元2045，用于获取与第一语言类型和第三语言类型相关联的第一初始网络模型；Optionally, the initial acquisition subunit 2045 is used to acquire the first initial network model associated with the first language type and the third language type;

变换子单元2046，用于对第四子文本进行编辑变换，得到与第四子文本相关联的第一变换文本；The transformation subunit 2046 is used to edit and transform the fourth subtext to obtain the first transformed text associated with the fourth subtext;

变换子单元2046包括：概率生成子单元20461，编辑变换子单元20462；The transformation subunit 2046 includes: a probability generation subunit 20461, and an editing transformation subunit 20462;

概率生成子单元20461，用于生成N个单位文本中的每个单位文本分别对应的随机变换概率，将随机变换概率处于可变换概率区间的单位文本确定为待编辑单位文本；The probability generation subunit 20461 is used to generate the random transformation probability corresponding to each unit text in the N unit texts, and determine the unit text whose random transformation probability is in the transformable probability interval as the unit text to be edited;

编辑变换子单元20462，用于获取与待编辑单位文本相关联的编辑变换方式，根据编辑机变换方式对待编辑单位文本进行编辑变换，得到编辑变换结果，根据编辑变换结果生成与第四子文本相关联的第一变换文本。The editing transformation subunit 20462 is used to obtain the editing transformation mode associated with the unit text to be edited, edit and transform the unit text to be edited according to the transformation mode of the editing machine, obtain the editing transformation result, and generate the fourth subtext according to the editing transformation result. Linked first transform text.

其中，编辑变换子单元20462，具体用于获取与待编辑单位文本相关联的编辑变换方式；Wherein, the editing transformation subunit 20462 is specifically used to obtain the editing transformation mode associated with the unit text to be edited;

编辑变换子单元20462，还具体用于若编辑变换方式为替换操作，则获取第三语言类型对应的词典表，从词典表中获取第一编辑文本，将待编辑单位文本替换为第一编辑文本，得到编辑变换结果；The editing transformation subunit 20462 is also specifically used to obtain the dictionary table corresponding to the third language type if the editing transformation mode is the replacement operation, obtain the first editing text from the dictionary table, and replace the unit text to be edited with the first editing text , get the editing transformation result;

编辑变换子单元20462，还具体用于若编辑变换方式为插入操作，则从词典表中获取第二编辑文本，在待编辑单位文本的相邻位置插入第二编辑文本，得到编辑变换结果；The editing transformation subunit 20462 is also specifically used to obtain the second editing text from the dictionary table if the editing transformation mode is an insert operation, insert the second editing text at the adjacent position of the unit text to be edited, and obtain the editing transformation result;

编辑变换子单元20462，还具体用于若编辑变换方式为删除操作，则在第四子文本中删除待编辑单位文本，得到编辑变换结果。The editing transformation subunit 20462 is also specifically configured to delete the unit text to be edited in the fourth subtext if the editing transformation mode is a deletion operation, and obtain the editing transformation result.

其中，概率生成子单元20461和编辑变换子单元20462的具体实现方式，可以参见上述图9所对应实施例中对步骤S303的描述，这里将不再进行赘述。The specific implementation manner of the probability generating subunit 20461 and the editing transforming subunit 20462 can be referred to the description of step S303 in the embodiment corresponding to FIG. 9 , which will not be repeated here.

拼接处理子单元2047，用于对第二子文本和第一变换文本进行拼接处理，得到第一拼接样本；The splicing processing subunit 2047 is used to perform splicing processing on the second subtext and the first transformed text to obtain the first splicing sample;

模型训练子单元2048，用于通过第一初始网络模型获取第一拼接样本的第一样本向量，基于第一样本向量和第四子文本，对第一初始网络模型进行模型训练，得到第一目标网络模型。The model training subunit 2048 is used to obtain the first sample vector of the first spliced sample through the first initial network model, and based on the first sample vector and the fourth sub-text, perform model training on the first initial network model to obtain the first sample vector. A target network model.

其中，模型训练子单元2048包括：文本预测子单元20481，损失生成子单元20482，参数调整子单元20483；The model training subunit 2048 includes: a text prediction subunit 20481, a loss generation subunit 20482, and a parameter adjustment subunit 20483;

文本预测子单元20481，用于通过第一初始网络模型获取第一拼接样本的第一样本向量，基于第一样本向量，生成属于第三语言类型的预测样本子文本；The text prediction subunit 20481 is used to obtain the first sample vector of the first spliced sample through the first initial network model, and based on the first sample vector, generate the predicted sample subtext belonging to the third language type;

损失生成子单元20482，用于获取预测样本子文本与第四子文本之间的样本语义相似度，根据样本语义相似度，生成第一初始网络模型的模型损失函数；The loss generation subunit 20482 is used to obtain the sample semantic similarity between the predicted sample sub-text and the fourth sub-text, and generate the model loss function of the first initial network model according to the sample semantic similarity;

参数调整子单元20483，用于基于模型损失函数对第一初始网络模型进行参数调整，得到第一目标网络模型。The parameter adjustment subunit 20483 is configured to adjust the parameters of the first initial network model based on the model loss function to obtain the first target network model.

其中，文本预测子单元20481，损失生成子单元20482和参数调整子单元20483的具体实现方式，可以参见上述图9所对应实施例中对步骤S305的描述，这里将不再进行赘述。The specific implementation of the text prediction subunit 20481, the loss generation subunit 20482, and the parameter adjustment subunit 20483 can be referred to the description of step S305 in the embodiment corresponding to FIG. 9, which will not be repeated here.

其中，第一拼接子单元2041，第一生成子单元2042，第二拼接子单元2043和第二生成子单元2044的具体实现方式，可以参见上述图7所对应实施例中对步骤S202的描述，这里将不再进行赘述。可选的，初始获取子单元2045，变换子单元2046，拼接处理子单元2047和模型训练子单元2048的具体实现方式，可以参见上述图9所对应实施例中对步骤S302-步骤S305的描述，这里将不再进行赘述。Wherein, for the specific implementation manner of the first splicing subunit 2041, the first generating subunit 2042, the second splicing subunit 2043 and the second generating subunit 2044, reference may be made to the description of step S202 in the embodiment corresponding to FIG. 7 above, It will not be repeated here. Optionally, for the specific implementation of the initial acquisition subunit 2045, the transformation subunit 2046, the splicing processing subunit 2047 and the model training subunit 2048, reference may be made to the description of step S302 to step S305 in the embodiment corresponding to FIG. 9 above, It will not be repeated here.

其中，文本提取单元201和距离确定单元202的具体实现方式，可以参见上述图3所对应实施例中对步骤S102的描述，这里将不再进行赘述。可选的，目标获取单元203和文本生成单元204的具体实现方式，可以参见上述图7所对应实施例中对步骤S201-步骤S202的描述，这里将不再进行赘述。The specific implementation of the text extraction unit 201 and the distance determination unit 202 may refer to the description of step S102 in the embodiment corresponding to FIG. 3 above, which will not be repeated here. Optionally, for the specific implementation manner of the target obtaining unit 203 and the text generating unit 204, reference may be made to the descriptions of steps S201 to S202 in the embodiment corresponding to FIG. 7, which will not be repeated here.

样本对生成模块30，用于根据第一文本对、第二文本对、第一目标子文本和第二目标子文本，生成文本样本对。The sample pair generating module 30 is configured to generate a text sample pair according to the first text pair, the second text pair, the first target sub-text and the second target sub-text.

其中，样本对生成模块30包括：第一组合单元301，第二组合单元302；样本对确定单元303；The sample pair generating module 30 includes: a first combining unit 301, a second combining unit 302; a sample pair determining unit 303;

第一组合单元301，用于将第一目标子文本和第三子文本，组合为第一样本对；第一目标子文本与第三子文本的语义信息相关联；The first combining unit 301 is used to combine the first target sub-text and the third sub-text into a first sample pair; the first target sub-text is associated with the semantic information of the third sub-text;

第二组合单元302，用于将第二目标子文本和第四子文本，组合为第二样本对；第二目标子文本与第四子文本的语义信息相关联；The second combining unit 302 is configured to combine the second target sub-text and the fourth sub-text into a second sample pair; the second target sub-text is associated with the semantic information of the fourth sub-text;

样本对确定单元303，用于将第一文本对、第二文本对、第一样本对和第二样本对，确定为文本样本对。The sample pair determination unit 303 is configured to determine the first text pair, the second text pair, the first sample pair and the second sample pair as text sample pairs.

其中，第一组合单元301，第二组合单元302和样本对确定单元303的具体实现方式，可以参见上述图3所对应实施例中对步骤S103的描述，这里将不再进行赘述。For the specific implementation of the first combining unit 301, the second combining unit 302 and the sample pair determining unit 303, reference may be made to the description of step S103 in the embodiment corresponding to FIG. 3, which will not be repeated here.

可选的，长度确定模块40，用于将第一子文本的文本长度确定为第一文本长度，将第二子文本的文本长度确定为第二文本长度，根据第一文本长度和第二文本长度，确定相似性条件对应的目标长度；Optionally, the length determination module 40 is used to determine the text length of the first subtext as the first text length, and determine the text length of the second subtext as the second text length, according to the first text length and the second text length. Length, determine the target length corresponding to the similarity condition;

距离确定模块50，用于获取与相似性条件相关联的相似参数，根据目标长度和相似参数，确定相似性条件对应的相似距离；The distance determination module 50 is used to obtain the similarity parameter associated with the similarity condition, and determine the similarity distance corresponding to the similarity condition according to the target length and the similarity parameter;

第一比较模块60，用于若编辑距离小于或等于相似距离，则确定编辑距离满足相似性条件；a first comparison module 60, configured to determine that the edit distance satisfies the similarity condition if the edit distance is less than or equal to the similarity distance;

第二比较模块70，用于若编辑距离大于相似距离，则确定编辑距离不满足相似性条件。The second comparison module 70 is configured to determine that the edit distance does not satisfy the similarity condition if the edit distance is greater than the similarity distance.

可选的，模型训练模块80，用于获取与文本样本对相关联的初始翻译模型，基于文本样本对，对初始翻译模型进行迭代训练，将迭代训练后的初始翻译模型确定为目标翻译模型；目标翻译模型用于在第一语言类型、第二语言类型和第三语言类型中的任意两种语言类型之间进行文本翻译。Optionally, the model training module 80 is configured to obtain the initial translation model associated with the pair of text samples, perform iterative training on the initial translation model based on the pair of text samples, and determine the initial translation model after the iterative training as the target translation model; The target translation model is used to translate text between any two of the first language type, the second language type, and the third language type.

其中，文本获取模块10，文本生成模块20，样本对生成模块30，长度确定模块40，距离确定模块50，第一比较模块60和第二比较模块70的具体实现方式，可以参见上述图3所对应实施例中对步骤S101-步骤S103、图7所对应实施例中对步骤S201-步骤S202和图9所对应实施例中对步骤S301-步骤S308的描述，这里将不再进行赘述。可选的，模型训练模块80的具体实现方式，可以参见上述图9所对应实施例中对步骤S309的描述，这里将不再进行赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。The specific implementations of the text acquisition module 10, the text generation module 20, the sample pair generation module 30, the length determination module 40, the distance determination module 50, the first comparison module 60 and the second comparison module 70 can be referred to in FIG. 3 above. The descriptions of steps S101 to S103 in the corresponding embodiment, steps S201 to S202 in the embodiment corresponding to FIG. 7 and steps S301 to S308 in the embodiment corresponding to FIG. 9 will not be repeated here. Optionally, for the specific implementation manner of the model training module 80, reference may be made to the description of step S309 in the embodiment corresponding to FIG. 9 above, which will not be repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.

进一步地，请参见图12，图12是本申请实施例提供的一种计算机设备的结构示意图。如图12所示，该计算机设备1000可以包括：处理器1001，网络接口1004和存储器1005，此外，上述计算机设备1000还可以包括：用户接口1003，和至少一个通信总线1002。其中，通信总线1002用于实现这些组件之间的连接通信。其中，用户接口1003可以包括显示屏(Display)、键盘(Keyboard)，可选用户接口1003还可以包括标准的有线接口、无线接口。可选的，网络接口1004可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器，也可以是非不稳定的存储器(non-volatile memory)，例如至少一个磁盘存储器。可选的，存储器1005还可以是至少一个位于远离前述处理器1001的存储装置。如图12所示，作为一种计算机可读存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。Further, please refer to FIG. 12 , which is a schematic structural diagram of a computer device provided by an embodiment of the present application. As shown in FIG. 12 , the computer device 1000 may include: a processor 1001 , a network interface 1004 and a memory 1005 , in addition, the above-mentioned computer device 1000 may further include: a user interface 1003 , and at least one communication bus 1002 . Among them, the communication bus 1002 is used to realize the connection and communication between these components. The user interface 1003 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. Optionally, the network interface 1004 may include a standard wired interface or a wireless interface (eg, a WI-FI interface). The memory 1005 may be a high-speed RAM memory, or a non-volatile memory, such as at least one disk memory. Optionally, the memory 1005 may also be at least one storage device located away from the aforementioned processor 1001 . As shown in FIG. 12 , the memory 1005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.

在如图12所示的计算机设备1000中，网络接口1004可提供网络通讯功能；而用户接口1003主要用于为用户提供输入的接口；而处理器1001可以用于调用存储器1005中存储的设备控制应用程序，以实现：In the computer device 1000 shown in FIG. 12 , the network interface 1004 can provide a network communication function; the user interface 1003 is mainly used to provide an input interface for the user; and the processor 1001 can be used to call the device control stored in the memory 1005 application to achieve:

应当理解，本申请实施例中所描述的计算机设备1000可执行前文图3、图7或图9所对应实施例中对文本数据处理方法的描述，也可执行前文图11所对应实施例中对文本数据处理装置1的描述，在此不再赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。It should be understood that the computer device 1000 described in the embodiment of the present application can execute the description of the text data processing method in the embodiment corresponding to FIG. 3 , FIG. 7 or FIG. The description of the text data processing apparatus 1 will not be repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.

此外，这里需要指出的是：本申请实施例还提供了一种计算机可读存储介质，且计算机可读存储介质中存储有前文提及的文本数据处理装置1所执行的计算机程序，且计算机程序包括程序指令，当处理器执行程序指令时，能够执行前文图3、图7或图9所对应实施例中对文本数据处理方法的描述，因此，这里将不再进行赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。对于本申请所涉及的计算机可读存储介质实施例中未披露的技术细节，请参照本申请方法实施例的描述。In addition, it should be pointed out here that the embodiment of the present application also provides a computer-readable storage medium, and the computer-readable storage medium stores a computer program executed by the aforementioned text data processing apparatus 1, and the computer program Including program instructions, when the processor executes the program instructions, it can execute the description of the text data processing method in the embodiment corresponding to FIG. 3 , FIG. 7 or FIG. In addition, the description of the beneficial effects of using the same method will not be repeated. For technical details not disclosed in the computer-readable storage medium embodiments involved in the present application, please refer to the description of the method embodiments of the present application.

此外，需要说明的是：本申请实施例还提供了一种计算机程序产品或计算机程序，该计算机程序产品或者计算机程序可以包括计算机指令，该计算机指令可以存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器可以执行该计算机指令，使得该计算机设备执行前文图3、图7或图9所对应实施例中对文本数据处理方法的描述，因此，这里将不再进行赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。对于本申请所涉及的计算机程序产品或者计算机程序实施例中未披露的技术细节，请参照本申请方法实施例的描述。In addition, it should be noted that the embodiments of the present application further provide a computer program product or computer program, and the computer program product or computer program may include computer instructions, and the computer instructions may be stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor can execute the computer instructions, so that the computer device executes the text data processing method in the embodiment corresponding to FIG. 3 , FIG. 7 or FIG. 9 . description, therefore, it will not be repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated. For the technical details not disclosed in the computer program products or computer program embodiments involved in the present application, please refer to the description of the method embodiments of the present application.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，计算机程序可存储于一计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory，ROM)或随机存储记忆体(Random Access Memory，RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing the relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium. , may include the flow of the above-mentioned method embodiments. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

以上所揭露的仅为本申请较佳实施例而已，当然不能以此来限定本申请之权利范围，因此依本申请权利要求所作的等同变化，仍属本申请所涵盖的范围。The above disclosures are only the preferred embodiments of the present application, and of course, the scope of the rights of the present application cannot be limited by this. Therefore, equivalent changes made according to the claims of the present application are still within the scope of the present application.

Claims

1. a text data processing method, is characterized in that, comprises:

Obtain a first text pair and a second text pair, obtain a first sub-text from the first text pair, and obtain a second sub-text from the second text pair; the first sub-text and the second The sub-texts all belong to the first language type; the first text pair further includes a third sub-text that has the same semantic information as the first sub-text and belongs to the second language type; the second text pair also includes Describe the fourth sub-text of the second sub-text with the same semantic information and belonging to the third language type;

Determine the edit distance between the first sub-text and the second sub-text, and if the edit distance satisfies the similarity condition, generate semantic information associated with the first sub-text and belong to the third sub-text a first target sub-text of a language type, generating a second target sub-text associated with the semantic information of the second sub-text and belonging to the second language type;

A pair of text samples is generated from the first text pair, the second text pair, the first target sub-text and the second target sub-text.

2. The method according to claim 1, wherein the method further comprises:

Determining the text length of the first subtext as the first text length, determining the text length of the second subtext as the second text length, and determining according to the first text length and the second text length The target length corresponding to the similarity condition;

obtaining the similarity parameter associated with the similarity condition, and determining the similarity distance corresponding to the similarity condition according to the target length and the similarity parameter;

If the edit distance is less than or equal to the similarity distance, determine that the edit distance satisfies the similarity condition;

If the edit distance is greater than the similarity distance, it is determined that the edit distance does not satisfy the similarity condition.

3. The method according to claim 1, wherein the determining an edit distance between the first sub-text and the second sub-text comprises:

extracting a first unit text from the first subtext, and extracting a second unit text from the second subtext;

Based on the first unit text and the second unit text, an edit distance between the first sub-text and the second sub-text is determined.

4 . The method according to claim 1 , wherein, if the edit distance satisfies a similarity condition, generating a semantic information associated with the first sub-text and belonging to the third language type. 5 . The first target sub-text, generating the second target sub-text associated with the semantic information of the second sub-text and belonging to the second language type, including:

If the edit distance satisfies the similarity condition, acquiring a network model set associated with the first sub-text and the second sub-text;

Based on the network model set, generate a first target sub-text associated with the semantic information of the first sub-text and belonging to the third language type, generate a first target sub-text associated with the semantic information of the second sub-text and belong to the third language type the second target subtext of the second language type.

5. The method according to claim 4, wherein the network model set comprises a first target network model and a second target network model; the first target network model is associated with the first language type and the A third language type is associated, and the second target network model is associated with the first language type and the second language type;

generating, based on the network model set, a first target sub-text associated with the semantic information of the first sub-text and belonging to the third language type, and generating semantic information associated with the second sub-text and the second target subtext belonging to the second language type, including:

Perform splicing processing on the first sub-text and the fourth sub-text to obtain the first splicing text;

Inputting the first spliced text into the first target network model, and generating a first target associated with the semantic information of the first sub-text and belonging to the third language type through the first target network model subtext;

Perform splicing processing on the second sub-text and the third sub-text to obtain the second splicing text;

Inputting the second spliced text into the second target network model, and generating a second target associated with the semantic information of the second sub-text and belonging to the second language type through the second target network model subtext.

6. The method according to claim 5, wherein the first target network model comprises an encoder for performing encoding processing and a decoder for performing decoding processing;

The first spliced text is input into the first target network model, and the first target network model is used to generate the first subtext that is associated with the semantic information of the first subtext and belongs to the third language type. A target subtext, including:

Inputting the first spliced text into an encoder in the first target network model, and encoding the first spliced text by the encoder to obtain a first feature vector corresponding to the first spliced text ;

Inputting the first feature vector to the decoder in the first target network model, and decoding the first feature vector by the decoder to obtain a first text vector corresponding to the first feature vector ;

Based on the first text vector, a first target subtext associated with the semantic information of the first subtext and belonging to the third language type is generated.

7 . The method according to claim 1 , wherein generating text according to the first text pair, the second text pair, the first target sub-text and the second target sub-text. 8 . Sample pairs, including:

combining the first target subtext and the third subtext into a first sample pair; the first target subtext is associated with the semantic information of the third subtext;

combining the second target subtext and the fourth subtext into a second sample pair; the second target subtext is associated with the semantic information of the fourth subtext;

The first text pair, the second text pair, the first sample pair and the second sample pair are determined as a text sample pair.

8. The method according to claim 5, wherein the method further comprises:

obtaining a first initial network model associated with the first language type and the third language type;

Editing and transforming the fourth sub-text to obtain the first transformed text associated with the fourth sub-text;

performing splicing processing on the second sub-text and the first transformed text to obtain a first splicing sample;

Obtain the first sample vector of the first spliced sample through the first initial network model, and perform model training on the first initial network model based on the first sample vector and the fourth sub-text, The first target network model is obtained.

9. The method according to claim 8, wherein the fourth subtext comprises N unit texts; the N is a positive integer;

The editing and transforming the fourth sub-text to obtain the first transformed text associated with the fourth sub-text, including:

generating a random transformation probability corresponding to each unit text in the N unit texts, and determining the unit text whose random transformation probability is in the transformable probability interval as the unit text to be edited;

Obtain the editing transformation mode associated with the unit text to be edited, edit and transform the unit text to be edited according to the transformation mode of the editing machine, obtain an editing transformation result, and generate an editing transformation result according to the editing transformation result. The first transform text associated with the subtext.

10 . The method according to claim 9 , wherein the acquiring an editing transformation mode associated with the unit text to be edited, and performing editing transformation on the unit text to be edited according to the editing machine transformation mode, 10 . Get editing transformation results, including:

Obtain the editing transformation mode associated with the unit text to be edited;

If the editing transformation method is a replacement operation, obtain a dictionary table corresponding to the third language type, obtain the first editing text from the dictionary table, and replace the unit text to be edited with the first editing text , get the editing transformation result;

If the editing transformation mode is an insert operation, obtain the second editing text from the dictionary table, insert the second editing text at the adjacent position of the unit text to be edited, and obtain the editing transformation result;

If the editing transformation mode is a deletion operation, the unit text to be edited is deleted in the fourth sub-text to obtain an editing transformation result.

11. The method according to claim 8, wherein the obtaining the first sample vector of the first spliced sample through the first initial network model is based on the first sample vector and the The fourth subtext is to perform model training on the first initial network model to obtain the first target network model, including:

Obtain a first sample vector of the first spliced sample through the first initial network model, and generate a predicted sample sub-text belonging to the third language type based on the first sample vector;

obtaining the sample semantic similarity between the predicted sample sub-text and the fourth sub-text, and generating a model loss function of the first initial network model according to the sample semantic similarity;

The parameters of the first initial network model are adjusted based on the model loss function to obtain the first target network model.

12. The method of claim 1, wherein the method further comprises:

Obtaining an initial translation model associated with the pair of text samples, performing iterative training on the initial translation model based on the pair of text samples, and determining the initial translation model after the iterative training as a target translation model; the target translation model for text translation between any two of the first language type, the second language type and the third language type.

13. A text data processing device, comprising:

a text acquisition module, configured to acquire a first text pair and a second text pair, acquire a first subtext from the first text pair, and acquire a second subtext from the second text pair; the first subtext Both the text and the second sub-text belong to the first language type; the first text pair further includes a third sub-text that has the same semantic information as the first sub-text and belongs to the second language type; the second The text pair further includes a fourth subtext that has the same semantic information as the second subtext and belongs to a third language type;

A text generation module, configured to determine the edit distance between the first sub-text and the second sub-text, and if the edit distance satisfies the similarity condition, generate semantic information associated with the first sub-text and belonging to the first target sub-text of the third language type, generating a second target sub-text associated with the semantic information of the second sub-text and belonging to the second language type;

A sample pair generation module, configured to generate a text sample pair according to the first text pair, the second text pair, the first target sub-text and the second target sub-text.

14. A computer device, comprising: a processor and a memory;

The processor is connected to a memory, wherein the memory is used to store a computer program, and the processor is used to invoke the computer program, so that the computer device performs the method of any one of claims 1-12.

15. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, and the computer program is adapted to be loaded and executed by a processor, so that a computer device having the processor executes The method of any of claims 1-12.