CN113705251B

CN113705251B - Training method of machine translation model, language translation method and equipment

Info

Publication number: CN113705251B
Application number: CN202110356556.4A
Authority: CN
Inventors: 涂兆鹏; 刘洋; 史树明; 王硕
Original assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Current assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2024-08-06
Anticipated expiration: 2041-04-01
Also published as: CN113705251A

Abstract

The embodiment of the application discloses a training method of a machine translation model, a language translation method and equipment, and relates to the field of machine translation of natural language processing. The method comprises the following steps: the data in the first bilingual parallel database is divided into source language data and target language data, an initial machine translation model is subjected to fine adjustment through the source language data to obtain a fine-adjusted machine translation model, the fine-adjusted machine translation model is applied to carry out translation tasks, the influence of language coverage deviation existing among data in different languages on the machine translation model can be eliminated, the performance of the machine translation model obtained through training by the method is improved, and translations with high translation quality and high fidelity can be obtained by applying the model.

Description

Machine translation model training method, language translation method and device

技术领域Technical Field

本发明涉及人工智能的自然语言处理领域，尤其涉及一种机器翻译模型的训练方法、语言翻译方法及设备。The present invention relates to the field of natural language processing of artificial intelligence, and in particular to a training method for a machine translation model, a language translation method and a device.

背景技术Background technique

神经机器翻译近年来迅速崛起。相比统计机器翻译而言，神经机器翻译从模型上来说相对简单，它主要包含两个部分，一个是编码器，一个是解码器。编码器是把源语言经过一系列的神经网络的变换之后，表示成一个高维的向量。解码器负责把这个高维向量再重新解码(翻译)成目标语言。Neural machine translation has risen rapidly in recent years. Compared with statistical machine translation, neural machine translation is relatively simple in terms of model. It mainly consists of two parts, an encoder and a decoder. The encoder represents the source language as a high-dimensional vector after a series of neural network transformations. The decoder is responsible for re-decoding (translating) this high-dimensional vector into the target language.

神经机器翻译模型的训练离不开大规模、高质量的双语平行数据。双语平行数据通常是请人工译员进行翻译所得到的，构建大规模双语平行数据需要耗费巨大的人力资源和时间成本。The training of neural machine translation models is inseparable from large-scale, high-quality bilingual parallel data. Bilingual parallel data is usually obtained by asking human translators to translate, and building large-scale bilingual parallel data requires huge human resources and time costs.

然而，双语平行数据中源自不同语言的所覆盖的内容有明显的差异，将该差异称作语言覆盖偏差。通过神经机器翻译模型翻译得到的译文的忠实度与语言覆盖偏差密切相关。因此，双语平行数据中语言覆盖偏差的存在，将会影响通过双语平行数据训练得到的神经机器翻译模型的性能。However, there are obvious differences in the content covered by bilingual parallel data from different languages, which is called language coverage bias. The fidelity of the translation obtained by the neural machine translation model is closely related to the language coverage bias. Therefore, the existence of language coverage bias in bilingual parallel data will affect the performance of the neural machine translation model trained with bilingual parallel data.

发明内容Summary of the invention

本申请实施例提供一种机器翻译模型的训练方法、语言翻译方法及设备，应用该机器翻译模型的训练方法可以消除由于双语平行数据库中源自不同语言的双语平行性句对之间存在的语言覆盖偏差对机器翻译模型产生的影响，从而提升应用该方法训练得到的机器翻译模型的性能。The embodiments of the present application provide a training method for a machine translation model, a language translation method and a device. The application of the training method for a machine translation model can eliminate the influence of the language coverage deviation between bilingual parallel sentence pairs originating from different languages in a bilingual parallel database on the machine translation model, thereby improving the performance of the machine translation model trained by the method.

第一方面，本申请实施例提供了一种机器翻译模型的训练方法，包括：获取双语平行数据库，所述双语平行数据库包括多组双语平行句对，所述双语平行句对为由源语言数据和目标语言数据构成的内容对齐的数据，所述双语平行数据库包括第一双语平行数据库；In a first aspect, an embodiment of the present application provides a method for training a machine translation model, comprising: obtaining a bilingual parallel database, the bilingual parallel database comprising a plurality of groups of bilingual parallel sentence pairs, the bilingual parallel sentence pairs being content-aligned data consisting of source language data and target language data, the bilingual parallel database comprising a first bilingual parallel database;

将所述第一双语平行数据库中的多组双语平行句对划分为源自源语言数据和源自目标语言数据，其中，属于所述源自源语言数据的双语平行句对中的目标语言数据是基于源语言数据翻译得到的，属于所述源自目标语言数据的双语平行句对中的源语言数据是基于目标语言数据翻译得到的；Dividing the plurality of groups of bilingual parallel sentence pairs in the first bilingual parallel database into those derived from source language data and those derived from target language data, wherein the target language data in the bilingual parallel sentence pairs derived from the source language data are obtained by translating based on the source language data, and the source language data in the bilingual parallel sentence pairs derived from the target language data are obtained by translating based on the target language data;

通过所述源自源语言数据训练第一机器翻译模型，所述第一机器翻译模型用于将源语言翻译为目标语言。A first machine translation model is trained by using the data derived from the source language, and the first machine translation model is used to translate the source language into the target language.

其中，所述将所述第一双语平行数据库中的多个双语平行句对划分为源自源语言数据和源自目标语言数据，包括：The step of dividing the plurality of bilingual parallel sentence pairs in the first bilingual parallel database into data derived from source language and data derived from target language includes:

从所述第一双语平行数据库中获取待处理平行句对；Acquire parallel sentence pairs to be processed from the first bilingual parallel database;

根据待处理平行句对覆盖的内容，确定待处理平行句对的数据类型，所述数据类型包括所述源自源语言数据和所述源自目标语言数据。According to the contents covered by the parallel sentence pairs to be processed, the data types of the parallel sentence pairs to be processed are determined, and the data types include the data derived from the source language and the data derived from the target language.

其中，所述根据待处理平行句对覆盖的内容，确定待处理平行句对的数据类型，包括：The step of determining the data type of the parallel sentence pairs to be processed according to the content covered by the parallel sentence pairs to be processed includes:

根据所述待处理平行句对中的源语言数据确定所述待处理平行句对中的源语言数据来源于源语言的第一概率；Determining a first probability that the source language data in the parallel sentence pair to be processed originates from a source language according to the source language data in the parallel sentence pair to be processed;

根据所述待处理平行句对中的目标语言数据的内容确定所述待处理平行句对中的目标语言数据来源于目标语言的第二概率；Determining a second probability that the target language data in the parallel sentence pair to be processed originates from the target language according to the content of the target language data in the parallel sentence pair to be processed;

根据所述第一概率和所述第二概率之间的偏差确定所述待处理平行句对的数据类型。The data type of the parallel sentence pair to be processed is determined according to a deviation between the first probability and the second probability.

其中，所述根据所述第一概率和所述第二概率之间的偏差确定所述待处理平行句对的数据类型，具体包括：The step of determining the data type of the parallel sentence pair to be processed according to the deviation between the first probability and the second probability specifically includes:

根据所述第一概率和所述第二概率之间的偏差确定所述待处理平行句对的评分，所述评分用于确定所述待处理平行句对的数据类型；Determine a score of the parallel sentence pair to be processed according to a deviation between the first probability and the second probability, wherein the score is used to determine a data type of the parallel sentence pair to be processed;

当所述评分大于目标阈值时，确定所述待处理平行句对属于源自源语言数据；When the score is greater than a target threshold, determining that the parallel sentence pair to be processed is derived from source language data;

当所述评分小于目标阈值时，确定所述待处理平行句对属于源自目标语言数据。When the score is less than the target threshold, it is determined that the parallel sentence pair to be processed is derived from the target language data.

其中，根据所述待处理平行句对中的源语言数据确定所述待处理平行句对中的源语言数据来源于源语言的第一概率，包括：将所述待处理平行句对中的源语言数据输入到第一语言模型中，确定所述第一概率，所述第一语言模型用于确定所述待处理平行句对中的源语言数据在源语言中出现的概率，所述第一语言模型是通过源语言单语数据库训练得到的模型；Wherein, determining, according to the source language data in the parallel sentence pairs to be processed, a first probability that the source language data in the parallel sentence pairs to be processed are derived from a source language comprises: inputting the source language data in the parallel sentence pairs to be processed into a first language model to determine the first probability, wherein the first language model is used to determine the probability that the source language data in the parallel sentence pairs to be processed appear in the source language, and the first language model is a model obtained by training a source language monolingual database;

根据所述待处理平行句对中的目标语言数据的内容确定所述待处理平行句对中的目标语言数据是源自目标语言的第二概率，包括：将所述待处理平行句对中的目标语言数据输入到第二语言模型中，确定所述待第二概率，所述第二语言模型用于确定所述待处理平行句对中的目标语言数据在目标语言中出现的概率，所述第二语言模型为通过目标语言单语数据库训练得到的模型。Determining a second probability that the target language data in the parallel sentence pair to be processed is derived from a target language according to the content of the target language data in the parallel sentence pair to be processed includes: inputting the target language data in the parallel sentence pair to be processed into a second language model to determine the second probability, wherein the second language model is used to determine the probability that the target language data in the parallel sentence pair to be processed appears in the target language, and the second language model is a model trained by a target language monolingual database.

其中，在通过所述源自源语言数据训练第一机器翻译模型之前，所述方法还包括：Before training the first machine translation model using the source language data, the method further includes:

使用所述双语平行数据库训练初始机器翻译模型，得到所述第一机器翻译模型。An initial machine translation model is trained using the bilingual parallel database to obtain the first machine translation model.

获取单语数据库，所述单语数据库包括多条语言为目标语言的原文本；Acquire a monolingual database, wherein the monolingual database includes a plurality of original texts in a target language;

将所述单语数据库中的每一条原文本输入到所述第二机器翻译模型中，得到所述每一条原文本对应的源语言翻译文本，所述第二机器翻译模型是通过所述第一双语平行数据库训练得到的，所述第二机器翻译模型用于将目标语言翻译为源语言；Inputting each original text in the monolingual database into the second machine translation model to obtain a source language translation text corresponding to each original text, wherein the second machine translation model is obtained by training the first bilingual parallel database, and the second machine translation model is used to translate the target language into the source language;

将所述多条语言为目标语言的原文本和所述多条语言为目标语言的原文本分别对应的源语言翻译文本组成的多组伪平行句对添加至所述双语平行数据库。Adding a plurality of groups of pseudo-parallel sentence pairs consisting of the plurality of original texts in the target language and the source language translation texts respectively corresponding to the plurality of original texts in the target language to the bilingual parallel database.

获取单语数据库，所述单语数据库包括多条语言为源语言的原文本；Acquire a monolingual database, wherein the monolingual database includes a plurality of original texts in a source language;

将所述单语数据库中的每一条原文本输入到所述第三机器翻译模型中，得到所述每一条原文本对应的目标语言翻译文本，所述第三机器翻译模型是通过所述第一双语平行数据库训练得到的，所述第三机器翻译模型用于将源语言翻译为目标语言；Inputting each original text in the monolingual database into the third machine translation model to obtain a target language translation text corresponding to each original text, wherein the third machine translation model is obtained by training the first bilingual parallel database, and the third machine translation model is used to translate the source language into the target language;

将所述多条语言为源语言的原文本和所述多条语言为源语言的原文本分别对应的目标语言翻译文本组成的多组伪平行句对添加至所述双语平行数据库。Adding a plurality of groups of pseudo-parallel sentence pairs consisting of the plurality of original texts in the source language and the target language translation texts corresponding to the plurality of original texts in the source language to the bilingual parallel database.

第二方面，本申请实施例提供一种语言翻译方法，包括：In a second aspect, an embodiment of the present application provides a language translation method, comprising:

接收待翻译数据，所述待翻译数据为源语言数据；receiving data to be translated, wherein the data to be translated is source language data;

将待翻译数据输入到机器翻译模型中，得到所述待翻译数据对应的目标语言数据，所述机器翻译模型是通过如第一方面或者第一方面的各种可选实现方式中提供的方法训练得到的第一机器翻译模型。The data to be translated is input into a machine translation model to obtain target language data corresponding to the data to be translated, wherein the machine translation model is a first machine translation model trained by the method provided in the first aspect or various optional implementations of the first aspect.

第三方面，本申请实施例提供一种机器翻译模型的训练装置，包括：In a third aspect, an embodiment of the present application provides a training device for a machine translation model, comprising:

获取单元，用于获取双语平行数据库，所述双语平行数据库包括多组双语平行句对，所述双语平行句对为由源语言数据和目标语言数据构成的内容对齐的数据，所述双语平行数据库包括第一双语平行数据库；An acquisition unit is used to acquire a bilingual parallel database, wherein the bilingual parallel database includes a plurality of groups of bilingual parallel sentence pairs, wherein the bilingual parallel sentence pairs are content-aligned data consisting of source language data and target language data, and the bilingual parallel database includes a first bilingual parallel database;

划分单元，用于将所述第一双语平行数据库中的多组双语平行句对划分为源自源语言数据和源自目标语言数据，其中，属于所述源自源语言数据的双语平行句对中的目标语言数据是基于源语言数据翻译得到的，属于所述源自目标语言数据的双语平行句对中的源语言数据是基于目标语言数据翻译得到的；a dividing unit, configured to divide the plurality of groups of bilingual parallel sentence pairs in the first bilingual parallel database into groups derived from source language data and groups derived from target language data, wherein the target language data in the bilingual parallel sentence pairs derived from the source language data are obtained by translating based on the source language data, and the source language data in the bilingual parallel sentence pairs derived from the target language data are obtained by translating based on the target language data;

训练单元，用于通过所述源自源语言数据训练第一机器翻译模型，所述第一机器翻译模型用于将源语言翻译为目标语言。A training unit is used to train a first machine translation model using the data derived from the source language, wherein the first machine translation model is used to translate the source language into a target language.

第四方面，本申请实施例提供一种语言翻译装置，包括：In a fourth aspect, an embodiment of the present application provides a language translation device, including:

接收单元，用于接收待翻译数据，所述待翻译数据为源语言数据；A receiving unit, configured to receive data to be translated, wherein the data to be translated is source language data;

翻译单元，用于将所述待翻译数据翻译为其对应的目标语言数据，所述翻译单元包括机器翻译模型，所述机器翻译模型是通过如第一方面或者第一方面的各种可选实现方式中提供的机器模型的训练方法训练得到的第一机器翻译模型。A translation unit is used to translate the data to be translated into its corresponding target language data, the translation unit includes a machine translation model, and the machine translation model is a first machine translation model trained by the machine model training method provided in the first aspect or various optional implementations of the first aspect.

第五方面，本申请实施例提供一种计算机设备，其特征在于，包括：一个或多个处理器、一个或多个存储器，所述一个或多个存储器分别与所述一个或多个处理器耦合；所述一个或多个存储器用于存储计算机程序代码，所述计算机程序代码包括计算机指令；In a fifth aspect, an embodiment of the present application provides a computer device, characterized in that it includes: one or more processors and one or more memories, wherein the one or more memories are respectively coupled to the one or more processors; the one or more memories are used to store computer program codes, wherein the computer program codes include computer instructions;

所述处理器用于调用所述计算机指令执行：如第一方面或者第一方面的各种可选实现方式中提供的机器模型的训练方法。The processor is used to call the computer instruction to execute: a training method for a machine model as provided in the first aspect or various optional implementations of the first aspect.

第六方面，本申请实施例提供一种计算机设备，其特征在于，包括：一个或多个处理器、一个或多个存储器，所述一个或多个存储器分别与所述一个或多个处理器耦合；所述一个或多个存储器用于存储计算机程序代码，所述计算机程序代码包括计算机指令；In a sixth aspect, an embodiment of the present application provides a computer device, characterized in that it includes: one or more processors and one or more memories, wherein the one or more memories are respectively coupled to the one or more processors; the one or more memories are used to store computer program codes, wherein the computer program codes include computer instructions;

所述处理器用于调用所述计算机指令执行：如第二方面提供的语言翻译方法。The processor is used to call the computer instruction to execute: the language translation method provided in the second aspect.

第七方面，本申请实施例提供一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令当被处理器执行时，执行如第一方面或者第一方面的各种可选实现方式中提供的机器翻译模型的训练方法。In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, wherein the computer program includes program instructions, and when the program instructions are executed by a processor, the training method of a machine translation model provided in the first aspect or various optional implementations of the first aspect is executed.

第八方面，本申请实施例提供一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令当被处理器执行时，执行如第二方面所述的语言翻译方法。In an eighth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, wherein the computer program includes program instructions, and when the program instructions are executed by a processor, the language translation method as described in the second aspect is executed.

第九方面，本申请实施例提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述第一方面或者第一方面的各种可选实现方式中提供的方法。In a ninth aspect, an embodiment of the present application provides a computer program product or a computer program, the computer program product or the computer program including computer instructions, the computer instructions being stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided in the first aspect or various optional implementations of the first aspect.

第十方面，本申请实施例提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述第二方面提供的语言翻译方法。In a tenth aspect, an embodiment of the present application provides a computer program product or a computer program, the computer program product or the computer program including computer instructions, the computer instructions being stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the language translation method provided in the second aspect above.

本申请实施例提供的一种机器翻译模型的训练方法，对第一双语平行数据库中的数据进行划分，划分为源自源语言数据和源自目标语言数据，通过源自源语言数据对初始机器翻译模型进行微调，得到微调后的机器翻译模型，应用该微调后的机器翻译模型进行翻译任务，能够消除由于不同语言的数据之间存在的语言覆盖偏差对机器翻译模型的影响，从而提高通过该方法训练得到的机器翻译模型的性能，应用该模型可以得到译文质量和忠实度较高的译文。A method for training a machine translation model provided in an embodiment of the present application divides data in a first bilingual parallel database into data derived from a source language and data derived from a target language, fine-tunes an initial machine translation model using the data derived from the source language to obtain a fine-tuned machine translation model, and applies the fine-tuned machine translation model to perform translation tasks, thereby eliminating the influence of language coverage deviations between data in different languages on the machine translation model, thereby improving the performance of the machine translation model trained by the method, and applying the model can obtain translations with higher quality and fidelity.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1A是本申请实施例提供的一种计算机系统的结构示意图；FIG1A is a schematic diagram of the structure of a computer system provided in an embodiment of the present application;

图1B是本申请实施例提供的一种机器翻译模型的训练方法的流程示意图；FIG1B is a flow chart of a method for training a machine translation model provided in an embodiment of the present application;

图2是本申请实施例提供的应用一种划分数据类型的方法得到的划分准确率的结果的示意图；FIG2 is a schematic diagram of a result of a classification accuracy rate obtained by applying a method for classifying data types provided in an embodiment of the present application;

图3-图5是本申请实施例提供的通过划分后的数据进行训练得到的几种机器模型的翻译表现结果的示意图；3-5 are schematic diagrams of translation performance results of several machine models obtained by training with divided data provided in embodiments of the present application;

图6是本申请实施例提供的另一种机器翻译模型的训练方法的流程示意图；FIG6 is a flow chart of another method for training a machine translation model provided in an embodiment of the present application;

图7是本申请实施例提供的应用一种机器翻译模型的训练方法的训练得到的模型的翻译表现结果的示意图；7 is a schematic diagram of the translation performance results of a model obtained by training using a training method for a machine translation model provided in an embodiment of the present application;

图8是本申请实施例提供的另一种机器翻译模型的训练方法的流程示意图；FIG8 is a flow chart of another method for training a machine translation model provided in an embodiment of the present application;

图9是本申请实施例提供的另一种机器翻译模型的训练方法的流程示意图；FIG9 is a flow chart of another method for training a machine translation model provided in an embodiment of the present application;

图10是本申请实施例提供的应用几种机器翻译模型的得到的译文质量的结果示意图；FIG10 is a schematic diagram of the results of translation quality obtained by applying several machine translation models provided in an embodiment of the present application;

图11A是本申请实施例提供的一种机器翻译模型的训练装置的结构示意图；FIG11A is a schematic diagram of the structure of a training device for a machine translation model provided in an embodiment of the present application;

图11B是本申请实施例提供的一种语言翻译装置的结构示意图；FIG11B is a schematic diagram of the structure of a language translation device provided in an embodiment of the present application;

图12是本申请实施例提供的一种语言翻译设备的结构示意图；FIG12 is a schematic diagram of the structure of a language translation device provided in an embodiment of the present application;

图13是本申请实施例提供的一种机器翻译模型的训练设备的结构示意图。FIG13 is a schematic diagram of the structure of a training device for a machine translation model provided in an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

首先，为便于理解，在介绍本申请实施例提供的机器翻译模型的训练方法之前，下面介绍本申请发明例所涉及的相关术语。First, for ease of understanding, before introducing the training method of the machine translation model provided in the embodiment of the present application, the following introduces the relevant terms involved in the invention example of the present application.

DL：Deep Learning，深度学习，是机器学习的分支，是一种试图使用包含复杂结构或由多重非线性变换构成的多个处理层对数据进行高层抽象的算法。DL: Deep Learning is a branch of machine learning. It is an algorithm that attempts to use multiple processing layers containing complex structures or composed of multiple nonlinear transformations to perform high-level abstraction of data.

NN：Neural Network，神经网络，在机器学习和认知科学领域的一种模仿生物神经网络结构和功能的深度学习模型。NN: Neural Network, a deep learning model in the field of machine learning and cognitive science that imitates the structure and function of biological neural networks.

DNN：Deep Neural Network，深度神经网络，网络结构较深的神经网络，深度学习中的核心模型。DNN: Deep Neural Network, a neural network with a deeper network structure, and the core model in deep learning.

NMT：Neural Machine Translation，神经机器翻译，最新一代基于神经网络的机器翻译技术。NMT: Neural Machine Translation, the latest generation of machine translation technology based on neural networks.

BLEU：机器翻译评测的标准方法，该值越高表示效果越好。BLEU: A standard method for evaluating machine translation. The higher the value, the better the effect.

人工智能(Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technology in computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that machines have the functions of perception, reasoning and decision-making.

人工智能技术是一门综合学科，涉及领域广泛，既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence technology is a comprehensive discipline that covers a wide range of fields, including both hardware-level and software-level technologies. Basic artificial intelligence technologies generally include sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operating/interactive systems, mechatronics, and other technologies. Artificial intelligence software technologies mainly include computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

机器学习(Machine Learning，ML)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。Machine Learning (ML) is a multi-disciplinary subject that involves probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications are spread across all areas of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and self-learning.

自然语言处理(Nature Language processing，NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。Natural language processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that can achieve effective communication between people and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language people use in daily life, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.

机器翻译，是指通过计算机将一种自然语言(源语言)的句子翻译为另一种自然语言(目标语言)的句子的翻译方式。通常，机器翻译是通过训练好的机器翻译模型将源语言的句子翻译为目标语言的句子。示例性的，源语言为中文，目标语言为英文，源语言的句子为“我是一个学生。”，通过机器翻译模型将该句子翻译为“Iamastudent.”。其中，机器翻译模型可以是通过大量的双语平行句对训练而成的。Machine translation refers to a translation method that translates a sentence in a natural language (source language) into another natural language (target language) by a computer. Usually, machine translation is to translate a sentence in the source language into a sentence in the target language by a trained machine translation model. Exemplary, the source language is Chinese, the target language is English, and the sentence in the source language is "I am a student.", which is translated into "I am a student." by the machine translation model. Among them, the machine translation model can be trained by a large number of bilingual parallel sentence pairs.

双语平行句对，由源语言数据和其对应的目标语言数据构成的内容对齐的数据，这里“内容对齐”是指源语言数据的内容和目标语言数据的内容具有翻译关系、意思表达一致。Bilingual parallel sentence pairs are content-aligned data consisting of source language data and its corresponding target language data. Here, "content alignment" means that the content of the source language data and the content of the target language data have a translation relationship and consistent meaning.

在本申请实施例中，双语平行句对可以划分为两类，分别是源自源语言数据和源自目标语言数据。In the embodiment of the present application, bilingual parallel sentence pairs can be divided into two categories, namely, those derived from source language data and those derived from target language data.

其中，源自源语言数据指的是由源语言的创作者先产生文本，再由人类译员翻译到目标语言，构成内容对齐的双语平行句对。也就是说，属于源自源语言数据的双语平行句对中的目标语言数据是基于源语言数据翻译得到的。The source language data refers to the text first generated by the creator of the source language and then translated into the target language by human translators to form content-aligned bilingual parallel sentence pairs. In other words, the target language data in the bilingual parallel sentence pairs derived from the source language data is obtained by translation based on the source language data.

其中，源自目标语言数据指的是由目标语言的创作者先产生文本，然后通过反向的翻译方向由人类译员翻译到源语言，构成内容对齐的双语平行句对。也就是说，属于源自目标语言数据的双语平行句对中的源语言数据是基于目标语言数据翻译得到的。The target language data refers to the text first generated by the creator of the target language, and then translated into the source language by human translators in the reverse translation direction to form content-aligned bilingual parallel sentence pairs. In other words, the source language data in the bilingual parallel sentence pairs derived from the target language data is obtained by translation based on the target language data.

语言覆盖偏差，源自不同语言的数据所覆盖的内容有明显的差异，将这种差异称作语言覆盖偏差。例如，源自中文的数据中可以包括“云南、张三、李四、大闸蟹、巴城”等具有中文特征的内容，而源自英文的数据中可以包括“California、National BasketballAssociation、Birmingham”等具有英文特征的内容。Language coverage bias: Data from different languages have obvious differences in the content they cover, which is called language coverage bias. For example, data from Chinese may include content with Chinese characteristics such as "Yunnan, Zhang San, Li Si, hairy crab, Bacheng", while data from English may include content with English characteristics such as "California, National Basketball Association, Birmingham".

随着人工智能技术研究和进步，人工智能技术在多个领域展开研究和应用，例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服等，相信随着技术的发展，人工智能技术将在更多的领域得到应用，并发挥越来越重要的价值。With the research and advancement of artificial intelligence technology, artificial intelligence technology has been studied and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, drones, robots, smart medical care, smart customer service, etc. I believe that with the development of technology, artificial intelligence technology will be applied in more fields and play an increasingly important role.

通过本申请实施例提供的机器翻译模型的训练方法训练得到的机器翻译模型可以用于如下场景：The machine translation model trained by the machine translation model training method provided in the embodiment of the present application can be used in the following scenarios:

(1)机器翻译(1) Machine Translation

在该应用场景下，采用本申请实施例提供的方法所训练的机器翻译模型可以用于电子词典应用程序、电子书应用程序、网页浏览应用程序、社交应用程序、图文识别应用程序等支持翻译功能的应用程序中。当上述应用程序接收到待翻译数据时，机器翻译模型根据输入的待翻译数据，输出翻译结果。示例性的，待翻译数据包括文本类型、图片类型、音频类型和视频类型的内容中的至少一种。In this application scenario, the machine translation model trained by the method provided in the embodiment of the present application can be used in applications supporting translation functions such as electronic dictionary applications, e-book applications, web browsing applications, social applications, and image and text recognition applications. When the above application receives the data to be translated, the machine translation model outputs the translation result according to the input data to be translated. Exemplarily, the data to be translated includes at least one of text type, picture type, audio type, and video type content.

(2)对话问答(2) Dialogue Q&A

在该应用场景下，采用本申请实施例提供的方法所训练的机器翻译模型可以应用于智能终端或智能家居等智能设备中。以智能终端中设置的虚拟助理为例，该虚拟助理的自动回答功能是通过本申请实施例提供的机器翻译模型的训练方法训练得到的及翻译模型实现的。用户向虚拟助理提出有关翻译的问题，当虚拟助理接收到用户输入的问题时(用户输入的问题可以是通过语音或文字输入的形式实现)，机器翻译模型分局输入的问题，输出翻译结果，智能设备将翻译结果转换为语音或文字的形式，通过虚拟助理反馈给用户。In this application scenario, the machine translation model trained by the method provided in the embodiment of the present application can be applied to smart devices such as smart terminals or smart homes. Taking the virtual assistant set in the smart terminal as an example, the automatic answer function of the virtual assistant is obtained by training the training method of the machine translation model provided in the embodiment of the present application and the translation model is implemented. The user asks the virtual assistant questions about translation. When the virtual assistant receives the questions input by the user (the questions input by the user can be implemented in the form of voice or text input), the machine translation model divides the questions input by the bureau and outputs the translation results. The smart device converts the translation results into voice or text and feeds them back to the user through the virtual assistant.

上述仅以两种场景为例进行说明，本申请实施例提供的方法还可以用于其他的应用场景，如提取文本摘要，本申请实施例对具体的应用场景不作限定。The above description only takes two scenarios as examples. The method provided in the embodiments of the present application can also be used in other application scenarios, such as extracting text summaries. The embodiments of the present application do not limit the specific application scenarios.

本申请实施例提供的机器翻译模型的训练方法和语言翻译方法可以应用于具有较强数据处理能力的计算机设备中。在一些实施例中，本申请实施例提供的机器翻译模型的训练方法和语言翻译方法可以应用于个人计算机、工作站或服务器中，即可以通过个人计算机、工作站或服务器实现机器翻译以及训练机器翻译模型。The training method of the machine translation model and the language translation method provided in the embodiments of the present application can be applied to a computer device with strong data processing capabilities. In some embodiments, the training method of the machine translation model and the language translation method provided in the embodiments of the present application can be applied to a personal computer, a workstation or a server, that is, machine translation and training of the machine translation model can be implemented by a personal computer, a workstation or a server.

下面介绍本申请实施例提供的一种计算机系统，请参阅图1A，图1A为一种计算机系统的结构示意图，该计算机系统包括数据库10、训练设备11和执行设备12，执行设备12中可以包括第一设备110和第二设备120。A computer system provided in an embodiment of the present application is introduced below. Please refer to Figure 1A. Figure 1A is a structural diagram of a computer system. The computer system includes a database 10, a training device 11 and an execution device 12. The execution device 12 may include a first device 110 and a second device 120.

其中，数据库10包括不同语言的双语平行数据库、单语数据库，其中的数据可以作为样本数据用于机器翻译模型的训练。The database 10 includes bilingual parallel databases and monolingual databases in different languages, and the data therein can be used as sample data for training a machine translation model.

训练设备11可以为服务器、工作站、个人电脑等设备，用于通过从数据库10获取到的数据训练机器翻译模型。具体的，训练设备11可以获取第一双语平行数据库，然后将所述第一双语平行数据库中的多组双语平行句对划分为源自源语言数据和源自目标语言数据，进而通过源自源语言数据训练第一机器翻译模型，第一机器翻译模型用于将源语言翻译为目标语言。在一些实施例中，训练设备还可以通过第一双语平行数据库训练的机器翻译模型对从数据库10中获取的单语数据库中的单语数据进行翻译，得到伪平行句对，并将得到的伪平行句对添加到数据库10中。The training device 11 can be a server, workstation, personal computer or other device, and is used to train the machine translation model through the data obtained from the database 10. Specifically, the training device 11 can obtain a first bilingual parallel database, and then divide the multiple groups of bilingual parallel sentence pairs in the first bilingual parallel database into data derived from the source language and data derived from the target language, and then train the first machine translation model through the data derived from the source language, and the first machine translation model is used to translate the source language into the target language. In some embodiments, the training device can also translate the monolingual data in the monolingual database obtained from the database 10 through the machine translation model trained by the first bilingual parallel database to obtain pseudo-parallel sentence pairs, and add the obtained pseudo-parallel sentence pairs to the database 10.

执行设备12可以存储上述训练设备11训练的机器翻译模型，执行设备12中的第一设备110与服务器之间通过通信网络进行传输。The execution device 12 can store the machine translation model trained by the training device 11, and the first device 110 in the execution device 12 is transmitted to the server via a communication network.

其中，第一设备110中安装有支持翻译功能的应用程序，该应用程序可以是电子词典应用程序、电子书阅读应用程序、网页浏览应用程序、社交应用程序等。第一设备110可以是智能手机、智能手表、平板电脑、笔记本电脑、智能机器人等终端设备。The first device 110 is installed with an application supporting the translation function, which may be an electronic dictionary application, an e-book reading application, a web browsing application, a social application, etc. The first device 110 may be a terminal device such as a smart phone, a smart watch, a tablet computer, a laptop computer, an intelligent robot, etc.

第二设备120可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、内容分发网络、以及大数据和人工智能平台等基础云计算服务的云服务器。在一些实施例中，第二设备120是第一设备110中应用程序的后台服务器。The second device 120 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, content distribution networks, and big data and artificial intelligence platforms. In some embodiments, the second device 120 is a background server of an application in the first device 110.

在一些实施例中，本申请实施例提供的语言翻译方法可以由第二设备120执行。当第一设备110获取待翻译数据后，可以通过通信网络发送至第二设备120，第二设备120接收到待翻译数据后执行上述语言翻译的方法，得到翻译结果，进一步地，第二设备120将该翻译结果发送至第一设备110，第一设备110将翻译结果通过应用程序展示。In some embodiments, the language translation method provided in the embodiments of the present application can be executed by the second device 120. When the first device 110 obtains the data to be translated, it can be sent to the second device 120 through the communication network. After receiving the data to be translated, the second device 120 executes the above-mentioned language translation method to obtain a translation result. Further, the second device 120 sends the translation result to the first device 110, and the first device 110 displays the translation result through an application.

在另一些实施例中，第一设备110和第二设备120可以是同一设备，本申请实施例提供的语言翻译方法也可以由该设备执行，本申请实施例不作限定。In other embodiments, the first device 110 and the second device 120 may be the same device, and the language translation method provided in the embodiment of the present application may also be executed by the device, which is not limited in the embodiment of the present application.

应理解，在一些实施例中，训练设备11和执行设备12可以是同一设备，该设备可以执行机器翻译模型的训练工作和语言翻译工作，本申请实施例对该设备的功能不作限定。It should be understood that in some embodiments, the training device 11 and the execution device 12 may be the same device, which can perform machine translation model training and language translation work, and the embodiments of the present application do not limit the function of the device.

本申请实施例提供的方案涉及人工智能的自然语言处理技术，具体通过如下实施例进行说明：The solution provided in the embodiments of the present application relates to the natural language processing technology of artificial intelligence, which is specifically described by the following embodiments:

请参见图1B，是本申请实施例提供的一种机器翻译模型的训练方法的流程示意图。该机器翻译模型的训练方法可以由上述图1A中的训练设备11执行，本申请实施例以上述图1A中的训练设备11执行为例进行说明，如图1B所示，该机器翻译模型的训练方法包括但不限于以下步骤：Please refer to FIG. 1B , which is a flow chart of a method for training a machine translation model provided in an embodiment of the present application. The method for training the machine translation model can be executed by the training device 11 in FIG. 1A above. The embodiment of the present application is described by taking the execution of the training device 11 in FIG. 1A above as an example. As shown in FIG. 1B , the method for training the machine translation model includes but is not limited to the following steps:

S1、获取双语平行数据库，该双语平行数据库包括第一双语平行数据库。S1. Obtain a bilingual parallel database, where the bilingual parallel database includes a first bilingual parallel database.

双语平行数据库包括多组双语平行句对。双语平行句对为由源语言数据和目标语言数据构成的内容对齐的数据。也就是说，在双语平行数据库中，每条源语言数据都有与之对应的目标语言数据。The bilingual parallel database includes multiple sets of bilingual parallel sentence pairs. The bilingual parallel sentence pairs are content-aligned data consisting of source language data and target language data. That is, in the bilingual parallel database, each source language data has corresponding target language data.

在一些实施例中，双语平行数据库为第一双语平行数据库，该第一双语平行数据库中包括多组双语平行句对，多组双语平行句对为源自源语言数据或源自目标语言数据，即每组双语平行句对中的一种语言数据是由人类译员翻译得到的。In some embodiments, the bilingual parallel database is a first bilingual parallel database, which includes multiple groups of bilingual parallel sentence pairs, and the multiple groups of bilingual parallel sentence pairs are derived from source language data or from target language data, that is, one language data in each group of bilingual parallel sentence pairs is translated by a human translator.

在另一些实施例中，双语平行数据库包括第一双语平行数据库和伪平行句对，伪平行句对为翻译单语数据库中的第一语言原文本和通过机器翻译模型翻译第一语言原文本得到的第二语言文本组成的句对。In other embodiments, the bilingual parallel database includes a first bilingual parallel database and pseudo-parallel sentence pairs, wherein the pseudo-parallel sentence pairs are sentence pairs consisting of a first language original text in a translated monolingual database and a second language text obtained by translating the first language original text through a machine translation model.

S2、将第一双语平行数据库中的双语平行句对划分为源自源语言数据和源自目标语言数据。S2. Divide the bilingual parallel sentence pairs in the first bilingual parallel database into those derived from source language data and those derived from target language data.

在一种实现中，S2可以包括但不限于以下步骤：In one implementation, S2 may include but is not limited to the following steps:

S21、从第一双语平行数据库中获取待处理平行句对。S21. Obtain parallel sentence pairs to be processed from the first bilingual parallel database.

S22、根据待处理平行句对覆盖的内容，确定待处理平行句对的数据类型，所数据类型包括源自源语言数据和源自目标语言数据。S22. Determine the data type of the parallel sentence pair to be processed according to the content covered by the parallel sentence pair to be processed, where the data type includes data derived from the source language and data derived from the target language.

不同语言所覆盖的内容分布不同，即在待处理平行句对中，源语言数据和目标语言数据所覆盖的内容分布不同，因此，可以通过源语言数据和目标语言数据所覆盖的内容分布之间的偏差确定该待处理平行句对的数据类型。Different languages cover different content distributions, that is, in the parallel sentence pairs to be processed, the content distributions covered by the source language data and the target language data are different. Therefore, the data type of the parallel sentence pairs to be processed can be determined by the deviation between the content distributions covered by the source language data and the target language data.

在具体的实施方式中，根据待处理平行句对覆盖的内容，确定待处理平行句对的数据类型，可以包括以下过程：In a specific implementation, determining the data type of the parallel sentence pair to be processed according to the content covered by the parallel sentence pair to be processed may include the following process:

S221、通过源语言单语数据库训练第一语言模型，第一语言模型用于确定双语平行句对中的源语言数据在源语言中出现的概率。S221. Train a first language model using a source language monolingual database, where the first language model is used to determine the probability of source language data in a bilingual parallel sentence pair appearing in the source language.

S222、将待处理平行句对中的源语言数据输入到第一语言模型中，得到待处理平行句对中的源语言数据来源于源语言的第一概率。S222: Input the source language data in the parallel sentence pairs to be processed into the first language model to obtain a first probability that the source language data in the parallel sentence pairs to be processed are derived from the source language.

S223、通过目标语言单语数据库训练第二语言模型，第二语言模型用于确定双语平行句对中的目标语言数据在目标语言中出现的概率。S223. Train a second language model using a target language monolingual database, where the second language model is used to determine the probability of target language data in a bilingual parallel sentence pair appearing in the target language.

S224、将所述待处理平行句对中的目标语言数据输入到第二语言模型中，得到待处理平行句对中的目标语言数据来源于目标语言的第二概率。S224: Input the target language data in the parallel sentence pairs to be processed into a second language model to obtain a second probability that the target language data in the parallel sentence pairs to be processed are from the target language.

S225、根据第一概率和第二概率之间的偏差确定待处理平行句对的数据类型。S225. Determine the data type of the parallel sentence pair to be processed according to the deviation between the first probability and the second probability.

在一些实现方式中，S225可以包括以下过程：根据第一概率和第二概率之间的偏差确定待处理平行句对的评分，评分用于确定所述待处理平行句对的数据类型；当该评分大于目标阈值时，确定待处理平行句对属于源自源语言数据；当该评分小于目标阈值时，确定待处理平行句对属于源自目标语言数据。In some implementations, S225 may include the following process: determining a score of the parallel sentence pair to be processed based on a deviation between the first probability and the second probability, the score being used to determine a data type of the parallel sentence pair to be processed; when the score is greater than a target threshold, determining that the parallel sentence pair to be processed is derived from source language data; when the score is less than the target threshold, determining that the parallel sentence pair to be processed is derived from target language data.

在另一些实施方式中，S225也可以通过其他方式实现，例如，比较第一概率与第二概率的大小，当第一概率大于第二概率时，则确定待处理平行句对属于源自源语言数据，当第一概率小于第二概率时，待处理平行句对属于源自目标语言数据。In other implementations, S225 may also be implemented in other ways, for example, by comparing the first probability with the second probability. When the first probability is greater than the second probability, it is determined that the parallel sentence pair to be processed is derived from the source language data. When the first probability is less than the second probability, the parallel sentence pair to be processed is derived from the target language data.

下面描述S221-S224中根据不同语言所覆盖的内容分布来检测待处理平行句对的数据类型的原理。具体地，可以使用表示源语言所覆盖的内容分布，表示目标语言所覆盖的内容分布。给定待处理平行句对<x,y>，其被源语言或目标语言覆盖的概率可以表示为：The following describes the principle of detecting the data type of the parallel sentence pairs to be processed according to the content distribution covered by different languages in S221-S224. Indicates the content distribution covered by the source language, Represents the content distribution covered by the target language. Given a parallel sentence pair <x, y> to be processed, the probability of it being covered by the source language or the target language can be expressed as:

可以使用一个评分来量化以上两个概率的差：The difference between the two probabilities above can be quantified using a score:

其中由于和仅和语言有关，而与待处理平行句对<x,y>无关，所以当源语言和目标语言给定时，c为常数。可以得到，分数值越高的平行句对有更高的概率属于源自源语言的数据，评分值越低的平行句对有更高的概率属于源自目标语言的数据。为了刻画与可以使用源语言单语数据库与目标语言单语数据库分别训练基于自注意力机制的语言模型：第一语言模型与第二语言模型并且使用语言模型的概率来估计与具体地，令则上述评分可以表示为：in because and It is only related to the language, but not to the parallel sentence pair <x, y> to be processed, so when the source language and the target language are given, c is a constant. It can be obtained that the parallel sentence pairs with higher scores have a higher probability of belonging to the data from the source language, and the parallel sentence pairs with lower scores have a higher probability of belonging to the data from the target language. In order to characterize and You can use the source language monolingual database and the target language monolingual database to train the language model based on the self-attention mechanism: the first language model With the second language model And use the probability of the language model to estimate and Specifically, The above score can be expressed as:

语言模型训练结束后，可以通过一个小规模的已知数据类型的双语平行数据库确定c的值。After the language model training is completed, the value of c can be determined through a small-scale bilingual parallel database of known data types.

应理解，第一语言模型和第二语言模型可以为上述基于自注意力机制的语言模型，也可以为基于其他任意架构的语言模型，如基于循环神经网络的语言模型、基于卷积神经网络的语言模型。It should be understood that the first language model and the second language model can be the above-mentioned language model based on the self-attention mechanism, or can be a language model based on any other architecture, such as a language model based on a recurrent neural network or a language model based on a convolutional neural network.

其中，语言模型为通过大量的样本数据以极大似然估计的方法训练得到的语言模型。通过语言模型确定一种语言数据来源于该语言的概率的原理如下，一个语言模型通常构建为字符串s的概率分布p(s)，这里的p(s)实际上反映的是s作为一个句子出现的概率。这里的概率指的是组成字符串的这个组合，在训练语料中出现的似然。假设训练语料来自于人类的语言，那么可以认为这个概率是输入的一句话是否是人类的语言的概率。语言模型可以包括输入层、投影层、隐藏层和输出层。字符串s由多个词组成，在输入层使用独热(one-hot)向量的形式来表示每个词，在投影层中将每个独热向量转换为另一种词向量，连接以创建矩阵e，然后将该矩阵展平并通过隐藏层进一步转换为隐藏向量，最后根据隐藏向量使用softmax函数计算并输出字符串s的概率分布。其中，句子(字符串)s的概率可以为该句子中每个词在该句子中出现的概率之积，可以根据输入的句子中的一个词预测该词的后一个词在该句子中出现的概率。The language model is a language model trained by a large amount of sample data using the maximum likelihood estimation method. The principle of determining the probability that a language data comes from the language through the language model is as follows: a language model is usually constructed as the probability distribution p(s) of a string s, where p(s) actually reflects the probability that s appears as a sentence. The probability here refers to the likelihood of this combination of string components appearing in the training corpus. Assuming that the training corpus comes from human language, it can be considered that this probability is the probability of whether the input sentence is human language. The language model may include an input layer, a projection layer, a hidden layer, and an output layer. The string s consists of multiple words. Each word is represented in the form of a one-hot vector in the input layer. Each one-hot vector is converted into another word vector in the projection layer, connected to create a matrix e, and then the matrix is flattened and further converted into a hidden vector through the hidden layer. Finally, the probability distribution of the string s is calculated and output based on the hidden vector using the softmax function. The probability of a sentence (string) s can be the product of the probability of each word in the sentence appearing in the sentence, and the probability of the next word of the word appearing in the sentence can be predicted based on a word in the input sentence.

例如，以单语数据库为中文单语数据库为例，说明语言模型的训练过程。示例性的，以句子“我有一个梦想”为一个样本数据训练语言模型的过程可以是：首先构造目标模型，该目标模型包括输入层，投影层，隐藏层和输出层；然后将“我有一个梦想”输入至该目标模型中，经过输入层，投影层，隐藏层和输出层，最后输出该模型预测“我有一个梦想”这句话的概率，模型在训练过程中被指导最大化此概率。以此方式，将中文单语数据库中的每一条数据输入至目标模型中，经过训练，得到优化后的目标模型，即为训练完成的语言模型。For example, the monolingual database is a Chinese monolingual database, and the training process of the language model is explained. Exemplarily, the process of training a language model with the sentence "I have a dream" as a sample data can be: first construct a target model, the target model includes an input layer, a projection layer, a hidden layer and an output layer; then "I have a dream" is input into the target model, through the input layer, the projection layer, the hidden layer and the output layer, and finally the model predicts the probability of the sentence "I have a dream". The model is guided to maximize this probability during the training process. In this way, each piece of data in the Chinese monolingual database is input into the target model, and after training, the optimized target model is obtained, which is the trained language model.

示例性的，如图2所示，通过上述S2所描述的确定双语平行句对的数据类型的方法分别对英语-汉语(En-Zh)平行数据库、英语-日语(En-Ja)平行数据库和英语-德语(En-De)平行数据库中的数据进行划分，得到如图2所示的划分准确率。其中，方法FT为现有的基于卷积神经网络文本分类方法，方法Ours为本申请实施例中S2所描述的方法，对三种数据库中的数据进行划分的准确率用F1值来表示，F1值越高，表示对数据的划分准确率越高。Exemplarily, as shown in FIG2, the data in the English-Chinese (En-Zh) parallel database, the English-Japanese (En-Ja) parallel database, and the English-German (En-De) parallel database are divided respectively by the method for determining the data type of the bilingual parallel sentence pair described in S2 above, and the division accuracy as shown in FIG2 is obtained. Among them, method FT is an existing text classification method based on convolutional neural network, method Ours is the method described in S2 in the embodiment of the present application, and the accuracy of dividing the data in the three databases is represented by the F1 value. The higher the F1 value, the higher the accuracy of data division.

当将双语数据库中的数据划分完之后，可以通过划分好的数据训练机器翻译模型。具体的，可以根据数据类型训练三种机器翻译模型：(1)通过源自目标语言数据训练得到的模型1；(2)通过源自源语言数据训练得到的模型2；(3)不加区分数据类型，通过第一双语数据库训练得到的模型3。After the data in the bilingual database is divided, the machine translation model can be trained with the divided data. Specifically, three machine translation models can be trained according to the data type: (1) Model 1 trained with data from the target language; (2) Model 2 trained with data from the source language; (3) Model 3 trained with the first bilingual database without distinguishing the data type.

示例性的，分别以英语-汉语(En-Zh)平行数据库、英语-日语(En-Ja)平行数据库和英语-德语(En-De)平行数据库中的数据为训练样本训练对应的上述三种机器翻译模型。从译文整体质量、翻译忠实度、翻译流畅度三个方面对每个模型的翻译表现进行评估，得到如图3-图5的结果。For example, the data in the English-Chinese (En-Zh) parallel database, the English-Japanese (En-Ja) parallel database, and the English-German (En-De) parallel database are used as training samples to train the corresponding three machine translation models. The translation performance of each model is evaluated from three aspects: overall translation quality, translation fidelity, and translation fluency, and the results are shown in Figures 3 to 5.

如图3所示，第一列Data Origin表示为机器翻译模型的训练样本的数据类型，Target表示源自目标语言数据，Source表示源自源语言数据，Both表示不区分数据类型地使用语平行数据库中的全部的数据。其中，从第二列起，每列BLEU值中的粗体表示该列分数的最高分，下划线标出的分数表示该列的第二高分，BLEU值表示译文的质量，该值越高表示译文的质量越好。示例性的，En-Zh表示英文与中文互译，第二列表示以英文为源语言，中文为目标语言，以第二列为例，则Target对应的分数33.2表示以源自中文的数据训练得到的模型翻译的译文质量，Source对应的分数36.5表示以源自英文的数据训练得到的模型翻译的译文质量，Both对应的分数36.6表示不加区分源自中文或源自英文的数据，以En-Zh双语平行数据库中的全部的数据训练得到的模型翻译的译文质量。则由图3中的每列的各BLEU值可知，仅使用源自源语言数据训练得到的模型与不区分数据类型地使用语平行数据库中的全部的数据训练得到的模型在译文质量上相差较小，并且仅使用源自源语言数据训练得到的模型的译文质量也取得过最高分，由此可知，仅使用源自源语言数据训练得到的模型的在译文质量上的翻译表现较好。As shown in Figure 3, the first column Data Origin represents the data type of the training sample for the machine translation model, Target represents the data derived from the target language, Source represents the data derived from the source language, and Both represents the use of all the data in the language parallel database without distinguishing the data type. Among them, starting from the second column, the bold in each column of BLEU values represents the highest score in the column, and the underlined score represents the second highest score in the column. The BLEU value represents the quality of the translation. The higher the value, the better the quality of the translation. Exemplarily, En-Zh represents English-Chinese translation, and the second column represents English as the source language and Chinese as the target language. Taking the second column as an example, the score 33.2 corresponding to Target represents the translation quality of the model translation obtained by training with data derived from Chinese, the score 36.5 corresponding to Source represents the translation quality of the model translation obtained by training with data derived from English, and the score 36.6 corresponding to Both represents the translation quality of the model translation obtained by training with all the data in the En-Zh bilingual parallel database without distinguishing between data derived from Chinese or English. It can be seen from the BLEU values in each column in Figure 3 that the translation quality of the model trained only using data from the source language and the model trained using all the data in the language parallel database without distinguishing the data type are very similar, and the translation quality of the model trained only using data from the source language has also achieved the highest score. It can be seen that the model trained only using data from the source language has better translation performance in terms of translation quality.

如图4所示，第一列Data Origin表示为机器翻译模型的训练样本的数据类型，Target表示源自目标语言数据，Source表示源自源语言数据，Both表示不区分数据类型地使用语平行数据库中的全部的数据。其中，从第二列起，每列的F-measure值中的粗体表示该列分数的最高分，下划线标出的分数表示该列的第二高分，F-measure值标识译文的忠实度，该值越高说明译文的忠实度越高。这里以第一双语平行数据库为En-Zh双语平行数据库为例，示例性的，En→Zh表示英文为源语言，中文为目标语言，En←Zh表示中文为源语言，英文为目标语言，分别确定名词(noun)、动词(verb)和形容词或副词(adj)这些实词的F-measure值。则由图4可知，仅使用源自源语言数据训练得到的模型的译文的忠实度取得多次最高分，由此可知，在译文的忠实度上，仅使用源自源语言数据训练得到的模型在三种模型的表现最好。而使用源自目标语言数据训练得到的模型在译文的忠实度上表现较差(F-measure值相对于其他两种模型较低)，而且，如图5中的第四列至第七列的分数，不区分数据类型地使用语平行数据库中的全部的数据训练得到的模型相比于仅使用源自源语言数据训练得到的模型的F-measure值较低，说明不加区分地使用额外的源自目标语言数据训练得到的模型未能提高译文的忠实度。As shown in FIG4 , the first column Data Origin indicates the data type of the training sample of the machine translation model, Target indicates the target language data, Source indicates the source language data, and Both indicates that all the data in the language parallel database is used without distinguishing the data type. Among them, starting from the second column, the bold in the F-measure value of each column indicates the highest score of the column, and the underlined score indicates the second highest score of the column. The F-measure value indicates the fidelity of the translation. The higher the value, the higher the fidelity of the translation. Here, the first bilingual parallel database is the En-Zh bilingual parallel database as an example. For example, En→Zh indicates that English is the source language and Chinese is the target language, and En←Zh indicates that Chinese is the source language and English is the target language. The F-measure values of nouns, verbs, and adjectives or adverbs are determined respectively. It can be seen from FIG4 that the fidelity of the translation obtained by using only the model trained from the source language data has achieved the highest score many times. Therefore, in terms of the fidelity of the translation, the model trained using only the source language data performs best among the three models. The model trained with data from the target language performed poorly in terms of translation fidelity (the F-measure value was lower than that of the other two models). Moreover, as shown in the scores from the fourth to seventh columns in Figure 5, the model trained with all the data in the language parallel database without distinguishing the data type had a lower F-measure value than the model trained only with data from the source language, indicating that the indiscriminate use of additional models trained with data from the target language failed to improve the fidelity of the translation.

如图5所示，翻译的流畅度是由语言模型的困惑度(即PPL)来衡量的，PPL越低，流畅度越好。Diff指一个模型对应的PPL相对于“Both”对应的PPL的相对变化。“No.Abs.”表示没有对实词进行抽象，“Count.Abs.”表示将所有实词用相应的词性标签进行抽象。以WMT20En-Zh双语平行数据库为例，训练的三种模型在流畅度的表现如图5所示，当将待翻译的文本中的所有实词用相应的词性标签进行抽象后，以源自源语言数据训练的模型的译文流畅度相对于不区分数据类型地使用语平行数据库中的全部的数据训练得到的模型的译文流畅度的变化也很小，+1.4％和+2.2％，因此可以说明当将所有实词用相应的词性标签进行抽象后，仅使用源自源语言数据训练的模型在译文的流畅度上也有较好的表现。As shown in Figure 5, the fluency of translation is measured by the perplexity of the language model (i.e., PPL). The lower the PPL, the better the fluency. Diff refers to the relative change of the PPL corresponding to a model relative to the PPL corresponding to "Both". "No.Abs." means that no content words are abstracted, and "Count.Abs." means that all content words are abstracted with corresponding part-of-speech tags. Taking the WMT20En-Zh bilingual parallel database as an example, the performance of the three trained models in fluency is shown in Figure 5. When all content words in the text to be translated are abstracted with corresponding part-of-speech tags, the fluency of the translation of the model trained with the source language data is also very small compared with the translation fluency of the model trained with all the data in the bilingual parallel database without distinguishing data types, which is +1.4% and +2.2%. Therefore, it can be shown that when all content words are abstracted with corresponding part-of-speech tags, the model trained only with source language data also has a better performance in translation fluency.

应理解，双语平行数据库中的双语平行句对都是由人类译员将一种语言(第一语言)的句子翻译为另一种语言(第二语言)的句子，将双语平行句对的第一语言作为该双语平行句对的原始语言(即该双语平行句对源自哪种语言)。由于前人的工作忽视了数据的原始语言对神经机器翻译模型的影响，大多数大规模双语平行数据库中的双语平行句对的原始语言信息在数据构建与整理的过程中都丢失了。由上述的通过不同的原始语言的数据(源自源语言数据和源自目标语言数据)训练的模型的表现可知，由于双语平行数据库中的大量的双语平行句对的原始语言不同，通过不加区分地使用不同原始语言的数据训练机器翻译模型翻译的译文的质量、忠实度等较差，因此，根据双语平行句对的原始语言可以将双语平行数据库中的双语平行句对划分为源自源语言数据和源自目标语言数据，针对划分好的数据可以进一步训练机器翻译模型，提高机器翻译模型的性能。It should be understood that the bilingual parallel sentence pairs in the bilingual parallel database are all translated by human translators from sentences in one language (first language) to sentences in another language (second language), and the first language of the bilingual parallel sentence pair is used as the original language of the bilingual parallel sentence pair (i.e., which language the bilingual parallel sentence pair originates from). Since the previous work ignored the influence of the original language of the data on the neural machine translation model, the original language information of the bilingual parallel sentence pairs in most large-scale bilingual parallel databases was lost in the process of data construction and collation. From the performance of the above-mentioned model trained with data in different original languages (derived from the source language data and derived from the target language data), it can be seen that since the original languages of a large number of bilingual parallel sentence pairs in the bilingual parallel database are different, the quality and fidelity of the translations translated by indiscriminately using data in different original languages to train the machine translation model are poor. Therefore, the bilingual parallel sentence pairs in the bilingual parallel database can be divided into those derived from the source language data and those derived from the target language data according to the original language of the bilingual parallel sentence pairs, and the machine translation model can be further trained for the divided data to improve the performance of the machine translation model.

S3、通过源自源语言数据训练第一机器翻译模型。S3. Training a first machine translation model by using data derived from a source language.

在本申请实施例中，通过源自源语言数据训练第一机器翻译模型可以通过以下几种实现方式实现，下面具体地描述S3的几种实现方式的过程。In the embodiment of the present application, training the first machine translation model by using source language data can be implemented by the following several implementation methods. The following specifically describes the processes of several implementation methods of S3.

实现方式1：以第一双语平行数据库中划分好的源自源语言数据作为训练样本，训练得到第一机器翻译模型。具体的，以源自源语言数据中的源语言数据作为输入，目标语言数据作为输出训练第一机器翻译模型。Implementation method 1: Use the source language data divided in the first bilingual parallel database as training samples to train a first machine translation model. Specifically, use the source language data from the source language data as input and the target language data as output to train the first machine translation model.

实现方式2：首先以未划分数据类型的第一双语平行数据库训练第一机器翻译模型，然后通过划分好的源自源语言数据对训练好的第一机器翻译模型进行微调。Implementation method 2: First, a first machine translation model is trained with a first bilingual parallel database of unclassified data types, and then the trained first machine translation model is fine-tuned using the classified source language data.

具体的，请参阅图6，图6为实现方式2的流程示意图。For details, please refer to FIG. 6 , which is a flowchart of implementation method 2. FIG.

其中，上述S3可以包括图6中的以下步骤：The above S3 may include the following steps in FIG6:

S301、通过第一双语平行数据库训练得到第一机器翻译模型。S301. Obtain a first machine translation model through training on a first bilingual parallel database.

将未划分数据类型的第一双语平行数据库中的双语平行句对作为训练样本，以每一组的双语平行句对中的源语言数据作为输入，目标语言数据作为输出，训练得到初始机器翻译模型，也称为第一机器翻译模型。The bilingual parallel sentence pairs in the first bilingual parallel database without data type division are used as training samples, the source language data in each group of bilingual parallel sentence pairs is used as input, and the target language data is used as output, and an initial machine translation model is trained, also called the first machine translation model.

S302、通过源自源语言数据微调第一机器翻译模型。S302: Fine-tune the first machine translation model using source language data.

以划分好的源自源语言数据中的源语言数据作为输入，目标语言数据作为输出，训练通过S301中训练得到的第一机器翻译模型，得到微调后的第一机器翻译模型。The divided source language data from the source language data is used as input and the target language data is used as output to train the first machine translation model obtained through the training in S301 to obtain a fine-tuned first machine translation model.

其中，可以理解的，对机器翻译模型的微调为通过部分数据进一步地对机器翻译模型进行训练。It can be understood that fine-tuning the machine translation model is to further train the machine translation model through partial data.

下面描述通过实现方式2所描述的方法训练得到第一机器翻译模型的翻译表现，示例性的，以源自不同语言的数据对其分别对应的第一机器翻译模型进行微调，使用微调后的第一机器翻译模型进行翻译，应用译文质量(BLEU值)来表征模型的翻译表现。如图7所示，图7为六个翻译方向(En-Zh、Zh–En、En-Ja、Ja–En、En-De、De–En，其中，En-Zh表示英文En为源语言，中文Zh为目标语言)上，使用第一双语数据库训练的第一机器模型和微调第一机器翻译模型后的BLEU值。其中，Baseline对应的行中的数值表示六个翻译方向上使用的第一双语平行数据库训练的第一机器模型的BLEU值，Tune对应的行中的数值表示六个翻译方向上使用源自不同语言的数据对其分别对应的第一机器模型进行微调后的模型的BLEU值，Average表示每行数值的平均值。则由图可知，在六个翻译方向上，使用源自源语言数据微调后的模型的BLEU值相对于没有微调的第一机器翻译模型的BLEU值较高，说明通过上述实现方式2所描述的方法训练得到的翻译模型能够提高译文质量。The following describes the translation performance of the first machine translation model obtained by training the method described in implementation method 2. Exemplarily, the first machine translation models corresponding to the first machine translation models are fine-tuned with data from different languages, and the fine-tuned first machine translation model is used for translation, and the translation quality (BLEU value) is used to characterize the translation performance of the model. As shown in Figure 7, Figure 7 shows the BLEU values of the first machine model trained with the first bilingual database and the first machine translation model after fine-tuning in six translation directions (En-Zh, Zh-En, En-Ja, Ja-En, En-De, De-En, where En-Zh means English En is the source language and Chinese Zh is the target language). Among them, the values in the row corresponding to Baseline represent the BLEU values of the first machine model trained with the first bilingual parallel database used in the six translation directions, the values in the row corresponding to Tune represent the BLEU values of the model after the first machine models corresponding to the first machine models are fine-tuned using data from different languages in the six translation directions, and Average represents the average value of each row of values. It can be seen from the figure that in the six translation directions, the BLEU value of the model fine-tuned using the source language data is higher than the BLEU value of the first machine translation model without fine-tuning, indicating that the translation model trained by the method described in the above implementation method 2 can improve the quality of the translation.

实现方式3：请参阅图8，图8为本申请实施例提供的另一种机器翻译模型的训练方法的流程示意图，其中，上述S3可以包括图8中的以下步骤：Implementation method 3: Please refer to FIG. 8 , which is a flow chart of another method for training a machine translation model provided in an embodiment of the present application, wherein the above S3 may include the following steps in FIG. 8 :

S31a、获取单语数据库，该单语数据库包括多条语言为目标语言的原文本。S31a, obtaining a monolingual database, wherein the monolingual database includes a plurality of original texts in the target language.

S32a、将单语数据库中的每一条语言为目标语言的原文本输入到第二机器翻译模型，得到每一条原文本对应的源语言翻译文本，第二机器翻译模型用于将目标语言翻译为源语言。S32a, input each original text in the target language in the monolingual database into the second machine translation model to obtain a source language translation text corresponding to each original text, and the second machine translation model is used to translate the target language into the source language.

其中，第二机器翻译模型为通过第一双语平行数据库训练得到的机器翻译模型，可以通过输入第一双语平行数据库每组双语平行句对中的目标语言数据，输出其对应的源语言数据的方式进行训练。Among them, the second machine translation model is a machine translation model trained by the first bilingual parallel database, and can be trained by inputting the target language data in each group of bilingual parallel sentence pairs in the first bilingual parallel database and outputting the corresponding source language data.

在一些实施例中，当得到单语数据库的多条目标语言的原文本和其分别对应的源语言翻译文本后，对每一条目标语言的原文本和其对应的源语言翻译文本组成的伪平行句对加上指示信息，例如标签BT(意为反向翻译)。其中，反向翻译表示由目标语言翻译为源语言。In some embodiments, after obtaining multiple target language original texts and their corresponding source language translation texts in a monolingual database, indication information, such as a label BT (meaning back translation), is added to each pseudo-parallel sentence pair consisting of the target language original text and its corresponding source language translation text. Back translation means translation from the target language to the source language.

S33a、将单语数据库中的多条目标语言的原文本和其分别对应的源语言翻译文本组成的多组伪平行句对添加至双语平行数据库。S33a, adding multiple sets of pseudo-parallel sentence pairs consisting of multiple original texts in the target language and their corresponding source language translation texts in the monolingual database to the bilingual parallel database.

S34a、通过双语平行数据库训练得到第一机器翻译模型。S34a, obtaining a first machine translation model through bilingual parallel database training.

S35a、通过源自源语言数据微调第一机器翻译模型。该过程可以参见上述实现方式1中S302的相关描述，此处不再赘述。S35a, fine-tuning the first machine translation model using the source language data. This process can refer to the relevant description of S302 in the above implementation mode 1, and will not be repeated here.

实现方式4：请参阅图9，图9为本申请实施例提供的另一种机器翻译模型的训练方法的流程示意图，其中，上述S3可以包括图9中的以下步骤：Implementation 4: Please refer to FIG. 9 , which is a flow chart of another method for training a machine translation model provided in an embodiment of the present application, wherein the above S3 may include the following steps in FIG. 9 :

S31b、获取单语数据库，该单语数据库包括多条语言为源语言的原文本。S31b, obtaining a monolingual database, wherein the monolingual database includes a plurality of original texts whose language is a source language.

S32b、将单语数据库中的每一条语言为源语言的原文本输入到第三机器翻译模型，得到每一条原文本对应的目标语言翻译文本，第三机器翻译模型用于将源语言翻译为目标语言。S32b, input each original text in the monolingual database in the source language into the third machine translation model to obtain a target language translation text corresponding to each original text, and the third machine translation model is used to translate the source language into the target language.

其中，第三机器翻译模型为通过第一双语平行数据库训练得到的，可以通过输入第一双语平行数据库每组双语平行句对中的源语言数据，输出其对应的目标语言数据的方式进行训练。Among them, the third machine translation model is obtained by training the first bilingual parallel database, and can be trained by inputting the source language data in each group of bilingual parallel sentence pairs in the first bilingual parallel database and outputting the corresponding target language data.

在一些实施例中，当得到单语数据库的多条源语言的原文本和其分别对应的目标语言翻译文本后，对每一条源语言的原文本和其对应的目标语言翻译文本组成的伪平行句对加上指示信息，例如标签FT(意为正向翻译)。其中，正向翻译表示由源语言翻译为目标语言。In some embodiments, after obtaining multiple source language original texts and their corresponding target language translation texts in a monolingual database, indication information, such as a label FT (meaning forward translation), is added to each pseudo-parallel sentence pair consisting of the source language original text and its corresponding target language translation text. Forward translation means translation from the source language to the target language.

S33b、将单语数据库中的多条语言为源语言的原文本和其分别对应的目标语言翻译文本组成的多条伪平行句对添加至双语平行数据库。S33b, adding multiple pseudo-parallel sentence pairs consisting of multiple original texts in the source language and their corresponding target language translation texts in the monolingual database to the bilingual parallel database.

S34b、通过双语平行数据库训练得到第一机器翻译模型。S34b. Obtain a first machine translation model through bilingual parallel database training.

S35b、通过源自源语言数据微调第一机器翻译模型。该过程可以参见上述实现方式1中S302的相关描述，此处不再赘述。S35b, fine-tuning the first machine translation model using the source language data. This process can refer to the relevant description of S302 in the above implementation mode 1, and will not be repeated here.

在实现方式3和实现方式4中，根据目标语言单语数据库和通过第一双语平行数据库训练得到的机器翻译模型确定多组伪平行句对，并将该多组伪平行句对添加至双语平行数据库，这样，通过新增伪平行句对的双语平行数据库训练初始机器翻译模型，相比于仅使用第一双语平行数据库训练的初始机器翻译模型，可以通过伪平行句对的特征充分训练机器翻译模型，增强机器翻译模型的性能。而且，当针对伪平行句对添加上指示信息后，通过双语平行数据库训练第一机器翻译模型的表现更好。In implementation 3 and implementation 4, multiple sets of pseudo-parallel sentence pairs are determined based on the target language monolingual database and the machine translation model trained by the first bilingual parallel database, and the multiple sets of pseudo-parallel sentence pairs are added to the bilingual parallel database. In this way, the initial machine translation model is trained by the bilingual parallel database with the newly added pseudo-parallel sentence pairs. Compared with the initial machine translation model trained only by the first bilingual parallel database, the machine translation model can be fully trained by the features of the pseudo-parallel sentence pairs, thereby enhancing the performance of the machine translation model. Moreover, after adding the indication information for the pseudo-parallel sentence pairs, the performance of the first machine translation model trained by the bilingual parallel database is better.

示例性的，下面说明以英文为目标语言或源语言时，通过上述实现方式2、实现方式3和实现方式4训练得到的机器翻译模型的翻译表现(以译文质量BLEU值表征)。如图10所示，表中Monolingual表示单语，Bilingual表示双语；X→En表示：X为源语言，英文(En)为目标语言，表示由不同源语言翻译为英文；En→X表示：英文(En)为源语言，X为目标语言，表示由英文翻译为不同目标语言；Tagging表示在通过单语数据得到的伪平行数据对上添加标签，如(BT或FT)；Fine-Tune表示使用源自源语言数据对初始机器翻译模型进行微调。第1行表示通过第一双语平行数据库训练得到的第一模型的译文质量(BLEU)，第2行表示通过源自源语言数据微调上述第一模型后得到的第二模型的译文质量，第3行表示通过第一平行数据库和未添加指示信息(也称为标签)的伪平行句对训练得到的第三模型的译文质量，第4行表示通过第一平行数据库和添加了指示信息(也称为标签)的伪平行句对训练得到的第四模型的译文质量，第5行表示通过源自源语言数据微调上述第三模型后得到的第五模型的译文质量，第6行示通过源自源语言数据微调上述第四模型后得到的第六模型的译文质量。其中，第3-6行表示通过包括应用英文(En)单语数据库获得的伪平行句对的双语平行数据库训练得到的模型的译文质量，框101内的BLUE值表示通过包括反向翻译单语数据库得到的伪平行句对的双语平行数据库训练得到的机器翻译模型的译文质量，框102内的BLUE值表示通过包括正向翻译单语数据库得到的伪平行句对的双语平行数据库训练得到的机器翻译模型的译文质量，Ave表示平均值。从图中可知，通过源自源语言数据微调的机器翻译模型(第二模型、第五模型、第六模型)分别比其对应的未使用源自源语言数据微调的机器翻译模型(第一模型、第三模型、第四模型)的译文翻译质量高。Exemplarily, the following describes the translation performance (characterized by the translation quality BLEU value) of the machine translation model trained by the above implementation methods 2, 3, and 4 when English is the target language or source language. As shown in Figure 10, in the table, Monolingual means monolingual, Bilingual means bilingual; X→En means: X is the source language, English (En) is the target language, indicating translation from different source languages to English; En→X means: English (En) is the source language, X is the target language, indicating translation from English to different target languages; Tagging means adding tags to the pseudo-parallel data pairs obtained through monolingual data, such as (BT or FT); Fine-Tune means fine-tuning the initial machine translation model using data derived from the source language. The first row represents the translation quality (BLEU) of the first model trained by the first bilingual parallel database, the second row represents the translation quality of the second model obtained by fine-tuning the above-mentioned first model with source language data, the third row represents the translation quality of the third model trained by the first parallel database and pseudo-parallel sentence pairs without adding indication information (also called labels), the fourth row represents the translation quality of the fourth model trained by the first parallel database and pseudo-parallel sentence pairs with added indication information (also called labels), the fifth row represents the translation quality of the fifth model obtained by fine-tuning the above-mentioned third model with source language data, and the sixth row represents the translation quality of the sixth model obtained by fine-tuning the above-mentioned fourth model with source language data. Among them, the 3rd to 6th rows represent the translation quality of the model obtained by training the bilingual parallel database including pseudo-parallel sentence pairs obtained by applying the English (En) monolingual database, the BLUE value in the box 101 represents the translation quality of the machine translation model obtained by training the bilingual parallel database including pseudo-parallel sentence pairs obtained by the reverse translation monolingual database, the BLUE value in the box 102 represents the translation quality of the machine translation model obtained by training the bilingual parallel database including pseudo-parallel sentence pairs obtained by the forward translation monolingual database, and Ave represents the average value. It can be seen from the figure that the machine translation models (the second model, the fifth model, and the sixth model) fine-tuned by the source language data are respectively higher than the corresponding machine translation models (the first model, the third model, and the fourth model) that do not use the fine-tuning from the source language data.

上述机器翻译模型的训练方法，对第一双语平行数据库中的数据进行划分，划分为源自源语言数据和源自目标语言数据，通过源自源语言数据对初始机器翻译模型进行微调，得到微调后的机器翻译模型，应用该微调后的机器翻译模型进行翻译任务，能够消除由于不同语言的数据之间存在的语言覆盖偏差对机器翻译模型的影响，从而提高通过该方法训练得到的机器翻译模型的性能，应用该模型可以得到译文质量和忠实度较高的译文。The training method of the above-mentioned machine translation model divides the data in the first bilingual parallel database into data from the source language and data from the target language, fine-tunes the initial machine translation model through the data from the source language to obtain a fine-tuned machine translation model, and applies the fine-tuned machine translation model to perform translation tasks. This can eliminate the influence of the language coverage deviation between data in different languages on the machine translation model, thereby improving the performance of the machine translation model trained by this method. Application of this model can obtain translations with higher quality and fidelity.

下面介绍本申请实施例提供的一种语言翻译方法，该方法可以包括：The following describes a language translation method provided by an embodiment of the present application, which may include:

计算机设备接收待翻译数据，该待翻译数据为源语言数据。The computer device receives data to be translated, where the data to be translated is source language data.

计算机设备将待翻译数据输入到机器翻译模型中，得到上述待翻译数据对应的目标语言数据。该过程可以是，首先应用机器翻译模型对输入的待翻译数据进行词嵌入处理，得到映射中间向量；进而根据映射中间向量之间的依赖关系对映射中间向量进行拟合，得到拟合后的向量；最后，通过机器翻译模型中的解码器对拟合后的向量进行解码，得到解码后的向量，从而输出翻译结果。其中，该机器翻译模型可以是通过上述图1B、或图6、或图8、或图9任一种机器翻译模型的训练方法训练得到的第一机器翻译模型。其中，机器翻译模型的相关描述可以参见上述图1B、或图6、或图8、或图9的描述，此处不再赘述。The computer device inputs the data to be translated into the machine translation model to obtain the target language data corresponding to the data to be translated. The process may be that the machine translation model is first applied to perform word embedding processing on the input data to be translated to obtain a mapping intermediate vector; then the mapping intermediate vector is fitted according to the dependency relationship between the mapping intermediate vectors to obtain a fitted vector; finally, the fitted vector is decoded by the decoder in the machine translation model to obtain a decoded vector, thereby outputting a translation result. Among them, the machine translation model may be the first machine translation model trained by the training method of any one of the machine translation models of Figure 1B, Figure 6, Figure 8, or Figure 9. Among them, the relevant description of the machine translation model can refer to the description of Figure 1B, Figure 6, Figure 8, or Figure 9, which will not be repeated here.

请参见图11A，图11A为本申请实施例提供的一种机器翻译模型的训练装置1100的结构示意图。该机器翻译模型的训练装置可以应用于上述图1A中的训练设备11，该机器翻译模型的训练装置可以包括：Please refer to FIG. 11A , which is a schematic diagram of the structure of a machine translation model training device 1100 provided in an embodiment of the present application. The machine translation model training device can be applied to the training device 11 in FIG. 1A above, and the machine translation model training device can include:

获取单元1101，用于获取双语平行数据库，所述双语平行数据库包括多组双语平行句对，所述双语平行句对为由源语言数据和目标语言数据构成的内容对齐的数据，所述双语平行数据库包括第一双语平行数据库；An acquisition unit 1101 is used to acquire a bilingual parallel database, wherein the bilingual parallel database includes a plurality of groups of bilingual parallel sentence pairs, wherein the bilingual parallel sentence pairs are content-aligned data consisting of source language data and target language data, and the bilingual parallel database includes a first bilingual parallel database;

划分单元1102，用于将所述第一双语平行数据库中的多组双语平行句对划分为源自源语言数据和源自目标语言数据，其中，属于所述源自源语言数据的双语平行句对中的目标语言数据是基于源语言数据翻译得到的，属于所述源自目标语言数据的双语平行句对中的源语言数据是基于目标语言数据翻译得到的；A division unit 1102 is used to divide the plurality of groups of bilingual parallel sentence pairs in the first bilingual parallel database into groups derived from source language data and groups derived from target language data, wherein the target language data in the bilingual parallel sentence pairs derived from the source language data are obtained by translating based on the source language data, and the source language data in the bilingual parallel sentence pairs derived from the target language data are obtained by translating based on the target language data;

训练单元1103，用于通过所述源自源语言数据训练第一机器翻译模型，所述第一机器翻译模型用于将源语言翻译为目标语言。The training unit 1103 is configured to train a first machine translation model using the data derived from the source language, wherein the first machine translation model is used to translate the source language into the target language.

其中，划分单元1102的具体功能实现可以参见上述S2中S21、S221-S225的相关描述；训练单元1103的具体功能实现可以参见上述S3中实现方式1-4的相关描述，这里不再赘述。Among them, the specific functional implementation of the division unit 1102 can refer to the relevant descriptions of S21, S221-S225 in the above S2; the specific functional implementation of the training unit 1103 can refer to the relevant descriptions of implementation methods 1-4 in the above S3, which will not be repeated here.

请参见图11B，图11B为本申请实施例提供的一种语言翻译装置的结构示意图。该语言翻译装置可以应用于上述图1A中的执行设备12，该语言翻译装置可以包括：Please refer to FIG. 11B , which is a schematic diagram of the structure of a language translation device provided in an embodiment of the present application. The language translation device can be applied to the execution device 12 in FIG. 1A above, and the language translation device may include:

接收单元1201，用于接收待翻译数据，所述待翻译数据为源语言数据；The receiving unit 1201 is used to receive data to be translated, where the data to be translated is source language data;

翻译单元1202，用于将待翻译数据翻译为其对应的目标语言数据，翻译单元1202中包括机器翻译模型，该机器翻译模型是通过上述图1B、或图6、或图8、或图9的任一种机器翻译模型的训练方法训练得到的机器翻译模型。The translation unit 1202 is used to translate the data to be translated into its corresponding target language data. The translation unit 1202 includes a machine translation model, which is a machine translation model trained by any of the machine translation model training methods of FIG. 1B , or FIG. 6 , or FIG. 8 , or FIG. 9 .

请参见图12，图12是本申请实施例提供的一种语言翻译设备的结构示意图。如图12所示，该语言翻译设备1000可以对应于上述图1A所示的计算机系统中的第二设备120或第一设备110，该语言翻译设备1000可以包括：处理器1001，网络接口1004和存储器1005，此外，上述语言翻译设备1000还可以包括：用户接口1003，和至少一个通信总线1002。其中，通信总线1002用于实现这些组件之间的连接通信。其中，用户接口1003可以包括显示屏(Display)、键盘(Keyboard)，可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1004可以是高速RAM存储器，也可以是非不稳定的存储器(non-volatile memory)，例如至少一个磁盘存储器。存储器1005可选的还可以是至少一个位于远离前述处理器1001的存储装置。如图12所示，作为一种计算机可读存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。Please refer to Figure 12, Figure 12 is a structural diagram of a language translation device provided by an embodiment of the present application. As shown in Figure 12, the language translation device 1000 can correspond to the second device 120 or the first device 110 in the computer system shown in Figure 1A above, and the language translation device 1000 may include: a processor 1001, a network interface 1004 and a memory 1005. In addition, the above-mentioned language translation device 1000 may also include: a user interface 1003, and at least one communication bus 1002. Wherein, the communication bus 1002 is used to realize the connection communication between these components. Wherein, the user interface 1003 may include a display screen (Display), a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1004 may be a high-speed RAM memory, or it may be a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also be at least one storage device located away from the aforementioned processor 1001. As shown in FIG. 12 , the memory 1005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.

在如图12所示的语言翻译设备1000中，网络接口1004可提供网络通讯功能；而用户接口1003主要用于为用户提供输入的接口；而处理器1001可以用于调用存储器1005中存储的设备控制应用程序，以实现通过上述图1B、或图6、或图8、或图9所示的任一种机器翻译模型的训练方法训练得到的模型进行的翻译任务，训练方法这里不再进行赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。In the language translation device 1000 shown in FIG12, the network interface 1004 can provide a network communication function; the user interface 1003 is mainly used to provide an input interface for the user; and the processor 1001 can be used to call the device control application stored in the memory 1005 to implement the translation task performed by the model trained by the training method of any machine translation model shown in FIG1B, or FIG6, or FIG8, or FIG9, and the training method is not described in detail here. In addition, the description of the beneficial effects of using the same method is not described in detail.

此外，这里需要指出的是：本申请实施例还提供了一种计算机可读存储介质，且所述计算机可读存储介质中存储有前文提及的语言翻译设备1000所执行的计算机程序，且所述计算机程序包括程序指令，当所述处理器执行所述程序指令时，能够执行上述语言翻译方法，因此，这里将不再进行赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。对于本发明所涉及的计算机可读存储介质实施例中未披露的技术细节，请参照本申请方法实施例的描述。In addition, it should be noted that: the embodiment of the present application also provides a kind of computer-readable storage medium, and the computer program executed by the language translation device 1000 mentioned above is stored in the computer-readable storage medium, and the computer program includes program instructions, when the processor executes the program instructions, the above-mentioned language translation method can be executed, therefore, it will not be repeated here. In addition, the description of the beneficial effects of the same method will not be repeated. For the technical details not disclosed in the computer-readable storage medium embodiment involved in the present invention, please refer to the description of the method embodiment of the present application.

请参见图13，图13是本申请实施例提供的一种机器翻译模型的训练设备的结构示意图。如图13所示，该机器翻译模型的训练设备2000可以包括：处理器2001，网络接口2004和存储器2005，此外，上述机器翻译模型的训练设备2000还可以包括：用户接口2003，和至少一个通信总线2002。其中，通信总线2002用于实现这些组件之间的连接通信。其中，用户接口2003可以包括显示屏(Display)、键盘(Keyboard)，可选用户接口2003还可以包括标准的有线接口、无线接口。网络接口2004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器2004可以是高速RAM存储器，也可以是非不稳定的存储器(non-volatilememory)，例如至少一个磁盘存储器。存储器2005可选的还可以是至少一个位于远离前述处理器2001的存储装置。如图13所示，作为一种计算机可读存储介质的存储器2005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。Please refer to Figure 13, which is a structural diagram of a training device for a machine translation model provided in an embodiment of the present application. As shown in Figure 13, the training device 2000 of the machine translation model may include: a processor 2001, a network interface 2004 and a memory 2005. In addition, the training device 2000 of the above-mentioned machine translation model may also include: a user interface 2003, and at least one communication bus 2002. Among them, the communication bus 2002 is used to realize the connection communication between these components. Among them, the user interface 2003 may include a display screen (Display), a keyboard (Keyboard), and the optional user interface 2003 may also include a standard wired interface and a wireless interface. The network interface 2004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 2004 may be a high-speed RAM memory, or it may be a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 2005 may also be at least one storage device located away from the aforementioned processor 2001. As shown in FIG. 13 , the memory 2005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.

在如图13所示的机器翻译模型的训练设备2000中，网络接口2004可提供网络通讯功能；而用户接口2003主要用于为用户提供输入的接口，可以包括显示屏，显示屏可以将处理器2001执行的指令获得的结果进行显示，例如显示模型训练的进度以及显示待翻译数据的翻译结果等；而处理器2001可以用于调用存储器2005中存储的设备控制应用程序，以实现上述图1B、或图6、或图8、或图9任一种机器翻译模型的训练方法，相关描述这里不再进行赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。In the training device 2000 of the machine translation model as shown in FIG13 , the network interface 2004 can provide a network communication function; the user interface 2003 is mainly used to provide an input interface for the user, and can include a display screen, which can display the results obtained by the instructions executed by the processor 2001, such as displaying the progress of model training and displaying the translation results of the data to be translated; and the processor 2001 can be used to call the device control application stored in the memory 2005 to implement the training method of any of the machine translation models in FIG1B , or FIG6 , or FIG8 , or FIG9 , and the relevant description is not repeated here. In addition, the description of the beneficial effects of using the same method is not repeated.

应当理解，本申请实施例中所描述的机器翻译模型的训练设备2000可执行前文行上述图1B、或图6、或图8、或图9任一种机器翻译模型的训练方法，在此不再赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。It should be understood that the machine translation model training device 2000 described in the embodiment of the present application can execute the training method of any machine translation model in the above-mentioned FIG. 1B, or FIG. 6, or FIG. 8, or FIG. 9, which will not be described in detail here. In addition, the description of the beneficial effects of using the same method will not be repeated.

此外，这里需要指出的是：本申请实施例还提供了一种计算机可读存储介质，且所述计算机可读存储介质中存储有前文提及的机器翻译模型的训练设备2000所执行的计算机程序，且所述计算机程序包括程序指令，当所述处理器执行所述程序指令时，能够执行上述图1B、或图6、或图8、或图9任一种机器翻译模型的训练方法，因此，相关描述这里不再进行赘述。另外，对采用相同方法的有益效果描述，也不再进行赘述。对于本申请所涉及的计算机可读存储介质实施例中未披露的技术细节，请参照本申请方法实施例的描述。In addition, it should be pointed out here that: the embodiment of the present application also provides a computer-readable storage medium, and the computer-readable storage medium stores the computer program executed by the training device 2000 of the machine translation model mentioned above, and the computer program includes program instructions. When the processor executes the program instructions, it can execute the training method of any one of the machine translation models of Figure 1B, or Figure 6, or Figure 8, or Figure 9, so the relevant description is not repeated here. In addition, the description of the beneficial effects of adopting the same method is not repeated. For the technical details not disclosed in the computer-readable storage medium embodiment involved in the present application, please refer to the description of the method embodiment of the present application.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的程序可存储于一计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory，ROM)或随机存储记忆体(Random AccessMemory，RAM)等。A person skilled in the art can understand that all or part of the processes in the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium, and when the program is executed, it can include the processes of the embodiments of the above-mentioned methods. The storage medium can be a disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), etc.

以上所揭露的仅为本发明较佳实施例而已，当然不能以此来限定本发明之权利范围，因此依本发明权利要求所作的等同变化，仍属本发明所涵盖的范围。The above disclosure is only a preferred embodiment of the present invention, which certainly cannot be used to limit the scope of the present invention. Therefore, equivalent changes made according to the claims of the present invention are still within the scope of the present invention.

Claims

1. A method for training a machine translation model, comprising:

Acquire a bilingual parallel database, the bilingual parallel database comprising a plurality of groups of bilingual parallel sentence pairs, the bilingual parallel sentence pairs being content-aligned data consisting of source language data and target language data, the bilingual parallel database comprising a first bilingual parallel database;

Dividing the plurality of groups of bilingual parallel sentence pairs in the first bilingual parallel database into those derived from source language data and those derived from target language data, wherein the target language data in the bilingual parallel sentence pairs derived from the source language data are translated by a human translator based on the source language data, and the source language data in the bilingual parallel sentence pairs derived from the target language data are translated by a human translator based on the target language data;

Training a first machine translation model using the source language data, wherein the first machine translation model is used to translate the source language into a target language;

The method of dividing the plurality of groups of bilingual parallel sentence pairs in the first bilingual parallel database into those derived from source language data and those derived from target language data comprises: obtaining parallel sentence pairs to be processed from the first bilingual parallel database; determining a first probability that the source language data in the parallel sentence pairs to be processed are derived from the source language according to the source language data in the parallel sentence pairs to be processed; determining a second probability that the target language data in the parallel sentence pairs to be processed are derived from the target language according to the content of the target language data in the parallel sentence pairs to be processed; determining a data type of the parallel sentence pairs to be processed according to a deviation between the first probability and the second probability; the data type comprises the data derived from the source language data and the data derived from the target language data.

2. The method according to claim 1, characterized in that the step of determining the data type of the parallel sentence pair to be processed according to the deviation between the first probability and the second probability specifically comprises:

Determine a score of the parallel sentence pair to be processed according to a deviation between the first probability and the second probability, wherein the score is used to determine a data type of the parallel sentence pair to be processed;

When the score is greater than a target threshold, determining that the parallel sentence pair to be processed is derived from source language data;

When the score is less than the target threshold, it is determined that the parallel sentence pair to be processed is derived from the target language data.

3. The method according to claim 1 or 2, characterized in that determining, according to the source language data in the parallel sentence pairs to be processed, a first probability that the source language data in the parallel sentence pairs to be processed are derived from a source language comprises: inputting the source language data in the parallel sentence pairs to be processed into a first language model to determine the first probability, wherein the first language model is used to determine the probability that the source language data in the parallel sentence pairs to be processed appear in the source language, and the first language model is a model obtained by training a source language monolingual database;

Determining a second probability that the target language data in the parallel sentence pair to be processed is derived from a target language according to the content of the target language data in the parallel sentence pair to be processed includes: inputting the target language data in the parallel sentence pair to be processed into a second language model to determine the second probability, wherein the second language model is used to determine the probability that the target language data in the parallel sentence pair to be processed appears in the target language, and the second language model is a model trained by a target language monolingual database.

4. The method according to any one of claims 1-2, characterized in that before training the first machine translation model using the source language data, the method further comprises:

An initial machine translation model is trained using the bilingual parallel database to obtain the first machine translation model.

5. The method according to any one of claims 1-2, characterized in that before training the first machine translation model using the source language data, the method further comprises:

Acquire a monolingual database, wherein the monolingual database includes a plurality of original texts in a target language;

Inputting each original text in the monolingual database into a second machine translation model to obtain a source language translation text corresponding to each original text, wherein the second machine translation model is obtained by training the first bilingual parallel database, and the second machine translation model is used to translate the target language into the source language;

Adding a plurality of groups of pseudo-parallel sentence pairs consisting of the plurality of original texts in the target language and the source language translation texts respectively corresponding to the plurality of original texts in the target language to the bilingual parallel database.

6. The method according to any one of claims 1-2, characterized in that before training the first machine translation model using the source language data, the method further comprises:

Acquire a monolingual database, wherein the monolingual database includes a plurality of original texts in a source language;

Inputting each original text in the monolingual database into a third machine translation model to obtain a target language translation text corresponding to each original text, wherein the third machine translation model is obtained by training the first bilingual parallel database, and the third machine translation model is used to translate the source language into the target language;

Adding a plurality of groups of pseudo-parallel sentence pairs consisting of the plurality of original texts in the source language and the target language translation texts corresponding to the plurality of original texts in the source language to the bilingual parallel database.

7. A language translation method, characterized by comprising:

receiving data to be translated, wherein the data to be translated is source language data;

The data to be translated is input into a machine translation model to obtain target language data corresponding to the data to be translated, wherein the machine translation model is a first machine translation model trained by the method according to any one of claims 1 to 6.

8. A training device for a machine translation model, comprising:

An acquisition unit is used to acquire a bilingual parallel database, wherein the bilingual parallel database includes a plurality of groups of bilingual parallel sentence pairs, wherein the bilingual parallel sentence pairs are content-aligned data consisting of source language data and target language data, and the bilingual parallel database includes a first bilingual parallel database;

a dividing unit, configured to divide the plurality of groups of bilingual parallel sentence pairs in the first bilingual parallel database into data derived from source language data and data derived from target language data, wherein the target language data in the bilingual parallel sentence pairs derived from the source language data are translated by a human translator based on the source language data, and the source language data in the bilingual parallel sentence pairs derived from the target language data are translated by a human translator based on the target language data;

A training unit, configured to train a first machine translation model using the data derived from the source language, wherein the first machine translation model is used to translate the source language into a target language;

Wherein, when the division unit divides the multiple groups of bilingual parallel sentence pairs in the first bilingual parallel database into those derived from source language data and those derived from target language data, the division unit is specifically used to: obtain parallel sentence pairs to be processed from the first bilingual parallel database; determine a first probability that the source language data in the parallel sentence pairs to be processed are derived from the source language according to the source language data in the parallel sentence pairs to be processed; determine a second probability that the target language data in the parallel sentence pairs to be processed are derived from the target language according to the content of the target language data in the parallel sentence pairs to be processed; determine a data type of the parallel sentence pairs to be processed according to a deviation between the first probability and the second probability; the data type includes the data derived from the source language data and the data derived from the target language data.

9. A language translation device, comprising:

A receiving unit, configured to receive data to be translated, wherein the data to be translated is source language data;

A translation unit is used to translate the data to be translated into its corresponding target language data, wherein the translation unit includes a machine translation model, and the machine translation model is a first machine translation model trained by the method according to any one of claims 1 to 6.

10. A computer device, comprising: one or more processors and one or more memories, wherein the one or more memories are respectively coupled to the one or more processors; the one or more memories are used to store computer program codes, wherein the computer program codes include computer instructions;

The processor is used to call the computer instruction to execute: the training method of the machine translation model as described in any one of claims 1-6.

11. A computer device, comprising: one or more processors and one or more memories, wherein the one or more memories are respectively coupled to the one or more processors; the one or more memories are used to store computer program codes, wherein the computer program codes include computer instructions;

The processor is used to call the computer instruction to execute: the language translation method as claimed in claim 7.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, the training method of the machine translation model according to any one of claims 1 to 6 is executed.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, wherein the computer program includes program instructions, and when the program instructions are executed by a processor, the language translation method according to claim 7 is executed.

14. A computer program product, characterized in that the computer program product comprises computer instructions, which, when executed by a processor, implement the training method for a machine translation model as described in any one of claims 1 to 6, or implement the language translation method as described in claim 7.