CN104933038A - Machine translation method and machine translation device - Google Patents
Machine translation method and machine translation device Download PDFInfo
- Publication number
- CN104933038A CN104933038A CN201410104256.7A CN201410104256A CN104933038A CN 104933038 A CN104933038 A CN 104933038A CN 201410104256 A CN201410104256 A CN 201410104256A CN 104933038 A CN104933038 A CN 104933038A
- Authority
- CN
- China
- Prior art keywords
- mentioned
- sentence
- translation
- translated
- corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Machine Translation (AREA)
Abstract
本发明提供一种能够改善翻译质量的动态的机器翻译方法和机器翻译装置。本发明的一个实施方式的机器翻译装置,包括:输入单元,其输入待翻译的句子;计算单元,其计算上述待翻译的句子和双语语料库中的源语言句子之间的相似度;选择单元,其基于上述相似度在上述双语语料库中选择多个句子对,作为训练语料;训练单元,其利用上述训练语料,训练翻译系统;以及翻译单元,其利用上述翻译系统,对上述待翻译的句子进行翻译。
The present invention provides a dynamic machine translation method and machine translation device capable of improving translation quality. A machine translation device according to an embodiment of the present invention includes: an input unit, which inputs a sentence to be translated; a calculation unit, which calculates the degree of similarity between the sentence to be translated and the source language sentence in the bilingual corpus; a selection unit, It selects a plurality of sentence pairs in the above-mentioned bilingual corpus based on the above-mentioned similarity as the training corpus; the training unit uses the above-mentioned training corpus to train the translation system; and the translation unit uses the above-mentioned translation system to perform the above-mentioned sentence to be translated translate.
Description
技术领域technical field
本发明涉及自然语言的处理技术,具体而言,涉及机器翻译方法和机器翻译装置。The present invention relates to natural language processing technology, in particular, to a machine translation method and a machine translation device.
背景技术Background technique
统计机器翻译系统的一般流程是,先确定模型(算法),然后基于训练数据来训练模型参数(翻译知识),最后利用训练得到的模型参数对输入的句子进行翻译。The general process of a statistical machine translation system is to first determine the model (algorithm), then train the model parameters (translation knowledge) based on the training data, and finally use the trained model parameters to translate the input sentence.
训练数据通常由大规模对齐的双语句子对组成,这些句子对可能来自不同的领域,句子的形式也不相同,即使源语言相同的一个句子也可能有不同的目标语言翻译。同样,源语言句子中的同一个词,也根据上下文的不同而可能有不同的翻译。Training data usually consists of large-scale aligned bilingual sentence pairs, which may come from different domains and have different sentence forms, and even a sentence with the same source language may have different target language translations. Similarly, the same word in a sentence in the source language may have different translations depending on the context.
在通常的翻译系统中,一旦完成训练过程,生成的翻译模型就不再改变。之后,使用生成的翻译模型对待翻译的句子进行翻译。然而,由于待翻译的句子的多样性,这种生成后即不再变化的翻译系统通常不能适用所有的待翻译的句子,因此会导致翻译质量不高。In usual translation systems, once the training process is completed, the resulting translation model does not change. Afterwards, the sentence to be translated is translated using the generated translation model. However, due to the diversity of sentences to be translated, such a translation system that does not change after being generated usually cannot be applied to all sentences to be translated, thus resulting in low translation quality.
对此,提出了一些领域适应的方法,用于构造“动态”的翻译系统。一些方法首先对领域内的数据和领域外的数据进行插值,然后利用插值后的数据构建翻译模型。另外一些方法首先按照领域对训练数据进行聚类,然后利用聚类的子集来训练单独的翻译子模型,在翻译时根据待翻译的句子所属的领域,选择与其领域对应的翻译子模型进行翻译。In this regard, some domain-adaptive methods are proposed to construct "dynamic" translation systems. Some methods first interpolate the in-domain data with the out-of-domain data, and then use the interpolated data to build translation models. Other methods first cluster the training data according to the domain, and then use the clustered subsets to train a separate translation sub-model. When translating, according to the domain of the sentence to be translated, the translation sub-model corresponding to the domain is selected for translation. .
发明内容Contents of the invention
本发明的发明人在对上述领域适应的方法进行研究后发现,尽管这些方法具有一定的适应能力,但是,一旦通过训练生成翻译模型或翻译子模型后就不再改变,即训练后生成的翻译模型仍然是“静态”的,因此翻译系统的适应能力有限,仍然会导致翻译质量不高。The inventors of the present invention have found after researching the methods for adapting to the above fields that although these methods have certain adaptability, once the translation model or translation sub-model is generated through training, it will not change, that is, the translation generated after training The model is still "static", so the translation system has limited adaptability and still leads to poor quality translations.
为了解决现有技术中存在的上述问题,本发明的实施方式提供了能够改善翻译质量的动态的机器翻译方法和机器翻译装置。具体地,提供了以下技术方案。In order to solve the above-mentioned problems in the prior art, embodiments of the present invention provide a dynamic machine translation method and a machine translation device capable of improving translation quality. Specifically, the following technical solutions are provided.
[1]一种机器翻译方法,包括以下步骤:输入待翻译的句子;计算上述待翻译的句子和双语语料库中的源语言句子之间的相似度;基于上述相似度在上述双语语料库中选择多个句子对,作为训练语料;利用上述训练语料,训练翻译系统;以及利用上述翻译系统,对上述待翻译的句子进行翻译。[1] A machine translation method comprising the steps of: inputting a sentence to be translated; calculating the similarity between the above-mentioned sentence to be translated and the source language sentence in the bilingual corpus; Sentence pairs are used as training corpus; using the above training corpus to train the translation system; and using the above translation system to translate the above sentence to be translated.
本实施方式的机器翻译方法,通过在双语语料库中将与待翻译的句子的相似度高的语料选出,并基于选出的语料实时构造翻译系统,能够构造动态的、具有针对性的翻译系统,从而能够改善翻译质量。The machine translation method of this embodiment can construct a dynamic and targeted translation system by selecting corpus with high similarity with the sentence to be translated in the bilingual corpus, and constructing a translation system in real time based on the selected corpus , which can improve translation quality.
[2]根据上述[1]的机器翻译方法,上述选择步骤包括以下步骤:对上述双语语料库中的句子对以上述相似度从大到小的顺序进行排序;以及选择排序后的前N个句子对,作为上述训练语料,N为1以上的整数。[2] According to the machine translation method of [1] above, the above selection step includes the following steps: sorting the sentences in the above bilingual corpus in order of the above similarity from large to small; and selecting the first N sentences after sorting Yes, as the above-mentioned training corpus, N is an integer greater than or equal to 1.
本实施方式的机器翻译方法,通过选择排序后的前N个句子,在双语语料库中存在大量与待翻译的句子的相似度高的语料时,能够利用最相似的、一定数量的语料训练翻译系统,从而不仅能够保证翻译质量,而且能够减轻训练翻译系统的处理负担。In the machine translation method of this embodiment, by selecting the first N sentences after sorting, when there are a large number of corpus with high similarity with the sentence to be translated in the bilingual corpus, the most similar and certain amount of corpus can be used to train the translation system , so that not only the translation quality can be guaranteed, but also the processing burden of training the translation system can be reduced.
[3]根据上述[1]或[2]的机器翻译方法,上述选择步骤包括以下步骤:选择上述双语语料库中的上述相似度大于预定的阈值的句子对,作为上述训练语料。[3] According to the machine translation method of [1] or [2] above, the selection step includes the following steps: selecting sentence pairs in the bilingual corpus whose similarity is greater than a predetermined threshold as the training corpus.
本实施方式的机器翻译方法,通过选择相似度大于预定的阈值的语料来训练翻译系统,能够将相似度低的语料排除,从而能够避免相似度低的语料对翻译系统的干扰,能够进一步保证翻译准确度。In the machine translation method of this embodiment, the translation system is trained by selecting corpus with a similarity greater than a predetermined threshold, and the corpus with a low similarity can be excluded, thereby avoiding the interference of the corpus with a low similarity on the translation system, and further ensuring translation. Accuracy.
[4]根据上述[1]~[3]之一的机器翻译方法,上述计算相似度的步骤包括以下步骤:利用上述待翻译的句子与上述双语语料库中的源语言句子之间的编辑距离计算上述相似度。[4] According to the machine translation method according to one of the above [1]~[3], the above step of calculating the similarity includes the following steps: using the edit distance calculation between the above sentence to be translated and the source language sentence in the above bilingual corpus similarity to the above.
[5]根据上述[1]~[4]之一的机器翻译方法,上述计算相似度的步骤包括以下步骤:计算上述待翻译的句子与上述双语语料库中的源语言句子之间的句法结构的相似度。[5] According to the machine translation method according to one of the above [1]~[4], the above step of calculating the similarity includes the following steps: calculating the syntactic structure between the above sentence to be translated and the source language sentence in the above bilingual corpus similarity.
[6]根据上述[1]~[5]之一的机器翻译方法,在上述翻译步骤之后还包括以下步骤:在翻译缓冲区保存上述待翻译的句子及其翻译结果。[6] The machine translation method according to any one of [1] to [5] above, after the above translation step, further includes the following step: saving the above sentence to be translated and its translation result in a translation buffer.
[7]根据上述[6]的机器翻译方法,在上述输入步骤之后还包括以下步骤:在上述翻译缓冲区查找上述待翻译的句子。[7] According to the machine translation method of [6] above, the following step is further included after the above input step: searching for the above sentence to be translated in the above translation buffer.
本实施方式的机器翻译方法,通过在翻译缓冲区中保存待翻译的句子及其翻译结果,在下次翻译相同的句子时,能够直接从翻译缓冲区取得该句子的翻译结果,节约了计算资源,提高了翻译效率。In the machine translation method of this embodiment, by saving the sentence to be translated and its translation result in the translation buffer, when the same sentence is translated next time, the translation result of the sentence can be directly obtained from the translation buffer, saving computing resources. Improved translation efficiency.
[8]根据上述[1]~[7]之一的机器翻译方法,在上述翻译步骤之后还包括以下步骤:将上述待翻译的句子和其翻译结果加入上述双语语料库。[8] The machine translation method according to any one of the above [1] to [7], after the above translation step, further includes the following step: adding the above sentence to be translated and its translation result to the above bilingual corpus.
本实施方式的机器翻译方法,通过将待翻译的句子和其翻译结果加入双语语料库,能够扩充双语语料库的语料数据,从而能够提高后续翻译的翻译质量。In the machine translation method of this embodiment, by adding sentences to be translated and their translation results into the bilingual corpus, the corpus data of the bilingual corpus can be expanded, thereby improving the translation quality of subsequent translations.
[9]根据上述[1]~[8]之一的机器翻译方法,在上述翻译步骤之后还包括以下步骤:对上述待翻译的句子和其翻译结果进行词对齐;以及将词对齐结果加入上述双语语料库。[9] According to the machine translation method according to any one of the above [1]~[8], after the above translation steps, the following steps are also included: performing word alignment on the above sentences to be translated and their translation results; and adding the word alignment results to the above Bilingual corpus.
本实施方式的机器翻译方法,通过将词对齐结果加入双语语料库,不仅能够扩充双语语料库的语料数据,提高后续翻译的翻译质量,而且能够提高翻译效率。In the machine translation method of this embodiment, by adding the word alignment result to the bilingual corpus, it can not only expand the corpus data of the bilingual corpus, improve the translation quality of subsequent translations, but also improve translation efficiency.
[10]根据上述[1]~[9]之一的机器翻译方法,在上述计算相似度的步骤之前还包括以下步骤:在上述双语语料库中添加与用户相关的训练数据。[10] The machine translation method according to any one of [1] to [9] above, further includes the following step before the above step of calculating the similarity: adding training data related to users to the above bilingual corpus.
本实施方式的机器翻译方法,通过添加与用户相关的训练数据,例如进行了对齐的句子对、上下文相关的数据、进行了词对齐的句子对等,在训练数据不足的情况下,也能够达到用户适应的目的。In the machine translation method of this embodiment, by adding user-related training data, such as aligned sentence pairs, context-related data, and word-aligned sentence pairs, etc., it can also achieve Purpose of User Adaptation.
[11]根据上述[1]~[10]之一的机器翻译方法,在上述翻译步骤之后还包括以下步骤:利用上述相似度计算上述翻译结果的置信度。[11] The machine translation method according to any one of the above [1] to [10], further comprising the following step after the translation step: calculating the confidence degree of the translation result by using the similarity degree.
本实施方式的机器翻译方法,通过利用相似度计算翻译结果的置信度,在得到翻译结果的同时,即可得到翻译结果的置信度,从而无需利用另外的方法计算置信度,提高了翻译效率。In the machine translation method of this embodiment, by using the similarity to calculate the confidence of the translation result, the confidence of the translation result can be obtained at the same time as the translation result, so that there is no need to use another method to calculate the confidence, and the translation efficiency is improved.
[12]一种机器翻译装置,包括:输入单元,其输入待翻译的句子;计算单元,其计算上述待翻译的句子和双语语料库中的源语言句子之间的相似度;选择单元,其基于上述相似度在上述双语语料库中选择多个句子对,作为训练语料;训练单元,其利用上述训练语料,训练翻译系统;以及翻译单元,其利用上述翻译系统,对上述待翻译的句子进行翻译。[12] A machine translation device, comprising: an input unit, which inputs a sentence to be translated; a calculation unit, which calculates the similarity between the sentence to be translated and the source language sentence in the bilingual corpus; a selection unit, which is based on For the above similarity, a plurality of sentence pairs are selected in the above-mentioned bilingual corpus as the training corpus; the training unit uses the above-mentioned training corpus to train the translation system; and the translation unit uses the above-mentioned translation system to translate the above-mentioned sentences to be translated.
本实施方式的机器翻译装置,通过在双语语料库中将与待翻译的句子的相似度高的语料选出,并基于选出的语料实时构造翻译系统,能够构造动态的、具有针对性的翻译系统,从而能够改善翻译质量。The machine translation device of this embodiment can construct a dynamic and targeted translation system by selecting corpus with high similarity with the sentence to be translated in the bilingual corpus, and constructing a translation system in real time based on the selected corpus , which can improve translation quality.
[13]根据上述[12]的机器翻译装置,其中,上述选择单元包括:排序单元,其对上述双语语料库中的句子对以上述相似度从大到小的顺序进行排序;上述选择单元选择排序后的前N个句子对,作为上述训练语料,N为1以上的整数。[13] The machine translation device according to [12] above, wherein the selection unit includes: a sorting unit that sorts the sentence pairs in the bilingual corpus in descending order of the similarity; the selection unit selects and sorts After the first N sentence pairs, as the above-mentioned training corpus, N is an integer greater than 1.
本实施方式的机器翻译装置,通过选择排序后的前N个句子,在双语语料库中存在大量与待翻译的句子的相似度高的语料时,能够利用最相似的、一定数量的语料训练翻译系统,从而不仅能够保证翻译质量,而且能够减轻训练翻译系统的处理负担。The machine translation device of this embodiment, by selecting the first N sentences after sorting, when there are a large number of corpus with high similarity with the sentence to be translated in the bilingual corpus, it can use the most similar and certain amount of corpus to train the translation system , so that not only the translation quality can be guaranteed, but also the processing burden of training the translation system can be reduced.
[14]根据上述[12]或[13]的机器翻译装置,其中,上述选择单元选择上述双语语料库中的上述相似度大于预定的阈值的句子对,作为上述训练语料。[14] The machine translation device according to [12] or [13] above, wherein the selection unit selects sentence pairs in the bilingual corpus whose similarity is greater than a predetermined threshold as the training corpus.
本实施方式的机器翻译装置,通过选择相似度大于预定的阈值的语料来训练翻译系统,能够将相似度低的语料排除,从而能够避免相似度低的语料对翻译系统的干扰,能够进一步保证翻译准确度。The machine translation device of this embodiment trains the translation system by selecting corpus with a degree of similarity greater than a predetermined threshold, and can exclude corpus with a low degree of similarity, thereby avoiding the interference of corpus with a low degree of similarity on the translation system, and further ensuring translation Accuracy.
[15]根据上述[12]~[14]之一的机器翻译装置,其中,上述计算单元利用上述待翻译的句子与上述双语语料库中的源语言句子之间的编辑距离计算上述相似度。[15] The machine translation device according to any one of [12] to [14] above, wherein the calculation unit calculates the similarity using an edit distance between the sentence to be translated and the source language sentence in the bilingual corpus.
[16]根据上述[12]~[15]之一的机器翻译装置,其中,上述计算单元计算上述待翻译的句子与上述双语语料库中的源语言句子之间的句法结构的相似度。[16] The machine translation device according to any one of [12] to [15] above, wherein the calculation unit calculates a similarity in syntactic structure between the sentence to be translated and the source language sentence in the bilingual corpus.
[17]根据上述[12]~[16]之一的机器翻译装置,还包括:保存单元,其在翻译缓冲区保存上述待翻译的句子及其翻译结果。[17] The machine translation device according to any one of [12] to [16] above, further comprising: a storage unit, which stores the sentence to be translated and its translation result in a translation buffer.
[18]根据上述[17]的机器翻译装置,还包括:查找单元,其在上述输入单元输入上述待翻译的句子之后在上述翻译缓冲区查找上述待翻译的句子。[18] The machine translation device according to [17] above, further comprising: a search unit that searches for the sentence to be translated in the translation buffer after the sentence to be translated is input by the input unit.
本实施方式的机器翻译装置,通过在翻译缓冲区中保存待翻译的句子及其翻译结果,在下次翻译相同的句子时,能够直接从翻译缓冲区取得该句子的翻译结果,节约了计算资源,提高了翻译效率。The machine translation device of this embodiment, by saving the sentence to be translated and its translation result in the translation buffer, when translating the same sentence next time, can directly obtain the translation result of the sentence from the translation buffer, saving computing resources, Improved translation efficiency.
[19]根据上述[12]~[18]之一的机器翻译装置,还包括:句对添加单元,其将上述待翻译的句子和其翻译结果加入上述双语语料库。[19] The machine translation device according to any one of [12] to [18] above, further comprising: a sentence pair adding unit, which adds the above-mentioned sentences to be translated and their translation results into the above-mentioned bilingual corpus.
本实施方式的机器翻译装置,通过将待翻译的句子和其翻译结果加入双语语料库,能够扩充双语语料库的语料数据,从而能够提高后续翻译的翻译质量。The machine translation device of this embodiment can expand the corpus data of the bilingual corpus by adding sentences to be translated and their translation results into the bilingual corpus, thereby improving the translation quality of subsequent translations.
[20]根据上述[12]~[19]之一的机器翻译装置,还包括:词对齐单元,其对上述待翻译的句子和其翻译结果进行词对齐;以及词对齐结果添加单元,其将词对齐结果加入上述双语语料库。[20] The machine translation device according to any one of [12] to [19] above, further comprising: a word alignment unit, which performs word alignment on the sentence to be translated and its translation result; and a word alignment result adding unit, which converts Word alignment results are added to the above bilingual corpus.
本实施方式的机器翻译装置,通过将词对齐结果加入双语语料库,不仅能够扩充双语语料库的语料数据,提高后续翻译的翻译质量,而且能够提高翻译效率。The machine translation device of this embodiment can not only expand the corpus data of the bilingual corpus, improve the translation quality of subsequent translations, but also improve translation efficiency by adding word alignment results to the bilingual corpus.
[21]根据上述[12]~[20]之一的机器翻译装置,还包括:训练语料添加单元,其在上述双语语料库中添加与用户相关的训练数据。[21] The machine translation device according to any one of [12] to [20] above, further comprising: a training corpus adding unit, which adds training data related to the user to the bilingual corpus.
本实施方式的机器翻译装置,通过添加与用户相关的训练数据,例如进行了对齐的句子对、上下文相关的数据、进行了词对齐的句子对等,在训练数据不足的情况下,也能够达到用户适应的目的。The machine translation device of this embodiment, by adding user-related training data, such as aligned sentence pairs, context-related data, and word-aligned sentence pairs, can also achieve Purpose of User Adaptation.
[22]根据上述[12]~[21]之一的机器翻译装置,还包括:[22] The machine translation device according to any one of [12] to [21] above, further comprising:
置信度计算单元,其利用上述相似度计算上述翻译结果的置信度。本实施方式的机器翻译装置,通过利用相似度计算翻译结果的置信度,在得到翻译结果的同时,即可得到翻译结果的置信度,从而无需利用另外的方法计算置信度,提高了翻译效率。A confidence calculation unit, which uses the similarity to calculate the confidence of the translation result. The machine translation device of this embodiment calculates the confidence degree of the translation result by using the similarity, and can obtain the confidence degree of the translation result at the same time as the translation result, so that there is no need to use another method to calculate the confidence degree, and the translation efficiency is improved.
附图说明Description of drawings
图1是根据本发明的一个实施方式的机器翻译方法的流程图。FIG. 1 is a flowchart of a machine translation method according to an embodiment of the present invention.
图2是根据本发明的另一个实施方式的机器翻译装置的方框图。FIG. 2 is a block diagram of a machine translation apparatus according to another embodiment of the present invention.
具体实施方式Detailed ways
下面就结合附图对本发明的各个优选实施方式进行详细的说明。Various preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
机器翻译方法machine translation method
本实施方式提供一种机器翻译方法,包括以下步骤:输入待翻译的句子;计算上述待翻译的句子和双语语料库10中的源语言句子之间的相似度;基于上述相似度在上述双语语料库10中选择多个句子对,作为训练语料;利用上述训练语料,训练翻译系统;以及利用上述翻译系统,对上述待翻译的句子进行翻译。The present embodiment provides a kind of machine translation method, comprises the following steps: Input the sentence to be translated; Calculate the similarity between the above-mentioned sentence to be translated and the source language sentence in the bilingual corpus 10; Based on the above-mentioned similarity in the above-mentioned bilingual corpus 10 A plurality of sentence pairs are selected as training corpus; using the above training corpus to train the translation system; and using the above translation system to translate the above sentence to be translated.
下面参照图1进行详细说明。图1是根据本发明的一个实施方式的机器翻译方法的流程图。The details will be described below with reference to FIG. 1 . FIG. 1 is a flowchart of a machine translation method according to an embodiment of the present invention.
如图1所示,首先,在步骤S101中,输入待翻译的句子。在本实施方式中,待翻译的句子可以是本领域的技术人员公知的任何需要翻译的句子,可以是任何语种,本发明对此没有任何限制。As shown in FIG. 1, first, in step S101, a sentence to be translated is input. In this embodiment, the sentence to be translated may be any sentence that needs to be translated known to those skilled in the art, and may be in any language, which is not limited in the present invention.
接着,在步骤S105中,计算待翻译的句子和双语语料库10中的源语言句子之间的相似度。Next, in step S105, the similarity between the sentence to be translated and the source language sentence in the bilingual corpus 10 is calculated.
在本实施方式中,双语语料库10包括多个源语言和目标语言的句子对,其可以是本领域的技术人员公知的任何双语语料库,例如英语-汉语语料库,英语-德语语料库,日语-汉语语料库等等。本实施方式对于双语语料库10没有任何限制。另外,本实施方式的双语语料库既可以是只进行了句子对齐的语料库,也可以是对句子对进行了词对齐的语料库,本发明对此没有任何限制。In this embodiment, the bilingual corpus 10 includes sentence pairs of a plurality of source languages and target languages, which can be any bilingual corpus known to those skilled in the art, such as English-Chinese corpus, English-German corpus, Japanese-Chinese corpus etc. This embodiment has no limitation on the bilingual corpus 10 . In addition, the bilingual corpus in this embodiment may be a corpus that only performs sentence alignment, or a corpus that performs word alignment on sentence pairs, and the present invention has no limitation on this.
在本实施方式中,相似度是表示待翻译的句子与源语言句子之间的相似程度的参数,例如可以采用基于字符串的相似度,也可以采用结构化的相似度。在计算相似度时,例如可以基于待翻译的句子与双语语料库10中的源语言句子之间的编辑距离来计算相似度。另外,也可以基于待翻译的句子与双语语料库10中的语言句子之间的句法结构来计算相似度。本实施方式对于相似度的计算方法没有任何限制,可以利用本领域的技术人员公知的任何方法。In this embodiment, the similarity is a parameter indicating the degree of similarity between the sentence to be translated and the sentence in the source language, for example, a string-based similarity or a structured similarity may be used. When calculating the similarity, for example, the similarity can be calculated based on the edit distance between the sentence to be translated and the source language sentence in the bilingual corpus 10 . In addition, the similarity may also be calculated based on the syntactic structure between the sentence to be translated and the language sentences in the bilingual corpus 10 . In this embodiment, there is no limitation on the calculation method of the similarity, and any method known to those skilled in the art may be used.
接着,在步骤S110中,基于相似度在双语语料库10中选择多个句子对,作为训练语料。Next, in step S110, a plurality of sentence pairs are selected from the bilingual corpus 10 based on similarity as training corpus.
在选择句子对时,例如可以根据相似度的计算结果,对双语语料库10中的句子对以相似度从大到小的顺序进行排序,并选择排序后的前N个句子对作为训练语料,N是1以上的整数。When selecting a sentence pair, for example, according to the calculation result of the similarity, the sentences in the bilingual corpus 10 are sorted in order of similarity from large to small, and the first N sentences after the sorting are selected as the training corpus, N is an integer of 1 or more.
另外,也可以不对句子对进行排序,而是选择双语语料库10中的相似度大于预定的阈值的句子对作为训练语料。本实施方式对于基于相似度选择训练语料的方法没有任何限制,可以是上述方法的组合或本领域的技术人员公知的任何其他方法。In addition, instead of sorting the sentence pairs, sentence pairs whose similarity in the bilingual corpus 10 is greater than a predetermined threshold may be selected as the training corpus. In this embodiment, there is no limitation on the method of selecting training corpus based on the similarity, which may be a combination of the above methods or any other method known to those skilled in the art.
接着,在步骤S115中,利用步骤S110中选择的训练语料,训练翻译系统。Next, in step S115, the translation system is trained using the training corpus selected in step S110.
在本实施方式中,翻译系统可以是本领域的技术人员公知的任何翻译系统,只要能够利用其对待翻译的句子进行翻译即可。也就是说,可以利用本领域的技术人员公知的任何算法来基于训练语料训练翻译系统。In this embodiment, the translation system may be any translation system known to those skilled in the art, as long as it can be used to translate the sentence to be translated. That is to say, any algorithm known to those skilled in the art can be used to train the translation system based on the training corpus.
具体地,在双语语料库为只进行了句子对齐的语料库的情况下,首先利用本领域的技术人员公知的任何词对齐方法对选择出的训练语料进行词对齐,然后利用进行了词对齐的训练语料来训练翻译系统。另一方面,在双语语料库为对句子对进行了词对齐的语料库的情况下,无需进行上述词对齐,而直接利用选择出的训练语料来训练翻译系统。Specifically, in the case that the bilingual corpus is a corpus that has only been aligned with sentences, first use any word alignment method known to those skilled in the art to perform word alignment on the selected training corpus, and then use the training corpus that has undergone word alignment to train the translation system. On the other hand, when the bilingual corpus is a corpus in which word alignment is performed on sentence pairs, the translation system is trained directly using the selected training corpus without performing the above-mentioned word alignment.
优选,本实施方式的翻译系统包括翻译模型(Translation Model,TM)和语言模型(Language Model,LM)。翻译模型使用训练语料中的双语句子对及其词对齐信息进行训练,描述从源语言到目标语言转换的准确程度,语言模型使用训练语料中的目标语言句子进行训练,描述翻译结果的流利程度。Preferably, the translation system in this embodiment includes a translation model (Translation Model, TM) and a language model (Language Model, LM). The translation model is trained using the bilingual sentence pairs and their word alignment information in the training corpus to describe the accuracy of conversion from the source language to the target language. The language model is trained using the target language sentences in the training corpus to describe the fluency of the translation results.
最后,在步骤S120中,利用在步骤S115中训练出的翻译系统,对待翻译的句子进行翻译,得到翻译结果20。Finally, in step S120, the sentence to be translated is translated by using the translation system trained in step S115, and the translation result 20 is obtained.
本实施方式的机器翻译方法,通过在双语语料库10中将与待翻译的句子的相似度高的语料选出,并基于选出的语料实时构造翻译系统,能够构造动态的、具有针对性的翻译系统,从而能够改善翻译质量。In the machine translation method of this embodiment, by selecting the corpus with high similarity with the sentence to be translated in the bilingual corpus 10, and constructing a translation system in real time based on the selected corpus, a dynamic and targeted translation can be constructed system, thereby being able to improve translation quality.
另外,本实施方式的机器翻译方法,通过选择排序后的前N个句子,在双语语料库10中存在大量与待翻译的句子的相似度高的语料时,能够利用最相似的、一定数量的语料训练翻译系统,从而不仅能够保证翻译质量,而且能够减轻训练翻译系统的处理负担。In addition, in the machine translation method of this embodiment, by selecting the first N sentences after sorting, when there are a large number of corpus with a high similarity with the sentence to be translated in the bilingual corpus 10, the most similar and certain amount of corpus can be used Train the translation system so that not only the translation quality can be guaranteed, but also the processing burden of training the translation system can be reduced.
另外,本实施方式的机器翻译方法,通过选择相似度大于预定的阈值的语料来训练翻译系统,能够将相似度低的语料排除,从而能够避免相似度低的语料对翻译系统的干扰,能够进一步保证翻译准确度。In addition, the machine translation method of this embodiment trains the translation system by selecting corpus with a similarity greater than a predetermined threshold, and can exclude corpus with low similarity, thereby avoiding the interference of corpus with low similarity on the translation system, and further Guaranteed translation accuracy.
优选,本实施方式的机器翻译方法,在步骤S120之后还包括以下步骤:在翻译缓冲区保存待翻译的句子及其翻译结果。翻译缓冲区可以是本领域的技术人员公知的任何存储器。Preferably, the machine translation method of this embodiment further includes the following step after step S120: saving the sentence to be translated and its translation result in the translation buffer. The translation buffer can be any memory known to those skilled in the art.
进而,本实施方式的机器翻译方法,优选在步骤S101之后还包括以下步骤:在翻译缓冲区查找待翻译的句子。如果在翻译缓冲区中存在待翻译的句子,则直接得到其翻译结果,如果不存在,则进行步骤S105。Furthermore, the machine translation method of this embodiment preferably further includes the following step after step S101: searching for the sentence to be translated in the translation buffer. If there is a sentence to be translated in the translation buffer, its translation result is directly obtained; if not, then step S105 is performed.
本实施方式的机器翻译方法,通过在翻译缓冲区中保存待翻译的句子及其翻译结果,在下次翻译相同的句子时,能够直接从翻译缓冲区取得该句子的翻译结果,节约了计算资源,提高了翻译效率。In the machine translation method of this embodiment, by saving the sentence to be translated and its translation result in the translation buffer, when the same sentence is translated next time, the translation result of the sentence can be directly obtained from the translation buffer, saving computing resources. Improved translation efficiency.
优选,本实施方式的机器翻译方法,在步骤S120之后还包括以下步骤:将上述待翻译的句子和其翻译结果加入上述双语语料库。Preferably, the machine translation method of this embodiment further includes the following step after step S120: adding the above sentence to be translated and its translation result to the above bilingual corpus.
本实施方式的机器翻译方法,通过将待翻译的句子和其翻译结果加入双语语料库10,能够扩充双语语料库10的语料数据,从而能够提高后续翻译的翻译质量。In the machine translation method of this embodiment, by adding sentences to be translated and their translation results into the bilingual corpus 10, the corpus data of the bilingual corpus 10 can be expanded, thereby improving the translation quality of subsequent translations.
优选,本实施方式的机器翻译方法,在添加上述待翻译的句子和其翻译结果之前还包括以下步骤:对待翻译的句子及其翻译结果进行词对齐;以及将词对齐结果加入双语语料库10。Preferably, the machine translation method of this embodiment further includes the following steps before adding the sentence to be translated and its translation result: performing word alignment on the sentence to be translated and its translation result; and adding the word alignment result to the bilingual corpus 10 .
在本实施方式中,可以利用本领域的技术人员公知的任何词对齐工具,例如GIZA++工具对待翻译的句子及其翻译结果进行词对齐。本实施方式对于词对齐的方法没有任何限制。In this embodiment, any word alignment tool known to those skilled in the art can be used, such as the GIZA++ tool, to perform word alignment on the sentence to be translated and its translation result. In this embodiment, there is no limitation on the word alignment method.
本实施方式的机器翻译方法,通过将词对齐结果加入双语语料库,不仅能够扩充双语语料库的语料数据,提高后续翻译的翻译质量,而且能够提高翻译效率。In the machine translation method of this embodiment, by adding the word alignment result to the bilingual corpus, it can not only expand the corpus data of the bilingual corpus, improve the translation quality of subsequent translations, but also improve translation efficiency.
优选,本实施方式的机器翻译方法,在步骤S110之前还包括以下步骤:在双语语料库10中添加与用户相关的训练数据。Preferably, the machine translation method of this embodiment further includes the following step before step S110: adding training data related to the user to the bilingual corpus 10 .
本实施方式的机器翻译方法,通过添加与用户相关的训练数据,例如进行了对齐的句子对、上下文相关的数据、进行了词对齐的句子对等,在训练数据不足的情况下,也能够达到用户适应的目的。In the machine translation method of this embodiment, by adding user-related training data, such as aligned sentence pairs, context-related data, and word-aligned sentence pairs, etc., it can also achieve Purpose of User Adaptation.
优选,本实施方式的机器翻译方法,在上述步骤S120之后还包括以下步骤:利用上述相似度计算上述翻译结果的置信度。Preferably, the machine translation method of this embodiment further includes the following step after the above-mentioned step S120: calculating the confidence of the above-mentioned translation result by using the above-mentioned similarity.
在本实施方式中,置信度是指翻译结果的可信程度。如果双语语料库10中的与输入的待翻译句子最相似部分都不足以用来学习得到翻译知识,可以得知利用整个双语语料库10获得最佳翻译的概率也很小。在选择相似句子时,如果相似度高于一定值的句子数低于预定阈值,则可以认为是双语语料库10不包含足以翻译待翻译的句子的知识。例如,可以在相似度大于一定值以上的句子的数量为预定阈值以上时,认为置信度高。In this embodiment, the confidence level refers to the degree of credibility of the translation result. If the part most similar to the input sentence to be translated in the bilingual corpus 10 is not enough to learn translation knowledge, it can be known that the probability of using the entire bilingual corpus 10 to obtain the best translation is also very small. When selecting similar sentences, if the number of sentences whose similarity is higher than a certain value is lower than a predetermined threshold, it can be considered that the bilingual corpus 10 does not contain enough knowledge to translate the sentence to be translated. For example, when the number of sentences whose similarity is greater than a certain value is equal to or greater than a predetermined threshold, the degree of confidence may be considered high.
本实施方式的机器翻译方法,通过利用相似度计算翻译结果的置信度,在得到翻译结果的同时,即可得到翻译结果的置信度,从而无需利用另外的方法计算置信度,提高了翻译效率。In the machine translation method of this embodiment, by using the similarity to calculate the confidence of the translation result, the confidence of the translation result can be obtained at the same time as the translation result, so that there is no need to use another method to calculate the confidence, and the translation efficiency is improved.
机器翻译装置machine translation device
在同一发明构思下,图2是根据本发明的另一个实施方式的机器翻译装置的方框图。下面就结合该图,对本实施方式进行描述。对于那些与前面实施方式相同的部分,适当省略其说明。Under the same inventive conception, FIG. 2 is a block diagram of a machine translation device according to another embodiment of the present invention. The present embodiment will be described below with reference to this figure. For those parts that are the same as those in the previous embodiment, descriptions thereof are appropriately omitted.
本实施方式提供一种机器翻译装置200,包括:输入单元201,其输入待翻译的句子;计算单元205,其计算上述待翻译的句子和双语语料库10中的源语言句子之间的相似度;选择单元210,其基于上述相似度在上述双语语料库10中选择多个句子对,作为训练语料;训练单元215,其利用上述训练语料,训练翻译系统;以及翻译单元220,其利用上述翻译系统,对上述待翻译的句子进行翻译。The present embodiment provides a machine translation device 200, comprising: an input unit 201, which inputs a sentence to be translated; a calculation unit 205, which calculates the similarity between the sentence to be translated and the source language sentence in the bilingual corpus 10; Selection unit 210, which selects a plurality of sentence pairs in the above-mentioned bilingual corpus 10 based on the above-mentioned similarity, as training corpus; training unit 215, which utilizes the above-mentioned training corpus to train the translation system; and translation unit 220, which utilizes the above-mentioned translation system, Translate the above sentence to be translated.
下面参照图2进行详细说明。如图2所示,输入单元201输入待翻译的句子。在本实施方式中,待翻译的句子可以是本领域的技术人员公知的任何需要翻译的句子,可以是任何语种,本发明对此没有任何限制。The details will be described below with reference to FIG. 2 . As shown in FIG. 2, the input unit 201 inputs a sentence to be translated. In this embodiment, the sentence to be translated may be any sentence that needs to be translated known to those skilled in the art, and may be in any language, which is not limited in the present invention.
计算单元205计算待翻译的句子和双语语料库10中的源语言句子之间的相似度。The calculation unit 205 calculates the similarity between the sentence to be translated and the source language sentence in the bilingual corpus 10 .
在本实施方式中,双语语料库10包括多个源语言和目标语言的句子对,其可以是本领域的技术人员公知的任何双语语料库,例如英语-汉语语料库,英语-德语语料库,日语-汉语语料库等等。本实施方式对于双语语料库10没有任何限制。另外,本实施方式的双语语料库既可以是只进行了句子对齐的语料库,也可以是对句子对进行了词对齐的语料库,本发明对此没有任何限制。In this embodiment, the bilingual corpus 10 includes sentence pairs of a plurality of source languages and target languages, which can be any bilingual corpus known to those skilled in the art, such as English-Chinese corpus, English-German corpus, Japanese-Chinese corpus etc. This embodiment has no limitation on the bilingual corpus 10 . In addition, the bilingual corpus in this embodiment may be a corpus that only performs sentence alignment, or a corpus that performs word alignment on sentence pairs, and the present invention has no limitation on this.
在本实施方式中,相似度是表示待翻译的句子与源语言句子之间的相似程度的参数,例如可以采用基于字符串的相似度,也可以采用结构化的相似度。计算单元205在计算相似度时,例如可以基于待翻译的句子与双语语料库10中的源语言句子之间的编辑距离来计算相似度。另外,计算单元205也可以基于待翻译的句子与双语语料库10中的语言句子之间的句法结构来计算相似度。本实施方式对于相似度的计算方法没有任何限制,可以利用本领域的技术人员公知的任何方法。In this embodiment, the similarity is a parameter indicating the degree of similarity between the sentence to be translated and the sentence in the source language, for example, a string-based similarity or a structured similarity may be used. When calculating the similarity, the calculation unit 205 may calculate the similarity based on, for example, the edit distance between the sentence to be translated and the source language sentence in the bilingual corpus 10 . In addition, the calculation unit 205 can also calculate the similarity based on the syntactic structure between the sentence to be translated and the language sentences in the bilingual corpus 10 . In this embodiment, there is no limitation on the calculation method of the similarity, and any method known to those skilled in the art may be used.
选择单元210基于相似度在双语语料库10中选择多个句子对,作为训练语料。The selection unit 210 selects a plurality of sentence pairs from the bilingual corpus 10 based on similarity as training corpus.
选择单元210优选包括排序单元。在选择句子对时,首先排序单元根据相似度的计算结果,对双语语料库10中的句子对以相似度从大到小的顺序进行排序,然后选择单元210选择排序后的前N个句子对作为训练语料,N是1以上的整数。The selection unit 210 preferably includes a sorting unit. When selecting a sentence pair, at first the sorting unit sorts the sentences in the bilingual corpus 10 in order of similarity from large to small according to the calculation result of the similarity, and then the selection unit 210 selects the top N sentence pairs after sorting as Training corpus, N is an integer greater than 1.
另外,选择单元210也可以不具有对句子对进行排序的排序单元,而直接选择双语语料库10中的相似度大于预定的阈值的句子对作为训练语料。本实施方式对于基于相似度选择训练语料的方法没有任何限制,可以是上述方法的组合或本领域的技术人员公知的任何其他方法。In addition, the selection unit 210 may not have a sorting unit for sorting the sentence pairs, but directly selects the sentence pairs in the bilingual corpus 10 whose similarity is greater than a predetermined threshold as the training corpus. In this embodiment, there is no limitation on the method of selecting training corpus based on the similarity, which may be a combination of the above methods or any other method known to those skilled in the art.
训练单元215利用选择单元210选择的训练语料,训练翻译系统。The training unit 215 uses the training corpus selected by the selection unit 210 to train the translation system.
在本实施方式中,翻译系统可以是本领域的技术人员公知的任何翻译系统,只要能够利用其对待翻译的句子进行翻译即可。也就是说,可以利用本领域的技术人员公知的任何算法来基于训练语料训练翻译系统。In this embodiment, the translation system may be any translation system known to those skilled in the art, as long as it can be used to translate the sentence to be translated. That is to say, any algorithm known to those skilled in the art can be used to train the translation system based on the training corpus.
具体地,在双语语料库为只进行了句子对齐的语料库的情况下,训练单元215还包括词对齐单元,其利用本领域的技术人员公知的任何词对齐方法对选择出的训练语料进行词对齐,然后训练单元215利用进行了词对齐的训练语料来训练翻译系统。另一方面,在双语语料库为对句子对进行了词对齐的语料库的情况下,训练单元215无需包括词对齐单元,而直接利用选择出的训练语料来训练翻译系统。Specifically, in the case that the bilingual corpus is a corpus that has only been aligned with sentences, the training unit 215 also includes a word alignment unit, which utilizes any word alignment method known to those skilled in the art to perform word alignment on the selected training corpus, Then the training unit 215 uses the word-aligned training corpus to train the translation system. On the other hand, when the bilingual corpus is a corpus in which word alignment is performed on sentence pairs, the training unit 215 does not need to include a word alignment unit, and directly uses the selected training corpus to train the translation system.
优选,本实施方式的翻译系统包括翻译模型(Translation Model,TM)和语言模型(Language Model,LM)。翻译模型使用训练语料中的双语句子对及其词对齐信息进行训练,描述从源语言到目标语言转换的准确程度,语言模型使用训练语料中的目标语言句子进行训练,描述翻译结果的流利程度。Preferably, the translation system in this embodiment includes a translation model (Translation Model, TM) and a language model (Language Model, LM). The translation model is trained using the bilingual sentence pairs and their word alignment information in the training corpus to describe the accuracy of conversion from the source language to the target language. The language model is trained using the target language sentences in the training corpus to describe the fluency of the translation results.
翻译单元220利用训练单元215训练出的翻译系统,对待翻译的句子进行翻译,得到翻译结果20。The translation unit 220 uses the translation system trained by the training unit 215 to translate the sentence to be translated to obtain the translation result 20 .
本实施方式的机器翻译装置200,通过在双语语料库10中将与待翻译的句子的相似度高的语料选出,并基于选出的语料实时构造翻译系统,能够构造动态的、具有针对性的翻译系统,从而能够改善翻译质量。The machine translation device 200 of this embodiment can construct a dynamic and targeted translation system by selecting corpus with high similarity with the sentence to be translated in the bilingual corpus 10, and constructing a translation system in real time based on the selected corpus. translation system, thereby being able to improve translation quality.
另外,本实施方式的机器翻译装置200,通过选择排序后的前N个句子,在双语语料库10中存在大量与待翻译的句子的相似度高的语料时,能够利用最相似的、一定数量的语料训练翻译系统,从而不仅能够保证翻译质量,而且能够减轻训练翻译系统的处理负担。In addition, the machine translation device 200 of this embodiment, by selecting the first N sentences after sorting, when there are a large number of corpus with a high similarity to the sentence to be translated in the bilingual corpus 10, the most similar, certain number of sentences can be used. The corpus trains the translation system, so that not only the translation quality can be guaranteed, but also the processing burden of training the translation system can be reduced.
另外,本实施方式的机器翻译装置200,通过选择相似度大于预定的阈值的语料来训练翻译系统,能够将相似度低的语料排除,从而能够避免相似度低的语料对翻译系统的干扰,能够进一步保证翻译准确度。In addition, the machine translation device 200 of this embodiment trains the translation system by selecting corpus with a degree of similarity greater than a predetermined threshold, and can exclude corpus with a low degree of similarity, thereby avoiding the interference of corpus with a low degree of similarity on the translation system. Further guarantee the translation accuracy.
优选,本实施方式的机器翻译装置200还包括在翻译缓冲区保存待翻译的句子及其翻译结果的保存单元。翻译缓冲区可以是本领域的技术人员公知的任何存储器。Preferably, the machine translation device 200 of this embodiment further includes a storage unit for storing sentences to be translated and their translation results in the translation buffer. The translation buffer can be any memory known to those skilled in the art.
进而,本实施方式的机器翻译装置200,优选还包括在翻译缓冲区查找待翻译的句子的查找单元。如果在翻译缓冲区中存在待翻译的句子,则直接得到其翻译结果,如果不存在,则利用计算单元205、选择单元210、训练单元215和翻译单元220进行翻译。Furthermore, the machine translation device 200 of this embodiment preferably further includes a search unit for searching the sentence to be translated in the translation buffer. If there is a sentence to be translated in the translation buffer, its translation result will be obtained directly; if not, the calculation unit 205 , the selection unit 210 , the training unit 215 and the translation unit 220 will be used for translation.
本实施方式的机器翻译装置200,通过在翻译缓冲区中保存待翻译的句子及其翻译结果,在下次翻译相同的句子时,能够直接从翻译缓冲区取得该句子的翻译结果,节约了计算资源,提高了翻译效率。The machine translation device 200 of this embodiment, by saving the sentence to be translated and its translation result in the translation buffer, when translating the same sentence next time, can directly obtain the translation result of the sentence from the translation buffer, saving computing resources , improving translation efficiency.
优选,本实施方式的机器翻译装置200还包括将上述待翻译的句子和其翻译结果加入上述双语语料库的句对添加单元。Preferably, the machine translation device 200 of this embodiment further includes a sentence pair adding unit for adding the above-mentioned sentences to be translated and their translation results into the above-mentioned bilingual corpus.
本实施方式的机器翻译装置200,通过将待翻译的句子和其翻译结果加入双语语料库10,能够扩充双语语料库10的语料数据,从而能够提高后续翻译的翻译质量。The machine translation device 200 of this embodiment can expand the corpus data of the bilingual corpus 10 by adding sentences to be translated and their translation results into the bilingual corpus 10, thereby improving the translation quality of subsequent translations.
优选,本实施方式的机器翻译装置200还包括对待翻译的句子及其翻译结果进行词对齐的词对齐单元和将词对齐结果加入双语语料库10的词对齐结果添加单元。Preferably, the machine translation device 200 of this embodiment further includes a word alignment unit for word alignment of the sentence to be translated and its translation result, and a word alignment result adding unit for adding the word alignment result to the bilingual corpus 10 .
在本实施方式中,可以利用本领域的技术人员公知的任何词对齐工具,例如GIZA++工具对待翻译的句子及其翻译结果进行词对齐。本实施方式对于词对齐的方法没有任何限制。In this embodiment, any word alignment tool known to those skilled in the art can be used, such as the GIZA++ tool, to perform word alignment on the sentence to be translated and its translation result. In this embodiment, there is no limitation on the word alignment method.
本实施方式的机器翻译装置200,通过将词对齐结果加入双语语料库,不仅能够扩充双语语料库的语料数据,提高后续翻译的翻译质量,而且能够提高翻译效率。The machine translation device 200 of this embodiment can not only expand the corpus data of the bilingual corpus, improve the translation quality of subsequent translations, but also improve translation efficiency by adding word alignment results to the bilingual corpus.
优选,本实施方式的机器翻译装置200还包括在双语语料库10中添加与用户相关的训练数据的训练数据添加单元。可选地,也可以将上述对齐结果添加单元兼用作训练数据添加单元。Preferably, the machine translation device 200 of this embodiment further includes a training data adding unit for adding training data related to the user to the bilingual corpus 10 . Optionally, the above-mentioned alignment result adding unit may also be used as a training data adding unit.
本实施方式的机器翻译装置200,通过添加与用户相关的训练数据,例如进行了对齐的句子对、上下文相关的数据、进行了词对齐的句子对等,在训练数据不足的情况下,也能够达到用户适应的目的。The machine translation device 200 of this embodiment, by adding user-related training data, such as aligned sentence pairs, context-related data, and word-aligned sentence pairs, can also be used in the case of insufficient training data. To achieve the purpose of user adaptation.
优选,本实施方式的机器翻译装置200还包括利用上述相似度计算上述翻译结果的置信度的置信度计算单元。可选地,也可以将上述计算单元205兼用作置信度计算单元。Preferably, the machine translation device 200 in this embodiment further includes a confidence calculation unit for calculating the confidence of the translation result by using the similarity. Optionally, the above calculation unit 205 may also be used as a confidence calculation unit.
在本实施方式中,置信度是指翻译结果的可信程度。如果双语语料库10中的与输入的待翻译句子最相似部分都不足以用来学习得到翻译知识,可以得知利用整个双语语料库10获得最佳翻译的概率也很小。在选择相似句子时,如果相似度高于一定值的句子数低于预定阈值,则可以认为是双语语料库10不包含足以翻译待翻译的句子的知识。例如,可以在相似度大于一定值以上的句子的数量为预定阈值以上时,认为置信度高。In this embodiment, the confidence level refers to the degree of credibility of the translation result. If the part most similar to the input sentence to be translated in the bilingual corpus 10 is not enough to learn translation knowledge, it can be known that the probability of using the entire bilingual corpus 10 to obtain the best translation is also very small. When selecting similar sentences, if the number of sentences whose similarity is higher than a certain value is lower than a predetermined threshold, it can be considered that the bilingual corpus 10 does not contain enough knowledge to translate the sentence to be translated. For example, when the number of sentences whose similarity is greater than a certain value is equal to or greater than a predetermined threshold, the degree of confidence may be considered high.
本实施方式的机器翻译装置200,通过利用相似度计算翻译结果的置信度,在得到翻译结果的同时,即可得到翻译结果的置信度,从而无需利用另外的方法计算置信度,提高了翻译效率。The machine translation device 200 of this embodiment, by using the similarity to calculate the confidence of the translation result, can obtain the confidence of the translation result at the same time as the translation result, so that there is no need to use another method to calculate the confidence, and the translation efficiency is improved. .
以上虽然通过一些示例性的实施方式详细地描述了本发明的机器翻译方法以及机器翻译装置,但是以上这些实施方式并不是穷举的,本领域技术人员可以在本发明的精神和范围内实现各种变化和修改。因此,本发明并不限于这些实施方式,本发明的范围仅由所附权利要求为准。Although the machine translation method and the machine translation device of the present invention have been described in detail through some exemplary implementations, the above implementations are not exhaustive, and those skilled in the art can implement various translations within the spirit and scope of the present invention. changes and modifications. Accordingly, the present invention is not limited to these embodiments, and the scope of the present invention is determined only by the appended claims.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410104256.7A CN104933038A (en) | 2014-03-20 | 2014-03-20 | Machine translation method and machine translation device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410104256.7A CN104933038A (en) | 2014-03-20 | 2014-03-20 | Machine translation method and machine translation device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN104933038A true CN104933038A (en) | 2015-09-23 |
Family
ID=54120207
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201410104256.7A Pending CN104933038A (en) | 2014-03-20 | 2014-03-20 | Machine translation method and machine translation device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN104933038A (en) |
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106598959A (en) * | 2016-12-23 | 2017-04-26 | 北京金山办公软件股份有限公司 | Method and system for determining intertranslation relationship of bilingual sentence pairs |
| CN106708812A (en) * | 2016-12-19 | 2017-05-24 | 新译信息科技(深圳)有限公司 | Machine translation model obtaining method and device |
| CN107193809A (en) * | 2017-05-18 | 2017-09-22 | 广东小天才科技有限公司 | Teaching material script generation method and device and user equipment |
| CN107329961A (en) * | 2017-07-03 | 2017-11-07 | 西安市邦尼翻译有限公司 | A kind of method of cloud translation memory library Fast incremental formula fuzzy matching |
| CN107526727A (en) * | 2017-07-31 | 2017-12-29 | 苏州大学 | language generation method based on statistical machine translation |
| CN110472251A (en) * | 2018-05-10 | 2019-11-19 | 腾讯科技(深圳)有限公司 | Method, the method for statement translation, equipment and the storage medium of translation model training |
| CN111191469A (en) * | 2019-12-17 | 2020-05-22 | 语联网(武汉)信息技术有限公司 | Large-scale corpus cleaning and aligning method and device |
| CN111221965A (en) * | 2019-12-30 | 2020-06-02 | 成都信息工程大学 | Classification and sampling detection method based on bilingual corpus of public signs |
| CN111800505A (en) * | 2020-07-05 | 2020-10-20 | 胡时英 | Big data acquisition and processing system under the control of on-site remote terminal unit |
| CN113988092A (en) * | 2021-11-05 | 2022-01-28 | 语联网(武汉)信息技术有限公司 | Task-adaptive dynamic training method for engine rollover |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101079028A (en) * | 2007-05-29 | 2007-11-28 | 中国科学院计算技术研究所 | On-line translation model selection method of statistic machine translation |
| CN101271451A (en) * | 2007-03-20 | 2008-09-24 | 株式会社东芝 | Method and device for computer-aided translation |
| CN101393547A (en) * | 2007-09-20 | 2009-03-25 | 株式会社东芝 | Machine translation device, method and system |
| CN101667176A (en) * | 2008-09-01 | 2010-03-10 | 株式会社东芝 | Method and system for counting machine translation based on phrases |
| CN101714137A (en) * | 2008-10-06 | 2010-05-26 | 株式会社东芝 | Methods for evaluating and selecting example sentence pairs and building universal example sentence library, and machine translation method and device |
| CN103631773A (en) * | 2013-12-16 | 2014-03-12 | 哈尔滨工业大学 | Statistical machine translation method based on field similarity measurement method |
-
2014
- 2014-03-20 CN CN201410104256.7A patent/CN104933038A/en active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101271451A (en) * | 2007-03-20 | 2008-09-24 | 株式会社东芝 | Method and device for computer-aided translation |
| CN101079028A (en) * | 2007-05-29 | 2007-11-28 | 中国科学院计算技术研究所 | On-line translation model selection method of statistic machine translation |
| CN101393547A (en) * | 2007-09-20 | 2009-03-25 | 株式会社东芝 | Machine translation device, method and system |
| CN101667176A (en) * | 2008-09-01 | 2010-03-10 | 株式会社东芝 | Method and system for counting machine translation based on phrases |
| CN101714137A (en) * | 2008-10-06 | 2010-05-26 | 株式会社东芝 | Methods for evaluating and selecting example sentence pairs and building universal example sentence library, and machine translation method and device |
| CN103631773A (en) * | 2013-12-16 | 2014-03-12 | 哈尔滨工业大学 | Statistical machine translation method based on field similarity measurement method |
Cited By (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106708812A (en) * | 2016-12-19 | 2017-05-24 | 新译信息科技(深圳)有限公司 | Machine translation model obtaining method and device |
| CN106598959A (en) * | 2016-12-23 | 2017-04-26 | 北京金山办公软件股份有限公司 | Method and system for determining intertranslation relationship of bilingual sentence pairs |
| CN107193809A (en) * | 2017-05-18 | 2017-09-22 | 广东小天才科技有限公司 | Teaching material script generation method and device and user equipment |
| CN107329961A (en) * | 2017-07-03 | 2017-11-07 | 西安市邦尼翻译有限公司 | A kind of method of cloud translation memory library Fast incremental formula fuzzy matching |
| CN107526727A (en) * | 2017-07-31 | 2017-12-29 | 苏州大学 | language generation method based on statistical machine translation |
| CN110472251B (en) * | 2018-05-10 | 2023-05-30 | 腾讯科技(深圳)有限公司 | Translation model training method, sentence translation method, equipment and storage medium |
| CN110472251A (en) * | 2018-05-10 | 2019-11-19 | 腾讯科技(深圳)有限公司 | Method, the method for statement translation, equipment and the storage medium of translation model training |
| CN111191469A (en) * | 2019-12-17 | 2020-05-22 | 语联网(武汉)信息技术有限公司 | Large-scale corpus cleaning and aligning method and device |
| CN111191469B (en) * | 2019-12-17 | 2023-09-19 | 语联网(武汉)信息技术有限公司 | Large-scale corpus cleaning and aligning method and device |
| CN111221965A (en) * | 2019-12-30 | 2020-06-02 | 成都信息工程大学 | Classification and sampling detection method based on bilingual corpus of public signs |
| CN111800505A (en) * | 2020-07-05 | 2020-10-20 | 胡时英 | Big data acquisition and processing system under the control of on-site remote terminal unit |
| CN113988092A (en) * | 2021-11-05 | 2022-01-28 | 语联网(武汉)信息技术有限公司 | Task-adaptive dynamic training method for engine rollover |
| CN113988092B (en) * | 2021-11-05 | 2024-10-25 | 语联网(武汉)信息技术有限公司 | Task self-adaptive engine turning dynamic training method |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN104933038A (en) | Machine translation method and machine translation device | |
| JP4331219B2 (en) | Method and apparatus for bilingual word association, method and apparatus for training bilingual word correspondence model | |
| CN107844469B (en) | Text simplification method based on word vector query model | |
| US10789431B2 (en) | Method and system of translating a source sentence in a first language into a target sentence in a second language | |
| KR102363369B1 (en) | Generating vector representations of documents | |
| CN107729322B (en) | Word segmentation method and device and sentence vector generation model establishment method and device | |
| CN107729329B (en) | Neural machine translation method and device based on word vector connection technology | |
| CN102737042B (en) | Method and device for establishing question generation model, and question generation method and device | |
| WO2020062770A1 (en) | Method and apparatus for constructing domain dictionary, and device and storage medium | |
| WO2020244065A1 (en) | Character vector definition method, apparatus and device based on artificial intelligence, and storage medium | |
| CN108804428A (en) | Correcting method, system and the relevant apparatus of term mistranslation in a kind of translation | |
| CN108133045A (en) | Keyword extracting method and system, keyword extraction model generating method and system | |
| CN111488468B (en) | Geographic information knowledge point extraction method and device, storage medium and computer equipment | |
| CN111727442A (en) | Use Quality Scores to Train Sequence Generative Neural Networks | |
| CN111144140B (en) | Zhongtai bilingual corpus generation method and device based on zero-order learning | |
| CN107766337A (en) | Translation Forecasting Methodology based on deep semantic association | |
| CN113822054A (en) | Chinese grammar error correction method and device based on data enhancement | |
| CN106503231A (en) | Searching method and device based on artificial intelligence | |
| CN101714136A (en) | Method and device for adapting a machine translation system based on language database to new field | |
| CN103823857A (en) | Space information searching method based on natural language processing | |
| CN108132932A (en) | Neural machine translation method with replicanism | |
| CN111613219B (en) | Voice data recognition method, equipment and medium | |
| CN110210041A (en) | The neat method, device and equipment of intertranslation sentence pair | |
| CN116304015A (en) | A method and device for training a classification model and classifying text | |
| CN112836523B (en) | Word translation method, device and equipment and readable storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150923 |
|
| WD01 | Invention patent application deemed withdrawn after publication |