WO2017088136A1

WO2017088136A1 - Translation method and terminal

Info

Publication number: WO2017088136A1
Application number: PCT/CN2015/095579
Authority: WO
Inventors: 李想; 李朋; 甘强; 陈天雄; 何永光
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2015-11-25
Filing date: 2015-11-25
Publication date: 2017-06-01
Anticipated expiration: 2018-05-25
Also published as: CN108141498B; CN108141498A

Abstract

Disclosed are a translation method and a terminal, relating to an intelligent voice translation technique, and used to solve the problem that an existing translation method is relatively low in translation accuracy. The method provided in the embodiments of the present invention comprises: acquiring an audio signal sent by a user, wherein the audio signal contains a voice segment signal; performing semantic analysis on the voice segment signal in the audio signal, and if a feature point exists in the voice segment signal, dividing the voice segment signal into at least one voice sub-segment signal by taking the feature point as a division point; and translating the at least one voice sub-segment signal into a voice signal conforming to a target user language, and sending the translated voice signal to a target terminal.

Description

Translation method and terminal

Technical field

本发明涉及语音智能翻译技术，尤其涉及一种翻译方法及终端。The invention relates to a voice intelligent translation technology, in particular to a translation method and a terminal.

Background technique

随着国家同国家的贸易、交流的不断发展和进步，用户国际化的趋势必然导致不同语种的人们在使用手机进行沟通时存在问题，以中、英文母语用户通话为例，使用中文的用户需要熟练使用英语才能和英文用户进行交流，而英文用户也很少有人懂汉语，于是语言成为国际化交流的最大障碍，使得通话过程中的即时语言翻译需求显得越来越重要。With the continuous development and progress of the country's trade and exchange with the country, the trend of users' internationalization will inevitably lead to problems in people of different languages when using mobile phones for communication. For example, Chinese and English native speakers are used as examples. Proficiency in English can communicate with English users, and few English users understand Chinese. Therefore, language has become the biggest obstacle to international communication, making the demand for instant language translation during the call more and more important.

目前的翻译技术主要基于语音端点检测(英文全称：voice activity detection，英文缩写：VAD)技术，检测出连续语句中的静音段，以静音段为分割点，将连续的语句划分为多个短句，来实现通话过程中的实时翻译。然而，这种从物理层出发，判断用户停顿时间较长时才进行断句、翻译的方法，完全脱离了翻译场景，由于用户通话过程中可能存在环境噪音、背景音、以及口头禅“恩～啊～这个”等这些无法提供VAD检测需要的静音时间但又不具有明显语义的语音信号，导致断句失败或者断句不合理，从而造成翻译失真，降低了翻译的准确性。The current translation technology is mainly based on the speech endpoint detection (English full name: voice activity detection, English abbreviation: VAD) technology, detecting the silent segment in the continuous sentence, using the silent segment as the segmentation point, dividing the continuous sentence into multiple short sentences. To achieve real-time translation during the call. However, this method of deciphering and translating from the physical layer to judge the user's pause time is completely out of the translation scenario, because the user may have ambient noise, background sound, and a mantra during the conversation. This "such voice signal that does not provide the silent time required for VAD detection but does not have obvious semantics causes the sentence to fail or the sentence to be unreasonable, resulting in translation distortion and reduced translation accuracy.

发明内容Summary of the invention

本发明的实施例提供一种翻译方法及终端，以解决现有翻译方法翻译准确性较低的问题。Embodiments of the present invention provide a translation method and a terminal to solve the problem that the translation accuracy of the existing translation method is low.

为达到上述目的，本发明的实施例采用如下技术方案：In order to achieve the above object, embodiments of the present invention adopt the following technical solutions:

第一方面，本发明实施例提供一种翻译方法，应用于正在进行语音通话的终端，所述终端可以为正在将本端语音发送至目标终端的发送端，所述方法可以包括：In a first aspect, the embodiment of the present invention provides a translation method, which is applied to a terminal that is performing a voice call, and the terminal may be a sender that is transmitting the local voice to the target terminal. The method may include:

获取用户输入的包含语音段信号的音频信号； Obtaining an audio signal input by the user and including a voice segment signal;

对所述语音段信号进行语义分析，若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号；Performing semantic analysis on the voice segment signal, if there is a feature point in the voice segment signal, dividing the voice segment signal into at least one sub-segment segment signal by using the feature point as a segmentation point;

将所述至少一个子语音段信号翻译成符合目标用户语种的语音，将翻译后的语音信号发送给目标终端。Translating the at least one sub-segment segment signal into a voice conforming to a target user language, and transmitting the translated voice signal to the target terminal.

如此，可以将语音段信号中不具有完整语义的语音剔除，同时，保证划分的子语音段为一具有完整语义的语音段，与现有翻译方法相比，提高了翻译准确性。In this way, the speech segment without the complete semantics in the speech segment signal can be eliminated, and at the same time, the sub-speech segment segmentation is guaranteed to be a speech segment with complete semantics, which improves the translation accuracy compared with the existing translation method.

同时，为了实现同声翻译以及提高翻译效率的目的，本发明通过一个完整语义的语句翻译完成后即进行播放的流水线模式、以及在该语句翻译后的语音播放的同时，下一完整语义的语音叠加在该语音中进行播放的语音合成技术，使原始语音和翻译后的语音合成叠加在一起，原声音量降低作为背景音，翻译后的语音作为主音量，发送至目标终端，具体实现如下：At the same time, in order to achieve simultaneous translation and improve translation efficiency, the present invention performs a pipeline mode of playing after completion of a complete semantic statement translation, and a speech of the next complete semantics while the speech is played after the translation of the sentence. The speech synthesis technology superimposed in the speech for playing, superimposes the original speech and the translated speech synthesis, the original sound volume is reduced as the background sound, and the translated speech is sent to the target terminal as the master volume, and the specific implementation is as follows:

向所述目标终端发送第一子语音信号；Transmitting, to the target terminal, a first sub-voice signal;

在将第一子语音信号播放给所述目标用户后，合成所述第一子语音信号翻译后的语音信号和第二子语音信号；After the first sub-speech signal is played to the target user, the translated sub-speech signal and the second sub-speech signal are synthesized;

将合成后的语音信号发送给目标终端。The synthesized voice signal is sent to the target terminal.

如此，不需要等到全部语句播放完成后，再将翻译后的语句逐句播放，与现有播放模式相比，翻译播放时间提前，减少了翻译等待时延，提高了翻译效率，增强用户体验。In this way, it is not necessary to wait until all the statements are played, and then the translated sentences are played sentence by sentence. Compared with the existing play mode, the translation play time is advanced, which reduces the translation waiting delay, improves the translation efficiency, and enhances the user experience.

由于在双发通话过程中，发送端和接收端为相对概念，通常根据通话双发正在通话的情况而定，将说话方确定为发送端，将收听方确定为接收端；因此，在某一时刻，上述发送端可以作为接收端。当上述发送端作为接收端，执行翻译功能时，所述方法还可以包括：Since the transmitting end and the receiving end are relative concepts during the dual-talking process, the calling party is determined to be the transmitting end and the listening party is determined as the receiving end according to the situation in which the double call is being called; therefore, at a certain end At the moment, the above transmitting end can serve as a receiving end. When the sending end is used as the receiving end and the translation function is performed, the method may further include:

接收源终端发送的音频信号；所述音频信号包含语音段信号；所述语音段信号为功率值大于预设门限值的一段语音信号；Receiving an audio signal sent by the source terminal; the audio signal includes a voice segment signal; and the voice segment signal is a segment of the voice signal whose power value is greater than a preset threshold;

对所述音频信号中的语音段信号进行语义分析；Performing semantic analysis on the speech segment signal in the audio signal;

若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号；所述特征点为不具有完整语义的语音信号所处的时间点；If there is a feature point in the voice segment signal, the feature point is used as a segmentation point. Segmenting the speech segment signal into at least one sub-segment segment signal; the feature point is a time point at which the speech signal without complete semantics is located;

将所述至少一个子语音段信号翻译成预设语种的语音信号；Translating the at least one sub-segment segment signal into a speech signal of a preset language;

播放翻译后的语音信号。Play the translated voice signal.

同理，为了实现同声翻译以及提高翻译效率的目的，所述方法还可以包括：Similarly, in order to achieve simultaneous translation and improve translation efficiency, the method may further include:

播放第一子语音信号；Playing the first sub-speech signal;

合成所述第一子语音信号翻译后的语音信号和第二子语音信号；Synthesizing the translated speech signal and the second sub-speech signal of the first sub-speech signal;

播放合成后的语音信号。Play the synthesized speech signal.

第二方面，本发明实施例还提供了一种终端，所述终端可以为集成有翻译功能的发送端，用于执行上述翻译方法，所述终端可以包括：In a second aspect, the embodiment of the present invention further provides a terminal, where the terminal may be a sender that is integrated with a translation function, and is configured to perform the foregoing translation method, where the terminal may include:

音频处理模块，用于获取用户输入的音频信号；所述音频信号包含语音段信号；所述语音段信号为功率值大于预设门限值的一段语音信号；An audio processing module, configured to acquire an audio signal input by a user; the audio signal includes a voice segment signal; and the voice segment signal is a segment of a voice signal whose power value is greater than a preset threshold;

语音端点检测模块，用于对所述音频处理模块获取到的音频信号中的语音段信号进行语义分析；若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号；所述特征点为不具有完整语义的语音信号所处的时间点；a voice endpoint detection module, configured to perform semantic analysis on a voice segment signal in an audio signal acquired by the audio processing module; if a feature point exists in the voice segment signal, the feature point is used as a segmentation point, and the The speech segment signal is segmented into at least one sub-segment segment signal; the feature point is a time point at which the speech signal without complete semantics is located;

翻译模块，用于将所述语音端点检测模块检测出的至少一个子语音段信号翻译成符合目标用户语种的语音信号；a translation module, configured to translate at least one sub-segment segment signal detected by the speech endpoint detection module into a speech signal conforming to a target user language;

语音合成模块，用于将所述翻译模块翻译后的语音信号发送给目标终端。a voice synthesis module, configured to send the translated voice signal of the translation module to the target terminal.

同时，为了实现同声翻译以及提高翻译效率的目的，本发明通过一个完整语义的语句翻译完成后即进行播放的流水线模式、以及在该语句翻译后的语音播放的同时，下一完整语义的语音叠加在该语音中进行播放的语音合成技术，原声音量降低作为背景音，翻译后的语音作为主音量，进行播放，具体的，所述语音合成模块用于：At the same time, in order to achieve simultaneous translation and improve translation efficiency, the present invention implements a pipeline mode of playing after completion of a complete semantic statement translation, and At the same time as the speech playback after the translation of the sentence, the speech synthesis technology in which the next complete semantic speech is superimposed in the speech, the original sound volume is reduced as the background sound, and the translated speech is played as the master volume, specifically, The speech synthesis module is used to:

在将第一子语音信号播放给所述目标用户后，将所述第一子语音信号翻译后的语音信号和所述第二子语音信号进行语音合成；After the first sub-speech signal is played to the target user, the speech signal translated by the first sub-speech signal and the second sub-speech signal are synthesized by speech;

将合成后的语音播放给所述目标用户。The synthesized voice is played to the target user.

由于发送端和接收端为相对概念，通常根据通话双发正在通话的情况而定，将说话方确定为发送端，将收听方确定为接收端，因此，在某一时刻，上述终端可以作为接收端，当上述终端作为接收端时，所述音频处理模块，还可以用于：Since the transmitting end and the receiving end are relative concepts, usually according to the situation in which the double call is being called, the speaking party is determined as the transmitting end, and the listening party is determined as the receiving end. Therefore, at a certain moment, the terminal can be received as a receiving end. The audio processing module is further configured to: when the terminal is used as the receiving end, the audio processing module is further configured to:

所述语音端点检测模块，还可以用于对所述音频处理模块获取到的音频信号中的语音段信号进行语义分析；若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号；所述特征点为不具有完整语义的语音信号所处的时间点；The voice endpoint detection module may be further configured to perform semantic analysis on a voice segment signal in the audio signal acquired by the audio processing module; if a feature point exists in the voice segment signal, the feature point is segmented Pointing, dividing the speech segment signal into at least one sub-speech segment signal; the feature point is a time point at which the speech signal without complete semantics is located;

所述翻译模块，还可以用于将所述语音端点检测模块检测出的至少一个子语音段信号翻译成预设语种的语音信号；The translation module may be further configured to translate the at least one sub-segment segment signal detected by the speech endpoint detection module into a speech signal of a preset language;

所述语音合成模块，还可以用于播放所述翻译模块翻译后的语音信号。The speech synthesis module may be further configured to play the translated speech signal of the translation module.

同理，为了实现同声翻译以及提高翻译效率的目的，所述语音合成模块，还可以用于：Similarly, in order to achieve simultaneous translation and improve translation efficiency, the speech synthesis module can also be used to:

播放第一子语音信号；Playing the first sub-speech signal;

在播放第一子语音信号后，合成所述第一子语音信号翻译后的语音信号和第二子语音信号； After playing the first sub-speech signal, synthesizing the translated sub-speech signal and the second sub-speech signal;

播放合成后的语音信号。Play the synthesized speech signal.

第三方面，本发明实施例还提供了一种终端，所述终端可以为集成有翻译功能的发送端，用于执行上述翻译方法，所述终端可以包括：In a third aspect, the embodiment of the present invention further provides a terminal, where the terminal may be a sender that is integrated with a translation function, and is configured to perform the foregoing translation method, where the terminal may include:

输入设备，用于获取用户输入的音频信号；所述音频信号包含语音段信号；所述语音段信号为功率值大于预设门限值的一段语音信号；An input device, configured to acquire an audio signal input by a user; the audio signal includes a voice segment signal; and the voice segment signal is a segment of a voice signal whose power value is greater than a preset threshold;

处理器，用于对所述输入设备获取到的音频信号中的语音段信号进行语义分析；若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号；所述特征点为不具有完整语义的语音信号所处的时间点；a processor, configured to perform semantic analysis on a voice segment signal in the audio signal acquired by the input device; if a feature point exists in the voice segment signal, the feature segment is used as a segmentation point, and the voice segment is Separating the signal into at least one sub-segment segment signal; the feature point is a time point at which the speech signal without complete semantics is located;

将所述至少一个子语音段信号翻译成符合目标用户语种的语音信号；Translating the at least one sub-segment segment signal into a speech signal conforming to a target user language;

发送器，用于将所述处理器翻译后的语音信号发送给目标终端。And a transmitter, configured to send the translated voice signal of the processor to the target terminal.

同时，为了实现同声翻译以及提高翻译效率的目的，本发明通过一个完整语义的语句翻译完成后即进行播放的流水线模式、以及在该语句翻译后的语音播放的同时，下一完整语义的语音叠加在该语音中进行播放的语音合成技术，原声音量降低作为背景音，翻译后的语音作为主音量，进行播放，具体的，所述处理器，还用于：At the same time, in order to achieve simultaneous translation and improve translation efficiency, the present invention performs a pipeline mode of playing after completion of a complete semantic statement translation, and a speech of the next complete semantics while the speech is played after the translation of the sentence. The speech synthesis technology superimposed on the speech for playing, the original sound volume is reduced as the background sound, and the translated speech is played as the master volume. Specifically, the processor is further used for:

在所述发送器将所述处理器翻译后的语音信号发送给目标终端之前，合成第一子语音信号翻译后的语音信号和第二子语音信号；Before the transmitter sends the translated voice signal of the processor to the target terminal, synthesizing the translated voice signal and the second sub-voice signal of the first sub-voice signal;

所述发送器，具体用于：The transmitter is specifically configured to:

将所述第一子语音信号发送给目标终端；Transmitting the first sub-voice signal to the target terminal;

将所述合成后的语音信号发送给目标终端。And transmitting the synthesized voice signal to the target terminal.

如此，不需要等到全部语句播放完成后，再将翻译后的语句逐句播放，与现有播放模式相比，翻译播放时间提前，减少了翻译等待时延，提高了翻译效率，增强用户体验。In this way, it is not necessary to wait until all the statements have been played, and then the translated sentences are played sentence by sentence. Compared with the existing play mode, the translation play time is advanced, and the translation is reduced. Time delays improve translation efficiency and enhance user experience.

由于发送端和接收端为相对概念，通常根据通话双发正在通话的情况而定，将说话方确定为发送端，将收听方确定为接收端，因此，在某一时刻，上述终端也可以作为接收端，当上述终端作为接收端时，所述终端，还可以包括：Since the transmitting end and the receiving end are relative concepts, usually according to the situation in which the double call is being called, the speaking party is determined as the transmitting end, and the listening party is determined as the receiving end. Therefore, at a certain moment, the terminal can also be used as The receiving end, when the terminal is used as the receiving end, the terminal may further include:

接收器，用于接收源终端发送的音频信号；所述音频信号包含语音段信号；所述语音段信号为功率值大于预设门限值的一段语音信号；a receiver, configured to receive an audio signal sent by the source terminal; the audio signal includes a voice segment signal; and the voice segment signal is a segment of the voice signal whose power value is greater than a preset threshold;

所述处理器，还可以用于对所述接收器获取到的音频信号中的语音段信号进行语义分析；若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号；所述特征点为不具有完整语义的语音信号所处的时间点；The processor may be further configured to perform semantic analysis on a voice segment signal in the audio signal acquired by the receiver; if a feature point exists in the voice segment signal, the feature point is used as a segmentation point, and The speech segment signal is segmented into at least one sub-segment segment signal; the feature point is a time point at which the speech signal without complete semantics is located;

输出设备，用于播放所述处理器翻译后的语音信号。And an output device, configured to play the translated voice signal of the processor.

同理，为了实现同声翻译以及提高翻译效率的目的，所述处理器，还用于：Similarly, in order to achieve simultaneous translation and improve translation efficiency, the processor is also used to:

在所述输出设备播放所述处理器翻译后的语音信号之前，合成第一子语音信号翻译后的语音信号和第二子语音信号；Before the output device plays the translated speech signal of the processor, synthesizing the translated speech signal and the second sub-speech signal of the first sub-speech signal;

所述输出设备，具体用于：The output device is specifically configured to:

播放所述第一子语音信号；Playing the first sub-speech signal;

播放所述合成后的语音信号。Playing the synthesized speech signal.

由上可知，本发明实施例提供一种翻译方法及终端，获取源用户发出的一帧音频信号；所述音频信号包含语音段信号；对所述音频信号中的语音段信号进行语义分析，检测所述语音段信号中是否存在特征点；所述特征点为：不具有完整语义的语音信号所处的时间点；若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号；将至少一个子语音段信号翻译成符合目标用户语种的语音，将翻译后的语音播放给所述目标用户。如此，基于语义分析把经VAD端点检测的语句中不具有完整语义的语音剔除，切分为更短的且具有完整语义的语句，完整的表达了说话者的语句含义，避免出现断句或半句的情况，有效地提高了通话中即时翻译的准确性；同时，通过流水线模式+两路音频叠加的语音合成技术，不需要等到全部语句播放完成后，再将翻译后的语句逐句播放，与现有播放模式相比，翻译播放时间提前，减少了翻译等待时延，提高了翻译效率，增强用户体验。As can be seen from the above, an embodiment of the present invention provides a translation method and a terminal, which acquires a frame of an audio signal sent by a source user; the audio signal includes a voice segment signal; and performs semantic analysis on the voice segment signal in the audio signal, and detects Whether there is a feature point in the speech segment signal; the feature point is: a time point at which the speech signal without complete semantics is located; if there is a feature point in the speech segment signal, the feature point is used as a segmentation point And dividing the voice segment signal into at least one sub-segment segment signal; translating the at least one sub-segment segment signal into a voice conforming to a target user language, and playing the translated voice to the target user. So, based on semantic analysis, the statement detected by the VAD endpoint is not Speech culling with complete semantics is divided into shorter and complete semantic statements, which fully expresses the speaker's sentence meaning, avoids the occurrence of sentence or half sentence, and effectively improves the accuracy of instant translation during the call. At the same time, the speech synthesis technology through pipeline mode + two-channel audio superposition does not need to wait until all the statements are played, and then the translated sentences are played sentence by sentence. Compared with the existing play mode, the translation play time is advanced, which is reduced. Translation latency is delayed, which improves translation efficiency and enhances user experience.

DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any creative work.

图1为本发明实施例提供的一种终端的结构示意图；FIG. 1 is a schematic structural diagram of a terminal according to an embodiment of the present disclosure;

图2为本发明实施例提供的一种翻译方法的流程示意图；2 is a schematic flowchart of a translation method according to an embodiment of the present invention;

图3为本发明实施例提供的一种翻译方法的流程示意图；FIG. 3 is a schematic flowchart diagram of a translation method according to an embodiment of the present disclosure;

图4为本发明实施例提供的实时翻译的时序图；4 is a sequence diagram of real-time translation according to an embodiment of the present invention;

图5为本发明实施例提供的终端的结构图示意图。FIG. 5 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

detailed description

本发明的核心思想是：在现有手机中集成不同语种间的实时语音翻译功能，基于语义分析将主叫用户录入的语音信息或对被叫用户发来的语音信息进行分割，剔除语音信息中基本不具有任何语义的语音，将语音信息划分为更短的且具有完整语义的短句，同时，采用将一个完整语义的语句翻译完成后即进行播放的流水线模式、以及在该语句翻译后的语音播放的同时，下一完整语义的语音叠加在该语音中进行播放的语音合成技术将翻译后的语音播放给目标用户，来实现对不同语种的手机用户进行通话的支持。The core idea of the invention is to integrate real-time speech translation function between different languages in an existing mobile phone, and to segment the voice information input by the calling user or the voice information sent by the called user based on semantic analysis, and eliminate the voice information. A speech that does not have any semantics, divides the speech information into shorter short sentences with complete semantics, and at the same time, adopts a pipeline mode in which a complete semantic statement is translated and then played, and after the statement is translated. At the same time of voice playback, the speech synthesis technology in which the next complete semantic voice is superimposed in the voice plays the translated voice to the target user, so as to realize the support for the mobile phone users of different languages.

可以理解的是，本发明实施例所述的主叫和被叫为相对概念，根据通话双发的发起顺序而定，通常将通话发起者称之为“主叫”，相应的，将通话接收者称之为“被叫”；所述的语音信息可以为包含多个语义的但无静音段信号的一段语音段信号，每个语句可以表达一个意思，且语句之间的时间间隔比较短(基本采用现有端点检测技术是区分不开的)；比如，根据人们的通话习惯，通常会将表达不同意思的语言基本不间断地说完，如：“吃什么好呢～嗯～额～哪个～吃面条吧”虽然为不间断的一段语音信号，但是“～嗯～额～哪个～”不具有任何特殊的含义，因此可以根据语义分析将该语音信息分为两个意思的语句：一、吃什么好呢；二、吃面条吧。需要说明的是，所述静音段信号为用户当前给对端待发出的一段完整语句中，语音信号的功率值低于预设门限值，且持续时间大于预设时间值的一段信号，如语音信号的功率值低于0dB，且持续时间大于500ms的一段语音信号可以作为静音段信号；相对应的，语音信号的功率值大于预设门限值的数据为语音段信号；其中，预设门限值和预设时间值可以根据需要进行设定，本发明实施例对比不进行限定。It can be understood that the calling party and the called party are relative concepts according to the embodiment of the present invention, and the calling initiator is usually referred to as a “calling party” according to the order in which the call is sent, and correspondingly, the call is received. Called "called"; the voice information can be included A segment of speech segment signals with multiple semantic but no mute segment signals, each statement can express a meaning, and the time interval between the sentences is relatively short (basically using existing endpoint detection techniques is indistinguishable); for example, according to People's habits of communication usually end up with a language that expresses different meanings, such as: "What is good to eat ~ ah ~ amount ~ which ~ eat noodles" although an uninterrupted piece of voice signal, but "~ "Hmm ~ amount ~ which ~" does not have any special meaning, so you can divide the voice information into two meanings according to semantic analysis: First, what to eat; Second, eat noodles. It should be noted that the mute segment signal is a segment of the signal that the power value of the voice signal is lower than the preset threshold and the duration is greater than the preset time value, such as a segment of the complete sentence to be sent by the user to the peer end, such as The voice signal has a power value lower than 0 dB, and a voice signal having a duration greater than 500 ms can be used as the silent segment signal; correspondingly, the data with the power value of the voice signal greater than the preset threshold is a voice segment signal; wherein, the preset The threshold value and the preset time value can be set as needed, and the comparison of the embodiments of the present invention is not limited.

下面结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

参照图1，为本发明实施例提供的集成有翻译功能的终端10的结构示意图，用于实现用户通话过程中的实时翻译，如图1所示，所述终端10可以由如下模块组成：音频处理模块101、语音端点检测模块102、语音识别模块103、翻译模块104、语音合成模块105；1 is a schematic structural diagram of a terminal 10 integrated with a translation function according to an embodiment of the present invention, which is used to implement real-time translation during a user call. As shown in FIG. 1 , the terminal 10 may be composed of the following modules: The processing module 101, the voice endpoint detection module 102, the voice recognition module 103, the translation module 104, and the voice synthesis module 105;

音频处理模块101：可以包含音频驱动、数字信号处理器(Digital Signal Processing，DSP)、调制解调器(Modem)、编译码器(Codec)、麦克风(MIC)、扬声器(Specker，简称：SPK)等子模块；主要提供录音放音功能，用于接收主叫用户发出的音频信号，将音频信号发送至语音端点检测模块102进行后续翻译工作，将翻译成被叫语种的语音经模数转换、调制、编码等处理后播放给通话被叫，或接收被叫用户发出的音频信号，将该音频信号经过数模转换、解调、译码等处理后发送至语音端点检测模块102进行后续翻译工作，将翻译成主叫语种的被叫语音播放给主叫用户；其中，音频驱动、DSP、Modem、Codec、MIC、SPK等子模块为现有音频处理中常用模块，在此不再详细赘述。The audio processing module 101: may include an audio driver, a digital signal processing (DSP), a modem (Modem), a codec, a microphone (MIC), a speaker (Specker, SPK for short), and the like. The recording and playback function is mainly provided for receiving the audio signal sent by the calling user, and the audio signal is sent to the voice endpoint detection module 102 for subsequent translation work, and the voice translated into the called language is subjected to analog-to-digital conversion, modulation, and encoding. After being processed, it is played to the called party, or receives the audio signal sent by the called user, and the audio signal is subjected to digital-to-analog conversion, demodulation, After decoding and the like, the processing is sent to the voice endpoint detection module 102 for subsequent translation work, and the called voice translated into the calling language is played to the calling user; among them, the audio driver, the DSP, the Modem, the Codec, the MIC, the SPK, and the like are sub-modules. It is a common module in the existing audio processing, and will not be described in detail here.

语音端点检测模块102：主要根据语音识别模块103中的语义数据库，检测出语音段信号中语义独立的语句，并将检测出的语句提供给语音识别模块103进行文字转换。The speech endpoint detection module 102: based on the semantic database in the speech recognition module 103, detects a semantically independent statement in the speech segment signal, and provides the detected statement to the speech recognition module 103 for text conversion.

语音识别模块103：可以包含语义数据库，主要为语音端点检测模块102的语句检测提供判断依据，并将语音端点检测模块102检测到的语句转换为文字信息。The speech recognition module 103 can include a semantic database, and provides a judgment basis for the sentence detection of the speech endpoint detection module 102, and converts the statement detected by the speech endpoint detection module 102 into text information.

翻译模块104：主要用于将语音识别模块103转换后的文字信息翻译成符合目标(主叫或被叫)语种的文字信息。The translation module 104 is mainly used to translate the text information converted by the speech recognition module 103 into text information conforming to the target (calling or called) language.

语音合成模块105，主要用于将翻译模块104翻译后的文字信息转换为语音信息发送至音频处理模块101，由音频处理模块101播放给目标用户。The speech synthesis module 105 is mainly configured to convert the translated text information of the translation module 104 into voice information and send it to the audio processing module 101, and the audio processing module 101 plays the target information to the target user.

其中，为了使翻译后的语音不影响原始语音，所述语音合成模块105，还可以用于将原声和语音合成模块105翻译后的声音叠加在一起，原声音量降低作为背景音，翻译后的语音作为主音量，达到类似现场同声翻译的效果。The voice synthesizing module 105 can also be used to superimpose the sounds translated by the original sound and the speech synthesis module 105, and the original sound volume is reduced as the background sound, and the translated speech is used in order to prevent the translated speech from affecting the original speech. As the master volume, it achieves the effect of similar on-site simultaneous translation.

可理解的是，对于手机通话过程中实现即时语音翻译的任意两个手机用户而言，上述执行翻译功能的模块可以集成在一个终端中，也可以分别集成在通话的两个终端中，即在本发明中，任意一个通话终端都可以采用图2所示的结构来实现通话过程中的即时翻译功能。具体来说，本发明的终端应用在实际应用中可采用以下几种基本架构：①主叫终端采用图2所示结构，被叫终端保持不变；②主叫终端保持不变，被叫终端采用图2所示结构；③主叫终端和被叫终端均采用图2所示的结构，即每个通讯终端都能支持第一语言到第二语言的翻译、以及第二语言到第一语言的翻译；具体采用哪种架构，本发明实施例不进行限定，本发明仅以将执行翻译功能的模块集中在一端终端内为例进行说明。It can be understood that, for any two mobile phone users who implement instant voice translation during a mobile phone call, the above-mentioned module for performing the translation function can be integrated into one terminal, or can be integrated into two terminals of the call, that is, In the present invention, any one of the call terminals can adopt the structure shown in FIG. 2 to implement the instant translation function during the call. Specifically, the terminal application of the present invention can adopt the following basic architectures in practical applications: 1 the calling terminal adopts the structure shown in FIG. 2, the called terminal remains unchanged; 2 the calling terminal remains unchanged, and the called terminal The structure shown in FIG. 2 is adopted; 3 the calling terminal and the called terminal adopt the structure shown in FIG. 2, that is, each communication terminal can support the translation of the first language to the second language, and the second language to the first language. The embodiment of the present invention is not limited, and the present invention only uses the module that will perform the translation function. The block is concentrated in one terminal and is described as an example.

为了便于描述，以下实施例一以步骤的形式示出并详细描述了本发明终端10执行自动翻译的过程，其中，示出的步骤也可以在除终端10之外的诸如一组可执行指令的计算机系统中执行。此外，虽然在图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。For convenience of description, the following embodiment 1 shows and describes in detail the process of the automatic translation of the terminal 10 of the present invention in the form of steps, wherein the steps shown may also be in addition to the terminal 10 such as a set of executable instructions. Executed in a computer system. Moreover, although logical sequences are shown in the figures, in some cases the steps shown or described may be performed in a different order than the ones described herein.

实施例一Embodiment 1

图2为本发明实施例提供的一种翻译方法的流程图，应用于如图1所示的终端，且所述终端和对端终端正在进行语音通话，当前时刻正在将本端语音发送至目标终端(即对端终端)，如图2所示，所述方法可以包括：2 is a flowchart of a translation method according to an embodiment of the present disclosure, which is applied to a terminal as shown in FIG. 1 , and the terminal and the opposite terminal are performing a voice call, and the local voice is being sent to the target at the current moment. The terminal (ie, the peer terminal), as shown in FIG. 2, the method may include:

步骤101：获取用户输入的音频信号；所述音频信号包含语音段信号；所述语音段信号为功率值大于预设门限值的一段语音信号。Step 101: Acquire an audio signal input by a user; the audio signal includes a voice segment signal; and the voice segment signal is a segment of a voice signal whose power value is greater than a preset threshold.

其中，所述用户为当前通话过程中正在说话的用户，为手持所述终端的本端用户。The user is a user who is talking during the current call, and is a local user who holds the terminal.

其中，所述音频信号是带有语音、音乐和音效的有规律的声波的频率、幅度变化信息的载体，根据声波的特征，可以为一种连续变化的模拟信号，从时间上来划分，可以分为多段语音信号，通常情况下，音频信号有三个重要参数：频率、幅度和相位，来决定音频信号的特征；可以将音频信号的信号幅度的平方值确定为语音信号的功率值(以dB为单位)，用于表示该语音信号的强度大小，即音量大小。Wherein, the audio signal is a carrier of frequency and amplitude change information of regular sound waves with voice, music and sound effects, and according to the characteristics of sound waves, it may be a continuously changing analog signal, which is divided in time and can be divided. For multi-segment speech signals, usually, the audio signal has three important parameters: frequency, amplitude and phase to determine the characteristics of the audio signal; the square value of the signal amplitude of the audio signal can be determined as the power value of the speech signal (in dB) Unit), used to indicate the intensity of the voice signal, that is, the volume.

所述静音段信号为用户当前给对端待发出的一段完整语句中，语音信号的功率值低于预设门限值，且持续时间大于预设时间值的一段信号，如语音信号的功率值低于0dB，且持续时间大于500ms的一段语音信号可以称之为静音段信号；相对应的，语音信号的功率值大于预设门限值的数据为语音段信号；其中，预设门限值和预设时间值可以根据需要进行设定，本发明实施例对比不进行限定。The mute segment signal is a segment of a signal in which a power value of the voice signal is lower than a preset threshold and the duration is greater than a preset time value, such as a power value of the voice signal, in a complete sentence to be sent by the user to the opposite end. A voice signal below 0 dB and having a duration greater than 500 ms may be referred to as a silent segment signal; correspondingly, the data whose voice signal has a power value greater than a preset threshold is a voice segment signal; wherein, the preset threshold value The preset time value can be set as needed, and the comparison of the embodiments of the present invention is not limited.

可选的，在本发明实施例中，可以采用现有语音端点检测(voice activity detection，VAD)技术对所述音频信号进行检测，先将语音信号的功率值小于预设门限值，且持续时间大于预设时间值的语音信号确定为静音段信号，然后，以所述静音段信号为分割点，对所述音频信号进行分割，获取至少一个语音段信号。Optionally, in the embodiment of the present invention, the existing voice endpoint detection (voice) may be adopted. The activity detection (VAD) technology detects the audio signal by first determining that the voice signal has a power value less than a preset threshold, and the voice signal having a duration greater than the preset time value is determined as a silent segment signal, and then The mute segment signal is a segmentation point, and the audio signal is segmented to obtain at least one segment of the speech segment.

例如，对端用户说：“今天天气很好，咱们去吃饭吧，吃什么好呢～嗯～额～哪个～吃面条吧”，其中，“今天天气很好”和“咱们去吃饭吧”之间的语音信号的功率值低于预设门限值，且持续时间大于预设时间值，则确定这两句话之间发出的语音信号为静音段信号，同理，若“咱们去吃饭吧”和“吃什么好呢～嗯～额～哪个～吃面条吧”之间的语音信号的功率值也低于预设门限值，且持续时间也大于预设时间值，可以确定“咱们去吃饭吧”和“吃什么好呢～嗯～额～哪个～吃面条吧”之间的语音信号为静音段信号，因此，可以将对端用户说的该段话“今天天气很好，咱们去吃饭吧，吃什么好呢～嗯～额～哪个～吃面条吧”划分为三个语音段信号“今天天气很好”、“咱们去吃饭吧”、“吃什么好呢～嗯～额～哪个～吃面条吧”。For example, the opposite user said: "The weather is very good today, let's go eat, what is good to eat ~ um ~ amount ~ which ~ eat noodles", among them, "today is good weather" and "we go to eat" If the power value of the voice signal is lower than the preset threshold and the duration is greater than the preset time value, it is determined that the voice signal sent between the two sentences is a silent segment signal. Similarly, if "we go to eat." "And what is good to eat ~ ah ~ amount ~ which ~ eat noodles" between the power value of the voice signal is also lower than the preset threshold, and the duration is also greater than the preset time value, you can determine "we go Let's eat and "What do you eat? Well, the amount ~ which ~ eat noodles." The voice signal is a silent segment signal. Therefore, the peer user can say the paragraph "Today's weather is very good, let's go Eat, what to eat? ~Hmm~ amount~ which eat noodles?" Divided into three voice segment signals "Today's weather is very good", "Let's go eat", "What are you eating? Well~~~ ~ Eat noodles."

步骤102：对所述音频信号中的语音段信号进行语义分析，若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号，所述特征点为：不具有完整语义的语音信号所处的时间点。Step 102: Perform semantic analysis on the speech segment signal in the audio signal. If there is a feature point in the speech segment signal, divide the speech segment signal into at least one sub-speech by using the feature point as a segmentation point. Segment signal, the feature point is: the time point at which the speech signal without complete semantics is located.

在本发明实施例中，可以预先将实际当中常用的一些不具有完整语义的词语或词的特征值作为语义特征值存储在语义数据库中，然后，可以查询所述语义数据库，对所述音频信号中的语音段信号进行语义分析；若所述语音段信号中存在第一语音信号，所述第一语音信号的特征值包含在所述语义数据库中，则确定所述第一语音信号为所述特征点；若所述语音段信号中所有语音信号的特征值均未包含在所述语义数据库中，则确定所述语音段信号中未包含特征点。其中，所述第一语音信号可以为所述语音段信号中的任一语音信号。In the embodiment of the present invention, some feature values of words or words that are not commonly used in practice may be stored in the semantic database as semantic feature values, and then the semantic database may be queried for the audio signal. The speech segment signal is semantically analyzed; if the first speech signal is present in the speech segment signal, and the feature value of the first speech signal is included in the semantic database, determining the first speech signal is the And a feature point; if the feature values of all the voice signals in the voice segment signal are not included in the semantic database, determining that the feature segment does not include a feature point. The first voice signal may be any one of the voice segment signals.

例如，在实际通话过程中，按照人们的习惯，通常会将“另外、还有、首先、其次、嗯、额、哪个”等过渡词前后的语句作为两个不同意思的语句同时讲给对方，但是，这些过渡词却不具有完整的语义，此时，在本发明实施例中，可以根据该习惯应用，将这些过渡词的特征值作为语义特征值，预先存在在语义特征库中，以便后续对语音段信号进行语义分析。For example, during an actual call, according to people’s habits, The statements before and after the transition words such as first, second, um, forehead, and which are used as two different meanings are spoken to each other at the same time. However, these transition words do not have complete semantics. In this case, the present invention is implemented. For example, according to the custom application, the feature values of the transition words may be pre-existed in the semantic feature database as semantic feature values, so as to perform semantic analysis on the speech segment signals.

可理解的是，在本发明实施例中，还可以将“噪声、背景音”等非静音但又不具有任何语义的语音的特征值作为语义特征值存储在语义数据库中，以便匹配语音段信号中该部分语音。It can be understood that, in the embodiment of the present invention, feature values of speech that are not muted but have no semantics such as “noise, background sound” may be stored as semantic feature values in the semantic database to match the voice segment signal. The part of the voice.

可选的，所述以所述特征点为分割点，将所述子语音段信号分割为至少一个子语音段信号具体可以包括：对于任一特征点，将所述特征点和所述特征点相邻的上一特征点间的语音信号作为一个子语音段信号，将所述特征点和所述特征点相邻的下一特征点之间的语音信号作为另外一个语音段信号。Optionally, the dividing the sub-speech segment signal into the at least one sub-segment segment signal by using the feature point as a segmentation point may specifically include: for any feature point, the feature point and the feature point The speech signal between the adjacent previous feature points is used as a sub-speech segment signal, and the speech signal between the feature point and the next feature point adjacent to the feature point is used as another speech segment signal.

如此，可以将语音段信号中不具有完整语义的语音剔除，同时，保证划分的子语音段为一具有完整语义的语音段。In this way, the speech segment signal without the complete semantics can be culled, and at the same time, the divided sub-speech segment is guaranteed to be a speech segment with complete semantics.

仍以步骤101中的例子为例，分别对音频信号中的三个语音段信号进行语义分析，发现语音段信号“吃什么好呢～嗯～额～哪个～吃面条吧”中“嗯～额～哪个”语音信号的特征值包含在语义特征库中，则可以以“嗯～额～哪个”为间隔将该语音段信号分为两个完整语义的子语音段“吃什么好呢”和“吃面条吧”。Still taking the example in step 101 as an example, the semantic analysis of the three speech segment signals in the audio signal is performed separately, and it is found that the speech segment signal "what is good to eat ~ um ~ amount ~ which ~ eat noodles" in the "hmm ~ amount ~ Which "the eigenvalue of the speech signal is included in the semantic feature database, then the speech segment signal can be divided into two complete semantic sub-speech segments at intervals of "hmm ~ amount ~ which", "What are you eating?" and " Eat noodles."

当然，子语音段信号的获取还可以是多个特征点之间的语音段信号。例如，某一语音段之间包括10个特征点，可提取第1特征点和第4特征点之间的语音段信号作为子语音段信号。该子语音段信号的提取规则可根据终端的处理能力等确定。Of course, the acquisition of the sub-segment segment signal may also be a speech segment signal between a plurality of feature points. For example, if a certain speech segment includes 10 feature points, the speech segment signal between the first feature point and the fourth feature point may be extracted as a sub-speech segment signal. The extraction rule of the sub-speech segment signal can be determined according to the processing capability of the terminal or the like.

可理解的是，若语音段信号中不存在特征点，则表示该语音段信号不能分割成至少两个完整语义的语句，需直接对该语音段信号进行翻译。It can be understood that if there is no feature point in the speech segment signal, it indicates that the speech segment signal cannot be segmented into at least two complete semantic sentences, and the speech segment signal needs to be directly translated.

步骤103：将所述至少一个子语音段信号翻译成符合目标用户语种的语音信号，将翻译后的语音信号发送给目标终端。 Step 103: Translate the at least one sub-speech segment signal into a speech signal conforming to the target user language, and send the translated speech signal to the target terminal.

其中，所述目标用户为当前通话过程中正在收听语音的用户，且本端用户和所述目标用户所支持的通话语种是不同的；例如，本端用户可以用汉语打电话，而目标用户则可以用英文通话。The target user is a user who is listening to the voice during the current call, and the calling language supported by the local user and the target user is different; for example, the local user can call in Chinese, and the target user can You can talk in English.

可选的，所述将所述至少一个子语音段信号翻译为符合目标用户语种的语音具体可以包括：Optionally, the translating the at least one sub-segment segment signal into the voice that conforms to the target user language may include:

所述终端中的语音识别模块将每个子语音段信号转换为符合源用户语种的文本；The voice recognition module in the terminal converts each sub-speech segment signal into text conforming to the source user language;

所述终端中的翻译模块将转换后的文本翻译成符合目标用户语种的文本；Translating the module in the terminal to translate the converted text into text conforming to the target user language;

所述终端中的语音合成模块将翻译后的文本转换为语音信号。The speech synthesis module in the terminal converts the translated text into a speech signal.

相应的，可以将翻译后的语音信号按照时间顺序依次播放给所述目标用户。由于在实际应用中，完全过滤掉原始语音，仅播放翻译后的语音信号的播放方式会让用户感到疑惑，不自然，为解决该问题，目前，人们基本上会将用户发出的原始语音信号和翻译后的语音信号均播放为目标用户，此时，为了使翻译后的语音信号不影响原始语音信号的播放，现有常规做法是在一段语音信号完全停止播放后，再将该段音频信号中每段子音频信号翻译后的语音信号依次播放出去，这导致目标用户需要长时间等待，为避免目标用户等待翻译的时间过长，本发明通过流水线模式+两路音频叠加的语音合成技术，使原始语音和翻译后的语音合成叠加在一起，原声音量降低作为背景音，翻译后的语音作为主音量，进行播放，具体实现如下：Correspondingly, the translated voice signals can be sequentially played to the target user in chronological order. Since the original voice is completely filtered out in the actual application, only playing the translated voice signal will make the user feel confused and unnatural. To solve the problem, at present, people basically will send the original voice signal and the user. The translated speech signal is played as the target user. At this time, in order to prevent the translated speech signal from affecting the playback of the original speech signal, it is conventional practice to place the audio signal in the segment after the speech signal is completely stopped. The translated voice signal of each sub-audio signal is sequentially played out, which causes the target user to wait for a long time. In order to avoid the target user waiting for the translation for too long, the present invention makes the original through the pipeline mode + two-way audio superposition speech synthesis technology. The speech and the translated speech synthesis are superimposed together, the original sound volume is reduced as the background sound, and the translated speech is played as the master volume, and the specific implementation is as follows:

将合成的语音信号发送给目标终端。The synthesized speech signal is transmitted to the target terminal.

同理，将第二子语音信号播放给所述目标终端后，可以将所述第二子语音信号翻译后的语音信号和所述第三子语音信号合成后播放给所述目标终端，按照这种方式，直至将子语音信号和翻译后的语音信号完全播放给所述目标终端，如此，实现了边播放边翻译的效果，降低了翻译的等待时延。Similarly, after the second sub-speech signal is played to the target terminal, the second sub-voice signal translated speech signal and the third sub-speech signal may be synthesized and played to the target terminal, according to this Ways until the sub-speech signal and the translated The voice signal is completely played to the target terminal, thus realizing the effect of translation while playing, and reducing the waiting delay of translation.

需要说明的是，第一子语音信号、第二子语音信号、第三子语音信号可以为步骤102中获得的至少一个子语音信号中的任一子语音信号，但是，从时间顺序上来看，第二子语音信号为：在第一子语音信号的时间之后且与第一子语音信号相邻的语音信号，第三子语音信号为：在第二子语音信号的时间之后且与第二子语音信号相邻的语音信号。It should be noted that the first sub-speech signal, the second sub-speech signal, and the third sub-speech signal may be any one of the at least one sub-speech signals obtained in step 102, but, in terms of chronological order, The second sub-speech signal is: a speech signal after the time of the first sub-speech signal and adjacent to the first sub-speech signal, and the third sub-speech signal is: after the time of the second sub-speech signal and with the second sub- A speech signal adjacent to a speech signal.

例如，如图3所示，为本发明实施例提供的实时翻译的时序图，本端基于语义分析将语音信息划分为三个完整语义的语音信息，且逐句翻译后，现有技术通常会这三个原始语句完全播放后，才将翻译后的语音按照时间顺序逐句播放给目标用户，由此导致了时间迟延。而本申请采用流水线模式，一个语句播放之后即进行翻译后的语音播放，同时，为了使下一语句的原始播放不影响当前翻译后的语音播放，将二者进行了音频合成处理，如此，不需要等到全部语句播放完成后，再将翻译后的语句逐句播放，从图3可以看出，本申请的流水线模式与现有播放模式相比，翻译播放时间提前，减少了翻译等待时延，提高了翻译效率，增强用户体验。For example, as shown in FIG. 3, which is a timing diagram of real-time translation provided by an embodiment of the present invention, the local end divides the voice information into three complete semantic voice information based on semantic analysis, and after the sentence is translated, the prior art usually After the three original sentences are fully played, the translated speech is played back to the target user in chronological order, resulting in a delay. The application adopts a pipeline mode, and the translated voice is played after a statement is played. At the same time, in order to make the original playback of the next sentence does not affect the current translated voice play, the two are subjected to audio synthesis processing, so It is necessary to wait until all the statements have been played, and then the translated sentences are played one by one. As can be seen from FIG. 3, the pipeline mode of the present application is earlier than the existing play mode, and the translation waiting time is reduced. Improve translation efficiency and enhance user experience.

可选的，本发明实施例中，所述合成所述第一子语音信号翻译后的语音信号和第二子语音信号具体可以包括：Optionally, in the embodiment of the present invention, the synthesizing the translated voice signal and the second sub-voice signal of the first sub-voice signal may specifically include:

对所述第一子语音信号翻译后的语音信号和所述第二子语音信号进行加权求和；Performing weighted summation on the voice signal translated by the first sub-speech signal and the second sub-speech signal;

其中，在加权求和过程中，所述第一子语音信号翻译后的语音信号的权值和所述第二子语音信号的权值，可以根据需要进行设定，本发明实施例对比不进行限定，所述权值可以理解为加权求和过程中，多个语音信号在总的语音信号中所占据的比重。但是，为了使原始语音不影响到翻译后的语音信号的播放，本发明实施例中，在对所述第一子语音信号翻译后的语音信号的权值和所述第二子语音信号的权值的设定过程中，需要使所述第一子语音信号翻译后的语音信号的权值大于所述第二子语音信号的权值。In the weighted summation process, the weight of the voice signal after the translation of the first sub-speech signal and the weight of the second sub-speech signal may be set as needed, and the comparison in the embodiment of the present invention is not performed. By definition, the weight can be understood as the proportion of a plurality of speech signals occupying the total speech signal in the weighted summation process. However, in order to prevent the original voice from affecting the playback of the translated voice signal, in the embodiment of the present invention, the weight of the voice signal after the first sub-voice signal is translated and the weight of the second sub-voice signal During the setting of the value, the translated language of the first sub-speech signal needs to be translated The weight of the tone signal is greater than the weight of the second sub-speech signal.

例如，假设A是第二子语音信号的原始语音，B是第一子语音信号翻译后的语音，令A的权值为第二为10％，B的权值为90％，则合成后的语音为：10％*A+90％*B，即可得到A和B混音效果，由于B的权重大于A，则可将A认为背景音，B为用户主要听到的声音。需要说明的是，本发明实施例包含但不限于上述加权方式的语音合成。For example, suppose A is the original speech of the second sub-speech signal, and B is the translated speech of the first sub-speech signal, so that the weight of A is 10% for the second, and the weight of B is 90%, then the synthesized The voice is: 10%*A+90%*B, and the A and B mix effects can be obtained. Since the weight of B is greater than A, A can be regarded as the background sound, and B is the sound that the user mainly hears. It should be noted that the embodiments of the present invention include, but are not limited to, speech synthesis of the above weighting method.

进一步的，为了识别出对端支持的语种，在将所述至少一个子语音段信号翻译成符合目标用户语种的语音信号之前，所述方法还包括：Further, in order to identify the language supported by the peer end, before the at least one sub-segment segment signal is translated into a voice signal conforming to the target user language, the method further includes:

接收所述目标终端发送的指示信息，其中，所述指示信息用于：指示所述目标终端所支持的语种。And receiving the indication information sent by the target terminal, where the indication information is used to: indicate a language supported by the target terminal.

例如，通话双方分别为中国人和美国人，即通话双方所使用的语种为中文和英文，通过通话双方最初相互发送的指示消息(中国人说“中文”，美国人说“英文”)，确定通话双方作用的语种为中文和英文。这样，在后续的处理过程中，如果输入语音信号为中文的语音信号，则源语种为中文，目标语种为英文；反之，如果输入语音信号为英文的语音信号，则源语种为英文，目标语种为中文。For example, the two sides of the call are Chinese and American respectively, that is, the language used by both parties is Chinese and English. The instructions are sent by the two parties (the Chinese say "Chinese" and the Americans say "English"). The languages spoken by both parties are Chinese and English. Thus, in the subsequent processing, if the input speech signal is a Chinese speech signal, the source language is Chinese and the target language is English; conversely, if the input speech signal is an English speech signal, the source language is English, and the target language is English. Chinese.

进一步的，在获取用户输入的音频信号之前，所述方法还包括：Further, before acquiring the audio signal input by the user, the method further includes:

和所述目标终端建立语音通话；Establishing a voice call with the target terminal;

接收所述目标终端发送的翻译请求，所述翻译请求用于请求向所述目标终端发送翻译后的语音信号。Receiving a translation request sent by the target terminal, the translation request being used to request that the translated voice signal be sent to the target terminal.

其中，本发明实例所述语音通话可以包括正常的通过数据网络进行的通话，也可以包括通过APP或语音聊天软件等进行的语音通话。The voice call in the example of the present invention may include a normal call through the data network, and may also include a voice call through an APP or voice chat software.

由于在双方通话过程中，发送端和接收端为相对概念，通常根据通话双发正在通话的情况而定，将说话方确定为发送端，将收听方确定为接收端，因此，在某一时刻，上述正在发送语音信号的终端也可以作为接收端。相应的，当上述终端作为接收端时，所述终端还可以执行如图4所示的几个方法步骤，以实现对接收到的语音信号进行翻译播放：Since the sender and the receiver are relative concepts during the conversation between the two parties, usually according to the situation that the call is being sent, the speaker is determined to be the sender, and the listener is determined to be the receiver. Therefore, at a certain moment The terminal that is transmitting the voice signal described above may also serve as the receiving end. Correspondingly, when the terminal is used as the receiving end, the terminal The terminal can also perform several method steps as shown in FIG. 4 to implement translation and playback of the received voice signal:

步骤201：接收源终端发送的音频信号；所述音频信号包含语音段信号；所述语音段信号为功率值大于预设门限值的一段语音信号。Step 201: Receive an audio signal sent by the source terminal, where the audio signal includes a voice segment signal, and the voice segment signal is a segment of the voice signal whose power value is greater than a preset threshold.

步骤202：对所述音频信号中的语音段信号进行语义分析；Step 202: Perform semantic analysis on the voice segment signal in the audio signal.

若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号；所述特征点为不具有完整语义的语音信号所处的时间点。If a feature point exists in the voice segment signal, the voice segment signal is segmented into at least one sub-segment segment signal by using the feature point as a segmentation point; and the feature point is a voice signal without complete semantics Time point.

步骤203：将所述至少一个子语音段信号翻译成预设语种的语音信号，播放翻译后的语音信号。Step 203: Translate the at least one sub-speech segment signal into a voice signal of a preset language, and play the translated voice signal.

其中，预设语种为本端用户所支持的语种，在此不再进行限定。步骤202和步骤102的具体实现过程相同，步骤203中将子语音信号翻译成预设语种的语音信号与步骤103中将子语音信号翻译成符合目标语种的具体实现过程相同，在此不再一一详细赘述。The default language is the language supported by the local user, and is not limited herein. The specific implementation process of step 202 and step 102 is the same. In step 203, the translation of the sub-speech signal into the speech signal of the preset language is the same as the specific implementation process of translating the sub-speech signal into the target language in step 103. A detailed description.

可选的，步骤204播放翻译后的语音信号具体是指：Optionally, the step 204 playing the translated voice signal specifically refers to:

将翻译后的语音信号通过终端自身的音频处理模块播放给本端用户收听。The translated voice signal is played to the local user through the audio processing module of the terminal itself.

播放第一子语音信号；Playing the first sub-speech signal;

在播放第一子语音信号后，合成所述第一子语音信号翻译后的语音信号和第二子语音信号；After playing the first sub-speech signal, synthesizing the translated sub-speech signal and the second sub-speech signal;

播放合成后的语音信号。Play the synthesized speech signal.

具体的，可以通过上述加权求和的方法合成语音信号，在此不再详细赘述。Specifically, the voice signal can be synthesized by the above method of weighted summation, and details are not described herein again.

进一步的，在步骤202之前，所述方法还可以包括：Further, before step 202, the method may further include:

和所述源终端建立语音通话；Establishing a voice call with the source terminal;

在所述终端的用户界面上显示提示信息；所述提示信息用于提示用户是否启动翻译功能；Displaying prompt information on a user interface of the terminal; the prompt information is used to raise Indicate whether the user initiates the translation function;

接收所述用户发送的确认信息，启动翻译功能。Receiving the confirmation information sent by the user, and starting the translation function.

由上可知，本发明实施例提供一种翻译方法，应用于正在进行语音通话的终端，包括：获取用户发出的音频信号；所述音频信号包含语音段信号；对所述音频信号中的语音段信号进行语义分析，若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号；将至少一个子语音段信号翻译成符合目标用户语种的语音，将翻译后的语音发送给目标终端。如此，基于语义分析把经VAD端点检测的语句中不具有完整语义的语音剔除，切分为更短的且具有完整语义的语句，完整的表达了说话者的语句含义，避免出现断句或半句的情况，有效地提高了通话中即时翻译的准确性。As can be seen from the above, an embodiment of the present invention provides a translation method, which is applied to a terminal that is engaged in a voice call, including: acquiring an audio signal sent by a user; the audio signal includes a voice segment signal; and a voice segment in the audio signal The signal is semantically analyzed. If there is a feature point in the voice segment signal, the segment is segmented into at least one sub-segment segment signal by using the feature point as a segmentation point; and at least one sub-segment segment signal is translated into The voice conforming to the target user's language, and the translated voice is sent to the target terminal. In this way, based on semantic analysis, the speech without the complete semantics in the statement detected by the VAD endpoint is culled into shorter sentences with complete semantics, which fully express the speaker's sentence meaning and avoid the occurrence of sentence or sentence. The situation effectively improves the accuracy of instant translation during a call.

需要说明的是，上述过程可以由图1所示终端中的各单元执行，具体不再赘述。此外，本发明图1所示终端中的音频处理模块可以为终端的输入设备或发送器；语音端点检测模块、语音识别模块、翻译模块、语音合成模块可以为单独设立的处理器，也可以集成在终端的某一个处理器中实现，此外，也可以以程序代码的形式存储于终端的存储器中，由终端的某一个处理器调用并执行以上翻译功能。这里所述的处理器可以是一个中央处理器(Central Processing Unit，CPU)，或者是特定集成电路(Application Specific Integrated Circuit，ASIC)，或者是被配置成实施本发明实施例的一个或多个集成电路。具体的，如实施例二所述，本发明还提供了一种终端，优选地用于实现上述方法实施例中的方法。It should be noted that the foregoing process may be performed by each unit in the terminal shown in FIG. 1, and details are not described herein. In addition, the audio processing module in the terminal shown in FIG. 1 of the present invention may be an input device or a transmitter of the terminal; the voice endpoint detection module, the voice recognition module, the translation module, and the voice synthesis module may be separate processors or integrated. It is implemented in a certain processor of the terminal, and may also be stored in the memory of the terminal in the form of program code, and is called by one of the processors of the terminal and performs the above translation function. The processor described herein may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more integrated systems configured to implement embodiments of the present invention. Circuit. Specifically, as described in the second embodiment, the present invention further provides a terminal, which is preferably used to implement the method in the foregoing method embodiment.

实施例二Embodiment 2

图5为本发明实施例提供的一种终端20的结构图，本发明实施例提供的终端20可以用于实施上述方法实施例所示的方法，为了便于说明，仅示出了与本发明实施例相关的部分，具体技术细节未揭示的，请参照上述方法实施例中的描述。FIG. 5 is a structural diagram of a terminal 20 according to an embodiment of the present invention. The terminal 20 provided by the embodiment of the present invention may be used to implement the method shown in the foregoing method embodiment. For ease of description, only the implementation of the present invention is shown. For the relevant parts of the examples, the specific technical details are not disclosed, please refer to the description in the above method embodiments.

该终端可以为手机、平板电脑、笔记本电脑、UMPC(Ultra-mobile Personal Computer，超级移动个人计算机)、上网本、PDA(Personal Digital Assistant，个人数字助理)等即时通话工具，本发明实施例以终端为手机为例进行说明，图5示出的是与本发明各实施例相关的手机20的部分结构的框图。The terminal can be a mobile phone, a tablet, a laptop, a UMPC (Ultra-mobile) An instant call tool such as a personal computer, a netbook, a PDA (Personal Digital Assistant), and the like. The embodiment of the present invention uses a terminal as a mobile phone as an example, and FIG. 5 shows the implementation of the present invention. A block diagram of a portion of the structure of the associated handset 20.

如图5所示，手机20包括：输入设备201、存储器202、处理器203、发送器204、输出设备205、接收器206等部件。本领域技术人员可以理解，图5中示出的手机结构并不构成对手机的限定，可以包括比图示更多的部件，或者组合某些部件，或者不同的部件布置。As shown in FIG. 5, the mobile phone 20 includes components such as an input device 201, a memory 202, a processor 203, a transmitter 204, an output device 205, a receiver 206, and the like. It will be understood by those skilled in the art that the structure of the handset shown in FIG. 5 does not constitute a limitation to the handset, and may include more components than those illustrated, or some components may be combined, or different component arrangements.

下面结合图5对手机20的各个构成部件进行具体的介绍：The components of the mobile phone 20 will be specifically described below with reference to FIG. 5:

输入设备201，可以包括触摸屏，也可以包括音频电路中的麦克风，用于实现手机20的输入功能。可收集用户在其上或附近发出的语音信号，并根据预先设定的程式驱动相应的连接装置，将收集的声音信号转换为电信号，由音频电路接收后转换为音频信号，再将音频信号发送给另一手机，或者将音频信号输出至存储器202以便进一步处理。The input device 201 may include a touch screen, and may also include a microphone in the audio circuit for implementing the input function of the mobile phone 20. The voice signal sent by the user on or near the user may be collected, and the corresponding connection device is driven according to a preset program to convert the collected sound signal into an electrical signal, which is received by the audio circuit, converted into an audio signal, and then the audio signal is transmitted. Send to another handset or output an audio signal to memory 202 for further processing.

存储器202可用于存储数据、软件程序以及模块；主要包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等；存储数据区可存储根据手机20的使用所创建的数据(比如音频数据、图像数据、电话本等)等。此外，存储器202可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 202 can be used to store data, software programs, and modules; and mainly includes a storage program area and a storage data area, wherein the storage program area can store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.) The storage data area can store data (such as audio data, image data, phone book, etc.) created according to the use of the mobile phone 20. Moreover, memory 202 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

处理器203是手机20的控制中心，利用各种接口和线路连接整个手机的各个部分，通过运行或执行存储在存储器202内的软件程序和/或模块，以及调用存储在存储器202内的数据，执行手机20的各种功能和处理数据，从而对手机进行整体监控。可选的，处理器203可包括一个或多个处理单元；优选的，处理器203可集成应用处理器和调制解调处理器，其中，应用处理器主要处理操作系统、用户界面和应用程序等，调制解调处理器主要处理无线通信。可以理解的是，上述调制解调处理器也可以不集成到处理器203中。The processor 203 is the control center of the handset 20, which connects various portions of the entire handset using various interfaces and lines, by running or executing software programs and/or modules stored in the memory 202, and recalling data stored in the memory 202, The various functions and processing data of the mobile phone 20 are performed to perform overall monitoring of the mobile phone. Optionally, the processor 203 may include one or more processing units; preferably, the processor 203 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, The user interface and applications, etc., the modem processor primarily handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 203.

发送器204，可以包括射频电路(Radio Frequency，RF)，可用于通话过程中语音信号的发送，特别地，将处理器203处理后的语音信号通过无线通道发送至另一手机；通常，发送器204包括但不限于天线、至少一个放大器、收发信机、耦合器、LNA(low noise amplifier，低噪声放大器)、双工器等。The transmitter 204 may include a radio frequency circuit (RF), which may be used for sending a voice signal during a call, in particular, transmitting a voice signal processed by the processor 203 to another mobile phone through a wireless channel; usually, the transmitter 204 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, an LNA (low noise amplifier), a duplexer, and the like.

接收器205，可以包括RF电路，所述RF电路包括但不限于天线、至少一个放大器、收发信机、耦合器、LNA(low noise amplifier，低噪声放大器)、双工器等，可通过无线通信与网络和其他设备通信，接收其他设备发送的语音信号；通常情况下，由于天线具有互易功能，通常情况下，可以将上述发送器204和接收器205集成在一起，作为收发器。The receiver 205 may include an RF circuit including, but not limited to, an antenna, at least one amplifier, a transceiver, a coupler, an LNA (low noise amplifier), a duplexer, etc., which can communicate wirelessly. Communicate with the network and other devices to receive voice signals sent by other devices; usually, since the antenna has a reciprocal function, the above-mentioned transmitter 204 and receiver 205 can be integrated as a transceiver.

输出设备206，可以包括音频电路中的扬声器，也可以包括触摸屏，可提供用户与手机20之间的音频接口，可将接收到的音频信号转换后的电信号，传输到扬声器，由扬声器转换为声音信号播放给本端用户。The output device 206 may include a speaker in the audio circuit, and may also include a touch screen, and may provide an audio interface between the user and the mobile phone 20, and transmit the converted electrical signal to the speaker, and convert the speaker into The sound signal is played to the local user.

尽管未示出，手机20还可以包括：WiFi(wireless fidelity，无线保真)模块、蓝牙模块、各个部件供电的电源(比如电池)等，在此不再赘述。Although not shown, the mobile phone 20 may further include: a wireless (wireless fidelity) module, a Bluetooth module, a power source (such as a battery) powered by various components, and the like, and details are not described herein.

在本发明实施例中，若手机20当前时刻正在将本端语音发送至目标终端(即对端终端)，则输入设备201，还可以用于获取用户输入的音频信号；所述音频信号包含语音段信号；所述语音段信号为功率值大于预设门限值的一段语音信号。In the embodiment of the present invention, if the mobile phone 20 is currently transmitting the local voice to the target terminal (ie, the peer terminal), the input device 201 may also be configured to acquire an audio signal input by the user; the audio signal includes a voice. a segment signal; the speech segment signal is a segment of the speech signal whose power value is greater than a preset threshold.

处理器203，还可以用于对所述输入设备201获取到的音频信号中的语音段信号进行语义分析，若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号；所述特征点为：不具有完整语义的语音信号所处的时间点；The processor 203 is further configured to perform semantic analysis on the voice segment signal in the audio signal acquired by the input device 201. If a feature point exists in the voice segment signal, the feature point is used as a segmentation point, and Segmenting the speech segment signal into at least one sub-language a segment signal; the feature point is: a time point at which the speech signal without complete semantics is located;

以及，将所述至少一个子语音段信号翻译成符合目标用户语种的语音；And translating the at least one sub-segment segment signal into a speech that conforms to the target user language;

发送器204，还可以用于将翻译后的语音信号发送给目标终端。The transmitter 204 can also be configured to send the translated voice signal to the target terminal.

在本发明实施例中，为了实现语义分析，可以预先将实际当中常用的一些不具有完整语义的词语或词的特征值作为语义特征值存储在存储器202语义数据库中，然后，所述处理器203具体用于：In the embodiment of the present invention, in order to implement semantic analysis, some feature values of words or words that are not commonly used in practice may be stored in the semantic database of the memory 202 as semantic feature values, and then the processor 203 Specifically used for:

查询所述存储器202中的语义数据库；其中，所述语义数据库包含至少一个语义特征值，所述语义特征值为：不具有完整语义的词语或词的特征值；Querying a semantic database in the memory 202; wherein the semantic database includes at least one semantic feature value, and the semantic feature value is: a feature value of a word or a word that does not have complete semantics;

若所述语音段信号中存在第一语音信号，所述第一语音信号的特征值包含在所述语义数据库中，则确定所述第一语音信号为所述特征点；If the first voice signal exists in the voice segment signal, and the feature value of the first voice signal is included in the semantic database, determining that the first voice signal is the feature point;

若所述语音段信号中所有语音信号的特征值均未包含在所述语义数据库中，则确定所述语音段信号中未包含特征点。If the feature values of all the voice signals in the voice segment signal are not included in the semantic database, it is determined that the feature segment does not include a feature point.

进一步的，为避免目标用户等待翻译的时间过长，本发明通过流水线模式+两路音频叠加的语音合成技术，使原始语音和翻译后的语音合成叠加在一起，原声音量降低作为背景音，翻译后的语音作为主音量，进行播放，具体的，所述处理器203，还用于：Further, in order to prevent the target user from waiting for the translation for too long, the present invention superimposes the original speech and the translated speech synthesis by the pipeline synthesis mode and the two-channel audio superposition speech synthesis technology, and the original sound volume is reduced as the background sound, and the translation is performed. After the voice is played as the master volume, the processor 203 is further configured to:

在发送器204将翻译后的语音信号发送给目标终端合成第一子语音信号翻译后的语音信号和第二子语音信号；Transmitting, by the transmitter 204, the translated voice signal to the target terminal to synthesize the translated voice signal and the second sub-voice signal of the first sub-voice signal;

所述发送器204，具体用于：The transmitter 204 is specifically configured to:

将所述合成的语音信号发送给目标终端。The synthesized speech signal is transmitted to the target terminal.

同理，将第二子语音信号播放给所述目标终端后，可以将所述第二子语音信号翻译后的语音信号和所述第三子语音信号合成后播放给所述目标终端，按照这种方式，直至将子语音信号和翻译后的语音信号完全播放给所述目标终端，如此，实现了边播放边翻译的效果，降低了翻译的等待时延。Similarly, after the second sub-speech signal is played to the target terminal, the second sub-voice signal translated speech signal and the third sub-speech signal may be synthesized and played to the target terminal, according to this In a manner, until the sub-speech signal and the translated speech signal are completely played to the target terminal, so that translation is performed while playing. The effect is to reduce the waiting delay of the translation.

可选的，本发明实施例中，所述处理器203具体用于：Optionally, in the embodiment of the present invention, the processor 203 is specifically configured to:

其中，在加权求和过程中，所述第一子语音信号翻译后的语音信号的权值和所述第二子语音信号的权值，可以根据需要进行设定，本发明实施例对比不进行限定，所述权值可以理解为加权求和过程中，多个语音信号在总的语音信号中所占据的比重。但是，为了使原始语音不影响到翻译后的语音信号的播放，本发明实施例中，在对所述第一子语音信号翻译后的语音信号的权值和所述第二子语音信号的权值的设定过程中，需要使所述第一子语音信号翻译后的语音信号的权值大于所述第二子语音信号的权值。In the weighted summation process, the weight of the voice signal after the translation of the first sub-speech signal and the weight of the second sub-speech signal may be set as needed, and the comparison in the embodiment of the present invention is not performed. By definition, the weight can be understood as the proportion of a plurality of speech signals occupying the total speech signal in the weighted summation process. However, in order to prevent the original voice from affecting the playback of the translated voice signal, in the embodiment of the present invention, the weight of the voice signal after the first sub-voice signal is translated and the weight of the second sub-voice signal During the setting of the value, the weight of the voice signal translated by the first sub-speech signal needs to be greater than the weight of the second sub-speech signal.

进一步的，为了识别出对端支持的语种，所述接收器205还可以用于：Further, in order to identify the language supported by the peer, the receiver 205 can also be used to:

在处理器203将所述至少一个子语音段信号翻译成符合目标用户语种的语音信号之前，接收所述目标终端发送的指示信息，其中，所述指示信息用于：指示所述目标终端所支持的语种。Before the processor 203 translates the at least one sub-speech segment signal into a voice signal conforming to the target user language, receiving the indication information sent by the target terminal, where the indication information is used to: indicate that the target terminal supports Language.

进一步的，所述接收器205，还可以用于：Further, the receiver 205 can also be used to:

在输入设备201获取用户输入的音频信号之前，所述终端与所述目标终端建立语音通话之后，接收所述目标终端发送的翻译请求，所述翻译请求用于请求向所述目标终端发送翻译后的语音信号。Before the input device 201 acquires the audio signal input by the user, after the terminal establishes a voice call with the target terminal, receiving a translation request sent by the target terminal, where the translation request is used to request to send a translation to the target terminal. Voice signal.

由于在双方通话过程中，发送端和接收端为相对概念，通常根据通话双发正在通话的情况而定，将说话方确定为发送端，将收听方确定为接收端，因此，在某一时刻，上述正在发送语音信号的手机20也可以作为接收端。相应的，当上述手机20作为接收端时，所述手机20中的接收器205，还可以用于：Since the sender and the receiver are relative concepts during the conversation between the two parties, usually according to the situation that the call is being sent, the speaker is determined to be the sender, and the listener is determined to be the receiver. Therefore, at a certain moment The mobile phone 20 that is transmitting the voice signal described above can also serve as the receiving end. Correspondingly, when the mobile phone 20 is used as the receiving end, the receiver 205 in the mobile phone 20 can also be used to:

接收源终端发送的音频信号；所述音频信号包含语音段信号；所述语音段信号为功率值大于预设门限值的一段语音信号。Receiving an audio signal sent by the source terminal; the audio signal includes a voice segment signal; and the voice segment signal is a segment of the voice signal whose power value is greater than a preset threshold.

处理器203，还可以用于对所述接收器205接收到的音频信号中的语音段信号进行语义分析；The processor 203 is further configured to use the audio signal received by the receiver 205. The speech segment signal in the semantic analysis;

若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号；所述特征点为不具有完整语义的语音信号所处的时间点；If a feature point exists in the voice segment signal, the voice segment signal is segmented into at least one sub-segment segment signal by using the feature point as a segmentation point; and the feature point is a voice signal having no complete semantics Time point

将所述至少一个子语音段信号翻译成预设语种的语音信号。Translating the at least one sub-segment segment signal into a speech signal of a preset language.

输出设备206，还可以用于播放所述处理器203翻译后的语音信号。The output device 206 can also be used to play the translated voice signal of the processor 203.

其中，处理器203执行翻译功能的详细步骤如前所述，在此不再详细赘述。The detailed steps of the processor 203 performing the translation function are as described above, and will not be described in detail herein.

同理，为了实现同声翻译以及提高翻译效率的目的，所述处理器203，还用于：Similarly, in order to achieve simultaneous translation and improve translation efficiency, the processor 203 is further configured to:

在所述输出设备203播放所述处理器203翻译后的语音信号之前，合成第一子语音信号翻译后的语音信号和第二子语音信号；Before the output device 203 plays the translated speech signal of the processor 203, synthesizing the translated speech signal and the second sub-speech signal of the first sub-speech signal;

所述输出设备206，具体可以用于：The output device 206 can be specifically configured to:

播放所述第一子语音信号；Playing the first sub-speech signal;

播放所述合成后的语音信号。Playing the synthesized speech signal.

进一步的，所述输出设备206，还可以用于：Further, the output device 206 can also be used to:

在所述终端和所述源终端建立语音通话后，在所述终端的用户界面上显示提示信息；所述提示信息用于提示用户是否启动翻译功能；After the terminal establishes a voice call with the source terminal, displaying prompt information on the user interface of the terminal; the prompt information is used to prompt the user whether to initiate a translation function;

所述输入设备201，还可以用于接收所述用户发送的确认信息，The input device 201 may be further configured to receive the confirmation information sent by the user.

所述处理器203，还可以用于启动翻译功能。The processor 203 can also be used to initiate a translation function.

由上可知，本发明实施例提供一种终端，获取用户发出的音频信号；所述音频信号包含语音段信号；对所述音频信号中的语音段信号进行语义分析，若所述语音段信号中存在特征点，则以所述特征点为分割点，将所述语音段信号分割为至少一个子语音段信号；将至少一个子语音段信号翻译成符合目标用户语种的语音，将翻译后的语音发送给目标终端。如此，基于语义分析把经VAD端点检测的语句中不具有完整语义的语音剔除，切分为更短的且具有完整语义的语句，完整的表达了说话者的语句含义，避免出现断句或半句的情况，有效地提高了通话中即时翻译的准确性。As can be seen from the above, an embodiment of the present invention provides a terminal that acquires an audio signal sent by a user; the audio signal includes a voice segment signal; and performs semantic analysis on a voice segment signal in the audio signal, if the voice segment signal is in the voice segment signal There is a feature point, wherein the feature point is used as a segmentation point, and the voice segment signal is segmented into at least one sub-segment segment signal; and at least one sub-segment segment signal is translated into a voice conforming to a target user language, and the translation is performed. The subsequent voice is sent to the target terminal. In this way, based on semantic analysis, the speech without the complete semantics in the statement detected by the VAD endpoint is culled into shorter sentences with complete semantics, which fully express the speaker's sentence meaning and avoid the occurrence of sentence or sentence. The situation effectively improves the accuracy of instant translation during a call.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应所述以权利要求的保护范围为准。 The above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the claims.

Claims

A translation method is applied to a terminal that is engaged in a voice call, and includes:

Acquiring an audio signal input by the user; the audio signal includes a voice segment signal; and the voice segment signal is a segment of the voice signal whose power value is greater than a preset threshold;

Performing semantic analysis on the speech segment signal in the audio signal;

If a feature point exists in the voice segment signal, the voice segment signal is segmented into at least one sub-segment segment signal by using the feature point as a segmentation point; and the feature point is a voice signal having no complete semantics Time point

Translating the at least one sub-segment segment signal into a speech signal conforming to the target user language, and transmitting the translated speech signal to the target terminal.

The method according to claim 1, wherein the performing semantic analysis on the voice segment signal in the audio signal comprises:

Querying a semantic database; the semantic database includes at least one semantic feature value, the semantic feature value being: a feature value of a word or a word having no complete semantics;

If the first voice signal exists in the voice segment signal, and the feature value of the first voice signal is included in the semantic database, determining that the first voice signal is the feature point;

If the feature values of all the voice signals in the voice segment signal are not included in the semantic database, it is determined that the feature segment does not include a feature point.

The method according to claim 1 or 2, wherein before the transmitting the translated voice signal to the target terminal, the method further comprises:

Combining the translated speech signal and the second sub-speech signal of the first sub-speech signal;

The sending the translated voice signal to the target terminal specifically includes:

Transmitting the first sub-voice signal to the target terminal;

Transmitting the synthesized speech signal to the target terminal; the second sub-speech signal is: a speech signal after the time of the first sub-speech signal and adjacent to the first sub-speech signal.

The method of claim 3 wherein said synthesizing said The translated voice signal and the second sub-voice signal of a sub-voice signal specifically include:

Performing weighted summation on the voice signal translated by the first sub-speech signal and the second sub-speech signal;

The weight of the translated voice signal of the first sub-speech signal is greater than the weight of the second sub-speech signal.

The method according to any one of claims 1 to 4, wherein before the translating the at least one sub-segment segment signal into a speech signal conforming to a target user language, the method further comprises:

And receiving the indication information sent by the target terminal, where the indication information is used to: indicate a language supported by the target terminal.

The method according to any one of claims 1-5, wherein before the obtaining the audio signal input by the user, the method further comprises:

Establishing a voice call with the target terminal;

Receiving a translation request sent by the target terminal, the translation request being used to request that the translated voice signal be sent to the target terminal.

Receiving an audio signal sent by the source terminal; the audio signal includes a voice segment signal; and the voice segment signal is a segment of the voice signal whose power value is greater than a preset threshold;

Performing semantic analysis on the speech segment signal in the audio signal;

Translating the at least one sub-segment segment signal into a speech signal of a preset language;

Play the translated voice signal.

The method according to claim 7, wherein the performing semantic analysis on the voice segment signal in the audio signal comprises:

The method according to claim 7 or 8, wherein before the playing the translated voice signal, the method further comprises:

The playing the translated voice signal specifically includes:

Playing the first sub-speech signal;

Playing the synthesized speech signal; the second sub-speech signal is: a speech signal after the time of the first sub-speech signal and adjacent to the first sub-speech signal.

The method according to claim 9, wherein the synthesizing the translated speech signal and the second sub-speech signal of the first sub-speech signal comprises:

The method according to any one of claims 7 to 10, further comprising: before receiving the audio signal transmitted by the source terminal, the method further comprising:

Establishing a voice call with the source terminal;

Displaying prompt information on a user interface of the terminal; the prompt information is used to prompt the user whether to initiate a translation function;

Receiving the confirmation information sent by the user, and starting the translation function.

A terminal, comprising:

An audio processing module, configured to acquire an audio signal input by a user; the audio signal includes a voice segment signal; and the voice segment signal is a segment of a voice signal whose power value is greater than a preset threshold;

a voice endpoint detection module, configured to acquire an audio signal from the audio processing module The speech segment signal is semantically analyzed; if there is a feature point in the speech segment signal, the speech segment signal is segmented into at least one sub-speech segment signal by using the feature point as a segmentation point; The point in time at which the speech signal without complete semantics is located;

a translation module, configured to translate at least one sub-segment segment signal detected by the speech endpoint detection module into a speech signal conforming to a target user language;

a voice synthesis module, configured to send the translated voice signal of the translation module to the target terminal.

The terminal according to claim 12, wherein the terminal further comprises: a voice recognition module, wherein the voice endpoint detection module is specifically configured to:

The terminal according to claim 12 or 13, wherein the speech synthesis module is specifically configured to:

Transmitting the first sub-voice signal to the target terminal;

Synthesizing the translated speech signal and the second sub-speech signal of the first sub-speech signal;

The terminal according to claim 14, wherein the speech synthesis module is specifically configured to:

The terminal according to any one of claims 12-15, wherein the audio processing module is further configured to:

Before the voice synthesizing module translates the at least one sub-speech segment signal into a voice signal that conforms to the target user language, receiving the indication information sent by the target terminal, where the indication information is used to: indicate that the target terminal supports Language.

The terminal according to any one of claims 12 to 16, wherein the audio processing module is further configured to:

Before acquiring the audio signal input by the user, receiving a translation request sent by the target terminal, the translation request is used to request to send the translated voice signal to the target terminal.

A terminal, comprising:

An audio processing module, configured to receive an audio signal sent by the source terminal; the audio signal includes a voice segment signal; and the voice segment signal is a segment of the voice signal whose power value is greater than a preset threshold;

a voice endpoint detection module, configured to perform semantic analysis on a voice segment signal in an audio signal acquired by the audio processing module; if a feature point exists in the voice segment signal, the feature point is used as a segmentation point, and the The speech segment signal is segmented into at least one sub-segment segment signal; the feature point is a time point at which the speech signal without complete semantics is located;

a translation module, configured to translate at least one sub-segment segment signal detected by the speech endpoint detection module into a speech signal of a preset language;

a speech synthesis module, configured to play the translated speech signal of the translation module.

The terminal according to claim 18, wherein the terminal further comprises: a voice recognition module, wherein the voice endpoint detection module is specifically configured to:

If the feature values of all the voice signals in the voice segment signal are not included in the language In the semantic database, it is determined that the feature segment does not include a feature point.

The terminal according to claim 18 or 19, wherein the speech synthesis module is specifically configured to:

Playing the first sub-speech signal;

The terminal according to claim 20, wherein the speech synthesis module is specifically configured to:

The terminal according to any one of claims 18 to 21, wherein before receiving the audio signal transmitted by the source terminal, the audio processing module is further configured to:

Before receiving the audio signal sent by the source terminal, the terminal and the source terminal establish a voice call, receive the confirmation information sent by the user, and start a translation function.

A terminal, comprising:

An input device, configured to acquire an audio signal input by a user; the audio signal includes a voice segment signal; and the voice segment signal is a segment of a voice signal whose power value is greater than a preset threshold;

a processor, configured to perform semantic analysis on a voice segment signal in the audio signal acquired by the input device; if a feature point exists in the voice segment signal, the feature segment is used as a segmentation point, and the voice segment is Separating the signal into at least one sub-segment segment signal; the feature point is a time point at which the speech signal without complete semantics is located;

Translating the at least one sub-segment segment signal into a speech signal conforming to a target user language;

And a transmitter, configured to send the translated voice signal of the processor to the target terminal.

The terminal according to claim 23, wherein said treatment tool Body for:

The terminal according to claim 23 or 24, wherein the transmitter is further configured to:

Before the processor sends the translated voice signal to the target terminal, synthesizing the first sub-voice signal translated speech signal and the second sub-speech signal;

The transmitter is specifically configured to:

Transmitting the first sub-voice signal to the target terminal;

The terminal according to claim 25, wherein the processor is specifically configured to:

The terminal according to any one of claims 23 to 26, wherein the terminal further comprises a receiver,

The receiver is configured to receive, after the transmitter translates the at least one sub-speech segment signal into a voice signal conforming to a target user language, the indication information sent by the target terminal, where the indication information is used to: indicate The language supported by the target terminal.

The terminal according to claim 27, wherein said receiver, Also used for:

Before the input device acquires the audio signal input by the user, the terminal and the target terminal establish a voice call, and receive a translation request sent by the target terminal, where the translation request is used to request to send a translation to the target terminal. After the voice signal.

A terminal, comprising:

a receiver, configured to receive an audio signal sent by the source terminal; the audio signal includes a voice segment signal; and the voice segment signal is a segment of the voice signal whose power value is greater than a preset threshold;

a processor, configured to perform semantic analysis on a voice segment signal in the audio signal acquired by the receiver; if a feature point exists in the voice segment signal, use the feature point as a segmentation point, and the voice segment is Separating the signal into at least one sub-segment segment signal; the feature point is a time point at which the speech signal without complete semantics is located;

And an output device, configured to play the translated voice signal of the processor.

The terminal according to claim 29, wherein the processor is specifically configured to:

The terminal according to claim 29 or 30, wherein the processor is further configured to:

Before the output device plays the translated speech signal of the processor, synthesizing the translated speech signal and the second sub-speech signal of the first sub-speech signal;

The output device is specifically configured to:

Playing the first sub-speech signal;

The terminal according to claim 31, wherein the processor is specifically configured to:

The terminal according to any one of claims 29 to 32, wherein the terminal further comprises an output device and an input device;

The output device is configured to display prompt information on a user interface of the terminal after the terminal establishes a voice call with the source terminal; the prompt information is used to prompt the user whether to initiate a translation function;

The input device is further configured to receive confirmation information sent by the user,

The processor is further configured to initiate a translation function.