CN104810021A

CN104810021A - Pre-processing method and device applied to far-field recognition

Info

Publication number: CN104810021A
Application number: CN201510236032.6A
Authority: CN
Inventors: 魏建强; 崔玮玮; 宋辉; 王昕�; 姜俊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-05-11
Filing date: 2015-05-11
Publication date: 2015-07-29
Anticipated expiration: 2035-05-11
Also published as: CN104810021B

Abstract

The invention provides a preprocessing method and a preprocessing device applied to far field identification, wherein the preprocessing method applied to the far field identification comprises the steps of carrying out fixed beam forming processing on a sound signal to be processed to obtain a beam signal after the fixed beam forming processing; performing acoustic echo cancellation and optimal beam selection on the beam signals subjected to the fixed beam forming processing; and obtaining a pre-processed signal applied to far-field recognition according to the beam signal after the acoustic echo cancellation and the optimal beam selection. The method can improve the preprocessing effect, and optionally, can reduce the operation amount when the number of the sound signals is large.

Description

Pre-processing method and device applied to far-field recognition

技术领域technical field

本发明涉及数据处理技术领域，尤其涉及一种应用于远场识别的前处理方法和装置。The invention relates to the technical field of data processing, in particular to a pre-processing method and device applied to far-field recognition.

背景技术Background technique

远场识别技术,也即远距离识别技术，通常是为了解决说话人距离语音设备2米之外场景的语音识别请求。为了获取比较稳定可靠的远场识别性能，针对远场识别场景的前处理(远场拾音)技术就显得尤为迫切和重要。Far-field recognition technology, that is, long-distance recognition technology, is usually used to solve the speech recognition request of the scene where the speaker is 2 meters away from the speech device. In order to obtain relatively stable and reliable far-field recognition performance, the pre-processing (far-field pickup) technology for far-field recognition scenes is particularly urgent and important.

现有技术中，远场拾音的流程串联依次包括：声回波消除(Acoustic echocancellation，AEC)，声源定位，自适应波束形成(Adaptive Beamforming，ABF)，单麦增强和后处理。In the prior art, the process of far-field sound pickup includes in series: Acoustic echo cancellation (AECoustic echocancellation, AEC), sound source localization, adaptive beamforming (Adaptive Beamforming, ABF), single microphone enhancement and post-processing.

但是，现有技术中需要声源定位模块，声源定位模块本身准确度就不理想，而且与后续的ABF串联，还会影响ABF的性能，从而影响前处理效果，另外，先进行AEC，当要处理的声音信号的数量较大时，运算量也较大。However, in the prior art, a sound source localization module is required, and the accuracy of the sound source localization module itself is not ideal, and it is connected in series with the subsequent ABF, which will also affect the performance of the ABF, thereby affecting the pre-processing effect. In addition, AEC is performed first, when When the number of sound signals to be processed is large, the amount of computation is also large.

发明内容Contents of the invention

本发明旨在至少在一定程度上解决相关技术中的技术问题之一。The present invention aims to solve one of the technical problems in the related art at least to a certain extent.

为此，本发明的一个目的在于提出一种应用于远场识别的前处理方法，该方法可以提高前处理效果，并且可选的，在声音信号数量较大时可以降低运算量。Therefore, an object of the present invention is to propose a pre-processing method applied to far-field recognition, which can improve the effect of pre-processing, and optionally, can reduce the amount of computation when the number of sound signals is large.

本发明的另一个目的在于提出一种应用于远场识别的前处理装置。Another object of the present invention is to propose a pre-processing device applied to far-field recognition.

为达到上述目的，本发明第一方面实施例提出的应用于远场识别的前处理方法，包括：对要处理的声音信号进行固定波束形成处理，得到固定波束形成处理后的波束信号；对所述固定波束形成处理后的波束信号，进行声回波消除以及最优波束选择；根据声回波消除以及最优波束选择后的波束信号，得到应用于远场识别的前处理后的信号。In order to achieve the above purpose, the pre-processing method applied to far-field recognition proposed by the embodiment of the first aspect of the present invention includes: performing fixed beamforming processing on the sound signal to be processed to obtain beam signals after fixed beamforming processing; Acoustic echo cancellation and optimal beam selection are performed on the beam signal after fixed beamforming processing; and a pre-processed signal applied to far-field recognition is obtained according to the beam signal after acoustic echo cancellation and optimal beam selection.

本发明第一方面实施例提出的应用于远场识别的前处理方法，不需要声源定位模块，可以避免声源定位不准确造成的前处理效果不好的问题，从而可以提高前处理效果，并且，可选的，先进行FBF后再进行AEC，由于通常FBF后的波束数量相对于要处理的声音信号的数量小，可以降低运算量。The pre-processing method applied to far-field recognition proposed by the embodiment of the first aspect of the present invention does not require a sound source localization module, which can avoid the problem of poor pre-processing effect caused by inaccurate sound source localization, thereby improving the pre-processing effect. And, optionally, AEC is performed after FBF is performed first, since the number of beams after FBF is usually smaller than the number of sound signals to be processed, the amount of computation can be reduced.

为达到上述目的，本发明第二方面实施例提出的应用于远场识别的前处理装置，包括：固定波束形成模块，用于对要处理的声音信号进行固定波束形成处理，得到固定波束形成处理后的波束信号；处理模块，用于对所述固定波束形成处理后的波束信号，进行声回波消除以及最优波束选择；获取模块，用于根据声回波消除以及最优波束选择后的波束信号，得到应用于远场识别的前处理后的信号。In order to achieve the above purpose, the pre-processing device applied to far-field recognition proposed by the embodiment of the second aspect of the present invention includes: a fixed beamforming module, which is used to perform fixed beamforming processing on the sound signal to be processed to obtain a fixed beamforming processing The beam signal after processing; the processing module is used to perform acoustic echo cancellation and optimal beam selection on the beam signal processed by the fixed beamforming; the acquisition module is used to perform acoustic echo cancellation and optimal beam selection according to the The beam signal is used to obtain the pre-processed signal applied to the far-field identification.

本发明第二方面实施例提出的应用于远场识别的前处理装置，不需要声源定位模块，可以避免声源定位不准确造成的前处理效果不好的问题，从而可以提高前处理效果，并且，可选的，先进行FBF后再进行AEC，由于通常FBF后的波束数量相对于要处理的声音信号的数量小，可以降低运算量。The pre-processing device applied to far-field recognition proposed by the embodiment of the second aspect of the present invention does not require a sound source localization module, which can avoid the problem of poor pre-processing effect caused by inaccurate sound source localization, thereby improving the pre-processing effect. And, optionally, AEC is performed after FBF is performed first, since the number of beams after FBF is usually smaller than the number of sound signals to be processed, the amount of computation can be reduced.

本发明附加的方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

附图说明Description of drawings

本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and easy to understand from the following description of the embodiments in conjunction with the accompanying drawings, wherein:

图1是本发明一实施例提出的应用于远场识别的前处理方法的流程示意图；FIG. 1 is a schematic flowchart of a preprocessing method applied to far-field recognition proposed by an embodiment of the present invention;

图2是本发明另一实施例提出的应用于远场识别的前处理方法的流程示意图；2 is a schematic flowchart of a preprocessing method applied to far-field recognition proposed by another embodiment of the present invention;

图3是本发明另一实施例提出的应用于远场识别的前处理方法的流程示意图；Fig. 3 is a schematic flowchart of a pre-processing method applied to far-field recognition proposed by another embodiment of the present invention;

图4是本发明另一实施例提出的应用于远场识别的前处理装置的结构示意图；Fig. 4 is a schematic structural diagram of a pre-processing device applied to far-field recognition proposed by another embodiment of the present invention;

图5是本发明另一实施例提出的应用于远场识别的前处理装置的结构示意图；Fig. 5 is a schematic structural diagram of a pre-processing device applied to far-field recognition proposed by another embodiment of the present invention;

图6是本发明另一实施例提出的应用于远场识别的前处理装置的结构示意图。Fig. 6 is a schematic structural diagram of a pre-processing device applied to far-field recognition proposed by another embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的模块或具有相同或类似功能的模块。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。相反，本发明的实施例包括落入所附加权利要求书的精神和内涵范围内的所有变化、修改和等同物。Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals denote the same or similar modules or modules having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention. On the contrary, the embodiments of the present invention include all changes, modifications and equivalents coming within the spirit and scope of the appended claims.

图1是本发明一实施例提出的应用于远场识别的前处理方法的流程示意图，该方法包括：Fig. 1 is a schematic flow chart of a pre-processing method applied to far-field recognition proposed by an embodiment of the present invention, the method comprising:

S11：对要处理的声音信号进行固定波束形成处理，得到固定波束形成处理后的波束信号。S11: Perform fixed beamforming processing on the sound signal to be processed to obtain a beam signal after fixed beamforming processing.

其中，要处理的声音信号可以是指麦克风信号，麦克风信号指麦克风拾取到的信号，其中包括近端语音信号(语音控制指令)，房间混响和各种环境噪音等。Wherein, the sound signal to be processed may refer to a microphone signal, and the microphone signal refers to a signal picked up by the microphone, including a near-end voice signal (voice control command), room reverberation, and various environmental noises.

在远场识别时，为了提高识别性能，通常会采用麦克风阵列(定向麦克风或者全向麦克风)，因此，要处理的声音信号可以具体是指麦克风阵列信号，麦克风阵列信号包括多路麦克风信号。In far-field recognition, in order to improve recognition performance, a microphone array (directional microphone or omnidirectional microphone) is usually used. Therefore, the sound signal to be processed may specifically refer to a microphone array signal, and the microphone array signal includes multiple microphone signals.

波束形成技术可以包括现有技术中采用的ABF，还包括固定波束形成(FixedBeamforming，FBF)。The beamforming technology may include ABF adopted in the prior art, and also includes fixed beamforming (Fixed Beamforming, FBF).

ABF的空间波束特性是自适应变化的，而FBF的空间波束特性是固定不变的。空间波束特性例如特定方向的信号增益响应。The spatial beam characteristics of ABF are adaptive changes, while the spatial beam characteristics of FBF are fixed. Spatial beam characteristics such as signal gain response in a particular direction.

FBF处理时，可选的，所述固定波束形成处理采用的固定波束的个数为多个，每个固定波束覆盖部分空间，所有固定波束形成对整个空间的覆盖。During FBF processing, optionally, the fixed beam forming process uses multiple fixed beams, each fixed beam covers a part of the space, and all the fixed beams form coverage of the entire space.

通过波束对空间的全覆盖，可以保证用户位于空间任意位置时都可以检测到用户讲话，避免对用户位置的限制。Through the full coverage of the space by the beam, it can be ensured that the user's speech can be detected when the user is located in any position in the space, and the restriction on the user's position can be avoided.

当要处理的声音信号(如麦克风阵列信号)的数量较大时，为了降低运算量，FBF采用的固定波束的数量可以小于要处理的声音信号的数量。When the number of sound signals to be processed (such as microphone array signals) is large, in order to reduce the amount of computation, the number of fixed beams used by FBF may be smaller than the number of sound signals to be processed.

例如，所述固定波束的个数是3个，不同的固定波束分别覆盖不同的120度的空间；或者，所述固定波束的个数是6个，不同的固定波束分别覆盖不同的60度的空间。For example, the number of fixed beams is 3, and different fixed beams cover different 120-degree spaces; or, the number of fixed beams is 6, and different fixed beams cover different 60-degree spaces. space.

S12：对所述固定波束形成处理后的波束信号，进行声回波消除以及最优波束选择。S12: Perform acoustic echo cancellation and optimal beam selection on the beam signal processed by the fixed beamforming.

其中，为了消除干扰信号，语音识别交互系统中通常会包括声回波消除(Acoustic echocancellation，AEC)模块，AEC模块通常称为BargeIn功能模块。Wherein, in order to eliminate interference signals, the voice recognition interactive system usually includes an acoustic echo cancellation (Acoustic echocancellation, AEC) module, and the AEC module is usually called a BargeIn function module.

干扰信号例如为语音识别交互系统(以下简称为系统)产生的音乐，语音合成(text tospeech，TTS)信号等。The interference signal is, for example, music generated by a voice recognition interactive system (hereinafter referred to as the system), a text to speech (TTS) signal, and the like.

由于AEC模块除了要追踪学习从系统的扬声器到麦克风的声学传递函数(Acoustictransfer function，ATF)，还要学习各种在它之前的处理模块产生的随时间变化的成分，如果这些变化快于AEC中自适应滤波器的收敛速度，就会出现AEC模块一直无法理想对这些快速变化进行学习的问题，进而导致对于系统播放的干扰信号无法很好消除。Since the AEC module not only has to track and learn the acoustic transfer function (Acoustictransfer function, ATF) from the speaker to the microphone of the system, but also learns the time-varying components produced by various processing modules before it, if these changes are faster than those in AEC Due to the convergence speed of the adaptive filter, there will be a problem that the AEC module has been unable to ideally learn these rapid changes, which leads to the inability to eliminate the interference signals played by the system.

由于ABF的空间波束特性是变化的，并且，通常ABF的滤波器的变化速度远远大于AEC模块的滤波器的变化速度，所以，现有技术中不能将ABF放在AEC之前来提高信噪比。而AEC的处理效果依赖于信噪比，信噪比越高处理效果越好。由于不能将ABF放在AEC之前以提高信噪比，因此，现有技术不能将ABF放在AEC之前进行处理的方式，会影响AEC效果，进而会影响远场识别效果。Since the spatial beam characteristics of the ABF are changing, and the change speed of the filter of the ABF is usually much faster than that of the filter of the AEC module, in the prior art, the ABF cannot be placed before the AEC to improve the signal-to-noise ratio . The processing effect of AEC depends on the signal-to-noise ratio, and the higher the signal-to-noise ratio, the better the processing effect. Since the ABF cannot be placed before the AEC to improve the signal-to-noise ratio, the existing technology cannot place the ABF before the AEC for processing, which will affect the AEC effect and further affect the far-field recognition effect.

而本实施例中，采用FBF，由于FBF的空间波束特性是固定不变的，对于AEC模块来讲就是已知的，不需要AEC模块进行追踪学习，因此，本实施例中可以将FBF放在AEC之前。由于经过FBF处理后，会提高信噪比，因此，将FBF放在AEC之前，就会提高AEC的处理效果，进而提高远场识别效果。However, in this embodiment, FBF is used. Since the spatial beam characteristics of FBF are fixed, it is known to the AEC module and does not require the AEC module to perform tracking and learning. Therefore, in this embodiment, the FBF can be placed Before AEC. Since the signal-to-noise ratio will be improved after FBF processing, placing FBF before AEC will improve the processing effect of AEC, thereby improving the far-field recognition effect.

另一方面，在麦克风阵列信号包括的信号的数量较大(比如大于6)时，现有技术中，先进行AEC，那么需要的AEC模块的个数就与麦克风信号的数量相同，也就比较大。而本实施例中，先进行FBF再进行AEC，需要的AEC模块的数量与FBF波束的个数相同，而FBF的波束个数通常小于数量较大的麦克风信号的数量，例如FBF的波束数量是3个或6个，那么就可以显著降低需要的AEC模块的数量，降低运算量。On the other hand, when the number of signals included in the microphone array signal is large (such as greater than 6), in the prior art, AEC is performed first, and the number of AEC modules required is the same as the number of microphone signals, which is relatively big. However, in this embodiment, FBF is performed first and then AEC is performed. The number of AEC modules required is the same as the number of FBF beams, and the number of beams of FBF is usually smaller than the number of microphone signals with a large number. For example, the number of beams of FBF is If there are 3 or 6, then the number of required AEC modules can be significantly reduced and the amount of computation can be reduced.

最优波束选择时，可以根据预设的选择准则进行选择。例如，预设的选择准则是最大信噪比准则，则选择信噪比最大的波束作为最优波束。When selecting the optimal beam, it can be selected according to a preset selection criterion. For example, if the preset selection criterion is the maximum signal-to-noise ratio criterion, the beam with the largest signal-to-noise ratio is selected as the optimal beam.

在具体处理时，可以先进行AEC再进行最优波束选择，或者，也可以先进行最优波束选择再进行AEC。In specific processing, the AEC may be performed first and then the optimal beam selection may be performed, or the optimal beam selection may be performed first and then the AEC is performed.

S13：根据声回波消除以及最优波束选择后的波束信号，得到应用于远场识别的前处理后的信号。S13: According to the beam signal after acoustic echo cancellation and optimal beam selection, obtain a pre-processed signal for far-field identification.

在进行声回波消除以及最优波束选择后，可以再进行一些后处理，以进一步提高处理效果。After acoustic echo cancellation and optimal beam selection, some post-processing can be performed to further improve the processing effect.

在得到应用于远场识别的前处理后的信号之后，可以将该前处理后的信号输入到识别器(远场识别引擎)中进行识别处理。After the pre-processed signal applied to far-field recognition is obtained, the pre-processed signal can be input into a recognizer (far-field recognition engine) for recognition processing.

本实施例中，不需要声源定位处理，因此可以有效避免由于声源定位错误而导致的整体系统性能不稳定和异常；通过在固定空间波束信号中选择最优波束信号，可以有效突破传统方法对于近端讲话人位置的约束和限制，从而实现无缝地适应讲话人在房间连续移动的应用场景，显著改善整体用户体验；采用固定波束形成技术，其空间波束特性都是不随时间变化的，这个特性是可以很好地被后续的AEC模块学习到，从而可以将FBF模块提到AEC模块之前进行处理。这样一方面可以获得更高信噪比的参考信号，有效改善后续AEC的收敛速度和性能，另一方面，由于用FBF的空间波束数目通常要小于麦克风数目，所以可以有效减少AEC模块的使用次数并降低整体计算量。In this embodiment, there is no need for sound source localization processing, so the overall system performance instability and abnormality caused by sound source localization errors can be effectively avoided; by selecting the optimal beam signal among fixed spatial beam signals, it can effectively break through the traditional method For the constraints and restrictions on the position of the near-end speaker, it can seamlessly adapt to the application scenario where the speaker moves continuously in the room, and significantly improve the overall user experience; using fixed beamforming technology, its spatial beam characteristics do not change with time. This feature can be well learned by the subsequent AEC module, so that the FBF module can be mentioned before the AEC module for processing. In this way, on the one hand, a reference signal with a higher signal-to-noise ratio can be obtained, which can effectively improve the convergence speed and performance of subsequent AEC. On the other hand, since the number of spatial beams using FBF is usually smaller than the number of microphones, it can effectively reduce the number of times the AEC module is used. and reduce the overall computational load.

图2是本发明另一实施例提出的应用于远场识别的前处理方法的流程示意图，该方法包括：Fig. 2 is a schematic flowchart of a pre-processing method applied to far-field recognition proposed by another embodiment of the present invention, the method includes:

S21：对麦克风阵列信号进行固定波束形成处理。S21: Perform fixed beamforming processing on the microphone array signal.

为了改善AEC性能并减小计算量，可以首先利用麦克风阵列(定向麦克风或者全向麦克风)将整个空间划分成若干个空间波束区域(比如3个或6个)。In order to improve the AEC performance and reduce the amount of calculation, the entire space can be divided into several spatial beam areas (for example, 3 or 6) by using a microphone array (directional microphone or omnidirectional microphone).

由于采用固定波束形成(Fixed Beamforming,FBF)技术，波束特性是不随时间变化的，因此这个特性是可以很好地被后续的AEC模块学习到。因此可以将FBF模块提到AEC模块之前进行处理。这样一方面可以利用FBF处理获得更高信噪比的参考信号，从而有效改善后续AEC的收敛速度和性能；另一方面，用FBF模块形成的空间波束数目通常要小于麦克风数目，因而可以有效减少AEC模块的使用次数并降低了整体计算量。Due to the use of Fixed Beamforming (FBF) technology, the beam characteristics do not change with time, so this characteristic can be well learned by subsequent AEC modules. Therefore, the FBF module can be mentioned before the AEC module for processing. In this way, on the one hand, FBF processing can be used to obtain a reference signal with a higher SNR, thereby effectively improving the convergence speed and performance of subsequent AEC; on the other hand, the number of spatial beams formed by the FBF module is usually smaller than the number of microphones, so it can effectively reduce The number of times the AEC module is used and reduces the overall calculation load.

S22：采用与所述固定波束形成处理后的波束信号个数相同的声回波消除模块，对每个固定波束形成处理后的波束信号进行声回波消除，得到多个声回波消除后的波束信号。S22: Using the same number of acoustic echo cancellation modules as the number of beam signals processed by the fixed beamforming process, performing acoustic echo cancellation on each beam signal processed by the fixed beamforming process, to obtain multiple acoustic echo cancellations beam signal.

FBF模块会输出若干方向的波束信号，这些信号通入AEC模块来消除其包含的干扰信号，如系统播放的音乐、TTS等，去除回声后的信号就可以明显改善远场识别性能。The FBF module will output beam signals in several directions. These signals are passed to the AEC module to eliminate the interference signals contained in it, such as music played by the system, TTS, etc. The signal after echo removal can significantly improve the far-field recognition performance.

S23：在声回波消除后的多个波束信号中，进行最优波束选择，选择出最优波束信号。S23: Perform optimal beam selection among the plurality of beam signals after the acoustic echo cancellation, and select the optimal beam signal.

经过以上两个模块处理后的各个空间波束信号，已经最大限度的消除了各种环境干扰，包括背景噪音、房间混响以及系统播放的音乐、TTS等。这一步中，本实施例会根据一定的准则(比如最大信噪比准则等)，从若干个空间波束信号中选择最优的空间波束信号，作为该步骤的输出信号。这样去除了传统技术方案中的声源定位模块，不仅有效减少了计算量，而且可以避免误差传递效应，也就是由声源定位错误而导致的整体系统性能不稳定和异常。同时去除了传统技术方案中对近端讲话人位置相对固定的限制，从而进一步改善用户体验；本实施例通过在若干固定波束信号中自动选择最优信号取代了声源定位模块，因而可以实现无缝地适应讲话人在房间内连续移动的应用场景。Each space beam signal processed by the above two modules has eliminated various environmental interference to the greatest extent, including background noise, room reverberation, music played by the system, TTS, etc. In this step, this embodiment will select the optimal spatial beam signal from several spatial beam signals according to a certain criterion (such as the maximum signal-to-noise ratio criterion, etc.), as the output signal of this step. In this way, the sound source localization module in the traditional technical solution is removed, which not only effectively reduces the amount of calculation, but also avoids the error transfer effect, that is, the instability and abnormality of the overall system performance caused by the sound source localization error. At the same time, the restriction on the relatively fixed position of the near-end speaker in the traditional technical solution is removed, thereby further improving the user experience; this embodiment replaces the sound source localization module by automatically selecting the optimal signal among several fixed beam signals, thus realizing wireless It can seamlessly adapt to the application scenario where the speaker moves continuously in the room.

S24：对声回波消除以及最优波束选择后的波束信号进行单麦通道增强和后处理，并将单麦增强和后处理后的信号确定为应用于远场识别的前处理后的信号。S24: Perform single-mic channel enhancement and post-processing on the beam signal after acoustic echo cancellation and optimal beam selection, and determine the single-mic enhanced and post-processed signal as a pre-processed signal applied to far-field recognition.

和传统的技术方案类似，各种单麦克风噪声消除技术会用来进一步消除残余噪音并在后端串接特殊的后处理技术，比如增益放大、动态范围控制(Dynamic range control，DRC)等，从而更好的改善远场识别性能。Similar to traditional technical solutions, various single-microphone noise cancellation technologies will be used to further eliminate residual noise and special post-processing technologies, such as gain amplification and dynamic range control (DRC), will be connected in series at the back end, so that Better improved far-field recognition performance.

本实施例中，在上一实施例的基础上，可以先进行AEC再进行最优方向波束选择，此时不需要限定没有系统干扰信号，适用场景更广。In this embodiment, on the basis of the previous embodiment, AEC can be performed first, and then optimal direction beam selection can be performed. At this time, there is no need to limit that there is no system interference signal, and the applicable scenarios are wider.

图3是本发明另一实施例提出的应用于远场识别的前处理方法的流程示意图，该方法可以应用到不存在系统干扰信号时，该方法包括：Fig. 3 is a schematic flow chart of a pre-processing method applied to far-field identification proposed by another embodiment of the present invention. This method can be applied when there is no system interference signal. The method includes:

S31：对麦克风阵列信号进行固定波束形成处理。S31: Perform fixed beamforming processing on the microphone array signal.

固定波束形成处理的内容可以参见上述实施例中的相关描述，在此不再赘述。For the content of the fixed beamforming processing, reference may be made to the relevant descriptions in the foregoing embodiments, and details are not repeated here.

S32：从多个固定波束形成处理后的波束信号中，进行最优波束选择，选择出一个最优波束信号。S32: Select an optimal beam from beam signals processed by multiple fixed beams, and select an optimal beam signal.

AEC模块中有专门的模块进行讲话状态的检测，会大致有三种状态，只有近端语音信号，双讲状态(近端语音和远端语音)和只有远端信号的状态，远端语音是系统播放的音乐或TTS信号等。通过AEC模块中该专门模块的检测获知当前只有近端语音信号时，就可以确定不存在系统干扰信号，从而可以先进行最优波束选择，例如采用最大信噪比准则进行选择。具体的最优波束选择的方式可以参见上述实施例的相关描述，在此不再赘述。There is a special module in the AEC module to detect the speech state. There will be roughly three states, only the near-end voice signal, dual-talk state (near-end voice and far-end voice) and only the state of the far-end signal. The far-end voice is the system Play music or TTS signal, etc. When the special module in the AEC module detects that there is only a near-end voice signal, it can be determined that there is no system interference signal, so that the optimal beam can be selected first, for example, using the maximum signal-to-noise ratio criterion for selection. For a specific optimal beam selection manner, reference may be made to relevant descriptions in the foregoing embodiments, and details are not repeated here.

S33：采用一个声回波消除模块，对所述最优波束信号进行声回波消除。S33: Use an acoustic echo cancellation module to perform acoustic echo cancellation on the optimal beam signal.

在最优波束选择之后，输出的只有一路信号，因此可以仅采用一个AEC模块进行AEC，从而降低运算量。After the optimal beam is selected, only one signal is output, so only one AEC module can be used for AEC, thereby reducing the amount of computation.

S34：对声回波消除以及最优波束选择后的波束信号进行单麦增强和后处理，并将单麦增强和后处理后的信号确定为应用于远场识别的前处理后的信号。S34: Perform single-microphone enhancement and post-processing on the beam signal after acoustic echo cancellation and optimal beam selection, and determine the single-microphone enhanced and post-processed signal as a pre-processed signal for far-field identification.

本实施例中，在没有系统干扰信号时，可以先进行最优方向波束选择再进行AEC，从而可以降低AEC模块的数量，降低运算量。In this embodiment, when there is no system interference signal, the optimal direction beam can be selected first and then the AEC can be performed, so that the number of AEC modules can be reduced and the amount of computation can be reduced.

图4是本发明另一实施例提出的应用于远场识别的前处理装置的结构示意图，该装置40包括：Fig. 4 is a schematic structural diagram of a pre-processing device applied to far-field recognition proposed by another embodiment of the present invention, the device 40 includes:

固定波束形成模块41，用于对要处理的声音信号进行固定波束形成处理，得到固定波束形成处理后的波束信号；The fixed beamforming module 41 is configured to perform fixed beamforming processing on the sound signal to be processed, to obtain beam signals after fixed beamforming processing;

处理模块42，用于对所述固定波束形成处理后的波束信号，进行声回波消除以及最优波束选择；A processing module 42, configured to perform acoustic echo cancellation and optimal beam selection on the beam signal processed by the fixed beamforming;

例如，参见图5，当所述固定波束形成处理后的波束信号是多个时，所述处理模块42包括：For example, referring to FIG. 5, when there are multiple beam signals processed by the fixed beamforming, the processing module 42 includes:

声回波消除模块51，与所述固定波束形成处理后的波束信号个数相同，与所述固定波束形成模块连接，用于对每个固定波束形成处理后的波束信号进行声回波消除，得到多个声回波消除后的波束信号；The acoustic echo cancellation module 51 has the same number of beam signals processed by the fixed beamforming, and is connected to the fixed beamforming module, for performing acoustic echo cancellation on each beam signal processed by the fixed beamforming, obtaining a plurality of beam signals after acoustic echo cancellation;

最优波束选择模块52，与所述声回波消除模块连接，用于在声回波消除后的多个波束信号中，进行最优波束选择，选择出最优波束信号。The optimal beam selection module 52 is connected with the acoustic echo cancellation module, and is used to select the optimal beam among the plurality of beam signals after the acoustic echo cancellation, and select the optimal beam signal.

又例如，参见图6，当所述固定波束形成处理后的波束信号是多个时，且，当不存在系统干扰信号时，所述处理模块42包括：For another example, referring to FIG. 6, when there are multiple beam signals processed by the fixed beamforming, and when there is no system interference signal, the processing module 42 includes:

最优波束选择模块61，与所述固定波束形成模块连接，用于从多个固定波束形成处理后的波束信号中，进行最优波束选择，选择出一个最优波束信号；The optimal beam selection module 61 is connected to the fixed beam forming module, and is used to select an optimal beam signal from the beam signals processed by multiple fixed beams, and select an optimal beam signal;

AEC模块中有专门的模块进行讲话状态的检测，会大致有三种状态，只有近端语音信号，双讲状态(近端语音和远端语音)和只有远端信号的状态，远端语音是系统播放的音乐或TTS信号等。通过AEC模块中该专门模块的检测获知当前只有近端语音信号时，就可以确定不存在系统干扰信号，从而可以先进行最优波束选择，例如采用最大信噪比准则进行选择。具体的最优波束选择的方式可以参见上述实施例的相关描述，在此不再赘述。There is a special module in the AEC module to detect the speech state. There will be roughly three states, only the near-end voice signal, dual-talk state (near-end voice and far-end voice) and only the state of the far-end signal. The far-end voice is the system Play music or TTS signal, etc. When it is known through the detection of the special module in the AEC module that there is only the near-end voice signal, it can be determined that there is no system interference signal, so that the optimal beam can be selected first, for example, using the maximum signal-to-noise ratio criterion for selection. For a specific optimal beam selection manner, reference may be made to relevant descriptions in the foregoing embodiments, and details are not repeated here.

一个声回波消除模块62，与所述最优波束选择模块连接，用于对所述最优波束信号进行声回波消除。An acoustic echo cancellation module 62, connected to the optimal beam selection module, for performing acoustic echo cancellation on the optimal beam signal.

获取模块43，用于根据声回波消除以及最优波束选择后的波束信号，得到应用于远场识别的前处理后的信号。The obtaining module 43 is configured to obtain a pre-processed signal applied to far-field identification according to the beam signal after acoustic echo cancellation and optimal beam selection.

可选的，所述获取模块43具体用于：Optionally, the acquiring module 43 is specifically used for:

对声回波消除以及最优波束选择后的波束信号进行单麦增强和后处理，并将单麦增强和后处理后的信号确定为应用于远场识别的前处理后的信号。Perform single-microphone enhancement and post-processing on the beam signal after acoustic echo cancellation and optimal beam selection, and determine the signal after single-microphone enhancement and post-processing as the pre-processed signal applied to far-field recognition.

需要说明的是，在本发明的描述中，术语“第一”、“第二”等仅用于描述目的，而不能理解为指示或暗示相对重要性。此外，在本发明的描述中，除非另有说明，“多个”的含义是指至少两个。It should be noted that, in the description of the present invention, the terms "first", "second" and so on are only used for description purposes, and should not be understood as indicating or implying relative importance. In addition, in the description of the present invention, unless otherwise specified, the meaning of "plurality" means at least two.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本发明的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本发明的实施例所属技术领域的技术人员所理解。Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments or portions of code comprising one or more executable instructions for implementing specific logical functions or steps of the process , and the scope of preferred embodiments of the invention includes alternative implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order depending on the functions involved, which shall It is understood by those skilled in the art to which the embodiments of the present invention pertain.

应当理解，本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如，如果用硬件来实现，和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that various parts of the present invention can be realized by hardware, software, firmware or their combination. In the embodiments described above, various steps or methods may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques known in the art: Discrete logic circuits, ASICs with suitable combinational logic gates, Programmable Gate Arrays (PGAs), Field Programmable Gate Arrays (FPGAs), etc.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括方法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium. During execution, one or a combination of the steps of the method embodiments is included.

此外，在本发明各个实施例中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are realized in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.

上述提到的存储介质可以是只读存储器，磁盘或光盘等。The storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, and the like.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, descriptions referring to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or characteristic is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limiting the present invention, those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.

Claims

1. be applied to the pre-treating method that far field identifies, it is characterized in that, comprising:

Wave beam forming process is fixed to voice signal to be processed, is fixed the beam signal after Wave beam forming process;

To the beam signal after described fixed beam formation processing, carry out sound Echo cancellation and optimal beam selection;

Beam signal after selecting according to sound Echo cancellation and optimal beam, obtains the signal be applied to after the pre-treatment of far field identification.

2. method according to claim 1, is characterized in that, the number of the fixed beam that described fixed beam formation processing adopts is multiple, and space, each fixed beam cover part, all fixed beams form the covering to whole space.

3. method according to claim 2, is characterized in that, the number of described fixed beam is 3, and different fixed beams covers the different spaces of 120 degree respectively; Or the number of described fixed beam is 6, different fixed beams covers the different spaces of 60 degree respectively.

4. method according to claim 1, is characterized in that, the number of the fixed beam that described fixed beam formation processing adopts is multiple, and the quantity of described fixed beam is less than the quantity of voice signal to be processed.

5. the method according to any one of claim 1-4, it is characterized in that, when the beam signal after described fixed beam formation processing is multiple, described to the beam signal after described fixed beam formation processing, carry out sound Echo cancellation and optimal beam selection, comprising:

Adopt the sound Echo cancellation module identical with the beam signal number after described fixed beam formation processing, sound Echo cancellation is carried out to the beam signal after each fixed beam formation processing, obtains the beam signal after multiple sound Echo cancellation;

In multiple beam signals after sound Echo cancellation, carry out optimal beam selection, select optimal beam signal.

6. the method according to any one of claim 1-4, it is characterized in that, when the beam signal after described fixed beam formation processing is multiple, and, when there is not system interference signal, described to the signal beam signal after described fixed beam formation processing, carry out sound Echo cancellation and optimal beam selection, comprising:

From the beam signal after multiple fixed beam formation processing, carry out optimal beam selection, select an optimal beam signal;

Adopt a sound Echo cancellation module, sound Echo cancellation is carried out to described optimal beam signal.

7. the method according to any one of claim 1-4, is characterized in that, described according to the beam signal after sound Echo cancellation and optimal beam selection, obtains the signal be applied to after the pre-treatment of far field identification, comprising:

Beam signal after selecting sound Echo cancellation and optimal beam carries out single wheat enhancing and aftertreatment, and is strengthened by single wheat and signal after aftertreatment is defined as the signal that is applied to after the pre-treatment of far field identification.

8. be applied to the pretreating device that far field identifies, it is characterized in that, comprising:

Fixed beam forms module, for being fixed Wave beam forming process to voice signal to be processed, is fixed the beam signal after Wave beam forming process;

Processing module, for the beam signal after described fixed beam formation processing, carries out sound Echo cancellation and optimal beam selection;

Acquisition module, for according to the beam signal after sound Echo cancellation and optimal beam selection, obtains the signal be applied to after the pre-treatment of far field identification.

9. device according to claim 8, is characterized in that, when the beam signal after described fixed beam formation processing is multiple, described processing module comprises:

Sound Echo cancellation module, identical with the beam signal number after described fixed beam formation processing, forming model calling with described fixed beam, for carrying out sound Echo cancellation to the beam signal after each fixed beam formation processing, obtaining the beam signal after multiple sound Echo cancellation;

Optimal beam selects module, with described sound Echo cancellation model calling, in the multiple beam signals after sound Echo cancellation, carries out optimal beam selection, selects optimal beam signal.

10. device according to claim 8, is characterized in that, when the beam signal after described fixed beam formation processing is multiple, and when there is not system interference signal, described processing module comprises:

Optimal beam selects module, forms model calling, for from the beam signal after multiple fixed beam formation processing, carries out optimal beam selection, select an optimal beam signal with described fixed beam;

A sound Echo cancellation module, selects model calling with described optimal beam, for carrying out sound Echo cancellation to described optimal beam signal.

11. devices according to Claim 8 described in-10 any one, is characterized in that, described acquisition module specifically for: