CN111816215A

CN111816215A - Voice endpoint detection model training and use method and device

Info

Publication number: CN111816215A
Application number: CN202010725288.4A
Authority: CN
Inventors: 吴梦玥; 陈烨斐; 丁翰林; 俞凯
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-10-23

Abstract

The invention discloses a method and a device for training and using a voice endpoint detection model, wherein the training method comprises the following steps: inputting training audio into a generalized voice endpoint detection model; detecting a plurality of audio events present in the training audio via the generalized speech endpoint detection model, wherein the plurality of audio events include a speak-by-person event, a silence event, and at least one noise event; obtaining the voice and non-voice distinguishing results of the plurality of audio events output by the generalized voice endpoint detection model; computing a loss function based on the audio event labels of the training audio and the output of the generalized speech endpoint detection model; and optimizing the generalized speech endpoint detection model by controlling the loss function. The scheme of the embodiment of the application can distinguish different types in the non-voice part, can improve the accuracy of classification, and is not easy to misjudge noise as the voice part.

Description

Method and device for training and using voice endpoint detection model

技术领域technical field

本发明属于语言模型领域，尤其涉及语音端点检测模型训练和使用方法及装置。The invention belongs to the field of language models, and in particular relates to a method and device for training and using a voice endpoint detection model.

背景技术Background technique

相关技术中，存在各种类型的语音端点检测模型，包括以短时能量，过零率等阈值作为语音和非语音区分标准的语音端点检测模型，以及使用神经网络等区分性模型训练的语音端点检测器。In the related art, there are various types of speech endpoint detection models, including speech endpoint detection models that use thresholds such as short-term energy and zero-crossing rate as the criteria for distinguishing between speech and non-speech, and speech endpoints trained using discriminative models such as neural networks. Detector.

一方面，阈值区分法主要是使用短时能量，过零率等指标作为阈值，从音频中提取声学特征之后，每一帧或每一小段计算这些指标，然后通过是否达到阈值来区分音频中的语音部分和非语音部分。On the one hand, the threshold discrimination method mainly uses short-term energy, zero-crossing rate and other indicators as thresholds. After extracting acoustic features from the audio, these indicators are calculated for each frame or each segment, and then the audio is distinguished by whether the threshold is reached. Voice part and non-voice part.

另一方面，模型区分法主要是使用深度神经网络等区分性模型，以声学特征作为输入，然后通过神经网络隐层的训练，最后输出每一帧是否为语音或者非语音的后验概率，通过比较后验概率的大小来判断语音还是非语音。On the other hand, the model discrimination method mainly uses a discriminative model such as a deep neural network, takes acoustic features as input, and then trains the hidden layer of the neural network, and finally outputs the posterior probability of whether each frame is speech or non-speech. Compare the magnitude of the posterior probability to judge speech or non-speech.

发明人在实现本申请的过程中发现，现有方案至少存在以下缺陷：During the process of realizing the present application, the inventor found that the existing solution has at least the following defects:

相关技术主要的缺陷就是在带有噪声的环境下不能很好的区分语音和非语音。一些噪声自适应的技术方法需要很多的标注好的数据进行训练才能对一些特定的噪声具有鲁棒性，而这些标注好的数据往往是比较难得到的，或者数量没有那么大。The main defect of the related art is that speech and non-speech cannot be distinguished well in a noisy environment. Some noise-adaptive technical methods require a lot of labeled data for training to be robust to some specific noises, and these labeled data are often difficult to obtain, or the amount is not so large.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供一种语音端点检测模型训练和使用方法及装置，用于至少解决上述技术问题之一。Embodiments of the present invention provide a method and device for training and using a voice endpoint detection model, which are used to solve at least one of the above technical problems.

第一方面，本发明实施例提供一种语音端点检测模型训练方法，包括：将训练音频输入至广义上的语音端点检测模型中；经由所述广义上的语音端点检测模型检测所述训练音频中存在的多个音频事件，其中，所述多个音频事件包括人说话事件、静音事件以及至少一种噪音事件；获取所述广义上的语音端点检测模型输出的所述多个音频事件的语音和非语音的区分结果；基于所述训练音频的音频事件标注和所述广义上的语音端点检测模型的输出计算损失函数，其中，所述音频事件标注包括预先对所述训练音频进行音频事件的标注；以及通过控制所述损失函数优化所述广义上的语音端点检测模型。In a first aspect, an embodiment of the present invention provides a method for training a voice endpoint detection model, including: inputting training audio into a voice endpoint detection model in a broad sense; There are multiple audio events, wherein the multiple audio events include a human speaking event, a silence event, and at least one noise event; acquiring the voice and the multiple audio events output by the voice endpoint detection model in the broad sense. Non-speech discrimination result; a loss function is calculated based on the audio event labeling of the training audio and the output of the speech endpoint detection model in a broad sense, wherein the audio event labeling includes pre-labeling the training audio with audio events ; and optimizing the generalized speech endpoint detection model by controlling the loss function.

第二方面，本发明实施例提供一种语音端点检测模型使用方法，包括：经由根据第一方面所述的方法训练后的广义上的语音端点检测模型检测输入音频中存在的多个音频事件，其中，所述多个音频事件包括人说话事件、静音事件以及至少一种噪音事件；以及获取所述广义上的语音端点检测模型输出的所述多个音频事件的语音和非语音的区分结果。In a second aspect, an embodiment of the present invention provides a method for using a voice endpoint detection model, including: detecting multiple audio events existing in the input audio through a voice endpoint detection model in a broad sense trained by the method according to the first aspect, Wherein, the plurality of audio events include a human speaking event, a silence event and at least one noise event; and a distinction result of speech and non-speech of the plurality of audio events output by the speech endpoint detection model in the broad sense is obtained.

第三方面，本发明实施例提供一种语音端点检测模型训练装置，包括：输入模块，配置为将训练音频输入至广义上的语音端点检测模型中；检测模块，配置为经由所述广义上的语音端点检测模型检测所述训练音频中存在的多个音频事件，其中，所述多个音频事件包括人说话事件、静音事件以及至少一种噪音事件；输出模块，配置为获取所述广义上的语音端点检测模型输出的所述多个音频事件的语音和非语音的区分结果；损失计算模块，配置为基于所述训练音频的音频事件标注和所述广义上的语音端点检测模型的输出计算损失函数，其中，所述音频事件标注包括预先对所述训练音频进行音频事件的标注；以及优化模块，配置为通过控制所述损失函数优化所述广义上的语音端点检测模型。In a third aspect, an embodiment of the present invention provides an apparatus for training a voice endpoint detection model, including: an input module configured to input training audio into a voice endpoint detection model in a broad sense; The voice endpoint detection model detects a plurality of audio events existing in the training audio, wherein the plurality of audio events include a human speaking event, a silence event and at least one noise event; an output module is configured to obtain the generalized The results of distinguishing speech and non-speech of the multiple audio events output by the speech endpoint detection model; the loss calculation module is configured to calculate the loss based on the audio event annotation of the training audio and the output of the speech endpoint detection model in a broad sense function, wherein the audio event labeling comprises pre-labeling the training audio with audio events; and an optimization module configured to optimize the speech endpoint detection model in a broad sense by controlling the loss function.

第四方面，本发明实施例提供一种语音端点检测模型使用装置，包括：模型处理模块，配置为经由根据权利要求1-4所述的方法训练后的广义上的语音端点检测模型检测输入音频中存在的多个音频事件，其中，所述多个音频事件包括人说话事件、静音事件以及至少一种噪音事件；以及区分模块，配置为获取所述广义上的语音端点检测模型输出的所述多个音频事件的语音和非语音的区分结果。In a fourth aspect, an embodiment of the present invention provides an apparatus for using a voice endpoint detection model, comprising: a model processing module configured to detect an input audio through a voice endpoint detection model in a broad sense trained by the method according to claims 1-4 A plurality of audio events existing in the audio event, wherein the plurality of audio events include a human speech event, a silence event and at least one noise event; Discrimination results of speech and non-speech for multiple audio events.

第五方面，提供一种电子设备，其包括：至少一个处理器，以及与所述至少一个处理器通信连接的存储器，其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行本发明任一实施例的语音端点检测模型训练和使用方法的步骤。A fifth aspect provides an electronic device comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, The instructions are executed by the at least one processor to enable the at least one processor to perform the steps of the voice endpoint detection model training and use method of any embodiment of the present invention.

第四方面，本发明实施例还提供一种计算机程序产品，所述计算机程序产品包括存储在非易失性计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，使所述计算机执行本发明任一实施例的语音端点检测模型训练和使用方法的步骤。In a fourth aspect, an embodiment of the present invention further provides a computer program product, the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, the computer program includes program instructions, and when the program is When the instructions are executed by a computer, the computer is made to execute the steps of the method for training and using the voice endpoint detection model according to any embodiment of the present invention.

本申请的方法和装置提供的方案利用音频事件检测的方法来解决音频端点检测的问题，这一点是比较具有创新性的，之前没有相同的工作这样来解决这个问题，这样的做法最大的好处就是提高了带噪环境下的检测性能，因为它把原本非语音类别进行了细化，区分了静音和各种类型的噪声，这样可以减少把各种噪声误判为语音的可能性，原来的语音端点检测模型中把所有非语音归为一类，但是这一类中有很多不同类型的噪声，各自特征也完全不同，所以在训练的时候类间的相似性就会比较低，会降低模型的区分性。The solution provided by the method and device of the present application uses the method of audio event detection to solve the problem of audio endpoint detection, which is relatively innovative. There is no similar work to solve this problem before. The biggest advantage of this approach is that It improves the detection performance in noisy environments, because it refines the original non-speech category, distinguishes silence and various types of noise, which can reduce the possibility of misjudging various noises as speech, the original speech In the endpoint detection model, all non-voices are classified into one category, but there are many different types of noise in this category, and their characteristics are completely different, so the similarity between categories will be relatively low during training, which will reduce the model's performance. Distinctive.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明一实施例提供的一种语音端点检测模型训练方法的流程图；1 is a flowchart of a method for training a voice endpoint detection model according to an embodiment of the present invention;

图2为本发明一实施例提供的一种语音端点检测模型使用方法的流程图；2 is a flowchart of a method for using a voice endpoint detection model according to an embodiment of the present invention;

图3为本发明一实施例提供的模型的流程图；3 is a flowchart of a model provided by an embodiment of the present invention;

图4示出了Aurora4(深色)和DCASE18(浅色)之间关于持续时间(左)和每个发声的段数(右)的评估数据分布；Figure 4 shows the distribution of evaluation data between Aurora4 (dark) and DCASE18 (light) with respect to duration (left) and number of segments per utterance (right);

图5示出了三个样本片段的每帧概率输出，带有可视化语音出现(方框，灰色)；Figure 5 shows per-frame probability output for three sample segments with visual speech occurrences (boxes, grey);

图6为本发明一实施例提供的一种语音端点检测模型训练装置的框图；6 is a block diagram of an apparatus for training a voice endpoint detection model according to an embodiment of the present invention;

图7为本发明一实施例提供的一种语音端点检测模型使用装置的框图；7 is a block diagram of an apparatus for using a voice endpoint detection model according to an embodiment of the present invention;

图8是本发明一实施例提供的电子设备的结构示意图。FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

请参考图1，其示出了本申请的语音端点检测模型训练方法的一实施例的流程图，本实施例的语音端点检测模型训练方法可以适用于对语音端点检测模型进行训练，本申请在此没有限制。Please refer to FIG. 1 , which shows a flowchart of an embodiment of the voice endpoint detection model training method of the present application. The voice endpoint detection model training method of the present embodiment can be applied to training the voice endpoint detection model. There is no limit to this.

如图1所示，在步骤101中，将训练音频输入至广义上的语音端点检测模型中；As shown in Figure 1, in step 101, the training audio is input into the voice endpoint detection model in a broad sense;

在步骤102中，经由所述广义上的语音端点检测模型检测所述训练音频中存在的多个音频事件，其中，所述多个音频事件包括人说话事件、静音事件以及至少一种噪音事件；In step 102, a plurality of audio events existing in the training audio are detected via the generalized speech endpoint detection model, wherein the plurality of audio events include a human speaking event, a silence event and at least one noise event;

在步骤103中，获取所述广义上的语音端点检测模型输出的所述多个音频事件的语音和非语音的区分结果；In step 103, obtain the distinction result of speech and non-speech of the multiple audio events output by the speech endpoint detection model in the broad sense;

在步骤104中，基于所述训练音频的音频事件标注和所述广义上的语音端点检测模型的输出计算损失函数，其中，所述音频事件标注包括预先对所述训练音频进行音频事件的标注；In step 104, a loss function is calculated based on the audio event labeling of the training audio and the output of the speech endpoint detection model in a broad sense, wherein the audio event labeling includes pre-labeling the training audio with audio events;

在步骤105中，通过控制所述损失函数优化所述广义上的语音端点检测模型。In step 105, the generalized speech endpoint detection model is optimized by controlling the loss function.

在一些可选的实施例中，所述音频事件标注为段落级别的标注。段落级别的标注是更容易获得的，而且在音频事件检测方法中使用段落级别的标注也是不影响标注效果的。In some optional embodiments, the audio event annotations are paragraph-level annotations. Paragraph-level annotations are easier to obtain, and the use of paragraph-level annotations in audio event detection methods does not affect the annotation effect.

在一些可选的实施例中，在所述将训练音频输入至广义上的语音端点检测模型中之前，所述方法还包括：提取所述训练音频种的声学特征；以及使用卷积循环神经网络模型对所述声学特征进行训练分类。In some optional embodiments, before the inputting the training audio into the speech endpoint detection model in a broad sense, the method further comprises: extracting acoustic features of the training audio; and using a convolutional recurrent neural network The model is trained to classify the acoustic features.

在一些可选的实施例中，所述经由所述广义上的语音端点检测模型检测所述训练音频中存在的多个音频事件包括：利用所述广义上的语音端点检测模型中的音频事件检测来识别所述训练音频中存在的多个音频事件。In some optional embodiments, the detecting a plurality of audio events existing in the training audio via the speech endpoint detection model in the broad sense comprises: using audio event detection in the speech endpoint detection model in the broad sense to identify multiple audio events present in the training audio.

请参考图2，其示出了本申请一实施例提供的一种语音端点检测模型使用方法。Please refer to FIG. 2 , which shows a method for using a voice endpoint detection model provided by an embodiment of the present application.

如图2所示，在步骤201中，经由根据权利要求1-4所述的方法训练后的广义上的语音端点检测模型检测输入音频中存在的多个音频事件，其中，所述多个音频事件包括人说话事件、静音事件以及至少一种噪音事件；As shown in FIG. 2 , in step 201, a plurality of audio events existing in the input audio are detected via the speech endpoint detection model in a broad sense trained by the method according to claims 1-4, wherein the plurality of audio Events include human speaking events, silence events, and at least one noise event;

在步骤202中，获取所述广义上的语音端点检测模型输出的所述多个音频事件的语音和非语音的区分结果。In step 202, a result of distinguishing speech and non-speech of the plurality of audio events output by the speech endpoint detection model in the broad sense is obtained.

在一些可选的实施例中，获取所述广义上的语音端点检测模型输出的所述多个音频事件的语音和非语音的区分结果包括：基于所述多个音频事件加上双阈值的后处理方法得到语音和非语音的区分结果。In some optional embodiments, acquiring the speech and non-speech discrimination results of the plurality of audio events output by the speech endpoint detection model in the broad sense includes: adding a double threshold based on the plurality of audio events The processing method obtains the distinction between speech and non-speech.

在一些可选的实施例中，所述基于所述多个音频事件加上双阈值的后处理方法得到语音和非语音的区分结果包括：基于所识别的多个音频事件将所述人说话事件作为语音部分，将所述静音事件和所述至少一种噪音事件作为非语音部分；以及基于所述语音部分和所述非语音部分的区分结果确定所述待检测音频的语音端点。In some optional embodiments, the post-processing method based on the multiple audio events plus a double threshold to obtain a distinction result between speech and non-speech includes: based on the identified multiple audio events, the person speaking event As a voice part, the silence event and the at least one noise event are regarded as a non-voice part; and a voice endpoint of the audio to be detected is determined based on the discrimination result of the voice part and the non-voice part.

下面对通过描述发明人在实现本发明的过程中遇到的一些问题和对最终确定的方案的一个具体实施例进行说明，以使本领域技术人员更好地理解本申请的方案。The following describes some problems encountered by the inventor in the process of implementing the present invention and a specific embodiment of the finalized solution, so that those skilled in the art can better understand the solution of the present application.

本领域的技术人员，一般要解决相关技术中存在的这些缺陷，通常会采用一些降噪的方法，即对需要判断的音频进行各种噪声消除的处理，降低噪声的影响之后再进行语音和非语音的区分，或者进行一些噪声自适应的方法，就是针对一些特定的场景，使用带有噪声的数据进行训练或者提取场景噪声的一些特征加入到模型中协助区分。Those skilled in the art generally want to solve these defects in the related art, and usually adopt some noise reduction methods, that is, perform various noise removal processing on the audio that needs to be judged, reduce the influence of noise, and then perform voice and noise reduction processing. The distinction of speech, or some methods of noise adaptation, is to use noisy data for training or extract some features of scene noise and add them to the model to assist in the distinction for some specific scenes.

发明人在实现本申请实施例的过程中发现：相关技术中存在的上述缺陷主要是由一个比较普遍的假设前提导致的，就是绝大多数的语音端点检测模型都是进行两个类别的区分，即语音和非语音，语音就是人说话的部分，然后把静音和各种噪声都归到非语音的一类中去，但其实静音段和各种不同类型的噪声特征都不相同，把它们放到同一类中会导致模型的区分性能下降。In the process of implementing the embodiments of the present application, the inventor found that the above-mentioned defects in the related art are mainly caused by a relatively common assumption, that is, the vast majority of voice endpoint detection models are to distinguish between two categories, That is, speech and non-speech, speech is the part of people speaking, and then mute and various noises are classified into the non-speech category, but in fact, the characteristics of silent segments and various types of noise are different. into the same class will lead to a decrease in the discriminative performance of the model.

本申请实施例中提出的方法就是使用音频事件检测的方法来帮助训练语音端点检测模型，即使用音频事件检测来对音频进行多个音频事件的检测，这些事件中包括人说话的时间，静音的事件，以及各种噪声的事件，例如鸟叫声，流水声等等，这样区分完毕之后，把人说话的事件作为语音部分，然后把其他的各种事件作为非语音的部分。这样做的好处就是把非语音的部分里面不同的类型(静音和不同类型的噪声)也区分开来，这样可以提高分类的准确性，不容易把噪声误判为语音部分。The method proposed in the embodiment of the present application is to use the method of audio event detection to help train the voice endpoint detection model, that is, to use audio event detection to detect multiple audio events in the audio. Events, as well as various noise events, such as bird calls, running water, etc., after this distinction is completed, the event of people speaking is regarded as the voice part, and other various events are regarded as the non-voice part. The advantage of this is that different types (silence and different types of noise) in the non-speech part are also distinguished, which can improve the accuracy of classification, and it is not easy to misjudge the noise as the speech part.

图3是我们模型的流程图，其中灰色的部分是模型训练的部分，首先使用CRNN(Convolutional Recurrent Neural Network，卷积循环神经网络)模型对输入的声学特征进行训练分类，然后我们提出的GPVAD(General Purpose Voice Activity Detection，广义上的语音端点检测)模型就是利用音频事件检测模型来识别出输入音频中存在的音频事件(说话，猫叫，开门声等等)，不同于传统语音端点检测模型VAD-C的训练过程，GPVAD模型不需要帧级别的标注来计算损失函数，只需要更容易得到的段落级别的标注(即某段音频中各个事件是否发生的0，1序列)来计算段落级别的损失函数即可。然后在inference andevaluation(测试和评价)部分，加上Double threshold(双阈值)的后处理方法得到语音和非语音的区分结果。其中，Label表示标签，Speech表示语音，Noise表示噪声，Predication表示预测，truth表示真相。Figure 3 is the flow chart of our model, in which the gray part is the part of model training. First, the CRNN (Convolutional Recurrent Neural Network, Convolutional Recurrent Neural Network) model is used to train and classify the input acoustic features, and then our proposed GPVAD ( The General Purpose Voice Activity Detection (VAD in a broad sense) model uses the audio event detection model to identify audio events (speaking, cat meowing, door opening, etc.) existing in the input audio, which is different from the traditional voice endpoint detection model VAD In the training process of -C, the GPVAD model does not need frame-level annotations to calculate the loss function, but only needs paragraph-level annotations that are easier to obtain (that is, the sequence of 0, 1 whether each event occurs in a certain piece of audio) to calculate the paragraph-level loss function. Then in the inference and evaluation (test and evaluation) part, add the post-processing method of Double threshold (double threshold) to obtain the distinction between speech and non-speech. Among them, Label represents the label, Speech represents the speech, Noise represents the noise, Prediction represents the prediction, and truth represents the truth.

发明人在实现本申请的过程中，还采用过一些备选方案，有一个备选方案就是利用相同的数据和标注来训练语音事件检测模型和语音端点检测模型，即都使用帧级别的标注来计算相同的损失函数，这样可以保证两个模型之间的可比性。但是缺点在于帧级别的标注不容易得到，尤其是对于音频事件的帧级别标注更难得到，而且我们要解决的问题也包括具有完整帧级别标注的数据太少这一个，所以最后没有使用完全相同的数据和标注来训练两个模型，而是使用段落级别的标注来训练音频事件检测模型，用帧级别的标注来训练音频端点检测模型。In the process of realizing this application, the inventor has also adopted some alternatives. One alternative is to use the same data and annotations to train the voice event detection model and the voice endpoint detection model, that is, both use frame-level annotations. Calculate the same loss function, which guarantees comparability between the two models. But the disadvantage is that frame-level annotations are not easy to obtain, especially for audio events, frame-level annotations are more difficult to obtain, and the problem we have to solve also includes too little data with complete frame-level annotations, so we did not use exactly the same. Instead, use paragraph-level annotations to train the audio event detection model and frame-level annotations to train the audio endpoint detection model.

本申请实施例提出的方法主要就是利用音频事件检测的方法来解决音频端点检测的问题，这一点是比较具有创新性的，之前没有相同的工作这样来解决这个问题，这样的做法最大的好处就是提高了带噪环境下的检测性能，因为它把原本非语音类别进行了细化，区分了静音和各种类型的噪声，这样可以减少把各种噪声误判为语音的可能性，原来的语音端点检测模型中把所有非语音归为一类，但是这一类中有很多不同类型的噪声，各自特征也完全不同，所以在训练的时候类间的相似性就会比较低，会降低模型的区分性。The method proposed in the embodiment of the present application mainly uses the method of audio event detection to solve the problem of audio endpoint detection, which is relatively innovative. There is no similar work to solve this problem before. The biggest advantage of this method is that It improves the detection performance in noisy environments, because it refines the original non-speech category, distinguishes silence and various types of noise, which can reduce the possibility of misjudging various noises as speech, the original speech In the endpoint detection model, all non-voices are classified into one category, but there are many different types of noise in this category, and their characteristics are completely different, so the similarity between categories will be relatively low during training, which will reduce the model's performance. Distinctive.

以下介绍发明人的实现本申请实施例的过程，以及在该过程中的一些实验过程及相应的实验数据，以使本领域技术人员更好地理解本申请的技术方案。The inventor's process of implementing the embodiments of the present application, as well as some experimental procedures and corresponding experimental data in the process are described below, so that those skilled in the art can better understand the technical solutions of the present application.

传统的有监督的语音端点检测技术在干净无噪声的特定环境下可以有很好的的效果。但是在真实带噪场景下性能会有很明显的下降。一个可能的瓶颈就是在真实场景下的语音通常带有很多不可预测的噪声，因此对于传统的有监督的语音端点检测模型来说，帧级别的预测是比较困难的。我们提出一种广泛意义上的语音端点检测框架(GPVAD)，这个框架可以用半监督训练的方式比轻松的训练带噪声的数据，而且只需要段落级别的标注。我们提出了两种GPVAD模型，一种是GPV-F，是在包含527个音频事件的Audioset数据集上训练的多分类器，另一种是GPV-B，只区分语音和噪声。我们在三个不同的测试集(干净的，合成噪声的，真实场景的)上比较这两个GPV模型以及传统的基于CRNN的语音端点检测模型(VAD-C)。结果显示我们提出的GPV-F模型在干净和合成的两个测试集上的检测性能与传统的VAD-C模型相当。同时在真实场景的测试中，帧级别的评价指标和段落级别的评价指标都表明，GPV-F模型的结果比传统的VAD-C模型提升很多。在真实场景下，相对来说简单一些的GPV-B模型也获得和VAD-C模型可比的性能。Traditional supervised speech endpoint detection techniques can work well in clean and noise-free specific environments. However, the performance will drop significantly in real noisy scenes. A possible bottleneck is that speech in real scenes usually has a lot of unpredictable noise, so frame-level prediction is difficult for traditional supervised speech endpoint detection models. We propose a general speech endpoint detection framework (GPVAD) that can be trained on noisy data more easily than semi-supervised training and requires only paragraph-level annotations. We propose two GPVAD models, GPV-F, a multi-classifier trained on the Audioset dataset containing 527 audio events, and GPV-B, which only distinguishes between speech and noise. We compare these two GPV models as well as the traditional CRNN-based speech endpoint detection model (VAD-C) on three different test sets (clean, synthetic noise, real scene). The results show that the detection performance of our proposed GPV-F model is comparable to the conventional VAD-C model on both clean and synthetic test sets. At the same time, in the test of real scenes, both the frame-level evaluation indicators and the paragraph-level evaluation indicators show that the results of the GPV-F model are much better than the traditional VAD-C model. In real scenarios, the relatively simple GPV-B model also achieves comparable performance to the VAD-C model.

1、简介1 Introduction

语音端点检测(VAD)的主要目的是检测语音段并将其与非语音区分开，它是语音识别，说话者识别和说话者确认等任务的关键组件。深度学习方法已成功应用于VAD。对于复杂环境中的VAD，神经网络(NN，Neural Networks)已取得成功。与传统方法相比，深度神经网络(DNN，Deep Neural Networks)特别是卷积神经网络(CNN，Convolutional NeuralNetworks)提供了改进的建模能力，而递归(RNN)和长短期记忆(LSTM，Long Short TermMemory)网络则可以更好地建模序列输入之间的长时依赖关系。但是，尽管应用了深度学习方法，基于NN的VAD训练仍然需要帧级别的标签。因此，所利用的训练数据通常是在有或没有其他合成噪声的受控环境下进行的。这不可避免地阻止了VAD在现实世界中的应用，因为在真实场景下的语音中经常伴随着无数种具有不同特征的没有见过的噪音。The main purpose of Voice Endpoint Detection (VAD) is to detect speech segments and distinguish them from non-speech, and it is a key component for tasks such as speech recognition, speaker identification, and speaker confirmation. Deep learning methods have been successfully applied to VAD. For VAD in complex environments, Neural Networks (NN) have been successful. Compared with traditional methods, Deep Neural Networks (DNN, Deep Neural Networks), especially Convolutional Neural Networks (CNN, Convolutional Neural Networks) provide improved modeling capabilities, while Recurrent (RNN) and Long Short-Term Memory (LSTM, Long Short TermMemory) network can better model long-term dependencies between sequence inputs. However, despite applying deep learning methods, NN-based VAD training still requires frame-level labels. Therefore, the exploited training data is usually performed in a controlled environment with or without other synthetic noise. This inevitably prevents the application of VAD in the real world, because speech in real scenes is often accompanied by countless unseen noises with different characteristics.

因此，本文旨在提出一种在干净无噪声的噪声环境之外检测语音的方法。Therefore, this paper aims to propose a method for detecting speech outside a clean and noise-free noisy environment.

应该注意的是，由于人工标记的成本很高，因此现实生活中音频的帧级别标签是比较难获得的，并且根据隐马尔可夫模型进行的标记预测需要有关所使用语言的先验知识。在启用噪声数据训练的同时检测语音成分的任务与弱监督音频事件检测(WSSED，Weakly Supervised Sound Event Detection)有关，该事件检测并定位不同的声音，包括通过段落级监督的语音。由于WSSED系统被验证了对噪声具有鲁棒性，并且只需要段落级标签，因此这项工作集成了WSSED方法，可将VAD缩放到野外语音场景并放宽对帧标签的依赖。具体来说，我们研究两个问题：1)当前的，多类WSSED模型的性能是否可与基于DNN的VAD媲美；2)段落级别的训练是否可以代替帧级别标签的训练？因此，我们介绍了我们的框架，这是VAD的通用训练框架(GPVAD，请参见图1)。一般而言，我们指的是两个不同的方面：首先，该框架具有强大的抗噪能力，并且能够在真实的生活场景中进行部署；其次，可以在不受约束的数据上训练框架，从而可以在大量的无标签数据(如嘈杂的在线视频)中学习。It should be noted that frame-level labels for real-life audio are relatively difficult to obtain due to the high cost of human labeling, and label prediction from hidden Markov models requires prior knowledge about the language used. The task of detecting speech components while enabling training on noisy data is related to Weakly Supervised Sound Event Detection (WSSED), which detects and localizes distinct sounds, including speech via paragraph-level supervision. Since the WSSED system is validated to be robust to noise and only requires paragraph-level labels, this work integrates the WSSED method to scale VAD to speech-in-the-wild scenes and relax the reliance on frame labels. Specifically, we investigate two questions: 1) Can the performance of current, multi-class WSSED models be comparable to DNN-based VADs; 2) Can paragraph-level training replace frame-level label training? Therefore, we introduce our framework, a generalized training framework for VAD (GPVAD, see Figure 1). In general, we refer to two distinct aspects: first, the framework is robust against noise and can be deployed in real life scenarios; second, the framework can be trained on unconstrained data, thereby Can learn on large amounts of unlabeled data such as noisy online videos.

该文件的结构如下：在第2节中，我们简要回顾了WSSED的相关工作以及如何将其用于真实环境中的VAD任务。在第3节中，介绍了GPVAD方法。此外，在第4节中，我们介绍了实验设置并提供了实现细节。在第5节中介绍了结果，最后在第6节中提供了结论。The document is structured as follows: In Section 2, we briefly review related work on WSSED and how it can be used for VAD tasks in real-world settings. In Section 3, the GPVAD method is introduced. Furthermore, in Section 4, we introduce the experimental setup and provide implementation details. The results are presented in Section 5, and finally the conclusions are provided in Section 6.

2、弱监督的音频事件检测2. Weakly supervised audio event detection

由于WSSED可以很好地在嘈杂的环境中检测语音而无需帧级标记，因此我们借用了这种想法以在真实环境中实现VAD。在这里，我们介绍有关音频事件检测(SED，SoundEvent Detection)的相关工作，其目的是对(音频标记)进行分类，并可能定位来自给定音频片段的多个同时发生的声音事件。在这项工作中，我们主要关注弱监督SED(WSSED)，这是一种半监督任务，该任务在训练期间只能访问段落级别的标签，而在评估过程中需要对特定事件进行分类和定位。这种弱监督的方式可以对噪声数据进行训练，而对标记方法的要求较低。弱监督的音频事件检测的最新进展，特别是声学场景和事件的检测和分类(DCASE)挑战，为预测准确的声音事件边界和事件标签带来了很大的进步。特别是最近的工作在短时不连续事件(例如语音)的检测方面显示出令人鼓舞的性能。Since WSSED can detect speech well in noisy environments without frame-level labeling, we borrow this idea to implement VAD in real environments. Here, we present related work on Audio Event Detection (SED, SoundEvent Detection), which aims to classify (audio markers) and possibly locate multiple simultaneous sound events from a given audio segment. In this work, we mainly focus on Weakly Supervised SED (WSSED), a semi-supervised task that only has access to paragraph-level labels during training and requires classification and localization of specific events during evaluation . This weakly supervised approach enables training on noisy data with less demanding labelling methods. Recent advances in weakly supervised audio event detection, especially the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, have brought great progress in predicting accurate sound event boundaries and event labels. In particular, recent work has shown encouraging performance in the detection of short-term discontinuous events such as speech.

图3示出了本申请实施例建议的框架。利用了CRNN架构，而GPVAD利用段落级标签进行训练，而VAD-C则使用帧级别标签进行训练。每个Conv2d块表示一个批归一化，然后是用0补齐的二维卷积，其内核大小为3×3，使用带泄漏的ReLU作为激活函数，负斜率为0.1。CNN输出被馈送到具有128个隐藏单元的双向门控循环单元(GRU)。该体系结构对时间维度T进行4倍的下采样，然后再进行上采样以匹配原始输入的时间维度。对于GPV-F，事件数E设置为527，对于GPV-B和VAD-C，事件数设置为2。在后处理之后，仅保留输出事件(语音)，以进行最终评估。FIG. 3 shows the framework proposed by the embodiments of the present application. The CRNN architecture is utilized, while GPVAD is trained with paragraph-level labels, and VAD-C is trained with frame-level labels. Each Conv2d block represents a batch normalization followed by a 0-padded 2D convolution with a kernel size of 3 × 3, using ReLU with leakage as the activation function, and a negative slope of 0.1. The CNN output is fed to a bidirectional Gated Recurrent Unit (GRU) with 128 hidden units. The architecture downsamples the time dimension T by a factor of 4, and then upsamples it to match the time dimension of the original input. For GPV-F, the number of events E is set to 527, and for GPV-B and VAD-C, the number of events is set to 2. After post-processing, only output events (speech) are retained for final evaluation.

3通过WSSED在嘈杂环境中使用VAD3 Using VAD in a noisy environment with WSSED

传统上，用于嘈杂场景的VAD是按照公式(1)建模的。假设可以从观察到的语音信号x中滤除加性噪声u以获得清晰的语音s。Traditionally, VADs for noisy scenes are modeled according to Equation (1). Suppose that the additive noise u can be filtered from the observed speech signal x to obtain a clear speech s.

x＝s+u(1)x=s+u(1)

但是，直接对u建模非常棘手，因为每种类型的噪声都有其各自的特征。因此，我们旨在通过观察潜在的L个不同的非语音事件(u1...，uL)来了解s的性质。这些事件不限于背景/前景噪声，并且可以具有不同的真实世界声音(例如，猫，音乐)。However, modeling u directly is tricky because each type of noise has its own characteristics. Therefore, we aim to understand the nature of s by looking at potentially L distinct non-speech events (u1...,uL). These events are not limited to background/foreground noise and can have different real world sounds (eg cats, music).

X＝{x1，...，xl，...，xL}X={x1,...,xl,...,xL}

xl＝(s，ul)(2)xl=(s,ul)(2)

我们的方法源于多实例学习(MIL)，这意味着关于特定标签的训练集知识是不完整的(例如，从未直接观察到语音)。在这里，我们将观察到的语音数据X建模为“包”，其中包含语音和其他任何一个可能有噪声的背景/前景事件标签l∈{1，...，L}共同发生的段落，可能的事件标签L<E(等式(2))。可以这么说，我们的方法旨在在复杂的环境场景中完善模型对语音信号的信念。这种建模方法的优点是可以同时应用于帧和段落级别的训练。因此，我们的GPVAD通过允许在片段/整句级别上进行训练来放松这些限制，其中每个训练片段至少包含一个感兴趣的事件。我们提出了两种不同的模型：GPV-F，它输出E＝527个标签(L＝405)，而朴素的GPV-B，E＝2，L＝1。GPV-F可以看作是一种成熟的WSSED方法，使用最多的标签监督，因此比GPV-B对标签的要求更高，后者只需要有关包含语音的片段的知识。但是，GPV-F应该能够对每个单独的噪声事件进行建模，而不是将所有噪声聚类为一个类别(GPV-B)，因此可能会增强重噪声场景下的性能。将这两个模型与在帧级上训练的模型(又称为VAD-C)进行了比较。Our method derives from multiple instance learning (MIL), which means that training set knowledge about a particular label is incomplete (e.g., speech is never directly observed). Here, we model the observed speech data X as "bags" containing passages where speech and any other possibly noisy background/foreground event label l ∈ {1,...,L} co-occur, Possible event labels L<E (equation (2)). Suffice it to say, our approach aims to refine the model's beliefs about speech signals in complex environmental scenarios. The advantage of this modeling approach is that it can be applied to both frame- and paragraph-level training. Therefore, our GPVAD relaxes these constraints by allowing training at the segment/full-sentence level, where each training segment contains at least one event of interest. We propose two different models: GPV-F, which outputs E=527 labels (L=405), and naive GPV-B, E=2, L=1. GPV-F can be seen as a full-fledged WSSED method that uses the most label supervision and is therefore more demanding on labels than GPV-B, which only requires knowledge about segments containing speech. However, GPV-F should be able to model each individual noise event instead of clustering all noises into one class (GPV-B), thus potentially enhancing performance in heavy noise scenarios. The two models were compared with a model trained at the frame level (aka VAD-C).

所有模型都共享一个在WSSED中使用的通用主干卷积递归神经网络(CRNN)方法，该方法对短时间的不连续事件(例如语音)具有鲁棒性。对以上方法进行了以下修改：1.添加上采样操作，以使模型的时间分辨率保持恒定。2.使用Lp池化并设置默认值(p＝4)，因为它对于持续时间不变性估计是有益的。与可以使用帧级标签的VAD-C培训不同，我们的GPVAD框架分为两个不同的阶段。在训练期间，仅可访问片段/整句的标签。因此，需要时间池化函数(等式(4))。在推论过程中，需要进行后处理(第4.3节)，以将概率序列转换为二进制标签(事件的缺失/存在)，并丢弃所有预测的非语音标签。该框架如图3所示。All models share a common backbone Convolutional Recurrent Neural Network (CRNN) method used in WSSED, which is robust to short-duration discrete events such as speech. The following modifications are made to the above method: 1. An upsampling operation is added to keep the temporal resolution of the model constant. 2. Use Lp pooling and set the default value (p=4) as it is beneficial for duration invariance estimation. Unlike VAD-C training, which can use frame-level labels, our GPVAD framework is divided into two distinct stages. During training, only labels for fragments/full sentences are accessible. Therefore, a temporal pooling function (equation (4)) is required. During inference, post-processing (Section 4.3) is required to convert probabilistic sequences into binary labels (absence/presence of events) and discard all predicted non-speech labels. The frame is shown in Figure 3.

4、实验4. Experiment

在我们的工作中，深层神经网络在PyTorch中实现，前端特征提取利用librosa。代码将在线提供。In our work, deep neural networks are implemented in PyTorch and front-end feature extraction utilizes librosa. Codes will be available online.

4.1数据集4.1 Dataset

表1：针对GPVAD(音频集)和VAD-C(Aurora4+)的培训数据集，以及针对干净，合成噪声和真实场景提出的三种建议测试集。持续时间代表大致的讲话时间。Table 1: Training datasets for GPVAD (audio set) and VAD-C (Aurora4+), and three proposed test sets for clean, synthetic noise, and real scenes. Duration represents the approximate speaking time.

图4示出了：Aurora4(深色)和DCASE18(浅色)之间关于持续时间(左)和每个句子片段数(右)的评估数据分布。彩色效果最佳。Figure 4 shows: The distribution of evaluation data between Aurora4 (dark) and DCASE18 (light) with respect to duration (left) and number of fragments per sentence (right). Color works best.

这项工作中利用的数据集可以分为训练数据部分(在GPVAD和VAD方法之间有所不同)和评估数据，这两种方法都共享该数据。我们的主要GPVAD训练数据集是AudioSet语料库提供的“平衡”集，其中包含21100/22160(由于部分不可用)10秒的Youtube音频片段，分为527个嘈杂事件标签。在可用的21100个段落(58h)中，有5452个段落(≈15h)被标记为包含语音，但始终与L＝405个其他事件(例如，狗叫声)并排。关于GPV-B，我们替换了平衡数据集中的所有526个事件，而不是将语音作为“噪音”，因此XGPV-B＝{(s，unoise，)，unoise}。重要的是要注意，对于GPVB/V的训练，永远不会单独观察语音。The datasets utilized in this work can be divided into training data parts (which differ between GPVAD and VAD methods) and evaluation data, which are shared by both methods. Our main GPVAD training dataset is the "balanced" set provided by the AudioSet corpus, which contains 21100/22160 (due to partial unavailability) 10-second Youtube audio clips grouped into 527 noisy event labels. Of the 21100 paragraphs available (58h), 5452 paragraphs (≈15h) were marked as containing speech, but were consistently side-by-side with L=405 other events (eg, dog barking). Regarding GPV-B, instead of treating speech as "noise", we replaced all 526 events in the balanced dataset, so XGPV-B={(s, unoise, ), unoise}. It is important to note that for the training of GPVB/V, speech is never observed alone.

我们的VAD-C模型在Aurora4训练集上进行了训练，该训练集是在15h的Switchboard数据子集的基础上进行了扩展，从而获得了我们的Aurora4+训练子集，其中包含干净的以及合成噪声数据。附加合成噪声(Syn)是从六种不同的噪声类型(汽车，杂音，餐厅，街道，机场和火车)获得的，这些噪声在10至20dB之间随机选择SNR进行添加。表1中描述了所有利用的数据集。提出了三种不同的评估方案。首先，我们在长达40分钟的干净Aurora4测试集合上进行测试。其次，我们通过使用从5db到15db的SNR以1db为步长，从100种噪声类型中随机添加噪声，基于干净的Aurora4测试集来合成一个噪声测试集。最后，我们合并DCASE18挑战[10]本身的音频集的开发和测试集，以创建我们的实际场景的评估数据。DCASE18数据提供了十个家庭环境事件标签，我们忽略了语音以外的所有标签，但是报告了存在非语音标签的实例数量。我们的DCASE18评估集包含596个标记为“语音”的语音，414个语音(占69％)包含另一个非语音标签，114个语音(占20％)仅包含语音和68个语音(占11％)包含两个或多个非语音标签。Our VAD-C model was trained on the Aurora4 training set, which was extended on the 15h subset of Switchboard data to obtain our Aurora4+ training subset, which contains clean as well as synthetic noise data. The additive synthetic noise (Syn) is obtained from six different noise types (car, murmur, restaurant, street, airport, and train) with randomly selected SNRs between 10 and 20 dB for addition. All utilized datasets are described in Table 1. Three different evaluation scenarios are proposed. First, we test on a clean Aurora4 test set of up to 40 minutes. Second, we synthesize a noisy test set based on the clean Aurora4 test set by randomly adding noise from 100 noise types with SNR from 5db to 15db in steps of 1db. Finally, we merge the dev and test sets of the audio set of the DCASE18 challenge [10] itself to create evaluation data for our real-world scenarios. The DCASE18 data provides ten home environment event labels, we ignore all labels except speech, but report the number of instances where non-speech labels are present. Our DCASE18 evaluation set contains 596 utterances labelled "speech", 414 utterances (69%) contain another non-speech label, 114 utterances (20%) contain speech only and 68 utterances (11%) ) contains two or more non-speech tags.

从图4中可以看出，DCASE18评估数据集与Aurora4数据集的不同之处在于，语音的平均持续时间(1.49s与3.31s)以及句子内的检测片段数量(3.87与2.08)。As can be seen in Figure 4, the DCASE18 evaluation dataset differs from the Aurora4 dataset in the average duration of speech (1.49s vs. 3.31s) and the number of detected fragments within a sentence (3.87 vs. 2.08).

4.2评估指标4.2 Evaluation indicators

帧级别：对于帧级别评估，我们利用帧级别宏/微平均F1分数(F1-宏，F1-micro)，曲线下面积(AUC，Area Under the Curve)和帧错误率(FER，frame error rate)。Frame-level: For frame-level evaluation, we utilize frame-level macro/micro-average F1 score (F1-macro, F1-micro), area under the curve (AUC, Area Under the Curve) and frame error rate (FER, frame error rate) .

段落级别：对于段落级别评估，我们使用基于事件的F1-Score(Event-F1)。事件F1计算开始，偏移和预测的标签是否与基本事实重叠，因此是时间一致性的度量。根据WSSED研究，我们将t-collar设置为200ms，以允许开始预测点的误差并进一步允许参考与预测之间的时长误差为20％。Paragraph level: For paragraph level evaluation, we use event-based F1-Score (Event-F1). Event F1 computes onset, offset and predicted labels overlap with ground truth and are therefore measures of temporal consistency. Based on the WSSED study, we set the t-collar to 200ms to allow for an error at the start of the prediction point and further allow a 20% error in duration between reference and prediction.

4.3设定4.3 Settings

关于特征提取，在这项工作中，所有实验都使用64维对数-梅尔功率谱图(LMS，log-Mel power spectrograms)。每个LMS样本使用Hann窗口，每20ms通过2048点傅里叶变换提取，窗口大小为40ms。在训练期间，将所有数据用0补齐到批处理中最长的样本长度，而在推理期间，将使用批处理大小为1，这意味着不填充。Regarding feature extraction, in this work, 64-dimensional log-Mel power spectrograms (LMS, log-Mel power spectrograms) are used for all experiments. Each LMS sample uses a Hann window, extracted by a 2048-point Fourier transform every 20ms, with a window size of 40ms. During training, all data is padded with 0s to the longest sample length in the batch, while during inference, a batch size of 1 will be used, meaning no padding.

真实结果y和预测

之间的所有实验的训练准则是所有样本N的交叉熵方程式(3)。线性softmax(方程式(4))用作合并帧级别的时间合并层单个向量表示y(e)∈[0,1]E的概率yt(e)∈[0,1]。true outcome y and prediction

The training criterion for all experiments in between is the cross-entropy equation (3) for all samples N. Linear softmax (Equation (4)) is used as the temporal merging layer at the merging frame level. A single vector represents the probability yt(e) ∈ [0,1] of y(e) ∈ [0,1]E.

GPVAD：可用的训练数据被分为90％训练平衡标签数据和10％验证集合。由于音频集内在的固有标签失衡，因此需要进行采样，以使每批包含每个标签中分布均匀的片段。训练使用Adam优化，起始学习率为1e-4，批量大小为64，并且如果指标在验证的数据集上没有减少，则在七轮后终止。GPVAD: The available training data is split into 90% training balanced label data and 10% validation set. Due to the inherent label imbalance inherent in audio sets, sampling is required so that each batch contains evenly distributed segments within each label. Training uses Adam optimization with a starting learning rate of 1e-4, a batch size of 64, and is terminated after seven epochs if the metric does not decrease on the validation dataset.

VAD-C：VAD-C训练使用20的批量大小，而对于填充的帧，则不计算损失函数(等式(3))。学习速率设置为1e-5，并且SGD用于模型优化。训练目标标签是通过从Kaldi训练过的ASRHMM模型中进行对齐获得的。VAD-C: VAD-C training uses a batch size of 20, while for padded frames, no loss function is computed (equation (3)). The learning rate is set to 1e-5, and SGD is used for model optimization. The training target labels are obtained by aligning from the Kaldi-trained ASRHMM model.

后处理在推理过程中，需要进行后处理，以便从类概率序列(yt(e))中获得硬标签。我们在此使用双阈值后处理，该后处理使用两个阈值

Post-processing During inference, post-processing is required to obtain hard labels from the sequence of class probabilities (yt(e)). Here we use dual threshold postprocessing which uses two thresholds

5结果5 results

我们的结果见表2。首先，我们提供的证据表明，我们的VAD-C模型能够与其他深度神经网络方法获得可以匹配的性能。将VAD-C与GPV-B/F进行比较，可以看出，鉴于我们针对干净和合成噪声数据集的指标，VAD-C确实是性能最好的模型。但是，对真实数据集的评估揭示了另外一些内容。在这里，VAD-C似乎无法对抗朴素的GPV-B方法(AUC 87.87与89.12，FER 21.92与19.65)，这表明VAD-C在存在真实噪声的情况下更有可能对语音进行错误分类。此外，在实际情况下，对于每个建议的指标，GPV-F的性能都优于VAD-C。我们提出的GPV-F方法也可以被认为是具有对噪声的鲁棒性，因为它在合成噪声和实际场景之间的性能差异很小。Our results are shown in Table 2. First, we provide evidence that our VAD-C model achieves comparable performance with other deep neural network approaches. Comparing VAD-C with GPV-B/F, it can be seen that VAD-C is indeed the best performing model given our metrics for clean and synthetic noisy datasets. However, evaluation on real datasets revealed something else. Here, VAD-C does not seem to be able to compete against the naive GPV-B method (AUC 87.87 vs 89.12, FER 21.92 vs 19.65), suggesting that VAD-C is more likely to misclassify speech in the presence of real noise. Furthermore, GPV-F outperforms VAD-C for every proposed metric in real-world situations. Our proposed GPV-F method can also be considered robust to noise, since its performance difference between synthetic noise and real scenes is small.

即使GPV-B的平均表现不及其他两种方法，也应注意这是成本最低的系统，因为GPV-B的标签数据本质上是一个二进制问题，即是否有人听到段落中的任何语音，这是能够廉价地扩展到大数据的方法。我们得出的结论是，仅使用段落级标签训练的GPVAD模型具有与在帧级标签上训练模型的竞争能力。Even though the average performance of GPV-B is not as good as the other two methods, it should be noted that this is the lowest cost system, because the label data for GPV-B is inherently a binary problem of whether anyone hears any speech in a passage, which is A method that scales to big data cheaply. We conclude that GPVAD models trained using only paragraph-level labels are competitive with models trained on frame-level labels.

定量结果Quantitative results

为了对模型能力进行可视化，从测试集中采样了三个片段(一个Aurora4Noisy，两个DCASE18)，并且每个帧的输出概率如图5所示。在顶部的合成Aurora4测试中，我们可以看到，我们的GPVAD模型能够对两个语音段之间的短暂停顿进行建模，而这两个停顿在VAD-C失败的情况下进行，但是两个GPVAD模型都无法正确估计第二个语音段结束。中心样本进一步展示了实际场景中的一个典型VADC问题：对于大多数语音，它无法区分前景事件(此处为Guitar)和活动语音。尤其是最下面的样本例证了这个问题：VAD-C开始预测语音，而没有语音，两个GPVAD模型都能够区分语音中的任何背景噪声。请注意，底部片段的结尾包含笑声，VAD-C将其分类为语音。在未来的工作中，我们希望通过利用更大的训练数据(例如，不平衡的AudioSet)来进一步扩展GPVAD训练的范围。To visualize model capability, three segments (one Aurora4Noisy, two DCASE18) were sampled from the test set, and the output probabilities for each frame are shown in Figure 5. In the synthetic Aurora4 test at the top, we can see that our GPVAD model is able to model short pauses between two speech segments that fail VAD-C, but both None of the GPVAD models can correctly estimate the end of the second speech segment. The center sample further demonstrates a typical VADC problem in real scenarios: for most speech, it cannot distinguish foreground events (Guitar here) and active speech. The bottom sample in particular exemplifies the problem: VAD-C starts to predict speech, and without speech, both GPVAD models are able to distinguish any background noise in speech. Note that the end of the bottom segment contains laughter, which VAD-C classifies as speech. In future work, we hope to further expand the scope of GPVAD training by leveraging larger training data (e.g., an unbalanced AudioSet).

表2：在每种评估条件下获得的最佳结果。粗体字表示各个数据集的最佳结果，下划线表示次佳。其中，AUC表示曲线下面积，FER表示误帧率。Table 2: Best results obtained under each evaluation condition. Bold text indicates the best result for each dataset, underlined indicates the next best. Among them, AUC represents the area under the curve, and FER represents the frame error rate.

图5：三个样本片段的每帧概率输出，带有可视化语音出现(方框，灰色)。(顶部)包含来自Aurora4(B)的片段；(中)有一位弹吉他的音乐家(DCASE18)；(底部)包含有人在谈论背景噪音(DCASE18)。指示了后处理阈值

彩色效果最佳。图中，Speech表示语音。Figure 5: Per-frame probability output for three sample segments with visual speech occurrences (boxes, grey). (Top) contains clips from Aurora4 (B); (middle) contains a musician playing guitar (DCASE18); (bottom) contains someone talking about background noise (DCASE18). postprocessing threshold indicated

Color works best. In the figure, Speech represents speech.

6结论6 Conclusion

本发明实施例介绍了一种利用弱标签进行训练的音频事件检测的鲁棒VAD方法。研究了两个GPVAD系统：仅对二进制语音和非语音对进行训练的GPV-B，以及使用所有527AudioSet标签的GPV-F。我们的测试集使用五个不同的指标将我们建议的GPVAD方法与传统VAD进行了彻底比较。结果表明，即使利用段落级别的标签对GPV-B进行训练，也可以将其用于检测语音，而无需使用干净的带有框架标签的训练数据。此外，虽然GPV-B/F在针对VAD-C的干净噪声和合成噪声场景中均有不足，但它们在针对真实场景的的稳定预测方面表现出色。具体而言，可以看出，我们提出的方法在合成噪声和现实噪声数据集的性能上都非常可靠。我们的最佳性能模型GPV-F在真实场景的上大大优于传统的监督VAD方法，最终的绝对性能提升是5.57％的F1-macro，6.45％的F1micro，3.93％的AUC，6.45％的FER和10.4％的Event-F1。Embodiments of the present invention introduce a robust VAD method for audio event detection using weak labels for training. Two GPVAD systems are studied: GPV-B, which trains only on binary speech and non-speech pairs, and GPV-F, which uses all 527AudioSet labels. Our test set thoroughly compares our proposed GPVAD method with traditional VAD using five different metrics. The results show that even if GPV-B is trained with paragraph-level labels, it can be used to detect speech without using clean frame-labeled training data. Furthermore, although GPV-B/F falls short for both clean noise and synthetic noise scenes for VAD-C, they excel in stable prediction for real scenes. Specifically, it can be seen that our proposed method performs very reliably on both synthetic noise and real-world noise datasets. Our best performing model, GPV-F, outperforms traditional supervised VAD methods by a large margin on real-world scenarios, with final absolute performance gains of 5.57% F1-macro, 6.45% F1micro, 3.93% AUC, 6.45% FER and 10.4% Event-F1.

请参考图6，其示出了本发明一实施例提供的一种语音端点检测模型训练和使用装置的框图。Please refer to FIG. 6 , which shows a block diagram of an apparatus for training and using a voice endpoint detection model according to an embodiment of the present invention.

如图6所示，语音端点检测模型训练装置600，包括输入模块610、检测模块620、输出模块630、损失计算模块640和优化模块650。As shown in FIG. 6 , the voice endpoint detection model training apparatus 600 includes an input module 610 , a detection module 620 , an output module 630 , a loss calculation module 640 and an optimization module 650 .

其中，输入模块610，配置为将训练音频输入至广义上的语音端点检测模型中；检测模块620，配置为经由所述广义上的语音端点检测模型检测所述训练音频中存在的多个音频事件，其中，所述多个音频事件包括人说话事件、静音事件以及至少一种噪音事件；输出模块630，配置为获取所述广义上的语音端点检测模型输出的所述多个音频事件的语音和非语音的区分结果；损失计算模块640，配置为基于所述训练音频的音频事件标注和所述广义上的语音端点检测模型的输出计算损失函数，其中，所述音频事件标注包括预先对所述训练音频进行音频事件的标注；以及优化模块650，配置为通过控制所述损失函数优化所述广义上的语音端点检测模型。Wherein, the input module 610 is configured to input the training audio into the speech endpoint detection model in a broad sense; the detection module 620 is configured to detect multiple audio events existing in the training audio via the speech endpoint detection model in the broad sense , wherein the multiple audio events include a human speaking event, a mute event, and at least one noise event; the output module 630 is configured to acquire the voice and the multiple audio events output by the voice endpoint detection model in the broad sense. non-speech discrimination result; the loss calculation module 640 is configured to calculate a loss function based on the audio event annotation of the training audio and the output of the speech endpoint detection model in the broad sense, wherein the audio event annotation includes pre- training audio for audio event labeling; and an optimization module 650 configured to optimize the generalized speech endpoint detection model by controlling the loss function.

请参考图7，其示出了本发明一实施例提供的一种语音端点检测模型使用装置。Please refer to FIG. 7 , which shows an apparatus for using a voice endpoint detection model provided by an embodiment of the present invention.

如图7所示，语音端点检测模型使用装置700包括模型处理模块710和区分模块720。As shown in FIG. 7 , the apparatus 700 for using a voice endpoint detection model includes a model processing module 710 and a distinguishing module 720 .

其中，模型处理模块710，配置为经由根据上述的方法训练后的广义上的语音端点检测模型检测输入音频中存在的多个音频事件，其中，所述多个音频事件包括人说话事件、静音事件以及至少一种噪音事件；以及区分模块720，配置为获取所述广义上的语音端点检测模型输出的所述多个音频事件的语音和非语音的区分结果。Wherein, the model processing module 710 is configured to detect multiple audio events existing in the input audio through the speech endpoint detection model in a broad sense trained according to the above method, wherein the multiple audio events include human speaking events, mute events and at least one noise event; and a distinguishing module 720, configured to obtain a distinguishing result of speech and non-speech of the plurality of audio events output by the speech endpoint detection model in the broad sense.

应当理解，图6和图7中记载的诸模块与参考图1和2中描述的方法中的各个步骤相对应。由此，上文针对方法描述的操作和特征以及相应的技术效果同样适用于图6和图7中的诸模块，在此不再赘述。It should be understood that the modules recited in FIGS. 6 and 7 correspond to the various steps in the method described with reference to FIGS. 1 and 2 . Therefore, the operations and features described above with respect to the method and the corresponding technical effects are also applicable to the modules in FIG. 6 and FIG. 7 , and will not be repeated here.

值得注意的是，本申请的实施例中的模块并不用于限制本申请的方案，例如接收模块可以描述为接收语音识别请求的模块。另外，还可以通过硬件处理器来实现相关功能模块，例如接收模块也可以用处理器实现，在此不再赘述。It should be noted that the modules in the embodiments of the present application are not used to limit the solution of the present application, for example, the receiving module may be described as a module that receives a voice recognition request. In addition, the relevant functional modules may also be implemented by a hardware processor, for example, the receiving module may also be implemented by a processor, which will not be repeated here.

在另一些实施例中，本发明实施例还提供了一种非易失性计算机存储介质，计算机存储介质存储有计算机可执行指令，该计算机可执行指令可执行上述任意方法实施例中的语音端点检测模型训练和使用方法；In other embodiments, embodiments of the present invention further provide a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the voice endpoint in any of the foregoing method embodiments Detect model training and usage methods;

作为一种实施方式，本发明的非易失性计算机存储介质存储有计算机可执行指令，计算机可执行指令设置为：As an embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions, and the computer-executable instructions are set to:

将训练音频输入至广义上的语音端点检测模型中；Input the training audio into the speech endpoint detection model in a broad sense;

经由所述广义上的语音端点检测模型检测所述训练音频中存在的多个音频事件，其中，所述多个音频事件包括人说话事件、静音事件以及至少一种噪音事件；Detecting, via the generalized speech endpoint detection model, a plurality of audio events present in the training audio, wherein the plurality of audio events include a human speaking event, a silence event, and at least one noise event;

获取所述广义上的语音端点检测模型输出的所述多个音频事件的语音和非语音的区分结果；Obtaining the discrimination results of speech and non-speech of the multiple audio events output by the speech endpoint detection model in the broad sense;

基于所述训练音频的音频事件标注和所述广义上的语音端点检测模型的输出计算损失函数，其中，所述音频事件标注包括预先对所述训练音频进行音频事件的标注；A loss function is calculated based on the audio event labeling of the training audio and the output of the speech endpoint detection model in a broad sense, wherein the audio event labeling includes pre-labeling the training audio for audio events;

通过控制所述损失函数优化所述广义上的语音端点检测模型。The generalized speech endpoint detection model is optimized by controlling the loss function.

作为另一种实施方式，本发明的非易失性计算机存储介质存储有计算机可执行指令，计算机可执行指令设置为：As another implementation manner, the non-volatile computer storage medium of the present invention stores computer-executable instructions, and the computer-executable instructions are set to:

经由根据第一方面所述的方法训练后的广义上的语音端点检测模型检测输入音频中存在的多个音频事件，其中，所述多个音频事件包括人说话事件、静音事件以及至少一种噪音事件；Detecting a plurality of audio events present in the input audio via a generalized speech endpoint detection model trained by the method according to the first aspect, wherein the plurality of audio events include a human speaking event, a silence event, and at least one noise event;

获取所述广义上的语音端点检测模型输出的所述多个音频事件的语音和非语音的区分结果。Obtaining a distinction result of speech and non-speech of the plurality of audio events output by the speech endpoint detection model in the broad sense.

非易失性计算机可读存储介质可以包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需要的应用程序；存储数据区可存储根据语音端点检测模型训练和使用装置的使用所创建的数据等。此外，非易失性计算机可读存储介质可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中，非易失性计算机可读存储介质可选包括相对于处理器远程设置的存储器，这些远程存储器可以通过网络连接至语音端点检测模型训练和使用装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The non-volatile computer-readable storage medium can include a stored program area and a stored data area, wherein the stored program area can store an operating system and an application program required by at least one function; Use the data created by the use of the device, etc. In addition, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the non-transitory computer-readable storage medium may optionally include memory located remotely from the processor that can be connected to the voice endpoint detection model training and use apparatus through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

本发明实施例还提供一种计算机程序产品，计算机程序产品包括存储在非易失性计算机可读存储介质上的计算机程序，计算机程序包括程序指令，当程序指令被计算机执行时，使计算机执行上述任一项语音端点检测模型训练和使用方法。An embodiment of the present invention further provides a computer program product, the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is made to execute the above Any speech endpoint detection model training and usage method.

图8是本发明实施例提供的电子设备的结构示意图，如图8所示，该设备包括：一个或多个处理器810以及存储器820，图8中以一个处理器810为例。语音端点检测模型训练和使用方法的设备还可以包括：输入装置830和输出装置840。处理器810、存储器820、输入装置830和输出装置840可以通过总线或者其他方式连接，图8中以通过总线连接为例。存储器820为上述的非易失性计算机可读存储介质。处理器810通过运行存储在存储器820中的非易失性软件程序、指令以及模块，从而执行服务器的各种功能应用以及数据处理，即实现上述方法实施例语音端点检测模型训练和使用方法。输入装置830可接收输入的数字或字符信息，以及产生与语音端点检测模型训练和使用装置的用户设置以及功能控制有关的键信号输入。输出装置840可包括显示屏等显示设备。FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. As shown in FIG. 8 , the device includes: one or more processors 810 and a memory 820 . In FIG. 8 , one processor 810 is used as an example. The apparatus for training and using the voice endpoint detection model may further include: an input device 830 and an output device 840 . The processor 810, the memory 820, the input device 830, and the output device 840 may be connected through a bus or in other ways, and the connection through a bus is taken as an example in FIG. 8 . The memory 820 is the aforementioned non-volatile computer-readable storage medium. The processor 810 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions and modules stored in the memory 820, that is, to implement the voice endpoint detection model training and use methods of the above method embodiments. The input device 830 may receive input numerical or character information and generate key signal input related to voice endpoint detection model training and user settings and function control of the use device. The output device 840 may include a display device such as a display screen.

上述产品可执行本发明实施例所提供的方法，具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节，可参见本发明实施例所提供的方法。The above product can execute the method provided by the embodiment of the present invention, and has corresponding functional modules and beneficial effects for executing the method. For technical details not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

作为一种实施方式，上述电子设备应用于语音端点检测模型训练装置中，包括：As an embodiment, the above-mentioned electronic equipment is applied to a voice endpoint detection model training device, including:

至少一个处理器；以及，与至少一个处理器通信连接的存储器；其中，存储器存储有可被至少一个处理器执行的指令，指令被至少一个处理器执行，以使至少一个处理器能够：at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to:

作为一种实施方式，上述电子设备应用于语音端点检测模型使用装置中，包括：As an embodiment, the above-mentioned electronic equipment is applied to a device for using a voice endpoint detection model, including:

本申请实施例的电子设备以多种形式存在，包括但不限于：The electronic devices in the embodiments of the present application exist in various forms, including but not limited to:

(1)移动通信设备：这类设备的特点是具备移动通信功能，并且以提供话音、数据通信为主要目标。这类终端包括:智能手机、多媒体手机、功能性手机，以及低端手机等。(1) Mobile communication equipment: This type of equipment is characterized by having mobile communication functions, and its main goal is to provide voice and data communication. Such terminals include: smart phones, multimedia phones, feature phones, and low-end phones.

(2)超移动个人计算机设备：这类设备属于个人计算机的范畴，有计算和处理功能，一般也具备移动上网特性。这类终端包括：PDA、MID和UMPC设备等。(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has the characteristics of mobile Internet access. Such terminals include: PDA, MID and UMPC devices.

(3)便携式娱乐设备：这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器，掌上游戏机，电子书，以及智能玩具和便携式车载导航设备。(3) Portable entertainment equipment: This type of equipment can display and play multimedia content. Such devices include: audio and video players, handheld game consoles, e-books, as well as smart toys and portable car navigation devices.

(4)服务器：提供计算服务的设备，服务器的构成包括处理器、硬盘、内存、系统总线等，服务器和通用的计算机架构类似，但是由于需要提供高可靠的服务，因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。(4) Server: A device that provides computing services. The composition of the server includes a processor, a hard disk, a memory, a system bus, etc. The server is similar to a general computer architecture, but due to the need to provide highly reliable services, the processing power, stability , reliability, security, scalability, manageability and other aspects of high requirements.

(5)其他具有数据交互功能的电子装置。(5) Other electronic devices with data interaction function.

以上所描述的装置实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place , or distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A speech endpoint detection model training method comprises the following steps:

inputting training audio into a generalized voice endpoint detection model;

detecting a plurality of audio events present in the training audio via the generalized speech endpoint detection model, wherein the plurality of audio events include a speak-by-person event, a silence event, and at least one noise event;

obtaining the voice and non-voice distinguishing results of the plurality of audio events output by the generalized voice endpoint detection model;

calculating a loss function based on the audio event label of the training audio and the output of the generalized voice endpoint detection model, wherein the audio event label comprises the label of an audio event of the training audio in advance;

optimizing the generalized speech endpoint detection model by controlling the loss function.

2. The method of claim 1, wherein the audio event annotation is a paragraph level annotation.

3. The method of claim 1, wherein prior to the inputting training audio into the generalized speech endpoint detection model, the method further comprises:

extracting acoustic features of the training audio species;

and training and classifying the acoustic features by using a convolution cyclic neural network model.

4. The method of claim 3, wherein the detecting, via the generalized speech endpoint detection model, a plurality of audio events present in the training audio comprises:

audio event detection in the generalized speech endpoint detection model is utilized to identify a plurality of audio events present in the training audio.

5. A method for using a voice endpoint detection model comprises the following steps:

detecting a plurality of audio events present in input audio via a generalized speech endpoint detection model trained according to the method of claims 1-4, wherein the plurality of audio events include a human speaking event, a silence event, and at least one noise event;

and acquiring a voice and non-voice distinguishing result of the plurality of audio events output by the generalized voice endpoint detection model.

6. The method of claim 5, wherein obtaining the speech and non-speech discrimination results for the plurality of audio events output by the generalized speech endpoint detection model comprises:

and obtaining a voice and non-voice distinguishing result based on the plurality of audio events and the dual-threshold post-processing method.

7. The method of claim 6, wherein the deriving a distinction between speech and non-speech based on the plurality of audio events plus a dual threshold post-processing method comprises:

identifying the human speaking event as a speech portion based on the identified plurality of audio events and the silence event and the at least one noise event as non-speech portions;

and determining the voice endpoint of the audio to be detected based on the distinguishing result of the voice part and the non-voice part.

8. A speech endpoint detection model training apparatus, comprising:

an input module configured to input training audio into a generalized speech endpoint detection model;

a detection module configured to detect a plurality of audio events present in the training audio via the generalized speech endpoint detection model, wherein the plurality of audio events include a human speaking event, a silence event, and at least one noise event;

an output module configured to obtain a result of distinguishing between speech and non-speech of the plurality of audio events output by the generalized speech endpoint detection model;

a loss calculation module configured to calculate a loss function based on an audio event label of the training audio and an output of the generalized speech endpoint detection model, wherein the audio event label includes a label of an audio event of the training audio in advance;

an optimization module configured to optimize the generalized speech endpoint detection model by controlling the loss function.

9. A speech endpoint detection model using apparatus, comprising:

a model processing module configured to detect a plurality of audio events present in input audio via a generalized speech endpoint detection model trained according to the method of claims 1-4, wherein the plurality of audio events include a human speaking event, a silence event, and at least one noise event;

and the distinguishing module is configured to acquire a distinguishing result of the voice and the non-voice of the plurality of audio events output by the generalized voice endpoint detection model.

10. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 7.