CN111816216A - Voice activity detection method and device - Google Patents
Voice activity detection method and device Download PDFInfo
- Publication number
- CN111816216A CN111816216A CN202010867436.6A CN202010867436A CN111816216A CN 111816216 A CN111816216 A CN 111816216A CN 202010867436 A CN202010867436 A CN 202010867436A CN 111816216 A CN111816216 A CN 111816216A
- Authority
- CN
- China
- Prior art keywords
- audio
- level
- sentence
- activity detection
- voice activity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
技术领域technical field
本发明属于语音识别技术领域,尤其涉及语音活性检测方法和装置。The invention belongs to the technical field of voice recognition, and in particular relates to a voice activity detection method and device.
背景技术Background technique
语音活性检测(Voice activity detection,VAD),也称为speech activitydetectionorspeech detection,是一项用于语音处理的技术,目的是检测语音信号是否存在。VAD技术主要用于语音编码和语音识别。它可以简化语音处理,也可用于在音频会话期间去除非语音片段:可以在IP电话应用中避免对静音数据包的编码和传输,节省计算时间和带宽。Voice activity detection (VAD), also known as speech activitydetectionorspeech detection, is a technique used in speech processing to detect the presence of speech signals. VAD technology is mainly used for speech coding and speech recognition. It simplifies voice processing and can also be used to remove non-voice segments during an audio session: encoding and transmission of silent packets can be avoided in IP telephony applications, saving computational time and bandwidth.
VAD技术使得一些列基于语音的应用程序成为现实。因此,有一系列的VAD算法,具有不同的特性和延迟时间、灵敏度、精度和计算成本。有些VAD算法也提供了进一步的分析,例如讲话是否浊音、清音或持续。语音活动检测通常是与语言无关的。VAD technology enables a range of voice-based applications. Therefore, there is a family of VAD algorithms with different characteristics and delay times, sensitivity, accuracy and computational cost. Some VAD algorithms also provide further analysis, such as whether speech is voiced, unvoiced, or sustained. Voice activity detection is usually language-independent.
VAD技术首先被用于时分语言内插法(time-assignment speech interpolation/TASI)系统。VAD technology was first used in time-assignment speech interpolation (TASI) systems.
基于传统声学特征如短时能量、频谱能量、过零率等或基于神经网络提取的特征来进行语音活动检测,对每一帧音频都给出是否是语音的判定。这种方法在信噪比较高时拥有很好的性能。Based on traditional acoustic features such as short-term energy, spectral energy, zero-crossing rate, etc., or based on features extracted by neural networks, voice activity detection is performed, and each frame of audio is judged whether it is voice or not. This method has good performance when the signal-to-noise ratio is high.
发明内容SUMMARY OF THE INVENTION
本发明实施例提供一种语音活性检测方法及装置,用于至少解决上述技术问题之一。Embodiments of the present invention provide a voice activity detection method and device, which are used to solve at least one of the above technical problems.
第一方面,本发明实施例提供一种语音活性检测方法,包括:将待检测音频输入帧级别VAD系统中进行帧级别的语音活性检测,获取所述帧级别VAD系统输出的第一音频;将所述第一音频输入句子级别VAD系统中进行句子级别的语音活性检测,获取所述句子级别VAD系统输出的第二音频,并对所述第二音频进行后续处理。In a first aspect, an embodiment of the present invention provides a voice activity detection method, comprising: inputting audio to be detected into a frame-level VAD system to perform frame-level voice activity detection, and obtaining the first audio output from the frame-level VAD system; Sentence-level voice activity detection is performed in the sentence-level VAD system for the first audio input, second audio output from the sentence-level VAD system is acquired, and subsequent processing is performed on the second audio.
第二方面,本发明实施例提供一种语音活性检测装置,包括:第一输入检测输出模块,配置为将待检测音频输入帧级别VAD系统中进行帧级别的语音活性检测,获取所述帧级别VAD系统输出的第一音频;第二输入检测输出模块,配置为将所述第一音频输入句子级别VAD系统中进行句子级别的语音活性检测,获取所述句子级别VAD系统输出的第二音频,并对所述第二音频进行后续处理。In a second aspect, an embodiment of the present invention provides a voice activity detection device, including: a first input detection and output module configured to input audio to be detected into a frame-level VAD system to perform frame-level voice activity detection, and obtain the frame-level voice activity detection. The first audio output by the VAD system; the second input detection output module is configured to input the first audio into the sentence level VAD system to perform sentence level voice activity detection, and obtain the second audio output of the sentence level VAD system, and perform subsequent processing on the second audio.
第三方面,提供一种计算机程序产品,所述计算机程序产品包括存储在非易失性计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行第一方面所述的语音活性检测方法的步骤。In a third aspect, a computer program product is provided, the computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that when executed by a computer , causing the computer to execute the steps of the voice activity detection method described in the first aspect.
第四方面,本发明实施例还提供一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行第一方面所述方法的步骤。In a fourth aspect, an embodiment of the present invention further provides an electronic device, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores data that can be processed by the at least one processor processor-executed instructions, the instructions being executed by the at least one processor to enable the at least one processor to perform the steps of the method of the first aspect.
本申请实施例提供的方法通过在已有的帧级别VAD系统之后附加一个句子级的VAD系统,可以实现对前一个系统判定为语音的音频进行进一步的整句级的判定,减少音频的误判定,提高了非语音段的召回率,进一步的节省了后端识别的资源。By adding a sentence-level VAD system after the existing frame-level VAD system, the method provided by the embodiment of the present application can further perform sentence-level judgment on the audio that was judged as speech by the previous system, thereby reducing audio misjudgments , which improves the recall rate of non-speech segments and further saves resources for back-end recognition.
附图说明Description of drawings
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.
图1为本发明一实施例提供的一种语音活性检测方法的流程图;1 is a flowchart of a voice activity detection method provided by an embodiment of the present invention;
图2为本发明一实施例提供的另一种语音活性检测方法的流程图;2 is a flowchart of another voice activity detection method provided by an embodiment of the present invention;
图3为本发明实施例的语音活性检测的方案一具体实施例的句子级语音活动检测系统的框架图;3 is a framework diagram of a sentence-level voice activity detection system according to a specific embodiment of a solution for voice activity detection according to an embodiment of the present invention;
图4为本发明实施例的语音活性检测的方案一具体实施例的用于提供原始音频的分类流程图;4 is a flow chart of classification for providing original audio according to a specific embodiment of a solution for voice activity detection according to an embodiment of the present invention;
图5为本发明一实施例提供的一种语音活性检测装置的框图;FIG. 5 is a block diagram of a voice activity detection apparatus according to an embodiment of the present invention;
图6为本发明一实施例提供的电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
请参考图1,其示出了本发明的一种语音活性检测方法的一实施例的流程图。Please refer to FIG. 1 , which shows a flowchart of an embodiment of a voice activity detection method of the present invention.
如图1所示,在步骤101中,将待检测音频输入帧级别VAD系统中进行帧级别的语音活性检测,获取所述帧级别VAD系统输出的第一音频;As shown in Figure 1, in
在步骤102中,将所述第一音频输入句子级别VAD系统中进行句子级别的语音活性检测,获取所述句子级别VAD系统输出的第二音频,并对所述第二音频进行后续处理。In
在本实施例中,对于步骤101,语音活性检测装置将待检测音频输入帧级别VAD系统中进行帧级别的语音活性检测,获取帧级别VAD系统输出的第一音频,其中,在做帧级别的VAD的时候,是基于FSMN(前馈顺序存储网络,Feed-forward Sequential MemoryNetwork)模型,其中,FSMN模型使用了在线FSMN模型和离线FSMN模型,在线FSMN模型只向前看几帧,不向后看,也就是只用到了历史的信息没有用未来的信息,离线FSMN模型不仅可以向前看,含可以向后看几帧,用到了历史信息还用到了未来的信息,例如,在获取到待检测音频后,将待检测音频经过特征处理送入帧级别VAD系统进行筛选,获取帧级别VAD系统判定为语音的第一音频。In this embodiment, for
对于步骤102,语音活性检测装置将第一音频输入句子级别VAD系统中进行句子级别的语音活性检测,获取句子级别VAD系统输出的第二音频,并对第二音频进行后续处理,例如,将帧级别VAD系统判定为语音的第一音频先做提取特征处理,再经过离线FSMN层、通过DNN(深度神经网络),给出最终的第二音频。For
在本实施例的方案中,通过在已有的帧级别VAD系统之后附加一个句子级的VAD系统,可以实现对前一个系统判定为语音的音频进行进一步的整句级的判定,减少音频的误判定,提高了非语音段的召回率,进一步的节省了后端识别的资源。In the solution of this embodiment, by adding a sentence-level VAD system after the existing frame-level VAD system, further sentence-level determination can be performed on the audio that the previous system determined to be speech, thereby reducing audio errors. The judgment improves the recall rate of non-speech segments, and further saves the resources of back-end recognition.
请参考图2,其示出了本发明一实施例提供的另一种语音活性检测方法的流程图,该流程图主要是针对流程图图1中步骤102“将所述第一音频输入句子级别VAD系统中进行句子级别的语音活性检测”进一步限定的步骤的流程图。Please refer to FIG. 2 , which shows a flowchart of another voice activity detection method provided by an embodiment of the present invention. The flowchart is mainly for
如图2所示,在步骤201中,将所述第一音频切分成多段音频,利用所述句子级别VAD系统分别对所述多段音频进行语音活性检测;As shown in Figure 2, in
在步骤202中,若检测到所述多段音频中任一段音频中包含语音,将所述第一音频整段输出。In
在本实施例中,对于步骤201,语音活性检测装置将帧级别VAD系统判定为语音的第一音频切分成多段音频,利用句子级别VAD系统分别对多段音频进行语音活性检测,例如,一段音频为“*你*好**小*驰”。其中,*是噪音部分,你好小驰是用户语音部分,语音活性检测装置将其切分成多段音频,例如,**你*、好**、小驰*,再利用句子级别VAD系统分别对**你*、好**、小驰*进行语音活性检测,其中,音频可以按照一个确切的时间或者长度来切分成段,例如,一个音频长度特别短的可以不用把音频切分,而音频长度特别长的可以把音频切分成更多的分段,本申请在此没有限制。In this embodiment, for
对于步骤202,语音活性检测装置若检测到所述多段音频中任一段音频中包含语音,将所述第一音频整段输出,例如,检测到将切分成音频段的**你*、好**、小驰*中的任一分段中包含语音,例如,检测到“**你”这一分段音频中包含语音,则停止其他分段的检测,直接判定为整段音频中包含语音,然后将整段音频输出。For
在本实施例的方案中,通过对一整条音频切分成段,分别对每段进行判定,只要有一个分段判定是语音,便认为该条音频含语音,可以省去音频剩下的分段的判定。In the solution of this embodiment, a whole piece of audio is divided into segments, and each segment is judged separately. As long as one segment is judged to be speech, it is considered that the piece of audio contains speech, and the rest of the audio can be omitted. segment judgment.
在一些可选的实施例中,所述帧级别VAD系统用于判断待检测音频中每一帧音频是否为语音帧,输出所述待检测音频中判定为语音帧的音频构成的第一音频;若判断所述待检测音频中不包含语音帧,则不进行后续处理。In some optional embodiments, the frame-level VAD system is used to judge whether each frame of audio in the audio to be detected is a speech frame, and output the first audio composed of the audio of the audio to be detected that is determined to be a speech frame; If it is determined that the audio to be detected does not contain a speech frame, no subsequent processing is performed.
在一些可选的实施例中,所述句子级别VAD系统用于判断所述第一音频整句是否为语音,若是,则将所述第一音频输入只语音识别系统进行语音识别;若否,则不进行后续处理。In some optional embodiments, the sentence-level VAD system is used to determine whether the entire first audio sentence is speech, and if so, input the first audio into a speech-only recognition system for speech recognition; if not, No subsequent processing is performed.
在一些可选的实施例中,所述句子级别VAD系统为基于FSMN的模型,所述基于FSMN(前馈顺序记忆网络,Feed-forward Sequential Memory Network)的模型包括特征提取层、多个离线FSMN层和DNN层。In some optional embodiments, the sentence-level VAD system is a FSMN-based model, and the FSMN (Feed-forward Sequential Memory Network)-based model includes a feature extraction layer, multiple offline FSMNs layers and DNN layers.
在上述实施例所述的方法中,所述帧级别VAD系统也为基于FSMN的模型。In the method described in the above embodiment, the frame-level VAD system is also a FSMN-based model.
需要说明的是,以上实施例中虽然采用了步骤101、步骤102等具有明确先后顺序的数字,限定了步骤的先后顺序,但是在实际的应用场景中,有些步骤是可以并列执行的,有些步骤的先后顺序也不受到以上数字的限定,本申请在此没有限制,在此不再赘述。It should be noted that although
下面对通过描述发明人在实现本发明的过程中遇到的一些问题和对最终确定的方案的一个具体实施例进行说明,以使本领域技术人员更好地理解本申请的方案。The following describes some problems encountered by the inventor in the process of implementing the present invention and a specific embodiment of the finalized solution, so that those skilled in the art can better understand the solution of the present application.
发明人在实现本发明的过程中发现这些相似技术的缺陷:The inventors discovered the defects of these similar technologies in the process of realizing the present invention:
信噪比较低、背景噪声较大时,系统容易将非语音片段判定为语音片段,送到后端识别系统,造成资源的浪费。When the signal-to-noise ratio is low and the background noise is large, the system will easily determine non-speech segments as speech segments and send them to the back-end recognition system, resulting in a waste of resources.
asr识别系统一般是基于vad来控制音频的输入和输出,一旦vad控制的不够好,会把噪音音频或其他非人声音频送进识别系统,会造成误触发识别,识别结果混乱或者是误触发识别这个动作,尤其是在交互场景中,整个系统一直在不停的打断,并等待客户输入,对客户体验很不好。The asr recognition system is generally based on the vad to control the input and output of the audio. Once the vad control is not good enough, the noise audio or other non-human voice audio will be sent to the recognition system, which will cause false trigger recognition, confusion or false triggering of the recognition results. Recognizing this action, especially in interactive scenarios, the whole system is constantly interrupting and waiting for customer input, which is very bad for the customer experience.
对每一帧音频都进行判定,持续时间太短,给出的结果局部性太强,无法对整句进行准确的判断。Each frame of audio is judged, the duration is too short, and the given result is too local, so it is impossible to accurately judge the entire sentence.
发明人在实现本发明的过程中发现为什么不容易想到原因:In the process of realizing the present invention, the inventor found why it is not easy to think of the reasons:
通常会采用调高门限的方法,简单直接,或者是找反例数据去训练,降低模型的误触发概率。Usually, the method of increasing the threshold is adopted, which is simple and direct, or find counter-example data to train to reduce the probability of false triggering of the model.
想不到本方案的原因:没有考虑到将解决方案与已有的vad系统剥离开,一直在已有的帧级别系统上进行改进,很难有实质性的改善。The reason why I didn't think of this solution: I didn't take into account the separation of the solution from the existing vad system, and it has been improving on the existing frame-level system, so it is difficult to achieve substantial improvement.
本申请实施例的方案通过以下方案解决上述现有技术中存在的技术问题:The solutions of the embodiments of the present application solve the technical problems existing in the above-mentioned prior art through the following solutions:
通过在已有的帧级别vad系统之后,附加一个句子级的vad系统,对前一个系统判定为语音的音频进行进一步的整句级的判定,减少非语音段错误流向后端识别。By adding a sentence-level vad system after the existing frame-level vad system, the audio that was judged as speech by the previous system is further judged at the whole sentence-level, so as to reduce the flow of non-speech segment errors to the back-end recognition.
本发明的技术创新点:The technical innovation of the present invention:
图3是句子级语音活动检测系统的框架图。Figure 3 is a block diagram of a sentence-level speech activity detection system.
图4是用于提供原始音频的分类流程图。Figure 4 is a classification flow diagram for providing raw audio.
对于图3,系统的输入是经过帧级别vad系统筛选过的音频(即帧级别vad系统判定为语音的音频),对该音频进行提取特征、经过离线fsmn层、再通过DNN,给出最后结果。For Figure 3, the input of the system is the audio that has been screened by the frame-level vad system (that is, the audio that the frame-level vad system determines as speech). The audio is extracted with features, passed through the offline fsmn layer, and then passed through the DNN to give the final result. .
对于图4,流程图展示了句子级vad系统与帧级别vad系统的关系。For Figure 4, the flow chart shows the relationship between the sentence-level vad system and the frame-level vad system.
发明人在实现本发明的过程中发现的备用方案:Alternative solutions discovered by the inventor in the process of realizing the present invention:
利用SVM和DNN做分类器来对音频进行分类,它们的优点是系统易搭建、分类速度快,缺点是模型的表达能力不强,在测试集上的表现不好,放弃;Using SVM and DNN as classifiers to classify audio, their advantages are that the system is easy to build and the classification speed is fast, but the disadvantage is that the model's expressive ability is not strong, and the performance on the test set is not good.
针对系统的输入,最开始的输入是整条音频,即一条音频只做一次是否是语音的判定,这种方法的优点是对于全局性来说是最好的,缺点是满足不了线上流式数据的实时性,会带来严重的滞后。所以后来改成分段式判定,即将一整条音频分成几段,分别对每段进行判定,并且只要有一段判定是语音,便认为该条音频含语音,可省去音频剩下的段的判定。For the input of the system, the initial input is the entire audio, that is, an audio is only judged once whether it is a voice. The advantage of this method is that it is the best for the whole world. The real-time performance will bring serious lag. Therefore, it was later changed to segmented judgment, that is, the entire audio was divided into several segments, and each segment was judged separately, and as long as one segment was judged to be speech, the audio was considered to contain speech, and the remaining segments of the audio could be omitted. determination.
在做帧级别的vad的时候,我们也是基于fsmn的模型,不仅使用了在线fsmn模型(即fsmn只向前看几帧,不向后看,只用了历史的信息,没有用未来的信息),而且使用了离线的fsmn模型,(即fsmn不仅可以向前看,还可以向后看几帧,用到了历史和未来的信息),离线的fsmn模型,虽然延迟比在线的要差,但是性能上比在线的好很多。所以,如果遇到不考虑延时的场景,我们可以把离线的fsmn训出来的vad用起来。When doing frame-level vad, we are also based on the fsmn model, not only using the online fsmn model (that is, fsmn only looks forward a few frames, not backward, only the historical information is used, and the future information is not used) , and the offline fsmn model is used, (that is, fsmn can not only look forward, but also look back a few frames, using historical and future information), the offline fsmn model, although the delay is worse than the online one, but the performance Much better than online. Therefore, if we encounter a scenario where the delay is not considered, we can use the vad trained by the offline fsmn.
发明人在实现本发明的过程中发现达到更深层次的效果:In the process of realizing the present invention, the inventor finds that a deeper effect is achieved:
本方案最直接的效果就是减少了音频的误判定,提高了非语音段的召回率;进一步的节省了后端识别的资源;提高了整个语音识别系统对于噪声的鲁棒性;提高了用户的使用体验,由于之前会有非语音段流到识别系统,识别出的结果往往是些无意义的“嗯、啊”或者NULL,在语音交互系统中,往往会被这些误识别造成误打断,会明显降低用户的体验感受。The most direct effect of this solution is to reduce the misjudgment of audio and improve the recall rate of non-speech segments; further save the resources of back-end recognition; improve the robustness of the entire speech recognition system to noise; In the user experience, since there will be non-voice segments flowing to the recognition system before, the recognized results are often meaningless "um, ah" or NULL. In the voice interaction system, these misrecognitions are often interrupted by mistake. It will significantly reduce the user experience.
请参考图5,其示出了本发明一实施例提供的一种语音活性检测装置的框图。Please refer to FIG. 5 , which shows a block diagram of a voice activity detection apparatus provided by an embodiment of the present invention.
如图5所示,第一输入检测输出模块510和第二输入检测输出模块520。As shown in FIG. 5 , a first input
其中,第一输入检测输出模块510,配置为将待检测音频输入帧级别VAD系统中进行帧级别的语音活性检测,获取所述帧级别VAD系统输出的第一音频;第二输入检测输出模块520,配置为将所述第一音频输入句子级别VAD系统中进行句子级别的语音活性检测,获取所述句子级别VAD系统输出的第二音频,并对所述第二音频进行后续处理。The first input detection and
应当理解,图5中记载的诸模块与参考图1和图2中描述的方法中的各个步骤相对应。由此,上文针对方法描述的操作和特征以及相应的技术效果同样适用于图5中的诸模块,在此不再赘述。It should be understood that the modules recited in FIG. 5 correspond to various steps in the method described with reference to FIGS. 1 and 2 . Therefore, the operations and features described above with respect to the method and the corresponding technical effects are also applicable to the modules in FIG. 5 , which will not be repeated here.
值得注意的是,本申请的实施例中的模块并不用于限制本申请的方案,例如第一输入检测输出模块可以描述为将待检测音频输入帧级别VAD系统中进行帧级别的语音活性检测,获取所述帧级别VAD系统输出的第一音频的模块,另外,还可以通过硬件处理器来实现相关功能模块,例如第一输入检测输出模块可以用处理器实现,在此不再赘述。It is worth noting that the modules in the embodiments of the present application are not used to limit the solution of the present application. For example, the first input detection and output module can be described as inputting the audio to be detected into the frame-level VAD system to perform frame-level voice activity detection, The module for obtaining the first audio output from the frame-level VAD system, in addition, the relevant functional modules may also be implemented by a hardware processor, for example, the first input detection and output module may be implemented by a processor, which will not be repeated here.
在另一些实施例中,本发明实施例还提供了一种非易失性计算机存储介质,计算机存储介质存储有计算机可执行指令,该计算机可执行指令可执行上述任意方法实施例中的语音活性检测方法;In other embodiments, embodiments of the present invention further provide a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the voice activity in any of the foregoing method embodiments Detection method;
作为一种实施方式,本发明的非易失性计算机存储介质存储有计算机可执行指令,计算机可执行指令设置为:As an embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions, and the computer-executable instructions are set to:
将待检测音频输入帧级别VAD系统中进行帧级别的语音活性检测,获取所述帧级别VAD系统输出的第一音频;The voice activity detection of frame level is carried out in the audio input frame level VAD system to be detected, and the first audio output of the frame level VAD system is obtained;
将所述第一音频输入句子级别VAD系统中进行句子级别的语音活性检测,获取所述句子级别VAD系统输出的第二音频,并对所述第二音频进行后续处理。The first audio is input into the sentence-level VAD system for sentence-level voice activity detection, the second audio output from the sentence-level VAD system is acquired, and subsequent processing is performed on the second audio.
非易失性计算机可读存储介质可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据语音活性检测装置的使用所创建的数据等。此外,非易失性计算机可读存储介质可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,非易失性计算机可读存储介质可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至语音活性检测装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system and an application program required by at least one function; the storage data area may store the usage according to the voice activity detection device created data, etc. In addition, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the non-transitory computer-readable storage medium may optionally include memory located remotely from the processor, the remote memory being connectable to the voice activity detection device through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
本发明实施例还提供一种计算机程序产品,计算机程序产品包括存储在非易失性计算机可读存储介质上的计算机程序,计算机程序包括程序指令,当程序指令被计算机执行时,使计算机执行上述任一项语音活性检测方法。An embodiment of the present invention further provides a computer program product, the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is made to execute the above Any voice activity detection method.
图6是本发明实施例提供的电子设备的结构示意图,如图6所示,该设备包括:一个或多个处理器610以及存储器620,图6中以一个处理器610为例。用于语音活性检测方法的设备还可以包括:输入装置630和输出装置640。处理器610、存储器620、输入装置630和输出装置640可以通过总线或者其他方式连接,图6中以通过总线连接为例。存储器620为上述的非易失性计算机可读存储介质。处理器610通过运行存储在存储器620中的非易失性软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例用于语音活性检测装置方法。输入装置630可接收输入的数字或字符信息,以及产生与用于语音活性检测装置的用户设置以及功能控制有关的键信号输入。输出装置640可包括显示屏等显示设备。FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. As shown in FIG. 6 , the device includes: one or
上述产品可执行本发明实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本发明实施例所提供的方法。The above product can execute the method provided by the embodiment of the present invention, and has corresponding functional modules and beneficial effects for executing the method. For technical details not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
作为一种实施方式,上述电子设备应用于语音活性检测装置中,包括:As an embodiment, the above-mentioned electronic equipment is applied to a voice activity detection device, including:
至少一个处理器;以及,与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够:at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to:
将待检测音频输入帧级别VAD系统中进行帧级别的语音活性检测,获取所述帧级别VAD系统输出的第一音频;The voice activity detection of frame level is carried out in the audio input frame level VAD system to be detected, and the first audio output of the frame level VAD system is obtained;
将所述第一音频输入句子级别VAD系统中进行句子级别的语音活性检测,获取所述句子级别VAD系统输出的第二音频,并对所述第二音频进行后续处理。The first audio is input into the sentence-level VAD system for sentence-level voice activity detection, the second audio output from the sentence-level VAD system is acquired, and subsequent processing is performed on the second audio.
本申请实施例的电子设备以多种形式存在,包括但不限于:The electronic devices in the embodiments of the present application exist in various forms, including but not limited to:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机、多媒体手机、功能性手机,以及低端手机等。(1) Mobile communication equipment: This type of equipment is characterized by having mobile communication functions, and its main goal is to provide voice and data communication. Such terminals include: smart phones, multimedia phones, feature phones, and low-end phones.
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等。(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has the characteristics of mobile Internet access. Such terminals include: PDA, MID and UMPC devices.
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器,掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。(3) Portable entertainment equipment: This type of equipment can display and play multimedia content. Such devices include: audio and video players, handheld game consoles, e-books, as well as smart toys and portable car navigation devices.
(4)服务器:提供计算服务的设备,服务器的构成包括处理器、硬盘、内存、系统总线等,服务器和通用的计算机架构类似,但是由于需要提供高可靠的服务,因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。(4) Server: A device that provides computing services. The composition of the server includes a processor, a hard disk, a memory, a system bus, etc. The server is similar to a general computer architecture, but due to the need to provide highly reliable services, the processing power, stability , reliability, security, scalability, manageability and other aspects of high requirements.
(5)其他具有数据交互功能的电子装置。(5) Other electronic devices with data interaction function.
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place , or distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010867436.6A CN111816216A (en) | 2020-08-25 | 2020-08-25 | Voice activity detection method and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010867436.6A CN111816216A (en) | 2020-08-25 | 2020-08-25 | Voice activity detection method and device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN111816216A true CN111816216A (en) | 2020-10-23 |
Family
ID=72859103
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010867436.6A Withdrawn CN111816216A (en) | 2020-08-25 | 2020-08-25 | Voice activity detection method and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111816216A (en) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112652296A (en) * | 2020-12-23 | 2021-04-13 | 北京华宇信息技术有限公司 | Streaming voice endpoint detection method, device and equipment |
| CN112786029A (en) * | 2020-12-25 | 2021-05-11 | 苏州思必驰信息科技有限公司 | Method and apparatus for training VAD using weakly supervised data |
| CN113077821A (en) * | 2021-03-23 | 2021-07-06 | 平安科技(深圳)有限公司 | Audio quality detection method and device, electronic equipment and storage medium |
| CN114038487A (en) * | 2021-11-10 | 2022-02-11 | 北京声智科技有限公司 | An audio extraction method, apparatus, device and readable storage medium |
| CN115132231A (en) * | 2022-08-31 | 2022-09-30 | 安徽讯飞寰语科技有限公司 | Voice activity detection method, device, equipment and readable storage medium |
| CN116168685A (en) * | 2023-02-21 | 2023-05-26 | 浙江大华技术股份有限公司 | Speech recognition method and device, storage medium and electronic device |
Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110264447A1 (en) * | 2010-04-22 | 2011-10-27 | Qualcomm Incorporated | Systems, methods, and apparatus for speech feature detection |
| CN103854662A (en) * | 2014-03-04 | 2014-06-11 | 中国人民解放军总参谋部第六十三研究所 | Self-adaptation voice detection method based on multi-domain joint estimation |
| US20180174583A1 (en) * | 2016-12-21 | 2018-06-21 | Avnera Corporation | Low-power, always-listening, voice command detection and capture |
| CN108346428A (en) * | 2017-09-13 | 2018-07-31 | 腾讯科技(深圳)有限公司 | Voice activity detection and its method for establishing model, device, equipment and storage medium |
| CN110047470A (en) * | 2019-04-11 | 2019-07-23 | 深圳市壹鸽科技有限公司 | A kind of sound end detecting method |
| CN110136749A (en) * | 2019-06-14 | 2019-08-16 | 苏州思必驰信息科技有限公司 | Speaker-related end-to-end voice endpoint detection method and device |
| CN110808073A (en) * | 2019-11-13 | 2020-02-18 | 苏州思必驰信息科技有限公司 | Voice activity detection method, voice recognition method and system |
| CN110931048A (en) * | 2019-12-12 | 2020-03-27 | 广州酷狗计算机科技有限公司 | Voice endpoint detection method and device, computer equipment and storage medium |
| CN110992979A (en) * | 2019-11-29 | 2020-04-10 | 北京搜狗科技发展有限公司 | Detection method and device and electronic equipment |
| CN111312218A (en) * | 2019-12-30 | 2020-06-19 | 苏州思必驰信息科技有限公司 | Neural network training and voice endpoint detection method and device |
-
2020
- 2020-08-25 CN CN202010867436.6A patent/CN111816216A/en not_active Withdrawn
Patent Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110264447A1 (en) * | 2010-04-22 | 2011-10-27 | Qualcomm Incorporated | Systems, methods, and apparatus for speech feature detection |
| CN103854662A (en) * | 2014-03-04 | 2014-06-11 | 中国人民解放军总参谋部第六十三研究所 | Self-adaptation voice detection method based on multi-domain joint estimation |
| US20180174583A1 (en) * | 2016-12-21 | 2018-06-21 | Avnera Corporation | Low-power, always-listening, voice command detection and capture |
| CN108346428A (en) * | 2017-09-13 | 2018-07-31 | 腾讯科技(深圳)有限公司 | Voice activity detection and its method for establishing model, device, equipment and storage medium |
| CN110047470A (en) * | 2019-04-11 | 2019-07-23 | 深圳市壹鸽科技有限公司 | A kind of sound end detecting method |
| CN110136749A (en) * | 2019-06-14 | 2019-08-16 | 苏州思必驰信息科技有限公司 | Speaker-related end-to-end voice endpoint detection method and device |
| CN110808073A (en) * | 2019-11-13 | 2020-02-18 | 苏州思必驰信息科技有限公司 | Voice activity detection method, voice recognition method and system |
| CN110992979A (en) * | 2019-11-29 | 2020-04-10 | 北京搜狗科技发展有限公司 | Detection method and device and electronic equipment |
| CN110931048A (en) * | 2019-12-12 | 2020-03-27 | 广州酷狗计算机科技有限公司 | Voice endpoint detection method and device, computer equipment and storage medium |
| CN111312218A (en) * | 2019-12-30 | 2020-06-19 | 苏州思必驰信息科技有限公司 | Neural network training and voice endpoint detection method and device |
Non-Patent Citations (1)
| Title |
|---|
| 牛米佳 等: "蒙古语长音频语音文本自动对齐的研究", 《中文信息学报》 * |
Cited By (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112652296A (en) * | 2020-12-23 | 2021-04-13 | 北京华宇信息技术有限公司 | Streaming voice endpoint detection method, device and equipment |
| CN112652296B (en) * | 2020-12-23 | 2023-07-04 | 北京华宇信息技术有限公司 | Method, device and equipment for detecting streaming voice endpoint |
| CN112786029A (en) * | 2020-12-25 | 2021-05-11 | 苏州思必驰信息科技有限公司 | Method and apparatus for training VAD using weakly supervised data |
| CN112786029B (en) * | 2020-12-25 | 2022-07-26 | 思必驰科技股份有限公司 | Method and apparatus for training VAD using weakly supervised data |
| CN113077821A (en) * | 2021-03-23 | 2021-07-06 | 平安科技(深圳)有限公司 | Audio quality detection method and device, electronic equipment and storage medium |
| CN113077821B (en) * | 2021-03-23 | 2024-07-05 | 平安科技(深圳)有限公司 | Audio quality detection method and device, electronic equipment and storage medium |
| CN114038487A (en) * | 2021-11-10 | 2022-02-11 | 北京声智科技有限公司 | An audio extraction method, apparatus, device and readable storage medium |
| CN114038487B (en) * | 2021-11-10 | 2025-09-05 | 北京声智科技有限公司 | Audio extraction method, device, equipment and readable storage medium |
| CN115132231A (en) * | 2022-08-31 | 2022-09-30 | 安徽讯飞寰语科技有限公司 | Voice activity detection method, device, equipment and readable storage medium |
| CN115132231B (en) * | 2022-08-31 | 2022-12-13 | 安徽讯飞寰语科技有限公司 | Voice activity detection method, device, equipment and readable storage medium |
| CN116168685A (en) * | 2023-02-21 | 2023-05-26 | 浙江大华技术股份有限公司 | Speech recognition method and device, storage medium and electronic device |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110136749B (en) | Method and device for detecting end-to-end voice endpoint related to speaker | |
| CN111816216A (en) | Voice activity detection method and device | |
| CN110648692B (en) | Voice endpoint detection method and system | |
| CN112735385B (en) | Voice endpoint detection method, device, computer equipment and storage medium | |
| CN107393526B (en) | Voice silence detection method, device, computer equipment and storage medium | |
| US9818407B1 (en) | Distributed endpointing for speech recognition | |
| CN108962227B (en) | Voice starting point and end point detection method and device, computer equipment and storage medium | |
| CN111312218B (en) | Neural network training and voice endpoint detection method and device | |
| CN110910885B (en) | Voice wake-up method and device based on decoding network | |
| CN111640456B (en) | Method, device and equipment for detecting overlapping sound | |
| CN110517670A (en) | Method and apparatus for improving wake-up performance | |
| CN111816215A (en) | Voice endpoint detection model training and use method and device | |
| CN110473539A (en) | Promote the method and apparatus that voice wakes up performance | |
| CN113345423B (en) | Voice endpoint detection method, device, electronic equipment and storage medium | |
| CN115803808A (en) | Synthesized speech detection | |
| CN114399992B (en) | Voice instruction response method, device and storage medium | |
| CN112581938B (en) | Speech breakpoint detection method, device and equipment based on artificial intelligence | |
| CN112992191A (en) | Voice endpoint detection method and device, electronic equipment and readable storage medium | |
| CN114299962A (en) | Method, system, device and storage medium for separating conversation role based on audio stream | |
| CN112951219A (en) | Noise rejection method and device | |
| US20230186943A1 (en) | Voice activity detection method and apparatus, and storage medium | |
| CN112614506B (en) | Voice activation detection method and device | |
| CN115273862B (en) | Voice processing method, device, electronic device and medium | |
| CN112185367A (en) | Keyword detection method and device, computer readable storage medium and electronic equipment | |
| CN110808073A (en) | Voice activity detection method, voice recognition method and system |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information |
Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Applicant after: Sipic Technology Co.,Ltd. Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Applicant before: AI SPEECH Co.,Ltd. |
|
| CB02 | Change of applicant information | ||
| WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201023 |
|
| WW01 | Invention patent application withdrawn after publication |