CN110600008A

CN110600008A - Voice wake-up optimization method and system

Info

Publication number: CN110600008A
Application number: CN201910899791.9A
Authority: CN
Inventors: 徐俊峰
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2019-12-20

Abstract

The embodiment of the present invention provides an optimization method for voice wake-up. The method includes: constructing a second-level wake-up acoustic model, the second-level wake-up acoustic model includes a phoneme acoustic model and a word-level acoustic model; performing feature extraction on the received speech audio, and inputting the mentioned acoustic features into the second-level wake-up acoustic model The phoneme-level acoustic model in the phoneme-level acoustic model is used to extract the output features of the phoneme-level acoustic model; based on the output features of the phoneme-level acoustic model, it is used as the input of the word-level acoustic model in the secondary wake-up acoustic model to determine the confidence level of the wake-up word; when the confidence level When the preset wake-up threshold is exceeded, the voice audio is determined as the wake-up word for voice wake-up. The embodiment of the present invention also provides an optimization system for voice wake-up. The embodiment of the present invention directly reduces the dependence of the final classification effect on the accuracy of the phoneme modeling unit, and in the case of inaccurate phoneme classification, the wake word can still be correctly discriminated.

Description

Optimization method and system for voice wake-up

技术领域technical field

本发明涉及智能语音对话领域，尤其涉及一种语音唤醒的优化方法及系统。The invention relates to the field of intelligent voice dialogue, in particular to an optimization method and system for voice wake-up.

背景技术Background technique

语音唤醒通常利用深度神经网络，对基础声学单元进行声学建模，声学单元一般选择音素。Speech arousal usually uses deep neural networks to acoustically model basic acoustic units, which generally select phonemes.

以上描述的语音唤醒技术，建模单元为音素，首先对音素进行预测、分类、处理；然后计算处理后序列与唤醒词序列之间的相似度，如果相似度大于某个阈值，则唤醒；否则不唤醒。In the above-described voice wake-up technology, the modeling unit is phoneme. First, the phoneme is predicted, classified, and processed; then the similarity between the processed sequence and the wake-up word sequence is calculated. If the similarity is greater than a certain threshold, wake up; otherwise Do not wake up.

在实现本发明过程中，发明人发现相关技术中至少存在如下问题：In the process of realizing the present invention, the inventor found that there are at least the following problems in the related art:

这种技术严重依赖声学模型对语音信号在建模单元上分类的准确性。在低信噪比的情况下，声学模型对音素的分类准确性不高，以至于影响了信噪比低场景的唤醒率。This technique relies heavily on the accuracy of the acoustic model to classify the speech signal on the modeling unit. In the case of low signal-to-noise ratio, the classification accuracy of the acoustic model for phonemes is not high, so that it affects the arousal rate of scenes with low signal-to-noise ratio.

发明内容SUMMARY OF THE INVENTION

为了至少解决现有技术中低信噪比场景下的唤醒率低的问题。In order to at least solve the problem of low wake-up rate in low signal-to-noise ratio scenarios in the prior art.

第一方面，本发明实施例提供一种语音唤醒的优化方法，包括：In a first aspect, an embodiment of the present invention provides an optimization method for voice wake-up, including:

构建二级唤醒声学模型，所述二级唤醒声学模型包括音素声学模型和词级别的声学模型；constructing a second-level wake-up acoustic model, where the second-level wake-up acoustic model includes a phoneme acoustic model and a word-level acoustic model;

对接收到的语音音频进行特征提取，将提到的声学特征输入至所述二级唤醒声学模型中的音素声学模型，提取所述音素声学模型的输出特征；Perform feature extraction on the received speech audio, input the mentioned acoustic features into the phoneme acoustic model in the secondary wake-up acoustic model, and extract the output features of the phoneme acoustic model;

基于所述音素声学模型的输出特征，作为所述二级唤醒声学模型中的词级别声学模型的输入，确定唤醒词的置信度；Based on the output feature of the phoneme acoustic model, as the input of the word-level acoustic model in the secondary wake-up acoustic model, determine the confidence level of the wake-up word;

当所述置信度超过预设唤醒阈值时，将所述语音音频确定为唤醒词，进行语音唤醒。When the confidence level exceeds a preset wake-up threshold, the voice audio is determined as a wake-up word, and voice wake-up is performed.

第二方面，本发明实施例提供一种语音唤醒的优化系统，包括：In a second aspect, an embodiment of the present invention provides an optimization system for voice wake-up, including:

模型构建程序模块，用于构建二级唤醒声学模型，所述二级唤醒声学模型包括音素声学模型和词级别的声学模型；a model building program module for constructing a secondary arousal acoustic model, the secondary arousal acoustic model includes a phoneme acoustic model and a word-level acoustic model;

特征提取程序模块，用于对接收到的语音音频进行特征提取，将提到的声学特征输入至所述二级唤醒声学模型中的音素声学模型，提取所述音素声学模型的输出特征；A feature extraction program module is used to perform feature extraction on the received speech audio, input the mentioned acoustic features into the phoneme acoustic model in the secondary wake-up acoustic model, and extract the output features of the phoneme acoustic model;

置信度确定程序模块，用于基于所述音素声学模型的输出特征，作为所述二级唤醒声学模型中的词级别声学模型的输入，确定唤醒词的置信度；A confidence level determination program module, used for determining the confidence level of the wake-up word based on the output feature of the phoneme acoustic model as the input of the word-level acoustic model in the secondary wake-up acoustic model;

唤醒程序模块，用于当所述置信度超过预设唤醒阈值时，将所述语音音频确定为唤醒词，进行语音唤醒。A wake-up program module, configured to determine the voice audio as a wake-up word and perform voice wake-up when the confidence level exceeds a preset wake-up threshold.

第三方面，提供一种电子设备，其包括：至少一个处理器，以及与所述至少一个处理器通信连接的存储器，其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行本发明任一实施例的语音唤醒的优化方法的步骤。In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, The instructions are executed by the at least one processor to enable the at least one processor to perform the steps of the method for optimizing wake-up by voice according to any embodiment of the present invention.

第四方面，本发明实施例提供一种存储介质，其上存储有计算机程序，其特征在于，该程序被处理器执行时实现本发明任一实施例的语音唤醒的优化方法的步骤。In a fourth aspect, an embodiment of the present invention provides a storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the steps of the method for optimizing voice wake-up according to any embodiment of the present invention are implemented.

本发明实施例的有益效果在于：在一个声学模型的基础上，采用一定长度的语音信号提取的深度声学特征，输入到另一个分类模型，直接进行分类，直接的减小了最终分类效果对音素建模单元准确性的依赖，在音素分类不准确的情况下，依然可以对唤醒词进行正确的判别。The beneficial effects of the embodiments of the present invention are: on the basis of one acoustic model, deep acoustic features extracted from speech signals of a certain length are used to input them into another classification model for direct classification, which directly reduces the impact of the final classification effect on phonemes. Depends on the accuracy of the modeling unit, in the case of inaccurate phoneme classification, the wake word can still be correctly discriminated.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1是本发明一实施例提供的一种语音唤醒的优化方法的流程图；1 is a flowchart of a method for optimizing voice wake-up provided by an embodiment of the present invention;

图2是本发明一实施例提供的一种语音唤醒的优化系统的结构示意图。FIG. 2 is a schematic structural diagram of an optimization system for voice wake-up provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

如图1所示为本发明一实施例提供的一种语音唤醒的优化方法的流程图，包括如下步骤：FIG. 1 is a flowchart of an optimization method for voice wake-up provided by an embodiment of the present invention, including the following steps:

S11：构建二级唤醒声学模型，所述二级唤醒声学模型包括音素声学模型和词级别的声学模型；S11: Construct a second-level wake-up acoustic model, where the second-level wake-up acoustic model includes a phoneme acoustic model and a word-level acoustic model;

S12：对接收到的语音音频进行特征提取，将提到的声学特征输入至所述二级唤醒声学模型中的音素声学模型，提取所述音素声学模型的输出特征；S12: Perform feature extraction on the received speech audio, input the mentioned acoustic features into the phoneme acoustic model in the secondary wake-up acoustic model, and extract the output features of the phoneme acoustic model;

S13：基于所述音素声学模型的输出特征，作为所述二级唤醒声学模型中的词级别声学模型的输入，确定唤醒词的置信度；S13: Based on the output feature of the phoneme acoustic model, as the input of the word-level acoustic model in the secondary wake-up acoustic model, determine the confidence level of the wake-up word;

S14：当所述置信度超过预设唤醒阈值时，将所述语音音频确定为唤醒词，进行语音唤醒。S14: When the confidence level exceeds a preset wake-up threshold, determine the voice audio as a wake-up word, and perform voice wake-up.

在本实施方式中，与一个建模的声学模型不同，并且与一般的两个声学模型的结果比对也不同，并不是用两个模型的输出结果进行对比，因为在信噪比低的情况下，选用多个声学模型不会明显提高音素的分类准确率。In this embodiment, it is different from a modeled acoustic model, and is also different from the comparison of the results of two general acoustic models. The output results of the two models are not used for comparison, because in the case of a low signal-to-noise ratio Therefore, the selection of multiple acoustic models will not significantly improve the classification accuracy of phonemes.

对于步骤S11，与一个建模音素声学模型不同，在此基础上，构建二级唤醒声学模型，所述二级唤醒声学模型包括音素声学模型和词级别的声学模型，其中，声学模型的任务是计算P(O|W)，即给模型产生语音波形的概率。声学模型是语音识别系统的重要组成部分，它占据着语音识别大部分的计算开销，决定着语音识别系统的性能。传统的语音识别系统普遍采用的是基于GMM-HMM的声学模型，其中GMM用于对语音声学特征的分布进行建模，HMM则用于对语音信号的时序性进行建模。2006年深度学习兴起以后，深度神经网络(DeepNeural Networks，DNN)被应用于语音声学模型。音素声学模型确定语音波形中每个音素的概率，词级别的声学模型确定语音波形中每个词的概率。For step S11, different from a modeling phoneme acoustic model, on this basis, a second-level wake-up acoustic model is constructed, and the second-level wake-up acoustic model includes a phoneme acoustic model and a word-level acoustic model, wherein the task of the acoustic model is to Calculate P(O|W), the probability of generating a speech waveform for the model. Acoustic model is an important part of speech recognition system, it occupies most of the computational overhead of speech recognition and determines the performance of speech recognition system. Traditional speech recognition systems generally use an acoustic model based on GMM-HMM, where GMM is used to model the distribution of speech acoustic features, and HMM is used to model the timing of speech signals. After the rise of deep learning in 2006, Deep Neural Networks (DNN) have been applied to speech acoustic models. The phoneme acoustic model determines the probability of each phoneme in the speech waveform, and the word-level acoustic model determines the probability of each word in the speech waveform.

对于步骤S12，为了能接收到实时语音唤醒，就需要智能设备实时采集环境内的语音音频，对采集后接收到的语音音频进行特征提取，将提取到的声学特征输入至所述二级唤醒声学模型中的音素声学模型，提取出所述其中音素声学模型的输出特征，例如语音音频的音素序列。For step S12, in order to receive the real-time voice wake-up, the smart device needs to collect the voice and audio in the environment in real time, perform feature extraction on the voice and audio received after the collection, and input the extracted acoustic features into the secondary wake-up acoustics The phoneme acoustic model in the model, and the output features of the phoneme acoustic model, such as the phoneme sequence of the speech audio, are extracted.

对于步骤S13，基于所述其中音素声学模型的输出特征，例如在步骤S12中输出的音频序列，作为所述二级唤醒声学模型中的词级声学模型的输入，通过另一个声学模型来分类，有了明确的分类，进而更加精确的确定出用户音频为唤醒词的置信度。For step S13, based on the output features of the phoneme acoustic model, such as the audio sequence output in step S12, as the input of the word-level acoustic model in the secondary wake-up acoustic model, classify by another acoustic model, With a clear classification, the confidence of the user audio as a wake-up word can be more accurately determined.

对于步骤S14，当所述置信度超过预设的唤醒阈值时，将语音音频确定为唤醒词，进行语音唤醒。。For step S14, when the confidence level exceeds a preset wake-up threshold, the voice audio is determined as a wake-up word, and voice wake-up is performed. .

通过该实施方式可以看出，在一个声学模型的基础上，采用一定长度的语音信号提取的深度声学特征，输入到另一个分类模型，直接进行分类，直接的减小了最终分类效果对音素建模单元准确性的依赖，在音素分类不准确的情况下，依然可以对唤醒词进行正确的判别。It can be seen from this embodiment that on the basis of one acoustic model, deep acoustic features extracted from speech signals of a certain length are used to input them into another classification model for direct classification, which directly reduces the final classification effect on phoneme construction. Dependence on the accuracy of the modular unit, in the case of inaccurate phoneme classification, the wake word can still be correctly discriminated.

作为一种实施方式，在本实施例中，所述二级唤醒声学模型中一个声学模型为音素声学模型，所述另一个声学模型为词级别的声学模型。As an implementation manner, in this embodiment, one of the two-level wake-up acoustic models is a phoneme acoustic model, and the other acoustic model is a word-level acoustic model.

在本实施方式中，其中一个声学模型为音素声学模型，另一个声学模型为词级别声学模型。发明人经过反复实验发现，利用音素声学模型进行唤醒词识别，在低信噪比情况下，唤醒性能较低，识别性能严重依赖音素声学模型对音素分类的准确率。在音素声学模型的基础上，再连接一个词级别声学模型，直接对唤醒词进行直接分类，即使在音素分类不准确的情况下，也能通过直接分类，提升唤醒词的识别效果，弥补单一音素声学模型的不足。In this embodiment, one of the acoustic models is a phoneme acoustic model, and the other acoustic model is a word-level acoustic model. After repeated experiments, the inventor found that using the phoneme acoustic model for wake-up word recognition, in the case of low signal-to-noise ratio, the wake-up performance is low, and the recognition performance is heavily dependent on the accuracy of phoneme classification by the phoneme acoustic model. On the basis of the phoneme acoustic model, a word-level acoustic model is connected to directly classify the wake-up words. Even if the phoneme classification is inaccurate, the recognition effect of the wake-up words can be improved by direct classification and make up for the single phoneme. Inadequate acoustic models.

作为一种实施方式，在本实施例中，在所述提取所述音素声学模型的输出特征之后，所述方法还包括：As an implementation manner, in this embodiment, after extracting the output features of the phoneme acoustic model, the method further includes:

将每一帧的输出特征发送至特征累计器；Send the output features of each frame to the feature accumulator;

当所述特征累计器中语音音频的帧数累计达到预设阈值时，将所述特征累计器中的输出特征拼接成一维特征；When the number of frames of speech and audio in the feature accumulator is accumulated to reach a preset threshold, the output features in the feature accumulator are spliced into a one-dimensional feature;

将所述一维特征输入至所述词级声学模型，以完成两个模型的耦合。The one-dimensional features are input to the word-level acoustic model to complete the coupling of the two models.

在本实施方式中，在提取出其中音素声学模型输出的特征之后，将每一帧的输出特征发送至特征累计器进行累计。当累积到一定帧数时，将这些特征拼接成一维完整的特征，将所述一维特征输入至词级声学模型，这样可以将两个声学模型进行耦合，确保两个模型的使用。In this embodiment, after the features output by the phoneme acoustic model are extracted, the output features of each frame are sent to a feature accumulator for accumulation. When a certain number of frames are accumulated, these features are spliced into one-dimensional complete features, and the one-dimensional features are input into the word-level acoustic model, so that the two acoustic models can be coupled to ensure the use of the two models.

作为一种实施方式，在本实施例中，在所述对接收到的语音音频进行特征提取之前，所述方法还包括：As an implementation manner, in this embodiment, before the feature extraction is performed on the received speech audio, the method further includes:

根据声学传感器实时接收音频信号，通过语音端点检测模型确定所述音频信号是否为语音音频；Receive the audio signal in real time according to the acoustic sensor, and determine whether the audio signal is voice audio through the voice endpoint detection model;

当所述音频信号为语音音频时，对接收到的对话语音进行声学特征提取。When the audio signal is speech audio, the acoustic feature extraction is performed on the received dialogue speech.

由于语音唤醒需要实时的对接收到的音频进行检测，如果收到音频就去检测，那么是十分浪费资源的。在对接收到的语音音频特征提取之前，根据智能设备中声学传感器实时接收音频信号，检测所述音频信号中是否为用户说话的语音音频，确保有用户说话，再去检测，避免了收到音频信号就去语音唤醒检测，提高音频唤醒检测效率。Since voice wake-up needs to detect the received audio in real time, it is very wasteful of resources to detect if the audio is received. Before extracting the features of the received voice and audio, receive the audio signal in real time according to the acoustic sensor in the smart device, detect whether the audio signal is the voice audio of the user speaking, ensure that there is a user speaking, and then go to the detection to avoid receiving the audio The signal goes to voice wake-up detection to improve the efficiency of audio wake-up detection.

如图2所示为本发明一实施例提供的一种语音唤醒的优化系统的结构示意图，该系统可执行上述任意实施例所述的语音唤醒的优化方法，并配置在终端中。FIG. 2 is a schematic structural diagram of a voice wake-up optimization system according to an embodiment of the present invention. The system can execute the voice wake-up optimization method described in any of the foregoing embodiments, and is configured in a terminal.

本实施例提供的一种语音唤醒的优化系统包括：模型构建程序模块11，特征提取程序模块12，置信度确定程序模块13和唤醒程序模块14。A voice wake-up optimization system provided in this embodiment includes: a model building program module 11 , a feature extraction program module 12 , a confidence level determination program module 13 and a wake-up program module 14 .

其中，模型构建程序模块11用于构建二级唤醒声学模型，所述二级唤醒声学模型包括音素声学模型和词级别的声学模型；特征提取程序模块12用于对接收到的语音音频进行特征提取，将提到的声学特征输入至所述二级唤醒声学模型中的音素声学模型，提取所述音素声学模型的输出特征；置信度确定程序模块13用于基于所述音素声学模型的输出特征，作为所述二级唤醒声学模型中的词级别声学模型的输入，确定唤醒词的置信度；唤醒程序模块14用于当所述置信度超过预设唤醒阈值时，将所述语音音频确定为唤醒词，进行语音唤醒。Wherein, the model building program module 11 is used to build a secondary wake-up acoustic model, which includes a phoneme acoustic model and a word-level acoustic model; the feature extraction program module 12 is used to perform feature extraction on the received speech audio , the mentioned acoustic feature is input to the phoneme acoustic model in the described secondary wake-up acoustic model, and the output feature of the described phoneme acoustic model is extracted; the confidence level determination program module 13 is used for the output feature based on the described phoneme acoustic model, As the input of the word-level acoustic model in the second-level wake-up acoustic model, the confidence of the wake-up word is determined; the wake-up program module 14 is configured to determine the voice audio as wake-up when the confidence exceeds a preset wake-up threshold word for voice wake-up.

进一步地，所述其中一个声学模型为音素声学模型，所述其中一个声学模型为词级别的声学模型。Further, the one of the acoustic models is a phoneme acoustic model, and the one of the acoustic models is a word-level acoustic model.

进一步地，在特征提取程序模块之后，所述系统还包括：特征累计程序模块，用于：Further, after the feature extraction program module, the system further includes: a feature accumulation program module for:

将所述一维特征输入至所述另一个声学模型，以完成两个模型的耦合。The one-dimensional features are input to the other acoustic model to complete the coupling of the two models.

进一步地，所述特征提取程序模块还用于：Further, the feature extraction program module is also used for:

本发明实施例还提供了一种非易失性计算机存储介质，计算机存储介质存储有计算机可执行指令，该计算机可执行指令可执行上述任意方法实施例中的语音唤醒的优化方法；Embodiments of the present invention further provide a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the optimization method for voice wake-up in any of the above method embodiments;

作为一种实施方式，本发明的非易失性计算机存储介质存储有计算机可执行指令，计算机可执行指令设置为：As an embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions, and the computer-executable instructions are set to:

作为一种非易失性计算机可读存储介质，可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块，如本发明实施例中的方法对应的程序指令/模块。一个或者多个程序指令存储在非易失性计算机可读存储介质中，当被处理器执行时，执行上述任意方法实施例中的语音唤醒的优化方法。As a non-volatile computer-readable storage medium, it can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. One or more program instructions are stored in a non-volatile computer-readable storage medium, and when executed by a processor, perform the optimization method for voice wake-up in any of the above method embodiments.

非易失性计算机可读存储介质可以包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需要的应用程序；存储数据区可存储根据装置的使用所创建的数据等。此外，非易失性计算机可读存储介质可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中，非易失性计算机可读存储介质可选包括相对于处理器远程设置的存储器，这些远程存储器可以通过网络连接至装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device. data etc. In addition, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the non-volatile computer-readable storage medium may optionally include memory located remotely from the processor, which may be connected to the device through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

本发明实施例还提供一种电子设备，其包括：至少一个处理器，以及与所述至少一个处理器通信连接的存储器，其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行本发明任一实施例的语音唤醒的优化方法的步骤。An embodiment of the present invention further provides an electronic device, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor , the instructions are executed by the at least one processor, so that the at least one processor can execute the steps of the method for optimizing wake-up by voice according to any embodiment of the present invention.

本申请实施例的客户端以多种形式存在，包括但不限于：The clients in the embodiments of the present application exist in various forms, including but not limited to:

(1)移动通信设备:这类设备的特点是具备移动通信功能，并且以提供话音、数据通信为主要目标。这类终端包括:智能手机、多媒体手机、功能性手机，以及低端手机等。(1) Mobile communication equipment: This type of equipment is characterized by having mobile communication functions, and its main goal is to provide voice and data communication. Such terminals include: smart phones, multimedia phones, feature phones, and low-end phones.

(2)超移动个人计算机设备:这类设备属于个人计算机的范畴，有计算和处理功能，一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等，例如平板电脑。(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has the characteristics of mobile Internet access. Such terminals include: PDAs, MIDs, and UMPC devices, such as tablet computers.

(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器，掌上游戏机，电子书，以及智能玩具和便携式车载导航设备。(3) Portable entertainment equipment: This type of equipment can display and play multimedia content. Such devices include: audio and video players, handheld game consoles, e-books, as well as smart toys and portable car navigation devices.

(4)其他具有数据处理功能的电子装置。(4) Other electronic devices with data processing functions.

在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”，不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。In this document, relational terms such as first and second, etc. are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such existence between these entities or operations. The actual relationship or sequence. Furthermore, the terms "comprising" and "comprising" include not only those elements, but also other elements not expressly listed, or elements inherent to such a process, method, article or apparatus. Without further limitation, an element defined by the phrase "comprises" does not preclude the presence of additional identical elements in a process, method, article, or device that includes the element.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An optimization method for voice wake-up, comprising:

constructing a second-level wake-up acoustic model, where the second-level wake-up acoustic model includes a phoneme acoustic model and a word-level acoustic model;

Perform feature extraction on the received speech audio, input the mentioned acoustic features into the phoneme acoustic model in the secondary wake-up acoustic model, and extract the output features of the phoneme acoustic model;

Based on the output feature of the phoneme acoustic model, as the input of the word-level acoustic model in the secondary wake-up acoustic model, determine the confidence level of the wake-up word;

When the confidence level exceeds a preset wake-up threshold, the voice audio is determined as a wake-up word, and voice wake-up is performed.

2. The method according to claim 1, wherein after the extracting the output features of the phoneme acoustic model, the method further comprises:

Send the output features of each frame to the feature accumulator;

When the number of frames of speech and audio in the feature accumulator is accumulated to reach a preset threshold, the output features in the feature accumulator are spliced into a one-dimensional feature;

The one-dimensional features are input to the word-level acoustic model to complete the coupling of the two models.

3. The method according to claim 1, wherein, before the feature extraction is performed on the received speech audio, the method further comprises:

Receive the audio signal in real time according to the acoustic sensor, and determine whether the audio signal is voice audio through the voice endpoint detection model;

When the audio signal is speech audio, the acoustic feature extraction is performed on the received dialogue speech.

4. An optimized system for voice wake-up, comprising:

a model building program module for constructing a secondary arousal acoustic model, the secondary arousal acoustic model includes a phoneme acoustic model and a word-level acoustic model;

A feature extraction program module is used to perform feature extraction on the received speech audio, input the mentioned acoustic features into the phoneme acoustic model in the secondary wake-up acoustic model, and extract the output features of the phoneme acoustic model;

A confidence level determination program module, used for determining the confidence level of the wake-up word based on the output feature of the phoneme acoustic model as the input of the word-level acoustic model in the secondary wake-up acoustic model;

A wake-up program module, configured to determine the voice audio as a wake-up word when the confidence level exceeds a preset wake-up threshold, and perform voice wake-up.

5. The system according to claim 4, wherein, after the feature extraction program module, the system further comprises: a feature accumulation program module for:

Send the output features of each frame to the feature accumulator;

When the number of frames of speech and audio in the feature accumulator reaches a preset threshold, the output features in the feature accumulator are spliced into a one-dimensional feature;

The one-dimensional features are input to the other acoustic model to complete the coupling of the two models.

6. The system of claim 4, wherein the feature extraction program module is further used to:

When the audio signal is voice audio, perform acoustic feature extraction on the received dialogue voice.

7. An electronic device comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions Executed by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-3.

8. A storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the steps of the method according to any one of claims 1-3 are implemented.