CN110619871A

CN110619871A - Voice wake-up detection method, device, equipment and storage medium

Info

Publication number: CN110619871A
Application number: CN201810637168.1A
Authority: CN
Inventors: 陈梦喆; 雷鸣; 高杰; 张仕良; 刘勇; 姚海涛
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-06-20
Filing date: 2018-06-20
Publication date: 2019-12-27
Anticipated expiration: 2038-06-20
Also published as: CN110619871B

Abstract

The disclosure proposes a voice wake-up detection method, device, equipment and storage medium. The audio data frames within the predetermined range near the target frame in the multi-frame audio data are input to the acoustic model component together with the target frame, the acoustic model component is a feedforward sequence memory neural network model component, and the output of the acoustic model component is the target frame and The state identification result of at least one frame of audio data in the audio data frames within the predetermined range; the single frame of audio data that is located after the target frame and has not been processed in the multi-frame audio data is used as the next target frame, and iteratively Using the acoustic model component to process multiple subsequent target frames; and comparing the state recognition results of multiple frames of audio data in the multiple frames of audio data with preset wake-up words to identify whether the multiple frames of audio data are wake-up instructions. Therefore, while reducing the occupation of device-side resources, good wake-up performance can also be ensured, and the real-time requirement for wake-up can be met.

Description

Voice wake-up detection method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of voice technologies, and in particular, to a voice wake-up detection method, apparatus, device, and storage medium.

Background

Voice wakeup refers to switching a device from a sleep state to an active state when a user speaks a specific voice instruction (i.e., a wakeup word). The purpose of the wake-up technology is that the user can operate the equipment completely by voice without the help of two hands; meanwhile, by utilizing the awakening mechanism, the equipment does not need to be in a working state all the time, and the energy consumption is greatly saved. At present, the voice awakening technology is widely applied to various voice-controlled products, such as robots, mobile phones, wearable devices, smart homes, vehicles and the like.

Generally, such products need to support work in both a network environment and a non-network environment, and the wake-up is the first step of interaction, and is necessarily required to work normally even in the non-network environment, which needs to be implemented by using storage and computing resources at the device end. The computing resources of the device end are usually very limited, and no matter the number of the CPU cores, the size of the memory, or the core frequency is much smaller than that of a computer which is commonly used, so that the device end is far inferior to a cloud server. In the offline case, this limited resource is allocated to wake-up, and also undertakes signal processing, semantic understanding, and other tasks, so that wake-up as a part used at high frequency needs to reduce the resource occupation as much as possible.

Moreover, on the premise of ensuring the occupation of smaller resources, the performance of awakening is also important. Since the wake word has little contextual information, the decision whether to wake up is entirely dependent on the acoustic model. In order to pursue better performance, namely higher recall rate and lower false wake-up rate, a model structure with a larger scale and stronger data expression capability is often adopted in acoustic modeling; meanwhile, the awakening technology has high requirements on real-time rate and time delay, so that the feedback speed of the product after the user sends an awakening word instruction is determined, and the two indexes are directly influenced by the calculated amount and the structure of the acoustic model. It can be seen that there is some contradiction between the above two. Therefore, in the voice wake-up technology, how to ensure good wake-up performance and meet real-time performance on the premise of not increasing resource occupation significantly is a main problem at present.

Disclosure of Invention

An object of the present disclosure is to provide a voice wake-up detection scheme capable of ensuring good wake-up performance without significantly increasing resource occupation.

According to a first aspect of the present disclosure, there is provided a voice wake-up detection method, including: inputting audio data frames in a preset range near a target frame in multi-frame audio data and the target frame into an acoustic model component, wherein the acoustic model component is a feedforward sequence memory neural network model component, and the output of the acoustic model component is a state identification result of at least one frame of audio data in the target frame and the audio data frames in the preset range; taking a single frame of audio data which is positioned behind the target frame and is not processed in the multi-frame of audio data as a next target frame, and iteratively processing a plurality of following target frames by using an acoustic model component; and comparing the state recognition results of the audio data of a plurality of frames in the multi-frame audio data with preset awakening words to recognize whether the multi-frame audio data is an awakening instruction or not.

Optionally, the audio data frames within the predetermined range include: audio data frames which are positioned in a first preset range before a target frame in the multi-frame audio data; and/or audio data frames in the plurality of frames of audio data which are positioned in a second preset range after the target frame.

Optionally, the voice wake-up detection method further includes: detecting voice input of a user in real time; and performing framing processing on the detected voice input to obtain the multi-frame audio data.

Optionally, the step of comparing the state identification result of the audio data of the plurality of frames in the multi-frame audio data with a preset wake-up word includes: and searching a path model matched with the analysis result from a plurality of preset path models to identify whether the multiframe audio data arouses the instruction or not, wherein different path models correspond to different identification results.

Optionally, the path model comprises: a wake-up command model; a white filling model; and a silence model.

Optionally, the acoustic model comprises: an input layer; a hidden layer structure; and a plurality of output layers for predicting analysis results of audio data of different frames in the input, respectively.

Optionally, the hidden layer structure includes a plurality of hidden layers, wherein a memory module is disposed between at least two adjacent hidden layers, and the memory module is configured to store history information and future information useful for determining the current target frame.

Optionally, the output of the memory module is used as the input of the next hidden layer, and the output of the memory module includes the output of the current hidden layer, the output of the hidden layer with the predetermined look-back order, and the output of the hidden layer with the predetermined look-ahead order.

Alternatively,

wherein,input representing the (l + 1) th hidden layer, obtained by nonlinear transformation of an activation function f, U^lThe weight is represented by a weight that is,the output of the memory module is represented by,the offset is represented by the number of bits in the bit,

the output of the l-th hidden layer is represented,representing the input of the l hidden layer, W^lRepresents a weight, b^lRepresenting an offset, t representing the current time, s₁And s₂Coding step factor, N, representing historical and future times, respectively₁And N₂Respectively representing a review order and a look-ahead order,andare the coding coefficients of the memory module.The output of the hidden layer, which may be considered to be a predetermined look-back order, represents s before the current time t₁The result obtained after the output of the hidden layer of the coding stride factor under different playback orders is multiplied by the corresponding coding coefficient bit by bit,the output of the hidden layer, which may be considered to be of a predetermined look-ahead order, represents s after the current time t₂The coding stride factor is the result obtained after the output of the hidden layer under different look-ahead orders is multiplied by the corresponding coding coefficient bit by bit.

According to a second aspect of the present disclosure, there is also provided a voice wake-up detection apparatus, including: the state recognition module is used for inputting audio data frames in a preset range near a target frame in multi-frame audio data and the target frame into an acoustic model component, the acoustic model component is a feedforward sequence memory neural network model component, the output of the acoustic model component is a state recognition result of at least one frame of audio data in the target frame and the audio data frames in the preset range, wherein the state recognition module takes a single frame of audio data which is positioned behind the target frame and is not predicted in the multi-frame audio data as a next target frame to be analyzed, and the acoustic model component is used for processing a plurality of following target frames in an iterative manner; and the awakening identification module is used for comparing the state identification results of the audio data of a plurality of frames in the multi-frame audio data with preset awakening words so as to identify whether the multi-frame audio data is an awakening instruction.

Optionally, the voice wake-up detection apparatus further includes: the detection module is used for detecting the voice input of a user in real time; and the framing module is used for framing the detected voice input to obtain multi-frame audio data.

Optionally, the wake-up identification module searches a path model matched with the state identification result of the audio data of the plurality of frames from a plurality of preset path models to identify whether the audio data of the plurality of frames is a wake-up instruction, wherein different path models correspond to different identification results.

Alternatively,

the output of the l-th hidden layer is represented,representing the input of the l hidden layer, W^lRepresents a weight, b^lRepresenting an offset, t representing the current time, s₁And s₂Coding step factor, N, representing historical and future times, respectively₁And N₂Respectively represent the playback orders andthe front view order.Andare the coding coefficients of the memory module.The output of the hidden layer, which may be considered to be a predetermined look-back order, represents s before the current time t₁The result obtained after the output of the hidden layer of the coding stride factor under different playback orders is multiplied by the corresponding coding coefficient bit by bit,the output of the hidden layer, which may be considered to be of a predetermined look-ahead order, represents s after the current time t₂The coding stride factor is the result obtained after the output of the hidden layer under different look-ahead orders is multiplied by the corresponding coding coefficient bit by bit.

According to a third aspect of the present disclosure, there is also provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform a method as set forth in the first aspect of the disclosure.

According to a fourth aspect of the present disclosure, there is also provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method as set forth in the first aspect of the present disclosure.

This is disclosed and is awaken the detection through combining the mode of many frame predictions and FSMN together for the frame number that needs the calculation can reduce by multiples, thereby can greatly reduced equipment end resource occupy, and when less resource occupies, can also guarantee good awakening performance, satisfy the requirement of awakening required real-time.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

Fig. 1 is an exemplary diagram showing an analysis manner for multi-frame audio data.

Fig. 2 is a schematic flow chart diagram illustrating a voice wake-up detection method according to an embodiment of the present disclosure.

Fig. 3A and 3B are exemplary diagrams illustrating an analysis manner for multi-frame audio data according to an embodiment of the present disclosure.

Fig. 4 is a schematic diagram illustrating a structure of an acoustic model according to an embodiment of the present disclosure.

Fig. 5 is a diagram showing the structure of the introduced FSMN.

Fig. 6 is a schematic diagram illustrating a structure of an acoustic model according to an embodiment of the present disclosure.

Fig. 7 is a structural framework diagram illustrating a voice wake-up system according to an embodiment of the present disclosure.

Fig. 8 is a schematic block diagram illustrating the structure of a voice wake-up detecting apparatus according to an embodiment of the present disclosure.

FIG. 9 shows a schematic structural diagram of a computing device according to an embodiment of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

[ scheme overview ]

When the acoustic model is used for voice wake-up detection, a current frame in multi-frame audio data is generally used as an input of the acoustic model to obtain an output of the current frame. In order to improve the accuracy of the output result, for the input frame currently to be processed, audio data of a certain length before and after the input frame may be spliced as input of the acoustic model, so that associated information containing the context of the input frame is input. Thus, when a current frame is processed (i.e., predicted) using an acoustic model, audio data in a certain range before and after the current frame including the current frame is input, and only a prediction result for the current frame is output.

When the voice awakening detection is carried out by adopting the mode of multi-frame input and single-frame output, repeated audio with a certain length can exist in the two adjacent inputs, namely, the characteristics of the two adjacent inputs can have certain overlap, namely, the characteristics of the two adjacent inputs have certain similarity. Since the acoustic model predicts the current frame, the feature overlapping is a waste of resources in the prediction process, and the more the overlapped features are, the more the waste of resources is obvious.

As shown in fig. 1, scales 0 to 9 represent consecutive multi-frame audio data after slicing. In the present disclosure, the section of audio data from scale 0 to scale 1 may be regarded as the 1 st frame of audio data, the section of audio data from scale 1 to scale 2 may be regarded as the 2 nd frame of audio data, and so on. It is assumed that for an input frame to be currently predicted, audio data of 3 frame lengths after splicing the input frame is input as an acoustic model. When the 1 st frame audio data is predicted, the 1 st to 4 th frames audio data can be used as input; when the 2 nd frame audio data is predicted, the 2 nd to 5 th frame audio data can be used as input; in predicting the 3 rd frame audio data, the audio data of the 3 rd to 6 th frames may be used as input.

It can be seen that there is duplicate audio data in the 1 st and 2 nd inputs (2 nd frame-4 th frame), there is duplicate audio data in the 2 nd and 3 rd inputs (3 rd frame-5 th frame), and there is also duplicate audio data in the 1 st and 3 rd inputs (3 rd frame, 4 th frame).

After the acoustic model processes the 1 st input to obtain the prediction result of the 1 st frame of audio data, the 2 nd input is processed to predict the 2 nd frame of audio data, and the 2 nd to 4 th frames of audio data in the current input are processed data when the 1 st input is processed. When the acoustic model continues to process the 3 rd input to predict the 3 rd frame of audio data, the 3 rd to 5 th frames in the current input are data processed by the model when the 2 nd input is processed, and the 3 rd and 4 th frames in the current input are data processed by the model when the 1 st input is processed. It can be seen that such repeated features (or similar features) among adjacent inputs are somewhat wasteful of computing resources.

In view of this, the present disclosure proposes that the output of the acoustic model may be modified by using a Multi-Frame Prediction (MFP) method, so as to change the "one-to-one Prediction mode" into the "one-to-many Prediction mode". In particular, for an input frame currently to be predicted, the acoustic model may be adapted to predict the input frame and other frame or frames contained in the input, since the input is associated information comprising the input frame and its context. Therefore, the number of frames required to be calculated can be reduced by times, and the occupation of equipment side resources can be greatly reduced.

Further, as described in the background section, the performance of wake-up is also important on the premise of ensuring the occupation of smaller resources. Since the wake word has little contextual information, the decision whether to wake up is entirely dependent on the acoustic model. In order to pursue better performance, namely higher recall rate and lower false wake-up rate, a model structure with a larger scale and stronger data expression capability is often adopted in acoustic modeling; meanwhile, the awakening technology has high requirements on real-time rate and time delay, so that the feedback speed of the product after the user sends an awakening word instruction is determined, and the two indexes are directly influenced by the calculated amount and the structure of the acoustic model. It can be seen that there is some contradiction between the above two. Therefore, in the voice wake-up technology, how to ensure good wake-up performance and meet real-time performance on the premise of not increasing resource occupation significantly is a main problem at present.

In order to obtain better analysis performance, Deep Neural Network (DNN) is mostly adopted in the acoustic modeling part at present, and DNN has obvious advantages in terms of calculation amount compared with other Neural Network structures, and has the defect that long-term information cannot be utilized, so that the improvement of performance is limited.

In order to make up for the disadvantage, a Long Short-Term Memory recovery Neural Network (LSTM-RNN) based on a Long-Term Memory unit may be adopted, and the model performance may be improved by using the cyclic link of the Recurrent Network and the storage capability of the LSTM unit to the history information. However, the structure of the LSTM unit and the cyclic mechanism require a large amount of computing resources, which is disadvantageous for resource-constrained device end products (e.g., mobile end products).

The inventor of the present disclosure has noticed that a feed-forward Sequential Memory neural network (FSMN) introduces a Memory module based on DNN, and increases a small amount of computation, thereby obtaining a great performance improvement. Taking a model with four hidden layers and 512 nodes in each layer as an example, under the condition that the input and output numbers are the same, the calculation amount of the FSMN is increased by 1% compared with DNN for each frame of data, and the calculation amount of the LSTM is 5 times that of the FSMN; when the FSMN model and the LSTM model with the same calculated amount are selected, the performance of the FSMN model is far superior to that of the LSTM model with the same calculated amount. Thus, in the present disclosure, the acoustic model may employ an FSMN model. Therefore, the awakening performance can be improved while the resource occupation is reduced.

The following further describes aspects of the present disclosure.

[ multiframe prediction ]

The mechanism for implementing the voice wake-up detection method of the present disclosure is described below with reference to fig. 2. Fig. 2 is a schematic flow chart diagram illustrating a voice wake-up detection method according to an embodiment of the present disclosure.

Referring to fig. 2, in step S210, audio data frames within a predetermined range near a target frame in a plurality of frames of audio data are input to an acoustic model component together with the target frame.

The target frame may be regarded as a frame to be processed currently in the multi-frame audio data, and the audio data frames within a predetermined range near the target frame may be audio data frames within a certain time length range before and/or after the target frame. For example, the audio data frames in the multi-frame audio data that are located in a first predetermined range before the target frame may be used, or the audio data frames in the multi-frame audio data that are located in a second predetermined range after the target frame may be used. Preferably, the audio data frames within the first predetermined range and the audio data frames within the second predetermined range may be included at the same time, so that the input may contain the association information of the target frame context at the same time.

Generally, the first predetermined range and the second predetermined range are set to be too small, so that the context information of the target frame contained in the input is limited, and the accuracy of the state recognition result obtained by processing the target frame by the acoustic model component is reduced; and if the first predetermined range and the second predetermined range are set to be too large, the waste of computing resources is caused. Thus, the specific values of the first predetermined range and the second predetermined range may be determined experimentally. In the present disclosure, the first predetermined range and the second predetermined range may comprise at least a single frame duration, and preferably may be an integer multiple of the frame length. In other words, the audio data frames within the predetermined range around the target frame may or may not be integer frames, and the disclosure is not limited thereto. As a preferred embodiment, the audio data frame may comprise one or several frames of audio data preceding and/or following the target frame.

Since the input is the audio data frame including the target frame to be analyzed currently and the audio data frame in the predetermined range nearby, for example, the audio data frame with a certain frame length before and after the target frame may be included. Therefore, the acoustic model may be modified such that the output of the acoustic model is the state recognition result (i.e., the prediction result) of the target frame and at least one frame of audio data in the predetermined range of audio data. In the present disclosure, an acoustic model component may be seen as an aggregation of software and/or hardware resources that are capable of implementing the processing functionality of the acoustic model, and thus, the output of the acoustic model is also the output of the acoustic model component. The structure of the acoustic model component and the state recognition result will be described in detail below, and will not be described here for the moment.

It should be noted that, in order to improve the accuracy of the output result of the acoustic model component, the "at least one frame of audio data" mentioned in the present disclosure may refer to any one or more frames of all the complete frames of audio data included in the audio data within the predetermined range. For example, in the case where the predetermined range is audio data of two frames after the target frame, the input may be regarded as three frames of audio data including the target frame. For the target frame, the last two frames of audio data may be regarded as context information of the target frame, for the intermediate frame of audio data, the target frame and the last frame of audio data may be regarded as context information of the frame, and for the last frame of audio data of the target frame, the target frame and the intermediate frame of audio data may be regarded as context information of the frame. Thus, the acoustic model component may be adapted to predict the target frame, the intermediate frame, and the last frame of audio data, respectively, to obtain the analysis results of the target frame, the intermediate frame, and the last frame, respectively. Of course, the acoustic model component may also be modified to predict the target frame and the intermediate frame respectively to obtain the analysis results of the target frame and the intermediate frame respectively.

In step S220, a single frame of audio data, which is located after the target frame and is not processed, of the multi-frame of audio data is used as a next target frame, and the subsequent multiple target frames are processed iteratively using the acoustic model component.

Therefore, a plurality of frames of audio data need to be input into the acoustic model component frame by frame to obtain a prediction result of each frame. Based on the audio analysis scheme disclosed by the invention, when the state of multi-frame audio data is identified by using the acoustic model component, the audio data can be input at intervals of a preset interval (one frame or several frames), so that the calculated amount can be reduced to 1/N, and the occupation of the calculation resources of equipment-side products can be greatly reduced. N may be an integer greater than or equal to 2, and a specific value of N may be set according to an actual situation, which is not limited in this disclosure.

As shown in fig. 3A and 3B, scale 0 to scale 10 represent consecutive multi-frame audio data. In the present disclosure, the section of audio data from scale 0 to scale 1 may be regarded as the 1 st frame of audio data, the section of audio data from scale 1 to scale 2 may be regarded as the 2 nd frame of audio data, and so on. Assume that for an input frame to be currently predicted, an audio data frame of 3 frame lengths after splicing the input frame is taken as input to the acoustic model component. In predicting the 1 st frame audio data, the 1 st to 4 th frames audio data may be used as input. Unlike FIG. 1, for the 1 st input, the acoustic model component may predict the state of the 1 st frame and one or more frames following it. Since the 1 st input includes audio data of the 1 st to 4 th frames, theoretically, the acoustic model component may be adapted to predict the states of the 1 st, 2 nd, 3 rd and 4 th frames respectively to obtain the state recognition results of the 1 st, 2 nd, 3 rd and 4 th frames respectively. However, considering the accuracy of prediction, the acoustic model component may preferably predict the state of frame data having context in the input, for example, the acoustic model component may predict the state of audio data of frame 1, frame 2, and frame 3, respectively.

As shown in fig. 3A, for the 1 st input, as an example, the acoustic model component may predict the states of the 1 st frame and the frame (i.e., the 2 nd frame) after the 1 st frame to obtain the state recognition results of the 1 st frame and the 2 nd frame, respectively. Therefore, after the 1 st input is processed, the acoustic model component can take the unanalyzed 3 rd frame audio data as the current target frame to be predicted, then splice the audio data with the length of 3 frames after the 3 rd frame as the 2 nd input, and input the audio data into the acoustic model component, and the acoustic model component can predict the 3 rd frame and the next frame (namely, the 4 th frame) after the 3 rd frame so as to respectively obtain the prediction results of the 3 rd frame and the 4 th frame. Thus, inter-frame (one frame apart) prediction can be achieved, reducing the amount of computation to 1/2.

As shown in fig. 3B, for the 1 st input, as an example, the acoustic model component may predict the 1 st frame and the two frames (i.e., the 2 nd frame and the 3 rd frame) after the 1 st frame to obtain the prediction results (i.e., the state recognition results) of the 1 st frame, the 2 nd frame and the 3 rd frame, respectively. Therefore, after the 1 st input is processed by the acoustic model, the unprocessed 4 th frame of audio data can be used as a current target frame to be predicted, then the audio data with the length of 3 frames after the 4 th frame is spliced to be used as the 2 nd input and input into the acoustic model component, and the acoustic model component can predict the states of the 4 th frame and the two frames after the 4 th frame (namely, the 5 th frame and the 6 th frame) to respectively obtain the prediction results of the 4 th frame, the 5 th frame and the 6 th frame. Thus, inter-frame (two frames apart) prediction can be achieved, reducing the amount of computation to 1/3.

In step S230, the state recognition results of the audio data of multiple frames in the multi-frame audio data are compared with the preset wake-up word to recognize whether the multi-frame audio data is a wake-up command.

The multi-frame audio data mentioned in the present disclosure may be obtained by performing a framing process on the detected speech input. For example, a user's voice input may be detected in real time and then the detected voice input may be framed to obtain a plurality of frames of audio data.

For each input, the acoustic model component may be configured to predict a state of at least one frame of audio data in a target frame and an audio number frame within a predetermined range in the input, for example, the acoustic model component may be configured to calculate scores (i.e., probabilities) of the at least one frame of audio data in the target frame and the audio number frame within the predetermined range in each state, and a state with a highest score may be used as a state recognition result of a corresponding frame.

Therefore, the state of each frame of audio data can be determined based on the state recognition result of the frame of audio data, and a phoneme can be recognized according to the states of several consecutive frames of audio data, and a plurality of phonemes can be combined into a word. Therefore, whether the multi-frame audio data contains the wake-up instruction or not can be identified according to the state identification result of the plurality of frames in the multi-frame audio data. For example, the state recognition results of the multiple frames may be compared with a preset wake-up word, and if the state recognition results of the audio data of the multiple frames are consistent with the wake-up word, it may be determined that the audio data of the multiple frames includes a wake-up command. When it is determined that the multi-frame audio data includes the wake-up instruction, the subsequent wake-up operation may be performed, which is not described in detail herein.

As an example, a plurality of path models may be preset, and different path models may correspond to different wake-up recognition results. Based on the state recognition result of the audio data of a plurality of frames in the multi-frame audio data, a path model matched with the state recognition result can be searched from the preset path models to recognize whether the multi-frame audio data arouses the instruction. The path models may include a wake instructions Model (which may also be referred to as a "Keyword Model"), a complementary Model (Filler Model), and a Silence Model (Silence Model). The wake instruction models may be multiple, and different wake instruction models may correspond to different wake instructions (i.e., wake words), for example, the wake instruction models may include multiple wake instructions respectively corresponding to "open", "play", "i want to see", and the like. A padding Model (Filler Model) may be used as a Filler to characterize the audio Model of the non-wake-up instruction component. A Silence Model (Silence Model) may refer to an audio Model with no speech input.

[ ACOUSTIC MODEL ]

In the present disclosure, to improve the analytical performance of the acoustic model component, the acoustic model component may be an FSMN model component. In addition, the output of the acoustic model component is modified, so that the acoustic model component can respectively predict a plurality of frames in the input.

Fig. 4 is a schematic diagram illustrating a network structure of an acoustic model component according to an embodiment of the present disclosure.

As shown in fig. 4, the network structure of the acoustic model component may include an Input Layer, a Hidden Layer structure, and a plurality of Output layers. The output layers are used for respectively predicting the analysis results of the audio data of a plurality of different frames in the input.

The hidden layer structure may include a plurality of hidden layers, and the plurality of output layers may be all connected to the last hidden layer. During training, a target value is originally prepared for each frame, and the acoustic model of the present disclosure needs to provide target values for the current frame and the next N frames. In practical use, each frame input can generate multi-frame output, so that input is only needed at intervals of a plurality of frames, the calculation amount is reduced to one N of the original calculation amount, and the saved calculation resources are invaluable for equipment-side products with resource shortage.

In the disclosure, audio data frames in a predetermined range near a target frame in multi-frame audio data may be spliced with the target frame to be input, directly input to an input layer, perform feature extraction on the input data by the input layer, and then input to a hidden layer structure. In addition, the audio data frames in the predetermined range near the target frame in the multi-frame audio data may be spliced with the target frame, then the characteristics of the spliced audio data are extracted, and then the extracted characteristics are input into the input layer, and the input layer inputs the extracted characteristics into the hidden layer structure.

The hidden layer structure may employ an FSMN structure. The core difference of FSMN compared with the ordinary DNN layer is that a memory module is added between the adjacent hidden layers, and the memory module is used for storing the history information and the future information which are useful for judging the current target frame. The output of the memory module is used as the input of the next hidden layer, and the output of the memory module may include the output of the current hidden layer, the output of the hidden layer with the predetermined look-back order, and the output of the hidden layer with the predetermined look-ahead order.

Fig. 5 is a diagram showing the structure of the introduced FSMN.

As shown in fig. 5, the core difference of FSMN compared with the conventional DNN layer is that a memory module B is added, in which part of the past and future information is stored, and B performs information processing and then transmits to the next hidden layer, which makes the network have the capability of processing long-term information. In order to reduce the calculation amount, the previous hidden layer can be output to the module A first, the dimensionality of the A is smaller than that of the previous hidden layer, which is equivalent to splitting a parameter matrix from the previous hidden layer to the B into two parts, and the reasonable arrangement of the A can reduce the calculation amount without losing the performance. The calculation of the FSMN layer is expressed as follows.

the output of the l-th hidden layer is represented,representing the input of the l hidden layer, W^lRepresents a weight, b^lRepresenting an offset, t representing the current time, s₁And s₂Coding step factor, N, representing historical and future times, respectively₁And N₂Respectively representing a review order and a look-ahead order,andare the coding coefficients of the memory module.

According to the formula, the output of the memory module is the sum of the output of the current hidden layer, the output of the hidden layer with the preset look-back order and the output of the hidden layer with the preset look-ahead order. Wherein,the output of the hidden layer, which may be considered to be a predetermined look-back order, represents s before the current time t₁The coding stride factor is the result obtained after the bitwise multiplication of the output of the hidden layer under different playback orders and the corresponding coding coefficient.The output of the hidden layer, which may be considered to be of a predetermined look-ahead order, represents s after the current time t₂The coding stride factor is the result obtained after the output of the hidden layer under different look-ahead orders is multiplied by the corresponding coding coefficient bit by bit.

The difference in the amount of calculation of FSMN compared to DNN is from equation (2). It is shown by specific calculations that FSMN is similar to DNN for the number of floating point operations per second in a similar network architecture (the number of layers is the same or similar to the nodes per layer), whereas LSTM is more than twice as computationally intensive as DNN in the case of a similar network architecture. Therefore, the calculation amount introduced by the FSMN is far less than that introduced by the LSTM with the equal structure, so that the model can effectively control the real-time rate, and meanwhile, the model has long-time information modeling capability which is not possessed by DNN and the performance of the model is superior to that of the LSTM.

Fig. 6 is a network architecture diagram illustrating acoustic model components according to an embodiment of the present disclosure.

As shown in fig. 6, the network structure of the acoustic model component may include an input layer, a hidden layer structure composed of a DNN layer and an FSMN layer, and a plurality of output layers. The DNN layer structure is well known to those skilled in the art and will not be described herein. For the descriptions of the input layer, the FSMN layer, and the output layers, see the above description, and are not repeated here.

As shown in fig. 7, the voice wake-up system of the present disclosure mainly includes a detection module 710, an acoustic prediction module 720 and a keyword detection module 730.

The detection module 710 may detect a voice input of a user in real time, and may perform framing processing on the detected voice input to obtain multiple frames of audio data.

The acoustic prediction module 720 may predict the state recognition result of each frame of audio data in the plurality of frames of audio data. In the prediction process, the acoustic prediction module 720 may splice audio data in a predetermined range near a target frame to be analyzed currently in the multi-frame audio data with the target frame as an input, and input a pre-trained acoustic model component, where the acoustic model component may respectively predict a state recognition result of at least one frame of audio data in the target frame and the audio data in the predetermined range. A single frame of audio data, which is located after the target frame and is not predicted, of the multi-frame audio data may then be used as a next target frame to be analyzed, and thus, the acoustic prediction module 720 may iteratively process a plurality of following target frames using the acoustic model component. For the network structure of the acoustic model components, see the above description, and are not described here.

According to the state recognition result of the audio data of a plurality of frames in the multi-frame audio data, the keyword detection module 730 may search a path model matching the state recognition result from the plurality of path models. The path models can be classified into a keyword model, a complementary white model and a silence model. When the state recognition result is matched with the keyword model, the user can be determined to send a wake-up instruction, and then the device can be controlled to be started, so that voice wake-up of the device is realized.

[ VOICE WAKE-UP DETECTING DEVICE ]

The voice wake-up detection method of the present disclosure may also be implemented as a voice wake-up detection apparatus.

Fig. 8 is a schematic block diagram illustrating the structure of a voice wake-up detecting apparatus according to an embodiment of the present disclosure. The functional modules of the voice wake-up detection apparatus can be implemented by hardware, software or a combination of hardware and software for implementing the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 8 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

In the following, functional modules that the voice wake-up detection apparatus may have and operations that each functional module may perform are briefly described, and for details related thereto, reference may be made to the description above in conjunction with fig. 2 to fig. 6, and details are not repeated here.

Referring to fig. 8, the voice wake-up detecting apparatus 800 includes a state recognition module 810 and a wake-up recognition module 820. The state identification module 810 is configured to splice an audio data frame in a predetermined range near a target frame in multiple frames of audio data with the target frame as an input, input a pre-trained acoustic model component, where the acoustic model component is a feedforward sequence memory neural network (FSMN) component, and an output of the acoustic model component is a state identification result of at least one frame of audio data in the target frame and the audio data frame in the predetermined range. The state identifying module 810 may use a single frame of audio data, which is located after the target frame and is not predicted, of the multi-frame audio data as a next target frame, and iteratively process a plurality of subsequent target frames using the acoustic model component.

The wake-up identifying module 820 may identify whether the multi-frame audio data wakes up the instruction based on the state identification result of the audio data of a plurality of frames in the multi-frame audio data. For example, the wake-up recognition module 820 may compare the state recognition results of the audio data of multiple frames in the multi-frame audio data with a preset wake-up word to recognize whether the multi-frame audio data is a wake-up command. As an example, the wake-up identifying module 820 may search a path model matching the state identification result of the audio data of the plurality of frames from the plurality of path models to identify whether the audio data of the plurality of frames wake up the instruction, where different path models correspond to different identification results. The path model may include a wake command model, a padding model, and a muting model.

In the present disclosure, the audio data frames within the predetermined range may include: audio data frames which are positioned in a first preset range before the target frame in the multi-frame audio data; and/or audio data frames in the plurality of frames of audio data which are positioned in a second preset range after the target frame.

As shown in fig. 8, the apparatus 800 for detecting wake-on-speech may further optionally include a detection module 830 and a framing module 840, which are shown by dashed boxes in the figure. The detecting module 830 is configured to detect a voice input of a user in real time, and the framing module 840 is configured to perform framing processing on the detected voice input to obtain multi-frame audio data.

As shown in fig. 4, in the present embodiment, the network structure of the acoustic model component may include: an input layer; a hidden layer structure; and a plurality of output layers for predicting analysis results of audio data of different frames in the input, respectively.

The hidden layer structure may include a plurality of hidden layers, wherein a memory module is disposed between at least two adjacent hidden layers, and the memory module is configured to store history information and future information useful for determining the current target frame. And the output of the memory module is used as the input of the next hidden layer, and the output of the memory module comprises the output of the current hidden layer, the output of the hidden layer with the preset look-back order and the output of the hidden layer with the preset look-ahead order.

The calculation of the hidden layer is expressed as follows.

the output of the l-th hidden layer is represented,representing the input of the l hidden layer, W^lRepresents a weight, b^lRepresenting an offset, t representing the current time, s₁And s₂Coding step factor, N, representing historical and future times, respectively₁And N₂And respectively representing a review order and a look-ahead order.Andare the coding coefficients of the memory module.The output of the hidden layer, which may be considered to be a predetermined look-back order, represents s before the current time t₁The result obtained after the output of the hidden layer of the coding stride factor under different playback orders is multiplied by the corresponding coding coefficient bit by bit,the output of the hidden layer, which may be considered to be of a predetermined look-ahead order, represents s after the current time t₂The coding stride factor is the result obtained after the output of the hidden layer under different look-ahead orders is multiplied by the corresponding coding coefficient bit by bit.

[ calculating device ]

Fig. 9 is a schematic structural diagram of a computing device for data processing, which can be used to implement the audio analysis and voice wake detection method according to an embodiment of the present invention.

Referring to fig. 9, computing device 900 includes memory 910 and processor 920.

The processor 920 may be a multi-core processor or may include multiple processors. In some embodiments, processor 920 may include a general-purpose main processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. In some embodiments, processor 920 may be implemented using custom circuits, such as Application Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs).

The memory 910 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 920 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 910 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 910 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 910 has stored thereon processable code, which, when processed by the processor 920, causes the processor 920 to perform the audio analysis and wake-on-speech detection methods described above.

The audio analysis and voice wake detection methods, apparatuses, and computing devices according to the present invention have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A voice wake-up detection method, characterized in that, comprising:

Input the audio data frames within the predetermined range near the target frame in the multi-frame audio data together with the target frame to the acoustic model component, the acoustic model component is a feedforward sequence memory neural network model component, and the acoustic model component The output is the state identification result of the target frame and at least one frame of audio data in the audio frames within the predetermined range;

Taking the single frame of audio data that is located after the target frame and not processed in the multi-frame audio data as the next target frame, and iteratively using the acoustic model component to process the subsequent multiple target frames; as well as

Comparing the state identification results of the audio data of multiple frames in the multi-frame audio data with the preset wake-up words to identify whether the multi-frame audio data is a wake-up instruction.

2. voice wake-up detection method according to claim 1, is characterized in that, the audio data frame in described predetermined scope comprises:

A frame of audio data within a first predetermined range before the target frame in the multi-frame audio data; and/or

A frame of audio data within a second predetermined range behind the target frame among the multiple frames of audio data.

3. voice wake-up detection method according to claim 1, is characterized in that, also comprises:

Real-time detection of user voice input; and

and performing frame division processing on the detected voice input to obtain the multi-frame audio data.

4. The voice wake-up detection method according to claim 1, wherein the step of comparing the state recognition result of the audio data of a plurality of frames in the multi-frame audio data with a preset wake-up word comprises:

Search for a path model that matches the state recognition results of the multiple frames of audio data from multiple preset path models to identify whether the multiple frames of audio data are wake-up instructions, wherein different path models correspond to Different recognition results.

5. voice wake-up detection method according to claim 4, is characterized in that, described path model comprises:

Wake-up instruction model;

filler model; and

Silent model.

6. The voice wake-up detection method according to claim 1, wherein the acoustic model component comprises:

input layer;

hidden layer structure; and

A plurality of output layers, the plurality of output layers are used to respectively predict the analysis results of the audio data of different frames in the input.

7. voice wake-up detection method according to claim 6, is characterized in that,

The hidden layer structure includes multiple hidden layers, wherein a memory module is arranged between at least two adjacent hidden layers, and the memory module is used to store historical information and future information useful for judging the current target frame.

8. voice wake-up detection method according to claim 7, is characterized in that,

The output of the memory module is used as the input of the next hidden layer,

The output of the memory module includes the output of the current hidden layer, the output of the hidden layer of the predetermined look-back order, and the output of the hidden layer of the predetermined look-ahead order.

9. voice wake-up detection method according to claim 8, is characterized in that,

in, Represents the input of the l+1th hidden layer, which is obtained through the nonlinear transformation of the activation function f, U ^l represents the weight, represents the output of the memory module, Indicates the bias,

Indicates the output of the lth hidden layer, Represents the input of the lth hidden layer, W ^l represents the weight, b ^l represents the bias, t represents the current moment, s ₁ and s ₂ represent the coding step factors of the historical moment and the future moment respectively, N ₁ and N ₂ represent look back order and look forward order, and is the encoding coefficient of the memory module.

10. A voice wake-up detection device, characterized in that, comprising:

The state identification module is used to input the audio data frames within the predetermined range near the target frame in the multi-frame audio data together with the target frame to the acoustic model component, and the acoustic model component is a feedforward sequence memory neural network model component , the output of the acoustic model component is the state recognition result of at least one frame of audio data in the target frame and the audio data frames within the predetermined range, wherein the state recognition module uses the multiple frames of audio data A single frame of audio data that is located after the target frame and is not predicted is used as the next target frame to be analyzed, and iteratively uses the acoustic model component to process the subsequent multiple target frames; and

A wake-up identification module, configured to compare the state identification results of multiple frames of audio data in the multi-frame audio data with a preset wake-up word, so as to identify whether the multi-frame audio data is a wake-up instruction.

11. A computing device comprising:

processor; and

A memory on which executable code is stored, and when the executable code is executed by the processor, causes the processor to execute the method according to any one of claims 1-9.

12. A non-transitory machine-readable storage medium, on which executable code is stored, and when the executable code is executed by a processor of an electronic device, the processor is made to perform any one of the methods described.