[go: up one dir, main page]

CN110619871A - Voice wake-up detection method, device, equipment and storage medium - Google Patents

Voice wake-up detection method, device, equipment and storage medium Download PDF

Info

Publication number
CN110619871A
CN110619871A CN201810637168.1A CN201810637168A CN110619871A CN 110619871 A CN110619871 A CN 110619871A CN 201810637168 A CN201810637168 A CN 201810637168A CN 110619871 A CN110619871 A CN 110619871A
Authority
CN
China
Prior art keywords
audio data
frame
wake
frames
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810637168.1A
Other languages
Chinese (zh)
Other versions
CN110619871B (en
Inventor
陈梦喆
雷鸣
高杰
张仕良
刘勇
姚海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810637168.1A priority Critical patent/CN110619871B/en
Publication of CN110619871A publication Critical patent/CN110619871A/en
Application granted granted Critical
Publication of CN110619871B publication Critical patent/CN110619871B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提出了一种语音唤醒检测方法、装置、设备以及存储介质。将多帧音频数据中目标帧附近预定范围内的音频数据帧,与目标帧一并输入至声学模型组件,声学模型组件为前馈序列记忆神经网络模型组件,声学模型组件的输出为目标帧以及预定范围内的音频数据帧中的至少一帧音频数据的状态识别结果;将多帧音频数据中位于目标帧之后且未对其进行处理的单帧音频数据,作为下一个目标帧,并迭代地使用声学模型组件对之后多个目标帧进行处理;以及将多帧音频数据中多个帧的音频数据的状态识别结果与预设唤醒词进行比对,以识别多帧音频数据是否为唤醒指令。由此,在降低设备端资源的占用的同时,还可以保证良好的唤醒性能,满足唤醒所需的实时性的要求。

The disclosure proposes a voice wake-up detection method, device, equipment and storage medium. The audio data frames within the predetermined range near the target frame in the multi-frame audio data are input to the acoustic model component together with the target frame, the acoustic model component is a feedforward sequence memory neural network model component, and the output of the acoustic model component is the target frame and The state identification result of at least one frame of audio data in the audio data frames within the predetermined range; the single frame of audio data that is located after the target frame and has not been processed in the multi-frame audio data is used as the next target frame, and iteratively Using the acoustic model component to process multiple subsequent target frames; and comparing the state recognition results of multiple frames of audio data in the multiple frames of audio data with preset wake-up words to identify whether the multiple frames of audio data are wake-up instructions. Therefore, while reducing the occupation of device-side resources, good wake-up performance can also be ensured, and the real-time requirement for wake-up can be met.

Description

Voice wake-up detection method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of voice technologies, and in particular, to a voice wake-up detection method, apparatus, device, and storage medium.
Background
Voice wakeup refers to switching a device from a sleep state to an active state when a user speaks a specific voice instruction (i.e., a wakeup word). The purpose of the wake-up technology is that the user can operate the equipment completely by voice without the help of two hands; meanwhile, by utilizing the awakening mechanism, the equipment does not need to be in a working state all the time, and the energy consumption is greatly saved. At present, the voice awakening technology is widely applied to various voice-controlled products, such as robots, mobile phones, wearable devices, smart homes, vehicles and the like.
Generally, such products need to support work in both a network environment and a non-network environment, and the wake-up is the first step of interaction, and is necessarily required to work normally even in the non-network environment, which needs to be implemented by using storage and computing resources at the device end. The computing resources of the device end are usually very limited, and no matter the number of the CPU cores, the size of the memory, or the core frequency is much smaller than that of a computer which is commonly used, so that the device end is far inferior to a cloud server. In the offline case, this limited resource is allocated to wake-up, and also undertakes signal processing, semantic understanding, and other tasks, so that wake-up as a part used at high frequency needs to reduce the resource occupation as much as possible.
Moreover, on the premise of ensuring the occupation of smaller resources, the performance of awakening is also important. Since the wake word has little contextual information, the decision whether to wake up is entirely dependent on the acoustic model. In order to pursue better performance, namely higher recall rate and lower false wake-up rate, a model structure with a larger scale and stronger data expression capability is often adopted in acoustic modeling; meanwhile, the awakening technology has high requirements on real-time rate and time delay, so that the feedback speed of the product after the user sends an awakening word instruction is determined, and the two indexes are directly influenced by the calculated amount and the structure of the acoustic model. It can be seen that there is some contradiction between the above two. Therefore, in the voice wake-up technology, how to ensure good wake-up performance and meet real-time performance on the premise of not increasing resource occupation significantly is a main problem at present.
Disclosure of Invention
An object of the present disclosure is to provide a voice wake-up detection scheme capable of ensuring good wake-up performance without significantly increasing resource occupation.
According to a first aspect of the present disclosure, there is provided a voice wake-up detection method, including: inputting audio data frames in a preset range near a target frame in multi-frame audio data and the target frame into an acoustic model component, wherein the acoustic model component is a feedforward sequence memory neural network model component, and the output of the acoustic model component is a state identification result of at least one frame of audio data in the target frame and the audio data frames in the preset range; taking a single frame of audio data which is positioned behind the target frame and is not processed in the multi-frame of audio data as a next target frame, and iteratively processing a plurality of following target frames by using an acoustic model component; and comparing the state recognition results of the audio data of a plurality of frames in the multi-frame audio data with preset awakening words to recognize whether the multi-frame audio data is an awakening instruction or not.
Optionally, the audio data frames within the predetermined range include: audio data frames which are positioned in a first preset range before a target frame in the multi-frame audio data; and/or audio data frames in the plurality of frames of audio data which are positioned in a second preset range after the target frame.
Optionally, the voice wake-up detection method further includes: detecting voice input of a user in real time; and performing framing processing on the detected voice input to obtain the multi-frame audio data.
Optionally, the step of comparing the state identification result of the audio data of the plurality of frames in the multi-frame audio data with a preset wake-up word includes: and searching a path model matched with the analysis result from a plurality of preset path models to identify whether the multiframe audio data arouses the instruction or not, wherein different path models correspond to different identification results.
Optionally, the path model comprises: a wake-up command model; a white filling model; and a silence model.
Optionally, the acoustic model comprises: an input layer; a hidden layer structure; and a plurality of output layers for predicting analysis results of audio data of different frames in the input, respectively.
Optionally, the hidden layer structure includes a plurality of hidden layers, wherein a memory module is disposed between at least two adjacent hidden layers, and the memory module is configured to store history information and future information useful for determining the current target frame.
Optionally, the output of the memory module is used as the input of the next hidden layer, and the output of the memory module includes the output of the current hidden layer, the output of the hidden layer with the predetermined look-back order, and the output of the hidden layer with the predetermined look-ahead order.
Alternatively,
wherein,input representing the (l + 1) th hidden layer, obtained by nonlinear transformation of an activation function f, UlThe weight is represented by a weight that is,the output of the memory module is represented by,the offset is represented by the number of bits in the bit,
the output of the l-th hidden layer is represented,representing the input of the l hidden layer, WlRepresents a weight, blRepresenting an offset, t representing the current time, s1And s2Coding step factor, N, representing historical and future times, respectively1And N2Respectively representing a review order and a look-ahead order,andare the coding coefficients of the memory module.The output of the hidden layer, which may be considered to be a predetermined look-back order, represents s before the current time t1The result obtained after the output of the hidden layer of the coding stride factor under different playback orders is multiplied by the corresponding coding coefficient bit by bit,the output of the hidden layer, which may be considered to be of a predetermined look-ahead order, represents s after the current time t2The coding stride factor is the result obtained after the output of the hidden layer under different look-ahead orders is multiplied by the corresponding coding coefficient bit by bit.
According to a second aspect of the present disclosure, there is also provided a voice wake-up detection apparatus, including: the state recognition module is used for inputting audio data frames in a preset range near a target frame in multi-frame audio data and the target frame into an acoustic model component, the acoustic model component is a feedforward sequence memory neural network model component, the output of the acoustic model component is a state recognition result of at least one frame of audio data in the target frame and the audio data frames in the preset range, wherein the state recognition module takes a single frame of audio data which is positioned behind the target frame and is not predicted in the multi-frame audio data as a next target frame to be analyzed, and the acoustic model component is used for processing a plurality of following target frames in an iterative manner; and the awakening identification module is used for comparing the state identification results of the audio data of a plurality of frames in the multi-frame audio data with preset awakening words so as to identify whether the multi-frame audio data is an awakening instruction.
Optionally, the audio data frames within the predetermined range include: audio data frames which are positioned in a first preset range before a target frame in the multi-frame audio data; and/or audio data frames in the plurality of frames of audio data which are positioned in a second preset range after the target frame.
Optionally, the voice wake-up detection apparatus further includes: the detection module is used for detecting the voice input of a user in real time; and the framing module is used for framing the detected voice input to obtain multi-frame audio data.
Optionally, the wake-up identification module searches a path model matched with the state identification result of the audio data of the plurality of frames from a plurality of preset path models to identify whether the audio data of the plurality of frames is a wake-up instruction, wherein different path models correspond to different identification results.
Optionally, the path model comprises: a wake-up command model; a white filling model; and a silence model.
Optionally, the acoustic model comprises: an input layer; a hidden layer structure; and a plurality of output layers for predicting analysis results of audio data of different frames in the input, respectively.
Optionally, the hidden layer structure includes a plurality of hidden layers, wherein a memory module is disposed between at least two adjacent hidden layers, and the memory module is configured to store history information and future information useful for determining the current target frame.
Optionally, the output of the memory module is used as the input of the next hidden layer, and the output of the memory module includes the output of the current hidden layer, the output of the hidden layer with the predetermined look-back order, and the output of the hidden layer with the predetermined look-ahead order.
Alternatively,
wherein,input representing the (l + 1) th hidden layer, obtained by nonlinear transformation of an activation function f, UlThe weight is represented by a weight that is,the output of the memory module is represented by,the offset is represented by the number of bits in the bit,
the output of the l-th hidden layer is represented,representing the input of the l hidden layer, WlRepresents a weight, blRepresenting an offset, t representing the current time, s1And s2Coding step factor, N, representing historical and future times, respectively1And N2Respectively represent the playback orders andthe front view order.Andare the coding coefficients of the memory module.The output of the hidden layer, which may be considered to be a predetermined look-back order, represents s before the current time t1The result obtained after the output of the hidden layer of the coding stride factor under different playback orders is multiplied by the corresponding coding coefficient bit by bit,the output of the hidden layer, which may be considered to be of a predetermined look-ahead order, represents s after the current time t2The coding stride factor is the result obtained after the output of the hidden layer under different look-ahead orders is multiplied by the corresponding coding coefficient bit by bit.
According to a third aspect of the present disclosure, there is also provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform a method as set forth in the first aspect of the disclosure.
According to a fourth aspect of the present disclosure, there is also provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method as set forth in the first aspect of the present disclosure.
This is disclosed and is awaken the detection through combining the mode of many frame predictions and FSMN together for the frame number that needs the calculation can reduce by multiples, thereby can greatly reduced equipment end resource occupy, and when less resource occupies, can also guarantee good awakening performance, satisfy the requirement of awakening required real-time.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.
Fig. 1 is an exemplary diagram showing an analysis manner for multi-frame audio data.
Fig. 2 is a schematic flow chart diagram illustrating a voice wake-up detection method according to an embodiment of the present disclosure.
Fig. 3A and 3B are exemplary diagrams illustrating an analysis manner for multi-frame audio data according to an embodiment of the present disclosure.
Fig. 4 is a schematic diagram illustrating a structure of an acoustic model according to an embodiment of the present disclosure.
Fig. 5 is a diagram showing the structure of the introduced FSMN.
Fig. 6 is a schematic diagram illustrating a structure of an acoustic model according to an embodiment of the present disclosure.
Fig. 7 is a structural framework diagram illustrating a voice wake-up system according to an embodiment of the present disclosure.
Fig. 8 is a schematic block diagram illustrating the structure of a voice wake-up detecting apparatus according to an embodiment of the present disclosure.
FIG. 9 shows a schematic structural diagram of a computing device according to an embodiment of the invention.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
[ scheme overview ]
When the acoustic model is used for voice wake-up detection, a current frame in multi-frame audio data is generally used as an input of the acoustic model to obtain an output of the current frame. In order to improve the accuracy of the output result, for the input frame currently to be processed, audio data of a certain length before and after the input frame may be spliced as input of the acoustic model, so that associated information containing the context of the input frame is input. Thus, when a current frame is processed (i.e., predicted) using an acoustic model, audio data in a certain range before and after the current frame including the current frame is input, and only a prediction result for the current frame is output.
When the voice awakening detection is carried out by adopting the mode of multi-frame input and single-frame output, repeated audio with a certain length can exist in the two adjacent inputs, namely, the characteristics of the two adjacent inputs can have certain overlap, namely, the characteristics of the two adjacent inputs have certain similarity. Since the acoustic model predicts the current frame, the feature overlapping is a waste of resources in the prediction process, and the more the overlapped features are, the more the waste of resources is obvious.
As shown in fig. 1, scales 0 to 9 represent consecutive multi-frame audio data after slicing. In the present disclosure, the section of audio data from scale 0 to scale 1 may be regarded as the 1 st frame of audio data, the section of audio data from scale 1 to scale 2 may be regarded as the 2 nd frame of audio data, and so on. It is assumed that for an input frame to be currently predicted, audio data of 3 frame lengths after splicing the input frame is input as an acoustic model. When the 1 st frame audio data is predicted, the 1 st to 4 th frames audio data can be used as input; when the 2 nd frame audio data is predicted, the 2 nd to 5 th frame audio data can be used as input; in predicting the 3 rd frame audio data, the audio data of the 3 rd to 6 th frames may be used as input.
It can be seen that there is duplicate audio data in the 1 st and 2 nd inputs (2 nd frame-4 th frame), there is duplicate audio data in the 2 nd and 3 rd inputs (3 rd frame-5 th frame), and there is also duplicate audio data in the 1 st and 3 rd inputs (3 rd frame, 4 th frame).
After the acoustic model processes the 1 st input to obtain the prediction result of the 1 st frame of audio data, the 2 nd input is processed to predict the 2 nd frame of audio data, and the 2 nd to 4 th frames of audio data in the current input are processed data when the 1 st input is processed. When the acoustic model continues to process the 3 rd input to predict the 3 rd frame of audio data, the 3 rd to 5 th frames in the current input are data processed by the model when the 2 nd input is processed, and the 3 rd and 4 th frames in the current input are data processed by the model when the 1 st input is processed. It can be seen that such repeated features (or similar features) among adjacent inputs are somewhat wasteful of computing resources.
In view of this, the present disclosure proposes that the output of the acoustic model may be modified by using a Multi-Frame Prediction (MFP) method, so as to change the "one-to-one Prediction mode" into the "one-to-many Prediction mode". In particular, for an input frame currently to be predicted, the acoustic model may be adapted to predict the input frame and other frame or frames contained in the input, since the input is associated information comprising the input frame and its context. Therefore, the number of frames required to be calculated can be reduced by times, and the occupation of equipment side resources can be greatly reduced.
Further, as described in the background section, the performance of wake-up is also important on the premise of ensuring the occupation of smaller resources. Since the wake word has little contextual information, the decision whether to wake up is entirely dependent on the acoustic model. In order to pursue better performance, namely higher recall rate and lower false wake-up rate, a model structure with a larger scale and stronger data expression capability is often adopted in acoustic modeling; meanwhile, the awakening technology has high requirements on real-time rate and time delay, so that the feedback speed of the product after the user sends an awakening word instruction is determined, and the two indexes are directly influenced by the calculated amount and the structure of the acoustic model. It can be seen that there is some contradiction between the above two. Therefore, in the voice wake-up technology, how to ensure good wake-up performance and meet real-time performance on the premise of not increasing resource occupation significantly is a main problem at present.
In order to obtain better analysis performance, Deep Neural Network (DNN) is mostly adopted in the acoustic modeling part at present, and DNN has obvious advantages in terms of calculation amount compared with other Neural Network structures, and has the defect that long-term information cannot be utilized, so that the improvement of performance is limited.
In order to make up for the disadvantage, a Long Short-Term Memory recovery Neural Network (LSTM-RNN) based on a Long-Term Memory unit may be adopted, and the model performance may be improved by using the cyclic link of the Recurrent Network and the storage capability of the LSTM unit to the history information. However, the structure of the LSTM unit and the cyclic mechanism require a large amount of computing resources, which is disadvantageous for resource-constrained device end products (e.g., mobile end products).
The inventor of the present disclosure has noticed that a feed-forward Sequential Memory neural network (FSMN) introduces a Memory module based on DNN, and increases a small amount of computation, thereby obtaining a great performance improvement. Taking a model with four hidden layers and 512 nodes in each layer as an example, under the condition that the input and output numbers are the same, the calculation amount of the FSMN is increased by 1% compared with DNN for each frame of data, and the calculation amount of the LSTM is 5 times that of the FSMN; when the FSMN model and the LSTM model with the same calculated amount are selected, the performance of the FSMN model is far superior to that of the LSTM model with the same calculated amount. Thus, in the present disclosure, the acoustic model may employ an FSMN model. Therefore, the awakening performance can be improved while the resource occupation is reduced.
The following further describes aspects of the present disclosure.
[ multiframe prediction ]
The mechanism for implementing the voice wake-up detection method of the present disclosure is described below with reference to fig. 2. Fig. 2 is a schematic flow chart diagram illustrating a voice wake-up detection method according to an embodiment of the present disclosure.
Referring to fig. 2, in step S210, audio data frames within a predetermined range near a target frame in a plurality of frames of audio data are input to an acoustic model component together with the target frame.
The target frame may be regarded as a frame to be processed currently in the multi-frame audio data, and the audio data frames within a predetermined range near the target frame may be audio data frames within a certain time length range before and/or after the target frame. For example, the audio data frames in the multi-frame audio data that are located in a first predetermined range before the target frame may be used, or the audio data frames in the multi-frame audio data that are located in a second predetermined range after the target frame may be used. Preferably, the audio data frames within the first predetermined range and the audio data frames within the second predetermined range may be included at the same time, so that the input may contain the association information of the target frame context at the same time.
Generally, the first predetermined range and the second predetermined range are set to be too small, so that the context information of the target frame contained in the input is limited, and the accuracy of the state recognition result obtained by processing the target frame by the acoustic model component is reduced; and if the first predetermined range and the second predetermined range are set to be too large, the waste of computing resources is caused. Thus, the specific values of the first predetermined range and the second predetermined range may be determined experimentally. In the present disclosure, the first predetermined range and the second predetermined range may comprise at least a single frame duration, and preferably may be an integer multiple of the frame length. In other words, the audio data frames within the predetermined range around the target frame may or may not be integer frames, and the disclosure is not limited thereto. As a preferred embodiment, the audio data frame may comprise one or several frames of audio data preceding and/or following the target frame.
Since the input is the audio data frame including the target frame to be analyzed currently and the audio data frame in the predetermined range nearby, for example, the audio data frame with a certain frame length before and after the target frame may be included. Therefore, the acoustic model may be modified such that the output of the acoustic model is the state recognition result (i.e., the prediction result) of the target frame and at least one frame of audio data in the predetermined range of audio data. In the present disclosure, an acoustic model component may be seen as an aggregation of software and/or hardware resources that are capable of implementing the processing functionality of the acoustic model, and thus, the output of the acoustic model is also the output of the acoustic model component. The structure of the acoustic model component and the state recognition result will be described in detail below, and will not be described here for the moment.
It should be noted that, in order to improve the accuracy of the output result of the acoustic model component, the "at least one frame of audio data" mentioned in the present disclosure may refer to any one or more frames of all the complete frames of audio data included in the audio data within the predetermined range. For example, in the case where the predetermined range is audio data of two frames after the target frame, the input may be regarded as three frames of audio data including the target frame. For the target frame, the last two frames of audio data may be regarded as context information of the target frame, for the intermediate frame of audio data, the target frame and the last frame of audio data may be regarded as context information of the frame, and for the last frame of audio data of the target frame, the target frame and the intermediate frame of audio data may be regarded as context information of the frame. Thus, the acoustic model component may be adapted to predict the target frame, the intermediate frame, and the last frame of audio data, respectively, to obtain the analysis results of the target frame, the intermediate frame, and the last frame, respectively. Of course, the acoustic model component may also be modified to predict the target frame and the intermediate frame respectively to obtain the analysis results of the target frame and the intermediate frame respectively.
In step S220, a single frame of audio data, which is located after the target frame and is not processed, of the multi-frame of audio data is used as a next target frame, and the subsequent multiple target frames are processed iteratively using the acoustic model component.
Therefore, a plurality of frames of audio data need to be input into the acoustic model component frame by frame to obtain a prediction result of each frame. Based on the audio analysis scheme disclosed by the invention, when the state of multi-frame audio data is identified by using the acoustic model component, the audio data can be input at intervals of a preset interval (one frame or several frames), so that the calculated amount can be reduced to 1/N, and the occupation of the calculation resources of equipment-side products can be greatly reduced. N may be an integer greater than or equal to 2, and a specific value of N may be set according to an actual situation, which is not limited in this disclosure.
As shown in fig. 3A and 3B, scale 0 to scale 10 represent consecutive multi-frame audio data. In the present disclosure, the section of audio data from scale 0 to scale 1 may be regarded as the 1 st frame of audio data, the section of audio data from scale 1 to scale 2 may be regarded as the 2 nd frame of audio data, and so on. Assume that for an input frame to be currently predicted, an audio data frame of 3 frame lengths after splicing the input frame is taken as input to the acoustic model component. In predicting the 1 st frame audio data, the 1 st to 4 th frames audio data may be used as input. Unlike FIG. 1, for the 1 st input, the acoustic model component may predict the state of the 1 st frame and one or more frames following it. Since the 1 st input includes audio data of the 1 st to 4 th frames, theoretically, the acoustic model component may be adapted to predict the states of the 1 st, 2 nd, 3 rd and 4 th frames respectively to obtain the state recognition results of the 1 st, 2 nd, 3 rd and 4 th frames respectively. However, considering the accuracy of prediction, the acoustic model component may preferably predict the state of frame data having context in the input, for example, the acoustic model component may predict the state of audio data of frame 1, frame 2, and frame 3, respectively.
As shown in fig. 3A, for the 1 st input, as an example, the acoustic model component may predict the states of the 1 st frame and the frame (i.e., the 2 nd frame) after the 1 st frame to obtain the state recognition results of the 1 st frame and the 2 nd frame, respectively. Therefore, after the 1 st input is processed, the acoustic model component can take the unanalyzed 3 rd frame audio data as the current target frame to be predicted, then splice the audio data with the length of 3 frames after the 3 rd frame as the 2 nd input, and input the audio data into the acoustic model component, and the acoustic model component can predict the 3 rd frame and the next frame (namely, the 4 th frame) after the 3 rd frame so as to respectively obtain the prediction results of the 3 rd frame and the 4 th frame. Thus, inter-frame (one frame apart) prediction can be achieved, reducing the amount of computation to 1/2.
As shown in fig. 3B, for the 1 st input, as an example, the acoustic model component may predict the 1 st frame and the two frames (i.e., the 2 nd frame and the 3 rd frame) after the 1 st frame to obtain the prediction results (i.e., the state recognition results) of the 1 st frame, the 2 nd frame and the 3 rd frame, respectively. Therefore, after the 1 st input is processed by the acoustic model, the unprocessed 4 th frame of audio data can be used as a current target frame to be predicted, then the audio data with the length of 3 frames after the 4 th frame is spliced to be used as the 2 nd input and input into the acoustic model component, and the acoustic model component can predict the states of the 4 th frame and the two frames after the 4 th frame (namely, the 5 th frame and the 6 th frame) to respectively obtain the prediction results of the 4 th frame, the 5 th frame and the 6 th frame. Thus, inter-frame (two frames apart) prediction can be achieved, reducing the amount of computation to 1/3.
In step S230, the state recognition results of the audio data of multiple frames in the multi-frame audio data are compared with the preset wake-up word to recognize whether the multi-frame audio data is a wake-up command.
The multi-frame audio data mentioned in the present disclosure may be obtained by performing a framing process on the detected speech input. For example, a user's voice input may be detected in real time and then the detected voice input may be framed to obtain a plurality of frames of audio data.
For each input, the acoustic model component may be configured to predict a state of at least one frame of audio data in a target frame and an audio number frame within a predetermined range in the input, for example, the acoustic model component may be configured to calculate scores (i.e., probabilities) of the at least one frame of audio data in the target frame and the audio number frame within the predetermined range in each state, and a state with a highest score may be used as a state recognition result of a corresponding frame.
Therefore, the state of each frame of audio data can be determined based on the state recognition result of the frame of audio data, and a phoneme can be recognized according to the states of several consecutive frames of audio data, and a plurality of phonemes can be combined into a word. Therefore, whether the multi-frame audio data contains the wake-up instruction or not can be identified according to the state identification result of the plurality of frames in the multi-frame audio data. For example, the state recognition results of the multiple frames may be compared with a preset wake-up word, and if the state recognition results of the audio data of the multiple frames are consistent with the wake-up word, it may be determined that the audio data of the multiple frames includes a wake-up command. When it is determined that the multi-frame audio data includes the wake-up instruction, the subsequent wake-up operation may be performed, which is not described in detail herein.
As an example, a plurality of path models may be preset, and different path models may correspond to different wake-up recognition results. Based on the state recognition result of the audio data of a plurality of frames in the multi-frame audio data, a path model matched with the state recognition result can be searched from the preset path models to recognize whether the multi-frame audio data arouses the instruction. The path models may include a wake instructions Model (which may also be referred to as a "Keyword Model"), a complementary Model (Filler Model), and a Silence Model (Silence Model). The wake instruction models may be multiple, and different wake instruction models may correspond to different wake instructions (i.e., wake words), for example, the wake instruction models may include multiple wake instructions respectively corresponding to "open", "play", "i want to see", and the like. A padding Model (Filler Model) may be used as a Filler to characterize the audio Model of the non-wake-up instruction component. A Silence Model (Silence Model) may refer to an audio Model with no speech input.
[ ACOUSTIC MODEL ]
In the present disclosure, to improve the analytical performance of the acoustic model component, the acoustic model component may be an FSMN model component. In addition, the output of the acoustic model component is modified, so that the acoustic model component can respectively predict a plurality of frames in the input.
Fig. 4 is a schematic diagram illustrating a network structure of an acoustic model component according to an embodiment of the present disclosure.
As shown in fig. 4, the network structure of the acoustic model component may include an Input Layer, a Hidden Layer structure, and a plurality of Output layers. The output layers are used for respectively predicting the analysis results of the audio data of a plurality of different frames in the input.
The hidden layer structure may include a plurality of hidden layers, and the plurality of output layers may be all connected to the last hidden layer. During training, a target value is originally prepared for each frame, and the acoustic model of the present disclosure needs to provide target values for the current frame and the next N frames. In practical use, each frame input can generate multi-frame output, so that input is only needed at intervals of a plurality of frames, the calculation amount is reduced to one N of the original calculation amount, and the saved calculation resources are invaluable for equipment-side products with resource shortage.
In the disclosure, audio data frames in a predetermined range near a target frame in multi-frame audio data may be spliced with the target frame to be input, directly input to an input layer, perform feature extraction on the input data by the input layer, and then input to a hidden layer structure. In addition, the audio data frames in the predetermined range near the target frame in the multi-frame audio data may be spliced with the target frame, then the characteristics of the spliced audio data are extracted, and then the extracted characteristics are input into the input layer, and the input layer inputs the extracted characteristics into the hidden layer structure.
The hidden layer structure may employ an FSMN structure. The core difference of FSMN compared with the ordinary DNN layer is that a memory module is added between the adjacent hidden layers, and the memory module is used for storing the history information and the future information which are useful for judging the current target frame. The output of the memory module is used as the input of the next hidden layer, and the output of the memory module may include the output of the current hidden layer, the output of the hidden layer with the predetermined look-back order, and the output of the hidden layer with the predetermined look-ahead order.
Fig. 5 is a diagram showing the structure of the introduced FSMN.
As shown in fig. 5, the core difference of FSMN compared with the conventional DNN layer is that a memory module B is added, in which part of the past and future information is stored, and B performs information processing and then transmits to the next hidden layer, which makes the network have the capability of processing long-term information. In order to reduce the calculation amount, the previous hidden layer can be output to the module A first, the dimensionality of the A is smaller than that of the previous hidden layer, which is equivalent to splitting a parameter matrix from the previous hidden layer to the B into two parts, and the reasonable arrangement of the A can reduce the calculation amount without losing the performance. The calculation of the FSMN layer is expressed as follows.
Wherein,input representing the (l + 1) th hidden layer, obtained by nonlinear transformation of an activation function f, UlThe weight is represented by a weight that is,the output of the memory module is represented by,the offset is represented by the number of bits in the bit,
the output of the l-th hidden layer is represented,representing the input of the l hidden layer, WlRepresents a weight, blRepresenting an offset, t representing the current time, s1And s2Coding step factor, N, representing historical and future times, respectively1And N2Respectively representing a review order and a look-ahead order,andare the coding coefficients of the memory module.
According to the formula, the output of the memory module is the sum of the output of the current hidden layer, the output of the hidden layer with the preset look-back order and the output of the hidden layer with the preset look-ahead order. Wherein,the output of the hidden layer, which may be considered to be a predetermined look-back order, represents s before the current time t1The coding stride factor is the result obtained after the bitwise multiplication of the output of the hidden layer under different playback orders and the corresponding coding coefficient.The output of the hidden layer, which may be considered to be of a predetermined look-ahead order, represents s after the current time t2The coding stride factor is the result obtained after the output of the hidden layer under different look-ahead orders is multiplied by the corresponding coding coefficient bit by bit.
The difference in the amount of calculation of FSMN compared to DNN is from equation (2). It is shown by specific calculations that FSMN is similar to DNN for the number of floating point operations per second in a similar network architecture (the number of layers is the same or similar to the nodes per layer), whereas LSTM is more than twice as computationally intensive as DNN in the case of a similar network architecture. Therefore, the calculation amount introduced by the FSMN is far less than that introduced by the LSTM with the equal structure, so that the model can effectively control the real-time rate, and meanwhile, the model has long-time information modeling capability which is not possessed by DNN and the performance of the model is superior to that of the LSTM.
Fig. 6 is a network architecture diagram illustrating acoustic model components according to an embodiment of the present disclosure.
As shown in fig. 6, the network structure of the acoustic model component may include an input layer, a hidden layer structure composed of a DNN layer and an FSMN layer, and a plurality of output layers. The DNN layer structure is well known to those skilled in the art and will not be described herein. For the descriptions of the input layer, the FSMN layer, and the output layers, see the above description, and are not repeated here.
Fig. 7 is a structural framework diagram illustrating a voice wake-up system according to an embodiment of the present disclosure.
As shown in fig. 7, the voice wake-up system of the present disclosure mainly includes a detection module 710, an acoustic prediction module 720 and a keyword detection module 730.
The detection module 710 may detect a voice input of a user in real time, and may perform framing processing on the detected voice input to obtain multiple frames of audio data.
The acoustic prediction module 720 may predict the state recognition result of each frame of audio data in the plurality of frames of audio data. In the prediction process, the acoustic prediction module 720 may splice audio data in a predetermined range near a target frame to be analyzed currently in the multi-frame audio data with the target frame as an input, and input a pre-trained acoustic model component, where the acoustic model component may respectively predict a state recognition result of at least one frame of audio data in the target frame and the audio data in the predetermined range. A single frame of audio data, which is located after the target frame and is not predicted, of the multi-frame audio data may then be used as a next target frame to be analyzed, and thus, the acoustic prediction module 720 may iteratively process a plurality of following target frames using the acoustic model component. For the network structure of the acoustic model components, see the above description, and are not described here.
According to the state recognition result of the audio data of a plurality of frames in the multi-frame audio data, the keyword detection module 730 may search a path model matching the state recognition result from the plurality of path models. The path models can be classified into a keyword model, a complementary white model and a silence model. When the state recognition result is matched with the keyword model, the user can be determined to send a wake-up instruction, and then the device can be controlled to be started, so that voice wake-up of the device is realized.
[ VOICE WAKE-UP DETECTING DEVICE ]
The voice wake-up detection method of the present disclosure may also be implemented as a voice wake-up detection apparatus.
Fig. 8 is a schematic block diagram illustrating the structure of a voice wake-up detecting apparatus according to an embodiment of the present disclosure. The functional modules of the voice wake-up detection apparatus can be implemented by hardware, software or a combination of hardware and software for implementing the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 8 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.
In the following, functional modules that the voice wake-up detection apparatus may have and operations that each functional module may perform are briefly described, and for details related thereto, reference may be made to the description above in conjunction with fig. 2 to fig. 6, and details are not repeated here.
Referring to fig. 8, the voice wake-up detecting apparatus 800 includes a state recognition module 810 and a wake-up recognition module 820. The state identification module 810 is configured to splice an audio data frame in a predetermined range near a target frame in multiple frames of audio data with the target frame as an input, input a pre-trained acoustic model component, where the acoustic model component is a feedforward sequence memory neural network (FSMN) component, and an output of the acoustic model component is a state identification result of at least one frame of audio data in the target frame and the audio data frame in the predetermined range. The state identifying module 810 may use a single frame of audio data, which is located after the target frame and is not predicted, of the multi-frame audio data as a next target frame, and iteratively process a plurality of subsequent target frames using the acoustic model component.
The wake-up identifying module 820 may identify whether the multi-frame audio data wakes up the instruction based on the state identification result of the audio data of a plurality of frames in the multi-frame audio data. For example, the wake-up recognition module 820 may compare the state recognition results of the audio data of multiple frames in the multi-frame audio data with a preset wake-up word to recognize whether the multi-frame audio data is a wake-up command. As an example, the wake-up identifying module 820 may search a path model matching the state identification result of the audio data of the plurality of frames from the plurality of path models to identify whether the audio data of the plurality of frames wake up the instruction, where different path models correspond to different identification results. The path model may include a wake command model, a padding model, and a muting model.
In the present disclosure, the audio data frames within the predetermined range may include: audio data frames which are positioned in a first preset range before the target frame in the multi-frame audio data; and/or audio data frames in the plurality of frames of audio data which are positioned in a second preset range after the target frame.
As shown in fig. 8, the apparatus 800 for detecting wake-on-speech may further optionally include a detection module 830 and a framing module 840, which are shown by dashed boxes in the figure. The detecting module 830 is configured to detect a voice input of a user in real time, and the framing module 840 is configured to perform framing processing on the detected voice input to obtain multi-frame audio data.
As shown in fig. 4, in the present embodiment, the network structure of the acoustic model component may include: an input layer; a hidden layer structure; and a plurality of output layers for predicting analysis results of audio data of different frames in the input, respectively.
The hidden layer structure may include a plurality of hidden layers, wherein a memory module is disposed between at least two adjacent hidden layers, and the memory module is configured to store history information and future information useful for determining the current target frame. And the output of the memory module is used as the input of the next hidden layer, and the output of the memory module comprises the output of the current hidden layer, the output of the hidden layer with the preset look-back order and the output of the hidden layer with the preset look-ahead order.
The calculation of the hidden layer is expressed as follows.
Wherein,input representing the (l + 1) th hidden layer, obtained by nonlinear transformation of an activation function f, UlThe weight is represented by a weight that is,the output of the memory module is represented by,the offset is represented by the number of bits in the bit,
the output of the l-th hidden layer is represented,representing the input of the l hidden layer, WlRepresents a weight, blRepresenting an offset, t representing the current time, s1And s2Coding step factor, N, representing historical and future times, respectively1And N2And respectively representing a review order and a look-ahead order.Andare the coding coefficients of the memory module.The output of the hidden layer, which may be considered to be a predetermined look-back order, represents s before the current time t1The result obtained after the output of the hidden layer of the coding stride factor under different playback orders is multiplied by the corresponding coding coefficient bit by bit,the output of the hidden layer, which may be considered to be of a predetermined look-ahead order, represents s after the current time t2The coding stride factor is the result obtained after the output of the hidden layer under different look-ahead orders is multiplied by the corresponding coding coefficient bit by bit.
[ calculating device ]
Fig. 9 is a schematic structural diagram of a computing device for data processing, which can be used to implement the audio analysis and voice wake detection method according to an embodiment of the present invention.
Referring to fig. 9, computing device 900 includes memory 910 and processor 920.
The processor 920 may be a multi-core processor or may include multiple processors. In some embodiments, processor 920 may include a general-purpose main processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. In some embodiments, processor 920 may be implemented using custom circuits, such as Application Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs).
The memory 910 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 920 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 910 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 910 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory 910 has stored thereon processable code, which, when processed by the processor 920, causes the processor 920 to perform the audio analysis and wake-on-speech detection methods described above.
The audio analysis and voice wake detection methods, apparatuses, and computing devices according to the present invention have been described in detail above with reference to the accompanying drawings.
Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.
Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (12)

1.一种语音唤醒检测方法,其特征在于,包括:1. A voice wake-up detection method, characterized in that, comprising: 将多帧音频数据中目标帧附近预定范围内的音频数据帧,与所述目标帧一并输入至声学模型组件,所述声学模型组件为前馈序列记忆神经网络模型组件,所述声学模型组件的输出为所述目标帧以及所述预定范围内的音频数帧中的至少一帧音频数据的状态识别结果;Input the audio data frames within the predetermined range near the target frame in the multi-frame audio data together with the target frame to the acoustic model component, the acoustic model component is a feedforward sequence memory neural network model component, and the acoustic model component The output is the state identification result of the target frame and at least one frame of audio data in the audio frames within the predetermined range; 将所述多帧音频数据中位于所述目标帧之后且未对其进行处理的单帧音频数据,作为下一个目标帧,并迭代地使用所述声学模型组件对之后多个目标帧进行处理;以及Taking the single frame of audio data that is located after the target frame and not processed in the multi-frame audio data as the next target frame, and iteratively using the acoustic model component to process the subsequent multiple target frames; as well as 将所述多帧音频数据中多个帧的音频数据的状态识别结果与预设唤醒词进行比对,以识别所述多帧音频数据是否为唤醒指令。Comparing the state identification results of the audio data of multiple frames in the multi-frame audio data with the preset wake-up words to identify whether the multi-frame audio data is a wake-up instruction. 2.根据权利要求1所述的语音唤醒检测方法,其特征在于,所述预定范围内的音频数据帧包括:2. voice wake-up detection method according to claim 1, is characterized in that, the audio data frame in described predetermined scope comprises: 所述多帧音频数据中位于所述目标帧之前第一预定范围内的音频数据帧;和/或A frame of audio data within a first predetermined range before the target frame in the multi-frame audio data; and/or 所述多帧音频数据中位于所述目标帧之后第二预定范围内的音频数据帧。A frame of audio data within a second predetermined range behind the target frame among the multiple frames of audio data. 3.根据权利要求1所述的语音唤醒检测方法,其特征在于,还包括:3. voice wake-up detection method according to claim 1, is characterized in that, also comprises: 实时检测用户的语音输入;以及Real-time detection of user voice input; and 对检测到的语音输入进行分帧处理,以得到所述多帧音频数据。and performing frame division processing on the detected voice input to obtain the multi-frame audio data. 4.根据权利要求1所述的语音唤醒检测方法,其特征在于,所述将所述多帧音频数据中多个帧的音频数据的状态识别结果与预设唤醒词进行比对的步骤包括:4. The voice wake-up detection method according to claim 1, wherein the step of comparing the state recognition result of the audio data of a plurality of frames in the multi-frame audio data with a preset wake-up word comprises: 从预先设定的多个路径模型中查找与所述多个帧的音频数据的状态识别结果相匹配的路径模型,以识别所述多帧音频数据是否为唤醒指令,其中,不同的路径模型对应不同的识别结果。Search for a path model that matches the state recognition results of the multiple frames of audio data from multiple preset path models to identify whether the multiple frames of audio data are wake-up instructions, wherein different path models correspond to Different recognition results. 5.根据权利要求4所述的语音唤醒检测方法,其特征在于,所述路径模型包括:5. voice wake-up detection method according to claim 4, is characterized in that, described path model comprises: 唤醒指令模型;Wake-up instruction model; 补白模型;以及filler model; and 静音模型。Silent model. 6.根据权利要求1所述的语音唤醒检测方法,其特征在于,所述声学模型组件包括:6. The voice wake-up detection method according to claim 1, wherein the acoustic model component comprises: 输入层;input layer; 隐藏层结构;以及hidden layer structure; and 多个输出层,所述多个输出层用于分别对输入中不同帧的音频数据的分析结果进行预测。A plurality of output layers, the plurality of output layers are used to respectively predict the analysis results of the audio data of different frames in the input. 7.根据权利要求6所述的语音唤醒检测方法,其特征在于,7. voice wake-up detection method according to claim 6, is characterized in that, 所述隐藏层结构包括多个隐藏层,其中,至少两个相邻隐藏层之间设有记忆模块,所述记忆模块用于存储对判断当前的目标帧有用的历史信息和未来信息。The hidden layer structure includes multiple hidden layers, wherein a memory module is arranged between at least two adjacent hidden layers, and the memory module is used to store historical information and future information useful for judging the current target frame. 8.根据权利要求7所述的语音唤醒检测方法,其特征在于,8. voice wake-up detection method according to claim 7, is characterized in that, 所述记忆模块的输出用于作为下一个隐藏层的输入,The output of the memory module is used as the input of the next hidden layer, 所述记忆模块的输出包括当前隐藏层的输出、预定回看阶数的隐藏层的输出以及预定前看阶数的隐藏层的输出。The output of the memory module includes the output of the current hidden layer, the output of the hidden layer of the predetermined look-back order, and the output of the hidden layer of the predetermined look-ahead order. 9.根据权利要求8所述的语音唤醒检测方法,其特征在于,9. voice wake-up detection method according to claim 8, is characterized in that, 其中,表示第l+1个隐层的输入,其通过激活函数f的非线性变换得到,Ul表示权重,表示记忆模块的输出,表示偏置,in, Represents the input of the l+1th hidden layer, which is obtained through the nonlinear transformation of the activation function f, U l represents the weight, represents the output of the memory module, Indicates the bias, 表示第l个隐层的输出,表示第l个隐层的输入,Wl表示权重,bl表示偏置,t表示当前时刻,s1和s2分别表示历史时刻和未来时刻的编码步幅因子,N1和N2分别表示回看阶数和前看阶数,是记忆模块的编码系数。 Indicates the output of the lth hidden layer, Represents the input of the lth hidden layer, W l represents the weight, b l represents the bias, t represents the current moment, s 1 and s 2 represent the coding step factors of the historical moment and the future moment respectively, N 1 and N 2 represent look back order and look forward order, and is the encoding coefficient of the memory module. 10.一种语音唤醒检测装置,其特征在于,包括:10. A voice wake-up detection device, characterized in that, comprising: 状态识别模块,用于将多帧音频数据中目标帧附近预定范围内的音频数据帧,与所述目标帧一并输入至声学模型组件,所述声学模型组件为前馈序列记忆神经网络模型组件,所述声学模型组件的输出为所述目标帧以及所述预定范围内的音频数据帧中的至少一帧音频数据的状态识别结果,其中,所述状态识别模块将所述多帧音频数据中位于所述目标帧之后且未对其进行预测的单帧音频数据,作为下一个待分析的目标帧,并迭代地使用所述声学模型组件对之后多个目标帧进行处理;和The state identification module is used to input the audio data frames within the predetermined range near the target frame in the multi-frame audio data together with the target frame to the acoustic model component, and the acoustic model component is a feedforward sequence memory neural network model component , the output of the acoustic model component is the state recognition result of at least one frame of audio data in the target frame and the audio data frames within the predetermined range, wherein the state recognition module uses the multiple frames of audio data A single frame of audio data that is located after the target frame and is not predicted is used as the next target frame to be analyzed, and iteratively uses the acoustic model component to process the subsequent multiple target frames; and 唤醒识别模块,用于将所述多帧音频数据中多个帧的音频数据的状态识别结果与预设唤醒词进行比对,以识别所述多帧音频数据是否为唤醒指令。A wake-up identification module, configured to compare the state identification results of multiple frames of audio data in the multi-frame audio data with a preset wake-up word, so as to identify whether the multi-frame audio data is a wake-up instruction. 11.一种计算设备,包括:11. A computing device comprising: 处理器;以及processor; and 存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如权利要求1-9中任何一项所述的方法。A memory on which executable code is stored, and when the executable code is executed by the processor, causes the processor to execute the method according to any one of claims 1-9. 12.一种非暂时性机器可读存储介质,其上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器执行如权利要求1至9中任一项所述的方法。12. A non-transitory machine-readable storage medium, on which executable code is stored, and when the executable code is executed by a processor of an electronic device, the processor is made to perform any one of the methods described.
CN201810637168.1A 2018-06-20 2018-06-20 Voice wakeup detection method, device, equipment and storage medium Active CN110619871B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810637168.1A CN110619871B (en) 2018-06-20 2018-06-20 Voice wakeup detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810637168.1A CN110619871B (en) 2018-06-20 2018-06-20 Voice wakeup detection method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110619871A true CN110619871A (en) 2019-12-27
CN110619871B CN110619871B (en) 2023-06-30

Family

ID=68920779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810637168.1A Active CN110619871B (en) 2018-06-20 2018-06-20 Voice wakeup detection method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110619871B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111640440A (en) * 2020-04-30 2020-09-08 华为技术有限公司 An audio stream decoding method, device, storage medium and device
CN111653270A (en) * 2020-08-05 2020-09-11 腾讯科技(深圳)有限公司 Voice processing method and device, computer readable storage medium and electronic equipment
CN111755029A (en) * 2020-05-27 2020-10-09 北京大米科技有限公司 Voice processing method, device, storage medium and electronic device
CN111798859A (en) * 2020-08-27 2020-10-20 北京世纪好未来教育科技有限公司 Data processing method, device, computer equipment and storage medium
CN112882394A (en) * 2021-01-12 2021-06-01 北京小米松果电子有限公司 Device control method, control apparatus, and readable storage medium
CN113643693A (en) * 2020-04-27 2021-11-12 声音猎手公司 Acoustic model conditioned on sound features
CN114333794A (en) * 2021-12-21 2022-04-12 科大讯飞股份有限公司 Voice wake-up method and device, electronic equipment and storage medium
CN114596841A (en) * 2022-03-15 2022-06-07 腾讯科技(深圳)有限公司 Real-time voice recognition method, model training method, device and equipment
CN114945980A (en) * 2020-01-15 2022-08-26 谷歌有限责任公司 Small size multichannel keyword location
CN115101063A (en) * 2022-08-23 2022-09-23 深圳市友杰智新科技有限公司 Low-computation-power voice recognition method, device, equipment and medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105845128A (en) * 2016-04-06 2016-08-10 中国科学技术大学 Voice identification efficiency optimization method based on dynamic pruning beam prediction
US20160267924A1 (en) * 2013-10-22 2016-09-15 Nec Corporation Speech detection device, speech detection method, and medium
US20160372119A1 (en) * 2015-06-19 2016-12-22 Google Inc. Speech recognition with acoustic models
CN106782536A (en) * 2016-12-26 2017-05-31 北京云知声信息技术有限公司 A kind of voice awakening method and device
WO2017114201A1 (en) * 2015-12-31 2017-07-06 阿里巴巴集团控股有限公司 Method and device for executing setting operation
CN107358951A (en) * 2017-06-29 2017-11-17 阿里巴巴集团控股有限公司 A kind of voice awakening method, device and electronic equipment
CN107680597A (en) * 2017-10-23 2018-02-09 平安科技(深圳)有限公司 Audio recognition method, device, equipment and computer-readable recording medium
CN107767861A (en) * 2016-08-22 2018-03-06 科大讯飞股份有限公司 voice awakening method, system and intelligent terminal
CN107871506A (en) * 2017-11-15 2018-04-03 北京云知声信息技术有限公司 The awakening method and device of speech identifying function

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160267924A1 (en) * 2013-10-22 2016-09-15 Nec Corporation Speech detection device, speech detection method, and medium
US20160372119A1 (en) * 2015-06-19 2016-12-22 Google Inc. Speech recognition with acoustic models
WO2017114201A1 (en) * 2015-12-31 2017-07-06 阿里巴巴集团控股有限公司 Method and device for executing setting operation
CN105845128A (en) * 2016-04-06 2016-08-10 中国科学技术大学 Voice identification efficiency optimization method based on dynamic pruning beam prediction
CN107767861A (en) * 2016-08-22 2018-03-06 科大讯飞股份有限公司 voice awakening method, system and intelligent terminal
CN106782536A (en) * 2016-12-26 2017-05-31 北京云知声信息技术有限公司 A kind of voice awakening method and device
CN107358951A (en) * 2017-06-29 2017-11-17 阿里巴巴集团控股有限公司 A kind of voice awakening method, device and electronic equipment
CN107680597A (en) * 2017-10-23 2018-02-09 平安科技(深圳)有限公司 Audio recognition method, device, equipment and computer-readable recording medium
CN107871506A (en) * 2017-11-15 2018-04-03 北京云知声信息技术有限公司 The awakening method and device of speech identifying function

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114945980A (en) * 2020-01-15 2022-08-26 谷歌有限责任公司 Small size multichannel keyword location
CN113643693A (en) * 2020-04-27 2021-11-12 声音猎手公司 Acoustic model conditioned on sound features
CN113643693B (en) * 2020-04-27 2024-02-09 声音猎手公司 Acoustic model conditioned on sound characteristics
US11741943B2 (en) 2020-04-27 2023-08-29 SoundHound, Inc Method and system for acoustic model conditioning on non-phoneme information features
US12154546B2 (en) 2020-04-27 2024-11-26 SoundHound AI IP, LLC. Method and system for acoustic model conditioning on non-phoneme information features
CN111640440A (en) * 2020-04-30 2020-09-08 华为技术有限公司 An audio stream decoding method, device, storage medium and device
WO2021218240A1 (en) * 2020-04-30 2021-11-04 华为技术有限公司 Audio stream decoding method and apparatus, storage medium, and device
CN111640440B (en) * 2020-04-30 2022-12-30 华为技术有限公司 Audio stream decoding method, device, storage medium and equipment
CN111755029A (en) * 2020-05-27 2020-10-09 北京大米科技有限公司 Voice processing method, device, storage medium and electronic device
CN111755029B (en) * 2020-05-27 2023-08-25 北京大米科技有限公司 Voice processing method, device, storage medium and electronic equipment
CN111653270A (en) * 2020-08-05 2020-09-11 腾讯科技(深圳)有限公司 Voice processing method and device, computer readable storage medium and electronic equipment
CN111798859B (en) * 2020-08-27 2024-07-12 北京世纪好未来教育科技有限公司 Data processing method, device, computer equipment and storage medium
CN111798859A (en) * 2020-08-27 2020-10-20 北京世纪好未来教育科技有限公司 Data processing method, device, computer equipment and storage medium
CN112882394A (en) * 2021-01-12 2021-06-01 北京小米松果电子有限公司 Device control method, control apparatus, and readable storage medium
CN114333794A (en) * 2021-12-21 2022-04-12 科大讯飞股份有限公司 Voice wake-up method and device, electronic equipment and storage medium
CN114596841A (en) * 2022-03-15 2022-06-07 腾讯科技(深圳)有限公司 Real-time voice recognition method, model training method, device and equipment
CN115101063B (en) * 2022-08-23 2023-01-06 深圳市友杰智新科技有限公司 Low-computation-power voice recognition method, device, equipment and medium
CN115101063A (en) * 2022-08-23 2022-09-23 深圳市友杰智新科技有限公司 Low-computation-power voice recognition method, device, equipment and medium

Also Published As

Publication number Publication date
CN110619871B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
CN110619871B (en) Voice wakeup detection method, device, equipment and storage medium
US11132992B2 (en) On-device custom wake word detection
US11664020B2 (en) Speech recognition method and apparatus
EP3966813B1 (en) Online verification of custom wake word
US12254036B2 (en) System and method for summarizing a multimedia content item
US8589163B2 (en) Adapting language models with a bit mask for a subset of related words
EP3654328B1 (en) Method and apparatus with speech recognition
KR20210015967A (en) End-to-end streaming keyword detection
US20190057683A1 (en) Encoder-decoder models for sequence to sequence mapping
CN110070859B (en) A voice recognition method and device
CN109543190A (en) A kind of intension recognizing method, device, equipment and storage medium
CN108010515A (en) A kind of speech terminals detection and awakening method and device
CN112825249A (en) Voice processing method and device
CN109741735B (en) A modeling method, an acoustic model acquisition method and device
JP7044856B2 (en) Speech recognition model learning methods and systems with enhanced consistency normalization
US20200090642A1 (en) Method and apparatus with speech recognition
US12499876B2 (en) Multi-device speech processing
EP4465293A1 (en) Automatic speech recognition with combined speech-text-embeddings
CN112652306A (en) Voice wake-up method and device, computer equipment and storage medium
CN106875936A (en) Voice recognition method and device
CN117409818A (en) Voice emotion recognition method and device
JP2009047838A (en) Speech recognition apparatus and method
JP2014098874A (en) Voice recognition apparatus, voice recognition method and program
WO2012076895A1 (en) Pattern recognition
CN113658593B (en) Wake-up realization method and device based on voice recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant