CN111667818B

CN111667818B - Method and device for training wake-up model

Info

Publication number: CN111667818B
Application number: CN202010461982.XA
Authority: CN
Inventors: 靳源; 冯大航; 陈孝良
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2023-10-10
Anticipated expiration: 2040-05-27
Also published as: CN111667818A

Abstract

The invention provides a method and a device for training a wake-up model, wherein the method comprises the following steps: when the model training is triggered, a first training set and a second training set are obtained; respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model; inputting a second training set into the current acoustic model, and determining a second difference parameter by comparing an output result of the current acoustic model with the single-heat codes corresponding to the awakening voice which can be recognized by the current acoustic model; and adjusting model parameters of the current acoustic model according to the first difference parameters and the second difference parameters. The method provided by the invention can enable the current acoustic model to be compatible with the initial voice under the condition of ensuring the adaptation to the current scene, reduce the risk of unstable performance of the wake-up model caused by updating, and enable the trained acoustic model to be well compatible with the initial wake-up scene before being adapted to the more complex scene.

Description

Method and device for training wake-up model

Technical Field

The invention relates to the technical field of voice processing, in particular to a method and a device for training a wake-up model.

Background

Along with the continuous development of technology, intelligent equipment is more and more popular, current intelligent equipment can confirm whether to wake up intelligent equipment through receiving the speech signal of user input, and one application scenario is, intelligent equipment can be through the interior acoustic model of equipment discernment user input speech signal of messenger's intelligent equipment of waiting to wake up the state to wake up, when acoustic model can discern the speech signal of input, wake up intelligent equipment to make the user can realize controlling other intelligent household equipment through the intelligent equipment of being awakened up. The user pre-records wake-up voice to the intelligent device, the intelligent device performs model training according to the recorded wake-up voice to generate an acoustic model, when the intelligent device receives a voice signal input by the user again, if the acoustic model can recognize the input voice signal, the voice signal is determined to be the wake-up voice, the intelligent device can be awakened, and further operation is performed on the intelligent device.

The method comprises the steps that when the method is in actual use, due to the fact that pre-recorded wake-up voice samples in an acoustic model recorded for the first time are too few, the acoustic model is too noisy due to the fact that the environment of sound production of a user or the problem of accent exists due to the fact that a speaker cannot recognize false wake-up or missed wake-up caused by corresponding wake-up voice is caused, existing wake-up devices are required to continuously collect new wake-up voice input for each time, the new wake-up voice is artificially marked or voice noise input by the user is filtered through data cleaning, whether the voice is voice which the intelligent device should wake up or voice which is wake-up by mistake or is leaked, and voice data marked or determined whether the wake-up voice is mixed with the wake-up voice recorded before to be retrained so as to obtain the acoustic model with better effect to replace the model obtained before.

However, when the wake-up device retrains by using the mixed wake-up voice, wake-up word data in the wake-up voice needs to be rearranged for training, and when the wake-up device retrains, a large computing resource of the wake-up device is occupied, the finally calculated acoustic model has long time, and the actual wake-up effect of the retrained wake-up model is unstable due to the fact that the wake-up model needs to be reestablished, and the proportion of training data is poor.

Disclosure of Invention

The invention provides a method and a device for training a wake-up model, which are used for solving the problems that in the prior art, the model training amount is large before a new acoustic model is retrained and a model with better performance is replaced after continuously collecting new wake-up word data and mixing the new wake-up word data with the wake-up word data recorded before, the time for retrained the acoustic model is long, and the actual wake-up effect of the retrained wake-up model is unstable due to the fact that the proportion of training data is not well controlled because the wake-up model needs to be reestablished.

The first aspect of the present invention provides a method of training a wake model, the method comprising:

when the model training is triggered, a first training set and a second training set are obtained, wherein the first training set comprises initial voice characteristic data for training an initial acoustic model, and the second training set comprises new voice characteristic data of corresponding missed wake-up/false wake-up voice of the current acoustic model in the wake-up voice recognition process;

Respectively inputting the first training set into an initial acoustic model and a current acoustic model, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;

inputting the second training set into a current acoustic model, and determining a second difference parameter by comparing an output result of the current acoustic model with a single thermal code corresponding to wake-up voice which can be recognized by the current acoustic model;

and adjusting model parameters of the current acoustic model according to the first difference parameters and the second difference parameters.

Optionally, determining the first variance parameter includes:

acquiring a first probability distribution corresponding to a wake-up voice recognition result output by the initial acoustic model and a second probability distribution corresponding to the wake-up voice recognition result output by the current acoustic model;

and determining the relative entropy according to the difference between the first probability distribution and the second probability distribution.

Optionally, determining the second variance parameter includes:

acquiring a third probability distribution corresponding to a wake-up voice recognition result output by the current acoustic model and a fourth probability distribution corresponding to the single thermal code;

and determining cross entropy according to the difference between the third probability distribution and the fourth probability distribution.

Optionally, after determining the second difference parameter, further comprising:

and when the current acoustic model is determined to be the initial acoustic model, adjusting model parameters of the initial acoustic model according to the second difference parameters.

Optionally, adjusting the model parameters of the initial acoustic model according to the second difference parameters includes:

and determining a loss function for adjusting the initial acoustic model according to the second difference parameters, and adjusting parameters of each network layer in the initial acoustic model by using the loss function to obtain the current acoustic model.

Optionally, adjusting the model parameters of the current acoustic model according to the first difference parameter and the second difference parameter includes:

and determining a loss function for adjusting the current acoustic model according to the first difference parameter and the second difference parameter, and adjusting parameters of each network layer in the current acoustic model by using the loss function.

Optionally, the initial speech feature data or new speech feature data comprises at least one of:

mel-frequency coefficient MFCC characteristic data;

perceptual linear prediction PLP feature data;

the mel scale filters the FBANK characteristic data.

Optionally, obtaining the one-hot code corresponding to the wake-up voice that can be identified by the current acoustic model includes:

Acquiring preset wake-up voice information which can be identified by a current acoustic model;

and inputting the preset awakening voice information into an ASR voice model, and determining the single-hot coding corresponding to the voice unit which can be recognized by the current acoustic model according to the recognition result of the ASR voice model, wherein the voice unit comprises at least one of a phoneme state, a phoneme and a word.

A second aspect of the present invention provides an apparatus for training a wake model, the apparatus comprising a memory for storing instructions;

the processor is used for reading the instructions in the memory, and the implementation method comprises the following steps:

Optionally, the processor is configured to determine a first difference parameter, including:

Optionally, the processor is configured to determine a second difference parameter, including:

Optionally, after the processor is configured to determine the second difference parameter, the method further includes:

Optionally, the processor is configured to adjust model parameters of an initial acoustic model according to the second difference parameter, including:

Optionally, the processor is configured to adjust a model parameter of the current acoustic model according to the first difference parameter and the second difference parameter, including:

Optionally, the processor is configured to use the initial speech feature data or the new speech feature data to include at least one of:

mel-frequency coefficient MFCC characteristic data;

perceptual linear prediction PLP feature data;

the mel scale filters the FBANK characteristic data.

Optionally, the processor is configured to obtain a single thermal code corresponding to wake-up speech that can be identified by the current acoustic model, including:

A third aspect of the present invention provides an apparatus for training a wake model, the apparatus comprising:

The training set acquisition module acquires a first training set and a second training set when the trigger model is trained, wherein the first training set comprises initial voice characteristic data for training an initial acoustic model, and the second training set comprises new voice characteristic data of corresponding wake-up missing/wake-up error voice of the current acoustic model in the wake-up voice recognition process;

the first difference parameter determining module is used for inputting the first training set into an initial acoustic model and a current acoustic model respectively, and determining a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;

the second difference parameter determining module is used for inputting the second training set into the current acoustic model, and determining a second difference parameter by comparing the output result of the current acoustic model with the single-heat code corresponding to the awakening voice which can be recognized by the current acoustic model;

and the model adjustment module is used for adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters.

Optionally, the first difference parameter determining module is configured to determine a first difference parameter, including:

Optionally, the second difference parameter determining module is configured to determine a second difference parameter, including:

Optionally, after the current acoustic model determining module is configured to determine the second difference parameter, the method further includes:

Optionally, the current model determining module is configured to adjust model parameters of an initial acoustic model according to the second difference parameters, including:

Optionally, the model adjustment module is configured to adjust a model parameter of the current acoustic model according to the first difference parameter and the second difference parameter, including:

mel-frequency coefficient MFCC characteristic data;

perceptual linear prediction PLP feature data;

the mel scale filters the FBANK characteristic data.

Optionally, the second difference parameter determining module is configured to obtain a single thermal code corresponding to wake-up speech that can be identified by the current acoustic model, and includes:

A fourth aspect of the invention provides a computer readable storage medium storing computer instructions which, when executed by a processor, implement a method of training a wake model as claimed in any one of the first aspects of the invention.

The method for training the wake model divides the training set of the wake model into two classes, wherein one class is the initial voice characteristic data for training the initial acoustic model, the other class is the initial wake-up/false wake-up voice data for training the current acoustic model, the adaptation to the actual scene is ensured by utilizing the independent thermal coding in the ASR model when the current acoustic model is trained, the difference parameters of the initial voice characteristic data in the initial acoustic model and the current acoustic model are calculated, the current acoustic model can also be compatible with the initial voice data under the condition of ensuring the adaptation to the current scene by utilizing the difference parameters, the risk of unstable wake-up performance caused by updating is reduced, the wake-up success rate of voice under the current scene and the voice under the initial training under the wake-up model is effectively improved, the trained acoustic model is well compatible with the initial wake-up scene before the condition of being adapted to more complex scene, and compared with the mode that the model with better performance is obtained by continuously collecting new wake-up word data and mixing together to retrain the new acoustic model.

Drawings

FIG. 1 is a schematic diagram of a wake-up device system;

FIG. 2 is a flow chart of a method of training a wake model;

FIG. 3 is a complete flow chart of a method of training a wake model;

FIG. 4 is a schematic diagram of a device for training a wake-up model;

FIG. 5 is a block diagram of an apparatus for training a wake model.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

For convenience of understanding, the terms involved in the embodiments of the present invention are explained below:

(1) Cross Entropy (Cross Entropy) is a concept commonly used in deep learning, is generally used for solving the difference between a target and a predicted value, and is an important concept in Shannon information theory, and is mainly used for measuring difference information between two probability distributions. The performance of a language model is usually measured by cross entropy and complexity (Perplexity), the cross entropy can be used as a loss function in a neural network, p represents the distribution of real marks, q is the prediction mark distribution of the model after training, and the cross entropy loss function can measure the similarity of p and q;

(2) The Relative Entropy (Relative Entropy), also known as the Kullback-Leibler divergence (Kullback-Leibler Divergence) or information divergence (Information Divergence), is a measure of asymmetry of the difference between two probability distributions (Probability Distribution). In information theory, the relative Entropy is equivalent to the difference between the information Entropy (Shannon Entropy) of two probability distributions, and is the loss function of some optimization algorithms, such as the maximum Expectation algorithm (Expectation-Maximization Algorithm, EM), which represents the information loss generated when a theoretical distribution is used to fit a real distribution.

(3) Mel cepstrum coefficient (MFCC), which is an abbreviation for Mel frequency cepstrum coefficient, is proposed based on auditory characteristics of human ears, and has a nonlinear correspondence with Hz frequency, and the Mel frequency cepstrum coefficient is a Hz spectrum characteristic calculated by using the relationship between them, and MFCC has been widely used in the field of speech recognition as a recognition characteristic.

(4) Fbank is a dct cepstrum link lacking MFCC feature extraction, other steps are the same as those of MFCC, fbank features are close to response characteristics of human ears, but there is a shortage that features adjacent to Fbank features are highly correlated (adjacent filter banks overlap), so when we model phonemes by HMM, cepstrum conversion is almost always needed first, wherein MFCC is performed on the basis of Fbank, so calculation amount of MFCC is larger, and Fbank feature correlation is higher.

In consideration of the problems of false awakening and missed awakening existing in the existing awakening word recognition technology, an awakening model can be continuously trained to improve accuracy. Therefore, in this embodiment, wake-up voice data is collected locally or at the cloud by controlling the device terminal to be woken up, and the wake-up model is updated and optimized based on the data, so that the probability of false wake-up and missing wake-up of the device terminal is reduced, the probability of false wake-up is reduced, and the accuracy of wake-up word recognition is improved.

Based on this, in this embodiment, a wake-up device system is provided, and the device wake-up system provided in the embodiment of the present invention wakes up a wake-up device by using a wake-up device, where the wake-up device may be any electronic device capable of receiving voice, such as an intelligent sound box, an intelligent mobile phone, and an intelligent home appliance, and is not limited in any way.

As shown in fig. 1, the system includes: the wake-up device 101, the wake-up device 102, which may be the same device or different devices, in an alternative embodiment, the system may further include: and a server in communication with the wake-up device. The awakening device is used for acquiring voice characteristic data, and identifying the voice characteristic data by utilizing the acoustic model, and determining whether to awaken the awakened device according to the probability of identifying the voice characteristic data when the input voice characteristic data can be identified. The acoustic model can be generated by training the wake-up device to collect voice data, or the wake-up device can collect voice data and send the voice data to the server, the server performs training to generate the acoustic model, and the server sends the obtained acoustic model data to the wake-up device after training is completed. Further, the acoustic model may also be generated by training other devices besides servers.

The awakened device 102 may include, but is not limited to, a smart speaker, a smart television, a smart robot, a smart refrigerator, a smart air conditioner, a smart electric cooker, a smart sensor (such as an infrared sensor, a light sensor, a vibration sensor, a sound sensor, etc.), a smart water purifier, etc. that is fixedly installed or is movable in a small range. Alternatively, the awakened device 102 may be a mobile device such as an MP3 player (Moving Picture Experts Group Audio Layer III, mpeg 3), MP4 (Moving Picture Experts Group Audio Layer IV, mpeg 4) player, or smart bluetooth headset.

The various awakened devices 102 may also be connected by a wired or wireless network, optionally using standard communication techniques and/or protocols. The network is typically the internet, but may be any network including, but not limited to, a Local area network (Local AreaNetwork, LAN), metropolitan area network (Metropolitan AreaNetwork, MAN), wide area network (Wide Area Network, MAN), a mobile, wired, or wireless network, a private network, or any combination of virtual private networks. In some embodiments, data exchanged over the network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible Markup Language, XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure socket layer (Secure SocketLayer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (VirtualPrivate Network, VPN), internet protocol security (Internet Protocol Security, IPsec), etc. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

The wake-up device 101 may be connected to the wake-up device 102 through the above-mentioned wired network or wireless network, and the user may cause the corresponding smart home device to perform the corresponding operation through control of the wake-up device 102. Alternatively, the wake-up device 102 may be a smart terminal. Alternatively, the smart terminal may be a smart phone, a tablet computer, an electronic book reader, smart glasses, a smart watch, or the like. For example, a user may control an a device in a smart home device to send data or signals to a B device through a smart phone, or a user may control a temperature of a smart refrigerator in a smart home device through a smart phone, or the like.

When any one of the devices is trained to generate an acoustic model, the network model can be trained through initial voice feature data to obtain the initial acoustic model, then the self-adaptive acoustic model is obtained according to new voice data feature training, wake-up voice can be obtained through a microphone in the wake-up device to obtain voice feature data, and the voice data transmitted to the wake-up device can be processed voice data after data cleaning; generally speaking, there are three methods, namely a box-dividing method, a clustering method and a regression method, for cleaning data, wherein the box-dividing method is a frequently used method, data to be processed are put into boxes according to a certain rule, then data in each box are tested, and the voice data are processed by adopting a method according to the actual situation of each box in the data.

As described above, the specific process of training the acoustic model may be at the server side, or at the wake-up device side, where the wake-up device side or the server side trains the network model to obtain an initial acoustic model through initial voice feature data, and continuously trains and optimizes the current acoustic model through new voice feature data, so as to adapt to the current wake-up environment and further improve the wake-up accuracy.

For example, after the user inputs the wake-up voice to the awakened equipment, the embodiment of the application can perform detection analysis through the acoustic model, obtain posterior probability distribution capable of identifying the input voice feature data according to the analysis, calculate the wake-up confidence coefficient according to the posterior probability distribution by using the decoder, judge whether the wake-up confidence coefficient is greater than or equal to a set threshold value, and if the wake-up confidence coefficient is greater than or equal to the threshold value, consider that the wake-up voice contains the wake-up word text or accords with the current acoustic model, and can be awakened, and if the wake-up confidence coefficient is less than the threshold value, consider that the wake-up voice does not contain the wake-up word text or does not accord with the current acoustic model.

Specifically, the embodiment does not limit the preset wake-up words, such as a small size, siri, etc. The wake-up words comprise preset wake-up words and/or user-defined wake-up words in the server, and more later users can delete or add new wake-up words to the existing wake-up words (comprising the preset wake-up words and the user-defined wake-up words).

The embodiment of the invention provides a method for training a wake-up model, which is applied to the training process of the wake-up model of a wake-up word detection module, and comprises the following steps of:

step S201, when model training is triggered, a first training set and a second training set are obtained, wherein the first training set comprises initial voice characteristic data for training an initial acoustic model, and the second training set comprises new voice characteristic data of corresponding wake-up missing/wake-up error voice of a current acoustic model in a wake-up voice recognition process;

inputting the initial voice characteristic data marked with the wake-up words and the non-wake-up words into a preset deep neural network model, and training the preset deep neural network model to obtain an initial acoustic model;

the initial voice feature data of the wake-up word and the non-wake-up word in the first training set can be manually recorded audio fragments specific to the wake-up word or the non-wake-up word, the initial voice feature data can be obtained through voice data extraction, the initial voice feature data can also be obtained through voice data extraction for audio fragments collected by a user in normal wake-up and use, the initial voice feature data which is to be awakened by the current acoustic model and the initial voice feature data which is not to be awakened by the current acoustic model are determined through the current acoustic model, and the initial voice feature data is stored in a wake-up device or a server side; or, the received voice is manually screened in a manual monitoring mode, the initial voice characteristic data which is to be judged to be the awakened voice by the current acoustic model and the initial voice characteristic data which is not to be determined to be the non-awakened voice by the current acoustic model are stored in the awakened equipment or the server side, and the judgment accuracy of the awakening word is higher in the manual screening mode, and the method is specifically used for selecting according to the number of the received voice samples.

The process of extracting the voice data can be implemented by adopting a conventional technical means in the art, the method adopted in the step is not particularly limited, and for example, any one of mel frequency cepstrum coefficient method (MFCC, mel Frequency Cepstrum Coefficient), perceptual linear prediction parameter method (PLP, perceptual Linear Predict ive) and mel scale filtering method (FBANK, melscale Filter Bank) can be adopted to extract the voice characteristic data.

The preset deep neural network model may be a deep neural network model, or a deep neural network-hidden markov model, where the hidden markov model includes: a plurality of network layers, each network layer consisting of a fully connected layer and an activation function (typically relu or sigmoid), the last network layer typically consisting of a fully connected layer plus a softmax activation function;

the initial acoustic model and the current acoustic model are deep neural network models modeled to any one of words, phonemes or phoneme states, a voice frame in voice characteristic data is input to the current acoustic model, and posterior probability distribution is output from the current acoustic model;

the modeling mode comprises the following steps:

Segmentation alignment of speech feature data using baseline neural network of deep neural network model

And (force alignment) obtaining the label of the word state level, the label of the phone level and the label of the state level corresponding to each frame of voice characteristic data so as to form the input and output of the training network of the deep neural network model.

The modeling to word deep neural network model takes the feature vector of each frame of feature of the voice feature data as input, takes the word level label of each frame of feature of the voice feature data as word output, and performs segmentation alignment on the input and the word output.

The modeling to the phoneme deep neural network model takes a feature vector of each frame of feature of the voice feature data as input, takes a phoneme level label of each frame of feature of the voice feature data as phoneme output, and performs segmentation alignment on the input and the phoneme output.

The modeling to phoneme state deep neural network model takes the feature vector of each frame of feature of the voice feature data as input, takes the label of the phoneme state level of each frame of feature of the voice feature data as phoneme state output, and performs segmentation alignment on the input and the phoneme state output.

The label of the phoneme level is the pronunciation of the phoneme corresponding to each voice feature at a certain moment, such as t moment; the label of the phoneme state level is the phoneme related to the context, and is represented by a clustered phoneme state unit, and the phoneme state corresponding to the t moment characteristic.

As an optional implementation manner, the output posterior probability distribution is input to a decoder to obtain a wake-up confidence score, and whether the input voice feature data can be waken is determined according to comparison of the wake-up confidence score and a wake-up threshold;

specifically, the method for obtaining the wake-up result by inputting the extracted voice characteristics into the current acoustic model mainly comprises the following steps: 1. extracting features by using a voice data extraction method; 2. inputting each voice frame of the extracted features into a current acoustic model to obtain posterior probability distribution of each voice frame; 3. calculating wake-up confidence corresponding to posterior probability distribution by using a decoder, and judging whether the current acoustic model can wake up according to input voice characteristic data based on the fact that the wake-up confidence exceeds a certain threshold value to obtain new voice characteristic data of missed wake-up voice and false wake-up voice when the wake-up confidence exceeds a certain threshold value, wherein the decoding method of the decoder can be to select a path with the highest score value in all paths in the decoder or to select a path meeting preset rules in the path searching process, wherein the preset rules are as a decoding algorithm according to Viterbi (Viterbi Algorithm) and the like;

The new voice characteristic data acquisition method of the wake-up missing and the wake-up error voice in the second training set comprises the following steps:

determining the actual semantics of the received voice as wake-up words and the voice which is not awakened by the current acoustic model as wake-up missing voice by receiving a voice judgment instruction determined by a voice screening side according to the received voice; and determining the received voice with the actual semantics of the voice as a non-wake-up word and the voice which is wake-up by the current acoustic model as a false wake-up voice, and obtaining new voice characteristic data in a voice data extraction mode.

Step S202, the first training set is respectively input into an initial acoustic model and a current acoustic model, and a first difference parameter is determined by comparing output results of the initial acoustic model and the current acoustic model;

specifically, determining the first discrepancy parameter comprises:

Inputting a first training set into an initial acoustic model, outputting a first probability distribution by the initial acoustic model;

The first training set is input to a current acoustic model from which a second probability distribution is output.

The relative entropy is determined according to the difference between the first probability distribution and the second probability distribution, and the relative entropy is also called KL (Kullback-Leibler) divergence, and is used for measuring the difference between the two probability distributions, specifically, the manner of determining the relative entropy according to the two probability distributions, which will be known to those skilled in the art, and will not be described herein.

Step S203, inputting the second training set into the current acoustic model, and determining a second difference parameter by comparing the output result of the current acoustic model with the single thermal code corresponding to the wake-up speech that can be recognized by the current acoustic model.

The method for acquiring the single-heat codes corresponding to the wake-up voice which can be identified by the current acoustic model comprises the following steps:

The ASR speech model comprises an acoustic model, a pronunciation dictionary and text labels corresponding to speech feature data; and inputting preset wake-up voice information which can be recognized by the current acoustic model into the ASR voice model, obtaining an optimal path and a corresponding score of the optimal path through a decoder in the ASR voice model, and processing the optimal path into the single-heat coding form probability distribution.

For example, a 3 frames total of a "small" wake-up speech, as shown in table 1, where the wake-up word is "small", and a one-hot coding form probability distribution table corresponding to a speech unit to a phoneme level, where "small" includes three phonemes of "x", "i" and "ao" at the phoneme level, each speech frame of preset wake-up speech information is input into an ASR speech model to obtain a posterior probability distribution corresponding to the preset wake-up speech information, and a decoder is utilized to select an optimal path according to the posterior probability distribution, where the optimal path may be a path with the highest score value in all paths in the decoder, or may be a path meeting a preset rule in the path searching process, where the preset rule processes the optimal path into the one-hot coding form probability distribution according to a viterbi decoding algorithm, and the one-hot coding form probability distribution may be pre-constructed by the wake-up speech that can be recognized by the current acoustic model, where a pre-constructed mode should be known to a person skilled in the art, and will not be repeated here;

TABLE 1

	Phoneme "x"	Phoneme "i"	Phoneme "ao"
				Phoneme "x" frame	1	0	0
Phoneme "i" frame	0	1	0
				Phoneme "ao" frame	0	0	1

Inputting the new voice characteristic data in the second training set into the current acoustic model in the form of voice frames to obtain an output result, inputting the voice frames in the voice characteristic data into the current acoustic model, and outputting posterior probability distribution from the current acoustic model;

as shown in table 2, the new speech feature data is input to the posterior probability distribution corresponding to the current acoustic model, for example, the new speech feature data also includes 3 frames, where the wake word corresponding to the current acoustic model is "small", where "small" includes three phonemes of "x", "i" and "ao" at a phoneme level, and each speech frame of the new speech feature data is input to the current acoustic model, so that the posterior probability distribution corresponding to each speech frame can be obtained;

TABLE 2

	Phoneme "x"	Phoneme "i"	Phoneme "ao"
				First frame	0.8	0.3	0.1
Second frame	0.4	0.8	0.6
				Third frame	0.1	0.4	0.9

the cross entropy is determined according to the difference between the third probability distribution and the fourth probability distribution, and the cross entropy is determined according to the two probability distributions, which will be known to those skilled in the art and will not be described in detail herein.

Step S204, according to the first difference parameter and the second difference parameter, the model parameters of the current acoustic model are adjusted.

Specifically, the loss function of the current acoustic model is adjusted according to the weighted sum of the cross entropy and the relative entropy, the gradient vector is determined by layer derivative according to a chained derivative mode, the gradient vector is the fastest increasing direction of the loss function, in order to enable the loss function to be smaller and better, network layer parameters of the current acoustic model are controlled to be adjusted along the opposite direction of the gradient, a learning rate of a network layer is manually set in a specific adjustment mode to control the adjustment size of each update, the parameters of each network layer in the current acoustic model are updated continuously, and if the full-connection function is y=wx+b, w and b parameters of each full-connection layer are updated.

As an alternative embodiment, after determining the second difference parameter, the method further comprises:

Adjusting model parameters of an initial acoustic model according to the second difference parameters, including:

When the current acoustic model is determined to be the initial acoustic model, a loss function for adjusting the initial acoustic model is determined according to the second difference parameter, neural network learning is conducted through a back propagation algorithm by utilizing the constructed loss function, model parameters of the initial acoustic model are adjusted by utilizing the loss function, and when the current acoustic model is adjusted for the first time, initial voice characteristic data are respectively input to the initial acoustic model and the relative entropy calculated by the current acoustic model is 0 because the current acoustic model is the initial acoustic model, so that the loss function for adjusting the initial acoustic model is determined according to the cross entropy.

As shown in fig. 3, a complete flowchart of a method for training a wake model includes the steps of:

step S301, when model training is triggered, a first training set and a second training set are obtained, wherein the first training set comprises initial voice characteristic data for training an initial acoustic model, and the second training set comprises new voice characteristic data of corresponding wake-up missing/wake-up error voice of a current acoustic model in a wake-up voice recognition process;

step S302, inputting the second training set into an initial acoustic model, determining a loss function for adjusting the initial acoustic model by comparing an output result of the initial acoustic model with the single thermal codes corresponding to wake-up voices which can be identified by the current acoustic model, and adjusting parameters of each network layer in the initial acoustic model by using the loss function to obtain the current acoustic model;

specifically, a loss function is determined by cross entropy calculated using the difference between the unithermal coding form probability distribution and the posterior probability distribution of the new speech feature data input to the current acoustic model, where p (x) represents the posterior probability distribution of the new speech feature data input to the initial acoustic model, and p _emp (x) Representing the probability distribution of the single thermal coding form, the cross entropy obtained is:

Determining a loss function by using the cross entropy, and adjusting parameters of each network layer in the initial acoustic model according to the loss function to obtain a current acoustic model;

step S303, the first training set is respectively input into an initial acoustic model and a current acoustic model, and the relative entropy is determined by comparing output results of the initial acoustic model and the current acoustic model;

Wherein p is _si (x) Representing a first probability distribution of initial speech feature data in an initial acoustic model, and p1 (x) represents a second probability distribution of initial speech feature data in a current acoustic model;

wherein the relative entropy is:

step S304, inputting the second training set into a current acoustic model, and determining cross entropy by comparing the output result of the current acoustic model with the single-heat codes corresponding to the wake-up voice which can be identified by the current acoustic model;

cross entropy determination loss function using differential calculation of a unithermal coding form probability distribution and a posterior probability distribution of new speech feature data input to a current acoustic model, where p2 (x) represents a third probability distribution of new speech feature data input to an initial acoustic model, and p _emp (x) Representing a fourth probability distribution of the single thermal encoding form, the resulting cross entropy is:

step S305, determining a loss function for adjusting the current acoustic model according to the relative entropy and the cross entropy, and adjusting parameters of each network layer in the current acoustic model by using the loss function.

Defining a loss function according to the cross entropy and the relative entropy as follows:

J _kld ＝(1-α)H(p _emp ,p2)+αD _kl (p _si ,p1)

wherein, alpha is a weight coefficient for controlling cross entropy and KL divergence, which is generally set to be a constant value of 0.25 empirically, if the number of the characteristic data is large, the value is increased, gradient vectors are determined by layer-by-layer derivation according to a chain derivation mode, and network layer parameters of an initial acoustic model are controlled to be adjusted along the opposite direction of the gradient, so that the current acoustic model is obtained.

Finally, after the current acoustic model is obtained, the posterior probability distribution of each voice frame in the voice feature data is obtained by inputting the voice feature data of the wake-up voice into the current acoustic model for recognition, the decoder is utilized to calculate the wake-up confidence corresponding to the posterior probability distribution, and the current acoustic model determines whether to send a command of waking up equipment to the wake-up equipment according to the wake-up confidence and the wake-up threshold.

The embodiment of the invention provides a device for training a wake-up model, which comprises a memory, a memory and a memory, wherein the memory is used for storing instructions;

Fig. 4 is an apparatus for training a wake-up model according to an embodiment of the present invention, where the apparatus 400 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (in english: central processing units, english: CPU) 401 (for example, one or more processors) and a memory 402, and one or more storage media 403 (for example, one or more mass storage devices) storing application programs 404 or data 406. Wherein the memory 402 and the storage medium 403 may be transitory or persistent storage. The program stored in the storage medium 403 may include one or more modules (not shown), each of which may include a series of instruction operations in the information processing apparatus. Still further, the processor 401 may be arranged to communicate with a storage medium 403 to execute a series of instruction operations in the storage medium 403 on the apparatus 400.

The apparatus 400 may also include one or more power supplies 409, one or more wired or wireless network interfaces 407, one or more input/output interfaces 408, and/or one or more operating systems 405, such as Windows Server, mac OS X, unix, linux, freeBSD, etc.

mel-frequency coefficient MFCC characteristic data;

perceptual linear prediction PLP feature data;

the mel scale filters the FBANK characteristic data.

The embodiment of the invention provides a device for training a wake-up model, as shown in fig. 5, which comprises the following modules:

the training set obtaining module 501 obtains a first training set and a second training set when the model is triggered to train, wherein the first training set comprises initial voice characteristic data for training an initial acoustic model, and the second training set comprises new voice characteristic data of corresponding wake-up missing/wake-up error voice of a current acoustic model in a wake-up voice recognition process;

A first difference parameter determining module 502, configured to input the first training set to an initial acoustic model and a current acoustic model, and determine a first difference parameter by comparing output results of the initial acoustic model and the current acoustic model;

a second difference parameter determining module 503, configured to input the second training set into a current acoustic model, and determine a second difference parameter by comparing an output result of the current acoustic model with a single thermal code corresponding to wake-up speech that can be identified by the current acoustic model;

the model adjustment module 504 is configured to adjust a model parameter of the current acoustic model according to the first difference parameter and the second difference parameter.

Optionally, the first variance parameter determining module 502 is configured to determine a first variance parameter, including:

Optionally, the second difference parameter determining module 503 is configured to determine a second difference parameter, including:

Optionally, after the current acoustic model determining module 505 is configured to determine the second difference parameter, the method further includes:

Optionally, the current acoustic model determining module 505 is configured to adjust model parameters of an initial acoustic model according to the second difference parameter, including:

Optionally, the model adjustment module 504 is configured to adjust a model parameter of the current acoustic model according to the first difference parameter and the second difference parameter, including:

mel-frequency coefficient MFCC characteristic data;

perceptual linear prediction PLP feature data;

the mel scale filters the FBANK characteristic data.

Optionally, the second difference parameter determining module 503 is configured to obtain a single thermal code corresponding to the wake-up speech that can be identified by the current acoustic model, including:

An embodiment of the present application provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement a method of training a wake model as provided in any of the above embodiments.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of training a wake model, the method comprising:

according to the first difference parameter and the second difference parameter, adjusting model parameters of a current acoustic model;

the method for acquiring the single thermal code corresponding to the wake-up voice which can be identified by the current acoustic model comprises the following steps:

2. The method of claim 1, wherein determining the first discrepancy parameter comprises:

3. The method of claim 1, wherein determining the second discrepancy parameter comprises:

4. The method of claim 1, further comprising, after determining the second variance parameter:

5. The method of claim 4, wherein adjusting model parameters of an initial acoustic model based on the second difference parameters comprises:

6. The method of claim 1, wherein adjusting model parameters of a current acoustic model based on the first and second difference parameters comprises:

7. The method of claim 1, wherein the initial speech feature data or new speech feature data comprises at least one of:

mel-frequency coefficient MFCC characteristic data;

perceptual linear prediction PLP feature data;

the mel scale filters the FBANK characteristic data.

8. An apparatus for training a wake model, the apparatus comprising: a memory for storing instructions;

a processor for reading instructions in said memory implementing a method of training a wake model as claimed in any one of claims 1 to 7.

9. An apparatus for training a wake model, the apparatus comprising:

the model adjustment module is used for adjusting the model parameters of the current acoustic model according to the first difference parameters and the second difference parameters;

the second difference parameter determining module is specifically configured to obtain preset wake-up voice information that can be identified by the current acoustic model;

10. A computer readable storage medium storing computer instructions which, when executed by a processor, implement a method of training a wake model as claimed in any one of claims 1 to 7.