[go: up one dir, main page]

GB2546325B - Speaker-adaptive speech recognition - Google Patents

Speaker-adaptive speech recognition Download PDF

Info

Publication number
GB2546325B
GB2546325B GB1600842.7A GB201600842A GB2546325B GB 2546325 B GB2546325 B GB 2546325B GB 201600842 A GB201600842 A GB 201600842A GB 2546325 B GB2546325 B GB 2546325B
Authority
GB
United Kingdom
Prior art keywords
speaker
adaptive
training
test
component
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
GB1600842.7A
Other versions
GB2546325A (en
GB201600842D0 (en
Inventor
Doddipatla Rama
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1600842.7A priority Critical patent/GB2546325B/en
Publication of GB201600842D0 publication Critical patent/GB201600842D0/en
Priority to US15/407,663 priority patent/US10013973B2/en
Priority to JP2017007052A priority patent/JP6437581B2/en
Publication of GB2546325A publication Critical patent/GB2546325A/en
Application granted granted Critical
Publication of GB2546325B publication Critical patent/GB2546325B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • Monitoring And Testing Of Exchanges (AREA)

Description

Speaker-adaptive speech recognition
Field
The present disclosure relates to methods and systems for recognising sounds in speech spoken by an individual. The systems may be components of apparatus for taking actions based on the recognised sounds.
Background
In recent years progress has been made in devising automatic speech recognition (ASR) systems which receive input data (generated by a microphone) which encodes speech spoken by a speaker - here referred to as a “test speaker”- and from it recognise phonemes spoken by the test speaker. A phoneme is a set of one or more “phones”, which are individual units of sound. Typically, the input data is initially processed to generate feature data indicating whether the input data has certain input features, and the feature data is passed to a system which uses it to recognise the phones. The phones may be recognised as individual phones (“mono-phones”), or pairs of adjacent phones (“diphones”), or seguences of three phenes (“triphenes”).
Since multiple individuals speak in different respective ways, it is desirable fcr the system which reccgnises the phenes to be adapted to the speech of the test speaker, and for the adaptation to be performed automatically using training data which is speech spoken by the test speaker.
Desirably, the volume of training data which the test speaker is reguired to speak should be minimised. For that reason, conventional ASR systems are trained using data from many other speakers (“training speakers”) for whom training data is available. Since there is huge amount of speaker variability in the data used for training the system, the performance can be very poor for an unknown test speaker. Speaker adaptation, which either transforms the features of the test speaker to better match the trained model or transforms the model parameters to better match the test speaker, has been found to improve the ASR performance.
Many adaptive systems are known. Recently there has been increasing interest in so-called deep neural networks (DNN). A deep neural network is an artificial neural network with more than one hidden layer between the input and output layers. Each layer is composed of one or more neurons, and each neuron performs a function of its inputs which is defined by a set of network parameters, such as numerical weights. DNNs are typically designed as feedforward networks, although recurrent forms of DNN also exist. In feedforward networks, each neuron in the first layer of neurons receives multiple input signals; in each successive layer, each neuron receives the output of multiple neurons in the preceding layer.
Speaker adaptive training (SAT) is an approach to perform speaker adaptation in ASR, where speaker variability is normalized both in training and recognition. SAT improves acoustic modelling and can be helpful both in DNN-based automatic speech recognition (ASR) and speech synthesis. Speaker adaptation in DNNs is performed either by transforming the input features before training the DNN or by tuning parameters of the DNN using the test speaker specific data. A wide range of systems have been proposed using both approaches. For approaches that focus on transforming the input features before training the DNN, the primary drawback is that the DNN has to be re-trained once a new feature transformation is applied. Whereas for approaches that focus on tuning the network parameters, the DNN typically requires more adaptive parameters, so the primary challenge is to tune the network parameters with the limited available data from the test speaker.
Summary of the invention
The present invention aims to provide new and useful methods and systems for generating an ASR system.
It further aims to provide new and useful ASR systems, and methods of using them to perform speech recognition.
In general terms the invention proposes that an adaptive model component is provided for each of a number of training speakers. Each adaptive model component is trained, as part of an adaptive network having an input network component (typically a plurality of layers of neurons) and the adaptive model component, using training data for the corresponding training speaker. Thus, for each of the training speakers, a corresponding training-speaker-specific adaptive model component is formed.
Preferably, the adaptive network also includes an adaptive output network component (at least one further layer of neurons) receiving the output of the adaptive model component. However, this is not necessary, since examples of the invention may be formed in which the adaptive model component is used to produce outputs which are directly indicative of a phone, e.g. a mono-phone. For example, outputs which can be formatted by a non-adaptive output layer as a signal indicating a mono-phone.
Then, a speaker-adaptive DNN model is trained, successively using each of the training-speaker-specific adaptive model components and training data for the corresponding training speaker.
When training data is available for a test speaker, a further adaptive model is formed comprising the input network component, an adaptive model component and the output network component (if any). Within this further adaptive model, the adaptive model component is trained using the training data for the test subject. Thus, the adaptive model component becomes specific to the test subject. A test-speaker-specific adaptive system is formed from the input network component, the trained test-speakerspecific bottleneck layer, and the speaker-adaptive DNN model. Note that the input network component and the speaker-adaptive DNN model do not have to be changed using the training data for the test speaker: they are both formed solely using the training data for the training speakers, as described above. The test-speaker-specific adaptive system is well-adapted for recognising the speech of the test speaker.
The adaptive model components have the same size for each of the training speakers and the test speaker. They may have a much smaller number of network variables than the number of variables of the speaker-adaptive DNN model, the input network component or the output network component (if any).
For this reason, the amount of training data for the test speaker which is needed to train the test-speaker-specific adaptive model component is low: much lower than the amount of training data from the training speakers which is used to obtain the speaker- adaptive DNN model. In other words, an example of the invention may be used when there is little data available from the test speaker.
For example, each adaptive model component may be defined by fewer than 10%, or even fewer than 5%, of the number of neurons in the input network component. Similarly, it may contain fewer than 10%, or even fewer than 5%, of the number of neurons of the speaker-adaptive DNN model.
Each adaptive model component may be a single layer in which each of neurons receives outputs of the input network component. For that reason, the adaptive model component may be referred to as a “bottleneck layer”, since it may form a layer of the complete test-speaker-specific adaptive layer which has a smaller number of neurons than either a layer of the input network component or a layer of the speaker-adaptive DNN model.
The input network component and the speaker-specific adaptive model component (together referred to as the “first stage”) primarily act as a feature extractor, to provide input for the speaker-adaptive DNN (“second stage”). The number of neurons in the hidden layers of the first stage, and particularly the number of neurons in the adaptive model component, can be much smaller than the dimension of the hidden layers in the speaker-adaptive DNN (second-stage DNN). This means that, there are fewer parameters for estimation and can be very helpful for online recognition (e.g. during recognition of the test speaker, the system can be tuned to perform better, using as little as one minute of speech data from the test speaker).
The input network component, and output network component (if any), of the adaptive model used to train the training-speaker-specific adaptive model components are preferably produced during an initial training procedure in which an adaptive model comprising the input network component, a generic adaptive model component and the output network component (if any), is trained using the training data for the training speakers.
In this training procedure, and/or in the subsequent training procedure in which the training-speaker-specific adaptive model components are produced, and/or in the subsequent procedure in which the test-speaker-specific adaptive model components are produced, the adaptive model is preferably trained to produce signals indicating mono-phone. However, this is merely an option. For example, it is alternatively possible for the example of the invention to use triphones in each step.
By contrast, during the training procedure which produces the speaker-adaptive DNN, the speaker-adaptive DNN it trained to generate signals indicating tri-phones.
The training data for the test speaker may take the form of data comprising a series of recorded utterances from the test speaker, and associated phones (i.e. the training method uses training data for the speaker in which the sounds have already been decoded as phones), preferably triphones. In this case, the training of the test-speakerspecific adaptive model component may be supervised learning.
Alternatively, the training data for the test speaker may not include the associated phones (i.e. the training method does not employ training data for the test speaker in which the sounds have already been decoded as phones). In this case, the algorithm may include a preliminary step of using each element of the training data for the test speaker to produce a corresponding first estimate (“first pass”) of the associated phones.
This first estimate may be in the form of triphones. Conveniently, this may be done by feeding the training data for the test subject into an adaptive network comprising the input network component, the trained generic adaptive model component, and a “speaker independent” DNN, which has been trained, successively using training data from the training speakers, to generate triphones using the output of the generic adaptive model component. The output of the adaptive network is the first estimate of the associated triphone. The test data for the test speaker, and the associated first estimate of the associated triphone, are then used to train the test-speaker-specific adaptive model component in a supervised learning process. In other words, although the training procedure as a whole is unsupervised (since it does not use training data for the test speaker in which the sounds have already been decoded as phonemes), the step of generating the test-speaker-specific adaptive model may be performed using a supervised learning algorithm.
In all of the adaptive networks discussed above, the signals input to the input network component are typically the output of a filter bank which identifies features in the speech of the user captured by a microphone. The speech of the test speaker is captured using a microphone, and passed through the filter bank before being transmitted to the input network component of the test-speaker-specific adaptive model.
The proposed approach facilitates integration of feature transformation approaches with approaches which tune the model parameters for DNNs to perform speaker adaptation.
Optionally, the training data for the training speakers may be pre-generated data stored in a database. If this training data is stored in the form of data which was output by the filter bank, then the filter bank does not need to be used again during the training process which produces the training-speaker-specific adaptive network component, and the speaker-adaptive DNN.
The proposed approach has been shown to improve performance when the testspeaker-specific bottleneck is generated by both supervised and unsupervised adaptation.
Optionally, the step of generating the test-speaker-specific adaptive model component may be repeated at intervals, and the test-speaker-specific adaptive network is updated with the most recent test-speaker-specific adaptive model component. In this way, the test-speaker-specific adaptive network may be updated for changes in the acoustic environment of the test speaker. The updating process may be performed on a predefined timetable (e.g. at regular intervals), or following a step of automatically detecting that an update would be beneficial.
The invention may be expressed in terms of a computer-implemented method of generating the test-speaker-specific adaptive system, or a computer system for performing the method, or a computer program product (such as a tangible data storage device) include program instructions (e.g. in non-transitory form) for causing a computer system to perform the methods.
Optionally, the invention may be expressed only in terms of the steps carried out using the training data from the test speaker. This is because the steps using the training data from the training speakers may be carried out in advance, and optionally by a different computer system.
The invention may furthermore be expressed as a method or a system for using the test-speaker-specific adaptive system to recognise speech from the test speaker. The recognised speech may be converted into words. The method or system may use those words to select actions, and optionally perform those actions.
Description of the drawings
An example of the invention will new be described with reference tc the fcllcwing drawings in which:
Fig. 1 is a flew diagram of steps of a method which is an example of the invention to produce a test-speaker-specific adaptive system;
Fig. 2 illustrates schematically a computer system for performing the method of Fig. 1;
Fig. 3 illustrates an adaptive model which is trained in a step of the method of Fig. 1;
Fig. 4 illustrates a further adaptive model which is trained in a step of the method of Fig. 1;
Fig. 5 illustrates a further adaptive model which is trained in a step of the method of Fig. 1;
Fig. 6 illustrates a further adaptive model which is trained in a step of the method of Fig. 1;
Fig. 7 illustrates a further adaptive model which is trained in a step of the method of Fig. 1; and
Fig. 8 illustrates a further adaptive model which is trained in a step of the method of Fig. 1.
Detailed description
Referring to Fig. 1, a flow-diagram is shown of a method which is an example of the invention.
The method may be performed by a computer system 10 shown in Fig. 2. The computer system includes a processor 11, a data storage system 12 and a microphone 13. The processor 11 is controlled by program instructions in a first memory device 111, and generates data which it stores in a second memory device 112. The computer system 10 may, for example, be a general computer system, such as a workstation PC (personal computer) or tablet computer. Alternatively, the processor 11 may be a processor of a server system. In another possibility the processor 11 may be a portion of a larger apparatus which it is desired to provide with ASR capability, such as a car, or an item of home or office equipment.
The data storage system 12 is for storing training data. It includes a first database 14 which is used for storing training data for a test speaker. The forms this training data may take are described below. The data storage system 12 further includes a database 15 for storing training data for N test speakers, labelled /=1,...A/. The database 15 is divided into N respective sections 151, 152, ...15/V, which respectively store training data for each of the N training speakers.
The training data for each training speaker stored in the corresponding one of the database sections 151, 151, ...15/V, consists of a first portion which is raw sound data recorded by a microphone. The sound data is divided into successive portions referred to here as frames. The training data further includes a second portion which, for each frame indicates the phone which the training speaker spoke at the corresponding time. The frames are of equal length, and each frame is associated with one mono-phone or tri-phone. The first portion of the data may have been recorded by the microphone 13. Alternatively, the first and second portions of the data may have been obtained from a pre-existing database, such as one generated by a third party. 1. Training the Bottleneck DNN (step 1 of Fig. 1)
The first step (step 1) of the method of Fig. 1 is performed using an adaptive system 20 as illustrated in Fig. 3. The adaptive system 20 exists only virtually in the computer system 1. It receives the output of a filter bank (FBANK) 16 for receiving and processing raw sound data 17. As mentioned below, as step 1 is carried out, the raw sound data 17 is successively drawn from the raw sound data in the first portions of the database sections 151, 152, ..., 15/V. At any instant, the raw sound data input to the FBANK 16 is one of the frames.
The filter bank FBANK 16 may be a mel FBANK. However, in variations of the example of the invention described below, the FBANK 16 may be replaced, throughout the following description by one of (i) a mel FBANK plus a D-vector unit (a D-vector is an additional component appended to the FBANK features. This is described in Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno and Jorge Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, 2014, (ii) a mel FBANK plus a constrained maximum likelihood linear regression (CMLLR) unit (this is described in S. P. Rath, D. Povey, K. Vesely and J. Cernocky, “Improved feature processing for deep neural networks”, in Proc, of INTERSPEECH, 2013; note that CMLLR is not an appended feature like a D-vector), or (iii) a mel FBANK plus a CMLLR unit and a D-vector unit.
The FBANK 16 identifies whether a plurality of features are present in the raw sound data it receives at any time (a “sound item”). It generates a plurality of output signals which indicate whether these features are present in the sound item.
The plurality of output signals of the FBANK 16 are fed to inputs of an adaptive inputnetwork component 18, which is typically a DNN. The input network component 18 depicted in Fig. 3 has three layers 18a, 18b, 18c, but in variations of the example of the invention there may be any integer number of layers. Each of these layers has a plurality of neurons, e.g. 512 neurons per layer. Each neuron receives multiple inputs and generates one output. Each neuron of the first layer 18a receives all the outputs of the FBANK 16. Each of the neurons in the second layer 18b receives the outputs of all the neurons in the first layer 18a. Each of the neurons in the third layer 18c receives the outputs of all the neurons in the second layer 18b.
The outputs of the input network component 18 (i.e. the respective outputs of the neurons in the third layer 18c) are fed to an adaptive model component 19 referred to as a “bottleneck layer” 19. The bottleneck layer 19 is a single layer of neurons which each receive all the outputs of the input network component 18 (i.e. the outputs of the neurons in layer 18c). The number of neurons in the bottleneck layer 19 is much lower than in each layer of the input network component 18. For example, there may be just 75 neurons in the bottleneck layer 19.
The outputs of the neurons in the bottleneck layer 19 are fed as inputs to an adaptive output network component 21. This consists, in this example, of a single layer 21a, which may contain 512 neurons.
The outputs of the output network component 21 are fed to a non-adaptive output layer 27 which formats the outputs of the output network component 21, to produce a signal indicative of a single monophone.
The input network component 18, the bottleneck layer 19 and the output network component 21 are collectively referred to here as a bottleneck DNN (BN-DNN).
Each of the neurons in the layers 18a, 18b, 18c, 19 and 21a forms a respective output which is a function of its inputs, such as a weighted sum of its inputs. The weights are variable parameters. The number of neurons in the bottleneck layer 19 is much lower than in any of the layers 18a, 18b, 18c or 21 a (e.g. no more than 20% of the neurons in any one of those layers), and thus only a very small proportion of the total network parameters are associated with the bottleneck layer 19.
In step 1, the adaptive system 20 is trained to associate the raw speech data in the first portions of the database sections 151, 152, .... 15N with the mono-phones in the second portions of the database sections 151, 152, .... 15N. That is, the weights of the layers 18a, 18b, 19 and 20 are gradually modified by known algorithms such that if speech items are successively input to the mel FBANK 16, the outputs of the layer 20 encode the corresponding mono-phone.
Note that this process is carried out using training data in the database 15 for all the training speakers successively. Thus, the input network component 18, bottleneck layer 19 and output network component 21 are not trained in a way which is specific to any of the training speakers. In particular, the bottleneck layer 19 is trained to become a generic bottleneck layer (i.e. applicable for any of the training speakers).
In most suitable learning algorithms, speech items are presented one-by-one to the FBANK 16, and the network parameters are modified such that the output network component 21 outputs the corresponding mono-phone.
The order in which frames from the training speakers are learnt is not important. In one possibility a randomly chosen one of the frames for the first training speaker may be input to the FBANK 16, and the network parameters are adjusted such that the output of the output network component 21 is indicative of the corresponding mono-phone. Then, the same is done with a randomly chosen one of the frames for the second training speaker. And so on, until the same is done with a randomly chosen one of the frames for the A/-th training speaker. Then the entire process is repeated as many times as desired (e.g. until a convergence criterion has been reached).
The reason for using mono-phone targets for training the BN-DNN is to make the bottleneck layer training robust to transcription errors during recognition, and alleviate the problem of data sparsity. Mapping the triphone targets onto mono-phones can be interpreted as state tying, and helps alleviate the problem of data sparsity.
We now describe step 2, and steps 3 and 4. Note that step 2 is independent of steps 3 and 4. It can be performed after steps 3 and 4, or it could be performed at the same time as steps 3 and 4. 2. Training the Speaker Independent (SI) DNN (step 2 of Fig. 1) for recognising the test speaker
In step 2, the trained input network component 18 and trained generic bottleneck layer 19 are used as the first stage of a two-stage adaptive network 25 shown in Fig. 4. Components with the same meaning as in Fig. 3 are given the same reference numerals. The adaptive network comprises the trained input network component 18, which receives the output of the FBANK 16), and the trained generic bottleneck layer 19. The adaptive network further comprises a “stage 2” DNN 22, comprising three layers 22a, 22b, 22c. Each layer may contain 2048 neurons, each of which forms a respective output as a function of a weighted sum of its inputs. Each neuron of the second layer 22b receives the outputs of the neurons in the first layer 22a, and each neuron of the third layer 22c receives the outputs of the neurons in the second layer 22c. Note that in variations of the example of the invention, the number of layers in the stage 2 DNN 22, and the number of neurons per layer can be different.
As in step 1, speech items for all training speakers are input successively to the FBANK 16, which feeds the first stage of the adaptive network (i.e. the trained input layer 18 and trained bottleneck layer 19). The corresponding resulting output of the bottleneck layer 19 is combined with the respective five outputs of the bottleneck layer 19 when the five succeeding frames for the same training speaker are successively input to the FBANK, and the respective five outputs of the bottleneck layer 19 when the five preceding frames for the same training speaker are successively input to the FBANK, to form a feature vector 26. Note that in variations of the example of the invention, the number of preceding and/or succeeding frames may differ from five, but five such frames were used in our experimental implementation explained below. The feature vector 26 is input to each neuron of the first layer 22a of the stage-2 DNN 22.
When a given feature vector 26 is input to the stage-2 DNN, the neurons of the stage-2 DNN 22 are modified such that the third layer 22c generates a signal indicative of the triphone centred on the speech item input to the FBANK 16. The outputs of the third layer 22c are fed to a non-adaptive output layer 28 which formats the outputs of the third stage layer 22c, to produce a signal indicative of a triphone.
This process is repeated successively for speech items for all of the training speakers.
As in step 1, the order in which the speech items for the training speakers are used is not important. In one possibility a randomly chosen one of the frames for the first training speaker may be input to the FBANK 16, and the parameters of the neurons in the stage-2 DNN 22 are adjusted such that the output of the output network 22c is indicative of the corresponding tri-phone. Then, the same is done with a randomly chosen one of the frames for the second training speaker. And so on, until the same is done with a randomly chosen one of the frames for the N-th training speaker. Then the entire process is repeated as many times as desired (e.g. until a convergence criterion has been reached).
Thus, the stage-2 DNN 22 is gradually trained to become a speaker-independent (SI) DNN. The trained adaptive system 25 of Fig. 4 is well adapted for recognising tri phones in speech of any of the training speakers, and as described below can be used to obtain a first pass transcription of a test speaker when the phone transcriptions are not available. The speaker variability is not yet normalised. 3. Forming a speaker-adaptive training model (steps 3 and 4 of Fig. 1)
In step 3, a respective adaptive system is farmed for each cf the Λ/training speakers. The adaptive netwcrk 30i for the /-th training speaker is shewn in Fig. 5. It receives the cutput of the FBANK 16, and includes the trained input network component 18 and the trained output network component 21 generated in step 1, but it further includes a bottleneck layer 19i which is different for each of the adaptive networks 30i. Optionally, the bottleneck layer 19i may initially be equal to the generic bottleneck layer generated in step 1.
The respective bottleneck layer 19i for each adaptive system 30i is trained using only the training data in the database section 15 for the respective /-th training speaker. The trained input network component 18 and trained output network component 21 are not modified. This results in a training-speaker-specific trained bottleneck layer 19i. The training-speaker-specific trained bottleneck layers may be labelled as SDBN-1, SDBN-2,...., SDBN-N. The combination of the trained input network component 18 and the training-speaker-specific trained bottleneck layer 19i is a training-speaker-specific BN DNN. Again, a non-adaptive output layer 27 is provided to format the output of the output network component 21 as a signal indicating a single monophone.
This process is performed for each of the training speakers in turn.
In step 4, the N training-speaker-specific trained bottleneck layers are used to train a stage-2 DNN 32 having the same form as the stage-2 DNN 22 of Fig. 4. This training is done while the stage-2 DNN is within an adaptive system 35 shown in Fig. 6.
The stage-2 DNN 32 comprises three layers 32a, 32b, 32c. Each layer may contain 2048 neurons, each of which forms a respective output as a function of a weighted sum of its inputs. Each neuron of the first layer 32a receives the feature vector, each neuron of the second layer 32b receives the outputs of the neurons in the first layer 32a, and each neuron of the third layer 32c receives the outputs of the neurons in the second layer 22c. Again, a non-adaptive output layer 28 is provided to format the output of the layer 32c as a signal indicating a single triphone.
The adaptive system 35 shown in Fig. 6 receives the output of the FBANK 16. The adaptive system comprises the trained input network component 18 formed in step 1.
As in steps 1 and 2, speech items for all training speakers are input successively to the FBANK 16. At times at which a speech item for the /-th training speaker is input to the FBANK 16, the output of the input layer 18 is fed to the /-th training-speaker-specific bottleneck layer 19i.
The resulting output of the bottleneck layer 19i is combined with the succeeding five frames and five preceding frames for the same training speaker to form a feature vector 26. This is input to the each neuron of the first layer 32a of the stage-2 DNN 32.
When a given feature vector 26 is input to the stage-2 DNN 32, the neurons of the stage-2 DNN 32 are modified such that the third layer 32c generates a signal indicative of the tri-phone centred on the speech item input to the FBANK 16.
This process is repeated successively for speech items for all of the training speakers.
The order in which the speech items for the training speakers are used is not important. In one possibility a randomly chosen one of the frames for the first training speaker may be input to the FBANK 16, and the parameters of the neurons in the stage-2 DNN 32 are adjusted such that the output of the output network 32c is indicative of the corresponding tri-phone. Then, the same is done with a randomly chosen one of the frames for the second training speaker. And so on, until the same is done with a randomly chosen one of the frames for the N-th training speaker. Then the entire process is repeated as many times as desired (e.g. until a convergence criterion has been reached).
Thus, the stage-2 DNN 32 is gradually trained to become recognise tri-phonemes from the data output by any of the training-speaker-specific BN DNNs. This is in contrast to the stage-2 DNN 22 generated in step 2, which is trained to recognise tri-phonemes from the output of the generic BN DNN generated in step 1. 4. Automatic speech recognition for test speaker (steps 5 to 9 of Fig. 1)
We now turn to how speech from a test speaker is recognised. This step is typically performed after steps 1-4, when training data from the test speaker become available. It employs: the trained input network component 18 and the trained output network component 21 generated in step 1; the adaptive network (SI-DNN) 25 generated in step 2; and the stage-2 DNN generated in step 4. The speech from the test speaker 2 is captured by the microphone 13, and stored in the database 14. Steps 5 to 9 are typically carried out after steps 1-4, when speech from a test speaker becomes available. This is indicated by the dashed line in Fig. 1. However, in some example of the inventions steps 5-7 could be carried out before steps 3 and 4, or at the same time.
In step 5, the adaptive network 25 (produced in step 2) is used to generate a “first-pass” recognition of the tri-phones in the captured speech of the test speaker. The result is reasonably accurate.
In step 6, the tri-phones derived in step 5 are converted into mono-phones. Note that this is an optional step of the method; the method can alternatively be performed entirely using triphones. This process also shows the alignments of the mono-phones with the captured speech of the test speaker (i.e. the time at which each mono-phone begins). Thus, the training data for the test speaker in the database 14 is divided into frames.
In step 7, an adaptive system 40 shown in Fig. 7 is formed. It receives the output of the FBANK 16, and includes the trained input network component 18, a new bottleneck layer 45 (which optionally may initially be equal to the generic bottleneck layer generated in step 1), and the trained output network component 21. Again, a non-adaptive output layer 27 is provided to format the output of the output network component 21 as a signal indicating a single monophone.
Then a learning procedure is performed, in a way similar to step 3, by successively inputting speech items from the database 14 into the FBANK 16 of the adaptive system 40 and modifying that bottleneck layer 45 such that the output of the output network component 21 is the corresponding mono-phone obtained in step 6.
Thus, the bottleneck layer 45 is trained to be a test-speaker-specific bottleneck layer. Note that the number of variable parameters associated with the bottleneck layer is much smaller than the number of variable parameters associated with the input network component 18 or output network component 21, so a much smaller amount of training data is required to fix the parameters of the bottleneck layer 45 than was required in step 1. Thus, the required captured speech of the test speaker is low. In particular, the training of the bottleneck layer 45 is performed with mono-phones, not tri-phones, which reduces the amount of captured speech of the test speaker required to train the test-speaker-specific bottleneck layer 45.
In step 8, a test-speaker specific adaptive system 50 shown in Fig. 8 is formed. It is used to recognise speech from the test speaker collected by the microphone 13. The output of the microphone is transmitted to the FBANK 16, and the output of the FBANK 16 is transmitted to the input network component 18 which is the first part of the testspeaker specific adaptive system 50. Specifically, the test-speaker specific adaptive system 50 includes the trained input network component 18, the trained test-speakerspecific bottleneck layer 45 and the stage-2 DNN 32 generated in step 4. This testspeaker-specific adaptive system 50 can be used to recognise tri-phones in speech captured by the microphone 13. Again, a non-adaptive output layer 28 is provided to format the output of the layer 32c of the stage-2 DNN 32 as a signal indicating a single triphone.
Note that the output from the test-speaker-specific bottleneck layer 45 when a certain frame is input to the FBANK 16 is combined with the five respective outputs of the bottleneck layer 45 when each of the 5 frames before that frame is successively input to the FBANK, and the five respective outputs of the bottleneck layer 45 when each of the 5 frames after the frame is successively input to the FBANK, to generate the input to the stage-2 DNN 32.
The method of Fig. 1 does not require a priori-information about the phonemes spoken by the test speaker: these are recognised in an approximate fashion in step 5, so that supervised learning can be performed in step 7. In other words, although the method as a whole performed in steps 5-8 is unsupervised (in the sense that no a priori information is available about the phones spoken by the test speaker), step 7 can be regarded as a supervised step.
Optionally, the system may determine (in step 9) that a certain amount of time has passed. After this, new training data for the training speaker is collected, and then converted into triphones using the existing test-speaker-specific adaptive system. The steps 6-8 are then repeated. This would produce a replacement test-speaker-specific adaptive system, incorporating a replacement test-speaker-specific bottleneck layer. The replacement test-speaker-specific adaptive system would cope, for example, with possibility that the acoustic environment of the test speaker has changed since the steps 5-8 were first performed.
Note that an alternative to converting the new training data for the test speaker into triphones using the existing test-speaker-specific adaptive system in step 9, would be to use the SI-DNN of Fig. 4 to convert the new training data for the test speaker into triphones. Then steps 6-8 would be repeated as described in the preceding paragraph.
The process of generating a replacement test-speaker-specific adaptive system may be performed at intervals indefinitely, and/or upon receiving a control signal (e.g. from the test speaker) indicating that it would be beneficial to repeat them because the accuracy of the existing test-speaker-specific adaptive system is insufficient.
In a variation of this concept, steps 5-8 may be repeated upon some other criterion being met. For example, the ASR system might include a component for determining the characteristics of noise in the sound received by the microphone 13, and steps 5-8 may be repeated upon a determination that the noise characteristics of sound received by the microphone 13 have changed by more than a pre-determined amount.
Note that if, in a variation of the example of the invention, training data from the test speaker is available in which, for items of captured speech of the test speaker, corresponding mono-phonemes spoken by the test speaker are identified, step 2, 5 and 6 could be omitted. The training data relating to the test speaker could be employed in step 7 to generate the test-speaker-specific bottleneck layer 45, by performing supervised learning of the bottleneck layer 45 within the adaptive network 40 using the training data relating to the test speaker.
As will be clear, the adaptive networks 20, 25, 30i, 35, 40 and 50 are implemented virtually in a memory space of the computer system 10.
It is not necessary for the steps 1-8 to be performed by the same computer system or at substantially the same time. Rather, steps 1-4 could optionally be performed by a first computer system, e.g. using very large amount of training data relating to the training speakers, and then steps 5-9 could be performed by a second computer system (e.g. with a different human operator) when data for a test speaker is available.
In a variation of the adaptive models of Figs. 3, 5, and 7, the output layer network 21 may be omitted from certain embodiments of the invention, such that the bottleneck layers 19, 19i, 45 are trained to produce outputs which are directly indicative of the monophone corresponding to the speech item input to the FBANK. The non-adaptive output layer 27 would format the output of the bottleneck layers 19, 19i, 45 to generate a signal indicating a single monophone.
Results
Table 1 below shows the performance of the example of the invention described above when using the unsupervised mode of adaptation illustrated in Fig. 1, as compared to some conventional neural network algorithms. As noted above, first-pass ASR (error full) transcription (performed in step 5) is used for generating training data for updating the weights of the bottleneck layer 45 in step 7. In step 8, a test-speaker-specific network is formed for recognising triphones, and using known algorithms the triphones are converted to words.
The training data was clean and multi-condition training data consisting of 7138 utterances from 83 speakers. The clean data was recorded using a primary Seenheiser microphone, whereas the multi-condition training data had data recorded with a primary microphone and a secondary microphone which includes convolutive distortions. The multi-condition data further included data having additive noise from six noise conditions: airport, babble, car, restaurant, street and train station.
The test data consisted of 14 test sets, including 330 utterances from 8 test speakers, recorded by two different microphones.
The FBANK was a 40-dimensional mel FBANK. Thus, since the bottlenecks produced a 75 dimensional output, the input to each of the stage-2 DNNs 22, 32 was a 825 dimensional feature vector 26. The stage-2 DNNs 22, 32 were trained to produce a signal indicative of one of 2281 triphones. RBM (Restricted Boltzmann machine) pretraining was performed, and optimized using a cross-entropy criterion.
Table 1 shows, in the second row, the performance (i.e. percentage word error rate, %WER) of the example of the invention in the case that the FBANK 16 is a mel FBANK. Rows 3-5 respectively show the performance of the example of the invention when the mel FBANK is supplemented with a D-vector unit, a CMLLR unit, and both a CMLLR unit and a D-vector unit. The final column of Table 4 compares the performance of each of these examples of the invention with a baseline which is the performance of the SI system shown in Fig. 4 which does not have a speaker-specific bottleneck layer.
Table 1
The CMLLR transforms were estimated while training a SAT (speaker adaptive training) GMM-HMM model (Gaussian mixture model - Hidden Markov model). D-vectors were obtained by training a bottleneck DNN with speaker labels as targets in the output layer. In the experiments, the D-vectors were obtained by averaging the output of the bottleneck layer over an utterance, and then appending the constant vector to the filterbank features in the utterance. This means that the speaker representation is allowed to change across utterances from the same speaker.
One can observe that the proposed approach when applied on top of DNN trained with Mel filter-bank (FBANK) features provides a relative gain (%WER reduction, or “%WERR”) of 8.9% in terms of word error rate (WER). A relative gain of 8.6% is observed when applied to a DNN trained with FBANK features appended with D-vectors. The best performance is achieved when the speaker adaptive DNN is applied on top of a DNN trained with FBANK features transformed with CMLLR feature transforms. The performance seems to saturate when CMLLR-FBANK is appended with D-vectors.
Instead of appending the D-vectors to the FBANK features, we tried, in another experiment, appending them to the bottleneck features before training the second stage DNN. This provided broadly similar gains in performance. No gain in performance was observed when D-vectors were appended to both the FBANK features and the bottleneck features.
We also studied the influence of reducing the number of neurons in the input network component. The motivation for this was to see whether it would be possible to reduce the number of parameters of the bottleneck layer which need to be adapted with the bottleneck layer is trained. We performed experiments in which each layer of the input network component was reduced to 256 neurons. This gave a slight reduction in performance. Accordingly, using a larger size for the layers of the input network component might give an improvement in performance.
Supervised adaptation experiments, where true transcripts of the test speaker training data are used for updating the weights of the BN layer in step 7 are shown in Table 2. In other words, the following results are the result of the variation mentioned above in which steps 2, 5 and 6 are not required. Again, the baseline is the system shown in Fig. 4, which is the baseline shown in Table 1.
Table 2
The columns indicate the number of utterances used per speaker to update the weights. Comparing both the tables, one can notice that using as few as 10 utterances (which correspond to one minute of audio) to update the weights of the bottleneck layer seems to improve the performance over the baseline. It is interesting to note that less adaptation data is required to achieve a similar or better performance if the data is normalised with CMLLR or D-vectors, compared to using only FBANK features. This may be because a better acoustic model was trained in the SAT framework. Also, we note that using D-vectors in combination with CMLLR-FBANK features seems to give little improvement over using only CMLLR-FBANK features.

Claims (17)

CLAIMS:
1. A method for generating a test-speaker-specific adaptive system for recognising sounds in speech spoken by a test speaker, the method employing: for each of a plurality of training speakers, a respective set of first training data comprising (i) data characterizing speech items spoken by the respective training speaker, and (ii) data characterizing phones for the speech items; and second training data comprising data characterizing speech items spoken by the test speaker; the method comprising: (a) using the sets of first training data to perform supervised learning of a first adaptive model (BN-DNN) comprising (i) an input network component and (ii) an adaptive model component, thereby training the input network component and the adaptive model component; (b) for each of the training speakers: (i) providing a respective second adaptive model comprising (i) the trained input network component, (ii) and a respective training-speaker-specific adaptive model component; and (ii) modifying the training-speaker-specific adaptive model component to perform supervised learning of the respective second adaptive model using the respective set of first training data, thereby producing a respective training-speaker-specific adaptive model component (SDBN-1, SDBN-2,...., SDBN-N); (c) training a speaker-adaptive output network, by, successively for each training speaker, modifying the speaker-adaptive output network to train, using the respective set of first training data, a respective third adaptive model comprising the trained input network component, the respective trained training-speaker-specific adaptive model component, and the speaker-adaptive output network; (d) using the second training data to train a test-speaker-specific adaptive model component of a fourth adaptive model comprising the trained input network component, and the test-speaker-specific adaptive model component; and (e) providing the test-speaker-specific adaptive system comprising the trained input network component, the trained test-speaker-specific adaptive model component, and the trained speaker-adaptive output network.
2. A method according to claim 1 in which the first adaptive network further comprises an output adaptive component which is trained in step (a), the second adaptive models and the fourth adaptive model further comprising the trained output adaptive component.
3. A method according to claim 1 or claim 2 in which each adaptive model component is a single layer of neurons.
4. A method according to claim 1, or claim 2 or claim 3 in which said input network component comprises a plurality of layers which each comprise a plurality of neurons, and each adaptive model component comprises a smaller number of neurons than any layer of the input network component.
5. A method according to any preceding claim in which said speaker-adaptive output network comprises a plurality of layers which each comprise a plurality of neurons, and each adaptive model component comprises a smaller number of neurons than any layer of the speaker-adaptive output network.
6. A method according to any preceding claim in which in steps (a), (b) and (d), the first and second adaptive networks are trained to produce signals indicating monophones.
7. A method according to any preceding claim in which in step (c), the speaker-adaptive DNN is trained to produce signals indicating tri-phones.
8. A method according to any preceding claim further comprising, prior to step (d), a step of generating, from elements of the second training data, a corresponding first estimate of associated phones, said first estimate of the associated phones being used in step (d).
9. A method according to claim 8, when dependent on claim 6, in which the first estimate of the associated phones is in the form of a triphone, the method further comprising converting each of the first estimates of the associated phones into monophones and obtaining alignment information characterizing times at which the second training data exhibits a transition between mono-phones.
10. A method according to claim 8 or 9 when dependent upon claim 7, in which the step of generating from each element of the second training data a corresponding first estimate of the associated phones comprises: training a speaker-independent network (SI-DNN) successively using training data from the training speakers, by training a fifth adaptive model comprising the trained input network component, the trained adaptive model component and the speaker-independent network (SI-DNN), to generate triphones from the training data from the training speakers; and inputting the second training data for the test subject into the trained fifth adaptive model, the output of the trained fifth adaptive network being the first estimate of the associated triphone.
11. A method according to any preceding claim in which the input network component of the first, second, third, and fourth adaptive models receives the output of a filter bank.
12. A method according to any preceding claim further comprising: at least once repeating step (d) using replacement second training data generate an updated test-speaker-specific adaptive model component, and providing an updated test-speaker-specific adaptive system comprising the trained input network component, the updated test-speaker-specific adaptive model component, and the trained speaker-adaptive output network.
13. A method for generating a test-speaker-specific adaptive system for recognising sounds in speech spoken by a test speaker, the method employing: (i) for each of a plurality of training speakers, a respective set of first training data, comprising (1) data characterizing speech items spoken by the corresponding training speaker and (2) data charactering sounds in the speech items; and (ii) second training data comprising data characterizing speech items spoken by the test speaker; (iii) an input network component; the method comprising: (a) for each of the training speakers: (i) forming a respective first adaptive model comprising the input network component, and a respective training-speaker-specific adaptive model component, (ii) performing supervised learning of the first respective adaptive model using the respective set of first training data by modifying the respective adaptive network component, thereby forming a trained training-speaker specific adaptive model component; (b) training a speaker-adaptive output network, by, for successive ones of the training speakers, modifying the speaker-adaptive output network to train, using the respective set of first training data, a respective second adaptive model comprising the input network component, the respective trained training-speaker-specific adaptive model component, and the speaker-adaptive output network; (c) using the second training data to train a test-speaker-specific adaptive model component of a third adaptive model comprising the input network component, and the test-speaker-specific adaptive model component; and (d) providing the test-speaker-specific adaptive system comprising the input network component, and the trained test-speaker-specific adaptive model component.
14. A method according to claim 13 further employing an output network component, the first adaptive models and the third adaptive model further comprising the output network component.
15. A method of recognising sounds in speech spoken by a test speaker, the method comprising: generating a test-speaker-specific adaptive system by a method according to any preceding claim; receiving speech data encoding speech spoken by the test speaker; passing the speech data into a filter bank; and passing data comprising the output of the filter bank into the test-speakerspecific adaptive system.
16. A computer system for generating a test-speaker-specific adaptive system for recognising sounds in speech spoken by a test speaker, the computer system comprising: a processor; and a data storage device which stores program instructions operative, when implemented by the processor, to cause the processor to perform a method according to any preceding claim.
17. A computer program product storing program instructions operative, when implemented by a processor, to cause the processor to perform a method according to any of claims 1 to 15.
GB1600842.7A 2016-01-18 2016-01-18 Speaker-adaptive speech recognition Expired - Fee Related GB2546325B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
GB1600842.7A GB2546325B (en) 2016-01-18 2016-01-18 Speaker-adaptive speech recognition
US15/407,663 US10013973B2 (en) 2016-01-18 2017-01-17 Speaker-adaptive speech recognition
JP2017007052A JP6437581B2 (en) 2016-01-18 2017-01-18 Speaker-adaptive speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1600842.7A GB2546325B (en) 2016-01-18 2016-01-18 Speaker-adaptive speech recognition

Publications (3)

Publication Number Publication Date
GB201600842D0 GB201600842D0 (en) 2016-03-02
GB2546325A GB2546325A (en) 2017-07-19
GB2546325B true GB2546325B (en) 2019-08-07

Family

ID=55488065

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1600842.7A Expired - Fee Related GB2546325B (en) 2016-01-18 2016-01-18 Speaker-adaptive speech recognition

Country Status (1)

Country Link
GB (1) GB2546325B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3267438B1 (en) * 2016-07-05 2020-11-25 Nxp B.V. Speaker authentication with artificial neural networks
CN108346428B (en) * 2017-09-13 2020-10-02 腾讯科技(深圳)有限公司 Voice activity detection and model building method, device, equipment and storage medium thereof
CN111243576B (en) * 2020-01-16 2022-06-03 腾讯科技(深圳)有限公司 Speech recognition and model training method, apparatus, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150161994A1 (en) * 2013-12-05 2015-06-11 Nuance Communications, Inc. Method and Apparatus for Speech Recognition Using Neural Networks with Speaker Adaptation
US20160034811A1 (en) * 2014-07-31 2016-02-04 Apple Inc. Efficient generation of complementary acoustic models for performing automatic speech recognition system combination

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150161994A1 (en) * 2013-12-05 2015-06-11 Nuance Communications, Inc. Method and Apparatus for Speech Recognition Using Neural Networks with Speaker Adaptation
US20160034811A1 (en) * 2014-07-31 2016-02-04 Apple Inc. Efficient generation of complementary acoustic models for performing automatic speech recognition system combination

Also Published As

Publication number Publication date
GB2546325A (en) 2017-07-19
GB201600842D0 (en) 2016-03-02

Similar Documents

Publication Publication Date Title
US10013973B2 (en) Speaker-adaptive speech recognition
Feng et al. End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model.
Rebai et al. Improving speech recognition using data augmentation and acoustic model fusion
Abdel-Hamid et al. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code
US11823655B2 (en) Synthetic speech processing
US12283266B2 (en) Synthetic speech processing
Seki et al. A deep neural network integrated with filterbank learning for speech recognition
Zhao et al. Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data
Kundu et al. Joint acoustic factor learning for robust deep neural network based automatic speech recognition
GB2546325B (en) Speaker-adaptive speech recognition
Yılmaz et al. Noise robust exemplar matching using sparse representations of speech
Long et al. Enhancing zero-shot many to many voice conversion via self-attention VAE with structurally regularized layers
Lim et al. CNN-based bottleneck feature for noise robust query-by-example spoken term detection
Kaur et al. Feature space discriminatively trained punjabi children speech recognition system using kaldi toolkit
WO2005096271A1 (en) Speech recognition device and speech recognition method
GB2558629B (en) Speaker-adaptive speech recognition
WO2019212375A1 (en) Method for obtaining speaker-dependent small high-level acoustic speech attributes
Seki et al. Discriminative learning of filterbank layer within deep neural network based speech recognition for speaker adaptation
Aafaq et al. Convolutional neural networks for deep spoken keyword spotting
US20230317085A1 (en) Audio processing device, audio processing method, recording medium, and audio authentication system
Nasef et al. Stochastic gradient descent analysis for the evaluation of a speaker recognition
Samarakoon et al. An investigation into learning effective speaker subspaces for robust unsupervised DNN adaptation
Doddipatla Speaker adaptive training in deep neural networks using speaker dependent bottleneck features
Tang et al. Deep neural network trained with speaker representation for speaker normalization
Wan et al. Waveform-based speaker representations for speech synthesis

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20240118