HK1183141B

HK1183141B - Discriminative pretraining of deep neural networks

Info

Publication number: HK1183141B
Application number: HK13110250.9A
Authority: HK
Inventors: 弗兰克塞得; 邓丽; 俞栋; 李钢
Original assignee: 微软技术许可有限责任公司
Priority date: 2011-11-26
Filing date: 2013-09-03
Publication date: 2017-10-06

Description

Discrimination pre-training of deep neural networks

Technical Field

The invention relates to a method and a system for discrimination pre-training of deep neural networks.

Background

Deep Neural Networks (DNNs) are known to be powerful discriminative modeling tools and can be used for various purposes. For example, DNN can be combined with a Hidden Markov Model (HMM) to characterize a context-dependent (CD) phoneme that is a pronunciation unit of speech. The resulting hybrid CD-DNN-HMM takes advantage of the discriminative modeling capability of temporal (temporally) localization of DNNs and the continuous modeling capability of HMMs. The CD-DNN-HMM can be used in many other systems such as speech recognition systems, handwriting recognition systems, and human behavior recognition/detection systems including gesture recognition systems.

One of the key processes in building such a CD-DNN-HMM is the training of DNN. This training is typically done by first initializing weights, and is known as a "pre-training" process.

Disclosure of Invention

The discriminative pretraining technique embodiments described herein are typically employed to pretrain hidden layers of a Deep Neural Network (DNN). The discrimination pre-training technique embodiments described herein have the following advantages: the DNN layer weights are made close to good local optima while still being left in a range with high gradients, enabling them to be effectively fine-tuned at later stages of training.

In one exemplary discriminative pre-training technique embodiment, a DNN is pre-trained by first training a single hidden layer Neural Network (NN) having: an input layer into which training data is input; an output layer from which an output is generated; and a first hidden layer interconnected with the input layer and the output layer with randomly initialized weights. The training involves accessing a set of training data entries, each training data entry in the set of training data entries having a corresponding label assigned thereto. Each data entry is then input into the input layer of the single-hidden-layer neural network one after the other until all data entries have been input at least once. It should be noted that after each data entry is input, the weights associated with the first hidden layer are set via a back-propagation (BP) process so that the output generated from the output layer matches the label associated with the training data entry. This results in the initial NN.

Once the single hidden layer NN has been trained, the current output layer is discarded and a new hidden layer is added, interconnected with the last previously trained hidden layer and new output layer with randomly initialized weights, to produce a new multi-hidden layer DNN. The newly generated new multi-hidden layer DNN is then trained as follows. Each data entry of the training set is input one by one to the input layer of the newly generated multi-hidden-layer DNN until all data entries have been input at least once. It should be noted that after each data entry is entered, the weights associated with the new hidden layer and each previously trained hidden layer are set via BP so that the output generated from the output layer matches the label associated with the training data entry. This results in a deeper neural network with one more layer than the previous DNN.

Additional new hidden layers are then added and trained in the same manner until a prescribed number of hidden layers have been added. The resulting most recently generated modified multi-layer DNN is then designated as the pre-trained DNN.

It should be noted that this summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Drawings

The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:

FIG. 1 is an exemplary computer program architecture for implementing embodiments of the discrimination pre-training techniques described herein.

FIG. 2 is a flow chart summarizing one embodiment of a pre-training technique process for pre-training a Deep Neural Network (DNN).

FIG. 3 is a flowchart outlining one embodiment of a process for performing an iteration of a multi-iteration process of fine-tuning a pre-trained DNN.

FIG. 4 is a diagram depicting a general purpose computing device constituting an exemplary system for implementing embodiments of the discrimination pre-training techniques described herein.

Detailed Description

In the following description of embodiments of the discerning pre-training technique, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the technique may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present technology.

1.0Discrimination pre-training and fine-tuning of deep neural networks

The discriminative pretraining technique embodiments described herein are typically employed to pretrain hidden layers of a Deep Neural Network (DNN). This results in a pre-trained DNN that can be fine-tuned to produce a fully trained DNN. For the purposes of this description, a completed DNN is defined as a neural network with more than one hidden layer. Furthermore, the term "pre-training" refers to the process of obtaining DNN weights in all layers subject to further modification based on a pure discriminative learning process throughout all layers in the DNN. One such discrimination learning process is the above-described fine tuning, which requires BP throughout all DNN layers from the top layer to the bottom layer.

The trained DNN can be used for various purposes. For example, the DNN can model the bound context-dependent (CD) state directly or in a context-independent manner that models the context-independent state. In the case of modeling bound CD states, as noted previously, the DNN can model Context Dependent (CD) phonemes and can be combined with a Hidden Markov Model (HMM). The resulting hybrid CD-DNN-HMM takes advantage of the discriminative modeling capabilities of DNN versus the continuous modeling capabilities of HMM. The CD-DNN-HMM can be used in many other systems such as speech recognition systems, handwriting recognition systems, and human behavior recognition/detection systems. In the case of a speech recognition system, such as used in a voice search task or a switchboard (telephone speech recognition standard data set) telephone call transcription task, the CD-DNN-HMM is used to directly model phonetic units (senones) in an HMM speech recognizer (bound CD states) and approximate the emission probabilities of those phonetic units. A phonetic unit represents a clustered (or bound) context-dependent triphone state. However, it is not intended that the discrimination pre-training technique embodiments described herein be limited to speech recognition systems, or any other such systems. Rather, the discrimination pre-training technique embodiments described herein can be employed with any DNN for any purpose.

1.1Deep neural network

DNNs can be thought of as traditional multilayer perceptrons (MLPs) with many hidden layers. In particular, DNN will give the posterior probability P of class s of observation vector o_s|o(s | o) is modeled as a stack of (L + 1) layers of a log-linear model. The first L layers, L0 … L-1, will give the input vector v^lImplicit binary output unit h of^lModelling as Bernoulli distribution

And the top level L models the expected class posterior probability as a polynomial distribution

Wherein z is^l(v^l)＝(W^l)^τv^l+α^lFor activation at layer l, W^lAnd α^lIs a weight matrix and a bias vector at layer l, andandare respectively h^lAnd z^l(v^l) The jth component of (a).

P_s|oAccurate modeling of (s | o) is not feasible because it requires h across all layers^lIs integrated over all possible values. An effective practical technique is to replace marginalization (marginalization) with mean field approximation. Given an observation o, set v⁰= o and selection conditions desiredAs input v to the next layer^l+1WhereinIs sigmoid function.

1.2Training deep neural networks

The DNN as a "deep" MLP can be trained with the well-known error Back Propagation (BP) process. Because BPs can easily fall into poor local optima for deep networks, it is helpful to "pre-train" the model in a layer growth manner to be described later. However, before describing this pre-training, it is helpful to briefly describe BP. MLPs are often trained using a random gradient-ascending error back-propagation (BP) process: for the objective function D and the learning rate e,

typically, the target is the total log posterior probability, i.e. the total log posterior probability, with respect to the T training samples O = { O (T) } having the true label s (T), i.e. the true label s (T)

And (4) maximizing. The gradient is then

e^L(t)＝(log softmax)′(z^L(v^L(t)))

e^l-1(t)＝W^l·ω^l(t)·e^l(t), L < L for 0. ltoreq. L

Wherein the error signalAs propagated backwards from network l +1 and above; derivative ω of the output nonlinearity of the network l^l(t), if any; derivative of component form σ'_j(z)＝σ_j(z)·(1-σ_j(z)) and (logsoftmax)'_j(z)＝_s(t)，j-softmax_j(z) and Kronecker (Kronecker delta).

1.3Discrimination pre-training

It has been found that pre-training DNNs followed by a fine-tuning training process can provide more accurate results than conventional training methods. The discrimination pre-training technique embodiments described herein have the following advantages: the DNN layer weights are made close to good local optima while still being left in a range with high gradients, enabling their effective fine tuning. Although both pre-training and fine-tuning described herein are discernable, they differ in that the former is constrained by each layer of the DNN and the latter is performed throughout all layers of the DNN. Embodiments of the discrimination pre-training technique will be described in this section, while the trimming process will be described in the next section.

The discrimination pre-training technique embodiments described herein operate as a computer-implemented process for pre-training a DNN. This may involve employing a computer-readable storage medium having stored thereon computer-executable instructions for implementing the training. Suitable computing devices and storage media are described in greater detail in the exemplary operating environment section that follows.

FIG. 1 illustrates an exemplary computer program architecture for implementing embodiments of the discrimination pre-training techniques described herein. The architecture includes various program modules executable by the computing device, such as a hidden layer generator program module 100. The module 100 instructs the computing device to initially generate a single hidden layer NN. The single hidden layer NN includes: an input layer into which training data is input; an output layer from which an output is generated; and a first hidden layer interconnected with the input layer and the output layer with randomly initialized weights.

There is also a pre-training program module 102 that instructs the computing device to first access a set of training data entries, each training data entry having a corresponding label. When the corresponding training data entry for the tag is entered, the tag represents the particular recognition output from the DNN that is desired. For example, in the above example of a speech recognizer system, the training data entry may be a frame of spoken speech (spoken utterance). The frames are assigned a phonetic unit label that indicates the desired DNN output for the frame that is speaking. For example, each unique phonetic unit associated with a training data entry will be assigned a different label (e.g., 1, 2, 3, …, N, where N is the total number of phonetic units). This simplification of the output associated with the speech units allows a clear distinction to be made between them. It should also be noted that the training data set accessed for pre-training may be a smaller set than the set used for subsequent training. Once the single hidden layer DNN has been generated in response to the above-described indication from the hidden layer generator program module 100, the pre-training program module 102 also instructs the computing device to access the single hidden layer DNN. The pre-training program module 102 then inputs the training data entries and generates a pre-trained version of the single hidden layer DNN. In an exemplary discriminative pre-training technique process to be provided later, one embodiment of a process for accomplishing the foregoing task will be described.

Whenever a pre-trained version of the single hidden layer DNN is produced under the direction of the pre-training program module 102, the hidden layer generator program module 100 directs the computing device to discard the current output layer and add a new hidden layer that interconnects the first hidden layer and the new output layer with randomly initialized weights to produce the multi-hidden layer DNN. Furthermore, whenever a pre-trained version of the most recently generated multi-hidden layer DNN is generated (as will be described later) and designated as lacking a prescribed number of hidden layers under the direction of the pre-training program module 102, the hidden layer generator program module 100 instructs the computing device to discard the current output layer and add a new hidden layer that is interconnected with the most recently previously added hidden layer and the new output layer with random initialization weights to generate a new multi-hidden layer DNN.

For the above-described pre-trained version of the most recently generated multi-hidden layer DNN, the pre-training program module 102 instructs the computing device to access each multi-hidden layer deep neural network as it is generated, and for each accessed multi-hidden layer DNN, training data entries are input thereto and a pre-trained version of the accessed network is generated. In the above exemplary discrimination pre-training technique process to be provided later, one embodiment of a process for accomplishing the foregoing task will be described.

DNN program module 104 is employed to accomplish this task with respect to the above specification of whether a pre-trained version of the recently generated multi-hidden deep neural network lacks a prescribed number of hidden layers. More specifically, each time such a network is generated under the direction of the hidden layer generator program module 100, the deep neural network module 104 directs the computing device to determine whether the most recently generated pre-trained version of the multi-hidden layer DNN includes a prescribed number of hidden layers. Whenever a recently generated pre-trained multi-hidden layer DNN is determined not to include a prescribed number of hidden layers, it is designated as lacking the prescribed number of hidden layers under the direction of the deep neural network module 104. However, whenever it is determined that the most recently generated pre-trained multi-hidden layer DNN includes a prescribed number of hidden layers, the deep neural network module 104 indicates that it is designated as the desired pre-trained DNN.

The above-described computer program architecture can be advantageously used to implement the discrimination pre-training technique embodiments described herein. More specifically, referring to fig. 2, one embodiment of a pre-training technique process for pre-training a DNN is presented. The process begins with training the above-described single-hidden-layer deep neural network. As previously indicated, the single-hidden-layer deep neural network includes: an input layer into which training data is input; an output layer from which an output is generated; and a first hidden layer interconnected with the input layer and the output layer with randomly initialized weights. Training involves first accessing a set of training data entries (process action 200). Each of these data entries has a corresponding tag assigned thereto.

Next, each data entry in the training set is input into the input layer of the single-hidden-layer neural network one after another until all data entries have been input at least once (process action 202). It should be noted that after each data entry is input, the weights associated with the first hidden layer are set via the error Back Propagation (BP) process described above so that the output generated from the output layer matches the label associated with the training data entry. This results in the initial NN.

It should also be noted that, in one embodiment, each data entry in the training set is entered into the input layer of the single-hidden-layer deep neural network only once (sometimes referred to as early termination). Further, in one embodiment, BP uses a prescribed high learning rate ranging between 0.01 and 0.20. In the test example, a learning rate of 0.08 was employed. It has been found that utilizing either or both of the foregoing features can result in improved accuracy.

It is also noted that, in one embodiment, the output from the first hidden layer is transformed via the softmax function to better correspond to the tag associated with the currently incoming training data entry. The softmax function is typically used to transform the outputs of the layers of the neural network such that all output values fall between 0 and 1, and such that the sum of the output values is 1. In one version, this is done using the following equation:

wherein p is_iIs the output value of node i, q_iIs the network input to output node i and n is the number of output nodes.

Once the single hidden layer deep neural network has been trained as previously described, the current output layer is discarded and a new hidden layer is added that interconnects the last previously trained hidden layer and the new output layer with randomly initialized weights (process action 204). This effectively creates a new multi-hidden layer DNN.

The newly generated new multi-hidden layer DNN is then trained as follows. Each data entry in the training set is entered into the input layer of the newly generated multi-hidden-layer neural network one after another until all data entries have been entered at least once (process action 206). It should be noted that after each data entry is entered, the weights associated with the new hidden layer and each previously trained hidden layer are set via an error back propagation process (BP) such that the output generated from the output layer matches the label associated with the training data entry. This produces a modified multi-hidden deep neural network.

As with the single hidden layer neural network, in one embodiment, each data entry in the training set is input only once to the input layer of the newly generated multi-hidden layer neural network. Further, in one embodiment, BP uses a prescribed high learning rate ranging between 0.01 and 0.20. In the test example, a learning rate of 0.08 was employed. As before, it has been found that utilizing either or both of the foregoing features can result in improved accuracy.

Also as before, in one embodiment, the output from the new hidden layer is transformed via the softmax function described above to better correspond to the contextual tags associated with the currently incoming training data entry.

Additional new hidden layers are then added and trained. More specifically, in process action 208, it is determined whether the most recently generated modified multi-hidden-layer deep neural network has a prescribed number of hidden layers (e.g., at least two hidden layers). If not, acts 204 and 206 are repeated. When it is determined that the most recently generated modified multi-hidden-layer deep neural network has a specified number of hidden layers, it is designated as a pre-trained DNN (process action 210).

1.4Fine tuning

As mentioned before, the pre-trained DNN can be fine-tuned. More specifically, in one embodiment, the fine-tuning involves iteratively training the pre-trained DNN a prescribed number of times (e.g., 4 times) to produce a trained DNN. In another embodiment, the fine-tuning involves iteratively training the pre-trained DNNs until the weights associated with each hidden layer do not vary more than a prescribed training threshold between iterations. In yet another embodiment, the fine tuning process ends if any of the aforementioned iteration limits occur. Referring again to fig. 1, the fine tuning program module 106 is used to instruct the computing device to fine tune the layer weights of the pre-trained DNNs.

More specifically, referring to FIG. 3, in one embodiment, each iteration of the fine tuning process is completed by first inputting each data entry in the above-described set of training data entries one-by-one to the input layer of the pre-trained DNN until all data entries have been input once (process action 300). It should be noted that after each data entry is input, the weights associated with the hidden layer are set via an error Back Propagation (BP) process such that an output is generated from the output layer that matches the label associated with the training data entry. It is then determined whether the pre-trained DNN has been fine-tuned a prescribed number of times or whether the weights associated with each hidden layer have not changed more than a prescribed training threshold (process action 302). If not, process action 300 is repeated. If, however, it is determined that either condition is true, the resulting trimmed DNN is designated as the trained DNN (process action 304).

2.0Exemplary operating Environment

The discrimination pre-training technique embodiments described herein may operate in many types of general purpose or special purpose computing system environments or configurations. FIG. 4 illustrates a simplified example of a general-purpose computer system upon which various embodiments and elements of the discrimination pre-training technique embodiments as described herein may be implemented. It should be understood that any blocks represented by broken or dashed lines in fig. 4 represent alternative embodiments of a simplified computing device, and that any or all of these alternative embodiments may be used in conjunction with other alternative embodiments described throughout this document, as described below.

For example, FIG. 4 illustrates a general system diagram showing a simplified computing device 10. Typically, such computing devices can be found in devices having at least some minimum computing capability, including but not limited to personal computers, server computers, hand-held computing devices, laptop or mobile computers, communication devices such as cell phones or PDAs, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, audio or video media players, and the like.

To allow a device to implement the discrimination pre-training technique embodiments described herein, the device should have sufficient computing power and system memory to enable basic computing operations. In particular, as illustrated in FIG. 4, computing power is typically illustrated by one or more processing units 12, and may also include one or more GPUs 14, either or both of which are in communication with a system memory 16. Note that the processing unit 12 in a general purpose computing device may be a special purpose microprocessor, such as a DSP, VLIW, or other microcontroller, or may be a conventional CPU having one or more processing cores, including a dedicated GPU-based core in a multi-core CPU.

In addition, the simplified computing device of FIG. 4 may also include other components, such as, for example, a communication interface 18. The simplified computing device of fig. 4 may also include one or more conventional computer input devices 20 (e.g., pointing devices, keyboards, audio input devices, video input devices, tactile input devices, devices for receiving wired or wireless data transmissions, etc.). The simplified computing device of fig. 4 may also include other optional components, such as, for example, one or more conventional display devices 24 or other computer output devices 22 (e.g., audio output devices, video output devices, devices for sending wired or wireless data transmissions, etc.). Note that typical communication interfaces 18, input devices 20, output devices 22, and storage devices 26 for a general purpose computer are well known to those skilled in the art and will not be described in detail herein.

The simplified computing device of FIG. 4 may also include a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 10 via the storage devices 26 and includes both volatile and nonvolatile media as removable and/or non-removable memory 28, 30 for storage of information such as computer-readable or computer-executable instructions, data structures, program modules or other data. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes, but is not limited to, computer or machine readable media or storage devices, such as DVD, CD, floppy disks, tape drives, hard disk drives, optical disk drives, solid state memory devices, RAM, ROM, EEPROM, flash memory or other memory technology, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other device that can be used to store the desired information and that can be accessed by one or more computing devices.

The preservation of information, such as computer-readable or computer-executable instructions, data structures, program modules, etc., can be accomplished by encoding one or more modulated data signals or carrier waves, or other transport mechanisms or communication protocols, using any of a variety of communication media as described above, and including any wired or wireless information delivery mechanisms. Note that the term "modulated data signal" or "carrier wave" generally refers to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For example, communication media includes wired media such as a direct-wired connection or a wired network that carries one or more modulated data signals; and wireless media such as acoustic, RF, infrared, laser, or other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves. Combinations of any of the above should also be included within the scope of communication media.

Furthermore, software, programs, and/or computer program products, or portions thereof, embodying some or all of the various pre-training technique embodiments described herein may be stored, received, transmitted, or read in the form of computer-executable instructions or other data structures from any desired combination of computers or machine-readable media or storage devices and communication media.

Finally, the discrimination pre-training technique embodiments described herein may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The embodiments described herein may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or in a cloud of one or more devices that are linked through one or more communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including media storage devices. Additionally, the instructions described above may be implemented partially or wholly as hardware logic circuitry, which may or may not include a processor.

3.0OTHER EMBODIMENTS

In another exemplary discriminant pre-training technique embodiment, the DNN is changed from all non-linear layers to linear and non-linear layers interleaved. In this embodiment, BP is no longer needed in the discrimination pre-training, but convex optimization is used to determine DNN weights prior to trimming. Again, pre-training of this type of DNN for linear and non-linear layer interleaving here involves accessing a set of training data entries (optionally plus output layer data), each of which has a corresponding label assigned thereto. All data entries are entered in a batch manner rather than one after the other.

It should also be noted that any or all of the above embodiments throughout the specification can be used in any desired combination to form additional hybrid embodiments. Furthermore, although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Supplementary note

1. A computer-implemented process for pre-training a Deep Neural Network (DNN), comprising:

performing the following processing actions with a computer:

(a) training a single-hidden-layer neural network NN, the single-hidden-layer neural network comprising: an input layer into which training data is input; an output layer from which an output is generated; and a first hidden layer interconnected with the input layer and the output layer with randomly initialized weights, wherein the training comprises:

accessing a set of training data entries, each data entry in the set of training data entries having a corresponding label assigned thereto,

inputting each data entry in the set into the input layer one by one until all of the data entries have been input at least once to produce an initial NN, whereby after the input of each data entry, the weights associated with the first hidden layer are set via an error back propagation process such that the output generated from the output layer matches the label associated with the training data entry;

(b) discarding the current output layer and adding a new hidden layer interconnected with the last previously trained hidden layer and the new output layer with randomly initialized weights to generate a new multi-hidden layer deep neural network;

(c) inputting each data entry in the set to the input layer one after the other until all of the data entries have been input at least once to produce a modified multi-hidden-layer deep neural network, whereby after the input of each data entry, the weights associated with the new hidden layer and each previously trained hidden layer are set via the error back-propagation process to produce an output from the new output layer that matches the label associated with the trained data entry;

(d) repeating acts (b) and (c) until a prescribed number of hidden layers have been added; and

(e) the most recently generated modified multi-layer DNN is designated as the pre-trained DNN.

2. The process of supplementary note 1, wherein each output layer employed utilizes a softmax function to match its output to the label associated with the currently incoming training data entry.

3. The process of supplementary note 1, wherein the act of accessing a set of training data entries, each data entry in the set of training data entries having a corresponding tag assigned thereto comprises accessing a set of speech frames, each speech frame in the set of speech frames corresponding to a speech unit tag.

4. The process of supplementary note 1, wherein the act of inputting each data entry of the set into the input layer one after the other until all of the data entries have been input at least once to produce an initial deep neural network comprises inputting each data entry of the set only once.

5. The process of supplementary note 1, wherein the act of inputting each data entry in the set to the input layer one after the other until all of the data entries have been input at least once to produce a modified multi-hidden-layer deep neural network comprises inputting each data entry in the set only once.

6. The process according to supplementary note 1, wherein the error back-propagation process for setting the weights associated with the first hidden layer employs a prescribed learning rate ranging between 0.01 and 0.20.

7. The process according to supplementary note 1, wherein the error back-propagation process for setting the weights associated with each new hidden layer and each previously trained hidden layer employs a prescribed learning rate ranging between 0.01 and 0.20.

8. A system for training a context-dependent deep neural network CD-DNN, comprising:

a computing device;

a computer program comprising program modules executable by the computing device, the computer program comprising:

a hidden layer generator program module, wherein the hidden layer generator program module instructs the computing device to:

initially generating a single hidden layer neural network, the single hidden layer neural network comprising: an input layer into which training data is input; an output layer from which an output is generated; and a first hidden layer interconnected with the input layer and the output layer with randomly initialized weights,

whenever a pre-trained version of the single hidden layer neural network is generated, discarding the current output layer and adding a new hidden layer interconnected with the first hidden layer and new output layer with randomly initialized weights to generate a multi-hidden layer deep neural network, and

whenever a pre-trained version of a recently generated multi-hidden deep neural network is generated and designated as lacking a prescribed number of hidden layers, discarding the current output layer and adding a new hidden layer interconnected with the last previously added hidden layer and the new output layer with randomly initialized weights to generate a new multi-hidden deep neural network,

a pre-training program module, wherein the pre-training program module instructs the computing device to:

the single hidden layer neural network is accessed once it is generated,

inputting each data entry in the set into the input layer of the single-hidden-layer neural network one after another until all of the data entries have been input at least once to produce the pre-trained version of the single-hidden-layer neural network, whereby after the input of each data entry, the weights associated with the first hidden layer are set via an error back-propagation process to produce an output from the output layer that matches the label associated with the training data entry;

each multi-hidden deep neural network is accessed as it is generated,

for each multi-hidden-layer deep neural network accessed, inputting each data entry in the set of training data entries into the input layer one by one until all of the data entries have been input at least once to produce a pre-trained version of the accessed multi-hidden-layer deep neural network, setting the weights associated with the most recently added hidden layer and each previously trained hidden layer via the error back-propagation process after the input of each data entry to produce an output from the output layer that matches the label associated with the training data entry, and

a DNN module, wherein the DNN module instructs the computing device to:

each time a pre-training version of a multi-hidden-layer DNN is generated, determining whether the pre-training version of the multi-hidden-layer DNN includes the prescribed number of hidden layers, an

Designating the recently generated pre-trained multi-hidden-layer deep neural network as lacking the prescribed number of hidden layers whenever it is determined that it does not include the prescribed number of hidden layers, an

Designating the most recently generated pre-trained multi-hidden-layer deep neural network as a pre-trained DNN whenever it is determined that it includes the prescribed number of hidden layers.

9. The system of supplementary note 8, further comprising a hinting module, wherein the hinting module instructs the computing device to iteratively train the pre-trained DNN until the weights associated with the each hidden layer do not vary more than a prescribed training threshold between iterations to produce a trained DNN, wherein each training iteration includes inputting each data entry of the set of training data entries into the input layer one by one until all of the data entries have been input once to produce a new hinted version of the pre-trained DNN, whereby after the input of each data entry, the weights associated with the hidden layers are set via the error back-propagation process such that an output generated from the output layer matches the label associated with the training data entry.

10. The system of supplementary note 8, wherein the accessing a set of training data entries, each data entry in the set of training data entries having a corresponding label assigned thereto comprises accessing a set of speech frames, each speech frame in the set of speech frames corresponding to a speech unit label.

11. The system of supplementary note 8, wherein the pre-training program module instructs the computing device to input each data entry in the set to the input layer of the single-hidden-layer neural network one after another until all of the data entries have been input only once to generate the pre-trained version of the single-hidden-layer DNN.

12. The system of supplementary note 8, wherein the pre-training program module instructs the computing device to, for each multi-hidden layer DNN accessed, enter each data entry of the set of training data entries into the input layer one by one until all of the data entries have been entered only once to produce a pre-trained version of the accessed multi-hidden layer DNN.

13. The system of supplementary note 8, wherein the pre-training program module instructs the computing device to set the weights associated with the first hidden layer via an error back-propagation process employing a prescribed learning rate ranging between 0.01 and 0.20 to produce an output from the output layer that matches the label associated with the training data entry.

14. The system of supplementary note 8, wherein the pre-training program module instructs the computing device to set the weights associated with the most recently added hidden layer and each previously trained hidden layer via an error back-propagation process employing a prescribed learning rate ranging between 0.01 and 0.20 to produce an output from the output layer that matches the label associated with the training data entry.

15. A computer-readable storage medium having stored thereon computer-executable instructions for training a Deep Neural Network (DNN), the computer-executable instructions comprising:

inputting each data entry in the set into the input layer one by one until all of the data entries have been input once to produce an initial NN, whereby after the input of each data entry, the weights associated with the first hidden layer are set via an error back-propagation process employing a prescribed learning rate ranging between 0.01 and 0.20 to produce an output from the output layer that matches the label associated with the training data entry;

(c) training a newly generated multi-hidden deep neural network, wherein the training comprises: inputting each data entry in the set into the input layer one after another until all of the data entries have been input once to produce a modified multi-hidden-layer deep neural network, whereby after the input of each data entry, the weights associated with the new hidden layer and each previously trained hidden layer are set via the error back-propagation process employing the prescribed learning rate such that the output generated from the output layer matches the label associated with the training data entry;

(d) repeating instructions (b) and (c) until a prescribed number of hidden layers have been added; and

16. The computer-readable storage medium of supplementary note 15, wherein the instructions for training the single hidden layer NN include each output layer employing a softmax function to match its output to the label associated with a currently incoming training data entry.

17. The computer-readable storage medium of supplementary note 15, wherein the instructions for training the newly generated multi-hidden-layer deep neural network include each output layer employing a softmax function to match its output to the label associated with a currently incoming training data entry.

18. The computer-readable storage medium of supplementary note 15, wherein the instructions for accessing a set of training data entries, each data entry in the set of training data entries having a corresponding tag assigned thereto, comprise accessing a set of speech frames, each speech frame in the set of speech frames corresponding to a speech unit tag.

19. The computer-readable storage medium of supplementary note 15, further comprising instructions for iteratively training the pre-trained DNN a prescribed number of times to produce the trained DNN, wherein each training iteration includes inputting each data entry of a set of training data entries into the input layer one by one until all of the data entries have been input once to produce a new trimmed version of the pre-trained DNN, thereby setting the weights associated with the hidden layer via the error back propagation process after the input of each data entry to produce an output from the output layer that matches the labels associated with the training data entries.

20. The computer-readable storage medium of supplementary note 19, wherein instructions for iteratively training the pre-trained DNN a prescribed number of times to generate the trained DNN comprise training the pre-trained DNN four times to generate the trained DNN.

Claims

1. A computer-implemented method for pre-training a Deep Neural Network (DNN), comprising:

performing the following processing actions with a computer:

(a) training a single-hidden-layer neural network NN, the single-hidden-layer neural network comprising: an input layer into which training data is input; a multi-neuron output layer from which an output is generated; and a first hidden layer interconnected with the input layer and the multi-neuron output layer with randomly initialized weights, wherein the training comprises:

accessing a set of training data entries, each data entry in the set of training data entries having a corresponding label assigned thereto (200),

inputting each data entry in the set into the input layer one by one until all of the data entries have been input at least once to produce an initial NN (202), whereby after the input of each data entry, the weights associated with the first hidden layer are set via an error back propagation process such that the output generated from the multi-neuron output layer matches the labels associated with the training data entries;

(b) discarding the current multi-neuron output layer and adding a new hidden layer interconnected with the last previously trained hidden layer and the new multi-neuron output layer with randomly initialized weights to generate a new multi-hidden layer deep neural network (204);

(c) inputting each data entry in the set to the input layer one by one until all of the data entries have been input at least once to produce a modified multi-hidden-layer deep neural network (206), whereby after the input of each data entry, the weights associated with the new hidden layer and each previously trained hidden layer are set via the error back-propagation process to produce an output from the new multi-neuron output layer that matches the label associated with the training data entry;

(d) repeating acts (b) and (c) until a prescribed number of hidden layers have been added (208); and

(e) the most recently generated modified multi-layer DNN is designated as the pre-trained DNN (210).

2. The method of claim 1, wherein each multi-neuron output layer employed utilizes a softmax function to match its output to the label associated with a currently incoming training data entry.

3. The method of claim 1, wherein the processing action of accessing a set of training data entries, each data entry in the set of training data entries having a corresponding label assigned thereto comprises accessing a set of speech frames, each speech frame in the set of speech frames corresponding to a speech unit label.

4. The method of claim 1, wherein the act of processing each data entry in the set one after another into the input layer until all of the data entries have been input at least once to produce an initial deep neural network comprises inputting each data entry of the set only once.

5. The method of claim 1, wherein the act of processing to input each data entry in the set to the input layer one after the other until all of the data entries have been input at least once to produce a modified multi-hidden-layer deep neural network comprises inputting each data entry in the set only once.

6. The method of claim 1, wherein the error back propagation process for setting the weights associated with the first hidden layer employs a prescribed learning rate ranging between 0.01 and 0.20.

7. The method of claim 1, wherein the error back propagation process for setting the weights associated with each new hidden layer and each previously trained hidden layer employs a prescribed learning rate ranging between 0.01 and 0.20.