WO1992017849A1

WO1992017849A1 - Automatic design of signal processors using neural networks

Info

Publication number: WO1992017849A1
Application number: PCT/US1992/002796
Authority: WO
Inventors: Murali M. Menon; Eric J. Van Allen
Original assignee: Massachusetts Institute of Technology
Current assignee: Massachusetts Institute of Technology
Priority date: 1991-04-02
Filing date: 1992-04-01
Publication date: 1992-10-15
Anticipated expiration: 1993-10-02

Abstract

A method of training a neural network (10) having an output layer (20) and at least one middle layer (16) including one or more internal nodes (18) each of which is characterized by a node activation function having a gain, the method including the steps of setting the gain on at least some of the internal nodes equal to an initial gain value; training the multi-layer perceptron starting with the initial gain value; and changing the gain on at least some of the internal nodes during training, the gain change being in a direction which increases sensitivity of the multi-layer perceptron.

Description

AUTOMATIC DESIGN OF SIGNAL PROCESSORS USING NEURAL NETWORKS Background of the Invention The Government has rights in this invention pursuant to Contract Number F19628-90-C-0002 awarded by the Department of the Air Force.

The invention relates to neural networks and methods of training neural networks. An important problem in signal processing is the capability to discriminate between signals originating from many different measurements. The discrimination is arbitrary in that any difference in the target or environment can serve as the basis of separation. The signal processing system must evaluate an available set of measurements and determine if the signals are separable at the operating signal-to-noise ratio (SNR) . Typically, a filter or transform is applied to the raw signal to obtain a representation that contains an easily separable set of features. The difficulty lies in designing a transformation that maps the raw signal into a more easily separable representation. In some cases, it is clear that a certain filtering operation is appropriate, though in general, the selection of a filter that maps the raw signal to a salient data representation is a heuristic process.

Ideally, the signal processor would "learn" the mapping required to perform signal discrimination from a known set of separable measurements. For certain signals, such as an EEG measurement, the relevant information is known to be in the frequency domain, and the separability can be enhanced by applying an FFT on the signal. For other signals, such as radar signatures, the appropriate mapping is unknown and some heuristic - 2 - application of known transformations in the literature must be attempted.

Neural networks offer the promise of a completely data-driven processor that automatically learns the required mapping by example. The neural network based processor can simply be retrained to accommodate changes in the measurement and discrimination.

The neural network approach has a loose correspondence to biological nervous systems where each neuron receives input from potentially thousands of other neurons to form a nonlinear interconnected network. It is believed that these biological networks are capable of learning highly complex mappings. A possible mechanism for learning in neural systems was proposed by Hebb (see, D. o. Hebb, "The Organization of Behavior," New York, N.Y., John Wiley, 1949) and this led to an interest in computer modeling of networks of "neuron-like" elements. Some of the first experiments on computer modeling of neural networks with adaptation was done at Lincoln Laboratory in Lexington, Massachusetts. Further modeling work by Rosenblatt (see, E. Rosenblatt, "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain," Psych. Rev.. 65, pp. 368-408 (1958)) led to the development of a multi-layer interconnected system of simple threshold neurons known as the Perceptron model. This work also described the Perceptron Learning Algorithm for adjusting the interconnection strengths to train the system. Later an adaptive system for optimizing filter response known as the ADELINE was developed by Widrow (see, B. Widrow and M.E. Hoff, "Adaptive Switching Circuits," IRE WESCON Conf. Record. Part 4, pp. 96-104 (I960)) and led to the development of the Least Mean Squared error training algorithm (LMS) . However, an efficient training algorithm for multi-layer networks was not available, which limited the application of neural networks to real world problems.

A training algorithm for multi-layer networks was then developed that allowed any desired mapping to be approximated given enough nodes in the network. This approach was later redeveloped as the BEP training algorithm and applied to many different problems in the framework of parallel distributed processing (see, D.E. Rumelhart, G.E. Hinton and R.J. Williams, "Learning Internal Representations by Error Propogation," in D.E. Rumelhart and J.L. McClelland (eds.), Parallel Distributed Processing; Explorations in the Microstructure of Cognition. Vol. 1, Cambridge, MA, MIT Press (1986) ) . A drawback with the BEP approach is that convergence is slow due to local minima problems and it requires thousands of presentations of the training set for large dimensional problems.

Summary of the Invention The invention relates to a neural network based signal recognition system that "learns" an appropriate transform for a given signal type and application. The recognition system is based on a multi-layer Perceptron neural network trained with a highly efficient deterministic annealing algorithm that can be two to three orders of magnitude faster than the commonly used Backward Error Propagation (BEP) technique. In addition, the training algorithm is less susceptible to local minima problems. The system is data driven in the sense that nodes are added until a specified level of performance is achieved, thereby making most efficient use of the available processing resources.

In general, in one aspect, the invention features a method of training a neural network having an output layer and at least one middle layer including one or more internal nodes each of which is characterized by a node activation function having a gain. The method includes the steps of setting the gain on at least some of the internal nodes equal to an initial gain value; training the multi-layer perceptron starting with the initial gain value; and changing the gain on at least some of the internal nodes during training, the gain change being in a direction which increases sensitivity of the multi¬ layer perceptron. Preferred embodiments include the following features. The neural network is a fully connected, multi-perceptron neural network which includes no more than one middle layer that has but a single node. The training employs a gradient descent training procedure, in particular, a back error propagation training procedure. The training is characterized by a learning rate and the method also includes the step of decreasing that learning rate during training while also changing the gain. The method further includes the step of setting the gain of each output node to a fixed value before beginning any training.

Also in preferred embodiments, the internal nodes are each characterized by a sigmoid-like activation function which has the following form:

^f(^χ) = T ~ ^δ

1 + e- ^χ where α and S are equal to 2 and 1, respectively, β is a gain variable', and x is a node input. The gain changing step changes the gain by increasing β . In addition, the method further includes the step of computing an error for the neural network after the gain has reached a final gain, the error indicating how well the neural network has been trained. Also, the method includes the further steps of adding an additional node to one of the internal layer if the error exceeds a predetermined threshold; and after adding the additional node, retraining the neural network.

Furthermore, in preferred embodiments, the training is supervised training using a training set made up of members for which corresponding desired outputs are known, and the said error is a measure of how far the desired outputs for the members of the training set are from actual outputs generated by applying the members of the training set to the neural network. The error is computed in accordance with the following equation:

E = ∑_p∑_j I o*. - d^P. I , where p is an index identifying a member of the training set; j is an index identifying an output node; θP . is an actual output of output node j for the member of the training set; and d^p. is a desired output of output node j for the p* member of the training set.

The method further includes the steps of determining whether the training is converging; and if it is determined that the training is not converging, modifying the training by increasing the training rate so as to cause an instability in training to occur. The method also includes the step of resuming training at a reduced training rate after training for a preselected period of time with the increased learning rate.

In general, in another aspect, the invention features an apparatus for training a neural network having an output layer and at least one middle layer which includes one or more internal nodes each of which is characterized by a node activation function having a gain. The apparatus includes means for setting the gain on at least some of the internal nodes equal to an initial gain value; means for training the multi-layer perceptron starting with the initial gain value; and means for changing the gain on at least some of the internal nodes during training, the gain change being in a direction which increases sensitivity of the multi¬ layer perceptron.

In preferred embodiments, the internal nodes are each characterized by a sig oid-like activation function of the following form:

1 + _e-^βχ wherein a and δ are constants, β is a gain variable, and x is a node input. The apparatus also includes means for computing an error for the neural network after the gain has reached a final gain, the error indicating how well the neural network has been trained. It further includes means for adding an additional node to one of the internal layers if the error exceeds a predetermined threshold; and means for causing the training' eans to retrain the neural network after the additional node has been added.

One advantage of the invention is that it can find a solution for architectures which appear to be insufficient based upon previous training techniques. For many problems involving real sensor signals, the invention arrives at architectures requiring less than 10 to 15 internal ("hidden") nodes to achieve the desired signal discrimination. In addition, the invention enables one to train a neural network on a very limited part of data and still achieve good generalization to the remainder of the data in the set of data. Moreover, the performance of the training algorithm is not particularly dependent on the order in which the neural network is trained. Other advantages and features will become apparent from the following description of the preferred embodiment and from the claims.

Description of the Preferred Embodiment Fig. 1 is a multi-layer perceptron;

Figs. 2a and 2b present a flow chart of the gradient descent gain annealing (GDGA) algorithm for training a multi-layer perceptron;

Fig. 3 shows the testing performance of a MLP network as a function of the percent of the data set used for training;

Fig. 4 shows the training performance of a single hidden node MLP network as a function of the percent of the data set used for training; and Fig. 5 is a comparison of the average testing performance of a MLP network trained on 1% of the data set.

Structure and Operation

Referring to Fig. 1, a multi-layer perceptron (MLP) neural network 10 is made up of an input layer 12 of input nodes 14 followed by an internal "hidden" layer 16 of internal nodes 18 that are connected to an output layer 20 of output nodes 22. In the described embodiment, MLP network 10 is a fully interconnected MLP network operating in a feed-forward mode. In such an architecture, each node in a given layer is connected to every node in the next higher layer, and conversely, every node at any level above input layer 12 receives input from every node on the next lower level. Thus, for example, the node labelled "A", i.e., a representative node 14 of input layer 12, is connected to every node 18 in the next higher, internal layer 16 and the node labelled "B", i.e., a representative node 18 of internal layer 16, is connected to every node 14 in the input layer 12.

Though the depicted MLP network has only a single hidden layer 16, it could have more than one hidden layer depending upon the complexity and type of problem being modeled. In addition, the number of input nodes 14 which are actually used depends on the dimensionality of the signal which will be fed into MLP network 10. For example, if the input signal is an M-point FFT, it may be necessary to use M input nodes. Similarly, the number of output nodes varies depending upon the number of classifications in the training set, i.e., the number of different class categories which MLP network 10 is being trained to distinguish. If MLP network 10 is to be trained to only determine whether or not a particular signal pattern is present in the input signal, only one output node would be required. Whereas, if MLP network 10 is being asked to distinguish among eight different class categories, at least three output nodes (2³=8) will be necessary.

Each node in MLP network 10 is characterized by a particular node activation function f(x) and an offset θ and each connection between node j in one layer and node k in the next lower level is characterized by a weight, w- .. In the described embodiment, the activation function for input nodes 14 simply represents a pass through, i.e., f(x) = x, with zero offset. Whereas, the activation functions for internal nodes 18 and output nodes 22 are sigmoid functions having the following form:

f(x) = - i, Eq.

1 + e β'* where β is the gain and x is the sum of the signals received by that node plus an offset. That is, the input signal is: X = I._j(l) + θ..(l), Eq. 2 where 1.(1) equals the sum of the signals received at node j on level 1 from all nodes of the next lower level and θ.(1) equals the offset. Thus, the output for node j on level 1 is as follows:

0._j(l) = fd_jd) + θ_jd)]. Eq. 3

In terms of the output signals of the next lower level 1.(1) can be expressed as follows:

I..(l) = Σ_kw_kj.0_k(l-1). Eq. 4 Thus,

O_jCl) = f[(∑_kO_k(l-l).w_kj) + θ..(l)]. Eq. 5 In general, a modified BEP training procedure is used to train the MLP network 10. It is modified by starting the system at a small magnitude for the gain and annealing the system to a large gain. At each gain value the BEP algorithm is run to convergence. The gain is a variable which has the characteristic that changing it deforms the energy surface with respect to the other free parameters (i.e., the weights and offsets). At low gain values, the MLP network has a nearly flat energy landscape, and the search covers a large portion of the parameter space. As the gain is increased the sensitivity to the parameters increases and the search is restricted to the neighborhood of a good solution. Empirical studies have shown that this procedure, which shall hereinafter be referred to as the Gradient Descent Gain Annealing (GDGA) procedure, reduces the chance of the network state being trapped in a local minimum state during training. Quantitative results are presented later.

If the performance of the network on a testing data set does not meet a specified criterion, a node is added to the hidden layer and the GDGA training procedure is repeated. The dynamic architecture MLP network has the ability to grow to accommodate the complexity of the problem and make efficient use of the available resources. The GDGA training procedure will now be described in greater detail.

In the described embodiment, the MLP network includes only a single hidden layer. However, it should be understood that the procedure applies to other MLP networks and other architectures, including those with multiple hidden layers. The steps of a GDGA training algorithm 100 are presented in Figs. 2a-b. Training algorithm 100, which implements a supervised training schedule, begins with the selection of a set of input signals for which the desired outputs are known, i.e., a training set (step 102). Before actual training occurs, MLP network 10 is initialized, which involves setting the gain of all of the output nodes to a low fixed quantity, e.g. -2 (step 104) . Setting this to a low level forces internal nodes 18 to perform the discrimination necessary to properly classify the signals in the training set and prevents them from relying on the output nodes to do so. As another part of the initialization phase, the number of internal nodes 18 on hidden layer 16 is set to one (i.e., k = 1) (step 106) . During the subsequent training, it may become apparent that k = 1 does not provide the dimensionality required to perform the classification satisfactorily, in which case, more nodes are added. This approach, as compared to other prior art approaches to this problem, has the advantage of proceeding from simpler architectures to more complex ones based upon the demands of the problem rather than by starting with an architecture which is unnecessarily complex for the problem at hand and then trying to pare away unneeded nodes.

Next, the weights and offsets of all of the internal and output nodes 18 and 22 and connections are set to some small random values (step 108) . In the described embodiment, the values for the weights and offsets are selected by using the following algorithm: 2.0 * RANDOM - 1.0, where RANDOM is a random number generating function which produces a number between 0 and 1. The output of RANDOM is scaled and shifted so as to yield a distribution of randomly generated numbers centered on zero, which thus introduces no bias into the initialization of MLP network 10.

After initializing the weights and offsets, algorithm 100 defines the ranges over which the gain β of internal nodes 18 and the learning rate for the subsequent training will be permitted to vary during the gain annealing process (step 110) . It then initializes the gain and learning rate to the initial values (step 112) . In the described embodiment, β_{Ln t} , the initial gain value, is set to -1.0; 0_final/ the final gain value, is set to -10.0; _7init/ the initial learning rate, is set to 0.03; and K_final, the final learning rate, is set to 0.0001. Algorithm 100 also initializes an energy variable E_k equal to some large number. E_k serves to keep track of the minimum energy which is achieved during the training procedure. Setting E. assures that the first computed energy for MLP network 10 will be smaller than the initial value of E_k.

Next, an initial energy E , . is computed for MLP network 10. E_old is a measure of the total error between all of the training set signals and the desired outputs for those signals (step 114) . The expression for computing E_Qld is as follows:

Eol,d__, = ΣPΣ3. l'ό⁹3. - d^p3. I' ,' where p is an index identifying the member of the training set; j is an index identifying the output node;

O⁹. is the actual output of output node j for the p^th signal of the training set; and d^p. is the desired output of output node j for the p^th signal of the training set.

After computing E_ol(J, algorithm 100 begins training MLP network 10 to adjust the weights and offsets using a back error propagation procedure (BEP) such as is well known to those skilled in the art (step 118) . The BEP training procedure is run in the mode in which the weights and offsets are adjusted for each signal pattern of the training set rather that for the entire training set at once. Thus, one iteration of the BEP training procedure consists of a separate training for each of the members of the training set. The BEP training continues through multiple iterations until either the desired convergence is achieved or the number of iterations exceeds some threshold amount, indicating that the procedure is not converging. Algorithm 100 keeps track of the number of iterations which are performed for a given gain and learning rate to determine whether the training procedure becomes stuck and fails to converge.

After an iteration is complete, algorithm 100 computes E_nβw, the energy for MLP network 10 resulting from that iteration of training (step 120) . E_aβw is then compared to E_k (step 122) . If it is smaller than^' E_k, the value of E_fe is set equal to E_nβw and the weights, offset and gain for that new minimum are saved (step 124) .

Once the new minimum energy value has been saved (or even when E is not a new minimum) , algorithm 100 determines whether the number of iterations which have been performed during this loop of the BEP training procedure has exceeded 50 (step 126) . During -the initial iterations of the training, algorithm 100 will of course detect that the number of iterations does not exceed 50 and it will then determine whether the desired convergence toward a global solution is occurring (step 134) . Algorithm 100 performs the convergence test by comparing the relative difference between E_nβw and E_old to some threshold level. In particular, algorithm 100 computes the absolute value of (E_0ew - E_Qld)/E_nβw and checks whether it is greater than 0.001. If the training at the given gain and learning rate are still achieving the desired convergence, algorithm 100 sets the value of ^E _oi_d **° ^E _nβw (step 135) moves onto the next iteration of the BEP training procedure (i.e., algorithm 100 branches back to step 118) .

If the training procedure gets trapped in a local minimum which causes the value of E to oscillate from one iteration to the next, it may be necessary to force the system out of that local minimum. The iteration count indicates when such a problem occurs by rising above 50 (see step 126) . When algorithm 100 detects that the iteration count has exceeded 50, it "kicks" the system by boosting the learning rate to a very high number, e.g. 0.75 (step 128) . After the learning rate has been increased to 0.75, algorithm 100 performs ten iterations of the BEP training procedure (step 130) . Forcing a high learning rate during BEP training causes the system to become unstable and thus dislodges it from the local minimum. After the tenth iteration, algorithm 100 jumps to the next higher gain and the next lower learning rate, and branches back to step 118 proceed with the BEP training with the new set of initial values for the state variables. It should be noted that in the described embodiment, algorithm 100 moves through the range of permissible gains and the range of permissible learning rates in a linear fashion, one jump at a time. Each step in gain is equal to ⁽-^,i_ni_t~0_fi_nai*'/^{5 and eacn} step in learning rate is equal to (7_init~7_finaχ)/⁵- *-_ⁿ addition, when algorithm 100 increases gain by one step, at the same time it also decreases the learning rate by one step.

In step 134, if the relative change in the magnitude of the energy does not exceed the threshold value, algorithm 100 prepares to move onto the next higher gain level. First, it sets the value of E_Qld to

Enew (^xstep ^c 139) ' . Then, ' it checks β ^r to determine whether it has reached the maximum gain level allowed (step 138) . If β is less than _β_final, algorithm 100, algorithm 100 jumps to the next gain and learning rate (step 140) and then branches back to step 118 to repeat the above- described BEP training procedure.

When β reaches β_{f nal} , algorithm 100 checks if k equals 1 (step 142) . If only one internal node has been tried, algorithm 100 adds a second node, sets k=2, and branches back to step 108 to perform the gain annealed training procedure for the new architecture (step 144) . After the above-described procedure has been performed for the two internal node architecture, algorithm 100 compares the improvement in performance resulting from adding the second node (step 146) . That is, algorithm 100 computes the absolute value of (E_k - E_k_ _χ)/E_k. If it is greater than 10%, indicating that significant improvement resulted from adding the last node, algorithm 100 adds a third node and again branches back to step 108 to see what effect the third node yields (step 148) . Algorithm 100 continues adding nodes until the resulting improvement in performance is no greater than 10%. At that point, algorithm selects the structure and values of the state variables which yielded the lowest energy and terminates.

The node function (i.e., Eq. 1) partitions different regions of the input space by constructing hyperplanes to approximate the region boundaries. The input to each node, given by Eq. 2, is a linear equation for a plane in multidimensional space. As more hidden layer nodes are used the actual boundaries are more closely approximated. The "sigmoid" transfer function implements a sharp or fuzzy boundary based on a high or low magnitude of the gain term in Eq. 1. An important characteristic of the sigmoid function is that it acts globally across the input space, thereby allowing the possibility of forming a compact representation of the salient features in the training data. Other commonly used techniques for supervised classification such as Radial Basis Functions (RBF's) and nearest neighbor algorithms act locally by computing the distance between clusters of data. The implication is that unless the distance metric in a local method happens to correctly represent the training data, the classifier will form o(n) clusters for n training samples. For many problems of interest this amounts to constructing a look-up table of the training data. By acting globally the MLP network with a sigmoid node function is able to learn the correct representation and avoid storing the data.

An MLP network trained with the GDGA algorithm was evaluated using actual radar signatures. The task was to separate the radar signatures into two classes; object types A and B. The problem is difficult, because the effects of object geometry and measurement conditions on the radar signature are not well characterized. As a consequence the signatures are not easy to discriminate, and it is not clear which transformation will increase the separability. A data set consisting of 3692 radar signatures (equal numbers of types A and B) was used in this study.

The training procedure consisted of initializing the network weights and offsets to a set of -sendom values and presenting a certain percent of the data^εet in a random order to train the network. After training, the weights and offsets were fixed and the entire data set (3592 signatures) was used to test the network. The combination of training and then testing the network is defined as a trial.

A specific MLP network was evaluated by running 100 trials, where the weights and offsets are initialized to a different set of random values at each trial. Each trial also selected a different (random) training set. The performance was defined as the percent of the input patterns correctly classified, based on the distance between the network output and a set of outputs for the two signature types. The network had a single output node that has output values of 0.95 for type A and 0.05 for type B signatures. For each input pattern the target class with the minimum distance (Euclidean) to the network output was chosen as the pattern class. The trial with the maximum percent correct during testing is used for performance comparisons. The percent of the data set used for training and the number of nodes in the hidden layer were treated as independent parameters in the experiment. Fig. 3 shows the effect of these parameters on the percent correct over the entire data set. The network was able to generalize to nearly 85% of the data after training on only 1% of the data set. A single hidden node network quickly saturated at about 87% correct during testing. This was simply due to the limited number of degrees of freedom in a single node system, i.e., it can account for only a certain percentage of the training data.

The observation is supported by Fig. 4 where the percent correct during training for a single node network is shown as a function of the percent of the data set used for training. Here the limited capacity of a single hidden layer node is shown by the decrease in training set performance from 100% to 93% correct. This performance decrease occurred when the training set size was increased from 1% to 25% of the entire data set.

As shown in Fig. 3, the testing performance is significantly increased by adding nodes to the hidden layer. After five nodes though, any further node addition only slightly improves the performance. The ten node network could account for 95% of the data after training on 20% of the data and was able to attain 97% correct during testing as the training percentage is increased. Further investigation showed that the remaining 3% of the data that the network could not account for were actually bad measurements. Apparently the network was able to discriminate between signatures and also identify a non-signature without being explicitly trained as to what constitutes a bad measurement.

The network was also trained with the standard BEP algorithm and the testing results are shown in Fig. 5. The generalization capability of the network trained with BEP was less than a network trained with the GDGA algorithm. This effect was most pronounced when training on a very small percent of the data set (a situation that is especially relevant to real world problems) . The GDGA technique is able to explore the state space of the network more thoroughly at the low gain values than BEP training operating at a single gain. In this experiment the BEP algorithm was initialized to the optimum value of gain found by training with the GDGA algorithm. The "history" of training at many different gain values was apparently significant to the generalization capability of the network. Both the BEP and GDGA algorithms required approximately the same number of iterations for training, (approx. 300 iterations) , but on average the GDGA algorithm found better solutions. In the BEP technique once the system is "stuck" in a local minimum it is virtually impossible to find a good solution. In fact, BEP often takes more than 100,000 iterations to escape from a local minimum. By varying the gain the GDGA algorithm can avoid local minima during training.

Other embodiments are within the following claims.

What is claimed is:

Claims

Claims 1. A method of training a neural network having an output layer and at least one middle layer comprising one or more internal nodes each of which is characterized by a node activation function having a gain, the method comprising: setting the gain on at least some of the internal nodes equal to an initial gain value; training the multi-layer perceptron starting with the initial gain value; and changing the gain on at least some of the internal nodes during training, said gain change being in a direction which increases sensitivity of the multi-layer perceptron.

2. The method of claim 1 wherein said neural network is a multi-perceptron neural network.

3. The method of claim 2 wherein said multi- perceptron neural network is fully connected.

4. The method of claim 2 wherein said multi- perceptron neural network comprises no more than one middle layer.

5. The method of claim 4 wherein said middle layer comprises only one node.

6. The method of claim 1 wherein said training employs a gradient descent training procedure.

7. The method of claim 6 wherein said training employs a back error propagation training procedure.

8. The method of claim 1 wherein said training is characterized by a learning rate and wherein said method further comprises decreasing said learning rate during training when said gain is changed.

9. The method of claim 1 further comprising the step of setting the gain of each output node to a fixed value before beginning any training.

10. The method of claim 1 wherein the internal nodes are each characterized by a sigmoid-like activation function.

11. The method of claim 10 wherein the internal nodes are each characterized by a activation function f(x) of the following form: α f(x) = δ 1 + e~^^x wherein and δ are constants, β is a gain variable, and x is a node input.

12. The method of claim 11 wherein said gain changing step changes the gain by increasing β .

13. The method of claim 11 wherein a equals 2 and δ equals 1.

14. The method of claim 1 further comprising the step of computing an error for the neural network after said gain has reached a final gain, said error indicating how well said neural network has been trained.

15. The method of claim 14 further comprising the steps of: adding an additional node to one of said internal layers if said error exceeds a predetermined threshold; and after adding said additional node, retraining said neural network.

16. The method of claim 14 wherein said training is supervised training using a training set made up of members for which corresponding desired outputs are known, and wherein said error is a measure of how far the desired outputs for the members of the training set are from actual outputs generated by applying the members of the training set to the neural network.

17. The method of claim 16 wherein said error is computed in accordance with the following equation: E = ∑_p∑. I o^P _j - d^P. I , where p is an index identifying a member of the training set; j is an index identifying an output node; o^P. is an actual output of output node j for the p^th member of the training set; and d^p. is a desired output of output node j for the p member of the training set.

18. The method of claim 1 further comprising the steps of: determining whether said training is converging; and if it is determined that said training is not converging, modifying the training so as to cause an instability in training to occur.

19. The method of claim 18 wherein said training is characterized by a learning rate and wherein said training modifying step comprises increasing the learning rate so as to cause an instability in training to occur.

20. The method of claim 19 further comprising the step of resuming training at a reduced training rate after training for a preselected period of time with the increased learning rate.

21. An apparatus for training a neural network having an output layer and at least one middle layer comprising one or more internal nodes each of which is characterized by a node activation function having a gain, the apparatus comprising: means for setting the gain on at least some of the internal nodes equal to an initial gain value; means for training the multi-layer perceptron starting with the initial gain value; and means for changing the gain on at least some of the internal nodes during training, said gain change being in a direction which increases sensitivity of the multi-layer perceptron.

22. The apparatus of claim 21 wherein said training is characterized by a learning rate and wherein said apparatus further comprises means for decreasing said learning rate during training.

23. The apparatus of claim 22 wherein-.gain changing means and said decreasing means cooperate so as to cause said gain and said learning rate to change simultaneously.

24. The apparatus of claim 21 further comprising means for setting the gain of each output node to a fixed value before said training means begins training.

25. The apparatus of claim 21 wherein the internal nodes are each characterized by a sigmoid-like activation function.

26. The apparatus of claim 25 wherein the internal nodes are each characterized by a activation function f(x) of the following form:

f(x) — - δ l + e ^βx wherein α and δ are constants, β is a gain variable, and x is a node input.

27. The apparatus of claim 26 wherein said gain changing means changes the gain by increasing β .

28. The apparatus of claim 21 further means for computing an error for the neural network after said gain has reached a final gain, said error indicating how well said neural network has been trained.

29. The apparatus of claim 28 further comprising: means for adding an additional node to one of said internal layers if said error exceeds a predetermined threshold; and means for causing said training means to retrain said neural network after said additional node has been added.

30. The apparatus of claim 28 wherein said training means performs supervised training using a training set made up of members for which corresponding desired outputs are known, and wherein said error is a measure of how far the desired outputs for the members of the training set are from actual outputs generated by applying the members of the training set to the neural network.

31. The apparatus of claim 28 wherein said error is computed in accordance with the following equation: E = Σ P∑3.|'θ^p3. - d^p3.l' ,' where p is an index identifying a member of the training set; j is an index identifying an output node; oP. is an actual output of output node j for the p^th member of the training set; and d^p. is a desired output of output node j for the p^th member of the training set.

32. The apparatus of claim 21 further comprising: means for determining whether said training is converging; and means for modifying the training so as to cause an instability in training to occur if said determining means determines that said training is not converging.

33. The apparatus of claim 32 wherein said training is characterized by a learning rate and wherein said modifying means causes an instability in training to occur by increasing the learning rate.

34. The apparatus of claim 33 further comprising means for causing said training means to resume training at a reduced training rate after said training means has performed training for a preselected period of time with the increased learning rate.