Expectation Sampling and Perception Control
The present invention (expectation sampling) samples incoming data in a way that selects the most relevant data and completes missing data based on information extracted from the previous data flow. The invention can advantageously be coupled to a control system and thereby act as an operation system (perception control).
Background for the invention and introduction to the invention
Data acquisition is an important element of handling information. In this process there are two generic problems that often are encountered. One is the selection of relevant information from the plentitude of information that is received or offered. For example, one can here think of the information that is received every day though the media including the Internet. Also, the problem is known in production systems, where a lot of information often is available, but where it is also not known whether or to which degree the various accessible data is relevant. In surveys, the problem of selection is well known. Which questions are really relevant to ask?
The present invention provides a way called expectation sampling to perform a selection that is far from random. In stead the sampling is based on a learning process where previous encountered relations are used to select the next pattern of activity.
Another generic problem of data acquisition is missing data transfer. Information could for various reasons reach a control system only in an incomplete form. A sensor could fail during a process, or some other cut in communication could appear. Often one would not like to stop the process going on, but rather have the system continue acting as well as possible under the given circumstances. The fact that information typically is obtained at different rates is another factor where it could be advantageous to speed up the next process on the expense of risking missing some data.
The present invention also completes missing data based on previous encountered relations. The invention can with advantage be coupled to a control system and thereby
act as an operation system that selects and completes information. We call this perception control.
Brief description of the invention
One aim of the present is to provide a method which provides an action corresponding to coming information. This aim implies the problem of providing an action corresponding to either information not being available, not being yet available or a combination thereof. Thus, a method which meet this aim may in some sense utilise a system being able - or configured - to provide an action based on at least an expectation of coming information.
In one aspect of the present invention a method is provided utilising two separate systems: a perception system and an action system. In this aspect the perception system is configured to provide perceptions of coming information, that is information that either is not yet available or information that has not yet been received by the action system and the action system is configured to provide an action based on the perception of the coming information.
Thus, in one aspect the present invention relates to method preferably being a control method utilising a computerised perception system (1) and a computerised action system (2) which in common determine an action (a), preferably being a control signal, said perception system (1) establishes a perception (ψ) of coming information (S) based on expectations (ψ) and said action system (2) determines an action (a) corresponding to the perception (ψ) of the coming information (S) based on motivations (expectations of rewards), the method comprising determining an expectation (ψ) of coming information (S) based on past perceptions (ψ) of past coming information (S), combining the expectation (ψ) and present incoming information (S) so as to establish a perception (ψ) of coming information (S),by use of the perception system (1) receiving said present incoming information (S) from an external environment (3), determining an action (a) based on said perception (ψ) of the coming information
(S) by use of the action system (2) receiving said perception (ψ).
In this context it should be noted that expectation and perception typically are vectors and that when reference is made in the following to combinations of either expectation or perception reference is made to combinations of the element of the vector, to combination of vectors or a combination thereof.
The term past perception indicates, correctly, that perceptions are formed during evolution of time that is perceptions follow one after another in the time domain. Thus, formation of a present perception of coming information is followed later on by formation of another formation of a new present perception and at that time the former present perception of coming information is then a past perception of past incoming information.
It is often preferred, and very advantageously, that the perception system is configured to produce expectations (ψ) in a self-driven and self-organising activation process in order to provide expectations on which the perceptions is to be based.
In preferred embodiments, the expectation (ψ) on which a perception is to be based is determined as a combination of past perceptions (ψ), such as a weighted sum of past perceptions (ψ). Furthermore, it is preferred that the perception (ψ) of coming information is based on a combination of expectation(s) (ψ) and incoming information (S).
According to a practical embodiment of the method according the present invention it is preferred that the combination of expectation(s) (ψ) and incoming information (S) includes weighting of the expectation(s) (ψ) and the incoming information (S) by the confidence levels for the expectation and adding up the weighted expectation(s) and incoming information (S). In such an other similar cases it may be preferred that the past perceptions (ψ) on which the expectation(s) (ψ) is based is stored in a memory, such as a short term memory.
According to many of the practical embodiments of the invention the perception system utilises a neural network, but other computer means such as statistical means may advantageously be utilised in connection with the perception system.
In embodiments of the method according to invention in which a neural network based perception system is utilised, such a system must, of course, be trained. In some situations the neural network is trained during use and in such case training of the
perception system (1) takes preferably place in case the difference between the perception (ψ) of coming information and a combination of expectation (ψ), perception (ψ) and incoming information (S) is not within a predefined limit. This on-going learning may very advantageously be combined with an initial learning of the neural network.
In many very useful and important embodiments of the method according to the present invention, it is preferred that the information (S) on which perception(s) (ψ) and expectation(s) (ψ) are based is selected from a plentitude of information. Such selection is particular useful and in some case even a must for instance if the plentitude of information is so vast that the information on which perception and expectation is to be based is drowned by the presence of non-relevant information. In such and similar cases where selection is performed it is often preferred that the selection is performed by the perception system (1), said selection being preferably performed based on previous encountered relations.
The opposite situation may also occur, that is the information on which perception and expectation is to be based may be missing in part or in total. In such cases the method according to the present invention is configured to perform addition of information, and in particular embodiments it is preferred that the information (S) on which perception(s) (ψ) and expectation(s) (ψ) are based is added additional information. It is often preferred in such cases that the additional information is provided based on previous encountered relations.
It should be noted, that these to measures may very advantageous be combined in one embodiment. This may be utilised in for instance a situation where the information on which the perception and expectation is to be based is extracted from a plentitude of information and wherein this plentitude of information miss parts of information on which the perception and expectation is to be based. In this case, the perception system extracts information available from the plentitude of information and adds additional information to the extracted information, so that expectation and perception may be provided.
Based on the perception, the action system determines an action. According to many important embodiments of the present invention the action (a) determined by the action
system (2) is determined based on a combination of perceptions (ψ) of coming information, said perceptions (ψ) being determined by the perception system (1).
Preferably, the action (a) determined by the action system (2) is determined based on a combination of motivation (V) dependent weights (w) and the perceptions (ψ) of coming information determined by the perception system (1). In an important embodiment of the invention the motivation dependent weights (w) represents tendency levels to perform certain actions.
In a number of practical and important embodiments of the invention, the combination is a summation in which the motivation dependent weights (w) are multiplied on the perceptions (ψ) of coming information.
Preferably, the motivations (V) are formed as a combination of perceptions (ψ), preferably as a weighted sum of perceptions (ψ), of the coming information. In such cases it is furthermore preferred that the action system forms motivations (V) in an ongoing process. This ongoing formation may, of course, also be utilised in other situations.
In order to enable the action system to investigate the landscape of actions (the possible actions) it is preferred to incorporate noise in the system. This is preferably implemented by configuring the action system so that actions (a) determined by the action system (2) comprise noise. Preferably, a random action is selected with probability ε being related to or being the level of the noise.
Preferably, also the action system (2) utilises a neural network. In such case it may be preferred that the neural network is trained by motivation learning, said motivation (V) being realised in form of expectations of receiving rewards or punishment.
In embodiments utilising motivation learning is may be preferred that the motivation learning is conducted in case a measure of how well the motivation anticipates a reward is not within a pre-defined limit.
In a very important embodiment of the method according to the present invention, hardwired needs or goals acts as a primary drive (r) in developing the motivation (V) on which actions (a) are based.
The invention relates in another aspect to a system configured to carry out the method according to the present invention. In this aspect, the system comprises at least means embodying the perception system and the action system. Furthermore, these means or additional means are configured to carry out any of the features of the method according to the present invention.
According to another aspect, the present invention relates to a data processing system such as a computer system for performing a control method, said data processing system comprises a computerised perception system and a computerised action system. The systems according to the present invention comprise processor means, such as one or more electronic processing unit(s), for processing data and storage means, such as RAM, ROM, PROM and/or EPROM, for storing data, and these systems determine in common by use of the processor means and the storage means an action (a), preferably being a control signal. Furthermore, the perception system (1) establishes a perception (ψ) of coming information (S) based on expectations (ψ) and said action system (2) determines an action (a) corresponding to the perception (ψ) of the coming information (S) based on motivations (expectations of rewards), which establishment and determination are also done by use of processor means and storage means similar to the ones listed above.
In particular, the data processing system comprises preferably: means for determining an expectation (ψ) of coming information (S) based on past perceptions (ψ) of past coming information (S), means for combining the expectation (ψ) and present incoming information (S) so as to establish a perception (ψ) of coming information (S),by use of the perception system (1) receiving said present incoming information (S) from an external environment (3), and means for determining an action (a) based on said perception (ψ) of the coming information (S) by use of the action system (2) receiving said perception (ψ).
These means are also of the types listed above, i.e. processor means such as electronic processing unit storage means such as and , such as RAM, ROM, PROM and/or EPROM.
Thus, a data processing system, such as a computer system, is considered within the scope of the present invention, and such processing system comprises accordingly, computer means, such as the ones listed above, for executing one or more of the step of the control method according to the present invention. .
In the following the invention and in particular preferred embodiments will be presented. This description comprises examples of preferred embodiments which is explained with reference to the figures accompanying the description.
Figure 1 shows schematically a perception control system.
Figure 2 shows schematically a perception system and an example of Expectation Learning (EL). The new perception is a combination of both expectation and the sensory input from the environment. The previous perceptions form an expectation, Ψ', which is a prediction of the next perception to be formed. This is then compared to the actual perception formed, which is a weighted combination of the previous perceptions and the externally provided sensory input. Should the perception differ from the expected, the weights u and confidence measure β are adjusted.
Figure 3 shows Appetitive Conditioning and Extinction. The stimulus configuration is:Trials 0-100: "bell" followed by "food", Trials 101 - : "bell" alone to test for conditioning. Example of a simulation with a "bell" input representing the CS, and a "food" input representing the US. The CS was activated prior to the US. The US stops (with the CS continuing) after 100 trials in this case. As can be seen, the expectation of "food" remains. This causes the perception of "food" to be activated until the 185th trial, 85 trials after the actual food input was present. At this point extinction becomes effective.
Figure 4 shows Blocking. The stimulus configuration is: Trials 0-100: "bell" followed by "food", Trials 101-150: "bell" and "light" followed by "food", Trials 151-200 trials: "light" alone to test for conditioning, Trials 201 - : "bell" alone to test for conditioning.
During the testing of the "light" conditioning, no expectation of "food" is evoked, and thus no perception of "food" is formed. The conditioning of the "light" has been blocked by the prior conditioning of the "bell". The confidence level, βf00d, for food remains high, as no discrepancy between the expectation and perception has occurred. When the "bell"
conditioning is tested, this confidence decreases until the perception again reflects the sensory input.
Figure 5 shows Motivation Learning (ML). Action system: The motivation V formed provides predictions of future "rewards". Comparing with the actual reward received, the weights v and the tendency weights w may be modified.
Figure 6 shows an example of use of Instrumental Reinforcement of the present invention. Figure 6 caption cont. (a,b) Positive reinforcement: A is the steady-state percentage of time the rewarded action is taken. (c,d) Aversive reinforcement: A is the steady-state percentage of time the punished action is taken. (Note the different vertical scale in these two figures). (e,f) Negative reinforcement: A is the steady-state percentage of time the relieving action is taken. (a,c,e) A versus numbers of actions, NA. Results for three different noise levels (ε = 0, ε = 0.5, and ε = 1) are shown. The line shows the unrewarded behaviour of A, A = 100/NA, which is encountered for ε = 1.
(b,d,f) A versus noise level ε. Results for three different values of NA (2, 4, and 10) are shown. The straight lines show the unrewarded levels of
A in the 3 cases (50, 25, and 10 respectively). When the action is completely randomly selected (ε = 1), the results coincide with the unrewarded case.
(a-f) The error bars show the standard deviation and average of the sample of 100 trials of 2000 time steps, measured after a transitory period (arbitrary taken to be
500 time steps).
Figure 7 shows Transitory Behaviour: Example of the transitory behaviour during a single trial for a positively rewarded system with 10 possible actions. This case is for no noise, ε = 0. The rewarded action is continuously chosen after an initial period, defined by the amount of time taken for the rewarded action first to be selected in a random fashion.
Figure 8 shows The Law of Effect: A higher rewarded action is chosen on the expense of a lower rewarded action.
(a,c) Closed circles: Action A, rewarded with +100 once the time is greater than 50. Open circles: Action B, rewarded with +1. The data show the percentage of time each action is selected, A, versus time (in time steps). The points show the average of 100 trials of 10,000 time steps, measured after a transitory period (arbitrarily taken to be 500 time steps). (b,d) Evolution of the motivation V. The points show the average and standard deviation of 100 trials.
(a,b) Noise level ε = 0.1. (c,d) Noise level ε = 0.05.
Figure 9 shows the perception / x a, illustrated for controlling a cart-pole system. In this case, four components of information may be received, and only two actions can be taken.
Figure 10 shows an example of the perception system.
Figure 11 shows the cart-pole task.
Figure 12 shows results for the cart-pole task. Learning times are shown for the three simple systems and when using perception control (κ = 0, 0.01, 0.05, 0.2, 0.3, 0.5, 0.9 and 1.0). The "random" system and the "extra unit" system have the worst performance. The "same as last" system performs well, but is clearly worse than perception control, marked by the value of the parameter κ . Each point is a mean over 40 runs. The maximum number of trials given is 5000.
Figure 13 shows the Mountain-Car task. On the steep sites the gravity is stronger than the motor.
Figure 14 shows results for the mountain-car task. Average time (steps) per trial plotted versus probability p for the three simple systems and for perception control with κ = 0, 0.01, 0.05, 0.2, 0.3, 0.5, 0.9 and 1.0. Each point is a mean of 30 runs. The maximum number of time steps allowed is 100,000.
Figure 15 shows average number of received components per trial. The same systems as in Figure 14. Each point is a mean of 30 runs. Note the logarithmic vertical axis.
Figure 16 shows the timing problem consists of eleven states. At state seven the agent is rewarded (r=+3) for been active. At all other states activity is punished (r=-1). At all states there is a general small punishment (r=-0.05) for being passive.
Figure 17 shows results for the timing problem. The fraction of maximum possible return the control system obtains for the three simple systems and for the perception control system (κL = 0, 0.01, 0.05, 0.2, 0.3, 0.5, 0.9 and 1.0). Each point is a mean of 40 runs.
Detailed description of preferred embodiments of the invention
The present invention is based on ideas from human perception. It is well known that human perception not always is consistent with reality. General phenomena as "we see what we expect to see", "selective perception", the "halo-effect", and "self-fulfilling prophecies" are all accepted concepts in the description of how our perception may be responsible for erroneous and inexpedient behaviour. However, despite much strong focus on the disadvantages of perception, there are important advantages as well. For example, our perception selects information from a continuously incoming highly complex information stream, in a way that for instance allows us to separate even a weak voice in a noisy environment.
Another, equally important feature of human perception is its ability to complete information that the sensory system has not provided, but is needed in order to act adequately. Through our senses, one often receives only partial information, for instance a few words of a sentence. Still, one may be able to form an expectation of what was said, or even reconstruct the sentence.
The present invention involves a "perception" system that selects and completes information from an external environment with the aim to control it (Figure 1). The success or failure of control may typically be described in terms of success or failure criteria that serve as a "motivational" drive for the actions taken. The computational approach used
may for instance be reinforcement learning [see e.g. R.S. Sutton and A. G. Barto, "Reinforcement Learning: An Introduction", MIT Press 1998], which we shall use in the specific applications described below. The success/failure criteria are expressed in terms of "rewardsVpunishments", and "motivation" is realized in form of "expectations" of receiving rewards and punishments. The motivation strongly influences the action tendencies, thus evoking the selection of appropriate control actions.
Actions are not given by a simple mapping of the perception, but influenced by motivation (expectations of rewards). In the same way perception is not a simple mapping of the input, but influenced by expectations of incoming information. In this invention the latter type of expectations are formed in an ongoing process of one activity pattern initiating another. Below, specific realizations of an expectation-based "perception system" is introduced and coupled to a motivation-based "action system". The full perception control system is applied on a number of generic control tasks, for which we exemplify the information-completing feature.
In the learning methods used in the present invention, expectations form the basis for the learning assumed to take place exactly when and only if expectations are wrong. The learning algorithms associated with motivation is denoted motivation learning (ML), an essential characteristic being that learning is driven by a basic reward system. An important ingredient is the existence of "hard-wired" needs or goals that act as a primary drive in the development of motivation. Motivation is valuable because rewards and punishments are not always instantly given, or may be intermittent in nature.
The other, equally important, and essential feature of this invention is the expectation of incoming information. These expectations are not related to any predefined drive, but produced in a self-driven and self-organizing activation process, where the currents propagating through the junctions from one activity pattern initiate to another. The learning method basic for the invention and associated with these expectations is denoted expectation learning (EL) to distinguish it from the reward-driven ML methods. The expectations are valuable because the information on which action is taken is not always complete or may be "polluted" by irrelevant information.
While ML may be considered the effective mechanism in selecting actions - in the development of performance, EL is the effective mechanism in selecting activity patterns
- in the development of perception. In some cases, for example for attaining control, one may focus on the reward-driven action system as the central system. In other cases, however, for example for classification or recognition, one may with advantage focus on the perception system as the central system. In general though, both systems should be involved. This is shown diagrammatically in Figure 1.
The basic test of learning is conditioning. As is the case for expectations, there are two distinct types of conditioning tasks, one is called classical conditioning, and the other is called instrumental conditioning. Instrumental conditioning concerns a reward- or punishment-driven change in performance. Below, this is specified in the context of the action system and the ML method, applying a reinforcement learning algorithm.
Classical conditioning concerns the pairing of inputs, where the action associated with a certain input is given (the unconditioned stimulus). The reward system is irrelevant for classical conditioning. Thus, while ML is the relevant learning method for instrumental conditioning, EL is the appropriate method for classical conditioning, as elaborated further below.
The perception system
The irradiation-driven dynamics introduced in this invention and tested for classical conditioning does not involve a reward system. By separating the action system from the perception system we allow the action system to be optimized to the (relatively) easier task of associating the internal representation, or the "perception", to the action. The expectation dynamics, on the other hand, processes associations between different stimuli to form the perception. These two association processes may happen at very different time scales.
In the expectation approach introduced here, the particular action called the unconditioned response is by definition connected strongly to the perception of the input called the unconditioned stimulus. The conditioned stimulus perception is in the expectation framework not directly associated with the unconditioned response. Rather, the conditioned stimulus evokes the perception of the unconditioned stimulus thus producing the unconditioned response.
In the perception system, a memory structure of the perception may be introduced, whereby the perception is "remembered" for a given time period - a type of "short-term memory". In this case, an internal history of the perceived environment can be built up, and then used to attempt to attain predictions for the near-future environment. This type of bootstrapping of the perception can be regarded as a type of internal classical conditioning. Formally, perception is here described by a vector, Ψ, in general the perception of the last τ timesteps, τ being the duration or size of the short-term memory. This size may be assumed fixed, so that a constant historical snapshot of what has happened is maintained. As time increases, the perceptions are moved systematically back in the short-term memory, and new perceptions for the current time are added (fig. A).
The fundamental characteristic of the perception system of the invention is that it produces a vector Ψ', the expectation, which in a simple form is given by a weighted sum of past perceptions,
J
Here Ψj runs over the entire short-term memory of the perception (fig. B). The basic point of expectation is that it strongly influences the perception of a sensory input. In a specific realization, the new perception may be given in the following form:
Ψ*1 (t + l) = fWTA03 iΨ'i+(l -β i)s1] ,
Where s is the input, and β is a weighting vector, weighting the expectation against the input. The weighting factors βj give the confidence levels for the expectation. A process fwrA called "winners-take-aH" is here applied, whereby, the entries with the largest magnitude, the winners, are set to unit value and the rest set to zero. The total "activity" (number of winners) is assumed to be fixed.
If Sj and Ψj are interchangeable in the equation above, we say that the perception is as expected. When this is not the case, expectation learning takes place, and the weights u as well as the confidence measure β are modified so that the expectation more correctly predicts the input. This is achieved by comparing the expectation with the actual new
perception state, adapting the weights u only and exactly when the two do not match, for example
where j covers the entire short-term memory of the perception, αu is a constant defining the adaptation rate of the weights u, and
is the difference between the actual perception and the perception resulting from expectation alone. In this way, the expectation system adapts to incoming input. The term (ψ;(t -r-^-Ujjδ; ) above is only introduced to limit the values of Uy between -1 and 1.
The confidence level βj is here taken to be the average of the fraction of times the expectation element βj matches the actual sensory input Sj. This may also be capped, so that, for instance, β, runs between 0.1 and 0.9, (which is used in the examples presented), rather than 0 and 1 , thus allowing the input and the expectation to always have some role in the determination of the perception. The perception system is shown in Figure 2.
Basic examples showing use of the perception system
We consider classical conditioning by expectation learning. Classical conditioning involves the association of stimuli, with, at its basis, the pairing of two stimuli. One is a conditioned stimulus (CS), which has no overt, or only weak responses unrelated to the response that will eventually be learned. The other is the unconditioned stimulus (US), which always produces an overt response. This response is "innate" and is denoted the unconditioned response (UR). In the expectation framework of classical conditioning, the connection (w) between the perception of an US and the following UR is inherently hardwired.
Classical condition has three basic ingredients: activation, inhibition and irradiation. The expectation dynamics considered here embodies these ingredients, including the cases outlined below.
Basic classical conditioning: After repeated presentations of the CS followed by the US, the CS elicits a response (CR, the conditioned response), which resembles the UR.
Depending on the US, the conditioning is named appetitive (when the US is rewarding) or defensive (in the case of a punished US). Extinction: When a CS repeatedly is presented without subsequent presentation of the
US, the UR decays.
Blocking: Two stimuli, A and B are presented, and paired with an US. Firstly, A is presented followed by the US, causing A to become a CS. Then, the combination of A and B is presented followed by the US. In this case, the conditioning of stimulus B is blocked by the prior conditioning of A, and will not become a CS.
Stimulus generalization (phobias): if a new stimulus, C is presented, where C possesses similarities to an existing CS, then the UR is elicited.
Discrimination: Repeatedly, a CS is presented with subsequent presentation of the US, while a stimulus C similar to the CS is presented without subsequent presentation of the US. Then discrimination between C and the CS will appear where only the CS elicits the
UR.
Conditioned inhibition: A new stimulus D is repeatedly presented together with a CS without subsequent presentation of the US. Afterwards D is presented together with another CS, the UR may not appear. To illustrate the behavior of the perception system, some examples are analyzed in detail below. The size of the "short-term memory" was taken to be 3 time steps, but is not crucial for the conditioning phenomena discussed.
First, consider the case (i) of appetitive classical conditioning. A "bell" input represents the new stimulus to be conditioned, while a "food" input represents the US causing the UR, "salivation". (The case of defensive conditioning is obtained simply replacing "food" with "electric shock" and "salivation" by "leg withdrawal".) The expectation dynamics activates the connection between the "bell" and the perception of "food". When "food" fails to appear, the activated connection produces the perception of "food", and "salivation" occurs.
The experiment is illustrated in Figure 3. The "bell" is presented with subsequent presentation of "food" for 100 times. The actual number of presentations is here chosen arbitrary and could as well be 10. After the conditioning period, the "bell" is presented without subsequent presentation of "food". However, for a period the expectation of "food"
still dominates the perception Ψ, which then cause "salivation". The confidence level, βfood, for the expectation of "food" is also shown in Figure 3. After a few number of presentations the confidence has reached its maximum level of 0.9. In the period where the conditioning is tested, βfood decreases, until it reaches a level where expectation no longer dominates the perception - extinction has taken place [case (ii)].
Our model also embodies blocking [case (Hi)], where a new stimulus, say a "light", is presented together with a CS, the "bell", subsequently followed by presentation of the US, the "food". The expected perception of the US appears, and therefore no learning takes place - nothing is changed. Thus, the "light" will not be conditioned.
The experiment is illustrated in Figure 4. As above, the "bell" is first presented with subsequent presentation of "food" for 100 times. Thereafter, the "bell" is presented together with the "light" with subsequent presentation of "food" for 50 times. Then the "food" is permanently removed from the sequence, and the "light" alone is presented for 50 times. It can be seen that the system does not form an expectation of "food" in this case, and hence the unconditioned response is not observed. Finally, the conditioned response of the "bell" is tested. As observed, the bell evokes an expectation of the "food", overriding the actual input in the perception, and causing the UR "salivation".
The other conditioning phenomena described above are also embodied using the expectation dynamics of the invention. In general, if a stimulus is similar to (overlapping with) a CS, it will give rise to the UR [case (iv)]. Indeed when an experiment was conducted with a sliding window of inputs, changing from one CS to another, a hysteresis effect was observed. The current CS was expected beyond the time of its presentation delaying recognition of the subsequent stimuli.
When a new stimulus is presented that possesses similarities to an existing CS, without subsequent presentation of the US, the network will learn to discriminate between the two stimuli [case (v)]. Initially, the system will form the expectation of the existing CS.
However, as a part of the sensory input does not match the CS, the expectation does not coincide with the perception. Thus, the confidence level of the expectation decreases, and discrimination emerges.
When a new stimulus is presented together with a CS without subsequent presentation of the expected US [case (vi)], both the connections from the new stimulus and the CS will be inhibited. This is embodied by strengthening connections to other stimuli. If the new stimulus is then presented together with another CS, the UR will not appear.
The action system
In instrumental conditioning a reward- or punishment-driven change in performance is tested. Here instrumental conditioning is tested in the context of the reward-driven motivation approach, applying a reinforcement learning algorithm. Alternative implementations could also be considered. An important characteristic of the motivation system is the existence of innate "hard-wired" needs or goals that act as a primary drive in the development of motivation. This drive may formally be introduced through a scalar r, the value of which represents an accomplished success (r positive) or failure (r negative) in fulfilling the needs or reaching the goals set.
Besides the reward drive, the action system is characterized by the ongoing formation of another scalar, the motivation V,
The basic point of motivation is its strong influence on the performance, on the actions a taken. Actions are formed based on the perception Ψ. In a specific realization, the action a is given by the following form,
where w are motivation dependent weights, representing the tendency levels to perform certain actions in given situations (fig. C). The number of "winners" in the "winners-take- all" process sets the total action activity.
Noise can be introduced, and is in the examples below. At each timestep, a random action (rather than the action above) is selected with probability ε (noise level). Non-zero
noise allows the system to "explore" and take actions that it would not have done in the deterministic case of zero noise.
A certain level of noise may be essential. If the system is not allowed to "explore", more highly rewarded actions (or the possibility of avoidance of punishment) may remain undiscovered. When there is a non-zero noise level, the system can indeed discover more highly rewarded actions within the system, and act accordingly.
A measure of how well the motivation anticipates the reward is given by
R = r + γV(t)-V(t-l),
where r is the reward drive, and the factor γ is a discount rate, setting the rate by which future rewards is discounted. When R is not zero, motivation learning (reinforcement learning) takes place, and the weights v as well as the action tendency weights w are modified so that the motivation more correctly predicts the (discounted) rewards received.
The motivation is modified adapting the weights v,
Δvj = αvRΨj ,
where αv is the motivation adaptation rate. The tendency weights w are modified according to
Δwjj = αwRey ,
where aw is the adaptation rate. eJ is called the eligibility of the pathway between /' and
This is an exponentially decaying factor, set to unit value upon the coincidence of Ψ,- and a, being unity,
where λw < 1.
The decay time (τw =l/(lnλw)) is preferably much larger than the short-term memory time, τ. The concept of eligibility is an integral part of reinforcement learning algorithms (see e.g. R.S. Sutton and A. G. Barto, "Reinforcement Learning: An Introduction", MIT Press 1998).
The action system is shown in Figure 5.
Basic examples showing use of the action system
Instrumental conditioning involves reward-based emission of specific actions, i.e. the increase and decrease in the tendency to perform specific actions. The motivation dynamics considered in this paper models instrumental conditioning in various aspects, including four broad cases:
Positive reinforcement: A certain action is rewarded. As a result, the rewarded action is observed more frequently.
Aversive reinforcement: A certain action is punished. As a result, the punished action is observed less frequently. Negative reinforcement: A certain action relieves punishment. As a result, the relieving action is observed more frequently.
The law of effect: A more highly rewarded action is chosen on the expense of less rewarded action.
Consider an action system with a tunable number NA of actions. The reward scenarios are discussed below in the presence of different noise levels, and for multiple trials for each value of NA. All of the examples use the following learning parameters and rates: αv = 0.01 , αw = 0.5, λw = 0.9, γ = 0.95, and a cutoff value of 0.00001 for the eligibility, below which it is set to a zero.
In the first two cases (i) and (ii), a specific action is rewarded (r = 1) or punished (r = -1). The percentage of time that the specific action is taken, A, is illustrated as a function of the number of actions, NA, and as a function of the noise level ε (the level of random choice of action). In the case of no rewards, we have
A = 100/NA,
which is the expected outcome when random actions are taken - no preferred actions exist. For the case of positive reinforcement [case (i)], the results are shown in Figure 6 (a) and (b).
A positively rewarded action is selected in a much higher proportion of the time than unrewarded actions. Indeed, once the positively rewarded action has been selected among the available actions (at first randomly before learning occurs), then the positively rewarded action will continually be taken (see Figure 7), and others only taken due to the presence of noise.
As the noise level ε increases, the behavior of the rewarded system moves towards that of the unrewarded system. At ε = 0.5, the value of A decreased substantially from its maximum value, although the motivation system still clearly selects the rewarded action more frequently than the unrewarded one. When ε is increased to 1, the behavior is indistinguishable from the unrewarded case (see Figure 6 (b)).
How does it work? When input is incident, the perception is activated, and an action is produced. Initially, all the weights are set to zero, and so the action taken is selected randomly. In this case of positive reinforcement, no changes are made to the weights if the action is not rewarded. Once the rewarded action is selected, however, the weight between the perception and this action is strengthened, meaning that from this point onwards, that action is always selected when the perception is activated, apart from when it is overridden by the noise added in the action selection.
For the case of aversive reinforcement [case (ii)], the results are shown in Figure 6 (c) and (d). The punished action is taken at a much lower proportion of the time than in the unrewarded case. Indeed, once the punished action has been selected (randomly), this action is not selected again, except in the presence of noise. Again, as the noise level increases, the behavior moves towards that of the unrewarded system.
For negative reinforcement [case (iii)], the punishment is given (r= -1) whenever a particular action, the relieving action, is not selected. The relieving action receives no reward. The results for the percentage of time the relieving action is taken, A, is shown in Figure 6 (e) and (f). The results closely resemble those for positive reinforcement.
An important feature of instrumental conditioning is the tendency to follow more highly rewarded actions [case (iv)]. This was tested using a system with NA = 5 possible actions and the following sequence of rewards: for the first 50 time steps, only action B receives a reward of +1 , no other actions are rewarded. After 50 timesteps, action A receives a reward of +100, action B still receiving a reward of only +1 , no other actions are rewarded.
The results are shown in Figure 8. As can be seen, it takes some time to swap to the more highly rewarded action A, but once this happens, action B is almost totally deselected. The amount of time it takes to change action from B to A increases with decreasing noise level as shown in Figure 8 (a) (ε=0.1) and in Figure 8 (c) (ε=0.05). The evolution of the motivation V for each of the examples above is shown in Figure 8 (b) and Figure 8 (d). The maximum value of V in the limit of zero noise is 2000, which is the A reward of 100 divided by the factor 1-γ (γ being the discount rate). Note that the actual change in behavior from B to A is more rapid than the growth of V, the latter roughly taking place over a time scale of 1/(αv(1-γ)) = 2000 time steps.
Perception control
The partnership of classical conditioning via expectation learning and instrumental conditioning via motivation learning provides a comprehensive method that embodies conditioning phenomena. The separability of instrumental and classical conditioning phenomena can utilize the strengths of the motivation dynamical protocols to encode instrumental conditioning between the perception and the action for the system, whilst allowing the expectation sampling to do the associative work between incoming information. The use of binary valued units in the perception provides an advantage in that the expectation sampling is learnt quickly. Whilst the system may employ a large number of computational units (in perception) leading to a large number of weights, the weights are not readjusted once the expectation and perception agree, and the connections to the actions are sparsely activated.
The fixed activity level requires that the perception be fully activated (i.e. a fixed number of "winners") at any one time step. This protocol allows for imperfect stimuli. If the system is missing or overloaded with sensory information from the environment, the
expectation and confidence form a filter with which prior experience (conditioning) helps to form the perception.
The current implementation of the expectation dynamics uses only positive expectation weights. This means that the strengths of the connections are only decreased relative to other weights (i.e. to decrease a particular connection, the others need to be higher relative to this one). The system is not directly inhibited by a particular connection; rather, inhibition is modeled by increasing the connection to other (competing) units.
The actions may only act upon the environment. However, generally the actions (or part of them) are also transferred to the perception itself - so that the perception on which the system acts is made up on both internal and external components. Both of these can then also be modified according to the expectation learning method.
In order to evaluate the performance of perception control, we shall compare with three simple systems coupled with the same action system, but where the perception system is replaced by a simple information completion mechanism. In the first system, the "random" system, missing components are assigned a random value within a reasonable interval. In the second system, the "same as last" system, a missing component is assigned the same value as last time it was available. In a continuously changing environment this value may often be close to correct, and a reasonable control action may be taken. We shall consider a timing task example where this is not the case. In the third system, the "extra unit" system, a special unit is assigned and activated if and only if some component is missing. One can think of the unit as a warning indicator: Some information is missing.
Consider an environment that provides information, s, which generally is sufficient for a motivation-based action system to achieve control of the environment. However, sometimes the information is missing an essential component - only a part of the information is received, and the control is lost. The purpose of the perception system is in this case to provide sufficient information, ψ to the action system, so that the ability to control is not lost. To achieve this, an expectation ψ' is formed that completes the information. Below we describe in detail how such an expectation may be formed in the present invention when the action information is transferred to the perception.
We assume that the information provided by the environment has components c, and that each component has as many elements as the component can take on values. The set of elements in a component c is denoted Cc. Hence, an information vector / may be defined having elements /,-, where / e Cc for some component c. Perception is introduced as a two dimensional array of units that consists of as many units as the size of the information vector /times the number of possible control actions (Figure 9).
The perception is / x a, where a is the action vector. The action vector is zero in all places but the current control action,
\l if the agent chose action j at time t, aj( ) H 0 otherwise.
Figure 10 shows the perception system. The expectation ψ'(t) is formed as a weighted sum from the perception calculated at time t - 1. For all k e Cc :
_ I1 if k = ar§maxk ecc ∑;j Jj ? ~ 1) aj (f ~ 1) «≠ (f ~ !)»
Ψ k O ~ I
0 otherwise
where the I, is given below, U are the weights, and argmax picks out the index associated with the maximal argument. Hence, for each component the expectation is 1 for the unit with the largest sum and 0 for the rest of the units. In the example illustrated in Figure 9 (the cart-pole system, see Figure 11), there are four components of information. The expectation is a vector of 15 elements, four of these are 1 and the rest are 0.
The expectation obtained from the perception in turn influences the next perception,
|κ ψ'j (t) if the component that i belongs to is missing, [sj (t) otherwise
i.e. the expectation is recurrent by nature. The factor K G [0;1] determines the confidence in earlier expectations when forming the perception. If K = 0, the perception only involves
information actually received. In the examples given below K = 0.5 has been used. The expectation ψ' is also used for completing the received information. The rth element of the completed information ψ-,(t) at time step r, is
[ ψ (t) if the component that i belongs to is missing,
The dynamics for the weight is such that the expectation resembles the actually received information. This part is the expectation learning. Learning takes place when the expectation is wrong. To this end, an error term ε is calculated. If the Mh component is received, the error term εk is
εk(t) = sk(t) - ψ'k(t).
If component k is not received at time step t, then the expectation is accepted as the alternative, and the error term is set to 0.
Let Ujjk be the weight from the perception unit /)' to the expectation unit k, where j denotes the control action taken. The update of the weights are given as follows
where η is the adaptation rate. In the examples below η=0.2. The factor κ e [0;1] determines to what degree the expectation learning of is based on earlier expectations. If κ = 0, expectation learning are only based on actually received information. This may give a slow learning process. If on the other hand κL > 0, "self-confident" errors may accumulate in the perception system.
Examples showing use of perception control
As the first example, we consider Barto, Sutton and Anderson's classical cart-pole system (Figure 11), and as the motivation-based action system that receives the completed
information from the perception system we use their actor-critic network, see A.G. Barto, R.S. Sutton and C.W. Anderson, "Neuronlike Adaptive Elements that can Solve Difficult Learning Control Problems", IEEE Transactions on Systems, Man and Cybernetics SMC- 13, (834-846), 1983, The actor-critic network has three motivation parameters, β, γ, and λ, and two action parameters, α and δ. For the actor-critic network parameters we used exactly the same as in the original paper (β = 0.5, γ = 0.95, λ = 0.8, α = 1000, and δ = 0.9). The system to be controlled contains a cart with a pole attached. The pole must be held upright within +12° measured from vertical position and the cart must not move more than ± 2.4m from the start position. If one of these things happens, a punishment is received. In order to balance the pole, two actions are considered: giving the cart a 10 N push to the left or to the right.
The environmental information has four components: x (the position of the cart on the track), (the cart velocity), θ (the angle of the pole measured from its vertical), and θ (the angular velocity of the pole). Contrary to the original situation, we now introduce for each of the components a probability p that the component is missing. The component units are defined by discretizing the phase space into intervals, based on the following quantization thresholds:
χ : + 0.8, ±2.4 m, (3 units) χ : + 0.5, + oo m/s, (3 units) θ : 0, + 1 , ± 6, + 12°, (6 units) θ : ± 50, ± oo /s. (3 units)
Figure 12 shows the results using perception control (eight different values of κ ), as well as results from using the three simple methods described above. Control was considered obtained if the control system could balance the pole for 10,000 time steps. The learning time is defined as the number of times the pole falls or the cart enters a forbidden area before control is obtained. In our example the maximum number of trials allowed is 5000 - if a system does not obtain control within this time, a learning time of 5000 is then assigned, in order to make it possible to calculate a mean.
The learning times shown in Figure 12 are means of 40 runs. Worst performance is obtained by the "random" system and by the "extra unit" system. These systems are
unable to obtain control if the probability of a component missing is higher than 2%. A better performance is obtained by using the "same as last" system. Here the network is able to learn the balancing task up to p = 17% probability of a component missing, which is equivalent to 1 - (1-p)4 = 53% probability that the information provided is incomplete.
Using perception control, the learning time is notably reduced. Curves for eight values (0, 0.01, 0.05, 0.2, 0.3, 0.5, 0.9 and 1.0) of κL are shown. This parameter has no significant influence on the learning time. In all cases, the perception system was able to obtain control up to 28% probability of a component missing, which is 73% probability that the information is incomplete.
The second example is Moore's Mountain-Car Task [see: J.A. Boyan and A.W. Moore, "Generalization in Reinforcement Learning: Safely Approximating the Value Function" in Advances in Reinforcement Learning, eds. G. Tesauro, D.S. Touretsky and T.K. Leen, MIT Press, Cambridge 1995]. Consider driving an underpowered car up a steep mountain road, as illustrated in Figure 13. The problem is that gravity is stronger than the car's engine can overcome, and the car cannot accelerate up the slope. The driver (control system) must first move the car away from the goal and up the opposite slope, and then, by applying full throttle the car can build up enough momentum to reach the goal. This is an example of a continuous control task where things have to get worse (farther from the goal) before the problem can be solved. At each time step a punishment is received until the mountaintop is reached. To minimize the punishment the control system (driver) must get to the goal in minimum time.
Three actions are considered: Full thrust forward, no thrust, or full thrust backwards. Two components of information is available, the car's position and the car's velocity. Again, contrary to the original situation, we introduce for each component a probability p that the component is missing.
As the motivation-based action system, we use the Sarsa-system thoroughly examined for the mountain-car task by Singh and Sutton, see S.P. Singh and R.S. Sutton "Reinforcement Learning with Replacing Eligibility Traces", Machine Learning, 22, (123- 158), 1996. For the specific equations for the action system, see there. The Sarsa system has three motivation parameters (we used α = 0.7, γ = 1, λ = 0.9) and the best-motivated action is always selected.
The time (number of time steps) the control system uses to get the car to the top was recorded and averaged over 20 trials. If the control system was not able to solve the task within 100,000 time steps, the trial was ended and 100,000 was used in the calculation of the average. This happened only a very few times and only for values of p close to 1.
Figure 14 shows our results for the average trial time. Perception control works even better than for the cart-pole system. The "extra unit" system has the worst performance, while the "random" system performs somewhat better. The exact timing of each action is of little importance and therefore the "same as last" system is expected to perform rather well. (In the next example, we discuss a problem where timing matters and the "same as last" system perform miserably.) At p=0.30 the average time for a trial has increased to twice the value at p=0. In comparison, the corresponding value for the perception control is p=0.65.
For large values of p the information received by the control system is generally several time steps old. At p=0.95, the agent only has 0.3% probability for receiving all the information, and with 36% probability the information will at any given time step be totally lacking for the next 10 time steps. Hence, the "same as last" system will at large values of p often perform the same action over and over again, which is the kind of behavior that actually solves the task. The surprisingly good performance of the "same as last" system above p=0.75 is thus due to the very nature of the problem at hand.
Another performance measure is the average amount of information, the control system receives during the 20 trials. In other words: How much information does the control system require to learn to get the car to the top. Figure 15 shows the average number of received components per trial. It is remarkable that for values of p<0.7 the perception control system requires less information than for p = 0. If we compare Figure 14 and Figure 15, we observe that although the average steps taken per trial dramatically increases at large values of p, the information needed for the perception control system does not change substantially.
The third and last example considered is a discrete, non-continuous problem. The purpose is to demonstrate that the perception control system can handle discrete problems, in contrast to the "same as last" system that strongly relies on continuity. The
example is an abstract formulation of a simple but general timing problem. The environment consists of an arbitrary number of states, here chosen to be eleven, and the (control) system can choose to be active or passive at each state. State /?+1 follows state n, as illustrated in Figure 16. One state is now selected, as a state where the system has to act, here taken to be state seven.
One may think of baseball and the states as representing the position of a thrown ball. At what position must the batter (control system) start moving the bat, so it hits the ball? The best behavior is to act once at the right time and be passive at all other time steps. Therefore, the control system (batter) receives a reward (r=+3) for being active at state seven, while it is punished (r=-1) for being active at all other states. In order to evoke action, there is at all states a general small punishment (r=-0.05) for being passive. At every time step, information is generally provided of which state the environment is in. However, at each time step there is a probability p that this information is missing. As the action system, the actor-critic system is used. For the actor-critic network parameters we used β = 0.5, γ = 0.95, λ = 0, α = 1, and δ = 0.
The control system was allowed to go trough the 11 states 100 times before the received rewards and punishments were recorded. Then the total return (sum of received rewards and punishments) over the following 10 times was recorded and divided by the maximum possible return. Figure 17 shows the results.
The results for the three simple methods are the three curves on the left. The "random" system and the "same as last" system have the worst performance. Their performance is easy to understand. In both cases, for small probabilities the system acts when it finds that the environment is in the selected state (even if it is not). The mean return, normalized by the maximal possible return can be calculated to be 1-1.6p. For larger probabilities (p>0.7625), the systems "give up" and stay passive all the time, receiving the associated punishment. In this case, the fraction of maximal possible return is -0.22.
The "extra unit" system performs slightly better. The system learns that when information is missing, it is best to stay passive. The fraction of maximal possible reinforcement is then 1-1.22p. Contrary to the two other simple systems, the "extra unit" system never reaches the level of total passivity (except for p=1).
Figure 17 shows that perception control is effective for non-continuous problems as well as for continuous problems. The perception control system solves the problem, even for high probabilities that the state information is missing. Small values of κL give the best results. Actually, for all values of p<1 , the perception control system learns to solve the problem. The actual performance of the system depends on how much time it is given to become "acquainted" with the system. The system must have at least the time that the relevant state information is available once.
The examples show that perception control works far better than some simple methods. The perception dynamics makes it possible for the subsequent action dynamics to obtain control even when large amounts of environmental information are missing.
In connection with control of industrial processes and systems typical applications of the innovation (expectations and perception control) is
a) where the operational feedback has a time delay, e.g., in regulation of fluid transport where there is a time delay between a change in input and the resulting change in output,
b) where the operation point is close to an instability point, e.g. in chemical catalytic reactors or fertilisation tanks where the typical onset of oscillatory instabilities causes substantial reduction in production,
c) in production lines where there is a potential risk for expensive production break downs.
In all cases, the input S is the state of the process, and the action a is the regulation of control parameters. Moreover, one would like to take action before a damaging state S is encountered, and to this end the expectations and perception control signal are needed.