FKI-REPORTS AVAILABLE

Wed Feb 28 10:37:59 EST 1990

Three reports on three quite different on-line algorithms for 
recurrent neural networks with  external feedback
(through a non-stationary environment) are available. 

        A LOCAL LEARNING ALGORITHM FOR DYNAMIC FEEDFORWARD 
	             AND RECURRENT NETWORKS 
                       Juergen Schmidhuber
		        FKI-Report 90-124

Most known learning algorithms for dynamic neural networks in
non-stationary environments need global computations to 
perform credit assignment. These algorithms either are not local 
in time or not local in space. Those algorithms which are local in both
time and space usually can not deal sensibly with `hidden units'.
In contrast, as far as we can judge by now, learning rules in 
biological systems with many `hidden units' are local in both space 
and time.
In this paper we propose a parallel on-line learning algorithm which
performs local computations only, yet still is designed to deal with 
hidden units and with units whose past activations are
`hidden in time'. The approach is inspired by Holland's idea of 
the bucket brigade for classifier systems, which is transformed to 
run on a neural network with fixed topology. The result is a 
feedforward or recurrent `neural' dissipative system which is
consuming `weight-substance' and permanently trying to distribute this
substance onto its connections in an appropriate way. 
Experiments demonstrating the feasability of the algorithm
are reported.

                   NETWORKS ADJUSTING NETWORKS
                       Juergen Schmidhuber
		        FKI-Report 90-125

An approach to spatiotemporal credit assignment in recurrent 
reinforcement learning networks is presented. The algorithm may be 
viewed as an application of Sutton's `Temporal Difference Methods' 
to the temporal evolution of recurrent networks.  State transitions 
in a completely recurrent network are observed by a second 
non-recurrent adaptive network which receives as input the complete 
activation vectors of the recurrent one. Differences between
successive state evaluations made by the second network provide
update information for the recurrent network.
In a reinforcement learning system an  adaptive critic 
(like the one used in Barto, Sutton and Anderson's AHC algorithm) 
controls the temporal evolution of a recurrent network in a changing 
environment. This is done by letting the critic learn learning rates 
for a Hebb-like  rule used to associate or disassociate successive 
states in the recurrent network.  Only computations local in space 
and time take place.  With a linear critic this scheme can be applied to
tasks without  linear solutions.  It was successfully tested on a 
delayed XOR-problem, and a complicated pole balancing task with 
asymmetrically scaled inputs). 
We finally consider how in a changing environment a recurrent 
dynamic supervised learning critic can interact with  a recurrent 
dynamic reinforcement learning network in order to improve its 
performance.

  MAKING THE WORLD DIFFERENTIABLE: ON USING SUPERVISED LEARNING
     FULLY RECURRENT NEURAL NETWORKS FOR DYNAMIC REINFORCEMENT
       LEARNING AND PLANNING IN NON-STATIONARY ENVIRONMENTS.  
                       Juergen Schmidhuber
		        FKI-Report 90-126

First a brief introduction to supervised and reinforcement learning 
with recurrent networks in non-stationary environments is given. 
The introduction also covers the basic principle of  SYSTEM
IDENTIFICATION as employed by Munro, Robinson and Fallside, Werbos, 
Jordan, and Widrow. This principle allows to employ supervised learning 
techniques for reinforcement learning.

Then a very general on-line algorithm for a reinforcement learning
neural network with internal and  external feedback in a
non-stationary reactive environment is described. Internal feedback is 
given by connections that allow cyclic activation flow through the
network. External feedback is given by output actions that may change 
the state of the environment thus influencing subsequent input 
activations. The network's main goal is to receive as much
reinforcement (or as few `pain') as possible.

Arbitrary time lags between actions and later consequences are possible.
Although the approach is based on `supervised' learning algorithms 
for fully recurrent dynamic networks, no teacher is required. 
An adaptive  model of the environmental dynamics is constructed
which includes a model of future reinforcement to be received. 
This model is used for learning goal directed behavior. For reasons 
of efficiency the on-line algorithm  CONCURRENTLY learns the 
model and learns to pursue the main goal. The algorithm 
is applied to the most difficult pole balancing problem ever given to 
any neural network.

A connection to `META-learning' (learning how to learn) is noted.
The possibility to use the model for learning by `mental simulation' 
of the environmental dynamics  is investigated. The approach is
compared to approaches based on Sutton's methods of temporal
differences and Werbos' heuristic dynamic programming.  

Finally it is described how the algorithm can be augmented by dynamic 
CURIOSITY  and BOREDOM . This can be done by introducing 
(delayed) reinforcement for controller actions that increase the
model network's knowledge about the world. This in turn requires
the model network to model its own ignorance.

Please direct requests to 

        schmidhu at lan.informatik.tu-muenchen.dbp.de

Only if this does not work for some reason, try

        schmidhu at tumult.informatik.tu-muenchen.de

Leave nothing but your physical address (subject: FKI-Reports).
DO NOT USE `REPLY'.

Of course, those who asked for copies at IJCNN in Washington 
will receive them without any further requests.