FKI-REPORTS AVAILABLE
Juergen Schmidhuber
schmidhu at tumult.informatik.tu-muenchen.de
Wed Feb 28 10:37:59 EST 1990
Three reports on three quite different on-line algorithms for
recurrent neural networks with external feedback
(through a non-stationary environment) are available.
A LOCAL LEARNING ALGORITHM FOR DYNAMIC FEEDFORWARD
AND RECURRENT NETWORKS
Juergen Schmidhuber
FKI-Report 90-124
Most known learning algorithms for dynamic neural networks in
non-stationary environments need global computations to
perform credit assignment. These algorithms either are not local
in time or not local in space. Those algorithms which are local in both
time and space usually can not deal sensibly with `hidden units'.
In contrast, as far as we can judge by now, learning rules in
biological systems with many `hidden units' are local in both space
and time.
In this paper we propose a parallel on-line learning algorithm which
performs local computations only, yet still is designed to deal with
hidden units and with units whose past activations are
`hidden in time'. The approach is inspired by Holland's idea of
the bucket brigade for classifier systems, which is transformed to
run on a neural network with fixed topology. The result is a
feedforward or recurrent `neural' dissipative system which is
consuming `weight-substance' and permanently trying to distribute this
substance onto its connections in an appropriate way.
Experiments demonstrating the feasability of the algorithm
are reported.
NETWORKS ADJUSTING NETWORKS
Juergen Schmidhuber
FKI-Report 90-125
An approach to spatiotemporal credit assignment in recurrent
reinforcement learning networks is presented. The algorithm may be
viewed as an application of Sutton's `Temporal Difference Methods'
to the temporal evolution of recurrent networks. State transitions
in a completely recurrent network are observed by a second
non-recurrent adaptive network which receives as input the complete
activation vectors of the recurrent one. Differences between
successive state evaluations made by the second network provide
update information for the recurrent network.
In a reinforcement learning system an adaptive critic
(like the one used in Barto, Sutton and Anderson's AHC algorithm)
controls the temporal evolution of a recurrent network in a changing
environment. This is done by letting the critic learn learning rates
for a Hebb-like rule used to associate or disassociate successive
states in the recurrent network. Only computations local in space
and time take place. With a linear critic this scheme can be applied to
tasks without linear solutions. It was successfully tested on a
delayed XOR-problem, and a complicated pole balancing task with
asymmetrically scaled inputs).
We finally consider how in a changing environment a recurrent
dynamic supervised learning critic can interact with a recurrent
dynamic reinforcement learning network in order to improve its
performance.
MAKING THE WORLD DIFFERENTIABLE: ON USING SUPERVISED LEARNING
FULLY RECURRENT NEURAL NETWORKS FOR DYNAMIC REINFORCEMENT
LEARNING AND PLANNING IN NON-STATIONARY ENVIRONMENTS.
Juergen Schmidhuber
FKI-Report 90-126
First a brief introduction to supervised and reinforcement learning
with recurrent networks in non-stationary environments is given.
The introduction also covers the basic principle of SYSTEM
IDENTIFICATION as employed by Munro, Robinson and Fallside, Werbos,
Jordan, and Widrow. This principle allows to employ supervised learning
techniques for reinforcement learning.
Then a very general on-line algorithm for a reinforcement learning
neural network with internal and external feedback in a
non-stationary reactive environment is described. Internal feedback is
given by connections that allow cyclic activation flow through the
network. External feedback is given by output actions that may change
the state of the environment thus influencing subsequent input
activations. The network's main goal is to receive as much
reinforcement (or as few `pain') as possible.
Arbitrary time lags between actions and later consequences are possible.
Although the approach is based on `supervised' learning algorithms
for fully recurrent dynamic networks, no teacher is required.
An adaptive model of the environmental dynamics is constructed
which includes a model of future reinforcement to be received.
This model is used for learning goal directed behavior. For reasons
of efficiency the on-line algorithm CONCURRENTLY learns the
model and learns to pursue the main goal. The algorithm
is applied to the most difficult pole balancing problem ever given to
any neural network.
A connection to `META-learning' (learning how to learn) is noted.
The possibility to use the model for learning by `mental simulation'
of the environmental dynamics is investigated. The approach is
compared to approaches based on Sutton's methods of temporal
differences and Werbos' heuristic dynamic programming.
Finally it is described how the algorithm can be augmented by dynamic
CURIOSITY and BOREDOM . This can be done by introducing
(delayed) reinforcement for controller actions that increase the
model network's knowledge about the world. This in turn requires
the model network to model its own ignorance.
Please direct requests to
schmidhu at lan.informatik.tu-muenchen.dbp.de
Only if this does not work for some reason, try
schmidhu at tumult.informatik.tu-muenchen.de
Leave nothing but your physical address (subject: FKI-Reports).
DO NOT USE `REPLY'.
Of course, those who asked for copies at IJCNN in Washington
will receive them without any further requests.
More information about the Connectionists
mailing list