PhD thesis and papers on reinforcement-learning based neural planner and basal ganglia

Fri Nov 21 11:38:57 EST 2003

Dear connectionits,

you can find my PhD thesis, and downloadable preprints
of some papers related to it, at the web-page:
http://gral.ip.rm.cnr.it/baldassarre/publications/publications.html

Thesis and papers are about a neural-network planner
based on reinforcement learning
(it builds on Sutton's Dyna-PI architectures(1990)).
Some of the papers show the biological inspiration of the model and
its possible relations with the brain (basal ganglia).

Below you will find:
- the list of the titles of the thesis and the papers
- the same list with abstracts
- the index of the thesis.

Best regards,
Gianluca Baldassarre

|.CS...|.......|...............|..|......US.|||.|||||.||.||||..|...|....
Gianluca Baldassarre, Ph.D.
Institute of Cognitive Sciences and  Technologies
National Research Council of Italy (ISTC-CNR)
Viale Marx 15, 00137, Rome, Italy
E-mail: baldassarre at ip.rm.cnr.it
Web: http://gral.ip.rm.cnr.it/baldassarre
Tel: ++39-06-86090227
Fax: ++39-06-824737
..CS.|||.||.|||.||..|.......|........|...US.|.|....||..|..|......|......

****************************************************************************
******************
TITLES
****************************************************************************
******************
Baldassarre G. (2002).
Planning with Neural Networks and Reinforcement Learning.
PhD Thesis.
Colchester - UK: Computer Science Department, University of Essex.

Baldassarre G. (2001).
Coarse Planning for Landmark Navigation in a Neural-Network Reinforcement
Learning Robot.
Proceedings of the International Conference on Intelligent Robots and
Systems (IROS-2001). IEEE.

Baldassarre G. (2001).
A Planning Modular Neural-Network Robot for Asynchronous Multi-Goal
Navigation Tasks.
In Arras K.O., Baerveldt A.-J, Balkenius C., Burgard W., Siegwart R. (eds.),
Proceedings of the 2001 Fourth European Workshop on Advanced Mobile Robots -
EUROBOT-2001,
pp. 223-230. Lund – Sweden: Lund University Cognitive Studies.

Baldassarre G. (2003).
Forward and Bidirectional Planning Based on Reinforcement Learning and
Neural Networks
in a Simulated Robot.
In Butz M., Sigaud O., Gérard P. (eds.),
Adaptive Behaviour in Anticipatory Learning Systems,
pp. 179-200. Berlin: Springer Verlag.

..papers describing the biological inspiration of the model and its
possible relations with the brain (not in the thesis).

Baldassarre G. (2002).
A modular neural-network model of the basal ganglia's role in learning and
selecting
motor behaviours. Journal of Cognitive Systems Research. Vol. 3, pp. 5-13.

Baldassarre G. (2002).
A biologically plausible model of human planning based on neural networks
and Dyna-PI models.
In Butz M., Sigaud O., Gérard P. (eds.),
Proceedings of the Workshop on Adaptive Behaviour in Anticipatory Learning
Systems – ABiALS-2002
(hold within SAB-2002), pp. 40-60. Wurzburg: University of Wurzburg.

****************************************************************************
******************
TITLES WITH ABSTRACTS
****************************************************************************
******************
Baldassarre G. (2002).
Planning with Neural Networks and Reinforcement Learning.
PhD Thesis.
Colchester - UK: Computer Science Department, University of Essex.

Abstract
This thesis presents the design, implementation and investigation of some
predictive-planning controllers built with neural-networks and inspired by
Dyna-PI architectures (Sutton, 1990). Dyna-PI architectures are planning
systems based on actor-critic reinforcement learning methods and a model of
the environment. The controllers are tested with a simulated robot that
solves a stochastic path-finding landmark navigation task. A critical review
of ideas and models proposed by the literature on problem solving, planning,
reinforcement learning, and neural networks precedes the presentation of the
controllers. The review isolates ideas relevant to the design of planners
based on neural networks. A “neural forward planner” is implemented that,
unlike the Dyna-PI architectures, is taskable in a strong sense. This
planner is capable of building a “partial policy” focussed on around
efficient start-goal paths, and is capable of deciding to re-plan if
“unexpected” states are encountered. Planning iteratively generates “chains
of predictions” starting from the current state and using the model of the
environment. This model is made up by some neural networks trained to
predict the next input when an action is executed. A “neural bidirectional
planner” that generates trajectories backward from the goal and forward from
the current state is also implemented. This planner exploits the knowledge
(image) on the goal, further focuses planning around efficient start-goal
paths, and produces a quicker updating of evaluations. In several
experiments the generalisation capacity of neural networks proves important
for learning but it also causes problems of interference. To deal with these
problems a modular neural architecture is implemented, that uses a mixture
of experts network for the critic, and a simple hierarchical modular network
for the actor. The research also implements a simple form of neural abstract
planning named “coarse planning”, and investigates its strengths in terms of
exploration and evaluations’ updating. Some experiments with coarse planning
and with other controllers suggest that discounted reinforcement learning
may have problems dealing with long-lasting tasks.

Baldassarre G. (2001).
Coarse Planning for Landmark Navigation in a Neural-Network Reinforcement
Learning Robot.
Proceedings of the International Conference on Intelligent Robots and
Systems (IROS-2001). IEEE.

Abstract
Is it possible to plan at a coarse level and act at a fine level with a
neural-network (NN) reinforcement-learning (RL) planner? This work presents
a NN planner, used to control a simulated robot in a stochastic
landmark-navigation problem, which plans at an abstract level. The
controller has both reactive components, based on actor-critic RL, and
planning components inspired by the Dyna-PI architecture (this roughly
corresponds to RL plus a model of the environment). Coarse planning is based
on macro-actions defined as a sequence of identical primitive actions. It
updates the evaluations and the action policy while generating simulated
experience at the macro level with the model of the environment (a NN
trained at the macro level). The simulations show how the controller works.
They also show the advantages of using a discount coefficient tuned to the
level of planning coarseness, and suggest that discounted RL has problems
dealing with long periods of time.

Baldassarre G. (2001).
A Planning Modular Neural-Network Robot for Asynchronous Multi-Goal
Navigation Tasks.
In Arras K.O., Baerveldt A.-J, Balkenius C., Burgard W., Siegwart R. (eds.),
Proceedings of the 2001 Fourth European Workshop on Advanced Mobile Robots -
EUROBOT-2001,
pp. 223-230. Lund – Sweden: Lund University Cognitive Studies.

Abstract
This paper focuses on two planning neural-network controllers, a "forward
planner" and a "bidirectional planner". These have been developed within the
framework of Sutton's Dyna-PI architectures (planning within reinforcement
learning) and have already been presented in previous papers. The novelty of
this paper is that the architecture of these planners is made modular in
some of its components in order to deal with catastrophic interference. The
controllers are tested through a simulated robot engaged in an asynchronous
multi-goal path-planning problem that should exacerbate the interference
problems. The results show that: (a) the modular planners can cope with
multi-goal problems allowing generalisation but avoiding interference; (b)
when dealing with multi-goal problems the planners keeps the advantages
shown previously for one-goal problems vs. sheer reinforcement learning; (c)
the superiority of the bidirectional planner vs. the forward planner is
confirmed for the multi-goal task.

Baldassarre G. (2003).
Forward and Bidirectional Planning Based on Reinforcement Learning and
Neural Networks
in a Simulated Robot.
In Butz M., Sigaud O., Gérard P. (eds.),
Adaptive Behaviour in Anticipatory Learning Systems,
pp. 179-200. Berlin: Springer Verlag.

Abstract
Building intelligent systems that are capable of learning, acting reactively
and planning actions before their execution is a major goal of artificial
intelligence. This paper presents two reactive and planning systems that
contain important novelties with respect to previous neural-network planners
and reinforcement-learning based planners: (a) the introduction of a new
component (“matcher”) allows both planners to execute genuine taskable
planning (while previous reinforcement-learning based models have used
planning only for speeding up learning); (b) the planners show for the first
time that trained neural-network models of the world can generate long
prediction chains that have an interesting robustness with regards to noise;
(c) two novel algorithms that generate chains of predictions in order to
plan, and control the flows of information between the systems’ different
neural components, are presented; (d) one of the planners uses backward
“predictions” to exploit the knowledge of the pursued goal; (e) the two
systems presented nicely integrate reactive behavior and planning on the
basis of a measure of “confidence” in action. The soundness and
potentialities of the two reactive and planning systems are tested and
compared with a simulated robot engaged in a stochastic path-finding task.
The paper also presents an extensive literature review on the relevant
issues.

Baldassarre G. (2002).
A modular neural-network model of the basal ganglia's role in learning and
selecting
motor behaviours. Journal of Cognitive Systems Research. Vol. 3, pp. 5-13.

Abstract
This work presents a modular neural-network model (based on
reinforcement-learning actor-critic methods) that tries to capture some of
the most-relevant known aspects of the role that basal ganglia play in
learning and selecting motor behavior related to different goals. In
particular some simulations with the model show that basal ganglia select
"chunks" of behaviour whose "details" are specified by direct sensory-motor
pathways, and how emergent modularity can help to deal with tasks with
asynchronous multiple goals. A "top-down" approach is adopted, beginning
with the analysis of the adaptive interaction of a (simulated) organism with
the environment, and its capacity to learn. Then an attempt is made to
implement these functions with neural architectures and mechanisms that have
an empirical neuroanatomical and neurophysiological foundation.

Baldassarre G. (2002).
A biologically plausible model of human planning based on neural networks
and Dyna-PI models.
In Butz M., Sigaud O., Gérard P. (eds.),
Proceedings of the Workshop on Adaptive Behaviour in Anticipatory Learning
Systems – ABiALS-2002
(hold within SAB-2002), pp. 40-60. Wurzburg: University of Wurzburg.

Abstract
Understanding the neural structures and physiological mechanisms underlying
human planning is a difficult challenge. In fact it is the product of a
sophisticated network of different brain components that interact in complex
ways. However, some data produced by brain imaging, neuroanatomical and
neurophysiological research, are now beginning to make it possible to draw a
first approximate picture of this network. This paper proposes such a
picture in the form of a neural-network computational model inspired by the
Dyna-PI models (Sutton, 1990). The model is based on the actor-critic
reinforcement learning model, that has been shown to be a good
representation of the anatomy and functioning of the basal ganglia. It is
also based on a “predictor”, a network capable of predicting the sensorial
consequences of actions, that may correspond to the lateral
cerebellum-prefrontal and rostral premotor cortex pathways. All these neural
structures have been shown to be involved in human planning by functional
brain-imaging research. The model has been tested with an animat engaged
with a landmark navigation task. In accordance with the brain imaging data,
the simulations show that with repeated practice performing the task, the
complex planning processes, and the activity of the neural structures
underlying them, fade away and leave the routine control of action to
lower-level reactive components. The simulations also show the biological
advantages offered by planning and some interesting properties of the
processing of “mental images”, based on neural networks, during planning. On
the machine learning side, the model presented extends the Dyna-PI models
with two important novelties: a “matcher” for the self-generation of a
reward signal in correspondence to any possible goal, and an algorithm that
focuses the exploration of the model of the world around important states
and allows the animat to decide when planning and when acting on the basis
of a measure of its “confidence”. The paper also offers a wide collection of
references on the addressed issues.

****************************************************************************
******************
TITLES WITH ABSTRACTS
****************************************************************************
******************

1	INTRODUCTION	12
1.1	The Objective of the Thesis	13
1.1.1	Why Neural-Network Planning Controllers?	13
1.1.2	Why a Robot and a Noisy Environment? Why a simulated robot?	15
1.1.3	Reinforcement Learning, Dynamic Programming and Dyna Architectures	16
1.1.4	Ideas from Problem Solving and Logical Planning	18
1.1.5	Why Dyna-PI Architectures (Reinforcement Learning + Model of the
Environment)?	19
1.1.6	Stochastic Path-Finding Landmark Navigation Problems	20
1.2	Overview of the Controllers and Outline of the Thesis	22
1.2.1	Overview of the Controllers Implemented in this Research	22
1.2.2	Outline of the Thesis and Problems Addressed Chapter by Chapter	23

PART 1: CRITICAL LITERATURE REVIEW AND ANALYSIS OF CONCEPTS USEFUL FOR
NEURAL PLANNING

2	PROBLEM SOLVING, SEARCH, AND STRIPS PLANNING	28
2.1	Planning as a Searching Process: Blind-Search Strategies	28
2.1.1	Critical Observations	29
2.2	Planning as a Searching Process: Heuristic-Search Strategies	29
2.2.1	Critical Observations	29
2.3	STRIPS Planning: Partial Order Planner	30
2.3.1	Situation Space and Plan Space	30
2.3.2	Partial Order Planner	31
2.3.3	Critical Observations	32
2.4	STRIPS Planning: Conditional Planning, Execution Monitoring, Abstract
Planning	32
2.4.1	Conditional Planning	33
2.4.2	Execution Monitoring and Replanning	33
2.4.3	Abstract Planning	34
2.4.4	Critical Observations	34
2.5	STRIPS Planning: Probabilistic and Reactive Planning	34
2.5.1	BURIDAN Planning Algorithm	35
2.5.2	Reactive Planning and Universal Plans	35
2.5.3	Decision theoretic planning	35
2.5.4	Maes' Planner	37
2.5.5	Critical Observations	37
2.6	Navigation and Motion Planning Through Configuration Spaces	38

3	MARKOV DECISION PROCESSES AND DYNAMIC PROGRAMMING	40
3.1	The Problem Domain Considered Here: Stochastic Path-Finding Problems	40
3.2	Critical Observations on Dynamic Programming and Heuristic Search	42
3.3	Dyna Framework and Dyna-PI Architecture	43
3.3.1	Critical Observations	44
3.4	Prioritised Sweeping and Trajectory Sampling	45
3.4.1	Critical Observations	46

4	NEURAL-NETWORKS	47
4.1	What is a Neural Network?	47
4.1.1	Critical Observations	48
4.2	Critical Observations: Feed-Forward Networks and Mixture of Experts
Networks	48
4.3	Neural Networks for Prediction Learning	50
4.3.1	Critical Observations	51
4.4	Properties of Neural Networks and Planning	51
4.4.1	Generalisation, Noise Tolerance, and Catastrophic Interference	51
4.4.2	Prototype Extraction	52
4.4.3	Learning	53
4.5	Planning with Neural Networks	53
4.5.1	Activation Diffusion Planning	54
4.5.2	Neural Planners Based on Gradient Descent Methods	56

5	UNIFYING CONCEPTS	58
5.1	Learning, Planning, Prediction and Taskability	58
5.1.1	Learning of Behaviour	59
5.1.2	Taskable Planning	60
5.1.3	Taskability: Reactive and Planning Controllers	61
5.1.4	Taskability and Dyna-PI	63
5.2	A Unified View of Heuristic Search, Dynamic Programming, and Activation
Diffusion	63
5.3	Policies and Plans	65

PART 2: DESIGNING AND TESTING NEURAL PLANNERS

6	NEURAL ACTOR-CRITIC REINFORCEMENT LEARNING	69
6.1	Introduction: Basic Neural Actor-Critic Controller and Simulations'
Scenarios	69
6.2	Scenarios of Simulations and the Simulated Robot	70
6.3	Architectures and Algorithms	72
6.4	Results and Interpretations	76
6.4.1	Functioning of the Matcher	76
6.4.2	Performance of the Controller: The Critic and the Actor	77
6.4.3	Aliasing Problem and Parameters' Exploration	81
6.4.4	Parameter Exploration	83
6.4.5	Why the Contrasts? Why no more than the Contrasts?	84
6.5	Temporal Limitations of Discounted Reinforcement Learning	85
6.6	Conclusion	89

7	REINFORCEMENT LEARNING, MULTIPLE GOALS, MODULARITY	91
7.1	Introduction	91
7.2	Scenario of Simulations: An Asynchronous Multi-Goal Task	92
7.3	Architectures and Algorithms: Monolithic and Modular Neural-Networks	93
7.4	Results and Interpretation	96
7.5	Limitations of the Controllers	100
7.6	Conclusion	100

8	THE NEURAL FORWARD PLANNER	101
8.1	Introduction: Taskability, Planning and Acting, Focussing	101
8.2	Scenario of the Simulations	103
8.3	Architectures and Algorithms: Reactive and Planning Components	104
8.3.1	The Reactive Components of the Architecture	104
8.3.2	The Planning Components of the Architecture	105
8.4	Results and Interpretation	108
8.4.1	Taskable Planning vs. Reactive Behaviour	108
8.4.2	Focussing, Partial Policies and Replanning	111
8.4.3	Neural Networks for Prediction: “True” Images as Attractors?	112
8.5	Limitations of the Neural Forward Planner	115
8.6	Conclusion	115

9	THE NEURAL BIDIRECTIONAL PLANNER	117
9.1	Introduction: More Efficient Exploration	117
9.2	Scenario of Simulations	118
9.3	Architectures and Algorithms	119
9.3.1	The Reactive Components of the Architecture	119
9.3.2	The Planning Components of the Architecture: Forward Planning	119
9.3.3	The Planning Components of the Architecture: Bidirectional Planning
121
9.4	Results and Interpretation	123
9.4.1	Common Strengths of the Forward-Planner and the Bidirectional Planner
123
9.4.2	The Forward Planner Versus the Bidirectional Planner	124
9.5	Limitations of the Neural Bidirectional Planner	126
9.6	A New “Goal Oriented Forward Planner” (Not Implemented)	126
9.7	Conclusion	127

10		NEURAL NETWORK PLANNERS AND MULTI-GOAL TASKS	128
10.1		Introduction: Neural Planners, Interference and Modularity	128
10.2		Scenario: Again the Asynchronous Multi-Goal Task	129
10.3		Architectures and Algorithms	129
10.3.1	Modular Reactive Components	129
10.3.2	Neural Modular Forward Planner	130
10.3.3	Neural Modular Bidirectional Planner	131
10.4		Results and Interpretation	132
10.4.1	Modularity and Interference	132
10.4.2	Taskability	134
10.4.3	From Planning To Reaction	134
10.4.4	The Forward Planner Versus the Bidirectional Planner	135
10.5		Limitations of the Modular Planners	137
10.6		Conclusion	137

11		COARSE PLANNING	138
11.1		Introduction: Abstraction, Macro-actions and Coarse Planning	138
11.2		Scenario of Simulations: A Simplified Navigation Task	139
11.3		Architectures and Algorithms: Coarse Planning with Macro-actions	140
11.4		Results and Interpretation	142
11.4.1	Reinforcement Learning at a Coarse Level	142
11.4.2	The Advantages of Coarse Planning	143
11.4.3	Predicting at a Coarse Level	145
11.4.4	Coarse Planning, Discount Coefficient and Time Limitations of
Reinforcement Learning	146
11.5		Limitations of the Neural Coarse Planner	149
11.6		Conclusion	150

12		CONCLUSION AND FUTURE WORK	152
12.1		Conclusion: What Have We Learned from This Research?	152
12.1.1	Ideas for Neural-Network Reinforcement-Learning Planning	152
12.1.2	Landmark Navigation, Reinforcement Learning and Neural Networks	153
12.1.3	A New Neural Forward Planner	153
12.1.4	A New Neural Bidirectional Planner	155
12.1.5	Common Structure, Interference, and Modular Networks	156
12.1.6	Coarse Planning and Time Limits of Reinforcement Learning	157
12.2		A List of the Major “Usable” Insights Delivered	158
12.3		Future Work	159

13		APPENDICES	162
13.1		Blind-Search and Heuristic-Search Strategies	162
13.1.1	Blind-Search Strategies	162
13.1.2	Heuristic-Search Strategies	163
13.2		Markov Decision Processes, Reinforcement Learning and Dynamic
Programming	165
13.2.1	Markov Decision Processes	165
13.2.2	Markov Property and Partially Observable Markov Decision Problems	167
13.2.3	Reinforcement Learning	168
13.2.4	Approximating the State or State-Action Evaluations	168
13.2.5	Searching the Policy with the Q'* and Q'p evaluations	170
13.2.6	Actor-Critic Model	171
13.2.7	Macro-actions and Options	172
13.2.8	Function Approximation and Reinforcement Learning	174
13.2.9	Dynamic Programming	174
13.2.10	Asynchronous Dynamic Programming	176
13.2.11	Trial-Based Real-Time Dynamic Programming and Heuristic Search	176
13.3		Feed-Forward Architectures and Mixture of Experts Networks	178
13.3.1	Feed-Forward Architectures and Error Backpropagation Algorithm	178
13.3.2	Mixture of Experts Neural Networks	179
13.3.3	The Generalisation Property of Neural Networks	181

14		REFERENCES	182
14.1		Candidate's Publications During the PhD Research	182
14.2		References	183
****************************************************************************
******************