PhD thesis and papers on reinforcement-learning based neural planner and basal ganglia
Gianluca Baldassarre
baldassarre at www.ip.rm.cnr.it
Fri Nov 21 11:38:57 EST 2003
Dear connectionits,
you can find my PhD thesis, and downloadable preprints
of some papers related to it, at the web-page:
http://gral.ip.rm.cnr.it/baldassarre/publications/publications.html
Thesis and papers are about a neural-network planner
based on reinforcement learning
(it builds on Sutton's Dyna-PI architectures(1990)).
Some of the papers show the biological inspiration of the model and
its possible relations with the brain (basal ganglia).
Below you will find:
- the list of the titles of the thesis and the papers
- the same list with abstracts
- the index of the thesis.
Best regards,
Gianluca Baldassarre
|.CS...|.......|...............|..|......US.|||.|||||.||.||||..|...|....
Gianluca Baldassarre, Ph.D.
Institute of Cognitive Sciences and Technologies
National Research Council of Italy (ISTC-CNR)
Viale Marx 15, 00137, Rome, Italy
E-mail: baldassarre at ip.rm.cnr.it
Web: http://gral.ip.rm.cnr.it/baldassarre
Tel: ++39-06-86090227
Fax: ++39-06-824737
..CS.|||.||.|||.||..|.......|........|...US.|.|....||..|..|......|......
****************************************************************************
******************
TITLES
****************************************************************************
******************
Baldassarre G. (2002).
Planning with Neural Networks and Reinforcement Learning.
PhD Thesis.
Colchester - UK: Computer Science Department, University of Essex.
Baldassarre G. (2001).
Coarse Planning for Landmark Navigation in a Neural-Network Reinforcement
Learning Robot.
Proceedings of the International Conference on Intelligent Robots and
Systems (IROS-2001). IEEE.
Baldassarre G. (2001).
A Planning Modular Neural-Network Robot for Asynchronous Multi-Goal
Navigation Tasks.
In Arras K.O., Baerveldt A.-J, Balkenius C., Burgard W., Siegwart R. (eds.),
Proceedings of the 2001 Fourth European Workshop on Advanced Mobile Robots -
EUROBOT-2001,
pp. 223-230. Lund Sweden: Lund University Cognitive Studies.
Baldassarre G. (2003).
Forward and Bidirectional Planning Based on Reinforcement Learning and
Neural Networks
in a Simulated Robot.
In Butz M., Sigaud O., Gérard P. (eds.),
Adaptive Behaviour in Anticipatory Learning Systems,
pp. 179-200. Berlin: Springer Verlag.
..papers describing the biological inspiration of the model and its
possible relations with the brain (not in the thesis).
Baldassarre G. (2002).
A modular neural-network model of the basal ganglia's role in learning and
selecting
motor behaviours. Journal of Cognitive Systems Research. Vol. 3, pp. 5-13.
Baldassarre G. (2002).
A biologically plausible model of human planning based on neural networks
and Dyna-PI models.
In Butz M., Sigaud O., Gérard P. (eds.),
Proceedings of the Workshop on Adaptive Behaviour in Anticipatory Learning
Systems ABiALS-2002
(hold within SAB-2002), pp. 40-60. Wurzburg: University of Wurzburg.
****************************************************************************
******************
TITLES WITH ABSTRACTS
****************************************************************************
******************
Baldassarre G. (2002).
Planning with Neural Networks and Reinforcement Learning.
PhD Thesis.
Colchester - UK: Computer Science Department, University of Essex.
Abstract
This thesis presents the design, implementation and investigation of some
predictive-planning controllers built with neural-networks and inspired by
Dyna-PI architectures (Sutton, 1990). Dyna-PI architectures are planning
systems based on actor-critic reinforcement learning methods and a model of
the environment. The controllers are tested with a simulated robot that
solves a stochastic path-finding landmark navigation task. A critical review
of ideas and models proposed by the literature on problem solving, planning,
reinforcement learning, and neural networks precedes the presentation of the
controllers. The review isolates ideas relevant to the design of planners
based on neural networks. A neural forward planner is implemented that,
unlike the Dyna-PI architectures, is taskable in a strong sense. This
planner is capable of building a partial policy focussed on around
efficient start-goal paths, and is capable of deciding to re-plan if
unexpected states are encountered. Planning iteratively generates chains
of predictions starting from the current state and using the model of the
environment. This model is made up by some neural networks trained to
predict the next input when an action is executed. A neural bidirectional
planner that generates trajectories backward from the goal and forward from
the current state is also implemented. This planner exploits the knowledge
(image) on the goal, further focuses planning around efficient start-goal
paths, and produces a quicker updating of evaluations. In several
experiments the generalisation capacity of neural networks proves important
for learning but it also causes problems of interference. To deal with these
problems a modular neural architecture is implemented, that uses a mixture
of experts network for the critic, and a simple hierarchical modular network
for the actor. The research also implements a simple form of neural abstract
planning named coarse planning, and investigates its strengths in terms of
exploration and evaluations updating. Some experiments with coarse planning
and with other controllers suggest that discounted reinforcement learning
may have problems dealing with long-lasting tasks.
Baldassarre G. (2001).
Coarse Planning for Landmark Navigation in a Neural-Network Reinforcement
Learning Robot.
Proceedings of the International Conference on Intelligent Robots and
Systems (IROS-2001). IEEE.
Abstract
Is it possible to plan at a coarse level and act at a fine level with a
neural-network (NN) reinforcement-learning (RL) planner? This work presents
a NN planner, used to control a simulated robot in a stochastic
landmark-navigation problem, which plans at an abstract level. The
controller has both reactive components, based on actor-critic RL, and
planning components inspired by the Dyna-PI architecture (this roughly
corresponds to RL plus a model of the environment). Coarse planning is based
on macro-actions defined as a sequence of identical primitive actions. It
updates the evaluations and the action policy while generating simulated
experience at the macro level with the model of the environment (a NN
trained at the macro level). The simulations show how the controller works.
They also show the advantages of using a discount coefficient tuned to the
level of planning coarseness, and suggest that discounted RL has problems
dealing with long periods of time.
Baldassarre G. (2001).
A Planning Modular Neural-Network Robot for Asynchronous Multi-Goal
Navigation Tasks.
In Arras K.O., Baerveldt A.-J, Balkenius C., Burgard W., Siegwart R. (eds.),
Proceedings of the 2001 Fourth European Workshop on Advanced Mobile Robots -
EUROBOT-2001,
pp. 223-230. Lund Sweden: Lund University Cognitive Studies.
Abstract
This paper focuses on two planning neural-network controllers, a "forward
planner" and a "bidirectional planner". These have been developed within the
framework of Sutton's Dyna-PI architectures (planning within reinforcement
learning) and have already been presented in previous papers. The novelty of
this paper is that the architecture of these planners is made modular in
some of its components in order to deal with catastrophic interference. The
controllers are tested through a simulated robot engaged in an asynchronous
multi-goal path-planning problem that should exacerbate the interference
problems. The results show that: (a) the modular planners can cope with
multi-goal problems allowing generalisation but avoiding interference; (b)
when dealing with multi-goal problems the planners keeps the advantages
shown previously for one-goal problems vs. sheer reinforcement learning; (c)
the superiority of the bidirectional planner vs. the forward planner is
confirmed for the multi-goal task.
Baldassarre G. (2003).
Forward and Bidirectional Planning Based on Reinforcement Learning and
Neural Networks
in a Simulated Robot.
In Butz M., Sigaud O., Gérard P. (eds.),
Adaptive Behaviour in Anticipatory Learning Systems,
pp. 179-200. Berlin: Springer Verlag.
Abstract
Building intelligent systems that are capable of learning, acting reactively
and planning actions before their execution is a major goal of artificial
intelligence. This paper presents two reactive and planning systems that
contain important novelties with respect to previous neural-network planners
and reinforcement-learning based planners: (a) the introduction of a new
component (matcher) allows both planners to execute genuine taskable
planning (while previous reinforcement-learning based models have used
planning only for speeding up learning); (b) the planners show for the first
time that trained neural-network models of the world can generate long
prediction chains that have an interesting robustness with regards to noise;
(c) two novel algorithms that generate chains of predictions in order to
plan, and control the flows of information between the systems different
neural components, are presented; (d) one of the planners uses backward
predictions to exploit the knowledge of the pursued goal; (e) the two
systems presented nicely integrate reactive behavior and planning on the
basis of a measure of confidence in action. The soundness and
potentialities of the two reactive and planning systems are tested and
compared with a simulated robot engaged in a stochastic path-finding task.
The paper also presents an extensive literature review on the relevant
issues.
Baldassarre G. (2002).
A modular neural-network model of the basal ganglia's role in learning and
selecting
motor behaviours. Journal of Cognitive Systems Research. Vol. 3, pp. 5-13.
Abstract
This work presents a modular neural-network model (based on
reinforcement-learning actor-critic methods) that tries to capture some of
the most-relevant known aspects of the role that basal ganglia play in
learning and selecting motor behavior related to different goals. In
particular some simulations with the model show that basal ganglia select
"chunks" of behaviour whose "details" are specified by direct sensory-motor
pathways, and how emergent modularity can help to deal with tasks with
asynchronous multiple goals. A "top-down" approach is adopted, beginning
with the analysis of the adaptive interaction of a (simulated) organism with
the environment, and its capacity to learn. Then an attempt is made to
implement these functions with neural architectures and mechanisms that have
an empirical neuroanatomical and neurophysiological foundation.
Baldassarre G. (2002).
A biologically plausible model of human planning based on neural networks
and Dyna-PI models.
In Butz M., Sigaud O., Gérard P. (eds.),
Proceedings of the Workshop on Adaptive Behaviour in Anticipatory Learning
Systems ABiALS-2002
(hold within SAB-2002), pp. 40-60. Wurzburg: University of Wurzburg.
Abstract
Understanding the neural structures and physiological mechanisms underlying
human planning is a difficult challenge. In fact it is the product of a
sophisticated network of different brain components that interact in complex
ways. However, some data produced by brain imaging, neuroanatomical and
neurophysiological research, are now beginning to make it possible to draw a
first approximate picture of this network. This paper proposes such a
picture in the form of a neural-network computational model inspired by the
Dyna-PI models (Sutton, 1990). The model is based on the actor-critic
reinforcement learning model, that has been shown to be a good
representation of the anatomy and functioning of the basal ganglia. It is
also based on a predictor, a network capable of predicting the sensorial
consequences of actions, that may correspond to the lateral
cerebellum-prefrontal and rostral premotor cortex pathways. All these neural
structures have been shown to be involved in human planning by functional
brain-imaging research. The model has been tested with an animat engaged
with a landmark navigation task. In accordance with the brain imaging data,
the simulations show that with repeated practice performing the task, the
complex planning processes, and the activity of the neural structures
underlying them, fade away and leave the routine control of action to
lower-level reactive components. The simulations also show the biological
advantages offered by planning and some interesting properties of the
processing of mental images, based on neural networks, during planning. On
the machine learning side, the model presented extends the Dyna-PI models
with two important novelties: a matcher for the self-generation of a
reward signal in correspondence to any possible goal, and an algorithm that
focuses the exploration of the model of the world around important states
and allows the animat to decide when planning and when acting on the basis
of a measure of its confidence. The paper also offers a wide collection of
references on the addressed issues.
****************************************************************************
******************
TITLES WITH ABSTRACTS
****************************************************************************
******************
1 INTRODUCTION 12
1.1 The Objective of the Thesis 13
1.1.1 Why Neural-Network Planning Controllers? 13
1.1.2 Why a Robot and a Noisy Environment? Why a simulated robot? 15
1.1.3 Reinforcement Learning, Dynamic Programming and Dyna Architectures 16
1.1.4 Ideas from Problem Solving and Logical Planning 18
1.1.5 Why Dyna-PI Architectures (Reinforcement Learning + Model of the
Environment)? 19
1.1.6 Stochastic Path-Finding Landmark Navigation Problems 20
1.2 Overview of the Controllers and Outline of the Thesis 22
1.2.1 Overview of the Controllers Implemented in this Research 22
1.2.2 Outline of the Thesis and Problems Addressed Chapter by Chapter 23
PART 1: CRITICAL LITERATURE REVIEW AND ANALYSIS OF CONCEPTS USEFUL FOR
NEURAL PLANNING
2 PROBLEM SOLVING, SEARCH, AND STRIPS PLANNING 28
2.1 Planning as a Searching Process: Blind-Search Strategies 28
2.1.1 Critical Observations 29
2.2 Planning as a Searching Process: Heuristic-Search Strategies 29
2.2.1 Critical Observations 29
2.3 STRIPS Planning: Partial Order Planner 30
2.3.1 Situation Space and Plan Space 30
2.3.2 Partial Order Planner 31
2.3.3 Critical Observations 32
2.4 STRIPS Planning: Conditional Planning, Execution Monitoring, Abstract
Planning 32
2.4.1 Conditional Planning 33
2.4.2 Execution Monitoring and Replanning 33
2.4.3 Abstract Planning 34
2.4.4 Critical Observations 34
2.5 STRIPS Planning: Probabilistic and Reactive Planning 34
2.5.1 BURIDAN Planning Algorithm 35
2.5.2 Reactive Planning and Universal Plans 35
2.5.3 Decision theoretic planning 35
2.5.4 Maes' Planner 37
2.5.5 Critical Observations 37
2.6 Navigation and Motion Planning Through Configuration Spaces 38
3 MARKOV DECISION PROCESSES AND DYNAMIC PROGRAMMING 40
3.1 The Problem Domain Considered Here: Stochastic Path-Finding Problems 40
3.2 Critical Observations on Dynamic Programming and Heuristic Search 42
3.3 Dyna Framework and Dyna-PI Architecture 43
3.3.1 Critical Observations 44
3.4 Prioritised Sweeping and Trajectory Sampling 45
3.4.1 Critical Observations 46
4 NEURAL-NETWORKS 47
4.1 What is a Neural Network? 47
4.1.1 Critical Observations 48
4.2 Critical Observations: Feed-Forward Networks and Mixture of Experts
Networks 48
4.3 Neural Networks for Prediction Learning 50
4.3.1 Critical Observations 51
4.4 Properties of Neural Networks and Planning 51
4.4.1 Generalisation, Noise Tolerance, and Catastrophic Interference 51
4.4.2 Prototype Extraction 52
4.4.3 Learning 53
4.5 Planning with Neural Networks 53
4.5.1 Activation Diffusion Planning 54
4.5.2 Neural Planners Based on Gradient Descent Methods 56
5 UNIFYING CONCEPTS 58
5.1 Learning, Planning, Prediction and Taskability 58
5.1.1 Learning of Behaviour 59
5.1.2 Taskable Planning 60
5.1.3 Taskability: Reactive and Planning Controllers 61
5.1.4 Taskability and Dyna-PI 63
5.2 A Unified View of Heuristic Search, Dynamic Programming, and Activation
Diffusion 63
5.3 Policies and Plans 65
PART 2: DESIGNING AND TESTING NEURAL PLANNERS
6 NEURAL ACTOR-CRITIC REINFORCEMENT LEARNING 69
6.1 Introduction: Basic Neural Actor-Critic Controller and Simulations'
Scenarios 69
6.2 Scenarios of Simulations and the Simulated Robot 70
6.3 Architectures and Algorithms 72
6.4 Results and Interpretations 76
6.4.1 Functioning of the Matcher 76
6.4.2 Performance of the Controller: The Critic and the Actor 77
6.4.3 Aliasing Problem and Parameters' Exploration 81
6.4.4 Parameter Exploration 83
6.4.5 Why the Contrasts? Why no more than the Contrasts? 84
6.5 Temporal Limitations of Discounted Reinforcement Learning 85
6.6 Conclusion 89
7 REINFORCEMENT LEARNING, MULTIPLE GOALS, MODULARITY 91
7.1 Introduction 91
7.2 Scenario of Simulations: An Asynchronous Multi-Goal Task 92
7.3 Architectures and Algorithms: Monolithic and Modular Neural-Networks 93
7.4 Results and Interpretation 96
7.5 Limitations of the Controllers 100
7.6 Conclusion 100
8 THE NEURAL FORWARD PLANNER 101
8.1 Introduction: Taskability, Planning and Acting, Focussing 101
8.2 Scenario of the Simulations 103
8.3 Architectures and Algorithms: Reactive and Planning Components 104
8.3.1 The Reactive Components of the Architecture 104
8.3.2 The Planning Components of the Architecture 105
8.4 Results and Interpretation 108
8.4.1 Taskable Planning vs. Reactive Behaviour 108
8.4.2 Focussing, Partial Policies and Replanning 111
8.4.3 Neural Networks for Prediction: True Images as Attractors? 112
8.5 Limitations of the Neural Forward Planner 115
8.6 Conclusion 115
9 THE NEURAL BIDIRECTIONAL PLANNER 117
9.1 Introduction: More Efficient Exploration 117
9.2 Scenario of Simulations 118
9.3 Architectures and Algorithms 119
9.3.1 The Reactive Components of the Architecture 119
9.3.2 The Planning Components of the Architecture: Forward Planning 119
9.3.3 The Planning Components of the Architecture: Bidirectional Planning
121
9.4 Results and Interpretation 123
9.4.1 Common Strengths of the Forward-Planner and the Bidirectional Planner
123
9.4.2 The Forward Planner Versus the Bidirectional Planner 124
9.5 Limitations of the Neural Bidirectional Planner 126
9.6 A New Goal Oriented Forward Planner (Not Implemented) 126
9.7 Conclusion 127
10 NEURAL NETWORK PLANNERS AND MULTI-GOAL TASKS 128
10.1 Introduction: Neural Planners, Interference and Modularity 128
10.2 Scenario: Again the Asynchronous Multi-Goal Task 129
10.3 Architectures and Algorithms 129
10.3.1 Modular Reactive Components 129
10.3.2 Neural Modular Forward Planner 130
10.3.3 Neural Modular Bidirectional Planner 131
10.4 Results and Interpretation 132
10.4.1 Modularity and Interference 132
10.4.2 Taskability 134
10.4.3 From Planning To Reaction 134
10.4.4 The Forward Planner Versus the Bidirectional Planner 135
10.5 Limitations of the Modular Planners 137
10.6 Conclusion 137
11 COARSE PLANNING 138
11.1 Introduction: Abstraction, Macro-actions and Coarse Planning 138
11.2 Scenario of Simulations: A Simplified Navigation Task 139
11.3 Architectures and Algorithms: Coarse Planning with Macro-actions 140
11.4 Results and Interpretation 142
11.4.1 Reinforcement Learning at a Coarse Level 142
11.4.2 The Advantages of Coarse Planning 143
11.4.3 Predicting at a Coarse Level 145
11.4.4 Coarse Planning, Discount Coefficient and Time Limitations of
Reinforcement Learning 146
11.5 Limitations of the Neural Coarse Planner 149
11.6 Conclusion 150
12 CONCLUSION AND FUTURE WORK 152
12.1 Conclusion: What Have We Learned from This Research? 152
12.1.1 Ideas for Neural-Network Reinforcement-Learning Planning 152
12.1.2 Landmark Navigation, Reinforcement Learning and Neural Networks 153
12.1.3 A New Neural Forward Planner 153
12.1.4 A New Neural Bidirectional Planner 155
12.1.5 Common Structure, Interference, and Modular Networks 156
12.1.6 Coarse Planning and Time Limits of Reinforcement Learning 157
12.2 A List of the Major Usable Insights Delivered 158
12.3 Future Work 159
13 APPENDICES 162
13.1 Blind-Search and Heuristic-Search Strategies 162
13.1.1 Blind-Search Strategies 162
13.1.2 Heuristic-Search Strategies 163
13.2 Markov Decision Processes, Reinforcement Learning and Dynamic
Programming 165
13.2.1 Markov Decision Processes 165
13.2.2 Markov Property and Partially Observable Markov Decision Problems 167
13.2.3 Reinforcement Learning 168
13.2.4 Approximating the State or State-Action Evaluations 168
13.2.5 Searching the Policy with the Q'* and Q'p evaluations 170
13.2.6 Actor-Critic Model 171
13.2.7 Macro-actions and Options 172
13.2.8 Function Approximation and Reinforcement Learning 174
13.2.9 Dynamic Programming 174
13.2.10 Asynchronous Dynamic Programming 176
13.2.11 Trial-Based Real-Time Dynamic Programming and Heuristic Search 176
13.3 Feed-Forward Architectures and Mixture of Experts Networks 178
13.3.1 Feed-Forward Architectures and Error Backpropagation Algorithm 178
13.3.2 Mixture of Experts Neural Networks 179
13.3.3 The Generalisation Property of Neural Networks 181
14 REFERENCES 182
14.1 Candidate's Publications During the PhD Research 182
14.2 References 183
****************************************************************************
******************
More information about the Connectionists
mailing list