new publications

Wed Sep 2 04:57:40 EDT 1998

Dear Connectionists,

I have updated my homepage which now contains a link to my
online publications. Below you can find a list of the publications
(title, abstract). They can be accessed from the page
http://sneaker.mindmaker.kfkipark.hu/~szepes/research/OnlinePubs.htm
in the form of gzipped postscript files.

Yours Sincerely,

  Csaba Szepesvari

====================================Reinforcement Learning & Markovian Decision Problems
-------------------------------------------------------------------------

Module-Based Reinforcement Learning: Experiments with a Real Robot

Zs. Kalmár, Cs. Szepesvári and A. Lorincz
Machine Learning 31:55-85, 1998. ps

Autonomous Robots 5:273-295, 1998. (this was a joint Special Issue of the
two journals..)

The behavior of reinforcement learning (RL) algorithms is best understood
in completely observable, discrete-time controlled Markov chains with
finite state and action spaces. In contrast, robot-learning domains are
inherently continuous both in time and space, and moreover are partially
observable. Here we suggest a systematic approach to solve such problems
in which the available qualitative and quantitative knowledge is used to
reduce the complexity of learning task. The steps of the design process
are to: i) decompose the task into subtasks using the qualitative
knowledge at hand; ii) design local controllers to solve the subtasks
using the available quantitative knowledge and iii) learn a coordination
of these controllers by means of reinforcement learning. It is argued
that the approach enables fast, semi-automatic, but still high quality
robot-control as no fine-tuning of the local controllers is needed. The
approach was verified on a non-trivial real-life robot task. Several RL
algorithms were compared by ANOVA and it was found that the model-based
approach worked significantly better than the model-free approach. The
learnt switching strategy performed comparably to a handcrafted version.
Moreover, the learnt strategy seemed to exploit certain properties of the
environment which were not foreseen in advance, thus supporting the view
that adaptive algorithms are advantageous to non-adaptive ones in complex
environments.

Non-Markovian Policies in Sequential Decision Problems

Cs. Szepesvári
Acta Cybernetica (to appear) 1998. ps

In this article we prove the validity of the Bellman Optimality Equation
and related results for sequential decision problems with a general
recursive structure. The characteristic feature of our approach is that
also non-Markovian policies are taken into account. The theory is
motivated by some experiments with a learning robot.

Convergence Results for Single-Step On-Policy Reinforcement-Learning
Algorithms

S. Singh, T. Jaakkola, M.L. Littman and Cs. Szepesvári
Machine Learning, to appear, 1998. ps

An important application of reinforcement learning (RL) is to
finite-state control problems and one of the most difficult problems in
learning for control is balancing the exploration/exploitation tradeoff.
Existing theoretical results for RL give very little guidance on
reasonable ways to perform exploration. In this paper, we examine the
convergence of single-step on-policy RL algorithms for control. On-policy
algorithms cannot separate exploration from learning and therefore must
confront the exploration problem directly. We prove convergence results
for several related on-policy algorithms with both decaying exploration
and persistent exploration. We also provide examples of exploration
strategies that can be followed during learning that result in
convergence to both optimal values and optimal policies.

Multi-criteria Reinforcement Learning

Z. Gábor, Zs. Kalmár and Cs. Szepesvári
In Proceedings of International Conference of Machine Learning, in press,
1998. ps

We consider multi-criteria sequential decision making problems where the
vector-valued evaluations are compared by a given, fixed total ordering.
Conditions for the optimality of stationary policies and the Bellman
optimality equation are given for a special, but important class of
problems when the evaluation of policies can be computed for the criteria
independently of each other. The analysis requires special care as the
topology introduced by pointwise convergence and the order-topology
introduced by the preference order are in general incompatible.
Reinforcement learning algorithms are proposed and analyzed. Preliminary
computer experiments confirm the validity of the derived algorithms.
These type of multi-criteria problems are most useful when there are
several optimal solutions to a problem and one wants to choose the one
among these which is optimal according to another fixed criterion.
Possible application in robotics and repeated games are outlined.

Multi-criteria Reinforcement Learning

Z. Gábor, Zs. Kalmár and Cs. Szepesvári
Technical Report TR-98-115, "Attila József" University, Research Group on
Artificial Intelligence Szeged, HU-6700, 1998 (submitted in 1997). ps

We consider multi-criteria sequential decision making problems where the
vector-valued evaluations are compared by a given, fixed total ordering.
Conditions for the optimality of stationary policies and the Bellman
optimality equation are given for a special, but important class of
problems when the evaluation of policies can be computed for the criteria
independently of each other. The analysis requires special care as the
topology introduced by pointwise convergence and the order-topology
introduced by the preference order are in general incompatible.
Reinforcement learning algorithms are proposed and analyzed. Preliminary
computer experiments confirm the validity of the derived algorithms.
These type of multi-criteria problems are most useful when there are
several optimal solutions to a problem and one wants to choose the one
among these which is optimal according to another fixed criterion.
Possible application in robotics and repeated games are outlined.

The Asymptotic Convergence-Rate of Q-learning

Cs. Szepesvári
In Proceedings of Neural Information Processing Systems 10, pp.
1064-1070, 1997. ps

In this paper we show that for discounted MDPs with discount factor
\gamma>1/2 the asymptotic rate of convergence of Q-learning is
O(1/t^{R(1-\gamma)}) if R(1-\gamma)<1/2 and O(\sqrt{\log\log t/ t})
otherwise provided that the state-action pairs are sampled from a fixed
probability distribution. Here R=\pmin/\pmax is the ratio of the minimum
and maximum state-action occupation frequencies. The results extend to
convergent on-line learning provided that \pmin>0, where \pmin and \pmax
now become the minimum and maximum state-action occupation frequencies
corresponding to the stationary distribution.

Learning and Exploitation do not Conflict Under Minimax Optimality

Cs. Szepesvári
In Proceedings of 9th European Conference of Machine Learning, pp.
242-249, 1997. ps

We show that adaptive real time dynamic programming extended with the
action selection strategy which chooses the best action according to the
latest estimate of the cost function yields asymptotically optimal
policies within finite time under the minimax optimality criterion. From
this it follows that learning and exploitation do not conflict under this
special optimality criterion. We relate this result to learning optimal
strategies in repeated two-player zero-sum deterministic games.

A unified analysis of value-function-based reinforcement-learning algorithms

Cs. Szepesvári and M. L. Littman

Submitted for review, 1997 ps

Reinforcement learning is the problem of generating optimal behavior in a
sequential decision-making environment given the opportunity of
interacting with it. Many algorithms for solving reinforcement-learning
problems work by computing improved estimates of the optimal value
function. We extend prior analyses of reinforcement-learning algorithms
and present a powerful new theorem that can provide a unified analysis of
value-function-based reinforcement-learning algorithms. The usefulness of
the theorem lies in how it allows the asynchronous convergence of a
complex reinforcement-learning algorithm to be proven by verifying that a
simpler synchronous algorithm converges. We illustrate the application of
the theorem by analyzing the convergence of Q-learning, model-based
reinforcement learning, Q-learning with multi-state updates, Q-learning
for Markov games, and risk-sensitive reinforcement learning.

Some basic facts concerning minimax sequential decision processes

Cs. Szepesvári
Technical Report TR-96-100, "Attila József" University, Research Group on
Artificial Intelligence Szeged, HU-6700, 1996. ps

It is shown that for discounted minimax sequential decision processes the
evaluation function of a stationary policy is the fixed point of the
so-called policy evaluation operator which is a contraction mapping.
Using this we prove that Bellman's principle of optimality holds. We also
prove that the asynchronous value iteration algorithm converges to the
optimal value function.

Adaptive (Neuro-)Control
---------------------------------------------------------------------
Uncertainty and Performance of Adaptive Controllers for Functionally
Uncertain Output Feedback Systems

M. French, Cs. Szepesvári and E. Rogers
In Proc. of 1998 IEEE Conference on Decision and Decision, 1998. ps

We consider nonlinear systems in an output feedback form which are
functionally known up to a L2 measure of uncertainty. The control task is
to drive the output of the system to some neighbourhood of the origin. A
modified L2 measure of transient performance (penalising both state and
control effort) is given, and the performance of a class of model based
adaptive controllers is studied. An upper performance bound is derived.

Uncertainty, Performance and Model Dependency in Approximate Adaptive
Nonlinear Control

M. French, Cs. Szepesvári and E. Rogers
In Proc. of 1997 IEEE Conference on Decision and Decision, San Diego,
California, 1997. ps

We consider systems satisfying a matching condition which are
functionally known up to a L2 measure of uncertainty. A modified L2
performance measure is given, and the performance of a class of model
based adaptive controllers is studied. An upper performance bound is
derived in terms of the uncertainty measure and measures of the
approximation error of the model. Asymptotic analyses of the bounds under
increasing model size are undertaken, and sufficient conditions are given
on the model that ensure the performance bounds are bounded independent
of the model size.

Uncertainty, Performance and Model Dependency in Approximate Adaptive
Nonlinear Control

M. French, Cs. Szepesvári and E. Rogers
Submitted for journal publication 1997. ps

We consider systems satisfying a matching condition which are
functionally known up to weighted L2 and L-infinity measures of
uncertainty. A modified LQ measure of control and state transient
performance is given, and the performance of a class of approximate model
based adaptive controllers is studied. An upper performance bound is
derived in terms of the uncertainty models (stability and the state
transient bounds require only the L2 uncertainty model; control effort
bounds require both L2 and L-infinity uncertainty models), and various
structural properties of the model basis. Sufficient conditions are
sgiven to ensure that the performance is bounded independent of the model
basis size.