Technical Report Available

Tue Nov 18 08:00:25 EST 1997

Please accept my apologies if you receive multiple copies of this
message.

The following technical report is available on the web at the page:

http://www.elet.polimi.it/~caironi/listpub.html

or directly at:

ftp://www.elet.polimi.it/pub/data/Pierguido.Caironi/tr97_50.ps.gz

-----------------------------------------------------------------------

                  Gradient-Based Reinforcement Learning:
                Learning Combinations of Control Policies

                          Pierguido V.C. Caironi
                      email: caironi at elet.polimi.it

                          Technical Report 97.50
                Dipartimento di Elettronica e Informazione 
                           Politecnico di Milano

                                 Abstract

        This report presents two innovative reinforcement learning
        algorithms for continuous state-action environments:
        Gradient REinforceMent LearnINg for Multiple control
        policies (GREMLIN-M) and Gradient REinforceMent LearnINg
        for Multiple and Single control policies (GREMLIN-MS).

        The two algorithms learn optimal combinations of control
        policies for autonomous agents.  GREMLIN-M learns an optimal
        combination of fixed base control policies.  GREMLIN-MS
        extends GREMLIN-M enabling the agent to learn simultaneously
        the base control policies as well.

        GREMLIN-M and GREMLIN-MS optimize a performance function
        equal to the sum of the expected reinforcements in a sliding
        temporal window of finite length.  The optimization is carried
        out through gradient ascent with respect to the parameter
        values of the control functions. While being natural
        extensions of previously existing supervised learning
        algorithms, GREMLIN-M and GREMLIN-MS improve the current
        state of art of reinforcement learning taking into account
        the temporal credit assignment problem for the on-line and
        real-time combination of control policies.

        Furthermore, GREMLIN-M and GREMLIN-MS lend themselves to a
        motivational interpretation.  That is, the combination
        function resulting from learning may be seen as a
        representation of the motivations to apply any single base
        control policy in different environmental conditions.

-- 
Name:    Pierguido V. C. CAIRONI
Job:     Ph.D. Student at the Politecnico di Milano - ITALY
e-mail:  caironi at elet.polimi.it
Address: Politecnico di Milano - Dip. di Elettronica e Informazione
	 Piazza Leonardo da Vinci 32
	 20133 - Milano - ITALY
Tel: 	 +39-2-23993622
Fax:	 +39-2-23993411
WWW:     http://www.elet.polimi.it/~caironi