[ACT-R-users] Question concerning the new utility learning algo
Chris R. Sims
simsc at rpi.edu
Tue Jan 16 08:13:32 EST 2007
Hi Marc,
The new algorithm (using constant learning rate) implicitly defines
an exponential recency weighted average over the history of feedback.
You can do a series expansion to show this. Although someone at CMU
will have to give you the "official" reason for using the equation,
it does as you suggest allow for re-adapting faster to task changes.
More importantly though, humans and animal decision makers tend to be
more influenced by local feedback rather than integrating the entire
history of rewards, even in stationary environments. Most of the
studies demonstrating this have used environments with probabilistic
rewards or reward magnitudes, but stationary reward statistics.
It's an empirical question whether the new algorithm's exponential
decay is more appropriate than say a power-law decay or some other
function. I have some currently unpublished data using maximum
likelihood fits of several decay functions suggesting that an
exponential decay fits just as well as a power function, but
definitely better than the old ACT-R mechanism. Although I'm not
aware of any other published studies specifically examining
exponential decay in adapting to feedback, it's certainly reminiscent
of the debate over an appropriate memory retention function, where
there is a large but somewhat inconclusive literature.
I hope that helps with your questions,
Chris Sims
Department of Cognitive Science
Rensselaer Polytechnic Instiute
Troy, NY 12180
On Jan 16, 2007, at 5:17 AM, Marc Halbruegge wrote:
> Hi!
>
> I'm just figuring out how the new version works. To me, there are two
> big differences between the algos:
>
> 1. In the old one, there was no discounting. The reward was propagated
> to the whole trajectory (quote: "When such a production fires all the
> productions that have fired since the last marked production fired
> are
> credited with a success or failure.").
>
> 2. The new one uses a moving average as estimator of the utility
> (r=reward):
> x(n)=x(n-1) + a[r - x(n-1)} with constant a
> while the old one used a "normal" average
> x(n)=x(n-1) + a[r - x(n-1)} with a=1/N, N being the number of
> visits
>
> The old algo was quite similar to what is called Monte-Carlo-
> Estimation
> (equivalent to TD(1)) in the machine learning literature. The new one
> has some similarities with TD(lambda), but uses a linear instead of
> the
> usual exponential discounting function and uses a moving average which
> is quite unusual because it leads to much more noise in the estimator.
>
> So here's my question: What is the reason for the use of a constant
> learning rate? The possibility to re-adapt faster when the task
> changes?
> Is it possible to get a mixture of both algorithms?
>
> Greetings
> Marc Halbruegge
>
>
> --
> Dipl.-Psych. Marc Halbruegge
> Human Factors Institute
> Faculty of Aerospace Engineering
> Bundeswehr University Munich
> Werner-Heisenberg-Weg 39
> D-85579 Neubiberg
>
> Phone: +49 89 6004 3497
> Fax: +49 89 6004 2564
> E-Mail: marc.halbruegge at unibw.de
> _______________________________________________
> ACT-R-users mailing list
> ACT-R-users at act-r.psy.cmu.edu
> http://act-r.psy.cmu.edu/mailman/listinfo/act-r-users
>
More information about the ACT-R-users
mailing list