[ACT-R-users] Question concerning the new utility learning algo

Tue Jan 16 08:13:32 EST 2007

Hi Marc,

The new algorithm (using constant learning rate) implicitly defines  
an exponential recency weighted average over the history of feedback.  
You can do a series expansion to show this.  Although someone at CMU  
will have to give you the "official" reason for using the equation,  
it does as you suggest allow for re-adapting faster to task changes.  
More importantly though, humans and animal decision makers tend to be  
more influenced by local feedback rather than integrating the entire  
history of rewards, even in stationary environments. Most of the  
studies demonstrating this have used environments with probabilistic  
rewards or reward magnitudes, but stationary reward statistics.

It's an empirical question whether the new algorithm's exponential  
decay is more appropriate than say a power-law decay or some other  
function. I have some currently unpublished data using maximum  
likelihood fits of several decay functions suggesting that an  
exponential decay fits just as well as a power function, but  
definitely better than the old ACT-R mechanism. Although I'm not  
aware of any other published studies specifically examining  
exponential decay in adapting to feedback, it's certainly reminiscent  
of the debate over an appropriate memory retention function, where  
there is a large but somewhat inconclusive literature.

I hope that helps with your questions,

Chris Sims
Department of Cognitive Science
Rensselaer Polytechnic Instiute
Troy, NY 12180

On Jan 16, 2007, at 5:17 AM, Marc Halbruegge wrote:

> Hi!
>
> I'm just figuring out how the new version works. To me, there are two
> big differences between the algos:
>
> 1. In the old one, there was no discounting. The reward was propagated
>  to the whole trajectory (quote: "When such a production fires all the
>  productions that have fired since the last marked production fired  
> are
>  credited with a success or failure.").
>
> 2. The new one uses a moving average as estimator of the utility
>  (r=reward):
>     x(n)=x(n-1) + a[r - x(n-1)} with constant a
>  while the old one used a "normal" average
>     x(n)=x(n-1) + a[r - x(n-1)} with a=1/N, N being the number of  
> visits
>
> The old algo was quite similar to what is called Monte-Carlo- 
> Estimation
> (equivalent to TD(1)) in the machine learning literature. The new one
> has some similarities with TD(lambda), but uses a linear instead of  
> the
> usual exponential discounting function and uses a moving average which
> is quite unusual because it leads to much more noise in the estimator.
>
> So here's my question: What is the reason for the use of a constant
> learning rate? The possibility to re-adapt faster when the task  
> changes?
> Is it possible to get a mixture of both algorithms?
>
> Greetings
> Marc Halbruegge
>
>
> -- 
> Dipl.-Psych. Marc Halbruegge
> Human Factors Institute
> Faculty of Aerospace Engineering
> Bundeswehr University Munich
> Werner-Heisenberg-Weg 39
> D-85579 Neubiberg
>
> Phone: +49 89 6004 3497
> Fax: +49 89 6004 2564
> E-Mail: marc.halbruegge at unibw.de
> _______________________________________________
> ACT-R-users mailing list
> ACT-R-users at act-r.psy.cmu.edu
> http://act-r.psy.cmu.edu/mailman/listinfo/act-r-users
>