[ACT-R-users] Question concerning the new utility learning algo

Tue Jan 16 05:17:10 EST 2007

Hi!

I'm just figuring out how the new version works. To me, there are two
big differences between the algos:

1. In the old one, there was no discounting. The reward was propagated
 to the whole trajectory (quote: "When such a production fires all the
 productions that have fired since the last marked production fired are
 credited with a success or failure.").

2. The new one uses a moving average as estimator of the utility
 (r=reward):
    x(n)=x(n-1) + a[r - x(n-1)} with constant a
 while the old one used a "normal" average
    x(n)=x(n-1) + a[r - x(n-1)} with a=1/N, N being the number of visits

The old algo was quite similar to what is called Monte-Carlo-Estimation
(equivalent to TD(1)) in the machine learning literature. The new one
has some similarities with TD(lambda), but uses a linear instead of the
usual exponential discounting function and uses a moving average which
is quite unusual because it leads to much more noise in the estimator.

So here's my question: What is the reason for the use of a constant
learning rate? The possibility to re-adapt faster when the task changes?
Is it possible to get a mixture of both algorithms?

Greetings
Marc Halbruegge

-- 
Dipl.-Psych. Marc Halbruegge
Human Factors Institute
Faculty of Aerospace Engineering
Bundeswehr University Munich
Werner-Heisenberg-Weg 39
D-85579 Neubiberg

Phone: +49 89 6004 3497
Fax: +49 89 6004 2564
E-Mail: marc.halbruegge at unibw.de