[ACT-R-users] Question concerning the new utility learning algo
Marc Halbruegge
marc.halbruegge at unibw.de
Tue Jan 16 05:17:10 EST 2007
Hi!
I'm just figuring out how the new version works. To me, there are two
big differences between the algos:
1. In the old one, there was no discounting. The reward was propagated
to the whole trajectory (quote: "When such a production fires all the
productions that have fired since the last marked production fired are
credited with a success or failure.").
2. The new one uses a moving average as estimator of the utility
(r=reward):
x(n)=x(n-1) + a[r - x(n-1)} with constant a
while the old one used a "normal" average
x(n)=x(n-1) + a[r - x(n-1)} with a=1/N, N being the number of visits
The old algo was quite similar to what is called Monte-Carlo-Estimation
(equivalent to TD(1)) in the machine learning literature. The new one
has some similarities with TD(lambda), but uses a linear instead of the
usual exponential discounting function and uses a moving average which
is quite unusual because it leads to much more noise in the estimator.
So here's my question: What is the reason for the use of a constant
learning rate? The possibility to re-adapt faster when the task changes?
Is it possible to get a mixture of both algorithms?
Greetings
Marc Halbruegge
--
Dipl.-Psych. Marc Halbruegge
Human Factors Institute
Faculty of Aerospace Engineering
Bundeswehr University Munich
Werner-Heisenberg-Weg 39
D-85579 Neubiberg
Phone: +49 89 6004 3497
Fax: +49 89 6004 2564
E-Mail: marc.halbruegge at unibw.de
More information about the ACT-R-users
mailing list