[ACT-R-users] Question concerning the new utility learning algo

Tue Jan 16 08:17:04 EST 2007

Thanks for these questions.

The time between production and reward is subtracted from the reward 
for that production giving a discount.  This fact is relevant to your 
question below.  The motivations for these choices was to make the 
behavior of the new utility learning mechanism similar to the current 
in terms of the reward a production receives.   The new system is 
simpler and extends to a greater variety of situations (i.e., 
variable rewards and rewards not tied to productions).

You are right, however, that the averaging mechanism is different. 
We may revisit the averaging mechanism but it has the following 
attractive properties: (1) It discounts past experiences; (2) it is 
very simple computationally; (3) it is basically the simple learning 
rule that has a long and successful history that goes at least to 
Bush & Mosteller (1955) and, indeed, has similarities to TD(lambda).

The new mechanism receives further discussion in my forthcoming book.

  I remind everyone that the old mechanism is still available within 
the new release.  We strive not to upset anyone's work or mandate 
changes.

At 11:17 AM +0100 1/16/07, Marc Halbruegge wrote:
>Hi!
>
>I'm just figuring out how the new version works. To me, there are two
>big differences between the algos:
>
>1. In the old one, there was no discounting. The reward was propagated
>  to the whole trajectory (quote: "When such a production fires all the
>  productions that have fired since the last marked production fired are
>  credited with a success or failure.").
>
>2. The new one uses a moving average as estimator of the utility
>  (r=reward):
>     x(n)=x(n-1) + a[r - x(n-1)} with constant a
>  while the old one used a "normal" average
>     x(n)=x(n-1) + a[r - x(n-1)} with a=1/N, N being the number of visits
>
>The old algo was quite similar to what is called Monte-Carlo-Estimation
>(equivalent to TD(1)) in the machine learning literature. The new one
>has some similarities with TD(lambda), but uses a linear instead of the
>usual exponential discounting function and uses a moving average which
>is quite unusual because it leads to much more noise in the estimator.
>
>So here's my question: What is the reason for the use of a constant
>learning rate? The possibility to re-adapt faster when the task changes?
>Is it possible to get a mixture of both algorithms?
>
>Greetings
>Marc Halbruegge
>
>
>--
>Dipl.-Psych. Marc Halbruegge
>Human Factors Institute
>Faculty of Aerospace Engineering
>Bundeswehr University Munich
>Werner-Heisenberg-Weg 39
>D-85579 Neubiberg
>
>Phone: +49 89 6004 3497
>Fax: +49 89 6004 2564
>E-Mail: marc.halbruegge at unibw.de
>_______________________________________________
>ACT-R-users mailing list
>ACT-R-users at act-r.psy.cmu.edu
>http://act-r.psy.cmu.edu/mailman/listinfo/act-r-users

-- 

==========================================================

John R. Anderson
Carnegie Mellon University
Pittsburgh, PA 15213

Phone: 412-268-2788
Fax:     412-268-2844
email: ja at cmu.edu
URL:  http://act.psy.cmu.edu/