[ACT-R-users] experiments to validate behaviour of a cognitive model

Tue Sep 16 11:55:55 EDT 2008

Florian,

There is nothing particularly special about establishing the  
reliability of experimental results in order to model those results.  
The issues here are the same as establishing the reliability of any  
experimental result whether they are modeled or not.

The complexity of the task often means that there are many different  
strategies that human subjects can bring to bear in performing the  
task. As a trivial example, the number of strategies available in  
something like the attentional blink (or RSVP) task is small compared  
to the number of strategies that could be brought to bear by an expert  
playing chess (against an expert opponent).

Your example of a pilot flying thru a thunder storm is a lot like the  
chess example  in that small choices made early on may have a big  
impact on how the game or flight plays out.

The real issue to focus on is at what level you wish to model (see  
Newell's time scale of human activity; first used in Newell, A., &  
Card, S. K. (1985). The prospects for psychological science in human- 
computer interaction. Human-Computer Interaction, 1(3), 209–242; later  
discussed in Newell, A. (1990). Unified theories of cognition.  
Cambridge, MA: Harvard University Press.). That is, what human and  
model data do you wish to compare?

One thing to think about is what performance in a complex task you  
wish to predict. Total flight time? Success? Mean attitude, speed,  
changes in directions, changes in altitude, speed? Or perhaps the  
exact path of the flight down to the level of predicting all course  
changes, etc. More detail? Which instruments they look at from moment  
to moment? The sequence of instrument checks before they make a  
decision to alter altitude, speed, course, etc? Eye movements?

Perhaps ironically, the gross characteristics of performance (total  
flight time) and the lower level features such as sequence of  
instrument checks prior to making a change are fairly easy to collect  
data on and model. The harder level would be to  predict what seems  
like it should be the middle layer in the above; namely, it would be  
hard if not impossible to predict the exact path of any given flight  
thru a dynamic flight environment. The problems here though are no  
different for comparing model to data than they would be in comparing  
one pilot's performance to another pilot's performance.

Establishing the reliability of the exact path of performance in a  
complex task is hard to do simply because there are so many variables  
that it is unlikely that even the same pilot put into the same  
beginning state of the system would have the exact same experience  
twice.

An example is provided by the Argus task that Mike Schoelles and I  
have worked on (some references are given below). Argus is a fairly  
simple radar-like task but a complex task for most laboratory studies.  
In Argus people have to track targets as they move on a screen and  
classify their threat value each time they fly into a new sector. In  
our studies students do this task in 12-15 min scenarios, usually  
about 10 of these scenarios are spaced over two days. Mean correct  
classifications range in the 60% to 85% level depending on the exact  
conditions of the study.

However, it is impossible to established commonalities across subjects  
are between model and subjects in things such as the order in which  
targets are checked. Trying to describe a common search method for all  
subjects is pretty near impossible as well. To make things more  
interesting in one of our studies we combined the basic Argus task  
with a second (not secondary) task of keeping the cursor over a  
jittering target airplane that jittered by itself (away from the  
radar) in the lower right-hand part of the screen. The cursor turned  
"blue" when the cursor over over the jittering target most of the  
time, yellow when the cursor was on and off the target, and then red  
as the Ss ignored this "jitter" task for the main classification task.

Our models of this task reproduced many of the overall characteristics  
of human performance -- number of targets correctly classified, mean  
correct tracking on the jitter task. The model did not exactly  
reproduce any one Ss data in terms of the sequence in which individual  
targets were checked. Our interest in this particular task was on task  
switching -- could our model predict when human Ss would decide to  
switch from the classification task to the tracking task and back?  
This focus gave us a lot of data as people switched back and forth  
quite a bit and our model also switched back and forth quite a bit.  
However, our model would not switch tasks except at subtask boundaries  
(during the classification task once the model begin to classify a  
target by hooking the target, it would stay focused on that target  
until it had completed classifying it). Our humans interrupted the  
classification task much more than our model did.

In this example, we established reliability at the upper levels of  
performance (number of targets correctly classified) between  
experimental conditions and the model matched those differences. At  
the "middle" level -- the sequence in which targets were selected and  
classified, we did not establish any consistent differences within or  
between experimental conditions and (therefore) the model could not  
predict behavior at this level. At the lower level of "when did people  
and model switch tasks" the experimental data showed consistency  
within conditions. However, this consistency was not matched by the  
model thus enabling us to conclude that whatever people were doing  
they did not preferentially switch tasks at subtask boundaries.

I hope this helps. I think it illustrates the dilemma of those who  
would model complex tasks. You can only hope to model tasks at the  
level at which you have consistency across human subjects. You can ask  
many different lower level questions of the human data (such as we did  
in asking about "when" humans switched tasks) and compare these to  
model  performance. However, these questions have to be fairly  
narrowly focused. If humans do not follow a consistent path then there  
is no way that your models will be able to match you humans in terms  
of moment-by-moment performance. If you can somehow more abstractly  
characterize human performance (as we did in terms of when humans  
would switch tasks) then you can compare models and humans at that  
level.

Cheers,

Wayne

Schoelles, M. J., Neth, H., Myers, C. W., & Gray, W. D. (2006). Steps  
towards integrated models of cognitive systems: A levels-of-analysis  
approach to comparing human performance to model predictions in a  
complex task environment. In R. Sun (Ed.), Proceedings of the 28th  
Annual Meeting of the Cognitive Science Society (pp. 756-761). Austin,  
TX: Cognitive Science Society.

Schoelles, M. J., & Gray, W. D. (2001). Decomposing interactive  
behavior. Proceedings of the Twenty-Third Annual Conference of the  
Cognitive Science Society, (pp. 898–903). Mahwah, NJ.

Schoelles, M. J., & Gray, W. D. (2001). Argus: A suite of tools for  
research in complex cognition. Behavior Research Methods, Instruments,  
& Computers, 33(2), 130–140.

On Sep 16, 2008, at 07:36, Florian Frische wrote:

> Hi all,
>
> I would like to ask you about the amount of experimental data in  
> order to
> validate a model's behaviour.
> I think there are several parameters that have an effect on the  
> amount of
> data (suspects, trials) that is needed.
> The following 2 examples should help to describe what this question is
> about:
>
> Example 1: Let's say we have a 2-armed bandit that always returns -1  
> on
> the one arm and 1 on the other arm. We Would like
> to validate the behaviour of our model towards the behaviour of  
> gamblers
> (The strategy that they use to maximize the overall outcome)
>
> Example 2: We have a complex flight task where a pilot has to avoid a
> thunderstorm. We would like to validate the behaviour
> of our pilot model in this task towards real pilots behaviour.
>
> It is obvious that example 1 is less complex than example 2 and I  
> think
> that we need much more experimental data to get reliable results
> for the second example. I suppose there is a relationship beetween  
> task
> complexity (e.g. independent variables) and the number of suspects/ 
> number
> of trials.
> But how can I assess how many suspects/trials I need (How many
> participants and how many repeats)?
>
> Thanks a lot for participating in this discussion,
>
> Florian Frische
>
> OFFIS
> FuE Bereich Verkehr | R&D Division Transportation
> Escherweg 2 - 26121 Oldenburg - Germany
> Phone.: +49 441 9722-523
> E-Mail: florian.frische at offis.de
> URL: http://www.offis.de
>
> _______________________________________________
> ACT-R-users mailing list
> ACT-R-users at act-r.psy.cmu.edu
> http://act-r.psy.cmu.edu/mailman/listinfo/act-r-users
>

**Rensselaer**Rensselaer**Rensselaer**Rensselaer**Rensselaer**
Wayne D. Gray; Professor of Cognitive Science
Rensselaer Polytechnic Institute
Carnegie Building (rm 108) ;;for all surface mail & deliveries
110 8th St.; Troy, NY 12180

EMAIL: grayw at rpi.edu, Office: 518-276-3315, Fax: 518-276-3017

for general information see: http://www.rpi.edu/~grayw/

for On-Line publications see: http://www.rpi.edu/~grayw/pubs/downloadable_pubs.htm

for the CogWorks Lab see: http://www.cogsci.rpi.edu/cogworks/

If you just have formalisms or a model you are doing "operations  
research" or" AI", if you just have data and a good study you are  
doing "experimental psychology", and if you just have ideas you are  
doing "philosophy" -- it takes all three to do cognitive science.