[ACT-R-users] experiments to validate behaviour of a cognitive model
Wayne Gray
grayw at rpi.edu
Tue Sep 16 11:55:55 EDT 2008
Florian,
There is nothing particularly special about establishing the
reliability of experimental results in order to model those results.
The issues here are the same as establishing the reliability of any
experimental result whether they are modeled or not.
The complexity of the task often means that there are many different
strategies that human subjects can bring to bear in performing the
task. As a trivial example, the number of strategies available in
something like the attentional blink (or RSVP) task is small compared
to the number of strategies that could be brought to bear by an expert
playing chess (against an expert opponent).
Your example of a pilot flying thru a thunder storm is a lot like the
chess example in that small choices made early on may have a big
impact on how the game or flight plays out.
The real issue to focus on is at what level you wish to model (see
Newell's time scale of human activity; first used in Newell, A., &
Card, S. K. (1985). The prospects for psychological science in human-
computer interaction. Human-Computer Interaction, 1(3), 209–242; later
discussed in Newell, A. (1990). Unified theories of cognition.
Cambridge, MA: Harvard University Press.). That is, what human and
model data do you wish to compare?
One thing to think about is what performance in a complex task you
wish to predict. Total flight time? Success? Mean attitude, speed,
changes in directions, changes in altitude, speed? Or perhaps the
exact path of the flight down to the level of predicting all course
changes, etc. More detail? Which instruments they look at from moment
to moment? The sequence of instrument checks before they make a
decision to alter altitude, speed, course, etc? Eye movements?
Perhaps ironically, the gross characteristics of performance (total
flight time) and the lower level features such as sequence of
instrument checks prior to making a change are fairly easy to collect
data on and model. The harder level would be to predict what seems
like it should be the middle layer in the above; namely, it would be
hard if not impossible to predict the exact path of any given flight
thru a dynamic flight environment. The problems here though are no
different for comparing model to data than they would be in comparing
one pilot's performance to another pilot's performance.
Establishing the reliability of the exact path of performance in a
complex task is hard to do simply because there are so many variables
that it is unlikely that even the same pilot put into the same
beginning state of the system would have the exact same experience
twice.
An example is provided by the Argus task that Mike Schoelles and I
have worked on (some references are given below). Argus is a fairly
simple radar-like task but a complex task for most laboratory studies.
In Argus people have to track targets as they move on a screen and
classify their threat value each time they fly into a new sector. In
our studies students do this task in 12-15 min scenarios, usually
about 10 of these scenarios are spaced over two days. Mean correct
classifications range in the 60% to 85% level depending on the exact
conditions of the study.
However, it is impossible to established commonalities across subjects
are between model and subjects in things such as the order in which
targets are checked. Trying to describe a common search method for all
subjects is pretty near impossible as well. To make things more
interesting in one of our studies we combined the basic Argus task
with a second (not secondary) task of keeping the cursor over a
jittering target airplane that jittered by itself (away from the
radar) in the lower right-hand part of the screen. The cursor turned
"blue" when the cursor over over the jittering target most of the
time, yellow when the cursor was on and off the target, and then red
as the Ss ignored this "jitter" task for the main classification task.
Our models of this task reproduced many of the overall characteristics
of human performance -- number of targets correctly classified, mean
correct tracking on the jitter task. The model did not exactly
reproduce any one Ss data in terms of the sequence in which individual
targets were checked. Our interest in this particular task was on task
switching -- could our model predict when human Ss would decide to
switch from the classification task to the tracking task and back?
This focus gave us a lot of data as people switched back and forth
quite a bit and our model also switched back and forth quite a bit.
However, our model would not switch tasks except at subtask boundaries
(during the classification task once the model begin to classify a
target by hooking the target, it would stay focused on that target
until it had completed classifying it). Our humans interrupted the
classification task much more than our model did.
In this example, we established reliability at the upper levels of
performance (number of targets correctly classified) between
experimental conditions and the model matched those differences. At
the "middle" level -- the sequence in which targets were selected and
classified, we did not establish any consistent differences within or
between experimental conditions and (therefore) the model could not
predict behavior at this level. At the lower level of "when did people
and model switch tasks" the experimental data showed consistency
within conditions. However, this consistency was not matched by the
model thus enabling us to conclude that whatever people were doing
they did not preferentially switch tasks at subtask boundaries.
I hope this helps. I think it illustrates the dilemma of those who
would model complex tasks. You can only hope to model tasks at the
level at which you have consistency across human subjects. You can ask
many different lower level questions of the human data (such as we did
in asking about "when" humans switched tasks) and compare these to
model performance. However, these questions have to be fairly
narrowly focused. If humans do not follow a consistent path then there
is no way that your models will be able to match you humans in terms
of moment-by-moment performance. If you can somehow more abstractly
characterize human performance (as we did in terms of when humans
would switch tasks) then you can compare models and humans at that
level.
Cheers,
Wayne
Schoelles, M. J., Neth, H., Myers, C. W., & Gray, W. D. (2006). Steps
towards integrated models of cognitive systems: A levels-of-analysis
approach to comparing human performance to model predictions in a
complex task environment. In R. Sun (Ed.), Proceedings of the 28th
Annual Meeting of the Cognitive Science Society (pp. 756-761). Austin,
TX: Cognitive Science Society.
Schoelles, M. J., & Gray, W. D. (2001). Decomposing interactive
behavior. Proceedings of the Twenty-Third Annual Conference of the
Cognitive Science Society, (pp. 898–903). Mahwah, NJ.
Schoelles, M. J., & Gray, W. D. (2001). Argus: A suite of tools for
research in complex cognition. Behavior Research Methods, Instruments,
& Computers, 33(2), 130–140.
On Sep 16, 2008, at 07:36, Florian Frische wrote:
> Hi all,
>
> I would like to ask you about the amount of experimental data in
> order to
> validate a model's behaviour.
> I think there are several parameters that have an effect on the
> amount of
> data (suspects, trials) that is needed.
> The following 2 examples should help to describe what this question is
> about:
>
> Example 1: Let's say we have a 2-armed bandit that always returns -1
> on
> the one arm and 1 on the other arm. We Would like
> to validate the behaviour of our model towards the behaviour of
> gamblers
> (The strategy that they use to maximize the overall outcome)
>
> Example 2: We have a complex flight task where a pilot has to avoid a
> thunderstorm. We would like to validate the behaviour
> of our pilot model in this task towards real pilots behaviour.
>
> It is obvious that example 1 is less complex than example 2 and I
> think
> that we need much more experimental data to get reliable results
> for the second example. I suppose there is a relationship beetween
> task
> complexity (e.g. independent variables) and the number of suspects/
> number
> of trials.
> But how can I assess how many suspects/trials I need (How many
> participants and how many repeats)?
>
> Thanks a lot for participating in this discussion,
>
> Florian Frische
>
> OFFIS
> FuE Bereich Verkehr | R&D Division Transportation
> Escherweg 2 - 26121 Oldenburg - Germany
> Phone.: +49 441 9722-523
> E-Mail: florian.frische at offis.de
> URL: http://www.offis.de
>
> _______________________________________________
> ACT-R-users mailing list
> ACT-R-users at act-r.psy.cmu.edu
> http://act-r.psy.cmu.edu/mailman/listinfo/act-r-users
>
**Rensselaer**Rensselaer**Rensselaer**Rensselaer**Rensselaer**
Wayne D. Gray; Professor of Cognitive Science
Rensselaer Polytechnic Institute
Carnegie Building (rm 108) ;;for all surface mail & deliveries
110 8th St.; Troy, NY 12180
EMAIL: grayw at rpi.edu, Office: 518-276-3315, Fax: 518-276-3017
for general information see: http://www.rpi.edu/~grayw/
for On-Line publications see: http://www.rpi.edu/~grayw/pubs/downloadable_pubs.htm
for the CogWorks Lab see: http://www.cogsci.rpi.edu/cogworks/
If you just have formalisms or a model you are doing "operations
research" or" AI", if you just have data and a good study you are
doing "experimental psychology", and if you just have ideas you are
doing "philosophy" -- it takes all three to do cognitive science.
More information about the ACT-R-users
mailing list