Summary (long): pattern recognition comparisons

Aaron Sloman aarons at cogs.sussex.ac.uk
Sun Aug 5 07:57:52 EDT 1990


> From: Leonard Uhr <uhr at cs.wisc.edu>
>
> Neural nets using backprop have only handled VERY SIMPLE images.....
>  .......In sharp contrast, pr/computer vision systems are designed
> to handle MUCH MORE COMPLEX images (eg houses, furniture) in
> 128-by-128 or even larger inputs....
	.....

> From: Nici Schraudolph <schraudo%cs at ucsd.edu>
> Well, Gary Cottrell for instance has successfully used a standard (3-layer,
> fully interconnected) backprop net for various face recognition tasks from
> 64x64 images.  While I agree with you that many NN architectures don't scale
> well to large input sizes, and that modular, heterogenous architectures have
> the potential to overcome this limitation, I don't understand why you insist
> that current NNs could only handle simple images - unless you consider any
> image with less than 16k pixels simple.  Does face recognition qualify as a
> complex visual task with you?
> ......

Characterising the complexity of the task in terms of the number of
pixels seems to me to miss the most important points.

Some (but by no means all) of the people working on NNs appear to have
joined the field (the bandwagon?) without feeling obliged to study the
AI literature on vision, perhaps because it is assumed that since the
AI mechanisms are "wrong" all the literature must be irrelevant?

On the contrary, good work in AI vision was concerned with understanding
the nature of the task (or rather tasks) of a visual system,
independently of the mechanisms postulated to perform those tasks. (When
your programs fail you learn more about the nature of the task.)

Recognition of isolated objects (e.g. face recognition) is just _one_ of
the tasks of vision.

Others include:

(a) Interpreting a 2-D array (retinal array or optic array) in terms of
3-D structures and relationships. Seeing the 3-D structure of a face is
a far more complex task than simply attaching a label: "Igor", "Bruce"
or whatever.

(b) Segmenting a complex scene into separate objects and describing the
relationships between them (e.g. "houses, furniture"!). (The
relationships include 2-D and 3-D spatial and functional relations.)
Because evidence for boundaries is often unclear and ambiguous, and
because recognition has to be based on combinations of features, the
segmentation often cannot be done without recognition and recognition
cannot be done without segmentation. This chicken and egg problem can
lead to dreadful combinatorial searches. NNs offer the prospect of doing
some of the searching in parallel by propagating constraints, but as
far as I know they have not yet matched the more sophisticated AI
visual systems.

(It is important to distinguish segmentation, recognition and
description of 2-D image fragments from segmentation, recognition and
description of 3-D objects. The former seems to be what people in
pattern recognition and NN research concentrate on most. The latter has
been a major concern of AI vision work since the mid/late sixties,
starting with L.G. Roberts I think, although some people in AI have
continued trying to find 2-D cues to 3-D segmentation. Both 2-D and 3-D
interpretations are important in human vision.)

(c) Seeing events, processes and their relationships. Change "2-D" to
"3-D" and "3-D" to "4-D" in (b) above. We are able to segment, recognize
and describe events, processes and causal relationships as well as
objects (e.g. following, entering, leaving, catching, bouncing,
intercepting, grasping, sliding, supporting, stretching, compressing,
twisting, untwisting, etc. etc.) Sometimes, as Johansson showed by
attaching lights to human joints in a dark room, motion can be used
to disambiguate 3-D structure.

(d) Providing information and/or control signals for motor-control
mechanisms: e.g. visual feedback is used (unconsciously) for posture
control in sighted people, also controlling movement of arm, hand and
fingers in grasping, etc. (I suspect that many such processes of fine
tuning and control use changing 2-D "image" information rather than (or
in addition to) 3-D structural information.)

That's still only a partial list of the tasks of a visual system.
For more detail see:
 A. Sloman `On designing a visual system: Towards a Gibsonian
 computational model of vision' in Journal of Experimental and
 Theoretical AI 1,4, 1989

 Ballard, D.H. and C.M. Brown,
 Computer Vision,
 Englewood-Cliffs, Prentice Hall 1982.

A system might be able to recognize isolated faces or other objects in
an image by using mechanisms that would fail miserably in dealing with
cluttered scenes where recognition and segmentation need to be combined.
So a NN that recognised faces might tell us nothing about how it is done
in natuarly visual systems, if the latter use more general mechanisms.

One area in which I think neither AI nor NN work has made significant
progress is shape perception. (I don't mean shape recognition!). People,
and presumably many other animals, can see complex, intricate, irregular
and varied shapes in a manner that supports a wide range of tasks,
including recognizing, grasping, planning, controlling motion,
predicting the consequences of motion, copying, building, etc. etc.
Although a number of different kinds of shape representations have been
explored in work on computer vision, CAD, graphics etc. (e.g. feature
vectors; logical descriptions; networks of nodes and arcs; numbers
representing co-ordinates, orientations, curvature etc; systems of
equations for lines, planes, and other mathematically simple structures;
fractals; etc. etc. etc.) they all seem capable of capturing only a
superficial subset of what we can see when we look at kittens, sand
dunes, crumpled paper, a human torso, a shrubbery, cloud formations,
under-water scenes, etc. (Work on computer graphics is particularly
misleading, because people are often tempted to think that a
representation that _generates_ a natural looking image on a screen must
capture what we see in the image, or in the scene that it depicts.)

Does anyone have any idea what kind of breakthrough is needed in order
to give a machine the kind of grasp of shape that can explain animal
abilities to cope with real environments?

Is there anything about NN shape representations that given them an
advantage over others that have been explored, and if so what are they?

I suspect that going for descriptions of static geometric structure is a
dead end: seeing a shape really involves seeing potential processes
involving that shape, and their limits (something like what J.J. Gibson
meant by "affordances"?). I.e. a 3-D shape is inherently a vast array of
4-D possibilities and one of the tasks of a visual system is computing a
large collection of those possibilities and making them readily
available for a variety of subsequent processes.

But that's much too vague an idea to be very useful. Or is it?

Aaron Sloman,
School of Cognitive and Computing Sciences,
Univ of Sussex, Brighton, BN1 9QH, England
    EMAIL   aarons at cogs.sussex.ac.uk
or:
            aarons%uk.ac.sussex.cogs at nsfnet-relay.ac.uk



More information about the Connectionists mailing list