simple pictures, tough problems (language grounding)

Fri Jul 19 03:31:49 EDT 1991

Neza van der Leeuw (Computational Linguistics, University of Amsterdam) writes:

>I am currently working on the "grounding problem", in the sense that I try to
>derive meaning of language from pictures coupled with sentences ...
> 
>The problem is that I want the architecture to generalize. This means that it
>should group small and large circles, rather than circles of size X with squa-
>res of size nearly X. ...
> 
>One could avoid this problem by choosing a representation in language-like
>propositions, but this seems to me to be the "solving the problem by doing it
>yourself" approach. Some represenational mechanism should provide the answer
>to my problem, but I want it to be "picture-like" instead of "language-like".
>Otherwise I bounce back into my favourite "grounding problem" again.

I think you will have to do one of two things: either hand-craft an environment
representation that is biased towards generalisation along the directions that
you think are 'natural'; or provide the system with some way of manipulating
the environment so that some properties of the environment are invariant under
manipulation.

It may help to think of the problem in terms of learning a first language and
a second language.  In learning a second language, the concepts are (mostly)
already present - so only the language has to be learned.  In learning a first
language, the concepts are learned more or less at the same time as the
language (without wanting to get into an argument about the Whorf hypothesis).

I think that learning concepts by pure observation (as your system has to) is
generally impossible.  What is there in the input to suggest that 'circle-ness'
or 'square-ness' is a better basis for generalisation than pixel overlap?
Imagine, that a civilisation living on the surface of a neutron star has just
made contact with us by sending a subtitled videotape of life on their world.
The  environmental scenes would make no sense to us - we probably would not
even be able to segment the scene into objects.  So how could we learn the
language if we can't 'understand' the picture?

The human visual system has certain generalisation biases (say, based on
edge detectors etc), but I think a stronger requirement is to be able to
manipulate the environment.  By grabbing an object and moving it backwards
and forwards under our control, we can learn that the shape remains constant
while the apparent size varies.

I would appreciate any references to more formal arguments for the necessity
(or otherwise) of being able to manipulate the environment (or perceptual
apparatus) in order to learn 'natural' concepts.

Ross Gayler
ross at psych.psy.uq.oz.au