shift invariance

Geoffrey Hinton hinton at cs.toronto.edu
Thu Feb 22 10:25:32 EST 1996


Dear Jerry,

Its a long time since we had a really good disagreement.

Contrary to your assertions, shift invariance can be learned by
backpropagation. It was one of the problems that I tried when fiddling about
with backprop in the mid 80's. I published a paper demonstrating this in an
obscure conference proceedings:

Hinton, G.~E. (1987)  
Learning translation invariant recognition in a massively parallel network.
In Goos, G. and Hartmanis, J., editors, PARLE: Parallel
Architectures and Languages Europe, pages~1--13, Lecture Notes in Computer
Science, Springer-Verlag, Berlin.

So far as I know, this is also the first paper to demonstrate that weight
decay can make a really big difference in generalization performance.
It reduced the error rate from about 45% to about 6%, though
I must confess that the amount of weight decay was determined by using the
test set (as was usual in our sloppy past). 

I used a one dimensional "retina" with 12 pixels.  On this retina there was
one instance of a shape at a time.  The "shape" consisted of two bright
"boundary" pixels with 4 pixels between them.  The 4 pixels in the sandwich
could have any of the 16 binary patterns, so there were 16 very confusable
shapes.  For example, here are two instances of the shape corresponding to the
binary number 0011 (strip off the boundary bits before reading the number):

000100111000

010011100000

The retina had wraparound so that each shape could occur in 12 different
positions. This seems to me to be exactly the kind of data that you think
cannot be learned.  In other words, if I train a neural network to identitfy
some instances of the shapes you think that it couldnt possibly generalize to
instances in other positions.

Of course, once you understand why it can generalize, you will decide on a way
to exclude this kind of example, but so far it seems to me to fit your
assertion about what cannot be done.

The network had two hidden layers.  In the first hidden layer there were 60
units divided into 12 groups of 5 with local receptive fields.  So we are
telling it about locality, but not about translation.  Within each group, all
5 units receive input from the same 6 adjacent input units.  In the next
hidden layer there is a bottleneck of only 6 units (I didnt dare use 4), so
all the information used to make the final decision has to be represented in a
distributed pattern of activity in the bottleneck.  There are 16 output units
for the 16 shapes.

The idea behind the network is as follows: Shapes are composed of features
that they share with other shapes.  Although we may not have seen a particular
shape in a novel position, we will presumably have seen its features in those
positions before.  So if we have already developed translation invariant
feature detectors, and if we represent our knowledge of the shape in terms of
these detectors, we can generalize across translation.  The "features" in this
example are the values of the four pixels inside the sandwich. A hidden unit
in the first hidden layer can see the whole sandwich, so it could learn to
respond to the conjunction of the two boundary pixels and ONE of the four
internal "shape feature" pixels. Its weights might look like this:

...+.+..+.. 

It would then be a position-dependent feature detector.  In each location we
have five such units to enable the net to develop all 4 position-dependent
feature detectors (or funny combinations of them that span the space). In the
next layer, we simply perform an OR for the 12 different copies of the same
feature detector in the 12 different positions.  So in the next layer we have
position-independent feature detectors.  Finally the outgoing connections from
this layer represent the identity of a shape in terms of its
position-independent feature detectors.  Notice that the use of an OR should
encourage the net to choose equivalent position-dependent feature detectors in
the different locations, even though there is no explicit weight sharing.  The
amazing thing is that simply using backprop on the shape identities is
sufficent to create this whole structure (or rather one of the zillions of
mangled versions of it that uses hard-to-decipher distributed
representations).  Thanks to kevin lang for writing the Convex code that made
this simulation possible in 1987.

Please note that the local connectivity was NOT NECESSARY to get
generalization.  Without it the net still got 20/32 correct (guessing would be
2/32). 

Now, I dont for a moment believe that human shape perception is learned
entirely by backpropagating from object identity labels.  The simulation was
simply intended to answer the philosophical point about whether this was
impossible. Its taken nearly a decade for someone to come out and publicly
voice the widely held belief that there is no way in hell a network could
learn this.

Thanks
Geoff

PS: I do think that there may be some biological mileage in the idea that
local, position-dependent feature detectors are encouraged to break symmetry
in the same way in order for later stages of processing to be able to achieve
position independence by just performing an OR. An effect like this ought to
occur in slightly less unrealistic models like Helmholtz machines.


More information about the Connectionists mailing list