shift invariance

Thu Feb 22 21:05:34 EST 1996

Hi fellow connectionists,

I must say I'm a little puzzled by this discussion about shift invariance.
It was started by Jerry Feldman by saying

>  Shift invariance is the ability  of a neural system to recognize a pattern
> independent of where appears on the retina. It is generally understood that
> this property can not be learned by neural network methods, but I have
> not seen a published proof. A "local" learning rule is one that updates the
> input weights of a unit as a function of the unit's own activity and some
> performance measure for the network on the training example. All biologically
> plausible learning rules, as well as all backprop variants, are local in this
> sense.

Now I always thought that this is so obvious that it didn't need any
proof.

Geoff Hinton responded by disagreeing:

> Contrary to your assertions, shift invariance can be learned by
> backpropagation. It was one of the problems that I tried when fiddling about
> with backprop in the mid 80's. I published a paper demonstrating this in an
> obscure conference proceedings:

He describes a model based on feature detectors and subsequent backpropagation
that can actually generalize over different positions. He finishes by
saying

> The simulation was
> simply intended to answer the philosophical point about whether this was
> impossible. Its taken nearly a decade for someone to come out and publicly
> voice the widely held belief that there is no way in hell a network could
> learn this.
> 

IMHO, there seems to be a misunderstanding of what the topic of discussion
is here. I don't think that Jerry meant that no model consisting of neural
network components could ever learn shift invariance. After all, there are
many famous examples in visual recognition with neural networks (such as
the Neocognitron, as Rolf W"urtz pointed out), and if this impossibility
were the case, we would have to give up neural network research in 
perceptual modeling altogether.

What I think Jerry meant is that any cascade of fully-connected feed-forward
connection schemes between layers (including the perceptron and the MLP) cannot
learn shift invariance. Now besides being obvious, this does raise some
important questions, possibly weakening the fundamentals of connectionism.
Let me explain why:

- state spaces in connectionist layers (based on the assumption that activation
  patterns are viewed as vectors) span a Euclidean space, with each connection
  scheme that transfers patterns into another layer applying a certain kind
  of metric defining similarity. This metric is non-trivial, especially in
  MLPs, but it restricts the ways of what such a basic neural network component
  (i.e. fully connected feedforward) can view as similar. Patterns that
  are close in this space according to a distance measure, or patterns
  that have large orthogonal projections onto each other (in my analysis
  the basic similarity measure in MLPs) are similar according to this metric.
  Different patterns with a sub-pattern in different positions are obviously
  NOT. Neither are patterns which share common differences between components
  (e.g. the patterns (0.8 0.3) and (0.6 0.1)), and a whole bunch
  of other examples. That's why we have to be so careful about the right
  kind of preprocessing when we apply neural networks in engineering, and
  why we have to be equally careful in choosing the appropriate 
  representations in connectionist cognitive modeling.

- introducing feature detectors and other complex connectivities and learning
  schemes (weight sharing, or the OR Geoff mentioned) is a way of translating
  the original pattern space into a space where the similarity structures
  which we expect obey the said metric in state space again. It's the
  same thing we do in preprocessing (e.g. we apply an FFT to signals, since
  we cannot expect that the network can extract invariances in the frequency
  domain).

- Geoff's model, necognitron, and many others do exactly that. Each single
  component (e.g. one feature detector) is restricted by the similarity metric
  mentioned above. But by applying non-linear functions, and by combining
  their output in a clever way they translate the original patterns into
  a new pattern space, where similarity corresponds to this metric again
  (e.g. for the final backprop network Geoff introduced).

Now obviously, when we look at the human visual system, the brain does seem
to do some kind of preprocessing, such as applying feature detectors, as well.
So we're kind of safe here. But the above observation does make one think,
whether the similarity metric a neural network basically applies is actually
the right kind of basis for cognitive modeling. Think about it: By 
introducing complex wiring and learning schemes, and by carefully choosing
representations, we go a long way to finally satisfy the actual neural 
network that has to do the job of extracting information from the patterns.
Visual recognition is but one, although prominent, example. Now what makes
us sure that deeper processes ARE of the kind a fully connected feedforward 
network can handle (i.e. that processes DO work on the said restricted kind
of similarity metric)?

Now I do not really think that we have a problem here. But some people
recently have raised serious doubts. Some have suggested that perhaps
replacing connectionist state spaces by the "space" that is spanned by
attractors in a dynamical systems gives one a more appropriate metric.
I am not suggesting this, I am just proposing that connectionists have to
be alert in this respect, and keep questioning themselves whether the
fundamental assumptions we're making are the appropriate ones. In this
way I think Jerry's point is worth discussing.

Just my 2 c. worth,

Georg
(email: georg at ai.univie.ac.at)

P.S: I would like to acknowledge F.G. Winkler from Vienna for some of the
ideas expressed in this message.