Dynamic Binding

Tue Nov 9 20:20:39 EST 1993

On binding using hidden layers, Graham Smith writes:

> I have recently trained a simple multi-layer feed-forward network using
> back-propagation, to auto-associate patterns which simultaneously describe
> two items. The patterns consist of four features (red, blue, square,
> ...

Graham's query has generated a reference to my work, and below I summarize
some of the issues as I see them, and give references to related work.

The use of a hidden layer for binding is at least implicit (and sometimes
explicit) in several uses of nets in the late 1980s (as Mark St John points
out).  The experiments Graham describes are a very clean version of the
problem, useful for highlighting the explicit issues in binding, combinatorial
structure and generalisation.  They are also known as multi-variable
encoders (MVEs) or Nk-Mk-Nk encoders, where k is the number of "variables"
(or "features" or "components"), N is the number of "values" (or "elements")
per variable and M is the number of hidden units per variable.  At least two
groups have published simulation results that I know about-

1. Olivier Brousse and Paul Smolensky worked on a variation (see Brousse's
PhD thesis "Generativity and systematicity in neural network combinatorial
learning", Oct, 1993, originally presented in 1989, at Cog Sci 11 or
possibly earlier, though I don't have a reference). The papers related to
this work addressed the issue of generalisation in combinatorial domains,
in which the binding problem is an implicit part.  The experiments are
described using Smolensky's tensor notation, and hence the connection to the
MVE task is perhaps not obvious for those unfamiliar with tensors.  The results
are very clear, showing massive generalisation from very few training samples.
Brousse's thesis goes into issues in compositionality and systematicity in
detail.

2. Mark Ollila and I presented an analysis of the representations in hidden
unit (HU) space formed in a colour-shape-location mapping task (1992, NIPS
"Intersecting regions: The key to combinatorial structure in HU space").
As Graham Smith described in his simulations, HU space forms a representation
of a particular "scene".  In our simulations we were interested in the range
of classifications possible in HU space.  Eg. Could a single hyperplane
distinguish between all the scenes with a blue object on the left? a blue
object anywhere? a blue square and a red triangle anywhere?
These questions are asking whether it is possible to partition the HU space
by a single hyperplane so that all the patterns relevant to the query (eg
all the blue scenes) can be distinguished from all the others.

The answer took us into an analysis of possible structures in HU space.
If HU representations are structured so that any classification can be
made using a single hyperplane, then the patterns must lie at the corners of
a hypertetrahedron (a generalised triangle).  This is a geometric way
of thinking about VC-dimension.

The MVE tasks do not require nearly such capacity in HU space - we can think
of them as compressed representations of the general hypertetrahedron.  For
each variable, the task only requires a 1-of-N selection.  Hence the structure
of HU space can be organized into a hypertetrahedron over the number of
variables, (rather than the variables x values).  This structure would allow
combination of any colour in location one, with any colour in location two,
but not both blue and green in location one, since it is not a legal pattern.
The bottom line is that instead of needing N^k-1 hidden units, where N is the
number of values per variable, and k is the number of variables, we only
need Mk, where M is the number of HU required per variable (M is a constant).

This year's Cog Sci paper (which John Kolen mentioned) began as an
extension to the NIPS92 study, asking: What is the minimum number of hidden
units required for such a mapping task, and how easily is it learned?
These tasks are "ultra-tight" MVEs, since there are several variables but a
minimum number of HUs.  Given the ultra-tight bottleneck, the hidden
units divide into pairs, each pair encoding the internal representation of
one variable.  (The theoretical minimum is 2 HU per variable, based on a
proof by Kruglyak on the N-2-N encoder).

Caveats: The decomposition of the hidden layer into pairs of cooperating
units only occurs (in fact is only possible from a coding theory point
of view) when there is a specific structure in the patterns for each
variable.  The traditional "local" codes used in encoder tasks satisfy
this criterion, but are hard to learn.  In our simulations, we used block
codes (eg 1111100000) for each variable in the output and local codes
(eg 100000000) for each variable in the input (Bakker et al, 1993, ICANN).
A second caveat is that the bottleneck of 2 HU/variable does not provide
sufficient capacity for a net to represent the probability distribution of
the input patterns, (eg co-occurrence between variables) and hence generalises
to all possible combinations of the variables. Sometimes this may be desirable,
sometimes not, depending on the task.

In later work, Steven Phillips and I showed that ultra-tight encoders with
block-structured outputs can be learned "efficiently" by standard bp.
ie. polynomial number of patterns in the training set (1993, IJCNN Japan).
Steve is continuing this work for his PhD.

-----
Several people have asked for copies of my CogSci paper in response to
to comments generated by Graham's query.  It is not online -- hard copy is
available for people who do not have access to the Proceedings.

------------------------
Janet Wiles
Departments of Computer Science and Psychology
University of Queensland
QLD  4072  AUSTRALIA
email: janetw at cs.uq.oz.au