Distributed Representations

Sat Jun 22 11:38:17 EDT 1991

Ali Minai presented a good example of apple and pear.  I am going to answer
some questions he raised.  Let's look at his statements first.

>is not. No reference is made to "properties" defining the object, and so there
>is no semantic content in any unit beyond that of mere signification: each

This is a very good question.  Generally speaking, there are many properties
existed at the same time for each object.  Let's take the apple as an example.
An apple can be classified according to its taste, color, size, shape, or
whether it is a fruit or not (as Ali Minai chose) etc.  Different people will
choose different criteria to meet the purpose of their applications.

>unit is, ideally, identical. The question is: why have three units signifying
>one object when they work as one? One reason might be to achieve redundancy,
>and consequent fault-tolerance, through a voting scheme (e.g. 101001 <-> pear).
Redundancy and fault-tolerance may be reasons for binary distributed
representation.  Another reason probably comes from the faster convergence
rate consideration.  Karen Kukich has done some interesting work and concludes
that the advantage of local representation is the faster convergence rate
(see K. Kukich, "Variations on a Back-Propagation Name Recognition Net" in the
Proceedings of the United States Postal Service Advanced Technology
Conference, Vol. 2, 722-735).  The binary distributed representation is similar
to local representation in that they all take binary values.  However,
as to why "three" instead of "five" or any other numbers, I also don't know.
This question is probably similar to the question of "how many hidden units are
needed for a specific task?".  It may depend on to what degree the redundancy
is needed.

>Here, under the obvious reading of this definition, I have two categories
>(units) called "fruits" and "vegetables". Each represents many objects
>with different values, but mutually exclusively. Thus, I might have
>apple <-> 0.1,0 and squash <-> 0,0.1, but no object will have the code
>0.1,0.1. This is obviously equivalent to a binary representation with
>each unit replaced by, say, n binary units. The question is: does this
>code embody the principle of dispensibility? Not necessarily. One wrong bit
>could change an apple into a lemon, or even lose all information about the
>category of the object. Thus, in the general case, such a representation
>is "distributed" only in the physical sense of activating (or not activating)
>units in a group. Each unit is still functionally critical.

It is true if there is a bit of error, the apple will change to lemon etc.
However, the key point here is that the neural net's fault-tolerance
characteristic exists only after it is trained and has reached an accuracy
criterion.  If we are dealing with many objects and use 0.1 as a value to
differentiate different objects, we will train the net to reach a criterion
at least smaller than 0.1 (otherwise, the net will be of no use).  Thus, for
seen patterns, the error will not be so big that an apple will turn into
a lemon.  For unseen patterns, bigger errors probably will occur, and apples
probably will turn to lemons or whatsoever.  However, this time we may not
attribute the problem to the representation used only.  This is related to
the generalizability of the net, and the learning algorithm, units responsive
characteristics and even the topology of the net all probably are playing
roles for the generalizability of the net.

>Now here we have what most people mean by "distributed representations". We

>nother. The question then is: is this dependency small or large? Does
>small malfunction in a unit cause catastrophic change in the semantic
>content of the whole group of units? I can "distribute" my representation

When talking about the representations, the graceful degradation of brain
is introduced as a criterion.  However, since the neural net is still far
away from a real brain model, some cautiousness should be taken when relating
the neural net to brain.  The first thing to be made clear is that which
layer of neural net we are refering to.  Most people refer to the interface
layers (the input and output layers) of neural net when they talk about
the local/distributed representations. However, they refer to all layers (both
the interface layers and hidden layers) when they talk about the graceful
degradation.
      However, what are the justices for the interface layers to possess
graceful degradation?  If we say that neural net resembles brain in some
aspects, then the resemblance most likely lies in the hidden layers instead
of the interface layers. The criterion of graceful degradation should be made
on the hidden layers instead of the interface layers.  In most of current nets,
the hidden layers are using mixed distributed representation, and thus
possess the graceful degradation characteristics.
      As to the interface layers (input/output layers), we can demand them
to possess the graceful degradation characteristics too.  However, in my
opinion, this will lead to many additional problems and confusions.
The mixed distributed representation is good for hidden layers, not for
interface layers.  I think for the interface layers, the analog distributed
representation works best because: (1) Considerations at the interface layers
should be practicality instead of graceful degradation.  There is no justice
and no need for the interface layers to possess the graceful degradation.
(2). The analog distributed representation has classified the objects to
be represented.  The objects with the same property are classified into the
same group.  The differences between the objects in the same group are
represented by different analog values of the unit representing this
property group (eg, assume that there are four apples and three pears, then
in analog distributed representation, two units should be used: unit A for
apple and unit P for pear.  The four apples can be represented by letting unit
A take four different analog values.  The three pears can be represented by
letting unit P take three different analog values.).  This is the most natural
way when we deal with many objects.  Why should we sacrifice the natural
way (analog distributed representation) for the graceful degradation (which
may not belong to the interface layers. The hidden layers are using mixed
distributed representation and possess graceful degradation) when we are
considering the interface layers?  We used the analog distributed
representation in a parabolic problem (a task mapping the parabola curve
we used to compare the performances of BPNN and PPNN) and found that the
analog distributed representation was the best and most natural representation
for problems (such as the parabolic problem) which has continuous and infinite
training/test patterns (objects).

In sum, I think that we should be more specific when we talk about the
representations and brain-like characteristics of neural nets:
(1) For the interface layers (input/output layers), the analog distributed
representation is the best choice because at the interface layers, the priority
of consideration is practicality, and the analog distributed representation
is the most natural one and most easily to be used in dealing with many
objects.
(2) For the hidden layers, the mixed distributed representation is the best
choice because the graceful degradation requirement now is the priority to
be taken into account of for hidden layers.  Fortunately, most of the current
network architechures have ensured such requirement for hidden layers.

Bo Xu
ITGT500 at INDYCMS.BITNET