Distributed Representations

Thu Jun 20 17:39:38 EDT 1991

Bo Xu presents a very interesting classification of representations
in terms of their distribution over representational units. The definitions
of each class are internally clear enough, but I have some comments about
how "distributivity" is defined, and where it leads. Let's take the
 definitions
that Bo Xu gives:

>Local Representation ---- The one-to-one correspondence in which each object
>      is represented by one unit, and each unit represents only one object.
>      Units in local representation always take binary values.

No quarrel about this one being a local representation.

>Binary Distributed Representation ---- The one-to-multiple correspondence
>      in which each object is represented by multiple units and each unit
>      is employed to represent only one object.  The unit takes only binary
>      values here because it represents only one object, there is no need
>      for it to take analog values.

Suppose I have two objects --- an apple and a pear --- and six
 representational
units r1.....r6. Then, if I read this definition correctly, a distributed
representation might be 000111 <-> apple and 111000 <-> pear. Since the units
are binary, they are presumably "on" if the object is present and "off" if it
is not. No reference is made to "properties" defining the object, and so there
is no semantic content in any unit beyond that of mere signification: each
unit is, ideally, identical. The question is: why have three units signifying
one object when they work as one? One reason might be to achieve redundancy,
and consequent fault-tolerance, through a voting scheme (e.g. 101001 <->
 pear).
Is this a distributed representation, though? To decide that, I must have
an *external* definition of what it means for a representation to be
distributed. Tentatively, I say that "a representation is distributed over
a group of units if no single unit's correct operation is critical to the
representation". This certainly holds in the above example. It holds, indeed,
in all error-correcting codes. In a binary distributed representation, then,
I can define the "degree of distributivity" as the minimum Hamming distance
of the code. This is quite consistent, if rather disappointingly mundane.

>Analog Distributed Representation ---- The multiple-to-one correspondence
>      in which multiple objects with the same property are represented by
>      one unit and each unit represents multiple objects with the same
>      property only.  Here the unit takes different analog values for
>      different objects within this property group. Different analog
>      values are used to differentiate these different objects within the
>      same property group.

Here, under the obvious reading of this definition, I have two categories
(units) called "fruits" and "vegetables". Each represents many objects
with different values, but mutually exclusively. Thus, I might have
apple <-> 0.1,0 and squash <-> 0,0.1, but no object will have the code
0.1,0.1. This is obviously equivalent to a binary representation with
each unit replaced by, say, n binary units. The question is: does this
code embody the principle of dispensibility? Not necessarily. One wrong bit
could change an apple into a lemon, or even lose all information about the
category of the object. Thus, in the general case, such a representation
is "distributed" only in the physical sense of activating (or not activating)
units in a group. Each unit is still functionally critical.

>Mixed Distributed Representation ---- The multiple-to-multiple correspondence
>       in which multiple objects of multiple properties are represented by
>       one unit and each unit represents multiple objects with multiple
>       properties. Here, the units take either binary or analog values
>       depending on the properties and the object they represent.

Now here we have what most people mean by "distributed representations". We
have many properties, each represented by a unit, and many objects. Each
object can be encoded in terms of its properties. If the set of properties
does not have enough discrimination, multiple objects could have the same
code. Even if the property set is sufficient for unique representation, it
is possible that the malfunction of one unit may change one object to
another. The question then is: is this dependency small or large? Does
a small malfunction in a unit cause catastrophic change in the semantic
content of the whole group of units? I can "distribute" my representation
over all the atoms in the universe, but if that doesn't give me some
protection from point failures, I have not truly "distributed" things
at all --- merely multiplied the local representation. Now, of course, in
the "real" world where things are uniformly or normally distributed and
errors are uncorrelated, increasing the size of a representation over a
set of independent units will almost always confer some degree of protection
from catastrophic point failures. An important issue is how to *maximize*
this. And to do that, we must be able to measure it. One way would be to
minimize the average information each representational unit conveys about
the represented objects, which is a simple maximum entropy formulation.
This requirement must, of course, be balanced by an adequate representation
imperative. Other formulations are certainly possible, and probably much
better. In any case, many of the more interesting issues in distributed
representation arise when the "object" being represented is only implicitly
available, or when the representation is distributed over a hierarchy of
units, not all of which are directly observable, and not all of which
count in the final encoding.

Comments?

Ali Minai
aam9n at Virginia.EDU