distribution and its advantages

Mon Jun 17 11:45:58 EDT 1991

Javier Movellan's question -- what are distributed 
representations are good for anyway? -- is I think an 
important one for connectionism and cognitive science 
generally. Trouble is, the way it was put, it presupposes that 
there is some one kind of representation that everyone is 
referring to when they talk about distribution. In fact, though 
most people have a reasonable idea what they themselves 
intend when they use the term "distributed", they usually 
don't realize that its not the way many other people use it. 

This is immediately apparent if one takes an overview of the 
responses that actually came in. Various people took it that a 
representation is distributed if it utilizes many units rather 
than just one, with the "strength" of distribution increasing 
as the total number of units (or perhaps, the proportion of 
available units) used increases. Massone by contrast thought 
the key concept is that of redundancy, which I take roughly to 
mean that a given piece of input information is represented 
multiple times. This presumably requires that many units are 
used (i.e., that there is distribution in the previous sense) but 
is a significantly stronger requirement. Massone's position 
was echoed in some other responses. Chalmers claims that a 
distributed representation is one in which every 
representation, whether of a basic concept or a more complex 
one, has a kind of semantically significant internal structure. 
This definition also seems to presuppose the first kind of 
definition, but is different from redundancy. Proposing a 
somewhat different definition again, French suggested that 
distribution is a matter of the degree of "overlap" between 
representations of different entities. And so on.

This lack of agreement over what distribution actually is at 
least partly responsible for the fact that no really clear and 
useful consensus on the advantages of distributed 
representation really emerged in the responses to the initial 
question. It manifests a wider lack of agreement over the 
concept of distribution in connectionism and cognitive 
science more generally. I once surveyed as many of the 
definitions and occurrences of "distribution", "distributed 
representation", etc., as I could find in the cognitive science 
literature, and found that there were at least 5 very different 
basic properties that people often refer to as distribution. 
These ranged from a very simple notion of "spread-out-
ness" - each entity being represented by activity in many 
units rather than just one - at one extreme, to complete 
functional equipotentiality at the other. (A representation is 
functionally equipotential when any part of it can stand in for 
the whole thing. Holograms are famous for exhibiting a form 
of equipotentiality.) Authors often picked up multiple strands 
and ran them together in one characterization, or defined 
distribution differently on different occasions, sometimes 
even in the same work. 

Probably the two most common definitions are (1) the notion 
of simple extendedness just mentioned (i.e., using "many" 
units to represent a given item) and (2) superimposition of 
representations. We have superimposition when there are 
multiple items being represented at the same time, but no 
way of pointing to the discrete part of the representation 
which is responsible for item A, the discrete part which is 
responsible for item B, and so forth. Think of the weights in 
a standard feed-forward network. Here multiple input-output 
associations are represented at the same time, but there is (in 
general) no separate set of weights for each association. 

To see how these two senses simultaneously dominate 
connectionist discussions of distribution, think again of the 
answers to Movellan's question. Many of the answers took 
the form, roughly, that "when I used representations 
involving activity in many units rather than just one in such 
and such a network, I found better (or worse!) performance". 
Other responses, particularly those that made reference to 
the brain or neuropsychological results, were more concerned 
with the extent to which there is separate or discrete storage 
of the various components of our knowledge in a given 
circumscribed domain. (In these contexts, "graceful 
degradation" in performance is often thought to be a 
consequence of knowledge being stored in an inextricably 
superimposed fashion.)

In one sense, it is not surprising that these are the two most 
common notions of distribution. Perhaps the only thing that 
is really clear about distribution is the opposition between 
distribution and localization: whatever distributed 
representations are, they are non-local. Trouble is, "local" 
turns out to be ambiguous. Sometimes "local" means 
restricted in extent (e.g., using only one unit rather than 
many), and sometimes it means not overlapping with the 
representation of anything else. The two most common 
senses of "distribution" mentioned a moment ago simply 
result from denying locality in these two distinct senses. 

It seems to me that a necessary condition for any significant 
progress on the question "what are distributed 
representations good for?" is that this general state of 
confusion over what "distributed" means be resolved. This 
means clearly laying out the different senses that are floating 
around, picking out the one that is the most central and most 
theoretically significant, and giving it a reasonably precise 
definition. I attempted this in Ch.1 of my PhD dissertation 
(Distributed Representation, University of Pittsburgh 1989); 
a shorter overview of some of the material from that chapter 
has recently appeared as "What is the D in PDP? An overview 
of the concept of distribution" in Stich, Ramsey & Rumelhart 
(eds) Philosophy and Connectionist Theory. 

In my opinion, the most important concept in the vicinity of 
distribution is that of superimposition of representations, 
and it is for this that the term "distributed" should really be 
reserved. One advantage of this strategy is that 
superimposition admits of a surprisingly clear and satisfying 
mathematical definition:

Suppose R is a representation of multiple items. If the 
representings of the different items are fully superimposed, 
every part of the representation R must be implicated in 
representing each item. If this is achieved in a non-trivial 
way there must be some encoding process that generates R 
given the various items to be stored, and which makes R 
vary, at every point, as a function of each item. This process 
will be implementing a certain kind of transformation from 
items to representations. This suggests thinking of 
distribution more generally in terms of mathematical 
transformations exhibiting a certain abstract structure of 
dependency of the output on the input. More precisely, define 
any transformation from a function F to another function G 
as strongly distributing just in case the value of G at any 
point varies with the value of F at every point; the Fourier 
transform is a classic example. Similarly, a transformation 
from F to G is weakly distributing, relative to a division of 
the domain of F into a number of sub-domains, just in case 
the value of G at every point varies as a function of the value 
of F at at least one point in each sub-domain. The classic 
example here is the linear associator, in which a series of 
vector pairs are stored in a weight matrix by first forming, 
and then adding together, their respective outer products. 
Each element of the matrix varies with every stored vector, 
but only with one element of each of those vectors. (The 
"functions" F and G in this case describe the input vectors 
and the association matrix respectively; e.g., given an 
argument specifying a place in an input vector, F returns the 
value of the vector at that place.)

Clearly, a given distributing transformation yields a 
whole space of functions resulting from applying that 
transformation to different inputs (i.e., different functions 
F). If we think of these output functions as descriptions of 
representations, and the input functions as descriptions of 
items to be represented, the distributing transformation is 
defining a whole space or scheme of distributed 
representations. To be a distributed representation, then, is 
to be a member of such a scheme; it is to be a representation 
R of a series of items C such that the encoding process which 
generates R on the basis of C implements a given distributing 
transformation.

Basically, then, distributed representations are what you get 
from distributing transformations, which are 
transformations which make each part of the output (the 
representation) depend on every part of the input (what 
you're representing). Now, mathematically speaking, there is 
a vast number of different kinds of distributing 
transformations, and so there is a vast number of possible 
instantiations of distributed representation. Connectionists 
can be seen as exploring that portion of the space of possible 
transformations that you can handle with n-dimensional 
vector operations, learning algorithms, etc. In other domains 
such as optics it is possible to implement other forms of 
distributing transformations and hence to get distributed 
representations with different properties.

There are a number of reasons for wanting to define 
distributed representation in terms of superimposition 
generally, and distributed transformations in particular:
(a) superimposition is certainly one of the most common of 
the standard senses of "distribution" in current usage, and so 
we remain as close as possible to that usage;
(b) superimposition admits of a precise mathematical 
definition, so those who think clarity only comes from 
formalization should be kept happy; 
(c) various popular properties of distributed representation 
such as automatic generalization and graceful degradation are 
a natural consequence of distribution defined this way;
(d) in practice, in a connectionist context, distribution in the 
sense of requiring many units rather than just one is a 
necessary precondition of this more full-blooded notion; 
hence any advantages that accrue to representations in virtue 
of utilizing many units also accrue to superimposed 
representations;
(e) a number of other interesting theoretical results follow 
from defining distribution this way: in particular, it can be 
shown that distributed representations cannot be symbolic in 
nature, on a reasonably precise definition of "symbolic" (see 
e.g. my "Why distributed representation is inherently non-
symbolic", in G. Dorffner (ed.) Konnektionismus in Artificial 
Intelligence und Kognitionsforschung. Berlin: Springer-
Verlag, 1990; 58-66). 

On the basis of this kind of definition of what distributed 
representation is, what kind of answer can be given to the 
"what are distributed representations good for?" question? 
Well, the kind of answer you will find satisfying will depend 
very much on what your theoretical interests are. A 
connectionist whose concerns have more of an applied, 
engineering focus will want to know what specific processing 
benefits arise from using representations generated by 
distributing transformations. As mentioned in (c) above, I 
think that some of the favorite virtues of distribution are 
best seen as an immediate consequence of superimposition. 
The technical issues here still need much clarification, 
however.

As a cognitive scientist, on the other hand, I'm interested in 
more general questions such as - what are the advantages of 
distribution for human knowledge representation? Here I 
don't have any actual answers ready to hand; the most I can 
do the moment is point to the kind of question that seems the 
most interesting. Speaking at the broadest possible level: 
various difficulties encountered in mainstream AI, combined 
with some philosophical reflections, suggest that everyday 
commonsense knowledge cannot be fully and effectively 
captured in any kind of purely symbolic format; that, in other 
words, symbolic representation is fundamentally the wrong 
medium for capturing at least certain kinds of human 
knowledge. Just above I mentioned that distributed 
representation (defined in terms of superimposition) can be 
shown to be intrinsically non-symbolic. The obvious 
suggestion then is: perhaps the most important advantage of 
distributed representation is that it (and it alone?) is capable 
of representing the kind of knowledge that underlies everyday 
human competence?

Tim van Gelder