Subtractive network design

Sun Nov 17 18:03:27 EST 1991

    My point is that subtractive shemes are more likely to find 
    these global descriptions. These structures so to speek condense out of
    the more complicated structures under the force of subtraction.

    I would like to hear your opinion on this claim! 

For best generalization in supervised learning, the goal is to develop a
separating surface that captures as much as possible of the "signal" in the
data set, without capturing too much of the noise.  If one assumes that the
signal components are larger and more coherent than the noise, you can do
this by restricting the complexity of the separating surface(s).  This, in
turn, can be accomplished by choosing a network architecture with exactly
the right level of complexity or by stopping the training before the
surface gets too contorted.  (The excess degrees of freedom are still
there, but tend to be redundant with one another in the early phases of
training.)

Since, in most cases, we can't guess in advance just what architecture is
needed, we must make this selection dynamically.  An architecture like
Cascade-Correlation builds the best model it can without hidden units, then
the best it can do with one, and so on.  It's possible to stop the process
as soon as the cross-validation performance begins to decline -- a sign
that the signal has been exhausted and you're starting to model noise.  One
problem is that each new hidden unit receives connections from ALL
available inputs.  Normally, you don't really need all those free
parameters at once, and the excess ones can hurt generalization in some
problems.  Various schemes have been proposed to eliminate these
unneccessary degrees of freedom as the new units are being trained, and I
think this problem will soon be solved.

A subtractive scheme can also lead to a network of about the right
complexity, and you cite a couple of excellent studies that demonstrate
this.  But I don't see why these should be better than additive methods
(except for the problem noted above).  You suggest that a larger net can
somehow form a good global description (presumably one that models a lot of
the noise as well as the signal), and that the good stuff is more likely to
be retained as the net is compressed.  I think it is equally likely that
the global model will form some sort of description that blends signal and
noise components in a very distributed manner, and that it is then hard to
get rid of just the noisy parts by eliminating discrete chunks of network.
That's my hunch, anyway -- maybe someone with more experience in
subtractive methods can comment.

I beleive that the subtractive schemes will be slower, other things being
equal: you have to train a very large net, lop off something, retrain and
evaluate the remainder, and iterate till done.  It's quicker to build up
small nets and to lock in useful sub-assemblies as you go.  But I guess
you wanted to focus only on generalization and not on speed.

Well, opinions are cheap.  If you really want to know the answer, why don't
you run some careful comparative studies and tell the rest of us what you
find out.

Scott Fahlman
School of Computer Science
Carnegie Mellon University