Similarity to Cascade-Correlation

Scott.Fahlman@SEF1.SLISP.CS.CMU.EDU Scott.Fahlman at SEF1.SLISP.CS.CMU.EDU
Mon Aug 6 23:20:09 EDT 1990


    >It is my understanding that some of the latest work of Hal White et al.
    >presents a learning algorithm - backprop plus a rule for adding hidden
    >units - that can (in the limit) provably learn any function of interest.
    >(Disclaimer: I don't have the mathematical proficiency required to fully
    >appreciate White et al.'s proofs and thus have to rely on second-hand
    >interpretations.)
    
    How does this new work compare with the Cascade Correlation method
    developed by Fahlman, where a new hidden unit is added by training
    its receptive weights to maximize the correlation between its
    output and the network error, and then trains the projective weights
    to the outputs to minimize the error (thus only allowing single-layer
    backprop learning at each iteration)?
    
    -Thomas Edwards
    The Johns Hopkins University / U.S. Naval Research Lab


I'll take a stab at answering this.  Maybe we'll also hear something from
Hal White or one of his colleagues -- especially if I somehow misrepresent
their work.

I believe that all of the published completeness results from White's group
assume a single layer of hidden units.  They show that this architecture
can approximate any desired transfer function (assuming it has certain
smoothness properties) to any desired accuracy if you add enough units in
this single layer.  It's rather like proving that a piecewise linear
approximation can approach any desired curve with arbitrarily small error
as long as you're willing to use enough tiny pieces.  Unless I've
missed something, their work does not attempt to say anything about the
minimum number of hidden units you might need in this hidden layer.

Cascade-Correlation produces a feed-forward network of sigmoid units, but
it differs in a number of ways from the kinds of nets considered by White:

1. Cascade-Correlation is intended to be a practical learning algorithm
that produces a relatively compact solution as fast as possible.

2. In a Cascade net, each new hidden unit can receive inputs from all
pre-existing hidden units.  Therefore, each new unit is potentially a new
layer.  White's results show that you don't really NEED more than a single
hidden layer, but having more layers can sometimes result in a very
dramatic reduction in the total number of units and weights needed to solve
a given problem.

3. There is no convergence proof for Cascade-Correlation.  The candidate
training phase, in which we try to create new hidden units by hill-climbing
in some correlation measure, can and does get stuck in local maxima of this
function.  That's one reason we use a pool of candidate units: by training
many candidates at once, we can greatly reduce the probability of creating
new units that do not contribute significantly to the solution, but with a
finite candidate pool we can never totally eliminate this possibility.

It would not be hard to modify Cascade-Correlation to guarantee that it
will eventually grind out a solution.  The hard part, for a practical
learning algorithm, is to guarantee that you'll find a "reasonably good"
solution, however you want to define that.  The recent work of Gallant and
of Frean are interesting steps in this direction, at least for
binary-valued transfer functions and fixed, finite training sets.

-- Scott


More information about the Connectionists mailing list