Similarity to Cascade-Correlation

Sun Aug 5 15:56:00 EDT 1990

As a side note on the problem of using backpropagation on large
problems, it should be noted that using efficient error minimization
methods (i.e. conjugate-gradient methods) as opposed to
the "vanilla" backprop described in _Parallel_Distributed_Processing_
allows one to work with much larger problems, and also allows for much
greater performance on problems the network was trained on.
For example, an IR target threat detection problem I have been recently
working on (with 127 or 254 inputs and 20 training patterns)
failed miserably when trained with "vanilla" backprop (hours and
hours on a Connection Machine without success).  When a
conjugate-gradient training program was used, the network was able to
learn 100% of the training set perfectly in just a minute or two.

>It is my understanding that some of the latest work of Hal White et al.
>presents a learning algorithm - backprop plus a rule for adding hidden
>units - that can (in the limit) provably learn any function of interest.
>(Disclaimer: I don't have the mathematical proficiency required to fully
>appreciate White et al.'s proofs and thus have to rely on second-hand
>interpretations.)

How does this new work compare with the Cascade Correlation method
developed by Fahlman, where a new hidden unit is added by training
its receptive weights to maximize the correlation between its
output and the network error, and then trains the projective weights
to the outputs to minimize the error (thus only allowing single-layer
backprop learning at each iteration)?

-Thomas Edwards
The Johns Hopkins University / U.S. Naval Research Lab