Similarity to Cascade-Correlation

Wed Aug 8 10:09:48 EDT 1990

I got this clarification from Steve Hanson of his original query, which I
found a bit cryptic:

    Isn't cascade Correlation a version (almost exact except for splitting
    rule--although I believe CART allows for other splitting rules) of 
    CART---the decision tree with the hyperplane  feature space cuts...?

My memory of Cart is a bit fuzzy, but I think it's very different from
Cascade-Correlation.  Unless I'm confused, here are a couple of glaring
differences:

1. In a decision-tree setup like CART, each new split works within one of
the regions of space that you've already carved out -- that is, within only
one branch of the tree.  So for something like N-bit parity, you'd need 2^N
hidden units (hyperplanes).  In a single-layer backprop net, you need only
N hidden units because they are shared.  Because it creates higher-order
units, Cascade-Correlation can generally do the job in less than N.  (See
the results in the Cascade-Correlation paper.)  I don't remember if any
version of CART makes serendipitous use of hyperplanes that were created
earlier to split other branches.  I am pretty sure, however, that it works
on splitting just one branch at a time, and doesn't actively try to create
hyperplanes that are useful in splitting many branches at once.

2. If you create all your new hidden units in a single layer, all you can
do is create hyperplanes in the original space of input features.  Because
it builds up multiple layers, Cascade-Correlation can create higher-order
units of great complexity, not just hyperplanes.  If you have the tech
report on Cascade-Correlation (the diagrams had to be cut from the NIPS
version due to page limitations), look at the strange complex curves it
creates in solving the two-spirals problem.  If you prefer,
Cascade-Correlation works by raising the dimensionality of the space and
then drawing hyperplanes in this new complex space, but the projection back
onto the original input space does not look like a straight line.  I've
never heard of anyone solving the two-spirals problem with a single layer
of sigmoid or threshold units -- it would take an awful lot of them.

I think that these two differences change the game entirely.  The only
resemblance I see between CART and Cascade-Correlation is that both build
up a structure little by little, trying to add new nonlinear elements that
eliminate some part of the remaining error.  But the kinds of structures
the two algorithms deal in is qualitatively different.

-- Scott