Multiple Models, Committee of nets etc...

Thu Jul 29 02:43:58 EDT 1993

For those interested in the recent discussion of Multiple Models, Committees, 
etc., the following references may be of interest.  The first three references
deal exactly with the issues that have recently been discussed on Connectionists.
The salient contributions from these papers are:

 1) A very general result which proves that averaging ALWAYS improves optimization 
 performance for a broad class of (convex) optimization problems including MSE, MLE,
 Maximum Entropy, Maximum Mutual Information, Splines, HMMs, etc.  This is a result
 about the topology of the optimization measure and is independent of the underlying
 data distribution, learning algorithm or network architecture.

 2) A closed form solution to the optimal weighted average of a set of regression
 estimates (Here, I regard density estimation and classification as special cases of
 regression) for a given cross-validation set and MSE optimization.  It should be
 noted that the solution may suffer from over-fitting when the CV set is not 
 representative of the true underlying distribution.  However the solution is 
 amenable to ridge regression and a wide variety of heuristic robustification
 techniques.

 3) Experiments on real-world datasets (NIST OCR data, human face data and timeseries
 data) which demonstrate the improvement due to averaging.  The improvement is so
 dramatic that in most cases the average estimator performs significantly better than
 the best individual estimator. (It is important to note that the CV performance of
 a network is not a guaranteed predictor for performance on an independent test set.
 So a network which has the best performance on the CV set may not have the best
 performance on the test set; however in practice, even when the CV performance is a
 good predictor for test set performance, the average estimator usually performs
 better.)

 4) Numerous extensions including bootstrapped and jackknifed neural net generation; and
 averaging over "hyperparameters" such as architectures, priors and/or regularizers.

 5) An interpretation of averaging in the case of MSE optimization, as a 
 regularizer which performs smoothing by variance reduction.  This implies that 
 averaging is having no effect on the bias of the estimators.  In fact, for a given
 population of estimators, the bias of the average estimator will be the same as 
 the expected bias of any estimator in the population.

 6) A very natural definition of the number of "distinct" estimators in a population
 which emphasizes two points: (a) Local minima are not necessarily a bad thing!
 We can actually USE LOCAL MINIMA TO IMPROVE PERFORMANCE; and (b) There is an
 important distinction between the number of local minima in parameter space and
 the number of local minima in function space.  Function space is what we are
 really concerned with and empirically, averaging suggests that there are not
 that many "distinct" local minima in trained populations.  Therefore one direction
 for the future is to devise ways of generating as many "distinct" estimators as
 possible.

The other three references deal with what I consider to be the flip side of the 
same coin:  On one side is the problem of combining networks, on the other is the
the problem of generating networks.  These three references explore neural net
motivated divide and conquer heuristics within the CART framework.

Enjoy!

Michael
--------------------------------------------------------------------------------
Michael P. Perrone                                      Email: mpp at cns.brown.edu
Institute for Brain and Neural Systems                  Tel:   401-863-3920
Brown University                                        Fax:   401-863-3934
Providence, RI 02912
--------------------------------------------------------------------------------

@phdthesis{Perrone93,
   AUTHOR    = {Michael P. Perrone},
   TITLE     = {Improving Regression Estimation: Averaging Methods for Variance Reduction
       with Extensions to General Convex Measure Optimization},
   YEAR      = {1993},
   SCHOOL    = {Brown University, Institute for Brain and Neural Systems; Dr. Leon N Cooper, Thesis Supervisor},
   MONTH     = {May}
}

@inproceedings{PerroneCooper93CAIP,
   AUTHOR    = {Michael P. Perrone and Leon N Cooper},
   TITLE     = {When Networks Disagree: Ensemble Method for Neural Networks},
   BOOKTITLE = {Neural Networks for Speech and Image processing},
   YEAR      = {1993},
   PUBLISHER = {Chapman-Hall},
   EDITOR    = {R. J. Mammone},
   NOTE      = {[To Appear]},
   where     = {London}
}

@inproceedings{PerroneCooper93WCNN,
   AUTHOR    = {Michael P. Perrone and Leon N Cooper},
   TITLE     = {Learning from What's Been Learned: Supervised Learning in Multi-Neural Network Systems},
   BOOKTITLE = {Proceedings of the World Conference on Neural Networks},
   YEAR      = {1993},
   PUBLISHER = {INNS}
}

---------------------

@inproceedings{Perrone91,
   AUTHOR    = {M. P. Perrone},
   TITLE     = {A Novel Recursive Partitioning Criterion},
   BOOKTITLE = {Proceedings of the International Joint Conference on Neural Networks},
   YEAR      = {1991},
   PUBLISHER = {IEEE},
   PAGES     = {989},
   volume    = {II}
}

@inproceedings{Perrone92,
   AUTHOR    = {M. P. Perrone},
   TITLE     = {A Soft-Competitive Splitting Rule for Adaptive Tree-Structured Neural Networks},
   BOOKTITLE = {Proceedings of the International Joint Conference on Neural Networks},
   YEAR      = {1992},
   PUBLISHER = {IEEE},
   PAGES     = {689-693},
   volume    = {IV}
}

@inproceedings{PerroneIntrator92,
   AUTHOR    = {M. P. Perrone and N. Intrator},
   TITLE     = {Unsupervised Splitting Rules for Neural Tree Classifiers},
   BOOKTITLE = {Proceedings of the International Joint Conference on Neural Networks},
   YEAR      = {1992},
   ORGANIZATION = {IEEE},
   PAGES     = {820-825},
   volume    = {III}
}