Redundancy

Mon Oct 21 19:29:59 EDT 1991

Scott Fahlman writes:

::	
::	I guess you could measure redundancy by seeing if some subset of the
::	training data set produces essentially the same gradient vector as the full
::	set.  Probably statisticians have good ways of talking about this
::	redundancy business -- unfortunately, I don't know the right vocabulary.

Indeed they do; however, they begin from a more general perspective:
for a particular "n", where "n" is the number of exemplars we are going to
train on, call a set of "n" exemplars optimal if better generalization can 
not be obtained by training on any other set of "n" exemplars.
This criterion is called "Integrated Mean Squared Error."  
See [Khuri & Cornell, 1987], [Box and Draper, 1987], or [Myers et.al., 1989]. 

Using appropriate approximations, we can use this to obtain what you suggest.  
Results for the case of clean data are currently available in
Neuroprose in the report "plutowski.active.ps.Z", or from the UCSD
CSE department (see [Plutowski & White, 1991].)  Basically, given a set of
candidate training examples, we select a subset which if trained upon
give a gradient highly correlated with the gradient obtained by
training upon the entire set.  This results in a concise set of exemplars 
representative (in a precise sense) of the entire set.  
Preliminary empirical results indicate that the end result is what we
originally desired: training upon this well chosen subset results in 
generalization close to that obtained by training upon the entire set.  

Tom Dietterich writes:

::	
::	There has been a fair amount of work in decision-tree learning on the
::	issue of breaking large training sets into smaller batches.  In 1980,
::	Quinlan introduced a method called "windowing" in which a small sample
::	(or window) of the training data is initially drawn at random.  The
::	algorithm is trained on this window and then tested on the remainder of
::	the data (that was excluded from the window).  Then, some fraction of
::	the misclassified examples (possibly all of them) are added to the
::	window.
::	
::	Generally speaking, in noise-free domains, windowing works quite well.
::	A very high-performing decision tree can be learned with a relatively
::	small window.  However, for noisy data, the general experience has
::	been that the window eventually grows to include the entire training set.
::	Jason Catlett (Sydney U) recently completed his dissertation on
::	testing windowing and various other related tricks on datasets of
::	roughly 100K examples (straight classification problems).  I recommend
::	his papers and thesis.
::	
::	His main conclusion is that if you want high performance, you need to
::	look at all of the data.

Could you provide a reference to the work demonstrating the performance 
of windowing on clean data?  And could you provide an e-mail address for
Jason Catlett?   I am in the process of setting up benchmarking experiments
for the technique I mentioned above.   Although I consider the more general
task of fitting arbitrary functional mappings, these works seem relevant.

Thanks,

=================
== Mark Plutowski
Computer Science and Engineering 0114
University of California, San Diego
La Jolla,  CA

-----------
REFERENCES:
-----------

Box,G., and N.Draper. 1987.
	{\bf Empirical Model-Building and Response Surfaces.}
	Wiley, New York. 

Khuri, A.I., and J.A.Cornell. 1987.
	{\bf Response Surfaces (Designs and Analyses)}.
	Marcel Dekker, Inc., New York. 

Myers, Raymond H., and A.I. Khuri, W.H. Carter, Jr. 1989. 
	``Response Surface Methodology: 1966-1988.'' 
	{\em Technometrics}. vol.31, no.2.

Plutowski, Mark E., and Halbert White. 1991.
	``Active selection of training examples for network learning 
	in noiseless environments.''  
	Technical Report No. CS91-180, 
	Department of Computer Science and Engineering, 
	The University of California, San Diego. 92093-0114.
	Accepted pending revision by IEEE Transactions on Neural Networks.

	---- Here are some other related works: --------

Cohn, David, Les Atlas, and Richard Ladner. 1990.
	``Training connectionist networks with queries and selective sampling.''
	{\em Advances in Neural Information Processing Systems 2,}
	Proc. of the Neural Information Processing Systems Conference.
	Morgan Kaufmann, San Mateo, California.

Hwang, Jenq-Neng, J.J. Choi, Seho Oh, and Robert J. Marks III. 1990.
	``Query learning based on boundary search and gradient 
	computation of trained multilayer perceptrons. '' 
	{\em Proc. IJCNN 1990, San Diego. The 
	International Joint Conference on Neural Networks.}  IEEE press.