Redundancy

Mon Oct 21 21:54:29 EDT 1991

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
..in response to your message, included here:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

::	To: Mark Plutowksi <pluto at cs.UCSD.EDU>
::	Cc: connectionists at CS.CMU.EDU
::	Subject: Re: Redundancy 
::	In-Reply-To: Your message of Mon, 21 Oct 91 16:29:59 -0800.
::	             <9110212329.AA12326 at tournesol.ucsd.edu> 
::	Date: Mon, 21 Oct 91 21:27:08 -0400
::	From: Scott_Fahlman at SEF-PMAX.SLISP.CS.CMU.EDU
::	
::	    ::	I guess you could measure redundancy by seeing if some subset of the
::	    ::	training data set produces essentially the same gradient vector as the full
::	    ::	set.  Probably statisticians have good ways of talking about this
::	    ::	redundancy business -- unfortunately, I don't know the right vocabulary.
::	    
::	    Indeed they do; however, they begin from a more general perspective:
::	    for a particular "n", where "n" is the number of exemplars we are going to
::	    train on, call a set of "n" exemplars optimal if better generalization can 
::	    not be obtained by training on any other set of "n" exemplars.
::	    This criterion is called "Integrated Mean Squared Error."  
::	    See [Khuri & Cornell, 1987], [Box and Draper, 1987], or [Myers et.al., 1989]. 
::	    
::	    Using appropriate approximations, we can use this to obtain what you suggest.  
::	    Results for the case of clean data are currently available in
::	    Neuroprose in the report "plutowski.active.ps.Z", or from the UCSD
::	    CSE department (see [Plutowski & White, 1991].)  Basically, given a set of
::	    candidate training examples, we select a subset which if trained upon
::	    give a gradient highly correlated with the gradient obtained by
::	    training upon the entire set.  This results in a concise set of exemplars 
::	    representative (in a precise sense) of the entire set.  
::	    Preliminary empirical results indicate that the end result is what we
::	    originally desired: training upon this well chosen subset results in 
::	    generalization close to that obtained by training upon the entire set.  
::	    
::	Thanks for the references.  This is a useful beginning, but doesn't seem to
::	address the problem we were discussing.  In many real-world problems, the
::	following constraints hold:
::	
::	1. We do not have direct access to "the entire set".  In fact, this set may
::	well be infinite.  All we can do is collect some number of samples, and
::	there is usually a cost for obtaining each sample.
::	
::	2. Rather than hand-crafting a training set by choosing all its elements,
::	we want to choose an appropriate "n" and then pick "n" samples at random
::	from the set we are trying to model.  Of course, if collecting samples is
::	cheap and network training is expensive, you might throw some samples away
::	and not use them in the training set.  I don't *think* that this would ever
::	improve generalization, but it might lead to faster training without
::	hurting generalization.
::	
::	3. The data may not be "clean".  The structure we are trying to model may
::	be masked by a lot of random noise.
::	
::	Do you know of any work on how to pick an optimal "n" under these
::	conditions?  I would guess that this sort of problem is already
::	well-studied in statistics; if not, it seems like a good research topic for
::	someone with the proper background.
::	
::	-- Scott Fahlman
::	

I don't know of a feasible way of choosing such an "n".  
Instead, I obtain a greedy approximation to it.
What we do (as reported in the tech report by Plutowski & White) 
is sequentially grow the training set, first finding
an "optimal" training set of size 1, then fitting the network to this
training set, appending the training set with a new exemplar selected from
the set of available candidates, obtaining a training set of size 2 which
is "approximately optimal",  fitting this set,  appending a third exemplar, etc,
continuing the process until the network fit obtained by training over the
exemplars fits the rest of the available examples within the desired tolerance.

I have no idea as
to how close the resulting training sets are to being truly IMSE-optimal.
But, they are much more concise than the original set - and so far, 
at least on the  toy problems I have tried so far,
it has resulted in a computational benefit, apparently because training on the
smaller set of exemplars provides an informative gradient at much lower 
cost than is required to obtain a gradient over all of the available examples.
The more the redundancy in the data, the more the computational benefit.

Of course, more extensive testing is required (and in progress.)

= Mark Plutowski