Redundancy

Mon Oct 21 21:27:08 EDT 1991

    ::	I guess you could measure redundancy by seeing if some subset of the
    ::	training data set produces essentially the same gradient vector as the full
    ::	set.  Probably statisticians have good ways of talking about this
    ::	redundancy business -- unfortunately, I don't know the right vocabulary.

    Indeed they do; however, they begin from a more general perspective:
    for a particular "n", where "n" is the number of exemplars we are going to
    train on, call a set of "n" exemplars optimal if better generalization can 
    not be obtained by training on any other set of "n" exemplars.
    This criterion is called "Integrated Mean Squared Error."  
    See [Khuri & Cornell, 1987], [Box and Draper, 1987], or [Myers et.al., 1989]. 

    Using appropriate approximations, we can use this to obtain what you suggest.  
    Results for the case of clean data are currently available in
    Neuroprose in the report "plutowski.active.ps.Z", or from the UCSD
    CSE department (see [Plutowski & White, 1991].)  Basically, given a set of
    candidate training examples, we select a subset which if trained upon
    give a gradient highly correlated with the gradient obtained by
    training upon the entire set.  This results in a concise set of exemplars 
    representative (in a precise sense) of the entire set.  
    Preliminary empirical results indicate that the end result is what we
    originally desired: training upon this well chosen subset results in 
    generalization close to that obtained by training upon the entire set.  

Thanks for the references.  This is a useful beginning, but doesn't seem to
address the problem we were discussing.  In many real-world problems, the
following constraints hold:

1. We do not have direct access to "the entire set".  In fact, this set may
well be infinite.  All we can do is collect some number of samples, and
there is usually a cost for obtaining each sample.

2. Rather than hand-crafting a training set by choosing all its elements,
we want to choose an appropriate "n" and then pick "n" samples at random
from the set we are trying to model.  Of course, if collecting samples is
cheap and network training is expensive, you might throw some samples away
and not use them in the training set.  I don't *think* that this would ever
improve generalization, but it might lead to faster training without
hurting generalization.

3. The data may not be "clean".  The structure we are trying to model may
be masked by a lot of random noise.

Do you know of any work on how to pick an optimal "n" under these
conditions?  I would guess that this sort of problem is already
well-studied in statistics; if not, it seems like a good research topic for
someone with the proper background.

-- Scott Fahlman