Redundancy
Mark Plutowksi
pluto at cs.UCSD.EDU
Mon Oct 21 21:54:29 EDT 1991
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
..in response to your message, included here:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
:: To: Mark Plutowksi <pluto at cs.UCSD.EDU>
:: Cc: connectionists at CS.CMU.EDU
:: Subject: Re: Redundancy
:: In-Reply-To: Your message of Mon, 21 Oct 91 16:29:59 -0800.
:: <9110212329.AA12326 at tournesol.ucsd.edu>
:: Date: Mon, 21 Oct 91 21:27:08 -0400
:: From: Scott_Fahlman at SEF-PMAX.SLISP.CS.CMU.EDU
::
:: :: I guess you could measure redundancy by seeing if some subset of the
:: :: training data set produces essentially the same gradient vector as the full
:: :: set. Probably statisticians have good ways of talking about this
:: :: redundancy business -- unfortunately, I don't know the right vocabulary.
::
:: Indeed they do; however, they begin from a more general perspective:
:: for a particular "n", where "n" is the number of exemplars we are going to
:: train on, call a set of "n" exemplars optimal if better generalization can
:: not be obtained by training on any other set of "n" exemplars.
:: This criterion is called "Integrated Mean Squared Error."
:: See [Khuri & Cornell, 1987], [Box and Draper, 1987], or [Myers et.al., 1989].
::
:: Using appropriate approximations, we can use this to obtain what you suggest.
:: Results for the case of clean data are currently available in
:: Neuroprose in the report "plutowski.active.ps.Z", or from the UCSD
:: CSE department (see [Plutowski & White, 1991].) Basically, given a set of
:: candidate training examples, we select a subset which if trained upon
:: give a gradient highly correlated with the gradient obtained by
:: training upon the entire set. This results in a concise set of exemplars
:: representative (in a precise sense) of the entire set.
:: Preliminary empirical results indicate that the end result is what we
:: originally desired: training upon this well chosen subset results in
:: generalization close to that obtained by training upon the entire set.
::
:: Thanks for the references. This is a useful beginning, but doesn't seem to
:: address the problem we were discussing. In many real-world problems, the
:: following constraints hold:
::
:: 1. We do not have direct access to "the entire set". In fact, this set may
:: well be infinite. All we can do is collect some number of samples, and
:: there is usually a cost for obtaining each sample.
::
:: 2. Rather than hand-crafting a training set by choosing all its elements,
:: we want to choose an appropriate "n" and then pick "n" samples at random
:: from the set we are trying to model. Of course, if collecting samples is
:: cheap and network training is expensive, you might throw some samples away
:: and not use them in the training set. I don't *think* that this would ever
:: improve generalization, but it might lead to faster training without
:: hurting generalization.
::
:: 3. The data may not be "clean". The structure we are trying to model may
:: be masked by a lot of random noise.
::
:: Do you know of any work on how to pick an optimal "n" under these
:: conditions? I would guess that this sort of problem is already
:: well-studied in statistics; if not, it seems like a good research topic for
:: someone with the proper background.
::
:: -- Scott Fahlman
::
I don't know of a feasible way of choosing such an "n".
Instead, I obtain a greedy approximation to it.
What we do (as reported in the tech report by Plutowski & White)
is sequentially grow the training set, first finding
an "optimal" training set of size 1, then fitting the network to this
training set, appending the training set with a new exemplar selected from
the set of available candidates, obtaining a training set of size 2 which
is "approximately optimal", fitting this set, appending a third exemplar, etc,
continuing the process until the network fit obtained by training over the
exemplars fits the rest of the available examples within the desired tolerance.
I have no idea as
to how close the resulting training sets are to being truly IMSE-optimal.
But, they are much more concise than the original set - and so far,
at least on the toy problems I have tried so far,
it has resulted in a computational benefit, apparently because training on the
smaller set of exemplars provides an informative gradient at much lower
cost than is required to obtain a gradient over all of the available examples.
The more the redundancy in the data, the more the computational benefit.
Of course, more extensive testing is required (and in progress.)
= Mark Plutowski
More information about the Connectionists
mailing list