redundancy (was Re: batch-mode implementations)

Sat Oct 19 13:30:33 EDT 1991

Scott Fahlman wrote: 

>>I guess you could measure redundancy by seeing if some subset of the
>>training data set produces essentially the same gradient vector as the full
>>set.

Yann Le Cun responded: 
> Hmmm, I think any dataset for which you expect good generalization is redunda
nt.
> Train your net on 30% of the dataset, and measure how many of the remaining
> 70% you get right. If you get a significant portion of them right, then
> accumulating gradients on these examples (without updating the weights) would
> be little more than a waste of time.

It is probably useful to distinguish between redundancy WITHIN the training set
and the redundancy BETWEEN the training and test sets (or, redundancy in
the combined training and test sets).  I suspect Scott Fahlman was  
refering to the redundancy (R1) within the training set while Le Cun 
was refering to the redundancy (R2) in the set formed by the union of
training set and test set (please correct me if I am wrong). I would
expect the relationship between generalization and R1 to be quite different
from the relationship between generaization and R2.  

Whether the two measures of redundancy will be the same or not will almost
certainly depend on the method(s) (e.g., sampling procedures, sample size 
reduction techniques) used to arrive at the data actually given to the
network during training. 
In fact, if a training set T (obtained say, by random sampling 
from some underlying distribution) were to be preprocessed in
some fashion (e.g., using statistical techniques) and reduced
training set T' was obtained from T after eliminating the "redundant" samples,
clearly the redundancy (R1') within the reduced training set T' will be much
smaller than the redundancy (R1) in the original training set T although the
overall redundancy (R2) in the set formed by the union of T and the test data
may be more or less equal to the redundancy (R2') in the set formed by the 
union of T' and the test data. My guess is that the generalization on the test 
data will be more or less the same irrespective of whether T or T' is used for
training the network.  

Vasant Honavar 
honavar at iastate.edu