BP for categorization...relative frequency problem

Dave Rumelhart der%beren at Forsythe.Stanford.EDU
Wed Aug 22 13:35:59 EDT 1990


We have also encountered the problem.  Since BP does gradient descent and
since the contribution of any set of patterns depends in part on the
relative frequency of those patterns, fewer resources are allocated to low
fequency categories.  Morover, those resources are allocated later in the
training -- probably after over-fitting has already become a problem for
higher frequency categories.  Of course, if your training distribution is
the same as your testing distribution you wil be getting the appropriate
Baysian estimate of the class probabilities.  On the other hand, if the
generalization distribution is unknown at test time we may wish to factor
out the relative frequency of your input frequency during training and add
any known "priors" during generalization.  There are two ways to do this.
One way, suggested in one of the notes on this topic is to "post process"
out output data.  That is, divide the output unit value by the relative
frequency in the training set and multiply by the relative frequency in the
test set.  This will give you an estimate of the Bayesian probability for
the test set.  For a variety of reasons, this is less appropriate that
correcting during training.  In this case, the procedure is to effectively
increase the learning rate inversely proportional to the relative frequency
of the category in the training set.  Thus, we take bigger learning steps
on low frequency categories.  In a simple classification task, this is
roughly equivalent to normalizing the data set by sampling each category
set equally.  In the case of cross-classification (in whihch a given input
can be a member of more the one class), it is roughly equivalent to
weighting each inversely by the probability that that pattern would occur,
given independence between the output classes.  We have used this method
successfully in a system designed to classify mass spectra.  In this method
an output of .5 means that the evidence for and against the category is
equal.  Whereas, in the normal traing method, an output equal to the
relative frequency in the training set means that the evidence for and
against is equal.  In some cases this can be very small.  It is possibly
to add the priors in manually and compare performance on the training set
with the original method.  We find that we do only slightly worse on the
training set with the two methods.  We do much better in generalization on
classes that were low frequency in the training set and slightly worse on
classes which were high frequency in the training set.


                                        der


More information about the Connectionists mailing list