MacKay's recent work and feature selection

Tue Aug 10 16:29:07 EDT 1993

Recently David MacKay made a posting concerning a technique he used to
win an energy prediction competition. Parts of that technique have
been done before (e.g., combining generalizers via validation set
behavior). However other parts are both novel and very interesting.
This posting concerns the "feature selection" aspect of his technique,
which I understand MacKay developed in association w/ Radford Neal.
(Note: MacKay prefers to call the technique "automatic relevance
determination"; nothing I'll discuss here will be detailed enough for
that distinction to be important though.)

What I will say grew out of conversations w/ David Rosen and Tom
Loredo, in part. Of course, any stupid or silly aspects to what I will
say should be assumed to originate w/ me.

***

Roughly speaking, MacKay implemented feature selection in a neural net
framework as follows:

1) Define a potentially different "weight decay constant" (i.e.,
regularization hyperparameter) for each input neuron. The idea is
that one wants to have those constants set high for input neurons
representing "features" of the input vector which it behooves us to
ignore.

2) One way to set those hyperparameters would be via a technique like
cross-validation. MacKay instead set them via maximum likelihood,
i.e., he set the weight decay constants alpha_i to those values
maximizing P(data | alpha_i). Given a reasonably smooth prior
P(alpha_i), this is equivalent to finding the maximum a posterior
(MAP) alpha_i, i.e., the alpha_i maximizing P(alpha_i | data).

3) Empirically, David found that this worked very well. (I.e., he won
the competition.)

***

This neat idea makes some interesting suggestions:

1) The first grows out of "blurring" the distinction between
parameters (i.e., weights w_j) and hyperparameters (the
alpha_i). Given such squinting, MacKay's procedure amounts to a sort
of "greedy MAP". First he sets one set of parameters to its MAP values
(the alpha_i), and then with those values fixed, he sets the other
parameters (the w_j) to their MAP values (this is done via the usual
back-propagation w/ weight-decay, which we can do since the first
stage set the weight decay constants). In general, the resultant
system will not be at the global MAP maximizing P(alpha_i, w_j | D).
In essence, a sort of extra level of regularization has been
added. (Note: Radford Neal informs me that calculationally, in the
procedure MacKay used, the second MAP step is "automatic", in the
sense that one has already made the necessary calcualtions to perform
that step when one carries out the first MAP step.)

Of course, this viewing the technique from a "blurred" perspective is
a bit of a fudge, since hyperparameters are not the same thing as
parameters. Nonetheless, this view suggests some interesting new
techniques. E.g., first set the weights leading to hidden layer 1 to
their MAP values (or maximum likelihood values, for that matter). Then
with those values fixed, do the same to the weights in the second
layer, etc.  Another reason to consider this layer-by-layer technique
is the fact that training of the weights connecting different layers
should in general be distinguishable, e.g., as MacKay has pointed out,
one should have different weight-decay constants for the different
layers.

2) Another interesting suggestion comes from justifying the technique
not as a priori reasonable, but rather as an approximation to a full
"hierarchical" Bayesian technique, in which one writes

P(w_j | data) (i.e., the ultimate object of interest)
	prop. to
integral d_alpha_i P(data | w_j alpha_i) P(w_j | alpha_i) P(alpha_i).

Note that all 3 distributions occuring in this integrand must be set
in order to use MacKay's technique. (The by-now-familiar object of
contention between MacKay and myself is on how generically this
approximation will be valid, and whether one should explicitly test
its validity when one claims that it holds. This issue isn't pertinent
to the current discussion however.)

Let's assume the approximation is very good. Then under the
assumptions:
i) P(alpha_i) is flat enought to be ignored;
ii) the distribution P(w_j | alpha_i) is a product of gaussians (each
gaussian being for those w_j connecting to input neuron i, i.e., for
those weights using weight decay constant alpha_i);

then what MacKay did is equivalent to back-propagation with
weight-decay, where rather than minimizing

{training set error} + constant x {sum over all j (w_j)^2},

as in conventional weight decay, MacKay is minimizing (something like)

{training set error } +
   {(sum over i) [ (number of weights connecting to neuron i)  x
	   ln [(sum over j; those weights connecting to neuron i) (w_j)^2] ]}.

What's interesting about this isn't so much the logarithm in the
"weight decay" term, but rather the fact that weights are being
clumped together in that weight-decay term, into groups of those
weights connecting to the same neuron. (This is not true in
conventional weight decay.) So in essence, the weight-decay term in
MacKay's scenario is designed to affect all the weights connecting to a
given neuron as a group. This makes intuitive sense if the goal is
feature selection.

3) One obvious idea based on viewing things this way is to try to
perform weight-decay using this modified weight-decay term. This might
be reasonable even if MacKay's technique is not a good approximation
to this full Bayesian technique.

4) The idea of MacKay's also leads to all kinds of ideas about how to
set up the weight-decay term so as to enforce feature selection (or
automatic relevance determination, if you prefer). These need not have
anything to do w/ the precise weight-decay term MacKay used; rather
the idea is simply to take his (implicit) suggestion of trying to do
feature selection via the weight-decay term, and see where it leads.

For example: Where originally we have input neurons at layer 1, hidden
layers 2 through n, and then output neurons at layer n+1, now have the
same architecture with an extra "pre-processing" layer 0 added. Inputs
are now fed to the neurons at layer 0. For each input neuron at layer
0, there is one and only weight, leading straight up to the neuron at
layer 1 which in the original formulation was the (corresponding)
input neuron.

The hope would be that for those input neurons which we "should" mostly
ignore, something like backprop might set the associated weights from
layer 0 to layer 1 to very small values.

David Wolpert