Methods for improving generalization (was Re: some questions on ...)

hicks@cs.titech.ac.jp hicks at cs.titech.ac.jp
Fri Feb 11 00:02:54 EST 1994


Dear Mr. Franke Lange (lange at ira.uka.de),

	On Wed, 9 Feb 94 14:19:22 MET you wrote:
>But Soft Weight-Sharing does not really adapt to the data,
>because you have to tune the same parameters as in normal Weight-Decay:
>the parameters, that are used to handle the strength of the penalty-term.
>The article of Nowlan and Hinton "Simplifying Neural Networks by Soft Weight-
>Sharing" does not mention a method to do this automatically - so no "real"
>adaption to the data is made.

I say "every model is adaptive, and no model is adaptive, but some are more
adaptive than others".  Every model has parameters which are adjusted during
learning.  Penalty functions, including soft weight sharing, affects the prior
distribution of weights and so can be thought of as just providing different
models.  All of these models adapt to data.  On the other hand, every model
>must< make some assumptions about which it is adamant.  If it didn't there
wouldn't be a model.  These assumptions are non-adaptive to the data. (note1)

	You further wrote:
>Maybe the methods of MacKay ("Bayesian Interpolation", Neural Comp. 4 (1992),
>page 415-447) could be used to get a fully-automatic adaption. A combination
>of this method with Weight-Decay or Soft Weight-Sharing would perhaps be
>data-adaptive; but Soft Weight-Sharing alone has still a parameter, that is
>not adapted by the data.


The article was very enlighenting.  Figure 1 on page 417 shows the 2 main
steps of modeling which involve Baysian methods: (1) Fit each model to the
data, (2) Assign preferences to the alternative models.  The first step is the
one we are all familiar with.  The second one is the topic of the paper and
consists of assigning objective preferences to each model: the probability of
the data given the model is called the evidence for the model.

Re your idea of "fully-automatic adaption". I will first review the parameters
related to soft weight sharing: (a) the number of weight groups (b) the mean
and variance of each group of weights.  The weight penalty weighting is not
arbitrary but determined by the variance of the squared error (which changes
with time) divided by a factor (determined by cross-validation) to adjust to
the number of free parameters.  I think you mean by "fully-automatic adaption"
that parameters (a) and (b) should be constant during stage (1), and after
running the simulation for a large number of times with different values for
(a) and (b) we should select the best ones with stage (2) methods: i.e. 
weighing the evidence for each model.  This would take a long time BUT we
might get a different answer from the one obtained by choosing (a) and (b) in
stage 1.

However, as to which way is best called "automatic", I would personaly favor
the present stage (1) way, because it automatically (although maybe
imperfectly) estimates the best parameters (a) and (b) implicitly during
learning, leaving less labor for the later and harder stage (2).  I realize I
am getting semantic here.

(note1) Mackay does give a special example of a 100% data-adaptive model: the
Sure Thing hypothesis, which is that the data set will be what it is
(predicted of course before seeing the data, selected afterwards), but this
hypothesis has very small a priori probability.  Too bad for our universe.
The other example is of course stock tips, (predicted of course before seeing
the money, collected afterwards), but look what happened to Micheal Milliken!

Respectfully Yours,

	Craig Hicks

Craig Hicks           hicks at cs.titech.ac.jp | Kore ya kono  Yuku mo kaeru mo
Ogawa Laboratory, Dept. of Computer Science | Wakarete wa   Shiru mo shiranu mo
Tokyo Institute of Technology, Tokyo, Japan |  	    Ausaka no seki        
lab:03-3726-1111 ext.2190 home:03-3785-1974 |  (from hyaku-nin-issyu)
fax: +81(3)3729-0685 (from abroad) 
     03-3729-0685  (from Japan)




More information about the Connectionists mailing list