Weight Decay

Stephen J Hanson jose at tractatus.bellcore.com
Wed Jan 25 10:02:09 EST 1989


actually I think the connection is more general--ridge regression
is a special case of variance techniques in regression called "biased regression"
(including principle components), biases are introduced in order to 
remove effects of collinearity as has been discussed and to attempt to achieve estimators that
may have a lower variance then the theoretical best least squares
unbiased estimator ("blue") since when assumptions of linearity and 
independence are violated LSE are not particularly attractive and
will not necessarily achieve "blue"s.  Conseqently nonlinear regression
and ordinary linear least squares regression with collinear variables
may be able to achieve lower variance estimators by entertaining biases.
In the nonlinear case a bias term would enter as a "constraint"
to be mininmized with Error (y-yhat) sup 2.  This constriant is actually
a term that can push weights differentially towards zero--and in terms
of regression is bias in terms of neural networks--weight decay.  Ridge regression is a 
specific case in terms of linear lse where the off diagonal terms of the correlation matrix are given 
less weight by adding a small constant to the diagonal in order to reduce
the collinearity problem--it is still controversial in statistical arenas--not
everyone subcribes to the notion of introducing biases--since it is hard 
a-priori to know what bias might be optimal for a given problem.

I have a paper with Lori Pratt that describes this relationship
more generally that had been given at the last NIPS and should be
available soon as a tech report.

	Steve Hanson


More information about the Connectionists mailing list