Weight Decay
Stephen J Hanson
jose at tractatus.bellcore.com
Wed Jan 25 10:02:09 EST 1989
actually I think the connection is more general--ridge regression
is a special case of variance techniques in regression called "biased regression"
(including principle components), biases are introduced in order to
remove effects of collinearity as has been discussed and to attempt to achieve estimators that
may have a lower variance then the theoretical best least squares
unbiased estimator ("blue") since when assumptions of linearity and
independence are violated LSE are not particularly attractive and
will not necessarily achieve "blue"s. Conseqently nonlinear regression
and ordinary linear least squares regression with collinear variables
may be able to achieve lower variance estimators by entertaining biases.
In the nonlinear case a bias term would enter as a "constraint"
to be mininmized with Error (y-yhat) sup 2. This constriant is actually
a term that can push weights differentially towards zero--and in terms
of regression is bias in terms of neural networks--weight decay. Ridge regression is a
specific case in terms of linear lse where the off diagonal terms of the correlation matrix are given
less weight by adding a small constant to the diagonal in order to reduce
the collinearity problem--it is still controversial in statistical arenas--not
everyone subcribes to the notion of introducing biases--since it is hard
a-priori to know what bias might be optimal for a given problem.
I have a paper with Lori Pratt that describes this relationship
more generally that had been given at the last NIPS and should be
available soon as a tech report.
Steve Hanson
More information about the Connectionists
mailing list