Weight Decay
movellan%garnet.Berkeley.EDU@violet.berkeley.edu
movellan%garnet.Berkeley.EDU at violet.berkeley.edu
Mon Jan 23 23:32:11 EST 1989
Referring to the compilation about weight decay from John: I
cannot see the analogy between weight decay and ridge regression.
The weight solutions in a linear network (Ordinary Least Squares)
are the solutions to (I'I) W = I'T where:
I is the input matrix (rows are # of patterns in epoch and
columns are # of input units in net). T is the teacher matrix
(rows are # of patterns in epoch and columns are # of
teacher units in net). W is the matrix of weights (net is linear
with only one layer!).
The weight solutions in ridge regression would be given by
(I'I + k<1>) W = I'T. Where k is a "shrinkage" constant and <1>
represents the identity matrix. Notice that k<1> has the same
effect as increasing the variances of the inputs (Diagonal of
I'I) without increasing their covariances (rest of the I'I
matrix). The final effect is biasing the W solutions but reducing
the extreme variability to which they are subject when I'I is
near singular (multicollinearity). Obviously collinearity may be
a problem in nets with a large # of hidden units. I am presently
studying how and why collinearity in the hidden layer affects
generalization and whether ridge solutions may help in this
situation. I cannot see though how these ridge solutions relate
to weight decay.
-Javier
More information about the Connectionists
mailing list