Weight Decay
kanderso@BBN.COM
kanderso at BBN.COM
Tue Jan 24 13:54:04 EST 1989
Date: Mon, 23 Jan 89 20:32:11 pst
From: movellan%garnet.Berkeley.EDU at violet.berkeley.edu
Message-Id: <8901240432.AA18293 at garnet.berkeley.edu>
To: connectionists at cs.cmu.edu
Subject: Weight Decay
Referring to the compilation about weight decay from John: I
cannot see the analogy between weight decay and ridge regression.
The weight solutions in a linear network (Ordinary Least Squares)
are the solutions to (I'I) W = I'T where:
I is the input matrix (rows are # of patterns in epoch and
columns are # of input units in net). T is the teacher matrix
(rows are # of patterns in epoch and columns are # of
teacher units in net). W is the matrix of weights (net is linear
with only one layer!).
The weight solutions in ridge regression would be given by
(I'I + k<1>) W = I'T. Where k is a "shrinkage" constant and <1>
represents the identity matrix. Notice that k<1> has the same
effect as increasing the variances of the inputs (Diagonal of
I'I) without increasing their covariances (rest of the I'I
matrix). The final effect is biasing the W solutions but reducing
the extreme variability to which they are subject when I'I is
near singular (multicollinearity). Obviously collinearity may be
a problem in nets with a large # of hidden units. I am presently
studying how and why collinearity in the hidden layer affects
generalization and whether ridge solutions may help in this
situation. I cannot see though how these ridge solutions relate
to weight decay.
-Javier
Yes i was confused by this too. Here is what the connection seems to
be. Say we are trying to minimize an energy function E(w) of the
weight vector for our network. If we add a constraint that also
attempts to minimize the length of w we would add a term kw'w to our
energy function. Taking your linear least squares problem we would have
E = (T-IW)'(T-IW) + kW'W
dE/dW = I'IW - I'T + kW
setting dE/dW = 0 gives
[I'I +k<1>]W = I'T, ie. Ridge Regression.
W = [I'I + k<1>]^-1 I'T
The covariance matrix is [I'I + k<1>]^-1 so the effect of increasing k
1. Make the matrix more invertable.
2. Reduces the covariance so that new training data will have less
effect on your weights.
3. You loose some resolution in weight space.
I agree that collinearity is probably very important, and i'll be glad
to discuss that off line.
k
More information about the Connectionists
mailing list