Weight Decay

movellan%garnet.Berkeley.EDU@violet.berkeley.edu movellan%garnet.Berkeley.EDU at violet.berkeley.edu
Mon Jan 23 23:32:11 EST 1989


Referring to the compilation about weight decay from John:  I
cannot see the analogy between weight decay and ridge regression.
 
 
The weight solutions in a linear network (Ordinary Least Squares)
 
are the solutions to (I'I) W = I'T where:  
 
I is the input matrix (rows are # of patterns in epoch and
columns are # of input units in net). T is the teacher matrix 
(rows are # of patterns in epoch and columns  are # of 
teacher units in net). W is the matrix of weights (net is linear
with only one layer!). 
 
The weight solutions in ridge regression would be given by  
(I'I + k<1>) W = I'T. Where k is a "shrinkage" constant and <1> 
represents the identity matrix. Notice that k<1> has the same 
effect as increasing the variances of the inputs (Diagonal of 
I'I) without increasing their covariances (rest of the I'I 
matrix). The final effect is biasing the W solutions but reducing
the extreme variability to which they are subject when I'I is 
near singular (multicollinearity). Obviously collinearity may be
a problem in nets with a large # of hidden units. I am presently 
studying how and why collinearity in the hidden layer affects 
generalization and whether ridge solutions may help in this 
situation. I cannot see though how these ridge solutions relate 
to weight decay.
 
-Javier


More information about the Connectionists mailing list