Weight Decay

Mon Jan 23 23:32:11 EST 1989

Referring to the compilation about weight decay from John:  I
cannot see the analogy between weight decay and ridge regression.

The weight solutions in a linear network (Ordinary Least Squares)

are the solutions to (I'I) W = I'T where:  

I is the input matrix (rows are # of patterns in epoch and
columns are # of input units in net). T is the teacher matrix 
(rows are # of patterns in epoch and columns  are # of 
teacher units in net). W is the matrix of weights (net is linear
with only one layer!). 

The weight solutions in ridge regression would be given by  
(I'I + k<1>) W = I'T. Where k is a "shrinkage" constant and <1> 
represents the identity matrix. Notice that k<1> has the same 
effect as increasing the variances of the inputs (Diagonal of 
I'I) without increasing their covariances (rest of the I'I 
matrix). The final effect is biasing the W solutions but reducing
the extreme variability to which they are subject when I'I is 
near singular (multicollinearity). Obviously collinearity may be
a problem in nets with a large # of hidden units. I am presently 
studying how and why collinearity in the hidden layer affects 
generalization and whether ridge solutions may help in this 
situation. I cannot see though how these ridge solutions relate 
to weight decay.

-Javier