Weight Decay

Tue Jan 24 13:54:04 EST 1989

  Date: Mon, 23 Jan 89 20:32:11 pst
  From: movellan%garnet.Berkeley.EDU at violet.berkeley.edu
  Message-Id: <8901240432.AA18293 at garnet.berkeley.edu>
  To: connectionists at cs.cmu.edu
  Subject: Weight Decay

  Referring to the compilation about weight decay from John:  I
  cannot see the analogy between weight decay and ridge regression.

  The weight solutions in a linear network (Ordinary Least Squares)

  are the solutions to (I'I) W = I'T where:  

  I is the input matrix (rows are # of patterns in epoch and
  columns are # of input units in net). T is the teacher matrix 
  (rows are # of patterns in epoch and columns  are # of 
  teacher units in net). W is the matrix of weights (net is linear
  with only one layer!). 

  The weight solutions in ridge regression would be given by  
  (I'I + k<1>) W = I'T. Where k is a "shrinkage" constant and <1> 
  represents the identity matrix. Notice that k<1> has the same 
  effect as increasing the variances of the inputs (Diagonal of 
  I'I) without increasing their covariances (rest of the I'I 
  matrix). The final effect is biasing the W solutions but reducing
  the extreme variability to which they are subject when I'I is 
  near singular (multicollinearity). Obviously collinearity may be
  a problem in nets with a large # of hidden units. I am presently 
  studying how and why collinearity in the hidden layer affects 
  generalization and whether ridge solutions may help in this 
  situation. I cannot see though how these ridge solutions relate 
  to weight decay.

  -Javier

Yes i was confused by this too.  Here is what the connection seems to
be.  Say we are trying to minimize an energy function E(w) of the
weight vector for our network.  If we add a constraint that also
attempts to minimize the length of w we would add a term kw'w to our
energy function.  Taking your linear least squares problem we would have

E = (T-IW)'(T-IW) + kW'W

dE/dW = I'IW - I'T + kW

setting dE/dW = 0 gives

[I'I +k<1>]W = I'T, ie. Ridge Regression.

W = [I'I + k<1>]^-1 I'T

The covariance matrix is [I'I + k<1>]^-1 so the effect of increasing k

1.  Make the matrix more invertable.

2.  Reduces the covariance so that new training data will have less
effect on your weights.

3.  You loose some resolution in weight space.

I agree that collinearity is probably very important, and i'll be glad
to discuss that off line.

k