Weight Decay

Fri Feb 10 15:52:34 EST 1989

  To: att!cs.cmu.edu!connectionists
  Subject: Re: Weight Decay 
  Reply-To: yann at neural.att.com
  Date: Wed, 25 Jan 89 15:13:58 -0500
  From: Yann le Cun <neural!yann>

  Consider a single layer linear network with N inputs. 
  When the number of training pattern is smaller than N , the
  set of solutions (in weight space) is a proper linear subspace.
  adding weight decay will select the minimum norm solution in this subspace
  (if the weight decay coefficient is decreased with time).
  The minimum norm solution happens to be the solution given by the 
  pseudo-inverse technique (cf Kohonen), and the solution which
  optimally cancels out uncorrelated zero mean additive noise on the input.

  - Yann Le Cun

I think this needs some clarification.  Your linear network problem is
Aw = d, where A is an N x M matrix of input patterns, w is an M x 1
vector of weights, and d is an Nx1 vector of outputs.

In the case you described, N < M, and w is underdetermined, ie there
are many solutions.  The pseudoinverse solution, w, is the one of all
solutions that mimimizes |w|^2, ie any other solution will be longer.

In the case where N > M and A is full rank, the pseudo-inverse
minimizes |d - Aw|^2, ie it is the least squares solution.  In the
general case, where A is not full rank, the pseudoinverse solution
minizes both (1) |d - Aw|^2 and (2) |w|^2.

In an iterative network application, a learning step typically
minimizes (1) while adding weight decay minimizes (2) at the same
time.  Another way to say this is that it trys to find a w that
minimizes the error subject to the constraint that w is bounded to
some length.  That length is determined by the weight decay coefficient
you use.

In general, it would seem wrong to let the weight decay coefficient go
to zero, since then you will wind up at the least squares solution
which may not be what you want.

k