Overfitting in learning discrete patterns

Sun Mar 6 10:39:14 EST 1994

Fabien.Moutarde at aar.alcatel-alsthom.fr wrote:

  I would like to know how were the weights initialized ?
  Were they taken from uniform distribution in some fixed
  interval whatever the network architecture ? Which interval ?

You are asking the right questions.  Are you aware of (Kolen &
Pollack, 1990) which explores the effects of initial weights on
back propagation?

  if you begin learning with some neurons already in their non
  linear regime somewhere in learning space, then the initial
  function realized by the network is not smooth, and the
  irregularities are likely to remain between learning points and
  to produce overfitting. This implies that the bigger the
  network, the lower the initial weights should be.

The last sentence does not necessarily follow from the previous
line.  The magnitude of the weights is less important than the
magnitude of the *net input* reaching the unit.  For instance, if
the network operates in an environment in which there are between
unit correllations in the input, then large magnitude weights
can effectively become small magnitude weights from the
perspective of the nonlinear squashing function.  In this
situation, I would predict that large weights actually help in
the distribution of error to the previous layer.

John Kolen

References

J. F. Kolen and J. B. Pollack, 1990.  Backpropagation is
Sensitive to Initial Conditions.  _Complex Systems_. 4:3. pg
269-280.  Available from neuroprose as kolen.bpsic.*.ps.Z (8
files).