LMS fails even in separable cases

Wed Nov 2 11:25:35 EST 1988

((I apologize for possible duplication.  I posted this 5 days ago with
no discernible effect, so I'm trying again.))

Yes, we noticed that a Least-Mean-Squares (LMS) network even with no
hidden units fails to separate some problems.  Ben Wittner spoke at
the IEEE NIPS meeting in Denver, November 1987, describing !two!
failings of this type.

He gave an example of a situation in which LMS algorithms (including
ordinary versions of back-prop) are metastable, i.e. they fail to
separate the data for certain initial configurations of the weights.
He went on to describe another case in which the algorithm actually
!leaves! the solution region after starting within it.

He also pointed out that this can lead to learning sessions in which
the categorization performance of back-prop nets (with or without
hidden units) is not a monotonically improving function of learning
time.

Finally, he presented a couple of ways of modifying the algorithm to
prevent these problems, and proved a convergence theorem for the
modified algorithms.  One of the key ideas is something that has been
mentioned in several recent postings, namely, to have zero penalty
when the training pattern is well-classified or "beyond".

We cited Minsky & Papert as well as Duda & Hart; we believe they were
more-or-less aware of these bugs in LMS, although they never presented
explicit examples of the failure modes.

Here is the abstract of our paper in the proceedings, _Neural
Information Processing Systems -- Natural and Synthetic_, Denver,
Colorado, November 8-12, 1987, Dana Anderson Ed., AIP Press.  We
posted the abstract back in January '88, but apparently it didn't get
through to everybody.  Reprints of the whole paper are available.

   Strategies for Teaching Layered Networks Classification Tasks

	        Ben S. Wittner (1)
		John S. Denker		
	    AT&T Bell Laboratories  			
	    Holmdel, New Jersey 07733

ABSTRACT:  There is a widespread misconception that the delta-rule is
in some sense guaranteed to work on networks without hidden units.  As
previous authors have mentioned, there is no such guarantee for
classification tasks.  We will begin by presenting explicit
counter-examples illustrating two different interesting ways in which
the delta rule can fail.  We go on to provide conditions which do
guarantee that gradient descent will successfully train networks
without hidden units to perform two-category classification tasks.  We
discuss the generalization of our ideas to networks with hidden units
and to multi-category classification tasks.

(1) Currently at NYNEX Science and Technology /  500 Westchester Ave.
White Plains, NY 10604