some questions on training neural nets...

Fri Feb 4 15:47:25 EST 1994

Tom Dietterich and William Finnof covered a lot of issues.

I'd just like to highlight two points:
	*   this is a contentious area
	*   there are several opposing factors at play that
		confuse our understanding of this

================  detail

Basically, this comment below is SO true.

>  There are many ways to manage the bias/variance tradeoff.  I would say
>  that there is nothing approaching complete agreement on the best
>  approaches (and more fundamentally, the best approach varies from one
>  application to another, since this is really a form of prior).  The
>  approaches can be summarized as

The bias/variance tradeoff lies at the heart of almost all disagreements
between different learning philosophies such as classical, Bayesian, minimum
description length, resampling schemes (now often viewed as empirical
Bayesian), statistical physics approaches, and the various
"implementation" schemes.

One thing to note is that there are several quite separate forces
in operation here:
	computational and search issues:
		(e.g.  maybe early stopping works better
			because its a more efficient way of
			searching the space of smaller networks ?)
	prior issues:
		(e.g.  have you thrown in 20 attributes you
			happen to think might apply, but probably
			15 are irrelevant;  OR did a medical
			specialist carefully pick all 10 attributes
			and assures you every one is important,
			OR  is a medical specialist able to solve the
			task blind, just be reading the 20 attribute
			values (without seeing the patient), etc.)
		(e.g.  are 30 hidden units adequate for the structure
			of the task? )
	asking the right question:
		(e.g.  sometimes the question:  what's the "best" network
			is a bit silly when you have a small amount of
			data, perhaps you should be trying to find
			10 reasonable alternative networks and pool their
			results (ala.  Michael Perrone's NIPS'93 workshop)
	understanding your representation:
		(e.g.   with rule based systems, each rule has a good
			interpretation so the question of how to
			prune, etc., is something you can understand
			well BUT with a large feed-forward network,
			understanding the structure of the space is more
			involved, e.g.  if I set these 2 weights to zero
			what the hell happens to my proposed solution)
		(e.g.   this confuses the problem of designing
			good regularizes/priors/network-encodings).

Problem is that theory people tend to focus on one, maybe two
of these, whereas application people tend to confuse them together.

Wray Buntine
NASA Ames Research Center                 phone:  (415) 604 3389
Mail Stop 269-2                           fax:    (415) 604 3594
Moffett Field, CA, 94035-1000 		  email:  wray at kronos.arc.nasa.gov