Ability to generalise over multiple training runs

Thu Mar 5 17:00:40 EST 1992

>In their recent technical report, Bates & Elman note that "it is possible
>for several different networks to reach the same solution to a problem, each
>with a totally different set of weights." (p. 13b) I am interested in the
>relationship between this phenomenon and the measurement of a network's
>ability to generalise.  

I have not seen the report yet, and I don't understand the assumptions made
to get at this conclusion, but if the solution is defined as some quality
of approximation of a finite number of points in the training or testing
set, then I contend that there are a very large, possibly infinite networks
that would yield to the same "solution." 
This argument is true if the networks are made with different
architectures, or if a single architecture is chosen; if it is a
classification or interpolation; and if the weights are allowed to be real
valued or not. A simple modification on the input variable order, or the
presentation order, or the functions of the nodes, or the initial points,
or the number of hidden nodes would lead to different nets. The only way to
talk about the "optimum weights" (for a fixed architecture in all respects)
is if the function is defined in EVERY possible point. For classification
tasks for example, how many ways can a closed contour be defined with
hyperplanes? Or in interpolation, how many functions perfectly visit the
data points, yet can do wildly different things in the "don't care" points?

Therefore, a function defined by a finite number of points can be
represented by an equivalent family of functions within an epsilon of
error, regardless of how big the finite set is.

>
>>From the above I presume (possibly incorrectly) that, if there are many
>possible solutions, then some of them will work well for new inputs and
>others will not work well.  So on one training run a network may appear to
>generalise well to a new input set, while on another it does not.  Does this
>mean that, when connectionists refer to the ability of a network to
>generalise, they are referring to an average ability over many trials?  Has
>anyone encountered situations in which the same network appeared to
>generalise well on one learning trial and poorly on another? 
>

This issue has come up in the network about a couple of weeks ago in a
discussion
about regularization and network training. it has to do with the power to
express the function given the network (has the network more or less
degrees of freedom that needed)  and a limited number of points and the
fact that these points can be noisy, and a poor representation of the
function itself. Things can get even more hectic if the training set is not
a faithful representation of the distribution of the function because of
the way it was designed. I'll let the people that published reports on
these and contributed on the discussion contact you directly with their
views.

>Reference:
>Bates, E.A. & Elman, J.L. (1992).  Connectionism and the study
>	of change.  CRL Technical Report 9202, (February).
>
>-- 
>Paul Atkins                     email: patkins at laurel.mqcc.mq.oz.au
>School of Behavioural Sciences  phone: (02) 805-8606
>Macquarie University            fax  : (02) 805-8062
>North Ryde, NSW, 2113
>Australia.

< Manoel Fernando Tenorio                             >
< (tenorio at ecn.purdue.edu) or (..!pur-ee!tenorio)  >
< MSEE233D                                            >
< Parallel Distributed Structures Laboratory          >
< School of Electrical Engineering                    >
< Purdue University                                   >
< W. Lafayette, IN, 47907                             >
< Phone: 317-494-3482 Fax: 317-494-6440               >