Ability to generalise over multiple training runs

Tue Mar 10 13:21:49 EST 1992

Regarding recent discussions on training 
different nets and their ability to get the same 
solution:

We observed (Giles, et.al., in IJCNN91, NIPS4, & Neural
Computation 92) similar results for recurrent nets
learning small regular grammars (finite state automata) 
from positive and negative sample strings. Briefly, the characters of
each string are presented at each time step and
supervised training occurs at the end of string 
presentation (RTRL). [See the above papers for more information]
Using random initial weight conditions and different numbers of 
neurons, most trained neural networks perfectly classified the 
training sets. Using a heuristic extraction method (there are 
many similar methods), a grammar could be
extracted from the trained neural network. These extracted
grammars were all different, but could be reduced to a unique
"minimal number of states" grammar (or minimal finite state automaton). 
Though these experiments were for 2nd order fully recurrent nets,
we've extracted the same grammars from 1st order recurrent
nets using the same training data.

Not all machines performed as well on unseen strings.
Some were perfect on all strings tested; others weren't. 
For small grammars, nearly all of the trained neural
networks produced perfect extracted grammars. 
In most cases the nets were trained on 10**3 strings and tested 
on randomly chosen 10**6 strings whose string length
is < 99. (Since an arbitrary number of strings can be generated 
by these grammars, perfect generalization is not possible 
to test in practice.) In fact it was possible to extract
ideal grammars from the trained nets that
classified fairly well, but not perfectly, on the test set.
[In other words, you could throw away the net and use 
just the grammar.}

This agrees with Paul Atkins' comment:

>From the above I presume (possibly incorrectly) that, if there are many
>possible solutions, then some of them will work well for new inputs and
>others will not work well.

and with Manoel Fernando Tenorio's observation:

>...then I contend that there are a very large, possibly infinite networks
>architectures, or if a single architecture is chosen; if it is a
>classification or interpolation; and if the weights are allowed to be real
>valued or not. A simple modification on the input variable order, or the
>presentation order, or the functions of the nodes, or the initial points,
>or the number of hidden nodes would lead to different nets...

                                  C. Lee Giles
                                  NEC Research Institute
                                  4 Independence Way
                                  Princeton, NJ 08540
                                  USA

Internet:   giles at research.nj.nec.com
    UUCP:   princeton!nec!giles
   PHONE:   (609) 951-2642
     FAX:   (609) 951-2482