some questions on training neural nets...

Fri Feb 4 10:22:12 EST 1994

Dear Charles X. Ling,

You say: "Some rather basic issues in training NN still puzzle me a lot, 
and I hope to get advice and help from the experts in the area."
Well... the questions you have asked still puzzle the experts as well,
and good answers, where they exist, are very much case dependent.
As Tom Dietterich wrote, in general "Even in the noise-free case, the bias/variance tradeoff is operating and it is possible to overfit the training 
data", therefore you can not expect just any large net to generalize well.

It was also observed recently that...
When having a large enough set of examples (so one can have a good enough sample
for the training and the validation set), you can obtain better generalization with larger nets by using cross validation to decide when to stop training,
as is demonstrated in the paper of A. Weigend :

Weigend A.S. (1994), in the {\em Proc. of the 1993 Connectionist
Models Summer School}, edited by M.C. Mozer, P. Smolensky, D.S. Touretzky,
J.L. Elman and A.S. Weigend, pp. 335-342 
(Erlbaum Associates, Hillsdale NJ, 1994).

Rich Caruana has presented similar results in the "Complexity Issues" workshop
in the last NIPS post-conference.

But... Larger networks can generalize as good as, or even better than small
networks even without cross-validation.  A simple experiment that demonstrates
that was presented in :

    T. Grossman, R. Meir and E. Domany,
    Learning by choice of Internal Representations,
    Complex Systems 2,  555-575 (1988).

In that experiment, networks with different number of hidden units were 
trained to perform the symmetry task by using a fraction of the possible
examples as the training set, training the net to 100% performance on the TR set
and testing the performance on the rest  (off training set generalization). 
No early stopping, no cross validation. 
The symmetry problem can be solved by 2 hidden units - 
so this is the minimal architecture required for this specific function. 
However, it was found that it is NOT the best generalizing architecture.
The generalization rates of all the architectures (H=2..N, the size of the input)
were similar, with the larger networks somewhat better.
Now, this is a special case. One can explain it by observing that the symmetry
problem can also be solved by a network of N hidden units, with smaller
weights, and not only by effectively "zeroing" the contributions of all but two
units (see an example in Minsky and Papert's Perceptrons). Probably by
all the other architectures as well. So, considering the mapping from weight
space to function space, it is very likely that training a large network on
partial data will take you closer (in function space) to your target function
F (symmetry in that case) than training a small one.

The picture can be different in other cases...
One has to remember that the training/generalization problem (including the
bias/variance tradeoff problem) is, in general, a complex interaction between
three entities:
1. The target function (or the task).
2. The learning model, and what is the class of functions that is realizable
 by this model (and its associated learning algorithm).
3. The training set, and how well it represents the task.

Even the simple question: is my training set large enough (or good enough) ?
is not simple at all. One might think that it should be larger than, say, twice
the number of free parameters (weights) in your model/network architecture.
It turns out that not even this is enough in general.
Allow me to advertise here the paper presented by A.Lapedes and myself at the 
last NIPS where we present a method to test a "general" classification algorithm
(i.e. any classifier such as a neural net, a decision tree, etc. and its 
 learning algorithm, which may include pruning or net construction) by a method
we call "noise sensitivity signature" NSS (see abstract below). In addition to 
introducing this new model selection method, which we believe can be a good
alternative to cross-validation in data limited cases, we present the following
experiment:  the target function is a network with 20:5:1 architecture (weights
chosen at random). The training set is provided by choosing M random input
patterns and classifying them by the teacher net. we then train other nets
with various architectures, ranging from 1 to 8 hidden units on the training
set (without controlled stopping, but with tolerance in the error function).
A different (and large) set of classified examples is used to determine the
generalization performance of the trained nets (averaged over several 
realizations with different initial weights).

Some of the results are :
1. With different training set sizes M=400,700,1000,  the the optimal
 architecture is different. Smaller training set yields smaller optimal 
 network, according to the independent test set measure.
2. Even with M=1000 (much more than twice the number of weights), the
 optimal learning net is still smaller than the original teacher net.
3. There are differences of up to a few percents in generalization performance
 of the different learning nets for all training set sizes.
 In particular, nets that are larger than the optimal are doing worse with
 size.
 Depends on your problem, a few percents can be insignificant or they can make
 a real difference.  In some real applications, 1-2 % can be the difference
 between a contract or a paper...  In such cases you would like to tune your
 model (i.e to identify the optimal architecture) as best as you can.
4. Using the NSS it was possible to recognize the optimal architectures for 
each training set, without using extra data.

Some conclusions are:  
1. If one uses a validation set to choose the architecture (not for 
stopping) - for example by using the extra 1000 examples - then the architecture
that will be picked up when using the 700 training set is going to be smaller
(and worse) than the one picked up when using the 1000 training set.
In other words, if your data is just a 1000 examples, and you devote 300 of 
them to be your validation set. Then even if those 300 will give a good estimation of the generalization of the trained net, when you choose the model
according to this test set, you end up with the optimal model for 700 training
examples, which is less good than the optimal model that you can obtain when
training with all the 1000 examples.
It means that in many cases you need more examples than one might expect in
order to obtain a well tuned model. Especially if you are using a considerable
fraction of it as a validation set.
2. Using NSS one would find the right architecture for the total number of 
examples you have - paying a factor of about 30 on training effort.
3. You can use "set 1 aside" cross validation in order to select your model.
 This will probably overcome the bias caused by giving up a large fraction of
the examples. However, in order to obtain a reliable estimate of the performance
the training process will have to be repeated many times, probably more than
what is needed in order to calculate the NSS.

It is important to emphasize again:
The above results were obtained for that specific experiment. We have obtained
similar results with different tasks (e.g. DNA structure classification) and
with different learning machines (e.g. decision trees), but still, these
results prove nothing "in general", except may be, that life is complicated
and full of uncertainty... 
A more careful comparison with cross validation as a stopping method, and using
NSS in other scenarios (like function fitting) is under investigation.
If anyone is interested in using the NSS method in combination with pruning
methods (e.g. to test the stopping criteria), I will be glad to help.
I will be grateful for any other information/ref about similar experiments.

I hope all the above did not add too much to your puzzlement.
Good luck with your training,
Tal
------------------------------------------------

The paper I mentioned above is:
Learning Theory seminar:  Thursday Feb.10. 15:15.  CNLS Conference room.

title: Use of Bad Training Data For Better Predictions.

by : Tal Grossman and Alan Lapedes (Complex Systems group, LANL)

Abstract:
We present a method for calculating the ``noise sensitivity signature''
of a learning algorithm which is based on scrambling the output classes of
various fractions of the training data. This signature
can be used to indicate a good (or bad) match between the complexity of the
classifier and the complexity of the data and hence
to improve the predictive accuracy of a classification algorithm.
Use of noise sensitivity signatures is distinctly different from other schemes
to avoid overtraining, such as cross-validation, which uses only part of the
training data, or various penalty functions, which are not data-adaptive.
Noise sensitivity signature methods use all of the training data and
are manifestly data-adaptive and non-parametric.
They are well suited for situations with limited training data

It is going to appear in the Proc. of NIPS 6.   An expanded version of it
will (hopefully) be placed in the neuroprose archive within a week or two.
Until then I can send a ps file of it to the interested.