Overfitting in learning discrete patterns

Thu Mar 3 16:47:02 EST 1994

Hi everyone,

A few weeks ago I posted several questions regarding overfitting in the
network training. I got many helpful replies (some did not forward 
to the list). Thanks very much to all. 

After some thoughts, and following Andreas Weigend's Summer School 93
paper, I designed and implemented the following experiments on the 
overfitting problem. The design is very simple but with a clear rationale, 
the results seem to be conclusive, and anyone can verify them easily.

THE FUNCTION TO BE LEARNED:
First, a Boolean function with 7 variables and 1 output is defined by:
  count := 2*v1 - v2 + 3*v3 + 2*v4 -v5 + 3*v6 + v7;
  if (count > 2) and (count < 7) then f1 := 0  else f1 := 1
This function needs a network with 2 one-layer hidden units to represent it.

The actual target function is the one above except 10% of function
values chosen randomly are flipped (0 to 1 and 1 to 0). Note that those
flipped values can be regarded as having a "high-level regularity",
and no sampling noise is added to this target function. (It is like 
learning verb past tenses which have a few so called "irregular" verbs, 
or learning string to phoneme mappings which have some exceptions). 

The 128 possible examples are split randomly to the training set
(with 64 examples), the testing set (64), non-overlap. It happens that 
the training set has 4 flipped examples, and the testing set has 7.

TRAINING:
Xerion from U of Toronto is used. Backprop with small learning
rate and momentum 0. Testing accuracies are monitored every 10
epochs till 1000, then several points are checked till 50,000 epochs.
Note that only the basic BP is used without weight decaying.

Networks have 7 input units and one output unit. Networks
with 1 (too small), 2 (right model), 3, 4 (enough to train
to 0 symbolic-level error, see below), 6, 8, 12, 20 (over sized)
hidden units are trained, each with 5 different random seeds.

ERROR:
The most important change from Weigend's paper is the way the error
is measured. Since we are learning discrete patterns, the error we
really care is the sum of misclassifications (i.e., symbolic-level 
error). The discrete patterns that have the smallest real-number 
Hamming distance (smallest angle) with the actual network outputs
should be taken. Since there is only one output here, if the output 
is greater than 0.5, then it is 1, otherwise is 0. 

RESULTS: 
See table below. Results are represented as average at standard_error

# of 		min training	min testing	overfitting	percent of 
hidden U	error		error		(# of increase) increase
1		19.0 at 0.0	30.5 at 0.6	2.5 at 0.6		8%
2		2.0 at 0.0		11.0 at 0.0	1.0 at 0.0		9%
3		1.0 at 0.0		11.3 at 0.9	2.5 at 1.7		22%
4		0.6 at 0.5		11.3 at 0.9	6.3 at 1.9		56%
6		0.4 at 0.5		13.2 at 1.3	2.6 at 0.5		20%
8		0.0 at 0.0		13.4 at 1.3	5.2 at 1.1		39%
12		0.2 at 0.5		12.4 at 1.1	3.4 at 2.8		27%
20		0.5 at 0.6		12.8 at 1.5	4.0 at 2.2		31%
min training error: minimal number of misclassifications for the training set
overfitting: increases in the number of misclassification on the testing set
	     when training to 50,000 epochs.
percent of increase: overfitting/min-testing-error

CONCLUSIONS:

1. Too-small networks and right-sized networks do overfit, but with very
small amount (8%, 9%). Testing errors and overfitting amount are stable.

2. The right-size network (2 hidden units) has the minimal testing
errors and minimal fluctuations.

3. Too-large networks overfit a lot. This is because the trained networks
represent a very complex separator which does not align with the hyper-planes.

4. Too-large networks have large variations on testing errors using
different random seeds. Therefore, the results are not as reliable.
Why? Too many freedoms in the net.

5. The average error with too-large networks is slightly *higher* than 
the right-sized networks. Therefore, training overly large networks
and stopping early does not seem to be beneficial.

IN SUM:
In terms of symbolic-level errors...

Even without noise, training to 0 symbolic-level error overfits. 

Training overly large networks does not seem beneficial (see 3, 4, 5 above).

We should look for a small network with minimal cross-validation
  error. That net should also have small overfitting effect and variations.
That is...
1. Split the training set as TR1 (training) and TR2 (validation).
2. Train net1, net2, ... net_n (with different numbers of hidden units) 
   on TR1, stop training when minimal validation errors are reached. 
3. Find a small net, net_k, with the smallest validation error.
4. Train net_k with TR1+TR2 until error on the training drops to 
   the scaled-up error when training TR1. This net will not overfit 
   too much anyway, and the accuracy on testing would be stable.

The only thing that should be explored further is the basic BP with 
weight decaying. I will do that in the near future.

Comments and suggestions are very welcome.

Regards,
Charles