Training XOR with BP

Mon Mar 15 12:49:19 EST 1993

I must say I am a bit surprised myself, with the XOR discussion, but
in the opposite sense: for me, the XOR has always converged rather
fast. Let me be more specific: I already had the idea in my mind, from
previous informal tests, that the XOR usually converged in much less
than 100 epochs (with a relatively large percentage of runs that fell
into "local minima" - more about this below). The difference with
other people's results may come from implementation details, which I
will give below.

The experiments I reported were made with a BP simulator developed
here at Inesc, which has a lot of facilities (adaptive step sizes,
cross-validation, optional entropy error, weight decay, momentum,
etc.). For these experiments I've set the parameters so that all these
features were disabled. And to be sure, I've just been looking into
the essential parts of the code, and didn't find any bugs - the thing
appears to be really doing plain BP, without any tricks.

So, here are the details:

Problem:			XOR (2 inputs)

No. of training patterns:	4

Input logical levels:		-1 for FALSE, 1 for TRUE

Target output logical levels:	-.9 for FALSE, .9 for TRUE

Network:			2 inputs, 2 hidden, 1 output

Interconnection:		Full between successive layers,
				no direct links from inputs to output

Unit non-linearity:		Scaled arctangent, i.e. 2/Pi * arctan(s),
				where "s" is the input sum

Learning method:		Backpropagation, batch mode, no momentum

Step size (learning rate):	1

Cost function:			Squared error, summed over the 4 training
				patterns

Weight initialization:		Random, uniform in [-1,1]

Stopping criterion:		When the sign of the output is correct
				for all 4 training patterns

Why did I choose these parameters? It is relatively well known that
symmetrical sigmoids (e.g. varying between -1 and 1) give faster
learning than unsymmetrical ones (e.g. varying between 0 and 1) [Yann
Le Cun had a poster on the reasons for that, in one of the NIPS
conferences, two or three years ago]. On the other hand, I thought that
"arctan" probably learned faster than "tanh", because of its slower
saturation, but I never ran any extensive tests on that - and see
below, about results with "tanh(s/2)".