Training XOR with BP
Luis B. Almeida
lba at sara.inesc.pt
Mon Mar 15 12:49:19 EST 1993
I must say I am a bit surprised myself, with the XOR discussion, but
in the opposite sense: for me, the XOR has always converged rather
fast. Let me be more specific: I already had the idea in my mind, from
previous informal tests, that the XOR usually converged in much less
than 100 epochs (with a relatively large percentage of runs that fell
into "local minima" - more about this below). The difference with
other people's results may come from implementation details, which I
will give below.
The experiments I reported were made with a BP simulator developed
here at Inesc, which has a lot of facilities (adaptive step sizes,
cross-validation, optional entropy error, weight decay, momentum,
etc.). For these experiments I've set the parameters so that all these
features were disabled. And to be sure, I've just been looking into
the essential parts of the code, and didn't find any bugs - the thing
appears to be really doing plain BP, without any tricks.
So, here are the details:
Problem: XOR (2 inputs)
No. of training patterns: 4
Input logical levels: -1 for FALSE, 1 for TRUE
Target output logical levels: -.9 for FALSE, .9 for TRUE
Network: 2 inputs, 2 hidden, 1 output
Interconnection: Full between successive layers,
no direct links from inputs to output
Unit non-linearity: Scaled arctangent, i.e. 2/Pi * arctan(s),
where "s" is the input sum
Learning method: Backpropagation, batch mode, no momentum
Step size (learning rate): 1
Cost function: Squared error, summed over the 4 training
patterns
Weight initialization: Random, uniform in [-1,1]
Stopping criterion: When the sign of the output is correct
for all 4 training patterns
Why did I choose these parameters? It is relatively well known that
symmetrical sigmoids (e.g. varying between -1 and 1) give faster
learning than unsymmetrical ones (e.g. varying between 0 and 1) [Yann
Le Cun had a poster on the reasons for that, in one of the NIPS
conferences, two or three years ago]. On the other hand, I thought that
"arctan" probably learned faster than "tanh", because of its slower
saturation, but I never ran any extensive tests on that - and see
below, about results with "tanh(s/2)".
More information about the Connectionists
mailing list