No subject

Mon Jun 5 16:42:55 EDT 2006

faster if you don't start too close to the origin. That's why I
normally use the range [-1,1] for weight initialization. Again, I
never ran any extensive tests on that.

The input logical values are symmetrical for the same reason that the
sigmoid should be symmetrical - avoid the DC component. On the other
hand, it is well known that one should not choose the saturation
levels of the sigmoid as target logical values, otherwise the weights
will tend to grow to infinity. That's why I chose +-.9 .

The only parameter that I played with, in this case, was the learning
rate. I made a few preliminary runs with different values of this
parameter, and the value of 1 looked good. Note, however, that these
were really just a few runs, not any extensive optimization.

Since the previous informal results generated some discussion, I
decided to be a bit more formal, and I report here the results of 51
runs using the framework indicated above, and different seeds for the
random number generator. What I give below is the histogram of the
number of epochs for convergence. The first figure is the number of
epochs, the second one is the number of runs that converged in that
number of epochs.

	 7 - 3		22 - 2
	 8 - 1		27 - 1
	 9 - 3		28 - 1
	10 - 3		36 - 1
	11 - 2		46 - 1
	12 - 2		48 - 1
	13 - 5		50 - 1
	17 - 5		51 - 1
	18 - 1		56 - 1
	19 - 1		72 - 1
	21 - 2	     >2000 - 12

The ">2000" are the "local minima" (see below). As you can see, the
median of this distribution is 19 epochs. Some colleagues around here
have been running tests, with results consistent with these. One of
them (Jose Amaral) has been studying algorithm convergence speeds, and
therefore has software specially designed for this kind of tests. He
also has similar results for this situation (in fact a median of 19,
too). But he also came up with a very surprising result: if you use
"tanh(s/2)" as sigmoid, with a step size of .7, the median of the
number of epochs is only 4 (!) [I've put the exclamation between
parentheses, so that people don't think it is the factorial of 4]. We
plan to make available, in a few days, a postscript version of one or
two graphs, with a summary of his results for a few different cases.

A few words about "local minima": I used this expression somewhat
informally, as we normally do, meaning that after a large number of
epochs (say, 2000) the network has not yet learned the correct outputs
for all training patterns, and the cost function is decreasing very
slowly, so it appears to be converging to a local minimum. I must say,
however, that some years ago I once took one of these "local minima"
of the XOR, and allowed it to continue training for a long time. After
some 180000 epochs, the net actually learned all 4 patterns correctly.
I tried this with one of the "local minima" here, and the same thing
happened again (after I reduced the step size to .5, and then to .2).
I don't know how many epochs it took: when I left to teach a class, it
was above 1 000 000 epochs, with wrong output in one of the patterns.
I left it running and when I came back it was at 5 360 000 epochs, and
had already learned all 4 patterns.

Finally, I am sorry that I cannot publish the simulator code itself.
We sell this simulator (we don't make much money with it, but anyway),
so I can't make it public. And besides, now that I have told you all
my tricks, leave me at least with my little simulator, so that I can
earn my living by selling it to those that didn't read this e-mail :)

Happy training,

Luis B. Almeida

INESC                             Phone: +351-1-544607, +351-1-3100246
Apartado 10105                    Fax:   +351-1-525843
P-1017 Lisboa Codex
Portugal

lba at inesc.pt
lba at inesc.uucp                    (if you have access to uucp)