Generalization ability of a BPTT-net

Luis B. Almeida lba at sara.inesc.pt
Fri Nov 13 07:28:43 EST 1992


[To the "Connectionists" moderator: I am sending the following
response to Walter Weber directly, I don't know if you would consider
it appropriate to also publish it in the Connectionists].

I would like to make two suggestions, concerning the first problem:

a) Teacher forcing, though often very useful, does not necessarily
perform descent (and therefore minimization) on the objective
function. Why not use the weights obtained with teacher forcing as an
initialization for a second training stage, which would use normal
BPTT without teacher forcing?

b) Using a sigmoid on the output unit means that, in order to produce
peaks (values close to 0 or to 1), the sum at the input of that unit
must become relatively large, in absolute value. The net might perform
better if you remove the sigmoid from the output unit, which will then
become linear.

I didn't fully understand your second problem. What are the inputs to
the net, in the [.6, .8] case, and in the [.4, .6] case? are they the
same? Are they similar in some way?

Luis B. Almeida

INESC                             Phone: +351-1-544607
Apartado 10105                    Fax:   +351-1-525843
P-1017 Lisboa Codex
Portugal

lba at inesc.pt
lba at inesc.uucp                    (if you have access to uucp)


More information about the Connectionists mailing list