forward and backward (or, the transposed weight matrix hassle)

Mon Nov 4 11:42:20 EST 1991

> From: Henrik Klagges <uh311ae at sunmanager.lrz-muenchen.de>
> Message-Id: <9111031856.AA02137 at sunmanager.lrz-muenchen.de>
> To: connectionists at cs.cmu.edu
> Subject: The transposed weight matrix hassle
> 
> There are some nasty things showing up if you want to fine-tune
> a parallel architecture to algorithms such as backprop. E.g., you
> either get the communications fast fro the forwar phase or the
> backward phase - but if you want to use the same communication
> flow for both, you have to transpose the weight matrices. This
> is on the order of O(forget it). Has anybody cooked up an idea ?
> 
> Cheers, Henrik
> 
> MPCI at LLNL
> IBM Research
> U. of Munich
> 
> 
> 
> ------- End of Forwarded Message
> 
> 
Sure. We have had our parallel architecture, the Ring Array Processor
(RAP) training up backprop nets for our speech recognition research for
about a year and a half now. With a ring or a torus, you don't need to
duplicate weight matrices. For instance, you can organize the weights
so they are most convenient for the forward pass, and then during the
backward pass just compute partial sums for all of the deltas;
that is, on each processor just compute what you can out of every
sum that has the local weights in it. Then pass around the partial
sums systolically, updating cumulatively in each processor. If your
computation is strongly virtualized (many more than 1 neuron per physical
processor), and if your computation is efficient (we shift around the
ring in one cycle, plus a few cycles overhead added to each complete
shift around the ring), then this part of backprop is not a bad cost.

I think this is described in our paper in Proceedings of ASAP '90.
You can also send to info at icsi.berkeley.edu to ask about RAP TR's.