Parallel MLP's

Wed Nov 6 14:29:04 EST 1991

Parallel Multilayer Perceptron

We have implemented batch mode conjugate gradient and backprop algorithms
on a AMT 510C array processor (32 x 32 8 bit SIMD elements). As you know
the weight update (i.e. straight vector operation done every epoch) is a
tiny fraction of the total cost since, for realistically large (non
redundant = noisy) training sets, the forward and backward propagation time is
dominant. Given that this applies to both conjugate gradient and backprop,
and that conjgrad typically converges in 30 times fewer iterations than
backprop, conjgrad is undeniably the way to do it.

On line involves the forward pass of a single input vector through the weights.
This involves a matrix*vector operation and a sigmoid (or whatever) evaluation.
The latter is purely parallel. The matrix operation involves a broadcast, 
a parallel multiplication and a recursive doubling sum over rows (or columns).

Batch (or semi batch) passes many vectors through and is thus a matrix*matrix
operation. The literature on this operation in parallel is huge and the
best algorithm is inevitably dependent on the (communications bandwidth of
the) particular machine and on the size of the matrices. On the array
processor the outer product accumulation algorithm is up to 5 times quicker
than the inner product algorithm:

Outer:  Given two matrices W(HIDS,INPS), P(INPS,PATS) obtain F(HIDS,PATS) thus
   F = 0
   do i = 1, INPS
   {
      col_replicas = col_broadcast(W(,i))	# replicate col i over cols
      row_replicas = row_broadcast(P(i,))	# replicate row i over rows
      F = F + row_replicas * col_replicas
   }

Inner:  As above except
   do i = 1, HIDS
   {
      col_replicas = col_broadcast(W(i,))	# replicate row i over cols
      F(i,) = sum_over_rows( P * col_replicas ) # sum up the cols (ie over rows)
   }

Henrik Klagges' weight matrix transposition in backprop is not really necessary.
The output error is backpropagated through the final layer weights using
the algorithm above; the difference is merely one of selecting columns
instead of rows.

Outer: With weights W(HIDS,INPS), output F(HIDS,PATS) obtain the inputs P(M,N)
   P = 0
   do i = 1, L
   {
      row_replicas = row_broadcast(W(i,))	# replicate the row i
      col_replicas = col_broadcast(P(i,))	# replicate the col i
      P = P + row_replicas * col_replicas
   }

On the DAP this operation is just as fast as explicitly doing the transpose.
Transposition can be speeded greatly if the matrix dimensions are powers of
two but the operation is inexpensive compared to matrix multiplication
anyway, for any size matrix.

A "recall" forward pass through a 32:32:10 MLP with 24 bit floats
is taking 79 microsecs per input vector. Through a 128:64:10 takes 305
microsecs and a 1024:512:32 takes 1237. The latter is equivalent to 
17.4 million connection-presentations per second. Such speed permits MLPs,
trained from many initial random weight positions, to be optimised.

The on-line versus batch problem is still unclear and I think that a semi
batch, conjugate gradient method looks a good compromise in which case
parallel code as above applies.

Patrick Grother
patrick at magi.ncsl.nist.gov

Image Recognition Group
Advanced Systems Division
Computer Systems Laboratory
Room A216 Building 225
NIST
Gaithersburg MD 20899