Backprop w/o Math, Not.

Tue Mar 24 12:08:34 EST 1998

Last fall, I posted a request for ideas on how to teach 
back propagation to undergrads, some from linguistics, 
etc., who had little math. There were quite a few clever 
suggestions, but my conclusion was that it couldn't be 
done. There is no problem conveying the ideas of searching 
weight space, local minima, etc. But how could they 
understand the functional dependence w/o math?

 The students had already done some simple exercises with
Tlearn and this seemed to help get them motivated. 
Following several suggestions and PDP v.3, I started with 
the integer based perceptron "delta rule", and left that 
on the board. The next step was to do the "delta rule" for 
linear nets with no hidden units. But, before doing that I 
"reminded" them about partial derivatives, using the volume 
of a cylinder, V = pi*r*r*h. 
There was an overhead with pictures of how the two 
partials affected V:

dV/dh = pi*r*r    was a thin dotted disk

dV/dr = 2*pi*r*h  was a thin dotted sleeve 

 The only other math needed is the chain rule and it worked 
to motivate that directly from the error formula for 
the linear case. They saw that the error is expressed in 
terms of the output, but that one needs to know the effect 
of a weight change, etc. The fact that the result had the 
same form as the perceptron case was, of course, 
satisfying.

 They had already seen various activation functions and 
knew that the sigmoid had several advantages, but was 
obviously more complex. I derived the delta rule for a 
network with only one top node and using only one input 
pattern, this eliminates lots of subscripts. The 
derivation of the sigmoid derivative = f*(1-f) was given 
as a handout in gory detail, but I only went over it 
briefly. The idea was to get them all to believe that they 
could work it through and maybe some of them did. At that 
point, I just hand-waved about the delta for hidden layers 
being the appropriate function of the deltas to which it 
contributed and gave the final result.

 We then talked about search, local minima and the 
learning rate. Since they used momentum in Tlearn, there 
was another slide and story on that. My impression is that 
this approach works and that nothing simpler would suffice. 
There were only about thirty students and 
questions were allowed; it would certainly be harder with 
a large lecture.

 This was all done in one lecture because of the nature of
the course. With more time, I would have followed some
other suggestions and had them work through a tiny
example by hand in class. For this course, we next went to 
a discussion of Regier's system, which uses some backprop 
extensions to push learning into the structured part of 
the net. I was able to describe Regier's techniques quite 
easily based on their being familiar with the derivation 
of backprop.

 I would still be interested in feedback on the overall
course design: www.icsi.berkeley.edu/~mbrodsky/cogsci110/

-- 
Jerry Feldman