Backprop w/o Math, Not.
Jerry Feldman
jfeldman at ICSI.Berkeley.EDU
Tue Mar 24 12:08:34 EST 1998
Last fall, I posted a request for ideas on how to teach
back propagation to undergrads, some from linguistics,
etc., who had little math. There were quite a few clever
suggestions, but my conclusion was that it couldn't be
done. There is no problem conveying the ideas of searching
weight space, local minima, etc. But how could they
understand the functional dependence w/o math?
The students had already done some simple exercises with
Tlearn and this seemed to help get them motivated.
Following several suggestions and PDP v.3, I started with
the integer based perceptron "delta rule", and left that
on the board. The next step was to do the "delta rule" for
linear nets with no hidden units. But, before doing that I
"reminded" them about partial derivatives, using the volume
of a cylinder, V = pi*r*r*h.
There was an overhead with pictures of how the two
partials affected V:
dV/dh = pi*r*r was a thin dotted disk
dV/dr = 2*pi*r*h was a thin dotted sleeve
The only other math needed is the chain rule and it worked
to motivate that directly from the error formula for
the linear case. They saw that the error is expressed in
terms of the output, but that one needs to know the effect
of a weight change, etc. The fact that the result had the
same form as the perceptron case was, of course,
satisfying.
They had already seen various activation functions and
knew that the sigmoid had several advantages, but was
obviously more complex. I derived the delta rule for a
network with only one top node and using only one input
pattern, this eliminates lots of subscripts. The
derivation of the sigmoid derivative = f*(1-f) was given
as a handout in gory detail, but I only went over it
briefly. The idea was to get them all to believe that they
could work it through and maybe some of them did. At that
point, I just hand-waved about the delta for hidden layers
being the appropriate function of the deltas to which it
contributed and gave the final result.
We then talked about search, local minima and the
learning rate. Since they used momentum in Tlearn, there
was another slide and story on that. My impression is that
this approach works and that nothing simpler would suffice.
There were only about thirty students and
questions were allowed; it would certainly be harder with
a large lecture.
This was all done in one lecture because of the nature of
the course. With more time, I would have followed some
other suggestions and had them work through a tiny
example by hand in class. For this course, we next went to
a discussion of Regier's system, which uses some backprop
extensions to push learning into the structured part of
the net. I was able to describe Regier's techniques quite
easily based on their being familiar with the derivation
of backprop.
I would still be interested in feedback on the overall
course design: www.icsi.berkeley.edu/~mbrodsky/cogsci110/
--
Jerry Feldman
More information about the Connectionists
mailing list