[CMU AI Seminar] Nov 2 at 12pm (Zoom) -- Jeremy Cohen (CMU MLD) -- Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability -- AI Seminar sponsored by Morgan Stanley

Wed Oct 27 10:56:12 EDT 2021

Dear all,

We look forward to seeing you *next Tuesday (11/2)* from *1**2:00-1:00 PM
(U.S. Eastern time)* for the next talk of our *CMU AI Seminar*, sponsored
by Morgan Stanley <https://www.morganstanley.com/about-us/technology/>.

To learn more about the seminar series or see the future schedule, please
visit the seminar website <http://www.cs.cmu.edu/~aiseminar/>.

On 11/2, *Jeremy Cohen* (CMU MLD) will be giving a talk on "*Gradient
Descent on Neural Networks Typically Occurs at the Edge of Stability*".

*Title:* Gradient Descent on Neural Networks Typically Occurs at the Edge
of Stability

*Talk Abstract:* Neural networks are trained using optimization algorithms.
While we sometimes understand how these algorithms behave in restricted
settings (e.g. on quadratic or convex functions), very little is known
about the dynamics of these optimization algorithms on real neural
objective functions. In this paper, we take a close look at the simplest
optimization algorithm—full-batch gradient descent with a fixed step
size—and find that its behavior on neural networks is both (1) surprisingly
consistent across different architectures and tasks, and (2) surprisingly
different from that envisioned in the "conventional wisdom."

In particular, we empirically demonstrate that during gradient descent
training of neural networks, the maximum Hessian eigenvalue (the
"sharpness") always rises all the way to the largest stable value, which is
2/(step size), and then hovers just *above* that numerical value for the
remainder of training, in a regime we term the "Edge of Stability." (Click
here <https://twitter.com/deepcohen/status/1366881479175847942> for 1m 17s
animation.) At the Edge of Stability, the sharpness is still "trying" to
increase further—and that's what happens if you drop the step size—but is
somehow being actively restrained from doing so, by the implicit dynamics
of the optimization algorithm. Our findings have several implications for
the theory of neural network optimization. First, whereas the conventional
wisdom in optimization says that the sharpness ought to determine the step
size, our paper shows that in the topsy-turvy world of deep learning, the
reality is precisely the opposite: the *step size* wholly determines the
*sharpness*. Second, our findings imply that convergence analyses based on
L-smoothness, or on ensuring monotone descent, do not apply to neural
network training.

*Speaker Bio: *Jeremy Cohen is a PhD student in the Machine Learning
Department at CMU, co-advised by Zico Kolter and Ameet Talwalkar. His
research focus is "neural network plumbing": how to initialize and
normalize neural networks so that they train quickly and generalize well.

*Zoom Link:*
https://cmu.zoom.us/j/96099846691?pwd=NEc3UjQ4aHJ5dGhpTHpqYnQ2cnNaQT09

Thanks,
Asher Trockman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/ai-seminar-announce/attachments/20211027/287427a5/attachment.html>