<div dir="ltr">Hi all,<div><br></div><div><b>Jeremy Cohen</b>'s talk on surprising observations and dynamics of full-batch GD on deep neural nets is starting in a few minutes! </div><div><br></div><div>Zoom link: <a href="https://cmu.zoom.us/j/96099846691?pwd=NEc3UjQ4aHJ5dGhpTHpqYnQ2cnNaQT09" target="_blank">https://cmu.zoom.us/j/96099846691?pwd=NEc3UjQ4aHJ5dGhpTHpqYnQ2cnNaQT09</a></div><div><br></div><div>Best,</div><div>Shaojie</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 1, 2021 at 3:32 PM Asher Trockman <<a href="mailto:ashert@cs.cmu.edu">ashert@cs.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi all,<div><br></div><div>Just a reminder that the <a href="http://www.cs.cmu.edu/~aiseminar/" target="_blank">CMU AI Seminar</a> is tomorrow <b><font color="#ff0000">12pm-1pm</font></b>: <a href="https://cmu.zoom.us/j/96099846691?pwd=NEc3UjQ4aHJ5dGhpTHpqYnQ2cnNaQT09" target="_blank">https://cmu.zoom.us/j/96099846691?pwd=NEc3UjQ4aHJ5dGhpTHpqYnQ2cnNaQT09</a>.</div><div><br></div><div><b>Jeremy Cohen (CMU MLD)</b> will be giving a talk on the surprising dynamics of full-batch gradient descent on neural networks.</div><div><br></div><div>Thanks,</div><div>Asher</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 27, 2021 at 10:56 AM Asher Trockman <<a href="mailto:ashert@cs.cmu.edu" target="_blank">ashert@cs.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Dear all,<br><br>We look forward to seeing you <b>next Tuesday (11/2)</b> from <b><font color="#ff0000">1</font></b><font color="#ff0000"><b>2:00-1:00 PM (U.S. Eastern time)</b></font> for the next talk of our <b>CMU AI Seminar</b>, sponsored by <a href="https://www.morganstanley.com/about-us/technology/" target="_blank">Morgan Stanley</a>.<br><br>To learn more about the seminar series or see the future schedule, please visit the <a href="http://www.cs.cmu.edu/~aiseminar/" target="_blank">seminar website</a>.<br><br><font color="#0b5394" style="background-color:rgb(255,255,0)">On 11/2, <b><u>Jeremy Cohen</u></b> (CMU MLD) will be giving a talk on "<b>Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability</b></font><font color="#0b5394" style="background-color:rgb(255,255,0)">".</font><br><br><font color="#0b5394"><b>Title:</b> Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability<br><br><b>Talk Abstract:</b> Neural networks are trained using optimization algorithms. While we sometimes understand how these algorithms behave in restricted settings (e.g. on quadratic or convex functions), very little is known about the dynamics of these optimization algorithms on real neural objective functions. In this paper, we take a close look at the simplest optimization algorithm<font size="1">—</font>full-batch gradient descent with a fixed step size—and find that its behavior on neural networks is both (1) surprisingly consistent across different architectures and tasks, and (2) surprisingly different from that envisioned in the "conventional wisdom."</font></div><div><font color="#0b5394"><br></font></div><div><font color="#0b5394">In particular, we empirically demonstrate that during gradient descent training of neural networks, the maximum Hessian eigenvalue (the "sharpness") always rises all the way to the largest stable value, which is 2/(step size), and then hovers just <i>above</i> that numerical value for the remainder of training, in a regime we term the "Edge of Stability." (Click <a href="https://twitter.com/deepcohen/status/1366881479175847942" target="_blank">here</a> for 1m 17s animation.) At the Edge of Stability, the sharpness is still "trying" to increase further—and that's what happens if you drop the step size—but is somehow being actively restrained from doing so, by the implicit dynamics of the optimization algorithm. Our findings have several implications for the theory of neural network optimization. First, whereas the conventional wisdom in optimization says that the sharpness ought to determine the step size, our paper shows that in the topsy-turvy world of deep learning, the reality is precisely the opposite: the <i>step size</i> wholly determines the <i>sharpness</i>. Second, our findings imply that convergence analyses based on L-smoothness, or on ensuring monotone descent, do not apply to neural network training.<br><br><b>Speaker Bio: </b>Jeremy Cohen is a PhD student in the Machine Learning Department at CMU, co-advised by Zico Kolter and Ameet Talwalkar. His research focus is "neural network plumbing": how to initialize and normalize neural networks so that they train quickly and generalize well.</font><br><br><b>Zoom Link:</b> <a href="https://cmu.zoom.us/j/96099846691?pwd=NEc3UjQ4aHJ5dGhpTHpqYnQ2cnNaQT09" target="_blank">https://cmu.zoom.us/j/96099846691?pwd=NEc3UjQ4aHJ5dGhpTHpqYnQ2cnNaQT09</a><br><br>Thanks,<br>Asher Trockman</div></div>
</blockquote></div>
</blockquote></div>