<div dir="ltr">Dear all,<div><br></div><div><div>We look forward to seeing you <b>this Tuesday (10/3)</b> from <b><font color="#ff0000">1</font></b><font color="#ff0000"><b>2:00-1:00 PM (U.S. Eastern time)</b></font> for the next talk of this semester's <b>CMU AI Seminar</b>, sponsored by <a href="https://sambanova.ai/" target="_blank">SambaNova Systems</a>. The seminar will be held in GHC 6115 <b>with pizza provided </b>and will<b> </b>be streamed on Zoom.</div><div><br></div><div>To learn more about the seminar series or to see the future schedule, please visit the <a href="http://www.cs.cmu.edu/~aiseminar/" target="_blank">seminar website</a>.</div><div><br></div><font color="#0b5394"><span style="background-color:rgb(255,255,0)">On this Tuesday (10/3), <u>Nikhil Ghosh</u> </span><span style="background-color:rgb(255,255,0)">(UC Berkeley) will be giving a talk titled </span><b style="background-color:rgb(255,255,0)">"</b><b style="background-color:rgb(255,255,0)">Hyperparameter Transfer for Finetuning Large-Scale Models</b><b style="background-color:rgb(255,255,0)">".</b></font></div><div><font color="#0b5394"><span style="background-color:rgb(255,255,0)"><br></span><b>Title</b>: Hyperparameter Transfer for Finetuning Large-Scale Models<br><br></font><div><font color="#0b5394"><b>Talk Abstract</b>: Current models have become so large that most practitioners are unable to effectively tune hyperparameters due to limited computational resources, which results in suboptimal performance. In this talk I will be discussing ongoing work which aims to address this issue by transferring the optimal learning rate from smaller models. This work builds on previous ideas of Yang et al. (2022), which achieves hyperparameter transfer for pretraining large models. In the current work, we aim to study the same problem but in the finetuning setting. By reducing the width of a pretrained model via random subsampling and rescaling according to the muP parameterization of Yang et al, we obtain a smaller proxy model which we can finetune with significantly less resources. In certain settings, such as when finetuning using LoRA on large datasets, the optimal learning rate is preserved under subsampling, which allows for immediate transfer to larger models. In general, however, we find through both experiments and theoretical calculations that the optimal learning rate can display a rich variety of scaling behaviors. Characterizing the scaling behavior requires understanding more fine-grained aspects of training and generalization.</font><div><div><font color="#0b5394"> </font><font color="#0b5394"><br></font></div><div><font color="#0b5394"><b>Speaker Bio:</b> Nikhil Ghosh is a PhD student in the Statistics department at UC Berkeley working with Bin Yu and Song Mei. His main interests are in the theory of deep learning. Previously he studied computer science at Caltech and has completed internships at Google and Microsoft Research.<br></font></div><div><br></div><div><font color="#0b5394"><b>In person: </b>GHC 6115</font></div><div><font color="#0b5394"><b>Zoom Link</b>:  <a href="https://cmu.zoom.us/j/99510233317?pwd=ZGx4aExNZ1FNaGY4SHI3Qlh0YjNWUT09" target="_blank">https://cmu.zoom.us/j/99510233317?pwd=ZGx4aExNZ1FNaGY4SHI3Qlh0YjNWUT09</a></font></div></div></div></div><div><br></div><div>Thanks,</div><div>Asher Trockman</div></div>