<div dir="ltr">Hey Predrag,<br><div><br></div><div>Thanks for your help on this. I am not sure about gpu5 or gpu7; they may have been unaffected. Besides those machines, I just confirmed that gpus 1-13 have the same output as the attached screenshot. Interestingly, I just looked at some of the other gpus, and they also logged some activity this morning only. However, they say "still running" (see screenshot for gpu23) and it seems like the jobs on them may have been unaffected. I am not familiar with what this means. Does this suggest some sort of power outage?</div><div><br></div><div><img src="cid:ii_lcrvb2w90" alt="image.png" width="562" height="48"><br></div><div><br></div><div>Thanks,</div><div>Ian</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 11, 2023 at 10:41 AM Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div>Hi Ian,</div><div dir="auto"><br></div><div dir="auto">Interesting, I have not noticed but the monitoring switchs are connected to the same PDUs. This makes me thinking that this have something to do with power outage. Namely, GPU nodes for obvious reasons are not UPSed. Either PDUs capacity was exceeded or power on certain outlets was cut. I will look into it. The server room was too hot the other day I did the work. They are doing something but I couldn't see what. Are you sure that affected machines are from 2 different racks? GPU1-9 + Denver is one rack. GPU 5 and GPU 7 are not even part of the GPU pool.</div><div dir="auto"><br></div><div dir="auto">Predrag<br><br><div class="gmail_quote" dir="auto"><div dir="ltr" class="gmail_attr">On Wed, Jan 11, 2023, 9:56 AM Ian Char <<a href="mailto:ichar@andrew.cmu.edu" rel="noreferrer" target="_blank">ichar@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hey Predrag,<br><div><br></div><div>Hope you had a happy new year and are doing well!</div><div><br></div><div>It seems that both today and yesterday morning many GPU machines were rebooted at the exact same time (see screenshot below). As far as I can tell this happened for GPUs 1-15. Do you have any insights why this might be happening?</div><div><br></div><div><img src="https://mail.google.com/mail/?ui=2&ik=d8eb32d3d7&attid=0.1&th=185a17581cab0db6&view=fimg&fur=ip&rm=185a17581cab0db6&sz=w1600-h1000&attbid=ANGjdJ-kpTM1riC3rMG8kvIlDFcrMNx19AP-s4HA1eLBxi_jlbr1XvyM4efHjNfBzlwJ0T0NO5W2VOEQEgn4WMiMgEdkElq1u_cBz77hYCBP5Zwmog3UTEjRbP8ePck&disp=emb&realattid=ii_lcrsaqta0&zw" alt="image.png" width="542" height="66"><br></div><div><br></div><div>Thank you,</div><div>Ian</div></div>
</blockquote></div></div></div>
</blockquote></div>