<div dir="auto"><div>Hi Ian,</div><div dir="auto"><br></div><div dir="auto">Interesting, I have not noticed but the monitoring switchs are connected to the same PDUs. This makes me thinking that this have something to do with power outage. Namely, GPU nodes for obvious reasons are not UPSed. Either PDUs capacity was exceeded or power on certain outlets was cut. I will look into it. The server room was too hot the other day I did the work. They are doing something but I couldn't see what. Are you sure that affected machines are from 2 different racks? GPU1-9 + Denver is one rack. GPU 5 and GPU 7 are not even part of the GPU pool.</div><div dir="auto"><br></div><div dir="auto">Predrag<br><br><div class="gmail_quote" dir="auto"><div dir="ltr" class="gmail_attr">On Wed, Jan 11, 2023, 9:56 AM Ian Char <<a href="mailto:ichar@andrew.cmu.edu" target="_blank" rel="noreferrer">ichar@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hey Predrag,<br><div><br></div><div>Hope you had a happy new year and are doing well!</div><div><br></div><div>It seems that both today and yesterday morning many GPU machines were rebooted at the exact same time (see screenshot below). As far as I can tell this happened for GPUs 1-15. Do you have any insights why this might be happening?</div><div><br></div><div><img src="https://mail.google.com/mail/?ui=2&ik=d8eb32d3d7&attid=0.1&th=185a17581cab0db6&view=fimg&fur=ip&rm=185a17581cab0db6&sz=w1600-h1000&attbid=ANGjdJ-kpTM1riC3rMG8kvIlDFcrMNx19AP-s4HA1eLBxi_jlbr1XvyM4efHjNfBzlwJ0T0NO5W2VOEQEgn4WMiMgEdkElq1u_cBz77hYCBP5Zwmog3UTEjRbP8ePck&disp=emb&realattid=ii_lcrsaqta0&zw" alt="image.png" width="542" height="66"><br></div><div><br></div><div>Thank you,</div><div>Ian</div></div>
</blockquote></div></div></div>