GPU Machines Rebooting

Thu Jan 12 15:45:24 EST 2023

I talked earlier today with Dave from CS CMU operations. Apparently this
was a scheduled power outage. I am supposed to receive emails when those
things happen but I didn't :-(

Best,
Predrag

On Wed, Jan 11, 2023 at 8:27 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Dear Autonians,
>
> dead autofs and sssd  deamons on GPU machines which are causing login
> troubles appear to be due to this electric instability. I owe a big apology
> to Ifi. Not that she shouldn't debug those scripts but that is another
> story :-) I am really taken aback with the electric grid problems. I
> haven't seen anything similar in 10 years.
>
> Predrag
>
> On Wed, Jan 11, 2023 at 8:15 PM Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
>> It appears that all machines which are not UPSed (all GPU nodes) have
>> been rebooted 14h and 27 minutes ago. The only explanation is electricity.
>> I will talk tomorrow morning to the guys who are supposed to monitor Wean
>> Hall  3611.
>>
>> Predrag
>>
>> On Wed, Jan 11, 2023 at 11:20 AM Ian Char <ichar at andrew.cmu.edu> wrote:
>>
>>> Hey Predrag,
>>>
>>> Thanks for your help on this. I am not sure about gpu5 or gpu7; they may
>>> have been unaffected. Besides those machines, I just confirmed that gpus
>>> 1-13 have the same output as the attached screenshot. Interestingly, I just
>>> looked at some of the other gpus, and they also logged some activity this
>>> morning only. However, they say "still running" (see screenshot for gpu23)
>>> and it seems like the jobs on them may have been unaffected. I am not
>>> familiar with what this means. Does this suggest some sort of power outage?
>>>
>>> [image: image.png]
>>>
>>> Thanks,
>>> Ian
>>>
>>> On Wed, Jan 11, 2023 at 10:41 AM Predrag Punosevac <
>>> predragp at andrew.cmu.edu> wrote:
>>>
>>>> Hi Ian,
>>>>
>>>> Interesting, I have not noticed but the monitoring switchs are
>>>> connected to the same  PDUs. This makes me thinking that this have
>>>> something to do with power outage. Namely, GPU nodes for obvious reasons
>>>> are not UPSed. Either PDUs capacity was exceeded or power on certain
>>>> outlets was cut. I will look into it. The server room was too hot the
>>>> other day I did the work. They are doing something but I couldn't see what.
>>>> Are you sure that affected machines are from 2 different racks? GPU1-9 +
>>>> Denver is one rack. GPU 5 and GPU 7 are not even part of the GPU pool.
>>>>
>>>> Predrag
>>>>
>>>> On Wed, Jan 11, 2023, 9:56 AM Ian Char <ichar at andrew.cmu.edu> wrote:
>>>>
>>>>> Hey Predrag,
>>>>>
>>>>> Hope you had a happy new year and are doing well!
>>>>>
>>>>> It seems that both today and yesterday morning many GPU machines were
>>>>> rebooted at the exact same time (see screenshot below). As far as I can
>>>>> tell this happened for GPUs 1-15. Do you have any insights why this might
>>>>> be happening?
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> Thank you,
>>>>> Ian
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230112/580b9ab8/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 16867 bytes
Desc: not available
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230112/580b9ab8/attachment-0001.png>