GPU Machines Rebooting

Thu Jan 12 15:48:50 EST 2023

Good to know the root cause. Thanks for getting to the bottom of this
Predrag!

On Thu, Jan 12, 2023 at 3:45 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> I talked earlier today with Dave from CS CMU operations. Apparently this
> was a scheduled power outage. I am supposed to receive emails when those
> things happen but I didn't :-(
>
> Best,
> Predrag
>
> On Wed, Jan 11, 2023 at 8:27 PM Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
>> Dear Autonians,
>>
>> dead autofs and sssd  deamons on GPU machines which are causing login
>> troubles appear to be due to this electric instability. I owe a big apology
>> to Ifi. Not that she shouldn't debug those scripts but that is another
>> story :-) I am really taken aback with the electric grid problems. I
>> haven't seen anything similar in 10 years.
>>
>> Predrag
>>
>> On Wed, Jan 11, 2023 at 8:15 PM Predrag Punosevac <
>> predragp at andrew.cmu.edu> wrote:
>>
>>> It appears that all machines which are not UPSed (all GPU nodes) have
>>> been rebooted 14h and 27 minutes ago. The only explanation is electricity.
>>> I will talk tomorrow morning to the guys who are supposed to monitor Wean
>>> Hall  3611.
>>>
>>> Predrag
>>>
>>> On Wed, Jan 11, 2023 at 11:20 AM Ian Char <ichar at andrew.cmu.edu> wrote:
>>>
>>>> Hey Predrag,
>>>>
>>>> Thanks for your help on this. I am not sure about gpu5 or gpu7; they
>>>> may have been unaffected. Besides those machines, I just confirmed that
>>>> gpus 1-13 have the same output as the attached screenshot. Interestingly, I
>>>> just looked at some of the other gpus, and they also logged some activity
>>>> this morning only. However, they say "still running" (see screenshot for
>>>> gpu23) and it seems like the jobs on them may have been unaffected. I am
>>>> not familiar with what this means. Does this suggest some sort of power
>>>> outage?
>>>>
>>>> [image: image.png]
>>>>
>>>> Thanks,
>>>> Ian
>>>>
>>>> On Wed, Jan 11, 2023 at 10:41 AM Predrag Punosevac <
>>>> predragp at andrew.cmu.edu> wrote:
>>>>
>>>>> Hi Ian,
>>>>>
>>>>> Interesting, I have not noticed but the monitoring switchs are
>>>>> connected to the same  PDUs. This makes me thinking that this have
>>>>> something to do with power outage. Namely, GPU nodes for obvious reasons
>>>>> are not UPSed. Either PDUs capacity was exceeded or power on certain
>>>>> outlets was cut. I will look into it. The server room was too hot the
>>>>> other day I did the work. They are doing something but I couldn't see what.
>>>>> Are you sure that affected machines are from 2 different racks? GPU1-9 +
>>>>> Denver is one rack. GPU 5 and GPU 7 are not even part of the GPU pool.
>>>>>
>>>>> Predrag
>>>>>
>>>>> On Wed, Jan 11, 2023, 9:56 AM Ian Char <ichar at andrew.cmu.edu> wrote:
>>>>>
>>>>>> Hey Predrag,
>>>>>>
>>>>>> Hope you had a happy new year and are doing well!
>>>>>>
>>>>>> It seems that both today and yesterday morning many GPU machines were
>>>>>> rebooted at the exact same time (see screenshot below). As far as I can
>>>>>> tell this happened for GPUs 1-15. Do you have any insights why this might
>>>>>> be happening?
>>>>>>
>>>>>> [image: image.png]
>>>>>>
>>>>>> Thank you,
>>>>>> Ian
>>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230112/026a4b6f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 16867 bytes
Desc: not available
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230112/026a4b6f/attachment-0001.png>