GPU Machines Rebooting

Predrag Punosevac predragp at andrew.cmu.edu
Wed Jan 11 20:27:11 EST 2023


Dear Autonians,

dead autofs and sssd  deamons on GPU machines which are causing login
troubles appear to be due to this electric instability. I owe a big apology
to Ifi. Not that she shouldn't debug those scripts but that is another
story :-) I am really taken aback with the electric grid problems. I
haven't seen anything similar in 10 years.

Predrag

On Wed, Jan 11, 2023 at 8:15 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> It appears that all machines which are not UPSed (all GPU nodes) have been
> rebooted 14h and 27 minutes ago. The only explanation is electricity. I
> will talk tomorrow morning to the guys who are supposed to monitor Wean
> Hall  3611.
>
> Predrag
>
> On Wed, Jan 11, 2023 at 11:20 AM Ian Char <ichar at andrew.cmu.edu> wrote:
>
>> Hey Predrag,
>>
>> Thanks for your help on this. I am not sure about gpu5 or gpu7; they may
>> have been unaffected. Besides those machines, I just confirmed that gpus
>> 1-13 have the same output as the attached screenshot. Interestingly, I just
>> looked at some of the other gpus, and they also logged some activity this
>> morning only. However, they say "still running" (see screenshot for gpu23)
>> and it seems like the jobs on them may have been unaffected. I am not
>> familiar with what this means. Does this suggest some sort of power outage?
>>
>> [image: image.png]
>>
>> Thanks,
>> Ian
>>
>> On Wed, Jan 11, 2023 at 10:41 AM Predrag Punosevac <
>> predragp at andrew.cmu.edu> wrote:
>>
>>> Hi Ian,
>>>
>>> Interesting, I have not noticed but the monitoring switchs are connected
>>> to the same  PDUs. This makes me thinking that this have something to do
>>> with power outage. Namely, GPU nodes for obvious reasons are not UPSed.
>>> Either PDUs capacity was exceeded or power on certain outlets was cut. I
>>> will look into it. The server room was too hot the other day I did the
>>> work. They are doing something but I couldn't see what. Are you sure that
>>> affected machines are from 2 different racks? GPU1-9 + Denver is one rack.
>>> GPU 5 and GPU 7 are not even part of the GPU pool.
>>>
>>> Predrag
>>>
>>> On Wed, Jan 11, 2023, 9:56 AM Ian Char <ichar at andrew.cmu.edu> wrote:
>>>
>>>> Hey Predrag,
>>>>
>>>> Hope you had a happy new year and are doing well!
>>>>
>>>> It seems that both today and yesterday morning many GPU machines were
>>>> rebooted at the exact same time (see screenshot below). As far as I can
>>>> tell this happened for GPUs 1-15. Do you have any insights why this might
>>>> be happening?
>>>>
>>>> [image: image.png]
>>>>
>>>> Thank you,
>>>> Ian
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230111/06ac8498/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 16867 bytes
Desc: not available
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230111/06ac8498/attachment-0001.png>


More information about the Autonlab-users mailing list