GPU Machines Rebooting

Ian Char ichar at andrew.cmu.edu
Wed Jan 11 11:20:33 EST 2023


Hey Predrag,

Thanks for your help on this. I am not sure about gpu5 or gpu7; they may
have been unaffected. Besides those machines, I just confirmed that gpus
1-13 have the same output as the attached screenshot. Interestingly, I just
looked at some of the other gpus, and they also logged some activity this
morning only. However, they say "still running" (see screenshot for gpu23)
and it seems like the jobs on them may have been unaffected. I am not
familiar with what this means. Does this suggest some sort of power outage?

[image: image.png]

Thanks,
Ian

On Wed, Jan 11, 2023 at 10:41 AM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Hi Ian,
>
> Interesting, I have not noticed but the monitoring switchs are connected
> to the same  PDUs. This makes me thinking that this have something to do
> with power outage. Namely, GPU nodes for obvious reasons are not UPSed.
> Either PDUs capacity was exceeded or power on certain outlets was cut. I
> will look into it. The server room was too hot the other day I did the
> work. They are doing something but I couldn't see what. Are you sure that
> affected machines are from 2 different racks? GPU1-9 + Denver is one rack.
> GPU 5 and GPU 7 are not even part of the GPU pool.
>
> Predrag
>
> On Wed, Jan 11, 2023, 9:56 AM Ian Char <ichar at andrew.cmu.edu> wrote:
>
>> Hey Predrag,
>>
>> Hope you had a happy new year and are doing well!
>>
>> It seems that both today and yesterday morning many GPU machines were
>> rebooted at the exact same time (see screenshot below). As far as I can
>> tell this happened for GPUs 1-15. Do you have any insights why this might
>> be happening?
>>
>> [image: image.png]
>>
>> Thank you,
>> Ian
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230111/55495636/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 16867 bytes
Desc: not available
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230111/55495636/attachment-0001.png>


More information about the Autonlab-users mailing list