GPU Machines Rebooting
Predrag Punosevac
predragp at andrew.cmu.edu
Wed Jan 11 20:15:03 EST 2023
It appears that all machines which are not UPSed (all GPU nodes) have been
rebooted 14h and 27 minutes ago. The only explanation is electricity. I
will talk tomorrow morning to the guys who are supposed to monitor Wean
Hall 3611.
Predrag
On Wed, Jan 11, 2023 at 11:20 AM Ian Char <ichar at andrew.cmu.edu> wrote:
> Hey Predrag,
>
> Thanks for your help on this. I am not sure about gpu5 or gpu7; they may
> have been unaffected. Besides those machines, I just confirmed that gpus
> 1-13 have the same output as the attached screenshot. Interestingly, I just
> looked at some of the other gpus, and they also logged some activity this
> morning only. However, they say "still running" (see screenshot for gpu23)
> and it seems like the jobs on them may have been unaffected. I am not
> familiar with what this means. Does this suggest some sort of power outage?
>
> [image: image.png]
>
> Thanks,
> Ian
>
> On Wed, Jan 11, 2023 at 10:41 AM Predrag Punosevac <
> predragp at andrew.cmu.edu> wrote:
>
>> Hi Ian,
>>
>> Interesting, I have not noticed but the monitoring switchs are connected
>> to the same PDUs. This makes me thinking that this have something to do
>> with power outage. Namely, GPU nodes for obvious reasons are not UPSed.
>> Either PDUs capacity was exceeded or power on certain outlets was cut. I
>> will look into it. The server room was too hot the other day I did the
>> work. They are doing something but I couldn't see what. Are you sure that
>> affected machines are from 2 different racks? GPU1-9 + Denver is one rack.
>> GPU 5 and GPU 7 are not even part of the GPU pool.
>>
>> Predrag
>>
>> On Wed, Jan 11, 2023, 9:56 AM Ian Char <ichar at andrew.cmu.edu> wrote:
>>
>>> Hey Predrag,
>>>
>>> Hope you had a happy new year and are doing well!
>>>
>>> It seems that both today and yesterday morning many GPU machines were
>>> rebooted at the exact same time (see screenshot below). As far as I can
>>> tell this happened for GPUs 1-15. Do you have any insights why this might
>>> be happening?
>>>
>>> [image: image.png]
>>>
>>> Thank you,
>>> Ian
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230111/1c172d36/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 16867 bytes
Desc: not available
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230111/1c172d36/attachment-0001.png>
More information about the Autonlab-users
mailing list