GPU Machines Rebooting

Predrag Punosevac predragp at andrew.cmu.edu
Wed Jan 11 10:41:15 EST 2023


Hi Ian,

Interesting, I have not noticed but the monitoring switchs are connected to
the same  PDUs. This makes me thinking that this have something to do with
power outage. Namely, GPU nodes for obvious reasons are not UPSed. Either
PDUs capacity was exceeded or power on certain outlets was cut. I will
look into it. The server room was too hot the other day I did the work.
They are doing something but I couldn't see what. Are you sure that
affected machines are from 2 different racks? GPU1-9 + Denver is one rack.
GPU 5 and GPU 7 are not even part of the GPU pool.

Predrag

On Wed, Jan 11, 2023, 9:56 AM Ian Char <ichar at andrew.cmu.edu> wrote:

> Hey Predrag,
>
> Hope you had a happy new year and are doing well!
>
> It seems that both today and yesterday morning many GPU machines were
> rebooted at the exact same time (see screenshot below). As far as I can
> tell this happened for GPUs 1-15. Do you have any insights why this might
> be happening?
>
> [image: image.png]
>
> Thank you,
> Ian
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230111/0e36a99b/attachment.html>


More information about the Autonlab-users mailing list