GPUs 1-9 offline
Predrag Punosevac
predragp at andrew.cmu.edu
Thu Dec 1 23:38:30 EST 2022
PDU's master cable plug (IEC 60309 60 A 3P + PE) is still plugged into the
floor outlet. Somebody was either messing with the main electric
switchboard or we have catastrophic failure of the PDU.
I emailed the director of CS computing facility Ed Walter. I would be very
surprised that any major electric work (switchboard) was done without me
not knowing about it. CMU doesn't have people who are licensed to do that
kind of work. We hire external crew and such for is scheduled months in
advance.
I will inspect the cables and the unit tomorrow after I hear back from Ed.
It looks like the replacement PDU is close to $4000. We used to buy them
for about $1800.
Best,
Predrag
On Thu, Dec 1, 2022 at 11:14 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:
> Hi Conor,
>
> I just noticed myself. It is not just GPUs 1-9 it is also Denver. The
> common thing for all those 10 servers is that they draw electricity from
> the same Metered 17.3 kW PDU. Sure enough IPMI is off as well which
> confirms that there is no electric power in that server RACK. Somebody cut
> the electricity to the RACK A1-2A or PDU had a catastrophic failure. I am
> now calling the server room to have them physically inspect the rack.
>
> Best,
> Predrag
>
> On Thu, Dec 1, 2022 at 6:37 PM Conor Igoe <cigoe at cs.cmu.edu> wrote:
>
>> Predrag,
>>
>> Sorry to bother you, but I was wondering if you knew why GPUs 1-9 are
>> offline since earlier today?
>>
>> Best,
>> *Conor*
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20221201/1dd03b5e/attachment.html>
More information about the Autonlab-users
mailing list