GPUs 1-9 offline
Predrag Punosevac
predragp at andrew.cmu.edu
Fri Dec 2 18:31:11 EST 2022
Sorry for the delayed update. There is nothing wrong with electricity. The
network switch is busted! I have two spare switches in the storage room. I
am replacing it right now. Hopefully one of them will be good.
Best,
Predrag
On Thu, Dec 1, 2022 at 11:38 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:
> PDU's master cable plug (IEC 60309 60 A 3P + PE) is still plugged into the
> floor outlet. Somebody was either messing with the main electric
> switchboard or we have catastrophic failure of the PDU.
>
> I emailed the director of CS computing facility Ed Walter. I would be very
> surprised that any major electric work (switchboard) was done without me
> not knowing about it. CMU doesn't have people who are licensed to do that
> kind of work. We hire external crew and such for is scheduled months in
> advance.
>
> I will inspect the cables and the unit tomorrow after I hear back from
> Ed. It looks like the replacement PDU is close to $4000. We used to buy
> them for about $1800.
>
> Best,
> Predrag
>
> On Thu, Dec 1, 2022 at 11:14 PM Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
>> Hi Conor,
>>
>> I just noticed myself. It is not just GPUs 1-9 it is also Denver. The
>> common thing for all those 10 servers is that they draw electricity from
>> the same Metered 17.3 kW PDU. Sure enough IPMI is off as well which
>> confirms that there is no electric power in that server RACK. Somebody cut
>> the electricity to the RACK A1-2A or PDU had a catastrophic failure. I am
>> now calling the server room to have them physically inspect the rack.
>>
>> Best,
>> Predrag
>>
>> On Thu, Dec 1, 2022 at 6:37 PM Conor Igoe <cigoe at cs.cmu.edu> wrote:
>>
>>> Predrag,
>>>
>>> Sorry to bother you, but I was wondering if you knew why GPUs 1-9 are
>>> offline since earlier today?
>>>
>>> Best,
>>> *Conor*
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20221202/ce509453/attachment.html>
More information about the Autonlab-users
mailing list