GPUs 1-9 offline

Predrag Punosevac predragp at andrew.cmu.edu
Fri Dec 2 19:54:24 EST 2022


Fixed! All computing nodes are available.

Predrag

On Fri, Dec 2, 2022 at 6:31 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Sorry for the delayed update. There is nothing wrong with electricity. The
> network switch is busted! I have two spare switches in the storage room. I
> am replacing it right now. Hopefully one of them will be good.
>
> Best,
> Predrag
>
> On Thu, Dec 1, 2022 at 11:38 PM Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
>> PDU's master cable plug (IEC 60309 60 A 3P + PE) is still plugged into
>> the floor outlet. Somebody was either  messing with the main electric
>> switchboard or we have catastrophic failure of the PDU.
>>
>> I emailed the director of CS computing facility Ed Walter. I would be
>> very surprised that any major electric work (switchboard) was done without
>> me not knowing about it. CMU doesn't have people who are licensed to do
>> that kind of work. We hire external crew and such for is scheduled months
>> in advance.
>>
>> I will inspect the cables and the unit tomorrow after I hear back from
>> Ed.  It looks like the replacement PDU is close to $4000. We used to buy
>> them for about $1800.
>>
>> Best,
>> Predrag
>>
>> On Thu, Dec 1, 2022 at 11:14 PM Predrag Punosevac <
>> predragp at andrew.cmu.edu> wrote:
>>
>>> Hi Conor,
>>>
>>> I just noticed myself. It is not just GPUs 1-9 it is also Denver. The
>>> common thing for all those 10 servers is that they draw electricity from
>>> the same Metered 17.3 kW PDU. Sure enough IPMI is off as well which
>>> confirms that there is no electric power in that server RACK. Somebody cut
>>> the electricity to the RACK A1-2A or PDU had a catastrophic failure. I am
>>> now calling the server room to have them physically inspect the rack.
>>>
>>> Best,
>>> Predrag
>>>
>>> On Thu, Dec 1, 2022 at 6:37 PM Conor Igoe <cigoe at cs.cmu.edu> wrote:
>>>
>>>> Predrag,
>>>>
>>>> Sorry to bother you, but I was wondering if you knew why GPUs 1-9 are
>>>> offline since earlier today?
>>>>
>>>> Best,
>>>> *Conor*
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20221202/e67f2b07/attachment.html>


More information about the Autonlab-users mailing list