CPU servers lo3, lo4, low1, ari, Foxconn down

Predrag Punosevac predragp at andrew.cmu.edu
Mon Jun 22 15:42:12 EDT 2020


I have really good news. It appears that a fuse on one of the UPSs was
blown up. The server room attendant was able to reset following my
directions over the phone. I was able to use IPMI and power up 4 out
of 5 machines. LOW1 appears to be down due to the faulty hardware. In
my experience working with very old machines like LOW1 the most likely
culprit is a dead RAM module. I do have some old yet good RAM modules
to fix LOW1 but that will have to wait. As of going forward, before
Cov19 struck, I was in the phase of moving all CPU nodes of UPS which
are no longer capable of backing power hungry computing nodes.

There is a design consensus among people who know server room electric
greed that going forward all computing nodes are to be considered
stateless and designed to crash before they pull down with themselves
mission critical gear like file servers, firewalls, and web servers.
All our GPU nodes are already in compliance but not CPU nodes.

Cheers,
Predrag

On Mon, Jun 22, 2020 at 1:13 PM Predrag Punosevac
<predragp at andrew.cmu.edu> wrote:
>
> On Mon, Jun 22, 2020 at 12:03 PM Arundhati Banerjee
> <arundhat at andrew.cmu.edu> wrote:
> >
> > Hi Predrag,
> >
> > I just wanted to bring to your attention that some of the CPU servers are not
> > accessible at the moment. I actually had some code running on Foxconn which I > am unable to access now. I would be obliged if you could kindly look into it.
> >
>
> I saw it on Monit when I logged in this morning.  This appears to be a
> major problem with the electric supply. The one common things for all
> these five servers is that they are connected to the same dumb PDU
> (power distribution unit) which in turn is connected to the same old
> 208V UPS. It was planned before Cov19 to remove all computing nodes
> from UPSs.
>
> I am trying to talk to David who is a server room attendant and the
> person physically located in Wean 3611. However, I would not hold my
> breath with this one. If the circuits are messed up that would require
> Ed Walter, me, and perhaps external contractors in the server room.
> That can take a long time.
> A nuclear option is moving those 5 servers to GHC. I am not sure that
> I would have enough electricity there. In either case we have a major
> problem on our hands.
>
> In my almost 8 years with the lab I have not seen such catastrophic
> failure of power circuits.
>
> Best,
> Predrag
>
>
>
>
>
> > Thank you.
> > Best regards,
> > Arundhati


More information about the Autonlab-users mailing list