system outage

Mon Jan 25 18:04:35 EST 2016

Predrag Punosevac <predragp at cs.cmu.edu> wrote:

> Predrag Punosevac <predragp at cs.cmu.edu> wrote:
> 
> > Dear Autonians,
> > 
> > One of us have done something nasty to our network file systems which
> > have caused massive outage in the Lab. I am working to restore the
> > services. Please stay tuned.
> > 
> > Predrag
> > 
> > P.S. The person that have done this will be hunted down and will have to
> > give at least 5 seminar talks until the end of the year. 
> 
> 
> I have a little more info about this. The outrage is caused by power
> failure in one of our racks. I am working on this right now. I can't giv
> the estimate how long would it take to restore the services. File
> servers are affected!
> 
> Predrag

Ok Folks,

I was able to partially restore the power in the A1-2C.  This is the
most important server RACK as it is hosting core network infrastructure
servers, file servers (GAIA, Neill-ZFS, Uranus), virtual host Athena,
as well as the following computing nodes GPU1, GPU2, ari, foxconn, low1,
lov3, lov4, lot1.

This is the summary. 

All core network servers, Athena, and Uranus are safe fully operational
and connected to its own 120V PDU.

File servers GAIA are Neill-ZFS are safe, fully operational and
connected to the their own 120V PDU.

GPU1 and GPU2 are fully operational and safe connected to their own 208V
PDU.

I have shut down on the emergency basis the following computing nodes 

ari, foxconn, lov3, lov4, low1, and lot1. I am afraid to add them to
any of the above mentioned PDU units. The good news is that I have a
space for and extra power supply in this rack so the best and the
easiest solution would be to add another PDU/UPS and safely connect this
computing nodes to separate power supply. They will remain down at least
until tomorrow morning while I consult with Artur about the future
course of action. 

Best,
Predrag