[auton-users] Auton System Status

Michael J. Baysek mjbaysek at cs.cmu.edu
Wed Aug 4 22:45:52 EDT 2010


The system is again restored.  You must reboot your /auton desktop in 
order to be able to work.

There are a few other things you may care to know:

A few facts about this afternoon (to now's) outage:

** The main file server (LYRE) that served /auton storage failed in a 
way that required it to be pulled offline.  At the moment, the problem 
appears to be related to a strange interplay between the brand of disks 
and the storage controller.  The instability seems to be hit or miss, 
depending on some unknown variable during boot.  If you get a good boot, 
it seems to run stable, but 8 times out of 10, it doesn't.  For the last 
6 months, we have been riding on a 'good boot' but all of the reboots 
since Friday's crash have been 'bad reboots'.

** The outage was actually in no way related to the outage from last 
evening.  It's more like a step brother to the outages on Friday and 
Monday.  The outage on Monday was planned, and its intent was to prevent 
outages like today's, which it didn't.

** There have been numerous minor failures of this type since Friday.  
Today, the failure forcibly unmounted /auton from the live server, 
outside of my control.

** We could not continue with that server until the root cause of the 
problem is rectified, which it still is not.

Solution:

** LOT2 has been commandeered to be used as the temporary file server.  
LOT2 is therefore unavailable for compute jobs.

** Only a partial data copy of /auton space has been done to the 
temporary server.  Due to the controller malfunction in the other 
server, I am not comfortable copying data while the server is 
unattended, so the rest of the data will commence copying tomorrow when 
I arrive to office.  All /auton data is intact, it's just not online yet.

** I have already copied all active accounts /auton/home/* directories 
to the temporary server.  If I missed your account, please let me know.

** If you have data which I have not yet copied and that you need access 
to, you need to email me the approximate location of that data so it can 
be inserted into the 'priority list' for copying.

I hope that this is the last time I have to write to you all, for a 
while, anyway.

Mike



More information about the Autonlab-users mailing list