[auton-users] Auton System Status
Michael J. Baysek
mjbaysek at cs.cmu.edu
Wed Aug 4 22:45:52 EDT 2010
The system is again restored. You must reboot your /auton desktop in
order to be able to work.
There are a few other things you may care to know:
A few facts about this afternoon (to now's) outage:
** The main file server (LYRE) that served /auton storage failed in a
way that required it to be pulled offline. At the moment, the problem
appears to be related to a strange interplay between the brand of disks
and the storage controller. The instability seems to be hit or miss,
depending on some unknown variable during boot. If you get a good boot,
it seems to run stable, but 8 times out of 10, it doesn't. For the last
6 months, we have been riding on a 'good boot' but all of the reboots
since Friday's crash have been 'bad reboots'.
** The outage was actually in no way related to the outage from last
evening. It's more like a step brother to the outages on Friday and
Monday. The outage on Monday was planned, and its intent was to prevent
outages like today's, which it didn't.
** There have been numerous minor failures of this type since Friday.
Today, the failure forcibly unmounted /auton from the live server,
outside of my control.
** We could not continue with that server until the root cause of the
problem is rectified, which it still is not.
Solution:
** LOT2 has been commandeered to be used as the temporary file server.
LOT2 is therefore unavailable for compute jobs.
** Only a partial data copy of /auton space has been done to the
temporary server. Due to the controller malfunction in the other
server, I am not comfortable copying data while the server is
unattended, so the rest of the data will commence copying tomorrow when
I arrive to office. All /auton data is intact, it's just not online yet.
** I have already copied all active accounts /auton/home/* directories
to the temporary server. If I missed your account, please let me know.
** If you have data which I have not yet copied and that you need access
to, you need to email me the approximate location of that data so it can
be inserted into the 'priority list' for copying.
I hope that this is the last time I have to write to you all, for a
while, anyway.
Mike
More information about the Autonlab-users
mailing list