Main file server trashed!
Predrag Punosevac
predragp at andrew.cmu.edu
Thu Jul 28 17:32:35 EDT 2022
Hi Autonians,
Over the past day and a half I became aware of many issues with our
cluster. People can't log because their ssh public keys are not loaded,
Carla was bronken, ipython is broken, TensorFlow etc.
All those things have one thing in common (NFS). People can't log into the
system because their ssh-keys are not read. ipython is storing config files
into SQLite database which most of you don't bother to put into the scratch
directory. Those databases are now corrupted. The same is with Carla.
TensorFlow has similar issues if left on NFS (that is why we have local
scratch directories).
I don't want to name the names but some of you put a very high read/write
load onto the file server hosting your home directories. The server is very
powerful and the network is 10Gb but it is not unbreakable.
I am long enough with the lab to know how to sandbox home directories.
Unlike data and project folders which will have to be rebuilt due to the
misuse of public space, this time around the damage is localized to NFS
daemon crash.
Those of you who have done this reading and writing can try to stop
processes (they might be a zombie process by now). Otherwise, I will have
to reboot the main file server but also all computing nodes and
workstations (to clear stale NFS file handles). That will not happen today.
That will also not happen before I take a note of offenders and suspend
those user accounts..
I am a bit hesitant to give a link to the new Wiki as I am not sure if all
of you have the access
https://drive.google.com/drive/folders/1ho7t48EES4AUYeBsMSh7BcpFCwGh1t_n
FYI the website and the Wiki have been migrated last Friday to the new
location. However, I was only involved in the demolition part so I don't
have much info beyond the above link.
Most Kind Regards,
Predrag Punosevac
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20220728/d57c38b7/attachment.html>
More information about the Autonlab-users
mailing list