Observium issues

predragp at andrew.cmu.edu predragp at andrew.cmu.edu
Thu Mar 12 21:01:33 EDT 2015


According to the Observium six computing nodes appear to be down. That is
not true as you can see on http://monit.autonlab.org:8080.

username: auton
password: Dr.Who

(The same credentials for Observium https://monit.autonlab.org

The only server which is really down is backup file server Lyre which I
brought down today around 6:30 PM to replace a failed HDD and never put
back on-line since people in machine room where out for dinner (they are
not suppose to be out for more than 10 minutes). The operation is
postponed for tomorrow.

However something is happening to the computing nodes. I restarted SNMP
daemons and as a result all missing computing nodes are coming back
on-line according to Observium. I restarted LDAP daemon on Atlas just to
make sure that LDAP stale connections are cleared.

In spite of that if you try right now to log into one of those computing
nodes lets say low1 the login will just hang. I can login with the root
account into low1 and I can also log into any of Neill machines with my
account (note that my home directory is not mounted on Neill machines)
This means that something is wrong with NFS mounts of your home
directories which reside on GAIA.

Has anybody done any massive writing on NFS? I am not planning to reboot
GAIA until I am physically present on campus. If people need access to CVS
they should use

shell.autonlab.org

Your home directories are local there. It is by design to enable you to
have partial access to the lab in the situations like this.

Predrag



More information about the Autonlab-users mailing list