From predragp at andrew.cmu.edu Thu Mar 12 21:01:33 2015 From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu) Date: Thu, 12 Mar 2015 21:01:33 -0400 Subject: Observium issues Message-ID: According to the Observium six computing nodes appear to be down. That is not true as you can see on http://monit.autonlab.org:8080. username: auton password: Dr.Who (The same credentials for Observium https://monit.autonlab.org The only server which is really down is backup file server Lyre which I brought down today around 6:30 PM to replace a failed HDD and never put back on-line since people in machine room where out for dinner (they are not suppose to be out for more than 10 minutes). The operation is postponed for tomorrow. However something is happening to the computing nodes. I restarted SNMP daemons and as a result all missing computing nodes are coming back on-line according to Observium. I restarted LDAP daemon on Atlas just to make sure that LDAP stale connections are cleared. In spite of that if you try right now to log into one of those computing nodes lets say low1 the login will just hang. I can login with the root account into low1 and I can also log into any of Neill machines with my account (note that my home directory is not mounted on Neill machines) This means that something is wrong with NFS mounts of your home directories which reside on GAIA. Has anybody done any massive writing on NFS? I am not planning to reboot GAIA until I am physically present on campus. If people need access to CVS they should use shell.autonlab.org Your home directories are local there. It is by design to enable you to have partial access to the lab in the situations like this. Predrag From predragp at andrew.cmu.edu Thu Mar 12 21:44:18 2015 From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu) Date: Thu, 12 Mar 2015 21:44:18 -0400 Subject: Observium follow up Message-ID: <122ee8305570011dc8c14ec2412ef342.squirrel@webmail.andrew.cmu.edu> I am almost 100% sure that the "file server problem" is caused by data science experiments on low1 and lov3. Both servers are using 100% of swap memory which means that they might crash any moment. Predrag From predragp at andrew.cmu.edu Fri Mar 13 13:59:33 2015 From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu) Date: Fri, 13 Mar 2015 13:59:33 -0400 Subject: lov3 & low1 + MATLAB info Message-ID: <88d11b9ba20943c723a8e9a4ef8df121.squirrel@webmail.andrew.cmu.edu> Dear Autonians, lov3 and low1 are completely unusable at this point. Effective immediately I will try to reboot both computing nodes to clear locked file systems. Hopefully this will work. Since the nodes are already down I will take this opportunity to update MATLAB to 2015Ra. Note that MATLAB license manager will not be started until the update is done. Predrag From predragp at andrew.cmu.edu Fri Mar 13 14:34:30 2015 From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu) Date: Fri, 13 Mar 2015 14:34:30 -0400 Subject: lov3 & low1 + MATLAB info Message-ID: <474989a225485135f24cb0ab619a11a4.squirrel@webmail.andrew.cmu.edu> lov3 and low1 were locked by careless writing to home directory. Both machines are now accessible to regular users. MATLAB will remain down couple hours longer. Predrag From predragp at andrew.cmu.edu Sat Mar 14 18:58:21 2015 From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu) Date: Sat, 14 Mar 2015 18:58:21 -0400 Subject: misc issues Message-ID: <07c00d6c240dd21b5306910b0139756d.squirrel@webmail.andrew.cmu.edu> Dear Autonians, I just updated MATLAB to R2015a on thee computing nodes which were not busy lov3, low1, charity. LOV3 and LOW1 appear to be in the good shape now but I still see that network is saturated by careless writing over NFS. At this point I am not even sure if it is due to the writing on /zfsauton or /zfsneill as I see bunch of MATLAB and other scripts running across the servers of both groups. If you have started a job on Thursday afternoon or Friday please contact me by PM so that we can get to the bottom of this. Hopping from computing node to computing node and crashing them with MATLAB scripts or programs has a serious impact to the productivity of entire LAB. I would be happy to sit with you on Monday and figure out what is wrong with programs (from systems point of view) and why they are crashing servers. Predrag From predragp at andrew.cmu.edu Sat Mar 14 23:19:20 2015 From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu) Date: Sat, 14 Mar 2015 23:19:20 -0400 Subject: network/NFS issues part II Message-ID: <505e172d4de8fe104c459d56d7a2efe7.squirrel@webmail.andrew.cmu.edu> Dear Autonians, After further troubleshooting network/NFS problem I pinpointed the issue to the main file server GAIA. To make matters complicated I realized that /zfsauton/project and /zfsauton/data are mounted on Neill group computing nodes for project Hightmark. One should not be writing any data to /zfsauton/data (mounted with rw so it is possible to write) which leaves us with wedged /zfsauton/project or to your /zfsauton/home directory (not relevant for the members of Neill group) as the cause of all troubles. Careful inspection of Collectd disk writes data for Gaia HDD reveals lots of tiny disk writes (speed never exceeds 70k Bytes/s) which is pathetic (for the reference current speed on Neill-ZFS is 800k Bytes/s per disk). Anyhow at this point I am confident that members of Neill group are not seriously affected unless they work on Highmark data. I would be happy to temporary umount /zfsauton/project and /zfsauton/data directories from Neill[1-4] to make sure those four computing nodes are 100% OK. The rest of lab might experience various issues. You safest bet for login into the lab is shell.autonlab.org since NFS shares are not mounted there. If the issues persist I will be forced to reboot main file server (current uptime 267 days). Predrag From predragp at andrew.cmu.edu Mon Mar 16 15:14:14 2015 From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu) Date: Mon, 16 Mar 2015 15:14:14 -0400 Subject: Main file server schedule to reboot Message-ID: Dear Autonians, I was hopping to avoid this situation but it looks like after close to 320 days of up time main file server will have to be rebooted to clear stale NFS file handles. It is also possible that few computing nodes will have to be rebooted as well. Tentatively I am planning to do this tomorrow Marh 17th around 10:30 AM if ZFS monthly scrub has completed. The members of the Neill group will not be affected but Highmark data will be unavailable at least for couple of hours. Predrag From predragp at andrew.cmu.edu Tue Mar 17 13:07:37 2015 From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu) Date: Tue, 17 Mar 2015 13:07:37 -0400 Subject: File server rebooted Message-ID: <3a9b690705f2095fbe4d57d77ab52f7e.squirrel@webmail.andrew.cmu.edu> Main file server was rebooted. NFS stale file handles appear all to be cleared up and the lab appears to be 100% functioning at this moment. I even did a minor update of the OS on the main file server to fix rsync related bug. Predrag From predragp at andrew.cmu.edu Tue Mar 17 15:19:44 2015 From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu) Date: Tue, 17 Mar 2015 15:19:44 -0400 Subject: MEMEX important backup info Message-ID: <14e25b8480f989841108dfd4a83d1ee2.squirrel@webmail.andrew.cmu.edu> This message is only relevant for members of MEMEX project who are also using Auton Lab desktops. Note that VPN connection you have with MEMEX server does affect backup of your desktops. Namely MEMEX VPN overrides DNS settings and by backup server appears to be from the hostile network. If you want you desktop machine to be backup regularly make sure you switch off VPN between 6:00 PM and 7:30 PM. Predrag From boecking at andrew.cmu.edu Wed Mar 25 09:24:20 2015 From: boecking at andrew.cmu.edu (Benedikt Boecking) Date: Wed, 25 Mar 2015 09:24:20 -0400 Subject: lov3 disc space Message-ID: Hello everyone, we are quickly running out of space in the /home/scratch/ directory on lov3, it is 99% full. I would appreciate it if everyone could make an effort to delete unnecessary files from their folders, save them locally on their machines, or at least compress very large files. I appreciate your cooperation! Best, Ben