From predragp at andrew.cmu.edu  Thu Mar 12 21:01:33 2015
From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu)
Date: Thu, 12 Mar 2015 21:01:33 -0400
Subject: Observium issues
Message-ID: <d9afeeea040e5ff27e67204ecd8c35b0.squirrel@webmail.andrew.cmu.edu>

According to the Observium six computing nodes appear to be down. That is
not true as you can see on http://monit.autonlab.org:8080.

username: auton
password: Dr.Who

(The same credentials for Observium https://monit.autonlab.org

The only server which is really down is backup file server Lyre which I
brought down today around 6:30 PM to replace a failed HDD and never put
back on-line since people in machine room where out for dinner (they are
not suppose to be out for more than 10 minutes). The operation is
postponed for tomorrow.

However something is happening to the computing nodes. I restarted SNMP
daemons and as a result all missing computing nodes are coming back
on-line according to Observium. I restarted LDAP daemon on Atlas just to
make sure that LDAP stale connections are cleared.

In spite of that if you try right now to log into one of those computing
nodes lets say low1 the login will just hang. I can login with the root
account into low1 and I can also log into any of Neill machines with my
account (note that my home directory is not mounted on Neill machines)
This means that something is wrong with NFS mounts of your home
directories which reside on GAIA.

Has anybody done any massive writing on NFS? I am not planning to reboot
GAIA until I am physically present on campus. If people need access to CVS
they should use

shell.autonlab.org

Your home directories are local there. It is by design to enable you to
have partial access to the lab in the situations like this.

Predrag


From predragp at andrew.cmu.edu  Thu Mar 12 21:44:18 2015
From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu)
Date: Thu, 12 Mar 2015 21:44:18 -0400
Subject: Observium follow up
Message-ID: <122ee8305570011dc8c14ec2412ef342.squirrel@webmail.andrew.cmu.edu>

I am almost 100% sure that the "file server problem" is caused by data
science experiments on low1
and lov3. Both servers are using 100% of swap memory which means that they
might crash any moment.

Predrag


From predragp at andrew.cmu.edu  Fri Mar 13 13:59:33 2015
From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu)
Date: Fri, 13 Mar 2015 13:59:33 -0400
Subject: lov3 & low1 + MATLAB info
Message-ID: <88d11b9ba20943c723a8e9a4ef8df121.squirrel@webmail.andrew.cmu.edu>

Dear Autonians,

lov3 and low1 are completely unusable at this point. Effective immediately
I will try to reboot both computing nodes to clear locked file systems.
Hopefully this will work. Since the nodes are already down I will take
this opportunity to update MATLAB to 2015Ra. Note that MATLAB license
manager will not be started until the update is done.


Predrag


From predragp at andrew.cmu.edu  Fri Mar 13 14:34:30 2015
From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu)
Date: Fri, 13 Mar 2015 14:34:30 -0400
Subject: lov3 & low1 + MATLAB info
Message-ID: <474989a225485135f24cb0ab619a11a4.squirrel@webmail.andrew.cmu.edu>

lov3 and low1 were locked by careless writing to home directory. Both
machines are now accessible to regular users. MATLAB will remain down
couple hours longer.

Predrag


From predragp at andrew.cmu.edu  Sat Mar 14 18:58:21 2015
From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu)
Date: Sat, 14 Mar 2015 18:58:21 -0400
Subject: misc issues
Message-ID: <07c00d6c240dd21b5306910b0139756d.squirrel@webmail.andrew.cmu.edu>

Dear Autonians,

I just updated MATLAB to R2015a on thee computing nodes which were not
busy lov3, low1, charity. LOV3 and LOW1 appear to be in the good shape now
but I still see that network is saturated by careless writing over NFS. At
this point I am not even sure if it is due to the writing on /zfsauton or
/zfsneill as I see bunch of MATLAB and other scripts running across the
servers of both groups. If you have started a job on Thursday afternoon or
Friday please contact me by PM so that we can get to the bottom of this.
Hopping from computing node to computing node and crashing them with
MATLAB scripts or programs has a serious impact to the productivity of
entire LAB. I would be happy to sit with you on Monday and figure out what
is wrong with programs (from systems point of view) and why they are
crashing servers.


Predrag


From predragp at andrew.cmu.edu  Sat Mar 14 23:19:20 2015
From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu)
Date: Sat, 14 Mar 2015 23:19:20 -0400
Subject: network/NFS issues part II
Message-ID: <505e172d4de8fe104c459d56d7a2efe7.squirrel@webmail.andrew.cmu.edu>

Dear Autonians,

After further troubleshooting network/NFS problem I pinpointed the issue
to the main file server GAIA. To make matters complicated I realized that
/zfsauton/project and /zfsauton/data are mounted on Neill group computing
nodes for project Hightmark. One should not be writing any data to
/zfsauton/data (mounted with rw so it is possible to write) which leaves
us with  wedged  /zfsauton/project or to your /zfsauton/home directory
(not relevant for the members of Neill group) as the cause of all
troubles.

Careful inspection of Collectd disk writes data for Gaia HDD reveals lots
of tiny disk writes (speed never exceeds 70k Bytes/s) which is pathetic
(for the reference current speed on Neill-ZFS is 800k Bytes/s per disk).

Anyhow at this point I am confident that members of Neill group are not
seriously affected unless they work on Highmark data. I would be happy to
temporary umount /zfsauton/project and /zfsauton/data directories from
Neill[1-4] to make sure those four computing nodes are 100% OK.

The rest of lab might experience various issues. You safest bet for login
into the lab is shell.autonlab.org since NFS shares are not mounted there.
If the issues persist I will be forced to reboot main file server (current
uptime 267 days).

Predrag


From predragp at andrew.cmu.edu  Mon Mar 16 15:14:14 2015
From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu)
Date: Mon, 16 Mar 2015 15:14:14 -0400
Subject: Main file server schedule to reboot
Message-ID: <a31f95c4445de4a11669200b5c429691.squirrel@webmail.andrew.cmu.edu>

Dear Autonians,

I was hopping to avoid this situation but it looks like after close to 320
days of up time main file server will have to be rebooted to clear stale
NFS file handles. It is also possible that few computing nodes will have
to be rebooted as well.

Tentatively I am planning to do this tomorrow Marh 17th around 10:30 AM if
ZFS monthly scrub has completed. The members of the Neill group will not
be affected but Highmark data will be unavailable at least for couple of
hours.


Predrag


From predragp at andrew.cmu.edu  Tue Mar 17 13:07:37 2015
From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu)
Date: Tue, 17 Mar 2015 13:07:37 -0400
Subject: File server rebooted
Message-ID: <3a9b690705f2095fbe4d57d77ab52f7e.squirrel@webmail.andrew.cmu.edu>

Main file server was rebooted. NFS stale file handles appear all to be
cleared up and the lab appears to be 100% functioning at this moment. I
even did a minor update of the OS on the main file server to fix rsync
related bug.

Predrag


From predragp at andrew.cmu.edu  Tue Mar 17 15:19:44 2015
From: predragp at andrew.cmu.edu (predragp at andrew.cmu.edu)
Date: Tue, 17 Mar 2015 15:19:44 -0400
Subject: MEMEX important backup info
Message-ID: <14e25b8480f989841108dfd4a83d1ee2.squirrel@webmail.andrew.cmu.edu>

This message is only relevant for members of MEMEX project who are also
using Auton Lab desktops. Note that VPN connection you have with MEMEX
server does affect backup of your desktops. Namely MEMEX VPN overrides DNS
settings and by backup server appears to be from the hostile network. If
you want you desktop machine to be backup regularly make sure you switch
off VPN between 6:00 PM and 7:30 PM.

Predrag


From boecking at andrew.cmu.edu  Wed Mar 25 09:24:20 2015
From: boecking at andrew.cmu.edu (Benedikt Boecking)
Date: Wed, 25 Mar 2015 09:24:20 -0400
Subject: lov3 disc space
Message-ID: <E9ED0BB9-E20E-49AD-85E6-D77D98174F42@andrew.cmu.edu>

Hello everyone,

we are quickly running out of space in the /home/scratch/ directory on lov3, it is 99% full. I would appreciate it if everyone could make an effort to delete unnecessary files from their folders, save them locally on their machines, or at least compress very large files.

I appreciate your cooperation! 

Best,
Ben