ssh login problems (NFS server killed due to overload)

Mon Oct 24 12:01:12 EDT 2022

I am trying really hard not to reboot anything. I manually restarted a
bunch of daemons on the main file server Gaia (nfsd, mounted, rpcbind). I
noticed that restarting autofs daemons on computing nodes restored the
access. I am using Ansible to propagate autofs daemon restart over all
computing nodes. It appears that some of them hang. I am hoping to get away
with rebooting only a machine or two and definitely avoid rebooting the
main file server.

For curiosity. NFS is the last century (1980s Sun Microsystem) technology.
It is a centralized single point of failure system. We mitigate this risk
by having NFS exports distributed over several different physical file
servers which run their own NFS instances. That is why /zfsauton/data and
/zfsauton/project as well as /zfsauton/datasets are not affected.
Unfortunately all of your home directories are located on GAIA. If I catch
rough users I could theoretically move their home directory to the
different file server and avoid this mess. The other option I was looking
for was migrating NFS to GlusterFS (distributed network file system). The
migration will be non-trivial and the performance penalty with small files
might be significant. This is not an exact science.

Predrag

On Mon, Oct 24, 2022 at 11:47 AM Benedikt Boecking <boecking at andrew.cmu.edu>
wrote:

> If there is any way to not reboot gpu24 and gpu27 you might save me 2
> weeks of work. If they are rebooted I may be screwed for my ICLR rebuttal.
>
> But ultimately, do what you have to of course. Thanks!
>
>
>
> > On Oct 24, 2022, at 10:43 AM, Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
> >
> >
> >
> > Dear Autoninas,
> >
> > I got several reports this morning from a few of you (Ifi, Abby, Ben,
> Vedant) that they are having problems accessing the system. After a bit of
> investigation, I nailed down the culprit to the main file server. The
> server (NFS instance) appears to be dead or severely degraded due to the
> overload.
> >
> > I am afraid that  the only medicine will be to reboot the machine,
> perhaps followed up by the reboot of all 45+ computing nodes. This will
> result in a significant loss of work and productivity. We did go through
> this exercise less than two months ago.
> >
> > The Auton Lab cluster is not policed for rogue users. Its usability
> depends on collegial behaviour of each of our 130 members. Use of scratch
> directories instead of taxing NFS is well described in the documentation
> and as recently as last week I added extra scratch on at least four
> machines.
> >
> > Best,
> > Predrag
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20221024/7ca29588/attachment.html>