ssh login problems (NFS server killed due to overload)

Mon Oct 24 13:27:07 EDT 2022

That means that the process which caused the crash are still alive. I need
to think a bit how to proceed in the most efficient way. Logging into 45
computing nodes and poking around doesn't scale well. If I end up doing
that offending account will be suspended.

Predrag

On Mon, Oct 24, 2022, 1:21 PM Benedikt Boecking <boecking at andrew.cmu.edu>
wrote:

> Just to confirm, looks like things are down again.
>
>
>
> On Oct 24, 2022, at 11:12 AM, Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
> Please try to test bash.autonlab.org, upload.autonlab.org, and
> lop2.autonlab.org.
>
> It appears that NFS mounts work on these shell gateways. If you have an
> Auton Lab workstation please mount -o remount your network home directory
> or reboot it.
>
> Predrag
>
> On Mon, Oct 24, 2022 at 12:01 PM Predrag Punosevac <
> predragp at andrew.cmu.edu> wrote:
>
>> I am trying really hard not to reboot anything. I manually restarted a
>> bunch of daemons on the main file server Gaia (nfsd, mounted, rpcbind). I
>> noticed that restarting autofs daemons on computing nodes restored the
>> access. I am using Ansible to propagate autofs daemon restart over all
>> computing nodes. It appears that some of them hang. I am hoping to get away
>> with rebooting only a machine or two and definitely avoid rebooting the
>> main file server.
>>
>> For curiosity. NFS is the last century (1980s Sun Microsystem)
>> technology. It is a centralized single point of failure system. We
>> mitigate this risk by having NFS exports distributed over several different
>> physical file servers which run their own NFS instances. That is why
>> /zfsauton/data and /zfsauton/project as well as /zfsauton/datasets are not
>> affected. Unfortunately all of your home directories are located on GAIA.
>> If I catch rough users I could theoretically move their home directory to
>> the different file server and avoid this mess. The other option I was
>> looking for was migrating NFS to GlusterFS (distributed network file
>> system). The migration will be non-trivial and the performance penalty with
>> small files might be significant. This is not an exact science.
>>
>> Predrag
>>
>>
>>
>>
>> On Mon, Oct 24, 2022 at 11:47 AM Benedikt Boecking <
>> boecking at andrew.cmu.edu> wrote:
>>
>>> If there is any way to not reboot gpu24 and gpu27 you might save me 2
>>> weeks of work. If they are rebooted I may be screwed for my ICLR rebuttal.
>>>
>>> But ultimately, do what you have to of course. Thanks!
>>>
>>>
>>>
>>> > On Oct 24, 2022, at 10:43 AM, Predrag Punosevac <
>>> predragp at andrew.cmu.edu> wrote:
>>> >
>>> >
>>> >
>>> > Dear Autoninas,
>>> >
>>> > I got several reports this morning from a few of you (Ifi, Abby, Ben,
>>> Vedant) that they are having problems accessing the system. After a bit of
>>> investigation, I nailed down the culprit to the main file server. The
>>> server (NFS instance) appears to be dead or severely degraded due to the
>>> overload.
>>> >
>>> > I am afraid that  the only medicine will be to reboot the machine,
>>> perhaps followed up by the reboot of all 45+ computing nodes. This will
>>> result in a significant loss of work and productivity. We did go through
>>> this exercise less than two months ago.
>>> >
>>> > The Auton Lab cluster is not policed for rogue users. Its usability
>>> depends on collegial behaviour of each of our 130 members. Use of scratch
>>> directories instead of taxing NFS is well described in the documentation
>>> and as recently as last week I added extra scratch on at least four
>>> machines.
>>> >
>>> > Best,
>>> > Predrag
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20221024/c69d98c2/attachment-0001.html>