ssh login problems (NFS server killed due to overload)

Mon Oct 24 12:12:47 EDT 2022

Please try to test bash.autonlab.org, upload.autonlab.org, and
lop2.autonlab.org.

It appears that NFS mounts work on these shell gateways. If you have an
Auton Lab workstation please mount -o remount your network home directory
or reboot it.

Predrag

On Mon, Oct 24, 2022 at 12:01 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> I am trying really hard not to reboot anything. I manually restarted a
> bunch of daemons on the main file server Gaia (nfsd, mounted, rpcbind). I
> noticed that restarting autofs daemons on computing nodes restored the
> access. I am using Ansible to propagate autofs daemon restart over all
> computing nodes. It appears that some of them hang. I am hoping to get away
> with rebooting only a machine or two and definitely avoid rebooting the
> main file server.
>
> For curiosity. NFS is the last century (1980s Sun Microsystem) technology.
> It is a centralized single point of failure system. We mitigate this risk
> by having NFS exports distributed over several different physical file
> servers which run their own NFS instances. That is why /zfsauton/data and
> /zfsauton/project as well as /zfsauton/datasets are not affected.
> Unfortunately all of your home directories are located on GAIA. If I catch
> rough users I could theoretically move their home directory to the
> different file server and avoid this mess. The other option I was looking
> for was migrating NFS to GlusterFS (distributed network file system). The
> migration will be non-trivial and the performance penalty with small files
> might be significant. This is not an exact science.
>
> Predrag
>
>
>
>
> On Mon, Oct 24, 2022 at 11:47 AM Benedikt Boecking <
> boecking at andrew.cmu.edu> wrote:
>
>> If there is any way to not reboot gpu24 and gpu27 you might save me 2
>> weeks of work. If they are rebooted I may be screwed for my ICLR rebuttal.
>>
>> But ultimately, do what you have to of course. Thanks!
>>
>>
>>
>> > On Oct 24, 2022, at 10:43 AM, Predrag Punosevac <
>> predragp at andrew.cmu.edu> wrote:
>> >
>> >
>> >
>> > Dear Autoninas,
>> >
>> > I got several reports this morning from a few of you (Ifi, Abby, Ben,
>> Vedant) that they are having problems accessing the system. After a bit of
>> investigation, I nailed down the culprit to the main file server. The
>> server (NFS instance) appears to be dead or severely degraded due to the
>> overload.
>> >
>> > I am afraid that  the only medicine will be to reboot the machine,
>> perhaps followed up by the reboot of all 45+ computing nodes. This will
>> result in a significant loss of work and productivity. We did go through
>> this exercise less than two months ago.
>> >
>> > The Auton Lab cluster is not policed for rogue users. Its usability
>> depends on collegial behaviour of each of our 130 members. Use of scratch
>> directories instead of taxing NFS is well described in the documentation
>> and as recently as last week I added extra scratch on at least four
>> machines.
>> >
>> > Best,
>> > Predrag
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20221024/c2528386/attachment.html>