ssh login problems (NFS server killed due to overload)

Wed Nov 9 10:03:55 EST 2022

The only sure thing is rebooting the file server, all computing nodes, and
deleting corrupted SQLite database when everything is back. Things will be
ok untill the same person(s) who killed the file server restarts her/his
scripts.

Predrag

On Wed, Nov 9, 2022, 8:00 AM Gus Welter <gwelter at andrew.cmu.edu> wrote:

> Hi Predrag,
>
> I've heard mixed results. For one lab member, the issue persists on all
> servers, including gpu. For another, it is on the lov servers but not gpu.
> Personally, I've only tested it on lov3.
>
> Best,
> Gus
>
> On Wed, Nov 9, 2022 at 12:44 AM Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
>> Hi Gus,
>>
>> Your SQLite database used by Jupyter is corrupted. Corruption was caused
>> by NFS (high io or stale NFS file handle). The real mystery is why don't
>> you instruct Jupyter notebook to store SQLite into the scratch directory?
>> Which server(s) are affected?
>>
>> Predrag
>>
>> On Tue, Nov 8, 2022 at 12:48 PM Gus Welter <gwelter at andrew.cmu.edu>
>> wrote:
>>
>>> Hi Predrag,
>>>
>>> Multiple lab members have Jupyter hang when they go to create or open a
>>> file in Jupyter Lab. I tried to do a "df -h" on lov3, and it hangs. Maybe
>>> there are some lingering NFS issues?
>>>
>>> Best,
>>> Gus
>>>
>>> On Mon, Oct 24, 2022 at 9:28 PM Predrag Punosevac <
>>> predragp at andrew.cmu.edu> wrote:
>>>
>>>> That means that the process which caused the crash are still alive. I
>>>> need to think a bit how to proceed in the most efficient way. Logging into
>>>> 45 computing nodes and poking around doesn't scale well. If I end up doing
>>>> that offending account will be suspended.
>>>>
>>>> Predrag
>>>>
>>>>
>>>> On Mon, Oct 24, 2022, 1:21 PM Benedikt Boecking <
>>>> boecking at andrew.cmu.edu> wrote:
>>>>
>>>>> Just to confirm, looks like things are down again.
>>>>>
>>>>>
>>>>>
>>>>> On Oct 24, 2022, at 11:12 AM, Predrag Punosevac <
>>>>> predragp at andrew.cmu.edu> wrote:
>>>>>
>>>>> Please try to test bash.autonlab.org, upload.autonlab.org, and
>>>>> lop2.autonlab.org.
>>>>>
>>>>> It appears that NFS mounts work on these shell gateways. If you have
>>>>> an Auton Lab workstation please mount -o remount your network home
>>>>> directory or reboot it.
>>>>>
>>>>> Predrag
>>>>>
>>>>> On Mon, Oct 24, 2022 at 12:01 PM Predrag Punosevac <
>>>>> predragp at andrew.cmu.edu> wrote:
>>>>>
>>>>>> I am trying really hard not to reboot anything. I manually restarted
>>>>>> a bunch of daemons on the main file server Gaia (nfsd, mounted, rpcbind). I
>>>>>> noticed that restarting autofs daemons on computing nodes restored the
>>>>>> access. I am using Ansible to propagate autofs daemon restart over all
>>>>>> computing nodes. It appears that some of them hang. I am hoping to get away
>>>>>> with rebooting only a machine or two and definitely avoid rebooting the
>>>>>> main file server.
>>>>>>
>>>>>> For curiosity. NFS is the last century (1980s Sun Microsystem)
>>>>>> technology. It is a centralized single point of failure system. We
>>>>>> mitigate this risk by having NFS exports distributed over several different
>>>>>> physical file servers which run their own NFS instances. That is why
>>>>>> /zfsauton/data and /zfsauton/project as well as /zfsauton/datasets are not
>>>>>> affected. Unfortunately all of your home directories are located on GAIA.
>>>>>> If I catch rough users I could theoretically move their home directory to
>>>>>> the different file server and avoid this mess. The other option I was
>>>>>> looking for was migrating NFS to GlusterFS (distributed network file
>>>>>> system). The migration will be non-trivial and the performance penalty with
>>>>>> small files might be significant. This is not an exact science.
>>>>>>
>>>>>> Predrag
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 24, 2022 at 11:47 AM Benedikt Boecking <
>>>>>> boecking at andrew.cmu.edu> wrote:
>>>>>>
>>>>>>> If there is any way to not reboot gpu24 and gpu27 you might save me
>>>>>>> 2 weeks of work. If they are rebooted I may be screwed for my ICLR
>>>>>>> rebuttal.
>>>>>>>
>>>>>>> But ultimately, do what you have to of course. Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > On Oct 24, 2022, at 10:43 AM, Predrag Punosevac <
>>>>>>> predragp at andrew.cmu.edu> wrote:
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > Dear Autoninas,
>>>>>>> >
>>>>>>> > I got several reports this morning from a few of you (Ifi, Abby,
>>>>>>> Ben, Vedant) that they are having problems accessing the system. After a
>>>>>>> bit of investigation, I nailed down the culprit to the main file server.
>>>>>>> The server (NFS instance) appears to be dead or severely degraded due to
>>>>>>> the overload.
>>>>>>> >
>>>>>>> > I am afraid that  the only medicine will be to reboot the machine,
>>>>>>> perhaps followed up by the reboot of all 45+ computing nodes. This will
>>>>>>> result in a significant loss of work and productivity. We did go through
>>>>>>> this exercise less than two months ago.
>>>>>>> >
>>>>>>> > The Auton Lab cluster is not policed for rogue users. Its
>>>>>>> usability depends on collegial behaviour of each of our 130 members. Use of
>>>>>>> scratch directories instead of taxing NFS is well described in the
>>>>>>> documentation and as recently as last week I added extra scratch on at
>>>>>>> least four machines.
>>>>>>> >
>>>>>>> > Best,
>>>>>>> > Predrag
>>>>>>>
>>>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20221109/2f349839/attachment-0001.html>