<div dir="ltr">Please try to test <a href="http://bash.autonlab.org">bash.autonlab.org</a>, <a href="http://upload.autonlab.org">upload.autonlab.org</a>, and <a href="http://lop2.autonlab.org">lop2.autonlab.org</a>.<div><br></div><div>It appears that NFS mounts work on these shell gateways. If you have an Auton Lab workstation please mount -o remount your network home directory or reboot it. </div><div><br></div><div>Predrag</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Oct 24, 2022 at 12:01 PM Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I am trying really hard not to reboot anything. I manually restarted a bunch of daemons on the main file server Gaia (nfsd, mounted, rpcbind). I noticed that restarting autofs daemons on computing nodes restored the access. I am using Ansible to propagate autofs daemon restart over all computing nodes. It appears that some of them hang. I am hoping to get away with rebooting only a machine or two and definitely avoid rebooting the main file server. <div><br></div><div>For curiosity. NFS is the last century (1980s Sun Microsystem) technology. It is a centralized single point of failure system. We mitigate this risk by having NFS exports distributed over several different physical file servers which run their own NFS instances. That is why /zfsauton/data and /zfsauton/project as well as /zfsauton/datasets are not affected. Unfortunately all of your home directories are located on GAIA. If I catch rough users I could theoretically move their home directory to the different file server and avoid this mess. The other option I was looking for was migrating NFS to GlusterFS (distributed network file system). The migration will be non-trivial and the performance penalty with small files might be significant. This is not an exact science. </div><div><br></div><div>Predrag<br><div><br></div><div><br></div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Oct 24, 2022 at 11:47 AM Benedikt Boecking <<a href="mailto:boecking@andrew.cmu.edu" target="_blank">boecking@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">If there is any way to not reboot gpu24 and gpu27 you might save me 2 weeks of work. If they are rebooted I may be screwed for my ICLR rebuttal. <br>

<br>

But ultimately, do what you have to of course. Thanks! <br>

<br>

<br>

<br>

> On Oct 24, 2022, at 10:43 AM, Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu" target="_blank">predragp@andrew.cmu.edu</a>> wrote:<br>

> <br>

> <br>

> <br>

> Dear Autoninas,<br>

> <br>

> I got several reports this morning from a few of you (Ifi, Abby, Ben, Vedant) that they are having problems accessing the system. After a bit of investigation, I nailed down the culprit to the main file server. The server (NFS instance) appears to be dead or severely degraded due to the overload. <br>

> <br>

> I am afraid that  the only medicine will be to reboot the machine, perhaps followed up by the reboot of all 45+ computing nodes. This will result in a significant loss of work and productivity. We did go through this exercise less than two months ago. <br>

> <br>

> The Auton Lab cluster is not policed for rogue users. Its usability depends on collegial behaviour of each of our 130 members. Use of scratch directories instead of taxing NFS is well described in the documentation and as recently as last week I added extra scratch on at least four machines. <br>

> <br>

> Best,<br>

> Predrag<br>

<br>

</blockquote></div>

</blockquote></div>