<div dir="ltr">Hi Predrag,<div><br></div><div>I've heard mixed results. For one lab member, the issue persists on all servers, including gpu. For another, it is on the lov servers but not gpu. Personally, I've only tested it on lov3.</div><div><br></div><div>Best,</div><div>Gus</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Nov 9, 2022 at 12:44 AM Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Gus,<div><br></div><div>Your SQLite database used by Jupyter is corrupted. Corruption was caused by NFS (high io or stale NFS file handle). The real mystery is why don't you instruct Jupyter notebook to store SQLite into the scratch directory? Which server(s) are affected?</div><div><br></div><div>Predrag</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Nov 8, 2022 at 12:48 PM Gus Welter <<a href="mailto:gwelter@andrew.cmu.edu" target="_blank">gwelter@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Predrag,<div><br></div><div>Multiple lab members have Jupyter hang when they go to create or open a file in Jupyter Lab. I tried to do a "df -h" on lov3, and it hangs. Maybe there are some lingering NFS issues? </div><div><br></div><div>Best,</div><div>Gus</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Oct 24, 2022 at 9:28 PM Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu" target="_blank">predragp@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div>That means that the process which caused the crash are still alive. I need to think a bit how to proceed in the most efficient way. Logging into 45 computing nodes and poking around doesn't scale well. If I end up doing that offending account will be suspended.<div dir="auto"><br></div><div dir="auto">Predrag</div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Oct 24, 2022, 1:21 PM Benedikt Boecking <<a href="mailto:boecking@andrew.cmu.edu" rel="noreferrer" target="_blank">boecking@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Just to confirm, looks like things are down again. <div><br><div>
<div dir="auto" style="color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><div dir="auto" style="color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><div style="color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><div style="color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><div><div style="color:rgb(0,0,0);font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><br></div><div style="color:rgb(0,0,0);font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><br></div></div></div></div></div></div></div><div><blockquote type="cite"><div>On Oct 24, 2022, at 11:12 AM, Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu" rel="noreferrer noreferrer" target="_blank">predragp@andrew.cmu.edu</a>> wrote:</div><br><div><div dir="ltr">Please try to test <a href="http://bash.autonlab.org/" rel="noreferrer noreferrer" target="_blank">bash.autonlab.org</a>, <a href="http://upload.autonlab.org/" rel="noreferrer noreferrer" target="_blank">upload.autonlab.org</a>, and <a href="http://lop2.autonlab.org/" rel="noreferrer noreferrer" target="_blank">lop2.autonlab.org</a>.<div><br></div><div>It appears that NFS mounts work on these shell gateways. If you have an Auton Lab workstation please mount -o remount your network home directory or reboot it. </div><div><br></div><div>Predrag</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Oct 24, 2022 at 12:01 PM Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu" rel="noreferrer noreferrer" target="_blank">predragp@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I am trying really hard not to reboot anything. I manually restarted a bunch of daemons on the main file server Gaia (nfsd, mounted, rpcbind). I noticed that restarting autofs daemons on computing nodes restored the access. I am using Ansible to propagate autofs daemon restart over all computing nodes. It appears that some of them hang. I am hoping to get away with rebooting only a machine or two and definitely avoid rebooting the main file server. <div><br></div><div>For curiosity. NFS is the last century (1980s Sun Microsystem) technology. It is a centralized single point of failure system. We mitigate this risk by having NFS exports distributed over several different physical file servers which run their own NFS instances. That is why /zfsauton/data and /zfsauton/project as well as /zfsauton/datasets are not affected. Unfortunately all of your home directories are located on GAIA. If I catch rough users I could theoretically move their home directory to the different file server and avoid this mess. The other option I was looking for was migrating NFS to GlusterFS (distributed network file system). The migration will be non-trivial and the performance penalty with small files might be significant. This is not an exact science. </div><div><br></div><div>Predrag<br><div><br></div><div><br></div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Oct 24, 2022 at 11:47 AM Benedikt Boecking <<a href="mailto:boecking@andrew.cmu.edu" rel="noreferrer noreferrer" target="_blank">boecking@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">If there is any way to not reboot gpu24 and gpu27 you might save me 2 weeks of work. If they are rebooted I may be screwed for my ICLR rebuttal. <br>
<br>
But ultimately, do what you have to of course. Thanks! <br>
<br>
<br>
<br>
> On Oct 24, 2022, at 10:43 AM, Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu" rel="noreferrer noreferrer" target="_blank">predragp@andrew.cmu.edu</a>> wrote:<br>
> <br>
> <br>
> <br>
> Dear Autoninas,<br>
> <br>
> I got several reports this morning from a few of you (Ifi, Abby, Ben, Vedant) that they are having problems accessing the system. After a bit of investigation, I nailed down the culprit to the main file server. The server (NFS instance) appears to be dead or severely degraded due to the overload. <br>
> <br>
> I am afraid that the only medicine will be to reboot the machine, perhaps followed up by the reboot of all 45+ computing nodes. This will result in a significant loss of work and productivity. We did go through this exercise less than two months ago. <br>
> <br>
> The Auton Lab cluster is not policed for rogue users. Its usability depends on collegial behaviour of each of our 130 members. Use of scratch directories instead of taxing NFS is well described in the documentation and as recently as last week I added extra scratch on at least four machines. <br>
> <br>
> Best,<br>
> Predrag<br>
<br>
</blockquote></div>
</blockquote></div>
</div></blockquote></div><br></div></div></blockquote></div>
</div></div>
</blockquote></div>
</blockquote></div>
</blockquote></div>