<div dir="ltr"><div>Dear Autonians,</div><div><br></div><div>I am happy to report that the second ZFS pool (storage) on our main file server has healed 100%. I don't know how you feel about it but I am going to buy a beer to Matt Ahrens when I see him next time at one of the BSD or OpenZFS conferences for saving my rear end :-)</div><div><br></div><div>Cheers,</div><div>Predrag</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jan 31, 2021 at 9:03 PM Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">HI Ifi,<div><br></div><div>The issue came from the fact that your home directory was unavailable due to the File server being down. I have managed to bring the file server up using IPMI and KVM console and you can proceed as usual. If you want to hear more details please keep reading this email.</div><div><br></div><div>Namely, one of two ZFS pools seems to be a bit damaged and holding the boot. I have managed to use IPMI and KVM console to boot in the single-user mode into the server and export ZFS data pools so that they don't hold the boot process. Once I was convinced the server works as expected I imported back ZFS pool tank which holds your home directory and home directories of other people who have a higher than normal disk quota. In total 15 users. Than I tried to import the ZFS pool storage which holds home directories of the majority of users including my own. The import didn't complete and I suspect that the file system is trying to self-heal (not possible on any other file system except ZFS). Even half functional system was sufficiently working for me to be able to use my regular user account and my home directory. There is nothing wrong with HDDs and I hope that the system will be able to repair itself. However, if you are unlucky and your home directory is on the damaged part of the ZFS pool I suspect that login might not work for you. Your only option, in that case, is to be patient and hope that things will be back to normal by tomorrow morning. </div><div><br></div><div>Cheers,</div><div>Predrag</div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jan 31, 2021 at 7:59 PM Ifigeneia Apostolopoulou <<a href="mailto:iapostol@andrew.cmu.edu" target="_blank">iapostol@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr">Hi Predrag,<br><div><br></div><div>I can now login but I'm getting the following. is it an issue related to my account (or just have to be a little bit more patient :))? thanks!</div><div> <span style="color:rgb(0,0,0);font-family:Menlo;font-size:11px">Could not chdir to home directory /zfsauton3/home/iapostol: Input/output error</span></div><div><br></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jan 31, 2021 at 6:08 PM Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu" target="_blank">predragp@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Dear Autonians,<div><br></div><div>The main file server Gaia is temporary down. I applied security patches and updates across all the FreeBSD file servers and jail hosts (6 in total) and rebooted them. Unfortunately, GAIA was running ZFS scrubbing and didn't come back. In my experience, once the scrubbing finishes (hopefully in an hour or two) it should automatically come back online. I apologize for any inconvenience but servers need some regular maintenance to be able to run.</div><div><br></div><div>On an unrelated note, I had to reboot GPU2 which was crashed by runaway scripts. As you probably know Python is not designed for scientific computing and has a global interpreter lock (GIL) which makes multithreaded programming almost impossible. Our users just like all other people who are using Python for scientific computing keep spawning process in order to fake multithreading as a result we have regular server reboots/crashes.</div><div><br></div><div>Cheers,</div><div>Predrag</div></div>
</blockquote></div>
</blockquote></div>
</blockquote></div></div>