Main file server temp down
Predrag Punosevac
predragp at andrew.cmu.edu
Tue Feb 2 21:37:26 EST 2021
Dear Autonians,
I am happy to report that the second ZFS pool (storage) on our main file
server has healed 100%. I don't know how you feel about it but I am going
to buy a beer to Matt Ahrens when I see him next time at one of the BSD or
OpenZFS conferences for saving my rear end :-)
Cheers,
Predrag
On Sun, Jan 31, 2021 at 9:03 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:
> HI Ifi,
>
> The issue came from the fact that your home directory was unavailable due
> to the File server being down. I have managed to bring the file server up
> using IPMI and KVM console and you can proceed as usual. If you want to
> hear more details please keep reading this email.
>
> Namely, one of two ZFS pools seems to be a bit damaged and holding the
> boot. I have managed to use IPMI and KVM console to boot in the single-user
> mode into the server and export ZFS data pools so that they don't hold the
> boot process. Once I was convinced the server works as expected I imported
> back ZFS pool tank which holds your home directory and home directories of
> other people who have a higher than normal disk quota. In total 15 users.
> Than I tried to import the ZFS pool storage which holds home directories of
> the majority of users including my own. The import didn't complete and I
> suspect that the file system is trying to self-heal (not possible on any
> other file system except ZFS). Even half functional system was sufficiently
> working for me to be able to use my regular user account and my home
> directory. There is nothing wrong with HDDs and I hope that the system will
> be able to repair itself. However, if you are unlucky and your home
> directory is on the damaged part of the ZFS pool I suspect that login might
> not work for you. Your only option, in that case, is to be patient and hope
> that things will be back to normal by tomorrow morning.
>
> Cheers,
> Predrag
>
>
>
>
> On Sun, Jan 31, 2021 at 7:59 PM Ifigeneia Apostolopoulou <
> iapostol at andrew.cmu.edu> wrote:
>
>> Hi Predrag,
>>
>> I can now login but I'm getting the following. is it an issue related to
>> my account (or just have to be a little bit more patient :))? thanks!
>> Could not chdir to home directory /zfsauton3/home/iapostol:
>> Input/output error
>>
>>
>> On Sun, Jan 31, 2021 at 6:08 PM Predrag Punosevac <
>> predragp at andrew.cmu.edu> wrote:
>>
>>> Dear Autonians,
>>>
>>> The main file server Gaia is temporary down. I applied security patches
>>> and updates across all the FreeBSD file servers and jail hosts (6 in total)
>>> and rebooted them. Unfortunately, GAIA was running ZFS scrubbing and didn't
>>> come back. In my experience, once the scrubbing finishes (hopefully in an
>>> hour or two) it should automatically come back online. I apologize for any
>>> inconvenience but servers need some regular maintenance to be able to run.
>>>
>>> On an unrelated note, I had to reboot GPU2 which was crashed by runaway
>>> scripts. As you probably know Python is not designed for scientific
>>> computing and has a global interpreter lock (GIL) which makes multithreaded
>>> programming almost impossible. Our users just like all other people who are
>>> using Python for scientific computing keep spawning process in order to
>>> fake multithreading as a result we have regular server reboots/crashes.
>>>
>>> Cheers,
>>> Predrag
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20210202/c9390001/attachment.html>
More information about the Autonlab-users
mailing list