Main file server temp down

Predrag Punosevac predragp at andrew.cmu.edu
Mon Feb 1 10:19:54 EST 2021


HI Ifi,

The issue came from the fact that your home directory was unavailable due
to the File server being down. I have managed to bring the file server up
using IPMI and KVM console and you can proceed as usual. If you want to
hear more details please keep reading this email.

Namely, one of two ZFS pools seems to be a bit damaged and holding the
boot. I have managed to use IPMI and KVM console to boot in the single-user
mode into the server and export ZFS data pools so that they don't hold the
boot process. Once I was convinced the server works as expected I imported
back ZFS pool tank which holds your home directory and home directories of
other people who have a higher than normal disk quota. In total 15 users.
Than I tried to import the ZFS pool storage which holds home directories of
the majority of users including my own. The import didn't complete and I
suspect that the file system is trying to self-heal (not possible on any
other file system except ZFS). Even half functional system was sufficiently
working for me to be able to use my regular user account and my home
directory. There is nothing wrong with HDDs and I hope that the system will
be able to repair itself. However, if you are unlucky and your home
directory is on the damaged part of the ZFS pool I suspect that login might
not work for you. Your only option, in that case, is to be patient and hope
that things will be back to normal by tomorrow morning.

Cheers,
Predrag




On Sun, Jan 31, 2021 at 7:59 PM Ifigeneia Apostolopoulou <
iapostol at andrew.cmu.edu> wrote:

> Hi Predrag,
>
> I can now login but I'm getting the following. is it an issue related to
> my account (or just have to be a little bit more patient :))? thanks!
>  Could not chdir to home directory /zfsauton3/home/iapostol: Input/output
> error
>
>
> On Sun, Jan 31, 2021 at 6:08 PM Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
>> Dear Autonians,
>>
>> The main file server Gaia is temporary down. I applied security patches
>> and updates across all the FreeBSD file servers and jail hosts (6 in total)
>> and rebooted them. Unfortunately, GAIA was running ZFS scrubbing and didn't
>> come back. In my experience, once the scrubbing finishes (hopefully in an
>> hour or two) it should automatically come back online. I apologize for any
>> inconvenience but servers need some regular maintenance to be able to run.
>>
>> On an unrelated note, I had to reboot GPU2 which was crashed by runaway
>> scripts. As you probably know Python is not designed for scientific
>> computing and has a global interpreter lock (GIL) which makes multithreaded
>> programming almost impossible. Our users just like all other people who are
>> using Python for scientific computing keep spawning process in order to
>> fake multithreading as a result we have regular server reboots/crashes.
>>
>> Cheers,
>> Predrag
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20210201/668de6d2/attachment.html>


More information about the Autonlab-users mailing list