Autonlab-sysinfo Digest, Vol 69, Issue 175
Predrag Punosevac
predragp at andrew.cmu.edu
Sat Apr 18 12:58:50 EDT 2020
I have a solution to this problem which will minimize the number of
users affected. I am migrating the remaining few home directories from
/zfsauton/home to /zfsauton2/home. Once the migration is finished I
will destroy empty ZFS pool and use it's 6 HDDs as hot spare drives to
repair degraded pool and potentially to deal with future problems on
other 2 healthy ZFS pools containing large amount of data.
Predrag
On Fri, Apr 17, 2020 at 6:24 PM Predrag Punosevac
<predragp at andrew.cmu.edu> wrote:
>
> Dear Autonians,
>
> Please see below. One of the HDD on the file server hosting
> /zfsauton/data and /zfsauton/project has crapped out on me. That, in
> turn, degraded one of zfs pools
>
> root at uranus:~ # zpool list
> NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP
> HEALTH ALTROOT
> archive 21.8T 12.1T 9.62T - - 21% 55% 1.00x ONLINE -
> backups 36.2T 23.4T 12.9T - - 17% 64% 1.00x
> DEGRADED -
> data0 36.2T 17.3T 18.9T - - 30% 47% 1.00x ONLINE -
> data1 36.2T 2.54T 33.7T - - 7% 6% 1.00x ONLINE -
> zroot 107G 64.7G 42.3G - - 24% 60% 1.00x ONLINE -
>
>
> which conveniently holds those two data sets. Under normal
> circumstances, I would just replace HDD with the spare I have in my
> office. Unfortunately, RMA-ing failed drive is quite challenging under
> these circumstances. In my experience 3TB Seagate drives which were
> shipped with the server five years ago were nothing but the trouble. I
> have already replaced 11 out of 36 drives originally shipped. They
> will be out of warranty within the next 6 months.
>
> I have two options:
>
> 1. Replace the failed HDD and do lots of praying that another HDD
> doesn't die before I can RMA the faulty one. If another one dies we
> will be in the same position a month from now but I will not have a
> spear drive to react.
>
> 2. Pull the trigger and remount datasets from the backup which I made
> on the newest file server purchased by Dr. Schneider in December. It
> could be a few hours of inconvenience and perhaps tiny data loss.
> Right now, I am even scared to try to zfs replicate delta from the
> last ZFS snapshot as that can kill another drive which will degrade
> zfs pool even further. So theoretically a tiny portion of the work in
> the project folder could remain on the decaying ZFS pool which I will
> let rotten.
>
> I will not do anything until I hear from Lab elders (Artur and Jeff).
> Thirty legacy home directories (zfsauton/home) which are regularly
> backed up are on the same file server. I will probably migrate those
> home directories to zfsauton2 as the insensitive on keeping them on
> current location (20-30 min inconvenience to users) is very low.
>
> My plan when we get into the normal operation mode is to get 36x12TB
> new HDDs and retire those four zfs pools (five years old) build with
> crappy Seagate drives.
>
> Best,
> Predrag
>
>
>
>
>
>
> ---------- Forwarded message ---------
> From: <autonlab-sysinfo-request at autonlab.org>
> Date: Fri, Apr 17, 2020 at 12:00 PM
> Subject: Autonlab-sysinfo Digest, Vol 69, Issue 175
> To: <autonlab-sysinfo at autonlab.org>
>
>
> Send Autonlab-sysinfo mailing list submissions to
> autonlab-sysinfo at autonlab.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mailman.srv.cs.cmu.edu/mailman/listinfo/autonlab-sysinfo
> or, via email, send a message with subject or body 'help' to
> autonlab-sysinfo-request at autonlab.org
>
> You can reach the person managing the list at
> autonlab-sysinfo-owner at autonlab.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Autonlab-sysinfo digest..."
>
>
> Today's Topics:
>
> 1. neill-zfs.int.autonlab.org security run output
> (auton.sysnotify at gmail.com)
> 2. SMART error (ErrorCount) detected on host: uranus
> (auton.sysnotify at gmail.com)
> 3. SMART error (CurrentPendingSector) detected on host: uranus
> (auton.sysnotify at gmail.com)
> 4. SMART error (OfflineUncorrectableSector) detected on host:
> uranus (auton.sysnotify at gmail.com)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 17 Apr 2020 07:01:02 -0000
> From: auton.sysnotify at gmail.com
> To: sysinfo at autonlab.org
> Subject: neill-zfs.int.autonlab.org security run output
> Message-ID: <5e99542f.1c69fb81.22168.ff77 at mx.google.com>
> Content-Type: text/plain; charset="utf-8"
>
>
> neill-zfs.int.autonlab.org kernel log messages:
> > pid 47721 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 47881 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 48047 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 48252 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 48416 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 48576 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 48711 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 48902 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 49068 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 49228 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 49398 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 49558 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 49693 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 49884 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 50050 (winbindd), uid 0: exited on signal 6 (core dumped)
> > pid 50257 (winbindd), uid 0: exited on signal 6 (core dumped)
>
> -- End of security output --
>
>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 17 Apr 2020 08:47:45 -0400
> From: auton.sysnotify at gmail.com
> To: root+ at cs.cmu.edu
> Subject: SMART error (ErrorCount) detected on host: uranus
> Message-ID: <5e99a571.1a6f0.4349afd1 at uranus.int.autonlab.org>
>
> This message was generated by the smartd daemon running on:
>
> host name: uranus
> DNS domain: int.autonlab.org
>
> The following warning/error was logged by the smartd daemon:
>
> Device: /dev/da6 [SAT], ATA error count increased from 0 to 14
>
> Device info:
> ST4000NM0024-1HT178, S/N:Z4F05P33, WWN:5-000c50-07b5a28c3, FW:SN02, 4.00 TB
>
> For details see host's SYSLOG.
>
> You can also use the smartctl utility for further investigation.
> No additional messages about this problem will be sent.
>
>
> ------------------------------
>
> Message: 3
> Date: Fri, 17 Apr 2020 09:17:11 -0400
> From: auton.sysnotify at gmail.com
> To: root+ at cs.cmu.edu
> Subject: SMART error (CurrentPendingSector) detected on host: uranus
> Message-ID: <5e99ac57.1a6f7.79b86657 at uranus.int.autonlab.org>
>
> This message was generated by the smartd daemon running on:
>
> host name: uranus
> DNS domain: int.autonlab.org
>
> The following warning/error was logged by the smartd daemon:
>
> Device: /dev/da6 [SAT], 16 Currently unreadable (pending) sectors
>
> Device info:
> ST4000NM0024-1HT178, S/N:Z4F05P33, WWN:5-000c50-07b5a28c3, FW:SN02, 4.00 TB
>
> For details see host's SYSLOG.
>
> You can also use the smartctl utility for further investigation.
> No additional messages about this problem will be sent.
>
>
> ------------------------------
>
> Message: 4
> Date: Fri, 17 Apr 2020 09:17:11 -0400
> From: auton.sysnotify at gmail.com
> To: root+ at cs.cmu.edu
> Subject: SMART error (OfflineUncorrectableSector) detected on host:
> uranus
> Message-ID: <5e99ac57.1a6f9.1d0a864 at uranus.int.autonlab.org>
>
> This message was generated by the smartd daemon running on:
>
> host name: uranus
> DNS domain: int.autonlab.org
>
> The following warning/error was logged by the smartd daemon:
>
> Device: /dev/da6 [SAT], 16 Offline uncorrectable sectors
>
> Device info:
> ST4000NM0024-1HT178, S/N:Z4F05P33, WWN:5-000c50-07b5a28c3, FW:SN02, 4.00 TB
>
> For details see host's SYSLOG.
>
> You can also use the smartctl utility for further investigation.
> No additional messages about this problem will be sent.
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Autonlab-sysinfo mailing list
> Autonlab-sysinfo at autonlab.org
> https://mailman.srv.cs.cmu.edu/mailman/listinfo/autonlab-sysinfo
>
>
> ------------------------------
>
> End of Autonlab-sysinfo Digest, Vol 69, Issue 175
> *************************************************
More information about the Autonlab-users
mailing list