Fwd: Autonlab-sysinfo Digest, Vol 69, Issue 175
Predrag Punosevac
predragp at andrew.cmu.edu
Fri Apr 17 18:24:07 EDT 2020
Dear Autonians,
Please see below. One of the HDD on the file server hosting
/zfsauton/data and /zfsauton/project has crapped out on me. That, in
turn, degraded one of zfs pools
root at uranus:~ # zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP
HEALTH ALTROOT
archive 21.8T 12.1T 9.62T - - 21% 55% 1.00x ONLINE -
backups 36.2T 23.4T 12.9T - - 17% 64% 1.00x
DEGRADED -
data0 36.2T 17.3T 18.9T - - 30% 47% 1.00x ONLINE -
data1 36.2T 2.54T 33.7T - - 7% 6% 1.00x ONLINE -
zroot 107G 64.7G 42.3G - - 24% 60% 1.00x ONLINE -
which conveniently holds those two data sets. Under normal
circumstances, I would just replace HDD with the spare I have in my
office. Unfortunately, RMA-ing failed drive is quite challenging under
these circumstances. In my experience 3TB Seagate drives which were
shipped with the server five years ago were nothing but the trouble. I
have already replaced 11 out of 36 drives originally shipped. They
will be out of warranty within the next 6 months.
I have two options:
1. Replace the failed HDD and do lots of praying that another HDD
doesn't die before I can RMA the faulty one. If another one dies we
will be in the same position a month from now but I will not have a
spear drive to react.
2. Pull the trigger and remount datasets from the backup which I made
on the newest file server purchased by Dr. Schneider in December. It
could be a few hours of inconvenience and perhaps tiny data loss.
Right now, I am even scared to try to zfs replicate delta from the
last ZFS snapshot as that can kill another drive which will degrade
zfs pool even further. So theoretically a tiny portion of the work in
the project folder could remain on the decaying ZFS pool which I will
let rotten.
I will not do anything until I hear from Lab elders (Artur and Jeff).
Thirty legacy home directories (zfsauton/home) which are regularly
backed up are on the same file server. I will probably migrate those
home directories to zfsauton2 as the insensitive on keeping them on
current location (20-30 min inconvenience to users) is very low.
My plan when we get into the normal operation mode is to get 36x12TB
new HDDs and retire those four zfs pools (five years old) build with
crappy Seagate drives.
Best,
Predrag
---------- Forwarded message ---------
From: <autonlab-sysinfo-request at autonlab.org>
Date: Fri, Apr 17, 2020 at 12:00 PM
Subject: Autonlab-sysinfo Digest, Vol 69, Issue 175
To: <autonlab-sysinfo at autonlab.org>
Send Autonlab-sysinfo mailing list submissions to
autonlab-sysinfo at autonlab.org
To subscribe or unsubscribe via the World Wide Web, visit
https://mailman.srv.cs.cmu.edu/mailman/listinfo/autonlab-sysinfo
or, via email, send a message with subject or body 'help' to
autonlab-sysinfo-request at autonlab.org
You can reach the person managing the list at
autonlab-sysinfo-owner at autonlab.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Autonlab-sysinfo digest..."
Today's Topics:
1. neill-zfs.int.autonlab.org security run output
(auton.sysnotify at gmail.com)
2. SMART error (ErrorCount) detected on host: uranus
(auton.sysnotify at gmail.com)
3. SMART error (CurrentPendingSector) detected on host: uranus
(auton.sysnotify at gmail.com)
4. SMART error (OfflineUncorrectableSector) detected on host:
uranus (auton.sysnotify at gmail.com)
----------------------------------------------------------------------
Message: 1
Date: Fri, 17 Apr 2020 07:01:02 -0000
From: auton.sysnotify at gmail.com
To: sysinfo at autonlab.org
Subject: neill-zfs.int.autonlab.org security run output
Message-ID: <5e99542f.1c69fb81.22168.ff77 at mx.google.com>
Content-Type: text/plain; charset="utf-8"
neill-zfs.int.autonlab.org kernel log messages:
> pid 47721 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 47881 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 48047 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 48252 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 48416 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 48576 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 48711 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 48902 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 49068 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 49228 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 49398 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 49558 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 49693 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 49884 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 50050 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 50257 (winbindd), uid 0: exited on signal 6 (core dumped)
-- End of security output --
------------------------------
Message: 2
Date: Fri, 17 Apr 2020 08:47:45 -0400
From: auton.sysnotify at gmail.com
To: root+ at cs.cmu.edu
Subject: SMART error (ErrorCount) detected on host: uranus
Message-ID: <5e99a571.1a6f0.4349afd1 at uranus.int.autonlab.org>
This message was generated by the smartd daemon running on:
host name: uranus
DNS domain: int.autonlab.org
The following warning/error was logged by the smartd daemon:
Device: /dev/da6 [SAT], ATA error count increased from 0 to 14
Device info:
ST4000NM0024-1HT178, S/N:Z4F05P33, WWN:5-000c50-07b5a28c3, FW:SN02, 4.00 TB
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.
------------------------------
Message: 3
Date: Fri, 17 Apr 2020 09:17:11 -0400
From: auton.sysnotify at gmail.com
To: root+ at cs.cmu.edu
Subject: SMART error (CurrentPendingSector) detected on host: uranus
Message-ID: <5e99ac57.1a6f7.79b86657 at uranus.int.autonlab.org>
This message was generated by the smartd daemon running on:
host name: uranus
DNS domain: int.autonlab.org
The following warning/error was logged by the smartd daemon:
Device: /dev/da6 [SAT], 16 Currently unreadable (pending) sectors
Device info:
ST4000NM0024-1HT178, S/N:Z4F05P33, WWN:5-000c50-07b5a28c3, FW:SN02, 4.00 TB
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.
------------------------------
Message: 4
Date: Fri, 17 Apr 2020 09:17:11 -0400
From: auton.sysnotify at gmail.com
To: root+ at cs.cmu.edu
Subject: SMART error (OfflineUncorrectableSector) detected on host:
uranus
Message-ID: <5e99ac57.1a6f9.1d0a864 at uranus.int.autonlab.org>
This message was generated by the smartd daemon running on:
host name: uranus
DNS domain: int.autonlab.org
The following warning/error was logged by the smartd daemon:
Device: /dev/da6 [SAT], 16 Offline uncorrectable sectors
Device info:
ST4000NM0024-1HT178, S/N:Z4F05P33, WWN:5-000c50-07b5a28c3, FW:SN02, 4.00 TB
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.
------------------------------
Subject: Digest Footer
_______________________________________________
Autonlab-sysinfo mailing list
Autonlab-sysinfo at autonlab.org
https://mailman.srv.cs.cmu.edu/mailman/listinfo/autonlab-sysinfo
------------------------------
End of Autonlab-sysinfo Digest, Vol 69, Issue 175
*************************************************
More information about the Autonlab-users
mailing list