Fwd: Autonlab-sysinfo Digest, Vol 69, Issue 175

Predrag Punosevac predragp at andrew.cmu.edu
Fri Apr 17 18:24:07 EDT 2020


Dear Autonians,

Please see below. One of the HDD on the file server hosting
/zfsauton/data and /zfsauton/project has crapped out on me. That, in
turn, degraded one of zfs pools

 root at uranus:~ # zpool list
NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP
HEALTH  ALTROOT
archive  21.8T  12.1T  9.62T        -         -    21%    55%  1.00x  ONLINE  -
backups  36.2T  23.4T  12.9T        -         -    17%    64%  1.00x
DEGRADED  -
data0    36.2T  17.3T  18.9T        -         -    30%    47%  1.00x  ONLINE  -
data1    36.2T  2.54T  33.7T        -         -     7%     6%  1.00x  ONLINE  -
zroot     107G  64.7G  42.3G        -         -    24%    60%  1.00x  ONLINE  -


which conveniently holds those two data sets. Under normal
circumstances, I would just replace HDD with the spare I have in my
office. Unfortunately, RMA-ing failed drive is quite challenging under
these circumstances. In my experience 3TB Seagate drives which were
shipped with the server five years ago were nothing but the trouble. I
have already replaced 11 out of 36 drives originally shipped. They
will be out of warranty within the next 6 months.

I have two options:

1. Replace the failed HDD and do lots of praying that another HDD
doesn't die before I can RMA the faulty one. If another one dies we
will be in the same position a month from now but I will not have a
spear drive to react.

2. Pull the trigger and remount datasets from the backup which I made
on the newest file server purchased by Dr. Schneider in December. It
could be a few hours of inconvenience and perhaps tiny data loss.
Right now, I am even scared to try to zfs replicate delta from the
last ZFS snapshot as that can kill another drive which will degrade
zfs pool even further. So theoretically a tiny portion of the work in
the project folder could remain on the decaying ZFS pool which I will
let rotten.

I will not do anything until I hear from Lab elders (Artur and Jeff).
Thirty legacy home directories (zfsauton/home) which are regularly
backed up are on the same file server. I will probably migrate those
home directories to zfsauton2 as the insensitive on keeping them on
current location (20-30 min inconvenience to users) is very low.

My plan when we get into the normal operation mode is to get 36x12TB
new HDDs and retire those four zfs pools (five years old) build with
crappy Seagate drives.

Best,
Predrag






---------- Forwarded message ---------
From: <autonlab-sysinfo-request at autonlab.org>
Date: Fri, Apr 17, 2020 at 12:00 PM
Subject: Autonlab-sysinfo Digest, Vol 69, Issue 175
To: <autonlab-sysinfo at autonlab.org>


Send Autonlab-sysinfo mailing list submissions to
        autonlab-sysinfo at autonlab.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://mailman.srv.cs.cmu.edu/mailman/listinfo/autonlab-sysinfo
or, via email, send a message with subject or body 'help' to
        autonlab-sysinfo-request at autonlab.org

You can reach the person managing the list at
        autonlab-sysinfo-owner at autonlab.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Autonlab-sysinfo digest..."


Today's Topics:

   1. neill-zfs.int.autonlab.org security run output
      (auton.sysnotify at gmail.com)
   2. SMART error (ErrorCount) detected on host: uranus
      (auton.sysnotify at gmail.com)
   3. SMART error (CurrentPendingSector) detected on host: uranus
      (auton.sysnotify at gmail.com)
   4. SMART error (OfflineUncorrectableSector) detected on host:
      uranus (auton.sysnotify at gmail.com)


----------------------------------------------------------------------

Message: 1
Date: Fri, 17 Apr 2020 07:01:02 -0000
From: auton.sysnotify at gmail.com
To: sysinfo at autonlab.org
Subject: neill-zfs.int.autonlab.org security run output
Message-ID: <5e99542f.1c69fb81.22168.ff77 at mx.google.com>
Content-Type: text/plain; charset="utf-8"


neill-zfs.int.autonlab.org kernel log messages:
> pid 47721 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 47881 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 48047 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 48252 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 48416 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 48576 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 48711 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 48902 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 49068 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 49228 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 49398 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 49558 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 49693 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 49884 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 50050 (winbindd), uid 0: exited on signal 6 (core dumped)
> pid 50257 (winbindd), uid 0: exited on signal 6 (core dumped)

-- End of security output --



------------------------------

Message: 2
Date: Fri, 17 Apr 2020 08:47:45 -0400
From: auton.sysnotify at gmail.com
To: root+ at cs.cmu.edu
Subject: SMART error (ErrorCount) detected on host: uranus
Message-ID: <5e99a571.1a6f0.4349afd1 at uranus.int.autonlab.org>

This message was generated by the smartd daemon running on:

   host name:  uranus
   DNS domain: int.autonlab.org

The following warning/error was logged by the smartd daemon:

Device: /dev/da6 [SAT], ATA error count increased from 0 to 14

Device info:
ST4000NM0024-1HT178, S/N:Z4F05P33, WWN:5-000c50-07b5a28c3, FW:SN02, 4.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.


------------------------------

Message: 3
Date: Fri, 17 Apr 2020 09:17:11 -0400
From: auton.sysnotify at gmail.com
To: root+ at cs.cmu.edu
Subject: SMART error (CurrentPendingSector) detected on host: uranus
Message-ID: <5e99ac57.1a6f7.79b86657 at uranus.int.autonlab.org>

This message was generated by the smartd daemon running on:

   host name:  uranus
   DNS domain: int.autonlab.org

The following warning/error was logged by the smartd daemon:

Device: /dev/da6 [SAT], 16 Currently unreadable (pending) sectors

Device info:
ST4000NM0024-1HT178, S/N:Z4F05P33, WWN:5-000c50-07b5a28c3, FW:SN02, 4.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.


------------------------------

Message: 4
Date: Fri, 17 Apr 2020 09:17:11 -0400
From: auton.sysnotify at gmail.com
To: root+ at cs.cmu.edu
Subject: SMART error (OfflineUncorrectableSector) detected on host:
        uranus
Message-ID: <5e99ac57.1a6f9.1d0a864 at uranus.int.autonlab.org>

This message was generated by the smartd daemon running on:

   host name:  uranus
   DNS domain: int.autonlab.org

The following warning/error was logged by the smartd daemon:

Device: /dev/da6 [SAT], 16 Offline uncorrectable sectors

Device info:
ST4000NM0024-1HT178, S/N:Z4F05P33, WWN:5-000c50-07b5a28c3, FW:SN02, 4.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.


------------------------------

Subject: Digest Footer

_______________________________________________
Autonlab-sysinfo mailing list
Autonlab-sysinfo at autonlab.org
https://mailman.srv.cs.cmu.edu/mailman/listinfo/autonlab-sysinfo


------------------------------

End of Autonlab-sysinfo Digest, Vol 69, Issue 175
*************************************************


More information about the Autonlab-users mailing list