How to replace a failed disk in a ZFS mirror

I recently built a new file server for my media needs at home. Something I've been thinking about doing for literally years. I chose to go with ZFS as the storage technology after having used Linux software RAID for many years. I went with a mirrored setup for a lot of the reasons outlined in this article - performance, simplicity, and in particular, easy recovery from disk failures.

This is the setup I ended up with according to zpool status.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ zpool status
  pool: storage
 state: ONLINE
  scan: none requested
config:

    NAME                                   STATE     READ WRITE CKSUM
    storage                                ONLINE       0     0     0
      mirror-0                             ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_VJHDBDGX  ONLINE       0     0     0
        ata-WDC_WD80EFAX-68KNBN0_VAGASE7L  ONLINE       0     0     0
      mirror-1                             ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_VJHD6BAX  ONLINE       0     0     0
        ata-WDC_WD80EFAX-68KNBN0_VAGA5BPL  ONLINE       0     0     0
      mirror-2                             ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_VJHD982X  ONLINE       0     0     0
        ata-WDC_WD80EFAX-68KNBN0_VAG9X8YL  ONLINE       0     0     0

errors: No known data errors

Well, no sooner had I completed the ZFS setup (a very straightforward process) than one of my disks started reporting SMART errors. I don't think a disk that is weeks old should do this, so I decided to start the RMA process.

And this is how I replaced the disk.

Replacing the disk

I started by physically removing the old disk, and replacing with a brand new one. I originally setup my pool using the disk id from /dev/disk/by-id/, so identifying the failed disk was very easy as the serial number is part of the device name

Once I started back up, I ran zpool status and saw this output.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ zpools status
  pool: storage
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

    NAME                                   STATE     READ WRITE CKSUM
    storage                                DEGRADED     0     0     0
      mirror-0                             ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_VJHDBDGX  ONLINE       0     0     0
        ata-WDC_WD80EFAX-68KNBN0_VAGASE7L  ONLINE       0     0     0
      mirror-1                             ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_VJHD6BAX  ONLINE       0     0     0
        ata-WDC_WD80EFAX-68KNBN0_VAGA5BPL  ONLINE       0     0     0
      mirror-2                             DEGRADED     0     0     0
        ata-WDC_WD80EFZX-68UW8N0_VJHD982X  ONLINE       0     0     0
        18311740819329882151               UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD80EFAX-68KNBN0_VAG9X8YL-part1

errors: No known data errors

ZFS noticed that it had a missing disk, and was now in a DEGRADED state, but crucially, everything was still working and available.

The next step was to find out what the new device is called. I did this by running ls -1 /dev/disk/by-id/ and seeing which disk was new.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ ls -1 /dev/disk/by-id/ | grep ata
ata-WDC_WD80EFAX-68KNBN0_VAGA5BPL
ata-WDC_WD80EFAX-68KNBN0_VAGA5BPL-part1
ata-WDC_WD80EFAX-68KNBN0_VAGA5BPL-part9
ata-WDC_WD80EFAX-68KNBN0_VAGASE7L
ata-WDC_WD80EFAX-68KNBN0_VAGASE7L-part1
ata-WDC_WD80EFAX-68KNBN0_VAGASE7L-part9
ata-WDC_WD80EFAX-68LHPN0_7HJSWL7F
ata-WDC_WD80EFZX-68UW8N0_VJHD6BAX
ata-WDC_WD80EFZX-68UW8N0_VJHD6BAX-part1
ata-WDC_WD80EFZX-68UW8N0_VJHD6BAX-part9
ata-WDC_WD80EFZX-68UW8N0_VJHD982X
ata-WDC_WD80EFZX-68UW8N0_VJHD982X-part1
ata-WDC_WD80EFZX-68UW8N0_VJHD982X-part9
ata-WDC_WD80EFZX-68UW8N0_VJHDBDGX
ata-WDC_WD80EFZX-68UW8N0_VJHDBDGX-part1
ata-WDC_WD80EFZX-68UW8N0_VJHDBDGX-part9

The new disk is the one on line 8 - ata-WDC_WD80EFAX-68LHPN0_7HJSWL7F. It stands out in this example as all the other disk serial numbers start with "V".

I now needed to tell ZFS to replace the missing disk with this one.

1
sudo zpool replace -f storage 18311740819329882151 /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_7HJSWL7F

ZFS automatically started the resilvering process (copying data to the new disk). I wasn't sure how long that would take...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$ zpool status
  pool: storage
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Nov 15 17:01:06 2018
    7.97G scanned out of 7.51T at 233M/s, 9h22m to go
    2.56G resilvered, 0.10% done
config:

    NAME                                     STATE     READ WRITE CKSUM
    storage                                  DEGRADED     0     0     0
      mirror-0                               ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_VJHDBDGX    ONLINE       0     0     0
        ata-WDC_WD80EFAX-68KNBN0_VAGASE7L    ONLINE       0     0     0
      mirror-1                               ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_VJHD6BAX    ONLINE       0     0     0
        ata-WDC_WD80EFAX-68KNBN0_VAGA5BPL    ONLINE       0     0     0
      mirror-2                               DEGRADED     0     0     0
        ata-WDC_WD80EFZX-68UW8N0_VJHD982X    ONLINE       0     0     0
        replacing-1                          DEGRADED     0     0     0
          18311740819329882151               UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD80EFAX-68KNBN0_VAG9X8YL-part1
          ata-WDC_WD80EFAX-68LHPN0_7HJSWL7F  ONLINE       0     0     0  (resilvering)

errors: No known data errors

The resilvering completed in 5 hours and 53 minutes. A figure I'm very satisfied with. In this mirrored setup the data is at risk whilst resilvering completes, so the quicker, the better.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ zpool status
  pool: storage
 state: ONLINE
  scan: resilvered 2.50T in 5h53m with 0 errors on Thu Nov 15 22:54:41 2018
config:

    NAME                                   STATE     READ WRITE CKSUM
    storage                                ONLINE       0     0     0
      mirror-0                             ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_VJHDBDGX  ONLINE       0     0     0
        ata-WDC_WD80EFAX-68KNBN0_VAGASE7L  ONLINE       0     0     0
      mirror-1                             ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_VJHD6BAX  ONLINE       0     0     0
        ata-WDC_WD80EFAX-68KNBN0_VAGA5BPL  ONLINE       0     0     0
      mirror-2                             ONLINE       0     0     0
        ata-WDC_WD80EFZX-68UW8N0_VJHD982X  ONLINE       0     0     0
        ata-WDC_WD80EFAX-68LHPN0_7HJSWL7F  ONLINE       0     0     0

errors: No known data errors

ZFS is easy to setup and use for the most part. It feels solid. Stable. If all disk replacements are this easy I will be very happy.