TrueNASGuide
Isometric vector illustration showing ZFS pool with degraded disk and replacement process
troubleshooting

TrueNAS: Replacing a Failed Disk in a Degraded Pool

Identify the failed drive, source a replacement, perform the swap, kick off the resilver, and verify the pool is healthy again on TrueNAS SCALE and CORE.

By TrueNASGuide Editorial · · 8 min read

A DEGRADED pool status looks alarming the first time you see it. It is also, in most home setups, the most boring possible failure mode: one drive has dropped out, ZFS is still serving every byte of your data correctly from the remaining redundancy, and you have time to do this properly. The goal of this guide is to keep it boring.

We will cover identifying the failed disk, sourcing a replacement, performing the swap, kicking off the resilver, and verifying the pool is healthy. This procedure applies to TrueNAS SCALE; CORE differs only in menu paths.

First, do not panic

A DEGRADED pool means redundancy has been reduced, not lost. On a RAIDZ2 vdev with one failed disk, you still have one disk of fault tolerance remaining. On a mirror with one half failed, you still have a complete copy. Your data is intact and accessible.

What you should not do, in order of how much damage they can cause:

  1. Reboot to “see if it fixes itself.” It will not, and a reboot can hide useful diagnostic state.
  2. Run a scrub immediately. A scrub on a degraded pool puts extra load on the surviving drives at exactly the moment you need them to be reliable. Replace first, then let the resilver do the verification work.
  3. Pull a drive without confirming which serial number is the failed one. Pulling the wrong drive in a single-parity setup loses the pool. Always confirm by serial.

What you should do, in order:

  1. Confirm pool status and identify the failed device.
  2. Confirm the failure with SMART data.
  3. Source a replacement of the same or larger capacity.
  4. Physically replace the drive (hot-swap if your backplane supports it).
  5. Tell ZFS about the replacement and let it resilver.
  6. Verify the pool is healthy.

Step 1: Identify the failed device

In the TrueNAS web UI, go to Storage and look at the pool topology. A failed drive will appear with state FAULTED, REMOVED, UNAVAIL, or DEGRADED depending on how it failed. Click into the pool to see the per-vdev breakdown.

For a more complete picture, use the shell:

zpool status -v tank

The output names the offending device by its TrueNAS device path (typically /dev/disk/by-partuuid/<uuid> or /dev/sdX). What you actually care about is the drive’s serial number, because the device path is not what you read off the physical drive in the chassis.

Map device to serial:

lsblk -o NAME,SERIAL,SIZE,MODEL

Or for a single device:

smartctl -i /dev/sdc

Write down the serial number of the failed drive. You will need it before you open the chassis.

The zpool status output also tells you the failure mode in the errors and STATE columns:

  • FAULTED with too many checksum errors — the drive is returning bad data. SMART may still say “PASSED.” Replace it.
  • FAULTED with too many read/write errors — the drive is failing to complete IO. Often a cable, sometimes the drive. Investigate the cable before assuming the drive.
  • REMOVED — the device disappeared from the system entirely. Almost always a cable, backplane, or drive failure; sometimes a controller hiccup.
  • UNAVAIL — ZFS cannot open the device. Similar to REMOVED.

Step 2: Confirm with SMART

Before pulling anything, get the SMART report from the suspect drive:

smartctl -a /dev/sdc

The OpenZFS man page for zpool-status notes that STATE reflects the device’s current condition and is the authoritative pool-side signal; SMART tells you what the drive itself thinks. (OpenZFS: zpool-status man page)

Look at:

  • Reallocated_Sector_Ct — non-zero is a yellow flag; rising over time is a red flag.
  • Current_Pending_Sector — non-zero means the drive has sectors it cannot read and is waiting to remap. Any non-zero value is concerning.
  • Offline_Uncorrectable — unrecoverable errors. Replace.
  • Reported_Uncorrect — uncorrectable errors reported to the host. Replace.

The smartmontools documentation covers attribute interpretation in depth. (smartmontools wiki)

If SMART says “PASSED” but ZFS has flagged the drive with checksum errors, trust ZFS. Drive firmware is conservative about declaring itself bad, and ZFS sees actual data corruption that SMART has no visibility into.

If SMART is itself unreachable (drive does not respond), the drive is essentially dead. Replace.

Step 3: Source the replacement

The replacement drive must be at least as large as the failed drive. ZFS allows replacing with a larger drive, which can later expand the vdev if all members are upgraded (autoexpand). It does not allow replacing with a smaller drive.

A few practical notes:

  • CMR over SMR. Shingled Magnetic Recording drives perform poorly with ZFS resilver workloads and have been known to drop out mid-resilver, which on a single-parity vdev means data loss. Confirm the replacement is CMR. Manufacturer datasheets list this; community-maintained lists fill in the gaps for older models.
  • Same family is convenient but not required. Mixed vendors and models work fine in a ZFS vdev. The only hard requirement is capacity.
  • New is safer than used for a single-disk replacement. A used drive with unknown power-on-hours that you trust into a degraded vdev is two unknowns stacked on top of each other.

If you keep a cold spare on the shelf, this is what it is for.

Step 4: Physical replacement

If your chassis supports hot-swap and you trust your backplane:

  1. In the TrueNAS UI, go to Storage → the pool → the vdev with the failed disk and click on the failed disk’s row. Note the slot identifier the UI shows.
  2. Physically locate the disk in the chassis using the serial number you wrote down. Confirm the serial on the label matches before pulling.
  3. Pull the failed drive and insert the replacement in the same slot.
  4. Wait 10–15 seconds for the system to detect the new drive. Confirm with lsblk that it appears.

If your chassis does not support hot-swap (or you do not trust the backplane), shut down cleanly first:

shutdown -h now

Replace the drive while powered off, then boot.

Step 5: Tell ZFS to replace

Once the new disk is present, kick off the replace operation. In the TrueNAS UI:

  1. Storage → pool → vdev → failed disk row → Replace.
  2. Select the new disk from the dropdown.
  3. Confirm.

From the shell, the equivalent is:

zpool replace tank <old-device-id> /dev/sdX

Where <old-device-id> is the identifier from zpool status and /dev/sdX is the new drive. The OpenZFS zpool-replace man page is the canonical reference for the syntax and edge cases. (OpenZFS: zpool-replace man page) The TrueNAS docs cover the UI-driven version in detail. (TrueNAS SCALE: Replacing Disks)

The resilver starts immediately. Monitor progress:

zpool status tank

You will see a scan: resilver in progress line with percent complete and an ETA. On a home NAS with 8 TB drives and a moderately full vdev, expect 12–36 hours. Resilver time scales with used capacity on the vdev, not raw drive size, so a half-full pool resilvers in roughly half the time of a full one.

What to do during the resilver

  • Let it run. Avoid heavy reads or writes if you can; they do not break the resilver but they slow it down.
  • Do not start a scrub. Scrubs and resilvers contend for the same IO; the resilver will yield and slow further.
  • Do not yank another drive. If a second drive fails mid-resilver on a single-parity vdev (RAIDZ1 or single mirror), the pool is lost.

If you have email alerting configured (which you should), TrueNAS will notify you when the resilver completes.

Step 6: Verify

After the resilver finishes, run:

zpool status tank

You want to see state: ONLINE, no errors, and a scan: resilvered <X> in <time> with 0 errors line. If the line shows non-zero errors, ZFS encountered checksum mismatches during the resilver — usually old corruption from before the replacement, now repaired. Note the number; it is informational rather than urgent at this point.

Now is the right time to start a scrub manually:

zpool scrub tank

A post-replacement scrub gives you confidence that the rebuilt vdev is consistent end-to-end. It can run with the pool in production; it just adds load. See our scrub and SMART guide for what a healthy scrub output looks like.

Edge cases

Resilver pauses or restarts: if another drive throws errors during the resilver, ZFS may pause or restart the resilver from scratch. This is rare but extremely unwelcome. The fix is to investigate the second drive (cable? failing?) and decide whether to push through or replace it too.

Replacement drive itself is bad: new drives DOA at a non-trivial rate. If your new drive throws errors within the first hours of resilver, replace it before continuing.

Pool is FAULTED, not DEGRADED: if you lost more drives than your vdev’s parity allows (e.g., two drives in a RAIDZ1, three in a RAIDZ2), the pool is offline and data recovery is a separate, much harder topic. The short version: stop everything, do not write to the pool, consult the OpenZFS docs on import recovery, and consider professional data recovery if the data is irreplaceable.

What this teaches you about pool design

If a disk replacement felt nerve-wracking, that is a signal about your pool topology. RAIDZ1 has exactly one disk of fault tolerance — every resilver is a single-fault-distance-from-data-loss event. RAIDZ2 and 3-way mirrors give you a buffer; you can survive a second failure during resilver. For a home NAS rebuilt today on 8 TB+ drives, RAIDZ2 is the conservative default precisely because resilvers are long and a second failure is not implausible.

The other lesson: a tested backup is the thing that makes any of this calm. If you know with confidence that your replicated copy is current and restorable, even a worst-case “pool is lost” outcome is recoverable. Without that, every drive failure is high-stakes.

Concrete checklist

  1. zpool status -v tank — confirm which device failed and how.
  2. lsblk -o NAME,SERIAL — map the failed device to a physical serial.
  3. smartctl -a /dev/sdX — confirm the drive’s own assessment.
  4. Source a CMR replacement of equal or greater capacity.
  5. Replace physically, matching slot and confirming serial.
  6. Storage → vdev → failed disk → Replace in the UI, or zpool replace.
  7. Let the resilver run. Do not start a scrub. Do not pull anything.
  8. After resilver, run a scrub. Verify clean.
  9. Update your spares inventory. Order a new cold spare if you used the only one.

Next steps

See also

Sources

  1. OpenZFS: zpool-replace(8) man page
  2. OpenZFS: zpool-status(8) man page
  3. TrueNAS SCALE: Replacing Disks documentation
  4. smartmontools documentation wiki

Related

Comments