zfs 三个 ssd 做 mirror 坏了两个，剩下一个也有错误，如何恢复数据和排查原因？

資深大佬 : woyaojizhu8 2

我的笔记本是 dell precision 7740，插了 5 个 ssd，一个是 windows10 系统盘，一个是 ubuntu 20.04 （我日常使用的系统）系统盘，还有一个 zfs 池，包括三个 1t ssd，组成形式为 mirror，存放数据。配置好后一年都默默使用，没有查看过状况，直到最近查看才发现其中两个都 faulted 了。

sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 91.2G in 0 days 00:20:39 with 0 errors on Wed Feb 10
17:30:23 2021
config:

NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 28 0 0
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500057 DEGRADED 47 0 220
too many errors
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500063 FAULTED 32 0 2
too many errors
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500220 FAULTED 22 0 3
too many errors`

因为三个 ssd 当初都是买的新的，到手检查状况都不错，到现在使用也才一年，所以平时都没管过。现在我怎么也不敢相信两个 ssd 都不行了。检查 smart 信息也没有异常。之后我抢救了部分数据，也就是把数据从内置固态盘（ zpool ）转移到外置固态盘，使用 rsync -avcXP 两遍来确保数据正确。但是有部分数据在第一遍时会提示校验错误(failed verification — update discarded)，第二遍却无报错。实际查看，文件应该是损坏了。然后我重启了，重启之后的状态：
sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Feb 14 15:38:53 2021
40.5G scanned at 251M/s, 11.1G issued at 68.8M/s, 755G total
23.1G resilvered, 1.47% done, 0 days 03:04:30 to go
config:

NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500057 DEGRADED 0 0 0 too many errors
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500063 ONLINE 0 0 7 (resilvering)
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500220 ONLINE 0 0 11 (resilvering)

待它 resilver 完毕：

sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using ‘zpool clear’ or replace the device with ‘zpool replace’.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: resilvered 83.5G in 0 days 00:10:26 with 0 errors on Sun Feb 14 15:49:19 2021
config:

NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500057 DEGRADED 2 0 9 too many errors
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500063 ONLINE 0 0 15
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500220 ONLINE 0 0 19

errors: No known data errors
接着我进行了 zfs scrub，然后没过多久后两个 ssd 又 faulted 了：

sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use ‘zpool clear’ to mark the device
repaired.
scan: scrub in progress since Sun Feb 14 15:56:49 2021
209G scanned at 1.76G/s, 903M issued at 7.59M/s, 755G total
849K repaired, 0.12% done, no estimated completion time
config:

NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500057 DEGRADED 3 0 9 too many errors (repairing)
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500063 FAULTED 32 0 1.90K too many errors (repairing)
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500220 FAULTED 64 0 419 too many errors (repairing)

待它 repair 完毕：

sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 970K in 0 days 00:29:42 with 213 errors on Sun Feb 14 16:26:31 2021
config:

NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 168 0 0
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500057 DEGRADED 327 0 2.34K too many errors
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500063 FAULTED 32 0 690K too many errors
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500220 FAULTED 64 0 682K too many errors

实际检查，应该有少数文件被修复了，但大部分没有。请问还有没有办法来恢复受损文件？
另外，关于出现这个情况的可能原因：

1.内存：我的内存是有 ecc 的，没超频，当初到手时也是各种方法检验正品，至少能保证不是粗制滥造的山寨货。而且也是用 memtest86 跑了好几天无错误的。只是我平时有时候内存用得比较满，这个有影响吗？

2.我家供电线路：笔记本有电池的，而且电源适配器功率也足够，虽然感觉不至于供电跟不上，但是结合一年多前的经历，这个也还是有可能的。一年多前，我还在使用另一只笔记本，内置 ssd 在短时间里先后坏了两块，导致我主力数据全毁。没想到一年多后还有可能再经历一次这种事情。当时我猜测可能是我平时拆笔记本太暴力，有几次不拆电池就继续拆机，导致笔记本供电部分受到影响，最后导致它很容易坏 ssd 。我不敢再用这只笔记本做主力，又买了笔记本，还用三只 ssd 做 zfs mirror，想着总不至于三只还会同时坏了。新笔记本我也从未不拆电池就继续拆机过。但是现在还是这样，难道是我家的供电线路有问题？不同的笔记本，不同的电源适配器，都未能过滤掉这个供电的问题，导致 ssd 持续损坏？

3.笔记本主板供电：莫非是我笔记本主板上供电无法承受 3 个 ssd 同时读写（也就是说我之前一个笔记本坏 ssd 的问题跟现在这个没有联系）？有一个现象不知道跟这个有没有关联，就是笔记本雷电口和 USB 口的输出电流感觉都很低，很多时候识别不了移动硬盘。

4.是我对虚拟机的 vmdk 文件进行 defragment 和 compact 操作导致文件系统损坏？这个不太可能吧，那只是文件而已。

5.zfs 本身 bug 。这个就不知道怎么排查了。

6.这个型号 /批次的 ssd 都有缺陷。ssd 是浦科特 m9p plus 1t 。这个也很难排查。

大佬有話說 (14)