I built a home NAS two years ago, that was the first COVID summer and I finally had the time. It’s running Proxmox, which is running TrueNAS (then Core, now Scale) as a VM. An HBA card is passed directly to the TrueNAS VM. The HBA card is a Dell PERC H310, but I’ve crossflashed it so that now it shows up as an LSI SAS2008 PCI-Express Fusion-MPT SAS-2. The system originally had five ST4000VN008 disks (4 TB) in a RAIDZ2.
Pretty much from the beginning I noticed the system was spewing out storage related error messages when booting up. ZFS also noticed, but after the TrueNAS VM was completely up, there were no more errors and I quite rarely rebooted or shut down the system, so I wasn’t too worried. The few read errors I got each boot I cleared with
zpool clear, which probably was not the best idea.
Last summer we had very cheap electricity here in Finland, something like 1-3 c/kWh plus transfer and taxes. Well, this summer it can be even 60 c/kWh during the worst times. I started shutting down my NAS when I knew we would not need it for a while. This made the disk issues worse.
I know the high electricity prices are partly due to Russia’s attack in Ukraine and the sanctions against Russia. I completely support Ukraine, they are fighting for the freedom of all of the Eastern EU border states. Please donate to support Ukraine.
TrueNAS keeps only one day of systemd journal data (why?) so I’ve already lost the actual error messages. By going through my Google search history I was able to find some of the errors I got. They were like this:
Unaligned partial completion ... tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE ... print_req_error: critical medium error ...
Because there’s quite a lot of discussion on the web about Ironwolf firmware issues, issues with NCQ etc. I hoped this was something that could be fixed with software. I tried passing many kernel options found by googling to the TrueNAS Scale kernel. I came up with
libata.force=noncq mpt3sas.msix_disable=1 mpt3sas.max_queue_depth=10000. For more discussion on these issues, see here, here, here, here. Seagate has actually released a firmware update from SC60 to SC61 for the larger Ironwolfs, but I have the 4 TB ones without an update available.
None of these options helped. Eventually the whole disk just disappeared. At this point it was clear to me that the issue was not a kernel bug, a disk firmware bug, an HBA firmware bug or anything like that. The disk had been faulty already on arrival.
I noticed Seagate has come up with new versions of the Ironwolfs. The 4 TB version is now ST4000VN006 with 256 MB of cache instead of 64 MB. The new version is also physically thinner and might run cooler. I ordered one of those. Unfortunately the firmware version is still SC60.
I replaced the faulty disk with the new one, ZFS resilvered the pool in about 8 hours and all is good again. I guess the moral of the story is that it seems like a disk could be defective, it probably is and you should start by replacing it.