-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pool corrupted after hibernation #14118
Comments
Speaking only for myself, my advice was, is, and remains don't hibernate with ZFS. There's, AFAIK, nothing in the test suite that tests that it works, and it's very easy for the resume process to do the wrong thing and eat your data. Other people disagree with me, but I've had too many people reporting that their data got eaten after hibernation, and as nobody seems to be actively working on making it less of a footgun, I just advise people not to play with fire for now. cf. #12842, this comment. |
I definitely agree with the recommendation, but despite the brittleness this is already "sort of" supported by distributions, I guess the best we can do is document the possible scenarios. As for me personally, S3 state overheats the laptop's RAM in my backpack and i can afford risking data loss so i'll keep trying.
This turned out false, genkernel's initrd does import my root pool before resume in https://gitweb.gentoo.org/proj/genkernel.git/tree/defaults/linuxrc#n657 , i.e. it runs If i were to run two drives in a mirror configuration for the root pool, would that make things even worse for hibernation, would the extra device make the pool more likely to fail upon resume, or not? (obviously raid1 is preferential in many cases but what about the specific resume case?) |
if it is supported by your distribution, you should open a downstream ticket with them to have it resolved. because the OpenZFS project does not yet support that use case. |
System information
Describe the problem you're observing
Zpool corrupted after resume from swap, hardware verified as healthy. Single ZFS disk, separate swap device, no zvols or luks.
This is a laptop with ZFS as root filesystem. The hardware is a single 1 TB NVMe drive with three GPT partitions for EFI, boot pool and root pool. No luks or dm-crypt involved. No separate log devices.
I use hibernation to a separate physical SSD block device, formatted as raw swap, no zvol. Resume is done by Gentoo's standard genkernel initrd built via
genkernel initramfs --zfs
, i don't expect initrd to be causing it.The hardware is 1 year old, filesystem was 2 months old, with hibernation in daily use. The corruption occurred after several quick hibernate/resume cycles, i hibernated and resumed the machine maybe five or six times in a row.
I attempted rewinding to an earlier transaction without success. Some zdb info below.
pool debug info
Eventually i was able to override the errors by disabling
spa_load_verify_data
andspa_load_verify_metadata
. After that, import would simply freeze indefinitely. However, read-only import works when used with those overrides.trace
As long as i import read-only, the pool seems to be okay, the failures mentioned in zdb dump only seem to have broken one disposable dataset.
With read-only access i was able to salvage all my data and migrate the system to a different device. However the original device containing the crashed pool seems to be in perfect health. I have successfully copied the entire device with no errors, also i did not observe any block device error messages.
block device test
I still have the disk and the crashed pool on it available for tests. I already salvaged all data but this seems like a perfect test bed to hunt a possible bug? To the best of my knowledge the initrd image doesn't attempt mounting the pool like in NixOS/nixpkgs#106093 and i don't think i have any scripts that would modify the pool outside the intended resume sequence. What would be the steps to identify the specific cause of this failure?
The text was updated successfully, but these errors were encountered: