Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support freeze/thaw #260

Closed
behlendorf opened this issue May 31, 2011 · 78 comments
Closed

Support freeze/thaw #260

behlendorf opened this issue May 31, 2011 · 78 comments
Labels
Type: Feature Feature request or new feature

Comments

@behlendorf
Copy link
Contributor

ZFS has hooks for suspending the filesystem but they have not yet been integrated with their Linux freeze/thaw counterparts. This must be done before you can safely hibernate a system running ZFS.

@devsk
Copy link

devsk commented May 31, 2011

This is a MUST if I ever want to be able to use ZFS on my laptop. If not rootfs, I would still want to use ZFS for my other FSs on the laptop.

It is possible that when I was rootfs on my laptop, I may have faced issues because of this.

@devsk
Copy link

devsk commented Jun 8, 2011

Brian, This is a much needed feature. Any idea how much work is it?

@behlendorf
Copy link
Contributor Author

I haven't carefully scoped it but my guy tells me it's probably not that much work. Why is this feature so critical? It's really only important for laptops correct? (Which I agree is important if your using a laptop)

What needs to be done here is to tie the Linux freeze/unfreeze hooks to the zfs_suspend_fs()/zfs_resume_fs() functions in zfs_vfsops.c. That should be just a couple lines of code, but then we need to review that change and make sure it's working as expected. Plus there will be the needed compatibility code for older kernels.

I'm not going to be able to get to this anytime soon but if you want to dig in to it I'm happy to review changes and comment. But I don't have time for the actual leg work on this right now.

@devsk
Copy link

devsk commented Jun 8, 2011

It is important because 1. I absolutely need it on the laptop, 2. I need it on my desktop, which has been suspending to RAM/disk every night for last 7 years. It is best of both worlds: I save energy, I don't heat up my room in summer, and I get to restore my desktop workspaces just like they were the previous day.

Native ZFS has broken that tradition for me. And I would never want to blame ZFS for anything...;-)

I will dig into it though to see if I can come up with a patch for you.

@kohlschuetter
Copy link
Contributor

This feature can be very important for home NAS environments, too.

These boxes are kept idling most of the time anyways, and S2R/hibernation can save a significant amount of power (about 15W with my setup).

I encourage implementing this, maybe this is the missing link to get suspend-to-RAM fully working on my Zotac Fusion NAS :)

@kohlschuetter
Copy link
Contributor

To easily test freeze/thaw, we could use xfs_freeze (from xfsprogs). It is documented to work on other FSes, too. Currently, of course, for ZFS, it reports that it is unable to do so.

@kohlschuetter
Copy link
Contributor

So, would this do?

    .freeze_fs  = zpl_freeze,
    .unfreeze_fs    = zpl_unfreeze,

into zpl_super.c's const struct super_operations zpl_super_operations

with

static int zpl_freeze(struct super_block *sb) {
            zfs_sb_t *zsb = sb->s_fs_info;
            return zfs_suspend_fs(zsb);
}

static int zpl_unfreeze(struct super_block *sb) {
zfs_sb_t *zsb = sb->s_fs_info;
const char *osname = // what goes in here? ;
return zfs_resume_fs(zsb, osname);
}

What about returned error codes? Are they compatible?

@behlendorf
Copy link
Contributor Author

That's going to be the jist of it. However, the devils in the details and that's why this isn't a trivial change. The questions your asking are the right one but someone needs to sit down and read through the code to get the right answers. A few things to be careful of.

  • Ensure you negate the zfs_suspend_fs/zfs_resume_fs() return codes. Solaris internally uses positive values, Linux uses negative values. This inversion is uniformly handled in the ZPL layer for consistency. See zpl_sync_fs() as an example of this, in fact all the zpl_* wrapper functions do this. Make sure you add the ASSERT3S(error, <=, 0);.
  • Since the zfs_suspend_fs/zfs_resume_fs() function don't take a credential you won't need to worry about handling a cred_t.
  • zfs_resume_fs() takes the object set name as a second argument to reopen the dataset. We may need to stash that information in the zfs_sb_t when suspending and closing the dataset. Under Solaris the VFS layer would provide it but under Linux we're on our own.
  • Verify the negated return codes from Solaris are going to cause reasonable behavior when returned to the Linux VFS.
  • Add any needed compatibility code for older kernel versions back to 2.6.26, this API has changes a little bit I believe.
  • Absolutely run the zfs_freeze tests from xfsprogs and see how it goes. In fact I'd love to see the full results from xfsprogs I haven't yet run that test suite over the ZFS code.

@kohlschuetter
Copy link
Contributor

Some updates on this feature in my branch: https://github.com/kohlschuetter/zfs/commits/freeze (see kohlschuetter@f9e8ae5 )

Freeze/unfreeze seems to work with Linux >= 2.6.35 and silently fails with 2.6.32/RHEL6. I haven't tried it with earlier kernels, though.

Before 2.6.35, freeze requires a block device set in the superblock, which zfs does not provide. The RHEL6 kernel can be patched easily by back-porting a few changes.

Having a compatible kernel, freezing/unfreezing seems to work with xfs_freeze, but unfreeze fails with util-linux's fsfreeze (you can freeze it with fsfreeze -f, but unfreeze only works with xfs_freeze -u). The reason actually is that the filesystem really freezes completely. You cannot even perform an fstat (which fsfreeze performs before freeze or unfreeze).

I am not sure about the expected behavior here. Changes on freeze behavior are in fact outside the scope of this patch; they should probably be performed at ZFS suspend/resume level.

@baryluk
Copy link

baryluk commented Sep 1, 2011

I actually think freeze/thaw is more important for backup scenarios and when underlaying storage have own snapshot/cloning mechanisms (like iSCSI, or LVM on local or remote machine, or snapshots of zVOL over iSCSI, etc).

freeze will make sure underlaying devices are in consistent state, that all direct / synchronized data are in fact pushed into devices, and block all process of further writes to the whole filesystem (this constrain could be relaxed, until we have enough memory, and doesn't performed fsync/fdatasync/create/close/unlink/rename, etc. - they should block only if actuall write IO would be needed to be performed.). After sucesfully freezing file system, one can create safely snapshot / clone on the storage (LVM snapshot, zVOL snapshot, netapp snapshot), then unfreeze zfs, and use snapshot for something (like dump it into tape streamer or other machine).

@behlendorf
Copy link
Contributor Author

@baryluk: Yes, after more carefully reviewing the kernel code your exactly right. In fact supporting freeze/thaw appears to be only useful if you want to support md/lvm style snapshots under the zfs vdev layer. That's something we probably don't care about. This support isn't needed for proper suspend/resume behavior. In fact, upon further inspection the filesystem doesn't really need to to anything to support this. So my question is... in practice why doesn't it work today?

@kohlschuetter
Copy link
Contributor

So we actually have two problems now:

  1. Suspend/resume doesn't work regardless of freeze/unfreeze.
  2. Freeze now works, unfreeze hangs the FS because fstat hangs on the frozen FS.

@paulhandy
Copy link

Has there been any progress on this issue in the last 3 years?

@behlendorf
Copy link
Contributor Author

@paulhandy this hasn't been a priority for any of the developers. In part this has been because the interfaces provided for freeze/thaw by the Linux kernel have been horrible until fairly recently. I don't think this would be a ton of work if people were interested in working on it.

@cyberius0
Copy link

Since I started using ZFS I made a custom script for the pm-sleep, which does an export of the pool before the system enters sleep mode. So I guess there is no better way to do it?

@kernelOfTruth
Copy link
Contributor

@cyberius0 so you basically log out of X and run the script to initiate suspend ?

been wondering how to do this is if /home is on a zpool - but there's probably only this way

@cyberius0
Copy link

Sorry, I didn't read the whole thread, my /home isn't on a zpool. The zpool is mounted in /RAID.
Without the export before going to suspend, the filesystem "freezes". Then every try to access it like e.g. "ls /RAID/" leads to a frozen console/shell and I have to reboot the system to access the RAID again.

@ccic
Copy link

ccic commented Jan 22, 2017

@kohlschuetter and @behlendorf , the above implementation has two issues:

  1. When the call of zfs_suspend_fs returns, the caller still holds two locks: 'z_teardown_lock' and 'z_teardown_inactive_lock'. The process cannot exit withholding those locks, otherwise it caused issues. So, a possible modification is call zfs_suspend_fs and sleep for a specified duration, and then calling zfs_resume_fs to release those locks. In other words, calling zfs_suspend_fs and zfs_resume_fs in pairs in order to apply and release locks in a correct way.
  2. zfs_suspend_fs/zfs_resume_fs is not sufficient to freeze/thaw the pool. Consider a scenario, some file systems share one pool, if you suspend the writes from some of the file system, the other file system are still allowed to write to the pool. Moreover, even the file system is suspended, the property setting of pool is still allowed. So, that is not a "real" freezing of the disk. We should explore other methods, for example, freezing uberlock, to meet the "real" freezing feature.

@behlendorf
Copy link
Contributor Author

@ccic thanks for taking the time to investigate how this functionality could be implemented. The changes proposed here should be considered an initial prototype WIP to help us investigate the various issues. Unfortunately, adding this functionality hasn't been a hasn't priority for the team. Regarding you specific concerns.

  1. When the call of zfs_suspend_fs returns, the caller still holds two locks: 'z_teardown_lock' and 'z_teardown_inactive_lock'. The process cannot exit withholding those locks,

Good point. So one possible avenue worth exploring might be to have a freeze take a snapshot and then use the existing rollback code to effectively pivot on to that snapshot. That would allow us to use the existing model of suspend/rollback/resume except that you'd be resuming on an immutable snapshot.

  1. zfs_suspend_fs/zfs_resume_fs is not sufficient to freeze/thaw the pool.

Using a snapshot would provide a solid guarantee on immutability. As for allowing the pool to be manipulated or other non-frozen filesystems it's not clear that's a problem. The VFS is only requesting that a specific super block be frozen. If freezing an entire pool is needed than alternate interfaces will be needed.

@ccic
Copy link

ccic commented Jan 24, 2017

@behlendorf thanks for your sharing of your consideration. I know this feature is not a priority. I just want to get some clues about how to design and implement it.
For the existing model of suspend/rollback/resume, I have checked the code, the zfs_ioc_rollback has already contained the logic of suspend and then resume, so a possible avenue is to (1) take a snapshot of a specified file system, (2) rollback it but wait for a while after suspending (here we freeze the fs), then (3) resume it Yes it has the effect of freezing file system. Am I correct?

@behlendorf
Copy link
Contributor Author

@ccic yes, it should have that effect.

@isegal
Copy link

isegal commented Feb 16, 2017

+1
Definitely, would love to have this feature for backup snapshot scenarios in the cloud.

For example on AWS one can perform a point-in-time snapshot of an attached EBS drive. Some backup tools rely on FS flushing and freezing so that the snapshot data is consistent. For example with xfs_freeze we are able to snapshot a raid array with no consistency issues.

An example of this is the mysql-ebs-backup script that's currently tailored for XFS on EBS: https://github.com/graphiq/mysql-ebs-snapshot.

If anyone knows for a workaround (i.e. sync command perhaps)? Please do share.

@ccic
Copy link

ccic commented Feb 16, 2017

A quick workaround for one ZFS FS may be:
(1) Expose zfs_suspend_fs/zfs_resume_fs to users through zfs command. That only takes less than 100 lines, I think. Since zfs_suspend_fs still holds two locks, so it requires users to specify another input: how long do you want to suspend? for example, 5 seconds (I think it is long enough)
(2) Execute "zfs suspend 5" ('suspend' is the command, 5 means 5 seconds) to freeze
(3) The other logic on MySQL ..
(4) Resume the ZFS after suspend timeout automatically.

For a complete feature to freeze zpool, we have to (1) flush dirty pages and (2) suspend write to the pool. But now, it prevents from being written to disk, unfortunately it excludes synchronous write. It still needs more investigation.

@isegal
Copy link

isegal commented Feb 16, 2017

Does /bin/sync flush dirty ZFS pages to disk?

@dm17
Copy link

dm17 commented Nov 24, 2021

Is there any evidence of suspend doing this? As far as I know this only applies to "hibernation" (aka suspend to disk, not suspend to memory aka "s3").

@bghira
Copy link

bghira commented Nov 24, 2021

ZFS has no control over this.

@bghira
Copy link

bghira commented Nov 24, 2021

also, even if the distribution disables hibernate on ZFS, people just go out of their way to avoid all warnings and do whatever they want: https://askubuntu.com/questions/1266599/how-to-increase-swap-with-zfs-on-ubuntu-20-04-to-enable-hibernation

@eblau
Copy link

eblau commented Nov 24, 2021

@bghira It doesn't seem believable to me that ZFS has no control over this. Surely there are callbacks that can be registered with the kernel to be invoked when hibernate is invoked. At the very least these could be implemented to hang or kernel panic instead of allowing hibernate to proceed and silently corrupt the zpool.

@bghira
Copy link

bghira commented Nov 24, 2021

as i understand, they're GPL-only symbols.

@danielmorlock
Copy link

After digging into the kernel hibernation (suspend to disk) process with @problame, we figured out that ZFS (event with hibernate) did not cause the zpool corruption. Further during our (rough) analyze, we did not found a reason why ZFS wouldn't work with hibernation iff the swap is outside of ZFS. Even without the freeze_fs hooks mentioned by @kohlschuetter, hibernating ZFS should "just work". I guess the hooks are relevant for doing hibernation to a swap that is inside the ZFS.

TL;DR: The problem was in genkernel(Gentoo automatic kernel building scripts) that includes a script for initramfs. This script is doing luks-encryption and boots from what is listed in the boot options. In the case of ZFS, I have a crypted swap and a crypted root including the ZFS pool. The initramfs script decrypts the root and imports the zpool BEFORE if decrypts the SWAP including the RAM state for hibernation. So the pool is always imported and then hibernate resumes the system where the zpool is already imported and online. I guess it is probably the reason for my corrupted zpool.

Thanks @problame for your support.

@AttilaFueloep
Copy link
Contributor

Well I can add to that. I'm hibernating regularly since a couple of years without any problem so far (knock on wood). I've root on zfs but boot and swap on luks on top of mdadm mirrors. I'm using mkinitcpio if that matters.

@luke-jr
Copy link

luke-jr commented Nov 24, 2021

Maybe ZFS should refuse to mount read-write without the user forcing it if it believes it was mounted at hibernation? (I'm assuming a read-only mount won't change anything on disk...)

@bghira
Copy link

bghira commented Nov 24, 2021

if you can help point out which docs might need updating to include these hints then we might make some progress on it

i know there's a few install guides that are hosted by the OpenZFS project that could be enhanced. each one could link to a page of caveats, which would mean just one spot to maintain them.

i would suggest that it be added to the man pages but ever since they were 'modernised' by breaking them out into separate pages, i have found them to be less useful and rarely search them for information that's now easier to find on Google.

@eblau
Copy link

eblau commented Nov 24, 2021

@danielmorlock @problame wow, thanks for that hint on the initramfs scripts issue! I use Arch Linux and think that I'm in the same situation with the scripts importing the zpool and then decrypting swap on resume.

I store the swap partition's key in the zpool itself so on resume it does the following:

  1. Runs "cryptsetup open" to prompt for the password to open the encrypted LUKS partition with the zpool on it.
  2. Imports the zpool.
  3. Invokes "cryptsetup open" to open the encrypted LUKS swap partition using the swap partition's key in the zpool.
  4. Unmounts the root dataset and exports the zpool.
  5. Resumes from hibernate.

I'm assuming that step 2 is the issue since the state of the zpool on disk could then differ from the in-memory state in swap that we resume from. Would this work if the pool is imported read-only in step 2?

@problame
Copy link
Contributor

@eblau yeah, that sound unhealthy.

Maybe ZFS should refuse to mount read-write without the user forcing it if it believes it was mounted at hibernation? (I'm assuming a read-only mount won't change anything on disk...)

zfs import should actually fail. I guess many scripts use zpool import -f to shoot themselves in the foot :)
@danielmorlock can you confirm it was using zpool import -f?

Regardless, I think -f shouldn't be sufficient to import a pool that was hibernated.
Idea:

Hibernation workflow:

  • somehow get notified by the kernel that we're hibernating
    • If I remember today's session correctly, freeze_fs and thaw_fs are not useful for this.
  • generate a random hibernation cookie
  • store the hibernation cookie in the in-DRAM spa_t, and somewhere on disk, let's say in the MOS config
  • wait until the disk change txg has synced out
  • but let all later txgs get stuck in transitioning from quiesce -> syncing
    • implementation in txg.c, just prevent the transition from happening
  • now allow hibernation. it must not be allowed before.

Resume workflow:

  • Let initrd restore kernel threads and userland
  • Somehow get notified from the kernel that we're resuming. That should be possible.
  • Load the MOS config from disk
  • Compare the hibernation cookie stored in the MOS config with the one we have in the (restored) DRAM
    • If they don't match, panic the kernel with the following message:
      zpool was used inbetween hibernation and resume
      
    • If they match, allow quiescing -> syncing transitions again.

To prevent accidental imports, we extend zpool import / spa_import such that they will fail by default a hibernation cookie is present in the MOS config.
This behavior can be overridden by a new flag zpool import --discard-hibernation-state-and-fail-resume.

Thoughts on this design, @behlendorf ?

@eblau
Copy link

eblau commented Nov 24, 2021

@eblau yeah, that sound unhealthy.

Maybe ZFS should refuse to mount read-write without the user forcing it if it believes it was mounted at hibernation? (I'm assuming a read-only mount won't change anything on disk...)

zfs import should actually fail. I guess many scripts use zpool import -f to shoot themselves in the foot :) @danielmorlock can you confirm it was using zpool import -f?

My initcpio script is not using -f. Here are the exact commands it runs:

    modprobe zfs
    mkdir /crypto_key_device
    zpool import -d /dev/mapper -R /crypto_key_device zroot
    cryptsetup open --type=luks --key-file /crypto_key_device/etc/swapkeyfile --allow-discards /dev/nvme0n1p3 cryptswap
    zfs unmount -f /crypto_key_device
    zpool export zroot

@danielmorlock
Copy link

danielmorlock commented Nov 24, 2021

zfs import should actually fail. I guess many scripts use zpool import -f to shoot themselves in the foot :)
@danielmorlock can you confirm it was using zpool import -f?

Im pretty sure, that -f was not used: man genkernel says:

       dozfs[=cache,force]
           Scan for bootable ZFS pools on bootup. Optionally use cachefile or force import if necessary or perform both actions.

My kernel command line was:

options dozfs crypt_root=UUID=612a36bf-607c-4c8f-8dfd-498b87ea6b7f crypt_swap=UUID=8d173ef7-2af5-4ae5-9b7f-ad06985b1dd0 root=ZFS=rpool_ws1/system/root resume=UUID=74ef965e-688b-495d-95b4-afc449c15750 systemd.unified_cgroup_hierarchy=0

In the initrd phase, the zpool is tried to import before resuming from suspend to disk. So this is equal to import an already imported zpool from a different kernel. Is it? Does ZFS track that it is already imported (from another system)?

@bghira
Copy link

bghira commented Nov 24, 2021

I'm assuming that step 2 is the issue since the state of the zpool on disk could then differ from the in-memory state in swap that we resume from. Would this work if the pool is imported read-only in step 2?

@behlendorf would be best to answer but from past discussions i recall him saying this would need to be very carefully handled (and might not be possible to, since many symbols are GPL-only surrounding the hibernation code, possibly why nvidia's implementation sucks as well) as it could lead to undefined behaviour and crashes, as the ZFS code is currently written.

@AttilaFueloep
Copy link
Contributor

@problame

That design sounds reasonable to me.

@eblau

I'm assuming that step 2 is the issue

Yes, definitively. Once I wrecked a pool beyond repair by accidentally resuming. I wrote to it between hibernation and resume from a different system and after the resume things went south. But even an import/export cycle alone will change the pool so it won't match the state which is stored in the hibernation image.

Would this work if the pool is imported read-only in step 2?

IIRC there was an issue with read only import modifying some state of the pool. Not sure what the current situation is.

@danielmorlock

In the initrd phase, the zpool is tried to import before resuming from suspend to disk. So this is equal to import an already imported zpool from a different kernel. Is it

Yes.

Does ZFS track that it is already imported (from another system)?

Not by default, MMP (Multi-modifier protection aka multihost) handles that but I can't tell if it would work in this case.

@Greek64
Copy link

Greek64 commented Nov 25, 2021

@eblau
Without getting too much offtopic, may I ask why you are doing this 2-stage LUKS unlocking?
Is it because you only need to insert the unlock passphrase once (Since the swap is then unlocked by file)?

If so, why not use something like decrypt_derived or decrypt_keyctl?

decrypt_derived basically generates a key based on an unlocked LUKS container. So you could unlock the root container normally, and the swap container is then automatically unlocked by a derived key based on the root container.

decrypt_keyctl is basically a key cache store. If both containers use the same passphrase, you only insert the passphrase once, it is stored in cache and then used for all containers.

Also - in a hibernation sense - wouldn't it make more sense to decrypt the swap first before the root?

@eblau
Copy link

eblau commented Nov 25, 2021

@Greek64
I will check out the decrypt_derived and decrypt_keyctl suggestions, thank you. I do indeed do 2-stage LUKS unlocking to avoid typing a password twice.

I do it this way due to ignorance. :) I researched LUKS on Arch Linux wiki pages and implemented that using the 2-stage unlock approach and then added hibernate/resume later without recognizing the bad interaction between the two.

Definitely it makes more sense to decrypt the swap before the root. That's why when I saw the explanation from @danielmorlock and @problame, I immediately realized the error of my ways.

Sorry for troubling the ZFS folks with this issue. The subject of this issue made me think that it was some ZFS support missing. Crazy thing is that I hibernated for like 2 years every day using this approach and only hit zpool corruption like 3 times. Luckily I take backups religiously and never lost much due to the magic of zfs send/receive.

@danielmorlock
Copy link

I've opened a bug ticket for genkernel: https://bugs.gentoo.org/827281

@problame
Copy link
Contributor

problame commented Nov 29, 2021

We should close this issue since @behlendorf 's original motivation is misleading people into believing freeze & thaw are related to or required for supporting hibernation.

Barring some uncertainty from my side about in-flight IOs, my current understanding is that it's safe to hibernate a system with an imported pool if and only if the swap space into which the hibernation image is written resides outside of the zpool.
As @danielmorlock and I figured out it's a very brittle process though, since ZFS has no safeguards if your initrd scripts accidentally import the zpool on boot-to-resume.
I have outlined a design to improve this situation above and will create a new issue for it.

A few words on freeze_fs / unfreeze_fs, since they have been mentioned in this thread. The idea behind these callbacks is to block out all VFS ops to a single struct super_block through an rwlock. Here's the kernel function freeze_super that is invoked from the ioctl, if freeze_fs != NULL.
As I understand it, the idea is that, with a one-super-block-per-bdev type of filesystem like XFS or Ext4, userspace can freeze the filesystem using the ioctl, then create a block-level snapshot or backup, then thaw it again using another ioctl.
If that understanding is correct and an exhaustive description of the use case, then I believe it is ill-advised to implement the callbacks for ZFS, since other datasets (= other super blocks) will continue to do IO to the same zpool. And even if userspace is careful to freeze all datasets, the Linux VFS isn't aware of the ZFS management operations that perform IO to the pool (send/recv, properties, ...).

Note that there are also the super block ops freeze_super / thaw_super (yep, confusing naming). A filesystem can implement these instead of freeze_fs / unfreeze_fs if it wants to implement the infrastructure for locking out VFS ops itself instead of using the one provided by the freeze_super function.

Note also: btrfs. It's the mainline filesystem most similar to ZFS with regard to pooled storage (multiple blockdevs!) and multiple super blocks on top. Btrfs implements freeze_fs/unfreeze_fs. But the btrfs secret sauce is that a single btrfs filesystem (= pool in ZFS terms) only has a single struct suber_block - the subvolumes (= ZPL datasets in ZFS terms) are implemented through mount_subtree.

@problame
Copy link
Contributor

Maybe one more remark regarding just how brittle hibernation currently is:
hibernation does a system-wide sync prior to "freezing" userspace and kernel threads.
But there is no transactionality here, i.e., userspace and kernel threads can continue to dirty DMU state, issue and execute ZIOs to the pool, until they are "frozen" (sigh, naming, it should be called "descheduled for hibernation").

What does this mean?

  • If we are able to successfully resume, everything should be ok, IF AND ONLY IF no zio's have been dropped on the floor during hibernation. Otherwise, those lost zios might later look a lot like data corruption. Later could be a lot later, if the hibernate image had the data cached in the ARC.
    • I haven't had time to look into this yet. Maybe the kernel quiesces the IO stack is quiesced before threads, I don't know. I wouldn't want to place a bet on it.
  • If we are unable to resume, the situation is like a hard crash shortly after txg sync / ZIL write.
    • In theory, the pool should be safe to import. You might lose a few seconds worth of data. But since the kernel does the equivalent of a sync during hibernation, you probably didn't lose much in practice. Sync write guarantees should continue to hold as well.
    • In practice, this still exercises a non-happy code path.

@behlendorf
Copy link
Contributor Author

@problame your general design makes good sense to me. It's been a while since I looked at this (years), but I agree it should largely be a matter of suspending the txg engine when hibernating and then resuming it after the system state has been restored.

generate a random hibernation cookie
store the hibernation cookie in the in-DRAM spa_t, and somewhere on disk, let's say in the MOS config

One avenue you may want to explore here is leveraging the existing zio_suspend() / zio_resume() machinery. Fundamentally this code implements the logic to cleanly halt the txg engine, including suspending the zio pipeline and halting any inflight zios. When resuming the pipeline is restarted and any halted zio's are resubmitted and allowed to proceed as if nothing happened. Today this is what's used to suspended the pool when, due to drive failures, there's no longer sufficient redundancy to allow the pool to continue operation. But I could image extending so the pool could be intentionally put in this suspended state for hibernation and a cookie stored.

Opening a new issue to track this work sounds like a good idea to me. I'd just suggest that we somehow reference this old issue from the new one to make it easy to find. I don't mind closing this one at all once we have a replacement to continue the discussion.

@problame
Copy link
Contributor

@behlendorf I have created two follow-up issues:

I think you can close this issue now.

sdimitro pushed a commit to sdimitro/zfs that referenced this issue May 23, 2022
…d be tunable (openzfs#260)

Default value remains 100, but this will allow easily increasing it for testing.
@alek-p
Copy link
Contributor

alek-p commented Sep 13, 2022

possible related to #13879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests