-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"zpool export <zpoolname>" on a faulted zpool hangs and blocks other zpool commands after that #6649
Comments
Here is a possible fix which I have tested and found to be working :
|
@sanjeevbagewadi when you open a PR for this please make sure you add your reproducer above to a test case for the ZFS Test Suite. |
@behlendorf I don't see the fix in @sanjeevbagewadi's zfs fork, it appears like he tested locally without commiting to github. |
@behlendorf , This slipped through the cracks. It is pending review internally and we did not get to it. This fix only allows other zpools commands to continue. It will not help export the faulted zpool. We might need additional work for that. I will generate a pull request soon. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
stale but still happening |
Is there a way to make I have a simple to reproduce case where one external pool failure breaks the whole working server: Connect an external usb pool then disconnect the power of that disk. result: ANY zpool command hangs from this point on. you can't do anything with it until you reboot or miraculously the I/O times out (which almost never happens, I see D state from several days ago). you can keep using the existing zfs mounts but cannot manage any pools. for example cron commands that rely on 'zpool status -x' will queue up as D state processes. some zfs commands appear to work, though (such as zfs list) all of these D state processes (zpool, samba etc) ultimately depend on a txg_sync process in D state:
even if the power has returned to the external usb pool disk, I cannot issue 'zpool online backup' nor 'zpool clear backup' because there are 100 other 'zpool status' processes queued in D state (and rsync, and samba) |
This is something I know multiple people have looked in to, I'm sure it's doable but it's surprisingly tricky. There's been some recent renewed interest and work towards being able to export a suspended pool but no patches for review just yet. |
able to export/unload a suspended pool would be awesome, but meanwhile can at least be solved the case where one faulted pool blocks zpool commands to work on other online pools (ie... currently zpool status/list hangs as well)? |
I just ran into the same issue, zfs with a iscsi volume below, which was removed a while ago. Somehow the pool never even faulted since no IO was running on it, but now I tried to clean it up and that lead to a total hang of everything ZFS related. So, it should be split in two separate issues:
Is there now anything I can do, except reboot the server? Will that even work since all zfs related commands lead to a D state process? |
I have run into this issue as well today as well (my external back disk hung on some USB bug). In my opinion, any single failure that brings down the whole server should be a P0. Can someone please at least implement the minimalistic change of allowing other commands to work? |
@devsk watch for this when it gets merged #5242 (comment) |
Folks, I'd very much appreciate if this issue will be fixed, thank you! |
Just ran into this issue myself, and it's basically locked everything; only way I've found to resolve it is a full restart which is hardly ideal! I'm not so bothered about |
folks sorry to tell, but this sucks really really big. i had numerous reboots becaus of this issue, which create far more hassle then this issue itself. and even worse - you cannot cleanly reboot a system with hanging zpool, because things get stuck on shutdown. you need to do hard reset !!! please add some logic to atleast avoid hang of zpool/zfs command |
It's currently being worked on, you can track the progress on issue #11082. I believe the actual code to support this is implemented but it's currently failing some tests; hopefully once those last few bugs are resolved it can be rolled out. |
thank you for the update! |
System information
Describe the problem you're observing
Had a zpool (with failmode=wait) has entered a degraded/faulted state due to IO failures.
Issued a "zpool export" and that blocked as below :
crash> bt 0xffff8801e79e9540
PID: 5478 TASK: ffff8801e79e9540 CPU: 1 COMMAND: "zpool"
#0 [ffff88032e1a7bf0] __schedule at ffffffff816cb514
#1 [ffff88032e1a7ca0] schedule at ffffffff816cbc10
#2 [ffff88032e1a7cc0] cv_wait_common at ffffffffa07cf845 [spl]
#3 [ffff88032e1a7d40] __cv_wait at ffffffffa07cf8d5 [spl]
#4 [ffff88032e1a7d50] txg_wait_synced at ffffffffa08f3919 [zfs]
#5 [ffff88032e1a7da0] spa_export_common at ffffffffa08e3dc0 [zfs]
#6 [ffff88032e1a7e00] spa_export at ffffffffa08e407b [zfs]
#7 [ffff88032e1a7e10] zfs_ioc_pool_export at ffffffffa0924d7f [zfs]
#8 [ffff88032e1a7e40] zfsdev_ioctl at ffffffffa09277d4 [zfs]
#9 [ffff88032e1a7eb0] do_vfs_ioctl at ffffffff81216072
#10 [ffff88032e1a7f00] sys_ioctl at ffffffff81216402
#11 [ffff88032e1a7f50] entry_SYSCALL_64_fastpath at ffffffff816cf76e
RIP: 00007f40aaff1a77 RSP: 00007fff29addca8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000c491a0 RCX: 00007f40aaff1a77
RDX: 00007fff29addcc0 RSI: 0000000000005a03 RDI: 0000000000000003
RBP: 00007fff29adda70 R8: 6338383337336261 R9: 3566353033323238
R10: 00007fff29adda30 R11: 0000000000000246 R12: 0000000000000006
R13: 00007fff29addb50 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
Unfortunately, the spa_sync() will not complete because, the vdisk is faulted. And until
a "zpool online" is issued, it will not progress. However, the "zpool export" is holding the
spa_namespace_lock in WRITE mode and hence other command will block as below :
crash> bt 5582
PID: 5582 TASK: ffff88008ffc5500 CPU: 2 COMMAND: "zpool"
#0 [ffff8802a7e1fb80] __schedule at ffffffff816cb514
#1 [ffff8802a7e1fc30] schedule at ffffffff816cbc10
#2 [ffff8802a7e1fc50] schedule_preempt_disabled at ffffffff816cbe4e
#3 [ffff8802a7e1fc60] __mutex_lock_slowpath at ffffffff816cd440
#4 [ffff8802a7e1fd00] mutex_lock at ffffffff816cd4f3
#5 [ffff8802a7e1fd20] spa_open_common at ffffffffa08e64a3 [zfs]
#6 [ffff8802a7e1fda0] spa_get_stats at ffffffffa08e6909 [zfs]
#7 [ffff8802a7e1fe00] zfs_ioc_pool_stats at ffffffffa0924c11 [zfs]
#8 [ffff8802a7e1fe40] zfsdev_ioctl at ffffffffa09277d4 [zfs]
#9 [ffff8802a7e1feb0] do_vfs_ioctl at ffffffff81216072
#10 [ffff8802a7e1ff00] sys_ioctl at ffffffff81216402
#11 [ffff8802a7e1ff50] entry_SYSCALL_64_fastpath at ffffffff816cf76e
RIP: 00007fd4730c5a77 RSP: 00007fffa8889408 RFLAGS: 00000202
RAX: ffffffffffffffda RBX: 00007fd473373120 RCX: 00007fd4730c5a77
RDX: 00007fffa8889430 RSI: 0000000000005a05 RDI: 0000000000000004
RBP: 0000000000772f80 R8: 0000000000000008 R9: 0000000001e00000
R10: 00007fffa8889190 R11: 0000000000000202 R12: 0000000000020090
R13: 0000000000772f70 R14: 0000000000010000 R15: 00007fd473373120
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
And hence, all zpool commands will block on spa_namespace_lock.
Probably, it is better for spa_export_common() to wait on txg_wait_synced() without holding the spa_namespace_lock.
Describe how to reproduce the problem
Here are the steps to reproduce the problem :
zinject -a -d /dev/sdz -e io zpool-1
This command will hang.
At this point all other zpool command (e.g.zpool list, zpool status) will hang.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: