-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel bug when flushing memory caches for hugepages from Linux 6.3.1 to 6.10.14 #15140
Comments
The timing of when that breaks make me wonder about something related to https://lwn.net/Articles/937943/, but who knows. |
Also, as an aside, can't you tell the kernel at boot time to pre-reserve 1G hugepages for you? Not that this isn't a bug, but just as a workaround for your use case atm. |
Yeah, sure, but then the memory is locked. I would like to keep using it if the VM is not in use. I rarely use Windows anyway these days :P For now remaining on 6.1 LTS is good enough, at least until someone can track down this issue. I've noticed that now even just the actual initial call crash the kernel, it might be related to the changes in the mm subsystem. But that's just speculation from me, since I can't reproduce the issue with the stock kernel in the same way (not that this exclude a kernel bug either, to be clear) Testing mainline shortly to check if they have already fixed this issue. |
What do you mean, you can't reproduce it with stock? It works on vanilla 6.3.x/6.4.x but not Arch's patchset? |
It works fine without zfs-dkms installed. It doesn't work with zfs-dkms installed. |
Does it break with ZFS loaded and no pools imported? |
Can try it shortly on my old BtrFS install, I'll end testing mainline first :P |
Well, heck, dkms doesn't build. Testing with just the module loaded now. |
Interesting, without mounting my zfs encrypted root drive it doesn't seem to trigger it. I'll do further testing tomorrow. |
I can reproduce this too. Good find. |
BTW I think the issue has to do with compacting memory. I can still reserve hugepages just fine. |
Yeah, the trigger is definitely memory compact, not hugepage allocation by itself. Good to know that I'm not the only one with this issue :) |
Starting bisect now, I'll update it shortly if I can find something from it. At least it should narrow down the possible changes. |
I finally have the result of the bisection:
I'll try to revert this on top of the latest kernel to see if it still works fine. git bisect log here
|
LMK how this should proceed, and to which people this need to be reported (if it's purely a ZFS kernel bug or if it can affect other part of the kernel and this need to be reported directly to the Linux developers) |
Yeah, unfortunately too much changes in the mm subsystem has been applied, so I really can't easily revert that change on top of the latest linux kernel. Please let me know how this should proceed. |
This should be reported as a kernel bug, no, especially as this provides a simple reproducer? Does this occur with the 6.5-rc kernels too? |
Can't reproduce without ZFS on root encrypted, so I can't prove is a kernel regression. Highly likely if you ask me, but that's just my opinion at this point. |
Just tested 10 bootup and shutdown cycles with ZFS loaded not on root and a dataset imported (non encrypted) and this didn't cause any issues. If someone from OpenZFS could take a look on why calling drop_caches cause an oops in kernel with ZFS on encrypted root it would be appreciated. |
It also happens on a non encrypted root, but with an encrypted data set present. So it looks like it is related to encryption. |
Thanks a lot for the report, this at least narrows down the area. |
I have encrypted datasets too and can confirm this happens. |
This is still an issue on the latest main 6.5.1. Can anyone from the team try to debug what's going wrong with encryption on the latest kernels? Steps to repro:
Marco. |
Also experiencing this one in my own vfio script lately. The call of |
Linux page migration code won't wait for writeback to complete unless it needs to call release_folio. Call SetPagePrivate from zpl_readpage_common and define .release_folio, to cause fallback_migrate_folio to wait for us. Signed-off-by: Tim Stabrawa <59430211+tstabrawa@users.noreply.github.com> Closes openzfs#15140
Hi all, not a ZFS dev here, but I heard about this bug from the Proxmox release notes, and gave it a good long look over. As far as I can tell, what's happening is that the Linux kernel page migration code is starting writeback on some pages, not waiting for writeback to complete, and then throwing a BUG when it finds that pages are still under writeback. Pretty much all of the interesting action happens in fallback_migrate_folio(), which doesn't show up in your stack traces, but suffice it to say that it's called from move_to_new_folio(), which does appear in the stack traces. What appears to be happening in the case of the crashes described here is that fallback_migrate_folio() is being called upon dirty ZFS page-cache pages, so it's starting writeback by calling writeout(). Then, since ZFS doesn't store private data in any page cache pages, it skips the call to filemap_release_folio() (because folio_test_private() returns false), and immediately calls migrate_folio(), which in turn calls migrate_folio_extra(). Then, at the beginning of migrate_folio_extra(), it BUGs out because the page is still under writeback (folio_test_writeback() returns true). Notably, if the page did have private data, then fallback_migrate_folio() would call into filemap_release_folio(), which would return false for pages under writeback, causing fallback_migrate_folio() to exit before calling migrate_folio(). So, in summary, in order for the BUG to happen a few things need to be true:
I went through the code for all of the filesystems in the Linux kernel and didn't see any that met all three conditions. Notably, pretty much all traditional filesystems store buffers in page private data. Those filesystems that don't store buffers either store something else in page_private (e.g. shmem/tmpfs, iomap), or don't do asynchronous writeback (e.g. ecryptfs, fuse, romfs, squashfs). So it would appear as if ZFS may be the only filesystem that experiences this particular behavior. Also, I wasn't able to identify anything special about kernel 6.3.1 that would cause this BUG to happen. As far as I can tell, the above-described behavior goes back all the way to when page migration was first implemented in kernel 2.6.16. The way I see it, there are two ways to make the problem go away:
I assume the latter may be preferable (even if only temporarily) so that ZFS can avoid this crash for any/all kernel versions, but I'm happy to defer to the ZFS devs on which option(s) they choose to pursue. The latter is the approach I took in the patch on my fix_15140 branch. Would one of you who has a reliable way to reproduce the problem please give this patch a try? It otherwise passes all of the tests in the ZFS test suite (or at least, all of the tests that pass without the patch), so once I have confirmation that it fixes the problem, I could submit it as a PR. |
Well, thanks a lot for trying to improve this! As already said above, I do not have this issue anymore since I've just switched FS to another format. I hope that someone above will test and report it back tho. |
@tstabrawa Thank you so much for your effort in fixing this issue! Much appreciated! I have just checked out your branch and installed your fix. The system is up for around five minutes. I will check back later. |
@tstabrawa The system that has experienced the kernel panic before has been up for over three hours. I will keep an eye on it for the next few days. |
I was able to repro this every time I flushed caches for VM stuff... let me check... :-) |
@tstabrawa Another panic, unfortunately. 4,1821,54929238000,-;------------[ cut here ]------------
2,1822,54929238528,-;kernel BUG at mm/migrate.c:664!
4,1823,54929238929,-;invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
4,1824,54929239314,-;CPU: 17 PID: 2566 Comm: numad Tainted: P OE 6.6.13+bpo-amd64 #1 Debian 6.6.13-1~bpo12+1
4,1825,54929239715,-;Hardware name: Supermicro Super Server/H11SSL-i, BIOS 1.3 06/25/2019
4,1826,54929240120,-;RIP: 0010:migrate_folio_extra+0x6b/0x70
4,1827,54929240538,-;Code: de 48 89 ef e8 86 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 18 b5 85 00 e8 73 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 05 b5 85 00 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
4,1828,54929241408,-;RSP: 0018:ffffbaedefd17880 EFLAGS: 00010202
4,1829,54929241879,-;RAX: 0057ffffc000010b RBX: ffffe0d324faff80 RCX: 0000000000000002
4,1830,54929242338,-;RDX: ffffe0d324faff80 RSI: ffffe0d340b51e00 RDI: ffff9b56fe4e1e10
4,1831,54929242832,-;RBP: ffff9b56fe4e1e10 R08: 0000000000000000 R09: 0000000000038120
4,1832,54929243340,-;R10: ffffe0d340b51e08 R11: 0000000000000000 R12: 0000000000000002
4,1833,54929243853,-;R13: ffffe0d340b51e00 R14: ffffe0d324faff80 R15: ffffbaedefd17928
4,1834,54929244373,-;FS: 00007fd9bbc70740(0000) GS:ffff9b4ebf140000(0000) knlGS:0000000000000000
4,1835,54929244909,-;CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1836,54929245448,-;CR2: 00007f65daa22838 CR3: 0000001901352000 CR4: 00000000003506e0
4,1837,54929245997,-;Call Trace:
4,1838,54929246538,-; <TASK>
4,1839,54929247072,-; ? die+0x36/0x90
4,1840,54929247618,-; ? do_trap+0xda/0x100
4,1841,54929248160,-; ? migrate_folio_extra+0x6b/0x70
4,1842,54929248709,-; ? do_error_trap+0x6a/0x90
4,1843,54929249253,-; ? migrate_folio_extra+0x6b/0x70
4,1844,54929249805,-; ? exc_invalid_op+0x50/0x70
4,1845,54929250358,-; ? migrate_folio_extra+0x6b/0x70
4,1846,54929250917,-; ? asm_exc_invalid_op+0x1a/0x20
4,1847,54929251485,-; ? migrate_folio_extra+0x6b/0x70
4,1848,54929252050,-; move_to_new_folio+0x138/0x140
4,1849,54929252624,-; migrate_pages_batch+0x865/0xbe0
4,1850,54929253203,-; ? __pfx_remove_migration_pte+0x10/0x10
4,1851,54929253783,-; migrate_pages+0xc1b/0xd60
4,1852,54929254357,-; ? __pfx_alloc_migration_target+0x10/0x10
4,1853,54929254954,-; migrate_to_node+0xfd/0x140
4,1854,54929255554,-; do_migrate_pages+0x210/0x2b0
4,1855,54929256151,-; kernel_migrate_pages+0x425/0x490
4,1856,54929256755,-; __x64_sys_migrate_pages+0x1d/0x30
4,1857,54929257353,-; do_syscall_64+0x5f/0xc0
4,1858,54929257951,-; ? srso_return_thunk+0x5/0x10
4,1859,54929258544,-; ? sched_setaffinity+0x1a9/0x230
4,1860,54929259138,-; ? srso_return_thunk+0x5/0x10
4,1861,54929259745,-; ? exit_to_user_mode_prepare+0x40/0x1e0
4,1862,54929260307,-; ? srso_return_thunk+0x5/0x10
4,1863,54929260893,-; ? syscall_exit_to_user_mode+0x2b/0x40
4,1864,54929261448,-; ? srso_return_thunk+0x5/0x10
4,1865,54929261998,-; ? do_syscall_64+0x6b/0xc0
4,1866,54929262563,-; ? srso_return_thunk+0x5/0x10
4,1867,54929263131,-; ? syscall_exit_to_user_mode+0x2b/0x40
4,1868,54929263703,-; ? srso_return_thunk+0x5/0x10
4,1869,54929264266,-; ? do_syscall_64+0x6b/0xc0
4,1870,54929264828,-; ? do_syscall_64+0x6b/0xc0
4,1871,54929265372,-; entry_SYSCALL_64_after_hwframe+0x6e/0xd8
4,1872,54929265910,-;RIP: 0033:0x7fd9bbd74719
4,1873,54929266438,-;Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
4,1874,54929267515,-;RSP: 002b:00007fff544e9a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000100
4,1875,54929268048,-;RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fd9bbd74719
4,1876,54929268583,-;RDX: 000055f1fc8e6e10 RSI: 0000000000000005 RDI: 0000000000005e4e
4,1877,54929269113,-;RBP: 000055f1fc8e74f0 R08: 000055f1fc8d4e50 R09: 00007fff544e9f00
4,1878,54929269641,-;R10: 000055f1fc8e5c40 R11: 0000000000000246 R12: 000055f1fc8e5c40
4,1879,54929270168,-;R13: 0000000000000002 R14: 0000000000000002 R15: 0000000000000008
4,1880,54929270708,-; </TASK>
4,1881,54929271223,-;Modules linked in: nls_ascii nls_cp437 vfat fat xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 ip_set xt_mark cls_bpf sch_ingress vxlan ip6_udp_tunnel udp_tunnel xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_filter ip6table_raw ip6table_mangle ip6_tables iptable_filter iptable_raw iptable_mangle iptable_nat xt_CT dummy xt_comment veth xt_nat xt_tcpudp nft_chain_nat xt_MASQUERADE nf_conntrack_netlink xfrm_user xt_addrtype nft_compat nf_tables nfnetlink ceph libceph fscache netfs scsi_transport_iscsi nvme_fabrics nvme_core overlay binfmt_misc intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif kvm irqbypass ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 xfs aesni_intel ast crypto_simd acpi_ipmi cryptd drm_shmem_helper rapl acpi_cpufreq pcspkr sp5100_tco drm_kms_helper ipmi_si ccp watchdog k10temp ipmi_devintf ipmi_msghandler evdev joydev button sg xt_ipvs xt_conntrack nf_conntrack_ftp ip_vs_wrr ip_vs_wlc ip_vs_sh ip_vs_rr ip_vs_ftp nf_nat nfsd ip_vs nf_conntrack nf_defrag_ipv6
4,1882,54929271381,c; nfs_acl nf_defrag_ipv4 lockd libcrc32c auth_rpcgss crc32c_generic br_netfilter grace bridge drm sunrpc stp dm_mod llc loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 zfs(POE) spl(OE) hid_generic usbhid hid sd_mod t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic ahci ixgbe xhci_pci libahci xfrm_algo xhci_hcd libata mdio_devres crct10dif_pclmul igb crct10dif_common libphy usbcore crc32_pclmul scsi_mod crc32c_intel mdio i2c_algo_bit scsi_common usb_common i2c_piix4 dca
4,1883,54929279719,-;---[ end trace 0000000000000000 ]---
4,1884,54929377510,-;RIP: 0010:migrate_folio_extra+0x6b/0x70
4,1885,54929378401,-;Code: de 48 89 ef e8 86 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 18 b5 85 00 e8 73 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 05 b5 85 00 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
4,1886,54929379940,-;RSP: 0018:ffffbaedefd17880 EFLAGS: 00010202
4,1887,54929380703,-;RAX: 0057ffffc000010b RBX: ffffe0d324faff80 RCX: 0000000000000002
4,1888,54929381469,-;RDX: ffffe0d324faff80 RSI: ffffe0d340b51e00 RDI: ffff9b56fe4e1e10
4,1889,54929382261,-;RBP: ffff9b56fe4e1e10 R08: 0000000000000000 R09: 0000000000038120
4,1890,54929383058,-;R10: ffffe0d340b51e08 R11: 0000000000000000 R12: 0000000000000002
4,1891,54929383836,-;R13: ffffe0d340b51e00 R14: ffffe0d324faff80 R15: ffffbaedefd17928
4,1892,54929384628,-;FS: 00007fd9bbc70740(0000) GS:ffff9b4ebf140000(0000) knlGS:0000000000000000
4,1893,54929385430,-;CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1894,54929386242,-;CR2: 00007f65daa22838 CR3: 0000001901352000 CR4: 00000000003506e0 |
Still looks identical. So that might not be the only cause then. |
@stephan2012 @RodoMa92 Thanks for trying out the patch. I'll have to take another look this weekend, I guess. It seems bizarre to me that the same BUG can be hit with the patch in place. Would you mind humoring me and confirm that you see |
Sure. Here we go: [0|root@n0044:~]# dmesg | grep -i zfs
[ 0.000000] Command line: BOOT_IMAGE=/ROOT/debian@/boot/vmlinuz-6.6.13+bpo-amd64 root=ZFS=rpool/ROOT/debian ro consoleblank=0 apparmor=0 group_enable=memory swapaccount=1 systemd.unified_cgroup_hierarchy=1
[ 0.022474] Kernel command line: BOOT_IMAGE=/ROOT/debian@/boot/vmlinuz-6.6.13+bpo-amd64 root=ZFS=rpool/ROOT/debian ro consoleblank=0 apparmor=0 group_enable=memory swapaccount=1 systemd.unified_cgroup_hierarchy=1
[ 5.695756] zfs: module license 'CDDL' taints kernel.
[ 5.716912] zfs: module license taints kernel.
[ 6.498122] WARNING: ignoring tunable zfs_arc_min (using 0 instead)
[ 6.507022] WARNING: ignoring tunable zfs_arc_min (using 0 instead)
[ 7.960227] ZFS: Loaded module v2.2.99-1, ZFS pool version 5000, ZFS filesystem version 5 Previously, version 2.2.4-1 (from Debian Backports) was installed. |
Linux page migration code won't wait for writeback to complete unless it needs to call release_folio. Call SetPagePrivate wherever PageUptodate is set and define .release_folio, to cause fallback_migrate_folio to wait for us. Signed-off-by: Tim Stabrawa <59430211+tstabrawa@users.noreply.github.com> Closes openzfs#15140
@stephan2012 @RodoMa92 @numinit Sorry for the delay. Much head-scratching ensued, and I was able to identify some potential scenarios where pages could end up in the page cache without having PagePrivate set by my previous changes. My new patch takes a different approach with setting PagePrivate wherever PageUptodate is set, so there should be no way for pages to end up dirty / under writeback without first going through one of these code paths. Would you please give the new patch (on my fix_15140 branch) a try? |
@tstabrawa Thanks for your work. Much appreciated! I have compiled and installed your new fix. The system has been up and running for 30 minutes now. Let’s wait for a few days and see what happens. |
@tstabrawa Oops, another panic: 4,1799,61665349822,-;------------[ cut here ]------------
2,1800,61665349833,-;kernel BUG at mm/migrate.c:663!
4,1801,61665349906,-;invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
4,1802,61665349950,-;CPU: 31 PID: 2594 Comm: numad Tainted: P OE 6.7.12+bpo-amd64 #1 Debian 6.7.12-1~bpo12+1
4,1803,61665350015,-;Hardware name: Supermicro Super Server/H11SSL-i, BIOS 1.3 06/25/2019
4,1804,61665350061,-;RIP: 0010:migrate_folio_extra+0x6b/0x70
4,1805,61665350109,-;Code: de 48 89 ef e8 86 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 88 5c 86 00 e8 73 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 75 5c 86 00 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
4,1806,61665350210,-;RSP: 0018:ffffbf03e506b988 EFLAGS: 00010202
4,1807,61665350254,-;RAX: 0057ffffe000010b RBX: ffffe8d928ecc800 RCX: 0000000000000002
4,1808,61665350296,-;RDX: ffffe8d928ecc800 RSI: ffffe8d965641c80 RDI: ffff9f0b087f4ee0
4,1809,61665350337,-;RBP: ffff9f0b087f4ee0 R08: 0000000000000000 R09: ffff9f15544945d8
4,1810,61665350379,-;R10: 0000000000000000 R11: 0000000000001000 R12: 0000000000000002
4,1811,61665350419,-;R13: ffffe8d965641c80 R14: ffffbf03e506ba30 R15: ffffe8d928ecc800
4,1812,61665350459,-;FS: 00007fefcc5a0740(0000) GS:ffff9f2a1fbc0000(0000) knlGS:0000000000000000
4,1813,61665350491,-;CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1814,61665350516,-;CR2: 00000000012f9800 CR3: 0000001840d7a000 CR4: 00000000003506f0
4,1815,61665350545,-;Call Trace:
4,1816,61665350561,-; <TASK>
4,1817,61665350577,-; ? die+0x36/0x90
4,1818,61665350600,-; ? do_trap+0xda/0x100
4,1819,61665350622,-; ? migrate_folio_extra+0x6b/0x70
4,1820,61665350648,-; ? do_error_trap+0x6a/0x90
4,1821,61665350669,-; ? migrate_folio_extra+0x6b/0x70
4,1822,61665350695,-; ? exc_invalid_op+0x50/0x70
4,1823,61665350718,-; ? migrate_folio_extra+0x6b/0x70
4,1824,61665350743,-; ? asm_exc_invalid_op+0x1a/0x20
4,1825,61665350776,-; ? migrate_folio_extra+0x6b/0x70
4,1826,61665350801,-; ? srso_return_thunk+0x5/0x5f
4,1827,61665350824,-; move_to_new_folio+0x138/0x140
4,1828,61665350847,-; migrate_pages_batch+0x874/0xba0
4,1829,61665350876,-; ? __pfx_remove_migration_pte+0x10/0x10
4,1830,61665350905,-; migrate_pages+0xc4b/0xd90
4,1831,61665350927,-; ? __pfx_alloc_migration_target+0x10/0x10
4,1832,61665350961,-; ? srso_return_thunk+0x5/0x5f
4,1833,61665350984,-; ? queue_pages_range+0x6a/0xb0
4,1834,61665351009,-; migrate_to_node+0xf0/0x170
4,1835,61665351041,-; do_migrate_pages+0x1f2/0x260
4,1836,61665351072,-; kernel_migrate_pages+0x425/0x490
4,1837,61665351110,-; __x64_sys_migrate_pages+0x1d/0x30
4,1838,61665351132,-; do_syscall_64+0x63/0x120
4,1839,61665351154,-; ? srso_return_thunk+0x5/0x5f
4,1840,61665351178,-; ? do_syscall_64+0x6f/0x120
4,1841,61665351200,-; entry_SYSCALL_64_after_hwframe+0x73/0x7b
4,1842,61665351227,-;RIP: 0033:0x7fefcc6a4719
4,1843,61665351878,-;Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
4,1844,61665353202,-;RSP: 002b:00007fff4f591a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000100
4,1845,61665353871,-;RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fefcc6a4719
4,1846,61665354539,-;RDX: 0000555c297c3d90 RSI: 0000000000000005 RDI: 000000000000460c
4,1847,61665355192,-;RBP: 0000555c297bbdb0 R08: 0000555c297b7e50 R09: 00007fff4f591f20
4,1848,61665355828,-;R10: 0000555c297c3db0 R11: 0000000000000246 R12: 0000555c297c3db0
4,1849,61665356453,-;R13: 0000000000000003 R14: 0000000000000003 R15: 0000000000000008
4,1850,61665357080,-; </TASK>
4,1851,61665357670,-;Modules linked in: udp_diag inet_diag xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 ip_set xt_CT xt_mark cls_bpf sch_ingress vxlan ip6_udp_tunnel udp_tunnel xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_filter ip6table_raw ip6table_mangle ip6_tables iptable_filter iptable_raw iptable_mangle iptable_nat dummy xt_comment veth ceph libceph fscache netfs xt_nat xt_tcpudp nft_chain_nat xt_MASQUERADE nf_conntrack_netlink xfrm_user xt_addrtype nft_compat nf_tables nfnetlink scsi_transport_iscsi nvme_fabrics nvme_core overlay binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 xfs aesni_intel ast crypto_simd cryptd acpi_ipmi drm_shmem_helper rapl acpi_cpufreq pcspkr drm_kms_helper ipmi_si ccp ipmi_devintf sp5100_tco watchdog k10temp evdev joydev ipmi_msghandler button sg xt_ipvs xt_conntrack nfsd nf_conntrack_ftp ip_vs_wrr ip_vs_wlc ip_vs_sh ip_vs_rr ip_vs_ftp nf_nat ip_vs nf_conntrack nfs_acl
4,1852,61665357851,c; auth_rpcgss lockd nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter grace bridge drm sunrpc stp llc dm_mod loop efi_pstore configfs ip_tables x_tables autofs4 zfs(POE) spl(OE) efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod hid_generic usbhid hid sd_mod t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic ahci libahci xhci_pci ixgbe libata xhci_hcd xfrm_algo crct10dif_pclmul mdio_devres crct10dif_common scsi_mod libphy crc32_pclmul crc32c_intel usbcore igb mdio scsi_common i2c_piix4 i2c_algo_bit dca usb_common
4,1853,61665367202,-;---[ end trace 0000000000000000 ]---
4,1854,61666622822,-;clocksource: Long readout interval, skipping watchdog check: cs_nsec: 1283089391 wd_nsec: 1283089400
3,1855,61666629562,-;pstore: backend (erst) writing error (-28)
4,1856,61666630657,-;RIP: 0010:migrate_folio_extra+0x6b/0x70
4,1857,61666631542,-;Code: de 48 89 ef e8 86 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 88 5c 86 00 e8 73 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 75 5c 86 00 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
4,1858,61666633198,-;RSP: 0018:ffffbf03e506b988 EFLAGS: 00010202
4,1859,61666633972,-;RAX: 0057ffffe000010b RBX: ffffe8d928ecc800 RCX: 0000000000000002
4,1860,61666634800,-;RDX: ffffe8d928ecc800 RSI: ffffe8d965641c80 RDI: ffff9f0b087f4ee0
4,1861,61666635519,-;RBP: ffff9f0b087f4ee0 R08: 0000000000000000 R09: ffff9f15544945d8
4,1862,61666636226,-;R10: 0000000000000000 R11: 0000000000001000 R12: 0000000000000002
4,1863,61666636881,-;R13: ffffe8d965641c80 R14: ffffbf03e506ba30 R15: ffffe8d928ecc800
4,1864,61666637607,-;FS: 00007fefcc5a0740(0000) GS:ffff9f2a1fbc0000(0000) knlGS:0000000000000000
4,1865,61666638381,-;CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1866,61666639192,-;CR2: 00000000012f9800 CR3: 0000001840d7a000 CR4: 00000000003506f0
0,1867,61666640050,-;Kernel panic - not syncing: Fatal exception
Message from syslogd@n0044 at May 31 13:00:25 ...
kernel:[61666.640050] Kernel panic - not syncing: Fatal exception |
@stephan2012 Thanks again for trying the new patch. Unfortunately, I don't expect to be able to look at this closely for maybe a couple of weeks. Hopefully someone else and/or one of the ZFS devs can pick up where I left off or identify something I missed. Otherwise, I'll try to help however/whenever I'm able. If you have the chance in the meantime, it may be helpful to double-confirm that you're building the intended code. The version number in your previous check unfortunately didn't include the Git hash (presumably it was downloaded as a ZIP file, so the ZFS build just didn't know what the hash should be, but I'd like to eliminate any remaining uncertainty, if we can). Here are some example commands to check out my branch using Git: tim@ubuntu2310-test:~/temp$ git clone https://github.com/openzfs/zfs.git
Cloning into 'zfs'...
remote: Enumerating objects: 190864, done.
remote: Counting objects: 100% (164/164), done.
remote: Compressing objects: 100% (134/134), done.
remote: Total 190864 (delta 63), reused 88 (delta 30), pack-reused 190700
Receiving objects: 100% (190864/190864), 127.21 MiB | 3.57 MiB/s, done.
Resolving deltas: 100% (139911/139911), done.
tim@ubuntu2310-test:~/temp$ cd zfs
tim@ubuntu2310-test:~/temp/zfs$ git remote add tstabrawa https://github.com/tstabrawa/zfs.git
tim@ubuntu2310-test:~/temp/zfs$ git fetch tstabrawa
remote: Enumerating objects: 24, done.
remote: Counting objects: 100% (24/24), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 14 (delta 10), reused 14 (delta 10), pack-reused 0
Unpacking objects: 100% (14/14), 2.78 KiB | 203.00 KiB/s, done.
From https://github.com/tstabrawa/zfs
* [new branch] fix_15140 -> tstabrawa/fix_15140
* [new branch] master -> tstabrawa/master
tim@ubuntu2310-test:~/temp/zfs$ git checkout fix_15140
branch 'fix_15140' set up to track 'tstabrawa/fix_15140'.
Switched to a new branch 'fix_15140'
tim@ubuntu2310-test:~/temp/zfs$ git describe
zfs-2.2.99-517-g778fe7923
tim@ubuntu2310-test:~/temp/zfs$ From there, if you build, install, and load the ZFS driver, you ought to see the full @numinit You mentioned earlier that you have a reliable way to reproduce the issue. If you still have the time to check, it may be helpful to know if that is still the case with the patched ZFS driver.
Thanks all, and sorry I wasn't able to solve this so far. |
I was experiencing this issue with an Ubuntu kernel (6.8.0-40-generic) and ZFS 2.2.4 where a script was doing:
As this was being executing during boot before starting guest vms (under Xen) I ended up with a system in a boot/crash loop. I've appplied the patch from 778fe79 to ZFS 2.2.5 and corrected a couple of apparent typos:
Running those commands in isolation no longer resulted in a crash. I'll put the original boot script back in place for some additional testing and report if that doesn't work. |
Linux page migration code won't wait for writeback to complete unless it needs to call release_folio. Call SetPagePrivate wherever PageUptodate is set and define .release_folio, to cause fallback_migrate_folio to wait for us. Signed-off-by: Tim Stabrawa <59430211+tstabrawa@users.noreply.github.com> Closes openzfs#15140
@tstabrawa |
oh ok, that's not possible because we need to call btw i found a maybe simpler way? you can call |
i also realized it is still possible to set |
All: Sorry for being away so long. Busy summer, I guess. @JKDingwall: Thanks for catching the invalidatepage-related typos. I admit I didn't actually try building against 3.x kernels, though I was trying to make the patch as backward-compatible as possible. I pushed a commit (59c2ab1) that should fix the typos. Let me know if you think I missed anything. @yshui: Thanks for the suggestions. I'll try to address them one at a time below:
Agreed. I think you could probably achieve something similar to
This would work, and I did consider it, but unfortunately
I haven't investigated this particular setting, but if it works as you describe, it may be a useful work-around for people facing this crash until a fix can get merged. |
@JKDingwall: How has your follow-on testing gone? Has the crash returned at all since applying the patch? |
It's only a single test system which does the compaction but I've not had any further problems since adding this patch. I've lots of other systems with the same zfs build without the compaction that have also been fine. |
This is not fixed, I've made the wrong assumption it was and migrated back to ZFS. As soon as I executed the usual command:
Reopen this please. I'll test with 6.10 shortly as soon as I compile it. zfs-dkms 2.2.6. |
I can confirm the issue is still present even at the top supported build of ZFS, 6.10.14:
The bug is sadly still here alive and well. |
Seems that some codepaths to this issue has not been completely covered yet, from what I'm seeing the stacktrace is still identical to my earlier reports of this. Not sure why it's so hard to reproduce on other systems. |
System information
Describe the problem you're observing
While executing the prepare script for a QEMU virtual machine, if I'm on kernel version 6.3.1 up to the latest 6.4.7 the script crashes with the following stack trace (this log is for a crash on 6.3.9, but I have tested all extremes above and the error is almost the same one as below):
I had my previous installation on top of BtrFS available before switching to ZFS as root, so I could test the same thing under another filesystem. This do not happens on the latest kernel.
I'm using a zvol as a backing store for the VM, the libvirt xml is also attached below, with minor fields redacted.
Describe how to reproduce the problem
vboxdrv vboxnetadp vboxnetflt
)sync; echo 3 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory
It seems that the combination of drop_caches and compact_memory interact someway with ZFS + this commit plus having VirtualBox driver loaded clashes with each other and causes a kernel oops.
I'm using the above for managing VMs memory hugepage allocation.
Using anything older than 6.3 the code works perfectly fine and I can bootup the VM each time with no issues. I'm currently using 6.1 LTS, and I have no problems with the VM in itself.
The above is needed to being able to allocate 1GB hugepages correctly, otherwise after the first bootup the memory is too fragmented to allocate 1 GB chunks without compressing the memory first, and the VM fail to boot up properly. This sometimes cause some host system instability and issues with the host shutting down cleanly.
qemu.tar.gz
GamingWin11.tar.gz
The text was updated successfully, but these errors were encountered: