Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel bug when flushing memory caches for hugepages from Linux 6.3.1 to 6.10.14 #15140

Closed
RodoMa92 opened this issue Aug 2, 2023 · 96 comments
Closed
Labels
Component: Encryption "native encryption" feature Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@RodoMa92
Copy link

RodoMa92 commented Aug 2, 2023

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version updated to latest
Kernel Version 6.6.9-arch1-1
Architecture X86_64
OpenZFS Version zfs-kmod-2.2.2-1

Describe the problem you're observing

While executing the prepare script for a QEMU virtual machine, if I'm on kernel version 6.3.1 up to the latest 6.4.7 the script crashes with the following stack trace (this log is for a crash on 6.3.9, but I have tested all extremes above and the error is almost the same one as below):

[ 2682.534320] bash (54689): drop_caches: 3
[ 2682.624207] ------------[ cut here ]------------
[ 2682.624211] kernel BUG at mm/migrate.c:662!
[ 2682.624219] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 2682.624223] CPU: 2 PID: 54689 Comm: bash Tainted: P           OE      6.3.9-arch1-1 #1 124dc55df4f5272ccb409f39ef4872fc2b3376a2
[ 2682.624226] Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 5102 05/31/2023
[ 2682.624228] RIP: 0010:migrate_folio_extra+0x6c/0x70
[ 2682.624234] Code: de 48 89 ef e8 35 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 e7 6d 9d 00 e8 22 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 d4 6d 9d 00 <0f> 0b 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f
[ 2682.624236] RSP: 0018:ffffb4685b5038f8 EFLAGS: 00010282
[ 2682.624238] RAX: 02ffff0000008025 RBX: ffffd9f684f02740 RCX: 0000000000000002
[ 2682.624240] RDX: ffffd9f684f02740 RSI: ffffd9f68d958dc0 RDI: ffff99d8d1cfe728
[ 2682.624241] RBP: ffff99d8d1cfe728 R08: 0000000000000000 R09: 0000000000000000
[ 2682.624242] R10: ffffd9f68d958dc8 R11: 0000000004020000 R12: ffffd9f68d958dc0
[ 2682.624243] R13: 0000000000000002 R14: ffffd9f684f02740 R15: ffffb4685b5039b8
[ 2682.624245] FS:  00007f78b8182740(0000) GS:ffff99de9ea80000(0000) knlGS:0000000000000000
[ 2682.624246] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2682.624248] CR2: 00007fe9a0001960 CR3: 000000011e406000 CR4: 00000000003506e0
[ 2682.624249] Call Trace:
[ 2682.624251]  <TASK>
[ 2682.624253]  ? die+0x36/0x90
[ 2682.624258]  ? do_trap+0xda/0x100
[ 2682.624261]  ? migrate_folio_extra+0x6c/0x70
[ 2682.624263]  ? do_error_trap+0x6a/0x90
[ 2682.624266]  ? migrate_folio_extra+0x6c/0x70
[ 2682.624268]  ? exc_invalid_op+0x50/0x70
[ 2682.624271]  ? migrate_folio_extra+0x6c/0x70
[ 2682.624273]  ? asm_exc_invalid_op+0x1a/0x20
[ 2682.624278]  ? migrate_folio_extra+0x6c/0x70
[ 2682.624280]  move_to_new_folio+0x136/0x150
[ 2682.624283]  migrate_pages_batch+0x913/0xd30
[ 2682.624285]  ? __pfx_compaction_free+0x10/0x10
[ 2682.624289]  ? __pfx_remove_migration_pte+0x10/0x10
[ 2682.624292]  migrate_pages+0xc61/0xde0
[ 2682.624295]  ? __pfx_compaction_alloc+0x10/0x10
[ 2682.624296]  ? __pfx_compaction_free+0x10/0x10
[ 2682.624300]  compact_zone+0x865/0xda0
[ 2682.624303]  compact_node+0x88/0xc0
[ 2682.624306]  sysctl_compaction_handler+0x46/0x80
[ 2682.624308]  proc_sys_call_handler+0x1bd/0x2e0
[ 2682.624312]  vfs_write+0x239/0x3f0
[ 2682.624316]  ksys_write+0x6f/0xf0
[ 2682.624317]  do_syscall_64+0x60/0x90
[ 2682.624322]  ? syscall_exit_to_user_mode+0x1b/0x40
[ 2682.624324]  ? do_syscall_64+0x6c/0x90
[ 2682.624327]  ? syscall_exit_to_user_mode+0x1b/0x40
[ 2682.624329]  ? exc_page_fault+0x7c/0x180
[ 2682.624330]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 2682.624333] RIP: 0033:0x7f78b82f5bc4
[ 2682.624355] Code: 15 99 11 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 80 3d 3d 99 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24 18 48
[ 2682.624356] RSP: 002b:00007ffd9d25ed18 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 2682.624358] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f78b82f5bc4
[ 2682.624359] RDX: 0000000000000002 RSI: 000055c97c5f05c0 RDI: 0000000000000001
[ 2682.624360] RBP: 000055c97c5f05c0 R08: 0000000000000073 R09: 0000000000000001
[ 2682.624362] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[ 2682.624363] R13: 00007f78b83d86a0 R14: 0000000000000002 R15: 00007f78b83d3ca0
[ 2682.624365]  </TASK>
[ 2682.624366] Modules linked in: vhost_net vhost vhost_iotlb tap tun snd_seq_dummy snd_hrtimer snd_seq xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter bridge stp llc intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_codec_hdmi snd_usb_audio btusb btrtl snd_hda_intel btbcm snd_intel_dspcfg crct10dif_pclmul btintel crc32_pclmul snd_intel_sdw_acpi btmtk vfat polyval_clmulni snd_usbmidi_lib polyval_generic fat snd_hda_codec ext4 gf128mul snd_rawmidi eeepc_wmi bluetooth ghash_clmulni_intel snd_hda_core sha512_ssse3 asus_wmi snd_seq_device aesni_intel mc ledtrig_audio snd_hwdep crc32c_generic crypto_simd snd_pcm sparse_keymap crc32c_intel igb ecdh_generic platform_profile sp5100_tco cryptd snd_timer mbcache rapl rfkill wmi_bmof pcspkr dca asus_wmi_sensors snd i2c_piix4 zenpower(OE) ccp
[ 2682.624417]  jbd2 crc16 soundcore gpio_amdpt gpio_generic mousedev acpi_cpufreq joydev mac_hid dm_multipath i2c_dev crypto_user loop fuse dm_mod bpf_preload ip_tables x_tables usbhid zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) nouveau nvme nvme_core xhci_pci nvme_common xhci_pci_renesas vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd amdgpu i2c_algo_bit drm_ttm_helper ttm mxm_wmi video wmi drm_buddy gpu_sched drm_display_helper cec
[ 2682.624456] ---[ end trace 0000000000000000 ]---
[ 2682.624457] RIP: 0010:migrate_folio_extra+0x6c/0x70
[ 2682.624461] Code: de 48 89 ef e8 35 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 e7 6d 9d 00 e8 22 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 d4 6d 9d 00 <0f> 0b 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f
[ 2682.624463] RSP: 0018:ffffb4685b5038f8 EFLAGS: 00010282
[ 2682.624465] RAX: 02ffff0000008025 RBX: ffffd9f684f02740 RCX: 0000000000000002
[ 2682.624466] RDX: ffffd9f684f02740 RSI: ffffd9f68d958dc0 RDI: ffff99d8d1cfe728
[ 2682.624467] RBP: ffff99d8d1cfe728 R08: 0000000000000000 R09: 0000000000000000
[ 2682.624469] R10: ffffd9f68d958dc8 R11: 0000000004020000 R12: ffffd9f68d958dc0
[ 2682.624470] R13: 0000000000000002 R14: ffffd9f684f02740 R15: ffffb4685b5039b8
[ 2682.624472] FS:  00007f78b8182740(0000) GS:ffff99de9ea80000(0000) knlGS:0000000000000000
[ 2682.624473] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2682.624475] CR2: 00007fe9a0001960 CR3: 000000011e406000 CR4: 00000000003506e0

I had my previous installation on top of BtrFS available before switching to ZFS as root, so I could test the same thing under another filesystem. This do not happens on the latest kernel.

I'm using a zvol as a backing store for the VM, the libvirt xml is also attached below, with minor fields redacted.

Describe how to reproduce the problem

  1. Use a kernel >= 6.3.1
  2. Load in virtualbox kernel drivers (vboxdrv vboxnetadp vboxnetflt)
  3. Have an encrypted dataset mounted on the system and a zvol created on that filesystem (not sure if both are needed as a precondition)
  4. Execute this command as root:
    sync; echo 3 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory
  5. Crash is triggered
    It seems that the combination of drop_caches and compact_memory interact someway with ZFS + this commit plus having VirtualBox driver loaded clashes with each other and causes a kernel oops.

I'm using the above for managing VMs memory hugepage allocation.

Using anything older than 6.3 the code works perfectly fine and I can bootup the VM each time with no issues. I'm currently using 6.1 LTS, and I have no problems with the VM in itself.

The above is needed to being able to allocate 1GB hugepages correctly, otherwise after the first bootup the memory is too fragmented to allocate 1 GB chunks without compressing the memory first, and the VM fail to boot up properly. This sometimes cause some host system instability and issues with the host shutting down cleanly.

qemu.tar.gz
GamingWin11.tar.gz

@RodoMa92 RodoMa92 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Aug 2, 2023
@rincebrain
Copy link
Contributor

The timing of when that breaks make me wonder about something related to https://lwn.net/Articles/937943/, but who knows.

@rincebrain
Copy link
Contributor

Also, as an aside, can't you tell the kernel at boot time to pre-reserve 1G hugepages for you?

Not that this isn't a bug, but just as a workaround for your use case atm.

@RodoMa92
Copy link
Author

RodoMa92 commented Aug 2, 2023

Also, as an aside, can't you tell the kernel at boot time to pre-reserve 1G hugepages for you?

Not that this isn't a bug, but just as a workaround for your use case atm.

Yeah, sure, but then the memory is locked. I would like to keep using it if the VM is not in use. I rarely use Windows anyway these days :P

For now remaining on 6.1 LTS is good enough, at least until someone can track down this issue.

I've noticed that now even just the actual initial call crash the kernel, it might be related to the changes in the mm subsystem. But that's just speculation from me, since I can't reproduce the issue with the stock kernel in the same way (not that this exclude a kernel bug either, to be clear)

Testing mainline shortly to check if they have already fixed this issue.

@rincebrain
Copy link
Contributor

What do you mean, you can't reproduce it with stock? It works on vanilla 6.3.x/6.4.x but not Arch's patchset?

@RodoMa92
Copy link
Author

RodoMa92 commented Aug 2, 2023

It works fine without zfs-dkms installed. It doesn't work with zfs-dkms installed.

@rincebrain
Copy link
Contributor

Does it break with ZFS loaded and no pools imported?

@RodoMa92
Copy link
Author

RodoMa92 commented Aug 2, 2023

Does it break with ZFS loaded and no pools imported?

Can try it shortly on my old BtrFS install, I'll end testing mainline first :P

@RodoMa92
Copy link
Author

RodoMa92 commented Aug 2, 2023

Well, heck, dkms doesn't build. Testing with just the module loaded now.

@RodoMa92
Copy link
Author

RodoMa92 commented Aug 2, 2023

Interesting, without mounting my zfs encrypted root drive it doesn't seem to trigger it. I'll do further testing tomorrow.

@numinit
Copy link
Contributor

numinit commented Aug 3, 2023

I can reproduce this too. Good find.

@numinit
Copy link
Contributor

numinit commented Aug 3, 2023

BTW I think the issue has to do with compacting memory. I can still reserve hugepages just fine.

@RodoMa92
Copy link
Author

RodoMa92 commented Aug 3, 2023

BTW I think the issue has to do with compacting memory. I can still reserve hugepages just fine.

Yeah, the trigger is definitely memory compact, not hugepage allocation by itself. Good to know that I'm not the only one with this issue :)

@RodoMa92
Copy link
Author

RodoMa92 commented Aug 6, 2023

Starting bisect now, I'll update it shortly if I can find something from it. At least it should narrow down the possible changes.

@RodoMa92
Copy link
Author

RodoMa92 commented Aug 7, 2023

I finally have the result of the bisection:

5dfab109d5193e6c224d96cabf90e9cc2c039884 is the first bad commit
commit 5dfab109d5193e6c224d96cabf90e9cc2c039884
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Feb 13 20:34:40 2023 +0800

    migrate_pages: batch _unmap and _move
    
    In this patch the _unmap and _move stage of the folio migration is
    batched.  That for, previously, it is,
    
      for each folio
        _unmap()
        _move()
    
    Now, it is,
    
      for each folio
        _unmap()
      for each folio
        _move()
    
    Based on this, we can batch the TLB flushing and use some hardware
    accelerator to copy folios between batched _unmap and batched _move
    stages.
    
    Link: https://lkml.kernel.org/r/20230213123444.155149-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Xin Hao <xhao@linux.alibaba.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

 mm/migrate.c | 214 ++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 189 insertions(+), 25 deletions(-)

I'll try to revert this on top of the latest kernel to see if it still works fine.

git bisect log here

git bisect start
# status: waiting for both good and bad commits
# good: [a5c95ca18a98d742d0a4a04063c32556b5b66378] Merge tag 'drm-next-2023-02-23' of git://anongit.freedesktop.org/drm/drm
git bisect good a5c95ca18a98d742d0a4a04063c32556b5b66378
# status: waiting for bad commit, 1 good commit known
# bad: [3822a7c40997dc86b1458766a3f146d62393f084] Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad 3822a7c40997dc86b1458766a3f146d62393f084
# good: [620932cd285208ef3009ac338b1eeed13ccd1753] mm/damon/dbgfs: print DAMON debugfs interface deprecation message
git bisect good 620932cd285208ef3009ac338b1eeed13ccd1753
# good: [7482c19173b7eb044d476b3444d7ee55bc669d03] selftests: arm64: Fix incorrect kernel headers search path
git bisect good 7482c19173b7eb044d476b3444d7ee55bc669d03
# good: [81ce2ebd194cf32027854ce1c703b7fd129c86b8] mm/slab.c: cleanup is_debug_pagealloc_cache()
git bisect good 81ce2ebd194cf32027854ce1c703b7fd129c86b8
# good: [65c084d848cd717d5913032dfa9e9c62ed33babd] leds: blinkm: Convert to i2c's .probe_new()
git bisect good 65c084d848cd717d5913032dfa9e9c62ed33babd
# good: [6a60dd2e876913be55e17e53ee57e1fe09448238] perf vendor events arm64: Add TLB metrics for neoverse-n2-v2
git bisect good 6a60dd2e876913be55e17e53ee57e1fe09448238
# good: [869b9eddf0b38a22c27a400e2fa849d2ff2aa7e1] mfd: intel-m10-bmc: Add PMCI driver
git bisect good 869b9eddf0b38a22c27a400e2fa849d2ff2aa7e1
# good: [45204677d427b7d0ed11930bd5be4a42893d1c93] perf symbols: Allow for .plt entries with no symbol
git bisect good 45204677d427b7d0ed11930bd5be4a42893d1c93
# good: [3a396f9859755e822775319516cd71dabc2b4e69] backlight: sky81452: Fix sky81452_bl_platform_data kernel-doc
git bisect good 3a396f9859755e822775319516cd71dabc2b4e69
# good: [a912f5975ffc82d52bbb5937eafe367d44db711c] perf test: Replace legacy `...` with $(...)
git bisect good a912f5975ffc82d52bbb5937eafe367d44db711c
# skip: [2b79eb73e2c4b362a2a261b7b2f718385fb478e4] Merge tag 'probes-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
git bisect skip 2b79eb73e2c4b362a2a261b7b2f718385fb478e4
# good: [db95818e888a927456686518880ed0145b1f20ce] perf pmu-events: Add separate metric from pmu_event
git bisect good db95818e888a927456686518880ed0145b1f20ce
# skip: [cd43b5068647f47d6936ffef4d15d99518fcab94] Merge tag 'slab-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
git bisect skip cd43b5068647f47d6936ffef4d15d99518fcab94
# good: [cf1d2ffcc6f17b422239f6ab34b078945d07f9aa] efi: Discover BTI support in runtime services regions
git bisect good cf1d2ffcc6f17b422239f6ab34b078945d07f9aa
# skip: [0df82189bc42037678fa590a77ed0116f428c90d] Merge tag 'perf-tools-for-v6.3-1-2023-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
git bisect skip 0df82189bc42037678fa590a77ed0116f428c90d
# good: [1470a108a60e8c0c4d19da10117c9b98f0078654] perf c2c: Add report option to show false sharing in adjacent cachelines
git bisect good 1470a108a60e8c0c4d19da10117c9b98f0078654
# good: [c2d3cf3653a8ff6e4b402d55e7f84790ac08a8ad] selftests: filesystems: Fix incorrect kernel headers search path
git bisect good c2d3cf3653a8ff6e4b402d55e7f84790ac08a8ad
# skip: [d8763154455e92a2ffed256e48fa46bb35ef3bdf] Merge tag 'printk-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
git bisect skip d8763154455e92a2ffed256e48fa46bb35ef3bdf
# good: [1f428356c38dcbe49fd2f1c488b41e88720ead92] rtla: Add hwnoise tool
git bisect good 1f428356c38dcbe49fd2f1c488b41e88720ead92
# bad: [f9366f4c2a29d14f5992b195e268240c2deb116e] include/linux/migrate.h: remove unneeded externs
git bisect bad f9366f4c2a29d14f5992b195e268240c2deb116e
# good: [42012e0436d44aeb2e68f11a28ddd0ad3f38b61f] migrate_pages: restrict number of pages to migrate in batch
git bisect good 42012e0436d44aeb2e68f11a28ddd0ad3f38b61f
# bad: [9325ddf90ec3a801c09da374b74532d4589a7346] m68k/nommu: add missing definition of ARCH_PFN_OFFSET
git bisect bad 9325ddf90ec3a801c09da374b74532d4589a7346
# bad: [6f7d760e86fa84862d749e36ebd29abf31f4f883] migrate_pages: move THP/hugetlb migration support check to simplify code
git bisect bad 6f7d760e86fa84862d749e36ebd29abf31f4f883
# bad: [80562ba0d8378e89fe5836c28ea56c2aab3014e8] migrate_pages: move migrate_folio_unmap()
git bisect bad 80562ba0d8378e89fe5836c28ea56c2aab3014e8
# bad: [5dfab109d5193e6c224d96cabf90e9cc2c039884] migrate_pages: batch _unmap and _move
git bisect bad 5dfab109d5193e6c224d96cabf90e9cc2c039884
# good: [64c8902ed4418317cd416c566f896bd4a92b2efc] migrate_pages: split unmap_and_move() to _unmap() and _move()
git bisect good 64c8902ed4418317cd416c566f896bd4a92b2efc
# first bad commit: [5dfab109d5193e6c224d96cabf90e9cc2c039884] migrate_pages: batch _unmap and _move

@RodoMa92
Copy link
Author

RodoMa92 commented Aug 7, 2023

LMK how this should proceed, and to which people this need to be reported (if it's purely a ZFS kernel bug or if it can affect other part of the kernel and this need to be reported directly to the Linux developers)

@RodoMa92
Copy link
Author

RodoMa92 commented Aug 7, 2023

Yeah, unfortunately too much changes in the mm subsystem has been applied, so I really can't easily revert that change on top of the latest linux kernel. Please let me know how this should proceed.

@satmandu
Copy link
Contributor

satmandu commented Aug 15, 2023

This should be reported as a kernel bug, no, especially as this provides a simple reproducer? Does this occur with the 6.5-rc kernels too?

@RodoMa92
Copy link
Author

RodoMa92 commented Aug 15, 2023

Can't reproduce without ZFS on root encrypted, so I can't prove is a kernel regression. Highly likely if you ask me, but that's just my opinion at this point.

@RodoMa92 RodoMa92 changed the title Kernel bug when flushing memory caches for hugepages from Linux 6.3.1 onwards up to 6.4.7 Kernel bug when flushing memory caches for hugepages from Linux 6.3.1 onwards up to 6.4.10 Aug 15, 2023
@RodoMa92
Copy link
Author

Just tested 10 bootup and shutdown cycles with ZFS loaded not on root and a dataset imported (non encrypted) and this didn't cause any issues. If someone from OpenZFS could take a look on why calling drop_caches cause an oops in kernel with ZFS on encrypted root it would be appreciated.

@igrekster
Copy link

igrekster commented Aug 15, 2023

It also happens on a non encrypted root, but with an encrypted data set present. So it looks like it is related to encryption.

@RodoMa92
Copy link
Author

Thanks a lot for the report, this at least narrows down the area.

@numinit
Copy link
Contributor

numinit commented Aug 17, 2023

I have encrypted datasets too and can confirm this happens.

@RodoMa92 RodoMa92 changed the title Kernel bug when flushing memory caches for hugepages from Linux 6.3.1 onwards up to 6.4.10 Kernel bug when flushing memory caches for hugepages from Linux 6.3.1 onwards up to 6.5.1 Sep 3, 2023
@RodoMa92
Copy link
Author

RodoMa92 commented Sep 3, 2023

This is still an issue on the latest main 6.5.1.

Can anyone from the team try to debug what's going wrong with encryption on the latest kernels?

Steps to repro:

  1. Use a kernel >= 6.3.1
  2. Have an encrypted dataset mounted on the system
  3. Execute this command as root
echo 3 > /proc/sys/vm/drop_caches

Marco.

@rincebrain rincebrain added the Component: Encryption "native encryption" feature label Sep 3, 2023
@ipaqmaster
Copy link

ipaqmaster commented Sep 4, 2023

Also experiencing this one in my own vfio script lately. The call of 3 > drop_caches then 1 > compact_memory consistently results in a kernel oops of kernel BUG at mm/migrate.c:656! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI.

tstabrawa added a commit to tstabrawa/zfs that referenced this issue May 21, 2024
Linux page migration code won't wait for writeback to complete unless
it needs to call release_folio.  Call SetPagePrivate from
zpl_readpage_common and define .release_folio, to cause
fallback_migrate_folio to wait for us.

Signed-off-by: Tim Stabrawa <59430211+tstabrawa@users.noreply.github.com>
Closes openzfs#15140
@tstabrawa
Copy link
Contributor

Hi all, not a ZFS dev here, but I heard about this bug from the Proxmox release notes, and gave it a good long look over. As far as I can tell, what's happening is that the Linux kernel page migration code is starting writeback on some pages, not waiting for writeback to complete, and then throwing a BUG when it finds that pages are still under writeback.

Pretty much all of the interesting action happens in fallback_migrate_folio(), which doesn't show up in your stack traces, but suffice it to say that it's called from move_to_new_folio(), which does appear in the stack traces. What appears to be happening in the case of the crashes described here is that fallback_migrate_folio() is being called upon dirty ZFS page-cache pages, so it's starting writeback by calling writeout(). Then, since ZFS doesn't store private data in any page cache pages, it skips the call to filemap_release_folio() (because folio_test_private() returns false), and immediately calls migrate_folio(), which in turn calls migrate_folio_extra(). Then, at the beginning of migrate_folio_extra(), it BUGs out because the page is still under writeback (folio_test_writeback() returns true).

Notably, if the page did have private data, then fallback_migrate_folio() would call into filemap_release_folio(), which would return false for pages under writeback, causing fallback_migrate_folio() to exit before calling migrate_folio().

So, in summary, in order for the BUG to happen a few things need to be true:

  • Dirty pages are being migrated (or in the case of OP, compacted)
  • The filesystem does asynchronous writeback (calls to its .writepage function return with the page unlocked and with PG_writeback set)
  • The filesystem does not store private data with page cache pages (e.g. buffers)

I went through the code for all of the filesystems in the Linux kernel and didn't see any that met all three conditions. Notably, pretty much all traditional filesystems store buffers in page private data. Those filesystems that don't store buffers either store something else in page_private (e.g. shmem/tmpfs, iomap), or don't do asynchronous writeback (e.g. ecryptfs, fuse, romfs, squashfs). So it would appear as if ZFS may be the only filesystem that experiences this particular behavior.

Also, I wasn't able to identify anything special about kernel 6.3.1 that would cause this BUG to happen. As far as I can tell, the above-described behavior goes back all the way to when page migration was first implemented in kernel 2.6.16.

The way I see it, there are two ways to make the problem go away:

  • Change the Linux kernel so that fallback_migrate_folio() won't call migrate_folio() if the page is under writeback, even for pages without private data.
  • Change ZFS so that it stores some private data (or at least indicates as if it does)

I assume the latter may be preferable (even if only temporarily) so that ZFS can avoid this crash for any/all kernel versions, but I'm happy to defer to the ZFS devs on which option(s) they choose to pursue.

The latter is the approach I took in the patch on my fix_15140 branch.

Would one of you who has a reliable way to reproduce the problem please give this patch a try? It otherwise passes all of the tests in the ZFS test suite (or at least, all of the tests that pass without the patch), so once I have confirmation that it fixes the problem, I could submit it as a PR.

@RodoMa92
Copy link
Author

Well, thanks a lot for trying to improve this! As already said above, I do not have this issue anymore since I've just switched FS to another format. I hope that someone above will test and report it back tho.

@stephan2012
Copy link

@tstabrawa Thank you so much for your effort in fixing this issue! Much appreciated!

I have just checked out your branch and installed your fix. The system is up for around five minutes. I will check back later.

@stephan2012
Copy link

@tstabrawa The system that has experienced the kernel panic before has been up for over three hours. I will keep an eye on it for the next few days.

@numinit
Copy link
Contributor

numinit commented May 22, 2024

I was able to repro this every time I flushed caches for VM stuff... let me check... :-)

@stephan2012
Copy link

@tstabrawa Another panic, unfortunately.

4,1821,54929238000,-;------------[ cut here ]------------
2,1822,54929238528,-;kernel BUG at mm/migrate.c:664!
4,1823,54929238929,-;invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
4,1824,54929239314,-;CPU: 17 PID: 2566 Comm: numad Tainted: P           OE      6.6.13+bpo-amd64 #1  Debian 6.6.13-1~bpo12+1
4,1825,54929239715,-;Hardware name: Supermicro Super Server/H11SSL-i, BIOS 1.3 06/25/2019
4,1826,54929240120,-;RIP: 0010:migrate_folio_extra+0x6b/0x70
4,1827,54929240538,-;Code: de 48 89 ef e8 86 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 18 b5 85 00 e8 73 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 05 b5 85 00 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
4,1828,54929241408,-;RSP: 0018:ffffbaedefd17880 EFLAGS: 00010202
4,1829,54929241879,-;RAX: 0057ffffc000010b RBX: ffffe0d324faff80 RCX: 0000000000000002
4,1830,54929242338,-;RDX: ffffe0d324faff80 RSI: ffffe0d340b51e00 RDI: ffff9b56fe4e1e10
4,1831,54929242832,-;RBP: ffff9b56fe4e1e10 R08: 0000000000000000 R09: 0000000000038120
4,1832,54929243340,-;R10: ffffe0d340b51e08 R11: 0000000000000000 R12: 0000000000000002
4,1833,54929243853,-;R13: ffffe0d340b51e00 R14: ffffe0d324faff80 R15: ffffbaedefd17928
4,1834,54929244373,-;FS:  00007fd9bbc70740(0000) GS:ffff9b4ebf140000(0000) knlGS:0000000000000000
4,1835,54929244909,-;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1836,54929245448,-;CR2: 00007f65daa22838 CR3: 0000001901352000 CR4: 00000000003506e0
4,1837,54929245997,-;Call Trace:
4,1838,54929246538,-; <TASK>
4,1839,54929247072,-; ? die+0x36/0x90
4,1840,54929247618,-; ? do_trap+0xda/0x100
4,1841,54929248160,-; ? migrate_folio_extra+0x6b/0x70
4,1842,54929248709,-; ? do_error_trap+0x6a/0x90
4,1843,54929249253,-; ? migrate_folio_extra+0x6b/0x70
4,1844,54929249805,-; ? exc_invalid_op+0x50/0x70
4,1845,54929250358,-; ? migrate_folio_extra+0x6b/0x70
4,1846,54929250917,-; ? asm_exc_invalid_op+0x1a/0x20
4,1847,54929251485,-; ? migrate_folio_extra+0x6b/0x70
4,1848,54929252050,-; move_to_new_folio+0x138/0x140
4,1849,54929252624,-; migrate_pages_batch+0x865/0xbe0
4,1850,54929253203,-; ? __pfx_remove_migration_pte+0x10/0x10
4,1851,54929253783,-; migrate_pages+0xc1b/0xd60
4,1852,54929254357,-; ? __pfx_alloc_migration_target+0x10/0x10
4,1853,54929254954,-; migrate_to_node+0xfd/0x140
4,1854,54929255554,-; do_migrate_pages+0x210/0x2b0
4,1855,54929256151,-; kernel_migrate_pages+0x425/0x490
4,1856,54929256755,-; __x64_sys_migrate_pages+0x1d/0x30
4,1857,54929257353,-; do_syscall_64+0x5f/0xc0
4,1858,54929257951,-; ? srso_return_thunk+0x5/0x10
4,1859,54929258544,-; ? sched_setaffinity+0x1a9/0x230
4,1860,54929259138,-; ? srso_return_thunk+0x5/0x10
4,1861,54929259745,-; ? exit_to_user_mode_prepare+0x40/0x1e0
4,1862,54929260307,-; ? srso_return_thunk+0x5/0x10
4,1863,54929260893,-; ? syscall_exit_to_user_mode+0x2b/0x40
4,1864,54929261448,-; ? srso_return_thunk+0x5/0x10
4,1865,54929261998,-; ? do_syscall_64+0x6b/0xc0
4,1866,54929262563,-; ? srso_return_thunk+0x5/0x10
4,1867,54929263131,-; ? syscall_exit_to_user_mode+0x2b/0x40
4,1868,54929263703,-; ? srso_return_thunk+0x5/0x10
4,1869,54929264266,-; ? do_syscall_64+0x6b/0xc0
4,1870,54929264828,-; ? do_syscall_64+0x6b/0xc0
4,1871,54929265372,-; entry_SYSCALL_64_after_hwframe+0x6e/0xd8
4,1872,54929265910,-;RIP: 0033:0x7fd9bbd74719
4,1873,54929266438,-;Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
4,1874,54929267515,-;RSP: 002b:00007fff544e9a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000100
4,1875,54929268048,-;RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fd9bbd74719
4,1876,54929268583,-;RDX: 000055f1fc8e6e10 RSI: 0000000000000005 RDI: 0000000000005e4e
4,1877,54929269113,-;RBP: 000055f1fc8e74f0 R08: 000055f1fc8d4e50 R09: 00007fff544e9f00
4,1878,54929269641,-;R10: 000055f1fc8e5c40 R11: 0000000000000246 R12: 000055f1fc8e5c40
4,1879,54929270168,-;R13: 0000000000000002 R14: 0000000000000002 R15: 0000000000000008
4,1880,54929270708,-; </TASK>
4,1881,54929271223,-;Modules linked in: nls_ascii nls_cp437 vfat fat xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 ip_set xt_mark cls_bpf sch_ingress vxlan ip6_udp_tunnel udp_tunnel xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_filter ip6table_raw ip6table_mangle ip6_tables iptable_filter iptable_raw iptable_mangle iptable_nat xt_CT dummy xt_comment veth xt_nat xt_tcpudp nft_chain_nat xt_MASQUERADE nf_conntrack_netlink xfrm_user xt_addrtype nft_compat nf_tables nfnetlink ceph libceph fscache netfs scsi_transport_iscsi nvme_fabrics nvme_core overlay binfmt_misc intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif kvm irqbypass ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 xfs aesni_intel ast crypto_simd acpi_ipmi cryptd drm_shmem_helper rapl acpi_cpufreq pcspkr sp5100_tco drm_kms_helper ipmi_si ccp watchdog k10temp ipmi_devintf ipmi_msghandler evdev joydev button sg xt_ipvs xt_conntrack nf_conntrack_ftp ip_vs_wrr ip_vs_wlc ip_vs_sh ip_vs_rr ip_vs_ftp nf_nat nfsd ip_vs nf_conntrack nf_defrag_ipv6
4,1882,54929271381,c; nfs_acl nf_defrag_ipv4 lockd libcrc32c auth_rpcgss crc32c_generic br_netfilter grace bridge drm sunrpc stp dm_mod llc loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 zfs(POE) spl(OE) hid_generic usbhid hid sd_mod t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic ahci ixgbe xhci_pci libahci xfrm_algo xhci_hcd libata mdio_devres crct10dif_pclmul igb crct10dif_common libphy usbcore crc32_pclmul scsi_mod crc32c_intel mdio i2c_algo_bit scsi_common usb_common i2c_piix4 dca
4,1883,54929279719,-;---[ end trace 0000000000000000 ]---
4,1884,54929377510,-;RIP: 0010:migrate_folio_extra+0x6b/0x70
4,1885,54929378401,-;Code: de 48 89 ef e8 86 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 18 b5 85 00 e8 73 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 05 b5 85 00 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
4,1886,54929379940,-;RSP: 0018:ffffbaedefd17880 EFLAGS: 00010202
4,1887,54929380703,-;RAX: 0057ffffc000010b RBX: ffffe0d324faff80 RCX: 0000000000000002
4,1888,54929381469,-;RDX: ffffe0d324faff80 RSI: ffffe0d340b51e00 RDI: ffff9b56fe4e1e10
4,1889,54929382261,-;RBP: ffff9b56fe4e1e10 R08: 0000000000000000 R09: 0000000000038120
4,1890,54929383058,-;R10: ffffe0d340b51e08 R11: 0000000000000000 R12: 0000000000000002
4,1891,54929383836,-;R13: ffffe0d340b51e00 R14: ffffe0d324faff80 R15: ffffbaedefd17928
4,1892,54929384628,-;FS:  00007fd9bbc70740(0000) GS:ffff9b4ebf140000(0000) knlGS:0000000000000000
4,1893,54929385430,-;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1894,54929386242,-;CR2: 00007f65daa22838 CR3: 0000001901352000 CR4: 00000000003506e0

@RodoMa92
Copy link
Author

@tstabrawa Another panic, unfortunately.

4,1821,54929238000,-;------------[ cut here ]------------
2,1822,54929238528,-;kernel BUG at mm/migrate.c:664!
4,1823,54929238929,-;invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
4,1824,54929239314,-;CPU: 17 PID: 2566 Comm: numad Tainted: P           OE      6.6.13+bpo-amd64 #1  Debian 6.6.13-1~bpo12+1
4,1825,54929239715,-;Hardware name: Supermicro Super Server/H11SSL-i, BIOS 1.3 06/25/2019
4,1826,54929240120,-;RIP: 0010:migrate_folio_extra+0x6b/0x70
4,1827,54929240538,-;Code: de 48 89 ef e8 86 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 18 b5 85 00 e8 73 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 05 b5 85 00 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
4,1828,54929241408,-;RSP: 0018:ffffbaedefd17880 EFLAGS: 00010202
4,1829,54929241879,-;RAX: 0057ffffc000010b RBX: ffffe0d324faff80 RCX: 0000000000000002
4,1830,54929242338,-;RDX: ffffe0d324faff80 RSI: ffffe0d340b51e00 RDI: ffff9b56fe4e1e10
4,1831,54929242832,-;RBP: ffff9b56fe4e1e10 R08: 0000000000000000 R09: 0000000000038120
4,1832,54929243340,-;R10: ffffe0d340b51e08 R11: 0000000000000000 R12: 0000000000000002
4,1833,54929243853,-;R13: ffffe0d340b51e00 R14: ffffe0d324faff80 R15: ffffbaedefd17928
4,1834,54929244373,-;FS:  00007fd9bbc70740(0000) GS:ffff9b4ebf140000(0000) knlGS:0000000000000000
4,1835,54929244909,-;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1836,54929245448,-;CR2: 00007f65daa22838 CR3: 0000001901352000 CR4: 00000000003506e0
4,1837,54929245997,-;Call Trace:
4,1838,54929246538,-; <TASK>
4,1839,54929247072,-; ? die+0x36/0x90
4,1840,54929247618,-; ? do_trap+0xda/0x100
4,1841,54929248160,-; ? migrate_folio_extra+0x6b/0x70
4,1842,54929248709,-; ? do_error_trap+0x6a/0x90
4,1843,54929249253,-; ? migrate_folio_extra+0x6b/0x70
4,1844,54929249805,-; ? exc_invalid_op+0x50/0x70
4,1845,54929250358,-; ? migrate_folio_extra+0x6b/0x70
4,1846,54929250917,-; ? asm_exc_invalid_op+0x1a/0x20
4,1847,54929251485,-; ? migrate_folio_extra+0x6b/0x70
4,1848,54929252050,-; move_to_new_folio+0x138/0x140
4,1849,54929252624,-; migrate_pages_batch+0x865/0xbe0
4,1850,54929253203,-; ? __pfx_remove_migration_pte+0x10/0x10
4,1851,54929253783,-; migrate_pages+0xc1b/0xd60
4,1852,54929254357,-; ? __pfx_alloc_migration_target+0x10/0x10
4,1853,54929254954,-; migrate_to_node+0xfd/0x140
4,1854,54929255554,-; do_migrate_pages+0x210/0x2b0
4,1855,54929256151,-; kernel_migrate_pages+0x425/0x490
4,1856,54929256755,-; __x64_sys_migrate_pages+0x1d/0x30
4,1857,54929257353,-; do_syscall_64+0x5f/0xc0
4,1858,54929257951,-; ? srso_return_thunk+0x5/0x10
4,1859,54929258544,-; ? sched_setaffinity+0x1a9/0x230
4,1860,54929259138,-; ? srso_return_thunk+0x5/0x10
4,1861,54929259745,-; ? exit_to_user_mode_prepare+0x40/0x1e0
4,1862,54929260307,-; ? srso_return_thunk+0x5/0x10
4,1863,54929260893,-; ? syscall_exit_to_user_mode+0x2b/0x40
4,1864,54929261448,-; ? srso_return_thunk+0x5/0x10
4,1865,54929261998,-; ? do_syscall_64+0x6b/0xc0
4,1866,54929262563,-; ? srso_return_thunk+0x5/0x10
4,1867,54929263131,-; ? syscall_exit_to_user_mode+0x2b/0x40
4,1868,54929263703,-; ? srso_return_thunk+0x5/0x10
4,1869,54929264266,-; ? do_syscall_64+0x6b/0xc0
4,1870,54929264828,-; ? do_syscall_64+0x6b/0xc0
4,1871,54929265372,-; entry_SYSCALL_64_after_hwframe+0x6e/0xd8
4,1872,54929265910,-;RIP: 0033:0x7fd9bbd74719
4,1873,54929266438,-;Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
4,1874,54929267515,-;RSP: 002b:00007fff544e9a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000100
4,1875,54929268048,-;RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fd9bbd74719
4,1876,54929268583,-;RDX: 000055f1fc8e6e10 RSI: 0000000000000005 RDI: 0000000000005e4e
4,1877,54929269113,-;RBP: 000055f1fc8e74f0 R08: 000055f1fc8d4e50 R09: 00007fff544e9f00
4,1878,54929269641,-;R10: 000055f1fc8e5c40 R11: 0000000000000246 R12: 000055f1fc8e5c40
4,1879,54929270168,-;R13: 0000000000000002 R14: 0000000000000002 R15: 0000000000000008
4,1880,54929270708,-; </TASK>
4,1881,54929271223,-;Modules linked in: nls_ascii nls_cp437 vfat fat xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 ip_set xt_mark cls_bpf sch_ingress vxlan ip6_udp_tunnel udp_tunnel xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_filter ip6table_raw ip6table_mangle ip6_tables iptable_filter iptable_raw iptable_mangle iptable_nat xt_CT dummy xt_comment veth xt_nat xt_tcpudp nft_chain_nat xt_MASQUERADE nf_conntrack_netlink xfrm_user xt_addrtype nft_compat nf_tables nfnetlink ceph libceph fscache netfs scsi_transport_iscsi nvme_fabrics nvme_core overlay binfmt_misc intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif kvm irqbypass ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 xfs aesni_intel ast crypto_simd acpi_ipmi cryptd drm_shmem_helper rapl acpi_cpufreq pcspkr sp5100_tco drm_kms_helper ipmi_si ccp watchdog k10temp ipmi_devintf ipmi_msghandler evdev joydev button sg xt_ipvs xt_conntrack nf_conntrack_ftp ip_vs_wrr ip_vs_wlc ip_vs_sh ip_vs_rr ip_vs_ftp nf_nat nfsd ip_vs nf_conntrack nf_defrag_ipv6
4,1882,54929271381,c; nfs_acl nf_defrag_ipv4 lockd libcrc32c auth_rpcgss crc32c_generic br_netfilter grace bridge drm sunrpc stp dm_mod llc loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 zfs(POE) spl(OE) hid_generic usbhid hid sd_mod t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic ahci ixgbe xhci_pci libahci xfrm_algo xhci_hcd libata mdio_devres crct10dif_pclmul igb crct10dif_common libphy usbcore crc32_pclmul scsi_mod crc32c_intel mdio i2c_algo_bit scsi_common usb_common i2c_piix4 dca
4,1883,54929279719,-;---[ end trace 0000000000000000 ]---
4,1884,54929377510,-;RIP: 0010:migrate_folio_extra+0x6b/0x70
4,1885,54929378401,-;Code: de 48 89 ef e8 86 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 18 b5 85 00 e8 73 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 05 b5 85 00 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
4,1886,54929379940,-;RSP: 0018:ffffbaedefd17880 EFLAGS: 00010202
4,1887,54929380703,-;RAX: 0057ffffc000010b RBX: ffffe0d324faff80 RCX: 0000000000000002
4,1888,54929381469,-;RDX: ffffe0d324faff80 RSI: ffffe0d340b51e00 RDI: ffff9b56fe4e1e10
4,1889,54929382261,-;RBP: ffff9b56fe4e1e10 R08: 0000000000000000 R09: 0000000000038120
4,1890,54929383058,-;R10: ffffe0d340b51e08 R11: 0000000000000000 R12: 0000000000000002
4,1891,54929383836,-;R13: ffffe0d340b51e00 R14: ffffe0d324faff80 R15: ffffbaedefd17928
4,1892,54929384628,-;FS:  00007fd9bbc70740(0000) GS:ffff9b4ebf140000(0000) knlGS:0000000000000000
4,1893,54929385430,-;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1894,54929386242,-;CR2: 00007f65daa22838 CR3: 0000001901352000 CR4: 00000000003506e0

Still looks identical. So that might not be the only cause then.

@tstabrawa
Copy link
Contributor

@stephan2012 @RodoMa92 Thanks for trying out the patch. I'll have to take another look this weekend, I guess.

It seems bizarre to me that the same BUG can be hit with the patch in place. Would you mind humoring me and confirm that you see ZFS: Loaded module v2.2.99-505_ga2d6c487f in your journalctl output from before the most recent crash? I'm sure you both know what you're doing, but just in case, I want to be sure you're running the patched driver when you see this.

@stephan2012
Copy link

It seems bizarre to me that the same BUG can be hit with the patch in place. Would you mind humoring me and confirm that you see ZFS: Loaded module v2.2.99-505_ga2d6c487f in your journalctl output from before the most recent crash? I'm sure you both know what you're doing, but just in case, I want to be sure you're running the patched driver when you see this.

Sure. Here we go:

[0|root@n0044:~]# dmesg | grep -i zfs
[    0.000000] Command line: BOOT_IMAGE=/ROOT/debian@/boot/vmlinuz-6.6.13+bpo-amd64 root=ZFS=rpool/ROOT/debian ro consoleblank=0 apparmor=0 group_enable=memory swapaccount=1 systemd.unified_cgroup_hierarchy=1
[    0.022474] Kernel command line: BOOT_IMAGE=/ROOT/debian@/boot/vmlinuz-6.6.13+bpo-amd64 root=ZFS=rpool/ROOT/debian ro consoleblank=0 apparmor=0 group_enable=memory swapaccount=1 systemd.unified_cgroup_hierarchy=1
[    5.695756] zfs: module license 'CDDL' taints kernel.
[    5.716912] zfs: module license taints kernel.
[    6.498122] WARNING: ignoring tunable zfs_arc_min (using 0 instead)
[    6.507022] WARNING: ignoring tunable zfs_arc_min (using 0 instead)
[    7.960227] ZFS: Loaded module v2.2.99-1, ZFS pool version 5000, ZFS filesystem version 5

Previously, version 2.2.4-1 (from Debian Backports) was installed.

tstabrawa added a commit to tstabrawa/zfs that referenced this issue May 30, 2024
Linux page migration code won't wait for writeback to complete unless
it needs to call release_folio.  Call SetPagePrivate wherever
PageUptodate is set and define .release_folio, to cause
fallback_migrate_folio to wait for us.

Signed-off-by: Tim Stabrawa <59430211+tstabrawa@users.noreply.github.com>
Closes openzfs#15140
@tstabrawa
Copy link
Contributor

@stephan2012 @RodoMa92 @numinit

Sorry for the delay. Much head-scratching ensued, and I was able to identify some potential scenarios where pages could end up in the page cache without having PagePrivate set by my previous changes. My new patch takes a different approach with setting PagePrivate wherever PageUptodate is set, so there should be no way for pages to end up dirty / under writeback without first going through one of these code paths.

Would you please give the new patch (on my fix_15140 branch) a try?

@stephan2012
Copy link

@tstabrawa Thanks for your work. Much appreciated!

I have compiled and installed your new fix. The system has been up and running for 30 minutes now. Let’s wait for a few days and see what happens.

@stephan2012
Copy link

@tstabrawa Oops, another panic:

4,1799,61665349822,-;------------[ cut here ]------------
2,1800,61665349833,-;kernel BUG at mm/migrate.c:663!
4,1801,61665349906,-;invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
4,1802,61665349950,-;CPU: 31 PID: 2594 Comm: numad Tainted: P           OE      6.7.12+bpo-amd64 #1  Debian 6.7.12-1~bpo12+1
4,1803,61665350015,-;Hardware name: Supermicro Super Server/H11SSL-i, BIOS 1.3 06/25/2019
4,1804,61665350061,-;RIP: 0010:migrate_folio_extra+0x6b/0x70
4,1805,61665350109,-;Code: de 48 89 ef e8 86 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 88 5c 86 00 e8 73 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 75 5c 86 00 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
4,1806,61665350210,-;RSP: 0018:ffffbf03e506b988 EFLAGS: 00010202
4,1807,61665350254,-;RAX: 0057ffffe000010b RBX: ffffe8d928ecc800 RCX: 0000000000000002
4,1808,61665350296,-;RDX: ffffe8d928ecc800 RSI: ffffe8d965641c80 RDI: ffff9f0b087f4ee0
4,1809,61665350337,-;RBP: ffff9f0b087f4ee0 R08: 0000000000000000 R09: ffff9f15544945d8
4,1810,61665350379,-;R10: 0000000000000000 R11: 0000000000001000 R12: 0000000000000002
4,1811,61665350419,-;R13: ffffe8d965641c80 R14: ffffbf03e506ba30 R15: ffffe8d928ecc800
4,1812,61665350459,-;FS:  00007fefcc5a0740(0000) GS:ffff9f2a1fbc0000(0000) knlGS:0000000000000000
4,1813,61665350491,-;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1814,61665350516,-;CR2: 00000000012f9800 CR3: 0000001840d7a000 CR4: 00000000003506f0
4,1815,61665350545,-;Call Trace:
4,1816,61665350561,-; <TASK>
4,1817,61665350577,-; ? die+0x36/0x90
4,1818,61665350600,-; ? do_trap+0xda/0x100
4,1819,61665350622,-; ? migrate_folio_extra+0x6b/0x70
4,1820,61665350648,-; ? do_error_trap+0x6a/0x90
4,1821,61665350669,-; ? migrate_folio_extra+0x6b/0x70
4,1822,61665350695,-; ? exc_invalid_op+0x50/0x70
4,1823,61665350718,-; ? migrate_folio_extra+0x6b/0x70
4,1824,61665350743,-; ? asm_exc_invalid_op+0x1a/0x20
4,1825,61665350776,-; ? migrate_folio_extra+0x6b/0x70
4,1826,61665350801,-; ? srso_return_thunk+0x5/0x5f
4,1827,61665350824,-; move_to_new_folio+0x138/0x140
4,1828,61665350847,-; migrate_pages_batch+0x874/0xba0
4,1829,61665350876,-; ? __pfx_remove_migration_pte+0x10/0x10
4,1830,61665350905,-; migrate_pages+0xc4b/0xd90
4,1831,61665350927,-; ? __pfx_alloc_migration_target+0x10/0x10
4,1832,61665350961,-; ? srso_return_thunk+0x5/0x5f
4,1833,61665350984,-; ? queue_pages_range+0x6a/0xb0
4,1834,61665351009,-; migrate_to_node+0xf0/0x170
4,1835,61665351041,-; do_migrate_pages+0x1f2/0x260
4,1836,61665351072,-; kernel_migrate_pages+0x425/0x490
4,1837,61665351110,-; __x64_sys_migrate_pages+0x1d/0x30
4,1838,61665351132,-; do_syscall_64+0x63/0x120
4,1839,61665351154,-; ? srso_return_thunk+0x5/0x5f
4,1840,61665351178,-; ? do_syscall_64+0x6f/0x120
4,1841,61665351200,-; entry_SYSCALL_64_after_hwframe+0x73/0x7b
4,1842,61665351227,-;RIP: 0033:0x7fefcc6a4719
4,1843,61665351878,-;Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
4,1844,61665353202,-;RSP: 002b:00007fff4f591a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000100
4,1845,61665353871,-;RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fefcc6a4719
4,1846,61665354539,-;RDX: 0000555c297c3d90 RSI: 0000000000000005 RDI: 000000000000460c
4,1847,61665355192,-;RBP: 0000555c297bbdb0 R08: 0000555c297b7e50 R09: 00007fff4f591f20
4,1848,61665355828,-;R10: 0000555c297c3db0 R11: 0000000000000246 R12: 0000555c297c3db0
4,1849,61665356453,-;R13: 0000000000000003 R14: 0000000000000003 R15: 0000000000000008
4,1850,61665357080,-; </TASK>
4,1851,61665357670,-;Modules linked in: udp_diag inet_diag xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 ip_set xt_CT xt_mark cls_bpf sch_ingress vxlan ip6_udp_tunnel udp_tunnel xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_filter ip6table_raw ip6table_mangle ip6_tables iptable_filter iptable_raw iptable_mangle iptable_nat dummy xt_comment veth ceph libceph fscache netfs xt_nat xt_tcpudp nft_chain_nat xt_MASQUERADE nf_conntrack_netlink xfrm_user xt_addrtype nft_compat nf_tables nfnetlink scsi_transport_iscsi nvme_fabrics nvme_core overlay binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 xfs aesni_intel ast crypto_simd cryptd acpi_ipmi drm_shmem_helper rapl acpi_cpufreq pcspkr drm_kms_helper ipmi_si ccp ipmi_devintf sp5100_tco watchdog k10temp evdev joydev ipmi_msghandler button sg xt_ipvs xt_conntrack nfsd nf_conntrack_ftp ip_vs_wrr ip_vs_wlc ip_vs_sh ip_vs_rr ip_vs_ftp nf_nat ip_vs nf_conntrack nfs_acl
4,1852,61665357851,c; auth_rpcgss lockd nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter grace bridge drm sunrpc stp llc dm_mod loop efi_pstore configfs ip_tables x_tables autofs4 zfs(POE) spl(OE) efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod hid_generic usbhid hid sd_mod t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic ahci libahci xhci_pci ixgbe libata xhci_hcd xfrm_algo crct10dif_pclmul mdio_devres crct10dif_common scsi_mod libphy crc32_pclmul crc32c_intel usbcore igb mdio scsi_common i2c_piix4 i2c_algo_bit dca usb_common
4,1853,61665367202,-;---[ end trace 0000000000000000 ]---
4,1854,61666622822,-;clocksource: Long readout interval, skipping watchdog check: cs_nsec: 1283089391 wd_nsec: 1283089400
3,1855,61666629562,-;pstore: backend (erst) writing error (-28)
4,1856,61666630657,-;RIP: 0010:migrate_folio_extra+0x6b/0x70
4,1857,61666631542,-;Code: de 48 89 ef e8 86 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 88 5c 86 00 e8 73 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 75 5c 86 00 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
4,1858,61666633198,-;RSP: 0018:ffffbf03e506b988 EFLAGS: 00010202
4,1859,61666633972,-;RAX: 0057ffffe000010b RBX: ffffe8d928ecc800 RCX: 0000000000000002
4,1860,61666634800,-;RDX: ffffe8d928ecc800 RSI: ffffe8d965641c80 RDI: ffff9f0b087f4ee0
4,1861,61666635519,-;RBP: ffff9f0b087f4ee0 R08: 0000000000000000 R09: ffff9f15544945d8
4,1862,61666636226,-;R10: 0000000000000000 R11: 0000000000001000 R12: 0000000000000002
4,1863,61666636881,-;R13: ffffe8d965641c80 R14: ffffbf03e506ba30 R15: ffffe8d928ecc800
4,1864,61666637607,-;FS:  00007fefcc5a0740(0000) GS:ffff9f2a1fbc0000(0000) knlGS:0000000000000000
4,1865,61666638381,-;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1866,61666639192,-;CR2: 00000000012f9800 CR3: 0000001840d7a000 CR4: 00000000003506f0
0,1867,61666640050,-;Kernel panic - not syncing: Fatal exception

Message from syslogd@n0044 at May 31 13:00:25 ...
 kernel:[61666.640050] Kernel panic - not syncing: Fatal exception

@tstabrawa
Copy link
Contributor

@stephan2012 Thanks again for trying the new patch. Unfortunately, I don't expect to be able to look at this closely for maybe a couple of weeks. Hopefully someone else and/or one of the ZFS devs can pick up where I left off or identify something I missed. Otherwise, I'll try to help however/whenever I'm able.

If you have the chance in the meantime, it may be helpful to double-confirm that you're building the intended code. The version number in your previous check unfortunately didn't include the Git hash (presumably it was downloaded as a ZIP file, so the ZFS build just didn't know what the hash should be, but I'd like to eliminate any remaining uncertainty, if we can). Here are some example commands to check out my branch using Git:

tim@ubuntu2310-test:~/temp$ git clone https://github.com/openzfs/zfs.git
Cloning into 'zfs'...
remote: Enumerating objects: 190864, done.
remote: Counting objects: 100% (164/164), done.
remote: Compressing objects: 100% (134/134), done.
remote: Total 190864 (delta 63), reused 88 (delta 30), pack-reused 190700
Receiving objects: 100% (190864/190864), 127.21 MiB | 3.57 MiB/s, done.
Resolving deltas: 100% (139911/139911), done.
tim@ubuntu2310-test:~/temp$ cd zfs
tim@ubuntu2310-test:~/temp/zfs$ git remote add tstabrawa https://github.com/tstabrawa/zfs.git
tim@ubuntu2310-test:~/temp/zfs$ git fetch tstabrawa
remote: Enumerating objects: 24, done.
remote: Counting objects: 100% (24/24), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 14 (delta 10), reused 14 (delta 10), pack-reused 0
Unpacking objects: 100% (14/14), 2.78 KiB | 203.00 KiB/s, done.
From https://github.com/tstabrawa/zfs
 * [new branch]          fix_15140  -> tstabrawa/fix_15140
 * [new branch]          master     -> tstabrawa/master
tim@ubuntu2310-test:~/temp/zfs$ git checkout fix_15140
branch 'fix_15140' set up to track 'tstabrawa/fix_15140'.
Switched to a new branch 'fix_15140'
tim@ubuntu2310-test:~/temp/zfs$ git describe
zfs-2.2.99-517-g778fe7923
tim@ubuntu2310-test:~/temp/zfs$

From there, if you build, install, and load the ZFS driver, you ought to see the full ZFS: Loaded module zfs-2.2.99-517-g778fe7923 string in journalctl instead of the shortened form (v2.2.99-1) that you saw last time. Again, very unlikely that you're running the wrong driver when you saw the problem happen again, but on the off chance that something is off, it can save a lot of headaches to find out.

@numinit You mentioned earlier that you have a reliable way to reproduce the issue. If you still have the time to check, it may be helpful to know if that is still the case with the patched ZFS driver.

I was able to repro this every time I flushed caches for VM stuff... let me check... :-)

Thanks all, and sorry I wasn't able to solve this so far.

@JKDingwall
Copy link
Contributor

I was experiencing this issue with an Ubuntu kernel (6.8.0-40-generic) and ZFS 2.2.4 where a script was doing:

    echo 3 > /proc/sys/vm/drop_caches
    echo 1 > /proc/sys/vm/compact_memory

As this was being executing during boot before starting guest vms (under Xen) I ended up with a system in a boot/crash loop. I've appplied the patch from 778fe79 to ZFS 2.2.5 and corrected a couple of apparent typos:

  • zpd_invalidate_page -> zpl_invalidate_page
  • .invalidate_page -> .invalidatepage

Running those commands in isolation no longer resulted in a crash. I'll put the original boot script back in place for some additional testing and report if that doesn't work.

whimbree pushed a commit to whimbree/zfs that referenced this issue Aug 15, 2024
Linux page migration code won't wait for writeback to complete unless
it needs to call release_folio.  Call SetPagePrivate wherever
PageUptodate is set and define .release_folio, to cause
fallback_migrate_folio to wait for us.

Signed-off-by: Tim Stabrawa <59430211+tstabrawa@users.noreply.github.com>
Closes openzfs#15140
@yshui
Copy link
Contributor

yshui commented Sep 8, 2024

@tstabrawa move_to_new_folio calls a_ops->migrate_folio if that exists, right? so can't we define a migrate_folio and emulate what fallback_migrate_folio does except remove the private flag check there? won't that be a bit easier than ensuring private flag is set?

@yshui
Copy link
Contributor

yshui commented Sep 8, 2024

oh ok, that's not possible because we need to call remove_migration_ptes and that function is not callable from the zfs module.

btw i found a maybe simpler way? you can call mapping_set_release_always to make sure release_folio is always called (thus the writeback flag always checked) even without the private flag set.

@yshui
Copy link
Contributor

yshui commented Sep 8, 2024

i also realized it is still possible to set compaction_proactiveness to a non-zero value. because proactive compaction does not trigger writebacks, therefore won't trigger this BUG.

@tstabrawa
Copy link
Contributor

All: Sorry for being away so long. Busy summer, I guess.

@JKDingwall: Thanks for catching the invalidatepage-related typos. I admit I didn't actually try building against 3.x kernels, though I was trying to make the patch as backward-compatible as possible. I pushed a commit (59c2ab1) that should fix the typos. Let me know if you think I missed anything.

@yshui: Thanks for the suggestions. I'll try to address them one at a time below:

@tstabrawa move_to_new_folio calls a_ops->migrate_folio if that exists, right? so can't we define a migrate_folio and emulate what fallback_migrate_folio does except remove the private flag check there? won't that be a bit easier than ensuring private flag is set?
...
oh ok, that's not possible because we need to call remove_migration_ptes and that function is not callable from the zfs module.

Agreed. I think you could probably achieve something similar to remove_migration_ptes using e.g. unmap_mapping_range, though it's a bigger hammer. Attempting to replicate what the kernel does in the ZFS codebase might also raise some eyebrows / be rejected by the ZFS devs.

btw i found a maybe simpler way? you can call mapping_set_release_always to make sure release_folio is always called (thus the writeback flag always checked) even without the private flag set.

This would work, and I did consider it, but unfortunately mapping_set_release_always doesn't exist until the 6.6 kernel, so we'd be leaving folks on older kernels out in the cold if we went with this approach.

i also realized it is still possible to set compaction_proactiveness to a non-zero value. because proactive compaction does not trigger writebacks, therefore won't trigger this BUG.

I haven't investigated this particular setting, but if it works as you describe, it may be a useful work-around for people facing this crash until a fix can get merged.

@tstabrawa
Copy link
Contributor

@JKDingwall: How has your follow-on testing gone? Has the crash returned at all since applying the patch?

@JKDingwall
Copy link
Contributor

@JKDingwall: How has your follow-on testing gone? Has the crash returned at all since applying the patch?

It's only a single test system which does the compaction but I've not had any further problems since adding this patch. I've lots of other systems with the same zfs build without the compaction that have also been fine.

@RodoMa92
Copy link
Author

RodoMa92 commented Oct 30, 2024

This is not fixed, I've made the wrong assumption it was and migrated back to ZFS. As soon as I executed the usual command:

[17534.993246] bash (32381): drop_caches: 3
[17535.060207] ------------[ cut here ]------------
[17535.060212] kernel BUG at mm/migrate.c:664!
[17535.060220] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[17535.060225] CPU: 9 PID: 32381 Comm: bash Tainted: P           OE      6.6.58-1-lts #1 1400000003000000474e5500ee53b845eb376bed
[17535.060230] Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 5404 03/18/2024
[17535.060233] RIP: 0010:migrate_folio_extra+0x6b/0x70
[17535.060240] Code: de 48 89 ef e8 66 e1 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 e8 64 a2 00 e8 53 e1 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 d5 64 a2 00 <0f> 0b 90 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
[17535.060244] RSP: 0018:ffffc900205f7658 EFLAGS: 00010202
[17535.060249] RAX: 02ffff800000030f RBX: ffffea00042e7780 RCX: 0000000000000002
[17535.060252] RDX: ffffea00042e7780 RSI: ffffea000fa4db40 RDI: ffff88812e98eeb0
[17535.060255] RBP: ffff88812e98eeb0 R08: 0000000000000000 R09: ffffffff8d7f1080
[17535.060257] R10: 0000000000000000 R11: 000000000000e001 R12: 0000000000000002
[17535.060260] R13: ffffea000fa4db40 R14: 0000000000000001 R15: ffffc900205f7710
[17535.060263] FS:  00007e466bc81b80(0000) GS:ffff88881ea40000(0000) knlGS:0000000000000000
[17535.060266] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17535.060269] CR2: 000063f8d5932140 CR3: 00000001e9c3c000 CR4: 00000000003506e0
[17535.060272] Call Trace:
[17535.060275]  <TASK>
[17535.060277]  ? die+0x36/0x90
[17535.060284]  ? do_trap+0xda/0x100
[17535.060288]  ? migrate_folio_extra+0x6b/0x70
[17535.060294]  ? do_error_trap+0x6a/0x90
[17535.060298]  ? migrate_folio_extra+0x6b/0x70
[17535.060303]  ? exc_invalid_op+0x50/0x70
[17535.060308]  ? migrate_folio_extra+0x6b/0x70
[17535.060312]  ? asm_exc_invalid_op+0x1a/0x20
[17535.060321]  ? migrate_folio_extra+0x6b/0x70
[17535.060326]  ? srso_return_thunk+0x5/0x5f
[17535.060329]  move_to_new_folio+0x13c/0x150
[17535.060335]  migrate_pages_batch+0x8ff/0xcd0
[17535.060341]  ? __pfx_compaction_free+0x10/0x10
[17535.060348]  ? __pfx_remove_migration_pte+0x10/0x10
[17535.060355]  migrate_pages+0xc36/0xe20
[17535.060360]  ? __pfx_compaction_alloc+0x10/0x10
[17535.060364]  ? __pfx_compaction_free+0x10/0x10
[17535.060369]  ? srso_return_thunk+0x5/0x5f
[17535.060376]  ? srso_return_thunk+0x5/0x5f
[17535.060381]  compact_zone+0x8de/0xe90
[17535.060386]  ? srso_return_thunk+0x5/0x5f
[17535.060392]  compact_node+0x88/0xc0
[17535.060401]  sysctl_compaction_handler+0x66/0xb0
[17535.060406]  proc_sys_call_handler+0x1c4/0x2e0
[17535.060413]  vfs_write+0x23e/0x410
[17535.060421]  ksys_write+0x6d/0xf0
[17535.060427]  do_syscall_64+0x5a/0x80
[17535.060432]  ? srso_return_thunk+0x5/0x5f
[17535.060435]  ? __x64_sys_fcntl+0x94/0xc0
[17535.060440]  ? srso_return_thunk+0x5/0x5f
[17535.060443]  ? syscall_exit_to_user_mode+0x22/0x40
[17535.060446]  ? srso_return_thunk+0x5/0x5f
[17535.060449]  ? set_close_on_exec+0x32/0x70
[17535.060463]  ? srso_return_thunk+0x5/0x5f
[17535.060466]  ? filp_flush+0x52/0x80
[17535.060471]  ? srso_return_thunk+0x5/0x5f
[17535.060474]  ? syscall_exit_to_user_mode+0x22/0x40
[17535.060477]  ? srso_return_thunk+0x5/0x5f
[17535.060480]  ? do_syscall_64+0x66/0x80
[17535.060484]  ? do_syscall_64+0x66/0x80
[17535.060488]  ? xa_load+0x91/0xe0
[17535.060493]  ? srso_return_thunk+0x5/0x5f
[17535.060496]  ? srso_return_thunk+0x5/0x5f
[17535.060499]  ? __pte_offset_map+0x1b/0x180
[17535.060505]  ? srso_return_thunk+0x5/0x5f
[17535.060508]  ? __handle_mm_fault+0xbad/0xdb0
[17535.060516]  ? srso_return_thunk+0x5/0x5f
[17535.060519]  ? __count_memcg_events+0x42/0x90
[17535.060524]  ? srso_return_thunk+0x5/0x5f
[17535.060527]  ? count_memcg_events.constprop.0+0x1a/0x30
[17535.060531]  ? srso_return_thunk+0x5/0x5f
[17535.060534]  ? handle_mm_fault+0x1f2/0x350
[17535.060539]  ? srso_return_thunk+0x5/0x5f
[17535.060542]  ? do_user_addr_fault+0x30f/0x620
[17535.060547]  ? srso_return_thunk+0x5/0x5f
[17535.060550]  ? exc_page_fault+0x7f/0x180
[17535.060556]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[17535.060560] RIP: 0033:0x7e466bdfe7a4
[17535.060583] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 28 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[17535.060586] RSP: 002b:00007fff61e95958 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[17535.060590] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007e466bdfe7a4
[17535.060593] RDX: 0000000000000002 RSI: 000063f8d5933630 RDI: 0000000000000001
[17535.060596] RBP: 00007fff61e95980 R08: 0000000000000073 R09: 0000000000000000
[17535.060598] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[17535.060601] R13: 000063f8d5933630 R14: 00007e466beda5c0 R15: 00007e466bed7ea0
[17535.060608]  </TASK>
[17535.060610] Modules linked in: snd_seq_dummy snd_hrtimer rfcomm snd_seq uhid cmac algif_hash algif_skcipher af_alg nft_masq nft_ct nft_reject_ipv4 nf_reject_ipv4 nft_reject nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c bridge stp llc bnep intel_rapl_msr intel_rapl_common vfat fat edac_mce_amd btusb kvm_amd btrtl btintel kvm iwlmvm btbcm btmtk ext4 crct10dif_pclmul crc32c_generic eeepc_wmi crc32_pclmul bluetooth snd_hda_codec_realtek mac80211 crc32c_intel snd_usb_audio asus_wmi polyval_clmulni mbcache snd_hda_codec_generic sparse_keymap polyval_generic ledtrig_audio snd_usbmidi_lib platform_profile gf128mul ecdh_generic jbd2 libarc4 crc16 ghash_clmulni_intel snd_hda_codec_hdmi snd_ump sha512_ssse3 snd_rawmidi sha256_ssse3 snd_seq_device snd_hda_intel sha1_ssse3 mc snd_intel_dspcfg aesni_intel snd_intel_sdw_acpi iwlwifi crypto_simd joydev mousedev snd_hda_codec cryptd snd_hda_core snd_hwdep asus_wmi_sensors cfg80211 igb snd_pcm rapl i8042 snd_timer dca sp5100_tco serio mxm_wmi
[17535.060724]  wmi_bmof snd pcspkr acpi_cpufreq ccp soundcore k10temp i2c_piix4 rfkill gpio_amdpt gpio_generic mac_hid i2c_dev sg crypto_user loop fuse dm_mod nfnetlink bpf_preload ip_tables x_tables usbhid nvme nvme_core xhci_pci nvme_common xhci_pci_renesas zfs(POE) spl(OE) amdgpu i2c_algo_bit drm_ttm_helper ttm video wmi drm_exec drm_suballoc_helper amdxcp drm_buddy gpu_sched drm_display_helper cec vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd
[17535.060787] ---[ end trace 0000000000000000 ]---
[17535.060790] RIP: 0010:migrate_folio_extra+0x6b/0x70
[17535.060808] Code: de 48 89 ef e8 66 e1 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 e8 64 a2 00 e8 53 e1 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 d5 64 a2 00 <0f> 0b 90 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
[17535.060811] RSP: 0018:ffffc900205f7658 EFLAGS: 00010202
[17535.060815] RAX: 02ffff800000030f RBX: ffffea00042e7780 RCX: 0000000000000002
[17535.060818] RDX: ffffea00042e7780 RSI: ffffea000fa4db40 RDI: ffff88812e98eeb0
[17535.060820] RBP: ffff88812e98eeb0 R08: 0000000000000000 R09: ffffffff8d7f1080
[17535.060823] R10: 0000000000000000 R11: 000000000000e001 R12: 0000000000000002
[17535.060825] R13: ffffea000fa4db40 R14: 0000000000000001 R15: ffffc900205f7710
[17535.060828] FS:  00007e466bc81b80(0000) GS:ffff88881ea40000(0000) knlGS:0000000000000000
[17535.060831] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17535.060834] CR2: 000063f8d5932140 CR3: 00000001e9c3c000 CR4: 00000000003506e0

Reopen this please. I'll test with 6.10 shortly as soon as I compile it. zfs-dkms 2.2.6.

@RodoMa92
Copy link
Author

RodoMa92 commented Oct 30, 2024

I can confirm the issue is still present even at the top supported build of ZFS, 6.10.14:

[   55.726522] bash (2897): drop_caches: 3
[   55.735483] ------------[ cut here ]------------
[   55.735487] kernel BUG at mm/migrate.c:679!
[   55.735497] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   55.735501] CPU: 10 PID: 2897 Comm: bash Tainted: P           OE      6.10.14-273-tkg-eevdf #1 1400000003000000474e5500566833b5df9ca30d
[   55.735506] Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 5404 03/18/2024
[   55.735509] RIP: 0010:migrate_folio_extra+0x6b/0x70
[   55.735515] Code: de 48 89 ef e8 56 e0 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 58 e1 9f 00 e8 43 e0 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 45 e1 9f 00 <0f> 0b 90 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
[   55.735518] RSP: 0018:ffffbdbf9c2d3648 EFLAGS: 00010202
[   55.735522] RAX: 02fff0000000010b RBX: fffffc6c040fde80 RCX: 0000000000000002
[   55.735525] RDX: fffffc6c040fde80 RSI: fffffc6c12e49180 RDI: ffff9f8c29c7cf68
[   55.735527] RBP: ffff9f8c29c7cf68 R08: 0000000000000000 R09: ffff9f8d4be2e260
[   55.735529] R10: ffffbdbf9c2d3230 R11: 0000000000000040 R12: 0000000000000002
[   55.735531] R13: fffffc6c12e49180 R14: fffffc6c12e49180 R15: ffffbdbf9c2d36f0
[   55.735533] FS:  00007c7c97382b80(0000) GS:ffff9f931f100000(0000) knlGS:0000000000000000
[   55.735536] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   55.735539] CR2: 00005f7e028af140 CR3: 00000001eb6fc000 CR4: 00000000003506f0
[   55.735541] Call Trace:
[   55.735544]  <TASK>
[   55.735547]  ? __die_body.cold+0x19/0x27
[   55.735553]  ? die+0x2e/0x50
[   55.735558]  ? do_trap+0xca/0x110
[   55.735563]  ? do_error_trap+0x6a/0x90
[   55.735565]  ? migrate_folio_extra+0x6b/0x70
[   55.735569]  ? exc_invalid_op+0x50/0x70
[   55.735573]  ? migrate_folio_extra+0x6b/0x70
[   55.735576]  ? asm_exc_invalid_op+0x1a/0x20
[   55.735584]  ? migrate_folio_extra+0x6b/0x70
[   55.735587]  ? srso_return_thunk+0x5/0x5f
[   55.735590]  move_to_new_folio+0x149/0x160
[   55.735594]  migrate_pages_batch+0x960/0xcc0
[   55.735598]  ? __pfx_compaction_free+0x10/0x10
[   55.735604]  ? __pfx_remove_migration_pte+0x10/0x10
[   55.735609]  migrate_pages+0xc2f/0xdc0
[   55.735612]  ? __pfx_compaction_alloc+0x10/0x10
[   55.735616]  ? __pfx_compaction_free+0x10/0x10
[   55.735619]  ? cgroup_rstat_updated+0x69/0x220
[   55.735623]  ? __pfx_compaction_free+0x10/0x10
[   55.735631]  compact_zone+0xa3d/0x1050
[   55.735638]  compact_node+0xa9/0x120
[   55.735648]  sysctl_compaction_handler+0x74/0xd0
[   55.735652]  proc_sys_call_handler+0x1c4/0x2e0
[   55.735658]  vfs_write+0x294/0x460
[   55.735665]  ksys_write+0x6d/0xf0
[   55.735669]  do_syscall_64+0x82/0x190
[   55.735673]  ? srso_return_thunk+0x5/0x5f
[   55.735676]  ? set_close_on_exec+0x31/0x70
[   55.735680]  ? srso_return_thunk+0x5/0x5f
[   55.735683]  ? do_fcntl+0x394/0x700
[   55.735695]  ? srso_return_thunk+0x5/0x5f
[   55.735698]  ? srso_return_thunk+0x5/0x5f
[   55.735700]  ? syscall_exit_to_user_mode+0x6d/0x180
[   55.735703]  ? srso_return_thunk+0x5/0x5f
[   55.735706]  ? do_syscall_64+0x8e/0x190
[   55.735709]  ? srso_return_thunk+0x5/0x5f
[   55.735711]  ? __x64_sys_fcntl+0x98/0xd0
[   55.735714]  ? srso_return_thunk+0x5/0x5f
[   55.735717]  ? syscall_exit_to_user_mode+0x6d/0x180
[   55.735720]  ? srso_return_thunk+0x5/0x5f
[   55.735722]  ? do_syscall_64+0x8e/0x190
[   55.735726]  ? srso_return_thunk+0x5/0x5f
[   55.735728]  ? do_user_addr_fault+0x3c7/0x710
[   55.735733]  ? srso_return_thunk+0x5/0x5f
[   55.735736]  ? srso_return_thunk+0x5/0x5f
[   55.735739]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   55.735743] RIP: 0033:0x7c7c974ff7a4
[   55.735767] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 28 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[   55.735770] RSP: 002b:00007ffc6861e488 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[   55.735773] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007c7c974ff7a4
[   55.735775] RDX: 0000000000000002 RSI: 00005f7e028b0630 RDI: 0000000000000001
[   55.735777] RBP: 00007ffc6861e4b0 R08: 0000000000000073 R09: 0000000000000000
[   55.735780] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[   55.735782] R13: 00005f7e028b0630 R14: 00007c7c975db5c0 R15: 00007c7c975d8ea0
[   55.735788]  </TASK>
[   55.735790] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq nft_masq nft_ct nft_reject_ipv4 nf_reject_ipv4 nft_reject nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c bridge stp llc uhid cmac algif_hash algif_skcipher af_alg bnep ext4 crc32c_generic mbcache vfat fat jbd2 intel_rapl_msr amd_atl intel_rapl_common iosf_mbi kvm_amd iwlmvm kvm btusb snd_hda_codec_realtek btrtl crct10dif_pclmul snd_hda_codec_generic mac80211 crc32_pclmul btintel crc32c_intel snd_hda_scodec_component snd_hda_codec_hdmi polyval_clmulni btbcm polyval_generic btmtk eeepc_wmi gf128mul libarc4 snd_usb_audio asus_wmi snd_hda_intel ghash_clmulni_intel sparse_keymap snd_intel_dspcfg bluetooth sha512_ssse3 snd_usbmidi_lib snd_hda_codec sha256_ssse3 snd_ump sha1_ssse3 iwlwifi crc16 snd_hwdep i8042 aesni_intel serio snd_hda_core crypto_simd snd_rawmidi sp5100_tco asus_wmi_sensors cryptd platform_profile igb wmi_bmof snd_pcm mxm_wmi snd_seq_device pcspkr cfg80211 ptp snd_timer k10temp pps_core i2c_piix4 snd
[   55.735884]  mc soundcore mousedev ccp joydev rfkill gpio_amdpt gpio_generic mac_hid i2c_dev sg crypto_user loop dm_mod nfnetlink ip_tables x_tables hid_generic usbhid xhci_pci xhci_pci_renesas zfs(POE) spl(OE) amdgpu video wmi amdxcp i2c_algo_bit mfd_core drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper drm_buddy drm_display_helper cec vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd
[   55.735930] ---[ end trace 0000000000000000 ]---
[   55.735932] RIP: 0010:migrate_folio_extra+0x6b/0x70
[   55.735936] Code: de 48 89 ef e8 56 e0 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 58 e1 9f 00 e8 43 e0 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 45 e1 9f 00 <0f> 0b 90 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
[   55.735938] RSP: 0018:ffffbdbf9c2d3648 EFLAGS: 00010202
[   55.735941] RAX: 02fff0000000010b RBX: fffffc6c040fde80 RCX: 0000000000000002
[   55.735943] RDX: fffffc6c040fde80 RSI: fffffc6c12e49180 RDI: ffff9f8c29c7cf68
[   55.735945] RBP: ffff9f8c29c7cf68 R08: 0000000000000000 R09: ffff9f8d4be2e260
[   55.735947] R10: ffffbdbf9c2d3230 R11: 0000000000000040 R12: 0000000000000002
[   55.735949] R13: fffffc6c12e49180 R14: fffffc6c12e49180 R15: ffffbdbf9c2d36f0
[   55.735951] FS:  00007c7c97382b80(0000) GS:ffff9f931f100000(0000) knlGS:0000000000000000
[   55.735954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   55.735956] CR2: 00005f7e028af140 CR3: 00000001eb6fc000 CR4: 00000000003506f0

The bug is sadly still here alive and well.

@RodoMa92 RodoMa92 changed the title Kernel bug when flushing memory caches for hugepages from Linux 6.3.1 to 6.7.4 Kernel bug when flushing memory caches for hugepages from Linux 6.3.1 to 6.10.14 Oct 30, 2024
@RodoMa92
Copy link
Author

Seems that some codepaths to this issue has not been completely covered yet, from what I'm seeing the stacktrace is still identical to my earlier reports of this. Not sure why it's so hard to reproduce on other systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Encryption "native encryption" feature Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests