TEST-55-OOMD triggers kernel crash on Fedora Rawhide #35044

DaanDeMeyer · 2024-11-06T08:17:03Z

systemd version the issue has been seen with

main

Used distribution

Fedora Rawhide

Linux kernel version used

6.12

CPU architectures issue was seen on

x86_64

Component

systemd-oomd

Expected behaviour you didn't see

No response

Unexpected behaviour you saw

No response

Steps to reproduce the problem

No response

Additional program output to the terminal or log subsystem illustrating the issue

[   76.313104] TEST-55-OOMD.sh[1216]: + timeout 1m bash -xec 'until oomctl | grep "/TEST-55-OOMD-testchill.service"; do sleep 1; done'
[   76.320322] TEST-55-OOMD.sh[1312]: + oomctl
[   76.326294] TEST-55-OOMD.sh[1313]: + grep /TEST-55-OOMD-testchill.service
[   76.382255] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   76.382883] #PF: supervisor read access in kernel mode
[   76.383221] #PF: error_code(0x0000) - not-present page
[   76.383571] PGD 0 P4D 0 
[   76.383739] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[   76.386799] CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Not tainted 6.12.0-0.rc6.51.fc42.x86_64 #1
[   76.387405] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   76.387988] Workqueue: events swap_reclaim_work
[   76.388330] RIP: 0010:__list_del_entry_valid_or_report+0x4/0x80
[   76.388725] Code: 00 4c 39 c7 0f 84 bc cc 8a 00 b8 01 00 00 00 e9 7d 8f c1 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa <48> 8b 17 48 8b 4f 08 48 85 d2 0f 84 d9 cc 8a 00 48 85 c9 0f 84 2e
[   76.390623] RSP: 0018:ffffb9124005be18 EFLAGS: 00010286
[   76.390959] RAX: 0000000000000000 RBX: ffff9032950b5400 RCX: ffff9032fbc36228
[   76.391475] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000000
[   76.392775] RBP: ffffb912404ac000 R08: ffff9032803e4f40 R09: ffff9032803e4f80
[   76.393273] R10: 0000000000000007 R11: 0000000000000007 R12: 000000000000001f
[   76.399693] R13: fffffffffffffff8 R14: ffff9032950b5678 R15: 0000000000000000
[   76.400203] FS:  0000000000000000(0000) GS:ffff9032fbc00000(0000) knlGS:0000000000000000
[   76.400702] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   76.401116] CR2: 0000000000000000 CR3: 000000010aae2001 CR4: 0000000000370ef0
[   76.401610] Call Trace:
[   76.401777]  <TASK>
[   76.401937]  ? __die_body.cold+0x19/0x27
[   76.402169]  ? page_fault_oops+0x15a/0x2f0
[   76.402432]  ? srso_alias_return_thunk+0x5/0xfbef5
[   76.402727]  ? exc_page_fault+0x7e/0x180
[   76.402963]  ? asm_exc_page_fault+0x26/0x30
[   76.403204]  ? __list_del_entry_valid_or_report+0x4/0x80
[   76.403524]  swap_reclaim_full_clusters+0x56/0x140
[   76.407538]  swap_reclaim_work+0x2b/0x40
[   76.407809]  process_one_work+0x179/0x330
[   76.408068]  worker_thread+0x252/0x390
[   76.408323]  ? __pfx_worker_thread+0x10/0x10
[   76.410680]  kthread+0xd2/0x100
[   76.410941]  ? __pfx_kthread+0x10/0x10
[   76.411194]  ret_from_fork+0x34/0x50
[   76.411461]  ? __pfx_kthread+0x10/0x10
[   76.414525]  ret_from_fork_asm+0x1a/0x30
[   76.414789]  </TASK>
[   76.414956] Modules linked in: intel_rapl_msr intel_rapl_common kvm_amd kvm iTCO_wdt intel_pmc_bxt iTCO_vendor_support pcspkr i2c_i801 i2c_smbus joydev serio_raw lpc_ich cfg80211 rfkill scsi_dh_rdac dm_multipath scsi_dh_emc scsi_dh_alua nfnetlink crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_scsi virtio_blk virtio_balloon loop fuse qemu_fw_cfg vmw_vsock_virtio_transport vmw_vsock_virtio_transport_common vsock virtio_console
[   76.418701] CR2: 0000000000000000
[   76.418933] ---[ end trace 0000000000000000 ]---
[   76.419228] RIP: 0010:__list_del_entry_valid_or_report+0x4/0x80
[   76.420373] Code: 00 4c 39 c7 0f 84 bc cc 8a 00 b8 01 00 00 00 e9 7d 8f c1 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa <48> 8b 17 48 8b 4f 08 48 85 d2 0f 84 d9 cc 8a 00 48 85 c9 0f 84 2e
[   76.423591] RSP: 0018:ffffb9124005be18 EFLAGS: 00010286
[   76.423990] RAX: 0000000000000000 RBX: ffff9032950b5400 RCX: ffff9032fbc36228
[   76.425825] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000000
[   76.426231] RBP: ffffb912404ac000 R08: ffff9032803e4f40 R09: ffff9032803e4f80
[   76.430535] R10: 0000000000000007 R11: 0000000000000007 R12: 000000000000001f
[   76.431040] R13: fffffffffffffff8 R14: ffff9032950b5678 R15: 0000000000000000
[   76.431559] FS:  0000000000000000(0000) GS:ffff9032fbc00000(0000) knlGS:0000000000000000
[   76.433764] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   76.436458] CR2: 0000000000000000 CR3: 000000010aae2001 CR4: 0000000000370ef0
[   76.436475] Kernel panic - not syncing: Fatal exception
[   76.436970] Kernel Offset: 0x34000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
qemu-system-x86_64: terminating on signal 15 from pid 94854 (/usr/bin/python3)

bluca · 2024-11-07T13:03:26Z

@anitazha @teknoraver any chance this could be reported to the kernel devs that look after the oom stuff please?

bluca · 2024-11-07T15:19:54Z

A patch has been posted: https://lore.kernel.org/all/20241107142335.GB1172372@cmpxchg.org/

syzbot and Daan report a NULL pointer crash in the new full swap cluster reclaim work: > Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN PTI > KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] > CPU: 1 UID: 0 PID: 51 Comm: kworker/1:1 Not tainted 6.12.0-rc6-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 > Workqueue: events swap_reclaim_work > RIP: 0010:__list_del_entry_valid_or_report+0x20/0x1c0 lib/list_debug.c:49 > Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 fe 48 83 c7 08 48 83 ec 18 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 19 01 00 00 48 89 f2 48 8b 4e 08 48 b8 00 00 00 > RSP: 0018:ffffc90000bb7c30 EFLAGS: 00010202 > RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff88807b9ae078 > RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008 > RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000001 R11: 000000000000004f R12: dffffc0000000000 > R13: ffffffffffffffb8 R14: ffff88807b9ae000 R15: ffffc90003af1000 > FS: 0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fffaca68fb8 CR3: 00000000791c8000 CR4: 00000000003526f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > __list_del_entry_valid include/linux/list.h:124 [inline] > __list_del_entry include/linux/list.h:215 [inline] > list_move_tail include/linux/list.h:310 [inline] > swap_reclaim_full_clusters+0x109/0x460 mm/swapfile.c:748 > swap_reclaim_work+0x2e/0x40 mm/swapfile.c:779 The syzbot console output indicates a virtual environment where swapfile is on a rotational device. In this case, clusters aren't actually used, and si->full_clusters is not initialized. Daan's report is from qemu, so likely rotational too. Make sure to only schedule the cluster reclaim work when clusters are actually in use. Link: https://lkml.kernel.org/r/20241107142335.GB1172372@cmpxchg.org Link: https://lore.kernel.org/lkml/672ac50b.050a0220.2edce.1517.GAE@google.com/ Link: systemd/systemd#35044 Fixes: 5168a68 ("mm, swap: avoid over reclaim of full clusters") Reported-by: syzbot+078be8bfa863cb9e0c6b@syzkaller.appspotmail.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Daan De Meyer <daan.j.demeyer@gmail.com> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bluca · 2024-11-20T16:53:23Z

This was fixed in kernel v6.12

DaanDeMeyer added the bug 🐛 Programming errors, that need preferential fixing label Nov 6, 2024

github-actions bot added the oomd label Nov 6, 2024

DaanDeMeyer added the tests label Nov 6, 2024

yuwata added kernel-bug and removed bug 🐛 Programming errors, that need preferential fixing labels Nov 6, 2024

bluca changed the title ~~TEST-55-OOMD broken on Fedora Rawhide~~ TEST-55-OOMD triggers kernel crash on Fedora Rawhide Nov 7, 2024

bluca closed this as completed Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TEST-55-OOMD triggers kernel crash on Fedora Rawhide #35044

TEST-55-OOMD triggers kernel crash on Fedora Rawhide #35044

DaanDeMeyer commented Nov 6, 2024

bluca commented Nov 7, 2024

bluca commented Nov 7, 2024

bluca commented Nov 20, 2024

TEST-55-OOMD triggers kernel crash on Fedora Rawhide #35044

TEST-55-OOMD triggers kernel crash on Fedora Rawhide #35044

Comments

DaanDeMeyer commented Nov 6, 2024

systemd version the issue has been seen with

Used distribution

Linux kernel version used

CPU architectures issue was seen on

Component

Expected behaviour you didn't see

Unexpected behaviour you saw

Steps to reproduce the problem

Additional program output to the terminal or log subsystem illustrating the issue

bluca commented Nov 7, 2024

bluca commented Nov 7, 2024

bluca commented Nov 20, 2024