DLPX-83701 Make function mnt_add_count() traceable #18

don-brady · 2022-12-07T22:22:31Z

Background

Some of the kernel functions in the unmount path are not traceable which makes it harder to debug busy unmounts.

Problem

To help with diagnosing busy unmounts we should allow tracing of mnt_add_count() and do_umount() in bpftrace probes.

Solution

Augment these functions with noinline and __noclone (disables the "-fipa-sra" compilation optimization to keep the compiler from optimizing the function signatures).

Testing Done

Confirm these function symbols are traceable:

delphix@ip-10-110-220-172:~$ sudo bpftrace -l '*mnt_add_count'
kprobe:mnt_add_count
delphix@ip-10-110-220-172:~$ sudo bpftrace -l '*do_umount'
kprobe:do_umount

Also used them with a bpftrace script that watches mnt_add_count calls that decrement the mount reference after a busy unmount occurred.

BugLink: https://bugs.launchpad.net/bugs/1990009 Tested on x86-64 and Ilya was also kind enough to give it a spin on s390x, both passing with probe_user:OK there. The test is using the newly added bpf_probe_read_user() to dump sockaddr from connect call into .bss BPF map and overrides the user buffer via bpf_probe_write_user(): # ./test_progs [...] #17 pkt_md_access:OK #18 probe_user:OK #19 prog_run_xattr:OK [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/90f449d8af25354e05080e82fc6e2d3179da30ea.1572649915.git.daniel@iogearbox.net (cherry picked from commit fa553d9) Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Cengiz Can <cengiz.can@canonical.com> Acked-by: Joseph Salisbury <joseph.salisbury@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

BugLink: https://bugs.launchpad.net/bugs/2002347 [ Upstream commit 5dd7caf ] In __unregister_kprobe_top(), if the currently unregistered probe has post_handler but other child probes of the aggrprobe do not have post_handler, the post_handler of the aggrprobe is cleared. If this is a ftrace-based probe, there is a problem. In later calls to disarm_kprobe(), we will use kprobe_ftrace_ops because post_handler is NULL. But we're armed with kprobe_ipmodify_ops. This triggers a WARN in __disarm_kprobe_ftrace() and may even cause use-after-free: Failed to disarm kprobe-ftrace at kernel_clone+0x0/0x3c0 (error -2) WARNING: CPU: 5 PID: 137 at kernel/kprobes.c:1135 __disarm_kprobe_ftrace.isra.21+0xcf/0xe0 Modules linked in: testKprobe_007(-) CPU: 5 PID: 137 Comm: rmmod Not tainted 6.1.0-rc4-dirty #18 [...] Call Trace: <TASK> __disable_kprobe+0xcd/0xe0 __unregister_kprobe_top+0x12/0x150 ? mutex_lock+0xe/0x30 unregister_kprobes.part.23+0x31/0xa0 unregister_kprobe+0x32/0x40 __x64_sys_delete_module+0x15e/0x260 ? do_user_addr_fault+0x2cd/0x6b0 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] For the kprobe-on-ftrace case, we keep the post_handler setting to identify this aggrprobe armed with kprobe_ipmodify_ops. This way we can disarm it correctly. Link: https://lore.kernel.org/all/20221112070000.35299-1-lihuafei1@huawei.com/ Fixes: 0bc11ed ("kprobes: Allow kprobes coexist with livepatch") Reported-by: Zhao Gongyi <zhaogongyi@huawei.com> Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Li Huafei <lihuafei1@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

…g the sock BugLink: https://bugs.launchpad.net/bugs/2003914 [ Upstream commit 3cf7203 ] There is a race condition in vxlan that when deleting a vxlan device during receiving packets, there is a possibility that the sock is released after getting vxlan_sock vs from sk_user_data. Then in later vxlan_ecn_decapsulate(), vxlan_get_sk_family() we will got NULL pointer dereference. e.g. #0 [ffffa25ec6978a38] machine_kexec at ffffffff8c669757 #1 [ffffa25ec6978a90] __crash_kexec at ffffffff8c7c0a4d #2 [ffffa25ec6978b58] crash_kexec at ffffffff8c7c1c48 #3 [ffffa25ec6978b60] oops_end at ffffffff8c627f2b #4 [ffffa25ec6978b80] page_fault_oops at ffffffff8c678fcb #5 [ffffa25ec6978bd8] exc_page_fault at ffffffff8d109542 #6 [ffffa25ec6978c00] asm_exc_page_fault at ffffffff8d200b62 [exception RIP: vxlan_ecn_decapsulate+0x3b] RIP: ffffffffc1014e7b RSP: ffffa25ec6978cb0 RFLAGS: 00010246 RAX: 0000000000000008 RBX: ffff8aa000888000 RCX: 0000000000000000 RDX: 000000000000000e RSI: ffff8a9fc7ab803e RDI: ffff8a9fd1168700 RBP: ffff8a9fc7ab803e R8: 0000000000700000 R9: 00000000000010ae R10: ffff8a9fcb748980 R11: 0000000000000000 R12: ffff8a9fd1168700 R13: ffff8aa000888000 R14: 00000000002a0000 R15: 00000000000010ae ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffa25ec6978ce8] vxlan_rcv at ffffffffc10189cd [vxlan] #8 [ffffa25ec6978d90] udp_queue_rcv_one_skb at ffffffff8cfb6507 #9 [ffffa25ec6978dc0] udp_unicast_rcv_skb at ffffffff8cfb6e45 #10 [ffffa25ec6978dc8] __udp4_lib_rcv at ffffffff8cfb8807 #11 [ffffa25ec6978e20] ip_protocol_deliver_rcu at ffffffff8cf76951 #12 [ffffa25ec6978e48] ip_local_deliver at ffffffff8cf76bde #13 [ffffa25ec6978ea0] __netif_receive_skb_one_core at ffffffff8cecde9b #14 [ffffa25ec6978ec8] process_backlog at ffffffff8cece139 #15 [ffffa25ec6978f00] __napi_poll at ffffffff8ceced1a #16 [ffffa25ec6978f28] net_rx_action at ffffffff8cecf1f3 #17 [ffffa25ec6978fa0] __softirqentry_text_start at ffffffff8d4000ca #18 [ffffa25ec6978ff0] do_softirq at ffffffff8c6fbdc3 Reproducer: https://github.com/Mellanox/ovs-tests/blob/master/test-ovs-vxlan-remove-tunnel-during-traffic.sh Fix this by waiting for all sk_user_data reader to finish before releasing the sock. Reported-by: Jianlin Shi <jishi@redhat.com> Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Fixes: 6a93cc9 ("udp-tunnel: Add a few more UDP tunnel APIs") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

BugLink: https://bugs.launchpad.net/bugs/2003914 [ Upstream commit b6702a9 ] syzkaller reported use-after-free with the stack trace like below [1]: [ 38.960489][ C3] ================================================================== [ 38.963216][ C3] BUG: KASAN: use-after-free in ar5523_cmd_tx_cb+0x220/0x240 [ 38.964950][ C3] Read of size 8 at addr ffff888048e03450 by task swapper/3/0 [ 38.966363][ C3] [ 38.967053][ C3] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.0.0-09039-ga6afa4199d3d-dirty #18 [ 38.968464][ C3] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 [ 38.969959][ C3] Call Trace: [ 38.970841][ C3] <IRQ> [ 38.971663][ C3] dump_stack_lvl+0xfc/0x174 [ 38.972620][ C3] print_report.cold+0x2c3/0x752 [ 38.973626][ C3] ? ar5523_cmd_tx_cb+0x220/0x240 [ 38.974644][ C3] kasan_report+0xb1/0x1d0 [ 38.975720][ C3] ? ar5523_cmd_tx_cb+0x220/0x240 [ 38.976831][ C3] ar5523_cmd_tx_cb+0x220/0x240 [ 38.978412][ C3] __usb_hcd_giveback_urb+0x353/0x5b0 [ 38.979755][ C3] usb_hcd_giveback_urb+0x385/0x430 [ 38.981266][ C3] dummy_timer+0x140c/0x34e0 [ 38.982925][ C3] ? notifier_call_chain+0xb5/0x1e0 [ 38.984761][ C3] ? rcu_read_lock_sched_held+0xb/0x60 [ 38.986242][ C3] ? lock_release+0x51c/0x790 [ 38.987323][ C3] ? _raw_read_unlock_irqrestore+0x37/0x70 [ 38.988483][ C3] ? __wake_up_common_lock+0xde/0x130 [ 38.989621][ C3] ? reacquire_held_locks+0x4a0/0x4a0 [ 38.990777][ C3] ? lock_acquire+0x472/0x550 [ 38.991919][ C3] ? rcu_read_lock_sched_held+0xb/0x60 [ 38.993138][ C3] ? lock_acquire+0x472/0x550 [ 38.994890][ C3] ? dummy_urb_enqueue+0x860/0x860 [ 38.996266][ C3] ? do_raw_spin_unlock+0x16f/0x230 [ 38.997670][ C3] ? dummy_urb_enqueue+0x860/0x860 [ 38.999116][ C3] call_timer_fn+0x1a0/0x6a0 [ 39.000668][ C3] ? add_timer_on+0x4a0/0x4a0 [ 39.002137][ C3] ? reacquire_held_locks+0x4a0/0x4a0 [ 39.003809][ C3] ? __next_timer_interrupt+0x226/0x2a0 [ 39.005509][ C3] __run_timers.part.0+0x69a/0xac0 [ 39.007025][ C3] ? dummy_urb_enqueue+0x860/0x860 [ 39.008716][ C3] ? call_timer_fn+0x6a0/0x6a0 [ 39.010254][ C3] ? cpuacct_percpu_seq_show+0x10/0x10 [ 39.011795][ C3] ? kvm_sched_clock_read+0x14/0x40 [ 39.013277][ C3] ? sched_clock_cpu+0x69/0x2b0 [ 39.014724][ C3] run_timer_softirq+0xb6/0x1d0 [ 39.016196][ C3] __do_softirq+0x1d2/0x9be [ 39.017616][ C3] __irq_exit_rcu+0xeb/0x190 [ 39.019004][ C3] irq_exit_rcu+0x5/0x20 [ 39.020361][ C3] sysvec_apic_timer_interrupt+0x8f/0xb0 [ 39.021965][ C3] </IRQ> [ 39.023237][ C3] <TASK> In ar5523_probe(), ar5523_host_available() calls ar5523_cmd() as below (there are other functions which finally call ar5523_cmd()): ar5523_probe() -> ar5523_host_available() -> ar5523_cmd_read() -> ar5523_cmd() If ar5523_cmd() timed out, then ar5523_host_available() failed and ar5523_probe() freed the device structure. So, ar5523_cmd_tx_cb() might touch the freed structure. This patch fixes this issue by canceling in-flight tx cmd if submitted urb timed out. Link: https://syzkaller.appspot.com/bug?id=9e12b2d54300842b71bdd18b54971385ff0d0d3a [1] Reported-by: syzbot+95001b1fd6dfcc716c29@syzkaller.appspotmail.com Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com> Link: https://lore.kernel.org/r/20221009183223.420015-1-syoshida@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

BugLink: https://bugs.launchpad.net/bugs/2003914 [ Upstream commit 031af50 ] The inline assembly for arm64's cmpxchg_double*() implementations use a +Q constraint to hazard against other accesses to the memory location being exchanged. However, the pointer passed to the constraint is a pointer to unsigned long, and thus the hazard only applies to the first 8 bytes of the location. GCC can take advantage of this, assuming that other portions of the location are unchanged, leading to a number of potential problems. This is similar to what we fixed back in commit: fee960b ("arm64: xchg: hazard against entire exchange variable") ... but we forgot to adjust cmpxchg_double*() similarly at the same time. The same problem applies, as demonstrated with the following test: | struct big { | u64 lo, hi; | } __aligned(128); | | unsigned long foo(struct big *b) | { | u64 hi_old, hi_new; | | hi_old = b->hi; | cmpxchg_double_local(&b->lo, &b->hi, 0x12, 0x34, 0x56, 0x78); | hi_new = b->hi; | | return hi_old ^ hi_new; | } ... which GCC 12.1.0 compiles as: | 0000000000000000 <foo>: | 0: d503233f paciasp | 4: aa0003e4 mov x4, x0 | 8: 1400000e b 40 <foo+0x40> | c: d2800240 mov x0, #0x12 // #18 | 10: d2800681 mov x1, #0x34 // #52 | 14: aa0003e5 mov x5, x0 | 18: aa0103e6 mov x6, x1 | 1c: d2800ac2 mov x2, #0x56 // #86 | 20: d2800f03 mov x3, #0x78 // #120 | 24: 48207c82 casp x0, x1, x2, x3, [x4] | 28: ca050000 eor x0, x0, x5 | 2c: ca060021 eor x1, x1, x6 | 30: aa010000 orr x0, x0, x1 | 34: d2800000 mov x0, #0x0 // #0 <--- BANG | 38: d50323bf autiasp | 3c: d65f03c0 ret | 40: d2800240 mov x0, #0x12 // #18 | 44: d2800681 mov x1, #0x34 // #52 | 48: d2800ac2 mov x2, #0x56 // #86 | 4c: d2800f03 mov x3, #0x78 // #120 | 50: f9800091 prfm pstl1strm, [x4] | 54: c87f1885 ldxp x5, x6, [x4] | 58: ca0000a5 eor x5, x5, x0 | 5c: ca0100c6 eor x6, x6, x1 | 60: aa0600a6 orr x6, x5, x6 | 64: b5000066 cbnz x6, 70 <foo+0x70> | 68: c8250c82 stxp w5, x2, x3, [x4] | 6c: 35ffff45 cbnz w5, 54 <foo+0x54> | 70: d2800000 mov x0, #0x0 // #0 <--- BANG | 74: d50323bf autiasp | 78: d65f03c0 ret Notice that at the lines with "BANG" comments, GCC has assumed that the higher 8 bytes are unchanged by the cmpxchg_double() call, and that `hi_old ^ hi_new` can be reduced to a constant zero, for both LSE and LL/SC versions of cmpxchg_double(). This patch fixes the issue by passing a pointer to __uint128_t into the +Q constraint, ensuring that the compiler hazards against the entire 16 bytes being modified. With this change, GCC 12.1.0 compiles the above test as: | 0000000000000000 <foo>: | 0: f9400407 ldr x7, [x0, #8] | 4: d503233f paciasp | 8: aa0003e4 mov x4, x0 | c: 1400000f b 48 <foo+0x48> | 10: d2800240 mov x0, #0x12 // #18 | 14: d2800681 mov x1, #0x34 // #52 | 18: aa0003e5 mov x5, x0 | 1c: aa0103e6 mov x6, x1 | 20: d2800ac2 mov x2, #0x56 // #86 | 24: d2800f03 mov x3, #0x78 // #120 | 28: 48207c82 casp x0, x1, x2, x3, [x4] | 2c: ca050000 eor x0, x0, x5 | 30: ca060021 eor x1, x1, x6 | 34: aa010000 orr x0, x0, x1 | 38: f9400480 ldr x0, [x4, #8] | 3c: d50323bf autiasp | 40: ca0000e0 eor x0, x7, x0 | 44: d65f03c0 ret | 48: d2800240 mov x0, #0x12 // #18 | 4c: d2800681 mov x1, #0x34 // #52 | 50: d2800ac2 mov x2, #0x56 // #86 | 54: d2800f03 mov x3, #0x78 // #120 | 58: f9800091 prfm pstl1strm, [x4] | 5c: c87f1885 ldxp x5, x6, [x4] | 60: ca0000a5 eor x5, x5, x0 | 64: ca0100c6 eor x6, x6, x1 | 68: aa0600a6 orr x6, x5, x6 | 6c: b5000066 cbnz x6, 78 <foo+0x78> | 70: c8250c82 stxp w5, x2, x3, [x4] | 74: 35ffff45 cbnz w5, 5c <foo+0x5c> | 78: f9400480 ldr x0, [x4, #8] | 7c: d50323bf autiasp | 80: ca0000e0 eor x0, x7, x0 | 84: d65f03c0 ret ... sampling the high 8 bytes before and after the cmpxchg, and performing an EOR, as we'd expect. For backporting, I've tested this atop linux-4.9.y with GCC 5.5.0. Note that linux-4.9.y is oldest currently supported stable release, and mandates GCC 5.1+. Unfortunately I couldn't get a GCC 5.1 binary to run on my machines due to library incompatibilities. I've also used a standalone test to check that we can use a __uint128_t pointer in a +Q constraint at least as far back as GCC 4.8.5 and LLVM 3.9.1. Fixes: 5284e1b ("arm64: xchg: Implement cmpxchg_double") Fixes: e9a4b79 ("arm64: cmpxchg_dbl: patch in lse instructions when supported by the CPU") Reported-by: Boqun Feng <boqun.feng@gmail.com> Link: https://lore.kernel.org/lkml/Y6DEfQXymYVgL3oJ@boqun-archlinux/ Reported-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/lkml/Y6GXoO4qmH9OIZ5Q@hirez.programming.kicks-ass.net/ Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: stable@vger.kernel.org Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230104151626.3262137-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

BugLink: https://bugs.launchpad.net/bugs/1990009 Tested on x86-64 and Ilya was also kind enough to give it a spin on s390x, both passing with probe_user:OK there. The test is using the newly added bpf_probe_read_user() to dump sockaddr from connect call into .bss BPF map and overrides the user buffer via bpf_probe_write_user(): # ./test_progs [...] #17 pkt_md_access:OK #18 probe_user:OK #19 prog_run_xattr:OK [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/90f449d8af25354e05080e82fc6e2d3179da30ea.1572649915.git.daniel@iogearbox.net (cherry picked from commit fa553d9) Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Cengiz Can <cengiz.can@canonical.com> Acked-by: Joseph Salisbury <joseph.salisbury@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

BugLink: https://bugs.launchpad.net/bugs/2076435 commit be346c1a6eeb49d8fda827d2a9522124c2f72f36 upstream. The code in ocfs2_dio_end_io_write() estimates number of necessary transaction credits using ocfs2_calc_extend_credits(). This however does not take into account that the IO could be arbitrarily large and can contain arbitrary number of extents. Extent tree manipulations do often extend the current transaction but not in all of the cases. For example if we have only single block extents in the tree, ocfs2_mark_extent_written() will end up calling ocfs2_replace_extent_rec() all the time and we will never extend the current transaction and eventually exhaust all the transaction credits if the IO contains many single block extents. Once that happens a WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to this error. This was actually triggered by one of our customers on a heavily fragmented OCFS2 filesystem. To fix the issue make sure the transaction always has enough credits for one extent insert before each call of ocfs2_mark_extent_written(). Heming Zhao said: ------ PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error" PID: xxx TASK: xxxx CPU: 5 COMMAND: "SubmitThread-CA" #0 machine_kexec at ffffffff8c069932 #1 __crash_kexec at ffffffff8c1338fa #2 panic at ffffffff8c1d69b9 #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2] #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2] #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2] #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2] #7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2] #8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2] #9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2] #10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2] #11 dio_complete at ffffffff8c2b9fa7 #12 do_blockdev_direct_IO at ffffffff8c2bc09f #13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2] #14 generic_file_direct_write at ffffffff8c1dcf14 #15 __generic_file_write_iter at ffffffff8c1dd07b #16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2] #17 aio_write at ffffffff8c2cc72e #18 kmem_cache_alloc at ffffffff8c248dde #19 do_io_submit at ffffffff8c2ccada #20 do_syscall_64 at ffffffff8c004984 #21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba Link: https://lkml.kernel.org/r/20240617095543.6971-1-jack@suse.cz Link: https://lkml.kernel.org/r/20240614145243.8837-1-jack@suse.cz Fixes: c15471f ("ocfs2: fix sparse file & data ordering issue in direct io") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Portia Stephens <portia.stephens@canonical.com> Signed-off-by: Roxana Nicolescu <roxana.nicolescu@canonical.com>

BugLink: https://bugs.launchpad.net/bugs/2078289 [ Upstream commit f0c18025693707ec344a70b6887f7450bf4c826b ] When running BPF selftests (./test_progs -t sockmap_basic) on a Loongarch platform, the following kernel panic occurs: [...] Oops[#1]: CPU: 22 PID: 2824 Comm: test_progs Tainted: G OE 6.10.0-rc2+ #18 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018 ... ... ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560 ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000000040 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...) Stack : ... Call Trace: [<9000000004162774>] copy_page_to_iter+0x74/0x1c0 [<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560 [<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0 [<90000000049aae34>] inet_recvmsg+0x54/0x100 [<900000000481ad5c>] sock_recvmsg+0x7c/0xe0 [<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0 [<900000000481e27c>] sys_recvfrom+0x1c/0x40 [<9000000004c076ec>] do_syscall+0x8c/0xc0 [<9000000003731da4>] handle_syscall+0xc4/0x160 Code: ... ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3510000 .text @ 0x9000000003710000 .data @ 0x9000000004d70000 .bss @ 0x9000000006469400 ---[ end Kernel panic - not syncing: Fatal exception ]--- [...] This crash happens every time when running sockmap_skb_verdict_shutdown subtest in sockmap_basic. This crash is because a NULL pointer is passed to page_address() in the sk_msg_recvmsg(). Due to the different implementations depending on the architecture, page_address(NULL) will trigger a panic on Loongarch platform but not on x86 platform. So this bug was hidden on x86 platform for a while, but now it is exposed on Loongarch platform. The root cause is that a zero length skb (skb->len == 0) was put on the queue. This zero length skb is a TCP FIN packet, which was sent by shutdown(), invoked in test_sockmap_skb_verdict_shutdown(): shutdown(p1, SHUT_WR); In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no page is put to this sge (see sg_set_page in sg_set_page), but this empty sge is queued into ingress_msg list. And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by sg_page(sge). Pass this NULL page to copy_page_to_iter(), which passes it to kmap_local_page() and to page_address(), then kernel panics. To solve this, we should skip this zero length skb. So in sk_msg_recvmsg(), if copy is zero, that means it's a zero length skb, skip invoking copy_page_to_iter(). We are using the EFAULT return triggered by copy_page_to_iter to check for is_fin in tcp_bpf.c. Fixes: 604326b ("bpf, sockmap: convert to generic sk_msg interface") Suggested-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/e3a16eacdc6740658ee02a33489b1b9d4912f378.1719992715.git.tanggeliang@kylinos.cn Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Portia Stephens <portia.stephens@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

DLPX-83701 Make function mnt_add_count() traceable

4821180

pcd1193182 approved these changes Dec 7, 2022

View reviewed changes

tonynguien approved these changes Dec 8, 2022

View reviewed changes

don-brady merged commit d5fef04 into delphix:6.0/stage Dec 9, 2022

don-brady deleted the dlpx-83701-azure branch December 9, 2022 16:40

delphix-devops-bot pushed a commit that referenced this pull request Dec 15, 2022

DLPX-83701 Make function mnt_add_count() traceable (#18)

ff54082

delphix-devops-bot pushed a commit that referenced this pull request Jan 11, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

bff8965

delphix-devops-bot pushed a commit that referenced this pull request Feb 2, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

346b4d2

delphix-devops-bot pushed a commit that referenced this pull request Feb 10, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

b17b225

delphix-devops-bot pushed a commit that referenced this pull request Mar 4, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

7b00e24

prakashsurya pushed a commit that referenced this pull request Mar 14, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

1dc7a6f

prakashsurya pushed a commit that referenced this pull request Mar 14, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

379fa4f

delphix-devops-bot pushed a commit that referenced this pull request Mar 30, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

4427b2e

delphix-devops-bot pushed a commit that referenced this pull request Apr 20, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

e3ec2d8

delphix-devops-bot pushed a commit that referenced this pull request Apr 28, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

67a697f

delphix-devops-bot pushed a commit that referenced this pull request May 26, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

30cedad

jwk404 pushed a commit to jwk404/linux-kernel-azure that referenced this pull request Mar 23, 2024

DLPX-83701 Make function mnt_add_count() traceable (delphix#18)

4428276

delphix-devops-bot pushed a commit that referenced this pull request Mar 24, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

cbc69a2

delphix-devops-bot pushed a commit that referenced this pull request Mar 25, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

1e8e1b9

jwk404 pushed a commit to jwk404/linux-kernel-azure that referenced this pull request Mar 25, 2024

DLPX-83701 Make function mnt_add_count() traceable (delphix#18)

4464f65

delphix-devops-bot pushed a commit that referenced this pull request Mar 26, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

c3b6902

delphix-devops-bot pushed a commit that referenced this pull request Mar 27, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

d31be43

jwk404 pushed a commit that referenced this pull request Apr 10, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

5c42d3d

jwk404 pushed a commit that referenced this pull request Apr 10, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

022d480

jwk404 pushed a commit that referenced this pull request Apr 11, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

76a1774

jwk404 pushed a commit that referenced this pull request Apr 14, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

4052cb5

jwk404 pushed a commit that referenced this pull request Apr 15, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

1d75088

jwk404 pushed a commit that referenced this pull request Apr 15, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

a22d072

jwk404 pushed a commit that referenced this pull request Apr 15, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

e0a78c6

delphix-devops-bot pushed a commit that referenced this pull request Apr 20, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

6bc26a7

delphix-devops-bot pushed a commit that referenced this pull request May 9, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

f033f01

delphix-devops-bot pushed a commit that referenced this pull request May 16, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

80e6ca4

delphix-devops-bot pushed a commit that referenced this pull request Jun 30, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

832bcb0

delphix-devops-bot pushed a commit that referenced this pull request Aug 1, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

2da8be5

delphix-devops-bot pushed a commit that referenced this pull request Aug 6, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

b7cf946

pcd1193182 pushed a commit to pcd1193182/linux-kernel-azure that referenced this pull request Aug 19, 2024

DLPX-83701 Make function mnt_add_count() traceable (delphix#18)

e24687c

delphix-devops-bot pushed a commit that referenced this pull request Aug 22, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

2279326

delphix-devops-bot pushed a commit that referenced this pull request Aug 23, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

8446bb3

prakashsurya pushed a commit that referenced this pull request Sep 23, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

9017853

delphix-devops-bot pushed a commit that referenced this pull request Oct 20, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

19735c0

palash-gandhi pushed a commit that referenced this pull request Oct 24, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

e465da9

delphix-devops-bot pushed a commit that referenced this pull request Nov 23, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

b126c22

delphix-devops-bot pushed a commit that referenced this pull request Jan 10, 2025

DLPX-83701 Make function mnt_add_count() traceable (#18)

9c66b13

delphix-devops-bot pushed a commit that referenced this pull request Feb 1, 2025

DLPX-83701 Make function mnt_add_count() traceable (#18)

8db813c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DLPX-83701 Make function mnt_add_count() traceable #18

DLPX-83701 Make function mnt_add_count() traceable #18

don-brady commented Dec 7, 2022

DLPX-83701 Make function mnt_add_count() traceable #18

DLPX-83701 Make function mnt_add_count() traceable #18

Conversation

don-brady commented Dec 7, 2022

Background

Problem

Solution

Testing Done