Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STI DRM kernel panic with glmark2-drm #11

Open
waby38b opened this issue Jan 6, 2025 · 1 comment
Open

STI DRM kernel panic with glmark2-drm #11

waby38b opened this issue Jan 6, 2025 · 1 comment

Comments

@waby38b
Copy link
Collaborator

waby38b commented Jan 6, 2025

Linux 4kOpen 6.12.6 #1 SMP Sat Dec 21 16:26:54 CET 2024 armv7l GNU/Linux /
bootargs= (...) cma=256m
glmark2 2023.01

glmark2-es2-wayland / weston

=> 2560x1600: OK
=> 1920x1080: OK
=> 2560x1440: OK

glmark2-es2-drm / DRM

=> 2560x1600: OK
=> 1920x1080: OK
=> 2560x1440: Dead !

Here kernel backstrace (loop)

(...)
[  155.987679] Division by zero in kernel.
[  155.994000] Division by zero in kernel.
[  156.000354] Division by zero in kernel.
[  156.006625] Division by zero in kernel.
(...)
Jan  6 22:23:11 4kOpen kern.err kernel: [  294.792802] Division by zero in kernel.
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.796644] CPU: 3 UID: 0 PID: 317 Comm: glmark2-es2-drm Tainted: G        W          6.12.6 #1
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.796677] Tainted: [W]=WARN
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.796688] Hardware name: STi SoC with Flattened Device Tree
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.796702] Call trace: 
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.796712]  unwind_backtrace from show_stack+0x10/0x14
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.796754]  show_stack from dump_stack_lvl+0x54/0x68
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.796784]  dump_stack_lvl from Ldiv0_64+0x8/0x18
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.796827]  Ldiv0_64 from div_u64_rem+0x30/0x44
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.796880]  div_u64_rem from div64_u64+0x28/0xcc
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.796923]  div64_u64 from clk_fs660c32_dig_get_params+0x1f0/0x3ac
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.796966]  clk_fs660c32_dig_get_params from quadfs_round_rate+0x4c/0x8c
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.796998]  quadfs_round_rate from clk_core_determine_round_nolock+0x98/0xd0
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.797030]  clk_core_determine_round_nolock from clk_calc_new_rates+0xa0/0x2b0
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.797069]  clk_calc_new_rates from clk_calc_new_rates+0x134/0x2b0
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.797106]  clk_calc_new_rates from clk_core_set_rate_nolock+0xdc/0x324
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.797148]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x154
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.797190]  clk_set_rate from sti_hdmi_set_mode+0x5c/0x90 [sti_drm]
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.797349]  sti_hdmi_set_mode [sti_drm] from drm_bridge_chain_mode_set+0x40/0x54 [drm]
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.798030]  drm_bridge_chain_mode_set [drm] from crtc_set_mode+0x190/0x1cc [drm_kms_helper]
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.798774]  crtc_set_mode [drm_kms_helper] from drm_atomic_helper_commit_tail+0x38/0x8c [drm_kms_helper]
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.799070]  drm_atomic_helper_commit_tail [drm_kms_helper] from commit_tail+0x15c/0x180 [drm_kms_helper]
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.799337]  commit_tail [drm_kms_helper] from drm_atomic_helper_commit+0x138/0x15c [drm_kms_helper]
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.799599]  drm_atomic_helper_commit [drm_kms_helper] from drm_atomic_commit+0xc0/0xf4 [drm]
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.800291]  drm_atomic_commit [drm] from drm_atomic_helper_set_config+0x88/0xc0 [drm_kms_helper]
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.801011]  drm_atomic_helper_set_config [drm_kms_helper] from drm_mode_setcrtc+0x37c/0x848 [drm]
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.801724]  drm_mode_setcrtc [drm] from drm_ioctl+0x2a0/0x490 [drm]
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.802746]  drm_ioctl [drm] from sys_ioctl+0x2d8/0xd10
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.803290]  sys_ioctl from ret_fast_syscall+0x0/0x54
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.803328] Exception stack(0xf1005fa8 to 0xf1005ff0)
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.803352] 5fa0:                   00000003 bee14960 00000003 c06864a2 bee14960 bee14940
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.803376] 5fc0: 00000003 bee14960 c06864a2 00000036 0059adac 0058e120 bee14a4c 004b617c
Jan  6 22:23:11 4kOpen kern.warn kernel: [  294.803395] 5fe0: b6f16ef4 bee14930 b6f0ac70 b6b900e4
@waby38b
Copy link
Collaborator Author

waby38b commented Jan 6, 2025

# modetest -M  sti
Encoders:
id	crtc	type	possible crtcs	possible clones	
52	35	TMDS	0x00000003	0x00000007
53	0	DAC	0x00000003	0x00000007
54	0	LVDS	0x00000003	0x00000007

Connectors:
id	encoder	status		name		size (mm)	modes	encoders
55	52	connected	HDMI-A-1       	600x330		100	52
  modes:
	index name refresh (Hz) hdisp hss hse htot vdisp vss vse vtot
  #0 2560x1440 59.97 2560 2608 2640 2720 1440 1443 1448 1481 241570 flags: phsync, pvsync; type: preferred, driver
  #1 3840x2160 60.00 3840 4016 4104 4400 2160 2168 2178 2250 594000 flags: phsync, pvsync; type: driver
  #2 3840x2160 59.94 3840 4016 4104 4400 2160 2168 2178 2250 593407 flags: phsync, pvsync; type: driver
  #3 3840x2160 50.00 3840 4896 4984 5280 2160 2168 2178 2250 594000 flags: phsync, pvsync; type: driver
  #4 3840x2160 30.00 3840 4016 4104 4400 2160 2168 2178 2250 297000 flags: phsync, pvsync; type: driver

1 -
# modetest -M sti -s 55@35:2560x1440-59.97 -d setting mode 2560x1440-59.97Hz on connectors 55, crtc 35
=> no crash here ! (picture ok)

2 -
# modetest -M sti -s 55@35:1920x1080-60 -d setting mode 2560x1440-59.97Hz on connectors 55, crtc 35
=> no crash here ! (picture ok)

3 -
# modetest -M sti -s 55@35:2560x1440-59.97 -d setting mode 2560x1440-59.97Hz on connectors 55, crtc 35
=> nothing on screen....

# dmesg 
[  567.138010] sti-hdmi a804000.sti-hdmi: sti_hdmi_tx6g0c28phy_start: hdmi phy pll not locked
[  567.146327] [drm:drm_atomic_bridge_chain_pre_enable [drm]] *ERROR* Unable to start hdmi phy

4 -

# modetest -M sti -s 55@35:3840x2160-60 -d 
setting mode 3840x2160-60.00Hz on connectors 55, crtc 35

[  325.195200] Division by zero in kernel.
[  325.199026] CPU: 3 UID: 0 PID: 316 Comm: modetest Tainted: G        W          6.12.6 #1
[  325.199039] Tainted: [W]=WARN
[  325.199044] Hardware name: STi SoC with Flattened Device Tree
[  325.199050] Call trace: 
[  325.199055]  unwind_backtrace from show_stack+0x10/0x14
[  325.199072]  show_stack from dump_stack_lvl+0x54/0x68
[  325.199085]  dump_stack_lvl from Ldiv0_64+0x8/0x18
[  325.199103]  Ldiv0_64 from div_u64_rem+0x30/0x44
[  325.199125]  div_u64_rem from div64_u64+0x28/0xcc
[  325.199144]  div64_u64 from clk_fs660c32_dig_get_params+0x1f0/0x3ac
[  325.199161]  clk_fs660c32_dig_get_params from quadfs_round_rate+0x4c/0x8c
[  325.199175]  quadfs_round_rate from clk_core_determine_round_nolock+0x98/0xd0
[  325.199189]  clk_core_determine_round_nolock from clk_calc_new_rates+0xa0/0x2b0
[  325.199204]  clk_calc_new_rates from clk_calc_new_rates+0x134/0x2b0
[  325.199221]  clk_calc_new_rates from clk_core_set_rate_nolock+0xdc/0x324
[  325.199238]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x154
[  325.199256]  clk_set_rate from sti_hdmi_set_mode+0x5c/0x90 [sti_drm]
[  325.199326]  sti_hdmi_set_mode [sti_drm] from drm_bridge_chain_mode_set+0x40/0x54 [drm]
[  325.199609]  drm_bridge_chain_mode_set [drm] from crtc_set_mode+0x190/0x1cc [drm_kms_helper]
[  325.199921]  crtc_set_mode [drm_kms_helper] from drm_atomic_helper_commit_tail+0x38/0x8c [drm_kms_helper]
[  325.200047]  drm_atomic_helper_commit_tail [drm_kms_helper] from commit_tail+0x15c/0x180 [drm_kms_helper]
[  325.200159]  commit_tail [drm_kms_helper] from drm_atomic_helper_commit+0x138/0x15c [drm_kms_helper]
[  325.200275]  drm_atomic_helper_commit [drm_kms_helper] from drm_atomic_commit+0xc0/0xf4 [drm]
[  325.200574]  drm_atomic_commit [drm] from drm_client_modeset_commit_atomic+0x208/0x24c [drm]
[  325.201008]  drm_client_modeset_commit_atomic [drm] from drm_client_modeset_commit_locked+0x5c/0x184 [drm]
[  325.201439]  drm_client_modeset_commit_locked [drm] from drm_client_modeset_commit+0x24/0x40 [drm]
[  325.201871]  drm_client_modeset_commit [drm] from drm_fb_helper_lastclose+0x48/0x74 [drm_kms_helper]
[  325.202177]  drm_fb_helper_lastclose [drm_kms_helper] from drm_fbdev_dma_client_restore+0xc/0x14 [drm_dma_helper]
[  325.202262]  drm_fbdev_dma_client_restore [drm_dma_helper] from drm_client_dev_restore+0x6c/0xc0 [drm]
[  325.202499]  drm_client_dev_restore [drm] from drm_release+0x120/0x124 [drm]
[  325.202931]  drm_release [drm] from __fput+0xcc/0x2d0
[  325.203156]  __fput from task_work_run+0x8c/0xbc
[  325.203171]  task_work_run from do_exit+0x320/0x96c
[  325.203192]  do_exit from do_group_exit+0x34/0x84
[  325.203212]  do_group_exit from get_signal+0x95c/0x9a4
[  325.203229]  get_signal from do_work_pending+0xf8/0x4b8
[  325.203240]  do_work_pending from slow_work_pending+0xc/0x24
[  325.203251] Exception stack(0xf0a3dfb0 to 0xf0a3dff8)
[  325.203261] dfa0:                                     00000000 c06864a2 becf3900 becf38e0
[  325.203271] dfc0: 00000003 becf3900 c06864a2 00000036 0041b1fc 00419220 00000001 00000000
[  325.203280] dfe0: b6f22ef4 becf38d0 b6f16c70 b6e450e4 40000010 00000003
[  325.211326] sti-hdmi a804000.sti-hdmi: sti_hdmi_tx6g0c28phy_start: hdmi phy pll not locked
[  325.219595] [drm:drm_atomic_bridge_chain_pre_enable [drm]] *ERROR* Unable to start hdmi phy

GallaisPoutine pushed a commit that referenced this issue Jan 23, 2025
During the update procedure, when overwrite element in a pre-allocated
htab, the freeing of old_element is protected by the bucket lock. The
reason why the bucket lock is necessary is that the old_element has
already been stashed in htab->extra_elems after alloc_htab_elem()
returns. If freeing the old_element after the bucket lock is unlocked,
the stashed element may be reused by concurrent update procedure and the
freeing of old_element will run concurrently with the reuse of the
old_element. However, the invocation of check_and_free_fields() may
acquire a spin-lock which violates the lockdep rule because its caller
has already held a raw-spin-lock (bucket lock). The following warning
will be reported when such race happens:

  BUG: scheduling while atomic: test_progs/676/0x00000003
  3 locks held by test_progs/676:
  #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
  #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
  #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
  Modules linked in: bpf_testmod(O)
  Preemption disabled at:
  [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
  CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
  Tainted: [W]=WARN, [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
  Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x70
  dump_stack+0x10/0x20
  __schedule_bug+0x120/0x170
  __schedule+0x300c/0x4800
  schedule_rtlock+0x37/0x60
  rtlock_slowlock_locked+0x6d9/0x54c0
  rt_spin_lock+0x168/0x230
  hrtimer_cancel_wait_running+0xe9/0x1b0
  hrtimer_cancel+0x24/0x30
  bpf_timer_delete_work+0x1d/0x40
  bpf_timer_cancel_and_free+0x5e/0x80
  bpf_obj_free_fields+0x262/0x4a0
  check_and_free_fields+0x1d0/0x280
  htab_map_update_elem+0x7fc/0x1500
  bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
  bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
  bpf_prog_test_run_syscall+0x322/0x830
  __sys_bpf+0x135d/0x3ca0
  __x64_sys_bpf+0x75/0xb0
  x64_sys_call+0x1b5/0xa10
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  ...
  </TASK>

It seems feasible to break the reuse and refill of per-cpu extra_elems
into two independent parts: reuse the per-cpu extra_elems with bucket
lock being held and refill the old_element as per-cpu extra_elems after
the bucket lock is unlocked. However, it will make the concurrent
overwrite procedures on the same CPU return unexpected -E2BIG error when
the map is full.

Therefore, the patch fixes the lock problem by breaking the cancelling
of bpf_timer into two steps for PREEMPT_RT:
1) use hrtimer_try_to_cancel() and check its return value
2) if the timer is running, use hrtimer_cancel() through a kworker to
   cancel it again
Considering that the current implementation of hrtimer_cancel() will try
to acquire a being held softirq_expiry_lock when the current timer is
running, these steps above are reasonable. However, it also has
downside. When the timer is running, the cancelling of the timer is
delayed when releasing the last map uref. The delay is also fixable
(e.g., break the cancelling of bpf timer into two parts: one part in
locked scope, another one in unlocked scope), it can be revised later if
necessary.

It is a bit hard to decide the right fix tag. One reason is that the
problem depends on PREEMPT_RT which is enabled in v6.12. Considering the
softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced
in v5.15, the bpf_timer commit is used in the fixes tag and an extra
depends-on tag is added to state the dependency on PREEMPT_RT.

Fixes: b00628b ("bpf: Introduce bpf timers.")
Depends-on: v6.12+ with PREEMPT_RT enabled
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Closes: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@linutronix.de
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>
Link: https://lore.kernel.org/r/20250117101816.2101857-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
avolmat pushed a commit that referenced this issue Feb 9, 2025
libtraceevent parses and returns an array of argument fields, sometimes
larger than RAW_SYSCALL_ARGS_NUM (6) because it includes "__syscall_nr",
idx will traverse to index 6 (7th element) whereas sc->fmt->arg holds 6
elements max, creating an out-of-bounds access. This runtime error is
found by UBsan. The error message:

  $ sudo UBSAN_OPTIONS=print_stacktrace=1 ./perf trace -a --max-events=1
  builtin-trace.c:1966:35: runtime error: index 6 out of bounds for type 'syscall_arg_fmt [6]'
    #0 0x5c04956be5fe in syscall__alloc_arg_fmts /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:1966
    #1 0x5c04956c0510 in trace__read_syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2110
    #2 0x5c04956c372b in trace__syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2436
    #3 0x5c04956d2f39 in trace__init_syscalls_bpf_prog_array_maps /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:3897
    #4 0x5c04956d6d25 in trace__run /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:4335
    #5 0x5c04956e112e in cmd_trace /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:5502
    #6 0x5c04956eda7d in run_builtin /home/howard/hw/linux-perf/tools/perf/perf.c:351
    #7 0x5c04956ee0a8 in handle_internal_command /home/howard/hw/linux-perf/tools/perf/perf.c:404
    #8 0x5c04956ee37f in run_argv /home/howard/hw/linux-perf/tools/perf/perf.c:448
    #9 0x5c04956ee8e9 in main /home/howard/hw/linux-perf/tools/perf/perf.c:556
    #10 0x79eb3622a3b7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #11 0x79eb3622a47a in __libc_start_main_impl ../csu/libc-start.c:360
    torvalds#12 0x5c04955422d4 in _start (/home/howard/hw/linux-perf/tools/perf/perf+0x4e02d4) (BuildId: 5b6cab2d59e96a4341741765ad6914a4d784dbc6)

     0.000 ( 0.014 ms): Chrome_ChildIO/117244 write(fd: 238, buf: !, count: 1)                                      = 1

Fixes: 5e58fcf ("perf trace: Allow allocating sc->arg_fmt even without the syscall tracepoint")
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Link: https://lore.kernel.org/r/20250122025519.361873-1-howardchu95@gmail.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant