5.11: Fix amdgpu with S0ix #60

crawfxrd · 2021-08-13T16:20:56Z

The patch train at least applies cleanly, so let's see if it fixes it on 5.11 as well.

crawfxrd · 2021-08-17T13:12:15Z

pang11 suspends fine, but resets when it should resume.

The one time it "worked", I got this:

kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 8 PID: 2684 at kernel/irq/chip.c:210 irq_startup+0xef/0x100
kernel: Modules linked in: ccm rfcomm cmac algif_hash algif_skcipher af_alg bnep snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd>
kernel:  system76_acpi(OE) i915 hid_generic amdgpu iommu_v2 gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect s>
kernel: CPU: 8 PID: 2684 Comm: systemd-sleep Tainted: G           OE     5.13.0-7614-generic #14
kernel: Hardware name: System76 Pangolin/Pangolin, BIOS 1.07.07 10/15/2020
kernel: RIP: 0010:irq_startup+0xef/0x100
kernel: Code: f6 4c 89 ef e8 e2 3e 00 00 85 c0 75 21 4c 89 ef 31 d2 4c 89 f6 e8 b1 c6 ff ff 4c 89 e7 e8 a9 fe ff ff 41 89 c5 e9 4c ff ff ff <0>
kernel: RSP: 0018:ffffb3e6427cbba0 EFLAGS: 00010002
kernel: RAX: 0000000000000010 RBX: 0000000000000001 RCX: 0000000000000040
kernel: RDX: ffffffffffffffff RSI: ffffffff83d60620 RDI: ffff9ab701080318
kernel: RBP: ffffb3e6427cbbc0 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: ffffffff83b553c8 R12: ffff9ab702b6fe00
kernel: R13: 0000000000000001 R14: ffff9ab701080318 R15: ffff9ab702b6fea4
kernel: FS:  00007f7bb663d900(0000) GS:ffff9ab807600000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000556c0b9d9bb4 CR3: 00000001137b4000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel:  __enable_irq+0x52/0x60
kernel:  resume_irqs+0xd3/0x120
kernel:  resume_device_irqs+0x10/0x20
kernel:  dpm_resume_noirq+0x1db/0x2a0
kernel:  suspend_devices_and_enter+0x214/0x580
kernel:  pm_suspend.cold+0x332/0x37d
kernel:  state_store+0x82/0xe0
kernel:  kobj_attr_store+0x12/0x20
kernel:  sysfs_kf_write+0x3e/0x50
kernel:  kernfs_fop_write_iter+0x138/0x1c0
kernel:  new_sync_write+0x117/0x1b0
kernel:  vfs_write+0x185/0x250
kernel:  ksys_write+0x67/0xe0
kernel:  __x64_sys_write+0x1a/0x20
kernel:  do_syscall_64+0x61/0xb0
kernel:  ? syscall_exit_to_user_mode+0x27/0x50
kernel:  ? __do_sys_gettid+0x1b/0x20
kernel:  ? do_syscall_64+0x6e/0xb0
kernel:  ? exit_to_user_mode_prepare+0x3d/0x1c0
kernel:  ? do_user_addr_fault+0x1d0/0x640
kernel:  ? irqentry_exit_to_user_mode+0x9/0x20
kernel:  ? irqentry_exit+0x19/0x30
kernel:  ? exc_page_fault+0x8f/0x170
kernel:  ? asm_exc_page_fault+0x8/0x30
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7f7bb6ed7c47
kernel: Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <4>
kernel: RSP: 002b:00007ffdfa4abba8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
kernel: RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f7bb6ed7c47
kernel: RDX: 0000000000000004 RSI: 00007ffdfa4abc60 RDI: 0000000000000004
kernel: RBP: 00007ffdfa4abc60 R08: 0000000000000004 R09: 0000000000000000
kernel: R10: 00007f7bb6f75040 R11: 0000000000000246 R12: 0000000000000004
kernel: R13: 000055bf4826f2d0 R14: 0000000000000004 R15: 00007f7bb6fb18a0
kernel: ---[ end trace 41bcfee1ceee2330 ]---

kernel: nvme nvme0: I/O 192 QID 4 timeout, aborting
kernel: nvme nvme0: I/O 193 QID 4 timeout, aborting
kernel: nvme nvme0: I/O 194 QID 4 timeout, aborting
kernel: nvme nvme0: I/O 195 QID 4 timeout, aborting
kernel: nvme nvme0: I/O 196 QID 4 timeout, aborting
kernel: nvme nvme0: I/O 197 QID 4 timeout, aborting
kernel: nvme nvme0: I/O 198 QID 4 timeout, aborting
kernel: nvme nvme0: I/O 199 QID 4 timeout, aborting
kernel: nvme nvme0: I/O 16 QID 0 timeout, reset controller
kernel: nvme nvme0: Abort status: 0x371
kernel: nvme nvme0: Abort status: 0x371
kernel: nvme nvme0: Abort status: 0x371
kernel: nvme nvme0: Abort status: 0x371
kernel: nvme nvme0: Abort status: 0x371
kernel: nvme nvme0: Abort status: 0x371
kernel: nvme nvme0: Abort status: 0x371
kernel: nvme nvme0: Abort status: 0x371
kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x90 returns -16
kernel: nvme 0000:04:00.0: PM: failed to resume async: error -16

Although first implemented for NVME, this check may be usable by other drivers as well. Microsoft's specification explicitly mentions that is may be usable by SATA and AHCI devices. Google also indicates that they have used this with SDHCI in a downstream kernel tree that a user can plug a storage device into. Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro Suggested-by: Keith Busch <kbusch@kernel.org> CC: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com> CC: Alexander Deucher <Alexander.Deucher@amd.com> CC: Rafael J. Wysocki <rjw@rjwysocki.net> CC: Prike Liang <prike.liang@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>

AMD systems from Renoir and Lucienne require that the NVME controller is put into D3 over a Modern Standby / suspend-to-idle cycle. This is "typically" accomplished using the `StorageD3Enable` property in the _DSD, but this property was introduced after many of these systems launched and most OEM systems don't have it in their BIOS. On AMD Renoir without these drives going into D3 over suspend-to-idle the resume will fail with the NVME controller being reset and a trace like this in the kernel logs: ``` [ 83.556118] nvme nvme0: I/O 161 QID 2 timeout, aborting [ 83.556178] nvme nvme0: I/O 162 QID 2 timeout, aborting [ 83.556187] nvme nvme0: I/O 163 QID 2 timeout, aborting [ 83.556196] nvme nvme0: I/O 164 QID 2 timeout, aborting [ 95.332114] nvme nvme0: I/O 25 QID 0 timeout, reset controller [ 95.332843] nvme nvme0: Abort status: 0x371 [ 95.332852] nvme nvme0: Abort status: 0x371 [ 95.332856] nvme nvme0: Abort status: 0x371 [ 95.332859] nvme nvme0: Abort status: 0x371 [ 95.332909] PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -16 [ 95.332936] nvme 0000:03:00.0: PM: failed to resume async: error -16 ``` The Microsoft documentation for StorageD3Enable mentioned that Windows has a hardcoded allowlist for D3 support, which was used for these platforms. Introduce quirks to hardcode them for Linux as well. As this property is now "standardized", OEM systems using AMD Cezanne and newer APU's have adopted this property, and quirks like this should not be necessary. CC: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com> CC: Alexander Deucher <Alexander.Deucher@amd.com> CC: Prike Liang <prike.liang@amd.com> Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Julian Sikorski <belegdol@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>

AMD spec mentions only revision 0. With this change, device constraint list is populated properly. Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com> Tested-by: Julian Sikorski <belegdol@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Refactor common code to prepare for upcoming changes. * Remove unused struct. * Print error before returning. * Frees ACPI obj if _DSM type is not as expected. * Treat lps0_dsm_func_mask as an integer rather than character * Remove extra out_obj * Move rev_id Co-developed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com> Tested-by: Julian Sikorski <belegdol@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Required for follow-up patch adding new UUID needing new function mask. Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com> Tested-by: Julian Sikorski <belegdol@gmail.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

This adds supports for _DSM notifications to the Microsoft UUID described by Microsoft documentation for s2idle. Link: https://docs.microsoft.com/en-us/windows-hardware/design/device-experiences/modern-standby-firmware-notifications Co-developed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com> Tested-by: Julian Sikorski <belegdol@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Some AMD Systems with uPEP _HID AMD004/AMDI005 have an off by one bug in their function mask return. This means that they will call entrance but not exit for matching functions. Other AMD systems with this HID should use the Microsoft generic UUID. AMD systems with uPEP HID AMDI006 should be using the Microsoft method. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Tested-by: Julian Sikorski <belegdol@gmail.com> Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

When using s2idle on a variety of AMD notebook systems, they are experiencing spurious events that the EC or SMU are in the wrong state leading to a hard time waking up or higher than expected power consumption. These events only occur when the EC GPE is inadvertently set as a wakeup source. Originally the EC GPE was only set as a wakeup source when using the intel-vbtn or intel-hid drivers in commit 10a08fd ("ACPI: PM: Set up EC GPE for system wakeup from drivers that need it") but during testing a reporter discovered that this was not enough for their ASUS Zenbook UX430UNR/i7-8550U to wakeup by lid event or keypress. Marking the EC GPE for wakeup universally resolved this for that reporter in commit b90ff35 ("ACPI: PM: s2idle: Always set up EC GPE for system wakeup"). However this behavior has lead to a number of problems: * On both Lenovo T14 and P14s the keyboard wakeup doesn't work, and sometimes the power button event doesn't work. * On HP 635 G7 detaching or attaching AC during suspend will cause the system not to wakeup * On Asus vivobook to prevent detaching AC causing resume problems * On Lenovo 14ARE05 to prevent detaching AC causing resume problems * On HP ENVY x360 to prevent detaching AC causing resume problems As there may be other Intel systems besides ASUS Zenbook UX430UNR/i7-8550U that don't use intel-vbtn or intel-hid, avoid these problems by only universally marking the EC GPE wakesource on non-AMD systems. Link: https://patchwork.kernel.org/project/linux-pm/cover/5997740.FPbUVk04hV@kreacher/#22825489 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1230 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1629 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

The protocol to submit a job request to SMU is to wait for AMD_PMC_REGISTER_RESPONSE to return 1,meaning SMU is ready to take requests. PMC driver has to make sure that the response code is always AMD_PMC_RESULT_OK before making any command submissions. When we submit a message to SMU, we have to wait until it processes the request. Adding a read_poll_timeout() check as this was missing in the existing code. Also, add a mutex to protect amd_pmc_send_cmd() calls to SMU. Fixes: 156ec47 ("platform/x86: amd-pmc: Add AMD platform support for S2Idle") Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Acked-by: Raul E Rangel <rrangel@chromium.org> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210629084803.248498-2-Shyam-sundar.S-k@amd.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>

It was lately understood that the current mechanism available in the driver to get SMU firmware info works only on internal SMU builds and there is a separate way to get all the SMU logging counters (addressed in the next patch). Hence remove all the smu info shown via debugfs as it is no more useful. Fixes: 156ec47 ("platform/x86: amd-pmc: Add AMD platform support for S2Idle") Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210629084803.248498-3-Shyam-sundar.S-k@amd.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>

Currently amd_pmc_dump_registers() routine is being called at multiple places. The best to call it is after command submission to SMU. Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210629084803.248498-4-Shyam-sundar.S-k@amd.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>

SMU provides a way to dump the s0ix debug statistics in the form of a metrics table via a of set special mailbox commands. Add support to the driver which can send these commands to SMU and expose the information received via debugfs. The information contains the s0ix entry/exit, active time of each IP block etc. As a side note, SMU subsystem logging is not supported on Picasso based SoC's. Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210629084803.248498-5-Shyam-sundar.S-k@amd.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>

Even the FCH SSC registers provides certain level of information about the s0ix entry and exit times which comes handy when the SMU fails to report the statistics via the mailbox communication. This information is captured via a new debugfs file "s0ix_stats". A non-zero entry in this counters would mean that the system entered the s0ix state. If s0ix entry time and exit time don't change during suspend to idle, the silicon has not entered the deepest state. Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210629084803.248498-6-Shyam-sundar.S-k@amd.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>

Some newer BIOSes have added another ACPI ID for the uPEP device. SMU statistics behave identically on this device. Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210629084803.248498-7-Shyam-sundar.S-k@amd.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>

The upcoming PMC controller would have a newer acpi id, add that to the supported acpid device list. Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210629084803.248498-8-Shyam-sundar.S-k@amd.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>

Right now the driver will still return success even if the OS_HINT command failed to send to the SMU. In the rare event of a failure, the suspend should really be aborted here so that relevant logs can may be captured. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Link: https://lore.kernel.org/r/20210707141647.8871-1-mario.limonciello@amd.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>

It was reported by a user with a Dell m15 R5 (5800H) that the keyboard backlight was turning on when entering suspend and turning off when exiting (the opposite of how it should be). The user bisected it back to commit 5dbf509 ("ACPI: PM: s2idle: Add support for new Microsoft UUID"). Previous to that commit the LEDs didn't turn off at all. Confirming in the spec, these were reversed when introduced. Fix them to match the spec. BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1230#note_1021836 Fixes: 5dbf509 ("ACPI: PM: s2idle: Add support for new Microsoft UUID") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>

…orted It was reported that on "HP ENVY x360" that power LED does not come back, certain keys like brightness controls do not work, and the fan never spins up, even under load on 5.14 final. In analysis of the SSDT it's clear that the Microsoft UUID doesn't provide functional support, but rather the AMD UUID should be supporting this system. Because this is a gap in the expected logic, we checked back with internal team. The conclusion was that on Windows AMD uPEP *does* run even when Microsoft UUID present, but most OEM systems have adopted value of "0x3" for supported functions and hence nothing runs. Henceforth add support for running both Microsoft and AMD methods. This approach will also allow the same logic on Intel systems if desired at a future time as well by pulling the evaluation of `lps0_dsm_func_mask_microsoft` out of the `if` block for `acpi_s2idle_vendor_amd`. Link: https://gitlab.freedesktop.org/drm/amd/uploads/9fbcd7ec3a385cc6949c9bacf45dc41b/acpi-f.20.bin BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1691 Reported-by: Maxwell Beck <max@ryt.one> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> [ rjw: Edits of the new comments ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

[ Upstream commit 04d8066 ] Currently, with an unknown recv_type, mwifiex_usb_recv just return -1 without restoring the skb. Next time mwifiex_usb_rx_complete is invoked with the same skb, calling skb_put causes skb_over_panic. The bug is triggerable with a compromised/malfunctioning usb device. After applying the patch, skb_over_panic no longer shows up with the same input. Attached is the panic report from fuzzing. skbuff: skb_over_panic: text:000000003bf1b5fa len:2048 put:4 head:00000000dd6a115b data:000000000a9445d8 tail:0x844 end:0x840 dev:<NULL> kernel BUG at net/core/skbuff.c:109! invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 0 PID: 198 Comm: in:imklog Not tainted 5.6.0 #60 RIP: 0010:skb_panic+0x15f/0x161 Call Trace: <IRQ> ? mwifiex_usb_rx_complete+0x26b/0xfcd [mwifiex_usb] skb_put.cold+0x24/0x24 mwifiex_usb_rx_complete+0x26b/0xfcd [mwifiex_usb] __usb_hcd_giveback_urb+0x1e4/0x380 usb_giveback_urb_bh+0x241/0x4f0 ? __hrtimer_run_queues+0x316/0x740 ? __usb_hcd_giveback_urb+0x380/0x380 tasklet_action_common.isra.0+0x135/0x330 __do_softirq+0x18c/0x634 irq_exit+0x114/0x140 smp_apic_timer_interrupt+0xde/0x380 apic_timer_interrupt+0xf/0x20 </IRQ> Reported-by: Brendan Dolan-Gavitt <brendandg@nyu.edu> Signed-off-by: Zekun Shen <bruceshenzk@gmail.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/YX4CqjfRcTa6bVL+@Zekuns-MBP-16.fios-router.home Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 8b59b0a upstream. arm32 uses software to simulate the instruction replaced by kprobe. some instructions may be simulated by constructing assembly functions. therefore, before executing instruction simulation, it is necessary to construct assembly function execution environment in C language through binding registers. after kasan is enabled, the register binding relationship will be destroyed, resulting in instruction simulation errors and causing kernel panic. the kprobe emulate instruction function is distributed in three files: actions-common.c actions-arm.c actions-thumb.c, so disable KASAN when compiling these files. for example, use kprobe insert on cap_capable+20 after kasan enabled, the cap_capable assembly code is as follows: <cap_capable>: e92d47f0 push {r4, r5, r6, r7, r8, r9, sl, lr} e1a05000 mov r5, r0 e280006c add r0, r0, #108 ; 0x6c e1a04001 mov r4, r1 e1a06002 mov r6, r2 e59fa090 ldr sl, [pc, #144] ; ebfc7bf8 bl c03aa4b4 <__asan_load4> e595706c ldr r7, [r5, #108] ; 0x6c e2859014 add r9, r5, #20 ...... The emulate_ldr assembly code after enabling kasan is as follows: c06f1384 <emulate_ldr>: e92d47f0 push {r4, r5, r6, r7, r8, r9, sl, lr} e282803c add r8, r2, #60 ; 0x3c e1a05000 mov r5, r0 e7e37855 ubfx r7, r5, #16, #4 e1a00008 mov r0, r8 e1a09001 mov r9, r1 e1a04002 mov r4, r2 ebf35462 bl c03c6530 <__asan_load4> e357000f cmp r7, #15 e7e36655 ubfx r6, r5, #12, #4 e205a00f and sl, r5, #15 0a000001 beq c06f13bc <emulate_ldr+0x38> e0840107 add r0, r4, r7, lsl #2 ebf3545c bl c03c6530 <__asan_load4> e084010a add r0, r4, sl, lsl #2 ebf3545a bl c03c6530 <__asan_load4> e2890010 add r0, r9, #16 ebf35458 bl c03c6530 <__asan_load4> e5990010 ldr r0, [r9, #16] e12fff30 blx r0 e356000f cm r6, #15 1a000014 bne c06f1430 <emulate_ldr+0xac> e1a06000 mov r6, r0 e2840040 add r0, r4, #64 ; 0x40 ...... when running in emulate_ldr to simulate the ldr instruction, panic occurred, and the log is as follows: Unable to handle kernel NULL pointer dereference at virtual address 00000090 pgd = ecb46400 [00000090] *pgd=2e0fa003, *pmd=00000000 Internal error: Oops: 206 [#1] SMP ARM PC is at cap_capable+0x14/0xb0 LR is at emulate_ldr+0x50/0xc0 psr: 600d0293 sp : ecd63af8 ip : 00000004 fp : c0a7c30c r10: 00000000 r9 : c30897f4 r8 : ecd63cd4 r7 : 0000000f r6 : 0000000a r5 : e59fa090 r4 : ecd63c98 r3 : c06ae294 r2 : 00000000 r1 : b7611300 r0 : bf4ec008 Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user Control: 32c5387d Table: 2d546400 DAC: 55555555 Process bash (pid: 1643, stack limit = 0xecd60190) (cap_capable) from (kprobe_handler+0x218/0x340) (kprobe_handler) from (kprobe_trap_handler+0x24/0x48) (kprobe_trap_handler) from (do_undefinstr+0x13c/0x364) (do_undefinstr) from (__und_svc_finish+0x0/0x30) (__und_svc_finish) from (cap_capable+0x18/0xb0) (cap_capable) from (cap_vm_enough_memory+0x38/0x48) (cap_vm_enough_memory) from (security_vm_enough_memory_mm+0x48/0x6c) (security_vm_enough_memory_mm) from (copy_process.constprop.5+0x16b4/0x25c8) (copy_process.constprop.5) from (_do_fork+0xe8/0x55c) (_do_fork) from (SyS_clone+0x1c/0x24) (SyS_clone) from (__sys_trace_return+0x0/0x10) Code: 0050a0e1 6c0080e2 0140a0e1 0260a0e1 (f801f0e7) Fixes: 35aa1df ("ARM kprobes: instruction single-stepping support") Fixes: 4210157 ("ARM: 9017/2: Enable KASan for ARM") Signed-off-by: huangshaobo <huangshaobo6@huawei.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

[ Upstream commit 04d8066 ] Currently, with an unknown recv_type, mwifiex_usb_recv just return -1 without restoring the skb. Next time mwifiex_usb_rx_complete is invoked with the same skb, calling skb_put causes skb_over_panic. The bug is triggerable with a compromised/malfunctioning usb device. After applying the patch, skb_over_panic no longer shows up with the same input. Attached is the panic report from fuzzing. skbuff: skb_over_panic: text:000000003bf1b5fa len:2048 put:4 head:00000000dd6a115b data:000000000a9445d8 tail:0x844 end:0x840 dev:<NULL> kernel BUG at net/core/skbuff.c:109! invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 0 PID: 198 Comm: in:imklog Not tainted 5.6.0 #60 RIP: 0010:skb_panic+0x15f/0x161 Call Trace: <IRQ> ? mwifiex_usb_rx_complete+0x26b/0xfcd [mwifiex_usb] skb_put.cold+0x24/0x24 mwifiex_usb_rx_complete+0x26b/0xfcd [mwifiex_usb] __usb_hcd_giveback_urb+0x1e4/0x380 usb_giveback_urb_bh+0x241/0x4f0 ? __hrtimer_run_queues+0x316/0x740 ? __usb_hcd_giveback_urb+0x380/0x380 tasklet_action_common.isra.0+0x135/0x330 __do_softirq+0x18c/0x634 irq_exit+0x114/0x140 smp_apic_timer_interrupt+0xde/0x380 apic_timer_interrupt+0xf/0x20 </IRQ> Reported-by: Brendan Dolan-Gavitt <brendandg@nyu.edu> Signed-off-by: Zekun Shen <bruceshenzk@gmail.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/YX4CqjfRcTa6bVL+@Zekuns-MBP-16.fios-router.home Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 8b59b0a upstream. arm32 uses software to simulate the instruction replaced by kprobe. some instructions may be simulated by constructing assembly functions. therefore, before executing instruction simulation, it is necessary to construct assembly function execution environment in C language through binding registers. after kasan is enabled, the register binding relationship will be destroyed, resulting in instruction simulation errors and causing kernel panic. the kprobe emulate instruction function is distributed in three files: actions-common.c actions-arm.c actions-thumb.c, so disable KASAN when compiling these files. for example, use kprobe insert on cap_capable+20 after kasan enabled, the cap_capable assembly code is as follows: <cap_capable>: e92d47f0 push {r4, r5, r6, r7, r8, r9, sl, lr} e1a05000 mov r5, r0 e280006c add r0, r0, #108 ; 0x6c e1a04001 mov r4, r1 e1a06002 mov r6, r2 e59fa090 ldr sl, [pc, #144] ; ebfc7bf8 bl c03aa4b4 <__asan_load4> e595706c ldr r7, [r5, #108] ; 0x6c e2859014 add r9, r5, #20 ...... The emulate_ldr assembly code after enabling kasan is as follows: c06f1384 <emulate_ldr>: e92d47f0 push {r4, r5, r6, r7, r8, r9, sl, lr} e282803c add r8, r2, #60 ; 0x3c e1a05000 mov r5, r0 e7e37855 ubfx r7, r5, #16, #4 e1a00008 mov r0, r8 e1a09001 mov r9, r1 e1a04002 mov r4, r2 ebf35462 bl c03c6530 <__asan_load4> e357000f cmp r7, #15 e7e36655 ubfx r6, r5, #12, #4 e205a00f and sl, r5, #15 0a000001 beq c06f13bc <emulate_ldr+0x38> e0840107 add r0, r4, r7, lsl #2 ebf3545c bl c03c6530 <__asan_load4> e084010a add r0, r4, sl, lsl #2 ebf3545a bl c03c6530 <__asan_load4> e2890010 add r0, r9, #16 ebf35458 bl c03c6530 <__asan_load4> e5990010 ldr r0, [r9, #16] e12fff30 blx r0 e356000f cm r6, #15 1a000014 bne c06f1430 <emulate_ldr+0xac> e1a06000 mov r6, r0 e2840040 add r0, r4, #64 ; 0x40 ...... when running in emulate_ldr to simulate the ldr instruction, panic occurred, and the log is as follows: Unable to handle kernel NULL pointer dereference at virtual address 00000090 pgd = ecb46400 [00000090] *pgd=2e0fa003, *pmd=00000000 Internal error: Oops: 206 [#1] SMP ARM PC is at cap_capable+0x14/0xb0 LR is at emulate_ldr+0x50/0xc0 psr: 600d0293 sp : ecd63af8 ip : 00000004 fp : c0a7c30c r10: 00000000 r9 : c30897f4 r8 : ecd63cd4 r7 : 0000000f r6 : 0000000a r5 : e59fa090 r4 : ecd63c98 r3 : c06ae294 r2 : 00000000 r1 : b7611300 r0 : bf4ec008 Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user Control: 32c5387d Table: 2d546400 DAC: 55555555 Process bash (pid: 1643, stack limit = 0xecd60190) (cap_capable) from (kprobe_handler+0x218/0x340) (kprobe_handler) from (kprobe_trap_handler+0x24/0x48) (kprobe_trap_handler) from (do_undefinstr+0x13c/0x364) (do_undefinstr) from (__und_svc_finish+0x0/0x30) (__und_svc_finish) from (cap_capable+0x18/0xb0) (cap_capable) from (cap_vm_enough_memory+0x38/0x48) (cap_vm_enough_memory) from (security_vm_enough_memory_mm+0x48/0x6c) (security_vm_enough_memory_mm) from (copy_process.constprop.5+0x16b4/0x25c8) (copy_process.constprop.5) from (_do_fork+0xe8/0x55c) (_do_fork) from (SyS_clone+0x1c/0x24) (SyS_clone) from (__sys_trace_return+0x0/0x10) Code: 0050a0e1 6c0080e2 0140a0e1 0260a0e1 (f801f0e7) Fixes: 35aa1df ("ARM kprobes: instruction single-stepping support") Fixes: 4210157 ("ARM: 9017/2: Enable KASan for ARM") Signed-off-by: huangshaobo <huangshaobo6@huawei.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

crawfxrd changed the title ~~Fix amdgpu with S0ix~~ 5.11: Fix amdgpu with S0ix Aug 16, 2021

crawfxrd marked this pull request as ready for review August 17, 2021 12:59

crawfxrd requested review from a team August 17, 2021 12:59

jackpot51 force-pushed the master branch 2 times, most recently from a887ad7 to e862cf5 Compare August 26, 2021 15:31

superm1 and others added 18 commits September 9, 2021 09:39

crawfxrd force-pushed the amdgpu-s0ix branch from fd161ac to 5b5c939 Compare September 9, 2021 15:41

jackpot51 closed this Sep 14, 2021

jackpot51 deleted the amdgpu-s0ix branch September 14, 2021 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5.11: Fix amdgpu with S0ix #60

5.11: Fix amdgpu with S0ix #60

crawfxrd commented Aug 13, 2021

crawfxrd commented Aug 17, 2021 •

edited

Loading

5.11: Fix amdgpu with S0ix #60

5.11: Fix amdgpu with S0ix #60

Conversation

crawfxrd commented Aug 13, 2021

crawfxrd commented Aug 17, 2021 • edited Loading

crawfxrd commented Aug 17, 2021 •

edited

Loading