Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADL]: DSP fw load failure consistently due to memory alloc failure #3844

Closed
jairaj-arava opened this issue Sep 2, 2022 · 16 comments
Closed
Assignees
Labels
ADL Applies to Alder Lake platform bug Something isn't working P1 Blocker bugs or important features

Comments

@jairaj-arava
Copy link

jairaj-arava commented Sep 2, 2022

While running suspend stress test with a Youtube playback, consistently soon after system resume SOF fw load is failing due to memory alloc failure as shown below:

[ 379.480206] sof-audio-pci-intel-tgl 0000:00:1f.3: loading firmware
[ 379.480213] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIR: 00000002 at 00000048
[ 379.480218] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIW: 00000000 at 00000048
[ 379.480224] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIR: 00000000 at 00000044
[ 379.480228] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIW: 00000004 at 00000044
[ 379.480229] sof-audio-pci-intel-tgl 0000:00:1f.3: booting DSP firmware
[ 379.484803] sof-audio-pci-intel-tgl 0000:00:1f.3: error: memory alloc failed: -12
[ 379.484810] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dma prepare for ICCMAX stream failed
[ 379.484812] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to reset DSP
[ 379.484814] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to boot DSP firmware after resume -12

[ 379.484817] sof-audio-pci-intel-tgl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 0000:00:1f.3: -12
[ 379.484822] SSP1-Codec: soc_pcm_open() failed (-12)
[ 379.484824] Speakers: ASoC: dpcm_be_dai_startup() failed at SSP1-Codec (-12)
[ 379.484827] Speakers: dpcm_fe_dai_startup() failed (-12)
[ 380.547828] sof-audio-pci-intel-tgl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 0000:00:1f.3: -22
[ 380.547834] SSP1-Codec: soc_pcm_open() failed (-22)
[ 380.547837] Speakers: ASoC: dpcm_be_dai_startup() failed at SSP1-Codec (-22)
[ 380.547839] Speakers: dpcm_fe_dai_startup() failed (-22)

Below are the steps to repro the issue quickly:
Restrict memory to 2G. This can be done by adding the mem=4g command line option to the linux kernel. Here's an example of how to do that:
/usr/share/vboot/bin/make_dev_ssd.sh --remove_rootfs_verification --partitions "2 4" && reboot

on next reboot:
#rootdev
/dev/mmcblk0p5
Partition 4 holds current kernel. If rootdev had been mmcblk0p3, it would have been partition 2
/usr/share/vboot/bin/make_dev_ssd.sh --edit_config --partitions 4
Add mem=4G to the end of the command line and reboot.

On the next reboot:
#cat /proc/cmdline
cros_secure console= loglevel=7 init=/sbin/init cros_secure drm.trace=0x106 root=PARTUUID=ea23107f-815b-1f4e-93a4-ef3e1e14c915/PARTNROFF=1 rootwait rw dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=0 dm="1 vroot none ro 1,0 5324800 verity payload=ROOT_DEV hashtree=HASH_DEV hashstart=5324800 alg=sha256 root_hexdigest=06fa741b1ba01512f692173dd0a735d52410068b9ee77af537bb654c90057d1d salt=89acb6f5b0d731a492b52f2601378a9e1b5a290e1633337c5262128d241eb9a1" noinitrd cros_debug vt.global_cursor_default=0 kern_guid=ea23107f-815b-1f4e-93a4-ef3e1e14c915 add_efi_memmap boot=local noresume noswap i915.modeset=1 disablevmx=off kvm-intel.vmentry_l1d_flush=always i915.enable_dpcd_backlight=1 i915.enable_dbc=1 i915.enable_guc_loading=1 i915.enable_guc_submission=0 mem=4g

head -1 /proc/meminfo
MemTotal: 1948356 kB
Notice the mem=4g parameter at the end of the command line. The actual memory available is less than 2G.

Now, log in and run youtube. After a video is playing, you can suspend & resume the system with the following command:
/usr/bin/suspend_stress_test -c 1 --suspend_max 1 --suspend_min 1
If the system comes back up successfully, log in and restart the youtube video. Then suspend again, restart again, etc. Within a few minutes the resume of the audio driver will fail with the messages shown above.

Here's the bash loop can be used on the DUT to automate the suspend/resume, login, and resume of the video:
while true; do
/usr/bin/suspend_stress_test -c 1 --suspend_max 1 --suspend_min 1;
sleep 20;
/usr/local/sbin/inject-keys.py -s $PASSWORD -k enter;
sleep 5;
/usr/local/sbin/inject-keys.py -k " ";
sleep 45;
done

Note: where $PASSWORD is the password for the test account used to login into the device.

Attached the dmesg log with dynamic logs enabled:
OOM_dmesg.txt

@jairaj-arava
Copy link
Author

jairaj-arava commented Sep 3, 2022

The same issue is seen on KBL Eve platform too. However @crojewsk has submitted the fix https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3792265 which reduced the frequency of occurance of the issue.

@sathya-nujella @sathyap-chrome @cujomalainey @ranj063 @plbossart @kv2019i @Vamshigopal

@jairaj-arava jairaj-arava added ADL Applies to Alder Lake platform bug Something isn't working labels Sep 3, 2022
@jairaj-arava jairaj-arava changed the title [ADL]: DSP fw load failure due to memory alloc failure [ADL]: DSP fw load failure consistently due to memory alloc failure Sep 3, 2022
@kv2019i
Copy link
Collaborator

kv2019i commented Sep 5, 2022

FYI to @ujfalusi as well. It seems a snd_dma_alloc_pages() fails allocating memory that can fit the firmware. One option would of course be to cache the allocation. This would cut the available memory for use-cases, but ensure we have enough at each runtime resume.

@jairaj-arava jairaj-arava added the P1 Blocker bugs or important features label Sep 6, 2022
@kv2019i
Copy link
Collaborator

kv2019i commented Sep 8, 2022

@jairaj-arava SOF driver is already using the snd_dma_alloc_pages() so that change for SKL doesn't help here. The options are not so great if free memory is low and fragmented. Let us try to reproduce this.

@ujfalusi
Copy link
Collaborator

ujfalusi commented Sep 8, 2022

Can you check the sound/core/memalloc.c in case you are missing some patches?
There were a regression which manifested like this caused by
2c95b92 ALSA: memalloc: Unify x86 SG-buffer handling (take#3)

If I recall right this patch was fixing that issue
925ca89 ALSA: memalloc: Add fallback SG-buffer allocations for x86

and like the following patch is needed:
9882d63 ALSA: memalloc: Drop x86-specific hack for WC allocations

kv2019i added a commit to kv2019i/linux that referenced this issue Sep 8, 2022
The data received via iccmax stream is not used anywhere, so no need to
allocate a big DMA buffer for it. This is especially important as the
allocation is done even in cases where reload of the firmware is skipped
and execution happens directly from the firmware stored in IMR.

BugLink: thesofproject#3844
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
@kv2019i
Copy link
Collaborator

kv2019i commented Sep 8, 2022

I think I found a bug that can explain this. @jairaj-arava , can you try this #3853

@jairaj-arava
Copy link
Author

I think I found a bug that can explain this. @jairaj-arava , can you try this #3853
I still see the issue with the #3853.

@plbossart
Copy link
Member

@jairaj-arava that would mean the platform is not able to allocate 4KB? Can you attach the latest logs with this #3853?

@jairaj-arava
Copy link
Author

@jairaj-arava that would mean the platform is not able to allocate 4KB? Can you attach the latest logs with this #3853?

We tried the below scenarios and could still see the issue in all scenarios . For the step 3, attached the dmesg logs and couldn't see fw logs since fw boot is failed (Attached fw logs dump just before we hit the issue).

  1. ADL-004 release
  2. ADL-004 release + IMR
  3. ADL-004 release + IMR + Kai's ASoC: SOF: Intel: hda-loader: use small buffer for iccmax stream #3853 patch
    SOF_DMESG_logs.txt
    SOF_fw_logs.txt

@kv2019i
Copy link
Collaborator

kv2019i commented Sep 20, 2022

@jairaj-arava I've been trying to reproduce this on a local machine for couple of hours, but with no luck (I'll contact you separately to sync the details of the environment). This is running a bit older Cros (R100), but I thought to start with this. I've monitored /proc/meminfo and I can see "Slab" and "SUnreclaim" grow a bit, and also overall "Committed_AS", but nothing drastic and I've not been able to hit the error. Could you add capture of /proc/meminfo to your test setup and capture the output at each iterator. I'd be interested to see the first and last meminfo dumps and a few captures in between.

We could also enable kmemleak in kernel and see if that catches any leaks in the test.

I also tested similar procedure with plain mainline SOF in a non-Cros system, and I can confirm memory usage is not growing if I do suspend test during playback.

@ujfalusi
Copy link
Collaborator

Locally tested workaround: #3872

The buffer allocation for ICCMAX will be also skipped in case of IMR booting.

@jairaj-arava, can you test it on your system?

@jairaj-arava
Copy link
Author

jairaj-arava commented Sep 21, 2022

@jairaj-arava I've been trying to reproduce this on a local machine for couple of hours, but with no luck (I'll contact you separately to sync the details of the environment). This is running a bit older Cros (R100), but I thought to start with this. I've monitored /proc/meminfo and I can see "Slab" and "SUnreclaim" grow a bit, and also overall "Committed_AS", but nothing drastic and I've not been able to hit the error. Could you add capture of /proc/meminfo to your test setup and capture the output at each iterator. I'd be interested to see the first and last meminfo dumps and a few captures in between.

We could also enable kmemleak in kernel and see if that catches any leaks in the test.

I also tested similar procedure with plain mainline SOF in a non-Cros system, and I can confirm memory usage is not growing if I do suspend test during playback.

Hi @kv2019i ,
Attached is the memory info dump captured. I consistently see there is a good amount of memory available in the dump.

Memory_dump_new.txt

@ujfalusi
Copy link
Collaborator

The blueprint of this is really similar to what we had not long ago within 5.17 cycle:
#3609 , #3530 , #3584

In all cases we had good amount of memory available, yet the DMA memory allocation failed.
I have checked sound/core/memalloc.c, sgbuf.c for clues and patterns, but I don't think there were similar issues reported with 5.10.

@kv2019i kv2019i self-assigned this Sep 21, 2022
@kv2019i
Copy link
Collaborator

kv2019i commented Sep 22, 2022

@jairaj-arava Please try this out: kv2019i@09ce62e

@jairaj-arava
Copy link
Author

@jairaj-arava Please try this out: kv2019i@09ce62e

sure @kv2019i. Will try this and update.

kv2019i added a commit to kv2019i/linux that referenced this issue Sep 23, 2022
Use __GFP_RETRY_MAYFAIL instead of __GFP__NORETRY in
snd_dma_dev_alloc(), snd_dma_wc_alloc() and friends, to allocate pages
for device memory. The MAYFAIL flag retains the semantics of not
triggering the OOM killer, but lowers the risk of alloc failure.

MAYFAIL flag was added in commit dcda9b0 ("mm, tree wide: replace
__GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic").

This change addresses recurring failures with SOF audio driver in test
cases where a system suspend-resume stress test is run, combined with an
active high memory-load use-case. The failure typically shows up as:

[ 379.480229] sof-audio-pci-intel-tgl 0000:00:1f.3: booting DSP firmware
[ 379.484803] sof-audio-pci-intel-tgl 0000:00:1f.3: error: memory alloc failed: -12
[ 379.484810] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dma prepare for ICCMAX stream failed

Multiple fixes to reduce the memory usage of DSP boot have been
identified in SOF driver, but even with those fixes, debug on affected
systems has shown that even a single page alloc may fail with
__GFP_NORETRY. When this occurs, system is under significant load on
physical memory, but a lot of reclaimable pages are available, so the
system has not run out of memory. With __GFP_RETRY_MAYFAIL, the errors
are not hit in these stress tests.

The alloc failure is severe as audio capability is completely lost if
alloc failure is hit at system resume.

An alternative solution was considered where the resources for DSP boot
would be kept allocated until driver is unbound. This would avoid the
allocation failure, but consume memory that is only needed temporarily
at probe and resume time. It seems better to not hang on to the memory,
but rather work a bit harder for allocating the pages at resume.

BugLink: thesofproject#3844
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Sep 23, 2022
Use __GFP_RETRY_MAYFAIL instead of __GFP__NORETRY in
snd_dma_dev_alloc(), snd_dma_wc_alloc() and friends, to allocate pages
for device memory. The MAYFAIL flag retains the semantics of not
triggering the OOM killer, but lowers the risk of alloc failure.

MAYFAIL flag was added in commit dcda9b0 ("mm, tree wide: replace
__GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic").

This change addresses recurring failures with SOF audio driver in test
cases where a system suspend-resume stress test is run, combined with an
active high memory-load use-case. The failure typically shows up as:

[ 379.480229] sof-audio-pci-intel-tgl 0000:00:1f.3: booting DSP firmware
[ 379.484803] sof-audio-pci-intel-tgl 0000:00:1f.3: error: memory alloc failed: -12
[ 379.484810] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dma prepare for ICCMAX stream failed

Multiple fixes to reduce the memory usage of DSP boot have been
identified in SOF driver, but even with those fixes, debug on affected
systems has shown that even a single page alloc may fail with
__GFP_NORETRY. When this occurs, system is under significant load on
physical memory, but a lot of reclaimable pages are available, so the
system has not run out of memory. With __GFP_RETRY_MAYFAIL, the errors
are not hit in these stress tests.

The alloc failure is severe as audio capability is completely lost if
alloc failure is hit at system resume.

An alternative solution was considered where the resources for DSP boot
would be kept allocated until driver is unbound. This would avoid the
allocation failure, but consume memory that is only needed temporarily
at probe and resume time. It seems better to not hang on to the memory,
but rather work a bit harder for allocating the pages at resume.

BugLink: thesofproject#3844
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
@jairaj-arava
Copy link
Author

@jairaj-arava Please try this out: kv2019i@09ce62e

sure @kv2019i. Will try this and update.

So far with this fix no issue is observed in 2 boards with 10+ hours of testing and the other 4 boards running from 6+ hours.

tiwai pushed a commit to tiwai/sound that referenced this issue Sep 26, 2022
Use __GFP_RETRY_MAYFAIL instead of __GFP__NORETRY in
snd_dma_dev_alloc(), snd_dma_wc_alloc() and friends, to allocate pages
for device memory. The MAYFAIL flag retains the semantics of not
triggering the OOM killer, but lowers the risk of alloc failure.

MAYFAIL flag was added in commit dcda9b0 ("mm, tree wide: replace
__GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic").

This change addresses recurring failures with SOF audio driver in test
cases where a system suspend-resume stress test is run, combined with an
active high memory-load use-case. The failure typically shows up as:

[ 379.480229] sof-audio-pci-intel-tgl 0000:00:1f.3: booting DSP firmware
[ 379.484803] sof-audio-pci-intel-tgl 0000:00:1f.3: error: memory alloc failed: -12
[ 379.484810] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dma prepare for ICCMAX stream failed

Multiple fixes to reduce the memory usage of DSP boot have been
identified in SOF driver, but even with those fixes, debug on affected
systems has shown that even a single page alloc may fail with
__GFP_NORETRY. When this occurs, system is under significant load on
physical memory, but a lot of reclaimable pages are available, so the
system has not run out of memory. With __GFP_RETRY_MAYFAIL, the errors
are not hit in these stress tests.

The alloc failure is severe as audio capability is completely lost if
alloc failure is hit at system resume.

An alternative solution was considered where the resources for DSP boot
would be kept allocated until driver is unbound. This would avoid the
allocation failure, but consume memory that is only needed temporarily
at probe and resume time. It seems better to not hang on to the memory,
but rather work a bit harder for allocating the pages at resume.

BugLink: thesofproject#3844
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Link: https://lore.kernel.org/r/20220923153501.3326041-1-kai.vehmanen@linux.intel.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
@kv2019i
Copy link
Collaborator

kv2019i commented Sep 26, 2022

Test results are good and patch merged upstream, so closing the bug.

@kv2019i kv2019i closed this as completed Sep 26, 2022
sysopenci pushed a commit to projectceladon/linux-intel-lts2019-chromium that referenced this issue Dec 8, 2022
…m allocs

Use __GFP_RETRY_MAYFAIL instead of __GFP__NORETRY in
snd_dma_dev_alloc(), snd_dma_wc_alloc() and friends, to allocate pages
for device memory. The MAYFAIL flag retains the semantics of not
triggering the OOM killer, but lowers the risk of alloc failure.

MAYFAIL flag was added in commit dcda9b0 ("mm, tree wide: replace
__GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic").

This change addresses recurring failures with SOF audio driver in test
cases where a system suspend-resume stress test is run, combined with an
active high memory-load use-case. The failure typically shows up as:

[ 379.480229] sof-audio-pci-intel-tgl 0000:00:1f.3: booting DSP firmware
[ 379.484803] sof-audio-pci-intel-tgl 0000:00:1f.3: error: memory alloc failed: -12
[ 379.484810] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dma prepare for ICCMAX stream failed

Multiple fixes to reduce the memory usage of DSP boot have been
identified in SOF driver, but even with those fixes, debug on affected
systems has shown that even a single page alloc may fail with
__GFP_NORETRY. When this occurs, system is under significant load on
physical memory, but a lot of reclaimable pages are available, so the
system has not run out of memory. With __GFP_RETRY_MAYFAIL, the errors
are not hit in these stress tests.

The alloc failure is severe as audio capability is completely lost if
alloc failure is hit at system resume.

An alternative solution was considered where the resources for DSP boot
would be kept allocated until driver is unbound. This would avoid the
allocation failure, but consume memory that is only needed temporarily
at probe and resume time. It seems better to not hang on to the memory,
but rather work a bit harder for allocating the pages at resume.

BugLink: thesofproject/linux#3844
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Link: https://lore.kernel.org/r/20220923153501.3326041-1-kai.vehmanen@linux.intel.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jairaj Arava <jairaj.arava@intel.com>
Signed-off-by: Sathya Prakash M R <sathya.prakash.m.r@intel.com>
Signed-off-by: Sathyanarayana Nujella <sathyanarayana.nujella@intel.com>
(cherry picked from commit a61c7d88d38cf3b9c88cf667c4f8a389a57744d4
 https://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git/ for-next)

BUG=b:171808591
TEST=Run suspend stress test for longer duration.

Change-Id: I4a5d398367d03843603ea967cb06f947c9633a60
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3919620
Reviewed-by: Curtis Malainey <cujomalainey@chromium.org>
Commit-Queue: Curtis Malainey <cujomalainey@chromium.org>
Reviewed-by: Jairaj Arava <jairaj.arava@intel.corp-partner.google.com>
Tested-by: Jairaj Arava <jairaj.arava@intel.corp-partner.google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADL Applies to Alder Lake platform bug Something isn't working P1 Blocker bugs or important features
Projects
None yet
Development

No branches or pull requests

4 participants