-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADL]: DSP fw load failure consistently due to memory alloc failure #3844
Comments
The same issue is seen on KBL Eve platform too. However @crojewsk has submitted the fix https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3792265 which reduced the frequency of occurance of the issue. @sathya-nujella @sathyap-chrome @cujomalainey @ranj063 @plbossart @kv2019i @Vamshigopal |
FYI to @ujfalusi as well. It seems a snd_dma_alloc_pages() fails allocating memory that can fit the firmware. One option would of course be to cache the allocation. This would cut the available memory for use-cases, but ensure we have enough at each runtime resume. |
@jairaj-arava SOF driver is already using the snd_dma_alloc_pages() so that change for SKL doesn't help here. The options are not so great if free memory is low and fragmented. Let us try to reproduce this. |
Can you check the If I recall right this patch was fixing that issue and like the following patch is needed: |
The data received via iccmax stream is not used anywhere, so no need to allocate a big DMA buffer for it. This is especially important as the allocation is done even in cases where reload of the firmware is skipped and execution happens directly from the firmware stored in IMR. BugLink: thesofproject#3844 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
I think I found a bug that can explain this. @jairaj-arava , can you try this #3853 |
|
@jairaj-arava that would mean the platform is not able to allocate 4KB? Can you attach the latest logs with this #3853? |
We tried the below scenarios and could still see the issue in all scenarios . For the step 3, attached the dmesg logs and couldn't see fw logs since fw boot is failed (Attached fw logs dump just before we hit the issue).
|
@jairaj-arava I've been trying to reproduce this on a local machine for couple of hours, but with no luck (I'll contact you separately to sync the details of the environment). This is running a bit older Cros (R100), but I thought to start with this. I've monitored /proc/meminfo and I can see "Slab" and "SUnreclaim" grow a bit, and also overall "Committed_AS", but nothing drastic and I've not been able to hit the error. Could you add capture of /proc/meminfo to your test setup and capture the output at each iterator. I'd be interested to see the first and last meminfo dumps and a few captures in between. We could also enable kmemleak in kernel and see if that catches any leaks in the test. I also tested similar procedure with plain mainline SOF in a non-Cros system, and I can confirm memory usage is not growing if I do suspend test during playback. |
Locally tested workaround: #3872 The buffer allocation for ICCMAX will be also skipped in case of IMR booting. @jairaj-arava, can you test it on your system? |
Hi @kv2019i , |
The blueprint of this is really similar to what we had not long ago within 5.17 cycle: In all cases we had good amount of memory available, yet the DMA memory allocation failed. |
@jairaj-arava Please try this out: kv2019i@09ce62e |
sure @kv2019i. Will try this and update. |
Use __GFP_RETRY_MAYFAIL instead of __GFP__NORETRY in snd_dma_dev_alloc(), snd_dma_wc_alloc() and friends, to allocate pages for device memory. The MAYFAIL flag retains the semantics of not triggering the OOM killer, but lowers the risk of alloc failure. MAYFAIL flag was added in commit dcda9b0 ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic"). This change addresses recurring failures with SOF audio driver in test cases where a system suspend-resume stress test is run, combined with an active high memory-load use-case. The failure typically shows up as: [ 379.480229] sof-audio-pci-intel-tgl 0000:00:1f.3: booting DSP firmware [ 379.484803] sof-audio-pci-intel-tgl 0000:00:1f.3: error: memory alloc failed: -12 [ 379.484810] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dma prepare for ICCMAX stream failed Multiple fixes to reduce the memory usage of DSP boot have been identified in SOF driver, but even with those fixes, debug on affected systems has shown that even a single page alloc may fail with __GFP_NORETRY. When this occurs, system is under significant load on physical memory, but a lot of reclaimable pages are available, so the system has not run out of memory. With __GFP_RETRY_MAYFAIL, the errors are not hit in these stress tests. The alloc failure is severe as audio capability is completely lost if alloc failure is hit at system resume. An alternative solution was considered where the resources for DSP boot would be kept allocated until driver is unbound. This would avoid the allocation failure, but consume memory that is only needed temporarily at probe and resume time. It seems better to not hang on to the memory, but rather work a bit harder for allocating the pages at resume. BugLink: thesofproject#3844 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Use __GFP_RETRY_MAYFAIL instead of __GFP__NORETRY in snd_dma_dev_alloc(), snd_dma_wc_alloc() and friends, to allocate pages for device memory. The MAYFAIL flag retains the semantics of not triggering the OOM killer, but lowers the risk of alloc failure. MAYFAIL flag was added in commit dcda9b0 ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic"). This change addresses recurring failures with SOF audio driver in test cases where a system suspend-resume stress test is run, combined with an active high memory-load use-case. The failure typically shows up as: [ 379.480229] sof-audio-pci-intel-tgl 0000:00:1f.3: booting DSP firmware [ 379.484803] sof-audio-pci-intel-tgl 0000:00:1f.3: error: memory alloc failed: -12 [ 379.484810] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dma prepare for ICCMAX stream failed Multiple fixes to reduce the memory usage of DSP boot have been identified in SOF driver, but even with those fixes, debug on affected systems has shown that even a single page alloc may fail with __GFP_NORETRY. When this occurs, system is under significant load on physical memory, but a lot of reclaimable pages are available, so the system has not run out of memory. With __GFP_RETRY_MAYFAIL, the errors are not hit in these stress tests. The alloc failure is severe as audio capability is completely lost if alloc failure is hit at system resume. An alternative solution was considered where the resources for DSP boot would be kept allocated until driver is unbound. This would avoid the allocation failure, but consume memory that is only needed temporarily at probe and resume time. It seems better to not hang on to the memory, but rather work a bit harder for allocating the pages at resume. BugLink: thesofproject#3844 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
So far with this fix no issue is observed in 2 boards with 10+ hours of testing and the other 4 boards running from 6+ hours. |
Use __GFP_RETRY_MAYFAIL instead of __GFP__NORETRY in snd_dma_dev_alloc(), snd_dma_wc_alloc() and friends, to allocate pages for device memory. The MAYFAIL flag retains the semantics of not triggering the OOM killer, but lowers the risk of alloc failure. MAYFAIL flag was added in commit dcda9b0 ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic"). This change addresses recurring failures with SOF audio driver in test cases where a system suspend-resume stress test is run, combined with an active high memory-load use-case. The failure typically shows up as: [ 379.480229] sof-audio-pci-intel-tgl 0000:00:1f.3: booting DSP firmware [ 379.484803] sof-audio-pci-intel-tgl 0000:00:1f.3: error: memory alloc failed: -12 [ 379.484810] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dma prepare for ICCMAX stream failed Multiple fixes to reduce the memory usage of DSP boot have been identified in SOF driver, but even with those fixes, debug on affected systems has shown that even a single page alloc may fail with __GFP_NORETRY. When this occurs, system is under significant load on physical memory, but a lot of reclaimable pages are available, so the system has not run out of memory. With __GFP_RETRY_MAYFAIL, the errors are not hit in these stress tests. The alloc failure is severe as audio capability is completely lost if alloc failure is hit at system resume. An alternative solution was considered where the resources for DSP boot would be kept allocated until driver is unbound. This would avoid the allocation failure, but consume memory that is only needed temporarily at probe and resume time. It seems better to not hang on to the memory, but rather work a bit harder for allocating the pages at resume. BugLink: thesofproject#3844 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/20220923153501.3326041-1-kai.vehmanen@linux.intel.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
Test results are good and patch merged upstream, so closing the bug. |
…m allocs Use __GFP_RETRY_MAYFAIL instead of __GFP__NORETRY in snd_dma_dev_alloc(), snd_dma_wc_alloc() and friends, to allocate pages for device memory. The MAYFAIL flag retains the semantics of not triggering the OOM killer, but lowers the risk of alloc failure. MAYFAIL flag was added in commit dcda9b0 ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic"). This change addresses recurring failures with SOF audio driver in test cases where a system suspend-resume stress test is run, combined with an active high memory-load use-case. The failure typically shows up as: [ 379.480229] sof-audio-pci-intel-tgl 0000:00:1f.3: booting DSP firmware [ 379.484803] sof-audio-pci-intel-tgl 0000:00:1f.3: error: memory alloc failed: -12 [ 379.484810] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dma prepare for ICCMAX stream failed Multiple fixes to reduce the memory usage of DSP boot have been identified in SOF driver, but even with those fixes, debug on affected systems has shown that even a single page alloc may fail with __GFP_NORETRY. When this occurs, system is under significant load on physical memory, but a lot of reclaimable pages are available, so the system has not run out of memory. With __GFP_RETRY_MAYFAIL, the errors are not hit in these stress tests. The alloc failure is severe as audio capability is completely lost if alloc failure is hit at system resume. An alternative solution was considered where the resources for DSP boot would be kept allocated until driver is unbound. This would avoid the allocation failure, but consume memory that is only needed temporarily at probe and resume time. It seems better to not hang on to the memory, but rather work a bit harder for allocating the pages at resume. BugLink: thesofproject/linux#3844 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/20220923153501.3326041-1-kai.vehmanen@linux.intel.com Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jairaj Arava <jairaj.arava@intel.com> Signed-off-by: Sathya Prakash M R <sathya.prakash.m.r@intel.com> Signed-off-by: Sathyanarayana Nujella <sathyanarayana.nujella@intel.com> (cherry picked from commit a61c7d88d38cf3b9c88cf667c4f8a389a57744d4 https://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git/ for-next) BUG=b:171808591 TEST=Run suspend stress test for longer duration. Change-Id: I4a5d398367d03843603ea967cb06f947c9633a60 Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3919620 Reviewed-by: Curtis Malainey <cujomalainey@chromium.org> Commit-Queue: Curtis Malainey <cujomalainey@chromium.org> Reviewed-by: Jairaj Arava <jairaj.arava@intel.corp-partner.google.com> Tested-by: Jairaj Arava <jairaj.arava@intel.corp-partner.google.com>
While running suspend stress test with a Youtube playback, consistently soon after system resume SOF fw load is failing due to memory alloc failure as shown below:
[ 379.480206] sof-audio-pci-intel-tgl 0000:00:1f.3: loading firmware
[ 379.480213] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIR: 00000002 at 00000048
[ 379.480218] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIW: 00000000 at 00000048
[ 379.480224] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIR: 00000000 at 00000044
[ 379.480228] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIW: 00000004 at 00000044
[ 379.480229] sof-audio-pci-intel-tgl 0000:00:1f.3: booting DSP firmware
[ 379.484803] sof-audio-pci-intel-tgl 0000:00:1f.3: error: memory alloc failed: -12
[ 379.484810] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dma prepare for ICCMAX stream failed
[ 379.484812] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to reset DSP
[ 379.484814] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to boot DSP firmware after resume -12
[ 379.484817] sof-audio-pci-intel-tgl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 0000:00:1f.3: -12
[ 379.484822] SSP1-Codec: soc_pcm_open() failed (-12)
[ 379.484824] Speakers: ASoC: dpcm_be_dai_startup() failed at SSP1-Codec (-12)
[ 379.484827] Speakers: dpcm_fe_dai_startup() failed (-12)
[ 380.547828] sof-audio-pci-intel-tgl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 0000:00:1f.3: -22
[ 380.547834] SSP1-Codec: soc_pcm_open() failed (-22)
[ 380.547837] Speakers: ASoC: dpcm_be_dai_startup() failed at SSP1-Codec (-22)
[ 380.547839] Speakers: dpcm_fe_dai_startup() failed (-22)
Below are the steps to repro the issue quickly:
Restrict memory to 2G. This can be done by adding the mem=4g command line option to the linux kernel. Here's an example of how to do that:
/usr/share/vboot/bin/make_dev_ssd.sh --remove_rootfs_verification --partitions "2 4" && reboot
on next reboot:
#rootdev
/dev/mmcblk0p5
Partition 4 holds current kernel. If rootdev had been mmcblk0p3, it would have been partition 2
/usr/share/vboot/bin/make_dev_ssd.sh --edit_config --partitions 4
Add mem=4G to the end of the command line and reboot.
On the next reboot:
#cat /proc/cmdline
cros_secure console= loglevel=7 init=/sbin/init cros_secure drm.trace=0x106 root=PARTUUID=ea23107f-815b-1f4e-93a4-ef3e1e14c915/PARTNROFF=1 rootwait rw dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=0 dm="1 vroot none ro 1,0 5324800 verity payload=ROOT_DEV hashtree=HASH_DEV hashstart=5324800 alg=sha256 root_hexdigest=06fa741b1ba01512f692173dd0a735d52410068b9ee77af537bb654c90057d1d salt=89acb6f5b0d731a492b52f2601378a9e1b5a290e1633337c5262128d241eb9a1" noinitrd cros_debug vt.global_cursor_default=0 kern_guid=ea23107f-815b-1f4e-93a4-ef3e1e14c915 add_efi_memmap boot=local noresume noswap i915.modeset=1 disablevmx=off kvm-intel.vmentry_l1d_flush=always i915.enable_dpcd_backlight=1 i915.enable_dbc=1 i915.enable_guc_loading=1 i915.enable_guc_submission=0 mem=4g
head -1 /proc/meminfo
MemTotal: 1948356 kB
Notice the mem=4g parameter at the end of the command line. The actual memory available is less than 2G.
Now, log in and run youtube. After a video is playing, you can suspend & resume the system with the following command:
/usr/bin/suspend_stress_test -c 1 --suspend_max 1 --suspend_min 1
If the system comes back up successfully, log in and restart the youtube video. Then suspend again, restart again, etc. Within a few minutes the resume of the audio driver will fail with the messages shown above.
Here's the bash loop can be used on the DUT to automate the suspend/resume, login, and resume of the video:
while true; do
/usr/bin/suspend_stress_test -c 1 --suspend_max 1 --suspend_min 1;
sleep 20;
/usr/local/sbin/inject-keys.py -s $PASSWORD -k enter;
sleep 5;
/usr/local/sbin/inject-keys.py -k " ";
sleep 45;
done
Note: where $PASSWORD is the password for the test account used to login into the device.
Attached the dmesg log with dynamic logs enabled:
OOM_dmesg.txt
The text was updated successfully, but these errors were encountered: