-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]Empty or stuck DMA trace - Initial FW ABI banner message not found in logger.data.txt #4333
Comments
@XiaoyunWu6666 I assume you are using latest logger (built using same commit ID as FW) ? |
I just rewrote that test in thesofproject/sof-test#666, before that it was not checking anything. So it is not obvious whether the lack of DMA trace is a recent regression (for instance caused by recent #4327), or it has always been a problem and we never noticed (green failure...) |
I logged into the storage server to explore the generic_test results and I found that the DMA trace on TGLU_RVP_HDA has been empty 20% of the time; 50 times out of 250 runs. So this is NOT a recent regression and there is no reason to revert the #4327 fixes. There is a reason why thesofproject/sof-test#666 was called "rewrite check-sof-logger so it can actually find (existing) bugs". Don't shoot the messenger. The DMA trace on TGLU_RVP_HDA has more specifically been empty in these test runs:
706 715 717 722 723 732 762 779 788 789 800 802 812 817 829 853 863 871 881 990 995 1022 1139 1288 1293 1305 1326 1429 1720 1897 1964 1995 2284 2290 2308 2320 2377 2491 2596 2810 3961 4081 4137 4171 4191 4276 4290 4346 4359 4398 4408 706 is the oldest TGLU_RVP_HDA failure and it's dated 2020-11-09 21:27:08 UTC Overall 706 is the 3rd oldest record I found so I think this issue is older than these records - possibly been here since TGLU_RVP_HDA existed? In fact DMA traces from the check-sof-logger have been randomly empty in many different TGL product (not just TGLU_RVP_HDA): 173 174 178 182 186 190 194 198 203 207 211 215 219 223 227 231 235 239 245 249 256 267 272 276 297 331 332 338 340 343 347 348 351 354 356 357 358 362 370 374 377 382 385 388 391 394 402 406 407 408 412 415 421 431 432 437 438 439 440 441 444 446 447 451 454 457 460 464 473 482 489 495 498 504 507 513 519 524 527 531 538 548 558 564 566 567 573 590 594 606 607 610 613 620 627 628 633 640 643 675 683 706 715 717 722 723 727 732 750 755 762 779 788 789 800 802 812 817 822 828 829 843 850 853 863 871 881 886 891 912 990 995 1022 1059 1082 1118 1139 1145 1288 1293 1305 1326 1351 1361 1366 1391 1397 1408 1429 1441 1456 1510 1539 1547 1568 1616 1668 1679 1720 1734 1735 1807 1828 1868 1893 1897 1926 1933 1937 1950 1964 1995 2020 2026 2050 2081 2096 2118 2131 2150 2154 2164 2171 2177 2190 2220 2256 2275 2284 2290 2302 2305 2308 2320 2357 2367 2377 2491 2596 2810 3610 3664 3685 3688 3711 3961 4081 4137 4171 4191 4276 4290 4346 4359 4398 4408 Run 173 (TGL_RVP_HDA) is dated 2020-07-23 For some unknown reason it's been rare with TGLU_RVP_SDW 5 runs out of 2628! (runs: 788 1868 1893 2256 3664) and rare with TGL_RVP_SDW too: 1 run out of 124, (run 447) While at it I checked other platforms and the most numerous failures happened (by far) on BYT_MB_NOCODEC (but only 15% of the time)
We noticed this issue on BYT (and others) a long time ago, see "byt: trace out of sync" email thread and this is why I made this "DMA nudge" test change to hide it: thesofproject/sof-test#663 However this "DMA nudge" applies only to generic log collection at the end of every "normal" test, it does not take effect for check-sof-logger that is specifically invoking sof-logger. |
While working on some Zephyr issues I discovered that enum task_state task_main_primary_core(void *data)
{
/* main audio processing loop */
while (1) {
/* sleep until next IPC or DMA */
wait_for_interrupt(0);
if (!ipc->pm_prepare_D3)
ipc_send_queued_msg();
}
} This seems strange to me: could this cause some IPCs to be delayed for a very long time? I'm not considering Zephyr in this bug. See #4356 Cc: @lgirdwood , @lyakh , @kv2019i , @keyonjie , @plbossart |
@marc-hb the LL scheduler on SOF will tick so this is not an issue, however we run Zephyr in tickless mode so all IPC work should be done in IPC threads.
e.g something like the below would work for 1 & 2. @lyakh may already have some example code. while (1) {
if (ipc_is pending())
do_ipc()
else
sleep();
} |
See bug thesofproject/sof#4333 Signed-off-by: Marc Herbert <marc.herbert@intel.com>
Some interesting experiments and results in thesofproject/sof-test#710
|
See bug thesofproject/sof#4333 Signed-off-by: Marc Herbert <marc.herbert@intel.com>
See bug thesofproject/sof#4333 Signed-off-by: Marc Herbert <marc.herbert@intel.com>
Workaround 710 has been merged, please link to test results if you see this again. 710 is a workaround, not a fix. |
hi @marc-hb I found a sof-logger related failure has happened on ADLP_RVP_HDA since 3 days ago. |
No: there was topology issue in that run and all tests failed. |
@marc-hb Sorry for giving the wrong info. Please refer to this one |
check-sof-logger has been failing regularly again. I submitted thesofproject/sof-test#876 to "nudge" the trace again and stop the noise in test results. Recent failures with an empty DMA trace
|
Happened again in daily run 10879 : the last daily run before workaround thesofproject/sof-test#876 was merged. Happened on ADLP_GMB_I2S_ZEPHYR, TGLU_RVP_NOCODEC and ADLP_RVP_NOCODEC and no other product. Start Time: 2022-03-09 22:27:42 UTC |
@marc-hb, I came to a conclusion that this is a feature, not a bug. Here is why: This is implemented like this to avoid duplicated traces in the sof-logger since what the sof-logger does is: Since the kernel has no means to know that the file is re-opened by the sof-logger or the sof-logger was stopped and later started, the kernel always resets the host_offset on file release. The only way this might be solved is to add write support for the trace file and any write would reset the host_offset. sof-logger would write there before re-open to avoid duplication, but eventually when you will have trace data coming in, you will have the trace from offset 0 printed out (which includes duplicated trace). Or we can add new variable to track the read/write pointer in the host buffer, but we still need a way to let the trace kernel code to know that sof-logger is re-opening or it has been killed and then started again. |
Thanks for the complicated explanation but the check-sof-logger test is a fairly simple test that just restarts the firmware and tries to read the DMA trace only once. The sof-logger process is never restarted and the message you quoted is never seen in that particular test whether it passes or fail. The DMA trace should never be empty in that super simple test case. It's really testing the most basic "does it work at all" feature, not any sof-logger restart or any other advanced use case. This test barely looks at what was logged. If we couldn't reliably test something as basic and as simple as "the logs are not completely empty" then there would probably be a more serious design issue than the one you describe. It is also a very deterministic test (or at least should be) yet it has been failing semi-randomly again for the last few weeks after being very reliable for months. We can make the test even simpler and even more deterministic if you have any specific suggestion. So your explanation looks very interesting and worth being saved in some README or source code comments, not buried and lost in some obscure github issue, but I doubt it's relevant to the super basic failures we are seeing right now. Did you try to run that test? |
@marc-hb, I found a local machine which I can reproduce this and the prints were super helpful! thesofproject/linux#3523 should fix this for good. Fingers crossed. |
Daily run 10920 (2022-03-10) was run short after the "trace nudge" was restored in sof-test (that's one week ago, sorry for the delay). The DMA trace got stuck on 2 platforms and it the nudge was enough to make the test pass only on one:
Start Time: 2022-03-10 22:27:34 UTC |
@marc-hb, on my machine the bootup is pretty silent (
Self compiled FW for dtrace position tracking. I see now, the nudge is local to fwiw, thesofproject/linux#3523 is updated. |
Switch everything to the shared memory
OK :-)
It's used in other tests but in other tests it makes no PASS/FAIL difference.
The test is currently waiting for the DSP to go to sleep. We can remove that to make it simpler, it should be a one-line change. What do you think? |
Similar results in daily run 11164: empty on 3 systems ADLP_GMB_I2S_ZEPHYR, ICL_RVP_HDA and ADLP_RVP_NOCODEC, test "rescued" by the nudge except on ADLP_GMB_I2S_ZEPHYR Start Time: 2022-03-17 21:28:49 UTC |
It would be interesting to have thesofproject/linux#3523 merged and observe the daily results for couple of days. |
@marc-hb, it looks like we have two types of missing dtrace data:
Binaries from thesofproject/linux#3526 can be used with identical prints if there is an interest. |
@marc-hb, pushed one more commit to thesofproject/linux#3526 to track the trace file open and release as well.
|
@marc-hb, I have managed to reproduce the issue (on
The dmesg have this to say about the runs:
The reason for the |
In Intel daily test run 12177 I see a clear IPC failure in setting up the logger (ADLP_GMB_I2S_ZEPHYR): [ 89.504320] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x90030000: GLB_TRACE_MSG: DMA_PARAMS_EXT This seems quite different from the cases @ujfalusi you've last seen on BYT... right? |
I added this wait for D3 in commit 6a8871f ("check-sof-logger: reload drivers first") that made the test more deterministic but I'm not 100% sure why it's now waiting for D3. Probably to minimize changes because before the addition of reload_drivers() the DSP was most likely in D3. While I recently made cavstool compatible with D3, the protocol used by cavstool is still different from the other traces: it won't print ANY log while the DSP is off. So change the test behavior to grab logs immediately after reloading the drivers / rebooting the DSP. This may also help with thesofproject/sof#4333, thesofproject/linux#3448 and others linked from there. Signed-off-by: Marc Herbert <marc.herbert@intel.com>
I added this wait for D3 in commit 6a8871f ("check-sof-logger: reload drivers first") that made the test more deterministic but I'm not 100% sure why it's now waiting for D3. Probably to minimize changes because before the addition of reload_drivers() the DSP was most likely in D3. While I recently made cavstool compatible with D3, the protocol used by cavstool is still different from the other traces: it won't print ANY log while the DSP is off. So change the test behavior to grab logs immediately after reloading the drivers / rebooting the DSP. This may also help with thesofproject/sof#4333, thesofproject/linux#3448 and others linked from there. Signed-off-by: Marc Herbert <marc.herbert@intel.com>
I added this wait for D3 in commit 6a8871f ("check-sof-logger: reload drivers first") that made the test more deterministic but I'm not 100% sure why it's now waiting for D3. Probably to minimize changes because before the addition of reload_drivers() the DSP was most likely in D3. While I recently made cavstool compatible with D3, the protocol used by cavstool is still different from the other traces: it won't print ANY log while the DSP is off. So change the test behavior to grab logs immediately after reloading the drivers / rebooting the DSP. This may also help with thesofproject/sof#4333, thesofproject/linux#3448 and others linked from there. Signed-off-by: Marc Herbert <marc.herbert@intel.com>
Close as this issue cannot be reproduced in recent daily testing. |
It can be reproduced, there's a workaround to ignore the error. |
@marc-hb EDIT: after making testing more deterministic in thesofproject/sof-test#710, now reproduced 100% on BDW and BYT, see any recent test results like for instance this one: https://sof-ci.01.org/softestpr/PR720/build763/devicetest/?model=BYT_MB_NOCODEC&testcase=check-sof-logger
Describe the bug
inner daily 4532 ; model=TGLU_RVP_HDA (and many others, see below); testcase=check-sof-logger
To Reproduce
TPLG=sof-hda-generic-4ch-kwd.tplg ~/sof-test/test-case/check-sof-logger.sh
Reproduction Rate
100%
Environment
Kernel Branch: topic/sof-dev
Kernel Commit: 57440285
SOF Branch: main
SOF Commit: 73df644
Topology:sof-hda-generic-4ch-kwd.tplg
Platform: TGLU_RVP_HDA
Screenshots or console output
https://thesofproject.github.io/latest/developer_guides/firmware/drivers/dma/intel/hda-dma.html
The text was updated successfully, but these errors were encountered: