Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Empty or stuck DMA trace - Initial FW ABI banner message not found in logger.data.txt #4333

Closed
XiaoyunWu6666 opened this issue Jun 11, 2021 · 43 comments
Assignees
Labels
ADL Applies to Alder Lake platform bug Something isn't working as expected Intel Linux Daily tests This issue can be found in internal Linux daily tests P2 Critical bugs or normal features TGL Applies to Tiger Lake won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc)

Comments

@XiaoyunWu6666
Copy link
Contributor

XiaoyunWu6666 commented Jun 11, 2021

@marc-hb EDIT: after making testing more deterministic in thesofproject/sof-test#710, now reproduced 100% on BDW and BYT, see any recent test results like for instance this one: https://sof-ci.01.org/softestpr/PR720/build763/devicetest/?model=BYT_MB_NOCODEC&testcase=check-sof-logger

Describe the bug
inner daily 4532 ; model=TGLU_RVP_HDA (and many others, see below); testcase=check-sof-logger

To Reproduce
TPLG=sof-hda-generic-4ch-kwd.tplg ~/sof-test/test-case/check-sof-logger.sh

Reproduction Rate
100%

Environment
Kernel Branch: topic/sof-dev
Kernel Commit: 57440285
SOF Branch: main
SOF Commit: 73df644
Topology:sof-hda-generic-4ch-kwd.tplg
Platform: TGLU_RVP_HDA

Screenshots or console output

2021-06-10 21:24:17 UTC N kill 4320 # DMA trace
2021-06-10 21:24:17 UTC [REMOTE_ERROR] Initial ERROR FW ABI message not found in /home/ubuntu/sof-test/logs/check-sof-logger/2021-06-10-21:24:15-18859/logger.data.txt


2021-06-10 21:24:17 UTC N Log data BEG::
 TIMESTAMP      (us)           DELTA  C# COMPONENT          LOCATION                      CONTENT	ktime=102.780s  @  2021-06-10 21:24:15 +0000 UTC
2021-06-10 21:24:17 UTC N ::END data


2021-06-10 21:24:17 UTC N Log etrace BEG::
 TIMESTAMP      (us)           DELTA  C# COMPONENT          LOCATION                      CONTENT	ktime=104.787s  @  2021-06-10 21:24:17 +0000 UTC
[     9558057.954] (           0.000) c0 dma-trace             src/trace/dma-trace.c:343  ERROR FW ABI 0x3012001 DBG ABI 0x5003000 tag v1.8-rc1-184-g73df64444b75 src hash 0xb1c22c73 (ldc hash 0xb1c22c73)
Skipped 8144 bytes after the last statement.
2021-06-10 21:24:17 UTC N ::END etrace


2021-06-10 21:24:17 UTC N Log error BEG::
2021-06-10 21:24:17 UTC N ::END error


2021-06-10 21:24:17 UTC N Log etrace_stderr BEG::
2021-06-10 21:24:17 UTC N ::END etrace_stderr
2021-06-10 21:24:17 UTC [REMOTE_ERROR] Initial ERROR FW ABI message not found in /home/ubuntu/sof-test/logs/check-sof-logger/2021-06-10-21:24:15-18859/logger.data.txt

https://thesofproject.github.io/latest/developer_guides/firmware/drivers/dma/intel/hda-dma.html

@XiaoyunWu6666 XiaoyunWu6666 added bug Something isn't working as expected Intel Linux Daily tests This issue can be found in internal Linux daily tests labels Jun 11, 2021
@lgirdwood
Copy link
Member

@XiaoyunWu6666 I assume you are using latest logger (built using same commit ID as FW) ?
@marc-hb assigning, but please re-assign if it turns out not to be a logger update issue.

@marc-hb
Copy link
Collaborator

marc-hb commented Jun 11, 2021

I just rewrote that test in thesofproject/sof-test#666, before that it was not checking anything. So it is not obvious whether the lack of DMA trace is a recent regression (for instance caused by recent #4327), or it has always been a problem and we never noticed (green failure...)

@marc-hb marc-hb changed the title Initial ERROR FW ABI message not found in logger.data.txt on TGLU_RVP_HDA Initial ERROR FW ABI message not found in DMA logger.data.txt on TGLU_RVP_HDA Jun 11, 2021
@marc-hb marc-hb changed the title Initial ERROR FW ABI message not found in DMA logger.data.txt on TGLU_RVP_HDA Initial FW ABI banner message not found in DMA logger.data.txt on TGLU_RVP_HDA Jun 12, 2021
@marc-hb
Copy link
Collaborator

marc-hb commented Jun 12, 2021

I logged into the storage server to explore the generic_test results and I found that the DMA trace on TGLU_RVP_HDA has been empty 20% of the time; 50 times out of 250 runs. So this is NOT a recent regression and there is no reason to revert the #4327 fixes.

There is a reason why thesofproject/sof-test#666 was called "rewrite check-sof-logger so it can actually find (existing) bugs". Don't shoot the messenger.

The DMA trace on TGLU_RVP_HDA has more specifically been empty in these test runs:

wc -l */TGLU_RVP_HDA/check-sof-logger/logger.data.log | sed -e 's#/# #g' | awk '$1 == 1 { print $2 }' | sort -u -n | tr '\n' ' '

706 715 717 722 723 732 762 779 788 789 800 802 812 817 829 853 863 871 881 990 995 1022 1139 1288 1293 1305 1326 1429 1720 1897 1964 1995 2284 2290 2308 2320 2377 2491 2596 2810 3961 4081 4137 4171 4191 4276 4290 4346 4359 4398 4408

706 is the oldest TGLU_RVP_HDA failure and it's dated 2020-11-09 21:27:08 UTC

Overall 706 is the 3rd oldest record I found so I think this issue is older than these records - possibly been here since TGLU_RVP_HDA existed?

In fact DMA traces from the check-sof-logger have been randomly empty in many different TGL product (not just TGLU_RVP_HDA):

173 174 178 182 186 190 194 198 203 207 211 215 219 223 227 231 235 239 245 249 256 267 272 276 297 331 332 338 340 343 347 348 351 354 356 357 358 362 370 374 377 382 385 388 391 394 402 406 407 408 412 415 421 431 432 437 438 439 440 441 444 446 447 451 454 457 460 464 473 482 489 495 498 504 507 513 519 524 527 531 538 548 558 564 566 567 573 590 594 606 607 610 613 620 627 628 633 640 643 675 683 706 715 717 722 723 727 732 750 755 762 779 788 789 800 802 812 817 822 828 829 843 850 853 863 871 881 886 891 912 990 995 1022 1059 1082 1118 1139 1145 1288 1293 1305 1326 1351 1361 1366 1391 1397 1408 1429 1441 1456 1510 1539 1547 1568 1616 1668 1679 1720 1734 1735 1807 1828 1868 1893 1897 1926 1933 1937 1950 1964 1995 2020 2026 2050 2081 2096 2118 2131 2150 2154 2164 2171 2177 2190 2220 2256 2275 2284 2290 2302 2305 2308 2320 2357 2367 2377 2491 2596 2810 3610 3664 3685 3688 3711 3961 4081 4137 4171 4191 4276 4290 4346 4359 4398 4408

Run 173 (TGL_RVP_HDA) is dated 2020-07-23

For some unknown reason it's been rare with TGLU_RVP_SDW 5 runs out of 2628! (runs: 788 1868 1893 2256 3664) and rare with TGL_RVP_SDW too: 1 run out of 124, (run 447)

While at it I checked other platforms and the most numerous failures happened (by far) on BYT_MB_NOCODEC (but only 15% of the time)

awk '{print $3}' all_empty_DMA_traces | sort | uniq -c | sort -nr

553/3349 BYT_MB_NOCODEC
  89/116 TGL_RVP_HDA
  51/251 TGLU_RVP_HDA
   48/90 TGLU_SKU09FB_HDA
  40/173 APL_UP2_PCM512X
 33/3481 GLK_BOB_DA7219
   29/61 TGL_SHURI_HDA
 25/3396 BDW_WSB_RT286
   16/31 TGLU_SHUR_HDA
  16/106 BYT_MB_RT5682
 16/2177 BSW_CYN_MAX98090
  8/2287 TGLH_RVP_HDA
  5/2628 TGLU_RVP_SDW
    5/12 ADLP_RVP_SDW
  4/2910 CML_HEL_RT5682
  3/3452 APL_UP2_NOCODEC
  3/3323 APL_UP2_HDA
    2/42 TGL_RVP_NOCODEC
  2/2338 ICL_RVP_HDA
  2/2782 CML_SKU0955_HDA
  2/2813 CML_RVP_SDW
   1/124 TGL_RVP_SDW

We noticed this issue on BYT (and others) a long time ago, see "byt: trace out of sync" email thread and this is why I made this "DMA nudge" test change to hide it: thesofproject/sof-test#663

However this "DMA nudge" applies only to generic log collection at the end of every "normal" test, it does not take effect for check-sof-logger that is specifically invoking sof-logger.

@marc-hb marc-hb changed the title Initial FW ABI banner message not found in DMA logger.data.txt on TGLU_RVP_HDA Empty DMA trace Jun 12, 2021
@marc-hb marc-hb changed the title Empty DMA trace Empty DMA trace - Initial FW ABI banner message not found in DMA logger.data.txt Jun 12, 2021
@marc-hb marc-hb changed the title Empty DMA trace - Initial FW ABI banner message not found in DMA logger.data.txt Empty DMA trace - Initial FW ABI banner message not found in logger.data.txt Jun 12, 2021
@slawblauciak slawblauciak added the P1 Blocker bugs or important features label Jun 15, 2021
@marc-hb
Copy link
Collaborator

marc-hb commented Jun 15, 2021

While working on some Zephyr issues I discovered that ipc_send_queued_msg() is run like this:

enum task_state task_main_primary_core(void *data)
{
	/* main audio processing loop */
	while (1) {
		/* sleep until next IPC or DMA */
		wait_for_interrupt(0);

		if (!ipc->pm_prepare_D3)
			ipc_send_queued_msg();

	}
}

This seems strange to me: could this cause some IPCs to be delayed for a very long time? I'm not considering Zephyr in this bug. See #4356

Cc: @lgirdwood , @lyakh , @kv2019i , @keyonjie , @plbossart

@lgirdwood
Copy link
Member

@marc-hb the LL scheduler on SOF will tick so this is not an issue, however we run Zephyr in tickless mode so all IPC work should be done in IPC threads.

  1. IPC command from host - we wake our IPC command thread and do the command from host. This is preemptable.
  2. IPC notifications to host - we wake our IPC notification thread and sent the message to host. This is preemptable.

e.g something like the below would work for 1 & 2. @lyakh may already have some example code.

while (1) {
    if (ipc_is pending())
        do_ipc()
    else
        sleep();
}

@marc-hb marc-hb changed the title Empty DMA trace - Initial FW ABI banner message not found in logger.data.txt Empty or stuck DMA trace - Initial FW ABI banner message not found in logger.data.txt Jun 18, 2021
marc-hb added a commit to marc-hb/sof-test that referenced this issue Jun 18, 2021
See bug thesofproject/sof#4333

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb
Copy link
Collaborator

marc-hb commented Jun 18, 2021

Some interesting experiments and results in thesofproject/sof-test#710

  • Reloading the drivers (which reboots the DSP) seems to reproduce 100% on BDW and BYT
  • As has been empirically observed by many people for a very long time, the DMA trace is not empty or lost, it's only stuck. A simple "nudge (log level change, aplay,... ) gets it finally out.

marc-hb added a commit to marc-hb/sof-test that referenced this issue Jun 19, 2021
See bug thesofproject/sof#4333

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
marc-hb added a commit to thesofproject/sof-test that referenced this issue Jun 25, 2021
See bug thesofproject/sof#4333

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb
Copy link
Collaborator

marc-hb commented Jun 25, 2021

Workaround 710 has been merged, please link to test results if you see this again.

710 is a workaround, not a fix.

@marc-hb marc-hb added P2 Critical bugs or normal features and removed P1 Blocker bugs or important features labels Jun 28, 2021
@XiaoyunWu6666
Copy link
Contributor Author

XiaoyunWu6666 commented Jun 28, 2021

Workaround 710 has been merged, please link to test results if you see this again.

710 is a workaround, not a fix.

hi @marc-hb I found a sof-logger related failure has happened on ADLP_RVP_HDA since 3 days ago.
inner 4883?model=ADLP_RVP_HDA&testcase=check-sof-logger
Is it related to 710 and this issue? Thanks

@XiaoyunWu6666 XiaoyunWu6666 changed the title Empty or stuck DMA trace - Initial FW ABI banner message not found in logger.data.txt [BUG]Empty or stuck DMA trace - Initial FW ABI banner message not found in logger.data.txt Jun 28, 2021
@marc-hb
Copy link
Collaborator

marc-hb commented Jun 28, 2021

4883?model=ADLP_RVP_HDA&testcase=check-sof-logger Is it related to 710 and this issue? Thanks

No: there was topology issue in that run and all tests failed.

@XiaoyunWu6666
Copy link
Contributor Author

@marc-hb Sorry for giving the wrong info. Please refer to this one
inner daily-ADL 4903?model=ADLP_RVP_SDW&testcase=check-sof-logger

@marc-hb
Copy link
Collaborator

marc-hb commented Mar 9, 2022

check-sof-logger has been failing regularly again. I submitted thesofproject/sof-test#876 to "nudge" the trace again and stop the noise in test results.

Recent failures with an empty DMA trace

@marc-hb
Copy link
Collaborator

marc-hb commented Mar 11, 2022

Happened again in daily run 10879 : the last daily run before workaround thesofproject/sof-test#876 was merged.

Happened on ADLP_GMB_I2S_ZEPHYR, TGLU_RVP_NOCODEC and ADLP_RVP_NOCODEC and no other product.

Start Time: 2022-03-09 22:27:42 UTC
End Time: 2022-03-10 02:31:16 UTC
Kernel Branch: topic/sof-dev
Kernel Commit: 92d71723d47c
SOF Branch: main
SOF Commit: b9d971920e30
Zephyr Commit: adc901aa6a39

@ujfalusi
Copy link
Contributor

@marc-hb, I came to a conclusion that this is a feature, not a bug. Here is why:
The kernel will reset it's host_offset variable when the debugfs/sof/trace file is closed.
When the trace file is opened, it is opened with offset 0.
if there were no trace messages after the file was closed then 0 - 0 = 0, the kernel will copy 0 bytes (from offset 0) to user space.

This is implemented like this to avoid duplicated traces in the sof-logger since what the sof-logger does is:
---- Re-opening trace input file; device suspend? -----
If we would not have reset the host_offset in the kernel then you would have the duplicated trace copied from 0 to previous offset.

Since the kernel has no means to know that the file is re-opened by the sof-logger or the sof-logger was stopped and later started, the kernel always resets the host_offset on file release.
This filters out the duplication in case of re-open, but it will make the trace empty of the logger was terminated and then started again.

The only way this might be solved is to add write support for the trace file and any write would reset the host_offset. sof-logger would write there before re-open to avoid duplication, but eventually when you will have trace data coming in, you will have the trace from offset 0 printed out (which includes duplicated trace).

Or we can add new variable to track the read/write pointer in the host buffer, but we still need a way to let the trace kernel code to know that sof-logger is re-opening or it has been killed and then started again.

@marc-hb
Copy link
Collaborator

marc-hb commented Mar 15, 2022

This filters out the duplication in case of re-open, but it will make the trace empty of the logger was terminated and then started again.

Thanks for the complicated explanation but the check-sof-logger test is a fairly simple test that just restarts the firmware and tries to read the DMA trace only once. The sof-logger process is never restarted and the message you quoted is never seen in that particular test whether it passes or fail. The DMA trace should never be empty in that super simple test case. It's really testing the most basic "does it work at all" feature, not any sof-logger restart or any other advanced use case. This test barely looks at what was logged. If we couldn't reliably test something as basic and as simple as "the logs are not completely empty" then there would probably be a more serious design issue than the one you describe.

It is also a very deterministic test (or at least should be) yet it has been failing semi-randomly again for the last few weeks after being very reliable for months.

We can make the test even simpler and even more deterministic if you have any specific suggestion.

So your explanation looks very interesting and worth being saved in some README or source code comments, not buried and lost in some obscure github issue, but I doubt it's relevant to the super basic failures we are seeing right now.

Did you try to run that test?

@ujfalusi
Copy link
Contributor

@marc-hb, yes I have been running that test on my local machines but can not reproduce it.
#3516 might give some hints on the trace position, but to be able to trace it better we would also need the diff which I added in the description.

@ujfalusi
Copy link
Contributor

@marc-hb, I found a local machine which I can reproduce this and the prints were super helpful!

thesofproject/linux#3523 should fix this for good.

Fingers crossed.

@marc-hb
Copy link
Collaborator

marc-hb commented Mar 16, 2022

Daily run 10920 (2022-03-10) was run short after the "trace nudge" was restored in sof-test (that's one week ago, sorry for the delay). The DMA trace got stuck on 2 platforms and it the nudge was enough to make the test pass only on one:

  • 10920?model=TGLH_SKU0A70_HDA&testcase=check-sof-logger

    The nudge made the test pass but the DMA is very short, only 11 lines. It look incomplete but that was enough for the test to pass

  • 10920?model=ADLP_GMB_I2S_ZEPHYR&testcase=check-sof-logger

    The nudge did not help, DMA trace still empty after it.

Start Time: 2022-03-10 22:27:34 UTC
Kernel Branch: topic/sof-dev
Kernel Commit: 43946399bb74
SOF Branch: main
SOF Commit: 7a847a5bd820
Zephyr Commit: e60a4afcdfda

@ujfalusi
Copy link
Contributor

@marc-hb, on my machine the bootup is pretty silent (trace: new host_offset: 0x80):

2022-03-17 07:48:15 UTC [INFO] Log file logger.data.txt BEG::
 TIMESTAMP      (us)           DELTA  C# COMPONENT          LOCATION                      CONTENT       ktime=1961.905s  @  2022-03-17 09:48:13 +0200 EET
[  3914022702.908] (           0.000) c0 dma-trace             src/trace/dma-trace.c:341  INFO DMA: FW ABI 0x3014001 DBG ABI 0x5003000 tags SOF:v2.0-rc1-508-gb746574370bf-dirty src hash 0xce57a501 (ldc hash 0xce57a501) (Peter)
[         162.187] (         162.188) c0 ll-schedule        ./schedule/ll_schedule.c:460  INFO task add 0x9e0e2230 dma-trace-task <2b972272-c5b1-4b7e-926f-0fc5cb4c4690>
[         181.979] (          19.792) c0 ll-schedule        ./schedule/ll_schedule.c:462  INFO task params pri 4 flags 0 start 500000 period 500000
[         209.792] (          27.812) c0 ll-schedule        ./schedule/ll_schedule.c:363  INFO new added task->start 3765187 at 3752514
2022-03-17 07:48:15 UTC [INFO] ::END log file logger.data.txt

Self compiled FW for dtrace position tracking.

I see now, the nudge is local to check-sof-logger.sh. All it does is to force the DSP to boot up (because the trace will wake up the DSP if you send a filter update, we have discussed this already).
The odds that this will work is probably higher than on module loading, but the very same race is there for the first position update. What might be different is that the filter update is generating more trace and that will result another position update, which will be handled -> info is relayed to sof-logger.
And yes, the trace can be incomplete, I think on HDA we do 0x20 bursts/alignment for the transfers, so if we have <32 bytes in firmware buffer, it is going to stay there.

fwiw, thesofproject/linux#3523 is updated.

@marc-hb
Copy link
Collaborator

marc-hb commented Mar 18, 2022

@marc-hb, on my machine the bootup is pretty silent

Switch everything to the shared memory etrace using CONFIG_TRACEM and you will see that's incomplete.

And yes, the trace can be incomplete,

OK :-)

I see now, the nudge is local to check-sof-logger.sh.

It's used in other tests but in other tests it makes no PASS/FAIL difference.

All it does is to force the DSP to boot up (because the trace will wake up the DSP if you send a filter update, we have discussed this already).

The test is currently waiting for the DSP to go to sleep. We can remove that to make it simpler, it should be a one-line change. What do you think?

@marc-hb
Copy link
Collaborator

marc-hb commented Mar 18, 2022

The DMA trace got stuck on 2 platforms and it the nudge was enough to make the test pass only on one:

Similar results in daily run 11164: empty on 3 systems ADLP_GMB_I2S_ZEPHYR, ICL_RVP_HDA and ADLP_RVP_NOCODEC, test "rescued" by the nudge except on ADLP_GMB_I2S_ZEPHYR

Start Time: 2022-03-17 21:28:49 UTC

@ujfalusi
Copy link
Contributor

It would be interesting to have thesofproject/linux#3523 merged and observe the daily results for couple of days.

@ujfalusi
Copy link
Contributor

ujfalusi commented Mar 18, 2022

@marc-hb, it looks like we have two types of missing dtrace data:

  1. the host_offset update comes too early and it is ignored by the kernel
  1. No read is issued from user space to debugfs/sof/trace file for some unknown reason:
[ 1324.554607] sof-audio-acpi-intel-byt 80860F28:00: snd_sof_init_trace: dma_trace_pages: 16
[ 1324.554710] sof-audio-acpi-intel-byt 80860F28:00: snd_sof_enable_trace: stream_tag: 0
[ 1325.055467] sof-audio-acpi-intel-byt 80860F28:00: trace: new host_offset: 0xffc
[ 1329.132220] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 0)
[ 1329.132240] sof-audio-acpi-intel-byt 80860F28:00: trace: copy to user from offset : 0x0, count: 0xffc
[ 1329.140847] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 4092)
[ 1331.129464] sof-audio-acpi-intel-byt 80860F28:00: trace: nothing to copy
...
[ 1413.306707] sof-audio-acpi-intel-byt 80860F28:00: snd_sof_init_trace: dma_trace_pages: 16
[ 1413.306794] sof-audio-acpi-intel-byt 80860F28:00: snd_sof_enable_trace: stream_tag: 0
[ 1413.807551] sof-audio-acpi-intel-byt 80860F28:00: trace: new host_offset: 0xffc
[ 1417.898211] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 0)
[ 1417.898227] sof-audio-acpi-intel-byt 80860F28:00: trace: copy to user from offset : 0x0, count: 0xffc
[ 1417.906805] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 4092)
[ 1419.894534] sof-audio-acpi-intel-byt 80860F28:00: trace: nothing to copy

[ 1426.241505] sof-audio-acpi-intel-byt 80860F28:00: snd_sof_init_trace: dma_trace_pages: 16
[ 1426.241580] sof-audio-acpi-intel-byt 80860F28:00: snd_sof_enable_trace: stream_tag: 0
[ 1426.742391] sof-audio-acpi-intel-byt 80860F28:00: trace: new host_offset: 0xffc

check-sof-logger.sh reports failure for the last iteration (banner not found in logger.data.txt), but if we run it (sof-logger directly) again manually we get the data out just fine:

[ 1715.768315] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 0)
[ 1715.768331] sof-audio-acpi-intel-byt 80860F28:00: trace: copy to user from offset : 0x0, count: 0xffc
[ 1715.776976] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 4092)

Binaries from thesofproject/linux#3526 can be used with identical prints if there is an interest.

@ujfalusi
Copy link
Contributor

@marc-hb, pushed one more commit to thesofproject/linux#3526 to track the trace file open and release as well.
fwiw, on my machine it looks like this:

# dmesg | grep "trace\: "
[  518.703117] sof-audio-pci-intel-tgl 0000:00:1f.3: snd_sof_init_trace: dma_trace_pages: 16
[  518.703215] sof-audio-pci-intel-tgl 0000:00:1f.3: snd_sof_enable_trace: stream_tag: 1
[  518.704245] sof-audio-pci-intel-tgl 0000:00:1f.3: trace: new host_offset: 0x80
[  523.594868] sof-audio-pci-intel-tgl 0000:00:1f.3: trace: sof_dfsentry_trace_open: ENTER
[  523.595031] sof-audio-pci-intel-tgl 0000:00:1f.3: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 0)
[  523.595038] sof-audio-pci-intel-tgl 0000:00:1f.3: trace: copy to user from offset : 0x0, count: 0x80
[  523.595112] sof-audio-pci-intel-tgl 0000:00:1f.3: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 128)
[  523.595115] sof-audio-pci-intel-tgl 0000:00:1f.3: trace: nothing to copy
[  523.595172] sof-audio-pci-intel-tgl 0000:00:1f.3: trace: sof_dfsentry_trace_open: ENTER
[  523.595178] sof-audio-pci-intel-tgl 0000:00:1f.3: trace: sof_dfsentry_trace_release: ENTER
[  523.595180] sof-audio-pci-intel-tgl 0000:00:1f.3: trace: new host_offset: 0x0
[  523.595188] sof-audio-pci-intel-tgl 0000:00:1f.3: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 0)
[  525.594176] sof-audio-pci-intel-tgl 0000:00:1f.3: trace: nothing to copy
[  525.594285] sof-audio-pci-intel-tgl 0000:00:1f.3: trace: sof_dfsentry_trace_release: ENTER

@ujfalusi
Copy link
Contributor

@marc-hb, I have managed to reproduce the issue (on jf-byt-mb-nocodec-1) with the kernel from thesofproject/linux#3526 :

2022-03-23 07:29:59 UTC Sub-Test: [INFO] Waiting a few seconds for the DSP to fully boot and then suspend
2022-03-23 07:30:03 UTC Sub-Test: [INFO] Trying to get the DMA trace log with background sof-logger ...
2022-03-23 07:30:03 UTC Sub-Test: [COMMAND] sudo /usr/local/bin/sof-logger  -t -f 3 -l  /etc/sof/sof-byt.ldc  -o  /home/ubuntu/sof-test/logs/check-sof-logger/2022-03-23-07:29:54-4132/logger.data.txt  2>  /home/ubuntu/sof-test/logs/check-sof-logger/2022-03-23-07:29:54-4132/logger.error.txt  &
2022-03-23 07:30:11 UTC Sub-Test: [INFO] Trying to get the etrace mailbox ...
2022-03-23 07:30:11 UTC Sub-Test: [COMMAND] sudo /usr/local/bin/sof-logger    -f 3 -l  /etc/sof/sof-byt.ldc  2>  /home/ubuntu/sof-test/logs/check-sof-logger/2022-03-23-07:29:54-4132/logger.etrace_stderr.txt  -o  /home/ubuntu/sof-test/logs/check-sof-logger/2022-03-23-07:29:54-4132/logger.etrace.txt

2022-03-23 07:30:11 UTC Sub-Test: [ERROR] Log header not found in /home/ubuntu/sof-test/logs/check-sof-logger/2022-03-23-07:29:54-4132/logger.data.txt

2022-03-23 07:30:11 UTC Sub-Test: [INFO] Log file logger.data.txt BEG::
2022-03-23 07:30:11 UTC Sub-Test: [INFO] ::END log file logger.data.txt


2022-03-23 07:30:11 UTC Sub-Test: [INFO] Log file logger.etrace.txt BEG::
 TIMESTAMP      (us)           DELTA  C# COMPONENT          LOCATION                      CONTENT       ktime=300.868s  @  2022-03-23 07:30:11 +0000 UTC
[   288253210.160] (           0.000) c0 dma-trace             src/trace/dma-trace.c:422  INFO SHM: FW ABI 0x3014000 DBG ABI 0x5003000 tag v2.0 src hash 0x927ffc8c (ldc hash 0x927ffc8c)
Skipped 848 bytes after the last statement.
2022-03-23 07:30:11 UTC Sub-Test: [INFO] ::END log file logger.etrace.txt


2022-03-23 07:30:11 UTC Sub-Test: [INFO] Log file logger.error.txt BEG::
GOOD: logger.error.txt was empty, no stderr output from that sof-logger instance
2022-03-23 07:30:11 UTC Sub-Test: [INFO] ::END log file logger.error.txt


2022-03-23 07:30:11 UTC Sub-Test: [INFO] Log file logger.etrace_stderr.txt BEG::
GOOD: logger.etrace_stderr.txt was empty, no stderr output from that sof-logger instance
2022-03-23 07:30:11 UTC Sub-Test: [INFO] ::END log file logger.etrace_stderr.txt

2022-03-23 07:30:11 UTC Sub-Test: [ERROR] Log header not found in /home/ubuntu/sof-test/logs/check-sof-logger/2022-03-23-07:29:54-4132/logger.data.txt
2022-03-23 07:30:11 UTC Sub-Test: [ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2022-03-23 07:30:11 UTC Sub-Test: [ERROR]  print_logs_exit()  @  /home/ubuntu/sof-test/test-case/check-sof-logger.sh
2022-03-23 07:30:11 UTC Sub-Test: [ERROR]  main()  @  /home/ubuntu/sof-test/test-case/check-sof-logger.sh:201
2022-03-23 07:30:11 UTC Sub-Test: [ERROR]  main()  @  /home/ubuntu/sof-test/test-case/check-sof-logger.sh:231
2022-03-23 07:30:12 UTC Sub-Test: [INFO] Test Result: FAIL!

The dmesg have this to say about the runs:

# successful run
[  186.496554] sof-audio-acpi-intel-byt 80860F28:00: snd_sof_init_trace: dma_trace_pages: 16
[  186.496693] sof-audio-acpi-intel-byt 80860F28:00: snd_sof_enable_trace: stream_tag: 0
[  186.997608] sof-audio-acpi-intel-byt 80860F28:00: trace: new host_offset: 0xff4
[  191.072270] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_open: ENTER
[  191.072813] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 0)
[  191.072826] sof-audio-acpi-intel-byt 80860F28:00: trace: copy to user from offset : 0x0, count: 0xff4
[  191.081454] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 4084)
[  193.070735] sof-audio-acpi-intel-byt 80860F28:00: trace: nothing to copy
[  193.071211] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_release: ENTER
...
# successful runs
...
# successful run
[  265.152479] sof-audio-acpi-intel-byt 80860F28:00: snd_sof_init_trace: dma_trace_pages: 16
[  265.152539] sof-audio-acpi-intel-byt 80860F28:00: snd_sof_enable_trace: stream_tag: 0
[  265.653319] sof-audio-acpi-intel-byt 80860F28:00: trace: new host_offset: 0xff4
[  277.608813] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_open: ENTER
[  277.609471] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 0)
[  277.609484] sof-audio-acpi-intel-byt 80860F28:00: trace: copy to user from offset : 0x0, count: 0xff4
[  277.619388] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 4084)
[  279.607347] sof-audio-acpi-intel-byt 80860F28:00: trace: nothing to copy
[  279.607573] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_release: ENTER
...
# The last, failed run:
[  288.075033] sof-audio-acpi-intel-byt 80860F28:00: snd_sof_init_trace: dma_trace_pages: 16
[  288.075144] sof-audio-acpi-intel-byt 80860F28:00: snd_sof_enable_trace: stream_tag: 0
[  288.576010] sof-audio-acpi-intel-byt 80860F28:00: trace: new host_offset: 0xff4

# running sof-logger manually
[  896.931802] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_open: ENTER
[  896.932272] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 0)
[  896.932289] sof-audio-acpi-intel-byt 80860F28:00: trace: copy to user from offset : 0x0, count: 0xff4
[  896.941216] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_read: ENTER (count: 4096, lpos: 4084)
[  900.897926] sof-audio-acpi-intel-byt 80860F28:00: trace: nothing to copy
[  900.898275] sof-audio-acpi-intel-byt 80860F28:00: trace: sof_dfsentry_trace_release: ENTER

The reason for the Log header not found in logger.data.txt failure is that sof-logger does not even open the trace file.

@kv2019i
Copy link
Collaborator

kv2019i commented Apr 29, 2022

In Intel daily test run 12177 I see a clear IPC failure in setting up the logger (ADLP_GMB_I2S_ZEPHYR):

[ 89.504320] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x90030000: GLB_TRACE_MSG: DMA_PARAMS_EXT
[ 90.007325] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx timed out for 0x90030000 (msg/reply size: 80/12)
[ 90.007363] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: preventing DSP entering D3 state to preserve context

This seems quite different from the cases @ujfalusi you've last seen on BYT... right?

@ujfalusi
Copy link
Contributor

@kv2019i, I think it is the same issue as #5733, it is under investigation

marc-hb added a commit to marc-hb/sof-test that referenced this issue Jul 14, 2022
I added this wait for D3 in commit 6a8871f ("check-sof-logger:
reload drivers first") that made the test more deterministic but I'm not
100% sure why it's now waiting for D3. Probably to minimize changes
because before the addition of reload_drivers() the DSP was most likely
in D3.

While I recently made cavstool compatible with D3, the protocol used by
cavstool is still different from the other traces: it won't print ANY
log while the DSP is off. So change the test behavior to grab logs
immediately after reloading the drivers / rebooting the DSP.

This may also help with
thesofproject/sof#4333,
thesofproject/linux#3448
and others linked from there.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
marc-hb added a commit to marc-hb/sof-test that referenced this issue Jul 15, 2022
I added this wait for D3 in commit 6a8871f ("check-sof-logger:
reload drivers first") that made the test more deterministic but I'm not
100% sure why it's now waiting for D3. Probably to minimize changes
because before the addition of reload_drivers() the DSP was most likely
in D3.

While I recently made cavstool compatible with D3, the protocol used by
cavstool is still different from the other traces: it won't print ANY
log while the DSP is off. So change the test behavior to grab logs
immediately after reloading the drivers / rebooting the DSP.

This may also help with
thesofproject/sof#4333,
thesofproject/linux#3448
and others linked from there.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
marc-hb added a commit to thesofproject/sof-test that referenced this issue Jul 19, 2022
I added this wait for D3 in commit 6a8871f ("check-sof-logger:
reload drivers first") that made the test more deterministic but I'm not
100% sure why it's now waiting for D3. Probably to minimize changes
because before the addition of reload_drivers() the DSP was most likely
in D3.

While I recently made cavstool compatible with D3, the protocol used by
cavstool is still different from the other traces: it won't print ANY
log while the DSP is off. So change the test behavior to grab logs
immediately after reloading the drivers / rebooting the DSP.

This may also help with
thesofproject/sof#4333,
thesofproject/linux#3448
and others linked from there.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@mengdonglin
Copy link
Collaborator

Close as this issue cannot be reproduced in recent daily testing.

@marc-hb
Copy link
Collaborator

marc-hb commented Sep 6, 2022

It can be reproduced, there's a workaround to ignore the error.

@marc-hb marc-hb added the won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc) label Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADL Applies to Alder Lake platform bug Something isn't working as expected Intel Linux Daily tests This issue can be found in internal Linux daily tests P2 Critical bugs or normal features TGL Applies to Tiger Lake won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc)
Projects
None yet
Development

No branches or pull requests

8 participants