memory: Use cached memory for frequently used objects #527

libinyang · 2018-11-02T07:23:33Z

This patch allocates the frequently used normal objects with cached
memory.

Signed-off-by: Libin Yang libin.yang@intel.com

libinyang · 2018-11-02T07:54:54Z

Hi all,

I submitted this patch. This patch can fix the issue #443 on UP2 initialization crash issue.

When I was debugging this issue, I found after the dcache flash, the issue will happen. So I tried to invalid the dcache every time I allocate the memory. This crash issue disappears.

However, I think some of the allocated memories are frequently used by the FW. I changed the mode to L1 $-able memory.

lgirdwood · 2018-11-02T10:07:10Z

@libinyang sorry I'm not following how changing memory attributes on the DSP will fix the stream tag on the host for #443 ? Yes, the stream tag should be validated on the DSP (but this PR does not even do that).

Looking at the driver hda-trace.c

	*tag = sdev->hda->dtrace_stream->hstream.stream_tag;

	/*
	 * initialize capture stream, set BDL address and return corresponding
	 * stream tag which will be sent to the firmware by IPC message.
	 */
	return hda_dsp_trace_prepare(sdev);

Surely the stream tag will be fixed here ?

libinyang · 2018-11-05T02:08:58Z

@libinyang sorry I'm not following how changing memory attributes on the DSP will fix the stream tag on the host for #443 ? Yes, the stream tag should be validated on the DSP (but this PR does not even do that).

Looking at the driver hda-trace.c
	*tag = sdev->hda->dtrace_stream->hstream.stream_tag;

	/*
	 * initialize capture stream, set BDL address and return corresponding
	 * stream tag which will be sent to the firmware by IPC message.
	 */
	return hda_dsp_trace_prepare(sdev);
Surely the stream tag will be fixed here ?

@lgirdwood
Actually, there are 2 different bugs in #443. The patch is for "on APL it is a topology loading issue". The dmesg shows:
[ 5.736618] sof-audio sof-audio: tplg: pipeline id 8 comp 47 scheduling comp id 46
[ 5.736622] sof-audio sof-audio: pipeline PIPELINE.8.SSP3.IN: deadline 1000 pri 0 mips 5000 core 0 frames 48
[ 5.736637] sof-audio sof-audio: ipc tx: 0x30100000
[ 5.736743] sof-audio sof-audio: error : DSP panic!
[ 5.736752] sof-audio sof-audio: panic: dsp_oops_offset 788480 offset 788480
[ 5.736756] sof-audio sof-audio: status: fw entered - code 00000005
[ 5.736778] sof-audio sof-audio: error: runtime exception
[ 5.736780] sof-audio sof-audio: error: trace point 00004000
[ 5.736782] sof-audio sof-audio: error: panic happen at :0
[ 5.736785] sof-audio sof-audio: error: DSP Firmware Oops
[ 5.736788] sof-audio sof-audio: error: Exception Cause: LoadProhibitedCause, A load referenced a page mapped with an attribute that does not permit loads
[ 5.736791] sof-audio sof-audio: EXCCAUSE 0x0000001c EXCVADDR 0xfffffff8 PS 0x00060020 SAR 0x0000001d
[ 5.736794] sof-audio sof-audio: EPC1 0xbe015a93 EPC2 0xbe0118f6 EPC3 0xbe013a48 EPC4 0x00000000
[ 5.736796] sof-audio sof-audio: EPC5 0x00000000 EPC6 0x00000000 EPC7 0x00000000 DEPC 0x00000000
[ 5.736799] sof-audio sof-audio: EPS2 0x00060720 EPS3 0x00060722 EPS4 0x00000000 EPS5 0x00000000
[ 5.736802] sof-audio sof-audio: EPS6 0x00000000 EPS7 0x00000000 INTENABL 0x000101d0 INTERRU 0x00000222
[ 5.736804] sof-audio sof-audio: stack dump from 0xbe07fd20
[ 5.736807] sof-audio sof-audio: 0xbe07fd20: 0xbe015a93 0x00060d30 0x0000001d 0x00000180
[ 5.736810] sof-audio sof-audio: 0xbe07fd24: 0xbe04db80 0x0000002f 0x9e050368 0x9e050368
[ 5.736812] sof-audio sof-audio: 0xbe07fd28: 0x0000001c 0xbe015f70 0xbe015a90 0xbe015aae
[ 5.736815] sof-audio sof-audio: 0xbe07fd2c: 0x00000002 0x00060d20 0xbe042048 0x20004eb8
[ 5.736818] sof-audio sof-audio: 0xbe07fd30: 0x00000000 0x001d72fc 0x00000000 0xbe05b400
[ 5.736821] sof-audio sof-audio: 0xbe07fd34: 0x00000008 0xbe07fdc0 0x07530000 0x00000000
[ 5.736824] sof-audio sof-audio: 0xbe07fd38: 0xffffffff 0xbe014117 0xbe0141ac 0xbe0141b9
[ 5.736826] sof-audio sof-audio: 0xbe07fd3c: 0x00000000 0x001d72fc 0x00000000 0x2000126c
[ 5.736829] sof-audio sof-audio: error: waking up any trace sleepers
[ 5.778036] random: crng init done
[ 5.778041] random: 7 urandom warning(s) missed due to ratelimiting
[ 6.048106] sof-audio sof-audio: error: ipc timed out for 0x30100000 size 0x30

libinyang · 2018-11-05T05:05:45Z

#443 is splitted into 2 bugs:
one is #443 for stream_tag,
the other is #539: DSP panic after system booted for topology load issue on up2

The root cause of #539 is that sometimes memory is modified randomly after dcache flash.

This patch is for #539

libinyang · 2018-11-05T08:53:35Z

I think I found the root cause of why dcache flush randomly changes the memory incorrectly. I will make a patch for the bug #539 DSP panic after system booted for topology load issue on up2

But I do think this is patch is useful to improve the performance and should be merged. What do you guys think? Thanks ;-)

lgirdwood · 2018-11-05T10:16:13Z

@mmaka1 can you comment on @libinyang PR for change memory attributes.

lgirdwood · 2018-11-05T20:12:36Z

@lbetlej can you comment on this as @tlauda is OOO

This patch allocates the frequently used normal objects with cached memory to improve the performance. Signed-off-by: Libin Yang <libin.yang@intel.com>

libinyang · 2018-11-07T05:54:28Z

any comments?

libinyang · 2018-11-07T06:22:42Z

add more objects to use cachable memory to improve the performance.

lgirdwood · 2018-11-07T10:28:14Z

@libinyang I think we need the uncache flag so that these peripherals can be shared between multiple cores. Where is the crash location from the stack trace ?

libinyang · 2018-11-08T01:05:08Z

@libinyang I think we need the uncache flag so that these peripherals can be shared between multiple cores. Where is the crash location from the stack trace ?

@lgirdwood
The patch doesn't aim to fix the crash issue. PR 541 is to fix the crash issue.
I think the cache sync up will be done by HW, won't it? If the HW can't handle the cache sync, I think we also need add uncache flag in other memory allocations.

lgirdwood · 2018-11-08T08:46:39Z

@libinyang I think we need the uncache flag so that these peripherals can be shared between multiple cores. Where is the crash location from the stack trace ?

@lgirdwood
The patch doesn't aim to fix the crash issue. PR 541 is to fix the crash issue.
I think the cache sync up will be done by HW, won't it? If the HW can't handle the cache sync, I think we also need add uncache flag in other memory allocations.

@libinyang ok, if this PR does not fix the issue what is it needed for ?

libinyang · 2018-11-09T02:51:33Z

@libinyang I think we need the uncache flag so that these peripherals can be shared between multiple cores. Where is the crash location from the stack trace ?

@lgirdwood
The patch doesn't aim to fix the crash issue. PR 541 is to fix the crash issue.
I think the cache sync up will be done by HW, won't it? If the HW can't handle the cache sync, I think we also need add uncache flag in other memory allocations.

@libinyang ok, if this PR does not fix the issue what is it needed for ?

@lgirdwood
Actually, this patch can fix some potential issues caused by dcache. As our FW doesn't allocate/free memory heavily, the potential issues are hard to catch and reproduce. If we do the stress test on memory allocation/free, some weird bugs may occur. And I think such bugs will be general issues.

Let's looking into the code of free memory: it doesn't handle the cache flush/invalidate. Let's image we free a memory which is cachable. And later, we allocate a un-cachable memory which is aligned to the same memory we just freed. Some time later, HW may decide to flush the dcache and the new allocated memory's contents will be changed randomly. Such bugs will be hard to debug. So let's prevent such issues in advance.

My suggestion is that in the same block of memory, we allocate them in the same behavior, either always cachable or always un-cachable. And I think for the frequently objects, cachable should be more reasonable. For the multi-cores, I think HW should already handle the synchronization. But If QA can do the test on multi-core, it will be helpful and more persuasive.

@mmaka1 What do you think of it? Please help to correct me if I miss something.

lgirdwood · 2018-11-09T09:56:27Z

@lgirdwood
Actually, this patch can fix some potential issues caused by dcache. As our FW doesn't allocate/free memory heavily, the potential issues are hard to catch and reproduce. If we do the stress test on memory allocation/free, some weird bugs may occur. And I think such bugs will be general issues.

Let's looking into the code of free memory: it doesn't handle the cache flush/invalidate. Let's image we free a memory which is cachable. And later, we allocate a un-cachable memory which is aligned to the same memory we just freed. Some time later, HW may decide to flush the dcache and the new allocated memory's contents will be changed randomly. Such bugs will be hard to debug. So let's prevent such issues in advance.

Ok, but the correct fix here is to invalidate and writeback the memory region on free() within the allocator.

libinyang · 2018-11-12T12:48:20Z

Ok, but the correct fix here is to invalidate and writeback the memory region on free() within the allocator.

Yes, invalidate cache in free() is a good choice. This can make sure that the memories are safe. And this is can help to handle the DMA memory at the same time. My plan is to handle the dma memory with a flag and invalidate the cache in allocate just like linux kernel.

However, caching the frequently used objects will improve the performance. I think we had better cache them all. What do you think?

tlauda · 2018-11-19T09:26:12Z

@libinyang Making those objects cacheable will break memory sychronization between cores. Of course, the best way would be to have dedicated shared memory space, but right now it would require huge effort to accomplish. There are simply too many objects, that needs sharing in order to run workload on slave cores.
We could potentially make them all cacheable, but:

Writeback and invalidate operations will pollute the code.
I need data, that will prove increase in performance coming from such changes. From my experience frequent cache operations on such small memory spaces (<1 cache line) will increase number of execution processor cycles.

tlauda

SMP will be broken.

libinyang · 2018-11-20T02:28:07Z

@libinyang Making those objects cacheable will break memory sychronization between cores. Of course, the best way would be to have dedicated shared memory space, but right now it would require huge effort to accomplish. There are simply too many objects, that needs sharing in order to run workload on slave cores.
We could potentially make them all cacheable, but:

Writeback and invalidate operations will pollute the code.

I need data, that will prove increase in performance coming from such changes. From my experience frequent cache operations on such small memory spaces (<1 cache line) will increase number of execution processor cycles.

@tlauda

Thanks for explanation. What I understand is that HW will handle such cache operation automatically, won't it? Do you have some test cases for SMP to produce such sychronization issue? I would like to have a test and try to find the reason.

tlauda · 2018-11-20T06:51:52Z

@libinyang All cores share the same piece of SRAM, but have separate caches. HW won't handle such synchronization, so we need to make sure that coreX will flush its own data to SRAM in order for the coreY to have it up to date.

libinyang · 2018-11-20T08:44:30Z

@libinyang Making those objects cacheable will break memory sychronization between cores. Of course, the best way would be to have dedicated shared memory space, but right now it would require huge effort to accomplish. There are simply too many objects, that needs sharing in order to run workload on slave cores.
We could potentially make them all cacheable, but:

Writeback and invalidate operations will pollute the code.

I need data, that will prove increase in performance coming from such changes. From my experience frequent cache operations on such small memory spaces (<1 cache line) will increase number of execution processor cycles.

I went through our spec about the cachable memory and didn't find requirement of manually sync between multiple cores. I only found multiple microprocessors need to manually handle this. Maybe this is our HW/xtensa special requirement?

Besides, I found in our code sometimes it allocates memory in cache, and sometimes not. For example:
in ipc.c:
icd = rzalloc(RZONE_RUNTIME | RZONE_FLAG_UNCACHED, SOF_MEM_CAPS_RAM,
sizeof(struct ipc_comp_dev));
elem_array->elems = rzalloc(RZONE_RUNTIME, SOF_MEM_CAPS_RAM,
sizeof(struct dma_sg_elem) * ring->pages);

Is there any special reason?

tlauda · 2018-11-20T09:02:32Z

@libinyang Only structures, which are used by slave cores are allocated in uncached memory.

libinyang requested review from lgirdwood, tlauda and mmaka1 November 2, 2018 07:26

libinyang mentioned this pull request Nov 5, 2018

DSP panic after system booted for topology load issue on up2 #539

Closed

lgirdwood requested a review from lbetlej November 5, 2018 20:12

memory: Use cached memory for frequently used objects

a191a82

This patch allocates the frequently used normal objects with cached memory to improve the performance. Signed-off-by: Libin Yang <libin.yang@intel.com>

libinyang force-pushed the use_cached_memory branch from 18fdddb to a191a82 Compare November 7, 2018 05:49

tlauda suggested changes Nov 19, 2018

View reviewed changes

lgirdwood closed this Nov 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory: Use cached memory for frequently used objects #527

memory: Use cached memory for frequently used objects #527

libinyang commented Nov 2, 2018

libinyang commented Nov 2, 2018

lgirdwood commented Nov 2, 2018

libinyang commented Nov 5, 2018

libinyang commented Nov 5, 2018 •

edited

Loading

libinyang commented Nov 5, 2018

lgirdwood commented Nov 5, 2018

lgirdwood commented Nov 5, 2018

libinyang commented Nov 7, 2018

libinyang commented Nov 7, 2018

lgirdwood commented Nov 7, 2018

libinyang commented Nov 8, 2018

lgirdwood commented Nov 8, 2018

libinyang commented Nov 9, 2018 •

edited

Loading

lgirdwood commented Nov 9, 2018

libinyang commented Nov 12, 2018

tlauda commented Nov 19, 2018 •

edited

Loading

tlauda left a comment

libinyang commented Nov 20, 2018

tlauda commented Nov 20, 2018

libinyang commented Nov 20, 2018

tlauda commented Nov 20, 2018

memory: Use cached memory for frequently used objects #527

memory: Use cached memory for frequently used objects #527

Conversation

libinyang commented Nov 2, 2018

libinyang commented Nov 2, 2018

lgirdwood commented Nov 2, 2018

libinyang commented Nov 5, 2018

libinyang commented Nov 5, 2018 • edited Loading

libinyang commented Nov 5, 2018

lgirdwood commented Nov 5, 2018

lgirdwood commented Nov 5, 2018

libinyang commented Nov 7, 2018

libinyang commented Nov 7, 2018

lgirdwood commented Nov 7, 2018

libinyang commented Nov 8, 2018

lgirdwood commented Nov 8, 2018

libinyang commented Nov 9, 2018 • edited Loading

lgirdwood commented Nov 9, 2018

libinyang commented Nov 12, 2018

tlauda commented Nov 19, 2018 • edited Loading

tlauda left a comment

Choose a reason for hiding this comment

libinyang commented Nov 20, 2018

tlauda commented Nov 20, 2018

libinyang commented Nov 20, 2018

tlauda commented Nov 20, 2018

libinyang commented Nov 5, 2018 •

edited

Loading

libinyang commented Nov 9, 2018 •

edited

Loading

tlauda commented Nov 19, 2018 •

edited

Loading