Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory: Use cached memory for frequently used objects #527

Closed
wants to merge 1 commit into from

Conversation

libinyang
Copy link
Contributor

This patch allocates the frequently used normal objects with cached
memory.

Signed-off-by: Libin Yang libin.yang@intel.com

@libinyang
Copy link
Contributor Author

@lgirdwood
@mmaka1
@tlauda

Hi all,

I submitted this patch. This patch can fix the issue #443 on UP2 initialization crash issue.

When I was debugging this issue, I found after the dcache flash, the issue will happen. So I tried to invalid the dcache every time I allocate the memory. This crash issue disappears.

However, I think some of the allocated memories are frequently used by the FW. I changed the mode to L1 $-able memory.

@lgirdwood
Copy link
Member

@libinyang sorry I'm not following how changing memory attributes on the DSP will fix the stream tag on the host for #443 ? Yes, the stream tag should be validated on the DSP (but this PR does not even do that).

Looking at the driver hda-trace.c

	*tag = sdev->hda->dtrace_stream->hstream.stream_tag;

	/*
	 * initialize capture stream, set BDL address and return corresponding
	 * stream tag which will be sent to the firmware by IPC message.
	 */
	return hda_dsp_trace_prepare(sdev);

Surely the stream tag will be fixed here ?

@libinyang
Copy link
Contributor Author

@libinyang sorry I'm not following how changing memory attributes on the DSP will fix the stream tag on the host for #443 ? Yes, the stream tag should be validated on the DSP (but this PR does not even do that).

Looking at the driver hda-trace.c

	*tag = sdev->hda->dtrace_stream->hstream.stream_tag;

	/*
	 * initialize capture stream, set BDL address and return corresponding
	 * stream tag which will be sent to the firmware by IPC message.
	 */
	return hda_dsp_trace_prepare(sdev);

Surely the stream tag will be fixed here ?

@libinyang sorry I'm not following how changing memory attributes on the DSP will fix the stream tag on the host for #443 ? Yes, the stream tag should be validated on the DSP (but this PR does not even do that).

Looking at the driver hda-trace.c

	*tag = sdev->hda->dtrace_stream->hstream.stream_tag;

	/*
	 * initialize capture stream, set BDL address and return corresponding
	 * stream tag which will be sent to the firmware by IPC message.
	 */
	return hda_dsp_trace_prepare(sdev);

Surely the stream tag will be fixed here ?

@lgirdwood
Actually, there are 2 different bugs in #443. The patch is for "on APL it is a topology loading issue". The dmesg shows:
[ 5.736618] sof-audio sof-audio: tplg: pipeline id 8 comp 47 scheduling comp id 46
[ 5.736622] sof-audio sof-audio: pipeline PIPELINE.8.SSP3.IN: deadline 1000 pri 0 mips 5000 core 0 frames 48
[ 5.736637] sof-audio sof-audio: ipc tx: 0x30100000
[ 5.736743] sof-audio sof-audio: error : DSP panic!
[ 5.736752] sof-audio sof-audio: panic: dsp_oops_offset 788480 offset 788480
[ 5.736756] sof-audio sof-audio: status: fw entered - code 00000005
[ 5.736778] sof-audio sof-audio: error: runtime exception
[ 5.736780] sof-audio sof-audio: error: trace point 00004000
[ 5.736782] sof-audio sof-audio: error: panic happen at :0
[ 5.736785] sof-audio sof-audio: error: DSP Firmware Oops
[ 5.736788] sof-audio sof-audio: error: Exception Cause: LoadProhibitedCause, A load referenced a page mapped with an attribute that does not permit loads
[ 5.736791] sof-audio sof-audio: EXCCAUSE 0x0000001c EXCVADDR 0xfffffff8 PS 0x00060020 SAR 0x0000001d
[ 5.736794] sof-audio sof-audio: EPC1 0xbe015a93 EPC2 0xbe0118f6 EPC3 0xbe013a48 EPC4 0x00000000
[ 5.736796] sof-audio sof-audio: EPC5 0x00000000 EPC6 0x00000000 EPC7 0x00000000 DEPC 0x00000000
[ 5.736799] sof-audio sof-audio: EPS2 0x00060720 EPS3 0x00060722 EPS4 0x00000000 EPS5 0x00000000
[ 5.736802] sof-audio sof-audio: EPS6 0x00000000 EPS7 0x00000000 INTENABL 0x000101d0 INTERRU 0x00000222
[ 5.736804] sof-audio sof-audio: stack dump from 0xbe07fd20
[ 5.736807] sof-audio sof-audio: 0xbe07fd20: 0xbe015a93 0x00060d30 0x0000001d 0x00000180
[ 5.736810] sof-audio sof-audio: 0xbe07fd24: 0xbe04db80 0x0000002f 0x9e050368 0x9e050368
[ 5.736812] sof-audio sof-audio: 0xbe07fd28: 0x0000001c 0xbe015f70 0xbe015a90 0xbe015aae
[ 5.736815] sof-audio sof-audio: 0xbe07fd2c: 0x00000002 0x00060d20 0xbe042048 0x20004eb8
[ 5.736818] sof-audio sof-audio: 0xbe07fd30: 0x00000000 0x001d72fc 0x00000000 0xbe05b400
[ 5.736821] sof-audio sof-audio: 0xbe07fd34: 0x00000008 0xbe07fdc0 0x07530000 0x00000000
[ 5.736824] sof-audio sof-audio: 0xbe07fd38: 0xffffffff 0xbe014117 0xbe0141ac 0xbe0141b9
[ 5.736826] sof-audio sof-audio: 0xbe07fd3c: 0x00000000 0x001d72fc 0x00000000 0x2000126c
[ 5.736829] sof-audio sof-audio: error: waking up any trace sleepers
[ 5.778036] random: crng init done
[ 5.778041] random: 7 urandom warning(s) missed due to ratelimiting
[ 6.048106] sof-audio sof-audio: error: ipc timed out for 0x30100000 size 0x30

@libinyang
Copy link
Contributor Author

libinyang commented Nov 5, 2018

#443 is splitted into 2 bugs:
one is #443 for stream_tag,
the other is #539: DSP panic after system booted for topology load issue on up2

The root cause of #539 is that sometimes memory is modified randomly after dcache flash.

This patch is for #539

@libinyang
Copy link
Contributor Author

I think I found the root cause of why dcache flush randomly changes the memory incorrectly. I will make a patch for the bug #539 DSP panic after system booted for topology load issue on up2

But I do think this is patch is useful to improve the performance and should be merged. What do you guys think? Thanks ;-)

@lgirdwood
Copy link
Member

@mmaka1 can you comment on @libinyang PR for change memory attributes.

@lgirdwood
Copy link
Member

@lbetlej can you comment on this as @tlauda is OOO

This patch allocates the frequently used normal objects with cached
memory to improve the performance.

Signed-off-by: Libin Yang <libin.yang@intel.com>
@libinyang
Copy link
Contributor Author

any comments?

@libinyang
Copy link
Contributor Author

add more objects to use cachable memory to improve the performance.

@lgirdwood
Copy link
Member

@libinyang I think we need the uncache flag so that these peripherals can be shared between multiple cores. Where is the crash location from the stack trace ?

@libinyang
Copy link
Contributor Author

@libinyang I think we need the uncache flag so that these peripherals can be shared between multiple cores. Where is the crash location from the stack trace ?

@lgirdwood
The patch doesn't aim to fix the crash issue. PR 541 is to fix the crash issue.
I think the cache sync up will be done by HW, won't it? If the HW can't handle the cache sync, I think we also need add uncache flag in other memory allocations.

@lgirdwood
Copy link
Member

@libinyang I think we need the uncache flag so that these peripherals can be shared between multiple cores. Where is the crash location from the stack trace ?

@lgirdwood
The patch doesn't aim to fix the crash issue. PR 541 is to fix the crash issue.
I think the cache sync up will be done by HW, won't it? If the HW can't handle the cache sync, I think we also need add uncache flag in other memory allocations.

@libinyang ok, if this PR does not fix the issue what is it needed for ?

@libinyang
Copy link
Contributor Author

libinyang commented Nov 9, 2018

@libinyang I think we need the uncache flag so that these peripherals can be shared between multiple cores. Where is the crash location from the stack trace ?

@lgirdwood
The patch doesn't aim to fix the crash issue. PR 541 is to fix the crash issue.
I think the cache sync up will be done by HW, won't it? If the HW can't handle the cache sync, I think we also need add uncache flag in other memory allocations.

@libinyang ok, if this PR does not fix the issue what is it needed for ?

@lgirdwood
Actually, this patch can fix some potential issues caused by dcache. As our FW doesn't allocate/free memory heavily, the potential issues are hard to catch and reproduce. If we do the stress test on memory allocation/free, some weird bugs may occur. And I think such bugs will be general issues.

Let's looking into the code of free memory: it doesn't handle the cache flush/invalidate. Let's image we free a memory which is cachable. And later, we allocate a un-cachable memory which is aligned to the same memory we just freed. Some time later, HW may decide to flush the dcache and the new allocated memory's contents will be changed randomly. Such bugs will be hard to debug. So let's prevent such issues in advance.

My suggestion is that in the same block of memory, we allocate them in the same behavior, either always cachable or always un-cachable. And I think for the frequently objects, cachable should be more reasonable. For the multi-cores, I think HW should already handle the synchronization. But If QA can do the test on multi-core, it will be helpful and more persuasive.

@mmaka1 What do you think of it? Please help to correct me if I miss something.

@lgirdwood
Copy link
Member

@lgirdwood
Actually, this patch can fix some potential issues caused by dcache. As our FW doesn't allocate/free memory heavily, the potential issues are hard to catch and reproduce. If we do the stress test on memory allocation/free, some weird bugs may occur. And I think such bugs will be general issues.

Let's looking into the code of free memory: it doesn't handle the cache flush/invalidate. Let's image we free a memory which is cachable. And later, we allocate a un-cachable memory which is aligned to the same memory we just freed. Some time later, HW may decide to flush the dcache and the new allocated memory's contents will be changed randomly. Such bugs will be hard to debug. So let's prevent such issues in advance.

Ok, but the correct fix here is to invalidate and writeback the memory region on free() within the allocator.

@libinyang
Copy link
Contributor Author

Ok, but the correct fix here is to invalidate and writeback the memory region on free() within the allocator.

Yes, invalidate cache in free() is a good choice. This can make sure that the memories are safe. And this is can help to handle the DMA memory at the same time. My plan is to handle the dma memory with a flag and invalidate the cache in allocate just like linux kernel.

However, caching the frequently used objects will improve the performance. I think we had better cache them all. What do you think?

@tlauda
Copy link
Contributor

tlauda commented Nov 19, 2018

@libinyang Making those objects cacheable will break memory sychronization between cores. Of course, the best way would be to have dedicated shared memory space, but right now it would require huge effort to accomplish. There are simply too many objects, that needs sharing in order to run workload on slave cores.
We could potentially make them all cacheable, but:

  1. Writeback and invalidate operations will pollute the code.
  2. I need data, that will prove increase in performance coming from such changes. From my experience frequent cache operations on such small memory spaces (<1 cache line) will increase number of execution processor cycles.

Copy link
Contributor

@tlauda tlauda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMP will be broken.

@lgirdwood lgirdwood closed this Nov 19, 2018
@libinyang
Copy link
Contributor Author

@libinyang Making those objects cacheable will break memory sychronization between cores. Of course, the best way would be to have dedicated shared memory space, but right now it would require huge effort to accomplish. There are simply too many objects, that needs sharing in order to run workload on slave cores.
We could potentially make them all cacheable, but:

  1. Writeback and invalidate operations will pollute the code.
  2. I need data, that will prove increase in performance coming from such changes. From my experience frequent cache operations on such small memory spaces (<1 cache line) will increase number of execution processor cycles.

@tlauda

Thanks for explanation. What I understand is that HW will handle such cache operation automatically, won't it? Do you have some test cases for SMP to produce such sychronization issue? I would like to have a test and try to find the reason.

@tlauda
Copy link
Contributor

tlauda commented Nov 20, 2018

@libinyang All cores share the same piece of SRAM, but have separate caches. HW won't handle such synchronization, so we need to make sure that coreX will flush its own data to SRAM in order for the coreY to have it up to date.

@libinyang
Copy link
Contributor Author

@libinyang Making those objects cacheable will break memory sychronization between cores. Of course, the best way would be to have dedicated shared memory space, but right now it would require huge effort to accomplish. There are simply too many objects, that needs sharing in order to run workload on slave cores.
We could potentially make them all cacheable, but:

  1. Writeback and invalidate operations will pollute the code.
  2. I need data, that will prove increase in performance coming from such changes. From my experience frequent cache operations on such small memory spaces (<1 cache line) will increase number of execution processor cycles.

I went through our spec about the cachable memory and didn't find requirement of manually sync between multiple cores. I only found multiple microprocessors need to manually handle this. Maybe this is our HW/xtensa special requirement?

Besides, I found in our code sometimes it allocates memory in cache, and sometimes not. For example:
in ipc.c:
icd = rzalloc(RZONE_RUNTIME | RZONE_FLAG_UNCACHED, SOF_MEM_CAPS_RAM,
sizeof(struct ipc_comp_dev));
elem_array->elems = rzalloc(RZONE_RUNTIME, SOF_MEM_CAPS_RAM,
sizeof(struct dma_sg_elem) * ring->pages);

Is there any special reason?

@tlauda
Copy link
Contributor

tlauda commented Nov 20, 2018

@libinyang Only structures, which are used by slave cores are allocated in uncached memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants