PSRAM Cache Issue stills exist (IDFGH-31) #2892

ThomasRogg · 2018-12-27T16:42:46Z

We stumbled upon the fact that cache issue with PSRAM still exist, even in the newest development environment. This can produce random crashes, even if the code is 100 % valid.

This very small example program reproduces the problem easily, at least if compiled with newest ESP-IDF and toolchain under Mac OS X (did not try other environments):
https://github.com/neonious/memcrash-esp32/

(As a side note: We noticed this problem when we implemented the dlmalloc memory allocator in a fork of ESP-IDF. We worked around this problem (hopefully you can fix it correctly), and now have an ESP-IDF with far faster allocations. Take a look at the blog post here: https://www.neonious-basics.com/index.php/2018/12/27/faster-optimized-esp-idf-fork-psram-issues/ ).

Alvin1Zhang · 2018-12-29T01:19:33Z

@neoniousTR Hi, neoniousTR, thanks for reporting this, we will look into this and update if any feedbacks. Also there is a topic about the issue on our forum at http://bbs.esp32.com/viewtopic.php?f=13&t=8628&sid=1acc8bd897e72cf450ad9eb71491d732. Thanks.

ThomasRogg · 2019-01-26T23:42:26Z

We updated the example project at https://github.com/neonious/memcrash-esp32
It is now leaner, more to the point, and most importantly, compiles out of the box.

We think this problem is urgent to fix, as random crashes can occur to anyone using the PSRAM of the ESP32.

There only seems to be two workarounds:

Use only the first 2 MB of 4 MB of PSRAM (big penalty)
End every function which stores to PSRAM with a memw instruction (slow). nops do not help.

Please take a look at the project, and hopefully you have a better idea.

Spritetm · 2019-01-29T10:21:29Z

Fyi, we're working on this. For what it's worth, it seems to be caused by an interrupt (in your example, the FreeRTOS tick interrupt) firing while some cache activity is going on. We have our digital team running simulations to see what exactly is going on in the hardware; we hope to create a better workaround than the memw solution from that.

ThomasRogg · 2019-01-29T10:28:15Z

Good that you can reproduce this.
Interrupts are a good explanation why this happens only randomly..
Hoping for the best.

markwj · 2019-01-30T11:36:54Z

We seem to be seeing this as a std::string memory corruption (all zeros, on a 4 byte boundary).

In our case, disabling the top 2MB of SPIRAM didn't seem to work. But pre-allocating 2MB (which we then never use) seemed to workaround the problem. Our code runs primarily on core #1.

This is impacting us quite badly. Lots of random corruptions and crashes with devices in the field.

ThomasRogg · 2019-01-30T11:42:18Z

Maybe whether the top or bottom of the RAM works depends on the core used.

dexterbg · 2019-01-30T21:41:45Z

Confirmed: running our test project from #3006 with that 2 MB allocation and also starting the test task on core 0 shows the corruptions again. It seems core 0 can only work reliably with the lower 2 MB and core 1 only with the higher 2 MB.

ThomasRogg · 2019-02-01T11:04:33Z

@Spritetm or @Alvin1Zhang
As this issue does not happen in single core mode, do you know if the original PSRAM cache issue which is fixed with the flag and adds many nops and memws is also only in dual core mode?

If so, we will try to switch low.js to single core mode, this might even be faster at the end, because the JavaScript itself is single core anyhow and has the most load.

Also, how is the progress going? I'd think the chances are to get this fixed by modifying the interrupt handlers or the cache fetchers and savers (they are part of the ROM?).

xbary · 2019-02-11T20:03:15Z

Hello, I wanted to add to the subject the error I observed in my application using PSRAM. Random error while retrieving the amount of free PSRAM memory. In my application, I check the amount of free PSRAM memory in the main loop, and differently I received in reply that, for example, 16 bytes of free memory, but at the next check, it actually answered ~ 4mb.

In my opinion, there must have been an erroneous random reading from PSRAM.

ThomasRogg · 2019-02-19T21:47:35Z

Our current status:

Load to cache/Write from cache does not seem to be interrupt-based.. Might be 100% hardware-based?

Added memw to interrupt handlers does not change anything.

Currently we believe Dual Core + PSRAM is a broken combination.

So we will completly switch to Unicore now.

Please answer:
Do you know if the original PSRAM cache issue also exist in unicore mode? Would be great if we can get rid of the nops and memws with this once and for all...

xbary · 2019-02-21T20:28:55Z

I made such an experience, I rewrote the String class from the arduino project to use the PSRAM memory. I changed the name and changed the realloc in the changebuffer function. Suddenly, it turned out that my application did not regularly show 0x00 in one cell.
here is this changed String class: https://github.com/xbary/xb_StringPSRAM
I would like you to be able to replace the class with StringSRAM as part of the tests, you may be able to reproduce the repeatability of the error.

ThomasRogg · 2019-02-22T23:16:07Z

I have to confirm that the original PSRAM workaround is still required in Unicore mode.

So Workaround + Unicore is the only combination which works reliably with PSRAM. If I am wrong, I hope somebody will post. Otherwise we have to take this as a fact ...

me21 · 2019-02-23T06:47:32Z

Can dual core chip be switched to unicore mode?
Does this error manifest itself in Arduino framework? As far as I know, Arduino task is pinned to core 0, therefore, it can be effectively viewed as unicore. Am I right?

ThomasRogg · 2019-02-23T09:23:46Z

Yes the CPU can be configured for unicore. Pinning to one core does not help, as in dual core mode the other 2 MB are handled by the other core.

Spritetm · 2019-02-26T02:58:11Z

FWIW, we have a tentative solution for this; the existing workaround solution does actually seem to work but doesn't take calls/returns into account properly. We'll ship a toolchain with improved workaround code soon, but we want to have this fairly well tested so we don't have any other edge cases sneaking past us. I'll see if I can post a preliminary patch as soon as I have something halfway stable,

markwj · 2019-03-04T00:38:31Z

Do have any idea of schedule for this, or an ability to get us a pre-release toolchain?

This is impacting us quite badly. The 2MB pre-allocation solves the problem for our code, but just shifts the problem to wifi running on the first core (which now experiences random errors and throughput problems).

xbary · 2019-03-04T07:00:38Z

I confirm, the error still occurs at random moments, even hangs completely.

markwj · 2019-04-25T01:43:43Z

Do have any idea of schedule for this, or an ability to get us a pre-release toolchain?

dexterbg · 2019-05-25T07:46:23Z

@Spritetm We do appreciate your efforts in making sure your patch is perfect. But meanwhile our system has to bear a huge performance hit by the workaround, while the stability is still impacted by the bug. We're more than willing to help you in beta testing your patch by using it on our project. Please do a pre-release or share some update on the status. Thanks!

Patrik-Berglund · 2019-05-26T18:39:09Z

Also think an update is in place, we are awaiting to see if you are able to fix this bug or if it makes the PSRAM feature unusable.

We need more RAM than internal available in the ESP32, so this is a deal breaker for our product.

negativekelvin · 2019-05-26T20:38:39Z

Just wondering why if the original workaround should work in this case that forcing nops does not resolve it.

400d4b3c:	1047a5        	call8	400e4fb8 <crash_set_both>
400d4b3f:	f03d      	nop.n
400d4b41:	f03d      	nop.n
400d4b43:	f03d      	nop.n
400d4b45:	f03d      	nop.n
400d4b47:	0228      	l32i.n	a2, a2, 0

400e4fb8 <crash_set_both>:
400e4fb8:	004136        	entry	a1, 32
400e4fbb:	0249      	s32i.n	a4, a2, 0
400e4fbd:	0349      	s32i.n	a4, a3, 0
400e4fbf:	f03d      	nop.n
400e4fc1:	f03d      	nop.n
400e4fc3:	f03d      	nop.n
400e4fc5:	f03d      	nop.n
400e4fc7:	f01d      	retw.n
400e4fc9:	000000        	ill

Also I noticed the workaround will add the nops even when there is already a memw barrier.

400d4e86:	03a9      	s32i.n	a10, a3, 0
400d4e88:	0020c0        	memw
400d4e8b:	01a9      	s32i.n	a10, a1, 0
400d4e8d:	f03d      	nop.n
400d4e8f:	f03d      	nop.n
400d4e91:	f03d      	nop.n
400d4e93:	f03d      	nop.n
400d4e95:	0020c0        	memw
400d4e98:	0138      	l32i.n	a3, a1, 0

Spritetm · 2019-05-29T11:17:24Z

From what I can see, the load-store inversion doesn't occur there exactly... the issue has something to do with a cache miss around that time that has delayed effects later. Because a cache miss takes a while to resolve, you can fix it with nops but you'd need to put a gazillion of them there.

I also noticed the nop/memw interaction... will see if I can get rid of that as well, inasfar gcc marks it. (As in: I can probably detect volatiles that cause an implicit memw, but a literal asm("memw") is harder to spot.)

negativekelvin · 2019-05-29T13:11:44Z

Ok thanks. Other things I noticed when playing with the memcrash-esp32 example:

If running on core 0 error will occur with mem2 (lower) in HIGH-LOW mode
If running on core 1 error will occur with mem1 (upper) in HIGH-LOW mode
If running on core 0 error will occur with both mem1 & mem2 even in EVEN-ODD mode
If running on core 1 error will occur with both mem1 & mem2 odd in EVEN-ODD mode
Error does not seem to happen in NORMAL mode on either core ( I don't know if this is supported or just giving a false result)

Assuming running NORMAL mode is valid workaround with a performance trade-off, how will it compare to performance of the planned workaround?

negativekelvin · 2019-05-30T14:51:44Z

More info: the memw in the example actually does not prevent the error when the routine is running in parallel on both cores. It is much more infrequent but still happens.

Normal mode does have a performance cost as it is around 24% fewer tries/ms with the example on both cores, but no errors.

ThomasRogg · 2019-06-01T22:46:32Z

Normal mode has cache coherency issues according to documentation, so not an option.

igrr · 2020-09-22T17:11:17Z

Toolchain updated in release/v4.1 branch with c7ba54e.

Curclamas · 2020-10-14T15:48:21Z

Hello @igrr we're working with v3.3 but also experience issues with regards to PSRAM. (Interestingly enough with ant without a PSRAM chip attached ) It only happens in release mode, debug build works just fine.
Could this be the same issue? How is the roadmap for updating the toolchain/fix for v3.3?

AxelLin · 2020-11-08T04:58:39Z

v4.2 9f0c564
v3.3 81da2ba @Curclamas, does this fix the issue in v3.3?

tmihovm2m · 2020-11-11T16:17:33Z

@AxelLin I started using the PSRAM a couple days ago and started having weird issues with MQTT (from the IDF) on version v3.3. Tried updating the compiler to the one from the linked commit, but that didn't help... after that I tried forcing the MQTT allocation to use the DMA ram and the issue disappeared. So my guess is the issues is not fixed :(

dexterbg · 2020-11-11T19:45:56Z

@tmihovm2m Did you enable CONFIG_SPIRAM_CACHE_WORKAROUND and do a full rebuild?

tmihovm2m · 2020-11-13T08:21:40Z

@dexterbg Hmm, I though it was enabled, but I guess I have accidentaly reverted the change when testing with the updated toolchain.
Sorry guys :(

igrr · 2020-11-13T13:22:16Z

Given that the toolchain has been updated in all currently maintained releases, I will close this issue. Please open a new issue if you are seeing a PSRAM-related problem.
Thanks everyone for the all the help reproducing the issue and great deal of patience while we were releasing the fixes.

vonnieda · 2020-11-13T15:52:04Z

@igrr It seems premature to close this when the toolchain released contains a critical bug as described at https://github.com/espressif/esp-idf/releases/tag/v3.3.4. I will note that the Known Issue says "difficult to reproduce", but in my use case, which is heavy WiFi and BLE, it crashes within minutes and usually under a minute. I have had to revert to a prior revision to keep my app stable.

dexterbg · 2020-11-13T16:16:43Z

@vonnieda Do you still see this with toolchain 1.22.0-97-gc752ad5?

vonnieda · 2020-11-13T23:06:45Z

@dexterbg I hadn't seen 97 yet, I will try it next week. On the commit, though, it says "Revert a part of PSRAM workaround because of regression"; so, does this mean the PSRAM issue is still not fixed in this version?

dexterbg · 2020-11-14T08:43:53Z

No, that means toolchain 97-gc752ad5 reverts the regression of the fix introduced in toolchain 96-g2852398 (see above). The fix is now supposed to be fully functional.

igrr · 2020-11-15T09:06:39Z

Sorry that i didn't make it clear, all release branches now contain the fix, i.e. a commit which updates the toolchain version. These commits are:

master: 439f4e4.
release/v4.2: 9f0c564. This commit will be part of v4.2-rc, due to be released soon.
release/v4.1: c7ba54e. This commit will be part of the next v4.1.1 bugfix release.
release/v4.0: 6093407. This commit is part of v4.0.2 release.
release/v3.3: 81da2ba. This commit will be part of the next v3.3.5 bugfix release.

The toolchain versions which contain the fix are esp-2020r3 based on GCC 8.4 (used in 4.x releases), and 1.22.0-97-gc752ad5 based on GCC 5.2.0 (used in 3.3 release).

Curclamas · 2020-12-18T13:07:45Z

@igrr do you happen to have any timeline when the v3.3.5 bugfix release is scheduled?

igrr · 2020-12-18T13:28:39Z

At the moment QA is testing two bugfix releases: 4.1.1 and 3.3.5. Testing will be finished around Dec 25, if no issues are found we will proceed with the release. 4.1.1 currently has higher priority for us, so if the issues are found we will work on fixing them in release/v4.1 first. Early January is probably viable for the release. We'll try to keep you updated (cc @Alvin1Zhang).

EtherFidelity · 2021-01-05T09:03:44Z

So, this has been a problem FOR TWO YEARS. How certain are you that the early January release you speak of will actually fix the problem permanently?

It's early January now, by the way.

igrr · 2021-01-05T10:24:25Z

Hi @Etherfi, some issues (unrelated to PSRAM) have been found while testing v4.1.1 release candidate, so the v3.3.5 release is still pending while we are working on fixing the issues in v4.1.1. To the best of our knowledge, no new PSRAM issue reports appeared since we have switched to this version of the compiler. That said, for new designs it is recommended to use ESP32 silicon revision 3 as it fixes the PSRAM cache issue in hardware.

EtherFidelity · 2021-01-09T09:53:36Z

Thanks!

dexterbg · 2021-12-29T16:59:14Z

FYI: there is a very strong indication the workaround in toolkit 1.22.0-97-gc752ad5-5.2.0 is not fully fixing the issue.

I don't have the time to create a test case. If you want to investigate, use my reduced OVMS build as described here:

Duktape: Randomly unable to access functions openvehicles/Open-Vehicle-Monitoring-System-3#474

Regards,
Michael

themadsens · 2022-03-04T13:19:50Z

We see this as well still with toolchain esp-2021r1-8.4.0 and IDF v4.3.1.

The occurrence we have been able to identify revolves around the following pseudocode when growing a buffer:

    char *oldbuf = buf;
    buf = assert(heap_caps_malloc(size*2, MALLOC_CAP_DEFAULT | MALLOC_CAP_SPIRAM));
    memcpy(buf, oldbuf, size);
    free(oldbuf);
    size *= 2;

The result invariably causing (from deduction) the first byte in the destination being 0, 1, or 2 (presumably 0)

I am afraid you can safely reopen this issue, alas!

For now, we will go unicore, which will hopefully alleviate our concrete pains.

christhomas · 2022-10-22T16:30:31Z

@ThomasRogg your website has a problem with the ssl certificate

ThomasRogg · 2022-10-24T10:04:33Z

@christhomas: Well, the links are 4 years old.

From my side, I believe the old ESP32 revisions are broken and cannot be reliably fixed for Dual Core as ESP-IDF interrupts whenever etc., and I'm not expecting Espressif to really try any longer. So I'm not really caring about what is happening here.

Edit: oh and looking above the old ESP32 revisions might never work well with PSRAM even in Unicore mode

beckerzito · 2023-03-18T16:45:49Z

Hi any update on this topic?! Not sure if it is related or not, but using ESP32 revision 1 with Dual core, all the current workaround fixes according to documentation and IDF v4.4, I’m getting random crashes (StoreProhibited and LoadProhibited) with PSRAM!

Did we have any closure on fixing that in this discussion here ?

Alvin1Zhang changed the title ~~PSRAM Cache Issue stills exist~~ [TW#28180] PSRAM Cache Issue stills exist Dec 28, 2018

ThomasRogg mentioned this issue Jan 29, 2019

Memory corruptions with std::string in SPI RAM (IDFGH-596) #3006

Closed

ThomasRogg pushed a commit to neonious/lowjs that referenced this issue Feb 22, 2019

esp32: switching to unicore because of espressif/esp-idf#2892

2941717

projectgus changed the title ~~[TW#28180] PSRAM Cache Issue stills exist~~ PSRAM Cache Issue stills exist (IDFGH-31) Mar 12, 2019

lll000111 mentioned this issue Mar 15, 2019

ESP32 (Espressif): Does the PSRAM problem of ESP-IDF affect Moddable? Moddable-OpenSource/moddable#151

Closed

igrr closed this as completed Nov 13, 2020

SWillSZ mentioned this issue Mar 15, 2021

Broken jpegs - fixed ??? - 1.0.5rc6, config.xclk_freq_hz = 20000000, ov2640 and ov5640 jpeg i2s problem espressif/esp32-camera#244

Closed

PerMalmberg mentioned this issue Apr 6, 2021

SecureSocket mbedtls_ssl_handshake returned -29056: SSL - Verification of the message MAC failed PerMalmberg/Smooth#33

Closed

dexterbg mentioned this issue Dec 29, 2021

Duktape: Randomly unable to access functions openvehicles/Open-Vehicle-Monitoring-System-3#474

Closed

igrr reopened this Oct 23, 2022

espressif-bot added the Status: Opened Issue is new label Oct 24, 2022

alex1115alex mentioned this issue Oct 27, 2022

CVBS output flickers/glitches when ESP32 SPIRAM is enabled lovyan03/LovyanGFX#295

Closed

beckerzito mentioned this issue Mar 24, 2023

ESP32 PSRAM Memory Exception Moddable-OpenSource/moddable#1069

Closed

PSRAM Cache Issue stills exist (IDFGH-31) #2892

PSRAM Cache Issue stills exist (IDFGH-31) #2892

Comments

ThomasRogg commented Dec 27, 2018

Alvin1Zhang commented Dec 29, 2018

ThomasRogg commented Jan 26, 2019

Spritetm commented Jan 29, 2019

ThomasRogg commented Jan 29, 2019

markwj commented Jan 30, 2019

ThomasRogg commented Jan 30, 2019

dexterbg commented Jan 30, 2019

ThomasRogg commented Feb 1, 2019

xbary commented Feb 11, 2019

ThomasRogg commented Feb 19, 2019

xbary commented Feb 21, 2019

ThomasRogg commented Feb 22, 2019

me21 commented Feb 23, 2019

ThomasRogg commented Feb 23, 2019 via email

Spritetm commented Feb 26, 2019

markwj commented Mar 4, 2019

xbary commented Mar 4, 2019

markwj commented Apr 25, 2019

dexterbg commented May 25, 2019

Patrik-Berglund commented May 26, 2019

negativekelvin commented May 26, 2019 • edited Loading

Spritetm commented May 29, 2019 • edited Loading

negativekelvin commented May 29, 2019 • edited Loading

negativekelvin commented May 30, 2019

ThomasRogg commented Jun 1, 2019

igrr commented Sep 22, 2020

Curclamas commented Oct 14, 2020

AxelLin commented Nov 8, 2020

tmihovm2m commented Nov 11, 2020

dexterbg commented Nov 11, 2020

tmihovm2m commented Nov 13, 2020

igrr commented Nov 13, 2020

vonnieda commented Nov 13, 2020

dexterbg commented Nov 13, 2020

vonnieda commented Nov 13, 2020

dexterbg commented Nov 14, 2020

igrr commented Nov 15, 2020 • edited Loading

Curclamas commented Dec 18, 2020

igrr commented Dec 18, 2020

EtherFidelity commented Jan 5, 2021

igrr commented Jan 5, 2021

EtherFidelity commented Jan 9, 2021

dexterbg commented Dec 29, 2021

themadsens commented Mar 4, 2022 • edited Loading

christhomas commented Oct 22, 2022 • edited Loading

ThomasRogg commented Oct 24, 2022 • edited Loading

beckerzito commented Mar 18, 2023 • edited Loading

negativekelvin commented May 26, 2019 •

edited

Loading

Spritetm commented May 29, 2019 •

edited

Loading

negativekelvin commented May 29, 2019 •

edited

Loading

igrr commented Nov 15, 2020 •

edited

Loading

themadsens commented Mar 4, 2022 •

edited

Loading

christhomas commented Oct 22, 2022 •

edited

Loading

ThomasRogg commented Oct 24, 2022 •

edited

Loading

beckerzito commented Mar 18, 2023 •

edited

Loading