Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ets intr lock nest #6484

Merged
merged 10 commits into from
Sep 11, 2019
Merged

Conversation

ChocolateFrogsNuts
Copy link
Contributor

Replace the SDK's use of ets_intr_lock / ets_intr_unlock with nestable versions.

This is intended for use in addition to using xt_rsil / xt_wsr_ps in the Arduino code and other more correct approaches where they are possible.

Testing has shown that there are several paths in the SDK that result in nested
calls to ets_intr_lock() / ets_intr_unlock() which may be a problem.

These functions also do not preserve the enabled interrupt level and may
result in code running with interrupts enabled when that is not intended.
This issue has recently been fixed in the Arduino code by using
xt_rsil() / xt_wsr_ps() but still exists in the Espressif SDK code.

This commit is intended to fix that and should be used in addition to the above.

The maximum nesting I have seen is 2 and lock/unlock calls appear to be balanced.
A max of 7 levels of nesting leaves plenty of room for that to change.
…r/underflow

The PS register is 15 bits, we should store the whole thing as xt_wsr_ps()
writes the whole thing.

Also if there is an underflow, we should make sure interrupts are enabled.
Same goes for overflow making sure interrupts are disabled, although this
is less important.
This saves having to modify libmain.a, libpp.a and libnet80211.a to use the
nested versions.
Adjusts fix_sdk_libs.sh accordingly.
@d-a-v
Copy link
Collaborator

d-a-v commented Sep 4, 2019

Are the modifications you brought to fw libs scriptable ?
Can you provide this script in the PR for when we update the sdk ?
(like fix_sdk_libs.sh)

Add a wrapper around the ets_post code in rom to preserve the interrupt enable state.

Rather than modifying the SDK libs, rename ets_post in the .ld file and call the
wrapper "ets_post" to replace it.

As far as I can establish, ets_post is the only rom function in use by our code or
the SDK libs we use that causes calls to ets_intr_(un)lock.
@ChocolateFrogsNuts
Copy link
Contributor Author

The ets_post wrapper appears to have resolved the arp / network deafness issue - at least my test unit is 58 hours up and reachable where I was seeing it go deaf after around 25-30 hours operation.
Testing continues, but it looks like it's worth getting a few more people verifying the result.

@d-a-v
Copy link
Collaborator

d-a-v commented Sep 9, 2019

If that's true, this is a real major improvement !

What about

  • some comments like "// save PS state before calling rom ets_post, workaround for silicium bug",
  • restoring and commenting removed provides and telling the reason

This should be merged asap !

@d-a-v
Copy link
Collaborator

d-a-v commented Sep 9, 2019

IRAM_ATTR is needed for this PR to work on my side:
(otherwise I get wdt or bootloops)
(I don't have the arp issue so I can't test further)

diff --git a/cores/esp8266/core_esp8266_main.cpp b/cores/esp8266/core_esp8266_main.cpp
index ad3c3522..0ddff2fb 100644
--- a/cores/esp8266/core_esp8266_main.cpp
+++ b/cores/esp8266/core_esp8266_main.cpp
@@ -128,14 +128,18 @@ extern "C" void optimistic_yield(uint32_t interval_us) {
     }
 }
 
-extern "C" void ets_intr_lock() {
+extern "C"
+IRAM_ATTR
+void ets_intr_lock() {
   if (ets_intr_lock_stack_ptr < ETS_INTR_LOCK_NEST_MAX)
      ets_intr_lock_stack[ets_intr_lock_stack_ptr++] = xt_rsil(3);
   else
      xt_rsil(3);
 }
 
-extern "C" void ets_intr_unlock() {
+extern "C"
+IRAM_ATTR
+void ets_intr_unlock() {
   if (ets_intr_lock_stack_ptr > 0)
      xt_wsr_ps(ets_intr_lock_stack[--ets_intr_lock_stack_ptr]);
   else
@@ -145,7 +149,9 @@ extern "C" void ets_intr_unlock() {
 
 extern "C" bool ets_post_rom(uint8 prio, ETSSignal sig, ETSParam par);
 
-extern "C" bool ets_post(uint8 prio, ETSSignal sig, ETSParam par) {
+extern "C"
+IRAM_ATTR
+bool ets_post(uint8 prio, ETSSignal sig, ETSParam par) {
   uint32_t saved;
   asm volatile ("rsr %0,ps":"=a" (saved));
   bool rc=ets_post_rom(prio, sig, par);
diff --git a/tools/sdk/ld/eagle.rom.addr.v6.ld b/tools/sdk/ld/eagle.rom.addr.v6.ld
index 215c3d07..97d6081d 100644
--- a/tools/sdk/ld/eagle.rom.addr.v6.ld
+++ b/tools/sdk/ld/eagle.rom.addr.v6.ld
@@ -119,6 +119,10 @@ PROVIDE ( ets_install_external_printf = 0x40002450 );
 PROVIDE ( ets_install_putc1 = 0x4000242c );
 PROVIDE ( ets_install_putc2 = 0x4000248c );
 PROVIDE ( ets_install_uart_printf = 0x40002438 );
+/* permanently hide reimplemented ets_intr_*lock(), see #6484
+PROVIDE ( ets_intr_lock = 0x40000f74 );
+PROVIDE ( ets_intr_unlock = 0x40000f80 );*/
+*/
 PROVIDE ( ets_isr_attach = 0x40000f88 );
 PROVIDE ( ets_isr_mask = 0x40000f98 );
 PROVIDE ( ets_isr_unmask = 0x40000fa8 );
@@ -126,6 +130,9 @@ PROVIDE ( ets_memcmp = 0x400018d4 );
 PROVIDE ( ets_memcpy = 0x400018b4 );
 PROVIDE ( ets_memmove = 0x400018c4 );
 PROVIDE ( ets_memset = 0x400018a4 );
+/* renamed to ets_post_rom(), see #6484
+PROVIDE ( ets_post = 0x40000e24 );
+*/
 PROVIDE ( ets_post_rom = 0x40000e24 );
 PROVIDE ( ets_printf = 0x400024cc );
 PROVIDE ( ets_putc = 0x40002be8 );

@TD-er
Copy link
Contributor

TD-er commented Sep 9, 2019

With the IRAM_ATTR my node does boot and seems to run fine.
Without it, the node entered a reboot loop.

@TD-er
Copy link
Contributor

TD-er commented Sep 9, 2019

Just a short update here.
Running for 4 hours, with sending ARP disabled and also the communication from the ESP node is set to a minimum.
It is still responding quickly and does not seem to be missing UDP packets from other nodes.
Will now increase the likelihood for missing packets by enabling "eco mode" (calling delay() when nothing has to be done)

@d-a-v
Copy link
Collaborator

d-a-v commented Sep 9, 2019

With WIFI_LIGHT_SLEEP ?

@TD-er
Copy link
Contributor

TD-er commented Sep 9, 2019

@d-a-v If that's the default behavior then yes.
I do not set any WiFi sleep mode.

The only WiFi sleep related code I have in the entire project is this:

 WiFi.setSleepMode(WIFI_NONE_SLEEP);

And that one is now not being called, so it is just running the default sleep mode.

@TD-er
Copy link
Contributor

TD-er commented Sep 9, 2019

Just some quick update.
Running on 2 nodes now, both with my so called "eco" mode enabled.
Running just fine and also the timing stats do seem to show less extremes which were mostly caused by timeouts connecting to a host.
I will leave them running now without querying them for a number of hours to see what happens.
But for now I'm quite positive about this fix.

@cziter15
Copy link
Contributor

cziter15 commented Sep 9, 2019

Me too, running two nodes with modem sleep option DTIM=3.

I've had problems with connection stability on LWIP2 before this fix, while LWIP 1.4 was working longer w/o disconnection from MQTT and without router disassociation due to inactivity. Testing LWIP2 Higher Bandwidth right now on this PR. Everything flashed properly and my device started successfully. I will update soon.

@TD-er
Copy link
Contributor

TD-er commented Sep 10, 2019

My nodes were still reacting fast after the night, so at least it is not breaking anything :)
The nodes with this PR are not running Gratuitous ARP, so that's a good thing.

@ChocolateFrogsNuts
Copy link
Contributor Author

Awesome! My original test device is at 95 hours now and still going strong.
I'll change the new vars to static and add some comments to ets_post as suggested above.
I'll push that in the morning - by then the test will be past the 100 hour mark.

@cziter15
Copy link
Contributor

cziter15 commented Sep 10, 2019

Two nodes - 25 hours passed, no disconnection from MQTT on LWIP2. Never reached this before :)
It was earlier always dropping connection after ~10 hours.

It means that it improves #2330 , don't saying 'fixes' yet. Test is still running.

@d-a-v
Copy link
Collaborator

d-a-v commented Sep 10, 2019

don't saying 'fixes' yet

Let's dare

@ChocolateFrogsNuts This is thrilling

@ChocolateFrogsNuts ChocolateFrogsNuts changed the title Ets intr lock nest - WIP Ets intr lock nest Sep 11, 2019
@devyte
Copy link
Collaborator

devyte commented Sep 11, 2019

Fixes #2330

@devyte
Copy link
Collaborator

devyte commented Sep 11, 2019

Ref: tools uploaded to this repo:
https://github.com/ChocolateFrogsNuts/ESP-DebugTools

@d-a-v
Copy link
Collaborator

d-a-v commented Sep 23, 2019

I'd like to mention that following this fix, a board that was never working suddenly resurrected.
Namely, it is a sonoff basic. I had ordered two of them, one was working and not the other. I thought of an hardware malfunction. But it was not. This fix is golden. Thanks again!

(symptom: Losing wifi after the first connected second. Never able to reconnect until a reset)

@d-a-v d-a-v added this to the 2.6.0 milestone Sep 23, 2019
TD-er added a commit to TD-er/ESPEasy that referenced this pull request Sep 23, 2019
TD-er added a commit to TD-er/ESPEasy that referenced this pull request Sep 24, 2019
@cziter15
Copy link
Contributor

Yeah ;) My boards connection time is now more than one week. No disconnections anymore! On LWIP2, before this fix, it was dropping after few hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants