Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WDT reset when enabling OTA (only with lwIP v2) #4028

Closed
NdK73 opened this issue Dec 26, 2017 · 28 comments
Closed

WDT reset when enabling OTA (only with lwIP v2) #4028

NdK73 opened this issue Dec 26, 2017 · 28 comments
Assignees
Milestone

Comments

@NdK73
Copy link
Contributor

NdK73 commented Dec 26, 2017

Hardware: ESP-12
Core Version: GIT

Description

I have a sketch where calling ArduinoOTA.begin() triggers a WDT reset if I compile with lwIP v2. The same sketch (I only change "lwIP variant" option) works when compiling with "lwIP v1.4 (prebuilt GCC)".

That sketch seems to work when loaded on a Wemos D1 mini (but changes quite a lot of the surrounding HW). Loaded on another ESP12 (on the same HW) gives the same problem. Is it possible it's related to some GPIO state? I don't think so...

I could trace the issue to ESP8266mDNS.cpp, function _listen(), line 198: igmp_joingroup(IP_ADDR_ANY, &multicast_addr) seems not to return (at least not for the time needed to trigger WDT). But I couldn't find igmp_joingroup under lwip2 tree to dig deeper :(

The failing (minimal) sketch is:

#include <ArduinoOTA.h>
#include <ESP8266WiFi.h>
const char* ssid = "MySSID";
const char* password = "MySecurePassword";
const char* host = "dispenc-01";

void setup() {
  WiFi.persistent(false); // Avoid wear-out
  WiFi.mode(WIFI_STA);
  WiFi.hostname(host);
  WiFi.begin(ssid, password);

  Serial.begin(115200);
  ArduinoOTA.setHostname(host);
}

long now, lastact;

void loop() {
  static bool otaStarted=false;
  static long last=0;

  now=millis();

  if(now!=last && 0==now%1000) { // Ignore rollover, ATM
    Serial.println(now/1000);
    last=now;
  }
  if(otaStarted) {
    // Handle updates
    ArduinoOTA.handle();
  } else {
    if(WiFi.status() == WL_CONNECTED) {
      /* setup the OTA server once we're connected and have obtained an IP */
      Serial.println(WiFi.localIP());
      ArduinoOTA.begin();
      Serial.println("begun");
      otaStarted=true;
    }
  }

}

It prints the IP but does not print "begun". Then, after 7-8s, wdt triggers :(

@d-a-v
Copy link
Collaborator

d-a-v commented Dec 26, 2017

This is interesting. Aren't chips all the same (excepted esp8285)?
My main dev boards are D1 mini but I also have ESP01/512K (which I run without OTA), D1 mini lite, and D1 R1. Santa may have put a nodemcu in my mailbox. I will try and reproduce.

You can get lwip2 submodule sources and recompile (for further debugging in igmp_joingroup()) with

cd tools/sdk/lwip2
make

and why not try its latest version

cd builder
git checkout master
cd ..
make

@NdK73
Copy link
Contributor Author

NdK73 commented Dec 26, 2017

They should be the same.
I already followed instructions to update lwip2 (from issue #3970) -- no change :(
I can't find a way to enable LWIP2 debugging from Arduino IDE... Should I modify /tools/sdk/lwip2/builder/lwip2-src/src/include/lwip/debug.h ?

@d-a-v
Copy link
Collaborator

d-a-v commented Dec 26, 2017

To debug lwip2, edit builder/glue/gluedebug.h. You can activate glue and lwip debug messages. Use Serial.setDebugOutput(true) or equivalent.
This will be better documented soon.

If there is no difference between chips on boards, could you check or add more capacitors accross power pins ?

@NdK73
Copy link
Contributor Author

NdK73 commented Dec 27, 2017

Done activating debug. I get:

 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
vd5bb4a99
~ld
   �scandone
state: 0 -> 2 (b0)
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 3
cnt 

connected with Base, channel 12
dhcp client start...

ip:192.168.1.74,mask:255.255.255.0,gw:192.168.1.1

ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset

And the cycle repeats :(

I already have 10nF + 100pF SMD near the power pins. But I tried anyway adding a 1uF electrolytic capacitor: no change (phew... I already have 60 boards with the same base design...).

@NdK73 NdK73 mentioned this issue Dec 28, 2017
@d-a-v
Copy link
Collaborator

d-a-v commented Dec 28, 2017

Does it happen with your D1 mini as well ?
I just tried with a D1 mini lite (UDEBUG=1, ULWIPDEBUG=1) with no problem.
I will try to find other hardware

@NdK73
Copy link
Contributor Author

NdK73 commented Jan 2, 2018

No, I can't replicate any more neither with D1 nor with my HW.
I just pulled latest GIT, enabled UDEBUG and ULWIPDEBUG in tools/sdk/lwip2/builder/glue/gluedebug.h, run "make install" from tools/sdk/lwip2 and finally loaded the sketch both on D1 and on my HW.
Left running for some time, no WDT resets. Seems latest patches solved it!
Tks!

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 2, 2018

Thanks for the report !

@NdK73 NdK73 closed this as completed Jan 2, 2018
@NdK73
Copy link
Contributor Author

NdK73 commented Jan 4, 2018

I have to reopen this issue.
Seems it only became more "random" :( The latest tests I did yesterday night triggered the WDT again.

@NdK73 NdK73 reopened this Jan 4, 2018
@NdK73
Copy link
Contributor Author

NdK73 commented Jan 4, 2018

On my HW with full sketch and debug enabled I get:

⸮ip4_output_if: ap1
IP header:
+-------------------------------+
| 4 | 6 |  0x00 |        32     | (v, hl, tos, len)
+-------------------------------+
|        0      |000|       0   | (id, flags, offset)
+-------------------------------+
|    1  |    2  |    0x98e4     | (ttl, proto, chksum)
+-------------------------------+
|  192  |  168  |    4  |    1  | (src)
+-------------------------------+
|  239  |  255  |  215  |   74  | (dest)
+-------------------------------+
ip4_output_if: call netif->output()
GLUE: linkoutput: netif@0x3fff1320 (soft-ap) pbuf=0x3fff1754 payload=0x3fff179a
GLUE: linkoutput default netif: 0
lwESP: LINKOUTPUT: real pbuf sent to wilderness (len=46B esp-pbuf=0x3fff18a0 glue-pbuf=0x3fff1754 payload=0x3fff179a netifidx=1)

 ets Jan  8 2013,rst cause:4, boot mode:(1,6)

wdt reset

On WemosD1mini it starts nearly the same, but then goes on (a lot -- here it's truncated):

⸮ip4_output_if: ap1
IP header:
+-------------------------------+
| 4 | 6 |  0x00 |        32     | (v, hl, tos, len)
+-------------------------------+
|        0      |000|       0   | (id, flags, offset)
+-------------------------------+
|    1  |    2  |    0x5d8e     | (ttl, proto, chksum)
+-------------------------------+
|    0  |    0  |    0  |    0  | (src)
+-------------------------------+
|  239  |  255  |  215  |   74  | (dest)
+-------------------------------+
ip4_output_if: call netif->output()
GLUE: linkoutput: netif@0x3fff1320 (soft-ap) pbuf=0x3fff27e4 payload=0x3fff282a
GLUE: linkoutput default netif: 0
lwESP: glue2esp_linkoutput: if 1 not initialized
GLUE: linkoutput error sending pbuf@0x3fff27e4
ip4_output_if: st0
IP header:
+-------------------------------+
| 4 | 6 |  0x00 |        32     | (v, hl, tos, len)

Seems on my HW it tries sending the packet even if wifi is not yet initialized...

With the minimal sketch posted when opening the issue, it goes on a bit more, up to:

ip4_output_if: ap1
IP header:
+-------------------------------+
| 4 | 6 |  0x00 |        32     | (v, hl, tos, len)
+-------------------------------+
|        3      |000|       0   | (id, flags, offset)
+-------------------------------+
|    1  |    2  |    0x7f30     | (ttl, proto, chksum)
+-------------------------------+
|  192  |  168  |    4  |    1  | (src)
+-------------------------------+
|  224  |    0  |    0  |  251  | (dest)
+-------------------------------+
ip4_output_if: call netif->output()
GLUE: linkoutput: netif@0x3fff0930 (soft-ap) pbuf=0x3fff151c payload=0x3fff1562
GLUE: linkoutput default netif: 0
lwESP: LINKOUTPUT: real pbuf sent to wilderness (len=46B esp-pbuf=0x3fff1774 glue-pbuf=0x3fff151c payload=0x3fff1562 netifidx=1)

 ets Jan  8 2013,rst cause:4, boot mode:(1,6)

Again, that message "sent to wilderness"... Uhm...
If needed, I can post the schematic of my HW (it's open source) but it's nothing special... Too bad I can't reproduce the bug on D1mini...

@NdK73
Copy link
Contributor Author

NdK73 commented Jan 4, 2018

A question: why is it using soft-ap instead of station interface?

@NdK73
Copy link
Contributor Author

NdK73 commented Jan 4, 2018

I just did a little experiment... Suspecting it could be something in the blob that does not like my "calling convention", I first commented out WiFi.persistent(false) to let the blob store my network credentials in its cache. The loaded sketch still triggered WDT.
Then I replaced WiFi.begin(ssid, pass) with a plain WiFi.begin() that uses the previously cached credentials. It's still up and running after ~5' !

The current setup() is:

void setup() {
//  WiFi.persistent(false); // Avoid wear-out
  WiFi.mode(WIFI_STA);
  WiFi.hostname(host);
//  WiFi.begin(ssid, password);
WiFi.begin();

  Serial.begin(115200);
  Serial.setDebugOutput(true);
  ArduinoOTA.setHostname(host);
}

Maybe that could give an hint about what's happening. But doesn't explain why it works on the D1mini.
The same change apparently "fixed" my full sketch too.

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 4, 2018

I can now reproduce the bug with your original sketch on a D1 mini lite.

@NdK73
Copy link
Contributor Author

NdK73 commented Jan 5, 2018

Urgh! Today (w/o updating anything from git) it does not crash. Even restoring the commented lines!

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 5, 2018

@NdK73 That's unfortunate - I promise I did nothing from here - so far :)

More seriously, I will shortly send a fix for lwip2.
Indeed in this sketch configuration, lwip2 is trying to send multicast packets through uninitialized SOFTAP interface, which caused the error, and which is now contained.
I am trying now to figure out how lwip choose its multicast output interface and why STA is not chosen in that case (and check the other multicast cases: ap only, ap+sta).
There is this new lwIP-v2 function ip4_set_default_multicast_netif that is not used at all and was not present in lwIP-v1. I need to understand why multicast packets are not sent to every LINK_UP netif.

@NdK73
Copy link
Contributor Author

NdK73 commented Jan 5, 2018

Glad you could spot it!
But WHICH multicast is it trying to send?
Unless ArduinoOTA.setHostname() already tries to send it even if ArduinoOTA is not yet started, then I'm waiting for Wifi.status() == WL_CONNECTED before generating traffic.
Well, actually it could be better to move that call inside the if, just to be more sure...

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 5, 2018

OTA uses mDNS which is multicast. It does it after the waiting loop, but currently lwip can choose the wrong interface to send its multicast packets.

@devyte
Copy link
Collaborator

devyte commented Jan 6, 2018

@NdK73 Please retest with the referenced PR.

@NdK73
Copy link
Contributor Author

NdK73 commented Jan 6, 2018

I don't know how to import a pr to my local repo...
But I just copied the file at https://raw.githubusercontent.com/d-a-v/esp82xx-nonos-linklayer/ac6e34a69e9f1481dd102a8bc5920a12d7c503be/glue-esp/lwip-esp.c and
No WDT reset, but I'll need other tests. That's the problem with intermittent problems: you never know for sure if they're solved or not... :(

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 6, 2018

You can try this:

git fetch origin pull/4105/head:pr-4105
git checkout pr-4105
git branch

When you are done testing

git checkout master
git branch

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 6, 2018

About what I said before

  • ip4_set_default_multicast_netif() does not need to be used
  • by default igmp loops through all available interfaces

d-a-v added a commit to d-a-v/Arduino that referenced this issue Jan 6, 2018
@NdK73
Copy link
Contributor Author

NdK73 commented Jan 6, 2018

Tks. I followed your instructions, but lwip-esp.c is the old version (18928bytes vs 20277bytes of the file I manually downloaded) -- actually seems it's not been touched:

ndk@arwen:~/Arduino/hardware/esp8266com/esp8266$ git checkout pr-4105
M	tools/sdk/lwip2/builder
M	tools/sdk/lwip2/include/gluedebug.h
M	tools/sdk/lwip2/include/lwipopts.h
Si è passati al branch 'pr-4105'

Anyway, it keeps working (but I'll have to check it better after successfully connecting to another network).
Tks a lot!

@devyte
Copy link
Collaborator

devyte commented Jan 6, 2018

@NdK73 I'll be merging the PR once CI passes, pls retest then.

devyte pushed a commit that referenced this issue Jan 6, 2018
@devyte devyte added this to the 2.5.0 milestone Jan 7, 2018
incosystem pushed a commit to incosys/Arduino that referenced this issue Jan 8, 2018
@d-a-v
Copy link
Collaborator

d-a-v commented Jan 11, 2018

@NdK73 do you confirm that's ok with git version of core ?

To clear modifications made to your local repository and update, you can:

git diff
<check that nothing you did is to be kept>
git stash
<rename directory tools/sdk/lwip2/builder to something else>
git checkout tools/sdk/lwip2/builder
git pull origin master
<restart the IDE>

@NdK73
Copy link
Contributor Author

NdK73 commented Jan 11, 2018

Done. Seems it does not crash... at least for now :)
Used lwIp2 MSS=536 . Does that mean that I can't receive an UDP packet longer than 536 bytes?

Now there seems to be an interaction between analogRead() and SPI transfers that crashes TFT_eSPI, but that's another (sad) story. But why do I have to find all these race conditions? :(

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 12, 2018

WiFi is limited by MTU which is always 1460 on esp8266 (initialized by the physical layer). This applies to IP (and TCP, UDP).
Besides, TCP limits itself with MSS.

@igrr igrr modified the milestones: 2.5.0, 2.4.1 Jan 16, 2018
@d-a-v
Copy link
Collaborator

d-a-v commented Jan 17, 2018

@NdK73 I let you close this issue if everything is OK regarding the original post.

@NdK73
Copy link
Contributor Author

NdK73 commented Jan 18, 2018

Haven't had the issue anymore after the latest pulls. Seems resolved. Closing.
Tks for the support.

@NdK73 NdK73 closed this as completed Jan 18, 2018
@barneyman
Copy link

Pretty sure this commit also fixes a problem (hang / crash) i was seeing with flashing a Sonoff Basic with my own firmware using mDNS / MDNSserver

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants