Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*** PANIC *** sys_timeout: timeout != NULL, pool MEMP_SYS_TIMEOUT is empty #1281

Closed
tjko opened this issue Feb 25, 2023 · 17 comments
Closed
Assignees

Comments

@tjko
Copy link
Contributor

tjko commented Feb 25, 2023

After upgrading to SDK 1.5.0, I noticed significant regression with the WiFi stack.

Trying to access simple (LwIP) web server causes pretty much immediate panic/crash:

[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9

*** PANIC ***

sys_timeout: timeout != NULL, pool MEMP_SYS_TIMEOUT is empty

Is this perhaps known issue? This does not happen on SDK 1.4.0...

@tjko
Copy link
Contributor Author

tjko commented Feb 25, 2023

Found #1106 and based on that increased MEMP_NUM_SYS_TIMEOUT and that seems to have helped with the panics (at least not seeing immediate crashing anymore):

#define MEMP_NUM_SYS_TIMEOUT        LWIP_NUM_SYS_TIMEOUT_INTERNAL + 8

Did LwIP perhaps get updated in 1.5.0 release?

However, still seeing steady stream of these got unexpected packet -9 errors:

[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
...

Looks like why I didn't see above errors on 1.4.0 is because of this change: georgerobotics/cyw43-driver#33 (comment)

I noticed this change #1039 , are these warnings from CYW43 driver now "safe" on 1.5.0 and no risk these causing deadlock anymore?

@manofmanyhats
Copy link

I too get the PANIC on version 1.5.0 - I did not on on 1.4.0
However, I don't see "unexpected packet" errors that @tjko does.
My scenario is:
Connect to wifi; cyw43_tcpip_link_status(&cyw43_state, CYW43_ITF_STA) returns CYW43_LINK_UP for the first time, so I start connecting to the MQTT server...

 cyw43_arch_lwip_begin();
 mqttClient = mqtt_client_new();

 memset(&ci, 0, sizeof(ci));
 ci.client_id = DEVICE_NAME;
 ci.keep_alive = 10;
 
  /* Setup callback for incoming publish requests */
  mqtt_set_inpub_callback(mqttClient, publishCallback, incomingDataCallback, NULL);                                                                   
  cyw43_arch_lwip_end();

Next, mqtt_client_is_connected(mqttClient) returns false (it would because we haven't tried to connect yet)
so...

cyw43_arch_lwip_begin();
err_t err = mqtt_client_connect(mqttClient, &ip_addr, 1883, connectionStatusCallback, 0, &ci);
cyw43_arch_lwip_end();

At this point, err == ERR_OK, so all good.
But then before the callback routine is called, I get:

*** PANIC ***

sys_timeout: timeout != NULL, pool MEMP_SYS_TIMEOUT is empty

@peterharperuk
Copy link
Contributor

are these warnings from CYW43 driver now "safe" on 1.5.0

That's the hope. The "got unexpected packet" warnings should be harmless and can be ignored. The issue has been raised with Infineon.

@peterharperuk
Copy link
Contributor

@manofmanyhats What value of MEMP_NUM_SYS_TIMEOUT are you using? Have you tried increasing it? This sounds like an issue in your code.

@tjko
Copy link
Contributor Author

tjko commented Feb 28, 2023

For MEMP_NUM_SYS_TIMEOUT I was using the default (from LwIP), as "lwipopts.h" from pico-examples (that I was using as starting point) didn't change it from default...

LwIP default seems to be these:

#define  LWIP_NUM_SYS_TIMEOUT_INTERNAL   (LWIP_TCP + IP_REASSEMBLY + LWIP_ARP + (2*LWIP_DHCP) + LWIP_AUTOIP + LWIP_IGMP + LWIP_DNS + PPP_NUM_TIMEOUTS + (LWIP_IPV6 * (1 + LWIP_IPV6_REASS + LWIP_IPV6_MLD)))
#define  MEMP_NUM_SYS_TIMEOUT   LWIP_NUM_SYS_TIMEOUT_INTERNAL

(https://www.nongnu.org/lwip/2_1_x/group__lwip__opts__memp.html)

Seems like something must have changed since there was no need to change these on 1.4.0, but with 1.5.0 seems like it is needed...

@manofmanyhats
Copy link

So, MEMP_NUM_SYS_TIMEOUT is defined in many include files - you have to change the correct one!
When I change the one in ./lib/lwip/src/include/lwip/opt.h I can get a system that works...
Original value:
#define MEMP_NUM_SYS_TIMEOUT LWIP_NUM_SYS_TIMEOUT_INTERNAL gives the PANIC
Increasing by just 1:
#define MEMP_NUM_SYS_TIMEOUT (LWIP_NUM_SYS_TIMEOUT_INTERNAL+1) does not give the panic, and allows the system to work.

Side note in case it helps anyone else; When I first got my mqtt code working with 1.4.0, there was an issue with mqtt keep alive what was affecting me. As a work around for this, I set keep_alive = 0; in the mqtt_connect_client_info_t that I passed to mqtt_client_connect(), and setup my own 14 second timer to send a dummy message to keep the server happy. This code is still in place.

@peterharperuk
Copy link
Contributor

Does anyone have an example that works in pico-sdk 1.4.0 and not 1.5.0? A lot has changed, but not lwip. It's possible timeout behaviour has been changed indirectly with the async changes we've made. Apart from that, there's MDNS support, which requires an extra timeout, but you need to enable that feature.

So, MEMP_NUM_SYS_TIMEOUT is defined in many include files

I would expect you to have your own lwipopts.h file and add it to that?

@manofmanyhats
Copy link

OK, so my code was working with 1.4.0.
I upgraded to 1.5.0 and I got the PANIC detailed above.
I added:
#define MEMP_NUM_SYS_TIMEOUT (LWIP_NUM_SYS_TIMEOUT_INTERNAL+1)
to my local "lwipopts.h" (and restored the changes I'd previously made to the real opt.h) and now everything works as it should again.

For me, that is the only change that is required.

@earlephilhower
Copy link
Contributor

I can confirm the same issue w/LWIP timers for the arduino-pico WiFi stack. As I did not wish to edit the Pico-SDK code directly, I've settled on making one of the components of LWIP_NUM_SYS_TIMEOUT_INTERNAL to 2 instead of 1 in lwipopts.h and afterwards the 1.5 LWIP stack seems pretty solid:

#define LWIP_ARP                    2

That said, because I was moving from my own custom CYW43 Ethernet packet<->LWIP hook I assumed I was doing something wrong on my end and worked around it vs. deep debugging.

I think if we can get a trace of the timer allocation call stacks (i.e. a gdb in 1.4 vs. 1.5 it would be instructive, something like:

set pagination off
br sys_timeout_abs
commands
   where
   continue
   end

That would show where the extra call to allocate a timer is coming from. Unfortunately, again, my own code path changed a lot so I can't really compare apples to apples only changing the SDK. If you have a simpler app that just swaps the SDK version, then it may be much simpler to see where the difference comes from.

@tjko
Copy link
Contributor Author

tjko commented Mar 1, 2023

I enabled following (in "lwipopts.h") to log 'timers' activity:

#define LWIP_DEBUG_TIMERNAMES       1
#define TIMERS_DEBUG                LWIP_DBG_ON

When comparing differences in debug output between 1.4.0 and 1.5.0 , somewhat surprisingly I didn't see any additional timers being used. I saw following (7) timers being used:

dhcp_fine_tmr
dns_tmr
etharp_tmr
ip_reass_tmr
lwip_cyclic_timer
sntp_request
tcpip_tcp_timer

While the default "formula" for LWIP_NUM_SYS_TIMEOUT_INTERNAL seemed to return 6 (in my use case) both on 1.4.0 and 1.5.0, so it seems I've always been running with one too few slots for timers...

Callin stats_display() (from LwIP) on 1.4.0 shows this about timers:

MEM SYS_TIMEOUT
	avail: 6
	used: 6
	max: 6
	err: 1

So, I suspect for some reason on 1.4.0 running out of timer 'slots' doesn't always trigger a panic, while on 1.5.0 it seems to be causing it always... (on 1.5.0 I don't get far enough to run stats_display()).

If I increment the MEMP_NUM_SYS_TIMEOUT (or LWIP_NUM_SYS_TIMEOUT_INTERNAL) by 1 so that things are 'stable' on 1.5.0, I get this:

MEM SYS_TIMEOUT
	avail: 7
	used: 7
	max: 7
	err: 0

@peterharperuk
Copy link
Contributor

Great - thanks for investigating further. So I guess this was caused by us changing LWIP_PLATFORM_ASSERT from the clearly wrong...

#define LWIP_PLATFORM_ASSERT(x) do { if(!(x)) while(1); } while(0)

To the correct implementation...

#define LWIP_PLATFORM_ASSERT(x) panic(x)

I guess we should have emphasised this a bit more...

commit 16ab2b21baae91d52b0b63378945403005ef470d
Author: James Peach <JamesWP@users.noreply.github.com>
Date:   Fri Oct 21 14:57:21 2022 +0100

  Remove incorrect override for LWIP_PLATFORM_ASSERT (#1064)
   
   * Remove incorrect override for LWIP_PLATFORM_ASSERT
   
   LWIP_PLATFORM_ASSERT macro is used by lwip driver implementations to specify the behavior of the assertions in lwip code.
   
   The previous override of this macro incorrectly assumed that the parameter to the function macro was the condition to check. However this is incorrect. The parameter is actually a message string defining what failed.
   
   This mistake caused all assertions to be ignored. ( and myself to loose many hours of debugging time )
   
   By removing this, we restore the default behavior specified by lwip which is to use `printf` to log the message.
   
   * I think i prefer to use panic (which doesn't pull in fflush etc)
   
   Co-authored-by: Graham Sanderson <graham.sanderson@gmail.com>

@tjko
Copy link
Contributor Author

tjko commented Mar 5, 2023

I guess this can be closed.

However, maybe https://github.com/raspberrypi/pico-examples/blob/master/pico_w/wifi/lwipopts_examples_common.h
could have (commented out) example on how to increase MEMP_NUM_SYS_TIMEOUT ?

@Windwoes
Copy link

Windwoes commented Jul 1, 2023

Does anyone have an example that works in pico-sdk 1.4.0 and not 1.5.0? A lot has changed, but not lwip. It's possible timeout behaviour has been changed indirectly with the async changes we've made. Apart from that, there's MDNS support, which requires an extra timeout, but you need to enable that feature.

So, MEMP_NUM_SYS_TIMEOUT is defined in many include files

I would expect you to have your own lwipopts.h file and add it to that?

Why does the default formula not account for MDNS being enabled?

@Windwoes
Copy link

Windwoes commented Jul 1, 2023

FWIW, I get the panic when using MDNS unless I add three additional timers to my lwipopts.h

#define MEMP_NUM_SYS_TIMEOUT            (LWIP_NUM_SYS_TIMEOUT_INTERNAL+3)

@peterharperuk
Copy link
Contributor

Yes, I see the same. The only MDNS example I have is unfortunately not pushed yet for various reasons https://github.com/peterharperuk/pico-examples/tree/add_httpd

@Windwoes
Copy link

Windwoes commented Jul 3, 2023

I may go ahead and PR an MDNS example since I have it working

@raarts
Copy link

raarts commented Jan 7, 2024

For whoever gets here: I got this error because I forgot to #define LWIP_SOCKETS 1 while trying to use the httpd server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants