Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite PicoW LWIP interface, major stability increase #916

Merged
merged 8 commits into from
Oct 15, 2022

Conversation

earlephilhower
Copy link
Owner

@earlephilhower earlephilhower commented Oct 15, 2022

Rewrite PicoW LWIP interface for major stability increase

Random crashes, infinite loops, and other lockups were happening to the PicoW
while under high load from multiple clients.

This seems to have been due to two issues:

  • The periodic sys_check_timeouts() call from an alarm/IRQ was happening while
    the core was in LWIP code. LWIP is not re-entrant or multi-cire/thread safe
    so this is a bad thing. Some calls may not have been locked with a manual
    addition of the LWIPMutex object to hit this.
  • The WiFi driver supplies packet data during an interrupt. PBUF work is
    legal in an interrupt, but actually calling netif->input() from an IRQ to
    queue up the Ethernet packet for processing is illegal if LWIP is already
    in progress.

Rearchitect the LWIP interface to fix these problems:

  • Disable interrupts during malloc/etc. to avoid the possibility of the
    periodic LWIP timeout check interrupting and potentially calling user
    code which did a memory operation
  • Wrap all used LWIP calls to note LWIP code will be executing, instead of
    manually eyeballing and adding in protection in user code.
  • Remove all user code LWIPMutex blocking, the wrapping takes care of it.
  • When an Ethernet packet is received by interrupt and we're in LWIP code,
    just throw the packet away for now. The upper layers can handle retransmit.
    (A possible optimization would be to set the packet aside and add a
    housekeeping portion to the LWIP wrappers to netif->input() them when safe).
  • Ignore callbacks during TCP connection teardown when the ClientContext
    passed from LWIP == nullptr

Fixes random crashes while on the PicoW with WiFi.

The system malloc/etc. is not re-entrant, so it is not safe to do
any memory (de)allocation from an interrupt.

The LWIP stack is configured not to use the memory manager, but it may
end up calling callback functions in userspace which may well use
new or malloc() (i.e. to manage a new connection info or process a
UDP packet like in MDNS).

This could cause unpredicatable crashes due to memory corruption at
some random time after the interrupt/malloc occurred.

Avoid the issue completely by locking interrupts out while in the memory
manager.  Inter-core will be protected by the malloc_lock already in
Newlib.

This should not have any effect on absolute performance of the system.
Brute-force protect all LWIP calls from re-rentrancy
Was causing memory corruption during long, high frequency, multi-client
workloads.  It is illegal to call netif->input() from an IRQ while
in LWIP code.

For now, throw them out and let upper layer handle a re-transmit.
@earlephilhower
Copy link
Owner Author

earlephilhower commented Oct 15, 2022

A while true; do curl ...; done job fetching without any rate limiting as well as two to four web clients (plus random MDNS queries) all fetching the AdvancedWebServer page has been running at full tilt for 90+ mins with these patches:

During a ClientContext we set TCP connection arguments to nullptr and then
turn off all the callbacks.  It is possible for an interrupt to occur
between setting the arg nullptr and the clearing of all CBs.  The the CB
function will get nullptr as (this), causing a crash like:
````
Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
isr_hardfault () at /home/earle/Arduino/hardware/pico/rp2040/pico-sdk/src/rp2_common/pico_standard_link/crt0.S:98
98	decl_isr_bkpt isr_hardfault
(gdb) where
   e/Arduino/hardware/pico/rp2040/libraries/WiFi/src/include/ClientContext.h:619
   ries/WiFi/src/include/ClientContext.h:612
   raries/WiFi/src/include/ClientContext.h:671
    ) at /home/earle/Arduino/hardware/pico/rp2040/pico-sdk/lib/lwip/src/core/tcp_in.c:541
    ) at /home/earle/Arduino/hardware/pico/rp2040/pico-sdk/lib/lwip/src/core/ipv4/ip4.c:743
   rdware/pico/rp2040/pico-sdk/lib/lwip/src/netif/ethernet.c:186
    buf=0x20001b7a <cyw43_state+226> "(\315\301") at /home/earle/Arduino/hardware/pico/rp2040/libraries/lwIP_CYW43/src/utility/CYW43shim.cpp:132
   2040/pico-sdk/lib/cyw43-driver/src/cyw43_ll.c:1212
   cores/rp2040/lwip_wrap.cpp:163
   e/ClientContext.h:85
   /libraries/WiFi/src/WiFiClient.cpp:81
   /libraries/WiFi/src/WiFiClient.cpp:78
   rver.cpp:284
   ico/rp2040/libraries/WebServer/src/WebServerTemplate.h:86
````

Avoid by checking and silently ignoring CBs who have a nullptr arg.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant