common/msg_queue: use more efficient form for large queues #7767

rustyrussell · 2024-10-25T22:30:23Z

We've had a report of connectd using a lot of CPU in memmove. This seems to be a large queue, so:

Use membuf which is far more efficient than a simple array
Print a backtrace the first time a queue passes 100000 entries.
Fix the root cause: trace messages were not limited.

JssDWt · 2024-10-26T21:14:17Z

Seeing enormous amounts of backtraces being printed. About 9 million lines per 30 seconds. Sending a small sample:

Suppressed 8850603 messages from lightningd.service
0x56166a642fb2 do_enqueue
        common/msg_queue.c:65
0x56166a643043 msg_enqueue
        common/msg_queue.c:82
0x56166a63c91d daemon_conn_send
        common/daemon_conn.c:161
0x56166a6454f0 status_send
        common/status.c:90
0x56166a645804 status_vfmt
        common/status.c:169
0x56166a645bde status_backtrace_print
        common/subdaemon.c:12
0x56166a63c35f send_backtrace
        common/daemon.c:36
0x56166a642fb2 do_enqueue
        common/msg_queue.c:65
0x56166a643043 msg_enqueue
        common/msg_queue.c:82
0x56166a63c91d daemon_conn_send
        common/daemon_conn.c:161
0x56166a6454f0 status_send
        common/status.c:90
0x56166a645804 status_vfmt
        common/status.c:169
0x56166a645bde status_backtrace_print
        common/subdaemon.c:12
0x56166a63c35f send_backtrace
        common/daemon.c:36
0x56166a642fb2 do_enqueue
        common/msg_queue.c:65
0x56166a643043 msg_enqueue
        common/msg_queue.c:82
0x56166a63c91d daemon_conn_send
        common/daemon_conn.c:161
0x56166a6454f0 status_send
        common/status.c:90
0x56166a645804 status_vfmt
        common/status.c:169
0x56166a645bde status_backtrace_print
        common/subdaemon.c:12
0x56166a63c35f send_backtrace
        common/daemon.c:36
0x56166a642fb2 do_enqueue
        common/msg_queue.c:65
0x56166a643043 msg_enqueue
        common/msg_queue.c:82
0x56166a63c91d daemon_conn_send
        common/daemon_conn.c:161
0x56166a6454f0 status_send
        common/status.c:90
0x56166a645804 status_vfmt
        common/status.c:169
0x56166a645bde status_backtrace_print
        common/subdaemon.c:12
0x56166a63c35f send_backtrace
        common/daemon.c:36
0x56166a642fb2 do_enqueue
        common/msg_queue.c:65
0x56166a643043 msg_enqueue
        common/msg_queue.c:82
0x56166a63c91d daemon_conn_send
        common/daemon_conn.c:161
0x56166a6454f0 status_send
        common/status.c:90
0x56166a645804 status_vfmt
        common/status.c:169
0x56166a645bde status_backtrace_print
        common/subdaemon.c:12
0x56166a63c35f send_backtrace

Based on CPU consumption in memmove with the current naive approach. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Scary looking, but great for debugging! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

When this (very spammy) "handle_recv_gossip" message was changed from debug to trace, the suppression code wasn't updated: we suppress overly active debug messages, but not trace messages. This is the backtrace from an earlier version of the "too large queue" patch: ``` lightning_gossipd: msg_queue length excessive (version v24.08.1-17-ga780ad4-modded) 0x557e521e833f send_backtrace common/daemon.c:33 0x557e521eefb9 do_enqueue common/msg_queue.c:66 0x557e521ef043 msg_enqueue common/msg_queue.c:82 0x557e521e891d daemon_conn_send common/daemon_conn.c:161 0x557e521f14f0 status_send common/status.c:90 0x557e521f1804 status_vfmt common/status.c:169 0x557e521f1433 status_fmt common/status.c:180 0x557e521de7c6 handle_recv_gossip gossipd/gossipd.c:206 0x557e521de9f5 connectd_req gossipd/gossipd.c:307 0x557e521e862d handle_read common/daemon_conn.c:35 ```

The original complaint which caused my investigation was the 100% CPU consumption of connectd, which we traced to the queue to gossipd. However, the issue is not really connectd's overproduction, but gossipd's underconsumption, probably caused by its own queueing issues with the trace messages to lightningd, which the prior patch fixed. Nonetheless, gossipd *can* get busy, and if we were to ask multiple nodes for full gossip, we could see a few hundred thousand messages come it at once. Hence I'm increasing the warning limit to 250,000 messages. This commit is also where we attach the Changelog message, even though it's really "common/msg_queue: use membuf for greater efficiency." and "gossipd: fix excessive msg_queue length from status_trace()" which solved the problem. Here's the backtrace from a previous debug patch: ``` lightning_connectd: msg_queue length excessive (version v24.08.1-17-ga780ad4-modded) 0x5580534051f0 send_backtrace common/daemon.c:33 0x55805340bd5b do_enqueue common/msg_queue.c:66 0x55805340bde5 msg_enqueue common/msg_queue.c:82 0x5580534057ce daemon_conn_send common/daemon_conn.c:161 0x5580533fe3ff handle_gossip_in connectd/multiplex.c:624 0x5580533ff23b handle_message_locally connectd/multiplex.c:763 0x5580533ff2d6 read_body_from_peer_done connectd/multiplex.c:1112 ``` Reported-by: https://github.com/JssDWt Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Fixed: `connectd` and `gossipd` message queues are much more efficient.

rustyrussell added this to the v24.11 milestone Oct 25, 2024

rustyrussell force-pushed the guilt/msgq-handle-longer branch 2 times, most recently from 0a72af2 to 37121ca Compare October 26, 2024 01:58

rustyrussell force-pushed the guilt/msgq-handle-longer branch from 37121ca to bde4a92 Compare October 27, 2024 05:59

common/msg_queue: use membuf for greater efficiency.

99a46c1

Based on CPU consumption in memmove with the current naive approach. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the guilt/msgq-handle-longer branch 2 times, most recently from b770075 to ce6bf9a Compare October 28, 2024 02:03

rustyrussell changed the title ~~common/msg_queue: use a linked list, not an array.~~ common/msg_queue: use more efficient form for large queues Oct 28, 2024

rustyrussell added 3 commits October 28, 2024 20:25

common/msg_queue: send backtrace on oversize queues.

bb5ca61

Scary looking, but great for debugging! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the guilt/msgq-handle-longer branch from ce6bf9a to 40f3058 Compare October 28, 2024 09:56

rustyrussell merged commit 183da39 into ElementsProject:master Nov 1, 2024
39 checks passed

JssDWt mentioned this pull request Nov 14, 2024

gossipd logs causing slow htlcs? #7815

Closed

yaslama added a commit to breez/lightning that referenced this pull request Nov 21, 2024

Backport ElementsProject#7767

1df1a74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common/msg_queue: use more efficient form for large queues #7767

common/msg_queue: use more efficient form for large queues #7767

rustyrussell commented Oct 25, 2024 •

edited

Loading

JssDWt commented Oct 26, 2024

common/msg_queue: use more efficient form for large queues #7767

common/msg_queue: use more efficient form for large queues #7767

Conversation

rustyrussell commented Oct 25, 2024 • edited Loading

JssDWt commented Oct 26, 2024

rustyrussell commented Oct 25, 2024 •

edited

Loading