-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
common/msg_queue: use more efficient form for large queues #7767
Merged
rustyrussell
merged 4 commits into
ElementsProject:master
from
rustyrussell:guilt/msgq-handle-longer
Nov 1, 2024
Merged
common/msg_queue: use more efficient form for large queues #7767
rustyrussell
merged 4 commits into
ElementsProject:master
from
rustyrussell:guilt/msgq-handle-longer
Nov 1, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rustyrussell
force-pushed
the
guilt/msgq-handle-longer
branch
2 times, most recently
from
October 26, 2024 01:58
0a72af2
to
37121ca
Compare
Seeing enormous amounts of backtraces being printed. About 9 million lines per 30 seconds. Sending a small sample:
|
rustyrussell
force-pushed
the
guilt/msgq-handle-longer
branch
from
October 27, 2024 05:59
37121ca
to
bde4a92
Compare
Based on CPU consumption in memmove with the current naive approach. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell
force-pushed
the
guilt/msgq-handle-longer
branch
2 times, most recently
from
October 28, 2024 02:03
b770075
to
ce6bf9a
Compare
rustyrussell
changed the title
common/msg_queue: use a linked list, not an array.
common/msg_queue: use more efficient form for large queues
Oct 28, 2024
Scary looking, but great for debugging! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When this (very spammy) "handle_recv_gossip" message was changed from debug to trace, the suppression code wasn't updated: we suppress overly active debug messages, but not trace messages. This is the backtrace from an earlier version of the "too large queue" patch: ``` lightning_gossipd: msg_queue length excessive (version v24.08.1-17-ga780ad4-modded) 0x557e521e833f send_backtrace common/daemon.c:33 0x557e521eefb9 do_enqueue common/msg_queue.c:66 0x557e521ef043 msg_enqueue common/msg_queue.c:82 0x557e521e891d daemon_conn_send common/daemon_conn.c:161 0x557e521f14f0 status_send common/status.c:90 0x557e521f1804 status_vfmt common/status.c:169 0x557e521f1433 status_fmt common/status.c:180 0x557e521de7c6 handle_recv_gossip gossipd/gossipd.c:206 0x557e521de9f5 connectd_req gossipd/gossipd.c:307 0x557e521e862d handle_read common/daemon_conn.c:35 ```
The original complaint which caused my investigation was the 100% CPU consumption of connectd, which we traced to the queue to gossipd. However, the issue is not really connectd's overproduction, but gossipd's underconsumption, probably caused by its own queueing issues with the trace messages to lightningd, which the prior patch fixed. Nonetheless, gossipd *can* get busy, and if we were to ask multiple nodes for full gossip, we could see a few hundred thousand messages come it at once. Hence I'm increasing the warning limit to 250,000 messages. This commit is also where we attach the Changelog message, even though it's really "common/msg_queue: use membuf for greater efficiency." and "gossipd: fix excessive msg_queue length from status_trace()" which solved the problem. Here's the backtrace from a previous debug patch: ``` lightning_connectd: msg_queue length excessive (version v24.08.1-17-ga780ad4-modded) 0x5580534051f0 send_backtrace common/daemon.c:33 0x55805340bd5b do_enqueue common/msg_queue.c:66 0x55805340bde5 msg_enqueue common/msg_queue.c:82 0x5580534057ce daemon_conn_send common/daemon_conn.c:161 0x5580533fe3ff handle_gossip_in connectd/multiplex.c:624 0x5580533ff23b handle_message_locally connectd/multiplex.c:763 0x5580533ff2d6 read_body_from_peer_done connectd/multiplex.c:1112 ``` Reported-by: https://github.com/JssDWt Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Fixed: `connectd` and `gossipd` message queues are much more efficient.
rustyrussell
force-pushed
the
guilt/msgq-handle-longer
branch
from
October 28, 2024 09:56
ce6bf9a
to
40f3058
Compare
yaslama
added a commit
to breez/lightning
that referenced
this pull request
Nov 21, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We've had a report of connectd using a lot of CPU in memmove. This seems to be a large queue, so:
Fixes: #7757