-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connectd: don't insist on ping replies when other traffic is flowing. #5347
Conversation
Tried both 94e7cac and e3a5fa9 applied against v0.11.2rc3. ACINQ's node is still reconnecting at intervals of less than a minute.
Trying full debug next ... |
Debug logs no longer doing any of these unreturned ping hangups.
Immediately after I receive these it disconnects every minute. Prior to the less-ping patch it did disconnect in this way sometimes according to earlier debug logs. |
Here is what we have on our side:
It seems we receive a weird message
This is the decimal representation of edit: this message type is apparently related to peerswap : https://github.com/ElementsProject/peerswap/blob/24ad6a6d51e64a8aa800de90f8f15eede9020ad4/docs/peer-protocol.md |
If the current chunk of data read from the TCP connection only contains unknown messages (in particular, could be only one isolated unknown message on an otherwise idle connection), we never resumed reading on the connection. This means all subsequent messages, including pings/pongs, won't be read, which is why the most visible effect is disconnecting due to no response to ping. Related to ElementsProject/lightning#5347.
So, the root cause is a bug in eclair ACINQ/eclair#2332, triggered by the unknown message. The ping issue is just a consequence. |
Confirmed the Eclair fix stopped the disconnects. Regarding this PR, while this made it easier to spot what was going on I suspect it might be more correct to keep the previous behavior as that would be more robust in attempting to recover from bad Tor circuits? Thank you to everyone involved. Sorry for my own confusion. I learned a lot in the process and hopefully can be more useful in the future. |
Thank you for reporting the issue @wtogami! |
I suspect we're better off closing this PR? CLN is already doing the right thing. |
I still like the PR: I know LND can ahve issues if it's streaming lots of gossip to us. But it's not urgent, so I'll change milestone, and also will rework description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK dc7cb01
If the current chunk of data read from the TCP connection only contains unknown messages (in particular, could be only one isolated unknown message on an otherwise idle connection), we never resumed reading on the connection. This means all subsequent messages, including pings/pongs, won't be read, which is why the most visible effect is disconnecting due to no response to ping. Related to ElementsProject/lightning#5347. Reported by @wtogami.
Got complaints about us hanging up on some nodes because they don't respond to pings in a timely manner (e.g. ACINQ?), but that turned out to be something else. Nonetheless, we've had reports in the past of LND badly prioritizing gossip traffic, and thus important messages can get queued behind gossip dumps! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Changed: connectd: give busy peers more time to respond to pings.
Fixes 2569999 (pytest: fix flake in test_node_reannounce). Converting ->set->list does not give deterministic order, so we can end up with the two lists differing due to order: ``` May send its own announcement *twice*, since it always spams us. msgs2 = list(set(msgs2)) > assert msgs == msgs2 E AssertionError: assert ['01012ff5580...000000000000'] == ['01014973d81...000000000000'] E At index 0 diff: '01012ff55800f5b9492021372d74df4d6547bb0d32aec8d4c932a8c3b044e4bd983c429154e73091b0a2aff1cf9bbf16b37e6e9dd10ce4c2d949217366472acd341b0007800000080269a262bbd1750266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c035180266e453454e494f524245414d000000000000000000000000000000000000000000000000' != '01014973d8160dd8fc28e8fb25c40b9d5c68aed8dfb36af9fc13e4d2040fb3718553051a188ce98239c0bed138e1f8713a64acc7de98c183c9597fa58bf37f0b89bb0007800000080269a262bbd16c022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59022d2253494c454e544152544953542d336333626132392d6d6f6464656400000000000000' E Full diff: E [ E + '01012ff55800f5b9492021372d74df4d6547bb0d32aec8d4c932a8c3b044e4bd983c429154e73091b0a2aff1cf9bbf16b37e6e9dd10ce4c2d949217366472acd341b0007800000080269a262bbd1750266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c035180266e453454e494f524245414d000000000000000000000000000000000000000000000000', E '01014973d8160dd8fc28e8fb25c40b9d5c68aed8dfb36af9fc13e4d2040fb3718553051a188ce98239c0bed138e1f8713a64acc7de98c183c9597fa58bf37f0b89bb0007800000080269a262bbd16c022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59022d2253494c454e544152544953542d336333626132392d6d6f6464656400000000000000', E - '01012ff55800f5b9492021372d74df4d6547bb0d32aec8d4c932a8c3b044e4bd983c429154e73091b0a2aff1cf9bbf16b37e6e9dd10ce4c2d949217366472acd341b0007800000080269a262bbd1750266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c035180266e453454e494f524245414d000000000000000000000000000000000000000000000000', E ] `` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 30c1346
Very useful have this so I can catch why a ping is failing in lnmetrics
If the current chunk of data read from the TCP connection only contains unknown messages (in particular, could be only one isolated unknown message on an otherwise idle connection), we never resumed reading on the connection. This means all subsequent messages, including pings/pongs, won't be read, which is why the most visible effect is disconnecting due to no response to ping. Related to ElementsProject/lightning#5347. Reported by @wtogami.
Got complaints about us hanging up on some nodes because they don't respond
to pings in a timely manner (e.g. ACINQ?). This only matters to us if no
other traffic is flowing, so mitigate that.
This may solve the problem because we've previously seen implementations
badly prioritize gossip traffic, and thus important messages can get queued
behind gossip dumps!