Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle connection better: make connectd know which peers are important #7630

Merged
merged 15 commits into from
Nov 25, 2024

Conversation

rustyrussell
Copy link
Contributor

This series changes connectd to maintain reconnections, rather than the rather convoluted logic we had before. lightningd simply tells it which peers are "important", based on which peers have channels. We also are smarter about saving the last known address, if we have a successfully established an outgoing connection previously.

This lets connectd be smarter in overload, too, when it needs to select a peer to evict. And it's a chance to update our "startup" logic which tried to space out connections: we keep 10 in flight now, and wait until lightingd has forked off a subdaemon before adding more (though we always let through one a second, for corner cases).

These changes should help large nodes.

@rustyrussell rustyrussell added the Highlight - Stability and Security Refinement of basics, prevention and cures label Aug 31, 2024
@rustyrussell rustyrussell added this to the v24.11 milestone Aug 31, 2024
@rustyrussell rustyrussell changed the title Handle connection better: make connectd know what are important Handle connection better: make connectd know which peers are important Sep 1, 2024
@rustyrussell rustyrussell force-pushed the guilt/ccan-io-scale branch 4 times, most recently from 28997b7 to 0e5f665 Compare November 21, 2024 03:51
@rustyrussell rustyrussell force-pushed the guilt/ccan-io-scale branch 2 times, most recently from aa59a03 to 030b46c Compare November 22, 2024 23:51
It's now trivial for us to do this ourselves, since we have gossmap.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Let lightningd feed us hints to try first, but we can extract the
addresses from node_announcement messages ourselves.

(Lightningd used to ask gossipd on our behalf: this is far simpler!)

One side effect of this is that we don't hand back address hints given to us
by lightningd: it would use these again for reconnecting.  This is breaks
test_sendpay_grouping, so we disable it temporarily.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In fact, only 951 of 17419 (5%) of node announcements are missing an address
(and gossipd doesn't know if we can connect to Tor addresses anyway) so
just check it *has* a node_announcement.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Once connectd is controlling reconnections, it'll need these.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…not just state.

We're going to use this to ask if there are any channels which make it
important to reconnect to the peer.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we connected out, remember that address.  We always remember the last
address, but that may be an incoming address.  This is explicitly the last
outgoing address which worked.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is more useful than the last address, which may be it connecting
to us.  And use it when we restore it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rather than have lightningd call us repeatedly to try to connect, have
it tell us what peers are transient and aren't, and connectd will
automatically try to maintain that connection.

There's a new "downgrade_peer" message to tell it a peer is now
transient: to make it non-transient we simply tell connectd to
connect as a non-transient.

The first time, I missed that dual_open_control does its own state
transitions :(

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: `connectd` now handles maintaining/reconnecting to important peers, and we remember the last successful address we connected to.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The important flag replaces it, and now we can be more intelligent about
eviction in overload.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We wait until a connection fails, or a subd is connected to the peer,
before letting another one through.  This should prevent us from
overwhelming lightningd on large nodes, but unlike the previous back-off,
it's based on how fast lightningd is, not an arbitrary time.

We also let one through each second, in case we're connecting to many,
but not doing anything but gossip (e.g. 100 explicit connect
commands).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Reconnecting to peers at startup should be significantly faster (dependent on machine speed).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The seeker can send a full gossip query, which means the ping doesn't happen
(it needs 14-45 seconds of quiet!).

We disable the gossip_queries feature, so it doesn't ask.

```
    def test_ping_timeout(node_factory):
        # Disconnects after this, but doesn't know it.
        l1_disconnects = ['xWIRE_PING']
    
        l1, l2 = node_factory.get_nodes(2, opts=[{'dev-no-reconnect': None,
                                                  'disconnect': l1_disconnects},
                                                 {'dev-no-ping-timer': None}])
        l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
    
        # This can take 10 seconds (dev-fast-gossip means timer fires every 5 seconds)
        l1.daemon.wait_for_log('seeker: startup peer finished', timeout=15)
        # Ping timers runs at 15-45 seconds, *but* only fires if also 60 seconds
        # after previous traffic.
>       l1.daemon.wait_for_log('dev_disconnect: xWIRE_PING', timeout=60 + 45 + 5)

tests/test_connection.py:4194: 
...
>                   raise TimeoutError('Unable to find "{}" in logs.'.format(exs))
E                   TimeoutError: Unable to find "[re.compile('dev_disconnect: xWIRE_PING')]" in logs.
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell merged commit 5701123 into ElementsProject:master Nov 25, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Highlight - Stability and Security Refinement of basics, prevention and cures
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant