Pre-connect on collator protocol #4381

eskimor · 2021-11-26T14:14:31Z

With AURA for example collators know when it is their turn to produce a block and can initiate a connection to the backing group before actually having the collation ready, thus saving time and improving performance.

This ticket can be seen as a preparation task to #3428 , but should be way faster to implement and should help with issues like this, until we have contextual execution ready.

As @ordian pointed out, we should also take care of collators disconnecting once they are done with their collation.

ordian · 2021-11-26T14:47:50Z

This is only relevant at group rotation boundaries, right? Currently, every collator would try to connect to all group validators.

Another thing is we should increase collator slots maybe as an interim measure?

eskimor · 2021-11-26T16:35:19Z

Yes, although the current pre-connect logic was kind of broken and I think we even removed it already? But I would need to check that. Increasing the rotation time is not something we would like to do, because we are already seeing issues with slow validators on Kusama, causing very long block times for parachains - an increased rotation time would make things worse.

ordian · 2021-11-26T16:38:53Z

Pre-connection to the next group was removed in #4261.

How sure are we that the very long times are only caused at group rotation boundaries? I don't think that's the only contributing factor.

eskimor · 2021-11-26T17:35:32Z

No it is not, it is just a contributing factor. The major problem is that TCP has slow start, so it will start with a relatively small amount of IP packages and then wait for a confirmation for all of those, if things went well, the amount of packets before requiring confirmation is increased, the game continues, .... With high latency connections, the waiting for confirmation will totally kill the effective bandwidth.

You are right that this should only be a problem right now at group rotations, but that unconditional connecting to validators, no matter whether one is a block producer, is a problem on its own, given the limited number of incoming connections validators will accept. The effect on paritytech/substrate#10359 will be limited though - good point!

bkchr · 2021-11-26T18:10:15Z

Pre-connection to the next group was removed in #4261.

How sure are we that the very long times are only caused at group rotation boundaries? I don't think that's the only contributing factor.

We only connected to the next group when we wanted to distribute a collation and validators would have disconnected us anyway relative fast. So, this code should have had no effect at all.

bkchr · 2021-11-26T18:11:29Z

If you extend the "collator" interface with some signaling on when it would like to connect to the next validator, we can write code in Cumulus to tell you if you should do this or not, based on if the collator will be building a block in the next slot.

eskimor · 2021-11-26T18:16:30Z

If you extend the "collator" interface with some signaling on when it would like to connect to the next validator, we can write code in Cumulus to tell you if you should do this or not, based on if the collator will be building a block in the next slot.

Yes! That's exactly what I was aiming at.

ordian · 2021-12-01T10:43:12Z

Currently, collators add validators to the reserved peerset when a collation is ready and only update it on the next collation.
That means that after a validator disconnects a collator, it will still try to connect back, even if it's not in their backing group.

As a short term fix, we could:

Issue a connection request (add to the reserved peerset) on every relay parent activation. This will ensure that we change groups on time on the collator side as well as pre-connection. The downside is that everyone will try to occupy a slot now.
Increase the collation peer slots to 50 or 100 to counter the negative effects of 1. Currently, parachain teams have that many collators per parachain.
Reduce the

polkadot/node/network/collator-protocol/src/lib.rs

Line 61 in c87a220

inactive_collator: Duration::from_secs(24),

to 6 or 5 seconds.

eskimor · 2021-12-01T11:13:26Z

This feels a little hacky to me, I think we should go with the reduced vote requirement, which should already improve things a lot and then do the pre-connect properly.

If I remember correctly we have parachains with 300 collators, having them all connect, although most of them will not provide a collation to that backing group does not sound ideal. Also even if we set the incoming limit to like 300, if it is in reality 310, then it could actually happen that the collators really needing the connection won't be able to get one, making things worse. This comes on top of that more allowed open connections, increases the attack vector for DoS attacks.

ordian · 2021-12-01T11:18:54Z

I agree this is not the way to go, but it's something that can quickly implemented for the next release. If however, you think that pre-connect will be implemented very soon, then it's not needed.

But if we implement pre-connect properly, we probably should also issue an empty connection request to clear the reserved after after distributing collation, so that a collator won't try to connect and occupy a slot unnecessarily.

Reducing inactive_collator policy also applies here.

eskimor · 2021-12-01T11:30:49Z

Absolutely, you raised a very good point here - we should not only have logic for timing the connect, we should also make sure to disconnect once we are done.

eskimor · 2021-12-01T11:43:47Z

I wasn't aware that we are already running into problems here. In that case, yes let's just bump incoming slots to 100. Making sure collators connect to the right backing group makes sense as well obviously and then let's get this ticket going asap.

slumber · 2023-04-11T15:42:16Z

Do we still need it considering async backing is coming? Collators can take their time issuing a connection and sending candidates to validators since no longer tied to latest relay chain parent,
and with validator groups ring buffer we no longer discard needed validators from the peerset as well as do drop unneeded.

ordian · 2023-04-11T15:45:37Z

I think this can be dropped as unneeded optimization with async backing.

eskimor mentioned this issue Nov 26, 2021

Runtime Upgrade (Big PoV) leading to collator peer reputation dropping (network stalled) paritytech/substrate#10359

Open

eskimor mentioned this issue Nov 26, 2021

Optimize backing for latency and liveness #4386

Closed

ordian mentioned this issue Dec 1, 2021

collator-protocol: short-term fixes for connectivity #4435

Closed

ordian mentioned this issue Feb 16, 2022

[DO NOT MERGE] Revert "collator-protocol: short-term fixes for connectivity (#4640)" #4925

Closed

slumber self-assigned this Mar 28, 2022

eskimor mentioned this issue Apr 7, 2022

Collator protocol: issue a connection request prior to distributing the collation #5255

Closed

This was referenced Apr 14, 2022

collator-protocol: preconnect (v2) #5321

Closed

SimpleSlotWorker: Do not implement SlotWorker for all types implementing SimpleSlotWorker paritytech/substrate#10934

Merged

ordian moved this to Progress in Parachains-core Aug 16, 2022

ordian added this to Parachains-core Aug 16, 2022

ordian added the T5-parachains_protocol This PR/Issue is related to Parachains features and protocol changes. label Aug 16, 2022

eskimor moved this from In progress to To do in Parachains-core Sep 1, 2022

eskimor moved this from To do to punt for now in Parachains-core Sep 1, 2022

eskimor removed this from Parachains-core Jan 5, 2023

eskimor added this to Collator Protocol Revamp Jan 5, 2023

slumber closed this as not planned Won't fix, can't repro, duplicate, stale Aug 21, 2023

github-project-automation bot moved this to Done in Collator Protocol Revamp Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-connect on collator protocol #4381

Pre-connect on collator protocol #4381

eskimor commented Nov 26, 2021 •

edited

Loading

ordian commented Nov 26, 2021

eskimor commented Nov 26, 2021

ordian commented Nov 26, 2021

eskimor commented Nov 26, 2021

bkchr commented Nov 26, 2021

bkchr commented Nov 26, 2021

eskimor commented Nov 26, 2021

ordian commented Dec 1, 2021 •

edited

Loading

eskimor commented Dec 1, 2021

ordian commented Dec 1, 2021 •

edited

Loading

eskimor commented Dec 1, 2021

eskimor commented Dec 1, 2021

slumber commented Apr 11, 2023

ordian commented Apr 11, 2023

Pre-connect on collator protocol #4381

Pre-connect on collator protocol #4381

Comments

eskimor commented Nov 26, 2021 • edited Loading

ordian commented Nov 26, 2021

eskimor commented Nov 26, 2021

ordian commented Nov 26, 2021

eskimor commented Nov 26, 2021

bkchr commented Nov 26, 2021

bkchr commented Nov 26, 2021

eskimor commented Nov 26, 2021

ordian commented Dec 1, 2021 • edited Loading

eskimor commented Dec 1, 2021

ordian commented Dec 1, 2021 • edited Loading

eskimor commented Dec 1, 2021

eskimor commented Dec 1, 2021

slumber commented Apr 11, 2023

ordian commented Apr 11, 2023

eskimor commented Nov 26, 2021 •

edited

Loading

ordian commented Dec 1, 2021 •

edited

Loading

ordian commented Dec 1, 2021 •

edited

Loading