Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Pre-connect on collator protocol #4381

Closed
eskimor opened this issue Nov 26, 2021 · 14 comments
Closed

Pre-connect on collator protocol #4381

eskimor opened this issue Nov 26, 2021 · 14 comments
Assignees
Labels
T5-parachains_protocol This PR/Issue is related to Parachains features and protocol changes.

Comments

@eskimor
Copy link
Member

eskimor commented Nov 26, 2021

With AURA for example collators know when it is their turn to produce a block and can initiate a connection to the backing group before actually having the collation ready, thus saving time and improving performance.

This ticket can be seen as a preparation task to #3428 , but should be way faster to implement and should help with issues like this, until we have contextual execution ready.

As @ordian pointed out, we should also take care of collators disconnecting once they are done with their collation.

@ordian
Copy link
Member

ordian commented Nov 26, 2021

This is only relevant at group rotation boundaries, right? Currently, every collator would try to connect to all group validators.

Another thing is we should increase collator slots maybe as an interim measure?

@eskimor
Copy link
Member Author

eskimor commented Nov 26, 2021

Yes, although the current pre-connect logic was kind of broken and I think we even removed it already? But I would need to check that. Increasing the rotation time is not something we would like to do, because we are already seeing issues with slow validators on Kusama, causing very long block times for parachains - an increased rotation time would make things worse.

@ordian
Copy link
Member

ordian commented Nov 26, 2021

Pre-connection to the next group was removed in #4261.

How sure are we that the very long times are only caused at group rotation boundaries? I don't think that's the only contributing factor.

@eskimor
Copy link
Member Author

eskimor commented Nov 26, 2021

No it is not, it is just a contributing factor. The major problem is that TCP has slow start, so it will start with a relatively small amount of IP packages and then wait for a confirmation for all of those, if things went well, the amount of packets before requiring confirmation is increased, the game continues, .... With high latency connections, the waiting for confirmation will totally kill the effective bandwidth.

You are right that this should only be a problem right now at group rotations, but that unconditional connecting to validators, no matter whether one is a block producer, is a problem on its own, given the limited number of incoming connections validators will accept. The effect on paritytech/substrate#10359 will be limited though - good point!

@bkchr
Copy link
Member

bkchr commented Nov 26, 2021

Pre-connection to the next group was removed in #4261.

How sure are we that the very long times are only caused at group rotation boundaries? I don't think that's the only contributing factor.

We only connected to the next group when we wanted to distribute a collation and validators would have disconnected us anyway relative fast. So, this code should have had no effect at all.

@bkchr
Copy link
Member

bkchr commented Nov 26, 2021

If you extend the "collator" interface with some signaling on when it would like to connect to the next validator, we can write code in Cumulus to tell you if you should do this or not, based on if the collator will be building a block in the next slot.

@eskimor
Copy link
Member Author

eskimor commented Nov 26, 2021

If you extend the "collator" interface with some signaling on when it would like to connect to the next validator, we can write code in Cumulus to tell you if you should do this or not, based on if the collator will be building a block in the next slot.

Yes! That's exactly what I was aiming at.

@ordian
Copy link
Member

ordian commented Dec 1, 2021

Currently, collators add validators to the reserved peerset when a collation is ready and only update it on the next collation.
That means that after a validator disconnects a collator, it will still try to connect back, even if it's not in their backing group.

As a short term fix, we could:

  1. Issue a connection request (add to the reserved peerset) on every relay parent activation. This will ensure that we change groups on time on the collator side as well as pre-connection. The downside is that everyone will try to occupy a slot now.
  2. Increase the collation peer slots to 50 or 100 to counter the negative effects of 1. Currently, parachain teams have that many collators per parachain.
  3. Reduce the
    inactive_collator: Duration::from_secs(24),
    to 6 or 5 seconds.

@eskimor
Copy link
Member Author

eskimor commented Dec 1, 2021

This feels a little hacky to me, I think we should go with the reduced vote requirement, which should already improve things a lot and then do the pre-connect properly.

If I remember correctly we have parachains with 300 collators, having them all connect, although most of them will not provide a collation to that backing group does not sound ideal. Also even if we set the incoming limit to like 300, if it is in reality 310, then it could actually happen that the collators really needing the connection won't be able to get one, making things worse. This comes on top of that more allowed open connections, increases the attack vector for DoS attacks.

@ordian
Copy link
Member

ordian commented Dec 1, 2021

I agree this is not the way to go, but it's something that can quickly implemented for the next release. If however, you think that pre-connect will be implemented very soon, then it's not needed.

But if we implement pre-connect properly, we probably should also issue an empty connection request to clear the reserved after after distributing collation, so that a collator won't try to connect and occupy a slot unnecessarily.

Reducing inactive_collator policy also applies here.

@eskimor
Copy link
Member Author

eskimor commented Dec 1, 2021

Absolutely, you raised a very good point here - we should not only have logic for timing the connect, we should also make sure to disconnect once we are done.

@eskimor
Copy link
Member Author

eskimor commented Dec 1, 2021

I wasn't aware that we are already running into problems here. In that case, yes let's just bump incoming slots to 100. Making sure collators connect to the right backing group makes sense as well obviously and then let's get this ticket going asap.

@slumber
Copy link
Contributor

slumber commented Apr 11, 2023

Do we still need it considering async backing is coming? Collators can take their time issuing a connection and sending candidates to validators since no longer tied to latest relay chain parent,
and with validator groups ring buffer we no longer discard needed validators from the peerset as well as do drop unneeded.

@ordian
Copy link
Member

ordian commented Apr 11, 2023

I think this can be dropped as unneeded optimization with async backing.

@slumber slumber closed this as not planned Won't fix, can't repro, duplicate, stale Aug 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T5-parachains_protocol This PR/Issue is related to Parachains features and protocol changes.
Projects
No open projects
Development

No branches or pull requests

4 participants