-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Pre-connect on collator protocol #4381
Comments
This is only relevant at group rotation boundaries, right? Currently, every collator would try to connect to all group validators. Another thing is we should increase collator slots maybe as an interim measure? |
Yes, although the current pre-connect logic was kind of broken and I think we even removed it already? But I would need to check that. Increasing the rotation time is not something we would like to do, because we are already seeing issues with slow validators on Kusama, causing very long block times for parachains - an increased rotation time would make things worse. |
Pre-connection to the next group was removed in #4261. How sure are we that the very long times are only caused at group rotation boundaries? I don't think that's the only contributing factor. |
No it is not, it is just a contributing factor. The major problem is that TCP has slow start, so it will start with a relatively small amount of IP packages and then wait for a confirmation for all of those, if things went well, the amount of packets before requiring confirmation is increased, the game continues, .... With high latency connections, the waiting for confirmation will totally kill the effective bandwidth. You are right that this should only be a problem right now at group rotations, but that unconditional connecting to validators, no matter whether one is a block producer, is a problem on its own, given the limited number of incoming connections validators will accept. The effect on paritytech/substrate#10359 will be limited though - good point! |
We only connected to the next group when we wanted to distribute a collation and validators would have disconnected us anyway relative fast. So, this code should have had no effect at all. |
If you extend the "collator" interface with some signaling on when it would like to connect to the next validator, we can write code in Cumulus to tell you if you should do this or not, based on if the collator will be building a block in the next slot. |
Yes! That's exactly what I was aiming at. |
Currently, collators add validators to the reserved peerset when a collation is ready and only update it on the next collation. As a short term fix, we could:
|
This feels a little hacky to me, I think we should go with the reduced vote requirement, which should already improve things a lot and then do the pre-connect properly. If I remember correctly we have parachains with 300 collators, having them all connect, although most of them will not provide a collation to that backing group does not sound ideal. Also even if we set the incoming limit to like 300, if it is in reality 310, then it could actually happen that the collators really needing the connection won't be able to get one, making things worse. This comes on top of that more allowed open connections, increases the attack vector for DoS attacks. |
I agree this is not the way to go, but it's something that can quickly implemented for the next release. If however, you think that pre-connect will be implemented very soon, then it's not needed. But if we implement pre-connect properly, we probably should also issue an empty connection request to clear the reserved after after distributing collation, so that a collator won't try to connect and occupy a slot unnecessarily. Reducing |
Absolutely, you raised a very good point here - we should not only have logic for timing the connect, we should also make sure to disconnect once we are done. |
I wasn't aware that we are already running into problems here. In that case, yes let's just bump incoming slots to 100. Making sure collators connect to the right backing group makes sense as well obviously and then let's get this ticket going asap. |
Do we still need it considering async backing is coming? Collators can take their time issuing a connection and sending candidates to validators since no longer tied to latest relay chain parent, |
I think this can be dropped as unneeded optimization with async backing. |
With AURA for example collators know when it is their turn to produce a block and can initiate a connection to the backing group before actually having the collation ready, thus saving time and improving performance.
This ticket can be seen as a preparation task to #3428 , but should be way faster to implement and should help with issues like this, until we have contextual execution ready.
As @ordian pointed out, we should also take care of collators disconnecting once they are done with their collation.
The text was updated successfully, but these errors were encountered: