Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Runtime Upgrade (Big PoV) leading to collator peer reputation dropping (network stalled) #10359

Open
crystalin opened this issue Nov 24, 2021 · 14 comments

Comments

@crystalin
Copy link
Contributor

Conditions

  • Internal permanent network (4 validators, 4 collators)
  • Relay runtime 9122 (kusama-local) client 0.12.1
  • Parachain runtime Moonbase 902 (substrate v0.9.11) migrating to 1000 (substrate v0.9.12)

Event

  • Parachain runtime upgraded from 902 to 1000
  • Parachain stalled at the block applying the upgrade

Logs

https://drive.google.com/file/d/1tNixS4n4_HHiaGz5bMURWM5-wGyGFmWb/view?usp=sharing

Observations

PoV for the block applying the runtime upgrade is 3000kb:

PoV size { header: 0.244140625kb, extrinsics: 2.8232421875kb, storage_proof: 3100.509765625kb }

Validators are reporting the collators:

validator2 2021-11-24 00:56:49.379 DEBUG tokio-runtime-worker parachain::collator-protocol: Fetching collation failed due to network error hash=0x91139d3d4a8c5c31febfb614a2a509681204535403fe0a769e5d0837095dd2fe para_id=Id(1000) peer_id=PeerId("12D3KooWEokXpahv1QL2aJceYDnfjR6d2B4bTpc1xnY2wP7cgkiS") err=Network(Timeout)
validator2 2021-11-24 00:56:49.379 DEBUG tokio-runtime-worker parachain::network-bridge: peer=PeerId("12D3KooWEokXpahv1QL2aJceYDnfjR6d2B4bTpc1xnY2wP7cgkiS") rep=CostMinor("Some network error") action="ReportPeer"

image

Conclusion

When the PoV is too big and can't be fetch in time by the validators, those are reducing the reputation of the collator and preventing it from sending any further block, stalling the network

@eskimor
Copy link
Member

eskimor commented Nov 24, 2021

Am I correct that the stall actually happens earlier than the disconnect due to reputation changes?

@crystalin
Copy link
Contributor Author

I'm having a bit of a trouble to fully interpret the logs (but you can access all the validator/collator logs from the file), but the stall started right when the parachain upgrade was applied (the block itself wasn't being included, because of timeout, which led to reputation issue)

@eskimor
Copy link
Member

eskimor commented Nov 24, 2021

The real issue is, why is fetching the collation taking so long. The timeout is 1 second, size is 3M ... so with a decent connection that transfer should take way less than a second. 🤔

@eskimor
Copy link
Member

eskimor commented Nov 24, 2021

I'm having a bit of a trouble to fully interpret the logs (but you can access all the validator/collator logs from the file), but the stall started right when the parachain upgrade was applied (the block itself wasn't being included, because of timeout, which led to reputation issue)

Yep, that makes sense. The disconnect is not the actual problem, because even if it did not happen the chain would still not make any progress, because the import will keep failing.

Just dropping the reputation change is no good solution anyway as this would make DoSing a parachain easier. Increasing the timeout will likely not help either - we only have two seconds in total, if the initial upload takes longer than a second, it becomes very unlikely that the candidate is going to make it in a block.

How good is the connection between validators and collators? Like, what bandwidth is available?

@eskimor
Copy link
Member

eskimor commented Nov 24, 2021

How big is the backing group?

@eskimor
Copy link
Member

eskimor commented Nov 24, 2021

Interestingly, the time it takes to transfer a collation, varies a lot between validators. Sending a collation to validator 3 takes 134ms :

validator3 2021-11-24 00:55:24.337 DEBUG tokio-runtime-worker parachain::collator-protocol: Requesting collation peer_id=12D3KooWEokXpahv1QL2aJceYDnfjR6d2B4bTpc1xnY2wP7cgkiS para_id=1000 relay_parent=0x187fb3bf0012f5fe882253446aebcc5971652cee47fef7e3c53e22aa83b846ac

validator3 2021-11-24 00:55:24.471 DEBUG tokio-runtime-worker parachain::collator-protocol: Received collation para_id=1000 hash=0x187fb3bf0012f5fe882253446aebcc5971652cee47fef7e3c53e22aa83b846ac candidate_hash=0x7530401ceb9d00469bd7fab7b0484790d99bad8ab00e71d9bc09e29a34480840

but sending the same collation to validator 2, takes almost 900ms (transfer to the next validator starts, once the first transfer finished):

validator3 2021-11-24 00:55:24.471 DEBUG tokio-runtime-worker parachain::collator-protocol: Received collation para_id=1000 hash=0x187fb3bf0012f5fe882253446aebcc5971652cee47fef7e3c53e22aa83b846ac candidate_hash=0x7530401ceb9d00469bd7fab7b0484790d99bad8ab00e71d9bc09e29a34480840
validator2 2021-11-24 00:55:25.368 DEBUG tokio-runtime-worker parachain::collator-protocol: Received collation para_id=1000 hash=0x187fb3bf0012f5fe882253446aebcc5971652cee47fef7e3c53e22aa83b846ac candidate_hash=0x7530401ceb9d00469bd7fab7b0484790d99bad8ab00e71d9bc09e29a34480840

@eskimor
Copy link
Member

eskimor commented Nov 24, 2021

Sometimes the logs don't really make any sense - could it be that log lines are missing?

@crystalin
Copy link
Contributor Author

@eskimor I don't fully agree with your statement that 1s is enough for 3Mb (at least not in a decentralized network).
But let me give a bit more context: Both our collators (8) and validators (4) are split in 2 regions (US, EU) with 2 collators and 4 validators each.
So the latency between some of the collator<->validator is not optimal (around 100ms).

So this brings me to the case where transferring 3Mb with a RTT of 200ms, over TCP with an congestion slow start of 10 * MSS (64kb), is going to be close to 1000ms in some cases (not pre-established or idle connection). There are some improvements to be done on that area (using QUIC instead of TCP or changing default configuration: paritytech/polkadot-sdk#908) but it is not realistic to expected 3Mb to always be under 1s (RTT of 300ms would go over).

@crystalin
Copy link
Contributor Author

Sometimes the logs don't really make any sense - could it be that log lines are missing?

There could be few lines missing when the log lines are using , but when I looked it was line 10 lines of log total for the file. If you want a precise validator or collator log, I can retrieve the exact one.

@eskimor
Copy link
Member

eskimor commented Nov 25, 2021

@eskimor I don't fully agree with your statement that 1s is enough for 3Mb (at least not in a decentralized network).

I see. The problem is, without contextual execution (yes we really need that asap), there is only so much we can do. We only have 2 seconds in total for the complete backing phase, if the initial PoV sending already takes more than a second, this is going to be tough.

Things we can do about that -very short term:

  • Increase the timeout and see if it helps, maybe the 2 second window is still enough.
  • Try if dropping the reputation change does indeed help - for testing you could simply drop all reputation changes there and see if it makes a difference.
  • Change deployment (different regions?), ..

Longer term (our roadmap):

  • Improve collator protocol to better establish connections ahead of time
  • Contextual Execution ..... this will pretty much solve everything, but is also the hardest
  • Other collation protocol performance improvements we have in the pipeline

As the problem seems to be pressing right now, it would be great if you could try out those short term solutions. Other than that - yes precise logs might also help revealing things we can do about the issue right now.

@crystalin
Copy link
Contributor Author

Thank you @eskimor ; All the short term changes are to be applied on Relay nodes, which is fine for our internal network but not really important as those issues happens only in rare condition (like runtime upgrade/migration).

However, they are appearing all the time on Moonriver, not sure what we can do in those conditions. All our collators have very good hardware (over the Polkadot Requirements) and bandwidth. But we have zero control on the validators

@eskimor
Copy link
Member

eskimor commented Nov 26, 2021

You mentioned that this happens in your super centralized test setup as well. But there we really should not be seeing that it takes a second for a 3MB PoV, assuming well speced nodes. So something else must be going on - I think we really should be looking at the logs, so the exact ones would be really useful here.

Other than that, what could also help in pinpointing the issue, if you could try increasing timeout or dropping reputation changes does help in your setup. E.g. 10% increase in timeout fixes it, vs. doubling it, issues remains ... would be interesting data points - although the very first thing definitely will be proper logs examination.

@eskimor
Copy link
Member

eskimor commented Nov 26, 2021

Ok, going forward with this issue, I think we should:

  1. Examine logs more closely - rule out any bugs
  2. Fix any found bugs
  3. Make collators pre-connect
  4. Remove reputation change on timeout ... not great, but if it helps until we get contextual execution and pre validation in, it might be a worthwhile tradeoff.

@eskimor
Copy link
Member

eskimor commented Nov 26, 2021

Ok, going forward with this issue, I think we should:

1. Examine logs more closely - rule out any bugs

2. Fix any found bugs

3. Make collators [pre-connect](https://github.com/paritytech/polkadot/issues/4381)

4. Remove reputation change on timeout ... not great, but if it helps until we get contextual execution and [pre validation](https://github.com/paritytech/polkadot/issues/3428) in, it might be a worthwhile tradeoff.

^^ Obsoleteted by: paritytech/polkadot#4386

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants