-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local Collator Panic - Connecting to Rococo with Validator RPC #4167
Comments
I did a general sanity test today by testing locally with an rpc-based collator and the openzeppelin template. The base case of it producing blocks worked (for several hours). So if this crashes for you maybe something else is wrong. I tried setting up your exact scenario with |
I was able to try the ondemand scenario. After checking your logs again I noticed that I built the openzeppelin template using In general the situation is not yet optimal when it comes to collating for these on-demand cores. The lookahead collator tries to fill the unincluded segment by building multiple blocks, but we only have one spot, so the node will have a higher block locally than what is included in the relay chain. @SBalaguer Can I ask you to try two things?
|
Hi, a thought, please ignore if irrelevant; i'm having the feeling there is a problem between the relaychain and systemchain interfacing; before the overseer crashes it gives
on a systemchain collator I also saw the overseer error and crash when the relaychain was not available for a moment which also made the collator crash with a similar error |
Hello, Our collaectives collator started crashing yesterday 6pm UTC without any prior interventions with the following error:
This repeated every few minutes and some other system chain collators crashed from time to time as well. After approximately 4 hours everything went back to normal. All the collators share a same datacenter and use remote relay chain.
As we have other collators in different datacenters that use the same remote relay rpc endpoints, it suggests that there might have been some minor connectivity problems in this particular datacenter which caused remote relays to fail. |
Thanks for the reports!
This message is not suspicious at all. It just means that we already know of a higher best block locally than what is included in the relay chain, hence we skip setting a lower bestblock.
Its true that this crash happened because both relay chain nodes that you specified were unavailable. This is expected and we see these proper error messages beforehand. But I think we should improve this, I opened a issue for tracking: #4278 @Curu24 Your error seems very relevant, can you post more logs? Which collator version are you running? |
So until now this has only been reported on collators (and not full nodes). Also, it seems to be not related to usage of ondemand cores. I did a quick skim of the collator-protocol subsystem but nothing immediately striked the eye. @sandreim Since we are seeing this just now, could this be related to the recent changes regarding elastic-scaling? |
Running 1.10.1. Sending a part of the logs when errors started here: https://pastebin.com/raw/K2rEUgMW |
Quick update: Came back to this and was finally able to reproduce the issue. The problem is that |
Specifically this line seems to be the culprit, fetching the backing state for every para id for every activated leaf:
@alindima I propose we to introduce a parameter on |
I was quite surprised to see the collators having the prospective-parachains subsystem activated. Digging a bit, it's only used in the CollationGeneration subsystem, to fetch the minimum relay parents the collator can build collations on. It's indeed quite an overkill to run the entire subsystem just for this piece of information. We can either do as you suggest and add a parameter for the prospective-parachains subsystem or directly use the ChainApi for this information and duplicate this bit of code in the collation-generation. |
I like this 👍. IMO not running the prospective-para system is preferrable to adding "collator-specific" logic to it. |
Implements #4429 Collators only need to maintain the implicit view for the paraid they are collating on. In this case, bypass prospective-parachains entirely. It's still useful to use the GetMinimumRelayParents message from prospective-parachains for validators, because the data is already present there. This enables us to entirely remove the subsystem from collators, which consumed resources needlessly Aims to resolve #4167 TODO: - [x] fix unit tests
Implements #4429 Collators only need to maintain the implicit view for the paraid they are collating on. In this case, bypass prospective-parachains entirely. It's still useful to use the GetMinimumRelayParents message from prospective-parachains for validators, because the data is already present there. This enables us to entirely remove the subsystem from collators, which consumed resources needlessly Aims to resolve #4167 TODO: - [x] fix unit tests
…h#4471) Implements paritytech#4429 Collators only need to maintain the implicit view for the paraid they are collating on. In this case, bypass prospective-parachains entirely. It's still useful to use the GetMinimumRelayParents message from prospective-parachains for validators, because the data is already present there. This enables us to entirely remove the subsystem from collators, which consumed resources needlessly Aims to resolve paritytech#4167 TODO: - [x] fix unit tests
@skunert sorry for the tag in a closed issue, but I wanted to avoid opening another for a follow-up question. We've started experiencing similar issues after upgrading to use For our particular setup, we have 8 separate Shiden nodes communicating with two Kusama nodes (private, only used by us). Before the upgrade it worked fine, but now the errors above with crash happen constantly. I'd like to ask was there more discussion about this issue somewhere else? |
From which version did you upgrade and what is the exact error message? The problem is only loosely related to the relay chain nodes themselves. The problem is a subsystem that was contained in the collator. It was doing a lot of RPC calls which led to a stall in that subsystem. The fix in the linked PRs is to not include that subsystem. In general your setup sounds reasonable, 4 parachain nodes connecting via RPC to a relay chain node seems fine to me. |
We've upgraded from It's actually 8 parachain collators/nodes connecting to two relay chain nodes 🙂.
Maybe I misunderstood, but from your comment I figured out the issue was that the subsystem was sending too many RPC calls (same as you repeated now) but also that the relay couldn't handle the load, i.e. reply in timely manner. We also have a testnet parachain which relies on relay chain RPC but in this case only single node uses it and we haven't had any problems there (the client code is exactly the same). |
Yes, I was assuming that you provided a different order so that 4 connect to the first relay node and 4 to the other. In case one fails, all 8 would connect to the same.
Yes its a mix of both. But the main problem was the high amount of queries send. If you have your nodes exclusively running for you, I would expect them to be able to handle the normal load.
The subsystem that was running on the collators was sending a lot of requests per parachain registered on the relay chain. This means that for testnets where you only have a couple chains its fine. But on kusama quite a few are registered, so its sending a lot more requests. |
I see, missed it before! Thanks for this explanation!
I wrongly assumed that a random connection is picked out of the two but I've checked the code and see that's not true 🙈. |
…h#4471) Implements paritytech#4429 Collators only need to maintain the implicit view for the paraid they are collating on. In this case, bypass prospective-parachains entirely. It's still useful to use the GetMinimumRelayParents message from prospective-parachains for validators, because the data is already present there. This enables us to entirely remove the subsystem from collators, which consumed resources needlessly Aims to resolve paritytech#4167 TODO: - [x] fix unit tests
Hi @Dinonard, we encountered the same issue updating binaries to v1.9.0 Did you manage to find a fix for it? If you found an alternative solution or workaround, I’d appreciate hearing how you resolved it. Thanks in advance! |
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
I'm testing out registering a Parachain on Rococo and getting some blocks validated using onDemandAssingmentProvider. The objective is to reproduce what a new builder could potentially do when registering and testing the system for the first time.
In order to achieve this, I'm running a collator leveraging an RPC connection to Rococo instead of running a Rococo validator directly within the collator. I'm doing this by passing the flag
--relay-chain-rpc-url "wss://rococo-rpc.polkadot.io"
to my collator.At the beginning everything seems to work fine, and I even manage to produce blocks on demand with my parachain, however it gets to a moment in time where the collator panics. The message looks like this (full logs from
alice
andbob
attached):The parachain (runtime and node) I'm testing is from the OpenZeppelin Generic Template.
More logs 👇
alice-logs-extra-logging.txt
bob-logs-extra-logging.txt
Steps to reproduce
alice
and one forbob
with the same flags (changing ports), although I noticed the same behavior when running only one (worked because offorce-authoring
flag).OnDemandAssingmentProvider.placeOrderAllowDeath
.The text was updated successfully, but these errors were encountered: