-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
polkadot 0.9.40-a2b62fb872b Validator Issues / ram and peers issues / lost sync / chilled #679
Comments
Number of TCP connections is not related to There was also a bug with the number of reported peers in the logs. This was already fixed. |
Please provide more logs around this issue.
We need at least |
From the offchain worker, maybe the staking transaction for the result of the next election. Generally isn't an issue.
Requires more logs, but could indicate that the node is too slow. |
Do we have logs for this node? |
@dcolley could you provide logs with |
Here are some trace logs of a "low specs" server running latest version At this point I wasn't able to reproduce the issue where the node stops syncing blocks with no error message, will keep it running and check. |
Are you able to confirm this reported behavior:
The logs you provided seem to indicate a lot of connectivity issues which would then translate to syncing issues as there are no peers to provide blocks. One issue in particular stands out as interesting:
which originates from here. There was another syncing-related issue resolved the other day that was related to NAT configuration: paritytech/polkadot#6696 (comment). Could you and any other person noticing these syncing problems first verify that the issue is not because of any connectivity issues pertaining to NAT configurations. |
I eventually got an error on one of my nodes.
|
My validator synchronization has stopped, and the following log appears Mar 26 13:29:00 kusama-3-new bash[36605]: 2023-03-26 13:29:00 The number of unprocessed messages in channel |
@xubincc what kind of hardware are you running the validator on? Before this release, syncing was part of |
@altonen 8vcpu 16GB ram and NVMe SSD disk |
In the night of 24 march, right after the update I had two nodes chilled, they are on dedicated hardware (Ryzen 3600, 64 Gb, NVME drives) which have been running stable (A+ one-t performance) for a long time without problems What seemed to happen: it was importing blocks, then at 2:41 it stopped importing and was loosing it's peers (peer count dropping), until after a few minutes it picks up the import again. This keeps happening on and off, because it happened at the end of a session I was chilled. Later the process crashed with a After this the stopping importing and restoring importing continues with a lot of Grafana gave a lot of "lagging block" errors. In the morning I reverted back and all problems went away. Later I put two nodes again on the new version and for now they seem to be holding up without the earlier pattern. On element I see some reports of this kind of behaviour and some more chilled validators. Maybe some interaction with the "mpsc_network_worker" which overloads or something like that? |
I have 12th Gen Intel(R) Core(TM) i9-12900K, 128G RAM and NVME disks. I'm running 2x kusama and 1x polkadot validators on this machine - has been running for months like this. I ran the polkadot benchmark machine and got this:
I suspect the RAM fail is because of the active validator |
@dcolley What RAM do you have? DDR4? DDR5? 3200Mhz? Single or double channel? |
I have 4x of these:
|
@dcolley As far as you RAM test is concerned, I would think the culprit is your RAM speed. You may want to switch to 3200Mhz. |
So what has changed? This server has been running with the same spec for months. |
On another machine with exactly the same RAM, I get this result:
The disk write is tested on SSD boot volume, but we have NVME - not sure how to force the benchmark command to test the NVME |
Today again the importing stopped multiple times and the process crashes while catching up importing with a mpsc network error. Same like my previous post the import stops, only idle messages, the peers drop until at a certain moment a lot of catch up importing and the the process crashes Same like @dcolley this system was running smoothly for ages before this update @altonen @bkchr it really looks something is off with the sync/network functioning, maybe the move of the syncing/sc-networking? |
To benchmark with a specific directory for IO: As for the benchmark itself, I don't believe it has changed recently. The 13.92 GiBs minimum has been there for quite a while. If two machines have substantially different benchmark results, there is probably something different between those two. Now this benchmark is only to compare with the hardware Parity has decided to use themselves (I believe). If the performance of your validator is generally A+, then it is doing just fine. Personally, I would not be comfortable with 6.96 GiBs memory speed or the lack of ECC memory, but that's just me. edit: if you want to compare the memory performance of your two "supposedly" identical machines, while using a tool other than |
We have one validator that we had synced from scratch with Flags:
We first started to get couple of the following error logs:
Shortly after the whole DB got corrupted and the validator crashed.
|
We get A+/A on most performance reports so I'm not too worried about that. Thanks for the recommendation for another tool: server 5 (had slow RAM on bench test):
server 6 (same memory)
|
There are at least three underlying issues that I can think of that contribute to this:
We are working on fixing these problems but they require some deep refactoring in the entire |
I had also problems on only one node, today around 2023-03-26 17:30:06.504 (CET TIME) Last block imported was
Then it went IDLE and peers started to decrease slowly...
At 18:01:31.021 (cet time), node started to import the missed blocks during this timeframe...
The log was full of "Imported " lines until 18:01:46, where it crashed
|
@gregorst3, thats the same pattern I observed; the node gets high in peer count, stops importing and goes "idle", peer count drops and at a certain point it restarts importing a lot of blocks at once which seems to overload some network subsystem and causes a node crash. If above happens at the end of a session the node gets involuntarily chilled. Strangely some nodes seem to suffer more then others. Maybe after the update the resource demand got bigger and nodes who had more spare resources were able to compensate better or something like that? But that is just guessing. @altonen thanks for the extensive reply, that sounds logical and makes it more clear |
Any of you able to provide logs with |
I'm running 0.9.41 on an Intel NUC running Ubuntu 20.04 and Kusama node seems to be operating fine. No sync issues or stalling. Been running smoothly since the upgrade to 0.9.41 the day it was released. |
We were running a kusama node on 0.9.41 for 5 days without any issues, but today it got stuck at
No logs produced after this event. A restart helps. |
I see the same dispute message in log for our node as well, but syncing has carried on normally w/out need for restart. |
I stand corrected. Our validator is syncing alright, but it was halted as a validator. Required us to start validating again via https://polkadot.js.org. |
Was it still using CPU? |
While I had similar issues with v0.9.41 as v0.9.40, they were far less often. I was eventually chilled on v0.9.40, but things did not get bad enough to receive a chill on v09.41. I would say v09.41 reduced the frequency of the issues by about 80%. I've since downgraded again to v0.9.39-1. Here are my triggered alerts from v0.9.41. I can pull logs if it's helpful let me know. |
Agree with @yayoi-k. Got chilled but issues with 0.9.41 are way less often than 0.9.40. Downgraded our node to 0.9.39-1 as well for the time being. |
@bkchr, I checked grafana metrics, but cannot see any CPU usage drop at that time. |
Prometheus reported those errors today between 10:35 and 11:53 UTC:
Following link contains btw, I'm using the following alerting-rules updated from more than a year ago, not sure if there's another one somewhere with more recent rules: https://github.com/paritytech/substrate/blob/master/scripts/ci/monitoring/alerting-rules/alerting-rules.yaml Also, the node is running 0.9.41. |
Thank you for the logs. We were able to reproduce the issue in the meantime and found out that the inbound substream for |
The node has crashed a little while ago, log file of the 10000 last entries (I can provide more if necessary): |
These look interesting but I don't know much about the dispute coordinator
|
Dispute coordinator messages are "normal". They just mean we've imported votes for a dispute. There were enough votes for the dispute to conclude. The candidate was "valid", so nothing to worry about.
This warning is more concerning though. It looks indeed like there is a potential deadlock: approval-distribution sends messages over a bounded channel to approval-voting (and also awaits the results): However, the timeout is the way to break the cycle. It could only happens when the approval-voting queue is full. paritytech/polkadot#6782 should make this much less likely to happen in practice by reducing the number of messages. The timeout could also be hit if node is slow/under load. |
@ordian is correct. This looks concerning, maybe there is a way to refactor and remove the cycle, by moving |
We experienced the mentioned problems related to nodes getting stuck / syncing after upgrading our nodes in South America to |
How many nodes are you connected to? This release included a fix for syncing which makes it kick inactive peers, releasing slots for peers who are hopefully more active. I don't know how to read the first graph but is both best and finalized block getting stuck? The second image doesn't look extremely concerning unless it is happening constantly. |
When that happened, 80 - in line with the usual count
Yes, best and finalized. This generally doesn't happen but we experienced it on |
What is the ratio between in/out peers? The default setting is 8 out/32 in. There is a 30 second delay between detecting the node is stuck and kicking the peers so if your best block is stuck multiple minutes then there is some other problem but I don't think that is the case unless the node is literally unable to establish any new connections and all it has is nodes that are already full. If I would have to guess what happened since you don't have logs, it looks like We did get a report from our DevOps that one of our RPC nodes in Brazil had similar issue but that was before the release. I've been busy with other stuff but I will try to look into this issue soon and see if I'm able to make any sense why it's lagging in SA. |
I keep having some UnboundedChannelPersistentlyLarge on a variety of servers, in which case I generally manually restart the polkadot process as once the first one happens, I generally get more soon after and this eventually leads to a stall. Is there any progress on this issue? Thanks! |
The issue was supposed to have been addressed in 0.9.42. Which channel is getting clogged? |
I get those messages below on a regular basis. It's much less than before 0.9.42, but it's still there once in a while on most/all my nodes.
|
This looks like a different bug than what this issue was originally about because the channel that's getting clogged is the block import notification stream. Could you open a new issue for this? |
* Remove user bob from bootstrap test * Remove unused testSubClient from bootstrap test * Add 6-account basic channel test * Change test name * Move ethClient.initialize * Get dispatchAs working for Alice only * Fetch sudo key instead of using Alice * Add transactions for the remaining 5 accounts Tests are failing because only a single transaction is reflected in the eth account. * Match naming convention for incentivized channel * Send multiple proofs from parachain relayer * Replace proof leaf with bundle from event * Tweak log for accounts & nonces * Update recommended polkadot version to 0.9.28 * Check for nil incentivized channel commitment * Avoid zero value proofs by setting len 0 * Fix proof struct field order Field order matters when decoding into RawMerkleProof. * Tweak warning log message * REMOVEME Debug logs * Handle non-perfect complete trees in hashSides * Fix power of 2 XOR strikes again! * This decodes properly, but rather use the proof * Add TODO for bundle & leaf validation * Loop by event instead of account * Tweak test README * Remove TODOs * Fix IncentivizedChannelCommitment pointer * Remove unused method * Return pointer from constructor instead of value * Refactor beefy-listener - Extract scanForBasicChannelProofs. - Rename to basicChannel{AccountNonces,ScanAccounts}. - Pass digestItemHash around instead of digestItem. * Remove extra debug logs * Separate parameters to generateHashSides * Newline * Add note about account ids * Add tests for generateHashSides * Remove NewIncentivizedChannelCommitment * Separate proofs from bundles * Remove fetchOffchainData
* fix broken message lane benchmarks * proof-size related benchmarks * impl Size for proof parameters * include proof weight into weight formula * left TODO * fixed proof size * WeightInfoExt::receive_messages_proof_weight * charge for extra message bytes delivery in send_message * removed default impl of WeightsInfoExt * moved weight formulas to WeightInfoExt * receive_messages_proof_outbound_lane_state_overhead is included twice in weight * typo * typo * fixed TODO * more asserts * started wotk on message-lane documentation * expected_extra_storage_proof_size() is actually expected in delivery confirmation tx * update README.md * ensure_able_to_receive_confirmation * test rialto message lane weights * removed TODO * removed unnecessary trait requirements * fixed arguments * fix compilation * decreased basic delivery tx weight * fmt * clippy * Update modules/message-lane/src/benchmarking.rs Co-authored-by: Hernando Castano <HCastano@users.noreply.github.com> * structs * Update primitives/millau/src/lib.rs Co-authored-by: Hernando Castano <HCastano@users.noreply.github.com> * removed readme.md * removed obsolete trait bounds * Revert "removed readme.md" This reverts commit 50b7376a41687a94c27bf77565434be153f87ca1. * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <tomusdrw@users.noreply.github.com> * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <tomusdrw@users.noreply.github.com> * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <tomusdrw@users.noreply.github.com> * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <tomusdrw@users.noreply.github.com> * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <tomusdrw@users.noreply.github.com> * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <tomusdrw@users.noreply.github.com> * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <tomusdrw@users.noreply.github.com> * PreComputedSize Co-authored-by: Hernando Castano <HCastano@users.noreply.github.com> Co-authored-by: Tomasz Drwięga <tomusdrw@users.noreply.github.com>
Stale. |
The latest medium-priority release caused problems on many validator instances.
A lot of validators got chilled since updating.
I'll just try to sum up the problems reported by various operators for visibility and to get decision-making going.
Peering gone wild
Even waiting nodes that don't participate in consensus see a rapid increasement of peers to 700+ ignoring the
--in-peers
and--out-peers
Number of TCP connections
Ram consumption
Consumed Ram clearly increased with this release and spikes in Ram usage caused the node to crash
Stops syncing
There are reports of nodes just stopping to sync the chain without any identifiable problem.
Some operators noticed problems on one machine but had none on the second identical build.
UnboundedChannelPersistentlyLarge
message = Channel mpsc_import_notification_stream on node localhost:9616 contains more than 200 items for more than 5 minutes. Node might be frozen.
Mar 24 06:13:04 worker02 polkadot[429783]: 2023-03-24 06:13:04 (offchain call) Error submitting a transaction to the pool: Transaction pool error: Too low priority (18446744073709551615 > 18446744073709551615)
Mar 24 06:13:04 worker02 polkadot[429783]: 2023-03-24 06:13:04 Ran out of free WASM instances
Mar 24 06:13:09 worker02 polkadot[429783]: 2023-03-24 06:13:09 Got a bad assignment from peer hash=0x65d2558f1c946ba12867777cfdcbb0182e944e511c2d3a0b3de9b1fa9d79b6b8 peer_id=PeerId("12D3KooWBy4Mhc9j4GJxZfKjettapQbsGr8z5wVNHXPgrYnnU4RH") error=Unknown block: 0x65d2558f1c946ba12867777cfdcbb0182e944e511c2d3a0b3de9b1fa9d79b6b8
Many of the 1KV validators have downgraded because of the problems which caused them invalid for program participation.
The text was updated successfully, but these errors were encountered: