-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subnet-EVM v0.5.3 stalls at a random height after bootstrapping (DFK as the example) #747
Comments
Hello! Which AvalancheGo version did you run and saw this problem? Also which versions (Subnet-EVM + AvalancheGo) did you try this again and saw the issue was fixed? When updating to another version, did you remove the database and started from scratch or didn't touch anything other than versions? Also if you're able to reproduce the issue, could you try and run without the state-sync option in Subnet-EVM? It should be under |
Hey! Yea I was using AVAX 1.0.4 with 0.5.2 at first, then removed the db tried again still had the same problem then removed the db and the plugin and moved to avax 1.0.4 with 0.5.1 subnet-evm then this resolved the issue. I still have one machine thats currently stuck with the current config, I can try and see if I can reproduce this and just have |
going back to this after a few weeks so I've updated to the latest 1.0.7 AvalancheGo version with Subnet-evm v0.5.3 seems like the issue still persists. I've removed But seems like rolling back to 1.0.5 and with Subnet-evm v0.5.3 seems to work fine while also removing the |
Thanks for reporting this issue and sorry for the delay. If you're able to reproduce this easily, could you provide a zip of the full log directory of your node? Additionally, since the node is getting stuck someplace without additional logs about where it's spending time it would be useful to get profiling information from the node. Could you enable continuous profiling on AvalancheGo with this flag: https://docs.avax.network/nodes/configure/avalanchego-config-flags#--profile-continuous-enabled-boolean and on the Subnet-EVM binary with this flag: https://docs.avax.network/nodes/configure/chain-config-flags#continuous-profiler-dir-string. Unfortunately the linked logs don't give much additional information to look into the root cause. |
Hey @aaronbuchwald thanks for getting back to me on this, I just saw your reply. It's not really easily reproducible its more like a random occurrence at the moment but we do see most of our nodes are hit at the same ish time and a select few just choose not to recover. Ill enable profiling on the nodes to see if the issue comes up on that node. Please see attached is the zip of the logs. The node logs I have attached kinda recovered for a little bit and then died again so I think this might give you a better outline? |
I've tried doing and this hasnt worked as well, I've attached the logs for this above, I've first removed the state-sync-enabled flag and restarted the node it progressed for a bit and has then since stalled the node has now been flushed completely so Ill try and add in the profiler that aaron mentioned |
Please see attached is the profiles for the current dfk node it crashed recently so I thought it would be good to send this through |
Thanks for sharing the profiles! Looking at this, it's spending the majority of its time wiping the state snapshot. The snapshot layer was added to provide an O(1) lookup of the state (instead of looking up values in the merkle trie, which require an O(log n) lookup). Ref here: https://blog.ethereum.org/2020/07/17/ask-about-geth-snapshot-acceleration. This is considered a performance requirement, but unfortunately the snapshot can be corrupted if the node dies while it is in the middle of accepting a block. In that case, the snapshot gets wiped and re-generated asynchronously, which can cause degraded performance. In this case, it looks like running the wiping process async is causing degraded performance of the node such that it struggles to catch up. We'd like to get rid of the possibility that the snapshot gets corrupted during shutdown during block accept at all, but in the meantime, you may see better performance if you disable async snapshot generation, so that it will wait for the snapshot to be fully formed before continuing to escape this degraded performance loop. You can set this flag to false: https://docs.avax.network/nodes/configure/chain-config-flags#snapshot-async-boolean to disable async snapshot generation. |
Thanks this should be very helpful! I've set this to false now, Ill do the same on my other nodes. Ill keep the profiler going and see if I catch the other crash. Ill have a read on the blog post as well. |
Made a PR upstream as well to increase the size at which a batch is broken up when flushing the snapshot diff to disk: ethereum/go-ethereum#27977. We'll migrate this change into Subnet-EVM/Coreth as well and this should significantly reduce the likelihood that a snapshot will be corrupted in the first place. |
Describe the bug
When attempting to bootstrap a node using subnet-evm version 0.5.3 with version 1.10.5 of avalanchego, the process stalls at a random block height. This issue doesn't occur in version 0.5.1 while using avalanchego v1.10.4. This behaviour also only happens on mainnet and not on testnet.
To Reproduce
Install subnet-evm version 0.5.3.
Add the upgrade.json file for DFK in the subnet config folder
{config-dir}/q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi/upgrade.json
Begin the bootstrapping process for the node.
Observe that the node gets stuck at a random block height and does not progress.
Expected behavior
The node should successfully bootstrap without stalling, similar to the behavior observed in subnet-evm version 0.5.1.
Logs
I've included the log which shows that the subnet has an extended processing time
q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi.log
Operating System
running on linux
m5.4xlarge
on awsAdditional context
This is was done while bootstrapping nodes completely from scratch so haven't tried going from 0.5.1 to 0.5.2 as Im afraid that this might cause the above experience as well.
The text was updated successfully, but these errors were encountered: