Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subnet-EVM v0.5.3 stalls at a random height after bootstrapping (DFK as the example) #747

Closed
NickSolante opened this issue Jul 21, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@NickSolante
Copy link

NickSolante commented Jul 21, 2023

Describe the bug
When attempting to bootstrap a node using subnet-evm version 0.5.3 with version 1.10.5 of avalanchego, the process stalls at a random block height. This issue doesn't occur in version 0.5.1 while using avalanchego v1.10.4. This behaviour also only happens on mainnet and not on testnet.

To Reproduce
Install subnet-evm version 0.5.3.
Add the upgrade.json file for DFK in the subnet config folder {config-dir}/q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi/upgrade.json

{
  "precompileUpgrades": [
    {
      "feeManagerConfig": {
        "adminAddresses": ["0x79A1A35d07044a3d856197756b82319E7F97ba25"],
        "blockTimestamp": 1662134400
      }
    }
  ],
  "stateUpgrades": [
    {
      "blockTimestamp": 1679072400,
      "accounts": {
        "0x04b9dA42306B023f3572e106B11D82aAd9D32EBb": {
          "storage": {
            "0x0000000000000000000000000000000000000000000000000000000000000007": "0x000000000000000000000000000000000000000000cecb8f27f4200f3a000000",
            "0x000000000000000000000000000000000000000000000000000000000000000b": "0x00000000000000000000000000000000000000000069E10DE76676D080000000"
          }
        }
      }
    }
  ]
}

Begin the bootstrapping process for the node.
Observe that the node gets stuck at a random block height and does not progress.

Expected behavior
The node should successfully bootstrap without stalling, similar to the behavior observed in subnet-evm version 0.5.1.

Logs
I've included the log which shows that the subnet has an extended processing time
q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi.log

Operating System
running on linux m5.4xlarge on aws

Additional context
This is was done while bootstrapping nodes completely from scratch so haven't tried going from 0.5.1 to 0.5.2 as Im afraid that this might cause the above experience as well.

@NickSolante NickSolante added the bug Something isn't working label Jul 21, 2023
@NickSolante NickSolante changed the title Subnet-EVM v0.5.2 stalls at a certain height after bootstrapping (DFK as the example) Subnet-EVM v0.5.2 stalls at a random height after bootstrapping (DFK as the example) Jul 21, 2023
@ceyonur
Copy link
Collaborator

ceyonur commented Jul 23, 2023

Hello!
It seems that bootstrapping was completed, then your node failed to process messages in reasonable times.

Which AvalancheGo version did you run and saw this problem? Also which versions (Subnet-EVM + AvalancheGo) did you try this again and saw the issue was fixed? When updating to another version, did you remove the database and started from scratch or didn't touch anything other than versions?

Also if you're able to reproduce the issue, could you try and run without the state-sync option in Subnet-EVM? It should be under {config-dir}/q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi/config.json as state-sync-enabled. As I see your logs this is currently turned on "state-sync-enabled": true. you can turn it off by "state-sync-enabled": false

@NickSolante
Copy link
Author

Hey!

Yea I was using AVAX 1.0.4 with 0.5.2 at first, then removed the db tried again still had the same problem then removed the db and the plugin and moved to avax 1.0.4 with 0.5.1 subnet-evm then this resolved the issue.

I still have one machine thats currently stuck with the current config, I can try and see if I can reproduce this and just have "state-sync-enabled": false

@NickSolante
Copy link
Author

going back to this after a few weeks so I've updated to the latest 1.0.7 AvalancheGo version with Subnet-evm v0.5.3 seems like the issue still persists. I've removed "state-sync-enabled": false in the file entirely and would get the same issue.

But seems like rolling back to 1.0.5 and with Subnet-evm v0.5.3 seems to work fine while also removing the "state-sync-enabled" flag in the config file

@aaronbuchwald
Copy link
Collaborator

Thanks for reporting this issue and sorry for the delay.

If you're able to reproduce this easily, could you provide a zip of the full log directory of your node?

Additionally, since the node is getting stuck someplace without additional logs about where it's spending time it would be useful to get profiling information from the node. Could you enable continuous profiling on AvalancheGo with this flag: https://docs.avax.network/nodes/configure/avalanchego-config-flags#--profile-continuous-enabled-boolean and on the Subnet-EVM binary with this flag: https://docs.avax.network/nodes/configure/chain-config-flags#continuous-profiler-dir-string.

Unfortunately the linked logs don't give much additional information to look into the root cause.

@NickSolante
Copy link
Author

NickSolante commented Aug 21, 2023

Hey @aaronbuchwald thanks for getting back to me on this, I just saw your reply.

It's not really easily reproducible its more like a random occurrence at the moment but we do see most of our nodes are hit at the same ish time and a select few just choose not to recover. Ill enable profiling on the nodes to see if the issue comes up on that node.

Please see attached is the zip of the logs.

logs.zip

The node logs I have attached kinda recovered for a little bit and then died again so I think this might give you a better outline?

@NickSolante NickSolante changed the title Subnet-EVM v0.5.2 stalls at a random height after bootstrapping (DFK as the example) Subnet-EVM v0.5.3 stalls at a random height after bootstrapping (DFK as the example) Aug 21, 2023
@NickSolante
Copy link
Author

Hello! It seems that bootstrapping was completed, then your node failed to process messages in reasonable times.

Which AvalancheGo version did you run and saw this problem? Also which versions (Subnet-EVM + AvalancheGo) did you try this again and saw the issue was fixed? When updating to another version, did you remove the database and started from scratch or didn't touch anything other than versions?

Also if you're able to reproduce the issue, could you try and run without the state-sync option in Subnet-EVM? It should be under {config-dir}/q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi/config.json as state-sync-enabled. As I see your logs this is currently turned on "state-sync-enabled": true. you can turn it off by "state-sync-enabled": false

I've tried doing and this hasnt worked as well, I've attached the logs for this above, I've first removed the state-sync-enabled flag and restarted the node it progressed for a bit and has then since stalled the node has now been flushed completely so Ill try and add in the profiler that aaron mentioned

@NickSolante
Copy link
Author

Please see attached is the profiles for the current dfk node it crashed recently so I thought it would be good to send this through
profiles.zip

@aaronbuchwald
Copy link
Collaborator

Thanks for sharing the profiles! Looking at this, it's spending the majority of its time wiping the state snapshot.

The snapshot layer was added to provide an O(1) lookup of the state (instead of looking up values in the merkle trie, which require an O(log n) lookup). Ref here: https://blog.ethereum.org/2020/07/17/ask-about-geth-snapshot-acceleration.

This is considered a performance requirement, but unfortunately the snapshot can be corrupted if the node dies while it is in the middle of accepting a block. In that case, the snapshot gets wiped and re-generated asynchronously, which can cause degraded performance.

In this case, it looks like running the wiping process async is causing degraded performance of the node such that it struggles to catch up.

We'd like to get rid of the possibility that the snapshot gets corrupted during shutdown during block accept at all, but in the meantime, you may see better performance if you disable async snapshot generation, so that it will wait for the snapshot to be fully formed before continuing to escape this degraded performance loop.

You can set this flag to false: https://docs.avax.network/nodes/configure/chain-config-flags#snapshot-async-boolean to disable async snapshot generation.

@NickSolante
Copy link
Author

NickSolante commented Aug 22, 2023

Thanks this should be very helpful! I've set this to false now, Ill do the same on my other nodes.

Ill keep the profiler going and see if I catch the other crash.

Ill have a read on the blog post as well.

@aaronbuchwald
Copy link
Collaborator

Made a PR upstream as well to increase the size at which a batch is broken up when flushing the snapshot diff to disk: ethereum/go-ethereum#27977.

We'll migrate this change into Subnet-EVM/Coreth as well and this should significantly reduce the likelihood that a snapshot will be corrupted in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants