Subnet-EVM v0.5.3 stalls at a random height after bootstrapping (DFK as the example) #747

NickSolante · 2023-07-21T00:58:54Z

Describe the bug
When attempting to bootstrap a node using subnet-evm version 0.5.3 with version 1.10.5 of avalanchego, the process stalls at a random block height. This issue doesn't occur in version 0.5.1 while using avalanchego v1.10.4. This behaviour also only happens on mainnet and not on testnet.

To Reproduce
Install subnet-evm version 0.5.3.
Add the upgrade.json file for DFK in the subnet config folder {config-dir}/q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi/upgrade.json

{
  "precompileUpgrades": [
    {
      "feeManagerConfig": {
        "adminAddresses": ["0x79A1A35d07044a3d856197756b82319E7F97ba25"],
        "blockTimestamp": 1662134400
      }
    }
  ],
  "stateUpgrades": [
    {
      "blockTimestamp": 1679072400,
      "accounts": {
        "0x04b9dA42306B023f3572e106B11D82aAd9D32EBb": {
          "storage": {
            "0x0000000000000000000000000000000000000000000000000000000000000007": "0x000000000000000000000000000000000000000000cecb8f27f4200f3a000000",
            "0x000000000000000000000000000000000000000000000000000000000000000b": "0x00000000000000000000000000000000000000000069E10DE76676D080000000"
          }
        }
      }
    }
  ]
}

Begin the bootstrapping process for the node.
Observe that the node gets stuck at a random block height and does not progress.

Expected behavior
The node should successfully bootstrap without stalling, similar to the behavior observed in subnet-evm version 0.5.1.

Logs
I've included the log which shows that the subnet has an extended processing time
q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi.log

Operating System
running on linux m5.4xlarge on aws

Additional context
This is was done while bootstrapping nodes completely from scratch so haven't tried going from 0.5.1 to 0.5.2 as Im afraid that this might cause the above experience as well.

The text was updated successfully, but these errors were encountered:

ceyonur · 2023-07-23T14:22:31Z

Hello!
It seems that bootstrapping was completed, then your node failed to process messages in reasonable times.

Which AvalancheGo version did you run and saw this problem? Also which versions (Subnet-EVM + AvalancheGo) did you try this again and saw the issue was fixed? When updating to another version, did you remove the database and started from scratch or didn't touch anything other than versions?

Also if you're able to reproduce the issue, could you try and run without the state-sync option in Subnet-EVM? It should be under {config-dir}/q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi/config.json as state-sync-enabled. As I see your logs this is currently turned on "state-sync-enabled": true. you can turn it off by "state-sync-enabled": false

NickSolante · 2023-07-24T00:41:55Z

Hey!

Yea I was using AVAX 1.0.4 with 0.5.2 at first, then removed the db tried again still had the same problem then removed the db and the plugin and moved to avax 1.0.4 with 0.5.1 subnet-evm then this resolved the issue.

I still have one machine thats currently stuck with the current config, I can try and see if I can reproduce this and just have "state-sync-enabled": false

NickSolante · 2023-08-05T06:10:26Z

going back to this after a few weeks so I've updated to the latest 1.0.7 AvalancheGo version with Subnet-evm v0.5.3 seems like the issue still persists. I've removed "state-sync-enabled": false in the file entirely and would get the same issue.

But seems like rolling back to 1.0.5 and with Subnet-evm v0.5.3 seems to work fine while also removing the "state-sync-enabled" flag in the config file

aaronbuchwald · 2023-08-15T14:00:39Z

Thanks for reporting this issue and sorry for the delay.

If you're able to reproduce this easily, could you provide a zip of the full log directory of your node?

Additionally, since the node is getting stuck someplace without additional logs about where it's spending time it would be useful to get profiling information from the node. Could you enable continuous profiling on AvalancheGo with this flag: https://docs.avax.network/nodes/configure/avalanchego-config-flags#--profile-continuous-enabled-boolean and on the Subnet-EVM binary with this flag: https://docs.avax.network/nodes/configure/chain-config-flags#continuous-profiler-dir-string.

Unfortunately the linked logs don't give much additional information to look into the root cause.

NickSolante · 2023-08-21T03:17:32Z

Hey @aaronbuchwald thanks for getting back to me on this, I just saw your reply.

It's not really easily reproducible its more like a random occurrence at the moment but we do see most of our nodes are hit at the same ish time and a select few just choose not to recover. Ill enable profiling on the nodes to see if the issue comes up on that node.

Please see attached is the zip of the logs.

logs.zip

The node logs I have attached kinda recovered for a little bit and then died again so I think this might give you a better outline?

NickSolante · 2023-08-21T03:26:17Z

Hello! It seems that bootstrapping was completed, then your node failed to process messages in reasonable times.

Which AvalancheGo version did you run and saw this problem? Also which versions (Subnet-EVM + AvalancheGo) did you try this again and saw the issue was fixed? When updating to another version, did you remove the database and started from scratch or didn't touch anything other than versions?

Also if you're able to reproduce the issue, could you try and run without the state-sync option in Subnet-EVM? It should be under {config-dir}/q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi/config.json as state-sync-enabled. As I see your logs this is currently turned on "state-sync-enabled": true. you can turn it off by "state-sync-enabled": false

I've tried doing and this hasnt worked as well, I've attached the logs for this above, I've first removed the state-sync-enabled flag and restarted the node it progressed for a bit and has then since stalled the node has now been flushed completely so Ill try and add in the profiler that aaron mentioned

NickSolante · 2023-08-22T06:39:07Z

Please see attached is the profiles for the current dfk node it crashed recently so I thought it would be good to send this through
profiles.zip

aaronbuchwald · 2023-08-22T17:48:02Z

Thanks for sharing the profiles! Looking at this, it's spending the majority of its time wiping the state snapshot.

The snapshot layer was added to provide an O(1) lookup of the state (instead of looking up values in the merkle trie, which require an O(log n) lookup). Ref here: https://blog.ethereum.org/2020/07/17/ask-about-geth-snapshot-acceleration.

This is considered a performance requirement, but unfortunately the snapshot can be corrupted if the node dies while it is in the middle of accepting a block. In that case, the snapshot gets wiped and re-generated asynchronously, which can cause degraded performance.

In this case, it looks like running the wiping process async is causing degraded performance of the node such that it struggles to catch up.

We'd like to get rid of the possibility that the snapshot gets corrupted during shutdown during block accept at all, but in the meantime, you may see better performance if you disable async snapshot generation, so that it will wait for the snapshot to be fully formed before continuing to escape this degraded performance loop.

You can set this flag to false: https://docs.avax.network/nodes/configure/chain-config-flags#snapshot-async-boolean to disable async snapshot generation.

NickSolante · 2023-08-22T21:48:46Z

Thanks this should be very helpful! I've set this to false now, Ill do the same on my other nodes.

Ill keep the profiler going and see if I catch the other crash.

Ill have a read on the blog post as well.

aaronbuchwald · 2023-08-25T15:29:44Z

Made a PR upstream as well to increase the size at which a batch is broken up when flushing the snapshot diff to disk: ethereum/go-ethereum#27977.

We'll migrate this change into Subnet-EVM/Coreth as well and this should significantly reduce the likelihood that a snapshot will be corrupted in the first place.

NickSolante added the bug Something isn't working label Jul 21, 2023

NickSolante changed the title ~~Subnet-EVM v0.5.2 stalls at a certain height after bootstrapping (DFK as the example)~~ Subnet-EVM v0.5.2 stalls at a random height after bootstrapping (DFK as the example) Jul 21, 2023

NickSolante changed the title ~~Subnet-EVM v0.5.2 stalls at a random height after bootstrapping (DFK as the example)~~ Subnet-EVM v0.5.3 stalls at a random height after bootstrapping (DFK as the example) Aug 21, 2023

aaronbuchwald closed this as completed Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subnet-EVM v0.5.3 stalls at a random height after bootstrapping (DFK as the example) #747

Subnet-EVM v0.5.3 stalls at a random height after bootstrapping (DFK as the example) #747

NickSolante commented Jul 21, 2023 •

edited

Loading

ceyonur commented Jul 23, 2023

NickSolante commented Jul 24, 2023

NickSolante commented Aug 5, 2023

aaronbuchwald commented Aug 15, 2023

NickSolante commented Aug 21, 2023 •

edited

Loading

NickSolante commented Aug 21, 2023

NickSolante commented Aug 22, 2023

aaronbuchwald commented Aug 22, 2023

NickSolante commented Aug 22, 2023 •

edited

Loading

aaronbuchwald commented Aug 25, 2023

Subnet-EVM v0.5.3 stalls at a random height after bootstrapping (DFK as the example) #747

Subnet-EVM v0.5.3 stalls at a random height after bootstrapping (DFK as the example) #747

Comments

NickSolante commented Jul 21, 2023 • edited Loading

ceyonur commented Jul 23, 2023

NickSolante commented Jul 24, 2023

NickSolante commented Aug 5, 2023

aaronbuchwald commented Aug 15, 2023

NickSolante commented Aug 21, 2023 • edited Loading

NickSolante commented Aug 21, 2023

NickSolante commented Aug 22, 2023

aaronbuchwald commented Aug 22, 2023

NickSolante commented Aug 22, 2023 • edited Loading

aaronbuchwald commented Aug 25, 2023

NickSolante commented Jul 21, 2023 •

edited

Loading

NickSolante commented Aug 21, 2023 •

edited

Loading

NickSolante commented Aug 22, 2023 •

edited

Loading