-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OP-Reth: Base node crashes, v1.0.8, --engine.experimental
#11471
Comments
What's the command used to run the node? |
exec ${HOME}/bin/op-reth/${OP_RETH_VER}/op-reth node --chain=base |
these are all the prune settings you can set,either in cli or in file (part of output from running Pruning:
--full
Run full node. Only the most recent [`MINIMUM_PRUNING_DISTANCE`] block states are stored
--block-interval <BLOCK_INTERVAL>
Minimum pruning interval measured in blocks
[default: 0]
--prune.senderrecovery.full
Prunes all sender recovery data
--prune.senderrecovery.distance <BLOCKS>
Prune sender recovery data before the `head-N` block number. In other words, keep last N + 1 blocks
--prune.senderrecovery.before <BLOCK_NUMBER>
Prune sender recovery data before the specified block number. The specified block number is not pruned
--prune.transactionlookup.full
Prunes all transaction lookup data
--prune.transactionlookup.distance <BLOCKS>
Prune transaction lookup data before the `head-N` block number. In other words, keep last N + 1 blocks
--prune.transactionlookup.before <BLOCK_NUMBER>
Prune transaction lookup data before the specified block number. The specified block number is not pruned
--prune.receipts.full
Prunes all receipt data
--prune.receipts.distance <BLOCKS>
Prune receipts before the `head-N` block number. In other words, keep last N + 1 blocks
--prune.receipts.before <BLOCK_NUMBER>
Prune receipts before the specified block number. The specified block number is not pruned
--prune.accounthistory.full
Prunes all account history
--prune.accounthistory.distance <BLOCKS>
Prune account before the `head-N` block number. In other words, keep last N + 1 blocks
--prune.accounthistory.before <BLOCK_NUMBER>
Prune account history before the specified block number. The specified block number is not pruned
--prune.storagehistory.full
Prunes all storage history data
--prune.storagehistory.distance <BLOCKS>
Prune storage history before the `head-N` block number. In other words, keep last N + 1 blocks
--prune.storagehistory.before <BLOCK_NUMBER>
Prune storage history before the specified block number. The specified block number is not pruned
--prune.receiptslogfilter <FILTER_CONFIG>
Configure receipts log filter. Format: <`address`>:<`prune_mode`>[,<`address`>:<`prune_mode`>...] Where <`prune_mode`> can be 'full', 'distance:<`blocks`>', or 'before:<`block_number`>' |
I just attempted to sync with the "consensus-layer" param from scratch on base (--full flag defined), and it appears my op-reth node got completely stuck:
Is there a way to recover the progress and tell it to chill out and reject/drop the payload and carry on? |
fwiw it was syncing along for a few hours, and this is at block |
Here's the hildr logs when op-reth gets stuck in a crash loop:
the part that I thought was curious was it will have a bunch of those:
logs where there's an "invalid timestamp mismatch" -- these appear to be wall clock times. when I attempted to sync from the base snapshot, the time differential was a more drastic timestamp difference since it was from the zipped file.. Also it isn't clear to me if hildr is actually purging these "batch" records that op-node is leaving behind as if I restart the process it will seemingly have at least some of the same timestamps as the previous run.. |
I was able to un-stick the sync by using the advice of @BowTiedDevil in this thread (thank you!) -- #11570 I am not doing it in a container so I just ran After I did this it was able to continue to sync; it seems like the exception handling should roll the offending record that causes the crash back and delete it so the op-node/hildr/magi process can re-fetch (and do this until the epoch boundary rather than genesis) My main concern for now I guess is how to targetedly see/delete/expire/purge these "batch" records since it seems like it would be faster for |
glad it worked out! did your node sync past the point it got stuck at first? mind that the engine experimental is experimental, and hence for perf guarantee still recommend using the traditional engine |
--engine.experimental
It's still syncing, currently at Wanted to also verify -- is using |
use |
Hmm, I just tried switching back (at tip of main branch, b787d9e)
However this is incorrect.. I see: Still showing: which seems wrong |
I didn't include this in my comments in #11570, but I was already using execution-layer sync mode for op-node when I got this same crash. Here's my op-node container config for reference (managed through Podman and systemd):
|
Let's focus this discussion on the I also think that the finalized / safe hashes being zero is fine, and not relevant to this issue. |
Also @DaveWK is this parent hash mismatch issue reproducible with a restart? |
I've pulled journalctl logs from my op-node container for the 30s prior to the op-reth crash, which occurred at 11:52:50. No errors in the log until after the crash, when op-reth goes down and op-node can't communicate with it.
|
Yes -- if I restart op-node/reth it will continue looping over downloading all the headers from latest to 0, and then complain about another additional allegedly "bad block" when running with execution-layer and unwinding I swapped back to consensus-layer and am waiting to see if it heals itself and eventually starts syncing and advancing the "actually correct" canonical chain |
I 'm optimistic 😜 this commit will fix things for me -- #11623 building it now.. At the very least this will stop hildr from crash-looping with null value exceptions. my working theory is the bogus block hashes come from a forkchoice someplace that gets hashed with a 0x000 parent block, canonicalized, then it starts unwinding the real blocks (or something along those lines). |
I'll try it now too, if it doesn't cause any problems in the next few days, I'll close the issue |
I'm doing a sync from scratch base-node with When I tried the new version with my existing db it started a rewind and seemed to get stuck/move v slowly, but it probably has to do with some edge case of switching between execution/consensus sync and with/without experimental a few times while troubleshooting. One last q, can also open a new ticket -- is there some way to make it leverage the static_files for a resync? I notice when I just delete the |
Now it immediately fails when I attempt from-scratch sync on execution-layer it seems, winds back, and is seemingly stuck in the
|
This is pretty weird, that error is actually a failure in the execution stage. Just as an update I don't have any leads on this so far, and have not been able to reproduce this. I don't know whether or not this has to do with hildr failing, or op-node, or op-reth, or some interaction between the two |
I am attempting to sync from scratch Wondering also if the genesis block(s) hashes should be the lowest/floor values for |
If you try from |
Cool, my previous run when I got that most recent error was at d027b7b but am building |
Currently passed the previous failure; going to be in the execution stage for a while:
current status:
|
This error has occurred again on v1.1.0. Here is roughly a minute of logs from the container before it crashes:
And here are the logs from op-node prior to the crash and for about 10s following:
After restarting the node with experimental engine disabled, I captured the following interesting info including a message about a mismatched parent hash and a bad block encountered during a rewind:
|
The bad block (hash 0x7fd6dda911d2d5029a6026896f9c9ca13fc51e28ad7b6be5ae3922824cd4a705) came from op-node at 15:45:36. That same block is marked as ignored because of a mismatched parent hash at 15:45:37. Here are the logs during that 10s window:
|
I appear to be slwoly making through my first pass of the Execution stage on commit 1ba631b This has been running ~ 2 days now so let me know if there's recent commits that would be good to pick up, but would rather not restart from scratch (losing 2 days sync) unless there's relevant commits |
Seems possibly relevant: #11651 |
I met the similar issue on ethereum mainnet, logs as below:
reth version: $ ./bin/reth --version
reth Version: 1.1.0-dev
Commit SHA: dfcaad4608797ed05a8b896fcfad83a83a6292af
Build Timestamp: 2024-10-18T02:27:42.090653312Z
Build Features: asm_keccak,jemalloc
Build Profile: release Reth run with reth node \
--datadir=/data \
--http \
--port=5500 \
--http.addr=0.0.0.0 \
--http.port=5545 \
--http.api=admin,debug,eth,net,trace,txpool,web3,rpc,ots \
--http.corsdomain=* \
--ws \
--ws.addr=0.0.0.0 \
--ws.port=5546 \
--ws.origins=* \
--ws.api=admin,debug,eth,net,trace,txpool,web3,rpc \
--authrpc.addr=0.0.0.0 \
--authrpc.port=8551 \
--authrpc.jwtsecret=/jwt/jwt.hex \
--metrics=0.0.0.0:5550 \
--rpc.max-request-size=50 \
--rpc.max-response-size=1024 \
--rpc.max-subscriptions-per-connection=1024 \
--rpc.max-connections=1000 \
--rpc.max-tracing-requests=20 \
--rpc.max-blocks-per-filter=500000 \
--rpc.max-logs-per-response=1000000 \
--color=never |
Still caught in an endless loop of the stages and unable to keep up.. Still seeing: |
One thing I am noticing in the "execution" and "merkleExecute" stages is that when I restart the node it goes fast for a few checkpoints IE in the 1.5 Ggas range, then it seems to drop down considerably to like 100 mGas when it adds peers.. Is there perhaps a thread associated with p2p that is making the execution threads block? |
the ignored safe/finalized hashes has been fixed on main need to think about your p2p idea |
One thing I noticed during these stages, as per the troubleshooting docs, was noticing the Commit total_duration seemed to be about 2-3 µs which seems to be the expected latency/timing. Not sure what logging setting I could adjust but whenever it seems to wake up and do "pinging boot nodes" there seems to be a stall and no other logging..
|
Not to be argumentative, but have been running off main for a few days now (gradually updating my commit) and my op-reth chain still does not appear to be storing the l2 safe/finalized as they still show up as 0x00.. in the op-node logs. I would have expected now that it's looping over all 12 stages then starting from stage 1 again that it would have at some point set finalized/safe to not be 0x00.. Could it have persisted the incorrect 0x00 value and never have any integrity checking between being "unset" and being "0x00" ? Are these value supposed to be derived from entries in the |
The safe / finalized hashes are set here if there is no safe head / finalized head: Had to do some investigation on what the
Let me double check that this is the only way for op-node to set the finalized / safe hashes. Then I will check that this is the only way for the finalized / safe hashes to be set. |
hmm, when I request the finalized block on startup when running base as:
I get: curl -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["finalized", false],"id":67}' http://localhost:8545
{"jsonrpc":"2.0","id":67,"result":null}⏎ Which is notably not the |
Hmm, so I am able to query and get a I am currently waiting for another stages loop to finish, as I am currently in another run of |
I am not 100% sure if this issue is related to my issue, but I get a similar error:
I switched to the new engine yesterday on Image version:
Update: Custom prune setting:
|
I also have the same error on Base |
Describe the bug
Oct 03 21:21:31 base-mainnet-reth.infra.proofofalpha.co start_op-reth.sh[38658]: 2024-10-03T21:21:31.173888Z INFO Status connected_peers=11 stage=MerkleExecute checkpoint=20477501 target=20589185
Oct 03 21:21:32 base-mainnet-reth.infra.proofofalpha.co start_op-reth.sh[38658]: thread 'tokio-runtime-worker' panicked at /data/base-user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/alloy-trie-0.6.0/src/hash_builder/mod.rs:116:9:
Oct 03 21:21:32 base-mainnet-reth.infra.proofofalpha.co start_op-reth.sh[38658]: add_leaf key Nibbles("0503070103010e05090a0003050f0a09050c0f0009000d0b05010f030c070a060e0e0e03020a01050a09000f0d01090a0d02070c080c080809000e0600030a07") self.key Nibbles("060e0b03020d0e01010a0604080b02070a0c0d0b0b020a070a0c050b070b0f0f0f01010c0a010d0b0905080f03010700020b07090f06090d0d010f0f0709060c")
Oct 03 21:21:32 base-mainnet-reth.infra.proofofalpha.co start_op-reth.sh[38658]: stack backtrace:
Oct 03 21:21:32 base-mainnet-reth.infra.proofofalpha.co start_op-reth.sh[38658]: note: Some details are omitted, run with
RUST_BACKTRACE=full
for a verbose backtrace.Oct 03 21:21:33 base-mainnet-reth.infra.proofofalpha.co start_op-reth.sh[38658]: 2024-10-03T21:21:33.040583Z ERROR Critical task
pipeline task
panicked:add_leaf key Nibbles("0503070103010e05090a0003050f0a09050c0f0009000d0b05010f030c070a060e0e0e03020a01050a09000f0d01090a0d02070c080c080809000e0600030a07") self.key Nibbles("060e0b03020d0e01010a0604080b02070a0c0d0b0b020a070a0c050b070b0f0f0f01010c0a010d0b0905080f03010700020b07090f06090d0d010f0f0709060c")
Oct 03 21:21:33 base-mainnet-reth.infra.proofofalpha.co start_op-reth.sh[38658]: 2024-10-03T21:21:33.040598Z ERROR backfill sync task dropped err=channel closed
Oct 03 21:21:33 base-mainnet-reth.infra.proofofalpha.co start_op-reth.sh[38658]: 2024-10-03T21:21:33.040620Z ERROR Fatal error in consensus engine
Oct 03 21:21:33 base-mainnet-reth.infra.proofofalpha.co start_op-reth.sh[38658]: 2024-10-03T21:21:33.040659Z ERROR shutting down due to error
Oct 03 21:21:33 base-mainnet-reth.infra.proofofalpha.co start_op-reth.sh[38658]: Error: Fatal error in consensus engine
Oct 03 21:21:33 base-mainnet-reth.infra.proofofalpha.co start_op-reth.sh[38658]: Location:
Oct 03 21:21:33 base-mainnet-reth.infra.proofofalpha.co start_op-reth.sh[38658]: /data/base-user/src/ret
Steps to reproduce
Attempt to run a base node, added pruinung config because didnt see any and it got too slow without one:
Node logs
Platform(s)
Linux (x86)
What version/commit are you on?
v1.0.8 (make op-reth)
What database version are you on?
./bin/op-reth/v1.0.8/op-reth db --datadir ./oprethdata version
2024-10-03T21:30:34.226654Z INFO Initialized tracing, debug log directory: /data/base-user/.cache/reth/logs/dev
Current database version: 2
Local database version: 2
Which chain / network are you on?
base
What type of node are you running?
Pruned with custom reth.toml config
What prune config do you use, if any?
[prune]
block_interval = 5
[prune.segments]
sender_recovery = "full"
transaction_lookup is not pruned
receipts = { before = 11052984 } # Beacon Deposit Contract deployment block: https://etherscan.io/tx/0xe75fb554e433e03763a1560646ee22dcb74e5274b34c5ad644e7c0f619a7e1d0
account_history = { distance = 10_064 }
storage_history = { distance = 10_064 }
[prune.segments.receipts_log_filter]
If you've built Reth from source, provide the full command you used
make maxperf-op
Code of Conduct
The text was updated successfully, but these errors were encountered: