Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Validator falls with error: Trie lookup error: Database missing expected key: #6377

Closed
base1217 opened this issue Dec 1, 2022 · 19 comments · Fixed by paritytech/substrate#12902

Comments

@base1217
Copy link

base1217 commented Dec 1, 2022

Hello,

After upgrade to 0.9.33 some of our validators node from time to time fall.
No spikes or bottlenecks detected on hosts meanwhile.

Event bellow:

====================

Version: 0.9.33-c7d6c21242f

   0: sp_panic_handler::set::{{closure}}
   1: std::panicking::rust_panic_with_hook
             at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:702:17
   2: std::panicking::begin_panic_handler::{{closure}}
             at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:588:13
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:138:18
   4: rust_begin_unwind
             at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5
   5: core::panicking::panic_fmt
             at rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14
   6: core::result::unwrap_failed
             at rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/result.rs:1785:5
   7: <sp_state_machine::ext::Ext<H,B> as sp_externalities::Externalities>::storage
   8: sp_io::storage::get_version_1
   9: sp_io::storage::ExtStorageGetVersion1::call
  10: <F as wasmtime::func::IntoFunc<T,(wasmtime::func::Caller<T>,A1),R>>::into_func::wasm_to_host_shim
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: wasmtime_runtime::traphandlers::catch_traps::call_closure
  15: wasmtime_setjmp
  16: sc_executor_wasmtime::runtime::perform_call
  17: <sc_executor_wasmtime::runtime::WasmtimeInstance as sc_executor_common::wasm_runtime::WasmInstance>::call_with_allocation_stats
  18: sc_executor_common::wasm_runtime::WasmInstance::call_export
  19: sc_executor::native_executor::WasmExecutor<H>::with_instance::{{closure}}
  20: <sc_executor::native_executor::NativeElseWasmExecutor<D> as sp_core::traits::CodeExecutor>::call
  21: sp_state_machine::execution::StateMachine<B,H,Exec>::execute_aux
  22: sp_state_machine::execution::StateMachine<B,H,Exec>::execute_using_consensus_failure_handler
  23: <sc_service::client::client::Client<B,E,Block,RA> as sp_api::CallApiAt<Block>>::call_api_at
  24: <kusama_runtime::RuntimeApiImpl<__SR_API_BLOCK__,RuntimeApiImplCall> as sp_api::Core<__SR_API_BLOCK__>>::__runtime_api_internal_call_api_at
  25: sp_consensus_babe::BabeApi::current_epoch
  26: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  27: polkadot_node_core_runtime_api::make_runtime_api_request::{{closure}}
  28: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  29: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
  30: tokio::runtime::task::raw::poll
  31: std::sys_common::backtrace::__rust_begin_short_backtrace
  32: core::ops::function::FnOnce::call_once{{vtable.shim}}
  33: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at rustc/897e37553bba8b42751c67658967889d11ecd120/library/alloc/src/boxed.rs:1940:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at rustc/897e37553bba8b42751c67658967889d11ecd120/library/alloc/src/boxed.rs:1940:9
      std::sys::unix::thread::Thread::new::thread_start
             at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys/unix/thread.rs:108:17
  34: start_thread
  35: clone


Thread 'tokio-runtime-worker' panicked at 'Externalities not allowed to fail within runtime: "Trie lookup error: Database missing expected key: 0x88720faab03e11ebe126d613a3020a15809d118d431100619f36faa0b8a4f7d2"', /usr/local/cargo/git/checkouts/substrate-7e08433d4c370a21/2dff067/primitives/state-machine/src/ext.rs:189

This is a bug. Please report it at:

	https://github.com/paritytech/polkadot/issues/new

@zikarad
Copy link

zikarad commented Dec 2, 2022

We've also observed behaviour described in previous comment:

Dec 02 14:27:30 polkadot[5359]: Thread 'tokio-runtime-worker' panicked at 'Externalities not allowed to fail within runtime: "Trie lookup error: Database missing expected key: 0x18aa2d51bb06890ed77960d1cb611ad3ab81c358d88b428f688deb7e943d1b58"', /usr/local/cargo/git/checkouts/substrate-7e08433d4c370a21/2dff067/primitives/state-machine/src/ext.rs:189
Dec 02 14:27:30 polkadot[5359]: This is a bug. Please report it at:
Dec 02 14:27:30 polkadot[5359]:         https://github.com/paritytech/polkadot/issues/new
Dec 02 14:27:30 systemd[1]: blockchain-val1.service: Main process exited, code=exited, status=1/FAILURE
Dec 02 14:27:30 systemd[1]: blockchain-val1.service: Failed with result 'exit-code'.

@bkchr
Copy link
Member

bkchr commented Dec 5, 2022

Does this error directly happens after the upgrade and restart of the binary? What kind of database are you using? Rocksdb or ParityDB?

In general this looks like some db issue, which could hint some disk issue.

@bkchr
Copy link
Member

bkchr commented Dec 5, 2022

@arkpar your recent changes to state-db could they be the cause of this? There is a similar report here: #6389

I checked the trie code and we generate these errors only when fetching data from the db, so it can not be the cache. However, it needs to be some in memory structure as in #6389 was reported to work again after a restart.

@arkpar
Copy link
Member

arkpar commented Dec 5, 2022

@bkchr #6389 could be caused by the database error only if the trie node was found, but contain incorrect data. I don't see how this is possible

This issue however could be caused by an unknown issue with the pruning code. The changes to state-db mostly concern the case when --pruning is set to a value higher than the default 256.

@base1217 What's your --pruning setting?

@bkchr
Copy link
Member

bkchr commented Dec 5, 2022

@bkchr #6389 could be caused by the database error only if the trie node was found, but contain incorrect data. I don't see how this is possible

Why only invalid data? https://github.com/paritytech/trie/blob/02b030a24bc60d46ed7f156888bdbbed6498b216/trie-db/src/lookup.rs#L362-L369 looking at this, it is the same code path and when the db returns None?

@tugytur
Copy link
Contributor

tugytur commented Dec 6, 2022

We encountered this today as well within 24hours after updating two validators.

Both crashed within one second of each other. Both are on different hardware. One has a Intel and the other AMD cpu.

Both validators use Rocksdb

Thread 'tokio-runtime-worker' panicked at 'Externalities not allowed to fail within runtime: "Trie lookup error: Database missing expected key: 0x309faafa96196b156ca1da83066a147276a4beca2f80e1defeaed98bb805aba3"', /usr/local/cargo/git/checkouts/substrate-7e08433d4c370a21/2dff067/primitives/state-machine/src/ext.rs:189
This is a bug. Please report it at:
        https://github.com/paritytech/polkadot/issues/new
Thread 'tokio-runtime-worker' panicked at 'Externalities not allowed to fail within runtime: "Trie lookup error: Database missing expected key: 0x46069249c711ed7583804a506a5bb14ea7b05f048de157da1a8b6795945bc48b"', /usr/local/cargo/git/checkouts/substrate-7e08433d4c370a21/2dff067/primitives/state-machine/src/ext.rs:275
This is a bug. Please report it at:
        https://github.com/paritytech/polkadot/issues/new

And these are the parameters for the validators:

ExecStart=/usr/local/bin/polkadot \
        --name "XXXX" \
        --validator \
        --state-pruning 1000 \
        --no-mdns \
        --public-addr=/ip4/X.X.X.X/tcp/30333 \
        --listen-addr=/ip4/X.X.X.X/tcp/30333 \
        --chain=kusama \
        --database=rocksdb \
        --prometheus-port 9615 \
        --rpc-port 9933 \
        --ws-port 9944 \
        --no-telemetry \
        --keystore-path=/var/blockchain/keystore \
        --base-path=/var/blockchain/data

@bkchr
Copy link
Member

bkchr commented Dec 6, 2022

@tugytur but after restart it works?

@tugytur
Copy link
Contributor

tugytur commented Dec 6, 2022

@bkchr yes. Just crashed and automatically restarted.
One of them was also a p/v at that moment and there were no signs of problems regarding block production or so after the restart.

@paradox-tt
Copy link

I just encountered this error again, this time on another validator.

@prostakers-com
Copy link

prostakers-com commented Dec 6, 2022

I can confirm that for a new validator also, it resolves on restart.

@stakeworld
Copy link
Contributor

I've also had this error, most on my kusama rocksdb node, multiple times, decided to rebuild after that. On paritydb only one slightly different error Failed to write to trie: Database missing expected key: 0x378b864637cdc6280b0d0ace43023d0fabff7e1b333ec467da65d9d66997faef but no recurrences after that. Both on good hardware, There were no disk/memory/dmesg problems. I run with --state-pruning 1000 --blocks-pruning 1000 so could be related to the pruning. No occurences on my archive (paritydb) nodes without pruning.

@arturgontijo
Copy link

Same for my Kusama node with --state-pruning 1000.
Restarting it seems to work for a few hours only...then it crashes again.

@arkpar
Copy link
Member

arkpar commented Dec 7, 2022

@arturgontijo @stakeworld and others seeing this issue, could you please add -l state-db=trace logging and collect logs for this crash?

@arturgontijo
Copy link

Just did it, if it crashes again I'll post the logs here...thanks!

This was referenced Dec 8, 2022
@arkpar
Copy link
Member

arkpar commented Dec 9, 2022

Identified a potential race condition introduced in #12239. State data moved from unfinalized overlay to pruning window becomes temporarily unavailable.

@ordian
Copy link
Member

ordian commented Dec 12, 2022

FWIW, We've still observed this error on Versi with #6423.

@chaindigital
Copy link

Node stops working every hour, has anyone solved this problem?

@bkchr
Copy link
Member

bkchr commented Dec 14, 2022

paritytech/substrate#12927 this should hopefully fix it. It is now tested.

@bkchr
Copy link
Member

bkchr commented Dec 18, 2022

Should be fixed in the next release.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.