Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Applying WASM snapshot fails occasionally #3780

Open
raykyri opened this issue Oct 7, 2019 · 4 comments
Open

Applying WASM snapshot fails occasionally #3780

raykyri opened this issue Oct 7, 2019 · 4 comments

Comments

@raykyri
Copy link

raykyri commented Oct 7, 2019

Our nodes occasionally panic when the executor's WASM snapshot gets out of sync. This happens more frequently on different nodes than others (e.g. nodes that have been running for longer / were started earlier).

We're using this version of Substrate: e0d486d

Oct 07 00:35:25 edg09-1 run_node.sh[20824]: ====================
Oct 07 00:35:25 edg09-1 run_node.sh[20824]: Version: 1.0.0-5a52390-x86_64-linux-gnu
Oct 07 00:35:25 edg09-1 run_node.sh[20824]: stack backtrace:
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:    0: substrate_panic_handler::set::{{closure}}
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:    1: std::panicking::rust_panic_with_hook
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:              at src/libstd/panicking.rs:481
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:    2: std::panicking::continue_panic_fmt
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:              at src/libstd/panicking.rs:384
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:    3: rust_begin_unwind
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:              at src/libstd/panicking.rs:311
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:    4: core::panicking::panic_fmt
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:              at src/libcore/panicking.rs:85
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:    5: core::result::unwrap_failed
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:              at src/libcore/result.rs:1084
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:    6: std::thread::local::LocalKey<T>::with
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:    7: substrate_state_machine::StateMachine<B,H,N,T,O,Exec>::execute_aux
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:    8: substrate_state_machine::StateMachine<B,H,N,T,O,Exec>::execute_using_consensus_failure_handler
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:    9: <substrate_client::call_executor::LocalCallExecutor<B,E> as substrate_client::call_executor::CallExecutor<Block,substrate_primitives
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   10: <substrate_client::client::Client<B,E,Block,RA> as substrate_client::runtime_api::CallRuntimeAt<Block>>::call_api_at
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   11: substrate_client::runtime_api::runtime_decl_for_Core::initialize_block_call_api_at
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   12: substrate_client::runtime_api::Core::initialize_block
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   13: <substrate_client::call_executor::LocalCallExecutor<B,E> as substrate_client::call_executor::CallExecutor<Block,substrate_primitives
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   14: <substrate_client::client::Client<B,E,Block,RA> as substrate_client::runtime_api::CallRuntimeAt<Block>>::call_api_at
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   15: substrate_client::runtime_api::runtime_decl_for_Metadata::metadata_call_api_at
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   16: substrate_client::runtime_api::Metadata::metadata
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   17: <substrate_rpc::state::state_full::FullState<B,E,Block,RA> as substrate_rpc::state::StateBackend<B,E,Block,RA>>::metadata
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   18: <jsonrpc_core::delegates::DelegateAsyncMethod<T,F> as jsonrpc_core::calls::RpcMethod<M>>::call
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   19: <futures::future::lazy::Lazy<F,R> as futures::future::Future>::poll
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   20: futures::future::chain::Chain<A,B,C>::poll
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   21: <futures::future::then::Then<A,B,F> as futures::future::Future>::poll
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   22: <futures::future::map::Map<A,F> as futures::future::Future>::poll
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   23: <futures::future::either::Either<A,B> as futures::future::Future>::poll
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   24: <futures::future::map::Map<A,F> as futures::future::Future>::poll
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   25: <futures::future::select::Select<A,B> as futures::future::Future>::poll
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   26: <futures::future::map::Map<A,F> as futures::future::Future>::poll
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   27: <futures::future::map_err::MapErr<A,F> as futures::future::Future>::poll
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   28: futures::task_impl::std::set
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   29: std::panicking::try::do_call
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   30: __rust_maybe_catch_panic
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:              at src/libpanic_unwind/lib.rs:80
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   31: tokio_threadpool::task::Task::run
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   32: tokio_threadpool::worker::Worker::run_task
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   33: tokio_threadpool::worker::Worker::run
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   34: std::thread::local::LocalKey<T>::with
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   35: std::thread::local::LocalKey<T>::with
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   36: std::thread::local::LocalKey<T>::with
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   37: tokio::runtime::threadpool::builder::Builder::build::{{closure}}
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   38: std::thread::local::LocalKey<T>::with
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   39: std::thread::local::LocalKey<T>::with
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   40: std::sys_common::backtrace::__rust_begin_short_backtrace
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   41: std::panicking::try::do_call
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   42: __rust_maybe_catch_panic
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:              at src/libpanic_unwind/lib.rs:80
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   43: core::ops::function::FnOnce::call_once{{vtable.shim}}
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   44: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:              at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/liballoc/boxed.rs:787
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   45: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:              at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/liballoc/boxed.rs:787
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:       std::sys_common::thread::start_thread
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:              at src/libstd/sys_common/thread.rs:13
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:       std::sys::unix::thread::Thread::new::thread_start
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:              at src/libstd/sys/unix/thread.rs:79
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   46: start_thread
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:   47: __clone
Oct 07 00:35:25 edg09-1 run_node.sh[20824]: Thread 'jsonrpc-eventloop-0' panicked at 'applying the snapshot can only fail if the passed instance is different
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:                         from the one that was used for creation of the snapshot;
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:                         we use the snapshot that is directly associated with the instance;
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:                         thus the snapshot was created using the instance;
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:                         qed: ApplySnapshotFailed', src/libcore/result.rs:1084
Oct 07 00:35:25 edg09-1 run_node.sh[20824]: This is a bug. Please report it at:
Oct 07 00:35:25 edg09-1 run_node.sh[20824]:         https://github.com/hicommonwealth/edgeware-node/issues/new
Oct 07 00:35:25 edg09-1 systemd[1]: edgeware.service: Main process exited, code=exited, status=1/FAILURE
Oct 07 00:35:25 edg09-1 systemd[1]: edgeware.service: Failed with result 'exit-code'.

Sometimes this happens with the offchain worker API:

Oct 07 03:10:43 edg09-1 run_node.sh[9920]: Version: 1.0.0-5a52390-x86_64-linux-gnu
Oct 07 03:10:43 edg09-1 run_node.sh[9920]: stack backtrace:
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:    0: substrate_panic_handler::set::{{closure}}
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:    1: std::panicking::rust_panic_with_hook
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:              at src/libstd/panicking.rs:481
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:    2: std::panicking::continue_panic_fmt
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:              at src/libstd/panicking.rs:384
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:    3: rust_begin_unwind
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:              at src/libstd/panicking.rs:311
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:    4: core::panicking::panic_fmt
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:              at src/libcore/panicking.rs:85
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:    5: core::result::unwrap_failed
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:              at src/libcore/result.rs:1084
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:    6: std::thread::local::LocalKey<T>::with
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:    7: substrate_state_machine::StateMachine<B,H,N,T,O,Exec>::execute_aux
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:    8: substrate_state_machine::StateMachine<B,H,N,T,O,Exec>::execute_using_consensus_failure_handler
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:    9: <substrate_client::call_executor::LocalCallExecutor<B,E> as substrate_client::call_executor::CallExecutor<Block,substrate_primitives:
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:   10: <substrate_client::client::Client<B,E,Block,RA> as substrate_client::runtime_api::CallRuntimeAt<Block>>::call_api_at
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:   11: substrate_offchain_primitives::runtime_decl_for_OffchainWorkerApi::offchain_worker_call_api_at
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:   12: substrate_offchain_primitives::OffchainWorkerApi::offchain_worker_with_context
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:   13: std::sys_common::backtrace::__rust_begin_short_backtrace
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:   14: std::panicking::try::do_call
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:   15: __rust_maybe_catch_panic
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:              at src/libpanic_unwind/lib.rs:80
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:   16: core::ops::function::FnOnce::call_once{{vtable.shim}}
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:   17: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:              at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/liballoc/boxed.rs:787
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:   18: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:              at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/liballoc/boxed.rs:787
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:       std::sys_common::thread::start_thread
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:              at src/libstd/sys_common/thread.rs:13
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:       std::sys::unix::thread::Thread::new::thread_start
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:              at src/libstd/sys/unix/thread.rs:79
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:   19: start_thread
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:   20: __clone
Oct 07 03:10:43 edg09-1 run_node.sh[9920]: Thread '<unnamed>' panicked at 'applying the snapshot can only fail if the passed instance is different
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:                         from the one that was used for creation of the snapshot;
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:                         we use the snapshot that is directly associated with the instance;
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:                         thus the snapshot was created using the instance;
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:                         qed: ApplySnapshotFailed', src/libcore/result.rs:1084
Oct 07 03:10:43 edg09-1 run_node.sh[9920]: This is a bug. Please report it at:
Oct 07 03:10:43 edg09-1 run_node.sh[9920]:         https://github.com/hicommonwealth/edgeware-node/issues/new
Oct 07 03:10:43 edg09-1 systemd[1]: edgeware.service: Main process exited, code=exited, status=1/FAILURE
Oct 07 03:10:43 edg09-1 systemd[1]: edgeware.service: Failed with result 'exit-code'.
@bkchr
Copy link
Member

bkchr commented Oct 8, 2019

CC @pepyakin as you implemented this :)

@pepyakin
Copy link
Contributor

pepyakin commented Oct 28, 2019

  1. Does it still happen?
  2. Do you have an idea what is the difference between these machines?
  3. Is there anything special about those machines? Do they have similar amounts of memory available ? Swap? Version of Linux? Anything?
  4. What do you see if you execute wasm2wat <path_to_your_runtime>.wasm | grep \(memory (wasm2wat is from wabt, you can try to use the web version as well)?

@pepyakin
Copy link
Contributor

Friendly ping @raykyri ! It would be useful to get some info here.

@raykyri
Copy link
Author

raykyri commented Nov 13, 2019

This seems to still be happening, although more rarely than before, which I think is because we're running fewer validators now. I believe it is only happening on validator boxes from my search through logs, but we only have a couple weeks of logs.

We're running on 4GB mem 8GB swap Ubuntu 18.04 boxes.

This is what our runtime says:

/Users/raymond/Development/edgeware-node/chains% wasm2wat testnet-0.9.2.compact.wasm | grep \(memor
  (memory (;0;) 18)
  (export "memory" (memory 0))

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants