Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Node Issue]: Validator Migration Giving Errors #12140

Closed
1 task done
Frixoe opened this issue Sep 11, 2024 · 3 comments
Closed
1 task done

[Node Issue]: Validator Migration Giving Errors #12140

Frixoe opened this issue Sep 11, 2024 · 3 comments
Labels
community Issues created by community investigation required

Comments

@Frixoe
Copy link

Frixoe commented Sep 11, 2024

Contact Details

suryansh@luganodes.com

What happened?

We have 3 machines which have been synced using the same config. We were migrating from machine 1 to machine 2 and machine 2 crashed.

So we tried to move the key back to machine 1, then machine 1 stopped syncing. We resynced machine 2 from the snapshot and whenever the key is not on either machine, they run properly with no issues. But the moment we restart with the key, the nodes stop syncing.

On machine 3, we did fresh snapshot download and restarted with the validator key. But first we see an error like this:

WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0

Then after a restart, we see:

WARN chunks: Error processing partial encoded chunk: ChainError(InvalidChunkHeight)

and if we restart again, we see:

ERROR metrics: Error when exporting postponed receipts count DB Not Found Error: BLOCK: AX8wFPoyVoULT9N7hMcVLxJtPBYZj4EkBhbhThKYZ7WN.

But once the validator key is removed and the node is restarted, the node syncs with no issues. Then, if we put the validator key back, it stops syncing.

The validator key doesn't seem to be running on any machine.

We also saw another error while trying to sync machine 3 from snapshot(after 1 restart or neard) with the validator key on it:

Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]: 2024-09-11T08:47:26.647486Z  WARN near_store::db::rocksdb: target="store::db::rocksdb" making a write batch took a very long time, make smaller transactions! elapsed=8.637442105s back
trace=   0: <near_store::db::rocksdb::RocksDB as near_store::db::Database>::write
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    1: near_store::StoreUpdate::commit
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    2: near_chain::store::ChainStoreUpdate::commit
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    3: near_chain::garbage_collection::<impl near_chain::store::ChainStore>::reset_data_pre_state_sync
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    4: near_client::client_actor::ClientActorInner::run_sync_step
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    5: near_client::client_actor::ClientActorInner::check_triggers
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    6: <actix::address::envelope::SyncEnvelopeProxy<M> as actix::address::envelope::EnvelopeProxy<A>>::handle
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    7: <actix::contextimpl::ContextFut<A,C> as core::future::future::Future>::poll
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    8: tokio::runtime::task::raw::poll
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    9: tokio::task::local::LocalSet::tick
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:   10: tokio::task::local::LocalSet::run_until::{{closure}}
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:   11: std::sys_common::backtrace::__rust_begin_short_backtrace
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:   12: core::ops::function::FnOnce::call_once{{vtable.shim}}
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:   13: std::sys::pal::unix::thread::Thread::new::thread_start
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:   14: <unknown>
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:   15: <unknown>

And then once the logs start moving we see logs like this:

WARN client: Dropping tx me=Some(AccountId("validator_contract.pool.near")) tx=SignedTransaction { transaction: V0(TransactionV0 { signer_id: AccountId("0-relay.hot.tg"), public_key: ed25519:HD71jFrShGZfUYY2QqtMgq6Y4wCoVffTqwFJ8ayixXL9, nonce: 114347435947781, receiver_id: AccountId("7254785709.tg"), block_hash: 9au2oGsybmTqUUW17Re9wSgM24vrw94m2mvHp4PUcLs7, actions: [Delegate(SignedDelegateAction { delegate_action: DelegateAction { sender_id: AccountId("7254785709.tg"), receiver_id: AccountId("game.hot.tg"), actions: [NonDelegateAction(FunctionCall(FunctionCallAction { method_name: l2_claim, args: eyJjaGFyZ2VfZ2FzX2ZlZSI6ZmFsc2UsInNpZ25hdHVyZSI6IjI0MWQxZDlhNWYxODE1MDViYzA2ZGMxNTJhZTM3NTNiMGIyMGU3OGM1NGRmZGQ0YzU5MWY4YjEzZDRiMTg4NjAiLCJtaW5pbmdfdGltZSI6Ijk4Mjk2MyIsIm1heF90cyI6IjE3MjYwNDIxNTkyNjA0NzEwNDAifQ==, gas: 300000000000000, deposit: 0 }))], nonce: 120626739001019, max_block_height: 128190301, public_key: ed25519:9NHXSXiyLnU2zV7z44CAVDp1jxqnAuF5Fbyxfj7tGeJM }, signature: ed25519:2yeC1AURAAUG55b5rcGnezjna77tJXaX9pTdtHvCmTCE3V1reyKnoya5SnFyfm8kVmiBtpwtUXfLku8VG2Bq76iN })] }), signature: ed25519:3XUsbK8KDvfcumfyVTLn6yTriY5Mdc5Yigk74wuBMHNQgGWih3ZgNwnni3UmUjbqsQUWSHsi2wLY44NmDLzKTv5W, hash: BmEeoZxr1mUwcqeVUnXaeNyN8gCtBDCxceyYQFNVAWpJ, size: 461 }

Version

neard (release 2.2.0) (build 2.2.0) (rustc 1.79.0) (protocol 71) (db 40)
features: [default, json_rpc, rosetta_rpc]

Node type

RPC (Default)

Are you a validator?

  • I am a validator.

Relevant log output

INFO stats: State 4cupEDQnemFXCa6s9Z3ankCBgj4sXDSnb2SP4KJTX51T[0: parts] 32 peers ⬇ 5.13 MB/s ⬆ 4.08 MB/s 0.00 bps 0 gas/s CPU: 296%, Mem: 22.4 GB
INFO stats: State 4cupEDQnemFXCa6s9Z3ankCBgj4sXDSnb2SP4KJTX51T[0: apply in progress] 31 peers ⬇ 5.27 MB/s ⬆ 4.63 MB/s 0.00 bps 0 gas/s CPU: 306%, Mem: 6.47 GB

Node head info

"CHUNK_TAIL": 127685197
"FINAL_HEAD": Tip { height: 127685193, last_block_hash: AX8wFPoyVoULT9N7hMcVLxJtPBYZj4EkBhbhThKYZ7WN, prev_block_hash: 5eeheX9m3Et51gnLenQh8t6SvECYZ5JkfqTgqwHjs6KK, epoch_id: EpochId(A6faGmnyqHh6gZYa8bX8NjrPuPYqY3eG1TtzDS67GXp), next_epoch_id: EpochId(EuzPvfoXd71sfgPRVoiGME3muncV5h3bhkJMc8bm3CCp) }
"FORK_TAIL": 127484451
"GENESIS_JSON_HASH": 93on1kcuqTXU94zGyGvBm3YYpPqCkaM8bssbxndgbeRX
"GENESIS_STATE_ROOTS": [8EhZRfDTYujfZoUZtZ3eSMB9gJyFo5zjscR12dEcaxGU]
"HEAD": Tip { height: 127685195, last_block_hash: D7rQvNeRWaD1fEZywBgEfRxNLrU1mb8GNKSa8tys46eW, prev_block_hash: 4kGUzZKyw1964LMtQsz9ukgLgDn6uV6KZDQyCviTzFRw, epoch_id: EpochId(A6faGmnyqHh6gZYa8bX8NjrPuPYqY3eG1TtzDS67GXp), next_epoch_id: EpochId(EuzPvfoXd71sfgPRVoiGME3muncV5h3bhkJMc8bm3CCp) }
"HEADER_HEAD": Tip { height: 127791665, last_block_hash: EnYcS4d4CqfXR3A2axXuPn4XcJni8qiLsPHLdUR4XimF, prev_block_hash: 5L2GZk24n36fivKJeqkFZKcx2dWLw9XeFPzhnKHbapXU, epoch_id: EpochId(7TUPSvHkWBZS81zzqHi16C2PheG7dJ6svjyvXeH5vWmk), next_epoch_id: EpochId(EsEQwWtjtURiwejU64TR5CVR37kCv6fWi5nN7W4yVRCs) }
"LARGEST_TARGET_HEIGHT": 127685406
"LATEST_KNOWN": LatestKnown { height: 127791665, seen: 1726043931818204310 }
"STATE_SYNC_DUMP:\0\0\0\0\0\0\0\0": AllDumped { epoch_id: EpochId(4c3AEoBnXPoqPM8cQHxqRfXbq5hm6CpJAv9okUSGMMxZ), epoch_height: 2362 }
"STATE_SYNC_DUMP:\u{1}\0\0\0\0\0\0\0": AllDumped { epoch_id: EpochId(4c3AEoBnXPoqPM8cQHxqRfXbq5hm6CpJAv9okUSGMMxZ), epoch_height: 2362 }
"STATE_SYNC_DUMP:\u{2}\0\0\0\0\0\0\0": AllDumped { epoch_id: EpochId(4c3AEoBnXPoqPM8cQHxqRfXbq5hm6CpJAv9okUSGMMxZ), epoch_height: 2362 }
"STATE_SYNC_DUMP:\u{3}\0\0\0\0\0\0\0": AllDumped { epoch_id: EpochId(4c3AEoBnXPoqPM8cQHxqRfXbq5hm6CpJAv9okUSGMMxZ), epoch_height: 2362 }
"SYNC_HEAD": Tip { height: 13740748, last_block_hash: 69A1wh25GwoD2CzEuQhs8D2goWPXqe1Liu3jq1i5tdMS, prev_block_hash: 8ZdbgiXn3JpfGGdMGByMqVE5GppKYrojjHJjgZajNV8Z, epoch_id: EpochId(EeWh36LxiVaZgQRsyCzyAUBhaL5yACKZSjK2vTobAC4d), next_epoch_id: EpochId(7edSVzdsSoo1ujdy79abYv3ztbfx7WDawdhCgKhK5qjj) }
"TAIL": 127484451

Node upgrade history

We were migrating to a node with the latest version(2.2.0) and started facing this issue.

DB reset history

Multiple times today on all our machines.
@staffik
Copy link
Contributor

staffik commented Sep 11, 2024

Could you share config.json for each machine?
Do you move node_key.json too or just validator_key.json?
So after upgrading all machines to 2.2.0, it only works if neither machine run as validator?

@telezhnaya telezhnaya transferred this issue from near/nearcore-support Sep 24, 2024
@telezhnaya
Copy link
Contributor

@Frixoe do you still have this problem?

@telezhnaya telezhnaya added investigation required community Issues created by community labels Sep 27, 2024
@Frixoe
Copy link
Author

Frixoe commented Sep 28, 2024

@telezhnaya No we don't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Issues created by community investigation required
Projects
None yet
Development

No branches or pull requests

3 participants