Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

'Snapshot bank failed to verify', ledger/src/snapshot_utils.rs:462:9 #8130

Closed
mvines opened this issue Feb 5, 2020 · 10 comments · Fixed by #8218
Closed

'Snapshot bank failed to verify', ledger/src/snapshot_utils.rs:462:9 #8130

mvines opened this issue Feb 5, 2020 · 10 comments · Fixed by #8218
Assignees

Comments

@mvines
Copy link
Contributor

mvines commented Feb 5, 2020

It looks like the bootstrap leader on TdS is serving up a bad snapshot:

$ solana-validator -l ledger -n 35.230.25.59:8001 -o - --expected-shred-version 16135
.
.
.
[2020-02-05T16:29:45.572013797Z INFO  solana_metrics::metrics] datapoint: accounts_db-stores total_count=615i
thread 'main' panicked at 'Snapshot bank failed to verify', ledger/src/snapshot_utils.rs:462:9

Steps to reproduce on the 0.23.2 release:

  1. Fetch and extract bad-snapshot.tgz from https://drive.google.com/drive/folders/1bBteuyTDRWDbCXhf-IgaTVog28teAjYn?usp=sharing
  2. Run solana-validator --ledger bad-snapshot/ -o -
@mvines
Copy link
Contributor Author

mvines commented Feb 5, 2020

solana_runtime::accounts_db] mismatched bank hash for slot 209668: BankHash 9fe1cdc89d1963bc7d782dca2b01c1e0711d8d27bbb4233be8eae71620732960 (calculated) != BankHash f2bfa120240db230730f7220eef3cde8c71684670a992b5c302e0cfeeece6a70 (expected)
solana_runtime::accounts] verify_bank_hash failed: MismatchedBankHash

@mvines
Copy link
Contributor Author

mvines commented Feb 5, 2020

Newer snapshots from the bootstrap leader produce a similar bankhash failure:

solana_runtime::accounts_db] mismatched bank hash for slot 216868: BankHash 6a3998f301818041a22e16270e6109ff214414a24e47cc24d79f6cce6a0ec855 (calculated) != BankHash 0767f41bb89551cdac5949cdcb9305f7974f1de2ff6ac4430f5b8726a4b38b45 (expected)

@mvines
Copy link
Contributor Author

mvines commented Feb 5, 2020

I picked up a new snapshot at slot 221774 from the BSV. BSV log output (bank hash is 80aac54190...)

solana_core::replay_stage] new fork:221774 parent:221773 (leader) root:221742
solana_runtime::bank] bank frozen: 221774 hash: 6ha2U1wpt5mAxHzVC4spw8DPDBNdWNR2A1rYPAfrKpby accounts_delta: BankHash 80aac54190539e6778eeebb99b0dcee980165e4848520a5d63f567ac76f33249 signature_count: 122 last_blockhash: 71UxsJwaSpFNxQ2j2M8Sh2hZVVydy64VjdNFuyH7NQLm
solana_runtime::bank] accounts hash slot: 221774 stats: BankHashStats { num_removed_accounts: 113, num_added_accounts: 0, num_lamports_stored: 8620136224739, total_data_len: 333836, num_executable_accounts: 0 }

but when I boot my validator, I get bank hash edf4a9a92947...

solana_runtime::accounts_db] mismatched bank hash for slot 221774: BankHash edf4a9a929474feb7699b4535effc2e1361d5708f97f023abb318c44b84e7159 (calculated) != BankHash 80aac54190539e6778eeebb99b0dcee980165e4848520a5d63f567ac76f33249 (expected)

@mvines
Copy link
Contributor Author

mvines commented Feb 5, 2020

While the BSV is still serving bad snapshots (accounts_db::verify_bank_hash() fails), the node at 62.171.132.17 is serving good snapshots. Unfortunately the BSV is pruning its ledger so we may not have the full history to go back and figure out when the bad hash was introduced.

However I've started up another node at -n 51.143.93.203:8115, booting it from a good snapshot, and have it configured to preserve all future ledger.

Next up is to start digging into the differences between the BSV and the new node

@mvines mvines self-assigned this Feb 5, 2020
@ryoqun
Copy link
Contributor

ryoqun commented Feb 5, 2020

@mvines Sorry for the disturbance.. I'll look into this.

@mvines
Copy link
Contributor Author

mvines commented Feb 6, 2020

Yes please!

The best plan I have right now is to compare the snapshots produced by 35.230.25.59:8001 (bad) with those produced by 51.143.93.203:8001 (good)

Both nodes are at the tip of the cluster and generally voting in the same way, I suspect there's something corrupt in the accounts files of 35.230.25.59.

If you need more access/logging from these machines let me know

@sakridge
Copy link
Contributor

sakridge commented Feb 6, 2020

@ryoqun one idea is maybe backing out the latest snapshot changes. How confident are you that that might have affected something?

@mvines
Copy link
Contributor Author

mvines commented Feb 6, 2020

Ok, at https://drive.google.com/drive/u/1/folders/1uD-ll87-5pmKvdSfQDguQUreR18K94_1 I've put two snapshots, both of slot 281297. One from the "bad" node, one from the "good" node. The snapshots themselves are very different in size but I've done no further analysis on it beyond observing that solana-ledger-tool verify accepts the good one and rejects the bad one.

@ryoqun
Copy link
Contributor

ryoqun commented Feb 6, 2020

status report: I have found an oddity.

@mvines Thanks for preparing good and bad snapshots with identical slot. That greatly accelerated the investigation.

Seems that account db is too eagerly purging AppendVec in some corner case. I'll try to fix it first quickly. But if that is found too hard for me, I'll share all the findings.

@ryoqun one idea is maybe backing out the latest snapshot changes. How confident are you that that might have affected something?

@sakridge Yeah, I've suspected my recent #7892 . But it seems that it's not culprit.

Sadly, we've found yet another long standing unnoticed bug.

@ryoqun
Copy link
Contributor

ryoqun commented Feb 6, 2020

I've left the above findings at #8148.

I moved the findings to #8176.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants