Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Consensus (bank hash) mismatch between master and v1.13/v1.14 (mnb) #31365

Closed
steviez opened this issue Apr 27, 2023 · 4 comments · Fixed by #31371
Closed

Consensus (bank hash) mismatch between master and v1.13/v1.14 (mnb) #31365

steviez opened this issue Apr 27, 2023 · 4 comments · Fixed by #31371

Comments

@steviez
Copy link
Contributor

steviez commented Apr 27, 2023

Problem

On April 23, nodes running against master were observed to have diverged from mainnet. Nodes with various commits off of master forked off at slot 190061698, and this hit this panic.

thread 'solReplayStage' panicked at 'We have tried to repair duplicate slot: 190061698 more than 10 times
and are unable to freeze a block with bankhash J5GobCrZZLvvj8EW72BV1Q7973TefL7f3qtqGinebCWF,
instead we have a block with bankhash Some(FqJHBuSpEXHGM5Rt5pk1hWuKQh2Wg3iyQnyiQ4kwT8NW).
This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting', 

The incorrect bank hash and associated fields that make up the bank hash look like:

bank frozen: 190061698
hash: FqJHBuSpEXHGM5Rt5pk1hWuKQh2Wg3iyQnyiQ4kwT8NW
accounts_delta: nSAB5dSBaN94HF7UNT4ZwKna4rYfZKMUPLoA9ZRAHTE
signature_count: 1640 
last_blockhash: 4Nez6awxz3mqyEPPtvmBErhaUVFgCd1Us1GVsr9kFEU4
capitalization: 545940604106710686,
stats: BankHashStats {
	num_updated_accounts: 4976,
	num_removed_accounts: 21,
	num_lamports_stored: 2524779513915279,
	total_data_len: 20916547,
	num_executable_accounts: 0
}

whereas the correct hash produced by v1.13/v1.14 looks like:

bank frozen: 190061698
hash: J5GobCrZZLvvj8EW72BV1Q7973TefL7f3qtqGinebCWF
accounts_delta: qnhRzBRp4Uapif5QewsCkLaEVL2zgLAiBvMacbchn45
signature_count: 1640
last_blockhash: 4Nez6awxz3mqyEPPtvmBErhaUVFgCd1Us1GVsr9kFEU4
capitalization: 545940604106710686
accounts hash slot: 190061698
stats: BankHashStats {
	num_updated_accounts: 4984,
	num_removed_accounts: 22,
	num_lamports_stored: 2524854860183319,
	total_data_len: 23534954,
	num_executable_accounts: 1
}

At first glance, there is an obvious difference in some of these numbers which would point to a transaction had a different execution result between versions.

Adding some extra logging, I determined that the transaction (explorer link here) that executed differently between versions is:

6ujGsjkwWJtE3VsraD97PBrhWpJaKGVfbxQAyV5thV32NTR9HWa6PjJgqGComn8y9oXBJMc1iT1NwZRLWLm4aM8

Looking at the basic logs in explorer, the above transaction fully succeeded in v1.13 / v1.14. However, looking at the debug logs from executing with master via solana-ledger-tool, I see the following for this tx:

[...] Program ComputeBudget111111111111111111111111111111 invoke [1]
[...] Program ComputeBudget111111111111111111111111111111 success
[...] Program SMPLecH534NA9acpos4G6x7uf3LWbCAwZQE9e8ZekMu invoke [1]
[...] Program log: Instruction: ExecuteTransaction
[...] Program hvsrNC3NKbcryqDs2DocYHZ9yPKEVzdSjQG6RVtK1s8 invoke [2]
[...] Program has been closed
[...] Program hvsrNC3NKbcryqDs2DocYHZ9yPKEVzdSjQG6RVtK1s8 failed: invalid account data for instruction
[...] Program SMPLecH534NA9acpos4G6x7uf3LWbCAwZQE9e8ZekMu consumed 17573 of 1400000 compute units
[...] Program SMPLecH534NA9acpos4G6x7uf3LWbCAwZQE9e8ZekMu failed: invalid account data for instruction
...
[...] tx error: InstructionError(1, InvalidAccountData) SanitizedTransaction { ... }

Proposed Solution

@steviez
Copy link
Contributor Author

steviez commented Apr 27, 2023

I have determined that aebc191 (#30703) introduced the breaking change:

  • Running solana-ledger-tool with aebc191 produces the incorrect bank hash FqJHBuSpEXHGM5Rt5pk1hWuKQh2Wg3iyQnyiQ4kwT8NW for slot 190061698
  • Running solana-ledger-tool with the parent commit e7887cf produces the correct bank hash J5GobCrZZLvvj8EW72BV1Q7973TefL7f3qtqGinebCWF for slot 190061698

@pgarg66 @Lichtso - Can you two please look into this further given that you were the author + reviewer on this PR.

@Lichtso
Copy link
Contributor

Lichtso commented Apr 27, 2023

I think I got an idea what is going on:

First of all, in the TX in question, the program hvsrNC3NKbcryqDs2DocYHZ9yPKEVzdSjQG6RVtK1s8 is first executed via CPI (meaning it is passed as an instruction account in a top level instruction) and then upgraded via CPI (meaning it is set to writable in the TX). Either of these conditions triggers the special case we added in the PR in question here: aebc191#diff-a6df749724ef3698bf0b90625702f100670b58c585f91da8725f78a94b04196eR388

However, if both occur simultaneously on the same program in the same TX as it is the case here, the function filter_executable_program_accounts() incorrectly filters the program out here:

if !tx.message().is_writable(i) && !result.contains_key(key) {

And it is thus the matching programdata account is not pre-loaded into the executor cache anymore here:

loaded_programs_for_txs.insert(**pubkey, program.clone());

Later, in the program runtime when the program is attempting execution it is not found in the executor cache because it was never pre-loaded:

if let Some(ref tx_executor_cache) = tx_executor_cache {

Which leads to an incorrect fall through into get_programdata_offset_and_deployment_offset(), where it is finally reported as being closed:

ic_logger_msg!(log_collector, "Program has been closed");

In other words, while #30703 made the problem visible, it was actually introduced earlier in #30371.

Patch

Remove the !tx.message().is_writable(i) && in the condition to become if !result.contains_key(key) only in filter_executable_program_accounts().

@jeffwashington
Copy link
Contributor

jeffwashington commented Apr 27, 2023

I also had a machine hit this at what I think is a different slot:
We have tried to repair duplicate slot: 190064501 more than 10 times

  • actually - 2 machines so far now that I'm digging stopped at this same slot.

@steviez
Copy link
Contributor Author

steviez commented Apr 27, 2023

I also had a machine hit this at what I think is a different slot: We have tried to repair duplicate slot: 190064501 more than 10 times

  • actually - 2 machines so far now that I'm digging stopped at this same slot.

Can you look a little further back in your logs and see if these machines initially died on slot 190061698? I already looked at another node that also failed on 190064501 and determined that it was a subsequent failure after failing on 190061698. Basically, what happened was:

  • Node replayed 190061698 and computed wrong hash
  • Node retried 10 times up to the limit, kept getting wrong hash and panicked
  • Node was restarted, and as part of "startup / local ledger replay", the node still computed the incorrect hash for slot 190061698
  • However, since this was during startup ledger replay before the validator is up, those computed hashes are NOT checked against the cluster
  • Thus, the node calculates the incorrect hash for slot 190061698 and all children
  • The node then hit the panic on the new slot of 190064501 b/c 190064501 was the first slot to be attempted by ReplayStage once the validator was up

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants