Consensus (bank hash) mismatch between master and v1.13/v1.14 (mnb) #31365

steviez · 2023-04-27T04:57:15Z

Problem

On April 23, nodes running against master were observed to have diverged from mainnet. Nodes with various commits off of master forked off at slot 190061698, and this hit this panic.

thread 'solReplayStage' panicked at 'We have tried to repair duplicate slot: 190061698 more than 10 times
and are unable to freeze a block with bankhash J5GobCrZZLvvj8EW72BV1Q7973TefL7f3qtqGinebCWF,
instead we have a block with bankhash Some(FqJHBuSpEXHGM5Rt5pk1hWuKQh2Wg3iyQnyiQ4kwT8NW).
This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting',

The incorrect bank hash and associated fields that make up the bank hash look like:

bank frozen: 190061698
hash: FqJHBuSpEXHGM5Rt5pk1hWuKQh2Wg3iyQnyiQ4kwT8NW
accounts_delta: nSAB5dSBaN94HF7UNT4ZwKna4rYfZKMUPLoA9ZRAHTE
signature_count: 1640 
last_blockhash: 4Nez6awxz3mqyEPPtvmBErhaUVFgCd1Us1GVsr9kFEU4
capitalization: 545940604106710686,
stats: BankHashStats {
	num_updated_accounts: 4976,
	num_removed_accounts: 21,
	num_lamports_stored: 2524779513915279,
	total_data_len: 20916547,
	num_executable_accounts: 0
}

whereas the correct hash produced by v1.13/v1.14 looks like:

bank frozen: 190061698
hash: J5GobCrZZLvvj8EW72BV1Q7973TefL7f3qtqGinebCWF
accounts_delta: qnhRzBRp4Uapif5QewsCkLaEVL2zgLAiBvMacbchn45
signature_count: 1640
last_blockhash: 4Nez6awxz3mqyEPPtvmBErhaUVFgCd1Us1GVsr9kFEU4
capitalization: 545940604106710686
accounts hash slot: 190061698
stats: BankHashStats {
	num_updated_accounts: 4984,
	num_removed_accounts: 22,
	num_lamports_stored: 2524854860183319,
	total_data_len: 23534954,
	num_executable_accounts: 1
}

At first glance, there is an obvious difference in some of these numbers which would point to a transaction had a different execution result between versions.

Adding some extra logging, I determined that the transaction (explorer link here) that executed differently between versions is:

6ujGsjkwWJtE3VsraD97PBrhWpJaKGVfbxQAyV5thV32NTR9HWa6PjJgqGComn8y9oXBJMc1iT1NwZRLWLm4aM8

Looking at the basic logs in explorer, the above transaction fully succeeded in v1.13 / v1.14. However, looking at the debug logs from executing with master via solana-ledger-tool, I see the following for this tx:

[...] Program ComputeBudget111111111111111111111111111111 invoke [1]
[...] Program ComputeBudget111111111111111111111111111111 success
[...] Program SMPLecH534NA9acpos4G6x7uf3LWbCAwZQE9e8ZekMu invoke [1]
[...] Program log: Instruction: ExecuteTransaction
[...] Program hvsrNC3NKbcryqDs2DocYHZ9yPKEVzdSjQG6RVtK1s8 invoke [2]
[...] Program has been closed
[...] Program hvsrNC3NKbcryqDs2DocYHZ9yPKEVzdSjQG6RVtK1s8 failed: invalid account data for instruction
[...] Program SMPLecH534NA9acpos4G6x7uf3LWbCAwZQE9e8ZekMu consumed 17573 of 1400000 compute units
[...] Program SMPLecH534NA9acpos4G6x7uf3LWbCAwZQE9e8ZekMu failed: invalid account data for instruction
...
[...] tx error: InstructionError(1, InvalidAccountData) SanitizedTransaction { ... }

Proposed Solution

The text was updated successfully, but these errors were encountered:

steviez · 2023-04-27T05:23:16Z

I have determined that aebc191 (#30703) introduced the breaking change:

Running solana-ledger-tool with aebc191 produces the incorrect bank hash FqJHBuSpEXHGM5Rt5pk1hWuKQh2Wg3iyQnyiQ4kwT8NW for slot 190061698
Running solana-ledger-tool with the parent commit e7887cf produces the correct bank hash J5GobCrZZLvvj8EW72BV1Q7973TefL7f3qtqGinebCWF for slot 190061698

@pgarg66 @Lichtso - Can you two please look into this further given that you were the author + reviewer on this PR.

Lichtso · 2023-04-27T10:43:26Z

I think I got an idea what is going on:

First of all, in the TX in question, the program hvsrNC3NKbcryqDs2DocYHZ9yPKEVzdSjQG6RVtK1s8 is first executed via CPI (meaning it is passed as an instruction account in a top level instruction) and then upgraded via CPI (meaning it is set to writable in the TX). Either of these conditions triggers the special case we added in the PR in question here: aebc191#diff-a6df749724ef3698bf0b90625702f100670b58c585f91da8725f78a94b04196eR388

However, if both occur simultaneously on the same program in the same TX as it is the case here, the function filter_executable_program_accounts() incorrectly filters the program out here:

solana/runtime/src/accounts.rs

Line 639 in aebc191

if !tx.message().is_writable(i) && !result.contains_key(key) {

And it is thus the matching programdata account is not pre-loaded into the executor cache anymore here:

solana/runtime/src/bank.rs

Line 4494 in aebc191

loaded_programs_for_txs.insert(**pubkey, program.clone());

Later, in the program runtime when the program is attempting execution it is not found in the executor cache because it was never pre-loaded:

solana/programs/bpf_loader/src/lib.rs

Line 184 in aebc191

if let Some(ref tx_executor_cache) = tx_executor_cache {

Which leads to an incorrect fall through into get_programdata_offset_and_deployment_offset(), where it is finally reported as being closed:

solana/programs/bpf_loader/src/lib.rs

Line 155 in aebc191

ic_logger_msg!(log_collector, "Program has been closed");

In other words, while #30703 made the problem visible, it was actually introduced earlier in #30371.

Patch

Remove the !tx.message().is_writable(i) && in the condition to become if !result.contains_key(key) only in filter_executable_program_accounts().

jeffwashington · 2023-04-27T15:16:44Z

I also had a machine hit this at what I think is a different slot:
We have tried to repair duplicate slot: 190064501 more than 10 times

actually - 2 machines so far now that I'm digging stopped at this same slot.

steviez · 2023-04-27T15:47:10Z

I also had a machine hit this at what I think is a different slot: We have tried to repair duplicate slot: 190064501 more than 10 times

actually - 2 machines so far now that I'm digging stopped at this same slot.

Can you look a little further back in your logs and see if these machines initially died on slot 190061698? I already looked at another node that also failed on 190064501 and determined that it was a subsequent failure after failing on 190061698. Basically, what happened was:

Node replayed 190061698 and computed wrong hash
Node retried 10 times up to the limit, kept getting wrong hash and panicked
Node was restarted, and as part of "startup / local ledger replay", the node still computed the incorrect hash for slot 190061698
However, since this was during startup ledger replay before the validator is up, those computed hashes are NOT checked against the cluster
Thus, the node calculates the incorrect hash for slot 190061698 and all children
The node then hit the panic on the new slot of 190064501 b/c 190064501 was the first slot to be attempted by ReplayStage once the validator was up

Lichtso pinned this issue Apr 27, 2023

Lichtso unpinned this issue Apr 27, 2023

Lichtso mentioned this issue Apr 27, 2023

Fix - Filter condition in filter_executable_program_accounts() #31371

Merged

Lichtso closed this as completed in #31371 Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consensus (bank hash) mismatch between master and v1.13/v1.14 (mnb) #31365

Consensus (bank hash) mismatch between master and v1.13/v1.14 (mnb) #31365

steviez commented Apr 27, 2023

steviez commented Apr 27, 2023 •

edited

Loading

Lichtso commented Apr 27, 2023 •

edited

Loading

jeffwashington commented Apr 27, 2023 •

edited

Loading

steviez commented Apr 27, 2023

Consensus (bank hash) mismatch between master and v1.13/v1.14 (mnb) #31365

Consensus (bank hash) mismatch between master and v1.13/v1.14 (mnb) #31365

Comments

steviez commented Apr 27, 2023

Problem

Proposed Solution

steviez commented Apr 27, 2023 • edited Loading

Lichtso commented Apr 27, 2023 • edited Loading

Patch

jeffwashington commented Apr 27, 2023 • edited Loading

steviez commented Apr 27, 2023

steviez commented Apr 27, 2023 •

edited

Loading

Lichtso commented Apr 27, 2023 •

edited

Loading

jeffwashington commented Apr 27, 2023 •

edited

Loading