-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blockchain: fix temp fails causing alt blocks to be permanently invalid #9395
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the premise that an invalid block cache doesn't make much sense: If a non-malicious peer sends a block the node deems invalid, validation should fail fast in common scenarios (unsupported transaction versions, missing parent blocks, etc), rather than relying on caching. Caching just tends to obscure flaws in the efficiency of the validation logic (e.g. one might not notice fail-fast conditions being spotted late in the validation because of the caching).
Also looking at the other spot monero/src/cryptonote_core/blockchain.cpp Lines 4288 to 4293 in caa62bc
since Note: the fourth and final spot monero/src/cryptonote_core/blockchain.cpp Lines 4307 to 4311 in caa62bc
is slated for removal in #9135 @jeffro256 thoughts on just removing the cache in this PR? |
Hmm, actually it would seem the block's Approving this PR as is, though agree the invalid cache generally seems to introduce a wider vector for abuse than benefit |
I ran this patch on a few stressnet nodes (i.e. nodes running https://github.com/spackle-xmr/monero ) and left a few nodes unpatched. My patched nodes never got stuck because of this invalid blocks problem, but my unpatched nodes got stuck four times during this test period. The patch seems to do what it's supposed to do. AFAIK, the problem is likely to occur when transaction propagation is imperfect. An unpatched node may get a fluffy block, not have all the fluffy block's transactions in its transaction pool, then mark the block as permanently invalid because it can't find all of the block's transactions. On stressnet, this has occurred when (1) there are many large consolidation transactions (e.g. many 144in/2out) or (2) when one or more nodes have their maximum tx pool size set above the default and the aggregate size of unconfirmed txs on the network exceeds the default maximum tx pool size. |
Issue appeared in stressnet where nodes would never switch to the valid chain with longest proof of work, but instead log a message like so:
What I believe was happening is this chain of events:
1. Transaction T is added to mempool
2. Block B1, which contains T, is added as an alt block
3. Mempool conditions cause T to be pushed out
4. Block B2, with previous block B1, is added as an alt block
5. Cumulative difficulty of B2 relative to the main chain triggers a reorg attempt
6. Tx validation of B1 begins, but T is not present in the mempool
7. Method
handle_block_to_main_chain
returns false8. As a result, method
switch_to_alternative_blockchain
adds B1 and B2 tom_invalid_blocks
list permanently9. Peer gets blocked, and node never reaches consensus with other nodes because B1 and B2 stay in the invalid blocks list
I could be wrong about the exact trigger, but regardless,
handle_block_to_main_chain
returnsfalse
for many reasons, some of which are non-deterministic (i.e. the node not having a transaction in the mempool). As such, we should not add blocks to the invalid blocks list every timehandle_block_to_main_chain
returns false. Additionally,handle_block_to_main_chain
already adds blocks to the invalid blocks list for transaction input consensus failures (which should be deterministic). The other big type of failure that could occur s a PoW check, but in that case, we shouldn't add those blocks to the invalid blocks list either, since it is 100% free for a peer to create a block that fails the PoW check, and doing so would cause the list to be filled up with garbage quickly.There is a discussion to be had whether we should have any invalid blocks cache in the first place...
Thanks to @Rucknium for finding this issue, and to everyone running a stressnet node!