Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

silkworm: inconsistent database content after unwind #2359

Closed
canepat opened this issue Sep 17, 2024 · 0 comments · Fixed by #2360
Closed

silkworm: inconsistent database content after unwind #2359

canepat opened this issue Sep 17, 2024 · 0 comments · Fixed by #2360
Assignees
Labels
bug Something isn't working

Comments

@canepat
Copy link
Member

canepat commented Sep 17, 2024

After encountering the invalid block issue described in #2358, Silkworm performs an unwind cycle and subsequently a new forward cycle triggered by the Consensus Layer:

...
  INFO [09-12|07:43:57.151 UTC] ExecutionEngine                          verifying chain 0c2f4eeb92fe45bff59e422fef6beddc062758e0e4b098559c523c0591c07bf9
  INFO [09-12|07:43:57.295 UTC] sync::SentryClient                       Peer de753506a7855f1f5be13d9e75a5616a42a00a12 disconnected, active 44, info: -info-not-found-
  INFO [09-12|07:43:57.472 UTC] ExecutionPipeline                        Forward start
  INFO [09-12|07:43:57.472 UTC] [1/12 Headers]                           op=Forward from=20581422 to=20727515 span=146093 
  INFO [09-12|07:43:57.952 UTC] [1/12 Headers]                           op=Forward done=480.476ms 
  INFO [09-12|07:43:57.952 UTC] [2/12 BlockHashes]                       op=Forward from=20581422 to=20727515 span=146093 
  INFO [09-12|07:43:58.726 UTC] [2/12 BlockHashes]                       op=Forward done=773.195ms 
  INFO [09-12|07:43:58.726 UTC] [3/12 Bodies]                            op=Forward from=20581422 to=20727515 span=146093 
...
  INFO [09-12|07:44:12.899 UTC] [3/12 Bodies]                            op=Forward done=14.173s 
  INFO [09-12|07:44:12.899 UTC] [4/12 Senders]                           op=Forward from=20581422 to=20727515 span=146093 max_batch_size=139810 
...
  INFO [09-12|07:45:37.219 UTC] [4/12 Senders]                           op=Forward done=1m 24s 
  INFO [09-12|07:45:37.219 UTC] [5/12 Execution]                         op=Forward from=20580035 to=20727515 span=147480 
 ERROR [09-12|07:45:37.634 UTC] [5/12 Execution]                         function=execute_batch exception=MDBX_EKEYMISMATCH: The given key value is mismatched to the current cursor position 
 ERROR [09-12|07:45:37.634 UTC] [5/12 Execution]                         function=forward exception=kDbError 
 ERROR [09-12|07:45:37.634 UTC] [5/12 Execution]                         op=Forward returned=kDbError 
 ERROR [09-12|07:45:37.634 UTC] ExecPipeline                             Forward interrupted due to stage Execution failure
  INFO [09-12|07:45:37.634 UTC] MainChain::verify_chain forward_result=kDbError
...

The forward cycle fails with internal error code MDBX_EKEYMISMATCH, which means the database content has become inconsistent. Restarting Silkworm again produces the same error during the initial cleanup forward cycle.

This is the status of the pipeline stages in the Silkworm database:

db_toolbox --datadir /Volumes/Extreme\ SSD/Library/silkworm/silkworm stages                                        

 Stage Name                    Block 
 ------------------------ ---------- 
 AccountHistoryIndex        20460999         
 BlockHashes                20727515         
 Bodies                     20727515         
 CallTraces                 20460999         
 Execution                  20580035         
 Finish                     20460999         
 HashState                  20460999         
 Headers                    20727515         
 HistoryIndex               20460999         
 IntermediateHashes         20460999         
 LogIndex                   20460999         
 Senders                    20727515         
 StorageHistoryIndex        20460999         
 TxLookup                   20460999   

Forcing the Execution state progress to 20581423 (the failed block) and performing a stage unwind cycle from 20581423 to 20581422 (the last valid block) reveals something interesting:

db_toolbox --exclusive --datadir /Volumes/Extreme\ SSD/Library/silkworm/silkworm unwind --height 20581422
  INFO [09-16|14:52:31.817 UTC] ExecutionPipeline                        Unwind start
  INFO [09-16|14:52:31.817 UTC] [1/12 Finish]                            op=Unwind done=343us 
  INFO [09-16|14:52:31.817 UTC] [2/12 TxLookup]                          op=Unwind done=31us 
  INFO [09-16|14:52:31.817 UTC] [3/12 CallTraces]                        op=Unwind done=15us 
  INFO [09-16|14:52:31.817 UTC] [4/12 LogIndex]                          op=Unwind done=25us 
  INFO [09-16|14:52:31.817 UTC] [5/12 HistoryIndex]                      op=Unwind done=27us 
  INFO [09-16|14:52:31.817 UTC] [6/12 HashState]                         op=Unwind done=23us 
  INFO [09-16|14:52:31.817 UTC] [7/12 IntermediateHashes]                op=Unwind done=33us 
  INFO [09-16|14:52:56.245 UTC] [8/12 Execution]                         op=Unwind from=20581423 to=20581422 span=1 
  INFO [09-16|14:52:59.462 UTC] Erased 0 records from AccountChangeSet starting key 00000000013a0c2f
  INFO [09-16|14:53:18.273 UTC] Erased 0 records from StorageChangeSet starting key 00000000013a0c2f
  INFO [09-16|14:53:21.348 UTC] Erased 1 records from Receipt starting key 00000000013a0c2f
  INFO [09-16|14:53:24.431 UTC] Erased 97 records from TransactionLog starting key 00000000013a0c2f
  INFO [09-16|14:53:32.280 UTC] Erased 0 records from CallTraceSet starting key 00000000013a0c2f

There are 1 block receipt and 97 transaction logs in the database belonging to block 20581423, which the builtin unwind cycle executed automatically by Silkworm after forward failure didn't remove. In fact, such automatic unwind cycle did the opposite: it committed such changes applied on the MDBX RW transaction after detecting the failure.

This is a regression introduced by #1511

sw_invalid_block.log
sw.log

@canepat canepat added the bug Something isn't working label Sep 17, 2024
@canepat canepat self-assigned this Sep 17, 2024
battlmonstr pushed a commit that referenced this issue Sep 18, 2024
Fixes #2359

This PR restores in stage Execution the invalid block handling logic removed in #1511

Also fixes the invalid block reference lifetime issue causing an incorrect block hash to be stored as bad block hash in SyncContext and displayed into the warning log, as described in #2358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant