TransactionClient `gc` function often enters error loop during CleanupLocks #471

jmhrpr · 2024-12-05T14:38:06Z

I have a program that runs periodically which makes a single call to the TransactionClient gc function. Around half of the time it enters some loop which prints thousands of these logs and eventually OOMs. It is checking many ranges/regions for the the same key. It is a different key each time it enters the loop. They are INFO logs originating from client-rust-da362376b56921db/1fa846b/src/request/plan.rs:686.

CleanupLocks::execute, inner region error:key <REDACTED KEY> is not in region key range [<REDACTED RANGE A START>, <REDACTED RANGE A END>) for region 116477
CleanupLocks::execute, inner region error:key <REDACTED KEY> is not in region key range [<REDACTED RANGE B START>, <REDACTED RANGE B END>) for region 116288
CleanupLocks::execute, inner region error:key <REDACTED KEY> is not in region key range [<REDACTED RANGE C START>, <REDACTED RANGE C END>) for region 970120
CleanupLocks::execute, inner region error:key <REDACTED KEY> is not in region key range [<REDACTED RANGE D START>, <REDACTED RANGE D END>) for region 969086
...

The text was updated successfully, but these errors were encountered:

pingyu · 2024-12-06T00:05:06Z

Could you show your program or some snippet can reproduce this issue ?

As well as the setup of TiKV cluster.

ekexium · 2024-12-09T03:36:56Z

I'm afraid GC is not stable yet(#180).
After a quick look at the code (haven't reviewed this part for years), one possible problem to investigate: the scan_lock request might be directly passed to cleanup_locks without proper region setup (e.g. via something like retry_multi_region)

client-rust/src/transaction/client.rs

Lines 264 to 271 in 59f13b5

    
           let req = new_scan_lock_request(range, safepoint, options.batch_size); 
        
           let plan = crate::request::PlanBuilder::new(self.pd.clone(), self.keyspace, req) 
        
               .cleanup_locks(ctx.clone(), options, backoff, self.keyspace) 
        
               .retry_multi_region(DEFAULT_REGION_BACKOFF) 
        
               .extract_error() 
        
               .merge(crate::request::Collect) 
        
               .plan(); 
        
           plan.execute().await

This change was introduced in PR #378, which added async commit lock resolution support. As a workaround, you might want to temporarily disable async commit and use the pre-PR GC implementation until this issue is resolved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TransactionClient `gc` function often enters error loop during CleanupLocks #471

TransactionClient `gc` function often enters error loop during CleanupLocks #471

jmhrpr commented Dec 5, 2024 •

edited

Loading

pingyu commented Dec 6, 2024 •

edited

Loading

ekexium commented Dec 9, 2024 •

edited

Loading

TransactionClient gc function often enters error loop during CleanupLocks #471

TransactionClient gc function often enters error loop during CleanupLocks #471

Comments

jmhrpr commented Dec 5, 2024 • edited Loading

pingyu commented Dec 6, 2024 • edited Loading

ekexium commented Dec 9, 2024 • edited Loading

TransactionClient `gc` function often enters error loop during CleanupLocks #471

TransactionClient `gc` function often enters error loop during CleanupLocks #471

jmhrpr commented Dec 5, 2024 •

edited

Loading

pingyu commented Dec 6, 2024 •

edited

Loading

ekexium commented Dec 9, 2024 •

edited

Loading