Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransactionClient gc function often enters error loop during CleanupLocks #471

Open
jmhrpr opened this issue Dec 5, 2024 · 2 comments
Open

Comments

@jmhrpr
Copy link

jmhrpr commented Dec 5, 2024

I have a program that runs periodically which makes a single call to the TransactionClient gc function. Around half of the time it enters some loop which prints thousands of these logs and eventually OOMs. It is checking many ranges/regions for the the same key. It is a different key each time it enters the loop. They are INFO logs originating from client-rust-da362376b56921db/1fa846b/src/request/plan.rs:686.

CleanupLocks::execute, inner region error:key <REDACTED KEY> is not in region key range [<REDACTED RANGE A START>, <REDACTED RANGE A END>) for region 116477
CleanupLocks::execute, inner region error:key <REDACTED KEY> is not in region key range [<REDACTED RANGE B START>, <REDACTED RANGE B END>) for region 116288
CleanupLocks::execute, inner region error:key <REDACTED KEY> is not in region key range [<REDACTED RANGE C START>, <REDACTED RANGE C END>) for region 970120
CleanupLocks::execute, inner region error:key <REDACTED KEY> is not in region key range [<REDACTED RANGE D START>, <REDACTED RANGE D END>) for region 969086
...
@pingyu
Copy link
Collaborator

pingyu commented Dec 6, 2024

Could you show your program or some snippet can reproduce this issue ?

As well as the setup of TiKV cluster.

@ekexium
Copy link
Collaborator

ekexium commented Dec 9, 2024

I'm afraid GC is not stable yet(#180).
After a quick look at the code (haven't reviewed this part for years), one possible problem to investigate: the scan_lock request might be directly passed to cleanup_locks without proper region setup (e.g. via something like retry_multi_region)

let req = new_scan_lock_request(range, safepoint, options.batch_size);
let plan = crate::request::PlanBuilder::new(self.pd.clone(), self.keyspace, req)
.cleanup_locks(ctx.clone(), options, backoff, self.keyspace)
.retry_multi_region(DEFAULT_REGION_BACKOFF)
.extract_error()
.merge(crate::request::Collect)
.plan();
plan.execute().await

This change was introduced in PR #378, which added async commit lock resolution support. As a workaround, you might want to temporarily disable async commit and use the pre-PR GC implementation until this issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants