-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust shard acquire expiration #3032
Adjust shard acquire expiration #3032
Conversation
Can you explain what behavior you want to see here, and how this is supposed to achieve it? Note that this value isn't the timeout for the persistence operation, that's 10s and has been for a long time. This is how long to keep retrying before unloading the shard when persistence is unavailable. If you change this to 8s, that effectively unloads the shard as soon as persistence is unavailable, which undoes the whole optimization we did last year to be able to handle a few minutes of unavailability without unloading the shard (and hence dropping caches, which will take time to fill later and cause high load). |
only place where acquireShard is invoked is here: setStateAcquiring := func() {
s.state = contextStateAcquiring
go s.acquireShard()
} this setStateAcquiring can be invokes in 2 situations:
if a shard is in initialization phase, then keep on retrying does not make situation better, e.g. some other k8s pod can be instructed by ringpop to take ownership & cause shard ownership instability if a shard is already initialized and sees e.g. timeout error, this shard is in unusable state until shard ownership is asserted: acquireShard is invoked, but why keep on retry for 5m? |
Can we just add a way for persistence to signal 400-style (unrecoverable) failure? Today every persistence failure is assumed to be retryable. With ownership enforced in our persistence layer, when ringpop instructs many pods (over time) to take ownership of coincident shards, it creates a long-lived shard stealing party. |
the logic will give up shard ownership immediately when seeing shard ownership lost error |
This reverts commit e52d10e.
What changed?
Why?
Setting is not right
How did you test it?
N/A
Potential risks
N/A
Is hotfix candidate?
N/A