Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust shard acquire expiration #3032

Merged
merged 1 commit into from
Jun 28, 2022
Merged

Adjust shard acquire expiration #3032

merged 1 commit into from
Jun 28, 2022

Conversation

wxing1292
Copy link
Contributor

What changed?

  • Adjust shard acquire expiration from 5m to 8s

Why?
Setting is not right

How did you test it?
N/A

Potential risks
N/A

Is hotfix candidate?
N/A

@wxing1292 wxing1292 requested a review from a team as a code owner June 28, 2022 19:44
@wxing1292 wxing1292 enabled auto-merge (squash) June 28, 2022 20:02
@wxing1292 wxing1292 merged commit e52d10e into temporalio:master Jun 28, 2022
@wxing1292 wxing1292 deleted the shard-acquire-expiration branch June 28, 2022 20:14
wxing1292 added a commit that referenced this pull request Jun 28, 2022
@dnr
Copy link
Member

dnr commented Jun 29, 2022

Can you explain what behavior you want to see here, and how this is supposed to achieve it?

Note that this value isn't the timeout for the persistence operation, that's 10s and has been for a long time. This is how long to keep retrying before unloading the shard when persistence is unavailable. If you change this to 8s, that effectively unloads the shard as soon as persistence is unavailable, which undoes the whole optimization we did last year to be able to handle a few minutes of unavailability without unloading the shard (and hence dropping caches, which will take time to fill later and cause high load).

@wxing1292
Copy link
Contributor Author

Can you explain what behavior you want to see here, and how this is supposed to achieve it?

Note that this value isn't the timeout for the persistence operation, that's 10s and has been for a long time. This is how long to keep retrying before unloading the shard when persistence is unavailable. If you change this to 8s, that effectively unloads the shard as soon as persistence is unavailable, which undoes the whole optimization we did last year to be able to handle a few minutes of unavailability without unloading the shard (and hence dropping caches, which will take time to fill later and cause high load).

only place where acquireShard is invoked is here:

	setStateAcquiring := func() {
		s.state = contextStateAcquiring
		go s.acquireShard()
	}

this setStateAcquiring can be invokes in 2 situations:

  • contextStateInitialized
  • contextStateAcquired

if a shard is in initialization phase, then keep on retrying does not make situation better, e.g. some other k8s pod can be instructed by ringpop to take ownership & cause shard ownership instability

if a shard is already initialized and sees e.g. timeout error, this shard is in unusable state until shard ownership is asserted: acquireShard is invoked, but why keep on retry for 5m?

@nagl-temporal
Copy link
Contributor

nagl-temporal commented Jun 29, 2022

Can we just add a way for persistence to signal 400-style (unrecoverable) failure? Today every persistence failure is assumed to be retryable. With ownership enforced in our persistence layer, when ringpop instructs many pods (over time) to take ownership of coincident shards, it creates a long-lived shard stealing party.

@wxing1292
Copy link
Contributor Author

Can we just add a way for persistence to signal 400-style (unrecoverable) failure?

the logic will give up shard ownership immediately when seeing shard ownership lost error
this "retry for 5m" is triggered only when seeing timeout or unknown error

wxing1292 added a commit that referenced this pull request Jun 30, 2022
wxing1292 added a commit that referenced this pull request Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants