Adjust shard acquire expiration #3032

wxing1292 · 2022-06-28T19:44:57Z

What changed?

Adjust shard acquire expiration from 5m to 8s

Why?
Setting is not right

How did you test it?
N/A

Potential risks
N/A

Is hotfix candidate?
N/A

dnr · 2022-06-29T22:26:07Z

Can you explain what behavior you want to see here, and how this is supposed to achieve it?

Note that this value isn't the timeout for the persistence operation, that's 10s and has been for a long time. This is how long to keep retrying before unloading the shard when persistence is unavailable. If you change this to 8s, that effectively unloads the shard as soon as persistence is unavailable, which undoes the whole optimization we did last year to be able to handle a few minutes of unavailability without unloading the shard (and hence dropping caches, which will take time to fill later and cause high load).

wxing1292 · 2022-06-29T22:50:23Z

Can you explain what behavior you want to see here, and how this is supposed to achieve it?

Note that this value isn't the timeout for the persistence operation, that's 10s and has been for a long time. This is how long to keep retrying before unloading the shard when persistence is unavailable. If you change this to 8s, that effectively unloads the shard as soon as persistence is unavailable, which undoes the whole optimization we did last year to be able to handle a few minutes of unavailability without unloading the shard (and hence dropping caches, which will take time to fill later and cause high load).

only place where acquireShard is invoked is here:

	setStateAcquiring := func() {
		s.state = contextStateAcquiring
		go s.acquireShard()
	}

this setStateAcquiring can be invokes in 2 situations:

contextStateInitialized
contextStateAcquired

if a shard is in initialization phase, then keep on retrying does not make situation better, e.g. some other k8s pod can be instructed by ringpop to take ownership & cause shard ownership instability

if a shard is already initialized and sees e.g. timeout error, this shard is in unusable state until shard ownership is asserted: acquireShard is invoked, but why keep on retry for 5m?

nagl-temporal · 2022-06-29T22:55:07Z

Can we just add a way for persistence to signal 400-style (unrecoverable) failure? Today every persistence failure is assumed to be retryable. With ownership enforced in our persistence layer, when ringpop instructs many pods (over time) to take ownership of coincident shards, it creates a long-lived shard stealing party.

wxing1292 · 2022-06-29T23:56:45Z

Can we just add a way for persistence to signal 400-style (unrecoverable) failure?

the logic will give up shard ownership immediately when seeing shard ownership lost error
this "retry for 5m" is triggered only when seeing timeout or unknown error

This reverts commit e52d10e.

Adjust shard acquire expiration

7af4cd1

wxing1292 requested a review from a team as a code owner June 28, 2022 19:44

nagl-temporal approved these changes Jun 28, 2022

View reviewed changes

jbreiding approved these changes Jun 28, 2022

View reviewed changes

wxing1292 enabled auto-merge (squash) June 28, 2022 20:02

wxing1292 merged commit e52d10e into temporalio:master Jun 28, 2022

wxing1292 deleted the shard-acquire-expiration branch June 28, 2022 20:14

wxing1292 added a commit that referenced this pull request Jun 28, 2022

Adjust shard acquire expiration (#3032)

53a9cce

wxing1292 added a commit that referenced this pull request Jun 30, 2022

Revert "Adjust shard acquire expiration (#3032)"

dde4277

This reverts commit e52d10e.

wxing1292 added a commit that referenced this pull request Jun 30, 2022

Revert "Adjust shard acquire expiration (#3032)" (#3041)

ce47ea9

This reverts commit e52d10e.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust shard acquire expiration #3032

Adjust shard acquire expiration #3032

wxing1292 commented Jun 28, 2022

dnr commented Jun 29, 2022

wxing1292 commented Jun 29, 2022

nagl-temporal commented Jun 29, 2022 •

edited

Loading

wxing1292 commented Jun 29, 2022

Adjust shard acquire expiration #3032

Adjust shard acquire expiration #3032

Conversation

wxing1292 commented Jun 28, 2022

dnr commented Jun 29, 2022

wxing1292 commented Jun 29, 2022

nagl-temporal commented Jun 29, 2022 • edited Loading

wxing1292 commented Jun 29, 2022

nagl-temporal commented Jun 29, 2022 •

edited

Loading