Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a mechanism to disable automatic retries for workflows and remote bazel #8172

Merged
merged 6 commits into from
Jan 31, 2025

Conversation

maggie-lou
Copy link
Contributor

@maggie-lou maggie-lou commented Jan 14, 2025

This PR adds a platform property to disable automatic retries by the scheduler. This can be used to ensure non-idempotent tasks are not retried.

For example, someone using remote bazel for a deploy may not want automatic retries

Pre-req to #8195

@maggie-lou maggie-lou force-pushed the rb_reenq branch 2 times, most recently from e32bdb4 to c7b921c Compare January 14, 2025 07:47
@maggie-lou maggie-lou changed the title Add a mechanism to disable automatic retries for remote runs Add a mechanism to disable automatic retries for workflows and remote bazel Jan 14, 2025
@maggie-lou
Copy link
Contributor Author

maggie-lou commented Jan 14, 2025

@bduffany I decided to maintain the default that everything is retried, even for the Run API and for bazel run. I think the more consistent we keep the Remote Bazel experience with workflows the better, because we kind of market them interchangeably and reducing subtle differences is probably for the best. At least for now, this feels like a better default anyway because at-most-once behavior hasn't come up as a priority and preventing failures on transient errors is more common

(Follow up from convo here #8098)

cli/remotebazel/remotebazel.go Outdated Show resolved Hide resolved
// If the remote bazel process is canceled or killed, cancel the remote run
isInvocationRunning := true
go func() {
<-ctx.Done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this ctx be cancelled after the current attempt, or after all attempts?

(another benefit of moving the loop body into a function is that you could probably defer this logic instead of putting it into a goroutine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is in a goroutine because it's listening in the background if the parent context is canceled (like if you CTRL-C the remote bazel command)

I added a child context that is canceled after the current retry attempt, so the goroutine is cleaned up after each attempt

cli/remotebazel/remotebazel.go Outdated Show resolved Hide resolved
cli/remotebazel/remotebazel.go Outdated Show resolved Hide resolved
Comment on lines 167 to 156
if !retriesEnabled(task) {
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the scheduler server also has some retry logic, e.g. if the executor crashes before we hit this codepath, the scheduler still probably needs some way for the task to be attempted again. Have you looked at that / is there anything we need to change there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - added some logic to address that

@maggie-lou maggie-lou marked this pull request as draft January 16, 2025 16:26
@maggie-lou maggie-lou force-pushed the rb_reenq branch 5 times, most recently from 15fdd71 to 0366b60 Compare January 16, 2025 17:45
@maggie-lou
Copy link
Contributor Author

I also moved the CLI code into a separate PR, because it has to go out after the server and executors have rolled out: #8195

@maggie-lou maggie-lou marked this pull request as ready for review January 16, 2025 17:59
@maggie-lou maggie-lou marked this pull request as draft January 16, 2025 18:40
@maggie-lou maggie-lou force-pushed the rb_reenq branch 2 times, most recently from 8db56b9 to 1688dae Compare January 16, 2025 18:53
@maggie-lou maggie-lou marked this pull request as ready for review January 16, 2025 20:29
enterprise/server/remote_execution/platform/platform.go Outdated Show resolved Hide resolved
enterprise/server/remote_execution/platform/platform.go Outdated Show resolved Hide resolved
server/util/status/status.go Outdated Show resolved Hide resolved
Copy link
Member

@bduffany bduffany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just 1 nit. Would be good to get another review for the scheduler changes, since that code is a bit tricky

server/util/status/status.go Outdated Show resolved Hide resolved
@maggie-lou
Copy link
Contributor Author

lgtm, just 1 nit. Would be good to get another review for the scheduler changes, since that code is a bit tricky

👍 Added Iain because he's worked on the scheduler before. I'll also wait to merge until next week in case anything goes wrong during the release

@maggie-lou maggie-lou merged commit badb015 into master Jan 31, 2025
14 of 15 checks passed
@maggie-lou maggie-lou deleted the rb_reenq branch January 31, 2025 17:45
maggie-lou added a commit that referenced this pull request Feb 6, 2025
Extracted CLI code from
#8172

This PR disables automatic retries when using the remote bazel CLI. The
CLI then implements retry behavior itself so it can detect when a retry
is happening

This fixes a bug where if the remote run fails due to a transient error,
the CLI will report a failure. However the scheduler may retry the run
in the background and it may eventually succeed. When users go to the
invocation link, they'll see that it succeeded despite expecting a
failure

Slack thread (Issue 2):
https://buildbuddy-corp.slack.com/archives/C07GMM2VBLY/p1734595804049229?thread_ts=1734595606.837269&cid=C07GMM2VBLY

Should be merged after
#8172 is rolled out
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants