-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a mechanism to disable automatic retries for workflows and remote bazel #8172
Conversation
e32bdb4
to
c7b921c
Compare
@bduffany I decided to maintain the default that everything is retried, even for the Run API and for bazel run. I think the more consistent we keep the Remote Bazel experience with workflows the better, because we kind of market them interchangeably and reducing subtle differences is probably for the best. At least for now, this feels like a better default anyway because at-most-once behavior hasn't come up as a priority and preventing failures on transient errors is more common (Follow up from convo here #8098) |
cli/remotebazel/remotebazel.go
Outdated
// If the remote bazel process is canceled or killed, cancel the remote run | ||
isInvocationRunning := true | ||
go func() { | ||
<-ctx.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this ctx be cancelled after the current attempt, or after all attempts?
(another benefit of moving the loop body into a function is that you could probably defer
this logic instead of putting it into a goroutine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is in a goroutine because it's listening in the background if the parent context is canceled (like if you CTRL-C the remote bazel command)
I added a child context that is canceled after the current retry attempt, so the goroutine is cleaned up after each attempt
if !retriesEnabled(task) { | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the scheduler server also has some retry logic, e.g. if the executor crashes before we hit this codepath, the scheduler still probably needs some way for the task to be attempted again. Have you looked at that / is there anything we need to change there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call - added some logic to address that
15fdd71
to
0366b60
Compare
I also moved the CLI code into a separate PR, because it has to go out after the server and executors have rolled out: #8195 |
8db56b9
to
1688dae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just 1 nit. Would be good to get another review for the scheduler changes, since that code is a bit tricky
👍 Added Iain because he's worked on the scheduler before. I'll also wait to merge until next week in case anything goes wrong during the release |
Extracted CLI code from #8172 This PR disables automatic retries when using the remote bazel CLI. The CLI then implements retry behavior itself so it can detect when a retry is happening This fixes a bug where if the remote run fails due to a transient error, the CLI will report a failure. However the scheduler may retry the run in the background and it may eventually succeed. When users go to the invocation link, they'll see that it succeeded despite expecting a failure Slack thread (Issue 2): https://buildbuddy-corp.slack.com/archives/C07GMM2VBLY/p1734595804049229?thread_ts=1734595606.837269&cid=C07GMM2VBLY Should be merged after #8172 is rolled out
This PR adds a platform property to disable automatic retries by the scheduler. This can be used to ensure non-idempotent tasks are not retried.
For example, someone using remote bazel for a deploy may not want automatic retries
Pre-req to #8195