-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RB] Always cancel remote bazel run at end of CLI to prevent missing retries #8098
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the remote bazel invocation was canceled due to a "stream terminated by RST_STREAM" error (which often occurs when an executor loses connection to an app server due to a rollout)
We should just cancel the invocation in this case
I think we should try to find a way to avoid failures during a rollout? Workflows don't result in "failed" GitHub statuses when the executors restart, so it seems like a worse experience if remote bazel runs do fail in this case.
Workflows are retried on the server currently, because there is no "client" that would be capable of doing retries. For remote bazel at least, maybe we could transparently retry the invocation in the CLI (instead of retrying on the server) if WaitExecution reports that the execution fails? I'm not sure how to handle this for the Run
API though.
Hmm that's a good point. With remote bazel, I'm concerned that people will want to run things like deploy commands and there's not a good way for us to detect if a workflow is safe to rerun from the top (whereas with CI it's probably safe to rerun) What do you think of
|
@maggie-lou sgtm overall - re the retry behavior, wdyt about using a heuristic like this:
Maybe @siggisim has thoughts here as well |
another thought
how do people get the results from this API currently? do they poll GetInvocation? if so, maybe it's fine to keep the server-side retries in this case |
Yes they poll GetInvocation. I guess there's a brief window where GetInvocation will return an error before it's retried though, so their scripts would have to expect / account for that. I feel like it would be simpler to just not retry and we can add documentation to the API that they may wish to automatically retry on certain errors |
Continuing here #8172 |
A customer reported a bug where the remote bazel invocation was canceled due to a "stream terminated by RST_STREAM" error (which often occurs when an executor loses connection to an app server due to a rollout). That error was immediately returned to the CLI, which reported that the remote run failed. However that execution was retried, and eventually succeeded.
We should just cancel the invocation in this case, so the CLI reports the correct status of the final results of the remote run. We don't want a case where the CLI reports that the run failed, which may lead the client to rerun the remote bazel command, which in this case would've caused duplicate runs
If the invocation did already complete running, the CancelExecutions will be a no-op
Fixes #2 here: https://buildbuddy-corp.slack.com/archives/C07GMM2VBLY/p1734595804049229?thread_ts=1734595606.837269&cid=C07GMM2VBLY