[RB] Always cancel remote bazel run at end of CLI to prevent missing retries #8098

maggie-lou · 2024-12-19T22:30:41Z

A customer reported a bug where the remote bazel invocation was canceled due to a "stream terminated by RST_STREAM" error (which often occurs when an executor loses connection to an app server due to a rollout). That error was immediately returned to the CLI, which reported that the remote run failed. However that execution was retried, and eventually succeeded.

We should just cancel the invocation in this case, so the CLI reports the correct status of the final results of the remote run. We don't want a case where the CLI reports that the run failed, which may lead the client to rerun the remote bazel command, which in this case would've caused duplicate runs

If the invocation did already complete running, the CancelExecutions will be a no-op

Fixes #2 here: https://buildbuddy-corp.slack.com/archives/C07GMM2VBLY/p1734595804049229?thread_ts=1734595606.837269&cid=C07GMM2VBLY

…retries

bduffany

the remote bazel invocation was canceled due to a "stream terminated by RST_STREAM" error (which often occurs when an executor loses connection to an app server due to a rollout)

We should just cancel the invocation in this case

I think we should try to find a way to avoid failures during a rollout? Workflows don't result in "failed" GitHub statuses when the executors restart, so it seems like a worse experience if remote bazel runs do fail in this case.

Workflows are retried on the server currently, because there is no "client" that would be capable of doing retries. For remote bazel at least, maybe we could transparently retry the invocation in the CLI (instead of retrying on the server) if WaitExecution reports that the execution fails? I'm not sure how to handle this for the Run API though.

maggie-lou · 2025-01-07T03:32:49Z

I think we should try to find a way to avoid failures during a rollout?

Hmm that's a good point. With remote bazel, I'm concerned that people will want to run things like deploy commands and there's not a good way for us to detect if a workflow is safe to rerun from the top (whereas with CI it's probably safe to rerun)

What do you think of

The server only automatically retrying workflows, and not remote bazel
Automatically retry from the remote bazel CLI. Add a flag to disable this behavior
If people want to use the Run API, they will have to implement the retry behavior themselves

bduffany · 2025-01-07T14:36:59Z

@maggie-lou sgtm overall - re the retry behavior, wdyt about using a heuristic like this:

build / test etc. are retried by default - this should generally be safe, since build/test actions are cacheable anyway.
- One possible downside is if the user has some pre-build or post-build hooks in their tools/bazel script that shouldn't be retried, then we'll retry them - but for those cases, the flag can be used to opt-out.
run commands aren't retried by default since we don't know if the user's command is idempotent or not

Maybe @siggisim has thoughts here as well

bduffany · 2025-01-07T14:39:13Z

another thought

If people want to use the Run API, they will have to implement the retry behavior themselves

how do people get the results from this API currently? do they poll GetInvocation? if so, maybe it's fine to keep the server-side retries in this case

maggie-lou · 2025-01-07T22:58:41Z

how do people get the results from this API currently? do they poll GetInvocation? if so, maybe it's fine to keep the server-side retries in this case

Yes they poll GetInvocation. I guess there's a brief window where GetInvocation will return an error before it's retried though, so their scripts would have to expect / account for that. I feel like it would be simpler to just not retry and we can add documentation to the API that they may wish to automatically retry on certain errors

maggie-lou · 2025-01-14T08:00:21Z

Continuing here #8172

maggie-lou added 2 commits December 19, 2024 16:04

[RB] Fix error handling

670db21

[RB] Always cancel remote bazel run at end of CLI to prevent missing …

5ac2881

…retries

maggie-lou requested review from bduffany and sluongng December 19, 2024 22:30

bduffany reviewed Dec 20, 2024

View reviewed changes

Base automatically changed from rb_fix to master January 7, 2025 03:25

maggie-lou mentioned this pull request Jan 14, 2025

Add a mechanism to disable automatic retries for workflows and remote bazel #8172

Merged

maggie-lou closed this Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RB] Always cancel remote bazel run at end of CLI to prevent missing retries #8098

[RB] Always cancel remote bazel run at end of CLI to prevent missing retries #8098

maggie-lou commented Dec 19, 2024

bduffany left a comment •

edited

Loading

maggie-lou commented Jan 7, 2025

bduffany commented Jan 7, 2025

bduffany commented Jan 7, 2025

maggie-lou commented Jan 7, 2025

maggie-lou commented Jan 14, 2025 •

edited

Loading

[RB] Always cancel remote bazel run at end of CLI to prevent missing retries #8098

[RB] Always cancel remote bazel run at end of CLI to prevent missing retries #8098

Conversation

maggie-lou commented Dec 19, 2024

bduffany left a comment • edited Loading

Choose a reason for hiding this comment

maggie-lou commented Jan 7, 2025

bduffany commented Jan 7, 2025

bduffany commented Jan 7, 2025

maggie-lou commented Jan 7, 2025

maggie-lou commented Jan 14, 2025 • edited Loading

bduffany left a comment •

edited

Loading

maggie-lou commented Jan 14, 2025 •

edited

Loading