Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RB] Always cancel remote bazel run at end of CLI to prevent missing retries #8098

Closed
wants to merge 2 commits into from

Conversation

maggie-lou
Copy link
Contributor

A customer reported a bug where the remote bazel invocation was canceled due to a "stream terminated by RST_STREAM" error (which often occurs when an executor loses connection to an app server due to a rollout). That error was immediately returned to the CLI, which reported that the remote run failed. However that execution was retried, and eventually succeeded.

We should just cancel the invocation in this case, so the CLI reports the correct status of the final results of the remote run. We don't want a case where the CLI reports that the run failed, which may lead the client to rerun the remote bazel command, which in this case would've caused duplicate runs

If the invocation did already complete running, the CancelExecutions will be a no-op

Fixes #2 here: https://buildbuddy-corp.slack.com/archives/C07GMM2VBLY/p1734595804049229?thread_ts=1734595606.837269&cid=C07GMM2VBLY

Copy link
Member

@bduffany bduffany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the remote bazel invocation was canceled due to a "stream terminated by RST_STREAM" error (which often occurs when an executor loses connection to an app server due to a rollout)

We should just cancel the invocation in this case

I think we should try to find a way to avoid failures during a rollout? Workflows don't result in "failed" GitHub statuses when the executors restart, so it seems like a worse experience if remote bazel runs do fail in this case.

Workflows are retried on the server currently, because there is no "client" that would be capable of doing retries. For remote bazel at least, maybe we could transparently retry the invocation in the CLI (instead of retrying on the server) if WaitExecution reports that the execution fails? I'm not sure how to handle this for the Run API though.

Base automatically changed from rb_fix to master January 7, 2025 03:25
@maggie-lou
Copy link
Contributor Author

I think we should try to find a way to avoid failures during a rollout?

Hmm that's a good point. With remote bazel, I'm concerned that people will want to run things like deploy commands and there's not a good way for us to detect if a workflow is safe to rerun from the top (whereas with CI it's probably safe to rerun)

What do you think of

  • The server only automatically retrying workflows, and not remote bazel
  • Automatically retry from the remote bazel CLI. Add a flag to disable this behavior
  • If people want to use the Run API, they will have to implement the retry behavior themselves

@bduffany
Copy link
Member

bduffany commented Jan 7, 2025

@maggie-lou sgtm overall - re the retry behavior, wdyt about using a heuristic like this:

  • build / test etc. are retried by default - this should generally be safe, since build/test actions are cacheable anyway.
    • One possible downside is if the user has some pre-build or post-build hooks in their tools/bazel script that shouldn't be retried, then we'll retry them - but for those cases, the flag can be used to opt-out.
  • run commands aren't retried by default since we don't know if the user's command is idempotent or not

Maybe @siggisim has thoughts here as well

@bduffany
Copy link
Member

bduffany commented Jan 7, 2025

another thought

If people want to use the Run API, they will have to implement the retry behavior themselves

how do people get the results from this API currently? do they poll GetInvocation? if so, maybe it's fine to keep the server-side retries in this case

@maggie-lou
Copy link
Contributor Author

how do people get the results from this API currently? do they poll GetInvocation? if so, maybe it's fine to keep the server-side retries in this case

Yes they poll GetInvocation. I guess there's a brief window where GetInvocation will return an error before it's retried though, so their scripts would have to expect / account for that. I feel like it would be simpler to just not retry and we can add documentation to the API that they may wish to automatically retry on certain errors

@maggie-lou
Copy link
Contributor Author

maggie-lou commented Jan 14, 2025

Continuing here #8172

@maggie-lou maggie-lou closed this Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants