Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically retry the build if encountered remote cache eviction error #17711

Closed
wants to merge 1 commit into from

Conversation

coeuvre
Copy link
Member

@coeuvre coeuvre commented Mar 9, 2023

With #17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With #17462 and #17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with #17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if --experimental_remote_cache_eviction_retries is set, or build rewinding.

$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading: 
Loading: 
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading: 
Loading: 
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$ 

Part of #16660.

@coeuvre coeuvre requested a review from a team as a code owner March 9, 2023 16:53
@brentleyjones
Copy link
Contributor

With #17462, Bazel is able to continue the build without bazel clean or bazel shutdown.

But only for the blobs that were detected as evicted, right? If other blobs were evicted (which is probably the case because of TTL being reached), then a re-run could fail on similar blobs. If my understanding is correct, I'm currently recommending people perform bazel clean --expunge if they get exit code 39.

@sgowroji sgowroji added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Mar 10, 2023
@coeuvre
Copy link
Member Author

coeuvre commented Mar 10, 2023

With #17462, Bazel is able to continue the build without bazel clean or bazel shutdown.

But only for the blobs that were detected as evicted, right? If other blobs were evicted (which is probably the case because of TTL being reached), then a re-run could fail on similar blobs. If my understanding is correct, I'm currently recommending people perform bazel clean --expunge if they get exit code 39.

True for 6.1. But with #17639 which I couldn't get into 6.1, Bazel knows when blobs will be expired and will discard them in advance.

I don't know whether it is good to always discard all remote metadata when remote cache eviction happens.

@brentleyjones
Copy link
Contributor

brentleyjones commented Mar 10, 2023

I've been running into the issue I laid out, now that I'm on 6.1. I once had to restart the build 6 times to get it to finally complete.

Until we have the ttl based discarding, maybe it's best to discard all of it? I'm not sure either. The above behavior though is influencing my current recommendations for others to do just that.

@ulrfa
Copy link
Contributor

ulrfa commented Mar 10, 2023

I don't know whether it is good to always discard all remote metadata when remote cache eviction happens.

Is this related to what was discussed Jan 27 in the review of design doc with the following comments?

Can we make bazel discard metadata about ALL remote blobs, in cases ANY remote blob is unexpectedly missing, before existing with a special exit code?

The purpose is to prevent multiple repeated restarts in use cases where a lease service is not possible (e.g. HTTP caches) or chosen to be disabled (e.g. disk-based gRPC caches bound by disk space and no actual TTLs)

Yes, I think this is a valid workaround for CAS that doesn't support TTL.

@brentleyjones
Copy link
Contributor

Thinking more, I agree. We should remote ALL metadata before the restart, so the next build is guaranteed to work.

@coeuvre
Copy link
Member Author

coeuvre commented Mar 10, 2023

Until we have the ttl based discarding, maybe it's best to discard all of it?

I probably won't have the chance to change that for 6.1. For later releases, we will have TTL based discarding.

Thinking more, I agree. We should remote ALL metadata before the restart, so the next build is guaranteed to work.

Yes, i agree. The reason I didn't do that before was because of performance: we could still hit skyframe cache / local action cache if we keep other remote metadata. However, if remote cache evicted blobs, it's probably means it is under high load and Bazel won't know what else it evicted. Following restarts could still fail for the same reason. Additionally, with TTL based discarding, we can avoid most of these errors. So the performance penalty in the real remote cache eviction case should be acceptable.

I will draft a PR for that.

@ulrfa
Copy link
Contributor

ulrfa commented Mar 12, 2023

I will draft a PR for that.

Thanks @coeuvre!

@coeuvre coeuvre force-pushed the build-rewinding branch 3 times, most recently from f997f5b to 7795c48 Compare March 17, 2023 14:18
@brentleyjones
Copy link
Contributor

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Mar 30, 2023
@ShreeM01
Copy link
Contributor

@bazel-io fork 6.2.0

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Mar 30, 2023
ShreeM01 pushed a commit to ShreeM01/bazel that referenced this pull request Mar 30, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
ShreeM01 pushed a commit to ShreeM01/bazel that referenced this pull request Mar 31, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
coeuvre added a commit to coeuvre/bazel that referenced this pull request Apr 21, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
coeuvre added a commit to coeuvre/bazel that referenced this pull request Apr 21, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
coeuvre added a commit to coeuvre/bazel that referenced this pull request Apr 21, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
coeuvre added a commit to coeuvre/bazel that referenced this pull request Apr 21, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
keertk pushed a commit that referenced this pull request Apr 21, 2023
…ror (#18171)

With #17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With #17462 and #17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with #17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of #16660.

Closes #17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
@sluongng
Copy link
Contributor

Hey @coeuvre, we do have users who prefer to set the invocation id manually so that they would be able to know the build URL in BES before triggering the build. This helps our users integrate the BES/RBE system against their internal CI system.

In those cases, the UUID is generated externally and passed to Bazel with --invocation_id which is ultimately processed here https://cs.opensource.google/bazel/bazel/+/refs/heads/master:src/main/java/com/google/devtools/build/lib/runtime/CommandEnvironment.java;l=467;drc=fdf46f7a17657bbd6f33e5d954ae37a35492b5e9;bpv=1;bpt=1

Now that execExclusively is being re-run multiple times, the computeCommandId() is also called multiple times to generate new UUID, except if there is a custom --invocation_id then the UUID would stay the same. This would cause a lot of trouble as multiple build could be sending BES data under the same UUID to BES server.

For a simple fix, I would suggest document the anti-affinity between --experimental_remote_cache_eviction_retries flag and --invocation_id flag and ask user to intercept exit code 39 and handle retry manually.

I don't think automatically retry with a new UUID is desirable for this use case as users would prefer to know the invocation ID for each retries to integrate with their system.

A more complicated fix is to make --invocation_id taking in a list of pre-generated UUIDs and allow retry count up to the size of the UUID list.

WDYT?

@coeuvre
Copy link
Member Author

coeuvre commented Apr 21, 2023

I agree if you set --invocation_id manually, you don't want to use --experimental_remote_cache_eviction_retries for the reason you said.

The fix you described sounds plausible, but I am not sure how well it can integrate with other tools/systems.

@coeuvre coeuvre deleted the build-rewinding branch April 21, 2023 13:15
@sluongng
Copy link
Contributor

Another thing I found today was that this field should be increased on retry attempt

third_party/googleapis/google/devtools/build/v1/build_events.proto
38:    int64 attempt_number = 1;

Currently we are setting it to 1 always https://cs.opensource.google/bazel/bazel/+/a767fe78e8ab28c685c8ec580ba4dd47689b05cb:src/main/java/com/google/devtools/build/lib/buildeventservice/BuildEventServiceProtoUtil.java;l=89-98

fweikert pushed a commit to fweikert/bazel that referenced this pull request May 25, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
@coeuvre
Copy link
Member Author

coeuvre commented Jun 6, 2023

Created #18591 to address #17711 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants