Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote cache poisoned? #1174

Closed
alexeagle opened this issue Jun 9, 2021 · 17 comments
Closed

Remote cache poisoned? #1174

alexeagle opened this issue Jun 9, 2021 · 17 comments

Comments

@alexeagle
Copy link
Contributor

Hey @fweikert

rules_nodejs has had almost all our RBE builds red for the last few days. On PRs I'm pressing the Retry button multiple times to get them green, and our default branch has been red for a day.

https://buildkite.com/bazel/rules-nodejs-nodejs/builds/9591#bdced136-a515-4099-a52b-5984d9575a61 is an example
failure:


(03:24:59) ERROR: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/external/bazel_tools/tools/jdk/BUILD:346:14: Action external/bazel_tools/tools/jdk/platformclasspath.jar [for host] failed: (Exit 34): com.google.devtools.build.lib.remote.BulkTransferException

	at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:227)
[...]
Suppressed: java.io.IOException: Output download failed: Expected digest '6f5ff115e713ede319bc832024f78d018ae0c5da7a810c8af68b2b5368d00a0d/85582084' does not match received digest 'e294ea66b89ce1dee25c6c6f354ddb8ebffd303f6ca9255c5fbe7d0d6f31d374/85782084'.

at com.google.devtools.build.lib.remote.util.Utils.verifyBlobContents(Utils.java:201)
at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onCompleted(GrpcCacheClient.java:372)

It's always this same entry causing the problem. Seems like something has gotten into the remote cache that shouldn't be there. Is it easy to just blow away the storage for that cache instance (no idea if it's shared-tenant with other rulesets or other reasons to be careful there)

Thanks!!

@philwo
Copy link
Member

philwo commented Jun 9, 2021

@coeuvre Is this something you want to debug? 🤔

@alexeagle The tricky thing here is that this is happening on the RBE platform and our usual approach to deal with cache poisoning (

platform_cache_key += ["cache-poisoning-20210323".encode("utf-8")]
) will not work here. I will reach out for help.

@philwo
Copy link
Member

philwo commented Jun 9, 2021

@alexeagle I checked with the RBE team, this doesn't look like something on the backend side. The blob is apparently there and fine. 👀 We're suspecting a Bazel bug, although not sure yet what's going on.

Could you bump your .bazelversion to 4.1.0 and see if it fixes this?

@alexeagle
Copy link
Contributor Author

Sure, bazel-contrib/rules_nodejs#2761 thanks for investigating!!

(maybe that will be enough to change the input key for that action in the cache such that this isn't reproducible, but that's okay with me I'm not trying to discover Bazel bugs today, just keep our CI green)

@alexeagle
Copy link
Contributor Author

Hmm get some new errors on RBE now


(20:46:23) ERROR: Traceback (most recent call last):
File "/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/external/buildkite_config/java/BUILD", line 19, column 60, in <toplevel>
load("@bazel_tools//tools/jdk:local_java_repository.bzl", "local_java_runtime")
Error: file '@bazel_tools//tools/jdk:local_java_repository.bzl' does not contain symbol 'local_java_runtime'
(20:46:23) INFO: ToolchainResolution:     Type //toolchains/node:toolchain_type: target platform @buildkite_config//config:platform: Rejected toolchain @nodejs_linux_s390x_config//:toolchain; mismatching values: s390x
(20:46:23) INFO: ToolchainResolution:     Type @io_bazel_rules_go//go:toolchain: target platform @buildkite_config//config:platform: Rejected toolchain @go_sdk//:go_android_arm_cgo-impl; mismatching values: android, arm, cgo_on
(20:46:23) <span class="term-fg31 term-fg1">ERROR: </span>/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/external/bazel_tools/tools/jdk/BUILD:69:26:
 every rule of type java_runtime_alias implicitly depends upon the 
target '@buildkite_config//java:jdk', but this target could not be found
 because of: no such target '@buildkite_config//java:jdk': target 'jdk' 
not declared in package 'java' defined by 
/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/external/buildkite_config/java/BUILD

maybe related to a warning I see locally

Current running Bazel is ahead of bazel-toolchains repo. Please update your pin to bazel-toolchains repo in your WORKSPACE file.
DEBUG: /home/alexeagle/.cache/bazel/_bazel_alexeagle/78c7479a1baf683797e1e6cd7cfefe83/external/bazel_toolchains/rules/rbe_repo/checked_in.bzl:125:14: buildkite_config not using checked in configs; Bazel version 4.1.0 was picked/selected but no checked in config was found in map {"0.20.0": ["8.0.0"], "0.21.0": ["8.0.0"], "0.22.0": ["8.0.0", "9.0.0"], "0.23.0": ["8.0.0", "9.0.0"], "0.23.1": ["8.0.0", "9.0.0"], "0.23.2": ["9.0.0"], "0.24.0": ["9.0.0"], "0.24.1": ["9.0.0"], "0.25.0": ["9.0.0"], "0.25.1": ["9.0.0"], "0.25.2": ["9.0.0"], "0.26.0": ["9.0.0"], "0.26.1": ["9.0.0"], "0.27.0": ["9.0.0"], "0.27.1": ["9.0.0"], "0.28.0": ["9.0.0"], "0.28.1": ["9.0.0"], "0.29.0": ["9.0.0"], "0.29.1": ["9.0.0", "10.0.0"], "1.0.0": ["9.0.0", "10.0.0"], "1.0.1": ["10.0.0"], "1.1.0": ["10.0.0"], "1.2.0": ["10.0.0"], "1.2.1": ["10.0.0"], "2.0.0": ["10.0.0"], "2.1.0": ["10.0.0"], "2.1.1": ["10.0.0", "11.0.0"], "2.2.0": ["11.0.0"], "3.0.0": ["11.0.0"], "3.1.0": ["11.0.0"], "3.2.0": ["11.0.0"], "3.3.0": ["11.0.0"], "3.3.1": ["11.0.0"], "3.4.1": ["11.0.0"], "3.5.0": ["11.0.0"], "3.5.1": ["11.0.0"], "3.6.0": ["11.0.0"], "3.7.0": ["11.0.0"], "3.7.1": ["11.0.0"], "3.7.2": ["11.0.0"], "4.0.0": ["11.0.0"]}

will try pinning like it says

@rubensf
Copy link

rubensf commented Jun 9, 2021

Sounds like you may need to update the bazel-toolchain repo?

FYI rbe_autoconfig doesn't work past bazel 4.0.0 - you'll need to update configurations following https://github.com/bazelbuild/bazel-toolchains#generating-configs.

@alexeagle
Copy link
Contributor Author

alexeagle commented Jun 10, 2021

thanks @rubensf I was never able to figure this out in prior attempts. We've been using the bazel-0.28.0.bazelrc this whole time ever since setting up our RBE test job and managing to get away with that.

I tried following the instructions,
running ./rbe_configs_gen --bazel_version=4.1.0 --toolchain_container=l.gcr.io/google/rbe-ubuntu16-04:latest --output_src_root=$HOME/Projects/rules_nodejs --output_config_path=tools/rbe --exec_os=linux --target_os=linux
and committing the resulting files along with the recent bazel-4.1.0.bazelrc with local edits to point to that tools/rbe folder
and still have a failure

(04:34:20) ERROR: Traceback (most recent call last):
File "/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/external/buildkite_config/java/BUILD", line 19, column 60, in &lt;toplevel&gt;
load("@bazel_tools//tools/jdk:local_java_repository.bzl", "local_java_runtime")
Error: file '@bazel_tools//tools/jdk:local_java_repository.bzl' does not contain symbol 'local_java_runtime'

maybe there's still some version skew in that config

@rubensf
Copy link

rubensf commented Jun 10, 2021

I have only used these configs for C++ so far, so I may be missing something 😅

I think you might just want to remove those rbe_autoconfig declarations from your workspace in bazel-contrib/rules_nodejs#2761.

Also, I'm not sure misleading -- see comment in bazelbuild/bazel-toolchains#926. I'm not sure if java_runtime_version already exists in 4.1.0... Certainly the auto generated configurations haven't created a rbe-sdk. I'd say stick to the 2.0.0 rc :p

@philwo
Copy link
Member

philwo commented Jun 10, 2021

@alexeagle I think that error referring to local_java_runtime is this one: bazelbuild/bazel#13099 (comment)

More context: bazelbuild/bazel#13502

@alexeagle
Copy link
Contributor Author

thanks @philwo I had seen that discussion but didn't put it together that I'd need that 4.1.0 of the toolchains repo to fix this.

Now I get

(16:12:40) ERROR: /Users/buildkite/builds/bk-imacpro-10/bazel/rules-nodejs-nodejs/tools/rbe/cc/BUILD:50:19: in cc_toolchain_suite rule //tools/rbe/cc:toolchain: cc_toolchain_suite '//tools/rbe/cc:toolchain' does not contain a toolchain for cpu 'darwin'

obviously darwin is an os, not a cpu, so some wiring is crossed somewhere?

The only thing in rules_nodejs that I can find which is related is our platform definitions in toolchain/node/BUILD.bazel such as

platform(
    name = "darwin_amd64",
    constraint_values = [
        "@platforms//os:osx",
        "@platforms//cpu:x86_64",
    ],
)

sorry to be stuck here :(

@rubensf
Copy link

rubensf commented Jun 14, 2021

For windows, you'll need to also run the rbe_configs_gen using a windows docker container on a windows machine or deactivate RBE for windows.

For MacOS, there's no RBE support for MacOS so you'll have to deactive remote support for that too.

(basically tweak the .bazelrc configurations to not include --remote on windows/macos builds).

Note you do can use remote caching, (basically setting --remote_cache instead of --remote_executor, and removing all the toolchain configurations).

@philwo
Copy link
Member

philwo commented Jun 14, 2021

@alexeagle Can you send a link to a Buildkite log where that error message is visible? I don't think it should happen when the job runs on the rbe_ubuntu1604 platform on Bazel CI (it would automatically manage the --remote_* flags and only enable execution for that platform, but not for macOS / Windows). 🤔

@coeuvre
Copy link
Member

coeuvre commented Jun 15, 2021

After comparing the corrupted blob with the correct one, I think this is caused by a known issue bazelbuild/bazel#12927 (comment). Bazel 4.1.0 and HEAD contain the fix.

@EricBurnett
Copy link

FYI, for the corrupted blob downloads, I expect you can mitigate in any version of bazel by setting --remote_timeout=3600 (or some suitably long period).

Chi linked the relevant bug - appears to be a race in gRPC at certain versions, which is triggered by the RPC timeout/retry flow for RPCs that exceed a certain duration. I'm guessing your build is download-bottlenecked at that phase, causing the 85MB file download to take >60s (the default), causing the race to be hit.

@philwo
Copy link
Member

philwo commented Jun 15, 2021

Thank you Eric! I assume the flag is safe to set even with future versions of Bazel? We could add it to our list of remote execution flags here, then it would apply to all jobs running on Bazel CI that use our rbe_ubuntu1604 platform:

flags = [
"--remote_executor=remotebuildexecution.googleapis.com",
"--remote_instance_name=projects/bazel-untrusted/instances/default_instance",
"--incompatible_strict_action_env",
"--google_default_credentials",
"--toolchain_resolution_debug",
]

@EricBurnett
Copy link

Should be safe, yep - it's a default we recommend for RBE and most of our users have had set since the beginning.

I'll note that it may not be safe with other remote execution services than RBE...some have races that can lead to hung RPCs and thus hung builds that a smaller remote timeout would have papered over. But if you don't use other services, or don't set this flag on them, you should be fine.

IIRC some future version of bazel will make this unnecessary by changing the logic for how remote timeouts apply to long-running RPCs like execution and download, but I don't think that has landed yet, so still useful at HEAD.

philwo added a commit that referenced this issue Jun 15, 2021
The bug is fixed in Bazel >= 4.1.0, but it doesn't hurt.

Context: #1174
coeuvre pushed a commit that referenced this issue Jun 16, 2021
The bug is fixed in Bazel >= 4.1.0, but it doesn't hurt.

Context: #1174
@coeuvre
Copy link
Member

coeuvre commented Jun 16, 2021

IIRC some future version of bazel will make this unnecessary by changing the logic for how remote timeouts apply to long-running RPCs like execution and download, but I don't think that has landed yet, so still useful at HEAD.

Yes, this work is on my list.

@alexeagle
Copy link
Contributor Author

I think this was resolved, since rules_nodejs was able to upgrade bazel versions with only a small amount of RBE flags in bazel-contrib/rules_nodejs#2792

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants