Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote execution fails with a garbage collecting remote cache #3452

Closed
buchgr opened this issue Jul 26, 2017 · 8 comments
Closed

remote execution fails with a garbage collecting remote cache #3452

buchgr opened this issue Jul 26, 2017 · 8 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required)

Comments

@buchgr
Copy link
Contributor

buchgr commented Jul 26, 2017

Using bazel 0.5.3rc3 for local execution. The remote cache is nginx. We have a python script running that enforces an upper size bound and deletes files based on their atime.

We get errors like

/usr/bin/ar: bazel-out/local-fastbuild/bin/third_party/zlib/libzlib.a: File format not recognized

The problem being that the file is empty.

$ ll bazel-out/local-fastbuild/bin/third_party/zlib/libzlib.a
-rw-r----- 1 buchgr eng 0 Jul 26 10:43 bazel-out/local-fastbuild/bin/third_party/zlib/libzlib.a

cc: @damienmg @ulfjack @ola-rozenfeld

@buchgr buchgr added category: service APIs P1 I'll work on this now. (Assignee required) labels Jul 26, 2017
@buchgr buchgr self-assigned this Jul 26, 2017
@ulfjack
Copy link
Contributor

ulfjack commented Jul 26, 2017

That is ... not ideal. Any ideas?

@buchgr
Copy link
Contributor Author

buchgr commented Jul 26, 2017

I don't know the cause yet. I am debugging it currently. The good thing is that I can reproduce it 100%.

@buchgr
Copy link
Contributor Author

buchgr commented Jul 26, 2017

We somehow seem to upload failed action execution to the action cache. I managed to get the CAS hash of the ActionKey which is 9456920373db4ff976e09c3d1446cba8bc301e66.

I then logged on to the remote cache machine and disassembled the blob (which is of type message ActionResult)

protoc --decode_raw < 9456920373db4ff976e09c3d1446cba8bc301e66
2 {
  1: "bazel-out/local-fastbuild/bin/third_party/ijar/libplatform_utils.a"
  2 {
    1: "da39a3ee5e6b4b0d3255bfef95601890afd80709"
  }
}
6 {
  1: "da39a3ee5e6b4b0d3255bfef95601890afd80709"
}
8 {
  1: "9026d157afa41c8daa19757d490a42cc77de382e"
  2: 108
}

Tag 2 is output files, 6 is stdout digest, 8 is stderr digest.

da39a3ee5e6b4b0d3255bfef95601890afd80709 is the hash of the zero byte blob. So the output file and stdout is empty. However, there is also stderr 9026d157afa41c8daa19757d490a42cc77de382e and that is 108 bytes long.

$ cat 9026d157afa41c8daa19757d490a42cc77de382e
/usr/bin/ar: bazel-out/local-fastbuild/bin/third_party/ijar/libplatform_utils.a: File format not recognized

So we seem to upload failed action execution. That means we have two bugs:

  1. If we purge the remote cache, bazel errors.
  2. Bazel uploads these errors to the remote cache.

😄 🍾 🎆

@buchgr
Copy link
Contributor Author

buchgr commented Jul 26, 2017

I fixed the upload of failed action results, and now things are better but we are still failing in weird ways. Looking...

@buchgr
Copy link
Contributor Author

buchgr commented Jul 26, 2017

We seem to have a problem with output files of cached actions having been deleted from the cache. So the action is still cached, but one or more of the output files has been deleted.

@buchgr
Copy link
Contributor Author

buchgr commented Jul 26, 2017

Ok we found the problem. If the remote download of an output file fails, it still creates an empty output file ... and then the local fallback execution fails because the output file already exists, as we are not using sandboxing as a fallback.

Solution seems to be to delete all output files if the remote download fails.

@ola-rozenfeld
Copy link
Contributor

Right, I thought we should delete the partially created file as part of the fail/cleanup, but Ulf just explained this will not be enough for some tools. You're right, deleting everything is the way to go.

@buchgr
Copy link
Contributor Author

buchgr commented Jul 26, 2017

Ok got no more failures, even with aggressive purging ... working on fixes ...

bazel-io pushed a commit that referenced this issue Jul 27, 2017
If a remote download fails, delete any output files that might have
already been created. Else, this might intefere with a subsequent
locally executed actions that expects none of its output files to
exist. See #3452.

Change-Id: I467a97d05606c586aa257326213940a37dad9dd5
PiperOrigin-RevId: 163336093
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required)
Projects
None yet
Development

No branches or pull requests

3 participants