Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V3 idea: No longer allow Digest.size_bytes <= 0 #134

Open
EdSchouten opened this issue Apr 20, 2020 · 4 comments
Open

V3 idea: No longer allow Digest.size_bytes <= 0 #134

EdSchouten opened this issue Apr 20, 2020 · 4 comments

Comments

@EdSchouten
Copy link
Collaborator

Right now it is allowed to create Digest messages that have size_bytes == 0, referring to the empty blob. In #131 we're extending the protocol to require that the empty blob is always present, because it can be derived trivially. I personally find this a bit problematic:

  • It makes the protocol less regular and consistent.
  • Naïvely implemented client/servers will get this wrong. For example, what is FindMissingBlobs() on {hash: "e984d2bdd07318c4e29f7a2ceea4a9e4569e2d8e695a953a4e2df6f69fbdec95", size_bytes: 0} supposed to do? Report existence, because it has size zero? Or should it report absence, because the empty blob actually has SHA-256 sum e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855?
  • When digests for empty blobs are embedded in other messages, they still waste space. We still end up storing a SHA-256 sum.

I would like to suggest that we simply deny the existence of Digest messages with size_bytes <= 0. Any field where the empty blob needs to be referenced, we should use null. This means that the optimization that Bazel performs of not loading the empty blob becomes the norm, as there is no longer any way to even address the empty blob.

@juergbi
Copy link
Contributor

juergbi commented Apr 20, 2020

* For example, what is FindMissingBlobs() on `{hash: "e984d2bdd07318c4e29f7a2ceea4a9e4569e2d8e695a953a4e2df6f69fbdec95", size_bytes: 0}` supposed to do? Report existence, because it has size zero? Or should it report absence, because the empty blob actually has SHA-256 sum `e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855`?

As it's not the correct empty blob, it should report absence.

* It makes the protocol less regular and consistent.

I'm not sure either optimization variant is really better with regards to regularity/consistency. Also with your suggested approach, special cases have to be defined. E.g., what happens if a client attempts to write a blob with a Digest with size_bytes == 0? I'd guess something like INVALID_ARGUMENT.

However, as this approach is overall more efficient and the implementation complexity doesn't seem higher to an optimized v2 implementation, I think it's a sensible plan.

There may be code in some clients and/or servers that currently associates a different meaning to null Digests in some contexts (e.g., Digest not yet calculated). However, I don't think this should be a reason against this idea. Assuming there is no ambiguity anywhere in the protocol itself.

@EdSchouten
Copy link
Collaborator Author

E.g., what happens if a client attempts to write a blob with a Digest with size_bytes == 0? I'd guess something like INVALID_ARGUMENT.

Note that implementations likely already have tests like these for the size_bytes < 0 case. They now only need to be updated to size_bytes <= 0.

@EricBurnett
Copy link
Collaborator

One thing to watch out for is type confusion between "unpopulated proto" and "legitimately empty digest" that can cause trouble and make validation more difficult. (See e.g. #6 for the reverse of this, where it's desired that the unpopulated ActionResult protos never be interpreted as a valid message).

If we go down the road of making the empty blob not be a "normal" blob (no digest), would it make sense to have a separate field like empty that must be specified instead? I don't think that makes the special-casing any worse than checking hash==nil && size==0, and it only requires storing 1 extra byte or so per "empty" Digest proto.

@erikmav
Copy link
Contributor

erikmav commented Jul 25, 2020

FYI this proposal conflicts with the proposed dynamic execution and two-phase caching in #156 (PR #155) where an empty digest is intended to mean a filesystem entry presented in the worker's filesystem view but not predicted as being needed by the action. See the commentary on the File and Directory message doc comments in the PR.

santigl pushed a commit to santigl/remote-apis that referenced this issue Aug 26, 2020
When given an output directory containing an invalid symlink (i.e., a
symlink pointing to a non-existent file), the SDK fails to compute
Merkle tree because of Stat failure. This change makes the SDK ignore
it.

Tested by both using this version of re-client and added unit-tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants