Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Servers must provide empty blobs that weren't uploaded #131

Merged
merged 1 commit into from
May 2, 2020

Conversation

mostynb
Copy link
Contributor

@mostynb mostynb commented Apr 16, 2020

Some clients refer to empty blobs without previously uploading them, as a performance optimization. Servers must support this.

Copy link
Collaborator

@EdSchouten EdSchouten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion this is the worst outcome that could have come out of this discussion. It adds unnecessary complexity to a system that would otherwise have been completely regular. To me it is also not evident how this leads to any meaningful increase in performance. Any client with some level of smartness would only download the empty blob once, due to it also using CAS-based storage locally. Because CAS blobs are (generally) only uploaded after calling FindMissingBlobs(), it would also save just one upload. It also doesn't reduce the total number of FindMissingBlobs() calls, as it is extremely unlikely that FindMissingBlobs() is called on just the empty blob.

Smarter ways to solve this would either have been to teach Bazel to upload empty blobs, or to simply disallow Digest messages to exist with size_bytes <= 0. References to empty objects could use null Digests. The latter would solve the problem in a completely obvious, non-ambiguous way, while also reducing the size of, most notably, ActionResult.

That said, I suspect that I am in a minority here, so let's just get this over with.

build/bazel/remote/execution/v2/remote_execution.proto Outdated Show resolved Hide resolved
@mostynb
Copy link
Contributor Author

mostynb commented Apr 16, 2020

In my opinion this is the worst outcome that could have come out of this discussion. It adds unnecessary complexity to a system that would otherwise have been completely regular.

Note that I'm documenting existing behaviour for compatibility reasons, not advocating for the design.

Smarter ways to solve this would either have been to teach Bazel to upload empty blobs, or to simply disallow Digest messages to exist with size_bytes <= 0. References to empty objects could use null Digests. The latter would solve the problem in a completely obvious, non-ambiguous way, while also reducing the size of, most notably, ActionResult.

Agreed- I initially tried this approach with bazel-remote based on the spec, but had to resort to this behaviour in order to support bazel. Something to improve for REAPIv3 (since we can't do it without breaking existing clients)?

Some clients refer to empty blobs without previously uploading them, as a
performance optimization. Servers must support this.

Perhaps we can improve on this for REAPIv3, eg require that empty blobs be
represented by null Digests.
@mostynb mostynb changed the title Servers should provide empty blobs that weren't uploaded Servers must provide empty blobs that weren't uploaded Apr 16, 2020
@scele
Copy link

scele commented Apr 17, 2020

In my opinion this is the worst outcome that could have come out of this discussion. -- I suspect that I am in a minority here

FWIW, I agree with @EdSchouten. Seems backwards to change the spec to be more complex, just because one client (bazel) is not conforming to the existing spec. Why not fix this on bazel side instead?

@mostynb
Copy link
Contributor Author

mostynb commented Apr 17, 2020

FWIW, I agree with @EdSchouten. Seems backwards to change the spec to be more complex, just because one client (bazel) is not conforming to the existing spec. Why not fix this on bazel side instead?

IIUC there are many bazel projects still using older versions of bazel, so even if we could convince the bazel team to accept such a change we would probably suffer from incompatibility problems indefinitely.

@EricBurnett
Copy link
Collaborator

In my opinion this is the worst outcome that could have come out of this discussion. -- I suspect that I am in a minority here

FWIW, I agree with @EdSchouten. Seems backwards to change the spec to be more complex, just because one client (bazel) is not conforming to the existing spec. Why not fix this on bazel side instead?

A few reasons.

  1. The complexity server-side is quite small - upon receipt of a upload, download, or FindMissing request for the empty blob (by hash+size), short circuit. This is probably not even a net increase in complexity, in that it simplifies lower layers and makes it easier to add sanity error-checking to know that they'll never need to deal with 0-byte payloads, any other request observed with a 0-size blob is a client error, etc.

  2. The converse is not true - the client-side complexity is ugly. To do null-blob optimizations clients can simply ignore downloading the empty blob, but have to track global state to be able to ensure that the empty blob is uploaded and present whenever referenced. And this includes potentially "touching" it or downloading it anyways if some action is required to ensure its lifetime is extended remotely, etc.

  3. There are more clients than servers (and that will only grow), so there's extra bias to push complexity server-side.

  4. As noted, for anyone who deals with Bazel (or any other clients making this assumption) it's a practical necessity today.

  5. And finally, this optimization actually matters. I haven't checked stats recently, but from when we implemented this same optimization early on in our bot, we observed some stupid fraction of blob references were for the empty blob. It's weirdly prevalent. And so to not special case it puts more read/write load on the server unnecessarily, risks hotspotting a row in the backing DB because it's the single hottest blob in existence, etc etc.

@illicitonion
Copy link
Contributor

In my opinion this is the worst outcome that could have come out of this discussion. -- I suspect that I am in a minority here

FWIW, I agree with @EdSchouten. Seems backwards to change the spec to be more complex, just because one client (bazel) is not conforming to the existing spec. Why not fix this on bazel side instead?

A few reasons.

  1. The complexity server-side is quite small - upon receipt of a upload, download, or FindMissing request for the empty blob (by hash+size), short circuit. This is probably not even a net increase in complexity, in that it simplifies lower layers and makes it easier to add sanity error-checking to know that they'll never need to deal with 0-byte payloads, any other request observed with a 0-size blob is a client error, etc.
  2. The converse is not true - the client-side complexity is ugly. To do null-blob optimizations clients can simply ignore downloading the empty blob, but have to track global state to be able to ensure that the empty blob is uploaded and present whenever referenced. And this includes potentially "touching" it or downloading it anyways if some action is required to ensure its lifetime is extended remotely, etc.
  3. There are more clients than servers (and that will only grow), so there's extra bias to push complexity server-side.
  4. As noted, for anyone who deals with Bazel (or any other clients making this assumption) it's a practical necessity today.
  5. And finally, this optimization actually matters. I haven't checked stats recently, but from when we implemented this same optimization early on in our bot, we observed some stupid fraction of blob references were for the empty blob. It's weirdly prevalent. And so to not special case it puts more read/write load on the server unnecessarily, risks hotspotting a row in the backing DB because it's the single hottest blob in existence, etc etc.

Similarly, we independently implemented this optimisation in another client (Pants) and another server (Scoot) because we found it both simplified implementation, and had noticeable performance effects.

@ulfjack
Copy link
Collaborator

ulfjack commented Apr 17, 2020

We had a release blocker because of this special case (thanks to the person who started this thread for us finding it ;-), and it also seems to increase server-side complexity rather than decrease. A server can still special-case this even if it isn't in the spec, so I think the primary question is whether it improves the client or not.

I don't immediately see the client-side value - the client already has to keep track of all the other things, and adding one more doesn't seem to make a big difference.

@ulfjack
Copy link
Collaborator

ulfjack commented Apr 17, 2020

The reason we didn't see it before is that it only caused an issue for us if the client attempts to run an action consuming an empty file as an input on a fresh cluster. The empty output is so common that it's basically guaranteed to be there after a few actions.

@EdSchouten
Copy link
Collaborator

EdSchouten commented Apr 17, 2020

  1. And finally, this optimization actually matters. I haven't checked stats recently, but from when we implemented this same optimization early on in our bot, we observed some stupid fraction of blob references were for the empty blob. It's weirdly prevalent. And so to not special case it puts more read/write load on the server unnecessarily, risks hotspotting a row in the backing DB because it's the single hottest blob in existence, etc etc.

What I've observed is that this optimization is just a drop in the ocean compared to what may actually be done to improve performance. Right now Bazel makes absolutely no attempt to cache knowledge of which blobs exist on the server. FindMissingBlobs() calls tend to be huge, relative to what is actually meaningful to request at that point in time.

For a workload of mine I have observed that if Bazel were to just cache blob existence info for a single minute (which seems very fair), I can obtain a >99.8% hit rate for blobs that ended up existing. Such an optimization has a lot more impact than distinguishing between the empty blob.

@EricBurnett
Copy link
Collaborator

...and it also seems to increase server-side complexity rather than decrease...

To put numbers to this, I checked and our special-casing of the empty blob is approximately 20 LoC in our server in total, not all of which are strictly required (we do IsEmpty() checks at various layers, sometimes redundantly). These are pretty boring lines of code, e.g. for ByteStream.Read the entire special-case looks like if digest.IsEmpty(dg) { return nil }

I can't quantify the savings we get from knowing that our storage should never legitimately see 0-byte blobs, so let's just ignore that and say that our anecdotal evidence is it's at worst +20 LoC for us.

For a workload of mine I have observed that if Bazel were to just cache blob existence info for a single minute (which seems very fair), I can obtain a >99.8% hit rate for blobs that ended up existing. Such an optimization has a lot more impact than distinguishing between the empty blob.

Certainly bazel is super dumb about blob existence checks today, and so the presence or absence of the empty blob may not be material there. (Noting the fact that bazel exists in the wild with this assumption already, so this point is also somewhat moot.)

Bot side it's a little different though, in that most actions output 2-3 blobs, one of which is usually the empty blob, and so having to upload it is in the realm of +50-100% of blob uploads. Which might be negligible (if all blobs fit in a batch upload, it's a few bytes in an existing RPC); and might not be (e.g. if there's one other file and it's >4MB, we'd be making an additional RPC purely for uploading the empty blob). Though I should also note that we don't require a REAPI spec change to be able to unblock our bots on this, so this is only an argument for why it's useful, not why it's required.

@peterebden
Copy link
Contributor

I agree with @scele and @EdSchouten ; if this is supposed to be a generic API, it should be driven by what makes for a clean API rather than "what does Bazel do".

I don't strongly disagree in that this isn't hard to implement on the server side, but as an optimisation it seems clearly better to be implemented on the client side, where a client conforming to the current API could simply optimise away some requests. Why isn't a SHOULD on the client side enough?

@sstriker
Copy link
Collaborator

Personally I'm fine with including the MUST language for the server side for v2. I think we can revisit in v3. I am not convinced we would end up in a different place.

@EdSchouten
Copy link
Collaborator

I think we can revisit in v3. I am not convinced we would end up in a different place.

Wouldn't my proposal to use null digests for empty objects even be worth considering for a v3?

@sstriker
Copy link
Collaborator

@EdSchouten I think it certainly would be consider. Personally, I am just thinking that that may be more complexity for clients and servers than the current requirement for a server to always have the empty blob available.

@EdSchouten
Copy link
Collaborator

EdSchouten commented Apr 17, 2020

@EdSchouten I think it certainly would be consider. Personally, I am just thinking that that may be more complexity for clients and servers than the current requirement for a server to always have the empty blob available.

You’d be amazed, but that’s often not the case. For certain classes of programming languages, null Protobuf messages are mapped directly to null pointers, option types with value ‘None’, etc. In those programming languages you already need to do null pointer comparisons or pattern matching whenever you access a Digest that’s stored in a message, as it could be null right now. These comparisons could be used to skip loading empty blobs.

This is what I meant with that using null Protobuf messages for this is more obvious and not ambiguous.

@EricBurnett
Copy link
Collaborator

EricBurnett commented Apr 18, 2020 via email

@mostynb
Copy link
Contributor Author

mostynb commented Apr 19, 2020

I think we have a slightly better chance of reducing interoperability problems if we say "MUST"- so this gets my vote. But I'm also OK with "SHOULD".

Re improving this for REAPIv3, should we move that discussion somewhere else? (If so, where? Are there any other v3 changes already being discussed?)

@EricBurnett
Copy link
Collaborator

EricBurnett commented Apr 20, 2020 via email

@sstriker sstriker merged commit 26c67c2 into bazelbuild:master May 2, 2020
@mostynb mostynb deleted the empty_blob_handling branch May 3, 2020 15:09
santigl pushed a commit to santigl/remote-apis that referenced this pull request Aug 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Pull requests whose authors are covered by a CLA with Google.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants