Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workaround for Bazel committed_size check when uploading compressed blobs #1493

Merged
merged 5 commits into from
Jan 28, 2022

Conversation

bduffany
Copy link
Member

Context: bazelbuild/bazel#14654


Version bump: Patch

@bduffany bduffany marked this pull request as ready for review January 27, 2022 23:54
@bduffany bduffany changed the title Add workaround for Bazel committed_size check Add workaround for Bazel committed_size check when uploading compressed blobs Jan 27, 2022
Comment on lines 328 to 331
streamState, err = s.initStreamState(ctx, req)
if status.IsAlreadyExistsError(err) {
return handleAlreadyExists(stream, req)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this jive well with the uploadTracker stuff below or does this mean we're missing data?

Copy link
Member Author

@bduffany bduffany Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compressed uploads requiring more than 1 message to upload, this does mean that now we are potentially letting the client upload a significant amount of data that is not tracked as an upload ("cache write" in the UI).

Fixed by creating a separate UploadTracker in handleAlreadyExists, but only in the case where we have to download the remaining stream.

Note that it's still tracking uncompressed size, but we have an existing issue to track the raw number of bytes sent by the client as opposed to uncompressed bytes.

Side note (probably not worth worrying about too much, since these short-circuiting cases should be relatively rare): I think we can do better in the case where the client sends only a single message that contains the entire payload, but then we short-circuit because the object already exists in cache. Currently (and also before this PR) we don't track those payloads at all, so we might present an inaccurate view of upload rate because it excludes these payloads from the total uploaded bytes. We also probably need to decide whether to present those as "cache writes" to the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also probably need to decide whether to present those as "cache writes" to the user

If the user uploaded, even if it's a duplicate, I feel we should show it. It's confusing to see the number lower than what they could see with the gRPC log. Similar to when it's a read-only key, I feel we should show the number of items uploaded. Maybe we split those in two in the UI though? Uploaded versus written to cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for separating upload vs. written -- I will file an issue for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think showing a warning when they upload but it's not written because of a read-only key can remove a lot of confusion as well.

return stream.SendAndClose(&bspb.WriteResponse{CommittedSize: committedSize})
}

func remainingUploadSize(stream bspb.ByteStream_WriteServer) (int64, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment what this function does?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and renamed to make it more clear that it's doing IO

//
// There are two cases where we can short-circuit, though:
//
// - If this is an uncompressed stream, we assume that the committed_size will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth comment what's happening in the biggest part of the function below: if this is a compressed stream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded and shortened -- hopefully it's clearer now.

Copy link
Member

@tylerwilliams tylerwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

nice job coming up with a workaround for this on the spot :)

@bduffany bduffany merged commit 1cf409d into master Jan 28, 2022
@bduffany bduffany deleted the committed-size-fix branch January 28, 2022 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants