-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds index_codecs to the sharding codec #253
Adds index_codecs to the sharding codec #253
Conversation
@jbms @jstriebel Please take a look. If we decide to go with this, it would be good to merge this before we send ZEP2 to the ZIC. |
Thanks! Looks good to me other than the comment I raised. |
I'm a bit confused how Rather I think the input should just be |
I think this could indeed be clarified, since we have "array -> array", "array -> bytes", and "bytes -> bytes" codecs to consider. For "array -> array" and "array -> bytes", the input is an array and therefore the relevant size would be the shape of the array, not its size in bytes. For "bytes -> bytes" codecs the input size would be in bytes.
|
The size of the decoded representation is Please note, that these are just implementation suggestions. Some implementations may choose to do this differently. |
That makes sense. I'll clarify that. |
Actually, this can be complicated for |
Yes, the "array -> array" codec should perhaps also indicate the output dtype. Possibly it is easier to just leave these implementation details unspecified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the generalization of the shard index representation is great 🎉
Co-authored-by: Jonathan Striebel <jstriebel@users.noreply.github.com>
Thanks for the feedback! I added some more text to clarify the |
The documentation looks improved. What is the current method to identify if a codec is one of |
You can't tell just from looking at the json, but any supported codec must be known to the implementation, and it is up to the implementation exactly how this is managed. In tensorstore there is a class for each codec, which inherits from one of To parse the json codec list, it first parses as a generic list of |
Please review and merge if approved. Thanks! |
This PR adds the
index_codecs
configuration to the sharding codec. It also adds thecrc32c
checksum bytes-to-bytes codec.Index codecs have been proposed by @jbms: #152 (comment)
There is a prototype implementation in zarrita: https://github.com/scalableminds/zarrita/blob/codec-pipeline/zarrita/sharding.py#L507-L518