-
Notifications
You must be signed in to change notification settings - Fork 10
Fix/prediction upload #298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line with my comments in the Hub PR:
Rather than having these workarounds, I think it makes more sense to ditch the custom codecs. We can use default codecs (i.e. MsgPack
for AtomArrays and VLenBytes
for RDKit mols), to ensure it remains a valid default Zarr archive that any machine can open, and then convert from these formats to the objects we need internally.
This would be a bigger change than you've signed up for, though, as it also requires non-trivial changes to the dataset class. Let's do the following:
- For datasets, we keep using the custom codecs
- For predictions, we use default Zarr codecs and add conversion code on the client and service side.
Does that make sense?
…to and from custom codecs handled by the client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @j279li , left some comments. Most are pretty minor! Take a look and let me know what you think.
Co-authored-by: Cas Wognum <caswognum@outlook.com>
Co-authored-by: Cas Wognum <caswognum@outlook.com>
Co-authored-by: Cas Wognum <caswognum@outlook.com>
Co-authored-by: Cas Wognum <caswognum@outlook.com>
Co-authored-by: Cas Wognum <caswognum@outlook.com>
This pull request introduces improvements to how Zarr object codecs and chunking are handled for prediction outputs, as well as some minor serialization and documentation fixes. The main focus is on making object codec support explicit and robust, especially for custom codecs, and ensuring correct serialization behavior for non-JSON-serializable fields.
Enhancements to Zarr object codec and chunking support:
detect_object_codec_and_chunking
inpolaris/utils/zarr/codecs.py
to determine the correct object codec, filter list, and chunking compatibility from template filters.supports_chunking
attribute toRDKitMolCodec
(set toTrue
) andAtomArrayCodec
(set toFalse
).None
values inAtomArrayCodec.encode
to explicitly set missing values in packed arrays.Serialization and model improvements:
BenchmarkPredictionsV2
to exclude the non-serializabledataset_zarr_root
attribute from JSON serialization using a PydanticField
directive, and modified__repr__
to exclude thepredictions
field as they are also stored in zarr.Minor documentation cleanup:
splits
inBenchmarkV2Specification
.