Skip to content

Conversation

j279li
Copy link
Contributor

@j279li j279li commented Aug 6, 2025

This pull request introduces improvements to how Zarr object codecs and chunking are handled for prediction outputs, as well as some minor serialization and documentation fixes. The main focus is on making object codec support explicit and robust, especially for custom codecs, and ensuring correct serialization behavior for non-JSON-serializable fields.

Enhancements to Zarr object codec and chunking support:

  • Added a utility function detect_object_codec_and_chunking in polaris/utils/zarr/codecs.py to determine the correct object codec, filter list, and chunking compatibility from template filters.
  • Made codec chunking support explicit by adding a supports_chunking attribute to RDKitMolCodec (set to True) and AtomArrayCodec (set to False).
  • Improved handling of None values in AtomArrayCodec.encode to explicitly set missing values in packed arrays.

Serialization and model improvements:

  • Updated BenchmarkPredictionsV2 to exclude the non-serializable dataset_zarr_root attribute from JSON serialization using a Pydantic Field directive, and modified __repr__ to exclude the predictions field as they are also stored in zarr.

Minor documentation cleanup:

  • Removed an outdated docstring reference to splits in BenchmarkV2Specification.

@j279li j279li requested review from danielpeng1 and Andrewq11 August 6, 2025 14:28
@j279li j279li self-assigned this Aug 6, 2025
@j279li j279li requested a review from cwognum as a code owner August 6, 2025 14:28
@j279li j279li added bug Something isn't working fix Annotates any PR that fixes bugs labels Aug 6, 2025
Copy link
Collaborator

@cwognum cwognum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line with my comments in the Hub PR:

Rather than having these workarounds, I think it makes more sense to ditch the custom codecs. We can use default codecs (i.e. MsgPack for AtomArrays and VLenBytes for RDKit mols), to ensure it remains a valid default Zarr archive that any machine can open, and then convert from these formats to the objects we need internally.

This would be a bigger change than you've signed up for, though, as it also requires non-trivial changes to the dataset class. Let's do the following:

  1. For datasets, we keep using the custom codecs
  2. For predictions, we use default Zarr codecs and add conversion code on the client and service side.

Does that make sense?

Copy link
Collaborator

@cwognum cwognum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @j279li , left some comments. Most are pretty minor! Take a look and let me know what you think.

j279li and others added 11 commits August 7, 2025 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fix Annotates any PR that fixes bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants