Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with Zarr and chunked arrays #718

Closed
Cadair opened this issue Dec 11, 2019 · 15 comments
Closed

Integration with Zarr and chunked arrays #718

Cadair opened this issue Dec 11, 2019 · 15 comments

Comments

@Cadair
Copy link
Contributor

Cadair commented Dec 11, 2019

"Zarr is a Python package providing an implementation of chunked, compressed, N-dimensional arrays." It is well suited to parallel computation in Dask arrays, as well as being friendly to workflows where data are stored in the cloud in object stores etc.

The Zarr file format has a specification and supports multiple storage backends, examples are DictStore, DirectoryStore, LMDBStore, SqliteStore. The storage backend API for the Python zarr package accepts anything that implements the MutableMapping interface, i.e. Zarr passes a backend store a string key and some bytes to store under that key.

The way this works is that each chunk of the array, is passed as a bytes blob to the backend, with a key representing the chunk and the bytes for the chunk. Three extra keys are defined in the Zarr specification which is metadata for the each group, each array and for user metadata (.zarray, .zgroup and .zattrs).

What I am proposing is that with a little work, I believe that asdf is a valid backing store for Zarr. This would enable workflows where astronomy specific metadata can be easily encoded in files which can be read via the Zarr API, and therefore are well suited to parallel workflows, specifically Dask arrays.

I would envisage that the Zarr layout in asdf would generally be mappable to a single yaml tree. When asdf gets passed the .zarray top level metadata object, it would be a JSON blob as bytes, we could decode this and save it into the tree. This would provide information such as dtype, top level array size and chunk size.

When asdf gets passed bytes for a individual array chunk we would save it to an exploded asdf file constructing the block header as required for that to be a valid asdf (the required data for that should be in the top level .zarray key).

The issues I forsee with this are as follows:

  • We will need to return bytes back to zarr for .zarray, .zgroup and .zattrs, which limits what we can return from .zattrs, so the interesting metadata things would have to be a separate non-zarr API. (This we could suggest as an improvement to Zarr to have non-bytes metadata for supporting backends).
  • There is not an existing way of dealing with chunking in asdf, so a file written via Zarr would not be easy to open with just asdf. This would be something that could be added in asdf.

I think that probably the best short term approach would be that if you opened an asdf file, with asdf which was chunked in this manner, it could return a Zarr array. It would also be nice to get to the point where you could load these files with Zarr directly (via an asdf backend).

@perrygreenfield
Copy link
Contributor

@Cadair let's us talk about this some more when we can get together ;-)

@NumesSanguis
Copy link

Seems like that over at Zarr they're requesting features that ASDF supports: zarr-developers/zarr-python#389

@perrygreenfield
Copy link
Contributor

@Cadair what you are describing appears to me to be this sort of flow (correct me if I'm wrong). The data comes from zarr and is passed to asdf for storage. That would be straightforward enough, but it seems like a very narrow use of asdf and I wonder how it integrates with broader uses of asdf. For example, we should be able to have a chunked array as a entity in an asdf file, but not the only entity. One possible view is that zarr is an extension of existing asdf arrays and zarr can access such asdf corresponding entities directly while ignoring all the associated asdf metadata. In this sense we are using zarr only for manipulations of array data using parallel mechanisms, but use asdf for all other metadata and grouping mechanisms. Is that sufficient, or is a greater integration of asdf structures to be passed to zarr (more difficult I think since it means zarr either has to treat these as opaque objects or add tools to interpret them. The previous mechanism seems like an achievable first step without a lot of complication.

It does raise interesting questions about representations within the asdf block structure. The chunks could be exploded blocks, blocks stored within the same file, or chunks within one block (this excludes compression probably since it complicates how one retrieves a chunk within a block). I'll have to give this more thought.

And ultimately it means providing a means within asdf of accessing the embedded zarr array, perhaps with a dependency on zarr itself (again, the simplest approach, I'm guessing)

@Cadair
Copy link
Contributor Author

Cadair commented Dec 12, 2019

Overall I see that this could work in both directions, as you say either "The data comes from zarr and is passed to asdf for storage." or "a greater integration of asdf structures to be passed to zarr".

I agree with you that we have the power to action the first one quite easily, and within the asdf project. I think maybe once we have that working we could talk to the Zarr developers about how we could have zarr pass through much richer (non-json) metadata (attributes) to asdf.

@rabernat
Copy link

Hello folks! I am learning about ASDF for the first time here at SciPy. I am one of the core developers of Zarr. It would be great to have the two projects coordinating more, as it seems like our goals are broadly similar. Is anyone interested in trying to meet up at SciPy this week?

@Cadair
Copy link
Contributor Author

Cadair commented Jul 11, 2022

I will be there (hopefully my flight got cancelled) would be great to meet up, I know of a few others who would be interested and are around as well.

@perrygreenfield
Copy link
Contributor

Very much interested. What days and times are you available? My and Nadia's schedules are pretty flexible, probably also for William, after the tutorial days.

@rabernat
Copy link

Perhaps we could meet at 5:30pm on Wednesday during the evening break?

I'll try to entrain some other Zarr folks.

@perrygreenfield
Copy link
Contributor

That works for the 3 of us. Any particular place to meet?

@martindurant
Copy link

I'm sure I have mentioned this elsewhere, but the kerchunk project has ASDF on its shortlist of formats to process, such that it will be possible to view (groups of) ASDF datasets as zarr, even without ASDF installed. So you could load everything (including FITS, if there is no whole-file-compression) using zarr, and get the lazy, cloud-friendly parallel access you would want, if only astropy worked with zarr.

(kerchunk's first success was with HDF5, and I note the link to zarr-developers/zarr-python#535 , above, which ought to now be closed for this reason)

The alternative discussed here, of using ASDF to load from zarr - well it might well work, I don't know. Certainly zarr itself is simple enough. Instead of a zarr backend to astropy (et al), we have a zarr backend to ASDF which is already (somewhat?) supported in astropy, which would still enable my point above, albeit somewhat circuitously.

@perrygreenfield
Copy link
Contributor

@rabernat, @nden noticed the break is actually scheduled for 4:30 local time. Is that what you meant?

@pllim
Copy link
Contributor

pllim commented Apr 17, 2023

Hi, any news on this front? Is this still planned? Thanks!

@braingram
Copy link
Contributor

@pllim Thanks for asking. Yes this is still planned.

Is there a use you have in mind? We are definitely curious to hear how this might be used.

Currently, a prototype extension can be found here: https://github.com/braingram/asdf_zarr/tree/deferred_block that allows saving some zarr arrays in an ASDF file (either as a reference to an external array or by copying the chunks into the ASDF file binary blocks). This prototype is not yet compatible with the development version of ASDF as the Converter in the new style extension does not yet support using block storage. This open PR should address that issue (the zarr extension prototype is compatible with the source branch):
#1508

@pllim
Copy link
Contributor

pllim commented Apr 17, 2023

Thanks for the quick response! cc @bmorris3 and @larrybradley

@braingram
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants