Integration with Zarr and chunked arrays #718

Cadair · 2019-12-11T16:56:53Z

"Zarr is a Python package providing an implementation of chunked, compressed, N-dimensional arrays." It is well suited to parallel computation in Dask arrays, as well as being friendly to workflows where data are stored in the cloud in object stores etc.

The Zarr file format has a specification and supports multiple storage backends, examples are DictStore, DirectoryStore, LMDBStore, SqliteStore. The storage backend API for the Python zarr package accepts anything that implements the MutableMapping interface, i.e. Zarr passes a backend store a string key and some bytes to store under that key.

The way this works is that each chunk of the array, is passed as a bytes blob to the backend, with a key representing the chunk and the bytes for the chunk. Three extra keys are defined in the Zarr specification which is metadata for the each group, each array and for user metadata (.zarray, .zgroup and .zattrs).

What I am proposing is that with a little work, I believe that asdf is a valid backing store for Zarr. This would enable workflows where astronomy specific metadata can be easily encoded in files which can be read via the Zarr API, and therefore are well suited to parallel workflows, specifically Dask arrays.

I would envisage that the Zarr layout in asdf would generally be mappable to a single yaml tree. When asdf gets passed the .zarray top level metadata object, it would be a JSON blob as bytes, we could decode this and save it into the tree. This would provide information such as dtype, top level array size and chunk size.

When asdf gets passed bytes for a individual array chunk we would save it to an exploded asdf file constructing the block header as required for that to be a valid asdf (the required data for that should be in the top level .zarray key).

The issues I forsee with this are as follows:

We will need to return bytes back to zarr for .zarray, .zgroup and .zattrs, which limits what we can return from .zattrs, so the interesting metadata things would have to be a separate non-zarr API. (This we could suggest as an improvement to Zarr to have non-bytes metadata for supporting backends).
There is not an existing way of dealing with chunking in asdf, so a file written via Zarr would not be easy to open with just asdf. This would be something that could be added in asdf.

I think that probably the best short term approach would be that if you opened an asdf file, with asdf which was chunked in this manner, it could return a Zarr array. It would also be nice to get to the point where you could load these files with Zarr directly (via an asdf backend).

The text was updated successfully, but these errors were encountered:

perrygreenfield · 2019-12-11T17:03:08Z

@Cadair let's us talk about this some more when we can get together ;-)

NumesSanguis · 2019-12-12T02:07:41Z

Seems like that over at Zarr they're requesting features that ASDF supports: zarr-developers/zarr-python#389

perrygreenfield · 2019-12-12T15:34:39Z

@Cadair what you are describing appears to me to be this sort of flow (correct me if I'm wrong). The data comes from zarr and is passed to asdf for storage. That would be straightforward enough, but it seems like a very narrow use of asdf and I wonder how it integrates with broader uses of asdf. For example, we should be able to have a chunked array as a entity in an asdf file, but not the only entity. One possible view is that zarr is an extension of existing asdf arrays and zarr can access such asdf corresponding entities directly while ignoring all the associated asdf metadata. In this sense we are using zarr only for manipulations of array data using parallel mechanisms, but use asdf for all other metadata and grouping mechanisms. Is that sufficient, or is a greater integration of asdf structures to be passed to zarr (more difficult I think since it means zarr either has to treat these as opaque objects or add tools to interpret them. The previous mechanism seems like an achievable first step without a lot of complication.

It does raise interesting questions about representations within the asdf block structure. The chunks could be exploded blocks, blocks stored within the same file, or chunks within one block (this excludes compression probably since it complicates how one retrieves a chunk within a block). I'll have to give this more thought.

And ultimately it means providing a means within asdf of accessing the embedded zarr array, perhaps with a dependency on zarr itself (again, the simplest approach, I'm guessing)

Cadair · 2019-12-12T18:06:02Z

Overall I see that this could work in both directions, as you say either "The data comes from zarr and is passed to asdf for storage." or "a greater integration of asdf structures to be passed to zarr".

I agree with you that we have the power to action the first one quite easily, and within the asdf project. I think maybe once we have that working we could talk to the Zarr developers about how we could have zarr pass through much richer (non-json) metadata (attributes) to asdf.

rabernat · 2022-07-11T13:46:52Z

Hello folks! I am learning about ASDF for the first time here at SciPy. I am one of the core developers of Zarr. It would be great to have the two projects coordinating more, as it seems like our goals are broadly similar. Is anyone interested in trying to meet up at SciPy this week?

Cadair · 2022-07-11T14:25:51Z

I will be there (hopefully my flight got cancelled) would be great to meet up, I know of a few others who would be interested and are around as well.

perrygreenfield · 2022-07-11T18:43:34Z

Very much interested. What days and times are you available? My and Nadia's schedules are pretty flexible, probably also for William, after the tutorial days.

rabernat · 2022-07-11T20:36:36Z

Perhaps we could meet at 5:30pm on Wednesday during the evening break?

I'll try to entrain some other Zarr folks.

perrygreenfield · 2022-07-12T13:14:50Z

That works for the 3 of us. Any particular place to meet?

martindurant · 2022-07-12T14:12:41Z

I'm sure I have mentioned this elsewhere, but the kerchunk project has ASDF on its shortlist of formats to process, such that it will be possible to view (groups of) ASDF datasets as zarr, even without ASDF installed. So you could load everything (including FITS, if there is no whole-file-compression) using zarr, and get the lazy, cloud-friendly parallel access you would want, if only astropy worked with zarr.

(kerchunk's first success was with HDF5, and I note the link to zarr-developers/zarr-python#535 , above, which ought to now be closed for this reason)

The alternative discussed here, of using ASDF to load from zarr - well it might well work, I don't know. Certainly zarr itself is simple enough. Instead of a zarr backend to astropy (et al), we have a zarr backend to ASDF which is already (somewhat?) supported in astropy, which would still enable my point above, albeit somewhat circuitously.

perrygreenfield · 2022-07-12T15:56:22Z

@rabernat, @nden noticed the break is actually scheduled for 4:30 local time. Is that what you meant?

pllim · 2023-04-17T18:58:23Z

Hi, any news on this front? Is this still planned? Thanks!

braingram · 2023-04-17T19:23:46Z

@pllim Thanks for asking. Yes this is still planned.

Is there a use you have in mind? We are definitely curious to hear how this might be used.

Currently, a prototype extension can be found here: https://github.com/braingram/asdf_zarr/tree/deferred_block that allows saving some zarr arrays in an ASDF file (either as a reference to an external array or by copying the chunks into the ASDF file binary blocks). This prototype is not yet compatible with the development version of ASDF as the Converter in the new style extension does not yet support using block storage. This open PR should address that issue (the zarr extension prototype is compatible with the source branch):
#1508

pllim · 2023-04-17T19:35:06Z

Thanks for the quick response! cc @bmorris3 and @larrybradley

braingram · 2023-12-14T16:24:06Z

Please see:
https://pypi.org/project/asdf-zarr/
https://github.com/asdf-format/asdf-zarr

NumesSanguis mentioned this issue Dec 12, 2019

Supporting links and references zarr-developers/zarr-python#389

Open

Cadair mentioned this issue Feb 5, 2020

Using the Zarr library to read HDF5 zarr-developers/zarr-python#535

Closed

Cadair mentioned this issue Feb 13, 2020

Collaborating with ASDF zarr-developers/community#23

Open

braingram closed this as completed Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with Zarr and chunked arrays #718

Integration with Zarr and chunked arrays #718

Cadair commented Dec 11, 2019 •

edited

Loading

perrygreenfield commented Dec 11, 2019

NumesSanguis commented Dec 12, 2019

perrygreenfield commented Dec 12, 2019

Cadair commented Dec 12, 2019

rabernat commented Jul 11, 2022

Cadair commented Jul 11, 2022

perrygreenfield commented Jul 11, 2022

rabernat commented Jul 11, 2022

perrygreenfield commented Jul 12, 2022

martindurant commented Jul 12, 2022

perrygreenfield commented Jul 12, 2022

pllim commented Apr 17, 2023

braingram commented Apr 17, 2023

pllim commented Apr 17, 2023

braingram commented Dec 14, 2023

Integration with Zarr and chunked arrays #718

Integration with Zarr and chunked arrays #718

Comments

Cadair commented Dec 11, 2019 • edited Loading

perrygreenfield commented Dec 11, 2019

NumesSanguis commented Dec 12, 2019

perrygreenfield commented Dec 12, 2019

Cadair commented Dec 12, 2019

rabernat commented Jul 11, 2022

Cadair commented Jul 11, 2022

perrygreenfield commented Jul 11, 2022

rabernat commented Jul 11, 2022

perrygreenfield commented Jul 12, 2022

martindurant commented Jul 12, 2022

perrygreenfield commented Jul 12, 2022

pllim commented Apr 17, 2023

braingram commented Apr 17, 2023

pllim commented Apr 17, 2023

braingram commented Dec 14, 2023

Cadair commented Dec 11, 2019 •

edited

Loading