-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration with Zarr and chunked arrays #718
Comments
@Cadair let's us talk about this some more when we can get together ;-) |
Seems like that over at Zarr they're requesting features that ASDF supports: zarr-developers/zarr-python#389 |
@Cadair what you are describing appears to me to be this sort of flow (correct me if I'm wrong). The data comes from zarr and is passed to asdf for storage. That would be straightforward enough, but it seems like a very narrow use of asdf and I wonder how it integrates with broader uses of asdf. For example, we should be able to have a chunked array as a entity in an asdf file, but not the only entity. One possible view is that zarr is an extension of existing asdf arrays and zarr can access such asdf corresponding entities directly while ignoring all the associated asdf metadata. In this sense we are using zarr only for manipulations of array data using parallel mechanisms, but use asdf for all other metadata and grouping mechanisms. Is that sufficient, or is a greater integration of asdf structures to be passed to zarr (more difficult I think since it means zarr either has to treat these as opaque objects or add tools to interpret them. The previous mechanism seems like an achievable first step without a lot of complication. It does raise interesting questions about representations within the asdf block structure. The chunks could be exploded blocks, blocks stored within the same file, or chunks within one block (this excludes compression probably since it complicates how one retrieves a chunk within a block). I'll have to give this more thought. And ultimately it means providing a means within asdf of accessing the embedded zarr array, perhaps with a dependency on zarr itself (again, the simplest approach, I'm guessing) |
Overall I see that this could work in both directions, as you say either "The data comes from zarr and is passed to asdf for storage." or "a greater integration of asdf structures to be passed to zarr". I agree with you that we have the power to action the first one quite easily, and within the asdf project. I think maybe once we have that working we could talk to the Zarr developers about how we could have zarr pass through much richer (non-json) metadata (attributes) to asdf. |
Hello folks! I am learning about ASDF for the first time here at SciPy. I am one of the core developers of Zarr. It would be great to have the two projects coordinating more, as it seems like our goals are broadly similar. Is anyone interested in trying to meet up at SciPy this week? |
I will be there (hopefully my flight got cancelled) would be great to meet up, I know of a few others who would be interested and are around as well. |
Very much interested. What days and times are you available? My and Nadia's schedules are pretty flexible, probably also for William, after the tutorial days. |
Perhaps we could meet at 5:30pm on Wednesday during the evening break? I'll try to entrain some other Zarr folks. |
That works for the 3 of us. Any particular place to meet? |
I'm sure I have mentioned this elsewhere, but the kerchunk project has ASDF on its shortlist of formats to process, such that it will be possible to view (groups of) ASDF datasets as zarr, even without ASDF installed. So you could load everything (including FITS, if there is no whole-file-compression) using zarr, and get the lazy, cloud-friendly parallel access you would want, if only astropy worked with zarr. (kerchunk's first success was with HDF5, and I note the link to zarr-developers/zarr-python#535 , above, which ought to now be closed for this reason) The alternative discussed here, of using ASDF to load from zarr - well it might well work, I don't know. Certainly zarr itself is simple enough. Instead of a zarr backend to astropy (et al), we have a zarr backend to ASDF which is already (somewhat?) supported in astropy, which would still enable my point above, albeit somewhat circuitously. |
Hi, any news on this front? Is this still planned? Thanks! |
@pllim Thanks for asking. Yes this is still planned. Is there a use you have in mind? We are definitely curious to hear how this might be used. Currently, a prototype extension can be found here: https://github.com/braingram/asdf_zarr/tree/deferred_block that allows saving some zarr arrays in an ASDF file (either as a reference to an external array or by copying the chunks into the ASDF file binary blocks). This prototype is not yet compatible with the development version of ASDF as the |
Thanks for the quick response! cc @bmorris3 and @larrybradley |
"Zarr is a Python package providing an implementation of chunked, compressed, N-dimensional arrays." It is well suited to parallel computation in Dask arrays, as well as being friendly to workflows where data are stored in the cloud in object stores etc.
The Zarr file format has a specification and supports multiple storage backends, examples are
DictStore
,DirectoryStore
,LMDBStore
,SqliteStore
. The storage backend API for the Python zarr package accepts anything that implements theMutableMapping
interface, i.e. Zarr passes a backend store a string key and some bytes to store under that key.The way this works is that each chunk of the array, is passed as a bytes blob to the backend, with a key representing the chunk and the bytes for the chunk. Three extra keys are defined in the Zarr specification which is metadata for the each group, each array and for user metadata (
.zarray
,.zgroup
and.zattrs
).What I am proposing is that with a little work, I believe that asdf is a valid backing store for Zarr. This would enable workflows where astronomy specific metadata can be easily encoded in files which can be read via the Zarr API, and therefore are well suited to parallel workflows, specifically Dask arrays.
I would envisage that the Zarr layout in asdf would generally be mappable to a single yaml tree. When asdf gets passed the
.zarray
top level metadata object, it would be a JSON blob as bytes, we could decode this and save it into the tree. This would provide information such as dtype, top level array size and chunk size.When asdf gets passed bytes for a individual array chunk we would save it to an exploded asdf file constructing the block header as required for that to be a valid asdf (the required data for that should be in the top level
.zarray
key).The issues I forsee with this are as follows:
.zarray
,.zgroup
and.zattrs
, which limits what we can return from.zattrs
, so the interesting metadata things would have to be a separate non-zarr API. (This we could suggest as an improvement to Zarr to have non-bytes metadata for supporting backends).I think that probably the best short term approach would be that if you opened an asdf file, with asdf which was chunked in this manner, it could return a Zarr array. It would also be nice to get to the point where you could load these files with Zarr directly (via an asdf backend).
The text was updated successfully, but these errors were encountered: