Create a community structure for sharing VirtualiZarr workflows and Icechunk virtual stores / Kerchunk references #320

maxrjones · 2024-11-25T18:29:16Z

Many Kerchunk workflows were developed as one-off Jupyter Notebooks that were shared on as a GitHub Gist or at most Medium blog posts/conference presentations. While all these examples were fantastic, it was often difficult to find examples and understand their differences. https://github.com/ProjectPythia/kerchunk-cookbook provided a more consistent structure, but was built after the fact by a small number of people. I think it would be valuable to promote a structure for sharing VirtualiZarr workflows earlier in process, so that they are open, findable, and ideally consistently structured. I also think there's a lot to learn from STAC in this type of community organization and would like to propose mirroring the stactools-packages structure. In this model, we would:

Create a Virtual-Zarr GitHub organization, akin to https://github.com/stactools-packages
Create a template repository, akin to https://github.com/stactools-packages/template (xref Develop a cookiecutter template for virtualization #319)
Create a method for people to transfer their repositories to https://github.com/virtual-zarr
Create a landing page for Virtual Zarr datasets, akin to https://stactools-packages.github.io/

I think it would be great if we had a way for people to easily clone their virtual data stores to a publicly accessible location (to my knowledge this isn't in place for STAC). IIRC @norlandrhagen suggested source.coop as a potential hub for sharing the actual virtual stores.

norlandrhagen · 2024-12-10T17:11:54Z

@maxrjones I just stumbled across https://github.com/stac-utils/xstac/tree/main. They have examples using Kerchunk references to create STAC assets.

TomNicholas · 2025-01-10T00:14:06Z

I love the forward-thinking-ness here, and I've also been mulling over what the world of findable virtual zarr stores could look like.

However, whilst I agree there is a lot to learn from STAC, I think we need to go quite a lot further than they have. In order to make all archival multidimensional scientific data actually "FAIR", we're going need many layers:

The location of the original data (hopefully in object storage but maybe behind a http server still),
The virtualizarr workflow code which generated virtual references and dumped them into icechunk,
One or more icechunk stores (i.e. manifest files in object storage somewhere, not necessarily in the same bucket as the data),
A catalog entry which contains additional information about the contents of the icechunk store (e.g. where the code used to generate it is), conforming to some catalog schema,
A searchable global index of catalog entries,
A website (landing page) which displays the catalog entries.

Of these layers, only (2) is actually executable code, which is why I don't think the solution to this will be to create lots and lots of small github repos. In theory all 6 could live in different places, and even be by managed by different organisations!

Currently (5) and (6) do not yet exist, at least not for Zarr specifically. (4) barely exists - arraylake's catalog is arguably a version of this. There are lots of existing prototypes to draw from (including from the STAC ecosystem), but none of them are as general as Zarr's data model. And this is before we even consider the idea of having non-zarr versioned data (e.g. iceberg) living alongside Zarr data...

One model of a solution here is that (4), (5), and (6) are all built and managed by one organization. That's GitHub's model - they have catalog entries (repos), search, and a master catalog website (github.com). (They also actually hold the equivalent of (1-3) in their systems too, we just don't really mind because every user of github automatically has a local backup of the whole history of all their code, in an open format.)

Another model is that catalog entries are created and hosted by independent organisations, following some common schema / using common tooling. This is more like the MediaWiki model, of which Wikipedia is just one instance. The downside of that is that although anyone can create their own catalog (wiki), there is no built-in global search across wikis, and so (5) & (6) ends up being managed by a separate centralized entity anyway (Google).

rabernat · 2025-01-10T03:42:08Z

Later this spring, Earthmover will be launching a free tier which will provide a catalog service for public Icechunk datasets. (Similar to GitHub's free tier for public git repos.) We think this will really help the community share and discover great datasets.

maxrjones added the usage example Real world use case examples label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a community structure for sharing VirtualiZarr workflows and Icechunk virtual stores / Kerchunk references #320

Create a community structure for sharing VirtualiZarr workflows and Icechunk virtual stores / Kerchunk references #320

maxrjones commented Nov 25, 2024

norlandrhagen commented Dec 10, 2024

TomNicholas commented Jan 10, 2025 •

edited

Loading

rabernat commented Jan 10, 2025

Create a community structure for sharing VirtualiZarr workflows and Icechunk virtual stores / Kerchunk references #320

Create a community structure for sharing VirtualiZarr workflows and Icechunk virtual stores / Kerchunk references #320

Comments

maxrjones commented Nov 25, 2024

norlandrhagen commented Dec 10, 2024

TomNicholas commented Jan 10, 2025 • edited Loading

rabernat commented Jan 10, 2025

TomNicholas commented Jan 10, 2025 •

edited

Loading