Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a community structure for sharing VirtualiZarr workflows and Icechunk virtual stores / Kerchunk references #320

Open
4 tasks
maxrjones opened this issue Nov 25, 2024 · 3 comments
Labels
usage example Real world use case examples

Comments

@maxrjones
Copy link
Member

Many Kerchunk workflows were developed as one-off Jupyter Notebooks that were shared on as a GitHub Gist or at most Medium blog posts/conference presentations. While all these examples were fantastic, it was often difficult to find examples and understand their differences. https://github.com/ProjectPythia/kerchunk-cookbook provided a more consistent structure, but was built after the fact by a small number of people. I think it would be valuable to promote a structure for sharing VirtualiZarr workflows earlier in process, so that they are open, findable, and ideally consistently structured. I also think there's a lot to learn from STAC in this type of community organization and would like to propose mirroring the stactools-packages structure. In this model, we would:

I think it would be great if we had a way for people to easily clone their virtual data stores to a publicly accessible location (to my knowledge this isn't in place for STAC). IIRC @norlandrhagen suggested source.coop as a potential hub for sharing the actual virtual stores.

@maxrjones maxrjones added the usage example Real world use case examples label Nov 25, 2024
@norlandrhagen
Copy link
Collaborator

@maxrjones I just stumbled across https://github.com/stac-utils/xstac/tree/main. They have examples using Kerchunk references to create STAC assets.

@TomNicholas
Copy link
Member

TomNicholas commented Jan 10, 2025

I love the forward-thinking-ness here, and I've also been mulling over what the world of findable virtual zarr stores could look like.

However, whilst I agree there is a lot to learn from STAC, I think we need to go quite a lot further than they have. In order to make all archival multidimensional scientific data actually "FAIR", we're going need many layers:

  1. The location of the original data (hopefully in object storage but maybe behind a http server still),
  2. The virtualizarr workflow code which generated virtual references and dumped them into icechunk,
  3. One or more icechunk stores (i.e. manifest files in object storage somewhere, not necessarily in the same bucket as the data),
  4. A catalog entry which contains additional information about the contents of the icechunk store (e.g. where the code used to generate it is), conforming to some catalog schema,
  5. A searchable global index of catalog entries,
  6. A website (landing page) which displays the catalog entries.

Of these layers, only (2) is actually executable code, which is why I don't think the solution to this will be to create lots and lots of small github repos. In theory all 6 could live in different places, and even be by managed by different organisations!

Currently (5) and (6) do not yet exist, at least not for Zarr specifically. (4) barely exists - arraylake's catalog is arguably a version of this. There are lots of existing prototypes to draw from (including from the STAC ecosystem), but none of them are as general as Zarr's data model. And this is before we even consider the idea of having non-zarr versioned data (e.g. iceberg) living alongside Zarr data...


One model of a solution here is that (4), (5), and (6) are all built and managed by one organization. That's GitHub's model - they have catalog entries (repos), search, and a master catalog website (github.com). (They also actually hold the equivalent of (1-3) in their systems too, we just don't really mind because every user of github automatically has a local backup of the whole history of all their code, in an open format.)

Another model is that catalog entries are created and hosted by independent organisations, following some common schema / using common tooling. This is more like the MediaWiki model, of which Wikipedia is just one instance. The downside of that is that although anyone can create their own catalog (wiki), there is no built-in global search across wikis, and so (5) & (6) ends up being managed by a separate centralized entity anyway (Google).

@rabernat
Copy link
Collaborator

Later this spring, Earthmover will be launching a free tier which will provide a catalog service for public Icechunk datasets. (Similar to GitHub's free tier for public git repos.) We think this will really help the community share and discover great datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage example Real world use case examples
Projects
None yet
Development

No branches or pull requests

4 participants