-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a community structure for sharing VirtualiZarr workflows and Icechunk virtual stores / Kerchunk references #320
Comments
@maxrjones I just stumbled across https://github.com/stac-utils/xstac/tree/main. They have examples using Kerchunk references to create STAC assets. |
I love the forward-thinking-ness here, and I've also been mulling over what the world of findable virtual zarr stores could look like. However, whilst I agree there is a lot to learn from STAC, I think we need to go quite a lot further than they have. In order to make all archival multidimensional scientific data actually "FAIR", we're going need many layers:
Of these layers, only (2) is actually executable code, which is why I don't think the solution to this will be to create lots and lots of small github repos. In theory all 6 could live in different places, and even be by managed by different organisations! Currently (5) and (6) do not yet exist, at least not for Zarr specifically. (4) barely exists - arraylake's catalog is arguably a version of this. There are lots of existing prototypes to draw from (including from the STAC ecosystem), but none of them are as general as Zarr's data model. And this is before we even consider the idea of having non-zarr versioned data (e.g. iceberg) living alongside Zarr data... One model of a solution here is that (4), (5), and (6) are all built and managed by one organization. That's GitHub's model - they have catalog entries (repos), search, and a master catalog website (github.com). (They also actually hold the equivalent of (1-3) in their systems too, we just don't really mind because every user of github automatically has a local backup of the whole history of all their code, in an open format.) Another model is that catalog entries are created and hosted by independent organisations, following some common schema / using common tooling. This is more like the MediaWiki model, of which Wikipedia is just one instance. The downside of that is that although anyone can create their own catalog (wiki), there is no built-in global search across wikis, and so (5) & (6) ends up being managed by a separate centralized entity anyway (Google). |
Later this spring, Earthmover will be launching a free tier which will provide a catalog service for public Icechunk datasets. (Similar to GitHub's free tier for public git repos.) We think this will really help the community share and discover great datasets. |
Many Kerchunk workflows were developed as one-off Jupyter Notebooks that were shared on as a GitHub Gist or at most Medium blog posts/conference presentations. While all these examples were fantastic, it was often difficult to find examples and understand their differences. https://github.com/ProjectPythia/kerchunk-cookbook provided a more consistent structure, but was built after the fact by a small number of people. I think it would be valuable to promote a structure for sharing VirtualiZarr workflows earlier in process, so that they are open, findable, and ideally consistently structured. I also think there's a lot to learn from STAC in this type of community organization and would like to propose mirroring the stactools-packages structure. In this model, we would:
I think it would be great if we had a way for people to easily clone their virtual data stores to a publicly accessible location (to my knowledge this isn't in place for STAC). IIRC @norlandrhagen suggested source.coop as a potential hub for sharing the actual virtual stores.
The text was updated successfully, but these errors were encountered: