Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which tool to load and share metadata? #13

Open
AdrienWehrle opened this issue May 22, 2022 · 3 comments
Open

Which tool to load and share metadata? #13

AdrienWehrle opened this issue May 22, 2022 · 3 comments

Comments

@AdrienWehrle
Copy link
Member

AdrienWehrle commented May 22, 2022

We initially explored datalad, but other options are very interesting too:

datalad

Very powerful because directly based on git-annex, but I still haven't fully understood how to use it properly/efficiently.
Datalad is a data management system, and only that (to my knowledge). Very efficient because concentrated on this one task, but somehow limits our application. Or calls for the use of other tools in combination. Which might just be ok.

intake

Simple set of tools but also powerful. Because simple, the community could easily contribute new catalog entries (through yaml files).

  • Allows for local file caching

  • Dask capabilities for big data

  • Cloud access support

  • Possibility for a simple GUI

  • Storing catalog metadata in files makes the structuring of our portal very easy to understand and efficient.

  • The use of the yaml format makes community contribution easier, even from non coders (json and more xml can be intimidating if not used to coding at all).

Intake is more than just a data management tool. Not only the data download step is streamlined but also the reading through the many drivers available (and easy to implement new ones).

pooch

Simple and similar to intake, instead data sources are not really considered as catalogs. Developed to download test data for libraries so we might see some limitations for our metadata portal.

This comparison will be further modified/refined.

@AdrienWehrle
Copy link
Member Author

AdrienWehrle commented May 22, 2022

I see a couple of points that are important to consider for our choice of tool:

Is the simplicity of our backend important? For us, for the users?

a simple backend tool

  • is easier for the dev team to setup and maintain
  • makes it easier for the community to contribute to (new catalogs in the case of intake)

Datalad is very powerful, but 99% of the users would probably not understand how the metadata portal actually works. The use of a CLI (and not only a GUI) might therefore be limited to only a small fraction of the community.

Intake is powerful but also simple, it is easy-ier for the users to grasp. Because intake is linked to e.g. Dask, analysis and visualisation is a natural and simple step after the download. Intake is full Python, with all the implications it has.

Is that an important point? I think it is.

@AdrienWehrle
Copy link
Member Author

At present, my choice goes for: Intake

@mankoff
Copy link
Member

mankoff commented May 22, 2022

Not sure why this is an issue and not the discussion (#12) :).

Anyway, maybe we should build a prototype using both to better understand the pros and cons. Suggest initial list is #4, but for our prototype I'm not sure those are good given their large size. How about the following five datasets:

I suggest the prototype include:

  1. A reproducible script (shell, Python, whatever) or Jupyter Notebook or equivalent documenting the steps needed to recreate the prototype. This also acts as one of our goals - a "How To Contribute" document.
  2. Notes on complications/issues
  3. Notes on how the tool (datalad, intake) handles or supports metadata and searching

Ideally, we each build two prototypes, so that we can each understand both tools for a decision/discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants