Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a metalad description #930

Merged
merged 28 commits into from
Feb 14, 2023
Merged

Conversation

christian-monch
Copy link
Contributor

This PR contains a first draft of metalad documentation. It covers meta-add and meta-dump and the
transport of metadata between repositories.

Copy link
Contributor

@adswa adswa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much for this PR @christian-monch. I believe a section on metalad is hugely important. It's really really highly sought after by handbook readers, and I think there is not a single accessible documentation of it anywhere in the datalad ecosystem.
I've added a first round of rewriting and restructuring on top of this draft. I have also added a couple of TODOs that would make the section even better, if implemented.

I have a number of proposals I'd like to get your opinion on:

  • I believe it would be helpful to see actual output from the meta-* commands. For this, we could turn what's currently code blocks into runrecords.
  • Related to the suggestion above, I saw that you already created a test repo with metadata - maybe it is possible to shrink that down so that the output of a meta-dump would be manageable?
  • I think it would be cool to see two extractor-related small examples in the section
    • how to use any of the built-in extractors
    • how to built your own extractor (maybe @jsheunis has built one once that we could showcase?)
  • I'd like to include some technical notes on how metadata is stored internally, and why it is stored this way. If you have suggestions on how to write that up or which aspects would be the most important to add, I'd be all ears. I also haven't fully understood the meta-dump command's -r parameter and how the way meta data is internally represented interacts with it.
  • I'd like to include some example queries, e.g., with a simple pipe into jq such that there is some notion of what "querying" even means

@adswa
Copy link
Contributor

adswa commented Feb 9, 2023

I'd like to link relevant cross-references to existing documentation in this PR, too, as the office hour this week showed how hard of a time users have to find the relevant pieces. I'm thinking of including @mslw's coming guide to writing own extractors, and also mention @jsheunis's catalog workflow-pipeline (can I get a pointer where to find out more about it). What else could I link, @christian-monch, @mslw, @jsheunis ?

docs/beyond_basics/101-181-metalad.rst Outdated Show resolved Hide resolved
docs/beyond_basics/101-181-metalad.rst Outdated Show resolved Hide resolved
docs/beyond_basics/101-181-metalad.rst Outdated Show resolved Hide resolved
docs/beyond_basics/101-181-metalad.rst Outdated Show resolved Hide resolved
docs/beyond_basics/101-181-metalad.rst Outdated Show resolved Hide resolved
docs/beyond_basics/101-181-metalad.rst Show resolved Hide resolved
@mslw
Copy link
Collaborator

mslw commented Feb 9, 2023

Regarding external pieces, these are the things I had "saved":

For the datalad catalog workflow subcommand (FTR: what it currently does is to combine extraction with "metalad_core", "metalad_studyminimeta", "bids_dataset", & "datacite_gin" extractors if possible, translation of outputs, catalog create, and catalog add) I don't think there is a dedicated documentation other than examples in catalog README, catalog manpage, and the workflows source with docstrings?

For Metalad, I think this gist by @christian-monch used to be primary overview of the functionality (but probably this PR means to supersede it?)

For a manual brealdown of extract / translate / update catalog, I have @jsheunis's draft of a fairly big catalog workflow and my digest of it. Though keep in mind that metalad->catalog translation is being formalized in the catalog datalad/datalad-catalog#246

And there is quite a bit of catalog docs

@jsheunis
Copy link
Contributor

jsheunis commented Feb 9, 2023

@mslw Already added all the sources I wanted to list. And there are also this primer and tutorials: https://github.com/datalad/tutorials/blob/master/notebooks/catalog_tutorials/datalad_catalog_primer.ipynb

I will add that the "fairly big catalog workflow" might have some useful text descriptions but is itself outdated, and kind of superseded by datalad-catalog's workflow functionality.

I will add another point that what we want to describe in the handbook in terms of this type of workflow is pretty much the same as what we are working on now with the "non-portal" effort, which I think will deserve it's own chapter (if not more). So I would keep whatever we add to the metalad or catalog chapters minimal and eventually refer to the to be added workflow chapters

lastly:

I believe it would be helpful to see actual output from the meta-* commands. For this, we could turn what's currently code blocks into runrecords.

I agree!

adswa and others added 3 commits February 9, 2023 15:31
Co-authored-by: Stephan Heunis <jsheunis@gmail.com>
Co-authored-by: Stephan Heunis <jsheunis@gmail.com>
Co-authored-by: Stephan Heunis <jsheunis@gmail.com>
@jsheunis
Copy link
Contributor

jsheunis commented Feb 9, 2023

BTW @christian-monch @adswa I think the way that metadata and its relation to primary data are described here is clear and intuitive, and I think there's some snippets that would be useful to port into the "non-portal" paper.

Comment on lines +202 to +203

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone have complex dumps from the real-world? The docstring of meta-add contains the spec for this notation, but it has a complex-ish feel to it, so the more examples, the better :)

^^^^^^^^^^^^^^^^^

TODO: something more about meta-dump and concrete usage example with, e.g., ``jq``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone has an example in their shell history, by any chance? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would create a list of all familyName values in the person lists that were extracted from .studyminimeta.yaml-files by metalad_studyminimeta-extractor:

> datalad meta-dump -r|jq '.extracted_metadata["@graph"][3]["@list"][].familyName'|sort|uniq

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this? (also studyforrest dataset)

Output array of all subdatasets:

$ datalad meta-extract -d . metalad_core > metadata_record_metalad_core.json
$ datalad meta-add -d . metadata_record_metalad_core.json
$ datalad meta-dump . | jq '.extracted_metadata["@graph"] | .[] | select(.["@type"] == "Dataset") | [.hasPart[]? | {"dataset_id": (.identifier | sub("^datalad:"; "")), "dataset_version": (.["@id"] | sub("^datalad:"; "")), "dataset_path": .name}]'

[
  {
    "dataset_id": "d5dd3da0-a631-4c0c-a4a9-de55dfc4620f",
    "dataset_version": "276ceffd8c11db1b99b30b4ae6965aac296fba02",
    "dataset_path": "artifact/3T_movie_eyetracking"
  },
  {
    "dataset_id": "ad9b6c66-4413-4b4f-b6da-b7f25d0d6397",
    "dataset_version": "787741801ad6ff9e4d0aea7a1403bc483f9e1a74",
    "dataset_path": "artifact/3T_structural_mri"
  },
  {
    "dataset_id": "6075c0fa-ab72-4bab-9888-3b597f0e63b1",
    "dataset_version": "64cf42ad00c707bb6533b5944041582c6bfdd384",
    "dataset_path": "artifact/3T_visuallocalizer"
  },
  {
    "dataset_id": "d4759300-5563-467d-be5f-e5b164fb3060",
    "dataset_version": "7b6ec16910eb6a77cd0cc3d2945e7fb292fb9672",
    "dataset_path": "artifact/7T_audiomovie"
  },
  {
    "dataset_id": "c08af312-e05b-43b3-b499-db0d2ad46bf6",
    "dataset_version": "7b58b38af7167d576489cc58e9ec4b3eee0822fc",
    "dataset_path": "artifact/7T_musicperception"
  },
  {
    "dataset_id": "da15d84c-9c8b-11e9-a3fb-f0d5bf7b5561",
    "dataset_version": "f168e8e2b8bef21f373e8a5385e528130ca6339d",
    "dataset_path": "artifact/media"
  },
  {
    "dataset_id": "126cd950-377c-4600-a921-045cf408bd9f",
    "dataset_version": "1b7cb46a97e75aeccab28ef7b61693f4f02a0f8d",
    "dataset_path": "artifact/movie_eyetracking"
  },
  {
    "dataset_id": "0f66b1ba-e9a9-46fd-b9d9-2e64fe94d307",
    "dataset_version": "c74b66cf37c0d4ed8914296c6d7792b2d25696aa",
    "dataset_path": "code/conversion_qa"
  },
  {
    "dataset_id": "7fcd8812-d0fe-11e7-8db2-a0369f7c647e",
    "dataset_version": "2ccaa115543c21e6658950d1cb8cc3038f14272f",
    "dataset_path": "derivative/aggregate_fmri_timeseries"
  },
  {
    "dataset_id": "c8ec2919-493b-4af5-9271-cbe9ebd08c43",
    "dataset_version": "74cd7ec0538448b05fb4d5f91119b279c5e9ab04",
    "dataset_path": "derivative/aligned_mri"
  },
  {
    "dataset_id": "3304e775-5f5f-435a-b68e-d98c9f5fb72a",
    "dataset_version": "aaac44e047d375cd8f791b1b6fe2b739f02c83b2",
    "dataset_path": "derivative/cortical_surfaces_freesurfer"
  },
  {
    "dataset_id": "ceb007ac-ef05-4392-98d2-35c02a774a21",
    "dataset_version": "688d8d8558fe847f4c1b19aed579745bcd6c7744",
    "dataset_path": "derivative/image_space_transformations"
  },
  {
    "dataset_id": "2d05f277-94b0-470b-8e11-4e56691d5b89",
    "dataset_version": "78e04a00fedbe3e055f86c2f9127aa48e1133d55",
    "dataset_path": "derivative/retinotopic_maps"
  },
  {
    "dataset_id": "92e65958-4a5a-4c34-a4f4-ee070f7a123b",
    "dataset_version": "203aa983534fc2b823ff4777a85f4f80d7a68656",
    "dataset_path": "derivative/visual_areas"
  },
  {
    "dataset_id": "5b1081d6-84d7-11e8-b00a-a0369fb55db0",
    "dataset_version": "b623351b43eb2715331ac59ad4cf41682e84ff7d",
    "dataset_path": "original/3T_multiresolution_fmri"
  },
  {
    "dataset_id": "1882e2e6-fbbf-4ade-a65f-3a1615235f51",
    "dataset_version": "e5d2f8368fc5f6717d8ef131041c6d943298d0c7",
    "dataset_path": "original/3T_structural_mri"
  },
  {
    "dataset_id": "3a8648b3-7df8-413f-8efb-4d39040ac174",
    "dataset_version": "10e23aafa8271f742e9022a3522d9a88d7fe30cf",
    "dataset_path": "original/7T_multiresolution_fmri"
  },
  {
    "dataset_id": "5eaff716-54eb-11e8-803d-a0369f7c647e",
    "dataset_version": "72f835ada046bd0479009ea0ff933b30a95b0076",
    "dataset_path": "original/phase2"
  },
  {
    "dataset_id": "4c536c4a-ec61-11e6-9440-00b56d060aa7",
    "dataset_version": "298e2a884d0598dd89753e9b5576bd93d19f335a",
    "dataset_path": "stimulus/computational_annotations"
  },
  {
    "dataset_id": "45b9ab26-07fc-11e8-8c71-f0d5bf7b5561",
    "dataset_version": "29dcce2b9477537b433996ebc342c531139e1d87",
    "dataset_path": "stimulus/curated_annotations"
  }
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh I only saw your reply now @christian-monch. I think yours is a bit more user friendly :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good example. Just wanted to mention that the first two lines can be combined to:

> datalad meta-extract -d . metalad_core  | datalad meta-add -d . --json-lines -

And since -d . is equivalent to the default dataset location, the following would be identical:

> datalad meta-extract metalad_core  | datalad meta-add --json-lines -

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! I think the family name example would work really well. I get a few errors/warnings by jq in @christian-monch's example - any idea what might be wrong?

datalad meta-dump -r | jq '.extracted_metadata["@graph"][3]["@list"][].familyName' | sort | uniq
jq: error (at <stdin>:16): Cannot iterate over null (null)
jq: error (at <stdin>:22): Cannot iterate over null (null)
jq: error (at <stdin>:23): Cannot iterate over null (null)
jq: error (at <stdin>:24): Cannot iterate over null (null)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is yet unresolved, I'm merely hiding it in the output. If some has advice, let me know

@christian-monch
Copy link
Contributor Author

christian-monch commented Feb 13, 2023

@adswa : I think it might be better to either include core metadata in the example or to use exclusively core metadata in the examples. WDYT?

@adswa adswa marked this pull request as ready for review February 13, 2023 15:20
@adswa
Copy link
Contributor

adswa commented Feb 14, 2023

Not everything in this PR is fully finalized, but its a very good start. I'll merge it to have metadata documentation available.

@adswa adswa merged commit 703247d into datalad-handbook:main Feb 14, 2023
@adswa
Copy link
Contributor

adswa commented Feb 14, 2023

hey @all-contributors please add @christian-monch for content

@allcontributors
Copy link
Contributor

@adswa

@christian-monch already contributed before to content

@adswa
Copy link
Contributor

adswa commented Feb 14, 2023

hey @all-contributors please add @jsheunis for review

@allcontributors
Copy link
Contributor

@adswa

I've put up a pull request to add @jsheunis! 🎉

@adswa
Copy link
Contributor

adswa commented Feb 14, 2023

hey https://github.com/all-contributors please add @mslw for review

@adswa
Copy link
Contributor

adswa commented Feb 14, 2023

hey @all-contributors please add @mslw for review

@allcontributors
Copy link
Contributor

@adswa

@mslw already contributed before to review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants