-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a metalad description #930
Conversation
This is a first draft of metalad documentation. It covers meta-add and meta-dump and the transport of metadata between repositories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks much for this PR @christian-monch. I believe a section on metalad is hugely important. It's really really highly sought after by handbook readers, and I think there is not a single accessible documentation of it anywhere in the datalad ecosystem.
I've added a first round of rewriting and restructuring on top of this draft. I have also added a couple of TODOs that would make the section even better, if implemented.
I have a number of proposals I'd like to get your opinion on:
- I believe it would be helpful to see actual output from the
meta-*
commands. For this, we could turn what's currently code blocks intorunrecords
. - Related to the suggestion above, I saw that you already created a test repo with metadata - maybe it is possible to shrink that down so that the output of a
meta-dump
would be manageable? - I think it would be cool to see two extractor-related small examples in the section
- how to use any of the built-in extractors
- how to built your own extractor (maybe @jsheunis has built one once that we could showcase?)
- I'd like to include some technical notes on how metadata is stored internally, and why it is stored this way. If you have suggestions on how to write that up or which aspects would be the most important to add, I'd be all ears. I also haven't fully understood the
meta-dump
command's-r
parameter and how the way meta data is internally represented interacts with it. - I'd like to include some example queries, e.g., with a simple pipe into
jq
such that there is some notion of what "querying" even means
I'd like to link relevant cross-references to existing documentation in this PR, too, as the office hour this week showed how hard of a time users have to find the relevant pieces. I'm thinking of including @mslw's coming guide to writing own extractors, and also mention @jsheunis's catalog workflow-pipeline (can I get a pointer where to find out more about it). What else could I link, @christian-monch, @mslw, @jsheunis ? |
Regarding external pieces, these are the things I had "saved": For the For Metalad, I think this gist by @christian-monch used to be primary overview of the functionality (but probably this PR means to supersede it?) For a manual brealdown of extract / translate / update catalog, I have @jsheunis's draft of a fairly big catalog workflow and my digest of it. Though keep in mind that metalad->catalog translation is being formalized in the catalog datalad/datalad-catalog#246 And there is quite a bit of catalog docs |
@mslw Already added all the sources I wanted to list. And there are also this primer and tutorials: https://github.com/datalad/tutorials/blob/master/notebooks/catalog_tutorials/datalad_catalog_primer.ipynb I will add that the "fairly big catalog workflow" might have some useful text descriptions but is itself outdated, and kind of superseded by datalad-catalog's workflow functionality. I will add another point that what we want to describe in the handbook in terms of this type of workflow is pretty much the same as what we are working on now with the "non-portal" effort, which I think will deserve it's own chapter (if not more). So I would keep whatever we add to the metalad or catalog chapters minimal and eventually refer to the to be added workflow chapters lastly:
I agree! |
BTW @christian-monch @adswa I think the way that metadata and its relation to primary data are described here is clear and intuitive, and I think there's some snippets that would be useful to port into the "non-portal" paper. |
Co-authored-by: Stephan Heunis <jsheunis@gmail.com>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone have complex dumps from the real-world? The docstring of meta-add contains the spec for this notation, but it has a complex-ish feel to it, so the more examples, the better :)
^^^^^^^^^^^^^^^^^ | ||
|
||
TODO: something more about meta-dump and concrete usage example with, e.g., ``jq`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone has an example in their shell history, by any chance? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would create a list of all familyName
values in the person
lists that were extracted from .studyminimeta.yaml
-files by metalad_studyminimeta
-extractor:
> datalad meta-dump -r|jq '.extracted_metadata["@graph"][3]["@list"][].familyName'|sort|uniq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this? (also studyforrest dataset)
Output array of all subdatasets:
$ datalad meta-extract -d . metalad_core > metadata_record_metalad_core.json
$ datalad meta-add -d . metadata_record_metalad_core.json
$ datalad meta-dump . | jq '.extracted_metadata["@graph"] | .[] | select(.["@type"] == "Dataset") | [.hasPart[]? | {"dataset_id": (.identifier | sub("^datalad:"; "")), "dataset_version": (.["@id"] | sub("^datalad:"; "")), "dataset_path": .name}]'
[
{
"dataset_id": "d5dd3da0-a631-4c0c-a4a9-de55dfc4620f",
"dataset_version": "276ceffd8c11db1b99b30b4ae6965aac296fba02",
"dataset_path": "artifact/3T_movie_eyetracking"
},
{
"dataset_id": "ad9b6c66-4413-4b4f-b6da-b7f25d0d6397",
"dataset_version": "787741801ad6ff9e4d0aea7a1403bc483f9e1a74",
"dataset_path": "artifact/3T_structural_mri"
},
{
"dataset_id": "6075c0fa-ab72-4bab-9888-3b597f0e63b1",
"dataset_version": "64cf42ad00c707bb6533b5944041582c6bfdd384",
"dataset_path": "artifact/3T_visuallocalizer"
},
{
"dataset_id": "d4759300-5563-467d-be5f-e5b164fb3060",
"dataset_version": "7b6ec16910eb6a77cd0cc3d2945e7fb292fb9672",
"dataset_path": "artifact/7T_audiomovie"
},
{
"dataset_id": "c08af312-e05b-43b3-b499-db0d2ad46bf6",
"dataset_version": "7b58b38af7167d576489cc58e9ec4b3eee0822fc",
"dataset_path": "artifact/7T_musicperception"
},
{
"dataset_id": "da15d84c-9c8b-11e9-a3fb-f0d5bf7b5561",
"dataset_version": "f168e8e2b8bef21f373e8a5385e528130ca6339d",
"dataset_path": "artifact/media"
},
{
"dataset_id": "126cd950-377c-4600-a921-045cf408bd9f",
"dataset_version": "1b7cb46a97e75aeccab28ef7b61693f4f02a0f8d",
"dataset_path": "artifact/movie_eyetracking"
},
{
"dataset_id": "0f66b1ba-e9a9-46fd-b9d9-2e64fe94d307",
"dataset_version": "c74b66cf37c0d4ed8914296c6d7792b2d25696aa",
"dataset_path": "code/conversion_qa"
},
{
"dataset_id": "7fcd8812-d0fe-11e7-8db2-a0369f7c647e",
"dataset_version": "2ccaa115543c21e6658950d1cb8cc3038f14272f",
"dataset_path": "derivative/aggregate_fmri_timeseries"
},
{
"dataset_id": "c8ec2919-493b-4af5-9271-cbe9ebd08c43",
"dataset_version": "74cd7ec0538448b05fb4d5f91119b279c5e9ab04",
"dataset_path": "derivative/aligned_mri"
},
{
"dataset_id": "3304e775-5f5f-435a-b68e-d98c9f5fb72a",
"dataset_version": "aaac44e047d375cd8f791b1b6fe2b739f02c83b2",
"dataset_path": "derivative/cortical_surfaces_freesurfer"
},
{
"dataset_id": "ceb007ac-ef05-4392-98d2-35c02a774a21",
"dataset_version": "688d8d8558fe847f4c1b19aed579745bcd6c7744",
"dataset_path": "derivative/image_space_transformations"
},
{
"dataset_id": "2d05f277-94b0-470b-8e11-4e56691d5b89",
"dataset_version": "78e04a00fedbe3e055f86c2f9127aa48e1133d55",
"dataset_path": "derivative/retinotopic_maps"
},
{
"dataset_id": "92e65958-4a5a-4c34-a4f4-ee070f7a123b",
"dataset_version": "203aa983534fc2b823ff4777a85f4f80d7a68656",
"dataset_path": "derivative/visual_areas"
},
{
"dataset_id": "5b1081d6-84d7-11e8-b00a-a0369fb55db0",
"dataset_version": "b623351b43eb2715331ac59ad4cf41682e84ff7d",
"dataset_path": "original/3T_multiresolution_fmri"
},
{
"dataset_id": "1882e2e6-fbbf-4ade-a65f-3a1615235f51",
"dataset_version": "e5d2f8368fc5f6717d8ef131041c6d943298d0c7",
"dataset_path": "original/3T_structural_mri"
},
{
"dataset_id": "3a8648b3-7df8-413f-8efb-4d39040ac174",
"dataset_version": "10e23aafa8271f742e9022a3522d9a88d7fe30cf",
"dataset_path": "original/7T_multiresolution_fmri"
},
{
"dataset_id": "5eaff716-54eb-11e8-803d-a0369f7c647e",
"dataset_version": "72f835ada046bd0479009ea0ff933b30a95b0076",
"dataset_path": "original/phase2"
},
{
"dataset_id": "4c536c4a-ec61-11e6-9440-00b56d060aa7",
"dataset_version": "298e2a884d0598dd89753e9b5576bd93d19f335a",
"dataset_path": "stimulus/computational_annotations"
},
{
"dataset_id": "45b9ab26-07fc-11e8-8c71-f0d5bf7b5561",
"dataset_version": "29dcce2b9477537b433996ebc342c531139e1d87",
"dataset_path": "stimulus/curated_annotations"
}
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh I only saw your reply now @christian-monch. I think yours is a bit more user friendly :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good example. Just wanted to mention that the first two lines can be combined to:
> datalad meta-extract -d . metalad_core | datalad meta-add -d . --json-lines -
And since -d .
is equivalent to the default dataset location, the following would be identical:
> datalad meta-extract metalad_core | datalad meta-add --json-lines -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! I think the family name example would work really well. I get a few errors/warnings by jq in @christian-monch's example - any idea what might be wrong?
datalad meta-dump -r | jq '.extracted_metadata["@graph"][3]["@list"][].familyName' | sort | uniq
jq: error (at <stdin>:16): Cannot iterate over null (null)
jq: error (at <stdin>:22): Cannot iterate over null (null)
jq: error (at <stdin>:23): Cannot iterate over null (null)
jq: error (at <stdin>:24): Cannot iterate over null (null)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is yet unresolved, I'm merely hiding it in the output. If some has advice, let me know
@adswa : I think it might be better to either include core metadata in the example or to use exclusively core metadata in the examples. WDYT? |
Not everything in this PR is fully finalized, but its a very good start. I'll merge it to have metadata documentation available. |
hey @all-contributors please add @christian-monch for content |
@christian-monch already contributed before to content |
hey @all-contributors please add @jsheunis for review |
I've put up a pull request to add @jsheunis! 🎉 |
hey https://github.com/all-contributors please add @mslw for review |
hey @all-contributors please add @mslw for review |
This PR contains a first draft of metalad documentation. It covers meta-add and meta-dump and the
transport of metadata between repositories.