Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: the future of metadata extractors in metalad #202

Open
jsheunis opened this issue Jan 19, 2022 · 4 comments
Open

Discussion: the future of metadata extractors in metalad #202

jsheunis opened this issue Jan 19, 2022 · 4 comments

Comments

@jsheunis
Copy link
Member

jsheunis commented Jan 19, 2022

This is to open up the discussion and get input on what to prioritise in terms of metadata extractor support in MetaLad.

Currently we have several extractors in datalad core (some with "toy"-status) that operate on the file and/or dataset level:

  • annex.py
  • audio.py
  • base.py
  • datacite.py
  • datalad_core.py
  • datalad_rfc822.py
  • exif.py
  • frictionless_datapackage.py
  • image.py
  • xmp.py

With metalad being the next generation metadata handling extension of datalad, the idea is to move all metadata functionality out of core. Existing extractors that are useful will be improved/maintained in metalad, others can be dropped.

Even more extractors currently exist in datalad extensions, such as the extractors in datalad-neuroimaging:

  • bids.py
  • dicom.py
  • nidm.py
  • nifti1.py

And some next generation metalad extractors are also available:

  • core.py
  • core_dataset.py
  • core_file.py
  • custom.py
  • external.py
  • external_dataset.py
  • external_file.py
  • runprov.py

Future extractors would either form part of an extension, or supported in metalad. A few are worth listing already based on known use cases:

Feel free to share thoughts on these/other extractors, which ones to support in metalad, which ones to support via extensions, and which ones to prioritise in terms of updates.

@jsheunis
Copy link
Member Author

Note/question:

metadata extractors can require data locally or not. E.g. when using the current version of the bids.py extractor with meta-extract, the process starts by datalad getting files from their remote location. It seems like it could be beneficial to have an option for first extracting whatever metadata would be possible without requiring local file content, and then to get and extract from file content if specified by the user. Is this already possible?

@jsheunis
Copy link
Member Author

metadata extractors can require data locally or not. E.g. when using the current version of the bids.py extractor with meta-extract, the process starts by datalad getting files from their remote location. It seems like it could be beneficial to have an option for first extracting whatever metadata would be possible without requiring local file content, and then to get and extract from file content if specified by the user. Is this already possible?

Progress on this documented here: datalad/datalad-neuroimaging#94

@jsheunis
Copy link
Member Author

Some comments re datacite:

  • Datalad core has a datacite XML schema extractor (XML the main schema format of datacite), which should be user-tested in order to determine its functionality
  • A JSON schema for datacite is in the works, but no stable release is out yet. (see also this page). It is / will be based on json-schema.
  • Datasets hosted on GIN can obtain a DOI by submitting a datacite.yml file (see example). A metadata extractor for this format could be useful, since datalad datasets hosted on GIN would likely already contain such a metadata file.
  • The datacite API can accept loads of different metadata formats (including DataCite XML, schema.org in JSON-LD, Crossref Unixref, Citeproc JSON, RIS, BibTeX) and can translate them and validate against Datacite XML.

@jsheunis
Copy link
Member Author

jsheunis commented Feb 2, 2022

Another useful addition, possibly to the studyminimeta extractor or on its own, would be to support the Data Use Ontology that allows semantic tagging of datasets with restrictions on data use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant