Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow DatasetMetadataExtractor.get_required_content() to return a generator #361

Merged
merged 8 commits into from
Mar 7, 2023
23 changes: 18 additions & 5 deletions datalad_metalad/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -458,11 +458,24 @@ def perform_dataset_metadata_extraction(ep: ExtractionArguments,
}

# Let the extractor get the files it requires
if extractor.get_required_content() is False:
yield {
"status": "impossible",
**result_template
}
# Handle both possibilities of bool return and Generator yield
res = extractor.get_required_content()

if isinstance(res, bool) or res is None:
christian-monch marked this conversation as resolved.
Show resolved Hide resolved
if res is False:
yield {
"status": "impossible",
**result_template
}
return
else:
failure_count = 0
for r in res:
if r['status'] in ["error", "impossible"]:
jsheunis marked this conversation as resolved.
Show resolved Hide resolved
failure_count+=1
jsheunis marked this conversation as resolved.
Show resolved Hide resolved
yield r
if failure_count > 0:
return

# Process results
result = extractor.extract(None)
Expand Down
30 changes: 21 additions & 9 deletions datalad_metalad/extractors/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,9 @@
import enum
from typing import (
Any,
IO,
Dict,
Generator,
IO,
List,
Optional,
Union,
Expand Down Expand Up @@ -158,17 +159,28 @@ def __init__(self,
self.ref_commit = ref_commit
self.parameter = parameter or {}

def get_required_content(self) -> bool:
"""
Let the extractor get the content that it needs locally.
The default implementation is to do nothing.
def get_required_content(self) -> Union[bool, Generator]:
christian-monch marked this conversation as resolved.
Show resolved Hide resolved
"""Let the extractor get the content that it needs locally.

The default implementation is to do nothing and return True
Extractors that overwrite this function can return a boolean
(True/False) value OR yield DataLad result records.
christian-monch marked this conversation as resolved.
Show resolved Hide resolved

Returns
-------
True if all required content could be fetched, False
otherwise. If False is returned, the extractor
infrastructure will signal an error and the extractor's
extract method will not be called.
bool
True if all required content could be fetched, False
otherwise. If False is returned, the extractor
infrastructure will signal an error and the extractor's
extract method will not be called.

Yields
------
dict
DataLad result records. If a result record is yielded
with a failure 'status' (i.e. equal to 'impossible' or
'error') the extractor infrastructure will signal an error
and the extractor's extract method will not be called.
"""
return True

Expand Down
47 changes: 43 additions & 4 deletions docs/source/user_guide/writing-extractors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,18 +86,55 @@ This function is used in dataset-level extractors only.
It will be called by MetaLad prior to metadata extraction.
Its purpose is to allow the extractor to ensure that content that is required for metadata extraction is present
(relevant, for example, if some of files to be inspected may be annexed).
The function should return ``True`` if it has obtained the required content, or confirmed its presence.
If it returns ``False``, metadata extraction will not proceed.

The function should either return a boolean value (``True | False``) or yield a ``Generator`` with
christian-monch marked this conversation as resolved.
Show resolved Hide resolved
`DataLad result records`_. In the case of a boolean value, the function should return ``True`` if
it has obtained the required content, or confirmed its presence. If it returns ``False``,
metadata extraction will not proceed. Alternatively, yielding result records provides extractors with
the capability to signal more expressive messages or errors. If a result record is yielded with a failure
status (i.e. with ``status`` equal to ``impossible`` or ``error``) metadata extraction will not proceed.

This function can be a place to call ``dataset.get()``.
It is advisable to disable result rendering (``result_renderer="disabled"``), because during metadata extraction, users will typically want to redirect standard output to a file or another command.
It is advisable to disable result rendering (``result_renderer="disabled"``), because during metadata
extraction, users will typically want to redirect standard output to a file or another command.

Example::
Example 1::

def get_required_content(self) -> bool:
result = self.dataset.get("CITATION.cff", result_renderer="disabled")
return result[0]["status"] in ("ok", "notneeded")

Example 2::

from typing import Generator
def get_required_content(self) -> Generator:
yield self.dataset.get("CITATION.cff", result_renderer="disabled")

Example 3::

from typing import Generator
def get_required_content(self) -> Generator:
result = self.dataset.get("CITATION.cff", result_renderer="disabled")
jsheunis marked this conversation as resolved.
Show resolved Hide resolved
failure_count = 0
result_dict = dict(
path=self.dataset.path,
type='dataset',
)
for r in res:
if r['status'] in ['error', 'impossible']:
failure_count+=1
jsheunis marked this conversation as resolved.
Show resolved Hide resolved
if failure_count > 0:
result_dict.update({
'status': 'error'
'message': 'could not retrieve required content'
})
else:
result_dict.update({
'status': 'ok'
'message': 'required content retrieved'
})
yield result_dict

``is_content_required()``
-------------------------

Expand Down Expand Up @@ -241,3 +278,5 @@ For example, a list of files with a given extension (including those in subfolde

files = list(self.dataset.repo.call_git_items_(["ls-files", "*.xyz"]))


.. _DataLad result records: https://docs.datalad.org/en/stable/design/result_records.html