Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let uncurl download progress reach git-annex #328

Closed
mih opened this issue May 2, 2023 · 4 comments · Fixed by #329
Closed

Let uncurl download progress reach git-annex #328

mih opened this issue May 2, 2023 · 4 comments · Fixed by #329

Comments

@mih
Copy link
Member

mih commented May 2, 2023

I am pulling a file via https and datalad get from the ICF data store (800M). Not a single progress update.

Pulling the same file from the same URL with datalad download (not download-url!) works fine and shows progress.

This is a serious issue, because progress reporting is necessary to avoid triggering annex stall detection -- which could easily happen for large-file downloads.

@mih mih changed the title No download progress from uncurl Let uncurl download progress reach git-annex May 2, 2023
@yarikoptic
Copy link
Member

yarikoptic commented May 2, 2023

would you be kind to share some uncurl handled URL I could try to addurl to see this issue manifest?

nevermind -- figured myself out an example

@mih
Copy link
Member Author

mih commented May 2, 2023

My interim conclusion after the devcall discussion today is to amend the special remote super main used in datalad-next and depending special remotes to amend the log handler setup to report progress messages to git annex via REMOTE.annex.progress().

This would be an approach that minimizes undesired interactions with other components. It makes the adjust only in a special remote process, and it does not require any yet-to-be-invented approach to bypass git-annex. There is also a reduced likelihood for more than one progress reporting to be ongoing in such a process.

@yarikoptic
Copy link
Member

yarikoptic commented May 2, 2023

FWIW uncurl does

        progress_id = self._get_progress_id(from_url, to_path)
        ...
        self._progress_report_start(
            progress_id,
            ('Download %s to %s', from_url, to_path),
            'downloading',
            expected_size,
        )
...
            for chunk in requests_tee(r, fp):
                self._progress_report_update(
                    progress_id, ('Downloaded chunk',), len(chunk))

whenever datalad special remote

pbar = ui.get_progressbar(label=url, fill_text=filepath, total=target_size)
...
for chunk in stream:
   ...
   pbar.update(total)

so, uncurl one does not ask UI directly, but goes through the logging system based progress bars (which generally work), and that one somehow doesn't seems to care about requesting progress bars, just reports

[INFO] git-annex-remote-uncurl:8>uncurl:526>main:78,55>annexremote:873,554,603>uncurl:369,480>any:184>http:152,251>url_operations.__init__:309  Downloaded chunk 

in a loop. I tried to dig deeper into logger based progress bar handling but my foo was not good enough for a rapid Aha moment.

@mih
Copy link
Member Author

mih commented May 2, 2023

Here is sketch of how the any progress logging compliant with http://docs.datalad.org/en/stable/design/progress_reporting.html can be reported to git-annex within a special remote process.

diff --git a/datalad/customremotes/main.py b/datalad/customremotes/main.py
index a2931ff04..d1a522c8c 100644
--- a/datalad/customremotes/main.py
+++ b/datalad/customremotes/main.py
@@ -45,6 +45,48 @@ def setup_parser(remote_name, description):
     return parser
 
 
+def only_progress_logrecords(record):
+    return hasattr(record, 'dlm_progress')
+
+
+class AnnexProgressHandler(logging.Handler):
+    def __init__(self, annexremote):
+        super().__init__()
+        self.annexremote = annexremote
+        self._ptrackers = {}
+
+    def close(self):
+        self._ptrackers = {}
+        super().close()
+
+    def emit(self, record):
+        if not hasattr(record, 'dlm_progress'):
+            # a filter should have been used to prevent this call
+            return
+
+        maint = getattr(record, 'dlm_progress_maint', None)
+        if maint in ('clear', 'refresh'):
+            return
+        pid = getattr(record, 'dlm_progress')
+        update = getattr(record, 'dlm_progress_update', None)
+        if pid not in self._ptrackers:
+            # this is new
+            prg = getattr(record, 'dlm_progress_initial', 0)
+            self._ptrackers[pid] = prg
+            self.annexremote.send_progress(prg)
+        elif update is None:
+            # not an update -> done
+            self._ptrackers.pop(pid)
+        else:
+            prg = self._ptrackers[pid]
+            if getattr(record, 'dlm_progress_increment', False):
+                prg += update
+            else:
+                prg = update
+            self._ptrackers[pid] = prg
+            self.annexremote.send_progress(prg)
+
+
 def _main(args, cls):
     """Unprotected portion"""
     assert(cls is not None)
@@ -52,6 +94,15 @@ def _main(args, cls):
     master = Master()
     remote = cls(master)
     master.LinkRemote(remote)
+
+    # we add an additional handler to the logger to deal with
+    # progress reports
+    dlroot_lgr = logging.getLogger('datalad')
+    phandler = AnnexProgressHandler(remote)
+    phandler.addFilter(only_progress_logrecords)
+    dlroot_lgr.addHandler(phandler)
+
+    # run the remote
     master.Listen()
     # cleanup
     if hasattr(remote, 'stop'):

Add dedicated log handler is attached to both the datalad "root" logger and the special remote class instance. The handler gets a filter to only makes it see progress log records (this is something that also the main progress handler should be doing). In the generic main() of special remotes, this log setup modifications is added as an additional log output -- no interference with any existing setup.

This sketch is limited/incomplete in that it assumes only a single progress tracker to be used (although it is already set up to support more than one). So in case of multiple concurrent trackers, it would send rubbish progress logs to git-annex.

That being said, with this change we see git-annex native progress reporting with git annex get and datalad native progress reporting with datalad get.

mih added a commit to mih/datalad-next that referenced this issue May 3, 2023
Add dedicated log handler is attached to both the datalad "root" logger
and the special remote class instance. The handler gets a filter to only
makes it see progress log records (this is something that also the main
progress handler should be doing). In the generic main() of special
remotes, this log setup modifications is added as an additional log
output -- no interference with any existing setup.

This sketch is limited/incomplete in that it assumes only a single
progress tracker to be used (although it is already set up to support
more than one). So in case of multiple concurrent trackers, it would
send rubbish progress logs to git-annex.

That being said, with this change we see git-annex native progress
reporting with git annex get and datalad native progress reporting with
datalad get.

Closes datalad#328
@mih mih closed this as completed in #329 May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants