PR for the Extending data ecosystem usecase #421

gi114 · 2020-03-19T14:45:26Z

Hi,

This PR proposes a new use case. This is about using the git-annex globus special remote to retrieve dataset files existing in Globus.org via datalad get file

The use of special remotes is defined thanks to git annex and the addition of a globus use case would highlight the growing data availability for researchers enabled by datalad, within the scientific data ecosystem

adswa · 2020-03-19T15:56:57Z

Hi @gi114,
Wow, awesome! Thanks a lot for your PR. I'm very excited about this, and I think this would be a great addition to the handbook. :)

There are a couple of things to fine-tune, still. As this is the largest PR we have yet received, we actually don't have a standard procedure in place on how to best approach a review. Would you be up to a short video call, e.g. via Zoom or jitsi? This could help me gauge how I could best assist you in finalizing this.

yarikoptic

Just cursory review, pointed to some minor issues

yarikoptic · 2020-03-19T19:24:19Z

docs/usecases/extending_data_ecosystem.rst

+
+.. code-block:: bash
+
+   $ ls ll path/to/file


what is ll ?

double-l (ll)? This command would show whether the file is annexed or not. It is used to compare before and after download

it must be your alias or something like that -- it is not a standard option of ls. I guess you meant -l - then indeed you could see if those are symlinks. But in cases of git-annex operating on crippled filesystems without symlinks support, or files being unlocked -- they wouldn't even be symlinks.

yarikoptic · 2020-03-19T19:25:58Z

docs/usecases/extending_data_ecosystem.rst

+datasets across multiple locations are reduce the need of replicating data
+
+
+The Datalad Approach


here and elsewhere, DataLad (not Datalad). I will submit a PR to tune up some missed hits in other docs

yarikoptic · 2020-03-19T19:27:16Z

docs/usecases/extending_data_ecosystem.rst

+side and to respond back if data is available. Currently, git-annex-globus-remote
+only supports data download operations but it could potentially be useful for additional
+functionalities. When the globus-remote get initialized for the first time, the user
+has to authenticate to Globus.org using ORCHiD, Gmail or a specific Globus accounts:


I would use ORCID not ORCiD. An i in ORCID with a dot is used only in the logo on top of full sized capital I. They use ORCID elsewhere

yarikoptic · 2020-03-19T19:29:18Z

docs/usecases/extending_data_ecosystem.rst

+
+.. code-block:: bash
+
+   $ datalad install -r <dataset>


It will not work just on any dataset. There should be a note that dataset (somehow, how?) should be prepared and populated with information in the special remote on which files to get from globus. Having an example dataset with access to data in globus would be great.

Yes sounds good, will update on that

gi114 · 2020-03-19T20:02:27Z

Hi @adswa,

Thanks for your comment. Sure, Zoom would work. When would you be available? My timezone is (GMT-4)

adswa · 2020-03-19T21:17:35Z

I'm GMT+1, and I could make a call tomorrow during most of your morning & afternoon. I'll tentatively suggest 11am GMT-4 (would be 4Pm for me), but feel free to suggest any time between approx. 4 hours before or after this, or a day that fits you better. Everyone here is in home office and time management is flexible. I'll post a Zoom link in here.

adswa · 2020-03-20T12:49:59Z

I'm in this Zoom meeting now:

Adina Wagner is inviting you to a scheduled Zoom meeting.

Topic: CONP Dataset hosting on Globus
Time: Mar 20, 2020 04:00 PM Amsterdam, Berlin, Rome, Stockholm, Vienna

Join Zoom Meeting
https://zoom.us/j/390795640

Meeting ID: 390 795 640

One tap mobile
,,390795640# US Toll

Dial by your location
US Toll

Meeting ID: 390 795 640
Find your local number: https://zoom.us/u/abmnSiR7iO

adswa · 2020-03-20T15:09:57Z

Hi @gi114 I'm in the video call. I'll stay here a while longer, so join, if it fits your schedule. :)

adswa

I have added a few comments on your PR, but feel free to suggest in-person discussion via Zoom. :)

In general, I approve this PR. Its really cool to see how you use DataLad, and I'm very grateful that you took the time to contribute this usecase. So thank you very much for this!

The difficult aspect in reviewing such a PR is the subjectiveness of the review. Its easy to find bugs/better alternatives in code PRs, its really tough to edit a proposed chapter for a book given that different people have different writing styles and idiosyncrasies (e.g., based on my writing it will sometimes be clear that my mother tongue is German, and based on your writing, I strongly suspect you are Italian).

I'm trying to achieve a balance between keeping your writing style, shaping the PR even more into the general usecase structure, and highlighting where I as a naive reader would need more information to understand something. For this, I have now added very "broad" comments (such as: A more specific title, or structuring parts of your writing into subsections of "Step-by-step") that I think will make it easier to understand the usecase when applied. I also agree with @yarikoptic that the most understandable approach would be to point to a publicly accessible dataset (but that's of course only possible if there is one publicly accessible).

From my own experience, I know that writing book content can take a few iterations: Improve the structure, fine-tune example code, fix typos... . Given that you have already invested a lot of time, I definitely don't want to make this PR a dreadful experience. My suggestion would be: As long as you're happy to apply suggestions, let's work on this PR together. But if you're saying "Its been cool, but I really can't invest too much more time into this", I can take over what is left. :)

adswa · 2020-03-20T15:16:18Z

docs/usecases/extending_data_ecosystem.rst

@@ -0,0 +1,151 @@
+.. _usecase_extending_data_ecosystem:
+
+Extending the data ecosystem


It can be hard to understand the contents of this usecase only from this title. I would go for something that emphasizes the nature of the use case - what is novel is how you use Globus.org to host data for the datasets used by CONP. One suggestion:
"Using Globus as a data store for the Canadian Open Neuroscience Portal"

Oh yes, sounds good actually !

adswa · 2020-03-20T15:16:36Z

docs/usecases/extending_data_ecosystem.rst

+
+Extending the data ecosystem
+----------------------------
+


To make your usecase findable in the index, you can add an index entry like this:

.. index:: ! Usecase; <name as you want it to appear in the index>

adswa · 2020-03-20T15:18:30Z

docs/usecases/extending_data_ecosystem.rst

+to retrieve actual files content, only on user need.
+
+Users log into the CONP portal and install Datalad datasets with
+``datalad install -r <dataset>`` to access annexed files (as mentioned


Personally I don't know how the CONP portal look like (I'm envisioning a webinterface). If you think it is helpful, you could provide a screenshot in the step-by-step section on how the portal looks like for a user that wants to obtain a dataset.

If you create a screenshot, it is best to save it into Git in docs/artwork/src (its a subdataset of this repo - adding it there will require a PR against the artwork repository). You can add a figure that lies in docs/artwork/srclike this:

.. figure:: ../artwork/src/screenshot.jpg

adswa · 2020-03-20T15:18:53Z

docs/usecases/extending_data_ecosystem.rst

+
+Globus as git-annex data store
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+


To shape this usecase into the structure we try to maintain I would suggest to put this section into the section step-by-step:

1.1. The Challenge

1.2. The DataLad Approach

1.3. Step-by-Step

1.3.1. Globus as git-annex data store

You can achieve the lower hierarchy by using """"" to underline the heading.

adswa · 2020-03-20T15:20:05Z

docs/usecases/extending_data_ecosystem.rst

+
+Step-by-Step
+^^^^^^^^^^^^
+


I think this section could fit under a section From the perspective of a user (as a subsection of step-by-step):

1.1. The Challenge

1.2. The DataLad Approach

1.3. Step-by-Step

1.3.1. Globus as git-annex data store

1.3.2. From the perspective of a user

adswa · 2020-03-20T15:23:14Z

docs/usecases/extending_data_ecosystem.rst

+
+It always starts with a dataset:
+
+.. code-block:: bash


The code blocks are the most helpful part of a usecase. The previous parts explained the problem (How can we work with large amounts of data?) and sketched a solution (Have a datasets on Globus.org, use a globus special remote to get access, retrieve only that file content that you need), but here you can show very concretely how it can be done - the more comprehensive the code block is, the better.

Maybe you install an example dataset on your computer and copy-paste the complete output from the terminal in a code block? This would make it feel as if there actually is code execution, and could give readers a complete picture or how it would look like from the perspective of a user. And instead of saying "path/to/file", maybe really get a file locally, and copy paste the output?

If the dataset is publicly accessible (e.g., if or anyone else can also install it you could also create executable code snippets like this:

.. runrecord:: _examples/globus-101 :workdir: usecases/globus :language: console $ datalad install -r <url>

.. runrecord:: _examples/globus-102 :workdir: usecases/globus/<name_of_dataset> :language: console $ ls -l <file>

This would work for all snippets but the second (with pip install and the special remote setup) -- but only if its publicly available.

adswa · 2020-03-20T15:24:33Z

docs/usecases/extending_data_ecosystem.rst

+   $ git-annex-remore-globus setup
+
+We can see that most of the files in the dataset are annexed, highlighted in red. You can check
+the symlink for a given file by running


I think you refer to the syntax highlighting of some shells here, right? My shells are also configured to do this, but not all are. Many people in our institute for example only use black-and-white shells that do not color code symlinks any different from normals files. It would be easier to explain what you mean with a screenshot, or with the output of an ls -l that shows the symlinks

adswa · 2020-05-08T14:41:59Z

Thanks a lot for the updates! I see where the problem lies with Travis, the submodule is not in its most recent state. If you obtained this repository with datalad clone and run a datalad status, I predict that you see something like this:

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   docs/artwork (new commits)

running git add docs/artwork and committing should fix this. Unfortunately, I can't push to your PR but I have added two commits with fixes onto your branch and submitted a new PR in #479 (all of your commits are still in the new PR). You could either:

cherry-pick my commits or redo them in your clone
close this PR in favor of Tmp globus #479, and I'll merge Tmp globus #479 instead (your commits are all preserved)

I will be away from my keyboard for a bit, but will get to this PR later today. :)

gi114 · 2020-05-08T19:36:34Z

Hi @adswa,

I will close this PR so you can merge #479

Thanks for your help

extending data ecosystem

eee74c9

mih added the new chapter! label Mar 19, 2020

yarikoptic suggested changes Mar 19, 2020

View reviewed changes

adswa approved these changes Mar 20, 2020

View reviewed changes

adswa mentioned this pull request Apr 29, 2020

Start preparing for a 0.13 release #468

Closed

7 tasks

gi114 added 3 commits May 1, 2020 10:42

Merge remote-tracking branch 'upstream/master'

7cfae1b

Small fixes to title, subshapters and colors

a6a02d9

added globus response snippets

58a012a

gi114 closed this May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR for the Extending data ecosystem usecase #421

PR for the Extending data ecosystem usecase #421

gi114 commented Mar 19, 2020

adswa commented Mar 19, 2020

yarikoptic left a comment

yarikoptic Mar 19, 2020

gi114 Mar 19, 2020

yarikoptic Mar 19, 2020

yarikoptic Mar 19, 2020

yarikoptic Mar 19, 2020

yarikoptic Mar 19, 2020

gi114 Mar 19, 2020

gi114 commented Mar 19, 2020

adswa commented Mar 19, 2020

adswa commented Mar 20, 2020 •

edited

Loading

adswa commented Mar 20, 2020

adswa left a comment •

edited

Loading

adswa Mar 20, 2020

gi114 Apr 3, 2020

adswa Mar 20, 2020

adswa Mar 20, 2020

adswa Mar 20, 2020

adswa Mar 20, 2020

adswa Mar 20, 2020

adswa Mar 20, 2020

adswa commented May 8, 2020

gi114 commented May 8, 2020

		datasets across multiple locations are reduce the need of replicating data


		The Datalad Approach

		@@ -0,0 +1,151 @@
		.. _usecase_extending_data_ecosystem:

		Extending the data ecosystem


		Globus as git-annex data store
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


		Step-by-Step
		^^^^^^^^^^^^

PR for the Extending data ecosystem usecase #421

PR for the Extending data ecosystem usecase #421

Conversation

gi114 commented Mar 19, 2020

adswa commented Mar 19, 2020

yarikoptic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gi114 commented Mar 19, 2020

adswa commented Mar 19, 2020

adswa commented Mar 20, 2020 • edited Loading

adswa commented Mar 20, 2020

adswa left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adswa commented May 8, 2020

gi114 commented May 8, 2020

adswa commented Mar 20, 2020 •

edited

Loading

adswa left a comment •

edited

Loading