Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR for the Extending data ecosystem usecase #421

Closed
wants to merge 4 commits into from

Conversation

gi114
Copy link
Contributor

@gi114 gi114 commented Mar 19, 2020

Hi,

This PR proposes a new use case. This is about using the git-annex globus special remote to retrieve dataset files existing in Globus.org via datalad get file

The use of special remotes is defined thanks to git annex and the addition of a globus use case would highlight the growing data availability for researchers enabled by datalad, within the scientific data ecosystem

@adswa
Copy link
Contributor

adswa commented Mar 19, 2020

Hi @gi114,
Wow, awesome! Thanks a lot for your PR. I'm very excited about this, and I think this would be a great addition to the handbook. :)

There are a couple of things to fine-tune, still. As this is the largest PR we have yet received, we actually don't have a standard procedure in place on how to best approach a review. Would you be up to a short video call, e.g. via Zoom or jitsi? This could help me gauge how I could best assist you in finalizing this.

Copy link
Contributor

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just cursory review, pointed to some minor issues


.. code-block:: bash

$ ls ll path/to/file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is ll ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double-l (ll)? This command would show whether the file is annexed or not. It is used to compare before and after download

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it must be your alias or something like that -- it is not a standard option of ls. I guess you meant -l - then indeed you could see if those are symlinks. But in cases of git-annex operating on crippled filesystems without symlinks support, or files being unlocked -- they wouldn't even be symlinks.

datasets across multiple locations are reduce the need of replicating data


The Datalad Approach
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and elsewhere, DataLad (not Datalad). I will submit a PR to tune up some missed hits in other docs

side and to respond back if data is available. Currently, git-annex-globus-remote
only supports data download operations but it could potentially be useful for additional
functionalities. When the globus-remote get initialized for the first time, the user
has to authenticate to Globus.org using ORCHiD, Gmail or a specific Globus accounts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use ORCID not ORCiD. An i in ORCID with a dot is used only in the logo on top of full sized capital I. They use ORCID elsewhere


.. code-block:: bash

$ datalad install -r <dataset>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not work just on any dataset. There should be a note that dataset (somehow, how?) should be prepared and populated with information in the special remote on which files to get from globus. Having an example dataset with access to data in globus would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sounds good, will update on that

@gi114
Copy link
Contributor Author

gi114 commented Mar 19, 2020

Hi @adswa,

Thanks for your comment. Sure, Zoom would work. When would you be available? My timezone is (GMT-4)

@adswa
Copy link
Contributor

adswa commented Mar 19, 2020

I'm GMT+1, and I could make a call tomorrow during most of your morning & afternoon. I'll tentatively suggest 11am GMT-4 (would be 4Pm for me), but feel free to suggest any time between approx. 4 hours before or after this, or a day that fits you better. Everyone here is in home office and time management is flexible. I'll post a Zoom link in here.

@adswa
Copy link
Contributor

adswa commented Mar 20, 2020

I'm in this Zoom meeting now:

Adina Wagner is inviting you to a scheduled Zoom meeting.

Topic: CONP Dataset hosting on Globus
Time: Mar 20, 2020 04:00 PM Amsterdam, Berlin, Rome, Stockholm, Vienna

Join Zoom Meeting
https://zoom.us/j/390795640

Meeting ID: 390 795 640

One tap mobile
,,390795640# US Toll

Dial by your location
US Toll

Meeting ID: 390 795 640
Find your local number: https://zoom.us/u/abmnSiR7iO

@adswa
Copy link
Contributor

adswa commented Mar 20, 2020

Hi @gi114 I'm in the video call. I'll stay here a while longer, so join, if it fits your schedule. :)

Copy link
Contributor

@adswa adswa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a few comments on your PR, but feel free to suggest in-person discussion via Zoom. :)

In general, I approve this PR. Its really cool to see how you use DataLad, and I'm very grateful that you took the time to contribute this usecase. So thank you very much for this!

The difficult aspect in reviewing such a PR is the subjectiveness of the review. Its easy to find bugs/better alternatives in code PRs, its really tough to edit a proposed chapter for a book given that different people have different writing styles and idiosyncrasies (e.g., based on my writing it will sometimes be clear that my mother tongue is German, and based on your writing, I strongly suspect you are Italian).

I'm trying to achieve a balance between keeping your writing style, shaping the PR even more into the general usecase structure, and highlighting where I as a naive reader would need more information to understand something. For this, I have now added very "broad" comments (such as: A more specific title, or structuring parts of your writing into subsections of "Step-by-step") that I think will make it easier to understand the usecase when applied. I also agree with @yarikoptic that the most understandable approach would be to point to a publicly accessible dataset (but that's of course only possible if there is one publicly accessible).

From my own experience, I know that writing book content can take a few iterations: Improve the structure, fine-tune example code, fix typos... . Given that you have already invested a lot of time, I definitely don't want to make this PR a dreadful experience. My suggestion would be: As long as you're happy to apply suggestions, let's work on this PR together. But if you're saying "Its been cool, but I really can't invest too much more time into this", I can take over what is left. :)

@@ -0,0 +1,151 @@
.. _usecase_extending_data_ecosystem:

Extending the data ecosystem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be hard to understand the contents of this usecase only from this title. I would go for something that emphasizes the nature of the use case - what is novel is how you use Globus.org to host data for the datasets used by CONP. One suggestion:
"Using Globus as a data store for the Canadian Open Neuroscience Portal"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, sounds good actually !


Extending the data ecosystem
----------------------------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make your usecase findable in the index, you can add an index entry like this:

.. index:: ! Usecase; <name as you want it to appear in the index>

to retrieve actual files content, only on user need.

Users log into the CONP portal and install Datalad datasets with
``datalad install -r <dataset>`` to access annexed files (as mentioned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I don't know how the CONP portal look like (I'm envisioning a webinterface). If you think it is helpful, you could provide a screenshot in the step-by-step section on how the portal looks like for a user that wants to obtain a dataset.

If you create a screenshot, it is best to save it into Git in docs/artwork/src (its a subdataset of this repo - adding it there will require a PR against the artwork repository). You can add a figure that lies in docs/artwork/srclike this:

.. figure:: ../artwork/src/screenshot.jpg


Globus as git-annex data store
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To shape this usecase into the structure we try to maintain I would suggest to put this section into the section step-by-step:

  • 1.1. The Challenge
  • 1.2. The DataLad Approach
  • 1.3. Step-by-Step
    • 1.3.1. Globus as git-annex data store

You can achieve the lower hierarchy by using """"" to underline the heading.


Step-by-Step
^^^^^^^^^^^^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section could fit under a section From the perspective of a user (as a subsection of step-by-step):

  • 1.1. The Challenge
  • 1.2. The DataLad Approach
  • 1.3. Step-by-Step
    • 1.3.1. Globus as git-annex data store
    • 1.3.2. From the perspective of a user


It always starts with a dataset:

.. code-block:: bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code blocks are the most helpful part of a usecase. The previous parts explained the problem (How can we work with large amounts of data?) and sketched a solution (Have a datasets on Globus.org, use a globus special remote to get access, retrieve only that file content that you need), but here you can show very concretely how it can be done - the more comprehensive the code block is, the better.

Maybe you install an example dataset on your computer and copy-paste the complete output from the terminal in a code block? This would make it feel as if there actually is code execution, and could give readers a complete picture or how it would look like from the perspective of a user. And instead of saying "path/to/file", maybe really get a file locally, and copy paste the output?

If the dataset is publicly accessible (e.g., if or anyone else can also install it you could also create executable code snippets like this:

.. runrecord:: _examples/globus-101
   :workdir: usecases/globus
   :language: console

   $ datalad install -r <url>
.. runrecord:: _examples/globus-102
   :workdir: usecases/globus/<name_of_dataset>
   :language: console

   $ ls -l <file>

This would work for all snippets but the second (with pip install and the special remote setup) -- but only if its publicly available.

$ git-annex-remore-globus setup

We can see that most of the files in the dataset are annexed, highlighted in red. You can check
the symlink for a given file by running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you refer to the syntax highlighting of some shells here, right? My shells are also configured to do this, but not all are. Many people in our institute for example only use black-and-white shells that do not color code symlinks any different from normals files. It would be easier to explain what you mean with a screenshot, or with the output of an ls -l that shows the symlinks

@adswa adswa mentioned this pull request Apr 29, 2020
7 tasks
@adswa
Copy link
Contributor

adswa commented May 8, 2020

Thanks a lot for the updates! I see where the problem lies with Travis, the submodule is not in its most recent state. If you obtained this repository with datalad clone and run a datalad status, I predict that you see something like this:

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   docs/artwork (new commits)

running git add docs/artwork and committing should fix this. Unfortunately, I can't push to your PR but I have added two commits with fixes onto your branch and submitted a new PR in #479 (all of your commits are still in the new PR). You could either:

  • cherry-pick my commits or redo them in your clone
  • close this PR in favor of Tmp globus #479, and I'll merge Tmp globus #479 instead (your commits are all preserved)

I will be away from my keyboard for a bit, but will get to this PR later today. :)

@gi114
Copy link
Contributor Author

gi114 commented May 8, 2020

Hi @adswa,

I will close this PR so you can merge #479

Thanks for your help

@gi114 gi114 closed this May 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants