-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR for the Extending data ecosystem usecase #421
Conversation
Hi @gi114, There are a couple of things to fine-tune, still. As this is the largest PR we have yet received, we actually don't have a standard procedure in place on how to best approach a review. Would you be up to a short video call, e.g. via Zoom or jitsi? This could help me gauge how I could best assist you in finalizing this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just cursory review, pointed to some minor issues
|
||
.. code-block:: bash | ||
|
||
$ ls ll path/to/file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is ll
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double-l (ll)? This command would show whether the file is annexed or not. It is used to compare before and after download
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it must be your alias or something like that -- it is not a standard option of ls
. I guess you meant -l
- then indeed you could see if those are symlinks. But in cases of git-annex operating on crippled filesystems without symlinks support, or files being unlocked -- they wouldn't even be symlinks.
datasets across multiple locations are reduce the need of replicating data | ||
|
||
|
||
The Datalad Approach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here and elsewhere, DataLad (not Datalad). I will submit a PR to tune up some missed hits in other docs
side and to respond back if data is available. Currently, git-annex-globus-remote | ||
only supports data download operations but it could potentially be useful for additional | ||
functionalities. When the globus-remote get initialized for the first time, the user | ||
has to authenticate to Globus.org using ORCHiD, Gmail or a specific Globus accounts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use ORCID
not ORCiD
. An i
in ORCID with a dot is used only in the logo on top of full sized capital I. They use ORCID elsewhere
|
||
.. code-block:: bash | ||
|
||
$ datalad install -r <dataset> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not work just on any dataset. There should be a note that dataset (somehow, how?) should be prepared and populated with information in the special remote on which files to get from globus. Having an example dataset with access to data in globus would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sounds good, will update on that
Hi @adswa, Thanks for your comment. Sure, Zoom would work. When would you be available? My timezone is (GMT-4) |
I'm GMT+1, and I could make a call tomorrow during most of your morning & afternoon. I'll tentatively suggest 11am GMT-4 (would be 4Pm for me), but feel free to suggest any time between approx. 4 hours before or after this, or a day that fits you better. Everyone here is in home office and time management is flexible. I'll post a Zoom link in here. |
I'm in this Zoom meeting now: Adina Wagner is inviting you to a scheduled Zoom meeting. Topic: CONP Dataset hosting on Globus Join Zoom Meeting Meeting ID: 390 795 640 One tap mobile Dial by your location Meeting ID: 390 795 640 |
Hi @gi114 I'm in the video call. I'll stay here a while longer, so join, if it fits your schedule. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a few comments on your PR, but feel free to suggest in-person discussion via Zoom. :)
In general, I approve this PR. Its really cool to see how you use DataLad, and I'm very grateful that you took the time to contribute this usecase. So thank you very much for this!
The difficult aspect in reviewing such a PR is the subjectiveness of the review. Its easy to find bugs/better alternatives in code PRs, its really tough to edit a proposed chapter for a book given that different people have different writing styles and idiosyncrasies (e.g., based on my writing it will sometimes be clear that my mother tongue is German, and based on your writing, I strongly suspect you are Italian).
I'm trying to achieve a balance between keeping your writing style, shaping the PR even more into the general usecase structure, and highlighting where I as a naive reader would need more information to understand something. For this, I have now added very "broad" comments (such as: A more specific title, or structuring parts of your writing into subsections of "Step-by-step") that I think will make it easier to understand the usecase when applied. I also agree with @yarikoptic that the most understandable approach would be to point to a publicly accessible dataset (but that's of course only possible if there is one publicly accessible).
From my own experience, I know that writing book content can take a few iterations: Improve the structure, fine-tune example code, fix typos... . Given that you have already invested a lot of time, I definitely don't want to make this PR a dreadful experience. My suggestion would be: As long as you're happy to apply suggestions, let's work on this PR together. But if you're saying "Its been cool, but I really can't invest too much more time into this", I can take over what is left. :)
@@ -0,0 +1,151 @@ | |||
.. _usecase_extending_data_ecosystem: | |||
|
|||
Extending the data ecosystem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be hard to understand the contents of this usecase only from this title. I would go for something that emphasizes the nature of the use case - what is novel is how you use Globus.org to host data for the datasets used by CONP. One suggestion:
"Using Globus as a data store for the Canadian Open Neuroscience Portal"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, sounds good actually !
|
||
Extending the data ecosystem | ||
---------------------------- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make your usecase findable in the index, you can add an index entry like this:
.. index:: ! Usecase; <name as you want it to appear in the index>
to retrieve actual files content, only on user need. | ||
|
||
Users log into the CONP portal and install Datalad datasets with | ||
``datalad install -r <dataset>`` to access annexed files (as mentioned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I don't know how the CONP portal look like (I'm envisioning a webinterface). If you think it is helpful, you could provide a screenshot in the step-by-step section on how the portal looks like for a user that wants to obtain a dataset.
If you create a screenshot, it is best to save it into Git in docs/artwork/src
(its a subdataset of this repo - adding it there will require a PR against the artwork repository). You can add a figure that lies in docs/artwork/src
like this:
.. figure:: ../artwork/src/screenshot.jpg
|
||
Globus as git-annex data store | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To shape this usecase into the structure we try to maintain I would suggest to put this section into the section step-by-step:
- 1.1. The Challenge
- 1.2. The DataLad Approach
- 1.3. Step-by-Step
- 1.3.1. Globus as git-annex data store
You can achieve the lower hierarchy by using """""
to underline the heading.
|
||
Step-by-Step | ||
^^^^^^^^^^^^ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section could fit under a section From the perspective of a user
(as a subsection of step-by-step
):
- 1.1. The Challenge
- 1.2. The DataLad Approach
- 1.3. Step-by-Step
- 1.3.1. Globus as git-annex data store
- 1.3.2. From the perspective of a user
|
||
It always starts with a dataset: | ||
|
||
.. code-block:: bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code blocks are the most helpful part of a usecase. The previous parts explained the problem (How can we work with large amounts of data?) and sketched a solution (Have a datasets on Globus.org, use a globus special remote to get access, retrieve only that file content that you need), but here you can show very concretely how it can be done - the more comprehensive the code block is, the better.
Maybe you install an example dataset on your computer and copy-paste the complete output from the terminal in a code block? This would make it feel as if there actually is code execution, and could give readers a complete picture or how it would look like from the perspective of a user. And instead of saying "path/to/file", maybe really get a file locally, and copy paste the output?
If the dataset is publicly accessible (e.g., if or anyone else can also install it you could also create executable code snippets like this:
.. runrecord:: _examples/globus-101
:workdir: usecases/globus
:language: console
$ datalad install -r <url>
.. runrecord:: _examples/globus-102
:workdir: usecases/globus/<name_of_dataset>
:language: console
$ ls -l <file>
This would work for all snippets but the second (with pip install and the special remote setup) -- but only if its publicly available.
$ git-annex-remore-globus setup | ||
|
||
We can see that most of the files in the dataset are annexed, highlighted in red. You can check | ||
the symlink for a given file by running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you refer to the syntax highlighting of some shells here, right? My shells are also configured to do this, but not all are. Many people in our institute for example only use black-and-white shells that do not color code symlinks any different from normals files. It would be easier to explain what you mean with a screenshot, or with the output of an ls -l
that shows the symlinks
Thanks a lot for the updates! I see where the problem lies with Travis, the submodule is not in its most recent state. If you obtained this repository with
running
I will be away from my keyboard for a bit, but will get to this PR later today. :) |
Hi,
This PR proposes a new use case. This is about using the git-annex globus special remote to retrieve dataset files existing in Globus.org via
datalad get file
The use of special remotes is defined thanks to git annex and the addition of a globus use case would highlight the growing data availability for researchers enabled by datalad, within the scientific data ecosystem