Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add walkthrough for using S3 as a special remote #721

Merged
merged 8 commits into from
Jun 10, 2021

Conversation

jsheunis
Copy link
Contributor

@adswa This builds successfully on my mac. I decide to work with a separate simple dataset and not Datalad-101.

Would be good to have you (and/or more people) running through this.

One thing I was uncertain about is when to include term tags and hyperlinks. You'll see that I include them for terms at the start and tend to exclude them later in the doc. Not sure what our approach should be here, i.e. if we should just always include term tags when they are used, or not.

I've run the PNGs through optipng.

@adswa
Copy link
Contributor

adswa commented May 21, 2021

I will get myself an S3 account and walk through it - thanks a lot all for all your work already!

Copy link
Contributor

@adswa adswa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, and it works! Thank you! :) I have added a round of comments, with the most severe being a suggestion for using downloadable example data. Let me know what you think. If you don't want to redo the publication but stick to the new example data, I can push a screenshot of the S3 bucket contents


Your DataLad dataset
^^^^^^^^^^^^^^^^^^^^
For this walkthrough, we are using a basic sample neuroimaging dataset with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cool to get some example data from somewhere to make this a copy-paste-working example. SPM has some small example datasets: https://www.fil.ion.ucl.ac.uk/spm/data/auditory/.

The command below

mkdir neuro-data-s3 && \
   wget https://www.fil.ion.ucl.ac.uk/spm/download/data/MoAEpilot/MoAEpilot.bids.zip -O neuro-data-s3.zip && \
   unzip neuro-data-s3.zip -d neuro-data-s3 && \
   rm neuro-data-s3.zip  

should result in this directory structure:

$ tree neuro-data-s3 
neuro-data-s3
└── MoAEpilot
    ├── CHANGES
    ├── dataset_description.json
    ├── README
    ├── sub-01
    │   ├── anat
    │   │   └── sub-01_T1w.nii
    │   └── func
    │       ├── sub-01_task-auditory_bold.nii
    │       └── sub-01_task-auditory_events.tsv
    └── task-auditory_bold.json

(could be tweaked to not include the MoAEpilot parent directory). Its a total of 50MB of data, and anyone could get the files to have the exact files the workthrough uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the suggestion and it's a great dataset for this use case, thanks!


$ cd <wherever-you-want-to-create-the-dataset>
$ mkdir neuro-data-s3
$ cd neuro-data-s3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the wget command, this code block could lose the mkdir and cp commands.

Comment on lines 103 to 105
If you already have a DataLad dataset, navigate to its root directory. If not, create a
new directory, navigate to it, copy your data, turn the directory into a DataLad dataset
with :command:`datalad create --force`, and lastly save the dataset with :command:`datalad save`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect many people will read this walkthrough without having read much else from the handbook, and may get confused at this point when they have to figure out whether or not they already have a dataset or not. Someone who is fully unfamiliar with any DataLad concept could be unsure which commands from the following code-block to copy, or come up with funky alternative interpretations (I am thinking of someone who has an unrelated dataset, and now saves random new changes in it). We could make the distinction between "I already have a dataset" and "let's create example data together" at the start of the subsection (under the "Your DataLad dataset" heading), with something like "When you already have a small dataset to practice with, feel free to use it. For a general introduction, we now download data from a small neuroimaging dataset, and transform it into a datalad dataset."

People without a dataset can commit to copying code-snippets right away, others with a concrete usecase or existing datasets know which code-blocks to skip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, I will update accordingly.

docs/basics/101-139-s3.rst Outdated Show resolved Hide resolved
docs/basics/101-139-s3.rst Outdated Show resolved Hide resolved
defines the underlying transport of your files to and/or from a specific location.

In this section, we provide a walkthrough on how to set up Amazon S3 for hosting
your DataLad dataset, and how to access this data locally from GitHub.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know where exactly the right place would be, but I think its worth highlighting in an importantnote that using AWS can potentially result in costs and that it isn't necessarily a free service

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Perhaps it fits best where we mention an Amazon account in the prerequisites. I will find a place for it there.

to "Buckets" to see your newly created bucket. It should only have a single
``annex-uuid`` file as content, since no actual file content has been pushed yet.

.. figure:: ../artwork/src/aws_s3_bucket_empty.png
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really great to have this screenshot here


Lastly, for git-annex to be able to download files from the bucket without requiring your
AWS credentials, it needs to know where to find the bucket. We do this by setting the bucket
URL, which takes a standard format and can also be copied from your AWS console:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me an embarrassingly long time to find the "Copy URL" button, and then it only copied the URL with an "annex-UUID" suffix. Could we add the name of the button to press or how to find it (maybe even in brackets) to make it easier to search for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaahh, sorry, I only understood now that I can simply copy the code snippet because you're using the environment variable 🤦‍♀️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. But if it wasn't immediately obvious, I can add some more words to make it obvious.


.. code-block:: bash

$ datalad create-sibling-github -d . neuro-data-s3 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging from "For conistency, we'll give the GitHub sibling the same name as the dataset name", I think you meant to write

Suggested change
$ datalad create-sibling-github -d . neuro-data-s3 \
$ datalad create-sibling-github -d . -s neuro-data-s3 \

(-s/--name). With the code as it is, the sibling is called "github", which is the default sibling name for a github sibling if no -s/--name flag is supplied. Smells a lot like a UX issue that datalad just ignores neuro-data-s3 without warning, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was referring to the name with which the GitHub repo is created, not the name of the sibling known to datalad. Unless my memory is playing trick, I could change this parameter to change the repo name. I'll run this again to make sure, if at least to re-educate myself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the -s flag sets the sibling name, while specifying a name without the -s flag sets the github repo name as that name.

$ datalad siblings
.: here(+) [git]
.: public-s3(+) [git]
.: github(-) [https://github.com/jsheunis/sample-neuro-data.git (git)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment about sibling name above, if you change it above, adjust the sibling name here. If you don't change it, its maybe worthwhile to point out that unnamed github-siblings are automatically called github

@adswa
Copy link
Contributor

adswa commented May 25, 2021

One thing I was uncertain about is when to include term tags and hyperlinks. You'll see that I include them for terms at the start and tend to exclude them later in the doc. Not sure what our approach should be here, i.e. if we should just always include term tags when they are used, or not.

I am not consistent in that either. I, too, tend to add them at the start of the page, but don't repeat them afterwards. At my first read, I didn't spot any terms/hyperlinks I would miss

The first step is to ensure that you have a valid DataLad dataset,
with ``main`` as the default branch.

.. importantnote:: Ensure main is set as default branch for newly-created repositories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#722 adds an FAQ on this that you could link to using

 :ref:`some random text of your choice <gitannexdefault>`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

@adswa
Copy link
Contributor

adswa commented Jun 9, 2021

Sorry, I remembered this PR and that I forgot to check in. I'll build it and review it tomorrow!

@adswa
Copy link
Contributor

adswa commented Jun 10, 2021

I think this is great! Thanks A LOT for writing this up! Do you think this is ready to go? (i.e., not WIP anymore?)
@all-contributors please add @jsheunis for content, example

@allcontributors
Copy link
Contributor

@adswa

I've put up a pull request to add @jsheunis! 🎉

@jsheunis
Copy link
Contributor Author

Yup I think it's ready to go 👍

@adswa
Copy link
Contributor

adswa commented Jun 10, 2021

Great! Can you merge the master branch into this branch and push it to fix the remaining conflicts? I can't push to this branch :)

@jsheunis jsheunis changed the title [WIP] Add walkthrough for using S3 as a special remote Add walkthrough for using S3 as a special remote Jun 10, 2021
@adswa adswa merged commit 406566e into datalad-handbook:master Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants