Add walkthrough for using S3 as a special remote #721

jsheunis · 2021-05-21T09:53:02Z

@adswa This builds successfully on my mac. I decide to work with a separate simple dataset and not Datalad-101.

Would be good to have you (and/or more people) running through this.

One thing I was uncertain about is when to include term tags and hyperlinks. You'll see that I include them for terms at the start and tend to exclude them later in the doc. Not sure what our approach should be here, i.e. if we should just always include term tags when they are used, or not.

I've run the PNGs through optipng.

adswa · 2021-05-21T13:10:22Z

I will get myself an S3 account and walk through it - thanks a lot all for all your work already!

adswa

I like it, and it works! Thank you! :) I have added a round of comments, with the most severe being a suggestion for using downloadable example data. Let me know what you think. If you don't want to redo the publication but stick to the new example data, I can push a screenshot of the S3 bucket contents

adswa · 2021-05-25T06:44:19Z

docs/basics/101-139-s3.rst

+
+Your DataLad dataset
+^^^^^^^^^^^^^^^^^^^^
+For this walkthrough, we are using a basic sample neuroimaging dataset with


I think it would be cool to get some example data from somewhere to make this a copy-paste-working example. SPM has some small example datasets: https://www.fil.ion.ucl.ac.uk/spm/data/auditory/.

The command below

mkdir neuro-data-s3 && \ wget https://www.fil.ion.ucl.ac.uk/spm/download/data/MoAEpilot/MoAEpilot.bids.zip -O neuro-data-s3.zip && \ unzip neuro-data-s3.zip -d neuro-data-s3 && \ rm neuro-data-s3.zip

should result in this directory structure:

$ tree neuro-data-s3 neuro-data-s3 └── MoAEpilot ├── CHANGES ├── dataset_description.json ├── README ├── sub-01 │ ├── anat │ │ └── sub-01_T1w.nii │ └── func │ ├── sub-01_task-auditory_bold.nii │ └── sub-01_task-auditory_events.tsv └── task-auditory_bold.json

(could be tweaked to not include the MoAEpilot parent directory). Its a total of 50MB of data, and anyone could get the files to have the exact files the workthrough uses.

I like the suggestion and it's a great dataset for this use case, thanks!

adswa · 2021-05-25T06:47:59Z

docs/basics/101-139-s3.rst

+
+   $ cd <wherever-you-want-to-create-the-dataset>
+   $ mkdir neuro-data-s3
+   $ cd neuro-data-s3


With the wget command, this code block could lose the mkdir and cp commands.

adswa · 2021-05-25T07:06:26Z

docs/basics/101-139-s3.rst

+If you already have a DataLad dataset, navigate to its root directory. If not, create a
+new directory, navigate to it, copy your data, turn the directory into a DataLad dataset
+with :command:`datalad create --force`, and lastly save the dataset with :command:`datalad save`:


I suspect many people will read this walkthrough without having read much else from the handbook, and may get confused at this point when they have to figure out whether or not they already have a dataset or not. Someone who is fully unfamiliar with any DataLad concept could be unsure which commands from the following code-block to copy, or come up with funky alternative interpretations (I am thinking of someone who has an unrelated dataset, and now saves random new changes in it). We could make the distinction between "I already have a dataset" and "let's create example data together" at the start of the subsection (under the "Your DataLad dataset" heading), with something like "When you already have a small dataset to practice with, feel free to use it. For a general introduction, we now download data from a small neuroimaging dataset, and transform it into a datalad dataset."

People without a dataset can commit to copying code-snippets right away, others with a concrete usecase or existing datasets know which code-blocks to skip.

Very good point, I will update accordingly.

docs/basics/101-139-s3.rst

adswa · 2021-05-25T07:30:01Z

docs/basics/101-139-s3.rst

+   defines the underlying transport of your files to and/or from a specific location.
+
+In this section, we provide a walkthrough on how to set up Amazon S3 for hosting
+your DataLad dataset, and how to access this data locally from GitHub.


I don't know where exactly the right place would be, but I think its worth highlighting in an importantnote that using AWS can potentially result in costs and that it isn't necessarily a free service

Good point. Perhaps it fits best where we mention an Amazon account in the prerequisites. I will find a place for it there.

adswa · 2021-05-25T07:33:46Z

docs/basics/101-139-s3.rst

+to "Buckets" to see your newly created bucket. It should only have a single 
+``annex-uuid`` file as content, since no actual file content has been pushed yet.
+
+.. figure:: ../artwork/src/aws_s3_bucket_empty.png


really great to have this screenshot here

adswa · 2021-05-25T07:37:15Z

docs/basics/101-139-s3.rst

+
+Lastly, for git-annex to be able to download files from the bucket without requiring your
+AWS credentials, it needs to know where to find the bucket. We do this by setting the bucket
+URL, which takes a standard format and can also be copied from your AWS console:


It took me an embarrassingly long time to find the "Copy URL" button, and then it only copied the URL with an "annex-UUID" suffix. Could we add the name of the button to press or how to find it (maybe even in brackets) to make it easier to search for it?

Aaahh, sorry, I only understood now that I can simply copy the code snippet because you're using the environment variable 🤦‍♀️

Indeed. But if it wasn't immediately obvious, I can add some more words to make it obvious.

adswa · 2021-05-25T07:45:16Z

docs/basics/101-139-s3.rst

+
+.. code-block:: bash
+
+   $ datalad create-sibling-github -d . neuro-data-s3 \


Judging from "For conistency, we'll give the GitHub sibling the same name as the dataset name", I think you meant to write

Suggested change

$ datalad create-sibling-github -d . neuro-data-s3 \

$ datalad create-sibling-github -d . -s neuro-data-s3 \

(-s/--name). With the code as it is, the sibling is called "github", which is the default sibling name for a github sibling if no -s/--name flag is supplied. Smells a lot like a UX issue that datalad just ignores neuro-data-s3 without warning, though.

Actually, I was referring to the name with which the GitHub repo is created, not the name of the sibling known to datalad. Unless my memory is playing trick, I could change this parameter to change the repo name. I'll run this again to make sure, if at least to re-educate myself.

Indeed, the -s flag sets the sibling name, while specifying a name without the -s flag sets the github repo name as that name.

adswa · 2021-05-25T07:50:51Z

docs/basics/101-139-s3.rst

+   $ datalad siblings
+   .: here(+) [git]
+   .: public-s3(+) [git]
+   .: github(-) [https://github.com/jsheunis/sample-neuro-data.git (git)]


See comment about sibling name above, if you change it above, adjust the sibling name here. If you don't change it, its maybe worthwhile to point out that unnamed github-siblings are automatically called github

adswa · 2021-05-25T08:02:26Z

One thing I was uncertain about is when to include term tags and hyperlinks. You'll see that I include them for terms at the start and tend to exclude them later in the doc. Not sure what our approach should be here, i.e. if we should just always include term tags when they are used, or not.

I am not consistent in that either. I, too, tend to add them at the start of the page, but don't repeat them afterwards. At my first read, I didn't spot any terms/hyperlinks I would miss

adswa · 2021-05-25T09:32:56Z

docs/basics/101-139-s3.rst

+The first step is to ensure that you have a valid DataLad dataset,
+with ``main`` as the default branch.
+
+.. importantnote:: Ensure main is set as default branch for newly-created repositories


#722 adds an FAQ on this that you could link to using

:ref:`some random text of your choice <gitannexdefault>`

Great, thanks!

Co-authored-by: Adina Wagner <adina.wagner@t-online.de>

adswa · 2021-06-09T15:37:45Z

Sorry, I remembered this PR and that I forgot to check in. I'll build it and review it tomorrow!

adswa · 2021-06-10T07:19:33Z

I think this is great! Thanks A LOT for writing this up! Do you think this is ready to go? (i.e., not WIP anymore?)
@all-contributors please add @jsheunis for content, example

allcontributors · 2021-06-10T07:19:42Z

@adswa

I've put up a pull request to add @jsheunis! 🎉

jsheunis · 2021-06-10T07:25:48Z

Yup I think it's ready to go 👍

adswa · 2021-06-10T07:32:36Z

Great! Can you merge the master branch into this branch and push it to fix the remaining conflicts? I can't push to this branch :)

jsheunis added 3 commits May 21, 2021 11:10

add S3 special remote wqalkthrough

893c50b

add S3 special remote wqalkthrough to toc

7263453

save updates to artwork submodule state

791c34c

adswa reviewed May 25, 2021

View reviewed changes

jsheunis and others added 4 commits May 27, 2021 13:23

Update docs/basics/101-139-s3.rst

f1ae5df

Co-authored-by: Adina Wagner <adina.wagner@t-online.de>

Update docs/basics/101-139-s3.rst

d4a3664

Co-authored-by: Adina Wagner <adina.wagner@t-online.de>

Merge branch 'master' of https://github.com/datalad-handbook/book

80426c4

Updated walkthrough content for PR datalad-handbook#721

8aa0b1f

allcontributors bot mentioned this pull request Jun 10, 2021

docs: add jsheunis as a contributor for content, example #726

Merged

Merge remote-tracking branch 'upstream/master'

88c18ce

jsheunis changed the title ~~[WIP] Add walkthrough for using S3 as a special remote~~ Add walkthrough for using S3 as a special remote Jun 10, 2021

adswa merged commit 406566e into datalad-handbook:master Jun 10, 2021

jsheunis mentioned this pull request Jul 9, 2021

Walkthrough for setting up S3 as special remote #712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add walkthrough for using S3 as a special remote #721

Add walkthrough for using S3 as a special remote #721

jsheunis commented May 21, 2021

adswa commented May 21, 2021

adswa left a comment

adswa May 25, 2021

jsheunis May 27, 2021

adswa May 25, 2021

adswa May 25, 2021

jsheunis May 27, 2021

adswa May 25, 2021

jsheunis May 27, 2021

adswa May 25, 2021

adswa May 25, 2021

adswa May 25, 2021

jsheunis May 27, 2021

adswa May 25, 2021

jsheunis May 27, 2021

jsheunis Jun 1, 2021

adswa May 25, 2021

adswa commented May 25, 2021

adswa May 25, 2021

jsheunis May 27, 2021

adswa commented Jun 9, 2021

adswa commented Jun 10, 2021

allcontributors bot commented Jun 10, 2021

jsheunis commented Jun 10, 2021

adswa commented Jun 10, 2021


		.. code-block:: bash

		$ datalad create-sibling-github -d . neuro-data-s3 \

	$ datalad create-sibling-github -d . neuro-data-s3 \
	$ datalad create-sibling-github -d . -s neuro-data-s3 \

Add walkthrough for using S3 as a special remote #721

Add walkthrough for using S3 as a special remote #721

Conversation

jsheunis commented May 21, 2021

adswa commented May 21, 2021

adswa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adswa commented May 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adswa commented Jun 9, 2021

adswa commented Jun 10, 2021

allcontributors bot commented Jun 10, 2021

jsheunis commented Jun 10, 2021

adswa commented Jun 10, 2021