From c68772d726d797f20ce234091528653be7082cab Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C5=82gorzata=20Wierzba?= Date: Fri, 25 Oct 2024 20:47:57 +0200 Subject: [PATCH 1/7] doc: update README This is a first attempt at providing a more complete documentation of the project. Some sections (e.g. synopsis) will likely be moved elsewhere at some point. --- README.md | 182 ++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 135 insertions(+), 47 deletions(-) diff --git a/README.md b/README.md index a734e7f..46049b1 100644 --- a/README.md +++ b/README.md @@ -6,28 +6,49 @@ [![Hatch project](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://github.com/pypa/hatch) -**This code is a POC**, that means currently: -- code does not thoroughly validate inputs -- names might be inconsistent -- few tests -- fewer docs -- no support for locking - -This is a naive datalad compute extension that serves as a playground for -the datalad remake-project. - -It contains an annex remote that can compute content on demand. It uses template -files that specify the operations. It encodes computation parameters in URLs -that are associated with annex keys, which allows to compute dropped content -instead of fetching it from some storage system. It also contains the new -datalad command `compute` that -can trigger the computation of content, generate the parameterized URLs, and -associate this URL with the respective annex key. This information can then -be used by the annex remote to repeat the computation. +**NOTE:** This extension is currently work-in-progress! + + +## About + +This extension equips DataLad with the functionality to generate file content on +demand, based on a specified set of instructions. This is particularly useful +when the file content can be (re)obtained deterministically. If storing the +file content is more expensive than (re)generating it, this functionality can +lead to more effective resource utilization. Thus, this extension may be of +interest to a wide, interdisciplinary audience, including researchers, data +curators, and infrastructure administrators. + + +## How it works + +This extension provides a new command called `datalad make`. + +By default, `datalad make` triggers the computation of content, generates a URL, +and associates this URL with the respective file (represented by a git annex +key). The associated URL encodes all the information necessary to (re)make the +file content. + +It is also possible to perform a *speculative computation*, in which case the +URL is recorded, without initiating the computation. This URL can then be used +to actually perform the computation. + +If the computation is performed, the URL is associated with a FILE-KEY, +otherwise the URL is associated with a URL-KEY. For more information on git +annex backends, go [here](https://git-annex.branchable.com/backends/). + +The URLs are handled by a `datalad-remake` annex special remote, implemented in +this extension. + + +## Requirements + +This extension requires Python >= `3.11`. + ## Installation -There is no pypi-package yet. To install the extension, clone the repository +There is no PyPI package yet. To install the extension, clone the repository and install it via `pip` (preferably in a virtual environment): ```bash @@ -38,9 +59,67 @@ and install it via `pip` (preferably in a virtual environment): ``` +## Synopsis + +``` +datalad make [-i INPUT] [-o OUTPUT] [-p PARAMETER] [-u] TEMPLATE +``` + +By design, to perform the computation `datalad make` creates a temporary git +worktree. All inputs required for the computation are automatically provisioned +to this temporary worktree, then the specified computation is performed, and +finally, all requested outputs are transferred back to the original dataset. + +The command is invoked with the following arguments: + +**`-i INPUT, --input INPUT`** (optional) + +Specification of the input file(s) to be provisioned to a temporary git +worktree. Paths need to be specified relative to the dataset in which `datalad +make` is executed. + +**`-o OUTPUT, --output OUTPUT`** + +Specification of the output file(s) to transfer back to the target dataset after +the computation. Paths need to be specified relative to the dataset in which +`datalad make` is executed. + +**`-p PARAMETER, --parameter PARAMETER`** (optional) + +Parameters for the computation, specified in a key-value format (e.g. `-p +key=value`). + +**`-u`** (optional) + +Run the command in a URL-only mode. If specified, a *speculative computation* +will be performed, i.e. only the URL will be recorded, without initiating the +computation. + +**`TEMPLATE`** + +Name of the method template used to perform the computation. The template should +be stored in `$DATASET_ROOT/.datalad/make/methods`. The template itself is a +simple text file, containing the following variables: +- `command`: command to be used for the computation +- `parameters` (optional): list of strings, corresponding to the parameters for + the computation +- `use_shell`: a boolean determining whether to use shell interpretation + +Please note, that placeholders (denoted with curly braces) are supported to allow +for the parametrized execution of the command. + +Also, in some cases, it may be more convenient to store inputs, outputs, and +parameters in external files. To support this, uppercase variants of the +command options have been introduced, i.e. `-I`, `-O` and `-P`, respectively. + +``` +datalad make -I input.txt -O output.txt -P parameter.txt TEMPLATE +``` + + ## Example usage -Create a dataset +Create a dataset: ```bash @@ -48,7 +127,7 @@ Create a dataset > cd remake-test-1 ``` -Create the template directory and a template +Create a template and place it in the `.datalad/make/methods` directory: ```bash > mkdir -p .datalad/make/methods @@ -63,25 +142,19 @@ EOF > datalad save -m "add `one-to-many` remake method" ``` -Create a "datalad-remake" annex special remote: +Create a `datalad-remake` annex special remote: ```bash > git annex initremote datalad-remake encryption=none type=external externaltype=datalad-remake ``` Execute a computation and save the result: ```bash -> datalad make -p first=bob -p second=alice -p output=name -o name-1.txt \ --o name-2.txt one-to-many +> datalad make -p first=bob -p second=alice -p output=name \ +-o name-1.txt -o name-2.txt one-to-many ``` The method `one-to-many` will create two files with the names `-1.txt` -and `-2.txt`. That is why the two files `name-1.txt` and `name-2.txt` -are listed as outputs in the command above. - -Note that only output files that are defined by the `-o/--output` option will -be available in the dataset after `datalad make`. Similarly, only the files -defined by `-i/--input` will be available as inputs to the computation (the -computation is performed in a "scratch" directory, so the input files must be -copied there and the output files must be copied back). +and `-2.txt`. Thus, the two files `name-1.txt` and `name-2.txt` need to +be specified as outputs in the command above. ```bash > cat name-1.txt @@ -91,7 +164,7 @@ content: alice ``` Drop the content of `name-1.txt`, verify it is gone, recreate it via -`datalad get`, which "fetches" is from the compute remote: +`datalad get`, which "fetches" it from the `datalad-remake` remote: ```bash > datalad drop name-1.txt @@ -100,31 +173,45 @@ Drop the content of `name-1.txt`, verify it is gone, recreate it via > cat name-1.txt ``` -The command `datalad make` does also support to just record the parameters -that would lead to a certain computation, without actually performing the -computation. We refer to this as *speculative computation*. - -To use this feature, the following configuration value has to be set: +The `datalad make` command can also be used to perform a *speculative +computation*. To use this feature, the following configuration value +has to be set: ```bash > git config annex.security.allow-unverified-downloads ACKTHPPT ``` -Afterward, a speculative computation can be recorded by providing the `-u` option -(url-only) to `datalad make`. +Afterwards, a speculative computation can be initiated by using the `-u` +option: ```bash > datalad make -p first=john -p second=susan -p output=person \ -o person-1.txt -o person-2.txt -u one-to-many -> cat person-1.txt # this will fail, because the computation has not yet been performed ``` -`ls -l person-1.txt` will show a link to a not-downloaded URL-KEY. -`git annex whereis person-1.txt` will show the associated computation description URL. -No computation has been performed yet, `datalad make` just creates an URL-KEY and -associates a computation description URL with the URL-KEY. +This will fail, because no computation has been performed, and the file content +is unavailable: + +```bash +> cat person-1.txt +``` + +However, `ls -l` will show a symlink to a URL-KEY: + +```bash +> ls -l person-1.txt +``` + +Similarly, `git annex whereis` will show the associated URL, that encodes all +the information necessary and sufficient to generate the file content: + +```bash +> git annex whereis person-1.txt +``` + +Based on this URL, `datalad get` can be used to generate the file content for +the first time based on the specified instructions: -Use `datalad get` to perform the computation for the first time and receive the result:: ```bash > datalad get person-1.txt > cat person-1.txt @@ -136,7 +223,8 @@ Use `datalad get` to perform the computation for the first time and receive the See [CONTRIBUTING.md](CONTRIBUTING.md) if you are interested in internals or contributing to the project. -## Acknowledgements + +# Acknowledgements This development was supported by European Union’s Horizon research and innovation programme under grant agreement [eBRAIN-Health From c3e81082ab32f29faec64de730172fd39af3c37b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C5=82gorzata=20Wierzba?= Date: Mon, 28 Oct 2024 21:42:36 +0100 Subject: [PATCH 2/7] doc: rename speculative computation to prospective computation --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 46049b1..ec65d27 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ and associates this URL with the respective file (represented by a git annex key). The associated URL encodes all the information necessary to (re)make the file content. -It is also possible to perform a *speculative computation*, in which case the +It is also possible to perform a *prospective computation*, in which case the URL is recorded, without initiating the computation. This URL can then be used to actually perform the computation. @@ -91,7 +91,7 @@ key=value`). **`-u`** (optional) -Run the command in a URL-only mode. If specified, a *speculative computation* +Run the command in a URL-only mode. If specified, a *prospective computation* will be performed, i.e. only the URL will be recorded, without initiating the computation. @@ -173,7 +173,7 @@ Drop the content of `name-1.txt`, verify it is gone, recreate it via > cat name-1.txt ``` -The `datalad make` command can also be used to perform a *speculative +The `datalad make` command can also be used to perform a *prospective computation*. To use this feature, the following configuration value has to be set: @@ -181,7 +181,7 @@ has to be set: > git config annex.security.allow-unverified-downloads ACKTHPPT ``` -Afterwards, a speculative computation can be initiated by using the `-u` +Afterwards, a prospective computation can be initiated by using the `-u` option: ```bash From b618a7529f3a32a7da1f36ffcf7c5116e01d63d2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C5=82gorzata=20Wierzba?= Date: Mon, 28 Oct 2024 21:57:27 +0100 Subject: [PATCH 3/7] doc: fix typos in Git and git-annex names --- README.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index ec65d27..f22bd63 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ curators, and infrastructure administrators. This extension provides a new command called `datalad make`. By default, `datalad make` triggers the computation of content, generates a URL, -and associates this URL with the respective file (represented by a git annex +and associates this URL with the respective file (represented by a git-annex key). The associated URL encodes all the information necessary to (re)make the file content. @@ -34,10 +34,10 @@ URL is recorded, without initiating the computation. This URL can then be used to actually perform the computation. If the computation is performed, the URL is associated with a FILE-KEY, -otherwise the URL is associated with a URL-KEY. For more information on git -annex backends, go [here](https://git-annex.branchable.com/backends/). +otherwise the URL is associated with a URL-KEY. For more information on +git-annex backends, go [here](https://git-annex.branchable.com/backends/). -The URLs are handled by a `datalad-remake` annex special remote, implemented in +The URLs are handled by a `datalad-remake` git-annex special remote, implemented in this extension. @@ -65,7 +65,7 @@ and install it via `pip` (preferably in a virtual environment): datalad make [-i INPUT] [-o OUTPUT] [-p PARAMETER] [-u] TEMPLATE ``` -By design, to perform the computation `datalad make` creates a temporary git +By design, to perform the computation `datalad make` creates a temporary Git worktree. All inputs required for the computation are automatically provisioned to this temporary worktree, then the specified computation is performed, and finally, all requested outputs are transferred back to the original dataset. @@ -74,7 +74,7 @@ The command is invoked with the following arguments: **`-i INPUT, --input INPUT`** (optional) -Specification of the input file(s) to be provisioned to a temporary git +Specification of the input file(s) to be provisioned to a temporary Git worktree. Paths need to be specified relative to the dataset in which `datalad make` is executed. @@ -142,7 +142,7 @@ EOF > datalad save -m "add `one-to-many` remake method" ``` -Create a `datalad-remake` annex special remote: +Create a `datalad-remake` git-annex special remote: ```bash > git annex initremote datalad-remake encryption=none type=external externaltype=datalad-remake ``` From 5ac5ad4d28784650e496562e4ac66cc738c709c2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C5=82gorzata=20Wierzba?= Date: Mon, 28 Oct 2024 22:20:56 +0100 Subject: [PATCH 4/7] doc: provide long form for -u option MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Michał Szczepanik --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index f22bd63..67df4a0 100644 --- a/README.md +++ b/README.md @@ -89,7 +89,7 @@ the computation. Paths need to be specified relative to the dataset in which Parameters for the computation, specified in a key-value format (e.g. `-p key=value`). -**`-u`** (optional) +**`-u, --url-only`** (optional) Run the command in a URL-only mode. If specified, a *prospective computation* will be performed, i.e. only the URL will be recorded, without initiating the @@ -181,8 +181,8 @@ has to be set: > git config annex.security.allow-unverified-downloads ACKTHPPT ``` -Afterwards, a prospective computation can be initiated by using the `-u` -option: +Afterwards, a prospective computation can be initiated by using the +`-u / --url-only` option: ```bash > datalad make -p first=john -p second=susan -p output=person \ From ca37e76a871144894e297a234faa912b741aabf1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C5=82gorzata=20Wierzba?= Date: Tue, 29 Oct 2024 12:22:52 +0100 Subject: [PATCH 5/7] doc: rework the opening paragraph --- README.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 67df4a0..30e34d2 100644 --- a/README.md +++ b/README.md @@ -11,13 +11,17 @@ ## About -This extension equips DataLad with the functionality to generate file content on -demand, based on a specified set of instructions. This is particularly useful -when the file content can be (re)obtained deterministically. If storing the -file content is more expensive than (re)generating it, this functionality can -lead to more effective resource utilization. Thus, this extension may be of -interest to a wide, interdisciplinary audience, including researchers, data -curators, and infrastructure administrators. +This extension equips DataLad with the functionality to (re)compute file +content on demand, based on a specified set of instructions. In particular, +it features a `datalad make` command for capturing instructions on how to +compute a given file, allowing the file content to be safely removed. It also +implements a git-annex special remote, which enables the (re)computation of +the file content based on the captured instructions. This is particularly +useful when the file content can be produced deterministically. If storing +the file content is more expensive than (re)producing it, this functionality +can lead to more effective resource utilization. Thus, this extension may be +of interest to a wide, interdisciplinary audience, including researchers, +data curators, and infrastructure administrators. ## How it works From af9073a3aecd7df33f2d621f628314c6c3ff9d78 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C5=82gorzata=20Wierzba?= Date: Tue, 29 Oct 2024 13:20:13 +0100 Subject: [PATCH 6/7] doc: apply suggestions from the code review - Hide implementation details (URL, URL-KEY, FILE-KEY) - Make use of git annex info --- README.md | 41 ++++++++++------------------------------- 1 file changed, 10 insertions(+), 31 deletions(-) diff --git a/README.md b/README.md index 30e34d2..978d8aa 100644 --- a/README.md +++ b/README.md @@ -24,27 +24,6 @@ of interest to a wide, interdisciplinary audience, including researchers, data curators, and infrastructure administrators. -## How it works - -This extension provides a new command called `datalad make`. - -By default, `datalad make` triggers the computation of content, generates a URL, -and associates this URL with the respective file (represented by a git-annex -key). The associated URL encodes all the information necessary to (re)make the -file content. - -It is also possible to perform a *prospective computation*, in which case the -URL is recorded, without initiating the computation. This URL can then be used -to actually perform the computation. - -If the computation is performed, the URL is associated with a FILE-KEY, -otherwise the URL is associated with a URL-KEY. For more information on -git-annex backends, go [here](https://git-annex.branchable.com/backends/). - -The URLs are handled by a `datalad-remake` git-annex special remote, implemented in -this extension. - - ## Requirements This extension requires Python >= `3.11`. @@ -95,8 +74,8 @@ key=value`). **`-u, --url-only`** (optional) -Run the command in a URL-only mode. If specified, a *prospective computation* -will be performed, i.e. only the URL will be recorded, without initiating the +If specified, a *prospective computation* will be performed, i.e. only the +instructions to compute a file will be recorded, without initiating the computation. **`TEMPLATE`** @@ -193,28 +172,28 @@ Afterwards, a prospective computation can be initiated by using the -o person-1.txt -o person-2.txt -u one-to-many ``` -This will fail, because no computation has been performed, and the file content -is unavailable: +The following command will fail, because no computation has been performed, +and the file content is unavailable: ```bash > cat person-1.txt ``` -However, `ls -l` will show a symlink to a URL-KEY: +We can further inspect this with `git annex info`: ```bash -> ls -l person-1.txt +> git annex info person-1.txt ``` -Similarly, `git annex whereis` will show the associated URL, that encodes all -the information necessary and sufficient to generate the file content: +Similarly, `git annex whereis` will show the URL, that can be handled by the +git-annex special remote: ```bash > git annex whereis person-1.txt ``` -Based on this URL, `datalad get` can be used to generate the file content for -the first time based on the specified instructions: +Finally, `datalad get` can be used to produce the file content (for the first +time!) based on the specified instructions: ```bash > datalad get person-1.txt From a8e0eb9624a06273d16fb76384a04a15cd52fa2b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C5=82gorzata=20Wierzba?= Date: Fri, 8 Nov 2024 10:55:58 +0100 Subject: [PATCH 7/7] doc: remove synopsis from README --- README.md | 58 +++---------------------------------------------------- 1 file changed, 3 insertions(+), 55 deletions(-) diff --git a/README.md b/README.md index 978d8aa..98ff794 100644 --- a/README.md +++ b/README.md @@ -41,62 +41,10 @@ and install it via `pip` (preferably in a virtual environment): > pip install . ``` +To check your installation, run: -## Synopsis - -``` -datalad make [-i INPUT] [-o OUTPUT] [-p PARAMETER] [-u] TEMPLATE -``` - -By design, to perform the computation `datalad make` creates a temporary Git -worktree. All inputs required for the computation are automatically provisioned -to this temporary worktree, then the specified computation is performed, and -finally, all requested outputs are transferred back to the original dataset. - -The command is invoked with the following arguments: - -**`-i INPUT, --input INPUT`** (optional) - -Specification of the input file(s) to be provisioned to a temporary Git -worktree. Paths need to be specified relative to the dataset in which `datalad -make` is executed. - -**`-o OUTPUT, --output OUTPUT`** - -Specification of the output file(s) to transfer back to the target dataset after -the computation. Paths need to be specified relative to the dataset in which -`datalad make` is executed. - -**`-p PARAMETER, --parameter PARAMETER`** (optional) - -Parameters for the computation, specified in a key-value format (e.g. `-p -key=value`). - -**`-u, --url-only`** (optional) - -If specified, a *prospective computation* will be performed, i.e. only the -instructions to compute a file will be recorded, without initiating the -computation. - -**`TEMPLATE`** - -Name of the method template used to perform the computation. The template should -be stored in `$DATASET_ROOT/.datalad/make/methods`. The template itself is a -simple text file, containing the following variables: -- `command`: command to be used for the computation -- `parameters` (optional): list of strings, corresponding to the parameters for - the computation -- `use_shell`: a boolean determining whether to use shell interpretation - -Please note, that placeholders (denoted with curly braces) are supported to allow -for the parametrized execution of the command. - -Also, in some cases, it may be more convenient to store inputs, outputs, and -parameters in external files. To support this, uppercase variants of the -command options have been introduced, i.e. `-I`, `-O` and `-P`, respectively. - -``` -datalad make -I input.txt -O output.txt -P parameter.txt TEMPLATE +```bash +> datalad make --help ```