diff --git a/README.md b/README.md index 94dd3a5..231ce58 100644 --- a/README.md +++ b/README.md @@ -6,28 +6,32 @@ [![Hatch project](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://github.com/pypa/hatch) -**This code is a POC**, that means currently: -- code does not thoroughly validate inputs -- names might be inconsistent -- few tests -- fewer docs -- no support for locking - -This is a naive datalad compute extension that serves as a playground for -the datalad remake-project. - -It contains an annex remote that can compute content on demand. It uses template -files that specify the operations. It encodes computation parameters in URLs -that are associated with annex keys, which allows to compute dropped content -instead of fetching it from some storage system. It also contains the new -datalad command `compute` that -can trigger the computation of content, generate the parameterized URLs, and -associate this URL with the respective annex key. This information can then -be used by the annex remote to repeat the computation. +**NOTE:** This extension is currently work-in-progress! + + +## About + +This extension equips DataLad with the functionality to (re)compute file +content on demand, based on a specified set of instructions. In particular, +it features a `datalad make` command for capturing instructions on how to +compute a given file, allowing the file content to be safely removed. It also +implements a git-annex special remote, which enables the (re)computation of +the file content based on the captured instructions. This is particularly +useful when the file content can be produced deterministically. If storing +the file content is more expensive than (re)producing it, this functionality +can lead to more effective resource utilization. Thus, this extension may be +of interest to a wide, interdisciplinary audience, including researchers, +data curators, and infrastructure administrators. + + +## Requirements + +This extension requires Python >= `3.11`. + ## Installation -There is no pypi-package yet. To install the extension, clone the repository +There is no PyPI package yet. To install the extension, clone the repository and install it via `pip` (preferably in a virtual environment): ```bash @@ -37,10 +41,16 @@ and install it via `pip` (preferably in a virtual environment): > pip install . ``` +To check your installation, run: + +```bash +> datalad make --help +``` + ## Example usage -Create a dataset +Create a dataset: ```bash @@ -48,7 +58,7 @@ Create a dataset > cd remake-test-1 ``` -Create the template directory and a template +Create a template and place it in the `.datalad/make/methods` directory: ```bash > mkdir -p .datalad/make/methods @@ -63,7 +73,7 @@ EOF > datalad save -m "add `one-to-many` remake method" ``` -Create a "datalad-remake" annex special remote: +Create a `datalad-remake` git-annex special remote: ```bash > git annex initremote datalad-remake encryption=none type=external externaltype=datalad-remake allow_untrusted_execution=true ``` @@ -71,17 +81,11 @@ Create a "datalad-remake" annex special remote: Execute a computation and save the result: ```bash > datalad make -p first=bob -p second=alice -p output=name -o name-1.txt \ --o name-2.txt --allow-untrusted-code one-to-many +-o name-2.txt --allow-untrusted-execution one-to-many ``` The method `one-to-many` will create two files with the names `-1.txt` -and `-2.txt`. That is why the two files `name-1.txt` and `name-2.txt` -are listed as outputs in the command above. - -Note that only output files that are defined by the `-o/--output` option will -be available in the dataset after `datalad make`. Similarly, only the files -defined by `-i/--input` will be available as inputs to the computation (the -computation is performed in a "scratch" directory, so the input files must be -copied there and the output files must be copied back). +and `-2.txt`. Thus, the two files `name-1.txt` and `name-2.txt` need to +be specified as outputs in the command above. ```bash > cat name-1.txt @@ -91,7 +95,7 @@ content: alice ``` Drop the content of `name-1.txt`, verify it is gone, recreate it via -`datalad get`, which "fetches" is from the compute remote: +`datalad get`, which "fetches" it from the `datalad-remake` remote: ```bash > datalad drop name-1.txt @@ -100,18 +104,16 @@ Drop the content of `name-1.txt`, verify it is gone, recreate it via > cat name-1.txt ``` -The command `datalad make` does also support to just record the parameters -that would lead to a certain computation, without actually performing the -computation. We refer to this as *speculative computation*. - -To use this feature, the following configuration value has to be set: +The `datalad make` command can also be used to perform a *prospective +computation*. To use this feature, the following configuration value +has to be set: ```bash > git config annex.security.allow-unverified-downloads ACKTHPPT ``` -Afterward, a speculative computation can be recorded by providing the `-u` option -(url-only) to `datalad make`. +Afterwards, a prospective computation can be initiated by using the +`-u / --url-only` option: ```bash > datalad make -p first=john -p second=susan -p output=person \ @@ -119,12 +121,29 @@ Afterward, a speculative computation can be recorded by providing the `-u` optio > cat person-1.txt # this will fail, because the computation has not yet been performed ``` -`ls -l person-1.txt` will show a link to a not-downloaded URL-KEY. -`git annex whereis person-1.txt` will show the associated computation description URL. -No computation has been performed yet, `datalad make` just creates an URL-KEY and -associates a computation description URL with the URL-KEY. +The following command will fail, because no computation has been performed, +and the file content is unavailable: + +```bash +> cat person-1.txt +``` + +We can further inspect this with `git annex info`: + +```bash +> git annex info person-1.txt +``` + +Similarly, `git annex whereis` will show the URL, that can be handled by the +git-annex special remote: + +```bash +> git annex whereis person-1.txt +``` + +Finally, `datalad get` can be used to produce the file content (for the first +time!) based on the specified instructions: -Use `datalad get` to perform the computation for the first time and receive the result:: ```bash > datalad get person-1.txt > cat person-1.txt @@ -136,7 +155,8 @@ Use `datalad get` to perform the computation for the first time and receive the See [CONTRIBUTING.md](CONTRIBUTING.md) if you are interested in internals or contributing to the project. -## Acknowledgements + +# Acknowledgements This development was supported by European Union’s Horizon research and innovation programme under grant agreement [eBRAIN-Health