Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README #26

Merged
merged 8 commits into from
Nov 11, 2024
Merged
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 66 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,28 +6,32 @@
[![Hatch project](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://github.com/pypa/hatch)


**This code is a POC**, that means currently:
- code does not thoroughly validate inputs
- names might be inconsistent
- few tests
- fewer docs
- no support for locking

This is a naive datalad compute extension that serves as a playground for
the datalad remake-project.

It contains an annex remote that can compute content on demand. It uses template
files that specify the operations. It encodes computation parameters in URLs
that are associated with annex keys, which allows to compute dropped content
instead of fetching it from some storage system. It also contains the new
datalad command `compute` that
can trigger the computation of content, generate the parameterized URLs, and
associate this URL with the respective annex key. This information can then
be used by the annex remote to repeat the computation.
**NOTE:** This extension is currently work-in-progress!


## About

This extension equips DataLad with the functionality to (re)compute file
content on demand, based on a specified set of instructions. In particular,
it features a `datalad make` command for capturing instructions on how to
compute a given file, allowing the file content to be safely removed. It also
implements a git-annex special remote, which enables the (re)computation of
the file content based on the captured instructions. This is particularly
useful when the file content can be produced deterministically. If storing
the file content is more expensive than (re)producing it, this functionality
can lead to more effective resource utilization. Thus, this extension may be
of interest to a wide, interdisciplinary audience, including researchers,
data curators, and infrastructure administrators.


## Requirements

This extension requires Python >= `3.11`.
m-wierzba marked this conversation as resolved.
Show resolved Hide resolved


## Installation

There is no pypi-package yet. To install the extension, clone the repository
There is no PyPI package yet. To install the extension, clone the repository
and install it via `pip` (preferably in a virtual environment):

```bash
Expand All @@ -37,18 +41,24 @@ and install it via `pip` (preferably in a virtual environment):
> pip install .
```

To check your installation, run:

```bash
> datalad make --help
```


## Example usage

Create a dataset
Create a dataset:


```bash
> datalad create remake-test-1
> cd remake-test-1
```

Create the template directory and a template
Create a template and place it in the `.datalad/make/methods` directory:

```bash
> mkdir -p .datalad/make/methods
Expand All @@ -63,25 +73,19 @@ EOF
> datalad save -m "add `one-to-many` remake method"
```

Create a "datalad-remake" annex special remote:
Create a `datalad-remake` git-annex special remote:
```bash
> git annex initremote datalad-remake encryption=none type=external externaltype=datalad-remake
```

Execute a computation and save the result:
```bash
> datalad make -p first=bob -p second=alice -p output=name -o name-1.txt \
-o name-2.txt one-to-many
> datalad make -p first=bob -p second=alice -p output=name \
-o name-1.txt -o name-2.txt one-to-many
```
The method `one-to-many` will create two files with the names `<output>-1.txt`
and `<output>-2.txt`. That is why the two files `name-1.txt` and `name-2.txt`
are listed as outputs in the command above.

Note that only output files that are defined by the `-o/--output` option will
be available in the dataset after `datalad make`. Similarly, only the files
defined by `-i/--input` will be available as inputs to the computation (the
computation is performed in a "scratch" directory, so the input files must be
copied there and the output files must be copied back).
and `<output>-2.txt`. Thus, the two files `name-1.txt` and `name-2.txt` need to
be specified as outputs in the command above.

```bash
> cat name-1.txt
Expand All @@ -91,7 +95,7 @@ content: alice
```

Drop the content of `name-1.txt`, verify it is gone, recreate it via
`datalad get`, which "fetches" is from the compute remote:
`datalad get`, which "fetches" it from the `datalad-remake` remote:

```bash
> datalad drop name-1.txt
Expand All @@ -100,31 +104,45 @@ Drop the content of `name-1.txt`, verify it is gone, recreate it via
> cat name-1.txt
```

The command `datalad make` does also support to just record the parameters
that would lead to a certain computation, without actually performing the
computation. We refer to this as *speculative computation*.

To use this feature, the following configuration value has to be set:
The `datalad make` command can also be used to perform a *prospective
computation*. To use this feature, the following configuration value
has to be set:

```bash
> git config annex.security.allow-unverified-downloads ACKTHPPT
```

Afterward, a speculative computation can be recorded by providing the `-u` option
(url-only) to `datalad make`.
Afterwards, a prospective computation can be initiated by using the
`-u / --url-only` option:

```bash
> datalad make -p first=john -p second=susan -p output=person \
-o person-1.txt -o person-2.txt -u one-to-many
> cat person-1.txt # this will fail, because the computation has not yet been performed
```

`ls -l person-1.txt` will show a link to a not-downloaded URL-KEY.
`git annex whereis person-1.txt` will show the associated computation description URL.
No computation has been performed yet, `datalad make` just creates an URL-KEY and
associates a computation description URL with the URL-KEY.
The following command will fail, because no computation has been performed,
and the file content is unavailable:

```bash
> cat person-1.txt
```

We can further inspect this with `git annex info`:

```bash
> git annex info person-1.txt
```

Similarly, `git annex whereis` will show the URL, that can be handled by the
git-annex special remote:

```bash
> git annex whereis person-1.txt
```

Finally, `datalad get` can be used to produce the file content (for the first
time!) based on the specified instructions:

Use `datalad get` to perform the computation for the first time and receive the result::
```bash
> datalad get person-1.txt
> cat person-1.txt
Expand All @@ -136,7 +154,8 @@ Use `datalad get` to perform the computation for the first time and receive the
See [CONTRIBUTING.md](CONTRIBUTING.md) if you are interested in internals or
contributing to the project.

## Acknowledgements

# Acknowledgements

This development was supported by European Union’s Horizon research and
innovation programme under grant agreement [eBRAIN-Health
Expand Down