Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate code from repository christian-monch/compute here #17

Merged
merged 148 commits into from
Oct 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
148 commits
Select commit Hold shift + click to select a range
799e9f2
clean up annex-remote-compute
christian-monch Jul 22, 2024
e7bf640
add compute-command stub
christian-monch Jul 22, 2024
a716049
add compute POC
christian-monch Jul 23, 2024
98a21c9
support `use_shell`, ignore `dependencies` in template
christian-monch Jul 23, 2024
37e2004
update README.md
christian-monch Jul 23, 2024
4c9faa5
add a POC-disclaimer to README.md
christian-monch Jul 23, 2024
f78846e
update python version to 3.11 in actions
christian-monch Jul 23, 2024
c293c60
update docs
christian-monch Jul 23, 2024
a2948ca
encode execution parameters in one URL
christian-monch Jul 25, 2024
331ff33
refactor compute-annex remote code
christian-monch Jul 26, 2024
da75383
change scheme to `datalad-make`
christian-monch Jul 26, 2024
1c592c1
adapt README.md to modified URL-scheme
christian-monch Jul 26, 2024
858c3d8
add a statement about tests to README.md
christian-monch Jul 26, 2024
50cc8ac
improve README.md
christian-monch Jul 27, 2024
406e232
fix a typo
christian-monch Aug 30, 2024
61e842b
[experimental] start a git data provider
christian-monch Aug 30, 2024
e83b4a6
start one-to-many support
christian-monch Sep 2, 2024
3ddfe1a
add a simple git-worktree provider
christian-monch Sep 5, 2024
9075a40
add support for subdatasets in provider
christian-monch Sep 10, 2024
c4d78de
add delete functionality
christian-monch Sep 10, 2024
50a9b45
disable result rendering
christian-monch Sep 11, 2024
d07100f
add tests for gitworktree-provider
christian-monch Sep 11, 2024
1f33e2a
improve gitworktree-provisioning
christian-monch Sep 12, 2024
44b9a60
improve cli-docs
christian-monch Sep 12, 2024
8c06941
enforce input specification for provision
christian-monch Sep 16, 2024
77cce97
add root id and default version to URLs
christian-monch Sep 16, 2024
6b45ee7
improve gitworktree provider cli-doc
christian-monch Sep 16, 2024
67cadb0
use real random branch names in provision worktree
christian-monch Sep 17, 2024
a147e6e
add a first complete version
christian-monch Sep 17, 2024
619efb9
refactor compute_cmd a little
christian-monch Sep 17, 2024
b8677f1
updates remote to new encodings
christian-monch Sep 18, 2024
e73ee3e
update README.md to new architecture
christian-monch Sep 18, 2024
98929de
update example method
christian-monch Sep 18, 2024
144fdfd
support no input files in git worktree provision
christian-monch Sep 18, 2024
d993ffa
update requirements-devel.txt
christian-monch Sep 18, 2024
b0b7f94
support no input files in compute-command
christian-monch Sep 18, 2024
ba95693
update TODO
christian-monch Sep 18, 2024
3b875fe
use packages and adapt imports
christian-monch Sep 18, 2024
19a359c
add installation instructions to README.md
christian-monch Sep 18, 2024
1a3354c
update README.md
christian-monch Sep 18, 2024
67e99b7
improve description of `datalad compute` in README.md
christian-monch Sep 18, 2024
8ccbb75
use reinject collection in compute special remote
christian-monch Sep 20, 2024
84b3331
[temp] add tests with dataset hierarchies
christian-monch Sep 20, 2024
d5e260c
fix collect mechanism in compute-command
christian-monch Sep 20, 2024
fac1e50
fix unlocking in provisioning in compute-command
christian-monch Sep 20, 2024
5954bae
clean up imports in compute-command
christian-monch Sep 20, 2024
d96472f
mimic `annex unlock` behavior for dangling links
christian-monch Sep 20, 2024
64e5c8f
add a test for computed data in subdatasets
christian-monch Sep 20, 2024
b879906
fix unlocking for `execute` and `collect`
christian-monch Sep 20, 2024
f6aa844
fix getting of `a1.txt` in hierarchy test
christian-monch Sep 21, 2024
7449352
add a test for get in subsubdataset
christian-monch Sep 21, 2024
e1c6697
improve end-to-end test for compute-remote
christian-monch Sep 21, 2024
5e97b7b
add input_list and output_list parameter to compute-command
christian-monch Sep 22, 2024
acfd2b2
add tests
christian-monch Sep 23, 2024
5a355e6
circumvent a problem with `Dataset.unlock`
christian-monch Sep 23, 2024
a62627d
fix parameter list argument spec
christian-monch Sep 23, 2024
11d6ebd
add {root_directory} placeholder to template resolution
christian-monch Sep 24, 2024
a841010
add an argument for temp-directory to gitworktree
christian-monch Sep 24, 2024
ead61ed
fix: give branch name in recursive calls
christian-monch Sep 24, 2024
d7d9ac2
improve collect
christian-monch Sep 25, 2024
abd9d70
improve gitworktree provisioning
christian-monch Sep 25, 2024
906e1f8
improve test code
christian-monch Sep 26, 2024
8576907
capture output of git-calls
christian-monch Sep 27, 2024
2a68d44
start globing support for input/output
christian-monch Sep 27, 2024
2fce20f
remove submodule url modification
christian-monch Sep 29, 2024
ed55f62
add globbing for input files in provide
christian-monch Sep 29, 2024
134eaa4
factor out resolve_patterns
christian-monch Sep 29, 2024
d64f7c2
fix directory creation in test_method
christian-monch Sep 29, 2024
9c1564f
remove output-globbing from compute_cmd
christian-monch Sep 29, 2024
9d31515
remove whitespace
christian-monch Sep 29, 2024
bfddeed
improve README.md
christian-monch Sep 30, 2024
0dff215
add an example for fmriprep-docker execution
christian-monch Sep 30, 2024
d042865
set all recommended annex configs in tests
christian-monch Oct 1, 2024
1534b44
ensure that local subdatasets are provisioned
christian-monch Oct 1, 2024
b111f6c
refactor gitworktree.py
christian-monch Oct 1, 2024
0b816cc
fix errors in fmriprep docker example
christian-monch Oct 1, 2024
a2b9d80
promote provision to a proper datalad command
christian-monch Oct 1, 2024
6fb2309
add get-candidate env-vars to provision cmd
christian-monch Oct 2, 2024
094eeea
adapt provision tests to new implementation
christian-monch Oct 2, 2024
ccb82fd
create worktree from locally available datasets
christian-monch Oct 3, 2024
74016e0
remove an unused variable
christian-monch Oct 4, 2024
7fa4f13
adapt provision tests to modified provision code
christian-monch Oct 4, 2024
ab0baa1
add registration test for provision-cmd
christian-monch Oct 4, 2024
2f6d8f9
remove an unused empty file
christian-monch Oct 4, 2024
43e6703
improve docstrings and log-messages
christian-monch Oct 5, 2024
d8bc2c7
improve collection in annex remote
christian-monch Oct 5, 2024
82e3d27
ensure that addurl is only called on annexed files
christian-monch Oct 5, 2024
068280f
use `pwd`-kwarg instead of `-C <path>`
christian-monch Oct 5, 2024
ab00680
change python version to 3.11 in tests
christian-monch Oct 5, 2024
bfe95c7
convert util tests to subdataset
christian-monch Oct 5, 2024
412f1d4
remove hash comparison of worktree and dataset files
christian-monch Oct 5, 2024
dba0ea0
add output pattern support for computation
christian-monch Oct 6, 2024
9305747
remove an unused import
christian-monch Oct 6, 2024
c3168e5
enable recursive globbing
christian-monch Oct 6, 2024
9042cad
update fmriprep docker example
christian-monch Oct 6, 2024
4c90987
refactor compute special remote tests
christian-monch Oct 6, 2024
aff6735
fix globing and add tests
christian-monch Oct 6, 2024
81ee142
add a provisioning context-manager
christian-monch Oct 6, 2024
8efedee
yield str instead of Path-object
christian-monch Oct 6, 2024
3534a77
fix end-to-end tests
christian-monch Oct 7, 2024
0f74916
rename annex special remote source file
christian-monch Oct 7, 2024
b1d67f8
fix root_dataset search in compute remote
christian-monch Oct 7, 2024
da37d83
add a test for the compute remote
christian-monch Oct 7, 2024
92e328c
improve log messages
christian-monch Oct 7, 2024
b61c8eb
remove `dataset-id` from recompute-URLs
christian-monch Oct 8, 2024
87c2244
fix bugs in compute-URL and dataset search
christian-monch Oct 8, 2024
1b95b6f
use stored specs in compute-URLs
christian-monch Oct 8, 2024
1816180
add a regression test for duplicated computations
christian-monch Oct 9, 2024
6f7a4ae
unlock spec files
christian-monch Oct 9, 2024
590355e
do not provision unclean files
christian-monch Oct 9, 2024
aada1a9
remove DEBUG output from compute-remote tests
christian-monch Oct 9, 2024
6fe6435
remove stale provision branches from datasets
christian-monch Oct 9, 2024
ec02442
fix the deletion of the worktree
christian-monch Oct 9, 2024
0261229
ensure that an annexed template is fetched
christian-monch Oct 9, 2024
67c688b
fix a bug in template directory calculation
christian-monch Oct 9, 2024
c5f4b53
remove print-statements from annexremote tests
christian-monch Oct 10, 2024
b74f95c
use only "present" local subdatasets in provision
christian-monch Oct 10, 2024
062d0c5
add --no-globbing option and provisioning tests
christian-monch Oct 11, 2024
47f219b
add missing provision enhancement fixes
christian-monch Oct 11, 2024
7acadbc
fix a bug in speculative computation URL-adding
christian-monch Oct 11, 2024
0f49b32
add a test for speculative computation
christian-monch Oct 11, 2024
b99d441
reduce output
christian-monch Oct 11, 2024
d78105b
remove unnecessary configuration settings from README.md
christian-monch Oct 11, 2024
8756611
add instructions for compute special remote to example
christian-monch Oct 11, 2024
69fa9a4
remove unnecessary configurations from code
christian-monch Oct 11, 2024
9a2f0ef
disable result renderer in `get`-command
christian-monch Oct 11, 2024
da57b2a
fix a bug in worktree_dir handling
christian-monch Oct 13, 2024
dfc7a64
remove duplicated code form test_compute.py
christian-monch Oct 12, 2024
335ecf4
removed duplicated code
christian-monch Oct 12, 2024
195f0b7
use Get()-interface, remove unused variable
christian-monch Oct 14, 2024
7d7a395
fix a bug in worktree parameter handling
christian-monch Oct 15, 2024
f9f1d05
add subdataset aware input globbing
christian-monch Oct 14, 2024
198da6b
fix an assert
christian-monch Oct 15, 2024
36c5295
fix use of `Get()`
christian-monch Oct 15, 2024
62a8118
remove unused imports
christian-monch Oct 15, 2024
0742e18
use `glob.glob` to glob input patterns
christian-monch Oct 15, 2024
eab5a60
refactor, add doc-strings, etc.
christian-monch Oct 16, 2024
485ac98
install locally available subdataset first
christian-monch Oct 16, 2024
15470ef
rename `compute` to `remake`
christian-monch Oct 17, 2024
1a8f361
add missing `hypothesis` dependency
christian-monch Oct 17, 2024
e591531
apply linter suggestions
christian-monch Oct 21, 2024
d2e8944
rf: reformat code
christian-monch Oct 21, 2024
630c2f5
rf: improve subprocess calls in compute
christian-monch Oct 22, 2024
588b2c3
ci: adjust lowest commitizen-checked sha
christian-monch Oct 22, 2024
7f6ddb6
fix: fix type errors
christian-monch Oct 22, 2024
400b9a8
fix: fix linter errors
christian-monch Oct 22, 2024
3bac8b1
fix: use automated formatting results
christian-monch Oct 22, 2024
81f08a3
ci: ignore missing imports in mypy for now
christian-monch Oct 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,14 @@ environment:
# Ubuntu core tests
- job_name: test-linux
APPVEYOR_BUILD_WORKER_IMAGE: Ubuntu2204
PY: 3.9
PY: 3.11
INSTALL_GITANNEX: git-annex -m snapshot

# same as 'test-linux', but TMPDIR is on a crippled filesystem, causing
# most, if not all test datasets to be created on that filesystem
- job_name: test-linux-crippled
APPVEYOR_BUILD_WORKER_IMAGE: Ubuntu2204
PY: 3.9
PY: 3.11
# datalad-annex git remote needs something after git-annex_8.20211x
INSTALL_GITANNEX: git-annex -m snapshot

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/conventional-commits.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ jobs:
run: python -m pip install commitizen
- name: Run commit message checks
run: |
cz check --rev-range ${{ github.event.pull_request.base.sha }}..${{ github.event.pull_request.head.sha }}
cz check --rev-range 630c2f514fd8d42c4def3d7ee588487ffa64cc38..${{ github.event.pull_request.head.sha }}
27 changes: 27 additions & 0 deletions .github/workflows/docbuild.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: docs

on: [push, pull_request]

jobs:
build:

runs-on: ubuntu-latest

steps:
- name: Set up environment
run: |
git config --global user.email "test@github.land"
git config --global user.name "GitHub Almighty"
- uses: actions/checkout@v4
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: 3.11
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements-devel.txt
pip install .
- name: Build docs
run: |
make -C docs html
8 changes: 5 additions & 3 deletions .github/workflows/mypy-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,13 @@ jobs:
if: steps.changed-py-files.outputs.any_changed == 'true'
run: |
# get any type stubs that mypy thinks it needs
hatch run types:mypy --install-types --non-interactive --follow-imports skip ${{ steps.changed-py-files.outputs.all_changed_files }}
hatch run types:mypy --install-types --non-interactive --ignore-missing-imports --follow-imports skip ${{ steps.changed-py-files.outputs.all_changed_files }}
# run mypy on the modified files only, and do not even follow imports.
# this results is a fairly superficial test, but given the overall
# state of annotations, we strive to become more correct incrementally
# with focused error reports, rather than barfing a huge complaint
# that is unrelated to the changeset someone has been working on.
# run on the oldest supported Python version
hatch run types:mypy --python-version 3.9 --follow-imports skip --pretty --show-error-context ${{ steps.changed-py-files.outputs.all_changed_files }}
# run on the oldest supported Python version.
# specify `--ignore-missing-imports` until the datalad-packages have
# type stubs for all their modules.
hatch run types:mypy --python-version 3.11 --ignore-missing-imports --follow-imports skip --pretty --show-error-context ${{ steps.changed-py-files.outputs.all_changed_files }}
6 changes: 4 additions & 2 deletions .github/workflows/mypy-project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,7 @@ jobs:
# get any type stubs that mypy thinks it needs
hatch run types:mypy --install-types --non-interactive --follow-imports skip datalad_core
# run mypy on the full project.
# run on the oldest supported Python version
hatch run types:mypy --python-version 3.9 --pretty --show-error-context datalad_core
# run on the oldest supported Python version.
# specify `--ignore-missing-imports` until the datalad-packages have
# type stubs for all their modules.
hatch run types:mypy --python-version 3.11 --ignore-missing-imports --pretty --show-error-context datalad_core
Empty file.
132 changes: 132 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,138 @@
[![Hatch project](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://github.com/pypa/hatch)


**This code is a POC**, that means currently:
- code does not thoroughly validate inputs
- names might be inconsistent
- few tests
- fewer docs
- no support for locking

This is a naive datalad compute extension that serves as a playground for
the datalad remake-project.

It contains an annex remote that can compute content on demand. It uses template
files that specify the operations. It encodes computation parameters in URLs
that are associated with annex keys, which allows to compute dropped content
instead of fetching it from some storage system. It also contains the new
datalad command `compute` that
can trigger the computation of content, generate the parameterized URLs, and
associate this URL with the respective annex key. This information can then
be used by the annex remote to repeat the computation.

## Installation

There is no pypi-package yet. To install the extension, clone the repository
and install it via `pip` (preferably in a virtual environment):

```bash
git clone https://github.com/christian-monch/datalad-compute.git
cd datalad-compute
pip install -r requirements-devel.txt
pip install .
```


## Example usage

Install the extension and create a dataset


```bash
> datalad create compute-test-1
> cd compute-test-1
```

Create the template directory and a template

```bash
> mkdir -p .datalad/compute/methods
> cat > .datalad/compute/methods/one-to-many <<EOF
inputs = ['first', 'second', 'output']

use_shell = 'true'
executable = 'echo'
arguments = [
"content: {first} > '{output}-1.txt';",
"echo content: {second} > '{output}-2.txt'",
]
EOF
> datalad save -m "add `one-to-many` compute method"
```

Create a "compute" annex special remote:
```bash
> git annex initremote compute encryption=none type=external externaltype=compute
```

Execute a computation and save the result:
```bash
> datalad compute -p first=bob -p second=alice -p output=name -o name-1.txt \
-o name-2.txt one-to-many
```
The method `one-to-many` will create two files with the names `<output>-1.txt`
and `<output>-2.txt`. That is why the two files `name-1.txt` and `name-2.txt`
are listed as outputs in the command above.

Note that only output files that are defined by the `-o/--output` option will
be available in the dataset after `datalad compute`. Similarly, only the files
defined by `-i/--input` will be available as inputs to the computation (the
computation is performed in a "scratch" directory, so the input files must be
copied there and the output files must be copied back).

```bash
> cat name-1.txt
content: bob
> cat name-2.txt
content: alice
```

Drop the content of `name-1.txt`, verify it is gone, recreate it via
`datalad get`, which "fetches" is from the compute remote:

```bash
> datalad drop name-1.txt
> cat name-1.txt
> datalad get name-1.txt
> cat name-1.txt
```

The command `datalad compute` does also support to just record the parameters
that would lead to a certain computation, without actually performing the
computation. We refer to this as *speculative computation*.

To use this feature, the following configuration value has to be set:

```bash
> git config annex.security.allow-unverified-downloads ACKTHPPT
```

Afterward, a speculative computation can be recorded by providing the `-u` option
(url-only) to `datalad compute`.

```bash
> datalad compute -p first=john -p second=susan -p output=person \
-o person-1.txt -o person-2.txt -u one-to-many
> cat person-1.txt # this will fail, because the computation has not yet been performed
```

`ls -l person-1.txt` will show a link to a not-downloaded URL-KEY.
`git annex whereis person-1.txt` will show the associated computation description URL.
No computation has been performed yet, `datalad compute` just creates an URL-KEY and
associates a computation description URL with the URL-KEY.

Use `datalad get` to perform the computation for the first time and receive the result::
```bash
> datalad get person-1.txt
> cat person-1.txt
```


# Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) if you are interested in internals or
contributing to the project.

## Acknowledgements

This development was supported by European Union’s Horizon research and
Expand Down
56 changes: 39 additions & 17 deletions datalad_remake/__init__.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,47 @@
"""DataLad remake extension"""

from __future__ import annotations

from datalad_remake._version import __version__

__all__ = [
'__version__',
'command_suite',
]

# command_suite = (
# # description of the command suite, displayed in cmdline help
# "Demo DataLad command suite",
# [
# # specification of a command, any number of commands can be defined
# (
# # importable module that contains the command implementation
# 'datalad_remake.commands.compute_cmd',
# # name of the command class implementation in above module
# 'Compute',
# # optional name of the command in the cmdline API
# 'compute',
# # optional name of the command in the Python API
# 'compute'
# ),
# ]
# )

# Defines a datalad command suite.
# This variable must be bound as a setuptools entrypoint
# to be found by datalad
command_suite = (
# description of the command suite, displayed in cmdline help
'DataLad remake command suite',
[
# specification of a command, any number of commands can be defined
(
# importable module that contains the command implementation
'datalad_remake.commands.make_cmd',
# name of the command class implementation in above module
'Make',
# optional name of the command in the cmdline API
'make',
# optional name of the command in the Python API
'make',
),
(
# importable module that contains the command implementation
'datalad_remake.commands.provision_cmd',
# name of the command class implementation in above module
'Provision',
# optional name of the command in the cmdline API
'provision',
# optional name of the command in the Python API
'provision',
),
],
)


url_scheme = 'datalad-remake'
template_dir = '.datalad/make/methods'
specification_dir = '.datalad/make/specifications'
Empty file.
Loading
Loading