Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nerdctl run --gpus #251

Merged
merged 2 commits into from
Jun 15, 2021
Merged

Support nerdctl run --gpus #251

merged 2 commits into from
Jun 15, 2021

Conversation

ktock
Copy link
Member

@ktock ktock commented Jun 14, 2021

Fixes: #248

This PR adds --gpus option to nerdctl run based on containerd's GPU support (github.com/containerd/containerd/contrib/nvidia by containerd/containerd#2330).

# nerdctl run --gpus all --rm -it nvidia/cuda:9.0-base nvidia-smi

For compose (https://github.com/compose-spec/compose-spec/blob/master/deploy.md#devices):

version: "3.8"
services:
  demo:
    image: nvidia/cuda:9.0-base
    command: nvidia-smi
    deploy:
      resources:
        reservations:
          devices:
          - capabilities: ["utility"]
            driver: nvidia
            count: all

nvidia-container-cli is needed.

@AkihiroSuda
Copy link
Member

👍

Can we support Compose as well? https://github.com/compose-spec/compose-spec/blob/master/deploy.md#devices

Can be another PR.

README.md Outdated
@@ -307,6 +307,9 @@ Metadata flags:
Shared memory flags:
- :whale: `--shm-size`: Size of `/dev/shm`

GPU flags:
- :whale: `--gpus`: GPU devices to add to the container ('all' to pass all GPUs). `nvidia-container-cli` is needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have ./docs/gpu.md to explain all the options?

We should also clarify how to set up GPU for rootless. (Can be another PR)

@ktock ktock marked this pull request as draft June 14, 2021 08:55
@ktock ktock force-pushed the gpus branch 5 times, most recently from 631261e to 5e3075c Compare June 14, 2021 11:49
@ktock ktock marked this pull request as ready for review June 14, 2021 11:49
@ktock
Copy link
Member Author

ktock commented Jun 14, 2021

Added compose support and docs.

docs/gpu.md Outdated
The following example exposes all available GPUs.

```
nerdctl run -it --rm --gpus all ubuntu:20.04 nvidia-smi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ubuntu:20.04 -> nvidia/cuda:9.0-base might be more useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this.

- NVIDIA Drivers
- Same requirement as when you use GPUs on Docker. For details, please refer to [the doc by NVIDIA](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#pre-requisites).
- `nvidia-container-cli`
- containerd relies on this CLI for setting up GPUs inside container. You can install this via [`libnvidia-container` package](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/arch-overview.html#libnvidia-container).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try rootless (on cgroup v1)?

I guess it needs setting no-cgroups = true
moby/moby#38729 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work as of now.

We might need to patch github.com/containerd/containerd/contrib/nvidia for allowing to pass --no-cgroup option to nvidia-container-cli.

Containerd doesn't use nvidia-container-runtime (instead, it executes nvidia-container-cli directly) so we cannot use /etc/nvidia-container-runtime/config.toml for nerdctl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very hacky workaround for this is to wrap nvidia-container-cli to forcefully specify --no-cgroups.

mkdir -p /opt/nvidia/bin
mv /usr/bin/nvidia-container-cli /opt/nvidia/bin/
cat <<'EOF' > /usr/bin/nvidia-container-cli
#!/bin/bash
/opt/nvidia/bin/nvidia-container-cli ${@:1:($#-1)} --no-cgroups ${@:$#}
EOF

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened containerd/containerd#5603 for discussion

Copy link
Member Author

@ktock ktock Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AkihiroSuda containerd/containerd#5604 is merged.
Updated this PR to use --no-cgroup and now it works in rootless environment as well (without any additional configurations to /etc/nvidia-container-runtime/config.toml, etc.).

replace directive is needed in go.mod to forcefully point to the latest commit of containerd.

@AkihiroSuda
Copy link
Member

I'll release nedctl v0.9.0 after merging this.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
if dev.Count != 0 {
e = append(e, fmt.Sprintf("count=%d", dev.Count))
}

Copy link
Member

@fahedouch fahedouch Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count and device_ids are mutually exclusive. we should define one field at a time. is it configured somewhere ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fahedouch
Copy link
Member

fahedouch commented Jun 14, 2021

I'll release nedctl v0.9.0 after merging this.

@AkihiroSuda I will clear this ticket tonight #239 . It will be good to have it in 0.9 :)

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
@ktock ktock marked this pull request as draft June 15, 2021 00:37
@ktock ktock marked this pull request as ready for review June 15, 2021 01:07
golang.org/x/term v0.0.0-20210406210042-72f3dc4e9b72
gotest.tools/v3 v3.0.3
)

replace github.com/containerd/containerd => github.com/containerd/containerd v1.5.1-0.20210614183500-0a3a77bc4453
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why replace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without replace, go mod tidy wants to point to v1.5.2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😞

Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@AkihiroSuda AkihiroSuda merged commit 40cce9e into containerd:master Jun 15, 2021
@ktock ktock deleted the gpus branch June 15, 2021 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support NVIDIA GPUs (nerdctl run --gpus)
3 participants