Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostRequirements: gpu: optional is broken on Windows 11 and 10 #9385

Closed
sarphiv opened this issue Jan 11, 2024 · 20 comments
Closed

hostRequirements: gpu: optional is broken on Windows 11 and 10 #9385

sarphiv opened this issue Jan 11, 2024 · 20 comments
Assignees
Labels
bug Issue identified by VS Code Team member as probable bug containers Issue in vscode-remote containers debt verified Verification succeeded

Comments

@sarphiv
Copy link

sarphiv commented Jan 11, 2024

  • VSCode Version: >=1.84.2
  • Local OS Version: Multiple OS
  • Remote OS Version: ?
  • Remote Extension/Connection Type: Containers and WSL
  • Logs: N/A

Does this issue occur when you try this locally?: Yes
Does this issue occur when you try this locally and all extensions are disabled?: Yes

This issue is a continuation of #9220, which appears to have regressed recently. Read the previous issue for more context.

Steps to Reproduce:

  1. Setup Docker to support CUDA containers according to NVIDIA's official instructions
  2. Create devcontainer.json with "hostRequirements": { "gpu": "optional" }
  3. Open a devcontainer that is supposed to support CUDA with the above config
  4. Check for CUDA support in PyTorch, or by running nvidia-smi

On Linux Fedora 38 the above works - the container has access to the GPU.
On Windows 11 + WSL2 the above does not work. Troubleshooting steps have been described in #9220.

Adding "runArgs": [ "--gpus", "all" ] to devcontainer.json makes Windows 11 + WSL2 work. However, using the runArgs trick breaks the devcontainer for machines without GPUs (confirmed on Windows 11, macOS, and Linux Fedora).

As a temporary workaround, we are therefore currently maintaining two files: .devcontainer/gpu/devcontainer.json and .devcontainer/cpu/devcontainer.json.

@github-actions github-actions bot added the containers Issue in vscode-remote containers label Jan 11, 2024
@chrmarti
Copy link
Contributor

What do you get for running docker info -f '{{.Runtimes.nvidia}}' on the command line?

@chrmarti chrmarti added the info-needed Issue requires more information from poster label Jan 25, 2024
@sarphiv
Copy link
Author

sarphiv commented Jan 25, 2024

What do you get for running docker info -f '{{.Runtimes.nvidia}}' on the command line?
@chrmarti

The team member who experienced the issues on Windows 11 + WSL2 is currently on leave.

However, I found a Windows 10 machine with a GPU that has never had anything Docker nor NVIDIA container related installed on it. I installed Docker Desktop with WSL2 support, and oddly enough GPU passthrough appears to be supported by default, so I did nothing further.

Anyways, I ran your command and it gave:

> docker info -f '{{.Runtimes.nvidia}}'
'<no value>'

I guess your suspicion from the previous issue was correct.

To ensure that this machine was also affected by the bug I created a folder with the following contents. Note that I just took some existing files and started deleting things, so there's probably some unrelated lines in the following:

.devcontainer/devcontainer.json

{
    "name": "Dockerfile devcontainer gpu",
    "build": {
        "context": "..",
        "dockerfile": "Dockerfile"
    },
    "workspaceFolder": "/workspace",
    "workspaceMount": "source=.,target=/workspace,type=bind",
    "hostRequirements": {
        "gpu": "optional"
    },
    "runArgs": [
        "--shm-size=4gb",
        "--gpus=all"
    ]
}

.devcontainer/Dockerfile

# Setup environment basics
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime


# Install packages
RUN apt update -y \
    && apt install -y sudo \
    && apt clean


# Set up user
ARG USERNAME=user
ARG USER_UID=1000
ARG USER_GID=$USER_UID

RUN groupadd --gid $USER_GID $USERNAME \
    && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME \
    && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
    && chmod 0440 /etc/sudoers.d/$USERNAME

USER $USERNAME


# Set up working directory
WORKDIR /workspace

# Set up environment variables
ENV PYTHONUNBUFFERED=True

Then I rebuilt and reopened the folder in a devcontainer via VSCode, and ran the following command to confirm I had access to a GPU (I also separately ensured PyTorch had access to CUDA acceleration):

> nvidia-smi

Everything worked perfectly. Afterwards, I commented out the "runArgs" key from the devcontainer.json file and repeated the above. This time nvidia-smi did not work and PyTorch had no CUDA acceleration.

@sarphiv sarphiv changed the title hostRequirements: gpu: optional is broken on Windows 11 hostRequirements: gpu: optional is broken on Windows 11 and 10 Jan 25, 2024
@chrmarti
Copy link
Contributor

Great, what do you get for docker info -f '{{json .}}' on that machine? Thanks.

@sarphiv
Copy link
Author

sarphiv commented Jan 25, 2024

I'm assuming you meant docker info -f json, because the other command fails.
Here's the output.json. I sadly don't see any GPU nor NVIDIA references.

I also checked docker info -f '{{.Runtimes.nvidia}}' on Linux Fedora. It has an output which contains the string "nvidia-container-runtime", so I guess that's why it works on Linux. I then checked docker info -f json on Linux too, and it does contain the runtime nvidia, so I guess Window's being weird.

@chrmarti
Copy link
Contributor

We could add a machine-scoped setting to tell us if a GPU is present, absent or (the default like today) should be detected. That will give users a good out-of-the-box experience where the detection works, others can use the setting and we can gradually (where possible) improve the detection.

@chrmarti chrmarti added bug Issue identified by VS Code Team member as probable bug debt and removed info-needed Issue requires more information from poster labels Feb 1, 2024
@sidecus
Copy link

sidecus commented Mar 5, 2024

I am running into the same issue on my Windows machine.
nvidia-smi -L correctly returns the GPU info.
docker info doesn't return anything related to the GPU.

Shall we use nvidia-smi to detect NVidia GPU instead?

@sarphiv
Copy link
Author

sarphiv commented Mar 5, 2024

I am running into the same issue on my Windows machine. nvidia-smi -L correctly returns the GPU info. docker info doesn't return anything related to the GPU.

Shall we use nvidia-smi to detect NVidia GPU instead?

If we only used nvidia-smi then maybe this would fail on Linux, where you may have NVIDIA drivers (nvidia-smi works) but not the NVIDIA Container Runtime (no GPU inside containers).

@sangotaro
Copy link

sangotaro commented May 22, 2024

@chrmarti
I am using an Ubuntu 22.04 machine with an NVIDIA GPU (non-WSL), but the hostRequirements: gpu: optional is not working. The output of docker info -f '{{.Runtimes.nvidia}}' is <no value>, indicating that I am experiencing the same issue as in this case. The output of docker info is as follows:

docker-info.json

@pascal456
Copy link

Stumbled upon this in the last days again, after having a solution in #9220 in January.

Working on a Windows Workstation now and cannot get a Dev Container running via WSL with GPU support.

What is about the intermediary solution to have a machine specific configuration, which marti mentioned above?

@chrmarti
Copy link
Contributor

I agree, that whether or not an SSH server machine can use its GPU in a Docker container should be a setting on the SSH server machine. It doesn't belong to the local machine.

One difficulty with the machine setting is that when connecting through an SSH server (or Tunnel), we can't access its machine settings through VS Code's API because that only knowns the local and the dev container (calling these "machine settings") settings. We can check for and read the machine settings.json in the extension though. /cc @sandy081

@RaphaelMelanconAtBentley

Here is my hacky fix for docker compose in the meantime :) #10124 (comment)

@chrmarti
Copy link
Contributor

Dev Containers 0.386.0-pre-release adds a user setting to override the automatic detection of a GPU:
Image

@maro-otto
Copy link

@chrmarti
DevContainers v0.386.0 (pre-release)

Hello,

It seems that this feature is still broken (v0.386.0). If I create a remote machine (GCP) with GPU and fully installed nvidia-stack, I can build and run the devcontainer using

"hostRequirements": {
    "gpu": "optional"
},

But if I remove the GPU from my remote machine I can't start the docker container anymore as it claims having detected a GPU despite the fact that no GPU is attached:

Output of devcontainer console is:
[21551 ms] Start: Run: docker info -f {{.Runtimes.nvidia}}
[21755 ms] GPU support found, add GPU flags to docker call.
...

If I run the command you have used in your ts-scripts on the machine (no GPU anymore) I get:
{nvidia-container-runtime [] }

I think you are just checking whether the nvidia-container-runtime is available but not whether an actual gpu is attached.
const runtimeFound = result.stdout.includes('nvidia-container-runtime');

So,

export async function extraRunArgs(common: ResolverParameters, params: DockerResolverParameters, config: DevContainerFromDockerfileConfig | DevContainerFromImageConfig) { const extraArguments: string[] = []; if (config.hostRequirements?.gpu) { if (await checkDockerSupportForGPU(params)) { common.output.write(GPU support found, add GPU flags to docker call.); extraArguments.push('--gpus', 'all'); } else { if (config.hostRequirements?.gpu !== 'optional') { common.output.write('No GPU support found yet a GPU was required - consider marking it as "optional"', LogLevel.Warning); } } } return extraArguments; }

Will add --gpus 'all' if the runtime is available even if no gpu is attached. Unfortunately the container won't start if --gpus all is given but no GPU is attached to the computer. Am I missing something here?

@chrmarti
Copy link
Contributor

@maro-otto Good catch, I'll open a new issue for this. Thanks.

@eleanorjboyd
Copy link
Member

Hello! Are users able to verify that this works (minus the new bug caught by @maro-otto)?

@eleanorjboyd
Copy link
Member

Also @chrmarti if no user is able to could you clarify the steps? The original filed issue is comprehensive but was wondering if this can be tested without CUDA containers according to NVIDIA's standards (since it seems like the setting would apply in other dev container scenarios). Thanks!

@eleanorjboyd eleanorjboyd added verification-steps-needed Steps to verify are needed for verification author-verification-requested Issues potentially verifiable by issue author labels Sep 26, 2024
@chrmarti
Copy link
Contributor

Without a GPU, I suggest to set GPU Availability to all and verify that a new dev container with "hostRequirements": { "gpu": "optional" } tries to enable GPU for the container and fails.

With a GPU, you could set GPU Availability to none and verify that such a dev container indeed does not get the GPU (cross check that it gets the GPU with all).

@chrmarti chrmarti removed the verification-steps-needed Steps to verify are needed for verification label Sep 27, 2024
@rzhao271 rzhao271 added the verified Verification succeeded label Sep 27, 2024
@rzhao271
Copy link

rzhao271 commented Sep 27, 2024

My laptop has a GPU. When GPU Availability is set to none, the dev container with optional gpu host requirements still gets a GPU:
[2024-09-27T17:32:07.572Z] GPU support found, add GPU flags to docker call.

Host: Windows
Remote: Node.js & JavaScript container

@rzhao271 rzhao271 reopened this Sep 27, 2024
@rzhao271 rzhao271 added verification-found Issue verification failed and removed verified Verification succeeded labels Sep 27, 2024
@chrmarti chrmarti modified the milestones: September 2024, October 2024 Sep 30, 2024
@chrmarti chrmarti removed verification-found Issue verification failed author-verification-requested Issues potentially verifiable by issue author labels Sep 30, 2024
@chrmarti
Copy link
Contributor

@rzhao271 Could you rebuild the container and append the log from that? (F1 > Dev Containers: Show Container Log)

@chrmarti chrmarti added the info-needed Issue requires more information from poster label Sep 30, 2024
@rzhao271
Copy link

rzhao271 commented Oct 2, 2024

Closing this issue. GPU Availability had to be set to none within the WSL settings, not the User settings.

@rzhao271 rzhao271 closed this as completed Oct 2, 2024
@rzhao271 rzhao271 added verified Verification succeeded and removed info-needed Issue requires more information from poster labels Oct 2, 2024
@rzhao271 rzhao271 modified the milestones: October 2024, September 2024 Oct 2, 2024
@vs-code-engineering vs-code-engineering bot locked and limited conversation to collaborators Nov 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Issue identified by VS Code Team member as probable bug containers Issue in vscode-remote containers debt verified Verification succeeded
Projects
None yet
Development

No branches or pull requests

9 participants