Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Launcher does not honor CUDA_VISIBLE_DEVICES #5818

Closed
arnold-jr opened this issue Jul 31, 2024 · 2 comments
Closed

[BUG] Launcher does not honor CUDA_VISIBLE_DEVICES #5818

arnold-jr opened this issue Jul 31, 2024 · 2 comments
Assignees
Labels
bug Something isn't working inference

Comments

@arnold-jr
Copy link

Describe the bug
The documentation here implies that CUDA_VISIBLE_DEVICES is not supported, but the launcher script does attempt to handle that case.
Given that CUDA_VISIBLE_DEVICES is used so commonly, I think this still qualifies as a bug.

On a single node, when CUDA_VISIBLE_DEVICES=0,2 for example, when I launch deepspeed I get the error.

ValueError: No slot '2' specified on host 'localhost'

There is a simple, if clunky, workaround.

INCLUDE_STR="localhost:$CUDA_VISIBLE_DEVICES"
unset CUDA_VISIBLE_DEVICES
deepspeed --include $INCLUDE_STR ...

To Reproduce
Steps to reproduce the behavior:

  1. Simple inference script to reproduce
import collections
import os

import pytest


class FakeAccelerator:
    def __init__(self, num_devices: int = 2):
        self.num_devices = num_devices

    def device_count(self) -> int:
        return self.num_devices


def test_main_with_cuda_visible_devices(monkeypatch):
    fake_acc = FakeAccelerator(2)
    from deepspeed.launcher import runner

    monkeypatch.setattr(runner, "get_accelerator", lambda: fake_acc)

    cvd = "0,1"
    os.environ["CUDA_VISIBLE_DEVICES"] = cvd

    runner.main()

    cvd = "0,2"
    os.environ["CUDA_VISIBLE_DEVICES"] = cvd
    with pytest.raises(ValueError):
        runner.main()


def test_parse_resource_filter():
    from deepspeed.launcher.runner import parse_resource_filter

    resource_pool = collections.OrderedDict({"localhost": list(range(2))})
    parse_resource_filter(resource_pool, include_str="localhost:0,1", exclude_str="")

    with pytest.raises(ValueError):
        parse_resource_filter(
            resource_pool, include_str="localhost:0,2", exclude_str=""
        )


def test_parse_inclusion_exclusion():
    from deepspeed.launcher.runner import parse_inclusion_exclusion

    resource_pool = collections.OrderedDict({"localhost": 2})
    parse_inclusion_exclusion(resource_pool, inclusion="localhost:0,1", exclusion="")

    with pytest.raises(ValueError):
        parse_inclusion_exclusion(
            resource_pool, inclusion="localhost:0,2", exclusion=""
        )
  1. What packages are required and their versions
    pytest
  2. How to run the script
    pytest tests/unit/launcher/test_cuda_visible_devices.py

Expected behavior
Deepspeed should launch, setting include str to "localhost:0,2"

Additional context
As I mentioned in #4248, the code modifies the include_str to match CUDA_VISIBLE_DEVICES, but then relies on the accelerator to determine total number of devices. For the cuda accelerator, device_count considers CUDA_VISIBLE_DEVICES if set. Then it assumes in the parse_inclusion_exclusion function, it is assumed that the devices are numbered consecutively, starting from zero. This leads to a mismatch when trying to reconcile the include_str and the introspected resources when the index of visible devices is greater or equal to the total number of devices. Note that this would not be a problem if the accelerator always returned a device count of all physical devices, but in the case of the cuda accelerator, torch.cuda.device_count is used, which uses a cached value if possilble. So even though runner.main unsets the CUDA_VISIBLE_DEVICES env var, torch has likely already grabbed the value.

@tohtana
Copy link
Contributor

tohtana commented Sep 12, 2024

Hi @arnold-jr, thank you for the detailed investigation!
I drafted #6530 to address this issue. I would appreciate it if you could give us feedback.

github-merge-queue bot pushed a commit that referenced this issue Oct 8, 2024
This PR addresses #5818.
Instead of contiguous numbers based on the device count, this PR uses
device indices in `--include`.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
@tohtana
Copy link
Contributor

tohtana commented Oct 11, 2024

Closing as #6530 was merged.

@tohtana tohtana closed this as completed Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working inference
Projects
None yet
Development

No branches or pull requests

2 participants