[BUG] Launcher does not honor CUDA_VISIBLE_DEVICES #5818

arnold-jr · 2024-07-31T19:58:14Z

Describe the bug
The documentation here implies that CUDA_VISIBLE_DEVICES is not supported, but the launcher script does attempt to handle that case.
Given that CUDA_VISIBLE_DEVICES is used so commonly, I think this still qualifies as a bug.

On a single node, when CUDA_VISIBLE_DEVICES=0,2 for example, when I launch deepspeed I get the error.

ValueError: No slot '2' specified on host 'localhost'

There is a simple, if clunky, workaround.

INCLUDE_STR="localhost:$CUDA_VISIBLE_DEVICES"
unset CUDA_VISIBLE_DEVICES
deepspeed --include $INCLUDE_STR ...

To Reproduce
Steps to reproduce the behavior:

Simple inference script to reproduce

import collections
import os

import pytest


class FakeAccelerator:
    def __init__(self, num_devices: int = 2):
        self.num_devices = num_devices

    def device_count(self) -> int:
        return self.num_devices


def test_main_with_cuda_visible_devices(monkeypatch):
    fake_acc = FakeAccelerator(2)
    from deepspeed.launcher import runner

    monkeypatch.setattr(runner, "get_accelerator", lambda: fake_acc)

    cvd = "0,1"
    os.environ["CUDA_VISIBLE_DEVICES"] = cvd

    runner.main()

    cvd = "0,2"
    os.environ["CUDA_VISIBLE_DEVICES"] = cvd
    with pytest.raises(ValueError):
        runner.main()


def test_parse_resource_filter():
    from deepspeed.launcher.runner import parse_resource_filter

    resource_pool = collections.OrderedDict({"localhost": list(range(2))})
    parse_resource_filter(resource_pool, include_str="localhost:0,1", exclude_str="")

    with pytest.raises(ValueError):
        parse_resource_filter(
            resource_pool, include_str="localhost:0,2", exclude_str=""
        )


def test_parse_inclusion_exclusion():
    from deepspeed.launcher.runner import parse_inclusion_exclusion

    resource_pool = collections.OrderedDict({"localhost": 2})
    parse_inclusion_exclusion(resource_pool, inclusion="localhost:0,1", exclusion="")

    with pytest.raises(ValueError):
        parse_inclusion_exclusion(
            resource_pool, inclusion="localhost:0,2", exclusion=""
        )

What packages are required and their versions
pytest
How to run the script
pytest tests/unit/launcher/test_cuda_visible_devices.py

Expected behavior
Deepspeed should launch, setting include str to "localhost:0,2"

Additional context
As I mentioned in #4248, the code modifies the include_str to match CUDA_VISIBLE_DEVICES, but then relies on the accelerator to determine total number of devices. For the cuda accelerator, device_count considers CUDA_VISIBLE_DEVICES if set. Then it assumes in the parse_inclusion_exclusion function, it is assumed that the devices are numbered consecutively, starting from zero. This leads to a mismatch when trying to reconcile the include_str and the introspected resources when the index of visible devices is greater or equal to the total number of devices. Note that this would not be a problem if the accelerator always returned a device count of all physical devices, but in the case of the cuda accelerator, torch.cuda.device_count is used, which uses a cached value if possilble. So even though runner.main unsets the CUDA_VISIBLE_DEVICES env var, torch has likely already grabbed the value.

The text was updated successfully, but these errors were encountered:

tohtana · 2024-09-12T23:39:38Z

Hi @arnold-jr, thank you for the detailed investigation!
I drafted #6530 to address this issue. I would appreciate it if you could give us feedback.

This PR addresses #5818. Instead of contiguous numbers based on the device count, this PR uses device indices in `--include`. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

tohtana · 2024-10-11T16:01:47Z

Closing as #6530 was merged.

arnold-jr added bug Something isn't working inference labels Jul 31, 2024

tohtana mentioned this issue Sep 12, 2024

Fix device selection using CUDA_VISIBLE_DEVICES #6530

Merged

tohtana self-assigned this Sep 12, 2024

tohtana closed this as completed Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Launcher does not honor CUDA_VISIBLE_DEVICES #5818

[BUG] Launcher does not honor CUDA_VISIBLE_DEVICES #5818

arnold-jr commented Jul 31, 2024

tohtana commented Sep 12, 2024

tohtana commented Oct 11, 2024

[BUG] Launcher does not honor CUDA_VISIBLE_DEVICES #5818

[BUG] Launcher does not honor CUDA_VISIBLE_DEVICES #5818

Comments

arnold-jr commented Jul 31, 2024

tohtana commented Sep 12, 2024

tohtana commented Oct 11, 2024