You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The documentation here implies that CUDA_VISIBLE_DEVICES is not supported, but the launcher script does attempt to handle that case.
Given that CUDA_VISIBLE_DEVICES is used so commonly, I think this still qualifies as a bug.
On a single node, when CUDA_VISIBLE_DEVICES=0,2 for example, when I launch deepspeed I get the error.
ValueError: No slot '2' specified on host 'localhost'
What packages are required and their versions pytest
How to run the script pytest tests/unit/launcher/test_cuda_visible_devices.py
Expected behavior
Deepspeed should launch, setting include str to "localhost:0,2"
Additional context
As I mentioned in #4248, the code modifies the include_str to match CUDA_VISIBLE_DEVICES, but then relies on the accelerator to determine total number of devices. For the cuda accelerator, device_count considers CUDA_VISIBLE_DEVICES if set. Then it assumes in the parse_inclusion_exclusion function, it is assumed that the devices are numbered consecutively, starting from zero. This leads to a mismatch when trying to reconcile the include_str and the introspected resources when the index of visible devices is greater or equal to the total number of devices. Note that this would not be a problem if the accelerator always returned a device count of all physical devices, but in the case of the cuda accelerator, torch.cuda.device_count is used, which uses a cached value if possilble. So even though runner.main unsets the CUDA_VISIBLE_DEVICES env var, torch has likely already grabbed the value.
The text was updated successfully, but these errors were encountered:
This PR addresses #5818.
Instead of contiguous numbers based on the device count, this PR uses
device indices in `--include`.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Describe the bug
The documentation here implies that CUDA_VISIBLE_DEVICES is not supported, but the launcher script does attempt to handle that case.
Given that CUDA_VISIBLE_DEVICES is used so commonly, I think this still qualifies as a bug.
On a single node, when
CUDA_VISIBLE_DEVICES=0,2
for example, when I launch deepspeed I get the error.There is a simple, if clunky, workaround.
To Reproduce
Steps to reproduce the behavior:
pytest
pytest tests/unit/launcher/test_cuda_visible_devices.py
Expected behavior
Deepspeed should launch, setting include str to "localhost:0,2"
Additional context
As I mentioned in #4248, the code modifies the include_str to match CUDA_VISIBLE_DEVICES, but then relies on the accelerator to determine total number of devices. For the cuda accelerator, device_count considers CUDA_VISIBLE_DEVICES if set. Then it assumes in the
parse_inclusion_exclusion
function, it is assumed that the devices are numbered consecutively, starting from zero. This leads to a mismatch when trying to reconcile the include_str and the introspected resources when the index of visible devices is greater or equal to the total number of devices. Note that this would not be a problem if the accelerator always returned a device count of all physical devices, but in the case of the cuda accelerator,torch.cuda.device_count
is used, which uses a cached value if possilble. So even thoughrunner.main
unsets the CUDA_VISIBLE_DEVICES env var, torch has likely already grabbed the value.The text was updated successfully, but these errors were encountered: