fix cuda device not found error when LLM is initialized in ray actor #3198
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After #2221, when tensor_parallel_size>1, the driver process's CUDA_VISIBLE_DEVICES is manually set after RayWorkerVllm has been up. When LLM engine is initialized in a ray actor with
num_gpus=0
, ray set its CUDA_VISIBLE_DEVICES to''
, then torch.cuda.is_available() will return False and causes any subsequent CUDA_VISIBLE_DEVICES updates invalid. The root cause is that pytorch initializes device number only once:https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDAFunctions.cpp#L96-L113
We should be very careful not call any torch.cuda.* function before CUDA_VISIBLE_DEVICES is set.