vLLM's default multiprocessing method is incompatible with ROCm and Gaudi #2439

tiran · 2024-10-11T13:26:54Z

Describe the bug
vLLM defaults to VLLM_WORKER_MULTIPROC_METHOD=fork, https://docs.vllm.ai/en/v0.6.1/serving/env_vars.html . Forking is incompatible with ROCm and Gaudi.

To Reproduce

Configure InstructLab to use more than one GPU
Run ilab model serve on a system with more than one AMD GPU

Expected behavior
vLLM works

Screenshots

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Additional context
I recommend to switch to "spawn". Python is switching from fork to spawn for all platforms. The fork method has issues, e.g. it can lead to deadlocks when a process mixes threads and fork.

I switch InstructLab to spawn a long time ago, because it was causing trouble on Gaudi, see #956. InstructLab should set VLLM_WORKER_MULTIPROC_METHOD=spawn by default.

The text was updated successfully, but these errors were encountered:

ktam3 · 2024-10-16T13:44:25Z

Additional comment by Russell

Indeed, setting the environment variable is the right thing to do in the short term, as it will work with the currently shipped version of vLLM.

FYI, for the future, spawn will be used automatically when you run vllm serve as of v0.6.3. vllm-project/vllm#8823

nathan-weinberg · 2024-10-23T15:13:53Z

@n1hility given @russellb's comment above, will this ticket be covered by the planned VLLM bump?

n1hility · 2024-11-13T16:22:07Z

IMO we should fix this in the ODH branches for the vllm versions we are pulling in, and also in any container definitions. I don't have a problem with adding some code as a rededundancy to instructlab to handle plugging in different versions of vllm .

nathan-weinberg · 2024-11-27T17:18:59Z

@n1hility does ODH vLLM 0.6.2 have this fix or do we need to wait for the next bump?

n1hility · 2024-12-17T21:26:12Z

Looks like we need to wait for another bump. The branches were created in odh for intel and amd, but not utilized yet and we still need a patch here.

nathan-weinberg · 2025-01-28T16:03:01Z

@n1hility @tiran do y'all know if ODH vLLM 0.6.4post1 (current vLLM version we are using) or alternatively, 0.6.6post1 (next version we are planning to bump to) has this issue, or could it be closed out?

cc @fabiendupont

tiran added bug Something isn't working vllm vLLM specific issues jira This triggers jira sync labels Oct 11, 2024

ktam3 assigned n1hility and nathan-weinberg Oct 22, 2024

nathan-weinberg added this to the 0.21.0 milestone Nov 1, 2024

nathan-weinberg removed this from the 0.21.0 milestone Nov 13, 2024

nathan-weinberg added this to the 0.22.0 milestone Nov 27, 2024

nathan-weinberg modified the milestones: 0.22.0, 0.23.0 Dec 13, 2024

nathan-weinberg removed this from the 0.23.0 milestone Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM's default multiprocessing method is incompatible with ROCm and Gaudi #2439

vLLM's default multiprocessing method is incompatible with ROCm and Gaudi #2439

tiran commented Oct 11, 2024

ktam3 commented Oct 16, 2024

nathan-weinberg commented Oct 23, 2024

n1hility commented Nov 13, 2024

nathan-weinberg commented Nov 27, 2024

n1hility commented Dec 17, 2024

nathan-weinberg commented Jan 28, 2025

vLLM's default multiprocessing method is incompatible with ROCm and Gaudi #2439

vLLM's default multiprocessing method is incompatible with ROCm and Gaudi #2439

Comments

tiran commented Oct 11, 2024

ktam3 commented Oct 16, 2024

nathan-weinberg commented Oct 23, 2024

n1hility commented Nov 13, 2024

nathan-weinberg commented Nov 27, 2024

n1hility commented Dec 17, 2024

nathan-weinberg commented Jan 28, 2025