[BugFix] Fix multiprocessing shutdown errors #7041

njhill · 2024-08-01T17:01:20Z

Don't use daemon threads in the multiproc worker machinery
Ensure that the LLMEngine is garbage collected properly, so that the executor and its non-daemon threads are shut down and don't cause the process to hang
Keep worker processes as daemons but add a check for sys.is_finalizing() to avoid logging any error messages in case they are killed non-cleanly prior to the main process (though this should no longer happen with changes 1 and 2)

There are still two warnings that appear consistently but I think that these are benign and we can investigate as a follow-on to this:

[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

1. Don't use daemon threads in the multiproc worker machinery 2. Ensure that the LLMEngine is garbage collected properly, so that the executor and its non-daemon threads are shut down and don't cause the process to hang There are still two warnings that appear consistently but I think that these are benign and we can investigate as a follow-on to this.

github-actions · 2024-08-01T17:01:32Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

vllm/engine/llm_engine.py

njhill · 2024-08-01T19:51:27Z

@youkaichao with some more experimentation I found that the try/finally block there wasn't really sufficient anyhow. I've changed it now to include an excepthook to run the multiproc shutdown if the main thread exits abnormally. And with the other fix, the GC seems to work when it exits normally.

vllm/engine/llm_engine.py

youkaichao

I don't have the full expertise to figure out the root cause, and will see how ci tells.

Thanks for the hard working! Debugging gc-related problems is quite a pain.

youkaichao · 2024-08-01T22:56:50Z

vllm/entrypoints/openai/api_server.py

+        # Clean up globals
+        for var in ("openai_serving_chat", "openai_serving_completion",
+                    "openai_serving_embedding", "openai_serving_tokenization",
+                    "engine_args", "engine"):
+            globals().pop(var, None)


does del work? globals().pop is too hacky.

@youkaichao we'd have to check and del each one individually so it would be a lot more code. I want this to work whether or not each is defined, in case some error occurs after setting some and not others.

The way globals are used here is already hacky imo and I think we'll clean it up later. I wanted to keep this change as simple as possible as there are overlapping changes in #6883 which will be merged very soon.

got it. this is a minor concern, and you can skip it if it is difficult to solve.

the most important part is still make the ci pass 🙏

cermeng · 2024-08-23T04:00:56Z

Hi, any update on it? 👀

Especially with multiprocessing Replaces vllm-project#7041

njhill · 2024-09-16T17:03:22Z

Superseded by #8492

njhill requested a review from youkaichao August 1, 2024 17:01

njhill commented Aug 1, 2024

View reviewed changes

vllm/engine/llm_engine.py Show resolved Hide resolved

Fix linting

17ef6cc

youkaichao reviewed Aug 1, 2024

View reviewed changes

vllm/engine/llm_engine.py Outdated Show resolved Hide resolved

Add sys.excepthook for multiproc shutdown

fa76ccc

youkaichao reviewed Aug 1, 2024

View reviewed changes

vllm/engine/llm_engine.py Show resolved Hide resolved

youkaichao reviewed Aug 1, 2024

View reviewed changes

Add comment per @youkaichao suggestion

5432d4e

This was referenced Aug 1, 2024

[Bug]: Shutdown error when using multiproc_gpu_executor #5521

Open

[Bug]: VllmWorkerProcess does not exit correctly when TP > 1 #6219

Open

youkaichao added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 1, 2024

Clean up globals in OpenAI server

4832bd6

youkaichao reviewed Aug 1, 2024

View reviewed changes

Ensure api server is garbage collected at shutdown post-requests

5fe1017

njhill marked this pull request as draft August 2, 2024 15:04

njhill mentioned this pull request Aug 6, 2024

[Core] Shut down aDAG workers with clean async llm engine exit #7224

Merged

njhill added a commit to njhill/vllm that referenced this pull request Sep 14, 2024

[BugFix] Fix clean shutdown issues

30725d0

Especially with multiprocessing Replaces vllm-project#7041

njhill added a commit to njhill/vllm that referenced this pull request Sep 14, 2024

[BugFix] Fix clean shutdown issues

781b962

Especially with multiprocessing Replaces vllm-project#7041

njhill mentioned this pull request Sep 14, 2024

[BugFix] Fix clean shutdown issues #8492

Merged

njhill closed this Sep 16, 2024

njhill deleted the improve-mp-shutdown branch September 16, 2024 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Fix multiprocessing shutdown errors #7041

[BugFix] Fix multiprocessing shutdown errors #7041

njhill commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

njhill commented Aug 1, 2024

youkaichao left a comment

youkaichao Aug 1, 2024

njhill Aug 1, 2024

youkaichao Aug 1, 2024

cermeng commented Aug 23, 2024

njhill commented Sep 16, 2024

[BugFix] Fix multiprocessing shutdown errors #7041

[BugFix] Fix multiprocessing shutdown errors #7041

Conversation

njhill commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

njhill commented Aug 1, 2024

youkaichao left a comment

Choose a reason for hiding this comment

youkaichao Aug 1, 2024

Choose a reason for hiding this comment

njhill Aug 1, 2024

Choose a reason for hiding this comment

youkaichao Aug 1, 2024

Choose a reason for hiding this comment

cermeng commented Aug 23, 2024

njhill commented Sep 16, 2024