Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] Fix multiprocessing shutdown errors #7041

Closed
wants to merge 6 commits into from

Conversation

njhill
Copy link
Member

@njhill njhill commented Aug 1, 2024

  1. Don't use daemon threads in the multiproc worker machinery
  2. Ensure that the LLMEngine is garbage collected properly, so that the executor and its non-daemon threads are shut down and don't cause the process to hang
  3. Keep worker processes as daemons but add a check for sys.is_finalizing() to avoid logging any error messages in case they are killed non-cleanly prior to the main process (though this should no longer happen with changes 1 and 2)

There are still two warnings that appear consistently but I think that these are benign and we can investigate as a follow-on to this:

[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

1. Don't use daemon threads in the multiproc worker machinery
2. Ensure that the LLMEngine is garbage collected properly, so that the executor and its non-daemon threads are shut down and don't cause the process to hang

There are still two warnings that appear consistently but I think that these are benign and we can investigate as a follow-on to this.
@njhill njhill requested a review from youkaichao August 1, 2024 17:01
Copy link

github-actions bot commented Aug 1, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@njhill
Copy link
Member Author

njhill commented Aug 1, 2024

@youkaichao with some more experimentation I found that the try/finally block there wasn't really sufficient anyhow. I've changed it now to include an excepthook to run the multiproc shutdown if the main thread exits abnormally. And with the other fix, the GC seems to work when it exits normally.

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the full expertise to figure out the root cause, and will see how ci tells.

Thanks for the hard working! Debugging gc-related problems is quite a pain.

@youkaichao youkaichao added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 1, 2024
Comment on lines +335 to +339
# Clean up globals
for var in ("openai_serving_chat", "openai_serving_completion",
"openai_serving_embedding", "openai_serving_tokenization",
"engine_args", "engine"):
globals().pop(var, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does del work? globals().pop is too hacky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youkaichao we'd have to check and del each one individually so it would be a lot more code. I want this to work whether or not each is defined, in case some error occurs after setting some and not others.

The way globals are used here is already hacky imo and I think we'll clean it up later. I wanted to keep this change as simple as possible as there are overlapping changes in #6883 which will be merged very soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. this is a minor concern, and you can skip it if it is difficult to solve.

the most important part is still make the ci pass 🙏

@cermeng
Copy link
Contributor

cermeng commented Aug 23, 2024

Hi, any update on it? 👀

njhill added a commit to njhill/vllm that referenced this pull request Sep 14, 2024
Especially with multiprocessing

Replaces vllm-project#7041
njhill added a commit to njhill/vllm that referenced this pull request Sep 14, 2024
Especially with multiprocessing

Replaces vllm-project#7041
@njhill
Copy link
Member Author

njhill commented Sep 16, 2024

Superseded by #8492

@njhill njhill closed this Sep 16, 2024
@njhill njhill deleted the improve-mp-shutdown branch September 16, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants