Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU Bottleneck when using beam search #1646

Closed
physicsrob opened this issue Nov 13, 2023 · 2 comments
Closed

CPU Bottleneck when using beam search #1646

physicsrob opened this issue Nov 13, 2023 · 2 comments
Assignees

Comments

@physicsrob
Copy link

I'm finding a surprising bottleneck in beam search generation in vllm 0.2.1.post1. I have one CPU process pegged at 100% CPU, and GPU utilization below 25%. When I use py-spy to inspect where the time is getting spent I see vllm/sequence.py:fork is calling deepcopy(), and that over 80% of my CPU time is getting spent there. So deepcopy() is clearly the bottleneck for this use case.

FWIW this is with llama2-7b on an A100-80. I'm not yet sure whether this is a regression or if there has always been this bottleneck in vLLM.

Here's a simple example which reproduces the issue: https://gist.github.com/physicsrob/f7bc0be046c01cd6f959966e24022bba

@simon-mo
Copy link
Collaborator

@physicsrob thank you for the detail report. @zhuohan123 will take a look.

@kevinhu
Copy link

kevinhu commented Nov 29, 2023

Replacing the deepcopy() call with pickle via new_seq = pickle.loads(pickle.dumps(self, -1)) doubles GPU utilization—but I think any further improvements will require overriding the copy methods or making the classes serializable with Pydantic. Or refactoring this step to avoid copies altogether.

@hmellor hmellor closed this as not planned Won't fix, can't repro, duplicate, stale Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants