Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Breaking Down Single Process into Asynchronous Tokenization, Model Inference, and Detokenization for Enhanced GPU Utilization #8295

Closed
1 task done
hxer7963 opened this issue Sep 9, 2024 · 1 comment

Comments

@hxer7963
Copy link
Contributor

hxer7963 commented Sep 9, 2024

🚀 The feature, motivation and pitch

Feature Proposal:
I would like to request an optimization feature where tokenization, model inference, and detokenization are performed asynchronously in separate processes, leading to a significant improvement in GPU utilization. This setup would enable parallel execution of these tasks, minimizing idle GPU time between the phases of the pipeline and increasing overall throughput.

Motivation:
Currently, these three stages (tokenization, inference, detokenization) are typically handled sequentially, which results in underutilization of the GPU during the tokenization and detokenization phases. By separating these stages into asynchronous, tri-process collaboration, the GPU could be used more efficiently, especially for large models where tokenization and detokenization overhead becomes non-negligible.

Pitch:
Implementing this feature could greatly enhance the performance of vLLM for high-throughput applications, leading to faster inference times and better resource utilization. I believe this would be beneficial for any workload where latency and throughput are critical.

Alternatives

One alternative solution would be to look into other frameworks like sglang and lightllm, which have already implemented tri-process asynchronous collaboration for tokenization, model inference, and detokenization. However, these solutions may not be as optimized or compatible with the specific features and design goals of vLLM. Another option could be manually orchestrating separate tokenization, inference, and detokenization steps outside of vLLM, but this would require additional complexity and could introduce synchronization issues.

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@robertgshaw2-redhat
Copy link
Collaborator

We have a few initiatives associated with this:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants