You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feature Proposal:
I would like to request an optimization feature where tokenization, model inference, and detokenization are performed asynchronously in separate processes, leading to a significant improvement in GPU utilization. This setup would enable parallel execution of these tasks, minimizing idle GPU time between the phases of the pipeline and increasing overall throughput.
Motivation:
Currently, these three stages (tokenization, inference, detokenization) are typically handled sequentially, which results in underutilization of the GPU during the tokenization and detokenization phases. By separating these stages into asynchronous, tri-process collaboration, the GPU could be used more efficiently, especially for large models where tokenization and detokenization overhead becomes non-negligible.
Pitch:
Implementing this feature could greatly enhance the performance of vLLM for high-throughput applications, leading to faster inference times and better resource utilization. I believe this would be beneficial for any workload where latency and throughput are critical.
Alternatives
One alternative solution would be to look into other frameworks like sglang and lightllm, which have already implemented tri-process asynchronous collaboration for tokenization, model inference, and detokenization. However, these solutions may not be as optimized or compatible with the specific features and design goals of vLLM. Another option could be manually orchestrating separate tokenization, inference, and detokenization steps outside of vLLM, but this would require additional complexity and could introduce synchronization issues.
Additional context
No response
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
Feature Proposal:
I would like to request an optimization feature where tokenization, model inference, and detokenization are performed asynchronously in separate processes, leading to a significant improvement in GPU utilization. This setup would enable parallel execution of these tasks, minimizing idle GPU time between the phases of the pipeline and increasing overall throughput.
Motivation:
Currently, these three stages (tokenization, inference, detokenization) are typically handled sequentially, which results in underutilization of the GPU during the tokenization and detokenization phases. By separating these stages into asynchronous, tri-process collaboration, the GPU could be used more efficiently, especially for large models where tokenization and detokenization overhead becomes non-negligible.
Pitch:
Implementing this feature could greatly enhance the performance of vLLM for high-throughput applications, leading to faster inference times and better resource utilization. I believe this would be beneficial for any workload where latency and throughput are critical.
Alternatives
One alternative solution would be to look into other frameworks like sglang and lightllm, which have already implemented tri-process asynchronous collaboration for tokenization, model inference, and detokenization. However, these solutions may not be as optimized or compatible with the specific features and design goals of vLLM. Another option could be manually orchestrating separate tokenization, inference, and detokenization steps outside of vLLM, but this would require additional complexity and could introduce synchronization issues.
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: