[Feature]: Breaking Down Single Process into Asynchronous Tokenization, Model Inference, and Detokenization for Enhanced GPU Utilization #8295

hxer7963 · 2024-09-09T12:23:22Z

🚀 The feature, motivation and pitch

Feature Proposal:
I would like to request an optimization feature where tokenization, model inference, and detokenization are performed asynchronously in separate processes, leading to a significant improvement in GPU utilization. This setup would enable parallel execution of these tasks, minimizing idle GPU time between the phases of the pipeline and increasing overall throughput.

Motivation:
Currently, these three stages (tokenization, inference, detokenization) are typically handled sequentially, which results in underutilization of the GPU during the tokenization and detokenization phases. By separating these stages into asynchronous, tri-process collaboration, the GPU could be used more efficiently, especially for large models where tokenization and detokenization overhead becomes non-negligible.

Pitch:
Implementing this feature could greatly enhance the performance of vLLM for high-throughput applications, leading to faster inference times and better resource utilization. I believe this would be beneficial for any workload where latency and throughput are critical.

Alternatives

One alternative solution would be to look into other frameworks like sglang and lightllm, which have already implemented tri-process asynchronous collaboration for tokenization, model inference, and detokenization. However, these solutions may not be as optimized or compatible with the specific features and design goals of vLLM. Another option could be manually orchestrating separate tokenization, inference, and detokenization steps outside of vLLM, but this would require additional complexity and could introduce synchronization issues.

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

robertgshaw2-redhat · 2024-09-09T15:40:29Z

We have a few initiatives associated with this:

Making API Server Mulitprocessing: [RFC]: Isolate OpenAI Server Into Separate Process #6797
Making output processing Async with GPU: [RFC]: Asynchronous Output Processor #6913
Multistep Scheduling: [RFC]: Multi-Step Scheduling #6854
Removing Asyncio: [RFC]: Multi-Step Scheduling #6854

hxer7963 added the feature request label Sep 9, 2024

robertgshaw2-redhat closed this as completed Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Breaking Down Single Process into Asynchronous Tokenization, Model Inference, and Detokenization for Enhanced GPU Utilization #8295

[Feature]: Breaking Down Single Process into Asynchronous Tokenization, Model Inference, and Detokenization for Enhanced GPU Utilization #8295

hxer7963 commented Sep 9, 2024

robertgshaw2-redhat commented Sep 9, 2024

[Feature]: Breaking Down Single Process into Asynchronous Tokenization, Model Inference, and Detokenization for Enhanced GPU Utilization #8295

[Feature]: Breaking Down Single Process into Asynchronous Tokenization, Model Inference, and Detokenization for Enhanced GPU Utilization #8295

Comments

hxer7963 commented Sep 9, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

robertgshaw2-redhat commented Sep 9, 2024