Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: How do I configure Phi-3-vision for high throughput? #7751

Open
hommayushi3 opened this issue Aug 21, 2024 · 8 comments
Open

[Usage]: How do I configure Phi-3-vision for high throughput? #7751

hommayushi3 opened this issue Aug 21, 2024 · 8 comments
Labels
usage How to use vllm

Comments

@hommayushi3
Copy link

hommayushi3 commented Aug 21, 2024

How would you like to use vllm

I want to run Phi-3-vision with VLLM to support parallel calls with high throughput. In my setup (openai compatible 0.5.4 VLLM server on HuggingFace Inference Endpoints with Nvidia-L4 24GB GPU), I have set up Phi-3-vision with the following parameters:

DISABLE_SLIDING_WINDOW=true
DTYPE=bfloat16
ENFORCE_EAGER=true   # Tried both true/false
GPU_MEMORY_UTILIZATION=0.98  # Tried 0.6-0.99
MAX_MODEL_LEN=3072  # Smallest token length that supports my work
MAX_NUM_BATCHED_TOKENS=12288  # Tried 3072-12288
MAX_NUM_SEQS=16  # Tried 2-32
QUANTIZATION=fp8  # Tried fp8 and None
TRUST_REMOTE_CODE=true
VLLM_ATTENTION_BACKEND=FLASH_ATTN

I am running into the issue that no matter what settings I use, adding more concurrent calls is increasing the total inference time linearly; the batching parallelism is not working. For example, running 4 concurrent requests takes 12 seconds, but 1 request by itself takes 3 seconds.

The logs show:

Avg prompt throughput: 3461 tokens/s, Avg generation throughput: 39.4 tokens/s, Running: 12 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 68.3%, CPU KV cache usage: 0.0%
Avg prompt throughput: 0 tokens/s, Avg generation throughput: 154.3 tokens/s, Running: 7 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 40.8%, CPU KV cache usage: 0.0%

Questions:

  1. Is this a configuration/usage issue? What other parameters might I be missing?
  2. Is this an issue with Phi-3-vision? (might be related to this issue)
  3. Would this be fixed with Phi-3.5-vision?
@hommayushi3 hommayushi3 added the usage How to use vllm label Aug 21, 2024
@Dineshkumar-Anandan-ZS0367

Are you deployed any vision language model across two machines, like pipeline parallelism.
Can you able to suggest some ideas.

Thanks if you suggest something on that. How to send the api request for the vision model. I need to send the image and prompt. Currently vllm supports text only?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 22, 2024

Are you deployed any vision language model across two machines, like pipeline parallelism.

PP is not yet supported for vision language models (#7684). Also, the model has been fully TP'ed yet (#7186). The performance should improve after these PRs are completed.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 22, 2024

Thanks if you suggest something on that. How to send the api request for the vision model. I need to send the image and prompt. Currently vllm supports text only?

vLLM's server supports image input via OpenAI Chat Completions API. Please refer to OpenAI's docs for more details.

@hommayushi3
Copy link
Author

I don't think either of these are relevant for my issue. I am using a single Nvidia-L4, not a multi-gpu setup.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 22, 2024

I suggest profiling the code to see where is the bottleneck. It's possible that most of the execution time is taken up by the model forward pass, in which case there can hardly be any improvement from adjusting the batching params.

@DarkLight1337
Copy link
Member

@youkaichao @ywang96 perhaps you have a better idea of this?

@youkaichao
Copy link
Member

definitely it needs profiling first.

@ywang96
Copy link
Member

ywang96 commented Sep 2, 2024

For example, running 4 concurrent requests takes 12 seconds, but 1 request by itself takes 3 seconds.

@hommayushi3 Can you share the information on how you currently set up the workload, including

  • vLLM version and launch args of the server/LLM class
  • How you send your requests, and a example of a single request that you're sending.

Without this information, we can't really help on how to optimize for your workload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

5 participants