Do you support streaming generating outputs? #245
-
Beta Was this translation helpful? Give feedback.
Replies: 6 comments
-
Yes, our FastAPI and OpenAI servers support streaming outputs. Just set up the server with python -m vllm.entrypoints.api_server or python -m vllm.entrypoints.openai.api_server and then add Line 26 in 665c489 |
Beta Was this translation helpful? Give feedback.
-
In addition, you can see the streaming result from the API server via
|
Beta Was this translation helpful? Give feedback.
-
@WoosukKwon python vllm/examples/api_client.py --stream It works. But does it support same streaming api as openai? curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
],
"stream": true
}' |
Beta Was this translation helpful? Give feedback.
-
Does streaming output only support return async_generator? |
Beta Was this translation helpful? Give feedback.
-
你们的文档挺难看懂的感觉,看了半天还是得去看源码😂 可以考虑多加几个demo吗,感谢 |
Beta Was this translation helpful? Give feedback.
-
How can I get the usage metrics (number of input tokens and output tokens) in streaming mode? @WoosukKwon |
Beta Was this translation helpful? Give feedback.
Yes, our FastAPI and OpenAI servers support streaming outputs. Just set up the server with
or
and then add
"stream": True
in client request (by default it's false).See
vllm/examples/api_client.py
Line 26 in 665c489