-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: sending request using response_format json twice breaks vLLM #4070
Comments
request-body-sanitized.json |
Looks like some issues with outlines's fsm copying and initialization. for this case using lm-format-enforcer might be better. #3868 |
There seems to be a bug with: response_format = {
"type": "json_object"
},
from openai import OpenAI
import os
def prompt_json_completion(messages):
base_url = os.getenv("BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("API_KEY", "EMPTY")
max_tokens = os.getenv("MAX_TOKENS", 100)
client = OpenAI(api_key = api_key, base_url = base_url)
completion = client.chat.completions.create(
model = client.models.list().data[0].id,
# response_format = {
# "type": "json_object"
# },
messages = messages,
max_tokens = max_tokens,
)
#print(completion)
print(completion.choices[0].message.content)
if __name__ == "__main__":
user_prompt = "Generate example JSON data of a student in an SIS"
messages = [
{"role": "user", "content": user_prompt}
]
prompt_json_completion(messages=messages) I am getting all whitespace if I uncomment |
I have the same error with json_object did anyone encounter this error with previous version? |
same problem when setting "response_format" to {"type": "json_object"}, text generation stops when reaching max model length. When seting "response_format" to {"type": "text"}, everything goes well. |
Outlines has made several improvements to its json output and was previously fixed to These issues might have been fixed with the nightly: Line 20 in abe855d
|
@maxdebayser I recently tried |
Yes, I am also having the same problem with versions |
Same as you. I have to give up respone_format. |
Hi guys! important to note: When you're using The following guidance is from the OpenAI docs, but it applies to vLLM as well: Basically, if you try to force the model to generate JSON when it's trying to generate natural text, it may produce just whitespace if whitespace tokens are more likely than a Hope that helps! |
It may be worth adding something about this in the vLLM docs -- seems to be a point of confusion; have been discussing this with the Nous team too |
Same issue when using |
Your current environment
🐛 Describe the bug
vLLM gets into a corrupted state and only responds garbage after sending a specific response_format = json request. The first request vllm is able to respond with a somewhat reasonable response but once you repeat the same request it only starts responding with
\n\t\t\t\t...
where\t
repeats until max_tokens is reached.Steps to reproduce:
mistralai/Mistral-7B-Instruct-v0.2
. The following config was used:This is running on a single L4 GPU
Current results:
\n
will repeat until max tokens is hit\n\t\t\t\t
where\t
repeats until max token is hit.Ocasionally vLLM gets into a bad state where all requests returns errors as well, but I can't consistently get into that state. The following errors were seen when that happens:
This issue was originally reported in Lingo: substratusai/kubeai#96 but it seems to be an issue with vLLM itself.
Expected results
vLLM should not get into a broken state where subsequent responses do not provide any results due to using response_format = json
The text was updated successfully, but these errors were encountered: