HF web service streaming response differs from OpenAI, breaking clients #1896

dluc · 2024-05-14T22:45:39Z

System Info

Attempting to reuse an existing OpenAI client to stream responses from HF endpoint doesn't work due to a couple of differences. In my case the differences break the .NET client in Azure AI SDK, though I suspect it might affect other clients too.

Differences found:

When streaming response tokens, OpenAI terminates the stream with a final [DONE] string, while HF simply stops sending tokens. Clients expecting [DONE] get stuck waiting either for another token of for the termination string.
OpenAI supports '0.0 <= top_p <= 1.0', while HF supports only '0.0 < top_p < 1.0'
When sending top_p = 0 to HF endpoint, the service replies 200 OK with an error {"error":"Input validation error: top_p must be > 0.0 and < 1.0","error_type":"validation"} and no final [DONE]. Given the status code and the lack of a termination, the error is parsed as data and causes a client to hang, waiting for the next token.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Example 1: error with top_p = 0

Request:

curl -v -X POST https://api-inference.huggingface.co/v1/chat/completions \
    -H "Authorization: Bearer ${HF_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"content":"how much is 1+1","role":"system"}],
      "max_tokens":50,
      "temperature":0,
      "top_p":0.0,
      "presence_penalty":0,
      "frequency_penalty":0,
      "stream":true,
      "model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO"}'

Response:

< HTTP/2 200
< date: Tue, 14 May 2024 22:33:46 GMT
< content-type: text/event-stream
< x-compute-type: 2-a100
< x-request-id: ...
< cache-control: no-cache
< access-control-allow-credentials: true
< vary: origin, Origin, Access-Control-Request-Method, Access-Control-Request-Headers
< x-accel-buffering: no
< access-control-allow-origin: *
< x-compute-characters: 67
< x-sha: ...
<
data:{"error":"Input validation error: `top_p` must be > 0.0 and < 1.0","error_type":"validation"}

OpenAI returns a response instead (see next).

Example 2: OpenAI response includes '[DONE]`

Request:

curl -X POST https://api.openai.com/v1/chat/completions \
    -H "Authorization: Bearer ${OPENAI_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"content":"how much is 1+1","role":"system"}],
      "max_tokens":5,
      "temperature":0,
      "top_p":0,
      "presence_penalty":0,
      "frequency_penalty":0,
      "stream":true,
      "model":"gpt-3.5-turbo"}'

Response:

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"1"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" +"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" "},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"1"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" equals"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"length"}]}

data: [DONE]

Example 3: HF response is missing '[DONE]`

Request:

curl -v -X POST https://api-inference.huggingface.co/v1/chat/completions \
    -H "Authorization: Bearer ${HF_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"content":"how much is 1+1","role":"system"}],
      "max_tokens":5,
      "temperature":0,
      "top_p":0.01,
      "presence_penalty":0,
      "frequency_penalty":0,
      "stream":true,
      "model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO"}'

Response:

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":"The"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" result"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" of"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" the"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" mathematical"},"logprobs":null,"finish_reason":"length"}]}

Expected behavior

Would be great if it was possible to reuse OpenAI clients (and apps built on these clients) simply by pointing them at https://api-inference.huggingface.co.

While it's possible to workaround the different range of top_p changing the code (if apps allow for it), the lack of termination strings makes it impossible to use these clients.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-06-14T01:48:37Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

dluc · 2024-06-14T02:29:19Z

@github-actions bump

michael-newsrx · 2024-07-11T17:51:09Z

The top_p issue seems unresolved still?

github-actions bot added the Stale label Jun 14, 2024

github-actions bot removed the Stale label Jun 15, 2024

drbh mentioned this issue Jul 11, 2024

fix: append DONE message to chat stream #2221

Merged

drbh closed this as completed in #2221 Jul 11, 2024

michael-newsrx mentioned this issue Jul 11, 2024

OpenAI supports top_p = 0.0 and top_p = 1.0 but TGI fails with a validation error with either of these values. #2222

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HF web service streaming response differs from OpenAI, breaking clients #1896

HF web service streaming response differs from OpenAI, breaking clients #1896

dluc commented May 14, 2024

github-actions bot commented Jun 14, 2024

dluc commented Jun 14, 2024

michael-newsrx commented Jul 11, 2024

HF web service streaming response differs from OpenAI, breaking clients #1896

HF web service streaming response differs from OpenAI, breaking clients #1896

Comments

dluc commented May 14, 2024

System Info

Information

Tasks

Reproduction

Example 1: error with top_p = 0

Example 2: OpenAI response includes '[DONE]`

Example 3: HF response is missing '[DONE]`

Expected behavior

github-actions bot commented Jun 14, 2024

dluc commented Jun 14, 2024

michael-newsrx commented Jul 11, 2024