Skip to content

Commit

Permalink
Update Dockerfile and Readme
Browse files Browse the repository at this point in the history
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
  • Loading branch information
gavrissh committed May 10, 2024
1 parent ce9b0e8 commit 081c893
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 7 deletions.
2 changes: 1 addition & 1 deletion python/huggingface_server.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ ARG POETRY_HOME=/opt/poetry
ARG POETRY_VERSION=1.7.1

# Install vllm
ARG VLLM_VERSION=0.4.0.post1
ARG VLLM_VERSION=0.4.2

RUN apt-get update -y && apt-get install gcc python3.10-venv python3-dev -y && apt-get clean && \
rm -rf /var/lib/apt/lists/*
Expand Down
21 changes: 15 additions & 6 deletions python/huggingfaceserver/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ spec:
cpu: 100m
memory: 2Gi
```
3. Serve the huggingface model using vllm runtime. vllm is the default runtime. Note - Model need to be supported by vllm otherwise KServe python runtime will be used as a failsafe.
3. Serve the huggingface model using vllm runtime. Note - Model need to be supported by vllm otherwise KServe python runtime will be used as a failsafe.
vllm supported models - https://docs.vllm.ai/en/latest/models/supported_models.html
```yaml
apiVersion: serving.kserve.io/v1beta1
Expand All @@ -138,7 +138,7 @@ spec:
```
If vllm needs to be disabled include the flag `--disable_vllm` in the container args. In this case the KServe python runtime will be used.
If vllm needs to be disabled include the flag `--backend=huggingface` in the container args. In this case the KServe python runtime will be used.
```yaml
apiVersion: serving.kserve.io/v1beta1
Expand All @@ -153,7 +153,7 @@ spec:
args:
- --model_name=llama2
- --model_id=meta-llama/Llama-2-7b-chat-hf
- --disable_vllm
- --backend=huggingface
resources:
limits:
cpu: "6"
Expand All @@ -167,9 +167,18 @@ spec:
Perform the inference for vllm specific runtime
vllm runtime deployments only support `/generate` endpoint for inference. Please refer to [text generation API schema](https://github.com/kserve/open-inference-protocol/blob/main/specification/protocol/generate_rest.yaml) for more details.
vllm runtime deployments only support OpenAI `v1/completions` and `v1/chat/completions` endpoints for inference.
Sample OpenAI Completions request
```bash
curl -H "content-type:application/json" -v localhost:8080/openai/v1/completions -d '{"model": "gpt2", "prompt": "<prompt>", "stream":false, "max_tokens": 30 }'
{"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"<generated_text>"}],"created":1715353182,"model":"gpt2","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}}
```
Sample OpenAI Chat request
```bash
curl -H "content-type:application/json" -v localhost:8080/v2/models/gpt2/generate -d '{"text_input": "The capital of france is [MASK]." }'
curl -H "content-type:application/json" -v localhost:8080/openai/v1/chat/completions -d '{"model": "gpt2", "messages": [{"role": "user","content": "<message>"}], "stream":false }'
{"text_output":"The capital of france is [MASK].\n\nThe capital of France is actually Paris.","model_name":"llama2","model_version":null,"details":null}
{"id":"cmpl-87ee252062934e2f8f918dce011e8484","choices":[{"finish_reason":"length","index":0,"message":{"content":"<generated_response>","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1715353461,"model":"gpt2","system_fingerprint":null,"object":"chat.completion","usage":{"completion_tokens":30,"prompt_tokens":3,"total_tokens":33}}
```

0 comments on commit 081c893

Please sign in to comment.