Skip to content

Commit

Permalink
Add vLLM on Ray microservice (#285)
Browse files Browse the repository at this point in the history
Signed-off-by: Xinyao Wang <xinyao.wang@intel.com>
  • Loading branch information
XinyaoWa authored Jul 12, 2024
1 parent 80da5a8 commit ec3b2e8
Show file tree
Hide file tree
Showing 14 changed files with 620 additions and 7 deletions.
10 changes: 3 additions & 7 deletions comps/llms/text-generation/ray_serve/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,9 @@ export HF_TOKEN=<token>
And then you can make requests with the OpenAI-compatible APIs like below to check the service status:

```bash
curl http://127.0.0.1:8008/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": <model_name>,
"messages": [{"role": "user", "content": "What is deep learning?"}],
"max_tokens": 32,
}'
curl http://172.17.0.1:8008/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": <model_name>, "messages": [{"role": "user", "content": "How are you?"}], "max_tokens": 32 }'
```

For more information about the OpenAI APIs, you can checkeck the [OpenAI official document](https://platform.openai.com/docs/api-reference/).
Expand Down
81 changes: 81 additions & 0 deletions comps/llms/text-generation/vllm-ray/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# VLLM-Ray Endpoint Service

[Ray](https://docs.ray.io/en/latest/serve/index.html) is an LLM serving solution that makes it easy to deploy and manage a variety of open source LLMs, built on [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), has native support for autoscaling and multi-node deployments, which is easy to use for LLM inference serving on Intel Gaudi2 accelerators. The Intel Gaudi2 accelerator supports both training and inference for deep learning models in particular for LLMs. Please visit [Habana AI products](<(https://habana.ai/products)>) for more details.

[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products).

This guide provides an example on how to launch vLLM with Ray serve endpoint on Gaudi accelerators.

## Set up environment

```bash
export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
export vLLM_RAY_ENDPOINT="http://${your_ip}:8006"
export LLM_MODEL=${your_hf_llm_model}
```

For gated models such as `LLAMA-2`, you will have to pass the environment HUGGINGFACEHUB_API_TOKEN. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.

## Set up VLLM Ray Gaudi Service

### Build docker

```bash
bash ./build_docker_vllmray.sh
```

### Launch the service

```bash
bash ./launch_vllmray.sh
```

The `launch_vllmray.sh` script accepts three parameters:

- port_number: The port number assigned to the Ray Gaudi endpoint, with the default being 8006.
- model_name: The model name utilized for LLM, with the default set to meta-llama/Llama-2-7b-chat-hf.
- parallel_number: The number of HPUs specifies the number of HPUs per worker process, the default is set to 2.
- enforce_eager: Whether to enforce eager execution, default to be False.

If you want to customize the setting, can run:

```bash
bash ./launch_vllmray.sh ${port_number} ${model_name} ${parallel_number} False/True
```

### Query the service

And then you can make requests with the OpenAI-compatible APIs like below to check the service status:

```bash
curl http://${your_ip}:8006/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": $LLM_MODEL, "messages": [{"role": "user", "content": "How are you?"}]}'
```

For more information about the OpenAI APIs, you can checkeck the [OpenAI official document](https://platform.openai.com/docs/api-reference/).

## Set up OPEA microservice

Then we warp the VLLM Ray service into OPEA microcervice.

### Build docker

```bash
bash ./build_docker_microservice.sh
```

### Launch the microservice

```bash
bash ./launch_microservice.sh
```

### Query the microservice

```bash
curl http://${your_ip}:9000/v1/chat/completions \
-X POST \
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \
-H 'Content-Type: application/json'
```
2 changes: 2 additions & 0 deletions comps/llms/text-generation/vllm-ray/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

cd ../../../../
docker build \
-t opea/llm-vllm-ray:latest \
--build-arg https_proxy=$https_proxy \
--build-arg http_proxy=$http_proxy \
-f comps/llms/text-generation/vllm-ray/docker/Dockerfile.microservice .
12 changes: 12 additions & 0 deletions comps/llms/text-generation/vllm-ray/build_docker_vllmray.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

cd docker

docker build \
-f Dockerfile.vllmray ../../ \
-t vllm_ray:habana \
--network=host \
--build-arg http_proxy=${http_proxy} \
--build-arg https_proxy=${https_proxy} \
--build-arg no_proxy=${no_proxy}
37 changes: 37 additions & 0 deletions comps/llms/text-generation/vllm-ray/docker/Dockerfile.microservice
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM langchain/langchain:latest

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
libgl1-mesa-glx \
libjemalloc-dev \
vim

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

USER user

COPY comps /home/user/comps

RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r /home/user/comps/llms/text-generation/vllm-ray/requirements.txt

ENV PYTHONPATH=$PYTHONPATH:/home/user

WORKDIR /home/user/comps/llms/text-generation/vllm-ray

ENTRYPOINT ["python", "llm.py"]
31 changes: 31 additions & 0 deletions comps/llms/text-generation/vllm-ray/docker/Dockerfile.vllmray
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# FROM vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
FROM vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest

ENV LANG=en_US.UTF-8

WORKDIR /root/vllm-ray

# copy the source code to the package directory
COPY ../../vllm-ray/ /root/vllm-ray

RUN pip install --upgrade-strategy eager optimum[habana] && \
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.15.1
# RUN pip install -v git+https://github.com/HabanaAI/vllm-fork.git@ae3d6121
RUN pip install -v git+https://github.com/HabanaAI/vllm-fork.git@cf6952d
RUN pip install "ray>=2.10" "ray[serve,tune]>=2.10"

RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
service ssh restart

ENV no_proxy=localhost,127.0.0.1
ENV PYTHONPATH=$PYTHONPATH:/root:/root/vllm-ray

# Required by DeepSpeed
ENV RAY_EXPERIMENTAL_NOSET_HABANA_VISIBLE_MODULES=1

ENV PT_HPU_LAZY_ACC_PAR_MODE=0

ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true
13 changes: 13 additions & 0 deletions comps/llms/text-generation/vllm-ray/launch_microservice.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

docker run -d --rm \
--name="llm-vllm-ray-server" \
-p 9000:9000 \
--ipc=host \
-e http_proxy=$http_proxy \
-e https_proxy=$https_proxy \
-e vLLM_RAY_ENDPOINT=$vLLM_RAY_ENDPOINT \
-e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN \
-e LLM_MODEL=$LLM_MODEL \
opea/llm-vllm-ray:latest
43 changes: 43 additions & 0 deletions comps/llms/text-generation/vllm-ray/launch_vllmray.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/bin/bash

# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# Set default values
default_port=8006
default_model=$LLM_MODEL
default_parallel_number=2
default_enforce_eager=False

# Assign arguments to variables
port_number=${1:-$default_port}
model_name=${2:-$default_model}
parallel_number=${3:-$default_parallel_number}
enforce_eager=${4:-$default_enforce_eager}

# Check if all required arguments are provided
if [ "$#" -lt 0 ] || [ "$#" -gt 3 ]; then
echo "Usage: $0 [port_number] [model_name] [parallel_number] [enforce_eager]"
echo "Please customize the arguments you want to use.
- port_number: The port number assigned to the Ray Gaudi endpoint, with the default being 8080.
- model_name: The model name utilized for LLM, with the default set to meta-llama/Llama-2-7b-chat-hf.
- parallel_number: The number of HPUs specifies the number of HPUs per worker process.
- enforce_eager: Whether to enforce eager execution, default to be True."
exit 1
fi

# Build the Docker run command based on the number of cards
docker run -d --rm \
--name="vllm-ray-service" \
--runtime=habana \
-v $PWD/data:/data \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
--cap-add=sys_nice \
--ipc=host \
-p $port_number:8000 \
-e HTTPS_PROXY=$https_proxy \
-e HTTP_PROXY=$https_proxy \
-e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN \
vllm_ray:habana \
/bin/bash -c "ray start --head && python vllm_ray_openai.py --port_number 8000 --model_id_or_path $model_name --tensor_parallel_size $parallel_number --enforce_eager $enforce_eager"
83 changes: 83 additions & 0 deletions comps/llms/text-generation/vllm-ray/llm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

from fastapi.responses import StreamingResponse
from langchain_openai import ChatOpenAI
from langsmith import traceable

from comps import GeneratedDoc, LLMParamsDoc, ServiceType, opea_microservices, register_microservice


@traceable(run_type="tool")
def post_process_text(text: str):
if text == " ":
return "data: @#$\n\n"
if text == "\n":
return "data: <br/>\n\n"
if text.isspace():
return None
new_text = text.replace(" ", "@#$")
return f"data: {new_text}\n\n"


@register_microservice(
name="opea_service@llm_vllm_ray",
service_type=ServiceType.LLM,
endpoint="/v1/chat/completions",
host="0.0.0.0",
port=9000,
)
@traceable(run_type="llm")
def llm_generate(input: LLMParamsDoc):
llm_endpoint = os.getenv("vLLM_RAY_ENDPOINT", "http://localhost:8006")
llm_model = os.getenv("LLM_MODEL", "meta-llama/Llama-2-7b-chat-hf")
llm = ChatOpenAI(
openai_api_base=llm_endpoint + "/v1",
model_name=llm_model,
openai_api_key=os.getenv("OPENAI_API_KEY", "not_needed"),
max_tokens=input.max_new_tokens,
temperature=input.temperature,
streaming=input.streaming,
request_timeout=600,
)

if input.streaming:

async def stream_generator():
chat_response = ""
async for text in llm.astream(input.query):
text = text.content
chat_response += text
processed_text = post_process_text(text)
if text and processed_text:
if "</s>" in text:
res = text.split("</s>")[0]
if res != "":
yield res
break
yield processed_text
print(f"[llm - chat_stream] stream response: {chat_response}")
yield "data: [DONE]\n\n"

return StreamingResponse(stream_generator(), media_type="text/event-stream")
else:
response = llm.invoke(input.query)
response = response.content
return GeneratedDoc(text=response, prompt=input.query)


if __name__ == "__main__":
opea_microservices["opea_service@llm_vllm_ray"].start()
15 changes: 15 additions & 0 deletions comps/llms/text-generation/vllm-ray/query.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

your_ip="0.0.0.0"

##query vllm ray service
curl http://${your_ip}:8006/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-2-7b-chat-hf", "messages": [{"role": "user", "content": "How are you?"}]}'

##query microservice
curl http://${your_ip}:9000/v1/chat/completions \
-X POST \
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \
-H 'Content-Type: application/json'
17 changes: 17 additions & 0 deletions comps/llms/text-generation/vllm-ray/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
docarray[full]
fastapi
huggingface_hub
langchain==0.1.16
langchain_openai
langserve
langsmith
openai
opentelemetry-api
opentelemetry-exporter-otlp
opentelemetry-sdk
prometheus-fastapi-instrumentator
ray[serve]>=2.10
setuptools==69.5.1
shortuuid
transformers
vllm
Loading

0 comments on commit ec3b2e8

Please sign in to comment.