Skip to content

Commit

Permalink
[llm.serving] add requirements sections to overview page (ray-project…
Browse files Browse the repository at this point in the history
…#50788)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Adding requirements sections on the llm overview page to let user know
what dependencies to install and suggest to pin `xgrammar==0.1.11` and
`pynvml==12.0.0` when paired with vllm 0.7.2

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Gene Su <e870252314@gmail.com>
  • Loading branch information
GeneDer authored and xsuler committed Mar 4, 2025
1 parent 9d8db59 commit 27b5213
Showing 1 changed file with 22 additions and 12 deletions.
34 changes: 22 additions & 12 deletions doc/source/serve/llm/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,17 @@ Features
- 🔄 Multi-LoRA support with shared base models
- 🚀 Engine agnostic architecture (i.e. vLLM, SGLang, etc)

Requirements
--------------

.. code-block:: bash
pip install ray[serve,llm]>=2.43.0 vllm>=0.7.2
# Suggested dependencies when using vllm 0.7.2:
pip install xgrammar==0.1.11 pynvml==12.0.0
Key Components
--------------

Expand Down Expand Up @@ -103,10 +114,10 @@ You can query the deployed models using either cURL or the OpenAI Python client:
.. code-block:: python
from openai import OpenAI
# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")
# Basic chat completion with streaming
response = client.chat.completions.create(
model="qwen-0.5b",
Expand All @@ -117,7 +128,7 @@ You can query the deployed models using either cURL or the OpenAI Python client:
for chunk in response:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
For deploying multiple models, you can pass a list of ``LLMConfig`` objects to the ``LLMRouter`` deployment:

Expand Down Expand Up @@ -174,22 +185,22 @@ For production deployments, Ray Serve LLM provides utilities for config-driven d
# config.yaml
applications:
- args:
- args:
llm_configs:
- model_loading_config:
model_id: qwen-0.5b
model_source: Qwen/Qwen2.5-0.5B-Instruct
accelerator_type: A10G
deployment_config:
autoscaling_config:
autoscaling_config:
min_replicas: 1
max_replicas: 2
- model_loading_config:
model_id: qwen-1.5b
model_source: Qwen/Qwen2.5-1.5B-Instruct
accelerator_type: A10G
deployment_config:
autoscaling_config:
autoscaling_config:
min_replicas: 1
max_replicas: 2
import_path: ray.serve.llm.builders:build_openai_app
Expand All @@ -204,7 +215,7 @@ For production deployments, Ray Serve LLM provides utilities for config-driven d
# config.yaml
applications:
- args:
- args:
llm_configs:
- models/qwen-0.5b.yaml
- models/qwen-1.5b.yaml
Expand All @@ -221,7 +232,7 @@ For production deployments, Ray Serve LLM provides utilities for config-driven d
model_source: Qwen/Qwen2.5-0.5B-Instruct
accelerator_type: A10G
deployment_config:
autoscaling_config:
autoscaling_config:
min_replicas: 1
max_replicas: 2
Expand All @@ -233,7 +244,7 @@ For production deployments, Ray Serve LLM provides utilities for config-driven d
model_source: Qwen/Qwen2.5-1.5B-Instruct
accelerator_type: A10G
deployment_config:
autoscaling_config:
autoscaling_config:
min_replicas: 1
max_replicas: 2
Expand All @@ -251,8 +262,8 @@ For each usage pattern, we provide a server and client code snippet.
Multi-LoRA Deployment
~~~~~~~~~~~~~~~~~~~~~

You can use LoRA (Low-Rank Adaptation) to efficiently fine-tune models by configuring the ``LoraConfig``.
We use Ray Serve's multiplexing feature to serve multiple LoRA checkpoints from the same model.
You can use LoRA (Low-Rank Adaptation) to efficiently fine-tune models by configuring the ``LoraConfig``.
We use Ray Serve's multiplexing feature to serve multiple LoRA checkpoints from the same model.
This allows the weights to be loaded on each replica on-the-fly and be cached via an LRU mechanism.

.. tab-set::
Expand Down Expand Up @@ -540,4 +551,3 @@ If you are using huggingface models, you can enable fast download by setting `HF
deployment = VLLMService.as_deployment(llm_config.get_serve_options(name_prefix="VLLM:")).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
serve.run(llm_app)

0 comments on commit 27b5213

Please sign in to comment.