-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data/llm/docs] LLM Batch API documentation improvements #50747
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{{ fullname | escape | underline}} | ||
|
||
.. currentmodule:: {{ module }} | ||
|
||
.. autoclass:: {{ objname }} | ||
:members: |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,10 +19,18 @@ Perform batch inference with LLMs | |
At a high level, the `ray.data.llm` module provides a `Processor` object which encapsulates | ||
logic for performing batch inference with LLMs on a Ray Data dataset. | ||
|
||
You can use the `build_llm_processor` API to construct a processor. In the following example, we use the `vLLMEngineProcessorConfig` to construct a processor for the `meta-llama/Llama-3.1-8B-Instruct` model. | ||
You can use the `build_llm_processor` API to construct a processor. | ||
In the following example, we use the `vLLMEngineProcessorConfig` to construct a processor for the `meta-llama/Llama-3.1-8B-Instruct` model. | ||
|
||
To run this example, install vLLM, which is a popular and optimized LLM inference engine. | ||
|
||
.. testcode:: | ||
|
||
pip install -U vllm | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Better to pin the version. You can add one sentence saying later versions should still work but not guaranteed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. Will pin to 0.7.2 |
||
|
||
The vLLMEngineProcessorConfig is a configuration object for the vLLM engine. | ||
It contains the model name, the number of GPUs to use, and the number of shards to use, along with other vLLM engine configurations. Upon execution, the Processor object instantiates replicas of the vLLM engine (using `map_batches` under the hood). | ||
It contains the model name, the number of GPUs to use, and the number of shards to use, along with other vLLM engine configurations. | ||
Upon execution, the Processor object instantiates replicas of the vLLM engine (using `map_batches` under the hood). | ||
|
||
.. testcode:: | ||
|
||
|
@@ -50,10 +58,12 @@ It contains the model name, the number of GPUs to use, and the number of shards | |
sampling_params=dict( | ||
temperature=0.3, | ||
max_tokens=250, | ||
) | ||
), | ||
**row | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think you need this in preproess. All columns are automatically carried over until the postprocess. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh that's very unintuitive! OK I will document that. |
||
), | ||
postprocess=lambda row: dict( | ||
answer=row["generated_text"] | ||
answer=row["generated_text"], | ||
**row | ||
), | ||
) | ||
|
||
|
@@ -67,6 +77,17 @@ It contains the model name, the number of GPUs to use, and the number of shards | |
|
||
{'answer': 'Snowflakes gently fall\nBlanketing the winter scene\nFrozen peaceful hush'} | ||
|
||
Some models may require a Hugging Face token to be specified. You can specify the token in the `runtime_env` argument. | ||
|
||
.. testcode:: | ||
|
||
config = vLLMEngineProcessorConfig( | ||
model="unsloth/Llama-3.1-8B-Instruct", | ||
runtime_env={"env_vars": {"HF_TOKEN": "your_huggingface_token"}}, | ||
concurrency=1, | ||
batch_size=64, | ||
) | ||
|
||
.. _vllm_llm: | ||
|
||
Configure vLLM for LLM inference | ||
|
@@ -78,7 +99,7 @@ Use the `vLLMEngineProcessorConfig` to configure the vLLM engine. | |
|
||
from ray.data.llm import vLLMEngineProcessorConfig | ||
|
||
processor_config = vLLMEngineProcessorConfig( | ||
config = vLLMEngineProcessorConfig( | ||
model="unsloth/Llama-3.1-8B-Instruct", | ||
engine_kwargs={"max_model_len": 20000}, | ||
concurrency=1, | ||
|
@@ -89,7 +110,7 @@ For handling larger models, specify model parallelism. | |
|
||
.. testcode:: | ||
|
||
processor_config = vLLMEngineProcessorConfig( | ||
config = vLLMEngineProcessorConfig( | ||
model="unsloth/Llama-3.1-8B-Instruct", | ||
engine_kwargs={ | ||
"max_model_len": 16384, | ||
|
@@ -106,11 +127,21 @@ The underlying `Processor` object instantiates replicas of the vLLM engine and a | |
configure parallel workers to handle model parallelism (for tensor parallelism and pipeline parallelism, | ||
if specified). | ||
|
||
To optimize model loading, you can configure the `load_format` to `runai_streamer` or `tensorizer`: | ||
|
||
.. testcode:: | ||
|
||
config = vLLMEngineProcessorConfig( | ||
model="unsloth/Llama-3.1-8B-Instruct", | ||
engine_kwargs={"load_format": "runai_streamer"}, | ||
concurrency=1, | ||
batch_size=64, | ||
) | ||
|
||
.. _openai_compatible_api_endpoint: | ||
|
||
OpenAI Compatible API Endpoint | ||
------------------------------ | ||
Batch inference with an OpenAI-compatible endpoint | ||
-------------------------------------------------- | ||
|
||
You can also make calls to deployed models that have an OpenAI compatible API endpoint. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before:
After:
