Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data/llm/docs] LLM Batch API documentation improvements #50747

Merged
merged 4 commits into from
Feb 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ fullname | escape | underline}}

.. currentmodule:: {{ module }}

.. autoclass:: {{ objname }}
:members:
5 changes: 3 additions & 2 deletions doc/source/data/api/llm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Large Language Model (LLM) API

.. currentmodule:: ray.data.llm

LLM Processor Builder
LLM processor builder
---------------------

.. autosummary::
Expand All @@ -23,11 +23,12 @@ Processor

~Processor

Processor Configs
Processor configs
-----------------

.. autosummary::
:nosignatures:
:template: autosummary/class_without_autosummary_noinheritance.rst
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before:

Screenshot 2025-02-19 at 3 21 22 PM

After:
Screenshot 2025-02-19 at 3 21 47 PM

:toctree: doc/

~ProcessorConfig
Expand Down
47 changes: 39 additions & 8 deletions doc/source/data/working-with-llms.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,19 @@ Perform batch inference with LLMs
At a high level, the `ray.data.llm` module provides a `Processor` object which encapsulates
logic for performing batch inference with LLMs on a Ray Data dataset.

You can use the `build_llm_processor` API to construct a processor. In the following example, we use the `vLLMEngineProcessorConfig` to construct a processor for the `meta-llama/Llama-3.1-8B-Instruct` model.
You can use the `build_llm_processor` API to construct a processor.
The following example uses the `vLLMEngineProcessorConfig` to construct a processor for the `unsloth/Llama-3.1-8B-Instruct` model.

The vLLMEngineProcessorConfig is a configuration object for the vLLM engine.
It contains the model name, the number of GPUs to use, and the number of shards to use, along with other vLLM engine configurations. Upon execution, the Processor object instantiates replicas of the vLLM engine (using `map_batches` under the hood).
To run this example, install vLLM, which is a popular and optimized LLM inference engine.

.. testcode::

# Later versions *should* work but are not tested yet.
pip install -U vllm==0.7.2

The `vLLMEngineProcessorConfig`` is a configuration object for the vLLM engine.
It contains the model name, the number of GPUs to use, and the number of shards to use, along with other vLLM engine configurations.
Upon execution, the Processor object instantiates replicas of the vLLM engine (using `map_batches` under the hood).

.. testcode::

Expand Down Expand Up @@ -53,7 +62,8 @@ It contains the model name, the number of GPUs to use, and the number of shards
)
),
postprocess=lambda row: dict(
answer=row["generated_text"]
answer=row["generated_text"],
**row # This will return all the original columns in the dataset.
),
)

Expand All @@ -67,6 +77,17 @@ It contains the model name, the number of GPUs to use, and the number of shards

{'answer': 'Snowflakes gently fall\nBlanketing the winter scene\nFrozen peaceful hush'}

Some models may require a Hugging Face token to be specified. You can specify the token in the `runtime_env` argument.

.. testcode::

config = vLLMEngineProcessorConfig(
model="unsloth/Llama-3.1-8B-Instruct",
runtime_env={"env_vars": {"HF_TOKEN": "your_huggingface_token"}},
concurrency=1,
batch_size=64,
)

.. _vllm_llm:

Configure vLLM for LLM inference
Expand All @@ -78,7 +99,7 @@ Use the `vLLMEngineProcessorConfig` to configure the vLLM engine.

from ray.data.llm import vLLMEngineProcessorConfig

processor_config = vLLMEngineProcessorConfig(
config = vLLMEngineProcessorConfig(
model="unsloth/Llama-3.1-8B-Instruct",
engine_kwargs={"max_model_len": 20000},
concurrency=1,
Expand All @@ -89,7 +110,7 @@ For handling larger models, specify model parallelism.

.. testcode::

processor_config = vLLMEngineProcessorConfig(
config = vLLMEngineProcessorConfig(
model="unsloth/Llama-3.1-8B-Instruct",
engine_kwargs={
"max_model_len": 16384,
Expand All @@ -106,11 +127,21 @@ The underlying `Processor` object instantiates replicas of the vLLM engine and a
configure parallel workers to handle model parallelism (for tensor parallelism and pipeline parallelism,
if specified).

To optimize model loading, you can configure the `load_format` to `runai_streamer` or `tensorizer`:

.. testcode::

config = vLLMEngineProcessorConfig(
model="unsloth/Llama-3.1-8B-Instruct",
engine_kwargs={"load_format": "runai_streamer"},
concurrency=1,
batch_size=64,
)

.. _openai_compatible_api_endpoint:

OpenAI Compatible API Endpoint
------------------------------
Batch inference with an OpenAI-compatible endpoint
--------------------------------------------------

You can also make calls to deployed models that have an OpenAI compatible API endpoint.

Expand Down
37 changes: 32 additions & 5 deletions python/ray/data/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,18 @@

@PublicAPI(stability="alpha")
class ProcessorConfig(_ProcessorConfig):
"""The processor configuration."""
"""The processor configuration.

Args:
batch_size: Configures batch size for the processor. Large batch sizes are
likely to saturate the compute resources and could achieve higher throughput.
On the other hand, small batch sizes are more fault-tolerant and could
reduce bubbles in the data pipeline. You can tune the batch size to balance
the throughput and fault-tolerance based on your use case.
accelerator_type: The accelerator type used by the LLM stage in a processor.
Default to None, meaning that only the CPU will be used.
concurrency: The number of workers for data parallelism. Default to 1.
"""

pass

Expand Down Expand Up @@ -69,9 +80,17 @@ class vLLMEngineProcessorConfig(_vLLMEngineProcessorConfig):

Args:
model: The model to use for the vLLM engine.
engine_kwargs: The kwargs to pass to the vLLM engine.
batch_size: The batch size to send to the vLLM engine. Large batch sizes are
likely to saturate the compute resources and could achieve higher throughput.
On the other hand, small batch sizes are more fault-tolerant and could
reduce bubbles in the data pipeline. You can tune the batch size to balance
the throughput and fault-tolerance based on your use case.
engine_kwargs: The kwargs to pass to the vLLM engine. Default engine kwargs are
pipeline_parallel_size: 1, tensor_parallel_size: 1, max_num_seqs: 128,
distributed_executor_backend: "mp".
task_type: The task type to use. If not specified, will use 'generate' by default.
runtime_env: The runtime environment to use for the vLLM engine.
runtime_env: The runtime environment to use for the vLLM engine. See
:ref:`this doc <handling_dependencies>` for more details.
max_pending_requests: The maximum number of pending requests. If not specified,
will use the default value from the vLLM engine.
max_concurrent_batches: The maximum number of concurrent batches in the engine.
Expand All @@ -86,6 +105,9 @@ class vLLMEngineProcessorConfig(_vLLMEngineProcessorConfig):
If not, vLLM will tokenize the prompt in the engine.
detokenize: Whether to detokenize the output.
has_image: Whether the input messages have images.
accelerator_type: The accelerator type used by the LLM stage in a processor.
Default to None, meaning that only the CPU will be used.
concurrency: The number of workers for data parallelism. Default to 1.

Examples:

Expand Down Expand Up @@ -144,9 +166,13 @@ def build_llm_processor(
config: The processor config.
preprocess: An optional lambda function that takes a row (dict) as input
and returns a preprocessed row (dict). The output row must contain the
required fields for the following processing stages.
required fields for the following processing stages. Each row
can contain a `sampling_params` field which will be used by the
engine for row-specific sampling parameters.
Note that all columns will be carried over until the postprocess stage.
postprocess: An optional lambda function that takes a row (dict) as input
and returns a postprocessed row (dict).
and returns a postprocessed row (dict). To keep all the original columns,
you can use the `**row` syntax to return all the original columns.

Returns:
The built processor.
Expand Down Expand Up @@ -184,6 +210,7 @@ def build_llm_processor(
),
postprocess=lambda row: dict(
resp=row["generated_text"],
**row, # This will return all the original columns in the dataset.
),
)

Expand Down