ray-project · richardliaw · Feb 20, 2025 · Feb 19, 2025 · Feb 19, 2025 · Feb 19, 2025
@@ -0,0 +1,6 @@
+{{ fullname | escape | underline}}
+
+.. currentmodule:: {{ module }}
+
+.. autoclass:: {{ objname }}
+    :members:
@@ -5,7 +5,7 @@ Large Language Model (LLM) API
 
 .. currentmodule:: ray.data.llm
 
-LLM Processor Builder
+LLM processor builder
 ---------------------
 
 .. autosummary::
@@ -23,11 +23,12 @@ Processor
 
     ~Processor
 
-Processor Configs
+Processor configs
 -----------------
 
 .. autosummary::
     :nosignatures:
+    :template: autosummary/class_without_autosummary_noinheritance.rst
     :toctree: doc/
 
     ~ProcessorConfig

@@ -19,10 +19,18 @@ Perform batch inference with LLMs
 At a high level, the `ray.data.llm` module provides a `Processor` object which encapsulates
 logic for performing batch inference with LLMs on a Ray Data dataset.
 
-You can use the `build_llm_processor` API to construct a processor. In the following example, we use the `vLLMEngineProcessorConfig` to construct a processor for the `meta-llama/Llama-3.1-8B-Instruct` model.
+You can use the `build_llm_processor` API to construct a processor.
+In the following example, we use the `vLLMEngineProcessorConfig` to construct a processor for the `meta-llama/Llama-3.1-8B-Instruct` model.
+
+To run this example, install vLLM, which is a popular and optimized LLM inference engine.
+
+.. testcode::
+
+    pip install -U vllm
 
 The vLLMEngineProcessorConfig is a configuration object for the vLLM engine.
-It contains the model name, the number of GPUs to use, and the number of shards to use, along with other vLLM engine configurations. Upon execution, the Processor object instantiates replicas of the vLLM engine (using `map_batches` under the hood).
+It contains the model name, the number of GPUs to use, and the number of shards to use, along with other vLLM engine configurations.
+Upon execution, the Processor object instantiates replicas of the vLLM engine (using `map_batches` under the hood).
 
 .. testcode::
 
@@ -50,10 +58,12 @@ It contains the model name, the number of GPUs to use, and the number of shards
             sampling_params=dict(
                 temperature=0.3,
                 max_tokens=250,
-            )
+            ),
+            **row
         ),
         postprocess=lambda row: dict(
-            answer=row["generated_text"]
+            answer=row["generated_text"],
+            **row
         ),
     )
 
@@ -67,6 +77,17 @@ It contains the model name, the number of GPUs to use, and the number of shards
 
     {'answer': 'Snowflakes gently fall\nBlanketing the winter scene\nFrozen peaceful hush'}
 
+Some models may require a Hugging Face token to be specified. You can specify the token in the `runtime_env` argument.
+
+.. testcode::
+
+    config = vLLMEngineProcessorConfig(
+        model="unsloth/Llama-3.1-8B-Instruct",
+        runtime_env={"env_vars": {"HF_TOKEN": "your_huggingface_token"}},
+        concurrency=1,
+        batch_size=64,
+    )
+
 .. _vllm_llm:
 
 Configure vLLM for LLM inference
@@ -78,7 +99,7 @@ Use the `vLLMEngineProcessorConfig` to configure the vLLM engine.
 
     from ray.data.llm import vLLMEngineProcessorConfig
 
-    processor_config = vLLMEngineProcessorConfig(
+    config = vLLMEngineProcessorConfig(
         model="unsloth/Llama-3.1-8B-Instruct",
         engine_kwargs={"max_model_len": 20000},
         concurrency=1,
@@ -89,7 +110,7 @@ For handling larger models, specify model parallelism.
 
 .. testcode::
 
-    processor_config = vLLMEngineProcessorConfig(
+    config = vLLMEngineProcessorConfig(
         model="unsloth/Llama-3.1-8B-Instruct",
         engine_kwargs={
             "max_model_len": 16384,
@@ -106,11 +127,21 @@ The underlying `Processor` object instantiates replicas of the vLLM engine and a
 configure parallel workers to handle model parallelism (for tensor parallelism and pipeline parallelism,
 if specified).
 
+To optimize model loading, you can configure the `load_format` to `runai_streamer` or `tensorizer`:
+
+.. testcode::
+
+    config = vLLMEngineProcessorConfig(
+        model="unsloth/Llama-3.1-8B-Instruct",
+        engine_kwargs={"load_format": "runai_streamer"},
+        concurrency=1,
+        batch_size=64,
+    )
 
 .. _openai_compatible_api_endpoint:
 
-OpenAI Compatible API Endpoint
-------------------------------
+Batch inference with an OpenAI-compatible endpoint
+--------------------------------------------------
 
 You can also make calls to deployed models that have an OpenAI compatible API endpoint.
 

@@ -11,8 +11,18 @@
 
 @PublicAPI(stability="alpha")
 class ProcessorConfig(_ProcessorConfig):
-    """The processor configuration."""
+    """The processor configuration.
 
+    Args:
+        batch_size: Configures batch size for the processor. Large batch sizes are
+            likely to saturate the compute resources and could achieve higher throughput.
+            On the other hand, small batch sizes are more fault-tolerant and could
+            reduce bubbles in the data pipeline. You can tune the batch size to balance
+            the throughput and fault-tolerance based on your use case.
+        accelerator_type: The accelerator type used by the LLM stage in a processor.
+            Default to None, meaning that only the CPU will be used.
+        concurrency: The number of workers for data parallelism. Default to 1.
+    """
     pass
 
 
@@ -69,9 +79,17 @@ class vLLMEngineProcessorConfig(_vLLMEngineProcessorConfig):
 
     Args:
         model: The model to use for the vLLM engine.
-        engine_kwargs: The kwargs to pass to the vLLM engine.
+        batch_size: The batch size to send to the vLLM engine. Large batch sizes are
+            likely to saturate the compute resources and could achieve higher throughput.
+            On the other hand, small batch sizes are more fault-tolerant and could
+            reduce bubbles in the data pipeline. You can tune the batch size to balance
+            the throughput and fault-tolerance based on your use case.
+        engine_kwargs: The kwargs to pass to the vLLM engine. Default engine kwargs are
+            pipeline_parallel_size: 1, tensor_parallel_size: 1, max_num_seqs: 128,
+            distributed_executor_backend: "mp".
         task_type: The task type to use. If not specified, will use 'generate' by default.
-        runtime_env: The runtime environment to use for the vLLM engine.
+        runtime_env: The runtime environment to use for the vLLM engine. See
+            :ref:`this doc <handling_dependencies>` for more details.
         max_pending_requests: The maximum number of pending requests. If not specified,
             will use the default value from the vLLM engine.
         max_concurrent_batches: The maximum number of concurrent batches in the engine.
@@ -86,6 +104,9 @@ class vLLMEngineProcessorConfig(_vLLMEngineProcessorConfig):
             If not, vLLM will tokenize the prompt in the engine.
         detokenize: Whether to detokenize the output.
         has_image: Whether the input messages have images.
+        accelerator_type: The accelerator type used by the LLM stage in a processor.
+            Default to None, meaning that only the CPU will be used.
+        concurrency: The number of workers for data parallelism. Default to 1.
 
     Examples: