Skip to content

Commit

Permalink
[Doc] add online speculative decoding example (vllm-project#7243)
Browse files Browse the repository at this point in the history
  • Loading branch information
stas00 authored and kylesayrs committed Aug 17, 2024
1 parent cd6ddd0 commit c13cdd5
Showing 1 changed file with 55 additions and 11 deletions.
66 changes: 55 additions & 11 deletions docs/source/models/spec_decode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@ Speculative decoding is a technique which improves inter-token latency in memory
Speculating with a draft model
------------------------------

The following code configures vLLM to use speculative decoding with a draft model, speculating 5 tokens at a time.
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.

.. code-block:: python
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="facebook/opt-6.7b",
tensor_parallel_size=1,
Expand All @@ -33,12 +33,56 @@ The following code configures vLLM to use speculative decoding with a draft mode
use_v2_block_manager=True,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
To perform the same with an online mode launch the server:

.. code-block:: bash
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
--seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
--num_speculative_tokens 5 --gpu_memory_utilization 0.8
Then use a client:
.. code-block:: python
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
# Completion API
stream = False
completion = client.completions.create(
model=model,
prompt="The future of AI is",
echo=False,
n=1,
stream=stream,
)
print("Completion results:")
if stream:
for c in completion:
print(c)
else:
print(completion)
Speculating by matching n-grams in the prompt
---------------------------------------------

Expand All @@ -48,12 +92,12 @@ matching n-grams in the prompt. For more information read `this thread. <https:/
.. code-block:: python
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="facebook/opt-6.7b",
tensor_parallel_size=1,
Expand All @@ -63,7 +107,7 @@ matching n-grams in the prompt. For more information read `this thread. <https:/
use_v2_block_manager=True,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
Expand All @@ -74,7 +118,7 @@ Speculating using MLP speculators

The following code configures vLLM to use speculative decoding where proposals are generated by
draft models that conditioning draft predictions on both context vectors and sampled tokens.
For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/>`_ or
For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/>`_ or
`this technical report <https://arxiv.org/abs/2404.19124>`_.

.. code-block:: python
Expand All @@ -100,9 +144,9 @@ For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Note that these speculative models currently need to be run without tensor parallelism, although
it is possible to run the main model using tensor parallelism (see example above). Since the
speculative models are relatively small, we still see significant speedups. However, this
Note that these speculative models currently need to be run without tensor parallelism, although
it is possible to run the main model using tensor parallelism (see example above). Since the
speculative models are relatively small, we still see significant speedups. However, this
limitation will be fixed in a future release.

A variety of speculative models of this type are available on HF hub:
Expand Down

0 comments on commit c13cdd5

Please sign in to comment.