[WIP] [V1] TPU support #11936

alexm-redhat · 2025-01-10T15:59:22Z

This PR is a rebase and modification of @robertgshaw2-redhat original PR for TPU support in vLLM V1 from 1.5 months ago #10241

Currently, TPU attention kernel has no support for mixing prefills and decodes in the same scheduler iteration. As a result, this PR separates the requests to (1) prefills and (2) decodes, and executes each one of them separately. Google guys are working on a new TPU attention kernel that will allow mixing prefills and decodes, the moment it is ready, we will be able to remove the separation logic and unify the requests (which will also provide better performance).

Notes:

@mgoin verified correctness with GSM8K on a TPU instance
No TP > 1 support yet
Only greedy sampler for now
V1 code had no support for multiple arches (this PR supports for CUDA and TPU), and this results in code duplications that are avoided as much as possible by introducing base classes for worker and model runner.
Not performance optimized yet

Follow up tasks (maybe I missed something):

Add all sampler options
Add prefix caching (currently supported in V0 TPU)
Add prefill chunking
Integrate with Google new super attention kernel to support mixing for prefills and decodes
Optimize

github-actions · 2025-01-10T15:59:34Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2025-01-10T16:00:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/v1/worker/tpu_model_runner.py

liangfu · 2025-01-10T20:50:00Z

vllm/v1/worker/tpu_model_runner.py

+        return PrefillInputData(
+            request_ids=prefill_request_ids,
+            prompt_lens=prefill_prompt_lens,
+            token_ids=prefill_token_ids,
+            position_ids=prefill_position_ids,
+            attn_metadata=prefill_attn_metadata,
+        )


remove the PrefillInputData data structure, and make it consistent with gpu_model_runner ?

This will be removed the moment Google provides the new attention kernel that supports chunked prefill.

how is the new attention kernel related to this PrefillInputData data structure ?

vllm/v1/worker/tpu_model_runner.py

mgoin · 2025-01-13T20:16:58Z

Successfully ran an eval on GSM8k

VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=Qwen/Qwen2.5-1.5B-Instruct,max_model_len=2048,max_num_seqs=512 --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=Qwen/Qwen2.5-1.5B-Instruct,max_model_len=2048,max_num_seqs=512), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5989|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.5428|±  |0.0137|

vllm/platforms/tpu.py

vllm/v1/worker/tpu_worker.py

vllm/platforms/tpu.py

vllm/v1/core/scheduler.py

vllm/v1/worker/gpu_model_runner.py

mgoin · 2025-01-14T22:01:11Z

examples/offline_inference/offline_inference.py

Please revert these changes

vllm/platforms/tpu.py

vllm/v1/worker/gpu_model_runner.py

mgoin · 2025-01-15T21:56:23Z

vllm/v1/worker/tpu_model_runner.py

+            # TODO: Remove prompt_len param here
+            prefill_attn_metadata.append(
+                PallasMetadata(
+                    num_prefills=1,
+                    num_prefill_tokens=prompt_len,  # NOTE: This is not used.
+                    num_decode_tokens=0,
+                    slot_mapping=slot_mapping.to(self.device),
+                    multi_modal_placeholder_index_maps=None,
+                    block_tables=None,
+                    context_lens=None,
+                    effective_query_lens=None,
+                ))


Can you address this TODO?

mgoin · 2025-01-15T21:58:23Z

vllm/v1/worker/tpu_model_runner.py

+                assert req_id is not None
+                req_state = self.requests[req_id]
+
+                # TODO: ASSERT NO CHUNKED PREFILL.


Implement this TODO

It looks like the current assert combo is good enough

mgoin · 2025-01-15T21:58:45Z

vllm/v1/worker/tpu_model_runner.py

+                           scheduler_output.num_scheduled_tokens[req_id])
+                assert seq_len == req_state.num_tokens
+
+                # TODO: Verify if req_id_to_index mapping is needed here!


removed, it is an old comment

mgoin · 2025-01-15T21:59:19Z

vllm/v1/worker/tpu_model_runner.py

+            # TODO: ASSERT NO PREFIX CACHING.
+            assert req_state.num_computed_tokens == 0
+            seq_len = (req_state.num_computed_tokens +
+                       scheduler_output.num_scheduled_tokens[req_id])
+
+            # TODO: ASSERT NO CHUNKED PREFILL.


Could you make these asserts at the initialization level? Why would you need to assert this for each request?

they are now inside tpu.py of the platform, and here are just in case something changes in the code and messes something. All of these will change the moment we have chunked prefill attn kernel.

mgoin · 2025-01-16T21:50:35Z

vllm/v1/worker/tpu_model_runner.py

+            token_ids = torch.zeros((batch_size, seq_len),
+                                    dtype=torch.int32,
+                                    device=self.device)


Why do you build these dummy tensors each time rather than allocating the max in the initializer and taking slices for each run like the gpu_model_runner?

taking slices will result in copies as well, no?

vllm/v1/worker/tpu_worker.py

alexm-redhat

@mgoin @vanbasten23 thanks for the review comments!

vllm/platforms/tpu.py

alexm-redhat · 2025-01-22T19:34:42Z

vllm/v1/worker/tpu_model_runner.py

+                           scheduler_output.num_scheduled_tokens[req_id])
+                assert seq_len == req_state.num_tokens
+
+                # TODO: Verify if req_id_to_index mapping is needed here!


removed, it is an old comment

alexm-redhat · 2025-01-22T19:35:09Z

vllm/v1/worker/tpu_model_runner.py

+            # TODO: ASSERT NO PREFIX CACHING.
+            assert req_state.num_computed_tokens == 0
+            seq_len = (req_state.num_computed_tokens +
+                       scheduler_output.num_scheduled_tokens[req_id])
+
+            # TODO: ASSERT NO CHUNKED PREFILL.


they are now inside tpu.py of the platform, and here are just in case something changes in the code and messes something. All of these will change the moment we have chunked prefill attn kernel.

alexm-redhat · 2025-01-22T19:36:53Z

vllm/v1/worker/tpu_model_runner.py

+            token_ids = torch.zeros((batch_size, seq_len),
+                                    dtype=torch.int32,
+                                    device=self.device)


taking slices will result in copies as well, no?

vllm/v1/worker/tpu_worker.py

mergify · 2025-01-23T18:01:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

robertgshaw2-redhat · 2025-01-24T02:28:52Z

.pre-commit-config.yaml

@@ -89,4 +89,4 @@ repos:
    name: Suggestion
    entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
    language: system
-    verbose: true
+    verbose: true


robertgshaw2-redhat · 2025-01-24T02:28:57Z

examples/offline_inference/basic.py

@@ -8,15 +8,15 @@
    "The future of AI is",
 ]
 # Create a sampling params object.
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+sampling_params = SamplingParams()  #temperature=0.8, top_p=0.95)


robertgshaw2-redhat · 2025-01-24T02:29:04Z

tools/mypy.sh

@@ -34,4 +34,4 @@ run_mypy vllm/plugins
 run_mypy vllm/prompt_adapter
 run_mypy vllm/spec_decode
 run_mypy vllm/worker
-run_mypy vllm/v1
+run_mypy vllm/v1


vanbasten23 · 2025-01-24T20:39:11Z

Hi @alexm-redhat , thanks for adding vLLm v1 support for TPU!
One quick question, this vLLM slides mentioned a few key changes in vLLM v1:

Simplified scheduler
Incremental input preparation
Piecewise CUDA graphs
Enhanced API server
More efficient Prefix caching
Fine-grained scheduling for VLMs

could you help mark which changes are included in this PR and which are to be made in the future PRs?
cc @miladm

WoosukKwon · 2025-01-26T18:59:01Z

vllm/v1/core/scheduler.py

@@ -212,6 +212,13 @@ def schedule(self) -> "SchedulerOutput":
                    num_computed_tokens -= self.block_size
                    num_new_tokens = self.block_size
                    computed_blocks.pop()
+
+                # If chunked prefill is not enabled, then breakout of the loop


This is a hack that hurts our development. We should find a way to not affect the scheduler.

@WoosukKwon I have addressed this issue by adding a chunked prompt support to TPU V1, the PR is updated. Now there is no changes to the scheduler, so it is the same for both GPU and TPU. Thanks for pointing this out!

liangfu · 2025-01-27T19:22:00Z

vllm/v1/worker/tpu_model_runner.py

+            self.model(token_ids, position_ids, None, kv_caches)
+
+    def profile_run(self) -> None:
+        raise NotImplementedError()


move this to base class ?

vllm/v1/worker/tpu_model_runner.py

liangfu · 2025-01-27T19:26:31Z

vllm/v1/worker/tpu_model_runner.py

+        return PrefillInputData(
+            request_ids=prefill_request_ids,
+            prompt_lens=prefill_prompt_lens,
+            token_ids=prefill_token_ids,
+            position_ids=prefill_position_ids,
+            attn_metadata=prefill_attn_metadata,
+        )


how is the new attention kernel related to this PrefillInputData data structure ?

vllm/v1/worker/tpu_model_runner.py

mergify · 2025-01-28T22:51:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

alexm-redhat · 2025-01-28T22:57:00Z

@liangfu PrefillInputData stores a list of PallasMetadata, each per prompt.

alexm-redhat · 2025-01-28T22:59:35Z

@vanbasten23 @miladm @bvrockwell

Reply:

Simplified scheduler => No changes to scheduler in this PR
Incremental input preparation => The input preparation is incremental here (same as for NVIDIA), however, it is not optimized yet (will work on it)
Piecewise CUDA graphs => TPU has support for this?
Enhanced API server => Same as NVIDIA, this PR is not touching API server
More efficient Prefix caching => Not enabled yet (will be next)
Fine-grained scheduling for VLMs => @mgoin follow-up PR will have this optimizations. Need to land this PR so Michael can progress

=> Hope this helps!

Signed-off-by: Alexander Matveev <amatveev@redhat.com>

mgoin · 2025-01-29T21:48:54Z

vllm/v1/worker/gpu_worker.py

+# TODO: Remove
+# def _check_if_gpu_supports_dtype(torch_dtype: torch.dtype):
+#     # Check if the GPU supports the dtype.
+#     if torch_dtype == torch.bfloat16:  # noqa: SIM102
+#         if not current_platform.has_device_capability(80):
+#             capability = current_platform.get_device_capability()
+#             gpu_name = current_platform.get_device_name()
+
+#             if capability is None:
+#                 compute_str = "does not have a compute capability"
+#             else:
+#                 version_str = capability.as_version_str()
+#                 compute_str = f"has compute capability {version_str}"
+
+#             raise ValueError(
+#                 "Bfloat16 is only supported on GPUs with compute capability "
+#                 f"of at least 8.0. Your {gpu_name} GPU {compute_str}. "
+#                 "You can use float16 instead by explicitly setting the"
+#                 "`dtype` flag in CLI, for example: --dtype=half.")


Remember to remove

mgoin · 2025-01-29T21:50:32Z

vllm/v1/worker/gpu_worker.py

-        else:
-            raise RuntimeError(
-                f"Not support device type: {self.device_config.device}")
+        assert self.device_config.device.type == "cuda"


Let's keep the error message

Suggested change

assert self.device_config.device.type == "cuda"

assert self.device_config.device.type == "cuda",

f"Not supported device type: {self.device_config.device}"

mgoin · 2025-01-29T21:52:33Z

vllm/v1/worker/tpu_model_runner.py

+# FIXME(woosuk): Temporarily disabled top-p sampling since it's too slow.
+_ENABLE_TOP_P = False
+# FIXME(woosuk): A temporary hack to support `n > 1`.
+# This can significantly affect the performance if too large.
+_MAX_NUM_SAMPLES = 128


I think these can be removed as unused for now

mgoin · 2025-01-29T21:54:55Z

vllm/v1/worker/model_runner_base.py

nit: I think this should be called base_model_runner.py so as we have more "base" files they are grouped together

mgoin · 2025-01-29T21:55:36Z

vllm/v1/worker/tpu_model_runner.py

+            # input_tokens = torch.from_numpy(self.input_batch.token_ids_cpu[
+            #     req_index, num_computed_tokens:padded_seq_len].reshape(1, -1))
+            # input_tokens[:, prompt_len:] = 0


Remove cruft

mgoin · 2025-01-29T21:55:47Z

vllm/v1/worker/tpu_model_runner.py

+            # TODO: Remove this
+            # if num_computed_tokens > 0:
+            #     print("-------------------")
+            #     print("input_tokens.shape = {}".format(input_tokens.shape))
+            #     print("input_positions.shape = {}".format(
+            #         input_positions.shape))
+            #     print("slot_mapping.shape = {}".format(slot_mapping.shape))
+            #     print("block_table.shape = {}".format(block_table.shape))
+            #     print("context_lens.shape = {} data = {}".format(
+            #         context_lens.shape, context_lens))
+            #     print("effective_query_lens.shape = {} data = {}".format(
+            #         effective_query_lens.shape, effective_query_lens))


Remove cruft or hide behind debug var

mgoin · 2025-01-29T23:27:32Z

vllm/v1/worker/tpu_worker.py

+        # use an empty tensor instead of `None`` to force Dynamo to pass
+        # it by reference, rather by specializing on the value ``None``.


Suggested change

# use an empty tensor instead of `None`` to force Dynamo to pass

# it by reference, rather by specializing on the value ``None``.

# use an empty tensor instead of `None` to force Dynamo to pass

# it by reference, rather by specializing on the value `None`.

liangfu · 2025-01-30T00:35:32Z

vllm/v1/worker/gpu_model_runner.py

@@ -833,14 +629,15 @@ def load_model(self) -> None:
                    self.model_memory_usage / float(2**30))

    @torch.inference_mode()
-    def _dummy_run(
+    def dummy_run(


keep the underscore ?

liangfu · 2025-01-30T00:38:57Z

vllm/v1/worker/tpu_model_runner.py

+    def _prepare_prompt_inputs(
+        self,
+        scheduler_output: "SchedulerOutput",
+    ) -> PromptInputData:


this is V0 style (_prepare_prompt_inputs + _prepare_decode_inputs), can we reuse

def _prepare_inputs(self, scheduler_output: "SchedulerOutput"):

from V1 GPU model runner, as it prepare both prefill and decode inputs ?

alexm-redhat requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96 and comaniac as code owners January 10, 2025 15:59

alexm-redhat self-assigned this Jan 10, 2025

mergify bot added the needs-rebase label Jan 10, 2025

alexm-redhat requested a review from mgoin January 10, 2025 16:18

alexm-redhat requested review from DarkLight1337 and simon-mo as code owners January 10, 2025 16:51

liangfu reviewed Jan 10, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Show resolved Hide resolved

liangfu reviewed Jan 10, 2025

View reviewed changes

mgoin reviewed Jan 13, 2025

View reviewed changes

vllm/platforms/tpu.py Outdated Show resolved Hide resolved

vllm/v1/worker/tpu_worker.py Outdated Show resolved Hide resolved

vanbasten23 reviewed Jan 15, 2025

View reviewed changes

vllm/platforms/tpu.py Outdated Show resolved Hide resolved

vanbasten23 reviewed Jan 15, 2025

View reviewed changes

vllm/platforms/tpu.py Show resolved Hide resolved

vanbasten23 reviewed Jan 15, 2025

View reviewed changes

vllm/platforms/tpu.py Outdated Show resolved Hide resolved

vanbasten23 reviewed Jan 15, 2025

View reviewed changes

vllm/v1/core/scheduler.py Outdated Show resolved Hide resolved

vanbasten23 reviewed Jan 15, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

mgoin reviewed Jan 15, 2025

View reviewed changes

mergify bot added the ci/build label Jan 16, 2025

mgoin reviewed Jan 16, 2025

View reviewed changes

mgoin mentioned this pull request Jan 17, 2025

[WIP] Multimodal model support for V1 TPU #12133

Draft

alexm-redhat force-pushed the tpu_v1 branch from d25ec0e to b65ed98 Compare January 20, 2025 14:13

vanbasten23 reviewed Jan 22, 2025

View reviewed changes

vllm/v1/worker/tpu_worker.py Show resolved Hide resolved

alexm-redhat force-pushed the tpu_v1 branch from b65ed98 to 0adf4a6 Compare January 22, 2025 18:31

alexm-redhat commented Jan 22, 2025

View reviewed changes

alexm-redhat force-pushed the tpu_v1 branch from 4b6599e to c86cd53 Compare January 22, 2025 21:52

mergify bot removed the needs-rebase label Jan 22, 2025

alexm-redhat force-pushed the tpu_v1 branch 2 times, most recently from dea6afd to c6f526c Compare January 22, 2025 22:38

mergify bot added the needs-rebase label Jan 23, 2025

alexm-redhat force-pushed the tpu_v1 branch from 0023b20 to 167c0f2 Compare January 24, 2025 02:24

mergify bot removed the needs-rebase label Jan 24, 2025

robertgshaw2-redhat reviewed Jan 24, 2025

View reviewed changes

alexm-redhat force-pushed the tpu_v1 branch 2 times, most recently from 90ecdbd to eee6378 Compare January 24, 2025 19:44

WoosukKwon requested changes Jan 26, 2025

View reviewed changes

alexm-redhat mentioned this pull request Jan 27, 2025

[RFC]: [V1] TPU support and multiple architecture support #12480

Open

1 task

liangfu reviewed Jan 27, 2025

View reviewed changes

mergify bot added the needs-rebase label Jan 28, 2025

alexm-redhat added 7 commits January 28, 2025 23:08

[V1] TPU support

c715fb1

Signed-off-by: Alexander Matveev <amatveev@redhat.com>

reorder funcs

0bddb6b

Signed-off-by: Alexander Matveev <amatveev@redhat.com>

Chunked prompt works!

61bb55f

Signed-off-by: Alexander Matveev <amatveev@redhat.com>

scheduler is clean

950f349

Signed-off-by: Alexander Matveev <amatveev@redhat.com>

works

248c5b6

Signed-off-by: Alexander Matveev <amatveev@redhat.com>

clean-ups

1ccf100

Signed-off-by: Alexander Matveev <amatveev@redhat.com>

review comments

39c4a4c

Signed-off-by: Alexander Matveev <amatveev@redhat.com>

alexm-redhat force-pushed the tpu_v1 branch from 1392a46 to 39c4a4c Compare January 28, 2025 23:09

mgoin reviewed Jan 29, 2025

View reviewed changes

liangfu reviewed Jan 30, 2025

View reviewed changes

	assert self.device_config.device.type == "cuda"
	assert self.device_config.device.type == "cuda",
	f"Not supported device type: {self.device_config.device}"

		# use an empty tensor instead of `None`` to force Dynamo to pass
		# it by reference, rather by specializing on the value ``None``.

[WIP] [V1] TPU support #11936

Are you sure you want to change the base?

[WIP] [V1] TPU support #11936

Conversation

alexm-redhat commented Jan 10, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 10, 2025

mergify bot commented Jan 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgoin commented Jan 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexm-redhat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanbasten23 commented Jan 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Jan 28, 2025

alexm-redhat commented Jan 28, 2025

alexm-redhat commented Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexm-redhat commented Jan 10, 2025 •

edited by github-actions bot

Loading

alexm-redhat commented Jan 28, 2025 •

edited

Loading