[Core] Add Additional Metrics to vLLM Server #12726

sahelib25 · 2025-02-04T11:17:15Z

This PR is an updated and improved version of PR #12627. Please see some discussion there.

github-actions · 2025-02-04T11:17:26Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

robertgshaw2-redhat · 2025-02-04T11:22:05Z

vllm/engine/metrics.py

+            "Histogram of time spent per prefill token request in ms.",
+            labelnames=labelnames,
+            buckets=request_latency_buckets)
+        self.gauge_model_load_time_request = self._gauge_cls(


please adhere to our standards and prometheus best practices for units (https://prometheus.io/docs/practices/naming/)

Always seconds, never milliseconds

Comments:

I find it a bit odd that load time is sent through prom. this never changes after starting

I find that this metric also does not fully capture the full startup time (e.g. compiling cudagraphs)

I find it overkill to have multiple load time related metrics

Thanks @robertgshaw2-redhat for the review! This time_per_prefill_token metric is requested in Milliseconds in #5041, hence I referred to vllm:model_forward_time_milliseconds code here:

vllm/vllm/engine/metrics.py

Line 192 in 18a88fc

name="vllm:model_forward_time_milliseconds",

Thats fine, but I would like VLLM to adhere to prometheus best practices as much as we can.

Thanks @robertgshaw2-redhat , I have removed model_load_time metric.

robertgshaw2-redhat

Are you going to add the metrics to V1?

robertgshaw2-redhat · 2025-02-04T11:24:16Z

vllm/engine/llm_engine.py

+                        seq_group.state.current_step)
+
+                # Calculate total tokens in queue
+                total_tokens_in_queue = 0


We cannot loop through all requests in waiting. This list is unbounded in size. e.g. for a batch request this can be 10000 items easily (or more likely 100000+)

I have made the necessary changes, could you please take a look?

robertgshaw2-redhat · 2025-02-04T11:28:38Z

vllm/engine/metrics.py

+            documentation="Maximum token capacity in tokens.",
+            labelnames=labelnames,
+            multiprocess_mode="sum")
+        self.gauge_total_tokens_in_current_batch_request = self._gauge_cls(


duplicate of: vllm:iteration_tokens_total

robertgshaw2-redhat · 2025-02-04T11:31:24Z

vllm/engine/llm_engine.py

+                        total_evicted = sum(seq.metrics.num_evicted_tokens
+                                            for seq in seq_group.get_seqs())
+                    else:
+                        # For CPU mode, no token evictions


I don't think this is true...

Maybe the comment should say:

else: # We do not count token evictions for CPU

robertgshaw2-redhat · 2025-02-04T11:36:57Z

vllm/engine/metrics.py

@@ -200,7 +209,20 @@ def __init__(self, labelnames: List[str], vllm_config: VllmConfig):
            "Histogram of time spent in the model execute function in ms.",
            labelnames=labelnames,
            buckets=build_1_2_3_5_8_buckets(3000))
-        #   Metadata
+        self.histogram_time_per_prefill_token_request = self._histogram_cls(


Why is this needed instead of vllm:time_to_first_token_seconds?

This metric is needed for disaggregated serving as it provides insight into the duration spent during the prefill stage.

robertgshaw2-redhat · 2025-02-04T11:38:48Z

vllm/engine/llm_engine.py

+                        ) + seq_group.sampling_params.max_tokens
+                        seq_group.metrics.max_token_capacity = (
+                            max_token_capacity)
+                    max_token_capacity_requests.append(max_token_capacity)


This will crash if seq_group.sampling_params is None 😱

The implementation is to be updated shortly.

@robertgshaw2-redhat , I have updated the implementation and renamed the metric max_token_capacity_per_batch. Could you please have a look into it?

robertgshaw2-redhat · 2025-02-04T11:40:09Z

vllm/engine/metrics.py

@@ -232,6 +254,22 @@ def __init__(self, labelnames: List[str], vllm_config: VllmConfig):
            labelnames=labelnames,
            buckets=build_1_2_5_buckets(max_model_len),
        )
+        self.gauge_max_token_capacity_request = self._gauge_cls(


This needs a more descriptive name.

This metric captures the maximum tokens that can be processed by the model server in total at its maximum batch size (ref). How about max_token_capacity_per_batch?

robertgshaw2-redhat · 2025-02-04T11:43:06Z

vllm/engine/metrics.py

        self._log_histogram(self.metrics.histogram_model_forward_time_request,
                            stats.model_forward_time_requests)
        self._log_histogram(self.metrics.histogram_model_execute_time_request,
                            stats.model_execute_time_requests)
+        # Model load time


The _get_stats() function in llm_engine should format metrics properly so that this function is as clean as possible. Notice that every other metric is setup such that we can just call _log_xxx in this function.

robertgshaw2-redhat · 2025-02-04T11:44:38Z

vllm/engine/metrics.py

@@ -120,6 +120,15 @@ def __init__(self, labelnames: List[str], vllm_config: VllmConfig):
            name="vllm:tokens_total",
            documentation="Number of prefill plus generation tokens processed.",
            labelnames=labelnames)
+        self.counter_requests_with_evicted_tokens = self._counter_cls(
+            name="vllm:requests_with_evicted_tokens_total",


duplicate of vllm:num_preemptions_total

Hi @robertgshaw2-redhat,
Please correct me if I am wrong, but I wanted to clarify:

vllm:num_preemptions_total: From my understanding, this tracks the number of preempted requests during the current scheduling iteration.

vllm:requests_with_evicted_tokens_total: On the other hand, this metric represents the number of requests that had tokens evicted from the KV cache.

I may be wrong here, but I suspect that a preempted request will always have it's tokens evicted from the KV cache?

Is it possible to have a request not preempted, but to have its tokens evicted anyway (e.g. due to capacity issues)? If so, the metrics will be different, right?

robertgshaw2-redhat · 2025-02-04T11:45:05Z

vllm/engine/metrics.py

+            "Number of requests that had tokens evicted from KV cache",
+            labelnames=labelnames)
+        self.counter_total_evicted_tokens = self._counter_cls(
+            name="vllm:total_evicted_tokens_total",


we should call this total_preempted_tokens for consistency with vllm:num_preemptions_total

robertgshaw2-redhat

Thanks for the PR!

TLDR Comments:

Several of the metrics added by this PR are duplicates or very similar to existing metrics we have
This only implements metrics on V0

psyhtest · 2025-02-04T14:17:52Z

Are you going to add the metrics to V1?

Yes, that's the plan. Would you recommend to use this PR, or open a new one?

robertgshaw2-redhat · 2025-02-04T14:59:40Z

Are you going to add the metrics to V1?

Yes, that's the plan. Would you recommend to use this PR, or open a new one?

Fine to do in another PR. There is some ongoing work from @markmc on V1 metrics so just make sure to coordinate

* Add New Metrics to VLLM Server(To test) (#4) * Add metrics model_load_time and max_token_capacity * Add time_per_prefill_token * Add total_tokens_in_current_batch * Add total_tokens_in_queue (prefill + decode) * Add request_with_evicted_tokens * Add total_evicted_tokens and fix for request_with_evicted_tokens. * Fix max_token_capacity metric * Fix code to have consistent naming of variables * Update metrics.py * Fix model_load_time metric and update scripts. * Update Scripts. * Revert changes. * Fix formatting * Fix model_loader.py script * Add tests. * Fix pre-commit errors. * Make ruff happy. * Fix to track evictions in GPU mode. * Fix to track evictions in GPU mode. * Fix to track evictions in GPU mode. * fix merge conflicts. * fix merge conflicts. * fix merge conflicts. * fix merge conflicts. * Fix formatting Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai> * Fixes. Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai> --------- Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

mergify · 2025-02-10T16:25:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sahelib25.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Update max_token_capacity_per_batch Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

annapendleton · 2025-02-13T00:27:28Z

vllm/engine/llm_engine.py

+            waiting_queue = scheduler.waiting
+            for waiting_seq_group in waiting_queue:
+                # Add prompt tokens
+                prompt_length = len(waiting_seq_group.prompt_token_ids)


Is there a way to avoid double counting tokens in the queue that may already exist in the kv cache when prefix caching is enabled?

If not, is there another metric we could consider that would enable us to identify how many duplicate tokens are in the queue that already exist in the kv cache when prefix caching is enabled?

…+CUDA graph capture+engine initialization). Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

mergify · 2025-02-14T12:41:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sahelib25.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

…etrics Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

sahelib25 requested review from DarkLight1337, robertgshaw2-redhat, simon-mo, WoosukKwon, njhill, ywang96, comaniac, alexm-redhat, zhuohan123 and youkaichao as code owners February 4, 2025 11:17

mergify bot added the v1 label Feb 4, 2025

robertgshaw2-redhat reviewed Feb 4, 2025

View reviewed changes

robertgshaw2-redhat requested changes Feb 4, 2025

View reviewed changes

robertgshaw2-redhat reviewed Feb 4, 2025

View reviewed changes

robertgshaw2-redhat requested changes Feb 4, 2025

View reviewed changes

markmc mentioned this pull request Feb 4, 2025

[Feature][v1]: Add metrics support #10582

Open

1 task

sahelib25 mentioned this pull request Feb 5, 2025

[Core] Add Additional Metrics to vLLM Server #12627

Closed

sahelib25 force-pushed the add_metrics branch from 0529c9d to 47aee12 Compare February 6, 2025 11:36

Remove model_load_time as a Prometheus metric and clean up code.

88a0bd2

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

miladm self-requested a review February 7, 2025 19:34

mergify bot added the needs-rebase label Feb 10, 2025

sahelib25 added 2 commits February 10, 2025 18:59

Rename max_token_capacity to max_token_capacity_per_batch

7d3ad96

Update max_token_capacity_per_batch Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

merge main and resolve conflicts.

9c48496

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

mergify bot removed the needs-rebase label Feb 10, 2025

sahelib25 added 2 commits February 11, 2025 15:09

Fix vllm:total_tokens_in_queue metric.

3d0bb8e

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

Fix tests and pre-commit issues.

0e1231b

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

annapendleton reviewed Feb 13, 2025

View reviewed changes

sahelib25 added 2 commits February 13, 2025 17:05

Log total GPU model load time (model weights loading+memory profiling…

256ccb3

…+CUDA graph capture+engine initialization). Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

Fix bug and pre-commit errors.

2cf93f8

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

mergify bot added the needs-rebase label Feb 14, 2025

sahelib25 added 2 commits February 14, 2025 14:29

Fix bug and pre-commit errors.

1500466

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

Merge branch 'add_metrics' of https://github.com/krai/vllm into add_m…

775f62a

…etrics Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

sahelib25 force-pushed the add_metrics branch from bcdc46e to 775f62a Compare February 14, 2025 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Add Additional Metrics to vLLM Server #12726

[Core] Add Additional Metrics to vLLM Server #12726

sahelib25 commented Feb 4, 2025

github-actions bot commented Feb 4, 2025

robertgshaw2-redhat Feb 4, 2025

robertgshaw2-redhat Feb 4, 2025

sahelib25 Feb 4, 2025

robertgshaw2-redhat Feb 4, 2025

sahelib25 Feb 6, 2025

robertgshaw2-redhat left a comment

robertgshaw2-redhat Feb 4, 2025 •

edited

Loading

sahelib25 Feb 11, 2025

robertgshaw2-redhat Feb 4, 2025

robertgshaw2-redhat Feb 4, 2025

psyhtest Feb 10, 2025

robertgshaw2-redhat Feb 4, 2025 •

edited

Loading

sahelib25 Feb 7, 2025

robertgshaw2-redhat Feb 4, 2025

psyhtest Feb 10, 2025

sahelib25 Feb 10, 2025

robertgshaw2-redhat Feb 4, 2025 •

edited

Loading

psyhtest Feb 10, 2025

robertgshaw2-redhat Feb 4, 2025

robertgshaw2-redhat Feb 4, 2025

sahelib25 Feb 5, 2025

psyhtest Feb 10, 2025 •

edited

Loading

robertgshaw2-redhat Feb 4, 2025

robertgshaw2-redhat left a comment

psyhtest commented Feb 4, 2025

robertgshaw2-redhat commented Feb 4, 2025

mergify bot commented Feb 10, 2025

annapendleton Feb 13, 2025

mergify bot commented Feb 14, 2025

[Core] Add Additional Metrics to vLLM Server #12726

Are you sure you want to change the base?

[Core] Add Additional Metrics to vLLM Server #12726

Conversation

sahelib25 commented Feb 4, 2025

github-actions bot commented Feb 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

robertgshaw2-redhat Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-redhat Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-redhat Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

psyhtest Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

psyhtest commented Feb 4, 2025

robertgshaw2-redhat commented Feb 4, 2025

mergify bot commented Feb 10, 2025

Choose a reason for hiding this comment

mergify bot commented Feb 14, 2025

robertgshaw2-redhat Feb 4, 2025 •

edited

Loading

robertgshaw2-redhat Feb 4, 2025 •

edited

Loading

robertgshaw2-redhat Feb 4, 2025 •

edited

Loading

psyhtest Feb 10, 2025 •

edited

Loading