Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add Additional Metrics to vLLM Server #12726

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

sahelib25
Copy link

This PR is an updated and improved version of PR #12627. Please see some discussion there.

Copy link

github-actions bot commented Feb 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added the v1 label Feb 4, 2025
"Histogram of time spent per prefill token request in ms.",
labelnames=labelnames,
buckets=request_latency_buckets)
self.gauge_model_load_time_request = self._gauge_cls(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please adhere to our standards and prometheus best practices for units (https://prometheus.io/docs/practices/naming/)

Always seconds, never milliseconds

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments:

  • I find it a bit odd that load time is sent through prom. this never changes after starting
  • I find that this metric also does not fully capture the full startup time (e.g. compiling cudagraphs)
  • I find it overkill to have multiple load time related metrics

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @robertgshaw2-redhat for the review! This time_per_prefill_token metric is requested in Milliseconds in #5041, hence I referred to vllm:model_forward_time_milliseconds code here:

name="vllm:model_forward_time_milliseconds",

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats fine, but I would like VLLM to adhere to prometheus best practices as much as we can.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @robertgshaw2-redhat , I have removed model_load_time metric.

Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to add the metrics to V1?

seq_group.state.current_step)

# Calculate total tokens in queue
total_tokens_in_queue = 0
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot loop through all requests in waiting. This list is unbounded in size. e.g. for a batch request this can be 10000 items easily (or more likely 100000+)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the necessary changes, could you please take a look?

documentation="Maximum token capacity in tokens.",
labelnames=labelnames,
multiprocess_mode="sum")
self.gauge_total_tokens_in_current_batch_request = self._gauge_cls(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate of: vllm:iteration_tokens_total

total_evicted = sum(seq.metrics.num_evicted_tokens
for seq in seq_group.get_seqs())
else:
# For CPU mode, no token evictions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the comment should say:

else:
    # We do not count token evictions for CPU

@@ -200,7 +209,20 @@ def __init__(self, labelnames: List[str], vllm_config: VllmConfig):
"Histogram of time spent in the model execute function in ms.",
labelnames=labelnames,
buckets=build_1_2_3_5_8_buckets(3000))
# Metadata
self.histogram_time_per_prefill_token_request = self._histogram_cls(
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed instead of vllm:time_to_first_token_seconds?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is needed for disaggregated serving as it provides insight into the duration spent during the prefill stage.

) + seq_group.sampling_params.max_tokens
seq_group.metrics.max_token_capacity = (
max_token_capacity)
max_token_capacity_requests.append(max_token_capacity)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will crash if seq_group.sampling_params is None 😱

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is to be updated shortly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertgshaw2-redhat , I have updated the implementation and renamed the metric max_token_capacity_per_batch. Could you please have a look into it?

@@ -232,6 +254,22 @@ def __init__(self, labelnames: List[str], vllm_config: VllmConfig):
labelnames=labelnames,
buckets=build_1_2_5_buckets(max_model_len),
)
self.gauge_max_token_capacity_request = self._gauge_cls(
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a more descriptive name.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric captures the maximum tokens that can be processed by the model server in total at its maximum batch size (ref). How about max_token_capacity_per_batch?

self._log_histogram(self.metrics.histogram_model_forward_time_request,
stats.model_forward_time_requests)
self._log_histogram(self.metrics.histogram_model_execute_time_request,
stats.model_execute_time_requests)
# Model load time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _get_stats() function in llm_engine should format metrics properly so that this function is as clean as possible. Notice that every other metric is setup such that we can just call _log_xxx in this function.

@@ -120,6 +120,15 @@ def __init__(self, labelnames: List[str], vllm_config: VllmConfig):
name="vllm:tokens_total",
documentation="Number of prefill plus generation tokens processed.",
labelnames=labelnames)
self.counter_requests_with_evicted_tokens = self._counter_cls(
name="vllm:requests_with_evicted_tokens_total",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • duplicate of vllm:num_preemptions_total

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @robertgshaw2-redhat,
Please correct me if I am wrong, but I wanted to clarify:

  • vllm:num_preemptions_total: From my understanding, this tracks the number of preempted requests during the current scheduling iteration.
  • vllm:requests_with_evicted_tokens_total: On the other hand, this metric represents the number of requests that had tokens evicted from the KV cache.

Copy link

@psyhtest psyhtest Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be wrong here, but I suspect that a preempted request will always have it's tokens evicted from the KV cache?

Is it possible to have a request not preempted, but to have its tokens evicted anyway (e.g. due to capacity issues)? If so, the metrics will be different, right?

"Number of requests that had tokens evicted from KV cache",
labelnames=labelnames)
self.counter_total_evicted_tokens = self._counter_cls(
name="vllm:total_evicted_tokens_total",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • we should call this total_preempted_tokens for consistency with vllm:num_preemptions_total

Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

TLDR Comments:

  • Several of the metrics added by this PR are duplicates or very similar to existing metrics we have
  • This only implements metrics on V0

@psyhtest
Copy link

psyhtest commented Feb 4, 2025

Are you going to add the metrics to V1?

Yes, that's the plan. Would you recommend to use this PR, or open a new one?

@robertgshaw2-redhat
Copy link
Collaborator

Are you going to add the metrics to V1?

Yes, that's the plan. Would you recommend to use this PR, or open a new one?

Fine to do in another PR. There is some ongoing work from @markmc on V1 metrics so just make sure to coordinate

* Add New Metrics to VLLM Server(To test)  (#4)

* Add metrics model_load_time and max_token_capacity

* Add time_per_prefill_token

* Add total_tokens_in_current_batch

* Add total_tokens_in_queue (prefill + decode)

* Add request_with_evicted_tokens

* Add total_evicted_tokens and fix for request_with_evicted_tokens.

* Fix max_token_capacity metric

* Fix code to have consistent naming of variables

* Update metrics.py

* Fix model_load_time metric and update scripts.

* Update Scripts.

* Revert changes.

* Fix formatting

* Fix model_loader.py script

* Add tests.

* Fix pre-commit errors.

* Make ruff happy.

* Fix to track evictions in GPU mode.

* Fix to track evictions in GPU mode.

* Fix to track evictions in GPU mode.

* fix merge conflicts.

* fix merge conflicts.

* fix merge conflicts.

* fix merge conflicts.

* Fix formatting

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

* Fixes.

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>

---------

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
@miladm miladm self-requested a review February 7, 2025 19:34
Copy link

mergify bot commented Feb 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sahelib25.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 10, 2025
Update max_token_capacity_per_batch

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
@mergify mergify bot removed the needs-rebase label Feb 10, 2025
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
waiting_queue = scheduler.waiting
for waiting_seq_group in waiting_queue:
# Add prompt tokens
prompt_length = len(waiting_seq_group.prompt_token_ids)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid double counting tokens in the queue that may already exist in the kv cache when prefix caching is enabled?

If not, is there another metric we could consider that would enable us to identify how many duplicate tokens are in the queue that already exist in the kv cache when prefix caching is enabled?

…+CUDA graph capture+engine initialization).

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Copy link

mergify bot commented Feb 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sahelib25.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 14, 2025
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
…etrics

Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants