-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Add Additional Metrics to vLLM Server #12726
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
vllm/engine/metrics.py
Outdated
"Histogram of time spent per prefill token request in ms.", | ||
labelnames=labelnames, | ||
buckets=request_latency_buckets) | ||
self.gauge_model_load_time_request = self._gauge_cls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please adhere to our standards and prometheus best practices for units (https://prometheus.io/docs/practices/naming/)
Always seconds, never milliseconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments:
- I find it a bit odd that load time is sent through prom. this never changes after starting
- I find that this metric also does not fully capture the full startup time (e.g. compiling cudagraphs)
- I find it overkill to have multiple load time related metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @robertgshaw2-redhat for the review! This time_per_prefill_token
metric is requested in Milliseconds in #5041, hence I referred to vllm:model_forward_time_milliseconds
code here:
Line 192 in 18a88fc
name="vllm:model_forward_time_milliseconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats fine, but I would like VLLM to adhere to prometheus best practices as much as we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @robertgshaw2-redhat , I have removed model_load_time metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to add the metrics to V1?
vllm/engine/llm_engine.py
Outdated
seq_group.state.current_step) | ||
|
||
# Calculate total tokens in queue | ||
total_tokens_in_queue = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot loop through all requests in waiting. This list is unbounded in size. e.g. for a batch request this can be 10000 items easily (or more likely 100000+)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the necessary changes, could you please take a look?
documentation="Maximum token capacity in tokens.", | ||
labelnames=labelnames, | ||
multiprocess_mode="sum") | ||
self.gauge_total_tokens_in_current_batch_request = self._gauge_cls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate of: vllm:iteration_tokens_total
total_evicted = sum(seq.metrics.num_evicted_tokens | ||
for seq in seq_group.get_seqs()) | ||
else: | ||
# For CPU mode, no token evictions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is true...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the comment should say:
else:
# We do not count token evictions for CPU
@@ -200,7 +209,20 @@ def __init__(self, labelnames: List[str], vllm_config: VllmConfig): | |||
"Histogram of time spent in the model execute function in ms.", | |||
labelnames=labelnames, | |||
buckets=build_1_2_3_5_8_buckets(3000)) | |||
# Metadata | |||
self.histogram_time_per_prefill_token_request = self._histogram_cls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed instead of vllm:time_to_first_token_seconds
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric is needed for disaggregated serving as it provides insight into the duration spent during the prefill stage.
vllm/engine/llm_engine.py
Outdated
) + seq_group.sampling_params.max_tokens | ||
seq_group.metrics.max_token_capacity = ( | ||
max_token_capacity) | ||
max_token_capacity_requests.append(max_token_capacity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will crash if seq_group.sampling_params is None
😱
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation is to be updated shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-redhat , I have updated the implementation and renamed the metric max_token_capacity_per_batch
. Could you please have a look into it?
vllm/engine/metrics.py
Outdated
@@ -232,6 +254,22 @@ def __init__(self, labelnames: List[str], vllm_config: VllmConfig): | |||
labelnames=labelnames, | |||
buckets=build_1_2_5_buckets(max_model_len), | |||
) | |||
self.gauge_max_token_capacity_request = self._gauge_cls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a more descriptive name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric captures the maximum tokens that can be processed by the model server in total at its maximum batch size (ref). How about max_token_capacity_per_batch
?
vllm/engine/metrics.py
Outdated
self._log_histogram(self.metrics.histogram_model_forward_time_request, | ||
stats.model_forward_time_requests) | ||
self._log_histogram(self.metrics.histogram_model_execute_time_request, | ||
stats.model_execute_time_requests) | ||
# Model load time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _get_stats()
function in llm_engine
should format metrics properly so that this function is as clean as possible. Notice that every other metric is setup such that we can just call _log_xxx
in this function.
@@ -120,6 +120,15 @@ def __init__(self, labelnames: List[str], vllm_config: VllmConfig): | |||
name="vllm:tokens_total", | |||
documentation="Number of prefill plus generation tokens processed.", | |||
labelnames=labelnames) | |||
self.counter_requests_with_evicted_tokens = self._counter_cls( | |||
name="vllm:requests_with_evicted_tokens_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- duplicate of
vllm:num_preemptions_total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @robertgshaw2-redhat,
Please correct me if I am wrong, but I wanted to clarify:
vllm:num_preemptions_total
: From my understanding, this tracks the number of preempted requests during the current scheduling iteration.vllm:requests_with_evicted_tokens_total
: On the other hand, this metric represents the number of requests that had tokens evicted from the KV cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be wrong here, but I suspect that a preempted request will always have it's tokens evicted from the KV cache?
Is it possible to have a request not preempted, but to have its tokens evicted anyway (e.g. due to capacity issues)? If so, the metrics will be different, right?
"Number of requests that had tokens evicted from KV cache", | ||
labelnames=labelnames) | ||
self.counter_total_evicted_tokens = self._counter_cls( | ||
name="vllm:total_evicted_tokens_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- we should call this
total_preempted_tokens
for consistency withvllm:num_preemptions_total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
TLDR Comments:
- Several of the metrics added by this PR are duplicates or very similar to existing metrics we have
- This only implements metrics on V0
Yes, that's the plan. Would you recommend to use this PR, or open a new one? |
Fine to do in another PR. There is some ongoing work from @markmc on V1 metrics so just make sure to coordinate |
* Add New Metrics to VLLM Server(To test) (#4) * Add metrics model_load_time and max_token_capacity * Add time_per_prefill_token * Add total_tokens_in_current_batch * Add total_tokens_in_queue (prefill + decode) * Add request_with_evicted_tokens * Add total_evicted_tokens and fix for request_with_evicted_tokens. * Fix max_token_capacity metric * Fix code to have consistent naming of variables * Update metrics.py * Fix model_load_time metric and update scripts. * Update Scripts. * Revert changes. * Fix formatting * Fix model_loader.py script * Add tests. * Fix pre-commit errors. * Make ruff happy. * Fix to track evictions in GPU mode. * Fix to track evictions in GPU mode. * Fix to track evictions in GPU mode. * fix merge conflicts. * fix merge conflicts. * fix merge conflicts. * fix merge conflicts. * Fix formatting Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai> * Fixes. Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai> --------- Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
0529c9d
to
47aee12
Compare
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
This pull request has merge conflicts that must be resolved before it can be |
Update max_token_capacity_per_batch Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
waiting_queue = scheduler.waiting | ||
for waiting_seq_group in waiting_queue: | ||
# Add prompt tokens | ||
prompt_length = len(waiting_seq_group.prompt_token_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to avoid double counting tokens in the queue that may already exist in the kv cache when prefix caching is enabled?
If not, is there another metric we could consider that would enable us to identify how many duplicate tokens are in the queue that already exist in the kv cache when prefix caching is enabled?
…+CUDA graph capture+engine initialization). Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
…etrics Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
bcdc46e
to
775f62a
Compare
This PR is an updated and improved version of PR #12627. Please see some discussion there.