[Feature][v1]: Add metrics support #10582

rickyyx · 2024-11-22T19:48:58Z

🚀 The feature, motivation and pitch

We should also be feature parity on metrics with most of available stats if possible. On a high level:

[P0] Support system and requests stats logging
[P0] Support metric export to prometheus.
[P1] Support or deprecate all metrics from V0
[P1] Allow users to define their own prometheus client and other arbitrary loggers.
[P2] Make it work with tracing too (there's some request level stats that tracing needs, like queue time, ttft). These request level metric should be possible to be surfaced in v1 too.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

rickyyx · 2024-11-22T19:50:22Z

Opening the issue to track and collab - in case someone else is already looking into this.

rickyyx · 2024-11-26T02:34:51Z

Prototype in #10651

Part of vllm-project#10582 Implement the vllm:num_requests_running and vllm:num_requests_waiting gauges from V0. This is a simple starting point from which to iterate towards parity with V0. There's no need to use prometheus_client's "multi-processing mode" (at least at this stage) because these metrics all exist within the API server process. Note this restores the following metrics - these were lost when we started using multi-processing mode: - python_gc_objects_collected_total - python_gc_objects_uncollectable_total - python_gc_collections_total - python_info - process_virtual_memory_bytes - process_resident_memory_bytes - process_start_time_seconds - process_cpu_seconds_total - process_open_fds - process_max_fds Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2025-02-04T17:42:13Z

I thought it was about time to update on the latest status of this and note some TODOs.

Status

The v1 engine frontend API server now has a Prometheus-compatible `/metrics' endpoint.

The following PRs should merge soon:

Which will mean we support the following metrics:

vllm:num_requests_running (Gauge)
vllm:num_requests_waiting (Gauge)
vllm:gpu_cache_usage_perc (Gauge)
vllm:prompt_tokens_total (Counter)
vllm:generation_tokens_total (Counter)
vllm:request_success_total (Counter)
vllm:request_prompt_tokens (Histogram)
vllm:request_generation_tokens (Histogram)
vllm:time_to_first_token_seconds (Histogram)
vllm:time_per_output_token_seconds (Histogram)
vllm:e2e_request_latency_seconds (Histogram)
vllm:request_queue_time_seconds (Histogram)
vllm:request_inference_time_seconds (Histogram)
vllm:request_prefill_time_seconds (Histogram)
vllm:request_decode_time_seconds (Histogram)

Also, note that - vllm:gpu_prefix_cache_queries and vllm:gpu_prefix_cache_hits (Counters) replaces vllm:gpu_prefix_cache_hit_rate (Gauge).

These are most of the metrics used by the example Grafana dashboard, with the exception of:

vllm:num_requests_swapped (Gauge)
vllm:cpu_cache_usage_perc (Gauge)
vllm:request_max_num_generation_tokens (Histogram)

Additionally, these are other metrics supported by v0, but not yet by v1:

vllm:num_preemptions_total (Counter)
vllm:cache_config_info (Gauge)
vllm:lora_requests_info (Gauge)
vllm:cpu_prefix_cache_hit_rate (Gauge)
vllm:tokens_total (Counter)
vllm:iteration_tokens_total (Histogram)
vllm:time_in_queue_requests (Historgram)
vllm:model_forward_time_milliseconds (Histogram
vllm:model_execute_time_milliseconds (Histogram)
vllm:request_params_n (Histogram)
vllm:request_params_max_tokens (Histogram)
vllm:spec_decode_draft_acceptance_rate (Gauge)
vllm:spec_decode_efficiency (Gauge)
vllm:spec_decode_num_accepted_tokens_total (Counter)
vllm:spec_decode_num_draft_tokens_total (Counter)
vllm:spec_decode_num_emitted_tokens_total (Counter)

Next Steps

Follow on from vllm-project#12579, part of vllm-project#10582. Add the following: - vllm:e2e_request_latency_seconds - vllm:request_queue_time_seconds - vllm:request_inference_time_seconds - vllm:request_prefill_time_seconds - vllm:request_decode_time_seconds e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend. For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core: ``` << queued timestamp >> [ queue interval ] << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to first token time) [ inference interval (relative to scheduled time) << new token timestamp (FINISHED) >> ``` We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core. Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events: - QUEUED: scheduler add_request() - SCHEDULED: scheduler schedule() There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

rickyyx added the feature request label Nov 22, 2024

rickyyx mentioned this issue Dec 5, 2024

[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types #10907

Merged

markmc mentioned this issue Jan 24, 2025

[V1][Metrics] Add initial Prometheus logger #12416

Merged

comaniac mentioned this issue Jan 30, 2025

[V1][Metrics] Add GPU prefix cache hit rate % gauge #12592

Merged

markmc mentioned this issue Feb 1, 2025

[V1][Metrics] Add several request timing histograms #12644

Merged

markmc mentioned this issue Feb 12, 2025

[V1][Metrics] Handle preemptions #13169

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][v1]: Add metrics support #10582

[Feature][v1]: Add metrics support #10582

rickyyx commented Nov 22, 2024 •

edited

Loading

rickyyx commented Nov 22, 2024

rickyyx commented Nov 26, 2024

markmc commented Feb 4, 2025 •

edited

Loading

[Feature][v1]: Add metrics support #10582

[Feature][v1]: Add metrics support #10582

Comments

rickyyx commented Nov 22, 2024 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

rickyyx commented Nov 22, 2024

rickyyx commented Nov 26, 2024

markmc commented Feb 4, 2025 • edited Loading

Status

Next Steps

rickyyx commented Nov 22, 2024 •

edited

Loading

markmc commented Feb 4, 2025 •

edited

Loading