Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][v1]: Add metrics support #10582

Open
1 task done
rickyyx opened this issue Nov 22, 2024 · 3 comments
Open
1 task done

[Feature][v1]: Add metrics support #10582

rickyyx opened this issue Nov 22, 2024 · 3 comments

Comments

@rickyyx
Copy link
Contributor

rickyyx commented Nov 22, 2024

🚀 The feature, motivation and pitch

We should also be feature parity on metrics with most of available stats if possible. On a high level:

  1. [P0] Support system and requests stats logging
  2. [P0] Support metric export to prometheus.
  3. [P1] Support or deprecate all metrics from V0
  4. [P1] Allow users to define their own prometheus client and other arbitrary loggers.
  5. [P2] Make it work with tracing too (there's some request level stats that tracing needs, like queue time, ttft). These request level metric should be possible to be surfaced in v1 too.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@rickyyx
Copy link
Contributor Author

rickyyx commented Nov 22, 2024

Opening the issue to track and collab - in case someone else is already looking into this.

@rickyyx
Copy link
Contributor Author

rickyyx commented Nov 26, 2024

Prototype in #10651

markmc added a commit to markmc/vllm that referenced this issue Jan 24, 2025
Part of vllm-project#10582

Implement the vllm:num_requests_running and vllm:num_requests_waiting
gauges from V0. This is a simple starting point from which to iterate
towards parity with V0.

There's no need to use prometheus_client's "multi-processing mode"
(at least at this stage) because these metrics all exist within the
API server process.

Note this restores the following metrics - these were lost when we
started using multi-processing mode:

- python_gc_objects_collected_total
- python_gc_objects_uncollectable_total
- python_gc_collections_total
- python_info
- process_virtual_memory_bytes
- process_resident_memory_bytes
- process_start_time_seconds
- process_cpu_seconds_total
- process_open_fds
- process_max_fds

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
markmc added a commit to markmc/vllm that referenced this issue Jan 26, 2025
Part of vllm-project#10582

Implement the vllm:num_requests_running and vllm:num_requests_waiting
gauges from V0. This is a simple starting point from which to iterate
towards parity with V0.

There's no need to use prometheus_client's "multi-processing mode"
(at least at this stage) because these metrics all exist within the
API server process.

Note this restores the following metrics - these were lost when we
started using multi-processing mode:

- python_gc_objects_collected_total
- python_gc_objects_uncollectable_total
- python_gc_collections_total
- python_info
- process_virtual_memory_bytes
- process_resident_memory_bytes
- process_start_time_seconds
- process_cpu_seconds_total
- process_open_fds
- process_max_fds

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc
Copy link
Contributor

markmc commented Feb 4, 2025

I thought it was about time to update on the latest status of this and note some TODOs.

Status

The v1 engine frontend API server now has a Prometheus-compatible `/metrics' endpoint.

The following PRs should merge soon:

Which will mean we support the following metrics:

  • vllm:num_requests_running (Gauge)
  • vllm:num_requests_waiting (Gauge)
  • vllm:gpu_cache_usage_perc (Gauge)
  • vllm:prompt_tokens_total (Counter)
  • vllm:generation_tokens_total (Counter)
  • vllm:request_success_total (Counter)
  • vllm:request_prompt_tokens (Histogram)
  • vllm:request_generation_tokens (Histogram)
  • vllm:time_to_first_token_seconds (Histogram)
  • vllm:time_per_output_token_seconds (Histogram)
  • vllm:e2e_request_latency_seconds (Histogram)
  • vllm:request_queue_time_seconds (Histogram)
  • vllm:request_inference_time_seconds (Histogram)
  • vllm:request_prefill_time_seconds (Histogram)
  • vllm:request_decode_time_seconds (Histogram)

Also, note that - vllm:gpu_prefix_cache_queries and vllm:gpu_prefix_cache_hits (Counters) replaces vllm:gpu_prefix_cache_hit_rate (Gauge).

These are most of the metrics used by the example Grafana dashboard, with the exception of:

  • vllm:num_requests_swapped (Gauge)
  • vllm:cpu_cache_usage_perc (Gauge)
  • vllm:request_max_num_generation_tokens (Histogram)

Additionally, these are other metrics supported by v0, but not yet by v1:

  • vllm:num_preemptions_total (Counter)
  • vllm:cache_config_info (Gauge)
  • vllm:lora_requests_info (Gauge)
  • vllm:cpu_prefix_cache_hit_rate (Gauge)
  • vllm:tokens_total (Counter)
  • vllm:iteration_tokens_total (Histogram)
  • vllm:time_in_queue_requests (Historgram)
  • vllm:model_forward_time_milliseconds (Histogram
  • vllm:model_execute_time_milliseconds (Histogram)
  • vllm:request_params_n (Histogram)
  • vllm:request_params_max_tokens (Histogram)
  • vllm:spec_decode_draft_acceptance_rate (Gauge)
  • vllm:spec_decode_efficiency (Gauge)
  • vllm:spec_decode_num_accepted_tokens_total (Counter)
  • vllm:spec_decode_num_draft_tokens_total (Counter)
  • vllm:spec_decode_num_emitted_tokens_total (Counter)

Next Steps

markmc added a commit to markmc/vllm that referenced this issue Feb 7, 2025
Follow on from vllm-project#12579, part of vllm-project#10582.

Add the following:

- vllm:e2e_request_latency_seconds
- vllm:request_queue_time_seconds
- vllm:request_inference_time_seconds
- vllm:request_prefill_time_seconds
- vllm:request_decode_time_seconds

e2e_request_latency is calculated relative to the arrival_time
timestamp recorded by the frontend.

For the rest ... we want to capture (in histograms) precise
pre-request timing intervals between certain events in the engine
core:

```
  << queued timestamp >>
    [ queue interval ]
  << scheduled timestamp >>
    [ prefill interval ]
  << new token timestamp (FIRST) >>
    [ inter-token interval ]
  << new token timestamp >>
    [ decode interval (relative to first token time)
    [ inference interval (relative to scheduled time)
  << new token timestamp (FINISHED) >>
```

We want to collect these metrics in the frontend process, to keep the
engine core freed up as much as possible. We need to calculate these
intervals based on timestamps recorded by the engine core.

Engine core will include these timestamps in EngineCoreOutput (per
request) as a sequence of timestamped events, and the frontend will
calculate intervals and log them. Where we record these timestamped
events:

- QUEUED: scheduler add_request()
- SCHEDULED: scheduler schedule()

There is an implicit NEW_TOKENS timestamp based on an initialization
timestamp recorded on EngineCoreOutputs.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
markmc added a commit to markmc/vllm that referenced this issue Feb 10, 2025
Follow on from vllm-project#12579, part of vllm-project#10582.

Add the following:

- vllm:e2e_request_latency_seconds
- vllm:request_queue_time_seconds
- vllm:request_inference_time_seconds
- vllm:request_prefill_time_seconds
- vllm:request_decode_time_seconds

e2e_request_latency is calculated relative to the arrival_time
timestamp recorded by the frontend.

For the rest ... we want to capture (in histograms) precise
pre-request timing intervals between certain events in the engine
core:

```
  << queued timestamp >>
    [ queue interval ]
  << scheduled timestamp >>
    [ prefill interval ]
  << new token timestamp (FIRST) >>
    [ inter-token interval ]
  << new token timestamp >>
    [ decode interval (relative to first token time)
    [ inference interval (relative to scheduled time)
  << new token timestamp (FINISHED) >>
```

We want to collect these metrics in the frontend process, to keep the
engine core freed up as much as possible. We need to calculate these
intervals based on timestamps recorded by the engine core.

Engine core will include these timestamps in EngineCoreOutput (per
request) as a sequence of timestamped events, and the frontend will
calculate intervals and log them. Where we record these timestamped
events:

- QUEUED: scheduler add_request()
- SCHEDULED: scheduler schedule()

There is an implicit NEW_TOKENS timestamp based on an initialization
timestamp recorded on EngineCoreOutputs.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
markmc added a commit to markmc/vllm that referenced this issue Feb 11, 2025
Follow on from vllm-project#12579, part of vllm-project#10582.

Add the following:

- vllm:e2e_request_latency_seconds
- vllm:request_queue_time_seconds
- vllm:request_inference_time_seconds
- vllm:request_prefill_time_seconds
- vllm:request_decode_time_seconds

e2e_request_latency is calculated relative to the arrival_time
timestamp recorded by the frontend.

For the rest ... we want to capture (in histograms) precise
pre-request timing intervals between certain events in the engine
core:

```
  << queued timestamp >>
    [ queue interval ]
  << scheduled timestamp >>
    [ prefill interval ]
  << new token timestamp (FIRST) >>
    [ inter-token interval ]
  << new token timestamp >>
    [ decode interval (relative to first token time)
    [ inference interval (relative to scheduled time)
  << new token timestamp (FINISHED) >>
```

We want to collect these metrics in the frontend process, to keep the
engine core freed up as much as possible. We need to calculate these
intervals based on timestamps recorded by the engine core.

Engine core will include these timestamps in EngineCoreOutput (per
request) as a sequence of timestamped events, and the frontend will
calculate intervals and log them. Where we record these timestamped
events:

- QUEUED: scheduler add_request()
- SCHEDULED: scheduler schedule()

There is an implicit NEW_TOKENS timestamp based on an initialization
timestamp recorded on EngineCoreOutputs.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@markmc @rickyyx and others