-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][v1]: Add metrics support #10582
Comments
Opening the issue to track and collab - in case someone else is already looking into this. |
Prototype in #10651 |
Part of vllm-project#10582 Implement the vllm:num_requests_running and vllm:num_requests_waiting gauges from V0. This is a simple starting point from which to iterate towards parity with V0. There's no need to use prometheus_client's "multi-processing mode" (at least at this stage) because these metrics all exist within the API server process. Note this restores the following metrics - these were lost when we started using multi-processing mode: - python_gc_objects_collected_total - python_gc_objects_uncollectable_total - python_gc_collections_total - python_info - process_virtual_memory_bytes - process_resident_memory_bytes - process_start_time_seconds - process_cpu_seconds_total - process_open_fds - process_max_fds Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Part of vllm-project#10582 Implement the vllm:num_requests_running and vllm:num_requests_waiting gauges from V0. This is a simple starting point from which to iterate towards parity with V0. There's no need to use prometheus_client's "multi-processing mode" (at least at this stage) because these metrics all exist within the API server process. Note this restores the following metrics - these were lost when we started using multi-processing mode: - python_gc_objects_collected_total - python_gc_objects_uncollectable_total - python_gc_collections_total - python_info - process_virtual_memory_bytes - process_resident_memory_bytes - process_start_time_seconds - process_cpu_seconds_total - process_open_fds - process_max_fds Signed-off-by: Mark McLoughlin <markmc@redhat.com>
I thought it was about time to update on the latest status of this and note some TODOs. StatusThe v1 engine frontend API server now has a Prometheus-compatible `/metrics' endpoint. The following PRs should merge soon:
Which will mean we support the following metrics:
Also, note that - These are most of the metrics used by the example Grafana dashboard, with the exception of:
Additionally, these are other metrics supported by v0, but not yet by v1:
Next Steps
|
Follow on from vllm-project#12579, part of vllm-project#10582. Add the following: - vllm:e2e_request_latency_seconds - vllm:request_queue_time_seconds - vllm:request_inference_time_seconds - vllm:request_prefill_time_seconds - vllm:request_decode_time_seconds e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend. For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core: ``` << queued timestamp >> [ queue interval ] << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to first token time) [ inference interval (relative to scheduled time) << new token timestamp (FINISHED) >> ``` We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core. Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events: - QUEUED: scheduler add_request() - SCHEDULED: scheduler schedule() There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Follow on from vllm-project#12579, part of vllm-project#10582. Add the following: - vllm:e2e_request_latency_seconds - vllm:request_queue_time_seconds - vllm:request_inference_time_seconds - vllm:request_prefill_time_seconds - vllm:request_decode_time_seconds e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend. For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core: ``` << queued timestamp >> [ queue interval ] << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to first token time) [ inference interval (relative to scheduled time) << new token timestamp (FINISHED) >> ``` We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core. Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events: - QUEUED: scheduler add_request() - SCHEDULED: scheduler schedule() There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Follow on from vllm-project#12579, part of vllm-project#10582. Add the following: - vllm:e2e_request_latency_seconds - vllm:request_queue_time_seconds - vllm:request_inference_time_seconds - vllm:request_prefill_time_seconds - vllm:request_decode_time_seconds e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend. For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core: ``` << queued timestamp >> [ queue interval ] << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to first token time) [ inference interval (relative to scheduled time) << new token timestamp (FINISHED) >> ``` We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core. Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events: - QUEUED: scheduler add_request() - SCHEDULED: scheduler schedule() There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
🚀 The feature, motivation and pitch
We should also be feature parity on metrics with most of available stats if possible. On a high level:
Alternatives
No response
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: