[Bugfix] Fix vllm metrics disappeared when --engine-use-ray is enabled #3938

AllenDou · 2024-04-09T11:59:31Z

Before,
when user run python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --served-model-name modelx --worker-use-ray --engine-use-ray When --engine-use-ray is enabled, the prometheus_client's REGISTRY variable cannot be shared between uvicorn.run and the AsyncLLMEngine process. This will cause vLLM's metrics to not be displayed, as shown below.

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 6290.0
python_gc_objects_collected_total{generation="1"} 8336.0
python_gc_objects_collected_total{generation="2"} 4726.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 826.0
python_gc_collections_total{generation="1"} 75.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.098353664e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.31774976e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71188972784e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.27
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 44.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

After,
this patch will get vllm metrics info through ray.remote/ray.get

python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --served-model-name modelx --worker-use-ray --engine-use-ray

python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --served-model-name modelx --worker-use-ray

python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --served-model-name modelx

all worked.

# curl http://127.0.0.1:8000/metrics
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 7836.0
python_gc_objects_collected_total{generation="1"} 10872.0
python_gc_objects_collected_total{generation="2"} 3383.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 1241.0
python_gc_collections_total{generation="1"} 111.0
python_gc_collections_total{generation="2"} 43.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.932583936e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 8.01712128e+09
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71266241016e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 10.45
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 64.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 65535.0
# HELP vllm:cache_config_info information of cache_config
# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",enable_prefix_caching="False",forced_num_gpu_blocks="None",gpu_memory_utilization="0.9",num_cpu_blocks="7281",num_gpu_blocks="34090",sliding_window="None",swap_space_bytes="4294967296"} 1.0
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
# HELP vllm:num_requests_swapped Number of requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
# HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE vllm:time_per_output_token_seconds histogram
# HELP vllm:e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE vllm:e2e_request_latency_seconds histogram
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge

FIX #xxxx (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

AllenDou · 2024-04-09T12:00:07Z

@grandiose-pizza FYI.

grandiose-pizza · 2024-04-09T21:19:08Z

Hi @AllenDou ,
Thanks for fixing this. Appreciate it.

…en /metrics was requested.

AllenDou · 2024-04-10T03:20:31Z

Refactored again, @simon-mo PTAL.

AllenDou mentioned this pull request Apr 9, 2024

Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs #3070

Merged

AllenDou changed the title ~~bugfix, when --engine-use-ray is enabled, vllm metrics disappeared wh…~~ [Bugfix] vllm metrics disappeared when --engine-use-ray is enabled Apr 9, 2024

bugfix, when --engine-use-ray is enabled, vllm metrics disappeared wh…

7a13e66

…en /metrics was requested.

AllenDou force-pushed the vllm_metrics branch from 2113472 to 7a13e66 Compare April 10, 2024 02:11

AllenDou changed the title ~~[Bugfix] vllm metrics disappeared when --engine-use-ray is enabled~~ [Bugfix] Fix vllm metrics disappeared when --engine-use-ray is enabled Apr 10, 2024

AllenDou closed this Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix vllm metrics disappeared when --engine-use-ray is enabled #3938

[Bugfix] Fix vllm metrics disappeared when --engine-use-ray is enabled #3938

AllenDou commented Apr 9, 2024 •

edited

Loading

AllenDou commented Apr 9, 2024

grandiose-pizza commented Apr 9, 2024

AllenDou commented Apr 10, 2024

[Bugfix] Fix vllm metrics disappeared when --engine-use-ray is enabled #3938

[Bugfix] Fix vllm metrics disappeared when --engine-use-ray is enabled #3938

Conversation

AllenDou commented Apr 9, 2024 • edited Loading

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

AllenDou commented Apr 9, 2024

grandiose-pizza commented Apr 9, 2024

AllenDou commented Apr 10, 2024

AllenDou commented Apr 9, 2024 •

edited

Loading