Measure model memory usage #3120

mgoin · 2024-02-29T19:20:25Z

There is already a measure for kv cache blocks memory usage, indirectly through how many blocks were allocated, but no direct measure of how much memory the model weights are using. This PR tries to add that by wrapping the model loading with torch.cuda.max_memory_allocated() calls. I'm not sure how this will work with non-Nvidia devices, so happy to disable this in that case

Exposes a new ModelRunner.model_memory_usage member variable

Example code:

from vllm import LLM
LLM("facebook/opt-125m")

Output:

INFO 02-29 19:21:40 llm_engine.py:79] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, sparsity=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 02-29 19:21:44 weight_utils.py:176] Using model weights format ['*.bin']
>> INFO 02-29 19:21:45 model_runner.py:90] Loading model weights took 244.61 MB
INFO 02-29 19:21:45 llm_engine.py:338] # GPU blocks: 76240, # CPU blocks: 7281

mgoin · 2024-03-04T01:19:24Z

Hey @simon-mo , what do you think about this?

mgoin · 2024-03-06T18:13:53Z

@WoosukKwon @zhuohan123 what do you think?

zhuohan123

LGTM! Left some small comments

vllm/utils.py

vllm/worker/model_runner.py

esmeetu · 2024-03-07T11:11:33Z

Is that right with tensor_parallel_size > 1?

mgoin · 2024-03-07T15:20:19Z

Thanks for the reviews @zhuohan123 and @esmeetu. For TP>1, my assumption is it still makes sense to report the per-worker model memory usage rather than trying to figure out and pipe the whole model memory usage to all workers.

To be explicit, here is the output of running with TP=1 and TP=2

TP=1

> CUDA_VISIBLE_DEVICES=7 python -c 'from vllm import LLM;LLM("facebook/opt-125m", tensor_parallel_size=1)'
INFO 03-07 15:15:10 llm_engine.py:88] Initializing an LLM engine (v0.3.3) with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-07 15:15:14 model_runner.py:96] Loading model weights took 0.2389 GB

TP=2

> CUDA_VISIBLE_DEVICES=6,7 python -c 'from vllm import LLM;LLM("facebook/opt-125m", tensor_parallel_size=2)'
2024-03-07 15:15:34,427 INFO worker.py:1724 -- Started a local Ray instance.
INFO 03-07 15:15:36 llm_engine.py:88] Initializing an LLM engine (v0.3.3) with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-07 15:15:47 model_runner.py:96] Loading model weights took 0.1189 GB
(RayWorkerVllm pid=798461) INFO 03-07 15:15:48 model_runner.py:96] Loading model weights took 0.1189 GB

Here you see that the model weights look evenly split between runners

mgoin added 2 commits February 29, 2024 19:19

Measure model memory usage

fe016b7

Merge branch 'main' into measure-model-memory-up

1162044

mgoin marked this pull request as ready for review February 29, 2024 19:25

mgoin added 3 commits February 29, 2024 19:26

Format

6032c9c

Update utils.py

55fee87

Merge branch 'vllm-project:main' into measure-model-memory-up

331d85d

zhuohan123 approved these changes Mar 7, 2024

View reviewed changes

vllm/utils.py Outdated Show resolved Hide resolved

vllm/worker/model_runner.py Outdated Show resolved Hide resolved

Review comments to show GB

f7ecd9c

Format

c830b6a

zhuohan123 merged commit 385da2d into vllm-project:main Mar 7, 2024
22 checks passed

AdrianAbeyta pushed a commit to AdrianAbeyta/vllm that referenced this pull request Mar 8, 2024

Measure model memory usage (vllm-project#3120)

ca1b39c

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

Measure model memory usage (vllm-project#3120)

fb3c092

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure model memory usage #3120

Measure model memory usage #3120

mgoin commented Feb 29, 2024 •

edited

Loading

mgoin commented Mar 4, 2024

mgoin commented Mar 6, 2024

zhuohan123 left a comment

esmeetu commented Mar 7, 2024

mgoin commented Mar 7, 2024 •

edited

Loading

Measure model memory usage #3120

Measure model memory usage #3120

Conversation

mgoin commented Feb 29, 2024 • edited Loading

mgoin commented Mar 4, 2024

mgoin commented Mar 6, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

esmeetu commented Mar 7, 2024

mgoin commented Mar 7, 2024 • edited Loading

mgoin commented Feb 29, 2024 •

edited

Loading

mgoin commented Mar 7, 2024 •

edited

Loading