-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure model memory usage #3120
Measure model memory usage #3120
Conversation
Hey @simon-mo , what do you think about this? |
@WoosukKwon @zhuohan123 what do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left some small comments
Is that right with |
Thanks for the reviews @zhuohan123 and @esmeetu. For TP>1, my assumption is it still makes sense to report the per-worker model memory usage rather than trying to figure out and pipe the whole model memory usage to all workers. To be explicit, here is the output of running with TP=1 and TP=2 TP=1
TP=2
Here you see that the model weights look evenly split between runners |
There is already a measure for kv cache blocks memory usage, indirectly through how many blocks were allocated, but no direct measure of how much memory the model weights are using. This PR tries to add that by wrapping the model loading with
torch.cuda.max_memory_allocated()
calls. I'm not sure how this will work with non-Nvidia devices, so happy to disable this in that caseExposes a new
ModelRunner.model_memory_usage
member variableExample code:
Output: