This repo contains code to run vllm inference on both the gh200 and h100 with the Llama 3.1 70B model.
Install pytorch and vllm as normal
- GH200:
python bench.py --cpu-offload-gb 60 --max-model-len 4096 --num-tokens 1024
- H100:
python bench.py --cpu-offload-gb 75 --max-model-len 4096 --num-tokens 1024
Results:
GPU | Tokens/s |
---|---|
H100 | 0.57 |
GH200 | 4.33 |
The reason for this is because neither GPU can entirely fit the 70B model in memory, we need to utilize cpu offloading (via --cpu-offload-gb
). Not only does the GH200 have a little bit more GPU memory, so we can offload less, but the GH200 also has a much faster CPU<>GPU transfer bandwidth, meaning it is just faster overall.