-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dual GPU performance regression After #4606 #5324
Comments
Have you tried using the "row" split method instead of the new default "layer" split? I saw around a 1/3 reduction in tokens/s until I turned it back to rows (using 2x A6000 and an nvlink bridge). |
Yes. This is the fastest it goes now. I believe it's before layer even got implemented. Try the older version and compare speeds. |
I just booted up my AMD machine, 2x MI100 and 2x W6800, this morning, updated, and did some testing. With llama 2 70b I'm getting 5 t/s with the two W6800 which is half of what I was getting a month ago when I last tested. I get ~7 t/s with the two MI100's but I have to use the row split or its slower than the W6800's. If I try to use row split with the W6800's it crashes as soon as I try to generate anything. We have indeed lost a lot of speed in the last month or so. Funny thing is the W6800s used to be faster than the MI100s. |
At current git head I got some speed back. Up to ~17.4t/s. Now layer splitting is faster than row. I got a new proc, xeon gold 5120. Prompt processing between the two versions is 299t/s vs ~227t/s with current git. It must be hitting AMD harder. You can play with # of wraps in #5394 I haven't been able to test with the P40 because my new board fails to load when powering GPUs so I have to power them externally. That can only sustain the 4. Assume P40s still do better with row. I can freely reproduce the speeds by switching versions. The highest speed I have ever seen is 19t/s 3090s and about 9t/s for the P40. I also noticed from watching load that the GPU usage % is lower. |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
A while ago, on 2x3090 I would get 18.x tokens/s on 70b models. I didn't update for a bit and was dismayed to see performance dip down to 15t/s. I had some HW issues so it took a while to figure out what's going on, but I narrowed it down to a commit between:
7082d24 and f679349
Reading through what happened in that week, the most likely culprits look to be 5bf3953 and dc68f00
The first one I can't check against because it produced errors in multi-gpu which the second commit fixed. I can run versions from before this and my performance is back.
link the pulls: #4606 #4620
Loading a model over 3 GPU, like miqu 5km, the regression is even bigger. From 15.5t/s down to 11 t/s. Memory use is improved though. I had to re-arrange how I split the model.
Some proof:
Pre:
Post:
The text was updated successfully, but these errors were encountered: