Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dual GPU performance regression After #4606 #5324

Closed
Ph0rk0z opened this issue Feb 4, 2024 · 6 comments
Closed

Dual GPU performance regression After #4606 #5324

Ph0rk0z opened this issue Feb 4, 2024 · 6 comments

Comments

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 4, 2024

A while ago, on 2x3090 I would get 18.x tokens/s on 70b models. I didn't update for a bit and was dismayed to see performance dip down to 15t/s. I had some HW issues so it took a while to figure out what's going on, but I narrowed it down to a commit between:
7082d24 and f679349

Reading through what happened in that week, the most likely culprits look to be 5bf3953 and dc68f00

The first one I can't check against because it produced errors in multi-gpu which the second commit fixed. I can run versions from before this and my performance is back.

link the pulls: #4606 #4620

Loading a model over 3 GPU, like miqu 5km, the regression is even bigger. From 15.5t/s down to 11 t/s. Memory use is improved though. I had to re-arrange how I split the model.

Some proof:

Pre:

llama_print_timings:        load time =     528.83 ms
llama_print_timings:      sample time =     112.26 ms /   200 runs   (    0.56 ms per token,  1781.55 tokens per second)
llama_print_timings: prompt eval time =     528.67 ms /    22 tokens (   24.03 ms per token,    41.61 tokens per second)
llama_print_timings:        eval time =   10762.82 ms /   199 runs   (   54.08 ms per token,    18.49 tokens per second)
llama_print_timings:       total time =   11874.81 ms
Output generated in 12.77 seconds (15.66 tokens/s, 200 tokens, context 22, seed 1952269572

Post:

llama_print_timings:        load time =     495.04 ms
llama_print_timings:      sample time =     113.32 ms /   200 runs   (    0.57 ms per token,  1764.90 tokens per second)
llama_print_timings: prompt eval time =     494.91 ms /    22 tokens (   22.50 ms per token,    44.45 tokens per second)
llama_print_timings:        eval time =   12894.68 ms /   199 runs   (   64.80 ms per token,    15.43 tokens per second)
llama_print_timings:       total time =   14055.05 ms /   221 tokens
Output generated in 14.63 seconds (13.67 tokens/s, 200 tokens, context 22, seed 1842804206)
@Ph0rk0z Ph0rk0z changed the title Dual GPU performance regression from a while back. Dual GPU performance regression After #4606 Feb 5, 2024
@jukofyork
Copy link
Contributor

Have you tried using the "row" split method instead of the new default "layer" split? I saw around a 1/3 reduction in tokens/s until I turned it back to rows (using 2x A6000 and an nvlink bridge).

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Feb 10, 2024

Yes. This is the fastest it goes now. I believe it's before layer even got implemented. Try the older version and compare speeds.

@ccbadd
Copy link

ccbadd commented Feb 10, 2024

I just booted up my AMD machine, 2x MI100 and 2x W6800, this morning, updated, and did some testing. With llama 2 70b I'm getting 5 t/s with the two W6800 which is half of what I was getting a month ago when I last tested. I get ~7 t/s with the two MI100's but I have to use the row split or its slower than the W6800's. If I try to use row split with the W6800's it crashes as soon as I try to generate anything. We have indeed lost a lot of speed in the last month or so. Funny thing is the W6800s used to be faster than the MI100s.

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Feb 11, 2024

At current git head I got some speed back. Up to ~17.4t/s. Now layer splitting is faster than row. I got a new proc, xeon gold 5120. Prompt processing between the two versions is 299t/s vs ~227t/s with current git.

It must be hitting AMD harder. You can play with # of wraps in #5394 I haven't been able to test with the P40 because my new board fails to load when powering GPUs so I have to power them externally. That can only sustain the 4. Assume P40s still do better with row.

I can freely reproduce the speeds by switching versions. The highest speed I have ever seen is 19t/s 3090s and about 9t/s for the P40. I also noticed from watching load that the GPU usage % is lower.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants