Dual GPU performance regression After #4606 #5324

Ph0rk0z · 2024-02-04T18:27:32Z

A while ago, on 2x3090 I would get 18.x tokens/s on 70b models. I didn't update for a bit and was dismayed to see performance dip down to 15t/s. I had some HW issues so it took a while to figure out what's going on, but I narrowed it down to a commit between:
7082d24 and f679349

Reading through what happened in that week, the most likely culprits look to be 5bf3953 and dc68f00

The first one I can't check against because it produced errors in multi-gpu which the second commit fixed. I can run versions from before this and my performance is back.

link the pulls: #4606 #4620

Loading a model over 3 GPU, like miqu 5km, the regression is even bigger. From 15.5t/s down to 11 t/s. Memory use is improved though. I had to re-arrange how I split the model.

Some proof:

Pre:

llama_print_timings:        load time =     528.83 ms
llama_print_timings:      sample time =     112.26 ms /   200 runs   (    0.56 ms per token,  1781.55 tokens per second)
llama_print_timings: prompt eval time =     528.67 ms /    22 tokens (   24.03 ms per token,    41.61 tokens per second)
llama_print_timings:        eval time =   10762.82 ms /   199 runs   (   54.08 ms per token,    18.49 tokens per second)
llama_print_timings:       total time =   11874.81 ms
Output generated in 12.77 seconds (15.66 tokens/s, 200 tokens, context 22, seed 1952269572

Post:

llama_print_timings:        load time =     495.04 ms
llama_print_timings:      sample time =     113.32 ms /   200 runs   (    0.57 ms per token,  1764.90 tokens per second)
llama_print_timings: prompt eval time =     494.91 ms /    22 tokens (   22.50 ms per token,    44.45 tokens per second)
llama_print_timings:        eval time =   12894.68 ms /   199 runs   (   64.80 ms per token,    15.43 tokens per second)
llama_print_timings:       total time =   14055.05 ms /   221 tokens
Output generated in 14.63 seconds (13.67 tokens/s, 200 tokens, context 22, seed 1842804206)

The text was updated successfully, but these errors were encountered:

jukofyork · 2024-02-10T04:00:31Z

Have you tried using the "row" split method instead of the new default "layer" split? I saw around a 1/3 reduction in tokens/s until I turned it back to rows (using 2x A6000 and an nvlink bridge).

Ph0rk0z · 2024-02-10T11:49:56Z

Yes. This is the fastest it goes now. I believe it's before layer even got implemented. Try the older version and compare speeds.

ccbadd · 2024-02-10T22:12:07Z

I just booted up my AMD machine, 2x MI100 and 2x W6800, this morning, updated, and did some testing. With llama 2 70b I'm getting 5 t/s with the two W6800 which is half of what I was getting a month ago when I last tested. I get ~7 t/s with the two MI100's but I have to use the row split or its slower than the W6800's. If I try to use row split with the W6800's it crashes as soon as I try to generate anything. We have indeed lost a lot of speed in the last month or so. Funny thing is the W6800s used to be faster than the MI100s.

Ph0rk0z · 2024-02-11T13:25:44Z

At current git head I got some speed back. Up to ~17.4t/s. Now layer splitting is faster than row. I got a new proc, xeon gold 5120. Prompt processing between the two versions is 299t/s vs ~227t/s with current git.

It must be hitting AMD harder. You can play with # of wraps in #5394 I haven't been able to test with the P40 because my new board fails to load when powering GPUs so I have to power them externally. That can only sustain the 4. Assume P40s still do better with row.

I can freely reproduce the speeds by switching versions. The highest speed I have ever seen is 19t/s 3090s and about 9t/s for the P40. I also noticed from watching load that the GPU usage % is lower.

github-actions · 2024-03-18T01:32:08Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-04-02T01:07:16Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Ph0rk0z added the bug-unconfirmed label Feb 4, 2024

Ph0rk0z changed the title ~~Dual GPU performance regression from a while back.~~ Dual GPU performance regression After #4606 Feb 5, 2024

github-actions bot added the stale label Mar 18, 2024

github-actions bot closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dual GPU performance regression After #4606 #5324

Dual GPU performance regression After #4606 #5324

Ph0rk0z commented Feb 4, 2024 •

edited

Loading

jukofyork commented Feb 10, 2024

Ph0rk0z commented Feb 10, 2024 •

edited

Loading

ccbadd commented Feb 10, 2024 •

edited

Loading

Ph0rk0z commented Feb 11, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

Dual GPU performance regression After #4606 #5324

Dual GPU performance regression After #4606 #5324

Comments

Ph0rk0z commented Feb 4, 2024 • edited Loading

jukofyork commented Feb 10, 2024

Ph0rk0z commented Feb 10, 2024 • edited Loading

ccbadd commented Feb 10, 2024 • edited Loading

Ph0rk0z commented Feb 11, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

Ph0rk0z commented Feb 4, 2024 •

edited

Loading

Ph0rk0z commented Feb 10, 2024 •

edited

Loading

ccbadd commented Feb 10, 2024 •

edited

Loading