Hows the inference speed and mem usage? #39

lucasjinreal · 2023-03-12T06:39:50Z

Hows the inference speed and mem usage?

factfictional · 2023-03-12T09:32:48Z

Did some testing on my machine (AMD 5700G with 32GB RAM on Arch Linux) and was able to run most of the models. With the 65B model, I would need 40+ GB of ram and using swap to compensate was just too slow.
(Prompt was "They" on seed 1678609319)

Quantized Model	Threads	Memory use	Time per token
llama-7b	4	4.2GB	137.10 ms
llama-7b	6	4.2GB	100.43 ms
llama-7b	8	4.2GB	112.44 ms
llama-7b	10	4.2GB	131.63 ms
llama-7b	12	4.2GB	132.73 ms
llama-13b	4	7.9GB	261.88 ms
llama-13b	6	7.9GB	190.74 ms
llama-13b	8	7.9GB	209.15 ms
llama-13b	10	7.9GB	244.64 ms
llama-13b	12	7.9GB	257.72 ms
llama-30b	4	19GB	645.15 ms
llama-30b	6	19GB	463.04 ms
llama-30b	8	19GB	476.64 ms
llama-30b	10	19GB	583.19 ms
llama-30b	12	19GB	593.75 ms

My PC has 8 cores, so it seems like with whisper.cpp keeping threads at 6/7 gives the best results.

ElRoberto538 · 2023-03-12T09:47:42Z

Will run the same tests on an EPYC 7443P to compare, should be able to run 65B - copying to my SSD now.

lucasjinreal · 2023-03-12T09:48:26Z

@ggerganov It looks like very nice, on CPU but gives a reasonable speed. I can run 13b even on my PC, how do u think the inference speed on a more long prompt tokens?

breakpointninja · 2023-03-12T10:19:37Z

Ryzen 9 5900X on llama-65b eats 40GB ram.

mem per token = 70897348 bytes
load time = 68146.04 ms
sample time =  1002.82 ms
predict time = 478729.38 ms / 936.85 ms per token
total time = 550394.94 ms

ElRoberto538 · 2023-03-12T12:06:22Z

AMD EPYC 7443P 24 core (in VM).
Prompt was "They" on seed 1678609319, as above.

Quantized Model	Threads	Memory use	Time per token
llama-7b	4	4.2	156.95 ms
llama-7b	6	4.2	113.06 ms
llama-7b	8	4.2	93.00 ms
llama-7b	10	4.2	85.18 ms
llama-7b	12	4.2	77.18 ms
llama-7b	21	4.2	76.53 ms
llama-7b	24	4.2	85.37 ms
llama-65b	4	41GB	1408.27 ms
llama-65b	6	41GB	978.18 ms
llama-65b	8	41GB	772.21 ms
llama-65b	10	41GB	654.20 ms
llama-65b	12	41GB	592.60 ms
llama-65b	21	41GB	577.96 ms
llama-65b	24	41GB	596.61 ms
llama-65b	48	41GB	1431.73 ms

Interestingly it doesn't seem to scale well with cores, I guess it likes a few fast cores and high memory bandwidth?

G2G2G2G · 2023-03-12T12:53:23Z

It scales with real cores. Once you get to virtual cores (threads) it starts going badly.
If you have a 8 core 16 thread use 8 cores... or a 24 core 48 thread use 24 cores etc

ElRoberto538 · 2023-03-12T12:55:52Z

I have 24 real cores, but if you look at the numbers above, it seems to hit a wall at around 12 threads, and barely improves when doubling the threads to 21/24

pugzly · 2023-03-12T22:11:44Z

Interesting, with almost the same setup as the top comment (AMD 5700G with 32GB RAM but Linux Mint) I get about 20% slower speed per token.
Maybe prompt length had something to do with it, or my memory dimms are slower, or arch faster, combinations of all those?
Not an issue, just mildly curious.

7b (6 threads):
main: predict time = 35909.73 ms / 120.91 ms per token

13b (6 threads):
main: predict time = 67519.31 ms / 227.34 ms per token

30b (6 threads):
main: predict time = 165125.56 ms / 555.98 ms per token

ElRoberto538 · 2023-03-12T22:35:09Z

Interesting, with almost the same setup as the top comment (AMD 5700G with 32GB RAM but Linux Mint) I get about 20% slower speed per token. Maybe prompt length had something to do with it, or my memory dimms are slower, or arch faster, combinations of all those? Not an issue, just mildly curious.

7b (6 threads): main: predict time = 35909.73 ms / 120.91 ms per token

13b (6 threads): main: predict time = 67519.31 ms / 227.34 ms per token

30b (6 threads): main: predict time = 165125.56 ms / 555.98 ms per token

My assumption is memory bandwidth, my per core speed should be slower than yours according to benchmarks, but when I run with 6 threads I get faster performance. My RAM is slow, but 8 memory channels vs 2 makes up for that I guess.

plhosk · 2023-03-13T00:21:08Z

Speeds on an old 4c/8t intel i7 with above prompt/seed:

7B, n=128
t=4 165 ms/token
t=5 220 ms/token
t=6 188 ms/token
t=7 168 ms/token
t=8 154 ms/token

13B
t=4 314 ms/token
t=5 420 ms/token
t=6 360 ms/token
t=7 314 ms/token
t=8 293 ms/token

Interesting how the fastest runs are t=4 and t=8 with the ones between being slower.

In comparison in I'm getting around 20-25 tokens/s (40-50 ms/token) on a 3060ti with the 7B model in text-generation-webui with the same prompt (although it gets much slower with higher amounts of context). If only GPUs had cheap, expandable VRAM.

pugzly · 2023-03-13T08:59:04Z

Interesting, with almost the same setup as the top comment (AMD 5700G with 32GB RAM but Linux Mint) I get about 20% slower speed per token. Maybe prompt length had something to do with it, or my memory dimms are slower, or arch faster, combinations of all those? Not an issue, just mildly curious.
7b (6 threads): main: predict time = 35909.73 ms / 120.91 ms per token
13b (6 threads): main: predict time = 67519.31 ms / 227.34 ms per token
30b (6 threads): main: predict time = 165125.56 ms / 555.98 ms per token

My assumption is memory bandwidth, my per core speed should be slower than yours according to benchmarks, but when I run with 6 threads I get faster performance. My RAM is slow, but 8 memory channels vs 2 makes up for that I guess.

Ah, yes. I have on my system Crucial 3200 MHz DDR4 (16GB x 2) kit, but all this time I had it running at 2666MHz, for whatever reason. I actually didn't expect memory to be such bottleneck on this workload, I would have blamed CPU exclusively for every millisecond.

Now after changing settings in my BIOS, at 3200 MHz, numbers still not exactly on par, but close enough:

7B:
main: predict time = 31586.56 ms / 106.35 ms per token
13B:
main: predict time = 59035.98 ms / 198.77 ms per token
30B:
main: predict time = 139936.17 ms / 484.21 ms per token

dennislysenko · 2023-03-13T16:58:04Z

This might be a dumb question but is there any way to reduce the memory requirements even if it increases inference time? Or is this a constant based on the model architecture and weights?

plhosk · 2023-03-14T02:00:29Z

This might be a dumb question but is there any way to reduce the memory requirements even if it increases inference time?

Currently no, other than adding a lot of swap space, but even with a fast NVMe drive it will be orders of magnitude slower than running fully in memory.

prusnak · 2023-03-18T21:03:27Z

Memory/disk requirements are being added in #269

As for the inference speed, feel free to discuss here, but I am closing this issue.

* fix missing separator * add cmpnct_unicode to nessisary targets

…gerganov#39

ggerganov added the performance Speed related topics label Mar 12, 2023

bitRAKE mentioned this issue Mar 13, 2023

Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes! #22

Closed

prusnak closed this as not planned Won't fix, can't repro, duplicate, stale Mar 18, 2023

testpppppp mentioned this issue Apr 10, 2023

LLM pigbreeder/CodeMemo#16

Open

Oxi84 mentioned this issue Apr 24, 2023

Llama 7B (4-bit) speed on Intel 12th or 13th generation #1157

Closed

44670 pushed a commit to 44670/llama.cpp that referenced this issue Aug 2, 2023

Fix Makefile (ggerganov#39)

be7e7c3

* fix missing separator * add cmpnct_unicode to nessisary targets

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Add model_alias option to override model_path in completions. Closes g…

a335292

…gerganov#39

Jeximo mentioned this issue Mar 17, 2024

llama.cpp run examples/chat-13B.sh with llama-13B model and hangs on start up long time, any type input is not responsed. #6084

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hows the inference speed and mem usage? #39

Hows the inference speed and mem usage? #39

lucasjinreal commented Mar 12, 2023

factfictional commented Mar 12, 2023 •

edited

Loading

ElRoberto538 commented Mar 12, 2023

lucasjinreal commented Mar 12, 2023 •

edited

Loading

breakpointninja commented Mar 12, 2023

ElRoberto538 commented Mar 12, 2023 •

edited

Loading

G2G2G2G commented Mar 12, 2023

ElRoberto538 commented Mar 12, 2023

pugzly commented Mar 12, 2023

ElRoberto538 commented Mar 12, 2023

plhosk commented Mar 13, 2023 •

edited

Loading

pugzly commented Mar 13, 2023

dennislysenko commented Mar 13, 2023

plhosk commented Mar 14, 2023

prusnak commented Mar 18, 2023 •

edited

Loading

Hows the inference speed and mem usage? #39

Hows the inference speed and mem usage? #39

Comments

lucasjinreal commented Mar 12, 2023

factfictional commented Mar 12, 2023 • edited Loading

ElRoberto538 commented Mar 12, 2023

lucasjinreal commented Mar 12, 2023 • edited Loading

breakpointninja commented Mar 12, 2023

ElRoberto538 commented Mar 12, 2023 • edited Loading

G2G2G2G commented Mar 12, 2023

ElRoberto538 commented Mar 12, 2023

pugzly commented Mar 12, 2023

ElRoberto538 commented Mar 12, 2023

plhosk commented Mar 13, 2023 • edited Loading

pugzly commented Mar 13, 2023

dennislysenko commented Mar 13, 2023

plhosk commented Mar 14, 2023

prusnak commented Mar 18, 2023 • edited Loading

factfictional commented Mar 12, 2023 •

edited

Loading

lucasjinreal commented Mar 12, 2023 •

edited

Loading

ElRoberto538 commented Mar 12, 2023 •

edited

Loading

plhosk commented Mar 13, 2023 •

edited

Loading

prusnak commented Mar 18, 2023 •

edited

Loading