-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hows the inference speed and mem usage? #39
Comments
Did some testing on my machine (AMD 5700G with 32GB RAM on Arch Linux) and was able to run most of the models. With the 65B model, I would need 40+ GB of ram and using swap to compensate was just too slow.
|
Will run the same tests on an EPYC 7443P to compare, should be able to run 65B - copying to my SSD now. |
@ggerganov It looks like very nice, on CPU but gives a reasonable speed. I can run 13b even on my PC, how do u think the inference speed on a more long prompt tokens? |
Ryzen 9 5900X on llama-65b eats 40GB ram.
|
AMD EPYC 7443P 24 core (in VM).
Interestingly it doesn't seem to scale well with cores, I guess it likes a few fast cores and high memory bandwidth? |
It scales with real cores. Once you get to virtual cores (threads) it starts going badly. |
I have 24 real cores, but if you look at the numbers above, it seems to hit a wall at around 12 threads, and barely improves when doubling the threads to 21/24 |
Interesting, with almost the same setup as the top comment (AMD 5700G with 32GB RAM but Linux Mint) I get about 20% slower speed per token. 7b (6 threads): 13b (6 threads): 30b (6 threads): |
My assumption is memory bandwidth, my per core speed should be slower than yours according to benchmarks, but when I run with 6 threads I get faster performance. My RAM is slow, but 8 memory channels vs 2 makes up for that I guess. |
Speeds on an old 4c/8t intel i7 with above prompt/seed: 7B, n=128 13B Interesting how the fastest runs are t=4 and t=8 with the ones between being slower. In comparison in I'm getting around 20-25 tokens/s (40-50 ms/token) on a 3060ti with the 7B model in text-generation-webui with the same prompt (although it gets much slower with higher amounts of context). If only GPUs had cheap, expandable VRAM. |
Ah, yes. I have on my system Crucial 3200 MHz DDR4 (16GB x 2) kit, but all this time I had it running at 2666MHz, for whatever reason. I actually didn't expect memory to be such bottleneck on this workload, I would have blamed CPU exclusively for every millisecond. Now after changing settings in my BIOS, at 3200 MHz, numbers still not exactly on par, but close enough: 7B: |
This might be a dumb question but is there any way to reduce the memory requirements even if it increases inference time? Or is this a constant based on the model architecture and weights? |
Currently no, other than adding a lot of swap space, but even with a fast NVMe drive it will be orders of magnitude slower than running fully in memory. |
Memory/disk requirements are being added in #269 As for the inference speed, feel free to discuss here, but I am closing this issue. |
* fix missing separator * add cmpnct_unicode to nessisary targets
Hows the inference speed and mem usage?
The text was updated successfully, but these errors were encountered: