-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run Llama 3 70b locally combining ram and vram like with other apps? #5965
Comments
So a few things... first off, I was asking about this earlier in the discussions but it takes a little while—a few weeks I guess—for the updates to llama.cpp to trickle into this program. That's because text-generation-webui doesn't use https://github.com/ggerganov/llama.cpp directly, it uses abetlen/llama-cpp-python/, which is, as I understand it, a port of llama.cpp into python. So once llama.cpp updates, then llama-cpp-python has to update, and THEN text-generation-webui has to update its compatibility to use the new version of llama-cpp-python. You can see in the requirements.txt file they just bumped this program to use llama-cpp-python 0.2.64, when the most recent release of llama-cpp-python is 0.2.68. I guess you could edit the requirements.txt of your local install but there's a good chance you'd break something idk. As to your main question I'd recommend this version: https://huggingface.co/bartowski/Meta-Llama-3-70B-Instruct-GGUF It wasn't quantized with the newest version of llama.cpp but still pretty recent. The guy who makes them says he will have an even newer version of llama 3 70b up in today-ish so keep an eye out for that. TheBloke is apparently retired btw. To my knowledge, the only way to properly use both your cpu and gpu together is to use gguf. That will be what you want to do. You've got enough memory to run the 6_K quant without too much trouble I think, that's a pretty good sweet spot imo for reducing memory use without losing accuracy. You will have to splice the two files together for 6_k, but that's pretty easy you can do it with command line, just look it up. To get it to run in text-generation-webui just drop it into your models folder and then load it. It should automatically default to 8k context. The only thing you will have to play with is n-gpu-layers in the model tab. Try like 20 or something to start and keep an eye on resource monitor and the CLI of text-generation-webui. Every layer you add in n-gpu-layers adds to the VRAM usage on your GPU. Just got to find the sweet spot. |
This issue has been closed due to inactivity for 6 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment. |
Sorry I am pretty novice here in LLM space. I have noticed that some users are able to run the llama 3 70b model as a gguf locally with quantization by offloading some of the model to cpu, ram, and vram with other programs with a larger context somehow. I don't really see any information on how to do this for text-generation-webui (which I much prefer)
I have 24gb vram and 64gb ram. Can anyone explain what model from the bloke? to download (if there's a better version, uncensored, etc) Someone referred me to this one which allows for a larger context length for llama 3 https://huggingface.co/models?sort=modified&search=llama+gradient+exl2 and what settings to set or if this is currently possible to do this with text-generation-webui?
I am also unsure if this can be done with exl2 or if I should be using gguf.
Edit: Apparantly flash attention was added today for llama.cpp ggerganov/llama.cpp#5021 for larger contexts over 64k, not sure if this is relevant.
The text was updated successfully, but these errors were encountered: