-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for the latest GPTQ models with group-size #530
Conversation
I think maybe you can make groupsize a parameter that defaults to 128, not a hard-coded one. That can also support -1 to load old 4bit models. |
I tried loading old 4 bit models with -1 and it would error. Thought everything has to be re-quantized for it to work with the new GPTQ. |
This need for re-quantization is a killer when combined with the fact that LLaMA is not a publicly available model that we can just re-quantize and upload to Hugging Face freely. |
The alpaca native can just be quantized the old way. I didn't see anyone delete any of the PT files from hugging face, have they been doing that? Is the new GPTQ any faster? |
https://huggingface.co/ozcur/alpaca-native-4bit This guy has quantized the alpaca-native model from chavinlo. |
I wish they made a 13b.. 7b just runs fast on everything. |
GPTQ 4bit does not load if it was made with act-order. I am currently testing true-sequential if that's okay. |
|
||
if not pt_path: | ||
print(f"Could not find {pt_model}, exiting...") | ||
exit() | ||
|
||
# qwopqwop200's offload | ||
if shared.args.gptq_pre_layer: | ||
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, shared.args.gptq_pre_layer) | ||
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, 128, shared.args.gptq_pre_layer) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these 128 break act-order? GPTQ states "Currently, groupsize and act-order do not work together and you must choose one of them." So I would imagen using 128 when you are not supposed to will cause issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is imported from here: https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/main/llama_inference.py#L26
Maybe it hasn't been updated yet to work with act-order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good to hear! I will make it a parameter.
I have added the Should this be a paramater? If it has to match the group size used during quantization, it would be better to store this number in a |
We could follow the naming of the GPTQ example https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/main/README.md?plain=1#L119 ? Having to use a file just for one parameter seems wasteful. |
Sure, I have renamed all parameters and added deprecation warnings to the old parameter names. |
Error with command python server.py --model alpaca-native-4bit --wbits 4 --model_type llama --groupsize 128: Loading alpaca-native-4bit... I'm assuming it has to do with --pre_layer |
You are right, I messed up. Can you see if it works now? |
The model loads fine, I think GPTQ-for-LLaMA is the problem: https://pastebin.com/irGyfc6L |
@USBhost does that require the (currently) slow Pytorch version of GPTQ-for-LLaMa at inference time? |
Yeah... But if you live in 128+ context it's faster. Like character cards on cai-chat |
Kind of funny you call it slow, here in 65b lands our delay was 50+ seconds |
Hello, a few questions if someone has the time, and I apologize if this isn't the place for this. I've gone over the new text-generation-webui install documentation. Am I correct in my understanding that this is only for Windows/Linux machines at this time due to the need of Cuda (Nvidia)? The only model compatible with text-generation-webui on Apple silicon seems to be Chavinlo's 7b Alpaca-Native (too slow for practical use on 16GB RAM). For Apple Silicon there appear to be two options:
I've seen little interest and documentation regarding getting Alpaca or LLAMA Apple Silicon. If anyone has additional information, I'd be interested |
Has anyone documented the RAM (not VRAM) costs for loading these? I don't see anything in the wiki, but it currently crashes for me when I try to load MetaIX_Alpaca_30B |
@mjlbach i suggest you add some swapspace (perhaps 32GB) and then run the command with "/usr/bin/time -v" in front of it to determine the peak memory use. |
Can I use act order + true sequential in the cuda kernel implementation without group size for both conversion and inference? I have a lot of models to re-do because I was using GPT-J/Neox and opts. It was doing ok before these new additions and the models weren't exactly "dumb". My card is old so the difference for me will be much larger, probably to the point where chatting becomes a huge frustration. |
Yes the only issue with the cuda kernel you can not use act-order and group size together. |
After 3 consecutive days trying to find a way of running server.py on WSL I finally managed to get it working, so I'm developing a simple straightforward tutorial to share with the community, yet I want to include the download from this models (which are the ones that work flawlessly on this specific context) but downloading the torrent through the WSL terminal is a pain as idk how to specifically download only the required model instead of downloading the complete torrent. So my question is if this files are available elsewhere (like HF for example) or if someone knows how can I specifically download only one of the folders (models) from the torrent (through WSL terminal) I'll greatly appreciate it because the tutorial is lacking that last step to be completed :) Keep up the amazing work guys |
In Win11, you can run Linux GUI apps through WSL. Or you can run a torrent client with a web UI inside of WSL. That way it's much easier to manage the torrents and select only the files you need. |
Thanks! I'm using Win10 latest version so as it uses WSL2 (just like Win11) it will run Linux GUI apps too. Yet, I want this tutorial to be very straightforward so a simple command line to retrieve only the needed folder from the torrent would be great! I was researching on ways to make this though.. but through command line I can only find info on how to download the torrent on its entirety, so if someone knows how to do this through linux terminal please let me know. Oh, and sorry if this appears to be out of context, but I thought asking here where the torrent link is so it would save time for someone. |
My last update for visibility: #530 (comment) |
Guys I finally made it. So now you can use the installation script I created to install this on WSL in a very easy way. The link is: I'm very new to all of this github world, so comments - negative or positive - will be of great help! I want to include a part saying like "Thanks to oobabooga for the text-generation-webui, USBhost for the 4-bit quantized models and qwopqwop200 for the GPTQ-for-Llama" can anyone help me on how I should properly post this thanks message? And I made a video tutorial too, but I don't even know if it's ok to share it here, I hope so! --> https://youtu.be/RcHIOVtYB7g |
Loading alpaca-native-4bit... |
fixed (lfs newbie)
|
**Warning: old 4-bit weights will not work anymore!** See here how to get up to date weights: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#step-2-get-the-pre-converted-weights
That worked perfectly thank you! |
|
Example: https://huggingface.co/ozcur/alpaca-native-4bit
Usage: