Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the latest GPTQ models with group-size #530

Merged
merged 24 commits into from
Mar 26, 2023
Merged

Conversation

oobabooga
Copy link
Owner

@oobabooga oobabooga commented Mar 24, 2023

Example: https://huggingface.co/ozcur/alpaca-native-4bit

Usage:

python server.py --model alpaca-native-4bit --wbits 4 --groupsize 128

@sgsdxzy
Copy link
Contributor

sgsdxzy commented Mar 24, 2023

I think maybe you can make groupsize a parameter that defaults to 128, not a hard-coded one. That can also support -1 to load old 4bit models.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 24, 2023

I tried loading old 4 bit models with -1 and it would error. Thought everything has to be re-quantized for it to work with the new GPTQ.

@oobabooga
Copy link
Owner Author

oobabooga commented Mar 24, 2023

This need for re-quantization is a killer when combined with the fact that LLaMA is not a publicly available model that we can just re-quantize and upload to Hugging Face freely.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 24, 2023

The alpaca native can just be quantized the old way. I didn't see anyone delete any of the PT files from hugging face, have they been doing that?

Is the new GPTQ any faster?

@RandomInternetPreson
Copy link
Contributor

https://huggingface.co/ozcur/alpaca-native-4bit

This guy has quantized the alpaca-native model from chavinlo.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 24, 2023

I wish they made a 13b.. 7b just runs fast on everything.

@USBhost
Copy link
Contributor

USBhost commented Mar 24, 2023

GPTQ 4bit does not load if it was made with act-order. I am currently testing true-sequential if that's okay.
Edit: true-sequential loads so only act-order is broken.


if not pt_path:
print(f"Could not find {pt_model}, exiting...")
exit()

# qwopqwop200's offload
if shared.args.gptq_pre_layer:
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, shared.args.gptq_pre_layer)
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, 128, shared.args.gptq_pre_layer)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these 128 break act-order? GPTQ states "Currently, groupsize and act-order do not work together and you must choose one of them." So I would imagen using 128 when you are not supposed to will cause issues.

Copy link
Owner Author

@oobabooga oobabooga Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is imported from here: https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/main/llama_inference.py#L26

Maybe it hasn't been updated yet to work with act-order?

Copy link
Contributor

@USBhost USBhost Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed both 128 to -1 and I was able to load act-order. So I guess just make groupsize a parameter as @sgsdxzy said. If act-order becomes the default just have groupsize default to -1 as it already does on GPTQ.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to hear! I will make it a parameter.

@oobabooga
Copy link
Owner Author

I have added the --gptq-group-size parameter with the value set to -1 by default.

Should this be a paramater? If it has to match the group size used during quantization, it would be better to store this number in a config.json that is distributed with the model. But we can leave it like this for now.

@USBhost
Copy link
Contributor

USBhost commented Mar 24, 2023

We could follow the naming of the GPTQ example https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/main/README.md?plain=1#L119 ? Having to use a file just for one parameter seems wasteful.

@oobabooga
Copy link
Owner Author

Sure, I have renamed all parameters and added deprecation warnings to the old parameter names.

@TwitchPid
Copy link

Error with command python server.py --model alpaca-native-4bit --wbits 4 --model_type llama --groupsize 128:

Loading alpaca-native-4bit...
Traceback (most recent call last):
File "/run/media/user/disk/text-generation-webui/server.py", line 234, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/run/media/user/disk/text-generation-webui/modules/models.py", line 101, in load_model
model = load_quantized(model_name)
File "/run/media/user/disk/text-generation-webui/modules/GPTQ_loader.py", line 64, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits)
TypeError: load_quant() missing 1 required positional argument: 'groupsize'

I'm assuming it has to do with --pre_layer

@oobabooga
Copy link
Owner Author

You are right, I messed up. Can you see if it works now?

@TwitchPid
Copy link

The model loads fine, I think GPTQ-for-LLaMA is the problem: https://pastebin.com/irGyfc6L

@TwitchPid
Copy link

Update: works! Used git reset --hard 5cdfad2a15dffebad5d1c24b443bc5dc291d8372 on GPTQ-for-LLaMA (branch that alpaca-native-4bit used)
Screenshot_20230324_184903

@oobabooga
Copy link
Owner Author

@USBhost does that require the (currently) slow Pytorch version of GPTQ-for-LLaMa at inference time?

@USBhost
Copy link
Contributor

USBhost commented Mar 30, 2023

@USBhost does that require the (currently) slow Pytorch version of GPTQ-for-LLaMa at inference time?

Yeah... But if you live in 128+ context it's faster. Like character cards on cai-chat

@USBhost
Copy link
Contributor

USBhost commented Mar 30, 2023

Kind of funny you call it slow, here in 65b lands our delay was 50+ seconds

@borriodelrio
Copy link

Hello, a few questions if someone has the time, and I apologize if this isn't the place for this. I've gone over the new text-generation-webui install documentation. Am I correct in my understanding that this is only for Windows/Linux machines at this time due to the need of Cuda (Nvidia)?

The only model compatible with text-generation-webui on Apple silicon seems to be Chavinlo's 7b Alpaca-Native (too slow for practical use on 16GB RAM).

For Apple Silicon there appear to be two options:

  1. [CPU] Llama.cpp/Alpaca.cpp with GGML
  2. [GPU] LLaMA_MPS (https://github.com/jankais3r/LLaMA_MPS), - this project seems as though it could make use of Apple's Neural Engine (ANE) transformers

I've seen little interest and documentation regarding getting Alpaca or LLAMA Apple Silicon. If anyone has additional information, I'd be interested

@mjlbach
Copy link

mjlbach commented Mar 30, 2023

Has anyone documented the RAM (not VRAM) costs for loading these? I don't see anything in the wiki, but it currently crashes for me when I try to load MetaIX_Alpaca_30B python server.py --wbits 4 --model MetaIX_Alpaca-30B-Int4 on a 3090 system with only 16gb of RAM

@neuhaus
Copy link

neuhaus commented Mar 30, 2023

@mjlbach i suggest you add some swapspace (perhaps 32GB) and then run the command with "/usr/bin/time -v" in front of it to determine the peak memory use.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 30, 2023

@USBhost does that require the (currently) slow Pytorch version of GPTQ-for-LLaMa at inference time?

Yeah... But if you live in 128+ context it's faster. Like character cards on cai-chat

Can I use act order + true sequential in the cuda kernel implementation without group size for both conversion and inference?

I have a lot of models to re-do because I was using GPT-J/Neox and opts. It was doing ok before these new additions and the models weren't exactly "dumb". My card is old so the difference for me will be much larger, probably to the point where chatting becomes a huge frustration.

@USBhost
Copy link
Contributor

USBhost commented Mar 30, 2023

@USBhost does that require the (currently) slow Pytorch version of GPTQ-for-LLaMa at inference time?

Yeah... But if you live in 128+ context it's faster. Like character cards on cai-chat

Can I use act order + true sequential in the cuda kernel implementation without group size for both conversion and inference?

I have a lot of models to re-do because I was using GPT-J/Neox and opts. It was doing ok before these new additions and the models weren't exactly "dumb". My card is old so the difference for me will be much larger, probably to the point where chatting becomes a huge frustration.

Yes the only issue with the cuda kernel you can not use act-order and group size together.

@Highlyhotgames
Copy link

Groupsize 128 version. TLDR this torrent 13b up is slightly better than the baseline one but also uses slightly more VRAM.

magnet:?xt=urn:btih:88f7d9d2460ffcaf78b21e83012de00939eacb65&dn=LLaMA-HF-4bit-128g&tr=http%3a%2f%2fbt2.archive.org%3a6969%2fannounce&tr=http%3a%2f%2fbt1.archive.org%3a6969%2fannounce

Or unzip for the torrent file. LLaMA-HF-4bit-128g.zip

for f in *; do sha256sum $f/*.safetensors; done
d3073ef1a2c0b441f95a5d4f8a5aa3b82884eef45d8997270619cb29bcc994b8  llama-13b-4bit-128g/llama-13b-4bit-128g.safetensors
8b7d75d562938823c4503b956cb4b8af6ac0a5afbce2278566cc787da0f8f682  llama-30b-4bit-128g/llama-30b-4bit-128g.safetensors
f1418091e3307611fb0a213e50a0f52c80841b9c4bcba67abc1f6c64c357c850  llama-65b-4bit-128g/llama-65b-4bit-128g.safetensors
ed8ec9c9f0ebb83210157ad0e3c5148760a4e9fd2acfb02cf00f8f2054d2743b  llama-7b-4bit-128g/llama-7b-4bit-128g.safetensors

Edit: Below is the original post with more information.

If Your on the the main branch to try out this pull do the following:

git revert dcfd866402dfbbc849bd4441fd1de9448de18c75
git pull origin --no-ff pull/530/head

Here's my second torrent using groupsize 128 + true-sequentia. Do to groupsize it will be slightly bigger than normal. However It is better but how much idk... I'm just a noob with a old server lol. Lower is better: 7B is included for completeness. So only use this torrent for 13b up.

**7B**
wikitext2 6.259988784790039
wikitext2 6.237235069274902 (groupsize 128)
ptb-new 10.817036628723145
ptb-new 11.199039459228516 (groupsize 128)
c4-new 7.802077293395996
c4-new 8.000247955322266 (groupsize 128)

**13B**
wikitext2 5.341851711273193
wikitext2 5.242600440979004 (groupsize 128)
ptb-new 9.474738121032715
ptb-new 9.225408554077148 (groupsize 128)
c4-new 7.071592330932617
c4-new 6.912217617034912 (groupsize 128)

**30B**
wikitext2 4.45449686050415
wikitext2 4.230341911315918 (groupsize 128)
ptb-new 8.377615928649902
ptb-new 8.243087768554688 (groupsize 128)
c4-new 6.390762805938721
c4-new 6.231330394744873 (groupsize 128)

**65B**
wikitext2 3.8416879177093506
wikitext2 3.658999443054199 (groupsize 128)
ptb-new 7.881875991821289
ptb-new 7.780252456665039 (groupsize 128)
c4-new 5.998412609100342
c4-new 5.896479606628418 (groupsize 128)

magnet:?xt=urn:btih:88f7d9d2460ffcaf78b21e83012de00939eacb65&dn=LLaMA-HF-4bit-128g&tr=http%3a%2f%2fbt2.archive.org%3a6969%2fannounce&tr=http%3a%2f%2fbt1.archive.org%3a6969%2fannounce

Or unzip for the torrent file. LLaMA-HF-4bit-128g.zip

To run this one you need to use groupsize 128: python server.py --listen --wbits 4 --cai-chat --groupsize 128 --model llama-7b-4bit-128g

After 3 consecutive days trying to find a way of running server.py on WSL I finally managed to get it working, so I'm developing a simple straightforward tutorial to share with the community, yet I want to include the download from this models (which are the ones that work flawlessly on this specific context) but downloading the torrent through the WSL terminal is a pain as idk how to specifically download only the required model instead of downloading the complete torrent. So my question is if this files are available elsewhere (like HF for example) or if someone knows how can I specifically download only one of the folders (models) from the torrent (through WSL terminal) I'll greatly appreciate it because the tutorial is lacking that last step to be completed :) Keep up the amazing work guys

@wywywywy
Copy link
Contributor

downloading the torrent through the WSL terminal is a pain as idk how to specifically download only the required model instead of downloading the complete torrent. So my question is if this files are available elsewhere (like HF for example) or if someone knows how can I specifically download only one of the folders (models) from the torrent (through WSL terminal) I'll greatly appreciate it because the tutorial is lacking that last step to be completed :) Keep up the amazing work guys

In Win11, you can run Linux GUI apps through WSL. Or you can run a torrent client with a web UI inside of WSL. That way it's much easier to manage the torrents and select only the files you need.

@Highlyhotgames
Copy link

downloading the torrent through the WSL terminal is a pain as idk how to specifically download only the required model instead of downloading the complete torrent. So my question is if this files are available elsewhere (like HF for example) or if someone knows how can I specifically download only one of the folders (models) from the torrent (through WSL terminal) I'll greatly appreciate it because the tutorial is lacking that last step to be completed :) Keep up the amazing work guys

In Win11, you can run Linux GUI apps through WSL. Or you can run a torrent client with a web UI inside of WSL. That way it's much easier to manage the torrents and select only the files you need.

Thanks! I'm using Win10 latest version so as it uses WSL2 (just like Win11) it will run Linux GUI apps too. Yet, I want this tutorial to be very straightforward so a simple command line to retrieve only the needed folder from the torrent would be great! I was researching on ways to make this though.. but through command line I can only find info on how to download the torrent on its entirety, so if someone knows how to do this through linux terminal please let me know. Oh, and sorry if this appears to be out of context, but I thought asking here where the torrent link is so it would save time for someone.

@USBhost
Copy link
Contributor

USBhost commented Apr 5, 2023

My last update for visibility: #530 (comment)
I have updated Stock LLaMA to include the latest tokenizer fixes. See my edit for info. Anyways I will be uploading them to HF tomorrow.

Edit: https://huggingface.co/Neko-Institute-of-Science

@Highlyhotgames
Copy link

downloading the torrent through the WSL terminal is a pain as idk how to specifically download only the required model instead of downloading the complete torrent. So my question is if this files are available elsewhere (like HF for example) or if someone knows how can I specifically download only one of the folders (models) from the torrent (through WSL terminal) I'll greatly appreciate it because the tutorial is lacking that last step to be completed :) Keep up the amazing work guys

In Win11, you can run Linux GUI apps through WSL. Or you can run a torrent client with a web UI inside of WSL. That way it's much easier to manage the torrents and select only the files you need.

Thanks! I'm using Win10 latest version so as it uses WSL2 (just like Win11) it will run Linux GUI apps too. Yet, I want this tutorial to be very straightforward so a simple command line to retrieve only the needed folder from the torrent would be great! I was researching on ways to make this though.. but through command line I can only find info on how to download the torrent on its entirety, so if someone knows how to do this through linux terminal please let me know. Oh, and sorry if this appears to be out of context, but I thought asking here where the torrent link is so it would save time for someone.

Guys I finally made it. So now you can use the installation script I created to install this on WSL in a very easy way. The link is:
https://github.com/Highlyhotgames/fast_txtgen_7B

I'm very new to all of this github world, so comments - negative or positive - will be of great help! I want to include a part saying like "Thanks to oobabooga for the text-generation-webui, USBhost for the 4-bit quantized models and qwopqwop200 for the GPTQ-for-Llama" can anyone help me on how I should properly post this thanks message?

And I made a video tutorial too, but I don't even know if it's ok to share it here, I hope so! --> https://youtu.be/RcHIOVtYB7g

@thistleknot
Copy link

Loading alpaca-native-4bit...
Loading model ...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/distvol/text-generation-webui/server.py:308 in │
│ │
│ 305 │ │ i = int(input()) - 1 │
│ 306 │ │ print() │
│ 307 │ shared.model_name = available_models[i] │
│ ❱ 308 shared.model, shared.tokenizer = load_model(shared.model_name) │
│ 309 if shared.args.lora: │
│ 310 │ add_lora_to_model(shared.args.lora) │
│ 311 │
│ │
│ /mnt/distvol/text-generation-webui/modules/models.py:102 in load_model │
│ │
│ 99 │ elif shared.args.wbits > 0: │
│ 100 │ │ from modules.GPTQ_loader import load_quantized │
│ 101 │ │ │
│ ❱ 102 │ │ model = load_quantized(model_name) │
│ 103 │ │
│ 104 │ # llamacpp model │
│ 105 │ elif shared.is_llamacpp: │
│ │
│ /mnt/distvol/text-generation-webui/modules/GPTQ_loader.py:135 in load_quantized │
│ │
│ 132 │ │ model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.a │
│ 133 │ else: │
│ 134 │ │ threshold = False if model_type == 'gptj' else 128 │
│ ❱ 135 │ │ model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.a │
│ 136 │ │ │
│ 137 │ │ # accelerate offload (doesn't work properly) │
│ 138 │ │ if shared.args.gpu_memory: │
│ │
│ /mnt/distvol/text-generation-webui/modules/GPTQ_loader.py:63 in _load_quant │
│ │
│ 60 │ │ from safetensors.torch import load_file as safe_load │
│ 61 │ │ model.load_state_dict(safe_load(checkpoint), strict=False) │
│ 62 │ else: │
│ ❱ 63 │ │ model.load_state_dict(torch.load(checkpoint), strict=False) │
│ 64 │ model.seqlen = 2048 │
│ 65 │ print('Done.') │
│ 66 │
│ │
│ /mnt/distvol/python_user/python_user/gpt/lib/python3.9/site-packages/torch/serialization.py:795 │
│ in load │
│ │
│ 792 │ │ │ │ return _legacy_load(opened_file, map_location, _weights_only_unpickler, │
│ 793 │ │ │ except RuntimeError as e: │
│ 794 │ │ │ │ raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None │
│ ❱ 795 │ │ return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args │
│ 796 │
│ 797 │
│ 798 # Register pickling support for layout instances such as │
│ │
│ /mnt/distvol/python_user/python_user/gpt/lib/python3.9/site-packages/torch/serialization.py:1002 │
│ in _legacy_load │
│ │
│ 999 │ │ │ f"Received object of type "{type(f)}". Please update to Python 3.8.2 or ne │
│ 1000 │ │ │ "functionality.") │
│ 1001 │ │
│ ❱ 1002 │ magic_number = pickle_module.load(f, **pickle_load_args) │
│ 1003 │ if magic_number != MAGIC_NUMBER: │
│ 1004 │ │ raise RuntimeError("Invalid magic number; corrupt file?") │
│ 1005 │ protocol_version = pickle_module.load(f, **pickle_load_args) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnpicklingError: invalid load key, 'v'.

@thistleknot
Copy link

thistleknot commented Apr 16, 2023

fixed (lfs newbie)

	yum install git-lfs
	cd into cloned dir
	git lfs install
	git lfs pull
	git lfs smudge

Ph0rk0z pushed a commit to Ph0rk0z/text-generation-webui-testing that referenced this pull request Apr 17, 2023
@Bortus-AI
Copy link

downloading the torrent through the WSL terminal is a pain as idk how to specifically download only the required model instead of downloading the complete torrent. So my question is if this files are available elsewhere (like HF for example) or if someone knows how can I specifically download only one of the folders (models) from the torrent (through WSL terminal) I'll greatly appreciate it because the tutorial is lacking that last step to be completed :) Keep up the amazing work guys

In Win11, you can run Linux GUI apps through WSL. Or you can run a torrent client with a web UI inside of WSL. That way it's much easier to manage the torrents and select only the files you need.

Thanks! I'm using Win10 latest version so as it uses WSL2 (just like Win11) it will run Linux GUI apps too. Yet, I want this tutorial to be very straightforward so a simple command line to retrieve only the needed folder from the torrent would be great! I was researching on ways to make this though.. but through command line I can only find info on how to download the torrent on its entirety, so if someone knows how to do this through linux terminal please let me know. Oh, and sorry if this appears to be out of context, but I thought asking here where the torrent link is so it would save time for someone.

Guys I finally made it. So now you can use the installation script I created to install this on WSL in a very easy way. The link is: https://github.com/Highlyhotgames/fast_txtgen_7B

I'm very new to all of this github world, so comments - negative or positive - will be of great help! I want to include a part saying like "Thanks to oobabooga for the text-generation-webui, USBhost for the 4-bit quantized models and qwopqwop200 for the GPTQ-for-Llama" can anyone help me on how I should properly post this thanks message?

And I made a video tutorial too, but I don't even know if it's ok to share it here, I hope so! --> https://youtu.be/RcHIOVtYB7g

That worked perfectly thank you!

@gmm005
Copy link

gmm005 commented May 2, 2023

What is meant by these do we need to update these lines of code after downloading . i have listened to your tutorial but could not successfully run it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.