-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New conversion script #545
Conversation
Looks fantastic! 🎉 Agree that the conversion scripts should be merged as one. Have you checked that the sha256 checksums match for files produced with the old and new scripts? So that no bits or bytes are accidentally dropped roadside on the way. Minor comment, I think naming it like |
Could you maybe add safetensors support? People are starting to distribute GPTQ weights in that format instead, since it doesn't allow arbitrary code execution. Usually it's just a matter of using Edit: Looks like if you use |
I converted with the script a 7B 3.77GB 4bit gptq (no grops) model. The converted file however is 5.39GB. Is this expected? It's also very slow compared to the RTN q4 model because it swaps on the disk now due to its size. |
You should verify if your script works with the new techniques proposed by @qwopqwop200 I think it's not the case as someone reported an error here : #442 (comment) |
@luxtiasco It's not finished yet, once qwopqwop200 will be able to make "act-order" and "groupsize 128" work together, we'll get a really great quantization 😄👍 |
https://www.reddit.com/r/LocalLLaMA/comments/1248183/i_am_currently_quantizing_llama65b_30b_and_13b/
|
qwopqwop200/GPTQ-for-LLaMa@4e141a8 The madman did it! Now it's possible to get both groupsize and act-order ! |
🦙 ! I'm fully OK with this change - cannot comment on the Python code as I don't have experience |
Looks very promising! Single file models would be nice. The main thing I want is for the tensors to be mmap()'able. In order for that to happen, multi-dimensional tensors need to be laid out in the file in such a way that they don't need to be reshaped in order to be loaded. The memory layout on disk, should be the same as what ggml wants in memory at runtime. The format should also observe ideally a 32-byte alignment. Does this change do that? If not, could it? |
Unfortunately, the conversion script seems to break when applied to models generated using the latest version of qwopqwop200/GPTQ-for-LLaMa@4c15f16. In this case, the OutputLoaded 'transformers' model split into 1 parts. Writing vocab... [1/291] Writing tensor tok_embeddings.weight, size 32001 x 4096... [2/291] Writing tensor norm.weight, size 4096... [3/291] Writing tensor output.weight, size 32001 x 4096... Traceback (most recent call last): File "/home/pierre/Development/llama/llama.cpp/convert.py", line 673, in main() File "/home/pierre/Development/llama/llama.cpp/convert.py", line 671, in main OutputFile.write_all(outfile, params, model, vocab) File "/home/pierre/Development/llama/llama.cpp/convert.py", line 579, in write_all for i, ((name, lazy_tensor), ndarray) in enumerate(zip(model.items(), ndarrays)): File "/home/pierre/Development/llama/llama.cpp/convert.py", line 508, in bounded_parallel_map result = futures.pop(0).result() File "/nix/store/iw1vmh509hcbby8dbpsaanbri4zsq7dj-python3-3.10.10/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/nix/store/iw1vmh509hcbby8dbpsaanbri4zsq7dj-python3-3.10.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/nix/store/iw1vmh509hcbby8dbpsaanbri4zsq7dj-python3-3.10.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/pierre/Development/llama/llama.cpp/convert.py", line 577, in ndarrays = bounded_parallel_map(lambda lazy_tensor: lazy_tensor.load().ggml_ndarray(), model.values(), File "/home/pierre/Development/llama/llama.cpp/convert.py", line 357, in load tensor = lazy_tensor.load() File "/home/pierre/Development/llama/llama.cpp/convert.py", line 399, in load return QuantizedTensor(model, namebase) File "/home/pierre/Development/llama/llama.cpp/convert.py", line 187, in __init__ scales = load_unquantized(model[f"{namebase}.scales"], np.float32) File "/home/pierre/Development/llama/llama.cpp/convert.py", line 178, in load_unquantized tensor = lazy_tensor.load() File "/home/pierre/Development/llama/llama.cpp/convert.py", line 473, in load return UnquantizedTensor(storage.load(storage_offset, elm_count).reshape(size)) ValueError: cannot reshape array of size 1 into shape (1,4096) |
The memory layout matches, but there is currently no alignment. I was thinking of adding that, but it will require a format change, whereas this PR happens to be compatible with the existing format (since there was already an option to adjust the per-file split), so I decided to leave it out of this one. Regarding other feedback, I’ll take a look soon. I’m also thinking about adding support for reading files that are already in GGML format so that they can be upgraded without needing the original. This is despite the fact that I think it’s probably advisable to make the loader backwards-compatible moving forward rather than requiring upgrades. Even with a change to add mmap support, there should be a fallback path that supports existing non-aligned files. But if you want to actually benefit from mmap, you’ll need alignment and thus a format upgrade. |
Would it be possible to ensure that the tensor data is aligned by padding the tensor names with zeros? That should allow us to do it without changing the file format. |
I thought of that, but it seemed like an ugly hack for not much benefit. It’s not hard to change the C++ side; it just seemed convenient to make it a separate change to avoid merge conflicts and such. (Edit: Not that I have a particularly strong objection to doing it that way; it just isn’t what I’d choose.) |
Something that may also help (as suggested by xloem in discord) would be making sure that the tensors in the model file are in the same order as they are accessed during inference. This should especially help in systems without enough memory to keep the entire model in memory. I think it is already very close to being that way, but may be worth double checking. |
@slaren There's a discord server for the project or about llama in general? |
@Belluxx Kind of yes. @slaren and I have been collaborating on Redbean's Discord server, which has an #AI channel. There's no official chatroom for the llama.cpp project yet, however you're all welcome to join us on the Redbean Discord until that happens! https://discord.gg/AqSvHf4u |
Does the new conversion script works better with generic pytorch models? (Such as https://huggingface.co/THUDM/chatglm-6b) |
I can confirm it doesn't work with the new implementations of the GPTQ quantization.
I got this error:
|
358bb6c
to
80ae52a
Compare
Current status: Working, except for the latest GPTQ-for-LLaMa format that includes `g_idx`. This turns out to require changes to GGML, so for now it only works if you use the `--outtype` option to dequantize it back to f16 (which is pointless except for debugging). I also included some cleanup for the C++ code. This script is meant to replace all the existing conversion scripts (including the ones that convert from older GGML formats), while also adding support for some new formats. Specifically, I've tested with: - [x] `LLaMA` (original) - [x] `llama-65b-4bit` - [x] `alpaca-native` - [x] `alpaca-native-4bit` - [x] LLaMA converted to 'transformers' format using `convert_llama_weights_to_hf.py` - [x] `alpaca-native` quantized with `--true-sequential --act-order --groupsize 128` (dequantized only) - [x] same as above plus `--save_safetensors` - [x] GPT4All - [x] stock unversioned ggml - [x] ggmh There's enough overlap in the logic needed to handle these different cases that it seemed best to move to a single script. I haven't tried this with Alpaca-LoRA because I don't know where to find it. Useful features: - Uses multiple threads for a speedup in some cases (though the Python GIL limits the gain, and sometimes it's disk-bound anyway). - Combines split models into a single file (both the intra-tensor split of the original and the inter-tensor split of 'transformers' format files). Single files are more convenient to work with and more friendly to future changes to use memory mapping on the C++ side. To accomplish this without increasing memory requirements, it has some custom loading code which avoids loading whole input files into memory at once. - Because of the custom loading code, it no longer depends in PyTorch, which might make installing dependencies slightly easier or faster... although it still depends on NumPy and sentencepiece, so I don't know if there's any meaningful difference. In any case, I also added a requirements.txt file to lock the dependency versions in case of any future breaking changes. - Type annotations checked with mypy. - Some attempts to be extra user-friendly: - The script tries to be forgiving with arguments, e.g. you can specify either the model file itself or the directory containing it. - The script doesn't depend on config.json / params.json, just in case the user downloaded files individually and doesn't have those handy. But you still need tokenizer.model and, for Alpaca, added_tokens.json. - The script tries to give a helpful error message if added_tokens.json is missing.
Current status: Working, except for the latest GPTQ-for-LLaMa format that includes `g_idx`. This turns out to require changes to GGML, so for now it only works if you use the `--outtype` option to dequantize it back to f16 (which is pointless except for debugging). I also included some cleanup for the C++ code. This script is meant to replace all the existing conversion scripts (including the ones that convert from older GGML formats), while also adding support for some new formats. Specifically, I've tested with: - [x] `LLaMA` (original) - [x] `llama-65b-4bit` - [x] `alpaca-native` - [x] `alpaca-native-4bit` - [x] LLaMA converted to 'transformers' format using `convert_llama_weights_to_hf.py` - [x] `alpaca-native` quantized with `--true-sequential --act-order --groupsize 128` (dequantized only) - [x] same as above plus `--save_safetensors` - [x] GPT4All - [x] stock unversioned ggml - [x] ggmh There's enough overlap in the logic needed to handle these different cases that it seemed best to move to a single script. I haven't tried this with Alpaca-LoRA because I don't know where to find it. Useful features: - Uses multiple threads for a speedup in some cases (though the Python GIL limits the gain, and sometimes it's disk-bound anyway). - Combines split models into a single file (both the intra-tensor split of the original and the inter-tensor split of 'transformers' format files). Single files are more convenient to work with and more friendly to future changes to use memory mapping on the C++ side. To accomplish this without increasing memory requirements, it has some custom loading code which avoids loading whole input files into memory at once. - Because of the custom loading code, it no longer depends in PyTorch, which might make installing dependencies slightly easier or faster... although it still depends on NumPy and sentencepiece, so I don't know if there's any meaningful difference. In any case, I also added a requirements.txt file to lock the dependency versions in case of any future breaking changes. - Type annotations checked with mypy. - Some attempts to be extra user-friendly: - The script tries to be forgiving with arguments, e.g. you can specify either the model file itself or the directory containing it. - The script doesn't depend on config.json / params.json, just in case the user downloaded files individually and doesn't have those handy. But you still need tokenizer.model and, for Alpaca, added_tokens.json. - The script tries to give a helpful error message if added_tokens.json is missing.
Updates:
This weekend hopefully I'll get to fixing compatibility with the latest GPTQ. |
Looks like GitHub doesn't give me a merge button even with the approval and checks passing (not sure why), but feel free to merge, @ggerganov. Thanks! |
Thank you for the hard work and for another very well done contribution! |
Since |
after #545 we do not need torch, tqdm and requests in the dependencies
after #545 we do not need torch, tqdm and requests in the dependencies
after #545 we do not need torch, tqdm and requests in the dependencies
How can convert.py be used to migrate old ggml model to the new ggml model? Attempting to do so blindly results in this error:
|
You can download the tokenizer.model it's missing from HF, eg at this link: https://huggingface.co/TheBloke/alpaca-lora-65B-HF/resolve/main/tokenizer.model PS. If you want a newer 65B Alpaca Lora model, using newer and better 4bit quantisation techniques, try the q4_0, q4_2 or q4_3 models from my repo here: https://huggingface.co/TheBloke/alpaca-lora-65B-GGML . q4_2 seems to be the quantisation format that people regard as best at the moment. |
Thank you for this! I still cannot get the conversion done due to a different error, but I downloaded your model and it is now working much better. Can you tell me what is the difference from the later q4_3 model? That one is larger. |
which version of GPTQ-for-LLaMa can get no g_idx model. |
@big-thousand I believe you are in the wrong repository. |
Would like to check if there is now support for converting GPTQ 4-bit quantized models to GGML |
@Interpause you will have better quality without the GPTQ in-between. |
I think they're asking because llama.cpp convert.py can convert old GPTQ models to GGML, but only if they don't have the new GPTQ However @big-thousand and @Interpause I do not recommend you do this. I tested using convert.py to convert GPTQ -> GGML and the perplexity (model accuracy) was very poor. Much worse than using llama.cpp's own quantize feature. I think this is partly because you have to use an old version of GPTQ to do the conversion. I suggest you make new GGMLs using llama.cpp That's what I do now for all my model releases on HF. I do float16 -> GPTQ, and separately I do float16 -> GGML |
Just would like to ask, current GGML 4 bit does some form of error correction right? The main rationale behind wanting to use GPTQ is to mitigate increase in perplexity. Is GGML's 4 bit quantization already on par or superior to GPTQ? |
I am actually testing that right this second. I wrote a perplexity calc for GPTQ that runs 100% the same algorithm as the Here are some early results from my testing (which I will publish properly soon): Llama 7B:
So you see that for 4bit, GPTQ is slightly better. Best result is 6.0422 or 6.0653. Although this requires But llama.cpp also offers 5bit, and this out-performs GPTQ 4bit. And now that llama.cpp has CUDA GPU acceleration, it may be it can compete on performance as well. So it will be up to the user to decide what is best for them and their use case. I will publish more results, and benchmarks, soon. |
@TheBloke - When doing the comparisons, don't forget to include the file sizes. These are important |
Yeah fair enough. I've edited that in. When I publish the full results I'll include a table and spreadsheet with all the details. |
this should be its own release on pypi |
Current status: Working, except for the latest GPTQ-for-LLaMa format that includes
g_idx
. This turns out to require changes to GGML, so for now it only works if you use the--outtype
option to dequantize it back to f16 (which is pointless except for debugging).I also included some cleanup for the C++ code.
This script is meant to replace all the existing conversion scripts (including the ones that convert from older GGML formats), while also adding support for some new formats. Specifically, I've tested with:
LLaMA
(original)llama-65b-4bit
alpaca-native
alpaca-native-4bit
convert_llama_weights_to_hf.py
alpaca-native
quantized with--true-sequential --act-order --groupsize 128
(dequantized only)--save_safetensors
There's enough overlap in the logic needed to handle these different cases that it seemed best to move to a single script.
I haven't tried this with Alpaca-LoRA because I don't know where to find it.
Useful features:
Uses multiple threads for a speedup in some cases (though the Python GIL limits the gain, and sometimes it's disk-bound anyway).
Combines split models into a single file (both the intra-tensor split of the original and the inter-tensor split of 'transformers' format files). Single files are more convenient to work with and more friendly to future changes to use memory mapping on the C++ side. To accomplish this without increasing memory requirements, it has some custom loading code which avoids loading whole input files into memory at once.
Because of the custom loading code, it no longer depends in PyTorch, which might make installing dependencies slightly easier or faster... although it still depends on NumPy and sentencepiece, so I don't know if there's any meaningful difference. In any case, I also added a requirements.txt file to lock the dependency versions in case of any future breaking changes.
Type annotations checked with mypy.
Some attempts to be extra user-friendly:
The script tries to be forgiving with arguments, e.g. you can specify either the model file itself or the directory containing it.
The script doesn't depend on config.json / params.json, just in case the user downloaded files individually and doesn't have those handy. But you still need tokenizer.model and, for Alpaca, added_tokens.json.
The script tries to give a helpful error message if added_tokens.json is missing.