Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New conversion script #545

Merged
merged 1 commit into from
Apr 14, 2023
Merged

New conversion script #545

merged 1 commit into from
Apr 14, 2023

Conversation

comex
Copy link
Contributor

@comex comex commented Mar 27, 2023

Current status: Working, except for the latest GPTQ-for-LLaMa format that includes g_idx. This turns out to require changes to GGML, so for now it only works if you use the --outtype option to dequantize it back to f16 (which is pointless except for debugging).

I also included some cleanup for the C++ code.

This script is meant to replace all the existing conversion scripts (including the ones that convert from older GGML formats), while also adding support for some new formats. Specifically, I've tested with:

  • LLaMA (original)
  • llama-65b-4bit
  • alpaca-native
  • alpaca-native-4bit
  • LLaMA converted to 'transformers' format using convert_llama_weights_to_hf.py
  • alpaca-native quantized with --true-sequential --act-order --groupsize 128 (dequantized only)
  • same as above plus --save_safetensors
  • GPT4All
  • stock unversioned ggml
  • ggmh
  • alpaca-30b-4bit.pt
  • alpaca-30b-4bit.safetensors
  • alpaca-30b-4bit-128g.safetensors
  • koala-13B-HF
  • koala-13B-4bit-128g.safetensors (dequantized only)
  • koala-13B-4bit-128g.pt

There's enough overlap in the logic needed to handle these different cases that it seemed best to move to a single script.

I haven't tried this with Alpaca-LoRA because I don't know where to find it.

Useful features:

  • Uses multiple threads for a speedup in some cases (though the Python GIL limits the gain, and sometimes it's disk-bound anyway).

  • Combines split models into a single file (both the intra-tensor split of the original and the inter-tensor split of 'transformers' format files). Single files are more convenient to work with and more friendly to future changes to use memory mapping on the C++ side. To accomplish this without increasing memory requirements, it has some custom loading code which avoids loading whole input files into memory at once.

  • Because of the custom loading code, it no longer depends in PyTorch, which might make installing dependencies slightly easier or faster... although it still depends on NumPy and sentencepiece, so I don't know if there's any meaningful difference. In any case, I also added a requirements.txt file to lock the dependency versions in case of any future breaking changes.

  • Type annotations checked with mypy.

  • Some attempts to be extra user-friendly:

    • The script tries to be forgiving with arguments, e.g. you can specify either the model file itself or the directory containing it.

    • The script doesn't depend on config.json / params.json, just in case the user downloaded files individually and doesn't have those handy. But you still need tokenizer.model and, for Alpaca, added_tokens.json.

    • The script tries to give a helpful error message if added_tokens.json is missing.

@anzz1
Copy link
Contributor

anzz1 commented Mar 27, 2023

Looks fantastic! 🎉

Agree that the conversion scripts should be merged as one.

Have you checked that the sha256 checksums match for files produced with the old and new scripts? So that no bits or bytes are accidentally dropped roadside on the way.

Minor comment, I think naming it like convert-model-to-ggml.py would be more verbose, as the name convert.py doesn't really tell its' purpose.

@green-s
Copy link

green-s commented Mar 27, 2023

Could you maybe add safetensors support? People are starting to distribute GPTQ weights in that format instead, since it doesn't allow arbitrary code execution. Usually it's just a matter of using safetensors.torch.load_from_file in place of torch.load but since you're not using torch.load it might be a bit trickier.

Edit:

Looks like if you use safetensors.safe_open you can load lazily/partially and in numpy format if you specify framework="numpy".

convert.py Outdated Show resolved Hide resolved
@anzz1 anzz1 added enhancement New feature or request script Script related labels Mar 27, 2023
@Belluxx
Copy link

Belluxx commented Mar 27, 2023

I converted with the script a 7B 3.77GB 4bit gptq (no grops) model. The converted file however is 5.39GB. Is this expected?

It's also very slow compared to the RTN q4 model because it swaps on the disk now due to its size.

@BadisG
Copy link

BadisG commented Mar 28, 2023

You should verify if your script works with the new techniques proposed by @qwopqwop200
https://github.com/qwopqwop200/GPTQ-for-LLaMa
image

I think it's not the case as someone reported an error here : #442 (comment)

@BadisG
Copy link

BadisG commented Mar 28, 2023

@luxtiasco It's not finished yet, once qwopqwop200 will be able to make "act-order" and "groupsize 128" work together, we'll get a really great quantization 😄👍

@BadisG
Copy link

BadisG commented Mar 28, 2023

https://www.reddit.com/r/LocalLLaMA/comments/1248183/i_am_currently_quantizing_llama65b_30b_and_13b/

image
It looks like "act-order" gives smaller models and better output than "groupesize (32 or 128)", making the latter irrelevant when using it alone

@BadisG
Copy link

BadisG commented Mar 28, 2023

qwopqwop200/GPTQ-for-LLaMa@4e141a8

The madman did it! Now it's possible to get both groupsize and act-order !

@ggerganov
Copy link
Owner

🦙 !

I'm fully OK with this change - cannot comment on the Python code as I don't have experience
cc @jart - this change allows to generate single-file models from the get-go. Might have some relevance for the mmap stuff, so bringing your attention just in case

@jart
Copy link
Contributor

jart commented Mar 28, 2023

Looks very promising! Single file models would be nice. The main thing I want is for the tensors to be mmap()'able. In order for that to happen, multi-dimensional tensors need to be laid out in the file in such a way that they don't need to be reshaped in order to be loaded. The memory layout on disk, should be the same as what ggml wants in memory at runtime. The format should also observe ideally a 32-byte alignment. Does this change do that? If not, could it?

@plabadens
Copy link

plabadens commented Mar 28, 2023

Unfortunately, the conversion script seems to break when applied to models generated using the latest version of qwopqwop200/GPTQ-for-LLaMa@4c15f16. In this case, the alpaca-native model, quantized to 4bit with --act-order, --true-sequential and --groupsize 128.

Output
Loaded 'transformers' model split into 1 parts.
Writing vocab...
[1/291] Writing tensor tok_embeddings.weight, size 32001 x 4096...
[2/291] Writing tensor norm.weight, size 4096...
[3/291] Writing tensor output.weight, size 32001 x 4096...
Traceback (most recent call last):
  File "/home/pierre/Development/llama/llama.cpp/convert.py", line 673, in 
    main()
  File "/home/pierre/Development/llama/llama.cpp/convert.py", line 671, in main
    OutputFile.write_all(outfile, params, model, vocab)
  File "/home/pierre/Development/llama/llama.cpp/convert.py", line 579, in write_all
    for i, ((name, lazy_tensor), ndarray) in enumerate(zip(model.items(), ndarrays)):
  File "/home/pierre/Development/llama/llama.cpp/convert.py", line 508, in bounded_parallel_map
    result = futures.pop(0).result()
  File "/nix/store/iw1vmh509hcbby8dbpsaanbri4zsq7dj-python3-3.10.10/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/nix/store/iw1vmh509hcbby8dbpsaanbri4zsq7dj-python3-3.10.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/nix/store/iw1vmh509hcbby8dbpsaanbri4zsq7dj-python3-3.10.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/pierre/Development/llama/llama.cpp/convert.py", line 577, in 
    ndarrays = bounded_parallel_map(lambda lazy_tensor: lazy_tensor.load().ggml_ndarray(), model.values(),
  File "/home/pierre/Development/llama/llama.cpp/convert.py", line 357, in load
    tensor = lazy_tensor.load()
  File "/home/pierre/Development/llama/llama.cpp/convert.py", line 399, in load
    return QuantizedTensor(model, namebase)
  File "/home/pierre/Development/llama/llama.cpp/convert.py", line 187, in __init__
    scales = load_unquantized(model[f"{namebase}.scales"], np.float32)
  File "/home/pierre/Development/llama/llama.cpp/convert.py", line 178, in load_unquantized
    tensor = lazy_tensor.load()
  File "/home/pierre/Development/llama/llama.cpp/convert.py", line 473, in load
    return UnquantizedTensor(storage.load(storage_offset, elm_count).reshape(size))
ValueError: cannot reshape array of size 1 into shape (1,4096)

@slaren slaren mentioned this pull request Mar 29, 2023
4 tasks
@comex
Copy link
Contributor Author

comex commented Mar 29, 2023

The memory layout on disk, should be the same as what ggml wants in memory at runtime. The format should also observe ideally a 32-byte alignment. Does this change do that? If not, could it?

The memory layout matches, but there is currently no alignment. I was thinking of adding that, but it will require a format change, whereas this PR happens to be compatible with the existing format (since there was already an option to adjust the per-file split), so I decided to leave it out of this one.

Regarding other feedback, I’ll take a look soon.

I’m also thinking about adding support for reading files that are already in GGML format so that they can be upgraded without needing the original. This is despite the fact that I think it’s probably advisable to make the loader backwards-compatible moving forward rather than requiring upgrades. Even with a change to add mmap support, there should be a fallback path that supports existing non-aligned files. But if you want to actually benefit from mmap, you’ll need alignment and thus a format upgrade.

@slaren
Copy link
Collaborator

slaren commented Mar 29, 2023

Would it be possible to ensure that the tensor data is aligned by padding the tensor names with zeros? That should allow us to do it without changing the file format.

@comex
Copy link
Contributor Author

comex commented Mar 29, 2023

I thought of that, but it seemed like an ugly hack for not much benefit. It’s not hard to change the C++ side; it just seemed convenient to make it a separate change to avoid merge conflicts and such. (Edit: Not that I have a particularly strong objection to doing it that way; it just isn’t what I’d choose.)

@slaren
Copy link
Collaborator

slaren commented Mar 29, 2023

Something that may also help (as suggested by xloem in discord) would be making sure that the tensors in the model file are in the same order as they are accessed during inference. This should especially help in systems without enough memory to keep the entire model in memory. I think it is already very close to being that way, but may be worth double checking.

@Belluxx
Copy link

Belluxx commented Mar 29, 2023

@slaren There's a discord server for the project or about llama in general?

@jart
Copy link
Contributor

jart commented Mar 29, 2023

@Belluxx Kind of yes. @slaren and I have been collaborating on Redbean's Discord server, which has an #AI channel. There's no official chatroom for the llama.cpp project yet, however you're all welcome to join us on the Redbean Discord until that happens! https://discord.gg/AqSvHf4u

@linouxis9
Copy link

Does the new conversion script works better with generic pytorch models? (Such as https://huggingface.co/THUDM/chatglm-6b)
Thanks :-)

@BadisG
Copy link

BadisG commented Apr 1, 2023

I can confirm it doesn't work with the new implementations of the GPTQ quantization.
I tried it with the gpt4-x-alpaca-13b-native-4bit-128g.pt model which was converted this way

CUDA_VISIBLE_DEVICES=0 python llama.py ./models/chavinlo-gpt4-x-alpaca --wbits 4 --true-sequential --act-order --groupsize 128 --save gpt-x-alpaca-13b-native-4bit-128g.pt

I got this error:

D:\Large Language Models\CONVERTISSEURS\gptq to ggml>python GPTQ-to-GGML.py gpt4-x-alpaca-13b-native-
4bit-128g.pt --vocab-dir TokenDIR
Loaded 'transformers' model split into 1 parts.
Writing vocab...
[1/363] Writing tensor tok_embeddings.weight, size 32001 x 5120...
[2/363] Writing tensor norm.weight, size 5120...
[3/363] Writing tensor output.weight, size 32001 x 5120...
Traceback (most recent call last):
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\GPTQ-to-GGML.py", line 673, in <module>

    main()
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\GPTQ-to-GGML.py", line 671, in main
    OutputFile.write_all(outfile, params, model, vocab)
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\GPTQ-to-GGML.py", line 579, in write_al
l
    for i, ((name, lazy_tensor), ndarray) in enumerate(zip(model.items(), ndarrays)):
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\GPTQ-to-GGML.py", line 508, in bounded_
parallel_map
    result = futures.pop(0).result()
  File "C:\Users\Utilisateur\AppData\Local\Programs\Python\Python310\lib\concurrent\futures\_base.py
", line 451, in result
    return self.__get_result()
  File "C:\Users\Utilisateur\AppData\Local\Programs\Python\Python310\lib\concurrent\futures\_base.py
", line 403, in __get_result
    raise self._exception
  File "C:\Users\Utilisateur\AppData\Local\Programs\Python\Python310\lib\concurrent\futures\thread.p
y", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\GPTQ-to-GGML.py", line 577, in <lambda>

    ndarrays = bounded_parallel_map(lambda lazy_tensor: lazy_tensor.load().ggml_ndarray(), model.val
ues(),
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\GPTQ-to-GGML.py", line 357, in load
    tensor = lazy_tensor.load()
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\GPTQ-to-GGML.py", line 399, in load
    return QuantizedTensor(model, namebase)
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\GPTQ-to-GGML.py", line 187, in __init__

    scales = load_unquantized(model[f"{namebase}.scales"], np.float32)
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\GPTQ-to-GGML.py", line 181, in load_unq
uantized
    assert tensor.ndarray.dtype == expected_dtype, (tensor.ndarray.dtype, expected_dtype)
AssertionError: (dtype('float16'), <class 'numpy.float32'>)

@comex comex force-pushed the convert-script branch 2 times, most recently from 358bb6c to 80ae52a Compare April 2, 2023 02:47
comex added a commit to comex/llama.cpp that referenced this pull request Apr 2, 2023
@comex comex marked this pull request as ready for review April 2, 2023 03:04
comex added a commit to comex/llama.cpp that referenced this pull request Apr 14, 2023
  Current status: Working, except for the latest GPTQ-for-LLaMa format
  that includes `g_idx`.  This turns out to require changes to GGML, so
  for now it only works if you use the `--outtype` option to dequantize it
  back to f16 (which is pointless except for debugging).

  I also included some cleanup for the C++ code.

  This script is meant to replace all the existing conversion scripts
  (including the ones that convert from older GGML formats), while also
  adding support for some new formats.  Specifically, I've tested with:

  - [x] `LLaMA` (original)
  - [x] `llama-65b-4bit`
  - [x] `alpaca-native`
  - [x] `alpaca-native-4bit`
  - [x] LLaMA converted to 'transformers' format using
        `convert_llama_weights_to_hf.py`
  - [x] `alpaca-native` quantized with `--true-sequential --act-order
        --groupsize 128` (dequantized only)
  - [x] same as above plus `--save_safetensors`
  - [x] GPT4All
  - [x] stock unversioned ggml
  - [x] ggmh

  There's enough overlap in the logic needed to handle these different
  cases that it seemed best to move to a single script.

  I haven't tried this with Alpaca-LoRA because I don't know where to find
  it.

  Useful features:

  - Uses multiple threads for a speedup in some cases (though the Python
    GIL limits the gain, and sometimes it's disk-bound anyway).

  - Combines split models into a single file (both the intra-tensor split
    of the original and the inter-tensor split of 'transformers' format
    files).  Single files are more convenient to work with and more
    friendly to future changes to use memory mapping on the C++ side.  To
    accomplish this without increasing memory requirements, it has some
    custom loading code which avoids loading whole input files into memory
    at once.

  - Because of the custom loading code, it no longer depends in PyTorch,
    which might make installing dependencies slightly easier or faster...
    although it still depends on NumPy and sentencepiece, so I don't know
    if there's any meaningful difference.  In any case, I also added a
    requirements.txt file to lock the dependency versions in case of any
    future breaking changes.

  - Type annotations checked with mypy.

  - Some attempts to be extra user-friendly:

      - The script tries to be forgiving with arguments, e.g. you can
        specify either the model file itself or the directory containing
        it.

      - The script doesn't depend on config.json / params.json, just in
        case the user downloaded files individually and doesn't have those
        handy.  But you still need tokenizer.model and, for Alpaca,
        added_tokens.json.

      - The script tries to give a helpful error message if
        added_tokens.json is missing.
  Current status: Working, except for the latest GPTQ-for-LLaMa format
  that includes `g_idx`.  This turns out to require changes to GGML, so
  for now it only works if you use the `--outtype` option to dequantize it
  back to f16 (which is pointless except for debugging).

  I also included some cleanup for the C++ code.

  This script is meant to replace all the existing conversion scripts
  (including the ones that convert from older GGML formats), while also
  adding support for some new formats.  Specifically, I've tested with:

  - [x] `LLaMA` (original)
  - [x] `llama-65b-4bit`
  - [x] `alpaca-native`
  - [x] `alpaca-native-4bit`
  - [x] LLaMA converted to 'transformers' format using
        `convert_llama_weights_to_hf.py`
  - [x] `alpaca-native` quantized with `--true-sequential --act-order
        --groupsize 128` (dequantized only)
  - [x] same as above plus `--save_safetensors`
  - [x] GPT4All
  - [x] stock unversioned ggml
  - [x] ggmh

  There's enough overlap in the logic needed to handle these different
  cases that it seemed best to move to a single script.

  I haven't tried this with Alpaca-LoRA because I don't know where to find
  it.

  Useful features:

  - Uses multiple threads for a speedup in some cases (though the Python
    GIL limits the gain, and sometimes it's disk-bound anyway).

  - Combines split models into a single file (both the intra-tensor split
    of the original and the inter-tensor split of 'transformers' format
    files).  Single files are more convenient to work with and more
    friendly to future changes to use memory mapping on the C++ side.  To
    accomplish this without increasing memory requirements, it has some
    custom loading code which avoids loading whole input files into memory
    at once.

  - Because of the custom loading code, it no longer depends in PyTorch,
    which might make installing dependencies slightly easier or faster...
    although it still depends on NumPy and sentencepiece, so I don't know
    if there's any meaningful difference.  In any case, I also added a
    requirements.txt file to lock the dependency versions in case of any
    future breaking changes.

  - Type annotations checked with mypy.

  - Some attempts to be extra user-friendly:

      - The script tries to be forgiving with arguments, e.g. you can
        specify either the model file itself or the directory containing
        it.

      - The script doesn't depend on config.json / params.json, just in
        case the user downloaded files individually and doesn't have those
        handy.  But you still need tokenizer.model and, for Alpaca,
        added_tokens.json.

      - The script tries to give a helpful error message if
        added_tokens.json is missing.
@comex
Copy link
Contributor Author

comex commented Apr 14, 2023

Updates:

  • Fixed Python 3.8 compatibility. (By the way, installing sentencepiece on Python 3.11 also works fine for me, but maybe it depends on the OS.)
  • Fixed faulthandler incompatibility with Windows.
  • Fixed TypeError: 'staticmethod' object is not callable.
  • Fixed error if scales is fp16 instead of fp32 (Koala in dequantize mode).

This weekend hopefully I'll get to fixing compatibility with the latest GPTQ.

@comex
Copy link
Contributor Author

comex commented Apr 14, 2023

Looks like GitHub doesn't give me a merge button even with the approval and checks passing (not sure why), but feel free to merge, @ggerganov. Thanks!

@ggerganov ggerganov merged commit 723dac5 into ggerganov:master Apr 14, 2023
@ggerganov
Copy link
Owner

@comex

Thank you for the hard work and for another very well done contribution!

@DannyDaemonic
Copy link
Contributor

DannyDaemonic commented Apr 14, 2023

Since requirements.txt is going into the root directory, to avoid confusion we should consider renaming it to something like convert-reqs.txt or conversion-requirements.txt, as those requirements are specific to the conversion scripts and are not requirements for llama.cpp.

prusnak added a commit that referenced this pull request Apr 14, 2023
after #545 we do not need torch, tqdm and requests in the dependencies
prusnak added a commit that referenced this pull request Apr 14, 2023
after #545 we do not need torch, tqdm and requests in the dependencies
sw pushed a commit that referenced this pull request Apr 14, 2023
after #545 we do not need torch, tqdm and requests in the dependencies
@vmajor
Copy link

vmajor commented Apr 22, 2023

How can convert.py be used to migrate old ggml model to the new ggml model? Attempting to do so blindly results in this error:

python convert.py --outfile ../alpaca.cpp_65b_ggml/new_ggml-model-q4_0.bin ../alpaca.cpp_65b_ggml/ggml-model-q4_0.bin

raise FileNotFoundError(f"Could not find tokenizer.model in {path} or its parent; if it's in another directory, pass the directory as --vocab-dir")
FileNotFoundError: Could not find tokenizer.model in ../alpaca.cpp_65b_ggml or its parent; if it's in another directory, pass the directory as --vocab-dir

@TheBloke
Copy link
Contributor

How can convert.py be used to migrate old ggml model to the new ggml model? Attempting to do so blindly results in this error:

python convert.py --outfile ../alpaca.cpp_65b_ggml/new_ggml-model-q4_0.bin ../alpaca.cpp_65b_ggml/ggml-model-q4_0.bin

raise FileNotFoundError(f"Could not find tokenizer.model in {path} or its parent; if it's in another directory, pass the directory as --vocab-dir")
FileNotFoundError: Could not find tokenizer.model in ../alpaca.cpp_65b_ggml or its parent; if it's in another directory, pass the directory as --vocab-dir

You can download the tokenizer.model it's missing from HF, eg at this link: https://huggingface.co/TheBloke/alpaca-lora-65B-HF/resolve/main/tokenizer.model

PS. If you want a newer 65B Alpaca Lora model, using newer and better 4bit quantisation techniques, try the q4_0, q4_2 or q4_3 models from my repo here: https://huggingface.co/TheBloke/alpaca-lora-65B-GGML . q4_2 seems to be the quantisation format that people regard as best at the moment.

@vmajor
Copy link

vmajor commented Apr 23, 2023

Thank you for this! I still cannot get the conversion done due to a different error, but I downloaded your model and it is now working much better. Can you tell me what is the difference from the later q4_3 model? That one is larger.

@Green-Sky
Copy link
Collaborator

@vmajor #1121

@big-thousand
Copy link

which version of GPTQ-for-LLaMa can get no g_idx model.

@Green-Sky
Copy link
Collaborator

@big-thousand I believe you are in the wrong repository.

@Interpause
Copy link

Would like to check if there is now support for converting GPTQ 4-bit quantized models to GGML

@Green-Sky
Copy link
Collaborator

@Interpause you will have better quality without the GPTQ in-between.

@TheBloke
Copy link
Contributor

I think they're asking because llama.cpp convert.py can convert old GPTQ models to GGML, but only if they don't have the new GPTQ g_idx format.

However @big-thousand and @Interpause I do not recommend you do this. I tested using convert.py to convert GPTQ -> GGML and the perplexity (model accuracy) was very poor. Much worse than using llama.cpp's own quantize feature.

I think this is partly because you have to use an old version of GPTQ to do the conversion.

I suggest you make new GGMLs using llama.cpp quantize. It will result in the highest quality model, and will be faster than going float16 -> GPTQ -> GGML.

That's what I do now for all my model releases on HF. I do float16 -> GPTQ, and separately I do float16 -> GGML

@Interpause
Copy link

Just would like to ask, current GGML 4 bit does some form of error correction right? The main rationale behind wanting to use GPTQ is to mitigate increase in perplexity. Is GGML's 4 bit quantization already on par or superior to GPTQ?

@TheBloke
Copy link
Contributor

TheBloke commented May 15, 2023

Just would like to ask, current GGML 4 bit does some form of error correction right? The main rationale behind wanting to use GPTQ is to mitigate increase in perplexity. Is GGML's 4 bit quantization already on par or superior to GPTQ?

I am actually testing that right this second. I wrote a perplexity calc for GPTQ that runs 100% the same algorithm as the perplexity tool in llama.cpp, so the results are comparable.

Here are some early results from my testing (which I will publish properly soon):

Llama 7B:

  • float16 (13.0GB) : 5.9066
  • llama.cpp q4_0 (4.0GB) : 6.1565
  • llama.cpp q4_1 (4.8GB) : 6.0910
  • llama.cpp q5_0 (4.4GB) : 5.9862
  • llama.cpp q5_1 (4.8GB) : 5.9481
  • llama.cpp q8_0 (7.1GB) : 5.9069
  • AutoGPTQ 4bit 32g no desc_act (4.0GB) : 6.2650
  • AutoGPTQ 4bit 32g desc_act (4.0GB) : 6.0422
  • AutoGPTQ 4bit 128g no desc_act (3.7GB) : 6.3850
  • AutoGPTQ 4bit 128g desc_act (3.7GB) : 6.0653

So you see that for 4bit, GPTQ is slightly better. Best result is 6.0422 or 6.0653. Although this requires desc_act, and there are currently some performance implications to using that - it slows down inference a fair bit at the moment.

But llama.cpp also offers 5bit, and this out-performs GPTQ 4bit. And now that llama.cpp has CUDA GPU acceleration, it may be it can compete on performance as well.

So it will be up to the user to decide what is best for them and their use case.

I will publish more results, and benchmarks, soon.

@ggerganov
Copy link
Owner

@TheBloke - When doing the comparisons, don't forget to include the file sizes. These are important

@TheBloke
Copy link
Contributor

Yeah fair enough. I've edited that in.

When I publish the full results I'll include a table and spreadsheet with all the details.

@earonesty
Copy link

this should be its own release on pypi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority Very important issue script Script related
Projects
None yet
Development

Successfully merging this pull request may close these issues.