-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Recompute llamafile-quantize documentation
- Loading branch information
Showing
2 changed files
with
109 additions
and
97 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,76 +1,83 @@ | ||
LLAMAFILE-QUANTIZE(1) BSD General Commands Manual LLAMAFILE-QUANTIZE(1) | ||
|
||
NNAAMMEE | ||
llllaammaaffiillee--qquuaannttiizzee — large language model quantizer | ||
|
||
SSYYNNOOPPSSIISS | ||
llllaammaaffiillee--qquuaannttiizzee [flags...] _m_o_d_e_l_-_f_3_2_._g_g_u_f [_m_o_d_e_l_-_q_u_a_n_t_._g_g_u_f] _t_y_p_e | ||
[_n_t_h_r_e_a_d_s] | ||
|
||
DDEESSCCRRIIPPTTIIOONN | ||
llllaammaaffiillee--qquuaannttiizzee converts large language model weights from the float32 | ||
or float16 formats into smaller data types from 2 to 8 bits in size. | ||
|
||
OOPPTTIIOONNSS | ||
The following flags are available: | ||
|
||
----aallllooww--rreeqquuaannttiizzee | ||
Allows requantizing tensors that have already been quantized. | ||
Warning: This can severely reduce quality compared to quantizing | ||
from 16bit or 32bit | ||
|
||
----lleeaavvee--oouuttppuutt--tteennssoorr | ||
Will leave output.weight un(re)quantized. Increases model size | ||
but may also increase quality, especially when requantizing | ||
|
||
----ppuurree Disable k-quant mixtures and quantize all tensors to the same | ||
type | ||
|
||
AARRGGUUMMEENNTTSS | ||
The following positional arguments are accepted: | ||
|
||
_m_o_d_e_l_-_f_3_2_._g_g_u_f | ||
Is the input file, which contains the unquantized model weights | ||
in either the float32 or float16 format. | ||
|
||
_m_o_d_e_l_-_q_u_a_n_t_._g_g_u_f | ||
Is the output file, which will contain quantized weights in the | ||
desired format. If this path isn't specified, it'll default to | ||
[inp path]/ggml-model-[ftype].gguf. | ||
|
||
_t_y_p_e Is the desired quantization format, which may be the integer id | ||
of a supported quantization type, or its name. See the quantiza‐ | ||
tion types section below for acceptable formats. | ||
|
||
_n_t_h_r_e_a_d_s | ||
Number of threads to use during computation (default: nproc/2) | ||
|
||
QQUUAANNTTIIZZAATTIIOONN TTYYPPEESS | ||
The following quantization types are available: | ||
|
||
-- 2 Q4_0 3.56G +0.2166 ppl @ LLaMA-v1-7B | ||
-- 3 Q4_1 3.90G +0.1585 ppl @ LLaMA-v1-7B | ||
-- 8 Q5_0 4.33G +0.0683 ppl @ LLaMA-v1-7B | ||
-- 9 Q5_1 4.70G +0.0349 ppl @ LLaMA-v1-7B | ||
-- 10 Q2_K 2.63G +0.6717 ppl @ LLaMA-v1-7B | ||
-- 12 Q3_K alias for Q3_K_M | ||
-- 11 Q3_K_S 2.75G +0.5551 ppl @ LLaMA-v1-7B | ||
-- 12 Q3_K_M 3.07G +0.2496 ppl @ LLaMA-v1-7B | ||
-- 13 Q3_K_L 3.35G +0.1764 ppl @ LLaMA-v1-7B | ||
-- 15 Q4_K alias for Q4_K_M | ||
-- 14 Q4_K_S 3.59G +0.0992 ppl @ LLaMA-v1-7B | ||
-- 15 Q4_K_M 3.80G +0.0532 ppl @ LLaMA-v1-7B | ||
-- 17 Q5_K alias for Q5_K_M | ||
-- 16 Q5_K_S 4.33G +0.0400 ppl @ LLaMA-v1-7B | ||
-- 17 Q5_K_M 4.45G +0.0122 ppl @ LLaMA-v1-7B | ||
-- 18 Q6_K 5.15G -0.0008 ppl @ LLaMA-v1-7B | ||
-- 7 Q8_0 6.70G +0.0004 ppl @ LLaMA-v1-7B | ||
-- 1 F16 13.00G @ 7B | ||
-- 0 F32 26.00G @ 7B | ||
-- COPY Only copy tensors, no quantizing. | ||
|
||
SSEEEE AALLSSOO | ||
llamafile(1), llamafile-imatrix(1), llamafile-perplexity(1), | ||
llava-quantize(1), zipalign(1), unzip(1) | ||
|
||
Llamafile Manual December 5, 2023 Llamafile Manual | ||
[4mLLAMAFILE-QUANTIZE[24m(1) General Commands Manual [4mLLAMAFILE-QUANTIZE[24m(1) | ||
|
||
[1mNAME[0m | ||
llamafile-quantize — large language model quantizer | ||
|
||
[1mSYNOPSIS[0m | ||
[1mllamafile-quantize [22m[flags...] [4mmodel-f32.gguf[24m [[4mmodel-quant.gguf[24m] [4mtype[0m | ||
[[4mnthreads[24m] | ||
|
||
[1mDESCRIPTION[0m | ||
[1mllamafile-quantize [22mconverts large language model weights from the | ||
float32 or float16 formats into smaller data types from 2 to 8 bits in | ||
size. | ||
|
||
[1mOPTIONS[0m | ||
The following flags are available: | ||
|
||
[1m--allow-requantize[0m | ||
Allows requantizing tensors that have already been quantized. | ||
Warning: This can severely reduce quality compared to quantiz‐ | ||
ing from 16bit or 32bit | ||
|
||
[1m--leave-output-tensor[0m | ||
Will leave output.weight un(re)quantized. Increases model size | ||
but may also increase quality, especially when requantizing | ||
|
||
[1m--pure [22mDisable k-quant mixtures and quantize all tensors to the same | ||
type | ||
|
||
[1mARGUMENTS[0m | ||
The following positional arguments are accepted: | ||
|
||
[4mmodel-f32.gguf[0m | ||
Is the input file, which contains the unquantized model weights | ||
in either the float32 or float16 format. | ||
|
||
[4mmodel-quant.gguf[0m | ||
Is the output file, which will contain quantized weights in the | ||
desired format. If this path isn't specified, it'll default to | ||
[inp path]/ggml-model-[ftype].gguf. | ||
|
||
[4mtype[24m Is the desired quantization format, which may be the integer id | ||
of a supported quantization type, or its name. See the quanti‐ | ||
zation types section below for acceptable formats. | ||
|
||
[4mnthreads[0m | ||
Number of threads to use during computation (default: nproc/2) | ||
|
||
[1mQUANTIZATION TYPES[0m | ||
The following quantization types are available. This table shows the ID | ||
of the quantization format, its name, the file size of 7B model weights | ||
that use it, and finally the amount of quality badness it introduces as | ||
measured by the llamafile-perplexity tool averaged over 128 chunks with | ||
the TinyLLaMA 1.1B v1.0 Chat model. Rows are ordered in accordance with | ||
how recommended the quantization format is for general usage. | ||
|
||
[1m- [22m18 Q6_K 5.6gb +0.0446 ppl (q6 kawrakow) | ||
[1m- [22m7 Q8_0 7.2gb +0.0022 ppl (q8 gerganov) | ||
[1m- [22m1 F16 14gb +0.0000 ppl (best but biggest) | ||
[1m- [22m8 Q5_0 4.7gb +0.0817 ppl (q5 gerganov zero) | ||
[1m- [22m17 Q5_K_M 4.8gb +0.0836 ppl (q5 kawrakow medium) | ||
[1m- [22m16 Q5_K_S 4.7gb +0.1049 ppl (q5 kawrakow small) | ||
[1m- [22m15 Q4_K_M 4.1gb +0.3132 ppl (q4 kawrakow medium) | ||
[1m- [22m14 Q4_K_S 3.9gb +0.3408 ppl (q4 kawrakow small) | ||
[1m- [22m13 Q3_K_L 3.6gb +0.5736 ppl (q3 kawrakow large) | ||
[1m- [22m12 Q3_K_M 3.3gb +0.7612 ppl (q3 kawrakow medium) | ||
[1m- [22m11 Q3_K_S 3.0gb +1.3834 ppl (q3 kawrakow small) | ||
[1m- [22m10 Q2_K 2.6gb +4.2359 ppl (tiniest hallucinates most) | ||
[1m- [22m32 BF16 14gb +0.0000 ppl (canonical but cpu/cuda only) | ||
[1m- [22m0 F32 27gb 9.0952 ppl (reference point) | ||
[1m- [22m2 Q4_0 3.9gb +0.3339 ppl (legacy) | ||
[1m- [22m3 Q4_1 4.3gb +0.4163 ppl (legacy) | ||
[1m- [22m9 Q5_1 5.1gb +0.1091 ppl (legacy) | ||
[1m- [22m12 Q3_K alias for Q3_K_M | ||
[1m- [22m15 Q4_K alias for Q4_K_M | ||
[1m- [22m17 Q5_K alias for Q5_K_M | ||
[1m- [22mCOPY Only copy tensors, no quantizing. | ||
|
||
[1mSEE ALSO[0m | ||
[4mllamafile[24m(1), [4mllamafile-imatrix[24m(1), [4mllamafile-perplexity[24m(1), | ||
[4mllava-quantize[24m(1), [4mzipalign[24m(1), [4munzip[24m(1) | ||
|
||
Llamafile Manual December 5, 2023 [4mLLAMAFILE-QUANTIZE[24m(1) |