Recompute llamafile-quantize documentation

Mozilla-Ocho · May 25, 2024 · 07e87bf · 07e87bf
1 parent 261dfe7
commit 07e87bf
Show file tree

Hide file tree

Showing 2 changed files with 109 additions and 97 deletions.
diff --git a/llama.cpp/quantize/quantize.1 b/llama.cpp/quantize/quantize.1
@@ -38,49 +38,54 @@ Is the desired quantization format, which may be the integer id of a supported q
 Number of threads to use during computation (default: nproc/2)
 .El
 .Sh QUANTIZATION TYPES
-The following quantization types are available:
+The following quantization types are available. This table shows the ID
+of the quantization format, its name, the file size of 7B model weights
+that use it, and finally the amount of quality badness it introduces as
+measured by the llamafile-perplexity tool averaged over 128 chunks with
+the TinyLLaMA 1.1B v1.0 Chat model. Rows are ordered in accordance with
+how recommended the quantization format is for general usage.
 .Pp
 .Bl -dash -compact
 .It
-   2 Q4_0   3.56G +0.2166 ppl @ LLaMA-v1-7B
+  18 Q6_K   5.6gb +0.0446 ppl (q6 kawrakow)
 .It
-   3 Q4_1   3.90G +0.1585 ppl @ LLaMA-v1-7B
+   7 Q8_0   7.2gb +0.0022 ppl (q8 gerganov)
 .It
-   8 Q5_0   4.33G +0.0683 ppl @ LLaMA-v1-7B
+   1 F16    14gb  +0.0000 ppl (best but biggest)
 .It
-   9 Q5_1   4.70G +0.0349 ppl @ LLaMA-v1-7B
+   8 Q5_0   4.7gb +0.0817 ppl (q5 gerganov zero)
 .It
-  10 Q2_K   2.63G +0.6717 ppl @ LLaMA-v1-7B
+  17 Q5_K_M 4.8gb +0.0836 ppl (q5 kawrakow medium)
 .It
-  12 Q3_K   alias for Q3_K_M
+  16 Q5_K_S 4.7gb +0.1049 ppl (q5 kawrakow small)
 .It
-  11 Q3_K_S 2.75G +0.5551 ppl @ LLaMA-v1-7B
+  15 Q4_K_M 4.1gb +0.3132 ppl (q4 kawrakow medium)
 .It
-  12 Q3_K_M 3.07G +0.2496 ppl @ LLaMA-v1-7B
+  14 Q4_K_S 3.9gb +0.3408 ppl (q4 kawrakow small)
 .It
-  13 Q3_K_L 3.35G +0.1764 ppl @ LLaMA-v1-7B
+  13 Q3_K_L 3.6gb +0.5736 ppl (q3 kawrakow large)
 .It
-  15 Q4_K   alias for Q4_K_M
+  12 Q3_K_M 3.3gb +0.7612 ppl (q3 kawrakow medium)
 .It
-  14 Q4_K_S 3.59G +0.0992 ppl @ LLaMA-v1-7B
+  11 Q3_K_S 3.0gb +1.3834 ppl (q3 kawrakow small)
 .It
-  15 Q4_K_M 3.80G +0.0532 ppl @ LLaMA-v1-7B
+  10 Q2_K   2.6gb +4.2359 ppl (tiniest hallucinates most)
 .It
-  17 Q5_K   alias for Q5_K_M
+  32 BF16   14gb  +0.0000 ppl (canonical but cpu/cuda only)
 .It
-  16 Q5_K_S 4.33G +0.0400 ppl @ LLaMA-v1-7B
+   0 F32    27gb   9.0952 ppl (reference point)
 .It
-  17 Q5_K_M 4.45G +0.0122 ppl @ LLaMA-v1-7B
+   2 Q4_0   3.9gb +0.3339 ppl (legacy)
 .It
-  18 Q6_K   5.15G -0.0008 ppl @ LLaMA-v1-7B
+   3 Q4_1   4.3gb +0.4163 ppl (legacy)
 .It
-   7 Q8_0   6.70G +0.0004 ppl @ LLaMA-v1-7B
+   9 Q5_1   5.1gb +0.1091 ppl (legacy)
 .It
-  32 BF16   Google Brain Floating Point
+  12 Q3_K   alias for Q3_K_M
 .It
-   1 F16    13.00G @ 7B
+  15 Q4_K   alias for Q4_K_M
 .It
-   0 F32    26.00G @ 7B
+  17 Q5_K   alias for Q5_K_M
 .It
 COPY Only copy tensors, no quantizing.
 .El

diff --git a/llama.cpp/quantize/quantize.1.asc b/llama.cpp/quantize/quantize.1.asc
@@ -1,76 +1,83 @@
-LLAMAFILE-QUANTIZE(1)     BSD General Commands Manual    LLAMAFILE-QUANTIZE(1)
-
-NNAAMMEE
-     llllaammaaffiillee--qquuaannttiizzee — large language model quantizer
-
-SSYYNNOOPPSSIISS
-     llllaammaaffiillee--qquuaannttiizzee [flags...] _m_o_d_e_l_-_f_3_2_._g_g_u_f [_m_o_d_e_l_-_q_u_a_n_t_._g_g_u_f] _t_y_p_e
-                        [_n_t_h_r_e_a_d_s]
-
-DDEESSCCRRIIPPTTIIOONN
-     llllaammaaffiillee--qquuaannttiizzee converts large language model weights from the float32
-     or float16 formats into smaller data types from 2 to 8 bits in size.
-
-OOPPTTIIOONNSS
-     The following flags are available:
-
-     ----aallllooww--rreeqquuaannttiizzee
-             Allows requantizing tensors that have already been quantized.
-             Warning: This can severely reduce quality compared to quantizing
-             from 16bit or 32bit
-
-     ----lleeaavvee--oouuttppuutt--tteennssoorr
-             Will leave output.weight un(re)quantized. Increases model size
-             but may also increase quality, especially when requantizing
-
-     ----ppuurree  Disable k-quant mixtures and quantize all tensors to the same
-             type
-
-AARRGGUUMMEENNTTSS
-     The following positional arguments are accepted:
-
-     _m_o_d_e_l_-_f_3_2_._g_g_u_f
-             Is the input file, which contains the unquantized model weights
-             in either the float32 or float16 format.
-
-     _m_o_d_e_l_-_q_u_a_n_t_._g_g_u_f
-             Is the output file, which will contain quantized weights in the
-             desired format. If this path isn't specified, it'll default to
-             [inp path]/ggml-model-[ftype].gguf.
-
-     _t_y_p_e    Is the desired quantization format, which may be the integer id
-             of a supported quantization type, or its name. See the quantiza‐
-             tion types section below for acceptable formats.
-
-     _n_t_h_r_e_a_d_s
-             Number of threads to use during computation (default: nproc/2)
-
-QQUUAANNTTIIZZAATTIIOONN TTYYPPEESS
-     The following quantization types are available:
-
-     --      2 Q4_0   3.56G +0.2166 ppl @ LLaMA-v1-7B
-     --      3 Q4_1   3.90G +0.1585 ppl @ LLaMA-v1-7B
-     --      8 Q5_0   4.33G +0.0683 ppl @ LLaMA-v1-7B
-     --      9 Q5_1   4.70G +0.0349 ppl @ LLaMA-v1-7B
-     --     10 Q2_K   2.63G +0.6717 ppl @ LLaMA-v1-7B
-     --     12 Q3_K   alias for Q3_K_M
-     --     11 Q3_K_S 2.75G +0.5551 ppl @ LLaMA-v1-7B
-     --     12 Q3_K_M 3.07G +0.2496 ppl @ LLaMA-v1-7B
-     --     13 Q3_K_L 3.35G +0.1764 ppl @ LLaMA-v1-7B
-     --     15 Q4_K   alias for Q4_K_M
-     --     14 Q4_K_S 3.59G +0.0992 ppl @ LLaMA-v1-7B
-     --     15 Q4_K_M 3.80G +0.0532 ppl @ LLaMA-v1-7B
-     --     17 Q5_K   alias for Q5_K_M
-     --     16 Q5_K_S 4.33G +0.0400 ppl @ LLaMA-v1-7B
-     --     17 Q5_K_M 4.45G +0.0122 ppl @ LLaMA-v1-7B
-     --     18 Q6_K   5.15G -0.0008 ppl @ LLaMA-v1-7B
-     --      7 Q8_0   6.70G +0.0004 ppl @ LLaMA-v1-7B
-     --      1 F16    13.00G @ 7B
-     --      0 F32    26.00G @ 7B
-     --   COPY Only copy tensors, no quantizing.
-
-SSEEEE AALLSSOO
-     llamafile(1), llamafile-imatrix(1), llamafile-perplexity(1),
-     llava-quantize(1), zipalign(1), unzip(1)
-
-Llamafile Manual               December 5, 2023               Llamafile Manual
+[4mLLAMAFILE-QUANTIZE[24m(1)       General Commands Manual      [4mLLAMAFILE-QUANTIZE[24m(1)
+
+[1mNAME[0m
+       llamafile-quantize — large language model quantizer
+
+[1mSYNOPSIS[0m
+       [1mllamafile-quantize  [22m[flags...]  [4mmodel-f32.gguf[24m  [[4mmodel-quant.gguf[24m] [4mtype[0m
+                          [[4mnthreads[24m]
+
+[1mDESCRIPTION[0m
+       [1mllamafile-quantize [22mconverts  large  language  model  weights  from  the
+       float32  or float16 formats into smaller data types from 2 to 8 bits in
+       size.
+
+[1mOPTIONS[0m
+       The following flags are available:
+
+       [1m--allow-requantize[0m
+               Allows requantizing tensors that have already  been  quantized.
+               Warning:  This can severely reduce quality compared to quantiz‐
+               ing from 16bit or 32bit
+
+       [1m--leave-output-tensor[0m
+               Will leave output.weight un(re)quantized. Increases model  size
+               but may also increase quality, especially when requantizing
+
+       [1m--pure  [22mDisable  k-quant  mixtures and quantize all tensors to the same
+               type
+
+[1mARGUMENTS[0m
+       The following positional arguments are accepted:
+
+       [4mmodel-f32.gguf[0m
+               Is the input file, which contains the unquantized model weights
+               in either the float32 or float16 format.
+
+       [4mmodel-quant.gguf[0m
+               Is the output file, which will contain quantized weights in the
+               desired format. If this path isn't specified, it'll default  to
+               [inp path]/ggml-model-[ftype].gguf.
+
+       [4mtype[24m    Is the desired quantization format, which may be the integer id
+               of  a supported quantization type, or its name. See the quanti‐
+               zation types section below for acceptable formats.
+
+       [4mnthreads[0m
+               Number of threads to use during computation (default: nproc/2)
+
+[1mQUANTIZATION TYPES[0m
+       The following quantization types are available. This table shows the ID
+       of the quantization format, its name, the file size of 7B model weights
+       that use it, and finally the amount of quality badness it introduces as
+       measured by the llamafile-perplexity tool averaged over 128 chunks with
+       the TinyLLaMA 1.1B v1.0 Chat model. Rows are ordered in accordance with
+       how recommended the quantization format is for general usage.
+
+       [1m-     [22m18 Q6_K   5.6gb +0.0446 ppl (q6 kawrakow)
+       [1m-      [22m7 Q8_0   7.2gb +0.0022 ppl (q8 gerganov)
+       [1m-      [22m1 F16    14gb  +0.0000 ppl (best but biggest)
+       [1m-      [22m8 Q5_0   4.7gb +0.0817 ppl (q5 gerganov zero)
+       [1m-     [22m17 Q5_K_M 4.8gb +0.0836 ppl (q5 kawrakow medium)
+       [1m-     [22m16 Q5_K_S 4.7gb +0.1049 ppl (q5 kawrakow small)
+       [1m-     [22m15 Q4_K_M 4.1gb +0.3132 ppl (q4 kawrakow medium)
+       [1m-     [22m14 Q4_K_S 3.9gb +0.3408 ppl (q4 kawrakow small)
+       [1m-     [22m13 Q3_K_L 3.6gb +0.5736 ppl (q3 kawrakow large)
+       [1m-     [22m12 Q3_K_M 3.3gb +0.7612 ppl (q3 kawrakow medium)
+       [1m-     [22m11 Q3_K_S 3.0gb +1.3834 ppl (q3 kawrakow small)
+       [1m-     [22m10 Q2_K   2.6gb +4.2359 ppl (tiniest hallucinates most)
+       [1m-     [22m32 BF16   14gb  +0.0000 ppl (canonical but cpu/cuda only)
+       [1m-      [22m0 F32    27gb   9.0952 ppl (reference point)
+       [1m-      [22m2 Q4_0   3.9gb +0.3339 ppl (legacy)
+       [1m-      [22m3 Q4_1   4.3gb +0.4163 ppl (legacy)
+       [1m-      [22m9 Q5_1   5.1gb +0.1091 ppl (legacy)
+       [1m-     [22m12 Q3_K   alias for Q3_K_M
+       [1m-     [22m15 Q4_K   alias for Q4_K_M
+       [1m-     [22m17 Q5_K   alias for Q5_K_M
+       [1m-   [22mCOPY Only copy tensors, no quantizing.
+
+[1mSEE ALSO[0m
+       [4mllamafile[24m(1),      [4mllamafile-imatrix[24m(1),       [4mllamafile-perplexity[24m(1),
+       [4mllava-quantize[24m(1), [4mzipalign[24m(1), [4munzip[24m(1)
+
+Llamafile Manual               December 5, 2023          [4mLLAMAFILE-QUANTIZE[24m(1)