Skip to content

Commit

Permalink
Recompute llamafile-quantize documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
jart committed May 25, 2024
1 parent 261dfe7 commit 07e87bf
Show file tree
Hide file tree
Showing 2 changed files with 109 additions and 97 deletions.
47 changes: 26 additions & 21 deletions llama.cpp/quantize/quantize.1
Original file line number Diff line number Diff line change
Expand Up @@ -38,49 +38,54 @@ Is the desired quantization format, which may be the integer id of a supported q
Number of threads to use during computation (default: nproc/2)
.El
.Sh QUANTIZATION TYPES
The following quantization types are available:
The following quantization types are available. This table shows the ID
of the quantization format, its name, the file size of 7B model weights
that use it, and finally the amount of quality badness it introduces as
measured by the llamafile-perplexity tool averaged over 128 chunks with
the TinyLLaMA 1.1B v1.0 Chat model. Rows are ordered in accordance with
how recommended the quantization format is for general usage.
.Pp
.Bl -dash -compact
.It
   2 Q4_0 3.56G +0.2166 ppl @ LLaMA-v1-7B
  18 Q6_K 5.6gb +0.0446 ppl (q6 kawrakow)
.It
   3 Q4_1 3.90G +0.1585 ppl @ LLaMA-v1-7B
   7 Q8_0 7.2gb +0.0022 ppl (q8 gerganov)
.It
   8 Q5_0 4.33G +0.0683 ppl @ LLaMA-v1-7B
   1 F16 14gb +0.0000 ppl (best but biggest)
.It
   9 Q5_1 4.70G +0.0349 ppl @ LLaMA-v1-7B
   8 Q5_0 4.7gb +0.0817 ppl (q5 gerganov zero)
.It
  10 Q2_K 2.63G +0.6717 ppl @ LLaMA-v1-7B
  17 Q5_K_M 4.8gb +0.0836 ppl (q5 kawrakow medium)
.It
  12 Q3_K alias for Q3_K_M
  16 Q5_K_S 4.7gb +0.1049 ppl (q5 kawrakow small)
.It
  11 Q3_K_S 2.75G +0.5551 ppl @ LLaMA-v1-7B
  15 Q4_K_M 4.1gb +0.3132 ppl (q4 kawrakow medium)
.It
  12 Q3_K_M 3.07G +0.2496 ppl @ LLaMA-v1-7B
  14 Q4_K_S 3.9gb +0.3408 ppl (q4 kawrakow small)
.It
  13 Q3_K_L 3.35G +0.1764 ppl @ LLaMA-v1-7B
  13 Q3_K_L 3.6gb +0.5736 ppl (q3 kawrakow large)
.It
  15 Q4_K alias for Q4_K_M
  12 Q3_K_M 3.3gb +0.7612 ppl (q3 kawrakow medium)
.It
  14 Q4_K_S 3.59G +0.0992 ppl @ LLaMA-v1-7B
  11 Q3_K_S 3.0gb +1.3834 ppl (q3 kawrakow small)
.It
  15 Q4_K_M 3.80G +0.0532 ppl @ LLaMA-v1-7B
  10 Q2_K 2.6gb +4.2359 ppl (tiniest hallucinates most)
.It
  17 Q5_K alias for Q5_K_M
  32 BF16 14gb +0.0000 ppl (canonical but cpu/cuda only)
.It
  16 Q5_K_S 4.33G +0.0400 ppl @ LLaMA-v1-7B
   0 F32 27gb 9.0952 ppl (reference point)
.It
  17 Q5_K_M 4.45G +0.0122 ppl @ LLaMA-v1-7B
   2 Q4_0 3.9gb +0.3339 ppl (legacy)
.It
  18 Q6_K 5.15G -0.0008 ppl @ LLaMA-v1-7B
   3 Q4_1 4.3gb +0.4163 ppl (legacy)
.It
   7 Q8_0 6.70G +0.0004 ppl @ LLaMA-v1-7B
   9 Q5_1 5.1gb +0.1091 ppl (legacy)
.It
  32 BF16 Google Brain Floating Point
  12 Q3_K alias for Q3_K_M
.It
   1 F16 13.00G @ 7B
  15 Q4_K alias for Q4_K_M
.It
   0 F32 26.00G @ 7B
  17 Q5_K alias for Q5_K_M
.It
COPY Only copy tensors, no quantizing.
.El
Expand Down
159 changes: 83 additions & 76 deletions llama.cpp/quantize/quantize.1.asc
Original file line number Diff line number Diff line change
@@ -1,76 +1,83 @@
LLAMAFILE-QUANTIZE(1) BSD General Commands Manual LLAMAFILE-QUANTIZE(1)

NNAAMMEE
llllaammaaffiillee--qquuaannttiizzee — large language model quantizer

SSYYNNOOPPSSIISS
llllaammaaffiillee--qquuaannttiizzee [flags...] _m_o_d_e_l_-_f_3_2_._g_g_u_f [_m_o_d_e_l_-_q_u_a_n_t_._g_g_u_f] _t_y_p_e
[_n_t_h_r_e_a_d_s]

DDEESSCCRRIIPPTTIIOONN
llllaammaaffiillee--qquuaannttiizzee converts large language model weights from the float32
or float16 formats into smaller data types from 2 to 8 bits in size.

OOPPTTIIOONNSS
The following flags are available:

----aallllooww--rreeqquuaannttiizzee
Allows requantizing tensors that have already been quantized.
Warning: This can severely reduce quality compared to quantizing
from 16bit or 32bit

----lleeaavvee--oouuttppuutt--tteennssoorr
Will leave output.weight un(re)quantized. Increases model size
but may also increase quality, especially when requantizing

----ppuurree Disable k-quant mixtures and quantize all tensors to the same
type

AARRGGUUMMEENNTTSS
The following positional arguments are accepted:

_m_o_d_e_l_-_f_3_2_._g_g_u_f
Is the input file, which contains the unquantized model weights
in either the float32 or float16 format.

_m_o_d_e_l_-_q_u_a_n_t_._g_g_u_f
Is the output file, which will contain quantized weights in the
desired format. If this path isn't specified, it'll default to
[inp path]/ggml-model-[ftype].gguf.

_t_y_p_e Is the desired quantization format, which may be the integer id
of a supported quantization type, or its name. See the quantiza‐
tion types section below for acceptable formats.

_n_t_h_r_e_a_d_s
Number of threads to use during computation (default: nproc/2)

QQUUAANNTTIIZZAATTIIOONN TTYYPPEESS
The following quantization types are available:

--    2 Q4_0 3.56G +0.2166 ppl @ LLaMA-v1-7B
--    3 Q4_1 3.90G +0.1585 ppl @ LLaMA-v1-7B
--    8 Q5_0 4.33G +0.0683 ppl @ LLaMA-v1-7B
--    9 Q5_1 4.70G +0.0349 ppl @ LLaMA-v1-7B
--   10 Q2_K 2.63G +0.6717 ppl @ LLaMA-v1-7B
--   12 Q3_K alias for Q3_K_M
--   11 Q3_K_S 2.75G +0.5551 ppl @ LLaMA-v1-7B
--   12 Q3_K_M 3.07G +0.2496 ppl @ LLaMA-v1-7B
--   13 Q3_K_L 3.35G +0.1764 ppl @ LLaMA-v1-7B
--   15 Q4_K alias for Q4_K_M
--   14 Q4_K_S 3.59G +0.0992 ppl @ LLaMA-v1-7B
--   15 Q4_K_M 3.80G +0.0532 ppl @ LLaMA-v1-7B
--   17 Q5_K alias for Q5_K_M
--   16 Q5_K_S 4.33G +0.0400 ppl @ LLaMA-v1-7B
--   17 Q5_K_M 4.45G +0.0122 ppl @ LLaMA-v1-7B
--   18 Q6_K 5.15G -0.0008 ppl @ LLaMA-v1-7B
--    7 Q8_0 6.70G +0.0004 ppl @ LLaMA-v1-7B
--    1 F16 13.00G @ 7B
--    0 F32 26.00G @ 7B
-- COPY Only copy tensors, no quantizing.

SSEEEE AALLSSOO
llamafile(1), llamafile-imatrix(1), llamafile-perplexity(1),
llava-quantize(1), zipalign(1), unzip(1)

Llamafile Manual December 5, 2023 Llamafile Manual
LLAMAFILE-QUANTIZE(1) General Commands Manual LLAMAFILE-QUANTIZE(1)

NAME
llamafile-quantize — large language model quantizer

SYNOPSIS
llamafile-quantize [flags...] model-f32.gguf [model-quant.gguf] type
[nthreads]

DESCRIPTION
llamafile-quantize converts large language model weights from the
float32 or float16 formats into smaller data types from 2 to 8 bits in
size.

OPTIONS
The following flags are available:

--allow-requantize
Allows requantizing tensors that have already been quantized.
Warning: This can severely reduce quality compared to quantiz‐
ing from 16bit or 32bit

--leave-output-tensor
Will leave output.weight un(re)quantized. Increases model size
but may also increase quality, especially when requantizing

--pure Disable k-quant mixtures and quantize all tensors to the same
type

ARGUMENTS
The following positional arguments are accepted:

model-f32.gguf
Is the input file, which contains the unquantized model weights
in either the float32 or float16 format.

model-quant.gguf
Is the output file, which will contain quantized weights in the
desired format. If this path isn't specified, it'll default to
[inp path]/ggml-model-[ftype].gguf.

type Is the desired quantization format, which may be the integer id
of a supported quantization type, or its name. See the quanti‐
zation types section below for acceptable formats.

nthreads
Number of threads to use during computation (default: nproc/2)

QUANTIZATION TYPES
The following quantization types are available. This table shows the ID
of the quantization format, its name, the file size of 7B model weights
that use it, and finally the amount of quality badness it introduces as
measured by the llamafile-perplexity tool averaged over 128 chunks with
the TinyLLaMA 1.1B v1.0 Chat model. Rows are ordered in accordance with
how recommended the quantization format is for general usage.

- 18 Q6_K 5.6gb +0.0446 ppl (q6 kawrakow)
- 7 Q8_0 7.2gb +0.0022 ppl (q8 gerganov)
- 1 F16 14gb +0.0000 ppl (best but biggest)
- 8 Q5_0 4.7gb +0.0817 ppl (q5 gerganov zero)
- 17 Q5_K_M 4.8gb +0.0836 ppl (q5 kawrakow medium)
- 16 Q5_K_S 4.7gb +0.1049 ppl (q5 kawrakow small)
- 15 Q4_K_M 4.1gb +0.3132 ppl (q4 kawrakow medium)
- 14 Q4_K_S 3.9gb +0.3408 ppl (q4 kawrakow small)
- 13 Q3_K_L 3.6gb +0.5736 ppl (q3 kawrakow large)
- 12 Q3_K_M 3.3gb +0.7612 ppl (q3 kawrakow medium)
- 11 Q3_K_S 3.0gb +1.3834 ppl (q3 kawrakow small)
- 10 Q2_K 2.6gb +4.2359 ppl (tiniest hallucinates most)
- 32 BF16 14gb +0.0000 ppl (canonical but cpu/cuda only)
- 0 F32 27gb 9.0952 ppl (reference point)
- 2 Q4_0 3.9gb +0.3339 ppl (legacy)
- 3 Q4_1 4.3gb +0.4163 ppl (legacy)
- 9 Q5_1 5.1gb +0.1091 ppl (legacy)
- 12 Q3_K alias for Q3_K_M
- 15 Q4_K alias for Q4_K_M
- 17 Q5_K alias for Q5_K_M
- COPY Only copy tensors, no quantizing.

SEE ALSO
llamafile(1), llamafile-imatrix(1), llamafile-perplexity(1),
llava-quantize(1), zipalign(1), unzip(1)

Llamafile Manual December 5, 2023 LLAMAFILE-QUANTIZE(1)

0 comments on commit 07e87bf

Please sign in to comment.