How to run DeepSeek-R1 IQ1_S 1.58bit at 140 Token/Sec #1591

loretoparisi · 2025-01-28T23:30:26Z

Following the blog post Run DeepSeek R1 Dynamic 1.58-bit I tried to reproduce the 140 token/second when running DeepSeek-R1-UD-IQ1_S i.e. 1.58-bit / 131GB / IQ1_S.

My setup was to offload to gpu all layers:

 ./llama.cpp/build/bin/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --n-gpu-layers 61 --prio 2 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --prompt "<｜User｜>What is the capital of Italy?<｜Assistant｜>"

With this config and 2x H100/80GB hardware

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:27:00.0 Off |                    0 |
| N/A   34C    P0              58W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:2A:00.0 Off |                    0 |
| N/A   32C    P0              60W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

resulting to this performances:

llama_perf_sampler_print:    sampling time =       2.37 ms /    35 runs   (    0.07 ms per token, 14767.93 tokens per second)
llama_perf_context_print:        load time =   21683.87 ms
llama_perf_context_print: prompt eval time =     927.17 ms /    10 tokens (   92.72 ms per token,    10.79 tokens per second)
llama_perf_context_print:        eval time =    2608.16 ms /    24 runs   (  108.67 ms per token,     9.20 tokens per second)
llama_perf_context_print:       total time =    3557.60 ms /    34 tokens

The whole Llama.cpp output with model details:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
build: 4575 (cae9fb43) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA A100-SXM4-80GB) - 80627 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA A100-SXM4-80GB) - 80627 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 52 key-value pairs and 1025 tensors from DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 BF16
llama_model_loader: - kv   3:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   4:                         general.size_label str              = 256x20B
llama_model_loader: - kv   5:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   6:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   7:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   8:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   9:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  10:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  11:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  12:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  15:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  16:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  17:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  18:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  19:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  20:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  21:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  22:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  23:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  24:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  25:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  26:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  27:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  28:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  29:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  30: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  31: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  34:                      tokenizer.ggml.tokens arr[str,129280]  = ["<｜begin▁of▁sentence｜>", "<�...
llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 128815
llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  41:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  43:               general.quantization_version u32              = 2
llama_model_loader: - kv  44:                          general.file_type u32              = 24
llama_model_loader: - kv  45:                      quantize.imatrix.file str              = DeepSeek-R1.imatrix
llama_model_loader: - kv  46:                   quantize.imatrix.dataset str              = /training_data/calibration_datav3.txt
llama_model_loader: - kv  47:             quantize.imatrix.entries_count i32              = 720
llama_model_loader: - kv  48:              quantize.imatrix.chunks_count i32              = 124
llama_model_loader: - kv  49:                                   split.no u16              = 0
llama_model_loader: - kv  50:                        split.tensors.count i32              = 1025
llama_model_loader: - kv  51:                                split.count u16              = 3
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q4_K:  190 tensors
llama_model_loader: - type q5_K:  116 tensors
llama_model_loader: - type q6_K:  184 tensors
llama_model_loader: - type iq2_xxs:    6 tensors
llama_model_loader: - type iq1_s:  168 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ1_S - 1.5625 bpw
print_info: file size   = 130.60 GiB (1.67 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 819
load: token to piece cache size = 0.8223 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 7168
print_info: n_layer          = 61
print_info: n_head           = 128
print_info: n_head_kv        = 128
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 192
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 24576
print_info: n_embd_v_gqa     = 16384
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = DeepSeek R1 BF16
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 1 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 1 '<｜end▁of▁sentence｜>'
print_info: PAD token        = 128815 '<｜PAD▁TOKEN｜>'
print_info: LF token         = 131 'Ä'
print_info: FIM PRE token    = 128801 '<｜fim▁begin｜>'
print_info: FIM SUF token    = 128800 '<｜fim▁hole｜>'
print_info: FIM MID token    = 128802 '<｜fim▁end｜>'
print_info: EOG token        = 1 '<｜end▁of▁sentence｜>'
print_info: max token length = 256
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloaded 61/62 layers to GPU
load_tensors:        CUDA0 model buffer size = 65208.70 MiB
load_tensors:        CUDA1 model buffer size = 67299.27 MiB
load_tensors:   CPU_Mapped model buffer size =  1222.09 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 8192
llama_init_from_model: n_ctx_per_seq = 8192
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 0.025
llama_init_from_model: n_ctx_per_seq (8192) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'q4_0', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:      CUDA0 KV buffer size = 11284.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size = 10920.00 MiB
llama_init_from_model: KV self size  = 22204.00 MiB, K (q4_0): 6588.00 MiB, V (f16): 15616.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model:      CUDA0 compute buffer size =  2218.00 MiB
llama_init_from_model:      CUDA1 compute buffer size =  2218.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    30.01 MiB
llama_init_from_model: graph nodes  = 5025
llama_init_from_model: graph splits = 5 (with bs=512), 4 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 12

system_info: n_threads = 12 (n_threads_batch = 12) / 64 | CUDA : ARCHS = 520,610,700,750 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

sampler seed: 3407
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.600
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1

So my top speed in terms of Token/Sec was 9-10 token per seconds when offloading 61 layers with 12 threads.
How to achieve 140 tokens / second?

The text was updated successfully, but these errors were encountered:

loretoparisi · 2025-01-28T23:31:30Z

Reported to Llama.CPP ggerganov/llama.cpp#11474

ikergarcia1996 · 2025-01-29T01:40:39Z

Same here.

git clone https://github.com/ggerganov/llama.cpp.git

cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON -DLLAMA_SERVER_SSL=ON

cmake --build llama.cpp/build --config Release -j 16 --clean-first -t llama-quantize llama-server llama-cli llama-gguf-split

cp llama.cpp/build/bin/llama-* llama.cpp

./llama.cpp/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --n-gpu-layers 62 --prio 2 \
    --temp 0.6 \
    --ctx-size 4096 \
    --seed 3407 \
    --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"

Hardware

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:0F:00.0 Off |                    0 |
| N/A   28C    P0             71W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:10:00.0 Off |                    0 |
| N/A   29C    P0             70W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Performance

llama_perf_sampler_print:    sampling time =       0.62 ms /     5 runs   (    0.12 ms per token,  8038.59 tokens per second)
llama_perf_context_print:        load time =  497613.18 ms
llama_perf_context_print: prompt eval time =    1182.71 ms /    19 tokens (   62.25 ms per token,    16.06 tokens per second)
llama_perf_context_print:        eval time =   77634.37 ms /   601 runs   (  129.18 ms per token,     7.74 tokens per second)
llama_perf_context_print:       total time =   79714.03 ms /   620 tokens

danielhanchen · 2025-01-29T04:09:03Z

Hey! Whoops guys apologies - just found out it should be 10 to 14 tokens / s for generation speed and not 140 (140 tok/s is the prompt eval time) on 2xH100. 😢

Sorry I didn't get any sleep over the past week since I was too excited to pump out the 1.58bit and release it to everyone. 😢

I mentioned most people should expect to get 1 to 3 tokens / s on most local GPUs, so I'm unsure how I missed the 140 tokens / s.

The 140 tokens / s is the prompt eval time - the generation / decode speed is in fact 10 to 14 tokens / s - so I must have reported the wrong line.

Eg - 137.66 tok / s for prompt processing and 10.69 tok / s for decoding:

llama_perf_sampler_print:    sampling time =     199.35 ms /  2759 runs   (    0.07 ms per token, 13839.98 tokens per second)
llama_perf_context_print:        load time =   32281.52 ms
llama_perf_context_print: prompt eval time =    1598.12 ms /   220 tokens (    7.26 ms per token,   137.66 tokens per second)
llama_perf_context_print:        eval time =  237358.50 ms /  2538 runs   (   93.52 ms per token,    10.69 tokens per second)
llama_perf_context_print:       total time =  239477.62 ms /  2758 tokens

I've changed the blog post, docs and everywhere to reflect this issue.

I also uploaded a screen recording GIF showing 140tok/s for prompt eval and 10 tok/s for generation for the 1st minute and the last minute to show an example:

So 140 tok / s is the prompt processing / eval time, and I so I reported the wrong line - decoding speed is 10 to 14 tok / s.

On more analysis - I can see via Open Router https://openrouter.ai/deepseek/deepseek-r1 the API tokens / s is around 3 or 4 tokens / s for R1.

Throughput though is a different measure - https://artificialanalysis.ai/models/deepseek-r1/providers reports 60 tok / s for DeepSeek's official API.

Assuming 6 tok / s for DeepSeek per single user, then throughput should be attainable at 10 * single user tokens / s.

danielhanchen · 2025-01-29T04:25:12Z

Thanks @loretoparisi again for reporting the issue! Extremely appreciate the testing and checks - also thanks @ikergarcia1996 for verifying the check!

However I hope the 1.58bit model at least functions well as reported! Again thanks for trying it out!

loretoparisi · 2025-01-29T18:39:25Z

@danielhanchen further benchmarks

2x H100 / 80GB, matching 12 tokens per second

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | type_k |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | ------------: | -------------------: |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         pp512 |        276.56 ± 1.24 |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         tg128 |         11.89 ± 0.01 |

4x H100/80GB @ 214 TFLOPS we have

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | type_k |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | ------------: | -------------------: |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         pp512 |        273.10 ± 1.41 |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         tg128 |         11.84 ± 0.00 |

So apparently the inference pipe is not scaling on the device count. My llama.cpp pipe is

 ./llama.cpp/build/bin/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 24 \
    -no-cnv \
    --n-gpu-layers 62 \
    --prio 2 \
    -ub 128 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407

Still looking to furtherly improve the inference.

tenzinhl · 2025-01-29T21:16:12Z

Thanks for posting @loretoparisi . Was scratching my head at the same thing after trying out a few different run parameters.

Can also confirm that I tend to get around 12 tokens/second on two H100 80GBs.

Note that even though the node I was running on had 8 GPUs, I used --device to only load onto two GPUs (as an aside: observed the same effect as @loretoparisi: scaling to use all 8 doesn't lead to any speedup):

root@deepseek-0 ~/r/l/b/bin> /mnt/home/repos/llama.cpp/build/bin/llama-cli --model /tmp/cache/huggingface/hub/models--unsloth--DeepSeek-R1-GGUF/snapshots/90bbbcf503d0e2566d397c2309897432053e5b58/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --device CUDA0,CUDA1 --cache-type-k q4_0 --threads 32 -no-cnv --n-gpu-layers 62 --prio 2 --temp 0.6 --ctx-size 1024 --seed 3407 --prompt "<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜>"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes

[... model outputs ...]

llama_perf_sampler_print:    sampling time =      67.62 ms /  1024 runs   (    0.07 ms per token, 15143.00 tokens per second)
llama_perf_context_print:        load time =   20953.98 ms
llama_perf_context_print: prompt eval time =     686.04 ms /    12 tokens (   57.17 ms per token,    17.49 tokens per second)
llama_perf_context_print:        eval time =   84607.03 ms /  1011 runs   (   83.69 ms per token,    11.95 tokens per second)
llama_perf_context_print:       total time =   85492.17 ms /  1023 tokens

Going to look into batching and see if the throughput increases.

I see the blog post has been updated, but still prominently displays "140 tokens/second" as the throughput number (I assume that's assuming batched inputs?). Was that experimentally proven @danielhanchen?

Update: Testing with ./llama-batched-bench on V3-Q2_K_L (sorry, know it's not R1, but I think it's comparable as after trying a few different quantizations of R1 and V3 for single-user decode speed I've found they all get around 11 tokens/second). With batching the total tokens/second is able to scale to at least 150 tokens/s (I wasn't trying to optimize this carefully, so can probably go notable higher). This comes at a decent cost to single-user latency though (batch size of 8 leading to around 2.25x single user latency). Also: I'm not sure if this would actually work well in a multi-user server as the server would need to wait to process a user's request until the batch size is reached to my understanding.

Ran this command the following command (Results here: batch_results.md):

./llama-batched-bench --model /tmp/cache/huggingface/hub/models--unsloth--DeepSeek-V3-GGUF/snapshots/6b9a45d8a30b48660644560df90e5ded5c0cb0e9/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf --ctx-size 2048 --batch-size 2048 -ub 512 -ngl 62 --device CUDA0,CUDA1,CUDA2,CUDA3 -npp 64,128,256 -ntg 128,256 -npl 1,2,4,8

danielhanchen · 2025-01-30T05:59:19Z

@tenzinhl Nice results! yes so it's throughput still ie (tokens/s * # of batches) Yes 140 is batched -

The tokens/s can be improved to approx 20 to 30 tokens / s via llama.cpp optims - for now it's around 15 tokens / s max.

Prompt processing ie the prefill step can attain much higher tokens / s, but sadly because Flash Attention is not yet enabled, batch processing with Flash Attention won't help yet (it should help reduce latency)

Deepseek themselves use I think 3 (or was it 4) lm_heads for multi token prediction so they essentially added speculative decoding - if this is enabled, we can achieve 40 to 50 tokens / s.

Currently I don't think llama.cpp supports draft models from diff vocabs, but if that's enabled, one would assume using a distilled Llama 8B or Qwen 3B would help a lot.

loretoparisi · 2025-01-30T09:53:30Z

@tenzinhl Nice results! yes so it's throughput still ie (tokens/s * # of batches) Yes 140 is batched -

The tokens/s can be improved to approx 20 to 30 tokens / s via llama.cpp optims - for now it's around 15 tokens / s max.

Prompt processing ie the prefill step can attain much higher tokens / s, but sadly because Flash Attention is not yet enabled, batch processing with Flash Attention won't help yet (it should help reduce latency)

Deepseek themselves use I think 3 (or was it 4) lm_heads for multi token prediction so they essentially added speculative decoding - if this is enabled, we can achieve 40 to 50 tokens / s.

Currently I don't think llama.cpp supports draft models from diff vocabs, but if that's enabled, one would assume using a distilled Llama 8B or Qwen 3B would help a lot.

@danielhanchen Good points! Assumed that speculative decoding can be considered an external component, the multi token prediction could be an option, while regarding the Attention, do you mean FlashAttention in Llama.CPP?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run DeepSeek-R1 IQ1_S 1.58bit at 140 Token/Sec #1591

How to run DeepSeek-R1 IQ1_S 1.58bit at 140 Token/Sec #1591

loretoparisi commented Jan 28, 2025

loretoparisi commented Jan 28, 2025 •

edited

Loading

ikergarcia1996 commented Jan 29, 2025

danielhanchen commented Jan 29, 2025 •

edited

Loading

danielhanchen commented Jan 29, 2025 •

edited

Loading

loretoparisi commented Jan 29, 2025

tenzinhl commented Jan 29, 2025 •

edited

Loading

danielhanchen commented Jan 30, 2025

loretoparisi commented Jan 30, 2025

How to run DeepSeek-R1 IQ1_S 1.58bit at 140 Token/Sec #1591

How to run DeepSeek-R1 IQ1_S 1.58bit at 140 Token/Sec #1591

Comments

loretoparisi commented Jan 28, 2025

loretoparisi commented Jan 28, 2025 • edited Loading

ikergarcia1996 commented Jan 29, 2025

danielhanchen commented Jan 29, 2025 • edited Loading

danielhanchen commented Jan 29, 2025 • edited Loading

loretoparisi commented Jan 29, 2025

tenzinhl commented Jan 29, 2025 • edited Loading

danielhanchen commented Jan 30, 2025

loretoparisi commented Jan 30, 2025

loretoparisi commented Jan 28, 2025 •

edited

Loading

danielhanchen commented Jan 29, 2025 •

edited

Loading

danielhanchen commented Jan 29, 2025 •

edited

Loading

tenzinhl commented Jan 29, 2025 •

edited

Loading