Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research: Benchmarking DeepSeek-R1 IQ1_S 1.58bit #11474

Open
1 of 5 tasks
loretoparisi opened this issue Jan 28, 2025 · 41 comments
Open
1 of 5 tasks

Research: Benchmarking DeepSeek-R1 IQ1_S 1.58bit #11474

loretoparisi opened this issue Jan 28, 2025 · 41 comments

Comments

@loretoparisi
Copy link

Research Stage

  • Background Research (Let's try to avoid reinventing the wheel)
  • Hypothesis Formed (How do you think this will work and it's effect?)
  • Strategy / Implementation Forming
  • Analysis of results
  • Debrief / Documentation (So people in the future can learn from us)

Previous existing literature and research

Command

 ./llama.cpp/build/bin/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --n-gpu-layers 61 --prio 2 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --prompt "<|User|>What is the capital of Italy?<|Assistant|>"

Model

DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S
1.58Bit, 131GB

Hardware

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:27:00.0 Off |                    0 |
| N/A   34C    P0              58W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:2A:00.0 Off |                    0 |
| N/A   32C    P0              60W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Hypothesis

Reported performances is 140 token/second

Implementation

No response

Analysis

Llama.cpp Performance Analysis

Raw Benchmarks

llama_perf_sampler_print:    sampling time =       2.45 ms /    35 runs   (    0.07 ms per token, 14297.39 tokens per second)
llama_perf_context_print:        load time =   20988.11 ms
llama_perf_context_print: prompt eval time =    1233.88 ms /    10 tokens (  123.39 ms per token,     8.10 tokens per second)
llama_perf_context_print:        eval time =    2612.63 ms /    24 runs   (  108.86 ms per token,     9.19 tokens per second)
llama_perf_context_print:       total time =    3869.00 ms /    34 tokens

Detailed Analysis

1. Token Sampling Performance

  • Total Time: 2.45 ms for 35 runs
  • Per Token: 0.07 ms
  • Speed: 14,297.39 tokens per second
  • Description: This represents the speed at which the model can select the next token after processing. This is extremely fast compared to the actual generation speed, as it only involves the final selection process.

2. Model Loading

  • Total Time: 20,988.11 ms (≈21 seconds)
  • Description: One-time initialization cost to load the model into memory. This happens only at startup and doesn't affect ongoing performance.

3. Prompt Evaluation

  • Total Time: 1,233.88 ms for 10 tokens
  • Per Token: 123.39 ms
  • Speed: 8.10 tokens per second
  • Description: Initial processing of the prompt is slightly slower than subsequent token generation, as it needs to establish the full context for the first time.

4. Generation Evaluation

  • Total Time: 2,612.63 ms for 24 runs
  • Per Token: 108.86 ms
  • Speed: 9.19 tokens per second
  • Description: This represents the actual speed of generating new tokens, including all neural network computations.

5. Total Processing Time

  • Total Time: 3,869.00 ms
  • Tokens Processed: 34 tokens
  • Average Speed: ≈8.79 tokens per second

Key Insights

  1. Performance Bottlenecks:

    • The main bottleneck is in the evaluation phase (actual token generation)
    • While sampling can handle 14K+ tokens per second, actual generation is limited to about 9 tokens per second
    • This difference highlights that the neural network computations, not the token selection process, are the limiting factor
  2. Processing Stages:

    • Model loading is a significant but one-time cost
    • Prompt evaluation is slightly slower than subsequent token generation
    • Sampling is extremely fast compared to evaluation
  3. Overall Performance:

    • The system demonstrates typical performance characteristics for a CPU-based language model
    • The total processing rate of ~9 tokens per second is reasonable for local inference on consumer hardware

Relevant log output

@winston-bosan
Copy link

winston-bosan commented Jan 29, 2025

Thanks for the replication! Here is a silly nitpick, shouldn't it be 14 tokens per second?

Reported performances is 140 token/second

Original:

The 1.58bit quantization should fit in 160GB of VRAM for fast inference (2x H100 80GB), with it attaining around 14 tokens per second. You don't need VRAM (GPU) to run 1.58bit R1, just 20GB of RAM (CPU) will work however it maybe slow. For optimal performance, we recommend the sum of VRAM + RAM to be at least 80GB+.

@danielhanchen
Copy link
Contributor

danielhanchen commented Jan 29, 2025

Hey! Whoops guys apologies - just found out it should be 10 to 14 tokens / s for generation speed and not 140 (140 tok/s is the prompt eval time) on 2xH100. 😢

Sorry I didn't get any sleep over the past week since I was too excited to pump out the 1.58bit and release it to everyone. 😢

I mentioned most people should expect to get 1 to 3 tokens / s on most local GPUs, so I'm unsure how I missed the 140 tokens / s.

The 140 tokens / s is the prompt eval time - the generation / decode speed is in fact 10 to 14 tokens / s - so I must have reported the wrong line.

Eg - 137.66 tok / s for prompt processing and 10.69 tok / s for decoding:

llama_perf_sampler_print:    sampling time =     199.35 ms /  2759 runs   (    0.07 ms per token, 13839.98 tokens per second)
llama_perf_context_print:        load time =   32281.52 ms
llama_perf_context_print: prompt eval time =    1598.12 ms /   220 tokens (    7.26 ms per token,   137.66 tokens per second)
llama_perf_context_print:        eval time =  237358.50 ms /  2538 runs   (   93.52 ms per token,    10.69 tokens per second)
llama_perf_context_print:       total time =  239477.62 ms /  2758 tokens

I've changed the blog post, docs and everywhere to reflect this issue.

I also uploaded a screen recording GIF showing 140tok/s for prompt eval and 10 tok/s for generation for the 1st minute and the last minute to show an example:

Image

Image

So 140 tok / s is the prompt eval time, and I so I reported the wrong line - decoding speed is 10 to 14 tok / s.

On more analysis - I can see via Open Router https://openrouter.ai/deepseek/deepseek-r1 the API tokens / s is around 3 or 4 tokens / s for R1.

Throughput though is a different measure - https://artificialanalysis.ai/models/deepseek-r1/providers reports 60 tok / s for DeepSeek's official API.

Assuming 6 tok / s for DeepSeek per single user, then throughput should be attainable at 10 * single user tokens / s.

@danielhanchen
Copy link
Contributor

danielhanchen commented Jan 29, 2025

Also @loretoparisi extreme appreciate the testing so thanks again!

Again thank you for testing the model out - hope the 1.58bit model functions well!

@ggerganov
Copy link
Member

ggerganov commented Jan 29, 2025

You can try to offload also the non-repeating tensors by using -ngl 62 instead of -ngl 61. You might have to lower the physical batch size to -ub 128 or -ub 256 to reduce compute buffer sizes and maybe improve the pipeline parallelism with 2 GPUs.

Btw, here is a data point for M2 Studio:

r1.mp4

The prompt processing reported by llama-bench is only 23t/s which is quite low, but the Metal backend is very poorly optimized for MoE, so maybe it can be improved a bit. Also, currently we have to disable FA because of the unusual shapes of the tensors in the attention which can also be improved.

@danielhanchen
Copy link
Contributor

@ggerganov Super cool! Glad it worked well on Mac! I'm a Linux user so I had to ask someone else to test it, but good thing you verified it works smoothly :) Good work again on llama.cpp!

@HabermannR
Copy link

@ggerganov why did you strike out the fa? Is it working now?

@ggerganov
Copy link
Member

@ggerganov why did you strike out the fa? Is it working now?

Sorry for the confusion. At first I thought that to enable FA it only requires to support n_embd_head_k != n_embd_head_v which is doable. But then I remembered that DS uses MLA and thought that the FA implementation that we have is not compatible with this attention mechanism, so I striked it out. But now I look at the code and it is actually compatible. So the initial point remains valid and FA can be enabled with some work.

@HabermannR
Copy link

@ggerganov this would enable v quantization, right? And maybe some speed ups?

@ggerganov
Copy link
Member

It will:

  • Reduce compute memory usage
  • Enable V quantization that reduces the KV cache memory
  • Improve performance at longer contexts

The Metal changes for FA should be relatively simple I think if someone wants to take a stab at it.

@loretoparisi
Copy link
Author

You can try to offload also the non-repeating tensors by using -ngl 62 instead of -ngl 61. You might have to lower the physical batch size to -ub 128 or -ub 256 to reduce compute buffer sizes and maybe improve the pipeline parallelism with 2 GPUs.

Btw, here is a data point for M2 Studio:

r1.mp4
The prompt processing reported by llama-bench is only 23t/s which is quite low, but the Metal backend is very poorly optimized for MoE, so maybe it can be improved a bit. Also, currently we have to disable FA because of the unusual shapes of the tensors in the attention which can also be improved.

@ggerganov thanks! I did --threads 12 -no-cnv --n-gpu-layers 62 --prio 2 -ub 256 \:

sampler seed: 3407
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.600
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1

with some little improvment

llama_perf_sampler_print:    sampling time =       1.98 ms /    35 runs   (    0.06 ms per token, 17667.84 tokens per second)
llama_perf_context_print:        load time =   27176.99 ms
llama_perf_context_print: prompt eval time =     916.83 ms /    10 tokens (   91.68 ms per token,    10.91 tokens per second)
llama_perf_context_print:        eval time =    2308.80 ms /    24 runs   (   96.20 ms per token,    10.40 tokens per second)

how to change pipe parallelism?

@ggerganov
Copy link
Member

how to change pipe parallelism?

It's enabled by default. The -ub parameter will affect the prompt processing speed and you can tune the value for optimal performance on your system. Just use a larger prompt, or llama-bench because just 10 tokens for prompt will not give you meaningful results.

@loretoparisi
Copy link
Author

how to change pipe parallelism?

It's enabled by default. The -ub parameter will affect the prompt processing speed and you can tune the value for optimal performance on your system. Just use a larger prompt, or llama-bench because just 10 tokens for prompt will not give you meaningful results.

This is the updated results with a larger prompt. It scored 9.41 tokens per second

llama_perf_sampler_print:    sampling time =     103.30 ms /  1337 runs   (    0.08 ms per token, 12942.63 tokens per second)
llama_perf_context_print:        load time =   23387.53 ms
llama_perf_context_print: prompt eval time =    1102.60 ms /    20 tokens (   55.13 ms per token,    18.14 tokens per second)
llama_perf_context_print:        eval time =  139817.28 ms /  1316 runs   (  106.24 ms per token,     9.41 tokens per second)
llama_perf_context_print:       total time =  141240.50 ms /  1336 tokens

@loretoparisi
Copy link
Author

@ggerganov these are test results from llama-bench

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | type_k |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | ------------: | -------------------: |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         pp512 |        189.38 ± 1.11 |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         tg128 |         10.32 ± 0.01 |

@loretoparisi
Copy link
Author

loretoparisi commented Jan 29, 2025

Further tests on 2x H100 / 80GB, matching 12 tokens per second

llama_perf_sampler_print:    sampling time =      70.01 ms /  1128 runs   (    0.06 ms per token, 16112.67 tokens per second)
llama_perf_context_print:        load time =   28143.05 ms
llama_perf_context_print: prompt eval time =   54405.96 ms /    20 tokens ( 2720.30 ms per token,     0.37 tokens per second)
llama_perf_context_print:        eval time =   94147.64 ms /  1107 runs   (   85.05 ms per token,    11.76 tokens per second)
llama_perf_context_print:       total time =  148778.61 ms /  1127 tokens

and benchmarks:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | type_k |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | ------------: | -------------------: |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         pp512 |        276.56 ± 1.24 |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         tg128 |         11.89 ± 0.01 |

while 4x H100/80GB @ 214 TFLOPS we have

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | type_k |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | ------------: | -------------------: |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         pp512 |        273.10 ± 1.41 |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         tg128 |         11.84 ± 0.00 |

or

llama_perf_sampler_print:    sampling time =      57.98 ms /  1128 runs   (    0.05 ms per token, 19453.98 tokens per second)
llama_perf_context_print:        load time =   23185.90 ms
llama_perf_context_print: prompt eval time =   49140.45 ms /    20 tokens ( 2457.02 ms per token,     0.41 tokens per second)
llama_perf_context_print:        eval time =   95991.54 ms /  1107 runs   (   86.71 ms per token,    11.53 tokens per second)
llama_perf_context_print:       total time =  145329.21 ms /  1127 tokens

So I didnt' see any isignificative improvement in increasing GPUs or increasing threads > 12 right now. Apparently from nvidia-smi all {0,1,2,3} GPUs were in use.

@ggerganov
Copy link
Member

This model is really something. I came up with a fun puzzle:

What could this mean: 'gwkki qieks'?

Solution by DeepSeek-R1 IQ1_S Image

It does not always get it right, but neither does the API.

@jukofyork
Copy link
Contributor

jukofyork commented Jan 29, 2025

Surprised the generation speed (ie: non-prompt processing) is so similar for H100, A100 and M2 Ultra?

Isn't the memory bandwidth approximately: H100 = 2 x A100 = 4 x M2 Ultra?

@jukofyork
Copy link
Contributor

I'm even more confused now as people seem to be getting ~1.5 tokens per second using SSDs:

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/discussions/13

https://old.reddit.com/r/LocalLLaMA/comments/1iczucy/running_deepseek_r1_iq2xxs_200gb_from_ssd/

At best these have 1/100th the memory bandwidth of even the M2 Ultra?

@jukofyork
Copy link
Contributor

There's a Reddit thread on this now:

https://www.reddit.com/r/LocalLLaMA/comments/1idivqe/the_mac_m2_ultra_is_faster_than_2xh100s_in/

I have 9x3090 and while running Deepseek 2.5 q4, I got about 25 tok/s

With R1 IQ1_S I get 2.5 tok/s. There is a bottleneck somewhere.

IQ1_S are seemingly not the best quants for CUDA backend. What's with Q2K?

It would be interesting to see the results for K quants (and _0 if anyone can run them).

@loretoparisi
Copy link
Author

@ggerganov This is vLLM for comparison started as

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 8192 --enforce-eager

It seems that when there is a GPU KV cache usage>0.0% generation throughput is ~30 tokens/s

...
INFO 01-30 16:15:44 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 32.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0%.
INFO 01-30 16:15:49 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 32.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0%.
INFO 01-30 16:15:54 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0%.
INFO 01-30 16:15:59 engine.py:291] Aborted request chatcmpl-04d27b242dbc4bc0b0743235a31d53d7.
INFO 01-30 16:16:09 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 01-30 16:16:19 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

In fact when you have GPU KV cache usage: 0.0% you get

INFO 01-30 16:16:09 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Worth to note that PyTorch Eager execution mode was enforced.

@justinjja
Copy link

justinjja commented Jan 30, 2025

@loretoparisi That 10.5 t/s is just averaging between 31T/s and 0T/s after your prompt ended, not actually a meaningful number right?

@jukofyork
Copy link
Contributor

@loretoparisi DeepSeek-R1-Distill-Qwen-32B isn't deepseek?

@loretoparisi
Copy link
Author

@loretoparisi DeepSeek-R1-Distill-Qwen-32B isn't deepseek?

It's R1 official Distillation from Qwen2.5-32B

@jukofyork
Copy link
Contributor

@loretoparisi DeepSeek-R1-Distill-Qwen-32B isn't deepseek?

It's R1 official Distillation from Qwen2.5-32B

But the whole point of this thread is to benchmark the deepseek-v3 architecture? :)

@accupham
Copy link

Reporting 8 tok/s on 2x A100 (pcie).

@marvin-0042
Copy link

1.58bit R1

  1. Reporting 3 token/s on 1x4090 24GB with 192 CPU core/huge CPU memory (>100GB).

  2. Reporting ~0.5 token/s on 1x4090 24GB with limited CPU memory ~60GB.

  3. Merging tensors of larger models #1's config
    system_info: n_threads = 192 (n_threads_batch = 192) / 384 | CUDA : ARCHS = 520,610,700,750 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

@agoralski-qc
Copy link

Here are our results for DeepSeek-R1 IQ1_S 1.58bit:

AMD EPYC 9654 96-Core 768GB RAM, 1 * Nvidia RTX 3090 (24GB VRAM)
./llama.cpp/llama-cli --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 12 -no-cnv --prio 2 --n-gpu-layers 8 --temp 0.6 --ctx-size 8192 --seed 3407 --prompt "<|User|>Why the Saturn planet in our system has rings?.<|Assistant|>"

Results:

llama_perf_sampler_print:    sampling time =      29.78 ms /   526 runs   (    0.06 ms per token, 17662.27 tokens per second)
llama_perf_context_print:        load time =   21075.00 ms
llama_perf_context_print: prompt eval time =    2659.54 ms /    13 tokens (  204.58 ms per token,     4.89 tokens per second)
llama_perf_context_print:        eval time =  155924.77 ms /   512 runs   (  304.54 ms per token,     3.28 tokens per second)
llama_perf_context_print:       total time =  158686.10 ms /   525 tokens

AMD EPYC 7713 64-Core 952GB RAM, 8 * Nvidia L40 (45GB VRAM, 360GB total VRAM)
./llama.cpp/llama-cli --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 12 -no-cnv --prio 2 --n-gpu-layers 62 --temp 0.6 --ctx-size 8192 --seed 3407 --prompt "<|User|>Why the Saturn planet in our system has rings?.<|Assistant|>"

Results:

llama_perf_sampler_print:    sampling time =      55.95 ms /  1106 runs   (    0.05 ms per token, 19768.71 tokens per second)
llama_perf_context_print:        load time =   26832.58 ms
llama_perf_context_print: prompt eval time =   47971.09 ms /    13 tokens ( 3690.08 ms per token,     0.27 tokens per second)
llama_perf_context_print:        eval time =   96577.12 ms /  1092 runs   (   88.44 ms per token,    11.31 tokens per second)
llama_perf_context_print:       total time =  144832.33 ms /  1105 tokens

AMD EPYC 7V12 64-Core 1820GB RAM, 8 * A100 SXM4 (80GB VRAM, 640GB total VRAM)
./llama.cpp/llama-cli --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 12 -no-cnv --prio 2 --n-gpu-layers 62 --temp 0.6 --ctx-size 8192 --seed 3407 --prompt "<|User|>Why the Saturn planet in our system has rings?.<|Assistant|>"

Results:

llama_perf_sampler_print:    sampling time =     126.63 ms /  1593 runs   (    0.08 ms per token, 12579.66 tokens per second)
llama_perf_context_print:        load time =   33417.77 ms
llama_perf_context_print: prompt eval time =    1183.44 ms /    13 tokens (   91.03 ms per token,    10.98 tokens per second)
llama_perf_context_print:        eval time =  209953.23 ms /  1579 runs   (  132.97 ms per token,     7.52 tokens per second)
llama_perf_context_print:       total time =  211573.62 ms /  1592 tokens

@davidsyoung
Copy link

6-11 t/s on 8x3090. Bigger context lower side, and visa versa.

@ryseek
Copy link

ryseek commented Feb 1, 2025

6-11 t/s on 8x3090. Bigger context lower side, and visa versa.

Speed is plenty good for generation 👍
Can you share the prompt processing speed? preferably on 10k+prompt.

@justinjja
Copy link

Out of curiosity, I went searching for a fat model with similar total parameters to R1's active parameters.
Found an iQ1s of Faclon 40b. (Yes it's basically braindead lol)

I'm running 10x P40's, so both models fit in Vram.
Tested with 500 input tokens and 500 output tokens:

Falcon 40b - iQ1s
Prompt: 160 T/s
Generation: 9.5T/s

DeepSeek R1 - iQ1s
Prompt: 53 T/s
Generation: 6.2 T/s

@davidsyoung
Copy link

6-11 t/s on 8x3090. Bigger context lower side, and visa versa.

Speed is plenty good for generation 👍 Can you share the prompt processing speed? preferably on 10k+prompt.

So I had some issues with getting CUDA out of memory during prompt processing at 10k+ context, even though it would allow me to load the model etc.

I also picked up another 3090 today, so I have 9x3090 now. I loaded the DeepSeek-R1-UD-IQ1_M model instead of the 1.58bit. However, I had to limit the GPU's on power to 280w as I only have 2x1500W PSU.

It's notable to mention that each GPU with MoE architecture pulls 130-150w~ on inference and not higher, but I believe it's more so peak utilisation that spikes too much so I limited to 280w.

I'm still playing around with -ub with 128/256, and matching context size with a nice balance.

Without FA, there's a lot of vram usage for context. Also using -ub at 128 to fix a bigger context. It's also unbalanced with splitting layers, here's how it looks. It's hard to balance right with tensor split.

load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors:        CUDA0 model buffer size = 15740.59 MiB
load_tensors:        CUDA1 model buffer size = 16315.85 MiB
load_tensors:        CUDA2 model buffer size = 19035.16 MiB
load_tensors:        CUDA3 model buffer size = 19035.16 MiB
load_tensors:        CUDA4 model buffer size = 19035.16 MiB
load_tensors:        CUDA5 model buffer size = 16315.85 MiB
load_tensors:        CUDA6 model buffer size = 19035.16 MiB
load_tensors:        CUDA7 model buffer size = 19035.16 MiB
load_tensors:        CUDA8 model buffer size = 17040.83 MiB
load_tensors:   CPU_Mapped model buffer size =   497.11 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 10240
llama_init_from_model: n_ctx_per_seq = 10240
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 128
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 0.025
llama_init_from_model: n_ctx_per_seq (10240) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 10240, offload = 1, type_k = 'q4_0', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:      CUDA0 KV buffer size =  3640.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  2730.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =  3185.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =  3185.00 MiB
llama_kv_cache_init:      CUDA4 KV buffer size =  3185.00 MiB
llama_kv_cache_init:      CUDA5 KV buffer size =  2730.00 MiB
llama_kv_cache_init:      CUDA6 KV buffer size =  3185.00 MiB
llama_kv_cache_init:      CUDA7 KV buffer size =  3185.00 MiB
llama_kv_cache_init:      CUDA8 KV buffer size =  2730.00 MiB
llama_init_from_model: KV self size  = 27755.00 MiB, K (q4_0): 8235.00 MiB, V (f16): 19520.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
llama_init_from_model:      CUDA0 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA1 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA2 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA3 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA4 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA5 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA6 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA7 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA8 compute buffer size =   712.50 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    23.51 MiB
llama_init_from_model: graph nodes  = 5025
llama_init_from_model: graph splits = 10
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 10240

Prompt wise, here you go:

prompt eval time =   97867.98 ms /  7744 tokens (   12.64 ms per token,    79.13 tokens per second)
       eval time =  450625.60 ms /  2000 tokens (  225.31 ms per token,     4.44 tokens per second)
      total time =  548493.58 ms /  9744 tokens
srv  update_slots: all slots are idle
request: POST /v1/chat/completions 192.168.1.64 200
slot launch_slot_: id  0 | task 4105 | processing task
slot update_slots: id  0 | task 4105 | new prompt, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 9920
slot update_slots: id  0 | task 4105 | kv cache rm [2, end)
slot update_slots: id  0 | task 4105 | prompt processing progress, n_past = 2050, n_tokens = 2048, progress = 0.206452
slot update_slots: id  0 | task 4105 | kv cache rm [2050, end)
slot update_slots: id  0 | task 4105 | prompt processing progress, n_past = 4098, n_tokens = 2048, progress = 0.412903
slot update_slots: id  0 | task 4105 | kv cache rm [4098, end)
slot update_slots: id  0 | task 4105 | prompt processing progress, n_past = 6146, n_tokens = 2048, progress = 0.619355
slot update_slots: id  0 | task 4105 | kv cache rm [6146, end)
slot update_slots: id  0 | task 4105 | prompt processing progress, n_past = 8194, n_tokens = 2048, progress = 0.825806
slot update_slots: id  0 | task 4105 | kv cache rm [8194, end)
slot update_slots: id  0 | task 4105 | prompt processing progress, n_past = 9920, n_tokens = 1726, progress = 0.999798
slot update_slots: id  0 | task 4105 | prompt done, n_past = 9920, n_tokens = 1726
slot      release: id  0 | task 4105 | stop processing: n_past = 9969, truncated = 0
slot print_timing: id  0 | task 4105 | 
prompt eval time =  129716.52 ms /  9918 tokens (   13.08 ms per token,    76.46 tokens per second)
       eval time =   11962.27 ms /    50 tokens (  239.25 ms per token,     4.18 tokens per second)
      total time =  141678.79 ms /  9968 tokens

Prompt processing is quite slow with -ub 128. Token generation also got quite a bit slower. I would say that's a combination of bigger quant, -ub 128, and GPUs limited to 280w.

GPU utilisation during inference really sits around 10%, so I believe there is huge potential for optimisation here.

@davidsyoung
Copy link

davidsyoung commented Feb 1, 2025

Here's some more examples of inference:

slot launch_slot_: id  0 | task 5949 | processing task
slot update_slots: id  0 | task 5949 | new prompt, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 19
slot update_slots: id  0 | task 5949 | kv cache rm [2, end)
slot update_slots: id  0 | task 5949 | prompt processing progress, n_past = 19, n_tokens = 17, progress = 0.894737
slot update_slots: id  0 | task 5949 | prompt done, n_past = 19, n_tokens = 17
slot      release: id  0 | task 5949 | stop processing: n_past = 1017, truncated = 0
slot print_timing: id  0 | task 5949 | 
prompt eval time =    1029.55 ms /    17 tokens (   60.56 ms per token,    16.51 tokens per second)
       eval time =   93916.70 ms /   999 tokens (   94.01 ms per token,    10.64 tokens per second)
      total time =   94946.25 ms /  1016 tokens
slot launch_slot_: id  0 | task 6949 | processing task
slot update_slots: id  0 | task 6949 | new prompt, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 1180
slot update_slots: id  0 | task 6949 | kv cache rm [2, end)
slot update_slots: id  0 | task 6949 | prompt processing progress, n_past = 1180, n_tokens = 1178, progress = 0.998305
slot update_slots: id  0 | task 6949 | prompt done, n_past = 1180, n_tokens = 1178
slot      release: id  0 | task 6949 | stop processing: n_past = 1229, truncated = 0
slot print_timing: id  0 | task 6949 | 
prompt eval time =   14212.81 ms /  1178 tokens (   12.07 ms per token,    82.88 tokens per second)
       eval time =    5358.20 ms /    50 tokens (  107.16 ms per token,     9.33 tokens per second)
      total time =   19571.01 ms /  1228 tokens
slot launch_slot_: id  0 | task 4160 | processing task
slot update_slots: id  0 | task 4160 | new prompt, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 2299
slot update_slots: id  0 | task 4160 | kv cache rm [2, end)
slot update_slots: id  0 | task 4160 | prompt processing progress, n_past = 2050, n_tokens = 2048, progress = 0.890822
slot update_slots: id  0 | task 4160 | kv cache rm [2050, end)
slot update_slots: id  0 | task 4160 | prompt processing progress, n_past = 2299, n_tokens = 249, progress = 0.999130
slot update_slots: id  0 | task 4160 | prompt done, n_past = 2299, n_tokens = 249
slot      release: id  0 | task 4160 | stop processing: n_past = 4032, truncated = 0
slot print_timing: id  0 | task 4160 | 
prompt eval time =   27000.31 ms /  2297 tokens (   11.75 ms per token,    85.07 tokens per second)
       eval time =  235668.83 ms /  1734 tokens (  135.91 ms per token,     7.36 tokens per second)
      total time =  262669.13 ms /  4031 tokens
slot launch_slot_: id  0 | task 7000 | processing task
slot update_slots: id  0 | task 7000 | new prompt, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 2205
slot update_slots: id  0 | task 7000 | kv cache rm [1, end)
slot update_slots: id  0 | task 7000 | prompt processing progress, n_past = 2049, n_tokens = 2048, progress = 0.928798
slot update_slots: id  0 | task 7000 | kv cache rm [2049, end)
slot update_slots: id  0 | task 7000 | prompt processing progress, n_past = 2205, n_tokens = 156, progress = 0.999546
slot update_slots: id  0 | task 7000 | prompt done, n_past = 2205, n_tokens = 156
slot      release: id  0 | task 7000 | stop processing: n_past = 4252, truncated = 0
slot print_timing: id  0 | task 7000 | 
prompt eval time =   26206.03 ms /  2204 tokens (   11.89 ms per token,    84.10 tokens per second)
       eval time =  280669.02 ms /  2048 tokens (  137.05 ms per token,     7.30 tokens per second)
      total time =  306875.06 ms /  4252 tokens

EDIT:

Was able to get ub 256 working with ok context.

``-ub 256 --ctx-size 9216 -m DeepSeek-R1-UD-IQ1_M```

srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 5767 | processing task
slot update_slots: id  0 | task 5767 | new prompt, n_ctx_slot = 9216, n_keep = 0, n_prompt_tokens = 2299
slot update_slots: id  0 | task 5767 | kv cache rm [2, end)
slot update_slots: id  0 | task 5767 | prompt processing progress, n_past = 2050, n_tokens = 2048, progress = 0.890822
slot update_slots: id  0 | task 5767 | kv cache rm [2050, end)
slot update_slots: id  0 | task 5767 | prompt processing progress, n_past = 2299, n_tokens = 249, progress = 0.999130
slot update_slots: id  0 | task 5767 | prompt done, n_past = 2299, n_tokens = 249
slot      release: id  0 | task 5767 | stop processing: n_past = 3770, truncated = 0
slot print_timing: id  0 | task 5767 | 
prompt eval time =   20164.03 ms /  2297 tokens (    8.78 ms per token,   113.92 tokens per second)
       eval time =  197513.46 ms /  1472 tokens (  134.18 ms per token,     7.45 tokens per second)
      total time =  217677.50 ms /  3769 tokens
slot launch_slot_: id  0 | task 7243 | processing task
slot update_slots: id  0 | task 7243 | new prompt, n_ctx_slot = 9216, n_keep = 0, n_prompt_tokens = 3923
slot update_slots: id  0 | task 7243 | kv cache rm [2, end)
slot update_slots: id  0 | task 7243 | prompt processing progress, n_past = 2050, n_tokens = 2048, progress = 0.522049
slot update_slots: id  0 | task 7243 | kv cache rm [2050, end)
slot update_slots: id  0 | task 7243 | prompt processing progress, n_past = 3923, n_tokens = 1873, progress = 0.999490
slot update_slots: id  0 | task 7243 | prompt done, n_past = 3923, n_tokens = 1873
slot      release: id  0 | task 7243 | stop processing: n_past = 3972, truncated = 0
slot print_timing: id  0 | task 7243 | 
prompt eval time =   36030.69 ms /  3921 tokens (    9.19 ms per token,   108.82 tokens per second)
       eval time =    7179.06 ms /    50 tokens (  143.58 ms per token,     6.96 tokens per second)
      total time =   43209.75 ms /  3971 tokens
slot launch_slot_: id  0 | task 7297 | processing task
slot update_slots: id  0 | task 7297 | new prompt, n_ctx_slot = 9216, n_keep = 0, n_prompt_tokens = 2156
slot update_slots: id  0 | task 7297 | kv cache rm [2, end)
slot update_slots: id  0 | task 7297 | prompt processing progress, n_past = 2050, n_tokens = 2048, progress = 0.949907
slot update_slots: id  0 | task 7297 | kv cache rm [2050, end)
slot update_slots: id  0 | task 7297 | prompt processing progress, n_past = 2156, n_tokens = 106, progress = 0.999072
slot update_slots: id  0 | task 7297 | prompt done, n_past = 2156, n_tokens = 106
slot      release: id  0 | task 7297 | stop processing: n_past = 3366, truncated = 0
slot print_timing: id  0 | task 7297 | 
prompt eval time =   18871.24 ms /  2154 tokens (    8.76 ms per token,   114.14 tokens per second)
       eval time =  156651.02 ms /  1211 tokens (  129.36 ms per token,     7.73 tokens per second)
      total time =  175522.27 ms /  3365 tokens

@loretoparisi
Copy link
Author

loretoparisi commented Feb 2, 2025

This PR is relevant for tris discussion because it adds Flash Attention to DeepSeek-R1.
Also this issue reported to ollama is related to the performance issues benchmarked here and related to KV Context Quantization missing.

@siddartha-RE
Copy link

I am the author of the PR mentioned above. To enable FA attention the main issue I ran into was that ggml does not yet support padding FP16 tensors. I added a CUDA kernel to do FP16 padding and now I have generation working but there is a bug that I need to track down. I will work on this some more tomorrow but I wanted to get an ok from code owners that this seems like a reasonable approach. cc @loretoparisi

@davidsyoung
Copy link

I am the author of the PR mentioned above. To enable FA attention the main issue I ran into was that ggml does not yet support padding FP16 tensors. I added a CUDA kernel to do FP16 padding and now I have generation working but there is a bug that I need to track down. I will work on this some more tomorrow but I wanted to get an ok from code owners that this seems like a reasonable approach. cc @loretoparisi

Really appreciate your work, looking forward to trying your PR when you update it!

@lingster
Copy link

lingster commented Feb 6, 2025

Hi, just wanted to add my results in here for the community:

GPUs are only running at 25%, so presume we are data transfer rate/memory bandwidth constraint. Tweaking tensor-split and gpu-layers helps to max out memory usage on the GPUs. Interestingly enough during inference, GPU0 memory usage ticks up, hence the 19/20 layer split. Increasing threads does not meaningfully increase the token/s rate.

  • AMD EPYC 7713 64-Core Processor 256GB Ram / 2xA6000ada (2x48Gb vram) *
./build/bin/llama-cli --no-mmap  --tensor-split 19,20   --model /data/gguf/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf     --cache-type-k q4_0     --threads 16     --prio 2     --temp 0.6     --ctx-size 8192     --seed 3407     --n-gpu-layers 36     -no-cnv     --prompt "<|User|>You are an expert python developer. Write a factorial function using python.<|Assistant|>"
llama_perf_sampler_print:    sampling time =      84.43 ms /   982 runs   (    0.09 ms per token, 11631.07 tokens per second)
llama_perf_context_print:        load time =   62308.16 ms
llama_perf_context_print: prompt eval time =    1993.99 ms /    17 tokens (  117.29 ms per token,     8.53 tokens per second)
llama_perf_context_print:        eval time =  226708.40 ms /   964 runs   (  235.17 ms per token,     4.25 tokens per second)
llama_perf_context_print:       total time =  228952.05 ms /   981 tokens

with --ctx 16384 (need to be reduced as more memory used system memory usage jumps from ~10g to 120g) : slower at 2.43tok/s but also appears to hallucinate/gets confused in it's thinking:

llama_perf_sampler_print:    sampling time =     178.56 ms /  1640 runs   (    0.11 ms per token,  9184.43 tokens per second)
llama_perf_context_print:        load time =  188784.21 ms
llama_perf_context_print: prompt eval time =    3323.42 ms /    17 tokens (  195.50 ms per token,     5.12 tokens per second)
llama_perf_context_print:        eval time =  667566.21 ms /  1622 runs   (  411.57 ms per token,     2.43 tokens per second)
llama_perf_context_print:       total time =  671509.54 ms /  1639 tokens
Interrupted by user

just for comparison using 2xa6000 via local rpc:
./build/bin/llama-cli --tensor-split 20,20 --model /data/gguf/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 16 --prio 2 --temp 0.6 --ctx-size 8192 --seed 3407 --n-gpu-layers 36 -no-cnv --prompt "<|User|>You are an expert python developer. Write a factorial function using python.<|Assistant|>" --rpc 127.0.0.1:50052,127.0.0.1:50053

yielded:

llama_perf_sampler_print:    sampling time =      80.14 ms /   982 runs   (    0.08 ms per token, 12253.10 tokens per second)
llama_perf_context_print:        load time =  195238.80 ms
llama_perf_context_print: prompt eval time =    1749.13 ms /    17 tokens (  102.89 ms per token,     9.72 tokens per second)
llama_perf_context_print:        eval time =  228672.05 ms /   964 runs   (  237.21 ms per token,     4.22 tokens per second)
llama_perf_context_print:       total time =  230671.65 ms /   981 tokens

@pickettd
Copy link

pickettd commented Feb 7, 2025

I've been messing around with DeepSeek-R1 IQ1_S 1.58bit on my M1 Ultra 128gb Studio as well as experimenting with running it using RPC.

With just the M1 Ultra with 36 gpu layers and 1024 tokens of context (after fresh restart and setting iogpu.wired_limit_mb)

llama-cli \
      --model ./DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
      --cache-type-k q8_0 \
      --threads 16 -no-cnv --prio 2 -ub 128 \
      --temp 0.6 \
      --ctx-size 1024 \
      -n 256 \
      --seed 3407 \
      --n-gpu-layers 36 \
      --prompt "<|User|>Hi are you ready to chat?<|Assistant|>"

results look like this

llama_perf_sampler_print:    sampling time =       8.20 ms /   149 runs   (    0.06 ms per token, 18177.38 tokens per second)
llama_perf_context_print:        load time =    3104.13 ms
llama_perf_context_print: prompt eval time =    2300.45 ms /    10 tokens (  230.04 ms per token,     4.35 tokens per second)
llama_perf_context_print:        eval time =   38278.98 ms /   138 runs   (  277.38 ms per token,     3.61 tokens per second)
llama_perf_context_print:       total time =   40639.06 ms /   148 tokens

and then if I use the M1 Ultra plus RPC (through a wired gigabit router) for 2 machines (a 3090 and a 4090) with 61 gpu layers and 8k context

llama-cli \
      --model ./DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
      --cache-type-k q8_0 \
      --threads 16 -no-cnv --prio 2 -ub 128 \
      --temp 0.6 \
      --ctx-size 8192 \
      -n 256 \
      --seed 3407 \
      --n-gpu-layers 61 \
      --prompt "<|User|>Hi are you ready to chat?<|Assistant|>" \
      --rpc 192.168.1.25:50052,192.168.1.10:50052

results look like this:

llama_perf_sampler_print:    sampling time =      10.48 ms /   128 runs   (    0.08 ms per token, 12214.91 tokens per second)
llama_perf_context_print:        load time =  345285.60 ms
llama_perf_context_print: prompt eval time =    1107.83 ms /    10 tokens (  110.78 ms per token,     9.03 tokens per second)
llama_perf_context_print:        eval time =   16413.14 ms /   117 runs   (  140.28 ms per token,     7.13 tokens per second)
llama_perf_context_print:       total time =   17587.15 ms /   127 tokens

@lingster
Copy link

lingster commented Feb 9, 2025

Just to add an RPC datapoint. I was experimenting with my 2xA6000 setup with RPC to 2x3090 over a slowish 100mb network. Even thought I could load the entire model in memory, the network latency cancels out any gains. At startup it can take up to 30mins to transfer the model layers to the remote servers (3090). With a very small ctx size (1024), I get just over 1.2 tok/s :

./build/bin/llama-cli -ub 512 --split-mode row --tensor-split 17,16,8,8   --model /data/gguf/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf     --cache-type-k q4_0     --threads 16     --prio 2     --temp 0.6     --ctx-size 1024     --seed 3407     --n-gpu-layers 60     -no-cnv     --prompt "<|User|>You are an expert python developer. Write a factorial function using python.<|Assistant|>" --rpc 127.0.0.1:50052,127.0.0.1:50053,10.123.0.244:50052,10.123.0.244:50053 --verbose  --log-timestamps --log-prefix 
39.53.712.057 I llama_perf_sampler_print:    sampling time =      99.52 ms /  1024 runs   (    0.10 ms per token, 10289.91 tokens per second)
39.53.712.059 I llama_perf_context_print:        load time = 1604650.80 ms
39.53.712.060 I llama_perf_context_print: prompt eval time =    1693.55 ms /    17 tokens (   99.62 ms per token,    10.04 tokens per second)
39.53.712.062 I llama_perf_context_print:        eval time =  786774.41 ms /  1006 runs   (  782.08 ms per token,     1.28 tokens per second)
39.53.712.063 I llama_perf_context_print:       total time =  788762.14 ms /  1023 tokens

The initial loading time is also painfully long, taking up to 30mins to get the 2x3090 loaded with model data. I an investigating if it might be possible to have the remote servers load the model locally if it is availble on disk, instead of via the network.

@RodriMora
Copy link

Epyc 7402 24c/48t
512GB RAM 3200Mhz
4x3090 power limit to 250w

./build/bin/llama-cli \
                                                --model ~/models/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
                                                --cache-type-k q4_0 \
                                                --threads 24 -no-cnv --n-gpu-layers 32 --prio 2 \
                                                --temp 0.6 \
                                                --ctx-size 8192 \
                                                --seed 3407 \
                                                --prompt "<|User|>What is the capital of Italy?<|Assistant|>"
lllama_perf_sampler_print:    sampling time =       5.10 ms /    66 runs   (    0.08 ms per token, 12933.57 tokens per second)
llama_perf_context_print:        load time =   32697.96 ms
llama_perf_context_print: prompt eval time =    1078.48 ms /    10 tokens (  107.85 ms per token,     9.27 tokens per second)
llama_perf_context_print:        eval time =    9489.53 ms /    55 runs   (  172.54 ms per token,     5.80 tokens per second)
llama_perf_context_print:       total time =   10601.75 ms /    65 tokens

@bianxuxuxu
Copy link

bianxuxuxu commented Feb 12, 2025

UD-IQ1_S(1.58b) on 4xA100-SXM-40G.

one-batch:
13~15tok/s

batched-bench:
./llama-batched-bench -m /path/to/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf -c 4096 -b 2048 -ub 512 -npp 128,256 -ntg 128,256 -npl 1,2,4,8,16,32,64 -ngl 61

Image

memory use:
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloaded 61/62 layers to GPU
load_tensors: CUDA0 model buffer size = 31559.07 MiB
load_tensors: CUDA1 model buffer size = 33649.64 MiB
load_tensors: CUDA2 model buffer size = 33649.64 MiB
load_tensors: CUDA3 model buffer size = 33649.64 MiB
load_tensors: CPU_Mapped model buffer size = 1222.09 MiB

@CHN-STUDENT
Copy link

CHN-STUDENT commented Feb 13, 2025

I try to do same thing on our server , But i use model DeepSeek-R1-UD-IQ2_XXS and DeepSeek-R1-UD-IQ1_M

CPU: EYPC 9654 * 2
MEM: 32 * 24 RAM
GPU: 8 * NVIDIA L20 48G

root@NF5468:~# llama-cli \
    --model /data/models/deepseek/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ2_XXS/DeepSeek-R1-UD-IQ2_XXS.gguf \
    --cache-type-k q4_0 \
    --threads 16 \
    --n-gpu-layers 62 \
    --temp 0.6 \
    --ctx-size 8192 \
    --prio 2 \
    --seed 3407 \
    --prompt "<|User|>你好。<|Assistant|>" \
    -no-cnv

Here are DeepSeek-R1-UD-IQ2_XXS 's result
llama_perf_sampler_print: sampling time = 1.67 ms / 35 runs ( 0.05 ms per token, 20908.00 tokens per second)
llama_perf_context_print: load time = 24549.66 ms
llama_perf_context_print: prompt eval time = 186.49 ms / 5 tokens ( 37.30 ms per token, 26.81 tokens per second)
llama_perf_context_print: eval time = 2022.99 ms / 29 runs ( 69.76 ms per token, 14.34 tokens per second)
llama_perf_context_print: total time = 2227.81 ms / 34 tokens

DeepSeek-R1-UD-IQ1_M 's result

llama_perf_sampler_print: sampling time = 0.84 ms / 21 runs ( 0.04 ms per token, 25000.00 tokens per second)
llama_perf_context_print: load time = 20218.08 ms
llama_perf_context_print: prompt eval time = 180.92 ms / 5 tokens ( 36.18 ms per token, 27.64 tokens per second)
llama_perf_context_print: eval time = 1037.82 ms / 15 runs ( 69.19 ms per token, 14.45 tokens per second)
llama_perf_context_print: total time = 1234.39 ms / 20 tokens

It has 5 token/second. And i try to open server apis for UI calls. could you advice me which numbet can be use as concurrent usage settings -np, --parallel N, Yesterday I found that ollama run model is too slow. I try to use llama.cpp instead of it. I try to follow yours analysis then set parameters to avoid no-context-shift error.

@bianxuxuxu
Copy link

@CHN-STUDENT Unlike the batched-bench test cases, the llamacpp or ollama can't batching the multi-requests in real server UI. It seems that running the request sequentially no matter how many ‘--parallel' you setting. So the speed is very slow. If you want use batching inference, you can use vllm or sglang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests