llama : add DeepSeek-v2-Chat support #7118

DirtyKnightForVi · 2024-05-07T06:22:43Z

please support deepseek-ai/DeepSeek-V2-Chat

https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat

SinanAkkoyun · 2024-05-07T15:49:46Z

That would be awesome.

jeff31415 · 2024-05-07T18:29:13Z

Impressive model, and potentially a CPU friendly model(if you have >96GB memory)

SinanAkkoyun · 2024-05-08T04:53:08Z

@ggerganov I'd be very interested in helping, I want to get into porting models to inference engines

Would you be so kind to provide a rough outline of what needs to be done here? I'd then submit a draft PR and ask for little details that don't work

ggerganov · 2024-05-08T06:17:43Z

Interesting - can we get a rundown of the multi-head latent KV cache technique:

@SinanAkkoyun Look at PRs that have already been merged and add support for new model arches

DirtyKnightForVi · 2024-05-08T06:26:29Z

Sure thing. Here's their tech report: https://github.com/deepseek-ai/DeepSeek-V2/blob/main/deepseek-v2-tech-report.pdf

ggerganov · 2024-05-09T06:27:47Z

Thanks, very cool work! Adding this to the roadmap to give it more visibility

taozhiyuai · 2024-05-12T12:25:01Z

+1

fairydreaming · 2024-05-14T18:35:00Z

I'm working on it right now: https://youtu.be/1AG-GUtDvaw
The code needs some cleanup, so it's not published yet.

SinanAkkoyun · 2024-05-14T18:55:56Z

@fairydreaming Oh wow how awesome!! How does the ppl look?

fairydreaming · 2024-05-15T06:46:02Z

@fairydreaming Oh wow how awesome!! How does the ppl look?

@SinanAkkoyun At this moment it's somewhat high (Q8_0):

perplexity: tokenizing the input ..
perplexity: tokenization took 1107.87 ms
perplexity: calculating perplexity over 596 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 91.94 seconds per pass - ETA 3 hours 48.32 minutes
[1]6.4552,[2]7.7478,[3]6.8637,[4]7.1755,[5]7.5298,[6]8.4102,[7]8.7088,[8]9.0019,[9]9.5003,[10]9.8350,[11]9.9215,[12]10.1602,[13]10.2808,[14]10.3361,[15]10.2942,[16]10.4948,[17]9.7985,[18]9.8037,[19]9.8295,[20]9.6260

ggerganov · 2024-05-15T06:50:42Z

At this moment it's somewhat high (Q8_0)

This is normal for non-base models

CyberTimon · 2024-05-17T15:49:32Z

Would love to see support for the smaller MoE models. They seem to be good and only use 2.5b active parameters for token generation.

fairydreaming · 2024-05-17T15:51:45Z

You can try my branch if you want: https://github.com/fairydreaming/llama.cpp/tree/deepseek-v2
The model works but there are several issues:

The implementation is suboptimal, since it permutes K and Q tensors during inference. I will try to avoid this by permuting model tensors during conversion instead.
I see some differences in YaRN implementation between DeepSeek-V2 and llama.cpp (calculation of mscale). Is there any YaRN expert on board?
The implementation still caches whole K and V tensors instead of the parts marked on the model diagram above (I don't think I'm going to change this, even the tensorflow implementation does the same).
Some model-specific parameters are hardcoded in the code. I'm not sure what to do with them, I don't think we want to add every little parameter from the myriad of model architectures to gguf model files.

ggerganov · 2024-05-17T16:23:23Z

I see some differences in YaRN implementation between DeepSeek-V2 and llama.cpp (calculation of mscale). Is there any YaRN expert on board?

There is this PR from a while ago: #4093

Though DS2 seems to not use the "GPT-NeoX RoPE" as we call it, so probably not relevant

Some model-specific parameters are hardcoded in the code. I'm not sure what to do with them, I don't think we want to add every little parameter from the myriad of model architectures to gguf model files.

How many are the parameters? I don't think we have a better solution than adding them to the GGUF header

fairydreaming · 2024-05-17T16:36:07Z

How many are the parameters? I don't think we have a better solution than adding them to the GGUF header

@ggerganov here they are:

                    // TODO maybe move some of these to hparams
                    const uint32_t n_shared_experts = 2;
                    const uint32_t moe_intermediate_size = 1536;
                    const uint32_t q_lora_rank = 1536;
                    const uint32_t kv_lora_rank = 512;
                    const uint32_t first_k_dense_replace = 1;

moe_intermediate_size is needed because intermediate_size is used for dense FFN intermediate size,
q_lora_rank and kv_lora_rank are the latent compressed Q and KV dimensions (consult the image above),
first_k_dense_replace says from which layer use MoE instead of dense FFN (so layer 0 has no MoE, but a dense FFN instead).

What do you think?

ggerganov · 2024-05-17T17:27:02Z

I think it's fine to add those parameters

fairydreaming · 2024-05-17T18:09:35Z

I see some differences in YaRN implementation between DeepSeek-V2 and llama.cpp (calculation of mscale). Is there any YaRN expert on board?

There is this PR from a while ago: #4093

Though DS2 seems to not use the "GPT-NeoX RoPE" as we call it, so probably not relevant

The difference in YaRN RoPE that I noticed is that llama.cpp scales sin and cos values with mscale calculated like this:

mscale *= 1.0f + 0.1f * logf(1.0f / freq_scale);

while DeepSeek-V2 tensorflow implementation uses the following code:

        _mscale = float(
            yarn_get_mscale(self.scaling_factor, self.mscale)
            / yarn_get_mscale(self.scaling_factor, self.mscale_all_dim)
        )

where yarn_get_mscale is:

def yarn_get_mscale(scale=1, mscale=1):
    if scale <= 1:
        return 1.0
    return 0.1 * mscale * math.log(scale) + 1.0

It uses the same calculation like llama.cpp, but twice - first for self.mscale (which is 0.707 in the config.json), then for self.mscale_all_dim (which is also 0.707 in the config.json) and then divides the first calculated value by the second. However, this will be 1.0 since both mscales are the same. In DeepSeek-V2 vLLM implementation they also do this. There's even a comment:

# Get n-d magnitude scaling corrected for interpolation.

In the DeepSeek-V2 paper there is: "Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy", but I'm not sure if they are talking about the difference I noticed.

ggerganov · 2024-05-18T07:38:45Z

Hm, that's strange - what's the point of multiplying by 1.0. Not sure if we should modify our implementation - probably we just need to disable YARN for DS2 since it's basically a noop based on the python implementations

fairydreaming · 2024-05-19T10:27:21Z

Would love to see support for the smaller MoE models. They seem to be good and only use 2.5b active parameters for token generation.

@CyberTimon I added support for the lite model in my branch, you can try it out now if you want: https://github.com/fairydreaming/llama.cpp/tree/deepseek-v2

fairydreaming · 2024-05-19T12:44:46Z

Hm, that's strange - what's the point of multiplying by 1.0. Not sure if we should modify our implementation - probably we just need to disable YARN for DS2 since it's basically a noop based on the python implementations

@ggerganov I think YaRN also affects calculation of sin/cos frequencies (theta variable), so we can't simply disable it. Anyway, I found another quirk of the DeepSeek-V2 - it uses a scalar value to scale the expert weights instead of normalizing them. After taking it into account perplexity looks much better in the chat model (Q8_0):

perplexity: calculating perplexity over 596 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 63.76 seconds per pass - ETA 2 hours 38.33 minutes
[1]2.7414,[2]3.6534,[3]3.1132,[4]3.3036,[5]3.5037,[6]3.9715,[7]4.1896,[8]4.2031,[9]4.4069,[10]4.5289,[11]4.6015,[12]4.7431,[13]4.8987,[14]4.7905,[15]4.6756,[16]4.6905,[17]4.5251,[18]4.6219,[19]4.6456,[20]4.4898,[21]4.5219,[22]4.5331,[23]4.4675,[24]4.3658,[25]4.2529,[26]4.1937,[27]4.0689,[28]3.9773,[29]3.9261

Of course it will require another parameter to be added to the model headers.

SinanAkkoyun · 2024-05-22T12:51:59Z

https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite

:P A model for everyone to test

YavorGIvanov · 2024-05-22T13:26:52Z

The MLA approach can probably be combined with the Pyramid KV cache - https://arxiv.org/abs/2405.12532

DirtyKnightForVi · 2024-05-23T14:37:35Z

Is the main branch code now able to support DeepseekV2 inference?

fairydreaming · 2024-05-23T14:39:57Z

Is the main branch code now able to support DeepseekV2 inference?

No, not yet

foldl · 2024-05-24T11:10:28Z

For those who want to have a test on DeepSeek-V2-Chat Light: chatllm.cpp now supports it (with conditions).

Comparing to @fairydreaming 's code, this one tries to follow the paper, but not modeling_deepseek.py.

DirtyKnightForVi · 2024-06-05T16:04:35Z

Deepseek still running

fairydreaming · 2024-06-05T16:12:37Z

@DirtyKnightForVi I have limited knowledge of Windows, but I guess there is some disk swap mechanism in use.

DirtyKnightForVi · 2024-06-05T16:16:37Z

@fairydreaming I'am running it on Ubuntu. And CPU offload maybe the reason why it works well.

@ggerganov This might be a default setting. But are there other configurations that can fully load my CPU or GPU? I’m quite curious about the origin of this setting.

fairydreaming · 2024-06-05T16:32:16Z

@DirtyKnightForVi Did you try some other model to see if your environment works correctly?

Running other models poses no issue. However, I'm curious as to why you encountered an OOM error, while I was able to smoothly infer a 200B large model with minimal resource consumption?

@DirtyKnightForVi It's because you run it with context size (n_ctx) set to 512, while on my machine it was set to default training context size value of 163840.

KylixC · 2024-06-27T08:32:03Z

I got an error like this:

E:/WorkingArea/llama_cpp/llama.cpp $ main --override-kv deepseek2.attention.q_lora_rank=int:1536 --override-kv deepseek2.attention.kv_lora_rank=int:512 --override-kv deepseek2.expert_shared_count=int:2 --override-kv deepseek2.expert_weights_scale=float:16 --override-kv deepseek2.expert_feed_forward_length=int:1536 --override-kv deepseek2.leading_dense_block_count=int:1 --override-kv deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707 -m E:/model_tmps/DeepSeek-V2-Chat.Q8_0.gguf -c 128 --color -i
Log start
main: build = 3083 (adc9ff38)
main: built with cc (GCC) 14.1.0 for x86_64-w64-mingw32
main: seed  = 1717592663
llama_model_loader: loaded meta data with 46 key-value pairs and 959 tensors from E:/model_tmps/DeepSeek-V2-Chat.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.name str              = Deepseek-V2-Chat
llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 60
llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 5120
llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 12288
llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  13:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  14:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  15:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  16: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  17:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  18:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  19:                     deepseek2.expert_count u32              = 160
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  29:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:            deepseek2.attention.q_lora_rank i32              = 1536
llama_model_loader: - kv  33:           deepseek2.attention.kv_lora_rank i32              = 512
llama_model_loader: - kv  34:              deepseek2.expert_shared_count i32              = 2
llama_model_loader: - kv  35:             deepseek2.expert_weights_scale f32              = 16.000000
llama_model_loader: - kv  36:       deepseek2.expert_feed_forward_length i32              = 1536
llama_model_loader: - kv  37:        deepseek2.leading_dense_block_count i32              = 1
llama_model_loader: - kv  38: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  39:                      quantize.imatrix.file str              = imatrix.dat
llama_model_loader: - kv  40:                   quantize.imatrix.dataset str              = groups_merged.txt
llama_model_loader: - kv  41:             quantize.imatrix.entries_count i32              = 716
llama_model_loader: - kv  42:              quantize.imatrix.chunks_count i32              = 62
llama_model_loader: - kv  43:                                   split.no u16              = 0
llama_model_loader: - kv  44:                                split.count u16              = 0
llama_model_loader: - kv  45:                        split.tensors.count i32              = 959
llama_model_loader: - type  f32:  300 tensors
llama_model_loader: - type q8_0:  659 tensors
validate_override: Using metadata override (  int) 'deepseek2.leading_dense_block_count' = 1
validate_override: Using metadata override (  int) 'deepseek2.attention.q_lora_rank' = 1536
validate_override: Using metadata override (  int) 'deepseek2.attention.kv_lora_rank' = 512
validate_override: Using metadata override (  int) 'deepseek2.expert_feed_forward_length' = 1536
validate_override: Using metadata override (  int) 'deepseek2.expert_shared_count' = 2
validate_override: Using metadata override (float) 'deepseek2.expert_weights_scale' = 16.000000
validate_override: Using metadata override (float) 'deepseek2.rope.scaling.yarn_log_multiplier' = 0.070700
llama_model_load: error loading model: error loading model vocabulary: wstring_convert::from_bytes
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'E:/model_tmps/DeepSeek-V2-Chat.Q8_0.gguf'
main: error: unable to load model

the latest version about llama.cpp model from : leafspark/DeepSeek-V2-Chat-GGUF, and merged by gguf-split

Linux is ok.
but win10 not work .

fairydreaming · 2024-06-27T10:27:20Z

I got an error like this:
...
the latest version about llama.cpp model from : leafspark/DeepSeek-V2-Chat-GGUF, and merged by gguf-split

Linux is ok. but win10 not work .

On Linux on my workstation it crashes too, seems to be related to changes in chat template support in #8068.

llmlover · 2024-07-01T00:35:01Z

Can someone please explain why this implementation runs significantly slower compared to a dense model with same active parameter count?

foldl · 2024-07-01T03:52:49Z

@llmlover Could you provide some data? I don't think there are significant differences.

Here is my result (I am using chatllm.cpp. Performance on CPU should be similar to llama.cpp) using the classic prompt "write a quick sort function in python".

CodeGemma 2.5B (Q8_0):

timings: prompt eval time =       352.95 ms /    16 tokens (    22.06 ms per token,    45.33 tokens per second)
timings:        eval time =      8555.32 ms /   139 tokens (    61.55 ms per token,    16.25 tokens per second)
timings:       total time =      8908.27 ms /   155 tokens

DeepSeek-v2-Chat, 2.7B active (Q8_0):

timings: prompt eval time =      1182.08 ms /    15 tokens (    78.81 ms per token,    12.69 tokens per second)
timings:        eval time =     16229.16 ms /   224 tokens (    72.45 ms per token,    13.80 tokens per second)
timings:       total time =     17411.25 ms /   239 tokens

Yes, it is slower, but not that significant.

llmlover · 2024-07-01T14:12:09Z

@foldl Thank you for running the tests
The results show an initial slowdown of almost 4x! Given that only 2.7B params are active, I am wondering why that's the case, what takes so much more compute if the routing MLP is not that deep?

foldl · 2024-07-02T09:54:26Z

@llmlover

You are talking about "prompt eval time"? It's slower in DeepSeekCoder because the model file is significant larger than CodeGemma 2.5B. The reported time is affected by model loading. If you measure a new round, "prompt eval time" becomes much shorter.

All in all I don't think it is significantly slower than a same-sized dense model.

LostRuins · 2024-07-04T16:12:59Z

Anyone figured out the llama_model_load: error loading model: error loading model vocabulary: wstring_convert::from_bytes yet?

fairydreaming · 2024-07-04T16:19:24Z

Anyone figured out the llama_model_load: error loading model: error loading model vocabulary: wstring_convert::from_bytes yet?

@LostRuins How did you get that error (in details if possible)?

LostRuins · 2024-07-05T07:31:47Z

@fairydreaming

Model used is https://huggingface.co/mradermacher/DeepSeek-Coder-V2-Lite-Instruct-GGUF/blob/main/DeepSeek-Coder-V2-Lite-Instruct.Q3_K_S.gguf

I just cloned the repo, used w64devkit to make and ran with llama-cli. Here are my full logs of these 3 steps:

err_logs.txt

LostRuins · 2024-07-05T07:33:43Z

Worth noting that the CI builds which were made with MSVC do not seem to have this issue.

Also this is sort of off-topic, but looking at my logs again there seems to be... a type IQ4_NL used inside a Q3_K_S? Is that intentional or a bug?

@ikawrakow (apologies if i missed something, but it just stood out as weird, it does break it for me separately since some backends like vulkan don't support IQ quants, and the larger K quants work fine.)

fairydreaming · 2024-07-05T11:07:29Z

@fairydreaming

Model used is https://huggingface.co/mradermacher/DeepSeek-Coder-V2-Lite-Instruct-GGUF/blob/main/DeepSeek-Coder-V2-Lite-Instruct.Q3_K_S.gguf

I just cloned the repo, used w64devkit to make and ran with llama-cli. Here are my full logs of these 3 steps:

err_logs.txt

I confirm the problem, but it's not llama.cpp's fault. For some reason C++ standard library used in mingw (it's a part of w64devkit) is unable to convert certain unicode characters. For example this doesn't work:

#include <string>
#include <locale>
#include <codecvt>

int main()
{
	std::wstring_convert<std::codecvt_utf8<wchar_t> > conv;
	std::string s("𐐀-𐑏");
	conv.from_bytes(s);
}

So you have to report this bug in mingw, not in llama.cpp, maybe they will have some idea about how to fix it.

fairydreaming · 2024-07-05T13:54:26Z

@fairydreaming
Model used is https://huggingface.co/mradermacher/DeepSeek-Coder-V2-Lite-Instruct-GGUF/blob/main/DeepSeek-Coder-V2-Lite-Instruct.Q3_K_S.gguf
I just cloned the repo, used w64devkit to make and ran with llama-cli. Here are my full logs of these 3 steps:
err_logs.txt

I confirm the problem, but it's not llama.cpp's fault. For some reason C++ standard library used in mingw (it's a part of w64devkit) is unable to convert certain unicode characters. For example this doesn't work:
#include <string>
#include <locale>
#include <codecvt>

int main()
{
	std::wstring_convert<std::codecvt_utf8<wchar_t> > conv;
	std::string s("𐐀-𐑏");
	conv.from_bytes(s);
}
So you have to report this bug in mingw, not in llama.cpp, maybe they will have some idea about how to fix it.

@LostRuins I tried replacing std::wstring_convert with a custom function to avoid this problem, but another problem appeared later, this time with std::wregex.

LostRuins · 2024-07-05T14:30:58Z

That's unfortunate. I trying googling but all I could find were some mentions of setting std::locale::global which I'm assuming won't help at all.

SinanAkkoyun · 2024-07-06T22:38:24Z

@foldl No, they are most likely referring to eval rate:

deepseek coder v2:
total duration: 10.863677554s
load duration: 2.141128ms
prompt eval count: 10 token(s)
prompt eval duration: 56.846ms
prompt eval rate: 175.91 tokens/s
eval count: 905 token(s)
eval duration: 10.667807s
eval rate: 84.83 tokens/s

gemma:2b:
total duration: 2.280008205s
load duration: 1.951113ms
prompt eval count: 40 token(s)
prompt eval duration: 12.748ms
prompt eval rate: 3137.75 tokens/s
eval count: 432 token(s)
eval duration: 2.119003s
eval rate: 203.87 tokens/s

Immense performance difference. Why is that the case, I am curious as well?

SinanAkkoyun · 2024-07-06T22:40:44Z

(ignore eval count tokens, it makes no difference at that scale, I tested it)

fairydreaming · 2024-07-07T06:04:06Z

@foldl @SinanAkkoyun The performance difference is most likey caused by the fact that there are much more operations in attention implementation of DeepSeek-V2 compared to Gemma. When Gemma calculates query, key and value vectors it does the following tensor operations (there are 35 tokens in the example):

ggml_debug:                   Qcur-0 = (f32)    MUL_MAT(blk.0.attn_q.weight{2048, 2048, 1, 1}, attn_norm-0{2048, 35, 1, 1}}) = {2048, 35, 1, 1}
ggml_debug:        Qcur-0 (reshaped) = (f32)    RESHAPE(Qcur-0{2048, 35, 1, 1}, }) = {256, 8, 35, 1}
ggml_debug:                   Qcur-0 = (f32)       ROPE(Qcur-0 (reshaped){256, 8, 35, 1}, inp_pos{35, 1, 1, 1}}) = {256, 8, 35, 1}
ggml_debug:            Qcur_scaled-0 = (f32)      SCALE(Qcur-0{256, 8, 35, 1}, }) = {256, 8, 35, 1}
ggml_debug:                   Kcur-0 = (f32)    MUL_MAT(blk.0.attn_k.weight{2048, 256, 1, 1}, attn_norm-0{2048, 35, 1, 1}}) = {256, 35, 1, 1}
ggml_debug:        Kcur-0 (reshaped) = (f32)    RESHAPE(Kcur-0{256, 35, 1, 1}, }) = {256, 1, 35, 1}
ggml_debug:                   Kcur-0 = (f32)       ROPE(Kcur-0 (reshaped){256, 1, 35, 1}, inp_pos{35, 1, 1, 1}}) = {256, 1, 35, 1}
ggml_debug:                   Vcur-0 = (f32)    MUL_MAT(blk.0.attn_v.weight{2048, 256, 1, 1}, attn_norm-0{2048, 35, 1, 1}}) = {256, 35, 1, 1}

while DeepSeek-V2 does the following in the same part of the model layer (there are 17 tokens in the example):

ggml_debug:                      q-0 = (f32)    MUL_MAT(blk.0.attn_q.weight{2048, 3072, 1, 1}, attn_norm-0{2048, 17, 1, 1}}) = {3072, 17, 1, 1}
ggml_debug:                 q_nope-0 = (f32)       VIEW(q-0{3072, 17, 1, 1}, }) = {128, 16, 17, 1}
ggml_debug:                   q_pe-0 = (f32)       VIEW(q-0{3072, 17, 1, 1}, }) = {64, 16, 17, 1}
ggml_debug:            q_pe-0 (cont) = (f32)       CONT(q_pe-0{64, 16, 17, 1}, }) = {64, 16, 17, 1}
ggml_debug:                   q_pe-0 = (f32)       ROPE(q_pe-0 (cont){64, 16, 17, 1}, inp_pos{17, 1, 1, 1}}) = {64, 16, 17, 1}
ggml_debug:               q_states-0 = (f32)     CONCAT(q_nope-0{128, 16, 17, 1}, q_pe-0{64, 16, 17, 1}}) = {192, 16, 17, 1} 
ggml_debug:      kv_pe_compresseed-0 = (f32)    MUL_MAT(blk.0.attn_kv_a_mqa.weight{2048, 576, 1, 1}, attn_norm-0{2048, 17, 1, 1}}) = {576, 17, 1, 1}
ggml_debug:          kv_compressed-0 = (f32)       VIEW(kv_pe_compresseed-0{576, 17, 1, 1}, }) = {512, 17, 1, 1}
ggml_debug:   kv_compressed-0 (cont) = (f32)       CONT(kv_compressed-0{512, 17, 1, 1}, }) = {512, 17, 1, 1} 
ggml_debug:                   norm-0 = (f32)   RMS_NORM(kv_compressed-0 (cont){512, 17, 1, 1}, }) = {512, 17, 1, 1}
ggml_debug:          kv_compressed-0 = (f32)        MUL(norm-0{512, 17, 1, 1}, blk.0.attn_kv_a_norm.weight{512, 1, 1, 1}}) = {512, 17, 1, 1}
ggml_debug:                     kv-0 = (f32)    MUL_MAT(blk.0.attn_kv_b.weight{512, 4096, 1, 1}, kv_compressed-0{512, 17, 1, 1}}) = {4096, 17, 1, 1}
ggml_debug:                 k_nope-0 = (f32)       VIEW(kv-0{4096, 17, 1, 1}, }) = {128, 16, 17, 1}
ggml_debug:                   k_pe-0 = (f32)       VIEW(kv_pe_compresseed-0{576, 17, 1, 1}, }) = {64, 1, 17, 1}
ggml_debug:            k_pe-0 (cont) = (f32)       CONT(k_pe-0{64, 1, 17, 1}, }) = {64, 1, 17, 1}
ggml_debug:                   k_pe-0 = (f32)       ROPE(k_pe-0 (cont){64, 1, 17, 1}, inp_pos{17, 1, 1, 1}}) = {64, 1, 17, 1}
ggml_debug:                  node_19 = (f32)     REPEAT(k_pe-0{64, 1, 17, 1}, }) = {64, 16, 17, 1}
ggml_debug:               k_states-0 = (f32)     CONCAT(k_nope-0{128, 16, 17, 1}, node_19{64, 16, 17, 1}}) = {192, 16, 17, 1}
ggml_debug:               v_states-0 = (f32)       VIEW(kv-0{4096, 17, 1, 1}, }) = {128, 16, 17, 1}
ggml_debug:               v_states-0 = (f32)       CONT(v_states-0{128, 16, 17, 1}, }) = {128, 16, 17, 1}
ggml_debug:               v_states-0 = (f32)       VIEW(v_states-0{128, 16, 17, 1}, }) = {2048, 17, 1, 1}

DeepSeek-V2 uses MLA (Multi-head Latent Attention) which means that are some additional tensor operations in this part of the model. Other than that, there are also CONT operations making tensors contiguous in memory (as implementations of most CUDA tensor operations require that input tensors are contiguous in memory), these operations also have a performance penalty.

SinanAkkoyun · 2024-07-07T07:27:42Z

@fairydreaming Thank you a lot for your clear explanation!

I was also wondering how OAI manages (despite specular decoding) to make GPT4 run so fast wirh presumably 200B active parameters.

Also, is it possible to optimize DeepSeekCoderV2 even further or are the 80 tps the practical limit of this architecture today?

fairydreaming · 2024-07-07T08:55:25Z

@fairydreaming Thank you a lot for your clear explanation!

Also, is it possible to optimize DeepSeekCoderV2 even further or are the 80 tps the practical limit of this architecture today?

Yes, this is not an optimal implementation, it simply required the least amount of work. It definitely can be optimized further.

SinanAkkoyun · 2024-07-07T11:14:49Z

@fairydreaming Thank you for the insight. Based on your intuition, what could the performance gain look like in TPS and how much work would it require? I might know some people who would be interested in taking this on if the potential improvement justifies the effort

fairydreaming · 2024-07-07T12:43:24Z

@SinanAkkoyun I think you should profile the CUDA implementation first to identify likely bottlenecks. A wise man once said: premature optimization is the root of all evil. My intuition mumbles something about a few percent of improvement, but I wouldn't rely on it.

SinanAkkoyun · 2024-07-07T12:53:59Z

@fairydreaming Thank you very very much, I appreciate your comments!

SinanAkkoyun · 2024-07-08T16:53:07Z

@fairydreaming vLLM just dropped DS Coder V2 support
I tested Lite in FP16 and it was way faster than ollama:
vLLM FP16 110.87 TPS
ollama Q4 87.64 TPS

Could it be that some bigger gains could still be made for Llama.cpp? I am not knowledgeable enough to assess the implementation but perhaps something went unnoticed, I can not imagine that vLLMs paged attention makes much difference for 40 generated tokens

fairydreaming · 2024-07-08T17:58:58Z

@fairydreaming vLLM just dropped DS Coder V2 support I tested Lite in FP16 and it was way faster than ollama: vLLM FP16 110.87 TPS ollama Q4 87.64 TPS

Could it be that some bigger gains could still be made for Llama.cpp? I am not knowledgeable enough to assess the implementation but perhaps something went unnoticed, I can not imagine that vLLMs paged attention makes much difference for 40 generated tokens

@SinanAkkoyun I don't know, maybe.

fairydreaming · 2024-07-20T18:28:06Z

I just ran my farel-bench benchmark on updated DeepSeek-V2 and the score is amazing! It's better than any other open-weights model! This is also confirmed by the ZebraLogic benchmark. So if you still use the older model I think it's wise to update.

By the way I got almost the same scores by running the Q8_0 quant locally (score of 87.78) in llama.cpp and by using the openrouter API (score of 87.56) so implementation of this model seems to be in a good shape in llama.cpp.

Ishankhan21 · 2024-07-22T18:49:26Z

Still getting below error , with llama.cpp on Apple M2 Mac 16 GB RAM, Trying to run the lowest quantised model around 6 DB ,

fairydreaming · 2024-07-22T19:25:23Z

@Ishankhan21 you are getting out of memory errors because you run the model without setting context size (which results in a default value of 163840 being used), try adding for example -c 4096 to your llama.cpp command-line options

ggerganov mentioned this issue May 8, 2024

tests : add test-tokenizer-0.sh #7036

Merged

ggerganov changed the title ~~Please Support DeepSeek-v2-Chat~~ llama : add DeepSeek-v2-Chat support May 9, 2024

ggerganov added good first issue Good for newcomers model Model specific labels May 9, 2024

SinanAkkoyun mentioned this issue May 22, 2024

[New Model]: DeepSeek VL vllm-project/vllm#4982

Open

llama : add DeepSeek-v2-Chat support #7118

llama : add DeepSeek-v2-Chat support #7118

Comments

DirtyKnightForVi commented May 7, 2024

SinanAkkoyun commented May 7, 2024

jeff31415 commented May 7, 2024

SinanAkkoyun commented May 8, 2024

ggerganov commented May 8, 2024

DirtyKnightForVi commented May 8, 2024

ggerganov commented May 9, 2024

taozhiyuai commented May 12, 2024

fairydreaming commented May 14, 2024

SinanAkkoyun commented May 14, 2024

fairydreaming commented May 15, 2024 • edited Loading

ggerganov commented May 15, 2024

CyberTimon commented May 17, 2024

fairydreaming commented May 17, 2024 • edited Loading

ggerganov commented May 17, 2024

fairydreaming commented May 17, 2024

ggerganov commented May 17, 2024

fairydreaming commented May 17, 2024 • edited Loading

ggerganov commented May 18, 2024

fairydreaming commented May 19, 2024

fairydreaming commented May 19, 2024

SinanAkkoyun commented May 22, 2024

YavorGIvanov commented May 22, 2024 • edited Loading

DirtyKnightForVi commented May 23, 2024

fairydreaming commented May 23, 2024

foldl commented May 24, 2024 • edited Loading

DirtyKnightForVi commented Jun 5, 2024 • edited Loading

fairydreaming commented Jun 5, 2024

DirtyKnightForVi commented Jun 5, 2024 • edited Loading

fairydreaming commented Jun 5, 2024

KylixC commented Jun 27, 2024

fairydreaming commented Jun 27, 2024

llmlover commented Jul 1, 2024

foldl commented Jul 1, 2024

llmlover commented Jul 1, 2024

foldl commented Jul 2, 2024

LostRuins commented Jul 4, 2024

fairydreaming commented Jul 4, 2024

LostRuins commented Jul 5, 2024

LostRuins commented Jul 5, 2024 • edited Loading

fairydreaming commented Jul 5, 2024

fairydreaming commented Jul 5, 2024

LostRuins commented Jul 5, 2024

SinanAkkoyun commented Jul 6, 2024

SinanAkkoyun commented Jul 6, 2024

fairydreaming commented Jul 7, 2024 • edited Loading

SinanAkkoyun commented Jul 7, 2024

fairydreaming commented Jul 7, 2024

SinanAkkoyun commented Jul 7, 2024

fairydreaming commented Jul 7, 2024

SinanAkkoyun commented Jul 7, 2024

SinanAkkoyun commented Jul 8, 2024 • edited Loading

fairydreaming commented Jul 8, 2024

fairydreaming commented Jul 20, 2024

Ishankhan21 commented Jul 22, 2024

fairydreaming commented Jul 22, 2024

fairydreaming commented May 15, 2024 •

edited

Loading

fairydreaming commented May 17, 2024 •

edited

Loading

fairydreaming commented May 17, 2024 •

edited

Loading

YavorGIvanov commented May 22, 2024 •

edited

Loading

foldl commented May 24, 2024 •

edited

Loading

DirtyKnightForVi commented Jun 5, 2024 •

edited

Loading

DirtyKnightForVi commented Jun 5, 2024 •

edited

Loading

LostRuins commented Jul 5, 2024 •

edited

Loading

fairydreaming commented Jul 7, 2024 •

edited

Loading

SinanAkkoyun commented Jul 8, 2024 •

edited

Loading