[WIP] add deepseek-v3 #35926

bzantium · 2025-01-28T05:45:28Z

What does this PR do?

This PR adds the codes for the DeepSeekV3.
code relies heavily on original remote code.

resolved: #35425

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case: DeepSeek V3 Support #35425
[] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

to: @ArthurZucker

…feature/huggingface#35425

…um/transformers into feature/huggingface#35425

Rocketknight1 · 2025-01-28T13:25:18Z

Hi @bzantium, this looks great so far! We'll need added tests for the model + a green CI, and then feel free to ping me to assign a reviewer, or if you have any problems with the port.

ArthurZucker

Ultra kudos! It's super nice
Mostly missing tests, here you can use a similar approach to the gemma2 tests, which use inheritance!

src/transformers/models/deepseek_v3/modular_deepseek_v3.py

cuichenx · 2025-01-29T17:15:39Z

@bzantium Thanks for the amazing work! I was wondering if you were able to train V3 with FSDP? If so how many GPUs did you need? Thanks!

ArthurZucker · 2025-01-29T17:29:29Z

One big thing would be TP support, the base_tp_plan would probably need to be updated to make sure each mlp's gat up down have the correct order, unless the direct usage of dist remove this need

casper-hansen · 2025-01-30T09:23:41Z

This is great work and I'm looking forward to try it out. For multi-token prediction, is this planned to be implemented in this PR via the num_nextn_predict_layers attribute in the config?

bzantium · 2025-01-30T11:14:46Z

Thanks for the comments in detail; following your comments, I revised code quite a lot and fixed some mismatch between original code and this PR. I checked the outputs from both are the same. I think now I can add test codes. For TP support, I think they can be applied only for mlp layer but not for self_attn because they have functions like split on the hidden_dim. I added as following:

    base_model_tp_plan = {
        "layers.*.gate_proj": "colwise",
        "layers.*.up_proj": "colwise",
        "layers.*.down_proj": "rowwise",
    }

to: @ArthurZucker

mseeger · 2025-02-18T10:15:16Z

OK, I sent a PR against the branch with some fixes.

One reason for the remaining failures could be that the head size of K and V tensors is different than normal, because you have these additional qk_rope_head_dim entries. May mean one has to generalize some of the common tests.

This reverts commit f264f80.

bzantium · 2025-02-18T12:23:56Z

Based on the test logs, I found two reasons:

as @mseeger said, some tests are not compatible for multi latent attention (checking head_dim) so I skipped some tests (maybe need to skip more).
because of load_pre_hook, load and save tensors become different so we need to add save_hook or remove load_hook.

ArthurZucker · 2025-02-18T14:20:11Z

Ah interesting, that is indeed an issue (load / save) to be careful about.

ArthurZucker · 2025-02-18T14:20:22Z

Okay, I'll give it another shot in a bit!

mseeger · 2025-02-18T21:06:10Z

src/transformers/models/deepseek_v3/modeling_deepseek_v3.py

+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)


Replace this line with:

if self.rope_type != "yarn": # If we pass `config` to `rope_init_fn`, the dimension is set to a wrong # value (for standard multi-head attention): self._rope_kwargs = dict( base=config.rope_theta, dim=config.qk_rope_head_dim, ) if config.rope_scaling is not None: self._rope_kwargs["factor"] = config.rope_scaling["factor"] if config.max_position_embeddings is not None: self._rope_kwargs["max_position_embeddings"] = config.max_position_embeddings else: # TODO: `_compute_yarn_parameters` requires `config` and will lead # to wrong dimension in this case self._rope_kwargs = {"config": config} inv_freq, self.attention_scaling = self.rope_init_fn(device=device, **self._rope_kwargs)

I revised src/transformers/modeling_rope_utils.py script as well for rope. could you check this file?

Revise again:

if self.rope_type != "yarn": # If we pass `config` to `rope_init_fn`, the dimension is set to a wrong # value (for standard multi-head attention): self._rope_kwargs = dict( base=config.rope_theta, dim=config.qk_rope_head_dim, ) if config.rope_scaling is not None: self._rope_kwargs["factor"] = config.rope_scaling["factor"] if config.max_position_embeddings is not None: self._rope_kwargs["max_position_embeddings"] = config.max_position_embeddings else: # We can pass `config` and `dim` in this case: self._rope_kwargs = {"config": config, "dim": dim=config.qk_rope_head_dim} inv_freq, self.attention_scaling = self.rope_init_fn(device=device, **self._rope_kwargs)

mseeger · 2025-02-18T21:06:43Z

src/transformers/models/deepseek_v3/modeling_deepseek_v3.py

+        """
+        seq_len = torch.max(position_ids) + 1
+        if seq_len > self.max_seq_len_cached:  # growth
+            inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, seq_len=seq_len)


Replace this line with:

inv_freq, self.attention_scaling = self.rope_init_fn(device=device, seq_len=seq_len, **self._rope_kwargs)

same comment above.

mseeger · 2025-02-18T21:07:49Z

src/transformers/models/deepseek_v3/modeling_deepseek_v3.py

+                    causal_mask,
+                    position_ids,
+                    past_key_values,
+                    output_attentions,


Insert this line after output_attentions:

False, # output_router_logits

I removed output_router_logits at this time to firstly fix essence modeling problem. thanks for notifying me what I missed.

mseeger · 2025-02-18T21:08:22Z

src/transformers/models/deepseek_v3/modeling_deepseek_v3.py

+                    use_cache,
+                    cache_position,
+                    position_embeddings,
+                )


Insert this line:

**flash_attn_kwargs,

mseeger

Left some comments with fixes

bzantium · 2025-02-19T00:33:14Z

Left some comments with fixes

Thanks for the comments! you can give suggestions directly like below! (not just text)
click Add a suggestion button after dragging where to fix, and replace original code with your code.

Also, you better give a suggestion on modular_deepseek_v3.py because modeling_deepseek_v3.py is automatically generated using modular file.

bzantium · 2025-02-19T05:51:04Z

I found more reasons to fail:

1 failed because `AssertionError: nan not found in [0.0, 1.0] ` -> Parameter layers.2.mlp.gate.weight of model <class 'transformers.models.deepseek_v3.modeling_deepseek_v3.DeepseekV3Model'> seems not properly initialized
   1 failed because `AssertionError: False is not true ` -> model.layers.2.mlp.experts.0.gate_proj.weight in DeepseekV3ForCausalLM has no gradient!
   2 failed because `AssertionError: DeepseekV3Model: Tensor layers.2.mlp.gate.weight` -> Tensor-likes are not close!
   2 failed because `AssertionError: False is not true ` -> model.layers.2.mlp.experts.4.gate_proj.weight in DeepseekV3ForCausalLM has no gradient!

first one is because weight is initialized with torch.empty.
self.weight = nn.Parameter(torch.empty((self.n_routed_experts, config.hidden_size)))
"has no gradient" problem is maybe because of topk selection which is key of the moe. I think this is mainly because how I implement get_topk_indices for router.

mseeger · 2025-02-19T08:06:12Z

src/transformers/modeling_rope_utils.py

@@ -189,13 +189,31 @@ def _compute_yarn_parameters(
    partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
    head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
    dim = int(head_dim * partial_rotary_factor)


A problem is that dim is wrong for the DeepSeek model, it has to be config.qk_rope_head_dim. I'd suggest this (sorry, cannot comment on parts of the code not changed):

def _compute_yarn_parameters( config: PretrainedConfig, device: "torch.device", seq_len: Optional[int] = None, dim: Optional[int] = None, **rope_kwargs ) -> Tuple["torch.Tensor", float]:

Then, replace line 191 with:

if dim is None: dim = int(head_dim * partial_rotary_factor)

Now, we can call it with the correct dim.

I take care of this here!

in configuration file,
self.head_dim = qk_rope_head_dim

I would not recommend that. self.head_dim is used in other places, it should remain independent of RoPE. In fact, the correct head_dim for DeepSeek models would be qk_nope_head_dim + qk_rope_head_dim.

The stuff in modeling_rope_utils.py mostly allows to input dim, so we should just use that, I think.

mseeger · 2025-02-19T08:06:55Z

src/transformers/modeling_rope_utils.py

    factor = config.rope_scaling["factor"]
+    attention_factor = config.rope_scaling.get("attention_factor")


I don't know the intricacies of RoPE for DeepSeek, would trust you here.

mseeger · 2025-02-19T08:08:37Z

src/transformers/models/deepseek_v3/modeling_deepseek_v3.py

+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)


Revise again:

if self.rope_type != "yarn": # If we pass `config` to `rope_init_fn`, the dimension is set to a wrong # value (for standard multi-head attention): self._rope_kwargs = dict( base=config.rope_theta, dim=config.qk_rope_head_dim, ) if config.rope_scaling is not None: self._rope_kwargs["factor"] = config.rope_scaling["factor"] if config.max_position_embeddings is not None: self._rope_kwargs["max_position_embeddings"] = config.max_position_embeddings else: # We can pass `config` and `dim` in this case: self._rope_kwargs = {"config": config, "dim": dim=config.qk_rope_head_dim} inv_freq, self.attention_scaling = self.rope_init_fn(device=device, **self._rope_kwargs)

mseeger · 2025-02-19T08:13:10Z

src/transformers/models/deepseek_v3/modular_deepseek_v3.py

+    pass
+
+
+class DeepseekV3Model(LlamaModel):


OK, so IF modeling_deepseek_v3.py is indeed created from modular_deepseek_v3.py (and I am not sure about this), then this. here will not work at all, right? This would simply be the LlamaModel.

One would at least have to copy the __init__ and make sure that DeepseekV3DecoderLayer is used. But we also need to use DeepseekV3RotaryEmbedding, etc.

This is not just inheritance as python. you can check how modeling_deepseek_v3.py look like.

I am still learning. Can you point me to where modeling is created from modular? My impression is one can change them independently, and there is some automatic check for differences.

I also understand the tole of the # Copied from ... comments, one can even specify some transformation rules at the end. But this is not the case for DeepseekV3Model. @ArthurZucker ?

OK, and even if this really works, so that I get the classes called DeepseekXYZ in modeling_deepseek_v3.py automatically from modular_deepseek_v3.py, by taking the corresponding code for LlamaXYZ, and then vanilla replace "Llama" by "DeepseekV3" everywhere (and I doubt that),

we still need to copy DeepseekV3RotaryEmbedding from modeling to modular, because that code genuinely has changed.

One would at least have to copy the init and make sure that DeepseekV3DecoderLayer is used. But we also need to use DeepseekV3RotaryEmbedding, etc.

that is valid, this is what users should expect, we still need some tests to make sure that we raise an error when the layers being used are not changed as it should be just inheritance!

mseeger · 2025-02-19T08:14:21Z

src/transformers/models/deepseek_v3/modular_deepseek_v3.py

+    pass
+
+
+class DeepseekV3RotaryEmbedding(LlamaRotaryEmbedding):


Need to copy the changed code over from modeling_deepseek_v3.py, because the RoPE embeddings have changed from Llama. First, there are your changes with mscale, etc. Second, my changes to pass dim=config.qk_rope_head_dim.

mseeger · 2025-02-25T07:24:39Z

Is there something I can help with?

ArthurZucker · 2025-02-25T08:39:13Z

Sorry have been working on #36335 to give us the tools to run the model as it was very very slow just loading the full checkpoint!

casper-hansen · 2025-02-25T08:43:23Z

This is a smaller, trained model using the DeepSeek V3 architecture. In BF16, not FP8. Might be helpful :)
https://huggingface.co/moonshotai/Moonlight-16B-A3B-Instruct

bzantium added 9 commits January 28, 2025 14:42

add deepseekv3 modeling

704767e

Merge branch 'main' into feature/huggingface#35425

737ee3a

Merge branch 'main' of https://github.com/bzantium/transformers into …

fc3a4c7

…feature/huggingface#35425

remove redundant code

244e793

Merge branch 'feature/huggingface#35425' of https://github.com/bzanti…

0968df5

…um/transformers into feature/huggingface#35425

apply make style

4fb2a80

apply fix-copies

6b002e5

make format

4ec1e88

add init files

114ab84

bzantium added 7 commits January 28, 2025 23:21

rename deepseekv3 into deepseek_v3 based on its model_type

779f8d2

rename deepseekv3 into deepseek_v3 based on its model_type

22623a3

deepseek-v3 not deepseek_v3

78b19b0

set model_type as deepseek_v3

eb0e3a4

use default docs

57088cc

apply make

0ef561b

fill type and docstring

9a75a56

bzantium changed the title ~~[WIP] add deepseekv3~~ [WIP] add deepseek-v3 Jan 29, 2025

ArthurZucker mentioned this pull request Jan 29, 2025

Unknown quantization type, got fp8 #35471

Open

4 tasks

ruidazeng approved these changes Jan 29, 2025

View reviewed changes

bzantium added 2 commits January 30, 2025 00:28

add rope_config_validation

cdf83e4

use custom DeepseekV3MLP

51990b9

ruidazeng mentioned this pull request Jan 29, 2025

Does hf/transformers even support R1? huggingface/open-r1#116

Closed

ArthurZucker reviewed Jan 29, 2025

View reviewed changes

bzantium added 2 commits January 30, 2025 20:00

hold code only for checkpoints congifuration; remove redundant

f4f0ebd

revise rope yarn for DeepSeek variation

4b72b30

bzantium added 6 commits February 18, 2025 19:48

remote output_router_logits

f264f80

Revert "remote output_router_logits"

d4c6a1b

This reverts commit f264f80.

remove output_router_logits

c7c8d76

Merge branch 'main' into feature/huggingface#35425

0b5ff07

make e_score_correction_bias as buffer

ba6f7d4

skip tests not compatible

d7931b3

fungaren mentioned this pull request Feb 18, 2025

Deepseek v2 support #31976

Draft

3 tasks

make style

92bd99c

make e_score_correction_bias as buffer

7d81efe

mseeger reviewed Feb 18, 2025

View reviewed changes

bzantium added 5 commits February 19, 2025 11:30

use rope_interleave instead of load_hook

b33fdb5

skip tests not compatible with MLA

7f859f8

add doc for rope_interleave

397ecf3

fix typo

2628438

remove torch.no_grad for selecting topk

af3d328

mseeger suggested changes Feb 19, 2025

View reviewed changes

		factor = config.rope_scaling["factor"]
		attention_factor = config.rope_scaling.get("attention_factor")

[WIP] add deepseek-v3 #35926

Are you sure you want to change the base?

[WIP] add deepseek-v3 #35926

Conversation

bzantium commented Jan 28, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

Rocketknight1 commented Jan 28, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

cuichenx commented Jan 29, 2025

ArthurZucker commented Jan 29, 2025 • edited Loading

casper-hansen commented Jan 30, 2025

bzantium commented Jan 30, 2025 • edited Loading

mseeger commented Feb 18, 2025 • edited Loading

bzantium commented Feb 18, 2025 • edited Loading

ArthurZucker commented Feb 18, 2025

ArthurZucker commented Feb 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mseeger left a comment

Choose a reason for hiding this comment

bzantium commented Feb 19, 2025 • edited Loading

bzantium commented Feb 19, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mseeger commented Feb 25, 2025

ArthurZucker commented Feb 25, 2025

casper-hansen commented Feb 25, 2025

bzantium commented Jan 28, 2025 •

edited

Loading

ArthurZucker commented Jan 29, 2025 •

edited

Loading

bzantium commented Jan 30, 2025 •

edited

Loading

mseeger commented Feb 18, 2025 •

edited

Loading

bzantium commented Feb 18, 2025 •

edited

Loading

bzantium commented Feb 19, 2025 •

edited

Loading