-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add qwen2moe #6074
Add qwen2moe #6074
Conversation
ggml-backend.c
Outdated
@@ -1003,13 +1004,14 @@ static bool ggml_is_view_op(enum ggml_op op) { | |||
#endif | |||
|
|||
#ifndef GGML_SCHED_MAX_SPLIT_INPUTS | |||
#define GGML_SCHED_MAX_SPLIT_INPUTS 16 | |||
#define GGML_SCHED_MAX_SPLIT_INPUTS 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the reason to increase this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for the graph hash size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just increase and test step-by-step from 16->32->64->128->256, and 256 works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the problem that you are trying to solve with this change? This seems like a workaround for a different issue. What would be the correct fix? This constant will be removed in the short term in favor of dynamically allocating the array of splits, so whatever issue this is trying to workaround, we need to deal with that directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is when calling ggml_gallocr_reserve_n in ggml_backend_sched_reserve, the hash value conflicts.
ggml.h
Outdated
@@ -227,7 +227,7 @@ | |||
#define GGML_MAX_DIMS 4 | |||
#define GGML_MAX_PARAMS 2048 | |||
#define GGML_MAX_CONTEXTS 64 | |||
#define GGML_MAX_SRC 10 | |||
#define GGML_MAX_SRC 62 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, we are not well prepared for 60-expert MoE models
This change will more than double the size of ggml_tensor
and I still don't know how to support many source buffers in Metal:
What are our options here? cc @slaren
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of each expert is very small. The ffn_moe_intermediate size is divided by 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could store all the experts in the same tensor, I think it would be easy to adapt the implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we merge this PR first and then try to resolve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we need to merge this right now. As it is, this will only work with the CPU backend, the sort operators in the GPU backends also require the number of experts to be a power of two, so assuming that it is a large model, this implementation may not be very useful. When we have a model to test with, we can improve the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember qwen2 has 1.8BX16. that make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaren do you have some implementations to deal with this extra large number experts issue?
@ggerganov @slaren is there any update? |
@simonJJJ @ggerganov |
This PR looks awesome, can't wait for it to get merged into the main branch. Could you quickly take a look at the failed test case and patch it up? |
We have to resolve the issues outlined earlier in the discussion before we can merge this. Sorry for the delay, but without a proper CUDA and Metal support it does not make sense to merge. Now that there is a model available, it will be easier to complete |
I just tested the "Add qwen2moe #6074" PR...on METAL (M3) CONVERT I try to convert the HF model : I got this error :
To make it work, I have added : USAGE I then had an other issue, number of experts not correct (0), so I modified/added properties in the config.json of the model.
Now I got this ERROR
So far, I don't know yet how to fix this 😞 I hope to help others with this comment |
I hope below update could be merged to
I have added this model to ChatLLM.cpp. It works for CPU. Maybe add a warning when |
It can't. Instead, we are moving all the experts to the same tensor. The work is being done in #6387. |
@ggerganov just updated! hope to merge asap. |
llama.cpp
Outdated
ggml_tensor * cur_gate = ggml_mul_mat_id(ctx0, model.layers[il].ffn_gate_exps, selected_experts, i, cur); | ||
cb(cur_gate, "ffn_moe_gate", il); | ||
|
||
cur_gate = ggml_silu(ctx0, cur_gate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is now build_moe_ffn
helper function for mixtral, grok, and dbrx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the arch is not the same. qwen2moe has shared experts in each block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The moe ffn still looks the same to me, and could be implemented using build_moe_ffn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another diff is the "weights" is not normalized in qwen2moe.
llama.cpp
Outdated
// sigmoid | ||
ggml_tensor * logits_shared_exp = ggml_silu(ctx0, gate_shared_exp); | ||
cb(logits_shared_exp, "ffn_moe_logits_shared_exp", il); | ||
|
||
ggml_tensor * probs_shared_exp = ggml_div(ctx0, logits_shared_exp, gate_shared_exp); | ||
cb(probs_shared_exp, "ffn_moe_probs_shared_exp", il); | ||
|
||
ggml_tensor * ffn_shared_exp = llm_build_ffn(ctx0, cur, | ||
model.layers[il].ffn_up_shared_exp, NULL, | ||
model.layers[il].ffn_gate_shared_exp, NULL, | ||
model.layers[il].ffn_down_shared_exp, NULL, | ||
NULL, | ||
LLM_FFN_SILU, LLM_FFN_PAR, cb, il); | ||
cb(ffn_shared_exp, "ffn_moe_shared_exp", il); | ||
|
||
ggml_tensor * ffn_shared_exp_out = ggml_mul(ctx0, ffn_shared_exp, probs_shared_exp); | ||
cb(ffn_shared_exp_out, "ffn_moe_shared_exp_out", il); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be equivalent to this?
https://github.com/bozheng-hit/transformers/blob/d9dc993fdd6e6b0a61fe68ccbe838e00c73b9f80/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py#L860-L863
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand now that the purpose of the division is to transform the silu
into a sigmoid
, but the names of the variables could be more descriptive. Is calling the silu the "logits" and the sigmoid the "probabilities" really accurate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"logits" ∈(−∞,∞) and "probs" ∈[0,1] look good to me. I can not find better options here.
@ggerganov is there any check hinders the merging? |
Should be ready soon - slaren will merge when done with the review |
@@ -1723,6 +1751,7 @@ enum e_model { | |||
MODEL_MEDIUM, | |||
MODEL_LARGE, | |||
MODEL_XL, | |||
MODEL_A2_7B, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs an entry in llama_model_type_name
for its string representation.
* support qwen2moe * fix-review * metal : support unary ops for nelements % 4 != 0 * metal : require contiguousness for float4 unary kernels * metal : require contiguousness for float4 unary kernels (cont) * fix-review * names : for brevity "SHARED_EXP" -> "SHEXP" * llama : reuse build_moe_ffn() * llama : add model type name --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This PR adds the support of codes for the coming Qwen2 MoE models hf.
I changed several macro values to support the 60 experts setting. @ggerganov