-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Invalidate trace cache @ step 1: expected module 25, but got module 323,how to resolve it ? #5006
Comments
Hi @tohtana, I get a similar error when using DeepSpeed via Hugging Face Accelerate to train SDXL. It happens during evaluation after the first epoch where the training simply freezes:
My deepspeed config is as follows: {
"fp16": {
"enabled": true,
"auto_cast": true,
"initial_scale_power": 16
},
"bf16": {
"enabled": false
},
"zero_optimization": {
"stage": 3,
"round_robin_gradients": false,
"load_from_fp32_weights": false,
"allgather_bucket_size": 5e8,
"reduce_bucket_size": 5e8,
"stage3_gather_16bit_weights_on_model_save": true,
"zero_quantized_weights": false,
"zero_hpz_partition_size": 1,
"zero_quantized_gradients": true
},
"gradient_clipping": 1.0,
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto"
} Using the following library versions:
|
same! |
same , and not use MixtralForCausalLM , use Qwen2ForCausalLM (without MOE) have warning, not break the training process. even disable the trace cache, also have warning, but not break.
|
I'm facing the same issue as @JakobLS. After the first epoch, I get the message |
i hava the same issue,so there are a solve way? |
Same issue, my deepspeed config is:
|
@tohtana Any insights how to debug this to find out if the issue is with code or configuration |
@Griffintaur, can you please see if this new API can help? |
same issue |
@Chuge0335, can you clarify if your run is hanging or just printing the warning message? |
Describe the bug
A clear and concise description of what the bug is.
Invalidate trace cache @ step 1: expected module 25, but got module 323
deepspeed version is : deepspeed 0.13.1
code is :
The text was updated successfully, but these errors were encountered: