[BUG] Invalidate trace cache @ step 1: expected module 25, but got module 323,how to resolve it ? #5006

awzhgw · 2024-01-24T09:17:00Z

Describe the bug
A clear and concise description of what the bug is.

I train mixtral 7Bx8 model , tain 270 step, it will be hang ， after 30m , NCCL timeout ,process will be killed

Invalidate trace cache @ step 1: expected module 25, but got module 323

deepspeed version is : deepspeed 0.13.1
code is :

config = transformers.AutoConfig.from_pretrained(model_args.model_name_or_path)
            config.num_hidden_layers = 2
            model = MixtralForCausalLM.from_pretrained(
                model_args.model_name_or_path,
                config=config,
                cache_dir=training_args.cache_dir,
                **bnb_model_from_pretrained_args
            )
deepspeed.utils.set_z3_leaf_modules(model, [MixtralSparseMoeBlock])

my deepspeed config is ：

{
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "steps_per_print": 1e5,
  "wall_clock_breakdown": false
}

### Tasks

The text was updated successfully, but these errors were encountered:

JakobLS · 2024-03-05T08:48:10Z

Hi @tohtana,

I get a similar error when using DeepSpeed via Hugging Face Accelerate to train SDXL. It happens during evaluation after the first epoch where the training simply freezes:

Invalidate trace cache @ step 4: expected module 1928, but got module 6

My deepspeed config is as follows:

{
    "fp16": {
        "enabled": true, 
        "auto_cast": true,
        "initial_scale_power": 16
    }, 
    "bf16": {
        "enabled": false
    },
    "zero_optimization": {
        "stage": 3,
        "round_robin_gradients": false,
        "load_from_fp32_weights": false,
        "allgather_bucket_size": 5e8,
        "reduce_bucket_size": 5e8,
        "stage3_gather_16bit_weights_on_model_save": true,
        "zero_quantized_weights": false,
        "zero_hpz_partition_size": 1,
        "zero_quantized_gradients": true
    },
    "gradient_clipping": 1.0,
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto"
}

Using the following library versions:

accelerate==0.27.2
deepspeed==0.13.4
diffusers==0.27.0.dev0 
torch==2.1.1+cu118

liuchengyuan123 · 2024-03-06T12:29:51Z

Hi @tohtana,

I get a similar error when using DeepSpeed via Hugging Face Accelerate to train SDXL. It happens during evaluation after the first epoch where the training simply freezes:

Invalidate trace cache @ step 4: expected module 1928, but got module 6

My deepspeed config is as follows:

{
    "fp16": {
        "enabled": true, 
        "auto_cast": true,
        "initial_scale_power": 16
    }, 
    "bf16": {
        "enabled": false
    },
    "zero_optimization": {
        "stage": 3,
        "round_robin_gradients": false,
        "load_from_fp32_weights": false,
        "allgather_bucket_size": 5e8,
        "reduce_bucket_size": 5e8,
        "stage3_gather_16bit_weights_on_model_save": true,
        "zero_quantized_weights": false,
        "zero_hpz_partition_size": 1,
        "zero_quantized_gradients": true
    },
    "gradient_clipping": 1.0,
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto"
}

Using the following library versions:

accelerate==0.27.2
deepspeed==0.13.4
diffusers==0.27.0.dev0 
torch==2.1.1+cu118

same!

Sander-houqi · 2024-04-25T06:48:05Z

same , and not use MixtralForCausalLM , use Qwen2ForCausalLM (without MOE) have warning, not break the training process.

even disable the trace cache, also have warning, but not break.

{
  "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
  },
  "bf16": {
      "enabled": "auto"
  },
  "train_micro_batch_size_per_gpu": "auto",
  "train_batch_size": "auto",
  "gradient_accumulation_steps": "auto",
  "zero_optimization": {
      "stage": 3,
      "overlap_comm": true,
      "contiguous_gradients": true,
      "sub_group_size": 1e9,
      "reduce_bucket_size": 5e8,
      "stage3_prefetch_bucket_size": 0,
      "stage3_param_persistence_threshold": 1e6,
      "stage3_max_live_parameters": 0,
      "stage3_max_reuse_distance": 0,
      "stage3_gather_16bit_weights_on_model_save": true
  }
}

vikram71198 · 2024-04-25T21:46:20Z

I'm facing the same issue as @JakobLS.

After the first epoch, I get the message Invalidate trace cache @ step 0: expected module 0, but got module 456 and then training simply freezes and does not proceed.

chenyunsai · 2024-04-29T02:23:28Z

i hava the same issue,so there are a solve way?

sxhysj · 2024-06-26T15:29:58Z

Same issue, my deepspeed config is:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
           "device": "nvme",
            "pin_memory": true,
            "nvme_path": "/home/xxx/git/sep/tmp",
            "buffer_count": 40
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true,
            "nvme_path": "/home/xxx/git/sep/tmp2",
            "buffer_count": 40
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": "auto",
        "reduce_bucket_size": 1e6
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Griffintaur · 2024-07-04T10:25:06Z

@tohtana Any insights how to debug this to find out if the issue is with code or configuration

tjruwase · 2024-07-08T20:06:03Z

@Griffintaur, can you please see if this new API can help?
#4966

Chuge0335 · 2024-11-17T05:23:41Z

same issue

tjruwase · 2024-11-18T22:22:58Z

@Chuge0335, can you clarify if your run is hanging or just printing the warning message?

Make trace cache warnings configurable, and disabled by default. Fix #6985, #4081, #5033, #5006, #5662 --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

tjruwase · 2025-02-20T16:17:52Z

Likely fixed by #7039

Make trace cache warnings configurable, and disabled by default. Fix deepspeedai#6985, deepspeedai#4081, deepspeedai#5033, deepspeedai#5006, deepspeedai#5662 --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

Make trace cache warnings configurable, and disabled by default. Fix deepspeedai#6985, deepspeedai#4081, deepspeedai#5033, deepspeedai#5006, deepspeedai#5662 --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: gyou2021 <ganmei.you@intel.com>

Make trace cache warnings configurable, and disabled by default. Fix #6985, #4081, #5033, #5006, #5662 --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Make trace cache warnings configurable, and disabled by default. Fix deepspeedai#6985, deepspeedai#4081, deepspeedai#5033, deepspeedai#5006, deepspeedai#5662 --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

awzhgw added bug Something isn't working training labels Jan 24, 2024

tohtana self-assigned this Jan 25, 2024

vikram71198 mentioned this issue Apr 25, 2024

Stalled loop during prediction with deepspeed huggingface/transformers#24751

Closed

4 tasks

vikram71198 mentioned this issue Apr 29, 2024

Training with PEFT + Accelerate randomly gets stuck with DeepSpeed after the first epoch huggingface/accelerate#2724

Closed

tjruwase mentioned this issue Feb 15, 2025

Control trace cache warnings #7039

Merged

github-merge-queue bot pushed a commit that referenced this issue Feb 18, 2025

Control trace cache warnings (#7039)

ee3f19b

Make trace cache warnings configurable, and disabled by default. Fix #6985, #4081, #5033, #5006, #5662 --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

tjruwase closed this as completed Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Invalidate trace cache @ step 1: expected module 25, but got module 323,how to resolve it ? #5006

[BUG] Invalidate trace cache @ step 1: expected module 25, but got module 323,how to resolve it ? #5006

awzhgw commented Jan 24, 2024 •

edited

Loading

JakobLS commented Mar 5, 2024

liuchengyuan123 commented Mar 6, 2024

Sander-houqi commented Apr 25, 2024 •

edited

Loading

vikram71198 commented Apr 25, 2024 •

edited

Loading

chenyunsai commented Apr 29, 2024

sxhysj commented Jun 26, 2024

Griffintaur commented Jul 4, 2024

tjruwase commented Jul 8, 2024

Chuge0335 commented Nov 17, 2024

tjruwase commented Nov 18, 2024

tjruwase commented Feb 20, 2025

[BUG] Invalidate trace cache @ step 1: expected module 25, but got module 323,how to resolve it ? #5006

[BUG] Invalidate trace cache @ step 1: expected module 25, but got module 323,how to resolve it ? #5006

Comments

awzhgw commented Jan 24, 2024 • edited Loading

JakobLS commented Mar 5, 2024

liuchengyuan123 commented Mar 6, 2024

Sander-houqi commented Apr 25, 2024 • edited Loading

vikram71198 commented Apr 25, 2024 • edited Loading

chenyunsai commented Apr 29, 2024

sxhysj commented Jun 26, 2024

Griffintaur commented Jul 4, 2024

tjruwase commented Jul 8, 2024

Chuge0335 commented Nov 17, 2024

tjruwase commented Nov 18, 2024

tjruwase commented Feb 20, 2025

awzhgw commented Jan 24, 2024 •

edited

Loading

Sander-houqi commented Apr 25, 2024 •

edited

Loading

vikram71198 commented Apr 25, 2024 •

edited

Loading