Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Invalidate trace cache @ step 1: expected module 25, but got module 323,how to resolve it ? #5006

Open
awzhgw opened this issue Jan 24, 2024 · 10 comments
Assignees
Labels
bug Something isn't working training

Comments

@awzhgw
Copy link

awzhgw commented Jan 24, 2024

Describe the bug
A clear and concise description of what the bug is.

  1. I train mixtral 7Bx8 model , tain 270 step, it will be hang , after 30m , NCCL timeout ,process will be killed

Invalidate trace cache @ step 1: expected module 25, but got module 323

  1. deepspeed version is : deepspeed 0.13.1

  2. code is :

config = transformers.AutoConfig.from_pretrained(model_args.model_name_or_path)
            config.num_hidden_layers = 2
            model = MixtralForCausalLM.from_pretrained(
                model_args.model_name_or_path,
                config=config,
                cache_dir=training_args.cache_dir,
                **bnb_model_from_pretrained_args
            )
deepspeed.utils.set_z3_leaf_modules(model, [MixtralSparseMoeBlock])
  1. my deepspeed config is :
{
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "steps_per_print": 1e5,
  "wall_clock_breakdown": false
}
### Tasks
@awzhgw awzhgw added bug Something isn't working training labels Jan 24, 2024
@tohtana tohtana self-assigned this Jan 25, 2024
@JakobLS
Copy link

JakobLS commented Mar 5, 2024

Hi @tohtana,

I get a similar error when using DeepSpeed via Hugging Face Accelerate to train SDXL. It happens during evaluation after the first epoch where the training simply freezes:

Invalidate trace cache @ step 4: expected module 1928, but got module 6

My deepspeed config is as follows:

{
    "fp16": {
        "enabled": true, 
        "auto_cast": true,
        "initial_scale_power": 16
    }, 
    "bf16": {
        "enabled": false
    },
    "zero_optimization": {
        "stage": 3,
        "round_robin_gradients": false,
        "load_from_fp32_weights": false,
        "allgather_bucket_size": 5e8,
        "reduce_bucket_size": 5e8,
        "stage3_gather_16bit_weights_on_model_save": true,
        "zero_quantized_weights": false,
        "zero_hpz_partition_size": 1,
        "zero_quantized_gradients": true
    },
    "gradient_clipping": 1.0,
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto"
}

Using the following library versions:

accelerate==0.27.2
deepspeed==0.13.4
diffusers==0.27.0.dev0 
torch==2.1.1+cu118

@liuchengyuan123
Copy link

Hi @tohtana,

I get a similar error when using DeepSpeed via Hugging Face Accelerate to train SDXL. It happens during evaluation after the first epoch where the training simply freezes:

Invalidate trace cache @ step 4: expected module 1928, but got module 6

My deepspeed config is as follows:

{
    "fp16": {
        "enabled": true, 
        "auto_cast": true,
        "initial_scale_power": 16
    }, 
    "bf16": {
        "enabled": false
    },
    "zero_optimization": {
        "stage": 3,
        "round_robin_gradients": false,
        "load_from_fp32_weights": false,
        "allgather_bucket_size": 5e8,
        "reduce_bucket_size": 5e8,
        "stage3_gather_16bit_weights_on_model_save": true,
        "zero_quantized_weights": false,
        "zero_hpz_partition_size": 1,
        "zero_quantized_gradients": true
    },
    "gradient_clipping": 1.0,
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto"
}

Using the following library versions:

accelerate==0.27.2
deepspeed==0.13.4
diffusers==0.27.0.dev0 
torch==2.1.1+cu118

same!

@Sander-houqi
Copy link

Sander-houqi commented Apr 25, 2024

same , and not use MixtralForCausalLM , use Qwen2ForCausalLM (without MOE) have warning, not break the training process.

even disable the trace cache, also have warning, but not break.

{
  "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
  },
  "bf16": {
      "enabled": "auto"
  },
  "train_micro_batch_size_per_gpu": "auto",
  "train_batch_size": "auto",
  "gradient_accumulation_steps": "auto",
  "zero_optimization": {
      "stage": 3,
      "overlap_comm": true,
      "contiguous_gradients": true,
      "sub_group_size": 1e9,
      "reduce_bucket_size": 5e8,
      "stage3_prefetch_bucket_size": 0,
      "stage3_param_persistence_threshold": 1e6,
      "stage3_max_live_parameters": 0,
      "stage3_max_reuse_distance": 0,
      "stage3_gather_16bit_weights_on_model_save": true
  }
}

@vikram71198
Copy link

vikram71198 commented Apr 25, 2024

I'm facing the same issue as @JakobLS.

After the first epoch, I get the message Invalidate trace cache @ step 0: expected module 0, but got module 456 and then training simply freezes and does not proceed.

@chenyunsai
Copy link

i hava the same issue,so there are a solve way?

@sxhysj
Copy link

sxhysj commented Jun 26, 2024

Same issue, my deepspeed config is:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
           "device": "nvme",
            "pin_memory": true,
            "nvme_path": "/home/xxx/git/sep/tmp",
            "buffer_count": 40
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true,
            "nvme_path": "/home/xxx/git/sep/tmp2",
            "buffer_count": 40
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": "auto",
        "reduce_bucket_size": 1e6
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

@Griffintaur
Copy link

@tohtana Any insights how to debug this to find out if the issue is with code or configuration

@tjruwase
Copy link
Contributor

tjruwase commented Jul 8, 2024

@Griffintaur, can you please see if this new API can help?
#4966

@Chuge0335
Copy link

same issue

@tjruwase
Copy link
Contributor

@Chuge0335, can you clarify if your run is hanging or just printing the warning message?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests