-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training with PEFT + Accelerate randomly gets stuck with DeepSpeed after the first epoch #2724
Comments
@muellerzr @pacman100 @BenjaminBossan can you please help with this? |
Would like an update |
Yeah, don't expect this to be resolved. I see a lot of issues here are not even addressed by the maintainers of this repo. We're on ourselves really. |
I tried to reproduce the issue but couldn't run the script as is because of memory issues (used 2 T4's). Therefore, I had to make some changes to the script (most notably using a smaller model, see below). For me, this passed successfully when running with I'm not sure which of the changes (if any) causes this to pass for me but not for you. In general, I would recommend to check this DeepSpeed + PEFT guide, as it is known to work. 1d0
<
8c7
< os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
---
> #os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" # BB
11c10
< os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
---
> os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
27d25
< import flash_attn
41,42c39,41
< model_name = "teknium/OpenHermes-2.5-Mistral-7B"
< output_dir = "enter-your-output-dir-here"
---
> #model_name = "teknium/OpenHermes-2.5-Mistral-7B"
> model_name = "facebook/opt-125m" # BB
> output_dir = "/tmp/"
46,47c45,46
< batch_size = 1
< max_length = 4096
---
> batch_size = 2 # BB
> max_length = 32 # BB
63c62
< for i in range(8):
---
> for i in range(2):
69c68
< for i in range(8):
---
> for i in range(2):
137c136
<
---
>
170c169
< model = AutoModelForCausalLM.from_pretrained(model_name, device_map = None, torch_dtype = torch.bfloat16)
---
> model = AutoModelForCausalLM.from_pretrained(model_name, device_map = None, torch_dtype = torch.float16) # BB
185c184
<
---
>
190c189
< gradient_accumulation_steps = 1,
---
> gradient_accumulation_steps = 4, # BB
198c197
< weight_decay = 0.2,
---
> weight_decay = 0.2,
200c199
< logging_strategy = "epoch",
---
> logging_strategy = "epoch",
203,205c202,205
< report_to = "none",
< deepspeed = deepspeed_config,
< bf16 = True,
---
> report_to = "none", # BB
> # deepspeed = deepspeed_config, # BB
> bf16 = False, # BB
> fp16 = True, # BB
225,268c225,226
< deepspeed_config = {
< "fp16": {"enabled": False},
< "bf16": {"enabled": True},
< "optimizer": {
< "type": "AdamW",
< "params": {
< "lr": "auto",
< "betas": "auto",
< "eps": "auto",
< "weight_decay": "auto",
< },
< },
< "scheduler": {
< "type": "WarmupLR",
< "params": {
< "warmup_min_lr": 0,
< "warmup_max_lr": 2e-4 * np.sqrt(8),
< "warmup_num_steps": "auto",
< },
< },
< "zero_optimization": {
< "stage": 3,
< "overlap_comm": False, #backwards prefetching
< "contiguous_gradients": True,
< "sub_group_size": 1000000000.0,
< "reduce_bucket_size": 500000000.0,
< "stage3_prefetch_bucket_size": 500000000.0,
< "stage3_param_persistence_threshold": 100000.0,
< "stage3_max_live_parameters": 1000000000.0,
< "stage3_max_reuse_distance": 1000000000.0,
< "stage3_gather_16bit_weights_on_model_save": True,
< "offload_param": {
< "device": "cpu",
< "pin_memory": False,
< },
< },
< "gradient_accumulation_steps": 1,
< "gradient_clipping": "auto",
< "steps_per_print": 39,
< "train_batch_size": 8,
< "train_micro_batch_size_per_gpu": 1,
< "wall_clock_breakdown": False,
< }
<
---
> # not used BB
> deepspeed_config = {} accelerate env: - `Accelerate` version: 0.30.1
- Platform: Linux-4.19.0-26-cloud-amd64-x86_64-with-glibc2.28
- `accelerate` bash location: /opt/conda/envs/env/bin/accelerate
- Python version: 3.11.8
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 51.10 GB
- GPU type: Tesla T4
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero3_save_16bit_model': False, 'zero_stage': 3}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: [] PEFT: latest version from source (commit fb7f2796e5411ee86588447947d1fdd5b6395cad). |
I also meet this issue,when the trainer save the model on the first epoch. |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi, I'm fine tuning an LLM using Soft Prompt Tuning using DeepSpeed via Accelerate implicitly, using the
deepspeed
param inTrainingArguments
.And all goes well until after the first epoch, where I get this relatively obscure message.
After some investigating, I came across this comment from a DeepSpeed maintainer on what this message means right here.
So, I ignored it, but turns out training just gets stuck after the first epoch and just does not proceed.
Similar issues have been raised here & here.
I'm not entirely sure if this is a DeepSpeed or Accelerate issue, but I'm inclining more towards this being one of Accelerate.
I'm running all of my experiments on a Databricks Cluster with
p4d.24xlarge
, which is 8x40 Gb Nvidia A100s.These are my platform specs:
Libraries
This is how you dropdown.
Here is a minimum reproducible:
Repro
And then in a separate notebook, I execute the following terminal command:
And this is the exact stacktrace I see:
Stacktrace
And at that point, it just gets stuck without proceeding any further.
Would really appreciate help getting to the bottom of this @muellerzr @pacman100. Thanks.
The text was updated successfully, but these errors were encountered: