cannot import name 'ShardedDDPOption' from 'transformers.trainer' #33242

nishitanand · 2024-09-02T02:29:07Z

System Info

I am getting the following error, but this error should not be there -
cannot import name 'ShardedDDPOption' from 'transformers.trainer'

I have the following versions installed -
tokenizers-0.19.1
transformers-4.43.4
huggingface-hub-0.24.6

I have upgraded Vicuna -7v-v1.5 to llama 3.1 8B in this github repo - https://github.com/baaivision/EVE

This works with the vicuna-7b-v1.5, but not with llama3.1 8B. It should work as there isn't much change. I earlier got rope error, but solved it by upgrading transformers as guided in this issue -
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/15

Who can help?

https://github.com/amyeroberts
https://github.com/muellerzr
https://github.com/SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I run bash eve7b_prealign.sh 0 localhost

This works with the vicuna-7b-v1.5, but not with llama3.1 8B. It should work as there isn't much change. I earlier got rope error, but solved it by upgrading transformers as guided in this issue -
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/15

Expected behavior

The model should start training

The text was updated successfully, but these errors were encountered:

LysandreJik · 2024-09-02T12:24:05Z

cc @muellerzr and @SunMarc

SunMarc · 2024-09-02T12:48:02Z

Hey @nishitanand, thanks for reporting ! Could you share your traceback ? This shouldn't happen as with your current version of transformers (4.43.4), ShardedDDPOption no longer exist. Mayve try to uninstall transformers then install it again.

nishitanand · 2024-09-02T16:07:55Z

Hi, I uninstalled and installed transformers again. I have tried with transformers version 4.44.2 as well. Same error.
I think the problem is that the code uses/requires sharded DDP and I think sharded DDP is removed after transformers v4.34.0 i.e. v4.35.0 onwards. Earlier I used vicunav1.5 and the older version of tranformers worked fine, but I have upgraded Vicuna-v1.5 to Llama 3.1 and Llama 3.1 requires newer version of transformers, which sadly doesn't have sharded DDP.

Here is the traceback:

Traceback (most recent call last):
File "/fs/nexus-scratch/nishit/gamma/encoderfree/EVE/train_mem.py", line 14, in
from eve.train.train import train
File "/fs/nexus-scratch/nishit/gamma/encoderfree/EVE/eve/train/train.py", line 43, in
from eve.train.eve_trainer import EVETrainer
File "/fs/nexus-scratch/nishit/gamma/encoderfree/EVE/eve/train/eve_trainer.py", line 8, in
from transformers.trainer import (ALL_LAYERNORM_LAYERS, ShardedDDPOption,
ImportError: cannot import name 'ShardedDDPOption' from 'transformers.trainer' (/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/transformers/trainer.py)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 671982) of binary: /fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/bin/python
Traceback (most recent call last):
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_mem.py FAILED
Failures:
<NO_OTHER_FAILURES>

SunMarc · 2024-09-02T16:20:38Z

That's indeed the case. It looks like the code in their repo needs to be updated to work with the current Trainer. Sorry for the breaking change. Do you know what replaced SharedDDP @muellerzr, so that @nishitanand can fix the trainer subclass ?

nishitanand · 2024-09-02T18:22:34Z

Hi @muellerzr, I'd really appreciate it if you could throw some light on the issue. I'm working on a priority project.

nishitanand · 2024-09-04T01:21:29Z

Hi @SunMarc, any pointers on how to solve the issue?

nishitanand added the bug label Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot import name 'ShardedDDPOption' from 'transformers.trainer' #33242

cannot import name 'ShardedDDPOption' from 'transformers.trainer' #33242

nishitanand commented Sep 2, 2024

LysandreJik commented Sep 2, 2024

SunMarc commented Sep 2, 2024

nishitanand commented Sep 2, 2024 •

edited

Loading

SunMarc commented Sep 2, 2024

nishitanand commented Sep 2, 2024

nishitanand commented Sep 4, 2024

cannot import name 'ShardedDDPOption' from 'transformers.trainer' #33242

cannot import name 'ShardedDDPOption' from 'transformers.trainer' #33242

Comments

nishitanand commented Sep 2, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

LysandreJik commented Sep 2, 2024

SunMarc commented Sep 2, 2024

nishitanand commented Sep 2, 2024 • edited Loading

SunMarc commented Sep 2, 2024

nishitanand commented Sep 2, 2024

nishitanand commented Sep 4, 2024

nishitanand commented Sep 2, 2024 •

edited

Loading