Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot import name 'ShardedDDPOption' from 'transformers.trainer' #33242

Open
2 of 4 tasks
nishitanand opened this issue Sep 2, 2024 · 6 comments
Open
2 of 4 tasks

cannot import name 'ShardedDDPOption' from 'transformers.trainer' #33242

nishitanand opened this issue Sep 2, 2024 · 6 comments
Labels

Comments

@nishitanand
Copy link

System Info

I am getting the following error, but this error should not be there -
cannot import name 'ShardedDDPOption' from 'transformers.trainer'

I have the following versions installed -
tokenizers-0.19.1
transformers-4.43.4
huggingface-hub-0.24.6

I have upgraded Vicuna -7v-v1.5 to llama 3.1 8B in this github repo - https://github.com/baaivision/EVE

This works with the vicuna-7b-v1.5, but not with llama3.1 8B. It should work as there isn't much change. I earlier got rope error, but solved it by upgrading transformers as guided in this issue -
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/15

Who can help?

https://github.com/amyeroberts
https://github.com/muellerzr
https://github.com/SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I run bash eve7b_prealign.sh 0 localhost

This works with the vicuna-7b-v1.5, but not with llama3.1 8B. It should work as there isn't much change. I earlier got rope error, but solved it by upgrading transformers as guided in this issue -
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/15

Expected behavior

The model should start training

@nishitanand nishitanand added the bug label Sep 2, 2024
@LysandreJik
Copy link
Member

cc @muellerzr and @SunMarc

@SunMarc
Copy link
Member

SunMarc commented Sep 2, 2024

Hey @nishitanand, thanks for reporting ! Could you share your traceback ? This shouldn't happen as with your current version of transformers (4.43.4), ShardedDDPOption no longer exist. Mayve try to uninstall transformers then install it again.

@nishitanand
Copy link
Author

nishitanand commented Sep 2, 2024

Hi, I uninstalled and installed transformers again. I have tried with transformers version 4.44.2 as well. Same error.
I think the problem is that the code uses/requires sharded DDP and I think sharded DDP is removed after transformers v4.34.0 i.e. v4.35.0 onwards. Earlier I used vicunav1.5 and the older version of tranformers worked fine, but I have upgraded Vicuna-v1.5 to Llama 3.1 and Llama 3.1 requires newer version of transformers, which sadly doesn't have sharded DDP.

Here is the traceback:

Traceback (most recent call last):
File "/fs/nexus-scratch/nishit/gamma/encoderfree/EVE/train_mem.py", line 14, in
from eve.train.train import train
File "/fs/nexus-scratch/nishit/gamma/encoderfree/EVE/eve/train/train.py", line 43, in
from eve.train.eve_trainer import EVETrainer
File "/fs/nexus-scratch/nishit/gamma/encoderfree/EVE/eve/train/eve_trainer.py", line 8, in
from transformers.trainer import (ALL_LAYERNORM_LAYERS, ShardedDDPOption,
ImportError: cannot import name 'ShardedDDPOption' from 'transformers.trainer' (/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/transformers/trainer.py)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 671982) of binary: /fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/bin/python
Traceback (most recent call last):
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_mem.py FAILED
Failures:
<NO_OTHER_FAILURES>

@SunMarc
Copy link
Member

SunMarc commented Sep 2, 2024

That's indeed the case. It looks like the code in their repo needs to be updated to work with the current Trainer. Sorry for the breaking change. Do you know what replaced SharedDDP @muellerzr, so that @nishitanand can fix the trainer subclass ?

@nishitanand
Copy link
Author

Hi @muellerzr, I'd really appreciate it if you could throw some light on the issue. I'm working on a priority project.

@nishitanand
Copy link
Author

Hi @SunMarc, any pointers on how to solve the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants