Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'Accelerator' object has no attribute 'gather_for_metrics' #854

Closed
2 of 4 tasks
Arij-Aladel opened this issue Nov 15, 2022 · 9 comments
Closed
2 of 4 tasks

Comments

@Arij-Aladel
Copy link

Arij-Aladel commented Nov 15, 2022

System Info

python 3.9


accelerate.yaml     file 

> compute_environment: LOCAL_MACHINE
> deepspeed_config: {}
> distributed_type: MULTI_GPU
> fsdp_config: {}
> machine_rank: 0
> main_process_ip: null
> main_process_port: null
> main_training_function: main
> mixed_precision: 'no'
> num_machines: 1
> num_processes: 10
> use_cpu: false

the rest as in [requirements](https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/requirements.txt) file

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Go to run_summarization_no_trainer.py

run

accelerate launch --config_file='./accelerate.yaml' run_summarization_notrainer.py --seed=42 --preprocessing_num_workers=1 --weight_decay='0.001' --output_dir="draft/" --per_device_train_batch_size=4 --per_device_eval_batch_size=8 --dataset_name="cnn_dailymail" --dataset_config "3.0.0" --num_train_epochs=10 --model_name_or_path 't5-small'

Expected behavior

run the script normally.

The error I got explained in this [issue](https://github.com/huggingface/transformers/issues/18189)
@muellerzr
Copy link
Collaborator

@Arij-Aladel what version of accelerate do you have? (pip show accelerate) as it may be quite old by the looks of your accelerate env report. I'd recommend pip install accelerate -U potentially, as gather_for_metrics has been around since v0.12.0 in August

@Arij-Aladel
Copy link
Author

the last one I guess it is exactly as requirements file of the example

@muellerzr
Copy link
Collaborator

The requirements file doesn't have a version specified, so it won't install a new version on the system if it exists. Can you please show the output of pip show accelerate? 😃

@Arij-Aladel
Copy link
Author

  • Accelerate version: 0.12.0.dev0
  • Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.31
  • Python version: 3.9.13
  • Numpy version: 1.22.3
  • PyTorch version (GPU?): 1.12.0 (True)
  • Accelerate default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: FSDP
    - mixed_precision: no
    - use_cpu: False
    - num_processes: 1
    - machine_rank: 0
    - num_machines: 1
    - main_process_ip: None
    - main_process_port: None
    - main_training_function: main
    - deepspeed_config: {}
    - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'offload_params': False, 'sharding_strategy': 1, 'transformer_layer_cls_to_wrap': ''}

@muellerzr
Copy link
Collaborator

Since it's running dev, it may be a commit before gather_for_metrics was added. I'd recommend doing pip install accelerate==0.12.0 --force-reinstall --no-deps so you get the fully released v0.12.0 version or do -U to upgrade 😃

@Arij-Aladel
Copy link
Author

Ok now another issue at this line I am getting the error

10%|███████████████████████ | 7178/71780 [32:54<4:56:14, 3.63it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9826) of binary: /home/arij/anaconda3/envs/sum/bin/python
Traceback (most recent call last):
File "/home/arij/anaconda3/envs/sum/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.0', 'console_scripts', 'torchrun')())
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

This is because after gathering on this line we get emplty lists

@muellerzr
Copy link
Collaborator

Can you open a separate issue on the transformers repo for this and @ mention me? Thanks!

@Arij-Aladel
Copy link
Author

Arij-Aladel commented Nov 16, 2022

@muellerzr I have already opened it and mention this issue I will mention you there too.

@muellerzr
Copy link
Collaborator

@Arij-Aladel is it safe to say this can be closed now? 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants