run_summarization_no_trainer #18189

Arij-Aladel · 2022-07-18T16:43:15Z

@sgugger Hello! I just tried to run the code to explore this example https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization_no_trainer.py

this is my yml file to build the env

name: sum

channels:

pytorch

conda-forge

defaults

dependencies:

jupyterlab

pip

python=3.9

pytorch

tensorboard

torchaudio

torchvision

tqdm

tokenizers

prettytable

einops

matplotlib

accelerate

datasets

sentencepiece != 0.1.92

protobuf

nltk

py7zr

transformers

then pip install rouge-score

after that simply I ran thhe command

accelerate launch run_summarization_no_trainer.py --model_name_or_path t5-small --dataset_name cnn_dailymail --dataset_config '3.0.0' --source_prefix 'summarize: ' --output_dir output/tst-summarization

and got the error

Traceback (most recent call last):
File "/home/arij/anaconda3/envs/sum/bin/accelerate", line 10, in
sys.exit(main())
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/launch.py", line 568, in launch_command
simple_launcher(args)
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/launch.py", line 235, in simple_launcher
mixed_precision = PrecisionType(args.mixed_precision.lower())
AttributeError: 'NoneType' object has no attribute 'lower'

How to fix it?

The text was updated successfully, but these errors were encountered:

sgugger · 2022-07-18T16:46:47Z

Did you run accelerte config? What's the result of accelerate env?

Arij-Aladel · 2022-07-18T16:52:14Z

accelerate env

Copy-and-paste the text below in your GitHub issue

Accelerate version: 0.10.0
Platform: Linux-5.15.0-41-generic-x86_64-with-glibc2.31
Python version: 3.9.13
Numpy version: 1.22.3
PyTorch version (GPU?): 1.12.0 (True)
Accelerate default config:
Not found

accelerate test

Running: accelerate-launch --config_file=None /home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/test_utils/test_script.py
stderr: Traceback (most recent call last):
stderr: File "/home/arij/anaconda3/envs/sum/bin/accelerate-launch", line 10, in
stderr: sys.exit(main())
stderr: File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/launch.py", line 574, in main
stderr: launch_command(args)
stderr: File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/launch.py", line 523, in launch_command
stderr: defaults = load_config_from_file(args.config_file)
stderr: File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 45, in load_config_from_file
stderr: with open(config_file, "r", encoding="utf-8") as f:
stderr: FileNotFoundError: [Errno 2] No such file or directory: '/home/arij/.cache/huggingface/accelerate/default_config.yaml'
Traceback (most recent call last):
File "/home/arij/anaconda3/envs/sum/bin/accelerate", line 10, in
sys.exit(main())
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/test.py", line 52, in test_command
result = execute_subprocess_async(cmd, env=os.environ.copy())
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/test_utils/testing.py", line 276, in execute_subprocess_async
raise RuntimeError(
RuntimeError: 'accelerate-launch --config_file=None /home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/test_utils/test_script.py' failed with returncode 1

The combined stderr from workers follows:
Traceback (most recent call last):
File "/home/arij/anaconda3/envs/sum/bin/accelerate-launch", line 10, in
sys.exit(main())
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/launch.py", line 574, in main
launch_command(args)
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/launch.py", line 523, in launch_command
defaults = load_config_from_file(args.config_file)
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 45, in load_config_from_file
with open(config_file, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arij/.cache/huggingface/accelerate/default_config.yaml'
accelerte config
accelerte: command not found

sgugger · 2022-07-18T16:57:24Z

That was a typo, sorry. You need to run accelerate config before running accelerate launch and answer the small questionnaire.

Arij-Aladel · 2022-07-18T17:03:23Z

one of the questions is Do you want to use DeepSpeed? [yes/NO]:
what is the better choice here?

Arij-Aladel · 2022-07-18T17:25:40Z

could you please send any link that helps how to figure the questionaire using deepspeed?

Arij-Aladel · 2022-07-18T17:51:07Z

Any way these are my steps

(sum) arij@dgx3:/summarization/tutorial$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 3
What is the rank of this machine (from 0 to the number of machines - 1 )? [0]: 9
What is the IP address of the machine that will host the main process? ###########33(hidden for security)
What is the port you will use to communicate with the main process? 8887
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes
Please enter the path to the json DeepSpeed config file:
Do you want to enable deepspeed.zero.Init when using ZeRO Stage-3 for constructing massive models? [yes/NO]:
Which Type of launcher do you want to use [0] pdsh, [1] standard, [2] openmpi, [3] mvapich)? [0]:
DeepSpeed configures multi-node compute resources with hostfile. Each row is of the format hostname slots=[num_gpus], e.g., localhost slots=2; for more information please refer official documentation. Please specify the location of hostfile:
Do you want to specify exclusion filter string? [yes/NO]:
Do you want to specify inclusion filter string? [yes/NO]:
How many GPU(s) should be used for distributed training? [1]:8
(sum) arij@dgx3:/summarization/tutorial$ accelerate launch run_summarization_no_trainer.py --model_name_or_path t5-small --dataset_name cnn_dailymail --dataset_config '3.0.0' --source_prefix 'summarize: ' --output_dir output/tst-summarization
[2022-07-18 20:47:06,728] [WARNING] [runner.py:159:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-07-18 20:47:06,728] [INFO] [runner.py:457:main] cmd = /home/arij/anaconda3/envs/sum/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --no_local_rank run_summarization_no_trainer.py --model_name_or_path t5-small --dataset_name cnn_dailymail --dataset_config 3.0.0 --source_prefix summarize: --output_dir output/tst-summarization
[2022-07-18 20:47:08,004] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2022-07-18 20:47:08,004] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=2, node_rank=0
[2022-07-18 20:47:08,004] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2022-07-18 20:47:08,004] [INFO] [launch.py:123:main] dist_world_size=2
[2022-07-18 20:47:08,004] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1
args:

Namespace(dataset_name='cnn_dailymail', dataset_config_name='3.0.0', train_file=None, validation_file=None, ignore_pad_token_for_loss=True, max_source_length=1024, source_prefix='summarize: ', preprocessing_num_workers=None, overwrite_cache=None, max_target_length=128, val_max_target_length=None, max_length=128, num_beams=None, pad_to_max_length=False, model_name_or_path='t5-small', config_name=None, tokenizer_name=None, text_column=None, summary_column=None, use_slow_tokenizer=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, learning_rate=5e-05, weight_decay=0.0, num_train_epochs=3, max_train_steps=None, gradient_accumulation_steps=1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, num_warmup_steps=0, output_dir='output/tst-summarization', seed=None, model_type=None, push_to_hub=False, hub_model_id=None, hub_token=None, checkpointing_steps=None, resume_from_checkpoint=None, with_tracking=False, report_to='all')
[2022-07-18 20:47:30,042] [INFO] [launch.py:210:main] Process 1054725 exits successfully.
args:

Namespace(dataset_name='cnn_dailymail', dataset_config_name='3.0.0', train_file=None, validation_file=None, ignore_pad_token_for_loss=True, max_source_length=1024, source_prefix='summarize: ', preprocessing_num_workers=None, overwrite_cache=None, max_target_length=128, val_max_target_length=None, max_length=128, num_beams=None, pad_to_max_length=False, model_name_or_path='t5-small', config_name=None, tokenizer_name=None, text_column=None, summary_column=None, use_slow_tokenizer=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, learning_rate=5e-05, weight_decay=0.0, num_train_epochs=3, max_train_steps=None, gradient_accumulation_steps=1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, num_warmup_steps=0, output_dir='output/tst-summarization', seed=None, model_type=None, push_to_hub=False, hub_model_id=None, hub_token=None, checkpointing_steps=None, resume_from_checkpoint=None, with_tracking=False, report_to='all')
[2022-07-18 20:47:39,051] [INFO] [launch.py:210:main] Process 1054726 exits successfully.

Still something wrong)

Arij-Aladel · 2022-07-18T17:55:37Z

I think there should be full instructions on how to use accelerate , it is not clear. Thanks for your reply

soumyasanyal · 2022-07-18T19:15:49Z

Interesting that I was facing the exact same issue right now. The fix for me was to pass the local config I created.

accelerate launch --config_file <your config file> your_file.py

Arij-Aladel · 2022-07-19T09:00:58Z

@soumyasanyal could you please tell the steps I am absolutely new) or post your config

soumyasanyal · 2022-07-19T20:08:54Z

Sure! I just followed the steps in this link. The steps I followed are:

accelerate config --config_file ./accelerate.yaml --> answer all the questions in the questionnaire
accelerate test --config_file ./accelerate.yaml
accelerate launch --config_file ./accelerate.yaml script.py

My config file is as follows (but it can change as per your requirements. I just wanted to run a job on 8 GPUs in a single node, without DeepSpeed or mixed precision):

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
use_cpu: false

I was previously running accelerate launch script.py without mentioning the config file when I faced the issue that you reported here.

Also FYI, note that the doc says that integration of accelerate with DeepSpeed is experimental.

Arij-Aladel · 2022-11-14T15:39:33Z

@sgugger sorry for reopenning the issue while using this script using T5 over cnn-dialy dataset
under this configuration

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
use_cpu: false

I got the error

AttributeError: 'Accelerator' object has no attribute 'gather_for_metrics'
    generated_tokens, labels = accelerator.gather_for_metrics((generated_tokens, labels))
AttributeError: 'Accelerator' object has no attribute 'gather_for_metrics'
    generated_tokens, labels = accelerator.gather_for_metrics((generated_tokens, labels))
AttributeError: 'Accelerator' object has no attribute 'gather_for_metrics'

For this error replacing gather_for_metrics with just gather as old version of this code, gives me zero list of gathered decoded_preds, decoded_labels . and gather for metrics did not work.

with this configuration

compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
use_cpu: false

I get this error

RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 7; 10.92 GiB total capacity; 9.83 GiB already allocated; 293.50 MiB free; 9.91 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ret = input.softmax(dim)
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 1; 10.92 GiB total capacity; 9.83 GiB already allocated; 245.50 MiB free; 9.91 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Arij-Aladel · 2022-11-16T08:15:07Z

@muellerzr

muellerzr · 2022-11-16T18:08:14Z

@Arij-Aladel in this case you should reduce your batch size most likely, but I'll be running it myself in just a moment

Arij-Aladel · 2022-11-16T18:15:31Z

I did already still problem of not finding gather_for_metric attribute

Arij-Aladel · 2022-11-16T18:16:05Z

You can simply run the example as is

muellerzr · 2022-11-16T19:35:47Z

Thanks @Arij-Aladel, I think I have found the fix. Can you try running the following training script on your end to verify? (I have wget to make your life easy):

(Also as mentioned in the other post please make sure you have a pypi version of accelerate >= 0.12.0 to run the scripts, a PR was just merged yesterday to make them a requirement for all these scripts)

wget https://raw.githubusercontent.com/huggingface/transformers/muellerzr-fix-no-trainer/examples/pytorch/summarization/run_summarization_no_trainer.py

Arij-Aladel · 2022-11-17T07:57:56Z

@muellerzr thanks for your response! As I understand your fix is just deleting this line

706 decoded_preds, decoded_labels = accelerator.gather_for_metrics(decoded_preds, decoded_labels)
??

my life with wget was not easier)))

wget https://raw.githubusercontent.com/huggingface/transformers/muellerzr-fix-no-trainer/examples/pytorch/summarization/run_summarization_no_trainer.py
--2022-11-17 10:45:24-- https://raw.githubusercontent.com/huggingface/transformers/muellerzr-fix-no-trainer/examples/pytorch/summarization/run_summarization_no_trainer.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-11-17 10:45:24 ERROR 404: Not Found.

Arij-Aladel · 2022-11-17T13:27:58Z

Really I do not know what is wrong with this script .....

muellerzr · 2022-11-17T13:30:23Z

@Arij-Aladel yes the fix got merged yesterday, you can find it here: https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization_no_trainer.py

I would highly recommend doing pip install -r transformers/examples/pytorch/summarization/requirements.txt -U (the txt file here: https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/requirements.txt) to avoid these dependency issues you have been struggling with as the script ran just fine for me.

Arij-Aladel · 2022-11-17T13:43:21Z

After

pip install -r transformers/examples/pytorch/summarization/requirements.txt -U

:)

Arij-Aladel · 2022-11-18T11:57:52Z

Ok seems it was package installation issue after your fix, I have uninstalled all packages then reinstall packages according to requirements file. It works now thanks @muellerzr

muellerzr · 2022-11-18T15:35:04Z

Great! Can this be closed now @Arij-Aladel? :)

Arij-Aladel · 2022-11-18T15:35:50Z

Yes , thanks . I am closing it.

Arij-Aladel closed this as completed Jul 20, 2022

Arij-Aladel reopened this Nov 14, 2022

Arij-Aladel mentioned this issue Nov 16, 2022

AttributeError: 'Accelerator' object has no attribute 'gather_for_metrics' huggingface/accelerate#854

Closed

4 tasks

muellerzr self-assigned this Nov 16, 2022

muellerzr mentioned this issue Nov 16, 2022

Fix no trainer summarization script test failure #20286

Merged

5 tasks

Arij-Aladel closed this as completed Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run_summarization_no_trainer #18189

run_summarization_no_trainer #18189

Arij-Aladel commented Jul 18, 2022 •

edited

Loading

sgugger commented Jul 18, 2022

Arij-Aladel commented Jul 18, 2022 •

edited

Loading

sgugger commented Jul 18, 2022

Arij-Aladel commented Jul 18, 2022

Arij-Aladel commented Jul 18, 2022

Arij-Aladel commented Jul 18, 2022 •

edited

Loading

Arij-Aladel commented Jul 18, 2022 •

edited

Loading

soumyasanyal commented Jul 18, 2022

Arij-Aladel commented Jul 19, 2022 •

edited

Loading

soumyasanyal commented Jul 19, 2022

Arij-Aladel commented Nov 14, 2022 •

edited

Loading

Arij-Aladel commented Nov 16, 2022

muellerzr commented Nov 16, 2022 •

edited

Loading

Arij-Aladel commented Nov 16, 2022

Arij-Aladel commented Nov 16, 2022

muellerzr commented Nov 16, 2022 •

edited

Loading

Arij-Aladel commented Nov 17, 2022 •

edited

Loading

Arij-Aladel commented Nov 17, 2022

muellerzr commented Nov 17, 2022 •

edited

Loading

Arij-Aladel commented Nov 17, 2022 •

edited

Loading

Arij-Aladel commented Nov 18, 2022 •

edited

Loading

muellerzr commented Nov 18, 2022

Arij-Aladel commented Nov 18, 2022

run_summarization_no_trainer #18189

run_summarization_no_trainer #18189

Comments

Arij-Aladel commented Jul 18, 2022 • edited Loading

sgugger commented Jul 18, 2022

Arij-Aladel commented Jul 18, 2022 • edited Loading

sgugger commented Jul 18, 2022

Arij-Aladel commented Jul 18, 2022

Arij-Aladel commented Jul 18, 2022

Arij-Aladel commented Jul 18, 2022 • edited Loading

Arij-Aladel commented Jul 18, 2022 • edited Loading

soumyasanyal commented Jul 18, 2022

Arij-Aladel commented Jul 19, 2022 • edited Loading

soumyasanyal commented Jul 19, 2022

Arij-Aladel commented Nov 14, 2022 • edited Loading

Arij-Aladel commented Nov 16, 2022

muellerzr commented Nov 16, 2022 • edited Loading

Arij-Aladel commented Nov 16, 2022

Arij-Aladel commented Nov 16, 2022

muellerzr commented Nov 16, 2022 • edited Loading

Arij-Aladel commented Nov 17, 2022 • edited Loading

Arij-Aladel commented Nov 17, 2022

muellerzr commented Nov 17, 2022 • edited Loading

Arij-Aladel commented Nov 17, 2022 • edited Loading

Arij-Aladel commented Nov 18, 2022 • edited Loading

muellerzr commented Nov 18, 2022

Arij-Aladel commented Nov 18, 2022

Arij-Aladel commented Jul 18, 2022 •

edited

Loading

Arij-Aladel commented Jul 18, 2022 •

edited

Loading

Arij-Aladel commented Jul 18, 2022 •

edited

Loading

Arij-Aladel commented Jul 18, 2022 •

edited

Loading

Arij-Aladel commented Jul 19, 2022 •

edited

Loading

Arij-Aladel commented Nov 14, 2022 •

edited

Loading

muellerzr commented Nov 16, 2022 •

edited

Loading

muellerzr commented Nov 16, 2022 •

edited

Loading

Arij-Aladel commented Nov 17, 2022 •

edited

Loading

muellerzr commented Nov 17, 2022 •

edited

Loading

Arij-Aladel commented Nov 17, 2022 •

edited

Loading

Arij-Aladel commented Nov 18, 2022 •

edited

Loading