Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_summarization_no_trainer #18189

Closed
Arij-Aladel opened this issue Jul 18, 2022 · 23 comments
Closed

run_summarization_no_trainer #18189

Arij-Aladel opened this issue Jul 18, 2022 · 23 comments
Assignees

Comments

@Arij-Aladel
Copy link

Arij-Aladel commented Jul 18, 2022

@sgugger Hello! I just tried to run the code to explore this example https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization_no_trainer.py

this is my yml file to build the env

name: sum

channels:

  • pytorch
  • conda-forge
  • defaults

dependencies:

  • jupyterlab
  • pip
  • python=3.9
  • pytorch
  • tensorboard
  • torchaudio
  • torchvision
  • tqdm
  • tokenizers
  • prettytable
  • einops
  • matplotlib
  • accelerate
  • datasets
  • sentencepiece != 0.1.92
  • protobuf
  • nltk
  • py7zr
  • transformers

then pip install rouge-score

after that simply I ran thhe command

accelerate launch run_summarization_no_trainer.py --model_name_or_path t5-small --dataset_name cnn_dailymail --dataset_config '3.0.0' --source_prefix 'summarize: ' --output_dir output/tst-summarization

and got the error

Traceback (most recent call last):
File "/home/arij/anaconda3/envs/sum/bin/accelerate", line 10, in
sys.exit(main())
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/launch.py", line 568, in launch_command
simple_launcher(args)
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/launch.py", line 235, in simple_launcher
mixed_precision = PrecisionType(args.mixed_precision.lower())
AttributeError: 'NoneType' object has no attribute 'lower'

How to fix it?

@sgugger
Copy link
Collaborator

sgugger commented Jul 18, 2022

Did you run accelerte config? What's the result of accelerate env?

@Arij-Aladel
Copy link
Author

Arij-Aladel commented Jul 18, 2022

accelerate env

Copy-and-paste the text below in your GitHub issue

  • Accelerate version: 0.10.0
  • Platform: Linux-5.15.0-41-generic-x86_64-with-glibc2.31
  • Python version: 3.9.13
  • Numpy version: 1.22.3
  • PyTorch version (GPU?): 1.12.0 (True)
  • Accelerate default config:
    Not found

accelerate test

Running: accelerate-launch --config_file=None /home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/test_utils/test_script.py
stderr: Traceback (most recent call last):
stderr: File "/home/arij/anaconda3/envs/sum/bin/accelerate-launch", line 10, in
stderr: sys.exit(main())
stderr: File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/launch.py", line 574, in main
stderr: launch_command(args)
stderr: File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/launch.py", line 523, in launch_command
stderr: defaults = load_config_from_file(args.config_file)
stderr: File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 45, in load_config_from_file
stderr: with open(config_file, "r", encoding="utf-8") as f:
stderr: FileNotFoundError: [Errno 2] No such file or directory: '/home/arij/.cache/huggingface/accelerate/default_config.yaml'
Traceback (most recent call last):
File "/home/arij/anaconda3/envs/sum/bin/accelerate", line 10, in
sys.exit(main())
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/test.py", line 52, in test_command
result = execute_subprocess_async(cmd, env=os.environ.copy())
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/test_utils/testing.py", line 276, in execute_subprocess_async
raise RuntimeError(
RuntimeError: 'accelerate-launch --config_file=None /home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/test_utils/test_script.py' failed with returncode 1

The combined stderr from workers follows:
Traceback (most recent call last):
File "/home/arij/anaconda3/envs/sum/bin/accelerate-launch", line 10, in
sys.exit(main())
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/launch.py", line 574, in main
launch_command(args)
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/launch.py", line 523, in launch_command
defaults = load_config_from_file(args.config_file)
File "/home/arij/anaconda3/envs/sum/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 45, in load_config_from_file
with open(config_file, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arij/.cache/huggingface/accelerate/default_config.yaml'
accelerte config
accelerte: command not found

@sgugger
Copy link
Collaborator

sgugger commented Jul 18, 2022

That was a typo, sorry. You need to run accelerate config before running accelerate launch and answer the small questionnaire.

@Arij-Aladel
Copy link
Author

one of the questions is Do you want to use DeepSpeed? [yes/NO]:
what is the better choice here?

@Arij-Aladel
Copy link
Author

could you please send any link that helps how to figure the questionaire using deepspeed?

@Arij-Aladel
Copy link
Author

Arij-Aladel commented Jul 18, 2022

Any way these are my steps

(sum) arij@dgx3:/summarization/tutorial$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 3
What is the rank of this machine (from 0 to the number of machines - 1 )? [0]: 9
What is the IP address of the machine that will host the main process? ###########33(hidden for security)
What is the port you will use to communicate with the main process? 8887
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes
Please enter the path to the json DeepSpeed config file:
Do you want to enable deepspeed.zero.Init when using ZeRO Stage-3 for constructing massive models? [yes/NO]:
Which Type of launcher do you want to use [0] pdsh, [1] standard, [2] openmpi, [3] mvapich)? [0]:
DeepSpeed configures multi-node compute resources with hostfile. Each row is of the format hostname slots=[num_gpus], e.g., localhost slots=2; for more information please refer official documentation. Please specify the location of hostfile:
Do you want to specify exclusion filter string? [yes/NO]:
Do you want to specify inclusion filter string? [yes/NO]:
How many GPU(s) should be used for distributed training? [1]:8
(sum) arij@dgx3:
/summarization/tutorial$ accelerate launch run_summarization_no_trainer.py --model_name_or_path t5-small --dataset_name cnn_dailymail --dataset_config '3.0.0' --source_prefix 'summarize: ' --output_dir output/tst-summarization
[2022-07-18 20:47:06,728] [WARNING] [runner.py:159:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-07-18 20:47:06,728] [INFO] [runner.py:457:main] cmd = /home/arij/anaconda3/envs/sum/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --no_local_rank run_summarization_no_trainer.py --model_name_or_path t5-small --dataset_name cnn_dailymail --dataset_config 3.0.0 --source_prefix summarize: --output_dir output/tst-summarization
[2022-07-18 20:47:08,004] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2022-07-18 20:47:08,004] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=2, node_rank=0
[2022-07-18 20:47:08,004] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2022-07-18 20:47:08,004] [INFO] [launch.py:123:main] dist_world_size=2
[2022-07-18 20:47:08,004] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1
args:

Namespace(dataset_name='cnn_dailymail', dataset_config_name='3.0.0', train_file=None, validation_file=None, ignore_pad_token_for_loss=True, max_source_length=1024, source_prefix='summarize: ', preprocessing_num_workers=None, overwrite_cache=None, max_target_length=128, val_max_target_length=None, max_length=128, num_beams=None, pad_to_max_length=False, model_name_or_path='t5-small', config_name=None, tokenizer_name=None, text_column=None, summary_column=None, use_slow_tokenizer=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, learning_rate=5e-05, weight_decay=0.0, num_train_epochs=3, max_train_steps=None, gradient_accumulation_steps=1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, num_warmup_steps=0, output_dir='output/tst-summarization', seed=None, model_type=None, push_to_hub=False, hub_model_id=None, hub_token=None, checkpointing_steps=None, resume_from_checkpoint=None, with_tracking=False, report_to='all')
[2022-07-18 20:47:30,042] [INFO] [launch.py:210:main] Process 1054725 exits successfully.
args:

Namespace(dataset_name='cnn_dailymail', dataset_config_name='3.0.0', train_file=None, validation_file=None, ignore_pad_token_for_loss=True, max_source_length=1024, source_prefix='summarize: ', preprocessing_num_workers=None, overwrite_cache=None, max_target_length=128, val_max_target_length=None, max_length=128, num_beams=None, pad_to_max_length=False, model_name_or_path='t5-small', config_name=None, tokenizer_name=None, text_column=None, summary_column=None, use_slow_tokenizer=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, learning_rate=5e-05, weight_decay=0.0, num_train_epochs=3, max_train_steps=None, gradient_accumulation_steps=1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, num_warmup_steps=0, output_dir='output/tst-summarization', seed=None, model_type=None, push_to_hub=False, hub_model_id=None, hub_token=None, checkpointing_steps=None, resume_from_checkpoint=None, with_tracking=False, report_to='all')
[2022-07-18 20:47:39,051] [INFO] [launch.py:210:main] Process 1054726 exits successfully.

Still something wrong)

@Arij-Aladel
Copy link
Author

Arij-Aladel commented Jul 18, 2022

I think there should be full instructions on how to use accelerate , it is not clear. Thanks for your reply

@soumyasanyal
Copy link

Interesting that I was facing the exact same issue right now. The fix for me was to pass the local config I created.

accelerate launch --config_file <your config file> your_file.py

@Arij-Aladel
Copy link
Author

Arij-Aladel commented Jul 19, 2022

@soumyasanyal could you please tell the steps I am absolutely new) or post your config

@soumyasanyal
Copy link

Sure! I just followed the steps in this link. The steps I followed are:

accelerate config --config_file ./accelerate.yaml --> answer all the questions in the questionnaire
accelerate test --config_file ./accelerate.yaml
accelerate launch --config_file ./accelerate.yaml script.py

My config file is as follows (but it can change as per your requirements. I just wanted to run a job on 8 GPUs in a single node, without DeepSpeed or mixed precision):

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
use_cpu: false

I was previously running accelerate launch script.py without mentioning the config file when I faced the issue that you reported here.

Also FYI, note that the doc says that integration of accelerate with DeepSpeed is experimental.

@Arij-Aladel
Copy link
Author

Arij-Aladel commented Nov 14, 2022

@sgugger sorry for reopenning the issue while using this script using T5 over cnn-dialy dataset
under this configuration

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
use_cpu: false

I got the error

AttributeError: 'Accelerator' object has no attribute 'gather_for_metrics'
    generated_tokens, labels = accelerator.gather_for_metrics((generated_tokens, labels))
AttributeError: 'Accelerator' object has no attribute 'gather_for_metrics'
    generated_tokens, labels = accelerator.gather_for_metrics((generated_tokens, labels))
AttributeError: 'Accelerator' object has no attribute 'gather_for_metrics'

For this error replacing gather_for_metrics with just gather as old version of this code, gives me zero list of gathered decoded_preds, decoded_labels . and gather for metrics did not work.

with this configuration

compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
use_cpu: false

I get this error

RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 7; 10.92 GiB total capacity; 9.83 GiB already allocated; 293.50 MiB free; 9.91 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ret = input.softmax(dim)
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 1; 10.92 GiB total capacity; 9.83 GiB already allocated; 245.50 MiB free; 9.91 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@Arij-Aladel
Copy link
Author

@muellerzr

@muellerzr
Copy link
Contributor

muellerzr commented Nov 16, 2022

@Arij-Aladel in this case you should reduce your batch size most likely, but I'll be running it myself in just a moment

@Arij-Aladel
Copy link
Author

I did already still problem of not finding gather_for_metric attribute

@Arij-Aladel
Copy link
Author

You can simply run the example as is

@muellerzr
Copy link
Contributor

muellerzr commented Nov 16, 2022

Thanks @Arij-Aladel, I think I have found the fix. Can you try running the following training script on your end to verify? (I have wget to make your life easy):

(Also as mentioned in the other post please make sure you have a pypi version of accelerate >= 0.12.0 to run the scripts, a PR was just merged yesterday to make them a requirement for all these scripts)

wget https://raw.githubusercontent.com/huggingface/transformers/muellerzr-fix-no-trainer/examples/pytorch/summarization/run_summarization_no_trainer.py

@Arij-Aladel
Copy link
Author

Arij-Aladel commented Nov 17, 2022

@muellerzr thanks for your response! As I understand your fix is just deleting this line

706 decoded_preds, decoded_labels = accelerator.gather_for_metrics(decoded_preds, decoded_labels)
??

my life with wget was not easier)))

wget https://raw.githubusercontent.com/huggingface/transformers/muellerzr-fix-no-trainer/examples/pytorch/summarization/run_summarization_no_trainer.py
--2022-11-17 10:45:24-- https://raw.githubusercontent.com/huggingface/transformers/muellerzr-fix-no-trainer/examples/pytorch/summarization/run_summarization_no_trainer.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-11-17 10:45:24 ERROR 404: Not Found.

@Arij-Aladel
Copy link
Author

image
Really I do not know what is wrong with this script .....

@muellerzr
Copy link
Contributor

muellerzr commented Nov 17, 2022

@Arij-Aladel yes the fix got merged yesterday, you can find it here: https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization_no_trainer.py

I would highly recommend doing pip install -r transformers/examples/pytorch/summarization/requirements.txt -U (the txt file here: https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/requirements.txt) to avoid these dependency issues you have been struggling with as the script ran just fine for me.

@Arij-Aladel
Copy link
Author

Arij-Aladel commented Nov 17, 2022

image
After

pip install -r transformers/examples/pytorch/summarization/requirements.txt -U

:)

@Arij-Aladel
Copy link
Author

Arij-Aladel commented Nov 18, 2022

Ok seems it was package installation issue after your fix, I have uninstalled all packages then reinstall packages according to requirements file. It works now thanks @muellerzr

@muellerzr
Copy link
Contributor

Great! Can this be closed now @Arij-Aladel? :)

@Arij-Aladel
Copy link
Author

Yes , thanks . I am closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants