Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed CI failures in Transformers with the latest version 0.12.4 but works with 0.12.3 #4795

Closed
pacman100 opened this issue Dec 11, 2023 · 2 comments
Assignees

Comments

@pacman100
Copy link
Contributor

DeepSpeed integration CI test suites is failing with ~80+ tests being impacted. https://github.com/huggingface/transformers/actions/runs/7155087776/job/19483398901

It is because the runner is changing the below command:

stdout: [2023-12-11 14:03:02,193] [INFO] [runner.py:570:main] cmd = /raid/sourab/miniconda3/envs/hf/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=10999 --enable_each_rank_log=None /raid/sourab/transformers/examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --train_file /raid/sourab/transformers/tests/deepspeed/../fixtures/tests_samples/wmt_en_ro/train.json --validation_file /raid/sourab/transformers/tests/deepspeed/../fixtures/tests_samples/wmt_en_ro/val.json --output_dir /tmp/tmpd708_3kl --overwrite_output_dir --max_source_length 32 --max_target_length 32 --val_max_target_length 32 --warmup_steps 8 --predict_with_generate --save_steps 0 --eval_steps 10 --group_by_length --label_smoothing_factor 0.1 --source_lang en --target_lang ro --report_to none --source_prefix "translate English to Romanian: " --fp16 --do_train --num_train_epochs 1 --max_train_samples 16 --per_device_train_batch_size 2 --learning_rate 3e-3 --do_eval --max_eval_samples 16 --per_device_eval_batch_size 2 --deepspeed /raid/sourab/transformers/tests/deepspeed/ds_config_zero2.json

to

stdout: [2023-12-11 14:03:02,193] [INFO] [runner.py:570:main] cmd = /raid/sourab/miniconda3/envs/hf/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=10999 --enable_each_rank_log=None /raid/sourab/transformers/examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --train_file /raid/sourab/transformers/tests/deepspeed/../fixtures/tests_samples/wmt_en_ro/train.json --validation_file /raid/sourab/transformers/tests/deepspeed/../fixtures/tests_samples/wmt_en_ro/val.json --output_dir /tmp/tmpd708_3kl --overwrite_output_dir --max_source_length 32 --max_target_length 32 --val_max_target_length 32 --warmup_steps 8 --predict_with_generate --save_steps 0 --eval_steps 10 --group_by_length --label_smoothing_factor 0.1 --source_lang en --target_lang ro --report_to none --source_prefix translate English to Romanian: --fp16 --do_train --num_train_epochs 1 --max_train_samples 16 --per_device_train_batch_size 2 --learning_rate 3e-3 --do_eval --max_eval_samples 16 --per_device_eval_batch_size 2 --deepspeed /raid/sourab/transformers/tests/deepspeed/ds_config_zero2.json

Notice no " around translate English to Romanian: . Becuase of this HF Argument parser fails leading to many tests being impacted.

Maybe the cause: #4660

@pacman100
Copy link
Contributor Author

Reverting the PR #4660 resolves the issue (tested it).

@loadams
Copy link
Collaborator

loadams commented Dec 11, 2023

Thanks @pacman100 - I've created a PR to revert it here. Will work on getting that merged.

We also know that our nv-transformers-v100 tests are failing with the latest transformers version, we are working on that as well.

@loadams loadams linked a pull request Dec 14, 2023 that will close this issue
mrwyattii added a commit that referenced this issue Dec 15, 2023
Splitting work from #4769 because we are still debugging transformers
integration issues.

Parsing was broken for user arguments (see #4795). Additionally, parsing
of user arguments is tricky and there are lots of edge cases. For
example: #4660, #4716, #3967. I've attempted to accommodate all of the
possible types of string inputs and added unit tests.
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this issue Feb 17, 2024
Splitting work from deepspeedai#4769 because we are still debugging transformers
integration issues.

Parsing was broken for user arguments (see deepspeedai#4795). Additionally, parsing
of user arguments is tricky and there are lots of edge cases. For
example: deepspeedai#4660, deepspeedai#4716, deepspeedai#3967. I've attempted to accommodate all of the
possible types of string inputs and added unit tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants