-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer class: using the Accelerate launcher with Deepspeed #25356
Comments
cc @pacman100 |
Hello @nebrelbug, please update the accelerateconfig to correclty use 8 GPUs as shown below: compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: '/home/bgubler7/.cache/huggingface/accelerate/ds_config.json'
zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
# mixed_precision: fp16
num_machines: 1
- num_processes: 1
+ num_processes: 8
use_cpu: false |
@pacman100 I updated my config and ran the code again. This time, all the GPUs filled up, but I'm still running into a CUDA out of memory error.
Am I configuring something wrong with fp16 or offload? I'm on a node with 8 A100 GPUs -- I believe I should be able to train even a 65B model, as long as I use half-precision. |
Hello @nebrelbug, you need to use gradient checkpointing for training such a large model as the activations aren't offloaded and they take up a lot of GPU memory for long sequences. For further increasing the throughput, use Flash Attention V2 too |
System Info
transformers
version: 4.32.0.dev0- distributed_type: DEEPSPEED
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Who can help?
@ArthurZucker, @sgugger, @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I've written a very simple training loop using the HuggingFace Trainer class, in order to finetune LLaMA. Here's the code:
loop.py
utils/dataloader_example.py
I can train smaller models, like LLaMA 7B, without using DeepSpeed. But in order to use LLaMA 30B, I've been trying to use DeepSpeed ZeRO-3 with the Accelerate launcher.
Here's my accelerate config:
And my DeepSpeed config:
When I run the code using
accelerate launch loop.py
, it seems to use the CPUs for model loading. The node I'm running on has 8 GPUs.Unfortunately, after the checkpoint shards has loaded, only one of the GPUs begins to fill up. This eventually results in a CUDA out of memory error. Am I configuring DeepSpeed incorrectly? I copied-and-pasted the configuration from the HuggingFace documentation.
Expected behavior
I'd expect that the 30B model would load, with parameters and optimizer offloaded to the CPUs. Then all GPUs would be utilized to some extent during the training loop.
The text was updated successfully, but these errors were encountered: