Device failure w/ transformer training using DDP/multiple GPU devices #20519
Unanswered
jacksettles
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am working on a project training a transformer language model of about 1.2B parameters on a small-ish dataset of about 80M words. That is small compared to industry standards, but big for this project at least. The issue I keep running into is device failure related. In regular PyTorch DDP one device would always get ahead of the others, causing a hang-up and the job would fail. Now I am using PyTorch Lightning, but it seems like the same issue is happening.
I was able to train this model on about 30M words with one GPU, but it didn't get very far (maybe about 10-15 epochs). Since I am using more data, I want to use multiple GPUs to speed up training time. I have tried PyTorch with DDP, and now I am on PyTorch Lightning with ddp as my strategy. I have access to 4 Nvidia A100 GPUs with my school computing cluster. I am submitting the jobs with SLURM. I am trying to figure out the right setup, because as of now the job keeps failing after a few hours. It looks like a memory issue, but I know the GPUs' memory isn't fully being used.
I am currently using a batch size of 16 sentences, but I was wondering if that is too small. When I run nvidia-smi, I can see that the GPUs are not being fully utilized in terms of memory, so I was wondering if increasing the batch size might help by reducing I/O operations with fewer batches. I was also wondering if using more workers in my dataloader might help. The max number of CPUs I can have for a 1 node job is 64, so if I used 4 GPUs, then I can only have 16 CPUs per task (i.e. per GPU device), and that didn't seem to work. I also tried using 2 GPUs with 32 CPUs per task, and I gave 28 workers to the dataloader, but it still died on me.
I can't seem to figure out the right configuration. Is my model size plus data size too big for a 1 node job? Are there any hyperparameters or special args I don't know about that may help in this scenario? Any help it greatly appreciated!!
Beta Was this translation helpful? Give feedback.
All reactions