Does pytorch lightning divide the loss by number of gradient accumulation steps? #17035
-
For example, in this code, does the Trainer divide the loss by gradient accumulation steps (i.e. 7)? # Accumulate gradients for 7 batches
trainer = Trainer(accumulate_grad_batches=7) I want to know this in order to understand how to set learning rate correctly. I saw that huggingface accelerate package does divide the loss: https://huggingface.co/docs/accelerate/usage_guides/gradient_accumulation - loss = loss / gradient_accumulation_steps # this is the line that does divide the loss
accelerator.backward(loss)
- if (index+1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad() If the loss is not being divided, it will cause the gradient to become bigger and that has implication on whether you should multiply the learning rate by gradient accumulation steps. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Now I know why my model doesn't converge :) |
Beta Was this translation helpful? Give feedback.
-
From the pytorch-lightning/src/lightning/pytorch/core/module.py Lines 707 to 709 in bb861cb |
Beta Was this translation helpful? Give feedback.
https://github.com/Lightning-AI/lightning/blob/bb861cba7e2a4597c56def506f0a64c9a30b9e8a/src/lightning/pytorch/core/module.py#L1038-L1054
Now I know why my model doesn't converge :)