Does pytorch lightning divide the loss by number of gradient accumulation steps? #17035

offchan42 · 2023-03-11T11:46:43Z

offchan42
Mar 11, 2023

For example, in this code, does the Trainer divide the loss by gradient accumulation steps (i.e. 7)?

# Accumulate gradients for 7 batches
trainer = Trainer(accumulate_grad_batches=7)

I want to know this in order to understand how to set learning rate correctly. I saw that huggingface accelerate package does divide the loss: https://huggingface.co/docs/accelerate/usage_guides/gradient_accumulation

- loss = loss / gradient_accumulation_steps  # this is the line that does divide the loss
  accelerator.backward(loss)
- if (index+1) % gradient_accumulation_steps == 0:
  optimizer.step()
  scheduler.step()
  optimizer.zero_grad()

If the loss is not being divided, it will cause the gradient to become bigger and that has implication on whether you should multiply the learning rate by gradient accumulation steps.

Answered by JuanFMontesinos

Mar 21, 2023

https://github.com/Lightning-AI/lightning/blob/bb861cba7e2a4597c56def506f0a64c9a30b9e8a/src/lightning/pytorch/core/module.py#L1038-L1054

Now I know why my model doesn't converge :)

View full answer

JuanFMontesinos · 2023-03-21T08:37:09Z

JuanFMontesinos
Mar 21, 2023

https://github.com/Lightning-AI/lightning/blob/bb861cba7e2a4597c56def506f0a64c9a30b9e8a/src/lightning/pytorch/core/module.py#L1038-L1054

Now I know why my model doesn't converge :)

2 replies

offchan42 Mar 21, 2023
Author

OK case closed. They all divided the loss by accumulation steps in order to make the gradients have constant magnitude no matter how big the batch size is.

Rithsek99 Apr 3, 2024

Im facing same issue when using gradient_accumulation_batch. Did you get to work and how?

rchan26 · 2025-02-25T16:37:37Z

rchan26
Feb 25, 2025

From the training_step method, the comment here seems to suggest that it is automatically normalised by the gradient accumulation steps:

pytorch-lightning/src/lightning/pytorch/core/module.py

Lines 707 to 709 in bb861cb

    
                   Note: 
        
                       When ``accumulate_grad_batches`` > 1, the loss returned here will be automatically 
        
                       normalized by ``accumulate_grad_batches`` internally.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does pytorch lightning divide the loss by number of gradient accumulation steps? #17035

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Does pytorch lightning divide the loss by number of gradient accumulation steps? #17035

offchan42 Mar 11, 2023

Replies: 2 comments · 2 replies

JuanFMontesinos Mar 21, 2023

offchan42 Mar 21, 2023 Author

Rithsek99 Apr 3, 2024

rchan26 Feb 25, 2025

offchan42
Mar 11, 2023

Replies: 2 comments 2 replies

JuanFMontesinos
Mar 21, 2023

offchan42 Mar 21, 2023
Author

rchan26
Feb 25, 2025