Skip to content

Commit

Permalink
Make note about grad accum and prec (#1296)
Browse files Browse the repository at this point in the history
  • Loading branch information
muellerzr authored Apr 6, 2023
1 parent 3cb9d5f commit 419ecf3
Showing 1 changed file with 6 additions and 0 deletions.
6 changes: 6 additions & 0 deletions docs/source/concept_guides/performance.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -92,3 +92,9 @@ optimizer = AdamW(params=model.parameters(), lr=learning_rate)
You will also find that `accelerate` will step the learning rate based on the number of processes being trained on. This is because
of the observed batch size noted earlier. So in a case of 2 GPUs, the learning rate will be stepped twice as often as a single GPU
to account for the batch size being twice as large (if no changes to the batch size on the single GPU instance are made).

## Gradient Accumulation and Mixed Precision

When using gradient accumulation and mixed precision, due to how gradient averaging works (accumulation) and the precision loss (mixed precision),
some degredation in performance is expected. This will be explicitly seen when comparing the batch-wise loss between different compute
setups. However, the overall loss, metric, and general performance at the end of training should be _roughly_ the same.

0 comments on commit 419ecf3

Please sign in to comment.