Make note about grad accum and prec (#1296)

huggingface · Apr 6, 2023 · 419ecf3 · 419ecf3
1 parent 3cb9d5f
commit 419ecf3
Showing 1 changed file with 6 additions and 0 deletions.
diff --git a/docs/source/concept_guides/performance.mdx b/docs/source/concept_guides/performance.mdx
@@ -92,3 +92,9 @@ optimizer = AdamW(params=model.parameters(), lr=learning_rate)
 You will also find that `accelerate` will step the learning rate based on the number of processes being trained on. This is because 
 of the observed batch size noted earlier. So in a case of 2 GPUs, the learning rate will be stepped twice as often as a single GPU
 to account for the batch size being twice as large (if no changes to the batch size on the single GPU instance are made).
+
+## Gradient Accumulation and Mixed Precision
+
+When using gradient accumulation and mixed precision, due to how gradient averaging works (accumulation) and the precision loss (mixed precision), 
+some degredation in performance is expected. This will be explicitly seen when comparing the batch-wise loss between different compute 
+setups. However, the overall loss, metric, and general performance at the end of training should be _roughly_ the same.