Ensuring _fair_ validation on epoch end in multi-gpu setting #20566
Unanswered
strentom
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello! I'm training a model on multiple devices with a validation epoch after each training epoch. The docs on validation warn:
Does this affect only DDP (multi-node) trainings or also DP?
My training selects the best model based on validation F1 score. How do I know if this score is correct? It suffers the same issue doesn't it? Is the "error" so small that it doesn't matter?
Why is it then recommended to validate with 1 device for "Research [benchmarking] to be done the right way". Industry production deployments also have high requirements for performance and reproducibility.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions