Ensuring _fair_ validation on epoch end in multi-gpu setting #20566

strentom · 2025-01-30T08:45:05Z

strentom
Jan 30, 2025

Hello! I'm training a model on multiple devices with a validation epoch after each training epoch. The docs on validation warn:

It is recommended to test with Trainer(devices=1) since distributed strategies such as DDP use [DistributedSampler](https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler) internally, which replicates some samples to make sure all devices have same batch size in case of uneven inputs. This is helpful to make sure benchmarking for research papers is done the right way.

Does this affect only DDP (multi-node) trainings or also DP?

My training selects the best model based on validation F1 score. How do I know if this score is correct? It suffers the same issue doesn't it? Is the "error" so small that it doesn't matter?

Why is it then recommended to validate with 1 device for "Research [benchmarking] to be done the right way". Industry production deployments also have high requirements for performance and reproducibility.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensuring _fair_ validation on epoch end in multi-gpu setting #20566

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Ensuring _fair_ validation on epoch end in multi-gpu setting #20566

strentom Jan 30, 2025

Replies: 0 comments

strentom
Jan 30, 2025