Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Does loss dict keep the same order in different processes? #309

Closed
yelantf opened this issue Dec 29, 2018 · 4 comments
Closed

Does loss dict keep the same order in different processes? #309

yelantf opened this issue Dec 29, 2018 · 4 comments

Comments

@yelantf
Copy link
Contributor

yelantf commented Dec 29, 2018

❓ Questions and Help

In maskrcnn_benchmark.engine.trainer, the function reduce_loss_dict reduce all losses in loss_dict to rank 0.

def reduce_loss_dict(loss_dict):
    """
    Reduce the loss dictionary from all processes so that process with rank
    0 has the averaged results. Returns a dict with the same fields as
    loss_dict, after reduction.
    """
    world_size = get_world_size()
    if world_size < 2:
        return loss_dict
    with torch.no_grad():
        loss_names = []
        all_losses = []
        for k, v in loss_dict.items():
            loss_names.append(k)
            all_losses.append(v)
        all_losses = torch.stack(all_losses, dim=0)
        dist.reduce(all_losses, dst=0)
        if dist.get_rank() == 0:
            # only main process gets accumulated, so only divide by
            # world_size in this case
            all_losses /= world_size
        reduced_losses = {k: v for k, v in zip(loss_names, all_losses)}
    return reduced_losses

It uses loss_dict.items() and I think the order of plain dictionary object keys is not deterministic. Would it be better if we use OrderDict?

@fmassa
Copy link
Contributor

fmassa commented Dec 29, 2018

Yes, ordered dict would be safer.
Another possibility would be to sort loss_names and use this order to sort all_losses.
Can you send a PR with the change?

@yelantf
Copy link
Contributor Author

yelantf commented Dec 30, 2018

Yes, I can do that.

@yelantf
Copy link
Contributor Author

yelantf commented Dec 30, 2018

By the way, it did mismatch among different losses during reduction. This issue became apparent when I added several new loss items into the original codes. After I sorted their keys, the value of those losses in the log became much different. A lucky thing is that it only influences the log, while the training keeps all right.

fmassa pushed a commit that referenced this issue Dec 30, 2018
@fmassa
Copy link
Contributor

fmassa commented Dec 30, 2018

Thanks a lot for fixing it in #310 !

@fmassa fmassa closed this as completed Dec 30, 2018
BobZhangHT added a commit to BobZhangHT/maskrcnn-benchmark that referenced this issue Jan 2, 2019
nprasad2021 pushed a commit to nprasad2021/maskrcnn-benchmark that referenced this issue Jan 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants