Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Convert SyncBN to BN when training on DP #772

Merged
merged 14 commits into from
Sep 15, 2021

Conversation

clownrat6
Copy link
Contributor

@clownrat6 clownrat6 commented Aug 10, 2021

Incompatible between DP and SyncBN

run this command without setting --launcher:

python tools/train.py [config_path]

The python environment will report errors about process group:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

This error is caused by SyncBN. When SyncBN.training is True, SyncBN need to init process_group. However, process_group is only valid when DDP.

The situation that SyncBN is valid:

  • process_group is not None or torch.distributed.group.WORLD is not None;
  • SyncBN.training is False or SyncBN.eval();

Based on those mentioned above, we convert SyncBN to BN when training on DP.

This PR is blocked by MMCV #1253

MMCV PR #1253 may be released in MMCV 1.3.13 version, the mmcv compatibility of mmseg 0.18 will larger than 1.3.13.

The CI will pass when MMCV 1.3.13 version release

@codecov
Copy link

codecov bot commented Aug 10, 2021

Codecov Report

Merging #772 (fdc4208) into master (56e18ba) will increase coverage by 0.03%.
The diff coverage is 100.00%.

❗ Current head fdc4208 differs from pull request most recent head c7446d7. Consider uploading reports for the commit c7446d7 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #772      +/-   ##
==========================================
+ Coverage   89.02%   89.05%   +0.03%     
==========================================
  Files         111      111              
  Lines        6043     6051       +8     
  Branches      969      969              
==========================================
+ Hits         5380     5389       +9     
  Misses        467      467              
+ Partials      196      195       -1     
Flag Coverage Δ
unittests 89.05% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmseg/__init__.py 90.32% <100.00%> (ø)
mmseg/datasets/custom.py 92.43% <100.00%> (+0.34%) ⬆️
mmseg/datasets/pipelines/loading.py 98.52% <0.00%> (+1.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56e18ba...c7446d7. Read the comment docs.

@xvjiarui
Copy link
Collaborator

May move to models/utils

tools/train.py Outdated Show resolved Hide resolved
tools/train.py Outdated Show resolved Hide resolved
tools/train.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Junjun2016 Junjun2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the conflict.

@Junjun2016 Junjun2016 requested a review from xvjiarui August 31, 2021 16:12
@xvjiarui
Copy link
Collaborator

open-mmlab/mmcv#1253

@Junjun2016
Copy link
Collaborator

We can import from mmcv after it merged.

@Junjun2016
Copy link
Collaborator

Can import revert_sync_batchnorm from open-mmlab/mmcv#1253

tools/train.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Junjun2016 Junjun2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the conflict.

Copy link
Collaborator

@Junjun2016 Junjun2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xvjiarui
Copy link
Collaborator

Please upgrade mmcv requirement.

Copy link
Collaborator

@Junjun2016 Junjun2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@Junjun2016 Junjun2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xvjiarui xvjiarui merged commit cae715a into open-mmlab:master Sep 15, 2021
bowenroom pushed a commit to bowenroom/mmsegmentation that referenced this pull request Feb 25, 2022
* [Fix] Convert SyncBN to BN when training on DP.

* Modify SyncBN2BN.

* Add SyncBN2BN unit test.

* Resolve some comments.

* use mmcv official revert_sync_batchnorm

* Remove local syncbn2bn unit tests.

* Update mmcv version.

* Fix bugs of gather model tools.

* Modify warnings.

* Modify docker mmcv version.

* Update mmcv version table.
aravind-h-v pushed a commit to aravind-h-v/mmsegmentation that referenced this pull request Mar 27, 2023
* add accelerate to load models with smaller memory footprint

* remove low_cpu_mem_usage as it is reduntant

* move accelerate init weights context to modelling utils

* add test to ensure results are the same when loading with accelerate

* add tests to ensure ram usage gets lower when using accelerate

* move accelerate logic to single snippet under modelling utils and remove it from configuration utils

* format code using to pass quality check

* fix imports with isor

* add accelerate to test extra deps

* only import accelerate if device_map is set to auto

* move accelerate availability check to diffusers import utils

* format code

* add device map to pipeline abstraction

* lint it to pass PR quality check

* fix class check to use accelerate when using diffusers ModelMixin subclasses

* use low_cpu_mem_usage in transformers if device_map is not available

* NoModuleLayer

* comment out tests

* up

* uP

* finish

* Update src/diffusers/pipelines/stable_diffusion/safety_checker.py

* finish

* uP

* make style

Co-authored-by: Pi Esposito <piero.skywalker@gmail.com>
sibozhang pushed a commit to sibozhang/mmsegmentation that referenced this pull request Mar 22, 2024
* modify stat.py merge_docs

* unify merge_docs style

* fix bugs

* fix bugs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants