Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Initial mixed-precision training #196

Merged
merged 5 commits into from
Apr 19, 2019
Merged

Conversation

slayton58
Copy link
Contributor

@slayton58 slayton58 commented Nov 21, 2018

This PR adds initial mixed-precision training support via apex.amp.

Mixed-precision is controlled with the SOLVER.MIXED_PRECISION config argument.

Along with apex.amp support, I've moved DistributedDataParallel to apex.DistributedDataParallel as this is what we've been using to good effect over the last few months.

Please note that this does add the apex package as a requirement.

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Nov 21, 2018
@miguelvr
Copy link
Contributor

could you please add it to the Dockerfile as well?

@slayton58
Copy link
Contributor Author

@miguelvr Done - hadn't noticed that file was there :)

Copy link
Contributor

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR!

I've left a few comments and questions to get started, as I'm not familiar with APEX.

tools/train_net.py Outdated Show resolved Hide resolved
tools/train_net.py Outdated Show resolved Hide resolved
model, device_ids=[local_rank], output_device=local_rank,
# this should be removed if we update BatchNorm stats
broadcast_buffers=False,
model = DDP(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference now between DistributedDataParallel from PyTorch and from apex? What about the non-legacy DistributedParallel from c10d, does it have similar performance?

Or does apex one handle fp16 differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apex.DistributedDataParallel has similar perf to the c10d implementation, and it's what we've been running for the last few months -- I'm not married to the change, but I might suggest using it until the c10d implementation works for Mask-RCNN (if it doesn't already in master)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need it for mixed precision to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't, but I haven't tested it.

@@ -17,6 +17,13 @@ def __init__(self, n):
self.register_buffer("running_var", torch.ones(n))

def forward(self, x):
# Cast all fixed parameters to half() if necessary
if x.type() == torch.half:
self.weight = self.weight.half()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems that we don't explicitly cast the model to fp16 during initialization, is that right?
This seems a bit counter-intuitive to me, what happens if we just cast everything in model to .half()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because not everything can be half, as not all ops support half. One could write a function to cast all ops that do support half, cast_some_to_half(model) maybe, but I decided to special case this one -- I'm open to other approaches :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the ops that do not support half (apart from the custom ones that are in this repo)? I thought that all ops in pytorch supported fp16 for cuda (with potentially bad accuracy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In using "support" I chose my words badly. It's not necessarily support, more "can be used with a reasonable expectation of not losing accuracy". Apex.amp takes a conservative approach by not moving ops to fp16 when we're not sure of their accuracy (the lists of what is / isn't moved to fp16 are in the files here.

We could try casting the entire model to half and see what happens -- there's enough code in the RPN especially that I'm just not sure how it'll behave in lower precisions that I decided to be conservative beyond what apex.amp does. Unfortunately that means until PyT can grok y_16 = a_32 * x_16 + b_32 where the subscripts denote precision we have to do something manual here

maskrcnn_benchmark/layers/batch_norm.py Outdated Show resolved Hide resolved
@@ -116,6 +116,6 @@ def forward(self, x, boxes):
for level, (per_level_feature, pooler) in enumerate(zip(x, self.poolers)):
idx_in_level = torch.nonzero(levels == level).squeeze(1)
rois_per_level = rois[idx_in_level]
result[idx_in_level] = pooler(per_level_feature, rois_per_level)
result[idx_in_level] = pooler(per_level_feature, rois_per_level).to(dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't the amp.float_function wrap back the values to fp16 after they are computed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd need to run again to work out exactly what case was happening here, but the type change was not happening correctly and I had to manually cast here to prevent errors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: is this casting still relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: result[idx_in_level] is expected to be in fp16 (as it's the same precision as the input), but the pooler returns fp32 (explicitly, as the code hasn't had fp16 support added). To get around this the result from the pooler needs to be casted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I thought that amp.float_function would:
1 - cast to float
2 - compute
3 - cast back to fp16

Or is my understanding of it wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.
I think the solution I'd potentially do myself (while support for fp16 is not present in the core pooler functions) is to just cast in the C++ side to float and cast back if the type is fp16.
But I suppose this is not really a hard-requirement here (but would make things cleaner)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like a comment added explaining the current need for the cast? (along with a TODO for the full (fp16 support in pooling) if you so desire)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just find it very unintuitive why you had to add this casting only here, and not after NMS as well. :-/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is a strange case -- you're in fp16-land, allocate your output as fp16, then run something that has to cast up to fp32. There's no cast back automatically (you're calling from a module, so there's no module boundary to cause a cast) so it has to be done manually. If lines 111-115 didn't exist, and you inferred the type from the return type of the pooling call this explicit wouldn't have to be there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it's a little weird, but that's where the code is right now -- if you want an explicit (non-AMP) fp16 version that can also be done, but it'll be more invasive and can't be done before the new year after I get back

@miguelvr
Copy link
Contributor

any updates on this?

@slayton58
Copy link
Contributor Author

I'm rebasing right now (along with moving back to pytorch default DistributedDataParallel, otherwise Im waiting on more reviews / feedback

tools/train_net.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks a lot @slayton58 !

@wat3rBro will be running a few training jobs to double check that accuracy is the same, and then I'll get this PR merged.

@wat3rBro
Copy link
Contributor

wat3rBro commented Jan 9, 2019

@slayton58 Hi, I'm collecting stats of training, finding out it would terminate silently during the first iteration if fp16 is enabled for training ResNeXt model (both detection and mask) on V100. I checked the verbose output from APEX and couldn't find difference when comparing it with a successful run on P100 for the same model. The exit should happen when running:

with optimizer.scale_loss(losses) as scaled_losses:
    scaled_losses.backward()

Is this a known issue?

@slayton58
Copy link
Contributor Author

@wat3rBro I've never run with ResNeXt (it didn't exist on the version of the code I originally developed this for). There should be no difference in behavior between resnet and resnext, does this happen on the first iteration? Also, do other tests pass with P100 (I develop on V100)?

@wat3rBro
Copy link
Contributor

@slayton58 it happens during the first iteration. Could you verify if it works for e2e_mask_rcnn_X_101_32x8d_FPN_1x.yaml? On P100 I have R-50-C4/FPN running successfully without accuracy loss.

Adds fp16 support via apex.amp
Also switches communication to apex.DistributedDataParallel
Added support to tools/test_net.py
SOLVER.MIXED_PRECISION -> DTYPE \in {float32, float16}
apex.amp not installed now raises ImportError
@slayton58
Copy link
Contributor Author

@wat3rBro I was having an issue with my branch due to some of the pre-trained model URLs having changed (this is the relevant commit) -- I rebased (and pushed) against latest master and I can start training on a single V100 now.

@wat3rBro
Copy link
Contributor

@slayton58 Does it run properly on your machine at least for the first few iterations?

@slayton58
Copy link
Contributor Author

Yes. Are you still having issues?

2019-01-10 14:11:12,900 maskrcnn_benchmark.trainer INFO: Start training
2019-01-10 14:11:25,448 maskrcnn_benchmark.trainer INFO: eta: 1 day, 7:21:44  iter: 20  loss: 2.5158 (3.3709)  loss_classifier: 0.5149 (1.5699)  loss_box_reg: 0.0209 (0.0326)  loss_mask: 0.8690 (1.0426)  loss_objectness: 0.6652 (0.6536)  loss_rpn_box_reg: 0.0419 (0.0723)  time: 0.4033 (0.6273)  data: 0.0040 (0.1115)  lr: 0.000448  max mem: 6004
INFO:maskrcnn_benchmark.trainer:eta: 1 day, 7:21:44  iter: 20  loss: 2.5158 (3.3709)  loss_classifier: 0.5149 (1.5699)  loss_box_reg: 0.0209 (0.0326)  loss_mask: 0.8690 (1.0426)  loss_objectness: 0.6652 (0.6536)  loss_rpn_box_reg: 0.0419 (0.0723)  time: 0.4033 (0.6273)  data: 0.0040 (0.1115)  lr: 0.000448  max mem: 6004
2019-01-10 14:11:33,937 maskrcnn_benchmark.trainer INFO: eta: 1 day, 2:17:19  iter: 40  loss: 1.5658 (2.6424)  loss_classifier: 0.2581 (1.0473)  loss_box_reg: 0.0278 (0.0597)  loss_mask: 0.7494 (0.9107)  loss_objectness: 0.3690 (0.5331)  loss_rpn_box_reg: 0.0280 (0.0915)  time: 0.4112 (0.5259)  data: 0.0046 (0.0584)  lr: 0.000482  max mem: 6124
INFO:maskrcnn_benchmark.trainer:eta: 1 day, 2:17:19  iter: 40  loss: 1.5658 (2.6424)  loss_classifier: 0.2581 (1.0473)  loss_box_reg: 0.0278 (0.0597)  loss_mask: 0.7494 (0.9107)  loss_objectness: 0.3690 (0.5331)  loss_rpn_box_reg: 0.0280 (0.0915)  time: 0.4112 (0.5259)  data: 0.0046 (0.0584)  lr: 0.000482  max mem: 6124

@fmassa
Copy link
Contributor

fmassa commented Jan 11, 2019

One thing to keep in mind: @wat3rBro what version of CUDNN are you using that gives the problem?

@wat3rBro
Copy link
Contributor

@fmassa I'm still on CuDNN 7.1.2 because of our infra.

@slayton58
Copy link
Contributor Author

@wat3rBro Any progress on your side? If we want to try and track down more details on exactly where the failure's happening on your end you could try running with CUDA_LAUNCH_BLOCKING=1 and see where the error is reported (if you're not running with that env variable then pytorch's normal async error checking is happening and the error location can be really misleading)

@wat3rBro
Copy link
Contributor

@slayton58 that ENV variable seems not to give any error report either. Anyway we'll have 7.4 available in several days, hope this issue just goes away.

@wat3rBro
Copy link
Contributor

@slayton58 it does work with 7.4.2. Floating point exception (core dumped) was what I got on 7.1.4. Are you going to fix the old version?

@slayton58
Copy link
Contributor Author

@wat3rBro could you run with the following please and give me the resulting cudnn.log file please -- I can use it to file an internal bug (or email me at slayton (at) nvidia (dot) com if it's too big)

export CUDA_LAUNCH_BLOCKING=1
export CUDNN_LOGINFO_DBG=1
export CUDNN_LOGDEST_DBG=./cudnn.log
<normal single GPU run command here>

@slayton58
Copy link
Contributor Author

@wat3rBro Any progress?

@wat3rBro
Copy link
Contributor

wat3rBro commented Jan 18, 2019

Hey @slayton58, I've sent you the log via email, sorry if I didn't notify you.

@slayton58
Copy link
Contributor Author

@wat3rBro Hmm, I haven't seen anything come through my email - what address did you send it from?

@wat3rBro
Copy link
Contributor

Maybe it's blocked, I just sent you again from my personal email.

@ClimbsRocks
Copy link
Contributor

Curious, what's the current state of this PR? Some of the X-152 models I'm training with are huge and slow. I've seen notable improvements from training with NVIDIA's half-precision code and PyTorch in the past.

@zimenglan-sysu-512
Copy link
Contributor

hi @slayton58
curious, what's the current state of this PR?

@LeviViana
Copy link
Contributor

I've made some tests on 4 x 2080Ti. I didn't experience any improvement in training time. In my tests, I set _C.DYTPE = "float16" vs _C.DYTPE = "float32". I tested :

  • e2e_faster_rcnn_R_50_FPN_1x.yaml for 500 iterations
  • e2e_mask_rcnn_X_101_32x8d_FPN_1x.yaml for 500 iterations

@slayton58 what could be going wrong on these tests ? Could you provide some benchmarks on the improvements you've achieved ?

@slayton58
Copy link
Contributor Author

@zimenglan-sysu-512 I'm still waiting on a merge. The PR should still be up-to-date and compatible with recent refactors of apex.amp.

@LeviViana Speedup is dependent on network and batch size per GPU -- e2e_mask_rcnn_R_50_FPN_1x.yaml is the config most tested with per-GPU batch sizes of 1, 2, 4. Based on the numbers from the model zoo and my iterations pasted above, the speedup on this config is ~0.45/0.4 = 12.5%. ResNeXt backbones (the second config you mention) don't seem to benefit this second from fp16, and I've never actually run faster-RCNN so I can't comment there.

@LeviViana
Copy link
Contributor

Thanks for your quick answer. Indeed, my precedent tests were based on a 1 img/gpu setup. I re-ran some other tests on the e2e_faster_rcnn_R_50_FPN_1x.yaml architecture, and I got about:

  • 10% speedup with 2 imgs/gpu
  • 20% speedup with 3 imgs/gpu

@zimenglan-sysu-512
Copy link
Contributor

thanks @slayton58
hi @fmassa do u still have a plan to merge the PR?

@fmassa fmassa merged commit 08fcf12 into facebookresearch:master Apr 19, 2019
@fmassa
Copy link
Contributor

fmassa commented Apr 19, 2019

Sorry for the delay in merging, and thanks @slayton58 !

@zimenglan-sysu-512
Copy link
Contributor

zimenglan-sysu-512 commented Apr 22, 2019

hi @slayton58
i have a question, will apex change the type of labels in targets in forwarding during training, e.g. change torch.int64 to torch.float32?

@slayton58
Copy link
Contributor Author

@zimenglan-sysu-512 No, labels will stay as torch.int64

@obendidi
Copy link
Contributor

obendidi commented Jun 3, 2019

I don't know if it was taken into account but apex throws an error when trying to run inference on a cpu only machine

what i installed:

RUN conda install pytorch-cpu=1.1.0 torchvision-cpu=0.3.0 -c pytorch

# install apex
RUN git clone https://github.com/NVIDIA/apex.git /apex
RUN cd /apex && git checkout 14e34f7f89967dcbe5876b8bf416e311dd90b9dd && python setup.py install --cpp_ext

# install PyTorch maskrcnn-benchmark
RUN git clone https://github.com/Sterblue/maskrcnn-benchmark.git /maskrcnn-benchmark \
    && cd /maskrcnn-benchmark \
    && git checkout 5c41f1225208c3cd22b9c4734fa1d89f3f4de592 \
    && python setup.py build develop

error:

  File "/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/__init__.py", line 2, in <module>
    from .detectors import build_detection_model
  File "/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/detectors.py", line 2, in <module>
    from .generalized_rcnn import GeneralizedRCNN
  File "/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 11, in <module>
    from ..backbone import build_backbone
  File "/maskrcnn-benchmark/maskrcnn_benchmark/modeling/backbone/__init__.py", line 2, in <module>
    from .backbone import build_backbone
  File "/maskrcnn-benchmark/maskrcnn_benchmark/modeling/backbone/backbone.py", line 7, in <module>
    from maskrcnn_benchmark.modeling.make_layers import conv_with_kaiming_uniform
  File "/maskrcnn-benchmark/maskrcnn_benchmark/modeling/make_layers.py", line 10, in <module>
    from maskrcnn_benchmark.layers import Conv2d
  File "/maskrcnn-benchmark/maskrcnn_benchmark/layers/__init__.py", line 10, in <module>
    from .nms import nms
  File "/maskrcnn-benchmark/maskrcnn_benchmark/layers/nms.py", line 5, in <module>
    from apex import amp
  File "/miniconda/envs/py36/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/__init__.py", line 2, in <module>
    from . import amp
  File "/miniconda/envs/py36/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/__init__.py", line 1, in <module>
    from .amp import init, half_function, float_function, promote_function,\
  File "/miniconda/envs/py36/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/amp.py", line 3, in <module>
    from .lists import functional_overrides, torch_overrides, tensor_overrides
  File "/miniconda/envs/py36/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/lists/torch_overrides.py", line 69, in <module>
    if utils.get_cuda_version() >= (9, 1, 0):
  File "/miniconda/envs/py36/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/utils.py", line 9, in get_cuda_version
    return tuple(int(x) for x in torch.version.cuda.split('.'))
AttributeError: 'NoneType' object has no attribute 'split'

Lyears pushed a commit to Lyears/maskrcnn-benchmark that referenced this pull request Jun 28, 2020
* Initial multi-precision training

Adds fp16 support via apex.amp
Also switches communication to apex.DistributedDataParallel

* Add Apex install to dockerfile

* Fixes from @fmassa review

Added support to tools/test_net.py
SOLVER.MIXED_PRECISION -> DTYPE \in {float32, float16}
apex.amp not installed now raises ImportError

* Remove extraneous apex DDP import

* Move to new amp API
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants