Initial mixed-precision training #196

slayton58 · 2018-11-21T17:21:05Z

This PR adds initial mixed-precision training support via apex.amp.

Mixed-precision is controlled with the SOLVER.MIXED_PRECISION config argument.

Along with apex.amp support, I've moved DistributedDataParallel to apex.DistributedDataParallel as this is what we've been using to good effect over the last few months.

Please note that this does add the apex package as a requirement.

miguelvr · 2018-11-22T11:08:47Z

could you please add it to the Dockerfile as well?

slayton58 · 2018-11-22T16:59:38Z

@miguelvr Done - hadn't noticed that file was there :)

fmassa

Thanks a lot for the PR!

I've left a few comments and questions to get started, as I'm not familiar with APEX.

tools/train_net.py

fmassa · 2018-11-22T17:14:50Z

tools/train_net.py

-            model, device_ids=[local_rank], output_device=local_rank,
-            # this should be removed if we update BatchNorm stats
-            broadcast_buffers=False,
+        model = DDP(


Is there a difference now between DistributedDataParallel from PyTorch and from apex? What about the non-legacy DistributedParallel from c10d, does it have similar performance?

Or does apex one handle fp16 differently?

apex.DistributedDataParallel has similar perf to the c10d implementation, and it's what we've been running for the last few months -- I'm not married to the change, but I might suggest using it until the c10d implementation works for Mask-RCNN (if it doesn't already in master)

Do we actually need it for mixed precision to work?

We shouldn't, but I haven't tested it.

fmassa · 2018-11-22T18:39:18Z

maskrcnn_benchmark/layers/batch_norm.py

@@ -17,6 +17,13 @@ def __init__(self, n):
        self.register_buffer("running_var", torch.ones(n))

    def forward(self, x):
+        # Cast all fixed parameters to half() if necessary
+        if x.type() == torch.half:
+            self.weight = self.weight.half()


So it seems that we don't explicitly cast the model to fp16 during initialization, is that right?
This seems a bit counter-intuitive to me, what happens if we just cast everything in model to .half()?

Because not everything can be half, as not all ops support half. One could write a function to cast all ops that do support half, cast_some_to_half(model) maybe, but I decided to special case this one -- I'm open to other approaches :)

What are the ops that do not support half (apart from the custom ones that are in this repo)? I thought that all ops in pytorch supported fp16 for cuda (with potentially bad accuracy)

In using "support" I chose my words badly. It's not necessarily support, more "can be used with a reasonable expectation of not losing accuracy". Apex.amp takes a conservative approach by not moving ops to fp16 when we're not sure of their accuracy (the lists of what is / isn't moved to fp16 are in the files here.

We could try casting the entire model to half and see what happens -- there's enough code in the RPN especially that I'm just not sure how it'll behave in lower precisions that I decided to be conservative beyond what apex.amp does. Unfortunately that means until PyT can grok y_16 = a_32 * x_16 + b_32 where the subscripts denote precision we have to do something manual here

maskrcnn_benchmark/layers/batch_norm.py

fmassa · 2018-11-22T18:46:12Z

maskrcnn_benchmark/modeling/poolers.py

@@ -116,6 +116,6 @@ def forward(self, x, boxes):
        for level, (per_level_feature, pooler) in enumerate(zip(x, self.poolers)):
            idx_in_level = torch.nonzero(levels == level).squeeze(1)
            rois_per_level = rois[idx_in_level]
-            result[idx_in_level] = pooler(per_level_feature, rois_per_level)
+            result[idx_in_level] = pooler(per_level_feature, rois_per_level).to(dtype)


doesn't the amp.float_function wrap back the values to fp16 after they are computed?

I'd need to run again to work out exactly what case was happening here, but the type change was not happening correctly and I had to manually cast here to prevent errors

Quick question: is this casting still relevant?

Yes: result[idx_in_level] is expected to be in fp16 (as it's the same precision as the input), but the pooler returns fp32 (explicitly, as the code hasn't had fp16 support added). To get around this the result from the pooler needs to be casted

But I thought that amp.float_function would:
1 - cast to float
2 - compute
3 - cast back to fp16

Or is my understanding of it wrong?

Ok.
I think the solution I'd potentially do myself (while support for fp16 is not present in the core pooler functions) is to just cast in the C++ side to float and cast back if the type is fp16.
But I suppose this is not really a hard-requirement here (but would make things cleaner)

Would you like a comment added explaining the current need for the cast? (along with a TODO for the full (fp16 support in pooling) if you so desire)

I just find it very unintuitive why you had to add this casting only here, and not after NMS as well. :-/

Because this is a strange case -- you're in fp16-land, allocate your output as fp16, then run something that has to cast up to fp32. There's no cast back automatically (you're calling from a module, so there's no module boundary to cause a cast) so it has to be done manually. If lines 111-115 didn't exist, and you inferred the type from the return type of the pooling call this explicit wouldn't have to be there.

I understand it's a little weird, but that's where the code is right now -- if you want an explicit (non-AMP) fp16 version that can also be done, but it'll be more invasive and can't be done before the new year after I get back

miguelvr · 2018-12-14T15:54:06Z

any updates on this?

slayton58 · 2018-12-14T15:57:06Z

I'm rebasing right now (along with moving back to pytorch default DistributedDataParallel, otherwise Im waiting on more reviews / feedback

tools/train_net.py

fmassa

This looks good to me, thanks a lot @slayton58 !

@wat3rBro will be running a few training jobs to double check that accuracy is the same, and then I'll get this PR merged.

wat3rBro · 2019-01-09T23:47:54Z

@slayton58 Hi, I'm collecting stats of training, finding out it would terminate silently during the first iteration if fp16 is enabled for training ResNeXt model (both detection and mask) on V100. I checked the verbose output from APEX and couldn't find difference when comparing it with a successful run on P100 for the same model. The exit should happen when running:

with optimizer.scale_loss(losses) as scaled_losses:
    scaled_losses.backward()

Is this a known issue?

slayton58 · 2019-01-10T00:57:45Z

@wat3rBro I've never run with ResNeXt (it didn't exist on the version of the code I originally developed this for). There should be no difference in behavior between resnet and resnext, does this happen on the first iteration? Also, do other tests pass with P100 (I develop on V100)?

wat3rBro · 2019-01-10T01:50:40Z

@slayton58 it happens during the first iteration. Could you verify if it works for e2e_mask_rcnn_X_101_32x8d_FPN_1x.yaml? On P100 I have R-50-C4/FPN running successfully without accuracy loss.

Adds fp16 support via apex.amp Also switches communication to apex.DistributedDataParallel

Added support to tools/test_net.py SOLVER.MIXED_PRECISION -> DTYPE \in {float32, float16} apex.amp not installed now raises ImportError

slayton58 · 2019-01-10T14:14:04Z

@wat3rBro I was having an issue with my branch due to some of the pre-trained model URLs having changed (this is the relevant commit) -- I rebased (and pushed) against latest master and I can start training on a single V100 now.

wat3rBro · 2019-01-10T17:18:06Z

@slayton58 Does it run properly on your machine at least for the first few iterations?

slayton58 · 2019-01-10T18:13:16Z

Yes. Are you still having issues?

2019-01-10 14:11:12,900 maskrcnn_benchmark.trainer INFO: Start training
2019-01-10 14:11:25,448 maskrcnn_benchmark.trainer INFO: eta: 1 day, 7:21:44  iter: 20  loss: 2.5158 (3.3709)  loss_classifier: 0.5149 (1.5699)  loss_box_reg: 0.0209 (0.0326)  loss_mask: 0.8690 (1.0426)  loss_objectness: 0.6652 (0.6536)  loss_rpn_box_reg: 0.0419 (0.0723)  time: 0.4033 (0.6273)  data: 0.0040 (0.1115)  lr: 0.000448  max mem: 6004
INFO:maskrcnn_benchmark.trainer:eta: 1 day, 7:21:44  iter: 20  loss: 2.5158 (3.3709)  loss_classifier: 0.5149 (1.5699)  loss_box_reg: 0.0209 (0.0326)  loss_mask: 0.8690 (1.0426)  loss_objectness: 0.6652 (0.6536)  loss_rpn_box_reg: 0.0419 (0.0723)  time: 0.4033 (0.6273)  data: 0.0040 (0.1115)  lr: 0.000448  max mem: 6004
2019-01-10 14:11:33,937 maskrcnn_benchmark.trainer INFO: eta: 1 day, 2:17:19  iter: 40  loss: 1.5658 (2.6424)  loss_classifier: 0.2581 (1.0473)  loss_box_reg: 0.0278 (0.0597)  loss_mask: 0.7494 (0.9107)  loss_objectness: 0.3690 (0.5331)  loss_rpn_box_reg: 0.0280 (0.0915)  time: 0.4112 (0.5259)  data: 0.0046 (0.0584)  lr: 0.000482  max mem: 6124
INFO:maskrcnn_benchmark.trainer:eta: 1 day, 2:17:19  iter: 40  loss: 1.5658 (2.6424)  loss_classifier: 0.2581 (1.0473)  loss_box_reg: 0.0278 (0.0597)  loss_mask: 0.7494 (0.9107)  loss_objectness: 0.3690 (0.5331)  loss_rpn_box_reg: 0.0280 (0.0915)  time: 0.4112 (0.5259)  data: 0.0046 (0.0584)  lr: 0.000482  max mem: 6124

fmassa · 2019-01-11T12:13:44Z

One thing to keep in mind: @wat3rBro what version of CUDNN are you using that gives the problem?

wat3rBro · 2019-01-11T21:31:59Z

@fmassa I'm still on CuDNN 7.1.2 because of our infra.

slayton58 · 2019-01-15T18:09:57Z

@wat3rBro Any progress on your side? If we want to try and track down more details on exactly where the failure's happening on your end you could try running with CUDA_LAUNCH_BLOCKING=1 and see where the error is reported (if you're not running with that env variable then pytorch's normal async error checking is happening and the error location can be really misleading)

wat3rBro · 2019-01-15T21:26:48Z

@slayton58 that ENV variable seems not to give any error report either. Anyway we'll have 7.4 available in several days, hope this issue just goes away.

wat3rBro · 2019-01-15T23:48:15Z

@slayton58 it does work with 7.4.2. Floating point exception (core dumped) was what I got on 7.1.4. Are you going to fix the old version?

slayton58 · 2019-01-16T01:18:25Z

@wat3rBro could you run with the following please and give me the resulting cudnn.log file please -- I can use it to file an internal bug (or email me at slayton (at) nvidia (dot) com if it's too big)

export CUDA_LAUNCH_BLOCKING=1
export CUDNN_LOGINFO_DBG=1
export CUDNN_LOGDEST_DBG=./cudnn.log
<normal single GPU run command here>

slayton58 · 2019-01-18T21:33:15Z

@wat3rBro Any progress?

wat3rBro · 2019-01-18T22:08:54Z

Hey @slayton58, I've sent you the log via email, sorry if I didn't notify you.

slayton58 · 2019-01-19T00:22:41Z

@wat3rBro Hmm, I haven't seen anything come through my email - what address did you send it from?

wat3rBro · 2019-01-19T01:04:54Z

Maybe it's blocked, I just sent you again from my personal email.

ClimbsRocks · 2019-03-12T20:56:51Z

Curious, what's the current state of this PR? Some of the X-152 models I'm training with are huge and slow. I've seen notable improvements from training with NVIDIA's half-precision code and PyTorch in the past.

zimenglan-sysu-512 · 2019-04-13T14:11:17Z

hi @slayton58
curious, what's the current state of this PR?

LeviViana · 2019-04-17T15:07:36Z

I've made some tests on 4 x 2080Ti. I didn't experience any improvement in training time. In my tests, I set _C.DYTPE = "float16" vs _C.DYTPE = "float32". I tested :

e2e_faster_rcnn_R_50_FPN_1x.yaml for 500 iterations
e2e_mask_rcnn_X_101_32x8d_FPN_1x.yaml for 500 iterations

@slayton58 what could be going wrong on these tests ? Could you provide some benchmarks on the improvements you've achieved ?

slayton58 · 2019-04-17T15:21:58Z

@zimenglan-sysu-512 I'm still waiting on a merge. The PR should still be up-to-date and compatible with recent refactors of apex.amp.

@LeviViana Speedup is dependent on network and batch size per GPU -- e2e_mask_rcnn_R_50_FPN_1x.yaml is the config most tested with per-GPU batch sizes of 1, 2, 4. Based on the numbers from the model zoo and my iterations pasted above, the speedup on this config is ~0.45/0.4 = 12.5%. ResNeXt backbones (the second config you mention) don't seem to benefit this second from fp16, and I've never actually run faster-RCNN so I can't comment there.

LeviViana · 2019-04-17T15:41:58Z

Thanks for your quick answer. Indeed, my precedent tests were based on a 1 img/gpu setup. I re-ran some other tests on the e2e_faster_rcnn_R_50_FPN_1x.yaml architecture, and I got about:

10% speedup with 2 imgs/gpu
20% speedup with 3 imgs/gpu

zimenglan-sysu-512 · 2019-04-19T03:30:02Z

thanks @slayton58
hi @fmassa do u still have a plan to merge the PR?

fmassa · 2019-04-19T15:33:22Z

Sorry for the delay in merging, and thanks @slayton58 !

zimenglan-sysu-512 · 2019-04-22T07:53:14Z

hi @slayton58
i have a question, will apex change the type of labels in targets in forwarding during training, e.g. change torch.int64 to torch.float32?

slayton58 · 2019-04-22T12:08:43Z

@zimenglan-sysu-512 No, labels will stay as torch.int64

obendidi · 2019-06-03T13:23:48Z

I don't know if it was taken into account but apex throws an error when trying to run inference on a cpu only machine

what i installed:

RUN conda install pytorch-cpu=1.1.0 torchvision-cpu=0.3.0 -c pytorch

# install apex
RUN git clone https://github.com/NVIDIA/apex.git /apex
RUN cd /apex && git checkout 14e34f7f89967dcbe5876b8bf416e311dd90b9dd && python setup.py install --cpp_ext

# install PyTorch maskrcnn-benchmark
RUN git clone https://github.com/Sterblue/maskrcnn-benchmark.git /maskrcnn-benchmark \
    && cd /maskrcnn-benchmark \
    && git checkout 5c41f1225208c3cd22b9c4734fa1d89f3f4de592 \
    && python setup.py build develop

error:

  File "/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/__init__.py", line 2, in <module>
    from .detectors import build_detection_model
  File "/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/detectors.py", line 2, in <module>
    from .generalized_rcnn import GeneralizedRCNN
  File "/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 11, in <module>
    from ..backbone import build_backbone
  File "/maskrcnn-benchmark/maskrcnn_benchmark/modeling/backbone/__init__.py", line 2, in <module>
    from .backbone import build_backbone
  File "/maskrcnn-benchmark/maskrcnn_benchmark/modeling/backbone/backbone.py", line 7, in <module>
    from maskrcnn_benchmark.modeling.make_layers import conv_with_kaiming_uniform
  File "/maskrcnn-benchmark/maskrcnn_benchmark/modeling/make_layers.py", line 10, in <module>
    from maskrcnn_benchmark.layers import Conv2d
  File "/maskrcnn-benchmark/maskrcnn_benchmark/layers/__init__.py", line 10, in <module>
    from .nms import nms
  File "/maskrcnn-benchmark/maskrcnn_benchmark/layers/nms.py", line 5, in <module>
    from apex import amp
  File "/miniconda/envs/py36/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/__init__.py", line 2, in <module>
    from . import amp
  File "/miniconda/envs/py36/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/__init__.py", line 1, in <module>
    from .amp import init, half_function, float_function, promote_function,\
  File "/miniconda/envs/py36/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/amp.py", line 3, in <module>
    from .lists import functional_overrides, torch_overrides, tensor_overrides
  File "/miniconda/envs/py36/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/lists/torch_overrides.py", line 69, in <module>
    if utils.get_cuda_version() >= (9, 1, 0):
  File "/miniconda/envs/py36/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/utils.py", line 9, in get_cuda_version
    return tuple(int(x) for x in torch.version.cuda.split('.'))
AttributeError: 'NoneType' object has no attribute 'split'

@fmassa

* Initial multi-precision training Adds fp16 support via apex.amp Also switches communication to apex.DistributedDataParallel * Add Apex install to dockerfile * Fixes from @fmassa review Added support to tools/test_net.py SOLVER.MIXED_PRECISION -> DTYPE \in {float32, float16} apex.amp not installed now raises ImportError * Remove extraneous apex DDP import * Move to new amp API

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Nov 21, 2018

fmassa reviewed Nov 22, 2018

View reviewed changes

slayton58 force-pushed the apex_fp16 branch from a2ecbe7 to af6d2ef Compare December 14, 2018 16:05

fmassa reviewed Dec 14, 2018

View reviewed changes

tools/train_net.py Outdated Show resolved Hide resolved

fmassa approved these changes Dec 20, 2018

View reviewed changes

slayton58 added 4 commits January 10, 2019 09:05

Initial multi-precision training

eef9ce5

Adds fp16 support via apex.amp Also switches communication to apex.DistributedDataParallel

Add Apex install to dockerfile

6a2f545

Fixes from @fmassa review

ed0a7b1

Added support to tools/test_net.py SOLVER.MIXED_PRECISION -> DTYPE \in {float32, float16} apex.amp not installed now raises ImportError

Remove extraneous apex DDP import

d2f1566

slayton58 force-pushed the apex_fp16 branch from f60949d to d2f1566 Compare January 10, 2019 14:11

Move to new amp API

48034cf

zimenglan-sysu-512 mentioned this pull request Apr 19, 2019

[WIP] Tracing / Scripting #138

Closed

fmassa merged commit 08fcf12 into facebookresearch:master Apr 19, 2019

Initial mixed-precision training #196

Initial mixed-precision training #196

Conversation

slayton58 commented Nov 21, 2018 • edited Loading

miguelvr commented Nov 22, 2018

slayton58 commented Nov 22, 2018

fmassa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miguelvr commented Dec 14, 2018

slayton58 commented Dec 14, 2018

fmassa left a comment

Choose a reason for hiding this comment

wat3rBro commented Jan 9, 2019

slayton58 commented Jan 10, 2019

wat3rBro commented Jan 10, 2019

slayton58 commented Jan 10, 2019

wat3rBro commented Jan 10, 2019

slayton58 commented Jan 10, 2019

fmassa commented Jan 11, 2019

wat3rBro commented Jan 11, 2019

slayton58 commented Jan 15, 2019

wat3rBro commented Jan 15, 2019

wat3rBro commented Jan 15, 2019

slayton58 commented Jan 16, 2019

slayton58 commented Jan 18, 2019

wat3rBro commented Jan 18, 2019 • edited Loading

slayton58 commented Jan 19, 2019

wat3rBro commented Jan 19, 2019

ClimbsRocks commented Mar 12, 2019

zimenglan-sysu-512 commented Apr 13, 2019

LeviViana commented Apr 17, 2019

slayton58 commented Apr 17, 2019

LeviViana commented Apr 17, 2019

zimenglan-sysu-512 commented Apr 19, 2019

fmassa commented Apr 19, 2019

zimenglan-sysu-512 commented Apr 22, 2019 • edited Loading

slayton58 commented Apr 22, 2019

obendidi commented Jun 3, 2019 • edited Loading

slayton58 commented Nov 21, 2018 •

edited

Loading

wat3rBro commented Jan 18, 2019 •

edited

Loading

zimenglan-sysu-512 commented Apr 22, 2019 •

edited

Loading

obendidi commented Jun 3, 2019 •

edited

Loading