Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MXNet 1.6.0 performance regression #16845

Closed
JonTanS opened this issue Nov 18, 2019 · 29 comments
Closed

MXNet 1.6.0 performance regression #16845

JonTanS opened this issue Nov 18, 2019 · 29 comments

Comments

@JonTanS
Copy link
Contributor

JonTanS commented Nov 18, 2019

Description

I wanted to report an issue with the performance of mxnet 1.6.x when comparing it to the previous version, mxnet 1.5.x

p2.16xlarge                
MXnet Version Model Num GPU avg_speed min_speed max_speed avg_gpu_memory min_gpu_memory max_gpu_memory
mxnet1.5 resnet101_v1 num_gpu_1 279.82 271.97 285.07 1269.93 220.00 10037.00
mxnet1.6_x resnet101_v1 num_gpu_1 316.53 312.23 329.48 1261.46 1106.00 1465.00
Percentage difference 1.6.x/1.5 1.131176404 1.148008905 1.155794134 0.993325863 5.027272727 0.145959948
mxnet1.5 resnet152_v1 num_gpu_1 386.45 365.09 392.36 1351.67 1341.00 1356.00
mxnet1.6_x resnet152_v1 num_gpu_1 433.14 418.71 462.85 1412.06 1392.00 1421.00
Percentage difference 1.6.x/1.5 1.120815624 1.146855987 1.179660748 1.044674962 1.03803132 1.047935103
p3.16xlarge                
MXnet Version Model Num GPU avg_speed min_speed max_speed avg_gpu_memory min_gpu_memory max_gpu_memory
mxnet1.5 resnet101_v1 num_gpu_1 194.04 164.38 201.67 1737.57 1734.00 1740.00
mxnet1.6_x resnet101_v1 num_gpu_1 200.38 189.29 225.85 1988.92 1824.00 2136.00
Percentage difference 1.6.x/1.5 1.032671936 1.151536731 1.119874587 1.144658131 1.051903114 1.227586207
mxnet1.5 resnet152_v1 num_gpu_1 275.77 221.22 282.92 2016.00 2016.00 2016.00
mxnet1.6_x resnet152_v1 num_gpu_1 337.50 317.80 348.30 2134.69 2130.00 2136.00
Percentage difference 1.6.x/1.5 1.223845898 1.436586371 1.231101633 1.058871723 1.056547619 1.05952381

To Reproduce

  1. Create a new conda enviornment for each version.
    To install 1.5: pip install mxnet-cu101mkl
    To install 1.6.x:
    checkout https://github.com/apeforest/mxnet-build-script
    Change the version to v1.6.x in the Dockerfile
    Follow instructions and launch and build mxnet-cu100
    Copy pip wheel out of the container and pip install
  2. Here are the links for the scripts:
    Gluon Model Zoo Site for Cifar 10
    Link for the Cifar 10 Training Script
  3. Launching the scripts with the respective conda env with mxnet version installed:
    Launch nvidia-smi to get GPU Memory Usage:
    nvidia-smi --query-gpu=index,memory.used --format=csv -l 30 -f <file_location>
    Launch model training:
    python train_cifar10.py --num-gpus 1 --model resnet152_v1 --num-epochs 40
    python train_cifar10.py --num-gpus 1 --model resnet101_v1 --num-epochs 40

The script will print out time it took per epoch.
I used this regex to match and grab the time after 3 epochs, treating those as the warmup.
r'.*\[Epoch ([0-9]*).*\].* time: ([0-9]*\.?[0-9]*)'
I then took the average, min, max values.

@pengzhao-intel
Copy link
Contributor

Thanks for the validation. Do you have a chance to try CPU?

@JonTanS
Copy link
Contributor Author

JonTanS commented Nov 18, 2019

Hey @pengzhao-intel, I tested training on GPU and inference on CPU. For inference, the times came very close to each other so I did not see any noticeable regression there!

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Nov 18, 2019

Hey @pengzhao-intel, I tested training on GPU and inference on CPU. For inference, the times came very close to each other so I did not see any noticeable regression there!

Really thanks for your efforts. It's great to hear this.
Feel free to ping me if anything our team can help :)

@chinakook
Copy link
Contributor

chinakook commented Nov 18, 2019

Use this script to build your own MXNet https://github.com/apache/incubator-mxnet/blob/master/tools/staticbuild/build.sh, this script will download all dependencies and build them.
If you have a very new Linux distro, e.g. Ubuntu 19.04/19.10 or Arch Linux, You can simply build by Makefile and make/config.mk becuase the system have the latest dependencies to support the performance of MXNet.
PS: As I've tested, the OpenCV dependency has significant effect on training of little dataset such as MNIST, CIFAR. So you can build the latest OpenCV to improve performance.

@KellenSunderland
Copy link
Contributor

KellenSunderland commented Nov 18, 2019

Seems like a significant drop. The p3 resnet152_v1 test should be fairly easy to reproduce. Has anyone else had a chance to verify this regression?

Edit: I notice you're comparing a CUDA 10.1 binary to a 10.0. Have you tried compiling for 10.1?

@ptrendx
Copy link
Member

ptrendx commented Nov 18, 2019

I will look into this today and see if I can repro it.

@ptrendx ptrendx self-assigned this Nov 18, 2019
@JonTanS
Copy link
Contributor Author

JonTanS commented Nov 18, 2019

@KellenSunderland So I noticed that the instances I was running on have CUDA 10.0 so I assumed that when running it would default to that version even though mxnet_cu101 was installed. I will rerun the tests with all the instances running on cu100 because I'm having issues building cu101mkl.

@ptrendx
Copy link
Member

ptrendx commented Nov 19, 2019

Ok, so I looked into it and I can kind of see 1.6 being slower, but on the other hand this script is really not a great way of testing performance of the GPU training. Because the kernels are tiny, it is actually dominated by gaps in execution while CPU is trying to launch the kernels (and the line to run it you gave does not even use hybridization to offset it in any way, enabling hybridization improves performance by ~2x).

Looking at the GPU kernel time I do not see any real difference, so the slowdown is most probably due to increase in time spent actually launching the ops.

@ptrendx ptrendx removed their assignment Nov 19, 2019
@ptrendx
Copy link
Member

ptrendx commented Nov 19, 2019

@apeforest Could somebody do a bisection when this got introduced?

@apeforest
Copy link
Contributor

@jonatan1626 is currently running the experiments again on multiple machines. Earlier he was running them on the same instance and we suspect there might be some performance interfering between runs.

@sxjscience
Copy link
Member

Is the performance worse if we turned on hybridization?

@apeforest
Copy link
Contributor

@jonatan1626 Could you please update your latest performance comparison result here?

@JonTanS
Copy link
Contributor Author

JonTanS commented Nov 20, 2019

Here are the results. ptrendx was correct with the cifar10 dataset being too small for testing for the gpu. It doesn't look like there is a regression between the two versions!

Running Imagenet training on the p3.16xlarge GPUs with mxnet-cu100:

MXnet Version Model Num GPU Average Seconds per Epoch Minimum Seconds per Epoch Maximum Seconds per Epoch Average Samples per Second Minimum Samples per Second Maximum Samples per Second Average GPU Memory Minimum GPU Memory Maximum GPU Memory
mxnet1.5_gpu resnet101_v1 1.00 6445.86 6440.19 6451.08 201.58 198.70 203.86 14059.24 12420.00 14666.00
mxnet1.6_gpu resnet101_v1 1.00 6432.60 6430.46 6436.61 201.99 199.27 204.95 14525.09 12730.00 14804.00
mxnet1.6_gpu_LT resnet101_v1 1.00 6534.50 6527.72 6541.22 198.85 196.91 202.19 14160.36 12722.00 14658.00
Percentage Difference 1.6/1.5     0.997943272 0.998490298 0.9977572 1.002056148 1.002849742 1.005366461 1.033134716 1.024959742 1.009409519
Percentage Difference 1.6_LT/1.5     1.013752014 1.013591585 1.013973939 0.986460326 0.990967978 0.991784602 1.00719251 1.02431562 0.999454521
Percentage Difference 1.6_LT/1.6     1.015841324 1.01512412 1.016253192 0.98443618 0.988152 0.986490638 0.974889813 0.999371563 0.990137801
                       
                       
mxnet1.5_gpu resnet152_v1 1.00 17546.07 17516.64 17575.51 73.56 71.99 76.28 11500.12 5540.00 15296.00
mxnet1.6_gpu resnet152_v1 1.00 17625.32 17623.83 17626.81 73.23 72.09 75.62 11587.57 5664.00 15320.00
mxnet1.6_gpu_LT resnet152_v1 1.00 17858.62 17858.62 17858.62 72.31 71.03 74.36 11669.76 5200.00 15308.00
Percentage Difference 1.6/1.5     1.00451625 1.006119057 1.002918812 0.995580622 1.001427955 0.991343963 1.007604007 1.022382671 1.001569038
Percentage Difference 1.6_LT/1.5     1.017812638 1.01952295 1.016108056 0.982986521 0.986669205 0.974823228 1.01475107 0.938628159 1.000784519
Percentage Difference 1.6_LT/1.6     1.013236608 1.013322372 1.013150858 0.987349994 0.985262295 0.983335013 1.007093127 0.918079096 0.99921671
                       
                       
mxnet1.5_gpu resnet50_v1 1.00 3789.90 3786.47 3792.26 342.92 340.36 347.85 11081.41 10426.00 11536.00
mxnet1.6_gpu resnet50_v1 1.00 3790.38 3785.99 3794.03 342.85 337.95 346.17 11719.89 11142.00 12302.00
mxnet1.6_gpu_LT resnet50_v1 1.00 3772.54 3771.12 3774.58 344.49 341.66 348.85 11798.29 11048.00 12624.00
Percentage Difference 1.6/1.5     1.000127854 0.99987373 1.000466181 0.999800748 0.992947517 0.995166781 1.057617053 1.068674468 1.066400832
Percentage Difference 1.6_LT/1.5     0.995421661 0.995945003 0.995339585 1.004569816 1.003821662 1.002854867 1.064691864 1.059658546 1.094313454
Percentage Difference 1.6_LT/1.6     0.995294409 0.996070777 0.994875792 1.004770018 1.010951379 1.007725425 1.006689389 0.991563454 1.026174606

@JonTanS
Copy link
Contributor Author

JonTanS commented Nov 20, 2019

Running CIFAR10 training on the c5.18xlarge cpus with mxnet-cu100 (Note: I just set the number of gpus to 0 for the script):

MXnet Version Model Num GPU Average Seconds per Epoch Minimum Seconds per Epoch Maximum Seconds per Epoch
mxnet1.5_cpu resnet101_v1 0.00 2079.77 2061.01 2087.01
mxnet1.6_cpu resnet101_v1 0.00 1337.37 1311.35 1366.60
mxnet1.6_cpu_LT resnet101_v1 0.00 1887.85 1787.26 2033.10
Percentage Difference 1.6/1.5     0.643038001 0.636263268 0.6548127
Percentage Difference 1.6_LT/1.5     0.907721519 0.86717887 0.974171041
Percentage Difference 1.6_LT/1.6     1.411614114 1.362924615 1.487709449
           
           
mxnet1.5_cpu resnet152_v1 0.00 2735.39 2726.98 2742.44
mxnet1.6_cpu resnet152_v1 0.00 2037.03 1972.81 2101.91
mxnet1.6_cpu_LT resnet152_v1 0.00 2752.12 2523.36 2865.49
Percentage Difference 1.6/1.5     0.744695601 0.723441483 0.766436962
Percentage Difference 1.6_LT/1.5     1.006116551 0.925330274 1.044867998
Percentage Difference 1.6_LT/1.6     1.351044036 1.279067202 1.363279762
           
           
mxnet1.5_cpu resnet50_v1 0.00 1239.23 1234.64 1244.87
mxnet1.6_cpu resnet50_v1 0.00 688.89 671.58 704.92
mxnet1.6_cpu_LT resnet50_v1 0.00 831.51 756.42 887.07
Percentage Difference 1.6/1.5     0.555901358 0.543949174 0.566261789
Percentage Difference 1.6_LT/1.5     0.670990108 0.612663114 0.712583375
Percentage Difference 1.6_LT/1.6     1.207030886 1.126324193 1.258399187

@ptrendx
Copy link
Member

ptrendx commented Nov 20, 2019

Hi @jonatan1626, out of curiosity, what does the "_LT" stand for (e.g. in "mxnet1.6_gpu_LT")?

@JonTanS
Copy link
Contributor Author

JonTanS commented Nov 20, 2019

@ptrendx LT means Large Tensor

@apeforest
Copy link
Contributor

@jonatan1626 Thanks for the update. Can you also share the results for the other a few models? If there is no performance regression for 1.6 release, I think we can close this issue.

@sxjscience
Copy link
Member

sxjscience commented Nov 20, 2019

I find that the results have changed. Is it achieved by running the same script you provided in the first comment?

@JonTanS
Copy link
Contributor Author

JonTanS commented Nov 20, 2019

@sxjscience Apologies I forgot to mention I am now using this script for imagenet:
train_imagenet.py
The examples can be found at this Gluon Site

@JonTanS
Copy link
Contributor Author

JonTanS commented Nov 20, 2019

@apeforest These are the results for the other model runs. These runs were done using cu101-mkl running on p3x16 machines with Cuda 10.0. These runs were run sequentially, so I think the memory issue that you mentioned might be a reason why 1.6.x numbers are slightly slower. I will rerun these tests again to revalidate.

MXnet Version Model Num GPU avg_speed min_speed max_speed avg_samples_sec min_samples_sec max_samples_sec avg_gpu_memory min_gpu_memory max_gpu_memory
mxnet1.5.x ssd num_gpu_1 241.39 234.14 244.12 75.49 28.25 94.23 15193.45 15160.00 15196.00
mxnet1.6.x ssd num_gpu_1 249.57 238.78 255.18 76.83 29.36 95.04 13844.95 13808.00 13848.00
mxnet1.6LT ssd num_gpu_1 355.82 352.32 360.77 59.88 17.37 93.01 15042.49 12806.00 15294.00
1.6 / 1.5     1.033896368 1.019808492 1.045318433 1.017670852 1.0392976 1.008542564 0.911244649 0.910817942 0.911292445
1.6LT/1.5     1.474059094 1.504719358 1.477852833 0.793207208 0.61484812 0.98696861 0.990063649 0.844722955 1.006449066
1.6LT/1.6     1.425731958 1.475492085 1.413782428 0.779433946 0.591599673 0.978608782 1.086495981 0.927433372 1.104419411
                       
mxnet1.5.x word_language_model num_gpu_1 134.82 132.15 137.45 16414.67 16106.89 16721.45 9280.24 3380.00 15304.00
mxnet1.6.x word_language_model num_gpu_1 140.49 137.76 143.56 16070.68 15730.44 16425.65 9554.93 3492.00 15306.00
mxnet1.6LT word_language_model num_gpu_1 135.71 133.23 139.13 16283.34 15848.56 16625.67 9703.43 3474.00 15320.00
1.6 / 1.5     1.042042802 1.042451759 1.044452528 0.979043711 0.976628014 0.982310147 1.029599291 1.033136095 1.000130685
1.6LT/1.5     1.006583913 1.008172531 1.012222626 0.991999123 0.983961522 0.994272028 1.045601522 1.027810651 1.001045478
1.6LT/1.6     0.965971754 0.967116725 0.969141822 1.013232721 1.007509008 1.012177296 1.015542193 0.994845361 1.000914674
                       
mxnet1.5.x yolo3 num_gpu_1 441.47 415.49 460.74 42.23 22.58 74.28 10324.93 2466.00 15322.00
mxnet1.6.x yolo3 num_gpu_1 454.92 425.47 474.86 42.91 20.59 76.15 10849.40 2208.00 15316.00
mxnet1.6LT yolo3 num_gpu_1 492.57 477.62 506.79 39.76 10.33 72.85 10895.04 2704.00 15322.00
1.6 / 1.5     1.030481648 1.024031982 1.030652996 1.01601748 0.911651388 1.025134964 1.050796059 0.895377129 0.999608406
1.6LT/1.5     1.11576745 1.149537411 1.099944003 0.941377794 0.45764138 0.980734797 1.055216859 1.096512571 1
1.6LT/1.6     1.082763048 1.122560068 1.0672302 0.926537006 0.501991645 0.956688467 1.004207097 1.224637681 1.000391747

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Nov 21, 2019

Running CIFAR10 training on the c5.18xlarge cpus with mxnet-cu100 (Note: I just set the number of gpus to 0 for the script):

MXnet Version Model Num GPU Average Seconds per Epoch Minimum Seconds per Epoch Maximum Seconds per Epoch
mxnet1.5_cpu resnet101_v1 0.00 2079.77 2061.01 2087.01
mxnet1.6_cpu resnet101_v1 0.00 1337.37 1311.35 1366.60
mxnet1.6_cpu_LT resnet101_v1 0.00 1887.85 1787.26 2033.10
Percentage Difference 1.6/1.5     0.643038001 0.636263268 0.6548127
Percentage Difference 1.6_LT/1.5     0.907721519 0.86717887 0.974171041
Percentage Difference 1.6_LT/1.6     1.411614114 1.362924615 1.487709449
           
           
mxnet1.5_cpu resnet152_v1 0.00 2735.39 2726.98 2742.44
mxnet1.6_cpu resnet152_v1 0.00 2037.03 1972.81 2101.91
mxnet1.6_cpu_LT resnet152_v1 0.00 2752.12 2523.36 2865.49
Percentage Difference 1.6/1.5     0.744695601 0.723441483 0.766436962
Percentage Difference 1.6_LT/1.5     1.006116551 0.925330274 1.044867998
Percentage Difference 1.6_LT/1.6     1.351044036 1.279067202 1.363279762
           
           
mxnet1.5_cpu resnet50_v1 0.00 1239.23 1234.64 1244.87
mxnet1.6_cpu resnet50_v1 0.00 688.89 671.58 704.92
mxnet1.6_cpu_LT resnet50_v1 0.00 831.51 756.42 887.07
Percentage Difference 1.6/1.5     0.555901358 0.543949174 0.566261789
Percentage Difference 1.6_LT/1.5     0.670990108 0.612663114 0.712583375
Percentage Difference 1.6_LT/1.6     1.207030886 1.126324193 1.258399187

Could you run CPU benchmark with mxnet-mkl or mxnet-cuXXmkl?

@apeforest
Copy link
Contributor

@jonatan1626 Thanks for the detailed report. This looks great. Please run mxnet-mkl for CPU performance test as @pengzhao-intel suggested. I guess we don't need to report mxnet1.6_LT since it's not an official release.

It would be great if you can put your run script together with logs in a repo and share it here so we can reproduce or track later on.

Thanks.

Lin

@pengzhao-intel
Copy link
Contributor

It would be great if you can put your run script together with logs in a repo and share it here so we can reproduce or track later on.

I remember we have a plan to make a dashboard to track the performance :)
Is this still on the plate?

@JonTanS
Copy link
Contributor Author

JonTanS commented Nov 21, 2019

@pengzhao-intel The runs just finished there was an error when running resnet50_v1, so I have restarted the job and will post the results when it is done! It does look like there is a regression between the mkl versions.

@apeforest Let me compile and organize the data first then I'll put it in a repo. I am also figuring out how to push the data to cloudwatch so we have a dashboard to track the performance!

MXnet Version Model Num GPU Average Seconds per Epoch Minimum Seconds per Epoch Maximum Seconds per Epoch
mxnet1.5_cu101mkl resnet101_v1 num_gpu_0 410.73 404.96 420.19
mxnet1.6_cu101mkl resnet101_v1 num_gpu_0 476.15 457.81 493.18
      1.159284458 1.130490989 1.173697115
           
           
mxnet1.5_cu101mkl resnet152_v1 num_gpu_0 606.61 599.58 618.81
mxnet1.6_cu101mkl resnet152_v1 num_gpu_0 753.11 714.77 810.51
      1.241503107 1.192118973 1.309792228

@JonTanS
Copy link
Contributor Author

JonTanS commented Nov 21, 2019

I have also uploaded the scripts to: Here. Do let me know if there is anything wrong with how I'm running this!

@pengzhao-intel
Copy link
Contributor

@rongzha1 please try to run the script and verify the CPU performance.

@pengzhao-intel
Copy link
Contributor

cc @TaoLv

@rongzha1
Copy link
Contributor

seems no regression issue on mkl dnn CPU platform.
test platform: skx-8180
mxnet version: v1.5.x and v1.6.x
v1.5.x build cmd: make -j USE_MKLDNN=1 USE_BLAS=mkl USE_GPERFTOOLS=0
v1.6.x build cmd: make -j USE_MKLDNN=1 USE_BLAS=mkl USE_GPERFTOOLS=0
USE_INTEL_PATH=/opt/intel
running scripts: jonatan1626 prepared above

MXnet Version Model Num GPU avg_speed min_speed max_speed
mxnet_v1.5.x resnet101_v1 num_gpu_0 421.02 417.34 426.82
mxnet_v1.6.x resnet101_v1 num_gpu_0 436.60 431.00 441.72
      1.036998128 1.032732535 1.03490117
           
mxnet_v1.5.x resnet152_v1 num_gpu_0 638.70 633.71 644.65
mxnet_v1.6.x resnet152_v1 num_gpu_0 648.94 626.71 656.89
      1.01604219 0.988963644 1.018989986
           
mxnet_v1.5.x resnet50_v1 num_gpu_0 231.60 228.20 234.31
mxnet_v1.6.x resnet50_v1 num_gpu_0 238.22 236.86 239.74
      1.028615052 1.037968653 1.023160919

@ptrendx
Copy link
Member

ptrendx commented Nov 25, 2019

What is the status of this issue? Based on the result gathered by @rongzha1 it seems we can close this issue?

@JonTanS JonTanS closed this as completed Nov 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants