[Discussion] Unified performance tests and dashboard #15757

sandeep-krishnamurthy · 2019-08-05T20:33:03Z

Problem Statement

Performance tests are not integrated with CI. We do not run any performance tests during PR validation and nightly tests. We will not be able to catch performance leaks early enough leading to performance degradations, regressions caught during or after a release.
Without performance tests with CI, we are unable to track performance improvement/degradation and bring in the focus of the community towards performance improvement related projects.
With new projects such as NumPy, Large Tensor Support, MKLDNN 1.0 integration, MShadow deprecation etc... tracking changes in the performance is critical. Having tools and integration with CI will make us move faster and handle regression swiftly.
Current performance/benchmark tests are too diverse distributed and maintained across teams and repos.
1. We have few performance tests under - benchmark/python
2. Recently, operator performance tests opperf
3. MXNet contributors at AWS maintain a suite of performance tests in - awslabs/deeplearning-benchmarks
4. MXNet contributors at Intel maintain a suite of performance tests. (repo - ??)
5. MXNet contributors at NVIDIA maintain a suite of performance tests. (repo - ??)
MXNet currently does not have a common dashboard to view performance benchmarks.

Proposal

At high level we can divide all performance tests into 3 categories:
1. Kernel level tests - Ex: Conv MKLDNN/CuDNN kernels.
2. Operator level tests - Ex: OpPerf we have in MXNet. This tests MXNet engine and other critical paths involved in execution of an operator.
3. End to end topology/model tests - Ex: ResNet50-v1 on ImageNet
  1. Training
  2. Inference
We will unify all performance tests distributed across MXNet repo, repos maintained by contributors across AWS, NVIDIA, Intel, and others under one single umbrella of MXNet performance tests and benchmarks.
We will integrate these performance tests with MXNet CI system. We need to divide tests across PR and nightly/weekly tests.
We will have a unified dashboard with results from nightly builds to see the status of MXNet at given point by the community.

This is a topic open for discussion. Please do comment with your suggestions/feedbacks.

CC: @apeforest @ChaiBapchya @access2rohit @samskalicky @PatricZhao @TaoLv @ptrendx @marcoabreu

mxnet-label-bot · 2019-08-05T20:33:05Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test

marcoabreu · 2019-08-05T21:18:53Z

We can't use the CI system for performance measurements since it does not provide a consistent environment for various reasons (efficiency, maintainability, etc). Thus, we need a separate system that has the sole purpose of being entirely consistent.

Also, I'm afraid that using tests to also measure performance could be misleading since tests might get extended or altered. I'd propose to have dedicated benchmarks instead.

pengzhao-intel · 2019-08-06T04:26:09Z

+1

It's a nice proposal which can save lots of maintaining efforts for a different organization with a single and unified dashboard and also very easy to track the performance regression.
Meanwhile, everyone can check and cite the latest performance from the official repo.

Actually, there're lots of tasks before achieving this goal. @juliusshufan can share some of our local experience first and then we can go to details of this proposal including SW, HW, database, metrics, etc.

sandeep-krishnamurthy · 2019-08-06T04:39:30Z

+1

It's a nice proposal which can save lots of maintaining efforts for a different organization with a single and unified dashboard and also very easy to track the performance regression.
Meanwhile, everyone can check and cite the latest performance from the official repo.

Actually, there're lots of tasks before achieving this goal. @juliusshufan can share some of our local experience first and then we can go to details of this proposal including SW, HW, database, metrics, etc.

Thanks @PatricZhao - This requires both hardware and software setup. Let us start small with whatever is available and incrementally expand it. Looking forward to more learnings from your experience.

sandeep-krishnamurthy · 2019-08-06T17:00:24Z

@ptrendx - Any inputs on the performance related tests / benchmarks / CI you maintain that can be upstreamed here?

ptrendx · 2019-08-08T22:32:57Z

We can certainly push some of our benchmarks to that common repo, although I'm not sure how to handle the differences between our container version of MXNet and upstream.

As for the performance testing insights - having a dedicated machine is important (so probably p3.16xlarge instance) as other tenants may skew the results, especially for the cases that are more CPU or IO intensive.

juliusshufan · 2019-08-09T02:14:30Z

Update some benchmark and accuracy test from Intel side.

Currently, we track the performance, accuracy and convergence of the MXNet github repo nightly, covering different models and MXNet Op. The kernel level performance is also measured with MKLDNN upgrade. The performance measurement on Xeon platform, covering the "top-bin" and "main-stream" SKUs. The scripts involve the internals and also levergate the public MXNet examples.

The performance report normally compared and presented by,

day-to-day comparison, if the performance fluctuation exceeds a preset threshold (model level normally 10%, accuracy is 0 gap), it will raise an suspicious regression;
Long-term trends tracking, The recent 30-day performance are presented as a curve;
The most recent nightly performance data will be the default criteria for the internal CI test and comparison target.

Detailed HW spec we used for performance tracking in below table, we using CentOS 7.5 and metal machine using below HW spec.

	Socket	Physical Core	HT	Turbo	RAM	RAM Slot	Memory Bandwidth
SKX-8180	2	28	On	On	DDR4 2666	2*6	255GB/s
SKX-6148	2	20	On	On	DDR4 2666	2*6	255GB/s
CLX-8280	2	28	On	On	DDR4 2933	2*6	281GB/s
CLX-8260	2	24	On	On	DDR4 2933	2*6	281GB/s
CLX-6248	2	20	On	On	DDR4 2666	2*6	255GB/s

To reflect the real production scenario, the SW configurations we used for performance tracking, the benchmark measurements are executed with different socket/cores/instances configuration.

Other details we may discuss offline. Thanks.

apeforest · 2019-08-14T06:54:06Z

@juliusshufan Thanks for providing the benchmark setup. Recently we have been running operator-level runtime comparison between int32 and int64 data types for tensor indexing using the MXNet Opperf profiler contributed by @sandeep-krishnamurthy and et al. However, we do noticed large variations if we calibrate the runtime using built-in profiler in MXNet, also mis-correlation from the runtime we measured using Python time module directly. @ChaiBapchya can provide more detailed performance results. We need a universal way to calibrate runtime in order for us to track the performance results. Any advice will be appreciated.

ChaiBapchya · 2019-08-15T00:03:03Z

Here are the links for Large Tensor Operator benchmarks I ran.

Python's Time module -
https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing

MXNet Profiler (built-in CPP profiler) - https://docs.google.com/spreadsheets/d/1VkZoBFacZo8NGNcdFU5P9gFs3dm7D_ykOkPUzUD-Yu4/edit?usp=sharing

Tested on - p3.16xl instance

pengzhao-intel · 2019-08-15T01:46:46Z

Here are the links for Large Tensor Operator benchmarks I ran.

Python's Time module -
https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing

MXNet Profiler (built-in CPP profiler) - https://docs.google.com/spreadsheets/d/1VkZoBFacZo8NGNcdFU5P9gFs3dm7D_ykOkPUzUD-Yu4/edit?usp=sharing

Tested on - p3.16xl instance

Thanks, @apeforest @ChaiBapchya we are testing large tensor operator now. Will come back with the results soon

apeforest · 2019-08-22T05:02:32Z

@pengzhao-intel There was some mistake in the earlier results due to CPU sharing. Chai has re-run profiling and collected the updated results here:

https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing

Please check the three sheets: Shape (1024, 1024), Shape (10000, 1) and Shape (10000, 100) corresponding to three different input shapes. The runtime numbers are the 50 percentile out of 100 runs. There are comparison between int64/int32 and in64mkl/int32mkl. Please feel free to ping @ChaiBapchya or me should you have any question.

Thanks!

marcoabreu · 2019-08-22T05:33:20Z

Erm why are we running cpu only benchmarks on a p3.16xlarge? Lin Yuan <notifications@github.com> schrieb am Mi., 21. Aug. 2019, 22:03:

…

@pengzhao-intel <https://github.com/pengzhao-intel> There was some mistake in the earlier results due to CPU sharing. Chai has re-run profiling and collected the updated results here: https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing Please check the three sheets: Shape (1024, 1024), Shape (10000, 1) and Shape (10000, 100) corresponding to three different input shapes. The runtime numbers are the 50 percentile out of 100 runs. There are comparison between int64/int32 and in64mkl/int32mkl. Please feel free to ping @ChaiBapchya <https://github.com/ChaiBapchya> or me should you have any question. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15757>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEOED2YSBGTMYTQ6FJZJJBLQFYM3LANCNFSM4IJOUW2Q> .

apeforest · 2019-08-22T05:36:46Z

@marcoabreu You are right. We should be more frugal :) @ChaiBapchya c5.x18 might be sufficient.

marcoabreu · 2019-08-22T05:39:45Z

It's not necessarily only about frugality but also the c5.18xlarge contains different processors than p3.16xlarge as far as I know. So the results don't really reflect the reality - but I also don't think that they will make a big difference. But in the future we should let apples stay apples and pears be pears :) Lin Yuan <notifications@github.com> schrieb am Mi., 21. Aug. 2019, 22:38:

…

@marcoabreu <https://github.com/marcoabreu> You are right. We should be more frugal :) @ChaiBapchya <https://github.com/ChaiBapchya> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15757>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEOED2235UEOGPTHVMPRMA3QFYQ3ZANCNFSM4IJOUW2Q> .

ChaiBapchya · 2019-08-22T06:16:06Z

It didn't occur to me about the instance. Apologies for the same. @marcoabreu Thanks for bringing it to our notice.

Having said that wanted clarification
"So the results
don't really reflect the reality " - Why does running CPU only benchmark on p3.16xl not reflect reality? All the 4 config (int32,int32+mkl.int64,int64+mkl) were run on the same instance. Moreover, I was planning to run GPU benchmarks as well. In that sense, wouldn't it make sense to run all of this in an instance that provides CPU+GPU support too.

"apples stay apples
and pears be pears :)" - meaning CPU benchmark in c5.18xl and GPU benchmark in p3.16xl?

Thanks

wuxun-zhang · 2019-08-22T06:39:51Z

We have just collected the performance numbers of some operators (like fullyConnected, softmax, etc) with MKL-DNN implementation. We also compared the results between MKL-DNN v0.20 and v1.0. Now one local CLX-8280 with 28 physical cores is used to run benchmark. Later maybe we'll switch to AWS EC2 C5 instance.
Because I don't have edit access to Chai's google doc, so I just listed the results in another doc below (please check the sheet Large Tensor Test (MKL-DNN)):

https://docs.google.com/spreadsheets/d/10rhQEzDqnCjSKq27QlT04qNHegmAZjOoVqT_q287_ZU/edit?usp=sharing

marcoabreu · 2019-08-22T07:44:32Z

It doesn't reflect reality in so far as that users would not run a cpu only build on a p3.16xlarge but on a c5 instead.

Right, they were run on the same instance, but I'm not sure (Intel, please confirm) if the CPUs in a c5 might perform differently. But in general I would doubt it and say that the relative results are still relevant, just not accurate.

I don't think it would make sense to be honest. A user looks at throughput/$ (or latency or whatever metric they optimize for). Cpu instances are way cheaper, but might underperform In direct comparison. But if you normalize these results with the costs, you will get a picture that's way closer to the reality of how a real user will use MXNet. In the end, we're optimizing for real use cases, so we should make the benchmarks and environment also as close to reality as possible.

Correct, that's what I meant :)

I didn't check in detail and sorry if my proposal introduces too much of a complexity, but what do you think about considering the performance of more than one sequential execution (think of a service) but instead measure the performance a fully utilized system is capable to handle? Like high batch size with one process (throughput optimized) vs batch size one with many processes (latency optimized).

apeforest · 2019-08-22T16:59:17Z

Hi @wuxun-zhang Thanks for the running the test and sharing data. Are the performance numbers generated from your inhouse profiling tool at Intel? We also noticed using average sometimes can be misleading due to some glitch (one super large number). We used p50 number to present in the table instead.

wuxun-zhang · 2019-08-23T05:40:24Z

@apeforest I used the Chai's large tensor benchmark scripts with latest MXNet master. So I think the data should be average but not a p50 number. Later I will update the data by using p50 metric to ensure consistency with your data.

ChaiBapchya · 2019-08-23T16:25:49Z

@wuxun-zhang For p50,90 and 99 numbers, I've this PR #15953

Once that's merged you will be able to get those numbers using python's time module.

With profiler flag, you can choose between python or Native.

wuxun-zhang · 2019-08-28T08:30:41Z

Hi @ChaiBapchya , is there has some updates for this large tensor benchmark script? I tried to run this script with this commit and will get such an error below. Look that this error is caused by incomplete input arguments (missing num_hidden for FC). BTW, this script works well for other operators but FC from my side. Thanks for your help in advance.

(mxnet_p36) ubuntu@ip-172-31-18-141:~/github/incubator-mxnet/benchmark/opperf$ python opperf_large_tensor.py --ctx=cpu -p python
Large tensor support : OFF
INFO:root:Running Large tensor benchmarks with the following options: Namespace(ctx='cpu', dtype='float32', mkldnn_option='mkldnn', output_file='./mxnet_operator_benchmarks.json', output_format='json', profiler='python')
[{'data': (1024, 1024), 'weight': (1024, 1024)}, {'data': (10000, 1), 'weight': (10000, 1)}, {'data': (10000, 100), 'weight': (10000, 100)}]
Traceback (most recent call last):
  File "opperf_large_tensor.py", line 114, in <module>
    sys.exit(main())
  File "opperf_large_tensor.py", line 103, in main
    final_benchmark_results = run_large_test_benchmarks(args.profiler, ctx=ctx, dtype=dtype)
  File "opperf_large_tensor.py", line 46, in run_large_test_benchmarks
    mx_large_tensor_results = run_op_benchmarks(mx_large_tensor_ops, dtype, ctx, profiler, warmup=10, runs=100)
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 157, in run_op_benchmarks
    warmup=warmup, runs=runs)
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 137, in run_performance_test
    benchmark_result = _run_nd_operator_performance_test(op, inputs, run_backward, warmup, runs, args_list, kwargs_list, profiler)
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 69, in _run_nd_operator_performance_test
    _, _ = benchmark_helper_func(op, warmup, [], **kwargs_list[0])
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/profiler_utils.py", line 241, in python_profile_it
    res = func(*modified_args, **kwargs)
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/ndarray_utils.py", line 48, in nd_forward_backward_and_profile
    res = op(**kwargs)
  File "<string>", line 86, in FullyConnected
  File "/home/ubuntu/github/incubator-mxnet/python/mxnet/_ctypes/ndarray.py", line 100, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/home/ubuntu/github/incubator-mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Required parameter num_hidden of int is not presented, in operator FullyConnected(name="")

ChaiBapchya · 2019-08-28T18:12:26Z

Yes. (This error is probably caused because incorrect file is being used. It was previously used for testing on my branch. But now with latest master, opperf.py file is good to use.)

Few pointers -

Don't use separate file for testing large tensor opperf_large_tensor.py. Functionality has been merged into the original opperf.py file.
All the operators that have been benchmarked so far in the opperf utility (in the master branch) can be profiled with native/python.
Inclusion of python time module via flag
Adding more operators to improve coverage

For current master branch,
All you've to do now for the opperf utility is run
python opperf.py with your desired flags --ctx=cpu -p python
It will run all the ops supported without error.

Let me know if that helps.

sandeep-krishnamurthy added CI Discussion Performance labels Aug 5, 2019

pengzhao-intel mentioned this issue Aug 16, 2019

[WIP] New Website: New Docs [1/3] #15884

Merged

ChaiBapchya mentioned this issue Feb 4, 2020

[OpPerf] Regression test for OpPerf #17515

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Unified performance tests and dashboard #15757

[Discussion] Unified performance tests and dashboard #15757

sandeep-krishnamurthy commented Aug 5, 2019 •

edited

Loading

mxnet-label-bot commented Aug 5, 2019

marcoabreu commented Aug 5, 2019

pengzhao-intel commented Aug 6, 2019

sandeep-krishnamurthy commented Aug 6, 2019

sandeep-krishnamurthy commented Aug 6, 2019

ptrendx commented Aug 8, 2019

juliusshufan commented Aug 9, 2019 •

edited

Loading

apeforest commented Aug 14, 2019 •

edited

Loading

ChaiBapchya commented Aug 15, 2019

pengzhao-intel commented Aug 15, 2019

apeforest commented Aug 22, 2019

marcoabreu commented Aug 22, 2019 via email

apeforest commented Aug 22, 2019 •

edited

Loading

marcoabreu commented Aug 22, 2019 via email

ChaiBapchya commented Aug 22, 2019

wuxun-zhang commented Aug 22, 2019

marcoabreu commented Aug 22, 2019

apeforest commented Aug 22, 2019 •

edited

Loading

wuxun-zhang commented Aug 23, 2019

ChaiBapchya commented Aug 23, 2019

wuxun-zhang commented Aug 28, 2019

ChaiBapchya commented Aug 28, 2019 •

edited

Loading

[Discussion] Unified performance tests and dashboard #15757

[Discussion] Unified performance tests and dashboard #15757

Comments

sandeep-krishnamurthy commented Aug 5, 2019 • edited Loading

mxnet-label-bot commented Aug 5, 2019

marcoabreu commented Aug 5, 2019

pengzhao-intel commented Aug 6, 2019

sandeep-krishnamurthy commented Aug 6, 2019

sandeep-krishnamurthy commented Aug 6, 2019

ptrendx commented Aug 8, 2019

juliusshufan commented Aug 9, 2019 • edited Loading

apeforest commented Aug 14, 2019 • edited Loading

ChaiBapchya commented Aug 15, 2019

pengzhao-intel commented Aug 15, 2019

apeforest commented Aug 22, 2019

marcoabreu commented Aug 22, 2019 via email

apeforest commented Aug 22, 2019 • edited Loading

marcoabreu commented Aug 22, 2019 via email

ChaiBapchya commented Aug 22, 2019

wuxun-zhang commented Aug 22, 2019

marcoabreu commented Aug 22, 2019

apeforest commented Aug 22, 2019 • edited Loading

wuxun-zhang commented Aug 23, 2019

ChaiBapchya commented Aug 23, 2019

wuxun-zhang commented Aug 28, 2019

ChaiBapchya commented Aug 28, 2019 • edited Loading

sandeep-krishnamurthy commented Aug 5, 2019 •

edited

Loading

juliusshufan commented Aug 9, 2019 •

edited

Loading

apeforest commented Aug 14, 2019 •

edited

Loading

apeforest commented Aug 22, 2019 •

edited

Loading

apeforest commented Aug 22, 2019 •

edited

Loading

ChaiBapchya commented Aug 28, 2019 •

edited

Loading