-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Discussion] Unified performance tests and dashboard #15757
Comments
Hey, this is the MXNet Label Bot. |
We can't use the CI system for performance measurements since it does not provide a consistent environment for various reasons (efficiency, maintainability, etc). Thus, we need a separate system that has the sole purpose of being entirely consistent. Also, I'm afraid that using tests to also measure performance could be misleading since tests might get extended or altered. I'd propose to have dedicated benchmarks instead. |
+1 It's a nice proposal which can save lots of maintaining efforts for a different organization with a single and unified dashboard and also very easy to track the performance regression. Actually, there're lots of tasks before achieving this goal. @juliusshufan can share some of our local experience first and then we can go to details of this proposal including SW, HW, database, metrics, etc. |
Thanks @PatricZhao - This requires both hardware and software setup. Let us start small with whatever is available and incrementally expand it. Looking forward to more learnings from your experience. |
@ptrendx - Any inputs on the performance related tests / benchmarks / CI you maintain that can be upstreamed here? |
We can certainly push some of our benchmarks to that common repo, although I'm not sure how to handle the differences between our container version of MXNet and upstream. As for the performance testing insights - having a dedicated machine is important (so probably p3.16xlarge instance) as other tenants may skew the results, especially for the cases that are more CPU or IO intensive. |
Update some benchmark and accuracy test from Intel side. Currently, we track the performance, accuracy and convergence of the MXNet github repo nightly, covering different models and MXNet Op. The kernel level performance is also measured with MKLDNN upgrade. The performance measurement on Xeon platform, covering the "top-bin" and "main-stream" SKUs. The scripts involve the internals and also levergate the public MXNet examples. The performance report normally compared and presented by,
Detailed HW spec we used for performance tracking in below table, we using CentOS 7.5 and metal machine using below HW spec.
To reflect the real production scenario, the SW configurations we used for performance tracking, the benchmark measurements are executed with different socket/cores/instances configuration. Other details we may discuss offline. Thanks. |
@juliusshufan Thanks for providing the benchmark setup. Recently we have been running operator-level runtime comparison between int32 and int64 data types for tensor indexing using the MXNet Opperf profiler contributed by @sandeep-krishnamurthy and et al. However, we do noticed large variations if we calibrate the runtime using built-in profiler in MXNet, also mis-correlation from the runtime we measured using Python time module directly. @ChaiBapchya can provide more detailed performance results. We need a universal way to calibrate runtime in order for us to track the performance results. Any advice will be appreciated. |
Here are the links for Large Tensor Operator benchmarks I ran. Python's Time module - MXNet Profiler (built-in CPP profiler) - https://docs.google.com/spreadsheets/d/1VkZoBFacZo8NGNcdFU5P9gFs3dm7D_ykOkPUzUD-Yu4/edit?usp=sharing Tested on - p3.16xl instance |
Thanks, @apeforest @ChaiBapchya we are testing large tensor operator now. Will come back with the results soon |
@pengzhao-intel There was some mistake in the earlier results due to CPU sharing. Chai has re-run profiling and collected the updated results here: https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing Please check the three sheets: Shape (1024, 1024), Shape (10000, 1) and Shape (10000, 100) corresponding to three different input shapes. The runtime numbers are the 50 percentile out of 100 runs. There are comparison between int64/int32 and in64mkl/int32mkl. Please feel free to ping @ChaiBapchya or me should you have any question. Thanks! |
Erm why are we running cpu only benchmarks on a p3.16xlarge?
Lin Yuan <notifications@github.com> schrieb am Mi., 21. Aug. 2019, 22:03:
… @pengzhao-intel <https://github.com/pengzhao-intel> There was some
mistake in the earlier results due to CPU sharing. Chai has re-run
profiling and collected the updated results here:
https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing
Please check the three sheets: Shape (1024, 1024), Shape (10000, 1) and
Shape (10000, 100) corresponding to three different input shapes. The
runtime numbers are the 50 percentile out of 100 runs. There are comparison
between int64/int32 and in64mkl/int32mkl. Please feel free to ping
@ChaiBapchya <https://github.com/ChaiBapchya> or me should you have any
question.
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15757>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEOED2YSBGTMYTQ6FJZJJBLQFYM3LANCNFSM4IJOUW2Q>
.
|
@marcoabreu You are right. We should be more frugal :) @ChaiBapchya c5.x18 might be sufficient. |
It's not necessarily only about frugality but also the c5.18xlarge contains
different processors than p3.16xlarge as far as I know. So the results
don't really reflect the reality - but I also don't think that they will
make a big difference. But in the future we should let apples stay apples
and pears be pears :)
Lin Yuan <notifications@github.com> schrieb am Mi., 21. Aug. 2019, 22:38:
… @marcoabreu <https://github.com/marcoabreu> You are right. We should be
more frugal :) @ChaiBapchya <https://github.com/ChaiBapchya>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15757>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEOED2235UEOGPTHVMPRMA3QFYQ3ZANCNFSM4IJOUW2Q>
.
|
It didn't occur to me about the instance. Apologies for the same. @marcoabreu Thanks for bringing it to our notice. Having said that wanted clarification "apples stay apples Thanks |
We have just collected the performance numbers of some operators (like fullyConnected, softmax, etc) with MKL-DNN implementation. We also compared the results between MKL-DNN v0.20 and v1.0. Now one local CLX-8280 with 28 physical cores is used to run benchmark. Later maybe we'll switch to AWS EC2 C5 instance. https://docs.google.com/spreadsheets/d/10rhQEzDqnCjSKq27QlT04qNHegmAZjOoVqT_q287_ZU/edit?usp=sharing |
It doesn't reflect reality in so far as that users would not run a cpu only build on a p3.16xlarge but on a c5 instead. Right, they were run on the same instance, but I'm not sure (Intel, please confirm) if the CPUs in a c5 might perform differently. But in general I would doubt it and say that the relative results are still relevant, just not accurate. I don't think it would make sense to be honest. A user looks at throughput/$ (or latency or whatever metric they optimize for). Cpu instances are way cheaper, but might underperform In direct comparison. But if you normalize these results with the costs, you will get a picture that's way closer to the reality of how a real user will use MXNet. In the end, we're optimizing for real use cases, so we should make the benchmarks and environment also as close to reality as possible. Correct, that's what I meant :) I didn't check in detail and sorry if my proposal introduces too much of a complexity, but what do you think about considering the performance of more than one sequential execution (think of a service) but instead measure the performance a fully utilized system is capable to handle? Like high batch size with one process (throughput optimized) vs batch size one with many processes (latency optimized). |
Hi @wuxun-zhang Thanks for the running the test and sharing data. Are the performance numbers generated from your inhouse profiling tool at Intel? We also noticed using average sometimes can be misleading due to some glitch (one super large number). We used p50 number to present in the table instead. |
@apeforest I used the Chai's large tensor benchmark scripts with latest MXNet master. So I think the data should be average but not a p50 number. Later I will update the data by using p50 metric to ensure consistency with your data. |
@wuxun-zhang For p50,90 and 99 numbers, I've this PR #15953 Once that's merged you will be able to get those numbers using python's time module. With profiler flag, you can choose between python or Native. |
Hi @ChaiBapchya , is there has some updates for this large tensor benchmark script? I tried to run this script with this commit and will get such an error below. Look that this error is caused by incomplete input arguments (missing
|
Yes. (This error is probably caused because incorrect file is being used. It was previously used for testing on my branch. But now with latest master, Few pointers -
For current master branch, Let me know if that helps. |
Problem Statement
Proposal
This is a topic open for discussion. Please do comment with your suggestions/feedbacks.
CC: @apeforest @ChaiBapchya @access2rohit @samskalicky @PatricZhao @TaoLv @ptrendx @marcoabreu
The text was updated successfully, but these errors were encountered: