-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Lightning GBenchmark Suite #249
Conversation
Hello. You may have forgotten to update the changelog!
|
Codecov Report
@@ Coverage Diff @@
## master #249 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 4 4
Lines 366 366
=========================================
Hits 366 366
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @maliasadi
Nothing major to add here, only we can await the 0.23 tag to merge first.
A few quick comments too, but happy to go with whatever you think.
@@ -369,7 +369,7 @@ inline auto matrixVecProd(const std::vector<std::complex<T>> mat, | |||
* @param n1 Index of the first column. | |||
* @param n2 Index of the last column. | |||
*/ | |||
template <class T, size_t BLOCKSIZE = 32> // NOLINT(readability-magic-numbers) | |||
template <class T, size_t BLOCKSIZE = 16> // NOLINT(readability-magic-numbers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does 16 offer better performance for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed! This is the result of running the following command:
$ python3 compare.py filters ./BuildGBench/benchmarks/utils "cf_transpose_cmplx<double, 16>" "cf_transpose_cmplx<double, 32>"
Running ./BuildGBench/benchmarks/utils
Run on (8 X 3877.22 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 1280 KiB (x4)
L3 Unified 12288 KiB (x1)
Load Average: 3.70, 2.74, 1.44
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
cf_transpose_cmplx<double, 16>/32 769 ns 769 ns 905299
cf_transpose_cmplx<double, 16>/64 3867 ns 3867 ns 181881
cf_transpose_cmplx<double, 16>/128 18741 ns 18741 ns 43778
cf_transpose_cmplx<double, 16>/256 223272 ns 223266 ns 3133
cf_transpose_cmplx<double, 16>/512 1028820 ns 1028753 ns 682
cf_transpose_cmplx<double, 16>/1024 5229414 ns 5229264 ns 129
cf_transpose_cmplx<double, 16>/2048 40673714 ns 40666706 ns 17
cf_transpose_cmplx<double, 16>/4096 165500143 ns 165467574 ns 4
cf_transpose_cmplx<double, 16>/8192 626944729 ns 626880717 ns 1
RUNNING: ./BuildGBench/benchmarks/utils --benchmark_filter=cf_transpose_cmplx<double, 32> --benchmark_out=/tmp/tmp_x_2ganq
2022-03-15T01:40:42-04:00
Running ./BuildGBench/benchmarks/utils
Run on (8 X 2341.99 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 1280 KiB (x4)
L3 Unified 12288 KiB (x1)
Load Average: 3.10, 2.66, 1.43
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
cf_transpose_cmplx<double, 32>/32 1067 ns 1067 ns 670864
cf_transpose_cmplx<double, 32>/64 5940 ns 5940 ns 106912
cf_transpose_cmplx<double, 32>/128 23607 ns 23607 ns 29433
cf_transpose_cmplx<double, 32>/256 226348 ns 226345 ns 2899
cf_transpose_cmplx<double, 32>/512 999435 ns 999460 ns 668
cf_transpose_cmplx<double, 32>/1024 5251783 ns 5250924 ns 128
cf_transpose_cmplx<double, 32>/2048 39253467 ns 39204690 ns 17
cf_transpose_cmplx<double, 32>/4096 169543968 ns 169549855 ns 4
cf_transpose_cmplx<double, 32>/8192 639742739 ns 639680299 ns 1
Comparing cf_transpose_cmplx<double, 16> to cf_transpose_cmplx<double, 32> (from ./BuildGBench/benchmarks/utils)
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/32 +0.3870 +0.3870 769 1067 769 1067
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/64 +0.5363 +0.5363 3867 5940 3867 5940
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/128 +0.2596 +0.2597 18741 23607 18741 23607
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/256 +0.0138 +0.0138 223272 226348 223266 226345
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/512 -0.0286 -0.0285 1028820 999435 1028753 999460
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/1024 +0.0043 +0.0041 5229414 5251783 5229264 5250924
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/2048 -0.0349 -0.0360 40673714 39253467 40666706 39204690
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/4096 +0.0244 +0.0247 165500143 169543968 165467574 169549855
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/8192 +0.0204 +0.0204 626944729 639742739 626880717 639680299
OVERALL_GEOMEAN +0.1156 +0.1155 0 0 0 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Would the same gain be seen on different CPUs do you think?
We should probably run this on multiple types of processors to see how this works (I can try a Ryzen tomorrow to see how that fares).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea. FYI, BLOCKSIZE=2^n
performs transposition over submatrices of size 2^n * 2^n
of the original matrix and the performance we gain using this blocking technique should come from the size of cache and the number of cache misses. On my machines with the following cache info, transposing submatrices of size 2^8
is more cache-friendly than transposing submatrices of size 2^10
,
CPU Caches:
L1 Data 48 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 1280 KiB (x4)
L3 Unified 12288 KiB (x1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @maliasadi, thanks for adding benchmark for gate operations! As I was also adding benchmark for generators and matrix operations (part of #245), I may add a subsequent PR after changing the code to use googlebenchmark. Some suggestions are also listed below:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really awesome Ali. I think I can directly import this into a benchmark website for it. I'm guessing this might actually work with lightning-gpu too right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments.
Co-authored-by: Chae-Yeun Park <chae-yeun@xanadu.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing more to add from my side. Thanks @maliasadi
Thanks you @chaeyeunpark for reviewing this PR. I added benchmarks for gate operations and leave cleaning |
Thank you @trevor-vincent! Yes, it should work with LightningGPU too. I believe there will be more GB scripts benchmarking different methods, kernels and devices 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! Thanks for the nice work again @maliasadi. As you mentioned, I will make a subsequent PR on benchmarking all gates/generators/matrix operations. Hopefully, we can benchmark all different kernels with a single command.
@@ -55,6 +55,49 @@ jobs: | |||
check_name: Test Report (C++) on Ubuntu | |||
files: Build/tests/results/report.xml | |||
|
|||
cpptestswithblas: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Lightning GBenchmark Suite
This PR adds the PennyLane-Lightning benchmark suite powered by google-benchmark (GB). To use GB scripts, one can perform
make gbenchmark
or runThe main requirement for these scripts is google-benchmark. We use the CMake
FetchContent
command to install the library if thefind_package
command fails to find GB.Implementation details
check the README file.