Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Lightning GBenchmark Suite #249

Merged
merged 46 commits into from
Mar 17, 2022
Merged

Add Lightning GBenchmark Suite #249

merged 46 commits into from
Mar 17, 2022

Conversation

maliasadi
Copy link
Member

@maliasadi maliasadi commented Mar 8, 2022

Lightning GBenchmark Suite

This PR adds the PennyLane-Lightning benchmark suite powered by google-benchmark (GB). To use GB scripts, one can perform make gbenchmark or run

$ cmake pennylane_lightning/src/ -BBuildGBench -DBUILD_BENCHMARKS=ON -DENABLE_OPENMP=ON -DENABLE_BLAS=ON -DCMAKE_BUILD_TYPE=Release
$ cmake --build ./BuildGBench --target utils apply_operations apply_multirz

The main requirement for these scripts is google-benchmark. We use the CMake FetchContent command to install the library if the find_package command fails to find GB.

Implementation details

check the README file.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2022

Hello. You may have forgotten to update the changelog!
Please edit .github/CHANGELOG.md with:

  • A one-to-two sentence description of the change. You may include a small working example for new features.
  • A link back to this PR.
  • Your name (or GitHub username) in the contributors section.

@codecov
Copy link

codecov bot commented Mar 8, 2022

Codecov Report

Merging #249 (0a566ee) into master (27bc5f5) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #249   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            4         4           
  Lines          366       366           
=========================================
  Hits           366       366           
Impacted Files Coverage Δ
pennylane_lightning/_version.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27bc5f5...0a566ee. Read the comment docs.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2022

Test Report (C++) on Ubuntu

       1 files  ±0         1 suites  ±0   0s ⏱️ ±0s
   555 tests ±0     555 ✔️ ±0  0 💤 ±0  0 ±0 
2 289 runs  ±0  2 289 ✔️ ±0  0 💤 ±0  0 ±0 

Results for commit 0a566ee. ± Comparison against base commit 27bc5f5.

♻️ This comment has been updated with latest results.

Copy link
Member

@mlxd mlxd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @maliasadi
Nothing major to add here, only we can await the 0.23 tag to merge first.

A few quick comments too, but happy to go with whatever you think.

@@ -369,7 +369,7 @@ inline auto matrixVecProd(const std::vector<std::complex<T>> mat,
* @param n1 Index of the first column.
* @param n2 Index of the last column.
*/
template <class T, size_t BLOCKSIZE = 32> // NOLINT(readability-magic-numbers)
template <class T, size_t BLOCKSIZE = 16> // NOLINT(readability-magic-numbers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 16 offer better performance for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed! This is the result of running the following command:

$ python3 compare.py filters ./BuildGBench/benchmarks/utils "cf_transpose_cmplx<double, 16>" "cf_transpose_cmplx<double, 32>"
Running ./BuildGBench/benchmarks/utils
Run on (8 X 3877.22 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 3.70, 2.74, 1.44
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
cf_transpose_cmplx<double, 16>/32          769 ns          769 ns       905299
cf_transpose_cmplx<double, 16>/64         3867 ns         3867 ns       181881
cf_transpose_cmplx<double, 16>/128       18741 ns        18741 ns        43778
cf_transpose_cmplx<double, 16>/256      223272 ns       223266 ns         3133
cf_transpose_cmplx<double, 16>/512     1028820 ns      1028753 ns          682
cf_transpose_cmplx<double, 16>/1024    5229414 ns      5229264 ns          129
cf_transpose_cmplx<double, 16>/2048   40673714 ns     40666706 ns           17
cf_transpose_cmplx<double, 16>/4096  165500143 ns    165467574 ns            4
cf_transpose_cmplx<double, 16>/8192  626944729 ns    626880717 ns            1
RUNNING: ./BuildGBench/benchmarks/utils --benchmark_filter=cf_transpose_cmplx<double, 32> --benchmark_out=/tmp/tmp_x_2ganq
2022-03-15T01:40:42-04:00
Running ./BuildGBench/benchmarks/utils
Run on (8 X 2341.99 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 3.10, 2.66, 1.43
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
cf_transpose_cmplx<double, 32>/32         1067 ns         1067 ns       670864
cf_transpose_cmplx<double, 32>/64         5940 ns         5940 ns       106912
cf_transpose_cmplx<double, 32>/128       23607 ns        23607 ns        29433
cf_transpose_cmplx<double, 32>/256      226348 ns       226345 ns         2899
cf_transpose_cmplx<double, 32>/512      999435 ns       999460 ns          668
cf_transpose_cmplx<double, 32>/1024    5251783 ns      5250924 ns          128
cf_transpose_cmplx<double, 32>/2048   39253467 ns     39204690 ns           17
cf_transpose_cmplx<double, 32>/4096  169543968 ns    169549855 ns            4
cf_transpose_cmplx<double, 32>/8192  639742739 ns    639680299 ns            1
Comparing cf_transpose_cmplx<double, 16> to cf_transpose_cmplx<double, 32> (from ./BuildGBench/benchmarks/utils)
Benchmark                                                                                  Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/32                  +0.3870         +0.3870           769          1067           769          1067
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/64                  +0.5363         +0.5363          3867          5940          3867          5940
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/128                 +0.2596         +0.2597         18741         23607         18741         23607
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/256                 +0.0138         +0.0138        223272        226348        223266        226345
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/512                 -0.0286         -0.0285       1028820        999435       1028753        999460
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/1024                +0.0043         +0.0041       5229414       5251783       5229264       5250924
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/2048                -0.0349         -0.0360      40673714      39253467      40666706      39204690
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/4096                +0.0244         +0.0247     165500143     169543968     165467574     169549855
[cf_transpose_cmplx<double, 16> vs. cf_transpose_cmplx<double, 32>]/8192                +0.0204         +0.0204     626944729     639742739     626880717     639680299
OVERALL_GEOMEAN                                                                         +0.1156         +0.1155             0             0             0             0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Would the same gain be seen on different CPUs do you think?

We should probably run this on multiple types of processors to see how this works (I can try a Ryzen tomorrow to see how that fares).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. FYI, BLOCKSIZE=2^n performs transposition over submatrices of size 2^n * 2^n of the original matrix and the performance we gain using this blocking technique should come from the size of cache and the number of cache misses. On my machines with the following cache info, transposing submatrices of size 2^8 is more cache-friendly than transposing submatrices of size 2^10,

CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)

pennylane_lightning/src/benchmarks/README.md Show resolved Hide resolved
pennylane_lightning/src/benchmarks/README.md Outdated Show resolved Hide resolved
pennylane_lightning/_version.py Outdated Show resolved Hide resolved
Copy link
Contributor

@chaeyeunpark chaeyeunpark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @maliasadi, thanks for adding benchmark for gate operations! As I was also adding benchmark for generators and matrix operations (part of #245), I may add a subsequent PR after changing the code to use googlebenchmark. Some suggestions are also listed below:

Copy link
Contributor

@trevor-vincent trevor-vincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really awesome Ali. I think I can directly import this into a benchmark website for it. I'm guessing this might actually work with lightning-gpu too right?

Copy link
Member

@mlxd mlxd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments.

Copy link
Member

@mlxd mlxd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing more to add from my side. Thanks @maliasadi

@maliasadi
Copy link
Member Author

Thanks you @chaeyeunpark for reviewing this PR. I added benchmarks for gate operations and leave cleaning ./src/examples to you as this is mostly your playground and don't want to remove anything useful unintentionally there. Feel free to update ./src/benchmarks afterwards. This PR is basically the first version of the GBenchmark suite in Lightning.

@maliasadi maliasadi requested a review from chaeyeunpark March 15, 2022 19:49
@maliasadi
Copy link
Member Author

Really awesome Ali. I think I can directly import this into a benchmark website for it. I'm guessing this might actually work with lightning-gpu too right?

Thank you @trevor-vincent! Yes, it should work with LightningGPU too. I believe there will be more GB scripts benchmarking different methods, kernels and devices 🚀

Copy link
Contributor

@chaeyeunpark chaeyeunpark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thanks for the nice work again @maliasadi. As you mentioned, I will make a subsequent PR on benchmarking all gates/generators/matrix operations. Hopefully, we can benchmark all different kernels with a single command.

@@ -55,6 +55,49 @@ jobs:
check_name: Test Report (C++) on Ubuntu
files: Build/tests/results/report.xml

cpptestswithblas:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@maliasadi maliasadi merged commit e563736 into master Mar 17, 2022
@maliasadi maliasadi deleted the add_gb_utils branch March 17, 2022 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants