Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compile-time support for AVX2/512 streaming operations in LQ #664

Merged
merged 28 commits into from
Apr 25, 2024
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
5269857
Add support for compile-time generation of streaming AVX kernels
mlxd Mar 27, 2024
6bb9a37
Add streaming and tuning docs
mlxd Mar 27, 2024
24f290a
Auto update version
github-actions[bot] Mar 27, 2024
f87bdc1
Trigger CI
mlxd Mar 27, 2024
68eb3cf
Update overloads
mlxd Mar 27, 2024
278bacf
Merge branch 'master' into feature/avx_streaming_ops
mlxd Apr 3, 2024
61c4073
Auto update version
github-actions[bot] Apr 3, 2024
2ad6c39
Merge branch 'master' into feature/avx_streaming_ops
mlxd Apr 4, 2024
9fd7c8f
Auto update version
github-actions[bot] Apr 4, 2024
f2525e4
Trigger CI
mlxd Apr 4, 2024
4ad7ca0
Update doc/lightning_qubit/development/avx_kernels/kernel_tuning.rst
mlxd Apr 5, 2024
083544b
Merge branch 'master' into feature/avx_streaming_ops
mlxd Apr 19, 2024
ae9809e
Update changelog
mlxd Apr 19, 2024
7377b2e
Auto update version
github-actions[bot] Apr 19, 2024
75e31ca
Trigger CI
mlxd Apr 19, 2024
d0aaeec
Update doc/lightning_qubit/development/avx_kernels/kernel_tuning.rst
mlxd Apr 24, 2024
2b1236e
Update doc/lightning_qubit/development/avx_kernels/kernel_tuning.rst
mlxd Apr 24, 2024
1b4129c
Auto update version from '0.36.0-dev34' to '0.36.0-dev37'
ringo-but-quantum Apr 24, 2024
e950df0
Merge branch 'master' into feature/avx_streaming_ops
mlxd Apr 24, 2024
22e1982
Updates from code review
mlxd Apr 24, 2024
f3fbcc2
Merge branch 'master' into feature/avx_streaming_ops
mlxd Apr 24, 2024
9d43784
Auto update version from '0.36.0-dev37' to '0.36.0-dev38'
ringo-but-quantum Apr 24, 2024
c525904
Merge branch 'master' into feature/avx_streaming_ops
mlxd Apr 24, 2024
540052f
Auto update version from '0.36.0-dev38' to '0.36.0-dev39'
ringo-but-quantum Apr 24, 2024
2cf4f03
Merge branch 'master' into feature/avx_streaming_ops
mlxd Apr 25, 2024
4be3c7f
Auto update version from '0.36.0-dev40' to '0.36.0-dev41'
ringo-but-quantum Apr 25, 2024
479c287
Update doc/lightning_qubit/development/avx_kernels/kernel_tuning.rst
mlxd Apr 25, 2024
5b43dbe
Update doc/lightning_qubit/development/avx_kernels/kernel_tuning.rst
mlxd Apr 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .github/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

### New features since last release

* Add compile-time support for AVX2/512 streaming operations in `lightning.qubit`.
[(#664)](https://github.com/PennyLaneAI/pennylane-lightning/pull/664)

* `lightning.kokkos` supports mid-circuit measurements.
[(#672)](https://github.com/PennyLaneAI/pennylane-lightning/pull/672)

Expand Down Expand Up @@ -126,7 +129,7 @@

This release contains contributions from (in alphabetical order):

Ali Asadi, Amintor Dusko, Christina Lee, Vincent Michaud-Rioux, Mudit Pandey, Shuli Shu
Ali Asadi, Amintor Dusko, Christina Lee, Vincent Michaud-Rioux, Lee James O'Riordan, Mudit Pandey, Shuli Shu

---

Expand Down
18 changes: 16 additions & 2 deletions .github/workflows/tests_linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -71,18 +71,31 @@ jobs:
-DENABLE_COVERAGE=ON \
-DLQ_ENABLE_KERNEL_OMP=ON

cmake . -BBuildKernelAVXStream -G Ninja \
mlxd marked this conversation as resolved.
Show resolved Hide resolved
-DCMAKE_BUILD_TYPE=Debug \
mlxd marked this conversation as resolved.
Show resolved Hide resolved
-DBUILD_TESTS=ON \
-DENABLE_PYTHON=OFF \
-DPL_BACKEND=${{ matrix.pl_backend }} \
-DCMAKE_CXX_COMPILER=$(which g++-$GCC_VERSION) \
-DENABLE_COVERAGE=ON \
-DLQ_ENABLE_KERNEL_AVX_STREAM=ON \
-DLQ_ENABLE_KERNEL_OMP=ON


cmake --build ./Build
cmake --build ./BuildKernelOMP
cmake --build ./BuildKernelAVXStream
mlxd marked this conversation as resolved.
Show resolved Hide resolved

for d in Build BuildKernelOMP; do
for d in Build BuildKernelOMP BuildKernelAVXStream; do
cd ./$d
mkdir -p ./tests/results
for file in *runner ; do ./$file --order lex --reporter junit --out ./tests/results/report_$file.xml; done;
lcov --directory . -b ../pennylane_lightning/core/src --capture --output-file coverage.info
lcov --remove coverage.info '/usr/*' --output-file coverage.info
cd ..
done
lcov --add-tracefile ./Build/coverage.info -a ./BuildKernelOMP/coverage.info -o coverage.info
lcov --add-tracefile ./Build/coverage.info -a ./BuildKernelOMP/coverage.info \
--add-tracefile ./BuildKernelAVXStream/coverage.info -o coverage.info
mv coverage.info coverage-${{ github.job }}-${{ matrix.pl_backend }}.info

- name: Upload test results
Expand All @@ -93,6 +106,7 @@ jobs:
path: |
./Build/tests/results/
./BuildKernelOMP/tests/results/
./BuildKernelAVXStream/tests/results/
mlxd marked this conversation as resolved.
Show resolved Hide resolved

if-no-files-found: error

Expand Down
1 change: 1 addition & 0 deletions doc/lightning_qubit/development/avx_kernels/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,4 @@ AVX2/AVX512 kernels

implementation
build_system
kernel_tuning
13 changes: 13 additions & 0 deletions doc/lightning_qubit/development/avx_kernels/kernel_tuning.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Kernel performance tuning
#########################

Lightning-Qubit's kernel implementations are by default tuned for high throughput single-threaded performance with gradient workloads. To enable this, we add OpenMP threading within the adjoint differentiation method implementation and use SIMD-level intrinsics to ensure fast performance for each given circuit in such a workload.

However, sometimes we may want to modify the above defaults to favour a given workload, such as by enabling multi-threaded execution of the gate kernels instead. For this, we have several compile-time flags to change the operating behaviour of Lightning-Qubit kernels.

OpenMP threaded kernels
-----------------------

To enable OpenMP acceleration of the gate kernels, Lightning-Qubit can be compiled with the `-DLQ_ENABLE_KERNEL_OMP=ON` CMake flag. Not, that for gradient workloads with many observables, this may reduce performance in comparison with the default mode, so this behaviour is opt-in only.
mlxd marked this conversation as resolved.
Show resolved Hide resolved

For workloads that show benefit from the use of threaded gate kernels, sometimes updating the CPU cache to accommodate recently modified data can become a bottleneck, and saturates the performance gained at high thread counts. This may be alleviated somewhat on systems supporting AVX2 and AVX-512 operations using the `-DLQ_ENABLE_KERNEL_AVX_STREAMING=on` CMake flag. This forces the data to avoid updating the CPU cache and can improve performance for larger workloads.
mlxd marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 1 addition & 1 deletion pennylane_lightning/core/_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@
Version number (major.minor.patch[-label])
"""

__version__ = "0.36.0-dev40"
__version__ = "0.36.0-dev41"
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ add_library(lightning_qubit STATIC ${LQUBIT_FILES})
option(ENABLE_BLAS "Enable BLAS" OFF)
option(ENABLE_GATE_DISPATCHER "Enable gate kernel dispatching on AVX/AVX2/AVX512" ON)
option(LQ_ENABLE_KERNEL_OMP "Enable OpenMP pragmas for gate kernels" OFF)
option(LQ_ENABLE_KERNEL_AVX_STREAMING "Enable AVX2/512 streaming operations for gate kernels" OFF)
mlxd marked this conversation as resolved.
Show resolved Hide resolved

# Inform the compiler that this device is enabled.
target_compile_options(lightning_compile_options INTERFACE "-D_ENABLE_PLQUBIT=1")
Expand Down Expand Up @@ -51,6 +52,13 @@ if(LQ_ENABLE_KERNEL_OMP)
add_definitions("-DPL_LQ_KERNEL_OMP")
endif()

if(LQ_ENABLE_KERNEL_AVX_STREAMING)
if(NOT LQ_ENABLE_KERNEL_OMP)
message(WARNING "AVX streaming operations require `LQ_ENABLE_KERNEL_OMP` to be enabled.")
endif()
add_definitions("-DPL_LQ_KERNEL_AVX_STREAMING")
endif()

target_link_libraries(lightning_qubit PUBLIC lightning_compile_options
lightning_external_libs
lightning_base
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ template <typename T> struct AVX2Concept {
}

PL_FORCE_INLINE
static void store(std::complex<PrecisionT> *p, IntrinsicType value) {
static void store_(std::complex<PrecisionT> *p, IntrinsicType value) {
if constexpr (std::is_same_v<PrecisionT, float>) {
_mm256_store_ps(reinterpret_cast<PrecisionT *>(p), value);
} else if (std::is_same_v<PrecisionT, double>) {
Expand All @@ -91,6 +91,43 @@ template <typename T> struct AVX2Concept {
}
}

PL_FORCE_INLINE
static void stream_(std::complex<PrecisionT> *p, IntrinsicType value) {
if constexpr (std::is_same_v<PrecisionT, float>) {
_mm256_stream_ps(reinterpret_cast<PrecisionT *>(p), value);
} else if (std::is_same_v<PrecisionT, double>) {
_mm256_stream_pd(reinterpret_cast<PrecisionT *>(p), value);
} else {
static_assert(std::is_same_v<PrecisionT, float> ||
std::is_same_v<PrecisionT, double>);
}
}

PL_FORCE_INLINE
static void stream_(PrecisionT *p, IntrinsicType value) {
if constexpr (std::is_same_v<PrecisionT, float>) {
_mm256_stream_ps(p, value);
} else if (std::is_same_v<PrecisionT, double>) {
_mm256_stream_pd(p, value);
} else {
static_assert(std::is_same_v<PrecisionT, float> ||
std::is_same_v<PrecisionT, double>);
}
}

PL_FORCE_INLINE
static void store(std::complex<PrecisionT> *p, IntrinsicType value) {
store(reinterpret_cast<PrecisionT *>(p), value);
}
PL_FORCE_INLINE
static void store(PrecisionT *p, IntrinsicType value) {
#ifdef PL_LQ_KERNEL_AVX_STREAMING
store_(p, value);
#else
stream_(p, value);
#endif
}

PL_FORCE_INLINE
static auto mul(IntrinsicType v0, IntrinsicType v1) {
if constexpr (std::is_same_v<PrecisionT, float>) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ template <typename T> struct AVX512Concept {
}

PL_FORCE_INLINE
static void store(std::complex<PrecisionT> *p, IntrinsicType value) {
static void store_(std::complex<PrecisionT> *p, IntrinsicType value) {
if constexpr (std::is_same_v<PrecisionT, float>) {
_mm512_store_ps(p, value);
} else if (std::is_same_v<PrecisionT, double>) {
Expand All @@ -92,6 +92,44 @@ template <typename T> struct AVX512Concept {
}
}

PL_FORCE_INLINE
static void stream_(std::complex<PrecisionT> *p, IntrinsicType value) {
if constexpr (std::is_same_v<PrecisionT, float>) {
_mm512_stream_ps(p, value);
} else if (std::is_same_v<PrecisionT, double>) {
_mm512_stream_pd(p, value);
} else {
static_assert(std::is_same_v<PrecisionT, float> ||
std::is_same_v<PrecisionT, double>);
}
}

PL_FORCE_INLINE
static void stream_(PrecisionT *p, IntrinsicType value) {
if constexpr (std::is_same_v<PrecisionT, float>) {
_mm512_stream_ps(p, value);
} else if (std::is_same_v<PrecisionT, double>) {
_mm512_stream_pd(p, value);
} else {
static_assert(std::is_same_v<PrecisionT, float> ||
std::is_same_v<PrecisionT, double>);
}
}

PL_FORCE_INLINE
static void store(std::complex<PrecisionT> *p, IntrinsicType value) {
store(reinterpret_cast<PrecisionT *>(p), value);
}

PL_FORCE_INLINE
static void store(PrecisionT *p, IntrinsicType value) {
#ifdef PL_LQ_KERNEL_AVX_STREAMING
store_(p, value);
#else
stream_(p, value);
#endif
}

PL_FORCE_INLINE
static auto mul(IntrinsicType v0, IntrinsicType v1) {
if constexpr (std::is_same_v<PrecisionT, float>) {
Expand Down
Loading