Chunk Hamiltonian, PauliSentence, LinearCombination [sc-65680] #873

vincentmr · 2024-08-27T14:12:23Z

Before submitting

Please complete the following checklist when submitting a PR:

All new features must include a unit test.
If you've fixed a bug or added code that should be tested, add a test to the
tests directory!
All new functions and code must be clearly commented and documented.
If you do make documentation changes, make sure that the docs build and
render correctly by running make docs.
Ensure that the test suite passes, by running make test.
Add a new entry to the .github/CHANGELOG.md file, summarizing the
change and including a link back to the PR.
Ensure that code is properly formatted by running make format.

When all the above are checked, delete everything above the dashed
line and fill in the pull request template.

Context:
Parallelizing over observables can accelerate adjoint Jacobian calculations' backward pass. This PR revisits our implementation for L-Qubit and L-GPU which are the two devices that support it. Certain observables like Hamiltonian, PauliSentence, and LinearCombination can be split into many observables, enabling the distribution of the cost of expectation value computation. This strategy is initiated by the serializer which partitions the observables if split_obs is not False. The serializer proceeds to a complete partitioning, meaning a 1000-PauliWord PauliSentence is partitioned into a 1000 PauliWords. We note in passing that L-Qubit does not split observables since it does not pass a split_obs value to _process_jacobian_tape. This is wasteful because we end up with either of two situations:

The Jacobian is computed N processes (threads, devices, etc.) at a time which results in a lot of duplicate computation (forward/backward passes are repeated and the results combined);
The Jacobian is parallelized over all observables, each of which requires a state vector copy which increases the memory requirements by as much.

We explore chunking instead of full partitioning for LinearCombination-like objects, meaning a 1000-PauliWord PauliSentence is partitioned into four 250-PauliWords PauliSentences if we parallelize over 4 processes.

Description of the Change:
Modify the serializer to chunk LinearCombination-like objects if self.split_obs is truthy.
Correctly route _batch_obs such that L-Qubit splits observables.
Enhance/adapt tests.

Analysis:
Lightning-Qubit

applyObservable is a bottleneck for somewhat large linear combinations (say 100s or 1000s of terms). Chunking isn't helpful for a circuit like

    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return qml.expval(ham)

because L-Qubit's applyObservable method is parallelized over terms for a single Hamiltonian observable. Chunking in this case is counter-productive because it requires extra state vectors, extra backward passes, etc.

For a circuit like however

    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return np.array([qml.expval(ham), qml.expval(qml.PauliZ(0))])

applyObservable is parallelized over observables, which only scales up to 2 threads, and with poor load-balance. In this case, it is better to split the observable, which is what the current changes do.

mol	master-serial	master-batched	chunk-serial	chunk-batched
CH4	1.793e+01	1.330e+01	1.819e+01	8.040e+00
Li2	5.333e+01	3.354e+01	5.289e+01	1.839e+01
CO	9.817e+01	5.945e+01	9.619e+01	2.559e+01
H10	1.220e+02	7.317e+01	1.182e+02	3.305e+01

So for this circuit the current PR yields speeds-up ranging from 1.5x to >2x by using obs-batching + chunking (compared with the previous obs-batching).

Lightning-GPU

Lightning-GPU splits the observables as soon as batch_obs is true. The current code splits a Hamiltonian into all its individual terms, which is quite inefficient and induces a lot of redundant backward passes. This is visible benchmarking the circuit

    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return qml.expval(ham)

mol	master-serial	master-batched	chunk-serial	chunk-batched
CH4	1.463e+01	forever	5.583e+00	3.405e+00
Li2	1.201e+01	forever	5.284e+00	2.658e+00
CO	2.357e+01	forever	4.716e+00	4.577e+00
H10	2.992e+01	forever	5.476e+00	5.469e+00
HCN	8.622e+01	forever	3.144e+01	2.452e+01

The batched L-GPU runs are using 2 x A100 GPUs on ISAIC. The speed-ups for batched versus serial are OK, but most important is the optimization of Hamiltonian::applyInPlace which brings about nice speed-ups between master and this PR.

Related GitHub Issues:

github-actions · 2024-08-27T14:12:41Z

Hello. You may have forgotten to update the changelog!
Please edit .github/CHANGELOG.md with:

A one-to-two sentence description of the change. You may include a small working example for new features.
A link back to this PR.
Your name (or GitHub username) in the contributors section.

codecov · 2024-08-27T14:34:32Z

Codecov Report

Attention: Patch coverage is 85.91549% with 10 lines in your changes missing coverage. Please review.

Project coverage is 97.40%. Comparing base (00ebcdf) to head (f4b8425).
Report is 79 commits behind head on master.

Files with missing lines	Patch %	Lines
pennylane_lightning/core/_serialize.py	66.66%	9 Missing ⚠️
...ne_lightning/lightning_kokkos/_adjoint_jacobian.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #873      +/-   ##
==========================================
+ Coverage   88.10%   97.40%   +9.30%     
==========================================
  Files          92      222     +130     
  Lines       11764    30715   +18951     
==========================================
+ Hits        10365    29919   +19554     
+ Misses       1399      796     -603

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mlxd

Thanks @vincentmr
A few things I'd like to understand first.

Also, given the questions, I'd like to also better understand the impact of this on previously run workloads, especially with memory use, timing/memory when using multiple GPUs, potential validation of the MPI4PY workload in the paper, and non-H workloads with many terms before considered a merge.

pennylane_lightning/lightning_gpu/lightning_gpu.py

pennylane_lightning/lightning_qubit/_adjoint_jacobian.py

pennylane_lightning/core/_serialize.py

AmintorDusko

Just flying by.
CSR typically outperforms the COO sparse representation.

pennylane_lightning/lightning_gpu/lightning_gpu.py

pennylane_lightning/lightning_qubit/_adjoint_jacobian.py

vincentmr

@mlxd Let's focus on L-Qubit first for simplicity, then I'll port whatever solution we end up with to L-GPU(+MPI).

Also, given the questions, I'd like to also better understand the impact of this on previously run workloads

I think at some point we might have lacked the ability to pass an entire Hamiltonian (or LinearCombination) down to the C++ layer. A way around this is to compute the Jacobian of the individual terms (which are simpler objects like TensorProducts) and sum them up. This is embarrassingly parallel, but it requires a lot of memory and computation. Not sure since when, but now we can pass a LinearCombination directly to the adjoint pipeline. I think the only part that is effectively parallelizable upon splitting is applyObservables, but this may seldom be a bottleneck.

pennylane_lightning/core/_serialize.py

pennylane_lightning/lightning_gpu/lightning_gpu.py

Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>

pennylane_lightning/core/_serialize.py

pennylane_lightning/core/lightning_base.py

pennylane_lightning/lightning_gpu/lightning_gpu.py

pennylane_lightning/lightning_kokkos/lightning_kokkos.py

pennylane_lightning/lightning_qubit/_adjoint_jacobian.py

pennylane_lightning/core/_serialize.py

pennylane_lightning/core/src/simulators/lightning_gpu/observables/ObservablesGPU.hpp

AmintorDusko · 2024-08-30T13:03:59Z

I will re-trigger your CIs as there was some problem with the test pypi.

pyproject.toml

AmintorDusko

Looks good. I will come back later to check the CIs, after you merge master.

AmintorDusko

Beautiful! Thanks!

multiphaseCFD

Thanks @vincentmr for the nice work! Just a few questions.

pennylane_lightning/core/_serialize.py

pennylane_lightning/core/src/simulators/lightning_kokkos/observables/ObservablesKokkos.hpp

pennylane_lightning/lightning_qubit/_adjoint_jacobian.py

tests/test_adjoint_jacobian.py

multiphaseCFD

Great job! Thanks @vincentmr !

Please complete the following checklist when submitting a PR: - [x] All new features must include a unit test. If you've fixed a bug or added code that should be tested, add a test to the [`tests`](../tests) directory! - [x] All new functions and code must be clearly commented and documented. If you do make documentation changes, make sure that the docs build and render correctly by running `make docs`. - [x] Ensure that the test suite passes, by running `make test`. - [x] Add a new entry to the `.github/CHANGELOG.md` file, summarizing the change and including a link back to the PR. - [x] Ensure that code is properly formatted by running `make format`. When all the above are checked, delete everything above the dashed line and fill in the pull request template. ------------------------------------------------------------------------------------------------------------ **Context:** Parallelizing over observables can accelerate adjoint Jacobian calculations' backward pass. This PR revisits our implementation for L-Qubit and L-GPU which are the two devices that support it. Certain observables like Hamiltonian, PauliSentence, and LinearCombination can be split into many observables, enabling the distribution of the cost of expectation value computation. This strategy is initiated by the serializer which partitions the observables if `split_obs` is not `False`. The serializer proceeds to a complete partitioning, meaning a 1000-PauliWord PauliSentence is partitioned into a 1000 PauliWords. We note in passing that L-Qubit does not split observables since it does not pass a `split_obs` value to `_process_jacobian_tape`. This is wasteful because we end up with either of two situations: - The Jacobian is computed N processes (threads, devices, etc.) at a time which results in a lot of duplicate computation (forward/backward passes are repeated and the results combined); - The Jacobian is parallelized over all observables, each of which requires a state vector copy which increases the memory requirements by as much. We explore chunking instead of full partitioning for LinearCombination-like objects, meaning a 1000-PauliWord PauliSentence is partitioned into four 250-PauliWords PauliSentences if we parallelize over 4 processes. **Description of the Change:** Modify the serializer to chunk LinearCombination-like objects if `self.split_obs` is truthy. Correctly route `_batch_obs` such that L-Qubit splits observables. Enhance/adapt tests. **Analysis:** **Lightning-Qubit** `applyObservable` is a bottleneck for somewhat large linear combinations (say 100s or 1000s of terms). Chunking isn't helpful for a circuit like ``` @qml.qnode(dev, diff_method="adjoint") def c(weights): qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles) return qml.expval(ham) ``` because L-Qubit's `applyObservable` method is parallelized over terms for a single `Hamiltonian` observable. Chunking in this case is counter-productive because it requires extra state vectors, extra backward passes, etc. For a circuit like however ``` @qml.qnode(dev, diff_method="adjoint") def c(weights): qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles) return np.array([qml.expval(ham), qml.expval(qml.PauliZ(0))]) ``` `applyObservable` is parallelized over observables, which only scales up to 2 threads, and with poor load-balance. In this case, it is better to split the observable, which is what the current changes do. | mol | master-serial | master-batched | chunk-serial | chunk-batched | | --- | ------------- | -------------- | ------------ | ------------- | | CH4 | 1.793e+01 | 1.330e+01 | 1.819e+01 | 8.040e+00 | | Li2 | 5.333e+01 | 3.354e+01 | 5.289e+01 | 1.839e+01 | | CO | 9.817e+01 | 5.945e+01 | 9.619e+01 | 2.559e+01 | | H10 | 1.220e+02 | 7.317e+01 | 1.182e+02 | 3.305e+01 | So for this circuit the current PR yields speeds-up ranging from 1.5x to >2x by using obs-batching + chunking (compared with the previous obs-batching). **Lightning-GPU** Lightning-GPU splits the observables as soon as `batch_obs` is true. The current code splits a Hamiltonian into all its individual terms, which is quite inefficient and induces a lot of redundant backward passes. This is visible benchmarking the circuit ``` @qml.qnode(dev, diff_method="adjoint") def c(weights): qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles) return qml.expval(ham) ``` | mol | master-serial | master-batched | chunk-serial | chunk-batched | | --- | ------------- | -------------- | ------------ | ------------- | | CH4 | 1.463e+01 | forever | 5.583e+00 | 3.405e+00 | | Li2 | 1.201e+01 | forever | 5.284e+00 | 2.658e+00 | | CO | 2.357e+01 | forever | 4.716e+00 | 4.577e+00 | | H10 | 2.992e+01 | forever | 5.476e+00 | 5.469e+00 | | HCN | 8.622e+01 | forever | 3.144e+01 | 2.452e+01 | The batched L-GPU runs are using 2 x A100 GPUs on ISAIC. The speed-ups for batched versus serial are OK, but most important is the optimization of `Hamiltonian::applyInPlace` which brings about nice speed-ups between master and this PR. **Related GitHub Issues:** --------- Co-authored-by: ringo-but-quantum <github-ringo-but-quantum@xanadu.ai> Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com> Co-authored-by: AmintorDusko <amintor_dusko@hotmail.com>

vincentmr and others added 4 commits August 26, 2024 18:00

WIP

e92539e

WIP

891c165

Simplify

e285aaf

Auto update version from '0.38.0-dev49' to '0.38.0-dev52'

0556f34

Merge branch 'master' into chunck_hamiltonian

6e19fc8

vincentmr changed the title ~~Chunk hamiltonian [sc-65680]~~ Chunk Hamiltonian, PauliSentence, LinearCombination [sc-65680] Aug 27, 2024

mlxd reviewed Aug 27, 2024

View reviewed changes

AmintorDusko reviewed Aug 27, 2024

View reviewed changes

vincentmr commented Aug 27, 2024

View reviewed changes

pennylane_lightning/core/_serialize.py Outdated Show resolved Hide resolved

pennylane_lightning/lightning_gpu/lightning_gpu.py Show resolved Hide resolved

vincentmr and others added 6 commits August 28, 2024 11:49

Apply suggestions from code review [skip ci]

352cfca

Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>

Fix split_obs [skip ci].

742f115

Remove obsolete _chunk_iterable [skip ci].

db789b2

Fix pylint warnings [skip ci].

099a641

Fix docstring [skip ci].

070fabd

Remove obsolete unreachable branches. [skip ci]

21091b1

vincentmr added the ci:use-multi-gpu-runner Enable usage of Multi-GPU runner for this Pull Request label Aug 28, 2024

vincentmr added 5 commits August 28, 2024 18:12

Simplify tests [skip ci]

f000217

trigger ci

5bca9b9

Remove property.

308dbb3

Update changelog

f0d7f42

Update jac shape.

3593d43

vincentmr commented Aug 28, 2024

View reviewed changes

vincentmr and others added 4 commits August 29, 2024 14:30

Merge branch 'master' into chunck_hamiltonian

e8d6682

Auto update version from '0.38.0-dev52' to '0.38.0-dev53'

a88f2aa

Optimize applyInPlace hamiltonian.

c94d55e

Remove obsolete function.

d5f434d

vincentmr commented Aug 29, 2024

View reviewed changes

pennylane_lightning/core/_serialize.py Show resolved Hide resolved

vincentmr commented Aug 29, 2024

View reviewed changes

pennylane_lightning/core/src/simulators/lightning_gpu/observables/ObservablesGPU.hpp Show resolved Hide resolved

vincentmr requested a review from mlxd August 30, 2024 12:13

Trigger CIs

24a33b5

vincentmr and others added 5 commits August 30, 2024 14:13

Remove LAPACK=ONfrom LGPU C++ tests.

f4adcda

Merge branch 'master' into chunck_hamiltonian

5b1ac6c

Auto update version from '0.38.0-dev53' to '0.38.0-dev54'

607ba06

trigger CIs

f77262c

trigger ci

7a56ee5

AmintorDusko added the ci:build_wheels Activate wheel building. label Sep 3, 2024

AmintorDusko reviewed Sep 3, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

AmintorDusko reviewed Sep 3, 2024

View reviewed changes

vincentmr and others added 4 commits September 3, 2024 19:51

Merge remote-tracking branch 'origin/master' into chunck_hamiltonian

5713ece

Revert pyproject

47e2ef2

Update version,

5715a1c

Auto update version from '0.39.0-dev0' to '0.39.0-dev1'

3687d1e

vincentmr requested review from a team and AmintorDusko September 3, 2024 21:30

AmintorDusko approved these changes Sep 4, 2024

View reviewed changes

vincentmr requested a review from a team September 4, 2024 13:16

vincentmr and others added 4 commits September 5, 2024 13:11

Merge remote-tracking branch 'origin/master' into chunck_hamiltonian

82af6ec

Auto update version from '0.39.0-dev2' to '0.39.0-dev3'

3b8600e

Merge branch 'master' into chunck_hamiltonian

66ff9b4

Auto update version from '0.39.0-dev3' to '0.39.0-dev4'

1570216

multiphaseCFD reviewed Sep 5, 2024

View reviewed changes

Import getenv only

f4b8425

vincentmr requested a review from multiphaseCFD September 5, 2024 15:18

multiphaseCFD approved these changes Sep 5, 2024

View reviewed changes

vincentmr merged commit f4a6114 into master Sep 5, 2024
129 of 133 checks passed

vincentmr deleted the chunck_hamiltonian branch September 5, 2024 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk Hamiltonian, PauliSentence, LinearCombination [sc-65680] #873

Chunk Hamiltonian, PauliSentence, LinearCombination [sc-65680] #873

vincentmr commented Aug 27, 2024 •

edited

Loading

github-actions bot commented Aug 27, 2024

codecov bot commented Aug 27, 2024 •

edited

Loading

mlxd left a comment •

edited

Loading

AmintorDusko left a comment

vincentmr left a comment

AmintorDusko commented Aug 30, 2024

AmintorDusko left a comment

AmintorDusko left a comment

multiphaseCFD left a comment

multiphaseCFD left a comment

Chunk Hamiltonian, PauliSentence, LinearCombination [sc-65680] #873

Chunk Hamiltonian, PauliSentence, LinearCombination [sc-65680] #873

Conversation

vincentmr commented Aug 27, 2024 • edited Loading

Before submitting

github-actions bot commented Aug 27, 2024

codecov bot commented Aug 27, 2024 • edited Loading

Codecov Report

mlxd left a comment • edited Loading

Choose a reason for hiding this comment

AmintorDusko left a comment

Choose a reason for hiding this comment

vincentmr left a comment

Choose a reason for hiding this comment

AmintorDusko commented Aug 30, 2024

AmintorDusko left a comment

Choose a reason for hiding this comment

AmintorDusko left a comment

Choose a reason for hiding this comment

multiphaseCFD left a comment

Choose a reason for hiding this comment

multiphaseCFD left a comment

Choose a reason for hiding this comment

vincentmr commented Aug 27, 2024 •

edited

Loading

codecov bot commented Aug 27, 2024 •

edited

Loading

mlxd left a comment •

edited

Loading