-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk Hamiltonian, PauliSentence, LinearCombination [sc-65680] #873
Conversation
Hello. You may have forgotten to update the changelog!
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #873 +/- ##
==========================================
+ Coverage 88.10% 97.40% +9.30%
==========================================
Files 92 222 +130
Lines 11764 30715 +18951
==========================================
+ Hits 10365 29919 +19554
+ Misses 1399 796 -603 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vincentmr
A few things I'd like to understand first.
Also, given the questions, I'd like to also better understand the impact of this on previously run workloads, especially with memory use, timing/memory when using multiple GPUs, potential validation of the MPI4PY workload in the paper, and non-H workloads with many terms before considered a merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just flying by.
CSR typically outperforms the COO sparse representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlxd Let's focus on L-Qubit first for simplicity, then I'll port whatever solution we end up with to L-GPU(+MPI).
Also, given the questions, I'd like to also better understand the impact of this on previously run workloads
I think at some point we might have lacked the ability to pass an entire Hamiltonian (or LinearCombination) down to the C++ layer. A way around this is to compute the Jacobian of the individual terms (which are simpler objects like TensorProducts) and sum them up. This is embarrassingly parallel, but it requires a lot of memory and computation. Not sure since when, but now we can pass a LinearCombination directly to the adjoint pipeline. I think the only part that is effectively parallelizable upon splitting is applyObservables
, but this may seldom be a bottleneck.
Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>
pennylane_lightning/core/src/simulators/lightning_gpu/observables/ObservablesGPU.hpp
Show resolved
Hide resolved
I will re-trigger your CIs as there was some problem with the test pypi. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I will come back later to check the CIs, after you merge master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vincentmr for the nice work! Just a few questions.
pennylane_lightning/core/src/simulators/lightning_kokkos/observables/ObservablesKokkos.hpp
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! Thanks @vincentmr !
Please complete the following checklist when submitting a PR: - [x] All new features must include a unit test. If you've fixed a bug or added code that should be tested, add a test to the [`tests`](../tests) directory! - [x] All new functions and code must be clearly commented and documented. If you do make documentation changes, make sure that the docs build and render correctly by running `make docs`. - [x] Ensure that the test suite passes, by running `make test`. - [x] Add a new entry to the `.github/CHANGELOG.md` file, summarizing the change and including a link back to the PR. - [x] Ensure that code is properly formatted by running `make format`. When all the above are checked, delete everything above the dashed line and fill in the pull request template. ------------------------------------------------------------------------------------------------------------ **Context:** Parallelizing over observables can accelerate adjoint Jacobian calculations' backward pass. This PR revisits our implementation for L-Qubit and L-GPU which are the two devices that support it. Certain observables like Hamiltonian, PauliSentence, and LinearCombination can be split into many observables, enabling the distribution of the cost of expectation value computation. This strategy is initiated by the serializer which partitions the observables if `split_obs` is not `False`. The serializer proceeds to a complete partitioning, meaning a 1000-PauliWord PauliSentence is partitioned into a 1000 PauliWords. We note in passing that L-Qubit does not split observables since it does not pass a `split_obs` value to `_process_jacobian_tape`. This is wasteful because we end up with either of two situations: - The Jacobian is computed N processes (threads, devices, etc.) at a time which results in a lot of duplicate computation (forward/backward passes are repeated and the results combined); - The Jacobian is parallelized over all observables, each of which requires a state vector copy which increases the memory requirements by as much. We explore chunking instead of full partitioning for LinearCombination-like objects, meaning a 1000-PauliWord PauliSentence is partitioned into four 250-PauliWords PauliSentences if we parallelize over 4 processes. **Description of the Change:** Modify the serializer to chunk LinearCombination-like objects if `self.split_obs` is truthy. Correctly route `_batch_obs` such that L-Qubit splits observables. Enhance/adapt tests. **Analysis:** **Lightning-Qubit** `applyObservable` is a bottleneck for somewhat large linear combinations (say 100s or 1000s of terms). Chunking isn't helpful for a circuit like ``` @qml.qnode(dev, diff_method="adjoint") def c(weights): qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles) return qml.expval(ham) ``` because L-Qubit's `applyObservable` method is parallelized over terms for a single `Hamiltonian` observable. Chunking in this case is counter-productive because it requires extra state vectors, extra backward passes, etc. For a circuit like however ``` @qml.qnode(dev, diff_method="adjoint") def c(weights): qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles) return np.array([qml.expval(ham), qml.expval(qml.PauliZ(0))]) ``` `applyObservable` is parallelized over observables, which only scales up to 2 threads, and with poor load-balance. In this case, it is better to split the observable, which is what the current changes do. | mol | master-serial | master-batched | chunk-serial | chunk-batched | | --- | ------------- | -------------- | ------------ | ------------- | | CH4 | 1.793e+01 | 1.330e+01 | 1.819e+01 | 8.040e+00 | | Li2 | 5.333e+01 | 3.354e+01 | 5.289e+01 | 1.839e+01 | | CO | 9.817e+01 | 5.945e+01 | 9.619e+01 | 2.559e+01 | | H10 | 1.220e+02 | 7.317e+01 | 1.182e+02 | 3.305e+01 | So for this circuit the current PR yields speeds-up ranging from 1.5x to >2x by using obs-batching + chunking (compared with the previous obs-batching). **Lightning-GPU** Lightning-GPU splits the observables as soon as `batch_obs` is true. The current code splits a Hamiltonian into all its individual terms, which is quite inefficient and induces a lot of redundant backward passes. This is visible benchmarking the circuit ``` @qml.qnode(dev, diff_method="adjoint") def c(weights): qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles) return qml.expval(ham) ``` | mol | master-serial | master-batched | chunk-serial | chunk-batched | | --- | ------------- | -------------- | ------------ | ------------- | | CH4 | 1.463e+01 | forever | 5.583e+00 | 3.405e+00 | | Li2 | 1.201e+01 | forever | 5.284e+00 | 2.658e+00 | | CO | 2.357e+01 | forever | 4.716e+00 | 4.577e+00 | | H10 | 2.992e+01 | forever | 5.476e+00 | 5.469e+00 | | HCN | 8.622e+01 | forever | 3.144e+01 | 2.452e+01 | The batched L-GPU runs are using 2 x A100 GPUs on ISAIC. The speed-ups for batched versus serial are OK, but most important is the optimization of `Hamiltonian::applyInPlace` which brings about nice speed-ups between master and this PR. **Related GitHub Issues:** --------- Co-authored-by: ringo-but-quantum <github-ringo-but-quantum@xanadu.ai> Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com> Co-authored-by: AmintorDusko <amintor_dusko@hotmail.com>
Before submitting
Please complete the following checklist when submitting a PR:
All new features must include a unit test.
If you've fixed a bug or added code that should be tested, add a test to the
tests
directory!All new functions and code must be clearly commented and documented.
If you do make documentation changes, make sure that the docs build and
render correctly by running
make docs
.Ensure that the test suite passes, by running
make test
.Add a new entry to the
.github/CHANGELOG.md
file, summarizing thechange and including a link back to the PR.
Ensure that code is properly formatted by running
make format
.When all the above are checked, delete everything above the dashed
line and fill in the pull request template.
Context:
Parallelizing over observables can accelerate adjoint Jacobian calculations' backward pass. This PR revisits our implementation for L-Qubit and L-GPU which are the two devices that support it. Certain observables like Hamiltonian, PauliSentence, and LinearCombination can be split into many observables, enabling the distribution of the cost of expectation value computation. This strategy is initiated by the serializer which partitions the observables if
split_obs
is notFalse
. The serializer proceeds to a complete partitioning, meaning a 1000-PauliWord PauliSentence is partitioned into a 1000 PauliWords. We note in passing that L-Qubit does not split observables since it does not pass asplit_obs
value to_process_jacobian_tape
. This is wasteful because we end up with either of two situations:We explore chunking instead of full partitioning for LinearCombination-like objects, meaning a 1000-PauliWord PauliSentence is partitioned into four 250-PauliWords PauliSentences if we parallelize over 4 processes.
Description of the Change:
Modify the serializer to chunk LinearCombination-like objects if
self.split_obs
is truthy.Correctly route
_batch_obs
such that L-Qubit splits observables.Enhance/adapt tests.
Analysis:
Lightning-Qubit
applyObservable
is a bottleneck for somewhat large linear combinations (say 100s or 1000s of terms). Chunking isn't helpful for a circuit likebecause L-Qubit's
applyObservable
method is parallelized over terms for a singleHamiltonian
observable. Chunking in this case is counter-productive because it requires extra state vectors, extra backward passes, etc.For a circuit like however
applyObservable
is parallelized over observables, which only scales up to 2 threads, and with poor load-balance. In this case, it is better to split the observable, which is what the current changes do.So for this circuit the current PR yields speeds-up ranging from 1.5x to >2x by using obs-batching + chunking (compared with the previous obs-batching).
Lightning-GPU
Lightning-GPU splits the observables as soon as
batch_obs
is true. The current code splits a Hamiltonian into all its individual terms, which is quite inefficient and induces a lot of redundant backward passes. This is visible benchmarking the circuitThe batched L-GPU runs are using 2 x A100 GPUs on ISAIC. The speed-ups for batched versus serial are OK, but most important is the optimization of
Hamiltonian::applyInPlace
which brings about nice speed-ups between master and this PR.Related GitHub Issues: