Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk Hamiltonian, PauliSentence, LinearCombination [sc-65680] #873

Merged
merged 37 commits into from
Sep 5, 2024

Conversation

vincentmr
Copy link
Contributor

@vincentmr vincentmr commented Aug 27, 2024

Before submitting

Please complete the following checklist when submitting a PR:

  • All new features must include a unit test.
    If you've fixed a bug or added code that should be tested, add a test to the
    tests directory!

  • All new functions and code must be clearly commented and documented.
    If you do make documentation changes, make sure that the docs build and
    render correctly by running make docs.

  • Ensure that the test suite passes, by running make test.

  • Add a new entry to the .github/CHANGELOG.md file, summarizing the
    change and including a link back to the PR.

  • Ensure that code is properly formatted by running make format.

When all the above are checked, delete everything above the dashed
line and fill in the pull request template.


Context:
Parallelizing over observables can accelerate adjoint Jacobian calculations' backward pass. This PR revisits our implementation for L-Qubit and L-GPU which are the two devices that support it. Certain observables like Hamiltonian, PauliSentence, and LinearCombination can be split into many observables, enabling the distribution of the cost of expectation value computation. This strategy is initiated by the serializer which partitions the observables if split_obs is not False. The serializer proceeds to a complete partitioning, meaning a 1000-PauliWord PauliSentence is partitioned into a 1000 PauliWords. We note in passing that L-Qubit does not split observables since it does not pass a split_obs value to _process_jacobian_tape. This is wasteful because we end up with either of two situations:

  • The Jacobian is computed N processes (threads, devices, etc.) at a time which results in a lot of duplicate computation (forward/backward passes are repeated and the results combined);
  • The Jacobian is parallelized over all observables, each of which requires a state vector copy which increases the memory requirements by as much.

We explore chunking instead of full partitioning for LinearCombination-like objects, meaning a 1000-PauliWord PauliSentence is partitioned into four 250-PauliWords PauliSentences if we parallelize over 4 processes.

Description of the Change:
Modify the serializer to chunk LinearCombination-like objects if self.split_obs is truthy.
Correctly route _batch_obs such that L-Qubit splits observables.
Enhance/adapt tests.

Analysis:
Lightning-Qubit

applyObservable is a bottleneck for somewhat large linear combinations (say 100s or 1000s of terms). Chunking isn't helpful for a circuit like

    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return qml.expval(ham)

because L-Qubit's applyObservable method is parallelized over terms for a single Hamiltonian observable. Chunking in this case is counter-productive because it requires extra state vectors, extra backward passes, etc.

For a circuit like however

    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return np.array([qml.expval(ham), qml.expval(qml.PauliZ(0))])

applyObservable is parallelized over observables, which only scales up to 2 threads, and with poor load-balance. In this case, it is better to split the observable, which is what the current changes do.

mol master-serial master-batched chunk-serial chunk-batched
CH4 1.793e+01 1.330e+01 1.819e+01 8.040e+00
Li2 5.333e+01 3.354e+01 5.289e+01 1.839e+01
CO 9.817e+01 5.945e+01 9.619e+01 2.559e+01
H10 1.220e+02 7.317e+01 1.182e+02 3.305e+01

So for this circuit the current PR yields speeds-up ranging from 1.5x to >2x by using obs-batching + chunking (compared with the previous obs-batching).

Lightning-GPU

Lightning-GPU splits the observables as soon as batch_obs is true. The current code splits a Hamiltonian into all its individual terms, which is quite inefficient and induces a lot of redundant backward passes. This is visible benchmarking the circuit

    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return qml.expval(ham)
mol master-serial master-batched chunk-serial chunk-batched
CH4 1.463e+01 forever 5.583e+00 3.405e+00
Li2 1.201e+01 forever 5.284e+00 2.658e+00
CO 2.357e+01 forever 4.716e+00 4.577e+00
H10 2.992e+01 forever 5.476e+00 5.469e+00
HCN 8.622e+01 forever 3.144e+01 2.452e+01

The batched L-GPU runs are using 2 x A100 GPUs on ISAIC. The speed-ups for batched versus serial are OK, but most important is the optimization of Hamiltonian::applyInPlace which brings about nice speed-ups between master and this PR.

Related GitHub Issues:

Copy link
Contributor

Hello. You may have forgotten to update the changelog!
Please edit .github/CHANGELOG.md with:

  • A one-to-two sentence description of the change. You may include a small working example for new features.
  • A link back to this PR.
  • Your name (or GitHub username) in the contributors section.

Copy link

codecov bot commented Aug 27, 2024

Codecov Report

Attention: Patch coverage is 85.91549% with 10 lines in your changes missing coverage. Please review.

Project coverage is 97.40%. Comparing base (00ebcdf) to head (f4b8425).
Report is 79 commits behind head on master.

Files with missing lines Patch % Lines
pennylane_lightning/core/_serialize.py 66.66% 9 Missing ⚠️
...ne_lightning/lightning_kokkos/_adjoint_jacobian.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #873      +/-   ##
==========================================
+ Coverage   88.10%   97.40%   +9.30%     
==========================================
  Files          92      222     +130     
  Lines       11764    30715   +18951     
==========================================
+ Hits        10365    29919   +19554     
+ Misses       1399      796     -603     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vincentmr vincentmr changed the title Chunk hamiltonian [sc-65680] Chunk Hamiltonian, PauliSentence, LinearCombination [sc-65680] Aug 27, 2024
Copy link
Member

@mlxd mlxd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vincentmr
A few things I'd like to understand first.

Also, given the questions, I'd like to also better understand the impact of this on previously run workloads, especially with memory use, timing/memory when using multiple GPUs, potential validation of the MPI4PY workload in the paper, and non-H workloads with many terms before considered a merge.

pennylane_lightning/lightning_gpu/lightning_gpu.py Outdated Show resolved Hide resolved
pennylane_lightning/core/_serialize.py Outdated Show resolved Hide resolved
Copy link
Contributor

@AmintorDusko AmintorDusko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just flying by.
CSR typically outperforms the COO sparse representation.

pennylane_lightning/lightning_gpu/lightning_gpu.py Outdated Show resolved Hide resolved
pennylane_lightning/lightning_gpu/lightning_gpu.py Outdated Show resolved Hide resolved
pennylane_lightning/lightning_qubit/_adjoint_jacobian.py Outdated Show resolved Hide resolved
pennylane_lightning/lightning_qubit/_adjoint_jacobian.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@vincentmr vincentmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlxd Let's focus on L-Qubit first for simplicity, then I'll port whatever solution we end up with to L-GPU(+MPI).

Also, given the questions, I'd like to also better understand the impact of this on previously run workloads

I think at some point we might have lacked the ability to pass an entire Hamiltonian (or LinearCombination) down to the C++ layer. A way around this is to compute the Jacobian of the individual terms (which are simpler objects like TensorProducts) and sum them up. This is embarrassingly parallel, but it requires a lot of memory and computation. Not sure since when, but now we can pass a LinearCombination directly to the adjoint pipeline. I think the only part that is effectively parallelizable upon splitting is applyObservables, but this may seldom be a bottleneck.

pennylane_lightning/core/_serialize.py Outdated Show resolved Hide resolved
@vincentmr vincentmr added the ci:use-multi-gpu-runner Enable usage of Multi-GPU runner for this Pull Request label Aug 28, 2024
pennylane_lightning/core/_serialize.py Show resolved Hide resolved
pennylane_lightning/core/_serialize.py Show resolved Hide resolved
pennylane_lightning/core/_serialize.py Show resolved Hide resolved
pennylane_lightning/core/_serialize.py Show resolved Hide resolved
pennylane_lightning/core/lightning_base.py Show resolved Hide resolved
pennylane_lightning/lightning_kokkos/lightning_kokkos.py Outdated Show resolved Hide resolved
@vincentmr vincentmr requested a review from mlxd August 30, 2024 12:13
@AmintorDusko
Copy link
Contributor

I will re-trigger your CIs as there was some problem with the test pypi.

@AmintorDusko AmintorDusko added the ci:build_wheels Activate wheel building. label Sep 3, 2024
pyproject.toml Outdated Show resolved Hide resolved
Copy link
Contributor

@AmintorDusko AmintorDusko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I will come back later to check the CIs, after you merge master.

@vincentmr vincentmr requested review from a team and AmintorDusko September 3, 2024 21:30
Copy link
Contributor

@AmintorDusko AmintorDusko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful! Thanks!

@vincentmr vincentmr requested a review from a team September 4, 2024 13:16
Copy link
Member

@multiphaseCFD multiphaseCFD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vincentmr for the nice work! Just a few questions.

Copy link
Member

@multiphaseCFD multiphaseCFD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Thanks @vincentmr !

@vincentmr vincentmr merged commit f4a6114 into master Sep 5, 2024
129 of 133 checks passed
@vincentmr vincentmr deleted the chunck_hamiltonian branch September 5, 2024 15:57
multiphaseCFD pushed a commit that referenced this pull request Sep 8, 2024
Please complete the following checklist when submitting a PR:

- [x] All new features must include a unit test.
If you've fixed a bug or added code that should be tested, add a test to
the
      [`tests`](../tests) directory!

- [x] All new functions and code must be clearly commented and
documented.
If you do make documentation changes, make sure that the docs build and
      render correctly by running `make docs`.

- [x] Ensure that the test suite passes, by running `make test`.

- [x] Add a new entry to the `.github/CHANGELOG.md` file, summarizing
the
      change and including a link back to the PR.

- [x] Ensure that code is properly formatted by running `make format`.

When all the above are checked, delete everything above the dashed
line and fill in the pull request template.

------------------------------------------------------------------------------------------------------------

**Context:**
Parallelizing over observables can accelerate adjoint Jacobian
calculations' backward pass. This PR revisits our implementation for
L-Qubit and L-GPU which are the two devices that support it. Certain
observables like Hamiltonian, PauliSentence, and LinearCombination can
be split into many observables, enabling the distribution of the cost of
expectation value computation. This strategy is initiated by the
serializer which partitions the observables if `split_obs` is not
`False`. The serializer proceeds to a complete partitioning, meaning a
1000-PauliWord PauliSentence is partitioned into a 1000 PauliWords. We
note in passing that L-Qubit does not split observables since it does
not pass a `split_obs` value to `_process_jacobian_tape`. This is
wasteful because we end up with either of two situations:

- The Jacobian is computed N processes (threads, devices, etc.) at a
time which results in a lot of duplicate computation (forward/backward
passes are repeated and the results combined);
- The Jacobian is parallelized over all observables, each of which
requires a state vector copy which increases the memory requirements by
as much.

We explore chunking instead of full partitioning for
LinearCombination-like objects, meaning a 1000-PauliWord PauliSentence
is partitioned into four 250-PauliWords PauliSentences if we parallelize
over 4 processes.

**Description of the Change:**
Modify the serializer to chunk LinearCombination-like objects if
`self.split_obs` is truthy.
Correctly route `_batch_obs` such that L-Qubit splits observables.
Enhance/adapt tests.

**Analysis:**
**Lightning-Qubit**

`applyObservable` is a bottleneck for somewhat large linear combinations
(say 100s or 1000s of terms). Chunking isn't helpful for a circuit like
```
    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return qml.expval(ham)
```
because L-Qubit's `applyObservable` method is parallelized over terms
for a single `Hamiltonian` observable. Chunking in this case is
counter-productive because it requires extra state vectors, extra
backward passes, etc.

For a circuit like however
```
    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return np.array([qml.expval(ham), qml.expval(qml.PauliZ(0))])
```
`applyObservable` is parallelized over observables, which only scales up
to 2 threads, and with poor load-balance. In this case, it is better to
split the observable, which is what the current changes do.

| mol | master-serial | master-batched | chunk-serial | chunk-batched |
| --- | ------------- | -------------- | ------------ | ------------- |
| CH4 | 1.793e+01     | 1.330e+01      | 1.819e+01    | 8.040e+00     |
| Li2 | 5.333e+01     | 3.354e+01      | 5.289e+01    | 1.839e+01     |
| CO  | 9.817e+01     | 5.945e+01      | 9.619e+01    | 2.559e+01     |
| H10 | 1.220e+02     | 7.317e+01      | 1.182e+02    | 3.305e+01     |

So for this circuit the current PR yields speeds-up ranging from 1.5x to
>2x by using obs-batching + chunking (compared with the previous
obs-batching).

**Lightning-GPU**

Lightning-GPU splits the observables as soon as `batch_obs` is true. The
current code splits a Hamiltonian into all its individual terms, which
is quite inefficient and induces a lot of redundant backward passes.
This is visible benchmarking the circuit
```
    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return qml.expval(ham)
```

| mol | master-serial | master-batched | chunk-serial | chunk-batched |
| --- | ------------- | -------------- | ------------ | ------------- |
| CH4 | 1.463e+01     | forever        | 5.583e+00    | 3.405e+00     |
| Li2 | 1.201e+01     | forever        | 5.284e+00    | 2.658e+00     |
| CO  | 2.357e+01     | forever        | 4.716e+00    | 4.577e+00     |
| H10 | 2.992e+01     | forever        | 5.476e+00    | 5.469e+00     |
| HCN | 8.622e+01     | forever        | 3.144e+01    | 2.452e+01     |

The batched L-GPU runs are using 2 x A100 GPUs on ISAIC. The speed-ups
for batched versus serial are OK, but most important is the optimization
of `Hamiltonian::applyInPlace` which brings about nice speed-ups between
master and this PR.

**Related GitHub Issues:**

---------

Co-authored-by: ringo-but-quantum <github-ringo-but-quantum@xanadu.ai>
Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>
Co-authored-by: AmintorDusko <amintor_dusko@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci:build_wheels Activate wheel building. ci:use-multi-gpu-runner Enable usage of Multi-GPU runner for this Pull Request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants