Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve computation time of compute_matrix for parametric ops #2759

Merged
merged 24 commits into from
Jul 27, 2022

Conversation

dwierichs
Copy link
Contributor

With the support of parameter broadcasting in #2609 , the compute_matrix method of many parametric operations was updated.
Unfortunately, the resulting methods are sometimes considerably slower than the previous version.
This PR rewrites compute_matrix for a series of these operations, reducing the slowdown and even achieving speedups compared to the versions that didn't support broadcasting. At the same time, the new methods are more compact.

One key change is heavy usage of qml.math.einsum instead of nested qml.math.stack calls. This reduces the calls to autoray, is easy to write in a broadcastable manner, and results in shorter code.

qml.math.einsum("...,i->...i", x, y) # scalar*vector or outer product between vectors
qml.math.einsum("...i,il->...il", x, y) # broadcastable version of [_x * _y for _x, _y in zip(x, y)]

@codecov
Copy link

codecov bot commented Jun 21, 2022

Codecov Report

Merging #2759 (0c49888) into master (4837dd7) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           master    #2759    +/-   ##
========================================
  Coverage   99.63%   99.63%            
========================================
  Files         252      252            
  Lines       20772    20986   +214     
========================================
+ Hits        20696    20910   +214     
  Misses         76       76            
Impacted Files Coverage Δ
pennylane/ops/qubit/attributes.py 100.00% <ø> (ø)
pennylane/_qubit_device.py 99.52% <100.00%> (+0.06%) ⬆️
pennylane/devices/default_mixed.py 100.00% <100.00%> (ø)
pennylane/devices/default_qubit.py 100.00% <100.00%> (ø)
pennylane/devices/default_qubit_autograd.py 100.00% <100.00%> (ø)
pennylane/devices/default_qubit_jax.py 100.00% <100.00%> (ø)
pennylane/devices/default_qubit_tf.py 97.16% <100.00%> (+1.85%) ⬆️
pennylane/devices/default_qubit_torch.py 92.45% <100.00%> (+0.14%) ⬆️
pennylane/ops/qubit/parametric_ops.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8bf6a2...0c49888. Read the comment docs.

@mlxd mlxd requested review from AmintorDusko and mlxd June 21, 2022 14:38
@albi3ro albi3ro requested review from albi3ro June 21, 2022 18:44
@albi3ro
Copy link
Contributor

albi3ro commented Jun 21, 2022

@dwierichs One important question I have about this that will effect performance optimization choices: How often will parameter-broadcasting be used by a standard user? Do we want to make everything a little bit slower so we can make parameter-broadcasting faster? Or should we optimize the standard use-case at the expense of parameter-broadcasting?

I'm also a bit concerned about the readability, and thus maintainability, of this code. These operations are highly trafficked by external contributors and new developers. If we do use non-standard behaviour, like the einsum broadcasting, that needs to be thoroughly explained as close to the code as possible.

@mlxd
Copy link
Member

mlxd commented Jun 21, 2022

Hi @dwierichs . I agree with Christina's points, and wonder if we can instead of using einsum with the ellipsis to directly access broadcasting, we can directly make the broadcasting calls ourselves, removing the dependence on einsum altogether.

Einsum's syntax is a sizeable maintenance burden in my eyes, and if we can do things with direct calls instead I think we should

@dwierichs
Copy link
Contributor Author

Thanks for the feedback, @albi3ro and @mlxd ! :)

Do we want to make everything a little bit slower so we can make parameter-broadcasting faster?

As far as I am aware, master (and v0.24.0) currently already carries the slowdown from allowing broadcasting, introduced by #2609. This PR basically speeds up all scenarios, as far as my timings went (although I have to admit that for some reason standard timing with process_time() and lots of iterations was surprisingly unstable...)

If we do use non-standard behaviour, like the einsum broadcasting, that needs to be thoroughly explained as close to the code as possible.

That's an excellent point :) Maybe one could, as there basically are only the two signatures "...,i->...i" and "...i,ik->ik" used, make them utility functions within parametric_ops that carry extensive docstrings?

if we can instead of using einsum with the ellipsis to directly access broadcasting, we can directly make the broadcasting calls ourselves, removing the dependence on einsum altogether.

I am not entirely sure I understand the alternative here? Would you manually differentiate between broadcasted and not broadcasted within the method, and have an if-else-loop for the distinction? :)

Einsum's syntax is a sizeable maintenance burden in my eyes, and if we can do things with direct calls instead I think we should

Thanks again for the feedback, I wasn't aware of this :)

@mlxd
Copy link
Member

mlxd commented Jun 21, 2022

I am not entirely sure I understand the alternative here? Would you manually differentiate between broadcasted and not broadcasted within the method, and have an if-else-loop for the distinction? :)

I guess what I was trying to map out was that instead of using the native broadcasting as part of einsum (enabled by the ellipsis ops according to their docs), since we already know the structure of the gates we want, why not calls the ops we want directly.

A poor example, but it does demonstrate speed-up over einsum for me:

import pennylane as qml
from timeit import default_timer as timer

theta = qml.numpy.random.rand(20)

theta_scale = 0.5j * theta
pm = qml.numpy.array([-1, 1])

start = timer()
for i in range(10000):
    qml.math.einsum("...,i->...i", theta_scale, pm)
end = timer()
print(f"EINSUM BCAST:t={end-start},res={qml.math.einsum('...,i->...i', theta_scale, pm)}")

start = timer()
for i in range(10000):
    qml.numpy.outer(theta_scale, pm)
end = timer()
print(f"OPS DIRECT:t={end-start},res={qml.numpy.outer(theta_scale, pm)}")

We obtain the same data overall, but the second option is about 2/3 the runtime of the first for me. What do you think?

@dwierichs
Copy link
Contributor Author

dwierichs commented Jun 21, 2022

We obtain the same data overall, but the second option is about 2/3 the runtime of the first for me. What do you think?

Ah yes, this is indeed significantly faster.
So if we want a version that supports both, theta being scalar and theta being 1d, what would you propose? Manual check which one it is (via qml.math.ndim I suppose) and using either np.outer or simple multiplication for the two cases, respectively?

I noticed that outer works with inputs scalar, tensor (but produces an excess dimension) when using the numpy interface, but it does not work for example with torch:

import pennylane as qml
from timeit import default_timer as timer

theta = qml.numpy.random.rand(20)
theta = torch.tensor(0.514)

theta_scale = 0.5j * theta
pm = qml.numpy.array([-1, 1])
pm = torch.tensor(pm)

start = timer()
for i in range(10000):
    qml.math.einsum("...,i->...i", theta_scale, pm)
end = timer()
print(f"EINSUM BCAST:t={end-start},res={qml.math.einsum('...,i->...i', theta_scale, pm)}")

start = timer()
for i in range(10000):
    qml.math.outer(theta_scale, pm) # <-- raises RuntimeError
end = timer()
print(f"OPS DIRECT:t={end-start},res={qml.numpy.outer(theta_scale, pm)}")

edit It seems that doing this manual if-else loop is indeed significantly faster, still :) Just want to make sure I'm getting your idea correctly!

@mlxd
Copy link
Member

mlxd commented Jun 21, 2022

edit It seems that doing this manual if-else loop is indeed significantly faster, still :) Just want to make sure I'm getting your idea correctly!

Good to know that helps. I guess my point overall is I am not sure if using einsum makes sense overall, as we should know the internal structure of the gates we are creating already, and should not need to offload to einsum to generate the values --- it is just a fancy wrapper around a dispatcher to more fundamental linalg ops & kernels (with some permutation steps and reordering too). As an example, if the ...,i->...i calling convention simply maps to an outer product function call, why not just call the outer product? The same for other uses of the library.

From the Numpy docs on it, the general syntax uses maps to other in-builts
image
(and I presume TF, JAX, Torch do similar).

Now, if we cannot replicate this functionality to work across all the frameworks, but using the einsum calling convention allows us to do so in a portable way that isn't causing overheads, then I think it becomes necessary to support this.

Though, maybe I am missing the bigger picture overall on why to use this.

@dwierichs
Copy link
Contributor Author

dwierichs commented Jun 21, 2022

as we should know the internal structure of the gates we are creating already,

As we want to support both broadcasted and scalar parameters, we do not know the internal structure at this point. There always are two scenarios, either producing a (2**num_wires, 2**num_wires) matrix or a (batch_size, 2**num_wires, 2**num_wires) tensor. If I'm getting our conversation right, this is the point at which we think differently? :)

Now, if we cannot replicate this functionality to work across all the frameworks, but using the einsum calling convention allows us to do so in a portable way that isn't causing overheads, then I think it becomes necessary to support this.

When I opened this PR, I thought that

  • the only way covering both scenarios (broadcasted and scalar) in a all-framework-compatible way is einsum
  • manually distinguishing between the two scenarios might be too cumbersome/ slow

But seeing now that manually deciding for theta * pm or qml.math.outer(theta, pm) actually is much faster, I agree that we should not use einsum. :)

@mlxd
Copy link
Member

mlxd commented Jun 21, 2022

As we want to support both broadcasted and scalar parameters, we do not know the internal structure at this point. There always are two scenarios, either producing a (2**num_wires, 2**num_wires) matrix or a (batch_size, 2**num_wires, 2**num_wires) tensor. If I'm getting our conversation right, this is the point at which we think differently? :)

Great, thanks for the clarification here!

But seeing now that manually deciding for theta * pm or qml.math.outer(theta, pm) actually is much faster, I agree that we should not use einsum. :)

Yep, sounds like the better choice to me. I still think there are scenarios where einsum is the best fit, but for simplified tensor-ops like we are doing here, I reckon it may be a bazooka to hunt a fly. 🙂

@AmintorDusko
Copy link
Contributor

AmintorDusko commented Jun 27, 2022

Hi @dwierichs, thank you for putting this PR together.
I made some suggestions to the RZ gate that may improve performance overall if propagated to all gates.
Please double-check if with these changes the edge cases are well covered (like TensorFlow and Torch parameters).

Copy link
Contributor Author

@dwierichs dwierichs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AmintorDusko ,
Thank you for your review and very helpful suggestions :)
And sorry for the extended silence on this PR.
I left some comments on the RZ code you discussed, I think it's a great idea to do this for one operation in general and carry over the fastest method to other operations (as long as applicable of course :))
Let me know what you think about the update :)

pennylane/ops/qubit/parametric_ops.py Outdated Show resolved Hide resolved
pennylane/ops/qubit/parametric_ops.py Outdated Show resolved Hide resolved
pennylane/ops/qubit/parametric_ops.py Outdated Show resolved Hide resolved
@AmintorDusko
Copy link
Contributor

Hi @dwierichs, thank you for testing the compatibility of my suggestions with other frameworks and adapting it!
I'm happy it helped to improve performance.

@dwierichs
Copy link
Contributor Author

Hi @mlxd and @albi3ro
I think this would now be in a review-ready state.
Amintor and I converged on a fastest method for RZ, which I then transferred to other diagonal operations.
Happy to hear any thoughts you might have on the current version, now that the code changed multiple times :)

@mlxd
Copy link
Member

mlxd commented Jul 13, 2022

Hi David, I have do some additional runtime profiling of the new ops compared against 0.24.0, where I ran running compute_matrix for 1e5 samples per gate and calculated the mean runtime.

For reference, this script is used to get the runtime for each op

import pennylane as qml
from timeit import default_timer as timer

p = qml.numpy.random.random(100)
num_samples = 10000
times = {}

def run_op(op):
    # Pre-run
    op.compute_matrix(p)
    start = timer()
    for i in range(num_samples):
        op.compute_matrix(p)
    end = timer()
    return (end-start)/num_samples

ops = [qml.RX, qml.RY, qml.RZ, qml.CRX, qml.CRY, qml.CRZ, qml.PhaseShift, qml.ControlledPhaseShift]

for op in ops:
    times[op.__name__] = run_op(op)

print({key: f"{t:.3e}" for key,t in times.items()})

For single (aka non-broadcasted params), we get the following:

# This PR
{'RX': '2.234e-04', 'RY': '2.385e-04', 'RZ': '9.266e-05', 'CRX': '3.635e-04', 'CRY': '3.521e-04', 'CRZ': '9.233e-05', 'PhaseShift': '9.596e-05', 'ControlledPhaseShift': '9.349e-05'}

# v0.24.0
{'RX': '2.039e-04', 'RY': '2.139e-04', 'RZ': '1.906e-04', 'CRX': '3.672e-04', 'CRY': '3.606e-04', 'CRZ': '3.138e-04', 'PhaseShift': '1.768e-04', 'ControlledPhaseShift': '1.154e-04'}

and for broadcasting with 100 params:

# This PR
{'RX': '2.332e-04', 'RY': '2.457e-04', 'RZ': '8.272e-05', 'CRX': '3.787e-04', 'CRY': '3.755e-04', 'CRZ': '9.163e-05', 'PhaseShift': '8.159e-05', 'ControlledPhaseShift': '9.045e-05'}

# v0.24.0
{'RX': '2.239e-04', 'RY': '2.445e-04', 'RZ': '1.929e-04', 'CRX': '3.474e-04', 'CRY': '3.589e-04', 'CRZ': '3.543e-04', 'PhaseShift': '1.875e-04', 'ControlledPhaseShift': '3.025e-04'}

While we get a speedup in some of the ops (especially the RZ), we see ~10% slow-down on RX and RY pretty much consistently across the board. While the slow-down isn't huge (on the order of ~10ms, I think this isn't an ideal scenario to have, especially for larger systems making calls to these ops.

@dwierichs
Copy link
Contributor Author

Hey Lee,
thanks for taking the time(s)! :)
I ran your code, against master instead of 0.24.0

## not broadcasted
# on master
{'RX': '4.312e-05', 'RY': '4.020e-05', 'RZ': '6.145e-05', 'CRX': '9.097e-05', 'CRY': '9.188e-05', 'CRZ': '9.558e-05', 'PhaseShift': '5.222e-05', 'ControlledPhaseShift': '2.979e-05'}

# this PR
{'RX': '4.619e-05', 'RY': '4.198e-05', 'RZ': '1.955e-05', 'CRX': '9.278e-05', 'CRY': '8.917e-05', 'CRZ': '2.197e-05', 'PhaseShift': '1.935e-05', 'ControlledPhaseShift': '1.924e-05'}

## broadcasted
# on master
{'RX': '1.890e-04', 'RY': '1.915e-04', 'RZ': '1.896e-04', 'CRX': '3.114e-04', 'CRY': '3.227e-04', 'CRZ': '2.974e-04', 'PhaseShift': '1.576e-04', 'ControlledPhaseShift': '2.744e-04'}

# this PR
{'RX': '2.012e-04', 'RY': '2.038e-04', 'RZ': '7.063e-05', 'CRX': '3.341e-04', 'CRY': '3.455e-04', 'CRZ': '7.834e-05', 'PhaseShift': '7.283e-05', 'ControlledPhaseShift': '8.079e-05'}

Now you might ask why I'm posting this. It is because this PR does not actually touch the code of RX, RY, CRX or CRY as far as I am aware, but only of diagonal operations.
I'm therefore a bit confused by the kind of consistent slowdown values for these operations (CRY without broadcasting is the only exception in these times)
that hypothetically should not be affected at all, compared to master 🤔

@mlxd
Copy link
Member

mlxd commented Jul 13, 2022

Hey Lee, thanks for taking the time(s)! :) I ran your code, against master instead of 0.24.0

## not broadcasted
# on master
{'RX': '4.312e-05', 'RY': '4.020e-05', 'RZ': '6.145e-05', 'CRX': '9.097e-05', 'CRY': '9.188e-05', 'CRZ': '9.558e-05', 'PhaseShift': '5.222e-05', 'ControlledPhaseShift': '2.979e-05'}

# this PR
{'RX': '4.619e-05', 'RY': '4.198e-05', 'RZ': '1.955e-05', 'CRX': '9.278e-05', 'CRY': '8.917e-05', 'CRZ': '2.197e-05', 'PhaseShift': '1.935e-05', 'ControlledPhaseShift': '1.924e-05'}

## broadcasted
# on master
{'RX': '1.890e-04', 'RY': '1.915e-04', 'RZ': '1.896e-04', 'CRX': '3.114e-04', 'CRY': '3.227e-04', 'CRZ': '2.974e-04', 'PhaseShift': '1.576e-04', 'ControlledPhaseShift': '2.744e-04'}

# this PR
{'RX': '2.012e-04', 'RY': '2.038e-04', 'RZ': '7.063e-05', 'CRX': '3.341e-04', 'CRY': '3.455e-04', 'CRZ': '7.834e-05', 'PhaseShift': '7.283e-05', 'ControlledPhaseShift': '8.079e-05'}

Now you might ask why I'm posting this. It is because this PR does not actually touch the code of RX, RY, CRX or CRY as far as I am aware, but only of diagonal operations. I'm therefore a bit confused by the kind of consistent slowdown values for these operations (CRY without broadcasting is the only exception in these times) that hypothetically should not be affected at all, compared to master thinking

Hmmm, this is an odd one indeed so. I think we need to do some call-stack probing for this. Let me try this out and see if I can find some differences.

@mlxd
Copy link
Member

mlxd commented Jul 13, 2022

I've tried to inspect cache usage by the implementations, and for whatever reason, I am seeing more L2 cache misses locally with this PR than 0.24 or master. However, this could be an artifact of something else elsewhere, and seems to even out in the large sample limit (1e5), at least for me. I wonder if there is some prefetching happening at the numpy level (C) level that is affecting this, or maybe something foolish like background processes for us both. All-in-all, I think this is worth ignoring, as it does not pose a major risk, provided we can accept the potential of a swing of up to 10% for these ops, which I am assuming will be a small fraction of actual calls in a given workflow.

If anybody else has any thought or comments on this, I'd welcome their suggestions.

I would like to see some timing runs performed of the test-suite for this PR against master/0.24.0 for the affected operations, but I think there will be no other blockers here from me.

Copy link
Contributor

@AmintorDusko AmintorDusko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @dwierichs!

@dwierichs
Copy link
Contributor Author

Thanks all!
I will wait for @albi3ro to have a chance to look at this and merge afterwards :)

dwierichs added a commit that referenced this pull request Jul 18, 2022
Copy link
Contributor

@albi3ro albi3ro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer product is also much more readable than einsum, so that's a double boost to using it.

I also like having the if qml.math.ndim(theta) == 0: statements for readability, as those cases are often simpler code. That allows us to understand the non-broadcasting behaviour more easily.

Good improvement, and thanks performance team!

@dwierichs dwierichs merged commit 465642b into master Jul 27, 2022
@dwierichs dwierichs deleted the speedup-compute_matrix branch July 27, 2022 20:11
dwierichs added a commit that referenced this pull request Aug 3, 2022
* introduce Operator.ndim_params, Operator.batch_size, QuantumTape.batch_size

* linting

* changelog

* enable tf.function input_signature usage

* black

* test for unsilenced error

* Apply suggestions from code review

Co-authored-by: Josh Izaac <josh146@gmail.com>

* introduce device flag and batch_transform for unbroadcasting; use transform in device.batch_transform

* black, [skip ci]

* code review

* string formatting [skip ci]

* operation broadcasting interface tests

* unbroadcast_expand

* tests for expand function

* tests

* black

* compatibility with TensorFlow 2.6

* builtins unstack

* failing case coverage

* stop using I in operation.py [skip ci]

* commit old changes

* Apply suggestions from code review

Co-authored-by: Josh Izaac <josh146@gmail.com>

* intermed

* review

* Apply suggestions from code review

Co-authored-by: Josh Izaac <josh146@gmail.com>

* review [skip ci]

* move changelog section from "improvements" to "new features"

* changelog

* add missing files

* clean up, move broadcast dimension first

* namespace

* linting variable names

* update tests that manually set ndim_params for default ops

* pin protobuf<4.21.0

* improve shape coersion order

* first features

* fix sample selection

* pin protobuf<4.21.0

* changelog formatting

* more testing, better batch_size

* fix tf function batch_size

* broadcasted pow tests

* attribute test, ControlledQubitUnitary update

* more tf support

* test kwargs attributes

* Apply suggestions from code review

Co-authored-by: Josh Izaac <josh146@gmail.com>

* changelog

* review

* remove prints

* explicit attribute supports_broadcasting tests

* tests disentangle

* fix

* PauliRot broadcasted identity compatible with TF

* rename "batched" into "broadcasted" for uniform namespace

* old TF version support in qubitunitary unitarity check

* python3.7 support

* extend support, flip flag

* fix _apply_unitary when applying unbroadcasted operator to broadcasted state

* tests default_qubit

* speed up _get_batch_size by a factor 10

* [skip ci] ndim adaptation, fix dimension in _get_bathc_size call

* default qubit additional broadcasting support

* default_qubit broadcasted tests start

* most of test suite

* black

* complete default qubit tests (without interfaces)

* tmp autograd done, jax partially

* tf tests

* testing

* black

* linting

* tests

* black

* tf fix

* minor changes

* use get_batch_size

* some TF tests

* tf test finished

* torch tests

* changelog

* move _get_batch_size to DefaultQubit, give QubitDevice dummy _get_batch_size, tests

* linting, old tf version support

* test coverage

* test fix

* mark jax test

* self- review

* Apply suggestions from code review

* typo

* changelog example

* speedup par ops

* qchem matrices

* parametric ops eigvals

* linting

* lint

* changelog

* id addition

* isingxy

* isingxy tests

* black

* mention speedup in changelog

* singleexcitation fix

* add ndim_params properties and add qchem ops to attribute

* attribute and test

* Revert "merge"

This reverts commit 55cb10d, reversing
changes made to 1cac187.

* mask lookups

* black

* remove unnecessary eigvals

* remove speedups to put them in separate PR

* tmp

* converged matrix methods, priority for scalar speedups

* black

* lint

* exclude isingxy changes (duplicate from #2759)

* revert generator changes

* dtype namespace

* unfying functions for qchem compute_matrix methods in Single- and DoubleExcitations

* black

Co-authored-by: Josh Izaac <josh146@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants