-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve computation time of compute_matrix
for parametric ops
#2759
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2759 +/- ##
========================================
Coverage 99.63% 99.63%
========================================
Files 252 252
Lines 20772 20986 +214
========================================
+ Hits 20696 20910 +214
Misses 76 76
Continue to review full report at Codecov.
|
@dwierichs One important question I have about this that will effect performance optimization choices: How often will parameter-broadcasting be used by a standard user? Do we want to make everything a little bit slower so we can make parameter-broadcasting faster? Or should we optimize the standard use-case at the expense of parameter-broadcasting? I'm also a bit concerned about the readability, and thus maintainability, of this code. These operations are highly trafficked by external contributors and new developers. If we do use non-standard behaviour, like the einsum broadcasting, that needs to be thoroughly explained as close to the code as possible. |
Hi @dwierichs . I agree with Christina's points, and wonder if we can instead of using einsum with the ellipsis to directly access broadcasting, we can directly make the broadcasting calls ourselves, removing the dependence on einsum altogether. Einsum's syntax is a sizeable maintenance burden in my eyes, and if we can do things with direct calls instead I think we should |
Thanks for the feedback, @albi3ro and @mlxd ! :)
As far as I am aware, master (and v0.24.0) currently already carries the slowdown from allowing broadcasting, introduced by #2609. This PR basically speeds up all scenarios, as far as my timings went (although I have to admit that for some reason standard timing with
That's an excellent point :) Maybe one could, as there basically are only the two signatures "...,i->...i" and "...i,ik->ik" used, make them utility functions within
I am not entirely sure I understand the alternative here? Would you manually differentiate between broadcasted and not broadcasted within the method, and have an if-else-loop for the distinction? :)
Thanks again for the feedback, I wasn't aware of this :) |
I guess what I was trying to map out was that instead of using the native broadcasting as part of einsum (enabled by the ellipsis ops according to their docs), since we already know the structure of the gates we want, why not calls the ops we want directly. A poor example, but it does demonstrate speed-up over einsum for me: import pennylane as qml
from timeit import default_timer as timer
theta = qml.numpy.random.rand(20)
theta_scale = 0.5j * theta
pm = qml.numpy.array([-1, 1])
start = timer()
for i in range(10000):
qml.math.einsum("...,i->...i", theta_scale, pm)
end = timer()
print(f"EINSUM BCAST:t={end-start},res={qml.math.einsum('...,i->...i', theta_scale, pm)}")
start = timer()
for i in range(10000):
qml.numpy.outer(theta_scale, pm)
end = timer()
print(f"OPS DIRECT:t={end-start},res={qml.numpy.outer(theta_scale, pm)}") We obtain the same data overall, but the second option is about 2/3 the runtime of the first for me. What do you think? |
Ah yes, this is indeed significantly faster. I noticed that import pennylane as qml
from timeit import default_timer as timer
theta = qml.numpy.random.rand(20)
theta = torch.tensor(0.514)
theta_scale = 0.5j * theta
pm = qml.numpy.array([-1, 1])
pm = torch.tensor(pm)
start = timer()
for i in range(10000):
qml.math.einsum("...,i->...i", theta_scale, pm)
end = timer()
print(f"EINSUM BCAST:t={end-start},res={qml.math.einsum('...,i->...i', theta_scale, pm)}")
start = timer()
for i in range(10000):
qml.math.outer(theta_scale, pm) # <-- raises RuntimeError
end = timer()
print(f"OPS DIRECT:t={end-start},res={qml.numpy.outer(theta_scale, pm)}") edit It seems that doing this manual if-else loop is indeed significantly faster, still :) Just want to make sure I'm getting your idea correctly! |
As we want to support both broadcasted and scalar parameters, we do not know the internal structure at this point. There always are two scenarios, either producing a
When I opened this PR, I thought that
But seeing now that manually deciding for |
Great, thanks for the clarification here!
Yep, sounds like the better choice to me. I still think there are scenarios where einsum is the best fit, but for simplified tensor-ops like we are doing here, I reckon it may be a bazooka to hunt a fly. 🙂 |
Hi @dwierichs, thank you for putting this PR together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @AmintorDusko ,
Thank you for your review and very helpful suggestions :)
And sorry for the extended silence on this PR.
I left some comments on the RZ
code you discussed, I think it's a great idea to do this for one operation in general and carry over the fastest method to other operations (as long as applicable of course :))
Let me know what you think about the update :)
Hi @dwierichs, thank you for testing the compatibility of my suggestions with other frameworks and adapting it! |
Hi David, I have do some additional runtime profiling of the new ops compared against 0.24.0, where I ran running For reference, this script is used to get the runtime for each op import pennylane as qml
from timeit import default_timer as timer
p = qml.numpy.random.random(100)
num_samples = 10000
times = {}
def run_op(op):
# Pre-run
op.compute_matrix(p)
start = timer()
for i in range(num_samples):
op.compute_matrix(p)
end = timer()
return (end-start)/num_samples
ops = [qml.RX, qml.RY, qml.RZ, qml.CRX, qml.CRY, qml.CRZ, qml.PhaseShift, qml.ControlledPhaseShift]
for op in ops:
times[op.__name__] = run_op(op)
print({key: f"{t:.3e}" for key,t in times.items()}) For single (aka non-broadcasted params), we get the following:
and for broadcasting with 100 params:
While we get a speedup in some of the ops (especially the RZ), we see ~10% slow-down on RX and RY pretty much consistently across the board. While the slow-down isn't huge (on the order of ~10ms, I think this isn't an ideal scenario to have, especially for larger systems making calls to these ops. |
Hey Lee, ## not broadcasted
# on master
{'RX': '4.312e-05', 'RY': '4.020e-05', 'RZ': '6.145e-05', 'CRX': '9.097e-05', 'CRY': '9.188e-05', 'CRZ': '9.558e-05', 'PhaseShift': '5.222e-05', 'ControlledPhaseShift': '2.979e-05'}
# this PR
{'RX': '4.619e-05', 'RY': '4.198e-05', 'RZ': '1.955e-05', 'CRX': '9.278e-05', 'CRY': '8.917e-05', 'CRZ': '2.197e-05', 'PhaseShift': '1.935e-05', 'ControlledPhaseShift': '1.924e-05'}
## broadcasted
# on master
{'RX': '1.890e-04', 'RY': '1.915e-04', 'RZ': '1.896e-04', 'CRX': '3.114e-04', 'CRY': '3.227e-04', 'CRZ': '2.974e-04', 'PhaseShift': '1.576e-04', 'ControlledPhaseShift': '2.744e-04'}
# this PR
{'RX': '2.012e-04', 'RY': '2.038e-04', 'RZ': '7.063e-05', 'CRX': '3.341e-04', 'CRY': '3.455e-04', 'CRZ': '7.834e-05', 'PhaseShift': '7.283e-05', 'ControlledPhaseShift': '8.079e-05'} Now you might ask why I'm posting this. It is because this PR does not actually touch the code of |
Hmmm, this is an odd one indeed so. I think we need to do some call-stack probing for this. Let me try this out and see if I can find some differences. |
I've tried to inspect cache usage by the implementations, and for whatever reason, I am seeing more L2 cache misses locally with this PR than 0.24 or master. However, this could be an artifact of something else elsewhere, and seems to even out in the large sample limit (1e5), at least for me. I wonder if there is some prefetching happening at the numpy level (C) level that is affecting this, or maybe something foolish like background processes for us both. All-in-all, I think this is worth ignoring, as it does not pose a major risk, provided we can accept the potential of a swing of up to 10% for these ops, which I am assuming will be a small fraction of actual calls in a given workflow. If anybody else has any thought or comments on this, I'd welcome their suggestions. I would like to see some timing runs performed of the test-suite for this PR against master/0.24.0 for the affected operations, but I think there will be no other blockers here from me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, @dwierichs!
Thanks all! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outer product is also much more readable than einsum
, so that's a double boost to using it.
I also like having the if qml.math.ndim(theta) == 0:
statements for readability, as those cases are often simpler code. That allows us to understand the non-broadcasting behaviour more easily.
Good improvement, and thanks performance team!
* introduce Operator.ndim_params, Operator.batch_size, QuantumTape.batch_size * linting * changelog * enable tf.function input_signature usage * black * test for unsilenced error * Apply suggestions from code review Co-authored-by: Josh Izaac <josh146@gmail.com> * introduce device flag and batch_transform for unbroadcasting; use transform in device.batch_transform * black, [skip ci] * code review * string formatting [skip ci] * operation broadcasting interface tests * unbroadcast_expand * tests for expand function * tests * black * compatibility with TensorFlow 2.6 * builtins unstack * failing case coverage * stop using I in operation.py [skip ci] * commit old changes * Apply suggestions from code review Co-authored-by: Josh Izaac <josh146@gmail.com> * intermed * review * Apply suggestions from code review Co-authored-by: Josh Izaac <josh146@gmail.com> * review [skip ci] * move changelog section from "improvements" to "new features" * changelog * add missing files * clean up, move broadcast dimension first * namespace * linting variable names * update tests that manually set ndim_params for default ops * pin protobuf<4.21.0 * improve shape coersion order * first features * fix sample selection * pin protobuf<4.21.0 * changelog formatting * more testing, better batch_size * fix tf function batch_size * broadcasted pow tests * attribute test, ControlledQubitUnitary update * more tf support * test kwargs attributes * Apply suggestions from code review Co-authored-by: Josh Izaac <josh146@gmail.com> * changelog * review * remove prints * explicit attribute supports_broadcasting tests * tests disentangle * fix * PauliRot broadcasted identity compatible with TF * rename "batched" into "broadcasted" for uniform namespace * old TF version support in qubitunitary unitarity check * python3.7 support * extend support, flip flag * fix _apply_unitary when applying unbroadcasted operator to broadcasted state * tests default_qubit * speed up _get_batch_size by a factor 10 * [skip ci] ndim adaptation, fix dimension in _get_bathc_size call * default qubit additional broadcasting support * default_qubit broadcasted tests start * most of test suite * black * complete default qubit tests (without interfaces) * tmp autograd done, jax partially * tf tests * testing * black * linting * tests * black * tf fix * minor changes * use get_batch_size * some TF tests * tf test finished * torch tests * changelog * move _get_batch_size to DefaultQubit, give QubitDevice dummy _get_batch_size, tests * linting, old tf version support * test coverage * test fix * mark jax test * self- review * Apply suggestions from code review * typo * changelog example * speedup par ops * qchem matrices * parametric ops eigvals * linting * lint * changelog * id addition * isingxy * isingxy tests * black * mention speedup in changelog * singleexcitation fix * add ndim_params properties and add qchem ops to attribute * attribute and test * Revert "merge" This reverts commit 55cb10d, reversing changes made to 1cac187. * mask lookups * black * remove unnecessary eigvals * remove speedups to put them in separate PR * tmp * converged matrix methods, priority for scalar speedups * black * lint * exclude isingxy changes (duplicate from #2759) * revert generator changes * dtype namespace * unfying functions for qchem compute_matrix methods in Single- and DoubleExcitations * black Co-authored-by: Josh Izaac <josh146@gmail.com>
With the support of parameter broadcasting in #2609 , the
compute_matrix
method of many parametric operations was updated.Unfortunately, the resulting methods are sometimes considerably slower than the previous version.
This PR rewrites
compute_matrix
for a series of these operations, reducing the slowdown and even achieving speedups compared to the versions that didn't support broadcasting. At the same time, the new methods are more compact.One key change is heavy usage of
qml.math.einsum
instead of nestedqml.math.stack
calls. This reduces the calls to autoray, is easy to write in a broadcastable manner, and results in shorter code.