-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add implementations for optimised gate kernels #85
Conversation
Codecov Report
@@ Coverage Diff @@
## master #85 +/- ##
=======================================
Coverage 98.03% 98.03%
=======================================
Files 3 3
Lines 51 51
=======================================
Hits 50 50
Misses 1 1 Continue to review full report at Codecov.
|
Awesome, thanks for this PR @ThomasLoke! 🎊 we'll have a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! 🥇 💯 Thank you so much for the quick PR, helps a lot! 😊
A couple of suggestions/questions, but good to go from my side. Ran the PennyLane device test suite and it completed well.
Was wondering: could it make sense to add some special cases too: e.g., when applying PauliX, we skip the swap if the two vector elements are identical. 🤔 Such cases could potentially arise when there's a more complex algorithm and gates are applied on states which are not known in advance. Not in scope here necessarily, just a thought.
Benchmarked T
for 23 qubits as an example:
master
import pennylane as qml
dev = qml.device('lightning.qubit', wires=23)
%timeit dev.apply_lightning(dev._state, [qml.T(wires=[0])])
49.5 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import pennylane as qml
dev = qml.device('lightning.qubit', wires=23)
%timeit dev.apply_lightning(dev._state, [qml.T(wires=[0])])
19.9 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
default.qubit
import pennylane as qml
dev = qml.device('default.qubit', wires=23)
%timeit dev.apply([qml.T(wires=[0])])
43.9 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Side note for benchmarking: using dev.state
instead of dev._state
with lightning takes ~10 ms.
pennylane_lightning/src/Apply.cpp
Outdated
shiftedStatePtr[internalIndex] += matrix[baseIndex + j] * inputVector[j]; | ||
} | ||
} | ||
gate->applyKernel(shiftedStatePtr, internalIndices, inputVector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
pennylane_lightning/src/Gates.cpp
Outdated
@@ -61,6 +85,10 @@ const vector<CplxType> Pennylane::XGate::matrix{ | |||
0, 1, | |||
1, 0 }; | |||
|
|||
void Pennylane::XGate::applyKernel(CplxType* state, vector<size_t>& indices, vector<CplxType>& v) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constness and making v
unnamed if unused would be a suggestion in other cases too
void Pennylane::XGate::applyKernel(CplxType* state, vector<size_t>& indices, vector<CplxType>& v) { | |
void Pennylane::XGate::applyKernel(CplxType* state, const vector<size_t>& indices, vector<CplxType>&) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with adding the const
; I'm generally not a fan of removing variable names from method signatures, but I guess opinions may differ on this one--if nothing else, its making it clear that its overriding the same virtual
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! A reason to make v
unnamed would be that while reading, having the name v
was somewhat confusing, as one would expect that to be used within the function (as is in some of them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True; although I think that's indication we should rework the API (i.e. we shouldn't be passing in parameters we don't need if we don't have to). As mentioned in the PR description, its probably beneficial, although it will unfortunately result in some measure of code duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if branching off would result in a performance degradation, how about incorporating the internal loop in applyKernel
or creating another member function that implements the internal loop and depends on applyKernel
? That way inputVector
would only be created if used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be ideal. However, ultimately I think we want to inline applyKernel
. And for that to happen I think we need to duplicate the outer loop for every subclass, because of the requirement to be able to statically infer which method is being called. Again, probably beneficial in the long run, but just a bit painful to look at.
pennylane_lightning/src/Gates.hpp
Outdated
/** | ||
* Generic matrix-multiplication kernel | ||
*/ | ||
virtual void applyKernel(CplxType* state, std::vector<size_t>& gateIndices, std::vector<CplxType>& v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth making v
more descriptive, seems to stand for inputVector
pennylane_lightning/src/Gates.cpp
Outdated
v[0] = state[indices[0]]; | ||
state[indices[0]] = -IMAG * state[indices[1]]; | ||
state[indices[1]] = IMAG * v[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we need to update v[0]
here (and not for PauliX
for example)? Seems like it's used as a temp for the swap, but it would also have effect afterwards, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v
is really just scratch space. If we move the outer loop in, we can avoid allocating v
altogether, except in the base implementation. The requirement to pass v
is one reason why I'm not entirely happy with this interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it!
we can avoid allocating v altogether, except in the base implementation
This would be preferable as just creating v
would result in an overhead, correct? Could we branch of at the beginning of the outer loop based on the type of operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allocation for v
is really insignificant in the grand scheme of things, so I don't think branching is necessary. The first rule of optimisation is to not over-optimise, and I don't think this ever becomes a bottleneck, so its more a question of code clarity as to which API we adopt.
, r1(c* std::pow(M_E, CplxType(0, (-phi - omega) / 2))) | ||
, r2(-s * std::pow(M_E, CplxType(0, (phi - omega) / 2))) | ||
, r3(s* std::pow(M_E, CplxType(0, (-phi + omega) / 2))) | ||
, r4(c* std::pow(M_E, CplxType(0, (phi + omega) / 2))) | ||
, matrix{ | ||
c * std::pow(M_E, CplxType(0, (-phi - omega) / 2)), -s * std::pow(M_E, CplxType(0, (phi - omega) / 2)), | ||
s * std::pow(M_E, CplxType(0, (-phi + omega) / 2)), c * std::pow(M_E, CplxType(0, (phi + omega) / 2)) } | ||
r1, r2, | ||
r3, r4 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ThomasLoke! This is really exciting! How can we help with this, perhaps one thing would be to work on adding tests for the kernels.
More generally, we're looking at adding an adjoint method for differentiation in #83. For that, we need to:
- Apply the inverse operation
- Apply the generator of an operation, i.e., apply
H
frome^(i H t)
.
I was wondering how/whether these would impact the API suggested here. In PennyLane, the name of an inverse gate just has .inv
appended at the end (e.g., RX.inv
for inverse of RX
). Perhaps the first line of constructAndApplyOperation()
could remove the .inv
and then set a boolean accordingly, then constructGate()
would accept that boolean to indicate whether to apply the inverse? We may then have to define applyKernelInverse
for each gate.
For generators of parameterized gates, I imagine we might have a applyGenerator()
method?
Anyway, just thinking out loud - these ideas could be included in later iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't get to these changes tonight, so it'll have to wait until tomorrow. Or just feel free to take this over and edit it as you see fit...
could it make sense to add some special cases too: e.g., when applying PauliX, we skip the swap if the two vector elements are identical. 🤔 Such cases could potentially arise when there's a more complex algorithm and gates are applied on states which are not known in advance.
Branching statements in the interior of tight loops may well be detrimental to performance due to branch misprediction, so it's probably not worth it unless it's saving a non-trivial amount of effort.
More generally, we're looking at adding an adjoint method for differentiation in #83.
I haven't taken a close look at the PR, but the API you propose does sound sensible to me :)
pennylane_lightning/src/Gates.cpp
Outdated
@@ -61,6 +85,10 @@ const vector<CplxType> Pennylane::XGate::matrix{ | |||
0, 1, | |||
1, 0 }; | |||
|
|||
void Pennylane::XGate::applyKernel(CplxType* state, vector<size_t>& indices, vector<CplxType>& v) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with adding the const
; I'm generally not a fan of removing variable names from method signatures, but I guess opinions may differ on this one--if nothing else, its making it clear that its overriding the same virtual
method.
pennylane_lightning/src/Gates.cpp
Outdated
v[0] = state[indices[0]]; | ||
state[indices[0]] = -IMAG * state[indices[1]]; | ||
state[indices[1]] = IMAG * v[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v
is really just scratch space. If we move the outer loop in, we can avoid allocating v
altogether, except in the base implementation. The requirement to pass v
is one reason why I'm not entirely happy with this interface.
I've bitten the bullet and moved the outer loop into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved! Very nice PR!
Thanks @ThomasLoke! |
Thanks @ThomasLoke! For benchmarking, created #87 which provides benchmarks for certain gates. It's not meant to be extensive but could give quick feedback when working on PRs. Thoughts on what could be important to be added in a similar fashion are appreciated. |
I'm still mulling over the most natural API, but this is one attempt at it. This is comprised of two commits:
AbstractGate
class.A word of caution: I've yet to test this extensively. In order to eliminate the function call overhead (especially for the single qubit gates), I think what needs to happen is to move the outer loop over
applyKernel
into said function (unfortunately, I don't think runtime polymorphism works with inline functions unless the concrete class can be statically determined). It may be a worthwhile exercise anyway for when we start vectorising, but again, I've not really tested much.