Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implementations for optimised gate kernels #85

Merged
merged 6 commits into from
Mar 12, 2021

Conversation

ThomasLoke
Copy link
Contributor

@ThomasLoke ThomasLoke commented Mar 11, 2021

I'm still mulling over the most natural API, but this is one attempt at it. This is comprised of two commits:

  1. A refactoring commit that simply moves the core of the gate application (matrix multiplication) into the base AbstractGate class.
  2. Adds optimised kernels for most gates that overrides the default base class implementation. Some gates (e.g. fully populated 2-by-2 matrices for single-qubit gates) have little (if any?) benefit from not using the default kernel, so I left them as-is.

A word of caution: I've yet to test this extensively. In order to eliminate the function call overhead (especially for the single qubit gates), I think what needs to happen is to move the outer loop over applyKernel into said function (unfortunately, I don't think runtime polymorphism works with inline functions unless the concrete class can be statically determined). It may be a worthwhile exercise anyway for when we start vectorising, but again, I've not really tested much.

@codecov
Copy link

codecov bot commented Mar 11, 2021

Codecov Report

Merging #85 (f305304) into master (a89ebde) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #85   +/-   ##
=======================================
  Coverage   98.03%   98.03%           
=======================================
  Files           3        3           
  Lines          51       51           
=======================================
  Hits           50       50           
  Misses          1        1           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a89ebde...f305304. Read the comment docs.

@antalszava
Copy link
Contributor

Awesome, thanks for this PR @ThomasLoke! 🎊 we'll have a look

Copy link
Contributor

@antalszava antalszava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! 🥇 💯 Thank you so much for the quick PR, helps a lot! 😊

A couple of suggestions/questions, but good to go from my side. Ran the PennyLane device test suite and it completed well.

Was wondering: could it make sense to add some special cases too: e.g., when applying PauliX, we skip the swap if the two vector elements are identical. 🤔 Such cases could potentially arise when there's a more complex algorithm and gates are applied on states which are not known in advance. Not in scope here necessarily, just a thought.

Benchmarked T for 23 qubits as an example:

master

import pennylane as qml

dev = qml.device('lightning.qubit', wires=23)

%timeit dev.apply_lightning(dev._state, [qml.T(wires=[0])])
49.5 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#85

import pennylane as qml

dev = qml.device('lightning.qubit', wires=23)

%timeit dev.apply_lightning(dev._state, [qml.T(wires=[0])])
19.9 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

default.qubit

import pennylane as qml

dev = qml.device('default.qubit', wires=23)

%timeit dev.apply([qml.T(wires=[0])])
43.9 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Side note for benchmarking: using dev.state instead of dev._state with lightning takes ~10 ms.

shiftedStatePtr[internalIndex] += matrix[baseIndex + j] * inputVector[j];
}
}
gate->applyKernel(shiftedStatePtr, internalIndices, inputVector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@@ -61,6 +85,10 @@ const vector<CplxType> Pennylane::XGate::matrix{
0, 1,
1, 0 };

void Pennylane::XGate::applyKernel(CplxType* state, vector<size_t>& indices, vector<CplxType>& v) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constness and making v unnamed if unused would be a suggestion in other cases too

Suggested change
void Pennylane::XGate::applyKernel(CplxType* state, vector<size_t>& indices, vector<CplxType>& v) {
void Pennylane::XGate::applyKernel(CplxType* state, const vector<size_t>& indices, vector<CplxType>&) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with adding the const; I'm generally not a fan of removing variable names from method signatures, but I guess opinions may differ on this one--if nothing else, its making it clear that its overriding the same virtual method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! A reason to make v unnamed would be that while reading, having the name v was somewhat confusing, as one would expect that to be used within the function (as is in some of them).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True; although I think that's indication we should rework the API (i.e. we shouldn't be passing in parameters we don't need if we don't have to). As mentioned in the PR description, its probably beneficial, although it will unfortunately result in some measure of code duplication.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if branching off would result in a performance degradation, how about incorporating the internal loop in applyKernel or creating another member function that implements the internal loop and depends on applyKernel? That way inputVector would only be created if used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be ideal. However, ultimately I think we want to inline applyKernel. And for that to happen I think we need to duplicate the outer loop for every subclass, because of the requirement to be able to statically infer which method is being called. Again, probably beneficial in the long run, but just a bit painful to look at.

/**
* Generic matrix-multiplication kernel
*/
virtual void applyKernel(CplxType* state, std::vector<size_t>& gateIndices, std::vector<CplxType>& v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth making v more descriptive, seems to stand for inputVector

Comment on lines 106 to 108
v[0] = state[indices[0]];
state[indices[0]] = -IMAG * state[indices[1]];
state[indices[1]] = IMAG * v[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we need to update v[0] here (and not for PauliX for example)? Seems like it's used as a temp for the swap, but it would also have effect afterwards, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v is really just scratch space. If we move the outer loop in, we can avoid allocating v altogether, except in the base implementation. The requirement to pass v is one reason why I'm not entirely happy with this interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!

we can avoid allocating v altogether, except in the base implementation

This would be preferable as just creating v would result in an overhead, correct? Could we branch of at the beginning of the outer loop based on the type of operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation for v is really insignificant in the grand scheme of things, so I don't think branching is necessary. The first rule of optimisation is to not over-optimise, and I don't think this ever becomes a bottleneck, so its more a question of code clarity as to which API we adopt.

Comment on lines +272 to +278
, r1(c* std::pow(M_E, CplxType(0, (-phi - omega) / 2)))
, r2(-s * std::pow(M_E, CplxType(0, (phi - omega) / 2)))
, r3(s* std::pow(M_E, CplxType(0, (-phi + omega) / 2)))
, r4(c* std::pow(M_E, CplxType(0, (phi + omega) / 2)))
, matrix{
c * std::pow(M_E, CplxType(0, (-phi - omega) / 2)), -s * std::pow(M_E, CplxType(0, (phi - omega) / 2)),
s * std::pow(M_E, CplxType(0, (-phi + omega) / 2)), c * std::pow(M_E, CplxType(0, (phi + omega) / 2)) }
r1, r2,
r3, r4 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@trbromley trbromley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ThomasLoke! This is really exciting! How can we help with this, perhaps one thing would be to work on adding tests for the kernels.

More generally, we're looking at adding an adjoint method for differentiation in #83. For that, we need to:

  • Apply the inverse operation
  • Apply the generator of an operation, i.e., apply H from e^(i H t).

I was wondering how/whether these would impact the API suggested here. In PennyLane, the name of an inverse gate just has .inv appended at the end (e.g., RX.inv for inverse of RX). Perhaps the first line of constructAndApplyOperation() could remove the .inv and then set a boolean accordingly, then constructGate() would accept that boolean to indicate whether to apply the inverse? We may then have to define applyKernelInverse for each gate.

For generators of parameterized gates, I imagine we might have a applyGenerator() method?

Anyway, just thinking out loud - these ideas could be included in later iterations.

pennylane_lightning/src/Gates.cpp Show resolved Hide resolved
Copy link
Contributor Author

@ThomasLoke ThomasLoke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get to these changes tonight, so it'll have to wait until tomorrow. Or just feel free to take this over and edit it as you see fit...

could it make sense to add some special cases too: e.g., when applying PauliX, we skip the swap if the two vector elements are identical. 🤔 Such cases could potentially arise when there's a more complex algorithm and gates are applied on states which are not known in advance.

Branching statements in the interior of tight loops may well be detrimental to performance due to branch misprediction, so it's probably not worth it unless it's saving a non-trivial amount of effort.

More generally, we're looking at adding an adjoint method for differentiation in #83.

I haven't taken a close look at the PR, but the API you propose does sound sensible to me :)

pennylane_lightning/src/Gates.cpp Show resolved Hide resolved
@@ -61,6 +85,10 @@ const vector<CplxType> Pennylane::XGate::matrix{
0, 1,
1, 0 };

void Pennylane::XGate::applyKernel(CplxType* state, vector<size_t>& indices, vector<CplxType>& v) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with adding the const; I'm generally not a fan of removing variable names from method signatures, but I guess opinions may differ on this one--if nothing else, its making it clear that its overriding the same virtual method.

Comment on lines 106 to 108
v[0] = state[indices[0]];
state[indices[0]] = -IMAG * state[indices[1]];
state[indices[1]] = IMAG * v[0];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v is really just scratch space. If we move the outer loop in, we can avoid allocating v altogether, except in the base implementation. The requirement to pass v is one reason why I'm not entirely happy with this interface.

@ThomasLoke
Copy link
Contributor Author

ThomasLoke commented Mar 12, 2021

I've bitten the bullet and moved the outer loop into applyKernel(). I think the API looks cleaner, at the cost of some code duplication, which I can probably live with. I'd speculate that it should perform better in terms of reduced function overheads and that the compiler probably has more things it can do now that the outer loop doesn't loop over a runtime polymorphic function, but its hard to say what (if any) measurable difference it'll make without benchmarking it properly, which I'd appreciate some help with.

Copy link
Contributor

@trbromley trbromley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! Very nice PR!

@trbromley trbromley merged commit 19a655c into PennyLaneAI:master Mar 12, 2021
@trbromley
Copy link
Contributor

Thanks @ThomasLoke!

@antalszava
Copy link
Contributor

antalszava commented Mar 12, 2021

Thanks @ThomasLoke! For benchmarking, created #87 which provides benchmarks for certain gates. It's not meant to be extensive but could give quick feedback when working on PRs. Thoughts on what could be important to be added in a similar fashion are appreciated.

@trbromley trbromley mentioned this pull request Mar 16, 2021
@antalszava antalszava mentioned this pull request Mar 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants