Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use rust gates for Optimize1QGatesDecomposition #12650

Merged
merged 21 commits into from
Jul 3, 2024

Conversation

mtreinish
Copy link
Member

@mtreinish mtreinish commented Jun 24, 2024

Summary

This commit moves to using rust gates for the Optimize1QGatesDecomposition transpiler pass. It takes in a sequence of runs (which are a list of DAGOpNodes) from the python side of the transpiler pass which are generated from DAGCircuit.collect_1q_runs() (which in the future should be moved to rust after #12550 merges). The rust portion of the pass now iterates over each run, performs the matrix multiplication to compute the unitary of the run, then synthesizes that unitary, computes the estimated error of the circuit synthesis and returns a tuple of the circuit sequence in terms of rust StandardGate enums. The python portion of the code then takes those sequences and does inplace substitution of each run with the sequence returned from rust.

Once #12550 merges we should be able to move the input collect_1q_runs() call and perform the output node substitions in rust making the full pass execute in the rust domain without any python interaction.

Additionally, the OneQubitEulerDecomposer class is updated to use rust for circuit generation instead of doing this python side. The internal changes done to use rust gates in the transpiler pass meant we were half way to this already by emitting rust StandardGates instead of python gate objects. The dag handling is still done in Python however until #12550 merges.

This also includes an implementation of the r gate, I temporarily added this to unblock this effort as it was the only gate missing needed to complete this. We can rebase this if a standalone implementation of the gate merges before this. Rebased on top of: #12662

Details and comments

TODO:

This commit moves to using rust gates for the Optimize1QGatesDecomposition
transpiler pass. It takes in a sequence of runs (which are a list of
DAGOpNodes) from the python side of the transpiler pass which are
generated from DAGCircuit.collect_1q_runs() (which in the future should
be moved to rust after Qiskit#12550 merges). The rust portion of the pass now
iterates over each run, performs the matrix multiplication to compute the
unitary of the run, then synthesizes that unitary, computes the
estimated error of the circuit synthesis and returns a tuple of the
circuit sequence in terms of rust StandardGate enums. The python portion
of the code then takes those sequences and does inplace substitution of
each run with the sequence returned from rust.

Once Qiskit#12550 merges we should be able to move the input collect_1q_runs()
call and perform the output node substitions in rust making the full
pass execute in the rust domain without any python interaction.

Additionally, the OneQubitEulerDecomposer class is updated to use
rust for circuit generation instead of doing this python side. The
internal changes done to use rust gates in the transpiler pass meant we
were half way to this already by emitting rust StandardGates instead of
python gate objects. The dag handling is still done in Python however
until Qiskit#12550 merges.

This also includes an implementation of the r gate, I temporarily added
this to unblock this effort as it was the only gate missing needed to
complete this. We can rebase this if a standalone implementation of the
gate merges before this.
@mtreinish mtreinish added performance Changelog: None Do not include in changelog synthesis Rust This PR or issue is related to Rust code in the repository mod: transpiler Issues and PRs related to Transpiler labels Jun 24, 2024
@mtreinish mtreinish added this to the 1.2.0 milestone Jun 24, 2024
@qiskit-bot
Copy link
Collaborator

One or more of the following people are relevant to this code:

  • @Cryoris
  • @Qiskit/terra-core
  • @ajavadia
  • @kevinhartman
  • @levbishop
  • @mtreinish

@mtreinish mtreinish changed the title [WIP] Use rust gates for Optimize1QGatesDecomposition Use rust gates for Optimize1QGatesDecomposition Jun 24, 2024
@coveralls
Copy link

coveralls commented Jun 24, 2024

Pull Request Test Coverage Report for Build 9652482316

Details

  • 354 of 400 (88.5%) changed or added relevant lines in 9 files are covered.
  • 18 unchanged lines in 5 files lost coverage.
  • Overall coverage increased (+0.04%) to 89.796%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/accelerate/src/euler_one_qubit_decomposer.rs 218 219 99.54%
qiskit/transpiler/passes/optimization/optimize_1q_decomposition.py 36 38 94.74%
qiskit/synthesis/one_qubit/one_qubit_decompose.py 2 15 13.33%
crates/circuit/src/dag_node.rs 25 39 64.1%
crates/circuit/src/operations.rs 30 46 65.22%
Files with Coverage Reduction New Missed Lines %
qiskit/transpiler/passes/optimization/optimize_1q_decomposition.py 1 95.35%
crates/qasm2/src/lex.rs 3 93.38%
qiskit/synthesis/one_qubit/one_qubit_decompose.py 4 70.67%
crates/accelerate/src/euler_one_qubit_decomposer.rs 4 91.37%
crates/qasm2/src/parse.rs 6 97.61%
Totals Coverage Status
Change from base Build 9650973845: 0.04%
Covered Lines: 64062
Relevant Lines: 71342

💛 - Coveralls

Previously this PR was re-computing the target bases to synthesize with
for each run found in the circuit. But in cases where there were
multiple runs repeated on a qubit this was unecessary work. Prior to
moving this code to rust there was already caching code to make this
optimization, but the rust path short circuited around this. This commit
fixes this so we're caching the target bases for each qubit and only
computing it once.
@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9662366170

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 357 of 403 (88.59%) changed or added relevant lines in 9 files are covered.
  • 20 unchanged lines in 6 files lost coverage.
  • Overall coverage increased (+0.03%) to 89.786%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/accelerate/src/euler_one_qubit_decomposer.rs 217 218 99.54%
qiskit/transpiler/passes/optimization/optimize_1q_decomposition.py 40 42 95.24%
qiskit/synthesis/one_qubit/one_qubit_decompose.py 2 15 13.33%
crates/circuit/src/dag_node.rs 25 39 64.1%
crates/circuit/src/operations.rs 30 46 65.22%
Files with Coverage Reduction New Missed Lines %
qiskit/transpiler/passes/optimization/optimize_1q_decomposition.py 1 95.45%
crates/qasm2/src/expr.rs 1 94.02%
qiskit/synthesis/one_qubit/one_qubit_decompose.py 4 70.67%
crates/qasm2/src/lex.rs 4 91.86%
crates/accelerate/src/euler_one_qubit_decomposer.rs 4 91.35%
crates/qasm2/src/parse.rs 6 97.61%
Totals Coverage Status
Change from base Build 9650973845: 0.03%
Covered Lines: 64056
Relevant Lines: 71343

💛 - Coveralls

@mtreinish
Copy link
Member Author

I took a quick attempt at implementing this with multithreading locally and I could get it to work but it required doing extra collection of all the data from the collect runs into a second vec to strip the PyRef from each node. This added more overhead than parallelism saved in this case. I think it should be possible to investigate doing this in parallel after #12550 merges because then we can generate the runs from rust directly. For multithreading we'll likely still need to collect the runs into a vec (to avoid mutation while iterating) but we can measure if that overhead is higher than the speedup from doing the synthesis in parallel and investigate in a followup.

@mtreinish
Copy link
Member Author

I threw together a quick benchmark to test this:

import statistics
import time

from qiskit.transpiler.passes import Optimize1qGatesDecomposition
from qiskit.circuit.random import random_circuit
from qiskit.converters import circuit_to_dag

seed = 42
n_qubits = 20
depth = 102400

qc = random_circuit(n_qubits, depth, measure=True, seed=seed)
dag = circuit_to_dag(qc)
basis_gates = ['sx', 'x', 'rz', 'cx', 'id']
opt_pass = Optimize1qGatesDecomposition(basis=basis_gates)
times = []
for i in range(10):
    start = time.perf_counter()
    opt_pass.run(dag)
    stop = time.perf_counter()
    times.append(stop - start)
print(f"Mean runtime: {statistics.geometric_mean(times)}")

and it's showing a pretty noticeable speedup over Qiskit 1.1. With this PR I got:

3.931521232593536 sec

and with Qiskit 1.1 I got:

6.184648169533495 sec

Looking at a profile with this script the majority of the time with this PR is spent in collect 1q runs and dag interactions in python.

@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9666227832

Details

  • 374 of 420 (89.05%) changed or added relevant lines in 9 files are covered.
  • 12 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.08%) to 89.797%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/accelerate/src/euler_one_qubit_decomposer.rs 234 235 99.57%
qiskit/transpiler/passes/optimization/optimize_1q_decomposition.py 40 42 95.24%
qiskit/synthesis/one_qubit/one_qubit_decompose.py 2 15 13.33%
crates/circuit/src/dag_node.rs 25 39 64.1%
crates/circuit/src/operations.rs 30 46 65.22%
Files with Coverage Reduction New Missed Lines %
qiskit/transpiler/passes/optimization/optimize_1q_decomposition.py 1 95.45%
crates/qasm2/src/lex.rs 3 92.62%
qiskit/synthesis/one_qubit/one_qubit_decompose.py 4 70.67%
crates/accelerate/src/euler_one_qubit_decomposer.rs 4 91.51%
Totals Coverage Status
Change from base Build 9661630219: 0.08%
Covered Lines: 64079
Relevant Lines: 71360

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9667036450

Details

  • 374 of 418 (89.47%) changed or added relevant lines in 9 files are covered.
  • 14 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.09%) to 89.805%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/accelerate/src/euler_one_qubit_decomposer.rs 234 235 99.57%
qiskit/synthesis/one_qubit/one_qubit_decompose.py 2 15 13.33%
crates/circuit/src/dag_node.rs 25 39 64.1%
crates/circuit/src/operations.rs 30 46 65.22%
Files with Coverage Reduction New Missed Lines %
qiskit/transpiler/passes/optimization/optimize_1q_decomposition.py 1 96.92%
qiskit/synthesis/one_qubit/one_qubit_decompose.py 4 70.67%
crates/accelerate/src/euler_one_qubit_decomposer.rs 4 91.51%
crates/qasm2/src/lex.rs 5 92.62%
Totals Coverage Status
Change from base Build 9661630219: 0.09%
Covered Lines: 64083
Relevant Lines: 71358

💛 - Coveralls

@mtreinish
Copy link
Member Author

To compare this against the current state of main I ran asv for the pass specific benchmarks:

Benchmarks that have improved:

| Change   | Before [1ed5951a] <optimize-1q-more-rust~3^2>   | After [130969a3] <optimize-1q-more-rust>   |   Ratio | Benchmark (Parameter)                                                                                         |
|----------|-------------------------------------------------|--------------------------------------------|---------|---------------------------------------------------------------------------------------------------------------|
| -        | 134±1ms                                         | 46.9±0.4ms                                 |    0.35 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_decompose(20, 1024, ['rz', 'x', 'sx', 'cx', 'id'])        |
| -        | 99.3±1ms                                        | 34.0±0.2ms                                 |    0.34 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_decompose(14, 1024, ['rz', 'x', 'sx', 'cx', 'id'])        |
| -        | 116±0.7ms                                       | 39.0±0.5ms                                 |    0.34 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_decompose(20, 1024, ['u', 'cx', 'id'])                    |
| -        | 85.6±0.3ms                                      | 27.9±0.1ms                                 |    0.33 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_decompose(14, 1024, ['u', 'cx', 'id'])                    |
| -        | 123±1ms                                         | 41.1±0.5ms                                 |    0.33 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_decompose(20, 1024, ['rx', 'ry', 'rz', 'r', 'rxx', 'id']) |
| -        | 92.0±0.5ms                                      | 29.6±0.2ms                                 |    0.32 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_decompose(14, 1024, ['rx', 'ry', 'rz', 'r', 'rxx', 'id']) |
| -        | 47.4±0.5ms                                      | 12.6±0.04ms                                |    0.27 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_decompose(5, 1024, ['rz', 'x', 'sx', 'cx', 'id'])         |
| -        | 43.6±0.3ms                                      | 10.6±0.05ms                                |    0.24 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_decompose(5, 1024, ['rx', 'ry', 'rz', 'r', 'rxx', 'id'])  |
| -        | 40.7±0.4ms                                      | 9.31±0.06ms                                |    0.23 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_decompose(5, 1024, ['u', 'cx', 'id'])                     |

Benchmarks that have stayed the same:

| Change   | Before [1ed5951a] <optimize-1q-more-rust~3^2>   | After [130969a3] <optimize-1q-more-rust>   |   Ratio | Benchmark (Parameter)                                                                                           |
|----------|-------------------------------------------------|--------------------------------------------|---------|-----------------------------------------------------------------------------------------------------------------|
|          | 2.65±0.01s                                      | 2.66±0.02s                                 |    1    | passes.PassBenchmarks.time_optimize_swap_before_measure(20, 1024)                                               |
|          | 231±1ms                                         | 230±1ms                                    |    1    | passes.PassBenchmarks.time_optimize_swap_before_measure(5, 1024)                                                |
|          | 737±10ms                                        | 719±6ms                                    |    0.98 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_commutation(14, 1024, ['rx', 'ry', 'rz', 'r', 'rxx', 'id']) |
|          | 1.53±0.01s                                      | 1.51±0s                                    |    0.98 | passes.PassBenchmarks.time_optimize_swap_before_measure(14, 1024)                                               |
|          | 1.14±0.01s                                      | 1.11±0s                                    |    0.97 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_commutation(20, 1024, ['rx', 'ry', 'rz', 'r', 'rxx', 'id']) |
|          | 256±2ms                                         | 247±1ms                                    |    0.97 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_commutation(5, 1024, ['rx', 'ry', 'rz', 'r', 'rxx', 'id'])  |
|          | 822±10ms                                        | 778±3ms                                    |    0.95 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_commutation(14, 1024, ['rz', 'x', 'sx', 'cx', 'id'])        |
|          | 746±7ms                                         | 706±2ms                                    |    0.95 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_commutation(14, 1024, ['u', 'cx', 'id'])                    |
|          | 1.26±0.01s                                      | 1.20±0s                                    |    0.95 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_commutation(20, 1024, ['rz', 'x', 'sx', 'cx', 'id'])        |
|          | 1.16±0.02s                                      | 1.10±0s                                    |    0.95 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_commutation(20, 1024, ['u', 'cx', 'id'])                    |
|          | 298±2ms                                         | 277±1ms                                    |    0.93 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_commutation(5, 1024, ['rz', 'x', 'sx', 'cx', 'id'])         |
|          | 260±2ms                                         | 239±1ms                                    |    0.92 | passes.MultipleBasisPassBenchmarks.time_optimize_1q_commutation(5, 1024, ['u', 'cx', 'id'])                     |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@coveralls
Copy link

coveralls commented Jun 26, 2024

Pull Request Test Coverage Report for Build 9678061360

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 375 of 419 (89.5%) changed or added relevant lines in 10 files are covered.
  • 29 unchanged lines in 7 files lost coverage.
  • Overall coverage increased (+0.08%) to 89.798%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/accelerate/src/euler_one_qubit_decomposer.rs 234 235 99.57%
qiskit/synthesis/one_qubit/one_qubit_decompose.py 2 15 13.33%
crates/circuit/src/dag_node.rs 25 39 64.1%
crates/circuit/src/operations.rs 30 46 65.22%
Files with Coverage Reduction New Missed Lines %
qiskit/transpiler/passes/optimization/optimize_1q_decomposition.py 1 96.92%
crates/qasm2/src/expr.rs 1 94.02%
qiskit/synthesis/one_qubit/one_qubit_decompose.py 4 70.67%
qiskit/synthesis/discrete_basis/solovay_kitaev.py 4 94.74%
crates/accelerate/src/euler_one_qubit_decomposer.rs 4 91.51%
crates/qasm2/src/lex.rs 5 92.62%
qiskit/providers/fake_provider/generic_backend_v2.py 10 94.71%
Totals Coverage Status
Change from base Build 9661630219: 0.08%
Covered Lines: 64084
Relevant Lines: 71365

💛 - Coveralls

mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Jul 2, 2024
This commit moves to use rust gates for the ConsolidateBlocks transpiler
pass. Instead of generating the unitary matrices for the gates in a 2q
block Python side and passing that list to a rust function this commit
switches to passing a list of DAGOpNodes to the rust and then generating
the matrices inside the rust function directly. This is similar to what
was done in Qiskit#12650 for Optimize1qGatesDecomposition. Besides being faster
to get the matrix for standard gates, it also reduces the eager
construction of Python gate objects which was a significant source of
overhead after Qiskit#12459. To that end this builds on the thread of work in
the two PRs Qiskit#12692 and Qiskit#12701 which changed the access patterns for
other passes to minimize eager gate object construction.
@jlapeyre
Copy link
Contributor

jlapeyre commented Jul 2, 2024

Another note for later. This function


has return type PyResult<OneQubitGateSequence>. But as far as I can tell it is only called in one place, and that is in the same file. The returned value is unwrapped with no check. In fact the return value is always ok. This is so the function can be called from Python. (maybe in the future, since I don't see this called anywhere) Reading and understanding the unwrap takes a bit of time.

I'd prefer that the Python-side function (if needed) wraps a function that returns OneQubitGateSequence rather than having this special requirement of the Python interface intrude in code that is only used in this function.

This is a quick fix. But probably outside the scope of this PR.

@jlapeyre
Copy link
Contributor

jlapeyre commented Jul 2, 2024

I left one more suggested simplification edit.

Other than that, this looks good to go.

@coveralls
Copy link

coveralls commented Jul 3, 2024

Pull Request Test Coverage Report for Build 9769535246

Details

  • 347 of 361 (96.12%) changed or added relevant lines in 7 files are covered.
  • 20 unchanged lines in 6 files lost coverage.
  • Overall coverage increased (+0.06%) to 89.846%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/circuit/src/dag_node.rs 37 38 97.37%
qiskit/synthesis/one_qubit/one_qubit_decompose.py 2 15 13.33%
Files with Coverage Reduction New Missed Lines %
qiskit/transpiler/passes/optimization/optimize_1q_decomposition.py 1 96.92%
crates/qasm2/src/expr.rs 1 94.02%
qiskit/synthesis/one_qubit/one_qubit_decompose.py 4 70.67%
crates/qasm2/src/lex.rs 4 93.38%
crates/accelerate/src/euler_one_qubit_decomposer.rs 4 93.34%
crates/qasm2/src/parse.rs 6 96.69%
Totals Coverage Status
Change from base Build 9767176201: 0.06%
Covered Lines: 64785
Relevant Lines: 72107

💛 - Coveralls

@jlapeyre jlapeyre enabled auto-merge July 3, 2024 02:54
@jlapeyre jlapeyre added this pull request to the merge queue Jul 3, 2024
Merged via the queue into Qiskit:main with commit bfc69a4 Jul 3, 2024
15 checks passed
@mtreinish mtreinish deleted the optimize-1q-more-rust branch July 3, 2024 10:00
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Jul 3, 2024
This commit moves to use rust gates for the ConsolidateBlocks transpiler
pass. Instead of generating the unitary matrices for the gates in a 2q
block Python side and passing that list to a rust function this commit
switches to passing a list of DAGOpNodes to the rust and then generating
the matrices inside the rust function directly. This is similar to what
was done in Qiskit#12650 for Optimize1qGatesDecomposition. Besides being faster
to get the matrix for standard gates, it also reduces the eager
construction of Python gate objects which was a significant source of
overhead after Qiskit#12459. To that end this builds on the thread of work in
the two PRs Qiskit#12692 and Qiskit#12701 which changed the access patterns for
other passes to minimize eager gate object construction.
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Jul 3, 2024
In the recently merged Qiskit#12650 a new rust function was added for the
filter function of the collect_1q_runs() method's internal filter
function. As part of this we need to do exclude any non-DAGOpNode dag
nodes passed to the filter function. This was previously implemented
using the pyo3 extract() method so that if the pyobject can be coerced
into a DAGOpNode for later use we continue but if it errors we know the
input object is not a DAGOpNode and return false. However, as we don't
need to return the error to python we should be using downcast instead
of extract (as per the pyo3 performance guide [1]) to avoid the error
conversion cost. This was an oversight in Qiskit#12650 and I only used
extract() out of habit.

[1] https://pyo3.rs/v0.22.0/performance#extract-versus-downcast
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Jul 3, 2024
In the recently merged Qiskit#12650 a new rust function was added for the
filter function of the collect_1q_runs() method's internal filter
function. As part of this we need to do exclude any non-DAGOpNode dag
nodes passed to the filter function. This was previously implemented
using the pyo3 extract() method so that if the pyobject can be coerced
into a DAGOpNode for later use we continue but if it errors we know the
input object is not a DAGOpNode and return false. However, as we don't
need to return the error to python we should be using downcast instead
of extract (as per the pyo3 performance guide [1]) to avoid the error
conversion cost. This was an oversight in Qiskit#12650 and I only used
extract() out of habit.

[1] https://pyo3.rs/v0.22.0/performance#extract-versus-downcast
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Jul 3, 2024
In the recently merged Qiskit#12650 a new rust function was added for the
filter function of the collect_1q_runs() method's internal filter
function. As part of this we need to do exclude any non-DAGOpNode dag
nodes passed to the filter function. This was previously implemented
using the pyo3 extract() method so that if the pyobject can be coerced
into a DAGOpNode for later use we continue but if it errors we know the
input object is not a DAGOpNode and return false. However, as we don't
need to return the error to python we should be using downcast instead
of extract (as per the pyo3 performance guide [1]) to avoid the error
conversion cost. This was an oversight in Qiskit#12650 and I only used
extract() out of habit.

[1] https://pyo3.rs/v0.22.0/performance#extract-versus-downcast
github-merge-queue bot pushed a commit that referenced this pull request Jul 3, 2024
In the recently merged #12650 a new rust function was added for the
filter function of the collect_1q_runs() method's internal filter
function. As part of this we need to do exclude any non-DAGOpNode dag
nodes passed to the filter function. This was previously implemented
using the pyo3 extract() method so that if the pyobject can be coerced
into a DAGOpNode for later use we continue but if it errors we know the
input object is not a DAGOpNode and return false. However, as we don't
need to return the error to python we should be using downcast instead
of extract (as per the pyo3 performance guide [1]) to avoid the error
conversion cost. This was an oversight in #12650 and I only used
extract() out of habit.

[1] https://pyo3.rs/v0.22.0/performance#extract-versus-downcast
jlapeyre added a commit to jlapeyre/qiskit-core that referenced this pull request Jul 8, 2024
This commit moves to use rust gates for the ConsolidateBlocks transpiler
pass. Instead of generating the unitary matrices for the gates in a 2q
block Python side and passing that list to a rust function this commit
switches to passing a list of DAGOpNodes to the rust and then generating
the matrices inside the rust function directly. This is similar to what
was done in Qiskit#12650 for Optimize1qGatesDecomposition. Besides being faster
to get the matrix for standard gates, it also reduces the eager
construction of Python gate objects which was a significant source of
overhead after Qiskit#12459. To that end this builds on the thread of work in
the two PRs Qiskit#12692 and Qiskit#12701 which changed the access patterns for
other passes to minimize eager gate object construction.
jlapeyre added a commit to jlapeyre/qiskit-core that referenced this pull request Jul 8, 2024
This commit moves to use rust gates for the ConsolidateBlocks transpiler
pass. Instead of generating the unitary matrices for the gates in a 2q
block Python side and passing that list to a rust function this commit
switches to passing a list of DAGOpNodes to the rust and then generating
the matrices inside the rust function directly. This is similar to what
was done in Qiskit#12650 for Optimize1qGatesDecomposition. Besides being faster
to get the matrix for standard gates, it also reduces the eager
construction of Python gate objects which was a significant source of
overhead after Qiskit#12459. To that end this builds on the thread of work in
the two PRs Qiskit#12692 and Qiskit#12701 which changed the access patterns for
other passes to minimize eager gate object construction.
github-merge-queue bot pushed a commit that referenced this pull request Jul 8, 2024
* Use rust gates for ConsolidateBlocks

This commit moves to use rust gates for the ConsolidateBlocks transpiler
pass. Instead of generating the unitary matrices for the gates in a 2q
block Python side and passing that list to a rust function this commit
switches to passing a list of DAGOpNodes to the rust and then generating
the matrices inside the rust function directly. This is similar to what
was done in #12650 for Optimize1qGatesDecomposition. Besides being faster
to get the matrix for standard gates, it also reduces the eager
construction of Python gate objects which was a significant source of
overhead after #12459. To that end this builds on the thread of work in
the two PRs #12692 and #12701 which changed the access patterns for
other passes to minimize eager gate object construction.

* Add rust filter function for DAGCircuit.collect_2q_runs()

* Update crates/accelerate/src/convert_2q_block_matrix.rs

---------

Co-authored-by: John Lapeyre <jlapeyre@users.noreply.github.com>
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Jul 8, 2024
This commit builds off of what Qiskit#12650 did for the 1q decomposer and
moves to using rust gates for the 2q decomposer too. This means that the
circuit sequence generation is using rust's StandardGate representation
directly instead of relying on mapping strings. For places where
circuits are generated (calling `TwoQubitWeylDecomposition.circuit()` or
or `TwoQubitBasisDecomposer.__call__` without the `use_dag()` flag) the
entire circuit is generated in Rust and returned to Python.
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Jul 8, 2024
This commit builds off of what Qiskit#12650 did for the 1q decomposer and
moves to using rust gates for the 2q decomposer too. This means that the
circuit sequence generation is using rust's StandardGate representation
directly instead of relying on mapping strings. For places where
circuits are generated (calling `TwoQubitWeylDecomposition.circuit()` or
or `TwoQubitBasisDecomposer.__call__` without the `use_dag` flag) the
entire circuit is generated in Rust and returned to Python.
github-merge-queue bot pushed a commit that referenced this pull request Jul 23, 2024
* Use Rust gates for 2q unitary synthesis

This commit builds off of what #12650 did for the 1q decomposer and
moves to using rust gates for the 2q decomposer too. This means that the
circuit sequence generation is using rust's StandardGate representation
directly instead of relying on mapping strings. For places where
circuits are generated (calling `TwoQubitWeylDecomposition.circuit()` or
or `TwoQubitBasisDecomposer.__call__` without the `use_dag` flag) the
entire circuit is generated in Rust and returned to Python.

* Run cargo fmt and black post rebase
ElePT pushed a commit to mtreinish/qiskit-core that referenced this pull request Jul 24, 2024
* Use Rust gates for 2q unitary synthesis

This commit builds off of what Qiskit#12650 did for the 1q decomposer and
moves to using rust gates for the 2q decomposer too. This means that the
circuit sequence generation is using rust's StandardGate representation
directly instead of relying on mapping strings. For places where
circuits are generated (calling `TwoQubitWeylDecomposition.circuit()` or
or `TwoQubitBasisDecomposer.__call__` without the `use_dag` flag) the
entire circuit is generated in Rust and returned to Python.

* Run cargo fmt and black post rebase
Procatv pushed a commit to Procatv/qiskit-terra-catherines that referenced this pull request Aug 1, 2024
* Use rust gates for Optimize1QGatesDecomposition

This commit moves to using rust gates for the Optimize1QGatesDecomposition
transpiler pass. It takes in a sequence of runs (which are a list of
DAGOpNodes) from the python side of the transpiler pass which are
generated from DAGCircuit.collect_1q_runs() (which in the future should
be moved to rust after Qiskit#12550 merges). The rust portion of the pass now
iterates over each run, performs the matrix multiplication to compute the
unitary of the run, then synthesizes that unitary, computes the
estimated error of the circuit synthesis and returns a tuple of the
circuit sequence in terms of rust StandardGate enums. The python portion
of the code then takes those sequences and does inplace substitution of
each run with the sequence returned from rust.

Once Qiskit#12550 merges we should be able to move the input collect_1q_runs()
call and perform the output node substitions in rust making the full
pass execute in the rust domain without any python interaction.

Additionally, the OneQubitEulerDecomposer class is updated to use
rust for circuit generation instead of doing this python side. The
internal changes done to use rust gates in the transpiler pass meant we
were half way to this already by emitting rust StandardGates instead of
python gate objects. The dag handling is still done in Python however
until Qiskit#12550 merges.

This also includes an implementation of the r gate, I temporarily added
this to unblock this effort as it was the only gate missing needed to
complete this. We can rebase this if a standalone implementation of the
gate merges before this.

* Cache target decompositions for each qubit

Previously this PR was re-computing the target bases to synthesize with
for each run found in the circuit. But in cases where there were
multiple runs repeated on a qubit this was unecessary work. Prior to
moving this code to rust there was already caching code to make this
optimization, but the rust path short circuited around this. This commit
fixes this so we're caching the target bases for each qubit and only
computing it once.

* Optimize rust implementation slightly

* Avoid extra allocations by inlining matrix multiplication

* Remove unnecessary comment

* Remove stray code block

* Add import path for rust gate

* Use rust gate in circuit constructor

* Remove duplicated op_name getter and just use existing name getter

* Apply suggestions from code review

Co-authored-by: John Lapeyre <jlapeyre@users.noreply.github.com>

* Simplify construction of target_basis_vec

* Fix rebase issue

* Update crates/accelerate/src/euler_one_qubit_decomposer.rs

Co-authored-by: John Lapeyre <jlapeyre@users.noreply.github.com>

* Update crates/accelerate/src/euler_one_qubit_decomposer.rs

---------

Co-authored-by: John Lapeyre <jlapeyre@users.noreply.github.com>
Procatv pushed a commit to Procatv/qiskit-terra-catherines that referenced this pull request Aug 1, 2024
In the recently merged Qiskit#12650 a new rust function was added for the
filter function of the collect_1q_runs() method's internal filter
function. As part of this we need to do exclude any non-DAGOpNode dag
nodes passed to the filter function. This was previously implemented
using the pyo3 extract() method so that if the pyobject can be coerced
into a DAGOpNode for later use we continue but if it errors we know the
input object is not a DAGOpNode and return false. However, as we don't
need to return the error to python we should be using downcast instead
of extract (as per the pyo3 performance guide [1]) to avoid the error
conversion cost. This was an oversight in Qiskit#12650 and I only used
extract() out of habit.

[1] https://pyo3.rs/v0.22.0/performance#extract-versus-downcast
Procatv pushed a commit to Procatv/qiskit-terra-catherines that referenced this pull request Aug 1, 2024
* Use rust gates for ConsolidateBlocks

This commit moves to use rust gates for the ConsolidateBlocks transpiler
pass. Instead of generating the unitary matrices for the gates in a 2q
block Python side and passing that list to a rust function this commit
switches to passing a list of DAGOpNodes to the rust and then generating
the matrices inside the rust function directly. This is similar to what
was done in Qiskit#12650 for Optimize1qGatesDecomposition. Besides being faster
to get the matrix for standard gates, it also reduces the eager
construction of Python gate objects which was a significant source of
overhead after Qiskit#12459. To that end this builds on the thread of work in
the two PRs Qiskit#12692 and Qiskit#12701 which changed the access patterns for
other passes to minimize eager gate object construction.

* Add rust filter function for DAGCircuit.collect_2q_runs()

* Update crates/accelerate/src/convert_2q_block_matrix.rs

---------

Co-authored-by: John Lapeyre <jlapeyre@users.noreply.github.com>
Procatv pushed a commit to Procatv/qiskit-terra-catherines that referenced this pull request Aug 1, 2024
* Use Rust gates for 2q unitary synthesis

This commit builds off of what Qiskit#12650 did for the 1q decomposer and
moves to using rust gates for the 2q decomposer too. This means that the
circuit sequence generation is using rust's StandardGate representation
directly instead of relying on mapping strings. For places where
circuits are generated (calling `TwoQubitWeylDecomposition.circuit()` or
or `TwoQubitBasisDecomposer.__call__` without the `use_dag` flag) the
entire circuit is generated in Rust and returned to Python.

* Run cargo fmt and black post rebase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog mod: transpiler Issues and PRs related to Transpiler performance Rust This PR or issue is related to Rust code in the repository synthesis
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants