-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyO3 performance analysis: function overheads #1607
Comments
Suggested ImprovementsTabulating the above, I get on my machine for the no-args case:
cc @birkenfeld - sorry this took me so long to write up! Feel free to ask for any more information if you're interested in implementing any of the above. |
Note that in addition to the |
That doesn't support calling by kwarg though, right? |
No, I suppose not, so it'd need to be opt-in (or an intentional backwards incompatible changes). |
I've submitted #1608 which reduces the overhead of |
Mmmm indeed we also don't support
The user could opt in with the postional-only-arguments separator |
The positional-only-arguments separator is |
Thanks, fixed my comment. We could support the separator independent of Python version it was added to, because we control the argument-parsing code. Issue is tracked at #1439 |
@davidhewitt how much work would it be to reproduce the table from #1607 (comment) of which components take up the time? |
Not too much at all, though I didn't script it 😄. Note that I'm now using
Bit suprised to see the argument parsing and panic handling overheads reduced. It's possible that the optimizer is doing a better job at removing side-effect-free code after #1604, so I'm now attributing some of that cost (incorrectly) to Indeed I wrote a quick benchmark in #1609, which suggested that we've achieved ~ 17ns of reduced overhead in Either way, we're looking better! I think that experimenting with the various |
As suggested in PyO3#1607. If I ran the benchmarks correctly, this shaves only about 5ns from the noargs case. But since it's relatively easy to do...
So I've hit a little bump while starting to implement fastcall support... The macros for pyfunction etc. have to generate different code depending on if were are on So I thought, let's just generate both versions, with the proper How would you solve that? |
Fwiw, I think that is why pyclass and pymethods have a cfg'd import and rename. It's why I had to duplicate the documentation for them. Those cfg's work in |
Ugh, @mejrs is right that I used the cfg'd import in But that approach doesn't really scale and created documentation pain, so I'm thinking about biting the bullet and reversing that decision. For this I think the right solution is probably to introduce a I'd been hesitating on pushing that branch I'd been playing with because I didn't really want to have a 4th crate for something that seemed like a small user convenience, but if it also makes it possible for us to make better optimizations in the macro codegen then I think it's well worth it. I'll try to finish up and push that branch on Wednesday. Perhaps to unblock you @birkenfeld, you could just copy pyo3's |
Given that we landed a lot of improvement in 0.14 (in particular #1619, thanks @birkenfeld), I'm going to close this particular analysis as it's now quite stale. |
I know this is closed, but I still think there's a lot more to do here. I'm using @davidhewitt's "bound" branch and seeing a very difference between barematal code and the elegant API. |
@samuelcolvin are you observing that the "bound" branch still has some overheads? That's not a surprise as we won't actually be able to remove the steps to create and remove the GILPool until 0.22. But the datastructure itself should remain empty for completely migrated code and be much closer to a noop. It is definitely possible there are some extra optimisations to be found. If you can share some concrete examples of what you're measuring and what the target would be, it helps us establish what might be remaining. |
Thanks, I'll create a new issue. |
See #3827. |
This is a follow-up from #1470 focussing specifically on the
#[pyfunction]
call overheads and how we can reduce the performance gap with CPython.To keep it simple, I'm starting with the no-arguments, no-return-value case. Optimizations we make for this case can benefit more complex cases.
First I'll demonstrate the sample code I'm using to measure. Second, I'll talk about the underlying code. Finally, I'll suggest improvements.
Sample code
First up, the sample code is trivial: I'm using a no-args pyfunction like this one:
pyo3/examples/pyo3-benchmarks/src/lib.rs
Lines 24 to 25 in 1ae3d87
I install the extension module from the pyo3 root dir:
cd pyo3 pip install --use-feature=in-tree-build examples/pyo3-benchmarks
I can then import and run this in ipython:
Comparing this to a pure-python implementation:
67.8ns vs 43ns ... suggests there's some work we can do even in this base case.
Underlying implementation
So the code generated in question is this macro output:
pyo3/pyo3-macros-backend/src/pyfunction.rs
Lines 448 to 460 in 1ae3d87
I'm going to break this code into three parts:
(PyObject*, PyObject, PyObject*) -> PyObject*
shows we are using theMETH_VARARGS | METH_KEYWORDS
calling convention. There's several more efficient calling conventions we should consider using:METH_NOARGS
.abi3
feature) we should useMETH_FASTCALL | METH_KEYWORDS
.pyo3::callback::handle_panic
sets up aGILPool
for pyo3 owned references and catched panics at the ffi boundary. The biggest change to this function will come from experimental: owned objects API #1308, which will pave the way to no longer need to create a GILPool. I don't think it's worth trying to optimize that helper before that PR makes progress.None
and returning that. This is all Rust code that we can benchmark and optimize. We can accelerate the argument parsing code a lot if we use theMETH_FASTCALL | METH_KEYWORDS
calling convention, because we'll get a raw array of argument and keyword argument names to work with instead of a Python tuple and dict.We can get an idea of the rough performance impact of each piece of code by making changes to this macro output.
First, let's make the contents of the body just return a reference to
None
, withouthandle_panic
or the argument parsing code:If I re-install the module and measure again, I get this timing:
24.8ns - that's even faster than the plain Python code! This is exciting; it shows us that we can get a lot faster. It also shows that we're spending about 44,ns of time inside the Rust function body.
Ok, so let's make one more change, which is to put
handle_panic
back. This'll help tease apart how much ishandle_panic
PyO3 framework overhead, and how much is PyO3's argument parsing.This results in:
We're back up to 60ns. So the argument parsing code is about 7ns, and the remaining 37ns is down to
handle_panic
.If I tweak
handle_panic
to usePython::assume_gil_acquired()
instead ofGILPool::new().python()
, I get back to 28ns. So unfortunately a lot of the slowdown is caused by creation of aGILPool
, meaning we need #1308 to resolve it...Run out of time now, will post suggested actions for interested contributors in a second message later or tomorrow...
The text was updated successfully, but these errors were encountered: