-
Notifications
You must be signed in to change notification settings - Fork 782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Class creation is slow #661
Comments
Could you post your rust build commands and commands for how you benchmarked this? Thanks |
Sure! I just used import example
def test_class_creation_rust(benchmark):
benchmark(example.Example) This calls |
Thank you for investigating that, but... |
Ah I see, that's too bad :/ I was under the assumption pyo3 was zero cost. Thank you for the information! |
Why does pyo3 track allocations and use PyRef rather than just increment and decrement the PyObject's reference count? This decision might predate everyone maintaining it now, but does anyone know? |
@ijl Now I'm too drunk to do more detailed explanation..., but I'll do it soon and I want to rethink about this problem. |
I was able to reproduce this in ipython. In the following,
What sort of overhead is there that cython doesn't seem to require? |
@ethanhs we've made steps to improve this in My guess is we're still not as fast as cython but hopefully we're closer, and maybe we'll find the next steps someday soon! |
@davidhewitt I ended up making a quick repo of my test for reproducibility. https://github.com/ethanhs/clscreate. The mean times I got on my Ryzen 9 3900x on Windows with nightly 2020-05-12: Edit: I realized I should probably also try a non-static class and other things which may get optimized in unrealistic ways by C, but not in Rust, but I don't really have the time atm. |
Oh man, it's a bit disappointing that class creation is slower than native Python classes even. I mean it's not suprising that they're also well optimised, but still... Definitely we shall have to look at improvements here :) |
Yeah, it is rather unfortunate. If I remember correctly, when I initially posted this was also the case that PyO3 was slower than Python, so at least it isn't a regression. I look forward to improvements to come! :) |
Just as an update to this, I updated the On my Core i7-10750H, with Python 3.9.5, now I get mean times of: Cython 52ns ... so we're now 2x Cython rather than 5x, and we're notably faster than plain Python! 🎉 I'm going to close this issue; also we track a benchmark at
|
I'm having a similar performance issue with a custom wrapper around tuple (code below). It turned out so ridiculously slow that it was not funny, i.e. in calling a Rust function that consumes PyPoint instance and returns a PyPoint instance it is a major bottleneck, eating about 95% of the execution time of the whole function. I've done some digging, and these are the results:
#[derive(Debug, Copy, Clone)]
pub struct Point(f64, f64);
#[pyclass]
pub struct PyPoint(Point);
#[pymethods]
impl PyPoint {
#[new]
pub fn new(data: (f64, f64)) -> Self {
PyPoint{0:Point{0:data.0, 1:data.1}}
}
fn __repr__(&self) -> String {
format!("[{}, {}]", self.0.0, self.0.1)
}
fn __getitem__(&self, i:usize) -> PyResult<f64> {
match self.as_list().get(i)
{
None => {Err(PyKeyError::new_err("index out of range"))}
Some(v) => {Ok(*v)}
}
}
fn as_list(&self) -> [f64; 2] {
return [self.0.0, self.0.1]
}
fn as_tuple(&self) -> (f64, f64) {
return (self.0.0, self.0.1)
}
}
impl Deref for PyPoint {
type Target = Point;
fn deref(&self) -> &Self::Target {
&self.0
}
}
fn noop_tuple(x:(f64,f64), y:(f64,f64))->Option<((f64,f64),((f64,f64)))>
{
Some((x,y))
} class MyTuple:
__slots__ = ['x', 'y']
def __init__(self, d):
self.x, self.y = d
def as_tuple(self):
return self.x, self.y
def bench():
for name, fn in {
"tuple_wrap": lambda _p1, _p2, _: [tuple(r) for r in [_p1, _p2]],
"rs_function":lambda _p1, _p2, _: [ tuple(r) for r in o3.noop_tuple(_p1, _p2)],
"rs_class": lambda _p1, _p2, _: [tuple(r.as_tuple()) for r in [o3.PyPoint(_p1), c3.PyPoint(_p2)]],
"py_class": lambda _p1, _p2, _: [tuple(r.as_tuple()) for r in [MyTuple(_p1), MyTuple(_p2)]], }.items():
t1 = time.time()
for N in range(10000):
p1, p2 = (N,N), (N,N)
_ = fn(p1, p2, 1.0)
t2 = time.time()
print(f"{name}: {(t2 - t1) * 1000} ms") |
@alexpyattaev thanks for challenging this. You're correct to say that PyO3 class creation is roughly on the same speed as Python class creation (some cases we are faster, some cases we are slower). We are limited by needing to create a Python object and all the machinery that requires. I do hope that one day we can find optimisations that mean we are consistently faster than Python, but I don't think this will ever be orders of magnitude difference. I think you have been a bit too fast to conclude that tuples are always better. Your example has a lot going on, so to make it easier to compare I've reduced it to just three
Code at the bottom. Some example numbers (Mac M1, Python 3.10.6, rust 1.66.0-beta.1): $ python test.py
py_tuples: 2.0890235900878906 ms
rs_tuples: 3.6208629608154297 ms
rs_class: 3.233194351196289 ms While there is some variance, with this measurement I see that creating You are correct that there is overhead converting from Python <-> Rust, and our user guide does discuss this. I still defend that it is better to use
use pyo3::prelude::*;
use pyo3::types::PyTuple;
#[derive(Debug, Copy, Clone)]
pub struct Point(f64, f64);
#[pyclass]
pub struct PyPoint(Point);
#[pymethods]
impl PyPoint {
#[new]
pub fn new(data: (f64, f64)) -> Self {
PyPoint {
0: Point {
0: data.0,
1: data.1,
},
}
}
}
#[pyfunction]
fn takes_py_tuples<'py>(x: &'py PyTuple, y: &'py PyTuple) -> (&'py PyTuple, &'py PyTuple) {
(x, y)
}
#[pyfunction]
fn takes_rs_tuples(x: (f64, f64), y: (f64, f64)) -> ((f64, f64), (f64, f64)) {
(x, y)
}
#[pyfunction]
fn takes_rs_tuples_makes_class(x: (f64, f64), y: (f64, f64)) -> (PyPoint, PyPoint) {
(PyPoint::new(x), PyPoint::new(y))
}
#[pymodule]
fn pyo3_scratch(_: Python, m: &PyModule) -> PyResult<()> {
m.add_class::<PyPoint>()?;
m.add_function(wrap_pyfunction!(takes_py_tuples, m)?)?;
m.add_function(wrap_pyfunction!(takes_rs_tuples, m)?)?;
m.add_function(wrap_pyfunction!(takes_rs_tuples_makes_class, m)?)?;
Ok(())
}
import time
import pyo3_scratch
def bench():
for name, fn in {
"py_tuples": lambda _p1, _p2: pyo3_scratch.takes_py_tuples(_p1, _p2),
"rs_tuples": lambda _p1, _p2: pyo3_scratch.takes_rs_tuples(_p1, _p2),
"rs_class": lambda _p1, _p2: pyo3_scratch.takes_rs_tuples_makes_class(_p1, _p2),
}.items():
t1 = time.time()
for N in range(10000):
p1, p2 = (N, N), (N, N)
_ = fn(p1, p2)
t2 = time.time()
print(f"{name}: {(t2 - t1) * 1000} ms")
assert len(_) == 2
if __name__ == "__main__":
bench() |
Well, now we are diving further into the rabbit hole =) A few notes/updates to your suggested benchmark:
Below are the timings with updated tests:
TL;DR of the results so far is:
Bottom line, I think docs should be patched to include this result (or maybe just drop this code straight into Examples? Not sure what would be best, please advise. There does not appear to be any performance loss in PyO3 itself (at least nothing major). #[derive(Debug, Copy, Clone)]
pub struct Point(f64, f64);
#[pyclass]
pub struct PyPoint(Point);
#[pymethods]
impl PyPoint {
#[new]
pub fn new(data: (f64, f64)) -> Self {
PyPoint {
0: Point {
0: data.0,
1: data.1,
},
}
}
}
#[pyfunction]
fn takes_py_tuples<'py>(
py: pyo3::Python<'py>,
x: &'py PyTuple,
y: &'py PyTuple,
) -> PyResult<(&'py PyTuple)> {
let xx: (f64, f64) = x.extract()?;
let yy: (f64, f64) = y.extract()?;
let r1 = PyTuple::new(py, [xx.0 + 1.0, yy.1 + 1.0]);
let r2 = PyTuple::new(py, [xx.1 + 1.0, yy.0 + 1.0]);
Ok(PyTuple::new(py, ([r1, r2])))
}
#[pyfunction]
fn takes_rs_tuples(x: (f64, f64), y: (f64, f64)) -> ((f64, f64), (f64, f64)) {
((x.0 + 1.0, y.1 + 1.0), (x.1 + 1.0, y.0 + 1.0))
}
#[pyfunction]
fn takes_rs_tuples_makes_class(x: (f64, f64), y: (f64, f64)) -> (PyPoint, PyPoint) {
(
PyPoint::new((x.0 + 1.0, y.1 + 1.0)),
PyPoint::new((x.1 + 1.0, y.0 + 1.0)),
)
}
#[pyfunction]
fn takes_rs_class_makes_class(x: &PyPoint, y: &PyPoint) -> (PyPoint, PyPoint) {
(
PyPoint {
0: Point(x.0 .0 + 1.0, y.0 .1 + 1.0),
},
PyPoint {
0: Point(x.0 .1 + 1.0, y.0 .0 + 1.0),
},
)
}
#[pymodule]
fn pyo3_scratch(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_class::<PyPoint>()?;
m.add_function(wrap_pyfunction!(takes_py_tuples, m)?)?;
m.add_function(wrap_pyfunction!(takes_rs_tuples, m)?)?;
m.add_function(wrap_pyfunction!(takes_rs_tuples_makes_class, m)?)?;
m.add_function(wrap_pyfunction!(takes_rs_class_makes_class, m)?)?;
Ok(())
} import time
import pyo3_scratch
#shortcut imports for faster speed
PyPoint = pyo3_scratch.PyPoint
takes_rs_class_makes_class = pyo3_scratch.takes_rs_class_makes_class
takes_rs_tuples_makes_class = pyo3_scratch.takes_rs_tuples_makes_class
takes_rs_tuples = pyo3_scratch.takes_rs_tuples
def py_fn(_p1, _p2):
# this is the benchmark logic. Yes it is silly.
return (_p1[0] + 1.0, _p2[1] + 1.0), (_p1[1] + 1.0, _p2[0] + 1.0)
def bench():
B=10
for name, fn in {
"Just lambda, no work": lambda _p1, _p2: (_p1, _p2),
"Normal Python function": lambda _p1, _p2: py_fn(_p1, _p2),
"PyTuple in and out": lambda _p1, _p2: pyo3_scratch.takes_py_tuples(_p1, _p2),
"Rust tuple in and out": lambda _p1, _p2: pyo3_scratch.takes_rs_tuples(_p1, _p2),
"Rust tuple in and out, local": lambda _p1, _p2: takes_rs_tuples(_p1, _p2),
"Rust tuple in, pyclass out": lambda _p1, _p2: pyo3_scratch.takes_rs_tuples_makes_class(_p1, _p2),
"Rust tuple in, pyclass out, local": lambda _p1, _p2: takes_rs_tuples_makes_class(_p1, _p2),
"pyclass in, pyclass out": lambda _p1, _p2: pyo3_scratch.takes_rs_class_makes_class(pyo3_scratch.PyPoint(_p1), pyo3_scratch.PyPoint(_p2)),
"pyclass in, pyclass out, fast": lambda _p1, _p2: takes_rs_class_makes_class(PyPoint(_p1), PyPoint(_p2)),
}.items():
times = []
for _ in range(B):
t1 = time.time()
for N in range(10000):
p1, p2 = (N, N), (N, N)
_ = fn(p1, p2)
t2 = time.time()
times.append(t2-t1)
mean = sum(times) / B
mse = sum(map(lambda i: (i-mean)**2, times))/B
print(f"{name:35}: {mean* 1000:.3f} ms, +-{mse*1e6:.3f} us")
assert len(_) == 2
if __name__ == "__main__":
bench() |
Thanks for the further exploration, and sorry for my slow reply. This shows that we can be fast to make |
My environment is rustc 1.40, python 3.8, Windows 10.
💥 Reproducing
Make a
pyclass
and benchmark it!I realize this could be a quirk of my current setup, but I was doing some benchmarking, and realized that it can take up to 440ns to create a Python class object. This is fine in isolation, but when performance matters, and you are making a lot of these objects (or want to make them as fast as possible) it matters. I created a simple Cython test file to compare, and it was more than 2x faster in creating the class (~200ns).
Here is the sample Cython code:
You can compile an extension from it with
cythonize -b test.pyx
.The Rust code is the obvious:
I realize this is probably not the biggest priority, but it seems odd that class creation would be so much slower.
The text was updated successfully, but these errors were encountered: