Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class creation is slow #661

Closed
emmatyping opened this issue Nov 6, 2019 · 17 comments
Closed

Class creation is slow #661

emmatyping opened this issue Nov 6, 2019 · 17 comments

Comments

@emmatyping
Copy link
Contributor

My environment is rustc 1.40, python 3.8, Windows 10.

💥 Reproducing

Make a pyclass and benchmark it!

I realize this could be a quirk of my current setup, but I was doing some benchmarking, and realized that it can take up to 440ns to create a Python class object. This is fine in isolation, but when performance matters, and you are making a lot of these objects (or want to make them as fast as possible) it matters. I created a simple Cython test file to compare, and it was more than 2x faster in creating the class (~200ns).

Here is the sample Cython code:

# cython: language_level=3
cdef class Example:
    def __cinit__(self):
        self.a = 4
        self.b = "Test"
    cdef public long a
    cdef public str b

You can compile an extension from it with cythonize -b test.pyx.

The Rust code is the obvious:

use pyo3::prelude::*;

#[pyclass]
pub struct Example {
    #[pyo3(get, set)]
    pub a: i64,
    #[pyo3(get, set)]
    pub b: String,
}

#[pymethods]
impl Example {
    #[new]
    fn new(obj: &PyRawObject) {
        obj.init({
            Example {
                a: 4,
                b: String::from("Hello"),
            }
        });
    }
}

#[pymodule]
fn example(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_class::<Example>()?;

    Ok(())
}

I realize this is probably not the biggest priority, but it seems odd that class creation would be so much slower.

@ehiggs
Copy link

ehiggs commented Nov 6, 2019

Could you post your rust build commands and commands for how you benchmarked this?

Thanks

@emmatyping
Copy link
Contributor Author

Sure! I just used cargo build --release (I'd have used target-cpu=native but I'm getting a heap corruption ICE if I do that). Then I write a simple benchmark using pytest-benchmark like the following:

import example
def test_class_creation_rust(benchmark):
    benchmark(example.Example)

This calls example.Example() which is the best benchmark I could think of atm.

@kngwyu
Copy link
Member

kngwyu commented Nov 7, 2019

Thank you for investigating that, but...
PyO3 has some fundamental overhead in its core, so I'm not sure we can fix it 🤔

@emmatyping
Copy link
Contributor Author

Ah I see, that's too bad :/

I was under the assumption pyo3 was zero cost. Thank you for the information!

@ijl
Copy link
Contributor

ijl commented Nov 7, 2019

Why does pyo3 track allocations and use PyRef rather than just increment and decrement the PyObject's reference count? This decision might predate everyone maintaining it now, but does anyone know?

@kngwyu
Copy link
Member

kngwyu commented Nov 7, 2019

@ijl
I think I had some discussions about this... but now only found this comment
#271 (comment)

Now I'm too drunk to do more detailed explanation..., but I'll do it soon and I want to rethink about this problem.
If we can remove ReleasePool, it's really cool.

@ehiggs
Copy link

ehiggs commented Nov 13, 2019

I was able to reproduce this in ipython. In the following, examplec is the cython code.

$ ipython
Python 3.7.5rc1 (default, Oct  8 2019, 16:47:45) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.9.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import examplec                                                                                            

In [2]: %timeit examplec.Example()                                                                                 
68.7 ns ± 0.315 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [3]: import example                                                                                             

In [4]: %timeit example.Example()                                                                                  
224 ns ± 1.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

PyO3 has some fundamental overhead in its core, so I'm not sure we can fix it thinking

What sort of overhead is there that cython doesn't seem to require?

@kngwyu
Copy link
Member

kngwyu commented Dec 1, 2019

@ehiggs
Sorry for the delay but I wrote an explanation #679

@davidhewitt
Copy link
Member

@ethanhs we've made steps to improve this in 0.10.1 - would be curious if you'd be interested in rerunning the benchmarks performed to see what the improvement is like?

My guess is we're still not as fast as cython but hopefully we're closer, and maybe we'll find the next steps someday soon!

@emmatyping
Copy link
Contributor Author

emmatyping commented May 21, 2020

@davidhewitt I ended up making a quick repo of my test for reproducibility. https://github.com/ethanhs/clscreate.

The mean times I got on my Ryzen 9 3900x on Windows with nightly 2020-05-12:
Cython: 75.9655 ns
Python: 247.2162 ns
PyO3: 354.3962 ns

Edit: I realized I should probably also try a non-static class and other things which may get optimized in unrealistic ways by C, but not in Rust, but I don't really have the time atm.

@davidhewitt
Copy link
Member

Oh man, it's a bit disappointing that class creation is slower than native Python classes even. I mean it's not suprising that they're also well optimised, but still...

Definitely we shall have to look at improvements here :)

@emmatyping
Copy link
Contributor Author

emmatyping commented May 22, 2020

Yeah, it is rather unfortunate. If I remember correctly, when I initially posted this was also the case that PyO3 was slower than Python, so at least it isn't a regression.

I look forward to improvements to come! :)

@davidhewitt
Copy link
Member

Just as an update to this, I updated the clscreate test to PyO3 0.14.5 - https://github.com/davidhewitt/clscreate

On my Core i7-10750H, with Python 3.9.5, now I get mean times of:

Cython 52ns
PyO3 107ns
Python 174ns

... so we're now 2x Cython rather than 5x, and we're notably faster than plain Python! 🎉

I'm going to close this issue; also we track a benchmark at

benchmark(pyo3_benchmarks.EmptyClass)
(though we don't yet compare against Cython).

@alexpyattaev
Copy link
Contributor

I'm having a similar performance issue with a custom wrapper around tuple (code below).

It turned out so ridiculously slow that it was not funny, i.e. in calling a Rust function that consumes PyPoint instance and returns a PyPoint instance it is a major bottleneck, eating about 95% of the execution time of the whole function. I've done some digging, and these are the results:

  • I've made another function that takes "raw" tuples instead of PyPoint instances, and it appears that just unwrapping and wrapping tuples is about 50% of the cost of creation and destructuring of PyPoint instances.
  • I've made a Python class (MyPoint, also below) to compare, and it has about the same creation and destructuring cost as rust class, so it does not appear that PyO3 is doing something alarmingly wrong
  • This is still a massive pitfall, and if it is impossible to avoid this, maybe we should say in the docs that pyclass should be avoided at any possible cost?
  • This does, of course, open a Pandora's box for problems with typing across Python/Rust boundary, as wrapper classes provide a very nice way to ensure Python code can not feed random junk to Rust code, and with plain tuples it is more tricky.
#[derive(Debug, Copy, Clone)]
pub struct Point(f64, f64);

#[pyclass]
pub struct PyPoint(Point);

#[pymethods]
impl PyPoint {
    #[new]
    pub fn new(data: (f64, f64)) -> Self {
        PyPoint{0:Point{0:data.0, 1:data.1}}
    }

    fn __repr__(&self) -> String {
        format!("[{}, {}]", self.0.0, self.0.1)
    }

    fn __getitem__(&self, i:usize) -> PyResult<f64> {
        match self.as_list().get(i)
        {
            None => {Err(PyKeyError::new_err("index out of range"))}
            Some(v) => {Ok(*v)}
        }
    }

    fn as_list(&self) -> [f64; 2] {
        return [self.0.0, self.0.1]
    }

    fn as_tuple(&self) -> (f64, f64) {
        return (self.0.0, self.0.1)
    }
}
impl Deref for PyPoint {
    type Target = Point;
    fn deref(&self) -> &Self::Target {
        &self.0
    }
}

fn noop_tuple(x:(f64,f64), y:(f64,f64))->Option<((f64,f64),((f64,f64)))>
{
    Some((x,y))
}
class MyTuple:
    __slots__ = ['x', 'y']
    def __init__(self, d):
        self.x, self.y = d

    def as_tuple(self):
        return self.x, self.y


def bench():
    for name, fn in {        
        "tuple_wrap": lambda _p1, _p2, _: [tuple(r) for r in [_p1, _p2]],
        "rs_function":lambda _p1, _p2, _: [ tuple(r) for r in o3.noop_tuple(_p1, _p2)],
        "rs_class": lambda _p1, _p2, _: [tuple(r.as_tuple()) for r in [o3.PyPoint(_p1), c3.PyPoint(_p2)]],
        "py_class": lambda _p1, _p2, _: [tuple(r.as_tuple()) for r in [MyTuple(_p1), MyTuple(_p2)]], }.items():
        t1 = time.time()
        for N in range(10000):
            p1, p2 = (N,N), (N,N)
            _ = fn(p1, p2, 1.0)
        t2 = time.time()
        print(f"{name}: {(t2 - t1) * 1000} ms")

@davidhewitt
Copy link
Member

@alexpyattaev thanks for challenging this.

You're correct to say that PyO3 class creation is roughly on the same speed as Python class creation (some cases we are faster, some cases we are slower). We are limited by needing to create a Python object and all the machinery that requires. I do hope that one day we can find optimisations that mean we are consistently faster than Python, but I don't think this will ever be orders of magnitude difference.

I think you have been a bit too fast to conclude that tuples are always better. Your example has a lot going on, so to make it easier to compare I've reduced it to just three #[pyfunction] to compare:

  • takes_py_tuples - the simplest case which doesn't create or destroy any Python objects
  • takes_rs_tuples - a variant of the above which converts the Python tuples into Rust tuples and back
  • takes_rs_tuples_makes_class - a variant of the above which converts the Python tuples into Rust tuples and then returns a tuple of PyPoint

Code at the bottom. Some example numbers (Mac M1, Python 3.10.6, rust 1.66.0-beta.1):

$ python test.py
py_tuples: 2.0890235900878906 ms
rs_tuples: 3.6208629608154297 ms
rs_class: 3.233194351196289 ms

While there is some variance, with this measurement I see that creating PyPoint from Rust is slightly faster than creating Python tuples.

You are correct that there is overhead converting from Python <-> Rust, and our user guide does discuss this. I still defend that it is better to use #[pyclass] than raw tuples for most use cases, you have acknowledged the typing benefit yourself above. There will always be exceptions, in some cases #[pyclass] may be slower, and whichever is faster it's always best to avoid creating and destroying Python objects in a hot loop if you can restructure your algorithm to avoid this.


lib.rs

use pyo3::prelude::*;
use pyo3::types::PyTuple;

#[derive(Debug, Copy, Clone)]
pub struct Point(f64, f64);

#[pyclass]
pub struct PyPoint(Point);

#[pymethods]
impl PyPoint {
    #[new]
    pub fn new(data: (f64, f64)) -> Self {
        PyPoint {
            0: Point {
                0: data.0,
                1: data.1,
            },
        }
    }
}

#[pyfunction]
fn takes_py_tuples<'py>(x: &'py PyTuple, y: &'py PyTuple) -> (&'py PyTuple, &'py PyTuple) {
    (x, y)
}

#[pyfunction]
fn takes_rs_tuples(x: (f64, f64), y: (f64, f64)) -> ((f64, f64), (f64, f64)) {
    (x, y)
}

#[pyfunction]
fn takes_rs_tuples_makes_class(x: (f64, f64), y: (f64, f64)) -> (PyPoint, PyPoint) {
    (PyPoint::new(x), PyPoint::new(y))
}

#[pymodule]
fn pyo3_scratch(_: Python, m: &PyModule) -> PyResult<()> {
    m.add_class::<PyPoint>()?;
    m.add_function(wrap_pyfunction!(takes_py_tuples, m)?)?;
    m.add_function(wrap_pyfunction!(takes_rs_tuples, m)?)?;
    m.add_function(wrap_pyfunction!(takes_rs_tuples_makes_class, m)?)?;
    Ok(())
}

test.py

import time

import pyo3_scratch


def bench():
    for name, fn in {
        "py_tuples": lambda _p1, _p2: pyo3_scratch.takes_py_tuples(_p1, _p2),
        "rs_tuples": lambda _p1, _p2: pyo3_scratch.takes_rs_tuples(_p1, _p2),
        "rs_class": lambda _p1, _p2: pyo3_scratch.takes_rs_tuples_makes_class(_p1, _p2),
    }.items():
        t1 = time.time()
        for N in range(10000):
            p1, p2 = (N, N), (N, N)
            _ = fn(p1, p2)
        t2 = time.time()
        print(f"{name}: {(t2 - t1) * 1000} ms")
        assert len(_) == 2


if __name__ == "__main__":
    bench()

@alexpyattaev
Copy link
Contributor

Well, now we are diving further into the rabbit hole =) A few notes/updates to your suggested benchmark:

  1. Taking &'py PyTuple and returning it right away is inappropriate, as in order to actually use its content you still need to do some extra work. I've updated all the functions to do "useful work" which requires de structuring Python objects, I think it levels the playing field a bit. With that refinement, dealing in PyTuple is no longer that beneficial compared to rust tuple (as it really should be the same thing).
  2. I've recalled that Python has a massive namespace lookup overhead, and taking names local has a massive impact when making plenty of class instances to feed input values into functions. I've added appropriate benchmarks to illustrate that (as well as baseline for pure python implementation).
  3. Just measuring means did not feel right anymore, so added mean squared error too.
  4. SW/HW specs rustc 1.64.0 (a55dd71d5 2022-09-19), AMD Ryzen 3500U, YMMV.

Below are the timings with updated tests:

Just lambda, no work               : 2.701 ms, +-0.013 us
Normal Python function             : 7.163 ms, +-0.044 us
PyTuple in and out                 : 8.640 ms, +-0.183 us
Rust tuple in and out              : 8.490 ms, +-0.364 us
Rust tuple in and out, local       : 7.404 ms, +-0.269 us
Rust tuple in, pyclass out         : 6.901 ms, +-0.064 us
Rust tuple in, pyclass out, local  : 5.997 ms, +-0.049 us
pyclass in, pyclass out            : 13.058 ms, +-0.424 us
pyclass in, pyclass out, fast      : 11.085 ms, +-0.123 us

TL;DR of the results so far is:

  • when targeting lowest overhead possible, it seems that rust tuples are the preferred approach (at least on my hardware, to test on your hardware run the bench).
  • pyclass as return type is fine, i.e. no significant extra cost compared to just returning tuple
  • if possible local imports should be used, especially if using pyclass instances for input parameters

Bottom line, I think docs should be patched to include this result (or maybe just drop this code straight into Examples? Not sure what would be best, please advise. There does not appear to be any performance loss in PyO3 itself (at least nothing major).

#[derive(Debug, Copy, Clone)]
pub struct Point(f64, f64);

#[pyclass]
pub struct PyPoint(Point);

#[pymethods]
impl PyPoint {
    #[new]
    pub fn new(data: (f64, f64)) -> Self {
        PyPoint {
            0: Point {
                0: data.0,
                1: data.1,
            },
        }
    }
}

#[pyfunction]
fn takes_py_tuples<'py>(
    py: pyo3::Python<'py>,
    x: &'py PyTuple,
    y: &'py PyTuple,
) -> PyResult<(&'py PyTuple)> {
    let xx: (f64, f64) = x.extract()?;
    let yy: (f64, f64) = y.extract()?;
    let r1 = PyTuple::new(py, [xx.0 + 1.0, yy.1 + 1.0]);
    let r2 = PyTuple::new(py, [xx.1 + 1.0, yy.0 + 1.0]);
    Ok(PyTuple::new(py, ([r1, r2])))
}

#[pyfunction]
fn takes_rs_tuples(x: (f64, f64), y: (f64, f64)) -> ((f64, f64), (f64, f64)) {
    ((x.0 + 1.0, y.1 + 1.0), (x.1 + 1.0, y.0 + 1.0))
}

#[pyfunction]
fn takes_rs_tuples_makes_class(x: (f64, f64), y: (f64, f64)) -> (PyPoint, PyPoint) {
    (
        PyPoint::new((x.0 + 1.0, y.1 + 1.0)),
        PyPoint::new((x.1 + 1.0, y.0 + 1.0)),
    )
}

#[pyfunction]
fn takes_rs_class_makes_class(x: &PyPoint, y: &PyPoint) -> (PyPoint, PyPoint) {
    (
        PyPoint {
            0: Point(x.0 .0 + 1.0, y.0 .1 + 1.0),
        },
        PyPoint {
            0: Point(x.0 .1 + 1.0, y.0 .0 + 1.0),
        },
    )
}

#[pymodule]
fn pyo3_scratch(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_class::<PyPoint>()?;
    m.add_function(wrap_pyfunction!(takes_py_tuples, m)?)?;
    m.add_function(wrap_pyfunction!(takes_rs_tuples, m)?)?;
    m.add_function(wrap_pyfunction!(takes_rs_tuples_makes_class, m)?)?;
    m.add_function(wrap_pyfunction!(takes_rs_class_makes_class, m)?)?;
    Ok(())
}
import time
import  pyo3_scratch

#shortcut imports for faster speed
PyPoint = pyo3_scratch.PyPoint
takes_rs_class_makes_class = pyo3_scratch.takes_rs_class_makes_class
takes_rs_tuples_makes_class = pyo3_scratch.takes_rs_tuples_makes_class
takes_rs_tuples = pyo3_scratch.takes_rs_tuples


def py_fn(_p1, _p2):
    # this is the benchmark logic. Yes it is silly.
    return (_p1[0] + 1.0, _p2[1] + 1.0), (_p1[1] + 1.0, _p2[0] + 1.0)


def bench():
    B=10
    for name, fn in {
        "Just lambda, no work": lambda _p1, _p2: (_p1, _p2),
        "Normal Python function": lambda _p1, _p2: py_fn(_p1, _p2),
        "PyTuple in and out": lambda _p1, _p2: pyo3_scratch.takes_py_tuples(_p1, _p2),
        "Rust tuple in and out": lambda _p1, _p2: pyo3_scratch.takes_rs_tuples(_p1, _p2),
        "Rust tuple in and out, local": lambda _p1, _p2: takes_rs_tuples(_p1, _p2),
        "Rust tuple in, pyclass out": lambda _p1, _p2: pyo3_scratch.takes_rs_tuples_makes_class(_p1, _p2),
        "Rust tuple in, pyclass out, local": lambda _p1, _p2: takes_rs_tuples_makes_class(_p1, _p2),
        "pyclass in, pyclass out": lambda _p1, _p2: pyo3_scratch.takes_rs_class_makes_class(pyo3_scratch.PyPoint(_p1), pyo3_scratch.PyPoint(_p2)),
        "pyclass in, pyclass out, fast": lambda _p1, _p2: takes_rs_class_makes_class(PyPoint(_p1), PyPoint(_p2)),
    }.items():
        times = []
        for _ in range(B):
            t1 = time.time()
            for N in range(10000):
                p1, p2 = (N, N), (N, N)
                _ = fn(p1, p2)
            t2 = time.time()
            times.append(t2-t1)
        mean = sum(times) / B
        mse = sum(map(lambda i: (i-mean)**2, times))/B
        print(f"{name:35}: {mean* 1000:.3f} ms, +-{mse*1e6:.3f} us")
        assert len(_) == 2


if __name__ == "__main__":
    bench()

@davidhewitt
Copy link
Member

Thanks for the further exploration, and sorry for my slow reply.

This shows that we can be fast to make #[pyclass] objects, but not when the creation is done from Python, so there should be wins to be gained on the #[new] implementation within PyO3. It looks like Python has a tp_vectorcall which can be used as an alternative to tp_new, but that's not available for the "heap types" PyO3 uses. I've asked upstream about this in python/cpython#100554

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants