Class creation is slow #661

emmatyping · 2019-11-06T12:29:56Z

My environment is rustc 1.40, python 3.8, Windows 10.

💥 Reproducing

Make a pyclass and benchmark it!

I realize this could be a quirk of my current setup, but I was doing some benchmarking, and realized that it can take up to 440ns to create a Python class object. This is fine in isolation, but when performance matters, and you are making a lot of these objects (or want to make them as fast as possible) it matters. I created a simple Cython test file to compare, and it was more than 2x faster in creating the class (~200ns).

Here is the sample Cython code:

# cython: language_level=3
cdef class Example:
    def __cinit__(self):
        self.a = 4
        self.b = "Test"
    cdef public long a
    cdef public str b

You can compile an extension from it with cythonize -b test.pyx.

The Rust code is the obvious:

use pyo3::prelude::*;

#[pyclass]
pub struct Example {
    #[pyo3(get, set)]
    pub a: i64,
    #[pyo3(get, set)]
    pub b: String,
}

#[pymethods]
impl Example {
    #[new]
    fn new(obj: &PyRawObject) {
        obj.init({
            Example {
                a: 4,
                b: String::from("Hello"),
            }
        });
    }
}

#[pymodule]
fn example(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_class::<Example>()?;

    Ok(())
}

I realize this is probably not the biggest priority, but it seems odd that class creation would be so much slower.

The text was updated successfully, but these errors were encountered:

ehiggs · 2019-11-06T12:42:07Z

Could you post your rust build commands and commands for how you benchmarked this?

Thanks

emmatyping · 2019-11-06T21:22:49Z

Sure! I just used cargo build --release (I'd have used target-cpu=native but I'm getting a heap corruption ICE if I do that). Then I write a simple benchmark using pytest-benchmark like the following:

import example
def test_class_creation_rust(benchmark):
    benchmark(example.Example)

This calls example.Example() which is the best benchmark I could think of atm.

kngwyu · 2019-11-07T03:12:15Z

Thank you for investigating that, but...
PyO3 has some fundamental overhead in its core, so I'm not sure we can fix it 🤔

emmatyping · 2019-11-07T05:13:02Z

Ah I see, that's too bad :/

I was under the assumption pyo3 was zero cost. Thank you for the information!

ijl · 2019-11-07T16:48:26Z

Why does pyo3 track allocations and use PyRef rather than just increment and decrement the PyObject's reference count? This decision might predate everyone maintaining it now, but does anyone know?

kngwyu · 2019-11-07T17:08:07Z

@ijl
I think I had some discussions about this... but now only found this comment
#271 (comment)

Now I'm too drunk to do more detailed explanation..., but I'll do it soon and I want to rethink about this problem.
If we can remove ReleasePool, it's really cool.

ehiggs · 2019-11-13T13:09:17Z

I was able to reproduce this in ipython. In the following, examplec is the cython code.

$ ipython
Python 3.7.5rc1 (default, Oct  8 2019, 16:47:45) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.9.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import examplec                                                                                            

In [2]: %timeit examplec.Example()                                                                                 
68.7 ns ± 0.315 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [3]: import example                                                                                             

In [4]: %timeit example.Example()                                                                                  
224 ns ± 1.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

PyO3 has some fundamental overhead in its core, so I'm not sure we can fix it thinking

What sort of overhead is there that cython doesn't seem to require?

kngwyu · 2019-12-01T09:33:46Z

@ehiggs
Sorry for the delay but I wrote an explanation #679

davidhewitt · 2020-05-15T07:32:05Z

@ethanhs we've made steps to improve this in 0.10.1 - would be curious if you'd be interested in rerunning the benchmarks performed to see what the improvement is like?

My guess is we're still not as fast as cython but hopefully we're closer, and maybe we'll find the next steps someday soon!

emmatyping · 2020-05-21T06:09:14Z

@davidhewitt I ended up making a quick repo of my test for reproducibility. https://github.com/ethanhs/clscreate.

The mean times I got on my Ryzen 9 3900x on Windows with nightly 2020-05-12:
Cython: 75.9655 ns
Python: 247.2162 ns
PyO3: 354.3962 ns

Edit: I realized I should probably also try a non-static class and other things which may get optimized in unrealistic ways by C, but not in Rust, but I don't really have the time atm.

davidhewitt · 2020-05-21T17:01:56Z

Oh man, it's a bit disappointing that class creation is slower than native Python classes even. I mean it's not suprising that they're also well optimised, but still...

Definitely we shall have to look at improvements here :)

emmatyping · 2020-05-22T18:29:31Z

Yeah, it is rather unfortunate. If I remember correctly, when I initially posted this was also the case that PyO3 was slower than Python, so at least it isn't a regression.

I look forward to improvements to come! :)

davidhewitt · 2021-09-18T09:54:27Z

Just as an update to this, I updated the clscreate test to PyO3 0.14.5 - https://github.com/davidhewitt/clscreate

On my Core i7-10750H, with Python 3.9.5, now I get mean times of:

Cython 52ns
PyO3 107ns
Python 174ns

... so we're now 2x Cython rather than 5x, and we're notably faster than plain Python! 🎉

I'm going to close this issue; also we track a benchmark at

pyo3/examples/pyo3-benchmarks/tests/test_benchmarks.py

Line 95 in 56ef7c9

benchmark(pyo3_benchmarks.EmptyClass)

(though we don't yet compare against Cython).

alexpyattaev · 2022-11-24T16:17:42Z

I'm having a similar performance issue with a custom wrapper around tuple (code below).

It turned out so ridiculously slow that it was not funny, i.e. in calling a Rust function that consumes PyPoint instance and returns a PyPoint instance it is a major bottleneck, eating about 95% of the execution time of the whole function. I've done some digging, and these are the results:

I've made another function that takes "raw" tuples instead of PyPoint instances, and it appears that just unwrapping and wrapping tuples is about 50% of the cost of creation and destructuring of PyPoint instances.
I've made a Python class (MyPoint, also below) to compare, and it has about the same creation and destructuring cost as rust class, so it does not appear that PyO3 is doing something alarmingly wrong
This is still a massive pitfall, and if it is impossible to avoid this, maybe we should say in the docs that pyclass should be avoided at any possible cost?
This does, of course, open a Pandora's box for problems with typing across Python/Rust boundary, as wrapper classes provide a very nice way to ensure Python code can not feed random junk to Rust code, and with plain tuples it is more tricky.

#[derive(Debug, Copy, Clone)]
pub struct Point(f64, f64);

#[pyclass]
pub struct PyPoint(Point);

#[pymethods]
impl PyPoint {
    #[new]
    pub fn new(data: (f64, f64)) -> Self {
        PyPoint{0:Point{0:data.0, 1:data.1}}
    }

    fn __repr__(&self) -> String {
        format!("[{}, {}]", self.0.0, self.0.1)
    }

    fn __getitem__(&self, i:usize) -> PyResult<f64> {
        match self.as_list().get(i)
        {
            None => {Err(PyKeyError::new_err("index out of range"))}
            Some(v) => {Ok(*v)}
        }
    }

    fn as_list(&self) -> [f64; 2] {
        return [self.0.0, self.0.1]
    }

    fn as_tuple(&self) -> (f64, f64) {
        return (self.0.0, self.0.1)
    }
}
impl Deref for PyPoint {
    type Target = Point;
    fn deref(&self) -> &Self::Target {
        &self.0
    }
}

fn noop_tuple(x:(f64,f64), y:(f64,f64))->Option<((f64,f64),((f64,f64)))>
{
    Some((x,y))
}

class MyTuple:
    __slots__ = ['x', 'y']
    def __init__(self, d):
        self.x, self.y = d

    def as_tuple(self):
        return self.x, self.y


def bench():
    for name, fn in {        
        "tuple_wrap": lambda _p1, _p2, _: [tuple(r) for r in [_p1, _p2]],
        "rs_function":lambda _p1, _p2, _: [ tuple(r) for r in o3.noop_tuple(_p1, _p2)],
        "rs_class": lambda _p1, _p2, _: [tuple(r.as_tuple()) for r in [o3.PyPoint(_p1), c3.PyPoint(_p2)]],
        "py_class": lambda _p1, _p2, _: [tuple(r.as_tuple()) for r in [MyTuple(_p1), MyTuple(_p2)]], }.items():
        t1 = time.time()
        for N in range(10000):
            p1, p2 = (N,N), (N,N)
            _ = fn(p1, p2, 1.0)
        t2 = time.time()
        print(f"{name}: {(t2 - t1) * 1000} ms")

davidhewitt · 2022-11-27T10:48:04Z

@alexpyattaev thanks for challenging this.

You're correct to say that PyO3 class creation is roughly on the same speed as Python class creation (some cases we are faster, some cases we are slower). We are limited by needing to create a Python object and all the machinery that requires. I do hope that one day we can find optimisations that mean we are consistently faster than Python, but I don't think this will ever be orders of magnitude difference.

I think you have been a bit too fast to conclude that tuples are always better. Your example has a lot going on, so to make it easier to compare I've reduced it to just three #[pyfunction] to compare:

takes_py_tuples - the simplest case which doesn't create or destroy any Python objects
takes_rs_tuples - a variant of the above which converts the Python tuples into Rust tuples and back
takes_rs_tuples_makes_class - a variant of the above which converts the Python tuples into Rust tuples and then returns a tuple of PyPoint

Code at the bottom. Some example numbers (Mac M1, Python 3.10.6, rust 1.66.0-beta.1):

$ python test.py
py_tuples: 2.0890235900878906 ms
rs_tuples: 3.6208629608154297 ms
rs_class: 3.233194351196289 ms

While there is some variance, with this measurement I see that creating PyPoint from Rust is slightly faster than creating Python tuples.

You are correct that there is overhead converting from Python <-> Rust, and our user guide does discuss this. I still defend that it is better to use #[pyclass] than raw tuples for most use cases, you have acknowledged the typing benefit yourself above. There will always be exceptions, in some cases #[pyclass] may be slower, and whichever is faster it's always best to avoid creating and destroying Python objects in a hot loop if you can restructure your algorithm to avoid this.

lib.rs

use pyo3::prelude::*;
use pyo3::types::PyTuple;

#[derive(Debug, Copy, Clone)]
pub struct Point(f64, f64);

#[pyclass]
pub struct PyPoint(Point);

#[pymethods]
impl PyPoint {
    #[new]
    pub fn new(data: (f64, f64)) -> Self {
        PyPoint {
            0: Point {
                0: data.0,
                1: data.1,
            },
        }
    }
}

#[pyfunction]
fn takes_py_tuples<'py>(x: &'py PyTuple, y: &'py PyTuple) -> (&'py PyTuple, &'py PyTuple) {
    (x, y)
}

#[pyfunction]
fn takes_rs_tuples(x: (f64, f64), y: (f64, f64)) -> ((f64, f64), (f64, f64)) {
    (x, y)
}

#[pyfunction]
fn takes_rs_tuples_makes_class(x: (f64, f64), y: (f64, f64)) -> (PyPoint, PyPoint) {
    (PyPoint::new(x), PyPoint::new(y))
}

#[pymodule]
fn pyo3_scratch(_: Python, m: &PyModule) -> PyResult<()> {
    m.add_class::<PyPoint>()?;
    m.add_function(wrap_pyfunction!(takes_py_tuples, m)?)?;
    m.add_function(wrap_pyfunction!(takes_rs_tuples, m)?)?;
    m.add_function(wrap_pyfunction!(takes_rs_tuples_makes_class, m)?)?;
    Ok(())
}

test.py

import time

import pyo3_scratch


def bench():
    for name, fn in {
        "py_tuples": lambda _p1, _p2: pyo3_scratch.takes_py_tuples(_p1, _p2),
        "rs_tuples": lambda _p1, _p2: pyo3_scratch.takes_rs_tuples(_p1, _p2),
        "rs_class": lambda _p1, _p2: pyo3_scratch.takes_rs_tuples_makes_class(_p1, _p2),
    }.items():
        t1 = time.time()
        for N in range(10000):
            p1, p2 = (N, N), (N, N)
            _ = fn(p1, p2)
        t2 = time.time()
        print(f"{name}: {(t2 - t1) * 1000} ms")
        assert len(_) == 2


if __name__ == "__main__":
    bench()

alexpyattaev · 2022-11-27T21:39:58Z

Well, now we are diving further into the rabbit hole =) A few notes/updates to your suggested benchmark:

Taking &'py PyTuple and returning it right away is inappropriate, as in order to actually use its content you still need to do some extra work. I've updated all the functions to do "useful work" which requires de structuring Python objects, I think it levels the playing field a bit. With that refinement, dealing in PyTuple is no longer that beneficial compared to rust tuple (as it really should be the same thing).
I've recalled that Python has a massive namespace lookup overhead, and taking names local has a massive impact when making plenty of class instances to feed input values into functions. I've added appropriate benchmarks to illustrate that (as well as baseline for pure python implementation).
Just measuring means did not feel right anymore, so added mean squared error too.
SW/HW specs rustc 1.64.0 (a55dd71d5 2022-09-19), AMD Ryzen 3500U, YMMV.

Below are the timings with updated tests:

Just lambda, no work               : 2.701 ms, +-0.013 us
Normal Python function             : 7.163 ms, +-0.044 us
PyTuple in and out                 : 8.640 ms, +-0.183 us
Rust tuple in and out              : 8.490 ms, +-0.364 us
Rust tuple in and out, local       : 7.404 ms, +-0.269 us
Rust tuple in, pyclass out         : 6.901 ms, +-0.064 us
Rust tuple in, pyclass out, local  : 5.997 ms, +-0.049 us
pyclass in, pyclass out            : 13.058 ms, +-0.424 us
pyclass in, pyclass out, fast      : 11.085 ms, +-0.123 us

TL;DR of the results so far is:

when targeting lowest overhead possible, it seems that rust tuples are the preferred approach (at least on my hardware, to test on your hardware run the bench).
pyclass as return type is fine, i.e. no significant extra cost compared to just returning tuple
if possible local imports should be used, especially if using pyclass instances for input parameters

Bottom line, I think docs should be patched to include this result (or maybe just drop this code straight into Examples? Not sure what would be best, please advise. There does not appear to be any performance loss in PyO3 itself (at least nothing major).

#[derive(Debug, Copy, Clone)]
pub struct Point(f64, f64);

#[pyclass]
pub struct PyPoint(Point);

#[pymethods]
impl PyPoint {
    #[new]
    pub fn new(data: (f64, f64)) -> Self {
        PyPoint {
            0: Point {
                0: data.0,
                1: data.1,
            },
        }
    }
}

#[pyfunction]
fn takes_py_tuples<'py>(
    py: pyo3::Python<'py>,
    x: &'py PyTuple,
    y: &'py PyTuple,
) -> PyResult<(&'py PyTuple)> {
    let xx: (f64, f64) = x.extract()?;
    let yy: (f64, f64) = y.extract()?;
    let r1 = PyTuple::new(py, [xx.0 + 1.0, yy.1 + 1.0]);
    let r2 = PyTuple::new(py, [xx.1 + 1.0, yy.0 + 1.0]);
    Ok(PyTuple::new(py, ([r1, r2])))
}

#[pyfunction]
fn takes_rs_tuples(x: (f64, f64), y: (f64, f64)) -> ((f64, f64), (f64, f64)) {
    ((x.0 + 1.0, y.1 + 1.0), (x.1 + 1.0, y.0 + 1.0))
}

#[pyfunction]
fn takes_rs_tuples_makes_class(x: (f64, f64), y: (f64, f64)) -> (PyPoint, PyPoint) {
    (
        PyPoint::new((x.0 + 1.0, y.1 + 1.0)),
        PyPoint::new((x.1 + 1.0, y.0 + 1.0)),
    )
}

#[pyfunction]
fn takes_rs_class_makes_class(x: &PyPoint, y: &PyPoint) -> (PyPoint, PyPoint) {
    (
        PyPoint {
            0: Point(x.0 .0 + 1.0, y.0 .1 + 1.0),
        },
        PyPoint {
            0: Point(x.0 .1 + 1.0, y.0 .0 + 1.0),
        },
    )
}

#[pymodule]
fn pyo3_scratch(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_class::<PyPoint>()?;
    m.add_function(wrap_pyfunction!(takes_py_tuples, m)?)?;
    m.add_function(wrap_pyfunction!(takes_rs_tuples, m)?)?;
    m.add_function(wrap_pyfunction!(takes_rs_tuples_makes_class, m)?)?;
    m.add_function(wrap_pyfunction!(takes_rs_class_makes_class, m)?)?;
    Ok(())
}

import time
import  pyo3_scratch

#shortcut imports for faster speed
PyPoint = pyo3_scratch.PyPoint
takes_rs_class_makes_class = pyo3_scratch.takes_rs_class_makes_class
takes_rs_tuples_makes_class = pyo3_scratch.takes_rs_tuples_makes_class
takes_rs_tuples = pyo3_scratch.takes_rs_tuples


def py_fn(_p1, _p2):
    # this is the benchmark logic. Yes it is silly.
    return (_p1[0] + 1.0, _p2[1] + 1.0), (_p1[1] + 1.0, _p2[0] + 1.0)


def bench():
    B=10
    for name, fn in {
        "Just lambda, no work": lambda _p1, _p2: (_p1, _p2),
        "Normal Python function": lambda _p1, _p2: py_fn(_p1, _p2),
        "PyTuple in and out": lambda _p1, _p2: pyo3_scratch.takes_py_tuples(_p1, _p2),
        "Rust tuple in and out": lambda _p1, _p2: pyo3_scratch.takes_rs_tuples(_p1, _p2),
        "Rust tuple in and out, local": lambda _p1, _p2: takes_rs_tuples(_p1, _p2),
        "Rust tuple in, pyclass out": lambda _p1, _p2: pyo3_scratch.takes_rs_tuples_makes_class(_p1, _p2),
        "Rust tuple in, pyclass out, local": lambda _p1, _p2: takes_rs_tuples_makes_class(_p1, _p2),
        "pyclass in, pyclass out": lambda _p1, _p2: pyo3_scratch.takes_rs_class_makes_class(pyo3_scratch.PyPoint(_p1), pyo3_scratch.PyPoint(_p2)),
        "pyclass in, pyclass out, fast": lambda _p1, _p2: takes_rs_class_makes_class(PyPoint(_p1), PyPoint(_p2)),
    }.items():
        times = []
        for _ in range(B):
            t1 = time.time()
            for N in range(10000):
                p1, p2 = (N, N), (N, N)
                _ = fn(p1, p2)
            t2 = time.time()
            times.append(t2-t1)
        mean = sum(times) / B
        mse = sum(map(lambda i: (i-mean)**2, times))/B
        print(f"{name:35}: {mean* 1000:.3f} ms, +-{mse*1e6:.3f} us")
        assert len(_) == 2


if __name__ == "__main__":
    bench()

davidhewitt · 2022-12-27T11:48:35Z

Thanks for the further exploration, and sorry for my slow reply.

This shows that we can be fast to make #[pyclass] objects, but not when the creation is done from Python, so there should be wins to be gained on the #[new] implementation within PyO3. It looks like Python has a tp_vectorcall which can be used as an alternative to tp_new, but that's not available for the "heap types" PyO3 uses. I've asked upstream about this in python/cpython#100554

kngwyu added the performance label Nov 7, 2019

kngwyu mentioned this issue Dec 1, 2019

Why object creation is slow in PyO3 and how to enhance it #679

Closed

davidhewitt added the needs-implementer label Aug 6, 2020

This was referenced Mar 1, 2021

pyfunction: refactor argument extraction #1440

Merged

Why is the plugins built by pyo3 run slower than cython ones? #1470

Closed

davidhewitt mentioned this issue Jun 5, 2021

pyclass: refactor tp_new / tp_dealloc / remove PyCellInner #1657

Merged

davidhewitt closed this as completed Sep 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Class creation is slow #661

Class creation is slow #661

emmatyping commented Nov 6, 2019

ehiggs commented Nov 6, 2019

emmatyping commented Nov 6, 2019

kngwyu commented Nov 7, 2019 •

edited

Loading

emmatyping commented Nov 7, 2019

ijl commented Nov 7, 2019 •

edited

Loading

kngwyu commented Nov 7, 2019 •

edited

Loading

ehiggs commented Nov 13, 2019

kngwyu commented Dec 1, 2019

davidhewitt commented May 15, 2020

emmatyping commented May 21, 2020 •

edited

Loading

davidhewitt commented May 21, 2020

emmatyping commented May 22, 2020 •

edited

Loading

davidhewitt commented Sep 18, 2021

alexpyattaev commented Nov 24, 2022

davidhewitt commented Nov 27, 2022

alexpyattaev commented Nov 27, 2022

davidhewitt commented Dec 27, 2022

Class creation is slow #661

Class creation is slow #661

Comments

emmatyping commented Nov 6, 2019

💥 Reproducing

ehiggs commented Nov 6, 2019

emmatyping commented Nov 6, 2019

kngwyu commented Nov 7, 2019 • edited Loading

emmatyping commented Nov 7, 2019

ijl commented Nov 7, 2019 • edited Loading

kngwyu commented Nov 7, 2019 • edited Loading

ehiggs commented Nov 13, 2019

kngwyu commented Dec 1, 2019

davidhewitt commented May 15, 2020

emmatyping commented May 21, 2020 • edited Loading

davidhewitt commented May 21, 2020

emmatyping commented May 22, 2020 • edited Loading

davidhewitt commented Sep 18, 2021

alexpyattaev commented Nov 24, 2022

davidhewitt commented Nov 27, 2022

alexpyattaev commented Nov 27, 2022

davidhewitt commented Dec 27, 2022

kngwyu commented Nov 7, 2019 •

edited

Loading

ijl commented Nov 7, 2019 •

edited

Loading

kngwyu commented Nov 7, 2019 •

edited

Loading

emmatyping commented May 21, 2020 •

edited

Loading

emmatyping commented May 22, 2020 •

edited

Loading