Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: nb::ndarray arguments with nb::call_guard<nb::gil_scoped_release> on Python 3.12 #377

Closed
pawel-glomski opened this issue Dec 3, 2023 · 14 comments

Comments

@pawel-glomski
Copy link

pawel-glomski commented Dec 3, 2023

Problem description

Hi there, many thanks for this amazing project!

On Python 3.12, calling functions with nb::ndarray arguments and nb::call_guard<nb::gil_scoped_release> causes segmentation fault. This problem does not occur with previous versions of Python.

When nb::call_guard<nb::gil_scoped_release> is not used and instead nb::gil_scoped_release is used directly inside the function, this error does not occur.

System:

Python: 3.12
nanobind: 1.8.0
OS: Linux Mint 21.2 Cinnamon

Reproducible example code

Here is a modified version of nanobind_example:

#include <nanobind/nanobind.h>
#include <nanobind/ndarray.h>

namespace nb = nanobind;
using namespace nb::literals;

NB_MODULE(nanobind_example_ext, m)
{
  m.def(
      "add",
      [](nb::ndarray<> const a) { return 0; },
      "a"_a,
      nb::call_guard<nb::gil_scoped_release>());
}

And a test:

import numpy as np
import nanobind_example as m


def test_add():
    a = np.ones(32)
    assert m.add(a) == 0
@wjakob
Copy link
Owner

wjakob commented Dec 4, 2023

Hum, it's not obvious to me why this would not work. And I just tried (different platform: macOS/Python 3.11) and had no issues.

Could I ask you to compile in debug mode and report a backtrace?

@pawel-glomski
Copy link
Author

pawel-glomski commented Dec 4, 2023

And I just tried (different platform: macOS/Python 3.11) and had no issues.

Everything works fine for me on 3.11, and for all previous python versions (>=3.8). It seems to be 3.12-only issue.

I will try the debug build in a sec.

@wjakob
Copy link
Owner

wjakob commented Dec 4, 2023

Mysterious. I don't really see what changed in 3.12 that could explain this difference. Except perhaps something related to the stable ABI, in case you are using that? (What's the filename of the compiled extension? Does it include the string abi3?)

@pawel-glomski
Copy link
Author

Yes, abi3 is present in the filename. I am using nanobind_example, which includes the following in its pyproject.toml.

# Build stable ABI wheels for CPython 3.12+
wheel.py-api = "cp312"

Disabling it makes no difference.

Running the test alone produces no backtrace. Just Segmentation fault (core dumped).

@pawel-glomski
Copy link
Author

I ran the test with LLDB. Here is the call stack:

PyMem_Free (@PyMem_Free:18)
nanobind::detail::ndarray_dec_ref(nanobind::detail::ndarray_handle*) (/home/pglomski/miniconda3/envs/pylibrb/lib/python3.12/site-packages/nanobind/src/nb_ndarray.cpp:545)
nanobind::ndarray<>::~ndarray() (/home/pglomski/miniconda3/envs/pylibrb/lib/python3.12/site-packages/nanobind/include/nanobind/ndarray.h:428)
operator() (/home/pglomski/miniconda3/envs/pylibrb/lib/python3.12/site-packages/nanobind/include/nanobind/nb_func.h:41)
operator() (/home/pglomski/miniconda3/envs/pylibrb/lib/python3.12/site-packages/nanobind/include/nanobind/nb_func.h:153)
_FUN (/home/pglomski/miniconda3/envs/pylibrb/lib/python3.12/site-packages/nanobind/include/nanobind/nb_func.h:124)
nanobind::detail::nb_func_vectorcall_complex(PyObject *, PyObject *const *, size_t, PyObject *) (/home/pglomski/miniconda3/envs/pylibrb/lib/python3.12/site-packages/nanobind/src/nb_func.cpp:651)
PyObject_Vectorcall (@PyObject_Vectorcall:27)
_PyEval_EvalFrameDefault (@_PyEval_EvalFrameDefault:312)
PyEval_EvalCode (@PyEval_EvalCode:46)
run_eval_code_obj (@run_eval_code_obj:24)
run_mod (@run_mod:41)
pyrun_file (@pyrun_file:46)
_PyRun_SimpleFileObject (@_PyRun_SimpleFileObject:113)
_PyRun_AnyFileObject (@_PyRun_AnyFileObject:25)
Py_RunMain (@Py_RunMain:167)
Py_BytesMain (@Py_BytesMain:14)
__libc_start_call_main (@__libc_start_call_main:29)
__libc_start_main_impl (@__libc_start_main@@GLIBC_2.34:43)
_start (@_start:14)

Debugger shows the following:
image

@pawel-glomski
Copy link
Author

image

For some reason strides is equalt to 1?

@pawel-glomski
Copy link
Author

Hmm. When I use a reference (nb::ndarray<> const& a) there is no error.

Not an expert here, but shouldn't we create the arguments before releasing the lock?
This is where the function is called:

    if constexpr (CheckGuard && !std::is_same_v<typename Info::call_guard, void>) {
        return func_create<ReturnRef, false>(
            [func = (forward_t<Func>) func](Args... args) NB_INLINE_LAMBDA {
                typename Info::call_guard::type g;
                (void) g;
                return func((forward_t<Args>) args...);
            },
            (Return(*)(Args...)) nullptr, is, extra...);
    }

The guard is created before the arguments are passed to the function. If the function takes the nb::ndarray object as a value, a new nb::ndarray will be created. Is this safe with the GIL released?

@wjakob
Copy link
Owner

wjakob commented Dec 4, 2023

I think I have guess as to what is wrong. Normally, passing around a nb::ndarray should be perfectly safe even when the GIL is not held, because all this does is to increase/decrease reference counts. However, once the reference count reaches zero, some cleanup code needs to be called, and that involves CPython API calls that require the GIL to be held.

I think the issue is that nanobind is now so aggressive in move-constructing arguments that the entire nb::ndarray ends up in the sub-function wrapped by the gil_scoped_release call guard, and it's not safe to call the cleanup routine there.

@wjakob
Copy link
Owner

wjakob commented Dec 4, 2023

Never mind the previous patch I posted, it was not a good solution to the problem. Could you please check if the following addresses the issue?

diff --git a/src/nb_ndarray.cpp b/src/nb_ndarray.cpp
index 6e74699..ced2343 100644
--- a/src/nb_ndarray.cpp
+++ b/src/nb_ndarray.cpp
@@ -537,6 +537,8 @@ void ndarray_dec_ref(ndarray_handle *th) noexcept {
     if (rc_value == 0) {
         check(false, "ndarray_dec_ref(): reference count became negative!");
     } else if (rc_value == 1) {
+        gil_scoped_acquire guard;
+
         Py_XDECREF(th->owner);
         Py_XDECREF(th->self);
         managed_dltensor *mt = th->ndarray;

@pawel-glomski
Copy link
Author

pawel-glomski commented Dec 4, 2023

Yes, both solutions fix the issue.

@pawel-glomski
Copy link
Author

Is this ndarray-only problem though?

@pawel-glomski pawel-glomski changed the title [BUG]: nb::ndarray arguments with nb::call_guard<nb::gil_scoped_acquire> on Python 3.12 [BUG]: nb::ndarray arguments with nb::call_guard<nb::gil_scoped_release> on Python 3.12 Dec 4, 2023
@wjakob wjakob closed this as completed in a958e8d Dec 4, 2023
@pawel-glomski
Copy link
Author

I Appreciate the quick response btw 🤗

@wjakob
Copy link
Owner

wjakob commented Dec 4, 2023

Is this ndarray-only problem though?

It's a problem for all types that require CPython API calls. That would be any wrapper type (e.g. nb::dict) as well as nb:ndarray. But it would not make sense to use nb::dict in a no-GIL context, so I think this is very specifically a ndarray problem.

fakufaku added a commit to LCAV/pyroomacoustics that referenced this issue Apr 25, 2024
fakufaku added a commit to LCAV/pyroomacoustics that referenced this issue Apr 25, 2024
* CI: removes python 3.7, adds 3.11 and 3.12

* switches to "build" package to build the package in CI.

* fixes bdist_wheel build

* Replaces the brittle crossing-based point-in-polygon (2D) algorithm by Dan Sunday's robust winding based algorithm.

* Fixes a bug with releasing the GIL and python 3.12 in the rir builder part of libroom. Related to wjakob/nanobind#377
@fakufaku
Copy link

I had the exact same problem but with pybind11. Removing the call_guard and adding a py::gil_scoped_release in the function itself solved the issue.

havogt added a commit to GridTools/gridtools that referenced this issue Jun 13, 2024
- nb::any is replaced by `nb::ssize_t(-1)`
- Because of a change in `~ndarray`, we need to make sure that Python is initialized in our tests (see wjakob/nanobind#377)
havogt added a commit to GridTools/gridtools that referenced this issue Jun 18, 2024
- nb::any is replaced by `nb::ssize_t(-1)`
- Because of a change in `~ndarray`, we need to make sure that Python is initialized in our tests (see wjakob/nanobind#377)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants