-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add pybind11/gil_safe_call_once.h (to fix deadlocks in pybind11/numpy.h) #4877
Conversation
…(also suggested by jbms@) is to add empty ctor + dtor.
@mtsokol @EthanSteinberg @henryiii FYI Not yet ready for review. |
For easy reference, a copy of the clang-tidy errors (GitHub Actions):
|
No, you can't |
Thanks! That's what I suspected and yes
Verified on godbolt, based on the link you shared with me: https://godbolt.org/z/jsE4MW99P I'll try |
Oh, hitting a wall much quicker than expected (
|
Also fails with C++17 on godbolt: https://godbolt.org/z/K8dP1fhv7 |
…e trick (also suggested by jbms@) is to add empty ctor + dtor." This reverts commit e7b8c4f.
I tried the aligned char storage alternative on godbolt: https://godbolt.org/z/q6oEj4dP9 This code (copied from there) compiles:
The Google-internal reproducer for the original deadlock also passes with the aligned char storage alternative. @tkoeppe You brought up elsewhere:
If I understand the godbolt output (link above) correctly, that version of the class is not constant-initializable, is that true? When is it critical that the class is constant-initializable? (I'll try the aligned char storage alternative in the GitHub Actions here.) |
It's not critical, but not having it be constant-initialized and destroyed means that you pay for a) a static guard flag (and its check) after all, and b) you add an element to the global destruction sequence. This should work in C++17: https://godbolt.org/z/Wa79nKz6e |
include/pybind11/numpy.h
Outdated
#include <cstdint> | ||
#include <cstdlib> | ||
#include <cstring> | ||
#include <functional> | ||
#include <numeric> | ||
#include <sstream> | ||
#include <stdalign.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this, this is a C interop header (and we're not writing C).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 109a165
include/pybind11/numpy.h
Outdated
template <typename T> | ||
class LazyInitializeAtLeastOnceDestroyNever { | ||
public: | ||
template <typename Initialize> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document that you expect the GIL to be held, or in any case that calls are serialized. (Otherwise the mutating accesses would cause races.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 1ce2715
include/pybind11/numpy.h
Outdated
T &Get(Initialize &&initialize) { | ||
if (!initialized_) { | ||
assert(PyGILState_Check()); | ||
// Multiple threads may run this concurrently, but that is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a bit misleading. Multiple threads may call Get
, but multiple threads definitely mustn't get to this point! E.g. initialized_
is not an atomic variable and it'd be a race to access it concurrently.
What is true is that multiple threads may possibly reach the inside of initialize
, but that's a separate matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm uncertain I understand the two points correctly.
My understanding:
From getting burned in the past, I've developed the strong belief that multiple threads do in fact get EDITconcurrently(see below) into the if (! initialized_)
branch.
If the first thread calls back into the Python C API (e.g. for a Python import), Python can and does in the general case release the GIL. That gives other threads the opportunity to also pass the if (! initialized_)
test, so the Python initialize()
really can run multiple times.
But after the initialize()
call we are sure to have the GIL again, therefore there is no race flipping initialized_
from false
to true
. One lucky thread gets there first. (All other threads that made it to the initialize()
call will write the true
again, but each under the GIL.) Only then will other threads no longer get into the if (! initialized_)
branch.
Would it help to change the comment to like this?
// Multiple threads may reach `initialize()` EDIT~concurrently~(see below), but each holding the GIL at that time.
Fundamentally, the "at least once" aspect here isn't great. "Guaranteed once" would be better.
My thinking:
-
Let's get the "at least once" solution stable. It almost seems to be there, and it solves the original deadlock problem.
-
Then let's look at it again (this PR maybe) to see what it takes to achieve a "guaranteed once" implementation (right here in pybind11, within the limitations of C++11).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think when I said "multiple threads can call Get
" that's not entirely right. By assuming that Get
is only called with the GIL held, we effectively say that only one thread can call Get
at a time. What I meant, but didn't say right, was that "mulitple threads can attept to lock the GIL and call Get
".
From getting burned in the past, I've developed the strong belief that multiple threads do in fact get concurrently into the
if (! initialized_)
branch.
I'm not sure how this would happen, and if it did, it'd definitely be a bug, since that's a race.
If the first thread calls back into the Python C API (e.g. for a Python import), Python can and does in the general case release the GIL. That gives other threads the opportunity to also pass the
if (! initialized_)
test, so the Pythoninitialize()
really can run multiple times.
OK, but that is not "concurrently". Only the thread holding the GIL can be accessing initialized_
. In order for one thread to release the GIL it must already have completed this access. (And here "completed" can be made precise in the sense of the memory model.)
I don't think "initialize()
really can run multiple times" is a precise enough statement to allow for a meaningful interpretation. It is certainly true that multiple threads can exectue inside initialize
, but it is (or definitely should) not be true that multiple threads reach the expression initialize()
on L56 "concurrently", in the sense of "without sequencing". If one thread is inside initialize
and another thread reaches L56, then the former thread must be blocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can say something more concrete and specific to this situation:
// It is possible that multiple threads execute `Get` with `initialized_` still being
// false, and thus proceed to execute `initialize()`. For this to happen, `initialize`
// has to release and reacquire the GIL internally. We accept this, and expect the
// operation to be both idempotent and cheap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I appear to be mixing up terms, which is critical here.
What is the correct term? (I want to edit-correct my comment.)
E.g.
I've developed the strong belief that multiple threads do in fact get
concurrentlyCORRECT_TERM_HERE into the if (! initialized_) branch.
I want to say what you are saying.
Although all I originally wanted: leave a meaningful, terse, and correct comment in the code, just enough to give readers a good clue to follow up on in case they want to fully understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'm seeing your suggested comment only now. Will adopt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 1ce2715
* `include\pybind11/numpy.h(24,10): fatal error C1083: Cannot open include file: 'stdalign.h': No such file or directory` * @tkoeppe wrote: this is a C interop header (and we're not writing C)
``` include/pybind11/eigen/../numpy.h:63:53: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing] return *reinterpret_cast<T *>(value_storage_); ^ ```
Document PRECONDITION. Adopt comment suggested by @tkoeppe: pybind#4877 (comment)
…ses: ``` g++ -o pybind11/tests/test_numpy_array.os -c -std=c++20 -fPIC -fvisibility=hidden -O0 -g -Wall -Wextra -Wconversion -Wcast-qual -Wdeprecated -Wundef -Wnon-virtual-dtor -Wunused-result -Werror -isystem /usr/include/python3.11 -isystem /usr/include/eigen3 -DPYBIND11_STRICT_ASSERTS_CLASS_HOLDER_VS_TYPE_CASTER_MIX -DPYBIND11_ENABLE_TYPE_CASTER_ODR_GUARD_IF_AVAILABLE -DPYBIND11_TEST_BOOST -Ipybind11/include -I/usr/local/google/home/rwgk/forked/pybind11/include -I/usr/local/google/home/rwgk/clone/pybind11/include /usr/local/google/home/rwgk/forked/pybind11/tests/test_numpy_array.cpp ``` ``` In file included from /usr/local/google/home/rwgk/forked/pybind11/tests/test_numpy_array.cpp:10: /usr/local/google/home/rwgk/forked/pybind11/include/pybind11/numpy.h: In static member function ‘static pybind11::detail::npy_api& pybind11::detail::npy_api::get()’: /usr/local/google/home/rwgk/forked/pybind11/include/pybind11/numpy.h:258:82: error: ‘constinit’ variable ‘api_init’ does not have a constant initializer 258 | PYBIND11_CONSTINIT static LazyInitializeAtLeastOnceDestroyNever<npy_api> api_init; | ^~~~~~~~ ``` ``` In file included from /usr/local/google/home/rwgk/forked/pybind11/tests/test_numpy_array.cpp:10: /usr/local/google/home/rwgk/forked/pybind11/include/pybind11/numpy.h: In static member function ‘static pybind11::object& pybind11::dtype::_dtype_from_pep3118()’: /usr/local/google/home/rwgk/forked/pybind11/include/pybind11/numpy.h:697:13: error: ‘constinit’ variable ‘imported_obj’ does not have a constant initializer 697 | imported_obj; | ^~~~~~~~~~~~ ```
…t use cases:" This reverts commit f07b28b.
Just to highlight with a full comment here: I think we'd need to get into |
You're missing the DMI on
|
Or rather:
|
…nt use cases:" This reverts commit 36be645.
…koeppe: pybind#4877 (comment) This fixes the errors reported under commit f07b28b.
This looks really good! One minor comment: This code will eventually be completely removed when we switch to numpy's public Python API in about two months. |
include/pybind11/numpy.h
Outdated
assert(PyGILState_Check()); | ||
auto value = initialize(); | ||
if (!initialized_) { | ||
new (reinterpret_cast<T *>(value_storage_)) T(std::move(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inner cast is bogus. Placement-new takes a void*
, so you should at best cast to void*
. However, this conversion is also available implicitly, and so unless you want to avoid hitting an overloaded placement new operator, you can just drop it:
new (value_storage_) T(args...); // OK
::new (static_cast<void*>(value_storage_)) T(args...); // Paranoidly correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be commit a864f21 (see below). Sorry I missed this before triggering the currently running GHA.
I'll work on moving things to the right places before running GHA again.
Semi-paranoid placement new (based on https://github.com/pybind/pybind11/pull/4877#discussion_r1350573114).
diff --git a/include/pybind11/numpy.h b/include/pybind11/numpy.h
index 47050a36..1c76177f 100644
--- a/include/pybind11/numpy.h
+++ b/include/pybind11/numpy.h
@@ -66,7 +66,7 @@ public:
assert(PyGILState_Check());
auto value = initialize();
if (!initialized_) {
- new (reinterpret_cast<T *>(value_storage_)) T(std::move(value));
+ ::new (value_storage_) T(std::move(value));
initialized_ = true;
}
}
include/pybind11/numpy.h
Outdated
~LazyInitializeAtLeastOnceDestroyNever() | ||
= default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this go on a single line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only with
// clang-format off
...
/// clang-format on
unfortunately. I feel it's better to just let clang-format do what it wants, but let me know if you prefer the override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh OK, sure, never mind!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a nicer solution: PYBIND11_DTOR_CONSTEXPR
Thanks, but
|
With the latest insights and removal of Regarding the API: we could just have
but that will
What would that look like with the simpler API? |
Unfortunately I developed a suspicion that the lock juggling magic is actually still not complete, and it does not want to go away: template <typename Callable>
gil_safe_call_once_and_store &call_once_and_store_result(Callable &&fn) {
if (!is_initialized_) { // This read is guarded by the GIL.
// Multiple threads may enter here, because CPython API calls in the
// `fn()` call below may release and reacquire the GIL.
gil_scoped_release gil_rel; // Needed to establish lock ordering.
std::call_once(once_flag_, [&] {
// Only one thread will ever enter here.
gil_scoped_acquire gil_acq;
::new (storage_) T(fn());
is_initialized_ = true; // This write is guarded by the GIL.
});
+ // Is it possible that one (or more) of the multiple threads entering above
+ // reaches this point BEFORE `is_initialized_` is true?
}
// Intentionally not returning `T &` to ensure the calling code is self-documenting.
return *this;
} Under https://en.cppreference.com/w/cpp/thread/call_once I found this:
I think we need to name the threads:
What we need is a mechanism that blocks the Limbo threads until the Lucky thread has flipped Is that somehow happening already? If not, how can we fix that? |
In the scenario you describe, the thread that calls To repeat: yes, |
Awesome, thanks! I was just beginning to think that must be the case after inspecting the
I don't think we need that, that's too much on the side of unit-testing the |
You shouldn't have to read any code to understand an API, especially not one as important as this one. If there's any aspect of any documentation that left you unsure about the semantics, please file appropriate bugs (or editorial issues). |
I think the fact that "every return from |
I think the documentation is great. It's just me: The problem for me is that I'm only vaguely familiar with the definition of terms used there, including some that may seem basic to someone more knowledgable about threads. My first instinct in such situations (very common) is to look at the code. |
With all comments stripped out, and defines resolved for a modern compiler, the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good to me now! Thanks for the explaining all the intricate parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All pending comment threads are resolved as far as I'm concerned.
Thanks for all the help, and the reviews! This PR was deployed Google-internally already a few hours ago. |
This PR updates pybind11 to v2.11.1. Even with the latest version of pybind11, we still have an issue with `pybind11::array_t` when cuCIM is used in multithread without importing numpy in the main thread. pybind/pybind11#4877 Will need to wait for the next release of pybind11.
This PR updates pybind11 to v2.11.1. Even with the latest version of pybind11, we still have an issue with `pybind11::array_t` when cuCIM is used in multithread without importing numpy in the main thread. pybind/pybind11#4877 Will need to wait for the next release of pybind11. Signed-off-by: Gigon Bae <gbae@nvidia.com>
This applies the following patches to pybind11: - pybind/pybind11#4857 - pybind/pybind11#4877 to avoid deadlock when using pybind11 without importing numpy in multi-threaded environment.
This PR updates pybind11 to v2.11.1. Even with the latest version of pybind11, we still have an issue with `pybind11::array_t` when cuCIM is used in multithread without importing numpy in the main thread. pybind/pybind11#4877 Will need to wait for the next release of pybind11. Signed-off-by: Gigon Bae <gbae@nvidia.com>
This applies the following patches to pybind11: - pybind/pybind11#4857 - pybind/pybind11#4877 to avoid deadlock when using pybind11 without importing numpy in multi-threaded environment.
…mmand (#618) ### Update Catch2 to v3.4.0 Without upgrading Catch2, The following error occurs when building on Ubuntu 22.04 due to glibc: ``` cucim/build-debug/_deps/deps-catch2-src/single_include/catch2/catch.hpp:10830:58: error: call to non-‘constexpr’ function ‘long int sysconf(int)’ 10830 | static constexpr std::size_t sigStackSize = 32768 >= MINSIGSTKSZ ? 32768 : MINSIGSTKSZ; ``` ### Update pybind11 to v2.11.1 Even with the latest version of pybind11, we still have an issue with `pybind11::array_t` when cuCIM is used in multithread without importing numpy in the main thread. See pybind/pybind11#4877 Will need to wait for the next release of pybind11. ### Use runtime option instead of using nvidia-docker command nvidia-docker binary is not available if user doesn't install nvidia-docker2 package. This change uses runtime option instead of using nvidia-docker command. ### Apply pybind11 patch to avoid deadlock (until new release is available) This applies the following patches to pybind11: - pybind/pybind11#4857 - pybind/pybind11#4877 to avoid deadlock when using pybind11 without importing numpy in multi-threaded environment. Authors: - Gigon Bae (https://github.com/gigony) - Gregory Lee (https://github.com/grlee77) Approvers: - Gregory Lee (https://github.com/grlee77) - https://github.com/jakirkham URL: #618
Description
See comments in pybind11/gil_safe_call_once.h for usage and technical details.
The primary (and quite common!) use case is initialization of C++
static
variables that involves interactions with the CPython API.The original motivation for adding pybind11/gil_safe_call_once.h was to fix deadlocks observed after deploying PR #4857 Google-internally (we had to roll back after a few hours).
Explanation in a nutshell (with a lot of help from @tkoeppe and @jbms):
static npy_api api = lookup();
introduces a hidden mutex: https://eel.is/c++draft/stmt.dcl#3In retrospect: This was a problem all along. PR #4857 only made it much more noticeable.
Suggested changelog entry: