Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for free-threaded Python #695

Merged
merged 17 commits into from
Sep 20, 2024
Merged

Support for free-threaded Python #695

merged 17 commits into from
Sep 20, 2024

Conversation

wjakob
Copy link
Owner

@wjakob wjakob commented Aug 21, 2024

This PR adapts nanobind to run on free-threaded Python builds. The changes involve:

  • A CMake parameter nanobind_add_module(... FREE_THREADED ...) to opt into free threaded behavior. It's a no-op in GIL-protected Python.

  • Adapated internal data structures to protect them from concurrent modification.

    1. Protected access to the high-bandwidth CPP->Python hash table (internals.inst_c2p) and and keep-alive lists data structure (internals.keep_alive). This uses sharding based on pybind11 PR #5148 by @colesbury to reduce lock contention.

    2. internals.type_c2p_fast turns into a fast thread-local cache around a slower globally locked internals.type_c2p_slow. (These are hash tables that map from C++ to Python types, and they are accessed relatively often).

    3. Ordinary locking for the remaining fields that aren't on the critical path.

    The design avoids centralized locks as much as possible to improve scalability (for example, pybind11 locks to the internals data structure several times per dispatched call, which is costly and seems somewhat counter to the design of the free-threaded Python implementation with its focus on local locks). A documentation string added to src/nb_internals.h provides a detailed explanation of the locking semantics of every individual field.

  • Threading awareness for list accessors and sequence type casters.

  • RAII wrappers for Python mutexes and critical sections.

  • Documentation explaining free-threading in general, and in the context of nanbind

  • Parallel tests that exercise various critical sections

The plan is to merge this into the upcoming nanobind release v2.2.0. Three related points:

  • Performance: although some internals were restructured, there is no impact on performance in non-free-threaded code. Optional features such as per-argument locking in free-threaded bindings are designed so that they don't add a runtime cost if not used.
  • API stability: no incompatibilities.
  • ABI stability: free-threading involves changes in internal data structures, requiring an ABI version bump. The upcoming version 2.2.0 already increments the ABI version.

@wjakob wjakob force-pushed the free-threaded branch 3 times, most recently from 7001e18 to 67f2752 Compare August 21, 2024 09:20
@wjakob
Copy link
Owner Author

wjakob commented Aug 21, 2024

Dear @colesbury,

following your work on the pybind11 free-threaded support, I was wondering if I could ask two quick questions about getting something similar into nanobind.

  1. Does the free-threaded version of Python provide a fast/portable way of getting a core and/or thread ID for sharding data structures? This is for situations where an existing pointer address isn't available that could be used to segregate a lookup.

  2. The failing test in this PR happens only in free-threaded builds, and I suspect that it is a regression in GC behavior. More details here: https://discuss.python.org/t/reference-leaks-in-free-threaded-3-13-version/61467/3. Do you have any ideas what might be going wrong?

  3. Is it necessary to set the PYTHON_GIL environment variable to 0 to get free-threaded behavior? Or is it also okay if the variable isn't set.

Thanks!

@vfdev-5
Copy link
Contributor

vfdev-5 commented Aug 21, 2024

few cents on the last question

Is it necessary to set the PYTHON_GIL environment variable to 0 to get free-threaded behavior? Or is it also okay if the variable isn't set.

@wjakob maybe you have seen this page, otherwise please check this : https://py-free-threading.github.io/running-gil-disabled/

Here is a PR for numpy enabling free-threading tests: https://github.com/numpy/numpy/pull/26463/files

HTH

@colesbury
Copy link
Contributor

Hi @wjakob,

  1. There are a few possible APIs depending on the use case. PyThreadState_Get() returns a unique address per-thread. _Py_ThreadId() is probably a bit faster, but it's not a public API. If you go that route, I'd suggest copy-pasting the implementation rather than calling _Py_ThreadId() directly, but that's up to you. If you need small consecutive integers, you are probably best off with a thread_local variable.

  2. I replied to the forum post: https://discuss.python.org/t/reference-leaks-in-free-threaded-3-13-version/61467/7

  3. If you every C extension calls PyUnstable_Module_SetGIL(m, Py_MOD_GIL_NOT_USED) then PYTHON_GIL=0 is not necessary. (The environment variable overrides what the extensions specify). In addition to @vfdev-5's link, there are some more details in the free-threading howto guide. If some PyUnstable_Module_SetGIL are missing and PYTHON_GIL=0 is not set, then you'll see a warning about the GIL being enabled printed in the tests. I don't see that warning in the test output, so I don't think you need to set the environment variable.

@colesbury
Copy link
Contributor

One more thing: PyObject_Malloc() now should only be used for allocating PyObject* (and not for PyObject** or other structs). I think some or all of the current PyObject_Malloc() usages should be PyMem_Malloc() (and PyObject_Free -> PyMem_Free().

https://docs.python.org/3.13/howto/free-threading-extensions.html#memory-allocation-apis

One annoying thing is that tp_doc needs to be PyObject_Malloc'd for 3.12, but PyMem_Malloc'd for 3.13+.

I think the other usages can use PyMem_Malloc unconditionally.

@wjakob
Copy link
Owner Author

wjakob commented Aug 22, 2024

Thank you for the clarification @colesbury.

  1. nanobind uses the PyType_FromMetaclass API on v3.12+ and doesn't allocate tp_docs, so that should be okay.

  2. Do you know if there is any difference between PyMem_Malloc and PyObject_Malloc for the other uses pre-v3.13? (If there is no difference, I would switch all of them over instead of making a version-dependent change).

    In general, the reason for using the Python allocation API for non-Python objects was to leverage its slab allocator, which has generally been faster than a C-style malloc() or C++ new. That would be nice to preserve when making a change here.

  3. The issue with the leak seems more complex, I responded on the Python forum.

@wjakob wjakob force-pushed the free-threaded branch 2 times, most recently from 03cb139 to 93a39a7 Compare August 22, 2024 07:24
@wjakob
Copy link
Owner Author

wjakob commented Aug 22, 2024

@colesbury: I have two more questions about functions that weren't mentioned in PEP 703.

  • are PyGILState_Ensure() and PyGILState_Release() no-ops in free-threaded builds? Although the GIL is gone, it seems to me that these at least still compile to function calls when looking at the CPython headers, so there is presumably some cost even if their implementations don't do anything. Or do they do something? Should I be careful to #ifdef calls to these functions when building free-threaded extensions?

  • same question, but about the opposite direction: what do PyEval_SaveThread() and PyEval_RestoreThread() in free-threaded builds.

(These pairs are commonly called in existing pybind11/nanobind extensions to mark boundaries between parallel C++ code and previously GIL-protected Python regions.)

And an unrelated question about PEP 703: Py_BEGIN_CRITICAL_SECTION can release prior locks held by the thread to avoid deadlocks. I'm wondering about the semantics of a critical section in this case. A traditional critical section executes atomically, but this approach seems inherently non-atomic. What is the purpose of such a relaxed critical section, and what assumptions can still be made about it?

I'm asking about critical sections because I would like to provide a builtin nanobind annotation to protect C++ instances from concurrent access (for unsafe/legacy code). Basically I'm wondering about the right primitive to build on, and whether Py_BEGIN_CRITICAL_SECTION provides enough guarantees.

Thanks!

@colesbury
Copy link
Contributor

Hi @wjakob -

  • There's not a substantial difference between PyMem_Malloc and PyObject_Malloc pre-3.13. They have the same performance characteristics so you can switch all of them over instead of making a version-dependent change.
  • PyGILState_Ensure and related APIs, including PyEval_SaveThread, are still required in the same places. They're not no-ops even when they no longer acquire/release a global interpreter lock. At a high level, you can think of them as managing thread states (i.e., PyThreadState). Threads are still required to have a valid, active thread state when calling most Python C APIs. More concretely, one of the major purposes is to synchronize with the garbage collector, which can't run while threads are mutating Python objects.

I'll reply about Py_BEGIN_CRITICAL_SECTION in a follow up comment.

@colesbury
Copy link
Contributor

The Py_BEGIN/END_CRITICAL_SECTION() pairs may (implicitly) define multiple atomically executed critical sections. For example:

Py_BEGIN_CRITICAL_SECTION(obj);
obj->x++;    // first atomically executed
obj->y--;    //  executed critical section
Py_DECREF(other); /* may call arbitrary code via finalizer */
obj->z *= 2; // second atomically executed critical section
Py_END_CRITICAL_SECTION();

In the above example:

  • Py_BEGIN_CRITICAL_SECTION(obj) acquires the per-object lock on obj.
  • obj->x++ and obj->y-- together execute atomically like in a traditional critical section
  • Py_DECREF(other) may internally temporarily release the per-object lock because finalizers can usually call arbitrary code. This can happen if it calls PyEval_SaveThread() or begins a new internal critical section (which might block). Importantly, when Py_DECREF returns, the thread is guaranteed to hold the lock for obj. It's just not guaranteed to have held it continuously.
  • obj->z *= 2 then executes atomically
  • Py_END_CRITICAL_SECTION() releases the lock on obj.

The implicit breakup into multiple smaller atomically executed critical sections occurs at the same places where the GIL would have been released -- Py_DECREF() in this case.

Py_BEGIN_CRITICAL_SECTION is less straight-forward than a plain mutex, but often easier to use when dealing with the Python C API because:

  1. It's semantics are similar to the GIL regarding which places it may be implicitly, temporarily released
  2. Many Python C API calls including the ubiquitous Py_DECREF() may call arbitrary code so using plain mutexes often requires substantial refactoring or risks deadlocks.

If you are dealing mostly with, for example, the C++ STL then plain mutexes are probably appropriate. If you're dealing mostly with the Python C API, then Py_BEGIN_CRITICAL_SECTION is probably easier to use.

@wjakob
Copy link
Owner Author

wjakob commented Aug 23, 2024

Thank you for the clarification @colesbury. That makes a lot of sense. So the new critical sections are really how things have always been with possible code execution through Py_DECREF() or even context switches to other threads while holding the GIL.

I'm think I'm done with all of the tweaks to nanobind and now wanted to test that it actually works as expected. This led me to a surprising result with v3.13rc1. I wrote a test that launches 8 threads that each construct 1M objects and destroy them afterwards.

You can run it as follows with this PR

$ python3.13 -m pytest  tests/test_thread.py

The file contains just a single test:

def test01_object_creation():
    from test_thread_ext import Counter

    def f():
        n = 1000000
        r = [None]*n
        for i in range(n):
            r[i] = Counter()
        del r

    parallelize(f, n_thread=8)
    gc.collect()

I would expect performance to stay mostly constant until the n_threads parameter exceeds the number of physical cores of the OS. Instead, performance seems to be roughly linear in n_threads, which means that there is a bottleneck somewhere.

Interrupting the interpreter via Ctrl-C indicates that this bottleneck seems to be related to the construction of the Counter type in the example. However, the bottleneck occurs inside Python, and before it is forwarded to nanobind. Here is a backtrace:

 * frame #0: 0x0000000198dfff88 libsystem_kernel.dylib` swtch_pri  + 8
    frame #1: 0x0000000198e3ce34 libsystem_pthread.dylib` cthread_yield  + 32
    frame #2: 0x000000010026aaa0 python3.13` _PyMutex_LockTimed [inlined] _Py_yield  + 208 at lock.c:46
    frame #3: 0x000000010026aa9c python3.13` _PyMutex_LockTimed(m=0x00000001005f8a88, timeout=-1, flags=_PY_LOCK_DET
ACH)  + 204 at lock.c:88
    frame #4: 0x000000010026b4c0 python3.13` PyMutex_Lock(m=<unavailable>)  + 8 at lock.c:582
    frame #5: 0x000000010022e980 python3.13` _PyCriticalSection_BeginSlow [inlined] _PyMutex_Lock(m=0x00000001005f8a
88)  + 80 at lock.h:49
    frame #6: 0x000000010022e96c python3.13` _PyCriticalSection_BeginSlow(c=0x0000000173e2aac8, m=0x00000001005f8a88
)  + 60 at critical_section.c:20
    frame #7: 0x000000010013fc5c python3.13` _PyType_LookupRef [inlined] _PyCriticalSection_BeginMutex(c=0x000000017
3e2aac8, m=<unavailable>)  + 524 at pycore_critical_section.h:119
    frame #8: 0x000000010013fc2c python3.13` _PyType_LookupRef(type=0x0000020000d70010, name=0x00000001005b96f8)  +
476 at typeobject.c:5256
    frame #9: 0x0000000100152370 python3.13` lookup_maybe_method(self='0x2001f611b40', attr=<unavailable>, unbound=0
x0000000173e2ab7c)  + 32 at typeobject.c:2521
    frame #10: 0x000000010014d850 python3.13` slot_tp_init [inlined] lookup_method(self='0x2001f611b40', attr=<unava
ilable>, unbound=0x0000000173e2ab7c)  + 28 at typeobject.c:2543
    frame #11: 0x000000010014d834 python3.13` slot_tp_init(self='0x2001f611b40', args=(), kwds='0x00000000')  + 56 a
t typeobject.c:9773
    frame #12: 0x00000001001433b0 python3.13` type_call(self='0x20000d70010', args=(), kwds='0x00000000')  + 464 at
typeobject.c:1990
    frame #13: 0x0000000100096bcc python3.13` _PyObject_MakeTpCall(tstate=0x0000000101019210, callable='0x20000d7001
0', args='0x1009fc270', nargs=4305437280, keywords='0x00000000')  + 356 at call.c:242
    frame #14: 0x0000000100096838 python3.13` _PyObject_VectorcallTstate(tstate=<unavailable>, callable=<unavailable
>, args=<unavailable>, nargsf=<unavailable>, kwnames=<unavailable>)  + 228 at pycore_call.h:166
    frame #15: 0x0000000100097704 python3.13` PyObject_Vectorcall(callable=<unavailable>, args=<unavailable>, nargsf
=<unavailable>, kwnames=<unavailable>)  + 48 at call.c:327
    frame #16: 0x00000001001f9b90 python3.13` _PyEval_EvalFrameDefault(tstate=<unavailable>, frame=<unavailable>, th
rowflag=0)  + 13112 at generated_cases.c.h:813
    frame #17: 0x000000010009b1d4 python3.13` _PyObject_VectorcallTstate(tstate=0x0000000101019210, callable='0x2000
0756790', args='0x173e2aed8', nargsf=1, kwnames='0x00000000')  + 116 at pycore_call.h:168

Looking at the CPython code, it seems to me that there is a caching scheme in place that should make it possible to avoid the critical section. However, for some reason, this isn't happening.

Do you have any ideas what could be wrong?

Thank you!

@colesbury
Copy link
Contributor

colesbury commented Aug 23, 2024

I think there is a combination of two performance issues:

First, instances of nanobind.nb_method (i.e., Counter.__init__) don't use deferred reference counting (nor are they immortal). So reference counting on those methods from multiple threads is going to be a bottleneck and inhibit scaling. Unfortunately, we don't have a public API currently to mark objects as using deferred reference counting. I think we can probably figure out a workaround, but it'd be good to get a public API (likely with the PyUnstable_ prefix) in 3.14.

The second problem is in the implementation of the MRO type cache in CPython. Basically, the fast-path lookup fails so we fall back to the locking slow path, but we're also not enabling future lookups in the fast-path to succeed. This case only happens with entries that don't use deferred reference counting (like nanobind.nb_method), which is why we missed it, but it makes that case worse: instead of having a contended atomic operation, we have a contended lock, which is even slower.

@wjakob
Copy link
Owner Author

wjakob commented Aug 23, 2024

Thank you for looking into this @colesbury! For context: nanobind uses a custom function object to implement an efficient function dispatch mechanism (i.e., faster and more flexible than the builtin C function wrappers). It is central to the design of the library—this design choice preventing multi-threaded speedups in free-threaded builds would be bad news.

Could you let me know if you see any way to force nb_method / nb_func to adopt deferred reference counting, even if hacky and only usable in a particular Python version? Rather than having a nice API in v3.14, I think it would be good to introduce unstable hack for v3.13. The latter is what I expect people to play around with when attempting to adapt their extension projects for free-threading.

@wjakob
Copy link
Owner Author

wjakob commented Aug 26, 2024

I tried doing the hack myself. Based on the CPython internals, I added the function

void enable_deferred(PyObject *op) {
    assert(PyType_IS_GC(Py_TYPE(op)) && _Py_IsOwnedByCurrentThread(op) &&
           op->ob_ref_shared == 0);
    op->ob_gc_bits |= 64;
    op->ob_ref_local += 1;
    op->ob_ref_shared = _Py_REF_QUEUED;
}

(This function is called on a newly created nb_func object that is local to the current thread)

Calling this on the function objects seems to fix the issue with critical sections. However, I'm still not seeing multithreaded speedups. Examining the callstack shows that most of the time is spent dealing with shared refcount atomics.

Does this mean that nanobind needs to bite the bullet and make immortal types and function objects? Is it correct to assume that this will break nanobind's leak tracking? For context: one feature of the library is that it complains loudly if any of its objects (classes, functions, instances) are still alive after Python has full shut down and invokes the Py_AtExit() callbacks.

@wjakob
Copy link
Owner Author

wjakob commented Aug 26, 2024

I tried the main branch (v3.14.0) as well immortalizing the functions and still observed significant performance regressions.

Then I thought I'd try this in plain Python, and it turns out that the behavior there is similar. So perhaps it's better to understand the situation there.

Here is a reproducer:

import threading

def parallelize(func, n_threads):
    barrier = threading.Barrier(n_threads)

    def wrapper():
        barrier.wait()
        return func()

    workers = []
    for _ in range(n_threads):
        t = threading.Thread(target=wrapper)
        t.start()
        workers.append(t)

    for worker in workers:
        worker.join()

def f():
    def q():
        return 1
    for i in range(1000000):
        q()

parallelize(f, n_threads=8)

Performance on an M1 Max laptop (macOS) on AC power

n_threads Runtime
1 49ms
2 202ms
4 610ms
8 2310ms

When I move the q function to the top level, the performance is worse.

n_threads Runtime
1 63ms
2 246ms
4 1402ms
8 5601ms

Especially in the former case with a callable declared per thread, I would have expected the runtime cost to be relatively flat until 8 threads (matching the number of performance cores in this machine.)

Backtrace

this is where time seems to be spent (for the former case with a local function):

* thread #2
    frame #0: 0x00000001000c4e78 python3.14t` _Py_DecRefShared [inlined] _Py_atomic_compare_exchange_ssize(obj=0x000003c380398970, expected=<unavailable>, desired=4611686018427387940)  + 8 at pyatomic_gcc.h:130
   127
   128  static inline int
   129  _Py_atomic_compare_exchange_ssize(Py_ssize_t *obj, Py_ssize_t *expected, Py_ssize_t desired)
-> 130  { return __atomic_compare_exchange_n(obj, expected, desired, 0,
   131                                       __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); }
   132
   133  static inline int
warning: python3.14t was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) bt
* thread #2
  * frame #0: 0x00000001000c4e78 python3.14t` _Py_DecRefShared [inlined] _Py_atomic_compare_exchange_ssize(obj=0x000003c380398970, expected=<unavailable>, desired=4611686018427387940)  + 8 at pyatomic_gcc.h:130
    frame #1: 0x00000001000c4e70 python3.14t` _Py_DecRefShared [inlined] _Py_DecRefSharedDebug(o=0x000003c380398960, filename=0x0000000000000000, lineno=0)  + 28 at object.c:353
    frame #2: 0x00000001000c4e54 python3.14t` _Py_DecRefShared(o=0x000003c380398960)  + 20 at object.c:371
    frame #3: 0x00000001001afed0 python3.14t` _PyEval_FrameClearAndPop  + 4 at ceval.c:1684
    frame #4: 0x00000001001afecc python3.14t` _PyEval_FrameClearAndPop(tstate=0x000000010081f200, frame=0x00000001006e0290)+ 232 at ceval.c:1710
    frame #5: 0x00000001001ab9d4 python3.14t` _PyEval_EvalFrameDefault(tstate=<unavailable>, frame=0x00000001006e0218, throwflag=<unavailable>)  + 58020 at generated_cases.c.h:6499
    frame #6: 0x000000010005e71c python3.14t` method_vectorcall [inlined] _PyObject_VectorcallTstate(tstate=0x000000010081f200, callable=0x000003c3805363c0, args=0x0000000171e12ee8, nargsf=1, kwnames=0x0000000000000000)  + 44 at pycore_call.h:167
    frame #7: 0x000000010005e6f0 python3.14t` method_vectorcall(method=<unavailable>, args=0x0000000100492b90, nargsf=<unavailable>, kwnames=0x0000000000000000)  + 284 at classobject.c:70
    frame #8: 0x000000010029a284 python3.14t` thread_run(boot_raw=0x0000600000c00e40)  + 128
    frame #9: 0x000000010022f2bc python3.14t` pythread_wrapper(arg=<unavailable>)  + 28
    frame #10: 0x0000000198e3ffa8 libsystem_pthread.dylib` _pthread_start  + 148

@colesbury
Copy link
Contributor

colesbury commented Aug 26, 2024

Does this mean that nanobind needs to bite the bullet and make immortal types and function objects?

Yes, deferred reference counting isn't fully implemented in 3.13 and is more of a placeholder, so functions and types need to be immortalized.

I tried the main branch (v3.14.0) as well immortalizing the functions...

I'll take a look at this today or tomorrow.

@colesbury
Copy link
Contributor

Hi @wjakob -

Here's a patch that demonstrates how to make functions and type objects immortal:

https://gist.github.com/colesbury/99e15d904d84d239231f465a5aca6c36

If you are doing multi-threaded benchmarking, you won't want to use suppress_immortalization. The tradeoff in 3.13 is that you either:

  1. immortalize objects (with the associated leaks on exit), and can scale to well to multiple threads/cores
  2. suppress immortalization (avoid leaks), but can't scale to multiple threads/cores

The other change was to move the from test_thread_ext import Counter outside of the function body of test01_object_creation. Closed over local variables are implemented using cell objects. Reading from a cell object currently takes a lock on the cell object instance. We should probably fix this for 3.14.

@wjakob
Copy link
Owner Author

wjakob commented Aug 27, 2024

Thank you very much @colesbury, it fixes the performance regression. I am now seeing the following performance on the v3.14 (main) branch:

n_threads Runtime
1 172ms
2 223ms
4 335ms
8 503ms

The cell performance caveat you mention sounds tricky to figure out for users, that would be good to fix indeed.

@wjakob
Copy link
Owner Author

wjakob commented Aug 27, 2024

Thank you for all your help so far @colesbury. I was wondering about one more thing:

In your pybind11 implementation, you sharded the (very frequently accessed) instance map based on the top 44 bits of the pointer address. The rationale was that this region is related the thread ID, and that way we can limit lock contention if every thread mostly works with its own objects. This sounded good to me, and I copied your approach in nanobind.

One potential issue is that nanobind extensions talk to each other through shared data structures, and that means that such a sharding scheme will effectively become part of the "ABI" (i.e. difficult to modify after the fact without breaking the ABI contract). It's the same also in pybind11. Hence I want to be sure that this is a good long-term solution.

Therefore, I was wondering where this came from, and what can be said about it? A few questions:

  1. Discarding the bottom 20 bits splits memory into 1MB-sized chunks., which isn't very much So it seems it can't directly map to the thread ID.

  2. There are actually two allocators: the system malloc() and mimalloc(). Do both of them allocate spatially segregated heaps for each thread? (In nanobind, pointers used for sharding can either be from a memory region allocated via PyObject_Malloc, or they can be on the C++ heap created via operator new/malloc())

  3. Do you know if this behavior is consistent across platforms?

Thank you!

@colesbury
Copy link
Contributor

I'm not sure that using the top 44 bits does much to help with lock contention compared to using all the bits. We are still applying a hash function (fmix64) and indexing a much smaller table. So I'd expect each thread to more-or-less pseudo-randomly choose a shard in the table with the contention depending on the number of shards and the number of threads concurrently trying to access it.

To the extent it helps, it's because a given thread is likely to re-use the same shard for lookups (fewer cache misses). The improvement here is small compared to reducing lock contention by increasing the number of shards.

There's a good deal of flexibility in terms of how many low bits to discard -- I don't think it's particularly sensitive to the choice or to the allocator. The important things are just that:

  • You use enough bits that different threads are likely to have different keys (i.e., it'd be bad if everyone used shard 0!)
  • But not too many bits -- otherwise consecutive lookups from the same thread are likely to use random shards.

The first condition is more important. You definitely don't want to have multiple threads forced to use the same shard. But we have a good deal of slack here. Glibc's arenas are (mostly) per-thread and by default 64 MiB. Mimalloc's segments are per-thread and 32 MiB.

If you are concerned about the choice of shift, you can put it in the internals struct (and use that value instead of hardcoding the shift). I think that would give more flexibility about changing it without affecting the ABI. In other words, the creator of the nb_internals struct determines the shift amount. I don't think it's particularly important though.

I think more likely paths for improvement are:

  1. Increasing the number of shards, but that's harder to justify without more real world data.
  2. Storing the C++ struct -> PyObject* mapping on the C++ struct itself when possible.

@wjakob
Copy link
Owner Author

wjakob commented Aug 30, 2024

Hi @colesbury,

thank you your the detailed explanation and clarifying my misunderstanding. The PR is now almost done in the sense that all the internal data structures are protected for free-threading.

I wanted to ask about your thoughts regarding one more thing:

Binding tools like nanobind and pybind11 are often used to bind existing code. However, that code might often not be thread-safe and therefore unsafe to use in free-threaded Python. It's worse than pure Python code because of the possibility of memory-unsafe operations that crash the process.

To amelierate such issues, I would like to add a way for nanobind to lock Python objects involved in a function call (without having to modify the function itself).

Example: with this feature, the user could change a binding of a function with 3 named arguments

m.def("func", &func, nb::arg("argument_1"), nb::arg("argument_2"), nb::arg("argument_3"));

into one guaranteeing that the PyMutex of the first and third argument is held while the function runs.

m.def("func", &func, nb::arg("argument_1").lock(), nb::arg("argument_2"), nb::arg("argument_3").lock());

In general, this sounds like something that Py_BEGIN_CRITICAL_SECTION() and Py_BEGIN_CRITICAL_SECTION2() are made for. However, the interface I am describing would in principle allow to take out more than 2 locks.

At first, I thought this is simple: I'll just sort all of the objects by pointer address and then run PyMutex_Lock() on the ob_mutex of their PyObject *. But then I realized that ob_mutex is likely not part of the public API, and that it will be locked away if there is a free-threaded limited API at some point in the future.

Do you think what I am trying to do is possible in some forward-compatible way?

Thanks!

@colesbury
Copy link
Contributor

Hi @wjakob - I don't think there's a way to do what you are proposing. You should avoid using PyMutex_Lock to lock PyObject.ob_mutex because that can lead to deadlocks. The object might already be locked by a critical section when your functions are called. Any given lock should only be locked with the critical section APIs or with PyMutex_Lock, but not a mix.

If possible, I'd recommend limiting the implementation to locking one or two arguments so that you can use the critical section APIs.

We do something similar in CPython with Argument Clinic: https://github.com/python/cpython/blob/782217f28f0d67916fc3ff82b03b88573686c0e7/Objects/setobject.c#L1998. In our cases:

  • The vast majority of uses lock a single object, typically the "self" object. There's only one a few cases where it makes sense to lock two objects at once.
  • It's not applicable everywhere: sometimes you need to lock a field on some argument or do some setup before locking.

@wjakob
Copy link
Owner Author

wjakob commented Sep 20, 2024

Thank you for the helpful feedback everyone! Two more updates:

  • d9b2b21, which makes the nb::dict iterator non-copyable while adding locking.
  • f17c414 makes it possible to lock an object with unnamed arguments without suffering the (very minor) overheads of the "complex" function dispatcher. Very niche, but maybe someone with a high performance use case will appreciate it 🤓

@wjakob
Copy link
Owner Author

wjakob commented Sep 20, 2024

This feature has been in the works for a while, and I will merge it for now. If there are any previously undiscovered problems, please let me know here or open a separate issue. My plan is to wait a bit more for things to stabilize and then push out a new release.

In principle, this PR should not affect existing (non-free-threaded) extensions. However, it is a whopping 2K diff with a rewrite of internal data structures. So I am again pinging users who have chimed in on GitHub in the past: @WKarel, @ManifoldFR, @hpkfft, @yosh-matsuda, @garth-wells, @luigifcruz, @awni, @qnzhou, @skallweitNV, @oremanj, @wojdyr, @jmalkin, @Chronum94. It would be great if you could compile your projects against the master branch and let me know if anything breaks between ee23846 (revision prior to merging this PR) and the latest master revision.

wjakob and others added 17 commits September 20, 2024 23:01
This commit adds the ``FREE_THREADED`` parameter to the CMake
``nanobind_add_module()`` command. It does not do anything for now
besides defining ``NB_FREE_THREADED`` in C++ compilation units.
Nanobind's internal data structures switch to a different layout when
free-threading was requested. We must give such extensions a different
ABI tag since they are incompatible with non-free-threaded ones.
This commit implements RAII scope guards that are thin wrappers around
existing Python API to lock Python mutexes and enter critical sections
with respect to one or two Python objects.

In ordinary (non-free-threaded) Python builds, they don't do anything
and can be optimized away.
The garbage collector in free-threaded Python does not like it when the
reference count of an object temporarily increases while traversing the
object graph in ``tp_traverse``, and doing so introduces leaks.
Unfortunately, example implementations of ``tp_traverse`` in both
documentation and test suite fall into this trap and must be adapted.
This commit enables free-threaded extension builds on Python 3.13+,
which involves the following changes:

- nanobind must notify Python that an extension supports free-threading.

- All internal data structures must be protected from concurrent
  modification. The approach taken varies with respect to the specific
  data structure, and a long comment in ``nb_internals.h`` explains the
  design decisions all of the changes. In general, the implementation
  avoids centralized locks as much as possible to improve scalability.

- Adopting safe versions of certain operations where needed, e.g.
  ``PyList_GetItemRef()``.

- Switching non-object allocation from ``PyObject_Allo()`` to
  ``PyMem_Alloc()``.
Global objects that undergo a high rate of reference count changes can
become a bottleneck in free-threaded Python extensions, since the
associated atomic operation require coordination between processor
cores. Function and type objects are a particular concern.

This commit immortalizes such objects, which exempts them from
free-threading. The downside of this is that they will leak.
Adapting C++ to handle parallelism due to free-threaded Python can be
tricky, especially when the original code is given as-is. This commit
an tentative API to retrofit locking onto existing code by locking the
arguments of function calls.
This commit documents free-threading in general and in the context of
nanobind extensions.
Several parallel tests to check that locking works as expected
This commit changes the ``nb::dict`` iterator so that nanobind can
implement the recommendation from

https://docs.python.org/3.14/howto/free-threading-extensions.html#pydict-next

The primary goal of ``nb::internal::dict_iterator`` was to be able to write

```cpp
nb::dict my_dict = /* ... */;
for (auto [k, v] : my_dict) {
    // ....
}
```

This in fact the only associated feature that is explicitly mentioned in
the documentation, and this continues to work.

However, some undocumented features are lost:

- The dictionary iterator is no longer copyable. This is because it
  must acquire an exclusive lock to the underlying dictionary.

- The pre-increment operator ``++dict_it`` (which relied on copying) is
  gone. Post-increment continues to work, and that is enough for the
  loop structure mentioned above.
This commit refactors argument the locking locking so that it occurs at
compile-time without imposing runtime overheads. The change applies to
free-threaded extensions.

Behavior differences compared to the prior approach:

- it is no longer possible to do ``nb::arg().lock(false)`` or
  ``.lock(runtime_determined_value)``

- we no longer prohibit locking self in ``__init__``; changing this
  would also require restoring ``cast_flags::lock``, and it's not clear
  that the benefit outweighs the complexity.
@wjakob wjakob merged commit 1dff7c4 into master Sep 20, 2024
15 checks passed
@wjakob wjakob deleted the free-threaded branch September 20, 2024 14:03
@awni
Copy link

awni commented Sep 20, 2024

MLX tests clear on current master. Thanks for the heads up.

@hpkfft
Copy link
Contributor

hpkfft commented Sep 21, 2024

My FFT extension is all good with the current master. Thanks!

@@ -819,24 +819,29 @@ struct fast_iterator {

class dict_iterator {
public:
NB_NONCOPYABLE(dict_iterator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjakob can you please suggest what would be the best way to fix a thirdparty code using dict_iterator copy and now failing to compile due to this change?
Here is the code:
https://github.com/openxla/xla/blob/182111eb601b6491b37625df51f488905d217ae2/xla/python/weakref_lru_cache.cc#L58-L60
Thanks!

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks complicated (weak references, LRU caches, dictionary iterators, ...) Could I ask you to explain in a bit more what the use case is, and why dict iterators need to be copied and hashed?

Maybe that code can be rewritten without needing a dictionary iterator. Or perhaps it can call .items() and then run a normal iterator over that.

The change made in the commit is difficult to roll back or avoid since it needed to make PyDict_Next safe in free-threaded builds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjakob thanks! We'll finally rewrite openxla code to take into account this change.

The change made in the commit is difficult to roll back or avoid since it needed to make PyDict_Next safe in free-threaded builds.

Yes, let's keep this change as it is for the free-threading.

@stellaraccident
Copy link

Thanks! Worked like a charm.

@jmalkin
Copy link

jmalkin commented Sep 27, 2024

Sorry for the delay in responding. Our unit tests didn't notice the change so everything is good for us.

@wjakob
Copy link
Owner Author

wjakob commented Oct 3, 2024

@colesbury I've just released v2.2.0 on PyPI with the free-threading support included. Thank you for all of your help with this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants