Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in PyThreadState_DeleteCurrent: drop_gil: GIL is not locked (free-threading) #118727

Closed
colesbury opened this issue May 7, 2024 · 3 comments
Assignees
Labels
3.13 bugs and security fixes tests Tests in the Lib/test dir topic-free-threading type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@colesbury
Copy link
Contributor

colesbury commented May 7, 2024

Crash report

cpython/Python/ceval_gil.c

Lines 232 to 239 in b9caa09

#ifdef Py_GIL_DISABLED
if (!_Py_atomic_load_int_relaxed(&gil->enabled)) {
return;
}
#endif
if (!_Py_atomic_load_int_relaxed(&gil->locked)) {
Py_FatalError("drop_gil: GIL is not locked");
}

Found by running ./python Lib/test/test_importlib/partial/pool_in_threads.py

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139836635715136) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139836635715136) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139836635715136, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f2eb2303476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f2eb22e97f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055f8d81dcb4b in fatal_error_exit (status=<optimized out>) at Python/pylifecycle.c:2894
#6  0x000055f8d81e9981 in fatal_error (fd=2, header=header@entry=1, prefix=prefix@entry=0x55f8d83592f8 <__func__.20> "drop_gil", msg=msg@entry=0x55f8d8358af4 "drop_gil: GIL is not locked", status=status@entry=-1) at Python/pylifecycle.c:3076
#7  0x000055f8d81e99f3 in _Py_FatalErrorFunc (func=func@entry=0x55f8d83592f8 <__func__.20> "drop_gil", msg=msg@entry=0x55f8d8358af4 "drop_gil: GIL is not locked") at Python/pylifecycle.c:3092
#8  0x000055f8d81a6a39 in drop_gil (interp=<optimized out>, tstate=0x0) at Python/ceval_gil.c:238
#9  0x000055f8d81a75bf in _PyEval_ReleaseLock (interp=<optimized out>, tstate=tstate@entry=0x0) at Python/ceval_gil.c:582
#10 0x000055f8d81ed323 in _PyThreadState_DeleteCurrent (tstate=tstate@entry=0x7f2ea80065e0) at Python/pystate.c:1837
#11 0x000055f8d82a5cff in thread_run (boot_raw=boot_raw@entry=0x7f2ea8006590) at ./Modules/_threadmodule.c:355
#12 0x000055f8d820b7ce in pythread_wrapper (arg=<optimized out>) at Python/thread_pthread.h:243
#13 0x00007f2eb2355ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#14 0x00007f2eb23e7660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Note that in the coredump gil->enabled is 0, so it looks like the gil was transiently enabled by a different thread during the call to _PyThreadState_DeleteCurrent.

cc @swtaarrs

Linked PRs

@colesbury colesbury added tests Tests in the Lib/test dir type-crash A hard crash of the interpreter, possibly with a core dump topic-free-threading labels May 7, 2024
@swtaarrs swtaarrs self-assigned this May 7, 2024
@swtaarrs
Copy link
Member

swtaarrs commented May 7, 2024

I think the race here isn't specific to the thread deletion sequence; I forgot that drop_gil() is called by detached threads and so gil->enabled can change during it (like take_gil(), which has protection against this). Fix shouldn't be too hard.

edit: I was wrong; the thread is still attached during drop_gil() and this is specific to thread deletion. The issue is that tstate_delete_common() decrements the countdown for any pending stop-the-world requests. If there is a thread waiting to enable the GIL, there's a decent chance that this will activate the stw which will immediately enable the GIL.

@ericsnowcurrently
Copy link
Member

FWIW, this issue (and the PR) reinforce a worry I have about the complexity of Python's runtime state relative to threads.

At a conceptual level the runtime has greater-than-one pieces of per-thread data for which the possible value classes/groupings (in combination) should map explicitly onto distinct operational states (while disallowing some value class combinations). We have meaningful gaps in identifying such a mapping and enforcing such constraints, which makes it hard to be confident about correctness (in such a critical part of the runtime), as well as making it easier to introduce unnecessary complexity. This observation is supported by the variety of defects we've seen relative to the GIL over the years, including recently, including this issue.

I recognize all to personally how hard a hard problem this is, and it's not new with the free-threading work, but it's something we should be extra thoughtful about sooner to avoid additional headache later.

FWIW, at one point I tried to tame this situation somewhat and codify it by adding PyThreadState._status, along with a bunch of asserts. However, that approach can only take us so far and, even then, I don't feel like I was rigorous enough nor was I able to constrain the states as much as I wanted (e.g. I had to comment out various asserts). As part of the free-threading work, I've seen a number of improvements in support of mapping data to state and constraining that programmatically. I've also noticed new gaps.

In conclusion, we've reached a point in complexity (which free-threading generally amplifies meaningfully) that it may be worth taking some time to more thoroughly identify and validate the (effective) state machine for the Python runtime relative to threads. Otherwise I fear we'll regularly run into bugs like this indefinitely.

@swtaarrs
Copy link
Member

swtaarrs commented May 9, 2024

it may be worth taking some time to more thoroughly identify and validate the (effective) state machine for the Python runtime relative to threads.

I agree. One that I'd especially like to address (it's a minor one but I run into it frequently) is that I believe this store is redundant, based on my understanding of how last_holder is managed. It's been around for over 13 years, so it's plausible that things have changed enough in the meantime.

colesbury pushed a commit that referenced this issue May 23, 2024
…ad holds it (#118745)

`drop_gil()` assumes that its caller is attached, which means that the current
thread holds the GIL if and only if the GIL is enabled, and the enabled-state
of the GIL won't change. This isn't true, though, because `detach_thread()`
calls `_PyEval_ReleaseLock()` after detaching and
`_PyThreadState_DeleteCurrent()` calls it after removing the current thread
from consideration for stop-the-world requests (effectively detaching it).

Fix this by remembering whether or not a thread acquired the GIL when it last
attached, in `PyThreadState._status.holds_gil`, and check this in `drop_gil()`
instead of `gil->enabled`.

This fixes a crash in `test_multiprocessing_pool_circular_import()`, so I've
reenabled it.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 23, 2024
…t thread holds it (pythonGH-118745)

`drop_gil()` assumes that its caller is attached, which means that the current
thread holds the GIL if and only if the GIL is enabled, and the enabled-state
of the GIL won't change. This isn't true, though, because `detach_thread()`
calls `_PyEval_ReleaseLock()` after detaching and
`_PyThreadState_DeleteCurrent()` calls it after removing the current thread
from consideration for stop-the-world requests (effectively detaching it).

Fix this by remembering whether or not a thread acquired the GIL when it last
attached, in `PyThreadState._status.holds_gil`, and check this in `drop_gil()`
instead of `gil->enabled`.

This fixes a crash in `test_multiprocessing_pool_circular_import()`, so I've
reenabled it.
(cherry picked from commit be1dfcc)

Co-authored-by: Brett Simmers <swtaarrs@users.noreply.github.com>
colesbury pushed a commit that referenced this issue May 23, 2024
…nt thread holds it (GH-118745) (#119474)

`drop_gil()` assumes that its caller is attached, which means that the current
thread holds the GIL if and only if the GIL is enabled, and the enabled-state
of the GIL won't change. This isn't true, though, because `detach_thread()`
calls `_PyEval_ReleaseLock()` after detaching and
`_PyThreadState_DeleteCurrent()` calls it after removing the current thread
from consideration for stop-the-world requests (effectively detaching it).

Fix this by remembering whether or not a thread acquired the GIL when it last
attached, in `PyThreadState._status.holds_gil`, and check this in `drop_gil()`
instead of `gil->enabled`.

This fixes a crash in `test_multiprocessing_pool_circular_import()`, so I've
reenabled it.
(cherry picked from commit be1dfcc)

Co-authored-by: Brett Simmers <swtaarrs@users.noreply.github.com>
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
…t thread holds it (python#118745)

`drop_gil()` assumes that its caller is attached, which means that the current
thread holds the GIL if and only if the GIL is enabled, and the enabled-state
of the GIL won't change. This isn't true, though, because `detach_thread()`
calls `_PyEval_ReleaseLock()` after detaching and
`_PyThreadState_DeleteCurrent()` calls it after removing the current thread
from consideration for stop-the-world requests (effectively detaching it).

Fix this by remembering whether or not a thread acquired the GIL when it last
attached, in `PyThreadState._status.holds_gil`, and check this in `drop_gil()`
instead of `gil->enabled`.

This fixes a crash in `test_multiprocessing_pool_circular_import()`, so I've
reenabled it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes tests Tests in the Lib/test dir topic-free-threading type-crash A hard crash of the interpreter, possibly with a core dump
Projects
None yet
Development

No branches or pull requests

3 participants