Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracemalloc regression: deadlocks in 3.12.9 #130093

Closed
itamarst opened this issue Feb 13, 2025 · 2 comments
Closed

tracemalloc regression: deadlocks in 3.12.9 #130093

itamarst opened this issue Feb 13, 2025 · 2 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@itamarst
Copy link

itamarst commented Feb 13, 2025

Bug report

Bug description:

After recent changes, Polars debug builds, which integrate tracemalloc tracking, started experiencing deadlocks. Ripping out the tracemalloc integration fixed them (pola-rs/polars#21231). This started happening specifically when GitHub Actions switched to 3.12.9. I imagine this is due to the recent tracemalloc bug fixes.

No reproducer yet, but as a starting point:

  1. I am fairly certain this was not after doing tracemalloc.start(), since it happened in e.g. doctest runs that don't do that at all.
  2. Polars in this mode would trace Rust allocations using PyTraceMalloc_Track(): https://github.com/pola-rs/polars/blob/f378b24aeaaa14a8e7dacb6b35103424e8b97f1c/py-polars/src/memory.rs#L53
  3. Polars is heavily multi-threaded, and tracemalloc memory tracking would happen both while holding GIL and not holding GIL.

So this suggests some mixture of GIL-holding and non-GIL holding threads calling PyTraceMalloc_Track() and friends can cause deadlocks.

CPython versions tested on:

3.12

Operating systems tested on:

Linux

@itamarst itamarst added the type-bug An unexpected behavior, bug, or error label Feb 13, 2025
@itamarst
Copy link
Author

Here's a plausible theory: the way Polars works is there's a main thread that dispatches tasks to a Rust thread pool, and waits for results. When using Python version of Polars, it's important that main thread release the GIL before dispatching to the thread pool, for situations where operations running threads may acquire the GIL. So I went and did that in the relevant APIs. Many operations would never acquire the GIL so likely I missed some cases.

With latest version of tracemalloc, when tracemalloc is compiled into Polars, any Rust memory allocation will acquire the GIL. So the scope for deadlocks is now much broader. Any operation will acquire the GIL.

If this is the case, is it a bug in CPython? Maaaaybe? It's a change in behavior, certainly always acquiring GIL when allocations are tracked, even when tracemalloc.start() isn't called. But arguably if you're going to work with tracemalloc you need to handle GIL acquisition since that was always going to happen after tracemalloc.start() was called.

So maybe not a CPython bug. But probably worth at minimum documenting that PyTraceMalloc_Track() acquires the GIL.

I will try to validate this theory, and if it is correct I will close this issue.

@itamarst
Copy link
Author

My theory above was correct so going to try to fix this in Polars. In any case if someone else gets deadlocks in latest CPython patch versions, the above might be relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant