-
-
Notifications
You must be signed in to change notification settings - Fork 31.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults on 3.12 when using PySR and running Julia's GC #113591
Comments
Not sure if #97920 is related at all to this?
Since here @pablogsal sorry for the tag but I'm wondering if you might have any intuition for how I could investigate this further? |
I think is very unlikely that this is related. That commit will make GC runs less common, but it will make them at times that are safer for the runtime (meaning there will be fewer chances of the runtime encountering illegal conditions). The fact that this happens at GC time unfortunately doesn't necessarily point at the GC as this is normally the time when error conditions that already happened in the past (corruption, illegal references or cycles) are discovered as object links are heavily exercised here. It is very difficult to make any informed suggestion just by the traceback, but here is what i can observe:
The bug looks in the extension layer, not in CPython (although it may be possible) so unfortunately without any reproducer that only involves CPython we won't be able to help more |
Answering some additional things:
As mentioned before, the periodic GC in 3.12 is running less often, not more. It's only executed when Python executes bytecode or when
If a Julia object is visible by Python after being deallocated then that's an error condition on the extension. If an obejct is destroyed the GC should NOT be able to see it. |
Looking at the traceback more closely I don't see CPython's GC anywhere. ALl the gc functions are referring to Julia's GC. In particulat the https://github.com/JuliaLang/julia/blob/1b183b93f4b78f567241b1e7511138798cea6a0d/src/gc.c#L406 So this looks like a extension/julia problem as I don't see any CPython GC calls here. |
Thanks very much for the advice. Indeed it sounds like an issue on the Julia side so feel free to close. The traceback is from Julia but I was wondering if it might be some change in the Python GC that might have freed memory which PyCall was expecting to free itself; but as you suggest it appears to be from something else. The thing I am puzzled about is that this issue occurs only when incrementing Python 3.11 -> 3.12, but the Julia version (or PyCall version) does not seem to affect it, and it manifests as these segfaults from the garbage collection code. I guess I will need to figure out if there are any changes in 3.12 which break assumptions in PyCall. (I don't think it's PySR-specific as it's basically a lightweight wrapper around a few PyJulia calls – it's just the only way I've been able to consistently reproduce this so far.) |
Maybe you can run your reproducer under valgrind and that will point to where the memory was allocated or maybe freed twice? You probably need to look closely and filter a lot of false positives but the answer may be there. Another possibility is to use memory sanitizer, as that normally tells you where the object was allocated. |
Thanks. I did a run of valgrind both on the pure Julia side and also the Python version that segfaults. It looks like most of the errors are just related to codegen and package loading (might just be false positives). I don't immediately notice anything stemming from the Python<->Julia interface.
Here was the output of valgrind on the Python version with the following settings:
valgrind --smc-check=all-non-file --leak-check=full \
--show-leak-kinds=definite --track-origins=yes \
--verbose --log-file=valgrind-pysr-with-segfault2.txt \
--trace-children=yes --suppressions=/home/mc2473/juliavalgrind/julia/contrib/valgrind-julia.supp \
python -c 'from julia import Julia; jl = Julia(runtime="/home/mc2473/juliavalgrind/julia/usr/bin/julia", threads="auto"); from pysr import PySRRegressor as SR; model=SR(); model.fit([[1]], [1]); model.fit([[1]], [1])'
It's odd because running directly from Julia has no errors, but I don't really see anything related to Python in the errors other than maybe the ffi calls. In particular valgrind says the error is from
I'm going to try rebuilding Julia and Python with a memory sanitizer and maybe that will help figure this out. For the record I only see the segfault when running multi-threaded Julia, so I'm going to try a thread sanitizer too. |
Crash report
What happened?
This is a segfault I am seeing on Python 3.12, when trying to use the Python and Julia runtimes simultaneously via the PyJulia package.
It seems like when there is an object that is referenced by both the Julia and Python runtimes, there can be memory access errors. It seems as though Python is trying to free memory which has already been freed in Julia or vice versa.
I am raising the issue here since the issue has only started occurring on Python 3.12, but does not occur on 3.11. The Julia version does not seem to affect this behavior. So I am trying to understand what changes were made to the Python GC that might have triggered this, and if perhaps the GC is more aggressive in some way?
Here is my current MWE based on a package I maintain (PySR) that uses Julia as backend for a Python frontend:
This is the smallest MWE I have been able to create thus far.
I also see the issue in my continuous integration tests on Python 3.12, but never before 3.12: MilesCranmer/PySR#450
For example, in one of those segfaults, I see the following backtrace:
I found this quite odd as it seems as though both the Julia and Python garbage collection are interfering with eachother. Here, it seems as though
PyObject_Free
is trying to free memory that was already freed? Perhaps one of the GCs is trying to free the memory access by the other runtime. Looking at the backtrace, I suppose this could also be an issue with PyCall.jl (which calls Python functions from Julia), although it hasn't occurred in any previous Python version, so I'm not sure where the issue is coming from.Any help is appreciated. I am happy to provide you with as much debugging information as I can, as this issue is quite urgent to fix in the ecosystem of Python <-> Julia packages.
CPython versions tested on:
3.12
Operating systems tested on:
Linux, macOS, Windows
Output from running 'python -VV' on the command line:
Linux test performed on:
Python 3.12.1 (main, Dec 30 2023, 22:23:57) [GCC 8.5.0 20210514 (Red Hat 8.5.0-20)]
The text was updated successfully, but these errors were encountered: