-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Still getting GC crashes even with GAP_jll disabling task stack scanning #1032
Comments
I get the following on my notebook with Julia 1.8.5 and with GAP.jl as in the current master branch.
|
@ThomasBreuer can you please also tell me the output of |
output of
|
Trying to reproduce the crash on 8 systems concurrently (on Mac, seven linux machines; with different CPUs, RAM, Linux distro etc.). So far I got this 1 one machine crashed in
|
Also tried |
Also tried Will next try to set up a Linux VM on my mac with reduced heap size to see if that helps reproducing the issue (the (sensible!) suggestion to SSH into the CI VMs unfortunately is of limited use to me as those environments are too limited and slow to do serious debugging) |
I've pushed two updates for GAP_jll: one rebuilds it once again, this time against the latest libjulia_jll (I made a mistake in the previous rebuild!) this may affect Julia nightly but should not matter for 1.11 and older. The second update disables "precise marking" which I suspect may help with the crashes (if so, then I think it masks the issue, not fix it, but that's a separate concern). I also (re)discovered https://github.com/nektos/act which may allow me to reproduce the crashes locally. |
Recently the CI tests on gap-system/gap involving GAP.jl started to crash when run in Julia nightly. See for example https://github.com/gap-system/gap/actions/runs/11495638665/job/31995581546?pr=5823 However the "same" tests work here. I think this because of the GC workaround patch I am including in GAP_jll: From cfc9fe38fed4c09f37e5771265415f464592fbe2 Mon Sep 17 00:00:00 2001
From: Max Horn <max@quendi.de>
Date: Tue, 3 Sep 2024 23:12:40 +0200
Subject: [PATCH] kernel: disable task stack rescan optimization
This seems to prevent recent crashes in GAP.jl. It is not a viable
long-term solution as it causes a severe performance penalty, but
it is worthwhile to experiment with this to see if this really
fixes all the crashes.
---
src/julia_gc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/julia_gc.c b/src/julia_gc.c
index d44c69fc5..a3d3ee0fe 100644
--- a/src/julia_gc.c
+++ b/src/julia_gc.c
@@ -621,7 +621,7 @@ static void GapTaskScanner(jl_task_t * task, int root_task)
// age bit back to new if tasks are being switched.
jl_taggedvalue_t * tag = jl_astaggedvalue(task);
if (tag->bits.gc & 2)
- rescan = 0;
+ rescan = 1;
}
char *active_start, *active_end, *total_start, *total_end;
--
2.46.0
Put another way: it seems the elusive GC crash that I couldn't reproduce now happens all the time with Julia nightly. That definitely needs to be looked into. On the upside, perhaps that means I can finally reproduce the issue again locally, and thus debug it properly! Now the main problem will be to find a date where there are not 5 other things with deadlines waiting for me to resolve them so I can focus on this :-( |
I don't want to kill all of your confidence, but in the julia nightly CI, there is a build warning. I only found it because locally, it results in an error instead (an IMO should do so in CI as well). https://github.com/oscar-system/GAP.jl/actions/runs/11496373988/job/31997861877?pr=1059#step:7:282
|
thanks. indeed we should treat warnings as errors. and report an issue to julia that we need jl_gc_new_weakref to be public again. will do |
I am already working on the julia PR |
actually we could also switch to using jl_gc_new_weakref_th ... |
Yeah, true. This function seems to be available since JuliaLang/julia@f5b224d, which is in julia 0.5. |
using the newest julia nightly with the new |
... at least in CI, I cannot yet reproduce it locally.
The text was updated successfully, but these errors were encountered: