-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory error during precompilation #44345
Conversation
Thanks for acting on this bug so quickly! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot be correct, since it doesn't change the access patterns, just the initialization size
Operationally it fixes the problem, though I agree it's not the cause. Just checking, you saw #44338 (comment), right? The hang happens when the Is there a lock I need to acquire or something? |
Looks like memory corruption? E.g. some of malloc's internal state getting clobbered. |
I'll try firing up |
Tested this PR and it works with my system & all the packages. |
Valgrind suggests there's a leak:
There are several "copies" of this report. But of all the htable uses, the only leak seems to come from here. Consequently, I suspect this isn't directly related. valgrind seems to shed no light on the problem. Maybe ASAN would? But see #44361. I'm pretty stuck until that's fixed. |
Definitely looks like a job for rr. |
This was tricky. In case it helps someone else, I usually get fatal
which is kind of a bummer. I found one system where I was able to capture a trace, but then during replay
So I ran a for i in {1..500}
do
rm -rf ~/.julia/compiled/v1.9/*
date -Iseconds && rr record ./julia --startup=no --project=/tmp/tim/pkgs2 -e 'using Pkg; Pkg.offline(true); Pkg.precompile()'
done (the Now it's tricky. The error is in malloc.c so I installed the
and (rr) p bck->fd
Cannot access memory at address 0x13 which seems like it might explain the crash. AFAICT (and I do not have a clear understanding of how malloc.c works),
So it looks a bit as if a check for a potential error condition failed to make sure that the check was valid? I don't know if anyone else has any other thoughts, but I may have reached the limit of how far I can carry this. If there is a possibility that this is a malloc bug and/or no one else has any good ideas, then it seems we have two options:
Thoughts? |
Found it! Phew. There's no reason I can see not to do the pre-allocation, so I decided to add the real fix to this PR. But I've tested this change in isolation, and it solves the problem on its own (>1000 successful precompilation of Crayons without a hitch). |
@vtjnash Does this look good to you now? |
This definitively fixes a serious bug and there is otherwise not much to this. I don't think it needs to wait on a review. |
htable_new(&visited, 0); | ||
if (list) { | ||
assert(jl_is_array(list)); | ||
size_t n0 = jl_array_len(list); | ||
htable_new(&visited, n0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The htable_free
below now access uninitialized memory and must be moved also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed in #44446
I don't know this part of the code, but the change makes sense now |
Fixes #44338
I tested this with
which only intermittently triggers #44338, but the frequency was well above 1 in 10 and probably more like 1 in 3. After a few manual runs, I did this:
49 of these 50 runs completed successfully. The one exception threw this backtrace:
which is also htable-related, but one that we've used this way for a long time. In particular, the crash probably occurred in a part of that line that has been part of Julia for 5 years:
julia/src/dump.c
Line 1077 in fb85cb3
One other interesting case: while almost all of these runs took between 90-92s each to compile, on one run it took 436s. I don't have any data about what caused that, so it's not really actionable.