Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop running finalizers on exit #51466

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Stop running finalizers on exit #51466

wants to merge 7 commits into from

Conversation

kpamnany
Copy link
Contributor

@kpamnany kpamnany commented Sep 26, 2023

Closes #50038.

We've observed that when GC is triggered during exit, Julia can freeze during stop-the-world. This is a symptom of a general problem with running finalizers at exit -- program threads are still running and the various live objects are potentially in use. Pausing threads before running finalizers would possibly introduce other problems.

For this reason, as observed below, other languages (Java, .NET, Go) do not run finalizers on exit and with this PR, Julia will also stop running finalizers on exit.

@gbaraldi
Copy link
Member

Do we need to run a GC before then disabling it?

@kpamnany
Copy link
Contributor Author

Given the following backtrace, I don't know if running GC here could also freeze? That's why I added this disable at this point.

Thread 1 (Thread 0x7ffff7da7040 (LWP 3646561) "julia"):
#0  0x00007ffff7837700 in _mm_pause () at /nix/store/gwdq0r345gjpnsbllqjgrb8c8z23aiw9-gcc-10.3.0/lib/gcc/x86_64-unknown-linux-gnu/10.3.0/include/xmmintrin.h:1332
#1  jl_gc_wait_for_the_world () at /build/source/src/gc.c:209
#2  ijl_gc_collect (collection=collection@entry=JL_GC_AUTO) at /build/source/src/gc.c:3395
#3  0x00007ffff78385dd in maybe_collect (ptls=0x464b60) at /build/source/src/gc.c:910
#4  jl_gc_pool_alloc_inner (osize=32, pool_offset=<optimized out>, ptls=0x464b60) at /build/source/src/gc.c:1254
#5  ijl_gc_pool_alloc (ptls=0x464b60, pool_offset=<optimized out>, osize=32) at /build/source/src/gc.c:1303
#6  0x00007fffe39b9471 in julia_inline_apply!_14014 () at compiler/ssair/inlining.jl:1031
#7  0x00007fffe3bfa064 in julia_process_simple!_12824 () at compiler/ssair/inlining.jl:1129
#8  0x00007fffe39b25d6 in julia_assemble_inline_todo!_18174 () at compiler/ssair/inlining.jl:1364
#9  0x00007fffe32ed2b6 in julia_ssa_inlining_pass!_13386 () at compiler/ssair/inlining.jl:82
#10 0x00007fffe32ed48d in jfptr_ssa_inlining_passNOT._13387.clone_1 () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/lib/julia/sys.so
#11 0x00007fffe3bf543a in julia_run_passes_13834 () at compiler/optimize.jl:539
#12 0x00007fffe39cd180 in optimize () at compiler/optimize.jl:504
#13 julia__typeinf_15708 () at compiler/typeinfer.jl:261
#14 0x00007fffe3bf0861 in julia_typeinf_17317 () at compiler/typeinfer.jl:217
#15 0x00007fffe3bf27ea in julia_typeinf_edge_14764 () at compiler/typeinfer.jl:881
#16 0x00007fffe3ae6789 in julia_abstract_call_method_14701 () at compiler/abstractinterpretation.jl:647
#17 0x00007fffe39a4de0 in julia_abstract_call_gf_by_type_13619 () at compiler/abstractinterpretation.jl:139
#18 0x00007fffe398aa91 in julia_abstract_call_known_15092 () at compiler/abstractinterpretation.jl:1716
#19 0x00007fffe398be84 in julia_abstract_call_15027 () at compiler/abstractinterpretation.jl:1786
#20 0x00007fffe3befe79 in julia_abstract_call_15039 () at compiler/abstractinterpretation.jl:1753
#21 0x00007fffe39c1b81 in julia_abstract_eval_statement_16908 () at compiler/abstractinterpretation.jl:1910
#22 0x00007fffe39c8d0b in julia_typeinf_local_17838 () at compiler/abstractinterpretation.jl:2386
#23 0x00007fffe39cbd2e in julia_typeinf_nocycle_14583 () at compiler/abstractinterpretation.jl:2482
#24 0x00007fffe39ccbd9 in julia__typeinf_15708 () at compiler/typeinfer.jl:234
#25 0x00007fffe3bf0861 in julia_typeinf_17317 () at compiler/typeinfer.jl:217
#26 0x00007fffe3b2b4f7 in julia_typeinf_ext_12957 () at compiler/typeinfer.jl:971
#27 0x00007fffe330e45f in julia_typeinf_ext_toplevel_14792 () at compiler/typeinfer.jl:1004
#28 0x00007fffe330e813 in julia_typeinf_ext_toplevel_14788 () at compiler/typeinfer.jl:1000
#29 0x00007fffe330e853 in jfptr_typeinf_ext_toplevel_14789.clone_1 () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/lib/julia/sys.so
#30 0x00007ffff77db6ac in jl_apply (nargs=3, args=0x7fffffffac00) at /build/source/src/julia.h:1842
#31 jl_type_infer (mi=mi@entry=0x7fffe479c660 <jl_system_image_data+8554016>, world=world@entry=32254, force=force@entry=0) at /build/source/src/gf.c:319
#32 0x00007ffff736dd15 in jl_generate_fptr_impl (mi=0x7fffe479c660 <jl_system_image_data+8554016>, world=32254) at /build/source/src/jitlayers.cpp:332
#33 0x00007ffff77d9fc7 in jl_compile_method_internal (world=32254, mi=0x7fffe479c660 <jl_system_image_data+8554016>) at /build/source/src/gf.c:2081
#34 jl_compile_method_internal (mi=0x7fffe479c660 <jl_system_image_data+8554016>, world=32254) at /build/source/src/gf.c:2025
#35 0x00007ffff77daeed in _jl_invoke (world=32254, mfunc=<optimized out>, nargs=1, args=0x7fffffffaf68, F=0x7fffe479c3f0 <jl_system_image_data+8553392>) at /build/source/src/gf.c:2359
#36 ijl_apply_generic (F=<optimized out>, args=0x7fffffffaf68, nargs=<optimized out>) at /build/source/src/gf.c:2549
#37 0x00007ffff782cfb7 in jl_apply (nargs=2, args=0x7fffffffaf60) at /build/source/src/julia.h:1842
#38 run_finalizer (ct=ct@entry=0x7fffedf08010, o=<optimized out>, ff=<optimized out>) at /build/source/src/gc.c:283
#39 0x00007ffff782de35 in jl_gc_run_finalizers_in_list (ct=ct@entry=0x7fffedf08010, list=list@entry=0x7fffffffb0f0) at /build/source/src/gc.c:370
#40 0x00007ffff782dfef in run_finalizers (ct=0x7fffedf08010) at /build/source/src/gc.c:413
#41 0x00007ffff77fba8b in ijl_atexit_hook (exitcode=exitcode@entry=0) at /build/source/src/init.c:236
#42 0x00007ffff7841618 in jl_repl_entrypoint (argc=<optimized out>, argv=0x7fffffffb608) at /build/source/src/jlapi.c:720
#43 0x0000000000401059 in main (argc=<optimized out>, argv=<optimized out>) at /build/source/cli/loader_exe.c:59

@gbaraldi
Copy link
Member

The comment in

        // we are about to start tearing everything down, so lets try not to get
        // upset by the local mess of things when we run the user's _atexit hooks
        // this also forces us into a GC-unsafe region without a safepoint
        jl_task_frame_noreturn(ct);

is suspicious

@kpamnany
Copy link
Contributor Author

Ah. So it looks like after jl_task_frame_noreturn(ct); GC cannot run safely?

@vtjnash
Copy link
Member

vtjnash commented Sep 27, 2023

jl_task_frame_noreturn

A GC-unsafe region is a place where allocations are safe

presumably because some threads have been torn down already

When threads are torn down (if they do get torn down at all) they end in such a way that GC will ignore them. If you wanted to ensure they get torn down correctly, there will be a new argument to exit to enable that, with kill_tasks=false in #50914

@kpamnany
Copy link
Contributor Author

When threads are torn down (if they do get torn down at all) they end in such a way that GC will ignore them. If you wanted to ensure they get torn down correctly, there will be a new argument to exit to enable that, with kill_tasks=false in #50914

No, that was only a theory as to why the thread whose backtrace I posted might have frozen in stop-the-world. The main issue is that the thread froze. If you look at the linked issue, the other backtraces suggest that some threads were stuck in sigprocmask...

This PR basically tries to duck the problem by turning off GC altogether, because we're exiting anyway.

@vtjnash
Copy link
Member

vtjnash commented Sep 27, 2023

other backtraces suggest that some threads were stuck in sigprocmask...

Those threads aren't stuck (sigprocmask can't get stuck as it is just a syscall), but they are spending most of their time there. The kill_tasks=false flag ensures that the child threads are allowed to pause gracefully, rather than sending them into an exceptional state where things can go very badly. The issue with them getting stuck I thought would have been addressed by #41616, but I am assuming you have that commit already? (n.b. that commit is not in the official v1.9 release)

@gbaraldi
Copy link
Member

So just for some context me and @vchuravy believe this might be similar or exactly the same as the hangs we see atexit on the libuv thread pr. https://buildkite.com/julialang/julia-master/builds/27576#018a85fd-4452-4f20-bdbe-3cade4f7d0d1

@kpamnany
Copy link
Contributor Author

kpamnany commented Sep 27, 2023

This is getting a bit confusing.

The kill_tasks=false flag ensures that the child threads are allowed to pause gracefully, rather than sending them into an exceptional state where things can go very badly.

Okay, 50914 looks pretty neat. But in this case, I'm not actually concerned about the threads shutting down gracefully or otherwise. I just don't want to freeze on exit.

The issue with them getting stuck I thought would have been addressed by #41616, but I am assuming you have that commit already? (n.b. that commit is not in the official v1.9 release)

It's not in 1.9, yeah. Looks like I should try to backport #41616 and also #47393? And that might prevent this freeze in stop-the-world on exit? Cc: @vchuravy for confirmation.

So just for some context me and @vchuravy believe this might be similar or exactly the same as the hangs we see atexit on the libuv thread pr. https://buildkite.com/julialang/julia-master/builds/27576#018a85fd-4452-4f20-bdbe-3cade4f7d0d1

So @gbaraldi, you don't think that 41616 and 47393 will help?

@gbaraldi
Copy link
Member

So looking at this issue and a couple others, I believe that running all finalizers like this at exit is not really tenable. Java and .NET went away from this because they were encountering the same issues as us.
i.e dotnet/runtime#16028
and dotnet/coreclr@cc9c314
dotnet-bot/corert@802eb6a

.net added something similar to our atexit call and changed to not running anything.

Their reasons were, you need to block all threads before running finalizers, because if you execute the finalizer of a live object it's now undefined. But blocking all threads is a syncronization issue, the decision was made to just not do it, and say that it's the users responsibility to ensure that resources get cleaned up before exiting.

@kpamnany
Copy link
Contributor Author

No longer running finalizers on exit is a more dramatic change that probably needs more thought/discussion. What this PR is doing is a much smaller change, no?

@gbaraldi
Copy link
Member

What I'm saying is that I don't think this is enough. We are still running finalizers of objects that threads can still reach

@JeffBezanson
Copy link
Member

Running arbitrary code with GC disabled is certainly dangerous.

@kpamnany
Copy link
Contributor Author

Okay. So what's a good solution here? Stop running finalizers at exit as .NET did?

@vchuravy
Copy link
Member

Yeah, if people need to run things at ext they should use atexit hooks. running finalizers and putting the objects into an illegal state while other code can still access them seems really bad. I think dotnet, Java, and Go are not running finalizers on exit.

@kpamnany kpamnany changed the title At exit, disable GC before running finalizers Stop running finalizers on exit Sep 30, 2023
@kpamnany
Copy link
Contributor Author

I've changed this to remove the call to jl_gc_run_all_finalizers in jl_atexit_hook. I'm not clear on what the overall impact of this change would be.

@PallHaraldsson
Copy link
Contributor

Will this be backported (to 1.10 at least)?

@kpamnany
Copy link
Contributor Author

kpamnany commented Oct 2, 2023

Bump.

@kpamnany
Copy link
Contributor Author

kpamnany commented Oct 2, 2023

Can/should we run PkgEval on this to see if there's an impact?

@vchuravy
Copy link
Member

vchuravy commented Oct 2, 2023

@nanosoldier runtests()

Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a docs update + news

Grepping for jl_gc_run_all_finalizers makes it show up in a few places. That document the behaviour.

@vchuravy vchuravy added GC Garbage collector needs docs Documentation for this change is required needs news A NEWS entry is required for this change minor change Marginal behavior change acceptable for a minor release labels Oct 2, 2023
@vchuravy
Copy link
Member

cc: @barche for #19911

Add a `compat` note that as of Julia 1.11, finalizers will not be run
on exit.
@NHDaly
Copy link
Member

NHDaly commented Oct 12, 2023

@vchuravy yeah agreed, re: "unpublicized" things. Some careful documentation can help, but it would still be better if it could be enforced in some way, to truly be a testing-only tool.

Anyway, like I said, not worth holding this up.

@JeffBezanson
Copy link
Member

I agree that we should not run finalizers at exit. A finalizer runs when the following two conditions are met:

  1. The object is unreferenced. That happens when you reach a particular point in the program's execution history. It's a synchronous event.
  2. GC runs. That is an unpredictable event based on heuristics and amount allocated etc. Installing more RAM will change when your finalizer runs.

For (1), exiting does not necessarily mean we have reached that event. For (2), it's our decision when to run GC, and on exit we're dropping the whole heap so we can certainly decide not to run it. So neither of these conditions are met, so we should not run finalizers.

@vtjnash
Copy link
Member

vtjnash commented Oct 13, 2023

commit 4796390
Author: Jeff Bezanson jeff.bezanson@gmail.com
Date: Fri Jul 5 14:24:37 2013 -0400

run all finalizers on exit. fixes #3624

@NHDaly
Copy link
Member

NHDaly commented Oct 13, 2023

(I guess disabling this won't re-break #3624 because now we have atexit_hooks right?)

@vtjnash
Copy link
Member

vtjnash commented Oct 13, 2023

Disabling this will reopen #3624 because it will revert the fix for it

@kpamnany
Copy link
Contributor Author

Unfortunately, that seems to be true. E.g. one of the test failures on this PR is, I think, because of this (shown below)?

function Process(cmd::Cmd, handle::Ptr{Cvoid})
    this = new(cmd, handle, devnull, devnull, devnull,
               typemin(fieldtype(Process, :exitcode)),
               typemin(fieldtype(Process, :termsignal)),
               ThreadSynchronizer())
    finalizer(uvfinalize, this)
    return this
end

In this case (but not in all cases), I think we want to do the same thing at exit as we do when a Process goes out of scope.

Suppose we add:

function atexit(f, x)
    _x = WeakRef(x)
    atexit() do
        _x.value !== nothing && f(x)
    end
    nothing
end

Then, would adding atexit(uvfinalize, this) after the finalizer(uvfinalize, this) do the right thing?

@kpamnany
Copy link
Contributor Author

But another thread could be running a task that is working with that Process. Then calling uvfinalize on the Process would cause that task to error.

We could prevent that by suspending threads in jl_atexit_hook. But then we're back to the original problem because whether running an atexit hook or a finalizer, we may have to compile it and some thread could be holding a needed lock when it is suspended => freeze. Compilation could trigger GC, and suspended threads can't reach safepoints => freeze.

Maybe we throw an InterruptException to all currently executing tasks and set a flag that stops all threads from picking up other tasks?

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Oct 13, 2023

We can have arbitrary many open files, and not flushed, so also flushing and thus closing can take an arbitrary long time, and thus exit(), if it done by triggering finalizers (or explicitly)?!

That doesn't seems good. Also does the order of flusing matter (I think it might, at least sometimes), and do finalizers run in that wanted order? If they don't or can't then I don't think we should use them for this.

as observed below, other languages (Java, .NET, Go) do not run finalizers on exit

Why do those get away with not doing it? Don't they flush?

All open files (file handles) are known, can we keep e.g. a queue of say last 10 not-yet flushed file handles and close them in the right order on exit? If we write to a file we add it to the front of the queue and flush the 11th file that then drops off it. On exit this then is a bounded (and very quick) problem. We could additionally do run(fsync), which is not done by flush I think.

Maybe finalizers do this in effect, I'm not sure, but they are for more (more than just files of any kind), and also for open "files" i.e. file handles to the web. I'm not sure if it's worth flushing those, nor do I know if we have a good way of knowing which "files" are really files and local. Unclear what we should do for remote filesystems.

@vtjnash
Copy link
Member

vtjnash commented Oct 13, 2023

Why do those get away with not doing it? Don't they flush?

They likely just don't flush? The C and C++ standards require running these finalizers (aka destructors) because stdout is flushed by them, and so yes, they will block on exit until that aspect of finalization is complete.

But for many GC languages (e.g. Java in that list), failing to call flush and/or close may result in the file itself getting truncated anyways due to finalizer ordering.

@PallHaraldsson
Copy link
Contributor

truncated anyways due to finalizer ordering.

I was only asking for (those) GC languages, why they get away with it, and you seem to confirm they put the responsibility on the user, i.e. even if they had run finalizers (e.g. on exit) that would not have worked (always).

So I think we should just not run the finalizers on exit. Destructors are a different story (in C++, they run unless you abort(), you can also do that, to exit...), C doesn't have them so responsibility is fully on the developer, and not on any standard to to it for them, just suggest what you should do.

Mojo has destructors I believe, but goes further than C++, it deallocates (heap) memory earlier than C++ can, so I assume if closes (and implicitly flushes) files earlier too. We should likely emulate that for files, and for heap deallocation, when possible.

@vtjnash
Copy link
Member

vtjnash commented Oct 13, 2023

C doesn't have them

C accidentally does have them, since it requires that struct FILE have them (sometimes exposed via non-portable functions like fcloseall), though you often have to use compiler extensions to get access to them for your own types

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Oct 13, 2023

I'm confused, Julia's exit calls jl_exit which calls C's exit, and it should flush files, thus no need to run finalizers for that?! Or is it about the wrong order, and C does it wrong? No, C doesn't have the destructor (concept), what I meant by "doesn't have", though file flushing/closing is sort of similar, but restricted to that, unlike in C++ where they are responsible for all cleanup (by default at least).

In some cases we call C's _exit(1) (or even _exit(0) which doesn't flush, as designed, are we incorrectly calling that?).

https://pubs.opengroup.org/onlinepubs/9699919799/functions/exit.html

The exit() function shall then flush all open streams with unwritten buffered data and close all open streams. Finally, the process shall be terminated [CX]

It does though have that CX footnote:

The functionality described is an extension to the ISO C standard. Application developers may make use of an extension as it is supported on all POSIX.1-2017-conforming systems.

With each function or header from the ISO C standard, a statement to the effect that "any conflict is unintentional" is included. That is intended to refer to a direct conflict. POSIX.1-2017 acts in part as a profile of the ISO C standard,

Do we for sure use POSIX.1-2017 (extension?), is that something we can and should opt into? But do not do and miss out?

https://pubs.opengroup.org/onlinepubs/9699919799/functions/_Exit.html

_exit(128 + signo);

This seems to be correct:

_exit(1);

but I'm not sure this is always correct:

_exit(128 + signo);

e.g. for SIGTERM too, i.e. exit code 143 = 128+15? Open issue for such here:

DataDog/datadog-agent#3940

https://stackoverflow.com/questions/18499497/how-to-process-sigterm-signal-gracefully

@NHDaly
Copy link
Member

NHDaly commented Oct 25, 2023

Then, would adding atexit(uvfinalize, this) after the finalizer(uvfinalize, this) do the right thing?

It seems to me that the most pragmatic solutions here are:

  • The proposal that we should simply be taking advantage of C's atexit hooks, which could already flush the files somehow? But it's unclear why this isn't happening for us already. Does uvfinalize have extra work to do that isn't covered by C's built-in flushing?
  • @kpamnany's proposal to flush files in an atexit hook, rather than in finalizers.

I think that the atexit hook proposal is very straightforward and should be possible to implement. Maybe we can proceed with that, until/unless we get more information on using C's atexit hooks?

Kiran: Rather than registering a new atexit hook for each file, you could also add the files to a per-thread WeakKeySet (WeakKeyDict{File,Nothing}) of open files, and then have a single atexit hook that flushes all the files in those dicts? This could reduce contention on the atexit hook lock, and also prevent a memory leak of those callbacks building up for every file that was ever opened.
(Are we guaranteed that user tasks are no longer running once the atexit hooks are running?)

@kpamnany
Copy link
Contributor Author

C's atexit hooks will flush buffered files, i.e. FILE *s. libuv uses fds under the hood, but as an event-driven system, implements a different sort of buffering.

jl_atexit_hook does clean up libuv, so it isn't clear to me why #3624 gets reopened by this change (I checked and it does 😢). Need to dig deeper to understand what's happening.

Implementing the atexit hooks to replace uvfinalize should use a WeakKeyDict, I agree. But there are still some unknowns that I need to dig into before getting to the implementation.

@ericphanson
Copy link
Contributor

btw, I think packages like Arrow and CSV rely on finalizers running on exit to release mmap'd memory, since they mmap inputs by default

@vchuravy
Copy link
Member

One thought I had the other day. Currently it is assumed in a few places that finalizers run in a certain sequence.

E.g. A references B means the finalizer for A runs before the finalizer for B. I have over the years observed issues with that and I think it might be due to finalizers at shoutdown ignoring the order?

@gbaraldi
Copy link
Member

That might be possible, because we schedule the whole list without much care

@vtjnash
Copy link
Member

vtjnash commented Dec 11, 2023

We should run them in the reverse of the order they were added (c.f. jl_gc_run_finalizers_in_list) if they both die at the same time. If two unrelated objects die around the same time, their order is unpredictable (since we may schedule the finalizers for one, and then, before running it, then schedule and run the finalizers for the other). Having a separate, single thread for finalizers to run on might fix some of that?

@vchuravy
Copy link
Member

Reverse order may help, but you can easily run into a situation where a younger object holds a reference to something older and depends on that state to still be valid.

This reinforces my belief that we should do this PR instead of promising that finalizers will run before the end of the program.

@vtjnash
Copy link
Member

vtjnash commented Dec 11, 2023

Reverse order may help, but you can easily run into a situation where a younger object holds a reference to something older and depends on that state to still be valid.

How would you construct that situation? Either they die at the same time, or the younger one dies first. The older one cannot die first, since there is a reference to it

@vchuravy
Copy link
Member

Isn't that the point? By running finalizers at exit (and not just those that are provable dead) we can construct such a situation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code GC Garbage collector triage This should be discussed on a triage call
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rare freeze on exit
9 participants