Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add devdocs on fixing precompile hangs #50914

Closed
wants to merge 7 commits into from
Closed

Conversation

timholy
Copy link
Member

@timholy timholy commented Aug 14, 2023

As promised, CC @MilesCranmer

I should say that these devdocs are not very satisfying. I didn't find it very useful to extract the PIDs, because I think the warning message from Pkg does the core job of telling you which causative package is hanging, and AFAICT that's the only useful thing you get from the PID. So the advice basically boils down to "comment half your code at a time and bisect until you've found the problematic lines, then stare at them until you figure out what's going on." There's a little more guidance than that, as you'll see if you review this PR, but I think we just don't have decent diagnostics to pinpoint what is going wrong.

@timholy timholy requested a review from vtjnash August 14, 2023 16:44
@timholy
Copy link
Member Author

timholy commented Aug 14, 2023

I'm also forgetting whether I have to do something to have this listed in the sidebar under "Developer Documentation", or whether that will happen automatically.

@timholy
Copy link
Member Author

timholy commented Aug 14, 2023

Other thoughts:

  • the beginnings of a systematic tool would to be to have a way of forcing re-compilation of all packages in a mode where each module's own __init__ function runs during precompilation. This would isolate the nastiest case to the source.
  • I kind of think we should link to this devdocs page the first time we show this warning. For that reason, I'm marking this as a blocking issue for the release of 1.10.

@timholy timholy added this to the 1.10 milestone Aug 14, 2023
@timholy timholy added the docs This change adds or pertains to documentation label Aug 14, 2023
@IanButterworth
Copy link
Member

Might it be worth mentioning using the pid with https://docs.julialang.org/en/v1/stdlib/Profile/#Triggered-During-Execution which will print the profile output into the Pkg.precompile report at the end? i.e. interrupt at least 2s seconds after sending the signal to give it time to run and print.

Or is profiling this kind of lingering task not very informative?

@timholy
Copy link
Member Author

timholy commented Aug 14, 2023

I'm not certain that profiling would capture a timer since it fires rarely, but honestly I don't know.

One issue is that getting the PID does not seem straightforward. The problem is that the detailed message doesn't show until you hit Ctrl-C, and at that point the process exits so there's nothing left to profile. The only way I've found to grab the PID while it's still running is to use julia> getpid() in the main Julia process from which I next trigger precompilation, and then in a separate shell run $ ps --ppid <JuliaMainPID> to learn what child processes got spawned from the main one. Seems like a pain.

@vtjnash
Copy link
Member

vtjnash commented Aug 14, 2023

The problem is that the detailed message doesn't show until you hit Ctrl-C, and at that point the process exits so there's nothing left to profile.

That is a separate issue, which Ian already has a couple PRs up to fix.

@IanButterworth
Copy link
Member

One issue is that getting the PID does not seem straightforward

JuliaLang/Pkg.jl#3583

@timholy
Copy link
Member Author

timholy commented Aug 15, 2023

Update: I interpret JuliaLang/Pkg.jl#3583 as suggesting that adding the PID doesn't really help much.

have a way of forcing re-compilation of all packages in a mode where each module's own init function runs during precompilation

If folks think this would be a good idea, I'd be happy to help with the C part of this effort (it's pretty trivial), but the Pkg logic is something that I could use some assistance with.

@vtjnash
Copy link
Member

vtjnash commented Aug 15, 2023

If folks think this would be a good idea, I'd be happy to help with the C part of this effort (it's pretty trivial), but the Pkg logic is something that I could use some assistance with.

The C part is just calling jl_task_wait_empty(). It could perhaps be added as a kwarg option to exit()?

@timholy
Copy link
Member Author

timholy commented Aug 15, 2023

Sorry, what I meant was "triggering __init__() during precompilation." This is about providing better tools to diagnose what I think of as the most annoying case to debug: where the hang is due to some work that happened in a dependency's __init__. The issue is that the blame does not obviously fall on the dependency, because the dependency's own __init__ doesn't run when it is being precompiled. So blame spuriously falls on the first package in the dependency tree to load the problematic package.

@vtjnash
Copy link
Member

vtjnash commented Aug 15, 2023

To be precise though, it fails on the first time the process calls jl_task_wait_empty after __init__, so you only need to trigger a call to that particular function after loading the package, and should not need to setup a call for anything else.

@timholy
Copy link
Member Author

timholy commented Aug 15, 2023

I see, so you're saying this could even be an Aqua test with no Julia changes required?

@vtjnash
Copy link
Member

vtjnash commented Aug 15, 2023

Potentially, yeah. That function just needs to be exposed somehow (which indeed will require a few small C changes)

@timholy
Copy link
Member Author

timholy commented Aug 19, 2023

6284153 now has the C infrastructure I think we'd need. I've verified that

~/src/julia/julia --startup=no --project -e 'using NoTimer; exit(; wait_task=true)'

exits cleanly for a fake package that does not start a timer in its __init__, while it hangs when it does start a timer. It remains to add an Aqua test, but I think this is probably enough of a proof-of-principle.

Alternately, @vtjnash, It occurs to me that you once surprised me by saying that we probably could run __init__ during precompilation. That might be an even cleaner fix, and eaiser to use. (With this approach, developers are going to have to learn to run Aqua.check_dependencies_for_hangs() whereas running __init__ during precompilation would make it obvious which package is causing the hang.) What do you think about solving it in this way? Is that a 1.10 change, or might that have too many potential complexities?

@sloede
Copy link
Contributor

sloede commented Aug 21, 2023

If I may ask: Since this is a very user-facing problem, why is the documentation buried in the devdocs? Shouldn't there be at least an explanation in the main manual and/or the Pkg docs as well?

@timholy
Copy link
Member Author

timholy commented Aug 22, 2023

Care to make a concrete suggestion of exactly where it should go?

base/initdefs.jl Outdated
@@ -25,8 +25,8 @@ Stop the program with an exit code. The default exit code is zero, indicating th
program completed successfully. In an interactive session, `exit()` can be called with
the keyboard shortcut `^D`.
"""
exit(n) = ccall(:jl_exit, Cvoid, (Int32,), n)
exit() = exit(0)
exit(n; wait_task::Bool=false) = ccall(:jl_exit, Cvoid, (Int32, Int32), n, wait_task)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming bikeshedding time: should we say this is exit(0, process=true), since it specifies whether this call just kills this task or all tasks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with kill_tasks=true.

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Just made a couple small comments

@timholy timholy added merge me PR is reviewed. Merge when all tests are passing and removed merge me PR is reviewed. Merge when all tests are passing labels Sep 4, 2023
@timholy
Copy link
Member Author

timholy commented Sep 4, 2023

Hmm, I realized one issue with the exit changes: when this problem is detected, often the easiest fix is to make the offender's __init__ function dependent on ccall(:jl_generating_output, Cint, ()) == 1 returning false. But adding an Aqua test calling exit(; kill_tasks=false) won't mimic that dependency, so it might lead to false positives. I'm a bit reluctant to "fake" the result of that ccall, though, that seems like a potentially-dangerous form of lying.

One thought would be to encourage people to write it

function __init__()
    if ccall(:jl_generating_output, Cint, ()) == 1 || parse(Bool, get(ENV, "JULIA_AQUA_TEST_INIT_TASKS", "false"))
        return nothing
    end
    ....
end

but that's less than elegant since it requires a new environment variable simply to avoid a false positive.

@timholy
Copy link
Member Author

timholy commented Sep 10, 2023

Given that the exit modification won't help detect working packages that check jl_generating_output, I propose that at a minimum we don't merge that part of this PR. It's a bit unclear to me whether we need devdocs or whether we can just point to the proposed Aqua test in JuliaTesting/Aqua.jl#174 (I tend to favor closing this altogether when that merges). After that merges, the plan would be to add a URL to the Aqua docs in the scary warning "Waiting for IO to finish..."

@@ -19,14 +19,17 @@ An array of the command line arguments passed to Julia, as strings.
const ARGS = String[]

"""
exit(code=0)
exit(code=0; kill_tasks=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't kill the tasks, just waits for them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is to kill all tasks (as before). The new optional behavior (kill_tasks=false) is to run the rest of the program to completion and only kill this task.

@@ -1852,7 +1852,7 @@ JL_DLLEXPORT int jl_is_initialized(void);
JL_DLLEXPORT void jl_atexit_hook(int status);
JL_DLLEXPORT void jl_task_wait_empty(void);
JL_DLLEXPORT void jl_postoutput_hook(void);
JL_DLLEXPORT void JL_NORETURN jl_exit(int status);
JL_DLLEXPORT void JL_NORETURN jl_exit(int status, int kill_tasks);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a breaking change; I think we should avoid it.

@JeffBezanson
Copy link
Member

Maybe doesn't need to be on the milestone?

@KristofferC
Copy link
Member

KristofferC commented Sep 11, 2023

when this problem is detected, often the easiest fix is to make the offender's __init__ function dependent on ccall(:jl_generating_output, Cint, ()) == 1 returning false.

One issue with this is that if you want to run a precompile workload involving that package as a dependency, the package will likely be non-functional. So I am not sure that is the best idea.

@vtjnash
Copy link
Member

vtjnash commented Sep 11, 2023

so it might lead to false positives

It somewhat only fixes the symptom, rather than the cause. The cause is that the user failed to clean up some resource when they should have, and so there is a leak in the code. In some cases, this code should (in addition to the precompile step to avoid making each and every precompile process expensive and slow) also have an explicit Base.uv_unref somewhere too to instruct Base that this is a resource that should not be cleaned up and should simply be terminated when the rest of the program is done.

@timholy
Copy link
Member Author

timholy commented Sep 12, 2023

I think "this" (meaning, a way of addressing the I/O hang, not actually this PR) does need to be on the milestone:

  • some packages break because of this change (including private ones we don't know about)
  • the message is opaque: almost no one will understand what it means, and it provides almost no guidance for fixing the problem

IMO it's the worst error in the language by a very large margin. But there seems to be very little we can do about it from a technical standpoint: for example, we don't have any tools for discovering where the task was launched from. So the only options seem to be documentation and, in difficult cases, encourage bisection-by-commenting.

That said, I no longer think the exit change is useful given #50914 (comment) and JuliaTesting/Aqua.jl#174. I plan to remove that, the question is whether we need this at all or if all the documentation should go in Aqua instead.

One issue with this is that if you want to run a precompile workload involving that package as a dependency, the package will likely be non-functional.

In JuliaTesting/Aqua.jl#174 I allude to that by saying "In more complex cases, you may need to set up independently-callable functions to launch the tasks and cleanly shut them down." But perhaps that section should be expanded.

@DilumAluthge DilumAluthge deleted the teh/devdocs_pchangs branch November 2, 2023 19:19
timholy added a commit that referenced this pull request Nov 13, 2023
This is #50914, with only the documentation changes, plus an improvement
to the warning message.

---------

Co-authored-by: Tim Holy <tim.holy@gmail.com>
Co-authored-by: Ian Butterworth <i.r.butterworth@gmail.com>
KristofferC pushed a commit that referenced this pull request Nov 27, 2023
This is #50914, with only the documentation changes, plus an improvement
to the warning message.

---------

Co-authored-by: Tim Holy <tim.holy@gmail.com>
Co-authored-by: Ian Butterworth <i.r.butterworth@gmail.com>
(cherry picked from commit 6d5787a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs This change adds or pertains to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants