Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace current, not global logger in safe loggers #622

Merged
merged 5 commits into from
Aug 28, 2024

Conversation

danielwe
Copy link
Contributor

GPUCompiler's safe loggers temporarily replace the global logger with a simple non-yielding logger. But this does nothing if a different current logger is set, in which case the loggers are unsafe and will error if the custom logger yields, as seen in fonsp/Pluto.jl#3012. This PR changes the safe loggers to use with_logger instead of global_logger, to ensure that the yield-free logger is the one that's actually used.

In addition, I made it so that the safe logger inherits Logging.shouldlog from the current logger, not just Logging.min_enabled_level. The reasoning is that many loggers set min_enabled_level to the minimum possible value and rely on shouldlog for filtering; Pluto's logger is a case in point. I implemented a dedicated SafeLogger type to achieve this. Let me know if this is too heavy and you'd rather stick to a more barebones design without a dedicated type (in which case I think min_enabled_level should inherit from the global logger rather than the current logger).

I'm not sure how to write tests for this, since I'm not sure how to call the @safe_<loglevel> macros from a context where task switching raises an error. Suggestions or edits to the PR welcome.

@maleadt
Copy link
Member

maleadt commented Aug 27, 2024

This is a easy way to trigger the issue when using regular loggers:

julia> GPUCompiler.@locked f() = @info "2"
f (generic function with 1 method)

julia> @sync begin
       @async while true
       @info 1
       end
       @async f()
       end
[ Info: 1
[ Info: 2
Assertion failed: (ptls->locks.len == 0), function ctx_switch, file task.c, line 439.

[44424] signal 6: Abort trap: 6
in expression starting at REPL[10]:1
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 23403711 (Pool: 23398303; Big: 5408); GC: 52

i.e., it's not allowed to switch tasks (as done by Base I/O) when operating under the codegen lock. The @async is needed to ensure there's another task to switch to.
This only triggers with an assertions build, on other builds it may or may not segfault

@danielwe
Copy link
Contributor Author

Thanks! I added some tests that reliably segfault with the old safe loggers but pass with the fixed ones. I also added a test for @safe_show that redirects stdout through an async pipe. This passes as written but segfaults if I replace GPUCompiler.@safe_show with @show.

I have second thoughts about forwarding shouldlog to the current logger. What if the current logger's shouldlog yields? In principle it only takes a heap allocation, right? (Though I couldn't trigger a segfault in the tests by making shouldlog allocate, I had to insert an explicit yield() for that.)

Of course, if we're unlucky, the forwarded min_enabled_level could also yield, but that seems much less likely; min_enabled_level usually returns a constant, while shouldlog can contain arbitrarily complex filtering logic.

I'll push another commit with a simplified design that only forwards min_enabled_level Then just let me know which design you prefer, and I'll squash commits as appropriate.

@maleadt
Copy link
Member

maleadt commented Aug 27, 2024

What if the current logger's shouldlog yields?

Looking at some implementations of shouldlog, that seems likely: https://github.com/JuliaLang/julia/blob/f457a7561526aa28e20c4b2a8ff1c650f0c0f910/stdlib/Test/src/logging.jl#L91-L100 (reentrant locks yield when contended)
With many custom loggers forwarding to Logging.shouldlog, I guess it would be best to try and avoid calls to it.

@danielwe
Copy link
Contributor Author

danielwe commented Aug 27, 2024

In that case, let me know if you want me to make any edits to code, comments, or git log/history, otherwise, I think this is good to go if you want to have it.

As a side note, I'm hitting this issue through Enzyme, but Enzyme is restricted to GPUCompiler <= 0.26. Any chance this could be backported to a 0.26.8 release? Alternatively, @wsmoses, do you foresee upgrading Enzyme to GPUCompiler 0.27 any time soon?

Edit: never mind, I just realized the most recent Enzyme is already on GPUCompiler 0.27.

Edit 2: Oh, but Enzyme is still incompatible with GPUCompiler 0.27 due to disjoint compat bounds for LLVM, so my question above still stands.

@maleadt
Copy link
Member

maleadt commented Aug 27, 2024

We generally don't do backport releases here. If upgrading Enzyme.jl to the latest LLVM.jl/GPUCompiler.jl remains a problem, I'm happy to do one (you can create a PR for that if you want), but I know that @wsmoses is already working on upgrading Enzyme.jl to the latest versions of these packages, so it may be easier to wait a bit.

src/utils.jl Outdated Show resolved Hide resolved
Co-authored-by: Tim Besard <tim.besard@gmail.com>
@maleadt maleadt merged commit db9780b into JuliaGPU:master Aug 28, 2024
1 check passed
@danielwe
Copy link
Contributor Author

danielwe commented Sep 2, 2024

Enzyme just released a version that's compatible with LLVM 9, so no backport needed. Do you know when you'll cut the next GPUCompiler release including this patch?

@maleadt
Copy link
Member

maleadt commented Sep 2, 2024

I've tagged a new version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants