Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race in inference profiling (@snoopi_deep). #47258

Closed
wants to merge 1 commit into from

Conversation

NHDaly
Copy link
Member

@NHDaly NHDaly commented Oct 20, 2022

Previously, there wasn't a lock around reset_timings(), so if one thread resets the inference profile timings while another thread is in the middle of profiling type inference, this can crash type inference.

There is still a logical race (the first thread's profile would be cleared by the second thread, and when the first thread finishes profiling, it would disable the inference timing for the second thread), but this fixes the data race.

Type inference is currently guarded by a global typeinf_lock, which the type inference profiler takes advantage of to protect data access safety. If we ever make the inference lock more granular, or make inference concurrent, we should add a separate lock around the inference profiles.


For more details on our plan to allow concurrent callers to @snoopi_deep, see: https://docs.google.com/document/d/1JO43w3V1Cwtb2CxWf4sYfAB68xbkTrCMr973iaJOFwY/edit#.

This can mostly happen entirely in the SnoopCompile.jl package, but I wanted to come back here and close the loop on this, to tie up lose ends, in case anyone else encounters this. 😊


Before:

julia> Threads.nthreads()
4

julia> using SnoopCompile

julia> for _ in 1:1000 Threads.@spawn begin
               SnoopCompile.@snoopi_deep (sleep(0.001); 2+2)
           end
       end

Internal error: encountered unexpected error in runtime:
MethodError(f=Base.string, args=(Expr(:call, :===, Expr(:., Expr(:., :new_timer, :(:mi_info)), :(:mi)), Expr(:., :expected_mi_info, :(:mi))),), world=0x0000000000001495)
jl_method_error_bare at /Users/nathandaly/src/julia/src/gf.c:1942
jl_method_error at /Users/nathandaly/src/julia/src/gf.c:1960
jl_lookup_generic_ at /Users/nathandaly/src/julia/src/gf.c:2613 [inlined]
ijl_apply_generic at /Users/nathandaly/src/julia/src/gf.c:2628
macro expansion at ./error.jl:231 [inlined]
exit_current_timer at ./compiler/typeinfer.jl:163 [inlined]
typeinf at ./compiler/typeinfer.jl:212
abstract_call_method_with_const_args at ./compiler/abstractinterpretation.jl:931
...

After 😊:

julia> Threads.nthreads()
4

julia> using SnoopCompile

julia> for _ in 1:1000 Threads.@spawn begin
               SnoopCompile.@snoopi_deep (sleep(0.001); 2+2)
           end
       end

julia> 

Previously, there wasn't a lock around `reset_timings()`, so if one
thread resets the inference profile timings while another thread is in
the middle of profiling type inference, this can crash type inference.

There is still a _logical_ race (the first thread's profile would be
cleared by the second thread), but this fixes the data race.

Type inference is currently guarded by a global typeinf_lock, which the
type inference profiler takes advantage of to protect data access
safety. If we ever make the inference lock more granular, or make
inference concurrent, we should add a separate lock around the inference
profiles.
@NHDaly
Copy link
Member Author

NHDaly commented Oct 20, 2022

CC: @vilterp

Comment on lines +89 to +99
ccall(:jl_typeinf_lock_begin, Cvoid, ())
try
empty!(_timings)
push!(_timings, Timing(
# The MethodInstance for ROOT(), and default empty values for other fields.
InferenceFrameInfo(ROOTmi, 0x0, Any[], Any[Core.Const(ROOT)], 1),
_time_ns()))
return nothing
finally
ccall(:jl_typeinf_lock_end, Cvoid, ())
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dip, it turns out that we've already done some work to shrink the inference lock (!! 🎉 exciting!! 🎉), so i can't rely on it for locking the inference profiling anymore. 😮 😢
#46324

Gonna move this PR to a draft. Needs more thinking now. Probably we at least need a data lock around the _timings buffer, but also the operation of the inference profiling was heavily assuming that inference was non-concurrent, which it no longer is.

@NHDaly NHDaly marked this pull request as draft October 21, 2022 20:14
@NHDaly
Copy link
Member Author

NHDaly commented Nov 18, 2022

Closing in favor of #47615

@NHDaly NHDaly closed this Nov 18, 2022
@giordano giordano deleted the nhd-snoopi_deep-data-race-reset_timings branch February 25, 2024 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant