Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using output from --trace-compile from running a workload can sometimes cache more than running the workload itself #48048

Open
KristofferC opened this issue Dec 30, 2022 · 4 comments
Labels
compiler:latency Compiler latency

Comments

@KristofferC
Copy link
Sponsor Member

KristofferC commented Dec 30, 2022

This is a summarized MWE version of the discussion in Ferrite-FEM/Tensors.jl#190.

The following code creates a small package:

mkpath("Ts/src")
write("Ts/src/Ts.jl", """
module Ts
struct T{dim} 
    data::NTuple{dim, Float64}
end

h(::Type{T{dim}}, f) where {dim} = T{dim}((1.0, 1.0))
@inline f(::Type{T{dim}}) where {dim} = h(T{dim}, () -> 1.0)

end
""")

write("Ts/Project.toml", """
name = "Ts"
uuid = "38a634ad-e948-5137-8d70-aa71f2a747f4"
version = "0.1"

[deps]
SnoopPrecompile = "66db9d55-30c0-4569-8b51-7e840670fc0c"
""")

Pkg.activate("Ts")
Pkg.instantiate()

One way to cache compiled code is by using --trace-compile=stderr and put what comes out into the package file. So we run the following code (the [2][1] stuff is to trick constant propagation):

julia --project --trace-compile=stderr --startup-file=no -e 'using Ts; test() = Ts.f(Ts.T{[2][1]}); test();'
precompile(Tuple{typeof(Main.test)})
precompile(Tuple{Type{Ts.T{2}}, Tuple{Float64, Float64}})
precompile(Tuple{typeof(Ts.h), Type{Ts.T{2}}, Function})

The typeof(Main.test) is not interesting but if we put the two others precompile signatures into the package and rerun the workload we see that the methods in Ts are cached and no longer printed

❯ julia --project --trace-compile=stderr --startup-file=no -e 'using Ts; test() = Ts.f(Ts.T{[2][1]}); test();'
precompile(Tuple{typeof(Main.test)})

An alternativ (better) way is to just run this workload in the package itself. SnoopPrecompile has a macro that makes Julia try really hard to compile the stuff in it. So the two precompile signatures are replaced with:

using SnoopPrecompile
@precompile_all_calls f(T{[2][1]})

where again the [2][1] is to try avoid any constant propagation. Running the workload again:

❯ julia --project --trace-compile=stderr --startup-file=no -e 'using Ts; test() = Ts.f(Ts.T{[2][1]}); test();'
precompile(Tuple{typeof(Main.test)})
precompile(Tuple{typeof(Ts.h), Type{Ts.T{2}}, Function})

we see that in this case, the method Ts.h has failed to cache and gets recompiled (the constructor Ts.T managed to cache though). So putting the precompile signature into the package made it cache, but running the workload that emits the precompile signature made it not cache.

There are a few things we can do to fix the discrepancy:

  • Remove the @inline on f.
  • Force specialization of f in h with where {F}.

But ideally, running a workload that outputs certain precompile signatures in --trace-compile should be equally good as putting those precompile signatures explicitly into the package.

@KristofferC
Copy link
Sponsor Member Author

The example in Ferrite-FEM/Tensors.jl#190 is a little bit more complicated and in this case, specializion on f does not help:

module Ts
struct T{dim} 
    data::NTuple{dim, Float64}
end

@generated function h(::Type{T{dim}}, f::F) where {dim, F}
    [:(f($i)) for i=1:dim]
    :(T((f(0), f(0))))
end
@inline f(el::Number, ::Type{T{dim}}) where {dim} = h(T{dim}, i -> el)

using SnoopPrecompile
@precompile_all_calls f(1.0, T{2})

end

@timholy
Copy link
Sponsor Member

timholy commented Dec 31, 2022

This appears to be fixed in https://github.com/timholy/SnoopCompile.jl/tree/teh/snoop_pc (the teh/snoop_pc branch of SnoopCompile). However, the issue in Tensors is not; indeed, it's quite a different set of methods that get re-inferred:

julia> using SnoopCompileCore

julia> using Tensors

julia> using SnoopPrecompile

julia> tinf = @time_precompiled @snoopi_deep begin
       @precompile_all_calls begin
       for dim in (2, 3)
               v = ones(Tensor{1, dim, Float64,})
               σ = one(SymmetricTensor{2, dim, Float64})
               F = one(Tensor{2, dim, Float64,})
               E = one(SymmetricTensor{4, dim, Float64})
               C = one(Tensor{4, dim, Float64,})
               v * 1.0
               v  v
               v  v
               σ  v
               F  v
               σ  σ
               F  F
               σ * 1.0
               F * 1.0
               σ  σ
               F  F
               E  σ
               C  F
               E * 1.0
               C * 1.0

               # TODO: AD?
           end
       end
       end;

julia> using SnoopCompile

julia> tinf
InferenceTimingNode: 0.013773/0.159047 on Core.Compiler.Timings.ROOT() with 1 direct children

julia> tinf.children[1].children
14-element Vector{SnoopCompileCore.InferenceTimingNode}:
 InferenceTimingNode: 0.000637/0.033348 on *([1.0, 1.0]::Vec{2, Float64}, 1.0::Float64) with 2 direct children
 InferenceTimingNode: 0.006718/0.008715 on Tensors.otimes([1.0, 1.0]::Vec{2, Float64}, [1.0, 1.0]::Vec{2, Float64}) with 4 direct children
 InferenceTimingNode: 0.000077/0.031955 on LinearAlgebra.dot([1.0 0.0; 0.0 1.0]::SymmetricTensor{2, 2, Float64, 3}, [1.0, 1.0]::Vec{2, Float64}) with 1 direct children
 InferenceTimingNode: 0.000865/0.003543 on Tensors.otimes([1.0 0.0; 0.0 1.0]::SymmetricTensor{2, 2, Float64, 3}, [1.0 0.0; 0.0 1.0]::SymmetricTensor{2, 2, Float64, 3}) with 6 direct children
 InferenceTimingNode: 0.001037/0.005002 on Tensors.otimes([1.0 0.0; 0.0 1.0]::Tensor{2, 2, Float64, 4}, [1.0 0.0; 0.0 1.0]::Tensor{2, 2, Float64, 4}) with 7 direct children
 InferenceTimingNode: 0.000514/0.000649 on *([1.0 0.0; 0.0 1.0]::SymmetricTensor{2, 2, Float64, 3}, 1.0::Float64) with 1 direct children
 InferenceTimingNode: 0.000372/0.000497 on *([1.0 0.0; 0.0 1.0]::Tensor{2, 2, Float64, 4}, 1.0::Float64) with 1 direct children
 InferenceTimingNode: 0.000165/0.016104 on Tensors.dcontract([1.0 0.0; 0.0 1.0]::SymmetricTensor{2, 2, Float64, 3}, [1.0 0.0; 0.0 1.0]::SymmetricTensor{2, 2, Float64, 3}) with 1 direct children
 InferenceTimingNode: 0.000068/0.004072 on Tensors.dcontract([1.0 0.0; 0.0 1.0]::Tensor{2, 2, Float64, 4}, [1.0 0.0; 0.0 1.0]::Tensor{2, 2, Float64, 4}) with 1 direct children
 InferenceTimingNode: 0.000303/0.006795 on Tensors.dcontract([1.0 0.0; 0.0 0.0;;; 0.0 0.0; 1.0 0.0;;;; 0.0 1.0; 0.0 0.0;;; 0.0 0.0; 0.0 1.0]::Tensor{4, 2, Float64, 16}, [1.0 0.0; 0.0 1.0]::Tensor{2, 2, Float64, 4}) with 4 direct children
 InferenceTimingNode: 0.000459/0.004824 on *([1.0 0.0; 0.0 0.0;;; 0.0 0.5; 0.5 0.0;;;; 0.0 0.5; 0.5 0.0;;; 0.0 0.0; 0.0 1.0]::SymmetricTensor{4, 2, Float64, 9}, 1.0::Float64) with 2 direct children
 InferenceTimingNode: 0.000560/0.018592 on *([1.0 0.0; 0.0 0.0;;; 0.0 0.0; 1.0 0.0;;;; 0.0 1.0; 0.0 0.0;;; 0.0 0.0; 0.0 1.0]::Tensor{4, 2, Float64, 16}, 1.0::Float64) with 2 direct children
 InferenceTimingNode: 0.000060/0.000081 on iterate((2, 3)::Tuple{Int64, Int64}, 2::Int64) with 1 direct children
 InferenceTimingNode: 0.000170/0.000225 on iterate((2, 3)::Tuple{Int64, Int64}, ::Int64) with 1 direct children

@timholy
Copy link
Sponsor Member

timholy commented Dec 31, 2022

I added the following to SnoopPrecompile:

tim@diva:~/.julia/dev/SnoopCompile/SnoopPrecompile$ git diff src/SnoopPrecompile.jl
diff --git a/SnoopPrecompile/src/SnoopPrecompile.jl b/SnoopPrecompile/src/SnoopPrecompile.jl
index d0dcd6b..59f4c41 100644
--- a/SnoopPrecompile/src/SnoopPrecompile.jl
+++ b/SnoopPrecompile/src/SnoopPrecompile.jl
@@ -1,5 +1,7 @@
 module SnoopPrecompile

+using MethodAnalysis
+
 export @precompile_all_calls, @precompile_setup

 const verbose = Ref(false)    # if true, prints all the precompiles
@@ -11,7 +13,7 @@ function precompile_roots(roots)
     for child in roots
         mi = child.mi_info.mi
         precompile(mi.specTypes) # TODO: Julia should allow one to pass `mi` directly (would handle `invoke` properly)
-        verbose[] && println(mi)
+        verbose[] && printstyled(mi, '\n'; color=:cyan)
     end
 end

@@ -74,6 +76,11 @@ macro precompile_all_calls(ex::Expr)
     return esc(quote
         if ccall(:jl_generating_output, Cint, ()) == 1 || $SnoopPrecompile.verbose[]
             $ex
+            if $SnoopPrecompile.verbose[]
+                for mi in $methodinstances_owned_by(@__MODULE__)
+                    printstyled(mi, '\n'; color=:green)
+                end
+            end
         end
     end)
 end

and modified the precompile section of Ts.jl to be

using SnoopPrecompile
SnoopPrecompile.verbose[] = true
@precompile_all_calls f(T{[2][1]})

and got the following output (screenshot to show color):
image

Furthermore:

tim@diva:/tmp/pkgimage/Ts$ ~/src/juliaw/julia --project -q
julia> using PkgCacheInspector

julia> cf = info_cachefile("Ts");

julia> using MethodAnalysis

julia> methodinstances_owned_by(only(cf.modules))
3-element Vector{Core.MethodInstance}:
 MethodInstance for Ts.f(::Type{Ts.T{2}})
 MethodInstance for Ts.h(::Type{Ts.T{2}}, ::Function)
 MethodInstance for Ts.h(::Type{Ts.T{2}}, ::Ts.var"#1#2")

So it's being cached. Finally:

tim@diva:/tmp/pkgimage/Ts$ ~/src/juliaw/julia --project -q
julia> using Ts

julia> using MethodAnalysis

julia> methodinstances_owned_by(Ts)
3-element Vector{Core.MethodInstance}:
 MethodInstance for Ts.f(::Type{Ts.T{2}})
 MethodInstance for Ts.h(::Type{Ts.T{2}}, ::Function)
 MethodInstance for Ts.h(::Type{Ts.T{2}}, ::Ts.var"#1#2")

So it's also being reloaded.

The issue is the abstract method (not the concrete one) gets re-inferred despite its presence in the cache. CC @aviatesk, xref #47960.

@KristofferC
Copy link
Sponsor Member Author

My follow up question is why it seems to work when the explicit precompile invocation is put into the package? To me, that seems like it indicates that Julia has the ability to cache this properly without re-inferring the method, it is "just" that when using the direct workload á la @precompile_all_calls that it fails. So perhaps the difference in how the method gets cached when precompile is printed vs how it is cached when it is directly executed is interesting to look at?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:latency Compiler latency
Projects
None yet
Development

No branches or pull requests

2 participants