-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant performance regression when precompiled #50749
Comments
Some slow down is expected, an order of magnitude is not. I recommend a native profiler like hotspot/perf/vtunes |
With precompilation, I get (removing julia> using LinuxPerf
julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> Base.donotdelete( f(args...)), Base.OneTo(N))
foreachf! (generic function with 1 method)
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
foreachf!(rand!, 10_000_000, lrng, x64)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 2.98e+10 60.0% # 3.9 cycles per ns
┌ instructions 4.07e+10 60.0% # 1.4 insns per cycle
│ branch-instructions 1.39e+10 60.0% # 34.3% of insns
└ branch-misses 9.95e+06 60.0% # 0.1% of branch insns
┌ task-clock 7.72e+09 100.0% # 7.7 s
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 2.16e+05 20.0% # 0.0% of dcache loads
│ L1-dcache-loads 1.56e+10 20.0%
└ L1-icache-load-misses 3.95e+05 20.0%
┌ dTLB-load-misses 1.50e+01 20.0% # 0.0% of dTLB loads
└ dTLB-loads 1.54e+10 20.0%
┌ iTLB-load-misses 1.66e+03 40.0% # 7.5% of iTLB loads
└ iTLB-loads 2.20e+04 40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Without precompilation: julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
foreachf!(rand!, 10_000_000, lrng, x64)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 1.87e+09 59.9% # 3.9 cycles per ns
┌ instructions 5.00e+09 60.0% # 2.7 insns per cycle
│ branch-instructions 2.40e+08 60.0% # 4.8% of insns
└ branch-misses 1.99e+03 60.0% # 0.0% of branch insns
┌ task-clock 4.84e+08 100.0% # 483.9 ms
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 9.64e+03 20.0% # 0.0% of dcache loads
│ L1-dcache-loads 2.51e+08 20.0%
└ L1-icache-load-misses 6.30e+03 20.0%
┌ dTLB-load-misses 0.00e+00 19.9% # 0.0% of dTLB loads
└ dTLB-loads 2.50e+08 19.9%
┌ iTLB-load-misses 5.01e+00 39.9% # 0.4% of iTLB loads
└ iTLB-loads 1.36e+03 39.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ So, with precompilation, we have 2.78 times more branch instructions than we have total instructions without precompilation. I get about a 16x performance difference (7.7s vs 0.484 s for the 10 million repetitions). Also, FWIW, the Copy-pastable script, if your CPU supports the same performance counters as mine: julia> using VectorizedRNG, Random, LinuxPerf
Precompiling VectorizedRNG
1 dependency successfully precompiled in 2 seconds. 30 already precompiled.
julia> x64 = Vector{Float64}(undef, 255);
julia> drng = Random.default_rng(); lrng = local_rng();
julia> @benchmark rand!($lrng, $x64)
BenchmarkTools.Trial: 10000 samples with 986 evaluations.
Range (min … max): 52.124 ns … 88.696 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 52.493 ns ┊ GC (median): 0.00%
Time (mean ± σ): 52.680 ns ± 1.211 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆▃█▇ ▁▁▁▁ ▂
▃▁██████▅▃▁▅▅▃▃▃▄▄▃▅▅▆▇██████████▇▇▇▇█▇▅▅▅▅▅▄▄▅▁▃▄▅▄▆▆▅▄▅▆▅ █
52.1 ns Histogram: log(frequency) by time 55.8 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> Base.donotdelete( f(args...)), Base.OneTo(N))
foreachf! (generic function with 1 method)
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
foreachf!(rand!, 1000, lrng, x64)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 2.19e+07 57.4% # 3.5 cycles per ns
┌ instructions 2.11e+07 68.3% # 1.0 insns per cycle
│ branch-instructions 4.21e+06 68.3% # 20.0% of insns
└ branch-misses 2.16e+05 68.3% # 5.1% of branch insns
┌ task-clock 6.26e+06 100.0% # 6.3 ms
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 7.50e+01 100.0%
┌ L1-dcache-load-misses 2.21e+05 15.7% # 4.5% of dcache loads
│ L1-dcache-loads 4.91e+06 15.7%
└ L1-icache-load-misses 1.41e+06 15.7%
┌ dTLB-load-misses 8.51e+03 16.0% # 0.1% of dTLB loads
└ dTLB-loads 6.72e+06 16.0%
┌ iTLB-load-misses 8.24e+03 32.0% # 22.7% of iTLB loads
└ iTLB-loads 3.63e+04 32.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
foreachf!(rand!, 10_000_000, lrng, x64)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 1.87e+09 59.9% # 3.9 cycles per ns
┌ instructions 5.00e+09 60.0% # 2.7 insns per cycle
│ branch-instructions 2.40e+08 60.0% # 4.8% of insns
└ branch-misses 1.99e+03 60.0% # 0.0% of branch insns
┌ task-clock 4.84e+08 100.0% # 483.9 ms
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 9.64e+03 20.0% # 0.0% of dcache loads
│ L1-dcache-loads 2.51e+08 20.0%
└ L1-icache-load-misses 6.30e+03 20.0%
┌ dTLB-load-misses 0.00e+00 19.9% # 0.0% of dTLB loads
└ dTLB-loads 2.50e+08 19.9%
┌ iTLB-load-misses 5.01e+00 39.9% # 0.4% of iTLB loads
└ iTLB-loads 1.36e+03 39.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ |
I can verify this with function runmany!(rng, buf, n)
for i = 1:n
Base.donotdelete(rand!(rng, buf))
end
return buf
end which is a check that specialization heuristics with |
Using Cthulhu I've checked the LLVM of
and unless I've screwed up they seem the same. Might they differ in their native-code optimizations? EDIT: moreover, their native code also looks nearly identical. |
Try with |
You'll want Also the second approach has the advantage that |
Here is the script: tim@flash:~/.julia/dev/VectorizedRNG$ cat script.jl
using VectorizedRNG, Random
drng = Random.default_rng(); lrng = local_rng();
x64 = Vector{Float64}(undef, 255);
function runmany!(rng, buf, n)
for i = 1:n
Base.donotdelete(rand!(rng, buf))
end
return buf
end
@time runmany!(lrng, x64, 1)
@time runmany!(lrng, x64, 10^6)
nothing Here is a session from the (slow) version with precompilation: tim@flash:~/.julia/dev/VectorizedRNG$ ~/src/julia-branch/julia --project
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.11.0-DEV.203 (2023-07-31)
_/ |\__'_|_|_|\__'_| | Commit ec8df3da35* (1 day old master)
|__/ |
julia> include("script.jl")
Precompiling VectorizedRNG
1 dependency successfully precompiled in 3 seconds. 30 already precompiled.
0.005778 seconds (2.63 k allocations: 184.414 KiB, 99.47% compilation time)
1.580240 seconds
julia> using Static
julia> open("/tmp/tim/slowpc/slow_native.log", "w") do io
code_native(io, VectorizedRNG.samplevector!, Tuple{typeof(VectorizedRNG.random_uniform), VectorizedRNG.Xoshiro{2}, Vector{Float64}, Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{1}, typeof(identity)}; dump_module=false)
end
julia> open("/tmp/tim/slowpc/slow_llvm.log", "w") do io
code_llvm(io, VectorizedRNG.samplevector!, Tuple{typeof(VectorizedRNG.random_uniform), VectorizedRNG.Xoshiro{2}, Vector{Float64}, Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{1}, typeof(identity)}; dump_module=false)
end The changes for the fast version (which comments out the |
Oh interesting, if I profile the
One bizarre feature: note the things that get listed with a file/line in the For comparison, here's the profile of the fast one:
|
Those look exactly the same 🤔 |
The code, but not the profile. Which makes me puzzled. |
I imagine |
I'll see if perf can see through it |
The other way to get the real true LLVM that we've optimized is to run with |
Notice that's one of the things attributed to the |
I'm trying to look at the llvm generated but for the sysimg module it just makes up garbage. Is there a way for us to emit a package_image to a .bc /.ll file? |
So after talking a bit with @pchintalapudi we realized that what is happening is that, during precompilation the llvmcall modules that are marked |
Using the
teh/pc
branch of my fork of VectorizedRNG seems to show worse performance with a precompile workload than without. Theteh/pc
branch has that workload commented-out, and I get (on an old CPU):The key thing is 139ns median run time. (I deleted the plot of the histogram since that doesn't transfer well to github.) Whereas when I uncomment the precompile block and use the same commands, I get
i.e., seemingly more than an order of magnitude slower.
I say seemingly because there's something quite interesting about the profile. Here's a snippet:
Fast version:
whereas the slow version is
There is a puzzle here: if the two differ by an order of magnitude, why is the number of samples approximately the same? (At top-level, the slow one has 30 samples and the fast one 28.) Perhaps
samples=10000 evals=1
is insufficient to ensure that they are running the same workload, but it's not entirely clear.CC @chriselrod
The text was updated successfully, but these errors were encountered: