Significant performance regression when precompiled #50749

timholy · 2023-08-01T17:00:31Z

Using the teh/pc branch of my fork of VectorizedRNG seems to show worse performance with a precompile workload than without. The teh/pc branch has that workload commented-out, and I get (on an old CPU):

julia> using VectorizedRNG, Random

julia> drng = Random.default_rng(); lrng = local_rng();

julia> x64 = Vector{Float64}(undef, 255);

julia> using BenchmarkTools

julia> @bprofile rand!($lrng, $x64) samples=10000 evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  134.000 ns … 706.949 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     139.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   229.106 ns ±   7.303 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 Memory estimate: 0 bytes, allocs estimate: 0.

The key thing is 139ns median run time. (I deleted the plot of the histogram since that doesn't transfer well to github.) Whereas when I uncomment the precompile block and use the same commands, I get

julia> @bprofile rand!($lrng, $x64) samples=10000 evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  1.594 μs … 967.359 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.615 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.547 μs ±  24.092 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 Memory estimate: 0 bytes, allocs estimate: 0.

i.e., seemingly more than an order of magnitude slower.

I say seemingly because there's something quite interesting about the profile. Here's a snippet:

Fast version:

 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 /home/tim/src/julia-branch/src/interpreter.c:774; jl_interpret_toplevel_thunk
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 /home/tim/src/julia-branch/src/interpreter.c:543; eval_body
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 /home/tim/src/julia-branch/src/interpreter.c:488; eval_body
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 /home/tim/src/julia-branch/src/interpreter.c:222; eval_value
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 /home/tim/src/julia-branch/src/interpreter.c:125; do_call
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark)
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 @BenchmarkTools/src/execution.jl:117; run
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Fl...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 @BenchmarkTools/src/execution.jl:34; run_result
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 @BenchmarkTools/src/execution.jl:34; #run_result#45
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 @Base/essentials.jl:884; invokelatest
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 @Base/essentials.jl:887; #invokelatest#2
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 /home/tim/src/julia-branch/src/builtins.c:812; jl_f__call_latest
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 @BenchmarkTools/src/execution.jl:92; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 @BenchmarkTools/src/execution.jl:105; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::Stri...

whereas the slow version is

 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 /home/tim/src/julia-branch/src/interpreter.c:774; jl_interpret_toplevel_thunk
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 /home/tim/src/julia-branch/src/interpreter.c:543; eval_body
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 /home/tim/src/julia-branch/src/interpreter.c:488; eval_body
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 /home/tim/src/julia-branch/src/interpreter.c:222; eval_value
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 /home/tim/src/julia-branch/src/interpreter.c:125; do_call
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark)
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 @BenchmarkTools/src/execution.jl:117; run
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Fl...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 @BenchmarkTools/src/execution.jl:34; run_result
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 @BenchmarkTools/src/execution.jl:34; #run_result#45
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 @Base/essentials.jl:884; invokelatest
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 @Base/essentials.jl:887; #invokelatest#2
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 /home/tim/src/julia-branch/src/builtins.c:812; jl_f__call_latest
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 @BenchmarkTools/src/execution.jl:92; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 @BenchmarkTools/src/execution.jl:105; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::Stri...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 25 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  25 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   25 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    18 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     16 @BenchmarkTools/src/execution.jl:495; var"##sample#236"(::Tuple{VectorizedRNG.Xoshiro{2}, Vector{Float64}}, __params::Benc...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 16 @BenchmarkTools/src/execution.jl:489; var"##core#235"(lrng#233::VectorizedRNG.Xoshiro{2}, x64#234::Vector{Float64})
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  16 @VectorizedRNG/src/api.jl:366; rand!(rng::VectorizedRNG.Xoshiro{2}, x::Vector{Float64}, α::Static.StaticInt{0}, β::...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   14 @VectorizedRNG/src/api.jl:366; rand!(rng::VectorizedRNG.Xoshiro{2}, x::Vector{Float64}, α::Static.StaticInt{0}, β:...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    12 @VectorizedRNG/src/api.jl:314; samplevector!
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     8  @VectorizedRNG/src/api.jl:216; random_uniform
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 8  @VectorizedRNG/src/api.jl:34; random_uniform
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  7  @VectorizedRNG/src/masks.jl:66; floatbitmask
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   6  @VectorizedRNG/src/masks.jl:34; setbits
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    6  @VectorizedRNG/src/masks.jl:33; setbits

There is a puzzle here: if the two differ by an order of magnitude, why is the number of samples approximately the same? (At top-level, the slow one has 30 samples and the fast one 28.) Perhaps samples=10000 evals=1 is insufficient to ensure that they are running the same workload, but it's not entirely clear.

CC @chriselrod

The text was updated successfully, but these errors were encountered:

vchuravy · 2023-08-01T17:03:13Z

Some slow down is expected, an order of magnitude is not. I recommend a native profiler like hotspot/perf/vtunes

chriselrod · 2023-08-01T18:14:00Z

With precompilation, I get (removing @pstats with less iters for compilation):

julia> using LinuxPerf

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> Base.donotdelete( f(args...)), Base.OneTo(N))
foreachf! (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(rand!, 10_000_000, lrng, x64)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               2.98e+10   60.0%  #  3.9 cycles per ns
┌ instructions             4.07e+10   60.0%  #  1.4 insns per cycle
│ branch-instructions      1.39e+10   60.0%  # 34.3% of insns
└ branch-misses            9.95e+06   60.0%  #  0.1% of branch insns
┌ task-clock               7.72e+09  100.0%  #  7.7 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    2.16e+05   20.0%  #  0.0% of dcache loads
│ L1-dcache-loads          1.56e+10   20.0%
└ L1-icache-load-misses    3.95e+05   20.0%
┌ dTLB-load-misses         1.50e+01   20.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.54e+10   20.0%
┌ iTLB-load-misses         1.66e+03   40.0%  #  7.5% of iTLB loads
└ iTLB-loads               2.20e+04   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Without precompilation:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(rand!, 10_000_000, lrng, x64)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.87e+09   59.9%  #  3.9 cycles per ns
┌ instructions             5.00e+09   60.0%  #  2.7 insns per cycle
│ branch-instructions      2.40e+08   60.0%  #  4.8% of insns
└ branch-misses            1.99e+03   60.0%  #  0.0% of branch insns
┌ task-clock               4.84e+08  100.0%  # 483.9 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    9.64e+03   20.0%  #  0.0% of dcache loads
│ L1-dcache-loads          2.51e+08   20.0%
└ L1-icache-load-misses    6.30e+03   20.0%
┌ dTLB-load-misses         0.00e+00   19.9%  #  0.0% of dTLB loads
└ dTLB-loads               2.50e+08   19.9%
┌ iTLB-load-misses         5.01e+00   39.9%  #  0.4% of iTLB loads
└ iTLB-loads               1.36e+03   39.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

So, with precompilation, we have 2.78 times more branch instructions than we have total instructions without precompilation.

I get about a 16x performance difference (7.7s vs 0.484 s for the 10 million repetitions).

Also, FWIW, the default_rng is also >6x slower than VectorizedRNG, despite also being vectorized and using mostly the same algorithm.

Copy-pastable script, if your CPU supports the same performance counters as mine:

julia> using VectorizedRNG, Random, LinuxPerf
Precompiling VectorizedRNG
  1 dependency successfully precompiled in 2 seconds. 30 already precompiled.

julia> x64 = Vector{Float64}(undef, 255);

julia> drng = Random.default_rng(); lrng = local_rng();

julia> @benchmark rand!($lrng, $x64)
BenchmarkTools.Trial: 10000 samples with 986 evaluations.
 Range (min … max):  52.124 ns … 88.696 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     52.493 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.680 ns ±  1.211 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▆▃█▇                  ▁▁▁▁                               ▂
  ▃▁██████▅▃▁▅▅▃▃▃▄▄▃▅▅▆▇██████████▇▇▇▇█▇▅▅▅▅▅▄▄▅▁▃▄▅▄▆▆▅▄▅▆▅ █
  52.1 ns      Histogram: log(frequency) by time      55.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> Base.donotdelete( f(args...)), Base.OneTo(N))
foreachf! (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(rand!, 1000, lrng, x64)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               2.19e+07   57.4%  #  3.5 cycles per ns
┌ instructions             2.11e+07   68.3%  #  1.0 insns per cycle
│ branch-instructions      4.21e+06   68.3%  # 20.0% of insns
└ branch-misses            2.16e+05   68.3%  #  5.1% of branch insns
┌ task-clock               6.26e+06  100.0%  #  6.3 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              7.50e+01  100.0%
┌ L1-dcache-load-misses    2.21e+05   15.7%  #  4.5% of dcache loads
│ L1-dcache-loads          4.91e+06   15.7%
└ L1-icache-load-misses    1.41e+06   15.7%
┌ dTLB-load-misses         8.51e+03   16.0%  #  0.1% of dTLB loads
└ dTLB-loads               6.72e+06   16.0%
┌ iTLB-load-misses         8.24e+03   32.0%  # 22.7% of iTLB loads
└ iTLB-loads               3.63e+04   32.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(rand!, 10_000_000, lrng, x64)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.87e+09   59.9%  #  3.9 cycles per ns
┌ instructions             5.00e+09   60.0%  #  2.7 insns per cycle
│ branch-instructions      2.40e+08   60.0%  #  4.8% of insns
└ branch-misses            1.99e+03   60.0%  #  0.0% of branch insns
┌ task-clock               4.84e+08  100.0%  # 483.9 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    9.64e+03   20.0%  #  0.0% of dcache loads
│ L1-dcache-loads          2.51e+08   20.0%
└ L1-icache-load-misses    6.30e+03   20.0%
┌ dTLB-load-misses         0.00e+00   19.9%  #  0.0% of dTLB loads
└ dTLB-loads               2.50e+08   19.9%
┌ iTLB-load-misses         5.01e+00   39.9%  #  0.4% of iTLB loads
└ iTLB-loads               1.36e+03   39.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

timholy · 2023-08-01T18:19:28Z

I can verify this with

function runmany!(rng, buf, n)
    for i = 1:n
        Base.donotdelete(rand!(rng, buf))
    end
    return buf
end

which is a check that specialization heuristics with foreach are not playing a role. I get about a 13x regression with precompilation.

timholy · 2023-08-01T18:42:29Z

Using Cthulhu I've checked the LLVM of samplevector!:

 Function Signature: samplevector!(typeof(VectorizedRNG.random_uniform), VectorizedRNG.Xoshiro{2}, Array{Float64, 1}, Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{1}, typeof(Base.identity))
;  @ /home/tim/.julia/dev/VectorizedRNG/src/api.jl:293 within `samplevector!`
define swiftcc nonnull {}* @"julia_samplevector!_766"({}*** nonnull swiftself %pgcstack, [1 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(8) %"rng::Xoshiro", {}* noundef nonnull align 16 dereferenceable(40) %"x::Array") #0 {

and unless I've screwed up they seem the same. Might they differ in their native-code optimizations?

EDIT: moreover, their native code also looks nearly identical.

pchintalapudi · 2023-08-01T18:58:31Z

Try with --image-codegen or @code_native dump_module=false, anything else will give you the JIT-ted version of native code rather than image version.

timholy · 2023-08-01T21:17:28Z

Thanks for the tip. Here's what I tried (both of them I think):

tim@flash:~/.julia/dev/VectorizedRNG$ ~/src/julia-branch/julia --project --image-codegen
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-DEV.203 (2023-07-31)
 _/ |\__'_|_|_|\__'_|  |  Commit ec8df3da35* (1 day old master)
|__/                   |

(VectorizedRNG) pkg> add Static
   Resolving package versions...
    Updating `~/.julia/dev/VectorizedRNG/Project.toml`
  [aedffcd0] + Static v0.8.8
  No Changes to `~/.julia/dev/VectorizedRNG/Manifest.toml`
Precompiling project...
  1 dependency successfully precompiled in 3 seconds. 34 already precompiled.

julia> include("script.jl")
  1.457765 seconds (982.10 k allocations: 65.922 MiB, 14.16% gc time, 100.00% compilation time)
  0.117277 seconds

julia> using Static

julia> code_native(VectorizedRNG.samplevector!, Tuple{typeof(VectorizedRNG.random_uniform), VectorizedRNG.Xoshiro{2}, Vector{Float64}, Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{1}, typeof(identity)}; dump_module=false)
...

and then I copy/pasted the output in the terminal to a file. I did this with and without precompilation, then ran diff --color native_fast.log native_slow.log. Here's a screenshot to see the color:

I assume that's inconsequential? Am I doing something wrong?

pchintalapudi · 2023-08-01T21:22:23Z

You'll want code_native(...; dump_module=true|false) without --image-codegen with precompilation; true will give you what the JIT would have compiled, false will give you what's actually in the image. If it's not in an image, then just doing code_native with julia --imaging-mode will give the image version of native code, while doing it without --imaging-mode will give you the JIT version (dump_module doesn't matter here, since we're compiling the code outright regardless).

Also the second approach has the advantage that code_llvm will give the imaging_mode output as well, and I personally find LLVM IR more readable than assembly.

timholy · 2023-08-02T14:41:20Z

Here is the script:

tim@flash:~/.julia/dev/VectorizedRNG$ cat script.jl
using VectorizedRNG, Random

drng = Random.default_rng(); lrng = local_rng();
x64 = Vector{Float64}(undef, 255);

function runmany!(rng, buf, n)
    for i = 1:n
        Base.donotdelete(rand!(rng, buf))
    end
    return buf
end

@time runmany!(lrng, x64, 1)
@time runmany!(lrng, x64, 10^6)
nothing

Here is a session from the (slow) version with precompilation:

tim@flash:~/.julia/dev/VectorizedRNG$ ~/src/julia-branch/julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-DEV.203 (2023-07-31)
 _/ |\__'_|_|_|\__'_|  |  Commit ec8df3da35* (1 day old master)
|__/                   |

julia> include("script.jl")
Precompiling VectorizedRNG
  1 dependency successfully precompiled in 3 seconds. 30 already precompiled.
  0.005778 seconds (2.63 k allocations: 184.414 KiB, 99.47% compilation time)
  1.580240 seconds

julia> using Static

julia> open("/tmp/tim/slowpc/slow_native.log", "w") do io
           code_native(io, VectorizedRNG.samplevector!, Tuple{typeof(VectorizedRNG.random_uniform), VectorizedRNG.Xoshiro{2}, Vector{Float64}, Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{1}, typeof(identity)}; dump_module=false)
       end

julia> open("/tmp/tim/slowpc/slow_llvm.log", "w") do io
           code_llvm(io, VectorizedRNG.samplevector!, Tuple{typeof(VectorizedRNG.random_uniform), VectorizedRNG.Xoshiro{2}, Vector{Float64}, Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{1}, typeof(identity)}; dump_module=false)
       end

The changes for the fast version (which comments out the @compile_workload block) should be obvious.

And here are the complete diffs:

timholy · 2023-08-02T14:54:16Z

Oh interesting, if I profile the runmany! call, the slow version shows this (I trimmed off the boring part):

  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  1590 @VectorizedRNG/script.jl:8; runmany!(rng::VectorizedRNG.Xoshiro{2}, buf::Vector{Float64}, n::Int64)
 1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   1589 @VectorizedRNG/src/api.jl:366; rand!
 1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    1322 @VectorizedRNG/src/api.jl:366; rand!(rng::VectorizedRNG.Xoshiro{2}, x::Vector{Float64}, α::Static.StaticInt...
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     811  @VectorizedRNG/src/api.jl:314; samplevector!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 443  @VectorizedRNG/src/api.jl:215; random_uniform
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  68   @VectorizedRNG/src/xoshiro.jl:588; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   68   @VectorizationBase/src/vecunroll/fmap.jl:194; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    25   @VectorizationBase/src/llvm_intrin/conversion.jl:357; Vec
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     25   @VectorizationBase/src/llvm_intrin/conversion.jl:188; vconvert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 25   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:122; _vbroadcast
23╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  25   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:95; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    43   @VectorizationBase/src/vecunroll/fmap.jl:16; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     29   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:888; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 29   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; funnel_shift_right
29╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  29   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     14   @VectorizationBase/src/vecunroll/fmap.jl:9; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 14   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:888; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  14   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; funnel_shift_right
14╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   14   ...torizationBase/src/llvm_intrin/intrin_funcs.jl:856; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  351  @VectorizedRNG/src/xoshiro.jl:589; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   54   @VectorizedRNG/src/xoshiro.jl:495; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    27   @VectorizationBase/src/base_defs.jl:98; <<
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     27   @VectorizationBase/src/promotion.jl:127; promote_div
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 27   @VectorizationBase/src/promotion.jl:139; promote_div
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  27   @VectorizationBase/src/base_defs.jl:209; rem
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   27   @VectorizationBase/src/llvm_intrin/conversion.jl:451; vrem
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    27   @VectorizationBase/src/base_defs.jl:199; convert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     27   @VectorizationBase/src/llvm_intrin/conversion.jl:232; vconvert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 27   @VectorizationBase/src/llvm_intrin/conversion.jl:188; vconvert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  27   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:122; _vbroadcast
20╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   27   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:95; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    27   @VectorizationBase/src/base_defs.jl:99; <<
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     27   @VectorizationBase/src/vecunroll/fmap.jl:159; vshl
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 27   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  18   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   18   @VectorizationBase/src/llvm_intrin/binary_ops.jl:25; vshl
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    18   @VectorizationBase/src/llvm_intrin/binary_ops.jl:31; vshl_fast
18╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     18   @VectorizationBase/src/llvm_intrin/binary_ops.jl:31; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   54   @VectorizedRNG/src/xoshiro.jl:497; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    54   @VectorizationBase/src/base_defs.jl:91; xor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     54   @VectorizationBase/src/vecunroll/fmap.jl:111; vxor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 54   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  26   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
26╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   26   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
28╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   37   @VectorizedRNG/src/xoshiro.jl:498; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    37   @VectorizationBase/src/base_defs.jl:91; xor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     37   @VectorizationBase/src/vecunroll/fmap.jl:111; vxor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 37   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  19   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
18╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   19   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  18   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   18   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
11╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    18   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   50   @VectorizedRNG/src/xoshiro.jl:499; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    50   @VectorizationBase/src/base_defs.jl:91; xor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     50   @VectorizationBase/src/vecunroll/fmap.jl:111; vxor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 50   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  31   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
31╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   31   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  19   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   19   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
 9╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    19   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
10╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     10   ...ia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2683
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   74   @VectorizedRNG/src/xoshiro.jl:500; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    74   @VectorizationBase/src/base_defs.jl:91; xor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     74   @VectorizationBase/src/vecunroll/fmap.jl:111; vxor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 74   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  37   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
37╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   37   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  37   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   37   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vxor
37╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    37   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   56   @VectorizedRNG/src/xoshiro.jl:501; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    56   @VectorizationBase/src/vecunroll/fmap.jl:194; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     56   @VectorizationBase/src/vecunroll/fmap.jl:16; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 24   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:888; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  24   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; funnel_shift_right
24╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   24   ...torizationBase/src/llvm_intrin/intrin_funcs.jl:856; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 32   @VectorizationBase/src/vecunroll/fmap.jl:9; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  32   @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:888; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   32   ...torizationBase/src/llvm_intrin/intrin_funcs.jl:856; funnel_shift_right
32╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    32   ...torizationBase/src/llvm_intrin/intrin_funcs.jl:856; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 368  @VectorizedRNG/src/api.jl:216; random_uniform
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  368  @VectorizedRNG/src/api.jl:34; random_uniform
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   253  @VectorizedRNG/src/masks.jl:66; floatbitmask
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    153  @VectorizedRNG/src/masks.jl:34; setbits
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     153  @VectorizedRNG/src/masks.jl:33; setbits
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 55   @VectorizationBase/src/base_defs.jl:99; &
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  55   @VectorizationBase/src/vecunroll/fmap.jl:111; vand
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   55   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    51   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     51   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vand
24╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 51   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
27╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  27   ...a/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2691
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 33   @VectorizationBase/src/base_defs.jl:98; |
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  33   @Base/promotion.jl:393; promote
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   33   @Base/promotion.jl:370; _promote
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    33   @VectorizationBase/src/base_defs.jl:199; convert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     33   @VectorizationBase/src/llvm_intrin/conversion.jl:232; vconvert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 33   @VectorizationBase/src/llvm_intrin/conversion.jl:188; vconvert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  33   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:122; _vbroadcast
29╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   33   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:95; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 52   @VectorizationBase/src/base_defs.jl:99; |
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  52   @VectorizationBase/src/vecunroll/fmap.jl:111; vor
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   52   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vor
20╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    24   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     24   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; vor
10╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 24   @VectorizationBase/src/llvm_intrin/binary_ops.jl:100; macro expansion
14╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  14   ...a/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2694
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    100  @VectorizationBase/src/base_defs.jl:201; reinterpret
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     100  @VectorizationBase/src/vecunroll/fmap.jl:82; vreinterpret
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 100  @VectorizationBase/src/vecunroll/fmap.jl:18; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  30   @VectorizationBase/src/llvm_intrin/conversion.jl:424; vreinterpret
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   30   @VectorizationBase/src/llvm_intrin/conversion.jl:424; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    30   @VectorizationBase/src/llvm_intrin/conversion.jl:435; vreinterpret
30╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     30   @VectorizationBase/src/llvm_intrin/conversion.jl:18; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  70   @VectorizationBase/src/vecunroll/fmap.jl:10; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   70   @VectorizationBase/src/llvm_intrin/conversion.jl:424; vreinterpret
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    70   @VectorizationBase/src/llvm_intrin/conversion.jl:424; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     70   @VectorizationBase/src/llvm_intrin/conversion.jl:435; vreinterpret
70╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 70   @VectorizationBase/src/llvm_intrin/conversion.jl:18; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   99   @VectorizationBase/src/base_defs.jl:99; -
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    99   @VectorizationBase/src/vecunroll/fmap.jl:111; vsub
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     99   @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 79   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; vsub
21╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  79   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; macro expansion
58╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   58   ...lia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2698
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 20   @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  20   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; vsub
 6╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   20   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; macro expansion
14╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    14   ...lia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2699
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     265  @VectorizedRNG/src/api.jl:320; samplevector!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 77   @Base/promotion.jl:423; *
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  19   @VectorizationBase/src/base_defs.jl:94; *
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   19   @Base/promotion.jl:393; promote
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    19   @Base/promotion.jl:370; _promote
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     19   @VectorizationBase/src/base_defs.jl:199; convert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 19   @VectorizationBase/src/llvm_intrin/conversion.jl:184; vconvert
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  19   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:151; vbroadcast
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   19   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:122; _vbroadcast
10╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    19   @VectorizationBase/src/llvm_intrin/vbroadcast.jl:95; macro expansion
 9╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     9    ...ia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2700
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  58   @VectorizationBase/src/base_defs.jl:95; *
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   58   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; vmul
 5╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    58   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; macro expansion
53╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     53   ...ulia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2701
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 93   @Base/promotion.jl:422; +
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  76   @VectorizationBase/src/base_defs.jl:95; +
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   76   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; vadd
29╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    76   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; macro expansion
47╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     47   ...ulia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2703
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 95   @VectorizationBase/src/llvm_intrin/memory_addr.jl:2094; vstore!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  27   ...ationBase/src/strided_pointers/stridedpointers.jl:198; _vstore!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   27   ...ationBase/src/strided_pointers/stridedpointers.jl:45; linear_index
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    27   ...onBase/src/strided_pointers/cartesian_indexing.jl:5; tdot
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     27   ...nBase/src/strided_pointers/cartesian_indexing.jl:9; tdot
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 27   @VectorizationBase/src/lazymul.jl:61; lazymul
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  27   @VectorizationBase/src/static.jl:53; vmul_nsw
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   27   @VectorizationBase/src/llvm_intrin/binary_ops.jl:49; vmul_nsw
27╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    27   @VectorizationBase/src/llvm_intrin/binary_ops.jl:49; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  68   ...ationBase/src/strided_pointers/stridedpointers.jl:199; _vstore!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   68   @VectorizationBase/src/llvm_intrin/memory_addr.jl:1482; __vstore!
19╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    68   @VectorizationBase/src/llvm_intrin/memory_addr.jl:1482; macro expansion
49╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     49   ...ulia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2705
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     149  @VectorizedRNG/src/api.jl:321; samplevector!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 54   @Base/promotion.jl:422; +
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  43   @VectorizationBase/src/base_defs.jl:95; +
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   43   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; vadd
 5╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    43   @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; macro expansion
38╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     38   ...ulia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2709
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 73   @VectorizationBase/src/llvm_intrin/memory_addr.jl:2094; vstore!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  61   ...ationBase/src/strided_pointers/stridedpointers.jl:199; _vstore!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   61   @VectorizationBase/src/llvm_intrin/memory_addr.jl:1482; __vstore!
28╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    61   @VectorizationBase/src/llvm_intrin/memory_addr.jl:1482; macro expansion
33╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     33   ...ulia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2711
 1╎1    /workspace/srcdir/gcc-12.1.0/libstdc++-v3/libsupc++/del_op.cc:48; operator delete
Total snapshots: 1592. Utilization: 100% across all threads and tasks. Use the `groupby` kwarg to break down by thread and/or task.

One bizarre feature: note the things that get listed with a file/line in the .so file. 🤔

For comparison, here's the profile of the fast one:

  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  108 @VectorizedRNG/script.jl:8; runmany!(rng::VectorizedRNG.Xoshiro{2}, buf::Vector{Float64}, n::Int64)
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   108 @VectorizedRNG/src/api.jl:366; rand!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    107 @VectorizedRNG/src/api.jl:366; rand!(rng::VectorizedRNG.Xoshiro{2}, x::Vector{Float64}, α::Static.StaticInt{...
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     90  @VectorizedRNG/src/api.jl:314; samplevector!
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 69  @VectorizedRNG/src/api.jl:215; random_uniform
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  17  @VectorizedRNG/src/xoshiro.jl:588; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   17  @VectorizationBase/src/vecunroll/fmap.jl:194; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    17  @VectorizationBase/src/vecunroll/fmap.jl:16; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     11  @VectorizationBase/src/vecunroll/fmap.jl:9; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 11  @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:888; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  11  @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; funnel_shift_right
11╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   11  @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  45  @VectorizedRNG/src/xoshiro.jl:589; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   19  @VectorizedRNG/src/xoshiro.jl:501; nextstate
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    19  @VectorizationBase/src/vecunroll/fmap.jl:194; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     19  @VectorizationBase/src/vecunroll/fmap.jl:16; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 12  @VectorizationBase/src/vecunroll/fmap.jl:9; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  12  @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:888; rotate_right
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   12  @VectorizationBase/src/llvm_intrin/intrin_funcs.jl:856; funnel_shift_right
12╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    12  ...torizationBase/src/llvm_intrin/intrin_funcs.jl:856; macro expansion
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 21  @VectorizedRNG/src/api.jl:216; random_uniform
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  21  @VectorizedRNG/src/api.jl:34; random_uniform
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   17  @VectorizationBase/src/base_defs.jl:99; -
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    17  @VectorizationBase/src/vecunroll/fmap.jl:111; vsub
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     17  @VectorizationBase/src/vecunroll/fmap.jl:11; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 16  @VectorizationBase/src/vecunroll/fmap.jl:7; fmap
  ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  16  @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; vsub
15╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   16  @VectorizationBase/src/llvm_intrin/binary_ops.jl:112; macro expansion
Total snapshots: 110. Utilization: 100% across all threads and tasks. Use the `groupby` kwarg to break down by thread and/or task.

gbaraldi · 2023-08-02T16:58:02Z

Those look exactly the same 🤔

timholy · 2023-08-02T16:59:03Z

The code, but not the profile. Which makes me puzzled.

chriselrod · 2023-08-02T17:02:27Z

I imagine code_llvm and code_native are still lying.

gbaraldi · 2023-08-02T17:02:54Z

I'll see if perf can see through it

pchintalapudi · 2023-08-02T17:07:50Z

The other way to get the real true LLVM that we've optimized is to run with JULIA_LLVM_ARGS="--print-before=AfterOptimization", which will give a whole lot more output but is everything the JIT compiled (BeforeOptimization will give the unoptimized code, obviously). The same thing will also work during precompilation, but you'll just get a giant module with every function in the image as well as the JIT-ted modules created by the precompile worker. Pkg might swallow that output (it'll arrive on stderr), so there may be some other changes needed to actually capture it for the image.

gbaraldi · 2023-08-02T17:24:03Z

JIT

Image

There's lots of non inlined calls and weird things

timholy · 2023-08-02T17:29:29Z

Notice that's one of the things attributed to the .so file in the Julia profile (e.g., ...ulia/compiled/v1.11/VectorizedRNG/wqLDZ_BnW93.so:?; julia_randNOT._2660u2711).

gbaraldi · 2023-08-02T17:34:20Z

I'm trying to look at the llvm generated but for the sysimg module it just makes up garbage. Is there a way for us to emit a package_image to a .bc /.ll file?
Maybe it's multiple threads printing 🤔

gbaraldi · 2023-08-02T17:54:46Z

So after talking a bit with @pchintalapudi we realized that what is happening is that, during precompilation the llvmcall modules that are marked always_inline can get split from the main function body, leading to those calls becoming non inlined. Which causes the following issue.
To prove this I ran the precompilation with export JULIA_IMAGE_THREADS=1 which then fixes the issue.
The solution is to be more precise with how we split the modules so we don't separate these functions.

timholy added the compiler:precompilation Precompilation of modules label Aug 1, 2023

timholy changed the title ~~(Maybe) worse performance when precompiled~~ Significant performance regression when precompiled Aug 1, 2023

DilumAluthge added performance Must go faster regression Regression in behavior compared to a previous version labels Aug 1, 2023

pchintalapudi mentioned this issue Aug 2, 2023

Don't partition alwaysinline functions #50766

Merged

pchintalapudi closed this as completed in #50766 Aug 2, 2023

timholy mentioned this issue Aug 22, 2023

add note about possible performance regressions JuliaLang/PrecompileTools.jl#29

Merged

timholy mentioned this issue Sep 16, 2024

Precompiling and FunctionWrappers lead to extra runtime allocations #54832

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant performance regression when precompiled #50749

Significant performance regression when precompiled #50749

timholy commented Aug 1, 2023 •

edited

Loading

vchuravy commented Aug 1, 2023 •

edited

Loading

chriselrod commented Aug 1, 2023 •

edited

Loading

timholy commented Aug 1, 2023

timholy commented Aug 1, 2023 •

edited

Loading

pchintalapudi commented Aug 1, 2023

timholy commented Aug 1, 2023

pchintalapudi commented Aug 1, 2023 •

edited

Loading

timholy commented Aug 2, 2023

timholy commented Aug 2, 2023 •

edited

Loading

gbaraldi commented Aug 2, 2023

timholy commented Aug 2, 2023

chriselrod commented Aug 2, 2023

gbaraldi commented Aug 2, 2023

pchintalapudi commented Aug 2, 2023

gbaraldi commented Aug 2, 2023

timholy commented Aug 2, 2023

gbaraldi commented Aug 2, 2023 •

edited

Loading

gbaraldi commented Aug 2, 2023 •

edited

Loading

Significant performance regression when precompiled #50749

Significant performance regression when precompiled #50749

Comments

timholy commented Aug 1, 2023 • edited Loading

vchuravy commented Aug 1, 2023 • edited Loading

chriselrod commented Aug 1, 2023 • edited Loading

timholy commented Aug 1, 2023

timholy commented Aug 1, 2023 • edited Loading

pchintalapudi commented Aug 1, 2023

timholy commented Aug 1, 2023

pchintalapudi commented Aug 1, 2023 • edited Loading

timholy commented Aug 2, 2023

timholy commented Aug 2, 2023 • edited Loading

gbaraldi commented Aug 2, 2023

timholy commented Aug 2, 2023

chriselrod commented Aug 2, 2023

gbaraldi commented Aug 2, 2023

pchintalapudi commented Aug 2, 2023

gbaraldi commented Aug 2, 2023

timholy commented Aug 2, 2023

gbaraldi commented Aug 2, 2023 • edited Loading

gbaraldi commented Aug 2, 2023 • edited Loading

timholy commented Aug 1, 2023 •

edited

Loading

vchuravy commented Aug 1, 2023 •

edited

Loading

chriselrod commented Aug 1, 2023 •

edited

Loading

timholy commented Aug 1, 2023 •

edited

Loading

pchintalapudi commented Aug 1, 2023 •

edited

Loading

timholy commented Aug 2, 2023 •

edited

Loading

gbaraldi commented Aug 2, 2023 •

edited

Loading

gbaraldi commented Aug 2, 2023 •

edited

Loading