`argmin()` is much slower than `findmin()` due to automatic inline on AMD Zen4 #56375

Moelf · 2024-10-28T23:40:56Z

Lines 1245 to 1246 in 2cdfe06

    
           """ 
        
           argmin(A::AbstractArray; dims=:) = findmin(A; dims=dims)[2]

but:

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 46 samples with 50 evaluations
 min    411.306 μs
 median 419.671 μs
 mean   419.933 μs
 max    448.662 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.239 ms
 median 2.246 ms
 mean   2.252 ms
 max    2.306 ms

somehow [2] is destorying performance:

julia> slow_argmin(x) = findmin(x; dims=:)[2]
slow_argmin (generic function with 1 method)

julia> fast_argmin(x) = findmin(x; dims=:)
fast_argmin (generic function with 1 method)

julia> @be rand(Float64, 512000) slow_argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.233 ms
 median 2.264 ms
 mean   2.262 ms
 max    2.309 ms

julia> @be rand(Float64, 512000) fast_argmin samples=100 evals=50
Benchmark: 45 samples with 50 evaluations
 min    412.383 μs
 median 421.955 μs
 mean   425.285 μs
 max    475.616 μs

The text was updated successfully, but these errors were encountered:

Moelf · 2024-10-28T23:49:42Z

related to #41963

Moelf · 2024-10-28T23:54:33Z

actually if you force no inline it's faster...

julia> noinline_argmin(x) = @noinline findmin(x; dims=:)[2]
slower_argmin (generic function with 1 method)

julia> @be rand(Float64, 512000) noinline_argmin samples=100 evals=50
Benchmark: 45 samples with 50 evaluations
 min    415.286 μs (2 allocs: 48 bytes)
 median 420.158 μs (2 allocs: 48 bytes)
 mean   423.787 μs (2 allocs: 48 bytes)
 max    476.411 μs (2 allocs: 48 bytes)

Zentrik · 2024-10-29T07:57:40Z

Presumably this is just llvm being dumb. Do you see the same issue on master?

Moelf · 2024-10-29T09:03:30Z

yes, still on master:

#Version 1.12.0-DEV.1508 (2024-10-28)

julia> @be rand(512000) slow_argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.214 ms
 median 2.217 ms
 mean   2.225 ms
 max    2.258 ms

julia> @be rand(512000) findmin samples=100 evals=50
Benchmark: 47 samples with 50 evaluations
 min    409.705 μs
 median 411.422 μs
 mean   415.092 μs
 max    451.189 μs

Zentrik · 2024-10-29T21:33:38Z

Fwiw doesn't reproduce on (or llvm 19 for that matter)

julia> versioninfo()
Julia Version 1.12.0-DEV.1502
Commit ee09ae70d9f (2024-10-26 01:01 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 1700 Eight-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver1)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

Moelf · 2024-10-29T22:29:24Z

Looks like you're on Zen1, and I can also confirm it doesn't happen on Zen2, but it does happen on Zen4!

julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
Threads: 24 default, 0 interactive, 24 GC (on 24 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 36 samples with 50 evaluations
 min    544.455 μs
 median 545.443 μs
 mean   545.703 μs
 max    554.404 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 34 samples with 50 evaluations
 min    564.647 μs
 median 566.045 μs
 mean   566.817 μs
 max    578.258 μs

julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 16 default, 0 interactive, 16 GC (on 16 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 47 samples with 50 evaluations
 min    405.372 μs
 median 407.367 μs
 mean   409.003 μs
 max    451.877 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.233 ms
 median 2.244 ms
 mean   2.266 ms
 max    2.389 ms

giordano · 2024-10-29T23:09:17Z

This really sounds like an upstream bug in LLVM:

$ julia +nightly -q
julia> using Chairmarks

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 33 samples with 50 evaluations
 min    558.099 μs
 median 558.759 μs
 mean   558.799 μs
 max    559.535 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 7 samples with 50 evaluations
 min    3.058 ms
 median 3.063 ms
 mean   3.063 ms
 max    3.067 ms

julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 384 × AMD EPYC 9654 96-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 384 virtual cores)

$ julia +nightly -Cx86_64 -q
julia> using Chairmarks

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 29 samples with 50 evaluations
 min    626.947 μs
 median 627.900 μs
 mean   627.837 μs
 max    628.868 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 28 samples with 50 evaluations
 min    697.109 μs
 median 698.185 μs
 mean   698.274 μs
 max    699.674 μs

argmin is over 4x faster when using the generic x86_64 target rather than the native target (znver4).

Zentrik · 2024-10-30T06:54:20Z

Maybe it's llvm/llvm-project#91370, i.e. it's generating gather and scatter instructions which are slow.

giordano · 2024-10-30T08:53:15Z

This is the LLVM module I get on znver4 with Julia nightly (llvm 18.1), and relative native code: https://godbolt.org/z/fPMcMEM48 (unfortunately we can't choose the target in the Julia frontend of godbolt until #52949 is resolved, but you can get the LLVM IR with https://godbolt.org/z/99W5EG7d4 and then copy-paste it as input for LLVM IR). I don't see gather/scatter instructions, but apart from a lone vunpcklpd instruction there are no other packed instructions. If you change the target to znver3 you get more packed instructions and argmin performance is also sensibly better:

$ julia +nightly -Cznver3 -q
julia> using Chairmarks

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 35 samples with 50 evaluations
 min    557.571 μs
 median 557.980 μs
 mean   558.067 μs
 max    558.717 μs

matching native findmin

Moelf changed the title ~~argmin() is much slower than findmin()~~ argmin() is much slower than findmin() due to automatic inline Oct 29, 2024

giordano added performance Must go faster fold sum, maximum, reduce, foldl, etc. labels Oct 29, 2024

Moelf changed the title ~~argmin() is much slower than findmin() due to automatic inline~~ argmin() is much slower than findmin() due to automatic inline on AMD Zen4 Oct 29, 2024

giordano added the upstream The issue is with an upstream dependency, e.g. LLVM label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`argmin()` is much slower than `findmin()` due to automatic inline on AMD Zen4 #56375

`argmin()` is much slower than `findmin()` due to automatic inline on AMD Zen4 #56375

Moelf commented Oct 28, 2024 •

edited

Loading

Moelf commented Oct 28, 2024 •

edited

Loading

Moelf commented Oct 28, 2024 •

edited

Loading

Zentrik commented Oct 29, 2024

Moelf commented Oct 29, 2024 •

edited

Loading

Zentrik commented Oct 29, 2024

Moelf commented Oct 29, 2024 •

edited

Loading

giordano commented Oct 29, 2024 •

edited

Loading

Zentrik commented Oct 30, 2024

giordano commented Oct 30, 2024 •

edited

Loading

argmin() is much slower than findmin() due to automatic inline on AMD Zen4 #56375

argmin() is much slower than findmin() due to automatic inline on AMD Zen4 #56375

Comments

Moelf commented Oct 28, 2024 • edited Loading

Moelf commented Oct 28, 2024 • edited Loading

Moelf commented Oct 28, 2024 • edited Loading

Zentrik commented Oct 29, 2024

Moelf commented Oct 29, 2024 • edited Loading

Zentrik commented Oct 29, 2024

Moelf commented Oct 29, 2024 • edited Loading

giordano commented Oct 29, 2024 • edited Loading

Zentrik commented Oct 30, 2024

giordano commented Oct 30, 2024 • edited Loading

`argmin()` is much slower than `findmin()` due to automatic inline on AMD Zen4 #56375

`argmin()` is much slower than `findmin()` due to automatic inline on AMD Zen4 #56375

Moelf commented Oct 28, 2024 •

edited

Loading

Moelf commented Oct 28, 2024 •

edited

Loading

Moelf commented Oct 28, 2024 •

edited

Loading

Moelf commented Oct 29, 2024 •

edited

Loading

Moelf commented Oct 29, 2024 •

edited

Loading

giordano commented Oct 29, 2024 •

edited

Loading

giordano commented Oct 30, 2024 •

edited

Loading