Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argmin() is much slower than findmin() due to automatic inline on AMD Zen4 #56375

Open
Moelf opened this issue Oct 28, 2024 · 9 comments
Open
Labels
fold sum, maximum, reduce, foldl, etc. performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@Moelf
Copy link
Contributor

Moelf commented Oct 28, 2024

julia/base/reducedim.jl

Lines 1245 to 1246 in 2cdfe06

"""
argmin(A::AbstractArray; dims=:) = findmin(A; dims=dims)[2]

but:

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 46 samples with 50 evaluations
 min    411.306 μs
 median 419.671 μs
 mean   419.933 μs
 max    448.662 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.239 ms
 median 2.246 ms
 mean   2.252 ms
 max    2.306 ms

somehow [2] is destorying performance:

julia> slow_argmin(x) = findmin(x; dims=:)[2]
slow_argmin (generic function with 1 method)

julia> fast_argmin(x) = findmin(x; dims=:)
fast_argmin (generic function with 1 method)

julia> @be rand(Float64, 512000) slow_argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.233 ms
 median 2.264 ms
 mean   2.262 ms
 max    2.309 ms

julia> @be rand(Float64, 512000) fast_argmin samples=100 evals=50
Benchmark: 45 samples with 50 evaluations
 min    412.383 μs
 median 421.955 μs
 mean   425.285 μs
 max    475.616 μs
@Moelf
Copy link
Contributor Author

Moelf commented Oct 28, 2024

related to #41963

@Moelf
Copy link
Contributor Author

Moelf commented Oct 28, 2024

actually if you force no inline it's faster...

julia> noinline_argmin(x) = @noinline findmin(x; dims=:)[2]
slower_argmin (generic function with 1 method)

julia> @be rand(Float64, 512000) noinline_argmin samples=100 evals=50
Benchmark: 45 samples with 50 evaluations
 min    415.286 μs (2 allocs: 48 bytes)
 median 420.158 μs (2 allocs: 48 bytes)
 mean   423.787 μs (2 allocs: 48 bytes)
 max    476.411 μs (2 allocs: 48 bytes)

@Moelf Moelf changed the title argmin() is much slower than findmin() argmin() is much slower than findmin() due to automatic inline Oct 29, 2024
@giordano giordano added performance Must go faster fold sum, maximum, reduce, foldl, etc. labels Oct 29, 2024
@Zentrik
Copy link
Member

Zentrik commented Oct 29, 2024

Presumably this is just llvm being dumb. Do you see the same issue on master?

@Moelf
Copy link
Contributor Author

Moelf commented Oct 29, 2024

yes, still on master:

#Version 1.12.0-DEV.1508 (2024-10-28)

julia> @be rand(512000) slow_argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.214 ms
 median 2.217 ms
 mean   2.225 ms
 max    2.258 ms

julia> @be rand(512000) findmin samples=100 evals=50
Benchmark: 47 samples with 50 evaluations
 min    409.705 μs
 median 411.422 μs
 mean   415.092 μs
 max    451.189 μs

@Zentrik
Copy link
Member

Zentrik commented Oct 29, 2024

Fwiw doesn't reproduce on (or llvm 19 for that matter)

julia> versioninfo()
Julia Version 1.12.0-DEV.1502
Commit ee09ae70d9f (2024-10-26 01:01 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 1700 Eight-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver1)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

@Moelf
Copy link
Contributor Author

Moelf commented Oct 29, 2024

Looks like you're on Zen1, and I can also confirm it doesn't happen on Zen2, but it does happen on Zen4!

julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
Threads: 24 default, 0 interactive, 24 GC (on 24 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 36 samples with 50 evaluations
 min    544.455 μs
 median 545.443 μs
 mean   545.703 μs
 max    554.404 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 34 samples with 50 evaluations
 min    564.647 μs
 median 566.045 μs
 mean   566.817 μs
 max    578.258 μs
julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 16 default, 0 interactive, 16 GC (on 16 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 47 samples with 50 evaluations
 min    405.372 μs
 median 407.367 μs
 mean   409.003 μs
 max    451.877 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.233 ms
 median 2.244 ms
 mean   2.266 ms
 max    2.389 ms

@Moelf Moelf changed the title argmin() is much slower than findmin() due to automatic inline argmin() is much slower than findmin() due to automatic inline on AMD Zen4 Oct 29, 2024
@giordano
Copy link
Contributor

giordano commented Oct 29, 2024

This really sounds like an upstream bug in LLVM:

$ julia +nightly -q
julia> using Chairmarks

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 33 samples with 50 evaluations
 min    558.099 μs
 median 558.759 μs
 mean   558.799 μs
 max    559.535 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 7 samples with 50 evaluations
 min    3.058 ms
 median 3.063 ms
 mean   3.063 ms
 max    3.067 ms

julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 384 × AMD EPYC 9654 96-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 384 virtual cores)
$ julia +nightly -Cx86_64 -q
julia> using Chairmarks

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 29 samples with 50 evaluations
 min    626.947 μs
 median 627.900 μs
 mean   627.837 μs
 max    628.868 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 28 samples with 50 evaluations
 min    697.109 μs
 median 698.185 μs
 mean   698.274 μs
 max    699.674 μs

argmin is over 4x faster when using the generic x86_64 target rather than the native target (znver4).

@Zentrik
Copy link
Member

Zentrik commented Oct 30, 2024

Maybe it's llvm/llvm-project#91370, i.e. it's generating gather and scatter instructions which are slow.

@giordano
Copy link
Contributor

giordano commented Oct 30, 2024

This is the LLVM module I get on znver4 with Julia nightly (llvm 18.1), and relative native code: https://godbolt.org/z/fPMcMEM48 (unfortunately we can't choose the target in the Julia frontend of godbolt until #52949 is resolved, but you can get the LLVM IR with https://godbolt.org/z/99W5EG7d4 and then copy-paste it as input for LLVM IR). I don't see gather/scatter instructions, but apart from a lone vunpcklpd instruction there are no other packed instructions. If you change the target to znver3 you get more packed instructions and argmin performance is also sensibly better:

$ julia +nightly -Cznver3 -q
julia> using Chairmarks

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 35 samples with 50 evaluations
 min    557.571 μs
 median 557.980 μs
 mean   558.067 μs
 max    558.717 μs

matching native findmin

@giordano giordano added the upstream The issue is with an upstream dependency, e.g. LLVM label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fold sum, maximum, reduce, foldl, etc. performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests

3 participants