-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
argmin()
is much slower than findmin()
due to automatic inline on AMD Zen4
#56375
Comments
related to #41963 |
actually if you force no inline it's faster... julia> noinline_argmin(x) = @noinline findmin(x; dims=:)[2]
slower_argmin (generic function with 1 method)
julia> @be rand(Float64, 512000) noinline_argmin samples=100 evals=50
Benchmark: 45 samples with 50 evaluations
min 415.286 μs (2 allocs: 48 bytes)
median 420.158 μs (2 allocs: 48 bytes)
mean 423.787 μs (2 allocs: 48 bytes)
max 476.411 μs (2 allocs: 48 bytes) |
argmin()
is much slower than findmin()
argmin()
is much slower than findmin()
due to automatic inline
Presumably this is just llvm being dumb. Do you see the same issue on master? |
yes, still on master: #Version 1.12.0-DEV.1508 (2024-10-28)
julia> @be rand(512000) slow_argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
min 2.214 ms
median 2.217 ms
mean 2.225 ms
max 2.258 ms
julia> @be rand(512000) findmin samples=100 evals=50
Benchmark: 47 samples with 50 evaluations
min 409.705 μs
median 411.422 μs
mean 415.092 μs
max 451.189 μs |
Fwiw doesn't reproduce on (or llvm 19 for that matter)
|
Looks like you're on Zen1, and I can also confirm it doesn't happen on Zen2, but it does happen on Zen4! julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
Threads: 24 default, 0 interactive, 24 GC (on 24 virtual cores)
Environment:
JULIA_NUM_THREADS = auto
julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 36 samples with 50 evaluations
min 544.455 μs
median 545.443 μs
mean 545.703 μs
max 554.404 μs
julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 34 samples with 50 evaluations
min 564.647 μs
median 566.045 μs
mean 566.817 μs
max 578.258 μs julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 16 default, 0 interactive, 16 GC (on 16 virtual cores)
Environment:
JULIA_NUM_THREADS = auto
julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 47 samples with 50 evaluations
min 405.372 μs
median 407.367 μs
mean 409.003 μs
max 451.877 μs
julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
min 2.233 ms
median 2.244 ms
mean 2.266 ms
max 2.389 ms |
argmin()
is much slower than findmin()
due to automatic inlineargmin()
is much slower than findmin()
due to automatic inline on AMD Zen4
This really sounds like an upstream bug in LLVM: $ julia +nightly -q
julia> using Chairmarks
julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 33 samples with 50 evaluations
min 558.099 μs
median 558.759 μs
mean 558.799 μs
max 559.535 μs
julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 7 samples with 50 evaluations
min 3.058 ms
median 3.063 ms
mean 3.063 ms
max 3.067 ms
julia> versioninfo()
Julia Version 1.12.0-DEV.1514
Commit e4dc9d357a1 (2024-10-29 17:18 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 384 × AMD EPYC 9654 96-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 384 virtual cores) $ julia +nightly -Cx86_64 -q
julia> using Chairmarks
julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 29 samples with 50 evaluations
min 626.947 μs
median 627.900 μs
mean 627.837 μs
max 628.868 μs
julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 28 samples with 50 evaluations
min 697.109 μs
median 698.185 μs
mean 698.274 μs
max 699.674 μs
|
Maybe it's llvm/llvm-project#91370, i.e. it's generating gather and scatter instructions which are slow. |
This is the LLVM module I get on znver4 with Julia nightly (llvm 18.1), and relative native code: https://godbolt.org/z/fPMcMEM48 (unfortunately we can't choose the target in the Julia frontend of godbolt until #52949 is resolved, but you can get the LLVM IR with https://godbolt.org/z/99W5EG7d4 and then copy-paste it as input for LLVM IR). I don't see gather/scatter instructions, but apart from a lone
matching native |
julia/base/reducedim.jl
Lines 1245 to 1246 in 2cdfe06
but:
somehow
[2]
is destorying performance:The text was updated successfully, but these errors were encountered: