-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep Galley Plans Per Approximate Sparsity Pattern #679
Conversation
Codecov ReportAttention: Patch coverage is
|
Small experiment showing the benefits of adapting to new inputs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few requests for this PR on the design of the GalleyExecutor.
Regarding the fixing of the deferred node, I'd like to design a benchmark which shows the pathological behavior of #664 on the main branch before we declare that it's fixed here. I'm not quite sure when I noticed that there was a problem there, but I don't see the 5ms overhead on main anymore. I've updated the benchmarksuite so that you can run julia runjudge.jl -i high-level
to compare high-level benchmarks on main and the target branch. Do you think we could update the SUITE["high-level"]["einsum_spmv_call_overhead"][scheduler_name]
benchmark to get main branch to exhibit the regression? Also, could we get the SUITE["high-level"]["einsum_spmv_call_adaptive"][scheduler_name]
to show the overhead associated with looking up among a large number of different kernels so that we know whether we need to use a better search algorithm?
src/Galley/FinchCompat/executor.jl
Outdated
was compiled for similar inputs and only compiles if it doesn't find one. If the `tag` argument is anything else, | ||
it will only compile once for that tag and will skip this search process. | ||
""" | ||
@kwdef struct GalleyExecutor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like the GalleyExecutor to have a configurable statistics similarity threshold, and to store the caches for each threshold and statistics type separately.
Also, I'd like to choose different executors for different compilation strategies (rather than using a sentinel tag value). Perhaps we can use the current executor for the "use first input strategy", and the GalleyExecutor for the "similar inputs" strategy. Then the GalleyExecutor wouldn't need a tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
It already caches it per-statistics type, and I can do the same for the threshold by making it a member of the struct. Will do that.
-
I like this idea. It seems clearer. In this case, we need better names. Maybe we call this one "AdaptiveExecutor" and the other one "TagExecutor"? And only schedulers which rely on statistics will be valid choices for the "AdaptiveExecutor".
I'll make these changes.
I made all the changes I think. Interestingly, there's a difference in the compute() and @Einsum call overhead for the AdaptiveExecutor. I can't quite figure out why, so I'm leaving both overhead benchmarks in for the moment. I also tried to add a benchmark which shows the overhead of searching through the stats list. However, it's pretty fast, so even with this its not super notable. Small note: Using |
Awesome! If you run |
When I run runjudge.jl, I'm also not seeing any slow call overhead on main. So, idk, maybe it was a fluke when you found it originally? At the very least, I think the changes to deferred are sensible and ought to prevent future issues. |
I think my point is that these benchmarks don't appear to be able to stress the issue with hashing the actual inputs to the kernel. I wanted to see an instance where we could cause the finch LogicExecutor to recompile when we give different inputs with the same program. Was this never an issue? I know the deferred nodes had inputs in them, so it's strange to see. |
Ah, wait it makes sense. The LogicExecutor doesn't hash any deferred nodes when looking up in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here looks great! I am concerned about a performance regression I'm noticing. It looks like this PR increases spmv call overhead by a factor of 378. Here's my benchmark command:
julia runjudge.jl -i high-level -e einsum_spmv_compile_overhead
Here's the results:
Benchmark Report for /Users/willow/Projects/Finch.jl
≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡
Job Properties
==============
• Time of benchmarks:
• Target: 6 Jan 2025 - 10:40
• Baseline: 6 Jan 2025 - 10:43
• Package commits:
• Target: f52839
• Baseline: 061cda
• Julia commits:
• Target: 5e9a32
• Baseline: 5e9a32
• Julia command flags:
• Target: None
• Baseline: None
• Environment variables:
• Target: FINCH_BENCHMARK_ARGS => -i high-level -e
einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1
• Baseline: FINCH_BENCHMARK_ARGS => -i high-level -e
einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1
Results
=======
A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less
than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results
that indicate possible regressions or improvements - are shown below (thus, an empty table
means that all benchmark results remained invariant between builds).
ID time ratio memory ratio
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– –––––––––––––– –––––––––––––
["high-level", "compute_spmv_call_overhead", "default_scheduler"] 0.92 (5%) ✅ 1.00 (1%)
["high-level", "compute_spmv_call_overhead", "galley_scheduler"] 8.46 (5%) ❌ 5.41 (1%) ❌
["high-level", "einsum_spmv_call_overhead", "default_scheduler"] 378.39 (5%) ❌ 27.91 (1%) ❌
["high-level", "einsum_spmv_call_overhead", "galley_scheduler"] 10.26 (5%) ❌ 8.48 (1%) ❌
["high-level", "matchain_adaptive_overhead", "default_scheduler"] 0.95 (5%) ✅ 1.00 (1%)
["high-level", "matchain_adaptive_overhead", "galley_scheduler"] 14.93 (5%) ❌ 8.41 (1%) ❌
["high-level", "sddmm_fused", "default_scheduler"] 0.99 (5%) 0.80 (1%) ✅
["high-level", "sddmm_fused", "galley_scheduler"] 1.23 (5%) ❌ 1.02 (1%) ❌
["high-level", "sddmm_unfused", "galley_scheduler"] 1.01 (5%) 1.01 (1%) ❌
Benchmark Group List
====================
Here's a list of all the benchmark groups executed by this job:
• ["high-level", "compute_spmv_call_overhead"]
• ["high-level"]
• ["high-level", "einsum_spmv_call_overhead"]
• ["high-level", "matchain_adaptive_overhead"]
• ["high-level", "permutedims(Dense(Dense()))"]
• ["high-level", "permutedims(Dense(Sparse()))"]
• ["high-level", "sddmm_fused"]
• ["high-level", "sddmm_unfused"]
Julia versioninfo
=================
Target
––––––
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin24.0.0)
uname: Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64 arm
CPU: Apple M2 Max:
speed user nice sys idle irq
#1-12 2400 MHz 6440533 s 0 s 2738620 s 73397003 s 0 s
Memory: 32.0 GB (497.859375 MB free)
Uptime: 3.776767e6 sec
Load Avg: 1.88916015625 1.72412109375 2.83203125
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Baseline
––––––––
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin24.0.0)
uname: Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64 arm
CPU: Apple M2 Max:
speed user nice sys idle irq
#1-12 2400 MHz 6442474 s 0 s 2738975 s 73412682 s 0 s
Memory: 32.0 GB (707.78125 MB free)
Uptime: 3.776917e6 sec
Load Avg: 2.01904296875 1.88134765625 2.71826171875
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
Target result
Benchmark Report for /Users/willow/Projects/Finch.jl
≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡
Job Properties
==============
• Time of benchmark: 6 Jan 2025 - 10:40
• Package commit: f52839
• Julia commit: 5e9a32
• Julia command flags: None
• Environment variables: FINCH_BENCHMARK_ARGS => -i high-level -e
einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1
Results
=======
Below is a table of this job's results, obtained by running the benchmarks. The values listed
in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks. The percentages
accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the
reported value. An empty cell means that the value was zero.
ID time GC time memory allocations
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––– –––––––– ––––––––––––––– –––––––––––
["high-level", "compute_spmv_call_overhead", "default_scheduler"] 17.625 μs (5%) 16.28 KiB (1%) 397
["high-level", "compute_spmv_call_overhead", "galley_scheduler"] 157.125 μs (5%) 79.52 KiB (1%) 2335
["high-level", "einsum_spmv_baremetal"] 5.208 ns (5%)
["high-level", "einsum_spmv_call_overhead", "default_scheduler"] 7.899 ms (5%) 485.77 KiB (1%) 9744
["high-level", "einsum_spmv_call_overhead", "galley_scheduler"] 56.253 ms (5%) 4.20 MiB (1%) 86087
["high-level", "matchain_adaptive_overhead", "default_scheduler"] 31.791 μs (5%) 30.81 KiB (1%) 627
["high-level", "matchain_adaptive_overhead", "galley_scheduler"] 634.709 μs (5%) 245.55 KiB (1%) 9680
["high-level", "permutedims(Dense(Dense()))", "default_scheduler"] 264.683 ms (5%) 8.925 ms 762.94 MiB (1%) 126
["high-level", "permutedims(Dense(Dense()))", "galley_scheduler"] 267.400 ms (5%) 9.613 ms 762.94 MiB (1%) 126
["high-level", "permutedims(Dense(Sparse()))", "default_scheduler"] 83.605 ms (5%) 2.015 ms 170.31 MiB (1%) 337
["high-level", "permutedims(Dense(Sparse()))", "galley_scheduler"] 80.687 ms (5%) 170.37 MiB (1%) 337
["high-level", "sddmm_fused", "default_scheduler"] 6.808 ms (5%) 266.86 KiB (1%) 708
["high-level", "sddmm_fused", "galley_scheduler"] 1.691 ms (5%) 7.84 MiB (1%) 4693
["high-level", "sddmm_unfused", "default_scheduler"] 146.093 ms (5%) 7.89 MiB (1%) 853
["high-level", "sddmm_unfused", "galley_scheduler"] 184.917 ms (5%) 15.50 MiB (1%) 5709
Benchmark Group List
====================
Here's a list of all the benchmark groups executed by this job:
• ["high-level", "compute_spmv_call_overhead"]
• ["high-level"]
• ["high-level", "einsum_spmv_call_overhead"]
• ["high-level", "matchain_adaptive_overhead"]
• ["high-level", "permutedims(Dense(Dense()))"]
• ["high-level", "permutedims(Dense(Sparse()))"]
• ["high-level", "sddmm_fused"]
• ["high-level", "sddmm_unfused"]
Julia versioninfo
=================
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin24.0.0)
uname: Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64 arm
CPU: Apple M2 Max:
speed user nice sys idle irq
#1-12 2400 MHz 6440533 s 0 s 2738620 s 73397003 s 0 s
Memory: 32.0 GB (497.859375 MB free)
Uptime: 3.776767e6 sec
Load Avg: 1.88916015625 1.72412109375 2.83203125
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
Baseline result
Benchmark Report for /Users/willow/Projects/Finch.jl
≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡
Job Properties
==============
• Time of benchmark: 6 Jan 2025 - 10:43
• Package commit: 061cda
• Julia commit: 5e9a32
• Julia command flags: None
• Environment variables: FINCH_BENCHMARK_ARGS => -i high-level -e
einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1
Results
=======
Below is a table of this job's results, obtained by running the benchmarks. The values listed
in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks. The percentages
accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the
reported value. An empty cell means that the value was zero.
ID time GC time memory allocations
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––– ––––––––– ––––––––––––––– –––––––––––
["high-level", "compute_spmv_call_overhead", "default_scheduler"] 19.125 μs (5%) 16.28 KiB (1%) 397
["high-level", "compute_spmv_call_overhead", "galley_scheduler"] 18.583 μs (5%) 14.70 KiB (1%) 375
["high-level", "einsum_spmv_baremetal"] 5.209 ns (5%)
["high-level", "einsum_spmv_call_overhead", "default_scheduler"] 20.875 μs (5%) 17.41 KiB (1%) 423
["high-level", "einsum_spmv_call_overhead", "galley_scheduler"] 5.484 ms (5%) 507.25 KiB (1%) 10180
["high-level", "matchain_adaptive_overhead", "default_scheduler"] 33.500 μs (5%) 30.81 KiB (1%) 627
["high-level", "matchain_adaptive_overhead", "galley_scheduler"] 42.500 μs (5%) 29.20 KiB (1%) 615
["high-level", "permutedims(Dense(Dense()))", "default_scheduler"] 276.038 ms (5%) 10.369 ms 762.94 MiB (1%) 126
["high-level", "permutedims(Dense(Dense()))", "galley_scheduler"] 274.323 ms (5%) 2.478 ms 762.94 MiB (1%) 126
["high-level", "permutedims(Dense(Sparse()))", "default_scheduler"] 80.988 ms (5%) 170.32 MiB (1%) 337
["high-level", "permutedims(Dense(Sparse()))", "galley_scheduler"] 81.542 ms (5%) 1.023 ms 170.36 MiB (1%) 337
["high-level", "sddmm_fused", "default_scheduler"] 6.877 ms (5%) 332.42 KiB (1%) 712
["high-level", "sddmm_fused", "galley_scheduler"] 1.369 ms (5%) 7.69 MiB (1%) 669
["high-level", "sddmm_unfused", "default_scheduler"] 145.995 ms (5%) 7.96 MiB (1%) 857
["high-level", "sddmm_unfused", "galley_scheduler"] 183.967 ms (5%) 15.33 MiB (1%) 814
Benchmark Group List
====================
Here's a list of all the benchmark groups executed by this job:
• ["high-level", "compute_spmv_call_overhead"]
• ["high-level"]
• ["high-level", "einsum_spmv_call_overhead"]
• ["high-level", "matchain_adaptive_overhead"]
• ["high-level", "permutedims(Dense(Dense()))"]
• ["high-level", "permutedims(Dense(Sparse()))"]
• ["high-level", "sddmm_fused"]
• ["high-level", "sddmm_unfused"]
Julia versioninfo
=================
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin24.0.0)
uname: Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64 arm
CPU: Apple M2 Max:
speed user nice sys idle irq
#1-12 2400 MHz 6442474 s 0 s 2738975 s 73412682 s 0 s
Memory: 32.0 GB (707.78125 MB free)
Uptime: 3.776917e6 sec
Load Avg: 2.01904296875 1.88134765625 2.71826171875
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
I just ran `julia runjudge.jl -i high-level -e einsum_spmv_compile_overhead' and I only saw major differences on the galley_scheduler which makes sense because that one was pretty significantly changed. Might be worth a quick meeting to troubleshoot?
|
Can confirm your results. I think either my tune.json was stale or my branch head was configured incorrectly (sometimes |
the overhead is much much higher here, but we expect this overhead to scale with the inputs. |
Yeah, the one thing which I can't figure out is why the einsum interface and the compute interface have such different overheads using the AdaptiveExecutor. This also shows up somewhat intermittently for me. |
which two benchmarks are you comparing? |
|
That overhead is quite large. I'm happy to merge this PR now and open an issue, or wait to merge until we figure it out. I'll leave it up to you how to proceed. Thanks for your help with this PR! |
I'm happy to leave it for an issue, especially because it doesn't hit through the compute interface. |
This PR changes the caching in Galley to keep a set of plans per program when called with tag=:global. It associates these plans with the stats of the inputs. When the same program is invoked again, it checks this cache for a plan on inputs with "similar" stats. If it finds one, it returns it immediately. Otherwise, it compiles a new plan.
fixes #664