Keep Galley Plans Per Approximate Sparsity Pattern #679

kylebd99 · 2024-12-19T22:14:00Z

This PR changes the caching in Galley to keep a set of plans per program when called with tag=:global. It associates these plans with the stats of the inputs. When the same program is invoked again, it checks this cache for a plan on inputs with "similar" stats. If it finds one, it returns it immediately. Otherwise, it compiles a new plan.

fixes #664

codecov · 2024-12-20T00:34:22Z

Codecov Report

Attention: Patch coverage is 71.42857% with 32 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/Galley/TensorStats/tensor-stats.jl	63.76%	25 Missing ⚠️
src/Galley/FinchCompat/executor.jl	89.18%	4 Missing ⚠️
src/Galley/utility-funcs.jl	0.00%	2 Missing ⚠️
src/Galley/TensorStats/propagate-stats.jl	50.00%	1 Missing ⚠️

Files with missing lines	Coverage Δ
src/Finch.jl	`89.79% <ø> (ø)`
src/FinchLogic/nodes.jl	`71.42% <100.00%> (ø)`
src/Galley/Galley.jl	`98.46% <ø> (+24.61%)`	⬆️
src/scheduler/LogicExecutor.jl	`90.24% <ø> (ø)`
src/Galley/TensorStats/propagate-stats.jl	`80.48% <50.00%> (-0.38%)`	⬇️
src/Galley/utility-funcs.jl	`40.39% <0.00%> (-0.55%)`	⬇️
src/Galley/FinchCompat/executor.jl	`92.42% <89.18%> (+5.32%)`	⬆️
src/Galley/TensorStats/tensor-stats.jl	`70.42% <63.76%> (+0.09%)`	⬆️

... and 2 files with indirect coverage changes

kylebd99 · 2024-12-20T00:40:52Z

Small experiment showing the benefits of adapting to new inputs:

using Finch
using Plots
using BenchmarkTools

Finch.Galley.galley_codes = Dict()
galley_tag_times = []
galley_global_times = []
for i in 1:5
    A_density = 10.0^(-i)
    B_density = .1
    C_density = 10.0^(i-5)
    n = 1000
    A = lazy(Tensor(Dense(SparseList(Element(0.0))), fsprand(n,n, A_density)))
    B = lazy(Tensor(Dense(SparseList(Element(0.0))), fsprand(n,n, B_density)))
    C = lazy(Tensor(Dense(SparseList(Element(0.0))), fsprand(n,n, C_density)))

    println("TAG VERBOSE")
    compute(A*B*C; ctx=galley_scheduler(), tag=1, verbose=true)
    println("GLOBAL VERBOSE")
    compute(A*B*C; ctx=galley_scheduler(), verbose=true)
    galley_tag_time = @belapsed compute($A*$B*$C; ctx=galley_scheduler(), tag=1,)
    galley_global_time = @belapsed compute($A*$B*$C; ctx=galley_scheduler())
    println("A Density: $A_density B Density: $B_density C Density: $C_density")
    println("Tag Time: $galley_tag_time")
    println("Global Time: $galley_global_time")
    push!(galley_tag_times, galley_tag_time)
    push!(galley_global_times, galley_global_time)
end

plot(1:5, galley_tag_times ./ galley_global_times, title="ABC varying density of A and C", xlabel="d_A=10^(-i), d_B=.1, d_C=10^(5-i)",ylabel="Speedup", label="Speedup", lw=3)

…ive-to-inputs

willow-ahrens

I have a few requests for this PR on the design of the GalleyExecutor.

Regarding the fixing of the deferred node, I'd like to design a benchmark which shows the pathological behavior of #664 on the main branch before we declare that it's fixed here. I'm not quite sure when I noticed that there was a problem there, but I don't see the 5ms overhead on main anymore. I've updated the benchmarksuite so that you can run julia runjudge.jl -i high-level to compare high-level benchmarks on main and the target branch. Do you think we could update the SUITE["high-level"]["einsum_spmv_call_overhead"][scheduler_name] benchmark to get main branch to exhibit the regression? Also, could we get the SUITE["high-level"]["einsum_spmv_call_adaptive"][scheduler_name] to show the overhead associated with looking up among a large number of different kernels so that we know whether we need to use a better search algorithm?

willow-ahrens · 2024-12-30T18:41:52Z

src/Galley/FinchCompat/executor.jl

+was compiled for similar inputs and only compiles if it doesn't find one. If the `tag` argument is anything else,
+it will only compile once for that tag and will skip this search process.
+"""
+@kwdef struct GalleyExecutor


I'd like the GalleyExecutor to have a configurable statistics similarity threshold, and to store the caches for each threshold and statistics type separately.

Also, I'd like to choose different executors for different compilation strategies (rather than using a sentinel tag value). Perhaps we can use the current executor for the "use first input strategy", and the GalleyExecutor for the "similar inputs" strategy. Then the GalleyExecutor wouldn't need a tag.

It already caches it per-statistics type, and I can do the same for the threshold by making it a member of the struct. Will do that.

I like this idea. It seems clearer. In this case, we need better names. Maybe we call this one "AdaptiveExecutor" and the other one "TagExecutor"? And only schedulers which rely on statistics will be valid choices for the "AdaptiveExecutor".

I'll make these changes.

kylebd99 · 2025-01-03T00:34:54Z

I made all the changes I think. Interestingly, there's a difference in the compute() and @Einsum call overhead for the AdaptiveExecutor. I can't quite figure out why, so I'm leaving both overhead benchmarks in for the moment. I also tried to add a benchmark which shows the overhead of searching through the stats list. However, it's pretty fast, so even with this its not super notable.

Small note: Using with_scheduler in the benchmark scripts doesn't seem to properly set the scheduler. Instead, I added it to the setup phases.

willow-ahrens · 2025-01-03T00:37:22Z

Awesome! If you run runjudge.jl -i high-level on the target branch, do the benchmarks show this PR improving #664?

kylebd99 · 2025-01-03T19:56:46Z

When I run runjudge.jl, I'm also not seeing any slow call overhead on main. So, idk, maybe it was a fluke when you found it originally? At the very least, I think the changes to deferred are sensible and ought to prevent future issues.

willow-ahrens · 2025-01-03T21:26:27Z

I think my point is that these benchmarks don't appear to be able to stress the issue with hashing the actual inputs to the kernel. I wanted to see an instance where we could cause the finch LogicExecutor to recompile when we give different inputs with the same program. Was this never an issue? I know the deferred nodes had inputs in them, so it's strange to see.

kylebd99 · 2025-01-03T22:32:40Z

Ah, wait it makes sense. The LogicExecutor doesn't hash any deferred nodes when looking up in the codes dict. The get_structure function doesn't use any deferred nodes. They only get introduced when the code is actually being scheduled and compiled in logic_executor_code.

willow-ahrens

The code here looks great! I am concerned about a performance regression I'm noticing. It looks like this PR increases spmv call overhead by a factor of 378. Here's my benchmark command:

julia runjudge.jl -i high-level -e einsum_spmv_compile_overhead

Here's the results:

  Benchmark Report for /Users/willow/Projects/Finch.jl
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Job Properties
  ==============

    •  Time of benchmarks:
       • Target: 6 Jan 2025 - 10:40
       • Baseline: 6 Jan 2025 - 10:43

    •  Package commits:
       • Target: f52839
       • Baseline: 061cda

    •  Julia commits:
       • Target: 5e9a32
       • Baseline: 5e9a32

    •  Julia command flags:
       • Target: None
       • Baseline: None

    •  Environment variables:
       • Target: FINCH_BENCHMARK_ARGS => -i high-level -e
       einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1
       • Baseline: FINCH_BENCHMARK_ARGS => -i high-level -e
       einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1

  Results
  =======

  A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less
  than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results
  that indicate possible regressions or improvements - are shown below (thus, an empty table
  means that all benchmark results remained invariant between builds).

                                                                 ID     time ratio  memory ratio
  ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– –––––––––––––– –––––––––––––
  ["high-level", "compute_spmv_call_overhead", "default_scheduler"]   0.92 (5%) ✅     1.00 (1%)
   ["high-level", "compute_spmv_call_overhead", "galley_scheduler"]   8.46 (5%) ❌  5.41 (1%) ❌
   ["high-level", "einsum_spmv_call_overhead", "default_scheduler"] 378.39 (5%) ❌ 27.91 (1%) ❌
    ["high-level", "einsum_spmv_call_overhead", "galley_scheduler"]  10.26 (5%) ❌  8.48 (1%) ❌
  ["high-level", "matchain_adaptive_overhead", "default_scheduler"]   0.95 (5%) ✅     1.00 (1%)
   ["high-level", "matchain_adaptive_overhead", "galley_scheduler"]  14.93 (5%) ❌  8.41 (1%) ❌
                 ["high-level", "sddmm_fused", "default_scheduler"]      0.99 (5%)  0.80 (1%) ✅
                  ["high-level", "sddmm_fused", "galley_scheduler"]   1.23 (5%) ❌  1.02 (1%) ❌
                ["high-level", "sddmm_unfused", "galley_scheduler"]      1.01 (5%)  1.01 (1%) ❌

  Benchmark Group List
  ====================

  Here's a list of all the benchmark groups executed by this job:

    •  ["high-level", "compute_spmv_call_overhead"]

    •  ["high-level"]

    •  ["high-level", "einsum_spmv_call_overhead"]

    •  ["high-level", "matchain_adaptive_overhead"]

    •  ["high-level", "permutedims(Dense(Dense()))"]

    •  ["high-level", "permutedims(Dense(Sparse()))"]

    •  ["high-level", "sddmm_fused"]

    •  ["high-level", "sddmm_unfused"]

  Julia versioninfo
  =================

  Target
  ––––––

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: macOS (arm64-apple-darwin24.0.0)
    uname: Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64 arm
    CPU: Apple M2 Max: 
                   speed         user         nice          sys         idle          irq
         #1-12  2400 MHz    6440533 s          0 s    2738620 s   73397003 s          0 s
    Memory: 32.0 GB (497.859375 MB free)
    Uptime: 3.776767e6 sec
    Load Avg:  1.88916015625  1.72412109375  2.83203125
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
  Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

  Baseline
  ––––––––

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: macOS (arm64-apple-darwin24.0.0)
    uname: Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64 arm
    CPU: Apple M2 Max: 
                   speed         user         nice          sys         idle          irq
         #1-12  2400 MHz    6442474 s          0 s    2738975 s   73412682 s          0 s
    Memory: 32.0 GB (707.78125 MB free)
    Uptime: 3.776917e6 sec
    Load Avg:  2.01904296875  1.88134765625  2.71826171875
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
  Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)



▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
Target result

  Benchmark Report for /Users/willow/Projects/Finch.jl
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Job Properties
  ==============

    •  Time of benchmark: 6 Jan 2025 - 10:40

    •  Package commit: f52839

    •  Julia commit: 5e9a32

    •  Julia command flags: None

    •  Environment variables: FINCH_BENCHMARK_ARGS => -i high-level -e
       einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1

  Results
  =======

  Below is a table of this job's results, obtained by running the benchmarks. The values listed
  in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
  index into the BaseBenchmarks suite to retrieve the corresponding benchmarks. The percentages
  accompanying time and memory values in the below table are noise tolerances. The "true"
  time/memory value for a given benchmark is expected to fall within this percentage of the
  reported value. An empty cell means that the value was zero.

                                                                   ID            time  GC time          memory allocations
  ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––– –––––––– ––––––––––––––– –––––––––––
    ["high-level", "compute_spmv_call_overhead", "default_scheduler"]  17.625 μs (5%)           16.28 KiB (1%)         397
     ["high-level", "compute_spmv_call_overhead", "galley_scheduler"] 157.125 μs (5%)           79.52 KiB (1%)        2335
                              ["high-level", "einsum_spmv_baremetal"]   5.208 ns (5%)                                     
     ["high-level", "einsum_spmv_call_overhead", "default_scheduler"]   7.899 ms (5%)          485.77 KiB (1%)        9744
      ["high-level", "einsum_spmv_call_overhead", "galley_scheduler"]  56.253 ms (5%)            4.20 MiB (1%)       86087
    ["high-level", "matchain_adaptive_overhead", "default_scheduler"]  31.791 μs (5%)           30.81 KiB (1%)         627
     ["high-level", "matchain_adaptive_overhead", "galley_scheduler"] 634.709 μs (5%)          245.55 KiB (1%)        9680
   ["high-level", "permutedims(Dense(Dense()))", "default_scheduler"] 264.683 ms (5%) 8.925 ms 762.94 MiB (1%)         126
    ["high-level", "permutedims(Dense(Dense()))", "galley_scheduler"] 267.400 ms (5%) 9.613 ms 762.94 MiB (1%)         126
  ["high-level", "permutedims(Dense(Sparse()))", "default_scheduler"]  83.605 ms (5%) 2.015 ms 170.31 MiB (1%)         337
   ["high-level", "permutedims(Dense(Sparse()))", "galley_scheduler"]  80.687 ms (5%)          170.37 MiB (1%)         337
                   ["high-level", "sddmm_fused", "default_scheduler"]   6.808 ms (5%)          266.86 KiB (1%)         708
                    ["high-level", "sddmm_fused", "galley_scheduler"]   1.691 ms (5%)            7.84 MiB (1%)        4693
                 ["high-level", "sddmm_unfused", "default_scheduler"] 146.093 ms (5%)            7.89 MiB (1%)         853
                  ["high-level", "sddmm_unfused", "galley_scheduler"] 184.917 ms (5%)           15.50 MiB (1%)        5709

  Benchmark Group List
  ====================

  Here's a list of all the benchmark groups executed by this job:

    •  ["high-level", "compute_spmv_call_overhead"]

    •  ["high-level"]

    •  ["high-level", "einsum_spmv_call_overhead"]

    •  ["high-level", "matchain_adaptive_overhead"]

    •  ["high-level", "permutedims(Dense(Dense()))"]

    •  ["high-level", "permutedims(Dense(Sparse()))"]

    •  ["high-level", "sddmm_fused"]

    •  ["high-level", "sddmm_unfused"]

  Julia versioninfo
  =================

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: macOS (arm64-apple-darwin24.0.0)
    uname: Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64 arm
    CPU: Apple M2 Max: 
                   speed         user         nice          sys         idle          irq
         #1-12  2400 MHz    6440533 s          0 s    2738620 s   73397003 s          0 s
    Memory: 32.0 GB (497.859375 MB free)
    Uptime: 3.776767e6 sec
    Load Avg:  1.88916015625  1.72412109375  2.83203125
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
  Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)



▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
Baseline result

  Benchmark Report for /Users/willow/Projects/Finch.jl
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Job Properties
  ==============

    •  Time of benchmark: 6 Jan 2025 - 10:43

    •  Package commit: 061cda

    •  Julia commit: 5e9a32

    •  Julia command flags: None

    •  Environment variables: FINCH_BENCHMARK_ARGS => -i high-level -e
       einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1

  Results
  =======

  Below is a table of this job's results, obtained by running the benchmarks. The values listed
  in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
  index into the BaseBenchmarks suite to retrieve the corresponding benchmarks. The percentages
  accompanying time and memory values in the below table are noise tolerances. The "true"
  time/memory value for a given benchmark is expected to fall within this percentage of the
  reported value. An empty cell means that the value was zero.

                                                                   ID            time   GC time          memory allocations
  ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––– ––––––––– ––––––––––––––– –––––––––––
    ["high-level", "compute_spmv_call_overhead", "default_scheduler"]  19.125 μs (5%)            16.28 KiB (1%)         397
     ["high-level", "compute_spmv_call_overhead", "galley_scheduler"]  18.583 μs (5%)            14.70 KiB (1%)         375
                              ["high-level", "einsum_spmv_baremetal"]   5.209 ns (5%)                                      
     ["high-level", "einsum_spmv_call_overhead", "default_scheduler"]  20.875 μs (5%)            17.41 KiB (1%)         423
      ["high-level", "einsum_spmv_call_overhead", "galley_scheduler"]   5.484 ms (5%)           507.25 KiB (1%)       10180
    ["high-level", "matchain_adaptive_overhead", "default_scheduler"]  33.500 μs (5%)            30.81 KiB (1%)         627
     ["high-level", "matchain_adaptive_overhead", "galley_scheduler"]  42.500 μs (5%)            29.20 KiB (1%)         615
   ["high-level", "permutedims(Dense(Dense()))", "default_scheduler"] 276.038 ms (5%) 10.369 ms 762.94 MiB (1%)         126
    ["high-level", "permutedims(Dense(Dense()))", "galley_scheduler"] 274.323 ms (5%)  2.478 ms 762.94 MiB (1%)         126
  ["high-level", "permutedims(Dense(Sparse()))", "default_scheduler"]  80.988 ms (5%)           170.32 MiB (1%)         337
   ["high-level", "permutedims(Dense(Sparse()))", "galley_scheduler"]  81.542 ms (5%)  1.023 ms 170.36 MiB (1%)         337
                   ["high-level", "sddmm_fused", "default_scheduler"]   6.877 ms (5%)           332.42 KiB (1%)         712
                    ["high-level", "sddmm_fused", "galley_scheduler"]   1.369 ms (5%)             7.69 MiB (1%)         669
                 ["high-level", "sddmm_unfused", "default_scheduler"] 145.995 ms (5%)             7.96 MiB (1%)         857
                  ["high-level", "sddmm_unfused", "galley_scheduler"] 183.967 ms (5%)            15.33 MiB (1%)         814

  Benchmark Group List
  ====================

  Here's a list of all the benchmark groups executed by this job:

    •  ["high-level", "compute_spmv_call_overhead"]

    •  ["high-level"]

    •  ["high-level", "einsum_spmv_call_overhead"]

    •  ["high-level", "matchain_adaptive_overhead"]

    •  ["high-level", "permutedims(Dense(Dense()))"]

    •  ["high-level", "permutedims(Dense(Sparse()))"]

    •  ["high-level", "sddmm_fused"]

    •  ["high-level", "sddmm_unfused"]

  Julia versioninfo
  =================

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: macOS (arm64-apple-darwin24.0.0)
    uname: Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64 arm
    CPU: Apple M2 Max: 
                   speed         user         nice          sys         idle          irq
         #1-12  2400 MHz    6442474 s          0 s    2738975 s   73412682 s          0 s
    Memory: 32.0 GB (707.78125 MB free)
    Uptime: 3.776917e6 sec
    Load Avg:  2.01904296875  1.88134765625  2.71826171875
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
  Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

src/Galley/FinchCompat/executor.jl

kylebd99 · 2025-01-06T17:44:59Z

I just ran `julia runjudge.jl -i high-level -e einsum_spmv_compile_overhead' and I only saw major differences on the galley_scheduler which makes sense because that one was pretty significantly changed. Might be worth a quick meeting to troubleshoot?

Benchmark Report for /home/kylebd99/Research/Finch.jl
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Job Properties
  ==============

    •  Time of benchmarks:
       • Target: 6 Jan 2025 - 09:19
       • Baseline: 6 Jan 2025 - 09:23

    •  Package commits:
       • Target: 910b03
       • Baseline: 061cda

    •  Julia commits:
       • Target: 5e9a32
       • Baseline: 5e9a32

    •  Julia command flags:
       • Target: None
       • Baseline: None

    •  Environment variables:
       • Target: FINCH_BENCHMARK_ARGS => -i high-level -e einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1
       • Baseline: FINCH_BENCHMARK_ARGS => -i high-level -e einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1

  Results
  =======

  A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results that indicate possible regressions or improvements - are shown below
  (thus, an empty table means that all benchmark results remained invariant between builds).

                                                                 ID    time ratio  memory ratio
  ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––– –––––––––––––
   ["high-level", "compute_spmv_call_overhead", "galley_scheduler"]  8.68 (5%) ❌  5.34 (1%) ❌
                            ["high-level", "einsum_spmv_baremetal"]  1.06 (5%) ❌     1.00 (1%)
    ["high-level", "einsum_spmv_call_overhead", "galley_scheduler"]  9.36 (5%) ❌  8.32 (1%) ❌
  ["high-level", "matchain_adaptive_overhead", "default_scheduler"]  0.72 (5%) ✅     1.00 (1%)
   ["high-level", "matchain_adaptive_overhead", "galley_scheduler"] 18.11 (5%) ❌ 10.16 (1%) ❌
                 ["high-level", "sddmm_fused", "default_scheduler"]  0.95 (5%) ✅  0.80 (1%) ✅
                  ["high-level", "sddmm_fused", "galley_scheduler"]  1.15 (5%) ❌  1.02 (1%) ❌
                ["high-level", "sddmm_unfused", "galley_scheduler"]     1.00 (5%)  1.01 (1%) ❌

  Benchmark Group List
  ====================

  Here's a list of all the benchmark groups executed by this job:

    •  ["high-level", "compute_spmv_call_overhead"]

    •  ["high-level"]

    •  ["high-level", "einsum_spmv_call_overhead"]

    •  ["high-level", "matchain_adaptive_overhead"]

    •  ["high-level", "permutedims(Dense(Dense()))"]

    •  ["high-level", "permutedims(Dense(Sparse()))"]

    •  ["high-level", "sddmm_fused"]

    •  ["high-level", "sddmm_unfused"]

  Julia versioninfo
  =================

  Target
  ––––––

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: Linux (x86_64-linux-gnu)
        Pop!_OS 22.04 LTS
    uname: Linux 6.9.3-76060903-generic #202405300957~1732141768~22.04~f2697e1 SMP PREEMPT_DYNAMIC Wed N x86_64 x86_64
    CPU: Intel(R) Core(TM) i5-14600KF: 
                   speed         user         nice          sys         idle          irq
         #1-20  5300 MHz     850367 s      89072 s     228622 s   60427612 s          0 s
    Memory: 31.17447280883789 GB (9159.96875 MB free)
    Uptime: 309030.47 sec
    Load Avg:  1.98  1.84  1.51
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
  Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)

  Baseline
  ––––––––

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: Linux (x86_64-linux-gnu)
        Pop!_OS 22.04 LTS
    uname: Linux 6.9.3-76060903-generic #202405300957~1732141768~22.04~f2697e1 SMP PREEMPT_DYNAMIC Wed N x86_64 x86_64
    CPU: Intel(R) Core(TM) i5-14600KF: 
                   speed         user         nice          sys         idle          irq
         #1-20  5300 MHz     855499 s      89076 s     229042 s   60470286 s          0 s
    Memory: 31.17447280883789 GB (9142.09375 MB free)
    Uptime: 309272.38 sec
    Load Avg:  1.9  2.03  1.67
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
  Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)



▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
Target result

  Benchmark Report for /home/kylebd99/Research/Finch.jl
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Job Properties
  ==============

    •  Time of benchmark: 6 Jan 2025 - 9:19

    •  Package commit: 910b03

    •  Julia commit: 5e9a32

    •  Julia command flags: None

    •  Environment variables: FINCH_BENCHMARK_ARGS => -i high-level -e einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1

  Results
  =======

  Below is a table of this job's results, obtained by running the benchmarks. The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to index into the BaseBenchmarks suite to retrieve the corresponding
  benchmarks. The percentages accompanying time and memory values in the below table are noise tolerances. The "true" time/memory value for a given benchmark is expected to fall within this percentage of the reported value. An empty cell means that the
  value was zero.

                                                                   ID            time    GC time          memory allocations
  ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––– –––––––––– ––––––––––––––– –––––––––––
    ["high-level", "compute_spmv_call_overhead", "default_scheduler"]  18.278 μs (5%)             16.50 KiB (1%)         401
     ["high-level", "compute_spmv_call_overhead", "galley_scheduler"] 146.628 μs (5%)             79.61 KiB (1%)        2338
                              ["high-level", "einsum_spmv_baremetal"]   3.789 ns (5%)                                       
     ["high-level", "einsum_spmv_call_overhead", "default_scheduler"]  18.655 μs (5%)             17.62 KiB (1%)         427
      ["high-level", "einsum_spmv_call_overhead", "galley_scheduler"]  74.575 ms (5%)              4.14 MiB (1%)       84804
    ["high-level", "matchain_adaptive_overhead", "default_scheduler"]  40.242 μs (5%)             31.14 KiB (1%)         633
     ["high-level", "matchain_adaptive_overhead", "galley_scheduler"] 691.978 μs (5%)            250.59 KiB (1%)        9715
   ["high-level", "permutedims(Dense(Dense()))", "default_scheduler"] 500.124 ms (5%)            762.94 MiB (1%)         126
    ["high-level", "permutedims(Dense(Dense()))", "galley_scheduler"] 484.068 ms (5%)   4.799 ms 762.94 MiB (1%)         126
  ["high-level", "permutedims(Dense(Sparse()))", "default_scheduler"]  78.612 ms (5%) 719.318 μs 170.28 MiB (1%)         337
   ["high-level", "permutedims(Dense(Sparse()))", "galley_scheduler"]  78.640 ms (5%)            170.34 MiB (1%)         337
                   ["high-level", "sddmm_fused", "default_scheduler"]   6.617 ms (5%)            265.56 KiB (1%)         714
                    ["high-level", "sddmm_fused", "galley_scheduler"]   2.530 ms (5%)              7.84 MiB (1%)        4696
                 ["high-level", "sddmm_unfused", "default_scheduler"] 109.267 ms (5%)              7.89 MiB (1%)         861
                  ["high-level", "sddmm_unfused", "galley_scheduler"] 158.425 ms (5%)             15.50 MiB (1%)        5715

  Benchmark Group List
  ====================

  Here's a list of all the benchmark groups executed by this job:

    •  ["high-level", "compute_spmv_call_overhead"]

    •  ["high-level"]

    •  ["high-level", "einsum_spmv_call_overhead"]

    •  ["high-level", "matchain_adaptive_overhead"]

    •  ["high-level", "permutedims(Dense(Dense()))"]

    •  ["high-level", "permutedims(Dense(Sparse()))"]

    •  ["high-level", "sddmm_fused"]

    •  ["high-level", "sddmm_unfused"]

  Julia versioninfo
  =================

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: Linux (x86_64-linux-gnu)
        Pop!_OS 22.04 LTS
    uname: Linux 6.9.3-76060903-generic #202405300957~1732141768~22.04~f2697e1 SMP PREEMPT_DYNAMIC Wed N x86_64 x86_64
    CPU: Intel(R) Core(TM) i5-14600KF: 
                   speed         user         nice          sys         idle          irq
         #1-20  5300 MHz     850367 s      89072 s     228622 s   60427612 s          0 s
    Memory: 31.17447280883789 GB (9159.96875 MB free)
    Uptime: 309030.47 sec
    Load Avg:  1.98  1.84  1.51
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
  Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)



▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
Baseline result

  Benchmark Report for /home/kylebd99/Research/Finch.jl
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Job Properties
  ==============

    •  Time of benchmark: 6 Jan 2025 - 9:23

    •  Package commit: 061cda

    •  Julia commit: 5e9a32

    •  Julia command flags: None

    •  Environment variables: FINCH_BENCHMARK_ARGS => -i high-level -e einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1

  Results
  =======

  Below is a table of this job's results, obtained by running the benchmarks. The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to index into the BaseBenchmarks suite to retrieve the corresponding
  benchmarks. The percentages accompanying time and memory values in the below table are noise tolerances. The "true" time/memory value for a given benchmark is expected to fall within this percentage of the reported value. An empty cell means that the
  value was zero.

                                                                   ID            time  GC time          memory allocations
  ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––– –––––––– ––––––––––––––– –––––––––––
    ["high-level", "compute_spmv_call_overhead", "default_scheduler"]  18.188 μs (5%)           16.50 KiB (1%)         401
     ["high-level", "compute_spmv_call_overhead", "galley_scheduler"]  16.894 μs (5%)           14.92 KiB (1%)         379
                              ["high-level", "einsum_spmv_baremetal"]   3.591 ns (5%)                                     
     ["high-level", "einsum_spmv_call_overhead", "default_scheduler"]  18.736 μs (5%)           17.62 KiB (1%)         427
      ["high-level", "einsum_spmv_call_overhead", "galley_scheduler"]   7.968 ms (5%)          509.31 KiB (1%)       10187
    ["high-level", "matchain_adaptive_overhead", "default_scheduler"]  56.003 μs (5%)           31.14 KiB (1%)         633
     ["high-level", "matchain_adaptive_overhead", "galley_scheduler"]  38.203 μs (5%)           24.66 KiB (1%)         590
   ["high-level", "permutedims(Dense(Dense()))", "default_scheduler"] 488.125 ms (5%)          762.94 MiB (1%)         126
    ["high-level", "permutedims(Dense(Dense()))", "galley_scheduler"] 487.539 ms (5%) 4.617 ms 762.94 MiB (1%)         126
  ["high-level", "permutedims(Dense(Sparse()))", "default_scheduler"]  79.067 ms (5%)          170.31 MiB (1%)         337
   ["high-level", "permutedims(Dense(Sparse()))", "galley_scheduler"]  80.254 ms (5%)          170.35 MiB (1%)         337
                   ["high-level", "sddmm_fused", "default_scheduler"]   6.993 ms (5%)          333.75 KiB (1%)         718
                    ["high-level", "sddmm_fused", "galley_scheduler"]   2.205 ms (5%)            7.69 MiB (1%)         675
                 ["high-level", "sddmm_unfused", "default_scheduler"] 114.306 ms (5%)            7.96 MiB (1%)         865
                  ["high-level", "sddmm_unfused", "galley_scheduler"] 158.795 ms (5%)           15.33 MiB (1%)         823

  Benchmark Group List
  ====================

  Here's a list of all the benchmark groups executed by this job:

    •  ["high-level", "compute_spmv_call_overhead"]

    •  ["high-level"]

    •  ["high-level", "einsum_spmv_call_overhead"]

    •  ["high-level", "matchain_adaptive_overhead"]

    •  ["high-level", "permutedims(Dense(Dense()))"]

    •  ["high-level", "permutedims(Dense(Sparse()))"]

    •  ["high-level", "sddmm_fused"]

    •  ["high-level", "sddmm_unfused"]

  Julia versioninfo
  =================

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: Linux (x86_64-linux-gnu)
        Pop!_OS 22.04 LTS
    uname: Linux 6.9.3-76060903-generic #202405300957~1732141768~22.04~f2697e1 SMP PREEMPT_DYNAMIC Wed N x86_64 x86_64
    CPU: Intel(R) Core(TM) i5-14600KF: 
                   speed         user         nice          sys         idle          irq
         #1-20  5300 MHz     855499 s      89076 s     229042 s   60470286 s          0 s
    Memory: 31.17447280883789 GB (9142.09375 MB free)
    Uptime: 309272.38 sec
    Load Avg:  1.9  2.03  1.67
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
  Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)

willow-ahrens · 2025-01-06T18:10:09Z

Can confirm your results. I think either my tune.json was stale or my branch head was configured incorrectly (sometimes gh pr checkout ... works weirdly with runjudge.jl)

willow-ahrens · 2025-01-06T18:11:39Z

the overhead is much much higher here, but we expect this overhead to scale with the inputs.

kylebd99 · 2025-01-06T19:34:57Z

Yeah, the one thing which I can't figure out is why the einsum interface and the compute interface have such different overheads using the AdaptiveExecutor. This also shows up somewhat intermittently for me.

willow-ahrens · 2025-01-06T19:37:10Z

which two benchmarks are you comparing?

kylebd99 · 2025-01-06T20:49:05Z

compute_spmv_call_overhead and einsum_spmv_call_overhead

willow-ahrens · 2025-01-06T21:24:01Z

That overhead is quite large. I'm happy to merge this PR now and open an issue, or wait to merge until we figure it out. I'll leave it up to you how to proceed. Thanks for your help with this PR!

kylebd99 · 2025-01-06T23:35:29Z

I'm happy to leave it for an issue, especially because it doesn't hit through the compute interface.

add issimilar and get_cannonical_stats

5896d1a

kylebd99 linked an issue Dec 19, 2024 that may be closed by this pull request

Make Galley Adaptive To Inputs #678

Closed

kylebd99 requested a review from willow-ahrens December 19, 2024 22:14

= added 5 commits December 19, 2024 14:20

cleanup

7b69f89

small fixes

6761338

fix stats construction

4bbf23f

last fix, definitely true this time

092579c

fix NaiveStats constructor, again

79d836d

= and others added 8 commits December 19, 2024 16:45

small change to verbose passing

2ed581d

fix deferred hash issue #664

e835a12

fix deferred equality check

4de8a9c

Merge branch 'main' into kbd-make-galley-adaptive-to-inputs

85ac56e

Merge branch 'main' into kbd-make-galley-adaptive-to-inputs

c53f9fb

Merge remote-tracking branch 'origin/main' into kbd-make-galley-adapt…

354db43

…ive-to-inputs

more accurate benchmark

b5b2831

add evaluation count to benchmark setup

0c8ab41

willow-ahrens requested changes Dec 30, 2024

View reviewed changes

= and others added 9 commits January 2, 2025 11:53

rename GalleyExecutor to AdaptiveExecutor

bff9d0d

warn when the tag argument is given to the AdaptiveExecutor

5aa6d01

remove with_scheduler issue

816d974

bug fix

ffa437f

lowering default threshold & warn on tag argument

cf19c69

update high-level benchmarks to set scheduler in setup

d24a6af

Merge branch 'main' into kbd-make-galley-adaptive-to-inputs

e31b572

small fix

5fd941b

add compute overhead check

f52839f

temporarily remove galley scheduler

595f7b6

kylebd99 requested a review from willow-ahrens January 3, 2025 20:57

willow-ahrens requested changes Jan 6, 2025

View reviewed changes

src/Galley/FinchCompat/executor.jl Outdated Show resolved Hide resolved

add galley_scheduler back

910b033

drop the warn on tag arg

1dbc64c

willow-ahrens enabled auto-merge January 6, 2025 18:07

willow-ahrens added 2 commits January 6, 2025 13:07

Merge branch 'main' into kbd-make-galley-adaptive-to-inputs

bd2a62d

Merge branch 'main' into kbd-make-galley-adaptive-to-inputs

fb7d3ca

willow-ahrens disabled auto-merge January 6, 2025 21:23

willow-ahrens approved these changes Jan 6, 2025

View reviewed changes

kylebd99 merged commit 2bd8d4c into main Jan 6, 2025
8 checks passed

kylebd99 deleted the kbd-make-galley-adaptive-to-inputs branch January 6, 2025 23:35

willow-ahrens mentioned this pull request Jan 7, 2025

Investigate Galley call overhead for einsum interface. #691

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep Galley Plans Per Approximate Sparsity Pattern #679

Keep Galley Plans Per Approximate Sparsity Pattern #679

kylebd99 commented Dec 19, 2024 •

edited by willow-ahrens

Loading

codecov bot commented Dec 20, 2024 •

edited

Loading

kylebd99 commented Dec 20, 2024

willow-ahrens left a comment •

edited

Loading

willow-ahrens Dec 30, 2024

kylebd99 Jan 2, 2025

kylebd99 commented Jan 3, 2025 •

edited

Loading

willow-ahrens commented Jan 3, 2025 •

edited

Loading

kylebd99 commented Jan 3, 2025

willow-ahrens commented Jan 3, 2025

kylebd99 commented Jan 3, 2025 •

edited

Loading

willow-ahrens left a comment

kylebd99 commented Jan 6, 2025 •

edited

Loading

willow-ahrens commented Jan 6, 2025

willow-ahrens commented Jan 6, 2025

kylebd99 commented Jan 6, 2025

willow-ahrens commented Jan 6, 2025

kylebd99 commented Jan 6, 2025

willow-ahrens commented Jan 6, 2025

kylebd99 commented Jan 6, 2025

Keep Galley Plans Per Approximate Sparsity Pattern #679

Keep Galley Plans Per Approximate Sparsity Pattern #679

Conversation

kylebd99 commented Dec 19, 2024 • edited by willow-ahrens Loading

codecov bot commented Dec 20, 2024 • edited Loading

Codecov Report

kylebd99 commented Dec 20, 2024

willow-ahrens left a comment • edited Loading

Choose a reason for hiding this comment

willow-ahrens Dec 30, 2024

Choose a reason for hiding this comment

kylebd99 Jan 2, 2025

Choose a reason for hiding this comment

kylebd99 commented Jan 3, 2025 • edited Loading

willow-ahrens commented Jan 3, 2025 • edited Loading

kylebd99 commented Jan 3, 2025

willow-ahrens commented Jan 3, 2025

kylebd99 commented Jan 3, 2025 • edited Loading

willow-ahrens left a comment

Choose a reason for hiding this comment

kylebd99 commented Jan 6, 2025 • edited Loading

willow-ahrens commented Jan 6, 2025

willow-ahrens commented Jan 6, 2025

kylebd99 commented Jan 6, 2025

willow-ahrens commented Jan 6, 2025

kylebd99 commented Jan 6, 2025

willow-ahrens commented Jan 6, 2025

kylebd99 commented Jan 6, 2025

kylebd99 commented Dec 19, 2024 •

edited by willow-ahrens

Loading

codecov bot commented Dec 20, 2024 •

edited

Loading

willow-ahrens left a comment •

edited

Loading

kylebd99 commented Jan 3, 2025 •

edited

Loading

willow-ahrens commented Jan 3, 2025 •

edited

Loading

kylebd99 commented Jan 3, 2025 •

edited

Loading

kylebd99 commented Jan 6, 2025 •

edited

Loading