Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep Galley Plans Per Approximate Sparsity Pattern #679

Merged
merged 28 commits into from
Jan 6, 2025

Conversation

kylebd99
Copy link
Collaborator

@kylebd99 kylebd99 commented Dec 19, 2024

This PR changes the caching in Galley to keep a set of plans per program when called with tag=:global. It associates these plans with the stats of the inputs. When the same program is invoked again, it checks this cache for a plan on inputs with "similar" stats. If it finds one, it returns it immediately. Otherwise, it compiles a new plan.

fixes #664

@kylebd99 kylebd99 linked an issue Dec 19, 2024 that may be closed by this pull request
Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 32 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/Galley/TensorStats/tensor-stats.jl 63.76% 25 Missing ⚠️
src/Galley/FinchCompat/executor.jl 89.18% 4 Missing ⚠️
src/Galley/utility-funcs.jl 0.00% 2 Missing ⚠️
src/Galley/TensorStats/propagate-stats.jl 50.00% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/Finch.jl 89.79% <ø> (ø)
src/FinchLogic/nodes.jl 71.42% <100.00%> (ø)
src/Galley/Galley.jl 98.46% <ø> (+24.61%) ⬆️
src/scheduler/LogicExecutor.jl 90.24% <ø> (ø)
src/Galley/TensorStats/propagate-stats.jl 80.48% <50.00%> (-0.38%) ⬇️
src/Galley/utility-funcs.jl 40.39% <0.00%> (-0.55%) ⬇️
src/Galley/FinchCompat/executor.jl 92.42% <89.18%> (+5.32%) ⬆️
src/Galley/TensorStats/tensor-stats.jl 70.42% <63.76%> (+0.09%) ⬆️

... and 2 files with indirect coverage changes

@kylebd99
Copy link
Collaborator Author

Small experiment showing the benefits of adapting to new inputs:

using Finch
using Plots
using BenchmarkTools

Finch.Galley.galley_codes = Dict()
galley_tag_times = []
galley_global_times = []
for i in 1:5
    A_density = 10.0^(-i)
    B_density = .1
    C_density = 10.0^(i-5)
    n = 1000
    A = lazy(Tensor(Dense(SparseList(Element(0.0))), fsprand(n,n, A_density)))
    B = lazy(Tensor(Dense(SparseList(Element(0.0))), fsprand(n,n, B_density)))
    C = lazy(Tensor(Dense(SparseList(Element(0.0))), fsprand(n,n, C_density)))

    println("TAG VERBOSE")
    compute(A*B*C; ctx=galley_scheduler(), tag=1, verbose=true)
    println("GLOBAL VERBOSE")
    compute(A*B*C; ctx=galley_scheduler(), verbose=true)
    galley_tag_time = @belapsed compute($A*$B*$C; ctx=galley_scheduler(), tag=1,)
    galley_global_time = @belapsed compute($A*$B*$C; ctx=galley_scheduler())
    println("A Density: $A_density B Density: $B_density C Density: $C_density")
    println("Tag Time: $galley_tag_time")
    println("Global Time: $galley_global_time")
    push!(galley_tag_times, galley_tag_time)
    push!(galley_global_times, galley_global_time)
end

plot(1:5, galley_tag_times ./ galley_global_times, title="ABC varying density of A and C", xlabel="d_A=10^(-i), d_B=.1, d_C=10^(5-i)",ylabel="Speedup", label="Speedup", lw=3)

Screenshot from 2024-12-19 16-40-39

Copy link
Collaborator

@willow-ahrens willow-ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few requests for this PR on the design of the GalleyExecutor.

Regarding the fixing of the deferred node, I'd like to design a benchmark which shows the pathological behavior of #664 on the main branch before we declare that it's fixed here. I'm not quite sure when I noticed that there was a problem there, but I don't see the 5ms overhead on main anymore. I've updated the benchmarksuite so that you can run julia runjudge.jl -i high-level to compare high-level benchmarks on main and the target branch. Do you think we could update the SUITE["high-level"]["einsum_spmv_call_overhead"][scheduler_name] benchmark to get main branch to exhibit the regression? Also, could we get the SUITE["high-level"]["einsum_spmv_call_adaptive"][scheduler_name] to show the overhead associated with looking up among a large number of different kernels so that we know whether we need to use a better search algorithm?

was compiled for similar inputs and only compiles if it doesn't find one. If the `tag` argument is anything else,
it will only compile once for that tag and will skip this search process.
"""
@kwdef struct GalleyExecutor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like the GalleyExecutor to have a configurable statistics similarity threshold, and to store the caches for each threshold and statistics type separately.

Also, I'd like to choose different executors for different compilation strategies (rather than using a sentinel tag value). Perhaps we can use the current executor for the "use first input strategy", and the GalleyExecutor for the "similar inputs" strategy. Then the GalleyExecutor wouldn't need a tag.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It already caches it per-statistics type, and I can do the same for the threshold by making it a member of the struct. Will do that.

  2. I like this idea. It seems clearer. In this case, we need better names. Maybe we call this one "AdaptiveExecutor" and the other one "TagExecutor"? And only schedulers which rely on statistics will be valid choices for the "AdaptiveExecutor".

I'll make these changes.

@kylebd99
Copy link
Collaborator Author

kylebd99 commented Jan 3, 2025

I made all the changes I think. Interestingly, there's a difference in the compute() and @Einsum call overhead for the AdaptiveExecutor. I can't quite figure out why, so I'm leaving both overhead benchmarks in for the moment. I also tried to add a benchmark which shows the overhead of searching through the stats list. However, it's pretty fast, so even with this its not super notable.

Small note: Using with_scheduler in the benchmark scripts doesn't seem to properly set the scheduler. Instead, I added it to the setup phases.

@willow-ahrens
Copy link
Collaborator

willow-ahrens commented Jan 3, 2025

Awesome! If you run runjudge.jl -i high-level on the target branch, do the benchmarks show this PR improving #664?

@kylebd99
Copy link
Collaborator Author

kylebd99 commented Jan 3, 2025

When I run runjudge.jl, I'm also not seeing any slow call overhead on main. So, idk, maybe it was a fluke when you found it originally? At the very least, I think the changes to deferred are sensible and ought to prevent future issues.

@kylebd99 kylebd99 requested a review from willow-ahrens January 3, 2025 20:57
@willow-ahrens
Copy link
Collaborator

I think my point is that these benchmarks don't appear to be able to stress the issue with hashing the actual inputs to the kernel. I wanted to see an instance where we could cause the finch LogicExecutor to recompile when we give different inputs with the same program. Was this never an issue? I know the deferred nodes had inputs in them, so it's strange to see.

@kylebd99
Copy link
Collaborator Author

kylebd99 commented Jan 3, 2025

Ah, wait it makes sense. The LogicExecutor doesn't hash any deferred nodes when looking up in the codes dict. The get_structure function doesn't use any deferred nodes. They only get introduced when the code is actually being scheduled and compiled in logic_executor_code.

Copy link
Collaborator

@willow-ahrens willow-ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here looks great! I am concerned about a performance regression I'm noticing. It looks like this PR increases spmv call overhead by a factor of 378. Here's my benchmark command:

julia runjudge.jl -i high-level -e einsum_spmv_compile_overhead

Here's the results:

  Benchmark Report for /Users/willow/Projects/Finch.jl
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Job Properties
  ==============

    •  Time of benchmarks:
       • Target: 6 Jan 2025 - 10:40
       • Baseline: 6 Jan 2025 - 10:43

    •  Package commits:
       • Target: f52839
       • Baseline: 061cda

    •  Julia commits:
       • Target: 5e9a32
       • Baseline: 5e9a32

    •  Julia command flags:
       • Target: None
       • Baseline: None

    •  Environment variables:
       • Target: FINCH_BENCHMARK_ARGS => -i high-level -e
       einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1
       • Baseline: FINCH_BENCHMARK_ARGS => -i high-level -e
       einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1

  Results
  =======

  A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less
  than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results
  that indicate possible regressions or improvements - are shown below (thus, an empty table
  means that all benchmark results remained invariant between builds).

                                                                 ID     time ratio  memory ratio
  ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– –––––––––––––– –––––––––––––
  ["high-level", "compute_spmv_call_overhead", "default_scheduler"]   0.92 (5%) ✅     1.00 (1%)
   ["high-level", "compute_spmv_call_overhead", "galley_scheduler"]   8.46 (5%) ❌  5.41 (1%) ❌
   ["high-level", "einsum_spmv_call_overhead", "default_scheduler"] 378.39 (5%) ❌ 27.91 (1%) ❌
    ["high-level", "einsum_spmv_call_overhead", "galley_scheduler"]  10.26 (5%) ❌  8.48 (1%) ❌
  ["high-level", "matchain_adaptive_overhead", "default_scheduler"]   0.95 (5%) ✅     1.00 (1%)
   ["high-level", "matchain_adaptive_overhead", "galley_scheduler"]  14.93 (5%) ❌  8.41 (1%) ❌
                 ["high-level", "sddmm_fused", "default_scheduler"]      0.99 (5%)  0.80 (1%) ✅
                  ["high-level", "sddmm_fused", "galley_scheduler"]   1.23 (5%) ❌  1.02 (1%) ❌
                ["high-level", "sddmm_unfused", "galley_scheduler"]      1.01 (5%)  1.01 (1%) ❌

  Benchmark Group List
  ====================

  Here's a list of all the benchmark groups executed by this job:

    •  ["high-level", "compute_spmv_call_overhead"]

    •  ["high-level"]

    •  ["high-level", "einsum_spmv_call_overhead"]

    •  ["high-level", "matchain_adaptive_overhead"]

    •  ["high-level", "permutedims(Dense(Dense()))"]

    •  ["high-level", "permutedims(Dense(Sparse()))"]

    •  ["high-level", "sddmm_fused"]

    •  ["high-level", "sddmm_unfused"]

  Julia versioninfo
  =================

  Target
  ––––––

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: macOS (arm64-apple-darwin24.0.0)
    uname: Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64 arm
    CPU: Apple M2 Max: 
                   speed         user         nice          sys         idle          irq
         #1-12  2400 MHz    6440533 s          0 s    2738620 s   73397003 s          0 s
    Memory: 32.0 GB (497.859375 MB free)
    Uptime: 3.776767e6 sec
    Load Avg:  1.88916015625  1.72412109375  2.83203125
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
  Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

  Baseline
  ––––––––

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: macOS (arm64-apple-darwin24.0.0)
    uname: Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64 arm
    CPU: Apple M2 Max: 
                   speed         user         nice          sys         idle          irq
         #1-12  2400 MHz    6442474 s          0 s    2738975 s   73412682 s          0 s
    Memory: 32.0 GB (707.78125 MB free)
    Uptime: 3.776917e6 sec
    Load Avg:  2.01904296875  1.88134765625  2.71826171875
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
  Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)



▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
Target result

  Benchmark Report for /Users/willow/Projects/Finch.jl
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Job Properties
  ==============

    •  Time of benchmark: 6 Jan 2025 - 10:40

    •  Package commit: f52839

    •  Julia commit: 5e9a32

    •  Julia command flags: None

    •  Environment variables: FINCH_BENCHMARK_ARGS => -i high-level -e
       einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1

  Results
  =======

  Below is a table of this job's results, obtained by running the benchmarks. The values listed
  in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
  index into the BaseBenchmarks suite to retrieve the corresponding benchmarks. The percentages
  accompanying time and memory values in the below table are noise tolerances. The "true"
  time/memory value for a given benchmark is expected to fall within this percentage of the
  reported value. An empty cell means that the value was zero.

                                                                   ID            time  GC time          memory allocations
  ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––– –––––––– ––––––––––––––– –––––––––––
    ["high-level", "compute_spmv_call_overhead", "default_scheduler"]  17.625 μs (5%)           16.28 KiB (1%)         397
     ["high-level", "compute_spmv_call_overhead", "galley_scheduler"] 157.125 μs (5%)           79.52 KiB (1%)        2335
                              ["high-level", "einsum_spmv_baremetal"]   5.208 ns (5%)                                     
     ["high-level", "einsum_spmv_call_overhead", "default_scheduler"]   7.899 ms (5%)          485.77 KiB (1%)        9744
      ["high-level", "einsum_spmv_call_overhead", "galley_scheduler"]  56.253 ms (5%)            4.20 MiB (1%)       86087
    ["high-level", "matchain_adaptive_overhead", "default_scheduler"]  31.791 μs (5%)           30.81 KiB (1%)         627
     ["high-level", "matchain_adaptive_overhead", "galley_scheduler"] 634.709 μs (5%)          245.55 KiB (1%)        9680
   ["high-level", "permutedims(Dense(Dense()))", "default_scheduler"] 264.683 ms (5%) 8.925 ms 762.94 MiB (1%)         126
    ["high-level", "permutedims(Dense(Dense()))", "galley_scheduler"] 267.400 ms (5%) 9.613 ms 762.94 MiB (1%)         126
  ["high-level", "permutedims(Dense(Sparse()))", "default_scheduler"]  83.605 ms (5%) 2.015 ms 170.31 MiB (1%)         337
   ["high-level", "permutedims(Dense(Sparse()))", "galley_scheduler"]  80.687 ms (5%)          170.37 MiB (1%)         337
                   ["high-level", "sddmm_fused", "default_scheduler"]   6.808 ms (5%)          266.86 KiB (1%)         708
                    ["high-level", "sddmm_fused", "galley_scheduler"]   1.691 ms (5%)            7.84 MiB (1%)        4693
                 ["high-level", "sddmm_unfused", "default_scheduler"] 146.093 ms (5%)            7.89 MiB (1%)         853
                  ["high-level", "sddmm_unfused", "galley_scheduler"] 184.917 ms (5%)           15.50 MiB (1%)        5709

  Benchmark Group List
  ====================

  Here's a list of all the benchmark groups executed by this job:

    •  ["high-level", "compute_spmv_call_overhead"]

    •  ["high-level"]

    •  ["high-level", "einsum_spmv_call_overhead"]

    •  ["high-level", "matchain_adaptive_overhead"]

    •  ["high-level", "permutedims(Dense(Dense()))"]

    •  ["high-level", "permutedims(Dense(Sparse()))"]

    •  ["high-level", "sddmm_fused"]

    •  ["high-level", "sddmm_unfused"]

  Julia versioninfo
  =================

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: macOS (arm64-apple-darwin24.0.0)
    uname: Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64 arm
    CPU: Apple M2 Max: 
                   speed         user         nice          sys         idle          irq
         #1-12  2400 MHz    6440533 s          0 s    2738620 s   73397003 s          0 s
    Memory: 32.0 GB (497.859375 MB free)
    Uptime: 3.776767e6 sec
    Load Avg:  1.88916015625  1.72412109375  2.83203125
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
  Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)



▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
Baseline result

  Benchmark Report for /Users/willow/Projects/Finch.jl
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Job Properties
  ==============

    •  Time of benchmark: 6 Jan 2025 - 10:43

    •  Package commit: 061cda

    •  Julia commit: 5e9a32

    •  Julia command flags: None

    •  Environment variables: FINCH_BENCHMARK_ARGS => -i high-level -e
       einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1

  Results
  =======

  Below is a table of this job's results, obtained by running the benchmarks. The values listed
  in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
  index into the BaseBenchmarks suite to retrieve the corresponding benchmarks. The percentages
  accompanying time and memory values in the below table are noise tolerances. The "true"
  time/memory value for a given benchmark is expected to fall within this percentage of the
  reported value. An empty cell means that the value was zero.

                                                                   ID            time   GC time          memory allocations
  ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––– ––––––––– ––––––––––––––– –––––––––––
    ["high-level", "compute_spmv_call_overhead", "default_scheduler"]  19.125 μs (5%)            16.28 KiB (1%)         397
     ["high-level", "compute_spmv_call_overhead", "galley_scheduler"]  18.583 μs (5%)            14.70 KiB (1%)         375
                              ["high-level", "einsum_spmv_baremetal"]   5.209 ns (5%)                                      
     ["high-level", "einsum_spmv_call_overhead", "default_scheduler"]  20.875 μs (5%)            17.41 KiB (1%)         423
      ["high-level", "einsum_spmv_call_overhead", "galley_scheduler"]   5.484 ms (5%)           507.25 KiB (1%)       10180
    ["high-level", "matchain_adaptive_overhead", "default_scheduler"]  33.500 μs (5%)            30.81 KiB (1%)         627
     ["high-level", "matchain_adaptive_overhead", "galley_scheduler"]  42.500 μs (5%)            29.20 KiB (1%)         615
   ["high-level", "permutedims(Dense(Dense()))", "default_scheduler"] 276.038 ms (5%) 10.369 ms 762.94 MiB (1%)         126
    ["high-level", "permutedims(Dense(Dense()))", "galley_scheduler"] 274.323 ms (5%)  2.478 ms 762.94 MiB (1%)         126
  ["high-level", "permutedims(Dense(Sparse()))", "default_scheduler"]  80.988 ms (5%)           170.32 MiB (1%)         337
   ["high-level", "permutedims(Dense(Sparse()))", "galley_scheduler"]  81.542 ms (5%)  1.023 ms 170.36 MiB (1%)         337
                   ["high-level", "sddmm_fused", "default_scheduler"]   6.877 ms (5%)           332.42 KiB (1%)         712
                    ["high-level", "sddmm_fused", "galley_scheduler"]   1.369 ms (5%)             7.69 MiB (1%)         669
                 ["high-level", "sddmm_unfused", "default_scheduler"] 145.995 ms (5%)             7.96 MiB (1%)         857
                  ["high-level", "sddmm_unfused", "galley_scheduler"] 183.967 ms (5%)            15.33 MiB (1%)         814

  Benchmark Group List
  ====================

  Here's a list of all the benchmark groups executed by this job:

    •  ["high-level", "compute_spmv_call_overhead"]

    •  ["high-level"]

    •  ["high-level", "einsum_spmv_call_overhead"]

    •  ["high-level", "matchain_adaptive_overhead"]

    •  ["high-level", "permutedims(Dense(Dense()))"]

    •  ["high-level", "permutedims(Dense(Sparse()))"]

    •  ["high-level", "sddmm_fused"]

    •  ["high-level", "sddmm_unfused"]

  Julia versioninfo
  =================

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: macOS (arm64-apple-darwin24.0.0)
    uname: Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64 arm
    CPU: Apple M2 Max: 
                   speed         user         nice          sys         idle          irq
         #1-12  2400 MHz    6442474 s          0 s    2738975 s   73412682 s          0 s
    Memory: 32.0 GB (707.78125 MB free)
    Uptime: 3.776917e6 sec
    Load Avg:  2.01904296875  1.88134765625  2.71826171875
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
  Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

src/Galley/FinchCompat/executor.jl Outdated Show resolved Hide resolved
@kylebd99
Copy link
Collaborator Author

kylebd99 commented Jan 6, 2025

I just ran `julia runjudge.jl -i high-level -e einsum_spmv_compile_overhead' and I only saw major differences on the galley_scheduler which makes sense because that one was pretty significantly changed. Might be worth a quick meeting to troubleshoot?

Benchmark Report for /home/kylebd99/Research/Finch.jl
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Job Properties
  ==============

    •  Time of benchmarks:
       • Target: 6 Jan 2025 - 09:19
       • Baseline: 6 Jan 2025 - 09:23

    •  Package commits:
       • Target: 910b03
       • Baseline: 061cda

    •  Julia commits:
       • Target: 5e9a32
       • Baseline: 5e9a32

    •  Julia command flags:
       • Target: None
       • Baseline: None

    •  Environment variables:
       • Target: FINCH_BENCHMARK_ARGS => -i high-level -e einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1
       • Baseline: FINCH_BENCHMARK_ARGS => -i high-level -e einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1

  Results
  =======

  A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results that indicate possible regressions or improvements - are shown below
  (thus, an empty table means that all benchmark results remained invariant between builds).

                                                                 ID    time ratio  memory ratio
  ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––– –––––––––––––
   ["high-level", "compute_spmv_call_overhead", "galley_scheduler"]  8.68 (5%) ❌  5.34 (1%) ❌
                            ["high-level", "einsum_spmv_baremetal"]  1.06 (5%) ❌     1.00 (1%)
    ["high-level", "einsum_spmv_call_overhead", "galley_scheduler"]  9.36 (5%) ❌  8.32 (1%) ❌
  ["high-level", "matchain_adaptive_overhead", "default_scheduler"]  0.72 (5%) ✅     1.00 (1%)
   ["high-level", "matchain_adaptive_overhead", "galley_scheduler"] 18.11 (5%) ❌ 10.16 (1%) ❌
                 ["high-level", "sddmm_fused", "default_scheduler"]  0.95 (5%) ✅  0.80 (1%) ✅
                  ["high-level", "sddmm_fused", "galley_scheduler"]  1.15 (5%) ❌  1.02 (1%) ❌
                ["high-level", "sddmm_unfused", "galley_scheduler"]     1.00 (5%)  1.01 (1%) ❌

  Benchmark Group List
  ====================

  Here's a list of all the benchmark groups executed by this job:

    •  ["high-level", "compute_spmv_call_overhead"]

    •  ["high-level"]

    •  ["high-level", "einsum_spmv_call_overhead"]

    •  ["high-level", "matchain_adaptive_overhead"]

    •  ["high-level", "permutedims(Dense(Dense()))"]

    •  ["high-level", "permutedims(Dense(Sparse()))"]

    •  ["high-level", "sddmm_fused"]

    •  ["high-level", "sddmm_unfused"]

  Julia versioninfo
  =================

  Target
  ––––––

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: Linux (x86_64-linux-gnu)
        Pop!_OS 22.04 LTS
    uname: Linux 6.9.3-76060903-generic #202405300957~1732141768~22.04~f2697e1 SMP PREEMPT_DYNAMIC Wed N x86_64 x86_64
    CPU: Intel(R) Core(TM) i5-14600KF: 
                   speed         user         nice          sys         idle          irq
         #1-20  5300 MHz     850367 s      89072 s     228622 s   60427612 s          0 s
    Memory: 31.17447280883789 GB (9159.96875 MB free)
    Uptime: 309030.47 sec
    Load Avg:  1.98  1.84  1.51
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
  Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)

  Baseline
  ––––––––

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: Linux (x86_64-linux-gnu)
        Pop!_OS 22.04 LTS
    uname: Linux 6.9.3-76060903-generic #202405300957~1732141768~22.04~f2697e1 SMP PREEMPT_DYNAMIC Wed N x86_64 x86_64
    CPU: Intel(R) Core(TM) i5-14600KF: 
                   speed         user         nice          sys         idle          irq
         #1-20  5300 MHz     855499 s      89076 s     229042 s   60470286 s          0 s
    Memory: 31.17447280883789 GB (9142.09375 MB free)
    Uptime: 309272.38 sec
    Load Avg:  1.9  2.03  1.67
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
  Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)



▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
Target result

  Benchmark Report for /home/kylebd99/Research/Finch.jl
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Job Properties
  ==============

    •  Time of benchmark: 6 Jan 2025 - 9:19

    •  Package commit: 910b03

    •  Julia commit: 5e9a32

    •  Julia command flags: None

    •  Environment variables: FINCH_BENCHMARK_ARGS => -i high-level -e einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1

  Results
  =======

  Below is a table of this job's results, obtained by running the benchmarks. The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to index into the BaseBenchmarks suite to retrieve the corresponding
  benchmarks. The percentages accompanying time and memory values in the below table are noise tolerances. The "true" time/memory value for a given benchmark is expected to fall within this percentage of the reported value. An empty cell means that the
  value was zero.

                                                                   ID            time    GC time          memory allocations
  ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––– –––––––––– ––––––––––––––– –––––––––––
    ["high-level", "compute_spmv_call_overhead", "default_scheduler"]  18.278 μs (5%)             16.50 KiB (1%)         401
     ["high-level", "compute_spmv_call_overhead", "galley_scheduler"] 146.628 μs (5%)             79.61 KiB (1%)        2338
                              ["high-level", "einsum_spmv_baremetal"]   3.789 ns (5%)                                       
     ["high-level", "einsum_spmv_call_overhead", "default_scheduler"]  18.655 μs (5%)             17.62 KiB (1%)         427
      ["high-level", "einsum_spmv_call_overhead", "galley_scheduler"]  74.575 ms (5%)              4.14 MiB (1%)       84804
    ["high-level", "matchain_adaptive_overhead", "default_scheduler"]  40.242 μs (5%)             31.14 KiB (1%)         633
     ["high-level", "matchain_adaptive_overhead", "galley_scheduler"] 691.978 μs (5%)            250.59 KiB (1%)        9715
   ["high-level", "permutedims(Dense(Dense()))", "default_scheduler"] 500.124 ms (5%)            762.94 MiB (1%)         126
    ["high-level", "permutedims(Dense(Dense()))", "galley_scheduler"] 484.068 ms (5%)   4.799 ms 762.94 MiB (1%)         126
  ["high-level", "permutedims(Dense(Sparse()))", "default_scheduler"]  78.612 ms (5%) 719.318 μs 170.28 MiB (1%)         337
   ["high-level", "permutedims(Dense(Sparse()))", "galley_scheduler"]  78.640 ms (5%)            170.34 MiB (1%)         337
                   ["high-level", "sddmm_fused", "default_scheduler"]   6.617 ms (5%)            265.56 KiB (1%)         714
                    ["high-level", "sddmm_fused", "galley_scheduler"]   2.530 ms (5%)              7.84 MiB (1%)        4696
                 ["high-level", "sddmm_unfused", "default_scheduler"] 109.267 ms (5%)              7.89 MiB (1%)         861
                  ["high-level", "sddmm_unfused", "galley_scheduler"] 158.425 ms (5%)             15.50 MiB (1%)        5715

  Benchmark Group List
  ====================

  Here's a list of all the benchmark groups executed by this job:

    •  ["high-level", "compute_spmv_call_overhead"]

    •  ["high-level"]

    •  ["high-level", "einsum_spmv_call_overhead"]

    •  ["high-level", "matchain_adaptive_overhead"]

    •  ["high-level", "permutedims(Dense(Dense()))"]

    •  ["high-level", "permutedims(Dense(Sparse()))"]

    •  ["high-level", "sddmm_fused"]

    •  ["high-level", "sddmm_unfused"]

  Julia versioninfo
  =================

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: Linux (x86_64-linux-gnu)
        Pop!_OS 22.04 LTS
    uname: Linux 6.9.3-76060903-generic #202405300957~1732141768~22.04~f2697e1 SMP PREEMPT_DYNAMIC Wed N x86_64 x86_64
    CPU: Intel(R) Core(TM) i5-14600KF: 
                   speed         user         nice          sys         idle          irq
         #1-20  5300 MHz     850367 s      89072 s     228622 s   60427612 s          0 s
    Memory: 31.17447280883789 GB (9159.96875 MB free)
    Uptime: 309030.47 sec
    Load Avg:  1.98  1.84  1.51
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
  Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)



▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
Baseline result

  Benchmark Report for /home/kylebd99/Research/Finch.jl
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  Job Properties
  ==============

    •  Time of benchmark: 6 Jan 2025 - 9:23

    •  Package commit: 061cda

    •  Julia commit: 5e9a32

    •  Julia command flags: None

    •  Environment variables: FINCH_BENCHMARK_ARGS => -i high-level -e einsum_spmv_compile_overhead JULIA_NUM_THREADS => 1

  Results
  =======

  Below is a table of this job's results, obtained by running the benchmarks. The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to index into the BaseBenchmarks suite to retrieve the corresponding
  benchmarks. The percentages accompanying time and memory values in the below table are noise tolerances. The "true" time/memory value for a given benchmark is expected to fall within this percentage of the reported value. An empty cell means that the
  value was zero.

                                                                   ID            time  GC time          memory allocations
  ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––– –––––––– ––––––––––––––– –––––––––––
    ["high-level", "compute_spmv_call_overhead", "default_scheduler"]  18.188 μs (5%)           16.50 KiB (1%)         401
     ["high-level", "compute_spmv_call_overhead", "galley_scheduler"]  16.894 μs (5%)           14.92 KiB (1%)         379
                              ["high-level", "einsum_spmv_baremetal"]   3.591 ns (5%)                                     
     ["high-level", "einsum_spmv_call_overhead", "default_scheduler"]  18.736 μs (5%)           17.62 KiB (1%)         427
      ["high-level", "einsum_spmv_call_overhead", "galley_scheduler"]   7.968 ms (5%)          509.31 KiB (1%)       10187
    ["high-level", "matchain_adaptive_overhead", "default_scheduler"]  56.003 μs (5%)           31.14 KiB (1%)         633
     ["high-level", "matchain_adaptive_overhead", "galley_scheduler"]  38.203 μs (5%)           24.66 KiB (1%)         590
   ["high-level", "permutedims(Dense(Dense()))", "default_scheduler"] 488.125 ms (5%)          762.94 MiB (1%)         126
    ["high-level", "permutedims(Dense(Dense()))", "galley_scheduler"] 487.539 ms (5%) 4.617 ms 762.94 MiB (1%)         126
  ["high-level", "permutedims(Dense(Sparse()))", "default_scheduler"]  79.067 ms (5%)          170.31 MiB (1%)         337
   ["high-level", "permutedims(Dense(Sparse()))", "galley_scheduler"]  80.254 ms (5%)          170.35 MiB (1%)         337
                   ["high-level", "sddmm_fused", "default_scheduler"]   6.993 ms (5%)          333.75 KiB (1%)         718
                    ["high-level", "sddmm_fused", "galley_scheduler"]   2.205 ms (5%)            7.69 MiB (1%)         675
                 ["high-level", "sddmm_unfused", "default_scheduler"] 114.306 ms (5%)            7.96 MiB (1%)         865
                  ["high-level", "sddmm_unfused", "galley_scheduler"] 158.795 ms (5%)           15.33 MiB (1%)         823

  Benchmark Group List
  ====================

  Here's a list of all the benchmark groups executed by this job:

    •  ["high-level", "compute_spmv_call_overhead"]

    •  ["high-level"]

    •  ["high-level", "einsum_spmv_call_overhead"]

    •  ["high-level", "matchain_adaptive_overhead"]

    •  ["high-level", "permutedims(Dense(Dense()))"]

    •  ["high-level", "permutedims(Dense(Sparse()))"]

    •  ["high-level", "sddmm_fused"]

    •  ["high-level", "sddmm_unfused"]

  Julia versioninfo
  =================

  Julia Version 1.11.2
  Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: Linux (x86_64-linux-gnu)
        Pop!_OS 22.04 LTS
    uname: Linux 6.9.3-76060903-generic #202405300957~1732141768~22.04~f2697e1 SMP PREEMPT_DYNAMIC Wed N x86_64 x86_64
    CPU: Intel(R) Core(TM) i5-14600KF: 
                   speed         user         nice          sys         idle          irq
         #1-20  5300 MHz     855499 s      89076 s     229042 s   60470286 s          0 s
    Memory: 31.17447280883789 GB (9142.09375 MB free)
    Uptime: 309272.38 sec
    Load Avg:  1.9  2.03  1.67
    WORD_SIZE: 64
    LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
  Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)

@willow-ahrens
Copy link
Collaborator

Can confirm your results. I think either my tune.json was stale or my branch head was configured incorrectly (sometimes gh pr checkout ... works weirdly with runjudge.jl)

@willow-ahrens
Copy link
Collaborator

the overhead is much much higher here, but we expect this overhead to scale with the inputs.

@kylebd99
Copy link
Collaborator Author

kylebd99 commented Jan 6, 2025

Yeah, the one thing which I can't figure out is why the einsum interface and the compute interface have such different overheads using the AdaptiveExecutor. This also shows up somewhat intermittently for me.

@willow-ahrens
Copy link
Collaborator

which two benchmarks are you comparing?

@kylebd99
Copy link
Collaborator Author

kylebd99 commented Jan 6, 2025

compute_spmv_call_overhead and einsum_spmv_call_overhead

@willow-ahrens willow-ahrens disabled auto-merge January 6, 2025 21:23
@willow-ahrens
Copy link
Collaborator

That overhead is quite large. I'm happy to merge this PR now and open an issue, or wait to merge until we figure it out. I'll leave it up to you how to proceed. Thanks for your help with this PR!

@kylebd99
Copy link
Collaborator Author

kylebd99 commented Jan 6, 2025

I'm happy to leave it for an issue, especially because it doesn't hit through the compute interface.

@kylebd99 kylebd99 merged commit 2bd8d4c into main Jan 6, 2025
8 checks passed
@kylebd99 kylebd99 deleted the kbd-make-galley-adaptive-to-inputs branch January 6, 2025 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Galley Adaptive To Inputs calling overhead of high-level interface performance regression
2 participants