Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix last cache, and make multithreaded benchmarks actually run multithreaded. #15

Merged
merged 15 commits into from
Nov 2, 2021

Conversation

chriselrod
Copy link
Contributor

The last_cachesize method from my previous PR accidentally returned the first cache size instead.

@codecov-commenter
Copy link

codecov-commenter commented Nov 1, 2021

Codecov Report

Merging #15 (5c561fa) into master (2acd367) will increase coverage by 4.09%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #15      +/-   ##
==========================================
+ Coverage   90.54%   94.63%   +4.09%     
==========================================
  Files           3        3              
  Lines         148      149       +1     
==========================================
+ Hits          134      141       +7     
+ Misses         14        8       -6     
Impacted Files Coverage Δ
src/benchmarks.jl 89.55% <100.00%> (ø)
src/original.jl 96.77% <100.00%> (+0.10%) ⬆️
src/kernels.jl 100.00% <0.00%> (+11.76%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2acd367...5c561fa. Read the comment docs.

@chriselrod chriselrod changed the title Fix last cache Fix last cache, and make multithreaded benchmarks actually run multithreaded. Nov 1, 2021
@carstenbauer
Copy link
Member

Last fix seems to have broken CI.

@chriselrod
Copy link
Contributor Author

For the tested default vector length, 10 was much too short.
Using larger vectors leads to passing tests locally, but seems CI's variance is much higher, requiring much larger vectors.

@carstenbauer
Copy link
Member

carstenbauer commented Nov 1, 2021

Test pass locally for me as well (macOS). I guess we could

  • relax the tests (it's probably anyway not a super great idea to check for a specific memory bandwidth range)
  • move CI to one of our cluster nodes to have a stable environment with less variance (see ThreadPinning.jl for an example)

@chriselrod
Copy link
Contributor Author

Test pass locally for me as well (macOS). I guess we could

  • relax the tests (it's probably anyway not a super great idea to check for a specific memory bandwidth range)
  • move CI to one of our cluster nodes to have a stable environment with less variance (see ThreadPinning.jl for an example)

An upper bound could be reasonable, but it seems we're capable of hitting pretty extreme lower bounds in these benchmarks.
If it isn't too difficult for you to setup your own environment, that'd probably be a more reliable solution.

@carstenbauer
Copy link
Member

If it isn't too difficult for you to setup your own environment, that'd probably be a more reliable solution.

Basic setup on master, see https://git.uni-paderborn.de/pc2-ci/julia/STREAMBenchmark-jl/-/jobs.

@carstenbauer
Copy link
Member

https://github.com/JuliaPerf/STREAMBenchmark.jl/runs/4076840066?check_suite_focus=true#step:6:126

Benchmarks: Test Failed at /Users/runner/work/STREAMBenchmark.jl/STREAMBenchmark.jl/test/runtests.jl:31
  Expression: (memory_bandwidth()).median > (memory_bandwidth(write_allocate = false)).median
   Evaluated: 2477.1 > 2592.9

Maybe we should add a small tolerance here.

@carstenbauer
Copy link
Member

carstenbauer commented Nov 2, 2021

Hope you don't mind that I'm taking the liberty to push here (merge the latest changes on master etc.)

I've introduced a STREAM_VECTOR_LENGTH env variable option to set the vector length for CI systems manually.

@carstenbauer
Copy link
Member

carstenbauer commented Nov 2, 2021

On the PC² CI runner I now get

Precompiling project...
  1 dependency successfully precompiled in 6 seconds (47 already precompiled)
     Testing Running tests...
STREAMBenchmark.default_vector_length() = 14417920
╔══╡ Single-threaded:
╟─ COPY:  17783.6 MB/s
╟─ SCALE: 17684.2 MB/s
╟─ ADD:   16718.5 MB/s
╟─ TRIAD: 16702.5 MB/s
╟─────────────────────
║ Median: 17201.3 MB/s
╚═════════════════════

╔══╡ Multi-threaded:
╠══╡ (20 threads)
╟─ COPY:  109680.7 MB/s
╟─ SCALE: 108428.4 MB/s
╟─ ADD:   105860.4 MB/s
╟─ TRIAD: 103778.1 MB/s
╟─────────────────────
║ Median: 107144.4 MB/s
╚═════════════════════

1: 3604480 => 17103.4
2: 7208960 => 17063.7
3: 10813440 => 17103.8
4: 14417920 => 17086.5
1: 3604480 => 260755.7
2: 14417920 => 91404.6
- Creating folder "stream"
- Downloading C STREAM benchmark
- Done.
- Trying to compile "stream.c" using gcc
  Using options: -O3 -DSTREAM_ARRAY_SIZE=14417920
- Done.
- Trying to compile "stream.c" using carsten
Test Summary:      | Pass  Total
STREAMBenchmark.jl |   32     32
     Testing STREAMBenchmark tests passed

with this PR. This looks good. The expected bandwidth for 20 threads is ~100 GB/s, see https://github.com/JuliaPerf/BandwidthBenchmark.jl/blob/main/benchmarks/noctua_pc2/bwbench/bwbench.out#L185

@carstenbauer
Copy link
Member

I'll merge this right away as it fixes an important bug (last_cachesize).

@carstenbauer carstenbauer merged commit a3443e8 into JuliaPerf:master Nov 2, 2021
@chriselrod chriselrod deleted the fixlastcache branch November 2, 2021 12:57
@carstenbauer
Copy link
Member

For the tested default vector length, 10 was much too short.
Using larger vectors leads to passing tests locally, but seems CI's variance is much higher, requiring much larger vectors.

FYI, I noticed that the ultra-low memory bandwidth values also occurred for the PC² CI runners (on our cluster). Investingating this, I noticed that only the run with Pkg.test(; coverage=true) showed this strange result which lead me to realize that we're hitting JuliaLang/julia#36142 here.

@chriselrod
Copy link
Contributor Author

coverage=true also seems to cause single-threaded @turbo to be very slow.
When coverage=true, the Octavian test suite runs only a fraction of the tests vs when false.
Yet coverage=false tests take about 10-12 minutes on CI, and many of the coverage=true tests take 2+ hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants