Fix last cache, and make multithreaded benchmarks actually run multithreaded. #15

chriselrod · 2021-11-01T06:03:34Z

The last_cachesize method from my previous PR accidentally returned the first cache size instead.

codecov-commenter · 2021-11-01T06:06:59Z

Codecov Report

Merging #15 (5c561fa) into master (2acd367) will increase coverage by 4.09%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #15      +/-   ##
==========================================
+ Coverage   90.54%   94.63%   +4.09%     
==========================================
  Files           3        3              
  Lines         148      149       +1     
==========================================
+ Hits          134      141       +7     
+ Misses         14        8       -6

Impacted Files	Coverage Δ
src/benchmarks.jl	`89.55% <100.00%> (ø)`
src/original.jl	`96.77% <100.00%> (+0.10%)`	⬆️
src/kernels.jl	`100.00% <0.00%> (+11.76%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2acd367...5c561fa. Read the comment docs.

carstenbauer · 2021-11-01T09:53:04Z

Last fix seems to have broken CI.

chriselrod · 2021-11-01T10:20:49Z

For the tested default vector length, 10 was much too short.
Using larger vectors leads to passing tests locally, but seems CI's variance is much higher, requiring much larger vectors.

carstenbauer · 2021-11-01T11:02:54Z

Test pass locally for me as well (macOS). I guess we could

relax the tests (it's probably anyway not a super great idea to check for a specific memory bandwidth range)
move CI to one of our cluster nodes to have a stable environment with less variance (see ThreadPinning.jl for an example)

chriselrod · 2021-11-01T11:19:30Z

Test pass locally for me as well (macOS). I guess we could

relax the tests (it's probably anyway not a super great idea to check for a specific memory bandwidth range)

move CI to one of our cluster nodes to have a stable environment with less variance (see ThreadPinning.jl for an example)

An upper bound could be reasonable, but it seems we're capable of hitting pretty extreme lower bounds in these benchmarks.
If it isn't too difficult for you to setup your own environment, that'd probably be a more reliable solution.

carstenbauer · 2021-11-02T10:18:37Z

If it isn't too difficult for you to setup your own environment, that'd probably be a more reliable solution.

Basic setup on master, see https://git.uni-paderborn.de/pc2-ci/julia/STREAMBenchmark-jl/-/jobs.

carstenbauer · 2021-11-02T10:20:32Z

https://github.com/JuliaPerf/STREAMBenchmark.jl/runs/4076840066?check_suite_focus=true#step:6:126

Benchmarks: Test Failed at /Users/runner/work/STREAMBenchmark.jl/STREAMBenchmark.jl/test/runtests.jl:31
  Expression: (memory_bandwidth()).median > (memory_bandwidth(write_allocate = false)).median
   Evaluated: 2477.1 > 2592.9

Maybe we should add a small tolerance here.

carstenbauer · 2021-11-02T10:54:42Z

Hope you don't mind that I'm taking the liberty to push here (merge the latest changes on master etc.)

I've introduced a STREAM_VECTOR_LENGTH env variable option to set the vector length for CI systems manually.

carstenbauer · 2021-11-02T10:57:55Z

On the PC² CI runner I now get

Precompiling project...
  1 dependency successfully precompiled in 6 seconds (47 already precompiled)
     Testing Running tests...
STREAMBenchmark.default_vector_length() = 14417920
╔══╡ Single-threaded:
╟─ COPY:  17783.6 MB/s
╟─ SCALE: 17684.2 MB/s
╟─ ADD:   16718.5 MB/s
╟─ TRIAD: 16702.5 MB/s
╟─────────────────────
║ Median: 17201.3 MB/s
╚═════════════════════

╔══╡ Multi-threaded:
╠══╡ (20 threads)
╟─ COPY:  109680.7 MB/s
╟─ SCALE: 108428.4 MB/s
╟─ ADD:   105860.4 MB/s
╟─ TRIAD: 103778.1 MB/s
╟─────────────────────
║ Median: 107144.4 MB/s
╚═════════════════════

1: 3604480 => 17103.4
2: 7208960 => 17063.7
3: 10813440 => 17103.8
4: 14417920 => 17086.5
1: 3604480 => 260755.7
2: 14417920 => 91404.6
- Creating folder "stream"
- Downloading C STREAM benchmark
- Done.
- Trying to compile "stream.c" using gcc
  Using options: -O3 -DSTREAM_ARRAY_SIZE=14417920
- Done.
- Trying to compile "stream.c" using carsten
Test Summary:      | Pass  Total
STREAMBenchmark.jl |   32     32
     Testing STREAMBenchmark tests passed

with this PR. This looks good. The expected bandwidth for 20 threads is ~100 GB/s, see https://github.com/JuliaPerf/BandwidthBenchmark.jl/blob/main/benchmarks/noctua_pc2/bwbench/bwbench.out#L185

carstenbauer · 2021-11-02T11:10:21Z

I'll merge this right away as it fixes an important bug (last_cachesize).

carstenbauer · 2021-11-03T07:25:32Z

For the tested default vector length, 10 was much too short.
Using larger vectors leads to passing tests locally, but seems CI's variance is much higher, requiring much larger vectors.

FYI, I noticed that the ultra-low memory bandwidth values also occurred for the PC² CI runners (on our cluster). Investingating this, I noticed that only the run with Pkg.test(; coverage=true) showed this strange result which lead me to realize that we're hitting JuliaLang/julia#36142 here.

chriselrod · 2021-11-03T08:46:31Z

coverage=true also seems to cause single-threaded @turbo to be very slow.
When coverage=true, the Octavian test suite runs only a fraction of the tests vs when false.
Yet coverage=false tests take about 10-12 minutes on CI, and many of the coverage=true tests take 2+ hours.

chriselrod added 2 commits November 1, 2021 02:00

Fix last_cacheisze.

38fba6a

Bump version.

cae299f

Fix bug where multithreading doesn't actually run multithreaded.

7a2bcba

chriselrod changed the title ~~Fix last cache~~ Fix last cache, and make multithreaded benchmarks actually run multithreaded. Nov 1, 2021

chriselrod added 4 commits November 1, 2021 06:06

Larger smaller vector length.

e2519f7

Even larger smaller vector length.

fa54ebb

Clang does not support -march=native for AArch64

f8c39d0

Larger smaller vector length.

2f32c4e

chriselrod added 2 commits November 1, 2021 06:41

turn off multithreading for vector length dependence test

d7ef3ef

100k vectors

ed320b4

chriselrod added 2 commits November 1, 2021 07:09

return to 10^6 vectors

e4825fc

Don't use less memory.

af36557

chriselrod added 2 commits November 1, 2021 13:41

Update runtests.jl

8bd1486

Update runtests.jl

5c561fa

merge conflict

5afb11f

enable pc2 ci for PRs

7fffd53

carstenbauer merged commit a3443e8 into JuliaPerf:master Nov 2, 2021

chriselrod deleted the fixlastcache branch November 2, 2021 12:57

carstenbauer mentioned this pull request Nov 3, 2021

Coverage test is extremely slow on multiple threads JuliaLang/julia#36142

Open

IanButterworth mentioned this pull request Feb 26, 2022

Add option for codecov and allocation tracking to be restricted by path JuliaLang/julia#44359

Merged

IanButterworth mentioned this pull request Mar 12, 2022

Try CI codecov on nightly too, to see if codecov is faster JuliaLinearAlgebra/Octavian.jl#142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix last cache, and make multithreaded benchmarks actually run multithreaded. #15

Fix last cache, and make multithreaded benchmarks actually run multithreaded. #15

chriselrod commented Nov 1, 2021

codecov-commenter commented Nov 1, 2021 •

edited

Loading

carstenbauer commented Nov 1, 2021

chriselrod commented Nov 1, 2021

carstenbauer commented Nov 1, 2021 •

edited

Loading

chriselrod commented Nov 1, 2021

carstenbauer commented Nov 2, 2021

carstenbauer commented Nov 2, 2021

carstenbauer commented Nov 2, 2021 •

edited

Loading

carstenbauer commented Nov 2, 2021 •

edited

Loading

carstenbauer commented Nov 2, 2021

carstenbauer commented Nov 3, 2021

chriselrod commented Nov 3, 2021

Fix last cache, and make multithreaded benchmarks actually run multithreaded. #15

Fix last cache, and make multithreaded benchmarks actually run multithreaded. #15

Conversation

chriselrod commented Nov 1, 2021

codecov-commenter commented Nov 1, 2021 • edited Loading

Codecov Report

carstenbauer commented Nov 1, 2021

chriselrod commented Nov 1, 2021

carstenbauer commented Nov 1, 2021 • edited Loading

chriselrod commented Nov 1, 2021

carstenbauer commented Nov 2, 2021

carstenbauer commented Nov 2, 2021

carstenbauer commented Nov 2, 2021 • edited Loading

carstenbauer commented Nov 2, 2021 • edited Loading

carstenbauer commented Nov 2, 2021

carstenbauer commented Nov 3, 2021

chriselrod commented Nov 3, 2021

codecov-commenter commented Nov 1, 2021 •

edited

Loading

carstenbauer commented Nov 1, 2021 •

edited

Loading

carstenbauer commented Nov 2, 2021 •

edited

Loading

carstenbauer commented Nov 2, 2021 •

edited

Loading