Coverage test is extremely slow on multiple threads #36142

efaulhaber · 2020-06-04T09:36:14Z

Running Pkg.test("MyModule", coverage=true) on more than one thread is extremely slow.

I measured the time on my machine using 1, 2, 8 and 16 threads:

1 thread: 339 seconds
2 threads: 3106 seconds
8 threads: 6401 seconds
16 threads: 5072 seconds

Running Pkg.test("MyModule") without coverage takes approximately the same time for every number of threads (around 2 minutes) for this module.

See also the discourse discussion.

I am using Julia 1.4.1 on Windows 10 and my CPU has 8 cores/16 threads.

The text was updated successfully, but these errors were encountered:

carstenbauer · 2021-11-03T07:26:14Z

FWIW, this has hit / confused us over at STREAMBenchmark.jl leading to unrealistic low memory bandwidths.

NHDaly · 2024-05-15T17:30:26Z

@IanButterworth: Thanks for the original fix! We find though that for a large codebase like ours, adding threads still makes a significant difference, e.g. some of our tests go from ~1 min to ~10 min just by using -tauto.

I am reopening this issue, since it seems like it needs a fundamental fix for large codebases, in addition to the work you've already done to limit the coverage computation to just the code under test?

NHDaly · 2024-05-15T17:31:13Z

For example, some solutions that we just brainstormed include:

Thread-local coverage counters, merged at the end.
or shard the counters by file hash into separate locks, to reduce contention

NHDaly · 2024-05-15T21:07:38Z

I think @KristofferC's explanation on discourse is likely a plausible one: https://discourse.julialang.org/t/coverage-test-is-extremely-slow-on-multiple-threads/40682/2
and we think a solution along the lines he and @efaulhaber describe seems reasonable.

NHDaly · 2024-05-15T21:23:39Z

Here is a simple example benchmark that demonstrates the thread scaling issue:

# Lots of whitespace to cause lots of "blocks" in code coverage
module MyPkg
function test(x,y)
    z = x + y
    if z > 0
        return z
    else
        return -z
    end
end
foo(x) = x + 1
bar(x) = x + 1
baz(x) = x + 1
end # module MyPkg

using .MyPkg

@show Threads.nthreads()

SCALE = parse(Int, ENV["SCALE"])

# Perform significant cpu-bound work in each task:
out = Threads.Atomic{Int}(0)
@sync for i in 1:50
    Threads.@spawn begin
        i = 0
        for _ in 1:($SCALE * 1_000_000)
            i += MyPkg.test($i,2)
            rand() < 0.01 && (i += MyPkg.foo(i))
            rand() < 0.01 && (i += MyPkg.bar(i))
            rand() < 0.01 && (i += MyPkg.baz(i))
        end
        Threads.atomic_add!(out, i)
    end
end
@show out

Thread-scaling without code coverage is as you'd expect:

SCALE=10 julia -e 'for i in 1:8
run(`time julia -t$i --proj=./MyPkg MyPkg/test/runtests.jl`)  # no coverage
end'
Threads.nthreads() = 1
out = Base.Threads.Atomic{Int64}(-8938018849759264776)
        8.88 real         9.23 user         0.75 sys
Threads.nthreads() = 2
out = Base.Threads.Atomic{Int64}(4590067741951977982)
        5.47 real        10.51 user         1.27 sys
Threads.nthreads() = 3
out = Base.Threads.Atomic{Int64}(5710305207362202079)
        4.21 real        10.82 user         1.89 sys
Threads.nthreads() = 4
out = Base.Threads.Atomic{Int64}(7271437011698658188)
        3.89 real        12.26 user         1.43 sys
Threads.nthreads() = 5
out = Base.Threads.Atomic{Int64}(5087105693316976996)
        3.38 real        12.81 user         1.56 sys
Threads.nthreads() = 6
out = Base.Threads.Atomic{Int64}(-4298445409765262958)
        3.32 real        14.25 user         1.25 sys
Threads.nthreads() = 7
out = Base.Threads.Atomic{Int64}(-2582078949504319358)
        2.96 real        14.59 user         1.85 sys
Threads.nthreads() = 8
out = Base.Threads.Atomic{Int64}(-5153601735755644603)
        3.49 real        19.98 user         1.86 sys

Thread-scaling with code-coverage sadly exhibits inverse scaling:

SCALE=1 julia -e 'for i in 1:8
run(`time julia -t$i --proj=./MyPkg --code-coverage=all MyPkg/test/runtests.jl`)
end'
Threads.nthreads() = 1
out = Base.Threads.Atomic{Int64}(-8989974450545929530)
        3.93 real         4.49 user         1.06 sys
Threads.nthreads() = 2
out = Base.Threads.Atomic{Int64}(7071944073563161211)
       18.38 real        36.62 user         0.51 sys
Threads.nthreads() = 3
out = Base.Threads.Atomic{Int64}(-1208807434831296872)
       20.40 real        57.07 user         1.11 sys
Threads.nthreads() = 4
out = Base.Threads.Atomic{Int64}(-1769835756041968564)
       22.92 real        85.85 user         1.33 sys
Threads.nthreads() = 5
out = Base.Threads.Atomic{Int64}(2079036137052582239)
       25.88 real       122.47 user         0.91 sys
Threads.nthreads() = 6
out = Base.Threads.Atomic{Int64}(-8918746429117640383)
       23.08 real       124.14 user         1.97 sys
Threads.nthreads() = 7
out = Base.Threads.Atomic{Int64}(-1729088419819035297)
       23.33 real       149.85 user         1.31 sys
Threads.nthreads() = 8
out = Base.Threads.Atomic{Int64}(2084586300043682145)
       21.86 real       150.49 user         2.03 sys

(Note: the benchmark is run with lower data scale in the coverage case so it doesn't take forever)

NHDaly · 2024-05-15T21:26:50Z

Suggestion from @adnan-alhomssi:

If we only care about whether a line is covered or not, can we implement an easier solution, where we first load the counter value and increment it only if it is equal to zero?
https://github.com/JuliaLang/julia/blob/5f68f1ae249d1836820116b18a23a1cad29367b3/src/codegen.cpp#L2824-L2831C43

This seem like a good idea to me too. Do people actually use the "how many times was this line touched" mechanism? We are simply trying to track whether our code is covered or not (a binary) for tracking test coverage.

It seems that trace-profiling and code coverage are maybe two separate features and could be supported separately, so that code coverage on its own could be more performant. We could add a separate mode to indicate this is the desired behavior.

And then the full profiling-mode that we currently have could be made more performant via separate per-thread counters.

efaulhaber · 2024-05-16T13:45:27Z

I don't know the internals, but might atomic adds with Atomix.jl be an easy quick fix to make this at least reasonably fast without having to implement the whole thread-local coverage counts stuff? Maybe as a temporary solution?

vtjnash · 2024-05-16T14:04:26Z

Atomics would make this a lot slower than it already is (we had briefly tested to see how bad it would be, as we already knew atomics are about 100x slower, but were not initially certain if that would be slow overall). I assume the penalty for this may be much lower on non-x86 platforms such as Aarch64, but x86 forces atomic-ish TSO on every value even when you don't want it.

gbaraldi · 2024-05-16T14:45:51Z

I wonder how the SanitizerCoveragePass LLVM has performs

vtjnash · 2024-05-16T15:30:02Z

My understanding is that LLVM uses a per-thread allocation

NHDaly · 2024-05-22T22:44:07Z

We actually are measuring on new m2 macs, and we're seeing pretty dramatic slowdowns there. :( Sorry I didn't make that clear but the above numbers were on my macbook pro.

We currently hypothesize that this is caused by thrashing the cash-lines, due to every thread invalidating the cached value on all other CPUs on every hit on that line. In which case, Adnan's suggestion above seems like the best one.
@d-netto suggested we should consider taking a cache invalidation profile through a recent linux's perf, but we haven't tried that yet.

ViralBShah added multithreading Base.Threads and related functionality performance Must go faster labels Jun 5, 2020

carstenbauer mentioned this issue Nov 3, 2021

Fix last cache, and make multithreaded benchmarks actually run multithreaded. JuliaPerf/STREAMBenchmark.jl#15

Merged

ranocha mentioned this issue Jan 12, 2022

Threading for periodic FDSBP flux differencing trixi-framework/Trixi.jl#1029

Merged

IanButterworth mentioned this issue Feb 13, 2022

Add "project" mode for faster code coverage and allocation tracking #43711

Closed

3 tasks

IanButterworth mentioned this issue Feb 26, 2022

Add option for codecov and allocation tracking to be restricted by path #44359

Merged

IanButterworth closed this as completed in #44359 Mar 3, 2022

NHDaly reopened this May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coverage test is extremely slow on multiple threads #36142

Coverage test is extremely slow on multiple threads #36142

efaulhaber commented Jun 4, 2020 •

edited

Loading

carstenbauer commented Nov 3, 2021 •

edited

Loading

NHDaly commented May 15, 2024

NHDaly commented May 15, 2024

NHDaly commented May 15, 2024

NHDaly commented May 15, 2024

NHDaly commented May 15, 2024

efaulhaber commented May 16, 2024

vtjnash commented May 16, 2024

gbaraldi commented May 16, 2024

vtjnash commented May 16, 2024

NHDaly commented May 22, 2024

Coverage test is extremely slow on multiple threads #36142

Coverage test is extremely slow on multiple threads #36142

Comments

efaulhaber commented Jun 4, 2020 • edited Loading

carstenbauer commented Nov 3, 2021 • edited Loading

NHDaly commented May 15, 2024

NHDaly commented May 15, 2024

NHDaly commented May 15, 2024

NHDaly commented May 15, 2024

NHDaly commented May 15, 2024

efaulhaber commented May 16, 2024

vtjnash commented May 16, 2024

gbaraldi commented May 16, 2024

vtjnash commented May 16, 2024

NHDaly commented May 22, 2024

efaulhaber commented Jun 4, 2020 •

edited

Loading

carstenbauer commented Nov 3, 2021 •

edited

Loading