Heavily degraded performance while in extreme contention on some processors #201

paulocsanz · 2020-01-04T18:37:20Z

This blog post Mutexes Are Faster Than Spinlocks generated a reddit discussion about lock implementatons. This made some users benchmark parking_lot on windows and some results shown are very problematic, others could just be better (since parking_lot is always faster than spin on linux and mac, but not on windows for some reason).

I did not run this benchmark since I don't currently have windows installed, but since the user hadn't filed an issue I decided to post it here, their comment was: https://www.reddit.com/r/rust/comments/ejx7y8/blog_post_mutexes_are_faster_than_spinlocks/fd31um3/

The results:

Benchmark code: https://github.com/matklad/lock-bench
Windows 10 Pro
Intel Core i7-5930k @ 3.5 GHz
stable-x86_64-pc-windows-msvc (default)
rustc 1.40.0 (73528e339 2019-12-16)

extreme contention

cargo run --release 32 2 10000 100
    Finished release [optimized] target(s) in 0.03s
     Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 32.452982ms  min 20.4146ms    max 45.2767ms
parking_lot::Mutex   avg 154.509064ms min 111.2522ms   max 180.4367ms
spin::Mutex          avg 46.3496ms    min 33.5478ms    max 56.1689ms
AmdSpinlock          avg 45.725299ms  min 32.1936ms    max 54.4236ms

std::sync::Mutex     avg 33.383154ms  min 18.2827ms    max 46.0634ms
parking_lot::Mutex   avg 134.983307ms min 95.5948ms    max 176.1896ms
spin::Mutex          avg 43.402769ms  min 31.9209ms    max 55.0075ms
AmdSpinlock          avg 39.572361ms  min 28.1705ms    max 50.2935ms

heavy contention

cargo run --release 32 64 10000 100
    Finished release [optimized] target(s) in 0.03s
     Running `target\release\lock-bench.exe 32 64 10000 100`
Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 12.8268ms    min 6.4807ms     max 14.174ms
parking_lot::Mutex   avg 8.470518ms   min 3.6558ms     max 10.0896ms
spin::Mutex          avg 6.356252ms   min 4.6299ms     max 8.1838ms
AmdSpinlock          avg 7.147972ms   min 5.7731ms     max 9.2027ms

std::sync::Mutex     avg 12.790879ms  min 3.7349ms     max 14.4933ms
parking_lot::Mutex   avg 8.526535ms   min 6.7143ms     max 10.0845ms
spin::Mutex          avg 5.730139ms   min 2.8063ms     max 7.6221ms
AmdSpinlock          avg 7.082415ms   min 5.2678ms     max 8.2064ms

light contention

cargo run --release 32 1000 10000 100
    Finished release [optimized] target(s) in 0.05s
     Running `target\release\lock-bench.exe 32 1000 10000 100`
Options {
    n_threads: 32,
    n_locks: 1000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 7.736325ms   min 4.3287ms     max 9.194ms
parking_lot::Mutex   avg 4.912407ms   min 4.1386ms     max 5.9617ms
spin::Mutex          avg 3.787679ms   min 3.2468ms     max 4.8136ms
AmdSpinlock          avg 4.229783ms   min 1.0404ms     max 5.2414ms

std::sync::Mutex     avg 7.791248ms   min 6.2809ms     max 8.9858ms
parking_lot::Mutex   avg 4.933393ms   min 4.3319ms     max 6.1515ms
spin::Mutex          avg 3.782046ms   min 3.3339ms     max 5.4954ms
AmdSpinlock          avg 4.22442ms    min 3.1285ms     max 5.3338ms

no contention

cargo run --release 32 1000000 10000 100
    Finished release [optimized] target(s) in 0.03s
     Running `target\release\lock-bench.exe 32 1000000 10000 100`
Options {
    n_threads: 32,
    n_locks: 1000000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 12.465917ms  min 8.8088ms     max 13.6216ms
parking_lot::Mutex   avg 5.164135ms   min 4.2478ms     max 6.1451ms
spin::Mutex          avg 4.112927ms   min 3.1624ms     max 5.599ms
AmdSpinlock          avg 4.302528ms   min 4.0533ms     max 5.4168ms

std::sync::Mutex     avg 11.765036ms  min 3.3567ms     max 13.5108ms
parking_lot::Mutex   avg 3.992219ms   min 2.4974ms     max 5.5604ms
spin::Mutex          avg 3.425334ms   min 2.0133ms     max 4.7788ms
AmdSpinlock          avg 3.813034ms   min 2.2009ms     max 5.0947ms

The text was updated successfully, but these errors were encountered:

Amanieu · 2020-01-04T18:40:23Z

I wonder if the result would be different if we used SwitchToThread instead of Sleep when yielding. Can you test this? I don't have a Windows system.

paulocsanz · 2020-01-04T18:58:34Z

I don't have a windows system either, the benchmark wasn't done by me, but another user got similar results in linux, maybe there is a common denominator.

https://www.reddit.com/r/rust/comments/ejx7y8/blog_post_mutexes_are_faster_than_spinlocks/fd3taac/

Linux system (rustc 1.41.0-nightly 2019-12-05, AMD 3900x 12 cores with SMT).

extreme contention

❯ cargo run --release 32 2 10000 100
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/lock-bench 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 39.63915ms   min 34.618755ms  max 51.911789ms 
parking_lot::Mutex   avg 222.896391ms min 214.575148ms max 226.433204ms
spin::Mutex          avg 20.253741ms  min 12.694752ms  max 38.933376ms 
AmdSpinlock          avg 17.53803ms   min 11.353536ms  max 51.322618ms 

std::sync::Mutex     avg 39.423473ms  min 33.850454ms  max 47.47324ms  
parking_lot::Mutex   avg 222.267268ms min 217.754466ms max 226.037187ms
spin::Mutex          avg 20.186599ms  min 12.566426ms  max 62.728842ms 
AmdSpinlock          avg 17.215404ms  min 11.445496ms  max 46.907045ms

heavy contention

❯ cargo run --release 32 64 10000 100
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/lock-bench 32 64 10000 100`
Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 8.144328ms   min 7.676202ms   max 8.855408ms  
parking_lot::Mutex   avg 6.590482ms   min 1.666855ms   max 8.721845ms  
spin::Mutex          avg 15.085528ms  min 1.510395ms   max 42.460191ms 
AmdSpinlock          avg 9.331913ms   min 1.681545ms   max 24.24093ms  

std::sync::Mutex     avg 8.117876ms   min 7.600261ms   max 8.398674ms  
parking_lot::Mutex   avg 5.605486ms   min 1.647048ms   max 8.671342ms  
spin::Mutex          avg 12.872803ms  min 1.517989ms   max 39.331793ms 
AmdSpinlock          avg 8.278936ms   min 1.779218ms   max 34.416964ms

light contention

❯ cargo run --release 32 1000 10000 100
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/lock-bench 32 1000 10000 100`
Options {
    n_threads: 32,
    n_locks: 1000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 4.673801ms   min 4.271466ms   max 5.416596ms  
parking_lot::Mutex   avg 1.379981ms   min 1.12888ms    max 1.714049ms  
spin::Mutex          avg 14.5374ms    min 1.050929ms   max 46.961405ms 
AmdSpinlock          avg 8.405825ms   min 1.172899ms   max 31.04467ms  

std::sync::Mutex     avg 4.660858ms   min 4.333317ms   max 5.126614ms  
parking_lot::Mutex   avg 1.379758ms   min 1.176389ms   max 1.749378ms  
spin::Mutex          avg 14.796396ms  min 1.039289ms   max 38.121532ms 
AmdSpinlock          avg 7.045806ms   min 1.189589ms   max 32.977048ms

no contention

❯ cargo run --release 32 1000000 10000 100
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/lock-bench 32 1000000 10000 100`
Options {
    n_threads: 32,
    n_locks: 1000000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 5.488052ms   min 4.789075ms   max 5.913014ms  
parking_lot::Mutex   avg 1.570826ms   min 1.294428ms   max 1.826788ms  
spin::Mutex          avg 1.383231ms   min 1.162079ms   max 1.678709ms  
AmdSpinlock          avg 1.363113ms   min 1.120449ms   max 1.582918ms  

std::sync::Mutex     avg 5.525267ms   min 4.877406ms   max 5.907605ms  
parking_lot::Mutex   avg 1.586628ms   min 1.317512ms   max 2.012493ms  
spin::Mutex          avg 1.388559ms   min 1.231672ms   max 1.603962ms  
AmdSpinlock          avg 1.38805ms    min 1.145911ms   max 1.590503ms

paulocsanz · 2020-01-04T19:11:01Z

I tested on my machine on linux and cannot reproduce. Maybe it's a question about the entire system more than the operating system.

I'm using Linux, rustc 1.40.0 (73528e339 2019-12-16) and Intel i5-8250U 8 cores

I realize it's not ideal not having anyone able to reproduce, but the results were so unexpected that they deserved a bigger investigation. Specially when parking_lot will be integrated with std.

mschlumpp · 2020-01-04T20:12:26Z

I ran the same tests on windows instead of Linux. The results are here.

Also a run of the "extreme contention" case with SwitchToThread instead of Sleep:

     Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 48.186484ms  min 46.201ms     max 68.707ms
parking_lot::Mutex   avg 190.429647ms min 185.9003ms   max 195.4049ms
spin::Mutex          avg 8.325163ms   min 7.919ms      max 9.3455ms
AmdSpinlock          avg 7.891016ms   min 7.5464ms     max 9.2582ms

std::sync::Mutex     avg 49.480777ms  min 45.7975ms    max 68.8675ms
parking_lot::Mutex   avg 192.0885ms   min 188.0831ms   max 195.9926ms
spin::Mutex          avg 8.310387ms   min 8.0344ms     max 10.1583ms
AmdSpinlock          avg 7.894169ms   min 7.5514ms     max 8.7815ms

matklad · 2020-01-04T20:54:40Z

I've run the benchmark on windows (the same machine from blog post, just different OS), and can't reproduce perf degradation of parking lot.

mschlumpp · 2020-01-04T20:56:35Z

@cynecx yes that's correct.

I just tried it on some other hardware:
i7-4770T (Haswell, Desktop)

lock-bench> cargo run --release 32 2 10000 100
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/lock-bench 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 29.457212ms  min 25.718361ms  max 31.862572ms 
parking_lot::Mutex   avg 14.884388ms  min 11.719785ms  max 17.62964ms  
spin::Mutex          avg 41.307676ms  min 20.261094ms  max 94.301218ms 
AmdSpinlock          avg 45.520435ms  min 21.818796ms  max 135.193445ms

std::sync::Mutex     avg 29.411401ms  min 24.678649ms  max 32.887241ms 
parking_lot::Mutex   avg 15.304354ms  min 11.821305ms  max 17.14152ms  
spin::Mutex          avg 39.935237ms  min 18.339029ms  max 103.907344ms
AmdSpinlock          avg 45.176324ms  min 18.215179ms  max 99.126875ms

i5-6200U (Skylake, Notebook)

     Running `target/release/lock-bench 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 111.925873ms min 31.216135ms  max 128.665372ms
parking_lot::Mutex   avg 37.560599ms  min 33.474334ms  max 49.060267ms 
spin::Mutex          avg 121.052961ms min 24.271047ms  max 187.949351ms
AmdSpinlock          avg 122.133179ms min 16.167021ms  max 178.323671ms

std::sync::Mutex     avg 113.942109ms min 104.396629ms max 123.955679ms
parking_lot::Mutex   avg 37.321911ms  min 33.046695ms  max 43.749566ms 
spin::Mutex          avg 125.359875ms min 24.666258ms  max 182.409344ms
AmdSpinlock          avg 117.566367ms min 16.328205ms  max 176.297166ms

The Intel Core i7-5930k of the initial windows post seems to be an Haswell-E.

cynecx · 2020-01-04T21:00:41Z

@matklad It seems to be related to the used microprocessor architecture. @theunknownxy's results show the same degradation on both oses (Windows and Linux).

EDIT: I've accidentally removed my previous comment.

I am not entirely sure but, perhaps this has to do with the SpinWait? Since its spin-wait is using spin_loop_hint which basically lowers to a pause instruction. And it's quite known that the pause instruction's latency varies on different microarchitectures.

Perhaps someone could experiment with lowering the condition here:

parking_lot/core/src/spinwait.rs

Line 53 in 8f69413

if self.counter <= 3 {

mschlumpp · 2020-01-04T21:38:17Z

@cynecx Lowering the value in the condition does not seem to have an impact on the results.
self.counter <= 1:

     Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 49.474712ms  min 46.0883ms    max 69.2954ms
parking_lot::Mutex   avg 191.411675ms min 187.1015ms   max 194.4445ms
spin::Mutex          avg 8.319929ms   min 8.0446ms     max 9.2257ms
AmdSpinlock          avg 7.870013ms   min 7.4975ms     max 8.8647ms

std::sync::Mutex     avg 48.940684ms  min 46.2259ms    max 68.8908ms
parking_lot::Mutex   avg 191.403276ms min 187.6034ms   max 194.5025ms
spin::Mutex          avg 8.540062ms   min 7.9655ms     max 33.9991ms
AmdSpinlock          avg 7.869302ms   min 7.6215ms     max 8.4093ms

always using thread_yield (ignoring the counter):

     Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 48.54949ms   min 46.0273ms    max 68.669ms
parking_lot::Mutex   avg 189.396441ms min 184.3732ms   max 193.476ms
spin::Mutex          avg 8.366757ms   min 7.4263ms     max 10.4656ms
AmdSpinlock          avg 7.893677ms   min 7.4782ms     max 9.6092ms

std::sync::Mutex     avg 48.805392ms  min 46.4548ms    max 68.7367ms
parking_lot::Mutex   avg 189.496029ms min 184.8406ms   max 194.1087ms
spin::Mutex          avg 8.632158ms   min 7.6492ms     max 40.5276ms
AmdSpinlock          avg 7.874248ms   min 7.5315ms     max 8.7431ms

newpavlov · 2020-01-04T23:39:16Z

Linux 4.15.0-72, AMD Ryzen 7 2700x (8 physical cores, enabled SMT), "extreme contention":

std::sync::Mutex     avg 32.08904ms   min 28.112784ms  max 33.733244ms 
parking_lot::Mutex   avg 148.294697ms min 141.888311ms max 150.374082ms
spin::Mutex          avg 87.941173ms  min 46.837399ms  max 158.922402ms
AmdSpinlock          avg 78.83599ms   min 47.9706ms    max 133.216168ms

Voultapher · 2020-01-05T00:21:51Z

I'd recommend looking closer at AMD, especially ZEN based machines, as they seem to be the common denominator so far.

paulocsanz · 2020-01-05T01:01:49Z

I'd recommend looking closer at AMD, especially ZEN based machines, as they seem to be the common denominator so far.

@Voultapher

The first windows benchmark is Intel, that specific model may be doing something similar to most AMD tho. Possibly related to Hyperthreading/SMT. Maybe if SMT is disabled and the bench is re-run the performance degradation for parking_lot will disappear?

newpavlov · 2020-01-05T01:36:26Z

Disabling SMT improves situation, but the problem is still present:

std::sync::Mutex     avg 30.765218ms  min 27.16788ms   max 34.434105ms 
parking_lot::Mutex   avg 80.858621ms  min 25.013774ms  max 112.516787ms
spin::Mutex          avg 73.487341ms  min 24.467014ms  max 178.773357ms
AmdSpinlock          avg 72.41482ms   min 28.667988ms  max 173.171411ms

cynecx · 2020-01-05T01:43:40Z

Well, I don’t have access to an AMD Ryzen (and I can’t find a cloud-service which specifically offers them) but for those who have, perhaps some profiling (eg. perf) can shed some light?

cynecx · 2020-01-05T03:02:37Z

Well, it seems that EC2 does have AMD instances (AMD EPYC 7571), so I got this:

System:

EC2 t3a.2xlarge
8 vCores (AMD EPYC 7571)
32GB RAM
Linux: 4.15.0-1051-aws

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 39.300861ms  min 35.752879ms  max 41.433415ms
parking_lot::Mutex   avg 36.729054ms  min 35.275409ms  max 38.59546ms
spin::Mutex          avg 92.0057ms    min 40.678487ms  max 206.923339ms
AmdSpinlock          avg 70.080664ms  min 34.398632ms  max 193.121841ms

std::sync::Mutex     avg 39.521573ms  min 36.72074ms   max 41.15228ms
parking_lot::Mutex   avg 36.669689ms  min 34.570288ms  max 38.946108ms
spin::Mutex          avg 96.520724ms  min 40.866666ms  max 207.652942ms
AmdSpinlock          avg 72.529016ms  min 33.687999ms  max 196.055792ms

Options {
    n_threads: 8,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 9.708629ms   min 8.964155ms   max 10.842344ms
parking_lot::Mutex   avg 8.112247ms   min 7.52242ms    max 8.667363ms
spin::Mutex          avg 10.472547ms  min 7.978397ms   max 11.968051ms
AmdSpinlock          avg 8.328134ms   min 7.276999ms   max 23.551344ms

std::sync::Mutex     avg 9.768446ms   min 8.710725ms   max 11.195713ms
parking_lot::Mutex   avg 8.066572ms   min 6.970456ms   max 8.696175ms
spin::Mutex          avg 10.505385ms  min 7.947568ms   max 11.267302ms
AmdSpinlock          avg 8.161658ms   min 5.269307ms   max 8.675885ms

Doesn't seem to be affected :/

EDIT:

Also according to AMD's SKU naming scheme, this cpu is basically based on first generation Zen, however most of the problematic bechmarks above were run on second generation Zen.

Matthias247 · 2020-01-05T06:01:15Z

Here are some results from a 3900x on windows:

$ cargo run --release 32 2 10000 100
std::sync::Mutex     avg 46.573633ms  min 44.3294ms    max 65.4726ms
parking_lot::Mutex   avg 181.645635ms min 106.3233ms   max 185.5278ms
spin::Mutex          avg 8.439861ms   min 7.9094ms     max 10.1592ms
AmdSpinlock          avg 7.834648ms   min 7.4119ms     max 8.2538ms

It's obvious that the spinlock solution would do fairly well on that kind of machine. There are plenty of cores around, so the spinning threads don't need to get preempted that often. However the parking_lot vs std::mutex difference is really big.

I thin the conclusion that it's AMD specific didn't hold true so far, since the initial post used a Intel CPU. And it also doesn't seem to be constrainted to Windows, since someone on a 3900x on Linux also observed a degradation.

Interesting thing. I will try play around with some settings

Matthias247 · 2020-01-05T06:51:50Z

SwitchToThread vs Sleep makes no difference for me.

I got a large improvement by increasing the amount of spins before a sleep. But it took really a very big increase, so my concern is I thereby just degraded it to the Spinlock:

pub fn spin(&mut self) -> bool {
    if self.counter >= 20 {
        return false;
    }
    self.counter += 1;
    if self.counter <= 15 {
        cpu_relax(1 << self.counter);
    } else {
        thread_parker::thread_yield();
    }
    true
}

Result:

parking_lot::Mutex avg 12.178126ms min 11.6343ms max 14.1958ms

So this goes from 2^3 spins to 2^15. Even with 2^9 there was no noticable improvement.
Increasing the number of thread_parker::thread_yield() calls always decreases the performance. Therefore it seems like that kind of yielding to the OS might be inefficient.

E.g. going to

if self.counter >= 16 {
    return false;
 }

which avoids all thread yield calls just changes the result to

parking_lot::Mutex avg 12.370461ms min 11.8578ms max 14.3432ms

Matthias247 · 2020-01-05T07:50:19Z

With disabled Hyperthreading I most times get

parking_lot::Mutex avg 80.075459ms min 47.6624ms max 98.0569ms

but at some times also half of it

parking_lot::Mutex avg 36.267395ms min 12.441ms max 53.9117ms

But it pretty much consistently changes between 40 and 80ms, with nothing in between 😵

Without Hyperthreading I also get better results if I do not increase the overall spins, but insert busy-spins before the OS yield (in a similiar fashion as crossbeam Backoff does):

pub fn spin(&mut self) -> bool {
    if self.counter >= 10 {
        return false;
    }
    self.counter += 1;
    if self.counter <= 3 {
        cpu_relax(1 << self.counter);
    } else {
        cpu_relax(1 << self.counter);
        thread_parker::thread_yield();
    }
    true
}

parking_lot::Mutex avg 15.98723ms min 14.0822ms max 18.4982ms

With activated Hyperthreading this actually did not make that much of a difference

Speedy37 · 2020-01-09T03:37:55Z

You can read linus thoughts about thread_yield in this post: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752.

jackguo380 · 2020-01-09T05:34:12Z

I am also seeing the issue on my desktop:

Intel Xeon E5-2670 x 2 (2 processors, 8 cores per processor, 2 threads per core)
64 GB RAM
Linux 5.3.0-24-generic

It should be noted that this is a NUMA system as mentioned in Linus' reply to the article.

Here are the results for extreme contention (Heavy, light, and none look okay):

$ cargo run --release 32 2 10000 100

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 55.576253ms  min 37.536375ms  max 63.957731ms 
parking_lot::Mutex   avg 228.590416ms min 188.487466ms max 239.946827ms
spin::Mutex          avg 44.905108ms  min 31.901131ms  max 67.000027ms 
AmdSpinlock          avg 46.383674ms  min 32.800095ms  max 93.837092ms

I also tried some of the modifications from @Matthias247

2^15 spins

std::sync::Mutex     avg 56.492339ms  min 37.714216ms  max 84.766978ms 
parking_lot::Mutex   avg 84.547093ms  min 14.349955ms  max 132.215521ms
spin::Mutex          avg 47.759024ms  min 33.021815ms  max 77.174472ms 
AmdSpinlock          avg 46.540309ms  min 32.153268ms  max 68.264111ms

2^15 spins, never yield

std::sync::Mutex     avg 55.81399ms   min 37.438165ms  max 78.098643ms 
parking_lot::Mutex   avg 113.635406ms min 31.291235ms  max 167.27489ms 
spin::Mutex          avg 49.064326ms  min 34.045076ms  max 74.999092ms 
AmdSpinlock          avg 47.007803ms  min 31.443733ms  max 81.399413ms

2^3 spins immediately before yield

std::sync::Mutex     avg 56.807845ms  min 37.624857ms  max 80.015346ms 
parking_lot::Mutex   avg 207.616094ms min 119.964334ms max 220.407735ms
spin::Mutex          avg 45.066819ms  min 31.970404ms  max 70.265231ms 
AmdSpinlock          avg 45.483723ms  min 33.518564ms  max 65.345176ms

glandium · 2020-01-09T06:23:43Z

I can reproduce on ThreadRipper 3970X.

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 43.973765ms  min 36.940827ms  max 48.005445ms 
parking_lot::Mutex   avg 287.913565ms min 258.751948ms max 304.911923ms
spin::Mutex          avg 25.721195ms  min 19.751268ms  max 48.85489ms  
AmdSpinlock          avg 24.204215ms  min 17.663769ms  max 49.178313ms 

std::sync::Mutex     avg 44.113866ms  min 39.096522ms  max 48.283289ms 
parking_lot::Mutex   avg 289.014664ms min 267.429467ms max 305.30055ms 
spin::Mutex          avg 26.11682ms   min 19.584014ms  max 60.147676ms 
AmdSpinlock          avg 24.284636ms  min 17.043627ms  max 67.09198ms

Ran a modified version running only one std::sync::Mutex loop under perf stat:

std::sync::Mutex     avg 43.501474ms  min 38.28496ms   max 47.539912ms 

 Performance counter stats for 'cargo run --release 32 2 10000 100':

        119,327.91 msec task-clock                #    8.225 CPUs utilized          
         2,388,322      context-switches          #    0.020 M/sec                  
            25,909      cpu-migrations            #    0.217 K/sec                  
             4,142      page-faults               #    0.035 K/sec                  
   438,819,580,111      cycles                    #    3.677 GHz                      (84.48%)
   332,632,165,301      stalled-cycles-frontend   #   75.80% frontend cycles idle     (82.91%)
    10,515,010,024      stalled-cycles-backend    #    2.40% backend cycles idle      (82.22%)
    50,102,474,576      instructions              #    0.11  insn per cycle         
                                                  #    6.64  stalled cycles per insn  (82.12%)
    11,335,523,432      branches                  #   94.995 M/sec                    (83.25%)
       157,822,216      branch-misses             #    1.39% of all branches          (85.02%)

      14.508207241 seconds time elapsed

       9.607140000 seconds user
     109.520591000 seconds sys

And with parking_log::Mutex only:

parking_lot::Mutex   avg 273.396674ms min 254.347881ms max 289.119739ms

 Performance counter stats for 'cargo run --release 32 2 10000 100':

        222,159.62 msec task-clock                #    5.920 CPUs utilized          
        21,304,902      context-switches          #    0.096 M/sec                  
           378,781      cpu-migrations            #    0.002 M/sec                  
             4,141      page-faults               #    0.019 K/sec                  
   444,159,684,024      cycles                    #    1.999 GHz                      (83.12%)
    85,009,073,534      stalled-cycles-frontend   #   19.14% frontend cycles idle     (83.38%)
   135,860,444,447      stalled-cycles-backend    #   30.59% backend cycles idle      (83.66%)
   184,364,949,242      instructions              #    0.42  insn per cycle         
                                                  #    0.74  stalled cycles per insn  (83.62%)
    41,792,087,226      branches                  #  188.117 M/sec                    (83.24%)
       615,614,410      branch-misses             #    1.47% of all branches          (82.98%)

      37.525030261 seconds time elapsed

      49.254354000 seconds user
     174.546567000 seconds sys

The GHz numbers caught my attention, on top of everything else, and empirically the CPU usage levels during a run in htop seems to indicate less CPU used overall while the parking_lot::Mutex test runs, compared to std::sync::Mutex. I haven't looked to get more accurate data, though. Anyways, that got me wondering what the result would look like if the cpu frequency governor was changed from ondemand to performance, and this results in:

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 41.136917ms  min 33.999931ms  max 49.160068ms 
parking_lot::Mutex   avg 208.116968ms min 196.38674ms  max 216.458502ms
spin::Mutex          avg 18.496924ms  min 13.221013ms  max 43.77934ms  
AmdSpinlock          avg 15.954826ms  min 11.889577ms  max 39.562233ms 

std::sync::Mutex     avg 40.888678ms  min 36.336717ms  max 44.511702ms 
parking_lot::Mutex   avg 206.536087ms min 189.072666ms max 215.691531ms
spin::Mutex          avg 18.482265ms  min 14.164115ms  max 63.214092ms 
AmdSpinlock          avg 15.733487ms  min 12.418109ms  max 38.475479ms

FWIW.

I'm not going to dive deeper into this right now, but I hope the data above can be useful somehow.

DanielJoyce · 2020-01-09T09:34:00Z

Are you sure you're measuring anything valid? Per Linus' response the code doesn't really measure what the blog poster thinks it's measuring.

matklad · 2020-01-09T11:18:06Z

This is mostly unrelated to the issue at hand, but, just to clear some confusion, the benchmark in this issue is from this blog post, while Linus is responding to another post. The benchmarks are different: the first tries to measure throughput, the second tries to measure latency.

This obviously doesn't mean that my benchmark isn't flawed in some terrible way as well :-)

Amanieu · 2020-01-09T11:19:38Z

Basically what this test is measuring is how effective the spin back-off is. parking_lot uses an exponential back-off strategy where it doubles the about of time it pauses between spins every time it misses the lock. This effectively allows the thread which currently holds the lock to rapidly lock/unlock without any cacheline interference from other threads.

The reason why you are seeing different results on different microarchitectures basically boils down to a change Intel made in the Skylake and later microarchitectures which changed the duration of the PAUSE x86 instruction from 10 cycles to 140 cycles (relevant issue in .NET). Since I tuned this delay loop on my personal computer, which has a Skylake CPU, the delay loop is much shorter (by about 10x) on other microarchitectures.

Voultapher · 2020-01-09T11:19:57Z

So from reading Linus' posts and other info the heavy degraded performance could be related to NUMA bouncing. The original WTF::Lock https://webkit.org/blog/6161/locking-in-webkit/ uses sched_yield(), so it seems does parking_lot which uses std::thread::yield_now() which in turn uses libc::sched_yield().

Linus talks a lot about informing the kernel about which threads are waiting on each other, are there APIs for that?

Voultapher · 2020-01-09T11:39:00Z

Hmm, the man page for sched_yield says:

sched_yield() is intended for use with real-time scheduling policies
(i.e., SCHED_FIFO or SCHED_RR).  Use of sched_yield() with
nondeterministic scheduling policies such as SCHED_OTHER is
unspecified and very likely means your application design is broken.

The rust documentation for std::thread::yield_now() reads very different https://doc.rust-lang.org/std/thread/fn.yield_now.html. Clearly there seems to be a disconnect between what kernel developers want this to be used for, and what many people are really using it for. I'm not aware of a more appropriate API for those use cases, short of sticking to kernel supported synchronization. Is this an API hole, or a more fundamental design issue?

Speedy37 · 2020-01-09T12:00:37Z

My current thoughts about this:

Even if thread_yield might output the best latency/throughput in microbenchmarks, these benchmarks are flawed as they only consider themselves. So any other workload on the machine can impact these benchmarks by a lot. And I don't think I saw a benchmark that tries to see what impacts these lock mechanisms have on other running processes or on total CPU power consumption. Doing so might make benchmarks a little less broken. I might be wrong tho.
thread_yield (at least on linux) is just a hint for the scheduler, as Linus stated in another post (read the thread it's very interesting), the scheduler might not schedule the thread you want even If you just "liberated" a slot for it to run as this other thread affinity might be to another CPU core or worse to another NUMA node.

Edit some kernel related links about this:

leo60228 · 2020-01-09T13:18:01Z

Results on an AWS t3a.medium (EPYC 7571):

     Running `target/release/lock-bench 32 2 10000 100`                                                                                      
Options {                                                                                                                                    
    n_threads: 32,                                                                                                                           
    n_locks: 2,                                                                                                                              
    n_ops: 10000,                                                                                                                            
    n_rounds: 100,                                                                                                                           
}                                                                                                                                            

std::sync::Mutex     avg 15.824456ms  min 13.662045ms  max 17.81835ms                                                                        
parking_lot::Mutex   avg 6.66539ms    min 1.039909ms   max 7.653024ms                                                                        
spin::Mutex          avg 8.628183ms   min 186.752µs    max 45.851717ms                                                                       
AmdSpinlock          avg 6.841491ms   min 1.073558ms   max 34.397579ms                                                                       

std::sync::Mutex     avg 15.460703ms  min 229.303µs    max 17.468396ms                                                                       
parking_lot::Mutex   avg 6.554586ms   min 629.526µs    max 7.659823ms                                                                        
spin::Mutex          avg 9.246582ms   min 206.732µs    max 45.582034ms                                                                       
AmdSpinlock          avg 8.371108ms   min 278.313µs    max 39.010627ms

Results on my desktop (Ryzen 5 3600):

     Running `target/release/lock-bench 32 2 10000 100`                                                                                      
Options {                                                                                                                                    
    n_threads: 32,                                                                                                                           
    n_locks: 2,                                                                                                                              
    n_ops: 10000,                                                                                                                            
    n_rounds: 100,                                                                                                                           
}                                                                                                                                            

std::sync::Mutex     avg 27.942144ms  min 25.432081ms  max 32.041021ms                                                                       
parking_lot::Mutex   avg 38.694589ms  min 14.405773ms  max 47.409828ms                                                                       
spin::Mutex          avg 33.888182ms  min 6.6821ms     max 118.340512ms                                                                      
AmdSpinlock          avg 20.657143ms  min 6.214608ms   max 74.403512ms                                                                       

std::sync::Mutex     avg 27.751037ms  min 25.481527ms  max 29.442091ms                                                                       
parking_lot::Mutex   avg 39.590541ms  min 10.887829ms  max 46.828278ms                                                                       
spin::Mutex          avg 34.156485ms  min 6.708959ms   max 132.084823ms                                                                      
AmdSpinlock          avg 18.27833ms   min 5.94831ms    max 73.317943ms

I'm not sure why my desktop (12 threads at 4.2GHz) is so much slower across the board than EC2 (2 threads at 2.2GHz), but I can't reproduce this on either. For what it's worth, my desktop is running Linux 5.3.1 and EC2 is Linux 4.19.80. Both are running NixOS 19.09 with (mostly) identical configurations.

Speedy37 · 2020-01-09T13:53:52Z

Results on Windows 10 (Ryzen 1800X):

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 31.686149ms  min 29.2895ms    max 40.4412ms
parking_lot::Mutex   avg 119.149075ms min 82.528ms     max 159.9172ms
spin::Mutex          avg 52.355757ms  min 48.4938ms    max 82.7013ms
AmdSpinlock          avg 57.400031ms  min 52.13ms      max 150.8743ms

std::sync::Mutex     avg 31.95052ms   min 29.553ms     max 40.4812ms
parking_lot::Mutex   avg 118.03958ms  min 71.2781ms    max 154.8971ms
spin::Mutex          avg 52.760405ms  min 44.0239ms    max 135.9375ms
AmdSpinlock          avg 55.782028ms  min 50.2748ms    max 76.1081ms

Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 6.539282ms   min 6.2849ms     max 7.2896ms
parking_lot::Mutex   avg 2.124488ms   min 1.6108ms     max 3.6931ms
spin::Mutex          avg 1.562838ms   min 128.5µs      max 2.5823ms
AmdSpinlock          avg 2.425624ms   min 1.0894ms     max 33.0315ms

std::sync::Mutex     avg 6.434743ms   min 1.0237ms     max 7.1225ms
parking_lot::Mutex   avg 2.247272ms   min 1.695ms      max 5.4078ms
spin::Mutex          avg 1.883035ms   min 1.3943ms     max 32.1562ms
AmdSpinlock          avg 2.147803ms   min 1.0586ms     max 3.5042ms

Options {
    n_threads: 32,
    n_locks: 1000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 3.864869ms   min 2.0875ms     max 4.5281ms
parking_lot::Mutex   avg 1.097378ms   min 953.6µs      max 1.3924ms
spin::Mutex          avg 2.175728ms   min 563.7µs      max 88.3553ms
AmdSpinlock          avg 1.017223ms   min 686.2µs      max 1.4419ms

std::sync::Mutex     avg 3.880485ms   min 2.0537ms     max 4.4228ms
parking_lot::Mutex   avg 1.111407ms   min 611.5µs      max 1.5689ms
spin::Mutex          avg 1.002341ms   min 786.6µs      max 1.9072ms
AmdSpinlock          avg 1.022204ms   min 718.5µs      max 1.24ms

Options {
    n_threads: 32,
    n_locks: 1000000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 7.099728ms   min 474.8µs      max 9.2633ms
parking_lot::Mutex   avg 2.24719ms    min 484.1µs      max 2.6865ms
spin::Mutex          avg 2.107813ms   min 1.9117ms     max 2.4531ms
AmdSpinlock          avg 2.258429ms   min 1.9603ms     max 2.7379ms

std::sync::Mutex     avg 7.444182ms   min 6.6535ms     max 10.615ms
parking_lot::Mutex   avg 2.262105ms   min 2.015ms      max 2.7102ms
spin::Mutex          avg 2.098281ms   min 275.1µs      max 2.5487ms
AmdSpinlock          avg 2.148916ms   min 1.9126ms     max 2.6017ms

DanielJoyce · 2020-01-09T18:30:14Z

If measuring latency is flawed, trying to measure throughput is likely flawed as well.

DanielJoyce · 2020-01-09T18:34:02Z

I would think the best action going forward is ask for help from kernel gurus. Because I don't think any of this is measuring what you think

glandium · 2020-01-09T20:51:49Z

Even if measuring latency or throughput is flawed, here's one that is not: the overall loop takes much longer with parking_lot with large contention.

Cocalus · 2020-01-09T22:24:52Z

Benchmark code: https://github.com/matklad/lock-bench
Ubuntu 14.04 (3.13.0-55-generic)
2x Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
rustc 1.40.0 (73528e339 2019-12-16)

This has 48 hyperthreads (12 ) across two NUMA nodes, so I maybe I should rerun with a higher n_threads to be comparable.

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 105.330632ms min 78.678834ms  max 120.509816ms
parking_lot::Mutex   avg 301.900891ms min 292.504842ms max 315.545867ms
spin::Mutex          avg 126.528075ms min 91.543749ms  max 173.728962ms
AmdSpinlock          avg 147.171101ms min 91.085144ms  max 216.321775ms

std::sync::Mutex     avg 105.171944ms min 98.356555ms  max 113.816357ms
parking_lot::Mutex   avg 305.341724ms min 292.772782ms max 442.073708ms
spin::Mutex          avg 133.224283ms min 91.873609ms  max 168.339223ms
AmdSpinlock          avg 147.928521ms min 91.893985ms  max 207.511968ms

Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 16.508817ms  min 15.48335ms   max 18.040974ms
parking_lot::Mutex   avg 10.114783ms  min 8.497922ms   max 12.527564ms
spin::Mutex          avg 10.519529ms  min 8.098395ms   max 36.546496ms
AmdSpinlock          avg 12.763253ms  min 9.302137ms   max 36.603806ms

std::sync::Mutex     avg 16.427776ms  min 15.34446ms   max 17.731556ms
parking_lot::Mutex   avg 10.163547ms  min 8.598746ms   max 13.373527ms
spin::Mutex          avg 10.722692ms  min 8.207084ms   max 33.398158ms
AmdSpinlock          avg 12.079969ms  min 9.342948ms   max 27.557544ms

Options {
    n_threads: 32,
    n_locks: 1000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 8.376152ms   min 6.85466ms    max 10.307236ms
parking_lot::Mutex   avg 5.556958ms   min 4.192273ms   max 7.469789ms
spin::Mutex          avg 4.474897ms   min 3.05826ms    max 19.753135ms
AmdSpinlock          avg 5.213362ms   min 3.455048ms   max 23.904829ms

std::sync::Mutex     avg 8.489237ms   min 7.157942ms   max 10.554202ms
parking_lot::Mutex   avg 5.627372ms   min 4.135676ms   max 7.917793ms
spin::Mutex          avg 4.59475ms    min 3.065647ms   max 23.009285ms
AmdSpinlock          avg 5.38561ms    min 3.572463ms   max 19.6496ms

Options {
    n_threads: 32,
    n_locks: 1000000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 10.541165ms  min 6.199109ms   max 14.244457ms
parking_lot::Mutex   avg 7.457033ms   min 4.544614ms   max 10.954456ms
spin::Mutex          avg 5.518958ms   min 3.592296ms   max 8.046606ms
AmdSpinlock          avg 4.227387ms   min 2.150174ms   max 5.860261ms

std::sync::Mutex     avg 10.345756ms  min 7.112158ms   max 14.649566ms
parking_lot::Mutex   avg 7.561167ms   min 5.212269ms   max 10.920169ms
spin::Mutex          avg 5.441578ms   min 3.124324ms   max 7.476264ms
AmdSpinlock          avg 4.188151ms   min 2.034598ms   max 5.990902ms

I reran and used perf top during the first parking_lot::Mutex, here's the top things things there

+   5.61%       lock-bench  lock-bench                 [.] parking_lot::raw_mutex::RawMutex::lock_slow::ha5f7701930258
+   4.89%       lock-bench  lock-bench                 [.] parking_lot_core::word_lock::WordLock::lock_slow::h1899dcfb
+   3.98%       lock-bench  [kernel]                   [k] update_cfs_shares
+   3.89%       lock-bench  [kernel]                   [k] update_curr
+   3.60%       lock-bench  [kernel]                   [k] enqueue_entity
+   3.60%               :0  [kernel]                   [k] native_write_msr_safe
+   3.41%       lock-bench  [kernel]                   [k] _raw_spin_lock
+   2.94%       lock-bench  [kernel]                   [k] __schedule
+   2.93%       lock-bench  [kernel]                   [k] effective_load.isra.35
+   2.48%       lock-bench  [kernel]                   [k] futex_wake
+   2.32%               :0  [kernel]                   [k] __schedule
+   2.23%       lock-bench  lock-bench                 [.] crossbeam_utils::thread::ScopedThreadBuilder::spawn::_$u7b$
+   2.01%               :0  [kernel]                   [k] __switch_to
+   1.85%               :0  [kernel]                   [k] set_next_entity
+   1.67%       lock-bench  lock-bench                 [.] parking_lot::raw_mutex::RawMutex::unlock_slow::hf158820543f
+   1.66%       lock-bench  [kernel]                   [k] __switch_to
+   1.66%       lock-bench  [kernel]                   [k] resched_task
+   1.59%               :0  [kernel]                   [k] cpu_startup_entry
+   1.55%               :0  [kernel]                   [k] pick_next_task_fair
+   1.51%       lock-bench  [kernel]                   [k] put_prev_task_fair
+   1.22%       lock-bench  [kernel]                   [k] dequeue_entity
+   1.21%               :0  [kernel]                   [k] menu_select
+   1.14%               :0  [kernel]                   [k] update_stats_wait_end
+   1.03%       lock-bench  [kernel]                   [k] update_cfs_rq_blocked_load
+   0.99%       lock-bench  [kernel]                   [k] __calc_delta
+   0.88%       lock-bench  [kernel]                   [k] futex_wait
+   0.87%       lock-bench  [kernel]                   [k] __enqueue_entity
+   0.86%               :0  [kernel]                   [k] _raw_spin_lock_irqsave
+   0.85%       lock-bench  [kernel]                   [k] finish_task_switch
+   0.81%       lock-bench  [kernel]                   [k] get_futex_key_refs.isra.13
+   0.81%               :0  [kernel]                   [k] _raw_spin_lock_irq
+   0.78%       lock-bench  [kernel]                   [k] select_task_rq_fair
+   0.76%       lock-bench  [kernel]                   [k] idle_cpu
+   0.75%       lock-bench  [kernel]                   [k] dequeue_task_fair
+   0.74%       lock-bench  [kernel]                   [k] _raw_spin_lock_irqsave
+   0.71%       lock-bench  [kernel]                   [k] update_stats_wait_end
+   0.69%       lock-bench  [kernel]                   [k] set_next_entity
+   0.68%       lock-bench  [kernel]                   [k] pick_next_task_fair
+   0.67%       lock-bench  [kernel]                   [k] update_min_vruntime
+   0.66%       lock-bench  [kernel]                   [k] account_entity_dequeue

Running on a single NUMA node seems to produce the same results.

$ numactl -N 0 -m 0 target/release/lock-bench 32 2 10000 100

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 102.78021ms  min 71.313543ms  max 109.791041ms
parking_lot::Mutex   avg 302.325191ms min 290.287462ms max 316.402482ms
spin::Mutex          avg 99.925461ms  min 75.925993ms  max 187.48057ms
AmdSpinlock          avg 101.305647ms min 81.467133ms  max 174.989328ms

std::sync::Mutex     avg 102.847848ms min 93.095496ms  max 107.869013ms
parking_lot::Mutex   avg 303.38784ms  min 291.25722ms  max 315.920774ms
spin::Mutex          avg 96.350532ms  min 81.014492ms  max 191.73539ms
AmdSpinlock          avg 99.533122ms  min 86.419717ms  max 185.259317ms

green-s · 2020-01-10T08:17:12Z

Windows 10 (1909), Ryzen 3600X

Initially I ran it with Firefox open in the background. Parking lot average was about twice as long as std, but the minimum was better. Closed Firefox, and surprisingly the numbers got worse. With FF open the avg varies from 40-80ms, and the minimum often dips below 10ms. With it closed, the average is very consistently around 85ms (±2), and the minimum 40-70ms.

I ran it about 10 times total, alternating and without alternating (didn't make a difference). Won't post them all since they were very similar. These were all "extreme contention" (cargo run --release 32 2 10000 100).

Firefox open

std::sync::Mutex     avg 36.775974ms  min 32.2679ms    max 46.2233ms
parking_lot::Mutex   avg 66.5786ms    min 25.8624ms    max 87.4034ms
spin::Mutex          avg 6.68705ms    min 6.3307ms     max 13.617ms
AmdSpinlock          avg 6.455339ms   min 6.1472ms     max 8.6498ms

std::sync::Mutex     avg 35.848655ms  min 31.634ms     max 46.7734ms
parking_lot::Mutex   avg 45.263035ms  min 7.762ms      max 84.559ms
spin::Mutex          avg 6.770047ms   min 6.3209ms     max 9.4868ms
AmdSpinlock          avg 6.543355ms   min 6.1779ms     max 8.2594ms

std::sync::Mutex     avg 34.277359ms  min 26.8796ms    max 45.6344ms
parking_lot::Mutex   avg 60.872018ms  min 9.2182ms     max 88.5883ms
spin::Mutex          avg 6.674307ms   min 6.2606ms     max 7.5974ms
AmdSpinlock          avg 6.470653ms   min 6.1456ms     max 7.7167ms

std::sync::Mutex     avg 34.205752ms  min 32.1898ms    max 45.5111ms
parking_lot::Mutex   avg 75.535247ms  min 41.2183ms    max 90.4932ms
spin::Mutex          avg 6.552556ms   min 6.3261ms     max 7.4709ms
AmdSpinlock          avg 6.425833ms   min 6.1333ms     max 7.6206ms

Firefox closed

std::sync::Mutex     avg 33.837238ms  min 32.1343ms    max 46.3164ms
parking_lot::Mutex   avg 84.321761ms  min 48.2918ms    max 101.4464ms
spin::Mutex          avg 6.88834ms    min 6.363ms      max 37.7647ms
AmdSpinlock          avg 6.423914ms   min 6.2138ms     max 7.8459ms

std::sync::Mutex     avg 33.617339ms  min 32.8948ms    max 45.8103ms
parking_lot::Mutex   avg 85.69858ms   min 62.9386ms    max 100.5828ms
spin::Mutex          avg 6.515437ms   min 6.3291ms     max 7.3897ms
AmdSpinlock          avg 6.366613ms   min 6.1445ms     max 7.3461ms

std::sync::Mutex     avg 38.682368ms  min 32.8688ms    max 46.2437ms
parking_lot::Mutex   avg 85.309935ms  min 50.1999ms    max 98.5446ms
spin::Mutex          avg 6.535303ms   min 6.2793ms     max 7.4777ms
AmdSpinlock          avg 6.404562ms   min 6.2058ms     max 7.2652ms

std::sync::Mutex     avg 39.269554ms  min 33.1641ms    max 46.7064ms
parking_lot::Mutex   avg 86.380115ms  min 70.204ms     max 96.3028ms
spin::Mutex          avg 6.68403ms    min 6.2877ms     max 9.5941ms
AmdSpinlock          avg 6.457565ms   min 6.145ms      max 7.4627ms

lnicola · 2020-01-10T08:26:49Z

That's nifty. On Windows, applications can ask the OS to increase the timer granularity. Media playback programs often do it, and probably browsers too.

I don't expect it to have the same effect, but can you also test using different power management profiles (Balanced/Performance)?

DanielJoyce · 2020-01-10T17:19:18Z

Ahh, I misapprehended...

Well, parking-lot mutex seems to try to spin in some cases waiting for a lock to become available, and so we fall to what linus says is the bad behavior of userspace spinlocks on Linux.

Whats the highly contested perf if all the spinning code in parkinglot mutex is removed?

green-s · 2020-01-10T20:53:46Z

That's nifty. On Windows, applications can ask the OS to increase the timer granularity. Media playback programs often do it, and probably browsers too.

That's the first thing that came to mind but I checked the timer interval and it's held at 1ms regardless of Firefox (I guess there's a driver or service keeping it at that, because I have everything in the task bar and tray closed). I manually set the interval to the minimum (0.5ms), re-ran and got the same results.

I don't expect it to have the same effect, but can you also test using different power management profiles (Balanced/Performance)?

Doesn't seem to make any significant difference.

I ran the bench on the same machine on Linux (Manjaro, 5.4). Times are consistently like these, regardless of what's open:

std::sync::Mutex     avg 41.888257ms  min 37.328722ms  max 44.560946ms
parking_lot::Mutex   avg 50.668306ms  min 33.37132ms   max 57.84472ms
spin::Mutex          avg 36.433262ms  min 10.723163ms  max 72.244387ms
AmdSpinlock          avg 23.923037ms  min 9.067332ms   max 76.638706ms

std::sync::Mutex     avg 41.824524ms  min 35.817135ms  max 44.253448ms
parking_lot::Mutex   avg 50.186056ms  min 36.960654ms  max 55.576347ms
spin::Mutex          avg 39.525809ms  min 10.008313ms  max 94.704114ms
AmdSpinlock          avg 26.139379ms  min 9.452384ms   max 74.966394ms

std::sync::Mutex     avg 42.090291ms  min 37.524273ms  max 45.824792ms
parking_lot::Mutex   avg 51.351734ms  min 35.260994ms  max 56.448228ms
spin::Mutex          avg 38.948821ms  min 10.116644ms  max 81.041639ms
AmdSpinlock          avg 26.802719ms  min 9.325822ms   max 79.673338ms

std::sync::Mutex     avg 42.327898ms  min 37.470312ms  max 45.371616ms
parking_lot::Mutex   avg 51.054137ms  min 42.471562ms  max 55.764639ms
spin::Mutex          avg 36.216242ms  min 10.552043ms  max 99.071695ms
AmdSpinlock          avg 26.746589ms  min 9.888954ms   max 72.403114ms

Speedy37 · 2020-01-10T22:33:48Z

I did some changes to the benchmark to make it use criterion and I added kernel/user CPU times reports (might be broken as I think it doesn't consider task switching costs).

This changes the way each iteration is run from matklad benchmark: It's criterion that select the n_rounds and it increase n_ops after each iteration.

This led to some jitter in the first iterations as the first locks seams to cost more.
Also "no contention" seems to benchmark first/2nd lock performances.

Here is the results: https://speedy37.github.io/lock-bench/
And here is the code: https://github.com/Speedy37/lock-bench

returntoreality · 2020-01-11T11:21:44Z

I tested removing the thread_yield for the Mutex, while keeping the original spinning behaviour and it improved the extreme contention benchmark significantly.
In the Mutex case we will park the thread if we cannot get the lock after spinning anyways. But this parking at least on linux uses a futex and will give back execution only when we can get the lock. In this case the scheduler can make useful decisions, as it knows the thread needs to wait anyways and can schedule the other threads. With the yield it might just reschedule a spinning thread (the spinwait causes 7 yields before the parking and thus the scheduler will have to reschedule a thread 7 times before the thread gets parked).
My system:
i5-6440HQ CPU @ 2.60GHz (no SMT)
Linux 5.4.10-arch1-1
Here are the results before:
Extreme contention

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 10,
}

std::sync::Mutex     avg 73.634096ms  min 66.967267ms  max 80.72622ms  
parking_lot::Mutex   avg 39.728062ms  min 38.240219ms  max 42.349675ms 
spin::Mutex          avg 173.159646ms min 92.922002ms  max 225.704301ms
AmdSpinlock          avg 187.595154ms min 74.457153ms  max 264.658529ms

std::sync::Mutex     avg 74.219651ms  min 67.949329ms  max 81.64875ms  
parking_lot::Mutex   avg 44.592959ms  min 38.127029ms  max 47.859755ms 
spin::Mutex          avg 184.845852ms min 116.583822ms max 261.280952ms
AmdSpinlock          avg 190.156816ms min 87.722316ms  max 266.788723ms

Heavy contention

Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 10,
}

std::sync::Mutex     avg 31.765911ms  min 28.41608ms   max 34.544286ms 
parking_lot::Mutex   avg 13.195215ms  min 12.788258ms  max 13.680797ms 
spin::Mutex          avg 16.727ms     min 10.418176ms  max 69.931865ms 
AmdSpinlock          avg 23.115233ms  min 11.406661ms  max 41.946329ms 

std::sync::Mutex     avg 27.792236ms  min 25.544377ms  max 28.308771ms 
parking_lot::Mutex   avg 14.52916ms   min 13.227557ms  max 15.918459ms 
spin::Mutex          avg 20.700384ms  min 10.605974ms  max 70.38615ms  
AmdSpinlock          avg 15.88085ms   min 10.910836ms  max 35.499587ms

And after the change:
Extreme contention

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 10,
}

std::sync::Mutex     avg 70.126806ms  min 63.669452ms  max 79.130471ms 
parking_lot::Mutex   avg 25.837282ms  min 25.042609ms  max 26.900789ms 
spin::Mutex          avg 186.131504ms min 117.933992ms max 255.906917ms
AmdSpinlock          avg 187.555589ms min 119.447433ms max 268.485298ms

std::sync::Mutex     avg 75.519208ms  min 69.761663ms  max 79.76914ms  
parking_lot::Mutex   avg 26.162509ms  min 24.93383ms   max 27.816685ms 
spin::Mutex          avg 168.669269ms min 98.125274ms  max 224.630625ms
AmdSpinlock          avg 177.760954ms min 143.723308ms max 202.127455ms

Heavy contention

Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 10,
}

std::sync::Mutex     avg 29.53841ms   min 27.388786ms  max 34.279535ms 
parking_lot::Mutex   avg 15.689115ms  min 15.066478ms  max 16.719213ms 
spin::Mutex          avg 27.421103ms  min 12.119445ms  max 62.141633ms 
AmdSpinlock          avg 17.651834ms  min 11.080985ms  max 37.589966ms 

std::sync::Mutex     avg 31.514057ms  min 27.859781ms  max 34.584677ms 
parking_lot::Mutex   avg 13.701994ms  min 12.907823ms  max 14.744607ms 
spin::Mutex          avg 21.318176ms  min 10.527562ms  max 48.334592ms 
AmdSpinlock          avg 12.902379ms  min 11.100196ms  max 25.023307ms

cynecx · 2020-01-11T23:33:20Z

~~Something I read today made me think about caches, false-sharing and related things, and thought I would try something: https://github.com/cynecx/parking_lot/tree/cacheline.~~

~~Could someone on an AMD Ryzen try out that branch (wrt the benchmark)?~~

nvm

Speedy37 · 2020-01-12T10:49:28Z

Windows 10 - R7 1800X (8cores 16threads) cacheline results (see my previous post for non cacheline results)

Avg is a bit worse.

Options {
    n_threads: 32,
    n_locks: 2,   
    n_ops: 10000, 
    n_rounds: 100,
}

std::sync::Mutex     avg 30.831742ms  min 25.8507ms    max 40.243ms    
parking_lot::Mutex   avg 137.364687ms min 102.3781ms   max 182.8287ms  
spin::Mutex          avg 52.551899ms  min 47.5073ms    max 67.3358ms   
AmdSpinlock          avg 55.114334ms  min 50.4422ms    max 88.9065ms   

std::sync::Mutex     avg 30.898776ms  min 25.6411ms    max 39.9243ms   
parking_lot::Mutex   avg 131.558342ms min 99.9603ms    max 161.9975ms  
spin::Mutex          avg 53.440152ms  min 48.1445ms    max 94.0101ms   
AmdSpinlock          avg 55.660354ms  min 51.6334ms    max 58.8772ms

Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 6.814591ms   min 6.37ms       max 7.815ms     
parking_lot::Mutex   avg 2.859985ms   min 2.0373ms     max 5.2382ms    
spin::Mutex          avg 1.909338ms   min 841.2µs      max 4.7601ms    
AmdSpinlock          avg 2.875636ms   min 468.6µs      max 35.1337ms   

std::sync::Mutex     avg 6.794524ms   min 6.3382ms     max 7.9424ms    
parking_lot::Mutex   avg 2.868905ms   min 1.9351ms     max 5.3661ms    
spin::Mutex          avg 1.846593ms   min 1.393ms      max 10.4112ms   
AmdSpinlock          avg 2.205968ms   min 1.2394ms     max 3.2763ms

Speedy37 · 2020-01-12T10:59:14Z

Parking_lot seams to be very good when running on a single CCX:

4cores, 8threads, 1CCX

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 11.732156ms  min 4.5348ms     max 38.7124ms
parking_lot::Mutex   avg 17.733915ms  min 12.6413ms    max 20.9893ms
spin::Mutex          avg 21.637979ms  min 679.1µs      max 22.6627ms
AmdSpinlock          avg 23.66785ms   min 2.2282ms     max 53.0323ms

std::sync::Mutex     avg 10.308772ms  min 1.1871ms     max 12.5605ms
parking_lot::Mutex   avg 17.617576ms  min 12.0187ms    max 20.9852ms
spin::Mutex          avg 21.621965ms  min 1.3302ms     max 51.0216ms
AmdSpinlock          avg 24.25325ms   min 1.9652ms     max 175.5796ms

8 cores, 8 threads, 2CCX

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 22.664372ms  min 19.3603ms    max 30.7216ms   
parking_lot::Mutex   avg 124.191566ms min 96.57ms      max 145.1499ms  
spin::Mutex          avg 33.303393ms  min 29.5157ms    max 92.0333ms   
AmdSpinlock          avg 32.245088ms  min 27.7896ms    max 88.293ms    

std::sync::Mutex     avg 20.921484ms  min 16.1338ms    max 25.4639ms   
parking_lot::Mutex   avg 111.182503ms min 83.1871ms    max 130.6418ms  
spin::Mutex          avg 35.753433ms  min 28.6262ms    max 94.8696ms   
AmdSpinlock          avg 32.314534ms  min 29.1613ms    max 64.7282ms

xacrimon · 2020-01-14T23:29:54Z

No degradation on my i9-9960x or i7-7700HQ

returntoreality · 2020-01-15T23:20:22Z

I can confirm on my AMD TR 1950X (16 Cores, 32 Threads):
Gentoo, Linux 5.5.0-rc6

Extreme contention:

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 49.620401ms  min 30.257507ms  max 62.528432ms 
parking_lot::Mutex   avg 313.609705ms min 209.916202ms max 365.43506ms 
spin::Mutex          avg 102.131561ms min 59.904313ms  max 140.323796ms
AmdSpinlock          avg 77.799462ms  min 48.043996ms  max 123.264297ms

std::sync::Mutex     avg 48.138047ms  min 33.949492ms  max 55.207798ms 
parking_lot::Mutex   avg 336.038897ms min 81.932059ms  max 372.217574ms
spin::Mutex          avg 93.937933ms  min 54.791539ms  max 140.238853ms
AmdSpinlock          avg 90.064237ms  min 57.000878ms  max 119.412932ms

Heavy contention:

Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 9.904111ms   min 7.187086ms   max 11.468723ms 
parking_lot::Mutex   avg 2.829724ms   min 1.237619ms   max 9.27986ms   
spin::Mutex          avg 5.093275ms   min 1.348044ms   max 30.635225ms 
AmdSpinlock          avg 4.67448ms    min 1.39905ms    max 33.602218ms 

std::sync::Mutex     avg 9.431099ms   min 7.105894ms   max 11.29583ms  
parking_lot::Mutex   avg 3.096453ms   min 1.303752ms   max 10.263885ms 
spin::Mutex          avg 5.178214ms   min 1.37244ms    max 37.807422ms 
AmdSpinlock          avg 4.925829ms   min 1.255081ms   max 23.782141ms

Light contention:

Options {
    n_threads: 32,
    n_locks: 1000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 6.01488ms    min 4.53077ms    max 7.274769ms  
parking_lot::Mutex   avg 1.429739ms   min 690.288µs    max 2.086121ms  
spin::Mutex          avg 3.308767ms   min 685.699µs    max 27.272801ms 
AmdSpinlock          avg 2.433049ms   min 593.547µs    max 24.027638ms 

std::sync::Mutex     avg 5.585047ms   min 3.446629ms   max 7.265813ms  
parking_lot::Mutex   avg 1.57561ms    min 878.178µs    max 2.189153ms  
spin::Mutex          avg 3.836579ms   min 685.899µs    max 34.239314ms 
AmdSpinlock          avg 1.487238ms   min 541.74µs     max 14.809573ms

No contention:

Options {
    n_threads: 32,
    n_locks: 1000000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 8.044701ms   min 3.326985ms   max 10.297255ms 
parking_lot::Mutex   avg 2.802812ms   min 1.450776ms   max 3.400943ms  
spin::Mutex          avg 2.640576ms   min 1.257766ms   max 3.199747ms  
AmdSpinlock          avg 2.706275ms   min 1.343937ms   max 3.351531ms  

std::sync::Mutex     avg 8.190728ms   min 3.851684ms   max 10.153918ms 
parking_lot::Mutex   avg 2.79763ms    min 1.35657ms    max 3.488807ms  
spin::Mutex          avg 2.675589ms   min 1.284726ms   max 3.521578ms  
AmdSpinlock          avg 2.617891ms   min 1.37763ms    max 3.381246ms

For comparison using only one numa node (8 cores of 16, 16 threads):
Extreme contention:

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 33.462524ms  min 29.126201ms  max 36.03932ms  
parking_lot::Mutex   avg 158.969794ms min 153.198464ms max 162.36107ms 
spin::Mutex          avg 82.75899ms   min 42.432099ms  max 144.74332ms 
AmdSpinlock          avg 75.146116ms  min 42.592656ms  max 149.14751ms 

std::sync::Mutex     avg 33.228207ms  min 24.293346ms  max 36.458827ms 
parking_lot::Mutex   avg 171.169807ms min 153.573209ms max 190.346191ms
spin::Mutex          avg 81.954471ms  min 28.748728ms  max 150.855624ms
AmdSpinlock          avg 73.463711ms  min 40.386234ms  max 148.293208ms

Using only one CCX (4 Cores of 16, 8 threads, same NUMA node, shared L3):
Extreme contention:

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 19.930861ms  min 17.932865ms  max 22.56532ms  
parking_lot::Mutex   avg 14.2647ms    min 13.394207ms  max 14.850745ms 
spin::Mutex          avg 35.568461ms  min 19.0695ms    max 155.422807ms
AmdSpinlock          avg 28.64757ms   min 17.991838ms  max 93.591645ms 

std::sync::Mutex     avg 19.822787ms  min 17.869771ms  max 20.446002ms 
parking_lot::Mutex   avg 14.264354ms  min 13.594571ms  max 14.860065ms 
spin::Mutex          avg 32.366616ms  min 18.082369ms  max 107.326862ms
AmdSpinlock          avg 27.763901ms  min 18.106624ms  max 79.187693ms

Using a spinning without yield did not help in contrast to my earlier comment where I ran it on a i5-6440HQ.

paulocsanz changed the title ~~Heavily degraded performance while in extreme contention on windows~~ Heavily degraded performance while in extreme contention on some processors Jan 4, 2020

voximity mentioned this issue Mar 28, 2024

Possible performance degradation under mutex contention AleoNet/anf-snarkOS#2

Open

Heavily degraded performance while in extreme contention on some processors #201

Heavily degraded performance while in extreme contention on some processors #201

Comments

paulocsanz commented Jan 4, 2020

Amanieu commented Jan 4, 2020

paulocsanz commented Jan 4, 2020 • edited Loading

paulocsanz commented Jan 4, 2020 • edited Loading

mschlumpp commented Jan 4, 2020

matklad commented Jan 4, 2020

mschlumpp commented Jan 4, 2020

cynecx commented Jan 4, 2020 • edited Loading

mschlumpp commented Jan 4, 2020

newpavlov commented Jan 4, 2020

Voultapher commented Jan 5, 2020

paulocsanz commented Jan 5, 2020 • edited Loading

newpavlov commented Jan 5, 2020

cynecx commented Jan 5, 2020

cynecx commented Jan 5, 2020 • edited Loading

Matthias247 commented Jan 5, 2020

Matthias247 commented Jan 5, 2020

Matthias247 commented Jan 5, 2020

Speedy37 commented Jan 9, 2020 • edited Loading

jackguo380 commented Jan 9, 2020

glandium commented Jan 9, 2020

DanielJoyce commented Jan 9, 2020

matklad commented Jan 9, 2020

Amanieu commented Jan 9, 2020

Voultapher commented Jan 9, 2020

Voultapher commented Jan 9, 2020

Speedy37 commented Jan 9, 2020 • edited Loading

leo60228 commented Jan 9, 2020 • edited Loading

Speedy37 commented Jan 9, 2020

DanielJoyce commented Jan 9, 2020

DanielJoyce commented Jan 9, 2020

glandium commented Jan 9, 2020

Cocalus commented Jan 9, 2020

green-s commented Jan 10, 2020

Firefox open

Firefox closed

lnicola commented Jan 10, 2020 • edited Loading

DanielJoyce commented Jan 10, 2020

green-s commented Jan 10, 2020 • edited Loading

Speedy37 commented Jan 10, 2020 • edited Loading

returntoreality commented Jan 11, 2020

cynecx commented Jan 11, 2020 • edited Loading

Speedy37 commented Jan 12, 2020

Speedy37 commented Jan 12, 2020

4cores, 8threads, 1CCX

8 cores, 8 threads, 2CCX

xacrimon commented Jan 14, 2020

returntoreality commented Jan 15, 2020

paulocsanz commented Jan 4, 2020 •

edited

Loading

paulocsanz commented Jan 4, 2020 •

edited

Loading

cynecx commented Jan 4, 2020 •

edited

Loading

paulocsanz commented Jan 5, 2020 •

edited

Loading

cynecx commented Jan 5, 2020 •

edited

Loading

Speedy37 commented Jan 9, 2020 •

edited

Loading

Speedy37 commented Jan 9, 2020 •

edited

Loading

leo60228 commented Jan 9, 2020 •

edited

Loading

lnicola commented Jan 10, 2020 •

edited

Loading

green-s commented Jan 10, 2020 •

edited

Loading

Speedy37 commented Jan 10, 2020 •

edited

Loading

cynecx commented Jan 11, 2020 •

edited

Loading