-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heavily degraded performance while in extreme contention on some processors #201
Comments
I wonder if the result would be different if we used |
I don't have a windows system either, the benchmark wasn't done by me, but another user got similar results in linux, maybe there is a common denominator. https://www.reddit.com/r/rust/comments/ejx7y8/blog_post_mutexes_are_faster_than_spinlocks/fd3taac/ Linux system (rustc 1.41.0-nightly 2019-12-05, AMD 3900x 12 cores with SMT). extreme contention
heavy contention
light contention
no contention
|
I tested on my machine on linux and cannot reproduce. Maybe it's a question about the entire system more than the operating system. I'm using Linux, rustc 1.40.0 (73528e339 2019-12-16) and Intel i5-8250U 8 cores I realize it's not ideal not having anyone able to reproduce, but the results were so unexpected that they deserved a bigger investigation. Specially when |
I ran the same tests on windows instead of Linux. The results are here. Also a run of the "extreme contention" case with SwitchToThread instead of Sleep:
|
I've run the benchmark on windows (the same machine from blog post, just different OS), and can't reproduce perf degradation of parking lot. |
@cynecx yes that's correct. I just tried it on some other hardware:
i5-6200U (Skylake, Notebook)
The Intel Core i7-5930k of the initial windows post seems to be an Haswell-E. |
@matklad It seems to be related to the used microprocessor architecture. @theunknownxy's results show the same degradation on both oses (Windows and Linux). EDIT: I've accidentally removed my previous comment. I am not entirely sure but, perhaps this has to do with the Perhaps someone could experiment with lowering the condition here: parking_lot/core/src/spinwait.rs Line 53 in 8f69413
|
@cynecx Lowering the value in the condition does not seem to have an impact on the results.
always using thread_yield (ignoring the counter):
|
Linux 4.15.0-72, AMD Ryzen 7 2700x (8 physical cores, enabled SMT), "extreme contention":
|
I'd recommend looking closer at AMD, especially ZEN based machines, as they seem to be the common denominator so far. |
The first windows benchmark is Intel, that specific model may be doing something similar to most AMD tho. Possibly related to Hyperthreading/SMT. Maybe if SMT is disabled and the bench is re-run the performance degradation for |
Disabling SMT improves situation, but the problem is still present:
|
Well, I don’t have access to an AMD Ryzen (and I can’t find a cloud-service which specifically offers them) but for those who have, perhaps some profiling (eg. perf) can shed some light? |
Well, it seems that EC2 does have AMD instances (AMD EPYC 7571), so I got this: System:
Doesn't seem to be affected :/ EDIT: Also according to AMD's SKU naming scheme, this cpu is basically based on first generation Zen, however most of the problematic bechmarks above were run on second generation Zen. |
Here are some results from a 3900x on windows:
It's obvious that the spinlock solution would do fairly well on that kind of machine. There are plenty of cores around, so the spinning threads don't need to get preempted that often. However the parking_lot vs std::mutex difference is really big. I thin the conclusion that it's AMD specific didn't hold true so far, since the initial post used a Intel CPU. And it also doesn't seem to be constrainted to Windows, since someone on a 3900x on Linux also observed a degradation. Interesting thing. I will try play around with some settings |
I got a large improvement by increasing the amount of spins before a sleep. But it took really a very big increase, so my concern is I thereby just degraded it to the Spinlock: pub fn spin(&mut self) -> bool {
if self.counter >= 20 {
return false;
}
self.counter += 1;
if self.counter <= 15 {
cpu_relax(1 << self.counter);
} else {
thread_parker::thread_yield();
}
true
} Result:
So this goes from 2^3 spins to 2^15. Even with 2^9 there was no noticable improvement. E.g. going to if self.counter >= 16 {
return false;
} which avoids all thread yield calls just changes the result to
|
With disabled Hyperthreading I most times get
but at some times also half of it
But it pretty much consistently changes between 40 and 80ms, with nothing in between 😵 Without Hyperthreading I also get better results if I do not increase the overall spins, but insert busy-spins before the OS yield (in a similiar fashion as crossbeam Backoff does): pub fn spin(&mut self) -> bool {
if self.counter >= 10 {
return false;
}
self.counter += 1;
if self.counter <= 3 {
cpu_relax(1 << self.counter);
} else {
cpu_relax(1 << self.counter);
thread_parker::thread_yield();
}
true
}
With activated Hyperthreading this actually did not make that much of a difference |
You can read linus thoughts about thread_yield in this post: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752. |
I am also seeing the issue on my desktop:
It should be noted that this is a NUMA system as mentioned in Linus' reply to the article. Here are the results for extreme contention (Heavy, light, and none look okay): $ cargo run --release 32 2 10000 100
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 55.576253ms min 37.536375ms max 63.957731ms
parking_lot::Mutex avg 228.590416ms min 188.487466ms max 239.946827ms
spin::Mutex avg 44.905108ms min 31.901131ms max 67.000027ms
AmdSpinlock avg 46.383674ms min 32.800095ms max 93.837092ms I also tried some of the modifications from @Matthias247 2^15 spins
2^15 spins, never yield
2^3 spins immediately before yield
|
I can reproduce on ThreadRipper 3970X.
Ran a modified version running only one
And with
The GHz numbers caught my attention, on top of everything else, and empirically the CPU usage levels during a run in htop seems to indicate less CPU used overall while the
FWIW. I'm not going to dive deeper into this right now, but I hope the data above can be useful somehow. |
Are you sure you're measuring anything valid? Per Linus' response the code doesn't really measure what the blog poster thinks it's measuring. |
This is mostly unrelated to the issue at hand, but, just to clear some confusion, the benchmark in this issue is from this blog post, while Linus is responding to another post. The benchmarks are different: the first tries to measure throughput, the second tries to measure latency. This obviously doesn't mean that my benchmark isn't flawed in some terrible way as well :-) |
Basically what this test is measuring is how effective the spin back-off is. parking_lot uses an exponential back-off strategy where it doubles the about of time it pauses between spins every time it misses the lock. This effectively allows the thread which currently holds the lock to rapidly lock/unlock without any cacheline interference from other threads. The reason why you are seeing different results on different microarchitectures basically boils down to a change Intel made in the Skylake and later microarchitectures which changed the duration of the PAUSE x86 instruction from 10 cycles to 140 cycles (relevant issue in .NET). Since I tuned this delay loop on my personal computer, which has a Skylake CPU, the delay loop is much shorter (by about 10x) on other microarchitectures. |
So from reading Linus' posts and other info the heavy degraded performance could be related to NUMA bouncing. The original Linus talks a lot about informing the kernel about which threads are waiting on each other, are there APIs for that? |
Hmm, the man page for
The rust documentation for |
My current thoughts about this:
Edit some kernel related links about this: |
Results on an AWS t3a.medium (EPYC 7571):
Results on my desktop (Ryzen 5 3600):
I'm not sure why my desktop (12 threads at 4.2GHz) is so much slower across the board than EC2 (2 threads at 2.2GHz), but I can't reproduce this on either. For what it's worth, my desktop is running Linux 5.3.1 and EC2 is Linux 4.19.80. Both are running NixOS 19.09 with (mostly) identical configurations. |
Results on Windows 10 (Ryzen 1800X):
|
If measuring latency is flawed, trying to measure throughput is likely flawed as well. |
I would think the best action going forward is ask for help from kernel gurus. Because I don't think any of this is measuring what you think |
Even if measuring latency or throughput is flawed, here's one that is not: the overall loop takes much longer with parking_lot with large contention. |
Benchmark code: https://github.com/matklad/lock-bench This has 48 hyperthreads (12 ) across two NUMA nodes, so I maybe I should rerun with a higher n_threads to be comparable.
I reran and used perf top during the first parking_lot::Mutex, here's the top things things there
Running on a single NUMA node seems to produce the same results.
|
Windows 10 (1909), Ryzen 3600X Initially I ran it with Firefox open in the background. Parking lot average was about twice as long as std, but the minimum was better. Closed Firefox, and surprisingly the numbers got worse. With FF open the avg varies from 40-80ms, and the minimum often dips below 10ms. With it closed, the average is very consistently around 85ms (±2), and the minimum 40-70ms. I ran it about 10 times total, alternating and without alternating (didn't make a difference). Won't post them all since they were very similar. These were all "extreme contention" ( Firefox open
Firefox closed
|
That's nifty. On Windows, applications can ask the OS to increase the timer granularity. Media playback programs often do it, and probably browsers too. I don't expect it to have the same effect, but can you also test using different power management profiles (Balanced/Performance)? |
Ahh, I misapprehended... Well, parking-lot mutex seems to try to spin in some cases waiting for a lock to become available, and so we fall to what linus says is the bad behavior of userspace spinlocks on Linux. Whats the highly contested perf if all the spinning code in parkinglot mutex is removed? |
That's the first thing that came to mind but I checked the timer interval and it's held at 1ms regardless of Firefox (I guess there's a driver or service keeping it at that, because I have everything in the task bar and tray closed). I manually set the interval to the minimum (0.5ms), re-ran and got the same results.
Doesn't seem to make any significant difference. I ran the bench on the same machine on Linux (Manjaro, 5.4). Times are consistently like these, regardless of what's open:
|
I did some changes to the benchmark to make it use criterion and I added kernel/user CPU times reports (might be broken as I think it doesn't consider task switching costs). This changes the way each iteration is run from matklad benchmark: It's criterion that select the n_rounds and it increase n_ops after each iteration. This led to some jitter in the first iterations as the first locks seams to cost more. Here is the results: https://speedy37.github.io/lock-bench/ |
I tested removing the thread_yield for the Mutex, while keeping the original spinning behaviour and it improved the extreme contention benchmark significantly.
Heavy contention
And after the change:
Heavy contention
|
nvm |
Windows 10 - R7 1800X (8cores 16threads) cacheline results (see my previous post for non cacheline results) Avg is a bit worse.
|
Parking_lot seams to be very good when running on a single CCX: 4cores, 8threads, 1CCX
8 cores, 8 threads, 2CCX
|
No degradation on my i9-9960x or i7-7700HQ |
I can confirm on my AMD TR 1950X (16 Cores, 32 Threads): Extreme contention:
Heavy contention:
Light contention:
No contention:
For comparison using only one numa node (8 cores of 16, 16 threads):
Using only one CCX (4 Cores of 16, 8 threads, same NUMA node, shared L3):
Using a spinning without yield did not help in contrast to my earlier comment where I ran it on a i5-6440HQ. |
This blog post Mutexes Are Faster Than Spinlocks generated a reddit discussion about lock implementatons. This made some users benchmark
parking_lot
on windows and some results shown are very problematic, others could just be better (since parking_lot is always faster than spin on linux and mac, but not on windows for some reason).I did not run this benchmark since I don't currently have windows installed, but since the user hadn't filed an issue I decided to post it here, their comment was: https://www.reddit.com/r/rust/comments/ejx7y8/blog_post_mutexes_are_faster_than_spinlocks/fd31um3/
The results:
Benchmark code: https://github.com/matklad/lock-bench
Windows 10 Pro
Intel Core i7-5930k @ 3.5 GHz
stable-x86_64-pc-windows-msvc (default)
rustc 1.40.0 (73528e339 2019-12-16)
extreme contention
heavy contention
light contention
no contention
The text was updated successfully, but these errors were encountered: