-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task spawning in a loop is too slow (not caching stacks right?) #15363
Comments
Here's the timings I got:
There is definitely affected by #11730. Looking into stack caching, it also looks like this is just #11730 again. What's happening here is that you have a pool of schedulers, let's say 8. The main task is on scheduler N, and spans the sub-task onto scheduler N. The main task is then immediately stolen to scheduler N+1. While the main tasks is being reawoken, the sub-task exits, caching its stack on scheduler N, not N+1. Hence, then the main task spawns another sub-task, the old stack is cached on a wrong scheduler. This is kind of like a game of cat-and-mouse. When I ran the program with RUST_THREADS=8, there were 5668 stack cache misses (compared to the 100k tasks spawned). Whereas if I run with RUST_THREADS=1 there is only one stack miss. Lots of these benchmarks are inherently single threaded (like this one), so go appears to blow us away, but keep in mind that they're single-threaded by default and we're multi-threaded by default. We're also compounded with #11730 to make the situations even worse, sadly! tl;dr - this is a dupe of #11730 unless we want to globally cache stacks instead of per-scheduler. |
Wow, that's some seriously impressive investigation. |
Would it be desirable to do a more global stack caching? I can see something along the lines of "try to steal stacks from other schedulers before allocating" being workable, but that adds some overhead to the case of spawning when there aren't stacks available to steal. Might be able to get around that with additional cleverness, but I have no idea if that is actually worth the complexity cost. |
On Linux it could keep a cache for each CPU core and use |
@alexcrichton would you mind adding the Go source for the program you compared against in your investigation, to ease future attempts to reproduce your results? |
@toddaaro, that's not a bad idea! I would suspect that an invocation of @pnkfelix, sure! // foo.go
package main
func main() {
for i := 0; i < 100000; i++ {
c := make(chan int)
go func() { c <- 5 }()
<-c
}
} // foo.rs
#![feature(phase)]
#[phase(plugin)]
extern crate green;
use std::task::TaskBuilder;
green_start!(main)
fn main() {
for _ in range(0u, 100000) {
let future = TaskBuilder::new().stack_size(65536).try_future(proc() {});
drop(future.unwrap());
}
}
|
@alexcrichton the hackiest approach is to just copy the current workstealing code, but apply it to workstealing deques of stacks instead of tasks. I'll think about it some and try to test it out. |
Using the sleeper list removal change I've linked in #11730 this test goes from ~10s to ~4s on my machine. Disclaimer being that I don't actually believe it works yet, but it seems promising. |
Why tie it to schedulers instead of sharing the caching mechanism between native / green threads? |
#17325 means this is no longer relevant to the standard libraries. Stack caching for native threads does exist but Rust is currently letting the C standard library handle it. |
Support `Self` without field in mir lowering
This program is 10x slower than the equivalent in Go:
Profiling indicates lots of stack allocation. Interestingly enough, bumping up
RUST_MAX_CACHED_STACKS
makes things much slower. So I believe stacks are never getting cached. Perhaps the stack isn't being returned by the time the future is unwrapped?cc @alexcrichton
The text was updated successfully, but these errors were encountered: