Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task spawning in a loop is too slow (not caching stacks right?) #15363

Closed
pcwalton opened this issue Jul 3, 2014 · 10 comments
Closed

Task spawning in a loop is too slow (not caching stacks right?) #15363

pcwalton opened this issue Jul 3, 2014 · 10 comments
Labels
A-concurrency Area: Concurrency A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@pcwalton
Copy link
Contributor

pcwalton commented Jul 3, 2014

This program is 10x slower than the equivalent in Go:

extern crate green;

use std::task::TaskBuilder;
use green::{SchedPool, PoolConfig, GreenTaskBuilder};
use native::NativeTaskBuilder;

fn main() {
    let mut pool = SchedPool::new(PoolConfig::new());
    for _ in range(0u, 100000) {
        let future = TaskBuilder::new().green(&mut pool).stack_size(65536).try_future(proc() {});
        drop(future.unwrap());
    }
    pool.shutdown();
}

Profiling indicates lots of stack allocation. Interestingly enough, bumping up RUST_MAX_CACHED_STACKS makes things much slower. So I believe stacks are never getting cached. Perhaps the stack isn't being returned by the time the future is unwrapped?

cc @alexcrichton

@alexcrichton
Copy link
Member

Here's the timings I got:

$ go build foo.go
$ time ./foo
./foo  0.04s user 0.00s system 101% cpu 0.038 total
$ time GOMAXPROCS=8 ./foo
GOMAXPROCS=8 ./foo  0.20s user 0.09s system 208% cpu 0.140 total


$ rustc -O foo.rs
$ time ./foo
./foo  2.45s user 1.51s system 261% cpu 1.517 total
$ time RUST_THREADS=1 ./foo
RUST_THREADS=1 ./foo  0.13s user 0.00s system 99% cpu 0.136 total

There is definitely affected by #11730.

Looking into stack caching, it also looks like this is just #11730 again. What's happening here is that you have a pool of schedulers, let's say 8. The main task is on scheduler N, and spans the sub-task onto scheduler N. The main task is then immediately stolen to scheduler N+1. While the main tasks is being reawoken, the sub-task exits, caching its stack on scheduler N, not N+1. Hence, then the main task spawns another sub-task, the old stack is cached on a wrong scheduler.

This is kind of like a game of cat-and-mouse. When I ran the program with RUST_THREADS=8, there were 5668 stack cache misses (compared to the 100k tasks spawned). Whereas if I run with RUST_THREADS=1 there is only one stack miss.

Lots of these benchmarks are inherently single threaded (like this one), so go appears to blow us away, but keep in mind that they're single-threaded by default and we're multi-threaded by default. We're also compounded with #11730 to make the situations even worse, sadly!


tl;dr - this is a dupe of #11730 unless we want to globally cache stacks instead of per-scheduler.

@pcwalton
Copy link
Contributor Author

pcwalton commented Jul 3, 2014

Wow, that's some seriously impressive investigation.

@toddaaro
Copy link
Contributor

toddaaro commented Jul 6, 2014

Would it be desirable to do a more global stack caching? I can see something along the lines of "try to steal stacks from other schedulers before allocating" being workable, but that adds some overhead to the case of spawning when there aren't stacks available to steal. Might be able to get around that with additional cleverness, but I have no idea if that is actually worth the complexity cost.

@thestinger
Copy link
Contributor

On Linux it could keep a cache for each CPU core and use sched_getcpu to pick the right cache. An alternative is using an intrusive linked list through the stacks to build a lock-free queue. I don't see a reason to tie it to schedulers, because it should be using the same cache for both green / native threads.

@pnkfelix
Copy link
Member

pnkfelix commented Jul 6, 2014

@alexcrichton would you mind adding the Go source for the program you compared against in your investigation, to ease future attempts to reproduce your results?

@alexcrichton
Copy link
Member

@toddaaro, that's not a bad idea! I would suspect that an invocation of mmap is far slower than attempting to steal a stack from other schedulers, but that's also just guesswork. I would also suspect that a nontrivial implementation may not be worth it, but that's also just guesswork!

@pnkfelix, sure!

// foo.go
package main

func main() {
  for i := 0; i < 100000; i++ {
    c := make(chan int)
    go func() { c <- 5 }()
    <-c
  }
}
// foo.rs
#![feature(phase)]

#[phase(plugin)]
extern crate green;

use std::task::TaskBuilder;

green_start!(main)

fn main() {
    for _ in range(0u, 100000) {
        let future = TaskBuilder::new().stack_size(65536).try_future(proc() {});
        drop(future.unwrap());
    }
}
[~] $ go version   
go version go1.2.1 linux/amd64
[~] $ go build foo.go                
[~] $ time GOMAXPROCS=1 ./foo   
GOMAXPROCS=1 ./foo  0.04s user 0.00s system 100% cpu 0.040 total
[~] $ time GOMAXPROCS=8 ./foo              
GOMAXPROCS=8 ./foo  0.16s user 0.08s system 203% cpu 0.117 total
[~] $ rustc -v                
rustc 0.11.0-nightly (459f155f81291c46633e86a480628b50304ffb1c 2014-07-04 23:46:44 +0000)
[~] $ rustc -O foo.rs                 
[~] $ time RUST_THREADS=1 ./foo      
RUST_THREADS=1 ./foo  0.12s user 0.00s system 99% cpu 0.124 total
[~] $ time RUST_THREADS=8 ./foo   
RUST_THREADS=8 ./foo  1.01s user 0.61s system 266% cpu 0.608 total

@toddaaro
Copy link
Contributor

toddaaro commented Jul 6, 2014

@alexcrichton the hackiest approach is to just copy the current workstealing code, but apply it to workstealing deques of stacks instead of tasks. I'll think about it some and try to test it out.

@toddaaro
Copy link
Contributor

toddaaro commented Jul 6, 2014

Using the sleeper list removal change I've linked in #11730 this test goes from ~10s to ~4s on my machine. Disclaimer being that I don't actually believe it works yet, but it seems promising.

@thestinger
Copy link
Contributor

Why tie it to schedulers instead of sharing the caching mechanism between native / green threads?

@thestinger
Copy link
Contributor

#17325 means this is no longer relevant to the standard libraries. Stack caching for native threads does exist but Rust is currently letting the C standard library handle it.

bors added a commit to rust-lang-ci/rust that referenced this issue Aug 7, 2023
Support `Self` without field in mir lowering
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-concurrency Area: Concurrency A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

5 participants