Task spawning in a loop is too slow (not caching stacks right?) #15363

pcwalton · 2014-07-03T07:25:37Z

This program is 10x slower than the equivalent in Go:

extern crate green;

use std::task::TaskBuilder;
use green::{SchedPool, PoolConfig, GreenTaskBuilder};
use native::NativeTaskBuilder;

fn main() {
    let mut pool = SchedPool::new(PoolConfig::new());
    for _ in range(0u, 100000) {
        let future = TaskBuilder::new().green(&mut pool).stack_size(65536).try_future(proc() {});
        drop(future.unwrap());
    }
    pool.shutdown();
}

Profiling indicates lots of stack allocation. Interestingly enough, bumping up RUST_MAX_CACHED_STACKS makes things much slower. So I believe stacks are never getting cached. Perhaps the stack isn't being returned by the time the future is unwrapped?

cc @alexcrichton

The text was updated successfully, but these errors were encountered:

alexcrichton · 2014-07-03T14:23:29Z

Here's the timings I got:

$ go build foo.go
$ time ./foo
./foo  0.04s user 0.00s system 101% cpu 0.038 total
$ time GOMAXPROCS=8 ./foo
GOMAXPROCS=8 ./foo  0.20s user 0.09s system 208% cpu 0.140 total


$ rustc -O foo.rs
$ time ./foo
./foo  2.45s user 1.51s system 261% cpu 1.517 total
$ time RUST_THREADS=1 ./foo
RUST_THREADS=1 ./foo  0.13s user 0.00s system 99% cpu 0.136 total

There is definitely affected by #11730.

Looking into stack caching, it also looks like this is just #11730 again. What's happening here is that you have a pool of schedulers, let's say 8. The main task is on scheduler N, and spans the sub-task onto scheduler N. The main task is then immediately stolen to scheduler N+1. While the main tasks is being reawoken, the sub-task exits, caching its stack on scheduler N, not N+1. Hence, then the main task spawns another sub-task, the old stack is cached on a wrong scheduler.

This is kind of like a game of cat-and-mouse. When I ran the program with RUST_THREADS=8, there were 5668 stack cache misses (compared to the 100k tasks spawned). Whereas if I run with RUST_THREADS=1 there is only one stack miss.

Lots of these benchmarks are inherently single threaded (like this one), so go appears to blow us away, but keep in mind that they're single-threaded by default and we're multi-threaded by default. We're also compounded with #11730 to make the situations even worse, sadly!

tl;dr - this is a dupe of #11730 unless we want to globally cache stacks instead of per-scheduler.

pcwalton · 2014-07-03T15:50:45Z

Wow, that's some seriously impressive investigation.

toddaaro · 2014-07-06T15:54:48Z

Would it be desirable to do a more global stack caching? I can see something along the lines of "try to steal stacks from other schedulers before allocating" being workable, but that adds some overhead to the case of spawning when there aren't stacks available to steal. Might be able to get around that with additional cleverness, but I have no idea if that is actually worth the complexity cost.

thestinger · 2014-07-06T16:01:39Z

On Linux it could keep a cache for each CPU core and use sched_getcpu to pick the right cache. An alternative is using an intrusive linked list through the stacks to build a lock-free queue. I don't see a reason to tie it to schedulers, because it should be using the same cache for both green / native threads.

pnkfelix · 2014-07-06T16:58:30Z

@alexcrichton would you mind adding the Go source for the program you compared against in your investigation, to ease future attempts to reproduce your results?

alexcrichton · 2014-07-06T18:40:05Z

@toddaaro, that's not a bad idea! I would suspect that an invocation of mmap is far slower than attempting to steal a stack from other schedulers, but that's also just guesswork. I would also suspect that a nontrivial implementation may not be worth it, but that's also just guesswork!

@pnkfelix, sure!

// foo.go
package main

func main() {
  for i := 0; i < 100000; i++ {
    c := make(chan int)
    go func() { c <- 5 }()
    <-c
  }
}

// foo.rs
#![feature(phase)]

#[phase(plugin)]
extern crate green;

use std::task::TaskBuilder;

green_start!(main)

fn main() {
    for _ in range(0u, 100000) {
        let future = TaskBuilder::new().stack_size(65536).try_future(proc() {});
        drop(future.unwrap());
    }
}

[~] $ go version   
go version go1.2.1 linux/amd64
[~] $ go build foo.go                
[~] $ time GOMAXPROCS=1 ./foo   
GOMAXPROCS=1 ./foo  0.04s user 0.00s system 100% cpu 0.040 total
[~] $ time GOMAXPROCS=8 ./foo              
GOMAXPROCS=8 ./foo  0.16s user 0.08s system 203% cpu 0.117 total
[~] $ rustc -v                
rustc 0.11.0-nightly (459f155f81291c46633e86a480628b50304ffb1c 2014-07-04 23:46:44 +0000)
[~] $ rustc -O foo.rs                 
[~] $ time RUST_THREADS=1 ./foo      
RUST_THREADS=1 ./foo  0.12s user 0.00s system 99% cpu 0.124 total
[~] $ time RUST_THREADS=8 ./foo   
RUST_THREADS=8 ./foo  1.01s user 0.61s system 266% cpu 0.608 total

toddaaro · 2014-07-06T18:43:26Z

@alexcrichton the hackiest approach is to just copy the current workstealing code, but apply it to workstealing deques of stacks instead of tasks. I'll think about it some and try to test it out.

toddaaro · 2014-07-06T19:11:32Z

Using the sleeper list removal change I've linked in #11730 this test goes from ~10s to ~4s on my machine. Disclaimer being that I don't actually believe it works yet, but it seems promising.

thestinger · 2014-07-06T19:25:46Z

Why tie it to schedulers instead of sharing the caching mechanism between native / green threads?

thestinger · 2014-10-02T08:56:51Z

#17325 means this is no longer relevant to the standard libraries. Stack caching for native threads does exist but Rust is currently letting the C standard library handle it.

Support `Self` without field in mir lowering

pcwalton added A-runtime labels Jul 3, 2014

thestinger closed this as completed Oct 2, 2014

bors added a commit to rust-lang-ci/rust that referenced this issue Aug 7, 2023

Auto merge of rust-lang#15363 - HKalbasi:mir, r=HKalbasi

429a381

Support `Self` without field in mir lowering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task spawning in a loop is too slow (not caching stacks right?) #15363

Task spawning in a loop is too slow (not caching stacks right?) #15363

pcwalton commented Jul 3, 2014

alexcrichton commented Jul 3, 2014

pcwalton commented Jul 3, 2014

toddaaro commented Jul 6, 2014

thestinger commented Jul 6, 2014

pnkfelix commented Jul 6, 2014

alexcrichton commented Jul 6, 2014

toddaaro commented Jul 6, 2014

toddaaro commented Jul 6, 2014

thestinger commented Jul 6, 2014

thestinger commented Oct 2, 2014

Task spawning in a loop is too slow (not caching stacks right?) #15363

Task spawning in a loop is too slow (not caching stacks right?) #15363

Comments

pcwalton commented Jul 3, 2014

alexcrichton commented Jul 3, 2014

pcwalton commented Jul 3, 2014

toddaaro commented Jul 6, 2014

thestinger commented Jul 6, 2014

pnkfelix commented Jul 6, 2014

alexcrichton commented Jul 6, 2014

toddaaro commented Jul 6, 2014

toddaaro commented Jul 6, 2014

thestinger commented Jul 6, 2014

thestinger commented Oct 2, 2014