Reduce contention in reactor #426

carllerche · 2018-06-15T00:02:46Z

The reactor requires state for each socket that it manages. Currently, this state is stored in a slab that is guarded by a RwLock. This was done for simplicity, but creates lock contention.

Option: Allocating per `ScheduledIo`

Storing each ScheduledIo in an Arc would not be easy. To do so would require using a *const ScheduledIo as the mio token. The problem is that mio can yield events for the token even after the socket has been dropped. So, it would be possible for the Arc<ScheduledIo> to be dropped due to all references being gone and then get an event for that ScheduledIo. There would be no easy way to check if that event is for a live ScheduledIo or a dropped one.

Option: A lock free slab

It would be possible to implement a lock free slab. A slab requires:

Grow-able and index-able storage.
A list of free slots

The storage can be achieved in a safe way by using an array of ScheduledIo vectors. These vectors would be lazily allocated as more storage is needed. The representation would be: [AtomicPtr<ScheduledIo>; N] where N is large enough to satisfy the maximum storage requirements. The size of each ScheduledIo vector would depend on the index. At index 0, the size would be 64 (or some other constant). The size of the vector would double for each index. Using this strategy, a token can be constructed that can index a slot in this storage. The main issue would be if multiple threads race to allocate the next vector. One thread would win and the other would have to throw the allocation away. This can be mitigated by using randomization to allocate the next chunk of storage before it is needed.

The list of free slots can be implemented using a treiber stack.

While this strategy would remove the lock and significantly reduce contention, it would still require all threads to synchronize on the treiber stack. This could potentially be mitigated by sharding (having multiple lock free slabs).

Another problem with this strategy is that allocated memory never shrinks. This could potentially be mitigated by using crossbeam epochs with the storage and by having a treiber stack per chunk instead of for the entire slab. In order to claim a slot, I believe the only solution would be to scan each chunk, from smallest to largest, for capacity. There would also need to be some heuristic for releasing larger chunks.

Option: Dedicated storage per thread

Because Tokio based applications are expected to span a small number of physical threads, one option could be to have a a slab per thread.

In this case, the slab would be "single consumer, multi producer". A single consumer (the thread) would claim slots in the slab. However, any thread could "free" the slot. Modeling this would be similar to above, however each chunk would maintain a mpsc list of free slots. Also, because this is now single consumer, it should be possible to track chunks that have available capacity using an AtomicUsize as a bit field. When claiming a slot, this bit field would be changed to find the smallest chunk with capacity.

Now, the problem here is how to make this indexable. There would probably need to be some registery where these slabs are tracked when they are created for the thread.

Conclusion

I don't know off of my head which would be the best strategy to explore. We should use benchmarks, including benchmarks for Tokio based applications (such as hyper), to help guide us.

The text was updated successfully, but these errors were encountered:

gterzian · 2018-06-15T04:48:58Z

are you referring to the io_dispatch: RwLock<Slab<ScheduledIo>>?

carllerche · 2018-06-15T05:05:12Z

That is correct.

I have some thoughts on how to address this lock. I will update the issue description with more details soon.

gterzian · 2018-06-15T08:55:34Z

A perhaps naive idea:

Instead of sharing a Weak<Inner> inside HandlePriv, one could perhaps share a std::sync::mpsc::Sender<ReactorMsg>?

And replace what is currently done directly on Inner, with sending messages to the reactor who could do the equivalent in response.

So for example something like wakeup could send a message to the reactor, who could then simply do self.inner.wakeup.set_readiness(mio::Ready::readable()).unwrap(); directly on it's own Inner, which would not have to be shared.

Essentially HandlePriv could contain a sender, and Handle could contain an Option<Sender>...

The reactor could receive on a channel and handle messages inside run just before calling reactor.turn(None).unwrap().

By handling only one message before each call to turn, you would also get a few guarantees, for example that wakeup could never be called twice as part of one iteration of the event loop(you could even take it further and have different types of channels for different types of tasks and guarantee the sequence of their execution as part of one iteration of the event loop).

carllerche · 2018-06-15T21:08:32Z

@ghotiphud if I understand your suggestion correctly, this would require an allocation per event. Sending a message using an mpsc allocates internally.

I updated the original issue text w/ some additional details.

ghost · 2018-07-23T20:33:58Z

Do we have any evidence that RwLock causes significant contention?

A few quick easy tweaks we could do, off the top of my head:

Replace std::sync::RwLock with parking_lot::RwLock.
Maybe we don't need to hold the lock in add_source when calling self.io.register.

Also, I worry that going for a lock-free solution would be a lot of work for questionable performance gains. Perhaps implementing a reasonably scalable slab by sharding locks would be good enough (e.g. akin to how ConcurrentHashMap is implemented in Java).

seanmonstar · 2018-07-23T20:38:28Z

The "evidence" I have is noticing hyper's benchmarks are a fair amount slower (varies on environment) when using the default runtime versus using current_thread times the numer of cores. When profiling those slower benchmarks, pthread_rwlock_i_cant_remember_exactly_which_one was near the top.

carllerche · 2018-07-23T20:40:51Z

@stjepang these are all good ideas and worth doing as an initial step. We can then profile to see if it remains a top hit.

This will most likely be less an issue once #424 lands. When this happens, there an be one RwLock shard per thread and in most cases, the lock will only be accessed from that thread. So that plus the two items you listed will probably be enough.

Thoughts?

jonhoo · 2018-07-23T20:54:44Z

Anecdotally, I've found that RwLock scales quite poorly across cores, because all of the cores still have to update the same cache line even if they all take read locks. I don't know if parking_lot::RwLock is better. See also a decent amount of discussion of this in this thread.

seanmonstar · 2018-07-23T21:01:42Z

I tried running the hyper benchmarks with the default runtime by swapping RwLock from std to parking_lot, and at least in my constrained virtual environment, I didn't see any noticeable difference. Could be an improvement in general, though, I can't say.

jonhoo · 2018-07-23T21:18:50Z

@seanmonstar that sounds about right -- I don't think parking_lot does something to make the locks more scalable; it just implements them with less overhead.

Amanieu · 2018-08-09T21:29:08Z

If you enable the "nightly" feature then parking_lot uses inline asm to use Intel's HLE (hardware lock elision), which can completely eliminate the contention on the reference count for readers.

ghost · 2018-08-09T22:16:01Z

@Amanieu

I used this tool to check whether I have HLE on my machine and the answer is yes: https://github.com/andikleen/tsx-tools

Then, I replaced the sharded lock with parking_lot::RwLock and enabled nightly for parking_lot:

parking_lot = { version = "0.6.3", features = ["nightly"] }

However, bench-poll shows the same results as a plain parking_lot::RwLock without nightly enabled. Is there something I'm missing? Maybe HLE for some reason doesn't kick in?

Amanieu · 2018-08-09T22:37:59Z

@stjepang Huh, that's strange. Try going into the benchmark directory in parking_lot and run these commands:

cargo run --features nightly --release --bin rwlock 0 4 1 0 1 2
cargo run --release --bin rwlock 0 4 1 0 1 2

The first one should be much faster if you have HLE.

ghost · 2018-08-09T23:49:00Z

The first command:

parking_lot::RwLock  - [write]      0.000 kHz [read]  53537.072 kHz

The second command:

parking_lot::RwLock  - [write]      0.000 kHz [read]  31690.419 kHz

There's definitely a difference. Is it as significant as you expected?

Amanieu · 2018-08-10T06:44:16Z

This is what I get:

parking_lot::RwLock  - [write]      0.000 kHz [read] 108301.867 kHz

parking_lot::RwLock  - [write]      0.000 kHz [read]  30468.357 kHz

This might be because I have a quad-core CPU and you have a dual core.

ghost · 2018-10-25T16:24:42Z

Ok, so we did two things to reduce contention:

Use sharded RW lock that scales better than plain RW lock under concurrent reads.
Use a reactor per worker rather than one single global reactor.

The combination of those two mitigations makes contention in reactors a non-issue. I don't think there is any need (or even possibility) to improve things further.

@carllerche If you agree, let's close this issue?

carllerche · 2018-10-25T16:39:41Z

@stjepang we could close it, but now there is one reactor per worker thread, we can do better.

The reactor could keep two slabs internally, one for access from the current thread and one for access cross thread.

That said, this would be a fairly large change, as we would need some sort of slab that allows for Sync read of slots and Sync release of slots.

leoyvens · 2019-05-18T21:58:51Z

Having a reactor per thread results in a different kind of contention issue, where if a thread is busy with futures that are a bit expensive to poll, then any ready i/o in its the reactor has to wait, even if other threads are idle. The ideal solution would be for one thread to be able to poll the reactor of others when idle, is this feasible?

To exemplify, in my use case where there is a relatively small amount of tasks in the system but they can be expensive to poll, moving to one reactor per thread was probably a pessimization. I'm considering putting all work in the blocking poll so that the reactors can always respond without waiting.

carllerche · 2019-05-18T22:10:58Z

@leodasvacas If you have expensive futures, you should annotate them w/ blocking. This roughly enables what you describe.

carllerche · 2019-06-25T19:56:33Z

Closing for now. I think there is more work we can do down the line.

carllerche added incomplete labels Jun 15, 2018

carllerche mentioned this issue Jun 15, 2018

Potential performance improvements #392

Closed

4 tasks

carllerche assigned seanmonstar Jun 15, 2018

carllerche assigned ghost and unassigned seanmonstar Jun 15, 2018

carllerche removed the incomplete label Jun 15, 2018

ghost mentioned this issue Jul 6, 2018

FIFO and LIFO strategies crossbeam-rs/crossbeam-deque#13

Merged

ghost mentioned this issue Aug 3, 2018

Use a scalable RW lock in tokio-reactor #517

Merged

leoyvens mentioned this issue May 18, 2019

Reduce thread contention in subgraph indexing graphprotocol/graph-node#937

Closed

carllerche closed this as completed Jun 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce contention in reactor #426

Reduce contention in reactor #426

carllerche commented Jun 15, 2018 •

edited

Loading

gterzian commented Jun 15, 2018

carllerche commented Jun 15, 2018

gterzian commented Jun 15, 2018 •

edited

Loading

carllerche commented Jun 15, 2018

ghost commented Jul 23, 2018

seanmonstar commented Jul 23, 2018

carllerche commented Jul 23, 2018

jonhoo commented Jul 23, 2018

seanmonstar commented Jul 23, 2018

jonhoo commented Jul 23, 2018

Amanieu commented Aug 9, 2018

ghost commented Aug 9, 2018

Amanieu commented Aug 9, 2018

ghost commented Aug 9, 2018

Amanieu commented Aug 10, 2018

ghost commented Oct 25, 2018

carllerche commented Oct 25, 2018

leoyvens commented May 18, 2019

carllerche commented May 18, 2019

carllerche commented Jun 25, 2019

Reduce contention in reactor #426

Reduce contention in reactor #426

Comments

carllerche commented Jun 15, 2018 • edited Loading

Option: Allocating per ScheduledIo

Option: A lock free slab

Option: Dedicated storage per thread

Conclusion

gterzian commented Jun 15, 2018

carllerche commented Jun 15, 2018

gterzian commented Jun 15, 2018 • edited Loading

carllerche commented Jun 15, 2018

ghost commented Jul 23, 2018

seanmonstar commented Jul 23, 2018

carllerche commented Jul 23, 2018

jonhoo commented Jul 23, 2018

seanmonstar commented Jul 23, 2018

jonhoo commented Jul 23, 2018

Amanieu commented Aug 9, 2018

ghost commented Aug 9, 2018

Amanieu commented Aug 9, 2018

ghost commented Aug 9, 2018

Amanieu commented Aug 10, 2018

ghost commented Oct 25, 2018

carllerche commented Oct 25, 2018

leoyvens commented May 18, 2019

carllerche commented May 18, 2019

carllerche commented Jun 25, 2019

carllerche commented Jun 15, 2018 •

edited

Loading

Option: Allocating per `ScheduledIo`

gterzian commented Jun 15, 2018 •

edited

Loading