Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce contention in reactor #426

Closed
carllerche opened this issue Jun 15, 2018 · 20 comments
Closed

Reduce contention in reactor #426

carllerche opened this issue Jun 15, 2018 · 20 comments

Comments

@carllerche
Copy link
Member

carllerche commented Jun 15, 2018

The reactor requires state for each socket that it manages. Currently, this state is stored in a slab that is guarded by a RwLock. This was done for simplicity, but creates lock contention.

Option: Allocating per ScheduledIo

Storing each ScheduledIo in an Arc would not be easy. To do so would require using a *const ScheduledIo as the mio token. The problem is that mio can yield events for the token even after the socket has been dropped. So, it would be possible for the Arc<ScheduledIo> to be dropped due to all references being gone and then get an event for that ScheduledIo. There would be no easy way to check if that event is for a live ScheduledIo or a dropped one.

Option: A lock free slab

It would be possible to implement a lock free slab. A slab requires:

  • Grow-able and index-able storage.
  • A list of free slots

The storage can be achieved in a safe way by using an array of ScheduledIo vectors. These vectors would be lazily allocated as more storage is needed. The representation would be: [AtomicPtr<ScheduledIo>; N] where N is large enough to satisfy the maximum storage requirements. The size of each ScheduledIo vector would depend on the index. At index 0, the size would be 64 (or some other constant). The size of the vector would double for each index. Using this strategy, a token can be constructed that can index a slot in this storage. The main issue would be if multiple threads race to allocate the next vector. One thread would win and the other would have to throw the allocation away. This can be mitigated by using randomization to allocate the next chunk of storage before it is needed.

The list of free slots can be implemented using a treiber stack.

While this strategy would remove the lock and significantly reduce contention, it would still require all threads to synchronize on the treiber stack. This could potentially be mitigated by sharding (having multiple lock free slabs).

Another problem with this strategy is that allocated memory never shrinks. This could potentially be mitigated by using crossbeam epochs with the storage and by having a treiber stack per chunk instead of for the entire slab. In order to claim a slot, I believe the only solution would be to scan each chunk, from smallest to largest, for capacity. There would also need to be some heuristic for releasing larger chunks.

Option: Dedicated storage per thread

Because Tokio based applications are expected to span a small number of physical threads, one option could be to have a a slab per thread.

In this case, the slab would be "single consumer, multi producer". A single consumer (the thread) would claim slots in the slab. However, any thread could "free" the slot. Modeling this would be similar to above, however each chunk would maintain a mpsc list of free slots. Also, because this is now single consumer, it should be possible to track chunks that have available capacity using an AtomicUsize as a bit field. When claiming a slot, this bit field would be changed to find the smallest chunk with capacity.

Now, the problem here is how to make this indexable. There would probably need to be some registery where these slabs are tracked when they are created for the thread.

Conclusion

I don't know off of my head which would be the best strategy to explore. We should use benchmarks, including benchmarks for Tokio based applications (such as hyper), to help guide us.

@gterzian
Copy link

are you referring to the io_dispatch: RwLock<Slab<ScheduledIo>>?

@carllerche
Copy link
Member Author

That is correct.

I have some thoughts on how to address this lock. I will update the issue description with more details soon.

@gterzian
Copy link

gterzian commented Jun 15, 2018

A perhaps naive idea:

Instead of sharing a Weak<Inner> inside HandlePriv, one could perhaps share a std::sync::mpsc::Sender<ReactorMsg>?

And replace what is currently done directly on Inner, with sending messages to the reactor who could do the equivalent in response.

So for example something like wakeup could send a message to the reactor, who could then simply do self.inner.wakeup.set_readiness(mio::Ready::readable()).unwrap(); directly on it's own Inner, which would not have to be shared.

Essentially HandlePriv could contain a sender, and Handle could contain an Option<Sender>...

The reactor could receive on a channel and handle messages inside run just before calling reactor.turn(None).unwrap().

By handling only one message before each call to turn, you would also get a few guarantees, for example that wakeup could never be called twice as part of one iteration of the event loop(you could even take it further and have different types of channels for different types of tasks and guarantee the sequence of their execution as part of one iteration of the event loop).

@carllerche carllerche assigned ghost and unassigned seanmonstar Jun 15, 2018
@carllerche
Copy link
Member Author

@ghotiphud if I understand your suggestion correctly, this would require an allocation per event. Sending a message using an mpsc allocates internally.

I updated the original issue text w/ some additional details.

@ghost
Copy link

ghost commented Jul 23, 2018

Do we have any evidence that RwLock causes significant contention?

A few quick easy tweaks we could do, off the top of my head:

  • Replace std::sync::RwLock with parking_lot::RwLock.
  • Maybe we don't need to hold the lock in add_source when calling self.io.register.

Also, I worry that going for a lock-free solution would be a lot of work for questionable performance gains. Perhaps implementing a reasonably scalable slab by sharding locks would be good enough (e.g. akin to how ConcurrentHashMap is implemented in Java).

@seanmonstar
Copy link
Member

The "evidence" I have is noticing hyper's benchmarks are a fair amount slower (varies on environment) when using the default runtime versus using current_thread times the numer of cores. When profiling those slower benchmarks, pthread_rwlock_i_cant_remember_exactly_which_one was near the top.

@carllerche
Copy link
Member Author

@stjepang these are all good ideas and worth doing as an initial step. We can then profile to see if it remains a top hit.

This will most likely be less an issue once #424 lands. When this happens, there an be one RwLock shard per thread and in most cases, the lock will only be accessed from that thread. So that plus the two items you listed will probably be enough.

Thoughts?

@jonhoo
Copy link
Contributor

jonhoo commented Jul 23, 2018

Anecdotally, I've found that RwLock scales quite poorly across cores, because all of the cores still have to update the same cache line even if they all take read locks. I don't know if parking_lot::RwLock is better. See also a decent amount of discussion of this in this thread.

@seanmonstar
Copy link
Member

I tried running the hyper benchmarks with the default runtime by swapping RwLock from std to parking_lot, and at least in my constrained virtual environment, I didn't see any noticeable difference. Could be an improvement in general, though, I can't say.

@jonhoo
Copy link
Contributor

jonhoo commented Jul 23, 2018

@seanmonstar that sounds about right -- I don't think parking_lot does something to make the locks more scalable; it just implements them with less overhead.

@Amanieu
Copy link

Amanieu commented Aug 9, 2018

If you enable the "nightly" feature then parking_lot uses inline asm to use Intel's HLE (hardware lock elision), which can completely eliminate the contention on the reference count for readers.

@ghost
Copy link

ghost commented Aug 9, 2018

@Amanieu

I used this tool to check whether I have HLE on my machine and the answer is yes: https://github.com/andikleen/tsx-tools

Then, I replaced the sharded lock with parking_lot::RwLock and enabled nightly for parking_lot:

parking_lot = { version = "0.6.3", features = ["nightly"] }

However, bench-poll shows the same results as a plain parking_lot::RwLock without nightly enabled. Is there something I'm missing? Maybe HLE for some reason doesn't kick in?

@Amanieu
Copy link

Amanieu commented Aug 9, 2018

@stjepang Huh, that's strange. Try going into the benchmark directory in parking_lot and run these commands:

cargo run --features nightly --release --bin rwlock 0 4 1 0 1 2
cargo run --release --bin rwlock 0 4 1 0 1 2

The first one should be much faster if you have HLE.

@ghost
Copy link

ghost commented Aug 9, 2018

The first command:

parking_lot::RwLock  - [write]      0.000 kHz [read]  53537.072 kHz

The second command:

parking_lot::RwLock  - [write]      0.000 kHz [read]  31690.419 kHz

There's definitely a difference. Is it as significant as you expected?

@Amanieu
Copy link

Amanieu commented Aug 10, 2018

This is what I get:

parking_lot::RwLock  - [write]      0.000 kHz [read] 108301.867 kHz

parking_lot::RwLock  - [write]      0.000 kHz [read]  30468.357 kHz

This might be because I have a quad-core CPU and you have a dual core.

@ghost
Copy link

ghost commented Oct 25, 2018

Ok, so we did two things to reduce contention:

  1. Use sharded RW lock that scales better than plain RW lock under concurrent reads.
  2. Use a reactor per worker rather than one single global reactor.

The combination of those two mitigations makes contention in reactors a non-issue. I don't think there is any need (or even possibility) to improve things further.

@carllerche If you agree, let's close this issue?

@carllerche
Copy link
Member Author

@stjepang we could close it, but now there is one reactor per worker thread, we can do better.

The reactor could keep two slabs internally, one for access from the current thread and one for access cross thread.

That said, this would be a fairly large change, as we would need some sort of slab that allows for Sync read of slots and Sync release of slots.

@leoyvens
Copy link

Having a reactor per thread results in a different kind of contention issue, where if a thread is busy with futures that are a bit expensive to poll, then any ready i/o in its the reactor has to wait, even if other threads are idle. The ideal solution would be for one thread to be able to poll the reactor of others when idle, is this feasible?

To exemplify, in my use case where there is a relatively small amount of tasks in the system but they can be expensive to poll, moving to one reactor per thread was probably a pessimization. I'm considering putting all work in the blocking poll so that the reactors can always respond without waiting.

@carllerche
Copy link
Member Author

@leodasvacas If you have expensive futures, you should annotate them w/ blocking. This roughly enables what you describe.

@carllerche
Copy link
Member Author

Closing for now. I think there is more work we can do down the line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants