-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce contention in reactor #426
Comments
are you referring to the |
That is correct. I have some thoughts on how to address this lock. I will update the issue description with more details soon. |
A perhaps naive idea: Instead of sharing a And replace what is currently done directly on So for example something like Essentially The reactor could receive on a channel and handle messages inside By handling only one message before each call to |
@ghotiphud if I understand your suggestion correctly, this would require an allocation per event. Sending a message using an I updated the original issue text w/ some additional details. |
Do we have any evidence that A few quick easy tweaks we could do, off the top of my head:
Also, I worry that going for a lock-free solution would be a lot of work for questionable performance gains. Perhaps implementing a reasonably scalable slab by sharding locks would be good enough (e.g. akin to how |
The "evidence" I have is noticing hyper's benchmarks are a fair amount slower (varies on environment) when using the default runtime versus using current_thread times the numer of cores. When profiling those slower benchmarks, |
@stjepang these are all good ideas and worth doing as an initial step. We can then profile to see if it remains a top hit. This will most likely be less an issue once #424 lands. When this happens, there an be one RwLock shard per thread and in most cases, the lock will only be accessed from that thread. So that plus the two items you listed will probably be enough. Thoughts? |
Anecdotally, I've found that |
I tried running the hyper benchmarks with the default runtime by swapping |
@seanmonstar that sounds about right -- I don't think |
If you enable the "nightly" feature then |
I used this tool to check whether I have HLE on my machine and the answer is yes: https://github.com/andikleen/tsx-tools Then, I replaced the sharded lock with parking_lot = { version = "0.6.3", features = ["nightly"] } However, |
@stjepang Huh, that's strange. Try going into the
The first one should be much faster if you have HLE. |
The first command:
The second command:
There's definitely a difference. Is it as significant as you expected? |
This is what I get:
This might be because I have a quad-core CPU and you have a dual core. |
Ok, so we did two things to reduce contention:
The combination of those two mitigations makes contention in reactors a non-issue. I don't think there is any need (or even possibility) to improve things further. @carllerche If you agree, let's close this issue? |
@stjepang we could close it, but now there is one reactor per worker thread, we can do better. The reactor could keep two slabs internally, one for access from the current thread and one for access cross thread. That said, this would be a fairly large change, as we would need some sort of slab that allows for |
Having a reactor per thread results in a different kind of contention issue, where if a thread is busy with futures that are a bit expensive to poll, then any ready i/o in its the reactor has to wait, even if other threads are idle. The ideal solution would be for one thread to be able to poll the reactor of others when idle, is this feasible? To exemplify, in my use case where there is a relatively small amount of tasks in the system but they can be expensive to poll, moving to one reactor per thread was probably a pessimization. I'm considering putting all work in the blocking poll so that the reactors can always respond without waiting. |
@leodasvacas If you have expensive futures, you should annotate them w/ |
Closing for now. I think there is more work we can do down the line. |
The reactor requires state for each socket that it manages. Currently, this state is stored in a slab that is guarded by a
RwLock
. This was done for simplicity, but creates lock contention.Option: Allocating per
ScheduledIo
Storing each
ScheduledIo
in anArc
would not be easy. To do so would require using a*const ScheduledIo
as the mio token. The problem is that mio can yield events for the token even after the socket has been dropped. So, it would be possible for theArc<ScheduledIo>
to be dropped due to all references being gone and then get an event for thatScheduledIo
. There would be no easy way to check if that event is for a liveScheduledIo
or a dropped one.Option: A lock free slab
It would be possible to implement a lock free slab. A slab requires:
The storage can be achieved in a safe way by using an array of
ScheduledIo
vectors. These vectors would be lazily allocated as more storage is needed. The representation would be:[AtomicPtr<ScheduledIo>; N]
whereN
is large enough to satisfy the maximum storage requirements. The size of eachScheduledIo
vector would depend on the index. At index 0, the size would be64
(or some other constant). The size of the vector would double for each index. Using this strategy, a token can be constructed that can index a slot in this storage. The main issue would be if multiple threads race to allocate the next vector. One thread would win and the other would have to throw the allocation away. This can be mitigated by using randomization to allocate the next chunk of storage before it is needed.The list of free slots can be implemented using a treiber stack.
While this strategy would remove the lock and significantly reduce contention, it would still require all threads to synchronize on the treiber stack. This could potentially be mitigated by sharding (having multiple lock free slabs).
Another problem with this strategy is that allocated memory never shrinks. This could potentially be mitigated by using crossbeam epochs with the storage and by having a treiber stack per chunk instead of for the entire slab. In order to claim a slot, I believe the only solution would be to scan each chunk, from smallest to largest, for capacity. There would also need to be some heuristic for releasing larger chunks.
Option: Dedicated storage per thread
Because Tokio based applications are expected to span a small number of physical threads, one option could be to have a a slab per thread.
In this case, the slab would be "single consumer, multi producer". A single consumer (the thread) would claim slots in the slab. However, any thread could "free" the slot. Modeling this would be similar to above, however each chunk would maintain a mpsc list of free slots. Also, because this is now single consumer, it should be possible to track chunks that have available capacity using an
AtomicUsize
as a bit field. When claiming a slot, this bit field would be changed to find the smallest chunk with capacity.Now, the problem here is how to make this indexable. There would probably need to be some registery where these slabs are tracked when they are created for the thread.
Conclusion
I don't know off of my head which would be the best strategy to explore. We should use benchmarks, including benchmarks for Tokio based applications (such as hyper), to help guide us.
The text was updated successfully, but these errors were encountered: