-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Barbara compares some cpp code (and has a performance problem) #144
Conversation
The technical conclusion of the specific example in this doc is similar to: rust-lang#132, but the story in general can be about how to compare rust async code and other languages async code (particularly c++20)
impl Client { | ||
async fn request(&self) -> Result<Data> { | ||
let bytes = self.inner.network_request().await? | ||
Ok(serialization_libary::from_bytes(&bytes)?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is still problematic, even with the spawn
below, because you've got blocking work on a non-blocking task. Ideally, it would use spawn_blocking
to fix this:
Ok(tokio::task::spawn_blocking(|| serialization_libary::from_bytes(&bytes)).await??)
or, if the library was normally fast, but slow on some (big?) inputs you might prefer block_in_place
(which ensures that other tasks run) and the spawn
solution you gave:
Ok(tokio::task::block_in_place(|| serialization_libary::from_bytes(&bytes))?)
Goes to show how hard this user story is to get right - the fix in this story still doesn't work well.
There's a fun sequence here where you add block_in_place
and it's still bad, and then add the spawn
, which fixes it.
One that's bitten us in Mononoke (issue #131 here) is that even if the long running work is also nicely async (i.e. just makes another network request), it can still go wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about block_in_place
, though I find it hard to reason about.I think in this case, only spawn_blocking
would work, but it has its own problems (can't be cancelled), (edit: block_in_place
+ my spawn
below, as you mentioned work as well, but in practice not having the block_in_place has not proven problematic for me, but perhaps I have even more perf improvements lurking) and like you mentioned, the cpu work is only expensive if the result is large, which is a runtime property. Not sure how we could detect that and do the correct thing. async-rs/async-std#631 looks like an attempt, but I need to read more into why it was closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does block_in_place do
? Isn't each tokio task already resilient to starving the others because they run on independent threads? Or is there some way the work-stealing part of it locks up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each worker thread has its own (fixed-size) queue of tasks to service. There's a global queue for overflows, too, and a certain amount of rebalancing done by each worker thread to avoid starving the tasks in the global queue.
If a worker task goes idle, and the global queue is empty, then it will steal work from other workers. block_in_place
ensures that the local queue is empty. As a result, it all depends on the balance between thread count (typically one per CPU hart) and runnable task count; if you have (e.g.) a huge AMD EPYC server with 256 harts (two threads and 128 cores in the system), and typically only 100 tasks to run, then you'll never see the difference between spawn_blocking
/block_in_place
and spawn
, because there will be idle workers that steal work from blocked workers.
On the other hand, if you have a small system (say an older laptop) with 4 harts (two cores, two threads per core), and you have still have 100 tasks to run, without block_in_place
you might see the runtime fail to migrate some of those tasks off the blocked worker and onto a running worker - the result is starvation for the tasks that don't get migrated to another worker thread.
I'm guessing that you've never see the starvation issue because you normally have lots of worker threads, and not many runnable tasks at a time - as a result, a worker goes idle and steals work from the blocked thread. If you had small numbers of worker threads, but still lots of tasks, you'd see starvation more often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like it might make a nice addition to the story. e.g., Barbara is happy for a while, but then she sees starvation or something like that. That said, it could also be a good FAQ. I wouldn't change the main part of the story, though, since invoking tokio::spawn
seems to be what "Barbara" actually did here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-reading this comment I was definitely struck by the challenge of "how to jigger this exactly right". I'm curious, @farnz, whether you think that a more adaptive runtime -- similar to what Go does, or what was proposed in this blog post, but never adopted -- would be a better fit for Mononoke?
One of the challenges here seems to be a matter of composition. The "right place" to insert block_in_place
, for example, is inside of reqwest
, at least if I understand correctly, but it may not be well positioned to judge whether messages are likely to be large or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I guess I was wrong about that last bit. Still, it seems like these kinds of issues could frequently cross library boundaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh also I see you referenced async-rs/async-std#631 in the FAQ :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced by more adaptive runtimes - they tend to result in developers not realising when they've introduced accidental blocking, and I have a bias towards engineers knowing the cost of their decisions.
There are two orthogonal problems that I perceive here:
- Two different ways to flag a section as blocking -
block_in_place
andspawn_blocking
- with room for semantic confusion. In my ideal world, there would only be one -block_in_place
- and something suitable done in the back-end so that it interacts nicely withspawn
, such thatspawn(async { block_in_place(|| thing()) })
is exactly the same as (or better than)spawn_blocking(|| thing())
today. That way, the guidance would beblock_in_place
for things that might block the CPU, andspawn
to get parallelism. - Lots of different ways to end up accidentally starving a task that's ready to run - both long polls, and failure to poll a sub-scheduler in a timely fashion. If I could wave a wand, I'd have async task tracing that tells me which tasks are taking too long to poll, and what they're doing when they hog the worker thread; this would also catch the issue where a sub-scheduler doesn't poll, because I'd see the tracing showing the first poll in the sub-scheduler, and then a long gap.
For point 2 there, tracing-futures
provides the Instrumented
wrapper which allows me to build a chunk of the tracing I'd want. Having the runtime automatically instrument everything so that I can see exactly when a future is polled (or not polled) and how long each poll takes would be better for my needs than an adaptive runtime; I could use tracing-flame
to both identify blocking (long polls), and starvation (long gaps between a future being ready to wake and being polled), and fix up both.
The missing bit in tracing-futures
is identifying when a Future is ready to be polled again - I'd love to see some sort of event tied to the future's span, so that I can see that (e.g.) it was ready to be polled again, but not polled for 10 seconds, or that it got polled again 2 µs after the waker was notified. I'd want to skip the first poll for this (the gap between creation and first poll) because that's a delay that I believe most people would predict in code review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @farnz. Helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just suggestions - I'm bringing in two other archetypes because my experience of dealing with these problems with my peers is that you end up (a) making the same mistake Barbara did at first, and (b) finding it hard to explain your decisions to your team.
My goals are twofold: first, to emphasise that Barbara is doing the right thing when she hits the problem - it's not something that "should" have been obvious as she was writing the call to serialization_libary::from_bytes
- and secondly to make it clear that the obvious fix of spawn
is incomplete, and that another fix is also needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks about ready to land to me. I left some small requests.
impl Client { | ||
async fn request(&self) -> Result<Data> { | ||
let bytes = self.inner.network_request().await? | ||
Ok(serialization_libary::from_bytes(&bytes)?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh also I see you referenced async-rs/async-std#631 in the FAQ :)
Co-authored-by: Alice Ryhl <alice@ryhl.io>
Co-authored-by: Ryan Levick <rylev@users.noreply.github.com>
@nikomatsakis I think this covers all the comments?
@nikomatsakis I think this covers all the comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ready to merge, yes! This is great, thanks @guswynn. I just re-read and I feel like I got a deeper understanding.
The technical conclusion of the specific example in this doc is similar to: #132, but the story in general can be about how to compare rust async code and other languages async code (particularly c++20)