Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Barbara compares some cpp code (and has a performance problem) #144

Merged
merged 6 commits into from
Apr 18, 2021

Conversation

guswynn
Copy link
Contributor

@guswynn guswynn commented Apr 12, 2021

The technical conclusion of the specific example in this doc is similar to: #132, but the story in general can be about how to compare rust async code and other languages async code (particularly c++20)

The technical conclusion of the specific example in this doc is similar to: rust-lang#132, but the story in general can be about how to compare rust async code and other languages async code (particularly c++20)
@guswynn guswynn changed the title Create barbara_compares_some_cpp_code.md Barbara compares some cpp code (and has a performance problem) Apr 12, 2021
impl Client {
async fn request(&self) -> Result<Data> {
let bytes = self.inner.network_request().await?
Ok(serialization_libary::from_bytes(&bytes)?)
Copy link
Contributor

@farnz farnz Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is still problematic, even with the spawn below, because you've got blocking work on a non-blocking task. Ideally, it would use spawn_blocking to fix this:

        Ok(tokio::task::spawn_blocking(|| serialization_libary::from_bytes(&bytes)).await??)

or, if the library was normally fast, but slow on some (big?) inputs you might prefer block_in_place (which ensures that other tasks run) and the spawn solution you gave:

        Ok(tokio::task::block_in_place(|| serialization_libary::from_bytes(&bytes))?)

Goes to show how hard this user story is to get right - the fix in this story still doesn't work well.

There's a fun sequence here where you add block_in_place and it's still bad, and then add the spawn, which fixes it.

One that's bitten us in Mononoke (issue #131 here) is that even if the long running work is also nicely async (i.e. just makes another network request), it can still go wrong.

Copy link
Contributor Author

@guswynn guswynn Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about block_in_place, though I find it hard to reason about.I think in this case, only spawn_blocking would work, but it has its own problems (can't be cancelled), (edit: block_in_place + my spawn below, as you mentioned work as well, but in practice not having the block_in_place has not proven problematic for me, but perhaps I have even more perf improvements lurking) and like you mentioned, the cpu work is only expensive if the result is large, which is a runtime property. Not sure how we could detect that and do the correct thing. async-rs/async-std#631 looks like an attempt, but I need to read more into why it was closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does block_in_place do? Isn't each tokio task already resilient to starving the others because they run on independent threads? Or is there some way the work-stealing part of it locks up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each worker thread has its own (fixed-size) queue of tasks to service. There's a global queue for overflows, too, and a certain amount of rebalancing done by each worker thread to avoid starving the tasks in the global queue.

If a worker task goes idle, and the global queue is empty, then it will steal work from other workers. block_in_place ensures that the local queue is empty. As a result, it all depends on the balance between thread count (typically one per CPU hart) and runnable task count; if you have (e.g.) a huge AMD EPYC server with 256 harts (two threads and 128 cores in the system), and typically only 100 tasks to run, then you'll never see the difference between spawn_blocking/block_in_place and spawn, because there will be idle workers that steal work from blocked workers.

On the other hand, if you have a small system (say an older laptop) with 4 harts (two cores, two threads per core), and you have still have 100 tasks to run, without block_in_place you might see the runtime fail to migrate some of those tasks off the blocked worker and onto a running worker - the result is starvation for the tasks that don't get migrated to another worker thread.

I'm guessing that you've never see the starvation issue because you normally have lots of worker threads, and not many runnable tasks at a time - as a result, a worker goes idle and steals work from the blocked thread. If you had small numbers of worker threads, but still lots of tasks, you'd see starvation more often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like it might make a nice addition to the story. e.g., Barbara is happy for a while, but then she sees starvation or something like that. That said, it could also be a good FAQ. I wouldn't change the main part of the story, though, since invoking tokio::spawn seems to be what "Barbara" actually did here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading this comment I was definitely struck by the challenge of "how to jigger this exactly right". I'm curious, @farnz, whether you think that a more adaptive runtime -- similar to what Go does, or what was proposed in this blog post, but never adopted -- would be a better fit for Mononoke?

One of the challenges here seems to be a matter of composition. The "right place" to insert block_in_place, for example, is inside of reqwest, at least if I understand correctly, but it may not be well positioned to judge whether messages are likely to be large or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I guess I was wrong about that last bit. Still, it seems like these kinds of issues could frequently cross library boundaries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh also I see you referenced async-rs/async-std#631 in the FAQ :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced by more adaptive runtimes - they tend to result in developers not realising when they've introduced accidental blocking, and I have a bias towards engineers knowing the cost of their decisions.

There are two orthogonal problems that I perceive here:

  1. Two different ways to flag a section as blocking - block_in_place and spawn_blocking - with room for semantic confusion. In my ideal world, there would only be one - block_in_place - and something suitable done in the back-end so that it interacts nicely with spawn, such that spawn(async { block_in_place(|| thing()) }) is exactly the same as (or better than) spawn_blocking(|| thing()) today. That way, the guidance would be block_in_place for things that might block the CPU, and spawn to get parallelism.
  2. Lots of different ways to end up accidentally starving a task that's ready to run - both long polls, and failure to poll a sub-scheduler in a timely fashion. If I could wave a wand, I'd have async task tracing that tells me which tasks are taking too long to poll, and what they're doing when they hog the worker thread; this would also catch the issue where a sub-scheduler doesn't poll, because I'd see the tracing showing the first poll in the sub-scheduler, and then a long gap.

For point 2 there, tracing-futures provides the Instrumented wrapper which allows me to build a chunk of the tracing I'd want. Having the runtime automatically instrument everything so that I can see exactly when a future is polled (or not polled) and how long each poll takes would be better for my needs than an adaptive runtime; I could use tracing-flame to both identify blocking (long polls), and starvation (long gaps between a future being ready to wake and being polled), and fix up both.

The missing bit in tracing-futures is identifying when a Future is ready to be polled again - I'd love to see some sort of event tied to the future's span, so that I can see that (e.g.) it was ready to be polled again, but not polled for 10 seconds, or that it got polled again 2 µs after the waker was notified. I'd want to skip the first poll for this (the gap between creation and first poll) because that's a delay that I believe most people would predict in code review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @farnz. Helpful.

Copy link
Contributor

@nikomatsakis nikomatsakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks. I think it'd be good to add something similar to what @farnz suggested, or perhaps @farnz might have a more concrete proposal.

Copy link
Contributor

@farnz farnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just suggestions - I'm bringing in two other archetypes because my experience of dealing with these problems with my peers is that you end up (a) making the same mistake Barbara did at first, and (b) finding it hard to explain your decisions to your team.

My goals are twofold: first, to emphasise that Barbara is doing the right thing when she hits the problem - it's not something that "should" have been obvious as she was writing the call to serialization_libary::from_bytes - and secondly to make it clear that the obvious fix of spawn is incomplete, and that another fix is also needed.

Copy link
Contributor

@nikomatsakis nikomatsakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks about ready to land to me. I left some small requests.

src/vision/status_quo/barbara_compares_some_cpp_code.md Outdated Show resolved Hide resolved
src/vision/status_quo/barbara_compares_some_cpp_code.md Outdated Show resolved Hide resolved
impl Client {
async fn request(&self) -> Result<Data> {
let bytes = self.inner.network_request().await?
Ok(serialization_libary::from_bytes(&bytes)?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh also I see you referenced async-rs/async-std#631 in the FAQ :)

guswynn and others added 3 commits April 14, 2021 10:27
Co-authored-by: Alice Ryhl <alice@ryhl.io>
Co-authored-by: Ryan Levick <rylev@users.noreply.github.com>
@nikomatsakis I think this covers all the comments?
@guswynn
Copy link
Contributor Author

guswynn commented Apr 15, 2021

@nikomatsakis I think this covers all the comments?

Copy link
Contributor

@nikomatsakis nikomatsakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ready to merge, yes! This is great, thanks @guswynn. I just re-read and I feel like I got a deeper understanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status-quo-story-ideas "Status quo" user story ideas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants