Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Barbara compares some cpp code (and has a performance problem) #144

Merged
merged 6 commits into from
Apr 18, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions src/vision/status_quo/barbara_compares_some_cpp_code.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# 😱 Status quo stories: Barbara compares some code (and has a performance problem)

## 🚧 Warning: Draft status 🚧

This is a draft "status quo" story submitted as part of the brainstorming period. It is derived from real-life experiences of actual Rust users and is meant to reflect some of the challenges that Async Rust programmers face today.

If you would like to expand on this story, or adjust the answers to the FAQ, feel free to open a PR making edits (but keep in mind that, as they reflect peoples' experiences, status quo stories [cannot be wrong], only inaccurate). Alternatively, you may wish to [add your own status quo story][htvsq]!

## The story

Barbara is recreating some code that has been written in other languages they have some familiarity with. These include C++, but
also GC'd languages like Python.

This code collates a large number of requests to network services, with eash response containing a large amount of data.
guswynn marked this conversation as resolved.
Show resolved Hide resolved
To speed this up, Barbara uses `buffer_unordered`, and writes code like this:

```
nikomatsakis marked this conversation as resolved.
Show resolved Hide resolved
let mut queries = futures::stream::iter(...)
.map(|query| async move {
let d: Data = self.client.request(&query).await?;
d
})
.buffer_unordered(32);

use futures::stream::StreamExt;
let results = queries.collect::<Vec<Data>>().await;
```

Barbara thinks this is similar in function to things she has seen using
Python's [asyncio.wait](https://docs.python.org/3/library/asyncio-task.html#asyncio.wait),
as well as some code her coworkers have written using c++20's `coroutines`,
using [this](https://github.com/facebook/folly/blob/master/folly/experimental/coro/Collect.h#L321):

```
std::vector<folly::coro::Task<Data>> tasks;
for (const auto& query : queries) {
tasks.push_back(
folly::coro::co_invoke([this, &query]() -> folly::coro::Task<Data> {
co_return co_await client_->co_request(query);
}
)
}
auto results = co_await folly:coro::collectAllWindowed(
move(tasks), 32);
```

However, *the Rust code performs quite poorly compared to the other impls,
appearing to effectively complete the requests serially, despite on the surface
looking like effectively identical code.*

Barbara goes deep into investigating this, spends time reading how `buffer_unordered` is
implemented, how its underlying `FuturesUnordered` is implemented, and even thinks about
how polling and the `tokio` runtime she is using works. She evens tries to figure out if the
upstream service is doing some sort of queueing.

nikomatsakis marked this conversation as resolved.
Show resolved Hide resolved
Eventually Barbara starts reading more about c++20 coroutines, looking closer at the folly
implementation used above, noticing that is works primarily with *tasks*, which are not exactly
equivalent to rust `Future`'s.

Then it strikes her! `request` is implemented something like this:
```
impl Client {
async fn request(&self) -> Result<Data> {
let bytes = self.inner.network_request().await?
Ok(serialization_libary::from_bytes(&bytes)?)
Copy link
Contributor

@farnz farnz Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is still problematic, even with the spawn below, because you've got blocking work on a non-blocking task. Ideally, it would use spawn_blocking to fix this:

        Ok(tokio::task::spawn_blocking(|| serialization_libary::from_bytes(&bytes)).await??)

or, if the library was normally fast, but slow on some (big?) inputs you might prefer block_in_place (which ensures that other tasks run) and the spawn solution you gave:

        Ok(tokio::task::block_in_place(|| serialization_libary::from_bytes(&bytes))?)

Goes to show how hard this user story is to get right - the fix in this story still doesn't work well.

There's a fun sequence here where you add block_in_place and it's still bad, and then add the spawn, which fixes it.

One that's bitten us in Mononoke (issue #131 here) is that even if the long running work is also nicely async (i.e. just makes another network request), it can still go wrong.

Copy link
Contributor Author

@guswynn guswynn Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about block_in_place, though I find it hard to reason about.I think in this case, only spawn_blocking would work, but it has its own problems (can't be cancelled), (edit: block_in_place + my spawn below, as you mentioned work as well, but in practice not having the block_in_place has not proven problematic for me, but perhaps I have even more perf improvements lurking) and like you mentioned, the cpu work is only expensive if the result is large, which is a runtime property. Not sure how we could detect that and do the correct thing. async-rs/async-std#631 looks like an attempt, but I need to read more into why it was closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does block_in_place do? Isn't each tokio task already resilient to starving the others because they run on independent threads? Or is there some way the work-stealing part of it locks up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each worker thread has its own (fixed-size) queue of tasks to service. There's a global queue for overflows, too, and a certain amount of rebalancing done by each worker thread to avoid starving the tasks in the global queue.

If a worker task goes idle, and the global queue is empty, then it will steal work from other workers. block_in_place ensures that the local queue is empty. As a result, it all depends on the balance between thread count (typically one per CPU hart) and runnable task count; if you have (e.g.) a huge AMD EPYC server with 256 harts (two threads and 128 cores in the system), and typically only 100 tasks to run, then you'll never see the difference between spawn_blocking/block_in_place and spawn, because there will be idle workers that steal work from blocked workers.

On the other hand, if you have a small system (say an older laptop) with 4 harts (two cores, two threads per core), and you have still have 100 tasks to run, without block_in_place you might see the runtime fail to migrate some of those tasks off the blocked worker and onto a running worker - the result is starvation for the tasks that don't get migrated to another worker thread.

I'm guessing that you've never see the starvation issue because you normally have lots of worker threads, and not many runnable tasks at a time - as a result, a worker goes idle and steals work from the blocked thread. If you had small numbers of worker threads, but still lots of tasks, you'd see starvation more often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like it might make a nice addition to the story. e.g., Barbara is happy for a while, but then she sees starvation or something like that. That said, it could also be a good FAQ. I wouldn't change the main part of the story, though, since invoking tokio::spawn seems to be what "Barbara" actually did here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading this comment I was definitely struck by the challenge of "how to jigger this exactly right". I'm curious, @farnz, whether you think that a more adaptive runtime -- similar to what Go does, or what was proposed in this blog post, but never adopted -- would be a better fit for Mononoke?

One of the challenges here seems to be a matter of composition. The "right place" to insert block_in_place, for example, is inside of reqwest, at least if I understand correctly, but it may not be well positioned to judge whether messages are likely to be large or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I guess I was wrong about that last bit. Still, it seems like these kinds of issues could frequently cross library boundaries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh also I see you referenced async-rs/async-std#631 in the FAQ :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced by more adaptive runtimes - they tend to result in developers not realising when they've introduced accidental blocking, and I have a bias towards engineers knowing the cost of their decisions.

There are two orthogonal problems that I perceive here:

  1. Two different ways to flag a section as blocking - block_in_place and spawn_blocking - with room for semantic confusion. In my ideal world, there would only be one - block_in_place - and something suitable done in the back-end so that it interacts nicely with spawn, such that spawn(async { block_in_place(|| thing()) }) is exactly the same as (or better than) spawn_blocking(|| thing()) today. That way, the guidance would be block_in_place for things that might block the CPU, and spawn to get parallelism.
  2. Lots of different ways to end up accidentally starving a task that's ready to run - both long polls, and failure to poll a sub-scheduler in a timely fashion. If I could wave a wand, I'd have async task tracing that tells me which tasks are taking too long to poll, and what they're doing when they hog the worker thread; this would also catch the issue where a sub-scheduler doesn't poll, because I'd see the tracing showing the first poll in the sub-scheduler, and then a long gap.

For point 2 there, tracing-futures provides the Instrumented wrapper which allows me to build a chunk of the tracing I'd want. Having the runtime automatically instrument everything so that I can see exactly when a future is polled (or not polled) and how long each poll takes would be better for my needs than an adaptive runtime; I could use tracing-flame to both identify blocking (long polls), and starvation (long gaps between a future being ready to wake and being polled), and fix up both.

The missing bit in tracing-futures is identifying when a Future is ready to be polled again - I'd love to see some sort of event tied to the future's span, so that I can see that (e.g.) it was ready to be polled again, but not polled for 10 seconds, or that it got polled again 2 µs after the waker was notified. I'd want to skip the first poll for this (the gap between creation and first poll) because that's a delay that I believe most people would predict in code review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @farnz. Helpful.

}
}
```
The results from the network service are VERY large, and the `BufferedUnordered` stream is contained within 1 tokio task.
**The request future does non-trivial cpu work to deserialize the data.
This causes significant slowdowns in wall-time as the the process CAN BE bounded by the time it takes
the single thread running the tokio-task to deserialize all the data.**
nikomatsakis marked this conversation as resolved.
Show resolved Hide resolved

The solution is to spawn tasks (note this requires `'static` futures):

```
let mut queries = futures::stream::iter(...)
.map(|query| async move {
let d: Data = tokio::spawn(
self.client.request(&query)).await??;
d
})
.buffer_unordered(32);

use futures::stream::StreamExt;
let results = queries.collect::<Vec<Data>>().await;
```

Barbara was able to figure this out by reading enough and trying things out, but had that not worked, it
would have probably required figuring out how to use `perf` or some similar tool.

nikomatsakis marked this conversation as resolved.
Show resolved Hide resolved
## 🤔 Frequently Asked Questions

### **What are the morals of the story?**
* Producing concurrent, performant code in Rust async is not always trivial. Debugging performance
issues can be difficult.
* Rust's async model, particularly the blocking nature of `polling`, can be complex to reason about,
and in some cases is different from other languages choices in meaningful ways.
* CPU-bound code can be easily hidden.

### **What are the sources for this story?**
* This is a issue I personally hit while writing code required for production.

### **Why did you choose *Barbara* to tell this story?**
That's probably the person in the cast that I am most similar to, but Alan
and to some extent Grace make sense for the story as well.

### **How would this story have played out differently for the other characters?**
* Alan: May have taken longer to figure out.
* Grace: Likely would have been as interested in the details of how polling works.
* Niklaus: Depends on their experience.