Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rt: Hang on too small completion queue #147

Closed
ollie-etl opened this issue Oct 24, 2022 · 35 comments · May be fixed by #152
Closed

rt: Hang on too small completion queue #147

ollie-etl opened this issue Oct 24, 2022 · 35 comments · May be fixed by #152

Comments

@ollie-etl
Copy link
Contributor

Similar to #145, and again exercised by #144. If there size of the completion queue is smaller than the number of concurrent writers to the submission queue, the runtime just hangs.

Probably not commiting?

@FrankReh
Copy link
Collaborator

Yes, the driver may be making too strict assumptions about when to submit and when to check the cqueue.

The driver isn't flushing the squeue and handling the cqueue as often as it could and it can use the overflow bit to make it take an extra loop before getting back to the caller.

And also, like another issue and commit found and handled, it is not enough to just submit when the thread is going idle, there will be times we want submit and tick should run without being idle first. Whether we wait for the squeue to be full or not is debatable. The squeue drains completely with any call to submit so its probably okay to wait for the squeue to become full but the app can decide what the size of the squeue is and if it is made to large, it can starve the kernel from work it could be processing on. Some day I want to ask the uring author what they think the right sizing is, but maybe this crate's benchmark abilities will already have given us some good ideas.

I believe my thinking about this earlier, and my pseudo fix, was any time the squeue is found full, we should run a tick first but we don't want to recurse and call a tick again if a tick completion handler had tried to make its own submission entry and was thwarted by finding the squeue full too. We also don't want to force the kernel to go into overflow mode if we can help it because that creates a slower path and burns cycles that will sometimes be avoidable by checking the cqueue periodically.

And there is also the non-sys call mode which the driver doesn't have an option for yet which would let the two sides run in parallel even more efficiently.

Great ways are being found to stress the system reliably. Will make it easier to push performance and detect when regressions are happening.

@Noah-Kennedy
Copy link
Contributor

I think that we should probably take an approach here similar to what tokio does with when it calls epoll_wait. This would mean getting a hook into tokio to let us run a callback to flush the queue and dispatch completions during the maintenance loop. Tokio's maintenance loop runs every set number of scheduler ticks (a scheduler tick is basically the polling of a task). This would give us a good way to periodically flush the queue.

@FrankReh
Copy link
Collaborator

@Noah-Kennedy Are you saying there is already a hook in tokio we can use, or you're going to try and get one added.

If that is new, perhaps before that lands, we can implement our own maintenance countdown in the op creation to trigger the same uring maintenance.

@Noah-Kennedy
Copy link
Contributor

I'm going to try to get one added. In the meantime, I'm wondering if it is best to address this in the op creation or to do a "duck tape and zip ties" fix by having an always ready task that no-ops and yields back most of the time like this (pardon my pseudocode):

async fn maintenance_hack() {
    loop {
        for _ in 0..16 {
            // yield back to the runtime 16 times
            tokio::task::yield_now().await;
        }
        
        // flush the queue and dispatch completions.
        // I'm acting like the thread locals are statics here for simplicity's sake.
        DRIVER.tick();
    }
}

This will basically run once every 16 full passes through the scheduler.

@Noah-Kennedy
Copy link
Contributor

Crap, that won't work. It will cause us to busy loop instead of blocking on epoll_wait.

@Noah-Kennedy
Copy link
Contributor

Yeah, we should probably try and do this like @FrankReh was suggesting for the time being.

@Noah-Kennedy
Copy link
Contributor

And long-term move to a hook in tokio.

@Noah-Kennedy
Copy link
Contributor

The more that I think about it, this probably can't be solved in the op submission without removing the ability to batch SQE submissions. I'm wondering if it might make sense for us to do something like my suggestion but with some added logic around only letting the task run if there are unflushed submission queue events.

@Noah-Kennedy
Copy link
Contributor

Ah I was thinking of a different issue here. Disregard my earlier comments, they are important to do but won't fix this issue. We should definitely dispatch CQEs first anytime we submit SQEs.

@Noah-Kennedy
Copy link
Contributor

Noah-Kennedy commented Oct 26, 2022

We should also be using NODROP, but I don't think that is in tokio-rs/io-uring. We would need to add that flag first and get a release out in there.

@Noah-Kennedy
Copy link
Contributor

I forgot that NODROP is a feature, not a flag...

@Noah-Kennedy
Copy link
Contributor

Does this occur with the NODROP feature enabled?

@FrankReh
Copy link
Collaborator

I've got buffer ring groups working in a fork and streaming multi-accept working in a fork. I think I'd like to get those in before we try to address how to handle a flood of submissions or a flood of responses because it is easier to create the floods and then we can see all the reasons for doing so more easily (I think).

But my streaming multi-accept has a cancel ability in it that may not fit with your plans for cancel but I'd like it to get a little more light because it had a nice aspect or two.

@FrankReh
Copy link
Collaborator

Does this occur with the NODROP feature enabled?

NODROP just means the kernel will queue up things using extra kernel memory, waiting for the cq to be drained by us. So it doesn't help if our problem is not getting the kernel to see our sq entries or we aren't pulling from the cq because we thought we did but what we didn't do was ask for more to be pulled. That's what using the OVERFLOW bit is for.

@Noah-Kennedy
Copy link
Contributor

Ah, I see the issue. I didn't realize that the problem was that we had extra completions we didn't notice due to overflow. Sorry for the confusion!

@FrankReh
Copy link
Collaborator

FrankReh commented Oct 26, 2022

Once I started handling the OVERFLOW bit, my test hangs went away.

My streaming fork is waiting for the slab linked list work by @ollie-etl to be merged. I think that PR was left in his court but I'll double check...

Yes, I did a review for him yesterday and he'll also need to resolve conflicts with the new master.

@FrankReh
Copy link
Collaborator

It's not a blocker but I want to get that writev PR committed as the author had put good work into that and we should be able to clear it off our table with no other dependencies.

@Noah-Kennedy
Copy link
Contributor

I understand your earlier comments now. Once we get through the current PR queue, we can get this fix in and do a release.

@FrankReh
Copy link
Collaborator

How about a release 4.0 tomorrow or Friday, regardless of what else is done. And none of our planned work is a breaking change so maybe a 4.1 about a week later? I'll work on change log stuff too. How do I share change log with you the best way? Just a regular PR?

@FrankReh
Copy link
Collaborator

I meant 0.4.0 and 0.4.1.

@Noah-Kennedy
Copy link
Contributor

@FrankReh we should have done this a while ago TBH, but we need a CHANGELOG.md file where we can document changes. We will format it as with tokio, and we can use this to add relnotes for the release tags.

@Noah-Kennedy
Copy link
Contributor

We should probably list this as a "known issue"

@FrankReh
Copy link
Collaborator

We should probably list this as a "known issue"

But it's good to see if it can be addressed while the OP has cycles and the ability to easily reproduce from their end. With your PR in, let's see if @ollie-etl can reproduce the problem.

I can make it my priority to get something handling OVERFLOW in for @ollie-etl to then check out. I don't know that this can't be solved relatively easily so let's see about getting this closed quickly.

@FrankReh
Copy link
Collaborator

I could create a self contained minimal reproducer that hangs at exactly the size of the completion queue, just as the title predicts

@oliverbunting
Copy link

My, you guys have been busy! I'll try and take a look at these this morning

@FrankReh
Copy link
Collaborator

I have a solution that solves it for my test case. It just involves a change to the io-uring crate. The tokio-uring crate can be left at master.

@Noah-Kennedy , @ollie-etl , would you like to change your Cargo.toml to use my fork/branch and report whether it fixes the problem you are seeing too?

io-uring = { git = "https://github.com/frankreh/io-uring", branch = "frankreh/cqueue-overflow", features = ["unstable"] }

@Noah-Kennedy
Copy link
Contributor

@FrankReh what did you change?

@FrankReh
Copy link
Collaborator

I changed the overflow bit handling in submit. I think I sent the idea out last night in one of these other issues.

@Noah-Kennedy
Copy link
Contributor

Noah-Kennedy commented Oct 27, 2022

How did you change it though, and do you think that this would be appropriate for a PR in io-uring?

@FrankReh
Copy link
Collaborator

Yes, I think it's appropriate. Just easier if someone else can vouch for it making a difference too.

@FrankReh
Copy link
Collaborator

How did you change it though, and do you think that this would be appropriate for a PR in io-uring?

I don't understand where this question is coming from. Isn't it easier for you to look at the diff of the commit I put into that branch?

@Noah-Kennedy
Copy link
Contributor

Sorry, I was on mobile earlier and not able to easily see, should have mentioned that. Looking at your change rn, this looks fine.

@FrankReh
Copy link
Collaborator

This should be solved once tokio-rs/io-uring#152 or its equivalent is merged and this crate's version of io-uring is appropriately bumped.

@Noah-Kennedy
Copy link
Contributor

Does that change alone fix this, or is #152 also needed?

@Noah-Kennedy
Copy link
Contributor

Nevermind, I just saw the PR in here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants