Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

driver: MultiCqe support for Driver Drop #164

Merged

Conversation

ollie-etl
Copy link
Contributor

@ollie-etl ollie-etl commented Nov 4, 2022

Handles LifeCycle::CompletionList in driver Drop logic

During the first pass over all lifecycles, identifies if CompletionLists are finished or not.

  • If they are, it changes the Lifecycle to Completed with null data.
  • If they are not, adds them to the cancel list, and converts to Ignored

This means that after this pass, we will never observe CompletionList in the lifecycles. This means the existing termination logic is ok unchanged. The ignored entry will stay until removed from the Lifecycle slab by the call to complete.

Fixes a gap left by #158 and unblocks #123

@ollie-etl
Copy link
Contributor Author

ollie-etl commented Nov 4, 2022

@Noah-Kennedy @FrankReh Could I get a review? I'm keen to unblock #123

@ollie-etl
Copy link
Contributor Author

I've another commit for this, in progress

src/driver/mod.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@FrankReh FrankReh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Wish it were testable because it is getting just complicated enough now that we might be overlooking something. The rust compiler and our ability to reason through the logic makes me pretty sure this does what is intended.

@ollie-etl
Copy link
Contributor Author

ollie-etl commented Nov 4, 2022

@FrankReh : I've made an update optimizing the drop logic

  • Now We don't have to allocate vec, we just mark the Op as ignored, and submit completions for those that have been ignored. This is ok, as no poll will occur during Drop, and the slab will be gone once its finished drop.

  • During the await loop, remove the Completed items from the slab, so we don't keep iterating over them. There is some odd structure there to avoid double borrows

@ollie-etl
Copy link
Contributor Author

I do wonder about both this and the original though. How do we know we've awaited all the AsyncCancel's we've pushed? I don't think either the oringal or this handle that properly

@FrankReh
Copy link
Collaborator

FrankReh commented Nov 4, 2022

How do we know we've awaited all the AsyncCancel's we've pushed?

It doesn't need to wait for the AsyncCancel's to be done. It uses a loop at the bottom to wait for the original's to move to complete I believe. I'll double check that now.

@FrankReh
Copy link
Collaborator

FrankReh commented Nov 4, 2022

Odd, this page. My approval shouldn't be counted if another change comes in.

@ollie-etl
Copy link
Contributor Author

ollie-etl commented Nov 4, 2022

Well there was a bug in the original - things wouldn't move to complete. Submitted may have moved to waiting, and would never be polled. That is fixed here

Looking at the wrong section

@ollie-etl ollie-etl requested a review from FrankReh November 4, 2022 15:06
@ollie-etl
Copy link
Contributor Author

@FrankReh thats a repo setting.

@ollie-etl
Copy link
Contributor Author

@FrankReh I'm happy this and the original awaited all the ops in the slab. I just don't know it its safe to drop the rings if there are in-fligh AsyncCancels, which we don't account for

src/driver/mod.rs Outdated Show resolved Hide resolved
@ollie-etl
Copy link
Contributor Author

ollie-etl commented Nov 4, 2022

@FrankReh , I'm going to have 1 more edit. I think we can do the wait logic in a single pass of the slab

@ollie-etl
Copy link
Contributor Author

ollie-etl commented Nov 4, 2022

Ignore that - its too big a change. I'm happy with this PR now. I will not be pressing merge, i'll leave that to @FrankReh

Interesting - thats gone from being possible to me being not on the list - which is what I expected

Copy link
Collaborator

@FrankReh FrankReh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I like this version as much. Do I misunderstand what's going on in that final loop?

src/driver/mod.rs Outdated Show resolved Hide resolved
@ollie-etl
Copy link
Contributor Author

ollie-etl commented Nov 4, 2022

@FrankReh Cancel is called exactly once for each outstanding op. The logic is pretty much the same as the orignal, only without a vec

  • Iterate over the slab, and change each lifecycle to either completed or ignored
  • iterate again (which may be more expensive than iterating over the vec, but doesn't allocate), and for each item marked ignore, issue a cancel exactty once for each item marked Ignored
  • Iterate a final time, waiting for the slab to be empty, which will occur once all the completions come in.

@FrankReh
Copy link
Collaborator

FrankReh commented Nov 4, 2022

I just don't know it its safe to drop the rings if there are in-fligh AsyncCancels, which we don't account for

I've never read a warning in the liburing stuff about having to wait for quiescence before closing a uring fd. I believe we should proceed like it is safe. If you want, I or one of us can pose the question to the liburing author as an issue on their git repo but I really think the answer is the kernel protects itself from the uring fd being closed and doesn't need the user land to wait for things to be cleaned up.

So to be clear, I think it's a good idea we make the drop wait to see the active operations completed but we don't need to wait to see our async cancels themselves being completed. It's only against the kernel modifying userland data that we are trying to protect ourselves but async cancels don't use userland buffers.

@ollie-etl
Copy link
Contributor Author

@FrankReh The number of defunct processes I've had whilst doing dev this week say otherwise

@FrankReh
Copy link
Collaborator

FrankReh commented Nov 4, 2022

@FrankReh Cancel is called exactly once for each outstanding op. The logic is pretty much the same as the orignal, only without a vec

  • Iterate over the slab, and change each lifecycle to either completed or ignored
  • iterate again (which may be more expensive than iterating over the vec, but doesn't allocate), and for each item marked ignore, issue a cancel exactty once for each item marked Ignored
  • Iterate a final time, waiting for the slab to be empty, which will occur once all the completions come in.

Fair enough. Let me look through it again.

@ollie-etl
Copy link
Contributor Author

Although, maybe AsyncCanels are ok. I don't know. I've been ungracefully exiting with send_zc buffers in flight

Copy link
Collaborator

@FrankReh FrankReh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we should use a want=1 when waiting. The io-uring crate may provide a wait with timeout however. Or a scheme that just busy waits seems safer.

src/driver/mod.rs Show resolved Hide resolved
src/driver/mod.rs Outdated Show resolved Hide resolved
src/driver/mod.rs Outdated Show resolved Hide resolved
@FrankReh
Copy link
Collaborator

FrankReh commented Nov 4, 2022

@ollie-etl I don't think you should be trying to get this mergable today. I'm assuming you want it in the release but if a release goes out this weekend like Noah wondered about, I think we should have a follow up release very soon after that with the PRs that have been consuming so much of our time over the last two weeks. I wouldn't even try to get any of my work into the follow on. It would be purely to catch up to with the current PRs that were ready.

@ollie-etl
Copy link
Contributor Author

@FrankReh I think I've addressed the quadratic behaviour

@ollie-etl
Copy link
Contributor Author

@frank: Now I've seen the path to a single traversal, what do you think about refactoring the whole lot from 3 passes to 1? That would imply we wait for each op to complete or be cancelled before moving onto the next.

Copy link
Collaborator

@FrankReh FrankReh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve but I want @Noah-Kennedy to have a look and approve too. This is significant.

src/driver/mod.rs Show resolved Hide resolved
@FrankReh FrankReh requested a review from Noah-Kennedy November 4, 2022 17:10
Copy link
Collaborator

@FrankReh FrankReh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to ask for comment fixes when Noah hasn't reviewed yet. But the comments in the right place might make it easier for him to reason through too.

src/driver/mod.rs Outdated Show resolved Hide resolved
src/driver/mod.rs Show resolved Hide resolved
src/driver/mod.rs Outdated Show resolved Hide resolved
@FrankReh
Copy link
Collaborator

FrankReh commented Nov 4, 2022

Looking at the wrong section

@ollie-etl Yes, easy to look in the wrong section. So for @Noah-Kennedy 's benefit, we no longer think this fixes his earlier version. This just makes it possible to handle CompletionList entries too. But I probably like the last loop here better, even without counting the handling of CompletionList being included here.

@ollie-etl
Copy link
Contributor Author

@FrankReh Comments addressed

Copy link
Collaborator

@FrankReh FrankReh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see this go in if @Noah-Kennedy finds it passes all the stress tests he had run before. I would expect this to be tweaked down the road, at least as we look for a way that avoids a possible wait hang with want=1. But maybe not before what could be today's release.

src/driver/mod.rs Show resolved Hide resolved
@Noah-Kennedy
Copy link
Contributor

@ollie-etl please run cargo fmt, then I will merge.

@ollie-etl
Copy link
Contributor Author

@Noah-Kennedy Done.

@Noah-Kennedy Noah-Kennedy enabled auto-merge (squash) November 5, 2022 21:30
@ollie-etl
Copy link
Contributor Author

@Noah-Kennedy I think this needs an approval from you, again it shows as blocked on approval. I guess the old one is stale?

@Noah-Kennedy Noah-Kennedy merged commit f5f1724 into tokio-rs:master Nov 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants