Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rt: refactor runtime to avoid Rc<RefCell<...>> #142

Merged
merged 6 commits into from
Oct 26, 2022

Conversation

Noah-Kennedy
Copy link
Contributor

This change refactors the runtime to not use reference counting directly in the ops themselves. Instead, ops access the driver via thread local variables.

This is sound because dropping the driver (which happens when it is removed from its thread-local state) blocks the thread until all ops complete, ensuring that we do not free the contents of the driver until after all operations have completed.

@Noah-Kennedy
Copy link
Contributor Author

Huh, that doesn't happen on my machine.

@Noah-Kennedy
Copy link
Contributor Author

Hmm, actually it does, its just dependent on the order the tests run in (maybe?)

@Noah-Kennedy
Copy link
Contributor Author

Nevermind, CLion has a bug and apparantly hides test failures with double panics.

@Noah-Kennedy
Copy link
Contributor Author

Well that worked.

@FrankReh
Copy link
Collaborator

Nice. You've removed the Rc from the highest level because the driver reference isn't being kept per running Op any longer. The running Op can access the TLS variable to get to the slab and the uring instance.

This doesn't completely rule out ever being able to support multiple urings in the same current thread runtime, but it would require some further refactoring, as it would have anyway. It's good to differentiate the runtime in this crate from the uring instance, it would seem to allow further flexibility later on.

@Noah-Kennedy
Copy link
Contributor Author

I'm going to be opening several additional refactoring PRs. I'd like to combine a lot of our TCP and Unix logic similar to what tokio does internally with PollEvented.

Copy link
Contributor

@ollie-etl ollie-etl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this

@ollie-etl
Copy link
Contributor

ollie-etl commented Oct 24, 2022

Hmm, I tried merging #144 with this, and it seems this gives a performance regression. I have to admit I'm surprised by this.
Values are vs #144

no_op/1                 time:   [205.00 ms 205.42 ms 205.85 ms]
                        thrpt:  [485.78 Kelem/s 486.81 Kelem/s 487.81 Kelem/s]
                 change:
                        time:   [+35.581% +36.358% +37.090%] (p = 0.00 < 0.05)
                        thrpt:  [-27.055% -26.664% -26.244%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
no_op/32                time:   [29.356 ms 29.435 ms 29.521 ms]
                        thrpt:  [3.3874 Melem/s 3.3973 Melem/s 3.4064 Melem/s]
                 change:
                        time:   [+3.9926% +4.6145% +5.2585%] (p = 0.00 < 0.05)
                        thrpt:  [-4.9958% -4.4110% -3.8393%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
no_op/64                time:   [26.168 ms 26.226 ms 26.291 ms]
                        thrpt:  [3.8035 Melem/s 3.8130 Melem/s 3.8215 Melem/s]
                 change:
                        time:   [+3.5382% +4.0286% +4.5123%] (p = 0.00 < 0.05)
                        thrpt:  [-4.3175% -3.8725% -3.4173%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
no_op/256               time:   [24.032 ms 24.080 ms 24.135 ms]
                        thrpt:  [4.1434 Melem/s 4.1529 Melem/s 4.1612 Melem/s]
                 change:
                        time:   [-0.1899% +0.1773% +0.5497%] (p = 0.36 > 0.05)
                        thrpt:  [-0.5467% -0.1770% +0.1903%]
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

     Running benches/lai/no_op.rs (target/release/deps/lai_no_op-60be63e49ca92170)
runtime_only
  Instructions:               30336 (+6.464519%)
  L1 Accesses:                43128 (+6.368076%)
  L2 Accesses:                  282 (+16.52893%)
  RAM Accesses:                1014 (+3.787103%)
  Estimated Cycles:           80028 (+5.367935%)

no_op_x1
  Instructions:           782557272 (+55.93528%)
  L1 Accesses:           1167943224 (+54.27984%)
  L2 Accesses:              4511358 (+204.9696%)
  RAM Accesses:                1432 (+9.647779%)
  Estimated Cycles:      1190550134 (+55.73512%)

no_op_x32
  Instructions:           142350658 (+9.518945%)
  L1 Accesses:            221759389 (+8.211193%)
  L2 Accesses:               143991 (+208.8676%)
  RAM Accesses:                1497 (+9.190372%)
  Estimated Cycles:       222531739 (+8.439342%)

no_op_x64
  Instructions:           132014910 (+6.487231%)
  L1 Accesses:            206460259 (+5.337330%)
  L2 Accesses:                96183 (+155.6155%)
  RAM Accesses:                1565 (+8.756081%)
  Estimated Cycles:       206995949 (+5.482285%)

no_op_x256
  Instructions:           124419880 (+3.999416%)
  L1 Accesses:            194515551 (+2.956976%)
  L2 Accesses:               767606 (+18.30563%)
  RAM Accesses:                1990 (+6.588109%)
  Estimated Cycles:       198423231 (+3.217230%)

@ollie-etl
Copy link
Contributor

More notes on performance. Adding LTO flags to the release build git cherry-pick 7d3d8df, vs #144 with the same.

the fairly dreadful no_op/1 benchmark regresses significantly vs #144, but the 256 way concurrency is actually better.

no_op/1                 time:   [182.93 ms 183.70 ms 184.50 ms]
                        thrpt:  [541.99 Kelem/s 544.37 Kelem/s 546.67 Kelem/s]
                 change:
                        time:   [+40.007% +40.654% +41.302%] (p = 0.00 < 0.05)
                        thrpt:  [-29.229% -28.904% -28.575%]
                        Performance has regressed.

no_op/256               time:   [19.945 ms 20.011 ms 20.087 ms]
                        thrpt:  [4.9784 Melem/s 4.9973 Melem/s 5.0139 Melem/s]
                 change:
                        time:   [-3.8566% -3.4359% -2.9896%] (p = 0.00 < 0.05)
                        thrpt:  [+3.0817% +3.5582% +4.0114%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

no_op_x1
  Instructions:           566777937 (+53.60089%)
  L1 Accesses:            828856458 (+51.37721%)
  L2 Accesses:                  312 (-99.96116%)
  RAM Accesses:                1202 (+8.679928%)
  Estimated Cycles:       828900088 (+50.27212%)

no_op_x256
  Instructions:            89362937 (-1.095379%)
  L1 Accesses:            140236031 (-2.657184%)
  L2 Accesses:               732513 (+14.23632%)
  RAM Accesses:                1760 (+5.960265%)
  Estimated Cycles:       143960196 (-2.286151%)

@Noah-Kennedy
Copy link
Contributor Author

This is likely noise. We might want to try using a longer benchmark period for this.

@ollie-etl
Copy link
Contributor

ollie-etl commented Oct 24, 2022

@Noah-Kennedy the benchmarks are performing 100k Noop, and iterating 100-300 times. Each criterion benchmark is therefore 10M+ operations.Each takes 5-20 seconds. I can reproduce very consistently.

@Noah-Kennedy
Copy link
Contributor Author

Ah, interesting.

@Noah-Kennedy
Copy link
Contributor Author

I'll profile this tonight then to see what is happening.

@ollie-etl
Copy link
Contributor

ollie-etl commented Oct 24, 2022

@Noah-Kennedy : You may find simple-benchmark-2 from git@github.com:etlsystems/tokio-uring helpful. It extends the simple-benchmark I've submitted as a PR with a few things:

Given your profile indicates you work for Cloudflare, I suspect you're fairly clued up on profiling anyway. My experience is if I haven't got something terribly wrong, a criterion result of more than 2% is real. Indeed, trying to find many 2% improvements is basically how I spend my days.

I'm having a look at the benchmarks on master currently

@ollie-etl
Copy link
Contributor

And looking at the flamegraphs, you can just disregard any low number of concurrency. thread_park totally dominates

@FrankReh
Copy link
Collaborator

the fairly dreadful no_op/1 benchmark regresses significantly vs #144, but the 256 way concurrency is actually better.

How to interpret the no_op/1 name and the no_op/256 names? Is the / indicating the number of concurrent tasks spawned? What does it mean that there are per second numbers? Is a spawned task replaced by another when the first one finishes? Or are there a certain number run per second?

I'm wondering if no_op/1 means just one operation is performed at a time? So there is little stress on the system and the test has to wait for the round trip from userland to kernel and back to userland for a single test to be counted before the next is started.

And how many cores is this running on? Sorry if that is mentioned somewhere already. And is this a native machine or a vm on a hypervisor?

@Noah-Kennedy
Copy link
Contributor Author

Do we want to gate this PR on improving the no-op perf? I'd vote for not doing so, but I'd like to get your thoughts @ollie-etl @FrankReh.

@FrankReh
Copy link
Collaborator

Do we want to gate this PR on improving the no-op perf? I'd vote for not doing so, but I'd like to get your thoughts @ollie-etl @FrankReh.

I thought that's what the two of you wanted and maybe you, Noah, were going to find time to check out the criterion results.

I would rather we get this committed after the review that I started a few minutes ago. At least two of us want to get better at criterion analysis but we should err on the side cleaning up the design so further work can be more fruitful. There is a backlog of work the three of us have, as well as PRs in the queue, and we have the chance to get to them all soon now.

@Noah-Kennedy
Copy link
Contributor Author

@FrankReh I think that we should probably merge this asap since it touches things we want to change in other PRs. Merging it post-review sounds like a good idea here.

I'll resolve the conflicts.

Noah-Kennedy and others added 3 commits October 26, 2022 12:19
This change refactors the runtime to not use reference counting directly in the ops themselves. Instead, ops access the driver via thread local variables.

This is sound because dropping the driver (which happens when it is removed from its thread-local state) blocks the thread until all ops complete, ensuring that we do not free the contents of the driver until after all operations have completed.
@Noah-Kennedy
Copy link
Contributor Author

@FrankReh merge conflicts are resolved

@oliverbunting
Copy link

From my perspective, please merge

@Noah-Kennedy
Copy link
Contributor Author

Cool, as soon as @FrankReh signs off, I'll merge.

Copy link
Collaborator

@FrankReh FrankReh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Small questions.

src/driver/mod.rs Outdated Show resolved Hide resolved
src/driver/mod.rs Outdated Show resolved Hide resolved
src/runtime/context.rs Outdated Show resolved Hide resolved
src/runtime/context.rs Outdated Show resolved Hide resolved
src/runtime/mod.rs Show resolved Hide resolved
src/runtime/mod.rs Show resolved Hide resolved
src/util.rs Show resolved Hide resolved
@FrankReh
Copy link
Collaborator

Be back in an hour.

@Noah-Kennedy
Copy link
Contributor Author

@FrankReh thanks for the review! I've addressed your findings.

Copy link
Collaborator

@FrankReh FrankReh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me!

@Noah-Kennedy Noah-Kennedy merged commit c069cc6 into master Oct 26, 2022
@Noah-Kennedy Noah-Kennedy deleted the noah/runtime-refactor branch October 26, 2022 19:19
@FrankReh
Copy link
Collaborator

Interesting. A few rounds of testing my buf_ring branch with these changes. Nothing is broken but it could be the driver tick isn't being called as often as it had been before. That's interesting. We want to visit a separate mechanism for submission and completion checking anyway.

@Noah-Kennedy
Copy link
Contributor Author

Oh we absolutely need to. Like I've said before, we should do this the same way that tokio invokes epoll_wait, which is to say by leveraging the maintenance hook for periodic polling, and by then making use of parking (as we do now).

@Noah-Kennedy
Copy link
Contributor Author

I suspect that the main impacting change is moving the driver ticking to a separate task.

@FrankReh
Copy link
Collaborator

I suspect that the main impacting change is moving the driver ticking to a separate task.

Me too. The tokio scheduler could make us crazy if we worry too much about which task gets to go most often and how long the ready queue is getting in our stress/flood tests.

@Noah-Kennedy
Copy link
Contributor Author

It would probably be best if we tried to avoid using the task as anything other than a way of getting tokio to unpark on uring completions arriving. I suspect that driving completions on unpark would potentially help improve perf a fair bit, as well as periodically flushing the queue every X ticks by getting something into tokio to hook into the maintenance routine.

ollie-etl pushed a commit to etlsystems/tokio-uring that referenced this pull request Oct 27, 2022
- tokio-rs#148 exposed Runtime. It is now possible to use Criterions
async support directly. Do so

- Adds criterion support for flamegraph generation of only the
  benchmarked code (pprof)

- Now crashes on second call to block_on, as the AsyncFd still exists
  Introduced in tokio-rs#142
Noah-Kennedy pushed a commit that referenced this pull request Nov 5, 2022
# 0.4.0 (November 5th, 2022)

### Fixed

- Fix panic in Deref/DerefMut for Slice extending into uninitialized
part of the buffer ([#52])
- docs: all-features = true ([#84])
- fix fs unit tests to avoid parallelism ([#121])
- Box the socket address to allow moving the Connect future ([#126])
- rt: Fix data race ([#146])

### Added

- Implement fs::File::readv_at()/writev_at() ([#87])
- fs: implement FromRawFd for File ([#89])
- Implement `AsRawFd` for `TcpStream` ([#94])
- net: add TcpListener.local_addr method ([#107])
- net: add TcpStream.write_all ([#111])
- driver: add Builder API as an option to start ([#113])
- Socket and TcpStream shutdown ([#124])
- fs: implement fs::File::from_std ([#131])
- net: implement FromRawFd for TcpStream ([#132])
- fs: implement OpenOptionsExt for OpenOptions ([#133])
- Add NoOp support ([#134])
- Add writev to TcpStream ([#136])
- sync TcpStream, UnixStream and UdpSocket functionality ([#141])
- Add benchmarks for no-op submission ([#144])
- Expose runtime structure ([#148])

### Changed

- driver: batch submit requests and add benchmark ([#78])
- Depend on io-uring version ^0.5.8 ([#153])

### Internal Improvements

- chore: fix clippy lints ([#99])
- io: refactor post-op logic in ops into Completable ([#116])
- Support multi completion events: v2 ([#130])
- simplify driver operation futures ([#139])
- rt: refactor runtime to avoid Rc\<RefCell\<...>> ([#142])
- Remove unused dev-dependencies ([#143])
- chore: types and fields explicitly named ([#149])
- Ignore errors from uring while cleaning up ([#154])
- rt: drop runtime before driver during shutdown ([#155])
- rt: refactor drop logic ([#157])
- rt: fix error when calling block_on twice ([#162])

### CI changes

- chore: update actions/checkout action to v3 ([#90])
- chore: add all-systems-go ci check ([#98])
- chore: add clippy to ci ([#100])
- ci: run cargo test --doc ([#135])


[#52]: #52
[#78]: #78
[#84]: #84
[#87]: #87
[#89]: #89
[#90]: #90
[#94]: #94
[#98]: #98
[#99]: #99
[#100]: #100
[#107]: #107
[#111]: #111
[#113]: #113
[#116]: #116
[#121]: #121
[#124]: #124
[#126]: #126
[#130]: #130
[#131]: #131
[#132]: #132
[#133]: #133
[#134]: #134
[#135]: #135
[#136]: #136
[#139]: #139
[#141]: #141
[#142]: #142
[#143]: #143
[#144]: #144
[#146]: #146
[#148]: #148
[#149]: #149
[#153]: #153
[#154]: #154
[#155]: #155
[#157]: #157
[#162]: #162
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants