-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Use FxHasher
in places where we don't need DDoS resistance
#2342
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2342 +/- ##
=======================================
Coverage 95.39% 95.40%
=======================================
Files 115 115
Lines 36982 36982
Branches 36982 36982
=======================================
+ Hits 35279 35281 +2
+ Misses 1697 1694 -3
- Partials 6 7 +1 ☔ View full report in Codecov by Sentry. |
Failed Interop TestsQUIC Interop Runner, client vs. server, differences relative to 1df2a5a. neqo-latest as client
neqo-latest as server
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
|
Benchmark resultsPerformance differences relative to 1df2a5a. decode 4096 bytes, mask ff: No change in performance detected.time: [12.320 µs 12.361 µs 12.411 µs] change: [-1.1343% -0.2400% +0.4683%] (p = 0.60 > 0.05) decode 1048576 bytes, mask ff: 💚 Performance has improved.time: [2.8331 ms 2.8421 ms 2.8516 ms] change: [-8.2453% -7.8524% -7.4374%] (p = 0.00 < 0.05) decode 4096 bytes, mask 7f: No change in performance detected.time: [20.859 µs 20.915 µs 20.975 µs] change: [-0.5860% -0.0010% +0.5818%] (p = 1.00 > 0.05) decode 1048576 bytes, mask 7f: 💚 Performance has improved.time: [4.5404 ms 4.5619 ms 4.5952 ms] change: [-13.094% -12.629% -11.975%] (p = 0.00 < 0.05) decode 4096 bytes, mask 3f: No change in performance detected.time: [8.2845 µs 8.3215 µs 8.3636 µs] change: [+0.0551% +1.1111% +2.5631%] (p = 0.09 > 0.05) decode 1048576 bytes, mask 3f: 💚 Performance has improved.time: [1.5913 ms 1.6016 ms 1.6150 ms] change: [-9.5069% -8.9206% -8.1949%] (p = 0.00 < 0.05) coalesce_acked_from_zero 1+1 entries: No change in performance detected.time: [90.959 ns 91.253 ns 91.546 ns] change: [-0.5692% -0.0696% +0.4224%] (p = 0.80 > 0.05) coalesce_acked_from_zero 3+1 entries: No change in performance detected.time: [109.59 ns 109.87 ns 110.18 ns] change: [-1.2378% -0.5216% -0.0079%] (p = 0.10 > 0.05) coalesce_acked_from_zero 10+1 entries: No change in performance detected.time: [109.47 ns 110.11 ns 110.86 ns] change: [-1.1644% -0.3243% +0.3417%] (p = 0.43 > 0.05) coalesce_acked_from_zero 1000+1 entries: No change in performance detected.time: [93.224 ns 93.416 ns 93.637 ns] change: [-1.8505% -0.5499% +0.6714%] (p = 0.42 > 0.05) RxStreamOrderer::inbound_frame(): Change within noise threshold.time: [111.71 ms 111.84 ms 112.06 ms] change: [-0.5380% -0.4001% -0.1859%] (p = 0.00 < 0.05) SentPackets::take_ranges: No change in performance detected.time: [5.2751 µs 5.4262 µs 5.5801 µs] change: [-5.2229% +3.1104% +18.372%] (p = 0.74 > 0.05) transfer/pacing-false/varying-seeds: Change within noise threshold.time: [34.013 ms 34.082 ms 34.162 ms] change: [-0.8593% -0.5632% -0.2669%] (p = 0.00 < 0.05) transfer/pacing-true/varying-seeds: Change within noise threshold.time: [34.250 ms 34.313 ms 34.380 ms] change: [-0.7082% -0.4630% -0.2111%] (p = 0.00 < 0.05) transfer/pacing-false/same-seed: Change within noise threshold.time: [34.302 ms 34.354 ms 34.410 ms] change: [-1.3476% -1.0841% -0.8252%] (p = 0.00 < 0.05) transfer/pacing-true/same-seed: Change within noise threshold.time: [34.257 ms 34.319 ms 34.388 ms] change: [-1.9185% -1.6382% -1.3676%] (p = 0.00 < 0.05) 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.time: [800.15 ms 811.32 ms 822.73 ms] thrpt: [121.55 MiB/s 123.26 MiB/s 124.98 MiB/s] change: time: [-7.9392% -6.1693% -4.3384%] (p = 0.00 < 0.05) thrpt: [+4.5352% +6.5749% +8.6239%] 1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: 💚 Performance has improved.time: [301.67 ms 305.86 ms 310.08 ms] thrpt: [32.250 Kelem/s 32.695 Kelem/s 33.149 Kelem/s] change: time: [-7.3405% -5.7200% -4.0728%] (p = 0.00 < 0.05) thrpt: [+4.2457% +6.0670% +7.9220%] 1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected.time: [25.409 ms 25.558 ms 25.708 ms] thrpt: [38.899 elem/s 39.127 elem/s 39.357 elem/s] change: time: [-0.6559% +0.1866% +1.0525%] (p = 0.66 > 0.05) thrpt: [-1.0415% -0.1862% +0.6602%] 1-conn/1-100mb-resp/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.time: [1.7796 s 1.8019 s 1.8241 s] thrpt: [54.821 MiB/s 55.498 MiB/s 56.193 MiB/s] change: time: [-5.4595% -4.0571% -2.6755%] (p = 0.00 < 0.05) thrpt: [+2.7491% +4.2286% +5.7748%] Client/server transfer resultsPerformance differences relative to 1df2a5a. Transfer of 33554432 bytes over loopback, 30 runs. All unit-less numbers are in milliseconds.
|
Signed-off-by: Lars Eggert <lars@eggert.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 in general.
That said, I would prefer only replacing std::collectionsHash*
where ever it proofs beneficial, e.g. not in unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spotted two places where EnumMap
could give us bigger wins.
I think that your performance gains largely derive from the changes to the client and server code. There, the security risk is limited (we're not using this server in real deployments).
Still, you should review the changes for security risk. This hasher could expose us to DoS if the hashed values are controlled by an adversary. I've checked the usage in our server code, which is fine because attackers don't get to control memory allocations (we use pointer values for the hash). Still, that makes me wonder whether we should be using Pin
.
@martinthomson thanks for the analysis. My plan is to add some benches first in another PR. I'll add some for those instances where you suggest to look into Even if some of the macro benefits come from speeding up the demo client and server code, it's IMO still worth doing, since eliminating those overheads makes it easier to spot other bottlenecks. About security, I didn't do much of an analysis, but I think the main use of this insecure hasher would be when looking up items (streams, unacked chunks) that while under the control of an attacker are also quite limited in what valid values are that wouldn't immediately cause a connection clause. |
I definitely agree with the point about removing the overheads from our toy code as much as possible. This seems like a pretty substantial win there, so it's worth doing. I doubt that my |
I think the |
FxHasher
in places where we don't need DDoS resistance
pub send_streams: HashMap<StreamId, Box<dyn SendStream>>, | ||
pub recv_streams: HashMap<StreamId, Box<dyn RecvStream>>, | ||
streams_with_pending_data: HashSet<StreamId>, | ||
pub send_streams: IndexMap<StreamId, Box<dyn SendStream>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, why is an IndexMap
faster than a HashMap
here? Both end up hashing the key, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not faster, it's possibly even slower. I made the change based on this comment: https://github.com/mozilla/neqo/pull/2342/files/bd061693b40e91b846c4d4cd1bc0ecfcd27c4e45#r1945509653
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that we want to be able to iterate send_streams
and the like in order? Given that HashMap
does not guarantee order, you are suggesting to use IndexMap
like we do in neqo-transport
.
If so, the requirement for ordering might be worth documenting. It wasn't obvious to me.
In that case, preference for using BTreeMap
, solely because it is in std
and only use IndexMap
in case it proves to be more performant than BTreeMap
in a meaningful way.
Given that IndexMap
is a fairly small dependency, the above is not a strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm not completely sure that we need consistently ordered iteration here. We rely on that in the transport crate because we want frame sending to follow a predictable pattern (clear the old stuff out first). However, that's not something we rely on at this layer of the stack, so whatever is fastest might be the right answer.
Maybe the first question we have is whether we need (insertion-)ordered iteration over this collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I think we should differentiate between RX and TX.
On RX (both here and in transport), I don't see a point in using anything that maintains order or has semantics that go beyond a simple map or set. I think all we do here is lookups based on received data.
On TX, (both here and in transport), do we really care that iteration is consistently ordered? All these data structures should mostly have a stable order while there are no insertions or removals. When there are, do we care that we'll have a different order afterwards? I can't convince myself that we do.
In other words, I think we should use whatever is fastest.
(Also, unrelated, I find the duplication/overlap between the transport and http3 crates really odd.)
@@ -43,8 +44,8 @@ pub struct WebTransportSession { | |||
state: SessionState, | |||
frame_reader: FrameReader, | |||
events: Box<dyn ExtendedConnectEvents>, | |||
send_streams: BTreeSet<StreamId>, | |||
recv_streams: BTreeSet<StreamId>, | |||
send_streams: HashSet<StreamId>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of a HashSet
over a BTreeSet` here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming HashSet
to be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only make these changes if we have proof that they are more performant.
Connecting this to the discussion above, might ordering (i.e. thus BTreeSet
) be relevant here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point about only changing this after benchmarking.
@@ -75,6 +78,8 @@ pub use self::{ | |||
version::Version, | |||
}; | |||
|
|||
pub type IndexMap<K, V> = indexmap::IndexMap<K, V, BuildHasherDefault<FxHasher>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this more performant? Can you add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably nor, see above as to why.
Good point. Though before we introduce the complexity of |
Signed-off-by: Lars Eggert <lars@eggert.org>
Signed-off-by: Lars Eggert <lars@eggert.org>
Definitely the right question to be asking. I think that it might be possible to use the first connection ID as a key for this sort of thing, but we don't tend to keep that around today, once we stop using it. Everything else -- as far as I know -- is ephemeral and therefore not suitable. |
I'm doing a benchmark in #2444 to quantify the benefits first. (It's not going well, a lot of variation run-to-run for some reason.) |
That is counterintuitive for me, given that it uses |
I wonder if it's the CPU scheduler and frequency control on my Mac. Bencher seems much more stable. |
For what it is worth, here is #2444 on my machine:
I don't see much deviation. Am I running the wrong version @larseggert? |
Can you run it again and see if there are changes run to run? That is where I see random improvements or regressions. |
Signed-off-by: Lars Eggert <lars@eggert.org>
Here are two more runs with vanilla #2444. No significant deviations. Note that I am not running your optimizations in this pull request.
|
Oh good. I think it is core pinning being awkward on macOS then. BTW, I came across https://manuel.bernhardt.io/posts/2023-11-16-core-pinning/ today, and we should change the bencher accordingly. |
I think this may be worthwhile. The
cargo bench
es don't consistently show a benefit, but the loopback transfers on the bencher machine are faster, e.g.,without this PR but
with it.
(I'll see if I can improve CI so that we also see the differences to
main
for the table results.)