-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed getIntLE
from BytesReference
#90147
Conversation
Pinging @elastic/es-distributed (Team:Distributed) |
This speeds up `getIntLE` from `BytesReference` which we'll be using in the upcoming dense representations for aggregations. Here's the performance: ``` (type) Mode Cnt Before Error After Error Units array avgt 7 1.036 ± 0.062 0.261 ± 0.022 ns/op paged_bytes_array avgt 7 5.189 ± 0.172 5.317 ± 0.196 ns/op composite_256kb avgt 7 30.792 ± 0.834 11.240 ± 0.387 ns/op composite_262344b avgt 7 32.503 ± 1.017 11.155 ± 0.358 ns/op composite_1mb avgt 7 25.189 ± 0.449 8.379 ± 0.193 ns/op ``` The `array` method is how we'll use slices that don't span the edges of a netty buffer. The `paged_bytes_array` method doesn't really change and represents the default for internal stuff. I'll bet we could make it faster too, but I don't know that we use it in the hot path. The `composite_<size>` method is how we'll be reading large slabs from the netty byte buffer. We could probably do better if we relied on the sizes of the buffers being even, but we don't presently do that in the composite bytes array. The different sizes following `composite` show that the performance is dominated by the number of slabs in the composite buffer. `1mb` looks like the largest buffer netty uses. `256kb` is the smallest. The wild number of bytes intentionally doesn't line the int up on sensible values. I don't think we'll use sizes like that but it looks like the performance doesn't make a huge difference. We're dominated by the buffer choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just some optional suggestions :)
public long read() { | ||
int res = 0; | ||
for (int i = 0; i < SIZE; i++) { | ||
res = res ^ read.get(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I think it might give a better benchmark to store this.read
into a local variable before the loop. I don't think the JIT is able to remove the field access here, hard to tell how much that matters here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can give it a go and see. I thought it'd be able to avoid both the things you mentioned but I dunno! 🧑🔬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to make a difference. This morning the fast path clocks in at .22ns either way. My CPU claims 2.3GHz which implies .43ns per cycle. So when this is an array we're looking at about half a cycle per read. Now I'm curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be that in this test, we are able to inline this fully so it's a loop over an array and gets vectorized (could verify that by printing assembly). No need to change anything here then I guess :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the hot block:
╭ 0x00007f0eb8ff0106: jmpq 0x00007f0eb8ff018b
│ 0x00007f0eb8ff010b: nopl 0x0(%rax,%rax,1)
0.00% │↗ 0x00007f0eb8ff0110: mov %ebx,%r10d ;*iload_1 {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.elasticsearch.benchmark.common.util.IntArrayBenchmark::read@11 (line 95)
1.74% ││ ↗ 0x00007f0eb8ff0113: mov %r10d,%ebx
││ │ 0x00007f0eb8ff0116: shl $0x2,%ebx
0.00% ││ │ 0x00007f0eb8ff0119: movslq %ebx,%rbx
0.00% ││ │ 0x00007f0eb8ff011c: xor 0x10(%rsi,%rbx,1),%r11d
2.31% ││ │ 0x00007f0eb8ff0121: xor 0x14(%rsi,%rbx,1),%r11d
0.00% ││ │ 0x00007f0eb8ff0126: xor 0x18(%rsi,%rbx,1),%r11d
1.13% ││ │ 0x00007f0eb8ff012b: xor 0x1c(%rsi,%rbx,1),%r11d
1.72% ││ │ 0x00007f0eb8ff0130: xor 0x20(%rsi,%rbx,1),%r11d
1.70% ││ │ 0x00007f0eb8ff0135: xor 0x24(%rsi,%rbx,1),%r11d
1.68% ││ │ 0x00007f0eb8ff013a: xor 0x28(%rsi,%rbx,1),%r11d
1.72% ││ │ 0x00007f0eb8ff013f: xor 0x2c(%rsi,%rbx,1),%r11d
1.64% ││ │ 0x00007f0eb8ff0144: xor 0x30(%rsi,%rbx,1),%r11d
1.76% ││ │ 0x00007f0eb8ff0149: xor 0x34(%rsi,%rbx,1),%r11d
1.71% ││ │ 0x00007f0eb8ff014e: xor 0x38(%rsi,%rbx,1),%r11d
3.52% ││ │ 0x00007f0eb8ff0153: xor 0x3c(%rsi,%rbx,1),%r11d
1.74% ││ │ 0x00007f0eb8ff0158: xor 0x40(%rsi,%rbx,1),%r11d
1.72% ││ │ 0x00007f0eb8ff015d: xor 0x44(%rsi,%rbx,1),%r11d
1.70% ││ │ 0x00007f0eb8ff0162: xor 0x48(%rsi,%rbx,1),%r11d
1.70% ││ │ 0x00007f0eb8ff0167: xor 0x4c(%rsi,%rbx,1),%r11d ;*ixor {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - org.elasticsearch.benchmark.common.util.IntArrayBenchmark::read@23 (line 95)
1.64% ││ │ 0x00007f0eb8ff016c: mov %r10d,%ebx
0.00% ││ │ 0x00007f0eb8ff016f: add $0x10,%ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - org.elasticsearch.benchmark.common.util.IntArrayBenchmark::read@25 (line 94)
0.00% ││ │ 0x00007f0eb8ff0172: cmp %edx,%ebx
│╰ │ 0x00007f0eb8ff0174: jl 0x00007f0eb8ff0110 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - org.elasticsearch.benchmark.common.util.IntArrayBenchmark::read@8 (line 94)
This reads to me like it's inlined the array access and unrolled the loop. That's with the field read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the field lookup is irrelevant here then. I guess the unexpected performance is probably due to op fusion in the CPU. Seems possible to fuse those xors maybe in some form, that's a little beyond my area of expertise :)
int i = getOffsetIndex(index); | ||
int idx = index - offsets[i]; | ||
int end = idx + 4; | ||
if (end <= references[i].length()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Maybe store references[i]
to a variable? That should be a little faster here on average I think since I'd expect the fast path to be taken most of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one helps!
Without:
Benchmark (type) Mode Cnt Score Error Units
IntArrayBenchmark.read composite_1mb avgt 7 8.131 ± 0.138 ns/op
With:
Benchmark (type) Mode Cnt Score Error Units
IntArrayBenchmark.read composite_1mb avgt 7 7.770 ± 0.140 ns/op
Update instructions for benchmark for `perfasm`.
* main: (186 commits) [DOCS] Add 8.5 release notes and fix links (elastic#90201) Mute DownsampleActionSingleNodeTests.testCannotRollupWhileOtherRollupInProgress (elastic#90198) Bump to version 8.6.0 Increase the minimum size of the management pool to 2 (elastic#90193) Speed `getIntLE` from `BytesReference` (elastic#90147) Restrict nodes for testClusterPrimariesActive1 (elastic#90191) Fix bug with `BigIntArray` serialization (elastic#90142) Synthetic _source: test _source filtering (elastic#90138) Modernize cardinality agg tests (elastic#90114) Mute failing test (elastic#90186) Move assertion in ES85BloomFilterPostingsFormat to fix test (elastic#90150) Restrict nodes for testClusterPrimariesActive2 (elastic#90184) Batch index delete cluster state updates (elastic#90033) Register stable plugins in ActionModule (elastic#90067) Mute failing test (elastic#90180) [HealthAPI] Disk: Use _ for diagnosis id (elastic#90179) [HealtAPI] Disk: use shorter help URLs (elastic#90178) Fixing disk health indicator unit tests (elastic#90175) Enable the health node and the disk health indicator elastic#84811 (elastic#90085) Add missing Disk Indicator health api IDs (elastic#90174) ...
* main: (121 commits) [DOCS] Add 8.5 release notes and fix links (elastic#90201) Mute DownsampleActionSingleNodeTests.testCannotRollupWhileOtherRollupInProgress (elastic#90198) Bump to version 8.6.0 Increase the minimum size of the management pool to 2 (elastic#90193) Speed `getIntLE` from `BytesReference` (elastic#90147) Restrict nodes for testClusterPrimariesActive1 (elastic#90191) Fix bug with `BigIntArray` serialization (elastic#90142) Synthetic _source: test _source filtering (elastic#90138) Modernize cardinality agg tests (elastic#90114) Mute failing test (elastic#90186) Move assertion in ES85BloomFilterPostingsFormat to fix test (elastic#90150) Restrict nodes for testClusterPrimariesActive2 (elastic#90184) Batch index delete cluster state updates (elastic#90033) Register stable plugins in ActionModule (elastic#90067) Mute failing test (elastic#90180) [HealthAPI] Disk: Use _ for diagnosis id (elastic#90179) [HealtAPI] Disk: use shorter help URLs (elastic#90178) Fixing disk health indicator unit tests (elastic#90175) Enable the health node and the disk health indicator elastic#84811 (elastic#90085) Add missing Disk Indicator health api IDs (elastic#90174) ...
This speeds up
getIntLE
fromBytesReference
which we'll be using in the upcoming dense representations for aggregations. Here's the performance:The
array
method is how we'll use slices that don't span the edges of a netty buffer. Thepaged_bytes_array
method doesn't really change and represents the default for internal stuff. I'll bet we could make it faster too, but I don't know that we use it in the hot path. Thecomposite_<size>
method is how we'll be reading large slabs from the netty byte buffer. We could probably do better if we relied on the sizes of the buffers being even, but we don't presently do that in the composite bytes array. The different sizes followingcomposite
show that the performance is dominated by the number of slabs in the composite buffer.1mb
looks like the largest buffer netty uses.256kb
is the smallest. The wild number of bytes intentionally doesn't line the int up on sensible values. I don't think we'll use sizes like that but it looks like the performance doesn't make a huge difference. We're dominated by the buffer choice.