Speed `getIntLE` from `BytesReference` #90147

nik9000 · 2022-09-19T20:49:43Z

This speeds up getIntLE from BytesReference which we'll be using in the upcoming dense representations for aggregations. Here's the performance:

           (type)  Mode  Cnt   Before  Error  After    Error  Units
            array  avgt    7   1.036 ± 0.062   0.261 ± 0.022  ns/op
paged_bytes_array  avgt    7   5.189 ± 0.172   5.317 ± 0.196  ns/op
  composite_256kb  avgt    7  30.792 ± 0.834  11.240 ± 0.387  ns/op
composite_262344b  avgt    7  32.503 ± 1.017  11.155 ± 0.358  ns/op
    composite_1mb  avgt    7  25.189 ± 0.449   8.379 ± 0.193  ns/op

The array method is how we'll use slices that don't span the edges of a netty buffer. The paged_bytes_array method doesn't really change and represents the default for internal stuff. I'll bet we could make it faster too, but I don't know that we use it in the hot path. The composite_<size> method is how we'll be reading large slabs from the netty byte buffer. We could probably do better if we relied on the sizes of the buffers being even, but we don't presently do that in the composite bytes array. The different sizes following composite show that the performance is dominated by the number of slabs in the composite buffer. 1mb looks like the largest buffer netty uses. 256kb is the smallest. The wild number of bytes intentionally doesn't line the int up on sensible values. I don't think we'll use sizes like that but it looks like the performance doesn't make a huge difference. We're dominated by the buffer choice.

elasticsearchmachine · 2022-09-19T20:50:06Z

Pinging @elastic/es-distributed (Team:Distributed)

This speeds up `getIntLE` from `BytesReference` which we'll be using in the upcoming dense representations for aggregations. Here's the performance: ``` (type) Mode Cnt Before Error After Error Units array avgt 7 1.036 ± 0.062 0.261 ± 0.022 ns/op paged_bytes_array avgt 7 5.189 ± 0.172 5.317 ± 0.196 ns/op composite_256kb avgt 7 30.792 ± 0.834 11.240 ± 0.387 ns/op composite_262344b avgt 7 32.503 ± 1.017 11.155 ± 0.358 ns/op composite_1mb avgt 7 25.189 ± 0.449 8.379 ± 0.193 ns/op ``` The `array` method is how we'll use slices that don't span the edges of a netty buffer. The `paged_bytes_array` method doesn't really change and represents the default for internal stuff. I'll bet we could make it faster too, but I don't know that we use it in the hot path. The `composite_<size>` method is how we'll be reading large slabs from the netty byte buffer. We could probably do better if we relied on the sizes of the buffers being even, but we don't presently do that in the composite bytes array. The different sizes following `composite` show that the performance is dominated by the number of slabs in the composite buffer. `1mb` looks like the largest buffer netty uses. `256kb` is the smallest. The wild number of bytes intentionally doesn't line the int up on sensible values. I don't think we'll use sizes like that but it looks like the performance doesn't make a huge difference. We're dominated by the buffer choice.

original-brownbear

LGTM, just some optional suggestions :)

original-brownbear · 2022-09-19T21:08:10Z

benchmarks/src/main/java/org/elasticsearch/benchmark/common/util/IntArrayBenchmark.java

+    public long read() {
+        int res = 0;
+        for (int i = 0; i < SIZE; i++) {
+            res = res ^ read.get(i);


NIT: I think it might give a better benchmark to store this.read into a local variable before the loop. I don't think the JIT is able to remove the field access here, hard to tell how much that matters here though.

I can give it a go and see. I thought it'd be able to avoid both the things you mentioned but I dunno! 🧑‍🔬

It doesn't seem to make a difference. This morning the fast path clocks in at .22ns either way. My CPU claims 2.3GHz which implies .43ns per cycle. So when this is an array we're looking at about half a cycle per read. Now I'm curious.

Could be that in this test, we are able to inline this fully so it's a loop over an array and gets vectorized (could verify that by printing assembly). No need to change anything here then I guess :)

Here's the hot block:

╭ 0x00007f0eb8ff0106: jmpq 0x00007f0eb8ff018b │ 0x00007f0eb8ff010b: nopl 0x0(%rax,%rax,1) 0.00% │↗ 0x00007f0eb8ff0110: mov %ebx,%r10d ;*iload_1 {reexecute=0 rethrow=0 return_oop=0} ││ ; - org.elasticsearch.benchmark.common.util.IntArrayBenchmark::read@11 (line 95) 1.74% ││ ↗ 0x00007f0eb8ff0113: mov %r10d,%ebx ││ │ 0x00007f0eb8ff0116: shl $0x2,%ebx 0.00% ││ │ 0x00007f0eb8ff0119: movslq %ebx,%rbx 0.00% ││ │ 0x00007f0eb8ff011c: xor 0x10(%rsi,%rbx,1),%r11d 2.31% ││ │ 0x00007f0eb8ff0121: xor 0x14(%rsi,%rbx,1),%r11d 0.00% ││ │ 0x00007f0eb8ff0126: xor 0x18(%rsi,%rbx,1),%r11d 1.13% ││ │ 0x00007f0eb8ff012b: xor 0x1c(%rsi,%rbx,1),%r11d 1.72% ││ │ 0x00007f0eb8ff0130: xor 0x20(%rsi,%rbx,1),%r11d 1.70% ││ │ 0x00007f0eb8ff0135: xor 0x24(%rsi,%rbx,1),%r11d 1.68% ││ │ 0x00007f0eb8ff013a: xor 0x28(%rsi,%rbx,1),%r11d 1.72% ││ │ 0x00007f0eb8ff013f: xor 0x2c(%rsi,%rbx,1),%r11d 1.64% ││ │ 0x00007f0eb8ff0144: xor 0x30(%rsi,%rbx,1),%r11d 1.76% ││ │ 0x00007f0eb8ff0149: xor 0x34(%rsi,%rbx,1),%r11d 1.71% ││ │ 0x00007f0eb8ff014e: xor 0x38(%rsi,%rbx,1),%r11d 3.52% ││ │ 0x00007f0eb8ff0153: xor 0x3c(%rsi,%rbx,1),%r11d 1.74% ││ │ 0x00007f0eb8ff0158: xor 0x40(%rsi,%rbx,1),%r11d 1.72% ││ │ 0x00007f0eb8ff015d: xor 0x44(%rsi,%rbx,1),%r11d 1.70% ││ │ 0x00007f0eb8ff0162: xor 0x48(%rsi,%rbx,1),%r11d 1.70% ││ │ 0x00007f0eb8ff0167: xor 0x4c(%rsi,%rbx,1),%r11d ;*ixor {reexecute=0 rethrow=0 return_oop=0} ││ │ ; - org.elasticsearch.benchmark.common.util.IntArrayBenchmark::read@23 (line 95) 1.64% ││ │ 0x00007f0eb8ff016c: mov %r10d,%ebx 0.00% ││ │ 0x00007f0eb8ff016f: add $0x10,%ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0} ││ │ ; - org.elasticsearch.benchmark.common.util.IntArrayBenchmark::read@25 (line 94) 0.00% ││ │ 0x00007f0eb8ff0172: cmp %edx,%ebx │╰ │ 0x00007f0eb8ff0174: jl 0x00007f0eb8ff0110 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} │ │ ; - org.elasticsearch.benchmark.common.util.IntArrayBenchmark::read@8 (line 94)

This reads to me like it's inlined the array access and unrolled the loop. That's with the field read.

Right, the field lookup is irrelevant here then. I guess the unexpected performance is probably due to op fusion in the CPU. Seems possible to fuse those xors maybe in some form, that's a little beyond my area of expertise :)

original-brownbear · 2022-09-19T21:09:48Z

server/src/main/java/org/elasticsearch/common/bytes/CompositeBytesReference.java

+        int i = getOffsetIndex(index);
+        int idx = index - offsets[i];
+        int end = idx + 4;
+        if (end <= references[i].length()) {


NIT: Maybe store references[i] to a variable? That should be a little faster here on average I think since I'd expect the fast path to be taken most of the time.

This one helps!

Without:

Benchmark (type) Mode Cnt Score Error Units IntArrayBenchmark.read composite_1mb avgt 7 8.131 ± 0.138 ns/op

With:

Benchmark (type) Mode Cnt Score Error Units IntArrayBenchmark.read composite_1mb avgt 7 7.770 ± 0.140 ns/op

Update instructions for benchmark for `perfasm`.

* main: (186 commits) [DOCS] Add 8.5 release notes and fix links (elastic#90201) Mute DownsampleActionSingleNodeTests.testCannotRollupWhileOtherRollupInProgress (elastic#90198) Bump to version 8.6.0 Increase the minimum size of the management pool to 2 (elastic#90193) Speed `getIntLE` from `BytesReference` (elastic#90147) Restrict nodes for testClusterPrimariesActive1 (elastic#90191) Fix bug with `BigIntArray` serialization (elastic#90142) Synthetic _source: test _source filtering (elastic#90138) Modernize cardinality agg tests (elastic#90114) Mute failing test (elastic#90186) Move assertion in ES85BloomFilterPostingsFormat to fix test (elastic#90150) Restrict nodes for testClusterPrimariesActive2 (elastic#90184) Batch index delete cluster state updates (elastic#90033) Register stable plugins in ActionModule (elastic#90067) Mute failing test (elastic#90180) [HealthAPI] Disk: Use _ for diagnosis id (elastic#90179) [HealtAPI] Disk: use shorter help URLs (elastic#90178) Fixing disk health indicator unit tests (elastic#90175) Enable the health node and the disk health indicator elastic#84811 (elastic#90085) Add missing Disk Indicator health api IDs (elastic#90174) ...

* main: (121 commits) [DOCS] Add 8.5 release notes and fix links (elastic#90201) Mute DownsampleActionSingleNodeTests.testCannotRollupWhileOtherRollupInProgress (elastic#90198) Bump to version 8.6.0 Increase the minimum size of the management pool to 2 (elastic#90193) Speed `getIntLE` from `BytesReference` (elastic#90147) Restrict nodes for testClusterPrimariesActive1 (elastic#90191) Fix bug with `BigIntArray` serialization (elastic#90142) Synthetic _source: test _source filtering (elastic#90138) Modernize cardinality agg tests (elastic#90114) Mute failing test (elastic#90186) Move assertion in ES85BloomFilterPostingsFormat to fix test (elastic#90150) Restrict nodes for testClusterPrimariesActive2 (elastic#90184) Batch index delete cluster state updates (elastic#90033) Register stable plugins in ActionModule (elastic#90067) Mute failing test (elastic#90180) [HealthAPI] Disk: Use _ for diagnosis id (elastic#90179) [HealtAPI] Disk: use shorter help URLs (elastic#90178) Fixing disk health indicator unit tests (elastic#90175) Enable the health node and the disk health indicator elastic#84811 (elastic#90085) Add missing Disk Indicator health api IDs (elastic#90174) ...

nik9000 added >non-issue :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v8.5.0 labels Sep 19, 2022

nik9000 requested review from not-napoleon and original-brownbear September 19, 2022 20:49

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Sep 19, 2022

original-brownbear approved these changes Sep 19, 2022

View reviewed changes

nik9000 added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 20, 2022

Speed composite

c797372

Update instructions for benchmark for `perfasm`.

nik9000 added auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Sep 20, 2022

nik9000 added 2 commits September 21, 2022 10:39

Merge branch 'main' into get_int_le_faster

8beaa95

Fixup

c5aedec

elasticsearchmachine merged commit 7ebb09b into elastic:main Sep 21, 2022

nik9000 deleted the get_int_le_faster branch September 21, 2022 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed `getIntLE` from `BytesReference` #90147

Speed `getIntLE` from `BytesReference` #90147

nik9000 commented Sep 19, 2022

elasticsearchmachine commented Sep 19, 2022

original-brownbear left a comment

original-brownbear Sep 19, 2022

nik9000 Sep 20, 2022

nik9000 Sep 20, 2022

original-brownbear Sep 20, 2022

nik9000 Sep 20, 2022

original-brownbear Sep 21, 2022

original-brownbear Sep 19, 2022

nik9000 Sep 20, 2022

Speed getIntLE from BytesReference #90147

Speed getIntLE from BytesReference #90147

Conversation

nik9000 commented Sep 19, 2022

elasticsearchmachine commented Sep 19, 2022

original-brownbear left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Speed `getIntLE` from `BytesReference` #90147

Speed `getIntLE` from `BytesReference` #90147