Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed getIntLE from BytesReference #90147

Merged
merged 4 commits into from
Sep 21, 2022

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Sep 19, 2022

This speeds up getIntLE from BytesReference which we'll be using in the upcoming dense representations for aggregations. Here's the performance:

           (type)  Mode  Cnt   Before  Error  After    Error  Units
            array  avgt    7   1.036 ± 0.062   0.261 ± 0.022  ns/op
paged_bytes_array  avgt    7   5.189 ± 0.172   5.317 ± 0.196  ns/op
  composite_256kb  avgt    7  30.792 ± 0.834  11.240 ± 0.387  ns/op
composite_262344b  avgt    7  32.503 ± 1.017  11.155 ± 0.358  ns/op
    composite_1mb  avgt    7  25.189 ± 0.449   8.379 ± 0.193  ns/op

The array method is how we'll use slices that don't span the edges of a netty buffer. The paged_bytes_array method doesn't really change and represents the default for internal stuff. I'll bet we could make it faster too, but I don't know that we use it in the hot path. The composite_<size> method is how we'll be reading large slabs from the netty byte buffer. We could probably do better if we relied on the sizes of the buffers being even, but we don't presently do that in the composite bytes array. The different sizes following composite show that the performance is dominated by the number of slabs in the composite buffer. 1mb looks like the largest buffer netty uses. 256kb is the smallest. The wild number of bytes intentionally doesn't line the int up on sensible values. I don't think we'll use sizes like that but it looks like the performance doesn't make a huge difference. We're dominated by the buffer choice.

@nik9000 nik9000 added >non-issue :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v8.5.0 labels Sep 19, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Sep 19, 2022
This speeds up `getIntLE` from `BytesReference` which we'll be using in
the upcoming dense representations for aggregations. Here's the
performance:
```
           (type)  Mode  Cnt   Before  Error  After    Error  Units
            array  avgt    7   1.036 ± 0.062   0.261 ± 0.022  ns/op
paged_bytes_array  avgt    7   5.189 ± 0.172   5.317 ± 0.196  ns/op
  composite_256kb  avgt    7  30.792 ± 0.834  11.240 ± 0.387  ns/op
composite_262344b  avgt    7  32.503 ± 1.017  11.155 ± 0.358  ns/op
    composite_1mb  avgt    7  25.189 ± 0.449   8.379 ± 0.193  ns/op
```

The `array` method is how we'll use slices that don't span the edges of
a netty buffer. The `paged_bytes_array` method doesn't really change and
represents the default for internal stuff. I'll bet we could make it
faster too, but I don't know that we use it in the hot path. The
`composite_<size>` method is how we'll be reading large slabs from the
netty byte buffer. We could probably do better if we relied on the sizes
of the buffers being even, but we don't presently do that in the
composite bytes array. The different sizes following `composite` show
that the performance is dominated by the number of slabs in the
composite buffer. `1mb` looks like the largest buffer netty uses.
`256kb` is the smallest. The wild number of bytes intentionally doesn't
line the int up on sensible values. I don't think we'll use sizes like
that but it looks like the performance doesn't make a huge difference.
We're dominated by the buffer choice.
Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some optional suggestions :)

public long read() {
int res = 0;
for (int i = 0; i < SIZE; i++) {
res = res ^ read.get(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I think it might give a better benchmark to store this.read into a local variable before the loop. I don't think the JIT is able to remove the field access here, hard to tell how much that matters here though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give it a go and see. I thought it'd be able to avoid both the things you mentioned but I dunno! 🧑‍🔬

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to make a difference. This morning the fast path clocks in at .22ns either way. My CPU claims 2.3GHz which implies .43ns per cycle. So when this is an array we're looking at about half a cycle per read. Now I'm curious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be that in this test, we are able to inline this fully so it's a loop over an array and gets vectorized (could verify that by printing assembly). No need to change anything here then I guess :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the hot block:

         ╭     0x00007f0eb8ff0106:   jmpq   0x00007f0eb8ff018b
         │     0x00007f0eb8ff010b:   nopl   0x0(%rax,%rax,1)
  0.00%  │↗    0x00007f0eb8ff0110:   mov    %ebx,%r10d                   ;*iload_1 {reexecute=0 rethrow=0 return_oop=0}
         ││                                                              ; - org.elasticsearch.benchmark.common.util.IntArrayBenchmark::read@11 (line 95)
  1.74%  ││ ↗  0x00007f0eb8ff0113:   mov    %r10d,%ebx
         ││ │  0x00007f0eb8ff0116:   shl    $0x2,%ebx
  0.00%  ││ │  0x00007f0eb8ff0119:   movslq %ebx,%rbx
  0.00%  ││ │  0x00007f0eb8ff011c:   xor    0x10(%rsi,%rbx,1),%r11d
  2.31%  ││ │  0x00007f0eb8ff0121:   xor    0x14(%rsi,%rbx,1),%r11d
  0.00%  ││ │  0x00007f0eb8ff0126:   xor    0x18(%rsi,%rbx,1),%r11d
  1.13%  ││ │  0x00007f0eb8ff012b:   xor    0x1c(%rsi,%rbx,1),%r11d
  1.72%  ││ │  0x00007f0eb8ff0130:   xor    0x20(%rsi,%rbx,1),%r11d
  1.70%  ││ │  0x00007f0eb8ff0135:   xor    0x24(%rsi,%rbx,1),%r11d
  1.68%  ││ │  0x00007f0eb8ff013a:   xor    0x28(%rsi,%rbx,1),%r11d
  1.72%  ││ │  0x00007f0eb8ff013f:   xor    0x2c(%rsi,%rbx,1),%r11d
  1.64%  ││ │  0x00007f0eb8ff0144:   xor    0x30(%rsi,%rbx,1),%r11d
  1.76%  ││ │  0x00007f0eb8ff0149:   xor    0x34(%rsi,%rbx,1),%r11d
  1.71%  ││ │  0x00007f0eb8ff014e:   xor    0x38(%rsi,%rbx,1),%r11d
  3.52%  ││ │  0x00007f0eb8ff0153:   xor    0x3c(%rsi,%rbx,1),%r11d
  1.74%  ││ │  0x00007f0eb8ff0158:   xor    0x40(%rsi,%rbx,1),%r11d
  1.72%  ││ │  0x00007f0eb8ff015d:   xor    0x44(%rsi,%rbx,1),%r11d
  1.70%  ││ │  0x00007f0eb8ff0162:   xor    0x48(%rsi,%rbx,1),%r11d
  1.70%  ││ │  0x00007f0eb8ff0167:   xor    0x4c(%rsi,%rbx,1),%r11d      ;*ixor {reexecute=0 rethrow=0 return_oop=0}
         ││ │                                                            ; - org.elasticsearch.benchmark.common.util.IntArrayBenchmark::read@23 (line 95)
  1.64%  ││ │  0x00007f0eb8ff016c:   mov    %r10d,%ebx
  0.00%  ││ │  0x00007f0eb8ff016f:   add    $0x10,%ebx                   ;*iinc {reexecute=0 rethrow=0 return_oop=0}
         ││ │                                                            ; - org.elasticsearch.benchmark.common.util.IntArrayBenchmark::read@25 (line 94)
  0.00%  ││ │  0x00007f0eb8ff0172:   cmp    %edx,%ebx
         │╰ │  0x00007f0eb8ff0174:   jl     0x00007f0eb8ff0110           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
         │  │                                                            ; - org.elasticsearch.benchmark.common.util.IntArrayBenchmark::read@8 (line 94)

This reads to me like it's inlined the array access and unrolled the loop. That's with the field read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the field lookup is irrelevant here then. I guess the unexpected performance is probably due to op fusion in the CPU. Seems possible to fuse those xors maybe in some form, that's a little beyond my area of expertise :)

int i = getOffsetIndex(index);
int idx = index - offsets[i];
int end = idx + 4;
if (end <= references[i].length()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Maybe store references[i] to a variable? That should be a little faster here on average I think since I'd expect the fast path to be taken most of the time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one helps!

Without:

Benchmark                      (type)  Mode  Cnt  Score   Error  Units
IntArrayBenchmark.read  composite_1mb  avgt    7  8.131 ± 0.138  ns/op

With:

Benchmark                      (type)  Mode  Cnt  Score   Error  Units
IntArrayBenchmark.read  composite_1mb  avgt    7  7.770 ± 0.140  ns/op

@nik9000 nik9000 added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 20, 2022
Update instructions for benchmark for `perfasm`.
@nik9000 nik9000 added auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Sep 20, 2022
@elasticsearchmachine elasticsearchmachine merged commit 7ebb09b into elastic:main Sep 21, 2022
@nik9000 nik9000 deleted the get_int_le_faster branch September 21, 2022 15:17
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 22, 2022
* main: (186 commits)
  [DOCS] Add 8.5 release notes and fix links (elastic#90201)
  Mute DownsampleActionSingleNodeTests.testCannotRollupWhileOtherRollupInProgress (elastic#90198)
  Bump to version 8.6.0
  Increase the minimum size of the management pool to 2 (elastic#90193)
  Speed `getIntLE` from `BytesReference` (elastic#90147)
  Restrict nodes for testClusterPrimariesActive1 (elastic#90191)
  Fix bug with `BigIntArray` serialization (elastic#90142)
  Synthetic _source: test _source filtering (elastic#90138)
  Modernize cardinality agg tests (elastic#90114)
  Mute failing test (elastic#90186)
  Move assertion in ES85BloomFilterPostingsFormat to fix test (elastic#90150)
  Restrict nodes for testClusterPrimariesActive2 (elastic#90184)
  Batch index delete cluster state updates (elastic#90033)
  Register stable plugins in ActionModule (elastic#90067)
  Mute failing test (elastic#90180)
  [HealthAPI] Disk: Use _ for diagnosis id (elastic#90179)
  [HealtAPI] Disk: use shorter help URLs (elastic#90178)
  Fixing disk health indicator unit tests (elastic#90175)
  Enable the health node and the disk health indicator elastic#84811 (elastic#90085)
  Add missing Disk Indicator health api IDs (elastic#90174)
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 22, 2022
* main: (121 commits)
  [DOCS] Add 8.5 release notes and fix links (elastic#90201)
  Mute DownsampleActionSingleNodeTests.testCannotRollupWhileOtherRollupInProgress (elastic#90198)
  Bump to version 8.6.0
  Increase the minimum size of the management pool to 2 (elastic#90193)
  Speed `getIntLE` from `BytesReference` (elastic#90147)
  Restrict nodes for testClusterPrimariesActive1 (elastic#90191)
  Fix bug with `BigIntArray` serialization (elastic#90142)
  Synthetic _source: test _source filtering (elastic#90138)
  Modernize cardinality agg tests (elastic#90114)
  Mute failing test (elastic#90186)
  Move assertion in ES85BloomFilterPostingsFormat to fix test (elastic#90150)
  Restrict nodes for testClusterPrimariesActive2 (elastic#90184)
  Batch index delete cluster state updates (elastic#90033)
  Register stable plugins in ActionModule (elastic#90067)
  Mute failing test (elastic#90180)
  [HealthAPI] Disk: Use _ for diagnosis id (elastic#90179)
  [HealtAPI] Disk: use shorter help URLs (elastic#90178)
  Fixing disk health indicator unit tests (elastic#90175)
  Enable the health node and the disk health indicator elastic#84811 (elastic#90085)
  Add missing Disk Indicator health api IDs (elastic#90174)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >non-issue Team:Distributed Meta label for distributed team v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants