Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize top() and bottom() using an incremental aggregator #8394

Merged
merged 1 commit into from
May 19, 2017

Conversation

jsternberg
Copy link
Contributor

@jsternberg jsternberg commented May 16, 2017

The previous version of top() and bottom() would gather all of the
points to use in a slice, filter them (if necessary), then use a
slightly modified heap sort to retrieve the top or bottom values.

This performed horrendously from the standpoint of memory. Since it
consumed so much memory and spent so much time in allocations (along
with sorting a potentially very large slice), this affected speed too.

These calls have now been modified so they keep the top or bottom points
in a min or max heap. For top(), a new point will read the minimum
value from the heap. If the new point is greater than the minimum point,
it will replace the minimum point and fix the heap with the new value.
If the new point is smaller, it discards that point. For bottom(), the
process is the opposite.

It will then sort the final result to ensure the correct ordering of the
selected points.

When top() or bottom() contain a tag to select, they have now been
modified so this query:

SELECT top(value, host, 2) FROM cpu

Essentially becomes this query:

SELECT top(value, 2), host FROM (
    SELECT max(value) FROM cpu GROUP BY host
)

This should drastically increase the performance of all top() and
bottom() queries.

  • Rebased/mergable
  • Tests pass
  • CHANGELOG.md updated

@jsternberg
Copy link
Contributor Author

Benchmark comparison:

benchmark                    old ns/op     new ns/op     delta
BenchmarkSelect_Top_1K-8     381236352     64565949      -83.06%

benchmark                    old allocs     new allocs     delta
BenchmarkSelect_Top_1K-8     110            66             -40.00%

benchmark                    old bytes     new bytes     delta
BenchmarkSelect_Top_1K-8     529304706     12168         -100.00%

There should also be an improvement when using SELECT top(value, host, <n>) ..., but the code to benchmark this is less straightforward since the old implementation worked completely differently than this new implementation. I can try if needed, but it wouldn't be a 1-1 match between the code.

@jsternberg jsternberg force-pushed the js-top-bottom-performance branch 2 times, most recently from bfd0640 to 58def26 Compare May 16, 2017 20:45
@jsternberg jsternberg added this to the 1.3.0 milestone May 16, 2017
@discoduck2x
Copy link

@jsternberg very interesting , is this part of the nightly build now?

@jsternberg
Copy link
Contributor Author

This has not been merged yet. It still needs to be reviewed by someone before it gets merged.

@jsternberg jsternberg force-pushed the js-top-bottom-performance branch from 58def26 to 4e68349 Compare May 17, 2017 16:34
@jsternberg jsternberg requested a review from benbjohnson May 18, 2017 14:12
The previous version of `top()` and `bottom()` would gather all of the
points to use in a slice, filter them (if necessary), then use a
slightly modified heap sort to retrieve the top or bottom values.

This performed horrendously from the standpoint of memory. Since it
consumed so much memory and spent so much time in allocations (along
with sorting a potentially very large slice), this affected speed too.

These calls have now been modified so they keep the top or bottom points
in a min or max heap. For `top()`, a new point will read the minimum
value from the heap. If the new point is greater than the minimum point,
it will replace the minimum point and fix the heap with the new value.
If the new point is smaller, it discards that point. For `bottom()`, the
process is the opposite.

It will then sort the final result to ensure the correct ordering of the
selected points.

When `top()` or `bottom()` contain a tag to select, they have now been
modified so this query:

    SELECT top(value, host, 2) FROM cpu

Essentially becomes this query:

    SELECT top(value, 2), host FROM (
        SELECT max(value) FROM cpu GROUP BY host
    )

This should drastically increase the performance of all `top()` and
`bottom()` queries.
@jsternberg jsternberg force-pushed the js-top-bottom-performance branch from 4e68349 to 7b9b55b Compare May 19, 2017 16:56
@jsternberg jsternberg merged commit 4bdce21 into master May 19, 2017
@jsternberg jsternberg deleted the js-top-bottom-performance branch May 19, 2017 19:32
@jsternberg jsternberg removed the review label May 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants