Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for search_after and point-in-time #1190

Merged
merged 31 commits into from
Mar 25, 2021

Conversation

DJRickyB
Copy link
Contributor

@DJRickyB DJRickyB commented Feb 18, 2021

This change adds support for search_after usage and point-in-time interactions. Specifically:

  • Defines open-point-in-time and close-point-in-time operations, for use in composite contexts
  • Adds support to the query/search runner for search_after pagination, using a new operation type paginated-search
  • Defines scroll-search operation type for usability
  • Moves runner parsing logic to its own module for isolation
  • Adds test and documentation for the above

Bonus:

  • Re-factors existing query runner to reduce duplicated code
  • Adds a benchmark file for demonstrating and experimenting with parsing things in the vein of detailed-results and the properties required for search_after support
  • A minor documentation fix (corrected get-async-search to delete-async-search)
  • Style fixes in docs with extra whitespace

Closes #1141

@DJRickyB DJRickyB added enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like highlight A substantial improvement that is worth mentioning separately in release notes labels Feb 18, 2021
@DJRickyB DJRickyB added this to the 2.1.0 milestone Feb 18, 2021
@DJRickyB DJRickyB self-assigned this Feb 18, 2021
.pylintrc Show resolved Hide resolved
docs/track.rst Outdated
@@ -883,8 +883,13 @@ Properties
* ``body`` (mandatory): The query body.
* ``response-compression-enabled`` (optional, defaults to ``true``): Allows to disable HTTP compression of responses. As these responses are sometimes large and decompression may be a bottleneck on the client, it is possible to turn off response compression.
* ``detailed-results`` (optional, defaults to ``false``): Records more detailed meta-data about queries. As it analyzes the corresponding response in more detail, this might incur additional overhead which can skew measurement results. This flag is ineffective for scroll queries.
* ``pages`` (optional): Number of pages to retrieve. If this parameter is present, a scroll query will be executed. If you want to retrieve all result pages, use the value "all".
* ``results-per-page`` (optional): Number of documents to retrieve per page for scroll queries.
* ``results-per-page`` (optional): Number of results to retrieve per page. This maps to the Search API's ``size`` parameter, and can be used for paginated and non-paginated searches. Defaults to ``10``
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just be called size? there didn't seem to be a good reason to make this only for scroll/pagination given it maps to something consequential to the single request_body search

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the name results-per-page provides less room for interpretation than size. We could, however, introduce size as a parameter and allow results-per-page as an alias.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can/should i deprecate results-per-page here?

esrally/track/params.py Outdated Show resolved Hide resolved
target_name = params.get("index")
type_name = params.get("type")
if not target_name:
target_name = params.get("data-stream", default_target)
Copy link
Contributor Author

@DJRickyB DJRickyB Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is data-stream intentionally undocumented for search? also undocumented here, tentatively

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at #1092 I see:

Assuming this PR is good, i will address this functionality next and then document together ie. i propose we don't document until #1054 is addressed.

This was intentional originally but I think we should document this property now.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I did a first pass and left a couple of thoughts.

docs/track.rst Outdated
@@ -883,8 +883,13 @@ Properties
* ``body`` (mandatory): The query body.
* ``response-compression-enabled`` (optional, defaults to ``true``): Allows to disable HTTP compression of responses. As these responses are sometimes large and decompression may be a bottleneck on the client, it is possible to turn off response compression.
* ``detailed-results`` (optional, defaults to ``false``): Records more detailed meta-data about queries. As it analyzes the corresponding response in more detail, this might incur additional overhead which can skew measurement results. This flag is ineffective for scroll queries.
* ``pages`` (optional): Number of pages to retrieve. If this parameter is present, a scroll query will be executed. If you want to retrieve all result pages, use the value "all".
* ``results-per-page`` (optional): Number of documents to retrieve per page for scroll queries.
* ``results-per-page`` (optional): Number of results to retrieve per page. This maps to the Search API's ``size`` parameter, and can be used for paginated and non-paginated searches. Defaults to ``10``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the name results-per-page provides less room for interpretation than size. We could, however, introduce size as a parameter and allow results-per-page as an alias.

docs/track.rst Outdated Show resolved Hide resolved
docs/track.rst Outdated Show resolved Hide resolved
docs/track.rst Show resolved Hide resolved
docs/track.rst Outdated Show resolved Hide resolved
esrally/track/params.py Outdated Show resolved Hide resolved
target_name = params.get("index")
type_name = params.get("type")
if not target_name:
target_name = params.get("data-stream", default_target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at #1092 I see:

Assuming this PR is good, i will address this functionality next and then document together ie. i propose we don't document until #1054 is addressed.

This was intentional originally but I think we should document this property now.

esrally/track/params.py Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
tests/driver/parsing_test.py Outdated Show resolved Hide resolved
Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating. I did another pass through the code and left a couple of suggestions.

docs/track.rst Outdated Show resolved Hide resolved
docs/track.rst Outdated Show resolved Hide resolved
docs/track.rst Outdated Show resolved Hide resolved
docs/track.rst Outdated Show resolved Hide resolved
docs/track.rst Outdated Show resolved Hide resolved
docs/track.rst Outdated Show resolved Hide resolved
esrally/driver/runner.py Outdated Show resolved Hide resolved
esrally/driver/runner.py Outdated Show resolved Hide resolved
esrally/track/track.py Outdated Show resolved Hide resolved
Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! I left some minor suggestions and corrections on the docs and some formatting. SearchAfterExtractor looks good but I think we can use one shared instance per runner.

docs/track.rst Outdated
Meta-data
"""""""""

The following meta data are always returned:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this superfluous?

docs/track.rst Outdated

The following meta data are always returned:

* ``weight``: "weight" of an operation. Always 1 for regular queries and the number of retrieved pages for scroll queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mentions regular queries but the section is about paginated queries so we can simplify the sentence.

docs/track.rst Outdated
The following meta data are always returned:

* ``weight``: "weight" of an operation. Always 1 for regular queries and the number of retrieved pages for scroll queries.
* ``unit``: The unit in which to interpret ``weight``. Always "ops" for regular queries and "pages" for scroll queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mentions regular queries but the section is about paginated queries so we can simplify the sentence.

docs/track.rst Outdated

The following meta data are always returned:

* ``weight``: "weight" of an operation. Always 1 for regular queries and the number of retrieved pages for scroll queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mentions regular queries but the section is about scroll queries so we can simplify the sentence.

docs/track.rst Outdated
The following meta data are always returned:

* ``weight``: "weight" of an operation. Always 1 for regular queries and the number of retrieved pages for scroll queries.
* ``unit``: The unit in which to interpret ``weight``. Always "ops" for regular queries and "pages" for scroll queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mentions regular queries but the section is about scroll queries so we can simplify the sentence.

docs/track.rst Outdated

**Example**

In this example, a point-in-time is opened, used by a ``search_after``-based search operation, and closed::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... used by a paginated-search operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was going for a "this is what we are doing to elasticsearch" description rather than a "this is what we are doing in rally" description, but I take this hesitation as sufficient evidence it should be changed (so it was)

@@ -799,111 +821,156 @@ async def request_body_query(self, es, params):
# disable eager response parsing - responses might be huge thus skewing results
es.return_raw_response()

r = await self._raw_search(es, doc_type, index, body, request_params, headers=headers)
async def _search_after_query(es, params):
extract = SearchAfterExtractor()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively means that the regex is recompiled on every request on the hot code path. As this class is immutable I propose that we instead create one instance of SearchAfterExtractor in the runner's constructor and reuse it.

r = await self._raw_search(es, doc_type, index, body, params, headers=headers)

props = parse(r,
["_scroll_id", "hits.total", "hits.total.value", "hits.total.relation",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting of parameters looks funny here. Is this intentional?

# make sure pit_id is updated afterward
self.assertEqual("fedcba9876543211", runner.CompositeContext.get(pit_op))

es.transport.perform_request.assert_has_calls([mock.call('GET', '/_search', params={},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can you please check that we use double-quotes everywhere consistently?

self.assertEqual("fedcba9876543211", runner.CompositeContext.get(pit_op))

es.transport.perform_request.assert_has_calls([mock.call('GET', '/_search', params={},
body={'query': {'match-all': {}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Personally, I find the following formatting easier to read:

 body={
     'query': {
         'match-all': {}
     },
       'sort': [{
           'timestamp': 'asc',
           'tie_breaker_id': 'asc'
       }],
       'size': 2,
       'pit': {
           'id': '0123456789abcdef',
           'keep_alive': '1m'
       }
 },

I'm fine if we keep it as is but I want to offer it as a thought.

(and i'm all out of nits)
Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating. Looks good now! :)

@DJRickyB DJRickyB merged commit 62595bd into elastic:master Mar 25, 2021
@DJRickyB DJRickyB deleted the point-in-time branch March 25, 2021 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo highlight A substantial improvement that is worth mentioning separately in release notes :Track Management New operations, changes in the track format, track download changes and the like
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Point-in-time API
2 participants