[Tiered Caching] Adding took time for QuerySearchResult #10510

peteralfonsi · 2023-10-09T17:28:39Z

Description

Adds time taken in nanoseconds to QuerySearchResult in the query phase, part of the shard-level query response. This will be used as part of tiered caching, to decide whether or not to move cached entries to the disk tier based on how long it would take to recompute them.

Unit tested in QuerySearchResultTests.java. The tests add a random delay to QueryPhase::execute() and assert the took time in the result is greater than or equal to this delay.

Related Issues

Resolves #10411
Milestone of larger tiered caching feature.

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
[N/A] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
[N/A] Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

msfroh · 2023-10-09T17:49:56Z

Is this already covered by #10351 ?

That one adds timing for each query phase to the search result.

github-actions · 2023-10-09T17:50:15Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/27451/
CommitID: 4dcc13c
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-10-09T18:06:45Z

Compatibility status:

Checks if related components are compatible with change 81dc9ef

Incompatible components

Incompatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/performance-analyzer.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

sgup432 · 2023-10-09T23:29:41Z

@msfroh

Is this already covered by #10351 ?
That one adds timing for each query phase to the search result.

This is not related to #10351. This adds a took time part of shard level response which is eventually cached as well in RequestCache. We wanted this info in cached value to take a decision whether we want to spill this entry to disk cache or not during evictions(from in-memory cache).

github-actions · 2023-10-20T04:58:51Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/28528/
CommitID: f5db1aa
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Signed-off-by: Peter Alfonsi <petealft@amazon.com>

github-actions · 2023-10-24T00:12:51Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/28878/
CommitID: a9ab327
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

sgup432 · 2023-10-26T18:19:22Z

server/src/main/java/org/opensearch/search/query/QuerySearchResult.java

@@ -364,6 +366,11 @@ public void readFromWithId(ShardSearchContextId id, StreamInput in) throws IOExc
        nodeQueueSize = in.readInt();
        setShardSearchRequest(in.readOptionalWriteable(ShardSearchRequest::new));
        setRescoreDocIds(new RescoreDocIds(in));
+        if (in.getVersion().onOrAfter(Version.V_3_0_0)) {
+            tookTimeNanos = in.readVLong();


Better to make this OptionalLong? in.readOptionalLong() and set below as null instead of -1.
Setting it -1 might be error prone if used and requires special checking. Though same for null but makes it more clear.

github-actions · 2023-10-26T19:25:40Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/29097/
CommitID: a68f44e
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Signed-off-by: Peter Alfonsi <petealft@amazon.com>

github-actions · 2023-10-26T20:25:39Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/29101/
CommitID: 4e57f4c
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

dblock · 2023-10-30T20:51:53Z

server/src/test/java/org/opensearch/search/query/QuerySearchResultTests.java

+            assertEquals(querySearchResult.getTookTimeNanos(), deserialized.getTookTimeNanos());
+            if (i == 1) {
+                assertNull(deserialized.getTookTimeNanos());
+            }


It would probably read better to iterate over a map of expectedInput: expected value.

Signed-off-by: Peter Alfonsi <petealft@amazon.com>

github-actions · 2023-10-31T19:18:29Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/29400/
CommitID: 1d47f38
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

sgup432 · 2023-10-31T23:24:47Z

@msfroh Would you be able us in getting this reviewed and merge in?

jainankitk

LGTM!

jainankitk · 2023-10-26T23:22:27Z

server/src/main/java/org/opensearch/search/query/QuerySearchResult.java

@@ -87,6 +88,7 @@ public final class QuerySearchResult extends SearchPhaseResult {
    private int nodeQueueSize = -1;

    private final boolean isNull;
+    private Long tookTimeNanos = null;


Ideally the tookTimeNanos should also be final, since we never modify it once set. Although, not sure how we can communicate that intent while invoking setTookTimeNanos from QueryPhase

You can define this field with SetOnce

jainankitk · 2023-11-03T11:53:08Z

server/src/main/java/org/opensearch/search/query/QueryPhase.java

@@ -131,13 +131,15 @@ public void preProcess(SearchContext context) {
    }

    public void execute(SearchContext searchContext) throws QueryPhaseExecutionException {
+        final long startTime = System.nanoTime();


Sorry missed this earlier. Given we don't report this as stats anywhere, wondering if millis suffice for our use case. Nano is generally much more expensive:

https://stackoverflow.com/questions/19052316/why-is-system-nanotime-way-slower-in-performance-than-system-currenttimemill

We need an accurate elapsed time for queryPhase so that we can take decisions based on that. So we need nanos for that instead of millis.

As per my understanding, since we need this time for evaluating whether to store result on disk or not, and disk access can be few ms, I don't completely understand the reason for using nanos instead of millis. Although it should be fine to use nanos as well, just want to ensure my understanding is not lacking.

It is not just about whether disk access would be in few ms, System.currentTimeMilis() is tied to System clock, so using this might be a bit error prone. While System.nanoTime measures time relative to an arbitrary point in time and not affected by System clock skewness. Elapsed time is calculated using nanoTime across OpenSearch while millis used to display human readable dates to users.

jainankitk · 2023-11-03T11:58:34Z

server/src/test/java/org/opensearch/search/query/QueryPhaseTests.java

+            try {
+                Thread.sleep(sleepMillis);
+            } catch (Exception ignored) {}
+            return super.searchWith(searchContext, searcher, query, collectors, hasFilterCollector, hasTimeout);


Ah! In general, it is bad coding practice as it can eat exceptions, although here we intentionally doing that.

https://stackoverflow.com/questions/48088/returning-from-a-finally-block-in-java

ticheng-aws · 2023-11-04T00:09:23Z

server/src/main/java/org/opensearch/search/query/QueryPhase.java

@@ -131,13 +131,15 @@ public void preProcess(SearchContext context) {
    }

    public void execute(SearchContext searchContext) throws QueryPhaseExecutionException {


Hi @peteralfonsi, could you please evaluate the use of Timer class for this change? I'd like to hear your thoughts on whether utilizing the existing Timer class is a feasible choice. Thank you.

opensearch-trigger-bot · 2023-12-07T15:21:07Z

This PR is stalled because it has been open for 30 days with no activity.

ticheng-aws · 2024-01-05T23:53:37Z

Hi @peteralfonsi, Is this being worked upon? Pls free to reach out to maintainers for further reviews.

sohami · 2024-01-09T11:52:32Z

server/src/test/java/org/opensearch/search/query/QuerySearchResultTests.java

+        expectedValues.put(false, null);
+        expectedValues.put(true, 1000L);
+        for (Boolean doSetTookTime : expectedValues.keySet()) {
+            QuerySearchResult querySearchResult = createTestInstance();


You can instead update the createTestInstance to randomly add the tookTime and then assert on created instance tookTime with the took time in deserialized one.

sohami · 2024-01-09T11:55:37Z

server/src/main/java/org/opensearch/search/query/QuerySearchResult.java

@@ -87,6 +88,7 @@ public final class QuerySearchResult extends SearchPhaseResult {
    private int nodeQueueSize = -1;

    private final boolean isNull;
+    private Long tookTimeNanos = null;


You can define this field with SetOnce

sohami · 2024-01-09T12:02:43Z

server/src/main/java/org/opensearch/search/query/QueryPhase.java

@@ -165,6 +167,7 @@ public void execute(SearchContext searchContext) throws QueryPhaseExecutionExcep
            );
            searchContext.queryResult().profileResults(shardResults);
        }
+        searchContext.queryResult().setTookTimeNanos(System.nanoTime() - startTime);


couple of questions:

The query phase time is also computed in SearchOperationListenerExecutor but I see that you need it in QueryResult which is stored in cache before above records the time.

If we store this time in QueryResult then while eviction it will need to deserialize each evicted value to get the tookTime and make decision of keeping it in disk tier or not. Wondering if the cache should have it as a separate wrapped object of QueryResult instead, to make it easily available ? Is there a need to output this value in the QueryResult ? If yes, then will probably use the value computed by SearchOperationListenerExecutor to keep it consistent with search stats.

For 1), I wasn't aware of that - let me see if it's doable to get that stored time into the QueryResult.
For 2), in my tiered caching policies PR that should hopefully get raised soon, there is a separate wrapper with just the important info to decide whether to add a result to a disk tier. This gets written first, before the actual result, so when it decides it can just read that wrapper and doesn't have to spend time deserializing the whole result. I will see if i can get the other value into there instead.

For 2, not sure if I understood correctly. If you already have a wrapper concept, then that wrapper can hold to the computed tookTime and the computation can be done here instead. That way QueryResult can be kept agnostic of tookTime

Signed-off-by: Peter Alfonsi <petealft@amazon.com>

github-actions · 2024-01-09T18:19:29Z

❌ Gradle check result for 81dc9ef: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

sohami · 2024-01-09T19:09:22Z

server/src/main/java/org/opensearch/search/query/QueryPhase.java

@@ -165,6 +167,7 @@ public void execute(SearchContext searchContext) throws QueryPhaseExecutionExcep
            );
            searchContext.queryResult().profileResults(shardResults);
        }
+        searchContext.queryResult().setTookTimeNanos(System.nanoTime() - startTime);


For 2, not sure if I understood correctly. If you already have a wrapper concept, then that wrapper can hold to the computed tookTime and the computation can be done here instead. That way QueryResult can be kept agnostic of tookTime

sohami · 2024-01-09T19:11:06Z

server/src/main/java/org/opensearch/search/query/QuerySearchResult.java

+        if (in.getVersion().onOrAfter(Version.V_3_0_0)) {
+            tookTimeNanos = new SetOnce<>(in.readOptionalLong());
+        } else {
+            tookTimeNanos = new SetOnce<>();
+        }


Suggested change

if (in.getVersion().onOrAfter(Version.V_3_0_0)) {

tookTimeNanos = new SetOnce<>(in.readOptionalLong());

} else {

tookTimeNanos = new SetOnce<>();

}

if (in.getVersion().onOrAfter(Version.V_3_0_0)) {

tookTimeNanos.set(in.readOptionalLong());

}

That is a very good point. I definitely should have thought of that... the wrapper concept came in much later as we were worried about reading the whole QSR into memory on each eviction, and by then I didn't consider going back and changing this. In that case, we can probably close this PR and just stick that computation into the upcoming policies PR.

github-actions bot added the enhancement Enhancement or improvement to existing feature or request label Oct 9, 2023

peteralfonsi changed the title ~~Adding took time for QuerySearchResult~~ [Tiered Caching] Adding took time for QuerySearchResult Oct 9, 2023

Peter Alfonsi added 3 commits October 23, 2023 16:52

Adds and tests took time for QuerySearchResult

08a4f6d

Signed-off-by: Peter Alfonsi <petealft@amazon.com>

Addressed Ankit's comments

dfd9128

Signed-off-by: Peter Alfonsi <petealft@amazon.com>

Removed minimum version variable

a9ab327

Signed-off-by: Peter Alfonsi <petealft@amazon.com>

peteralfonsi force-pushed the tooktime-squashed branch from f5db1aa to a9ab327 Compare October 23, 2023 23:52

sgup432 reviewed Oct 26, 2023

View reviewed changes

Changed type for tookTime from long to Long, to address Sagar's comment

4e57f4c

Signed-off-by: Peter Alfonsi <petealft@amazon.com>

peteralfonsi force-pushed the tooktime-squashed branch from a68f44e to 4e57f4c Compare October 26, 2023 20:05

sgup432 approved these changes Oct 26, 2023

View reviewed changes

dblock reviewed Oct 30, 2023

View reviewed changes

Addressed dblock's comment

1d47f38

Signed-off-by: Peter Alfonsi <petealft@amazon.com>

jainankitk reviewed Nov 3, 2023

View reviewed changes

ticheng-aws reviewed Nov 4, 2023

View reviewed changes

opensearch-trigger-bot bot added the stalled Issues that have stalled label Dec 7, 2023

opensearch-trigger-bot bot removed the stalled Issues that have stalled label Jan 7, 2024

sohami reviewed Jan 9, 2024

View reviewed changes

Addressed Sorabh's comments

81dc9ef

Signed-off-by: Peter Alfonsi <petealft@amazon.com>

peteralfonsi requested review from adnapibar and tlfeng as code owners January 9, 2024 18:08

github-actions bot added the Search Search query, autocomplete ...etc label Jan 9, 2024

sohami reviewed Jan 9, 2024

View reviewed changes

peteralfonsi closed this Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tiered Caching] Adding took time for QuerySearchResult #10510

[Tiered Caching] Adding took time for QuerySearchResult #10510

peteralfonsi commented Oct 9, 2023 •

edited

Loading

msfroh commented Oct 9, 2023

github-actions bot commented Oct 9, 2023

github-actions bot commented Oct 9, 2023 •

edited

Loading

sgup432 commented Oct 9, 2023 •

edited

Loading

github-actions bot commented Oct 20, 2023

github-actions bot commented Oct 24, 2023

sgup432 Oct 26, 2023 •

edited

Loading

github-actions bot commented Oct 26, 2023

github-actions bot commented Oct 26, 2023

dblock Oct 30, 2023

github-actions bot commented Oct 31, 2023

sgup432 commented Oct 31, 2023

jainankitk left a comment

jainankitk Oct 26, 2023

sohami Jan 9, 2024

jainankitk Nov 3, 2023

sgup432 Nov 3, 2023

jainankitk Nov 6, 2023

sgup432 Nov 6, 2023

jainankitk Nov 3, 2023

ticheng-aws Nov 4, 2023

opensearch-trigger-bot bot commented Dec 7, 2023

ticheng-aws commented Jan 5, 2024

sohami Jan 9, 2024

sohami Jan 9, 2024

sohami Jan 9, 2024

peteralfonsi Jan 9, 2024

sohami Jan 9, 2024

github-actions bot commented Jan 9, 2024

sohami Jan 9, 2024

sohami Jan 9, 2024

peteralfonsi Jan 10, 2024

		@@ -131,13 +131,15 @@ public void preProcess(SearchContext context) {
		}

		public void execute(SearchContext searchContext) throws QueryPhaseExecutionException {

[Tiered Caching] Adding took time for QuerySearchResult #10510

[Tiered Caching] Adding took time for QuerySearchResult #10510

Conversation

peteralfonsi commented Oct 9, 2023 • edited Loading

Description

Related Issues

Check List

msfroh commented Oct 9, 2023

github-actions bot commented Oct 9, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 9, 2023 • edited Loading

Compatibility status:

Incompatible components

Skipped components

Compatible components

sgup432 commented Oct 9, 2023 • edited Loading

github-actions bot commented Oct 20, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 24, 2023

Gradle Check (Jenkins) Run Completed with:

sgup432 Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Oct 26, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 26, 2023

Gradle Check (Jenkins) Run Completed with:

Choose a reason for hiding this comment

github-actions bot commented Oct 31, 2023

Gradle Check (Jenkins) Run Completed with:

sgup432 commented Oct 31, 2023

jainankitk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

opensearch-trigger-bot bot commented Dec 7, 2023

ticheng-aws commented Jan 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peteralfonsi commented Oct 9, 2023 •

edited

Loading

github-actions bot commented Oct 9, 2023 •

edited

Loading

sgup432 commented Oct 9, 2023 •

edited

Loading

sgup432 Oct 26, 2023 •

edited

Loading