Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Fix pagination during rolling upgrade #85399

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

Luegg
Copy link
Contributor

@Luegg Luegg commented Mar 28, 2022

Resolves #83726 by trying to redirect requests with a cursor with the wrong version to a node with the matching version.

The most frequent scenario that this PR addresses is the following situation during a rolling upgrade:

  1. An SQL query either hits a not-yet-upgraded node or it is redirected to the older node by the mechanism implemented in TransportActionUtils.executeRequestWithRetryAttempt. The returned cursor will only work on the nodes with the old version.
  2. A follow up scroll (or clean cursor) request hits a node that has already been upgraded. Previously, this failed during deserialization of the cursor. With this change, the request will be forwarded to a node with the cursor's version.

Especially in larger clusters, one might also encounter an additional error mode that is addressed by this PR:

  1. An index is allocated on nodes that have already been upgraded.
  2. A search request hits an upgraded node but won't be redirected because all shards fulfill the version requirement.
  3. The follow up scroll (or clean cursor) requests hits an old node.

The fix should also work with requests with format=txt but only if all nodes support the redirect. This case is more complicated because the cursor wraps another cursor together with the formatter and inner and outer cursor might not have the same version.

@Luegg Luegg force-pushed the fix/cursorIncompatibilityDuringRollingUpgrades branch from bc3c25d to 1b10dc1 Compare March 29, 2022 09:52
@Luegg Luegg force-pushed the fix/cursorIncompatibilityDuringRollingUpgrades branch from d355c5f to dd1e0f5 Compare March 30, 2022 06:58
@Luegg Luegg added >bug :Analytics/SQL SQL querying labels Mar 30, 2022
@Luegg Luegg marked this pull request as ready for review March 30, 2022 08:03
@elasticmachine elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label Mar 30, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Collaborator

Hi @Luegg, I've created a changelog YAML for you.

assertTextCursorCompatibleAcrossVersions(bwcVersion, oldNodesClient, newNodesClient);
}

@AwaitsFix(bugUrl = "tbd")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See last remark in the PR description. I'll create an according issue once this PR is approved and ready to merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind about this comment. The @AwaitsFix referred to a previous attempt to address the issue. With the current attempt, text format based cursors should also be usable across versions if all nodes support the cursor redirect from this PR.

@Luegg Luegg force-pushed the fix/cursorIncompatibilityDuringRollingUpgrades branch from a3333d1 to 25c3ddb Compare March 30, 2022 08:30
@elasticsearchmachine
Copy link
Collaborator

Hi @Luegg, I've created a changelog YAML for you.

assertEquals(scrollJson.get("succeeded"), true);
}

public void testCursorForIndexAllocatedOnNewNodeWorkOnOldNode() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this test and testTextCursorFromNewNodeWorksOnOldNode can only be tested once we have a release including the cursor redirect. Any ideas on how to run the test anyway would be highly appreciated :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 8.2 has now been branched out, I can backport this PR to 8.2 when it's ready to merge, see how the tests are doing and then merge to master.

Copy link
Contributor

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it seems pretty good.
Just leaving a couple of comments

private static DiscoveryNode findNodeWithVersionOtherThanLocalNode(ClusterService clusterService, Version expectedVersion) {
DiscoveryNode localNode = clusterService.state().nodes().getLocalNode();
if (expectedVersion.equals(localNode.getVersion())) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the logic here: if current node has the right version, it should be a good candidate itself... and a QlVersionMismatchException shouldn't have been thrown in the first place. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. This condition should not happen and I don't think you can end up in this situation. But if it is the case, you would want to avoid the retry because you could end up in an endless loop otherwise. Throwing an exception here is probably even better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it makes sense.
The exception makes the intention much more clear 👍

Consumer<Exception> onFailure,
Logger log
) {
if (log.isDebugEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(very minor note) IMHO this log does not belong here. This method is responsible of retrying the request, not of checking the version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a good place could also be the catch block in the calling methods but that would duplicate the logging code. I've added some more context to the log message that also emphasizes that it looks for a matching node.

@Luegg Luegg force-pushed the fix/cursorIncompatibilityDuringRollingUpgrades branch from 12b915a to 509a31f Compare March 30, 2022 16:07
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@csoulios csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022
@kingherc kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022
@rjernst rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023
@gmarouli gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023
@quux00 quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023
@mattc58 mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023
@elasticsearchmachine elasticsearchmachine added v8.14.0 Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed v8.13.0 labels Feb 14, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/SQL SQL querying >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:QL (Deprecated) Meta label for query languages team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: Scroll cursors issued by upgraded nodes during rolling upgrade are invalid