Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42090] Introduce sasl retry count in RetryingBlockTransferor #39611

Closed
wants to merge 1 commit into from

Conversation

tedyu
Copy link
Contributor

@tedyu tedyu commented Jan 16, 2023

What changes were proposed in this pull request?

This PR introduces sasl retry count in RetryingBlockTransferor.

Why are the changes needed?

Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

  1. SaslTimeoutException
  2. IOException
  3. SaslTimeoutException
  4. IOException

Even though IOException at #2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step #4.
Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New test is added, courtesy of Mridul.

@github-actions github-actions bot added the CORE label Jan 16, 2023
@tedyu
Copy link
Contributor Author

tedyu commented Jan 16, 2023

cc @mridulm

@tedyu tedyu changed the title Introduce sasl retry count in RetryingBlockTransferor [SPARK-42090] Introduce sasl retry count in RetryingBlockTransferor Jan 16, 2023
@mridulm
Copy link
Contributor

mridulm commented Jan 16, 2023

+CC @akpatnam25, @otterc

@tedyu
Copy link
Contributor Author

tedyu commented Jan 16, 2023

From https://github.com/tedyu/spark/actions/runs/3931766275/jobs/6723482156

FAIL [0.692s]: test_shuffle_data_with_multiple_locations (pyspark.tests.test_shuffle.MergerTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/tests/test_shuffle.py", line 84, in test_shuffle_data_with_multiple_locations
    self.assertTrue(
AssertionError: False is not true

It seems the above error was not related to the PR.

@tedyu
Copy link
Contributor Author

tedyu commented Jan 16, 2023

In the previous test suite run, there was no test failure - only lint error:

https://github.com/tedyu/spark/actions/runs/3931003566

Copy link

@akpatnam25 akpatnam25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this corner case, I missed it in my PR. Just had a couple minor comments. LGTM overall.

@@ -341,6 +343,35 @@ public void testBlockTransferFailureAfterSasl() throws IOException, InterruptedE
assert(_retryingBlockTransferor.getRetryCount() == 1);
}

@Test
public void testIOExceptionFailsConnectionEvenWithSaslException()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we should also expose saslRetryCount as @VisibleForTesting and assert the count in the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the change w.r.t. your 3rd comment.
With that change, I think we don't need to expose saslRetryCount.

In the future, we can use a map (type of exception -> outstanding retry count for that type) where we can think more about asserting on retry count for each type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. I had a similar idea here: 3440f12
But we ended up opting to go with a simple flag for now. I agree with using a map in the future where we might have retry counts for each type.

ImmutableMap.of("b0", saslExceptionInitial),
ImmutableMap.of("b0", ioException),
ImmutableMap.of("b0", saslExceptionFinal),
// will not get invoked

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add to this comment to explain why it won't get invoked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if my addition should be enhanced.

// If this is a non SASL request failure, reduce earlier SASL failures from retryCount
// since some subsequent SASL attempt was successful
if (!isSaslTimeout && saslRetryCount > 0) {
retryCount -= saslRetryCount;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also verify that retryCount >= saslRetryCount? Just so that we don't ever end up in some funky state where we end up with negative values (I don't think this can ever happen, but would good to add the check!). :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a runtime exception.
But as you said, subset should never be bigger than the whole.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks

@tedyu tedyu force-pushed the sasl-cnt branch 2 times, most recently from d9d78a9 to ea357f4 Compare January 16, 2023 23:20
Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a couple of comments.
I will merge it later tonight once the updates and tests pass.

@tedyu
Copy link
Contributor Author

tedyu commented Jan 17, 2023

@mridulm
Your comments have been addressed.

Copy link

@akpatnam25 akpatnam25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@mridulm mridulm closed this in f1a3f4a Jan 17, 2023
@mridulm
Copy link
Contributor

mridulm commented Jan 17, 2023

Merged to master.
Thanks for fixing this @tedyu !
Thanks for the review @akpatnam25 :-)

@mridulm
Copy link
Contributor

mridulm commented Jan 17, 2023

@akpatnam25, please do include this fix in your backport to 3.3 and 3.2.

akpatnam25 pushed a commit to akpatnam25/spark that referenced this pull request Jan 17, 2023
### What changes were proposed in this pull request?
This PR introduces sasl retry count in RetryingBlockTransferor.

### Why are the changes needed?
Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

1. SaslTimeoutException
2. IOException
3. SaslTimeoutException
4. IOException

Even though IOException at apache#2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step apache#4.
Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New test is added, courtesy of Mridul.

Closes apache#39611 from tedyu/sasl-cnt.

Authored-by: Ted Yu <yuzhihong@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
akpatnam25 pushed a commit to akpatnam25/spark that referenced this pull request Jan 17, 2023
### What changes were proposed in this pull request?
This PR introduces sasl retry count in RetryingBlockTransferor.

### Why are the changes needed?
Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

1. SaslTimeoutException
2. IOException
3. SaslTimeoutException
4. IOException

Even though IOException at apache#2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step apache#4.
Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New test is added, courtesy of Mridul.

Closes apache#39611 from tedyu/sasl-cnt.

Authored-by: Ted Yu <yuzhihong@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
akpatnam25 pushed a commit to akpatnam25/spark that referenced this pull request Jan 17, 2023
### What changes were proposed in this pull request?
This PR introduces sasl retry count in RetryingBlockTransferor.

### Why are the changes needed?
Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

1. SaslTimeoutException
2. IOException
3. SaslTimeoutException
4. IOException

Even though IOException at apache#2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step apache#4.
Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New test is added, courtesy of Mridul.

Closes apache#39611 from tedyu/sasl-cnt.

Authored-by: Ted Yu <yuzhihong@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
akpatnam25 pushed a commit to akpatnam25/spark that referenced this pull request Jan 23, 2023
### What changes were proposed in this pull request?
This PR introduces sasl retry count in RetryingBlockTransferor.

### Why are the changes needed?
Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

1. SaslTimeoutException
2. IOException
3. SaslTimeoutException
4. IOException

Even though IOException at apache#2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step apache#4.
Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New test is added, courtesy of Mridul.

Closes apache#39611 from tedyu/sasl-cnt.

Authored-by: Ted Yu <yuzhihong@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
akpatnam25 pushed a commit to akpatnam25/spark that referenced this pull request Jan 23, 2023
### What changes were proposed in this pull request?
This PR introduces sasl retry count in RetryingBlockTransferor.

### Why are the changes needed?
Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

1. SaslTimeoutException
2. IOException
3. SaslTimeoutException
4. IOException

Even though IOException at apache#2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step apache#4.
Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New test is added, courtesy of Mridul.

Closes apache#39611 from tedyu/sasl-cnt.

Authored-by: Ted Yu <yuzhihong@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
mridulm pushed a commit that referenced this pull request Jan 24, 2023
### What changes were proposed in this pull request? This PR introduces sasl retry count in RetryingBlockTransferor.

### Why are the changes needed?
Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

1. SaslTimeoutException
2. IOException
3. SaslTimeoutException
4. IOException

Even though IOException at #2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step #4. Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

### Does this PR introduce _any_ user-facing change? No

### How was this patch tested?
New test is added, courtesy of Mridul.

Closes #39611 from tedyu/sasl-cnt.

Authored-by: Ted Yu <yuzhihonggmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

Closes #39710 from akpatnam25/SPARK-42090-backport-3.2.

Authored-by: Ted Yu <yuzhihong@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
mridulm pushed a commit that referenced this pull request Jan 24, 2023
### What changes were proposed in this pull request? This PR introduces sasl retry count in RetryingBlockTransferor.

### Why are the changes needed?
Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

1. SaslTimeoutException
2. IOException
3. SaslTimeoutException
4. IOException

Even though IOException at #2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step #4. Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

### Does this PR introduce _any_ user-facing change? No

### How was this patch tested?
New test is added, courtesy of Mridul.

Closes #39611 from tedyu/sasl-cnt.

Authored-by: Ted Yu <yuzhihonggmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

Closes #39709 from akpatnam25/SPARK-42090-backport-3.3.

Authored-by: Ted Yu <yuzhihong@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
### What changes were proposed in this pull request? This PR introduces sasl retry count in RetryingBlockTransferor.

### Why are the changes needed?
Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

1. SaslTimeoutException
2. IOException
3. SaslTimeoutException
4. IOException

Even though IOException at apache#2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step apache#4. Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

### Does this PR introduce _any_ user-facing change? No

### How was this patch tested?
New test is added, courtesy of Mridul.

Closes apache#39611 from tedyu/sasl-cnt.

Authored-by: Ted Yu <yuzhihonggmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

Closes apache#39710 from akpatnam25/SPARK-42090-backport-3.2.

Authored-by: Ted Yu <yuzhihong@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants