-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#1608] feat(spark3): Ensure the compatiblity of reassign and stageRetry #1783
Conversation
@@ -638,7 +638,7 @@ public ShuffleHandleInfo getShuffleHandleInfoByShuffleId(int shuffleId) { | |||
@Override | |||
public int getMaxFetchFailures() { | |||
final String TASK_MAX_FAILURE = "spark.task.maxFailures"; | |||
return Math.max(1, sparkConf.getInt(TASK_MAX_FAILURE, 4) - 1); | |||
return Math.max(0, sparkConf.getInt(TASK_MAX_FAILURE, 4) - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original 1 is wrong if the spark.task.max.failure=1
that won't trigger the stage retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this into the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
client-spark/common/src/main/java/org/apache/uniffle/shuffle/manager/RssShuffleManagerBase.java
Outdated
Show resolved
Hide resolved
client-spark/common/src/main/java/org/apache/uniffle/shuffle/manager/RssShuffleManagerBase.java
Outdated
Show resolved
Hide resolved
ping @jerqi |
cc @rickyma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
@@ -638,7 +638,7 @@ public ShuffleHandleInfo getShuffleHandleInfoByShuffleId(int shuffleId) { | |||
@Override | |||
public int getMaxFetchFailures() { | |||
final String TASK_MAX_FAILURE = "spark.task.maxFailures"; | |||
return Math.max(1, sparkConf.getInt(TASK_MAX_FAILURE, 4) - 1); | |||
return Math.max(0, sparkConf.getInt(TASK_MAX_FAILURE, 4) - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this into the comment?
What changes were proposed in this pull request?
Ensure the compatiblity of reassign and stageRetry.
Why are the changes needed?
To improve the job stability if having reassign and stage retry.
For #1608
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit tests