Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1059] Fix callback not update if push worker excluded during retry #2005

Closed
wants to merge 2 commits into from

Conversation

onebox-li
Copy link
Contributor

What changes were proposed in this pull request?

When retry push data and revive succeed in ShuffleClientImpl#submitRetryPushData, if new location is excluded, the callback's lastest location has not been updated when wrappedCallback.onFailure is called in ShuffleClientImpl#isPushTargetWorkerExcluded. Therefore there may be problems with subsequent revive.

Why are the changes needed?

Ditto

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual test.

@codecov
Copy link

codecov bot commented Oct 20, 2023

Codecov Report

Merging #2005 (66a6205) into main (7456d9a) will increase coverage by 0.14%.
Report is 5 commits behind head on main.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #2005      +/-   ##
==========================================
+ Coverage   46.74%   46.88%   +0.14%     
==========================================
  Files         165      165              
  Lines       10531    10531              
  Branches      959      959              
==========================================
+ Hits         4922     4936      +14     
+ Misses       5284     5272      -12     
+ Partials      325      323       -2     

see 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@onebox-li onebox-li changed the title [CELEBORN-1059] Fix callback not update if push worker excluded during retry [WIP][CELEBORN-1059] Fix callback not update if push worker excluded during retry Oct 20, 2023
@onebox-li onebox-li force-pushed the improve-push-exclude branch from 388acdc to 0ff6d01 Compare October 20, 2023 12:39
@onebox-li
Copy link
Contributor Author

The RetryReviveTest UT can always revive succeed because of the partition location's epoch not updated when retry. If do the update there(IMO this is correct rather than before one), LifeCycleManager will exclude the worker when revive in the original behavior. Finally the test will fail because of REVIVE_FAILED(SLOT_NOT_AVAILABLE). So here let LifeCycleManager doesn't exclude workers when test revive to avoid randomness in small test clusters.

@onebox-li onebox-li changed the title [WIP][CELEBORN-1059] Fix callback not update if push worker excluded during retry [CELEBORN-1059] Fix callback not update if push worker excluded during retry Oct 20, 2023
@@ -51,7 +51,7 @@ class RetryReviveTest extends AnyFunSuite
.config(updateSparkConf(sparkConf, ShuffleMode.HASH))
.getOrCreate()
val result = ss.sparkContext.parallelize(1 to 1000, 2)
.map { i => (i, Range(1, 1000).mkString(",")) }.groupByKey(16).collect()
.map { i => (i, Range(1, 1000).mkString(",")) }.groupByKey(4).collect()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here changing the groupByKey param is to decrease some logs produced by the UT. It doesn't matters.

Copy link
Contributor

@waitinfuture waitinfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Merging to main(v0.4.0)/branch-0.3(v0.3.2)

waitinfuture pushed a commit that referenced this pull request Nov 1, 2023
…g retry

### What changes were proposed in this pull request?
When retry push data and revive succeed in ShuffleClientImpl#submitRetryPushData, if new location is excluded, the callback's `lastest` location has not been updated when wrappedCallback.onFailure is called in ShuffleClientImpl#isPushTargetWorkerExcluded. Therefore there may be problems with subsequent revive.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual test.

Closes #2005 from onebox-li/improve-push-exclude.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
(cherry picked from commit cd8acf8)
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
@onebox-li onebox-li deleted the improve-push-exclude branch November 1, 2023 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants