Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconstruct failed actors without sending tasks. #5161

Merged
merged 8 commits into from
Jul 15, 2019

Conversation

raulchen
Copy link
Contributor

@raulchen raulchen commented Jul 10, 2019

What do these changes do?

Previously, we had to send a task to trigger the reconstruction of a failed actor. This has issues in some cases. For example, an actor that reading data from external DB will never receive tasks. This PR fixes this issue.

Related issue number

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@raulchen raulchen requested a review from stephanie-wang July 10, 2019 14:00
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15279/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1589/
Test FAILed.

@stephanie-wang stephanie-wang self-assigned this Jul 10, 2019
Copy link
Contributor

@zhijunfu zhijunfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. thanks

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good!

@@ -94,3 +94,25 @@ def wait_for_errors(error_type, num_errors, timeout=10):
return
time.sleep(0.1)
raise Exception("Timing out of wait.")


def wait_for_contition(condition_predictor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def wait_for_contition(condition_predictor,
def wait_for_condition(condition_predictor,

def wait_for_contition(condition_predictor,
timeout_ms=1000,
retry_interval_ms=100):
"""A helper function that wait until a conition is met.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""A helper function that wait until a conition is met.
"""A helper function that waits until a condition is met.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1636/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15328/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15332/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1640/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1641/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15333/
Test PASSed.

@stephanie-wang
Copy link
Contributor

Looks like the unit test that was added failed on one of the Travis runs: https://travis-ci.com/ray-project/ray/jobs/215417720. We should increase the timeout for that test.

@raulchen
Copy link
Contributor Author

thanks, increased to 5s

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15366/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1670/
Test FAILed.

@raulchen
Copy link
Contributor Author

@stephanie-wang Tests have passed. Can you give a stamp?

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! :)

@stephanie-wang stephanie-wang merged commit ea6aa64 into ray-project:master Jul 15, 2019
@raulchen raulchen deleted the fast_reconstruct branch July 16, 2019 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants