-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Fast Node Recovery #5053
[tune] Fast Node Recovery #5053
Conversation
Changes LogSync into a mixin, and adds tests for different functionalities.
Test FAILed. |
@hartikainen I've been running this for the last day or so on 8 worker nodes and have seen 9 successful restarts thus far. Thank you so much, it seems to work! |
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the couple of comments, I think this looks good.
@@ -541,8 +563,13 @@ def restore(self, trial, checkpoint=None): | |||
assert type(value) != Checkpoint, type(value) | |||
trial.runner.restore_from_object.remote(value) | |||
else: | |||
worker_ip = ray.get(trial.runner.current_ip.remote()) | |||
trial.sync_logger_to_new_location(worker_ip) | |||
# This can be very slow - a better fix would |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a TODO here and clarification when this would be very slow? Also might want to create an issue if this can cause problems.
python/ray/tune/trial_runner.py
Outdated
""" | ||
self._scheduler_alg.on_trial_error(self, trial) | ||
self.trial_executor.set_status(trial, Trial.PENDING) | ||
|
||
# Right now, this requeues the trial to the end of the queue. This is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also have a TODO and possibly an issue for fixing this?
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
Test FAILed. |
Test PASSed. |
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test PASSed. |
What do these changes do?
Pre-empting nodes will result in actors sometimes being "lost". This
ends up taking book-keeping space and also resources that aren't filled.
This PR aims to address that.
Related issue number
Linter
scripts/format.sh
to lint the changes in this PR.