[tune] Fast Node Recovery #5053

richardliaw · 2019-06-28T01:10:02Z

What do these changes do?

Pre-empting nodes will result in actors sometimes being "lost". This
ends up taking book-keeping space and also resources that aren't filled.

This PR aims to address that.

Related issue number

Linter

I've run scripts/format.sh to lint the changes in this PR.

Changes LogSync into a mixin, and adds tests for different functionalities.

AmplabJenkins · 2019-07-04T10:09:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1450/
Test FAILed.

SudeepDasari · 2019-07-05T17:56:26Z

@hartikainen I've been running this for the last day or so on 8 worker nodes and have seen 9 successful restarts thus far. Thank you so much, it seems to work!

AmplabJenkins · 2019-07-06T02:10:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15126/
Test FAILed.

AmplabJenkins · 2019-07-06T14:56:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1473/
Test FAILed.

AmplabJenkins · 2019-07-10T07:38:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15263/
Test PASSed.

AmplabJenkins · 2019-07-10T07:59:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1573/
Test FAILed.

python/ray/tune/ray_trial_executor.py

hartikainen

Apart from the couple of comments, I think this looks good.

python/ray/tune/ray_trial_executor.py

hartikainen · 2019-07-11T01:37:26Z

python/ray/tune/ray_trial_executor.py

@@ -541,8 +563,13 @@ def restore(self, trial, checkpoint=None):
                assert type(value) != Checkpoint, type(value)
                trial.runner.restore_from_object.remote(value)
            else:
-                worker_ip = ray.get(trial.runner.current_ip.remote())
-                trial.sync_logger_to_new_location(worker_ip)
+                # This can be very slow - a better fix would


Maybe add a TODO here and clarification when this would be very slow? Also might want to create an issue if this can cause problems.

hartikainen · 2019-07-11T01:38:41Z

python/ray/tune/trial_runner.py

        """
        self._scheduler_alg.on_trial_error(self, trial)
        self.trial_executor.set_status(trial, Trial.PENDING)
+
+        # Right now, this requeues the trial to the end of the queue. This is


This should also have a TODO and possibly an issue for fixing this?

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

AmplabJenkins · 2019-07-11T02:43:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1602/
Test FAILed.

AmplabJenkins · 2019-07-11T04:27:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15291/
Test PASSed.

AmplabJenkins · 2019-07-11T04:56:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15293/
Test PASSed.

AmplabJenkins · 2019-07-11T05:44:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1608/
Test FAILed.

AmplabJenkins · 2019-07-11T07:44:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15299/
Test PASSed.

AmplabJenkins · 2019-07-11T20:31:50Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1626/
Test FAILed.

AmplabJenkins · 2019-07-11T20:33:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15317/
Test FAILed.

AmplabJenkins · 2019-07-11T22:24:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1629/
Test FAILed.

AmplabJenkins · 2019-07-11T23:56:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15320/
Test PASSed.

hartikainen and others added 30 commits March 21, 2019 15:56

Change the log syncing behavior

6527abf

Merge branch 'master' into bunch-of-log-sync-fixes

e760991

fix up abstractions for syncer

2ada6db

Finished checkpoint syncing

26fe09b

Code

f45110d

Set of changes to get things running

ee5c61d

Fixes for log syncing

045bfa4

Merge branch 'master' into bunch-of-log-sync-fixes

c5b1731

Fix parts

7d7ced1

Merge branch 'tune-submit-fix' into bunch-of-log-sync-fixes

5ce47d7

Lint and other fixes

979a04c

fix some test

91dad93

Remove extra parsing functionality

e3ecc72

Merge branch 'tune-relax-configs' into bunch-of-log-sync-fixes

26a538f

some test fixes

b0f6218

Fix up cloud syncing

5ca8eca

Another thing to do

b7fd1e9

Merge branch 'master' into bunch-of-log-sync-fixes

1ad642a

Fix up tests and local sync

5bc70af

Changes LogSync into a mixin, and adds tests for different functionalities.

Fix up tests, start on local migration

2b6d21f

Merge branch 'master' into bunch-of-log-sync-fixes

f11800b

fix distributed migrations

ed015f6

comments

a39279e

formatting

368f90b

Better checkpoint directory handling

2e38543

fix tests

099edb9

fix tests

9d38c10

fix click

10302af

comments

2fcdfa8

formatting comments

7324426

fix-tests

620f7fd

richardliaw added 2 commits July 9, 2019 21:55

Merge branch 'master' into fix_autoscaling_tune

afccf93

lint

62e942b

hartikainen reviewed Jul 11, 2019

View reviewed changes

python/ray/tune/ray_trial_executor.py Outdated Show resolved Hide resolved

hartikainen approved these changes Jul 11, 2019

View reviewed changes

richardliaw and others added 4 commits July 10, 2019 18:39

Update python/ray/tune/ray_trial_executor.py

fdcc4cb

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

Update python/ray/tune/ray_trial_executor.py

13a4fc5

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

Update python/ray/tune/ray_trial_executor.py

8ff64df

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

comment and stuff

0132b71

flake

caa49dc

Merge branch 'master' into fix_autoscaling_tune

f762dbf

richardliaw added 2 commits July 11, 2019 14:06

Merge branch 'master' into fix_autoscaling_tune

27dccbf

state

baadc87

richardliaw merged commit 1530389 into ray-project:master Jul 12, 2019

DmitriGekhtman mentioned this pull request Mar 23, 2021

[autoscaler][aws] Use subnets in only one VPC #14868

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Fast Node Recovery #5053

[tune] Fast Node Recovery #5053

richardliaw commented Jun 28, 2019 •

edited

Loading

AmplabJenkins commented Jul 4, 2019

SudeepDasari commented Jul 5, 2019

AmplabJenkins commented Jul 6, 2019

AmplabJenkins commented Jul 6, 2019

AmplabJenkins commented Jul 10, 2019

AmplabJenkins commented Jul 10, 2019

hartikainen left a comment

hartikainen Jul 11, 2019

hartikainen Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

[tune] Fast Node Recovery #5053

[tune] Fast Node Recovery #5053

Conversation

richardliaw commented Jun 28, 2019 • edited Loading

What do these changes do?

Related issue number

Linter

AmplabJenkins commented Jul 4, 2019

SudeepDasari commented Jul 5, 2019

AmplabJenkins commented Jul 6, 2019

AmplabJenkins commented Jul 6, 2019

AmplabJenkins commented Jul 10, 2019

AmplabJenkins commented Jul 10, 2019

hartikainen left a comment

Choose a reason for hiding this comment

hartikainen Jul 11, 2019

Choose a reason for hiding this comment

hartikainen Jul 11, 2019

Choose a reason for hiding this comment

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

AmplabJenkins commented Jul 11, 2019

richardliaw commented Jun 28, 2019 •

edited

Loading