[jobs] revamp scheduling for managed jobs #4485

cg505 · 2024-12-19T01:51:23Z

Detaches the job controller from ray worker and the ray driver program, and uses our own scheduling and parallelism control mechanism, derived from the state tracked in the managed jobs sqlite database on the controller.

See the commands in sky/jobs/scheduler.py for more info.

Previously, the number of simultaneous jobs is limited to 4x CPU count, by our per-job ray placement group request

skypilot/sky/skylet/constants.py

Line 293 in 1578108

CONTROLLER_PROCESS_CPU_DEMAND = 0.25

After this PR, there are two paralellism limits:
4 * cpu_count jobs can be launching at the same time.
memory / 350M jobs can be running at the same time.

Common and max instance sizes and their parallelism limits

instance type	vCPUs	memory (GB)	old job parallelism	(new) launch parallelism	(new) run parallelism
m6i.large / Standard_D2s_v5 / n2-standard-2	2	8	8 launching/running at once	8 launches at once	22 running at once
r6i.large / Standard_E2s_v5 / n2-highmem-2	2	16	8 launching/running at once	8 launches at once	44 running at once
m6i.2xlarge / Standard_D8s_v2 / n2-standard-8	8	32	32 launching/running at once	32 launches at once	90 running at once
Standard_E96s_v5	96	672	384 launching/running at once	384 launches at once	~1930 running at once
n2-highmem-128	128	864	512 launching/running at once	512 launches at once	~2480 running at once
r6i.32xlarge	128	1024	512 launching/running at once	512 launches at once	~2950 running at once

run parallelism varies slightly between clouds as instances listed with the same amount of memory do not actually have exactly the same number of bytes.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- Tested normal usage patterns
- Load test with ~1.5k concurrent jobs
/quicktest-core
- Backward compatibility tests: `conda deactivate; bash -i tests/backward_compatibility_tests.
pytest tests/test_smoke.py --managed-jobs --aws
pytest tests/test_smoke.py --managed-jobs --azure
pytest tests/test_smoke.py --managed-jobs --gcp

sky/jobs/scheduler.py

sky/jobs/state.py

sky/jobs/scheduler.py

Michaelvll

Thanks @cg505 for making this significant change! This is awesome! I glanced the code, and it mostly looks good. The main concern is the complexity and granularity we have for limiting the number of launches. Please see the comments below.

sky/backends/cloud_vm_ray_backend.py

sky/jobs/constants.py

sky/jobs/scheduler.py

…d-jobs-skylet

cg505 · 2024-12-20T06:47:25Z

/quicktest-core

Michaelvll

Thanks @cg505! This PR looks pretty good to me! We should do some thorough test with managed jobs, especially testing for:

scheduling speed for jobs
special cases that may get the scheduling stuck
many jobs
cancellation of jobs
in parallel jobs scheduling

sky/jobs/controller.py

sky/jobs/state.py

sky/jobs/scheduler.py

sky/jobs/utils.py

Michaelvll · 2025-01-09T06:05:05Z

/smoke-test managed_jobs

zpoint · 2025-01-09T06:58:45Z

Need to merge this PR to get smoke-test comment work
I have resolved the comment, could u help take a look again? @Michaelvll

…d-jobs-skylet

cg505 · 2025-01-11T00:05:57Z

/quicktest-core

sky/backends/cloud_vm_ray_backend.py

Michaelvll

Thanks @cg505! It looks mostly good to me. Left several comments below

Michaelvll · 2025-01-13T17:40:02Z

sky/backends/cloud_vm_ray_backend.py

-        assert job_name is None or job_id is None, (job_name, job_id)
-        if job_id is None and job_name is not None:
+        # exactly one of job_name, job_id should be None
+        if (job_name is None) == (job_id is None):


This is guaranteed by the caller -- Assertion seems good?

Ah, caller guarantees that they are not both set. If they are both None, it's actually valid - we will just download the most recent job log. Will clarify this in the code.

sky/jobs/constants.py

Michaelvll · 2025-01-13T17:42:24Z

sky/jobs/constants.py

@@ -2,10 +2,12 @@

 JOBS_CONTROLLER_TEMPLATE = 'jobs-controller.yaml.j2'
 JOBS_CONTROLLER_YAML_PREFIX = '~/.sky/jobs_controller'
+JOBS_CONTROLLER_LOGS_DIR = '~/sky_logs/jobs_controller'


nit: in the place we use this, we need to make sure that we create the folder if not exists. : )

We do: https://github.com/skypilot-org/skypilot/pull/4485/files#diff-342edc587647149b7d7849c4687967f5a1aa87df4df525b02e284dc5ec3dc305R136-R138

Michaelvll · 2025-01-13T17:55:55Z

sky/jobs/controller.py

@@ -353,11 +357,15 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool:
            managed_job_state.set_recovering(job_id=self._job_id,
                                             task_id=task_id,
                                             callback_func=callback_func)
+            # Switch to LAUNCHING schedule_state here, since the entire recovery
+            # process is somewhat heavy.
+            scheduler.wait_until_launch_okay(self._job_id)


I am wondering if we should directly call the scheduler in the recovery strategy executor, i.e., in the launch() function. The current way of spreading the scheduler in multiple places is hard to track.

Same for the first launch, would it be possible to call the scheduler in the launch()? We can set the STARTING state before it is scheduled, since we can have the scheduling state shown in the column to indicate that the STARTING state is blocked by waiting for controller resources.

I am wondering if we should directly call the scheduler in the recovery strategy executor

I guess this is actually reasonable, and it lets us change to actually using a context as you suggested before #4485 (comment).
The only downside is that we will not hold the lock for cluster termination/cleanup before the re-launch. I am hoping that's ok.

Same for the first launch, would it be possible to call the scheduler in the launch()?

We can leave everything as-is, it will just be a no-op.
https://github.com/skypilot-org/skypilot/pull/4485/files#diff-342edc587647149b7d7849c4687967f5a1aa87df4df525b02e284dc5ec3dc305R203-R207

There is another downside - the jobs will go to STARTING status before they are scheduled to launch. This could be confusing if a lot of jobs are stuck WAITING/STARTING.

I've attempted to make this change nonetheless, let me know what you think.

The only downside is that we will not hold the lock for cluster termination/cleanup before the re-launch. I am hoping that's ok.

Does this mean that we will no longer be able to limit the total number of termination / cleanup that are running in parallel? Would that cause issue as we met for the credentials?

There is another downside - the jobs will go to STARTING status before they are scheduled to launch. This could be confusing if a lot of jobs are stuck WAITING/STARTING.

Is our Description column showing that the job is actually waiting for the resources on controller to schedule the launch? If that is the case, I think it should be fine. : )

Michaelvll · 2025-01-13T19:12:07Z

sky/jobs/scheduler.py

+# parallelism control or updating the schedule_state of any job.
+# Any code that takes this lock must conclude by calling
+# maybe_start_waiting_jobs.
+_MANAGED_JOB_SCHEDULER_LOCK = '~/.sky/locks/managed_job_scheduler.lock'


Should we create the folder, in case it is not created yet?

sky/jobs/utils.py

sky/skylet/events.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

Michaelvll · 2025-01-15T05:42:15Z

/smoke-test managed_jobs

Michaelvll

This is awesome @cg505!!! Thanks for the great effort!

The current scheduler looks very clean! It should be good to go in once we have the tests passed.

sky/backends/cloud_vm_ray_backend.py

Michaelvll · 2025-01-15T07:40:54Z

sky/jobs/recovery_strategy.py

-                    logger.error(
-                        'Failure happened before provisioning. Failover '
-                        f'reasons: {reasons_str}')
+            with scheduler.scheduled_launch(self.job_id):


Nice!! This feels super clean now.

Michaelvll · 2025-01-15T07:44:26Z

sky/jobs/scheduler.py

+        # parallelism control. If we cannot obtain the lock, exit immediately.
+        # The current lock holder is expected to launch any jobs it can before
+        # releasing the lock.
+        with filelock.FileLock(os.path.expanduser(_get_lock_path()),


no need expanduser again here?

Michaelvll · 2025-01-15T08:08:10Z

sky/jobs/utils.py

+        # The last two args of the command line should be --job-id <id>
+        job_args = process.cmdline()[-2:]
+        return process.is_running() and job_args == ['--job-id', str(job_id)]


minor: we should double-check the length limit of the cmdline, to make sure that cmdline will always contain the info we need here in all systems we use for controller. In the job_lib, we prepend a identifier so that it appears at the very beginning to avoid the length limit, if any, as much as possible

I think this should be okay. The cmdline should not be truncated - it's limited by the normal argument limit getconf ARG_MAX, and if this is violated, the process will not be spawned in the first place.
In old Linuxes, /proc/<pid>/cmdline was truncated, but this hasn't been true since 2015. So unless we have observed issues with this behavior, I think we can ignore it.

Tested on my system:

ulimit -s 256

cat /proc/self/cmdline $(seq 10016) 2>/dev/null is not truncated

cat /proc/self/cmdline $(seq 10017) 2>/dev/null fails to spawn the cat process (argument limit exceeded)

I see. It should be good then. We should leave a comment where we start the process that --job-id must to be placed at the end to be aligned with this check here.

Michaelvll · 2025-01-15T08:09:48Z

sky/jobs/utils.py

+                    # The controller is still running.
+                    continue
+                # Otherwise, proceed to mark the job as failed.
+                logger.error(f'Controller for {job_id_} seems to be dead.')


Let's avoid the term controller here as it can be confusing for VM vs process. In most of places, we use controller process to refer to the process that manages one job.

sky/jobs/utils.py

Michaelvll · 2025-01-15T08:20:46Z

btw, please double check the buildkite failures : )

cg505 · 2025-01-15T22:29:05Z

smoke test failures look pretty unrelated... may try to run locally instead

cg505 · 2025-01-15T22:29:28Z

/quicktest-core

cg505 · 2025-01-15T22:30:12Z

/smoke-test managed_jobs

cg505 · 2025-01-15T23:44:19Z

manually running smoke tests:

pytest tests/test_smoke.py --managed-jobs --aws
pytest tests/test_smoke.py --managed-jobs --azure
pytest tests/test_smoke.py --managed-jobs --gcp

Michaelvll · 2025-01-16T00:03:06Z

manually running smoke tests:

pytest tests/test_smoke.py --managed-jobs --aws

pytest tests/test_smoke.py --managed-jobs --azure

pytest tests/test_smoke.py --managed-jobs --gcp

cc'ing @zpoint, we should avoid this kind of requirement for running manual smoke tests as much as possible

zpoint · 2025-01-16T04:07:17Z

we should avoid this kind of requirement for running manual smoke tests as much as possible

Fixing the issue of the same user_hash being used across different instances on buildkite.

zpoint · 2025-01-16T15:21:00Z

/smoke-test managed_jobs

sky/backends/cloud_vm_ray_backend.py

Michaelvll · 2025-01-16T20:08:07Z

sky/jobs/utils.py

+        # The last two args of the command line should be --job-id <id>
+        job_args = process.cmdline()[-2:]
+        return process.is_running() and job_args == ['--job-id', str(job_id)]


I see. It should be good then. We should leave a comment where we start the process that --job-id must to be placed at the end to be aligned with this check here.

revamp scheduling for managed jobs

78eef52

cg505 commented Dec 19, 2024

View reviewed changes

sky/jobs/scheduler.py Outdated Show resolved Hide resolved

cg505 commented Dec 19, 2024

View reviewed changes

sky/jobs/state.py Outdated Show resolved Hide resolved

cg505 commented Dec 19, 2024

View reviewed changes

sky/jobs/scheduler.py Outdated Show resolved Hide resolved

cg505 mentioned this pull request Dec 19, 2024

detach the managed job controller from job submission #4458

Closed

6 tasks

Michaelvll reviewed Dec 19, 2024

View reviewed changes

cg505 added 2 commits December 19, 2024 21:19

simplify locking mechanism

4c54642

additional fixes

aeaaf7b

cg505 marked this pull request as ready for review December 20, 2024 05:34

cg505 requested a review from Michaelvll December 20, 2024 05:34

cg505 changed the title ~~revamp scheduling for managed jobs~~ [jobs/ revamp scheduling for managed jobs Dec 20, 2024

cg505 changed the title ~~[jobs/ revamp scheduling for managed jobs~~ [jobs] revamp scheduling for managed jobs Dec 20, 2024

cg505 added 2 commits December 19, 2024 22:19

fix pid writing

3b4cf44

Merge branch 'master' of github.com:skypilot-org/skypilot into manage…

ea84bb4

…d-jobs-skylet

Michaelvll reviewed Jan 3, 2025

View reviewed changes

address PR comments

a255236

cg505 requested a review from Michaelvll January 7, 2025 21:32

cg505 added 2 commits January 8, 2025 15:49

fix jobs logs --controller

0c54fc6

catch inconsistent schedule state in update_managed_job_status

734fdc2

cg505 added 2 commits January 9, 2025 14:03

fix bug

8685abb

Merge branch 'master' of github.com:skypilot-org/skypilot into manage…

88eb21e

…d-jobs-skylet

fix --sync-down for new controller logs location

6a75cdf

cg505 commented Jan 11, 2025

View reviewed changes

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved

Michaelvll reviewed Jan 13, 2025

View reviewed changes

cg505 commented Jan 13, 2025

View reviewed changes

sky/skylet/events.py Outdated Show resolved Hide resolved

fix sync-down code

ce20367

cg505 and others added 5 commits January 14, 2025 15:43

Update sky/jobs/utils.py

1fb3db5

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

update log_lib comment

6ab6b87

fix falsy ==0 check

903896a

turn the scheduled launch into a context

afd24c6

clarify scheduler.py docstring

5f89500

cg505 requested a review from Michaelvll January 15, 2025 01:32

fix lint/tests

e223e0c

Michaelvll approved these changes Jan 15, 2025

View reviewed changes

cg505 added 4 commits January 15, 2025 13:52

avoid redundant expanduser

79737be

clarify controller vs controller process

a7de7d8

clarify comment

de8d832

address PR comment

dab5bf1

cg505 requested a review from Michaelvll January 16, 2025 18:34

Michaelvll reviewed Jan 16, 2025

View reviewed changes

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved

Michaelvll approved these changes Jan 16, 2025

View reviewed changes

add comments

6ced431

cg505 enabled auto-merge (squash) January 16, 2025 22:37

cg505 merged commit f8494b5 into skypilot-org:master Jan 16, 2025
18 checks passed

This was referenced Jan 17, 2025

[Managed Jobs] Reduce the resource requirement for the controller process for more parallel jobs #4234

Closed

[jobs] allow for null schedule_state #4588

Merged

Michaelvll mentioned this pull request Feb 19, 2025

[Roadmap] SkyPilot Roadmap Q1 2025 #4760

Open

39 tasks

[jobs] revamp scheduling for managed jobs #4485

[jobs] revamp scheduling for managed jobs #4485

Conversation

cg505 commented Dec 19, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

cg505 commented Dec 20, 2024

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll commented Jan 9, 2025

zpoint commented Jan 9, 2025

cg505 commented Jan 11, 2025

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Jan 15, 2025

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cg505 Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Jan 15, 2025

cg505 commented Jan 15, 2025

cg505 commented Jan 15, 2025

cg505 commented Jan 15, 2025

cg505 commented Jan 15, 2025 • edited Loading

Michaelvll commented Jan 16, 2025 • edited Loading

zpoint commented Jan 16, 2025

zpoint commented Jan 16, 2025

Choose a reason for hiding this comment

cg505 commented Dec 19, 2024 •

edited

Loading

cg505 Jan 15, 2025 •

edited

Loading

cg505 commented Jan 15, 2025 •

edited

Loading

Michaelvll commented Jan 16, 2025 •

edited

Loading