Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jobs] revamp scheduling for managed jobs #4485

Merged
merged 28 commits into from
Jan 16, 2025

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Dec 19, 2024

Detaches the job controller from ray worker and the ray driver program, and uses our own scheduling and parallelism control mechanism, derived from the state tracked in the managed jobs sqlite database on the controller.

See the commands in sky/jobs/scheduler.py for more info.

Previously, the number of simultaneous jobs is limited to 4x CPU count, by our per-job ray placement group request

CONTROLLER_PROCESS_CPU_DEMAND = 0.25

After this PR, there are two paralellism limits:
4 * cpu_count jobs can be launching at the same time.
memory / 350M jobs can be running at the same time.

Common and max instance sizes and their parallelism limits

instance type vCPUs memory (GB) old job parallelism (new) launch parallelism (new) run parallelism
m6i.large / Standard_D2s_v5 / n2-standard-2 2 8 8 launching/running at once 8 launches at once 22 running at once
r6i.large / Standard_E2s_v5 / n2-highmem-2 2 16 8 launching/running at once 8 launches at once 44 running at once
m6i.2xlarge / Standard_D8s_v2 / n2-standard-8 8 32 32 launching/running at once 32 launches at once 90 running at once
Standard_E96s_v5 96 672 384 launching/running at once 384 launches at once ~1930 running at once
n2-highmem-128 128 864 512 launching/running at once 512 launches at once ~2480 running at once
r6i.32xlarge 128 1024 512 launching/running at once 512 launches at once ~2950 running at once

run parallelism varies slightly between clouds as instances listed with the same amount of memory do not actually have exactly the same number of bytes.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Tested normal usage patterns
    • Load test with ~1.5k concurrent jobs
  • /quicktest-core
    • Backward compatibility tests: `conda deactivate; bash -i tests/backward_compatibility_tests.
  • pytest tests/test_smoke.py --managed-jobs --aws
  • pytest tests/test_smoke.py --managed-jobs --azure
  • pytest tests/test_smoke.py --managed-jobs --gcp

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505 for making this significant change! This is awesome! I glanced the code, and it mostly looks good. The main concern is the complexity and granularity we have for limiting the number of launches. Please see the comments below.

@cg505 cg505 marked this pull request as ready for review December 20, 2024 05:34
@cg505 cg505 requested a review from Michaelvll December 20, 2024 05:34
@cg505 cg505 changed the title revamp scheduling for managed jobs [jobs/ revamp scheduling for managed jobs Dec 20, 2024
@cg505 cg505 changed the title [jobs/ revamp scheduling for managed jobs [jobs] revamp scheduling for managed jobs Dec 20, 2024
@cg505
Copy link
Collaborator Author

cg505 commented Dec 20, 2024

/quicktest-core

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505! This PR looks pretty good to me! We should do some thorough test with managed jobs, especially testing for:

  1. scheduling speed for jobs
  2. special cases that may get the scheduling stuck
  3. many jobs
  4. cancellation of jobs
  5. in parallel jobs scheduling

@cg505 cg505 requested a review from Michaelvll January 7, 2025 21:32
@Michaelvll
Copy link
Collaborator

/smoke-test managed_jobs

@zpoint
Copy link
Collaborator

zpoint commented Jan 9, 2025

Need to merge this PR to get smoke-test comment work
I have resolved the comment, could u help take a look again? @Michaelvll

@cg505
Copy link
Collaborator Author

cg505 commented Jan 11, 2025

/quicktest-core

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505! It looks mostly good to me. Left several comments below

assert job_name is None or job_id is None, (job_name, job_id)
if job_id is None and job_name is not None:
# exactly one of job_name, job_id should be None
if (job_name is None) == (job_id is None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is guaranteed by the caller -- Assertion seems good?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, caller guarantees that they are not both set. If they are both None, it's actually valid - we will just download the most recent job log. Will clarify this in the code.

@@ -2,10 +2,12 @@

JOBS_CONTROLLER_TEMPLATE = 'jobs-controller.yaml.j2'
JOBS_CONTROLLER_YAML_PREFIX = '~/.sky/jobs_controller'
JOBS_CONTROLLER_LOGS_DIR = '~/sky_logs/jobs_controller'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in the place we use this, we need to make sure that we create the folder if not exists. : )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -353,11 +357,15 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool:
managed_job_state.set_recovering(job_id=self._job_id,
task_id=task_id,
callback_func=callback_func)
# Switch to LAUNCHING schedule_state here, since the entire recovery
# process is somewhat heavy.
scheduler.wait_until_launch_okay(self._job_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should directly call the scheduler in the recovery strategy executor, i.e., in the launch() function. The current way of spreading the scheduler in multiple places is hard to track.

Same for the first launch, would it be possible to call the scheduler in the launch()? We can set the STARTING state before it is scheduled, since we can have the scheduling state shown in the column to indicate that the STARTING state is blocked by waiting for controller resources.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should directly call the scheduler in the recovery strategy executor

I guess this is actually reasonable, and it lets us change to actually using a context as you suggested before #4485 (comment).
The only downside is that we will not hold the lock for cluster termination/cleanup before the re-launch. I am hoping that's ok.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the first launch, would it be possible to call the scheduler in the launch()?

We can leave everything as-is, it will just be a no-op.
https://github.com/skypilot-org/skypilot/pull/4485/files#diff-342edc587647149b7d7849c4687967f5a1aa87df4df525b02e284dc5ec3dc305R203-R207

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another downside - the jobs will go to STARTING status before they are scheduled to launch. This could be confusing if a lot of jobs are stuck WAITING/STARTING.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've attempted to make this change nonetheless, let me know what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only downside is that we will not hold the lock for cluster termination/cleanup before the re-launch. I am hoping that's ok.

Does this mean that we will no longer be able to limit the total number of termination / cleanup that are running in parallel? Would that cause issue as we met for the credentials?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another downside - the jobs will go to STARTING status before they are scheduled to launch. This could be confusing if a lot of jobs are stuck WAITING/STARTING.

Is our Description column showing that the job is actually waiting for the resources on controller to schedule the launch? If that is the case, I think it should be fine. : )

# parallelism control or updating the schedule_state of any job.
# Any code that takes this lock must conclude by calling
# maybe_start_waiting_jobs.
_MANAGED_JOB_SCHEDULER_LOCK = '~/.sky/locks/managed_job_scheduler.lock'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create the folder, in case it is not created yet?

@cg505 cg505 requested a review from Michaelvll January 15, 2025 01:32
@Michaelvll
Copy link
Collaborator

/smoke-test managed_jobs

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @cg505!!! Thanks for the great effort!

The current scheduler looks very clean! It should be good to go in once we have the tests passed.

logger.error(
'Failure happened before provisioning. Failover '
f'reasons: {reasons_str}')
with scheduler.scheduled_launch(self.job_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!! This feels super clean now.

# parallelism control. If we cannot obtain the lock, exit immediately.
# The current lock holder is expected to launch any jobs it can before
# releasing the lock.
with filelock.FileLock(os.path.expanduser(_get_lock_path()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need expanduser again here?

Comment on lines +116 to +118
# The last two args of the command line should be --job-id <id>
job_args = process.cmdline()[-2:]
return process.is_running() and job_args == ['--job-id', str(job_id)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we should double-check the length limit of the cmdline, to make sure that cmdline will always contain the info we need here in all systems we use for controller. In the job_lib, we prepend a identifier so that it appears at the very beginning to avoid the length limit, if any, as much as possible

Copy link
Collaborator Author

@cg505 cg505 Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be okay. The cmdline should not be truncated - it's limited by the normal argument limit getconf ARG_MAX, and if this is violated, the process will not be spawned in the first place.
In old Linuxes, /proc/<pid>/cmdline was truncated, but this hasn't been true since 2015. So unless we have observed issues with this behavior, I think we can ignore it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on my system:

  • ulimit -s 256
  • cat /proc/self/cmdline $(seq 10016) 2>/dev/null is not truncated
  • cat /proc/self/cmdline $(seq 10017) 2>/dev/null fails to spawn the cat process (argument limit exceeded)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It should be good then. We should leave a comment where we start the process that --job-id must to be placed at the end to be aligned with this check here.

# The controller is still running.
continue
# Otherwise, proceed to mark the job as failed.
logger.error(f'Controller for {job_id_} seems to be dead.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid the term controller here as it can be confusing for VM vs process. In most of places, we use controller process to refer to the process that manages one job.

@Michaelvll
Copy link
Collaborator

btw, please double check the buildkite failures : )

@cg505
Copy link
Collaborator Author

cg505 commented Jan 15, 2025

smoke test failures look pretty unrelated... may try to run locally instead

@cg505
Copy link
Collaborator Author

cg505 commented Jan 15, 2025

/quicktest-core

@cg505
Copy link
Collaborator Author

cg505 commented Jan 15, 2025

/smoke-test managed_jobs

@cg505
Copy link
Collaborator Author

cg505 commented Jan 15, 2025

manually running smoke tests:

  • pytest tests/test_smoke.py --managed-jobs --aws
  • pytest tests/test_smoke.py --managed-jobs --azure
  • pytest tests/test_smoke.py --managed-jobs --gcp

@Michaelvll
Copy link
Collaborator

Michaelvll commented Jan 16, 2025

manually running smoke tests:

  • pytest tests/test_smoke.py --managed-jobs --aws
  • pytest tests/test_smoke.py --managed-jobs --azure
  • pytest tests/test_smoke.py --managed-jobs --gcp

cc'ing @zpoint, we should avoid this kind of requirement for running manual smoke tests as much as possible

@zpoint
Copy link
Collaborator

zpoint commented Jan 16, 2025

we should avoid this kind of requirement for running manual smoke tests as much as possible

Fixing the issue of the same user_hash being used across different instances on buildkite.

@zpoint
Copy link
Collaborator

zpoint commented Jan 16, 2025

/smoke-test managed_jobs

@cg505 cg505 requested a review from Michaelvll January 16, 2025 18:34
Comment on lines +116 to +118
# The last two args of the command line should be --job-id <id>
job_args = process.cmdline()[-2:]
return process.is_running() and job_args == ['--job-id', str(job_id)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It should be good then. We should leave a comment where we start the process that --job-id must to be placed at the end to be aligned with this check here.

@cg505 cg505 enabled auto-merge (squash) January 16, 2025 22:37
@cg505 cg505 merged commit f8494b5 into skypilot-org:master Jan 16, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants