-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jobs] revamp scheduling for managed jobs #4485
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505 for making this significant change! This is awesome! I glanced the code, and it mostly looks good. The main concern is the complexity and granularity we have for limiting the number of launches. Please see the comments below.
/quicktest-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! This PR looks pretty good to me! We should do some thorough test with managed jobs, especially testing for:
- scheduling speed for jobs
- special cases that may get the scheduling stuck
- many jobs
- cancellation of jobs
- in parallel jobs scheduling
/smoke-test managed_jobs |
Need to merge this PR to get |
/quicktest-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! It looks mostly good to me. Left several comments below
sky/backends/cloud_vm_ray_backend.py
Outdated
assert job_name is None or job_id is None, (job_name, job_id) | ||
if job_id is None and job_name is not None: | ||
# exactly one of job_name, job_id should be None | ||
if (job_name is None) == (job_id is None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is guaranteed by the caller -- Assertion seems good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, caller guarantees that they are not both set. If they are both None, it's actually valid - we will just download the most recent job log. Will clarify this in the code.
@@ -2,10 +2,12 @@ | |||
|
|||
JOBS_CONTROLLER_TEMPLATE = 'jobs-controller.yaml.j2' | |||
JOBS_CONTROLLER_YAML_PREFIX = '~/.sky/jobs_controller' | |||
JOBS_CONTROLLER_LOGS_DIR = '~/sky_logs/jobs_controller' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in the place we use this, we need to make sure that we create the folder if not exists. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sky/jobs/controller.py
Outdated
@@ -353,11 +357,15 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool: | |||
managed_job_state.set_recovering(job_id=self._job_id, | |||
task_id=task_id, | |||
callback_func=callback_func) | |||
# Switch to LAUNCHING schedule_state here, since the entire recovery | |||
# process is somewhat heavy. | |||
scheduler.wait_until_launch_okay(self._job_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should directly call the scheduler
in the recovery strategy executor, i.e., in the launch()
function. The current way of spreading the scheduler in multiple places is hard to track.
Same for the first launch, would it be possible to call the scheduler in the launch()
? We can set the STARTING
state before it is scheduled, since we can have the scheduling state shown in the column to indicate that the STARTING
state is blocked by waiting for controller resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should directly call the
scheduler
in the recovery strategy executor
I guess this is actually reasonable, and it lets us change to actually using a context as you suggested before #4485 (comment).
The only downside is that we will not hold the lock for cluster termination/cleanup before the re-launch. I am hoping that's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the first launch, would it be possible to call the scheduler in the launch()?
We can leave everything as-is, it will just be a no-op.
https://github.com/skypilot-org/skypilot/pull/4485/files#diff-342edc587647149b7d7849c4687967f5a1aa87df4df525b02e284dc5ec3dc305R203-R207
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another downside - the jobs will go to STARTING
status before they are scheduled to launch. This could be confusing if a lot of jobs are stuck WAITING
/STARTING
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've attempted to make this change nonetheless, let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only downside is that we will not hold the lock for cluster termination/cleanup before the re-launch. I am hoping that's ok.
Does this mean that we will no longer be able to limit the total number of termination / cleanup that are running in parallel? Would that cause issue as we met for the credentials?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another downside - the jobs will go to STARTING status before they are scheduled to launch. This could be confusing if a lot of jobs are stuck WAITING/STARTING.
Is our Description
column showing that the job is actually waiting for the resources on controller to schedule the launch? If that is the case, I think it should be fine. : )
# parallelism control or updating the schedule_state of any job. | ||
# Any code that takes this lock must conclude by calling | ||
# maybe_start_waiting_jobs. | ||
_MANAGED_JOB_SCHEDULER_LOCK = '~/.sky/locks/managed_job_scheduler.lock' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create the folder, in case it is not created yet?
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
/smoke-test managed_jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome @cg505!!! Thanks for the great effort!
The current scheduler looks very clean! It should be good to go in once we have the tests passed.
logger.error( | ||
'Failure happened before provisioning. Failover ' | ||
f'reasons: {reasons_str}') | ||
with scheduler.scheduled_launch(self.job_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!! This feels super clean now.
sky/jobs/scheduler.py
Outdated
# parallelism control. If we cannot obtain the lock, exit immediately. | ||
# The current lock holder is expected to launch any jobs it can before | ||
# releasing the lock. | ||
with filelock.FileLock(os.path.expanduser(_get_lock_path()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need expanduser again here?
# The last two args of the command line should be --job-id <id> | ||
job_args = process.cmdline()[-2:] | ||
return process.is_running() and job_args == ['--job-id', str(job_id)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: we should double-check the length limit of the cmdline, to make sure that cmdline will always contain the info we need here in all systems we use for controller. In the job_lib
, we prepend a identifier so that it appears at the very beginning to avoid the length limit, if any, as much as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be okay. The cmdline should not be truncated - it's limited by the normal argument limit getconf ARG_MAX
, and if this is violated, the process will not be spawned in the first place.
In old Linuxes, /proc/<pid>/cmdline was truncated, but this hasn't been true since 2015. So unless we have observed issues with this behavior, I think we can ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on my system:
ulimit -s 256
cat /proc/self/cmdline $(seq 10016) 2>/dev/null
is not truncatedcat /proc/self/cmdline $(seq 10017) 2>/dev/null
fails to spawn thecat
process (argument limit exceeded)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It should be good then. We should leave a comment where we start the process that --job-id
must to be placed at the end to be aligned with this check here.
sky/jobs/utils.py
Outdated
# The controller is still running. | ||
continue | ||
# Otherwise, proceed to mark the job as failed. | ||
logger.error(f'Controller for {job_id_} seems to be dead.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid the term controller here as it can be confusing for VM vs process. In most of places, we use controller process
to refer to the process that manages one job.
btw, please double check the buildkite failures : ) |
smoke test failures look pretty unrelated... may try to run locally instead |
/quicktest-core |
/smoke-test managed_jobs |
manually running smoke tests:
|
cc'ing @zpoint, we should avoid this kind of requirement for running manual smoke tests as much as possible |
Fixing the issue of the same |
/smoke-test managed_jobs |
# The last two args of the command line should be --job-id <id> | ||
job_args = process.cmdline()[-2:] | ||
return process.is_running() and job_args == ['--job-id', str(job_id)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It should be good then. We should leave a comment where we start the process that --job-id
must to be placed at the end to be aligned with this check here.
Detaches the job controller from ray worker and the ray driver program, and uses our own scheduling and parallelism control mechanism, derived from the state tracked in the managed jobs sqlite database on the controller.
See the commands in sky/jobs/scheduler.py for more info.
Previously, the number of simultaneous jobs is limited to 4x CPU count, by our per-job ray placement group request
skypilot/sky/skylet/constants.py
Line 293 in 1578108
After this PR, there are two paralellism limits:
4 * cpu_count
jobs can be launching at the same time.memory / 350M
jobs can be running at the same time.Common and max instance sizes and their parallelism limits
run parallelism varies slightly between clouds as instances listed with the same amount of memory do not actually have exactly the same number of bytes.
Tested (run the relevant ones):
bash format.sh