Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Coscheduling #2337
KEP: Coscheduling #2337
Changes from all commits
93b2b0b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is tracking
Succeeded
andFailed
useful at the pod Group level? This seems to make an assumption that pod groups are typically used only for run to completion jobs.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just give a summary of pod status here; not only for completion jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Mark it alpha explicitly & be consistent with the object kind -
alpha.incubator.scheduling.k8s.io
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm .... , mentioned this alpha feature in doc and release notes (in future); so when we migrate this feature to upstream, the user did not need update annotation in pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to add that the use of annotations is temporary and for prototyping purposes. We will change it to a more permanent form, such a field, in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, why not consider adding an alpha field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alpha field should be added to core, but PodGroup will be a CRD. So, I think we should postpone adding the field to sometime in the future when we move PodGroup to the core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the batch scheduler that you intend to develop will not schedule pods that do not need gang scheduling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for the first version. As we do not know when/whether scheduler will get
PodGroup
in future, thePodGroup
is required currently, but user can setminMember
to 1.I'm also thinking to add another "flag" to identify whether
PodGroup
is required to schedule the pod. Supporting jobs withoutgang-schedulingPodGroup in kube-batch should be another proposal.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two issues to consider:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, controller will re-submit the entire job. phantom pods maybe also preempted by high priority pods from default scheduler. For such a mix environment, prefer to only share resources between elastic 'jobs', e.g. Spark's executor and nginx/tomcat.
In kube-batch 0.1, we use this option; two major concerns: 1. there're several computing here, pre-allocated and free up, 2. after free up, the resource may not be used by others, e.g. predicates.
So in kube-batch 0.2/0.3, I'm thinking to use preemption (or backfill) to handle this case: kube-batch will try to allocate resource by jobs's
minMember
as much as possible; if not enough resource, preemptallocated
resource firstly which is occupied by pod-group but can not use it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What triggers this "preemption"? When there are two pod groups with the same priority and each one is partially allocated and there is no more resources in the cluster, which one preempts the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure which term is better: preempt or backfill; anyway, the assumed/allocated resource will be in "Allocated" state if the job did not get enough resource; and after allocation phase, those allocated but not bound resource will be re-calculated for other jobs. For your case, the two jobs will try to get those allocated resources in order (e.g. FCFS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if FCFS works in this case. What if the pods of the two jobs created/processed alternatively? For example, batch scheduler processes pod 1 of job 1 and then processes pod 1 of job 2, then pod 2 of job 1, then pod 2 of job 2, and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in kube-batch, it'll do allocation in job/podgroup level (FCFS also for job/podgroup); so for this case, we will not handle pod2 of job2 until all pod in job1 are handled :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backfilling soungs too optimistic. many k8s clusters may not have that many jobs to backfill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's only one of
solutionsoptions :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is batch distinguished here? Restart policy could similarly cause endless restart loop in other types of tasks. Kubelet has backoff mechanism to avoid a tight restart loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this tip is not relevant to this proposal. It is better suited for a controller design proposal/user-guide. And the job controller already has some documentation on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
several people asked this when using kube-arbitrator, so I highlight here. I'm ok to highlight in readme or other place.