-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for affinity and tolerations, refactor unit tests #6232
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Jason Parraga <sovietaced@gmail.com>
Code Review Agent Run Status
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6232 +/- ##
=======================================
Coverage 36.87% 36.87%
=======================================
Files 1318 1318
Lines 134647 134653 +6
=======================================
+ Hits 49647 49657 +10
+ Misses 80679 80676 -3
+ Partials 4321 4320 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Code Review Agent Run #d27a94Actionable Suggestions - 4
Review Details
|
Changelist by BitoThis pull request implements the following key changes.
|
Signed-off-by: Jason Parraga <sovietaced@gmail.com>
Code Review Agent Run #3fcfcfActionable Suggestions - 0Review Details
|
if len(customPodSpec.Tolerations) > 0 { | ||
podSpec.Tolerations = customPodSpec.Tolerations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use flytek8s.MergePodSpecs
to merge the podSpec instead? like https://github.com/flyteorg/flyte/pull/6085/files#diff-5225038bf6b0b87842d5e795c8485ba63d597946970cc180637955e8baf0e38bR266-R269
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pingsutw I tried using the utility function but its not working because the utility does slice appending and this results in a merged pod spec with a a list of two ray-head or ray-worker containers. The logic later just picks the override container which blows out all the values from the base pod spec derived from the task.
This may work for pod templates but it doesn't seem to work for just merging two pod specs with similar container names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using the utility function but its not working because the utility does slice appending and this results in a merged pod spec
Even if you specify the primary container name?
podSpec, err = flytek8s.MergePodSpecs(podSpec, customPodSpec, "ray-head", "")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is what I've been doing.
func mergeCustomPodSpec(primaryContainer *v1.Container, podSpec *v1.PodSpec, k8sPod *core.K8SPod) (*v1.PodSpec, error) {
if k8sPod == nil {
return podSpec, nil
}
if k8sPod.GetPodSpec() == nil {
return podSpec, nil
}
var customPodSpec *v1.PodSpec
err := utils.UnmarshalStructToObj(k8sPod.GetPodSpec(), &customPodSpec)
if err != nil {
return nil, flyteerr.Errorf(flyteerr.BadTaskSpecification,
"Unable to unmarshal pod spec [%v], Err: [%v]", k8sPod.GetPodSpec(), err.Error())
}
podSpec, err = flytek8s.MergePodSpecs(podSpec, customPodSpec, primaryContainer.Name, "")
if err != nil {
return nil, err
}
return podSpec, nil
}
Tracking issue
Closes #6229
Why are the changes needed?
If you are using ray to do distributed training on GPUs you may want to use tolerations and node affinity to ensure GPU workloads land on GPU nodes.
What changes were proposed in this pull request?
Updates the ray plugin to cherry pick tolerations and affinity if they are non-null or non-empty in the customized head/worker pod specs.
I also changed how assertions are made in the unit tests since they were becoming cumbersome in the previous style.
How was this patch tested?
Unit tests
Check all the applicable boxes
Related PRs
Docs link
Summary by Bito
This PR enhances the Ray plugin by implementing pod tolerations, node affinity configurations, and resource requirements for Ray head and worker pods. The changes enable custom scheduling requirements particularly for GPU workloads and include merging tolerations and affinity settings from custom pod specifications. The implementation includes refactoring of affinity configurations to ensure proper handling of resource specifications in Ray clusters, along with improved test suite organization and comprehensive test coverage.Unit tests added: True
Estimated effort to review (1-5, lower is better): 2