-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for affinity and tolerations, refactor unit tests #6232
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Jason Parraga <sovietaced@gmail.com>
Code Review Agent Run Status
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6232 +/- ##
=======================================
Coverage 36.87% 36.87%
=======================================
Files 1318 1318
Lines 134647 134653 +6
=======================================
+ Hits 49647 49657 +10
+ Misses 80679 80676 -3
+ Partials 4321 4320 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Code Review Agent Run #d27a94Actionable Suggestions - 4
Review Details
|
Changelist by BitoThis pull request implements the following key changes.
|
Signed-off-by: Jason Parraga <sovietaced@gmail.com>
Code Review Agent Run #3fcfcfActionable Suggestions - 0Review Details
|
Tracking issue
Closes #6229
Why are the changes needed?
If you are using ray to do distributed training on GPUs you may want to use tolerations and node affinity to ensure GPU workloads land on GPU nodes.
What changes were proposed in this pull request?
Updates the ray plugin to cherry pick tolerations and affinity if they are non-null or non-empty in the customized head/worker pod specs.
I also changed how assertions are made in the unit tests since they were becoming cumbersome in the previous style.
How was this patch tested?
Unit tests
Check all the applicable boxes
Related PRs
Docs link
Summary by Bito
This PR enhances the Ray plugin by implementing pod tolerations, node affinity configurations, and resource requirements for Ray head and worker pods. The changes enable custom scheduling requirements particularly for GPU workloads and include merging tolerations and affinity settings from custom pod specifications. The implementation includes refactoring of affinity configurations to ensure proper handling of resource specifications in Ray clusters, along with improved test suite organization and comprehensive test coverage.Unit tests added: True
Estimated effort to review (1-5, lower is better): 2