Add support for affinity and tolerations, refactor unit tests #6232

Sovietaced · 2025-02-08T03:25:14Z

Tracking issue

Why are the changes needed?

If you are using ray to do distributed training on GPUs you may want to use tolerations and node affinity to ensure GPU workloads land on GPU nodes.

What changes were proposed in this pull request?

Updates the ray plugin to cherry pick tolerations and affinity if they are non-null or non-empty in the customized head/worker pod specs.

I also changed how assertions are made in the unit tests since they were becoming cumbersome in the previous style.

How was this patch tested?

Unit tests

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This PR enhances the Ray plugin by implementing pod tolerations, node affinity configurations, and resource requirements for Ray head and worker pods. The changes enable custom scheduling requirements particularly for GPU workloads and include merging tolerations and affinity settings from custom pod specifications. The implementation includes refactoring of affinity configurations to ensure proper handling of resource specifications in Ray clusters, along with improved test suite organization and comprehensive test coverage.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 2

Signed-off-by: Jason Parraga <sovietaced@gmail.com>

flyte-bot · 2025-02-08T03:25:35Z

Code Review Agent Run Status

Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at eduardo@union.ai.

codecov · 2025-02-08T03:28:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 36.87%. Comparing base (34205dd) to head (408624e).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6232   +/-   ##
=======================================
  Coverage   36.87%   36.87%           
=======================================
  Files        1318     1318           
  Lines      134647   134653    +6     
=======================================
+ Hits        49647    49657   +10     
+ Misses      80679    80676    -3     
+ Partials     4321     4320    -1

Flag	Coverage Δ
unittests-datacatalog	`51.58% <ø> (ø)`
unittests-flyteadmin	`51.96% <ø> (ø)`
unittests-flytecopilot	`30.99% <ø> (ø)`
unittests-flytectl	`62.29% <ø> (ø)`
unittests-flyteidl	`7.22% <ø> (ø)`
unittests-flyteplugins	`54.05% <100.00%> (+0.03%)`	⬆️
unittests-flytepropeller	`42.78% <ø> (ø)`
unittests-flytestdlib	`55.35% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

flyte-bot · 2025-02-08T04:31:58Z

Code Review Agent Run #d27a94

Actionable Suggestions - 4

flyteplugins/go/tasks/plugins/k8s/ray/ray_test.go - 3
- Consider extracting node affinity configuration · Line 485-507
- Consider keeping resource requirement assertions · Line 512-514
- Possible missing submitter pod resource validation · Line 556-561
flyteplugins/go/tasks/plugins/k8s/ray/ray.go - 1
- Consider adding nil check for customPodSpec · Line 552-558

Review Details

Files reviewed - 2 · Commit Range: 6b06ff6..6b06ff6
- flyteplugins/go/tasks/plugins/k8s/ray/ray.go
- flyteplugins/go/tasks/plugins/k8s/ray/ray_test.go
Files skipped - 0
Tools
- Golangci-lint (Linter) - ✖︎ Failed
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful

AI Code Review powered by

flyte-bot · 2025-02-08T04:36:41Z

Changelist by Bito

This pull request implements the following key changes.

Key Change	Files Impacted
Feature Improvement - Ray Plugin Pod Scheduling Enhancement	- `ray.go` - Added support for tolerations and affinity in Ray pod specifications - `ray_test.go` - Refactored test structure and added test cases for tolerations and affinity

flyteplugins/go/tasks/plugins/k8s/ray/ray_test.go

flyteplugins/go/tasks/plugins/k8s/ray/ray.go

Signed-off-by: Jason Parraga <sovietaced@gmail.com>

flyte-bot · 2025-02-08T19:49:49Z

Code Review Agent Run #3fcfcf

Actionable Suggestions - 0

Review Details

Files reviewed - 1 · Commit Range: 6b06ff6..408624e
- flyteplugins/go/tasks/plugins/k8s/ray/ray_test.go
Files skipped - 0
Tools
- Golangci-lint (Linter) - ✖︎ Failed
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful

AI Code Review powered by

pingsutw · 2025-02-10T22:28:10Z

flyteplugins/go/tasks/plugins/k8s/ray/ray.go

+	if len(customPodSpec.Tolerations) > 0 {
+		podSpec.Tolerations = customPodSpec.Tolerations


Could we use flytek8s.MergePodSpecs to merge the podSpec instead? like https://github.com/flyteorg/flyte/pull/6085/files#diff-5225038bf6b0b87842d5e795c8485ba63d597946970cc180637955e8baf0e38bR266-R269

Yeah I can do that

@pingsutw I tried using the utility function but its not working because the utility does slice appending and this results in a merged pod spec with a a list of two ray-head or ray-worker containers. The logic later just picks the override container which blows out all the values from the base pod spec derived from the task.

This may work for pod templates but it doesn't seem to work for just merging two pod specs with similar container names

I tried using the utility function but its not working because the utility does slice appending and this results in a merged pod spec

Even if you specify the primary container name?

podSpec, err = flytek8s.MergePodSpecs(podSpec, customPodSpec, "ray-head", "")

Yes, that is what I've been doing.

func mergeCustomPodSpec(primaryContainer *v1.Container, podSpec *v1.PodSpec, k8sPod *core.K8SPod) (*v1.PodSpec, error) { if k8sPod == nil { return podSpec, nil } if k8sPod.GetPodSpec() == nil { return podSpec, nil } var customPodSpec *v1.PodSpec err := utils.UnmarshalStructToObj(k8sPod.GetPodSpec(), &customPodSpec) if err != nil { return nil, flyteerr.Errorf(flyteerr.BadTaskSpecification, "Unable to unmarshal pod spec [%v], Err: [%v]", k8sPod.GetPodSpec(), err.Error()) } podSpec, err = flytek8s.MergePodSpecs(podSpec, customPodSpec, primaryContainer.Name, "") if err != nil { return nil, err } return podSpec, nil }

Add support for affinity and tolerations, refactor unit tests

6b06ff6

Signed-off-by: Jason Parraga <sovietaced@gmail.com>

Sovietaced added the added Merged changes that add new functionality label Feb 8, 2025

Sovietaced marked this pull request as ready for review February 8, 2025 03:31

flyte-bot reviewed Feb 8, 2025

View reviewed changes

flyteplugins/go/tasks/plugins/k8s/ray/ray_test.go Show resolved Hide resolved

flyte-bot reviewed Feb 8, 2025

View reviewed changes

flyteplugins/go/tasks/plugins/k8s/ray/ray_test.go Show resolved Hide resolved

flyte-bot reviewed Feb 8, 2025

View reviewed changes

flyteplugins/go/tasks/plugins/k8s/ray/ray_test.go Show resolved Hide resolved

flyte-bot reviewed Feb 8, 2025

View reviewed changes

flyteplugins/go/tasks/plugins/k8s/ray/ray.go Show resolved Hide resolved

Retain coverage

408624e

Signed-off-by: Jason Parraga <sovietaced@gmail.com>

pingsutw reviewed Feb 10, 2025

View reviewed changes

davidmirror-ops requested a review from pingsutw February 12, 2025 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for affinity and tolerations, refactor unit tests #6232

Add support for affinity and tolerations, refactor unit tests #6232

Sovietaced commented Feb 8, 2025 •

edited by flyte-bot

Loading

flyte-bot commented Feb 8, 2025

codecov bot commented Feb 8, 2025 •

edited

Loading

flyte-bot commented Feb 8, 2025 •

edited

Loading

Code Review Agent Run #d27a94

flyte-bot commented Feb 8, 2025 •

edited

Loading

Changelist by Bito

flyte-bot commented Feb 8, 2025 •

edited

Loading

Code Review Agent Run #3fcfcf

pingsutw Feb 10, 2025

Sovietaced Feb 10, 2025

Sovietaced Feb 10, 2025 •

edited

Loading

pingsutw Feb 12, 2025

Sovietaced Feb 13, 2025 •

edited

Loading

		if len(customPodSpec.Tolerations) > 0 {
		podSpec.Tolerations = customPodSpec.Tolerations

Add support for affinity and tolerations, refactor unit tests #6232

Are you sure you want to change the base?

Add support for affinity and tolerations, refactor unit tests #6232

Conversation

Sovietaced commented Feb 8, 2025 • edited by flyte-bot Loading

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Check all the applicable boxes

Related PRs

Docs link

Summary by Bito

flyte-bot commented Feb 8, 2025

Code Review Agent Run Status

codecov bot commented Feb 8, 2025 • edited Loading

Codecov Report

flyte-bot commented Feb 8, 2025 • edited Loading

Code Review Agent Run #d27a94

flyte-bot commented Feb 8, 2025 • edited Loading

Changelist by Bito

flyte-bot commented Feb 8, 2025 • edited Loading

Code Review Agent Run #3fcfcf

pingsutw Feb 10, 2025

Choose a reason for hiding this comment

Sovietaced Feb 10, 2025

Choose a reason for hiding this comment

Sovietaced Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

pingsutw Feb 12, 2025

Choose a reason for hiding this comment

Sovietaced Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Sovietaced commented Feb 8, 2025 •

edited by flyte-bot

Loading

codecov bot commented Feb 8, 2025 •

edited

Loading

flyte-bot commented Feb 8, 2025 •

edited

Loading

flyte-bot commented Feb 8, 2025 •

edited

Loading

flyte-bot commented Feb 8, 2025 •

edited

Loading

Sovietaced Feb 10, 2025 •

edited

Loading

Sovietaced Feb 13, 2025 •

edited

Loading