-
Notifications
You must be signed in to change notification settings - Fork 799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training: Add documentation for the MultiKueue and spec.managedBy API #3956
base: master
Are you sure you want to change the base?
Changes from all commits
4b02eec
fe517b1
af5ac9e
2c2cd7d
cbac0ed
713beb7
0df155e
e1c64a5
8f6c944
c5df54a
12ebd60
f7e7154
6fd0f16
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,67 @@ | ||||||
+++ | ||||||
title = "How to manage Jobs in multi-cluster environment" | ||||||
Desciption = "Using managedBy feild for MultiKueue" | ||||||
weight = 10 | ||||||
+++ | ||||||
|
||||||
## Overview | ||||||
|
||||||
The `spec.runPolicy.managedBy` field is a new feature introduced for MultiQueue support in the Kubeflow Training Operator. This field allows for more robust management of multi-cluster job dispatching by specifying the managing entity. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
## Prerequisites | ||||||
|
||||||
1. Ensure that you have the latest version of the Kubeflow Training Operator installed. | ||||||
2. Make sure Kueue is compiled against the new operator to leverage the `spec.runPolicy.managedBy` field. | ||||||
Comment on lines
+13
to
+14
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we say the version of Kubeflow Training Operator and Kueue that needs to be installed ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yess that would make sense There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will be supported since Kueue 0.11. |
||||||
|
||||||
## Usage | ||||||
|
||||||
To use the `spec.runPolicy.managedBy` field in your training jobs, include it in the job specification as shown below: | ||||||
|
||||||
```yaml | ||||||
apiVersion: "kubeflow.org/v1" | ||||||
kind: "TFJob" | ||||||
metadata: | ||||||
name: "example-tfjob" | ||||||
spec: | ||||||
runPolicy: | ||||||
managedBy: "kueue.x-k8s.io/multikueue" | ||||||
tfReplicaSpecs: | ||||||
... | ||||||
``` | ||||||
|
||||||
Example | ||||||
|
||||||
Here is a complete example of a TensorFlow job using the spec.managedBy field: | ||||||
|
||||||
```YAML | ||||||
apiVersion: "kubeflow.org/v1" | ||||||
kind: "TFJob" | ||||||
metadata: | ||||||
name: "example-tfjob" | ||||||
spec: | ||||||
runPolicy: | ||||||
managedBy: "kueue.x-k8s.io/multikueue" | ||||||
tfReplicaSpecs: | ||||||
Chief: | ||||||
replicas: 1 | ||||||
template: | ||||||
spec: | ||||||
containers: | ||||||
- name: tensorflow | ||||||
image: tensorflow/tensorflow:latest | ||||||
args: ["python", "model.py"] | ||||||
Worker: | ||||||
replicas: 2 | ||||||
template: | ||||||
spec: | ||||||
containers: | ||||||
- name: tensorflow | ||||||
image: tensorflow/tensorflow:latest | ||||||
args: ["python", "model.py"] | ||||||
``` | ||||||
|
||||||
## More Details | ||||||
|
||||||
For more details on setting up and using MultiQueue with the Kubeflow Training Operator, refer to the following documentation pages: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
- [Kueue/Kubeflow](https://kueue.sigs.k8s.io/docs/tasks/run/multikueue/kubeflow/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explain in overview that we leverage MultiKueue capability in Kueue project for this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have provided the Cross-references/More details for that purpose, so that the user interested can have a brief there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but the goal of overview section is to provide brief description on what is that feature.
It might be important to say that we leverage Kueue for that capability.
WDYT @Garvit-77 @mimowo @mszadkow ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allows this feature of MultiKueue to be applied for Kubeflow, yes, I think it make sense to mention that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can just point https://kueue.sigs.k8s.io/docs/concepts/multikueue/ document here to describe what is MultiKueue.