Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training: Add documentation for the MultiKueue and spec.managedBy API #3956

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
+++
title = "How to manage Jobs in multi-cluster environment"
Desciption = "Using managedBy feild for MultiKueue"
weight = 10
+++

## Overview

The `spec.runPolicy.managedBy` field is a new feature introduced for MultiQueue support in the Kubeflow Training Operator. This field allows for more robust management of multi-cluster job dispatching by specifying the managing entity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explain in overview that we leverage MultiKueue capability in Kueue project for this feature?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have provided the Cross-references/More details for that purpose, so that the user interested can have a brief there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but the goal of overview section is to provide brief description on what is that feature.
It might be important to say that we leverage Kueue for that capability.
WDYT @Garvit-77 @mimowo @mszadkow ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows this feature of MultiKueue to be applied for Kubeflow, yes, I think it make sense to mention that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can just point https://kueue.sigs.k8s.io/docs/concepts/multikueue/ document here to describe what is MultiKueue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `spec.runPolicy.managedBy` field is a new feature introduced for MultiQueue support in the Kubeflow Training Operator. This field allows for more robust management of multi-cluster job dispatching by specifying the managing entity.
The `spec.runPolicy.managedBy` field is a new feature introduced for MultiKueue support in the Kubeflow Training Operator. This field allows for more robust management of multi-cluster job dispatching by specifying the managing entity.


## Prerequisites

1. Ensure that you have the latest version of the Kubeflow Training Operator installed.
2. Make sure Kueue is compiled against the new operator to leverage the `spec.runPolicy.managedBy` field.
Comment on lines +13 to +14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say the version of Kubeflow Training Operator and Kueue that needs to be installed ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yess that would make sense
Trainer : upto V1.9
but I don't have idea for Kueue Version , can you help me with that @mimowo

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be supported since Kueue 0.11.


## Usage

To use the `spec.runPolicy.managedBy` field in your training jobs, include it in the job specification as shown below:

```yaml
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "example-tfjob"
spec:
runPolicy:
managedBy: "kueue.x-k8s.io/multikueue"
tfReplicaSpecs:
...
```

Example

Here is a complete example of a TensorFlow job using the spec.managedBy field:

```YAML
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "example-tfjob"
spec:
runPolicy:
managedBy: "kueue.x-k8s.io/multikueue"
tfReplicaSpecs:
Chief:
replicas: 1
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest
args: ["python", "model.py"]
Worker:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest
args: ["python", "model.py"]
```

## More Details

For more details on setting up and using MultiQueue with the Kubeflow Training Operator, refer to the following documentation pages:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more details on setting up and using MultiQueue with the Kubeflow Training Operator, refer to the following documentation pages:
For more details on setting up and using MultiKueue with the Kubeflow Training Operator, refer to the following documentation pages:


- [Kueue/Kubeflow](https://kueue.sigs.k8s.io/docs/tasks/run/multikueue/kubeflow/)
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
+++
title = "TensorFlow Training (TFJob)"
description = "Using TFJob to train a model with TensorFlow"
weight = 10
weight = 20
+++

{{% alert title="Old Version" color="warning" %}}
Expand Down