Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix default label for Training Operators #1808

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/proposals/metrics-collector.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ In the namespace with `katib.kubeflow.org/metrics-collector-injection=enabled` l

In **Pod Level Injecting**,

1. Job operators (_e.x. TFjob/PyTorchjob_) tag the `job-role: master` ([#1064](https://github.com/kubeflow/tf-operator/pull/1064)) label on the master pod.
1. Job operators (_e.x. TFjob/PyTorchjob_) tag the `training.kubeflow.org/job-role: master` ([#1064](https://github.com/kubeflow/tf-operator/pull/1064)) label on the master pod.
2. The webhook inject the metric collector only if the webhook recognizes this label.
3. The webhook uses [ObjectSelector](https://github.com/kubernetes/kubernetes/pull/78505) to skip on irrelevant objects in order to optimize the performance.
4. ObjectSelector is only supported above _Kubernetes v1.15_. Without this new feature, there may be a [performance issue](https://github.com/kubeflow/katib/issues/685#issuecomment-516226070) in webhook. In this situation, the following **Job Level Injecting** mode may be a better option.
Expand Down
2 changes: 1 addition & 1 deletion docs/proposals/trial-custom-crd.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ For example, for TFJob:
```yaml
. . .
PrimaryPodLabel:
"job-role": "master"
"training.kubeflow.org/job-role": "master"
. . .
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ spec:
primaryContainerName: mxnet
# In this example we can collect metrics only from the Worker pods.
primaryPodLabels:
replica-type: worker
training.kubeflow.org/replica-type: worker
trialParameters:
- name: learningRate
description: Learning rate for the training model
Expand Down
3 changes: 2 additions & 1 deletion pkg/apis/controller/experiments/v1beta1/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,13 +38,14 @@ const (

var (
// DefaultKubeflowJobPrimaryPodLabels is the default value of spec.trialTemplate.primaryPodLabels for Kubeflow Training Job.
DefaultKubeflowJobPrimaryPodLabels = map[string]string{"job-role": "master"}
DefaultKubeflowJobPrimaryPodLabels = map[string]string{"training.kubeflow.org/job-role": "master"}

// KubeflowJobKinds is the list of Kubeflow Training Job kinds.
KubeflowJobKinds = map[string]bool{
"TFJob": true,
"PyTorchJob": true,
"XGBoostJob": true,
"MXJob": true,
"MPIJob": true,
}
)
4 changes: 2 additions & 2 deletions pkg/ui/v1beta1/frontend/src/reducers/general.js
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,14 @@ const initialState = {
value: 'status.conditions.#(type=="Complete")#|#(status=="True")#',
description: `Condition when Trial custom resource is succeeded.
Default value for k8s BatchJob: status.conditions.#(type=="Complete")#|#(status=="True")#.
Default value for Kubeflow Job (TFJob, PyTorchJob, XGBoostJob, MXJob): status.conditions.#(type=="Succeeded")#|#(status=="True")#.`,
Default value for Kubeflow Job (TFJob, PyTorchJob, XGBoostJob, MXJob, MPIJob): status.conditions.#(type=="Succeeded")#|#(status=="True")#.`,
},
{
name: 'FailureCondition',
value: 'status.conditions.#(type=="Failed")#|#(status=="True")#',
description: `Condition when Trial custom resource is failed.
Default value for k8s BatchJob: status.conditions.#(type=="Failed")#|#(status=="True")#.
Default value for Kubeflow Job (TFJob, PyTorchJob, XGBoostJob, MXJob): status.conditions.#(type=="Failed")#|#(status=="True")#.`,
Default value for Kubeflow Job (TFJob, PyTorchJob, XGBoostJob, MXJob, MPIJob): status.conditions.#(type=="Failed")#|#(status=="True")#.`,
},
{
name: 'Retain',
Expand Down
2 changes: 1 addition & 1 deletion test/e2e/v1beta1/argo_workflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ def create_task_template(self, task_name, exec_image, command):
},
{
"name": "EXTRA_REPOS",
"value": "kubeflow/testing@HEAD;kubeflow/manifests@v1.4-branch"
"value": "kubeflow/testing@HEAD;kubeflow/manifests@v1.5-branch"
},
# Set GOPATH to test_dir because Katib repo is located under /src/github.com/kubeflow/katib
{
Expand Down
2 changes: 1 addition & 1 deletion test/e2e/v1beta1/scripts/setup-katib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ cat "manifests/v1beta1/components/controller/katib-config.yaml"
echo "Creating Kubeflow namespace"
kubectl create namespace kubeflow

echo "Deploying training-operator from kubeflow/manifests v1.4 branch"
cd "${MANIFESTS_DIR}/apps/training-operator/upstream/overlays/kubeflow"
echo "Deploying Training Operator from kubeflow/manifests $(git rev-parse --abbrev-ref HEAD)"
kustomize build . | kubectl apply -f -

echo "Deploying Katib"
Expand Down