Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support leader election for katib-controller #1713

7 changes: 7 additions & 0 deletions cmd/katib-controller/v1beta1/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@ func main() {
var injectSecurityContext bool
var enableGRPCProbeInSuggestion bool
var trialResources trialutil.GvkListFlag
var enableLeaderElection bool
var leaderElectionID string

flag.StringVar(&experimentSuggestionName, "experiment-suggestion-name",
"default", "The implementation of suggestion interface in experiment controller (default)")
Expand All @@ -56,6 +58,9 @@ func main() {
flag.BoolVar(&enableGRPCProbeInSuggestion, "enable-grpc-probe-in-suggestion", true, "enable grpc probe in suggestions")
flag.Var(&trialResources, "trial-resources", "The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org)")
flag.IntVar(&webhookPort, "webhook-port", 8443, "The port number to be used for admission webhook server.")
// For leader election
flag.BoolVar(&enableLeaderElection, "enable-leader-election", true, "Enable leader election for katib-controller. Enabling this will ensure there is only one active katib-controller.")
flag.StringVar(&leaderElectionID, "leader-election-id", "3fbc96e9.katib.kubeflow.org", "The ID for leader election.")

// TODO (andreyvelich): Currently it is not possible to set different webhook service name.
// flag.StringVar(&serviceName, "webhook-service-name", "katib-controller", "The service name which will be used in webhook")
Expand Down Expand Up @@ -95,6 +100,8 @@ func main() {
// Create a new katib controller to provide shared dependencies and start components
mgr, err := manager.New(cfg, manager.Options{
MetricsBindAddress: metricsAddr,
LeaderElection: enableLeaderElection,
LeaderElectionID: leaderElectionID,
})
if err != nil {
log.Error(err, "Failed to create the manager")
Expand Down
18 changes: 10 additions & 8 deletions docs/developer-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,14 +56,16 @@ make generate

Below is a list of command-line flags accepted by Katib controller:

| Name | Type | Default | Description |
| ------------------------------- | ------------------------- | --------- | ---------------------------------------------------------------------------------------------------------------------- |
| enable-grpc-probe-in-suggestion | bool | true | Enable grpc probe in suggestions |
| experiment-suggestion-name | string | "default" | The implementation of suggestion interface in experiment controller |
| metrics-addr | string | ":8080" | The address the metric endpoint binds to |
| trial-resources | []schema.GroupVersionKind | null | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
| webhook-inject-securitycontext | bool | false | Inject the securityContext of container[0] in the sidecar |
| webhook-port | int | 8443 | The port number to be used for admission webhook server |
| Name | Type | Default | Description |
| ------------------------------- | ------------------------- | ----------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| enable-grpc-probe-in-suggestion | bool | true | Enable grpc probe in suggestions |
| experiment-suggestion-name | string | "default" | The implementation of suggestion interface in experiment controller |
| metrics-addr | string | ":8080" | The address the metric endpoint binds to |
| trial-resources | []schema.GroupVersionKind | null | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
| webhook-inject-securitycontext | bool | false | Inject the securityContext of container[0] in the sidecar |
| webhook-port | int | 8443 | The port number to be used for admission webhook server |
| enable-leader-election | bool | true | Enable leader election for katib-controller. Enabling this will ensure there is only one active katib-controller. |
| leader-election-id | string | "3fbc96e9.katib.kubeflow.org" | The ID for leader election. |

## Workflow design

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: kustomize.config.k8s.io/v1beta1
tenzen-y marked this conversation as resolved.
Show resolved Hide resolved
kind: Kustomization
namespace: kubeflow
resources:
# non HA katib-cert-manager
- ../../katib-cert-manager
# rbac for HA
- ../overlays
replicas:
- name: katib-controller
count: 2
11 changes: 11 additions & 0 deletions manifests/v1beta1/installs/ha/katib-external-db/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
resources:
# non HA katib-external-db
- ../../katib-external-db
# rbac for HA
- ../overlays
replicas:
- name: katib-controller
count: 2
11 changes: 11 additions & 0 deletions manifests/v1beta1/installs/ha/katib-openshift/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
resources:
# non HA katib-openshift
- ../../katib-openshift
# rbac for HA
- ../overlays
replicas:
- name: katib-controller
count: 2
11 changes: 11 additions & 0 deletions manifests/v1beta1/installs/ha/katib-standalone/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
resources:
# non HA katib-standalone
- ../../katib-standalone
# rbac for HA
- ../overlays
replicas:
- name: katib-controller
count: 2
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
resources:
# non HA katib-with-kubeflow
- ../../katib-with-kubeflow
# rbac for HA
- ../overlays
replicas:
- name: katib-controller
count: 2
5 changes: 5 additions & 0 deletions manifests/v1beta1/installs/ha/overlays/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- ./rbac-ha.yaml
32 changes: 32 additions & 0 deletions manifests/v1beta1/installs/ha/overlays/rbac-ha.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: leader-election-role
namespace: kubeflow
rules:
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: leader-election-rolebinding
namespace: kubeflow
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: leader-election-role
subjects:
- kind: ServiceAccount
name: katib-controller
namespace: kubeflow
tenzen-y marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 1 addition & 1 deletion scripts/v1beta1/deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ set -o xtrace
SCRIPT_ROOT=$(dirname ${BASH_SOURCE})/../..

cd ${SCRIPT_ROOT}
kustomize build manifests/v1beta1/installs/katib-standalone | kubectl apply -f -
kustomize build manifests/v1beta1/installs/ha/katib-standalone | kubectl apply -f -
tenzen-y marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 1 addition & 1 deletion scripts/v1beta1/undeploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,4 @@ sleep 10
SCRIPT_ROOT=$(dirname ${BASH_SOURCE})/../..

cd ${SCRIPT_ROOT}
kustomize build manifests/v1beta1/installs/katib-standalone | kubectl delete -f -
kustomize build manifests/v1beta1/installs/ha/katib-standalone | kubectl delete -f -
17 changes: 3 additions & 14 deletions test/e2e/v1beta1/scripts/setup-katib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -93,20 +93,9 @@ cd "${GOPATH}/src/github.com/kubeflow/katib"
make deploy

# Wait until all Katib pods is running.
TIMEOUT=120
PODNUM=$(kubectl get deploy -n kubeflow | grep -v NAME | wc -l)
# 1 Pod for the cert-generator Job
PODNUM=$((PODNUM + 1))
until kubectl get pods -n kubeflow | grep -E 'Running|Completed' | [[ $(wc -l) -eq $PODNUM ]]; do
echo Pod Status $(kubectl get pods -n kubeflow | grep "1/1" | wc -l)/$PODNUM
sleep 10
TIMEOUT=$((TIMEOUT - 1))
if [[ $TIMEOUT -eq 0 ]]; then
echo "NG"
kubectl get pods -n kubeflow
exit 1
fi
done
TIMEOUT=120s
kubectl wait --for=condition=complete --timeout=${TIMEOUT} -l katib.kubeflow.org/component=cert-generator -n kubeflow job
kubectl wait --for=condition=ready --timeout=${TIMEOUT} -l "katib.kubeflow.org/component in (controller,db-manager,mysql,ui)" -n kubeflow pod

echo "All Katib components are running."
echo "Katib deployments"
Expand Down