Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for XGBoost Operator with LightGBM example #1603

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,8 @@ Katib has these CRD examples in upstream:

- [Kubeflow `MPIJob`](https://www.kubeflow.org/docs/components/training/mpi/)

- [Kubeflow `XGBoostJob`](https://github.com/kubeflow/xgboost-operator)

- [Tekton `Pipeline`](https://github.com/tektoncd/pipeline)

Thus, Katib supports multiple frameworks with the help of different job kinds.
Expand Down
6 changes: 6 additions & 0 deletions examples/v1beta1/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -391,3 +391,9 @@ docker.io/inaccel/jupyter:lab
```
docker.io/kubeflow/mpi-horovod-mnist
```

- XGBoost operator LightGBM dist example, [source](https://github.com/kubeflow/xgboost-operator/tree/master/config/samples/lightgbm-dist).

```
docker.io/kubeflowkatib/xgboost-lightgbm
```
123 changes: 123 additions & 0 deletions examples/v1beta1/xgboost-lightgbm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
namespace: kubeflow
name: xgboost-lightgbm
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: valid_1 auc
additionalMetricNames:
- valid_1 binary_logloss
- training auc
- training binary_logloss
metricsCollectorSpec:
source:
filter:
metricsFormat:
- "(\\w+\\s\\w+)\\s:\\s((-?\\d+)(\\.\\d+)?)"
algorithm:
algorithmName: random
parallelTrialCount: 2
maxTrialCount: 6
maxFailedTrialCount: 3
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.1"
- name: num-leaves
parameterType: int
feasibleSpace:
min: "50"
max: "60"
step: "1"
trialTemplate:
primaryPodLabels:
job-role: master
primaryContainerName: xgboostjob
successCondition: status.conditions.#(type=="Succeeded")#|#(status=="True")#
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
- name: numberLeaves
description: Number of leaves for one tree
reference: num-leaves
trialSpec:
# TODO (andreyvelich): Change to kubeflow.org/v1 once all-in-one operator is finished.
apiVersion: xgboostjob.kubeflow.org/v1
kind: XGBoostJob
spec:
xgbReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- name: xgboostjob
image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
ports:
- containerPort: 9991
name: xgboostjob-port
imagePullPolicy: Always
args:
- --job_type=Train
- --metric=binary_logloss,auc
- --learning_rate=${trialParameters.learningRate}
- --num_leaves=${trialParameters.numberLeaves}
- --num_trees=100
- --boosting_type=gbdt
- --objective=binary
- --metric_freq=1
- --is_training_metric=true
- --max_bin=255
- --data=data/binary.train
- --valid_data=data/binary.test
- --tree_learner=feature
- --feature_fraction=0.8
- --bagging_freq=5
- --bagging_fraction=0.8
- --min_data_in_leaf=50
- --min_sum_hessian_in_leaf=50
- --is_enable_sparse=true
- --use_two_round_loading=false
- --is_save_binary_file=false
Worker:
replicas: 2
restartPolicy: ExitCode
template:
spec:
containers:
- name: xgboostjob
image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
ports:
- containerPort: 9991
name: xgboostjob-port
imagePullPolicy: Always
args:
- --job_type=Train
- --metric=binary_logloss,auc
- --learning_rate=${trialParameters.learningRate}
- --num_leaves=${trialParameters.numberLeaves}
- --num_trees=100
- --boosting_type=gbdt
- --objective=binary
- --metric_freq=1
- --is_training_metric=true
- --max_bin=255
- --data=data/binary.train
- --valid_data=data/binary.test
- --tree_learner=feature
- --feature_fraction=0.8
- --bagging_freq=5
- --bagging_fraction=0.8
- --min_data_in_leaf=50
- --min_sum_hessian_in_leaf=50
- --is_enable_sparse=true
- --use_two_round_loading=false
- --is_save_binary_file=false
2 changes: 2 additions & 0 deletions manifests/v1beta1/components/controller/controller.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ spec:
- "--trial-resources=TFJob.v1.kubeflow.org"
- "--trial-resources=PyTorchJob.v1.kubeflow.org"
- "--trial-resources=MPIJob.v1.kubeflow.org"
# TODO (andreyvelich): Change to v1.kubeflow.org once all-in-one operator is finished.
- "--trial-resources=XGBoostJob.v1.xgboostjob.kubeflow.org"
- "--trial-resources=PipelineRun.v1beta1.tekton.dev"
ports:
- containerPort: 8443
Expand Down
7 changes: 7 additions & 0 deletions manifests/v1beta1/components/controller/rbac.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,13 @@ rules:
- mpijobs
verbs:
- "*"
# TODO (andreyvelich): Move to "apiGroup: kubeflow.org" once all-in-one operator is finished.
- apiGroups:
- xgboostjob.kubeflow.org
resources:
- xgboostjobs
verbs:
- "*"
- apiGroups:
- tekton.dev
resources:
Expand Down