Add a new field "TensorflowImage" to KFP viewer CRD file template. #2544

jingzhang36 · 2019-11-05T08:11:27Z

We add a new field "TensorflowImage" to our viewer CRD file template.

Changes include:
(1) Frontend specifies a tensorflow image version and writes it to the viewer CRD file. The image version is hard-coded in FE now but later, FE will take this from input textbox.
(2) Backend controller parses the tensorflow image from viewer CRD file and uses it to start a new TB instance. Meanwhile, any viewer CRD file that is missing tensorflow image (i.e., written in the old way) will be removed by backend controller, since our TB tensorbord is stateless. After the old viewer CRD file is removed, Frontend will create a new viewer CRD file that contains tensorflow image.

Note:
(1) Whether to increase viewer CRD's version: I did some research work and feel that it is a little bit overkilling for this particular change. If we follow a standard version upgrade for CRD, a separate webhook server needs to be brought up to migrate old version CRD files (https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definition-versioning/#webhook-conversion). This migration is not very necessary for us, since we can simply remove our old crd files that are written in the old way.
(2) Whether to let backend controller to have a default tensorflow image in case of working with old viewer crd file (written in the old way and hence missing tensorflow image): I feel either way is ok. If we let backend controller use a default tensorflow image, then we don't need to remove the old viewer CRD file. Alternative is that we let backend controller to remove those old viewer CRD files and thus a default tensorflow image is not needed. I used the latter approach in the first post for review. The good thing about that approach we'll get rid of the old viewer CRDs completely and have everything up-to-date. The downside is that users must upgrade their frontend server release when they upgrade their backend controller release. (I.e., they can use a FE release that writes old viewer CRD format together with a backend release that reads new viewer CRD format).

Related bug: #2514

This change is

jingzhang36 · 2019-11-05T08:15:13Z

/assign @IronPan
/assign @Bobgy please check out the notes in the PR description, which addresses your previous comments.
/cc @rmgogogo

Bobgy · 2019-11-05T09:15:12Z

@jingzhang36 Sounds good to me. I think transient failures for a stateless service is acceptable when upgrading.

numerology · 2019-11-05T17:22:26Z

Shall we provide any sanity check to the user provided image? Or just let the backend throw failures when bad things happen.

jingzhang36 · 2019-11-06T00:28:48Z

Shall we provide any sanity check to the user provided image? Or just let the backend throw failures when bad things happen.

@numerology yes. Our plan is when we add UI input box for specifying the version, we'll provide a drop-down list of versions for user to choose from.

jingzhang36 · 2019-11-06T00:52:19Z

/retest

Bobgy · 2019-11-06T03:09:25Z

Shall we provide any sanity check to the user provided image? Or just let the backend throw failures when bad things happen.

@numerology yes. Our plan is when we add UI input box for specifying the version, we'll provide a drop-down list of versions for user to choose from.

I'm not sure if there could be security implications here. If backend uses this image without any sanity checking. Then user can run any image there, maybe even for bad reasons.

IronPan · 2019-11-06T17:40:55Z

backend/src/crd/controller/viewer/reconciler/reconciler.go

@@ -83,6 +83,16 @@ func (r *Reconciler) Reconcile(req reconcile.Request) (reconcile.Result, error)
 	}
 	glog.Infof("Got instance: %+v", view)

+	if len(view.Spec.TensorboardSpec.TensorflowImage) == 0 {


the viewer crd is not tensorflow specific. it's good to move tensorboard specific code to a single place or abstract that out.

switch view.Spec.Type : viewerV1beta1.ViewerTypeTensorboard : tensorboard_handler() any_other_type: { glog.Infof("Unsupported spec type: %q", view.Spec.Type) // Return nil to indicate nothing more to do here. return reconcile.Result{}, nil } ... tensorboard_handler(){ if len(view.Spec.TensorboardSpec.TensorflowImage) == 0 {...} }

So far we don't support other type

pipelines/backend/src/crd/controller/viewer/reconciler/reconciler.go

Line 87 in d5e27e2

if view.Spec.Type != viewerV1beta1.ViewerTypeTensorboard {

For this time, I plan to put the specific code after the line (pasted above) and say it's sufficient for now. Later, if we have other types, we probably will separate most of the current logic (I feel most of current code is tensorboard specific...) from the added type support.

IronPan · 2019-11-06T17:52:42Z

backend/src/crd/controller/viewer/reconciler/reconciler.go

+	if len(view.Spec.TensorboardSpec.TensorflowImage) == 0 {
+		if err := r.Client.Delete(context.Background(), view); err != nil {
+			glog.Infof("Error in deleting viewer CRD: %+v", err)
+			return reconcile.Result{}, err
+		} else {
+			glog.Infof("Deleted viewer CRD that is missing tensorflow image.")
+			return reconcile.Result{}, nil
+		}
+	}
+


why adding this logic instead of using CRD built-in openapi validation
https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#specifying-a-structural-schema

Our current implementation doesn't seem to support structural schema yet. After a quick browse of the doc, we'll need to first add/apply a schema yaml. And then all the viewer CRD yamls created later can be validated according to the specifications in the schema yaml. I'll do another PR to upgrade our current viewer crd implementation to structural schema and let this one focus on tensorboard version issue.

BTW, we have an intern here. She will work on adding the version selector and hopefully structural schema implementation as well.

Can we not check in the validation logic here then and add a todo?

sure. Leave validation to when we have structural crd schema. Temporarily, I'll just use a default version if no version is specified.

…us put it after the type check.

jingzhang36 · 2019-11-11T08:55:52Z

Shall we provide any sanity check to the user provided image? Or just let the backend throw failures when bad things happen.

@numerology yes. Our plan is when we add UI input box for specifying the version, we'll provide a drop-down list of versions for user to choose from.

I'm not sure if there could be security implications here. If backend uses this image without any sanity checking. Then user can run any image there, maybe even for bad reasons.

Any tensorflow image or any image? If we restrict it to a list of tensorflow images for users to choose from, I feel it is ok.

Bobgy · 2019-11-13T03:30:06Z

Any tensorflow image or any image? If we restrict it to a list of tensorflow images for users to choose from, I feel it is ok.

Any image. Because frontend is never a place you can put security checks. User can see all network requests and they can send any forged requests to backend server by themselves too.

IronPan

/lgtm
/approve

k8s-ci-robot · 2019-11-18T18:04:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: IronPan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~backend/OWNERS~~ [IronPan]
~~frontend/OWNERS~~ [IronPan]
~~manifests/kustomize/OWNERS~~ [IronPan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

IronPan · 2019-11-18T18:11:06Z

Any tensorflow image or any image? If we restrict it to a list of tensorflow images for users to choose from, I feel it is ok.

Any image. Because frontend is never a place you can put security checks. User can see all network requests and they can send any forged requests to backend server by themselves too.

It might be ok. This is similar to launching arbitrary container from pipeline.
The pod is launched with default KSA under the namespace, one thing we could do is to launch it using the same KSA as launching the pipeline. This ensures the TB pod doesn't by accident have any unwanted permission.
Regarding the security concern, a namespace isolated multi-user support should address it.

* Adding new manifests. Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com> * Adding new manifests. Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com> * Deleted Stateful set for kserve-manager-controller Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com> * Fixing relative link in Release process Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com> Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com>

jingzhang36 added 6 commits November 1, 2019 18:09

Without version bump

289f104

fix the delete caller

7108d2e

return after delete

3dba158

reconciler removes old viewer crd file that misses image specification

992d09b

add frontend comment

cac161f

remove accidental changes that are irrelevant

853fe0f

k8s-ci-robot added the size/S label Nov 5, 2019

k8s-ci-robot requested review from Bobgy and IronPan November 5, 2019 08:11

Revise log message

0a76ee3

k8s-ci-robot assigned IronPan Nov 5, 2019

jingzhang36 added 2 commits November 5, 2019 16:39

Merge remote-tracking branch 'origin/master' into tf-2

d8f6bdc

Add error handling

128320b

Merge remote-tracking branch 'origin/master' into tf-2

16717fc

add test

d5a5e42

k8s-ci-robot added size/M and removed size/S labels Nov 6, 2019

IronPan reviewed Nov 6, 2019

View reviewed changes

jingzhang36 added 2 commits November 11, 2019 15:54

tensorflow image check only applies to viewer tensorboard type and th…

d17ad8a

…us put it after the type check.

Merge remote-tracking branch 'origin/master' into tf-2

c72a511

jingzhang36 added 2 commits November 14, 2019 09:40

Use of default image instead of validation

475a31f

Merge remote-tracking branch 'origin/master' into tf-2

0807ef9

IronPan reviewed Nov 18, 2019

View reviewed changes

k8s-ci-robot added the lgtm label Nov 18, 2019

k8s-ci-robot added the approved label Nov 18, 2019

k8s-ci-robot merged commit f308abe into kubeflow:master Nov 18, 2019

dldaisy mentioned this pull request Dec 3, 2019

Support choosing tensorboard version from UI #2690

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new field "TensorflowImage" to KFP viewer CRD file template. #2544

Add a new field "TensorflowImage" to KFP viewer CRD file template. #2544

jingzhang36 commented Nov 5, 2019 •

edited by jlewi

Loading

jingzhang36 commented Nov 5, 2019 •

edited

Loading

Bobgy commented Nov 5, 2019

numerology commented Nov 5, 2019

jingzhang36 commented Nov 6, 2019

jingzhang36 commented Nov 6, 2019

Bobgy commented Nov 6, 2019

IronPan Nov 6, 2019

jingzhang36 Nov 11, 2019 •

edited

Loading

IronPan Nov 6, 2019

jingzhang36 Nov 11, 2019 •

edited

Loading

IronPan Nov 13, 2019

jingzhang36 Nov 14, 2019

jingzhang36 commented Nov 11, 2019

Bobgy commented Nov 13, 2019

IronPan left a comment

k8s-ci-robot commented Nov 18, 2019

IronPan commented Nov 18, 2019 •

edited

Loading

Add a new field "TensorflowImage" to KFP viewer CRD file template. #2544

Add a new field "TensorflowImage" to KFP viewer CRD file template. #2544

Conversation

jingzhang36 commented Nov 5, 2019 • edited by jlewi Loading

jingzhang36 commented Nov 5, 2019 • edited Loading

Bobgy commented Nov 5, 2019

numerology commented Nov 5, 2019

jingzhang36 commented Nov 6, 2019

jingzhang36 commented Nov 6, 2019

Bobgy commented Nov 6, 2019

IronPan Nov 6, 2019

Choose a reason for hiding this comment

jingzhang36 Nov 11, 2019 • edited Loading

Choose a reason for hiding this comment

IronPan Nov 6, 2019

Choose a reason for hiding this comment

jingzhang36 Nov 11, 2019 • edited Loading

Choose a reason for hiding this comment

IronPan Nov 13, 2019

Choose a reason for hiding this comment

jingzhang36 Nov 14, 2019

Choose a reason for hiding this comment

jingzhang36 commented Nov 11, 2019

Bobgy commented Nov 13, 2019

IronPan left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 18, 2019

IronPan commented Nov 18, 2019 • edited Loading

jingzhang36 commented Nov 5, 2019 •

edited by jlewi

Loading

jingzhang36 commented Nov 5, 2019 •

edited

Loading

jingzhang36 Nov 11, 2019 •

edited

Loading

jingzhang36 Nov 11, 2019 •

edited

Loading

IronPan commented Nov 18, 2019 •

edited

Loading