Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new field "TensorflowImage" to KFP viewer CRD file template. #2544

Merged
merged 15 commits into from
Nov 18, 2019

Conversation

jingzhang36
Copy link
Contributor

@jingzhang36 jingzhang36 commented Nov 5, 2019

We add a new field "TensorflowImage" to our viewer CRD file template.

Changes include:
(1) Frontend specifies a tensorflow image version and writes it to the viewer CRD file. The image version is hard-coded in FE now but later, FE will take this from input textbox.
(2) Backend controller parses the tensorflow image from viewer CRD file and uses it to start a new TB instance. Meanwhile, any viewer CRD file that is missing tensorflow image (i.e., written in the old way) will be removed by backend controller, since our TB tensorbord is stateless. After the old viewer CRD file is removed, Frontend will create a new viewer CRD file that contains tensorflow image.

Note:
(1) Whether to increase viewer CRD's version: I did some research work and feel that it is a little bit overkilling for this particular change. If we follow a standard version upgrade for CRD, a separate webhook server needs to be brought up to migrate old version CRD files (https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definition-versioning/#webhook-conversion). This migration is not very necessary for us, since we can simply remove our old crd files that are written in the old way.
(2) Whether to let backend controller to have a default tensorflow image in case of working with old viewer crd file (written in the old way and hence missing tensorflow image): I feel either way is ok. If we let backend controller use a default tensorflow image, then we don't need to remove the old viewer CRD file. Alternative is that we let backend controller to remove those old viewer CRD files and thus a default tensorflow image is not needed. I used the latter approach in the first post for review. The good thing about that approach we'll get rid of the old viewer CRDs completely and have everything up-to-date. The downside is that users must upgrade their frontend server release when they upgrade their backend controller release. (I.e., they can use a FE release that writes old viewer CRD format together with a backend release that reads new viewer CRD format).

Related bug: #2514


This change is Reviewable

@jingzhang36
Copy link
Contributor Author

jingzhang36 commented Nov 5, 2019

/assign @IronPan
/assign @Bobgy please check out the notes in the PR description, which addresses your previous comments.
/cc @rmgogogo

@Bobgy
Copy link
Contributor

Bobgy commented Nov 5, 2019

@jingzhang36 Sounds good to me. I think transient failures for a stateless service is acceptable when upgrading.

@numerology
Copy link

Shall we provide any sanity check to the user provided image? Or just let the backend throw failures when bad things happen.

@k8s-ci-robot k8s-ci-robot added size/M and removed size/S labels Nov 6, 2019
@jingzhang36
Copy link
Contributor Author

Shall we provide any sanity check to the user provided image? Or just let the backend throw failures when bad things happen.

@numerology yes. Our plan is when we add UI input box for specifying the version, we'll provide a drop-down list of versions for user to choose from.

@jingzhang36
Copy link
Contributor Author

/retest

@Bobgy
Copy link
Contributor

Bobgy commented Nov 6, 2019

Shall we provide any sanity check to the user provided image? Or just let the backend throw failures when bad things happen.

@numerology yes. Our plan is when we add UI input box for specifying the version, we'll provide a drop-down list of versions for user to choose from.

I'm not sure if there could be security implications here. If backend uses this image without any sanity checking. Then user can run any image there, maybe even for bad reasons.

@@ -83,6 +83,16 @@ func (r *Reconciler) Reconcile(req reconcile.Request) (reconcile.Result, error)
}
glog.Infof("Got instance: %+v", view)

if len(view.Spec.TensorboardSpec.TensorflowImage) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the viewer crd is not tensorflow specific. it's good to move tensorboard specific code to a single place or abstract that out.

switch view.Spec.Type :
 viewerV1beta1.ViewerTypeTensorboard :
     tensorboard_handler()
 any_other_type: 
      {
		glog.Infof("Unsupported spec type: %q", view.Spec.Type)
		// Return nil to indicate nothing more to do here.
		return reconcile.Result{}, nil
	}

...
tensorboard_handler(){
  if  len(view.Spec.TensorboardSpec.TensorflowImage) == 0 {...}
}

Copy link
Contributor Author

@jingzhang36 jingzhang36 Nov 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far we don't support other type

if view.Spec.Type != viewerV1beta1.ViewerTypeTensorboard {

For this time, I plan to put the specific code after the line (pasted above) and say it's sufficient for now. Later, if we have other types, we probably will separate most of the current logic (I feel most of current code is tensorboard specific...) from the added type support.

Comment on lines 86 to 95
if len(view.Spec.TensorboardSpec.TensorflowImage) == 0 {
if err := r.Client.Delete(context.Background(), view); err != nil {
glog.Infof("Error in deleting viewer CRD: %+v", err)
return reconcile.Result{}, err
} else {
glog.Infof("Deleted viewer CRD that is missing tensorflow image.")
return reconcile.Result{}, nil
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@jingzhang36 jingzhang36 Nov 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current implementation doesn't seem to support structural schema yet. After a quick browse of the doc, we'll need to first add/apply a schema yaml. And then all the viewer CRD yamls created later can be validated according to the specifications in the schema yaml. I'll do another PR to upgrade our current viewer crd implementation to structural schema and let this one focus on tensorboard version issue.

BTW, we have an intern here. She will work on adding the version selector and hopefully structural schema implementation as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not check in the validation logic here then and add a todo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. Leave validation to when we have structural crd schema. Temporarily, I'll just use a default version if no version is specified.

@jingzhang36
Copy link
Contributor Author

Shall we provide any sanity check to the user provided image? Or just let the backend throw failures when bad things happen.

@numerology yes. Our plan is when we add UI input box for specifying the version, we'll provide a drop-down list of versions for user to choose from.

I'm not sure if there could be security implications here. If backend uses this image without any sanity checking. Then user can run any image there, maybe even for bad reasons.

Any tensorflow image or any image? If we restrict it to a list of tensorflow images for users to choose from, I feel it is ok.

@Bobgy
Copy link
Contributor

Bobgy commented Nov 13, 2019

Any tensorflow image or any image? If we restrict it to a list of tensorflow images for users to choose from, I feel it is ok.

Any image. Because frontend is never a place you can put security checks. User can see all network requests and they can send any forged requests to backend server by themselves too.

Copy link
Member

@IronPan IronPan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: IronPan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@IronPan
Copy link
Member

IronPan commented Nov 18, 2019

Any tensorflow image or any image? If we restrict it to a list of tensorflow images for users to choose from, I feel it is ok.

Any image. Because frontend is never a place you can put security checks. User can see all network requests and they can send any forged requests to backend server by themselves too.

It might be ok. This is similar to launching arbitrary container from pipeline.
The pod is launched with default KSA under the namespace, one thing we could do is to launch it using the same KSA as launching the pipeline. This ensures the TB pod doesn't by accident have any unwanted permission.
Regarding the security concern, a namespace isolated multi-user support should address it.

@k8s-ci-robot k8s-ci-robot merged commit f308abe into kubeflow:master Nov 18, 2019
magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this pull request Oct 22, 2023
* Adding new manifests.

Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com>

* Adding new manifests.

Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com>

* Deleted Stateful set for kserve-manager-controller

Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com>

* Fixing relative link in Release process

Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com>

Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants