-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new field "TensorflowImage" to KFP viewer CRD file template. #2544
Conversation
@jingzhang36 Sounds good to me. I think transient failures for a stateless service is acceptable when upgrading. |
Shall we provide any sanity check to the user provided image? Or just let the backend throw failures when bad things happen. |
@numerology yes. Our plan is when we add UI input box for specifying the version, we'll provide a drop-down list of versions for user to choose from. |
/retest |
I'm not sure if there could be security implications here. If backend uses this image without any sanity checking. Then user can run any image there, maybe even for bad reasons. |
@@ -83,6 +83,16 @@ func (r *Reconciler) Reconcile(req reconcile.Request) (reconcile.Result, error) | |||
} | |||
glog.Infof("Got instance: %+v", view) | |||
|
|||
if len(view.Spec.TensorboardSpec.TensorflowImage) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the viewer crd is not tensorflow specific. it's good to move tensorboard specific code to a single place or abstract that out.
switch view.Spec.Type :
viewerV1beta1.ViewerTypeTensorboard :
tensorboard_handler()
any_other_type:
{
glog.Infof("Unsupported spec type: %q", view.Spec.Type)
// Return nil to indicate nothing more to do here.
return reconcile.Result{}, nil
}
...
tensorboard_handler(){
if len(view.Spec.TensorboardSpec.TensorflowImage) == 0 {...}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far we don't support other type
if view.Spec.Type != viewerV1beta1.ViewerTypeTensorboard { |
For this time, I plan to put the specific code after the line (pasted above) and say it's sufficient for now. Later, if we have other types, we probably will separate most of the current logic (I feel most of current code is tensorboard specific...) from the added type support.
if len(view.Spec.TensorboardSpec.TensorflowImage) == 0 { | ||
if err := r.Client.Delete(context.Background(), view); err != nil { | ||
glog.Infof("Error in deleting viewer CRD: %+v", err) | ||
return reconcile.Result{}, err | ||
} else { | ||
glog.Infof("Deleted viewer CRD that is missing tensorflow image.") | ||
return reconcile.Result{}, nil | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why adding this logic instead of using CRD built-in openapi validation
https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#specifying-a-structural-schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current implementation doesn't seem to support structural schema yet. After a quick browse of the doc, we'll need to first add/apply a schema yaml. And then all the viewer CRD yamls created later can be validated according to the specifications in the schema yaml. I'll do another PR to upgrade our current viewer crd implementation to structural schema and let this one focus on tensorboard version issue.
BTW, we have an intern here. She will work on adding the version selector and hopefully structural schema implementation as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not check in the validation logic here then and add a todo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. Leave validation to when we have structural crd schema. Temporarily, I'll just use a default version if no version is specified.
…us put it after the type check.
Any tensorflow image or any image? If we restrict it to a list of tensorflow images for users to choose from, I feel it is ok. |
Any image. Because frontend is never a place you can put security checks. User can see all network requests and they can send any forged requests to backend server by themselves too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: IronPan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
It might be ok. This is similar to launching arbitrary container from pipeline. |
* Adding new manifests. Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com> * Adding new manifests. Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com> * Deleted Stateful set for kserve-manager-controller Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com> * Fixing relative link in Release process Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com> Signed-off-by: rachitchauhan43 <rachitchauhan43@gmail.com>
We add a new field "TensorflowImage" to our viewer CRD file template.
Changes include:
(1) Frontend specifies a tensorflow image version and writes it to the viewer CRD file. The image version is hard-coded in FE now but later, FE will take this from input textbox.
(2) Backend controller parses the tensorflow image from viewer CRD file and uses it to start a new TB instance. Meanwhile, any viewer CRD file that is missing tensorflow image (i.e., written in the old way) will be removed by backend controller, since our TB tensorbord is stateless. After the old viewer CRD file is removed, Frontend will create a new viewer CRD file that contains tensorflow image.
Note:
(1) Whether to increase viewer CRD's version: I did some research work and feel that it is a little bit overkilling for this particular change. If we follow a standard version upgrade for CRD, a separate webhook server needs to be brought up to migrate old version CRD files (https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definition-versioning/#webhook-conversion). This migration is not very necessary for us, since we can simply remove our old crd files that are written in the old way.
(2) Whether to let backend controller to have a default tensorflow image in case of working with old viewer crd file (written in the old way and hence missing tensorflow image): I feel either way is ok. If we let backend controller use a default tensorflow image, then we don't need to remove the old viewer CRD file. Alternative is that we let backend controller to remove those old viewer CRD files and thus a default tensorflow image is not needed. I used the latter approach in the first post for review. The good thing about that approach we'll get rid of the old viewer CRDs completely and have everything up-to-date. The downside is that users must upgrade their frontend server release when they upgrade their backend controller release. (I.e., they can use a FE release that writes old viewer CRD format together with a backend release that reads new viewer CRD format).
Related bug: #2514
This change is