Save Suggestion state in persistent volume #1250

andreyvelich · 2020-06-30T16:21:18Z

/kind feature

To continue idea proposed here: #1062.
We would like to attach persistent volume to every deployed suggestion to save suggestion state after corresponding pod is deleted.
To implement this we should follow these steps:

Extend ResumePolicyType with the new type. My idea, name it VolumeSource, any other ideas?
Add new Storage Class YAML to Katib deployed manifests. I am not sure what should be the default provisioner for us, since k8s doesn't support dynamic volume provisioning for local storage (https://kubernetes.io/blog/2019/04/04/kubernetes-1.14-local-persistent-volumes-ga/#limitations-of-ga).
We can be specific to GKE and use Persistent Disks (https://kubernetes.io/docs/concepts/storage/storage-classes/#gce-pd). Or use 3rd party local path provisioner (https://github.com/rancher/local-path-provisioner), this require additional controller.
What do you think ?
Implement new logic to the controller:

Create PVC when user submits Experiment.
Attach PVC to suggestion deployment.
Delete suggestion resources when Experiment is succeeded.
Restore suggestion resources when Experiment is resuming.

Extend katib-config with the new parameters for suggestion PVC.

What do you think @gaocegege @johnugeorge ?

/cc @sperlingxx @c-bata @jlewi
/priority p0

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-06-30T16:21:25Z

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

jlewi · 2020-07-06T12:43:56Z

Why use pickling to store data as opposed to YAMl?
Im lacking some backgeound. What precisely does persisting a suggestion mean? Is a suggestion just a set of hyperparameters?

If you use PDs are you going to have to manage disks?

Could you persist to object store or some other easily managed datastore?

Would it make sense to treat this data as metadata and store in metadata store?

johnugeorge · 2020-07-06T17:06:34Z

Wrt to 1, we will need to have consistent naming with other values
@gaocegege Wrt to 2(Adding new storage class yaml), How is it handled?

Rest looks good to me

andreyvelich · 2020-07-06T17:41:31Z

Thanks for the comment @jlewi.

Why use pickling to store data as opposed to YAMl?
Im lacking some backgeound. What precisely does persisting a suggestion mean? Is a suggestion just a set of hyperparameters?

Suggestion is just a Kubernetes deployment with running script for HP or NAS algorithm to produce new Trial parameters (in case of HP tunning - hyperparameters) from the Search Space.

Currently, when user submits Experiment, controller creates this Suggestion deployment. When Experiment is finished Suggestion deployment can be deleted or can be always-running (if user wants to resume Experiment later).
To avoid lacking resources with always-running Kubernetes deployment, we want to save Suggestion script state in Persistent Volume, after Experiment is finished.
Some HP algorithms have internal variables that needs to be retain before resuming experiment. Algorithms maintainers can use this PV, if they want to support resuming experiment in their suggestion.

I think one of the mechanism to save Suggestion python script state, can be pickling the executable class.

If you use PDs are you going to have to manage disks?

I am not sure that we want to manage them, because we should be not specific to the GCP. The question is, what should be the default structure for Storage class and PVCs ?

Could you persist to object store or some other easily managed datastore?

Do you have any ideas what it can be for Kubernetes?

Would it make sense to treat this data as metadata and store in metadata store?

Can we save serialization objects to metadata?

jlewi · 2020-07-07T14:35:03Z

Some HP algorithms have internal variables that needs to be retain before resuming experiment. Algorithms maintainers can use this PV, if they want to support resuming experiment in their suggestion.

Why is the experiment manager managing internal storage of individual HP algorithms? Why not adopt a microservice architecture? What if different algorithms require different types of internal storage?

e.g. Suppose one algorithm needs to store a couple meta parameters and a YAML file works well vs. another algorithm which needs to store timeseries using a timeseries database?

Why can't HP Tuner algorithm authors configure their own storage backend? e.g. the algorithm author provides a kustomize package to deploy their algorithm and this is parameterized depending on the storage they accept. e.g. PVC, S3/GCS URL, SQL database etc...

If you don't want to waste resources when the service isn't be used can't we use autoscaling for that? e.g deploy the suggestion service server for that algorithm using knative?

I am not sure that we want to manage them, because we should be not specific to the GCP. The question is, what should be the default structure for Storage class and PVCs ?

This isn't specific to GCP. PVCs imply volumes which in many cases map to some form of "disk" as opposed to network or cloud filesystem.

andreyvelich · 2020-07-07T14:59:15Z

Why can't HP Tuner algorithm authors configure their own storage backend? e.g. the algorithm author provides a kustomize package to deploy their algorithm and this is parameterized depending on the storage they accept. e.g. PVC, S3/GCS URL, SQL database etc...

This is exactly what we want to do, but use kaitb-config for it. User can specify various parameters for storage backend. First of all, we can start with different PVCs (e.g. Local Storage, GCEPersistentDisk, etc.). Later we can support others storage backend, like SQL database, if it's needed.

The question with PVC, what should be the default technique, if user doesn't want to make any changes in configuration, but use Resume Experiment feature.

If you don't want to waste resources when the service isn't be used can't we use autoscaling for that? e.g deploy the suggestion service server for that algorithm using knative?

Please, can you show some examples from knative projects, where they use this.

jlewi · 2020-07-10T04:35:50Z

@andreyvelich I would suggest talking to the KFServing folks to better understand knative autoscaling.

jlewi · 2020-07-10T04:40:15Z

@andreyvelich Is katib-config providing the config for the suggestion microservices? Did you consider a micro-service architecture? e.g. for each suggestion service have a set of YAML file describing its configuration (e.g Deployment, ConfigMap, PVCs if needed).

The other parts of Katib e.g. katib-config can then just take the URL of the suggestion endpoint.

andreyvelich · 2020-07-10T17:54:23Z

@andreyvelich Is katib-config providing the config for the suggestion microservices? Did you consider a micro-service architecture? e.g. for each suggestion service have a set of YAML file describing its configuration (e.g Deployment, ConfigMap, PVCs if needed).

Yes, it was created to give user additional control over Suggestion service/deployment: https://www.kubeflow.org/docs/components/hyperparameter-tuning/katib-config/#suggestion-settings.
And we can easily extend it with different storage backend settings that we would like to support.

jlewi · 2020-07-13T14:02:35Z

Yes, it was created to give user additional control over Suggestion service/deployment:
Why doesn't the operator (person deploying the suggestion services) just configure the service/deployment directly. What's the point of creating a layer of abstraction around that with a ConfigMap.

e.g. suppose I have two suggestion services

NAS
GridSearch

Each of these may have different backend requirements. Lets suppose NAS uses an SQL DB and GridSearch uses an object store. So each of them would have YAML manifests for the K8s services (Deployment, Service, etc... ) that they need. Operator would just customize them as needed.

andreyvelich · 2020-07-13T15:22:31Z

e.g. suppose I have two suggestion services

NAS

GridSearch

Each of these may have different backend requirements. Lets suppose NAS uses an SQL DB and GridSearch uses an object store. So each of them would have YAML manifests for the K8s services (Deployment, Service, etc... ) that they need. Operator would just customize them as needed.

@jlewi We are not giving functionality to define whole YAML manifest for Suggestion resource to the user. To make it very easy to submit Katib Experiment and get results. Controller creates k8s deployment and service automatically for Suggestion.

For users that would like to modify default Suggestion installation we provide few settings, e.g. Service Account name.

And my thought is to add another setting that represent different volume technique. As I said we can start with various Storage Class provisioners.

andreyvelich · 2020-07-16T15:57:28Z

Few thoughts after discussion on the Katib meeting:

As @gaocegege mentioned Kubernetes HPA and knative autoscaling is designed for the stateless services, but most Katib Suggestion services are stateful.
Some Suggestion has their own DB to store internal data. Because of that, having external storage for the Suggestion service state make sense.
To avoid problems with supporting different provisioners for storage class (e.g persistent disks), we want to manually deploy PVC and PV with local storage for Suggestion in every Experiment. We are not sharing PV for multiple Suggestions for the security purpose in multi-user infra.

If user wants to use this feature, controller creates PV and PVC and binds it to Experiment's Suggestion deployment.

Later, we can add functionality when user can specify Storage class name in Katib config for Suggestion, because some Kubernetes cluster doesn't support creating PV manually.

What do you think @gaocegege @johnugeorge @jlewi ?

issue-label-bot · 2020-07-16T15:57:36Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/katib	0.55

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

gaocegege · 2020-07-17T00:33:13Z

LGTM

andreyvelich · 2020-07-20T13:03:26Z

/assign @andreyvelich

k8s-ci-robot added kind/feature priority/p0 labels Jun 30, 2020

andreyvelich mentioned this issue Jun 30, 2020

Save Suggestion state after deployment is deleted #1062

Closed

issue-label-bot bot added the area/katib label Jul 16, 2020

k8s-ci-robot assigned andreyvelich Jul 20, 2020

andreyvelich mentioned this issue Jul 21, 2020

Resume Experiment from Volume #1275

Merged

k8s-ci-robot closed this as completed in #1275 Jul 27, 2020

andreyvelich mentioned this issue Aug 21, 2020

[Release 1.2] Feature Planning / Roadmap kubeflow/kubeflow#5224

Closed

andreyvelich mentioned this issue Oct 17, 2020

[Release] Katib 0.10 release for Kubeflow 1.2 #1360

Closed

12 tasks

jbottum mentioned this issue Nov 19, 2020

Kubeflow 1.2 Blog post kubeflow/community#455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save Suggestion state in persistent volume #1250

Save Suggestion state in persistent volume #1250

andreyvelich commented Jun 30, 2020 •

edited

Loading

issue-label-bot bot commented Jun 30, 2020

jlewi commented Jul 6, 2020

johnugeorge commented Jul 6, 2020 •

edited

Loading

andreyvelich commented Jul 6, 2020

jlewi commented Jul 7, 2020

andreyvelich commented Jul 7, 2020

jlewi commented Jul 10, 2020

jlewi commented Jul 10, 2020

andreyvelich commented Jul 10, 2020

jlewi commented Jul 13, 2020

andreyvelich commented Jul 13, 2020

andreyvelich commented Jul 16, 2020

issue-label-bot bot commented Jul 16, 2020

gaocegege commented Jul 17, 2020

andreyvelich commented Jul 20, 2020

Save Suggestion state in persistent volume #1250

Save Suggestion state in persistent volume #1250

Comments

andreyvelich commented Jun 30, 2020 • edited Loading

issue-label-bot bot commented Jun 30, 2020

jlewi commented Jul 6, 2020

johnugeorge commented Jul 6, 2020 • edited Loading

andreyvelich commented Jul 6, 2020

jlewi commented Jul 7, 2020

andreyvelich commented Jul 7, 2020

jlewi commented Jul 10, 2020

jlewi commented Jul 10, 2020

andreyvelich commented Jul 10, 2020

jlewi commented Jul 13, 2020

andreyvelich commented Jul 13, 2020

andreyvelich commented Jul 16, 2020

issue-label-bot bot commented Jul 16, 2020

gaocegege commented Jul 17, 2020

andreyvelich commented Jul 20, 2020

andreyvelich commented Jun 30, 2020 •

edited

Loading

johnugeorge commented Jul 6, 2020 •

edited

Loading