Create a doc to describe the deployment process. (#1159)

* This is not a proposal but a description of the current state.
kubeflow · Jul 27, 2018 · 645db22 · 645db22
1 parent 1628d96
commit 645db22
Showing 1 changed file with 282 additions and 0 deletions.
diff --git a/docs_dev/kubeflow_deployment.md b/docs_dev/kubeflow_deployment.md
@@ -0,0 +1,282 @@
+# Deploying Kubeflow
+
+Authors:
+
+  * jlewi@google.com
+
+Status:
+
+  * 2018-07-08 Created
+
+## TL;DR
+
+This document describes how we are approaching the problem of deploying Kubeflow.
+
+This document is primarily intended for Kubeflow contributors and in particular
+contributors interested in
+
+- Improving the user experience
+- Creating an optimized experience for a particular platform.
+
+## Goals For Deploying Kubeflow
+
+Kubeflow is a Kubernetes native platform for machine learning.
+When it comes to deploying Kubeflow the primary goals are
+
+1. Creating a very simple getting started experience 
+1. Allow Kubeflow to be customized and optimized for different platforms
+
+## Glossary
+
+Platform - We use platform to refer to the entirety of a user's solution. This
+  includes not only Kubeflow and Kubernetes distribution but supporting services
+  that may be external to Kubernetes (e.g. S3/GCS).
+
+## Getting Started Experience: one click or one command
+
+With our 0.3 release we'd like to offer users a one command or one click 
+experience.
+
+1. User deploys Kubeflow on their platform using one click or one command
+
+   * Using the command line (with required tools preinstalled) the experience should 
+     be something like
+
+     ```
+     curl https://github.com/kubeflow/kubeflow/blob/master/scripts/gke/deploy.sh | bash
+     ```
+
+
+   * Alternatively, we'd like to offer a web-app that allows deployments with one click
+
+     * A web-app creates an opportunity to offer a wizard that can help walk users through 
+       the setup process
+
+     * A web-app can also eliminate the need to install any tools client side
+
+
+   * We'd like to offer scripts for various platforms including major clouds as well
+     as solutions for deploying Kubeflow locally on one's machine (e.g. docker, minikube, microk8s) 
+
+   * The script should take care of setting up Kubernetes and supporting services
+     for each platform
+
+1. Scripts/users use standard K8s patterns to determine when Kubeflow is ready
+
+   ```
+   kubectl get application kubeflow
+   ```
+
+   * We use the sig apps [application resource](https://github.com/kubernetes-sigs/application)
+     to represent Kubeflow and provide status information
+
+   * K8s events will be used to surface information about the deployment process.
+
+1. Users can easily navigate to JupyterLab after deploying Kubeflow
+
+   * JupyterLab provides a suitable environment for running Kubeflow codelabs
+   * Jupyter provides a notebook solution, a text editor, and command line shell.
+   * We will provide a curated Jupyter notebook for running Kubeflow codelabs.
+
+1. Users clone Kubeflow codelabs in JupyterLab and go through them
+
+1. Users can easily transition to managing/customizing Kubeflow from their local
+   machine
+
+   * If they deployed via script the ksonnet app should already be on their local
+     machine
+
+   * If they deployed by webapp they can easily clone the app to their local machine; e.g.
+
+     ```
+     kubectl cp kubeflow-admin/kubeflow-bootstrapper-0:/opt/bootstrap/default ~/my-kubeflow
+     ```
+
+## Customizing & Optimizing Kubeflow For Different Platforms
+
+One of the goals of Kubeflow is to run anywhere Kubernetes runs. 
+**However we also want to optimize Kubeflow to take advantage of features in
+different K8s versions and platforms.**
+
+We've tried a variety of approaches but the approach that is gaining traction
+is
+
+1. We provide simple, platform deployment scripts like this [one for GKE](https://github.com/kubeflow/kubeflow/blob/master/scripts/gke/deploy.sh)
+
+1. A corresponding platform specific getting started page ([see here](https://github.com/kubeflow/website/tree/master/content/docs/started) provides platform specific instructions
+
+Here are some **guidelines (not requirements)** for creating the above scripts and instructions
+
+* Platform scripts should assume users are starting from scratch
+
+  * Scripts should create K8s clusters and supporting services as needed
+
+  * This is based on ancedetal evidence that most Kubeflow users to date
+    are creating infrastructure specific for Kubeflow rather than deploying on
+    existing infrastructure.
+
+  * New features for ML (e.g. GPU support advanced, job scheduling) are landing
+    in each K8s release. So spinning up a new K8s cluster with appropriate
+    features enabled is often advantageous.
+
+* Declarative approaches to managing infrastructure (e.g. Terraform) are preferred
+
+  * Kubeflow aims to adopt K8s patterns; manging infrastructure declaratively 
+    is one such pattern
+
+  * Another Kubeflow principle is low bar high ceiling
+
+  	* Declarative configs help this by providing a starting point for users
+  	  looking to customize the deployment further
+
+* Scripts should tend toward simplicity and readibility
+
+  * Scripts should primarily be sugar that prevent users from having to
+    copy and paste a bunch of commands.
+
+  * Scripts should be readable
+
+* Scripts should have a linear flow e.g.
+
+   * Create non K8s resources (including K8s cluster if appropriate)
+
+   * Create Kubeflow K8s resources
+
+* To avoid hitting GitHub API limits and requiring users to create a GitHub token
+  the recommended approach is
+
+  1. Fetch and unback an archive copy of the repository.
+
+  	 ```
+  	 curl -L -O /tmp/source.tar.gz https://github.com/kubeflow/kubeflow/archive/v0.2.0.tar.gz
+     tar -zxvf /tmp/source.tar.gz
+	 ```
+
+  1. Add the registry as a file path and not a git registry
+
+  	 ```
+  	 ks registry add kubeflow $/tmp/source
+  	 ```
+
+  * Since the registry is added from a local file and not Git the registry will only
+    be valid on that machine. To mitigate the effects of moving the ksonnet app to other machines you can install any packages you think users might want
+
+  * Installed packages are stored in the ksonnet app vendor directory; they will be
+    valid on other machines provided you preserve the vendor directory (e.g. check it in to source control).
+
+  * There is an open issue [ksonnet/ksonnet#64](https://github.com/ksonnet/ksonnet/issues/641) to support this pattern
+    natively in ksonnet and address these limitations.
+
+## Open Question: Ensuring a common experience
+
+An open question is finding the right balance between ensuring a common experience across all platforms
+while still giving maintainers of a particular platform the ability to customize that experience.
+
+I think we'd like to have a well defined concept of base Kubeflow which consists of a good set of components
+that we think all users will want. Platform maintainers should be discouraged from removing components in base.
+If users want to remove certain base components they are free to do so and can easily do so.
+
+We have yet to find a good method for defining the base components. Currently, the base components
+is defined by the components generated in the deploy script. This works but will probably be difficult
+to avoid fragmentation as scripts are added for more platforms.
+
+We had previously tried defining a ksonnet core prototype consisting of all the components that should be 
+installed by default. This had some unfortunate side effects and I think we want to avoid creating a single prototype
+for multiple components going forward
+
+   * A user should be able to manage the components individually; e.g. updating JupyterHub without updating other components
+   * We end up combining the parameters for all the components into a single prototype file which is a bit unwieldy
+   * It limits our ability to organize components into ksonnet packages by application (e.g. TensorFlow, PyTorch, Jupyter)
+
+## Monitoring Kubeflow Deployment
+
+Monitoring deployment is an open area. The initial focus is:
+
+  1. Informing the user when Kubeflow is fully deployed and ready to use
+  1. Surfacing relevant events (e.g. as K8s events)
+
+For more info please take a look at some of the open issues and consider
+opening more. 
+
+The current thinking is to follow the guidance of sig-apps and use an [application resource](https://github.com/kubernetes-sigs/application)
+to represent Kubeflow and attach events, status, and other metrics to that application as appropriate.
+
+
+* [kubeflow/kubeflow#1106](https://github.com/kubeflow/kubeflow/issues/1106) Use Application CRD to describe Kubeflow
+* [kubeflow/kubeflow#1142](https://github.com/kubeflow/kubeflow/issues/1142) - Report K8s events to indicate when Kubeflow is ready
+* [kubeflow/kubeflow#955](https://github.com/kubeflow/kubeflow/issues/955) - Determine when ingress to Kubeflow is ready
+
+## Testing
+
+Continuous testing of deploy scripts is critical. When our test infrastructure
+is sufficiently mature CI will likely be a requirement in order to be an officially
+recommended solution on kubeflow.org.
+
+For more information on testing
+
+* Refer to [docs](https://github.com/kubeflow/testing) for our test infrastructure
+
+# References
+
+* [kubeflow/kubeflow#105](https://github.com/kubeflow/kubeflow/issues/105) - Make it
+  easy to get started with Kubeflow
+* [kubeflow/kubeflow#23](https://github.com/kubeflow/kubeflow/issues/23) - Original issue about tooling and configuration.
+* [kubeflow/kubeflow#376](https://github.com/kubeflow/kubeflow/issues/376) - Discussion of ksonnet patterns
+
+# Appendix: Other Approaches to Customization
+
+## Auto Configuration
+
+One of the ideas we considered was creating a simple program (originally called bootstrapper) that could automatically optimize a 
+Kubeflow configuration
+based on a user's setup. For example, if the K8s cluster has a default storage
+class the program would automatically configure JupyterHub to use persistent 
+volumes for notebook storage.
+
+So far this idea hasn't gained much traction. 
+
+Most of the setup to date has focused on users starting from scratch. In this situation we 
+have complete control of the setup so there's no reason to optimize with respect to an
+existing deployment.
+
+The scaffolding/prototype for this (bootstrapper) is still in place and we haven't
+rejected this idea completely. So contributions pursuing this idea further would
+be welcome.
+
+## Cloud ksonnet parameter
+
+An early idea was to have a simple ksonnet/template parameter 
+(originally called cloud). This parameter would correspond to different
+platforms (e.g. GKE or Azure or minikube).
+
+This idea has largely proven to be unworkable and abandoned.
+
+* We need more customizability such as the ability to install different
+  components based on the user's setup.
+
+* It lead to really complex jsonnet e.g. using late binding to inject different
+  types of credentials. The current thinking is to find better patterns (e.g. Admission controllers) that avoid complex jsonnet
+  see [kubeflow/kubeflow#376](https://github.com/kubeflow/kubeflow/issues/376).
+
+## YAML manifests
+
+For a while we tried to get an experience that was the equivalent of
+
+```
+kubectl create -f https://.../kubeflow_manifests.yaml
+```
+
+We are moving away from this approach in favor of deploy.sh for a variety of reasons
+
+* Some amount of customization of the YAML was always needed and a deploy script
+  can help automate this
+
+* This approach worked by creating a bootstrapper that ran on the cluster and then 
+  deployed the app. This had the following drawbacks
+
+  1. You had to grant the pod elevated permissions
+  1. We still needed a deploy.sh script to create the cluster itself at which point
+     it just made sense to run all the logic on the client.
+
+* A click to deploy web app provides a better experience in the case where users
+  don't want to install any command line tools to get started.