From 1dd668e703671216a55fa97f257c03d6d71a94ef Mon Sep 17 00:00:00 2001 From: Iman Tabrizian Date: Fri, 2 Aug 2019 01:15:52 -0400 Subject: [PATCH] Add troubleshooting guide for notebooks (#1008) * feat: add template of troubleshooting guide for notebooks * feat: improve the notebooks troubleshooting guide * fix: improve the note for GCP users * fix: minor bug in the link * fix: improve the troubleshooting guide * fix: fix troubleshooting guides --- content/docs/notebooks/setup.md | 4 +- content/docs/notebooks/troubleshoot.md | 95 ++++++++++++++++++++ content/docs/other-guides/troubleshooting.md | 31 +------ 3 files changed, 99 insertions(+), 31 deletions(-) create mode 100644 content/docs/notebooks/troubleshoot.md diff --git a/content/docs/notebooks/setup.md b/content/docs/notebooks/setup.md index e1292a0d1e..a8918a33da 100644 --- a/content/docs/notebooks/setup.md +++ b/content/docs/notebooks/setup.md @@ -265,4 +265,6 @@ exposed to the internet and is an unsecured endpoint by default. * Learn the advanced features available from a Kubeflow notebook, such as [submitting Kubernetes resources](/docs/notebooks/submit-kubernetes/) or [building Docker images](/docs/notebooks/submit-docker-image/). - \ No newline at end of file +* Visit the [troubleshooting guide](/docs/notebooks/troubleshoot) for fixing common + errors in creating Jupyter notebooks in Kubeflow + diff --git a/content/docs/notebooks/troubleshoot.md b/content/docs/notebooks/troubleshoot.md new file mode 100644 index 0000000000..235dbd2141 --- /dev/null +++ b/content/docs/notebooks/troubleshoot.md @@ -0,0 +1,95 @@ ++++ +title = "Troubleshooting Guide" +description = "Fixing common problems in Kubeflow notebooks" +weight = 50 ++++ + +## Persistent Volumes and Persistent Volumes Claims + +First, make sure that PVCs are bounded when using Jupter notebooks. This should +not be a problem when using managed Kuberenetes. But if you are using Kubernetes +on-prem, check out the guide to [Kubeflow on-prem in a multi-node Kubernetes cluster](/docs/use-cases/kubeflow-on-multinode-cluster/) if you are running Kubeflow in multi-node on-prem environment. Otherwise, look at the [Pods stuck in Pending State](/docs/other-guides/troubleshooting/#pods-stuck-in-pending-state) guide to troubleshoot this problem. + +## Check the status of notebooks + +Run the commands below. + +``` +kubectl get notebooks -o yaml ${NOTEBOOK} +kubectl describe notebooks ${NOTEBOOK} +``` + +Check the `events` section to make sure that there are no errors. + +## Check the status of statefulsets + +Make sure that the number of `statefulsets` equals the desired number. If it is +not the case, check for errors using the `kubectl describe`. + + +``` +kubectl get statefulsets -o yaml ${NOTEBOOK} +kubectl describe statefulsets ${NOTEBOOK} +``` + + +The output should look like below: +``` +NAME DESIRED CURRENT AGE +your-notebook 1 1 9m4s +``` +## Check the status of Pods + +If the number of statefulsets didn't match the desired number, make sure that +the number of Pods match the number of desired Pods in the first command. +In case it didn't match, follow the steps below to further investigate the issue. + +``` +kubectl get pod -o yaml ${NOTEBOOK}-0 +``` + +* The name of the Pod should start with `jupter` +* If you are using username/password auth with Jupyter the pod will be named + +``` +jupyter-${USERNAME} +``` + +* If you are using IAP on GKE the pod will be named + +``` +jupyter-accounts-2egoogle-2ecom-3USER-40DOMAIN-2eEXT +``` + * Where USER@DOMAIN.EXT is the Google account used with IAP + +Once you know the name of the pod do + +``` +kubectl describe pod ${NOTEBOOK}-0 +``` + +* Look at the `events` to see if there are any errors trying to schedule the pod +* One common error is not being able to schedule the pod because there aren’t enough resources in the cluster. + + +If the error still persisted, check for the errors in the logs of containers. + +``` +kubectl logs ${NOTEBOOK}-0 +``` + +## Note for GCP Users + +You may encounter error below: +``` +Type Reason Age From Message +---- ------ ---- ---- ------- +Warning FailedCreate 2m19s (x26 over 7m39s) statefulset-controller create Pod test1-0 in StatefulSet test1 failed error: pods "test1-0" is forbidden: error looking up service account kubeflow/default-editor: serviceaccount "default-editor" not found +``` + +To fix this problem, create a service account named `default-editor` with cluster-admin role. + +``` +kubectl create sa default-editor +kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user default-editor +``` diff --git a/content/docs/other-guides/troubleshooting.md b/content/docs/other-guides/troubleshooting.md index 8a5b9be180..30d896c09b 100644 --- a/content/docs/other-guides/troubleshooting.md +++ b/content/docs/other-guides/troubleshooting.md @@ -78,37 +78,8 @@ how RBAC interacts with IAM on GCP. ## Problems spawning Jupyter pods -If you're having trouble spawning Jupyter notebooks, check that the pod is getting -scheduled +This section has been moved to [Jupyter Notebooks Troubleshooting Guide] (/docs/notebooks/troubleshoot/). -``` -kubectl -n ${NAMESPACE} get pods -``` - - * Look for pods whose name starts with juypter - * If you are using username/password auth with Jupyter the pod will be named - - ``` - jupyter-${USERNAME} - ``` - - * If you are using IAP on GKE the pod will be named - - ``` - jupyter-accounts-2egoogle-2ecom-3USER-40DOMAIN-2eEXT - ``` - - * Where USER@DOMAIN.EXT is the Google account used with IAP - -Once you know the name of the pod do - -``` -kubectl -n ${NAMESPACE} describe pods ${PODNAME} -``` - - * Look at the events to see if there are any errors trying to schedule the pod - * One common error is not being able to schedule the pod because there aren't - enough resources in the cluster. ## Pods stuck in Pending state