Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely long loading times in UI on Minikube #3070

Closed
frsann opened this issue Feb 13, 2020 · 25 comments · Fixed by #3118
Closed

Extremely long loading times in UI on Minikube #3070

frsann opened this issue Feb 13, 2020 · 25 comments · Fixed by #3118
Assignees
Labels
area/frontend help wanted The community is welcome to contribute. priority/p2 status/triaged Whether the issue has been explicitly triaged

Comments

@frsann
Copy link

frsann commented Feb 13, 2020

What happened:
I continuously, but somewhat sporadically, experiencing extremely long loading times for (at least) pipelines and experiments in the UI. When issue occurs the loading time can vary from 30s to 6 min. The page renders, but I'm shown a spinning wheel where the items should be listed.

What did you expect to happen:
I expected the experiments and pipelines to load "instantly", within seconds at least.

What steps did you take:

  • Created a new minikube cluster (4 cpus, 16 gb RAM, 110 gb storage).
  • Deployed the latest version (1.0 rc4) of Kubeflow.
  • Waited for everything to start
  • Accessed the Kubeflow and the KFP UI
  • Clicked around in the UI
  • Occasionally (say every 2-3 minutes) the loading times would become extremely long.

Anything else you would like to add:
The ml-pipline-ui pod is reporting some errors. Here is a log dump from a short test session including multiple stalls:

[HPM] Proxy created: / -> http://10.109.114.45:9090
[HPM] Proxy created: / -> http://127.0.0.1
[HPM] Subscribed to http-proxy events: [ 'error', 'close' ]
[HPM] Proxy created: / -> http://127.0.0.1
[HPM] Subscribed to http-proxy events: [ 'error', 'close' ]
[HPM] Proxy created: / -> http://10.105.245.70:8888
[HPM] Subscribed to http-proxy events: [ 'proxyReq', 'error', 'close' ]
[HPM] Proxy created: / -> http://10.105.245.70:8888
[HPM] Subscribed to http-proxy events: [ 'proxyReq', 'error', 'close' ]
Server listening at http://localhost:3000
GET /pipeline/apis/v1beta1/runs?page_size=5&sort_by=created_at%20desc
Proxied request: /apis/v1beta1/runs?page_size=5&sort_by=created_at%20desc
GET /pipeline/apis/v1beta1/pipelines?page_size=5&sort_by=created_at%20desc
Proxied request: /apis/v1beta1/pipelines?page_size=5&sort_by=created_at%20desc
GET /pipeline/apis/v1beta1/pipelines?page_size=5&sort_by=created_at%20desc
Proxied request: /apis/v1beta1/pipelines?page_size=5&sort_by=created_at%20desc
GET /pipeline/apis/v1beta1/runs?page_size=5&sort_by=created_at%20desc
Proxied request: /apis/v1beta1/runs?page_size=5&sort_by=created_at%20desc
GET /pipeline/apis/v1beta1/pipelines?page_size=5&sort_by=created_at%20desc
Proxied request: /apis/v1beta1/pipelines?page_size=5&sort_by=created_at%20desc
GET /pipeline/apis/v1beta1/runs?page_size=5&sort_by=created_at%20desc
Proxied request: /apis/v1beta1/runs?page_size=5&sort_by=created_at%20desc
GET /pipeline/
GET /pipeline/static/css/main.de5d904e.css
GET /pipeline/static/js/main.6a1ff432.js
GET /pipeline/static/css/main.de5d904e.css.map
GET /pipeline/apis/v1beta1/healthz
GET /pipeline/system/cluster-name
GET /pipeline/apis/v1beta1/pipelines?page_token=&page_size=10&sort_by=created_at%20desc&filter=
Proxied request: /apis/v1beta1/pipelines?page_token=&page_size=10&sort_by=created_at%20desc&filter=
GET /pipeline/system/project-id
GET /pipeline/static/js/main.6a1ff432.js.map
(node:1) UnhandledPromiseRejectionWarning: FetchError: request to http://metadata/computeMetadata/v1/instance/attributes/cluster-name failed, reason: getaddrinfo ENOTFOUND metadata
at ClientRequest. (/server/node_modules/node-fetch/lib/index.js:1455:11)
at ClientRequest.emit (events.js:223:5)
at Socket.socketErrorListener (_http_client.js:406:9)
at Socket.emit (events.js:223:5)
at emitErrorNT (internal/streams/destroy.js:92:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
at processTicksAndRejections (internal/process/task_queues.js:81:21)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:1) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:1) UnhandledPromiseRejectionWarning: FetchError: request to http://metadata/computeMetadata/v1/project/project-id failed, reason: getaddrinfo ENOTFOUND metadata
at ClientRequest. (/server/node_modules/node-fetch/lib/index.js:1455:11)
at ClientRequest.emit (events.js:223:5)
at Socket.socketErrorListener (_http_client.js:406:9)
at Socket.emit (events.js:223:5)
at emitErrorNT (internal/streams/destroy.js:92:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
at processTicksAndRejections (internal/process/task_queues.js:81:21)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
GET /pipeline/apis/v1beta1/healthz
GET /pipeline/apis/v1beta1/experiments?page_token=&page_size=10&sort_by=created_at%20desc&filter=
Proxied request: /apis/v1beta1/experiments?page_token=&page_size=10&sort_by=created_at%20desc&filter=
GET /pipeline/apis/v1beta1/runs?page_size=5&sort_by=created_at%20desc&resource_reference_key.type=EXPERIMENT&resource_reference_key.id=34543ca2-0e65-4939-9cd0-9c476f13548f&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D
Proxied request: /apis/v1beta1/runs?page_size=5&sort_by=created_at%20desc&resource_reference_key.type=EXPERIMENT&resource_reference_key.id=34543ca2-0e65-4939-9cd0-9c476f13548f&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D
GET /pipeline/apis/v1beta1/pipelines?page_token=&page_size=10&sort_by=created_at%20desc&filter=
Proxied request: /apis/v1beta1/pipelines?page_token=&page_size=10&sort_by=created_at%20desc&filter=
GET /pipeline/apis/v1beta1/healthz
GET /pipeline/apis/v1beta1/healthz
GET /pipeline/apis/v1beta1/pipelines/c402e282-a6b8-4e89-aa93-ccdbe45f7923
Proxied request: /apis/v1beta1/pipelines/c402e282-a6b8-4e89-aa93-ccdbe45f7923
GET /pipeline/apis/v1beta1/pipeline_versions?resource_key.type=PIPELINE&resource_key.id=c402e282-a6b8-4e89-aa93-ccdbe45f7923&page_size=50&sort_by=created_at%20desc
Proxied request: /apis/v1beta1/pipeline_versions?resource_key.type=PIPELINE&resource_key.id=c402e282-a6b8-4e89-aa93-ccdbe45f7923&page_size=50&sort_by=created_at%20desc
GET /pipeline/apis/v1beta1/pipelines/c402e282-a6b8-4e89-aa93-ccdbe45f7923/templates
Proxied request: /apis/v1beta1/pipelines/c402e282-a6b8-4e89-aa93-ccdbe45f7923/templates
GET /pipeline/apis/v1beta1/healthz
GET /pipeline/apis/v1beta1/pipelines/c402e282-a6b8-4e89-aa93-ccdbe45f7923
Proxied request: /apis/v1beta1/pipelines/c402e282-a6b8-4e89-aa93-ccdbe45f7923
GET /pipeline/apis/v1beta1/pipeline_versions/c402e282-a6b8-4e89-aa93-ccdbe45f7923
Proxied request: /apis/v1beta1/pipeline_versions/c402e282-a6b8-4e89-aa93-ccdbe45f7923
GET /pipeline/system/project-id
GET /pipeline/system/cluster-name
(node:1) UnhandledPromiseRejectionWarning: FetchError: request to http://metadata/computeMetadata/v1/project/project-id failed, reason: getaddrinfo ENOTFOUND metadata
at ClientRequest. (/server/node_modules/node-fetch/lib/index.js:1455:11)
at ClientRequest.emit (events.js:223:5)
at Socket.socketErrorListener (_http_client.js:406:9)
at Socket.emit (events.js:223:5)
at emitErrorNT (internal/streams/destroy.js:92:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
at processTicksAndRejections (internal/process/task_queues.js:81:21)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 3)
(node:1) UnhandledPromiseRejectionWarning: FetchError: request to http://metadata/computeMetadata/v1/instance/attributes/cluster-name failed, reason: getaddrinfo ENOTFOUND metadata
at ClientRequest. (/server/node_modules/node-fetch/lib/index.js:1455:11)
at ClientRequest.emit (events.js:223:5)
at Socket.socketErrorListener (_http_client.js:406:9)
at Socket.emit (events.js:223:5)
at emitErrorNT (internal/streams/destroy.js:92:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
at processTicksAndRejections (internal/process/task_queues.js:81:21)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 4)
GET /pipeline/apis/v1beta1/experiments?page_token=&page_size=10&sort_by=created_at%20desc&filter=
Proxied request: /apis/v1beta1/experiments?page_token=&page_size=10&sort_by=created_at%20desc&filter=
GET /pipeline/system/cluster-name
GET /pipeline/system/project-id
(node:1) UnhandledPromiseRejectionWarning: FetchError: request to http://metadata/computeMetadata/v1/instance/attributes/cluster-name failed, reason: getaddrinfo ENOTFOUND metadata
at ClientRequest. (/server/node_modules/node-fetch/lib/index.js:1455:11)
at ClientRequest.emit (events.js:223:5)
at Socket.socketErrorListener (_http_client.js:406:9)
at Socket.emit (events.js:223:5)
at emitErrorNT (internal/streams/destroy.js:92:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
at processTicksAndRejections (internal/process/task_queues.js:81:21)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 5)
(node:1) UnhandledPromiseRejectionWarning: FetchError: request to http://metadata/computeMetadata/v1/project/project-id failed, reason: getaddrinfo ENOTFOUND metadata
at ClientRequest. (/server/node_modules/node-fetch/lib/index.js:1455:11)
at ClientRequest.emit (events.js:223:5)
at Socket.socketErrorListener (_http_client.js:406:9)
at Socket.emit (events.js:223:5)
at emitErrorNT (internal/streams/destroy.js:92:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
at processTicksAndRejections (internal/process/task_queues.js:81:21)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 6)

Here is a screen shot showing the network activity for loading the experiments view, containing 2 experiments and 2 completed runs (6 min loading time in total).
Screenshot 2020-02-12 at 10 52 35

@Ark-kun
Copy link
Contributor

Ark-kun commented Feb 13, 2020

Can you please check that all services are running (e.g. metadata-service).

@Ark-kun
Copy link
Contributor

Ark-kun commented Feb 13, 2020

@numerology Can this be related to the PROJECT_ID substitution?

@frsann
Copy link
Author

frsann commented Feb 13, 2020

@Ark-kun

Everything seems to be running, but some errors causing restarts during startup of the cluster. When the stall occurs everything is running:

~ : k get pods -A                                                                                                                                                            10:03:47
NAMESPACE              NAME                                                           READY   STATUS      RESTARTS   AGE
cert-manager           cert-manager-5d849b9888-6xsmz                                  1/1     Running     1          25h
cert-manager           cert-manager-cainjector-dccb4d7f-smk97                         1/1     Running     1          25h
cert-manager           cert-manager-webhook-695df7dbb-l9jpq                           1/1     Running     1          25h
istio-system           grafana-86f89dbd84-9grtq                                       1/1     Running     1          25h
istio-system           istio-citadel-74966f47d6-ftjbp                                 1/1     Running     2          25h
istio-system           istio-cleanup-secrets-1.1.6-wzt9j                              0/1     Completed   0          25h
istio-system           istio-egressgateway-5c64d575bc-22qd8                           1/1     Running     1          25h
istio-system           istio-galley-784b9f6d75-2kddf                                  1/1     Running     1          25h
istio-system           istio-grafana-post-install-1.1.6-rq4kg                         0/1     Completed   0          25h
istio-system           istio-ingressgateway-589ff776dd-jr7q6                          1/1     Running     1          25h
istio-system           istio-pilot-677df6b6d4-4llr9                                   2/2     Running     2          25h
istio-system           istio-policy-6f74d9d95d-d8hw2                                  2/2     Running     7          25h
istio-system           istio-security-post-install-1.1.6-z98bd                        0/1     Completed   0          25h
istio-system           istio-sidecar-injector-866f4b98c7-td8rn                        1/1     Running     1          25h
istio-system           istio-telemetry-549c8f9dcb-b5j64                               2/2     Running     6          25h
istio-system           istio-tracing-555cf644d-m48xg                                  1/1     Running     1          25h
istio-system           kiali-7db44d6dfb-hcwfq                                         1/1     Running     1          25h
istio-system           prometheus-d44645598-27xnd                                     1/1     Running     1          25h
knative-serving        activator-5484756f7b-vdznj                                     2/2     Running     7          24h
knative-serving        autoscaler-8dc957c8-29g9b                                      2/2     Running     6          24h
knative-serving        autoscaler-hpa-5654b69d4c-cvjhv                                1/1     Running     1          24h
knative-serving        controller-66654bc6f7-7xzzj                                    1/1     Running     1          24h
knative-serving        networking-istio-557465cf96-rktn2                              1/1     Running     1          24h
knative-serving        webhook-585767d97f-4jnl8                                       1/1     Running     0          5m16s
kube-system            coredns-5c98db65d4-bhhj2                                       1/1     Running     1          25h
kube-system            coredns-5c98db65d4-t6b2n                                       1/1     Running     1          25h
kube-system            etcd-minikube                                                  1/1     Running     1          25h
kube-system            kube-addon-manager-minikube                                    1/1     Running     1          25h
kube-system            kube-apiserver-minikube                                        1/1     Running     1          25h
kube-system            kube-controller-manager-minikube                               1/1     Running     1          25h
kube-system            kube-proxy-74rlb                                               1/1     Running     1          25h
kube-system            kube-scheduler-minikube                                        1/1     Running     1          25h
kube-system            nginx-ingress-controller-657fd58d97-kvxzd                      1/1     Running     2          25h
kube-system            storage-provisioner                                            1/1     Running     2          25h
kubeflow               admission-webhook-bootstrap-stateful-set-0                     1/1     Running     1          24h
kubeflow               admission-webhook-deployment-569558c8b6-bjkwp                  1/1     Running     0          5m16s
kubeflow               application-controller-stateful-set-0                          1/1     Running     1          25h
kubeflow               argo-ui-7ffb9b6577-l22t9                                       1/1     Running     1          24h
kubeflow               centraldashboard-659bd78c-5rfjl                                1/1     Running     1          24h
kubeflow               conditional-execution-pipeline-mwrm9-1673514438                0/2     Completed   0          24h
kubeflow               conditional-execution-pipeline-mwrm9-353249933                 0/2     Completed   0          24h
kubeflow               conditional-execution-pipeline-mwrm9-436159354                 0/2     Completed   0          24h
kubeflow               jupyter-web-app-deployment-5b5bc97ff8-p2njf                    1/1     Running     1          24h
kubeflow               katib-controller-7f58569f7d-zl6sk                              1/1     Running     2          24h
kubeflow               katib-db-manager-54b66f9f9d-98dvl                              1/1     Running     1          24h
kubeflow               katib-mysql-dcf7dcbd5-q4sm7                                    1/1     Running     1          24h
kubeflow               katib-ui-6f97756598-nq92k                                      1/1     Running     1          24h
kubeflow               kfserving-controller-manager-0                                 2/2     Running     3          24h
kubeflow               metacontroller-0                                               1/1     Running     1          24h
kubeflow               metadata-db-65fb5b695d-prjv2                                   1/1     Running     1          24h
kubeflow               metadata-deployment-65ccddfd4c-g8sk7                           1/1     Running     1          24h
kubeflow               metadata-envoy-deployment-7754f56bff-q6wwt                     1/1     Running     1          24h
kubeflow               metadata-grpc-deployment-7557fdc6bb-s8qm4                      1/1     Running     2          24h
kubeflow               metadata-ui-7c85545947-g5kq7                                   1/1     Running     1          24h
kubeflow               minio-69b4676bb7-r5lmv                                         1/1     Running     1          24h
kubeflow               ml-pipeline-5cddb75848-bfzb9                                   1/1     Running     3          24h
kubeflow               ml-pipeline-ml-pipeline-visualizationserver-7f6fcb68c8-tkdvg   1/1     Running     1          24h
kubeflow               ml-pipeline-persistenceagent-6ff9fb86dc-vxr8b                  1/1     Running     1          24h
kubeflow               ml-pipeline-scheduledworkflow-7f84b54646-fpbrf                 1/1     Running     1          24h
kubeflow               ml-pipeline-ui-6758f58868-s2x6l                                1/1     Running     1          24h
kubeflow               ml-pipeline-viewer-controller-deployment-745dbb444d-z87kc      1/1     Running     1          24h
kubeflow               mysql-6bcbfbb6b8-hpk9c                                         1/1     Running     1          24h
kubeflow               notebook-controller-deployment-54f455c5c9-wqqmc                1/1     Running     1          24h
kubeflow               parallel-pipeline-lfq4j-1031451100                             0/2     Completed   0          24h
kubeflow               parallel-pipeline-lfq4j-2500367655                             0/2     Completed   0          24h
kubeflow               parallel-pipeline-lfq4j-2879362660                             0/2     Completed   0          24h
kubeflow               profiles-deployment-78f694bffb-5b4ms                           2/2     Running     2          24h
kubeflow               pytorch-operator-cf8c5c497-4zvs6                               1/1     Running     1          24h
kubeflow               seldon-controller-manager-6b4b969447-lqsf5                     1/1     Running     1          24h
kubeflow               sequential-pipeline-92fbw-3504141718                           0/2     Completed   0          23h
kubeflow               sequential-pipeline-92fbw-3740095253                           0/2     Completed   0          23h
kubeflow               spark-operatorcrd-cleanup-95v2h                                0/2     Completed   0          24h
kubeflow               spark-operatorsparkoperator-76dd5f5688-nhqhc                   1/1     Running     1          24h
kubeflow               spartakus-volunteer-5dc96f4447-96xzd                           1/1     Running     1          24h
kubeflow               tensorboard-5f685f9d79-kpjzh                                   1/1     Running     1          24h
kubeflow               tf-job-operator-5fb85c5fb7-5h8vt                               1/1     Running     1          24h
kubeflow               workflow-controller-689d6c8846-tkfrm                           1/1     Running     1          24h
kubernetes-dashboard   dashboard-metrics-scraper-74c99fbfdf-2zw9l                     1/1     Running     1          25h
kubernetes-dashboard   kubernetes-dashboard-86d44f77cf-prd4h                          1/1     Running     2          25h

@rmgogogo
Copy link
Contributor

@Bobgy

@rmgogogo rmgogogo added the status/triaged Whether the issue has been explicitly triaged label Feb 13, 2020
@rmgogogo
Copy link
Contributor

Which version of KF/KFP inside the minikube?
Does it work well in normal cluster?

@frsann
Copy link
Author

frsann commented Feb 13, 2020

@rmgogogo

I'm using this: https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.0.yaml

and kfctl v1.0-rc.3-1-g24b60e8. Does that answer your question?

I haven't tried deploying on a cloud-based cluster.

@andrewgdavis
Copy link

andrewgdavis commented Feb 13, 2020

I think the slowness is attributed to the request for http://metadata which is google cloud specific. DNS resolution in k8s uses ndots, which can take some time (causing the intermittent delays in the ui). In a minikube environment, this call shouldn't be made at all (unless an explicit service is installed with pipelines).

https://github.com/kubeflow/pipelines/blob/master/frontend/server/handlers/gke-metadata.ts#L25

The way i worked around it was to setup nginx and an ExternalName Service that would resolve "metadata" within the namespace where pipelines was deployed--- and the ui no longer has issues.

@Bobgy
Copy link
Contributor

Bobgy commented Feb 14, 2020

@frsann Can you verify which part is slow, is it frontend server or backend server?
e.g. you can try using kfp python client to list experiments, runs and measure the delay. Compare it with network requests in UI.

@Bobgy
Copy link
Contributor

Bobgy commented Feb 14, 2020

@andrewgdavis these requests do not block other requests. So I am not sure if that's the root cause.

Did you experience exactly the same issue like @frsann? If that's the case, I'm guessing maybe either frontend server or the browser has a limit of running requests, because project-id, cluster-name ... requests keep running for a long time, other requests maybe blocked because of throttling.

If we can confirm the issue, I can submit a fix to add cluster-name request caching in either UI or frontend server to solve the issue. (or probably adding a configuration to short-circuit this behavior.)

What do you think?

@Bobgy Bobgy added help wanted The community is welcome to contribute. priority/p2 needs investigation labels Feb 14, 2020
@frsann
Copy link
Author

frsann commented Feb 14, 2020

@Bobgy
I used the KFP cli tool (version 0.2.2.1) to query pipelines multiple times. The result returned within 1-2 seconds. I also tried querying the pipelines WHILE the stall was occurring in the UI, and still, the results returned in seconds. So seems like a frontend issue?

@andrewgdavis
Copy link

I am not sure about the exact root cause, but i did see a correlation with the error message in the ml-pipeline-ui logs and the browser trying to render the ui when making requests. the error message is quite specific, and realizing that in minikube there was no "http://metadata" endpoint, i gave it one and i have not seen any issue since.

There are ways to dig in, but perhaps it comes down to whether or not the metadata request should even be made in minikube.

@andrewgdavis
Copy link

andrewgdavis commented Feb 14, 2020

as a quick test, i left the externalName service in place so that dns resolution was not an issue, but i did remove the nginx backend that was handling the request, and the ui slowness is visible again.

any GET /system/cluster-name or GET/system/project-id request resulted in the UnhandledPromiseRejectionWarning error in the logs. in the browser dev-ui there is no "status code" response or bytes transferred for the GET.

when nginx was placed back in the mix; an initial request resulted in an immediate 200 and subsequent requests returned a 304 (Not Modified)

because it is intermittent; perhaps it is an issue at the network layer with socket timeouts on the connection... the error message does say "reason: connect ECONNREFUSED"

@Bobgy
Copy link
Contributor

Bobgy commented Feb 16, 2020

@frsann and @andrewgdavis Thanks for the verification! I'm convinced that's the root cause.

I guess then the long stall is caused by browser has a limit of concurrent requests (if not using HTTP2) and the bad metadata endpoint waits forever, blocking other requests in UI.

I think we can do a few things to improve this:

  1. UI should only request this data only once
  2. frontend node server can add an env config to disable this endpoint

@frsann
Copy link
Author

frsann commented Feb 16, 2020

Not surprisingly, the problem also occurs on a standalone KFP deployment (0.2.3) as well.

@Bobgy
Copy link
Contributor

Bobgy commented Feb 19, 2020

Made a PR to avoid sending a lot of gke metadata requests in UI: #3117

This should mitigate most of the issues here.

@Bobgy
Copy link
Contributor

Bobgy commented Feb 19, 2020

@frsann @andrewgdavis Do you think the fix #3118 will be enough for you?

@andrewgdavis
Copy link

@Bobgy i think that should do it.

@Bobgy
Copy link
Contributor

Bobgy commented Feb 20, 2020

This issue is resolved by #3118 (comment) released in KFP 0.2.5. It hasn't been released in KF yet (planned to be part of KF 1.0.2).

If you are affected, please upgrade to KFP 0.2.5+ and add an env variable DISABLE_GKE_METADATA=true in ml-pipeline-ui deployment.

@miramar-labs
Copy link

I am running on a pure Kubernetes install on my home ESXi system .. 3 masters, 8 workers ..
I tried DISABLE_GKE_METADATA=true
Doesn't seem to fix the problem.....
Any other suggestions?

@Bobgy
Copy link
Contributor

Bobgy commented Apr 21, 2020

I am running on a pure Kubernetes install on my home ESXi system .. 3 masters, 8 workers ..
I tried DISABLE_GKE_METADATA=true
Doesn't seem to fix the problem.....
Any other suggestions?

Which version did you install? You can check image field of ml-pipeline-ui deployment

@Bobgy
Copy link
Contributor

Bobgy commented Apr 21, 2020

@miramar-labs

@miramar-labs
Copy link

ah! ok it looks like it is 0.2.0 .. I guess I should change that to 0.2.5 ?

@Bobgy
Copy link
Contributor

Bobgy commented Apr 21, 2020

@miramar-labs I'd suggest install Kubeflow 1.0.2 instead: https://www.kubeflow.org/docs/started/getting-started/

it has 0.2.5 KFP (not all features may work if you just update the UI server's version)

@miramar-labs
Copy link

miramar-labs commented Apr 21, 2020

ok well I just installed 1.0.2 on a fresh k82 cluster .. seemed to install ok but now i'm seeing an error 'etcdserver request timed out' in the UI when:

  1. I create the initial namespace in the 'setup' tab
  2. when I add my jupyterlab notebook server

any ideas what that's about?

however, I created an experiment and ran a test pipeline .. all seems good now and the UI is nice and snappy so thanks for that !

@Bobgy
Copy link
Contributor

Bobgy commented Apr 21, 2020

@miramar-labs Sorry, I've never seen that. Because jupyter notebook server also has the same problem. Sounds like it's not related to Kubeflow Pipelines, maybe you can ask about it in kubeflow community or repo?

magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this issue Oct 22, 2023
Signed-off-by: Aleksey Karpov <86011874+alekseyolg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/frontend help wanted The community is welcome to contribute. priority/p2 status/triaged Whether the issue has been explicitly triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants