-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely long loading times in UI on Minikube #3070
Comments
Can you please check that all services are running (e.g. metadata-service). |
@numerology Can this be related to the PROJECT_ID substitution? |
Everything seems to be running, but some errors causing restarts during startup of the cluster. When the stall occurs everything is running:
|
Which version of KF/KFP inside the minikube? |
I'm using this: https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.0.yaml and I haven't tried deploying on a cloud-based cluster. |
I think the slowness is attributed to the request for https://github.com/kubeflow/pipelines/blob/master/frontend/server/handlers/gke-metadata.ts#L25 The way i worked around it was to setup nginx and an ExternalName Service that would resolve "metadata" within the namespace where pipelines was deployed--- and the ui no longer has issues. |
@frsann Can you verify which part is slow, is it frontend server or backend server? |
@andrewgdavis these requests do not block other requests. So I am not sure if that's the root cause. Did you experience exactly the same issue like @frsann? If that's the case, I'm guessing maybe either frontend server or the browser has a limit of running requests, because project-id, cluster-name ... requests keep running for a long time, other requests maybe blocked because of throttling. If we can confirm the issue, I can submit a fix to add cluster-name request caching in either UI or frontend server to solve the issue. (or probably adding a configuration to short-circuit this behavior.) What do you think? |
@Bobgy |
I am not sure about the exact root cause, but i did see a correlation with the error message in the ml-pipeline-ui logs and the browser trying to render the ui when making requests. the error message is quite specific, and realizing that in minikube there was no "http://metadata" endpoint, i gave it one and i have not seen any issue since. There are ways to dig in, but perhaps it comes down to whether or not the metadata request should even be made in minikube. |
as a quick test, i left the externalName service in place so that dns resolution was not an issue, but i did remove the nginx backend that was handling the request, and the ui slowness is visible again. any GET /system/cluster-name or GET/system/project-id request resulted in the UnhandledPromiseRejectionWarning error in the logs. in the browser dev-ui there is no "status code" response or bytes transferred for the GET. when nginx was placed back in the mix; an initial request resulted in an immediate 200 and subsequent requests returned a 304 (Not Modified) because it is intermittent; perhaps it is an issue at the network layer with socket timeouts on the connection... the error message does say "reason: connect ECONNREFUSED" |
@frsann and @andrewgdavis Thanks for the verification! I'm convinced that's the root cause. I guess then the long stall is caused by browser has a limit of concurrent requests (if not using HTTP2) and the bad metadata endpoint waits forever, blocking other requests in UI. I think we can do a few things to improve this:
|
Not surprisingly, the problem also occurs on a standalone KFP deployment ( |
Made a PR to avoid sending a lot of gke metadata requests in UI: #3117 This should mitigate most of the issues here. |
@frsann @andrewgdavis Do you think the fix #3118 will be enough for you? |
@Bobgy i think that should do it. |
This issue is resolved by #3118 (comment) released in KFP 0.2.5. It hasn't been released in KF yet (planned to be part of KF 1.0.2). If you are affected, please upgrade to KFP 0.2.5+ and add an env variable |
I am running on a pure Kubernetes install on my home ESXi system .. 3 masters, 8 workers .. |
Which version did you install? You can check image field of |
ah! ok it looks like it is 0.2.0 .. I guess I should change that to 0.2.5 ? |
@miramar-labs I'd suggest install Kubeflow 1.0.2 instead: https://www.kubeflow.org/docs/started/getting-started/ it has 0.2.5 KFP (not all features may work if you just update the UI server's version) |
ok well I just installed 1.0.2 on a fresh k82 cluster .. seemed to install ok but now i'm seeing an error 'etcdserver request timed out' in the UI when:
any ideas what that's about? however, I created an experiment and ran a test pipeline .. all seems good now and the UI is nice and snappy so thanks for that ! |
@miramar-labs Sorry, I've never seen that. Because jupyter notebook server also has the same problem. Sounds like it's not related to Kubeflow Pipelines, maybe you can ask about it in kubeflow community or repo? |
Signed-off-by: Aleksey Karpov <86011874+alekseyolg@users.noreply.github.com>
What happened:
I continuously, but somewhat sporadically, experiencing extremely long loading times for (at least) pipelines and experiments in the UI. When issue occurs the loading time can vary from 30s to 6 min. The page renders, but I'm shown a spinning wheel where the items should be listed.
What did you expect to happen:
I expected the experiments and pipelines to load "instantly", within seconds at least.
What steps did you take:
Anything else you would like to add:
The
ml-pipline-ui
pod is reporting some errors. Here is a log dump from a short test session including multiple stalls:Here is a screen shot showing the network activity for loading the experiments view, containing 2 experiments and 2 completed runs (6 min loading time in total).
The text was updated successfully, but these errors were encountered: