-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New runs' details page always loading #3763
Comments
/assign @Bobgy |
Workaround: run |
I'm thinking even if we couldn't fix the issue directly, adding a liveness probe to persistence agent would be the best. However, persistence agent doesn't just hang, it keeps working properly just without getting new information. If there are any ideas we can implement a liveness probe for this situation that would be awesome. |
1 extra data point: we got one more external report about this |
Copy notes here: "One thing strange is that, when looking at persistence agent logs, it still loops properly listing all the old workflows, but it no longer detect newly created workflows." "it happens after reschedule pod to another node-pool" |
The other report is different from root cause of this one.
|
A qq: does the empty detail page usually turn normal after a while? Or they stay empty? |
@jingzhang36 No, it doesn't recover by itself. |
Then, do you still have the instance where this issue happens? |
I don't, I can ping you next time I reproduce it |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/frozen |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/lifecycle frozen |
I no longer see this happening any more |
@Bobgy I'm curious if you have any updates on this issue. After we upgraded our cluster to KFP 1.3 and this issue seems to occur more often especially when the cluster is busy. When it happened, the run status was shown as unknown in the experiment page, and the run details page was empty and showed a spinning wheel. It can be fixed by deleting |
@daikeshi I no longer see this issue after my last post here. My suspect is that this might have sth to do with controller runtime version. Can you try upgrading persistent agent and see if there's any changes? (You can check changehistory to find out which version we upgraded controller runtime) |
hmm, we are using |
Yes, I've got a wrong impression. There's a new pending PR that updates this lib and k8s client. That might help. Based on my investigation, this seems a problem with the controller boilerplate code or the library. |
@Bobgy that's awesome! Would you mind sharing the link to that new PR, so l keep an eye on it when it gets merged and released? Thank you! |
Sure, it's #5792 |
I'm having the same issue. and solved it by deleting when there was a problem, I found a log and I found that kubeflow pipeline's after I re-create |
Yes, it's expected that |
Deleting the ml-pipeline-persistenceagent is not helping. |
@Bobgy How many is too many workflows ?. It is supposed to scale horizontally. So, what is the bottle next here. |
Per #3763 (comment), There was a fix (#5792) for this issue, which was released in 1.7.0. |
We did. It helped. TY !! |
@Bobgy It is happening again all the time Here are logs from ml-pipeline Error Here is log from ml-pipeline-persistence-agent Scanned up to 2/3/22, 11:23 PM. Scanned 26.8 MB. |
Here are logs from kube-dns 2022-02-03 02:16:49.025 PSTError while fetching metric descriptors for kubedns: googleapi: Error 503: The service is currently unavailable., backendError |
@Bobgy @james-jwu I found that "new runs" hanging is caused by network interruptions between Kubeflow Pipelines and the MySQL database, and still happens in the latest KFP versions. That is, this issue always happens when you see errors like See issues that user's are raising when they have database connection issues:
From Kubeflow Pipeline's perspective, we should probably make database network issues fail more catastrophically, so that users are not left with semi-working Kubeflow Pipelines, and not understanding why things are not working. For users, the solution is to fix your cluster's network access to your MySQL, which could be quite hard to debug, as network issues usually are. If your MySQL is a managed service (like AWS RDS or Google Cloud SQL), look for VPC routing issues like asymmetric routing. For example, I had a case where the cluster accessed the database via an AWS VPN Gateway, but there was no route back from the database to the cluster. Note that MySQL sometimes initiates new TCP connections back to the client from the "server side", which will obviously fail in the above case where it has no route back to the client. |
Closing this issue as there is no users reporting this error since 2022. Feel free to reopen it if the issue remains in latest releases. /close |
@rimolive: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What steps did you take:
What happened:
Sometimes, after a job is submitted and successfully executed, the KFP UI fails to display run details (empty page with a spinning wheel) via the url: https://${CLUSTER_URI}/_/pipeline/?ns=keshi#/runs/details/${RUN_ID} for every new run.
Additionally, the problem is that kfp db doesn't have those information, because persistence agent stopped syncing new workflows.
One thing strange is that, when looking at persistence agent logs, it still loops properly listing all the old workflows, but it no longer detect newly created workflows.
What did you expect to happen:
PA should keep syncing workflows.
Environment:
How did you deploy Kubeflow Pipelines (KFP)?
Kubeflow deployment
KFP version:
I don't remember clearly, but I think I've seen the issue in rare occasions from 0.2.0 to 0.5.0.
/kind bug
The text was updated successfully, but these errors were encountered: