New runs' details page always loading #3763

Bobgy · 2020-05-14T04:28:15Z

What steps did you take:

Deploy a KFP cluster
Wait for a long time (likely something that happened in this period is required, e.g. upgrading cluster kubernetes version)

What happened:

Sometimes, after a job is submitted and successfully executed, the KFP UI fails to display run details (empty page with a spinning wheel) via the url: https://${CLUSTER_URI}/_/pipeline/?ns=keshi#/runs/details/${RUN_ID} for every new run.

Additionally, the problem is that kfp db doesn't have those information, because persistence agent stopped syncing new workflows.
One thing strange is that, when looking at persistence agent logs, it still loops properly listing all the old workflows, but it no longer detect newly created workflows.

What did you expect to happen:

PA should keep syncing workflows.

Environment:

How did you deploy Kubeflow Pipelines (KFP)?

Kubeflow deployment

KFP version:
I don't remember clearly, but I think I've seen the issue in rare occasions from 0.2.0 to 0.5.0.

/kind bug

Bobgy · 2020-05-14T04:28:45Z

/assign @Bobgy
/cc @IronPan @jingzhang36
Record a known issue here.

Bobgy · 2020-05-14T04:29:23Z

Workaround: run kubectl delete pod ml-pipeline-persistenceagent-xxxxxxx-xxxx -n kubeflow to restart persistence agent.

Bobgy · 2020-05-14T04:30:59Z

I'm thinking even if we couldn't fix the issue directly, adding a liveness probe to persistence agent would be the best. However, persistence agent doesn't just hang, it keeps working properly just without getting new information.

If there are any ideas we can implement a liveness probe for this situation that would be awesome.

Bobgy · 2020-06-01T02:00:14Z

1 extra data point: we got one more external report about this

rmgogogo · 2020-06-09T02:23:11Z

Copy notes here:

"One thing strange is that, when looking at persistence agent logs, it still loops properly listing all the old workflows, but it no longer detect newly created workflows."

"it happens after reschedule pod to another node-pool"

Bobgy · 2020-06-11T02:55:15Z

The other report is different from root cause of this one.
Making a summary here first:

The symptom: "new runs' details page always empty" means persistence agent is not syncing workflows to KFP db.
The root cause can be various reasons.
- The reason listed above is one possibility, it can be worked around by restarting persistence agent.
- Another possibility is basically there are too many workflows in the cluster, that kubernetes api server starts to crash. In this case, the suggested fix is to set a shorter workflow TTL, so that total workflow count doesn't increase too much.

jingzhang36 · 2020-06-18T09:48:37Z

A qq: does the empty detail page usually turn normal after a while? Or they stay empty?

Bobgy · 2020-06-18T09:51:41Z

@jingzhang36 No, it doesn't recover by itself.

jingzhang36 · 2020-06-18T09:53:39Z

@jingzhang36 No, it doesn't recover by itself.

Then, do you still have the instance where this issue happens?

Bobgy · 2020-06-18T09:55:42Z

I don't, I can ping you next time I reproduce it

stale · 2020-09-30T03:36:11Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Bobgy · 2020-09-30T04:52:39Z

/frozen

stale · 2020-12-29T15:18:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Bobgy · 2021-01-23T01:51:47Z

/lifecycle frozen

Bobgy · 2021-01-23T02:04:22Z

I no longer see this happening any more

daikeshi · 2021-06-04T00:27:00Z

@Bobgy I'm curious if you have any updates on this issue. After we upgraded our cluster to KFP 1.3 and this issue seems to occur more often especially when the cluster is busy. When it happened, the run status was shown as unknown in the experiment page, and the run details page was empty and showed a spinning wheel.

It can be fixed by deleting ml-pipeline-persistenceagent pod under kubeflow namespace. I also tried to increase resource configuration for ml-pipeline-persistenceagent. It seems to help (less frequent now) but it still can occur from time to time. Do you have more info regarding this issue? or is there anything that we can help look into?

Bobgy · 2021-06-04T03:40:51Z

@daikeshi I no longer see this issue after my last post here. My suspect is that this might have sth to do with controller runtime version.

Can you try upgrading persistent agent and see if there's any changes? (You can check changehistory to find out which version we upgraded controller runtime)

daikeshi · 2021-06-04T16:18:06Z

hmm, we are using gcr.io/ml-pipeline/persistenceagent:1.3.0. Are you referring to controller-runtime lib version here? It seems that it hasn't been changed since v1.3.0

Bobgy · 2021-06-05T03:13:29Z

Yes, I've got a wrong impression. There's a new pending PR that updates this lib and k8s client. That might help.

Based on my investigation, this seems a problem with the controller boilerplate code or the library.

daikeshi · 2021-06-05T14:39:40Z

@Bobgy that's awesome! Would you mind sharing the link to that new PR, so l keep an eye on it when it gets merged and released? Thank you!

Bobgy · 2021-06-05T15:44:51Z

Sure, it's #5792

kim-sardine · 2021-06-17T04:45:51Z

I'm having the same issue. and solved it by deleting ml-pipeline-persistenceagent

when there was a problem, I found a log Unknown node phase: undefined in Chrom DevTool's console.
and it came from Status.tsx or StatusUtils.ts.

and I found that kubeflow pipeline's getRun api don't have status property in its response.
(found it by accessing https://ENDPOINT/pipeline/apis/v1beta1/runs/RUN_ID)

after I re-create ml-pipeline-persistenceagent pod, getRun api pass status property correctly

Bobgy · 2021-07-31T04:00:17Z

Yes, it's expected that status property is empty when ml-pipeline-persistenceagent is stuck, because it stops syncing status from argo workflow to KFP db.
The root cause is still about persistence agent. If there're insights in there, or if there are ideas to detect this problem and set it as liveness hook (when liveness hook fails, the server restarts), they will help a lot.

kvamshi · 2022-02-01T00:27:07Z

Deleting the ml-pipeline-persistenceagent is not helping.
Our KFP version is 1.4.1, installed in GCP as part of AI Platform Pipelines

kvamshi · 2022-02-01T00:29:47Z

@Bobgy How many is too many workflows ?. It is supposed to scale horizontally. So, what is the bottle next here.
Are you referring to running workflows or even the finished ones ?

Bobgy · 2022-02-01T01:16:45Z

cc @zijianjoy @chensun

chensun · 2022-02-01T07:38:50Z

Deleting the ml-pipeline-persistenceagent is not helping.
Our KFP version is 1.4.1, installed in GCP as part of AI Platform Pipelines

Per #3763 (comment), There was a fix (#5792) for this issue, which was released in 1.7.0.
Can you please try upgrading your development and see if it helps?

kvamshi · 2022-02-01T20:56:24Z

We did. It helped. TY !!

kvamshi · 2022-02-04T07:19:15Z

@Bobgy It is happening again all the time
Below are logs from ml-pipeline and ml-pipeline-persistence-agent

Here are logs from ml-pipeline

Error
2022-02-03 23:28:51.209 PSTgithub.com/kubeflow/pipelines/backend/src/common/util.Wrapf
Error
2022-02-03 23:28:51.209 PST /go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:260
Error
2022-02-03 23:28:51.209 PSTmain.apiServerInterceptor
Error
2022-02-03 23:28:51.209 PST /go/src/github.com/kubeflow/pipelines/backend/src/apiserver/interceptor.go:32
Error
2022-02-03 23:28:51.209 PSTgithub.com/kubeflow/pipelines/backend/api/go_client._ReportService_ReportWorkflow_Handler
Error
2022-02-03 23:28:51.209 PST /go/src/github.com/kubeflow/pipelines/backend/api/go_client/report.pb.go:339
Error
2022-02-03 23:28:51.209 PSTgoogle.golang.org/grpc.(*Server).processUnaryRPC
Error
2022-02-03 23:28:51.209 PST /go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1210
Error
2022-02-03 23:28:51.209 PSTgoogle.golang.org/grpc.(*Server).handleStream
Error
2022-02-03 23:28:51.209 PST /go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1533
Error
2022-02-03 23:28:51.209 PSTgoogle.golang.org/grpc.(*Server).serveStreams.func1.2
Error
2022-02-03 23:28:51.209 PST /go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:871
Error
2022-02-03 23:28:51.209 PSTruntime.goexit
Error

Here is log from ml-pipeline-persistence-agent

Scanned up to 2/3/22, 11:23 PM. Scanned 26.8 MB.
Scanned up to 2/3/22, 11:25 PM. Scanned 35.3 MB.
Error
2022-02-03 23:24:22.011 PSTtime="2022-02-04T07:24:22Z" level=error msg="Permanent failure while syncing resource (default/pipeline-mhp6s): CustomError (code: 1): Syncing Workflow (pipeline-mhp6s): permanent failure: CustomError (code: 1): Error while reporting workflow resource (code: NotFound, message: Report workflow failed.: NotFoundError: Failed to add PersistedFinalState label to workflow pipeline-mhp6s: workflows.argoproj.io "pipeline-mhp6s" not found): rpc error: code = NotFound desc = Report workflow failed.: NotFoundError: Failed to add PersistedFinalState label to workflow pipeline-mhp6s: workflows.argoproj.io "pipeline-mhp6s" not found, &Workflow{ObjectMeta:{pipeline-mhp6s pipeline- default /apis/argoproj.io/v1alpha1/namespaces/default/workflows/pipeline-mhp6s f5ae9bc3-1ea4-43a5-84a5-e87480cf62ec 1726982 6 2022-02-03 05:02:33 +0000 UTC map[pipeline/persistedFinalState:true pipeline/runid:b8be3778-8764-4d74-b64d-5fb2aaaa65d3 pipelines.kubeflow.org/kfp_sdk_version:1.8.1 workflows.argoproj.io/completed:true workflows.argoproj.io/phase:Succeeded] map[pipelines.kubeflow.org/kfp_sdk_version:1.8.1 pipelines.kubeflow.org/pipeline_compilation_time:2022-02-03T05:02:33.431281 pipelines.kubeflow.org/pipeline_spec:{"name": "Pipeline"} pipelines.kubeflow.org/run_name:pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling] [] [] [{workflow-controller Update argoproj.io/v1alpha1 2022-02-03 05:05:36 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{"f:workflows.argoproj.io/completed":{},"f:workflows.argoproj.io/phase":{}}},"f:status":{"f:artifactRepositoryRef":{},"f:conditions":{},"f:finishedAt":{},"f:nodes":{},"f:phase":{},"f:progress":{},"f:resourcesDuration":{},"f:startedAt":{}}}} {apiserver Update argoproj.io/v1alpha1 2022-02-03 05:05:37 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:pipelines.kubeflow.org/kfp_sdk_version":{},"f:pipelines.kubeflow.org/pipeline_compilation_time":{},"f:pipelines.kubeflow.org/pipeline_spec":{},"f:pipelines.kubeflow.org/run_name":{}},"f:generateName":{},"f:labels":{".":{},"f:pipeline/persistedFinalState":{},"f:pipeline/runid":{},"f:pipelines.kubeflow.org/kfp_sdk_version":{}}},"f:spec":{".":{},"f:arguments":{},"f:entrypoint":{},"f:podMetadata":{},"f:serviceAccountName":{},"f:templates":{}},"f:status":{}}}]},Spec:WorkflowSpec{Templates:[]Template{Template{Name:pipeline,Inputs:Inputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},},Outputs:Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},NodeSelector:map[string]string{},Affinity:nil,Metadata:Metadata{Annotations:map[string]string{sidecar.istio.io/inject: false,},Labels:map[string]string{pipelines.kubeflow.org/cache_enabled: true,},},Daemon:nil,Steps:[]ParallelSteps{},Container:nil,Script:nil,Resource:nil,DAG:&DAGTemplate{Target:,Tasks:[]DAGTask{DAGTask{Name:pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling,Template:pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling,Arguments:Arguments{Parameters:[]Parameter{},Artifacts:[]Artifact{},},TemplateRef:nil,Dependencies:[],WithItems:[]Item{},WithParam:,WithSequence:nil,When:,ContinueOn:nil,OnExit:,Depends:,Hooks:LifecycleHooks{},},DAGTask{Name:update-workflow-as-success,Template:update-workflow-as-success,Arguments:Arguments{Parameters:[]Parameter{},Artifacts:[]Artifact{},},TemplateRef:nil,Dependencies:[pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling],WithItems:[]Item{},WithParam:,WithSequence:nil,When:,ContinueOn:nil,OnExit:,Depends:,Hooks:LifecycleHooks{},},},FailFast:nil,},Suspend:nil,Volumes:[]Volume{},InitContainers:[]UserContainer{},Sidecars:[]UserContainer{},ArchiveLocation:nil,ActiveDeadlineSeconds:,RetryStrategy:nil,Parallelism:nil,Tolerations:[]Toleration{},SchedulerName:,PriorityClassName:,Priority:nil,ServiceAccountName:,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,Metrics:nil,Synchronization:nil,Memoize:nil,Timeout:,Data:nil,ContainerSet:nil,FailFast:nil,},Template{Name:pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling,Inputs:Inputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},},Outputs:Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},NodeSelector:map[string]string{},Affinity:nil,Metadata:Metadata{Annotations:map[string]string{pipelines.kubeflow.org/arguments.parameters: {"env": "prod", "is_prereq": "False", "task_id": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling_scheduling", "workflow_id": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling"},pipelines.kubeflow.org/component_ref: {"digest": "4855580cf7de83bf122e3c55e3b7a016fd792ad4cb718d7f09bed5487dcb351c", "url": "/training-platform/barista/training/workflow/kfp/kfp_process_task_component.yaml"},pipelines.kubeflow.org/component_spec: {"implementation": {"container": {"args": ["process_task", "-e", {"inputValue": "env"}, "-wfid", {"inputValue": "workflow_id"}, "-tid", {"inputValue": "task_id"}, "-prereq", {"inputValue": "is_prereq"}], "image": "gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2"}}, "inputs": [{"name": "workflow_id"}, {"name": "task_id"}, {"name": "env"}, {"name": "is_prereq"}], "name": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling_scheduling"},pipelines.kubeflow.org/max_cache_staleness: P0D,sidecar.istio.io/inject: false,},Labels:map[string]string{pipelines.kubeflow.org/cache_enabled: true,pipelines.kubeflow.org/enable_caching: true,pipelines.kubeflow.org/kfp_sdk_version: 1.8.1,pipelines.kubeflow.org/pipeline-sdk-type: kfp,},},Daemon:nil,Steps:[]ParallelSteps{},Container:&v1.Container{Name:,Image:gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2,Command:[],Args:[process_task -e prod -wfid pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling -tid pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling_scheduling -prereq False],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:SNAP_BARISTA_ENV,Value:ad_ranking,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{150 -3} {} 150m DecimalSI},memory: {{1500 6} {} 1500M DecimalSI},},Requests:ResourceList{cpu: {{70 -3} {} 70m DecimalSI},memory: {{400 6} {} 400M DecimalSI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:host-docker-sock,ReadOnly:false,MountPath:/var/run/docker.sock,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:,ImagePullPolicy:,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},Script:nil,Resource:nil,DAG:nil,Suspend:nil,Volumes:[]Volume{{host-docker-sock {&HostPathVolumeSource{Path:/var/run/docker.sock,Type:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}},},InitContainers:[]UserContainer{},Sidecars:[]UserContainer{},ArchiveLocation:nil,ActiveDeadlineSeconds:,RetryStrategy:nil,Parallelism:nil,Tolerations:[]Toleration{},SchedulerName:,PriorityClassName:,Priority:nil,ServiceAccountName:,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,Metrics:nil,Synchronization:nil,Memoize:nil,Timeout:,Data:nil,ContainerSet:nil,FailFast:nil,},Template{Name:update-workflow-as-success,Inputs:Inputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},},Outputs:Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},NodeSelector:map[string]string{},Affinity:nil,Metadata:Metadata{Annotations:map[string]string{pipelines.kubeflow.org/arguments.parameters: {"env": "prod", "workflow_id": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling"},pipelines.kubeflow.org/component_ref: {"digest": "adfa904281fe5c1168b52253d2c48b26f07347dbc1259190ba5c1bd0b2f3f6a8", "url": "/training-platform/barista/training/workflow/kfp/kfp_update_workflow_status_component.yaml"},pipelines.kubeflow.org/component_spec: {"implementation": {"container": {"args": ["update_workflow_status", "-e", {"inputValue": "env"}, "-wfid", {"inputValue": "workflow_id"}], "image": "gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2"}}, "inputs": [{"name": "workflow_id"}, {"name": "env"}], "name": "Update workflow as Success"},pipelines.kubeflow.org/max_cache_staleness: P0D,sidecar.istio.io/inject: false,},Labels:map[string]string{pipelines.kubeflow.org/cache_enabled: true,pipelines.kubeflow.org/enable_caching: true,pipelines.kubeflow.org/kfp_sdk_version: 1.8.1,pipelines.kubeflow.org/pipeline-sdk-type: kfp,},},Daemon:nil,Steps:[]ParallelSteps{},Container:&v1.Container{Name:,Image:gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2,Command:[],Args:[update_workflow_status -e prod -wfid pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:SNAP_BARISTA_ENV,Value:ad_ranking,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{100 -3} {} 100m DecimalSI},},Requests:ResourceList{cpu: {{100 -3} {} 100m DecimalSI},},},VolumeMounts:[]VolumeMount{},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:,ImagePullPolicy:,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},Script:nil,Resource:nil,DAG:nil,Suspend:nil,Volumes:[]Volume{},InitContainers:[]UserContainer{},Sidecars:[]UserContainer{},ArchiveLocation:nil,ActiveDeadlineSeconds:,RetryStrategy:nil,Parallelism:nil,Tolerations:[]Toleration{},SchedulerName:,PriorityClassName:,Priority:nil,ServiceAccountName:,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,Metrics:nil,Synchronization:nil,Memoize:nil,Timeout:,Data:nil,ContainerSet:nil,FailFast:nil,},},Entrypoint:pipeline,Arguments:Arguments{Parameters:[]Parameter{},Artifacts:[]Artifact{},},ServiceAccountName:pipeline-runner,Volumes:[]Volume{},VolumeClaimTemplates:[]PersistentVolumeClaim{},Parallelism:nil,ArtifactRepositoryRef:nil,Suspend:nil,NodeSelector:map[string]string{},Affinity:nil,Tolerations:[]Toleration{},ImagePullSecrets:[]LocalObjectReference{},HostNetwork:nil,DNSPolicy:nil,DNSConfig:nil,OnExit:,ActiveDeadlineSeconds:nil,Priority:nil,SchedulerName:,PodGC:nil,PodPriorityClassName:,PodPriority:nil,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,TTLStrategy:nil,PodDisruptionBudget:nil,Metrics:nil,Shutdown:,WorkflowTemplateRef:nil,Synchronization:nil,VolumeClaimGC:nil,RetryStrategy:nil,PodMetadata:&Metadata{Annotations:map[string]string{},Labels:map[string]string{pipeline/runid: b8be3778-8764-4d74-b64d-5fb2aaaa65d3,},},TemplateDefaults:nil,},Status:WorkflowStatus{Phase:Succeeded,StartedAt:2022-02-03 05:02:33 +0000 UTC,FinishedAt:2022-02-03 05:05:36 +0000 UTC,Message:,CompressedNodes:,Nodes:Nodes{pipeline-mhp6s: {pipeline-mhp6s pipeline-mhp6s pipeline-mhp6s DAG pipeline nil local/pipeline-mhp6s Succeeded 2022-02-03 05:02:33 +0000 UTC 2022-02-03 05:05:36 +0000 UTC 0 2/2 5m17s*(1 cpu),9m52s*(100Mi memory) nil nil [pipeline-mhp6s-2692902509] [pipeline-mhp6s-34572902] nil nil},pipeline-mhp6s-2692902509: {pipeline-mhp6s-2692902509 pipeline-mhp6s.pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling Pod pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling nil local/pipeline-mhp6s Succeeded pipeline-mhp6s 2022-02-03 05:02:33 +0000 UTC 2022-02-03 05:04:13 +0000 UTC 0 1/1 3m16s*(1 cpu),7m51s*(100Mi memory) nil &Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{Artifact{Name:main-logs,Path:,Mode:nil,From:,ArtifactLocation:ArtifactLocation{ArchiveLogs:nil,S3:&S3Artifact{S3Bucket:S3Bucket{Endpoint:,Bucket:,Region:,Insecure:nil,AccessKeySecret:nil,SecretKeySecret:nil,RoleARN:,UseSDKCreds:false,CreateBucketIfNotPresent:nil,},Key:artifacts/pipeline-mhp6s/2022/02/03/pipeline-mhp6s-2692902509/main.log,},Git:nil,HTTP:nil,Artifactory:nil,HDFS:nil,Raw:nil,OSS:nil,GCS:nil,},GlobalName:,Archive:nil,Optional:false,SubPath:,RecurseMode:false,FromExpression:,},},Result:nil,ExitCode:0,} [pipeline-mhp6s-34572902] [] gke-snapads-kubeflow-kubeflow-workflo-cee5d2f9-5l8g nil nil},pipeline-mhp6s-34572902: {pipeline-mhp6s-34572902 pipeline-mhp6s.update-workflow-as-success update-workflow-as-success Pod update-workflow-as-success nil local/pipeline-mhp6s Succeeded pipeline-mhp6s 2022-02-03 05:04:23 +0000 UTC 2022-02-03 05:05:26 +0000 UTC 0 1/1 2m1s(1 cpu),2m1s*(100Mi memory) nil &Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{Artifact{Name:main-logs,Path:,Mode:nil,From:,ArtifactLocation:ArtifactLocation{ArchiveLogs:nil,S3:&S3Artifact{S3Bucket:S3Bucket{Endpoint:,Bucket:,Region:,Insecure:nil,AccessKeySecret:nil,SecretKeySecret:nil,RoleARN:,UseSDKCreds:false,CreateBucketIfNotPresent:nil,},Key:artifacts/pipeline-mhp6s/2022/02/03/pipeline-mhp6s-34572902/main.log,},Git:nil,HTTP:nil,Artifactory:nil,HDFS:nil,Raw:nil,OSS:nil,GCS:nil,},GlobalName:,Archive:nil,Optional:false,SubPath:,RecurseMode:false,FromExpression:,},},Result:nil,ExitCode:0,} [] [] gke-snapads-kubeflow-kubeflow-workflo-cee5d2f9-5l8g nil nil},},PersistentVolumeClaims:[]Volume{},Outputs:nil,StoredTemplates:map[string]Template{},OffloadNodeStatusVersion:,ResourcesDuration:ResourcesDuration{cpu: 5m17s,memory: 9m52s,},Conditions:[]Condition{Condition{Type:PodRunning,Status:False,Message:,},Condition{Type:Completed,Status:True,Message:,},},StoredWorkflowSpec:nil,Synchronization:nil,EstimatedDuration:0,Progress:2/2,ArtifactRepositoryRef:default-artifact-repository,},}: rpc error: code = NotFound desc = Report workflow failed.: NotFoundError: Failed to add PersistedFinalState label to workflow pipeline-mhp6s: workflows.argoproj.io "pipeline-mhp6s" not found: CustomError (code: 1): Error while reporting workflow resource (code: NotFound, message: Report workflow failed.: NotFoundError: Failed to add PersistedFinalState label to workflow pipeline-mhp6s: workflows.argoproj.io "pipeline-mhp6s" not found): rpc error: code = NotFound desc = Report workflow failed.: NotFoundError: Failed to add PersistedFinalState label to workflow pipeline-mhp6s: workflows.argoproj.io "pipeline-mhp6s" not found, &Workflow{ObjectMeta:{pipeline-mhp6s pipeline- default /apis/argoproj.io/v1alpha1/namespaces/default/workflows/pipeline-mhp6s f5ae9bc3-1ea4-43a5-84a5-e87480cf62ec 1726982 6 2022-02-03 05:02:33 +0000 UTC map[pipeline/persistedFinalState:true pipeline/runid:b8be3778-8764-4d74-b64d-5fb2aaaa65d3 pipelines.kubeflow.org/kfp_sdk_version:1.8.1 workflows.argoproj.io/completed:true workflows.argoproj.io/phase:Succeeded] map[pipelines.kubeflow.org/kfp_sdk_version:1.8.1 pipelines.kubeflow.org/pipeline_compilation_time:2022-02-03T05:02:33.431281 pipelines.kubeflow.org/pipeline_spec:{"name": "Pipeline"} pipelines.kubeflow.org/run_name:pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling] [] [] [{workflow-controller Update argoproj.io/v1alpha1 2022-02-03 05:05:36 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{"f:workflows.argoproj.io/completed":{},"f:workflows.argoproj.io/phase":{}}},"f:status":{"f:artifactRepositoryRef":{},"f:conditions":{},"f:finishedAt":{},"f:nodes":{},"f:phase":{},"f:progress":{},"f:resourcesDuration":{},"f:startedAt":{}}}} {apiserver Update argoproj.io/v1alpha1 2022-02-03 05:05:37 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:pipelines.kubeflow.org/kfp_sdk_version":{},"f:pipelines.kubeflow.org/pipeline_compilation_time":{},"f:pipelines.kubeflow.org/pipeline_spec":{},"f:pipelines.kubeflow.org/run_name":{}},"f:generateName":{},"f:labels":{".":{},"f:pipeline/persistedFinalState":{},"f:pipeline/runid":{},"f:pipelines.kubeflow.org/kfp_sdk_version":{}}},"f:spec":{".":{},"f:arguments":{},"f:entrypoint":{},"f:podMetadata":{},"f:serviceAccountName":{},"f:templates":{}},"f:status":{}}}]},Spec:WorkflowSpec{Templates:[]Template{Template{Name:pipeline,Inputs:Inputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},},Outputs:Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},NodeSelector:map[string]string{},Affinity:nil,Metadata:Metadata{Annotations:map[string]string{sidecar.istio.io/inject: false,},Labels:map[string]string{pipelines.kubeflow.org/cache_enabled: true,},},Daemon:nil,Steps:[]ParallelSteps{},Container:nil,Script:nil,Resource:nil,DAG:&DAGTemplate{Target:,Tasks:[]DAGTask{DAGTask{Name:pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling,Template:pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling,Arguments:Arguments{Parameters:[]Parameter{},Artifacts:[]Artifact{},},TemplateRef:nil,Dependencies:[],WithItems:[]Item{},WithParam:,WithSequence:nil,When:,ContinueOn:nil,OnExit:,Depends:,Hooks:LifecycleHooks{},},DAGTask{Name:update-workflow-as-success,Template:update-workflow-as-success,Arguments:Arguments{Parameters:[]Parameter{},Artifacts:[]Artifact{},},TemplateRef:nil,Dependencies:[pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling],WithItems:[]Item{},WithParam:,WithSequence:nil,When:,ContinueOn:nil,OnExit:,Depends:,Hooks:LifecycleHooks{},},},FailFast:nil,},Suspend:nil,Volumes:[]Volume{},InitContainers:[]UserContainer{},Sidecars:[]UserContainer{},ArchiveLocation:nil,ActiveDeadlineSeconds:,RetryStrategy:nil,Parallelism:nil,Tolerations:[]Toleration{},SchedulerName:,PriorityClassName:,Priority:nil,ServiceAccountName:,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,Metrics:nil,Synchronization:nil,Memoize:nil,Timeout:,Data:nil,ContainerSet:nil,FailFast:nil,},Template{Name:pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling,Inputs:Inputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},},Outputs:Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},NodeSelector:map[string]string{},Affinity:nil,Metadata:Metadata{Annotations:map[string]string{pipelines.kubeflow.org/arguments.parameters: {"env": "prod", "is_prereq": "False", "task_id": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling_scheduling", "workflow_id": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling"},pipelines.kubeflow.org/component_ref: {"digest": "4855580cf7de83bf122e3c55e3b7a016fd792ad4cb718d7f09bed5487dcb351c", "url": "/training-platform/barista/training/workflow/kfp/kfp_process_task_component.yaml"},pipelines.kubeflow.org/component_spec: {"implementation": {"container": {"args": ["process_task", "-e", {"inputValue": "env"}, "-wfid", {"inputValue": "workflow_id"}, "-tid", {"inputValue": "task_id"}, "-prereq", {"inputValue": "is_prereq"}], "image": "gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2"}}, "inputs": [{"name": "workflow_id"}, {"name": "task_id"}, {"name": "env"}, {"name": "is_prereq"}], "name": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling_scheduling"},pipelines.kubeflow.org/max_cache_staleness: P0D,sidecar.istio.io/inject: false,},Labels:map[string]string{pipelines.kubeflow.org/cache_enabled: true,pipelines.kubeflow.org/enable_caching: true,pipelines.kubeflow.org/kfp_sdk_version: 1.8.1,pipelines.kubeflow.org/pipeline-sdk-type: kfp,},},Daemon:nil,Steps:[]ParallelSteps{},Container:&v1.Container{Name:,Image:gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2,Command:[],Args:[process_task -e prod -wfid pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling -tid pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling_scheduling -prereq False],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:SNAP_BARISTA_ENV,Value:ad_ranking,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{150 -3} {} 150m DecimalSI},memory: {{1500 6} {} 1500M DecimalSI},},Requests:ResourceList{cpu: {{70 -3} {} 70m DecimalSI},memory: {{400 6} {} 400M DecimalSI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:host-docker-sock,ReadOnly:false,MountPath:/var/run/docker.sock,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:,ImagePullPolicy:,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},Script:nil,Resource:nil,DAG:nil,Suspend:nil,Volumes:[]Volume{{host-docker-sock {&HostPathVolumeSource{Path:/var/run/docker.sock,Type:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}},},InitContainers:[]UserContainer{},Sidecars:[]UserContainer{},ArchiveLocation:nil,ActiveDeadlineSeconds:,RetryStrategy:nil,Parallelism:nil,Tolerations:[]Toleration{},SchedulerName:,PriorityClassName:,Priority:nil,ServiceAccountName:,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,Metrics:nil,Synchronization:nil,Memoize:nil,Timeout:,Data:nil,ContainerSet:nil,FailFast:nil,},Template{Name:update-workflow-as-success,Inputs:Inputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},},Outputs:Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},NodeSelector:map[string]string{},Affinity:nil,Metadata:Metadata{Annotations:map[string]string{pipelines.kubeflow.org/arguments.parameters: {"env": "prod", "workflow_id": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling"},pipelines.kubeflow.org/component_ref: {"digest": "adfa904281fe5c1168b52253d2c48b26f07347dbc1259190ba5c1bd0b2f3f6a8", "url": "/training-platform/barista/training/workflow/kfp/kfp_update_workflow_status_component.yaml"},pipelines.kubeflow.org/component_spec: {"implementation": {"container": {"args": ["update_workflow_status", "-e", {"inputValue": "env"}, "-wfid", {"inputValue": "workflow_id"}], "image": "gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2"}}, "inputs": [{"name": "workflow_id"}, {"name": "env"}], "name": "Update workflow as Success"},pipelines.kubeflow.org/max_cache_staleness: P0D,sidecar.istio.io/inject: false,},Labels:map[string]string{pipelines.kubeflow.org/cache_enabled: true,pipelines.kubeflow.org/enable_caching: true,pipelines.kubeflow.org/kfp_sdk_version: 1.8.1,pipelines.kubeflow.org/pipeline-sdk-type: kfp,},},Daemon:nil,Steps:[]ParallelSteps{},Container:&v1.Container{Name:,Image:gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2,Command:[],Args:[update_workflow_status -e prod -wfid pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:SNAP_BARISTA_ENV,Value:ad_ranking,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{100 -3} {} 100m DecimalSI},},Requests:ResourceList{cpu: {{100 -3} {} 100m DecimalSI},},},VolumeMounts:[]VolumeMount{},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:,ImagePullPolicy:,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},Script:nil,Resource:nil,DAG:nil,Suspend:nil,Volumes:[]Volume{},InitContainers:[]UserContainer{},Sidecars:[]UserContainer{},ArchiveLocation:nil,ActiveDeadlineSeconds:,RetryStrategy:nil,Parallelism:nil,Tolerations:[]Toleration{},SchedulerName:,PriorityClassName:,Priority:nil,ServiceAccountName:,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,Metrics:nil,Synchronization:nil,Memoize:nil,Timeout:,Data:nil,ContainerSet:nil,FailFast:nil,},},Entrypoint:pipeline,Arguments:Arguments{Parameters:[]Parameter{},Artifacts:[]Artifact{},},ServiceAccountName:pipeline-runner,Volumes:[]Volume{},VolumeClaimTemplates:[]PersistentVolumeClaim{},Parallelism:nil,ArtifactRepositoryRef:nil,Suspend:nil,NodeSelector:map[string]string{},Affinity:nil,Tolerations:[]Toleration{},ImagePullSecrets:[]LocalObjectReference{},HostNetwork:nil,DNSPolicy:nil,DNSConfig:nil,OnExit:,ActiveDeadlineSeconds:nil,Priority:nil,SchedulerName:,PodGC:nil,PodPriorityClassName:,PodPriority:nil,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,TTLStrategy:nil,PodDisruptionBudget:nil,Metrics:nil,Shutdown:,WorkflowTemplateRef:nil,Synchronization:nil,VolumeClaimGC:nil,RetryStrategy:nil,PodMetadata:&Metadata{Annotations:map[string]string{},Labels:map[string]string{pipeline/runid: b8be3778-8764-4d74-b64d-5fb2aaaa65d3,},},TemplateDefaults:nil,},Status:WorkflowStatus{Phase:Succeeded,StartedAt:2022-02-03 05:02:33 +0000 UTC,FinishedAt:2022-02-03 05:05:36 +0000 UTC,Message:,CompressedNodes:,Nodes:Nodes{pipeline-mhp6s: {pipeline-mhp6s pipeline-mhp6s pipeline-mhp6s DAG pipeline nil local/pipeline-mhp6s Succeeded 2022-02-03 05:02:33 +0000 UTC 2022-02-03 05:05:36 +0000 UTC 0 2/2 5m17s(1 cpu),9m52s*(100Mi memory) nil nil [pipeline-mhp6s-2692902509] [pipeline-mhp6s-34572902] nil nil},pipeline-mhp6s-2692902509: {pipeline-mhp6s-2692902509 pipeline-mhp6s.pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling Pod pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling nil local/pipeline-mhp6s Succeeded pipeline-mhp6s 2022-02-03 05:02:33 +0000 UTC 2022-02-03 05:04:13 +0000 UTC 0 1/1 3m16s*(1 cpu),7m51s*(100Mi memory) nil &Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{Artifact{Name:main-logs,Path:,Mode:nil,From:,ArtifactLocation:ArtifactLocation{ArchiveLogs:nil,S3:&S3Artifact{S3Bucket:S3Bucket{Endpoint:,Bucket:,Region:,Insecure:nil,AccessKeySecret:nil,SecretKeySecret:nil,RoleARN:,UseSDKCreds:false,CreateBucketIfNotPresent:nil,},Key:artifacts/pipeline-mhp6s/2022/02/03/pipeline-mhp6s-2692902509/main.log,},Git:nil,HTTP:nil,Artifactory:nil,HDFS:nil,Raw:nil,OSS:nil,GCS:nil,},GlobalName:,Archive:nil,Optional:false,SubPath:,RecurseMode:false,FromExpression:,},},Result:nil,ExitCode:0,} [pipeline-mhp6s-34572902] [] gke-snapads-kubeflow-kubeflow-workflo-cee5d2f9-5l8g nil nil},pipeline-mhp6s-34572902: {pipeline-mhp6s-34572902 pipeline-mhp6s.update-workflow-as-success update-workflow-as-success Pod update-workflow-as-success nil local/pipeline-mhp6s Succeeded pipeline-mhp6s 2022-02-03 05:04:23 +0000 UTC 2022-02-03 05:05:26 +0000 UTC 0 1/1 2m1s(1 cpu),2m1s*(100Mi memory) nil &Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{Artifact{Name:main-logs,Path:,Mode:nil,From:,ArtifactLocation:ArtifactLocation{ArchiveLogs:nil,S3:&S3Artifact{S3Bucket:S3Bucket{Endpoint:,Bucket:,Region:,Insecure:nil,AccessKeySecret:nil,SecretKeySecret:nil,RoleARN:,UseSDKCreds:false,CreateBucketIfNotPresent:nil,},Key:artifacts/pipeline-mhp6s/2022/02/03/pipeline-mhp6s-34572902/main.log,},Git:nil,HTTP:nil,Artifactory:nil,HDFS:nil,Raw:nil,OSS:nil,GCS:nil,},GlobalName:,Archive:nil,Optional:false,SubPath:,RecurseMode:false,FromExpression:,},},Result:nil,ExitCode:*0,} [] [] gke-snapads-kubeflow-kubeflow-workflo-cee5d2f9-5l8g nil nil},},PersistentVolumeClaims:[]Volume{},Outputs:nil,StoredTemplates:map[string]Template{},OffloadNodeStatusVersion:,ResourcesDuration:ResourcesDuration{cpu: 5m17s,memory: 9m52s,},Conditions:[]Condition{Condition{Type:PodRunning,Status:False,Message:,},Condition{Type:Completed,Status:True,Message:,},},StoredWorkflowSpec:nil,Synchronization:nil,EstimatedDuration:0,Progress:2/2,ArtifactRepositoryRef:default-artifact-repository,},}: rpc error: code = NotFound desc = Report workflow failed.: NotFoundError: Failed to add PersistedFinalState label to workflow pipeline-mhp6s: workflows.argoproj.io "pipeline-mhp6s" not found"

kvamshi · 2022-02-04T08:26:17Z

Here are logs from kube-dns

2022-02-03 02:16:49.025 PSTError while fetching metric descriptors for kubedns: googleapi: Error 503: The service is currently unavailable., backendError
Warning
2022-02-03 04:06:44.906 PSTpkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
Warning
2022-02-03 04:06:44.906 PSTpkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.Endpoints ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
Warning
2022-02-03 04:06:54.844 PSTpkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.Endpoints ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
Error
2022-02-03 04:06:54.844 PSTpkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.Service: Get "https://10.224.0.1:443/api/v1/services?allowWatchBookmarks=true&resourceVersion=2109172&timeoutSeconds=529&watch=true": http2: client connection lost

thesuperzapper · 2023-05-14T21:52:04Z

@Bobgy @james-jwu I found that "new runs" hanging is caused by network interruptions between Kubeflow Pipelines and the MySQL database, and still happens in the latest KFP versions.

That is, this issue always happens when you see errors like [mysql] XXXX/XX/XX XX:XX:XX packets.go:36: unexpected EOF in your cache-server and other pods, and goes away when the network issue is resolved.

See issues that user's are raising when they have database connection issues:

From Kubeflow Pipeline's perspective, we should probably make database network issues fail more catastrophically, so that users are not left with semi-working Kubeflow Pipelines, and not understanding why things are not working.

For users, the solution is to fix your cluster's network access to your MySQL, which could be quite hard to debug, as network issues usually are.

If your MySQL is a managed service (like AWS RDS or Google Cloud SQL), look for VPC routing issues like asymmetric routing. For example, I had a case where the cluster accessed the database via an AWS VPN Gateway, but there was no route back from the database to the cluster.

Note that MySQL sometimes initiates new TCP connections back to the client from the "server side", which will obviously fail in the above case where it has no route back to the client.

rimolive · 2024-03-14T00:06:54Z

Closing this issue as there is no users reporting this error since 2022. Feel free to reopen it if the issue remains in latest releases.

/close

google-oss-prow · 2024-03-14T00:06:58Z

@rimolive: Closing this issue.

In response to this:

Closing this issue as there is no users reporting this error since 2022. Feel free to reopen it if the issue remains in latest releases.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/bug area/backend labels May 14, 2020

k8s-ci-robot assigned Bobgy May 14, 2020

Bobgy added status/triaged Whether the issue has been explicitly triaged priority/p2 labels May 14, 2020

Bobgy mentioned this issue May 14, 2020

[KF 1.0 Compliance] liveness and readiness probes #3756

Closed

Bobgy changed the title ~~[Persistence Agent] In rare occasions, PA stops syncing new workflows to DB~~ UI fails to display run details -- empty page with a spinning wheel Jun 1, 2020

Bobgy changed the title ~~UI fails to display run details -- empty page with a spinning wheel~~ UI fails to display every new runs' details -- empty page with a spinning wheel Jun 1, 2020

Bobgy changed the title ~~UI fails to display every new runs' details -- empty page with a spinning wheel~~ New run's details page always empty Jun 1, 2020

Bobgy changed the title ~~New run's details page always empty~~ New runs' details page always empty Jun 1, 2020

Bobgy added priority/p1 and removed priority/p2 labels Jun 1, 2020

rmgogogo assigned jingzhang36 Jun 9, 2020

Bobgy changed the title ~~New runs' details page always empty~~ New runs' details page always loading Jul 1, 2020

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 30, 2020

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 30, 2020

k8s-ci-robot added lifecycle/frozen and removed lifecycle/stale The issue / pull request is stale, any activities remove this label. labels Jan 23, 2021

Bobgy mentioned this issue Jan 23, 2021

Experiment Run status became unknown #4972

Closed

Bobgy mentioned this issue Mar 19, 2021

Kubeflow pipeline cannot get graph and state of a run #5324

Closed

Bobgy mentioned this issue Jul 30, 2021

[bug] pipeline ui runs status can not update #6118

Closed

rafalk0 mentioned this issue Jun 14, 2022

[bug] Considerable delay in showing steps in Kubeflow Pipelines UI after navigating to Runs #7891

Closed

This was referenced May 14, 2023

ml-pipelines [mysql] packets.go:36: unexpected EOF kubeflow/kubeflow#5540

Closed

[Kubeflow Dex Distribution] KF Pipelines 100% Unusable - MULTIPLE PEOPLE REPORTING #5223

Closed

Database Connection Errors #5329

Closed

google-oss-prow bot closed this as completed Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New runs' details page always loading #3763

New runs' details page always loading #3763

Bobgy commented May 14, 2020 •

edited

Loading

Bobgy commented May 14, 2020

Bobgy commented May 14, 2020

Bobgy commented May 14, 2020

Bobgy commented Jun 1, 2020

rmgogogo commented Jun 9, 2020

Bobgy commented Jun 11, 2020

jingzhang36 commented Jun 18, 2020

Bobgy commented Jun 18, 2020

jingzhang36 commented Jun 18, 2020

Bobgy commented Jun 18, 2020

stale bot commented Sep 30, 2020

Bobgy commented Sep 30, 2020

stale bot commented Dec 29, 2020

Bobgy commented Jan 23, 2021

Bobgy commented Jan 23, 2021

daikeshi commented Jun 4, 2021

Bobgy commented Jun 4, 2021

daikeshi commented Jun 4, 2021

Bobgy commented Jun 5, 2021

daikeshi commented Jun 5, 2021

Bobgy commented Jun 5, 2021

kim-sardine commented Jun 17, 2021 •

edited

Loading

Bobgy commented Jul 31, 2021 •

edited

Loading

kvamshi commented Feb 1, 2022

kvamshi commented Feb 1, 2022

Bobgy commented Feb 1, 2022

chensun commented Feb 1, 2022

kvamshi commented Feb 1, 2022

kvamshi commented Feb 4, 2022 •

edited

Loading

kvamshi commented Feb 4, 2022

thesuperzapper commented May 14, 2023 •

edited

Loading

rimolive commented Mar 14, 2024

google-oss-prow bot commented Mar 14, 2024

New runs' details page always loading #3763

New runs' details page always loading #3763

Comments

Bobgy commented May 14, 2020 • edited Loading

What steps did you take:

What happened:

What did you expect to happen:

Environment:

Bobgy commented May 14, 2020

Bobgy commented May 14, 2020

Bobgy commented May 14, 2020

Bobgy commented Jun 1, 2020

rmgogogo commented Jun 9, 2020

Bobgy commented Jun 11, 2020

jingzhang36 commented Jun 18, 2020

Bobgy commented Jun 18, 2020

jingzhang36 commented Jun 18, 2020

Bobgy commented Jun 18, 2020

stale bot commented Sep 30, 2020

Bobgy commented Sep 30, 2020

stale bot commented Dec 29, 2020

Bobgy commented Jan 23, 2021

Bobgy commented Jan 23, 2021

daikeshi commented Jun 4, 2021

Bobgy commented Jun 4, 2021

daikeshi commented Jun 4, 2021

Bobgy commented Jun 5, 2021

daikeshi commented Jun 5, 2021

Bobgy commented Jun 5, 2021

kim-sardine commented Jun 17, 2021 • edited Loading

Bobgy commented Jul 31, 2021 • edited Loading

kvamshi commented Feb 1, 2022

kvamshi commented Feb 1, 2022

Bobgy commented Feb 1, 2022

chensun commented Feb 1, 2022

kvamshi commented Feb 1, 2022

kvamshi commented Feb 4, 2022 • edited Loading

kvamshi commented Feb 4, 2022

thesuperzapper commented May 14, 2023 • edited Loading

rimolive commented Mar 14, 2024

google-oss-prow bot commented Mar 14, 2024

Bobgy commented May 14, 2020 •

edited

Loading

kim-sardine commented Jun 17, 2021 •

edited

Loading

Bobgy commented Jul 31, 2021 •

edited

Loading

kvamshi commented Feb 4, 2022 •

edited

Loading

thesuperzapper commented May 14, 2023 •

edited

Loading