Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New runs' details page always loading #3763

Closed
Bobgy opened this issue May 14, 2020 · 33 comments
Closed

New runs' details page always loading #3763

Bobgy opened this issue May 14, 2020 · 33 comments
Assignees
Labels

Comments

@Bobgy
Copy link
Contributor

Bobgy commented May 14, 2020

What steps did you take:

  • Deploy a KFP cluster
  • Wait for a long time (likely something that happened in this period is required, e.g. upgrading cluster kubernetes version)

What happened:

Sometimes, after a job is submitted and successfully executed, the KFP UI fails to display run details (empty page with a spinning wheel) via the url: https://${CLUSTER_URI}/_/pipeline/?ns=keshi#/runs/details/${RUN_ID} for every new run.

Additionally, the problem is that kfp db doesn't have those information, because persistence agent stopped syncing new workflows.
One thing strange is that, when looking at persistence agent logs, it still loops properly listing all the old workflows, but it no longer detect newly created workflows.

What did you expect to happen:

PA should keep syncing workflows.

Environment:

How did you deploy Kubeflow Pipelines (KFP)?

Kubeflow deployment

KFP version:
I don't remember clearly, but I think I've seen the issue in rare occasions from 0.2.0 to 0.5.0.

/kind bug

@Bobgy
Copy link
Contributor Author

Bobgy commented May 14, 2020

/assign @Bobgy
/cc @IronPan @jingzhang36
Record a known issue here.

@Bobgy Bobgy added status/triaged Whether the issue has been explicitly triaged priority/p2 labels May 14, 2020
@Bobgy
Copy link
Contributor Author

Bobgy commented May 14, 2020

Workaround: run kubectl delete pod ml-pipeline-persistenceagent-xxxxxxx-xxxx -n kubeflow to restart persistence agent.

@Bobgy
Copy link
Contributor Author

Bobgy commented May 14, 2020

I'm thinking even if we couldn't fix the issue directly, adding a liveness probe to persistence agent would be the best. However, persistence agent doesn't just hang, it keeps working properly just without getting new information.

If there are any ideas we can implement a liveness probe for this situation that would be awesome.

@Bobgy Bobgy changed the title [Persistence Agent] In rare occasions, PA stops syncing new workflows to DB UI fails to display run details -- empty page with a spinning wheel Jun 1, 2020
@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 1, 2020

1 extra data point: we got one more external report about this

@Bobgy Bobgy changed the title UI fails to display run details -- empty page with a spinning wheel UI fails to display every new runs' details -- empty page with a spinning wheel Jun 1, 2020
@Bobgy Bobgy changed the title UI fails to display every new runs' details -- empty page with a spinning wheel New run's details page always empty Jun 1, 2020
@Bobgy Bobgy changed the title New run's details page always empty New runs' details page always empty Jun 1, 2020
@rmgogogo
Copy link
Contributor

rmgogogo commented Jun 9, 2020

Copy notes here:

"One thing strange is that, when looking at persistence agent logs, it still loops properly listing all the old workflows, but it no longer detect newly created workflows."

"it happens after reschedule pod to another node-pool"

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 11, 2020

The other report is different from root cause of this one.
Making a summary here first:

  • The symptom: "new runs' details page always empty" means persistence agent is not syncing workflows to KFP db.
  • The root cause can be various reasons.
    • The reason listed above is one possibility, it can be worked around by restarting persistence agent.
    • Another possibility is basically there are too many workflows in the cluster, that kubernetes api server starts to crash. In this case, the suggested fix is to set a shorter workflow TTL, so that total workflow count doesn't increase too much.

@jingzhang36
Copy link
Contributor

A qq: does the empty detail page usually turn normal after a while? Or they stay empty?

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 18, 2020

@jingzhang36 No, it doesn't recover by itself.

@jingzhang36
Copy link
Contributor

@jingzhang36 No, it doesn't recover by itself.

Then, do you still have the instance where this issue happens?

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 18, 2020

I don't, I can ping you next time I reproduce it

@Bobgy Bobgy changed the title New runs' details page always empty New runs' details page always loading Jul 1, 2020
@stale
Copy link

stale bot commented Sep 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 30, 2020
@Bobgy
Copy link
Contributor Author

Bobgy commented Sep 30, 2020

/frozen

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 30, 2020
@stale
Copy link

stale bot commented Dec 29, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jan 23, 2021

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen and removed lifecycle/stale The issue / pull request is stale, any activities remove this label. labels Jan 23, 2021
@Bobgy
Copy link
Contributor Author

Bobgy commented Jan 23, 2021

I no longer see this happening any more

@daikeshi
Copy link
Contributor

daikeshi commented Jun 4, 2021

@Bobgy I'm curious if you have any updates on this issue. After we upgraded our cluster to KFP 1.3 and this issue seems to occur more often especially when the cluster is busy. When it happened, the run status was shown as unknown in the experiment page, and the run details page was empty and showed a spinning wheel.
kfp_ui_issue

It can be fixed by deleting ml-pipeline-persistenceagent pod under kubeflow namespace. I also tried to increase resource configuration for ml-pipeline-persistenceagent. It seems to help (less frequent now) but it still can occur from time to time. Do you have more info regarding this issue? or is there anything that we can help look into?

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 4, 2021

@daikeshi I no longer see this issue after my last post here. My suspect is that this might have sth to do with controller runtime version.

Can you try upgrading persistent agent and see if there's any changes? (You can check changehistory to find out which version we upgraded controller runtime)

@daikeshi
Copy link
Contributor

daikeshi commented Jun 4, 2021

hmm, we are using gcr.io/ml-pipeline/persistenceagent:1.3.0. Are you referring to controller-runtime lib version here? It seems that it hasn't been changed since v1.3.0

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 5, 2021

Yes, I've got a wrong impression. There's a new pending PR that updates this lib and k8s client. That might help.

Based on my investigation, this seems a problem with the controller boilerplate code or the library.

@daikeshi
Copy link
Contributor

daikeshi commented Jun 5, 2021

@Bobgy that's awesome! Would you mind sharing the link to that new PR, so l keep an eye on it when it gets merged and released? Thank you!

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 5, 2021

Sure, it's #5792

@kim-sardine
Copy link
Contributor

kim-sardine commented Jun 17, 2021

I'm having the same issue. and solved it by deleting ml-pipeline-persistenceagent

when there was a problem, I found a log Unknown node phase: undefined in Chrom DevTool's console.
and it came from Status.tsx or StatusUtils.ts.

and I found that kubeflow pipeline's getRun api don't have status property in its response.
(found it by accessing https://ENDPOINT/pipeline/apis/v1beta1/runs/RUN_ID)

after I re-create ml-pipeline-persistenceagent pod, getRun api pass status property correctly

@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 31, 2021

Yes, it's expected that status property is empty when ml-pipeline-persistenceagent is stuck, because it stops syncing status from argo workflow to KFP db.
The root cause is still about persistence agent. If there're insights in there, or if there are ideas to detect this problem and set it as liveness hook (when liveness hook fails, the server restarts), they will help a lot.

@kvamshi
Copy link

kvamshi commented Feb 1, 2022

Deleting the ml-pipeline-persistenceagent is not helping.
Our KFP version is 1.4.1, installed in GCP as part of AI Platform Pipelines

@kvamshi
Copy link

kvamshi commented Feb 1, 2022

@Bobgy How many is too many workflows ?. It is supposed to scale horizontally. So, what is the bottle next here.
Are you referring to running workflows or even the finished ones ?

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 1, 2022

cc @zijianjoy @chensun

@chensun
Copy link
Member

chensun commented Feb 1, 2022

Deleting the ml-pipeline-persistenceagent is not helping.
Our KFP version is 1.4.1, installed in GCP as part of AI Platform Pipelines

Per #3763 (comment), There was a fix (#5792) for this issue, which was released in 1.7.0.
Can you please try upgrading your development and see if it helps?

@kvamshi
Copy link

kvamshi commented Feb 1, 2022

We did. It helped. TY !!

@kvamshi
Copy link

kvamshi commented Feb 4, 2022

@Bobgy It is happening again all the time
Below are logs from ml-pipeline and ml-pipeline-persistence-agent

Here are logs from ml-pipeline

Error
2022-02-03 23:28:51.209 PSTgithub.com/kubeflow/pipelines/backend/src/common/util.Wrapf
Error
2022-02-03 23:28:51.209 PST /go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:260
Error
2022-02-03 23:28:51.209 PSTmain.apiServerInterceptor
Error
2022-02-03 23:28:51.209 PST /go/src/github.com/kubeflow/pipelines/backend/src/apiserver/interceptor.go:32
Error
2022-02-03 23:28:51.209 PSTgithub.com/kubeflow/pipelines/backend/api/go_client._ReportService_ReportWorkflow_Handler
Error
2022-02-03 23:28:51.209 PST /go/src/github.com/kubeflow/pipelines/backend/api/go_client/report.pb.go:339
Error
2022-02-03 23:28:51.209 PSTgoogle.golang.org/grpc.(*Server).processUnaryRPC
Error
2022-02-03 23:28:51.209 PST /go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1210
Error
2022-02-03 23:28:51.209 PSTgoogle.golang.org/grpc.(*Server).handleStream
Error
2022-02-03 23:28:51.209 PST /go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1533
Error
2022-02-03 23:28:51.209 PSTgoogle.golang.org/grpc.(*Server).serveStreams.func1.2
Error
2022-02-03 23:28:51.209 PST /go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:871
Error
2022-02-03 23:28:51.209 PSTruntime.goexit
Error

Here is log from ml-pipeline-persistence-agent

Scanned up to 2/3/22, 11:23 PM. Scanned 26.8 MB.
Scanned up to 2/3/22, 11:25 PM. Scanned 35.3 MB.
Error
2022-02-03 23:24:22.011 PSTtime="2022-02-04T07:24:22Z" level=error msg="Permanent failure while syncing resource (default/pipeline-mhp6s): CustomError (code: 1): Syncing Workflow (pipeline-mhp6s): permanent failure: CustomError (code: 1): Error while reporting workflow resource (code: NotFound, message: Report workflow failed.: NotFoundError: Failed to add PersistedFinalState label to workflow pipeline-mhp6s: workflows.argoproj.io "pipeline-mhp6s" not found): rpc error: code = NotFound desc = Report workflow failed.: NotFoundError: Failed to add PersistedFinalState label to workflow pipeline-mhp6s: workflows.argoproj.io "pipeline-mhp6s" not found, &Workflow{ObjectMeta:{pipeline-mhp6s pipeline- default /apis/argoproj.io/v1alpha1/namespaces/default/workflows/pipeline-mhp6s f5ae9bc3-1ea4-43a5-84a5-e87480cf62ec 1726982 6 2022-02-03 05:02:33 +0000 UTC map[pipeline/persistedFinalState:true pipeline/runid:b8be3778-8764-4d74-b64d-5fb2aaaa65d3 pipelines.kubeflow.org/kfp_sdk_version:1.8.1 workflows.argoproj.io/completed:true workflows.argoproj.io/phase:Succeeded] map[pipelines.kubeflow.org/kfp_sdk_version:1.8.1 pipelines.kubeflow.org/pipeline_compilation_time:2022-02-03T05:02:33.431281 pipelines.kubeflow.org/pipeline_spec:{"name": "Pipeline"} pipelines.kubeflow.org/run_name:pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling] [] [] [{workflow-controller Update argoproj.io/v1alpha1 2022-02-03 05:05:36 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{"f:workflows.argoproj.io/completed":{},"f:workflows.argoproj.io/phase":{}}},"f:status":{"f:artifactRepositoryRef":{},"f:conditions":{},"f:finishedAt":{},"f:nodes":{},"f:phase":{},"f:progress":{},"f:resourcesDuration":{},"f:startedAt":{}}}} {apiserver Update argoproj.io/v1alpha1 2022-02-03 05:05:37 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:pipelines.kubeflow.org/kfp_sdk_version":{},"f:pipelines.kubeflow.org/pipeline_compilation_time":{},"f:pipelines.kubeflow.org/pipeline_spec":{},"f:pipelines.kubeflow.org/run_name":{}},"f:generateName":{},"f:labels":{".":{},"f:pipeline/persistedFinalState":{},"f:pipeline/runid":{},"f:pipelines.kubeflow.org/kfp_sdk_version":{}}},"f:spec":{".":{},"f:arguments":{},"f:entrypoint":{},"f:podMetadata":{},"f:serviceAccountName":{},"f:templates":{}},"f:status":{}}}]},Spec:WorkflowSpec{Templates:[]Template{Template{Name:pipeline,Inputs:Inputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},},Outputs:Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},NodeSelector:map[string]string{},Affinity:nil,Metadata:Metadata{Annotations:map[string]string{sidecar.istio.io/inject: false,},Labels:map[string]string{pipelines.kubeflow.org/cache_enabled: true,},},Daemon:nil,Steps:[]ParallelSteps{},Container:nil,Script:nil,Resource:nil,DAG:&DAGTemplate{Target:,Tasks:[]DAGTask{DAGTask{Name:pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling,Template:pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling,Arguments:Arguments{Parameters:[]Parameter{},Artifacts:[]Artifact{},},TemplateRef:nil,Dependencies:[],WithItems:[]Item{},WithParam:,WithSequence:nil,When:,ContinueOn:nil,OnExit:,Depends:,Hooks:LifecycleHooks{},},DAGTask{Name:update-workflow-as-success,Template:update-workflow-as-success,Arguments:Arguments{Parameters:[]Parameter{},Artifacts:[]Artifact{},},TemplateRef:nil,Dependencies:[pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling],WithItems:[]Item{},WithParam:,WithSequence:nil,When:,ContinueOn:nil,OnExit:,Depends:,Hooks:LifecycleHooks{},},},FailFast:nil,},Suspend:nil,Volumes:[]Volume{},InitContainers:[]UserContainer{},Sidecars:[]UserContainer{},ArchiveLocation:nil,ActiveDeadlineSeconds:,RetryStrategy:nil,Parallelism:nil,Tolerations:[]Toleration{},SchedulerName:,PriorityClassName:,Priority:nil,ServiceAccountName:,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,Metrics:nil,Synchronization:nil,Memoize:nil,Timeout:,Data:nil,ContainerSet:nil,FailFast:nil,},Template{Name:pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling,Inputs:Inputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},},Outputs:Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},NodeSelector:map[string]string{},Affinity:nil,Metadata:Metadata{Annotations:map[string]string{pipelines.kubeflow.org/arguments.parameters: {"env": "prod", "is_prereq": "False", "task_id": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling_scheduling", "workflow_id": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling"},pipelines.kubeflow.org/component_ref: {"digest": "4855580cf7de83bf122e3c55e3b7a016fd792ad4cb718d7f09bed5487dcb351c", "url": "/training-platform/barista/training/workflow/kfp/kfp_process_task_component.yaml"},pipelines.kubeflow.org/component_spec: {"implementation": {"container": {"args": ["process_task", "-e", {"inputValue": "env"}, "-wfid", {"inputValue": "workflow_id"}, "-tid", {"inputValue": "task_id"}, "-prereq", {"inputValue": "is_prereq"}], "image": "gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2"}}, "inputs": [{"name": "workflow_id"}, {"name": "task_id"}, {"name": "env"}, {"name": "is_prereq"}], "name": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling_scheduling"},pipelines.kubeflow.org/max_cache_staleness: P0D,sidecar.istio.io/inject: false,},Labels:map[string]string{pipelines.kubeflow.org/cache_enabled: true,pipelines.kubeflow.org/enable_caching: true,pipelines.kubeflow.org/kfp_sdk_version: 1.8.1,pipelines.kubeflow.org/pipeline-sdk-type: kfp,},},Daemon:nil,Steps:[]ParallelSteps{},Container:&v1.Container{Name:,Image:gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2,Command:[],Args:[process_task -e prod -wfid pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling -tid pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling_scheduling -prereq False],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:SNAP_BARISTA_ENV,Value:ad_ranking,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{150 -3} {} 150m DecimalSI},memory: {{1500 6} {} 1500M DecimalSI},},Requests:ResourceList{cpu: {{70 -3} {} 70m DecimalSI},memory: {{400 6} {} 400M DecimalSI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:host-docker-sock,ReadOnly:false,MountPath:/var/run/docker.sock,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:,ImagePullPolicy:,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},Script:nil,Resource:nil,DAG:nil,Suspend:nil,Volumes:[]Volume{{host-docker-sock {&HostPathVolumeSource{Path:/var/run/docker.sock,Type:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}},},InitContainers:[]UserContainer{},Sidecars:[]UserContainer{},ArchiveLocation:nil,ActiveDeadlineSeconds:,RetryStrategy:nil,Parallelism:nil,Tolerations:[]Toleration{},SchedulerName:,PriorityClassName:,Priority:nil,ServiceAccountName:,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,Metrics:nil,Synchronization:nil,Memoize:nil,Timeout:,Data:nil,ContainerSet:nil,FailFast:nil,},Template{Name:update-workflow-as-success,Inputs:Inputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},},Outputs:Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},NodeSelector:map[string]string{},Affinity:nil,Metadata:Metadata{Annotations:map[string]string{pipelines.kubeflow.org/arguments.parameters: {"env": "prod", "workflow_id": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling"},pipelines.kubeflow.org/component_ref: {"digest": "adfa904281fe5c1168b52253d2c48b26f07347dbc1259190ba5c1bd0b2f3f6a8", "url": "/training-platform/barista/training/workflow/kfp/kfp_update_workflow_status_component.yaml"},pipelines.kubeflow.org/component_spec: {"implementation": {"container": {"args": ["update_workflow_status", "-e", {"inputValue": "env"}, "-wfid", {"inputValue": "workflow_id"}], "image": "gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2"}}, "inputs": [{"name": "workflow_id"}, {"name": "env"}], "name": "Update workflow as Success"},pipelines.kubeflow.org/max_cache_staleness: P0D,sidecar.istio.io/inject: false,},Labels:map[string]string{pipelines.kubeflow.org/cache_enabled: true,pipelines.kubeflow.org/enable_caching: true,pipelines.kubeflow.org/kfp_sdk_version: 1.8.1,pipelines.kubeflow.org/pipeline-sdk-type: kfp,},},Daemon:nil,Steps:[]ParallelSteps{},Container:&v1.Container{Name:,Image:gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2,Command:[],Args:[update_workflow_status -e prod -wfid pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:SNAP_BARISTA_ENV,Value:ad_ranking,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{100 -3} {} 100m DecimalSI},},Requests:ResourceList{cpu: {{100 -3} {} 100m DecimalSI},},},VolumeMounts:[]VolumeMount{},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:,ImagePullPolicy:,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},Script:nil,Resource:nil,DAG:nil,Suspend:nil,Volumes:[]Volume{},InitContainers:[]UserContainer{},Sidecars:[]UserContainer{},ArchiveLocation:nil,ActiveDeadlineSeconds:,RetryStrategy:nil,Parallelism:nil,Tolerations:[]Toleration{},SchedulerName:,PriorityClassName:,Priority:nil,ServiceAccountName:,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,Metrics:nil,Synchronization:nil,Memoize:nil,Timeout:,Data:nil,ContainerSet:nil,FailFast:nil,},},Entrypoint:pipeline,Arguments:Arguments{Parameters:[]Parameter{},Artifacts:[]Artifact{},},ServiceAccountName:pipeline-runner,Volumes:[]Volume{},VolumeClaimTemplates:[]PersistentVolumeClaim{},Parallelism:nil,ArtifactRepositoryRef:nil,Suspend:nil,NodeSelector:map[string]string{},Affinity:nil,Tolerations:[]Toleration{},ImagePullSecrets:[]LocalObjectReference{},HostNetwork:nil,DNSPolicy:nil,DNSConfig:nil,OnExit:,ActiveDeadlineSeconds:nil,Priority:nil,SchedulerName:,PodGC:nil,PodPriorityClassName:,PodPriority:nil,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,TTLStrategy:nil,PodDisruptionBudget:nil,Metrics:nil,Shutdown:,WorkflowTemplateRef:nil,Synchronization:nil,VolumeClaimGC:nil,RetryStrategy:nil,PodMetadata:&Metadata{Annotations:map[string]string{},Labels:map[string]string{pipeline/runid: b8be3778-8764-4d74-b64d-5fb2aaaa65d3,},},TemplateDefaults:nil,},Status:WorkflowStatus{Phase:Succeeded,StartedAt:2022-02-03 05:02:33 +0000 UTC,FinishedAt:2022-02-03 05:05:36 +0000 UTC,Message:,CompressedNodes:,Nodes:Nodes{pipeline-mhp6s: {pipeline-mhp6s pipeline-mhp6s pipeline-mhp6s DAG pipeline nil local/pipeline-mhp6s Succeeded 2022-02-03 05:02:33 +0000 UTC 2022-02-03 05:05:36 +0000 UTC 0 2/2 5m17s*(1 cpu),9m52s*(100Mi memory) nil nil [pipeline-mhp6s-2692902509] [pipeline-mhp6s-34572902] nil nil},pipeline-mhp6s-2692902509: {pipeline-mhp6s-2692902509 pipeline-mhp6s.pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling Pod pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling nil local/pipeline-mhp6s Succeeded pipeline-mhp6s 2022-02-03 05:02:33 +0000 UTC 2022-02-03 05:04:13 +0000 UTC 0 1/1 3m16s*(1 cpu),7m51s*(100Mi memory) nil &Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{Artifact{Name:main-logs,Path:,Mode:nil,From:,ArtifactLocation:ArtifactLocation{ArchiveLogs:nil,S3:&S3Artifact{S3Bucket:S3Bucket{Endpoint:,Bucket:,Region:,Insecure:nil,AccessKeySecret:nil,SecretKeySecret:nil,RoleARN:,UseSDKCreds:false,CreateBucketIfNotPresent:nil,},Key:artifacts/pipeline-mhp6s/2022/02/03/pipeline-mhp6s-2692902509/main.log,},Git:nil,HTTP:nil,Artifactory:nil,HDFS:nil,Raw:nil,OSS:nil,GCS:nil,},GlobalName:,Archive:nil,Optional:false,SubPath:,RecurseMode:false,FromExpression:,},},Result:nil,ExitCode:0,} [pipeline-mhp6s-34572902] [] gke-snapads-kubeflow-kubeflow-workflo-cee5d2f9-5l8g nil nil},pipeline-mhp6s-34572902: {pipeline-mhp6s-34572902 pipeline-mhp6s.update-workflow-as-success update-workflow-as-success Pod update-workflow-as-success nil local/pipeline-mhp6s Succeeded pipeline-mhp6s 2022-02-03 05:04:23 +0000 UTC 2022-02-03 05:05:26 +0000 UTC 0 1/1 2m1s(1 cpu),2m1s*(100Mi memory) nil &Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{Artifact{Name:main-logs,Path:,Mode:nil,From:,ArtifactLocation:ArtifactLocation{ArchiveLogs:nil,S3:&S3Artifact{S3Bucket:S3Bucket{Endpoint:,Bucket:,Region:,Insecure:nil,AccessKeySecret:nil,SecretKeySecret:nil,RoleARN:,UseSDKCreds:false,CreateBucketIfNotPresent:nil,},Key:artifacts/pipeline-mhp6s/2022/02/03/pipeline-mhp6s-34572902/main.log,},Git:nil,HTTP:nil,Artifactory:nil,HDFS:nil,Raw:nil,OSS:nil,GCS:nil,},GlobalName:,Archive:nil,Optional:false,SubPath:,RecurseMode:false,FromExpression:,},},Result:nil,ExitCode:0,} [] [] gke-snapads-kubeflow-kubeflow-workflo-cee5d2f9-5l8g nil nil},},PersistentVolumeClaims:[]Volume{},Outputs:nil,StoredTemplates:map[string]Template{},OffloadNodeStatusVersion:,ResourcesDuration:ResourcesDuration{cpu: 5m17s,memory: 9m52s,},Conditions:[]Condition{Condition{Type:PodRunning,Status:False,Message:,},Condition{Type:Completed,Status:True,Message:,},},StoredWorkflowSpec:nil,Synchronization:nil,EstimatedDuration:0,Progress:2/2,ArtifactRepositoryRef:default-artifact-repository,},}: rpc error: code = NotFound desc = Report workflow failed.: NotFoundError: Failed to add PersistedFinalState label to workflow pipeline-mhp6s: workflows.argoproj.io "pipeline-mhp6s" not found: CustomError (code: 1): Error while reporting workflow resource (code: NotFound, message: Report workflow failed.: NotFoundError: Failed to add PersistedFinalState label to workflow pipeline-mhp6s: workflows.argoproj.io "pipeline-mhp6s" not found): rpc error: code = NotFound desc = Report workflow failed.: NotFoundError: Failed to add PersistedFinalState label to workflow pipeline-mhp6s: workflows.argoproj.io "pipeline-mhp6s" not found, &Workflow{ObjectMeta:{pipeline-mhp6s pipeline- default /apis/argoproj.io/v1alpha1/namespaces/default/workflows/pipeline-mhp6s f5ae9bc3-1ea4-43a5-84a5-e87480cf62ec 1726982 6 2022-02-03 05:02:33 +0000 UTC map[pipeline/persistedFinalState:true pipeline/runid:b8be3778-8764-4d74-b64d-5fb2aaaa65d3 pipelines.kubeflow.org/kfp_sdk_version:1.8.1 workflows.argoproj.io/completed:true workflows.argoproj.io/phase:Succeeded] map[pipelines.kubeflow.org/kfp_sdk_version:1.8.1 pipelines.kubeflow.org/pipeline_compilation_time:2022-02-03T05:02:33.431281 pipelines.kubeflow.org/pipeline_spec:{"name": "Pipeline"} pipelines.kubeflow.org/run_name:pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling] [] [] [{workflow-controller Update argoproj.io/v1alpha1 2022-02-03 05:05:36 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{"f:workflows.argoproj.io/completed":{},"f:workflows.argoproj.io/phase":{}}},"f:status":{"f:artifactRepositoryRef":{},"f:conditions":{},"f:finishedAt":{},"f:nodes":{},"f:phase":{},"f:progress":{},"f:resourcesDuration":{},"f:startedAt":{}}}} {apiserver Update argoproj.io/v1alpha1 2022-02-03 05:05:37 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:pipelines.kubeflow.org/kfp_sdk_version":{},"f:pipelines.kubeflow.org/pipeline_compilation_time":{},"f:pipelines.kubeflow.org/pipeline_spec":{},"f:pipelines.kubeflow.org/run_name":{}},"f:generateName":{},"f:labels":{".":{},"f:pipeline/persistedFinalState":{},"f:pipeline/runid":{},"f:pipelines.kubeflow.org/kfp_sdk_version":{}}},"f:spec":{".":{},"f:arguments":{},"f:entrypoint":{},"f:podMetadata":{},"f:serviceAccountName":{},"f:templates":{}},"f:status":{}}}]},Spec:WorkflowSpec{Templates:[]Template{Template{Name:pipeline,Inputs:Inputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},},Outputs:Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},NodeSelector:map[string]string{},Affinity:nil,Metadata:Metadata{Annotations:map[string]string{sidecar.istio.io/inject: false,},Labels:map[string]string{pipelines.kubeflow.org/cache_enabled: true,},},Daemon:nil,Steps:[]ParallelSteps{},Container:nil,Script:nil,Resource:nil,DAG:&DAGTemplate{Target:,Tasks:[]DAGTask{DAGTask{Name:pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling,Template:pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling,Arguments:Arguments{Parameters:[]Parameter{},Artifacts:[]Artifact{},},TemplateRef:nil,Dependencies:[],WithItems:[]Item{},WithParam:,WithSequence:nil,When:,ContinueOn:nil,OnExit:,Depends:,Hooks:LifecycleHooks{},},DAGTask{Name:update-workflow-as-success,Template:update-workflow-as-success,Arguments:Arguments{Parameters:[]Parameter{},Artifacts:[]Artifact{},},TemplateRef:nil,Dependencies:[pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling],WithItems:[]Item{},WithParam:,WithSequence:nil,When:,ContinueOn:nil,OnExit:,Depends:,Hooks:LifecycleHooks{},},},FailFast:nil,},Suspend:nil,Volumes:[]Volume{},InitContainers:[]UserContainer{},Sidecars:[]UserContainer{},ArchiveLocation:nil,ActiveDeadlineSeconds:,RetryStrategy:nil,Parallelism:nil,Tolerations:[]Toleration{},SchedulerName:,PriorityClassName:,Priority:nil,ServiceAccountName:,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,Metrics:nil,Synchronization:nil,Memoize:nil,Timeout:,Data:nil,ContainerSet:nil,FailFast:nil,},Template{Name:pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling,Inputs:Inputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},},Outputs:Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},NodeSelector:map[string]string{},Affinity:nil,Metadata:Metadata{Annotations:map[string]string{pipelines.kubeflow.org/arguments.parameters: {"env": "prod", "is_prereq": "False", "task_id": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling_scheduling", "workflow_id": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling"},pipelines.kubeflow.org/component_ref: {"digest": "4855580cf7de83bf122e3c55e3b7a016fd792ad4cb718d7f09bed5487dcb351c", "url": "/training-platform/barista/training/workflow/kfp/kfp_process_task_component.yaml"},pipelines.kubeflow.org/component_spec: {"implementation": {"container": {"args": ["process_task", "-e", {"inputValue": "env"}, "-wfid", {"inputValue": "workflow_id"}, "-tid", {"inputValue": "task_id"}, "-prereq", {"inputValue": "is_prereq"}], "image": "gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2"}}, "inputs": [{"name": "workflow_id"}, {"name": "task_id"}, {"name": "env"}, {"name": "is_prereq"}], "name": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling_scheduling"},pipelines.kubeflow.org/max_cache_staleness: P0D,sidecar.istio.io/inject: false,},Labels:map[string]string{pipelines.kubeflow.org/cache_enabled: true,pipelines.kubeflow.org/enable_caching: true,pipelines.kubeflow.org/kfp_sdk_version: 1.8.1,pipelines.kubeflow.org/pipeline-sdk-type: kfp,},},Daemon:nil,Steps:[]ParallelSteps{},Container:&v1.Container{Name:,Image:gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2,Command:[],Args:[process_task -e prod -wfid pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling -tid pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling_scheduling -prereq False],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:SNAP_BARISTA_ENV,Value:ad_ranking,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{150 -3} {} 150m DecimalSI},memory: {{1500 6} {} 1500M DecimalSI},},Requests:ResourceList{cpu: {{70 -3} {} 70m DecimalSI},memory: {{400 6} {} 400M DecimalSI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:host-docker-sock,ReadOnly:false,MountPath:/var/run/docker.sock,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:,ImagePullPolicy:,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},Script:nil,Resource:nil,DAG:nil,Suspend:nil,Volumes:[]Volume{{host-docker-sock {&HostPathVolumeSource{Path:/var/run/docker.sock,Type:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}},},InitContainers:[]UserContainer{},Sidecars:[]UserContainer{},ArchiveLocation:nil,ActiveDeadlineSeconds:,RetryStrategy:nil,Parallelism:nil,Tolerations:[]Toleration{},SchedulerName:,PriorityClassName:,Priority:nil,ServiceAccountName:,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,Metrics:nil,Synchronization:nil,Memoize:nil,Timeout:,Data:nil,ContainerSet:nil,FailFast:nil,},Template{Name:update-workflow-as-success,Inputs:Inputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},},Outputs:Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},NodeSelector:map[string]string{},Affinity:nil,Metadata:Metadata{Annotations:map[string]string{pipelines.kubeflow.org/arguments.parameters: {"env": "prod", "workflow_id": "pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling"},pipelines.kubeflow.org/component_ref: {"digest": "adfa904281fe5c1168b52253d2c48b26f07347dbc1259190ba5c1bd0b2f3f6a8", "url": "/training-platform/barista/training/workflow/kfp/kfp_update_workflow_status_component.yaml"},pipelines.kubeflow.org/component_spec: {"implementation": {"container": {"args": ["update_workflow_status", "-e", {"inputValue": "env"}, "-wfid", {"inputValue": "workflow_id"}], "image": "gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2"}}, "inputs": [{"name": "workflow_id"}, {"name": "env"}], "name": "Update workflow as Success"},pipelines.kubeflow.org/max_cache_staleness: P0D,sidecar.istio.io/inject: false,},Labels:map[string]string{pipelines.kubeflow.org/cache_enabled: true,pipelines.kubeflow.org/enable_caching: true,pipelines.kubeflow.org/kfp_sdk_version: 1.8.1,pipelines.kubeflow.org/pipeline-sdk-type: kfp,},},Daemon:nil,Steps:[]ParallelSteps{},Container:&v1.Container{Name:,Image:gcr.io/snap-ads-debug/training-platform-trainer-bento-processor:20220122-133733-ruizhacky_code_for_multihead_nce_calculation-57e4e8443-rzhang2,Command:[],Args:[update_workflow_status -e prod -wfid pixel_lat_refresh_subbydataset_20220119_2315_71_fixed_eval_only_score_rescaling],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:SNAP_BARISTA_ENV,Value:ad_ranking,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{100 -3} {} 100m DecimalSI},},Requests:ResourceList{cpu: {{100 -3} {} 100m DecimalSI},},},VolumeMounts:[]VolumeMount{},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:,ImagePullPolicy:,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},Script:nil,Resource:nil,DAG:nil,Suspend:nil,Volumes:[]Volume{},InitContainers:[]UserContainer{},Sidecars:[]UserContainer{},ArchiveLocation:nil,ActiveDeadlineSeconds:,RetryStrategy:nil,Parallelism:nil,Tolerations:[]Toleration{},SchedulerName:,PriorityClassName:,Priority:nil,ServiceAccountName:,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,Metrics:nil,Synchronization:nil,Memoize:nil,Timeout:,Data:nil,ContainerSet:nil,FailFast:nil,},},Entrypoint:pipeline,Arguments:Arguments{Parameters:[]Parameter{},Artifacts:[]Artifact{},},ServiceAccountName:pipeline-runner,Volumes:[]Volume{},VolumeClaimTemplates:[]PersistentVolumeClaim{},Parallelism:nil,ArtifactRepositoryRef:nil,Suspend:nil,NodeSelector:map[string]string{},Affinity:nil,Tolerations:[]Toleration{},ImagePullSecrets:[]LocalObjectReference{},HostNetwork:nil,DNSPolicy:nil,DNSConfig:nil,OnExit:,ActiveDeadlineSeconds:nil,Priority:nil,SchedulerName:,PodGC:nil,PodPriorityClassName:,PodPriority:nil,HostAliases:[]HostAlias{},SecurityContext:nil,PodSpecPatch:,AutomountServiceAccountToken:nil,Executor:nil,TTLStrategy:nil,PodDisruptionBudget:nil,Metrics:nil,Shutdown:,WorkflowTemplateRef:nil,Synchronization:nil,VolumeClaimGC:nil,RetryStrategy:nil,PodMetadata:&Metadata{Annotations:map[string]string{},Labels:map[string]string{pipeline/runid: b8be3778-8764-4d74-b64d-5fb2aaaa65d3,},},TemplateDefaults:nil,},Status:WorkflowStatus{Phase:Succeeded,StartedAt:2022-02-03 05:02:33 +0000 UTC,FinishedAt:2022-02-03 05:05:36 +0000 UTC,Message:,CompressedNodes:,Nodes:Nodes{pipeline-mhp6s: {pipeline-mhp6s pipeline-mhp6s pipeline-mhp6s DAG pipeline nil local/pipeline-mhp6s Succeeded 2022-02-03 05:02:33 +0000 UTC 2022-02-03 05:05:36 +0000 UTC 0 2/2 5m17s(1 cpu),9m52s*(100Mi memory) nil nil [pipeline-mhp6s-2692902509] [pipeline-mhp6s-34572902] nil nil},pipeline-mhp6s-2692902509: {pipeline-mhp6s-2692902509 pipeline-mhp6s.pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling Pod pixel-lat-refresh-subbydataset-20220119-2315-71-fixed-eval-only-score-rescaling-scheduling nil local/pipeline-mhp6s Succeeded pipeline-mhp6s 2022-02-03 05:02:33 +0000 UTC 2022-02-03 05:04:13 +0000 UTC 0 1/1 3m16s*(1 cpu),7m51s*(100Mi memory) nil &Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{Artifact{Name:main-logs,Path:,Mode:nil,From:,ArtifactLocation:ArtifactLocation{ArchiveLogs:nil,S3:&S3Artifact{S3Bucket:S3Bucket{Endpoint:,Bucket:,Region:,Insecure:nil,AccessKeySecret:nil,SecretKeySecret:nil,RoleARN:,UseSDKCreds:false,CreateBucketIfNotPresent:nil,},Key:artifacts/pipeline-mhp6s/2022/02/03/pipeline-mhp6s-2692902509/main.log,},Git:nil,HTTP:nil,Artifactory:nil,HDFS:nil,Raw:nil,OSS:nil,GCS:nil,},GlobalName:,Archive:nil,Optional:false,SubPath:,RecurseMode:false,FromExpression:,},},Result:nil,ExitCode:0,} [pipeline-mhp6s-34572902] [] gke-snapads-kubeflow-kubeflow-workflo-cee5d2f9-5l8g nil nil},pipeline-mhp6s-34572902: {pipeline-mhp6s-34572902 pipeline-mhp6s.update-workflow-as-success update-workflow-as-success Pod update-workflow-as-success nil local/pipeline-mhp6s Succeeded pipeline-mhp6s 2022-02-03 05:04:23 +0000 UTC 2022-02-03 05:05:26 +0000 UTC 0 1/1 2m1s(1 cpu),2m1s*(100Mi memory) nil &Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{Artifact{Name:main-logs,Path:,Mode:nil,From:,ArtifactLocation:ArtifactLocation{ArchiveLogs:nil,S3:&S3Artifact{S3Bucket:S3Bucket{Endpoint:,Bucket:,Region:,Insecure:nil,AccessKeySecret:nil,SecretKeySecret:nil,RoleARN:,UseSDKCreds:false,CreateBucketIfNotPresent:nil,},Key:artifacts/pipeline-mhp6s/2022/02/03/pipeline-mhp6s-34572902/main.log,},Git:nil,HTTP:nil,Artifactory:nil,HDFS:nil,Raw:nil,OSS:nil,GCS:nil,},GlobalName:,Archive:nil,Optional:false,SubPath:,RecurseMode:false,FromExpression:,},},Result:nil,ExitCode:*0,} [] [] gke-snapads-kubeflow-kubeflow-workflo-cee5d2f9-5l8g nil nil},},PersistentVolumeClaims:[]Volume{},Outputs:nil,StoredTemplates:map[string]Template{},OffloadNodeStatusVersion:,ResourcesDuration:ResourcesDuration{cpu: 5m17s,memory: 9m52s,},Conditions:[]Condition{Condition{Type:PodRunning,Status:False,Message:,},Condition{Type:Completed,Status:True,Message:,},},StoredWorkflowSpec:nil,Synchronization:nil,EstimatedDuration:0,Progress:2/2,ArtifactRepositoryRef:default-artifact-repository,},}: rpc error: code = NotFound desc = Report workflow failed.: NotFoundError: Failed to add PersistedFinalState label to workflow pipeline-mhp6s: workflows.argoproj.io "pipeline-mhp6s" not found"

@kvamshi
Copy link

kvamshi commented Feb 4, 2022

Here are logs from kube-dns

2022-02-03 02:16:49.025 PSTError while fetching metric descriptors for kubedns: googleapi: Error 503: The service is currently unavailable., backendError
Warning
2022-02-03 04:06:44.906 PSTpkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
Warning
2022-02-03 04:06:44.906 PSTpkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.Endpoints ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
Warning
2022-02-03 04:06:54.844 PSTpkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.Endpoints ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
Error
2022-02-03 04:06:54.844 PSTpkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.Service: Get "https://10.224.0.1:443/api/v1/services?allowWatchBookmarks=true&resourceVersion=2109172&timeoutSeconds=529&watch=true": http2: client connection lost

@thesuperzapper
Copy link
Member

thesuperzapper commented May 14, 2023

@Bobgy @james-jwu I found that "new runs" hanging is caused by network interruptions between Kubeflow Pipelines and the MySQL database, and still happens in the latest KFP versions.

That is, this issue always happens when you see errors like [mysql] XXXX/XX/XX XX:XX:XX packets.go:36: unexpected EOF in your cache-server and other pods, and goes away when the network issue is resolved.

See issues that user's are raising when they have database connection issues:

From Kubeflow Pipeline's perspective, we should probably make database network issues fail more catastrophically, so that users are not left with semi-working Kubeflow Pipelines, and not understanding why things are not working.


For users, the solution is to fix your cluster's network access to your MySQL, which could be quite hard to debug, as network issues usually are.

If your MySQL is a managed service (like AWS RDS or Google Cloud SQL), look for VPC routing issues like asymmetric routing. For example, I had a case where the cluster accessed the database via an AWS VPN Gateway, but there was no route back from the database to the cluster.

Note that MySQL sometimes initiates new TCP connections back to the client from the "server side", which will obviously fail in the above case where it has no route back to the client.

@rimolive
Copy link
Member

Closing this issue as there is no users reporting this error since 2022. Feel free to reopen it if the issue remains in latest releases.

/close

Copy link

@rimolive: Closing this issue.

In response to this:

Closing this issue as there is no users reporting this error since 2022. Feel free to reopen it if the issue remains in latest releases.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants