Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem when deploying kubeflow 0.4.0 #676

Closed
hamedhsn opened this issue Jan 14, 2019 · 17 comments
Closed

problem when deploying kubeflow 0.4.0 #676

hamedhsn opened this issue Jan 14, 2019 · 17 comments
Assignees

Comments

@hamedhsn
Copy link
Contributor

I am trying to deploy 0.4.0 in my cluster and pipeline seems not deployed correctly.
my setup: deploying in EKS, with kssonet==0.13.0

list of my pods:

ml-pipeline-scheduledworkflow-5f47df7d54-4s45c           1/1       Running            0          4d
ml-pipeline-persistenceagent-5b7d65db44-2w8pg            0/1       CrashLoopBackOff   1137       4d
ml-pipelines-load-samples-dxg59                          0/1       Error              0          4d
ml-pipeline-ui-784bc748c4-dj8qh                          1/1       Running            0          4d

log from load-samples:

goroutine 1 [running]:
github.com/kubeflow/pipelines/vendor/github.com/golang/glog.stacks(0xc0003f5d00, 0xc000560000, 0x68, 0x9b)
	/go/src/github.com/kubeflow/pipelines/vendor/github.com/golang/glog/glog.go:769 +0xd4
github.com/kubeflow/pipelines/vendor/github.com/golang/glog.(*loggingT).output(0x263bee0, 0xc000000003, 0xc00001e790, 0x239a6b6, 0x8, 0x128, 0x0)
	/go/src/github.com/kubeflow/pipelines/vendor/github.com/golang/glog/glog.go:720 +0x329
github.com/kubeflow/pipelines/vendor/github.com/golang/glog.(*loggingT).printf(0x263bee0, 0x3, 0x15d6e4d, 0x2, 0xc000145a70, 0x1, 0x1)
	/go/src/github.com/kubeflow/pipelines/vendor/github.com/golang/glog/glog.go:655 +0x14b
github.com/kubeflow/pipelines/vendor/github.com/golang/glog.Fatalf(0x15d6e4d, 0x2, 0xc000145a70, 0x1, 0x1)
	/go/src/github.com/kubeflow/pipelines/vendor/github.com/golang/glog/glog.go:1148 +0x67
github.com/kubeflow/pipelines/backend/src/common/util.TerminateIfError(0x17814e0, 0xc0005e4050)
	/go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:296 +0x79
main.initMysql(0xc0003486c8, 0x5, 0x53d1ac1000, 0x0, 0x0)
	/go/src/github.com/kubeflow/pipelines/backend/src/apiserver/client_manager.go:204 +0x2c6
main.initDBClient(0x53d1ac1000, 0x15)
	/go/src/github.com/kubeflow/pipelines/backend/src/apiserver/client_manager.go:137 +0x54c
main.(*ClientManager).init(0xc000145d50)
	/go/src/github.com/kubeflow/pipelines/backend/src/apiserver/client_manager.go:103 +0x80
main.newClientManager(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/kubeflow/pipelines/backend/src/apiserver/client_manager.go:242 +0x90
main.main()
	/go/src/github.com/kubeflow/pipelines/backend/src/apiserver/main.go:53 +0x88```

and when I open the UI:
I see `Error: failed to retrieve list of pipelines. Click Details for more information.`
@hamedhsn hamedhsn changed the title issue when deploying 0.4 issue when deploying kubeflow 0.4.0 Jan 14, 2019
@hamedhsn hamedhsn changed the title issue when deploying kubeflow 0.4.0 problem when deploying kubeflow 0.4.0 Jan 14, 2019
@gaoning777
Copy link
Contributor

The pipeline(0.1.7) in the incoming release of the kubeflow will have the fix for the load sample bug in kubeflow 0.4.0(pipeline0.1.6).
Fixed in #615
@IronPan @neuromage correct me if it is not the case.

@hamedhsn
Copy link
Contributor Author

@gaoning777 when will you release the new version?

@gaoning777
Copy link
Contributor

The current one-click deployment should install the 0.1.7 pipeline version, which contains the fix.
https://deploy.kubeflow.cloud/#/deploy, choose kubeflow 0.4.1 version.

@hamedhsn
Copy link
Contributor Author

hamedhsn commented Jan 17, 2019

We are using EKS. I have to deploy it through the ks component in Kubeflow. what is the best way to upgrade that?

@gaoning777
Copy link
Contributor

gaoning777 commented Jan 17, 2019

I guess you have been following the instructions here? If you can deploy the kubeflow again, update "export KUBEFLOW_TAG=v0.3.5" with "export KUBEFLOW_TAG=v0.4.1".

@IronPan
Copy link
Member

IronPan commented Jan 17, 2019

@hamedhsn Could you try https://github.com/kubeflow/kubeflow/releases/tag/v0.4.1
I think 0.4.0 doesn't have all the necessary fixes.

@hamedhsn
Copy link
Contributor Author

@IronPan all the containers are running with v0.4.1. seems that it is resolved.

one more Q: I do port forwarding of UI but when I open the pipeline UI I see at the top :
Error: failed to retrieve list of pipelines. Click Details for more information.

when I try to create experiment getting this:
Error occured while trying to proxy to: localhost:8081/apis/v1beta1/experiments

@hamedhsn
Copy link
Contributor Author

also after a while, I see that pipeline pod is also failing

> kubectl ml-pipeline-c97d95f46-swf8s -n kubeflow

F0121 19:49:03.763668       1 error.go:296] dial tcp 172.20.60.82:3306: connect: connection timed out
goroutine 1 [running]:
github.com/golang/glog.stacks(0xc0001d7f00, 0xc000532a00, 0x66, 0x9b)
	/go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:769 +0xd4
github.com/golang/glog.(*loggingT).output(0x254d200, 0xc000000003, 0xc00029c630, 0x22abd72, 0x8, 0x128, 0x0)
	/go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:720 +0x329
github.com/golang/glog.(*loggingT).printf(0x254d200, 0x3, 0x15dea7e, 0x2, 0xc000141a28, 0x1, 0x1)
	/go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:655 +0x14b
github.com/golang/glog.Fatalf(0x15dea7e, 0x2, 0xc000141a28, 0x1, 0x1)
	/go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:1148 +0x67
github.com/kubeflow/pipelines/backend/src/common/util.TerminateIfError(0x178bbe0, 0xc000382000)
	/go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:296 +0x79
main.initMysql(0xc000352738, 0x5, 0x53d1ac1000, 0x0, 0x0)
	/go/src/github.com/kubeflow/pipelines/backend/src/apiserver/client_manager.go:211 +0x2c6
main.initDBClient(0x53d1ac1000, 0x15)
	/go/src/github.com/kubeflow/pipelines/backend/src/apiserver/client_manager.go:143 +0x588
main.(*ClientManager).init(0xc000141d20)
	/go/src/github.com/kubeflow/pipelines/backend/src/apiserver/client_manager.go:108 +0x80
main.newClientManager(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/kubeflow/pipelines/backend/src/apiserver/client_manager.go:249 +0x90
main.main()
	/go/src/github.com/kubeflow/pipelines/backend/src/apiserver/main.go:54 +0x88

@hamedhsn
Copy link
Contributor Author

log from ml-pipeline-persistenceagent pod

> kl ml-pipeline-persistenceagent-5669f69cdd-9ln6m -n kubeflow

W0124 12:03:42.356695       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2019-01-24T12:05:58Z" level=fatal msg="Error creating ML pipeline API Server client: Failed to initialize pipeline client. Error: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp 172.20.34.212:8888: i/o timeout: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp 172.20.34.212:8888: i/o timeout"

@gaoning777
Copy link
Contributor

@IronPan @neuromage can you take a look at the apiserver failing in 0.4.1? Thanks

@neuromage
Copy link
Contributor

/assign @neuromage

@neuromage
Copy link
Contributor

@hamedhsn Sorry for the trouble, but I can't reproduce this error. I suspect something went wrong with the db when you upgraded to 0.4.1. Can you try to delete everything with kubectl delete namespace kubeflow and re-install? And then port-forward again with:

kubectl -n kubeflow port-forward $(kubectl -n kubeflow get pods -l app=ml-pipeline-ui -o jsonpath='{.items[0].metadata.name}') 8888:3000

@hamedhsn
Copy link
Contributor Author

Sounds like I had a problem with pvc and my minio and mysql was not running properly and that caused crashing everywhere. Configured EFS and created pvc on that, sounds like everything starts working.
might be good to have some instructions and maybe parameters to set so that it starts working with AWS straightaway. Will created PR for that later.
Thanks, guys for help.

@vackysh
Copy link

vackysh commented May 2, 2019

Hi @hamedhsn : I have the similar issue where pvc and my minio and mysql are not working properly.
Can you please provide details solution on this, how to configure EFS and create PVC ?

@hamedhsn
Copy link
Contributor Author

hamedhsn commented May 2, 2019

Hi @vackysh
for EFS you need to create it first, then use efs-provisioner(https://github.com/kubernetes-incubator/external-storage/tree/master/aws/efs) to mount EFS as pv in k8s and then create pvc. once you have the pvc ready just change the ks files in pipelines related to minio and sql(remove the bit that creates the pvc and adjust the the name of the used pvc) to use your existing pvc. https://github.com/kubeflow/kubeflow/blob/master/kubeflow/pipeline/mysql.libsonnet#L80
hope that helps.

@vackysh
Copy link

vackysh commented May 3, 2019

Hi @hamedhsn ,

Thanks for your response.

I followed the similar steps that you mentioned above but getting the below error:

ERRO[0086] (Will retry) Component pipeline apply failed; Error: handle object: patching object from cluster: merging object with existing state: PersistentVolumeClaim "mysql-pv-claim" is invalid: spec: Forbidden: is immutable after creation except resources.requests for bound claims filename="ksonnet/ksonnet.go:174"

Can you please help me out in resolving this issue ?

Regards,
Varun

@hamedhsn
Copy link
Contributor Author

hamedhsn commented May 3, 2019

seems that it is trying to recreate the pvc.
comment or remove the pvc creation from minio.libsonnet. //$.parts(namespace).pvc,

magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this issue Oct 22, 2023
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024
update flip-coin, nested-pipeline and trusted-ai

Signed-off-by: Yihong Wang <yh.wang@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants