Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kubeflow Dex Distribution] KF Pipelines 100% Unusable - MULTIPLE PEOPLE REPORTING #5223

Closed
ReggieCarey opened this issue Mar 1, 2021 · 15 comments
Assignees
Labels

Comments

@ReggieCarey
Copy link

ReggieCarey commented Mar 1, 2021

What steps did you take:

KFP in KF 1.2.0 with Dex on K8s 1.18.9 does not work. I receive an error in the KF dashboard when attempting to view pipelines:

Error: failed to retrieve list of pipelines. Click Details for more information. -> An error occured, no healthy upstream

What happened:

Installed Kubeflow 1.2.0 on-prem as per installation instructions. Any attempt to see pipelines or use pipelines fails.

What did you expect to happen:

I expected to be able to use Pipelines

Environment:

Kubernetes version 1.18.9
Kubeflow version 1.2.0
Installed with Dex, configured after deploy to use LDAP.

ml-pipelines pod fails to start completely. Logs indicate

How did you deploy Kubeflow Pipelines (KFP)?

Installed Kubeflow Pipelines as part of Kubeflow installation for on-prem with dex.

KFP version: 1.0.4

KFP SDK version:
I HAVEN'T GOTTEN FAR ENOUGH TO USE THIS!

Anything else you would like to add:

ml-pipeline pod refuses to run:

$ kubectl get pods -n kubeflow
NAME                                                     READY   STATUS    RESTARTS   AGE
admission-webhook-bootstrap-stateful-set-0               1/1     Running   0          4d20h
admission-webhook-deployment-5d9ccb5696-f6zs6            1/1     Running   0          4d20h
application-controller-stateful-set-0                    1/1     Running   0          4d21h
argo-ui-684bcb587f-z84nh                                 1/1     Running   0          4d16h
cache-deployer-deployment-6667847478-7h2w8               2/2     Running   2          4d21h
cache-server-bd9c859db-755zj                             2/2     Running   527        4d21h
centraldashboard-895c4c768-46xgc                         1/1     Running   0          4d21h
jupyter-web-app-deployment-6588c6f544-c5m45              1/1     Running   0          3d3h
katib-controller-75c8d47f8c-5k2tr                        1/1     Running   0          4d21h
katib-db-manager-6c88c68d79-cgxdh                        1/1     Running   0          4d16h
katib-mysql-858f68f588-zvhnj                             1/1     Running   0          4d21h
katib-ui-68f59498d4-bkscp                                1/1     Running   0          4d21h
kfserving-controller-manager-0                           2/2     Running   0          36h
kubeflow-pipelines-profile-controller-69c94df75b-xtpfj   1/1     Running   0          4d21h
metacontroller-0                                         1/1     Running   0          4d21h
metadata-db-757dc9c7b5-pt75k                             1/1     Running   0          4d21h
metadata-envoy-deployment-6ff58757f6-57pjc               1/1     Running   0          4d21h
metadata-grpc-deployment-76d69f69c8-xcmjk                1/1     Running   3          4d21h
metadata-writer-6d94ffb7df-mhnxj                         2/2     Running   1          4d21h
minio-66c9cd74c9-jrss8                                   1/1     Running   0          4d21h

ml-pipeline-54989c9946-s2f46                             1/2     Running   926        4d21h

ml-pipeline-persistenceagent-7f6bf7646-ldct6             2/2     Running   0          4d21h
ml-pipeline-scheduledworkflow-66db7bcf5d-q244j           2/2     Running   0          4d16h
ml-pipeline-ui-756b58fb-gpwn9                            2/2     Running   0          4d21h
ml-pipeline-viewer-crd-58f59f87db-dmj2l                  2/2     Running   2          4d21h
ml-pipeline-visualizationserver-6f9ff4974-k4cf9          2/2     Running   0          4d21h
mpi-operator-77bb5d8f4b-w4dhj                            1/1     Running   0          4d21h
mxnet-operator-68b688bb69-b5985                          1/1     Running   0          4d16h
mysql-7694c6b8b7-jthp6                                   2/2     Running   0          4d17h
notebook-controller-deployment-58447d4b4c-6ll57          1/1     Running   0          4d21h
profiles-deployment-78d4549cbc-z9lld                     2/2     Running   0          4d21h
pytorch-operator-b79799447-f8nnl                         1/1     Running   0          4d21h
seldon-controller-manager-5fc5dfc86c-nh2qm               1/1     Running   0          4d21h
spark-operatorsparkoperator-67c6bc65fb-8tgn5             1/1     Running   0          4d21h
tf-job-operator-5c97f4bf7-g5vtw                          1/1     Running   0          4d21h
workflow-controller-5c7cc7976d-5n6tb                     1/1     Running   0          4d16h
$ kubectl logs -n kubeflow ml-pipeline-54989c9946-s2f46 ml-pipeline-api-server 
I0301 20:22:00.153656       6 client_manager.go:134] Initializing client manager
I0301 20:22:00.153817       6 config.go:50] Config DBConfig.ExtraParams not specified, skipping
[mysql] 2021/03/01 20:22:01 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:22:02 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:22:04 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:22:07 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:22:10 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:22:13 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:22:16 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:22:23 packets.go:36: unexpected EOF
$ kubectl logs -n kubeflow mysql-7694c6b8b7-jthp6 mysql
...
MySQL init process done. Ready for start up.

2021-02-25 03:04:17 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2021-02-25 03:04:17 0 [Note] mysqld (mysqld 5.6.44) starting as process 1 ...
2021-02-25 03:04:17 1 [Note] Plugin 'FEDERATED' is disabled.
2021-02-25 03:04:17 1 [Note] InnoDB: Using atomics to ref count buffer pool pages
2021-02-25 03:04:17 1 [Note] InnoDB: The InnoDB memory heap is disabled
2021-02-25 03:04:17 1 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2021-02-25 03:04:17 1 [Note] InnoDB: Memory barrier is not used
2021-02-25 03:04:17 1 [Note] InnoDB: Compressed tables use zlib 1.2.11
2021-02-25 03:04:17 1 [Note] InnoDB: Using Linux native AIO
2021-02-25 03:04:17 1 [Note] InnoDB: Using CPU crc32 instructions
2021-02-25 03:04:17 1 [Note] InnoDB: Initializing buffer pool, size = 128.0M
2021-02-25 03:04:17 1 [Note] InnoDB: Completed initialization of buffer pool
2021-02-25 03:04:17 1 [Note] InnoDB: Highest supported file format is Barracuda.
2021-02-25 03:04:17 1 [Note] InnoDB: 128 rollback segment(s) are active.
2021-02-25 03:04:17 1 [Note] InnoDB: Waiting for purge to start
2021-02-25 03:04:17 1 [Note] InnoDB: 5.6.44 started; log sequence number 1625997
2021-02-25 03:04:17 1 [Note] Server hostname (bind-address): '*'; port: 3306
2021-02-25 03:04:17 1 [Note] IPv6 is available.
2021-02-25 03:04:17 1 [Note]   - '::' resolves to '::';
2021-02-25 03:04:17 1 [Note] Server socket created on IP: '::'.
2021-02-25 03:04:17 1 [Warning] Insecure configuration for --pid-file: Location '/var/run/mysqld' in the path is accessible to all OS users. Consider choosing a different directory.
2021-02-25 03:04:17 1 [Warning] 'proxies_priv' entry '@ root@mysql-7694c6b8b7-jthp6' ignored in --skip-name-resolve mode.
2021-02-25 03:04:17 1 [Note] Event Scheduler: Loaded 0 events
2021-02-25 03:04:17 1 [Note] mysqld: ready for connections.
Version: '5.6.44'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MySQL Community Server (GPL)

Cache Server also is unable to connect to MYSQL

$ kubectl logs -n kubeflow cache-server-bd9c859db-755zj  server 
2021/03/01 20:19:21 Initing client manager....
[mysql] 2021/03/01 20:19:22 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:19:24 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:19:25 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:19:27 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:19:30 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:19:33 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:19:39 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:19:46 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:19:55 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:20:07 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:20:26 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:21:02 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:21:40 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:22:35 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:23:58 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:25:09 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:25:50 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:25:51 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:25:52 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:25:54 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:25:56 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:25:59 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:26:02 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:26:06 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:26:15 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:26:20 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:26:34 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:27:03 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:27:45 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:28:11 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:29:39 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:30:12 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:31:32 packets.go:36: unexpected EOF
[mysql] 2021/03/01 20:32:07 packets.go:36: unexpected EOF
F0301 20:32:07.437107       1 error.go:305] invalid connection
goroutine 1 [running]:
github.com/golang/glog.stacks(0xc000786600, 0xc0004790a0, 0x3f, 0x40)
	/go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:769 +0xd4
github.com/golang/glog.(*loggingT).output(0x237c4c0, 0xc000000003, 0xc000479080, 0x20d8f16, 0x8, 0x131, 0x0)
	/go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:720 +0x329
github.com/golang/glog.(*loggingT).printf(0x237c4c0, 0x3, 0x14ca0b3, 0x2, 0xc0006c58f8, 0x1, 0x1)
	/go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:655 +0x14b
github.com/golang/glog.Fatalf(0x14ca0b3, 0x2, 0xc0006c58f8, 0x1, 0x1)
	/go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:1148 +0x67
github.com/kubeflow/pipelines/backend/src/common/util.TerminateIfError(0x1649b00, 0xc0005eca40)
	/go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:305 +0x79
main.initMysql(0x7ffefc6905bd, 0x5, 0x7ffefc6905cd, 0x5, 0x7ffefc6905dd, 0x4, 0x7ffefc6905ec, 0x7, 0x7ffefc6905fe, 0x4, ...)
	/go/src/github.com/kubeflow/pipelines/backend/src/cache/client_manager.go:157 +0x466
main.initDBClient(0x7ffefc6905bd, 0x5, 0x7ffefc6905cd, 0x5, 0x7ffefc6905dd, 0x4, 0x7ffefc6905ec, 0x7, 0x7ffefc6905fe, 0x4, ...)
	/go/src/github.com/kubeflow/pipelines/backend/src/cache/client_manager.go:71 +0x599
main.(*ClientManager).init(0xc0006c5db8, 0x7ffefc6905bd, 0x5, 0x7ffefc6905cd, 0x5, 0x7ffefc6905dd, 0x4, 0x7ffefc6905ec, 0x7, 0x7ffefc6905fe, ...)
	/go/src/github.com/kubeflow/pipelines/backend/src/cache/client_manager.go:57 +0x80
main.NewClientManager(0x7ffefc6905bd, 0x5, 0x7ffefc6905cd, 0x5, 0x7ffefc6905dd, 0x4, 0x7ffefc6905ec, 0x7, 0x7ffefc6905fe, 0x4, ...)
	/go/src/github.com/kubeflow/pipelines/backend/src/cache/client_manager.go:169 +0xab
main.main()
	/go/src/github.com/kubeflow/pipelines/backend/src/cache/main.go:71 +0x367

Attempted suggestions for repair (ALL fail - please do not suggest)

  1. ISTIO disable ISTIO_MUTUAL -> DISABLE : This allows the mysql db to be populated but the KFP UI will NOT startup.
  2. ISTIO configure STRICT vs PERMISSIVE : Pipelines and Jupyter Notebooks will not come up.

The product as advertised online does not work on a vanilla on-prem, K8s installation. It appears to work on GCP, Azure, AwS, and possibly IBM.

Provided diagnostic tools are not compatible with an on-prem installation:

$ kfp diagnose_me
Google Cloud SDK is not installed, gcloud, gsutil and kubectl are required for this app to run. Please follow instructions at https://cloud.google.com/sdk/install to install the SDK.

/kind bug

@ReggieCarey
Copy link
Author

Note most users of Kubeflow experiencing this bug cannot afford to wait 6-9 months for the next release which may or may not address this problem. This bug breaks this product. We need a solution immediately. This part of the product is considered STABLE.

@ReggieCarey
Copy link
Author

ReggieCarey commented Mar 1, 2021

Kubernetes installed via Kubespray

@Bobgy
Copy link
Contributor

Bobgy commented Mar 2, 2021

Hi @ReggieCarey, sorry for your bad experience.

KF 1.2.0 with Dex on K8s is a community maintained distribution.

/assign @yanniszark
Can you take a look or suggest who else can take a look at this issue?

@yanniszark
Copy link
Contributor

@ReggieCarey I'm sorry for your bad experience with 1.2. From Arrikto's side, we supported this distribution until Kubeflow 1.1. In Kubeflow 1.2, we started transitioning out of kfctl and do not support this distribution in 1.2.

Our current efforts are focused on releasing 1.3. In Kubeflow 1.3, we plan to support a similar distribution but without kfctl at all. To give you an idea of the current timeline, the release candidate for 1.3 is March 15th. Distributions will be tested and the release will be finalized soon after.

@ReggieCarey
Copy link
Author

@yanniszark thanks for looking into this. Some more info. I disabled sidecar injection in the kubeflow namespace. This allowed ml pipeline to connect with MySQL and populate 1 pipeline. I can't get to the pipeline dashboard from the KF dashboard but I can get to it via kubectl proxy. Experiments still don't work. Need to check the cache server status.

Q: Why are you consuming MySQL instead of a SQL service?

@Bobgy Bobgy changed the title KF Pipelines 100% Unusable - MULTIPLE PEOPLE REPORTING [Kubeflow Dex Distribution] KF Pipelines 100% Unusable - MULTIPLE PEOPLE REPORTING Mar 2, 2021
@Bobgy
Copy link
Contributor

Bobgy commented Mar 2, 2021

So I guess there were some miscommunication, I thought Arrikto was still supporting Kubeflow 1.2 with dex. If that's not the case, we should have deleted the dex distribution from kubeflow.org documentation during the release.

@davidspek
Copy link
Contributor

@Bobgy Just dropping in here quickly as I came across the issue. I think what @yanniszark is said is just that deployment with kfctl is not being supported by Arrikto for 1.2. I believe the Dex part is still included in this though (as one of the OIDC provider options). Please correct me if I am wrong.

@Bobgy
Copy link
Contributor

Bobgy commented Mar 19, 2021

You are right, I was only referring to the distribution

@Shaked
Copy link

Shaked commented Jun 3, 2021

@Bobgy is there a solution for that?

@stale
Copy link

stale bot commented Sep 3, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 3, 2021
@solarist
Copy link

We are experiencing the same problem in 1.4.0 (on-prem installation with Dex).

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jan 24, 2022
@stale
Copy link

stale bot commented Apr 28, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Apr 28, 2022
@thesuperzapper
Copy link
Member

For everyone watching, any time you see an error like [mysql] XXXX/XX/XX XX:XX:XX packets.go:36: unexpected EOF in the cache-server pods, you are almost certainly dealing with a network-level issue between your pods and the MySQL database.

The most likely case is that there is some asymmetric routing going on. That is, your Pods might be able to create a connection to MySQL, but MySQL might not have a route back to your Pod. Note, MySQL connections are complex and will sometimes initiate new TCP connections back to the client, which will fail in the previous case.

I have a more detailed write up on this issue: #3763 (comment)

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label May 14, 2023
@rimolive
Copy link
Member

Closing this issue as there is no users reporting this since 2022. Feel free to reopen if this issue remains in latest releases.

/close

Copy link

@rimolive: Closing this issue.

In response to this:

Closing this issue as there is no users reporting this since 2022. Feel free to reopen if this issue remains in latest releases.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Closed
Development

No branches or pull requests

9 participants