Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying helm chart 3.22.2 fails for a new cluster while 3.21.4 works #5775

Closed
EugenMayer opened this issue Mar 18, 2022 · 12 comments
Closed
Assignees

Comments

@EugenMayer
Copy link
Contributor

EugenMayer commented Mar 18, 2022

I'am creating a cluster from scratch via terraform for the reproduction of #5575 -
When using 3.21.4 the cluster comes up just fine, while 3.22.2 ails.

It fails in a strange way, which means the only real thing that i can see is that cert-manager fails to talk to the calico API. The tigera operator seems to work just fine / come up just fine.

I understand that i need to provide better logs here - what should I look for / fetch? Which pods are interesting to get the logs from? The logs i have are https://gist.github.com/EugenMayer/94866586c516591f95b4ea8184ff8c13

Env:

  • ubuntu 20.04 (latest) LTS
  • rke2 v1.22.3+rke2r1 / rke2 v1.23.52+rke2r1 / k3s 1.22.6-k3s1
  • 1 server node, 2 agent nodes (workloads)

I deploy plainly via TF, nothing special:

    bgp: Disabled
    linuxDataplane: ${dataPlane}
    hostPorts: ${hostPorts}
    ipPools:
      - blockSize: 26
        # TODO: make this configurable, yet matches k3s deployments
        cidr: 10.42.0.0/16
        encapsulation: ${encapsulation}
        natOutgoing: Enabled
        nodeSelector: all()
@EugenMayer
Copy link
Contributor Author

I tried the usual dance https://projectcalico.docs.tigera.io/security/tutorials/kubernetes-policy-basic and i can access the nginx running on workload node1 from a different pod on node2 - so common node 2 node and pod 2 pod communication seems to be working

@caseydavenport
Copy link
Member

panic: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request, projectcalico.org/v3: the server is currently unable to handle the request

This suggests a potential problem with the Calico API server - could you check the status and logs if the pod within the calico-apiserver namespace?

Checking the APIService status would be useful as well

@EugenMayer
Copy link
Contributor Author

EugenMayer commented Mar 23, 2022

so to test this, i applied the fix of #5575 ethtool -K ens3 rx-gro-hw off prior deploying calico to ensure this is not a network issue related to it.

The API server (2 of those, since I have 2 workloads) look just fine (i guess):

I0323 08:23:49.538370       1 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook.
I0323 08:23:49.538393       1 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0323 08:23:49.639386       1 run_server.go:69] Running the API server
I0323 08:23:49.643289       1 run_server.go:58] Starting watch extension
W0323 08:23:49.643310       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0323 08:23:49.757200       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0323 08:23:49.757214       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0323 08:23:49.757230       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0323 08:23:49.757232       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0323 08:23:49.757239       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0323 08:23:49.757241       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0323 08:23:49.757417       1 secure_serving.go:202] Serving securely on [::]:5443
I0323 08:23:49.757514       1 run_server.go:80] apiserver is ready.
I0323 08:23:49.757525       1 dynamic_serving_content.go:130] Starting serving-cert::apiserver.local.config/certificates/apiserver.crt::apiserver.local.config/certificates/apiserver.key
I0323 08:23:49.757541       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0323 08:23:49.857964       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 
I0323 08:23:49.857977       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 
I0323 08:23:49.858015       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController

The status of both is running, no restarts.

What is the APIService status? how would i check this? If it is the kub-apiserver, there are logs pointing to an issue

E0323 08:38:35.285094       1 controller.go:116] loading OpenAPI spec for "v3.projectcalico.org" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
, Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
I0323 08:38:35.285202       1 controller.go:129] OpenAPI AggregationController: action for item v3.projectcalico.org: Rate Limited Requeue.
E0323 08:38:39.270463       1 available_controller.go:524] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.13.29.100:443/apis/metrics.k8s.io/v1beta1: Get "https://10.13.29.100:443/apis/metrics.k8s.io/v1beta1": context deadline exceeded
E0323 08:38:39.286149       1 available_controller.go:524] v3.projectcalico.org failed with: failing or missing response from https://10.13.73.131:443/apis/projectcalico.org/v3: Get "https://10.13.73.131:443/apis/projectcalico.org/v3": dial tcp 10.13.73.131:443: i/o timeout

Kind of similar with pgo (postgres operator)

time="2022-03-23T08:29:17Z" level=info msg="metrics server is starting to listen" addr=":8080" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/log/deleg.go:130" func="log.(*DelegatingLogger).Info" version=5.0.4-0
panic: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request, projectcalico.org/v3: the server is currently unable to handle the request

goroutine 1 [running]:
main.assertNoError(...)
	github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:39
main.isOpenshift({0x1864d10, 0xc000391600}, 0x0)
	github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:110 +0x1f1
main.addControllersToManager({0x1864d10, 0xc000391600}, {0x189f8b8, 0xc00029c000})
	github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:96 +0xb4
main.main()
	github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:80 +0x1f5

It would be great to have all your questions somewhat grouped since it's hard for me to keep the dev server just broken for tests, since we need to do other things with it too. Thank you for all the effort

@caseydavenport
Copy link
Member

Get "https://10.13.73.131:443/apis/projectcalico.org/v3": dial tcp 10.13.73.131:443: i/o timeout

This seems to suggest that the main Kubernetes API server is having trouble contacting the Calico API server - this is probably where we want to dig in deeper.

Some ideas:

  • The thing to try to figure out here is why is this connection failing - is routing configured properly between the nodes hosting the Calico API server and the k8s API server?

  • Is the kube-proxy healthy and are services working?

  • Is calico/node running on your control plane nodes?

@EugenMayer
Copy link
Contributor Author

EugenMayer commented Mar 27, 2022

The thing to try to figure out here is why is this connection failing - is routing configured properly between the nodes hosting the Calico API server and the k8s API server?

How to do that

Is the kube-proxy healthy and are services working?

There is no kube-proxy since i run with eBPF

Is calico/node running on your control plane nodes?

Actually it runs on the control-plane (server) and on all workloads (agents)

One question, does this setup work for you? Ubuntu 20.04 / calico 3.22.1 via helm / eBPF (vxlan)?
Is this again something special in terms of underlay network (openstack) or kernel (5.4)?

@EugenMayer
Copy link
Contributor Author

Anything I could do to help out finding the cause for this? Surprisingly, it seems that not a lot of people are affected, right?

Could I assume that yet not a lot of people are using BPF and if yes, they are not upgrading that quickly - that is the reason why it might be not that of an issue (yet)?

@EugenMayer
Copy link
Contributor Author

Let me know if I can do anything to drive the progress here, currently I'm very dependent on your feedback - it seems to me at least.

If you assume I run some esoteric / complex / unusual setup, let me know which aspect you consider to be the critical part that makes it so. Thanks!

@caseydavenport
Copy link
Member

@EugenMayer sorry for the delay. Your scenario isn't especially esoteric.

Given that this works on v3.21 but not v3.22, I wonder if @tomastigera is aware of any changes that may be impacting you here.

@EugenMayer
Copy link
Contributor Author

EugenMayer commented Apr 23, 2022

Even though is kind of expect nobody really cares, i still leave the note:

  • i can now also reproduce this with rk2 1.23 (so k8s 1.23)

But more interesting, i can now reproduce this on a EC2 instance on AWS, one node setup only with

  • ubuntu 20
  • k3s 1.22.6-k3s1
  • calico 3.22.2

Usual t3.large instance. So you should be able to reproduce this very easy on AWS.
Again, downgrading to 3.21.4 fixes the issue (not changing anything else)

@EugenMayer EugenMayer changed the title Deploying helm chart 3.22.1 fails for a new cluster while 3.21.4 works Deploying helm chart 3.22.2 fails for a new cluster while 3.21.4 works Apr 23, 2022
@caseydavenport
Copy link
Member

caseydavenport commented May 2, 2022

@EugenMayer sorry for the delay on this - @tomastigera is currently investigating some oddities around the way services are behind in Calico v3.22 (eBPF mode) that might be relevant here.

xref: #5957

@EugenMayer
Copy link
Contributor Author

@caseydavenport happy that someone is looking after that. I understand it takes time. Thanks!

@caseydavenport
Copy link
Member

GitHub claims this was fixed by #5498

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants