Deploying helm chart 3.22.2 fails for a new cluster while 3.21.4 works #5775

EugenMayer · 2022-03-18T14:35:44Z

I'am creating a cluster from scratch via terraform for the reproduction of #5575 -
When using 3.21.4 the cluster comes up just fine, while 3.22.2 ails.

It fails in a strange way, which means the only real thing that i can see is that cert-manager fails to talk to the calico API. The tigera operator seems to work just fine / come up just fine.

I understand that i need to provide better logs here - what should I look for / fetch? Which pods are interesting to get the logs from? The logs i have are https://gist.github.com/EugenMayer/94866586c516591f95b4ea8184ff8c13

Env:

ubuntu 20.04 (latest) LTS
rke2 v1.22.3+rke2r1 / rke2 v1.23.52+rke2r1 / k3s 1.22.6-k3s1
1 server node, 2 agent nodes (workloads)

I deploy plainly via TF, nothing special:

    bgp: Disabled
    linuxDataplane: ${dataPlane}
    hostPorts: ${hostPorts}
    ipPools:
      - blockSize: 26
        # TODO: make this configurable, yet matches k3s deployments
        cidr: 10.42.0.0/16
        encapsulation: ${encapsulation}
        natOutgoing: Enabled
        nodeSelector: all()

The text was updated successfully, but these errors were encountered:

EugenMayer · 2022-03-18T17:00:45Z

I tried the usual dance https://projectcalico.docs.tigera.io/security/tutorials/kubernetes-policy-basic and i can access the nginx running on workload node1 from a different pod on node2 - so common node 2 node and pod 2 pod communication seems to be working

caseydavenport · 2022-03-22T00:00:10Z

panic: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request, projectcalico.org/v3: the server is currently unable to handle the request

This suggests a potential problem with the Calico API server - could you check the status and logs if the pod within the calico-apiserver namespace?

Checking the APIService status would be useful as well

EugenMayer · 2022-03-23T08:37:57Z

so to test this, i applied the fix of #5575 ethtool -K ens3 rx-gro-hw off prior deploying calico to ensure this is not a network issue related to it.

The API server (2 of those, since I have 2 workloads) look just fine (i guess):

I0323 08:23:49.538370       1 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook.
I0323 08:23:49.538393       1 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0323 08:23:49.639386       1 run_server.go:69] Running the API server
I0323 08:23:49.643289       1 run_server.go:58] Starting watch extension
W0323 08:23:49.643310       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0323 08:23:49.757200       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0323 08:23:49.757214       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0323 08:23:49.757230       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0323 08:23:49.757232       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0323 08:23:49.757239       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0323 08:23:49.757241       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0323 08:23:49.757417       1 secure_serving.go:202] Serving securely on [::]:5443
I0323 08:23:49.757514       1 run_server.go:80] apiserver is ready.
I0323 08:23:49.757525       1 dynamic_serving_content.go:130] Starting serving-cert::apiserver.local.config/certificates/apiserver.crt::apiserver.local.config/certificates/apiserver.key
I0323 08:23:49.757541       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0323 08:23:49.857964       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 
I0323 08:23:49.857977       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 
I0323 08:23:49.858015       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController

The status of both is running, no restarts.

What is the APIService status? how would i check this? If it is the kub-apiserver, there are logs pointing to an issue

E0323 08:38:35.285094       1 controller.go:116] loading OpenAPI spec for "v3.projectcalico.org" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
, Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
I0323 08:38:35.285202       1 controller.go:129] OpenAPI AggregationController: action for item v3.projectcalico.org: Rate Limited Requeue.
E0323 08:38:39.270463       1 available_controller.go:524] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.13.29.100:443/apis/metrics.k8s.io/v1beta1: Get "https://10.13.29.100:443/apis/metrics.k8s.io/v1beta1": context deadline exceeded
E0323 08:38:39.286149       1 available_controller.go:524] v3.projectcalico.org failed with: failing or missing response from https://10.13.73.131:443/apis/projectcalico.org/v3: Get "https://10.13.73.131:443/apis/projectcalico.org/v3": dial tcp 10.13.73.131:443: i/o timeout

Kind of similar with pgo (postgres operator)

time="2022-03-23T08:29:17Z" level=info msg="metrics server is starting to listen" addr=":8080" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/log/deleg.go:130" func="log.(*DelegatingLogger).Info" version=5.0.4-0
panic: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request, projectcalico.org/v3: the server is currently unable to handle the request

goroutine 1 [running]:
main.assertNoError(...)
	github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:39
main.isOpenshift({0x1864d10, 0xc000391600}, 0x0)
	github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:110 +0x1f1
main.addControllersToManager({0x1864d10, 0xc000391600}, {0x189f8b8, 0xc00029c000})
	github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:96 +0xb4
main.main()
	github.com/crunchydata/postgres-operator/cmd/postgres-operator/main.go:80 +0x1f5

It would be great to have all your questions somewhat grouped since it's hard for me to keep the dev server just broken for tests, since we need to do other things with it too. Thank you for all the effort

caseydavenport · 2022-03-23T18:43:33Z

Get "https://10.13.73.131:443/apis/projectcalico.org/v3": dial tcp 10.13.73.131:443: i/o timeout

This seems to suggest that the main Kubernetes API server is having trouble contacting the Calico API server - this is probably where we want to dig in deeper.

Some ideas:

The thing to try to figure out here is why is this connection failing - is routing configured properly between the nodes hosting the Calico API server and the k8s API server?
Is the kube-proxy healthy and are services working?
Is calico/node running on your control plane nodes?

EugenMayer · 2022-03-27T22:17:12Z

The thing to try to figure out here is why is this connection failing - is routing configured properly between the nodes hosting the Calico API server and the k8s API server?

How to do that

Is the kube-proxy healthy and are services working?

There is no kube-proxy since i run with eBPF

Is calico/node running on your control plane nodes?

Actually it runs on the control-plane (server) and on all workloads (agents)

One question, does this setup work for you? Ubuntu 20.04 / calico 3.22.1 via helm / eBPF (vxlan)?
Is this again something special in terms of underlay network (openstack) or kernel (5.4)?

EugenMayer · 2022-03-30T11:31:53Z

Anything I could do to help out finding the cause for this? Surprisingly, it seems that not a lot of people are affected, right?

Could I assume that yet not a lot of people are using BPF and if yes, they are not upgrading that quickly - that is the reason why it might be not that of an issue (yet)?

EugenMayer · 2022-04-04T12:55:50Z

Let me know if I can do anything to drive the progress here, currently I'm very dependent on your feedback - it seems to me at least.

If you assume I run some esoteric / complex / unusual setup, let me know which aspect you consider to be the critical part that makes it so. Thanks!

caseydavenport · 2022-04-13T21:49:11Z

@EugenMayer sorry for the delay. Your scenario isn't especially esoteric.

Given that this works on v3.21 but not v3.22, I wonder if @tomastigera is aware of any changes that may be impacting you here.

EugenMayer · 2022-04-23T20:36:15Z

Even though is kind of expect nobody really cares, i still leave the note:

i can now also reproduce this with rk2 1.23 (so k8s 1.23)

But more interesting, i can now reproduce this on a EC2 instance on AWS, one node setup only with

ubuntu 20
k3s 1.22.6-k3s1
calico 3.22.2

Usual t3.large instance. So you should be able to reproduce this very easy on AWS.
Again, downgrading to 3.21.4 fixes the issue (not changing anything else)

caseydavenport · 2022-05-02T20:44:39Z

@EugenMayer sorry for the delay on this - @tomastigera is currently investigating some oddities around the way services are behind in Calico v3.22 (eBPF mode) that might be relevant here.

xref: #5957

EugenMayer · 2022-05-03T06:19:31Z

@caseydavenport happy that someone is looking after that. I understand it takes time. Thanks!

caseydavenport · 2024-03-05T17:52:21Z

GitHub claims this was fixed by #5498

EugenMayer mentioned this issue Mar 21, 2022

Performance issue with eBPF enabled in a multi-node setup (working without eBPF) #5575

Closed

lwr20 added the kind/support label Apr 5, 2022

EugenMayer changed the title ~~Deploying helm chart 3.22.1 fails for a new cluster while 3.21.4 works~~ Deploying helm chart 3.22.2 fails for a new cluster while 3.21.4 works Apr 23, 2022

caseydavenport assigned tomastigera May 2, 2022

caseydavenport added kind/bug likelihood/high impact/high area/bpf eBPF Dataplane issues and removed kind/support labels May 2, 2022

tomastigera mentioned this issue May 10, 2022

[release-v3.22] Auto pick #5498: felix/bpf: fix err handling in bpf_program_attach_cgroup #6056

Merged

3 tasks

caseydavenport closed this as completed Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying helm chart 3.22.2 fails for a new cluster while 3.21.4 works #5775

Deploying helm chart 3.22.2 fails for a new cluster while 3.21.4 works #5775

EugenMayer commented Mar 18, 2022 •

edited

Loading

EugenMayer commented Mar 18, 2022

caseydavenport commented Mar 22, 2022

EugenMayer commented Mar 23, 2022 •

edited

Loading

caseydavenport commented Mar 23, 2022

EugenMayer commented Mar 27, 2022 •

edited

Loading

EugenMayer commented Mar 30, 2022

EugenMayer commented Apr 4, 2022

caseydavenport commented Apr 13, 2022

EugenMayer commented Apr 23, 2022 •

edited

Loading

caseydavenport commented May 2, 2022 •

edited

Loading

EugenMayer commented May 3, 2022

caseydavenport commented Mar 5, 2024

Deploying helm chart 3.22.2 fails for a new cluster while 3.21.4 works #5775

Deploying helm chart 3.22.2 fails for a new cluster while 3.21.4 works #5775

Comments

EugenMayer commented Mar 18, 2022 • edited Loading

EugenMayer commented Mar 18, 2022

caseydavenport commented Mar 22, 2022

EugenMayer commented Mar 23, 2022 • edited Loading

caseydavenport commented Mar 23, 2022

EugenMayer commented Mar 27, 2022 • edited Loading

EugenMayer commented Mar 30, 2022

EugenMayer commented Apr 4, 2022

caseydavenport commented Apr 13, 2022

EugenMayer commented Apr 23, 2022 • edited Loading

caseydavenport commented May 2, 2022 • edited Loading

EugenMayer commented May 3, 2022

caseydavenport commented Mar 5, 2024

EugenMayer commented Mar 18, 2022 •

edited

Loading

EugenMayer commented Mar 23, 2022 •

edited

Loading

EugenMayer commented Mar 27, 2022 •

edited

Loading

EugenMayer commented Apr 23, 2022 •

edited

Loading

caseydavenport commented May 2, 2022 •

edited

Loading