-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deploying helm chart 3.22.2 fails for a new cluster while 3.21.4 works #5775
Comments
I tried the usual dance https://projectcalico.docs.tigera.io/security/tutorials/kubernetes-policy-basic and i can access the nginx running on workload node1 from a different pod on node2 - so common node 2 node and pod 2 pod communication seems to be working |
This suggests a potential problem with the Calico API server - could you check the status and logs if the pod within the Checking the |
so to test this, i applied the fix of #5575 The API server (2 of those, since I have 2 workloads) look just fine (i guess):
The status of both is running, no restarts. What is the
Kind of similar with pgo (postgres operator)
It would be great to have all your questions somewhat grouped since it's hard for me to keep the dev server just broken for tests, since we need to do other things with it too. Thank you for all the effort |
This seems to suggest that the main Kubernetes API server is having trouble contacting the Calico API server - this is probably where we want to dig in deeper. Some ideas:
|
How to do that
There is no kube-proxy since i run with eBPF
Actually it runs on the control-plane (server) and on all workloads (agents) One question, does this setup work for you? Ubuntu 20.04 / calico 3.22.1 via helm / eBPF (vxlan)? |
Anything I could do to help out finding the cause for this? Surprisingly, it seems that not a lot of people are affected, right? Could I assume that yet not a lot of people are using BPF and if yes, they are not upgrading that quickly - that is the reason why it might be not that of an issue (yet)? |
Let me know if I can do anything to drive the progress here, currently I'm very dependent on your feedback - it seems to me at least. If you assume I run some esoteric / complex / unusual setup, let me know which aspect you consider to be the critical part that makes it so. Thanks! |
@EugenMayer sorry for the delay. Your scenario isn't especially esoteric. Given that this works on v3.21 but not v3.22, I wonder if @tomastigera is aware of any changes that may be impacting you here. |
Even though is kind of expect nobody really cares, i still leave the note:
But more interesting, i can now reproduce this on a EC2 instance on AWS, one node setup only with
Usual t3.large instance. So you should be able to reproduce this very easy on AWS. |
@EugenMayer sorry for the delay on this - @tomastigera is currently investigating some oddities around the way services are behind in Calico v3.22 (eBPF mode) that might be relevant here. xref: #5957 |
@caseydavenport happy that someone is looking after that. I understand it takes time. Thanks! |
GitHub claims this was fixed by #5498 |
I'am creating a cluster from scratch via terraform for the reproduction of #5575 -
When using 3.21.4 the cluster comes up just fine, while 3.22.2 ails.
It fails in a strange way, which means the only real thing that i can see is that cert-manager fails to talk to the calico API. The tigera operator seems to work just fine / come up just fine.
I understand that i need to provide better logs here - what should I look for / fetch? Which pods are interesting to get the logs from? The logs i have are https://gist.github.com/EugenMayer/94866586c516591f95b4ea8184ff8c13
Env:
I deploy plainly via TF, nothing special:
The text was updated successfully, but these errors were encountered: