-
-
Notifications
You must be signed in to change notification settings - Fork 15k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k3s: All pods crashing with latest version #181790
Comments
This issue has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/kubernetes-bringup-on-current-unstable-2022-07-17/20403/2 |
I have reproduced a problem with an empty cluster: The testing method needs to be improved/fixed as well. Tests should have failed. Have you been able to solve problem (with Discourse information)? |
I have not been able to test it yet. I will get a chance to test this evening 😄 |
Seems that a CNI is missing, if there is no CNI pods will crash and burn. EDIT: Seems that k3s run flannel as process not as a pod. |
EDIT: Will try to reproduce in my system to help. |
To try to access kubernetes service you can use the following command:
Once inside, do a In my test, pods fail to access K8S API Server, which is a little bit strange, so maybe I am missing something (just copied the K3S NixOS Wiki entry). In the case of OP, what is strange to me is SIGTERM at coredns and local path provisioner. |
Maybe related #179741 |
Thanks for the detailed report, @collinarnett! I think I was able to reproduce and work around / fix this on my machine. For me, the problem indeed showed up as "SandboxChanged":
The logs from the kubelet ended up not being all that useful, however, as soon as I ran The k3s nixos module right now adds on
However, containerd for me didn't think it was using the systemd driver, i.e. doing stuff like Also This was already reported on the upstream k3s repo here: k3s-io/k3s#5454 Removing the The issue also suggested another workaround, which also worked (collapsed since it's verbose): Workaround
(Note, this is just this, but with
So, where does that leave us? Well, according to the upstream issue, systemd detection should be fixed soon upstream such that it soon defaults to the right containerd cgroup driver.. which presumably will cause the kubelet to do the right thing too. That "soon" though is next release (maybe a month? I don't know the exact release schedule), not the just-released 1.24.3. In the meanwhile, I think the easiest resolution for us is probably to drop The original motivation for I'll put up a PR with that change, and hopefully that fixes things here! I'm optimistic since I do think I'm observing the same issue in my repro, and this does seem to fix it for me |
Setting `cgroup-driver=systemd` was originally necessary to match with docker, else the kubelet would not start (NixOS#111835) However, since then, docker support has been dropped from k3s (NixOS#177790). As such, this option is much less necessary. More importantly, it now seems to be actively causing issues. Due to an upstream k3s bug, it's resulting in the kubelet and containerd having different cgroup drivers, which seems to result in some difficult to debug failure modes. See NixOS#181790 (comment) for a description of this problem. Removing this flag entirely seems reasonable to me, and it results in k3s working again on my machine.
@collinarnett Can you test from master now? |
First, thanks for coming to save the day. Your expertise is always appreciated. :-) Would you consider some improvement to the k3s testing to avoid this from happening again? |
$ k get deployments -n kube-system
NAME READY UP-TO-DATE AVAILABLE AGE
local-path-provisioner 1/1 1 1 2m8s
coredns 1/1 1 1 2m8s
metrics-server 1/1 1 1 2m7s
traefik 1/1 1 1 105s Thank you all for your help 😄 At some point I hope I can help with these issues since I think Kubernetes on nixos is the way to go. I might also get a discussion about a matrix channel started on discourse since there seems to be small cluster of kube-nix people lurking around. If there is no more discussion to be had on this issue feel free to close it. |
If we have network in NixOS tests, maybe using some greps or jq to check pods status, deploy some NGINX pods and check connectivity between pods/services. |
Yup, absolutely, we should have tests that would catch this. I keep meaning to write a reasonable multi-node test, but keep not finding the time... If anyone wants to write one, I'd be happy to review, and if not, I'll probably eventually get to it 😅 For this one, I think any test that ran "long enough" would work, since I think the single-node test we have didn't catch it just because it ran so quickly, before the kubelet got the chance to poll and decide to recreate the pod. |
We could use the best solution to any race condition, sleep! Seriously now, we can:
|
Now talking about hacks, how "ugly" is to have Network access in NixOS tests? |
That ends up being a bit of a challenge, since running base components naively requires networking to download all those images. The existing test right now just skips all those components for that reason: nixpkgs/nixos/tests/k3s-single-node.nix Line 52 in 9731530
There is a potential solution there though: k3s upstream has an "airgapped images" tarball, and I think we could fetch that into the nix store, which we can then access in the test VM and then Anyway, I did get a multi-node test written (over here #182445), but I don't have much confidence it actually would have definitely caught this issue, though it should get flannel issues, and anything that would crashloop earlier / quicker. |
Describe the bug
This is going to be a long description since I'm not entirely sure where the bounds of k3s are when it comes to statefulness. I'm currently running the latest version of k3s in nixpkgs and I am unable to stand up the cluster without all pods failing after helm deploys traefik. I have search upstream's issues and there doesn't seem to be anything relevant there. I have spent quite a lot of time scouring the logs of both the pods and k3s itself. I also tried reverting the package to a previous commit and that did not work either.
I can't see anything obvious in the k3s logs but there are quite a few warnings and errors but I'm not sure which are genuine errors or just the kubelet complaining about the state not being ready. The pods don't seem to indicate anything explicit either.
Steps To Reproduce
Here is the current config I have:
Expected behavior
Pods come up in the kube-system namespace.
Screenshots
journalctl
k3s_logs.txt
pods
Additional context
If you want to look at my entire config, it's hosted here https://github.com/collinarnett/brew
Notify maintainers
@euank
@superherointj
@Mic92
@kalbasit
Metadata
The text was updated successfully, but these errors were encountered: