-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CRI to obtain pod sandbox IDs instead of Kubernetes API #714
Conversation
a6c8a1c
to
0886166
Compare
@mogren we've been running this on our staging cluster (on CRI-O) for a few days and it's been smooth sailing. I've manually tested it works with Docker as well. There's a few more edge cases to think about before this is ready to review, but I did have a couple of questions for you:
|
@drakedevel Hi, sorry for the slow reply, I'll go through this PR today. I agree that it's fine to have the daemonset set that up. I'm also not sure about the Docker UID issue, @jaypipes, do you have some insight? |
@drakedevel I built and and ran some tests using these changes, but I could only get pods to schedule on the first two ENIs that were already attached to the node. Triggered the issue by scheduling 150 pods on 3 m5.xlarge nodes that should be able to handle 56 pods each. The logs had a lot of errors from the pods that failed to schedule, looking like:
I guess the issue is that since I schedule 50 on each node, but they only have around 28 available IPs, the first 28 pods schedule fine, but the 29th gets an error back from ipamd because the data_store has no more free IPs, and then the kubelet tries to clean up 10 times. I switched back to v1.6.0-rc4, and then things worked just fine again. Will try to do some more tests. |
@mogren Thanks for trying it out! Did you cherry-pick onto release-1.6, or did you run this commit directly? (This branch is on top of master, specifically d73d118) Our staging cluster is actually ENI-limited; every node in that cluster is using more than one ENI, a handful have used every available EIP. So I'm a little puzzled by that result :P That rebase I did right before I pinged you was intended to pull in the integration test scripts with no functional changes (although I didn't end up getting those tests working), but I just realized d73d118 is itself a significant functional change that I missed while rebasing. I haven't tested that commit, and it isn't on the release-1.6 branch -- is it possibly related? I'll spin up an EKS cluster to test with and see if I can repro either way. |
@mogren Reproduced on an EKS cluster. Oddly, one of the nodes stalled at 14 pods, the other at 28, using 1 and 2 ENIs respectively (despite having all 3 in I tried deploying a build of 7eab704 (the original root of this branch on master) and actually got the same result! So I think master is just broken in default EKS, and this PR isn't involved (unless it somehow got the nodes into a bad state). I don't have any theories why it works on our staging cluster 😛 The main differences are the Kubernetes version (1.16 on staging, 1.14 on EKS) and network config (our staging cluster uses ENIConfig CRDs). I'll rebase this PR against release-1.6 for now to keep things easier to test. |
0886166
to
f5daa9e
Compare
Rebased, and the new image ( I'll roll that image out on our staging cluster to get it baking there. There's not a lot of diff between release-1.6 and 7eab704; the only relevant functional changes appear to be c74841a and 1ee59a0. |
@drakedevel Thanks a lot for testing this. I'll do some more testing as well to try and figure out when this happens. I've been terminating the nodes completely between the tests. Next I'll try with just restarting the aws-node pods. Something is definitely going on here... Btw, I tested by running a build of your commit. |
@mogren just to clarify in case my comments were confusing: I was able to reproduce the issue you described on this branch, but also on master unmodified. I've since rebased this branch onto release-1.6 and I can no longer reproduce that issue, so I think it's an issue with the state of master rather than this PR. (Either way the issue only reproduces on EKS for me -- both master and release-1.6 seem to work fine on our staging cluster.) The build of this commit (f5daa9e) should be suitable for testing again now that it's been rebased. |
I got some time today to dig into the open questions I had in this PR, and found some interesting results! The main insight was discovering that the Docker It seems like the main reason for the UID collision check (which is also present in We only care about living pods, though, since dead pods don't need IPs. It's still technically possible to get multiple ready sandboxes for the same pod from the CRI, but it appears to be very much an edge case. The easiest way to trigger it is to manually create a sandbox with the appropriate labels, but it may be possible under various sorts of CRI bugs (e.g. if it fails to actually kill a previous iteration of the sandbox). Kubelet has explicit handling for this case in the pod sync loop, and will respond by killing all of the ready sandboxes and making a fresh one. I would say for now the best way to handle multiple ready sandboxes with the same UID on startup is to just crash -- it's an unusual case, and kubelet should be cleaning it up so by the time we restart there's a good chance it'll be fixed. That's what the Docker code does (although the check for I haven't come up with any way besides a manually-crafted sandbox that would lead to two sandboxes having the same pod UID but different pod names/namespaces. I think the safest thing would still be to bail out in that case, since there's nothing sensible we can hope to make out of that state. Future workI think this provides more evidence in favor of the strategy to avoid the Kubernetes API entirely for pod details. IPAMD gets told that a sandbox is being created/destroyed, and allocates/frees an IP in response. On startup, it checks if there's any changes that it missed notifications for and does the same thing. Everything needed to support this is available from the CRI, as I mentioned in the issue thread. What's clearer now is that there's significant complexity we're exposed to because we try to think about the sandboxes in Pod terms. Sandboxes aren't 1:1 with pods, but if we were oblivious to pods as a concept, we wouldn't have to care. We could leave the job of reconciling the CRI with the Kubernetes API to kubelet 😛 Concretely, that would mean treating the sandbox ID as the single identity to associate with an IP in the datastore, and treating any information like pod name/namespace/uid as a best-effort logging and debugging aid. It should simplify things a lot. And, while I'm sure it's not a high priority goal (if it's a goal at all), a free side-effect of that refactor would be that this CNI would work in non-Kubernetes environments (minus custom network config). |
@drakedevel Thanks again for really researching this issue! We should definitely keep track of the namespace as well. I'll rebuild with these latest changes and run some more tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good stuff, @drakedevel, appreciate this PR very much.
I have a couple minor nits and suggestions inline, but nothing major.
ipamd/ipamd.go
Outdated
@@ -420,6 +423,27 @@ func (c *IPAMContext) getLocalPodsWithRetry() ([]*k8sapi.K8SPodInfo, error) { | |||
return nil, nil | |||
} | |||
|
|||
var containers map[string]*cri.ContainerInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to put a comment above this block that explains why you're doing this. Something like this might work...
// Ask CRI for the locally-running containers and attempt to set the
// Pod.Container field to the value of the *sandbox* container. This
// will ensure that we don't unassign IPs from Pods due to looking up
// Pods with the wrong container ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do -- should hopefully be clearer once the sandbox rename is done, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that this code was largely pre-existing -- I restored it in the first commit with a revert, and modified it to use the cri
package instead of docker
, but didn't otherwise touch it. (e.g. the TODO was there before).
Added a comment and tidied up the other comment to be a bit clearer.
pkg/cri/cri.go
Outdated
return &Client{} | ||
} | ||
|
||
func (c *Client) GetRunningContainers() (map[string]*ContainerInfo, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might be better to name this GetRunningSandboxContainers()
to make it clearer at the calling point what this is actually doing. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! For consistency with Kubernetes and the CRI (and thus less confusion hopefully), I think it's probably better to consistently refer to them as "pod sandboxes." The fact that a sandbox is a container under the hood is a common implementation choice but not universal: e.g. under Kata Containers a sandbox is a full-blown VM in which other containers run.
I'm renaming things now to use the pod sandbox terminology -- let me know if you'd prefer a different name.
This reverts commit 14de538.
@drakedevel When I tried this change using docker, I saw that existing pods never got removed from the data_store when they were deleted. Haven't really figured out why the containerIDs don't match yet, been busy with some other issues. Will try to get back to testing ASAP. |
Addressed CR comments, included two small changes in separate commits, as well:
I'm not sure what tests you were running @mogren but is it possible these are the cause of the issue you were seeing? The behavior you described is what I would expect to see if the tests ran without a valid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking excellent, thanks very much @drakedevel. I left one question in there, but it's about a corner case and not something I would hold this PR up for.
The other question I had was around this:
amazon-vpc-cni-k8s/config/v1.5/aws-k8s-cni.yaml
Lines 129 to 131 in 3bc194d
- name: dockersock | |
hostPath: | |
path: /var/run/docker.sock |
With use of CRI instead of Docker, would that still be necessary?
|
||
log.Tracef("Update for pod %s: %+v, %+v", podName, pod.Status, pod.Spec) | ||
|
||
// Save pod info | ||
d.workerPods[key] = &K8SPodInfo{ | ||
Name: podName, | ||
Namespace: pod.GetNamespace(), | ||
Container: containerID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the Pod's K8S_POD_INFRA_CONTAINER_ID
change ever? If so, we may need to update the K8SPodInfo.Sandbox
attribute accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we care about the Sandbox
field here, specifically. The only use of this data in IPAMD is during nodeInit
, where we extract a copy and then promptly overwrite the Sandbox
field with data from the CRI. I'm pretty sure we never look at it again, I don't know why we even keep the watch active. (Per #714 (comment), there's really no reason for the K8S Pod API to be involved even as a one-shot.)
More generally, the sandbox ID for a Pod can change (more on that later), but the CNI/CRI side of things doesn't care. The ipamd/datastore
PodKey
store includes the sandbox ID, so we'll treat a sandbox change as a completely unrelated pod. It'll look exactly like one pod getting shut down and another pod getting started up. The fact that they were "the same pod" in some sense doesn't affect anything.
As to how a sandbox ID for a pod can change:
The key
value here comes from MetaNamespaceKeyFunc
, which just returns a namespace/name
string. The sandbox ID associated with that can very easily change -- restarting a pod that's part of a StatefulSet will produce a new pod with the same namespace and name, but a different sandbox (and a different UID).
A less obvious way it can happen is if the sandbox gets re-created. This can happen if the sandbox process gets killed, as well as in some more obscure scenarios which I'm not sure how to trigger (e.g. pod spec changes, which are normally immutable but there's code to handle anyway). In that case, you'll get the same name, namespace, and UID; the only change is the Attempt
field in the CRI PodSandboxMetadata
structure, which is not exposed in the Kubernetes API.
Because information about the sandbox is generally not exposed through the API, the only observable effects of a sandbox re-creation are that all of the normal containers restart. There's no way to distinguish that the cause of the restart was a sandbox change.
If we treat the CNI/CRI as the source of truth then the complexity around how and when Kubelet manages various obscure sync edge-cases just disappears completely. But per #712 (comment), saving that for a follow-up PR 😄
@drakedevel Sorry for my late reply. Tested this last night, and scaling up worked with no issues this time. I did see some problems when changing the config and restarting the CNI, in the middle of scaling up, the pod got stuck in a crash loop. I didn't get the logs since I was in the middle of something else, but I think the cause is the second change you mentioned earlier. I wonder why the sandbox ID is no longer matching... |
That socket is no longer necessary, but you'll need a CRI socket. We're using a manifest that looks like this: ...
volumeMounts:
...
- mountPath: /var/run/cri.sock
name: crisock
...
volumes:
...
- name: crisock
hostPath:
path: /var/run/dockershim.sock
type: Socket
That path on the host varies with the CRI in use; for CRI-O it's
Not sure of your PR label conventions but based on the name I'm not sure
@mogren The second change will cause crashes under three conditions (after several retries):
All of these have different errors/logs so those logs would be very useful for debugging 🙂 The first two are most likely transient (if the CRI crashes or got restarted or something), so if it keeps happening that seems likely caused by a misconfiguration of some sort, somewhere. The latter would be very weird, and it suggests a Kubelet bug if it persists, since I'm pretty sure the pod sync loop will correct that situation. It's also possible I misread or miswrote some code somewhere, so any logs/configs/etc you can share would be very helpful in tracking the issue down 😛 This has been rock-solid on our CRI-O staging cluster, and I haven't been able to reproduce any issues on Docker while torture-testing on EKS, so I'm a little bit stuck on debugging without more to go on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@drakedevel Thanks a lot for all the work on this, looks great. So far I haven't seen any new issues, but I'll keep running tests before making the next v1.6.0 release candidate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work on this, @drakedevel. Nudging this on its merry way (and changed from enhancement to priority/P0 bug fix)
* Revert "fix: get container ID from kube rather than docker (aws#371)" This reverts commit 14de538. * go mod tidy * Add initial proof-of-concept implementation of CRI support. * Update CRI socket path to /var/run/cri.sock * Filter ready sandboxes and abort if there's a pod UID conflict. * Revert "fix: get container ID from kube rather than docker (aws#371)" This reverts commit 14de538. * Address review comments, refactor to use pod sandbox nomenclature consistently. * Bail if we can't retrieve local pod sandboxes on startup.
) * Revert "fix: get container ID from kube rather than docker (#371)" This reverts commit 14de538. * go mod tidy * Add initial proof-of-concept implementation of CRI support. * Update CRI socket path to /var/run/cri.sock * Filter ready sandboxes and abort if there's a pod UID conflict. * Revert "fix: get container ID from kube rather than docker (#371)" This reverts commit 14de538. * Address review comments, refactor to use pod sandbox nomenclature consistently. * Bail if we can't retrieve local pod sandboxes on startup.
Issue #, if available: #712
Description of changes:
Reverts #371, and then replaces the use of the Docker API with the CRI. The revert is in a separate commit, to make it easier to see my changes.
Work in progress: I've done basic testing to ensure the values returned by
pkg/cri
are reasonable, unit tests pass, etc. I've tested it briefly on a Docker staging cluster, and for several days on a CRI-O staging cluster, and all is smooth so far. There's still a couple of edge cases to think about and TODOs.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.