Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating from v0.3.0 to v0.3.1 causing multi-node network issues #163

Closed
spowelljr opened this issue Feb 16, 2023 · 14 comments · Fixed by #168
Closed

Updating from v0.3.0 to v0.3.1 causing multi-node network issues #163

spowelljr opened this issue Feb 16, 2023 · 14 comments · Fixed by #168

Comments

@spowelljr
Copy link
Contributor

Reference kubernetes/minikube#15870

In kubernetes/minikube#15752 we updated cri-dockerd and resulted in some consistently failing multi-node network tests. Such as DeployApp2Nodes: deploys an app to a multinode cluster and makes sure all nodes can serve traffic

I see in the release notes, PR #147 there was an update to a CNI dependency which could be related to the problem.

Logs for the failing test can be seen here: https://storage.googleapis.com/minikube-builds/logs/15752/27907/Docker_Linux.html#fail_TestMultiNode/serial/DeployApp2Nodes

@evol262
Copy link
Contributor

evol262 commented Feb 16, 2023

Can you distill this down? This is an absurdly large amount of logs, and we have no idea whatsoever what your test environment is like. What's the network config? Which CNI? What do the CNI logs say?

It looks like only one pod is getting an address, but it's an absolute unknown what CNI you're using, whether it's finished reconciling, what (if anything) happened to the host OS, whether it's eBPF CNI or overlay or bridge, and so on.

A brief overview of what the environment is supposed to look like is important.

@medyagh
Copy link

medyagh commented Feb 21, 2023

hello @evol262
here is a summary of what the test is doing, I hope this information answers your question,
1- Test starts a new minikube cluster with two nodes without any additional CNI (default settings)
$ minikube start --wait=true memory=2200 --nodes=2

2- An app is deployed to both nodes (busybox)

$ kubectl  -- apply -f ./testdata/multinodes/multinode-pod-dns-test.yaml

if you wanna see the content of the "multinode-pod-dns-test.yaml" here it is:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  labels:
    app: busybox
spec:
  replicas: 2
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
      - name: busybox
        # flaky nslookup in busybox versions newer than 1.28:
        # https://github.com/docker-library/busybox/issues/48
        # note: k8s.gcr.io/e2e-test-images/agnhost:2.32
        # has similar issues (ie, resolves but returns exit code 1)
        image: gcr.io/k8s-minikube/busybox:1.28
        command:
          - sleep
          - "3600"
        imagePullPolicy: IfNotPresent
      restartPolicy: Always
      affinity:
        # ⬇⬇⬇ This ensures pods will land on separate hosts
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions: [{ key: app, operator: In, values: [busybox] }]
            topologyKey: "kubernetes.io/hostname"

3- then the test is verifying to check if Both nodes Serve traffic

$ kubectl  -- rollout status deployment/busybox
$ kubectl  -- rollout status deployment/busybox: (1.805356933s)

The first problem is in this verification:

kubectl -p  -- get pods -o jsonpath='{.items[*].status.podIP}'

expected 2 Pod IPs but got 1
and then the test goes on ...

$ kubectl -- get pods -o jsonpath='{.items[*].metadata.name}'
$ kubectl -- exec busybox-6b86dd6d48-8vp4k -- nslookup kubernetes.io

$ kubectl -- exec busybox-6b86dd6d48-8vp4k -- nslookup kubernetes.default.svc.cluster.local: exit status 1 (187.571295ms)
-- stdout --
Server: 10.96.0.10
Address 1: 10.96.0.10

-- /stdout --
** stderr **
nslookup: can't resolve 'kubernetes.default.svc.cluster.local'
command terminated with exit code 1
** /stderr **

I confirm this is causing the issue consistently since we upgraded the cri-docker

@medyagh
Copy link

medyagh commented Feb 21, 2023

so in summary it seems like we deploy an app on two nodes, but only one node gets the app

@evol262
Copy link
Contributor

evol262 commented Feb 22, 2023

@medyagh Please actually verify this (leave the test environment up and actually check with the docker CLI or other).

I have no idea what "default settings" means for a CNI. Presumably it is whatever the default is for minikube. We don't ship one. Do you know what CNI is in use? Failure to issue an address falls strictly in the domain of CNI IPAM, not cri-dockerd, and the fact that this is neither reproducible in testing nor has it been reported by anyone using a version released over a month ago other than a "deploy with no CNI to assign addresses" test scenario leads me to believe that it's somewhere there.

I'm happy to help debug, but it really needs a lot more information.

@medyagh
Copy link

medyagh commented Feb 22, 2023

@evol262 fair enough, the default CNI is actually kindnet (https://github.com/kubernetes-sigs/kind/tree/main/images/kindnetd/cmd/kindnetd)

as seen in minikube

kube-system   kindnet-68wxq                      1/1     Running   0             65s
kube-system   kindnet-t69r5                      1/1     Running   0             61s

@evol262
Copy link
Contributor

evol262 commented Feb 22, 2023

Again, please check the actual nodes. Or the CNI logs. Post back your findings, please. Pod IP assignment is strictly up to the CNI, and establishing whether the CNI thinks it's doing the right thing is crucial before we go ghost hunting to see whether a dep update for a CVE somehow broke kindnet and no other CNIs.

@medyagh
Copy link

medyagh commented Feb 22, 2023

sounds good we will do some more investigation on our end to see why it happens on Debian Linux CI machines, hopefully with more useful logs for you

@prezha
Copy link
Contributor

prezha commented Mar 23, 2023

hello @evol262, we've investigated this issue further and can share with you our findings and details:

the context:

  • we deploy two pods to two nodes (cp and worker) using anti-affinity rules
  • we then noticed that the pod on the worker node eventually does get an ip address assigned ~40sec after deployment was initiated (and ~30+ sec after the pod on the cp node got an ip address)
  • during the period when the cp's pod has an ip address, and the worker's pod still don't have one assigned, both pods (hence also the deployment) are marked as Ready
  • this breaks our tests, as we expect pods to have an ip address if they are Ready

the problems we observed:

  1. significant delay before a pod get an ip address
    note: i think this is actually a minor problem as it might be down to the cni - kindnet in this case, and when we tried with other CNIs (and also other than the busybox pods), the delay was lower but still present
  2. worker's pod being marked as Ready while not having an ip address assigned (yet) - this will be the focus of further details below

please see the following screenshot capturing the problem:
TestMultinode

on the left-hand side, red rectangles mark Info messages from cri-docker log - eg:

Mar 19 20:53:16 multinode-877589-m02 cri-dockerd[1059]: time=2023-03-19T20:53:16Z level=info msg="Failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "busybox-6b86dd6d48-ttnwd_default": unexpected command output Device "eth0" does not exist.\n with error: exit status 1"

the corresponding kubelet/PLEG log around that same specific time (2023-03-19T20:53:16Z) is:

Mar 19 20:53:16 multinode-877589-m02 kubelet[1337]: I0319 20:53:16.289327 1337 generic.go:453] "PLEG: Write status" pod="default/busybox-6b86dd6d48-ttnwd" podStatus=&{ID:6fe19c31-5d2c-444e-8562-69f4c31186f4 Name:busybox-6b86dd6d48-ttnwd Namespace:default IPs:[] ContainerStatuses:[] SandboxStatuses:[&PodSandboxStatus{Id:a926fc57856373944f563b0d8ebc1bd3befda7fe42e00932dc6ae0e8b97d91e7,Metadata:&PodSandboxMetadata{Name:busybox-6b86dd6d48-ttnwd,Uid:6fe19c31-5d2c-444e-8562-69f4c31186f4,Namespace:default,Attempt:0,},State:SANDBOX_READY,CreatedAt:1679259195437451087,Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},},Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,UsernsOptions:nil,},},},Labels:map[string]string{app: busybox,io.kubernetes.pod.name: busybox-6b86dd6d48-ttnwd,io.kubernetes.pod.namespace: default,io.kubernetes.pod.uid: 6fe19c31-5d2c-444e-8562-69f4c31186f4,pod-template-hash: 6b86dd6d48,},Annotations:map[string]string{kubernetes.io/config.seen: 2023-03-19T20:53:15.093241411Z,kubernetes.io/config.source: api,},RuntimeHandler:,}] TimeStamp:0001-01-01 00:00:00 +0000 UTC}

similarly, on the right-hand side, highlighted are sections affirming that the pod has no ip address assigned (and no eth0 interface) yet, but it's marked as Ready, which assumes it should be able to communicate - ie, based on the Pod Lifecycle/Pod conditions:

Ready: the Pod is able to serve requests and should be added to the load balancing pools of all matching Services.

according to the CRI Networking Specifications Requirements:

RunPodSandbox must set up pod's network. This includes, but is not limited to allocating a pod IP, configuring the pod's network interfaces and default network route. Kubelet expects the pod sandbox to have an IP which is routable within the k8s cluster, if RunPodSandbox returns successfully. RunPodSandbox must return an error if it fails to set up the pod's network. If the pod's network has already been set up, RunPodSandbox must skip network setup and proceed.

so, it looks like it's the cri's responsibility to ensure that the pod has an ip address before marking it as Ready ?

i tried to dig a bit deeper, and i might be completely wrong, but sharing anyway - this code segment might be relevant:

v1Response, err = as.ds.RunPodSandbox(ctx, &v1Request)
if v1Response != nil {
resp := &runtimeapi_alpha.RunPodSandboxResponse{}
err = utils.V1ResponseToAlphaResponse(v1Response, resp)
if err == nil {
res = resp
}
return res, err
}
return nil, err

here the v1Response is used before the error is checked, whereas the ds.RunPodSandbox might return both a non-nil error and a non-nil response (perhaps here?)

also, the above Info log message comes from cri-docker's PodSandboxStatus that calls getIPs that has this segment that might be related in the call chain:

// If all else fails, warn but don't return an error, as pod status
// should generally not return anything except fatal errors
logrus.Infof("Failed to read pod IP from plugin/docker: %v", err)
return ips

for completeness, i'm attaching the logs from kubelet (log verbosity 7, to be able to capture PLEG logs there) and cri-docker:

i hope these more specific details help (happy to share more, if needed)

what are your thoughts, please?

@evol262
Copy link
Contributor

evol262 commented Mar 23, 2023

I mean, the short answer here is "yes, but no". It's handed off to the CNI, which deals with IPAM after that, and the responsibility of the CRI is to raise any errors back up the stack, or cancel it after the timeout (220 seconds, as a default).

If the plugin says "ok, done", we and returns a result which is not an err, we believe it, and we have no reason not to.

@prezha
Copy link
Contributor

prezha commented Mar 23, 2023

thanks, @evol262, for your quick reply!

so if SetUpPod returns an error back to the caller - ds.RunPodSandbox, it might still return both non-nil error and non-nil response, and then RunPodSandbox might process the response before/without checking if the error is not nil (effectively ignoring the error)?

v1Response, err = as.ds.RunPodSandbox(ctx, &v1Request)
if v1Response != nil {
resp := &runtimeapi_alpha.RunPodSandboxResponse{}
err = utils.V1ResponseToAlphaResponse(v1Response, resp)
if err == nil {
res = resp
}
return res, err
}
return nil, err

@prezha
Copy link
Contributor

prezha commented Mar 23, 2023

btw, our temporary workaround is to wait (retry) for the pod to get an ip address, which evenutally happens, whereas we should ideally only rely on the ready status
not sure if similar retries should be on cni or cri to do though
actually, instead of waiting on cni/cri side, i think it would be sufficient if cri returns to kubelet/pleg a non-ready status if pod does not have an ip assigned - that's probably where the PodSandboxStatus could play a role
currently, it only logs an Info message:

// If all else fails, warn but don't return an error, as pod status
// should generally not return anything except fatal errors
logrus.Infof("Failed to read pod IP from plugin/docker: %v", err)
return ips

perhaps checking for IP[s] and [re]setting the status/state if needed between two blocks here or moving the ip check before state is set and then also check it here?

@prezha
Copy link
Contributor

prezha commented Mar 24, 2023

@evol262 if the above makes sense, i've created a pr #168 that should add that check

@evol262
Copy link
Contributor

evol262 commented Mar 25, 2023

Thanks! Merged.

I'll cut a release early next week

@prezha
Copy link
Contributor

prezha commented Mar 25, 2023

great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants