Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod unable to reach kube API: i/o timeout #272

Closed
xrl opened this issue Dec 18, 2018 · 10 comments
Closed

Pod unable to reach kube API: i/o timeout #272

xrl opened this issue Dec 18, 2018 · 10 comments
Labels
needs investigation priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release

Comments

@xrl
Copy link

xrl commented Dec 18, 2018

I am running the nginx-ingress controller which needs to talk to the kube API to get virtual host config information. I have 3 replicas, 2 of which are working and handling my production traffic. 1 of them is in a restart loop, unable to reach the kube API. It complains:

$ k -n nginx-ingress logs nginx-ingress-6b597b96c6-dm4nx --previous
I1218 03:07:49.404026       1 main.go:92] Starting NGINX Ingress controller Version=1.3.0 GitCommit=b8acd634
F1218 03:08:19.406232       1 main.go:147] Error when getting nginx-ingress/default-server-secret: Get https://10.43.168.1:443/api/v1/namespaces/nginx-ingress/secrets/default-server-secret: dial tcp 10.43.168.1:443: i/o timeout

I know this issue could be rectified by k delete pod $POD but I am seeing this error routinely throughout my system, something like 5% of my pods enter in to this routability trap. Unable to send out messages and their software may not die immediately, leading me to believe the pod is OK but it is actually just stuck.

To continue with gathering information:

$ k -n nginx-ingress describe pod nginx-ingress-6b597b96c6-dm4nx
Name:           nginx-ingress-6b597b96c6-dm4nx
Namespace:      nginx-ingress
Node:           ip-10-43-169-250.ec2.internal/10.43.169.250
Start Time:     Mon, 17 Dec 2018 20:09:39 -0400
Labels:         app=nginx-ingress
                pod-template-hash=2615365272
Annotations:    <none>
Status:         Running
IP:             10.43.168.164
Controlled By:  ReplicaSet/nginx-ingress-6b597b96c6
Containers:
  nginx-ingress:
    Container ID:  docker://a757c37787c2ed64cbc5b655f0e7ac4788302db7ce9dc907347a9bc0637f3531
    Image:         nginx/nginx-ingress:1.3.0
    Image ID:      docker-pullable://nginx/nginx-ingress@sha256:87df118b4a8032a755e8498c6137872498e1ee35d58ade97a09fee05bb13dfcb
    Ports:         80/TCP, 443/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      -nginx-configmaps=$(POD_NAMESPACE)/nginx-configmap
      -default-server-tls-secret=$(POD_NAMESPACE)/default-server-secret
      -report-ingress-status
      -v=0
      -logtostderr
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Mon, 17 Dec 2018 23:07:49 -0400
      Finished:     Mon, 17 Dec 2018 23:08:19 -0400
    Ready:          False
    Restart Count:  36
    Environment:
      POD_NAMESPACE:  nginx-ingress (v1:metadata.namespace)
      POD_NAME:       nginx-ingress-6b597b96c6-dm4nx (v1:metadata.name)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-token-sb8xb (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  nginx-ingress-token-sb8xb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nginx-ingress-token-sb8xb
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                     From                                    Message
  ----     ------   ----                    ----                                    -------
  Normal   Pulled   29m (x31 over 178m)     kubelet, ip-10-43-169-250.ec2.internal  Container image "nginx/nginx-ingress:1.3.0" already present on machine
  Warning  BackOff  4m41s (x708 over 178m)  kubelet, ip-10-43-169-250.ec2.internal  Back-off restarting failed container

the kube node ip-10-43-169-250.ec2.internal is running a variety of pods, some of them work just fine.

Name:               ip-10-43-169-250.ec2.internal
Roles:              node
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.4xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1c
                    kops.k8s.io/instancegroup=nodes
                    kubernetes.io/hostname=ip-10-43-169-250.ec2.internal
                    kubernetes.io/role=node
                    node-role.kubernetes.io/node=
Annotations:        kubernetes-ec2-srcdst-controller.ottoyiu.com/srcdst-check-disabled: true
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 03 Dec 2018 21:14:07 -0400
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Mon, 17 Dec 2018 23:15:58 -0400   Mon, 03 Dec 2018 21:14:36 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Mon, 17 Dec 2018 23:15:58 -0400   Mon, 03 Dec 2018 21:14:36 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 17 Dec 2018 23:15:58 -0400   Mon, 03 Dec 2018 21:14:36 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 17 Dec 2018 23:15:58 -0400   Mon, 03 Dec 2018 21:14:36 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 17 Dec 2018 23:15:58 -0400   Mon, 03 Dec 2018 21:15:47 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.43.169.250
  InternalDNS:  ip-10-43-169-250.ec2.internal
  Hostname:     ip-10-43-169-250.ec2.internal
Capacity:
 cpu:                16
 ephemeral-storage:  245675440Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             64473092Ki
 pods:               110
Allocatable:
 cpu:                16
 ephemeral-storage:  226414485130
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             64370692Ki
 pods:               110
System Info:
 Machine ID:                 ec2ae4f604dec1a2f41bfa37aa7b8598
 System UUID:                EC2AE4F6-04DE-C1A2-F41B-FA37AA7B8598
 Boot ID:                    5ec0605c-9a3f-48b3-bcb4-064509b0fc68
 Kernel Version:             4.9.0-7-amd64
 OS Image:                   Debian GNU/Linux 9 (stretch)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://17.3.2
 Kubelet Version:            v1.10.7
 Kube-Proxy Version:         v1.10.7
ProviderID:                  aws:///us-east-1c/i-08239c90775af14c4
Non-terminated Pods:         (15 in total)
  Namespace                  Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                         ------------  ----------  ---------------  -------------  ---
  elasticsearch              metrics-dev-kibana-7584d95669-rg4hg          0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h8m
  elasticsearch              metrics-preprod-telegraf-85495c8b48-j857j    0 (0%)        0 (0%)      0 (0%)           0 (0%)         5h2m
  kube-logs                  coordinator-0                                0 (0%)        0 (0%)      10Gi (16%)       10Gi (16%)     3h8m
  kube-system                aws-node-9gjbf                               10m (0%)      0 (0%)      0 (0%)           0 (0%)         14d
  kube-system                filebeat-927zt                               100m (0%)     0 (0%)      100Mi (0%)       200Mi (0%)     16m
  kube-system                kube-proxy-ip-10-43-169-250.ec2.internal     100m (0%)     0 (0%)      0 (0%)           0 (0%)         14d
  kube-system                metrics-server-54458b9f58-98h5g              0 (0%)        0 (0%)      0 (0%)           0 (0%)         161m
  logstash                   idb-esvt-to-es-preprod-1                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         12d
  logstash                   idb-omnibus-events-to-es-dev-0               0 (0%)        0 (0%)      0 (0%)           0 (0%)         11d
  logstash                   idb-omnibus-events-to-es-preprod-1           0 (0%)        0 (0%)      0 (0%)           0 (0%)         9d
  logstash                   idb-syslogs-to-es-dev-0                      2 (12%)       4 (25%)     8Gi (13%)        8Gi (13%)      4d12h
  logstash                   idb-syslogs-to-es-preprod-1                  2 (12%)       4 (25%)     8Gi (13%)        8Gi (13%)      14d
  logstash                   idb-threshold-events-to-es-preprod-1         0 (0%)        0 (0%)      0 (0%)           0 (0%)         11d
  monitoring                 node-exporter-72lc5                          112m (0%)     122m (0%)   200Mi (0%)       220Mi (0%)     13d
  nginx-ingress              nginx-ingress-6b597b96c6-dm4nx               0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h8m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                4322m (27%)    8122m (50%)
  memory             26924Mi (42%)  27044Mi (43%)
  ephemeral-storage  0 (0%)         0 (0%)
Events:              <none>

and the aws-node on that kube node doesn't report any errors:

$ k -n kube-system logs aws-node-9gjbf
=====Starting installing AWS-CNI =========
=====Starting amazon-k8s-agent ===========
ERROR: logging before flag.Parse: W1204 01:15:36.071671      12 client_config.go:533] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.

the AWS VPC CNI is configured:

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: eniconfigs.crd.k8s.amazonaws.com
spec:
  scope: Cluster
  group: crd.k8s.amazonaws.com
  version: v1alpha1
  names:
    scope: Cluster
    plural: eniconfigs
    singuar: eniconfig
    kind: ENIConfig
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"extensions/v1beta1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"k8s-app":"aws-node"},"name":"aws-node","namespace":"kube-system"},"spec":{"selector":{"matchLabels":{"k8s-app":"aws-node"}},"template":{"metadata":{"annotations":{"scheduler.alpha.kubernetes.io/critical-pod":""},"labels":{"k8s-app":"aws-node"}},"spec":{"containers":[{"env":[{"name":"AWS_VPC_K8S_CNI_LOGLEVEL","value":"DEBUG"},{"name":"MY_NODE_NAME","valueFrom":{"fieldRef":{"fieldPath":"spec.nodeName"}}}],"image":"602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.0.0","name":"aws-node","resources":{"requests":{"cpu":"10m"}},"securityContext":{"privileged":true},"volumeMounts":[{"mountPath":"/host/opt/cni/bin","name":"cni-bin-dir"},{"mountPath":"/host/etc/cni/net.d","name":"cni-net-dir"},{"mountPath":"/host/var/log","name":"log-dir"},{"mountPath":"/var/run/docker.sock","name":"dockersock"}]}],"hostNetwork":true,"serviceAccountName":"aws-node","tolerations":[{"effect":"NoSchedule","key":"node-role.kubernetes.io/master"},{"key":"CriticalAddonsOnly","operator":"Exists"}],"volumes":[{"hostPath":{"path":"/opt/cni/bin"},"name":"cni-bin-dir"},{"hostPath":{"path":"/etc/cni/net.d"},"name":"cni-net-dir"},{"hostPath":{"path":"/var/log"},"name":"log-dir"},{"hostPath":{"path":"/var/run/docker.sock"},"name":"dockersock"}]}},"updateStrategy":{"type":"RollingUpdate"}}}
  creationTimestamp: null
  generation: 1
  labels:
    k8s-app: aws-node
  name: aws-node
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/aws-node
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: aws-node
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      creationTimestamp: null
      labels:
        k8s-app: aws-node
    spec:
      tolerations: 
      - key: "ids.purpose"
        operator: "Equal"
        value: "elasticsearch"
        effect: "NoSchedule"
      containers:
      - env:
        # - name: AWS_VPC_K8S_CNI_LOGLEVEL
        #   value: DEBUG
        - name: MY_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: WATCH_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0
        imagePullPolicy: IfNotPresent
        name: aws-node
        resources:
          requests:
            cpu: 10m
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
        - mountPath: /host/var/log
          name: log-dir
        - mountPath: /var/run/docker.sock
          name: dockersock
      dnsPolicy: ClusterFirst
      hostNetwork: true
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: aws-node
      serviceAccountName: aws-node
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - key: CriticalAddonsOnly
        operator: Exists
      - key: "ids.purpose"
        operator: "Equal"
        value: "elasticsearch"
        effect: "NoSchedule"
      volumes:
      - hostPath:
          path: /opt/cni/bin
          type: ""
        name: cni-bin-dir
      - hostPath:
          path: /etc/cni/net.d
          type: ""
        name: cni-net-dir
      - hostPath:
          path: /var/log
          type: ""
        name: log-dir
      - hostPath:
          path: /var/run/docker.sock
          type: ""
        name: dockersock
  templateGeneration: 6
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

I am running on kubernetes 1.10.7 and I use kops 1.10 to manage my cluster in AWS.

The network is laid out with cloudformation. It's a small vpc 10.43.168.0/21, and the kube subnets are 10.43.168.0/23, 10.43.170.0/23, 10.43.172.0/23.

t = Template(Description="IDS Network")
vpc = ec2.VPC(
    "vpc",
    CidrBlock="10.43.168.0/21",
    EnableDnsSupport=True,
    EnableDnsHostnames=True,
    Tags=[
        {"Key": "Name",
         "Value": "IDS VPC"}
    ]
)
t.add_resource(vpc)

route_table = ec2.RouteTable(
    "routetable",
    VpcId=Ref("vpc"),
    Tags=[
      {"Key": "Name",
       "Value": "IDS Route Table"}
    ]
)
t.add_resource(route_table)

gateway = ec2.VPNGateway(
  "gateway",
  Type="ipsec.1",
  Tags=[
    {"Key": "Name",
     "Value": "IDS Virtual Private Gateway"}
  ]
)
t.add_resource(gateway)

# route_propagation = ec2.VPNGatewayRoutePropagation(
#   "routepropagation",
#   RouteTableIds=[Ref(route_table)],
#   VpnGatewayId=Ref(vpc),
#   DependsOn=[route_table.name,gateway.name]
# )
# t.add_resource(route_propagation)

t.add_resource(ec2.VPCGatewayAttachment(
  "vpcgatewayattachment",
  VpcId=Ref("vpc"),
  VpnGatewayId=Ref("gateway")
))

# Cloudformation-managed subnets must be tagged so Kops can
# find them:
# https://github.com/kubernetes/kops/blob/master/docs/run_in_existing_vpc.md#advanced-options-for-creating-clusters-in-existing-vpcs
subnets = [
  ("10.43.168.0/23", "us-east-1c", "kube1"),
  ("10.43.170.0/23", "us-east-1d", "kube2"),
  ("10.43.172.0/23", "us-east-1f", "kube3"),
  ("10.43.174.0/25", "us-east-1c", "classic1"),
  ("10.43.174.128/25", "us-east-1d", "classic2"),
  ("10.43.175.0/25", "us-east-1f", "classic3"),
  ("10.43.175.128/25", "us-east-1f", "badlands")
]

for (cidrblock,az,name) in subnets:
    tags = [{"Key": "Name",
             "Value": "IDS %s" % name}]
    if "kube" in name:
        tags += [
          {"Key": "kubernetes.io/cluster/kube.ids.io",
           "Value": "shared"},
          {"Key": "kubernetes.io/role/internal-elb",
           "Value": "1"},
          {"Key": "SubnetType",
           "Value": "Private"},
        ]
    subnet = ec2.Subnet(
        "useast1%s" % name,
        AvailabilityZone=az,
        CidrBlock=cidrblock,
        MapPublicIpOnLaunch=False,
        Tags=tags,
        VpcId=Ref("vpc")
    )
    t.add_resource(subnet)

    t.add_resource(ec2.SubnetRouteTableAssociation(
      "useast1%sroutetableassoc" % name,
      RouteTableId=Ref("routetable"),
      SubnetId=Ref(subnet)
    ))

About the kube nodes, all of them are m5.4xlarge or r5.2xlarge. They should have plenty of IPs to dole out, I am running 200 pods and I have 12 kube-nodes with general workloads and 12 kube-nodes with 2-5 pods.

Kernel logs from the kube-node show a lot of activity on the eth devices (related to the restarts?):

Dec 18 00:09:39 ip-10-43-169-250 kernel: [1205731.673144] IPVS: Creating netns size=2104 id=545
Dec 18 00:09:39 ip-10-43-169-250 kernel: [1205731.708047] IPVS: Creating netns size=2104 id=546
Dec 18 00:09:39 ip-10-43-169-250 kernel: [1205731.803933] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Dec 18 00:09:39 ip-10-43-169-250 kernel: [1205731.851904] IPv6: ADDRCONF(NETDEV_UP): eni34a07cedb06: link is not ready
Dec 18 00:09:39 ip-10-43-169-250 kernel: [1205731.854319] IPv6: ADDRCONF(NETDEV_CHANGE): eni34a07cedb06: link becomes ready
Dec 18 00:09:39 ip-10-43-169-250 kernel: [1205731.855476] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Dec 18 00:09:39 ip-10-43-169-250 kernel: [1205731.856901] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Dec 18 00:09:40 ip-10-43-169-250 kernel: [1205731.900564] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Dec 18 00:09:42 ip-10-43-169-250 kernel: [1205734.821726] IPVS: Creating netns size=2104 id=547
Dec 18 00:09:43 ip-10-43-169-250 kernel: [1205734.939264] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Dec 18 00:09:43 ip-10-43-169-250 kernel: [1205734.940301] IPv6: ADDRCONF(NETDEV_UP): eni4b147937e89: link is not ready
Dec 18 00:09:43 ip-10-43-169-250 kernel: [1205734.941280] IPv6: ADDRCONF(NETDEV_CHANGE): eni4b147937e89: link becomes ready
Dec 18 00:09:43 ip-10-43-169-250 kernel: [1205734.942320] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Dec 18 00:21:34 ip-10-43-169-250 kernel: [1206446.077789] IPVS: Creating netns size=2104 id=548
Dec 18 00:21:34 ip-10-43-169-250 kernel: [1206446.190435] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Dec 18 00:21:34 ip-10-43-169-250 kernel: [1206446.243819] IPv6: ADDRCONF(NETDEV_UP): eni43a9e7a4be0: link is not ready
Dec 18 00:21:34 ip-10-43-169-250 kernel: [1206446.245619] IPv6: ADDRCONF(NETDEV_CHANGE): eni43a9e7a4be0: link becomes ready
Dec 18 00:21:34 ip-10-43-169-250 kernel: [1206446.246793] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Dec 18 00:36:19 ip-10-43-169-250 kernel: [1207331.709655] IPVS: Creating netns size=2104 id=549
Dec 18 00:36:19 ip-10-43-169-250 kernel: [1207331.868125] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Dec 18 00:36:20 ip-10-43-169-250 kernel: [1207331.911918] IPv6: ADDRCONF(NETDEV_UP): eni4a26bfebb3d: link is not ready
Dec 18 00:36:20 ip-10-43-169-250 kernel: [1207331.913466] IPv6: ADDRCONF(NETDEV_CHANGE): eni4a26bfebb3d: link becomes ready
Dec 18 00:36:20 ip-10-43-169-250 kernel: [1207331.914525] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Dec 18 03:01:07 ip-10-43-169-250 kernel: [1216019.206013] IPVS: Creating netns size=2104 id=550
Dec 18 03:01:07 ip-10-43-169-250 kernel: [1216019.336390] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Dec 18 03:01:07 ip-10-43-169-250 kernel: [1216019.387921] IPv6: ADDRCONF(NETDEV_UP): eni6fb90311837: link is not ready
Dec 18 03:01:07 ip-10-43-169-250 kernel: [1216019.389467] IPv6: ADDRCONF(NETDEV_CHANGE): eni6fb90311837: link becomes ready
Dec 18 03:01:07 ip-10-43-169-250 kernel: [1216019.390581] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

Some debugging output from the aws-node running on the kube-node hosting the failed nginx-ingress:

# curl -s http://localhost:61678/v1/enis | python -m json.tool
{
    "AssignedIPs": 12,
    "ENIIPPools": {
        "eni-063ebbe7bf2157386": {
            "AssignedIPv4Addresses": 11,
            "DeviceNumber": 0,
            "ID": "eni-063ebbe7bf2157386",
            "IPv4Addresses": {
                "10.43.168.10": {
                    "Assigned": false
                },
                "10.43.168.122": {
                    "Assigned": true
                },
                "10.43.168.123": {
                    "Assigned": false
                },
                "10.43.168.134": {
                    "Assigned": true
                },
                "10.43.168.136": {
                    "Assigned": false
                },
                "10.43.168.172": {
                    "Assigned": true
                },
                "10.43.168.214": {
                    "Assigned": true
                },
                "10.43.168.219": {
                    "Assigned": false
                },
                "10.43.168.235": {
                    "Assigned": true
                },
                "10.43.168.246": {
                    "Assigned": true
                },
                "10.43.168.39": {
                    "Assigned": false
                },
                "10.43.168.47": {
                    "Assigned": true
                },
                "10.43.168.63": {
                    "Assigned": false
                },
                "10.43.168.94": {
                    "Assigned": false
                },
                "10.43.168.98": {
                    "Assigned": true
                },
                "10.43.169.11": {
                    "Assigned": false
                },
                "10.43.169.133": {
                    "Assigned": true
                },
                "10.43.169.149": {
                    "Assigned": false
                },
                "10.43.169.150": {
                    "Assigned": false
                },
                "10.43.169.169": {
                    "Assigned": false
                },
                "10.43.169.202": {
                    "Assigned": true
                },
                "10.43.169.21": {
                    "Assigned": true
                },
                "10.43.169.245": {
                    "Assigned": false
                },
                "10.43.169.254": {
                    "Assigned": false
                },
                "10.43.169.35": {
                    "Assigned": false
                },
                "10.43.169.68": {
                    "Assigned": false
                },
                "10.43.169.69": {
                    "Assigned": false
                },
                "10.43.169.76": {
                    "Assigned": false
                },
                "10.43.169.97": {
                    "Assigned": false
                }
            },
            "IsPrimary": true
        },
        "eni-0b8d23a1eee7d04a6": {
            "AssignedIPv4Addresses": 1,
            "DeviceNumber": 2,
            "ID": "eni-0b8d23a1eee7d04a6",
            "IPv4Addresses": {
                "10.43.168.143": {
                    "Assigned": false
                },
                "10.43.168.164": {
                    "Assigned": true
                },
                "10.43.168.165": {
                    "Assigned": false
                },
                "10.43.168.173": {
                    "Assigned": false
                },
                "10.43.168.175": {
                    "Assigned": false
                },
                "10.43.168.198": {
                    "Assigned": false
                },
                "10.43.168.23": {
                    "Assigned": false
                },
                "10.43.168.27": {
                    "Assigned": false
                },
                "10.43.168.40": {
                    "Assigned": false
                },
                "10.43.168.53": {
                    "Assigned": false
                },
                "10.43.168.71": {
                    "Assigned": false
                },
                "10.43.168.73": {
                    "Assigned": false
                },
                "10.43.168.79": {
                    "Assigned": false
                },
                "10.43.168.82": {
                    "Assigned": false
                },
                "10.43.168.88": {
                    "Assigned": false
                },
                "10.43.168.89": {
                    "Assigned": false
                },
                "10.43.169.118": {
                    "Assigned": false
                },
                "10.43.169.13": {
                    "Assigned": false
                },
                "10.43.169.131": {
                    "Assigned": false
                },
                "10.43.169.138": {
                    "Assigned": false
                },
                "10.43.169.141": {
                    "Assigned": false
                },
                "10.43.169.157": {
                    "Assigned": false
                },
                "10.43.169.218": {
                    "Assigned": false
                },
                "10.43.169.234": {
                    "Assigned": false
                },
                "10.43.169.28": {
                    "Assigned": false
                },
                "10.43.169.50": {
                    "Assigned": false
                },
                "10.43.169.9": {
                    "Assigned": false
                },
                "10.43.169.92": {
                    "Assigned": false
                },
                "10.43.169.98": {
                    "Assigned": false
                }
            },
            "IsPrimary": false
        }
    },
    "TotalIPs": 58
}
# curl -s http://localhost:61678/v1/pods | python -m json.tool
{
    "coordinator-0_kube-logs_cd95adf621ad3d3e8685f12383755d031babee2a025187773bdc8633e7ca1e62": {
        "DeviceNumber": 0,
        "IP": "10.43.168.134"
    },
    "filebeat-927zt_kube-system_daf109fa6c13005afd4f43cdae5b7947bd525637db16d57c6f4f01424d8aa330": {
        "DeviceNumber": 0,
        "IP": "10.43.168.246"
    },
    "idb-esvt-to-es-preprod-1_logstash_631aed34c4ad172e87cd87a371b012686f6b4a27cf07fa5ae58044f01f46de5c": {
        "DeviceNumber": 0,
        "IP": "10.43.168.98"
    },
    "idb-omnibus-events-to-es-dev-0_logstash_1b187d44b1a18f026d5e61c908f7e263cd195557baf076713fd3c33d4616ca05": {
        "DeviceNumber": 0,
        "IP": "10.43.169.21"
    },
    "idb-omnibus-events-to-es-preprod-1_logstash_4a1b6cfc85a4890f799653140cb05f1cc249b4b3d1945b1781f449ea00d20b13": {
        "DeviceNumber": 0,
        "IP": "10.43.168.172"
    },
    "idb-syslogs-to-es-dev-0_logstash_60807c660f4125608cd98440d39b660d0e988da4c6f58b413292f6b4dd199518": {
        "DeviceNumber": 0,
        "IP": "10.43.168.122"
    },
    "idb-syslogs-to-es-preprod-1_logstash_7365b94ff9974e35c62342911ba5c01765c99ced87dce564403564cc7d78882b": {
        "DeviceNumber": 0,
        "IP": "10.43.168.47"
    },
    "idb-threshold-events-to-es-preprod-1_logstash_3d25bd772d23584ed6c3cf9487afdb714319e6d0eb469d073b583ed9c5f2a17a": {
        "DeviceNumber": 0,
        "IP": "10.43.169.133"
    },
    "metrics-dev-kibana-7584d95669-rg4hg_elasticsearch_07e0ede645263210ae39ab2e3f816a60ad7e781db41d033f39f9c6b5bb439753": {
        "DeviceNumber": 0,
        "IP": "10.43.168.235"
    },
    "metrics-preprod-telegraf-85495c8b48-j857j_elasticsearch_d2ecf6a4d6a8ab52c89e82acbb1ba86a6c22041385d3043242110f06c4c0a811": {
        "DeviceNumber": 0,
        "IP": "10.43.168.214"
    },
    "metrics-server-54458b9f58-98h5g_kube-system_5e53a38e6ab6433d6d54be0feaf6cf62a233e09c4ee11c02a8ce074c865f8b14": {
        "DeviceNumber": 0,
        "IP": "10.43.169.202"
    },
    "nginx-ingress-6b597b96c6-dm4nx_nginx-ingress_aeb03f0ea99daa03f59f741424f1972640bce3a85d78c731fd8efa432c96be3b": {
        "DeviceNumber": 2,
        "IP": "10.43.168.164"
    }
}
# curl -s http://localhost:61678/metrics
# HELP add_ip_req_count The number of add IP address request
# TYPE add_ip_req_count counter
add_ip_req_count 550
# HELP assigned_ip_addresses The number of IP addresses assigned to pods
# TYPE assigned_ip_addresses gauge
assigned_ip_addresses 12
# HELP del_ip_req_count The number of delete IP address request
# TYPE del_ip_req_count counter
del_ip_req_count{reason="PodDeleted"} 540
del_ip_req_count{reason="SetupNSFailed"} 1
# HELP eni_allocated The number of ENIs allocated
# TYPE eni_allocated gauge
eni_allocated 2
# HELP eni_max The maximum number of ENIs that can be attached to the instance
# TYPE eni_max gauge
eni_max 8
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 7.9707e-05
go_gc_duration_seconds{quantile="0.25"} 9.5658e-05
go_gc_duration_seconds{quantile="0.5"} 0.000101121
go_gc_duration_seconds{quantile="0.75"} 0.000111446
go_gc_duration_seconds{quantile="1"} 0.008658322
go_gc_duration_seconds_sum 1.5565753340000001
go_gc_duration_seconds_count 10156
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 27
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 9.588984e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 4.5244160648e+10
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.780852e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 1.48049097e+08
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 917504
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 9.588984e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 6.111232e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.3615104e+07
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 77860
# HELP go_memstats_heap_released_bytes_total Total number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes_total counter
go_memstats_heap_released_bytes_total 1.236992e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 1.9726336e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.545103838066779e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 780928
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 1.48126957e+08
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 27776
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 32768
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 211736
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 245760
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 1.6407552e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 3.894404e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 2.29376e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 2.29376e+06
# HELP go_memstats_sys_bytes Number of bytes obtained by system. Sum of all system allocations.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 2.8891384e+07
# HELP ip_max The maximum number of IP addresses that can be allocated to the instance
# TYPE ip_max gauge
ip_max 232
# HELP ipamd_action_inprogress The number of ipamd actions inprogress
# TYPE ipamd_action_inprogress gauge
ipamd_action_inprogress{fn="increaseIPPool"} 0
ipamd_action_inprogress{fn="nodeIPPoolReconcile"} 0
ipamd_action_inprogress{fn="nodeInit"} 0
ipamd_action_inprogress{fn="retryAllocENIIP"} 0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 1442.5
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 10
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.0845312e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.54388613528e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 6.1489152e+07
# HELP total_ip_addresses The total number of IP addresses
# TYPE total_ip_addresses gauge
total_ip_addresses 58
@xrl
Copy link
Author

xrl commented Jan 3, 2019

I have switched AMIs to amazon linux 2 and it's still the same problem. I am going to try updating docker to 18.

@xrl
Copy link
Author

xrl commented Jan 9, 2019

I have rolled back the amazon linux 2 hosts, they were failing to create any good pods.

@recollir
Copy link

recollir commented Feb 20, 2019

I have reported something similar. #318

@sdavids13
Copy link

This is happening to us as well using a CentOS 7 AMI w/ EKS 1.11 at about the same frequency as described in the ticket. Today was especially fun because this happed to our coredns pod which was added to the load balancer (no readiness installed by default :() and thus was causing a lot of DNS errors inside of our cluster which our services didn’t appreciate :/. This has become a huge headache and cause for quite a bit of frustration on our team, please help...

@lnr0626
Copy link

lnr0626 commented Mar 8, 2019

I've seen this as well with a similar setup to sdavids

It looked like it might only be happening on secondary ENIs - i.e. all of the pods in our cluster that have IPs allocated on the primary ENI consistently come up correctly

@recollir
Copy link

recollir commented Mar 8, 2019

Have you tried to run from master, @lnr0626 ? This solved it for us, so it seems.

@tabern tabern added the priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release label Mar 8, 2019
@xrl
Copy link
Author

xrl commented Jul 15, 2019

I got this during an autoscaler/pod rush from our gitlab CI spawning build pods:

Events:
  Type     Reason                  Age                    From                                                     Message
  ----     ------                  ----                   ----                                                     -------
  Normal   TriggeredScaleUp        8m18s                  cluster-autoscaler                                       pod triggered scale-up: [{K3-EKS-ondemandasgsubnet09df9044a965c5907-FYER7FLJR931 2->3 (max: 5)}]
  Warning  FailedScheduling        7m20s (x4 over 8m20s)  default-scheduler                                        0/24 nodes are available: 12 Insufficient pods, 13 Insufficient memory, 6 Insufficient cpu.
  Warning  FailedScheduling        6m23s (x3 over 6m34s)  default-scheduler                                        0/25 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 12 Insufficient pods, 13 Insufficient memory, 6 Insufficient cpu.
  Normal   Scheduled               6m23s                  default-scheduler                                        Successfully assigned gitlab/runner-2s5ezhug-project-6-concurrent-4tlz8n to ip-172-30-19-231.eu-central-1.compute.internal
  Warning  FailedCreatePodSandBox  6m22s                  kubelet, ip-172-30-19-231.eu-central-1.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "32ef33d345b2d715ccb78e88d543a2fe610586b252aea99b34e40ba0d78448e4" network for pod "runner-2s5ezhug-project-6-concurrent-4tlz8n": NetworkPlugin cni failed to set up pod "runner-2s5ezhug-project-6-concurrent-4tlz8n_gitlab" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "32ef33d345b2d715ccb78e88d543a2fe610586b252aea99b34e40ba0d78448e4" network for pod "runner-2s5ezhug-project-6-concurrent-4tlz8n": NetworkPlugin cni failed to teardown pod "runner-2s5ezhug-project-6-concurrent-4tlz8n_gitlab" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

@jaypipes
Copy link
Contributor

jaypipes commented Nov 6, 2019

@xrl we are fairly confident that this issue was fixed with #360. Would you mind please telling us whether you continue to see issues with the current version of the CNI plugin? Thanks much!

@sdavids13
Copy link

We are no longer having this issue after upgrading to a newer CNI.

@jaypipes
Copy link
Contributor

jaypipes commented Nov 6, 2019

@sdavids13 thx for the update. I'm going to close this issue out. @xrl please feel free to let us know if you see similar issues now and I will reopen.

@jaypipes jaypipes closed this as completed Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs investigation priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release
Projects
None yet
Development

No branches or pull requests

6 participants