Process errors with i/o timeout against the kube api endpoint #542

xrl · 2018-09-21T15:35:09Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: I installed kube-state-metrics from kube-prometheus and it has been in a restart loop, timing out while talking to the kube API.

What you expected to happen: kube-state-metrics should not time out talking to the kube API.

How to reproduce it (as minimally and precisely as possible):

Use this container definitions:

      - args:
        - --host=127.0.0.1
        - --port=8081
        - --telemetry-host=127.0.0.1
        - --telemetry-port=8082
        # - --apiserver
        image: quay.io/coreos/kube-state-metrics:v1.4.0
        name: kube-state-metrics
        resources:
          limits:
            cpu: 100m
            memory: 150Mi
          requests:
            cpu: 100m
            memory: 150Mi

The full deployment, which I'm running verbatim, is here: https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/kube-state-metrics-deployment.yaml

Once that's installed, then run k get pods:

$ k get pods
NAME                                  READY     STATUS             RESTARTS   AGE
alertmanager-main-0                   1/2       CrashLoopBackOff   1081       3d
grafana-5b68464b84-mg6pf              1/1       Running            0          3d
kube-state-metrics-594cc76cfc-dpqfb   3/4       CrashLoopBackOff   767        2d
node-exporter-4lr5b                   2/2       Running            0          3d
node-exporter-5qzl4                   2/2       Running            0          3d
node-exporter-kddxg                   2/2       Running            0          3d
node-exporter-mb4xz                   2/2       Running            0          3d
node-exporter-n44vr                   2/2       Running            0          3d
node-exporter-wntnh                   2/2       Running            0          3d
prometheus-k8s-0                      3/3       Running            1          3d
prometheus-k8s-1                      3/3       Running            1          3d
prometheus-operator-587d64f4c-tkl4w   1/1       Running            0          3d

and then look at the logs:

$ k logs kube-state-metrics-594cc76cfc-dpqfb -c kube-state-metrics
I0921 15:26:29.547189       1 main.go:76] Using default collectors
I0921 15:26:29.547238       1 main.go:90] Using all namespace
I0921 15:26:29.547248       1 main.go:96] No metric whitelist or blacklist set. No filtering of metrics will be done.
W0921 15:26:29.547270       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0921 15:26:29.547962       1 main.go:145] Testing communication with server
F0921 15:26:59.548419       1 main.go:112] Failed to create client: ERROR communicating with apiserver: Get https://10.43.168.1:443/version?timeout=32s: dial tcp 10.43.168.1:443: i/o timeout

Anything else we need to know?:

I have confirmed from a standalone pod that I can access the kube cluster:

# curl -vk https://kubernetes.default.svc
* Rebuilt URL to: https://kubernetes.default.svc/
*   Trying 10.43.168.1...
* TCP_NODELAY set
* Connected to kubernetes.default.svc (10.43.168.1) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Request CERT (13):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=kubernetes-master
*  start date: Sep 11 17:15:28 2018 GMT
*  expire date: Sep 10 17:15:28 2028 GMT
*  issuer: CN=kubernetes
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55c98221c8e0)
> GET / HTTP/2
> Host: kubernetes.default.svc
> User-Agent: curl/7.58.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 401
< content-type: application/json
< www-authenticate: Basic realm="kubernetes-master"
< content-length: 165
< date: Fri, 21 Sep 2018 15:31:20 GMT
<
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "Unauthorized",
  "reason": "Unauthorized",
  "code": 401
* Connection #0 to host kubernetes.default.svc left intact
}

Environment:

Kubernetes version (use kubectl version):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-10T11:44:36Z", GoVersion:"go1.11", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T09:56:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Kube-state-metrics image version: 1.4.0 or 1.3.1 they both fail the same way.

The text was updated successfully, but these errors were encountered:

xrl · 2018-09-21T15:47:25Z

I should also note, I have a "jumppod" which is just running ubuntu with kubectl installed and a RBAC setup, relevant deployment bits:

      serviceAccount: doit
      containers:
      - image: 981810374974.dkr.ecr.us-east-1.amazonaws.com/doit:2018-09-18-17-19
        imagePullPolicy: Always
        name: doit
        # command: ["bash", "-c", "find /data -mtime +1 -name *.log -exec echo {} > /data/cron_out \; -exec gzip {} \;"]
        command: ["sleep", "infinity"]
        workingDir: /root
        env:
        - name: USER
          value: root
        - name: TERM
          value: xterm
        volumeMounts:
        - mountPath: /root
          subPath: root-homedir
          name: doit-home
        securityContext:
          privileged: true
      volumes:
        - name: doit-home
          persistentVolumeClaim:
            claimName: doit-home

and RBAC:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: doit
  namespace: default
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  # "namespace" omitted since ClusterRoles are not namespaced
  name: doit-cluster-role
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["all", "clusters", "configmaps", "cronjobs", "daemonsets", "events", "ingresses", "jobs", "namespaces", "nodes", "persistentvolumeclaims", "persistentvolumes", "pods", "replicasets", "replicationcontrollers", "secrets", "services", "statefulsets", "storageclasses"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["deployments.extensions"] # "" indicates the core API group
  resources: ["deployments"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
# This role binding allows "system:serviceaccount:default:default" to read pods in the "default" namespace.
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: doit-cluster-role-binding
subjects:
- kind: ServiceAccount
  name: doit
  namespace: default
roleRef:
  kind: ClusterRole
  name: doit-cluster-role
  apiGroup: rbac.authorization.k8s.io

and from that jumppod I can run:

$ k -n default exec -it doit-75544c9c89-lxf6k bash
root@doit-75544c9c89-lxf6k:~# kubectl get pods
NAME                    READY     STATUS    RESTARTS   AGE
doit-75544c9c89-lxf6k   1/1       Running   0          2d

so from inside of the cluster it's totally OK to talk to the API using a cluster config.

xrl · 2018-09-21T17:30:28Z

Also interesting, if I build the kube-state-metrics executable from my ubuntu bionic jumppod, I can see it's connecting to the cluster (although the jumppod does not have the proper RBAC, it's a good sign it can get that far):

# apt-get install -y golang
# go get k8s.io/kube-state-metrics
# cd go/bin/
~/go/bin# ./kube-state-metrics --host=127.0.0.1 --port=8081 --telemetry-host=127.0.0.1 --telemetry-port=8082
I0921 17:27:50.014572   15792 main.go:77] Using default collectors
I0921 17:27:50.014600   15792 main.go:91] Using all namespace
I0921 17:27:50.014609   15792 main.go:97] No metric whitelist or blacklist set. No filtering of metrics will be done.
W0921 17:27:50.014622   15792 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0921 17:27:50.016275   15792 main.go:146] Testing communication with server
I0921 17:27:50.025239   15792 main.go:151] Running with Kubernetes cluster version: v1.10. git version: v1.10.7. git tree state: clean. commit: 0c38c362511b20a098d7cd855f1314dad92c2780. platform: linux/amd64
I0921 17:27:50.025266   15792 main.go:153] Communication with server successful
I0921 17:27:50.025517   15792 main.go:162] Starting kube-state-metrics self metrics server: 127.0.0.1:8082
I0921 17:27:50.026206   15792 main.go:241] Active collectors: resourcequotas,namespaces,secrets,configmaps,pods,endpoints,daemonsets,replicationcontrollers,services,jobs,cronjobs,statefulsets,persistentvolumes,persistentvolumeclaims,limitranges,nodes,replicasets,horizontalpodautoscalers,deployments
I0921 17:27:50.026228   15792 main.go:187] Starting metrics server: 127.0.0.1:8081
E0921 17:27:50.027926   15792 reflector.go:205] k8s.io/kube-state-metrics/pkg/collectors/collectors.go:79: Failed to list *v1beta1.StatefulSet: statefulsets.apps is forbidden: User "system:serviceaccount:default:doit" cannot list statefulsets.apps at the cluster scope
E0921 17:27:50.030528   15792 reflector.go:205] k8s.io/kube-state-metrics/pkg/collectors/collectors.go:79: Failed to list *v1.ResourceQuota: resourcequotas is forbidden: User "system:serviceaccount:default:doit" cannot list resourcequotas at the cluster scope

perhaps there is something "wrong" with the container I am using? or perhaps that container is incompatible in some way?

xrl · 2018-09-22T01:06:28Z

The more I poke at this error the more I realize it's probably the configuration of the host kube-node. Tracking issue with the aws-vpc-cni here: aws/amazon-vpc-cni-k8s#180

xrl · 2018-09-22T01:10:32Z

Terminating those ec2 instances and letting the kops-configured instance group autoscaler has worked, the pods can now resolve DNS and connect to other hosts.

mxinden · 2018-09-22T06:42:05Z

Thanks for the detailed report @xrl!

Terminating those ec2 instances and letting the kops-configured instance group autoscaler has worked, the pods can now resolve DNS and connect to other hosts.

I am not quite sure I understand. Are you saying the issue is resolved?

If not, can you try running your doit container on the same host as kube-state-metrics?

xrl · 2018-09-22T14:12:04Z

@mxinden correct, the issue is "resolved". more like, it's not kube-state-metrics fault :)

closing!

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 21, 2018

xrl closed this as completed Sep 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process errors with i/o timeout against the kube api endpoint #542

Process errors with i/o timeout against the kube api endpoint #542

xrl commented Sep 21, 2018 •

edited

Loading

xrl commented Sep 21, 2018

xrl commented Sep 21, 2018

xrl commented Sep 22, 2018

xrl commented Sep 22, 2018

mxinden commented Sep 22, 2018

xrl commented Sep 22, 2018

Process errors with i/o timeout against the kube api endpoint #542

Process errors with i/o timeout against the kube api endpoint #542

Comments

xrl commented Sep 21, 2018 • edited Loading

xrl commented Sep 21, 2018

xrl commented Sep 21, 2018

xrl commented Sep 22, 2018

xrl commented Sep 22, 2018

mxinden commented Sep 22, 2018

xrl commented Sep 22, 2018

xrl commented Sep 21, 2018 •

edited

Loading