Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process errors with i/o timeout against the kube api endpoint #542

Closed
xrl opened this issue Sep 21, 2018 · 6 comments
Closed

Process errors with i/o timeout against the kube api endpoint #542

xrl opened this issue Sep 21, 2018 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@xrl
Copy link

xrl commented Sep 21, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: I installed kube-state-metrics from kube-prometheus and it has been in a restart loop, timing out while talking to the kube API.

What you expected to happen: kube-state-metrics should not time out talking to the kube API.

How to reproduce it (as minimally and precisely as possible):

Use this container definitions:

      - args:
        - --host=127.0.0.1
        - --port=8081
        - --telemetry-host=127.0.0.1
        - --telemetry-port=8082
        # - --apiserver
        image: quay.io/coreos/kube-state-metrics:v1.4.0
        name: kube-state-metrics
        resources:
          limits:
            cpu: 100m
            memory: 150Mi
          requests:
            cpu: 100m
            memory: 150Mi

The full deployment, which I'm running verbatim, is here: https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/kube-state-metrics-deployment.yaml

Once that's installed, then run k get pods:

$ k get pods
NAME                                  READY     STATUS             RESTARTS   AGE
alertmanager-main-0                   1/2       CrashLoopBackOff   1081       3d
grafana-5b68464b84-mg6pf              1/1       Running            0          3d
kube-state-metrics-594cc76cfc-dpqfb   3/4       CrashLoopBackOff   767        2d
node-exporter-4lr5b                   2/2       Running            0          3d
node-exporter-5qzl4                   2/2       Running            0          3d
node-exporter-kddxg                   2/2       Running            0          3d
node-exporter-mb4xz                   2/2       Running            0          3d
node-exporter-n44vr                   2/2       Running            0          3d
node-exporter-wntnh                   2/2       Running            0          3d
prometheus-k8s-0                      3/3       Running            1          3d
prometheus-k8s-1                      3/3       Running            1          3d
prometheus-operator-587d64f4c-tkl4w   1/1       Running            0          3d

and then look at the logs:

$ k logs kube-state-metrics-594cc76cfc-dpqfb -c kube-state-metrics
I0921 15:26:29.547189       1 main.go:76] Using default collectors
I0921 15:26:29.547238       1 main.go:90] Using all namespace
I0921 15:26:29.547248       1 main.go:96] No metric whitelist or blacklist set. No filtering of metrics will be done.
W0921 15:26:29.547270       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0921 15:26:29.547962       1 main.go:145] Testing communication with server
F0921 15:26:59.548419       1 main.go:112] Failed to create client: ERROR communicating with apiserver: Get https://10.43.168.1:443/version?timeout=32s: dial tcp 10.43.168.1:443: i/o timeout

Anything else we need to know?:

I have confirmed from a standalone pod that I can access the kube cluster:

# curl -vk https://kubernetes.default.svc
* Rebuilt URL to: https://kubernetes.default.svc/
*   Trying 10.43.168.1...
* TCP_NODELAY set
* Connected to kubernetes.default.svc (10.43.168.1) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Request CERT (13):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=kubernetes-master
*  start date: Sep 11 17:15:28 2018 GMT
*  expire date: Sep 10 17:15:28 2028 GMT
*  issuer: CN=kubernetes
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55c98221c8e0)
> GET / HTTP/2
> Host: kubernetes.default.svc
> User-Agent: curl/7.58.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 401
< content-type: application/json
< www-authenticate: Basic realm="kubernetes-master"
< content-length: 165
< date: Fri, 21 Sep 2018 15:31:20 GMT
<
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "Unauthorized",
  "reason": "Unauthorized",
  "code": 401
* Connection #0 to host kubernetes.default.svc left intact
}

Environment:

  • Kubernetes version (use kubectl version):
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-10T11:44:36Z", GoVersion:"go1.11", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T09:56:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
  • Kube-state-metrics image version: 1.4.0 or 1.3.1 they both fail the same way.
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 21, 2018
@xrl
Copy link
Author

xrl commented Sep 21, 2018

I should also note, I have a "jumppod" which is just running ubuntu with kubectl installed and a RBAC setup, relevant deployment bits:

      serviceAccount: doit
      containers:
      - image: 981810374974.dkr.ecr.us-east-1.amazonaws.com/doit:2018-09-18-17-19
        imagePullPolicy: Always
        name: doit
        # command: ["bash", "-c", "find /data -mtime +1 -name *.log -exec echo {} > /data/cron_out \; -exec gzip {} \;"]
        command: ["sleep", "infinity"]
        workingDir: /root
        env:
        - name: USER
          value: root
        - name: TERM
          value: xterm
        volumeMounts:
        - mountPath: /root
          subPath: root-homedir
          name: doit-home
        securityContext:
          privileged: true
      volumes:
        - name: doit-home
          persistentVolumeClaim:
            claimName: doit-home

and RBAC:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: doit
  namespace: default
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  # "namespace" omitted since ClusterRoles are not namespaced
  name: doit-cluster-role
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["all", "clusters", "configmaps", "cronjobs", "daemonsets", "events", "ingresses", "jobs", "namespaces", "nodes", "persistentvolumeclaims", "persistentvolumes", "pods", "replicasets", "replicationcontrollers", "secrets", "services", "statefulsets", "storageclasses"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["deployments.extensions"] # "" indicates the core API group
  resources: ["deployments"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
# This role binding allows "system:serviceaccount:default:default" to read pods in the "default" namespace.
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: doit-cluster-role-binding
subjects:
- kind: ServiceAccount
  name: doit
  namespace: default
roleRef:
  kind: ClusterRole
  name: doit-cluster-role
  apiGroup: rbac.authorization.k8s.io

and from that jumppod I can run:

$ k -n default exec -it doit-75544c9c89-lxf6k bash
root@doit-75544c9c89-lxf6k:~# kubectl get pods
NAME                    READY     STATUS    RESTARTS   AGE
doit-75544c9c89-lxf6k   1/1       Running   0          2d

so from inside of the cluster it's totally OK to talk to the API using a cluster config.

@xrl
Copy link
Author

xrl commented Sep 21, 2018

Also interesting, if I build the kube-state-metrics executable from my ubuntu bionic jumppod, I can see it's connecting to the cluster (although the jumppod does not have the proper RBAC, it's a good sign it can get that far):

# apt-get install -y golang
# go get k8s.io/kube-state-metrics
# cd go/bin/
~/go/bin# ./kube-state-metrics --host=127.0.0.1 --port=8081 --telemetry-host=127.0.0.1 --telemetry-port=8082
I0921 17:27:50.014572   15792 main.go:77] Using default collectors
I0921 17:27:50.014600   15792 main.go:91] Using all namespace
I0921 17:27:50.014609   15792 main.go:97] No metric whitelist or blacklist set. No filtering of metrics will be done.
W0921 17:27:50.014622   15792 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0921 17:27:50.016275   15792 main.go:146] Testing communication with server
I0921 17:27:50.025239   15792 main.go:151] Running with Kubernetes cluster version: v1.10. git version: v1.10.7. git tree state: clean. commit: 0c38c362511b20a098d7cd855f1314dad92c2780. platform: linux/amd64
I0921 17:27:50.025266   15792 main.go:153] Communication with server successful
I0921 17:27:50.025517   15792 main.go:162] Starting kube-state-metrics self metrics server: 127.0.0.1:8082
I0921 17:27:50.026206   15792 main.go:241] Active collectors: resourcequotas,namespaces,secrets,configmaps,pods,endpoints,daemonsets,replicationcontrollers,services,jobs,cronjobs,statefulsets,persistentvolumes,persistentvolumeclaims,limitranges,nodes,replicasets,horizontalpodautoscalers,deployments
I0921 17:27:50.026228   15792 main.go:187] Starting metrics server: 127.0.0.1:8081
E0921 17:27:50.027926   15792 reflector.go:205] k8s.io/kube-state-metrics/pkg/collectors/collectors.go:79: Failed to list *v1beta1.StatefulSet: statefulsets.apps is forbidden: User "system:serviceaccount:default:doit" cannot list statefulsets.apps at the cluster scope
E0921 17:27:50.030528   15792 reflector.go:205] k8s.io/kube-state-metrics/pkg/collectors/collectors.go:79: Failed to list *v1.ResourceQuota: resourcequotas is forbidden: User "system:serviceaccount:default:doit" cannot list resourcequotas at the cluster scope

perhaps there is something "wrong" with the container I am using? or perhaps that container is incompatible in some way?

@xrl
Copy link
Author

xrl commented Sep 22, 2018

The more I poke at this error the more I realize it's probably the configuration of the host kube-node. Tracking issue with the aws-vpc-cni here: aws/amazon-vpc-cni-k8s#180

@xrl
Copy link
Author

xrl commented Sep 22, 2018

Terminating those ec2 instances and letting the kops-configured instance group autoscaler has worked, the pods can now resolve DNS and connect to other hosts.

@mxinden
Copy link
Contributor

mxinden commented Sep 22, 2018

Thanks for the detailed report @xrl!

Terminating those ec2 instances and letting the kops-configured instance group autoscaler has worked, the pods can now resolve DNS and connect to other hosts.

I am not quite sure I understand. Are you saying the issue is resolved?

If not, can you try running your doit container on the same host as kube-state-metrics?

@xrl
Copy link
Author

xrl commented Sep 22, 2018

@mxinden correct, the issue is "resolved". more like, it's not kube-state-metrics fault :)

closing!

@xrl xrl closed this as completed Sep 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants