Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't get resource list for external.metrics.k8s.io/v1beta1 #486

Closed
hden opened this issue May 25, 2019 · 36 comments
Closed

Couldn't get resource list for external.metrics.k8s.io/v1beta1 #486

hden opened this issue May 25, 2019 · 36 comments

Comments

@hden
Copy link

hden commented May 25, 2019

Image: 602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/amazon-k8s-cni:v1.4.1

Symptoms

All nodes are ready and healthy. Secondary IPs are begin allocated. But the aws-node pod periodically prints the following error message.

ERROR: logging before flag.Parse: E0525 01:44:40.871589      11 memcache.go:147] couldn't get resource list for external.metrics.k8s.io/v1beta1: <nil>

Why is it trying to access the external metrics anyway?

Extra informations

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.7", GitCommit:"6f482974b76db3f1e0f5d24605a9d1d38fad9a2b", GitTreeState:"clean", BuildDate:"2019-03-25T02:52:13Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-eks-d69f1b", GitCommit:"d69f1bf3669bf00b7f4a758e978e0e7a1e3a68f7", GitTreeState:"clean", BuildDate:"2019-02-28T20:26:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
$ kubectl describe pod aws-node-4g4wp -n kube-system
Name:               aws-node-4g4wp
Namespace:          kube-system
Priority:           2000001000
PriorityClassName:  system-node-critical
Node:               <redacted>.compute.internal/x.x.x.x
Start Time:         Sat, 25 May 2019 10:27:21 +0900
Labels:             controller-revision-hash=7f8557c6b8
                    k8s-app=aws-node
                    pod-template-generation=1
Annotations:        scheduler.alpha.kubernetes.io/critical-pod: 
Status:             Running
IP:                 10.100.3.5
Controlled By:      DaemonSet/aws-node
Containers:
  aws-node:
    Container ID:   docker://7139948319c5d13793a4349bd8a891f6a362546c839a8070a4638dc210b9473d
    Image:          602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/amazon-k8s-cni:v1.4.1
    Image ID:       docker-pullable://602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/amazon-k8s-cni@sha256:7652301be646d5796febc547a29781f4da6bdd61abc4940e22ec3098c884211e
    Port:           61678/TCP
    Host Port:      61678/TCP
    State:          Running
      Started:      Sat, 25 May 2019 10:27:30 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:  10m
    Environment:
      AWS_VPC_K8S_CNI_LOGLEVEL:  DEBUG
      MY_NODE_NAME:               (v1:spec.nodeName)
      WATCH_NAMESPACE:           kube-system (v1:metadata.namespace)
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /host/var/log from log-dir (rw)
      /var/run/docker.sock from dockersock (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from aws-node-token-8cbrg (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  log-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  dockersock:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:  
  aws-node-token-8cbrg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  aws-node-token-8cbrg
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:          <none>
@nickdgriffin
Copy link

This appears to be happening in 1.5 too.

@homme
Copy link

homme commented Jul 16, 2019

I noticed that after installing the CloudWatch adapter (kubectl apply -f https://raw.githubusercontent.com/awslabs/k8s-cloudwatch-adapter/master/deploy/adapter.yaml) these errors changed from the external.metrics endpoint to custom.metrics:
ERROR: logging before flag.Parse: E0716 12:54:37.701124 13 memcache.go:147] couldn't get resource list for custom.metrics.k8s.io/v1beta1: <nil>
It would be interesting to see what metric(s) aws-node is looking for and what impact on the cluster the errors have.

@RuBiCK
Copy link

RuBiCK commented Oct 8, 2019

Same issue with 1.14
v1.14.6-eks-5047ed

@harshal-shah
Copy link

Same issue with v1.14.8-eks-b7174d

@SevenSilentStars
Copy link

Any updates on this issue?

@mogren
Copy link
Contributor

mogren commented Dec 19, 2019

What version of the CNI?

kubectl describe daemonset aws-node -n kube-system | grep Image | cut -d ":" -f 2-3 | tr -d '[:space:]'

I'd recommend updating to v1.5.5.

@labria
Copy link
Contributor

labria commented Dec 31, 2019

Happens on v1.5.5 too, but does not seem to affect anything.

@rafaelmagu
Copy link

1.6 is out. Has anyone verified if this issue is resolved?

@yourilefers
Copy link

I've deployed version 1.6 on our clusters and it seems to have been resolved! Maybe someone can verify this for their situation?

@hineze
Copy link

hineze commented Mar 3, 2020

I've deployed version 1.6 on our clusters and it seems to have been resolved! Maybe someone can verify this for their situation?

Just updated the plugin in all 4 of my clusters and I can confirm 1.6 resolves this issue

@rafaelmagu
Copy link

I've deployed version 1.6 on our clusters and it seems to have been resolved! Maybe someone can verify this for their situation?

Just updated the plugin in all 4 of my clusters and I can confirm 1.6 resolves this issue

Undersigned. All my clusters no longer present the issue with 1.6. This ticket can be closed.

@jaypipes
Copy link
Contributor

jaypipes commented Mar 3, 2020

Thanks for the feedback, folks! Closing out.

@ragarcia26
Copy link

ragarcia26 commented Jun 8, 2020

I'm seeing this when going from 1.6.1 to 1.6.2 and rolling back to 1.6.1 seemed to have no effect

I've tried extending the probes as described here #872 as I noticed the same timeout 1s issue, but same effect. Theres no eksctl here.

$ kubectl logs -n kube-system aws-node-s2ljv -f
Copying portmap binary ... Starting IPAM daemon in the background ... ok.
ERROR: logging before flag.Parse: E0608 20:07:32.183918       8 memcache.go:147] couldn't get resource list for external.metrics.k8s.io/v1beta1: <nil>
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:
  Normal   Scheduled  7m14s                   default-scheduler                                   Successfully assigned kube-system/aws-node-s2ljv to ip-172-16-11-2.us-east-2.compute.internal
  Normal   Pulled     4m44s (x4 over 7m13s)   kubelet, ip-172-16-11-2.us-east-2.compute.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.2"
  Normal   Created    4m44s (x4 over 7m13s)   kubelet, ip-172-16-11-2.us-east-2.compute.internal  Created container aws-node
  Normal   Started    4m44s (x4 over 7m13s)   kubelet, ip-172-16-11-2.us-east-2.compute.internal  Started container aws-node
  Normal   Pulling    3m18s (x5 over 7m13s)   kubelet, ip-172-16-11-2.us-east-2.compute.internal  Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.2"
  Warning  BackOff    2m10s (x12 over 5m59s)  kubelet, ip-172-16-11-2.us-east-2.compute.internal  Back-off restarting failed container

@ragarcia26
Copy link

Identified our issue was our IAM role with the CNI policy has an incorrect target, thus was unable to use that role/permissions.

@nesl247
Copy link

nesl247 commented Aug 18, 2020

We're still getting this on 1.6.3. Any ideas?

@anupash147
Copy link

anupash147 commented Aug 25, 2020

I am using 1.6.3 and for me as well the issue persists. And only happening on my tainted nodes

@SaranBalaji90
Copy link
Contributor

SaranBalaji90 commented Aug 31, 2020

@nesl247 and @anupash147 is ipamd not starting up at all on these affected nodes or does it eventually starts up?

@mogren mogren reopened this Aug 31, 2020
@SaranBalaji90
Copy link
Contributor

@mogren lets reopen this and investigate the issue further. I'm not sure on what we actually did to mitigate the issue

@SaranBalaji90
Copy link
Contributor

SaranBalaji90 commented Aug 31, 2020

We have to understand two things

  1. Why we are seeing this error message -memcache.go:147] couldn't get resource list for external.metrics.k8s.io/v1beta1: <nil>
  2. Why is ipamd failing to start in some cases? ( connectivity issues to ec2 instance metadata/apiserver/ec2? add more loggers for failure paths?)

Not sure if (1) affects (2), but (1) might give some hints for answering (2).

@nesl247
Copy link

nesl247 commented Aug 31, 2020

For 1, it would probably be one of 2 things:

  • Cannot connect due to security groups
  • Cannot connect due to RBAC

@SaranBalaji90 It is starting. This doesn't seem to cause any issues that we can tell.

@anguslees
Copy link
Contributor

anguslees commented Aug 31, 2020

This error (memcache.go:147] couldn't get resource list for external.metrics.k8s.io/v1beta1) means an aggregated api server has been configured in k8s, but is not functional for some reason. The vpc-cni code (somewhere deep in client-go) is trying to list all available cluster resources (just the discovery metadata, not list the actual resources themselves) - this is ~common in client-go, and the internet is full of bug reports like this one :/ You can reproduce the above by doing something like kubectl api-versions.

I agree that vpc-cni doesn't actually need this information, which is why the OP found that vpc-cni continued just fine despite this "error" message.

@MiLk
Copy link

MiLk commented Aug 31, 2020

We have the same issue, and for us, the v1beta1.external.metrics.k8s.io apiservice has been created for the datadog cluster agent.
https://docs.datadoghq.com/agent/cluster_agent/external_metrics/

I'm not sure where to go from here.

@anguslees
Copy link
Contributor

anguslees commented Sep 1, 2020

I'm not sure where to go from here.

Just ignore the "error"? It's purely noise. (I think the "failed to start" issue above was found to be unrelated?)

Confusion is bad and we should absolutely fix the code to not report this "error" - which will require a change to client-go. ... but at the same time, there's no actual functionality impact here, so it's going to be a low priority for everyone.

If you have this error yourself, and you want it to go away: Your best bet is going to be to fix the metrics service (or whatever aggregated api server is reported in the error) or remove the relevant apiservice registration (kubectl get apiservices). Test with kubectl api-versions.

@nesl247
Copy link

nesl247 commented Sep 1, 2020

Just as an FYI, the metrics services, in our case metrics-server and datadog, are not reporting any errors just in case this is ever determined to actually be an issue.

@SaranBalaji90
Copy link
Contributor

SaranBalaji90 commented Sep 1, 2020

It will be really helpful if someone can paste the output of /var/log/aws-routed-eni/ipamd.log when the issue happens. ipamd logs should give us some more information.

For example,

{"level":"info","ts":"2020-08-30T18:55:51.585Z","caller":"aws-k8s-agent/main.go:30","msg":"Starting L-IPAMD aws-node-cni-v1  ..."}
{"level":"info","ts":"2020-08-30T18:56:21.587Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2020-08-30T18:56:21.591Z","caller":"aws-k8s-agent/main.go:42","msg":"Successful communication with the Cluster! Cluster Version is: v1.17+. git version: v1.17.9-eks-4c6976. git tree state: clean. commit: 4c6976793196d70bc5cd29d56ce5440c9473648e. platform: linux/amd64"}
...

@achevuru
Copy link
Contributor

achevuru commented Sep 2, 2020

Error indicates that there are no custom metrics and the resource list for that was empty. As @anguslees pointed out, shouldn't have any functional impact and I see this error log issue for empty resource list is addressed/removed in K8S 1.14+. Interested to know your K8S version @hden @homme @nesl247 @anupash147

https://github.com/kubernetes/kubernetes/blob/release-1.11/staging/src/k8s.io/client-go/discovery/cached/memcache.go#L146

kubernetes/kubernetes@ee8b6f5

@nesl247
Copy link

nesl247 commented Sep 2, 2020

For us it's been 1.14-1.17.

@achevuru
Copy link
Contributor

achevuru commented Sep 2, 2020

@nesl247 Is the error "nil" for you in the log? or do you have a valid error string? There is another open issue #1078 around this tied to kube-proxy version

@SaranBalaji90
Copy link
Contributor

It will be really helpful if someone can paste the output of /var/log/aws-routed-eni/ipamd.log when the issue happens. ipamd logs should give us some more information.

For example,

{"level":"info","ts":"2020-08-30T18:55:51.585Z","caller":"aws-k8s-agent/main.go:30","msg":"Starting L-IPAMD aws-node-cni-v1  ..."}
{"level":"info","ts":"2020-08-30T18:56:21.587Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2020-08-30T18:56:21.591Z","caller":"aws-k8s-agent/main.go:42","msg":"Successful communication with the Cluster! Cluster Version is: v1.17+. git version: v1.17.9-eks-4c6976. git tree state: clean. commit: 4c6976793196d70bc5cd29d56ce5440c9473648e. platform: linux/amd64"}
...

From #1055 (comment) looks like reading docker socket is taking time. Would like to see what others are seeing..

@mogren
Copy link
Contributor

mogren commented Sep 9, 2020

If anyone is seeing Couldn't get resource list for external.metrics.k8s.io/v1beta1 in their logs, this should not cause issues for the CNI. Updating the go-client is tracked in #745.

If the aws-node fails to start up, it is caused by something else and we are currently investigating issues with docker in #1055 and kube-proxy in #1078.

@serhatcetinkaya
Copy link

serhatcetinkaya commented Sep 10, 2020

same issue on EKS v1.17.9-eks-4c6976 with aws-node version v1.7.1 the container keeps restarting:

 {"level":"info","ts":"2020-09-10T10:05:51.110Z","caller":"entrypoint.sh","msg":"Copying CNI plugin binaries ... "}
 {"level":"info","ts":"2020-09-10T10:05:51.355Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
 {"level":"info","ts":"2020-09-10T10:05:51.658Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
 {"level":"info","ts":"2020-09-10T10:05:51.755Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
 ERROR: logging before flag.Parse: E0910 10:05:52.877086      14 memcache.go:147] couldn't get resource list for custom.metrics.k8s.io/v1beta1: <nil>
 ERROR: logging before flag.Parse: E0910 10:05:57.878681      14 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

EDIT:

For me the errors are caused by outdated aws-cni deployments. You might have an outdated version (like me), check here and update your cluster accordingly. This solved the error and now everything works okay.

@s4mur4i
Copy link

s4mur4i commented Oct 6, 2020

Since issue #950 has been closed, will update here.
On EKS version v1.17.9-eks-4c6976 with aws-node version v1.7.3 the custom metrics error still gets logged.

{"level":"info","ts":"2020-10-06T10:49:10.927Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2020-10-06T10:49:11.224Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2020-10-06T10:49:11.323Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
ERROR: logging before flag.Parse: E1006 10:49:12.628595      10 memcache.go:147] couldn't get resource list for custom.metrics.k8s.io/v1beta1: <nil>
{"level":"info","ts":"2020-10-06T10:49:18.925Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2020-10-06T10:49:19.024Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2020-10-06T10:49:19.025Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}
ERROR: logging before flag.Parse: E1006 10:50:12.830684      10 memcache.go:147] couldn't get resource list for custom.metrics.k8s.io/v1beta1: <nil>
ERROR: logging before flag.Parse: E1006 10:51:13.029515      10 memcache.go:147] couldn't get resource list for custom.metrics.k8s.io/v1beta1: <nil>
ERROR: logging before flag.Parse: E1006 10:52:12.832894      10 memcache.go:147] couldn't get resource list for custom.metrics.k8s.io/v1beta1: <nil>

Our deployment file is latest according to example for 1.7 config

@max-rocket-internet
Copy link
Contributor

I'm seeing this in CNI version 1.7.5 also:

{"level":"info","ts":"2020-12-16T15:00:38.209Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2020-12-16T15:00:38.226Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2020-12-16T15:00:38.227Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
ERROR: logging before flag.Parse: E1216 15:00:38.293483       9 memcache.go:147] couldn't get resource list for external.metrics.k8s.io/v1beta1: <nil>
{"level":"info","ts":"2020-12-16T15:00:40.243Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2020-12-16T15:00:40.246Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2020-12-16T15:00:40.246Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}
ERROR: logging before flag.Parse: E1216 15:01:38.327168       9 memcache.go:147] couldn't get resource list for external.metrics.k8s.io/v1beta1: <nil>
ERROR: logging before flag.Parse: E1216 15:02:38.329119       9 memcache.go:147] couldn't get resource list for external.metrics.k8s.io/v1beta1: <nil>

There is no such resource as external.metrics.k8s.io but there is externalmetrics.metrics.aws🤔

Interestingly I don't see this error on a newer 1.17 cluster. Only on 1.17 cluster that has existed since 1.14.

@rmakoto
Copy link

rmakoto commented Dec 18, 2020

@max-rocket-internet I'm also seeing something like you, but in my case it is a 1.17 created cluster.

{"level":"info","ts":"2020-12-18T14:38:54.869Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2020-12-18T14:38:54.883Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2020-12-18T14:38:54.884Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2020-12-18T14:38:56.898Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2020-12-18T14:38:56.900Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2020-12-18T14:38:56.901Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}
ERROR: logging before flag.Parse: E1218 18:40:00.063760       9 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
ERROR: logging before flag.Parse: E1218 19:21:00.076885       9 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Kubernetes version: 1.17
Platform version: eks.4

aws-node:
amazon-k8s-cni-init:v1.7.5-eksbuild.1
amazon-k8s-cni:v1.7.5-eksbuild.1

kube-proxy:
kube-proxy:v1.17.9-eksbuild.1

@hugoShaka
Copy link

Cause

The issue happens when the external metrics API is enabled (likely prometheus adaptor deployed) but no external metric declared. This is a totally valid scenario but is not handled correctly.

You can check if you're in this situation by running the following command and checking that the ressources field is empty.

$ kubectl get --raw /apis/external.metrics.k8s.io/v1beta1 
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"external.metrics.k8s.io/v1beta1","resources":[]} 

This behaviour comes from the k8s/client-go and was fixed in november 2018 by this commit: kubernetes/client-go@5865254#diff-8cc59cc0342ea488b3db0c5a07f4bffda3edb171ca58c6fc0265d10e515018dfL134

amazon-vpc-cni-k8s uses an old version of the client-go library, from before the fix:

k8s.io/client-go v0.0.0-20180806134042-1f13a808da65
(2018-08)

How to fix

To fix this issue one has to update the k8s/client-go to a version that contains the commit mentioned earlier.

Workaround

If you don't want the prometheus adapter

Properly remove the prometheus-adapter chart, it will unregister external and custom metrics apis from the apiserver.

If you do want the prometheus adpater

We don't control the CNI version AWS deploys when we request an EKS cluster, so as long as the library is not updated we'll have to live with it. The easiest way to make sure the error stops popping is to declare a dummy external metric.

For example with the prometheus-adapter chart one could do:

metrics:
  external:
    - seriesQuery: '{__name__=~"^prometheus_build_info$"}'
      resources:
        template: <<.Resource>>
      name:
        matches: "^prometheus_build_info$"
        as: "dummy_external_metric"
      metricsQuery: 'count(<<.Series>>)'

@jayanthvn
Copy link
Contributor

Client-go is updated and operator-sdk dependency is removed - #1419. Release 1.8 has these fixed. Closing the issue for now please try and let us know if the issue still exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests