Pod can't reliably establish watches properly #2118

jmirc · 2019-01-21T21:44:03Z

Bug Report

What is the issue?

I am running the latest version of linkerd edge 19.1.2 and I am getting this error

WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("cs-ch-domain-manager-v1.content-hub-test.svc.cluster.local.")), port: 8080 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

How can it be reproduced?

I just deployed the latest version. Nothing more

Logs, error output, etc

output for linkerd logs --control-plane-component controller

linkerd linkerd-controller-7bc49fd77f-lwt8q linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

output for linkerd logs --control-plane-component controller -c proxy-api

linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T13:54:55Z" level=info msg="Stopping watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api W0121 15:57:34.899318       1 reflector.go:341] k8s.io/client-go/informers/factory.go:130: watch of *v1beta2.ReplicaSet ended with: too old resource version: 3417120 (3420499)
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T17:25:43Z" level=info msg="Establishing watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T17:32:18Z" level=info msg="Stopping watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api W0121 17:49:54.531144       1 reflector.go:341] k8s.io/client-go/informers/factory.go:130: watch of *v1beta2.ReplicaSet ended with: too old resource version: 3437967 (3439015)
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T21:32:21Z" level=info msg="Establishing watch on endpoint linkerd-prometheus.linkerd:9090"

(If the output is long, please create a gist and
paste the link here.)

`linkerd check` output

kubernetes-api
--------------
✔ can initialize the client
✔ can query the Kubernetes API

kubernetes-version
------------------
✔ is running the minimum Kubernetes API version

linkerd-existence
-----------------
✔ control plane namespace exists
✔ controller pod is running
✔ can initialize the client
✔ can query the control plane API

linkerd-api
-----------
✔ control plane pods are ready
✔ can query the control plane API
✔ [kubernetes] control plane can talk to Kubernetes
✔ [prometheus] control plane can talk to Prometheus

linkerd-service-profile
-----------------------
✔ no invalid service profiles

linkerd-version
---------------
✔ can determine the latest version
✔ cli is up-to-date

control-plane-version
---------------------
✔ control plane is up-to-date

Status check results are ✔

Environment

Kubernetes Version:
Cluster Environment: (GKE, AKS, kops, ...) EKS
Host OS: Amazon ami
Linkerd version: edge 19.1.2

Possible solution

Additional context

The text was updated successfully, but these errors were encountered:

ihcsim · 2019-01-22T05:49:02Z

Any chance that you have some minimal YAML manifests that we can use to reproduce the issue? I am not seeing the errors on EKS 1.11.5. Is this a new installation of edge 19.1.2 or an upgrade from an older version of Linkerd2? Also, do you know if this happens on other EKS clusters too, or just one particular cluster?

jmirc · 2019-01-22T11:27:27Z

I am sorry I can't share our code. We have only one EKS cluster with Linkerd.

However, I can do testing on our side.

I upgraded the linkerd stack for each new small update. I can delete the linkerd stack and try with a new fresh installation. Let me know.

jmirc · 2019-01-22T11:41:22Z

I just deleted the namespace linkerd and re-install and I am getting the following errors.
I don't have this exception 172.20.0.1:443: connect: connection refused from other apps.

linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:01Z" level=info msg="running version edge-19.1.2"
linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:01Z" level=info msg="waiting for caches to sync"
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.556917       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://172.20.0.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.557836       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://172.20.0.1:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.557934       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://172.20.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.558011       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://172.20.0.1:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.558411       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://172.20.0.1:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:02Z" level=info msg="caches synced"
linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:02Z" level=info msg="starting gRPC server on 127.0.0.1:8088"
linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:02Z" level=info msg="starting admin server on :9998"
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error

ihcsim · 2019-01-22T21:00:57Z

So after letting my EKS cluster run overnight, my linkerd proxies in the linkerd and application namespaces are flooded with the WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("linkerd-grafana.linkerd.svc.cluster.local")), port: 3000 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" }) error. However, my application is still up and running.

fwiw, I was using t3.small workers and I did see an OOM error on one of my nodes, causing application to be unavailable. But the node recovered. I assume you are using bigger machines than mine?

The connect: connection refused errors you saw might just be start up errors. I saw them on GKE and Minikube too when the control plane was first deployed. And yes, definitely seeing many error fetching profile warnings on other platforms too. But I don't think they are related to your problem.

Are all your meshed services unreachable?

jmirc · 2019-01-22T21:25:35Z

Hi,

I was able to reach from ingress the service but I was not able to call an external service. By external, I mean another service running on Kubernetes. However, I was able to call this service from another service that was not meshed by linkerd.

I will retry later today or tomorrow with a fresh install of linkerd. So you know why we are getting all these errors

olix0r · 2019-01-22T21:58:33Z

It seems at least possible that this issue is related to #2111. I think it probably makes sense for us to spend some more time on #2111 first to see if that bug could explain the behavior reported in this issue.

jmirc · 2019-01-22T22:17:42Z

Interesting. We have grpc service also. I will try to mesh only the services that are not using grpc to see if I am still getting the errors

jmirc · 2019-01-23T00:14:10Z

I injected linkerd inside a service and I am getting the same error in the linkerd-proxy which is

WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("cs-ch-domain-manager-v1.content-hub-test.svc.cluster.local.")), port: 8080 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

Inside the injected container, I am trying to call another service and I am getting the following error

webmaster@cs-ch-gateway-adapter:/app$ curl http://cs-ch-domain-manager-v1:8080/service/security/clients  -v
*   Trying 172.20.1.80...
* TCP_NODELAY set
* Connected to cs-ch-domain-manager-v1 (172.20.1.80) port 8080 (#0)
> GET /service/security/clients HTTP/1.1
> Host: cs-ch-domain-manager-v1:8080
> User-Agent: curl/7.62.0
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< content-length: 0
< date: Wed, 23 Jan 2019 00:07:22 GMT
<
* Connection #0 to host cs-ch-domain-manager-v1 left intact

The error 500 is not returned by the service, it is returned by linkerd-proxy

If I am calling the same service through the external route53 - I am getting the response.
Other services can communicate with the linkerd injected service. So, only outbound connections are affected

hawkw · 2019-01-23T18:43:13Z

This does not appear to be related to issue #2111; that issue was caused by a client sending requests that were not standards-compliant.

jmirc · 2019-01-23T19:00:22Z

Ok, let me know if I can help on anything to fix this issue

hawkw · 2019-01-23T20:23:42Z

@jmirc just to clarify, you only began seeing this error after upgrading to edge-19.1.2? What version were you on previously?

jmirc · 2019-01-23T20:40:52Z

Yes. I didn't have this issue with the previous version which was edge-19.1.1.
I will retry to install the version edge-19.1.1 this evening to see if I am getting the issue.

hawkw · 2019-01-23T20:42:10Z

Thanks @jmirc! That narrows the search space considerably. 🙂

klingerf · 2019-01-23T21:06:15Z

I can confirm that the warnings in the linkerd-proxy container started appearing as of the edge-19.1.2 release.

If I install the edge-19.1.1 release and open the dashboard, in the controller's linkerd-proxy logs, I see:

$ kubectl -n linkerd logs deploy/linkerd-controller linkerd-proxy --timestamps
2019-01-23T20:43:46.8655514Z INFO linkerd2_proxy::app::main using controller at Some(Name(NameAddr { name: DnsName(DNSName("localhost.")), port: 8086 }))
2019-01-23T20:43:46.8656894Z INFO linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
2019-01-23T20:43:46.8670232Z INFO linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
2019-01-23T20:43:46.8670578Z INFO linkerd2_proxy::app::main serving Prometheus metrics on V4(0.0.0.0:4191)
2019-01-23T20:43:46.8670856Z INFO linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
2019-01-23T20:43:46.8672108Z INFO linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}
...

Whereas if I install the edge-19.1.2 release and open the dashboard, in the controller's linkerd-proxy logs, I see:

$ kubectl -n linkerd logs deploy/linkerd-controller linkerd-proxy --timestamps
2019-01-23T21:02:39.7247423Z INFO linkerd2_proxy::app::main using controller at Some(Name(NameAddr { name: DnsName(DNSName("localhost.")), port: 8086 }))
2019-01-23T21:02:39.724812Z INFO linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
2019-01-23T21:02:39.7248405Z INFO linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
2019-01-23T21:02:39.7248669Z INFO linkerd2_proxy::app::main serving Prometheus metrics on V4(0.0.0.0:4191)
2019-01-23T21:02:39.7248925Z INFO linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
2019-01-23T21:02:39.7249283Z INFO linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}
...
2019-01-23T21:03:01.3305407Z WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
2019-01-23T21:03:07.338784Z WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
2019-01-23T21:03:11.4398961Z WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
2019-01-23T21:03:11.4554226Z WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
2019-01-23T21:03:12.0799139Z WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-controller-api.linkerd.svc.cluster.local:8085: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

jmirc · 2019-01-23T21:09:48Z

I can confirm too. just installed the previous version (edge-19.1.1) and everything works.

INFO linkerd2_proxy::app::main using controller at Some(Name(NameAddr { name: DnsName(DNSName("linkerd-proxy-api.linkerd.svc.cluster.local")), port: 8086 }))
INFO linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
INFO linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
INFO linkerd2_proxy::app::main serving Prometheus metrics on V4(0.0.0.0:4191)
INFO linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
INFO linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}

hawkw · 2019-01-23T21:10:49Z

Excellent, thanks again for confirming that for me. I have some theories about possible causes of this issue and I'm looking into them.

hawkw · 2019-01-23T23:28:58Z

Interesting, I've been testing with a fresh linkerd install, and I'm thus far only seeing the

error fetching profile for ...:  Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

log line, not the

 Destination.Get stream errored for NameAddr { ... }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

error. I believe the "error fetching profile" warning is relatively benign, and only the failure to get destinations should result in the proxy returning 500 errors.

ihcsim · 2019-01-24T00:11:24Z

@hawkw To date with the latest edge version, I've only see the Destination.Get stream errored for NameAddr error in EKS, not GKE or Minikube, after leaving the app running for a few hours. But even with the errors, the emojivoto application seems to work just fine.

hawkw · 2019-01-24T00:18:03Z

@ihcsim Iiiiiiinteresting! I've been testing exclusively in GKE, but I note that @jmirc is also running in EKS. Issue #2115, which AFAICT is just the timeout error, was reported on Azure.

jon-walton · 2019-01-24T01:22:16Z

to add, I've also seen (in azure and starting with edge-19.1.2. edge-19.1.1 was fine)

Destination.Get stream errored for NameAddr { ... }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

my thoughts are that the error fetching profile is benign but the ingress controller starts returning 500s once Destination.Get stream errored shows up in the logs. The only way to recover is to restart the ingress controller pods

to note: we only have http rest services, no grpc

hawkw · 2019-01-24T01:28:54Z

thanks @jon-walton, that confirms what I've been seeing.

the issue isn't specific to gRPC services, as the proxy itself uses gRPC to talk to the control plane's service discovery API.

bourquep · 2019-01-24T18:09:39Z

FWIW, I've installed edge 19.1.2 on an Azure AKS cluster and am seeing the same kinds of errors.

Some notes:

That cluster used to run linkerd stable 2.1.0, but I had uninjected the proxy from all deployments and deleted the linked namespace about a week ago.
I have installed the edge-19.1.2 CLI on my mac, and ran the linkerd install | kubectl apply -f - command
I have not yet injected the proxy into any deployment.

Highlights from the controller logs:

A bunch of linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for some-service-endpoint
A few public-api E0124 17:53:25.043216 1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod (and v1.AnythingElse)

Full result of linkerd logs --control-plane-component controller are in thie attached file.

controller.log

bourquep · 2019-01-24T18:23:41Z

I did a cleanup as thorough as I could think of:

k delete ns linkerd
k delete clusterrolebinding linkerd-linkerd-controller
k delete clusterrolebinding linkerd-linkerd-prometheus
k delete clusterrole linkerd-linkerd-controller
k delete clusterrole linkerd-linkerd-prometheus
k delete crd serviceprofiles.linkerd.io

And installed edge-19.1.2 again (linkerd install | kubectl apply -f -)

Now, I'm not seeing the "error fetching profile" errors anymore, just the "Failed to list *v1.Pod/Deployment/..." errors)

controller2.log

bourquep · 2019-01-24T19:19:08Z

I've just tried with stable-2.1.0, and I see the same errors:

k logs linkerd-controller-55cf8bd769-pfbw6 -n linkerd -c public-api


time="2019-01-24T18:35:50Z" level=info msg="running version stable-2.1.0"
time="2019-01-24T18:35:50Z" level=info msg="starting admin server on :9995"
time="2019-01-24T18:35:50Z" level=info msg="starting HTTP server on :8085"
time="2019-01-24T18:35:50Z" level=info msg="waiting for caches to sync"
E0124 18:35:50.653890       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:50.653890       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:50.653958       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:50.654014       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:50.654092       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.662692       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.662692       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.662747       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.662792       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.663056       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669449       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669456       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669506       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669549       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669558       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677299       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677299       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677385       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677439       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677690       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:54.991160       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: EOF
time="2019-01-24T18:35:56Z" level=info msg="caches synced"

klingerf · 2019-01-24T20:01:24Z

@bourquep Thanks for the additional info. I just wanted to chime in and say that those "connection refused" messages that appear prior to the "caches synced" message are (unfortunately) expected. They're a result of the public-api trying to query the kubernetes API before the linkerd-proxy container in the same pod is ready to serve requests. They eventually succeed if you see the "caches synced" message.

For more context, we use k8s.io/client-go to query the kubernetes API, and that package uses glog to log errors when the API is unreachable, before retrying. We would be better off suppressing all of the glog logs, but we have to redirect them to stderr, due to all of the reasons mentioned in kubernetes/kubernetes#61006. Kubernetes recently swapped out glog with it's own fork (called klog 🙄) that is apparently more configurable. So it's possible that by updating to a more recent version of client-go we could suppress those message, but we haven't gotten around to it yet.

bourquep · 2019-01-24T20:02:52Z

Aahh, thanks for clarifying that. :)

siggy · 2019-01-24T20:04:21Z

@bourquep Similar to @klingerf's previous reply, those profiles error fetching profile for some-service-endpoint messages coming from linkerd-proxy are also benign. They indicate that a service profile has not been installed for a service, which is default behavior unless you have explicitly added service profiles. We're working on a fix to make those log lines look less like errors.

bourquep · 2019-01-24T20:28:08Z

So considering that my install went like this, everything should be fine regardless of those errors?

(go to 1:15 to see the logs)

siggy · 2019-01-24T20:50:00Z

A little more info on the error fetching profile for errors. I observed these running locally with edge-19.1.2, and then they went away once I installed service profiles for the control plane...

linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-prometheus-74d66f86f6-cc9zv linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-prometheus-74d66f86f6-cc9zv linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-prometheus-74d66f86f6-cc9zv linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-prometheus-74d66f86f6-cc9zv linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

linkerd install-sp | kubectl apply -f -

linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

olix0r · 2019-01-24T22:24:10Z

We've narrowed the error fetching profile warning down to a recent controller change. Though, my understanding is that this warning should not indicate any functional problem for traffic. However, I believe folks have reported seeing communication fail in this situation? Is that true?

The errors related to the Destination.Get stream errored for NameAddr are more concerning, as this implies an error is occurring in an unrelated API path, though.

hawkw · 2019-01-25T01:11:57Z

The edge-19.1.3 release that just went out contains a few changes that should be relevant here: first, we've fixed the issue which was causing the (benign) error fetching profile ... log spam, and second (more importantly), we've improved the diagnostics for gRPC errors --- it should hopefully be much easier to now to track down where the Destination.Get error is coming from on the new version.

bourquep · 2019-01-25T01:13:26Z

I have 45 minutes before my son's hockey game starts, installing now! :)

hawkw · 2019-01-25T01:17:38Z

thanks @bourquep ❤️

bourquep · 2019-01-25T01:20:27Z

Hey, I'm in the middle of nowhere, nothing else to do. :)

https://www.google.com/maps/@46.5172089,-72.7554464,12z

jmirc · 2019-01-25T01:39:51Z

@bourquep not too far from me. I am in Montreal ;) I am starting to test this version

bourquep · 2019-01-25T01:40:36Z

@jmirc Je suis sur la rive-sud de Mtl, mais en tournoi de hockey à Shawi :)

jmirc · 2019-01-25T01:54:52Z

I finished my test and everything works as expected. No more errors in the log of linkerd-proxy.
My service is able to connect external service.

This new version has fixed all the problems I had previously.

grampelberg · 2019-01-25T17:49:57Z

@hawkw sounds like we can close this one out?

hawkw · 2019-01-25T17:56:27Z

@grampelberg I'd prefer to hear back from @bourquep and @jon-walton before closing this, to confirm that the issue has been resolved for everyone affected.

bourquep · 2019-01-25T18:08:08Z

So far so good on my side!

hawkw · 2019-01-25T18:10:13Z

FWIW, I spun up an AKS test cluster with edge-19.1.3 last night, and after 17 hours, I haven't seen any failures either.

jon-walton · 2019-01-26T00:40:57Z

Yep, 19.1.3 solved all the issues I was having with 19.1.2, awesome 👍

hawkw · 2019-01-26T00:53:59Z

That's great to hear! :D

grampelberg added area/controller bug labels Jan 22, 2019

klingerf mentioned this issue Jan 22, 2019

Nginx timeouts when proxy is injected #2000

Closed

hawkw self-assigned this Jan 23, 2019

hawkw mentioned this issue Jan 23, 2019

error fetching profile after upgrading to edge-19.1.2 #2115

Closed

grampelberg added the priority/P0 Release Blocker label Jan 23, 2019

hawkw closed this as completed Jan 26, 2019

jimmykarily mentioned this issue May 20, 2021

Internal communication with TLS epinio/epinio#130

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021

Pod can't reliably establish watches properly #2118

Pod can't reliably establish watches properly #2118

Comments

jmirc commented Jan 21, 2019

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

ihcsim commented Jan 22, 2019

jmirc commented Jan 22, 2019

jmirc commented Jan 22, 2019 • edited Loading

ihcsim commented Jan 22, 2019

jmirc commented Jan 22, 2019 • edited Loading

olix0r commented Jan 22, 2019

jmirc commented Jan 22, 2019

jmirc commented Jan 23, 2019 • edited Loading

hawkw commented Jan 23, 2019

jmirc commented Jan 23, 2019

hawkw commented Jan 23, 2019

jmirc commented Jan 23, 2019

hawkw commented Jan 23, 2019

klingerf commented Jan 23, 2019

jmirc commented Jan 23, 2019 • edited Loading

hawkw commented Jan 23, 2019

hawkw commented Jan 23, 2019

ihcsim commented Jan 24, 2019

hawkw commented Jan 24, 2019

jon-walton commented Jan 24, 2019 • edited Loading

hawkw commented Jan 24, 2019

bourquep commented Jan 24, 2019

bourquep commented Jan 24, 2019

bourquep commented Jan 24, 2019

klingerf commented Jan 24, 2019

bourquep commented Jan 24, 2019

siggy commented Jan 24, 2019

bourquep commented Jan 24, 2019 • edited Loading

siggy commented Jan 24, 2019

olix0r commented Jan 24, 2019

hawkw commented Jan 25, 2019

bourquep commented Jan 25, 2019

hawkw commented Jan 25, 2019

bourquep commented Jan 25, 2019

jmirc commented Jan 25, 2019 • edited Loading

bourquep commented Jan 25, 2019

jmirc commented Jan 25, 2019

grampelberg commented Jan 25, 2019

hawkw commented Jan 25, 2019

bourquep commented Jan 25, 2019

hawkw commented Jan 25, 2019

jon-walton commented Jan 26, 2019

hawkw commented Jan 26, 2019

`linkerd check` output

jmirc commented Jan 22, 2019 •

edited

Loading

jmirc commented Jan 22, 2019 •

edited

Loading

jmirc commented Jan 23, 2019 •

edited

Loading

jmirc commented Jan 23, 2019 •

edited

Loading

jon-walton commented Jan 24, 2019 •

edited

Loading

bourquep commented Jan 24, 2019 •

edited

Loading

jmirc commented Jan 25, 2019 •

edited

Loading