Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod can't reliably establish watches properly #2118

Closed
jmirc opened this issue Jan 21, 2019 · 43 comments
Closed

Pod can't reliably establish watches properly #2118

jmirc opened this issue Jan 21, 2019 · 43 comments
Assignees

Comments

@jmirc
Copy link

jmirc commented Jan 21, 2019

Bug Report

What is the issue?

I am running the latest version of linkerd edge 19.1.2 and I am getting this error

WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("cs-ch-domain-manager-v1.content-hub-test.svc.cluster.local.")), port: 8080 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

How can it be reproduced?

I just deployed the latest version. Nothing more

Logs, error output, etc

output for linkerd logs --control-plane-component controller

linkerd linkerd-controller-7bc49fd77f-lwt8q linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

output for linkerd logs --control-plane-component controller -c proxy-api

linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T13:54:55Z" level=info msg="Stopping watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api W0121 15:57:34.899318       1 reflector.go:341] k8s.io/client-go/informers/factory.go:130: watch of *v1beta2.ReplicaSet ended with: too old resource version: 3417120 (3420499)
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T17:25:43Z" level=info msg="Establishing watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T17:32:18Z" level=info msg="Stopping watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api W0121 17:49:54.531144       1 reflector.go:341] k8s.io/client-go/informers/factory.go:130: watch of *v1beta2.ReplicaSet ended with: too old resource version: 3437967 (3439015)
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T21:32:21Z" level=info msg="Establishing watch on endpoint linkerd-prometheus.linkerd:9090"

(If the output is long, please create a gist and
paste the link here.)

linkerd check output

kubernetes-api
--------------
✔ can initialize the client
✔ can query the Kubernetes API

kubernetes-version
------------------
✔ is running the minimum Kubernetes API version

linkerd-existence
-----------------
✔ control plane namespace exists
✔ controller pod is running
✔ can initialize the client
✔ can query the control plane API

linkerd-api
-----------
✔ control plane pods are ready
✔ can query the control plane API
✔ [kubernetes] control plane can talk to Kubernetes
✔ [prometheus] control plane can talk to Prometheus

linkerd-service-profile
-----------------------
✔ no invalid service profiles

linkerd-version
---------------
✔ can determine the latest version
✔ cli is up-to-date

control-plane-version
---------------------
✔ control plane is up-to-date

Status check results are ✔

Environment

  • Kubernetes Version:
  • Cluster Environment: (GKE, AKS, kops, ...) EKS
  • Host OS: Amazon ami
  • Linkerd version: edge 19.1.2

Possible solution

Additional context

@ihcsim
Copy link
Contributor

ihcsim commented Jan 22, 2019

Any chance that you have some minimal YAML manifests that we can use to reproduce the issue? I am not seeing the errors on EKS 1.11.5. Is this a new installation of edge 19.1.2 or an upgrade from an older version of Linkerd2? Also, do you know if this happens on other EKS clusters too, or just one particular cluster?

@jmirc
Copy link
Author

jmirc commented Jan 22, 2019

I am sorry I can't share our code. We have only one EKS cluster with Linkerd.

However, I can do testing on our side.

I upgraded the linkerd stack for each new small update. I can delete the linkerd stack and try with a new fresh installation. Let me know.

@jmirc
Copy link
Author

jmirc commented Jan 22, 2019

I just deleted the namespace linkerd and re-install and I am getting the following errors.
I don't have this exception 172.20.0.1:443: connect: connection refused from other apps.

linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:01Z" level=info msg="running version edge-19.1.2"
linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:01Z" level=info msg="waiting for caches to sync"
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.556917       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://172.20.0.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.557836       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://172.20.0.1:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.557934       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://172.20.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.558011       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://172.20.0.1:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap E0122 11:34:01.558411       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://172.20.0.1:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: connect: connection refused
linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:02Z" level=info msg="caches synced"
linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:02Z" level=info msg="starting gRPC server on 127.0.0.1:8088"
linkerd linkerd-controller-7bc49fd77f-j7j8l tap time="2019-01-22T11:34:02Z" level=info msg="starting admin server on :9998"
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-7bc49fd77f-j7j8l linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error

@ihcsim
Copy link
Contributor

ihcsim commented Jan 22, 2019

So after letting my EKS cluster run overnight, my linkerd proxies in the linkerd and application namespaces are flooded with the WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("linkerd-grafana.linkerd.svc.cluster.local")), port: 3000 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" }) error. However, my application is still up and running.

fwiw, I was using t3.small workers and I did see an OOM error on one of my nodes, causing application to be unavailable. But the node recovered. I assume you are using bigger machines than mine?

The connect: connection refused errors you saw might just be start up errors. I saw them on GKE and Minikube too when the control plane was first deployed. And yes, definitely seeing many error fetching profile warnings on other platforms too. But I don't think they are related to your problem.

Are all your meshed services unreachable?

@jmirc
Copy link
Author

jmirc commented Jan 22, 2019

Hi,

I was able to reach from ingress the service but I was not able to call an external service. By external, I mean another service running on Kubernetes. However, I was able to call this service from another service that was not meshed by linkerd.

I will retry later today or tomorrow with a fresh install of linkerd. So you know why we are getting all these errors

@olix0r
Copy link
Member

olix0r commented Jan 22, 2019

It seems at least possible that this issue is related to #2111. I think it probably makes sense for us to spend some more time on #2111 first to see if that bug could explain the behavior reported in this issue.

@jmirc
Copy link
Author

jmirc commented Jan 22, 2019

Interesting. We have grpc service also. I will try to mesh only the services that are not using grpc to see if I am still getting the errors

@jmirc
Copy link
Author

jmirc commented Jan 23, 2019

  1. I injected linkerd inside a service and I am getting the same error in the linkerd-proxy which is
WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("cs-ch-domain-manager-v1.content-hub-test.svc.cluster.local.")), port: 8080 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })
  1. Inside the injected container, I am trying to call another service and I am getting the following error
webmaster@cs-ch-gateway-adapter:/app$ curl http://cs-ch-domain-manager-v1:8080/service/security/clients  -v
*   Trying 172.20.1.80...
* TCP_NODELAY set
* Connected to cs-ch-domain-manager-v1 (172.20.1.80) port 8080 (#0)
> GET /service/security/clients HTTP/1.1
> Host: cs-ch-domain-manager-v1:8080
> User-Agent: curl/7.62.0
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< content-length: 0
< date: Wed, 23 Jan 2019 00:07:22 GMT
<
* Connection #0 to host cs-ch-domain-manager-v1 left intact

The error 500 is not returned by the service, it is returned by linkerd-proxy

  1. If I am calling the same service through the external route53 - I am getting the response.

  2. Other services can communicate with the linkerd injected service. So, only outbound connections are affected

@hawkw hawkw self-assigned this Jan 23, 2019
@hawkw
Copy link
Contributor

hawkw commented Jan 23, 2019

This does not appear to be related to issue #2111; that issue was caused by a client sending requests that were not standards-compliant.

@jmirc
Copy link
Author

jmirc commented Jan 23, 2019

Ok, let me know if I can help on anything to fix this issue

@hawkw
Copy link
Contributor

hawkw commented Jan 23, 2019

@jmirc just to clarify, you only began seeing this error after upgrading to edge-19.1.2? What version were you on previously?

@grampelberg grampelberg added the priority/P0 Release Blocker label Jan 23, 2019
@jmirc
Copy link
Author

jmirc commented Jan 23, 2019

Yes. I didn't have this issue with the previous version which was edge-19.1.1.
I will retry to install the version edge-19.1.1 this evening to see if I am getting the issue.

@hawkw
Copy link
Contributor

hawkw commented Jan 23, 2019

Thanks @jmirc! That narrows the search space considerably. 🙂

@klingerf
Copy link
Contributor

I can confirm that the warnings in the linkerd-proxy container started appearing as of the edge-19.1.2 release.

If I install the edge-19.1.1 release and open the dashboard, in the controller's linkerd-proxy logs, I see:

$ kubectl -n linkerd logs deploy/linkerd-controller linkerd-proxy --timestamps
2019-01-23T20:43:46.8655514Z INFO linkerd2_proxy::app::main using controller at Some(Name(NameAddr { name: DnsName(DNSName("localhost.")), port: 8086 }))
2019-01-23T20:43:46.8656894Z INFO linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
2019-01-23T20:43:46.8670232Z INFO linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
2019-01-23T20:43:46.8670578Z INFO linkerd2_proxy::app::main serving Prometheus metrics on V4(0.0.0.0:4191)
2019-01-23T20:43:46.8670856Z INFO linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
2019-01-23T20:43:46.8672108Z INFO linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}
...

Whereas if I install the edge-19.1.2 release and open the dashboard, in the controller's linkerd-proxy logs, I see:

$ kubectl -n linkerd logs deploy/linkerd-controller linkerd-proxy --timestamps
2019-01-23T21:02:39.7247423Z INFO linkerd2_proxy::app::main using controller at Some(Name(NameAddr { name: DnsName(DNSName("localhost.")), port: 8086 }))
2019-01-23T21:02:39.724812Z INFO linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
2019-01-23T21:02:39.7248405Z INFO linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
2019-01-23T21:02:39.7248669Z INFO linkerd2_proxy::app::main serving Prometheus metrics on V4(0.0.0.0:4191)
2019-01-23T21:02:39.7248925Z INFO linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
2019-01-23T21:02:39.7249283Z INFO linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}
...
2019-01-23T21:03:01.3305407Z WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
2019-01-23T21:03:07.338784Z WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
2019-01-23T21:03:11.4398961Z WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
2019-01-23T21:03:11.4554226Z WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
2019-01-23T21:03:12.0799139Z WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-controller-api.linkerd.svc.cluster.local:8085: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

@jmirc
Copy link
Author

jmirc commented Jan 23, 2019

I can confirm too. just installed the previous version (edge-19.1.1) and everything works.

INFO linkerd2_proxy::app::main using controller at Some(Name(NameAddr { name: DnsName(DNSName("linkerd-proxy-api.linkerd.svc.cluster.local")), port: 8086 }))
INFO linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
INFO linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
INFO linkerd2_proxy::app::main serving Prometheus metrics on V4(0.0.0.0:4191)
INFO linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 3306}
INFO linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 3306}

@hawkw
Copy link
Contributor

hawkw commented Jan 23, 2019

Excellent, thanks again for confirming that for me. I have some theories about possible causes of this issue and I'm looking into them.

@hawkw
Copy link
Contributor

hawkw commented Jan 23, 2019

Interesting, I've been testing with a fresh linkerd install, and I'm thus far only seeing the

error fetching profile for ...:  Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

log line, not the

 Destination.Get stream errored for NameAddr { ... }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

error. I believe the "error fetching profile" warning is relatively benign, and only the failure to get destinations should result in the proxy returning 500 errors.

@ihcsim
Copy link
Contributor

ihcsim commented Jan 24, 2019

@hawkw To date with the latest edge version, I've only see the Destination.Get stream errored for NameAddr error in EKS, not GKE or Minikube, after leaving the app running for a few hours. But even with the errors, the emojivoto application seems to work just fine.

@hawkw
Copy link
Contributor

hawkw commented Jan 24, 2019

@ihcsim Iiiiiiinteresting! I've been testing exclusively in GKE, but I note that @jmirc is also running in EKS. Issue #2115, which AFAICT is just the timeout error, was reported on Azure.

@jon-walton
Copy link
Contributor

jon-walton commented Jan 24, 2019

to add, I've also seen (in azure and starting with edge-19.1.2. edge-19.1.1 was fine)

Destination.Get stream errored for NameAddr { ... }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

my thoughts are that the error fetching profile is benign but the ingress controller starts returning 500s once Destination.Get stream errored shows up in the logs. The only way to recover is to restart the ingress controller pods

to note: we only have http rest services, no grpc

@hawkw
Copy link
Contributor

hawkw commented Jan 24, 2019

thanks @jon-walton, that confirms what I've been seeing.

the issue isn't specific to gRPC services, as the proxy itself uses gRPC to talk to the control plane's service discovery API.

@bourquep
Copy link
Contributor

FWIW, I've installed edge 19.1.2 on an Azure AKS cluster and am seeing the same kinds of errors.

Some notes:

  • That cluster used to run linkerd stable 2.1.0, but I had uninjected the proxy from all deployments and deleted the linked namespace about a week ago.
  • I have installed the edge-19.1.2 CLI on my mac, and ran the linkerd install | kubectl apply -f - command
  • I have not yet injected the proxy into any deployment.

Highlights from the controller logs:

  • A bunch of linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for some-service-endpoint
  • A few public-api E0124 17:53:25.043216 1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod (and v1.AnythingElse)

Full result of linkerd logs --control-plane-component controller are in thie attached file.

controller.log

@bourquep
Copy link
Contributor

I did a cleanup as thorough as I could think of:

k delete ns linkerd
k delete clusterrolebinding linkerd-linkerd-controller
k delete clusterrolebinding linkerd-linkerd-prometheus
k delete clusterrole linkerd-linkerd-controller
k delete clusterrole linkerd-linkerd-prometheus
k delete crd serviceprofiles.linkerd.io

And installed edge-19.1.2 again (linkerd install | kubectl apply -f -)

Now, I'm not seeing the "error fetching profile" errors anymore, just the "Failed to list *v1.Pod/Deployment/..." errors)

controller2.log

@bourquep
Copy link
Contributor

I've just tried with stable-2.1.0, and I see the same errors:

k logs linkerd-controller-55cf8bd769-pfbw6 -n linkerd -c public-api


time="2019-01-24T18:35:50Z" level=info msg="running version stable-2.1.0"
time="2019-01-24T18:35:50Z" level=info msg="starting admin server on :9995"
time="2019-01-24T18:35:50Z" level=info msg="starting HTTP server on :8085"
time="2019-01-24T18:35:50Z" level=info msg="waiting for caches to sync"
E0124 18:35:50.653890       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:50.653890       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:50.653958       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:50.654014       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:50.654092       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.662692       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.662692       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.662747       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.662792       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:51.663056       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669449       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669456       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669506       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669549       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:52.669558       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677299       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.Deployment: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677299       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677385       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677439       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:53.677690       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1beta2.ReplicaSet: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/apis/apps/v1beta2/replicasets?limit=500&resourceVersion=0: dial tcp 52.237.10.158:443: connect: connection refused
E0124 18:35:54.991160       1 reflector.go:205] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Get https://studyo-16fb7fc6.hcp.canadacentral.azmk8s.io:443/api/v1/pods?limit=500&resourceVersion=0: EOF
time="2019-01-24T18:35:56Z" level=info msg="caches synced"

@klingerf
Copy link
Contributor

@bourquep Thanks for the additional info. I just wanted to chime in and say that those "connection refused" messages that appear prior to the "caches synced" message are (unfortunately) expected. They're a result of the public-api trying to query the kubernetes API before the linkerd-proxy container in the same pod is ready to serve requests. They eventually succeed if you see the "caches synced" message.

For more context, we use k8s.io/client-go to query the kubernetes API, and that package uses glog to log errors when the API is unreachable, before retrying. We would be better off suppressing all of the glog logs, but we have to redirect them to stderr, due to all of the reasons mentioned in kubernetes/kubernetes#61006. Kubernetes recently swapped out glog with it's own fork (called klog 🙄) that is apparently more configurable. So it's possible that by updating to a more recent version of client-go we could suppress those message, but we haven't gotten around to it yet.

@bourquep
Copy link
Contributor

Aahh, thanks for clarifying that. :)

@siggy
Copy link
Member

siggy commented Jan 24, 2019

@bourquep Similar to @klingerf's previous reply, those profiles error fetching profile for some-service-endpoint messages coming from linkerd-proxy are also benign. They indicate that a service profile has not been installed for a service, which is default behavior unless you have explicitly added service profiles. We're working on a fix to make those log lines look less like errors.

@bourquep
Copy link
Contributor

bourquep commented Jan 24, 2019

So considering that my install went like this, everything should be fine regardless of those errors?

(go to 1:15 to see the logs)

asciicast

@siggy
Copy link
Member

siggy commented Jan 24, 2019

A little more info on the error fetching profile for errors. I observed these running locally with edge-19.1.2, and then they went away once I installed service profiles for the control plane...

linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-prometheus-74d66f86f6-cc9zv linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-prometheus-74d66f86f6-cc9zv linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-prometheus-74d66f86f6-cc9zv linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-prometheus-74d66f86f6-cc9zv linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-prometheus.linkerd.svc.cluster.local:9090: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd install-sp | kubectl apply -f -
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
linkerd linkerd-controller-b8c4c48c8-x5n4d linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for localhost:6443: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

@olix0r
Copy link
Member

olix0r commented Jan 24, 2019

We've narrowed the error fetching profile warning down to a recent controller change. Though, my understanding is that this warning should not indicate any functional problem for traffic. However, I believe folks have reported seeing communication fail in this situation? Is that true?

The errors related to the Destination.Get stream errored for NameAddr are more concerning, as this implies an error is occurring in an unrelated API path, though.

@hawkw
Copy link
Contributor

hawkw commented Jan 25, 2019

The edge-19.1.3 release that just went out contains a few changes that should be relevant here: first, we've fixed the issue which was causing the (benign) error fetching profile ... log spam, and second (more importantly), we've improved the diagnostics for gRPC errors --- it should hopefully be much easier to now to track down where the Destination.Get error is coming from on the new version.

@bourquep
Copy link
Contributor

I have 45 minutes before my son's hockey game starts, installing now! :)

@hawkw
Copy link
Contributor

hawkw commented Jan 25, 2019

thanks @bourquep ❤️

@bourquep
Copy link
Contributor

Hey, I'm in the middle of nowhere, nothing else to do. :)

https://www.google.com/maps/@46.5172089,-72.7554464,12z

@jmirc
Copy link
Author

jmirc commented Jan 25, 2019

@bourquep not too far from me. I am in Montreal ;) I am starting to test this version

@bourquep
Copy link
Contributor

@jmirc Je suis sur la rive-sud de Mtl, mais en tournoi de hockey à Shawi :)

@jmirc
Copy link
Author

jmirc commented Jan 25, 2019

I finished my test and everything works as expected. No more errors in the log of linkerd-proxy.
My service is able to connect external service.

This new version has fixed all the problems I had previously.

@grampelberg
Copy link
Contributor

@hawkw sounds like we can close this one out?

@hawkw
Copy link
Contributor

hawkw commented Jan 25, 2019

@grampelberg I'd prefer to hear back from @bourquep and @jon-walton before closing this, to confirm that the issue has been resolved for everyone affected.

@bourquep
Copy link
Contributor

So far so good on my side!

@hawkw
Copy link
Contributor

hawkw commented Jan 25, 2019

FWIW, I spun up an AKS test cluster with edge-19.1.3 last night, and after 17 hours, I haven't seen any failures either.

@jon-walton
Copy link
Contributor

Yep, 19.1.3 solved all the issues I was having with 19.1.2, awesome 👍

@hawkw
Copy link
Contributor

hawkw commented Jan 26, 2019

That's great to hear! :D

@hawkw hawkw closed this as completed Jan 26, 2019
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants