-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod can't reliably establish watches properly #2118
Comments
Any chance that you have some minimal YAML manifests that we can use to reproduce the issue? I am not seeing the errors on EKS 1.11.5. Is this a new installation of edge 19.1.2 or an upgrade from an older version of Linkerd2? Also, do you know if this happens on other EKS clusters too, or just one particular cluster? |
I am sorry I can't share our code. We have only one EKS cluster with Linkerd. However, I can do testing on our side. I upgraded the linkerd stack for each new small update. I can delete the linkerd stack and try with a new fresh installation. Let me know. |
I just deleted the namespace linkerd and re-install and I am getting the following errors.
|
So after letting my EKS cluster run overnight, my linkerd proxies in the fwiw, I was using The Are all your meshed services unreachable? |
Hi, I was able to reach from ingress the service but I was not able to call an external service. By external, I mean another service running on Kubernetes. However, I was able to call this service from another service that was not meshed by linkerd. I will retry later today or tomorrow with a fresh install of linkerd. So you know why we are getting all these errors |
Interesting. We have grpc service also. I will try to mesh only the services that are not using grpc to see if I am still getting the errors |
The error
|
This does not appear to be related to issue #2111; that issue was caused by a client sending requests that were not standards-compliant. |
Ok, let me know if I can help on anything to fix this issue |
@jmirc just to clarify, you only began seeing this error after upgrading to edge-19.1.2? What version were you on previously? |
Yes. I didn't have this issue with the previous version which was edge-19.1.1. |
Thanks @jmirc! That narrows the search space considerably. 🙂 |
I can confirm that the warnings in the linkerd-proxy container started appearing as of the edge-19.1.2 release. If I install the edge-19.1.1 release and open the dashboard, in the controller's linkerd-proxy logs, I see:
Whereas if I install the edge-19.1.2 release and open the dashboard, in the controller's linkerd-proxy logs, I see:
|
I can confirm too. just installed the previous version (edge-19.1.1) and everything works.
|
Excellent, thanks again for confirming that for me. I have some theories about possible causes of this issue and I'm looking into them. |
Interesting, I've been testing with a fresh linkerd install, and I'm thus far only seeing the
log line, not the
error. I believe the "error fetching profile" warning is relatively benign, and only the failure to get destinations should result in the proxy returning 500 errors. |
@hawkw To date with the latest edge version, I've only see the |
to add, I've also seen (in azure and starting with edge-19.1.2. edge-19.1.1 was fine)
my thoughts are that the error fetching profile is benign but the ingress controller starts returning 500s once to note: we only have http rest services, no grpc |
thanks @jon-walton, that confirms what I've been seeing. the issue isn't specific to gRPC services, as the proxy itself uses gRPC to talk to the control plane's service discovery API. |
FWIW, I've installed edge 19.1.2 on an Azure AKS cluster and am seeing the same kinds of errors. Some notes:
Highlights from the controller logs:
Full result of |
I did a cleanup as thorough as I could think of:
And installed edge-19.1.2 again ( Now, I'm not seeing the "error fetching profile" errors anymore, just the "Failed to list *v1.Pod/Deployment/..." errors) |
I've just tried with stable-2.1.0, and I see the same errors:
|
@bourquep Thanks for the additional info. I just wanted to chime in and say that those "connection refused" messages that appear prior to the "caches synced" message are (unfortunately) expected. They're a result of the public-api trying to query the kubernetes API before the linkerd-proxy container in the same pod is ready to serve requests. They eventually succeed if you see the "caches synced" message. For more context, we use k8s.io/client-go to query the kubernetes API, and that package uses glog to log errors when the API is unreachable, before retrying. We would be better off suppressing all of the glog logs, but we have to redirect them to stderr, due to all of the reasons mentioned in kubernetes/kubernetes#61006. Kubernetes recently swapped out glog with it's own fork (called klog 🙄) that is apparently more configurable. So it's possible that by updating to a more recent version of client-go we could suppress those message, but we haven't gotten around to it yet. |
Aahh, thanks for clarifying that. :) |
@bourquep Similar to @klingerf's previous reply, those |
A little more info on the
linkerd install-sp | kubectl apply -f -
|
We've narrowed the error fetching profile warning down to a recent controller change. Though, my understanding is that this warning should not indicate any functional problem for traffic. However, I believe folks have reported seeing communication fail in this situation? Is that true? The errors related to the |
The |
I have 45 minutes before my son's hockey game starts, installing now! :) |
thanks @bourquep ❤️ |
Hey, I'm in the middle of nowhere, nothing else to do. :) |
@bourquep not too far from me. I am in Montreal ;) I am starting to test this version |
@jmirc Je suis sur la rive-sud de Mtl, mais en tournoi de hockey à Shawi :) |
I finished my test and everything works as expected. No more errors in the log of linkerd-proxy. This new version has fixed all the problems I had previously. |
@hawkw sounds like we can close this one out? |
@grampelberg I'd prefer to hear back from @bourquep and @jon-walton before closing this, to confirm that the issue has been resolved for everyone affected. |
So far so good on my side! |
FWIW, I spun up an AKS test cluster with |
Yep, 19.1.3 solved all the issues I was having with 19.1.2, awesome 👍 |
That's great to hear! :D |
Bug Report
What is the issue?
I am running the latest version of linkerd edge 19.1.2 and I am getting this error
How can it be reproduced?
I just deployed the latest version. Nothing more
Logs, error output, etc
output for
linkerd logs --control-plane-component controller
output for
linkerd logs --control-plane-component controller -c proxy-api
(If the output is long, please create a gist and
paste the link here.)
linkerd check
outputEnvironment
Possible solution
Additional context
The text was updated successfully, but these errors were encountered: