-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first request slow #4680
Comments
some more intel, although I still feel like I have no idea why this is happening
the first request takes 5s, the second 0s. Now, if I invert the two curl calls:
the first request takes substantially less (<1s) and the second 0s. I have no idea what's going on here, but I figure I might share this 🙏 |
Your 5s is virtually certain to be from a DNS timeout. I guess you're not running Node Local DNS Cache and that one of your CoreDNS pods is unreachable from your test pod. Because Kubernetes is load balancing between two CoreDNS pods, you are seeing the problem on roughly every other curl request. |
hey @nickperry, interesting, I'll try with more than one dns requests. do you think there is any reason why this happens only within a cluster with calico? I think I can reproduce that pretty reliably by switching calico on and off (i need new nodes when removing calico though), and it only happens when calico is on. |
@itajaja did you ever figure this one out? It does seem likely to be a DNS issue rather than a Calico one, given what you've said above. When you say it only happens with Calico on, what are you comparing it against? |
hey @caseydavenport, I had the same cluster with calico removed and re-added to test with and without (reflushing the nodes in between) and it reliably only happens with calico. it definitely also has to do with DNS, but it's not just DNS, as this behavior does not happen without calico. to summarize: this only happens within the first seconds, only happens with DNS requests, and only happens with calico (with or without network policies for that pod). This is what I can confidently say, anything else is probably beyond my knowledge, and probably requires a deep understanding of the technologies involved. But I'd be curious to see if this is replicable by anyone else, in their calico-instrumented clusters. it should be easy to try, just:
and see if the first request is much slower than the second |
another bit of intel, in the calico-less cluster, if I try the snippet above:
these are very reliable values (happens all the time) so even in the calico-less, the first request is much slower, but still many orders of magnitudes faster than with calico (avg speed of first request |
We seem to have the same issue. We did some research as well and here are our findings. We are on... The 5s delay certainly seems to be a compound issue where a delay that exists during pod startup causes an additional delay in DNS. We're not exactly sure what's going on there yet as we haven't had a chance to dig through the
The delays only happen (at least for us) when using the Amazon VPC CNI and Calico together. For example (we repeated these tests many times and these examples are representative of all of the results)... Amazon VPC CNI without Calico
Calico CNI and Calico
Amazon VPC CNI and Calico
The 500ms sleep was just to give a tcpdump running in the background time to start. The delay would happen with no sleep or with up to ~2s of sleep. Obviously a sleep is not a very scientific test, we were just curious if it was a "first request" problem or something that cleared up after some time. It certainly seemed to clear up with time, not with number of requests. The requests delays were generally long enough that only the first request was affected (again not a very scientific part of the test). We have a pretty simple reproduction setup and would be happy to provide that, look into anything further, provide additional logs, etc. This issue is currently causing us some problems (pods restart due to DNS timeouts when they first come up and then cause erroneous alerts due to those restarts) so we'd love to find some solution. |
It's possible that what you are seeing is a result of slow kubelet status reporting of the pod's IP address. Calico CNI has a built in workaround for this, but the Amazon VPC plugin does not, which is why you see it when both are active. We're working with Amazon on a solution to this. Here's the main issue for that: #3530 Which effectively comes down to this: kubernetes/kubernetes#39113 My guess is that you're hitting that window and then getting bit by backoff timers. |
I'm actually going to close this because it seems highly likely that this is a duplicate of #3530, which we're working on a fix for. Thanks all. |
I notice that for some reasons (related to calico), the first request when a pod starts are slow. There are also some slow requests in other circumstances (eg when my flask service receives a request and needs to make an upstream call).
Steps to Reproduce (for bugs)
to reproduce, I run the following pod:
the output i receive is the following:
you can see that it took 5 seconds for the first request, while the second one was instantaneous. This happens with 100% reliability. and it's not a random, time, it's always around 5s. I am pretty sure that this is related to calico because I ran the same pod on a similar cluster without calico and i don't see this behavior. I also deleted calico from this cluster (and got a new node, so that calico didn't configure the node) and the behavior didn't occur. I re-installed calico, and again I see the slow request. So, I am not sure why this happens, but I am pretty sure it's related to calico.
Your Environment
truth be told, I have been experiencing this behavior for quite some time, before I decided to update, so I can say it's not because of a bug introduced in latest versions
The text was updated successfully, but these errors were encountered: