Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first request slow #4680

Closed
itajaja opened this issue Jun 15, 2021 · 9 comments
Closed

first request slow #4680

itajaja opened this issue Jun 15, 2021 · 9 comments

Comments

@itajaja
Copy link

itajaja commented Jun 15, 2021

I notice that for some reasons (related to calico), the first request when a pod starts are slow. There are also some slow requests in other circumstances (eg when my flask service receives a request and needs to make an upstream call).

Steps to Reproduce (for bugs)

to reproduce, I run the following pod:

kubectl run --restart='Never' --rm -it calicotest --image=python -- sh -c '
echo starting
curl https://example.com > /dev/null
curl https://example.com > /dev/null

the output i receive is the following:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1256  100  1256    0     0    247      0  0:00:05  0:00:05 --:--:--   284
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1256  100  1256    0     0  83733      0 --:--:-- --:--:-- --:--:-- 83733

you can see that it took 5 seconds for the first request, while the second one was instantaneous. This happens with 100% reliability. and it's not a random, time, it's always around 5s. I am pretty sure that this is related to calico because I ran the same pod on a similar cluster without calico and i don't see this behavior. I also deleted calico from this cluster (and got a new node, so that calico didn't configure the node) and the behavior didn't occur. I re-installed calico, and again I see the slow request. So, I am not sure why this happens, but I am pretty sure it's related to calico.

Your Environment

truth be told, I have been experiencing this behavior for quite some time, before I decided to update, so I can say it's not because of a bug introduced in latest versions

@itajaja
Copy link
Author

itajaja commented Jun 16, 2021

some more intel, although I still feel like I have no idea why this is happening

kubectl run --restart='Never' --rm -it cltest --image=python -- sh -c '
curl http://example.com > /dev/null
curl http://99.84.39.78 > /dev/null'

the first request takes 5s, the second 0s. Now, if I invert the two curl calls:

kubectl run --restart='Never' --rm -it cltest --image=python -- sh -c '
curl http://99.84.39.78 > /dev/null
curl http://example.com > /dev/null
'

the first request takes substantially less (<1s) and the second 0s. I have no idea what's going on here, but I figure I might share this 🙏

@nickperry
Copy link
Contributor

Your 5s is virtually certain to be from a DNS timeout. I guess you're not running Node Local DNS Cache and that one of your CoreDNS pods is unreachable from your test pod. Because Kubernetes is load balancing between two CoreDNS pods, you are seeing the problem on roughly every other curl request.

@itajaja
Copy link
Author

itajaja commented Jun 23, 2021

hey @nickperry, interesting, I'll try with more than one dns requests. do you think there is any reason why this happens only within a cluster with calico? I think I can reproduce that pretty reliably by switching calico on and off (i need new nodes when removing calico though), and it only happens when calico is on.

@caseydavenport
Copy link
Member

@itajaja did you ever figure this one out? It does seem likely to be a DNS issue rather than a Calico one, given what you've said above.

When you say it only happens with Calico on, what are you comparing it against?

@itajaja
Copy link
Author

itajaja commented Aug 11, 2021

hey @caseydavenport, I had the same cluster with calico removed and re-added to test with and without (reflushing the nodes in between) and it reliably only happens with calico. it definitely also has to do with DNS, but it's not just DNS, as this behavior does not happen without calico. to summarize: this only happens within the first seconds, only happens with DNS requests, and only happens with calico (with or without network policies for that pod). This is what I can confidently say, anything else is probably beyond my knowledge, and probably requires a deep understanding of the technologies involved. But I'd be curious to see if this is replicable by anyone else, in their calico-instrumented clusters. it should be easy to try, just:

kubectl run --restart='Never' --rm -it calicotest --image=python -- sh -c '
echo starting
curl https://example.com > /dev/null
curl https://example.com > /dev/null'

and see if the first request is much slower than the second

@itajaja
Copy link
Author

itajaja commented Aug 11, 2021

another bit of intel, in the calico-less cluster, if I try the snippet above:

  • first curl https://example.com > /dev/null avg speed: 89714
  • second: curl https://example.com > /dev/null avg speed 175k

these are very reliable values (happens all the time) so even in the calico-less, the first request is much slower, but still many orders of magnitudes faster than with calico (avg speed of first request 247)

@lheinlen
Copy link

lheinlen commented Oct 1, 2021

We seem to have the same issue. We did some research as well and here are our findings.

We are on...
Kubernetes: 1.21
Amazon VPC CNI: 1.9.0
Tigera Operator: 1.20.1
Calico: 3.20.0
Calico Config (using what AWS recommends):
https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/357dfd6a7b4d17de4145b73a82947e7a99461f30/config/master/calico-operator.yaml
https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/357dfd6a7b4d17de4145b73a82947e7a99461f30/config/master/calico-crs.yaml

The 5s delay certainly seems to be a compound issue where a delay that exists during pod startup causes an additional delay in DNS. We're not exactly sure what's going on there yet as we haven't had a chance to dig through the tcpdumps of the requests. As an example of the issue happening outside of DNS requests, we did a test that is making HTTP requests directly to the IP of another pod in the cluster...

2021/10/01 20:35:17 SUCCEEDED - duration: 1.0218743s
2021/10/01 20:35:17 SUCCEEDED - duration: 1.202938ms
2021/10/01 20:35:17 SUCCEEDED - duration: 1.010641ms
2021/10/01 20:35:17 SUCCEEDED - duration: 1.048572ms
2021/10/01 20:35:17 SUCCEEDED - duration: 1.010367ms

The delays only happen (at least for us) when using the Amazon VPC CNI and Calico together. For example (we repeated these tests many times and these examples are representative of all of the results)...

Amazon VPC CNI without Calico

2021/10/01 18:23:04 Sleeping 500ms before doing DNS lookups
2021/10/01 18:23:05 Trying to look up google.com
2021/10/01 18:23:05 SUCCEEDED - duration: 4.017169ms
2021/10/01 18:23:05 Trying to look up google.com
2021/10/01 18:23:05 SUCCEEDED - duration: 4.725565ms

Calico CNI and Calico

2021/10/01 17:53:09 Sleeping 500ms before doing DNS lookups
2021/10/01 17:53:09 Trying to look up google.com
2021/10/01 17:53:09 SUCCEEDED - duration: 6.146613ms
2021/10/01 17:53:09 Trying to look up google.com
2021/10/01 17:53:09 SUCCEEDED - duration: 4.709482ms

Amazon VPC CNI and Calico

2021/10/01 17:45:55 Sleeping 500ms before doing DNS lookups
2021/10/01 17:45:56 Trying to look up google.com
2021/10/01 17:46:01 SUCCEEDED - duration: 5.050024024s
2021/10/01 17:46:01 Trying to look up google.com
2021/10/01 17:46:01 SUCCEEDED - duration: 2.477653ms

The 500ms sleep was just to give a tcpdump running in the background time to start. The delay would happen with no sleep or with up to ~2s of sleep. Obviously a sleep is not a very scientific test, we were just curious if it was a "first request" problem or something that cleared up after some time. It certainly seemed to clear up with time, not with number of requests. The requests delays were generally long enough that only the first request was affected (again not a very scientific part of the test).

We have a pretty simple reproduction setup and would be happy to provide that, look into anything further, provide additional logs, etc. This issue is currently causing us some problems (pods restart due to DNS timeouts when they first come up and then cause erroneous alerts due to those restarts) so we'd love to find some solution.

@caseydavenport
Copy link
Member

It's possible that what you are seeing is a result of slow kubelet status reporting of the pod's IP address. Calico CNI has a built in workaround for this, but the Amazon VPC plugin does not, which is why you see it when both are active. We're working with Amazon on a solution to this.

Here's the main issue for that: #3530

Which effectively comes down to this: kubernetes/kubernetes#39113

My guess is that you're hitting that window and then getting bit by backoff timers.

@caseydavenport
Copy link
Member

I'm actually going to close this because it seems highly likely that this is a duplicate of #3530, which we're working on a fix for. Thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants