Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The number of max concurrent queries for the dns resolver is 100 #2214

Closed
vladimir-avinkin opened this issue Jul 1, 2018 · 21 comments
Closed

Comments

@vladimir-avinkin
Copy link

I'm reopening #2082 because the initial question was not answered during a discussion.

To reiterate:
The current limit https://github.com/docker/libnetwork/blob/7e5ff9e9cb4b91cee895cdfa7a7786b3886c366f/resolver.go#L70

Is not configurable, quite easily reached with legitimate network code and there are no easy workarounds.

The initial commit lacks any rationale behind the change and the @sanimej (author) later increased it beyond initial 50

@NirBenor
Copy link

NirBenor commented Jul 2, 2018

Hi, this also occurs for our use-case in times of high load.
Our use case is an app which also resolves many async reverse-dns lookups using getnameinfo.

In peak times, we may have lots of concurrent queries (we might get a few thousands lookup request per second), so a small percent of timed out queries do reach that hardcoded value.
We would also be happy to see this being configurable.

Example lines from /var/log/messages:

Jul  2 13:14:11 <host> dockerd: time="2018-07-02T13:14:11.793212614Z" level=error msg="[resolver] more than 100 concurrent queries from 172.18.0.5:40772"
Jul  2 13:14:29 <host> dockerd: time="2018-07-02T13:14:29.628400657Z" level=error msg="[resolver] more than 100 concurrent queries from 172.18.0.5:52957"
Jul  2 13:14:36 <host> dockerd: time="2018-07-02T13:14:36.323500409Z" level=error msg="[resolver] more than 100 concurrent queries from 172.18.0.5:37852"
Jul  2 13:14:41 <host> dockerd: time="2018-07-02T13:14:41.840606882Z" level=error msg="[resolver] more than 100 concurrent queries from 172.18.0.5:52817"
Jul  2 13:14:51 <host> dockerd: time="2018-07-02T13:14:51.106120756Z" level=error msg="[resolver] more than 100 concurrent queries from 172.18.0.5:33767"

Regarding our DNS configuration:
The container runs inside an Azure machine, and we do not override ndots or any other resolv.conf params. These errors occured using both Azure's internal DNS and google DNS (8.8.8.8 and 8.8.4.4) which is the one we currently use.

We ran the docker daemon in debug for a while when our system was under high load, and copied the relevant logs of that time period. It appears that some (~4%) of the queries result in i/o timeout errors.

root@<host> /d/l/# cat /tmp/messages.log | grep "IP To resolve" | wc -l
11304
root@<host> /d/l/# cat /tmp/messages.log | grep "read from DNS server failed" | wc -l
436

@euanh @fcrisciani @ddebroy

@DmitryFrolovTri
Copy link

Hi all, this issue is a plague for us. Due to this we can't use Docker networks and have to stick to legacy docker networks as the resolver behavior is different there. On some other occasions we had to hardcode IP.
We are experiencing this over and over and yet no move.
How hard is it to update this parameter to 200 in the meantime or just make this configurable. That would be very nice indeed.

@ipodsekin
Copy link

ipodsekin commented Jul 31, 2018

We have the same problem. It's critical to our infrastructure. We are using Docker Enterprise.

@thaJeztah
Copy link
Member

(copying here as well);

Slightly more details can be found in moby/moby#22185. If you're using a public DNS, also be sure to check if they have a rate limit. For example, Google's DNS servers also have a rate limit of 100 QPS (but can be raised on request); if you hit that rate limit, DNS responses will fail/stall, which would result in the queue /failures in docker's embedded DNS to grow as well

Having said the above; I don't know if there are big objections from the libnetwork maintainers to raise the limit once more for situations where requests to upstream DNS servers are unable to be processed fast enough to keep the queue under 100 outstanding requests per second.

@euanh
Copy link
Collaborator

euanh commented Jul 31, 2018

I don't think there is an objection in principle to increasing the limit, however it's good to understand why the limit is being hit to rule out bugs in Docker or application misbehaviour. For example, one problem reported on #2082 was caused by a monitoring system which was issuing hundreds of requests for long-dead containers. The internal resolver could not resolve them, so they were forwarded to the upstream which also didn't know about them. Increasing the limit in this case would just have hidden a problem which wasn't going to go away.

If we were to change the limit or make it configurable, it would still take some time to appear in a release you can use. In the meantime, for the case of monitoring a fairly static set of servers, could you set up a caching resolver and point your cluster at that instead of Google's DNS? That would reduce the average upstream DNS response latency and therefore reduce the chance of 'filling up' the 100 outstanding upstream requests allowed by the libnetwork resolver. It would also reduce the risk of hitting rate limits imposed by Google.

@vladimir-avinkin
Copy link
Author

I don't think there is an objection in principle to increasing the limit, however it's good to understand why the limit is being hit to rule out bugs in Docker or application misbehaviour.

It's cool and all, but i'll again reiterate the original question, why was it added in the first place?

Am i the only one bothered by the fact that there are no rationale stated behind the change?

Making it configurable is a start, however, i don't see a need for it at all.

@ctelfer
Copy link
Contributor

ctelfer commented Jul 31, 2018

Well I can't speak to the original author's intent or the history, but I can think of at least 2 very good reasons for concurrent query limit to be present. First, each outstanding query consumes resources in the docker daemon. A misbehaving container can consume shared resources in the daemon at the expense of other containers. This can be in the form of file descriptors or memory within the daemon (and cycles,
and bandwidth although those are less likely to be problematic IMHO in this case). Second, it is a reasonable practice to limit outgoing DNS connections for the same reason it is reasonable to rate limit outgoing ICMP queries: DoS prevention/mitigation. A compromised container or simply one tricked into making a DNS query can become part of a DoS amplification attack. I suspect others could think of further good reasons.

DNS queries are generally not supposed to be repeated per connection and their results are designed to be cached either locally or a few hops away. So lots of outstanding simultaneous queries from a single process is usually a cause for question/analysis. #2082 really kind of demonstrated exactly this. That said, I think we all want Docker to be a flexible platform for distributed computing: "batteries included, but swappable". Hence, no one is objecting to the notion of making the limit configurable. Hopefully we also all want the default behavior to be stable, responsible and debuggable.

@DmitryFrolovTri
Copy link

DmitryFrolovTri commented Aug 1, 2018

Hello All,

I am not so sure on the original intent of the 100 concurrent DNS request per second. However,

I saw a scenario where this wasn't enough:
X1.32xlarge AWS host 128 vCPU. 10G network. 2T mem. Runs docker, runs 14K threads.

  • After migrating from default docker network (where we use AWS DNS resolver) to custom Network created - SMTP servers that send and receives mail across multiple containers experienced the 100 per second limit due to need to resolve domains and reverse resolve IPs during mail sending and logging. We were not able to bypass this behavior with custom DNS servers / settings inside container and migrated back to default docker network where we use AWS resolver. Host was sending millions of e-mails daily. My estimation is that 150 concurrent dns queries per second limit for our case might not be enough.

DnsMasq has a different default setting then docker - 150 they claim that this is not enough in following situation:

http://www.thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html

-0, --dns-forward-max=
Set the maximum number of concurrent DNS queries. The default value is 150, which should be fine for most setups. The only known situation where this needs to be increased is when using web-server log file resolvers, which can generate large numbers of concurrent queries.

and of course they allow to change it to whatever we like, so obviously they had a case where this is needed. So I would say as docker is expected to run almost any workload there are cases when it needs to be tweaked to over 100 default.

Following solutions are possible:

  • Hardcode to a higher value 150 or 200 (easy)
  • Make this limit globally configurable (easy)
  • Remove this limit (easy)
  • Make this limit configurable per docker network (medium)
  • Allow docker networks to use non-standard resolver via configuration and avoid internal resolver (hard, the best solution)

So I am voting to Make this limit globally configurable (easy) - it's easy to do and would give control over this setting to docker users. If someone reaches the bottleneck for whichever reason they have the signal to investigate or they can increase.

@ipodsekin
Copy link

Hi guys,

We have huge docker hosts with many hundreds of containers.

"640K ought to be enough for anyone."
Bill Gates

It's the same. It would be great if we will have ability to configure parameters of docker hosts. In any way I don't see reasons to hardcode such parameters.

@DmitryFrolovTri
Copy link

DmitryFrolovTri commented Aug 1, 2018

@mouzfun

It's cool and all, but i'll again reiterate the original question, why was it added in the first place?

Limits are introduced to avoid DDOS or resource overutilization. It is normal to have such limits on DNS servers.

Am i the only one bothered by the fact that there are no rationale stated behind the change?

I hope the above use case explains at least one real case where the 100 limit is overcame.
As for why this limit is there except the stated above I can't think of a reason, except it is present for any DNS server configuration

Making it configurable is a start, however, i don't see a need for it at all.

I would be fine with it's removal. I guess a test of removing it and trying to do outrageous amounts of requests might pinpoint us on the issues

@thiagoalves
Copy link
Contributor

I have created a custom build, raising the max concurrent queries to 10000, and used resperf to analyze docker resolver performance. I was able to run up to 5k queries per second on it.

@thiagoalves
Copy link
Contributor

Installed dnsperf as described here:
https://gist.github.com/i0rek/369a6bcd172e214fd791

Then ran some experiments like this:

root@9d50de753c90:/dnsperf# head -n200 queryfile-example-current > scrambled-200

root@9d50de753c90:/dnsperf# for i in {1..5000}; do cat scrambled-200 >> 200-scrambled-multiple; done

root@9d50de753c90:/dnsperf# resperf -m 6000 -d 200-scrambled-multiple -s127.0.0.11
DNS Resolution Performance Testing Tool
Nominum Version 2.0.0.0

[Status] Command line: resperf -m 6000 -d 200-scrambled-multiple -s127.0.0.11
[Status] Sending
[Status] Waiting for more responses
[Status] Testing complete

Statistics:

  Queries sent:         180000
  Queries completed:    180000
  Queries lost:         0
  Run time (s):         100.000001
  Maximum throughput:   5926.000000 qps
  Lost at that point:   0.00%

This was executed on AWS. Docker was configured to use a dnsmasq service running on the host with cache-size=10000 and dns-forward-max=10000

@DmitryFrolovTri
Copy link

DmitryFrolovTri commented Sep 10, 2018 via email

@thiagoalves
Copy link
Contributor

@DmitryFrolovTri I am sending a pull request with the change. In our production hosts (with a custom build) our benchmark indicates that we can run as much as 9k queries per second using the docker resolver DNS

@swift1911
Copy link

is any progress for this issue?

@thiagoalves
Copy link
Contributor

@swift1911 PR Merged

#2262

@thiagoalves
Copy link
Contributor

@swift1911
Copy link

@thiagoalves Awesome! So it's any plan to Docker to Release a Bug fix Version contains this PR?

@thiagoalves
Copy link
Contributor

@swift1911 The fix is already merged to master, so it is going to be released soon. I don't know the exact time frame but I would say it will take a few months or so

@fcrisciani
Copy link

The moby backport is already in progress (moby/moby#38031) as that one is merged, you can try the nightly build

@thaJeztah
Copy link
Member

moby/moby#38031 was merged, so this should be resolved on master / nightly

robertgzr pushed a commit to balena-os/balena-libnetwork that referenced this issue Mar 4, 2019
This addresses/alleviates moby#2214

The new proposed limit should remediate the issue for most users.

Signed-off-by: Thiago Alves Silva <thiago.alves@aurea.com>
vodolaz095 added a commit to vodolaz095/node-dnsbl-lookup that referenced this issue May 1, 2019
2. eslint + linting code
3. npm test fixed (commented out lists not working)
4. add function `.setServers` to set custom DNS server being used for resolve.

.addServers is important, because in docker, this package hits rate limit
moby/libnetwork#2082
moby/libnetwork#2214
cpuguy83 pushed a commit to cpuguy83/docker that referenced this issue May 25, 2021
This addresses/alleviates moby/libnetwork#2214

The new proposed limit should remediate the issue for most users.

Signed-off-by: Thiago Alves Silva <thiago.alves@aurea.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants