The number of max concurrent queries for the dns resolver is 100 #2214

vladimir-avinkin · 2018-07-01T01:11:33Z

I'm reopening #2082 because the initial question was not answered during a discussion.

To reiterate:
The current limit https://github.com/docker/libnetwork/blob/7e5ff9e9cb4b91cee895cdfa7a7786b3886c366f/resolver.go#L70

Is not configurable, quite easily reached with legitimate network code and there are no easy workarounds.

The initial commit lacks any rationale behind the change and the @sanimej (author) later increased it beyond initial 50

NirBenor · 2018-07-02T13:24:07Z

Hi, this also occurs for our use-case in times of high load.
Our use case is an app which also resolves many async reverse-dns lookups using getnameinfo.

In peak times, we may have lots of concurrent queries (we might get a few thousands lookup request per second), so a small percent of timed out queries do reach that hardcoded value.
We would also be happy to see this being configurable.

Example lines from /var/log/messages:

Jul  2 13:14:11 <host> dockerd: time="2018-07-02T13:14:11.793212614Z" level=error msg="[resolver] more than 100 concurrent queries from 172.18.0.5:40772"
Jul  2 13:14:29 <host> dockerd: time="2018-07-02T13:14:29.628400657Z" level=error msg="[resolver] more than 100 concurrent queries from 172.18.0.5:52957"
Jul  2 13:14:36 <host> dockerd: time="2018-07-02T13:14:36.323500409Z" level=error msg="[resolver] more than 100 concurrent queries from 172.18.0.5:37852"
Jul  2 13:14:41 <host> dockerd: time="2018-07-02T13:14:41.840606882Z" level=error msg="[resolver] more than 100 concurrent queries from 172.18.0.5:52817"
Jul  2 13:14:51 <host> dockerd: time="2018-07-02T13:14:51.106120756Z" level=error msg="[resolver] more than 100 concurrent queries from 172.18.0.5:33767"

Regarding our DNS configuration:
The container runs inside an Azure machine, and we do not override ndots or any other resolv.conf params. These errors occured using both Azure's internal DNS and google DNS (8.8.8.8 and 8.8.4.4) which is the one we currently use.

We ran the docker daemon in debug for a while when our system was under high load, and copied the relevant logs of that time period. It appears that some (~4%) of the queries result in i/o timeout errors.

root@<host> /d/l/# cat /tmp/messages.log | grep "IP To resolve" | wc -l
11304
root@<host> /d/l/# cat /tmp/messages.log | grep "read from DNS server failed" | wc -l
436

@euanh @fcrisciani @ddebroy

DmitryFrolovTri · 2018-07-31T10:38:19Z

Hi all, this issue is a plague for us. Due to this we can't use Docker networks and have to stick to legacy docker networks as the resolver behavior is different there. On some other occasions we had to hardcode IP.
We are experiencing this over and over and yet no move.
How hard is it to update this parameter to 200 in the meantime or just make this configurable. That would be very nice indeed.

ipodsekin · 2018-07-31T10:38:50Z

We have the same problem. It's critical to our infrastructure. We are using Docker Enterprise.

thaJeztah · 2018-07-31T11:09:57Z

(copying here as well);

Slightly more details can be found in moby/moby#22185. If you're using a public DNS, also be sure to check if they have a rate limit. For example, Google's DNS servers also have a rate limit of 100 QPS (but can be raised on request); if you hit that rate limit, DNS responses will fail/stall, which would result in the queue /failures in docker's embedded DNS to grow as well

Having said the above; I don't know if there are big objections from the libnetwork maintainers to raise the limit once more for situations where requests to upstream DNS servers are unable to be processed fast enough to keep the queue under 100 outstanding requests per second.

euanh · 2018-07-31T12:33:55Z

I don't think there is an objection in principle to increasing the limit, however it's good to understand why the limit is being hit to rule out bugs in Docker or application misbehaviour. For example, one problem reported on #2082 was caused by a monitoring system which was issuing hundreds of requests for long-dead containers. The internal resolver could not resolve them, so they were forwarded to the upstream which also didn't know about them. Increasing the limit in this case would just have hidden a problem which wasn't going to go away.

If we were to change the limit or make it configurable, it would still take some time to appear in a release you can use. In the meantime, for the case of monitoring a fairly static set of servers, could you set up a caching resolver and point your cluster at that instead of Google's DNS? That would reduce the average upstream DNS response latency and therefore reduce the chance of 'filling up' the 100 outstanding upstream requests allowed by the libnetwork resolver. It would also reduce the risk of hitting rate limits imposed by Google.

vladimir-avinkin · 2018-07-31T14:37:16Z

I don't think there is an objection in principle to increasing the limit, however it's good to understand why the limit is being hit to rule out bugs in Docker or application misbehaviour.

It's cool and all, but i'll again reiterate the original question, why was it added in the first place?

Am i the only one bothered by the fact that there are no rationale stated behind the change?

Making it configurable is a start, however, i don't see a need for it at all.

ctelfer · 2018-07-31T23:42:32Z

Well I can't speak to the original author's intent or the history, but I can think of at least 2 very good reasons for concurrent query limit to be present. First, each outstanding query consumes resources in the docker daemon. A misbehaving container can consume shared resources in the daemon at the expense of other containers. This can be in the form of file descriptors or memory within the daemon (and cycles,
and bandwidth although those are less likely to be problematic IMHO in this case). Second, it is a reasonable practice to limit outgoing DNS connections for the same reason it is reasonable to rate limit outgoing ICMP queries: DoS prevention/mitigation. A compromised container or simply one tricked into making a DNS query can become part of a DoS amplification attack. I suspect others could think of further good reasons.

DNS queries are generally not supposed to be repeated per connection and their results are designed to be cached either locally or a few hops away. So lots of outstanding simultaneous queries from a single process is usually a cause for question/analysis. #2082 really kind of demonstrated exactly this. That said, I think we all want Docker to be a flexible platform for distributed computing: "batteries included, but swappable". Hence, no one is objecting to the notion of making the limit configurable. Hopefully we also all want the default behavior to be stable, responsible and debuggable.

DmitryFrolovTri · 2018-08-01T08:41:00Z

Hello All,

I am not so sure on the original intent of the 100 concurrent DNS request per second. However,

I saw a scenario where this wasn't enough:
X1.32xlarge AWS host 128 vCPU. 10G network. 2T mem. Runs docker, runs 14K threads.

After migrating from default docker network (where we use AWS DNS resolver) to custom Network created - SMTP servers that send and receives mail across multiple containers experienced the 100 per second limit due to need to resolve domains and reverse resolve IPs during mail sending and logging. We were not able to bypass this behavior with custom DNS servers / settings inside container and migrated back to default docker network where we use AWS resolver. Host was sending millions of e-mails daily. My estimation is that 150 concurrent dns queries per second limit for our case might not be enough.

DnsMasq has a different default setting then docker - 150 they claim that this is not enough in following situation:

http://www.thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html

-0, --dns-forward-max=
Set the maximum number of concurrent DNS queries. The default value is 150, which should be fine for most setups. The only known situation where this needs to be increased is when using web-server log file resolvers, which can generate large numbers of concurrent queries.

and of course they allow to change it to whatever we like, so obviously they had a case where this is needed. So I would say as docker is expected to run almost any workload there are cases when it needs to be tweaked to over 100 default.

Following solutions are possible:

Hardcode to a higher value 150 or 200 (easy)
Make this limit globally configurable (easy)
Remove this limit (easy)
Make this limit configurable per docker network (medium)
Allow docker networks to use non-standard resolver via configuration and avoid internal resolver (hard, the best solution)

So I am voting to Make this limit globally configurable (easy) - it's easy to do and would give control over this setting to docker users. If someone reaches the bottleneck for whichever reason they have the signal to investigate or they can increase.

ipodsekin · 2018-08-01T08:42:53Z

Hi guys,

We have huge docker hosts with many hundreds of containers.

"640K ought to be enough for anyone."
Bill Gates

It's the same. It would be great if we will have ability to configure parameters of docker hosts. In any way I don't see reasons to hardcode such parameters.

DmitryFrolovTri · 2018-08-01T08:51:28Z

@mouzfun

It's cool and all, but i'll again reiterate the original question, why was it added in the first place?

Limits are introduced to avoid DDOS or resource overutilization. It is normal to have such limits on DNS servers.

Am i the only one bothered by the fact that there are no rationale stated behind the change?

I hope the above use case explains at least one real case where the 100 limit is overcame.
As for why this limit is there except the stated above I can't think of a reason, except it is present for any DNS server configuration

Making it configurable is a start, however, i don't see a need for it at all.

I would be fine with it's removal. I guess a test of removing it and trying to do outrageous amounts of requests might pinpoint us on the issues

thiagoalves · 2018-09-08T19:30:25Z

I have created a custom build, raising the max concurrent queries to 10000, and used resperf to analyze docker resolver performance. I was able to run up to 5k queries per second on it.

thiagoalves · 2018-09-08T19:46:56Z

Installed dnsperf as described here:
https://gist.github.com/i0rek/369a6bcd172e214fd791

Then ran some experiments like this:

root@9d50de753c90:/dnsperf# head -n200 queryfile-example-current > scrambled-200

root@9d50de753c90:/dnsperf# for i in {1..5000}; do cat scrambled-200 >> 200-scrambled-multiple; done

root@9d50de753c90:/dnsperf# resperf -m 6000 -d 200-scrambled-multiple -s127.0.0.11
DNS Resolution Performance Testing Tool
Nominum Version 2.0.0.0

[Status] Command line: resperf -m 6000 -d 200-scrambled-multiple -s127.0.0.11
[Status] Sending
[Status] Waiting for more responses
[Status] Testing complete

Statistics:

  Queries sent:         180000
  Queries completed:    180000
  Queries lost:         0
  Run time (s):         100.000001
  Maximum throughput:   5926.000000 qps
  Lost at that point:   0.00%

This was executed on AWS. Docker was configured to use a dnsmasq service running on the host with cache-size=10000 and dns-forward-max=10000

DmitryFrolovTri · 2018-09-10T09:32:47Z

Wow!. So this is unblocked is that released to all?

thiagoalves · 2018-09-11T11:53:18Z

@DmitryFrolovTri I am sending a pull request with the change. In our production hosts (with a custom build) our benchmark indicates that we can run as much as 9k queries per second using the docker resolver DNS

swift1911 · 2018-10-08T04:42:39Z

is any progress for this issue?

thiagoalves · 2018-10-09T21:28:25Z

@swift1911 PR Merged

#2262

thiagoalves · 2018-10-09T21:29:08Z

@DmitryFrolovTri @ipodsekin

swift1911 · 2018-10-11T03:21:41Z

@thiagoalves Awesome! So it's any plan to Docker to Release a Bug fix Version contains this PR?

thiagoalves · 2018-10-12T02:28:03Z

@swift1911 The fix is already merged to master, so it is going to be released soon. I don't know the exact time frame but I would say it will take a few months or so

fcrisciani · 2018-10-15T15:00:21Z

The moby backport is already in progress (moby/moby#38031) as that one is merged, you can try the nightly build

thaJeztah · 2018-10-18T10:00:42Z

moby/moby#38031 was merged, so this should be resolved on master / nightly

This addresses/alleviates moby#2214 The new proposed limit should remediate the issue for most users. Signed-off-by: Thiago Alves Silva <thiago.alves@aurea.com>

2. eslint + linting code 3. npm test fixed (commented out lists not working) 4. add function `.setServers` to set custom DNS server being used for resolve. .addServers is important, because in docker, this package hits rate limit moby/libnetwork#2082 moby/libnetwork#2214

This addresses/alleviates moby/libnetwork#2214 The new proposed limit should remediate the issue for most users. Signed-off-by: Thiago Alves Silva <thiago.alves@aurea.com>

chriscroome mentioned this issue Jul 31, 2018

Docker DNS concurrent query limit hard coded to 100 mailcow/mailcow-dockerized#1618

Closed

vladimir-avinkin closed this as completed Jul 31, 2018

vladimir-avinkin reopened this Jul 31, 2018

thiagoalves mentioned this issue Sep 4, 2018

Increase max concurrent requests for DNS from 100 to 1024 #2262

Merged

thaJeztah mentioned this issue Oct 18, 2018

Vendor libnetwork moby/moby#38031

Merged

fcrisciani closed this as completed Oct 18, 2018

vodolaz095 mentioned this issue May 1, 2019

general update + ability to set dns servers to use hassansin/node-dnsbl-lookup#6

Open

This was referenced Dec 11, 2020

Resolver: Hardcoded 1024 Concurrent Queries docker/for-linux#1165

Closed

Bug: More than 1024 Concurrent Queries graphprotocol/mission-control-indexer#254

Open

Feature Request: Configurable Max Number of Concurrent Queries #2601

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The number of max concurrent queries for the dns resolver is 100 #2214

The number of max concurrent queries for the dns resolver is 100 #2214

vladimir-avinkin commented Jul 1, 2018

NirBenor commented Jul 2, 2018 •

edited

Loading

DmitryFrolovTri commented Jul 31, 2018

ipodsekin commented Jul 31, 2018 •

edited

Loading

thaJeztah commented Jul 31, 2018

euanh commented Jul 31, 2018 •

edited

Loading

vladimir-avinkin commented Jul 31, 2018

ctelfer commented Jul 31, 2018 •

edited

Loading

DmitryFrolovTri commented Aug 1, 2018 •

edited

Loading

ipodsekin commented Aug 1, 2018

DmitryFrolovTri commented Aug 1, 2018 •

edited

Loading

thiagoalves commented Sep 8, 2018

thiagoalves commented Sep 8, 2018

DmitryFrolovTri commented Sep 10, 2018 via email •

edited

Loading

thiagoalves commented Sep 11, 2018

swift1911 commented Oct 8, 2018

thiagoalves commented Oct 9, 2018

thiagoalves commented Oct 9, 2018

swift1911 commented Oct 11, 2018

thiagoalves commented Oct 12, 2018

fcrisciani commented Oct 15, 2018

thaJeztah commented Oct 18, 2018

The number of max concurrent queries for the dns resolver is 100 #2214

The number of max concurrent queries for the dns resolver is 100 #2214

Comments

vladimir-avinkin commented Jul 1, 2018

NirBenor commented Jul 2, 2018 • edited Loading

DmitryFrolovTri commented Jul 31, 2018

ipodsekin commented Jul 31, 2018 • edited Loading

thaJeztah commented Jul 31, 2018

euanh commented Jul 31, 2018 • edited Loading

vladimir-avinkin commented Jul 31, 2018

ctelfer commented Jul 31, 2018 • edited Loading

DmitryFrolovTri commented Aug 1, 2018 • edited Loading

ipodsekin commented Aug 1, 2018

DmitryFrolovTri commented Aug 1, 2018 • edited Loading

thiagoalves commented Sep 8, 2018

thiagoalves commented Sep 8, 2018

DmitryFrolovTri commented Sep 10, 2018 via email • edited Loading

thiagoalves commented Sep 11, 2018

swift1911 commented Oct 8, 2018

thiagoalves commented Oct 9, 2018

thiagoalves commented Oct 9, 2018

swift1911 commented Oct 11, 2018

thiagoalves commented Oct 12, 2018

fcrisciani commented Oct 15, 2018

thaJeztah commented Oct 18, 2018

NirBenor commented Jul 2, 2018 •

edited

Loading

ipodsekin commented Jul 31, 2018 •

edited

Loading

euanh commented Jul 31, 2018 •

edited

Loading

ctelfer commented Jul 31, 2018 •

edited

Loading

DmitryFrolovTri commented Aug 1, 2018 •

edited

Loading

DmitryFrolovTri commented Aug 1, 2018 •

edited

Loading

DmitryFrolovTri commented Sep 10, 2018 via email •

edited

Loading