-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The number of max concurrent queries for the dns resolver is 100 #2214
Comments
Hi, this also occurs for our use-case in times of high load. In peak times, we may have lots of concurrent queries (we might get a few thousands lookup request per second), so a small percent of timed out queries do reach that hardcoded value. Example lines from
Regarding our DNS configuration: We ran the docker daemon in debug for a while when our system was under high load, and copied the relevant logs of that time period. It appears that some (~4%) of the queries result in i/o timeout errors.
|
Hi all, this issue is a plague for us. Due to this we can't use Docker networks and have to stick to legacy docker networks as the resolver behavior is different there. On some other occasions we had to hardcode IP. |
We have the same problem. It's critical to our infrastructure. We are using Docker Enterprise. |
(copying here as well); Slightly more details can be found in moby/moby#22185. If you're using a public DNS, also be sure to check if they have a rate limit. For example, Google's DNS servers also have a rate limit of 100 QPS (but can be raised on request); if you hit that rate limit, DNS responses will fail/stall, which would result in the queue /failures in docker's embedded DNS to grow as well Having said the above; I don't know if there are big objections from the libnetwork maintainers to raise the limit once more for situations where requests to upstream DNS servers are unable to be processed fast enough to keep the queue under 100 outstanding requests per second. |
I don't think there is an objection in principle to increasing the limit, however it's good to understand why the limit is being hit to rule out bugs in Docker or application misbehaviour. For example, one problem reported on #2082 was caused by a monitoring system which was issuing hundreds of requests for long-dead containers. The internal resolver could not resolve them, so they were forwarded to the upstream which also didn't know about them. Increasing the limit in this case would just have hidden a problem which wasn't going to go away. If we were to change the limit or make it configurable, it would still take some time to appear in a release you can use. In the meantime, for the case of monitoring a fairly static set of servers, could you set up a caching resolver and point your cluster at that instead of Google's DNS? That would reduce the average upstream DNS response latency and therefore reduce the chance of 'filling up' the 100 outstanding upstream requests allowed by the libnetwork resolver. It would also reduce the risk of hitting rate limits imposed by Google. |
It's cool and all, but i'll again reiterate the original question, why was it added in the first place? Am i the only one bothered by the fact that there are no rationale stated behind the change? Making it configurable is a start, however, i don't see a need for it at all. |
Well I can't speak to the original author's intent or the history, but I can think of at least 2 very good reasons for concurrent query limit to be present. First, each outstanding query consumes resources in the docker daemon. A misbehaving container can consume shared resources in the daemon at the expense of other containers. This can be in the form of file descriptors or memory within the daemon (and cycles, DNS queries are generally not supposed to be repeated per connection and their results are designed to be cached either locally or a few hops away. So lots of outstanding simultaneous queries from a single process is usually a cause for question/analysis. #2082 really kind of demonstrated exactly this. That said, I think we all want Docker to be a flexible platform for distributed computing: "batteries included, but swappable". Hence, no one is objecting to the notion of making the limit configurable. Hopefully we also all want the default behavior to be stable, responsible and debuggable. |
Hello All, I am not so sure on the original intent of the 100 concurrent DNS request per second. However, I saw a scenario where this wasn't enough:
DnsMasq has a different default setting then docker - 150 they claim that this is not enough in following situation: http://www.thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html
and of course they allow to change it to whatever we like, so obviously they had a case where this is needed. So I would say as docker is expected to run almost any workload there are cases when it needs to be tweaked to over 100 default. Following solutions are possible:
So I am voting to Make this limit globally configurable (easy) - it's easy to do and would give control over this setting to docker users. If someone reaches the bottleneck for whichever reason they have the signal to investigate or they can increase. |
Hi guys, We have huge docker hosts with many hundreds of containers.
It's the same. It would be great if we will have ability to configure parameters of docker hosts. In any way I don't see reasons to hardcode such parameters. |
@mouzfun
Limits are introduced to avoid DDOS or resource overutilization. It is normal to have such limits on DNS servers.
I hope the above use case explains at least one real case where the 100 limit is overcame.
I would be fine with it's removal. I guess a test of removing it and trying to do outrageous amounts of requests might pinpoint us on the issues |
I have created a custom build, raising the max concurrent queries to 10000, and used |
Installed dnsperf as described here: Then ran some experiments like this:
This was executed on AWS. Docker was configured to use a dnsmasq service running on the host with |
Wow!. So this is unblocked is that released to all?
|
@DmitryFrolovTri I am sending a pull request with the change. In our production hosts (with a custom build) our benchmark indicates that we can run as much as 9k queries per second using the docker resolver DNS |
is any progress for this issue? |
@swift1911 PR Merged |
@thiagoalves Awesome! So it's any plan to Docker to Release a Bug fix Version contains this PR? |
@swift1911 The fix is already merged to master, so it is going to be released soon. I don't know the exact time frame but I would say it will take a few months or so |
The moby backport is already in progress (moby/moby#38031) as that one is merged, you can try the nightly build |
moby/moby#38031 was merged, so this should be resolved on master / nightly |
This addresses/alleviates moby#2214 The new proposed limit should remediate the issue for most users. Signed-off-by: Thiago Alves Silva <thiago.alves@aurea.com>
2. eslint + linting code 3. npm test fixed (commented out lists not working) 4. add function `.setServers` to set custom DNS server being used for resolve. .addServers is important, because in docker, this package hits rate limit moby/libnetwork#2082 moby/libnetwork#2214
This addresses/alleviates moby/libnetwork#2214 The new proposed limit should remediate the issue for most users. Signed-off-by: Thiago Alves Silva <thiago.alves@aurea.com>
I'm reopening #2082 because the initial question was not answered during a discussion.
To reiterate:
The current limit https://github.com/docker/libnetwork/blob/7e5ff9e9cb4b91cee895cdfa7a7786b3886c366f/resolver.go#L70
Is not configurable, quite easily reached with legitimate network code and there are no easy workarounds.
The initial commit lacks any rationale behind the change and the @sanimej (author) later increased it beyond initial 50
The text was updated successfully, but these errors were encountered: