Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DNS resolver #260

Merged
merged 11 commits into from
Feb 17, 2017
Merged

Improve DNS resolver #260

merged 11 commits into from
Feb 17, 2017

Conversation

mikz
Copy link
Contributor

@mikz mikz commented Feb 13, 2017

Improve performance of our DNS resolver.

  • design document
  • parallelization
  • query de-duplication
  • stale cache on error
  • management api

In the end there is going to be local caching DNS resolver in the same container. It is way easier to just leave the hard work on another component.

That means in the native deployment the user is responsible for providing local caching dns server for optimal performance. /cc @3scale/support

@3scale/productization after merging we are going to require s2i-openresty with dnsmasq (https://github.com/3scale/s2i-openresty/releases/tag/1.11.2.2-2).

@octobot octobot added the T-obux label Feb 13, 2017
@mikz mikz force-pushed the dns-resolver branch 4 times, most recently from a17bf09 to 8d3f784 Compare February 15, 2017 09:02
* When cache is stale, use stale record and trigger de-duplicated query in the background.
* If record can't be updated, use stale cache for unlimited time, but try to recover.

This has one drawback: in low traffic scenario, the first requests after some time can hit wrong server before the cache is updated.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to solve that in the future in the load balancer. If the server is unavailable then remove that cached record and try to do the query again. The load balancer code should be able to mark hosts as unhealthy and do retries (does not do now). I'm not describing it as it is really in the future for now.


Using OpenResty's `lua-resty-dns` we can connect to any DNS server and implement our own resolver.

DNS resolver caches each record individually and fetches it from the cache recursively. That allows us to have layers of cache and fill just the missing layers. Example: CNAME (TTL 3600) -> CNAME (TTL 360) -> A (TTL 60) will just query the A record when it expires.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unleashed and @jmprusi suggested to use local caching DNS resolver instead.

While that is a good idea, it might be harder to pack another dependency.

Copy link
Contributor

@unleashed unleashed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments - Relaying queries to a specific DNS server has some (small) drawbacks, but I think it's the way to go, with some cache and stale information layer on top.


* caching for duration of TTL
* resolve A and CNAME records
* using stale cache when TTL expires and cache can't be updated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be expanded - the case in which a stale cached host is always used because name resolving never comes back should produce noisy alerts to administrators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this should produce alerts. But IMO it should not produce errors when the connection works just fine. What we've seen in the Dyn outage was that DNS records haven't updated and everything would work if you'd just keep connecting to the stale records.

Eventually when those records are marked as unhealthy by the load balancer it should return 500s.

* resolve A and CNAME records
* using stale cache when TTL expires and cache can't be updated
* many parallel requests to one worker result in one query
* connect to multiple DNS servers in parallel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a real requirement? Why? I think supporting multiple DNS servers should be ok, but having to connect to all or a subset of them in parallel is probably going to be wasted effort most of the time unless the queries are for different hostnames.

Also, if multiple hosts need resolving at the same time, IIRC you can coalesce all queries in a single DNS request, but that will make the response be as slow as the slowest name resolving (so if used this needs to be careful).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenShift provides you with 2 nameservers. One for internal and other for external hostnames.
The internal does not resolve external requests and vice versa.
It would be possible, but quite hard to detect if the query is supposed to go to one or the other.

IMO the easiest way is to just do two queries. But if we delegate this work to local caching DNS server it is not that critical (and there would be just one server anyway).


### Low traffic

A request is made to APIcast and DNS cache is updated. Then there is no request for TTL of that record and another request is made after that record expires. Should APIcast try to use stale cache or hold the request? For how long?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache is updated, no DNS request has to be performed. I assume the set of hosts that need resolving is going to be generally fixed at some low number. Having a timer re-resolve out-of-band before the TTL expires should be good. If some hosts are seldom used, then keeping an indication about how often they need resolving would save on extra requests, perhaps waiting a multiple of their TTL if the number of hosts to resolve is large, and returning the stale info when a new inusual request arrives (and triggering an early update of the DNS info on the assumption that the request that just happened is likely to be followed in the near future by another one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that too. But here are counterarguments I could imagine:

  • number of scheduled timers in OpenResty is limited to 256 or some similar low number
  • this has to work on APIcast hosted with theoretically hundreds of services running on multi-tenant instance
  • hosts can appear and disappear pretty much all the time, the cache is LRU

I think preloading of any kind will be expensive when we get to scale of hundreds of services in one instance.

I was thinking that maybe we could try to update the cache in some fraction of the TTL when there is a request. For example TTL is 60 so when there is a request incoming and there is 6 seconds left issue one timer to refresh the cache.

What would not work is to schedule a timer for refresh in TTL-1 seconds when the record is fetched the first time.

But that is the same what you are explaining, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea is to hide resolving latencies as much as possible. Also, when the oob refreshing cannot refresh fast enough to avoid a window of time in which the TTL is expired but the resolving has not finished, deciding to return stale data back is probably good enough (as opposed to waiting on the refreshing to finish), and is consistent with the behaviour that would occur if the refresh never actually finished or didn't do it in a reasonable time.


1000 requests are made in the same time. First that arrives starts resolving DNS and others should wait until it is completed. After that request is completed, other requests are going to use that result.

If first request timeouts, next one should try to do another DNS query and other requests should wait for it to complete.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So:

  • First resolving request for a given hostname "grabs" the DNS resolver and performs it.
  • Further requests for the same hostname wait on the response from the DNS resolver for the ongoing DNS request.
  • A limit on the number of concurrent DNS requests should be set, and new hostname requests wait for resources to be freed (or a timeout) or fail.

I would add the possibility of, again assuming not many different hostnames will be needed, a warm up by querying the list of expected hostnames at boot time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preloading would make sense with configuration loaded on boot as defined in (#230).
It would not be possible in the cloud lazy load mode (or would when configuration is loaded).


### High traffic (with cache)

With constant traffic 1000 requests per second when the DNS cache expires should we use stale cache before it is refreshed? Or should requests wait for it to be updated?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As pointed above: try to keep the entries refreshed before they expire. Least used entries should gradually stop being updated before the TTL expires to save on resources (how this is implemented is something to be decided later). Stale cache is a last resort to try to keep the API working as a best effort mechanism, but should produce loud alerts (and/or be configurable for on/off).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice, but if we use dnsmasq that is not how it is going to work. Dnsmasq is going to expire that record and then will query again.

While I agree it would be nice it might be too much work to implement properly. Using stale cache for duration of the cache refresh (few ms, up to a second?) shouldn't hurt anyone and is much easier to implement then ahead of time cache refresh. And almost impossible when we use dnsmasq.


Querying multiple nameservers in parallel is accomplished by using "light thread". The first nameserver that responds is used.

De-duplicating queries is achieved by using ngx.semaphore to wait for already executing queries. The waiting algorithm can be very much improved as right now it just waits for any query being executed, not the specific one. Stale cache is being returned when timeout is reached and query can't be completed on time. Queries are performed in the background timer loop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very similar to what I explain above, except having entries continually refreshed unless they cross a "unused" threshold.


De-duplicating queries is achieved by using ngx.semaphore to wait for already executing queries. The waiting algorithm can be very much improved as right now it just waits for any query being executed, not the specific one. Stale cache is being returned when timeout is reached and query can't be completed on time. Queries are performed in the background timer loop.

Search scope defined in `/etc/resolv.conf` is evaluted one by one until matching record is returned. This could be optimized by executing all queries in parallel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to avoid having to do this at all. If the code that requests this is all local, just using fully-qualified names could be set as a rule and avoid this altogether - note that "random_service" can actually be a fully-qualified name for the default search scope of "." (not sure whether this applies to the way Openshift does name resolving).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can convince all our customers to use fully qualified names everywhere. All the utilities work just fine with just the relative names. I was struggling with this myself and spent some time debugging why I can curl system-app:3000 but not connect to it from apicast.

In theory we could try to run following heuristics to determine if the name is local or not:

  • no . - internal
  • some . but matches the search scope - internal
  • some . but not matches the search scope - external

The issue is even if we can detect which hostnames are internal/external we don't know which name server is going to resolve what. So we would have to implement logic to remember which search scope belongs to which name server.

That just sounds too hard. Even dnsmasq will connect to all name servers in parallel and use the first response.


Following rules apply to handle scenarios described above:

* When cache is missing, all incoming requests de-duplicate queries and wait for the first one to complete. If does not complete on time, the next one tries again, ...
Copy link
Contributor

@unleashed unleashed Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to condition all requests pending on a name being resolved on the result of the query. Requests of name resolving should seldom trigger a real DNS resolving request right away. We do something akin to this in XC, and strategies are:

  • At boot time, do cache warm-up of expected host names addresses.
  • Keep an out-of-band, continuous task going through most used hostnames and resolving them before the TTL expires.
  • When a stale entry is found, return the stale addresses and schedule an immediate resolving of the hostname in the assumption that the entry is now becoming hotter and further requests could happen (which would also mean that it would be auto-resolved before TTL expires for a few cycles until deemed seldom used again).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using dnsmasq we can use it as synchronisation mechanism. It de-duplicates queries.

So if there are 1000 parallel queries, we can just hit dnsmasq and let it resolve it.

As described in #260 (comment) boot cache warm-up would make sense in some configurations and would be good addition.

Having the continuous out of band resolver seems not so needed as it will not (or will be hard to) scale up to hundreds of services as described in #260 (comment).

Returning stale entry and triggering refresh would be reasonable option. But as described in #260 (comment) dnsmasq would not allow us to refresh the cache until it is expired. I'd try to avoid scheduling work into the future because there is a hard cap on number of timers.
I guess in context of sparsely used API we can just not preload the DNS and let user see the query time.
Then we should use stale cache only when there is a query running for that record.

Following rules apply to handle scenarios described above:

* When cache is missing, all incoming requests de-duplicate queries and wait for the first one to complete. If does not complete on time, the next one tries again, ...
* When cache is stale, use stale record and trigger de-duplicated query in the background.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think we are agreeing with what I just wrote above here.


* When cache is missing, all incoming requests de-duplicate queries and wait for the first one to complete. If does not complete on time, the next one tries again, ...
* When cache is stale, use stale record and trigger de-duplicated query in the background.
* If record can't be updated, use stale cache for unlimited time, but try to recover.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to this: sound the alarm and transition to red alert, make noise, eventually drop the stale cache and start returning 500s. Admins should know ASAP, and if not taken care of, things should start failing hard (instead of, for example, partially failing because some host not being available is not a fatal error but a subtle one). I guess there are a lot of different possible behaviours, perhaps configuration or overriding of default behaviour is desirable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As described in #260 (comment) I think the stale cache is not the issue here, but rather marking hosts down by the load balancer.
That would eventually start raising 500s.

@mikz mikz force-pushed the dns-resolver branch 5 times, most recently from bab625f to 4f091d3 Compare February 16, 2017 15:23
@mikz mikz changed the title [wip] Improve DNS resolver Improve DNS resolver Feb 17, 2017
@mikz mikz merged commit b04ad80 into master Feb 17, 2017
@mikz mikz deleted the dns-resolver branch February 17, 2017 09:38
mikz added a commit that referenced this pull request Feb 17, 2017
* caching for duration of TTL
* resolve A and CNAME records
* using stale cache when TTL expires and cache can't be updated
* many parallel requests to one worker result in one query

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this sentence related to this (?)

When cache is missing, all incoming requests de-duplicate queries and wait for the first one to complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. But in the end we delegate this work to the local DNS cache server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants