Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle DNS SERVFAIL differently than NXDOMAIN #2776

Merged
merged 14 commits into from
Apr 11, 2024

Conversation

tkountis
Copy link
Contributor

@tkountis tkountis commented Dec 8, 2023

Motivation:
An UnknownHostException can have multiple root causes, each with a different application logic.
By far the most prominent is due to an NXDOMAIN from the nameserver, meaning that the domain in the query doesn't not exist. Other reasons could be a SERVFAIL or timeouts during resolutions. Due to insufficient information in the error itself, the dns-client has no obvious way of distinctly handling the various causes.

Modification:
Netty will expose that information as an init-cause. ST can leverage the enrichment to handle the different scenarios.

Results:
Better behavior.
As part of this effort a behavior change is expected. NXDOMAIN errors will now cause a DNS state invalidation. To retain previous behavior use the property io.servicetalk.dns.discovery.nxdomain.invalidation set to false.

Depends on netty/netty#13721
Depends on netty/netty#13850

@tkountis tkountis self-assigned this Dec 8, 2023
Copy link
Member

@Scottmitch Scottmitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few cleanup related PRs but in general makes sense!

@tkountis tkountis marked this pull request as ready for review December 12, 2023 02:09
Copy link
Contributor

@bryce-anderson bryce-anderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert here but it seems good to me.

Copy link
Contributor

@bryce-anderson bryce-anderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I'll give it the 👍 once we can get the required netty changes landed and this PR building.

@@ -1,5 +1,5 @@
/*
* Copyright © 2018-2020 Apple Inc. and the ServiceTalk project authors
* Copyright © 2018-2020, 2023 Apple Inc. and the ServiceTalk project authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024?

Copy link
Contributor

@bryce-anderson bryce-anderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me. The apparently failing test in #2882 is curious.

Comment on lines 316 to 317
ensurePositive(srvHostNameRepeatInitialDelay, "srvHostNameRepeatInitialDelay");
ensurePositive(srvHostNameRepeatJitter, "srvHostNameRepeatJitter");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, it seems reasonable to have a zero jitter. It's common for testing purposes to make things predicable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that RepeatStreategies.repeatWithConstantBackoffDeltaJitter uses ThreadLocalRandom.nextLong(long, long) that does not allow the same values for upper/lower bounds. So, jitter should be at least 1ns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a bug in repeatWithConstantBackoffDeltaJitter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put up a PR to address it: #2888

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR was merged and we should now support zero length jitters.

Comment on lines 316 to 317
ensurePositive(srvHostNameRepeatInitialDelay, "srvHostNameRepeatInitialDelay");
ensurePositive(srvHostNameRepeatJitter, "srvHostNameRepeatJitter");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that RepeatStreategies.repeatWithConstantBackoffDeltaJitter uses ThreadLocalRandom.nextLong(long, long) that does not allow the same values for upper/lower bounds. So, jitter should be at least 1ns

@tkountis tkountis changed the title Handle DNS SERVFAIL differently that NXDOMAIN Handle DNS SERVFAIL differently than NXDOMAIN Mar 27, 2024
@tkountis tkountis requested a review from idelpivnitskiy March 27, 2024 17:55
@@ -62,6 +62,9 @@ public final class DefaultDnsServiceDiscovererBuilder implements DnsServiceDisco
*/
@Deprecated // FIXME: 0.43 - consider removing this system property
private static final String SKIP_BINDING_PROPERTY = "io.servicetalk.dns.discovery.netty.skipBinding";
private static final String NX_DOMAIN_INVALIDATES_PROPERTY = "io.servicetalk.dns.discovery.nxdomain.invalidation";
@SuppressWarnings("PMD.MutableStaticState")
static boolean NX_DOMAIN_INVALIDATES = parseProperty(NX_DOMAIN_INVALIDATES_PROPERTY, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that worries me is that if DNS server had a blip or if someone removed entries before adding new entries, DNS client could receive NXDOMAIN, then invalidate all entries, and then end up in an awkward state without any addresses until another resolution happens. Could you please elaborate a bit more on why it is important to invalidate addresses in this case? Can we just wait for another result set (keep using pre-existing IPs) while we retry UnknownHostException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NXDOMAIN is not / should not, be a blip. Its a valid answer that declares that DNS has no knowledge of that domain. The resolver makes every effort to resolve from the nameservers before it gives up. DNS misbehavior (such as SERVFAIL or timeouts) will still be handled more gracefully. Every other case (e.g., removed entries before adding entries) are flows that are beyond the scope of a resolver to heuristically work around, we should propagate the correct state at all times, which makes it easier to debug and reason about.

Comment on lines +978 to +980
// string matching is done on purpose to avoid the hard Netty dependency
(t.getCause() != null && t.getCause().getClass().getName()
.equals("io.netty.resolver.dns.DnsErrorCauseException")) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting work around.

@tkountis tkountis merged commit e605c87 into apple:main Apr 11, 2024
15 checks passed
@tkountis tkountis deleted the dns-debugging branch April 11, 2024 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants