Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Update DNS cache in background #22930

Closed
1 task done
spolitov opened this issue Jun 20, 2024 · 0 comments
Closed
1 task done

[DocDB] Update DNS cache in background #22930

spolitov opened this issue Jun 20, 2024 · 0 comments
Assignees
Labels
2.18 Backport Required 2.20 Backport Required 2024.1 Backport Required area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@spolitov
Copy link
Contributor

spolitov commented Jun 20, 2024

Jira Link: DB-11847

Description

Issue

We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default).

When a new RPC call to node with hostname is started and the existing DNS record has expired, we start a new DNS resolution and the RPC call (in the outbound queue) waits until this DNS resolution completes. If there are delays in DNS resolution, then this can delay the RPC calls from being executed by the reactor threads in Yugabyte.

The reactor threads are responsible for handling the RPCs, including the consensus RPCs that are responsible for maintaining the leadership status of the tablets. If the delay in RPC is greater than the leader_lease_duration_ms, then the leader of the tablet can step down. However, sometime later the load balancer can move the leader back to the old node (that was the leader), to keep the number of leaders load balanced. If the DNS resolution runs into issues, it can lead to unexpected leadership changes, leader moves in the cluster, negatively impacting the query latencies.

Workaround/Fix

In order to overcome the DNS resolution problems in some cases, it is better to perform the DNS resolution in the background while using the previously cached DNS record for the current RPC. This reduces the chances of leadership changes in the event of DNS resolution issues.

Once the background thread gets the response for the DNS request, it can update the cached entry that can be used the subsequent RPC calls.

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@spolitov spolitov added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jun 20, 2024
@spolitov spolitov self-assigned this Jun 20, 2024
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Jun 20, 2024
spolitov added a commit that referenced this issue Jun 21, 2024
Summary:
We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default).
When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution.
And RPC call waits until this DNS resolution completes.

But, actually, we don't have to wait until it completes.
The address from previous resolution could be used for this RPC call, while cache update could happen in background.
Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls.

This diff implements such behaviour.

Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure.

Added metric dns_resolve_latency that reflects time spend by DNS resolution.
Jira: DB-11847, DB-11222

Test Plan: Jenkins

Reviewers: qhu, rthallam, slingam

Reviewed By: qhu, slingam

Subscribers: slingam, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35993
karthik-ramanathan-3006 pushed a commit to karthik-ramanathan-3006/yugabyte-db that referenced this issue Jun 24, 2024
Summary:
We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default).
When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution.
And RPC call waits until this DNS resolution completes.

But, actually, we don't have to wait until it completes.
The address from previous resolution could be used for this RPC call, while cache update could happen in background.
Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls.

This diff implements such behaviour.

Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure.

Added metric dns_resolve_latency that reflects time spend by DNS resolution.
Jira: DB-11847, DB-11222

Test Plan: Jenkins

Reviewers: qhu, rthallam, slingam

Reviewed By: qhu, slingam

Subscribers: slingam, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35993
spolitov added a commit that referenced this issue Jul 4, 2024
Summary:
We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default).
When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution.
And RPC call waits until this DNS resolution completes.

But, actually, we don't have to wait until it completes.
The address from previous resolution could be used for this RPC call, while cache update could happen in background.
Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls.

This diff implements such behaviour.

Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure.

Added metric dns_resolve_latency that reflects time spend by DNS resolution.
Jira: DB-11847, DB-11222

Original commit: bf0fb4b/D35993

Test Plan: Jenkins

Reviewers: qhu, rthallam, slingam

Reviewed By: qhu

Subscribers: ybase, slingam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36156
spolitov added a commit that referenced this issue Jul 14, 2024
Summary:
We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default).
When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution.
And RPC call waits until this DNS resolution completes.

But, actually, we don't have to wait until it completes.
The address from previous resolution could be used for this RPC call, while cache update could happen in background.
Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls.

This diff implements such behaviour.

Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure.

Added metric dns_resolve_latency that reflects time spend by DNS resolution.
Jira: DB-11847, DB-11222

Original commit: bf0fb4b/D35993

Test Plan: Jenkins

Reviewers: qhu, rthallam, slingam

Reviewed By: qhu

Subscribers: slingam, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36404
spolitov added a commit that referenced this issue Jul 14, 2024
Summary:
We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default).
When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution.
And RPC call waits until this DNS resolution completes.

But, actually, we don't have to wait until it completes.
The address from previous resolution could be used for this RPC call, while cache update could happen in background.
Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls.

This diff implements such behaviour.

Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure.

Added metric dns_resolve_latency that reflects time spend by DNS resolution.
Jira: DB-11847, DB-11222

Original commit: bf0fb4b/D35993

Test Plan: Jenkins

Reviewers: qhu, rthallam, slingam

Reviewed By: qhu

Subscribers: ybase, slingam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36403
spolitov added a commit that referenced this issue Jul 14, 2024
Summary:
We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default).
When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution.
And RPC call waits until this DNS resolution completes.

But, actually, we don't have to wait until it completes.
The address from previous resolution could be used for this RPC call, while cache update could happen in background.
Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls.

This diff implements such behaviour.

Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure.

Added metric dns_resolve_latency that reflects time spend by DNS resolution.
Jira: DB-11847, DB-11222

Original commit: bf0fb4b/D35993

Test Plan: Jenkins

Reviewers: qhu, rthallam, slingam

Reviewed By: qhu

Subscribers: slingam, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36402
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.18 Backport Required 2.20 Backport Required 2024.1 Backport Required area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

3 participants