[DocDB] Update DNS cache in background #22930

spolitov · 2024-06-20T05:35:29Z

Description

Issue

We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default).

When a new RPC call to node with hostname is started and the existing DNS record has expired, we start a new DNS resolution and the RPC call (in the outbound queue) waits until this DNS resolution completes. If there are delays in DNS resolution, then this can delay the RPC calls from being executed by the reactor threads in Yugabyte.

The reactor threads are responsible for handling the RPCs, including the consensus RPCs that are responsible for maintaining the leadership status of the tablets. If the delay in RPC is greater than the leader_lease_duration_ms, then the leader of the tablet can step down. However, sometime later the load balancer can move the leader back to the old node (that was the leader), to keep the number of leaders load balanced. If the DNS resolution runs into issues, it can lead to unexpected leadership changes, leader moves in the cluster, negatively impacting the query latencies.

Workaround/Fix

In order to overcome the DNS resolution problems in some cases, it is better to perform the DNS resolution in the background while using the previously cached DNS record for the current RPC. This reduces the chances of leadership changes in the event of DNS resolution issues.

Once the background thread gets the response for the DNS request, it can update the cached entry that can be used the subsequent RPC calls.

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

Summary: We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default). When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution. And RPC call waits until this DNS resolution completes. But, actually, we don't have to wait until it completes. The address from previous resolution could be used for this RPC call, while cache update could happen in background. Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls. This diff implements such behaviour. Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure. Added metric dns_resolve_latency that reflects time spend by DNS resolution. Jira: DB-11847, DB-11222 Test Plan: Jenkins Reviewers: qhu, rthallam, slingam Reviewed By: qhu, slingam Subscribers: slingam, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35993

Summary: We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default). When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution. And RPC call waits until this DNS resolution completes. But, actually, we don't have to wait until it completes. The address from previous resolution could be used for this RPC call, while cache update could happen in background. Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls. This diff implements such behaviour. Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure. Added metric dns_resolve_latency that reflects time spend by DNS resolution. Jira: DB-11847, DB-11222 Original commit: bf0fb4b/D35993 Test Plan: Jenkins Reviewers: qhu, rthallam, slingam Reviewed By: qhu Subscribers: ybase, slingam Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36156

Summary: We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default). When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution. And RPC call waits until this DNS resolution completes. But, actually, we don't have to wait until it completes. The address from previous resolution could be used for this RPC call, while cache update could happen in background. Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls. This diff implements such behaviour. Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure. Added metric dns_resolve_latency that reflects time spend by DNS resolution. Jira: DB-11847, DB-11222 Original commit: bf0fb4b/D35993 Test Plan: Jenkins Reviewers: qhu, rthallam, slingam Reviewed By: qhu Subscribers: slingam, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36404

Summary: We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default). When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution. And RPC call waits until this DNS resolution completes. But, actually, we don't have to wait until it completes. The address from previous resolution could be used for this RPC call, while cache update could happen in background. Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls. This diff implements such behaviour. Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure. Added metric dns_resolve_latency that reflects time spend by DNS resolution. Jira: DB-11847, DB-11222 Original commit: bf0fb4b/D35993 Test Plan: Jenkins Reviewers: qhu, rthallam, slingam Reviewed By: qhu Subscribers: ybase, slingam Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36403

Summary: We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default). When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution. And RPC call waits until this DNS resolution completes. But, actually, we don't have to wait until it completes. The address from previous resolution could be used for this RPC call, while cache update could happen in background. Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls. This diff implements such behaviour. Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure. Added metric dns_resolve_latency that reflects time spend by DNS resolution. Jira: DB-11847, DB-11222 Original commit: bf0fb4b/D35993 Test Plan: Jenkins Reviewers: qhu, rthallam, slingam Reviewed By: qhu Subscribers: slingam, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36402

spolitov added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jun 20, 2024

spolitov self-assigned this Jun 20, 2024

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Jun 20, 2024

rthallamko3 added 2024.1 Backport Required 2.20 Backport Required 2.18 Backport Required and removed status/awaiting-triage Issue awaiting triage labels Jun 21, 2024

rthallamko3 closed this as completed Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Update DNS cache in background #22930

[DocDB] Update DNS cache in background #22930

spolitov commented Jun 20, 2024 •

edited by rthallamko3

Loading

[DocDB] Update DNS cache in background #22930

[DocDB] Update DNS cache in background #22930

Comments

spolitov commented Jun 20, 2024 • edited by rthallamko3 Loading

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

spolitov commented Jun 20, 2024 •

edited by rthallamko3

Loading