-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Update DNS cache in background #22930
Labels
2.18 Backport Required
2.20 Backport Required
2024.1 Backport Required
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Comments
spolitov
added
area/docdb
YugabyteDB core features
status/awaiting-triage
Issue awaiting triage
labels
Jun 20, 2024
yugabyte-ci
added
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
labels
Jun 20, 2024
rthallamko3
added
2024.1 Backport Required
2.20 Backport Required
2.18 Backport Required
and removed
status/awaiting-triage
Issue awaiting triage
labels
Jun 21, 2024
spolitov
added a commit
that referenced
this issue
Jun 21, 2024
Summary: We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default). When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution. And RPC call waits until this DNS resolution completes. But, actually, we don't have to wait until it completes. The address from previous resolution could be used for this RPC call, while cache update could happen in background. Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls. This diff implements such behaviour. Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure. Added metric dns_resolve_latency that reflects time spend by DNS resolution. Jira: DB-11847, DB-11222 Test Plan: Jenkins Reviewers: qhu, rthallam, slingam Reviewed By: qhu, slingam Subscribers: slingam, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35993
karthik-ramanathan-3006
pushed a commit
to karthik-ramanathan-3006/yugabyte-db
that referenced
this issue
Jun 24, 2024
Summary: We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default). When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution. And RPC call waits until this DNS resolution completes. But, actually, we don't have to wait until it completes. The address from previous resolution could be used for this RPC call, while cache update could happen in background. Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls. This diff implements such behaviour. Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure. Added metric dns_resolve_latency that reflects time spend by DNS resolution. Jira: DB-11847, DB-11222 Test Plan: Jenkins Reviewers: qhu, rthallam, slingam Reviewed By: qhu, slingam Subscribers: slingam, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35993
spolitov
added a commit
that referenced
this issue
Jul 4, 2024
Summary: We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default). When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution. And RPC call waits until this DNS resolution completes. But, actually, we don't have to wait until it completes. The address from previous resolution could be used for this RPC call, while cache update could happen in background. Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls. This diff implements such behaviour. Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure. Added metric dns_resolve_latency that reflects time spend by DNS resolution. Jira: DB-11847, DB-11222 Original commit: bf0fb4b/D35993 Test Plan: Jenkins Reviewers: qhu, rthallam, slingam Reviewed By: qhu Subscribers: ybase, slingam Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36156
spolitov
added a commit
that referenced
this issue
Jul 14, 2024
Summary: We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default). When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution. And RPC call waits until this DNS resolution completes. But, actually, we don't have to wait until it completes. The address from previous resolution could be used for this RPC call, while cache update could happen in background. Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls. This diff implements such behaviour. Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure. Added metric dns_resolve_latency that reflects time spend by DNS resolution. Jira: DB-11847, DB-11222 Original commit: bf0fb4b/D35993 Test Plan: Jenkins Reviewers: qhu, rthallam, slingam Reviewed By: qhu Subscribers: slingam, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36404
spolitov
added a commit
that referenced
this issue
Jul 14, 2024
Summary: We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default). When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution. And RPC call waits until this DNS resolution completes. But, actually, we don't have to wait until it completes. The address from previous resolution could be used for this RPC call, while cache update could happen in background. Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls. This diff implements such behaviour. Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure. Added metric dns_resolve_latency that reflects time spend by DNS resolution. Jira: DB-11847, DB-11222 Original commit: bf0fb4b/D35993 Test Plan: Jenkins Reviewers: qhu, rthallam, slingam Reviewed By: qhu Subscribers: ybase, slingam Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36403
spolitov
added a commit
that referenced
this issue
Jul 14, 2024
Summary: We have DNS cache controlled by gflag dns_cache_expiration_ms (60000 by default). When new RPC call to node with hostname is started and existing record has expired, we start new DNS resolution. And RPC call waits until this DNS resolution completes. But, actually, we don't have to wait until it completes. The address from previous resolution could be used for this RPC call, while cache update could happen in background. Once response to a new DNS request is received, we could update cached entry and use new address for all new RPC calls. This diff implements such behaviour. Also added flag dns_cache_failure_expiration_ms (2s by default) to control the time before DNS resolution retry in case of failure. Added metric dns_resolve_latency that reflects time spend by DNS resolution. Jira: DB-11847, DB-11222 Original commit: bf0fb4b/D35993 Test Plan: Jenkins Reviewers: qhu, rthallam, slingam Reviewed By: qhu Subscribers: slingam, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36402
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.18 Backport Required
2.20 Backport Required
2024.1 Backport Required
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Jira Link: DB-11847
Description
Issue
We have DNS cache controlled by gflag
dns_cache_expiration_ms
(60000 by default).When a new RPC call to node with hostname is started and the existing DNS record has expired, we start a new DNS resolution and the RPC call (in the outbound queue) waits until this DNS resolution completes. If there are delays in DNS resolution, then this can delay the RPC calls from being executed by the reactor threads in Yugabyte.
The reactor threads are responsible for handling the RPCs, including the consensus RPCs that are responsible for maintaining the leadership status of the tablets. If the delay in RPC is greater than the
leader_lease_duration_ms
, then the leader of the tablet can step down. However, sometime later the load balancer can move the leader back to the old node (that was the leader), to keep the number of leaders load balanced. If the DNS resolution runs into issues, it can lead to unexpected leadership changes, leader moves in the cluster, negatively impacting the query latencies.Workaround/Fix
In order to overcome the DNS resolution problems in some cases, it is better to perform the DNS resolution in the background while using the previously cached DNS record for the current RPC. This reduces the chances of leadership changes in the event of DNS resolution issues.
Once the background thread gets the response for the DNS request, it can update the cached entry that can be used the subsequent RPC calls.
Issue Type
kind/enhancement
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: