-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters #5472
Labels
Comments
kmuthukk
changed the title
[YSQL] High CPU utilization in 2DC enabled YugabyteDB clusters
[YSQL] High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters
Aug 24, 2020
nchandrappa
changed the title
[YSQL] High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters
High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters
Aug 24, 2020
yugabyte-ci
added
community/request
Issues created by external users
and removed
area/cdc
Change Data Capture
labels
Aug 24, 2020
kmuthukk
added
area/cdc
Change Data Capture
area/docdb
YugabyteDB core features
priority/high
High Priority
and removed
community/request
Issues created by external users
labels
Aug 24, 2020
nspiegelberg
added a commit
that referenced
this issue
Sep 2, 2020
Summary: In YugabyteDB clusters with bi-directional CDC enabled, we were seeing high CPU utilization ( ~ 70%) in both the clusters without any workloads running on the clusters. The CDC GetChanges call for identifying the new changes in the table is very aggressive to minimize latency and ensure minimal lag in high volume situations. Our new heuristic has 2 goals: 1. The Producer is active, we need to minimize lag and keep up. 2. The Producer is mostly idle, we don’t want to waste hw resources. For #2, we add an idle delay after X consecutive requests with no data. As soon as we get new data from GetChanges, we reset the delay. Test Plan: TwoDCTest.PollAndObserveIdleDampening Reviewers: bogdan, kannan, alan, rahuldesirazu Reviewed By: rahuldesirazu Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D9253
nspiegelberg
added a commit
that referenced
this issue
Sep 15, 2020
Summary: In YugabyteDB clusters with bi-directional CDC enabled, we were seeing high CPU utilization ( ~ 70%) in both the clusters without any workloads running on the clusters. The CDC GetChanges call for identifying the new changes in the table is very aggressive to minimize latency and ensure minimal lag in high volume situations. Our new heuristic has 2 goals: 1. The Producer is active, we need to minimize lag and keep up. 2. The Producer is mostly idle, we don’t want to waste hw resources. For #2, we add an idle delay after X consecutive requests with no data. As soon as we get new data from GetChanges, we reset the delay. Test Plan: Jenkins: rebase: 2.2.2.1 Reviewers: bogdan, rahuldesirazu, hector Reviewed By: hector Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D9380
nspiegelberg
added a commit
that referenced
this issue
Sep 15, 2020
Summary: In YugabyteDB clusters with bi-directional CDC enabled, we were seeing high CPU utilization ( ~ 70%) in both the clusters without any workloads running on the clusters. The CDC GetChanges call for identifying the new changes in the table is very aggressive to minimize latency and ensure minimal lag in high volume situations. Our new heuristic has 2 goals: 1. The Producer is active, we need to minimize lag and keep up. 2. The Producer is mostly idle, we don’t want to waste hw resources. For #2, we add an idle delay after X consecutive requests with no data. As soon as we get new data from GetChanges, we reset the delay. Test Plan: Jenkins: rebase: 2.2 Reviewers: bogdan, kannan, alan, rahuldesirazu, hector Reviewed By: hector Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D9377
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
In YugabyteDB clusters with bi-directional replication enabled, I’m seeing high CPU utilization ( ~ 70%) in both the clusters without any workloads running on the clusters.
Rootcause:
GetChanges
call for identifying the new changes in the table is very aggressive and it significantly increases the no. of RPC calls in the clusters which results in High CPU utilization. Following GFlag--async_replication_polling_delay_ms
controls how often theGetChanges
runs. By default it is set to--async_replication_polling_delay_ms=0
which is very aggressive and as a workaround, we can change the polling for new data to be less aggressive by increasing the polling delay.Workaround:
Increasing the polling delay to
3ms
brought down the CPU utilization to ~3%. Update the GFlag to--async_replication_polling_delay_ms=3
and perform a rolling upgrade of both clusters if they are already created. Or else set this Gflag during cluster creation.The text was updated successfully, but these errors were encountered: