Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters #5472

Closed
nchandrappa opened this issue Aug 24, 2020 · 0 comments
Closed
Assignees
Labels
area/cdc Change Data Capture area/docdb YugabyteDB core features priority/high High Priority

Comments

@nchandrappa
Copy link
Contributor

nchandrappa commented Aug 24, 2020

In YugabyteDB clusters with bi-directional replication enabled, I’m seeing high CPU utilization ( ~ 70%) in both the clusters without any workloads running on the clusters.

Rootcause:

GetChanges call for identifying the new changes in the table is very aggressive and it significantly increases the no. of RPC calls in the clusters which results in High CPU utilization. Following GFlag --async_replication_polling_delay_ms controls how often the GetChanges runs. By default it is set to --async_replication_polling_delay_ms=0 which is very aggressive and as a workaround, we can change the polling for new data to be less aggressive by increasing the polling delay.

result of top -H

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
27382 yugabyte  20   0 7666960 1.221g  33272 S 26.9 16.7  46:57.55 TabletServer_re
27380 yugabyte  20   0 7666960 1.221g  33272 S 24.6 16.7  42:22.55 TabletServer_re
14182 yugabyte  20   0 7666960 1.221g  33272 S 19.6 16.7   4:29.32 CDCConsumerHand
14183 yugabyte  20   0 7666960 1.221g  33272 S 19.3 16.7   4:29.10 CDCConsumerHand
14184 yugabyte  20   0 7666960 1.221g  33272 S 19.3 16.7   4:28.75 CDCConsumerHand
14185 yugabyte  20   0 7666960 1.221g  33272 S 19.3 16.7   4:29.32 CDCConsumerHand
27381 yugabyte  20   0 7666960 1.221g  33272 R 17.3 16.7  31:45.60 TabletServer_re
27383 yugabyte  20   0 7666960 1.221g  33272 S 15.3 16.7  28:37.63 TabletServer_re
14171 yugabyte  20   0 7666960 1.221g  33272 S 12.0 16.7   2:44.45 CDCConsumerRemo
14172 yugabyte  20   0 7666960 1.221g  33272 S 12.0 16.7   2:44.49 CDCConsumerRemo
14174 yugabyte  20   0 7666960 1.221g  33272 R 12.0 16.7   2:42.02 CDCConsumerRemo
14173 yugabyte  20   0 7666960 1.221g  33272 R 11.6 16.7   2:41.43 CDCConsumerRemo

Workaround:

Increasing the polling delay to 3ms brought down the CPU utilization to ~3%. Update the GFlag to --async_replication_polling_delay_ms=3 and perform a rolling upgrade of both clusters if they are already created. Or else set this Gflag during cluster creation.

@nchandrappa nchandrappa added area/ysql Yugabyte SQL (YSQL) area/cdc Change Data Capture labels Aug 24, 2020
@kmuthukk kmuthukk changed the title [YSQL] High CPU utilization in 2DC enabled YugabyteDB clusters [YSQL] High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters Aug 24, 2020
@nchandrappa nchandrappa changed the title [YSQL] High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters Aug 24, 2020
@kmuthukk kmuthukk removed the area/ysql Yugabyte SQL (YSQL) label Aug 24, 2020
@yugabyte-ci yugabyte-ci added community/request Issues created by external users and removed area/cdc Change Data Capture labels Aug 24, 2020
@kmuthukk kmuthukk added area/cdc Change Data Capture area/docdb YugabyteDB core features priority/high High Priority and removed community/request Issues created by external users labels Aug 24, 2020
@nspiegelberg nspiegelberg self-assigned this Aug 26, 2020
@bmatican bmatican removed their assignment Aug 26, 2020
nspiegelberg added a commit that referenced this issue Sep 2, 2020
Summary:
In YugabyteDB clusters with bi-directional CDC enabled, we were seeing high CPU
utilization ( ~ 70%) in both the clusters without any workloads running on the clusters.  The CDC
GetChanges call for identifying the new changes in the table is very aggressive to minimize latency
and ensure minimal lag in high volume situations.  Our new heuristic has 2 goals:
  1. The Producer is active, we need to minimize lag and keep up.
  2. The Producer is mostly idle, we don’t want to waste hw resources.
For #2, we add an idle delay after X consecutive requests with no data.  As soon as we get new data
from GetChanges, we reset the delay.

Test Plan: TwoDCTest.PollAndObserveIdleDampening

Reviewers: bogdan, kannan, alan, rahuldesirazu

Reviewed By: rahuldesirazu

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D9253
nspiegelberg added a commit that referenced this issue Sep 15, 2020
Summary:
In YugabyteDB clusters with bi-directional CDC enabled, we were seeing high CPU
utilization ( ~ 70%) in both the clusters without any workloads running on the clusters.  The CDC
GetChanges call for identifying the new changes in the table is very aggressive to minimize latency
and ensure minimal lag in high volume situations.  Our new heuristic has 2 goals:
  1. The Producer is active, we need to minimize lag and keep up.
  2. The Producer is mostly idle, we don’t want to waste hw resources.
For #2, we add an idle delay after X consecutive requests with no data.  As soon as we get new data
from GetChanges, we reset the delay.

Test Plan: Jenkins: rebase: 2.2.2.1

Reviewers: bogdan, rahuldesirazu, hector

Reviewed By: hector

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D9380
nspiegelberg added a commit that referenced this issue Sep 15, 2020
Summary:
In YugabyteDB clusters with bi-directional CDC enabled, we were seeing high CPU
utilization ( ~ 70%) in both the clusters without any workloads running on the clusters.  The CDC
GetChanges call for identifying the new changes in the table is very aggressive to minimize latency
and ensure minimal lag in high volume situations.  Our new heuristic has 2 goals:
  1. The Producer is active, we need to minimize lag and keep up.
  2. The Producer is mostly idle, we don’t want to waste hw resources.
For #2, we add an idle delay after X consecutive requests with no data.  As soon as we get new data
from GetChanges, we reset the delay.

Test Plan: Jenkins: rebase: 2.2

Reviewers: bogdan, kannan, alan, rahuldesirazu, hector

Reviewed By: hector

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D9377
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdc Change Data Capture area/docdb YugabyteDB core features priority/high High Priority
Projects
None yet
Development

No branches or pull requests

5 participants