High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters #5472

nchandrappa · 2020-08-24T16:16:55Z

In YugabyteDB clusters with bi-directional replication enabled, I’m seeing high CPU utilization ( ~ 70%) in both the clusters without any workloads running on the clusters.

Rootcause:

GetChanges call for identifying the new changes in the table is very aggressive and it significantly increases the no. of RPC calls in the clusters which results in High CPU utilization. Following GFlag --async_replication_polling_delay_ms controls how often the GetChanges runs. By default it is set to --async_replication_polling_delay_ms=0 which is very aggressive and as a workaround, we can change the polling for new data to be less aggressive by increasing the polling delay.

result of top -H

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
27382 yugabyte  20   0 7666960 1.221g  33272 S 26.9 16.7  46:57.55 TabletServer_re
27380 yugabyte  20   0 7666960 1.221g  33272 S 24.6 16.7  42:22.55 TabletServer_re
14182 yugabyte  20   0 7666960 1.221g  33272 S 19.6 16.7   4:29.32 CDCConsumerHand
14183 yugabyte  20   0 7666960 1.221g  33272 S 19.3 16.7   4:29.10 CDCConsumerHand
14184 yugabyte  20   0 7666960 1.221g  33272 S 19.3 16.7   4:28.75 CDCConsumerHand
14185 yugabyte  20   0 7666960 1.221g  33272 S 19.3 16.7   4:29.32 CDCConsumerHand
27381 yugabyte  20   0 7666960 1.221g  33272 R 17.3 16.7  31:45.60 TabletServer_re
27383 yugabyte  20   0 7666960 1.221g  33272 S 15.3 16.7  28:37.63 TabletServer_re
14171 yugabyte  20   0 7666960 1.221g  33272 S 12.0 16.7   2:44.45 CDCConsumerRemo
14172 yugabyte  20   0 7666960 1.221g  33272 S 12.0 16.7   2:44.49 CDCConsumerRemo
14174 yugabyte  20   0 7666960 1.221g  33272 R 12.0 16.7   2:42.02 CDCConsumerRemo
14173 yugabyte  20   0 7666960 1.221g  33272 R 11.6 16.7   2:41.43 CDCConsumerRemo

Workaround:

Increasing the polling delay to 3ms brought down the CPU utilization to ~3%. Update the GFlag to --async_replication_polling_delay_ms=3 and perform a rolling upgrade of both clusters if they are already created. Or else set this Gflag during cluster creation.

The text was updated successfully, but these errors were encountered:

Summary: In YugabyteDB clusters with bi-directional CDC enabled, we were seeing high CPU utilization ( ~ 70%) in both the clusters without any workloads running on the clusters. The CDC GetChanges call for identifying the new changes in the table is very aggressive to minimize latency and ensure minimal lag in high volume situations. Our new heuristic has 2 goals: 1. The Producer is active, we need to minimize lag and keep up. 2. The Producer is mostly idle, we don’t want to waste hw resources. For #2, we add an idle delay after X consecutive requests with no data. As soon as we get new data from GetChanges, we reset the delay. Test Plan: TwoDCTest.PollAndObserveIdleDampening Reviewers: bogdan, kannan, alan, rahuldesirazu Reviewed By: rahuldesirazu Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D9253

Summary: In YugabyteDB clusters with bi-directional CDC enabled, we were seeing high CPU utilization ( ~ 70%) in both the clusters without any workloads running on the clusters. The CDC GetChanges call for identifying the new changes in the table is very aggressive to minimize latency and ensure minimal lag in high volume situations. Our new heuristic has 2 goals: 1. The Producer is active, we need to minimize lag and keep up. 2. The Producer is mostly idle, we don’t want to waste hw resources. For #2, we add an idle delay after X consecutive requests with no data. As soon as we get new data from GetChanges, we reset the delay. Test Plan: Jenkins: rebase: 2.2.2.1 Reviewers: bogdan, rahuldesirazu, hector Reviewed By: hector Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D9380

Summary: In YugabyteDB clusters with bi-directional CDC enabled, we were seeing high CPU utilization ( ~ 70%) in both the clusters without any workloads running on the clusters. The CDC GetChanges call for identifying the new changes in the table is very aggressive to minimize latency and ensure minimal lag in high volume situations. Our new heuristic has 2 goals: 1. The Producer is active, we need to minimize lag and keep up. 2. The Producer is mostly idle, we don’t want to waste hw resources. For #2, we add an idle delay after X consecutive requests with no data. As soon as we get new data from GetChanges, we reset the delay. Test Plan: Jenkins: rebase: 2.2 Reviewers: bogdan, kannan, alan, rahuldesirazu, hector Reviewed By: hector Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D9377

nchandrappa added area/ysql Yugabyte SQL (YSQL) area/cdc Change Data Capture labels Aug 24, 2020

kmuthukk changed the title ~~[YSQL] High CPU utilization in 2DC enabled YugabyteDB clusters~~ [YSQL] High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters Aug 24, 2020

nchandrappa changed the title ~~[YSQL] High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters~~ High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters Aug 24, 2020

kmuthukk assigned bmatican Aug 24, 2020

kmuthukk removed the area/ysql Yugabyte SQL (YSQL) label Aug 24, 2020

yugabyte-ci added community/request Issues created by external users and removed area/cdc Change Data Capture labels Aug 24, 2020

kmuthukk added area/cdc Change Data Capture area/docdb YugabyteDB core features priority/high High Priority and removed community/request Issues created by external users labels Aug 24, 2020

nspiegelberg mentioned this issue Aug 25, 2020

Reduce CPU Resources when CDC Idle #5479

Closed

nspiegelberg self-assigned this Aug 26, 2020

bmatican removed their assignment Aug 26, 2020

nspiegelberg closed this as completed Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters #5472

High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters #5472

nchandrappa commented Aug 24, 2020 •

edited

Loading

High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters #5472

High CPU utilization in xCluster Async (2DC) enabled YugabyteDB clusters #5472

Comments

nchandrappa commented Aug 24, 2020 • edited Loading

Rootcause:

Workaround:

nchandrappa commented Aug 24, 2020 •

edited

Loading