v6.5 oom, long checkpoint/resovlets lag on 50k tables #7872

dbsid · 2022-12-10T09:31:36Z

What did you do?

test 50k tables via benchbot

python3 main.py gen-benchbot-case --email "xxx@pingcap.com" --token 'xxx'  --testbed_size "2xl" --benchmark_id 6000048  --resource_pool benchbot-cdc --bench_type cdc_workload --bench_sub_types "bank" \
--cdc_workload_options  "-row-count 10000000 -table-count 50000 -tps 14000" \
--duration=30m --threads 64 --versions v6.1.3,v6.5.0 --tidb_replicas 4 --tikv_replicas 6 --cdc_replicas 4 \
--tikv_configs "{raftstore: {store-io-pool-size: 1}, raft-engine: {enable: true}, import: {num-threads: 128}, cdc: {min-ts-interval: "1000ms"}}" \
--cdc_configs "{per-table-memory-quota: 33554432, owner-flush-interval: "200ms", processor-flush-interval: "100ms"}" \
--sink_options "?protocol=canal-json&kafka-version=3.1.0&partition-num=6&max-message-bytes=67108864&replication-factor=1" \
--warmup False --tidb_configs "{new_collations_enabled_on_first_bootstrap: false}" \
--storage_url s3://perftest/cdc_workload_bank_50k_t_20m_r --db_name bank50k --sink_target kafka --br_options "--ddl-batch-size 64"  --staging True --description "kafka-50k-table"

[2022/12/10 16:29:29.015 +08:00] [INFO] [utils.go:353] ["tiup cdc:v6.5.0 cli changefeed create  --pd=http://pd-peer.benchbot-cdc-kafka-amd64-2xl-cdc-workload-tps-1325828-1-198:2379 --sink-uri=\"kafka://d
s-kafka-kafka-peer:9092/cdc-topic-1670660969?protocol=canal-json&kafka-version=3.1.0&partition-num=6&max-message-bytes=67108864&replication-factor=1\" --changefeed-id=\"simple-replication-task\" --no-con
firm --config=/tmp/2022-12-10T16:29:28+08:00.cf.toml"]

What did you expect to see?

no oom, log checkpoint/resolvets lag

What did you see instead?

v6.5 oom, long checkpoint and resovlets lag, which v6.1.3 is fine

Versions of the cluster

v6.5 release test

The text was updated successfully, but these errors were encountered:

dbsid · 2022-12-10T09:33:08Z

the qps on upstream cluster on v6.5 is much higher (7810 vs 5480), not sure if the higher qps on v6.5 is causing CDC oom

dbsid · 2022-12-10T09:33:56Z

/severity critical

dbsid · 2022-12-10T09:38:37Z

/affects-6.5

nongfushanquan · 2022-12-10T10:39:23Z

is the first phase on v6.1.3 and the second on 6.5 ?

dbsid · 2022-12-10T10:56:08Z

is the first phase on v6.1.3 and the second on 6.5 ?

yes, first phase is v6.1.3

dbsid · 2022-12-10T10:56:51Z

a heap profile when the memoy usage is ~30GB

nongfushanquan · 2022-12-16T11:55:23Z

the oom is caused by timeout to connect with the pd periodically ,which would cause the processor restart.
we can set the the value as for timeout more bigger(such as 30s) as a workaround.

nongfushanquan · 2022-12-20T03:26:04Z

/severity moderate

nongfushanquan · 2022-12-20T03:27:01Z

/remove critical

nongfushanquan · 2022-12-20T03:27:34Z

/remove severity/critical

sdojjy · 2023-03-31T03:06:34Z

with PR #8699 , cdc 6.5 can handle about 100k tables with 4 nodes

dbsid added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Dec 10, 2022

ti-chi-bot added severity/critical may-affects-4.0 may-affects-5.0 may-affects-5.1 may-affects-5.2 may-affects-5.3 may-affects-5.4 may-affects-6.0 may-affects-6.1 may-affects-6.2 may-affects-6.3 may-affects-6.4 may-affects-6.5 labels Dec 10, 2022

jebter added the affects-6.5 label Dec 10, 2022

ti-chi-bot removed the may-affects-6.5 label Dec 10, 2022

Rustin170506 mentioned this issue Dec 13, 2022

sinkv2(ticdc): directly advance checkpoint ts if only resolved ts #7876

Merged

ti-chi-bot added the severity/moderate label Dec 20, 2022

sdojjy removed the severity/critical label Dec 20, 2022

sdojjy mentioned this issue Mar 30, 2023

processor,scheduler(ticdc): clean up unused method and metrics (#8049) #8699

Merged

sdojjy closed this as completed Mar 31, 2023

nongfushanquan mentioned this issue Jun 9, 2023

add v6.5.3 release notes pingcap/docs-cn#14168

Merged

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v6.5 oom, long checkpoint/resovlets lag on 50k tables #7872

v6.5 oom, long checkpoint/resovlets lag on 50k tables #7872

dbsid commented Dec 10, 2022

dbsid commented Dec 10, 2022 •

edited

Loading

dbsid commented Dec 10, 2022

dbsid commented Dec 10, 2022

nongfushanquan commented Dec 10, 2022

dbsid commented Dec 10, 2022

dbsid commented Dec 10, 2022

nongfushanquan commented Dec 16, 2022

nongfushanquan commented Dec 20, 2022

nongfushanquan commented Dec 20, 2022

nongfushanquan commented Dec 20, 2022

sdojjy commented Mar 31, 2023

v6.5 oom, long checkpoint/resovlets lag on 50k tables #7872

v6.5 oom, long checkpoint/resovlets lag on 50k tables #7872

Comments

dbsid commented Dec 10, 2022

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

dbsid commented Dec 10, 2022 • edited Loading

dbsid commented Dec 10, 2022

dbsid commented Dec 10, 2022

nongfushanquan commented Dec 10, 2022

dbsid commented Dec 10, 2022

dbsid commented Dec 10, 2022

nongfushanquan commented Dec 16, 2022

nongfushanquan commented Dec 20, 2022

nongfushanquan commented Dec 20, 2022

nongfushanquan commented Dec 20, 2022

sdojjy commented Mar 31, 2023

dbsid commented Dec 10, 2022 •

edited

Loading