Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v6.5 oom, long checkpoint/resovlets lag on 50k tables #7872

Closed
dbsid opened this issue Dec 10, 2022 · 11 comments
Closed

v6.5 oom, long checkpoint/resovlets lag on 50k tables #7872

dbsid opened this issue Dec 10, 2022 · 11 comments

Comments

@dbsid
Copy link
Contributor

dbsid commented Dec 10, 2022

What did you do?

test 50k tables via benchbot

python3 main.py gen-benchbot-case --email "xxx@pingcap.com" --token 'xxx'  --testbed_size "2xl" --benchmark_id 6000048  --resource_pool benchbot-cdc --bench_type cdc_workload --bench_sub_types "bank" \
--cdc_workload_options  "-row-count 10000000 -table-count 50000 -tps 14000" \
--duration=30m --threads 64 --versions v6.1.3,v6.5.0 --tidb_replicas 4 --tikv_replicas 6 --cdc_replicas 4 \
--tikv_configs "{raftstore: {store-io-pool-size: 1}, raft-engine: {enable: true}, import: {num-threads: 128}, cdc: {min-ts-interval: "1000ms"}}" \
--cdc_configs "{per-table-memory-quota: 33554432, owner-flush-interval: "200ms", processor-flush-interval: "100ms"}" \
--sink_options "?protocol=canal-json&kafka-version=3.1.0&partition-num=6&max-message-bytes=67108864&replication-factor=1" \
--warmup False --tidb_configs "{new_collations_enabled_on_first_bootstrap: false}" \
--storage_url s3://perftest/cdc_workload_bank_50k_t_20m_r --db_name bank50k --sink_target kafka --br_options "--ddl-batch-size 64"  --staging True --description "kafka-50k-table"
[2022/12/10 16:29:29.015 +08:00] [INFO] [utils.go:353] ["tiup cdc:v6.5.0 cli changefeed create  --pd=http://pd-peer.benchbot-cdc-kafka-amd64-2xl-cdc-workload-tps-1325828-1-198:2379 --sink-uri=\"kafka://d
s-kafka-kafka-peer:9092/cdc-topic-1670660969?protocol=canal-json&kafka-version=3.1.0&partition-num=6&max-message-bytes=67108864&replication-factor=1\" --changefeed-id=\"simple-replication-task\" --no-con
firm --config=/tmp/2022-12-10T16:29:28+08:00.cf.toml"]

What did you expect to see?

no oom, log checkpoint/resolvets lag

What did you see instead?

v6.5 oom, long checkpoint and resovlets lag, which v6.1.3 is fine
image

Versions of the cluster

v6.5 release test

@dbsid dbsid added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Dec 10, 2022
@dbsid
Copy link
Contributor Author

dbsid commented Dec 10, 2022

the qps on upstream cluster on v6.5 is much higher (7810 vs 5480), not sure if the higher qps on v6.5 is causing CDC oom

image

@dbsid
Copy link
Contributor Author

dbsid commented Dec 10, 2022

/severity critical

@dbsid
Copy link
Contributor Author

dbsid commented Dec 10, 2022

/affects-6.5

@nongfushanquan
Copy link
Contributor

is the first phase on v6.1.3 and the second on 6.5 ?

@dbsid
Copy link
Contributor Author

dbsid commented Dec 10, 2022

is the first phase on v6.1.3 and the second on 6.5 ?

yes, first phase is v6.1.3

@dbsid
Copy link
Contributor Author

dbsid commented Dec 10, 2022

a heap profile when the memoy usage is ~30GB

image

@nongfushanquan
Copy link
Contributor

the oom is caused by timeout to connect with the pd periodically ,which would cause the processor restart.
we can set the the value as for timeout more bigger(such as 30s) as a workaround.

@nongfushanquan
Copy link
Contributor

/severity moderate

@nongfushanquan
Copy link
Contributor

/remove critical

@nongfushanquan
Copy link
Contributor

/remove severity/critical

@sdojjy
Copy link
Member

sdojjy commented Mar 31, 2023

with PR #8699 , cdc 6.5 can handle about 100k tables with 4 nodes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants