Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/client: add incremental scan region count limit #1899

Merged
merged 18 commits into from
Jun 4, 2021

Conversation

amyangfei
Copy link
Contributor

@amyangfei amyangfei commented May 31, 2021

What problem does this PR solve?

This PR adds a token limiter to kv client, in order to control the concurrency of incremental scan region, for some reasons:

  • A large scale incremental scan tasks could cause pressure in TiKV server, which could lead to OOM in TiKV.
  • TiKV could also add concurrency limit of incremental scan regions and prevents too much memory consumption, however it will cause too many uninitialized pending regions in TiCDC kv clients. Region reconnect could happen before the TiKV can serve them.

What is changed and how it works?

  • Add a token based region router.
    • When the used token doesn't reach upper limit, redirect the region request as soon as possible.
    • Otherwise buffer the region request, and check 100ms to see whether there exists new token opportunity.
    • The token is grouped by store address
  • Separate some logic from dispatchRequest into requestRegionToStore
    • dispatchRequest only reads singleRegionInfo from region channel, try to get gRPC context and sends the singleRegionInfo(which has been filled in with gRPC context) into region router
    • requestRegionToStore reads singleRegionInfo from the output channel of region based region router, and sends real gRPC request to TiKV store.

Check List

Tests

  • Unit test
  • Integration test

Related changes

  • Need to cherry-pick to the release branch
  • Need to update key monitor metrics in both TiCDC document and official document

Release note

  • Add concurrency limit to the region incremental scan in kv client.

@amyangfei amyangfei added component/kv-client TiKV kv log client component. needs-cherry-pick-release-4.0 Should cherry pick this PR to release-4.0 branch. needs-cherry-pick-release-5.0 Should cherry pick this PR to release-5.0 branch. labels May 31, 2021
@amyangfei amyangfei added this to the v5.0.3 milestone May 31, 2021
@ti-chi-bot ti-chi-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 31, 2021
@amyangfei
Copy link
Contributor Author

/run-all-tests

1 similar comment
@amyangfei
Copy link
Contributor Author

/run-all-tests

@amyangfei
Copy link
Contributor Author

/run-integration-tests

@codecov-commenter
Copy link

Codecov Report

Merging #1899 (2f32548) into master (99ee8fb) will increase coverage by 0.8089%.
The diff coverage is 73.0769%.

@@               Coverage Diff                @@
##             master      #1899        +/-   ##
================================================
+ Coverage   53.4083%   54.2173%   +0.8089%     
================================================
  Files           154        161         +7     
  Lines         16166      16764       +598     
================================================
+ Hits           8634       9089       +455     
- Misses         6608       6704        +96     
- Partials        924        971        +47     

@amyangfei amyangfei added the status/ptal Could you please take a look? label Jun 1, 2021
@@ -67,6 +67,13 @@ var (
Name: "channel_size",
Help: "size of each channel in kv client",
}, []string{"id", "channel"})
clientRegionTokenSize = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "ticdc",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe extract the ticdc kvclient in

Namespace: "ticdc",
Subsystem: "kvclient",

to const ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep they are constant values, but I will not change them to constants in this PR, since too many literal constants are used in metric names, maybe in another PR we can address it

)

const (
regionOutputChanSize = 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could u add more comment explain why 16, thx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

}

func (r *sizedRegionRouter) Run(ctx context.Context) error {
ticker := time.NewTicker(time.Millisecond * 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe defined as const also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

cdc/kv/client.go Outdated
Comment on lines 521 to 523
// The channel to put the region that will be sent requests.
regionCh chan singleRegionInfo
regionRouter LimitRegionRouter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not a channel anymore, could you update the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

cdc/kv/client.go Outdated
@@ -70,6 +70,9 @@ const (
// failed region will be reloaded via `BatchLoadRegionsWithKeyRange` API. So we
// don't need to force reload region any more.
regionScheduleReload = false

// defines the scan region limit for each table
regionScanLimitPerTable = 40
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to limit scan region for each store?

@amyangfei amyangfei force-pushed the kv-client-region-scan-limit branch from f3f42d4 to 97ab2f4 Compare June 2, 2021 07:10
@liuzix
Copy link
Contributor

liuzix commented Jun 2, 2021

Can you comment on how your design is similar or different form a "cancellable semaphore"?

@ti-chi-bot ti-chi-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 2, 2021
@amyangfei
Copy link
Contributor Author

Can you comment on how your design is similar or different form a "cancellable semaphore"?

The semaphore counter will be used for fast check and metric counter

  • the fast check means we will check whether we have token available, which can be achieved by semaphore.TryAcquire(n) and semaphore.Release(n), but it is simpler with counter
  • golang native semaphore doesn't expose available counter

@liuzix

@amyangfei
Copy link
Contributor Author

/run-all-tests /tidb=release-5.0 /tikv=release-5.0 /pd=release-5.0 /tiflash=release-5.0

@amyangfei
Copy link
Contributor Author

/run-leak-tests

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jun 3, 2021
@amyangfei
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 9a07fd2

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jun 3, 2021
@ti-chi-bot ti-chi-bot removed the status/can-merge Indicates a PR has been approved by a committer. label Jun 4, 2021
@amyangfei
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: c01b492

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jun 4, 2021
@amyangfei
Copy link
Contributor Author

/run-unit-tests

@amyangfei
Copy link
Contributor Author

/run-kafka-tests

@ti-chi-bot ti-chi-bot merged commit f7ab5ba into pingcap:master Jun 4, 2021
ti-chi-bot pushed a commit to ti-chi-bot/tiflow that referenced this pull request Jun 4, 2021
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #1930.

ti-chi-bot pushed a commit to ti-chi-bot/tiflow that referenced this pull request Jun 4, 2021
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #1931.

@amyangfei amyangfei deleted the kv-client-region-scan-limit branch June 4, 2021 02:10
amyangfei added a commit to amyangfei/tiflow that referenced this pull request Jun 24, 2021
Note pingcap#1926 has picked part of pingcap#1899, this PR picks the remaining change
amyangfei added a commit to amyangfei/tiflow that referenced this pull request Jun 24, 2021
Note pingcap#1926 has picked part of pingcap#1899, this PR picks the remaining change
amyangfei added a commit to amyangfei/tiflow that referenced this pull request Jul 7, 2021
Note pingcap#1926 has picked part of pingcap#1899, this PR picks the remaining change
amyangfei added a commit to amyangfei/tiflow that referenced this pull request Jul 7, 2021
Note pingcap#1926 has picked part of pingcap#1899, this PR picks the remaining change
ti-chi-bot pushed a commit that referenced this pull request Jul 8, 2021
Note #1926 has picked part of #1899, this PR picks the remaining change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/kv-client TiKV kv log client component. needs-cherry-pick-release-4.0 Should cherry pick this PR to release-4.0 branch. needs-cherry-pick-release-5.0 Should cherry pick this PR to release-5.0 branch. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. status/ptal Could you please take a look?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants