Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce locking contention for region heartbeat handling #5586

Closed
Tracked by #5648
hihihuhu opened this issue Oct 10, 2022 · 4 comments
Closed
Tracked by #5648

reduce locking contention for region heartbeat handling #5586

hihihuhu opened this issue Oct 10, 2022 · 4 comments
Assignees
Labels
affects-6.1 severity/minor type/bug The issue is confirmed as a bug. type/enhancement The issue or PR belongs to an enhancement.

Comments

@hihihuhu
Copy link

Enhancement Task

we are running a tidb cluster to host about 90TB data on 1.8m regions, and seeing significant locking contention that causes 2 issues when there are massive region movement, i.e. when scales out tikv.

  1. region heartbeat handling itself, which could lead to pd OOM because of pending heartbeat requests
  2. the latency GetRegion requests increase significantly because of the locking, thus make the sql queries timeout.

dig into pd code, the locks has coarse scope and could be improved to increase the throughput. for example,

  1. https://github.com/tikv/pd/blob/v6.3.0/server/cluster/cluster.go#L809 doesn't have to lock everything but could use a per region/per region bucket lock
  2. https://github.com/tikv/pd/blob/v6.3.0/server/core/basic_cluster.go#L30, this locks 2 different data structures, could be separated.
  3. or even better, BasicCluster could only lock thing based on its business needs, and using lock-free or thread-safe data structure instead of lock at high-level, for example, a thread-safe b-tree could have a lock for each node instead of lock the whole tree.
@hihihuhu hihihuhu added the type/enhancement The issue or PR belongs to an enhancement. label Oct 10, 2022
@rleungx
Copy link
Member

rleungx commented Oct 11, 2022

  1. https://github.com/tikv/pd/blob/v6.3.0/server/cluster/cluster.go#L809 doesn't have to lock everything but could use a per region/per region bucket lock
  2. https://github.com/tikv/pd/blob/v6.3.0/server/core/basic_cluster.go#L30, this locks 2 different data structures, could be separated.
  3. or even better, BasicCluster could only lock thing based on its business needs, and using lock-free or thread-safe data structure instead of lock at high-level, for example, a thread-safe b-tree could have a lock for each node instead of lock the whole tree.

For 1, I think the purpose of this lock is that we wanna keep the atomicity of the whole progress of updating the cache. And both 2 and 3 are good choices.

@rleungx
Copy link
Member

rleungx commented Oct 11, 2022

Related to #3557

ti-chi-bot added a commit that referenced this issue Oct 25, 2022
ref #5586

Signed-off-by: Ryan Leung <rleungx@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Oct 25, 2022
ref tikv#5586

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot added a commit that referenced this issue Oct 27, 2022
ref #5586

Signed-off-by: Ryan Leung <rleungx@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@nolouch nolouch added the type/bug The issue is confirmed as a bug. label Nov 14, 2022
@nolouch
Copy link
Contributor

nolouch commented Nov 18, 2022

More details task list here: #5648

rleungx added a commit to ti-chi-bot/pd that referenced this issue Nov 22, 2022
ref tikv#5586

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
rleungx added a commit to ti-chi-bot/pd that referenced this issue Nov 22, 2022
ref tikv#5586

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
rleungx added a commit to ti-chi-bot/pd that referenced this issue Nov 22, 2022
ref tikv#5586

Signed-off-by: Ryan Leung <rleungx@gmail.com>
ti-chi-bot added a commit that referenced this issue Nov 25, 2022
ref #5586, ref #5587

Signed-off-by: Ryan Leung <rleungx@gmail.com>

Co-authored-by: Ryan Leung <rleungx@gmail.com>
ti-chi-bot added a commit that referenced this issue Nov 30, 2022
ref #5586, ref #5648, ref #5712

Signed-off-by: Ryan Leung <rleungx@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@rleungx
Copy link
Member

rleungx commented Mar 3, 2023

Close it for now since #5648 is closed.

@rleungx rleungx closed this as completed Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.1 severity/minor type/bug The issue is confirmed as a bug. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

4 participants