Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The hot cache cannot be cleared, when the interval is less than 60 #4390

Closed
lhy1024 opened this issue Nov 26, 2021 · 4 comments · Fixed by #4396 or #4446
Closed

The hot cache cannot be cleared, when the interval is less than 60 #4390

lhy1024 opened this issue Nov 26, 2021 · 4 comments · Fixed by #4396 or #4446
Assignees
Labels

Comments

@lhy1024
Copy link
Contributor

lhy1024 commented Nov 26, 2021

Bug Report

What did you do?

stop bench

What did you expect to see?

hot cache is cleared

What did you see instead?

there are still many hot peers

What version of PD are you using (pd-server -V)?

v5.0.4

@lhy1024 lhy1024 added the type/bug The issue is confirmed as a bug. label Nov 26, 2021
@lhy1024 lhy1024 self-assigned this Nov 26, 2021
lhy1024 added a commit to lhy1024/pd that referenced this issue Nov 26, 2021
Signed-off-by: lhy1024 <admin@liudos.us>
lhy1024 added a commit to lhy1024/pd that referenced this issue Nov 26, 2021
Signed-off-by: lhy1024 <admin@liudos.us>
ti-chi-bot added a commit that referenced this issue Nov 29, 2021
* move file

Signed-off-by: lhy1024 <admin@liudos.us>

* ref #4390

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
IcePigZDB pushed a commit to IcePigZDB/pd that referenced this issue Nov 29, 2021
* move file

Signed-off-by: lhy1024 <admin@liudos.us>

* ref tikv#4390

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@lhy1024
Copy link
Contributor Author

lhy1024 commented Dec 6, 2021

Our current heartbeat process looks like this.

If the interval is less than the default interval for heartbeats (60 seconds), then we will put it into the cache temporarily and wait until it collects 60 seconds before considering whether it is hot enough.

For a peer that has just been reported, if the region is also in the hot cache, then there are three cases.

  1. the store is the same, then the old item is used directly
  2. the store is different, but there is a move peer or transfer leader in the previous round, then the inheritable item is directly used as the old item
  3. store is different, and there is no inheritable peer, then choose any peer as old item

The problem occurs in the third branch, if the old peer is used directly without clone, then the old item and the new item will be written at the same time. And when the new peer interval is less than 60 seconds, it means that it will be temporarily put into the cache.

If the old peer will be cooled down at this time, it will keep the peer in the hot cache for a long time and cannot be exited.

ti-chi-bot pushed a commit that referenced this issue Dec 7, 2021
… the interval is less than 60 (#4396)

* fix cache in 5.2/5.3 ref #4390

Signed-off-by: lhy1024 <admin@liudos.us>

* fix test

Signed-off-by: lhy1024 <admin@liudos.us>

* fix ci

Signed-off-by: lhy1024 <admin@liudos.us>

* address comment

Signed-off-by: lhy1024 <admin@liudos.us>

* address comment

Signed-off-by: lhy1024 <admin@liudos.us>

* fix ci

Signed-off-by: lhy1024 <admin@liudos.us>

* fix test

Signed-off-by: lhy1024 <admin@liudos.us>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Dec 7, 2021
Signed-off-by: lhy1024 <admin@liudos.us>
lhy1024 added a commit to ti-chi-bot/pd that referenced this issue Dec 7, 2021
Signed-off-by: lhy1024 <admin@liudos.us>
lhy1024 added a commit to ti-chi-bot/pd that referenced this issue Dec 7, 2021
Signed-off-by: lhy1024 <admin@liudos.us>
lhy1024 added a commit to ti-chi-bot/pd that referenced this issue Dec 7, 2021
Signed-off-by: lhy1024 <admin@liudos.us>
lhy1024 added a commit to ti-chi-bot/pd that referenced this issue Dec 8, 2021
Signed-off-by: lhy1024 <admin@liudos.us>
@lhy1024
Copy link
Contributor Author

lhy1024 commented Dec 10, 2021

Our current heartbeat process looks like this.

If the interval is less than the default interval for heartbeats (60 seconds), then we will put it into the cache temporarily and wait until it has made up 60 seconds before considering whether it is hot enough.

For a peer that has just been reported, if the region is also in the hot cache, then there are three cases.

  1. the store is the same, then the old item is used directly
  2. the store is different, but there is a move peer or transfer leader in the previous round, then the inheritable item is directly used as the old item
  3. store is different, and there is no inheritable peer, then choose any peer as old item

The problem occurs in the third branch, if the old peer is used directly without clone, then the old item and the new item will be written at the same time. And when the new peer interval is less than 60 seconds, it means that it will be temporarily put into the cache.

If the old peer will be cooled down at this time, it will keep the peer in the hot cache for a long time and cannot be exited.

For store is different, and there is no inheritable peer, then choose any peer as old item, we call it as adopt, taking the child of a relative as own son or daughter.

Suppose there is a cluster with region heartbeat 5s.

At t1, when a peer in store1 is changed to cold, and other peers still are hot, it will adopt from store2 as own, it is expected.

source hot/cold interval of sum
store1 adopt from store2 at t1 cold 60s
store2 inherit hot 55s
store3 inherit hot 55s

But other peers will be changed to cold, the problem occurs. A peer in store2 or store3 adopts from store1 at t2, but this child is from store2 at t1.

source hot/cold interval of sum
store1 adopt from store2 at t1 cold 5s
store2 adopt from store1 at t2 (it is from store2 at t1!) cold 60s
store3 adopt from store1 at t2 (it is from store2 at t1!) cold 60s

So we should avoid to adopt from adopted item before it is confirmed whether it is hot.

@nolouch
Copy link
Contributor

nolouch commented Dec 13, 2021

/assign @lhy1024

@nolouch nolouch added the status/TODO The issue will be done in the future. label Dec 13, 2021
lhy1024 added a commit to lhy1024/pd that referenced this issue Dec 13, 2021
Signed-off-by: lhy1024 <admin@liudos.us>
@mayjiang0203
Copy link

/severity major

ti-chi-bot pushed a commit that referenced this issue Dec 20, 2021
* fix hot peer cache

Signed-off-by: lhy1024 <admin@liudos.us>

* fix

Signed-off-by: lhy1024 <admin@liudos.us>

* fix

Signed-off-by: lhy1024 <admin@liudos.us>

* add more test

Signed-off-by: lhy1024 <admin@liudos.us>

* fix ci

Signed-off-by: lhy1024 <admin@liudos.us>

* address comment

Signed-off-by: lhy1024 <admin@liudos.us>

* ref #4390

Signed-off-by: lhy1024 <admin@liudos.us>

* add comment and test

Signed-off-by: lhy1024 <admin@liudos.us>

* address comments

Signed-off-by: lhy1024 <admin@liudos.us>

* fix

Signed-off-by: lhy1024 <admin@liudos.us>

* add more test

Signed-off-by: lhy1024 <admin@liudos.us>

* add comment

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: ShuNing <nolouch@gmail.com>
@nolouch nolouch added needs-cherry-pick-release-4.0 The PR needs to cherry pick to release-4.0 branch. needs-cherry-pick-release-5.0 The PR needs to cherry pick to release-5.0 branch. needs-cherry-pick-release-5.1 Type: Need cherry pick to release-5.1 needs-cherry-pick-release-5.2 Type: Need cherry pick to release-5.2 needs-cherry-pick-release-5.3 Type: Need cherry pick to release-5.3 labels Dec 20, 2021
lhy1024 added a commit to ti-chi-bot/pd that referenced this issue Dec 20, 2021
* fix hot peer cache

Signed-off-by: lhy1024 <admin@liudos.us>

* fix

Signed-off-by: lhy1024 <admin@liudos.us>

* fix

Signed-off-by: lhy1024 <admin@liudos.us>

* add more test

Signed-off-by: lhy1024 <admin@liudos.us>

* fix ci

Signed-off-by: lhy1024 <admin@liudos.us>

* address comment

Signed-off-by: lhy1024 <admin@liudos.us>

* ref tikv#4390

Signed-off-by: lhy1024 <admin@liudos.us>

* add comment and test

Signed-off-by: lhy1024 <admin@liudos.us>

* address comments

Signed-off-by: lhy1024 <admin@liudos.us>

* fix

Signed-off-by: lhy1024 <admin@liudos.us>

* add more test

Signed-off-by: lhy1024 <admin@liudos.us>

* add comment

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: ShuNing <nolouch@gmail.com>
Signed-off-by: lhy1024 <admin@liudos.us>
ti-chi-bot added a commit that referenced this issue Dec 21, 2021
… the interval is less than 60 (#4396) (#4432)

* This is an automated cherry-pick of #4396

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

* fix conflict without sync test

Signed-off-by: lhy1024 <admin@liudos.us>

* fix lint

Signed-off-by: lhy1024 <admin@liudos.us>

* ref #4390

Signed-off-by: lhy1024 <admin@liudos.us>

* statistics: fix hot peer cache (#4446)

* fix hot peer cache

Signed-off-by: lhy1024 <admin@liudos.us>

* fix

Signed-off-by: lhy1024 <admin@liudos.us>

* fix

Signed-off-by: lhy1024 <admin@liudos.us>

* add more test

Signed-off-by: lhy1024 <admin@liudos.us>

* fix ci

Signed-off-by: lhy1024 <admin@liudos.us>

* address comment

Signed-off-by: lhy1024 <admin@liudos.us>

* ref #4390

Signed-off-by: lhy1024 <admin@liudos.us>

* add comment and test

Signed-off-by: lhy1024 <admin@liudos.us>

* address comments

Signed-off-by: lhy1024 <admin@liudos.us>

* fix

Signed-off-by: lhy1024 <admin@liudos.us>

* add more test

Signed-off-by: lhy1024 <admin@liudos.us>

* add comment

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: ShuNing <nolouch@gmail.com>
Signed-off-by: lhy1024 <admin@liudos.us>

* fix test

Signed-off-by: lhy1024 <admin@liudos.us>

* revert log

Signed-off-by: lhy1024 <admin@liudos.us>

* remove todo

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: lhy1024 <admin@liudos.us>
Co-authored-by: ShuNing <nolouch@gmail.com>
CabinfeverB pushed a commit to CabinfeverB/pd that referenced this issue Dec 28, 2021
* fix hot peer cache

Signed-off-by: lhy1024 <admin@liudos.us>

* fix

Signed-off-by: lhy1024 <admin@liudos.us>

* fix

Signed-off-by: lhy1024 <admin@liudos.us>

* add more test

Signed-off-by: lhy1024 <admin@liudos.us>

* fix ci

Signed-off-by: lhy1024 <admin@liudos.us>

* address comment

Signed-off-by: lhy1024 <admin@liudos.us>

* ref tikv#4390

Signed-off-by: lhy1024 <admin@liudos.us>

* add comment and test

Signed-off-by: lhy1024 <admin@liudos.us>

* address comments

Signed-off-by: lhy1024 <admin@liudos.us>

* fix

Signed-off-by: lhy1024 <admin@liudos.us>

* add more test

Signed-off-by: lhy1024 <admin@liudos.us>

* add comment

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: ShuNing <nolouch@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@jebter jebter added affects-4.0 affects-5.0 affects-5.1 affects-5.2 affects-5.3 and removed needs-cherry-pick-release-4.0 The PR needs to cherry pick to release-4.0 branch. needs-cherry-pick-release-5.0 The PR needs to cherry pick to release-5.0 branch. needs-cherry-pick-release-5.1 Type: Need cherry pick to release-5.1 needs-cherry-pick-release-5.2 Type: Need cherry pick to release-5.2 needs-cherry-pick-release-5.3 Type: Need cherry pick to release-5.3 labels Jan 18, 2022
ti-chi-bot added a commit that referenced this issue Jan 26, 2022
… the interval is less than 60 (#4396) (#4433)

* This is an automated cherry-pick of #4396

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

* fix test

Signed-off-by: lhy1024 <admin@liudos.us>

* ref #4390

Signed-off-by: lhy1024 <admin@liudos.us>

* fix ci

Signed-off-by: lhy1024 <admin@liudos.us>

* pick for #4446, #4512

Signed-off-by: lhy1024 <admin@liudos.us>

* fix test

Signed-off-by: lhy1024 <admin@liudos.us>

* update

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: lhy1024 <admin@liudos.us>
ti-chi-bot added a commit that referenced this issue Feb 23, 2022
… the interval is less than 60 (#4396) (#4435)

close #4390, ref #4396, ref #4446

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: lhy1024 <admin@liudos.us>
ti-chi-bot added a commit that referenced this issue Apr 14, 2022
… the interval is less than 60 (#4396) (#4434)

close #4390, ref #4396, ref #4446, ref #4512

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: lhy1024 <admin@liudos.us>
Co-authored-by: 混沌DM <hundundm@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants