Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server/cluster: add regionState to record abnormal regions #5272

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

LLThomas
Copy link
Contributor

@LLThomas LLThomas commented Jul 6, 2022

What problem does this PR solve?

Issue Number: Ref #4684

What is changed and how does it work?

In order to track abnormal regions, this pr has built a generic checker that currently only detects regions that have expired due to machine downtime.

server/cluster: add a regionState to record abnormal regions

Check List

Tests

  • Unit test

Release note

None.

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Jul 6, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • JmPotato

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels Jul 6, 2022
@codecov
Copy link

codecov bot commented Jul 6, 2022

Codecov Report

Merging #5272 (b840ec2) into master (c0a623f) will increase coverage by 0.06%.
The diff coverage is 100.00%.

❗ Current head b840ec2 differs from pull request most recent head ad34dfc. Consider uploading reports for the commit ad34dfc to get more accurate results

@@            Coverage Diff             @@
##           master    #5272      +/-   ##
==========================================
+ Coverage   75.67%   75.74%   +0.06%     
==========================================
  Files         312      313       +1     
  Lines       30966    30991      +25     
==========================================
+ Hits        23435    23473      +38     
+ Misses       5520     5515       -5     
+ Partials     2011     2003       -8     
Flag Coverage Δ
unittests 75.74% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/cluster/coordinator.go 71.16% <100.00%> (-1.33%) ⬇️
server/cluster/metrics.go 100.00% <100.00%> (ø)
server/cluster/region_stat.go 100.00% <100.00%> (ø)
pkg/errs/errs.go 75.00% <0.00%> (-25.00%) ⬇️
server/storage/storage.go 67.30% <0.00%> (-3.85%) ⬇️
server/region_syncer/server.go 82.96% <0.00%> (-3.30%) ⬇️
server/region_syncer/client.go 85.07% <0.00%> (-2.99%) ⬇️
server/storage/endpoint/meta.go 61.36% <0.00%> (-2.28%) ⬇️
server/schedule/hbstream/heartbeat_streams.go 72.72% <0.00%> (-2.03%) ⬇️
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@JmPotato JmPotato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not implement this feature through RegionStatistics? Checker should be more like a struct that will generate certain operators for scheduling purpose.

@LLThomas
Copy link
Contributor Author

LLThomas commented Jul 6, 2022

Why not implement this feature through RegionStatistics? Checker should be more like a struct that will generate certain operators for scheduling purpose.

You are right, I misunderstood the issue. I'll make a change.

Signed-off-by: LLThomas <zs033@qq.com>
Signed-off-by: LLThomas <zs033@qq.com>
@LLThomas LLThomas requested a review from JmPotato July 6, 2022 08:53
@rleungx
Copy link
Member

rleungx commented Jul 6, 2022

FYI, #3901 is another way to implement it. But there is an extra cost since we need to scan more times.

@LLThomas LLThomas changed the title checker: add a checker to check abnormal regions server/cluster: add a regionState to record abnormal regions Jul 6, 2022
@LLThomas LLThomas changed the title server/cluster: add a regionState to record abnormal regions server/cluster: add regionState to record abnormal regions Jul 6, 2022

// region state type
const (
RegionStateDown regionStateType = 1 << iota
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variables of type private should not be public.
Why use 1 << iota?

type regionState struct {
sync.RWMutex
opt *config.PersistOptions
states map[regionStateType]map[uint64]*core.RegionInfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace map with a fixed-length array.


// regionState is used to record abnormal regions.
type regionState struct {
sync.RWMutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design does not require a lock?

}

// GetRegionStateByType gets the states of the region by types. The regions here need to be cloned, otherwise, it may cause data race problems.
func (r *regionState) getRegionStateByType(typ regionStateType) []*core.RegionInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function only used in tests?

}

// Check verifies a region's state, recording it if need.
func (r *regionState) observe(region *core.RegionInfo) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could pass in a slice of regions and compute time.Now only once.

}

// Collect collects the metrics of the regions' states.
func (r *regionState) collect() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to clean up a region if it does not exist in the region tree?

Signed-off-by: LLThomas <zs033@qq.com>
)

// clearThreadhold indicates when we do cleanup operations.
// When the number of times the collect() function is executed is over than this threshold,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// When the number of times the collect() function is executed is over than this threshold,
// When the number of times the collect() function is executed exceeds this threshold,

regionStateTypeLen
)

// regionState is used to record abnormal regions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add some comments to show its difference with the RegionStatistics.

Suggested change
// regionState is used to record abnormal regions.
// regionState is used to record abnormal regions which could not be detected by the region heartbeat.
// For example, the down regions won’t send any heartbeat anymore.

Comment on lines 73 to 74
// Collect collects the metrics of the regions' states.
func (r *regionState) collect() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about renaming this function to collectAndClean to indicate the possible clean-up work?

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 13, 2022
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 16, 2022
@LLThomas LLThomas marked this pull request as ready for review July 18, 2022 14:16
@ti-chi-bot ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 18, 2022
@disksing disksing removed their request for review July 25, 2022 04:29
@nolouch
Copy link
Contributor

nolouch commented Jul 25, 2022

PTAL @HunDunDM

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Aug 1, 2022
@LLThomas
Copy link
Contributor Author

LLThomas commented Aug 1, 2022

Picture of down-region-count :

image

@LLThomas LLThomas requested a review from HunDunDM August 2, 2022 04:38

// region state type
const (
regionStateDown regionStateType = iota
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider replacing Down with NoLeader to prevent confusion with down-peer?

@@ -119,6 +119,14 @@ var (
Name: "store_sync",
Help: "The state of store sync config",
}, []string{"address", "state"})

regionStateGauge = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a duplication with regionStatusGauge of server/statistics/metrics.go.

type regionState struct {
cluster schedule.Cluster
states [regionStateTypeLen]map[uint64]*core.RegionInfo
count int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a more specific name.

@nolouch
Copy link
Contributor

nolouch commented Aug 17, 2022

ptal @LLThomas

Signed-off-by: LLThomas <zs033@qq.com>
Signed-off-by: LLThomas <zs033@qq.com>
Signed-off-by: LLThomas <zs033@qq.com>
Signed-off-by: LLThomas <zs033@qq.com>
Signed-off-by: LLThomas <zs033@qq.com>
@ti-chi-bot
Copy link
Member

@LLThomas: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. status/LGT1 Indicates that a PR has LGTM 1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants