-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server/cluster: add regionState to record abnormal regions #5272
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: LLThomas <zs033@qq.com>
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Signed-off-by: LLThomas <zs033@qq.com>
Codecov Report
@@ Coverage Diff @@
## master #5272 +/- ##
==========================================
+ Coverage 75.67% 75.74% +0.06%
==========================================
Files 312 313 +1
Lines 30966 30991 +25
==========================================
+ Hits 23435 23473 +38
+ Misses 5520 5515 -5
+ Partials 2011 2003 -8
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not implement this feature through RegionStatistics? Checker
should be more like a struct that will generate certain operators for scheduling purpose.
You are right, I misunderstood the issue. I'll make a change. |
Signed-off-by: LLThomas <zs033@qq.com>
Signed-off-by: LLThomas <zs033@qq.com>
FYI, #3901 is another way to implement it. But there is an extra cost since we need to scan more times. |
server/cluster/region_stat.go
Outdated
|
||
// region state type | ||
const ( | ||
RegionStateDown regionStateType = 1 << iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variables of type private should not be public.
Why use 1 << iota
?
server/cluster/region_stat.go
Outdated
type regionState struct { | ||
sync.RWMutex | ||
opt *config.PersistOptions | ||
states map[regionStateType]map[uint64]*core.RegionInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace map
with a fixed-length array.
server/cluster/region_stat.go
Outdated
|
||
// regionState is used to record abnormal regions. | ||
type regionState struct { | ||
sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current design does not require a lock?
server/cluster/region_stat.go
Outdated
} | ||
|
||
// GetRegionStateByType gets the states of the region by types. The regions here need to be cloned, otherwise, it may cause data race problems. | ||
func (r *regionState) getRegionStateByType(typ regionStateType) []*core.RegionInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function only used in tests?
server/cluster/region_stat.go
Outdated
} | ||
|
||
// Check verifies a region's state, recording it if need. | ||
func (r *regionState) observe(region *core.RegionInfo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could pass in a slice of regions and compute time.Now
only once.
server/cluster/region_stat.go
Outdated
} | ||
|
||
// Collect collects the metrics of the regions' states. | ||
func (r *regionState) collect() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to clean up a region if it does not exist in the region tree?
server/cluster/region_stat.go
Outdated
) | ||
|
||
// clearThreadhold indicates when we do cleanup operations. | ||
// When the number of times the collect() function is executed is over than this threshold, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// When the number of times the collect() function is executed is over than this threshold, | |
// When the number of times the collect() function is executed exceeds this threshold, |
server/cluster/region_stat.go
Outdated
regionStateTypeLen | ||
) | ||
|
||
// regionState is used to record abnormal regions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add some comments to show its difference with the RegionStatistics
.
// regionState is used to record abnormal regions. | |
// regionState is used to record abnormal regions which could not be detected by the region heartbeat. | |
// For example, the down regions won’t send any heartbeat anymore. |
server/cluster/region_stat.go
Outdated
// Collect collects the metrics of the regions' states. | ||
func (r *regionState) collect() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about renaming this function to collectAndClean
to indicate the possible clean-up work?
Signed-off-by: LLThomas <zs033@qq.com>
Signed-off-by: LLThomas <zs033@qq.com>
PTAL @HunDunDM |
server/cluster/region_stat.go
Outdated
|
||
// region state type | ||
const ( | ||
regionStateDown regionStateType = iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider replacing Down
with NoLeader
to prevent confusion with down-peer
?
@@ -119,6 +119,14 @@ var ( | |||
Name: "store_sync", | |||
Help: "The state of store sync config", | |||
}, []string{"address", "state"}) | |||
|
|||
regionStateGauge = prometheus.NewGaugeVec( | |||
prometheus.GaugeOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of a duplication with regionStatusGauge
of server/statistics/metrics.go
.
server/cluster/region_stat.go
Outdated
type regionState struct { | ||
cluster schedule.Cluster | ||
states [regionStateTypeLen]map[uint64]*core.RegionInfo | ||
count int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a more specific name.
ptal @LLThomas |
Signed-off-by: LLThomas <zs033@qq.com>
Signed-off-by: LLThomas <zs033@qq.com>
Signed-off-by: LLThomas <zs033@qq.com>
@LLThomas: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What problem does this PR solve?
Issue Number: Ref #4684
What is changed and how does it work?
In order to track abnormal regions, this pr has built a generic checker that currently only detects regions that have expired due to machine downtime.
Check List
Tests
Release note