Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store : add store limit to restrain bad store from occupying too much token limit. #12779

Merged
merged 41 commits into from
Nov 13, 2019

Conversation

AilinKid
Copy link
Contributor

@AilinKid AilinKid commented Oct 17, 2019

What problem does this PR solve?

In some cases, when a store is broken in tikv cluster, the store will hang all write requests sent to it, which will cause token limit exhausted in tidb server, consequently leading QPS to decrease sharply.

What is changed and how it works?

Add storeLimit in Store structure, we should do the count on it before sending the request to each store, eventually, control the traffic of each store.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

Code changes

  • Store : add store limit check before sending request

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release note

  • add store limit to restrain bad store from occupying too much token limit.

@AilinKid AilinKid changed the title store : add store limit to restrain bad store fromoccupying store : add store limit to restrain bad store from occupying too much token limit. Oct 17, 2019
@shenli
Copy link
Member

shenli commented Oct 17, 2019

hi @AilinKid, Please refine the description.

@AilinKid
Copy link
Contributor Author

hi @AilinKid, Please refine the description.
ok, done

@AilinKid AilinKid force-pushed the tokenLimit branch 2 times, most recently from 0d23ba0 to d4c95e5 Compare October 17, 2019 05:06
@codecov
Copy link

codecov bot commented Oct 17, 2019

Codecov Report

Merging #12779 into master will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             master     #12779   +/-   ##
===========================================
  Coverage   80.2157%   80.2157%           
===========================================
  Files           471        471           
  Lines        114864     114864           
===========================================
  Hits          92139      92139           
  Misses        15579      15579           
  Partials       7146       7146

@AilinKid AilinKid force-pushed the tokenLimit branch 2 times, most recently from 2895d1e to e81f939 Compare October 18, 2019 03:26
@AilinKid
Copy link
Contributor Author

/rebuild

config/config.go Outdated
@@ -339,6 +340,7 @@ var defaultConf = Config{
SplitTable: true,
Lease: "45s",
TokenLimit: 1000,
StoreLimit: -1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to set the default value to 0 and make compatible with the old configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

@@ -15,6 +15,7 @@ package tikv

import (
"context"
"github.com/pingcap/tidb/config"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The github.com/** should be placed in an individual group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed,Done.

@@ -178,6 +192,27 @@ func (s *RegionRequestSender) sendReqToRegion(bo *Backoffer, ctx *RPCContext, re
return
}

func (s *RegionRequestSender) getStoreToken(storeID uint64) error {
s.storeLimit.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lock is too coarse. Maybe we can define a guard field in the store struct and use atomic to reduce contention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have tried to reconstruct the code to decrease granularity of the lock

@@ -167,7 +170,18 @@ func (s *RegionRequestSender) sendReqToRegion(bo *Backoffer, ctx *RPCContext, re
if e := tikvrpc.SetContext(req, ctx.Meta, ctx.Peer); e != nil {
return nil, false, errors.Trace(e)
}
resp, err = s.client.SendRequest(bo.ctx, ctx.Addr, req, timeout)
// judge the store limit switch.
if atomic.LoadInt32(&config.GetGlobalConfig().StoreLimit) != -1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "> 0" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed!

return nil
}
s.storeLimit.Unlock()
return errors.New("store token is up to the limit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an error number and refined message are needed, such as 'maybe TiKV ...(IP) is unavailable'. Otherwise users don't know what to do in this emergency.

s.storeLimit.Lock()
limit, ok := s.storeLimit.limit[storeID]
if !ok {
limit = atomic.LoadInt32(&config.GetGlobalConfig().StoreLimit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if it's up to StoreLimit, and the user wants to raise StoreLimit, it won't work, is it right?
I think it's better in this way:s.storeLimit.limit[storeID] is initialized to 0, and increases each time a request comes in, then compares it to s.storeLimit.limit[storeID].
In this way, StoreLimit can be hot reloaded and work immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

@djshow832
Copy link
Contributor

hi @AilinKid, Please refine the description.
ok, done

It's still empty in "What is changed and how it works?"

@AilinKid
Copy link
Contributor Author

AilinKid commented Oct 30, 2019

hi @AilinKid, Please refine the description.
ok, done

It's still empty in "What is changed and how it works?"

good, done

@bb7133 bb7133 added type/enhancement The issue or PR belongs to an enhancement. needs-cherry-pick-2.1 and removed needs-cherry-pick-2.1 labels Oct 30, 2019

}

func (s *RegionRequestSender) releaseStoreToken(st *Store) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think st.tokenCount cannot be less than or equal to 0, even if it is really less than 0, it doesn't matter.
So prefer just st.tokenCount.Sub(1). (But it's not a strong opinion)
@djshow832 @bb7133

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok here. @.@

Copy link
Contributor

@lonng lonng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AilinKid AilinKid added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Nov 13, 2019
@@ -0,0 +1,8 @@
package storeutil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a license.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, Addressed.

# store-limit is used to restrain TiDB from sending request to some stores which is up to the limit.
# If a store has been up to the limit, it will return error for the successive request in same store.
# default 0 means shutting off store limit.
store-limit = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, addressed.

@@ -348,6 +348,8 @@ func ValidateSetSystemVar(vars *SessionVars, name string, value string) (string,
return value, ErrWrongValueForVar.GenWithStackByArgs(name, value)
case SQLSelectLimit:
return checkUInt64SystemVar(name, value, 0, math.MaxUint64, vars)
case TiDBStoreLimit:
return checkInt64SystemVar(name, value, 0, math.MaxInt64, vars)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add an upper limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard to define a valid upper limit (big table scanning can cause store limit increase). now is math.MaxInt64.

@AilinKid
Copy link
Contributor Author

/run-all-tests

Copy link
Contributor

@zimulala zimulala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zimulala zimulala added require-LGT3 Indicates that the PR requires three LGTM. status/LGT3 The PR has already had 3 LGTM. status/can-merge Indicates a PR has been approved by a committer. and removed component/DDL-need-LGT3 status/LGT2 Indicates that a PR has LGTM 2. labels Nov 13, 2019
@sre-bot
Copy link
Contributor

sre-bot commented Nov 13, 2019

/run-all-tests

@sre-bot
Copy link
Contributor

sre-bot commented Nov 13, 2019

@AilinKid merge failed.

@AilinKid
Copy link
Contributor Author

/run-all-tests

@AilinKid AilinKid merged commit 4283263 into pingcap:master Nov 13, 2019
@sre-bot
Copy link
Contributor

sre-bot commented Nov 13, 2019

cherry pick to release-2.1 failed

@sre-bot
Copy link
Contributor

sre-bot commented Nov 13, 2019

cherry pick to release-3.0 failed

@sre-bot
Copy link
Contributor

sre-bot commented Nov 13, 2019

cherry pick to release-3.1 failed

@sre-bot
Copy link
Contributor

sre-bot commented Apr 7, 2020

It seems that, not for sure, we failed to cherry-pick this commit to release-2.1. Please comment '/run-cherry-picker' to try to trigger the cherry-picker if we did fail to cherry-pick this commit before. @AilinKid PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
require-LGT3 Indicates that the PR requires three LGTM. status/can-merge Indicates a PR has been approved by a committer. status/LGT3 The PR has already had 3 LGTM. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants