Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

br restore hang #27534

Closed
dbsid opened this issue Aug 24, 2021 · 4 comments · Fixed by #28497
Closed

br restore hang #27534

dbsid opened this issue Aug 24, 2021 · 4 comments · Fixed by #28497
Labels
component/br This issue is related to BR of TiDB. severity/critical type/bug The issue is confirmed as a bug.

Comments

@dbsid
Copy link
Contributor

dbsid commented Aug 24, 2021

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

tiup br:nightly restore db --db=test --pd xxxx:2679 --storage s3://benchmark/sysbench-32-10000000 --s3.endpoint http://xxx:9000 --send-credentials-to-tikv=true

2. What did you expect to see? (Required)

br successfully

3. What did you see instead (Required)

br does not exit with log

[2021/08/24 06:52:41.588 +00:00] [INFO] [split_client.go:538] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 642 is not fully replicated"] [attempt-remain=2]
[2021/08/24 06:52:44.790 +00:00] [INFO] [split_client.go:538] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 642 is not fully replicated"] [attempt-remain=1]
[2021/08/24 06:52:44.790 +00:00] [WARN] [split.go:293] ["scatter region failed, stop retry"] [region="{ID=642,startKey=7480000000000000FF3D5F728000000000FF4C37660000000000FA,endKey=7480000000000000FF3D5F728000000000FF52998D0000000000FA,epoch=\"conf_ver:1 version:352 \",peers=\"id:643 store_id:1 \"}"] [error="rpc error: code = Unknown desc = region 642 is not fully replicated; rpc error: code = Unknown desc = region 642 is not fully replicated; rpc error: code = Unknown desc = region 642 is not fully replicated; rpc error: code = Unknown desc = region 642 is not fully replicated; rpc error: code = Unknown desc = region 642 is not fully replicated; rpc error: code = Unknown desc = region 642 is not fully replicated; rpc error: code = Unknown desc = region 642 is not fully replicated"] [errorVerbose="the following errors occurred:\n -  rpc error: code = Unknown desc = region 642 is not fully replicated\n -  rpc error: code = Unknown desc = region 642 is not fully replicated\n -  rpc error: code = Unknown desc = region 642 is not fully replicated\n -  rpc error: code = Unknown desc = region 642 is not fully replicated\n -  rpc error: code = Unknown desc = region 642 is not fully replicated\n -  rpc error: code = Unknown desc = region 642 is not fully replicated\n -  rpc error: code = Unknown desc = region 642 is not fully replicated"]
[2021/08/24 06:52:44.791 +00:00] [INFO] [split_client.go:538] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 644 is not fully replicated"] [attempt-remain=7]
[2021/08/24 06:52:44.892 +00:00] [INFO] [split_client.go:538] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 644 is not fully replicated"] [attempt-remain=6]
[2021/08/24 06:52:45.093 +00:00] [INFO] [split_client.go:538] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 644 is not fully replicated"] [attempt-remain=5]
[2021/08/24 06:52:45.494 +00:00] [INFO] [split_client.go:538] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 644 is not fully replicated"] [attempt-remain=4]

4. What is your TiDB version? (Required)

nightly

Release Version: v5.1.1
Edition: Community
Git Commit Hash: 797bddd
Git Branch: heads/refs/tags/v5.1.1
UTC Build Time: 2021-07-28 11:00:26
GoVersion: go1.16.4
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

@dbsid dbsid added type/bug The issue is confirmed as a bug. component/br This issue is related to BR of TiDB. labels Aug 24, 2021
@YuJuncen
Copy link
Contributor

YuJuncen commented Aug 26, 2021

The root cause is the same as #27425.

@Yui-Song
Copy link
Contributor

The issue is still not fixed with the latest PR.

[INFO] [info.go:49] ["Welcome to Backup & Restore (BR)"] [release-version=v5.3.0-alpha-248-g091ae3bde] [git-hash=091ae3bde0a54b126f2f2ed275dd20b0986d74d3] [git-branch=master] [go-version=go1.16.4] [utc-build-time="2021-09-29 15:24:44"] [race-enabled=false]
……
[2021/09/29 16:25:23.628 +00:00] [INFO] [split_client.go:435] ["checking whether need to scatter"] [store=3] [max-replica=3]
[2021/09/29 16:25:23.628 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2372 is not fully replicated"] [attempt-remain=7]
[2021/09/29 16:25:23.729 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2372 is not fully replicated"] [attempt-remain=6]
[2021/09/29 16:25:23.930 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2372 is not fully replicated"] [attempt-remain=5]
[2021/09/29 16:25:24.346 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2372 is not fully replicated"] [attempt-remain=4]
[2021/09/29 16:25:25.147 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2372 is not fully replicated"] [attempt-remain=3]
[2021/09/29 16:25:26.749 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2372 is not fully replicated"] [attempt-remain=2]
[2021/09/29 16:25:29.963 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2372 is not fully replicated"] [attempt-remain=1]
[2021/09/29 16:25:29.963 +00:00] [WARN] [split.go:295] ["scatter region failed, stop retry"] [region="{ID=2372,startKey=7480000000000000FF4F00000000000000F8,endKey=7480000000000000FF4F5F698000000000FF0000010380000000FF0000001F03800000FF0003566D38000000FC,epoch=\"conf_ver:5 version:175 \",peers=\"id:2373 store_id:1 ,id:2374 store_id:4 ,id:2375 store_id:5 \"}"] [error="rpc error: code = Unknown desc = region 2372 is not fully replicated; rpc error: code = Unknown desc = region 2372 is not fully replicated; rpc error: code = Unknown desc = region 2372 is not fully replicated; rpc error: code = Unknown desc = region 2372 is not fully replicated; rpc error: code = Unknown desc = region 2372 is not fully replicated; rpc error: code = Unknown desc = region 2372 is not fully replicated; rpc error: code = Unknown desc = region 2372 is not fully replicated"] [errorVerbose="the following errors occurred:\n -  rpc error: code = Unknown desc = region 2372 is not fully replicated\n -  rpc error: code = Unknown desc = region 2372 is not fully replicated\n -  rpc error: code = Unknown desc = region 2372 is not fully replicated\n -  rpc error: code = Unknown desc = region 2372 is not fully replicated\n -  rpc error: code = Unknown desc = region 2372 is not fully replicated\n -  rpc error: code = Unknown desc = region 2372 is not fully replicated\n -  rpc error: code = Unknown desc = region 2372 is not fully replicated"]
[2021/09/29 16:25:29.972 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2376 is not fully replicated"] [attempt-remain=7]
[2021/09/29 16:25:30.073 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2376 is not fully replicated"] [attempt-remain=6]
[2021/09/29 16:25:30.274 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2376 is not fully replicated"] [attempt-remain=5]
[2021/09/29 16:25:30.676 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2376 is not fully replicated"] [attempt-remain=4]
[2021/09/29 16:25:31.490 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2376 is not fully replicated"] [attempt-remain=3]
[2021/09/29 16:25:33.092 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2376 is not fully replicated"] [attempt-remain=2]
[2021/09/29 16:25:36.292 +00:00] [INFO] [split_client.go:608] ["scatter region failed, retring"] [error="rpc error: code = Unknown desc = region 2376 is not fully replicated"] [attempt-remain=1]

@YuJuncen
Copy link
Contributor

This is caused by the placement rules. #27483 has fixed the problem by comparing max-replicas with TiKV nodes in the cluster and skip scattering if it seems one region can never fit the replica number requirement. But this doesn't consider the placement rules. However, in @Yui-Song's cluster, #28353 writes some extra rules which requires 3 more replicas, so the requirement cannot be reached.

The simplest way to fix it is the "Plan B" of fixing #27425: the root reason of being stuck is we scatter region sequentially:

for _, region := range newRegions {
// Wait for a while until the regions successfully split.
rs.waitForSplit(ctx, region.Region.Id)
if err := utils.WithRetry(ctx,
func() error { return rs.client.ScatterRegion(ctx, region) },
// backoff about 6s, or we give up scattering this region.
&scatterBackoffer{
attempt: 7,
baseBackoff: 100 * time.Millisecond,
},
); err != nil {
log.Warn("scatter region failed, stop retry", logutil.Region(region.Region), zap.Error(err))
}
}

If we call the scatter requests parallel or just move the scatter call outside the for loop and just call the batch version of scatter: 'ScatterRegions'. BR won't be stuck more than one full retry.

BTW, BR only get stuck at a unhealthy cluster (with many regions lose peer). IMO, maybe this bug shouldn't be so critical...

@github-actions
Copy link

github-actions bot commented Oct 9, 2021

Please check whether the issue should be labeled with 'affects-x.y' or 'fixes-x.y.z', and then remove 'needs-more-info' label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/br This issue is related to BR of TiDB. severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants