Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store info in region cache won't be refreshed if tiflash nodes were ever scaled-in #7408

Closed
zanmato1984 opened this issue May 3, 2023 · 4 comments
Labels
affects-5.0 affects-5.1 affects-5.2 affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. component/compute component/coprocessor severity/major type/bug The issue is confirmed as a bug.

Comments

@zanmato1984
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. Do some queries covering all tiflash nodes;
  2. Scale-in tiflash nodes;
  3. Do some queries again. This time, produce many remote reads by e.g. scale-out some tiflash nodes (to trigger region movement).

2. What did you expect to see? (Required)

The queries after scale-in succeed.

3. What did you see instead (Required)

Query failed with:

other error for mpp stream: ... get store failed:  2: invalid store ID ...

This is because that a stale store info (the scaled-in store) exists in region cache in tiflash client-c. When doing remote read, tiflash will ask pd about the latest region info using the stale store id. However, when encountering the error of invalid store id, the backoff logic doesn't invalidate the store info in the region cache before, causing tiflash keeps using the invalid store id until the backoff threshold is triggered and subsequently reports error.

4. What is your TiFlash version? (Required)

6.1.5

@zanmato1984 zanmato1984 added type/bug The issue is confirmed as a bug. component/coprocessor affects-5.0 affects-5.1 affects-5.2 affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 component/compute affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.2 affects-6.3 affects-6.4 affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-6.6 affects-7.0 affects-7.1 This bug affects the 7.1.x(LTS) versions. severity/major labels May 3, 2023
@solotzg
Copy link
Contributor

solotzg commented May 19, 2023

PD: v6.1.5

error message is put into grpc status error: https://github.com/tikv/pd/blob/v6.1.5/server/grpc_service.go#L468-L470
The behavior of tiflash is as expected because it's recommended to retry grpc error until timeout.

PD: master

error message is put into response of GetStore: https://github.com/tikv/pd/blob/bde0a1b4255e623c4c86d9dfe07fbf0bd552502b/server/grpc_service.go#L483-L488

We need to close it issue and let PD cherry pick commit bde0a1b4255e623c4c86d9dfe07fbf0bd552502b to related branches( <= release-6.1).

@windtalker
Copy link
Contributor

But I think we can not just cherry pick PD commit, since client-c does not handle the error in the response of GetStore?

https://github.com/tikv/client-c/blob/88d295d74502803780266a73873e6e1f685d4e13/src/pd/Client.cc#L423 just call return response.store();

@zanmato1984 zanmato1984 removed affects-6.2 affects-6.3 affects-6.4 affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-6.6 affects-7.0 affects-7.1 This bug affects the 7.1.x(LTS) versions. labels May 19, 2023
@zanmato1984
Copy link
Contributor Author

PD's fix is tikv/pd#5310. This PR will be backported into 6.1 and released in 6.1.7. Closing this one.

@nolouch
Copy link
Member

nolouch commented Jun 30, 2023

PD part only miss in release 6.1, and the cherry-pick is tikv/pd#6724

ti-chi-bot bot pushed a commit to tikv/pd that referenced this issue Jun 30, 2023
close #5161, ref #5309, ref #5310, ref pingcap/tiflash#7408

put the unknown error into the header instead of directly returning a gRPC error.

Signed-off-by: husharp <jinhao.hu@pingcap.com>

Co-authored-by: Hu# <jinhao.hu@pingcap.com>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.0 affects-5.1 affects-5.2 affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. component/compute component/coprocessor severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

4 participants