You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Configurable Replica Read Timeout with Retry Feature Request
Is your feature request related to a problem? Please describe:
One of common problems running TiDB in the cloud on network attached disks (Amazon EBS, Google PD or Azure managed disks) is temporary elevated disk IO latency. This can happen if a cloud provider storage node fails and goes through a repair procedure. During the repair phase a network attached disk exhibits 100ms or even single digit second latency vs single digit millisecond latency under the normal conditions.
Describe the feature you'd like:
If TiDB customer uses Follower Read or Stale Read feature, it is possible to retry a request initially landed on the TiKV node with network disk exhibiting elevated latency on the other TiKV replica. While retry policy already exists in tikv go-client, the default network timeout is 10s of seconds.
OLTP workload on TiDB could leverage an introduction of system variable tidb_tikv_read_timeout which then could be passed as a context timeout for TiKV requests made by TiDB layer and rely on existing selector logic to retry requests on other replicas. The implementation of this feature needs also to take care of the following:
The intermediate timeouts should still bubble up in metrics
Describe alternatives you've considered:
TiDB already has a max_execution_timeout system variable, but it is not used as a context deadline in go-client to network calls from TiDB to TiKV. Moreover, if TiKV request takes longer than max_execution_timeout, then the session is marked as killed and retry won’t happen.
Teachability, Documentation, Adoption, Migration Strategy:
The feature would be fully controlled by session variable tidb_tikv_read_timeout.
The text was updated successfully, but these errors were encountered:
Controlling the timeout behavior of tikv-client is reasonable and requires such a parameter. However, the newly added variable overlaps with the existing variable "tidb_load_based_replica_read_threshold". I personally suggest keeping only tidb_tikv_read_timeout and gradually deprecating the tidb_load_based_replica_read_threshold variable in the future.
Configurable Replica Read Timeout with Retry Feature Request
Is your feature request related to a problem? Please describe:
One of common problems running TiDB in the cloud on network attached disks (Amazon EBS, Google PD or Azure managed disks) is temporary elevated disk IO latency. This can happen if a cloud provider storage node fails and goes through a repair procedure. During the repair phase a network attached disk exhibits 100ms or even single digit second latency vs single digit millisecond latency under the normal conditions.
Describe the feature you'd like:
If TiDB customer uses Follower Read or Stale Read feature, it is possible to retry a request initially landed on the TiKV node with network disk exhibiting elevated latency on the other TiKV replica. While retry policy already exists in tikv go-client, the default network timeout is 10s of seconds.
OLTP workload on TiDB could leverage an introduction of system variable
tidb_tikv_read_timeout
which then could be passed as a context timeout for TiKV requests made by TiDB layer and rely on existing selector logic to retry requests on other replicas. The implementation of this feature needs also to take care of the following:DataIsNot
ready error, but after timeout it should go to another replica.Describe alternatives you've considered:
TiDB already has a
max_execution_timeout
system variable, but it is not used as a context deadline in go-client to network calls from TiDB to TiKV. Moreover, if TiKV request takes longer thanmax_execution_timeout
, then the session is marked as killed and retry won’t happen.Teachability, Documentation, Adoption, Migration Strategy:
The feature would be fully controlled by session variable
tidb_tikv_read_timeout
.The text was updated successfully, but these errors were encountered: