- Authors: cfzjywxk
- Tracking issue: TODO
TiKV requests are typically processed quickly, taking only a few milliseconds. However, if there is disk IO or network latency jitter on a specified TiKV node, the duration of these requests may spike to several seconds or more. This usually happens when there is an EBS issue like the EBS is repairing its replica.
Currently, the timeout limit for the TiKV requests is fixed and cannot be adjusted. For example:
ReadTimeoutShort
: 30 seconds, which is used for the Get RPC request.ReadTimeoutMedium
: 60 seconds, which is used for the BatchGet, Cop RPC request.
The kv-client or TiDB may wait for the responses from TiKV nodes until timeout happens, it could not retry or redirect the requests to other nodes quickly as the timeout configuration is usually tens of seconds. As a result, the application may experience a query latency spike when TiKV node jitter is encountered.
For latency sensitive applications, providing predictable sub-second read latency by set fast retrying
is valuable. queries that are usually very fast (such as point-select query), setting the value
of tikv_client_read_timeout
to short value like 500ms
, the TiDB cluster would be more tolerable
for network latency or io latency jitter for a single storage node, because the retry are more quickly.
This would be helpful if the requests could be processed by more than 1 candidate targets, for example follower or stale read requests that could be handled by multiple available peers.
A possible improvement suggested by #44771 is to make the timeout values of specific KV requests configurable. For example:
- Adding a session variable
tikv_client_read_timeout
, which is used to control the timeout for a single TiKV read RPC request. When the user sets the value of this variable, all read RPC request timeouts will use this value. The default value of this variable is 0, and the timeout of TiKV read RPC requests is still the original value ofReadTimeoutShort
andReadTimeoutMedium
. - Support statement level hint like
SELECT /*+ set_var(tikv_client_read_timeout=500) */ * FROM t where id = ?;
to set the timeout value of the KV requests of this single query to the certain value.
Consider the stale read usage case:
set @@tidb_read_staleness=-5;
# The unit is miliseconds. The session variable usage.
set @@tidb_tikv_tidb_timeout=500;
select * from t where id = 1;
# The unit is miliseconds. The query hint usage.
select /*+ set_var(tikv_client_read_timeout=500) */ * FROM t where id = 1;
- Setting the variable
tikv_client_read_timeout
may not be easy if it affects the timeout for all TiKV read requests, such as Get, BatchGet, Cop in this session.A timeout of 1 second may be sufficient for GET requests, but may be small for COP requests. Some large COP requests may keep timing out and could not be processed properly. - If the value of the variable or query hint
tikv_client_read_timeout
is set too small, more retries will occur, increasing the load pressure on the TiDB cluster. In the worst case the query would not return until the max backoff timeout is reached if thetikv_client_read_timeout
is set to a value which none of the replicas could finish processing within that time.
The query hint usage would be more flexible and safer as the impact is limited to a single query.
- Add related filed in the
KVSnapshot
struct like
type KVSnapshot struct {
...
KVReadTimeout: Duration
}
This change is to be done in the client-go
repository.
- Add set interface
SetKVReadTimeout
and get interfaceGetKVReadTimeout
forKVSnapshot
func (s *KVSnapshot) SetKVReadTimeout(KVTimeout Duration) {
s.KVReadTimeout = KVTimeout
}
This change is to be done both in the client-go
repository and tidb
repository.
- Call the set interfaces to pass the timeout value from
StmtContext
to theKVSnapshot
structures, when thePointGet
andBatchPointGet
executors are built. - Pass the timeout value into the
Context.max_execution_duration_ms
field of the RPCContext
like
message Context {
...
uint64 max_execution_duration_ms = 14;
...
}
Note by now this field is used only by write requests like Prewrite
and Commit
.
- Support timeout check during the request handling in TiKV. When there's new point get and batch point get
requests are received, the
kv_read_timeout
value should be read from theGetReuqest
andBatchGetRequest
requests and passed to thepointGetter
. By now there's no canceling mechanism when the task is scheduled to the read pool in TiKV, a simpler way is to check the deadline duration before next processing and try to return directly if the deadline is exceeded, but once the read task is spawned to the read pool it could not be canceled anymore.
pub fn get(
&self,
mut ctx: Context,
key: Key,
start_ts: TimeStamp,
) -> impl Future<Output = Result<(Option<Value>, KvGetStatistics)>> {
...
let res = self.read_pool.spawn_handle(
async move {
// TODO: Add timeout check here.
let snapshot =
Self::with_tls_engine(|engine| Self::snapshot(engine, snap_ctx)).await?;
{
// TODO: Add timeout check here.
...
let result = Self::with_perf_context(CMD, || {
let _guard = sample.observe_cpu();
let snap_store = SnapshotStore::new(
snapshot,
start_ts,
ctx.get_isolation_level(),
!ctx.get_not_fill_cache(),
bypass_locks,
access_locks,
false,
);
// TODO: Add timeout check here.
snap_store
.get(&key, &mut statistics)
});
...
...
);
}
These changes need to be done in the tikv
repository.
- Add related field in the
copTask
struct like
type copTask struct {
...
KVReadTimeout: Duration
}
- Pass the timeout value into the
Context.max_execution_duration_ms
field
message Context {
...
uint64 max_execution_duration_ms = 14;
...
}
- Use the
kv_read_timeout
value passed in to calculate thedeadline
result inparse_and_handle_unary_request
,
fn parse_request_and_check_memory_locks(
&self,
mut req: coppb::Request,
peer: Option<String>,
is_streaming: bool,
) -> Result<(RequestHandlerBuilder<E::Snap>, ReqContext)> {
...
req_ctx = ReqContext::new(
tag,
context,
ranges,
self.max_handle_duration, // Here use the specified timeout value.
peer,
Some(is_desc_scan),
start_ts.into(),
cache_match_version,
self.perf_level,
);
}
This change needs to be done in the tikv
repository.
- When timeout happens, retry the next available peer for the read requests like stale read and
non-leader-only
snapshot read requests. The strategy is:- Try the next peer if the first request times out immediately.
- If all the responses from all the peers are
timeout
errors, returntimeout
error to the upper layer, - Backoff with the
timeout
error and retry the requests with the original defaultReadShortTimeout
andReadMediumTimeout
values. - Continue to process like before.
- The timeout retry details should be recorded and handled properly in the request tracker, so are the metrics.
- If the timeout value is configured to a small value like hundreds of milliseconds, there could be more
timeout tasks. When task is spawned to the read pool in TiKV, it could not be canceled immediately for
some reasons:
- When the read task is polled and executed, there's no timeout checking mechanism in the task scheduler by now.
- The read operations are synchronous, so if the read task is blocked by slow io the task could not be scheduled or canceled.
- Though some simple timeout checks could be added in certain places like "after getting snapshot". A complete timeout check or scheduling design is still needed in the future. These preconditions are complex and require a lot of work including making the read processing asynchronous in TiKV.
The requests with configurable timeout values would take effect on newer versions. These fields are expected not to take effect when down-grading the cluster theoretically.
There’s already a query level timeout configuration max_execution_time
which could be configured using
variables and query hints.
If the timeout of RPC or TiKV requests could derive the timeout values from the max_execution_time
in an
intelligent way, it’s not needed to expose another variable or usage like tikv_client_read_timeout
.
For example, consider the strategy:
- The
max_execution_time
is configured to1s
on the query level - The first RPC timeout is configured to some proportional value of the
max_execution_time
like500ms
- When first RPC times out the retry RPC is configured to left value of the
max_execution_time
like400ms
From the application’s perspective, TiDB is trying to finish the requests ASAP considering the
input max_execution_time
value, and when TiDB could not finish in time the timeout error is returned
to the application and the query fails fast.
Though it's easier to use and would not introduce more configurations or variables, it's difficult to make the timeout calculation intelligent or automatic, considering things like different request types, node pressures and latencies, etc. A lot of experiments are needed to figure out the appropriate strategy.