Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The timeout value for TSO stream inaccurate #5207

Closed
Defined2014 opened this issue Jun 21, 2022 · 3 comments · Fixed by #5551
Closed

The timeout value for TSO stream inaccurate #5207

Defined2014 opened this issue Jun 21, 2022 · 3 comments · Fixed by #5551

Comments

@Defined2014
Copy link

Defined2014 commented Jun 21, 2022

Bug Report

pd/client/client.go

Lines 784 to 822 in f04b89e

streamChoosingLoop:
for {
connectionCtx := c.chooseStream(&connectionCtxs)
if connectionCtx != nil {
streamAddr, stream, streamCtx, cancel = connectionCtx.streamAddr, connectionCtx.stream, connectionCtx.ctx, connectionCtx.cancel
}
// Check stream and retry if necessary.
if stream == nil {
log.Info("[pd] tso stream is not ready", zap.String("dc", dc))
c.updateConnectionCtxs(dispatcherCtx, dc, &connectionCtxs)
if retryTimeConsuming >= c.option.timeout {
err = errs.ErrClientCreateTSOStream.FastGenByArgs("retry timeout")
log.Error("[pd] create tso stream error", zap.String("dc-location", dc), errs.ZapError(err))
c.ScheduleCheckLeader()
c.finishTSORequest(tbc.getCollectedRequests(), 0, 0, 0, errors.WithStack(err))
retryTimeConsuming = 0
continue tsoBatchLoop
}
select {
case <-dispatcherCtx.Done():
return
case <-time.After(time.Second):
retryTimeConsuming += time.Second
continue
}
}
retryTimeConsuming = 0
select {
case <-streamCtx.Done():
log.Info("[pd] tso stream is canceled", zap.String("dc", dc), zap.String("stream-addr", streamAddr))
// Set `stream` to nil and remove this stream from the `connectionCtxs` due to being canceled.
connectionCtxs.Delete(streamAddr)
cancel()
stream = nil
continue
default:
break streamChoosingLoop
}
}

See the link above, the retryTimeConsuming will increase 1s for every loop which is inaccurate. Because the updateConnectionCtxs function will take up to 5s when network unreachable. The [pd] tso stream is not ready log appears every 6 seconds and update leader every 4 * 6 = 24s.

[2022/06/21 15:22:39.372 +08:00] [ERROR] [client.go:792] ["[pd] create tso stream error"] [dc-location=global] [error="[PD:client:ErrClientCreateTSOStream]create TSO stream failed, retry timeout"]
[2022/06/21 15:22:39.372 +08:00] [INFO] [client.go:788] ["[pd] tso stream is not ready"] [dc=global]
[2022/06/21 15:22:39.372 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://127.0.0.1:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\" target:127.0.0.1:2379 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\" target:127.0.0.1:2379 status:TRANSIENT_FAILURE"]
[2022/06/21 15:22:39.372 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://127.0.0.1:2382] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2382: connect: connection refused\" target:127.0.0.1:2382 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2382: connect: connection refused\" target:127.0.0.1:2382 status:TRANSIENT_FAILURE"]
[2022/06/21 15:22:39.372 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://127.0.0.1:2384] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2384: connect: connection refused\" target:127.0.0.1:2384 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2384: connect: connection refused\" target:127.0.0.1:2384 status:TRANSIENT_FAILURE"]
[2022/06/21 15:22:39.372 +08:00] [ERROR] [base_client.go:144] ["[pd] failed updateMember"] [error="[PD:client:ErrClientGetLeader]get leader from [http://127.0.0.1:2379 http://127.0.0.1:2382 http://127.0.0.1:2384] error"]
[2022/06/21 15:22:39.579 +08:00] [INFO] [client_batch.go:581] ["batchRecvLoop re-create streaming fail"] [target=127.0.0.1:20160] [forwardedHost=] [error="context deadline exceeded"]
[2022/06/21 15:22:40.481 +08:00] [INFO] [client_batch.go:581] ["batchRecvLoop re-create streaming fail"] [target=127.0.0.1:20160] [forwardedHost=] [error="context deadline exceeded"]
[2022/06/21 15:22:40.559 +08:00] [INFO] [domain.go:578] ["server topology syncer restarted"]
[2022/06/21 15:22:41.278 +08:00] [INFO] [client_batch.go:573] ["batchRecvLoop re-create streaming success"] [target=127.0.0.1:20160] [forwardedHost=]
[2022/06/21 15:22:41.412 +08:00] [INFO] [client_batch.go:581] ["batchRecvLoop re-create streaming fail"] [target=127.0.0.1:20160] [forwardedHost=] [error="context deadline exceeded"]
[2022/06/21 15:22:43.279 +08:00] [INFO] [client_batch.go:573] ["batchRecvLoop re-create streaming success"] [target=127.0.0.1:20160] [forwardedHost=]
[2022/06/21 15:22:44.375 +08:00] [ERROR] [client.go:907] ["[pd] update connection contexts failed"] [dc=global] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""]
[2022/06/21 15:22:45.279 +08:00] [INFO] [client_batch.go:573] ["batchRecvLoop re-create streaming success"] [target=127.0.0.1:20160] [forwardedHost=]
[2022/06/21 15:22:45.376 +08:00] [INFO] [client.go:788] ["[pd] tso stream is not ready"] [dc=global]
[2022/06/21 15:22:46.587 +08:00] [INFO] [client_batch.go:581] ["batchRecvLoop re-create streaming fail"] [target=127.0.0.1:20160] [forwardedHost=] [error="context deadline exceeded"]
[2022/06/21 15:22:48.583 +08:00] [INFO] [client_batch.go:573] ["batchRecvLoop re-create streaming success"] [target=127.0.0.1:20160] [forwardedHost=]
[2022/06/21 15:22:50.379 +08:00] [ERROR] [client.go:907] ["[pd] update connection contexts failed"] [dc=global] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""]
[2022/06/21 15:22:51.380 +08:00] [INFO] [client.go:788] ["[pd] tso stream is not ready"] [dc=global]
[2022/06/21 15:22:51.637 +08:00] [INFO] [data_window.go:249] ["Error exists when getting the SQL Metric."]
[2022/06/21 15:22:56.383 +08:00] [ERROR] [client.go:907] ["[pd] update connection contexts failed"] [dc=global] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""]
[2022/06/21 15:22:57.384 +08:00] [INFO] [client.go:788] ["[pd] tso stream is not ready"] [dc=global]
[2022/06/21 15:23:02.387 +08:00] [ERROR] [client.go:907] ["[pd] update connection contexts failed"] [dc=global] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""]
[2022/06/21 15:23:02.387 +08:00] [ERROR] [client.go:792] ["[pd] create tso stream error"] [dc-location=global] [error="[PD:client:ErrClientCreateTSOStream]create TSO stream failed, retry timeout"]
[2022/06/21 15:23:02.387 +08:00] [INFO] [client.go:788] ["[pd] tso stream is not ready"] [dc=global]
[2022/06/21 15:23:02.388 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://127.0.0.1:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\" target:127.0.0.1:2379 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\" target:127.0.0.1:2379 status:TRANSIENT_FAILURE"]
[2022/06/21 15:23:02.388 +08:00] [ERROR] [pd.go:236] ["updateTS error"] [txnScope=global] [error="[PD:client:ErrClientCreateTSOStream]create TSO stream failed, retry timeout"]
[2022/06/21 15:23:02.391 +08:00] [INFO] [base_client.go:350] ["[pd] switch leader"] [new-leader=http://127.0.0.1:2384] [old-leader=http://127.0.0.1:2379]
[2022/06/21 15:23:07.391 +08:00] [ERROR] [client.go:907] ["[pd] update connection contexts failed"] [dc=global] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""]
[2022/06/21 15:23:08.391 +08:00] [INFO] [client.go:788] ["[pd] tso stream is not ready"] [dc=global]
@Defined2014 Defined2014 added the type/bug The issue is confirmed as a bug. label Jun 21, 2022
@mayjiang0203
Copy link

/severity moderate

@King-Dylan
Copy link

We found this issue could cause tidb qps to drop zero for too long (v4.x 14s vs v6.x 18s),May need to escalate the priority to fix as soon as possible.
v4.x-qps0-14s.log
v6.x-18s-qps0.log
v6.x-12s-qps0.log

@nolouch
Copy link
Contributor

nolouch commented Aug 26, 2022

ptal @CabinfeverB

ti-chi-bot added a commit that referenced this issue Oct 10, 2022
close #5207

fix Stream timeout logic

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Oct 10, 2022
close tikv#5207

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Oct 10, 2022
close tikv#5207

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Oct 10, 2022
close tikv#5207

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Oct 10, 2022
close tikv#5207

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Oct 10, 2022
close tikv#5207

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Oct 10, 2022
close tikv#5207

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to pingcap/tidb that referenced this issue Oct 12, 2022
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Oct 13, 2022
close tikv#5207

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot added a commit that referenced this issue Oct 21, 2022
close #5207, ref #5551

fix Stream timeout logic

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>

Co-authored-by: Yongbo Jiang <cabinfeveroier@gmail.com>
Co-authored-by: Cabinfever_B <cabinfeveroier@gmail.com>
ti-chi-bot added a commit that referenced this issue Nov 1, 2022
close #5207, ref #5551

fix Stream timeout logic

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>

Co-authored-by: Yongbo Jiang <cabinfeveroier@gmail.com>
Co-authored-by: Cabinfever_B <cabinfeveroier@gmail.com>
ti-chi-bot added a commit that referenced this issue Nov 14, 2022
close #5207, ref #5551

fix Stream timeout logic

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>

Co-authored-by: Yongbo Jiang <cabinfeveroier@gmail.com>
Co-authored-by: Cabinfever_B <cabinfeveroier@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants