Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeoutInfo replay will produce deadlock #7833

Closed
TheWinds opened this issue Feb 16, 2022 · 6 comments
Closed

TimeoutInfo replay will produce deadlock #7833

TheWinds opened this issue Feb 16, 2022 · 6 comments

Comments

@TheWinds
Copy link

Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source):
v0.34.x
v0.35.x

ABCI app (name for built-in, URL for self-written if it's publicly available):
any

Environment:

  • OS (e.g. from /etc/os-release):
  • Install tools:
  • Others:
    centos7

What happened:
The consensus module gets stuck when replaying the timeout message

What you expected to happen:

Have you tried the latest version: yes/no
yes

How to reproduce it (as minimally and precisely as possible):

Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file):

Config (you can paste only the changes you've made):

node command runtime flags:

Please provide the output from the http://<ip>:<port>/dump_consensus_state RPC endpoint for consensus bugs
no,this api will get stuck

Anything else we need to know:

Maybe these codes should be promoted before catch up replay?

https://github.com/tendermint/tendermint/blob/master/internal/consensus/state.go#L433-L440

catchUpReplay method will call cs.scheduleTimeout,but timeoutRoutine not created,so

the following code will lock

func (t *timeoutTicker) ScheduleTimeout(ti timeoutInfo) {
	t.tickChan <- ti
}
@williambanfield
Copy link
Contributor

Thanks for filing the issue @TheWinds, would you mind attaching some logs so from the process from when this happens?

@ancazamfir
Copy link
Contributor

I was also running into this with node/ validator restart during consensus with multiple rounds. I can confirm that moving the cs.timeoutTicker.Start(ctx) before the replay loop solves the locking.

@TheWinds
Copy link
Author

Thanks for filing the issue @TheWinds, would you mind attaching some logs so from the process from when this happens?

Because the replay process is blocked, no helpful logs are output. You can reproduce this problem with continuous writes at scale.

@TheWinds
Copy link
Author

I was also running into this with node/ validator restart during consensus with multiple rounds. I can confirm that moving the cs.timeoutTicker.Start(ctx) before the replay loop solves the locking.

yes i did that too

@williambanfield
Copy link
Contributor

backported to v0.35: #8082
backported to v0.34: #8079

@TheWinds
Copy link
Author

good job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants