Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: Don't panic on commit index regression #10166

Closed
bdarnell opened this issue Oct 9, 2018 · 9 comments
Closed

raft: Don't panic on commit index regression #10166

bdarnell opened this issue Oct 9, 2018 · 9 comments
Labels

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Oct 9, 2018

We occasionally see issues in production in which writes to our raft log are not fully persisted before MsgAppResps are sent (Sometimes this is deliberate, as when users disable fsync for performance. Sometimes it's a bug in some layer or another). The raft package does not handle this gracefully: if a follower has sent an MsgAppResp for a given index, the leader will send it that index as committed in future MsgApps. If the raft log was not written successfully, this will cause the follower to crash with panic: tocommit(265625) is out of range [lastIndex(265624)]. Was the raft log corrupted, truncated, or lost?.

The message is correct because the log was truncated, and this behavior could result in a loss of committed writes if it happened on multiple nodes. However, this is a very severe failure mode. The follower panics, and will repeatedly panic until its disk is wiped and it is started from scratch. In most cases a less extreme recovery path is possible. If the leader still has the log entries in question, it can simply re-send them, and if it does not, it can send a snapshot. I'd like to at least have the option to disable this panic and recover a node in-place when it reaches this state.

I haven't thought this all the way through, but I think the solution would involve synthesizing an MsgAppResp with Reject: true and an appropriate IndexHint when the panic would be triggered.

@wenjiaswe
Copy link
Contributor

/cc @jpbetz

@xiang90
Copy link
Contributor

xiang90 commented Oct 12, 2018

I'd like to at least have the option to disable this panic and recover a node in-place when it reaches this state.

how about adding a corrupted hint in ready or something like that? so the upper layer can do something when raft thinks itself is in a bad state?

@bdarnell
Copy link
Contributor Author

Maybe (that would be an alternative to the panics we have all over the place now), but I think it might be difficult to use correctly in this case. Here we know we're missing information but we can recover if we get the right log entries. "Corruption" in the general case is difficult (if not impossible) to handle without just crashing and giving up.

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2020
@stale stale bot closed this as completed Apr 28, 2020
@chaochn47
Copy link
Member

chaochn47 commented Jun 30, 2022

Follow up from the community meeting to revive this conversation. @serathius @lavacat

Another use case not to panic right away:

  • add a new member with etcd db snapshot to speed up time ready to accept client traffic.

To support this, a few more things may need to be changed. etcd db snapshot as backup right now is only for creating a fresh new cluster with that data as disaster recovery mechanism.

@lavacat
Copy link

lavacat commented Jul 1, 2022

For ref, here is original commit that added commitTo with panic
#1740

@stale
Copy link

stale bot commented Dec 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 31, 2022
@ahrtr
Copy link
Member

ahrtr commented Dec 31, 2022

@tbg Is this issue still valid? Please feel free to raise an issue in the new raft repo if needed.

@tbg
Copy link
Contributor

tbg commented Jan 4, 2023

I raised it here again: etcd-io/raft#18
Thanks for the heads up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

8 participants