raft: Don't panic on commit index regression #10166

bdarnell · 2018-10-09T14:53:04Z

We occasionally see issues in production in which writes to our raft log are not fully persisted before MsgAppResps are sent (Sometimes this is deliberate, as when users disable fsync for performance. Sometimes it's a bug in some layer or another). The raft package does not handle this gracefully: if a follower has sent an MsgAppResp for a given index, the leader will send it that index as committed in future MsgApps. If the raft log was not written successfully, this will cause the follower to crash with panic: tocommit(265625) is out of range [lastIndex(265624)]. Was the raft log corrupted, truncated, or lost?.

The message is correct because the log was truncated, and this behavior could result in a loss of committed writes if it happened on multiple nodes. However, this is a very severe failure mode. The follower panics, and will repeatedly panic until its disk is wiped and it is started from scratch. In most cases a less extreme recovery path is possible. If the leader still has the log entries in question, it can simply re-send them, and if it does not, it can send a snapshot. I'd like to at least have the option to disable this panic and recover a node in-place when it reaches this state.

I haven't thought this all the way through, but I think the solution would involve synthesizing an MsgAppResp with Reject: true and an appropriate IndexHint when the panic would be triggered.

The text was updated successfully, but these errors were encountered:

wenjiaswe · 2018-10-11T18:13:18Z

/cc @jpbetz

xiang90 · 2018-10-12T22:09:53Z

I'd like to at least have the option to disable this panic and recover a node in-place when it reaches this state.

how about adding a corrupted hint in ready or something like that? so the upper layer can do something when raft thinks itself is in a bad state?

bdarnell · 2018-10-15T15:03:51Z

Maybe (that would be an alternative to the panics we have all over the place now), but I think it might be difficult to use correctly in this case. Here we know we're missing information but we can recover if we get the right log entries. "Corruption" in the general case is difficult (if not impossible) to handle without just crashing and giving up.

stale · 2020-04-07T04:12:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

chaochn47 · 2022-06-30T19:21:59Z

Follow up from the community meeting to revive this conversation. @serathius @lavacat

Another use case not to panic right away:

add a new member with etcd db snapshot to speed up time ready to accept client traffic.

To support this, a few more things may need to be changed. etcd db snapshot as backup right now is only for creating a fresh new cluster with that data as disaster recovery mechanism.

lavacat · 2022-07-01T06:22:37Z

For ref, here is original commit that added commitTo with panic
#1740

stale · 2022-12-31T23:11:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

ahrtr · 2022-12-31T23:22:42Z

@tbg Is this issue still valid? Please feel free to raise an issue in the new raft repo if needed.

tbg · 2023-01-04T08:26:54Z

I raised it here again: etcd-io/raft#18
Thanks for the heads up.

stale bot added the stale label Apr 7, 2020

stale bot closed this as completed Apr 28, 2020

serathius reopened this Jul 4, 2022

stale bot removed the stale label Jul 4, 2022

lavacat mentioned this issue Nov 24, 2022

etcdutl: add check broken command to check whether the data file is broked #14844

Closed

stale bot added the stale label Dec 31, 2022

ahrtr closed this as completed Dec 31, 2022

tbg mentioned this issue Jan 4, 2023

Better handling of detected invariant violations etcd-io/raft#18

Open

chaochn47 mentioned this issue Mar 24, 2023

Etcd corruption detection triggers during in-place cluster recovery #15548

Closed

CaojiamingAlan mentioned this issue May 17, 2023

Add AllowInvariantViolations config, commit index regression test and… etcd-io/raft#53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: Don't panic on commit index regression #10166

raft: Don't panic on commit index regression #10166

bdarnell commented Oct 9, 2018

wenjiaswe commented Oct 11, 2018

xiang90 commented Oct 12, 2018

bdarnell commented Oct 15, 2018

stale bot commented Apr 7, 2020

chaochn47 commented Jun 30, 2022 •

edited

Loading

lavacat commented Jul 1, 2022

stale bot commented Dec 31, 2022

ahrtr commented Dec 31, 2022

tbg commented Jan 4, 2023

raft: Don't panic on commit index regression #10166

raft: Don't panic on commit index regression #10166

Comments

bdarnell commented Oct 9, 2018

wenjiaswe commented Oct 11, 2018

xiang90 commented Oct 12, 2018

bdarnell commented Oct 15, 2018

stale bot commented Apr 7, 2020

chaochn47 commented Jun 30, 2022 • edited Loading

lavacat commented Jul 1, 2022

stale bot commented Dec 31, 2022

ahrtr commented Dec 31, 2022

tbg commented Jan 4, 2023

chaochn47 commented Jun 30, 2022 •

edited

Loading