Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help Wanted] Possibility of data loss when server restarts immediately after key-value put with explained conditions #14364

Closed
hasethuraman opened this issue Aug 19, 2022 · 6 comments

Comments

@hasethuraman
Copy link

I observed the possibility of data loss and I would like the community to comment / correct me otherwise.

Before explaining that, I would like to explain the happy path when user does a PUT <key, value>. I have tried to only necessary steps to focus this issue. And considered a single etcd instance.

====================================================================================
----------api thread --------------

  1. User calls etcdctl PUT k v

  2. It lands in v3_server.go::put function with the message about k,v

  3. Call delegates to series of function calls and enters v3_server.go::processInternalRaftRequestOnce

  4. It registers for a signal with wait utility against this keyid

  5. Call delegates further to series of function calls and enters raft/node.go::stepWithWaitOption(..message..)

  6. It wraps this message in a msgResult channel and updates its result channel; then sends this message to propc channel.

  7. After sending it waits on msgResult.channel
    ----------api thread waiting --------------

  8. On seeing a message in propc channel, raft/node.go::run(), it wakes up and sequence of calls adds the message.Entries to raftLog

  9. Notifies the msgResult.channel

----------api thread wakes--------------
10. Upon seeing the msgResult.channel, api thread wakes and returns down the stack back to v3_server.go::processInternalRaftRequestOnce and waits for signal that it registered at step#4
----------api thread waiting --------------

  1. In next iteration of raft/node.go::run(), it gets the entry from raftLog and add it to readyc
  2. etcdserver/raft.go::start wakes up on seeing this entry in readyc and adds this entry to applyc channel
  3. and synchronously writes to wal log ---------------------> wal log
  4. etcdserver/server.go wakes up on seeing entry in applyc channel (added in step accept machine list to join cluster #12)
  5. From step#14, the call goes through series of calls and lands in server.go::applyEntryNormal
  6. applyEntryNormal calls applyV3.apply which will eventually puts the KV to mvcc kvstore txn kvindex
  7. applyEntryNormal now sends the signal for this key which is basically to wake up api thread that is waiting in 7

----------api thread wakes--------------
18. User thread here wakes and sends back acknowledgement
----------user sees ok--------------

  1. Batcher flushes the entries added to kvstore txn kvindex to database file. (also this can happen before 18 based on its timer)
    ====================================================================================

Here if step #13 thread is pre-empted and rescheduled by the underlying operating system after completing step #18 and when there is a power failure at the end of step 18 where after user sees error, then the kv is neither written to wal nor to database file

I think this is not seen today because it is a small window where the server has to restart immediately after step 18 (and immediately after step 12 the underlying os must have pre-empted the etcdserver/raft.go::start and added to end of the runnable Q.). Given these multiple conditions, it appears that we dont see data loss.

But it appears from the code that it is possible. To simulate, added sleep after step 12 (also added exit) and 19. I was able to see ok but the data is not in both wal and db.

If I am not correct, my apology and also please correct my understanding.

@ahrtr
Copy link
Member

ahrtr commented Aug 19, 2022

Please provide the following info:

  1. What's the etcd version?
  2. Run etcdctl endpoint status -w json --cluster. Or provide the etcd configurations.
  3. The detailed steps to reproduce the issue if possible.

@hasethuraman
Copy link
Author

@ahrtr
(1)
It is applicable to both 3.4 and 3.5
(2)
Nothing specific to any runtime status but from code walk through

(3)
This is from my self code walkthrough to explain that there is a possibility of data loss with the explained conditions. I will send the code diff to simulate the data loss if that is the ask. Though I am missing something - so reached out here to confirm.

@ahrtr
Copy link
Member

ahrtr commented Aug 22, 2022

The workflow may not be correct. Refer to https://github.com/ahrtr/etcd-issues/blob/master/docs/cncf_storage_tag_etcd.md

Please open a new issue if you could reproduce the issue in your test environment.

@ahrtr ahrtr closed this as completed Aug 22, 2022
@hasethuraman
Copy link
Author

@ahrtr what I see from the code is different from that document. Can I provide the code diff and necessary screenshots? I hope you also went through the steps I gave in the bug description.

@hasethuraman
Copy link
Author

@ahrtr Please find the code changes and steps to repro.. please kindly note that these changes are just sleep and exit to simulate the condition that I explained in the bug description

  1. Do the code changes in raft.go

image

  1. Do the code changes in tx.go

image

  1. Rebuild etcd server

//1. Start etcd server with changes

//2. Add a key value. Allow etcdserver to acknowledge and exit immediately (with just sleep and exit to simulate the explanation)
$ touch /tmp/exitnow; ./bin/etcdctl put /k1 v1
OK

//3. Remove this control flag file and restart the etcd server
$ rm /tmp/exitnow

//4. Check if key present
$ ./bin/etcdctl get /k --prefix
$

// We can see no key-value

@ahrtr
Copy link
Member

ahrtr commented Aug 23, 2022

Note that it's expected behavior by design. If you really want high availability, then you need to setup a cluster with 3 members at least.

Due to the performance concern of bboltDB, etcd periodically commits the transaction instead of committing on each request. So in theory, it's possible that the bboltDB commit might actually fail for whatever system or hardware issue. But it isn't an issue if either of the following condition is true:

  1. The local WAL entries are persisted;
  2. There are other healthy members;

If the local WAL entries are successfully persisted, then etcd replays the WAL entries on startup. If there is other healthy members, then the leader will sync the missing data to other members, including the previous problematic one.

In your case, you intentionally created a situation in which both conditions are false. So eventually it caused data loss. Please note that it's beyond etcd's capacity to resolve such extreme catastrophic situation, and I believe it's also beyond the capacity of any single project. You need to think about/resolve this from a more high level system architecture, such as back & restore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants