Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vault HA problem in vault 0.7.3 #3031

Closed
shsjshentao opened this issue Jul 18, 2017 · 13 comments
Closed

vault HA problem in vault 0.7.3 #3031

shsjshentao opened this issue Jul 18, 2017 · 13 comments

Comments

@shsjshentao
Copy link

vault 0.7.3
etcd 3.1.9
API v3

vault cannot change leader to other standby nodes

Error log
2017/07/18 03:28:39.935689 [ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found

@xiang90
Copy link
Contributor

xiang90 commented Jul 18, 2017

How to reproduce this issue?

@shsjshentao
Copy link
Author

shsjshentao commented Jul 18, 2017

Three nodes
Shutdown the active node.
The other two nodes vault leader is none.
Start the active node.
The three nodes still keep the leader none.

like this

[root@SGDLITVM0819 ~]# vault status -tls-skip-verify
Sealed: false
Key Shares: 1
Key Threshold: 1
Unseal Progress: 0
Unseal Nonce:
Version: 0.7.3
Cluster Name: vault-cluster-b7c6edd5
Cluster ID: 2c3ccf8c-f387-6bd4-8cb1-5b4f1551ad98

High-Availability Enabled: true
Mode: standby
Leader: <none>

@xiang90
Copy link
Contributor

xiang90 commented Jul 18, 2017

@shsjshentao

i tired, but it works. cannot really reproduce what you described.

can you provide me a step by step script? (how did you initialize vault? how do you shutdown the active one? what logging information should i expect to see?)

@shsjshentao
Copy link
Author

I usually happen this problem.
I reinstalled the three master just now.
I found command "vault step-down" also does not work.
The other two nodes cannot take over the leader and logs [ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found
And I found etcd is low performance maybe my hdd.

2017/07/19 07:31:14.425524 [ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found
2017-07-19T07:31:15Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6458]: INFO Backend set to env
2017-07-19T07:31:15Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6458]: INFO Starting confd
2017-07-19T07:31:15Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6458]: INFO Backend nodes set to
2017-07-19T07:31:17Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6466]: INFO Backend set to env
2017-07-19T07:31:17Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6466]: INFO Starting confd
2017-07-19T07:31:17Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6466]: INFO Backend nodes set to
2017-07-19T07:31:19Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6474]: INFO Backend set to env
2017-07-19T07:31:19Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6474]: INFO Starting confd
2017-07-19T07:31:19Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6474]: INFO Backend nodes set to
2017-07-19T07:31:21Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6483]: INFO Backend set to env
2017-07-19T07:31:21Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6483]: INFO Starting confd
2017-07-19T07:31:21Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6483]: INFO Backend nodes set to
2017-07-19T07:31:23Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6491]: INFO Backend set to env
2017-07-19T07:31:23Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6491]: INFO Starting confd
2017-07-19T07:31:23Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6491]: INFO Backend nodes set to
2017/07/19 07:31:24.430583 [ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found

2017-07-19 07:21:09.939587 I | raft: 2046c458caf5773f [logterm: 7, index: 15633, vote: 0] ignored MsgVote from d33e6e92eef6099a [logterm: 7, index: 15633] at term 7: lease is not expired (remaining ticks: 10)
2017-07-19 07:21:09.948538 I | raft: 2046c458caf5773f [term: 7] received a MsgApp message with higher term from d33e6e92eef6099a [term: 14]
2017-07-19 07:21:09.948585 I | raft: 2046c458caf5773f became follower at term 14
2017-07-19 07:21:13.337983 W | etcdserver: timed out waiting for read index response
2017-07-19 07:21:18.488810 W | wal: sync duration of 20.093414577s, expected less than 1s
2017-07-19 07:22:15.515277 I | mvcc: store.index: compact 3106
2017-07-19 07:22:15.518657 I | mvcc: finished scheduled compaction at 3106 (took 1.72943ms)
2017-07-19 07:24:49.255967 W | etcdserver: apply entries took too long [20.245638093s for 1 entries]
2017-07-19 07:24:49.256026 W | etcdserver: avoid queries with large range/delete range!
2017-07-19 07:25:17.555745 W | etcdserver: timed out waiting for read index response
2017-07-19 07:25:24.556360 W | etcdserver: timed out waiting for read index response
2017-07-19 07:25:30.152730 W | wal: sync duration of 20.17741517s, expected less than 1s
2017-07-19 07:25:30.176633 W | etcdserver: ignored out-of-date read index response (want [119 63 93 89 151 89 82 208], got [119 63 93 89 151 89 82 216])
2017-07-19 07:25:31.834635 W | wal: sync duration of 1.09778233s, expected less than 1s
2017-07-19 07:26:15.480884 W | etcdserver: timed out waiting for read index response
2017-07-19 07:26:22.481279 W | etcdserver: timed out waiting for read index response
2017-07-19 07:26:26.944909 W | wal: sync duration of 20.207759118s, expected less than 1s
2017-07-19 07:26:26.955119 W | etcdserver: ignored out-of-date read index response (want [119 63 93 89 151 89 83 160], got [119 63 93 89 151 89 83 184])
2017-07-19 07:26:26.957652 W | etcdserver: ignored out-of-date read index response (want [119 63 93 89 151 89 83 175], got [119 63 93 89 151 89 83 184])
2017-07-19 07:27:08.779148 W | wal: sync duration of 1.043000802s, expected less than 1s
2017-07-19 07:27:15.528865 I | mvcc: store.index: compact 3615
2017-07-19 07:27:15.531434 I | mvcc: finished scheduled compaction at 3615 (took 1.434634ms)
2017-07-19 07:28:35.413686 W | etcdserver: timed out waiting for read index response
2017-07-19 07:28:42.414061 W | etcdserver: timed out waiting for read index response
2017-07-19 07:28:47.959655 W | wal: sync duration of 20.182056724s, expected less than 1s
2017-07-19 07:31:42.777420 W | wal: sync duration of 1.042830191s, expected less than 1s
2017-07-19 07:32:15.547140 I | mvcc: store.index: compact 4148
2017-07-19 07:32:15.550025 I | mvcc: finished scheduled compaction at 4148 (took 2.002339ms)

@xiang90
Copy link
Contributor

xiang90 commented Jul 25, 2017

2017-07-19 07:28:47.959655 W | wal: sync duration of 20.182056724s, expected less than 1s

your etcd cluster is not in a good shape. fsync took too long.

@bhiggins
Copy link
Contributor

I believe it's because vault lost its etcd lease. Looking at the etcd client code (that vault uses), I don't see anything for handling that situation, e.g. code for acquiring a new lease.

Due to 64d412e, I believe if you want vault + etcd to be healthy, you must ensure vault and etcd can always talk to each other within 15 seconds (used to be 60 seconds). If that's not possible, due to fsync, or network problems, or for any other reason, you will need to restart vault to acquire a new etcd lease.

@bhiggins
Copy link
Contributor

I looked into this a bit more, because we've seen the same problem, where a standby vault will eventually end up printing out "core: failed to acquire lock: error=etcdserver: requested lease not found" over and over after the leader goes down. I don't know what caused it to get into that state (network problems? etcd problems? otherwise?) and to some degree, I don't care. I just want things to recover eventually :-)

I don't know much about etcd, but it seems to me that EtcdLock's Lock() function needs to check if the session is still valid (i.e. is the channel returned by Done() still open?). If the session isn't valid anymore--because the lease expired, I assume--then a new one needs to be made, instead of continuing to use the old session.

If that sounds reasonable, I can provide a patch.

@bhiggins
Copy link
Contributor

My colleague @willmo wonders if there may be a better way to address this.

The question is around the Lock interface, specifically what to do if an error is returned from Lock(). The code in core.go, specifically acquireLock and runStandby (which calls acquireLock), assumes that all errors returned from Lock() will be transient; in other words, an error won't mean that the lock is permanently unusable. However, for etcd3, this is currently not the case. The lock returned by LockWith() can become invalid if its associated session/lease expires. Without knowing more about Consul and Zookeeper, it's not immediately clear if a similar problem might exist with those backends as well.

That leads us to ask: what should change here? Should core.go change so it calls LockWith() again if Lock() returns an error? Should the etcd3 backend change (which was my proposal above) and the Lock documentation improve to explicitly state the contract? Even if the current contract remains, for the sake of robustness, should we do both?

What do you think, @jefferai and @armon?

@xiang90
Copy link
Contributor

xiang90 commented Oct 24, 2017

i am going to take a look.

@bhiggins
Copy link
Contributor

@xiang90, thanks for taking a look.

Just wanted to note how to reproduce this:

  1. Have 2 (or more) vaults in HA configuration using etcd3 as a backend.
  2. Manually revoke one of the standby vault's leases from etcd3.
  3. Kill the vault leader.

After this, the standby that had its lease removed will start logging "requested lease not found" every 10 seconds and never recover.

In the real world, my guess for what took the place of step (2) in our case was temporary network problems.

@bhiggins
Copy link
Contributor

@xiang90, please take a look at my PR. Thanks.

bhiggins added a commit to bhiggins/vault that referenced this issue Oct 31, 2017
This change makes these errors transient instead of permanent:

[ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found

After this change, there can still be one of these errors when a
standby vault that lost its lease tries to become leader, but on the
next lock acquisition attempt a new session will be created. With this
new session, the standby will be able to become the leader.
jefferai pushed a commit that referenced this issue Nov 3, 2017
This change makes these errors transient instead of permanent:

[ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found

After this change, there can still be one of these errors when a
standby vault that lost its lease tries to become leader, but on the
next lock acquisition attempt a new session will be created. With this
new session, the standby will be able to become the leader.
@xiang90
Copy link
Contributor

xiang90 commented Nov 3, 2017

@jefferai can you please close this one as it is fixed by @bhiggins's pr.

@jefferai
Copy link
Member

jefferai commented Nov 3, 2017

Sure!

@jefferai jefferai closed this as completed Nov 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants