Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuously attempt to unseal if sealed keys are supported #6039

Merged
merged 4 commits into from
Jan 23, 2019
Merged

Continuously attempt to unseal if sealed keys are supported #6039

merged 4 commits into from
Jan 23, 2019

Conversation

sethvargo
Copy link
Contributor

This fixes a bug / adds a missing feature that occurs on bootstrapping an initial cluster. Given a collection of Vault nodes and an initialized storage backend, they will all go into standby waiting for initialization. After one node is initialized, the other nodes had no mechanism by which they "re-check" to see if unseal keys are present. This adds a goroutine to the server command which continually waits for unseal keys to exist. It exits in the following conditions:

  • the node is unsealed
  • the node does not support stored keys
  • a fatal error occurs (as defined by Vault)
  • the server is shutting down

In all other situations, the routine wakes up at the specified interval and attempts to unseal with the stored keys.

Why is this important?

When running on a scheduler for the first time with auto-unseal configured, Vault will start all the pods/services, but then the vault nodes all immediately enter a "deadlock" mode in which they are waiting for unseal keys (because the config says they are supported), but no keys exist because the storage backend is uninitialized. Without this change, upon initializing the storage backend, the only way to get the pods to unseal themselves is to fully restart the pod. This isn't a great experience and is incredibly difficult to automate.

Other things considered

Have a sidecar container that runs vault init ...

This works great in theory, but Vault's storage backends don't support CAS operations. If multiple Vault servers start at the same time (as they would when scheduled by something like Kubernetes or Nomad), the sidecar container would run vault init... on each of them. Depending on the storage backend, there's a race condition where one init process will overwrite another between the time of the init call and the time of the data persistence. This sounds rare, but in practice I've been able to reproduce it reliably 88% of the time. See also #5356.

Add / update API

I considered adding a new API like sys/unseal-check and considered updating the existing sys/unseal endpoint to not require a key parameter and instead try to unseal with stored keys. I lied, I didn't consider it, I actually wrote it first. It worked. I deployed and tested it. It just didn't "feel" correct. It's difficult to orchestrate an API call to all the Vault nodes after their deployment, and it honestly felt like an unnecessary extra step. I kept asking why Vault wouldn't just do this for me since it's already pointed at the shared storage backend. I'm not opposed to going back this route, but it felt wrong.

Check on HUP

This is probably the simplest option in terms of code changes. We simply update the reload handlers to attempt an auto-unseal. The drawbacks, however, are not as simple. Just like the new API endpoint, it's difficult to orchestrate sending signals across multiple pods. Additionally, it creates a weird edge case where someone manually sealed a Vault node, but another process was able to HUP it, effectively unsealing it. It felt a little priv-esc to me, and I think there's probably a security hole there if you look hard enough.

Do nothing

This is my least favorite option, as it pushes far too much orchestration back onto the user.

@hashicorp-cla
Copy link

hashicorp-cla commented Jan 14, 2019

CLA assistant check
All committers have signed the CLA.

@@ -741,7 +741,7 @@ CLUSTER_SYNTHESIS_COMPLETE:
// Initialize the core
core, newCoreError := vault.NewCore(coreConfig)
if newCoreError != nil {
if !errwrap.ContainsType(newCoreError, new(vault.NonFatalError)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are somewhat unrelated, but the "if not contains a non-fatal error" got really confusing.

@@ -937,13 +937,40 @@ CLUSTER_SYNTHESIS_COMPLETE:
return 1
}

// Attempt an initial unseal.
if err := core.UnsealWithStoredKeys(context.Background()); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to do here. We already have code that attempted an unseal with stored keys. The previous method would return nil if unseal keys aren't supported. Now it returns a fatal error, so I think we need to wrap these calls with a check to see if unseal keys are supported. Additionally, I'm not sure if we need this block anyway since the goroutine will do it later. The only difference is that the goroutine won't force the CLI to exit on error (it'll just log).

@sethvargo
Copy link
Contributor Author

@briankassouf @jefferai here's a fun one for ya 😄. This isn't merge-ready yet because I'm looking for feedback from you all around the best paths here.

@jefferai jefferai added this to the 1.0.3 milestone Jan 14, 2019
Copy link
Member

@jefferai jefferai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good except for that one comment about suspected log spamming.

The double-double negative was really confusing, and this pattern is used a few places in Vault. This negates the double negative, making the devx a bit easier to follow.
Prior to this commit, "nil" could have meant unsupported auto-unseal, a transient error, or success. This updates the function to return the correct error type, signaling to the caller whether they should retry or fail.
jefferai
jefferai previously approved these changes Jan 23, 2019
briankassouf
briankassouf previously approved these changes Jan 23, 2019
Copy link
Contributor

@briankassouf briankassouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question, otherwise looks great!

command/server.go Outdated Show resolved Hide resolved
This fixes a bug that occurs on bootstrapping an initial cluster. Given a collection of Vault nodes and an initialized storage backend, they will all go into standby waiting for initialization. After one node is initialized, the other nodes had no mechanism by which they "re-check" to see if unseal keys are present. This adds a goroutine to the server command which continually waits for unseal keys to exist. It exits in the following conditions:

- the node is unsealed
- the node does not support stored keys
- a fatal error occurs (as defined by Vault)
- the server is shutting down

In all other situations, the routine wakes up at the specified interval and attempts to unseal with the stored keys.
@sethvargo sethvargo dismissed stale reviews from briankassouf and jefferai via 9e646ee January 23, 2019 21:24
@jefferai
Copy link
Member

Mergin'

@jefferai jefferai merged commit f0ab6b5 into hashicorp:master Jan 23, 2019
@jefferai
Copy link
Member

Thanks Seth!

@sethvargo sethvargo deleted the sethvargo/watch_init branch January 23, 2019 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants