-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continuously attempt to unseal if sealed keys are supported #6039
Conversation
@@ -741,7 +741,7 @@ CLUSTER_SYNTHESIS_COMPLETE: | |||
// Initialize the core | |||
core, newCoreError := vault.NewCore(coreConfig) | |||
if newCoreError != nil { | |||
if !errwrap.ContainsType(newCoreError, new(vault.NonFatalError)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are somewhat unrelated, but the "if not contains a non-fatal error" got really confusing.
command/server.go
Outdated
@@ -937,13 +937,40 @@ CLUSTER_SYNTHESIS_COMPLETE: | |||
return 1 | |||
} | |||
|
|||
// Attempt an initial unseal. | |||
if err := core.UnsealWithStoredKeys(context.Background()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what to do here. We already have code that attempted an unseal with stored keys. The previous method would return nil
if unseal keys aren't supported. Now it returns a fatal error, so I think we need to wrap these calls with a check to see if unseal keys are supported. Additionally, I'm not sure if we need this block anyway since the goroutine will do it later. The only difference is that the goroutine won't force the CLI to exit on error (it'll just log).
@briankassouf @jefferai here's a fun one for ya 😄. This isn't merge-ready yet because I'm looking for feedback from you all around the best paths here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good except for that one comment about suspected log spamming.
The double-double negative was really confusing, and this pattern is used a few places in Vault. This negates the double negative, making the devx a bit easier to follow.
Prior to this commit, "nil" could have meant unsupported auto-unseal, a transient error, or success. This updates the function to return the correct error type, signaling to the caller whether they should retry or fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question, otherwise looks great!
This fixes a bug that occurs on bootstrapping an initial cluster. Given a collection of Vault nodes and an initialized storage backend, they will all go into standby waiting for initialization. After one node is initialized, the other nodes had no mechanism by which they "re-check" to see if unseal keys are present. This adds a goroutine to the server command which continually waits for unseal keys to exist. It exits in the following conditions: - the node is unsealed - the node does not support stored keys - a fatal error occurs (as defined by Vault) - the server is shutting down In all other situations, the routine wakes up at the specified interval and attempts to unseal with the stored keys.
Mergin' |
Thanks Seth! |
This fixes a bug / adds a missing feature that occurs on bootstrapping an initial cluster. Given a collection of Vault nodes and an initialized storage backend, they will all go into standby waiting for initialization. After one node is initialized, the other nodes had no mechanism by which they "re-check" to see if unseal keys are present. This adds a goroutine to the server command which continually waits for unseal keys to exist. It exits in the following conditions:
In all other situations, the routine wakes up at the specified interval and attempts to unseal with the stored keys.
Why is this important?
When running on a scheduler for the first time with auto-unseal configured, Vault will start all the pods/services, but then the vault nodes all immediately enter a "deadlock" mode in which they are waiting for unseal keys (because the config says they are supported), but no keys exist because the storage backend is uninitialized. Without this change, upon initializing the storage backend, the only way to get the pods to unseal themselves is to fully restart the pod. This isn't a great experience and is incredibly difficult to automate.
Other things considered
Have a sidecar container that runs
vault init ...
This works great in theory, but Vault's storage backends don't support CAS operations. If multiple Vault servers start at the same time (as they would when scheduled by something like Kubernetes or Nomad), the sidecar container would run
vault init...
on each of them. Depending on the storage backend, there's a race condition where one init process will overwrite another between the time of the init call and the time of the data persistence. This sounds rare, but in practice I've been able to reproduce it reliably 88% of the time. See also #5356.Add / update API
I considered adding a new API like
sys/unseal-check
and considered updating the existingsys/unseal
endpoint to not require akey
parameter and instead try to unseal with stored keys. I lied, I didn't consider it, I actually wrote it first. It worked. I deployed and tested it. It just didn't "feel" correct. It's difficult to orchestrate an API call to all the Vault nodes after their deployment, and it honestly felt like an unnecessary extra step. I kept asking why Vault wouldn't just do this for me since it's already pointed at the shared storage backend. I'm not opposed to going back this route, but it felt wrong.Check on HUP
This is probably the simplest option in terms of code changes. We simply update the reload handlers to attempt an auto-unseal. The drawbacks, however, are not as simple. Just like the new API endpoint, it's difficult to orchestrate sending signals across multiple pods. Additionally, it creates a weird edge case where someone manually sealed a Vault node, but another process was able to HUP it, effectively unsealing it. It felt a little priv-esc to me, and I think there's probably a security hole there if you look hard enough.
Do nothing
This is my least favorite option, as it pushes far too much orchestration back onto the user.