-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SecretDiscoveryService server implementation #667
Conversation
This commit adds an implementation of the Envoy Secret Discovery Service to the SPIRE agent. The SDS server is off by default and must be enabled with the `enable_sds` config value. Specifically, the SDS server is hosted on the same UDS socket as the Workload API. It performs workload attestation to determine which secrets to make available to the caller. Resources currently map as follows: - SVIDs map to TLSCertificate resources with a resource name matching the SPIFFE ID of the SVID. - Bundles map to ValidationContext resources with a resource name matching the trust domain ID of the bundle. At some point, we may consider symbolic names like "default_tls_certificate" or "default_validation_context" that map to certain resources, but that is out of scope for the initial implementation. Signed-off-by: Andrew Harding <azdagron@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks great, thanks @azdagron
I reviewed this 1) while sick, and 2) without intimate knowledge of expected xDS API behavior (though I did reference the docs)... so, apologies in advance for any strange questions or comments :). Hopefully I'll be better by the time this is ready for another pass
pkg/agent/endpoints/sds/handler.go
Outdated
} | ||
defer done() | ||
|
||
metrics = metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metrics is unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
pkg/agent/endpoints/sds/handler.go
Outdated
for { | ||
select { | ||
case newReq := <-reqch: | ||
h.c.Log.Debugf("stream-secrets: recv: res=%q ver=%q typ=%q nonce=%q", newReq.ResourceNames, newReq.VersionInfo, newReq.TypeUrl, newReq.ResponseNonce) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good opportunity to take advantage of the structured logger fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call!
pkg/agent/endpoints/sds/handler.go
Outdated
// match the last sent nonce, otherwise the request should be | ||
// ignored. | ||
if ((lastNonce == "") != (newReq.ResponseNonce == "")) || lastNonce != newReq.ResponseNonce { | ||
h.c.Log.Warnf("unexpected nonce %q (expected %q); ignoring request", newReq.ResponseNonce, lastNonce) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: log message should be capitalized. A little more clarity would be good too e.g. Received unexpected nonce %q (expected %q); ignoring request
. The subsystem_name
field should be enough to indicate that it's related to the SDS API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pkg/agent/endpoints/sds/handler.go
Outdated
// The nonce should be empty (if we've never sent a response) or | ||
// match the last sent nonce, otherwise the request should be | ||
// ignored. | ||
if ((lastNonce == "") != (newReq.ResponseNonce == "")) || lastNonce != newReq.ResponseNonce { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this different than if lastNonce != newReq.ResponseNonce {
? Seems to me that if we have lastNonce set to none, we should also expect newReq to have no nonce set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to me you are correct
pkg/agent/endpoints/sds/handler.go
Outdated
if req.ErrorDetail != nil { | ||
// The caller has failed to apply the secrets. Wait until the | ||
// next update to send down new info. | ||
h.c.Log.Errorf("caller failed to apply secrets: %q", req.ErrorDetail.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: clearer message and capitalization would be good (we know its x509 etc) e.g. Envoy instance failed to install an SVID: %q
Is it possible to know a SPIFFE ID or something here to identify which instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "secret" could be a validation context :) i will add some details though, like the resource names.
pkg/agent/endpoints/sds/handler.go
Outdated
return err | ||
} | ||
|
||
h.c.Log.Debugf("stream-secrets: send: ver=%q typ=%q nonce=%q count=%d", resp.VersionInfo, resp.TypeUrl, resp.Nonce, len(resp.Resources)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here on logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
if req.ErrorDetail != nil { | ||
// The caller has failed to apply the secrets. Wait until the | ||
// next update to send down new info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What sort of conditions can cause this to fail? Waiting for the next update might be too little too late...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the update is malformed, or empty for example. There may be transient error conditions, but my interpretation of https://github.com/envoyproxy/data-plane-api/blob/master/XDS_PROTOCOL.md#when-to-send-an-update is that I shouldn't re-send the update unless there is some change from the last DiscoveryResponse that was sent. I assume there is no corrective action I can take on the payload to get envoy to suddenly accept it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh... ok yes, I see that now. I misinterpreted one of the error diagrams in that doc.
I guess I figured that there are probably transient error conditions on envoy side, but maybe there are not...? My main concern here was a transient envoy error preventing an SVID update, then subsequently falling offline because presumably that SVID would expire before we try to create a new one (which would really be the third one).
I am definitely no expert here... I guess we can take it as is and see if we run into anything
pkg/agent/endpoints/sds/handler.go
Outdated
} | ||
|
||
func (h *Handler) FetchSecrets(ctx context.Context, req *api_v2.DiscoveryRequest) (*api_v2.DiscoveryResponse, error) { | ||
h.c.Log.Debugf("fetch-secrets: res=%q ver=%q typ=%q", req.ResourceNames, req.VersionInfo, req.TypeUrl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here on logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pkg/agent/endpoints/sds/handler.go
Outdated
} | ||
defer done() | ||
|
||
metrics = metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metrics is unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pkg/agent/endpoints/sds/handler.go
Outdated
func (h *Handler) callerPID(ctx context.Context) (pid int32, err error) { | ||
info, ok := auth.CallerFromContext(ctx) | ||
if !ok { | ||
return 0, errors.New("Unable to fetch credentials from context") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: these errors should be lowercased
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Andrew Harding <azdagron@gmail.com>
Signed-off-by: Andrew Harding <azdagron@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wooooo 🔵 🐙
errch <- err | ||
return | ||
} | ||
reqch <- req |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The xDS docs say that the management server should respond to the latest request received by envoy, though it seems like there is room for a second request to race the first response. It's not super clear to me how we can avoid that gracefully, but thought I'd point it out... maybe I am missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it is possible to avoid the race no matter what you do, at least the way the protocol is constructed 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I thought the same... 🤷
This PR adds an implementation of the Envoy Secret Discovery Service to the SPIRE agent. The SDS server is off by default and must be enabled with the
enable_sds
config value.Specifically, the SDS server is hosted on the same UDS socket as the Workload API. It performs workload attestation to determine which secrets to make available to the caller. Resources currently map as follows:
At some point, we may consider symbolic names like "default_tls_certificate" or default_validation_context" that map to certain resources, but that is out of scope for the initial
implementation.