-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: add jwks rpc and http api #18035
Changes from all commits
ab1ea61
59dd317
1f76474
b28436b
027c6b2
3f252d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,12 +4,79 @@ | |
package agent | ||
|
||
import ( | ||
"crypto/ed25519" | ||
"fmt" | ||
"net/http" | ||
"strings" | ||
"time" | ||
|
||
"github.com/go-jose/go-jose/v3" | ||
"github.com/hashicorp/nomad/helper" | ||
"github.com/hashicorp/nomad/nomad/structs" | ||
) | ||
|
||
// jwksMinMaxAge is the minimum amount of time the JWKS endpoint will instruct | ||
// consumers to cache a response for. | ||
const jwksMinMaxAge = 15 * time.Minute | ||
|
||
// JWKSRequest is used to handle JWKS requests. JWKS stands for JSON Web Key | ||
// Sets and returns the public keys used for signing workload identities. Third | ||
// parties may use this endpoint to validate workload identities. Consumers | ||
// should cache this endpoint, preferably until an unknown kid is encountered. | ||
func (s *HTTPServer) JWKSRequest(resp http.ResponseWriter, req *http.Request) (any, error) { | ||
if req.Method != http.MethodGet { | ||
return nil, CodedError(405, ErrInvalidMethod) | ||
} | ||
|
||
args := structs.GenericRequest{} | ||
if s.parse(resp, req, &args.Region, &args.QueryOptions) { | ||
return nil, nil | ||
} | ||
|
||
var rpcReply structs.KeyringListPublicResponse | ||
if err := s.agent.RPC("Keyring.ListPublic", &args, &rpcReply); err != nil { | ||
return nil, err | ||
} | ||
setMeta(resp, &rpcReply.QueryMeta) | ||
|
||
// Key set will change after max(CreateTime) + RotationThreshold. | ||
var newestKey int64 | ||
jwks := make([]jose.JSONWebKey, 0, len(rpcReply.PublicKeys)) | ||
for _, pubKey := range rpcReply.PublicKeys { | ||
if pubKey.CreateTime > newestKey { | ||
newestKey = pubKey.CreateTime | ||
} | ||
|
||
jwk := jose.JSONWebKey{ | ||
KeyID: pubKey.KeyID, | ||
Algorithm: pubKey.Algorithm, | ||
Use: pubKey.Use, | ||
} | ||
switch alg := pubKey.Algorithm; alg { | ||
case structs.PubKeyAlgEdDSA: | ||
// Convert public key bytes to an ed25519 public key | ||
jwk.Key = ed25519.PublicKey(pubKey.PublicKey) | ||
default: | ||
s.logger.Warn("unknown public key algorithm. server is likely newer than client", "alg", alg) | ||
} | ||
Comment on lines
+55
to
+61
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this error message is suggesting that the request came to a Nomad client and was forwarded to a server that was newer, but the text isn't super clear and it's not necessarily the case (ex. the request was sent to a Nomad server that hadn't been upgraded and then that was forwarded to the leader which was). What if the RPC returned both the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I struggled with what responsibilities to put in the RPC vs HTTP handlers. If I understand your proposal it would mean the type KeyringPublicKey struct {
KeyID string `json:"kid"`
// Always "OKP" (octet key pair) currently
KeyType string `json:"kty"`
PublicKey []byte `json:"x"`
// Always "EdDSA" currently
Algorithm string `json:"crv"`
// Always "Ed25519" currently, must match alg
Curve string `json:"alg"`
// Always "sig" currently
Use string `json:"use"`
// Purely for Nomad's cache control use. Not part of JWK spec.
CreateTime int64 `json:"-"`
} Then we would just JSON marshal Reason 1 Against: Too DIYI avoided this initially because it felt too DIY. The nice thing about That being said go-jose is "saving" us 2 fields currently ( So my "too DIY to skip go-jose" argument seems weak at best... Reason 2 Against: Validating JWTs in the AgentBut there's another reason to do the conversion here: if we ever want to validate JWTs in the Agent, the Agent needs to do the conversion we're doing here. There's no way to avoid converting The auth middleware used for the Task API and the WhoAmI RPC would both benefit from being able to validate JWTs using these PublicKeys. In my other branch I have a I think this is reason enough to keep Alternative: JSONWebKeySet internallyPlan: just have the RPC return Pro: All of the key handling is the Con: I don't think it's safe to send go-jose's My main concern is I'm also a little worried that if we started using X509 certificates in the future that the I don't know... these may be unreasonable concerns, but There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah that's pretty much what I was thinking. To be clear, I'm less concerned about where in the code the conversion is happening than avoiding weird cross-version problems in doing that conversion with info we got from the server. But...
I think I'm probably sold on the basis of this alone. We know we're likely to want to do this in the near-ish future. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you think about my comment about the text of the error message though? |
||
|
||
jwks = append(jwks, jwk) | ||
} | ||
|
||
// Have nonzero create times and threshold so set a reasonable cache time. | ||
if newestKey > 0 && rpcReply.RotationThreshold > 0 { | ||
exp := time.Unix(0, newestKey).Add(rpcReply.RotationThreshold) | ||
maxAge := helper.ExpiryToRenewTime(exp, time.Now, jwksMinMaxAge) | ||
resp.Header().Set("Cache-Control", fmt.Sprintf("max-age=%d", int(maxAge.Seconds()))) | ||
} | ||
|
||
out := &jose.JSONWebKeySet{ | ||
Keys: jwks, | ||
} | ||
|
||
return out, nil | ||
} | ||
|
||
// KeyringRequest is used route operator/raft API requests to the implementing | ||
// functions. | ||
func (s *HTTPServer) KeyringRequest(resp http.ResponseWriter, req *http.Request) (interface{}, error) { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
// Copyright (c) HashiCorp, Inc. | ||
// SPDX-License-Identifier: MPL-2.0 | ||
|
||
package helper | ||
|
||
import ( | ||
"time" | ||
) | ||
|
||
// ExpiryToRenewTime calculates how long until clients should try to renew | ||
// credentials based on their expiration time and now. | ||
// | ||
// Renewals will begin halfway between now and the expiry plus some jitter. | ||
// | ||
// If the expiration is in the past or less than the min wait, then the min | ||
// wait time will be used with jitter. | ||
Comment on lines
+15
to
+16
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But doesn't this behavior floor the result at 15min? If so, when would the client ever renew if the time to renew is always 15min in the future? Is the assumption that clients will call this function (or something that calls it, like the JWKS endpoint) once and then set a timer on that value? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, but oof you're right this function has a sort of Fence Post Problem: if you call it before each fetch to determine whether it's time to fetch: you'll never renew! However if you call it after you fetch it will give you a reasonable time to fetch in. JWKS is the latter, so I think it's safe:
If Consul was using something like ExpiryToRenewTime prior to each fetch I think we'd have the problem you mention. I'll see if there's a safer way to write this function when I post the next PR that references it. It's possible we don't want to actually share this code at all and can just make it a private function for JWKS to use. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (also just realized I named it "...to renew time" when it returns a duration which is a it silly) |
||
func ExpiryToRenewTime(exp time.Time, now func() time.Time, minWait time.Duration) time.Duration { | ||
left := exp.Sub(now()) | ||
|
||
renewAt := left / 2 | ||
|
||
if renewAt < minWait { | ||
return minWait + RandomStagger(minWait/10) | ||
} | ||
|
||
return renewAt + RandomStagger(renewAt/10) | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
// Copyright (c) HashiCorp, Inc. | ||
// SPDX-License-Identifier: MPL-2.0 | ||
|
||
package helper | ||
|
||
import ( | ||
"testing" | ||
"time" | ||
|
||
"github.com/shoenig/test/must" | ||
) | ||
|
||
// TestExpiryToRenewTime_0Min asserts that ExpiryToRenewTime with a 0 min wait | ||
// will cause an immediate renewal | ||
func TestExpiryToRenewTime_0Min(t *testing.T) { | ||
exp := time.Date(2023, 1, 1, 0, 0, 0, 0, time.UTC) | ||
now := func() time.Time { | ||
return time.Date(2023, 1, 1, 0, 0, 1, 0, time.UTC) | ||
} | ||
|
||
renew := ExpiryToRenewTime(exp, now, 0) | ||
|
||
must.Zero(t, renew) | ||
} | ||
|
||
// TestExpiryToRenewTime_14Days asserts that ExpiryToRenewTime begins trying to | ||
// renew at or after 7 days of a 14 day expiration window. | ||
func TestExpiryToRenewTime_30Days(t *testing.T) { | ||
exp := time.Date(2023, 1, 15, 0, 0, 0, 0, time.UTC) | ||
now := func() time.Time { | ||
return time.Date(2023, 1, 1, 0, 0, 0, 0, time.UTC) | ||
} | ||
min := 20 * time.Minute | ||
|
||
renew := ExpiryToRenewTime(exp, now, min) | ||
|
||
// Renew should be much greater than min wait | ||
must.Greater(t, min, renew) | ||
|
||
// Renew should be >= 7 days | ||
must.GreaterEq(t, 7*24*time.Hour, renew) | ||
} | ||
|
||
// TestExpiryToRenewTime_UnderMin asserts that ExpiryToRenewTime uses the min | ||
// wait + jitter if it is greater than the time until expiry. | ||
func TestExpiryToRenewTime_UnderMin(t *testing.T) { | ||
exp := time.Date(2023, 1, 1, 0, 0, 10, 0, time.UTC) | ||
now := func() time.Time { | ||
return time.Date(2023, 1, 1, 0, 0, 0, 0, time.UTC) | ||
} | ||
min := 20 * time.Second | ||
|
||
renew := ExpiryToRenewTime(exp, now, min) | ||
|
||
// Renew should be >= min wait (jitter can be 0) | ||
must.GreaterEq(t, min, renew) | ||
|
||
// When we fallback to the min wait it means we miss the expiration, but this | ||
// is necessary to prevent stampedes after outages and partitions. | ||
must.GreaterEq(t, exp.Sub(now()), renew) | ||
Comment on lines
+58
to
+60
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be nice to have this comment in the docstring for |
||
} | ||
|
||
// TestExpiryToRenewTime_Expired asserts that ExpiryToRenewTime defaults to | ||
// minWait (+jitter) if the renew time has already elapsed. | ||
func TestExpiryToRenewTime_Expired(t *testing.T) { | ||
exp := time.Date(2023, 1, 1, 0, 0, 0, 0, time.UTC) | ||
now := func() time.Time { | ||
return time.Date(2023, 2, 1, 0, 0, 0, 0, time.UTC) | ||
} | ||
min := time.Hour | ||
|
||
renew := ExpiryToRenewTime(exp, now, min) | ||
|
||
must.Greater(t, min, renew) | ||
must.Less(t, min*2, renew) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically the HTTP server will be protected with mTLS, which won't be reachable by third parties. So this probably needs to get served on a separate HTTP listener. (On its own port, I suppose?) This PR is already a nice size so we don't have to solve that in this PR, but just wanted to make sure we didn't forget about that problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah in the RFC (internal only, sorry folks) I call this out as a problem.
The Task API (+Workload Identity) gets around the mTLS problem for things running in Nomad... which our friends Consul and Vault usually are not.
...so then you're left with the option of running an mTLS terminating proxy just for this endpoint which is a hassle to say the least. How the proxy and the JWT's
issuer
field interacted could be problematic as well, although I don't think for Consul and Vault because their JWT auth methods allow you to specify the JWKSURL explicitly and only seem to use theissuer
claim for mapping.That being said this is only a problem for folks running with
tls.verify_https_client = true
explicitly as it defaults tofalse
even with mTLS enabled. I think this is still an ok default and recommendation as the HTTP endpoints will still be protected by (asymmetric) TLS, and the Consul and Vault JWT auth methods support setting a custom CA cert for that........even after all of that we should probably add a "Only TLS, Not mTLS" HTTP listener for this, the eventual OIDC discovery endpoint, and maybe other "low risk" endpoints like metrics.
While I feel like the situation has fairly reasonable options, I think it's clearly too complicated. Somebody hardening their cluster by setting
tls.verify_https_client = true
one day breaking their Consul integration due to the JWT auth method lacking a client certificate would just be hellish to debug. Definitely a : (╯°□°)╯︵ ┻━┻ situation I want to avoid.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep forgetting about
tls.verify_https_client = false
being the default. Even Vault's equivalent is set to false.We definitely don't need to solve this in this PR for sure 😀