-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a configurable SVID rotation threshold #4599
Conversation
aabc08f
to
e6acdea
Compare
"github.com/spiffe/spire/pkg/agent/client" | ||
) | ||
|
||
const ( | ||
gracePeriodThreshold = 12 * time.Hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can tune this value along with minimum number of the availability_target.
@@ -55,6 +55,8 @@ const ( | |||
|
|||
bundleFormatPEM = "pem" | |||
bundleFormatSPIFFE = "spiffe" | |||
|
|||
minimumAvailabilityTarget = 24 * time.Hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can tune this value.
e6acdea
to
f487182
Compare
|
||
return ttl <= jitteredHalfLife | ||
return ttl <= availabilityTarget |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think, should we introduce the jitter for availability target ?
IMO:
- ±10% jitter is too large and may shift the availability target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think this may have a jitter, this can even simplify your code here too, because you can get halftime or your availabilityTarget
as initial step for the shitter.
if you are worried about 10%, you can reduce that amount when availabilityTarget is set,
what value do you think is good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that we should preserve the jitter (with a reduced percentage in this case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since availability_target is more than 24 hours, what about the using absolute maximum jitter instead of percentage? For example, 0 ~ +10min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple high level comments on docs just to make sure the functionality is clearly communicated ... let me know what you think on those. @amartinezfayo is going to help out reviewing some of the renewal logic changes, so a couple more comments to come. Thanks so much for the contribution @hiyosi 🙏
doc/spire_agent.md
Outdated
@@ -68,6 +68,7 @@ This may be useful for templating configuration files, for example across differ | |||
| `trust_bundle_format` | Format of the initial trust bundle, pem or spiffe | pem | | |||
| `trust_domain` | The trust domain that this agent belongs to (should be no more than 255 characters) | | | |||
| `workload_x509_svid_key_type` | The workload X509 SVID key type <rsa-2048|ec-p256> | ec-p256 | | |||
| `availability_target` | The amount of time to guarantee the SVID availability. Must be grater than 24h. | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `availability_target` | The amount of time to guarantee the SVID availability. Must be grater than 24h. | | | |
| `availability_target` | The minimum amount of time desired to gracefully handle SPIRE Server or Agent downtime. This configurable influences how aggressively SVIDs should be rotated. Must be greater than 24h. | | |
doc/spire_agent.md
Outdated
@@ -105,6 +106,13 @@ These are the available profiles that can be set in the `profiling_freq` configu | |||
- `trace` | |||
- `cpu` | |||
|
|||
### Availability Target | |||
|
|||
If the `availability_target` is set, the agent will rotate when remaining lifetime of the SVID reaches the `availability_target`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the `availability_target` is set, the agent will rotate when remaining lifetime of the SVID reaches the `availability_target`. | |
If the `availability_target` is set, the agent will rotate an SVID when its remaining lifetime reaches the `availability_target`. |
486084d
to
f948911
Compare
pkg/agent/svid/rotator.go
Outdated
@@ -144,7 +144,8 @@ func (r *rotator) rotateSVIDIfNeeded(ctx context.Context) (err error) { | |||
return fmt.Errorf("unexpected value type: %T", r.state.Value()) | |||
} | |||
|
|||
if rotationutil.ShouldRotateX509(r.clk.Now(), state.SVID[0]) { | |||
log := r.c.Log.WithField(telemetry.SPIFFEID, state.SVID[0].URIs[0].String()) | |||
if rotationutil.ShouldRotateX509(r.clk.Now(), state.SVID[0], r.c.AvailabilityTarget, log) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so should rotate will be affected by log, and that log can have different fields, for example here you added SPIFFEID but in updateCache
this will contains entryID and spiffeID,
maybe we can unify? just to be consistent in what we are logging?
and can we quaranty that state.SVID[0].URIs[0].String()
will return the same string we have in updateCache
?
in case of x509 you can get the spiffeID from certificate itself and add the fields when required in the function itselft
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored some of them and would be happy to review them again.
pkg/agent/manager/manager.go
Outdated
|
||
cachedSVID, ok := m.cache.GetJWTSVID(spiffeID, audience) | ||
if ok && !rotationutil.JWTSVIDExpiresSoon(cachedSVID, now) { | ||
if ok && !rotationutil.JWTSVIDExpiresSoon(cachedSVID, now, m.c.AvailabilityTarget, log) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is the same for all uses of JWTSVIDExpiresSoon
or rotationutil.ShouldRotateX509
,
since it depends of some metadata from the caller (introduced in logs)
can you test that those functions are getting the values we expect?
maybe it is better to just provide an spiffeID as parameter and add the field in the function itself?
pkg/agent/svid/rotator_config.go
Outdated
@@ -40,6 +40,8 @@ type RotatorConfig struct { | |||
|
|||
// Clk is the clock that the rotator will use to create a ticker | |||
Clk clock.Clock | |||
|
|||
AvailabilityTarget time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment?
// ShouldRotateX509 determines if a given SVID should be rotated, based | ||
// on presented current time, and the certificate's expiration. | ||
func ShouldRotateX509(now time.Time, cert *x509.Certificate) bool { | ||
return shouldRotate(now, cert.NotBefore, cert.NotAfter) | ||
func ShouldRotateX509(now time.Time, cert *x509.Certificate, availabilityTarget time.Duration, log logrus.FieldLogger) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this approach, looks now functions requires to have an availabilityTarged, if it is jwt or x.509, and in both cases we requires to have a logger too.
May we create an struct? and create at the start of the agent? and set the availabilityTarget
there? (and maybe add the logger too)
So places where we are using thi public function
somehting like:
type RotationUtil interface {
X509(now time.Time, cert *x509.Certificate) bool{}
JWTSVIDExpiresSoon(svid *client.JWTSVID, now time.Time) bool {}
}
type rotationUtil struct {
availabilityTarget time.Duration
log logrus.FieldLogger
}
func (r *rotationUtil) X509(now time.Time, cert *x509.Certificate) bool{}
func (r *rotationUtil) JWTSVIDExpiresSoon(svid *client.JWTSVID, now time.Time) bool {}
but maybe it is an overkill, what do you think about this? (And I dont like interface name :D)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored some of them and would be happy to review them again.
if availabilityTarget > 0 { | ||
log.Warn("SVID lifetime isn't long enough to guarantee the availability_target, fell back to the default rotation strategy") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks to me like this log must not be here, since this will happen only when logic from shouldFallbackDefaultRotationStrategy
return true because the fracePreriod,
maybe you can move this log there? but that will result in spamming messages...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that the log should be done in the shouldFallbackDefaultRotationStrategy
function. The current implementation already logs this every time that the SVID is being rotated. I'm not sure if I would suggest to log it only once. That could cause to remain unnoticed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is OK if the warning log for each entry to be output at every intervals?
but that will result in spamming messages...
Right, I thought to suppress it.
e.g.
WARN[0005] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0005] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0005] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0005] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0009] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0009] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0009] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0010] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0014] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0014] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0014] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0014] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0019] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0019] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0019] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0019] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0024] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0024] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0024] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0024] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0028] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0029] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0029] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0029] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0033] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0034] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0034] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0034] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0038] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0039] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0039] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0039] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0043] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0044] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0044] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0044] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0048] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0049] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0049] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0049] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy spiffe_id="spiffe://example.org/workload" subsystem_name=manager
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont like that... so you can maybe keep the log where it is, or refactor to simplify it, I think that code can be affected for some of the another comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this approch?
When creating/renewing SVIDs, to check if the availability_target is enough or not.
INFO[0000] SVID is not found. Starting node attestation subsystem_name=attestor trust_domain_id="spiffe://example.org"
INFO[0000] Node attestation was successful rettestable=false spiffe_id="spiffe://example.org/spire/agent/join_token/f42c9691-21f9-4c9c-a1d0-388e3b60be4b" subsystem_name=attestor trust_domain_id="spiffe://example.org"
DEBU[0000] Entry created entry=50c1d620-74b2-4bf3-887e-b6585a08b84f selectors_added=1 spiffe_id="spiffe://example.org/workload" subsystem_name=cache_manager
DEBU[0000] Entry created entry=f886c54d-775f-4d21-9013-082ef8bdef6d selectors_added=1 spiffe_id="spiffe://example.org/token" subsystem_name=cache_manager
DEBU[0000] Entry created entry=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 selectors_added=1 spiffe_id="spiffe://example.org/test-workload" subsystem_name=cache_manager
DEBU[0000] Renewing stale entries cache_type=workload count=3 limit=500 subsystem_name=manager
INFO[0000] Creating X509-SVID entry_id=50c1d620-74b2-4bf3-887e-b6585a08b84f spiffe_id="spiffe://example.org/workload" subsystem_name=manager
INFO[0000] Creating X509-SVID entry_id=f886c54d-775f-4d21-9013-082ef8bdef6d spiffe_id="spiffe://example.org/token" subsystem_name=manager
INFO[0000] Creating X509-SVID entry_id=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0000] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy entry_id=f886c54d-775f-4d21-9013-082ef8bdef6d spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0000] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy entry_id=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0000] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy entry_id=50c1d620-74b2-4bf3-887e-b6585a08b84f spiffe_id="spiffe://example.org/workload" subsystem_name=manager
DEBU[0000] SVID updated entry=f886c54d-775f-4d21-9013-082ef8bdef6d spiffe_id="spiffe://example.org/token" subsystem_name=cache_manager
DEBU[0000] SVID updated entry=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 spiffe_id="spiffe://example.org/test-workload" subsystem_name=cache_manager
DEBU[0000] SVID updated entry=50c1d620-74b2-4bf3-887e-b6585a08b84f spiffe_id="spiffe://example.org/workload" subsystem_name=cache_manager
DEBU[0000] Bundle added subsystem_name=svid_store_cache trust_domain_id=example.org
INFO[0000] Starting Workload and SDS APIs address=/tmp/spire-agent/public/api.sock network=unix subsystem_name=endpoints
DEBU[0000] Initializing health checkers subsystem_name=health
DEBU[0812] Updating expiring SVIDs in cache cache_type=workload expiring_svids=1 subsystem_name=manager
DEBU[0812] Renewing stale entries cache_type=workload count=1 limit=500 subsystem_name=manager
INFO[0812] Renewing X509-SVID entry_id=50c1d620-74b2-4bf3-887e-b6585a08b84f expires_at="2023-11-20T01:13:15Z" spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0812] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy entry_id=50c1d620-74b2-4bf3-887e-b6585a08b84f spiffe_id="spiffe://example.org/workload" subsystem_name=manager
DEBU[0812] SVID updated entry=50c1d620-74b2-4bf3-887e-b6585a08b84f spiffe_id="spiffe://example.org/workload" subsystem_name=cache_manager
DEBU[0827] Updating expiring SVIDs in cache cache_type=workload expiring_svids=1 subsystem_name=manager
DEBU[0827] Renewing stale entries cache_type=workload count=1 limit=500 subsystem_name=manager
INFO[0827] Renewing X509-SVID entry_id=f886c54d-775f-4d21-9013-082ef8bdef6d expires_at="2023-11-20T01:13:15Z" spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0827] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy entry_id=f886c54d-775f-4d21-9013-082ef8bdef6d spiffe_id="spiffe://example.org/token" subsystem_name=manager
DEBU[0827] SVID updated entry=f886c54d-775f-4d21-9013-082ef8bdef6d spiffe_id="spiffe://example.org/token" subsystem_name=cache_manager
DEBU[0832] Updating expiring SVIDs in cache cache_type=workload expiring_svids=1 subsystem_name=manager
DEBU[0832] Renewing stale entries cache_type=workload count=1 limit=500 subsystem_name=manager
INFO[0832] Renewing X509-SVID entry_id=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 expires_at="2023-11-20T01:13:15Z" spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0832] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy entry_id=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
DEBU[0832] SVID updated entry=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 spiffe_id="spiffe://example.org/test-workload" subsystem_name=cache_manager
DEBU[0850] Rotating agent SVID subsystem_name=manager
INFO[0850] Successfully rotated agent SVID spiffe_id="spiffe://example.org/spire/agent/join_token/f42c9691-21f9-4c9c-a1d0-388e3b60be4b" subsystem_name=manager
|
||
return ttl <= jitteredHalfLife | ||
return ttl <= availabilityTarget |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think this may have a jitter, this can even simplify your code here too, because you can get halftime or your availabilityTarget
as initial step for the shitter.
if you are worried about 10%, you can reduce that amount when availabilityTarget is set,
what value do you think is good enough?
temp, err := tc.makeTemplate() | ||
require.NoError(t, err) | ||
|
||
cert, _, err := util.SelfSign(temp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: signing a x.509 certificate looks like an overkill.. we only need cert.NotBefore
& cert.NotAfter
...
maybe you can create a x509.Certificate struct with those values set?
"github.com/spiffe/spire/pkg/agent/client" | ||
) | ||
|
||
const ( | ||
gracePeriodThreshold = 12 * time.Hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: may we have the same gracePeriodThreshold
for x509 and jwt? it looks to me like jwt ones must live less time,
and I'm not sure if we even want to have this for those SVIDs... and i case we have it 12hs looks like a lot of time for a minimun
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the consideration about JWT-SVIDs is a valid point.
The default JWT-SVID TTL is 5m, and users in general will want a much shorter JWT-SVID TTL than the X509-SVID TTL.
@hiyosi What's the JWT-SVID TTL in your use case? Or was it based mainly on requirements for X509-SVID?
Knowing more about how JWT-SVIDs are issued and consumed in your use case would help to have a better idea about what's the best approach for this in regards to JWT-SVIDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, I proposed a feature that would only affect X509.
Then, based on the discussion in the proposal, I implemented a configurable that affects both X509 and JWT.
But it may not be the final conclusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hiyosi We discussed this again and we feel that it would be kind of weird if this configurable affects JWT-SVIDs also. We think that JWT-SVIDs should be excluded, and have that documented properly. There is no need to make a change in the configuration name, but the implementation should be updated to not affect JWT-SVIDs. The documentation should then be clear that this affects agent SVIDs and workload X509-SVIDs, but not JWT-SVIDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reply.
We think that JWT-SVIDs should be excluded,
I understand. No problem for me either.
if availabilityTarget > 0 { | ||
log.Warn("SVID lifetime isn't long enough to guarantee the availability_target, fell back to the default rotation strategy") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that the log should be done in the shouldFallbackDefaultRotationStrategy
function. The current implementation already logs this every time that the SVID is being rotated. I'm not sure if I would suggest to log it only once. That could cause to remain unnoticed.
|
||
return ttl <= jitteredHalfLife | ||
return ttl <= availabilityTarget |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that we should preserve the jitter (with a reduced percentage in this case).
doc/spire_agent.md
Outdated
| `trust_bundle_format` | Format of the initial trust bundle, pem or spiffe | pem | | ||
| `trust_domain` | The trust domain that this agent belongs to (should be no more than 255 characters) | | | ||
| `workload_x509_svid_key_type` | The workload X509 SVID key type <rsa-2048|ec-p256> | ec-p256 | | ||
| `availability_target` | The minimum amount of time desired to gracefully handle SPIRE Server or Agent downtime. This configurable influences how aggressively SVIDs should be rotated. Must be greater than 24h. | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `availability_target` | The minimum amount of time desired to gracefully handle SPIRE Server or Agent downtime. This configurable influences how aggressively SVIDs should be rotated. Must be greater than 24h. | | | |
| `availability_target` | The minimum amount of time desired to gracefully handle SPIRE Server or Agent downtime. This configurable influences how aggressively SVIDs should be rotated. If set, must be greater than 24h. | | |
doc/spire_agent.md
Outdated
|
||
If the `availability_target` is set, the agent will rotate an SVID when its remaining lifetime reaches the `availability_target`. | ||
|
||
To guarantee the `availability_target`, grace period(`SVID lifetime - availability_target`) must be greater than 12h. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To guarantee the `availability_target`, grace period(`SVID lifetime - availability_target`) must be greater than 12h. | |
To guarantee the `availability_target`, grace period (`SVID lifetime - availability_target`) must be greater than 12h. |
4f06c55
to
e39a464
Compare
3072757
to
349024b
Compare
I addressed comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @hiyosi for your patience. I think that we are pretty close here.
I left a few more comments.
doc/spire_agent.md
Outdated
| `trust_bundle_format` | Format of the initial trust bundle, pem or spiffe | pem | | ||
| `trust_domain` | The trust domain that this agent belongs to (should be no more than 255 characters) | | | ||
| `workload_x509_svid_key_type` | The workload X509 SVID key type <rsa-2048|ec-p256> | ec-p256 | | ||
| `availability_target` | The minimum amount of time desired to gracefully handle SPIRE Server or Agent downtime. This configurable influences how aggressively X509 SVIDs should be rotated. If set, must be greater than 24h. | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would be good to reference and add a link to the Availability Target section here.
doc/spire_agent.md
Outdated
@@ -105,6 +106,15 @@ These are the available profiles that can be set in the `profiling_freq` configu | |||
- `trace` | |||
- `cpu` | |||
|
|||
### Availability Target | |||
|
|||
NOTE: The `availability_target` only affects the agent SVIDs and workload X509-SVIDs, but not JWT-SVIDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
NOTE: The `availability_target` only affects the agent SVIDs and workload X509-SVIDs, but not JWT-SVIDs. | |
_Note: The `availability_target` only affects the agent SVIDs and workload X509-SVIDs, but not JWT-SVIDs._ |
doc/spire_agent.md
Outdated
| `trust_bundle_format` | Format of the initial trust bundle, pem or spiffe | pem | | ||
| `trust_domain` | The trust domain that this agent belongs to (should be no more than 255 characters) | | | ||
| `workload_x509_svid_key_type` | The workload X509 SVID key type <rsa-2048|ec-p256> | ec-p256 | | ||
| `availability_target` | The minimum amount of time desired to gracefully handle SPIRE Server or Agent downtime. This configurable influences how aggressively X509 SVIDs should be rotated. If set, must be greater than 24h. | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the logic is that the availability target must be at least 24, which allows setting a target of 24h. It should say that must be at least 24h instead of greater than 24h.
doc/spire_agent.md
Outdated
|
||
If the `availability_target` is set, the agent will rotate an X509 SVID when its remaining lifetime reaches the `availability_target`. | ||
|
||
To guarantee the `availability_target`, grace period (`SVID lifetime - availability_target`) must be greater than 12h. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To guarantee the `availability_target`, grace period (`SVID lifetime - availability_target`) must be greater than 12h. | |
To guarantee the `availability_target`, grace period (`SVID lifetime - availability_target`) must be at least 12h. |
func shouldFallbackX509Default(lifetime, availabilityTarget time.Duration) bool { | ||
// if the grace period less than the threshold, it should be felt back to the default rotation strategy | ||
gracePeriod := lifetime - availabilityTarget | ||
return gracePeriod < gracePeriodThreshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's easier to understand if we have a grace period of at least x hours.
return gracePeriod < gracePeriodThreshold | |
return gracePeriod <= gracePeriodThreshold |
cmd/spire-agent/cli/run/run.go
Outdated
return nil, fmt.Errorf("unable to parse availability_target: %w", err) | ||
} | ||
if t < minimumAvailabilityTarget { | ||
return nil, fmt.Errorf("availability_target must be greater than %s", minimumAvailabilityTarget.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, fmt.Errorf("availability_target must be greater than %s", minimumAvailabilityTarget.String()) | |
return nil, fmt.Errorf("availability_target must be at least %s", minimumAvailabilityTarget.String()) |
pkg/agent/manager/sync.go
Outdated
"spiffe_id": chain[0].URIs[0].String(), | ||
"entry_id": entryID, | ||
}) | ||
log.Warn("X509 SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Warn("X509 SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy") | |
log.Warn("X509 SVID lifetime isn't long enough to guarantee the availability_target, falling back to the default rotation strategy") |
349024b
to
a8a0228
Compare
@amartinezfayo Thank you for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @hiyosi!
@hiyosi There is a conflict due to a recent merged change in the main branch. Could you please resolve it so we can have this merged? Thanks! |
c591f27
to
ee6ecb9
Compare
ee6ecb9
to
4117352
Compare
Signed-off-by: Tomoya Usami <tousami@zlab.co.jp>
4117352
to
c2885c5
Compare
I resolved the conflict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @hiyosi for fixing the conflicts.
We squash the commits before merging, so there is no need to force push a single commit. Force pushing to a branch that is being reviewed makes the review more difficult, so we try to avoid that when is possible :)
Thanks again for this contribution, is greatly appreciated!
I understand. I will respect that policy. |
Signed-off-by: Tomoya Usami <tousami@zlab.co.jp>
Pull Request check list
Affected functionality
svid rotation in agent
Description of change
introduce a
availability_target
configurable to control SVID rotation periold.Which issue this PR fixes
fixes #4115